UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

In silico approaches to investigating mechanisms of gene regulation Ho Sui, Shannan Janelle 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2008_spring_hosui_shannan.pdf [ 3.9MB ]
Metadata
JSON: 24-1.0066331.json
JSON-LD: 24-1.0066331-ld.json
RDF/XML (Pretty): 24-1.0066331-rdf.xml
RDF/JSON: 24-1.0066331-rdf.json
Turtle: 24-1.0066331-turtle.txt
N-Triples: 24-1.0066331-rdf-ntriples.txt
Original Record: 24-1.0066331-source.json
Full Text
24-1.0066331-fulltext.txt
Citation
24-1.0066331.ris

Full Text

IN SILICO APPROACHES TO INVESTIGATING MECHANISMS OF GENE REGULATION by SHANNAN JANELLE HO SUI B.Sc., The University of British Columbia, 2000  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Genetics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  March 2008  © Shannan Janelle Ho Sui, 2008  Abstract Identification and characterization of regions influencing the precise spatial and temporal expression of genes is critical to our understanding of gene regulatory networks. Connecting transcription factors to the cis-regulatory elements that they bind and regulate remains a challenging problem in computational biology. The rapid accumulation of whole genome sequences and genome-wide expression data, and advances in alignment algorithms and motif-finding methods, provide opportunities to tackle the important task of dissecting how genes are regulated. Genes exhibiting similar expression profiles are often regulated by common transcription factors. We developed a method for identifying statistically overrepresented regulatory motifs in the promoters of co-expressed genes using weight matrix models representing the specificity of known factors. Application of our methods to yeast fermenting in grape must revealed elements that play important roles in utilizing carbon sources. Extension of the method to metazoan genomes via incorporation of comparative sequence analysis facilitated identification of functionally relevant binding sites for sets of tissue-specific genes, and for genes showing similar expression in large-scale expression profiling studies. Further extensions address alternative promoters for human genes and coordinated binding of multiple transcription factors to cis-regulatory modules. Sequence conservation reveals segments of genes of potential interest, but the degree of sequence divergence among human genes and their orthologous sequences varies widely. Genes with a small number of well-distinguished, highly conserved noncoding elements proximal to the transcription start site may be well-suited for targeted laboratory promoter characterization studies. We developed a “regulatory resolution”  ii  score to prioritize lists of genes for laboratory gene regulation studies based on the conservation profile of their promoters. Additionally, genome-wide comparisons of vertebrate genomes have revealed surprisingly large numbers of highly conserved noncoding elements (HCNEs) that cluster nearby to genes associated with transcription and development. To further our understanding of the genomic organization of regulatory regions, we developed methods to identify HCNEs in insects. We find that HCNEs in insects have similar function and organization as their vertebrate counterparts. Our data suggests that microsynteny in insects has been retained to keep large arrays of HCNEs intact, forming genomic regulatory blocks that surround the key developmental genes they regulate.  iii  Table of Contents Abstract ............................................................................................................................... ii Table of Contents............................................................................................................... iv List of Tables ................................................................................................................... viii List of Figures .................................................................................................................... ix List of Abbreviations and Acronyms.................................................................................. x Acknowledgements........................................................................................................... xii Co-authorship Statement.................................................................................................. xiv Chapter 1: Introduction ....................................................................................................... 1 1.1 Background and significance.................................................................................... 1 1.2 Structure of eukaryotic regulatory regions ............................................................... 3 1.2.1 The core promoter.............................................................................................. 3 1.2.2 Complexities in promoter architecture............................................................... 6 1.2.3 Enhancers........................................................................................................... 7 1.2.4 Chromatin structure and its effects on gene regulation ..................................... 8 1.2.5 Current strategies for improving our understanding of gene regulation.......... 10 1.3. Laboratory-based methods for regulatory region identification ............................ 10 1.3.1 Reporter constructs .......................................................................................... 11 1.3.2 DNA binding assays ........................................................................................ 12 1.3.3 In vitro selection .............................................................................................. 13 1.3.4 Transcript profiling to locate proximal promoters........................................... 13 1.4 Computational methods for regulatory element prediction .................................... 14 1.4.1 Repositories of gene regulatory information ................................................... 14 1.4.2 Computational identification of TFBSs ........................................................... 16 1.5 Conservation analysis for the identification of regulatory regions......................... 20 1.5.1 Algorithms for multi-species sequence alignments ......................................... 21 1.5.2 Identifying and measuring evolutionary constraint ......................................... 21 1.5.3 Highly conserved noncoding elements in metazoan genomes ........................ 24 1.6 Thesis overview and chapter objectives ................................................................. 26 1.7 References............................................................................................................... 30 Chapter 2: oPOSSUM: Identification of Over-represented Transcription Factor Binding Sites in Sets of Co-expressed Genes ................................................................................ 42 2.1 Introduction............................................................................................................. 42 2.2 Methods................................................................................................................... 45 2.2.1 Automated retrieval of human-mouse orthologs ............................................. 45 2.2.2 Phylogenetic footprinting ................................................................................ 46 2.2.3 Detection of TF binding sites........................................................................... 47 2.2.4 Discovery of over-represented binding sites ................................................... 47 2.2.5 NF-κB microarray experiment......................................................................... 49 2.2.6 Simulations using random sampling................................................................ 50 2.2.7 Parameter selection for validation studies ....................................................... 51 2.3 Results..................................................................................................................... 51 2.3.1 Validation using reference gene sets ............................................................... 52 2.3.2 Application to transcript profiling data............................................................ 56  iv  2.3.3 Specificity assessment ..................................................................................... 59 2.3.4 Noise tolerance ................................................................................................ 60 2.3.5 Web implementation........................................................................................ 62 2.3.6 The oPOSSUM application programming interface (API).............................. 63 2.4 Discussion ............................................................................................................... 65 2.4.1 Performance ..................................................................................................... 65 2.4.2 Challenges........................................................................................................ 67 2.4.3 Utility ............................................................................................................... 69 2.5 Conclusions............................................................................................................. 70 2.6 References............................................................................................................... 71 Chapter 3: Integrated Tools for Analysis of Regulatory Motif Over-representation ....... 75 3.1 Introduction............................................................................................................. 75 3.2 Methods................................................................................................................... 76 3.2.1 Over-representation analysis............................................................................ 76 3.2.2 Species-specific Databases .............................................................................. 77 3.2.3 Transcription Factor Binding Site Prediction .................................................. 82 3.3 Results..................................................................................................................... 83 3.3.1 Human SSA ..................................................................................................... 83 3.3.2 Human CSA..................................................................................................... 86 3.3.3 Worm SSA....................................................................................................... 87 3.3.4 Yeast SSA ........................................................................................................ 88 3.4 Discussion ............................................................................................................... 88 3.5 Conclusions............................................................................................................. 91 3.6 References............................................................................................................... 93 Chapter 4: Dynamics of the Yeast Transcriptome During Wine Fermentation Reveals a Novel Fermentation Stress Response ............................................................................... 96 4.1 Introduction............................................................................................................. 96 4.2 Methods................................................................................................................... 98 4.2.1 Strains, media and growth conditions.............................................................. 98 4.2.2 Analysis of fermenting wine............................................................................ 98 4.2.3 Microarray analysis.......................................................................................... 99 4.2.4 Analyses of gene expression data .................................................................... 99 4.2.5 Functional enrichment of genes in clusters.................................................... 101 4.2.6 Identification of regulatory elements over-represented in gene clusters ....... 101 4.3 Results................................................................................................................... 102 4.3.1 Clustering of genes based on temporal expression profiles........................... 102 4.3.2 The fermentation stress response ................................................................... 103 4.3.3 Over-representation of functional annotations .............................................. 112 4.3.4 Statistical enrichment of predicted transcription factor binding sites............ 114 4.3.5 Entry into stationary growth .......................................................................... 116 4.4 Discussion ............................................................................................................. 116 4.4.1 Fermentation of grape must induces a novel and adaptive response ............. 117 4.4.2 Attenuated glucose repression ....................................................................... 118 4.4.3 Proposed models for alleviation of glucose repression.................................. 120 4.4.4 Ethanol as a regulator of fermentation........................................................... 121 4.4.5 Osmotic stress ................................................................................................ 123  v  4.4.6 Nature of the signal for entry into stationary growth..................................... 123 4.5 Conclusions........................................................................................................... 124 4.6 References............................................................................................................. 126 Chapter 5: Ranking Candidate Genes for Promoter Construct Design .......................... 131 5.1 Introduction........................................................................................................... 131 5.2 Methods................................................................................................................. 135 5.2.1 Identification of the boundaries for analysis.................................................. 135 5.2.2 Nonexonic conserved regions ........................................................................ 136 5.2.3 Score definition.............................................................................................. 136 5.2.4 Manual promoter interpretation ..................................................................... 137 5.3 Results................................................................................................................... 137 5.3.1 Genome-wide distribution of regulatory resolution scores............................ 137 5.3.2 Features of genes with the highest, average and lowest regulatory resolution ...................................................................................................................... 138 5.3.3 Regulatory resolution scores capture aspects of the manual curation process ...................................................................................................................... 141 5.4 Discussion ............................................................................................................. 144 5.5 Conclusions........................................................................................................... 146 5.6 References............................................................................................................. 148 Chapter 6: Genomic Regulatory Blocks Underlie Extensive Microsynteny Conservation Among Insects ............................................................................................................... 152 6.1 Introduction........................................................................................................... 152 6.2 Methods................................................................................................................. 155 6.2.1 Sequences and annotations ............................................................................ 155 6.2.2 HCNE Detection ............................................................................................ 155 6.2.3 Computation of feature densities and density peak detection........................ 156 6.2.4 Identification of synteny blocks and RA sequence among flies .................... 157 6.2.5 Analysis of extent of Dmel-Agam synteny .................................................... 158 6.2.6 Core promoter analysis .................................................................................. 159 6.2.7 Expression analysis........................................................................................ 159 6.3 Results................................................................................................................... 160 6.3.1 Highly conserved noncoding elements are enriched in large synteny blocks 160 6.3.2 HCNE arrays are centrally positioned in large synteny blocks that span multiple genes............................................................................................... 161 6.3.3 HCNE-associated genes are in large blocks of conserved microsynteny between fly and mosquito............................................................................. 163 6.3.4 HCNE-associated genes have specific types of core promoters.................... 167 6.3.5 HCNE arrays mark regulatory domains maintained in evolution.................. 171 6.4 Discussion ............................................................................................................. 173 6.4.1 Experimental evidence for long-range regulation and GRBs in Drosophila. 173 6.4.2 Regulatory HCNE arrays are a fundamental feature of metazoan genomes . 174 6.4.3 Responsiveness of genes to long-range enhancers ........................................ 176 6.5 Conclusions........................................................................................................... 178 6.6 References............................................................................................................. 179 Chapter 7: Discussion and Conclusions.......................................................................... 185 7.1 Human/mouse regulation studies.......................................................................... 185  vi  7.2 Regulation in model organisms ............................................................................ 187 7.3 Limitations of conservation analysis .................................................................... 190 7.4 Future directions for gene regulation studies........................................................ 191 7.5 References............................................................................................................. 194 Appendices...................................................................................................................... 198 Appendix 1.................................................................................................................. 198 Appendix 2.................................................................................................................. 199 Appendix 3.................................................................................................................. 202 Appendix 4.................................................................................................................. 213  vii  List of Tables Table 1.1 Selected resources for gene regulatory information ......................................... 15 Table 2.1 Statistically over-represented TF binding sites in reference gene sets ............. 55 Table 2.2 Statistically over-represented TFBSs in gene expression data sets .................. 57 Table 2.3 Predefined values for phylogenetic footprinting and TFBS detection available in oPOSSUM’s default mode............................................................................ 62 Table 3.1 Validation of the human FoxM1-regulated gene cluster .................................. 85 Table 3.2 Validation of the c-Fos-regulated gene cluster ................................................. 85 Table 3.3 Validation of skeletal muscle genes identified by Moran et al. and Tomzcak et al. ...................................................................................................................... 86 Table 3.4 Validation of worm skeletal muscle genes using worm profiles ...................... 87 Table 3.5 Validation of the yeast CLB2 gene cluster ....................................................... 88 Table 4.1 Summary of gene counts................................................................................. 104 Table 4.2 Fermentation Stress Response Genes ............................................................. 107 Table 4.3 Highest ranked transcription factor binding site motifs for FSR genes.......... 115 Table 5.1 Number of genes in each Pleiades promoter assessment class ....................... 141 Table 5.2 Pleiades Promoter Project genes approved for MiniPromoter design ............ 143 Table 6.1 Core promoter classification of Dmel genes................................................... 168  viii  List of Figures Figure 1.1 Regulation of transcription. ............................................................................... 4 Figure 1.2 Construction of quantitative models for transcription factor binding specificity ......................................................................................................................... 18 Figure 2.1 The oPOSSUM system for identifying over-represented TFBSs in sets of coexpressed genes............................................................................................... 44 Figure 2.2 Relationship between the Fisher p-values and Z-scores for the muscle, liver and NF-kappaB reference sets ........................................................................ 54 Figure 2.3 Percentage of trials that produced false positive (FP) predictions ................ 60 Figure 2.4 Noise tolerance ................................................................................................ 61 Figure 2.5 The oPOSSUM result report for the identification of over-represented TFBSs in sets of co-expressed genes .......................................................................... 64 Figure 3.1 Determination of one-to-one orthologs for human and mouse genes ............. 78 Figure 3.2 Identification of transcription start regions (TSRs) using a combination of Ensembl annotations and CAGE data............................................................. 80 Figure 3.3 oPOSSUM output............................................................................................ 90 Figure 4.1 Clustering of genes based on temporal expression profiles .......................... 105 Figure 4.2 Fermentation stress response......................................................................... 106 Figure 4.3 Functional annotations of genes in the FSR using biological process GO Slim Terms from SGD........................................................................................... 113 Figure 4.4 Heat map of the association of GO terms to clusters .................................... 114 Figure 4.5 Growth of Vin14 in Riesling grape juice containing either 191 mg/L N, 300 mg/L N, 600 mg/L N or 900 mg/L N............................................................ 117 Figure 5.1 Genome-wide distribution of regulatory resolution scores ........................... 139 Figure 5.2 Genes with (A) the highest, (B) average and (C) the lowest regulatory resolution scores............................................................................................ 140 Figure 5.3 Application of regulatory resolution scores to manually curated promoters 142 Figure 6.1 HCNE arrays are centrally positioned in large synteny blocks ..................... 162 Figure 6.2 Genes associated with HCNE arrays tend to be in large fly-mosquito synteny blocks ............................................................................................................ 164 Figure 6.3 Associations between core promoter types and gene functions .................... 170 Figure 6.4 HCNE-clusters spanning co-regulated genes and boundary agreement among synteny blocks, HCNE clusters and Polycomb binding regions................... 172 Figure 6.5 The E32 enhancer trap insertion at the Dmel decapentaplegic (dpp) locus .. 175  ix  List of Abbreviations and Acronyms AIDS  Acquired Immunodeficiency Syndrome  CAGE  Cap Analysis of Gene Expression  CAT  Chloramphenicol Acetyltransferase  ChIP  Chromatin Immunoprecipitation  CRM  Cis-regulatory Module  CSA  Combination Site Analysis  CSRE  Carbon Source Responsive Element  DPE  Downstream Promoter Element  EM  Expectation Maximization  EMSA  Electrophoretic Mobility Shift Assay  ENCODE  Encyclopedia of DNA Elements  GFP  Green Fluorescent Protein  GO  Gene Ontology  GRB  Genomic Regulatory Block  HCNE  Highly Conserved Noncoding Element  IBSD  Inter-binding Site Distance  Inr  Initiator  MCS  Multi-species Conserved Sequences  Pc  Polycomb  PCR  Polymerase Chain Reaction  PFM  Position Frequency Matrix  Phylo-HMM  Phylogenetic Hidden Markov Model  x  PWM  Position Weight Matrix  RT-PCR  Reverse Transcription Polymerase Chain Reaction  RP  Regulatory Potential  SSA  Single Site Analysis  SNP  Single Nucleotide Polymorphism  STRE  Stress Response Element  TAF  TATA-binding Associated Factor  TBP  TATA Binding Protein  TF  Transcription Factor  TFBS  Transcription Factor Binding Site  TSS  Transcription Start Site  UCSC  University of California, Santa Cruz  xi  Acknowledgements I would like to express my sincere gratitude to everyone who has been a part of this journey. It has been a pleasure to work with many inspiring and talented individuals. First and foremost, I would like to thank my thesis advisor, Wyeth Wasserman, for his scientific and personal guidance, numerous opportunities to expand my knowledge and experience as a researcher, and for having faith in my abilities. Your patience and pleasant disposition have made my studies a unique and rewarding experience. Thank you to past and present members of the Wasserman group who collaborated on the work, and with whom I’ve shared many interesting and stimulating discussions, including Danielle Kemmer, David Arenillas, Jochen Brumm, Debra Fulton, Andrew Kwon and Elodie Portales-Casamar. Thank you also to everyone at the CMMT; the friendly atmosphere has made my graduate studies a pleasant and memorable experience. To my collaborators, Boris Lenhard, Albin Sandelin, and Pär Engström, formerly at the Karolinksa Institute; James Mortimer and Brian Kennedy at Merck-Frosst; Jennifer Bryan in the UBC Statistics department; and Hennie van Vuuren at the UBC Wine Research Centre, thank you for productive and positive collaborations. I have thoroughly enjoyed working with you and have learned a great deal in the process. I would also like to thank the systems support personnel, Jonathan Lim, Miroslav Hatas, and Jonathan Falkowski, as well as Dora Pak for helping me with administrative issues. I am particularly grateful to the members of my graduate supervisory committee for years of mentorship and support. They include Fiona Brinkman, the first person to encourage me to pursue graduate work in bioinformatics; Elizabeth Conibear, for taking time to provide friendly guidance and advice; and Francis Ouellette, who has been instrumental in  xii  introducing me to the bioinformatics research community. Thank you all for believing in me. I would also like to acknowledge the National Sciences and Engineering Research Council, the Michael Smith Foundation for Health Research, the Child and Family Research Institute, and the CIHR/MSFHR Strategic Training Program in Bioinformatics for salary, research and travel funding. I have been blessed with many wonderful friends who have helped me weather the challenges and celebrate the triumphs of graduate school. Though it would be too much to acknowledge you all individually, I am truly grateful for your friendship. Finally, I would like to thank my parents, Loraine and Keith, and my sister, Tracey, for their unwavering love and encouragement; and Robin, for more than words can express.  xiii  Co-authorship Statement The work presented in this Ph.D. thesis was in part due to collaboration with other researchers. Those involved in the research are listed in the publication citations in each of the chapters. Their individual contributions to the research are outlined below. Chapter 2: I prototyped the initial oPOSSUM system, developed the method, devised the scoring measures, and validated the resource on reference gene sets and microarray data. Christopher Walsh contributed a small piece of code to the prototype, while David Arenillas converted the prototype to production level software. James Mortimer and Brian Kennedy provided the NF-κB microarray data. Jochen Brumm provided statistical advice. I wrote the published manuscript. Chapter 3: I identified and designed the research program for integrating and improving the oPOSSUM system. I designed and implemented new procedures for dealing with multiple orthologous human/mouse promoters, was fully responsible for design and implementation of the yeast-specific resource, upgraded the human-specific single site analysis web service in collaboration with David Arenillas, and contributed code and data for the worm-specific resource, which was also partially built by Andrew Kwon. I performed all validations and testing of the integrated system. Debra Fulton modified the combination site analysis web service to accommodate alternative promoters. I wrote the published manuscript. Chapter 4: Virginia Marks performed the microarray experiments and some preliminary data analysis. Jochen Brumm normalized and processed the microarray data. Jenny Bryan performed the seeded clustering and provided input for the GO term analysis. Danie Erasmus performed the ethanol/nitrogen experiment. I was responsible  xiv  for regulatory and functional data analysis of the gene clusters and identified the transcriptional networks that are activated during fermentation. I wrote the published manuscript with input from Hennie van Vuuren and Wyeth Wasserman. Chapter 5: Wyeth Wasserman and I identified the research program. Elodie Portales-Casamar provided the curated scores produced by the Pleiades Promoter Project Curation Team. I designed and performed the research and data analyses. I wrote the manuscript. Chapter 6: I was responsible for the initial design and implementation of methods to identify highly conserved noncoding elements (HCNEs) in insects. I contributed pieces of code for the analysis of alignments and synteny blocks. I characterized the functional properties and genomic distribution of HCNEs, and analyzed HCNE arrays for their association with polycomb binding domains. Pär Engström designed and implemented methods to correlate properties of syntenic blocks with arrays of HCNEs, analyzed the core promoter architecture of HCNE-associated genes, and wrote the published manuscript.  xv  Chapter 1: Introduction 1.1 Background and significance Gene expression in eukaryotes is a highly regulated process. Diverse mechanisms have evolved to ensure that genes are expressed at the right time, in the appropriate tissues and under specific conditions. This is critically important for processes such as development that require precise spatial and temporal expression of genes for the differentiation of cell types, body patterning and organ development. In addition, rapid modulation of gene expression levels allows cells to respond dynamically to changing environments. A growing body of literature provides evidence that gene regulation plays key roles in many human diseases: the transcription factors TCF7L2 and PTF1A, as well as microsatellite polymorphisms upstream of the insulin gene, have been linked to diabetes (1-3), NRF2 has been associated with asthma (4), PPARγ has been linked with obesity (5), and a number of noncoding single nucleotide polymorphisms (SNPs) of the CCR5 cis-regulatory region have been shown to modulate the transmission and progression of acquired immunodeficiency syndrome (AIDS) (6-8). With the abundance of reports linking regulatory polymorphisms to genetic diseases and susceptibility, combined with a wide variety of model organism studies demonstrating dramatic phenotypic consequences arising from mutations in transcription factor genes (9;10), it is tempting to speculate that disruptions in gene regulatory programs may be amongst the most important contributors to diseases. Much of our understanding of the fundamental properties of living cells – how they grow and divide, how they regulate the expression of their genetic information, and  1  how they use and store energy – have been made possible through studies of simple and tractable model organisms. In fact, the first major breakthrough in characterizing the regulatory dynamics of a system was achieved by studying lactose catabolism in the bacterium Escherichia coli (11). The utility of model organisms is reflected in the considerable number of genes conserved in evolution from yeast to man: 46% of Saccharomyces cerevisiae, 43% of Caenorhabditis elegans, 61% of Drosophila melanogaster, and up to 99% of mouse genes have a human homolog (12;13). Studies of complex eukaryotes like yeasts, nematodes, flies, fishes, mice and humans, each a representative of the diversity of life, have been instrumental in improving our understanding of how genes are regulated. Yeast research, in particular, has paved the way via technological innovations such as two-hybrid analysis, high-throughput protein purification and localization, gene expression profiling, protein arrays, and genome-wide chromatin immunoprecipitation (ChIP). This work, in turn, has provided insights into how DNA binding proteins, DNA sequence elements, components of the transcriptional machinery, chromatin structure, and signaling pathways combine in the circuitry of gene regulation. Studies in fruit flies and worms have made particularly important contributions to the understanding of developmental processes and morphology in metazoans. Likewise, homology between human and mouse genes allows genetics researchers to identify disruptions in certain genes in the mouse that result in phenotypes that closely resemble human diseases. Thus, the availability of the genome sequences for these model organisms, as well as for other species, provides broad opportunities to begin the complex task of unraveling the regulatory controls of gene activity.  2  1.2 Structure of eukaryotic regulatory regions In a simplified model of eukaryotic transcriptional regulation, the basal transcription factors bind to the transcription start site (TSS) to recruit and stabilize RNA Polymerase II at the promoter (Figure 1.1). Precise control over the initiation of transcription is further aided by binding of sequence-specific transcription factors to their cognate transcription factor binding sites (TFBSs) within enhancer sequences, which can occur immediately upstream of the gene’s TSS, within introns, or at distal locations of up to hundreds of kilobases upstream or downstream from the gene. Adding to the complexity is that transcription factors appear to operate cooperatively in modules, with enhancers typically containing binding sites for multiple proteins that work together to regulate gene expression. At a higher level, transcriptional regulation is tightly linked to chromatin structure. DNA is packaged into nucleosomes, which limits the accessibility of nucleosomal DNA to transcription factors. Proteins involved in histone acetylation and nucleosome remodeling alter the nucleosome structure, allowing transcription factors to gain access and bind to the DNA. The following sections discuss the complexities of eukaryotic gene regulation in more detail. 1.2.1 The core promoter The core promoter is responsible for guiding RNA polymerase II to the correct TSS of a gene. It includes canonical sequence features that can extend ~35 bp upstream and/or downstream of the transcription initiation site, and that are sufficient for recognition by the basal transcription machinery - the minimal complement of factors that are essential to reconstitute accurate transcription in vitro from an isolated core promoter. The majority of core promoter elements identified thus far serve as recognition sites for 3  subunits of the basal transcription factor TFIID, which contains the TATA-binding protein (TBP) and several TBP-associated factors (TAFs) (14).  Figure 1.1 Regulation of transcription. An increasingly complex model of transcriptional regulation has emerged in which genes are regulated by the interplay of DNA-binding transcription factors and co-activators, which in turn are subject to regulation by chromatin. Courtesy of (15) with kind permission from Springer Science and Business Media.  The best known of the core promoter elements is the A/T-rich TATA-box, located 25-30 bp upstream of the TSS of metazoan genes. A stable preinitiation complex can form in vitro on TATA-dependent core promoters following association with the TATAbinding protein. Although the majority of early studies of core promoters revealed the presence of a TATA-box, the percentage of TATA-driven transcription is less than previously believed; an estimated 36% of Drosophila genes and 32% of human genes  4  contain this element (16;17). Analysis of human and mouse promoters indicates that TATA-box-containing promoters are highly conserved across species and are commonly associated with tissue-specific genes (18). The initiator (Inr) refers to a conserved sequence element that encompasses the TSS and is composed, in part, of a purine at the TSS (+1) and a pyrimidine at the -1 position (14). It is functionally similar to the TATA-box, and can direct accurate transcription independently or in conjunction with other core promoter elements, including the TATA-box and the downstream promoter element (DPE). The DPE is located ~30bp downstream of the start site and is recognized by two distinct TAFs. In Drosophila, 36% of predicted Inr elements are associated with a TATA-box, 12% are associated with a DPE, and a further 6% are predicted to function independently (17). Though it’s unclear how many mammalian core promoters rely on the Inr element, the presence of CG, TG or CA dinucleotides at the TSS is associated with greater transcriptional activity (18). A computational approach to identify over-represented motifs within a set of ~2000 Drosophila core promoters led to the discovery of additional sequence elements that may be important for initiation of transcription (19). Of note is the motif 10 element (MTE), which was later shown to promote transcription by RNA polymerase II when located precisely at positions +18 to +27 (20). MTE requires the Inr. It functions independently of the TATA-box and DPE, but can also work synergistically with the TATA-box or the DPE, increasing transcription when added to a heterologous core promoter. Also of note is the combination of motif 1 and motif 6 (M1/M6). M1 tends to occur close to the TSS, and there a marked spatial preference for M6 to be located 20 bp  5  upstream of M1. In addition, there exists a bias against co-occurrence of M1 with the Inr and M6 with the TATA-box, suggesting that the M1/M6 combination may serve as an alternative to the TATA/Inr module (17). The variability of core promoter architecture suggests that core promoters are active contributors to combinatorial gene regulation. Several hypotheses have been proposed to account for the observed diversity, including functional equivalence of core promoter modules, recognition of the core promoters by different proteins to regulate subsets of genes, or selective communication of core promoter elements with distal enhancer regions and specific transcription factors (reviewed by (21)). 1.2.2 Complexities in promoter architecture Recent data revealed that the majority of mammalian promoters do not initiate transcription at a single, well-defined site but rather at multiple alternative initiation sites spread across a region (18). For clarity, this is conceptually different from alternative promoters described below, in which core promoters are separated by clear genomic space. Promoters with diffuse regions of initiation are strongly associated with CpG islands1, and in 90% of cases, transcription is initiated independently of a TATA-box. In contrast, TATA-boxes are significantly over-represented in promoters showing sharp TSSs, separating mammalian promoters into two classes. Genes that are ubiquitously expressed tend to be regulated by promoters with CpG islands and broad TSSs, while TATA-box promoters are associated with tissue-specific genes and tend to be highly  1  CpG islands are long stretches of DNA with a high GC content and where CpG dinucleotides occur at significantly higher levels than is typical for the genome as a whole (22). CpG dinucleotides generally occur infrequently in vertebrate DNA unless there is selective pressure to retain them, or a region is preferentially not methylated. This is because methylation of cytosine in CG dinucleotides often results in converstion to TG dinucleotides by spontaneous deamination.  6  conserved across species (18). The methylation status of CpG islands has been correlated with transcriptional activity with the promoters of inactive genes being methylated to suppress their expression while lack of methylation in these regions promotes accessibility to regulatory proteins (23). Alternative promoters are an important mechanism by which diversity and flexibility in gene expression patterns can be generated. Transcripts that differ in their 5’ ends can produce transcripts varying in turnover rates and translation efficiencies, as well as protein products with different amino-terminal regions depending on tissue type and developmental stage. Estimates of alternative promoter usage based on CAGE tags suggest that 58% of protein-coding transcriptional units have multiple alternative promoters (18). Such diversity in the location of transcription initiation sites poses significant challenges to delineating regulatory sequences. 1.2.3 Enhancers Enhancers are key regulatory elements in higher eukaryotes that activate gene expression from distal locations, in some instances up to a megabase from their target genes (24). This long-range activation is independent of enhancer orientation on DNA as well as their position upstream or downstream of the promoter (25). Although the term “enhancers” emphasizes their interactions with transcriptional activators, enhancer elements are also found to interact with transcriptional repressors, resulting in repression of their target genes. A gene can be regulated by multiple enhancers, each acting as a regulatory module that executes a function that is a subfraction of the overall combined regulatory function (26). The activities of individual enhancers are often restricted to specific cell types, specific stages of development and/or environmental conditions. Thus,  7  enhancers integrate environmental and developmental information to regulate the expression of an individual gene in a biologically appropriate manner. The current model of enhancer action is based primarily on the concept of “recruitment”, in which transcriptional activators bind to distal locations, recruit chromatin modifying complexes to the gene, and/or bring the transcriptional machinery to the transcription initiation site (27). Several models have been proposed to explain enhancer action over long genomic distances, including (i) DNA looping to bring enhancer-bound activators into close proximity to proteins at the promoter, (ii) tracking along the DNA from the enhancer, which acts as a loading platform for transcriptional activators, and (iii) spreading-looping, whereby activators induce cooperative binding of proteins to several closely spaced sites in the intervening region, resulting in a series of small loops between the enhancer and promoter (reviewed in (28)). The binding of activators and repressors to enhancers is mediated by specific contacts with certain combinations of TFBSs. In metazoan genomes, the number of sites ranges from four to eight, with the average value near five (26). TFBSs are usually short sequence motifs (between 5-15 bps) and are frequently degenerate. The degeneracy, though beneficial in that it allows for modulation of the levels of transcription based on the strength of binding, makes the prediction of TFBSs extremely challenging. 1.2.4 Chromatin structure and its effects on gene regulation All of the above-mentioned interactions necessary for regulating transcription take place within the context of chromatin, wherein dynamic structural characteristics play an important role in gene regulation. The basic repeating structural unit of chromatin is the nucleosome, composed of ~146 bp of DNA wrapped around an octamer of histone  8  proteins comprised of two copies each of histones H2A, H2B, H3 and H4. Packaging of promoter DNA in nucleosomes inhibits transcription (29;30). Changes in nucleosome structure accompany transcriptional activation, as shown by the conversion of promoter DNA from a nuclease-resistant to a nuclease-accessible (“hypersensitive”) state upon transcriptional activation. A genome-wide study of nucleosome occupancy showed that nucleosomes are depleted from active regulatory elements throughout the yeast genome in vivo, and that the level of nucleosome occupancy is inversely proportional to the transcription initiation rate at the promoter (31). Changes in chromatin structure are affected by methylation of DNA (associated with gene repression), modifications of specific lysine residues in histone proteins, as well as preferential use of particular histone variants. Epigenetic studies have shown that specific post-translational modifications to histone tails influence the structure of the nucleosome and its interactions with DNA and regulatory proteins, leading to the “histone code” hypothesis (32). Recent genome-wide studies of chromatin structure show that active promoters are associated with H3-lysine-9 (H3K9), H3K14 and H4 acetylation, and transcriptional activity appears to increase with an increasing gradient of H3K4 mono-, di- and trimethylation, with H3K4me3 marking active promoters exclusively. In contrast, trimethylation of H3K27 localizes to the promoters of repressed genes and plays an important role in long-term gene silencing. Repressive methylation is mediated via interactions with the Polycomb group (PcG) proteins, comprised of Polycomb repressive complex 1 (PRC1) and the ESC-E(Z) complex. Evidence shows that the latter complex specifically methylates H3K27, which in turn, facilitates binding of PRC1 to maintain the silent state of target genes (33).  9  Chromatin modifications are also associated with restricting the effects of longrange enhancers to specific promoters via the action of sequences called insulators. Insulators can block the effects of enhancers in a position-specific manner or form chromatin boundaries through repressive chromatin modifications (34). 1.2.5 Current strategies for improving our understanding of gene regulation The layers of complexity inherent in regulating the transcription of genes pose significant challenges to researchers hoping to gain a better understanding of this process. To provide an infrastructure to study the functional landscape encoded in the human genome, the Encyclopedia of DNA Elements (ENCODE) Project was initiated in 2003 by an international consortium of researchers (35). The pilot phase of the project focused on 30 megabases (1%) of the human genome, and has not only demonstrated that the complexity is far greater than previously believed, but has also driven major experimental and computational technological advances for detecting and characterizing genomic features (36). Classical experimental procedures and high-throughput approaches for studying gene regulation will be the focus of the following section.  1.3. Laboratory-based methods for regulatory region identification Numerous laboratory procedures exist for identifying potential regulatory sequences. Methods for characterizing and implicating a sequence element in regulating expression typically involve searching for evidence of (i) activation or repression of a reporter gene and/or (ii) binding of a transcription factor to the DNA sequence of interest. High-throughput methods that demarcate the 5’ ends of genes also contribute to an improved understanding of gene regulation because the regions surrounding TSSs are  10  known to contain gene regulatory elements (18). It is important to note that the results from these methods depict regulation at a static point in time, and that the overall picture of gene regulation may look quite different depending on the cell type and developmental stage in which the experiment is performed. 1.3.1 Reporter constructs Reporter genes are used to assay for the ability of a particular sequence or promoter to drive expression via the action of a trans-acting protein. To act as a reporter, the gene must induce a measurable phenotype, such as the commonly used reporters green fluorescent protein (GFP), which glows under UV light, and luciferase, an enzyme that catalyzes a reaction with luciferin to produce light. Other frequently used reporters include the lacZ gene, whose protein product cleaves X-gal to produce a dark blue precipitate, and chloramphenicol acetyltransferase (CAT), which confers resistance to the antibiotic choramphenicol. Creation of a reporter construct involves placing the reporter gene under the control of the putative regulatory region, i.e. the promoter or predicted enhancer, and then quantitatively measuring the activity of the gene product. Once the ability of a region to modulate expression has been confirmed, mutagenesis studies using reporter constructs can further delineate sequence elements required for regulation. As mentioned previously, multiple cis-regulatory modules may direct subsets of the overall pattern of expression for a gene. To test the combined action of multiple enhancers, DNA minipromoters containing selected combinations of cis-regulatory modules can be constructed.  11  1.3.2 DNA binding assays DNA binding can be detected using a binding assay, such as an electrophoreticmobility shift assay (EMSA) or chromatin immunoprecipitation experiments (ChIP). EMSA detects protein-DNA complexes that exhibit impeded movement compared to unbound DNA when subjected to gel electrophoresis. An antibody that recognizes a transcription factor of interest can be used to form even larger complexes that are “supershifted”, facilitating unambiguous identification of the bound protein. ChIP assays are designed to detect binding of proteins to endogenous chromatin, thus identifying protein-DNA interactions that occur in vivo. In this method, the transcription factors are cross-linked to the chromatin to which they are bound by formaldehyde treatment. After the cells have been lysed and the DNA sheared or enzymatically digested, an antibody specific for the protein in question is used to immunoprecipitate the protein-DNA complexes. The DNA from the isolated proteinDNA fraction is purified and amplified by the Polymerase Chain Reaction (PCR) to identify the sequence that is bound by the transcription factor. The conventional ChIP method has been extended to a high-throughput assay by combining ChIP with DNA microarrays (37;38). Referred to as ChIP-on-chip or ChIPchip, the method allows for efficient identification of protein-DNA interactions on a genome-wide scale. After immunoprecipitation, purification and amplification of the bound sequences, the DNA is fluorescently labeled and hybridized to DNA microarrays that contain probes covering the genomic regions of interest. A control DNA sample labeled with a different fluorescent dye is also hybridized to the array, with the result that DNA array elements corresponding to the genomic binding sites for the protein are  12  identified as those that display significantly stronger fluorescence signal in the ChIP DNA channel than the control. 1.3.3 In vitro selection SELEX (Systematic Evolution of Ligands by Exponential Enrichment) is an in vitro system that identifies double-stranded oligonucleotides that bind the protein of interest from a random pool of sequences (39). In this case, after incubation of the transcription factor with a randomly generated pool of oligonucleotides, formed complexes are purified by immunoprecipitation, and the bound DNA is amplified by PCR. This DNA is then used in further rounds of binding, immunoprecipitation, and amplification, until specific binding is detectable. The resulting high affinity sequences are often referred to as aptamers. While these sequences reflect the strongest binding sites or consensus, the method can be overly-specific, missing important weaker sites that may contribute to modulating expression. 1.3.4 Transcript profiling to locate proximal promoters Advances in sequencing and expression profiling techniques have led to the development of high-throughput technologies for efficient measurement of transcript abundance on a genome-wide scale. A number of these technologies, such as cap analysis of gene expression (CAGE) (40) and 5’-end serial analysis of gene expression (5’ SAGE) (41;42), are designed to identify sequence tags corresponding to the 5’ ends of mRNA transcripts at the cap sites, and facilitate the delineation of proximal promoter regions and potential regulatory sequences. In both methods, linkers are attached to the 5’ ends of full-length enriched cDNAs to introduce a recognition site for the class II restriction  13  endonuclease MmeI, which cleaves the first 20 bp of the cDNA. After amplification, the sequence tags are concatenated for high-throughput sequencing. The above-mentioned methods provide base pair resolution of the expression profiles of TSSs compared to conventional microarrays, which profile transcript expression without distinguishing between alternate 5’ ends. More recently, high density tiling arrays have been used to profile transcribed regions (43;44). This technology, combined with the rapid amplification of cDNA ends (RACE), can determine the start and termination positions of coding and noncoding RNAs (43). Expressed sequence tags (EST) and full-length cDNA sequences can also be used to characterize the 5’ ends of transcripts and identify putative proximal promoter regions (45); however these methods are less reliable than direct sequencing of 5’ cap-trapped cDNAs.  1.4 Computational methods for regulatory element prediction 1.4.1 Repositories of gene regulatory information Computational biologists rely heavily on reference data collections of regulatory sequences to produce models of transcription factor binding specificity. However, much of the experimental data linking transcription factors to target sequences is dispersed in a haphazard manner among disparate sources (Table 1.1). Two widely accessed databases housing transcription factor motif models are JASPAR (46) and TRANSFAC (47). JASPAR, a repository of high quality motif models, has low coverage of human transcription factors, while TRANSFAC, a central tool for bioinformatics, is subject to commercial funding and certain restrictions. In recent years, efforts have been made to create a unified, open access resource to house literature-curated regulatory regions and  14  Table 1.1 Selected resources for gene regulatory information Name ABS (48) DBTSS (49)  Drosophila DNase I Footprint Database (50) EDGEdb (51)  EPD (52)  ERTargetDB (53) Hardison Lab JASPAR (54) LSPD MPromDb (55)  MTIR ooTFD (56;57) Oreganno (58) REDfly (59)  SCPD (60)  TRANSFAC (47) TRED (61)  TRRD (62)  VISTA Enhancer Browser (63) Yeast Regulatory Map (64;65)  Description Database of annotated regulatory binding sites from orthologous human, mouse and rat promoters. Database of transcriptional start sites based on experimentally determined 5’end sequences of full-length cDNAs. Additionally, it contains predicted TFBSs and conservation information. Systematic curation and genome annotation of 1,365 DNase I footprints for D. melanogaster. C. elegans differential gene expression database contains regulatory element, transcription factor, protein-DNA and protein-protein interaction, and gene expression information. Eukaryotic promoter database is an annotated non-redundant collection of eukaryotic POL II promoters with experimentally determined TSSs. It includes both conventional, targeted studies of individual genes as well as large-scale annotation projects. Estrogen receptor target database integrates information from ChIP-chip experiments and promoter sequence conservation. Compilation of data useful in comparative genomics and in erythroid gene regulation studies (http://www.bx.psu.edu/~ross/dataset/DatasetHome.html). High quality transcription factor binding profile database. Liver specific gene promoter database (http://rulai.cshl.edu/LSPD/). Mammalian promoter database is an integrated database containing experimentally supported annotation of TSSs, cis-regulatory elements, CpG islands and ChIP-chip experiments. Summary of published information on muscle-specific transcriptional regulation (http://www.cbil.upenn.edu/MTIR/HomePage.html). Object-oriented transcription factor database that contains literature-derived transcription factor binding information. Open regulatory annotation database is an open database for the curation of known regulatory elements from scientific literature. Regulatory element database for Drosophila seeks to include all experimentally verified fly cis-regulatory modules and TFBSs, along with their DNA sequence, their associated genes, and the expression patterns they direct. Promoter database for the yeast Saccharomyces cerevisiae contains experimentally mapped TFBSs and transcriptional start sites, as well as binding affinity and expression data where available. Eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. Transcriptional regulatory element database contains genome-wide human, mouse and rat promoter annotation, derived from a combination of automation and curation, as well as curated transcription factor binding and regulation information. Transcription regulatory regions database contains experimentally confirmed information on structural and functional organization of transcription regulatory regions of eukaryotic genes. Experimental data for conserved noncoding human sequences that have been tested for tissue-specific enhancer activity in transgenic mice. A regulatory map of the yeast genome based on motifs derived from genomewide chromatin immunoprecipitation data for 203 transcription factors (http://fraenkel.mit.edu/Harbison/).  15  TFBSs. This has led to the creation of the ORegAnno database, which additionally contains the locations of genetic variations known to alter TFBSs (58), and PAZAR, which allows individual catalogues of regulatory sequences to be maintained and disseminated separately within a larger “information mall” schema (66). The success of community-based resources is highly dependent on the community to import, annotate and maintain their regulatory sequence collections, thus enabling computational biologists to harness the power of multiple, well-curated and updated regulatory sequence data sets. 1.4.2 Computational identification of TFBSs Identifying the locations of binding sites for a characterized transcription factor requires modeling of the binding affinity of the protein. Construction of a model of transcription factor binding affinity is dependent on the availability of multiple known target sequences for the transcription factor of interest. The sequences are aligned to detect the key features that are recognized by the transcription factor, i.e. which residues are conserved and which residues are variable. In the simplest form, the binding site is modeled using a consensus sequence, which assigns a letter to represent the nucleotide position in each column. While providing more information than a single sequence, the consensus does not model the distribution of nucleotides in each position, and results in a net loss of information. Searching with the consensus leads to under-representation of true sites. Most commonly, however, motifs are modeled using matrix-based profiles that provide quantitative descriptions of known binding sites for the transcription factor of interest (67). Useful matrices produce scores that directly correlate with the binding  16  energies between the transcription factor and the binding site (68). The matrix-based approach assumes that the nucleotide occurrence at each position contributes independently to the binding, such that the total energy of the interaction is the sum of the energies of the individual contacts. This does likely not reflect the true biological characteristics of binding sites (69;70). In the best cases, where a large amount of high quality binding information is available, higher order models can offer slight improvements (70;71). The first step in the construction of a matrix model involves enumerating the number of occurrences of each nucleotide at each position in the multiple sequence alignment to create a position frequency matrix (PFM) (Figure 1.2). The PFM is transformed to a position weight matrix (PWM) using Equations 1 and 2, which normalize the counts based on the total number of binding site sequences, considers the genomic nucleotide distribution, and converts the matrix to a log-scale for efficient computational analysis: Wb ,i = log 2  p (b, i ) p (b)  (1)  where Wb,i is the PWM value of base b in position i, p(b,i) is the corrected probability of base b in position i, and p(b) is the background probability of base b. The formula for the corrected probability of base b in position i is given below:  p (b, i ) =  f b,i + s (b) N+  ∑ s(b' )  (2)  b '∈{ A,C ,G ,T }  where fb,i is the count of base b at position i, N is the total number of sites, and s(b) is the pseudocount correction that is optionally applied to small samples of binding sites.  17  Position Frequency Matrix (PFM)  Binding Sites CTATTTTTGG CTATAATTAA TTATTTTTAG CTATTAATAG CTATATTAAG CTAATATTAG ATATTTATAG CTCTTTATAG TTATTTATAG CTACTTTTAT  Binding Site Logo  A  1  0  9  1  2  3  6  1  9  1  C G  7 0  0 0  1 0  1 0  0 0  0 0  0 0  0 0  0 1  0 8  T  2  10  0  8  8  7  4  9  0  1  Position Weight Matrix (PWM) A  0.0  -1.2  2.4  0.0  0.6  1.0  1.9  0.0  2.4  0.0  C  1.9  -1.2  0.0  0.0  -1.2  -1.2  -1.2  -1.2  -1.2  -1.2  G  -1.2  -1.2  -1.2  -1.2  -1.2  -1.2  -1.2  -1.2  0.0  2.3  T  0.6  2.6  -1.2  2.3  2.3  2.1  1.4  2.4  -1.2  0.0  C  T  A  T  A  T  T  T  A  A  Score = 18.1 (89% of maximum)  Figure 1.2 Construction of quantitative models for transcription factor binding specificity An aligned set of TFBSs can be converted into a position frequency matrix (PFM) that quantitatively describes the alignment by enumerating the frequency of each nucleotide in each column. A graphical display (a “logo”) of a binding profile can be generated that displays the binding preferences in proportion to the strength of the pattern at each position. The values in a PFM are converted to a logscale for simplified arithmetic in scoring potential sites in a representation referred to as a position weight matrix (PWM).  To search a sequence for potential binding sites, the PWM is slid along the sequence and a quantitative score at each position is computed as the sum of the PWM values for each nucleotide in the site. Sometimes, although the precise locations of TFBSs are unknown, there is evidence to suggest that target sites for a factor of interest reside within longer stretches of DNA. This is the case for target sequences derived from ChIP-chip experiments or coexpression analysis. Matrix-based binding models for motifs can be generated from sets  18  of related sequences using probabilistic pattern discovery algorithms, under the expectation that functional binding sites will provide sufficient signal to be detected as over-represented patterns when compared to background sequences. Probabilistic methods are fast and produce quantitative binding profiles. Examples include expectation maximization (EM) (72;73) and Gibbs sampling (74). A full presentation of pattern discovery procedures is beyond the scope of this thesis but the interested reader can refer to a plethora of review articles (75-79). The application of weight matrix models to predict TFBSs in metazoan genomes suffers from significant limitations in practice. Binding sites are typically short and degenerate, and are likely to occur with high frequency in large genomes. Also, DNA is packaged into chromatin in the nucleus, which prevents transcription factors from accessing and binding non-functional sites (80). Consequently, although scores produced by TFBS models show high correlation with binding affinity, and 95% of potential HNF1 binding sites predicted by a profile were indeed bound by the protein in vitro (81), the correlation between predictions and functional sites in vivo is poor (82). Thus, motif models in isolation are insufficient to discriminate functional sites, and additional information is needed to reduce false positives. The specificity of cis-regulatory element prediction can be enhanced by incorporating combinatorial patterns of DNA motifs into search algorithms. This approach is motivated by cooperative interactions that have been observed between transcription factors and clustered binding sites or cis-regulatory modules (CRMs) (26;83;84). Rule-based models define motifs, interactions and spacing requirements for CRMs based on studies of composite response elements (26;85;86). Often there is  19  insufficient data to establish rules for regulatory architecture; more flexible approaches have been developed to identify clusters of predicted sites enriched in a set of sequences, sometimes incorporating conservation analysis to further improve their specificity (8792).  1.5 Conservation analysis for the identification of regulatory regions Phylogenetic footprinting of orthologous sequences from organisms at appropriate evolutionary distances has emerged as a powerful tool for enhancing the specificity of TFBS predictions (93). The underlying expectation is that selective pressure causes regulatory elements to evolve at a slower rate than the non-functional surrounding sequences. Therefore, highly constrained noncoding regions among a set of orthologous sequences are considered excellent candidates as regulatory elements. Phylogenetic footprinting of human and mouse sequences increases specificity by eliminating ~80% of noncoding sequence (94); approximately 70% of functional binding sites can be detected in conserved regions of human-mouse alignments (94;95). Thus, although using conservation limits our ability to detect binding sites that have evolved in a speciesspecific manner or those subject to turnover (95-97), there is much to be gained from the dramatic reduction in noise. Determining the appropriate evolutionary distance to identify functionally constrained regions can be a major challenge as rates of sequence changes vary widely across a genome. Thus, one cannot make global statements about the optimal pair of genomes appropriate for comparison. Comparisons of human and mouse sequences comprise the bulk of comparative genomic studies (93;98-100), but successful examples also exist for comparisons of human sequence with chicken (101) or pufferfish (102).  20  Many stringently constrained noncoding sequences act as developmental enhancers in gain-of-function assays (24;103).  1.5.1 Algorithms for multi-species sequence alignments The rapid accumulation of metazoan genome sequences has facilitated the development of new algorithms to deal with multi-species sequence comparisons. Numerous multiple sequence alignment methods have been developed, most of which progressively construct a multiple alignment by successive applications of a pairwise alignment algorithm (104-106). Alignments produced by different multiple sequence alignment methods are consistent on a large-scale, but show significant discrepancies at the nucleotide level in terms of small genomic rearrangements, sensitivity (sequence coverage) and specificity (alignment accuracy) (107).  1.5.2 Identifying and measuring evolutionary constraint Equally as important as generating alignments is identifying and measuring evolutionary constraint. A variety of approaches have been developed that take both the sequence alignment and a species distance tree as input, and return quantitative scores that assign a level of constraint to a genomic position or small window within the species of interest. For example, PhastCons uses a two-state phylogenetic hidden Markov model (phylo-HMM) to calculate the probability of being in the conserved state for every alignable base in the reference genome. The states are based on phylogenetic models representing the average rate of substitution in conserved and nonconserved sequences (108). The free parameters of the model are estimated using maximum likelihood.  21  Alternatively, “multi-species conserved sequences” (MCS) can be used to measure evolutionary constraint (109). In the binomial-based MCS approach, the relative contribution of each species’ sequence is weighted by its baseline neutral substitution rate (relative to the reference genome) such that conserved sequences from more diverged species make a greater relative contribution to the conservation score than those from less diverged species The weighting is calculated as the cumulative binomial probability of detecting the observed number of base identities in each 25-base window, given the neutral substitution rate calculated using four-fold degenerate positions. An alternative parsimony-based MCS score measures conservation within each column of the multisequence alignment using a phylogenetic parsimony score that reflects the minimal number of substitutions needed along the branches of an established phylogenetic tree to account for the observed bases. A p-value associated with the derived parsimony score is calculated under a continuous-time Markov model of neutral evolution, and converted to a conservation score such that higher scores reflect higher conservation. Other established measures of constraint include GERP and “composite alignability.” GERP estimates evolutionary rates for individual alignment columns and compares them with a tree describing the neutral substitution rates relating the species under consideration (110). It identifies candidate constrained elements by annotating those regions that exhibit fewer than expected substitutions. Composite alignability is computed as the average of the pairwise alignabilities weighted by branch length to the reference genome, where alignability is computed as the fraction of bases in the reference sequence aligning to another species, regardless of whether the aligned base is a match,  22  mismatch or gap (111). Composite alignability reflects conservation of a region between two species but does not necessarily mean that the region is subject to purifying selection. Results from the ENCODE Pilot Project (36) demonstrate that binomial-based MCS scores select for only a very highly constrained subset of regulatory elements and miss many other regions that are under constraint. However, the sliding window approach used by this method is less sensitive to many of the smaller elements identified by PhastCons and GERP, and is also less prone to annotate isolated and short sequences that are highly similar between the reference and distantly related species that may be alignment artifacts (107). PhastCons scores perform well at discriminating specific promoters (defined as having validated activity in 1-5 of 16 cell lines assayed), while “composite alignability” performs best for ubiquitous promoters and putative transcriptional regulatory regions derived from the ENCODE protein occupancy and chromatin modification data (111). While many regulatory sequences are constrained through evolution, conservation alone is not sufficient to differentiate regulatory sequences from other functional regions such as coding exons and noncoding RNA genes. Regulatory Potential (RP) scores attempt to combine evolutionary constraint within alignments between human and rodent sequences with features characteristic of known regulatory regions (112;113). The method uses a fifth order hidden Markov model (HMM) based on the frequencies of short alignment patterns in regulatory regions to distinguish regulatory DNA from neutral sites, with the idea that short, absolutely conserved segments are more representative of RP than third position variable coding sequences or freely evolving neutral sequences. Evaluation of MCS, phastCons and RP scores showed that all three methods correctly  23  identify 50%-60% of the CRMs in the well-studied HBB gene complex, with the RP performing better than the conservation measures (114).  1.5.3 Highly conserved noncoding elements in metazoan genomes Comparisons of the human genome against the genomes of multiple vertebrate species have revealed an abundance of highly conserved noncoding DNA elements (HCNEs) that appear to have been “frozen” throughout vertebrate development (115;116). Though the exact number of such elements varies depending on the precise definition of similarity and the genomes used in the comparison, HCNEs are consistently found in clusters near genes encoding transcription factors and key regulators of development (103;115;116). They constitute an interesting class of genetic elements that have thus far been primarily implicated as enhancers of transcription (24;103;117-119), and to a lesser extent, as noncoding RNAs or regions that form RNA secondary structures important for post-transcriptional processing (108;109;120). The emerging model of HCNE enhancer function is that arrays of HCNEs define regions of regulatory inputs of their target genes, and that the full complement of those inputs results in the actual expression pattern of the genes (121;122). The extreme levels of sequence conservation in HCNEs led many to speculate that these regions are necessary and crucial for proper development and viability. However, deletions of large genomic regions containing numerous HCNEs had no appreciable phenotype (123). Even more remarkable is that knockout mice lacking some of the most conserved HCNEs – “ultraconserved elements” spanning 200 bp or more that are identical in alignments of human, mouse and rat sequences (116) – were viable and fertile with no ill effects in terms of growth, longevity, pathology and metabolism,  24  suggesting that extreme sequence constraint does not necessarily reflect crucial functions required for viability (124). Furthermore, introduction of two separate 16 bp insertions into a HCNE within an ancient conserved enhancer, Dc2, located near the dachshund gene, did not cause any detectable modification of its in vivo activity (125). The authors suggest that HCNEs are tolerant of change, and suggest that purifying selection may not be as strong as previously predicted or that some unknown property also constrains this HCNE sequence. The few studies that have investigated HCNEs in other organisms show that they mirror some of the properties of vertebrate HCNEs. HCNEs in worms and insects frequently reside near developmental regulatory genes and show similar base composition with vertebrate HCNEs (108;126). The fraction of the genome within HCNEs increases with increasing genome size and biological complexity, so that while most conserved bases in vertebrates and insects apparently do not code for proteins, most in worms and yeasts do. This suggests elaborate mechanisms for gene regulation via noncoding sequences in complex eukaryotes (108). Determination of the evolutionary forces that have given rise to HCNEs is an active area of research. Some HCNEs have clear repetitive origins, and are likely to be mobile element instances exapted into putative cis-regulatory roles (127;128). Analyses of single nucleotide polymorphism (SNP) allele frequencies within human and fly HCNEs argue that HCNEs are maintained by purifying selection, and are not simply regions of the genome with extremely low mutation rates as predicted by the mutation cold spot hypothesis (129-131).  25  1.6 Thesis overview and chapter objectives The maturation of genomics technologies for sequencing and expression profiling, as well as improvements in motif discovery and alignment algorithms, creates new opportunities to gain insight into the regulatory programs of eukaryotic cells. We hypothesize that the analysis of sequence conservation and predicted transcription factor target sequences can provide functional insights into the regulatory mechanisms active in a cell or tissue. The work traverses multiple species, providing both species-specific predictions and allowing for inferences about pathways and processes that have been conserved through evolution. The research is geared towards producing both bioinformatics approaches and specific software to be used by the research community to generate testable hypotheses and motivate specific experiments for the study of transcriptional regulation. Gene expression profiling studies identify sets of genes with similar patterns of expression in diverse biological contexts. Based on the assumption that subsets of coexpressed genes are also co-regulated, we tested the hypothesis that underlying regulatory networks could be determined from co-expression data via identification of statistically over-represented regulatory motifs. At the time, several studies had successfully linked co-expression to regulation for sets of yeast genes using this approach (reviewed in (132;133)). However, the more complex mechanisms for transcriptional regulation employed by metazoans, including long-range regulation and synergistic control via multiple transcription factors, as well as far greater proportions of noncoding sequence, pose significant challenges. Thus, when I began this research, few resources existed to adequately address the regulatory analysis of co-expressed human genes in a systematic  26  and user-friendly way. Available methods limited their analysis to short upstream regions of related human genes (ranging from 600 bp to 1200 bp) to reduce noise (134-136), potentially missing important enhancer elements that can reside thousands of bases from the TSS. A number of studies had showed that functional binding sites could be identified in long stretches of upstream sequence using conservation between human and mouse to filter noise (94;137;138). We endeavored to extend these methods to all human genes and to create a system for identifying over-represented, evolutionary conserved motifs in sets of human genes. In Chapter 2 (as published in (139)), we describe oPOSSUM, an integrated platform that combines a pre-computed database of predicted, evolutionarily conserved TFBSs, derived from phylogenetic footprinting and mature motif identification tools, with statistical methods to assess enrichment of regulatory sites within sets of biologically linked genes. oPOSSUM’s user-friendly interface, robust implementation and use of two statistical frameworks for assessing over-representation make it an important contribution to this active research area. High-throughput alignment of orthologous human and mouse sequences presented some challenges during the development of oPOSSUM. We had defined a single TSS for each human and mouse gene based on the 5’-end of the first exon of each gene, but the presence of multiple alternative promoters led to alignment failures when 5’-most exons differed between species. To address this shortcoming, we developed an approach to map the locations of alternative TSSs and identify multiple corresponding promoter regions in the alignments (Chapter 3, as published in (140)). Our goal was to improve the percentage of successfully aligned orthologs, and we hypothesized that the incorporation of alternative conserved promoters could improve the motif signal within a set of gene  27  sequences. Further advances included developing regulatory analysis platforms for yeast and nematode sequences, and extending oPOSSUM to identify enriched CRMs via analysis of combinations of clustered binding sites (141). In Chapter 4 (as published in (142)), we demonstrate the use of applied genomics technologies and bioinformatics to make predictions about regulatory pathways and to generate testable hypotheses. Despite a multitude of studies describing yeast stress responses, few have examined the complex mechanisms employed by S. cerevisiae to cope with changing nutritional status and increasing ethanol concentration during longterm alcoholic fermentation in natural grape must. The objective of this work was to use gene expression profiling, as well as functional annotations and motif over-representation, to make novel discoveries of genes and pathways that enable yeast cells to adapt and survive in such a hostile environment. As a large fraction of stress-responsive genes remain uncharacterized, we hypothesized that they are only activated under conditions not typically tested in the laboratory. Once a group of genes is identified as having interesting expression or regulation properties in a genomics study, the next challenge is laboratory validation. The laboratory research is a bottleneck in that the studies are slow and labour intensive. We observed that manual selection of candidate genes for synthetic promoter design and reporter gene assays follows several general guidelines, one of which is the evaluation of conserved regions in the vicinity of TSSs that may act as regulatory sequences. Our extensive use of phylogenetic footprinting highlighted diverse levels of conservation in the promoters of different genes, which suggested a mechanism by which we can prioritize gene candidates in an automated way. In Chapter 5 (manuscript in preparation), we explore a  28  scoring procedure, based on features of gene conservation profiles from multi-species sequence alignments, to rank genes and their candidate regulatory regions for targeted laboratory-based gene regulation experiments. In Chapter 6 (as published in (143)), we investigate patterns of extreme sequence conservation among insect genomes. HCNEs in vertebrates have been implicated as longrange enhancers of gene expression, particularly for developmental regulators (103;115;116;125), and we hypothesized that this might be a common feature of metazoan genomes. The objective of this work was to identify HCNEs in the insect lineage, and to investigate their genomic organization and functional roles. In particular, we wished to find evidence for the role and mechanism of HCNEs in gene regulation. This thesis, which addresses a broad spectrum of computational genomics studies and resources, contributes to the analysis and understanding of eukaryotic gene regulation. The common thread unifying these chapters is the use of sequence analysis combined with large-scale genomics data to decipher potential regulatory mechanisms, and each chapter makes an independent contribution towards achieving this goal.  29  1.7 References 1. Kennedy,G.C., German,M.S. and Rutter,W.J. (1995) The minisatellite in the diabetes susceptibility locus IDDM2 regulates insulin transcription. Nat.Genet., 9, 293-298. 2. Grant,S.F., Thorleifsson,G., Reynisdottir,I., Benediktsson,R., Manolescu,A., Sainz,J., Helgason,A., Stefansson,H., Emilsson,V., Helgadottir,A. et al. (2006) Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat.Genet., 38, 320-323. 3. Sellick,G.S., Barker,K.T., Stolte-Dijkstra,I., Fleischmann,C., Coleman,R.J., Garrett,C., Gloyn,A.L., Edghill,E.L., Hattersley,A.T., Wellauer,P.K. et al. (2004) Mutations in PTF1A cause pancreatic and cerebellar agenesis. Nat.Genet., 36, 1301-1305. 4. Li,N. and Nel,A.E. (2006) Role of the Nrf2-mediated signaling pathway as a negative regulator of inflammation: implications for the impact of particulate pollutants on asthma. Antioxid.Redox.Signal., 8, 88-98. 5. Meirhaeghe,A., Fajas,L., Helbecque,N., Cottel,D., Lebel,P., Dallongeville,J., Deeb,S., Auwerx,J. and Amouyel,P. (1998) A genetic polymorphism of the peroxisome proliferator-activated receptor gamma gene influences plasma leptin levels in obese humans. Hum.Mol.Genet., 7, 435-440. 6. Bream,J.H., Young,H.A., Rice,N., Martin,M.P., Smith,M.W., Carrington,M. and O'Brien,S.J. (1999) CCR5 promoter alleles and specific DNA binding factors. Science, 284, 223. 7. Martin,M.P., Dean,M., Smith,M.W., Winkler,C., Gerrard,B., Michael,N.L., Lee,B., Doms,R.W., Margolick,J., Buchbinder,S. et al. (1998) Genetic acceleration of AIDS progression by a promoter variant of CCR5. Science, 282, 1907-1911. 8. McDermott,D.H., Zimmerman,P.A., Guignard,F., Kleeberger,C.A., Leitman,S.F. and Murphy,P.M. (1998) CCR5 promoter polymorphism and HIV-1 disease progression. Multicenter AIDS Cohort Study (MACS). Lancet, 352, 866-870. 9. Look,A.T. (1997) Oncogenic transcription factors in the human acute leukemias. Science, 278, 1059-1064. 10. Lee,J., Ahnn,J. and Bae,S.C. (2004) Homologs of RUNX and CBF beta/PEBP2 beta in C. elegans. Oncogene, 23, 4346-4352. 11. Jacob,F. and Monod,J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. J.Mol.Biol., 3, 318-356.  30  12. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C., Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860-921. 13. Waterston,R.H., Lindblad-Toh,K., Birney,E., Rogers,J., Abril,J.F., Agarwal,P., Agarwala,R., Ainscough,R., Alexandersson,M., An,P. et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520-562. 14. Smale,S.T. and Kadonaga,J.T. (2003) The RNA polymerase II core promoter. Annu.Rev.Biochem., 72, 449-479. 15. Wasserman,W.W. and Krivan,W. (2003) In silico identification of metazoan transcriptional regulatory regions. Naturwissenschaften, 90, 156-166. 16. Suzuki,Y., Tsunoda,T., Sese,J., Taira,H., Mizushima-Sugano,J., Hata,H., Ota,T., Isogai,T., Tanaka,T., Nakamura,Y. et al. (2001) Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res., 11, 677-684. 17. Ohler,U. (2006) Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res., 34, 5943-5950. 18. Carninci,P., Sandelin,A., Lenhard,B., Katayama,S., Shimokawa,K., Ponjavic,J., Semple,C.A., Taylor,M.S., Engstrom,P.G., Frith,M.C. et al. (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nat.Genet., 38, 626635. 19. Ohler,U., Liao,G.C., Niemann,H. and Rubin,G.M. (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biol., 3, RESEARCH0087. 20. Lim,C.Y., Santoso,B., Boulay,T., Dong,E., Ohler,U. and Kadonaga,J.T. (2004) The MTE, a new core promoter element for transcription by RNA polymerase II. Genes Dev., 18, 1606-1617. 21. Smale,S.T. (2001) Core promoters: active contributors to combinatorial gene regulation. Genes Dev., 15, 2503-2508. 22. Bird,A.P., Taggart,M.H., Nicholls,R.D. and Higgs,D.R. (1987) Non-methylated CpG-rich islands at the human alpha-globin locus: implications for evolution of the alpha-globin pseudogene. EMBO J., 6, 999-1004. 23. Cross,S.H. and Bird,A.P. (1995) CpG islands and genes. Curr.Opin.Genet.Dev., 5, 309-314. 24. Nobrega,M.A., Ovcharenko,I., Afzal,V. and Rubin,E.M. (2003) Scanning human gene deserts for long-range enhancers. Science, 302, 413.  31  25. Banerji,J., Rusconi,S. and Schaffner,W. (1981) Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell, 27, 299-308. 26. Arnone,M.I. and Davidson,E.H. (1997) The hardwiring of development: organization and function of genomic regulatory systems. Development, 124, 1851-1864. 27. Ptashne,M. and Gann,A. (1997) Transcriptional activation by recruitment. Nature, 386, 569-577. 28. Bondarenko,V.A., Liu,Y.V., Jiang,Y.I. and Studitsky,V.M. (2003) Communication over a large distance: enhancers and insulators. Biochem.Cell Biol., 81, 241-251. 29. Knezetic,J.A. and Luse,D.S. (1986) The presence of nucleosomes on a DNA template prevents initiation by RNA polymerase II in vitro. Cell, 45, 95-104. 30. Han,M. and Grunstein,M. (1988) Nucleosome loss activates yeast downstream promoters in vivo. Cell, 55, 1137-1145. 31. Lee,C.K., Shibata,Y., Rao,B., Strahl,B.D. and Lieb,J.D. (2004) Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat.Genet., 36, 900-905. 32. Strahl,B.D. and Allis,C.D. (2000) The language of covalent histone modifications. Nature, 403, 41-45. 33. Cao,R., Wang,L., Wang,H., Xia,L., Erdjument-Bromage,H., Tempst,P., Jones,R.S. and Zhang,Y. (2002) Role of histone H3 lysine 27 methylation in Polycombgroup silencing. Science, 298, 1039-1043. 34. Bell,A.C., West,A.G. and Felsenfeld,G. (2001) Insulators and boundaries: versatile regulatory elements in the eukaryotic genome. Science, 291, 447-450. 35. ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 306, 636-640. 36. Birney,E., Stamatoyannopoulos,J.A., Dutta,A., Guigo,R., Gingeras,T.R., Margulies,E.H., Weng,Z., Snyder,M., Dermitzakis,E.T., Thurman,R.E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799-816. 37. Wu,J., Smith,L.T., Plass,C. and Huang,T.H. (2006) ChIP-chip comes of age for genome-wide functional analysis. Cancer Res., 66, 6899-6902. 38. Bulyk,M.L. (2006) DNA microarray technologies for measuring protein-DNA interactions. Curr.Opin.Biotechnol., 17, 422-430.  32  39. Pollock,R. and Treisman,R. (1990) A sensitive method for the determination of protein-DNA binding specificities. Nucleic Acids Res., 18, 6197-6204. 40. Shiraki,T., Kondo,S., Katayama,S., Waki,K., Kasukawa,T., Kawaji,H., Kodzius,R., Watahiki,A., Nakamura,M., Arakawa,T. et al. (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc.Natl.Acad.Sci.U.S.A, 100, 15776-15781. 41. Wei,C.L., Ng,P., Chiu,K.P., Wong,C.H., Ang,C.C., Lipovich,L., Liu,E.T. and Ruan,Y. (2004) 5' Long serial analysis of gene expression (LongSAGE) and 3' LongSAGE for transcriptome characterization and genome annotation. Proc.Natl.Acad.Sci.U.S.A, 101, 11701-11706. 42. Hashimoto,S., Suzuki,Y., Kasai,Y., Morohoshi,K., Yamada,T., Sese,J., Morishita,S., Sugano,S. and Matsushima,K. (2004) 5'-end SAGE for the analysis of transcriptional start sites. Nat.Biotechnol., 22, 1146-1149. 43. Kapranov,P., Drenkow,J., Cheng,J., Long,J., Helt,G., Dike,S. and Gingeras,T.R. (2005) Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res., 15, 987-997. 44. Cheng,J., Kapranov,P., Drenkow,J., Dike,S., Brubaker,S., Patel,S., Long,J., Stern,D., Tammana,H., Helt,G. et al. (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science, 308, 1149-1154. 45. Halees,A.S., Leyfer,D. and Weng,Z. (2003) PromoSer: A large-scale mammalian promoter and transcription start site identification service. Nucleic Acids Res., 31, 3554-3559. 46. Sandelin,A., Alkema,W., Engstrom,P., Wasserman,W.W. and Lenhard,B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32, D91-D94. 47. Wingender,E., Dietze,P., Karas,H. and Knuppel,R. (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res., 24, 238-241. 48. Blanco,E., Farre,D., Alba,M.M., Messeguer,X. and Guigo,R. (2006) ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res., 34, D63-D67. 49. Suzuki,Y., Yamashita,R., Nakai,K. and Sugano,S. (2002) DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res., 30, 328-331. 50. Bergman,C.M., Carlson,J.W. and Celniker,S.E. (2005) Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics., 21, 1747-1749. 33  51. Barrasa,M.I., Vaglio,P., Cavasino,F., Jacotot,L. and Walhout,A.J. (2007) EDGEdb: a transcription factor-DNA interaction database for the analysis of C. elegans differential gene expression. BMC.Genomics, 8, 21. 52. Schmid,C.D., Perier,R., Praz,V. and Bucher,P. (2006) EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res., 34, D82-D85. 53. Jin,V.X., Sun,H., Pohar,T.T., Liyanarachchi,S., Palaniswamy,S.K., Huang,T.H. and Davuluri,R.V. (2005) ERTargetDB: an integral information resource of transcription regulation of estrogen receptor target genes. J.Mol.Endocrinol., 35, 225-230. 54. Vlieghe,D., Sandelin,A., De Bleser,P.J., Vleminckx,K., Wasserman,W.W., van Roy,F. and Lenhard,B. (2006) A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res., 34, D95-D97. 55. Sun,H., Palaniswamy,S.K., Pohar,T.T., Jin,V.X., Huang,T.H. and Davuluri,R.V. (2006) MPromDb: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-chip experimental data. Nucleic Acids Res., 34, D98-103. 56. Ghosh,D. (1990) A relational database of transcription factors. Nucleic Acids Res., 18, 1749-1756. 57. Ghosh,D. (2000) Object-oriented transcription factors database (ooTFD). Nucleic Acids Res., 28, 308-310. 58. Montgomery,S.B., Griffith,O.L., Sleumer,M.C., Bergman,C.M., Bilenky,M., Pleasance,E.D., Prychyna,Y., Zhang,X. and Jones,S.J. (2006) ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics, 22, 637-640. 59. Gallo,S.M., Li,L., Hu,Z. and Halfon,M.S. (2006) REDfly: a Regulatory Element Database for Drosophila. Bioinformatics, 22, 381-383. 60. Zhu,J. and Zhang,M.Q. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics., 15, 607-611. 61. Zhao,F., Xuan,Z., Liu,L. and Zhang,M.Q. (2005) TRED: a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acids Res., 33, D103-D107. 62. Kolchanov,N.A., Ignatieva,E.V., Ananko,E.A., Podkolodnaya,O.A., Stepanenko,I.L., Merkulova,T.I., Pozdnyakov,M.A., Podkolodny,N.L.,  34  Naumochkin,A.N. and Romashchenko,A.G. (2002) Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res., 30, 312-317. 63. Visel,A., Minovitsky,S., Dubchak,I. and Pennacchio,L.A. (2007) VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res., 35, D88-D92. 64. Harbison,C.T., Gordon,D.B., Lee,T.I., Rinaldi,N.J., Macisaac,K.D., Danford,T.W., Hannett,N.M., Tagne,J.B., Reynolds,D.B., Yoo,J. et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99-104. 65. Macisaac,K.D., Wang,T., Gordon,D.B., Gifford,D.K., Stormo,G.D. and Fraenkel,E. (2006) An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC.Bioinformatics, 7, 113. 66. Portales-Casamar,E., Kirov,S., Lim,J., Lithwick,S., Swanson,M.I., Ticoll,A., Snoddy,J. and Wasserman,W.W. (2007) PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol., 8, R207. 67. Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics., 16, 16-23. 68. Stormo,G.D. and Fields,D.S. (1998) Specificity, free energy and information content in protein-DNA interactions. Trends Biochem.Sci., 23, 109-113. 69. Man,T.K. and Stormo,G.D. (2001) Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res., 29, 2471-2478. 70. Bulyk,M.L., Johnson,P.L. and Church,G.M. (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res., 30, 1255-1261. 71. Benos,P.V., Bulyk,M.L. and Stormo,G.D. (2002) Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res., 30, 4442-4451. 72. Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc.Int.Conf.Intell.Syst.Mol.Biol., 2, 28-36. 73. Lawrence,C.E. and Reilly,A.A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 41-51. 74. Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-214.  35  75. GuhaThakurta,D. (2006) Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res., 34, 3585-3598. 76. Das,M.K. and Dai,H.K. (2007) A survey of DNA motif finding algorithms. BMC.Bioinformatics, 8 Suppl 7, S21. 77. D'haeseleer,P. (2006) How does DNA sequence motif discovery work? Nat.Biotechnol., 24, 959-961. 78. Wei,W. and Yu,X.D. (2007) Comparative analysis of regulatory motif discovery tools for transcription factor binding sites. Genomics Proteomics.Bioinformatics, 5, 131-142. 79. Tompa,M., Li,N., Bailey,T.L., Church,G.M., De Moor,B., Eskin,E., Favorov,A.V., Frith,M.C., Fu,Y., Kent,W.J. et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat.Biotechnol., 23, 137-144. 80. Mellor,J. (2005) The dynamics of chromatin remodeling at promoters. Mol.Cell, 19, 147-157. 81. Tronche,F., Ringeisen,F., Blumenfeld,M., Yaniv,M. and Pontoglio,M. (1997) Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. J.Mol.Biol., 266, 231-245. 82. Wasserman,W.W. and Sandelin,A. (2004) Applied bioinformatics for the identification of regulatory elements. Nat.Rev.Genet., 5, 276-287. 83. Yuh,C.H., Bolouri,H. and Davidson,E.H. (2001) Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development, 128, 617-629. 84. Pearce,D., Matsui,W., Miner,J.N. and Yamamoto,K.R. (1998) Glucocorticoid receptor transcriptional activity determined by spacing of receptor and nonreceptor DNA sites. J.Biol.Chem., 273, 30081-30085. 85. Fickett,J.W. (1996) Coordinate positioning of MEF2 and myogenin binding sites. Gene, 172, GC19-GC32. 86. Kel,A.E., Kel-Margoulis,O.V., Farnham,P.J., Bartley,S.M., Wingender,E. and Zhang,M.Q. (2001) Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. J.Mol.Biol., 309, 99-120. 87. Johansson,O., Alkema,W., Wasserman,W.W. and Lagergren,J. (2003) Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics., 19 Suppl 1, I169I176.  36  88. Wasserman,W.W. and Fickett,J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J.Mol.Biol., 278, 167-181. 89. Krivan,W. and Wasserman,W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res., 11, 1559-1566. 90. Frith,M.C., Hansen,U. and Weng,Z. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878-889. 91. Berman,B.P., Nibu,Y., Pfeiffer,B.D., Tomancak,P., Celniker,S.E., Levine,M., Rubin,G.M. and Eisen,M.B. (2002) Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc.Natl.Acad.Sci.U.S.A, 99, 757-762. 92. Halfon,M.S., Grad,Y., Church,G.M. and Michelson,A.M. (2002) Computationbased discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res., 12, 1019-1028. 93. Wasserman,W.W., Palumbo,M., Thompson,W., Fickett,J.W. and Lawrence,C.E. (2000) Human-mouse genome comparisons to locate regulatory sites. Nat.Genet., 26, 225-228. 94. Lenhard,B., Sandelin,A., Mendoza,L., Engstrom,P., Jareborg,N. and Wasserman,W.W. (2003) Identification of conserved regulatory elements by comparative genome analysis. J.Biol., 2, 13. 95. Dermitzakis,E.T. and Clark,A.G. (2002) Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol.Biol.Evol., 19, 1114-1121. 96. Moses,A.M., Pollard,D.A., Nix,D.A., Iyer,V.N., Li,X.Y., Biggin,M.D. and Eisen,M.B. (2006) Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS.Comput.Biol., 2, e130. 97. Doniger,S.W. and Fay,J.C. (2007) Frequent gain and loss of functional transcription factor binding sites. PLoS.Comput.Biol., 3, e99. 98. Sauer,T., Shelest,E. and Wingender,E. (2006) Evaluating phylogenetic footprinting for human-rodent comparisons. Bioinformatics, 22, 430-437. 99. Levy,S., Hannenhalli,S. and Workman,C. (2001) Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics., 17, 871-877. 100. Hardison,R.C., Oeltjen,J. and Miller,W. (1997) Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res., 7, 959-966.  37  101. Uchikawa,M., Takemoto,T., Kamachi,Y. and Kondoh,H. (2004) Efficient identification of regulatory sequences in the chicken genome by a powerful combination of embryo electroporation and genome comparison. Mech.Dev., 121, 1145-1158. 102. Ahituv,N., Rubin,E.M. and Nobrega,M.A. (2004) Exploiting human--fish genome comparisons for deciphering gene regulation. Hum.Mol.Genet., 13 Spec No 2, R261-R266. 103. Woolfe,A., Goodson,M., Goode,D.K., Snell,P., McEwen,G.K., Vavouri,T., Smith,S.F., North,P., Callaway,H., Kelly,K. et al. (2005) Highly conserved noncoding sequences are associated with vertebrate development. PLoS.Biol., 3, e7. 104. Blanchette,M., Kent,W.J., Riemer,C., Elnitski,L., Smit,A.F., Roskin,K.M., Baertsch,R., Rosenbloom,K., Clawson,H., Green,E.D. et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res., 14, 708-715. 105. Bray,N. and Pachter,L. (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res., 14, 693-699. 106. Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: efficient tools for largescale multiple alignment of genomic DNA. Genome Res., 13, 721-731. 107. Margulies,E.H., Cooper,G.M., Asimenos,G., Thomas,D.J., Dewey,C.N., Siepel,A., Birney,E., Keefe,D., Schwartz,A.S., Hou,M. et al. (2007) Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res., 17, 760-774. 108. Siepel,A., Bejerano,G., Pedersen,J.S., Hinrichs,A.S., Hou,M., Rosenbloom,K., Clawson,H., Spieth,J., Hillier,L.W., Richards,S. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15, 1034-1050. 109. Margulies,E.H., Blanchette,M., Haussler,D. and Green,E.D. (2003) Identification and characterization of multi-species conserved sequences. Genome Res., 13, 2507-2518. 110. Cooper,G.M., Stone,E.A., Asimenos,G., Green,E.D., Batzoglou,S. and Sidow,A. (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Res., 15, 901-913. 111. King,D.C., Taylor,J., Zhang,Y., Cheng,Y., Lawson,H.A., Martin,J., Chiaromonte,F., Miller,W. and Hardison,R.C. (2007) Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res., 17, 775-786.  38  112. Elnitski,L., Hardison,R.C., Li,J., Yang,S., Kolbe,D., Eswara,P., O'Connor,M.J., Schwartz,S., Miller,W. and Chiaromonte,F. (2003) Distinguishing regulatory DNA from neutral sites. Genome Res., 13, 64-72. 113. Kolbe,D., Taylor,J., Elnitski,L., Eswara,P., Li,J., Miller,W., Hardison,R. and Chiaromonte,F. (2004) Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res., 14, 700-707. 114. King,D.C., Taylor,J., Elnitski,L., Chiaromonte,F., Miller,W. and Hardison,R.C. (2005) Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome Res., 15, 1051-1060. 115. Sandelin,A., Bailey,P., Bruce,S., Engstrom,P.G., Klos,J.M., Wasserman,W.W., Ericson,J. and Lenhard,B. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC.Genomics, 5, 99. 116. Bejerano,G., Pheasant,M., Makunin,I., Stephen,S., Kent,W.J., Mattick,J.S. and Haussler,D. (2004) Ultraconserved elements in the human genome. Science, 304, 1321-1325. 117. Boffelli,D., McAuliffe,J., Ovcharenko,D., Lewis,K.D., Ovcharenko,I., Pachter,L. and Rubin,E.M. (2003) Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science, 299, 1391-1394. 118. Spitz,F., Gonzalez,F. and Duboule,D. (2003) A global control region defines a chromosomal regulatory landscape containing the HoxD cluster. Cell, 113, 405417. 119. Pennacchio,L.A., Ahituv,N., Moses,A.M., Prabhakar,S., Nobrega,M.A., Shoukry,M., Minovitsky,S., Dubchak,I., Holt,A., Lewis,K.D. et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444, 499502. 120. Bejerano,G., Haussler,D. and Blanchette,M. (2004) Into the heart of darkness: large-scale clustering of human non-coding DNA. Bioinformatics., 20 Suppl 1, I40-I48. 121. Calle-Mustienes,E., Feijoo,C.G., Manzanares,M., Tena,J.J., Rodriguez-Seguel,E., Letizia,A., Allende,M.L. and Gomez-Skarmeta,J.L. (2005) A functional survey of the enhancer activity of conserved non-coding sequences from vertebrate Iroquois cluster gene deserts. Genome Res., 15, 1061-1072. 122. Kimura-Yoshida,C., Kitajima,K., Oda-Ishii,I., Tian,E., Suzuki,M., Yamamoto,M., Suzuki,T., Kobayashi,M., Aizawa,S. and Matsuo,I. (2004) Characterization of the pufferfish Otx2 cis-regulators reveals evolutionarily conserved genetic mechanisms for vertebrate head specification. Development, 131, 57-71. 39  123. Nobrega,M.A., Zhu,Y., Plajzer-Frick,I., Afzal,V. and Rubin,E.M. (2004) Megabase deletions of gene deserts result in viable mice. Nature, 431, 988-993. 124. Ahituv,N., Zhu,Y., Visel,A., Holt,A., Afzal,V., Pennacchio,L.A. and Rubin,E.M. (2007) Deletion of ultraconserved elements yields viable mice. PLoS.Biol., 5, e234. 125. Poulin,F., Nobrega,M.A., Plajzer-Frick,I., Holt,A., Afzal,V., Rubin,E.M. and Pennacchio,L.A. (2005) In vivo characterization of a vertebrate ultraconserved enhancer. Genomics, 85, 774-781. 126. Vavouri,T., Walter,K., Gilks,W.R., Lehner,B. and Elgar,G. (2007) Parallel evolution of conserved non-coding elements that target a common set of developmental regulatory genes from worms to humans. Genome Biol., 8, R15. 127. Lowe,C.B., Bejerano,G. and Haussler,D. (2007) Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc.Natl.Acad.Sci.U.S.A, 104, 8005-8010. 128. Bejerano,G., Lowe,C.B., Ahituv,N., King,B., Siepel,A., Salama,S.R., Rubin,E.M., Kent,W.J. and Haussler,D. (2006) A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature, 441, 87-90. 129. Casillas,S., Barbadilla,A. and Bergman,C.M. (2007) Purifying selection maintains highly conserved noncoding sequences in Drosophila. Mol.Biol.Evol., 24, 22222234. 130. Drake,J.A., Bird,C., Nemesh,J., Thomas,D.J., Newton-Cheh,C., Reymond,A., Excoffier,L., Attar,H., Antonarakis,S.E., Dermitzakis,E.T. et al. (2006) Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat.Genet., 38, 223-227. 131. Katzman,S., Kern,A.D., Bejerano,G., Fewell,G., Fulton,L., Wilson,R.K., Salama,S.R. and Haussler,D. (2007) Human genome ultraconserved elements are ultraselected. Science, 317, 915. 132. Li,H. and Wang,W. (2003) Dissecting the transcription networks of a cell using computational genomics. Curr.Opin.Genet.Dev., 13, 611-616. 133. Futcher,B. (2002) Transcriptional regulatory networks and the yeast cell cycle. Curr.Opin.Cell Biol., 14, 676-683. 134. Elkon,R., Linhart,C., Sharan,R., Shamir,R. and Shiloh,Y. (2003) Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res., 13, 773-780. 135. Zheng,J., Wu,J. and Sun,Z. (2003) An approach to identify over-represented ciselements in related sequences. Nucleic Acids Res., 31, 1995-2005. 40  136. Sharan,R., Ovcharenko,I., Ben Hur,A. and Karp,R.M. (2003) CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics., 19 Suppl 1, i283-i291. 137. Dieterich,C., Wang,H., Rateitschak,K., Luz,H. and Vingron,M. (2003) CORG: a database for COmparative Regulatory Genomics. Nucleic Acids Res., 31, 55-57. 138. Loots,G.G., Ovcharenko,I., Pachter,L., Dubchak,I. and Rubin,E.M. (2002) rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res., 12, 832-839. 139. Ho Sui,S.J., Mortimer,J.R., Arenillas,D.J., Brumm,J., Walsh,C.J., Kennedy,B.P. and Wasserman,W.W. (2005) oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res., 33, 3154-3164. 140. Ho Sui,S.J., Fulton,D.L., Arenillas,D.J., Kwon,A.T. and Wasserman,W.W. (2007) oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Res., 35, W245-W252. 141. Huang, S. S., Fulton, D. L., Arenillas, D. J., Perco, P., Ho Sui, S. J., Mortimer, J. R., and Wasserman, W. W. (2006) Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In Advances in Bioinformatics and Computational Biology. Imperial College Press, London, UK, Vol. 3, pp 247-256. 142. Marks, V.D., Ho Sui, S.J., Erasmus, D., van der Merwe, G.K., Brumm, J., Wasserman, W.W., Bryan, J. and van Vuuren, H.J.J. (2008). Dynamics of the yeast transcriptome during wine fermentation reveals a novel fermentation stress response. FEMS Yeast Res., 8, 35-52. 143. Engstrom,P.G., Ho Sui,S.J., Drivenes,O., Becker,T.S. and Lenhard,B. (2007) Genomic regulatory blocks underlie extensive microsynteny conservation in insects. Genome Res., 17, 1898-1908.  41  Chapter 2: oPOSSUM: Identification of Over-represented Transcription Factor Binding Sites in Sets of Co-expressed Genes 2 2.1 Introduction DNA microarrays profile patterns of gene expression changes on a genome-wide scale, elucidating sets of genes coordinately expressed under specific conditions. Recent improvements in bioinformatics methods for the analysis of sequences regulating transcription have made it possible to elucidate potential factors involved in key regulatory networks underlying a transcriptional response. The enumeration of such networks, by identifying genes with similar patterns of expression and shared cisregulatory motifs, is crucial to advancing our understanding of biological pathways and processes. Transcriptional regulation of gene expression is a tightly controlled process that involves the synchronized binding of trans-acting transcription factors (TFs) to numerous binding sites (TFBS) in the regions surrounding a gene’s transcription start site, as well as to enhancer regions that mediate gene activation from distal locations. The binding specificities of TFs to their cognate DNA binding motifs are typically modeled using position specific scoring matrices (PSSMs) (1), which are constructed from alignments of binding site sequences that have been characterized experimentally or identified in highthroughput protein-DNA binding assays (2;3). These PSSMs are catalogued in databases such as TRANSFAC (4) and JASPAR (5). The use of PSSMs to detect individual  2  A version of this chapter has been published. Ho Sui, S.J., Mortimer, J.R., Arenillas, D.J., Brumm, J., Walsh, C.J., Kennedy, B.P. and Wasserman W.W. (2005) oPOSSUM: Identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Research. 33(10):3154-64.  42  TFBSs is well-established (6). However, application of these models typically yields a large number of false positive predictions due to the short, degenerate nature of TFBS motifs. For example, a 6 bp long motif has about a 1 in 4000 chance of occurring at random; while tolerance of ambiguity at just one highly variable position can raise the prediction rate to 1 in 1000. Dramatic improvements in the specificity of TFBS prediction are attained by limiting the search space to regions of conserved, noncoding DNA using a comparative genomics approach known as phylogenetic footprinting (7-10). Based on the assumption that functional DNA sequences are subject to greater selective pressure, and therefore, are conserved across moderately diverged organisms, comparison of sequences from orthologous genes can highlight functional noncoding DNA, providing clues to where regulatory sequences may be located. Phylogenetic footprinting eliminates, on average, 80% of sequence (11), and estimates have placed the proportion of TFBSs occurring within conserved regions when comparing human and mouse sequences at approximately 70% (11;12). Thus, while the use of phylogenetic footprinting limits our ability to detect binding sites that have evolved in a species-specific manner, the drastic reduction in noise increases specificity, and far outweighs the decrease in sensitivity. Even with the improved performance conferred by phylogenetic footprinting, most predicted TFBSs are non-functional. By incorporating gene expression data into the analysis procedures, we should improve capacity to discriminate functional binding sites from potential false positive matches. This paper describes a new method, oPOSSUM, which identifies statistically over-represented, conserved TFBSs in the promoters of coexpressed genes. Based on the assumption that some subset of the co-expressed genes is  43  co-regulated by one or more common TFs, we reason that the observed number of binding sites for those TFs should be greater than would be expected by chance. oPOSSUM integrates a pre-computed database of predicted, conserved TFBSs, derived from phylogenetic footprinting and TFBS detection algorithms, with statistical methods for calculating over-representation (Figure 2.1).  Regulatory Analysis Pipeline  Web-based Interface  1. Automated sequence retrieval from Ensembl  2. Phylogenetic footprinting using ORCA  3. Detection of TFBS using JASPAR PSSMs  Z-score Fisher Exact Probability  4. Database of predicted, evolutionarily conserved TFBSs  Ranked list of over-represented transcription factor binding sites  Figure 2.1 The oPOSSUM system for identifying over-represented TFBSs in sets of co-expressed genes The system is built upon a database of conserved TFBSs for human-mouse orthologs, derived from an analysis pipeline that combines phylogenetic footprinting with TFBS identification using the JASPAR library of PSSMs. Given a set of human or mouse genes, the pipeline (1) retrieves the genomic DNA sequence for the human and mouse genes plus 5000 bp of upstream sequence, (2) performs an alignment of the orthologous sequences and extracts noncoding DNA subsequences that are conserved above a predefined threshold, (3) searches the subsequences for matches to TFBS profiles contained in JASPAR, and (4) stores the results in the oPOSSUM database. Upon querying the web-based interface with a list of co-expressed genes, oPOSSUM retrieves the TFBS counts for each gene in the list and computes two statistics (Z-score, Fisher exact test) to measure over-representation of TFBSs in the set relative to a background comprised of all genes in the oPOSSUM database.  44  The oPOSSUM system was validated using curated regulatory region collections for genes expressed in a tissue-specific manner, and published targets of the nuclear factor NF-κB. oPOSSUM was then applied to two published transcript profiling data sets, as well as to a new analysis of expression addressing NF-κB inhibition. The results demonstrate that oPOSSUM is able to identify the TFs expected to mediate changes in gene expression through the detection of over-represented TFBSs. Simulations using random sampling gave low false positive rates and revealed tolerance for some noise in the gene sets.  2.2 Methods 2.2.1 Automated retrieval of human-mouse orthologs The Ensembl software system (13) provides a flexible bioinformatics framework to retrieve sequences and annotations for genes from multiple organisms. Sets of orthologous human and mouse genes are available via EnsMART, a computationally convenient interface to genome annotations. To avoid aligning paralogs (genes which have diverged due to gene duplication in a common ancestor), human genes mapping to more than one mouse gene (and vice versa) are filtered to obtain a set of one-to-one orthologous pairs. For each human-mouse orthologous pair, repeat masked sequences are retrieved encompassing the region 5000 bp upstream of the annotated transcription start site (TSS) to either the 3’ end of the gene or, in the case of long genes, 50000 bp downstream of the TSS. For genes with multiple annotated TSSs, the 5’-most TSS is selected.  45  2.2.2 Phylogenetic footprinting Orthologous sequences are aligned using ORCA (Arenillas and Wasserman, unpublished), a pairwise global progressive alignment algorithm similar to LAGAN (14). ORCA first identifies short segments of high similarity between orthologous genes by performing a local BLASTN alignment using the Bl2Seq algorithm (Version 2.2.5) (15), and then aligns the regions between such segments through the more time-consuming Needleman-Wunsch algorithm (NW) (16) to obtain an overall global alignment of the two sequences. The process is recursive; regions that are too long to align using NW are re-aligned with BLASTN using less stringent parameters. The process comes to a halt when either the regions are short enough to perform NW successfully (the product of the input sequence lengths does not exceed 100 MB), or the minimum BLASTN word size of 7 has been reached. The first iteration of BLASTN was performed using the following parameters: penalty for a nucleotide mismatch = -7; expectation value = 0.10; word size = 15; default values were used for the remaining parameters. For each subsequent run of BLASTN, the nucleotide mismatch score was incremented by 2 and the word size was decremented by 4. NW global alignments used a match score =3; mismatch score = -1; gap open penalty = 20; and gap extension penalty = 0. Three dynamically selected and progressively more stringent conservation thresholds are applied. Specifically, each alignment is scanned using a 100 bp sliding window, the percent sequence identity within each window is calculated, and the top 10%, 20%, and 30% of all windows (excluding those overlapping a coding region) are retained. Minimum identity thresholds of 70%, 65% and 60% are required for the high, medium and low conservation levels, respectively. The use of dynamically computed thresholds  46  versus fixed sequence identity cutoffs is motivated by the variable rates of evolution for each gene in each genome.  2.2.3 Detection of TF binding sites The conserved noncoding regions of the promoters are searched for matches to all TFBS profiles in the JASPAR database with information content greater than eight bits, using the TFBS suite of Perl modules for regulatory sequence analysis (17). Excluding low information content profiles (a measure of the specificity of predictions) eliminates spurious hits. A predicted binding site for a given TF model is reported if the site occurs in both the human and mouse sequences above a threshold PSSM score of 75%, and at equivalent positions in the alignment. Overlapping sites for the same TF are filtered such that only the highest scoring is kept. The location, score, orientation, and local sequence conservation level of each TFBS match in the human and mouse genes are stored in the oPOSSUM database.  2.2.4 Discovery of over-represented binding sites Two statistical measures were calculated to determine which, if any, TFBS were over-represented in the set of promoters for co-expressed genes. These represent two distinct models for counting the occurrences of binding sites.  Z-score calculation for determining TFBS that occur more frequently than expected. The z-score uses a simple binomial distribution model to compare the rate of occurrence of a TFBS in the set of co-expressed genes to the expected rate estimated from the pre-computed background set.  47  For a given TFBS, let the random variable X denote the number of predicted binding site nucleotides in the conserved noncoding regions of the co-expressed gene set. Let B be the number of predicted binding site nucleotides in the conserved noncoding regions of the background set. Using a binomial model with n events, where n is the total number of nucleotides examined (i.e. the total number of nucleotides in the conserved noncoding regions) from the co-expressed genes, and N is the total number of nucleotides examined from the background gene set, then the expected value of X is µ = BC , where C = n/N (i.e. the ratio of sample sizes). Then, taking p = B/N as the probability of success, the standard deviation is given by σ = np( 1 − p) . Now, let x be the observed number of binding site nucleotides in the conserved noncoding regions of the co-expressed genes. By applying the Central Limit Theorem and using the normal approximation to the binomial distribution with a continuity correction, the Z-score is calculated as z =  x − µ − 0.5 . Thus the Z-score indicates a σ  significant difference in the rate of occurrence of sites, and is particularly good for detecting increased prevalence of common sites.  One-tailed Fisher exact probability for determining TFBSs that occur in a significant number of the co-expressed genes. In contrast to the Z-score, the one-tailed Fisher exact probability compares the proportion of co-expressed genes containing a particular TFBS to the proportion of the background set that contains the site to determine the probability of a non-random association between the co-expressed gene set and the TFBS of interest. It is calculated using the hyper-geometric probability distribution that describes sampling without replacement from a finite population consisting of two types of elements (18). Therefore, the number of times a TFBS occurs  48  in the promoter of an individual gene is disregarded, and instead, the TFBS is considered as either present or absent. A significant value for the Fisher exact probability indicates that there are a significant proportion of genes that contain the site, and is particularly good for rare TFBSs. Fisher exact probabilities were calculated using the R Statistics package (http://www.r-project.org).  2.2.5 NF-κB microarray experiment A list of genes differentially expressed during interruption of the NF-κB pathway by a specific NF-κB inhibitor was obtained from an unpublished microarray experiment. Appendix 1 Table S1 contains sufficient information to reproduce or challenge the in silico promoter analysis described in this text. The design of the experiment is briefly described here. Human umbilical vein endothelial cells (HUVEC) in the treatment condition were pre-treated with 10 µM of NF-κB inhibitor, followed by stimulation of the NF-κB signaling pathway with 0.1 ng/mL IL-1B one hour later (t=0 hours). A second sample of the same culture was treated with 0.1 ng/mL IL-1B only at t=0. A third sample received only vehicle treatment (0.33% dimethyl sulfoxide) at t=0. From each condition, total RNA was isolated at 6 hours using the RNeasy midi kit (Qiagen, USA). The entire paradigm was repeated three times on separate batches of HUVEC, generating nine samples. Equal amounts of RNA were pooled from the three IL-1B treated samples, as the control channel. Each sample, i.e. the three inhibitor treated, the three vehicle treated, and the pool of IL-1B treatment alone, was split in two and labeled with either the Cy3 or Cy5 fluorescent dye (Agilent, USA). Using a two-color microarray system, the labeled cDNA from treatment and control conditions was hybridized to an oligonucleotide microarray representing 23000 human genes (Agilent, USA) as follows: i) Three  49  replicates of individual vehicle treated samples vs. pool of IL1B samples, and ii) three replicates of pool of IL1B samples vs. individual IL1B + NF-κB inhibitor treated samples, iii) same as i with fluors reversed, iv) same as ii with fluors reversed. After quantification of the raw data, normalization and combination of the technical, fluor-reversed replicates using the Rosetta Resolver® (version 3.0) gene-expression-data-analysis system (19), an error-weighted ANOVA analysis was performed across replicates in the two groups. Biological replicates were then combined using the Rosetta Resolver® (version 3.0) error model. For our analysis we focused on a list of 508 sequences showing significantly decreased levels of expression in inhibitor-treated cells, defined by an ANOVA p-value ≤ 0.01, an error-model p-value ≤ 0.01 and a fold-change ≥ 1.3. The 508 sequences were mapped to 326 unique Ensembl gene IDs by identifying gene models from Ensembl V19 (build 34a) which overlapped with the probe sequences. The down-regulated genes were submitted for analysis by oPOSSUM.  2.2.6 Simulations using random sampling To estimate the false positive rate, we tested oPOSSUM on randomly generated subsets of genes from the oPOSSUM database to determine how frequently TFBSs are identified as over-represented by chance, and to assess the validity of the selected Z-score and Fisher p-value cutoffs of 10 and 0.01, respectively. We created 100 independent sets, each containing 15 genes. These were submitted to oPOSSUM, and the number of TFBSs significantly over-represented during each trial was counted. The number of trials that generated significant TFBSs over 100 independent trials gives us a measure of the false positive rate. We repeated this process for gene lists of 50, 100 and 200 randomly  50  selected genes to see what, if any, effect the number of genes in the list has on the false positive rate. Next we investigated the amount of noise oPOSSUM can tolerate by adding increasing numbers of randomly selected genes from the oPOSSUM database to our reference gene sets. For the muscle- and liver-specific gene sets, we added 5, 10, 15, 20, 25, 30, 40, 50, 75, and 100 randomly selected genes, and submitted them to oPOSSUM. Additional increments of 150 and 300 genes were tested for the larger set of NF-κB target genes. This process was repeated 100 times for each noise level. The average Z-scores and Fisher p-values for the Mef-2, HNF-1 and NF-κB TFBS profiles over 100 independent trials for each noise level were recorded.  2.2.7 Parameter selection for validation studies For all of the analyses presented in this study, we examined the promoter region encompassing 5000 bp upstream and 5000 bp downstream of the TSS, used the highest conservation level to extract conserved, noncoding regions (top 10% of conserved regions with a minimum of 70% sequence identity), and required a PSSM score greater than 85% for predicted binding sites, using only the vertebrate-specific PSSMs in JASPAR.  2.3 Results The oPOSSUM database was constructed from an initial set of 14083 orthologs from human and mouse, obtained by selecting only “one-to- one” human-mouse orthologs from Ensembl (13). Of these, 4921 (34.9%) of the ortholog sequence pairs failed to produce reasonable alignments of the promoter regions, due largely to an  51  inability to reconcile transcription start site positions as a result of alternative promoter usage by orthologs, and to a lesser degree, as a consequence of low nucleotide sequence similarity between assigned orthologous gene pairs, genes within genes, and transcription start sites located within exons of upstream genes on the opposite strand. Attempts to align a subset of the failed promoter pairs using the LAGAN algorithm produced similar results (not shown). An additional 456 (3.2%) ortholog pairs successfully aligned but did not contain conserved, noncoding regions (minimum of 100 bp with greater than 60% identity) in the target region spanning from 5000 bp upstream of the transcription start site (TSS) to 5000 bp downstream of the TSS. Of the remaining 8706 genes with conserved promoters, 8698 contained matches to one or more TFBS profiles (PSSM cutoff of 75%), producing 2.4x106, 3.3x106 and 4.1x106 conserved predicted binding sites at the high, medium and low conservation levels, respectively (See Methods).  2.3.1 Validation using reference gene sets The muscle and liver regulatory region collections catalogue experimentally verified TFBSs that confer muscle- and liver-specific gene expression, respectively (20;21). We searched the literature for additional experimentally verified sites in human and mouse, adding 8 liver-specific and 5 muscle-specific promoters to these collections (available at http://www.cisreg.ca/tjkwon/). In addition to these tissue-specific genes, we compiled a list of 61 known targets of the nuclear factor NF-κB (22). We used these reference sets to assess oPOSSUM’s ability to discriminate functionally relevant TFBSs and to empirically determine appropriate thresholds for our scoring measures. oPOSSUM calculates two statistical measures for binding site over-representation, one at the gene level (Fisher exact test) and the other based on the ratio of TFBSs to  52  nucleotides (Z-score). Figure 2.2 shows the correlation between the scores for each reference set. Clearly, scores for the majority of TFBSs cluster at the bottom right corner of the graph for all reference sets, with Z-scores ranging from -10 to 10 and Fisher pvalues ranging from 0.02 to 1. For each reference set, we also ranked the top ten binding sites, ordered by Z-score, along with associated Fisher p-values (Table 2.1). In each case, the TFs were further investigated for experimentally verified evidence in the given tissue or system.  Muscle-specific regulatory region collection. Studies of skeletal muscle expression have revealed five primary classes of TFs that contribute to skeletal musclespecific expression: Myf (MyoD), Mef-2, SRF, TEF-1, and Sp-1 (21). Submission of the 25 genes of human, mouse or rat origin in the muscle regulatory collection resulted in 14 pairs of orthologs being analyzed. oPOSSUM ranked SRF, TEF-1, Mef-2 and Myf as the top four most significant profiles (Table 2.1A). In fact, all of these TFs had Fisher pvalues less than 0.01 and with the exception of Myf, had Z-scores greater than 10, considerably higher than for all other TFs (Figure 2.2). Sp-1 was ranked tenth but without sufficiently convincing scores to discriminate it from the remainder of the TFBSs (Figure 2.2); this is not surprising given that it is a ubiquitous activator of numerous genes in the human genome (23).  Liver-specific regulatory region collection. Based on a collection of genes expressed either exclusively in liver hepatocytes or in a small number of tissues including liver hepatocytes, previous studies have found that hepatocyte-specific gene expression can be governed by the combined action of four primary TFs: HNF-1, HNF-3, HNF-4, and c/EBP (20). (There are additional regulatory programs that are controlled  53  independently of these factors in hepatocytes.) Using this established list of 22 genes, we were able to analyze 11 orthologous gene pairs. Predicted HNF-1 sites were the most significantly over-represented TFBSs in the promoters of genes from the liver collection using both the Z-score and Fisher measures (Table 2.1B). In fact, with a Z-score of 32.5, which is almost three times greater than the next most significant TFBS profile from JASPAR, and a Fisher p-value of 1.5e-04, HNF-1 clearly segregates from the remaining TFBS profiles in this reference set (Figure 2.2). c/EBP ranked third, but was not sufficiently over-represented to exceed the significance cutoffs of 10 and 0.01 for the Zscore and Fisher measures, respectively.  40 p65  SRF  c-Rel  30 NF-κB  HNF-1 p50  20 Z-score  TEF-1  Muscle  MEF2  Liver  FREAC-2 Myf  10  cEBP SP1 HNF-3β  0  NF-κB Z-score cutoff Fisher cutoff  -10  -20 1.0E-09  1.0E-07  1.0E-05  1.0E-03  1.0E-01  Fisher p-value  Figure 2.2 Relationship between the Fisher p-values and Z-scores for the muscle, liver and NFkappaB reference sets Based on the distribution of scores for the reference sets, a Z-score cutoff of 10 and a Fisher p-value cutoff of 0.01 were empirically selected as threshold levels to be used for testing. TFBSs that have functional relevance are labeled.  54  Table 2.1 Statistically over-represented TF binding sites in reference gene sets A. Muscle-specific (25 input; 14 analyzed) Rank Z-score Fisher p-value SRF * 1 35.93 3.93E-04 TEF-1 * 2 15.84 5.48E-05 MEF-2 * 3 15.26 2.77E-04 Myf * 4 8.585 9.81E-03 S8 5 8.168 1.49E-01 Yin-Yang 6 6.396 6.79E-02 RORalfa-2 7 5.697 8.70E-02 deltaEF1 8 5.514 9.76E-03 Nkx 9 5.492 1.17E-01 SP1 * 10 4.671 6.34E-02 B. Liver-specific (22 input; 11 analyzed) Rank Z-score Fisher p-value HNF-1 * 1 32.46 1.51E-04 FREAC-2 2 11.46 7.68E-03 cEBP * 3 8.477 2.29E-01 FREAC-4 4 7.522 1.86E-01 c-FOS 5 6.286 2.73E-02 HLF 6 5.454 4.20E-02 Chop-cEBP 7 5.313 7.93E-02 SRY 8 5.000 1.78E-01 Tal1beta-E47S 9 4.338 6.30E-01 Hen-1 10 3.117 5.19E-01 C. Known NF-κB targets (61 input; 33 analyzed) Rank Z-score Fisher p-value p65 * 1 35.60 1.18E-09 c-REL * 2 33.14 5.94E-08 p50 * 3 27.62 5.74E-07 NF-kappaB * 4 27.00 1.62E-07 SPI-B 5 13.92 1.92E-02 Irf-2 6 12.88 8.69E-02 NRF-2 7 6.468 1.22E-01 Evi-1 8 5.959 2.04E-01 Elk-1 9 4.912 2.49E-01 MZF_5-13 10 4.908 1.06E-01 TF binding sites detected by oPOSSUM with the top ten mostly highly ranked Z-scores or with Fisher p-value < 0.01. * TFs with experimentally-verified sites in the reference sets. The number of genes used as input and the number of genes analyzed by oPOSSUM (i.e. genes that have an unambiguous mouse ortholog) are shown in brackets. See Methods for how the Z-score and Fisher p-values were calculated.  55  Known NF-κB target genes. The NF-κB/Rel family of TFs, which includes RELA (p65), NF-κB1 (p50;p105), NF-κB2 (p52, p100), c-REL, and RELB, plays a central role in regulating the immune response (24). oPOSSUM was applied to a set of 61 known NF-κB-regulated genes (22), which include a large number of cytokines and immunoreceptors, and to a lesser extent, antigen presentation proteins, cell adhesion molecules, acute phase proteins, stress response genes, and TFs. Of the 61 human genes submitted to oPOSSUM, 33 were mapped to mouse orthologs and subsequently analyzed. The NF-κB, c-REL, p65, and p50 binding sites, which are all members of the NF-κBfamily of TFs, ranked as the top four most over-represented TFBSs, using either the Zscore or Fisher p-values (Table 2.1C). Figure 2.2 shows that they were indeed the only TFBSs with significant scores discriminating them from other sites, with Z-scores as high as 35.6 and Fisher p-values as low as 1.2e-09. Based on the results obtained from the three reference gene sets, we decided empirically to use a Z-score cutoff of 10 and Fisher p-value cutoff of 0.01 to identify TFBSs for each of our test sets.  2.3.2 Application to transcript profiling data The reference collections used above are curated sets of genes. In contrast, highthroughput transcript profiling studies typically produce clusters of hundreds of coexpressed genes, of which only a small subset is likely to be co-regulated by a given factor. We assessed oPOSSUM’s performance on three sets of genes derived from transcript profiling experiments, and report the results in Table 2.2. For each set of coexpressed genes, we list the top ten over-represented TFBSs, as determined by the Zscore, as well as any additional TFBSs with significant Fisher p-values (p<0.01).  56  Table 2.2 Statistically over-represented TFBSs in gene expression data sets A. c-Myc-induced Genes (53 input; 30 analyzed) TF Class Rank Z-score Fisher No. Genes Myc-Max * bHLH-ZIP 1 32.41 1.17E-04 7 ARNT bHLH 2 23.82 1.56E-04 12 Max bHLH-ZIP 3 21.89 7.40E-03 9 SP1 ZN-FINGER, C2H2 4 20.90 2.04E-02 14 10 USF bHLH-ZIP 5 17.01 2.39E-02 MZF_1-4 ZN-FINGER, C2H2 6 14.96 1.35E-01 20 Staf ZN-FINGER, C2H2 7 11.12 7.59E-02 2 12 Ahr-ARNT bHLH 8 10.87 2.05E-01 SAP-1 ETS 9 10.41 2.57E-03 9 n-MYC bHLH-ZIP 10 9.821 4.71E-01 11 B. c-Fos-induced Genes (150 input; 98 analyzed) TF Class Rank Z-score Fisher No. Genes 40 c-FOS * bZIP 1 11.01 2.94E-02 CREB bZIP 2 8.728 2.45E-01 11 38 SP1 ZN-FINGER, C2H2 3 8.015 1.14E-02 E2F Unknown † 4 3.995 1.12E-01 15 Myc-Max bHLH-ZIP 5 3.898 3.21E-01 5 HLF bZIP 6 3.249 1.84E-01 10 6 Pbx HOMEO 7 2.878 1.38E-01 FREAC-2 FORKHEAD 8 1.763 6.35E-02 20 32 HLF bZIP 9 1.632 3.32E-02 Myc-Max bHLH-ZIP 10 1.314 5.53E-01 12 C. Genes Down-regulated by the NF-κB inhibitor (326 input; 170 analyzed) TF Class Rank Z-score Fisher No. Genes 46 p65 * REL 1 27.73 7.78E-11 NF-kappaB * REL 2 24.11 8.76E-08 49 58 c-REL * REL 3 21.31 3.76E-07 p50 * REL 4 15.60 9.71E-05 19 Irf-2 TRP-CLUSTER 5 13.30 1.30E-02 3 Irf-1 TRP-CLUSTER 6 12.59 1.50E-03 22 117 SPI-B ETS 7 12.45 9.06E-04 FREAC-4 FORKHEAD 8 11.05 2.55E-04 71 SRY HMG 9 10.52 2.81E-04 85 Pbx HOMEO 10 9.79 8.58E-02 10 Sox-5 HMG 12 9.00 2.50E-04 72 cEBP bZIP 13 8.26 2.63E-04 44 c-FOS bZIP 14 7.52 2.70E-03 71 HFH-2 FORKHEAD 15 7.36 1.68E-03 46 Nkx HOMEO 16 6.74 4.57E-03 106 HNF-3beta FORKHEAD 28 2.77 4.70E-03 46 deltaEF1 HMG 40 1.38 1.16E-03 149 TF binding sites detected by oPOSSUM with the top ten mostly highly ranked Z-scores or with Fisher p-value < 0.01. * TFs over-expressed or inhibited in gene expression studies. The number of genes used as input and the number of genes analyzed by oPOSSUM (i.e. genes that have an unambiguous mouse ortholog) are shown in brackets. † Although E2F is annotated as “unknown” in the JASPAR database, it is structurally defined as a member of the “winged helix” class of proteins.  57  c-Myc SAGE experiment. The c-Myc transcription factor, which dimerizes with the Max protein, is a key regulator of cell proliferation, differentiation and apoptosis (25;26). Using serial analysis of gene expression (SAGE), Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in human umbilical vein endothelial cells (HUVEC) (26). The induction of 53 genes was confirmed using microarray analysis and RT-PCR. We analyzed the 53 genes with oPOSSUM and found that the binding sites of Myc-Max heterodimers are indeed the most significantly over-represented (Table 2.2A); Myc-Max sites were identified in seven of the genes. Matches to the binding profile for homogeneous Max dimers, c-Myc’s interacting partner, were also highly overrepresented (present in nine genes, giving a high Z-score of 21.9). The binding profile for a related protein, n-Myc, ranked amongst the top ten most over-represented profiles.  c-Fos microarray experiment. In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. used microarrays to compare the gene expression profile of 208F fibroblasts transformed by c-Fos against the profiles for the parental 208F rat fibroblast cell line (27). We mapped the list of 252 induced genes to 150 human orthologs, which were submitted to oPOSSUM. As expected, the c-Fos TFBS was ranked as the most over-represented TFBS in the promoters of the induced genes, with a Z-score of 11.0 and a Fisher p-value of 2.9E-02 (Table 2.2B). c-Fos sites were identified in 40 of the co-expressed genes.  NF-κB microarray experiment. In HUVEC cells, interleukin 1B treatment precipitates an inflammatory response observable as an induction of mRNA expression. This response can be modulated by the inhibition of the NF-κB signaling pathway (28).  58  We assessed oPOSSUM’s performance on 326 genes that showed decreased levels of expression in interleukin-1B-stimulated HUVEC cells treated with an NF-κB inhibitor as compared to IL-1B-stimulated HUVEC cells. Binding sites for the NF-κB/Rel family of TFs were the most over-represented (present in approximately 50 genes) in the inhibitormodulated genes (Table 2.2C). Other over-represented TFBSs included the immunerelated genes Irf-1, Irf-2 and SPI-B.  2.3.3 Specificity assessment Based on the reference gene sets and expression data, oPOSSUM successfully identifies TFBSs that play a functional role in the regulation of sets of co-expressed genes. In the majority of cases, a Z-score greater than 10 and a Fisher p-value less than 0.01 effectively discriminated the known sites within each set of reference genes. To assess how many of the over-represented TFBSs may be expected by chance and ascertain if the qualitatively observed thresholds are appropriate, we tested oPOSSUM on randomly generated subsets of genes from the oPOSSUM database. In Figure 2.3 we show the percentage of trials that produced TFBS predictions for random sets of genes, providing a measure of the false positive rate. For a set of 15 genes, using the Z-score alone, 23% of the trials produced one false positive prediction, 19% produced two false positives, and so forth, for an overall false positive rate of 66%. Using only the Fisher exact test for a set of 15 genes, we obtain an overall false positive rate of 28 %. Thus, when used in isolation, each of the scoring measures result in surprisingly high false positive rates (average of 63% for the Z-score and 31% for the Fisher test), which are dramatically reduced by combining the scores. By applying both the Z-score and the Fisher p-value cutoffs to the randomly selected sets, we observed an average false  59  positive rate of 15%. The specificity when using the combination of scores (Z&F) appears consistent across gene sets of different sizes. Thus, with sets as large as 100-200 genes, which is typical of clustered expression data, ~86% of the time no spurious results are observed.  Percentage of trials that produced a FP prediction  100  >3 FP binding sites 3 FP binding sites  90  2 FP binding sites 80  1 FP binding site  70 60 50 40 30 20 10 0 Z  F 15 genes  Z&F  Z  F  Z&F  50 genes  Z  F  Z&F  100 genes  Z  F  Z&F  200 genes  Figure 2.3 Percentage of trials that produced false positive (FP) predictions Sets containing 15, 50, 100, and 200 randomly selected genes were generated and submitted to oPOSSUM (100 trials each). Each segment of the bar represents the percentage of trials where n TFBSs were over-represented by chance using the Z-score and Fisher p-value cutoffs. Symbols: Z = Z-score>10; F = Fisher<0.01; Z&F = Z-score>10 and Fisher<0.01.  2.3.4 Noise tolerance Next we performed simulations to investigate the amount of noise oPOSSUM can tolerate. To do this, we added from 5 to 300 randomly selected genes to the reference  60  gene sets, and applied oPOSSUM to determine what proportion of the sets could be noise before losing our ability to elucidate the TFBSs mediating tissue-specific and pathwayspecific expression. We considered the Mef-2, HNF-1 and NF-κB binding site profiles to be representative of each set, and plotted their average Z-scores and Fisher p-values over  (A) 35  Z-score  30 25  MEF2  20  HNF-1 NF-kappaB  15  Cutoff  10 5 0 0  0.1  0.2 0.3  0.4 0.5 0.6  0.7 0.8  0.9  1  Noise  (B) 1.0E+00  Fisher p-value  1.0E-01 1.0E-02  MEF2  1.0E-03  HNF-1 NF-kappaB  1.0E-04  Cutoff  1.0E-05 1.0E-06 1.0E-07 1.0E-08 0  0.1  0.2 0.3  0.4 0.5 0.6  0.7 0.8  0.9  1  Noise Figure 2.4 Noise tolerance Increasing numbers of randomly selected genes were added to the muscle, liver and NF-κB reference sets to assess the effect of noise on (A) the Z-score and (B) Fisher exact probability statistical measures. The amount of noise is represented as the fraction of all genes in the set that were randomly selected. Average Z-scores and Fisher p-values for Mef-2, HNF-1 and NF-κB over 100 trials for each noise level are shown to represent the muscle, liver and NF-κB reference sets, respectively. Suggested cutoffs for the Z-score and Fisher p-value are shown by the dotted red lines.  61  100 trials against the proportion of noise in the set. The muscle, liver and NF-κB data sets can tolerate up to 60% of the gene list being noise using the Z-score (Figure 2.4A) and up to 50% using the Fisher p-value (Figure 2.4B). There is significant variation in the degree of noise tolerance amongst the three sets of genes: the NF-κB set is able to tolerate up to 80% of the set being noise versus only 50% for the muscle set. Figure 2.4 shows that the Z-score decreases linearly and the Fisher p-value increases logarithmically with increasing noise for all three sets of genes.  2.3.5 Web implementation The approach described for the detection of over-represented conserved TFBSs in sets of co-expressed genes has been implemented as a flexible, user-friendly website available from www.cisreg.ca. The implementation allows for analysis in default and custom modes. In the default mode, conserved human and mouse TFBS counts have been pre-calculated and stored using combinations of pre-defined values for the following three parameters: (i) the amount of sequence relative to the TSS to be included in the analysis, (ii) the level of interspecies conservation required, and (iii) the PSSM score required for a hit to be reported (Table 2.3).  Table 2.3 Predefined values for phylogenetic footprinting and TFBS detection available in oPOSSUM’s default mode Level Conservation † PSSM Score Promoter Region ‡ 1 Top 30th percentile (minimum 60%) 75% -5000 to +5000 2 Top 20th percentile (minimum 65%) 80% -2000 to +2000 3 Top 10th percentile (minimum 70%) 85% -2000 to 0 † Conservation thresholds based on percentiles are determined by first calculating the amount of sequence identity for all windows of size 100 bp, removing coding regions, and then finding the value above which the top x% of scores reside. ‡ Relative to the transcription start site  62  Users simply select a pre-defined set of parameters, select a set of TFBS to be included in the analysis, and submit a list of gene identifiers (Ensembl, GenBank, RefSeq, or LocusLink are presently supported) for analysis. oPOSSUM retrieves the TFBS hits matching the specified criteria for each gene in the list, calculates a Fisher exact probability and Z-score for the classes of TFBSs found in the set of genes, and returns ranked lists of TFBSs for each statistical test (Figure 2.5A). This operation is fast (less than 30 seconds for each of the reference sets) due to the pre-calculation of background frequencies. Pop-up windows for each TFBS display the genes in which the site has been located, as well as the site’s co-ordinates and score (Figure 2.5B). Furthermore, the TFBSs are linked to the JASPAR database for easy access to information regarding the binding site profiles. In the custom mode, users are not restricted to the pre-defined parameter values for the PSSM score and promoter region, and are given the option to supply userdefinedbackground sets. Users might be motivated to introduce their own background sets if there is prior biological evidence linking sequence composition to expression in the tissue or condition studied. The customization option provides users with more control, and results in more variable processing speeds depending on the size of the background set and the parameters selected.  2.3.6 The oPOSSUM application programming interface (API) The oPOSSUM API, based on a set of object-oriented Perl modules, provides an interface to the oPOSSUM database and defines data objects for facilitating statistical (Fisher and Z-score) analysis. A set of modules at the top level of the API tree model each of the data objects in the oPOSSUM database. Briefly, the current version of the  63  (A)  (B)  Figure 2.5 The oPOSSUM result report for the identification of over-represented TFBSs in sets of coexpressed genes (A) Results report showing the selected parameters, genes included and excluded in the analysis, and summary tables containing the Fisher exact probability scores and Z-scores for each TFBS (only the first few results are shown for each statistical test in this figure). (B) Pop-up window displaying genes that contain a particular TFBS (in this case, Mef-2), as well as the site locations and scores.  64  API includes modules for connecting and retrieving gene indices, orthologous gene pairs, conserved region information, TFBS matches, and other types of data from the oPOSSUM database, running the Z-score and Fisher analyses, and storing the input and output from these analysis modules. The API with accompanying documentation is available through the oPOSSUM website.  2.4 Discussion Regulatory analysis of the promoters of co-expressed genes can give rise to hypotheses about the factors, TFBSs, and putative pathways involved in generating the observed expression patterns. Our integrated approach to regulatory analysis incorporates public data sources, cross-species conservation and complementary statistical methods to identify over-represented motifs. We validate the method using updated reference sets of muscle-specific and liver-specific regulatory regions, and a new set of NF-κB-regulated genes. We demonstrate the utility of this technique for analyzing experimental data with three independent gene expression studies. We show the robustness of this method through computationally assessed rates of false-positives and noise tolerance. The procedure has been implemented as a user-friendly, flexible website called oPOSSUM and as a Perl API. In short, we illustrate herein that oPOSSUM is a novel, validated, useful, robust, user-friendly means for analysts to explore potential regulatory mechanisms in their expression experiments.  2.4.1 Performance In the case of the muscle regulatory collection, oPOSSUM ranked four of the five documented TFBSs as the four most over-represented sites. In fact, the only three profiles  65  to surpass the specified Z-score and Fisher p-value cutoffs were those for the musclespecific TFs SRF, TEF-1, and Mef-2. Similar results were obtained for the extended liver regulatory collection, which contains genes with experimentally verified HNF1/3/4 and c/EBP binding sites. oPOSSUM analysis resulted in two over-represented TFBSs, including the top-ranked HNF-1, followed by forkhead related activator 2 (FREAC-2), a member of the forkhead box family of eukaryotic DNA binding proteins, which includes FREAC-2, FREAC-4, HNF-3β, and HNF-4. c/EBP, though not considered significantly over-represented based on our empirical cutoffs, ranked third. The JASPAR database does not currently contain a binding profile for HNF-4, and so this TF could not be included in the analysis. The liver set illustrates how the absence of high-quality PSSM profiles to model all TFs in the human genome represents a key limitation to this method for the entire field. Application of oPOSSUM to a set of known targets of NF-κB resulted in all four NF-κB-related profiles ranking at the top of the list of over-represented TFBS profiles, with markedly significant scores. Within the ranked list, though not exceeding the thresholds, we observed other immune response-related TFBS profiles. For example, the interferon regulatory factors Irf-1 and Irf-2 (ranked #8 and #6, respectively) are known regulators of the host defense in response to viral infection or cytokine stimulation; they regulate interferon (IFN) and IFN-inducible genes, and also form interactions with SPI-1 and SPI-B (ranked #5) to induce the activity of various cytokines (29-31). It is worth noting that the default threshold values are simply suggestions and less conservative cutoffs may yield valuable insights as well.  66  While the system behaved well for the validation collections, we desired to assess the utility for the analysis of larger, more heterogeneous experimental data. In each of the two published gene expression data sets, derived from the ectopic expression of c-Myc and c-Fos respectively, oPOSSUM clearly and appropriately ranked the corresponding TFBSs as being the most significantly over-represented. Further evidence of oPOSSUM’s utility in analyzing gene expression data was presented by applying oPOSSUM to a set of genes that showed decreased expression in an experiment examining the effect of a known inhibitor of the NF-κB signaling pathway. This set of genes is distinct from the NF-κB reference set in that the experiment examines an interleukin-induced immune-response in a cellular system, and potentially contains a large number of mRNAs that are independent of NF-κB signaling. Still, oPOSSUM identified the NF-κB binding sites (NF-κB, c-REL, p50, p65) as being significantly overrepresented, as well as the same TFs involved in the immune response that were identified in the NF-κB test set (Irf1, Irf2, SPI-B). Taken together, the three experimental analyses illustrate the power of promoter sequence analysis to identify the TFs governing gene expression changes observed in heterogeneous microarray and SAGE data.  2.4.2 Challenges A common problem for promoter analysis is circularity. Binding sites that have been experimentally verified in genes are used to construct binding site profiles, which in turn, are used to search for binding sites in sets containing the original genes. In this study, the Mef-2, SRF, c-REL, p50, p65, Myc-Max, and c-Fos binding site profiles were constructed based on SELEX experiments, in which in vitro binding experiments are used to isolate suitable binding sites for a particular TF from random oligonucleotides (2).  67  Thus, we can be sure that at least for the SELEX-based profiles, we have avoided any circularity. At present, a key limitation to the oPOSSUM analysis is the scarcity of annotated binding site profiles. JASPAR, the underlying database supporting oPOSSUM, contains 111 high-quality binding site profiles representing 25 structural classes. When analyzing expression data sets, it is worth keeping in mind that although the TF mediating the observed response may not be present in the sparse JASPAR database, it is possible that a TF that recognizes a similar motif may be identified as being over-represented. For example, looking at the results for the c-Myc experiment in Table 2.2A, it is evident that the over-represented TFBSs we observe are predominately bound by TFs containing the basic helix-loop-helix (bHLH) domain, and in particular, by TFs within the bHLH-ZIP (basic helix-loop-helix / leucine zipper) structural class. In fact, four of the five TFs in JASPAR that belong to the bHLH-ZIP class rank amongst the top ten profiles. This is also true for the NF-κB-related data sets where we see a clear over-representation of TFs belonging to the Rel class of TFs (Tables 2.1C and 2.2C). It is important to consider whether a match may be indicative of a member of a structural class, rather than the specific profiled TF. A major challenge, however, is that zinc-finger proteins make up the largest class of TF proteins, comprising ~47% of the estimated 1445 TFs identified in mammalian genomes (32). Cys2-His2 zinc-fingers are the most versatile of the DNArecognition domains, and variations in amino acid sequence enable them to bind to a diverse range of DNA sequences. In addition, zing-finger proteins in mammalian genomes use multiple, tandem fingers to interact with arrays of subsites, providing a degree of modularity and exceptional adaptability (33). JASPAR currently contains only  68  17 zinc-finger binding profiles. We will continue our ongoing efforts to expand the JASPAR collection and incorporate new information as it becomes available. Analysis of the false positive rates using random sampling revealed that, when used in isolation, the Z-score and Fisher tests result in high false positive rates that can be reduced by combining the two scoring measures. While it’s true that applying a multiple testing correction could possibly improve performance for the Fisher measure, the Zscores we obtain are extremely large, such that a correction for the one hundred or so TFBS profiles being tested has negligible impact on the Z-score results. Instead, we have opted to empirically derive threshold cutoffs based on our reference data. Furthermore, our experience with oPOSSUM suggests that it is the ranks of the binding site profiles rather than the specific values of the scores that are indicative of functionally relevant TFBSs. For these reasons, and in light of the binding similarities within factor families, we have abstained from making a Bonferroni correction to adjust for multiple testing. In the future, we may introduce an option for users to make this adjustment that is based on an improved statistical model.  2.4.3 Utility oPOSSUM is the first integrated, web-based tool for analyzing sets of coexpressed genes that incorporates cross-species comparisons, PSSM-based promoter motif detection, and statistical methods for the identification of over-represented TFBSs with a pre-computed database. Other resources are available for detecting and visualizing binding sites within the conserved regions of human genes (Consite (11), rVISTA (34), dbTSS (35), CONREAL (36), CORG (37)), as well as for identifying statistically overrepresented motifs in the promoters of related sequences (Clover (38), OTFBS (39),  69  PRIMA (40). Other comparable tools that integrate all of these approaches include the Toucan workbench for regulatory sequence analysis (41), CONFAC (42), and CRÈME (43). Unlike Toucan, oPOSSUM employs a pre-computed database of conserved TFBSs, eliminating the need for long processing times involved in retrieving sequences, performing alignments, and detecting motifs via PSSMs. Furthermore, the use of two complementary statistical tests to determine over-represented TFBSs is unique to oPOSSUM, and attempts to address the inherent problems involved in analyzing conserved regions of promoters for TFBSs, which include variation in conservation properties from one orthologous gene pair to another and multiple occurrences of a particular TFBS in the promoter of a single gene.  2.5 Conclusions The oPOSSUM system is under continued development. As new information accumulates, we intend to expand the orthology mapping, increase the number of TFBS profiles supported in JASPAR, include the option for users to specify alternative promoters (TSSs), and improve the over-representation analysis. We believe that this approach to regulatory analysis will be helpful to researchers hoping to elucidate transcriptional pathways from gene expression data.  70  2.6 References 1. Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics., 16, 16-23. 2. Pollock,R. and Treisman,R. (1990) A sensitive method for the determination of protein-DNA binding specificities. Nucleic Acids Res., 18, 6197-6204. 3. Bulyk,M.L., Gentalen,E., Lockhart,D.J. and Church,G.M. (1999) Quantifying DNA-protein interactions by double-stranded DNA arrays. Nat.Biotechnol., 17, 573-577. 4. Wingender,E., Dietze,P., Karas,H. and Knuppel,R. (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res., 24, 238241. 5. Sandelin,A., Alkema,W., Engstrom,P., Wasserman,W.W. and Lenhard,B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32, D91-D94. 6. Wasserman,W.W. and Krivan,W. (2003) In silico identification of metazoan transcriptional regulatory regions. Naturwissenschaften, 90, 156-166. 7. Duret,L. and Bucher,P. (1997) Searching for regulatory elements in human noncoding sequences. Curr.Opin.Struct.Biol., 7, 399-406. 8. Hardison,R.C., Oeltjen,J. and Miller,W. (1997) Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res., 7, 959-966. 9. Koop,B.F. (1995) Human and rodent DNA sequence comparisons: a mosaic model of genomic evolution. Trends Genet., 11, 367-371. 10. Wasserman,W.W., Palumbo,M., Thompson,W., Fickett,J.W. and Lawrence,C.E. (2000) Human-mouse genome comparisons to locate regulatory sites. Nat.Genet., 26, 225-228. 11. Lenhard,B., Sandelin,A., Mendoza,L., Engstrom,P., Jareborg,N. and Wasserman,W.W. (2003) Identification of conserved regulatory elements by comparative genome analysis. J.Biol., 2, 13. 12. Dermitzakis,E.T. and Clark,A.G. (2002) Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol.Biol.Evol., 19, 1114-1121.  71  13. Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V. et al. (2003) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res., 31, 38-42. 14. Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res., 13, 721-731. 15. Tatusova,T.A. and Madden,T.L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol.Lett., 174, 247-250. 16. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol., 48, 443-453. 17. Lenhard,B. and Wasserman,W.W. (2002) TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics., 18, 1135-1136. 18. Fleiss,J.L. (1981) Statistical methods for Rates and Proportions. John Wiley, New York. 19. Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109-126. 20. Krivan,W. and Wasserman,W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res., 11, 1559-1566. 21. Wasserman,W.W. and Fickett,J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J.Mol.Biol., 278, 167-181. 22. Baeuerle,P.A. and Baichwal,V.R. (1997) NF-kappa B as a frequent target for immunosuppressive and anti-inflammatory molecules. Adv.Immunol., 65, 111-137. 23. Emami,K.H., Burke,T.W. and Smale,S.T. (1998) Sp1 activation of a TATA-less promoter requires a species-specific interaction involving transcription factor IID. Nucleic Acids Res., 26, 839-846. 24. Li,Q. and Verma,I.M. (2002) NF-kappaB regulation in the immune system. Nat.Rev.Immunol., 2, 725-734. 25. Amati,B., Dalton,S., Brooks,M.W., Littlewood,T.D., Evan,G.I. and Land,H. (1992) Transcriptional activation by the human c-Myc oncoprotein in yeast requires interaction with Max. Nature, 359, 423-426. 26. Menssen,A. and Hermeking,H. (2002) Characterization of the c-MYC-regulated transcriptome by SAGE: identification and analysis of c-MYC target genes. Proc.Natl.Acad.Sci.U.S.A, 99, 6274-6279.  72  27. Ordway,J.M., Williams,K. and Curran,T. (2004) Transcription repression in oncogenic transformation: common targets of epigenetic repression in cells transformed by Fos, Ras or Dnmt1. Oncogene, 23, 3737-3748. 28. Epinat,J.C. and Gilmore,T.D. (1999) Diverse agents act at multiple levels to inhibit the Rel/NF-kappaB signal transduction pathway. Oncogene, 18, 6896-6909. 29. Marecki,S., Riendeau,C.J., Liang,M.D. and Fenton,M.J. (2001) PU.1 and multiple IFN regulatory factor proteins synergize to mediate transcriptional activation of the human IL-1 beta gene. J.Immunol., 166, 6829-6838. 30. Meraro,D., Gleit-Kielmanowicz,M., Hauser,H. and Levi,B.Z. (2002) IFNstimulated gene 15 is synergistically activated through interactions between the myelocyte/lymphocyte-specific transcription factors, PU.1, IFN regulatory factor8/IFN consensus sequence binding protein, and IFN regulatory factor-4: characterization of a new subtype of IFN-stimulated response element. J.Immunol., 168, 6224-6231. 31. Taniguchi,T., Ogasawara,K., Takaoka,A. and Tanaka,N. (2001) IRF family of transcription factors as regulators of host defense. Annu.Rev.Immunol., 19, 623-655. 32. Gray,P.A., Fu,H., Luo,P., Zhao,Q., Yu,J., Ferrari,A., Tenzen,T., Yuk,D.I., Tsung,E.F., Cai,Z. et al. (2004) Mouse brain organization revealed through direct genome-scale TF expression analysis. Science, 306, 2255-2257. 33. Urnov,F.D. and Rebar,E.J. (2002) Designed transcription factors as tools for therapeutics and functional genomics. Biochem.Pharmacol., 64, 919-923. 34. Loots,G.G., Ovcharenko,I., Pachter,L., Dubchak,I. and Rubin,E.M. (2002) rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res., 12, 832-839. 35. Suzuki,Y., Yamashita,R., Nakai,K. and Sugano,S. (2002) DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res., 30, 328-331. 36. Berezikov,E., Guryev,V., Plasterk,R.H. and Cuppen,E. (2004) CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. Genome Res., 14, 170-178. 37. Dieterich,C., Wang,H., Rateitschak,K., Luz,H. and Vingron,M. (2003) CORG: a database for COmparative Regulatory Genomics. Nucleic Acids Res., 31, 55-57. 38. Frith,M.C., Fu,Y., Yu,L., Chen,J.F., Hansen,U. and Weng,Z. (2004) Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res., 32, 1372-1381.  73  39. Zheng,J., Wu,J. and Sun,Z. (2003) An approach to identify over-represented ciselements in related sequences. Nucleic Acids Res., 31, 1995-2005. 40. Elkon,R., Linhart,C., Sharan,R., Shamir,R. and Shiloh,Y. (2003) Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res., 13, 773-780. 41. Aerts,S., Thijs,G., Coessens,B., Staes,M., Moreau,Y. and De Moor,B. (2003) Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res., 31, 1753-1764. 42. Karanam,S. and Moreno,C.S. (2004) CONFAC: automated application of comparative genomic promoter analysis to DNA microarray datasets. Nucleic Acids Res., 32, W475-W484. 43. Sharan,R., Ovcharenko,I., Ben Hur,A. and Karp,R.M. (2003) CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics., 19 Suppl 1, i283-i291.  74  Chapter 3: Integrated Tools for Analysis of Regulatory Motif Overrepresentation 3 3.1 Introduction Functional genomics research often generates lists of genes with observed common properties, such as coordinated expression. For many studies, a key challenge is the generation of relevant and testable hypotheses about the regulatory networks and pathways that underlie observed co-expression. Our strategy for elucidating regulatory mechanisms identifies over-represented sequence motifs that are present in the upstream regulatory regions of genes. The motifs may represent transcription factor binding sites (TFBSs) that have a role in regulating expression. oPOSSUM (1) and oPOSSUM2 (2) were developed to identify over-represented, predicted TFBSs and combinations of predicted TFBSs, respectively, in sets of human and mouse genes. The user inputs a list of related genes, selects the TFBS profile set to be included in the analysis, and the algorithm determines which, if any, predicted TFBSs occur in the promoters of the set of input genes more often than would be expected by chance. Both analytic approaches rely on a database of aligned, orthologous human and mouse sequences, and the delineation of conserved regions within which TFBS predictions are analyzed. While the approach does not explicitly address uncharacterized transcription factors (TFs), the effective coverage is broadened by the fact that members within certain structural families of TFs can exhibit similarities in binding specificity.  3  A version of this chapter has been published. Ho Sui, S.J., Fulton, D.L., Arenillas, D.J., Kwok, A.T. and Wasserman, W.W. (2007). oPOSSUM: Integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Research. 35 Web Server Issue:W245-52.  75  While intra-class similarity is not always the case, as exemplified by the zinc-finger family of TFs (3), the observation holds true for many TF families (4;5). Here we describe the new release of the oPOSSUM system, which integrates the two previously developed applications, and has been expanded to accommodate new species (yeast and worms). It also includes new methods for orthology assignment, transcription start site (TSS) determination, and sequence alignment.  3.2 Methods 3.2.1 Over-representation analysis oPOSSUM single site analysis (SSA). The oPOSSUM system for identifying over-represented TFBSs in sets of co-expressed genes first focused on single site analysis (1). Two scores were developed to assess over-representation, one at the TFBS occurrence level and the other at the gene level. The Z-score, based on the normal approximation to the binomial distribution, indicates how far and in what direction the number of TFBS occurrences deviates from the background distribution's mean. The second score, the Fisher exact test, indicates if the proportion of genes containing the TFBS is greater than would be expected by chance. TFBS predictions situated within overlapping alternative promoters are counted only once when calculating overrepresentation in human and mouse genes. For C. elegans genes in operons, TFBS predictions in the upstream region of the first gene in the operon apply to all genes in the operon.  oPOSSUM combination site analysis (CSA). TFBSs do not act in isolation to initiate the transcription process. Transcriptional regulation can be viewed as mediated by  76  arrays of cis-regulatory sequences, termed cis-regulatory modules (CRMs), which are bound by multiple TFs. In oPOSSUM2, Huang et al. (2006) (2) address the detection of over-represented sets of TFBSs in the promoters of a set of co-expressed genes. In brief, the method reduces combinatorial complexity through an initial clustering step, which partitions similar TFBS profiles into groups, herein denoted TFBS classes, along with an analysis step to determine a TFBS class representative profile for each TFBS class, which are used to detect over-represented.sets of TFBS classes. Since each distinct, overrepresented set of detected TFBS classes, herein described as a TFBS class combination, implicates the over-representation of one or more underlying TFBS profile-specific combinations, each of these TFBS class combinations is expanded to all possible TFBS profile-specific combinations (for the indicated classes) and then all combinations are analyzed for over-representation. Furthermore, given that CRMs can contain locally dense clusters of TFBSs, the system also provides for the specification of an inter-binding site distance (IBSD) constraint to confine the number of TFBS combinations that are investigated. A scoring scheme, adopted from the Fisher exact test, utilizes two sets of TFBS (class or profile-specific) combination counts to compare the degree of their overrepresentation: 1) the number found in the promoters of the co-expressed gene set versus 2) the number found in the promoters of genes in a background set (all genes in the database). TFBS combinations occurring in multiple alternative gene promoter regions are counted only once.  3.2.2 Species-specific Databases In addition to enhancements to the human/mouse oPOSSUM database, we introduce new species databases for studies of over-represented TFBSs in yeast and  77  worms. While the SSA over-representation analysis remains the same for all species, differences in gene structure require that the construction of the underlying databases be particular to each species.  Human/mouse. Ambiguities in ortholog assignments and the definition of TSS positions are major challenges when performing alignments for a large proportion of human and mouse genes. We have expanded the human/mouse database through (i) the discrimination of potential orthologs from predicted paralogs based on upstream sequence similarity (Figure 3.1), and (ii) the delineation of alternative promoters for human and mouse genes (Figure 3.2) to address the alignment failure observed in previous database builds.  EnsEMBL-defined human/mouse homologs (16,058) Filter by ortholog type One-to-many and many-tomany homologs (1,513)  One-to-one ortholog pairs (14,967)  Map to UCSC wholegenome alignments Genes without a one-to-one ortholog based on upstream sequence similarity (847)  One-to-one ortholog pairs predicted from upstream sequence similarity (195)  15,162 orthologous gene pairs for TSR determination Figure 3.1 Determination of one-to-one orthologs for human and mouse genes  78  Starting with an initial set of homologs downloaded from EnsEMBL v41 (6), all homologs annotated as “one2one” are extracted. To select the closest putative ortholog pairs from homologs with “one2many” or “many2many” relationships, we check for upstream conservation using the whole-genome human-mouse alignments (7). We reannotate unambiguously aligned homologs (i.e. where the region 5000 bp upstream of the human TSS maps to a single region mouse net alignment) as putative one-to-one orthologs, adding 195 gene pairs to our set, and bringing the total number of orthologs to 15162 (Figure 3.1). The original release of oPOSSUM had the drawback that only 65% of human genes could be reliably aligned to its mouse ortholog (8). Closer inspection revealed that the large majority of alignment failures could be attributed to alternative promoter usage between human and mouse genes. To improve our alignments, we determine putative alternative transcription start sites (TSSs) for the human and mouse genes (Figure 3.2). For each gene, the entire repertoire of transcripts from both EnsEMBL core genes and EST genes are retrieved. Transcripts derived from EnsEMBL EST gene annotations are henceforth referred to as “lower confidence” transcripts. The TSSs for all transcripts are recorded, followed by a clustering step such that TSSs within 500 bp of one another are merged to form a transcriptional start region (TSR). This step is motivated by observations that genes with CpG island-associated promoters often have broad distributions of TSSs (9). For each TSR containing a transcript annotated as “known” or “novel”, we accept the TSR as is. For TSRs based solely on EST gene transcripts, we require additional evidence for transcription initiation. Cap analysis of gene expression (CAGE) (10) is a high-throughput method used to measure expression levels by counting  79  large amounts of sequenced capped 5' ends of transcripts, termed CAGE tags. The 5’capping method facilitates accurate mapping of TSRs. We require that lower confidence TSRs contain a minimum of 5 CAGE tags to be accepted.  Core EnsEMBL genes (hs18/mm8)  EnsEMBL EST genes (hs18/mm8)  (“higher confidence”)  (“lower confidence”)  Locate TSS positions  Fantom3 CAGE (hs17/mm5)  TSSs  Cluster TSSs (500 bp window) LiftOver (UCSC)  Fantom3 CAGE (hs18/mm8)  Transcription start regions (TSRs) Map CAGE tags to region  Putative TSRs for KNOWN/NOVEL transcripts and PUTATIVE transcripts with ≥5 CAGE tags  Figure 3.2 Identification of transcription start regions (TSRs) using a combination of Ensembl annotations and CAGE data Identification of transcription start regions (TSRs) using a combination of EnsEMBL annotations and CAGE data. To improve our alignments, we determine putative alternative TSSs for the human and mouse genes. For each gene, the entire repertoire of transcripts from both EnsEMBL core genes and EST genes are retrieved. The TSSs for all transcripts are recorded, followed by a clustering step such that TSSs within 500 bp of one another are merged to form a transcriptional start region (TSR). For each TSR containing a transcript annotated as “known” or “novel”, we accept the TSR as is. For TSRs based solely on EST gene transcripts, we require a minimum of 5 CAGE tags as evidence for transcription initiation.  80  For each human/mouse orthologous pair, we determine the coordinates of the longest region from the UCSC genome alignments (7) spanning all transcripts plus an additional 10 kb of upstream sequence. The orthologous sequences are retrieved and realigned using ORCA, a pairwise global progressive alignment algorithm (described in (1)) to optimally align short, conserved blocks within longer global alignments. If possible, TSRs from human and mouse are paired in the alignment. We apply three dynamically computed and progressively more stringent conservation thresholds corresponding to the top 10%, 20%, and 30% of all 100 bp noncoding windows, each with a minimum percent identity of 70%, 65%, and 60%, respectively.  C. elegans/C. briggsae. To facilitate transcriptional regulatory analysis of the numerous gene expression studies performed in C. elegans, we have implemented a worm version of oPOSSUM. While the database structure and pipeline procedure are very similar to that used for the human/mouse database, there are small modifications that allow for mapping of genes to their operons, as defined by Blumenthal et al. (11). In addition, nucleotide identity thresholds for conserved regions were reduced to 60%, 55%, and 50% for the top 10%, 20%, and 30% of noncoding windows, respectively, to account for the greater sequence divergence between C. elegans and C. briggsae compared to human and mouse. The set of orthologs for C. elegans and C. briggsae is defined by oneto-one InParanoid clusters (12) from WormBase (WS160) (13). After filtering overlapping genes, 10592 orthologous gene pairs (of which, 2140 genes are in operons) remain for alignment. Alignments are performed on the orthologous gene sequences plus 2 kb of upstream sequence (relative to the start codon) for C. elegans, and 4 kb of upstream sequence for C. briggsae. Annotations are not as mature for C. briggsae, and  81  the longer upstream region aids in the alignment of the worm promoter sequences. Alternative promoters have not been considered in this first version; however, should CAGE data or other reliable means for annotating TSSs in worms become available, efforts will certainly be made to include them. Of the 10592 worm orthologs, 9331 (88%) successfully align.  Yeast. The analysis of yeast promoters is simplified by the more compact nature of the yeast genome. This characteristic diminishes the requirement for comparative methods to reduce the search space and noise inherent in larger genomes. Computational methods using S. cerevisiae sequences alone have successfully been used to identify regulatory elements associated with known sets of related genes (14;15). We opted to exclude phylogenetic footprinting for yeast, and instead, select promoter sequences corresponding to the 5' untranslated region 1000 bp immediately upstream of the start codon of each open reading frame (ORF). Note that for all applications, users have the option to further restrict the search space if they wish. The sequences were downloaded from the Saccharomyces Genome Database (16).  3.2.3 Transcription Factor Binding Site Prediction For the metazoan species, we search for matches to TFBS profiles contained in the JASPAR CORE and JASPAR PhyloFACTS database collections (17;18). Additionally, we include a set of profiles compiled for C. elegans TFs from literature review for Worm SSA (http://www.cisreg.ca/oPOSSUM/data/). Binding sites are predicted for the sequences using the TFBS suite of Perl modules for regulatory sequence analysis (19). A predicted binding site for a given TF model is reported if the site occurs in the promoters of both orthologs above a threshold PSSM score of 70%  82  and at equivalent positions in the alignment. Overlapping sites for the same TF are filtered such that only the highest scoring motif is kept. The genomic location, profile score, motif orientation, and local sequence conservation level of each TFBS match in orthologous genes are stored in the respective species databases. For S. cerevisiae, we compiled a collection of yeast-specific TFBS motifs from both the Yeast Regulatory Sequence Analysis (YRSA) system (20) and the literature (http://www.cisreg.ca/oPOSSUM/data), and record the genomic location, profile score and motif orientation for each prediction.  3.3 Results Each oPOSSUM component was validated on sets of reference genes. The results of all validations are available at http://cisreg.ca/oPOSSUM/data/. Below we describe a single validation for each system.  3.3.1 Human SSA The human/mouse database was expanded by discriminating potential orthologs from predicted paralogs based on upstream sequence similarity (Figure 3.1), and using multiple alternative promoters for human and mouse genes (Figure 3.2). While the inclusion of promoter comparisons for candidate ortholog assignment may be controversial, the impact is marginal as less than 1.3% of gene pairs were derived from this approach. This brings the total number of orthologs to 15162. Despite improvements in EnsEMBL’s ortholog prediction, this is only 1079 more orthologs than were present in our previous database build. Based on the small incremental increases in mapped  83  orthologs, we may be nearing the upper bound for the number of genes in human and mouse that are truly orthologous and detectable by sequence conservation. Approximately one quarter of genes have multiple TSRs based on our procedure (Appendix 2 Figure S1). This observation is consistent with a report that 18–20% of protein-coding genes use alternative promoters (21), but is much less than the reported 58% of protein-coding transcriptional units with two or more alternative promoters based on the presence of nonoverlapping CAGE tag clusters (22). Of the 15162 orthologous gene pairs supplied as input to the oPOSSUM pipeline, 5121 (99.7 %) successfully align, and 15027 (99.1%) have non-exonic conserved regions above 60% nucleotide identity. This is a significant improvement over the previous version of oPOSSUM (1). Wonsey and Follettie performed a microarray analysis of genes that are transcriptionally regulated by FoxM1, a member of the forkhead family of TFs, using BT20 cells that had been transfected with FoxM1 siRNA (21). They identified a set of 27 genes that were specifically regulated in cells transfected with FoxM1 siRNA. The 27 Affymetrix UG144A identifiers were mapped to 27 EnsEMBL gene identifiers and submitted to Human SSA with default parameters. Of these, 22 genes had a unique mouse ortholog and were used in the oPOSSUM analysis. While a specific profile for FoxM1 is not present in JASPAR CORE, other members of the forkhead family were ranked in the top ten highest scoring TFBS profiles (Table 3.1). There is also a known association between HNF4, the highest scoring TFBS profile, and the forkhead TF, FOXO1 in the regulation of gluconeogenic gene expression in hepatocytes (22), which may explain the over-representation of the HNF4 profile.  84  We previously identified over-represented Fos binding sites in a set of genes induced after transformation by c-Fos in rat fibroblast cells (1;23). We analyzed 160 orthologous genes from the original list of 252 induced genes. This is a notable improvement over the previous version where only 98 genes were included in the oPOSSUM analysis. The Fos TFBS profile ranked second in the list of over-represented TFBSs (Table 3.2). Inspection of the results using the JASPAR PhyloFACTS profiles with default parameters illustrates how inclusion of this new set of profiles provides  Table 3.1 Validation of the human FoxM1-regulated gene cluster JASPAR CORE  TF Class Nuclear Receptor bZIP Homeo Forkhead Nuclear Receptor bHLH Zn-Finger, C2H2 Forkhead Forkhead TEA  HNF4 Fos Pbx FOXI1 RORA1 TAL1-TCF3 Staf Foxa2 Foxd3 TEAD  IC  Target gene hits  Background TFBS rate  Target TFBS rate  Z-score  Fisher P-value  9.62  13  0.0054  0.0085  7.19  2.64E-02  10.67 14.64 13.18  15 5 16  0.0111 0.0019 0.0153  0.0146 0.0033 0.0186  5.72 5.57 4.49  4.29E-01 3.10E-01 9.05E-02  17.42  4  0.0020  0.0029  3.54  5.04E-01  14.07  12  0.0052  0.0066  3.30  5.88E-02  17.54  3  0.0014  0.0021  3.16  3.03E-01  12.43 12.94 15.67  13 13 6  0.0152 0.0172 0.0028  0.0174 0.0194 0.0037  3.04 2.93 2.850  4.83E-01 5.27E-01 4.70E-01  Table 3.2 Validation of the c-Fos-regulated gene cluster JASPAR CORE RREB1 Fos RORA1 SP1 NR3C1  TF Class Zn-Finger, C2H2 bZIP Nuclear Receptor Zn-Finger, C2H2 Nuclear Receptor  IC 22.29 10.67 17.43 9.72 14.75  Target gene hits 3 60 13 54 5  Background TFBS rate 0.0001 0.0044 0.0006 0.0046 0.0003  Target TFBS rate 0.0003 0.0058 0.0010 0.0057 0.0005  Z-score 10.19 9.49 8.49 6.59 6.03  Fisher P-value 6.12E-02 2.99E-01 3.27E-02 1.07E-01 1.37E-01  JASPAR PhyloFACTS  Similar To  IC  Target gene hits  Background TFBS rate  Target TFBS rate  Z-score  Fisher P-value  TGANTCA GGGYGTGNY TGASTMAGC GGARNTKYCCA GGGAGGRR  AP-1 NF-E2 MAZ  12.06 14.18 16.60 17.13 14.00  46 82 43 44 111  0.0011 0.0059 0.0013 0.0016 0.0171  0.0023 0.0083 0.0024 0.0026 0.0202  18.05 15.64 15.64 12.54 11.98  1.40E-04 4.98E-02 1.19E-03 1.11E-03 3.16E-01  85  additional, meaningful information. The highest ranked PhyloFACTS motif (TGANTCA) is noted by JASPAR as being most similar to the binding profile for AP-1, and the third highest scoring motif (TGASTMAGC) is most similar to the bZIP TF NF-E2. AP-1 complexes are comprised of Fos and Jun proteins, and the structurally related NF-E2 and AP-1 TFs bind similar sequence motifs (24).  3.3.2 Human CSA The combination site analysis was validated on a set of mouse skeletal muscle genes comprised of the union of the results of the microarray studies of Moran et al. (25) and Tomczak et al. (26). To avoid circularity, we removed muscle-specific genes used to generate the JASPAR binding site profiles for Mef-2, Myf, Sp-1, SRF, and Tef. Binding sites for these factors occur in clusters in cis-regulatory modules that contribute to skeletal muscle-specific expression (27). Table 3.3 lists the top five over-represented pairwise TFBS combinations for this set of genes, along with the JASPAR class each TF profile clustered to, and the Fisher score obtained for each pair. The five most overrepresented pairs of TFBS profiles include combinations of Mef-2, SRF and Sp-1.  Table 3.3 Validation of skeletal muscle genes identified by Moran et al. and Tomzcak et al. TF name (Class ID)  TF class name  TF name (Class ID)  TF class name  MEF2A (class 4)  MADS  Myf (class 22)  bHLH  1.65E-06  MEF2A (class 4)  MADS  ZNF42_1-4 (class 25)  Zn-finger, C2H2  4.24E-06  Myf (class 22)  bHLH  SRF (class 1)  MADS  2.52E-05  SP1 (class (31)  Zn-finger, C2H2  SRF (class 1)  MADS  2.68E-05  Agamous (class 1)  MADS  MEF2A (class 4)  MADS  7.63E-05  Score  The inclusion of alternative promoters provides notable improvements in the Human SSA and Human CSA analyses. The same data sets were used to validate our  86  previous and current human oPOSSUM analyses systems. Demarcation of additional promoter boundaries increases the signal in the discovery process, improving the signal for both over-represented single TFBSs and combinations of TFBSs in the gene sets analyzed.  3.3.3 Worm SSA Worm SSA was tested on a set of well-characterized nematode muscle genes (28). Analysis of 1000 bp of upstream sequence, using the top 10% of conserved regions (minimum of 60% sequence identity), a matrix match threshold of 80% and the worm profiles, identified the putative muscle1 motif with a Z-score of 20.6 and a Fisher score less than 0.01 (Table 3.4). This is, however, somewhat circular, given that 19 of the 41 input genes were used to generate the putative muscle-specific worm profiles. Analysis using the JASPAR CORE profiles ranked SP1 and Su(H) within the top ten scoring profiles (Appendix 2 Table S2). Studies in Xenopus and Drosophila provide evidence that MyoD triggers Notch signaling through Su(H) for muscle determination (29;30). Although SP1 has been implicated in muscle CRMs, it is a general transcription factor involved in the expression of many different genes and binds to GC-rich motifs.  Table 3.4 Validation of worm skeletal muscle genes using worm profiles Worm Profile Muscle1 Muscle3 Muscle2 LIN-14  Status Putative Putative Putative Putative  IC 11.34 16.67 11.97 9.13  Target gene hits 5 4 2 10  Background TFBS rate 0.0023 0.0035 0.0018 0.0154  Target TFBS rate 0.0054 0.0038 0.0018 0.0127  Z-score 8.01 0.61 -0.22 -2.79  Fisher Pvalue 1.89E-02 3.66E-01 3.97E-01 4.00E-01  87  3.3.4 Yeast SSA The yeast CLB2 gene cluster is comprised of 32 genes whose pattern of expression peaks at late G2/early M phase of the cell cycle. Transcription of these genes is regulated by two TFs: FKH, which is a component of the TF SFF, and MCM1, a member of the early cell cycle box (ECB) binding complex. Analysis of 500 bp of upstream sequence using a matrix match threshold of 85% ranked ECB, MCM1 and FKH1 in the top five scoring TFBS profiles (Table 3.5), which is consistent with the literature (31). Table 3.5 Validation of the yeast CLB2 gene cluster Yeast Profile ECB MCM1 FKH1 CCA LYS14  TF Class Unclassified MADS Forkhead Unclassified C6_Zinc finger  IC 16.65 9.15 13.28 16.93 17.02  Target gene hits 13 10 30 3 6  Background TFBS rate 0.0019 0.0073 0.0305 0.0017 0.0030  Target TFBS rate 0.0131 0.0165 0.0473 0.0040 0.0053  Z-score 32.87 13.71 12.26 7.08 5.20  Fisher P-value 8.68E-09 1.08E-02 4.05E-02 2.02E-01 9.41E-02  3.4 Discussion The four oPOSSUM systems, Human SSA, Human CSA, Worm SSA, and Yeast SSA, have been integrated into a use-friendly website at www.cisreg.ca/oPOSSUM. We recommend that users of the system begin with the SSA to quickly identify TFBSs that may be relevant to their input data sets. For sets of human and mouse genes, this can be followed with the CSA, which takes longer to process, but which can provide insights into TFBSs that may be acting in concert to regulate the set of genes. The web implementation allows for analysis in default and custom modes. Default mode processing is faster as TFBS counts have been pre-calculated and stored for  88  predefined conservation levels, matrix match thresholds and promoter lengths. In either mode, the user is required to select a species and to enter a list of gene identifiers (EnsEMBL, RefSeq, HGNC and Entrez Gene are supported for human). A number of options are available to specify the TFBS profile set to be used in the analysis. Finally, the conservation level, matrix match threshold and the promoter length can be varied. In the custom mode, users may define their own background set, which provides users with more control, but results in more variable processing speeds depending on the size of the background set and the parameters selected. Upon submission, oPOSSUM SSA generates a summary of the input parameters, and produces a single table that ranks the over-represented TFBSs by descending Z-score. The table may be sorted by TF name, TF class, supergroup, information content (IC), Zscore and Fisher score (Figure 3.3A). Pop-up windows linked to each TFBS foreground count display the genes in which the putative site is located, the promoter region(s) for each gene, as well as the TFBS’s co-ordinates and score (Figure 3.3B). TFBSs that occur in overlapping promoter regions are marked by an asterisk and highlighted in yellow. The TF names are linked to the JASPAR database for easy access to information regarding the binding site profiles. The output for oPOSSUM CSA is similar, providing (i) a ranked list of over-represented TFBS class combinations, and (ii) a list of the most significant TFBS combinations (found in the set of expanded top-ranked class combinations). Based on the underlying assumption of the statistics employed that DNA sequences are randomly generated, there is little reason to accept the calculated scores as accurate reflections of significance. Instead, as suggested in the original published description of the oPOSSUM algorithm, we recommend that the scores are best used as  89  Figure 3.3 oPOSSUM output (A) A screenshot of the output of the oPOSSUM Human SSA analysis, which ranks over-represented TFBS profiles by Z-score. The arrows allow the user to sort and re-order the results by Fisher p-value, TF name, TF class, TF supergroup, or TF profile information content (IC). Each TF name links to a pop-up window displaying the TFBS profile information. (B) Pop-up window displaying genes that contain a particular TFBS (in this case, SRF), as well as the promoter coordinates associated with each gene, and the motif locations and scores. Sites in overlapping alternative promoters are highlighted for emphasis. Such sites are only counted once in the statistical analysis.  90  rankings rather than significance measures. For this reason, a multiple testing correction is not applied as it does not alter the relative ranks. Empirically, we determined that TFBS profiles with Z-scores equal to or exceeding 10 and Fisher scores less than or equal to 0.01 facilitate the identification of relevant TFBSs for our sets of reference genes (1). However, these are relatively stringent thresholds, and we encourage users to examine the scores of top-ranked TFBS profiles before applying any cutoffs. We provide a consistent display for all four systems. However, there are slight differences between the systems, such as different parameters for selection on the input pages which are relevant for each species database and system. Also, due to the longer processing times required to compute combinations of TFBSs, Human CSA queues the analysis request on the server and emails the completed results to the user. Lastly, plant and insect matrices are available for inclusion in the analysis based on the observation that members of the same structural family of TFs often bind to similar sequences. The MADS family of TFs is an excellent example of conservation of binding domains between plants and vertebrates (32;33), and there are numerous examples of conservation of binding domains across vertebrates, flies and worms. Thus, in cases where a profile for the TF of interest is not available in the database, oPOSSUM can still provide insights into the underlying regulation by suggesting a particular TF family that may be involved.  3.5 Conclusions The oPOSSUM system is under continued development. Efforts are underway to allow users to submit custom TF profiles to be included in the analysis. An improved  91  search method for nuclear hormone receptors, which typically contain two half sites separated by a variable length spacer, has been developed and will be included in a future release. We will continue to add TFBS profiles as they become available, with an emphasis on expanding the repertoire of worm TFBS profiles. We believe the oPOSSUM web server is and will continue to be a useful resource for researchers attempting to move from observed co-expression to infer mechanisms of co-regulation.  92  3.6 References 1. Ho Sui,S.J., Mortimer,J.R., Arenillas,D.J., Brumm,J., Walsh,C.J., Kennedy,B.P. and Wasserman,W.W. (2005) oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res., 33, 3154-3164. 2. Huang, S. S., Fulton, D. L., Arenillas, D. J., Perco, P., Ho Sui, S. J., Mortimer, J. R., and Wasserman, W. W. (2006) Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In Advances in Bioinformatics and Computational Biology. Imperial College Press, London, UK, Vol. 3, pp 247-256. 3. Urnov,F.D. and Rebar,E.J. (2002) Designed transcription factors as tools for therapeutics and functional genomics. Biochem.Pharmacol., 64, 919-923. 4. Luscombe,N.M., Austin,S.E., Berman,H.M. and Thornton,J.M. (2000) An overview of the structures of protein-DNA complexes. Genome Biol., 1, REVIEWS001. 5. Sandelin,A. and Wasserman,W.W. (2004) Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J.Mol.Biol., 338, 207-215. 6. Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensembl genome database project. Nucleic Acids Res., 30, 38-41. 7. Schwartz,S., Kent,W.J., Smit,A., Zhang,Z., Baertsch,R., Hardison,R.C., Haussler,D. and Miller,W. (2003) Human-mouse alignments with BLASTZ. Genome Res., 13, 103-107. 8. Ho Sui,S.J., Mortimer,J.R., Arenillas,D.J., Brumm,J., Walsh,C.J., Kennedy,B.P. and Wasserman,W.W. (2005) oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res., 33, 3154-3164. 9. Carninci,P., Sandelin,A., Lenhard,B., Katayama,S., Shimokawa,K., Ponjavic,J., Semple,C.A., Taylor,M.S., Engstrom,P.G., Frith,M.C. et al. (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nat.Genet., 38, 626635. 10. Shiraki,T., Kondo,S., Katayama,S., Waki,K., Kasukawa,T., Kawaji,H., Kodzius,R., Watahiki,A., Nakamura,M., Arakawa,T. et al. (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc.Natl.Acad.Sci.U.S.A, 100, 15776-15781.  93  11. Blumenthal,T., Evans,D., Link,C.D., Guffanti,A., Lawson,D., Thierry-Mieg,J., Thierry-Mieg,D., Chiu,W.L., Duke,K., Kiraly,M. et al. (2002) A global analysis of Caenorhabditis elegans operons. Nature, 417, 851-854. 12. O'Brien,K.P., Remm,M. and Sonnhammer,E.L. (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res., 33, D476-D480. 13. Stein,L., Sternberg,P., Durbin,R., Thierry-Mieg,J. and Spieth,J. (2001) WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res., 29, 82-86. 14. Zhang,M.Q. (1999) Promoter analysis of co-regulated genes in the yeast genome. Comput.Chem., 23, 233-250. 15. Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. and Church,G.M. (1999) Systematic determination of genetic network architecture. Nat.Genet., 22, 281-285. 16. Cherry,J.M., Adler,C., Ball,C., Chervitz,S.A., Dwight,S.S., Hester,E.T., Jia,Y., Juvik,G., Roe,T., Schroeder,M. et al. (1998) SGD: Saccharomyces Genome Database. Nucleic Acids Res., 26, 73-79. 17. Vlieghe,D., Sandelin,A., De Bleser,P.J., Vleminckx,K., Wasserman,W.W., van Roy,F. and Lenhard,B. (2006) A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res., 34, D95D97. 18. Sandelin,A., Alkema,W., Engstrom,P., Wasserman,W.W. and Lenhard,B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32, D91-D94. 19. Lenhard,B. and Wasserman,W.W. (2002) TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics., 18, 1135-1136. 20. Sandelin,A., Hoglund,A., Lenhard,B. and Wasserman,W.W. (2003) Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct.Integr.Genomics, 3, 125-134. 21. Wonsey,D.R. and Follettie,M.T. (2005) Loss of the forkhead transcription factor FoxM1 causes centrosome amplification and mitotic catastrophe. Cancer Res., 65, 5181-5189. 22. Lin,J., Tarr,P.T., Yang,R., Rhee,J., Puigserver,P., Newgard,C.B. and Spiegelman,B.M. (2003) PGC-1beta in the regulation of hepatic glucose and energy metabolism. J.Biol.Chem., 278, 30843-30848. 23. Ordway,J.M., Williams,K. and Curran,T. (2004) Transcription repression in oncogenic transformation: common targets of epigenetic repression in cells transformed by Fos, Ras or Dnmt1. Oncogene, 23, 3737-3748. 94  24. Daftari,P., Gavva,N.R. and Shen,C.K. (1999) Distinction between AP1 and NF-E2 factor-binding at specific chromatin regions in mammalian cells. Oncogene, 18, 5482-5486. 25. Moran,J.L., Li,Y., Hill,A.A., Mounts,W.M. and Miller,C.P. (2002) Gene expression changes during mouse skeletal myoblast differentiation revealed by transcriptional profiling. Physiol Genomics, 10, 103-111. 26. Tomczak,K.K., Marinescu,V.D., Ramoni,M.F., Sanoudou,D., Montanaro,F., Han,M., Kunkel,L.M., Kohane,I.S. and Beggs,A.H. (2004) Expression profiling and identification of novel genes involved in myogenic differentiation. FASEB J., 18, 403-405. 27. Wasserman,W.W. and Fickett,J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J.Mol.Biol., 278, 167-181. 28. GuhaThakurta,D., Schriefer,L.A., Waterston,R.H. and Stormo,G.D. (2004) Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. Genome Res., 14, 2457-2468. 29. Rusconi,J.C. and Corbin,V. (1998) Evidence for a novel Notch pathway required for muscle precursor selection in Drosophila. Mech.Dev., 79, 39-50. 30. Wittenberger,T., Steinbach,O.C., Authaler,A., Kopan,R. and Rupp,R.A. (1999) MyoD stimulates delta-1 transcription and triggers notch signaling in the Xenopus gastrula. EMBO J., 18, 1915-1922. 31. Kreiman,G. (2004) Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res., 32, 2889-2900. 32. Alvarez-Buylla,E.R., Pelaz,S., Liljegren,S.J., Gold,S.E., Burgeff,C., Ditta,G.S., Ribas,d.P., Martinez-Castilla,L. and Yanofsky,M.F. (2000) An ancestral MADSbox gene duplication occurred before the divergence of plants and animals. Proc.Natl.Acad.Sci.U.S.A, 97, 5328-5333. 33. Shore,P. and Sharrocks,A.D. (1995) The MADS-box family of transcription factors. Eur.J.Biochem., 229, 1-13.  95  Chapter 4: Dynamics of the Yeast Transcriptome During Wine Fermentation Reveals a Novel Fermentation Stress Response 4 4.1 Introduction Strains of Saccharomyces cerevisiae are industrially important due to extensive use in baking, brewing, wine-making and in the production of fuel ethanol. S. cerevisiae has complex regulatory networks to sense, respond, and adapt to changing environments, but the regulation of metabolic pathways and mechanisms for adapting to the extreme conditions in industrial processes, such as fermentation, have not been well studied. In the natural environment of grape must fermentation, S. cerevisiae is subject to high osmotic pressure, hypoxia, high concentrations of sugar and low nitrogen levels. In addition, as the fermentation progresses, ethanol levels can reach 16% (v/v); in standard laboratory media the ethanol concentration never exceeds 1% (v/v). During fermentation, stationary phase growth is reached within approximately 48 hours, but the yeast actively ferments and then survives for months in wine. Many of the 1253 genes that have not been characterized (1) may well play an important role during the later adaptive stages of fermentation. Perturbation of environmental conditions has led to the identification of a multitude of stress responses (see (2) for a review). While numerous genomics studies have addressed specific responses, a few studies have examined the common components of generalized environmental stress. Large-scale transcriptome analyses of short-term 4  A version of this chapter has been published. Marks, V.D.*, Ho Sui, S.J.*, Erasmus, D., van der Merwe, G.K., Brumm, J., Wasserman, W.W., Bryan, J. and van Vuuren, H.J.J. (2008). Dynamics of the yeast transcriptome during wine fermentation reveals a novel fermentation stress response. FEMS Yeast Research. 8(1):35-52. * Joint first authors.  96  responses have identified the ~868 gene ”environmental stress response” (ESR) (3) and the ~499 gene ”common environmental response” (CER) (4) which overlap by ~337 genes (Appendix 3 Figure S1; all numbers approximate due to changing gene annotations over time). Functions of genes induced by stresses in both studies include carbohydrate metabolism, protein degradation, response to reactive oxygen species, protein folding, and genes with stress response elements (STREs) in their promoters. Repressed genes include those involved in translation and protein synthesis, cytoplasmic ribosomal proteins, tRNA synthesis, and translation. Both the ESR and CER studies identified many genes with unknown functions that respond to stress. In contrast to the transient stress responses, S. cerevisiae is capable of making adaptive changes to grow both aerobically or anaerobically depending on environmental conditions. In the presence of glucose at concentrations that exceed 0.5% (w/v), the yeast will ferment regardless of the presence of oxygen. Genes involved in respiratory metabolism and gluconeogenesis can also be regulated by carbon source (5). When glucose is depleted, the cells undergo a diauxic shift and respire ethanol produced during fermentation (6). During the transition, the transcriptional repressor Mig1p is exported from the nucleus to the cytoplasm. Mig1p is a key regulator of carbon catabolite repression; the nuclear export of Mig1p results in derepression of genes required for the utilization of alternative carbon sources (7;8). The exact mechanism of signal transduction in response to glucose is not entirely understood. Furthermore, little is known about how yeast metabolizes carbon when both glucose and ethanol are in excess. The incorporation of transcript profiling technologies into enology research has led to a series of observations about gene expression during the early stages of  97  fermentation. Several publications have demonstrated a dramatic change in gene expression patterns during the transition to stationary growth (9-11). Differences in gene expression patterns between commercial strains of wine yeast revealed a common pattern of stress response between wine strains at the beginning of vinification (12;13). Rossignol et al. explored the genomic expression patterns of S. cerevisiae during a threeday fermentation in a synthetic medium (14). In this latter study, stationary phase changes were noted, along with the observation of a plentitude of stress response processes being active. To identify how S. cerevisiae responds adaptively to long-term environmental stresses present during grape must fermentation, the yeast transcriptome was profiled over a period of 15 days.  4.2 Methods 4.2.1 Strains, media and growth conditions An industrial wine strain of S. cerevisiae, Vin13 (Anchor Yeast, South Africa), was used. The yeast was inoculated into 1 L filter sterilized Riesling grape juice (15) with a final cell count of 4 x 106 /ml. The grape juice contained 214 g/L sugar (equimolar amounts of glucose and fructose), 70 mg/ L ammonia, and 140 mg/L free α-amino nitrogen (15). Incubation was stationary at 20 °C without the addition of oxygen. Fermentations were conducted in triplicate and cells were harvested from fermentations after 0.5 %, 2 %, 3.5 %, 7 %, and 10 % (v/v) ethanol was produced.  4.2.2 Analysis of fermenting wine Enzymatic kits were used to determine glucose and fructose concentrations according to manufacturer’s instructions (Roche Molecular Biochemicals, Laval, QC).  98  Ethanol was measured by HPLC as described previously (16). Water activity was measured using an Aqualab Series 3 water activity meter (Decagon Devices, Pullmann, WA). All measurements were done in triplicate.  4.2.3 Microarray analysis Total RNA was extracted, and isolation of mRNA and cDNA synthesis were done according to Causton, et al. (4). Double stranded cDNA purification, cRNA synthesis and fragmentation were done according to Marks et al. (15). Affymetrix Yeast Genome S98 Chips were used (Affymetrix, Santa Clara, CA). Preparation of hybridization solution, hybridization, and washing, staining and scanning of yeast arrays were done as described by the manufacturer (Eukaryotic Arrays GeneChip Expression Analysis and Technical Manual, Affymetrix, Santa Clara, CA). Washing and staining were done as previously described Marks et al. (15).  4.2.4 Analyses of gene expression data Feature intensities from Affymetrix S98 chips were pre-processed using the robust multi-array average procedure (17) to obtain probe-set-summarized, normalized gene expression data. Of the 9335 probesets, 6299 probesets were retained that map to verified, uncharacterized or dubious ORFs from the Saccharomyces Genome Database (SGD) (18). Differences in gene expression patterns between commercial wine yeast strains at the beginning of vinification are well characterized (12;13) and the first two time points were not included in the analysis. However, data obtained for these two time points were deposited in ArrayExpress (GSE8536). Therefore, for each of the 6299 probesets, the primary data consist of three independent biological replicates at each of  99  five time points (the ethanol concentrations of 0.5 %, 2 %, 3.5 %, 7 %, and 10 % (v/v) were nominally coded as 24, 48, 60, 120, and 340 hours, respectively). For the purposes of temporal modeling, the predictor t was defined as the logarithm of the sampling time (in hours), centered around the associated log-time midpoint. The 15 expression measurements for any probeset g were summarized with this quadratic polynomial: Yg(t) = β0g + β1gt + β2gt2 + εg(t). The intercept β0g captures the overall expression level, whereas the more interesting pair of temporal parameters (β1g, β2g) provides a probeset-specific summary of the expression change observed during fermentation (see Figure 4.1); the linear term β1g reflects a general trend (e.g. up vs. down) and the quadratic term reflects overall shape (e.g. concave vs. convex). A statistical test for differential expression was conducted with the null hypothesis of (β1g, β2g) = (0,0). The p-values from this F-test were converted to q-values (19) and the estimated proportion of null genes was ~10%. We classified a probeset as differentially expressed if it had a q-value of 0.001 or less and a predicted fold-change of 2 or more at at least one fermentation time point, based on the model. The temporal parameters (β1g, β2g) were used as the input features for supervised clustering (20), in order to identify groups of genes with similar temporal trends. The clusters were anchored by 20 genes, selected to span the observed (β1g, β2g) values. A probeset was assigned to cluster k if the squared Euclidean distance to cluster anchor k was smaller than that to any other anchor. All analysis was conducted within the R statistical software environment (21) and relied on the libraries affy (22;23) and qvalue (24). The analyzed data set is available online at http://www.cisreg.ca/shosui/FSR/.  100  4.2.5 Functional enrichment of genes in clusters Frequency of individual Gene Ontology (GO) annotation terms was used as a basis to assess biological significance of the gene expression data (25). Owing to the fact that the vast majority of GO terms are associated with very few genes (i.e. most terms are highly specific), a truncated version of the GO database was used. The truncated GO database consisted of 475 terms, selected for being associated with more than 10 genes. The level, or number of parents in the branching GO hierarchy, was also used as a tool in the analysis. The association between the clusters and GO terms was studied using contingency tables. The chi-square statistic (and its conventional p-value) was used to summarize evidence for enrichment or depletion of genes having a functional annotation within gene clusters.  4.2.6 Identification of regulatory elements over-represented in gene clusters Promoter sequences corresponding to the 5' untranslated region 500 base pairs upstream of the initial ATG for each open reading frame (ORF) were downloaded from SGD. A collection of yeast-specific transcription factor binding site (TFBS) motifs was compiled from the Yeast Regulatory Sequence Analysis (YRSA) system (26) and from the literature (http://www.cisreg.ca/shosui/FSR/). Patterns matching the set of 44 TFBS weight matrix profiles in the collection were found using the TFBS suite of Perl regulatory analysis modules (27). The locations and scores for hits with matrix match scores exceeding 80% of the normalized score range were noted. For each of the clusters, and two “combined clusters” comprising clusters 1-6 and clusters 18-20, we calculated two statistical measures, a Z-score and Fisher exact probability to determine which, if any,  101  of the TFBSs were over-represented in the gene clusters relative to a background set comprised of all yeast promoter sequences (28). The Z-score measures how frequently a particular TFBS occurs in the promoters of co-expressed genes in a cluster and compares it to the frequency of occurrence in the background set, determining over-representation of the motif at the nucleotide level. Briefly, z =  x − µ − 0.5 , where x is the observed number of binding site nucleotides for a σ  given TFBS in the co-expressed set, µ is the expected number of binding site nucleotides based on the background sequences, and σ is the standard deviation. The score is based on the normal approximation to the binomial distribution. In contrast, the Fisher exact probability compares the proportion of genes in a cluster that contain a particular TFBS to the proportion of genes in the background that contain the TFBS, determining over-representation at the gene level. The Fisher's exact test computes the probability, given the observed marginal frequencies, of obtaining exactly the frequencies observed and any configuration more extreme (29).  4.3 Results 4.3.1 Clustering of genes based on temporal expression profiles To investigate how yeast adapts to the harsh conditions during the fermentation of grape must, genome-wide transcription was assayed at five time points during alcoholic fermentation. Correlation coefficients for the biological replicates at individual time points ranged from 0.9575 to 0.9950. The full data set is available at http://www.cisreg.ca/shosui/FSR.  102  For each gene, the temporal trend throughout the fermentation process was summarized with two parameter values that describe the overall change and shape of the expression profile. A seeded clustering algorithm was applied in order to group the genes around the expression patterns exhibited by 20 genes whose temporal trends span the range of patterns observed across the entire transcriptome (Figure 4.1). For each gene, a test was conducted for temporal trend, based on the null hypothesis that the true expression pattern is a flat line (i.e., no change at any time), and these gene-specific pvalues were converted to q-values to control the false discovery rate (19;30). Differentially expressed genes were identified as genes having a q-value less than 0.001 and exhibiting a minimum of two-fold change in expression at one or more time points during fermentation. Therefore, the definition of differential expression requires a statistically significant temporal trend as well as an expression change that is large in absolute magnitude. There were 2550 genes that met these criteria, which corresponds to 42% of the yeast genome (Table 4.1). Induction of 1123 genes was sustained until the end of fermentation (clusters 1-10), while 1279 genes were repressed (clusters 14-20). The expression of 148 genes was transiently induced and returned to steady state levels by the final time point (clusters 11 and 12) (Figure 4.1). The expression of 1876 genes was unaffected (cluster 13, Figure 4.1).  4.3.2 The fermentation stress response Under enological conditions, the yeast cell is exposed to changing nutrient concentrations and diverse forms of stress. S. cerevisiae responds transcriptionally to stress by general and/or stimuli-specific response mechanisms. In contrast to the transient responses observed in laboratory studies (3;4), sustained global changes in transcript  103  Table 4.1 Summary of gene counts Statistically significant temporal trend (q-value ≤ 0.001) Cluster  Genes  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Simple Total Unique Total  5 8 11 63 37 100 356 345 336 182 50 538 1876 311 372 1123 62 368 7 41 6191 6088  Count 5 8 11 63 37 100 337 307 239 178 48 285 161 222 311 760 61 366 7 41 3547 3493  Proportion 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.89 0.71 0.98 0.96 0.53 0.09 0.71 0.84 0.68 0.98 0.99 1.00 1.00 0.57 0.57  Minimum of two-fold change between any two time points (FC ≥ 2) Count 5 8 11 63 37 100 350 250 190 182 50 121 1 177 241 473 62 368 7 41 2737 2688  Proportion 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.72 0.57 1.00 1.00 0.22 0.00 0.57 0.65 0.42 1.00 1.00 1.00 1.00 0.44 0.44  Differentially expressed (q-value ≤ 0.001 and FC ≥ 2) Count 5 8 11 63 37 100 331 239 162 178 48 100 0 157 226 450 61 366 7 41 2590 2549  Proportion 1.00 1.00 1.00 1.00 1.00 1.00 0.93 0.69 0.48 0.98 0.96 0.19 0.00 0.50 0.61 0.40 0.98 0.99 1.00 1.00 0.42 0.42  Sustained response (Minimum of two-fold change at final time point) 5 8 11 63 37 100 350 191 114 132 3 0 0 131 137 473 60 368 7 41 2231 2186  Strength of longterm response (Smallest final expression change for the cluster) 22.73 13.62 20.03 4.83 4.03 4.38 1.87 1.11 1.37 1.43 1.01 1.00 1.00 1.01 1.36 1.33 1.97 3.45 4.61 8.66  Description of the Response Sustained induction Sustained induction Sustained induction Sustained induction Sustained induction Sustained induction Sustained induction Sustained induction Sustained induction Sustained induction Transient induction Transient induction No change Sustained repression Sustained repression Sustained repression Sustained repression Sustained repression Sustained repression Sustained repression  Genes in the FSR are shaded in green. As some genes occur in multiple clusters, we provide the total number of unique genes in the data set in addition to a simple total of the counts in each cluster  104  Figure 4.1 Clustering of genes based on temporal expression profiles Clustering of genes based on temporal expression profiles. The centre graph shows the entire transcriptome plotted with the coefficients calculated from the expression model. The clusters are anchored by the genes numbered in black. The temporal expression of these genes is displayed in the 20 line graphs on the periphery. The dotted lines indicate two-fold change in expression levels. The number of genes in each cluster is indicated in the cluster.  105  abundance were observed for the duration of the fermentation process (15 days) indicating an adaptive response to fermentation stress (Table 4.2). The differentially expressed genes in clusters 1-6, i.e. genes exhibiting sustained and dramatic induction that persists until the final time point (Table 4.2), are defined as fermentation stress response (FSR) genes. Genes that are down-regulated are excluded from the FSR as they are primarily involved in protein biosynthesis and ribosomal processing, functions known to be repressed under stress and in the stationary growth phase. A list of the 223 FSR genes is shown in Table 4.2. These genes show a four-fold or more change in expression at some point during fermentation. Within the 223 FSR genes, 20% overlap with the transient global stress responses characterized in the ESR or CER (Figure 4.2, Table 4.2). A further 18% overlap with genes that respond to stress related to osmotic pressure, nitrogen depletion, ethanol increase, and oxidative stress (Table 4.2). For comparison, more than half (55%) of the repressed genes in clusters 18-20 overlap with the ESR or CER (Appendix 3 Figure S1).  ESR (induced) 133  CER (induced)  112  81  15 22  8  178  FSR  Figure 4.2 Fermentation stress response Fermentation Stress Response. Venn diagram showing the number of genes associated with two wellknown transcriptionally characterized stress responses. ESR – Genes induced in the Environmental Stress Response (4); CER- Genes induced in the Common Environmental Response (5); FSR – Fermentation Stress Response genes. Refer also to Appendix 3 Figure S1.  106  Table 4.2 Fermentation Stress Response Genes The overlap with other stress responses is shown: (1) environmental stress response (ESR), (2) common environmental response (CER), (3) salt/sorbital-induced osmotic stress, (4) sugar-induced osmotic stress, (5) short-term ethanol stress, (6) genes involved in nitrogen limitation, and (7) genes involved in oxidative stress. Boldface indicates genes that have unknown biological process. An asterisk (*) adjacent to the GO term indicates that the gene product has more than one associated GO term, and only the term most commonly used for annotation is shown. Cl. refers to the cluster number.  Cl. 1 1 1  Gene HSP30 GAC1 GAT1  Maximum Fold Change 80.11 44.21 29.10  1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4  CSR2 MCH5 CYB2 INO1 PHM8 PUT4 HSP26 MEP2 SPI1 SHC1 YMR244W IRC8 YJL150W YOL047C MBR1 YDL068W YOR318C SPS1 RDH54 PES4 APJ1 RPI1 PUT1 RSB1 HPA2 MFA1  24.68 22.73 68.30 39.14 28.99 27.37 24.45 24.36 23.51 20.00 32.27 32.13 31.22 29.44 27.90 26.56 25.51 24.93 23.27 21.96 20.03 19.20 19.06 18.74 18.35 17.69  4 4 4 4 4 4 4 4 4  SUE1 ATG1 YFR017C VHS1 GAP1 PDR15 YLR168C YBR284W VHR1  17.68 17.61 17.51 16.06 15.99 15.85 15.83 15.70 14.77  GO Biological Process response to stress meiosis* transcription initiation from RNA polymerase II promoter* cell wall organization and biogenesis* riboflavin transport electron transport inositol metabolic process biological process unknown proline catabolic process* response to stress* pseudohyphal growth* biological process unknown sporulation (sensu Fungi)* biological process unknown biological process unknown biological process unknown biological process unknown aerobic respiration biological process unknown biological process unknown protein amino acid phosphorylation* meiotic recombination* biological process unknown biological process unknown thiamin biosynthetic process* glutamate biosynthetic process* response to toxin* histone acetylation pheromone-dependent signal transduction during conjugation with cellular fusion protein catabolic process autophagy* biological process unknown protein amino acid phosphorylation* amino acid transport* transport biological process unknown telomere maintenance regulation of transcription from RNA polymerase II promoter*  ORF Status V V V V V V V V V V V V V U U D U V D D V V V V V V V V V V V U V V V U U U  Other Response 1 2 3 4 5 6 7 x x x x x x x x x x x x x x x x x x x x x x  x  x x x x x  x x  x x x x x x  107  x  Cl. 4  Gene AGX1  4 4 4 4 4 4 4 4 4 4 4 4 4 4  GSC2 STR3 RIM8 YBL065W ACS1 JID1 YLL020C HXT6 YKL071W DAL80 PTR2 TSL1 PIN3 DOT6  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4  QNQ1 PDE1 GSY2 UPS1 YBR085C-A YCR061W YER158C RGT1 CRC1 IRC9 DIA3 GPX1 YNL194C CTA1 MOD5 ATG8 HSP78 VID24 TPO4 YLL056C SSD1 YIR016W AVT4 MGA1 SSE2 HXT7 YMR244CA KRE1 IRC15 YOR1 OPY2  4 4 4 4  Maximum Fold Change 14.62  7.69 7.65 7.58 7.57 7.52 7.50 7.43 7.26 7.16 7.11 7.07 6.71 6.63 6.48 6.27 6.25 6.24 6.22 6.11 5.97 5.87 5.85 5.80 5.78 5.66 5.55 5.54  GO Biological Process glycine biosynthetic process, by transamination of glyoxylate cell wall organization and biogenesis* methionine biosynthetic process meiosis* biological process unknown histone acetylation* biological process unknown biological process unknown hexose transport biological process unknown transcription* peptide transport* response to stress* actin cytoskeleton organization and biogenesis regulation of transcription from RNA polymerase II promoter* biological process unknown cAMP-mediated signaling glycogen biosynthetic process mitochondrial protein processing biological process unknown regulation of cell size biological process unknown glucose metabolic process* fatty acid metabolic process biological process unknown pseudohyphal growth* response to oxidative stress sporulation (sensu Fungi) OX and reactive OX species metabolic process* tRNA modification protein targeting to vacuole* response to stress* vesicle-mediated transport* polyamine transport response to toxin response to drug* biological process unknown amino acid export from vacuole filamentous growth protein folding* hexose transport biological process unknown  5.51 5.42 5.16 5.11  cell wall organization and biogenesis biological process unknown telomere maintenance* cell cycle arrest in response to pheromone*  13.38 12.50 11.21 11.15 10.79 10.41 10.18 9.76 9.71 9.43 8.62 8.11 7.88 7.81  ORF Status V V V V D V V D V U V V V V V U V V V U U U V V D V V V V V V V V V U V U V V V V U  Other Response 1 2 3 4 5 6 7 x x x x x  x x x x x x x x x  x x x x x x  x  x x x x  x x x  x x x x  x x x x  x x x x  x  x x x x x x  V U V V  108  x  Cl. 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5  Gene YMR253C MRP8 VBA2 MTR4 DIP5 PLB3 YJL149W BTN2 MUD1 ADR1 YDR042C ULP2 YBR099C YDL012C URN1 PHO80 HAP1  Maximum Fold Change 5.10 4.88 4.83 21.43 19.61 14.78 14.43 12.82 12.72 12.61 12.49 11.21 10.95 10.54 10.40 9.88 9.63  GO Biological Process biological process unknown translation basic amino acid transport ribosome biogenesis and assembly* amino acid transport phosphatidylserine catabolic process* biological process unknown retrograde transport, endosome to Golgi* nuclear mRNA splicing, via spliceosome transcription* biological process unknown mitotic spindle checkpoint* biological process unknown biological process unknown biological process unknown telomere maintenance* positive regulation of transcription from RNA polymerase II promoter*  ORF Status U V V V V V U V V V U V D U U V V  5 5 5 5 5 5  SNG1 SRT1 MKS1 SLD2 HRP1 NRD1  9.58 9.43 9.30 8.98 8.45 8.36  response to drug protein amino acid glycosylation regulation of nitrogen utilization* DNA strand elongation during DNA replication mRNA cleavage* transcription termination from Pol II promoter, RNA polymerase(A)-independent  V V V V V V  5 5 5 5 5 5 5 5 5 5 5 5  CHS1 SIS1 ISA1 AQR1 VID27 RCL1 ECM3 YML089C MPC54 YOR378W TOR2 DST1  8.35 7.59 7.58 7.14 7.09 6.63 6.32 6.29 6.28 5.79 5.78 5.66  V V V V V V V D V U V V  5 5 5 5 5 6 6 6  STE5 YGL041C SRC1 SSY5 STU1 TPO2 MUP3 NRG1  5.60 5.25 5.08 5.02 4.93 17.87 17.09 16.02  6  ARG82  14.69  cytokinesis, completion of separation* protein folding* telomere maintenance* drug transport* biological process unknown ribosome biogenesis and assembly* cell wall organization and biogenesis biological process unknown spore wall assembly (sensu Fungi) biological process unknown ribosome biogenesis and assembly* RNA elongation from RNA polymerase II promoter* invasive growth (sensu Saccharomyces)* biological process unknown mitotic sister chromatid segregation protein processing* microtubule nucleation polyamine transport amino acid transport regulation of transcription from RNA polymerase II promoter* response to drug*  V D V V V V V V  Other Response 1 2 3 4 5 6 7 x x x x  x x  x  x  x  V  109  Cl. 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6  Gene RTS3 HRK1 IRC20 GIP2 YKL070W GAT2 SAP155 PMC1 XBP1 MEP1 YIL066W-A PKH2 YMR102C YDL010W SWI1  Maximum Fold Change 14.44 14.38 12.40 11.83 11.82 11.58 11.45 11.16 10.83 10.78 10.30 10.15 9.70 9.37 9.31  GO Biological Process protein amino acid dephosphorylation cell ion homeostasis biological process unknown protein amino acid dephosphorylation* response to toxin transcription G1/S transition of mitotic cell cycle calcium ion homeostasis* response to stress nitrogen utilization* biological process unknown protein amino acid phosphorylation* biological process unknown biological process unknown regulation of transcription from RNA polymerase II promoter* spore wall assembly (sensu Fungi)* biological process unknown response to desiccation telomere maintenance* biological process unknown polyamine transport chitin- and beta-glucan-containing cell wall organization and biogenesis  ORF Status U V U V U V V V V V D V U U V  6 6 6 6 6 6 6  GIS1 YGR146C YJL144W RPN4 YIL152W TPO1 YLR194C  9.12 9.09 8.99 8.77 8.51 8.35 8.28  V U U V U V U  6 6 6 6 6 6  FRT1 KNS1 YDR186C AHC1 SKN7 MET4  8.25 8.25 7.77 7.72 7.62 7.62  response to stress protein amino acid phosphorylation biological process unknown histone acetylation response to oxidative stress* positive regulation of transcription from RNA polymerase II promoter*  U V U V V V  6 6 6 6 6 6 6 6 6  PDR5 YPL230W YLL020C TOS3 ZEO1 MUB1 RIM15 ROG3 HAC1  7.58 7.51 7.41 7.39 7.32 7.23 6.99 6.91 6.88  V U D V V V V U V  6 6  KTR2 FLO10  6.87 6.82  6 6 6 6 6  OPI9 SKS1 YPL136W TPO3 SUR1  6.67 6.58 6.53 6.52 6.39  response to drug* biological process unknown biological process unknown protein amino acid phosphorylation* telomere maintenance* regulation of cell budding protein amino acid phosphorylation* biological process unknown regulation of transcription from RNA polymerase II promoter* protein amino acid N-linked glycosylation* flocculation via cell wall protein-carbohydrate interaction biological process unknown protein amino acid phosphorylation* biological process unknown polyamine transport sphingolipid biosynthetic process*  V V  Other Response 1 2 3 4 5 6 7 x x x  x  x x x x x  x  x x  x x x x x  x  x x x x  x x x  x x  x  D V D V V  110  Cl. 6 6 6 6 6 6 6 6 6 6 6  Maximum Fold Change 6.36 6.19 6.10 6.07 6.04 6.01 6.00 5.99 5.97 5.91 5.91  6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6  Gene ISF1 MPH1 YOL036W YNL144C NAB6 AFG3 PGA1 YOR390W SPG1 MCH1 YML116WA SSL2 FSP2 EDC2 OSW2 PSR2 PSK1 YAK1 OAF1 PDR1 GPI18 SNQ2 IXR1 MDN1 YLR297W PAU21 YER093C-A RAD54 RAD7  6 6 6 6 6 6 6 6 6 6 6 6 6 6 6  UBC8 VPS72 MGR1 YOR152C TSC11 ISU1 YMR252C BUL1 RPO21 SPG5 KSP1 INP1 LEE1 PEP12 YPS3  5.01 5.01 5.00 5.00 4.99 4.97 4.96 4.96 4.81 4.80 4.78 4.77 4.75 4.75 4.75  6 6  YMR085W YML081W  4.62 4.59  5.86 5.85 5.83 5.75 5.74 5.66 5.51 5.49 5.47 5.38 5.37 5.34 5.27 5.26 5.17 5.17 5.17 5.16  GO Biological Process aerobic respiration DNA repair biological process unknown biological process unknown RNA metabolic process translation* secretory pathway biological process unknown biological process unknown transport biological process unknown transcription from RNA polymerase II promoter* biological process unknown deadenylation-dependent decapping spore wall assembly (sensu Fungi) response to stress protein amino acid phosphorylation* protein amino acid phosphorylation peroxisome organization and biogenesis* response to drug* GPI anchor biosynthetic process response to drug* DNA repair rRNA processing* biological process unknown biological process unknown biological process unknown chromatin remodeling* nucleotide-excision repair, DNA damage recognition protein monoubiquitination* protein targeting to vacuole* mitochondrial genome maintenance biological process unknown cell wall organization and biogenesis* iron ion homeostasis* biological process unknown mitochondrion inheritance* transcription from RNA polymerase II promoter biological process unknown protein amino acid phosphorylation peroxisome inheritance biological process unknown Golgi to vacuole transport chitin- and beta-glucan-containing cell wall organization and biogenesis* biological process unknown biological process unknown  ORF Status V V U U U V U U U V D V V V U V V V V V U V V V U U U V U V V U U V V U V V U V U V V V  Other Response 1 2 3 4 5 6 7 x x  x x  x x x x  x  x x  x  x x  x x  x x x  x  x x x  x x  x x  U U  111  Cl. 6 6 6 6 6 6  Gene PTK1 YIL067C YMR291W ASG1 ATO3 SRX1  Maximum Fold Change 4.56 4.55 4.53 4.43 4.41 4.41  GO Biological Process polyamine transport biological process unknown biological process unknown biological process unknown nitrogen utilization* response to oxidative stress  ORF Status V U U U V V  Other Response 1 2 3 4 5 6 7  x x x  x  Of the FSR genes, 28% lack a GO biological process annotation – they are uncharacterized. Of those that are characterized, more than half are associated with cellular and metabolic processes that facilitate the adaptation of the yeast to the continuously changing nutrient environment and the consequences thereof. Specifically, examination of the GO biological processes for these genes reveals that prevalent functional associations include transport (18%), organelle organization and biogenesis (15%), protein modification processes (13%), RNA metabolic processes (12%), response to stress (11%), and transcription (11%) (Figure 4.3).  4.3.3 Over-representation of functional annotations Genes in the 20 clusters were tested for biased association with a subset of GO biological process terms and for predicted transcription factor binding sites within promoter regions. The fifty GO terms with the most biased over-representation in one or more clusters were identified, and the enrichment or depletion of genes belonging to these particular GO terms was plotted in Figure 4.4. We observe enrichment of terms associated with cellular respiration amongst genes displaying transient induction trends in clusters 10 and 11. Consistent with changes in cell growth and metabolism, protein  112  x  metabolism and ribosomal biogenesis are enriched in genes displaying prolonged and dramatic repression. Among the FSR genes, nitrogen utilization is prevalent.  Proportion of genes annotated to process 0.0  0.1  0.2  0.3  biological process unknown transport organelle organization and biogenesis protein modification process RNA metabolic process response to stress transcription cell wall organization and biogenesis DNA metabolic process carbohydrate metabolic process cell cycle signal transduction  FSR Yeast Genome - FSR  Figure 4.3 Functional annotations of genes in the FSR using biological process GO Slim Terms from SGD Functional annotations of genes in the FSR using biological process GO Slim Terms from SGD. The distribution of terms for the rest of the yeast genome is shown for comparison.  113  Figure 4.4 Heat map of the association of GO terms to clusters White represents depletion and enrichment increases with darker squares (based on the chi-square statistic).  4.3.4 Statistical enrichment of predicted transcription factor binding sites Statistically enriched transcription factor binding sites (TFBSs) were ranked using Z-scores; the enrichment was then further evaluated with Fisher exact probabilities to determine enrichment of the sites at the gene level (28). Table 4.3 shows the ten highest ranking predicted binding sites among promoters of the FSR genes. A full listing of scores for each of the 20 clusters is available in Appendix 3 Table S2. Binding sites for Mig1p and Adr1p, as well as the carbon source responsive element (CSRE) bound by Cat8p are over-represented in the FSR genes (Table 4.3).  114  Table 4.3 Highest ranked transcription factor binding site motifs for FSR genes Rank  TFBS Motif  Functional Annotation  Z-score  Fisher Score  Genes Containing Site  1  ADR1P  Carbohydrate metabolism, glucose repression  19.33  8.21E-02  166  2  PDR3 *  Chemical agent resistance; detoxification  18.35  7.26E-04  65  3  STRE *  Stress response element  17.82  7.57E-04  48  4  CSRE *  Carbon source responsive element; gluconeogenesis  14.81  1.40E-03  136  5  UASPHR  DNA repair; DNA damage response  13.78  4.60E-02  189  6  LEU3 *  Amino acid biosynthesis  11.74  9.74E-04  47  7  MIG1c *  Carbohydrate metabolism; glucose repression  11.70  4.37E-03  132  8  MIG1b  Carbohydrate metabolism; glucose repression  10.53  6.29E-02  73  9  UME6 *  Amino acid metabolism; nitrogen metabolism, mitotic cycle and cell cycle control  10.20  8.75E-03  114  10  CAR1_r  Nitrogen metabolism  8.38  1.23E-01  114  Logo  * TF motifs with Z-score > 10 and Fisher scores < 0.01. This combination of empirically-derived thresholds has been used to discriminate relevant binding sites in reference sets of genes while reducing the false positive rate to less than 10% in simulations using random promoter sequences (29).  115  4.3.5 Entry into stationary growth To test the hypothesis that nitrogen limitation is responsible for yeast cells entering into stationary phase during fermentation, fermentations that contained increasing concentrations of di-ammonium phosphate (DAP) were conducted. Yeast growth, nitrogen utilization, and ethanol production were monitored. Surprisingly, yeast cells entered into stationary phase at the same time point in all the fermentations, independent of nitrogen or carbon depletion (Figure 4.5). However, the common factor in all instances was an ethanol concentration of ~2.0 % (v/v). No further growth occurred after 50 hours when the ethanol concentration reached ~ 5.0% (v/v), despite the fact that nitrogen was in excess. After 32 hours, the ammonium concentration in the control grape must, and must supplemented with 300, 600 or 900 mg/L DAP, was 0, 63, 262 and 480 mg/L, respectively. The free alpha amino nitrogen content was 13, 23, 39 and 59 mg/L, respectively. This experiment was repeated in yeast nitrogen base (Difco) containing 21.4% sugars (equimolar amounts of glucose and fructose); nitrogen was adjusted with DAP to 191, 300, 600, or 900 mg/L. Yeast cells entered stationary phase after 48 hours when the ethanol concentration reached ~2% (v/v) (data not shown).  4.4 Discussion Numerous studies have investigated the response of yeast to stress, mostly under laboratory growth conditions. In contrast, long-term alcoholic fermentation represents a more complex set of conditions, which accounts for the distinctive response observed in the FSR. Analysis of the genes that respond during alcoholic fermentation reveals the activation of several overlapping responses, such as osmotic stress, glucose repression, hypoxia/anoxia, nutrient depletion, and increasing ethanol concentrations. 116  1.5  2.80 - 3.00 % v/v 3.36 - 3.86 % v/v  2.06 - 2.16 % v/v  Log OD600nm  1  1.32 - 1.38 % v/v  B  0.5  A  0 0  10  20  30  40  50  60  70  80  90  100  191mg/L N (Control-No DAP) 300mg/L N  -0.5  600mg/L N 900mg/L N  -1  Time (hours)  Figure 4.5 Growth of Vin14 in Riesling grape juice containing either 191 mg/L N, 300 mg/L N, 600 mg/L N or 900 mg/L N Nitrogen content of Riesling grape juice containing 191 mg/L N was adjusted to either 300 mg/L N, 600 mg/L N, or 900 mg/L N using di-ammoniumphosphate . ADY of Vin13 was inoculated to 3 x 106 cells/mL in 250 mL Kimax bottles fitted with vapour locks containing 200 mL Riesling grape juice. Ethanol concentrations (% v/v) are indicated on the graph.  4.4.1 Fermentation of grape must induces a novel and adaptive response The stress response of a single industrial wine yeast strain, Vin13, in fermenting Riesling grape must was examined; 40% of the yeast genome significantly changed expression levels to mediate long-term adaptation to fermenting grape must. Among the genes that changed expression levels, 223 FSR genes were identified that are permanently induced at various points during fermentation. Approximately 62% of the FSR form a unique subset of genes that have not previously been implicated in the ESR, CER or selected short-term studies of osmotic pressure, ethanol stress, oxidative stress or nitrogen depletion (Table 4.2). Of the FSR genes, 28% lack a GO biological process annotation. Those that are characterized appear to play roles in transport, organelle 117  organization and biogenesis, RNA metabolism, transcription, and response to stress (Figure 4.3). The large set of transport-related proteins are primarily involved in glucose uptake, nitrogen regulation, vacuolar function (important for growth under ethanol stress (31)) and detoxification, reflecting the integrated response to nutrients in the environment, multiple stresses, and toxins produced during vinification. It must be noted that the composition of grape musts can vary significantly with respect to the sugar concentration, and lipid, nitrogen and vitamin compositions and concentrations. The transcriptional response of other industrial wine yeast strains in different grape musts should therefore be investigated to determine whether any additional FSR genes exist.  4.4.2 Attenuated glucose repression S. cerevisiae can grow oxidatively on many non-fermentative carbon sources such as pyruvate, lactate, acetate, and ethanol. These compounds are oxidized in the citric acid cycle and ATP is generated by reoxidation of reduced co-enzymes in the mitochondrial electron transport system by oxidative phosphorylation. Several genes needed for oxidative energy metabolism, mitochondrial function and the catabolism of nonfermentable carbon sources are repressed by high concentrations of glucose (32;33). At the onset of fermentation, the sugar concentration in the grape must was 21.4 % (w/v) (equimolar amounts of glucose and fructose). These data show an increase in expression of multiple glucose repressed genes, indicating a partial attenuation of classic glucose repression during fermentation. Young et al. identified 40 of the most highly glucose repressed genes based on expression ratios from derepressed versus repressed growth conditions (34). Twenty-two of these genes are up-regulated during  118  fermentation (Appendix 3 Table S3). Furthermore, genes involved in mitochondrial respiration/oxidative phosphorylation (Appendix 3 Table S4), hypoxic genes involved in heme, sterol and unsaturated fatty acid biosynthesis (Appendix 3 Table S5), and genes associated with oxidative stress (Appendix 3 Table S6) were all induced during fermentation while the glucose concentration in the media should still be repressive. Several genes annotated to “carbohydrate metabolism” in clusters 9 and 10 (Appendix 3 Table S7) are reported to be glucose repressed (5;35;36); many of these genes are involved in glycolysis and gluconeogenesis. The induction of numerous glucoserepressed and oxygen-regulated genes indicates that cellular respiration may not be fully repressed during fermentation. The regulators Mig1p, Adr1p and Cat8p play pivotal roles in the transcription of glucose repressed genes. Mig1p binds to promoter sequences and represses the expression of many genes in yeast cells growing in high concentrations of glucose. Conversely, Cat8p and Adr1p encode carbon source-responsive transcription factors shown to activate the expression of genes for metabolism of non-fermentative carbon sources when glucose is depleted (5). Statistically enriched TFBSs in the promoters of FSR genes were ranked using Z-scores; the enrichment was then further evaluated with Fisher exact probabilities to determine enrichment of the sites at the gene level. Binding sites for Mig1p and Adr1p, as well as the carbon source responsive element (CSRE) bound by Cat8p ranked among the ten most abundant TFBS motifs in the FSR genes (Table 4.3), lending further support to the hypothesis that glucose repression is not fully functional over the course of fermentation.  119  The HXK2 gene is highly expressed during growth on glucose (37). Hxk2p is involved in both hexose phosphorylation and the regulation of glucose repression (38;39). Hxk2p interacts with Mig1p to form a complex located in the nucleus that mediates glucose repression (40). Deletion of HXK2 leads to expression of a number of genes normally subject to carbon catabolite repression (41). Our data showed that HXK2 expression was down-regulated 18-fold during fermentation. The significant downregulation of HXK2 expression during fermentation of grape must, may, therefore, contribute to the expression of genes that are normally repressed by glucose.  4.4.3 Proposed models for alleviation of glucose repression A number of explanations are proposed for the observed attenuation of glucose repression. Firstly, there could be genetic differences between the industrial yeast strain used (Vin13) and laboratory strains, resulting in an altered response to glucose. There is no evidence to suggest this is the case as Vin13 has been shown to respond comparably to laboratory strains to osmotic stress and nitrogen catabolite repression (15;16). Secondly, stress, particularly increasing ethanol concentration, could disrupt the structure of membranes affecting membrane-bound glucose sensors and/or alter protein structure of other proteins involved in the signaling. Thirdly, the increase in expression of glucose repressed genes might be due to the activation of the retrograde response pathway. Because active dry yeast is prepared in a highly aerobic environment, the yeast cells contain large numbers of respiratory-efficient mitochondria at inoculation. Prolonged fermentative metabolism and aging of the cells would cause mitochondrial dysfunction, potentially inducing the retrograde response. This is unlikely as no peroxisomal PEX genes were induced. Fourthly, a yet unidentified ethanol-sensing mechanism that  120  functions in the presence of excess glucose might exist. Finally, HXK2 is significantly dwn-regulated (18-fold), and may contribute to relieving glucose repression during wine fermentations (38-41). The down-regulation of HXK2 may also be partially responsible for stuck alcoholic wine fermentations. The most noticeable change during wine fermentation is the decrease in fermentable sugars and the accompanied rapid increase in ethanol concentration. Ethanol’s dual role as a stressor to the cell and potential carbon source during respiration which may follow fermentation, along with the results, suggests that the yeast cell has adapted a unique mechanism to respond to the presence of ethanol independent of the glucose concentration. The response of yeast to ethanol is not detected in classic laboratory studies because low concentrations of glucose (2% w/v) will yield low concentrations of ethanol < 1% (v/v). To further complicate matters, oxygen is limiting during grape must fermentations, thereby preventing the effective utilization of ethanol as a carbon source. Furthermore, down-regulation of HXK2 expression is not observed during short-term laboratory fermentations. It is likely that glucose repression is alleviated in response to both ethanol and oxygen in the environment, and limited or no Hxk2p.  4.4.4 Ethanol as a regulator of fermentation The main product of fermentation is ethanol, and wines contain as much as 16% (v/v) of this compound. Ethanol inhibits yeast growth and viability as it negatively affects membrane integrity as well as intracellular and membrane-related processes (42-46). Ethanol, at concentrations affecting growth and fermentation rates (3-10% (v/v)), causes potent activation of the plasma membrane H+-ATPase – a probable mechanism to  121  regulate the yeast cell’s internal pH (47). Although ethanol is the main stress factor during fermentation, relatively little is known of its effect on the transcriptome and proteome of yeast. The short-term response of a laboratory strain of S. cerevisiae to a 7% (v/v) ethanol spike during early exponential phase was investigated by global expression analysis (42). The yeast responded by increasing expression of genes involved in energy metabolism, stress response, protein trafficking, and ionic homeostasis. Comparison with the Alexandre data set reveals that 41 ethanol stress genes were induced during fermentation; included within the overlapping genes are the heat shock genes HSP104¸ HSP26, HSP30, HSP42, HSP78¸ SSA1, SSA4 and SSE1, and the ethanol stress gene GRE3. Other ethanol stress genes that were not activated in the general response in our study included HSP12, GPD1, ALD2, HSP82, HOR2 and DAK1. Ten short term ethanol stress genes are present in FSR, comprising 5% of the FSR (Table 4.2). A genome-wide screen of the yeast deletion collection for ethanol-sensitive mutants on complex medium supplemented with 2% glucose and 6% ethanol identified 46 mutants with impaired growth (31). Genes that were required for growth included those involved in the general stress pathway, cell integrity pathway, vacuolar function and mitochondrial function. Two of these (BEM2, SIT4) exhibited slow growth when further tested in fermentations supplemented with 20% glucose. None of the identified genes are part of the FSR; eight were induced in clusters 7-12 and ten were repressed in clusters 15-20. However, other genes with similar functional profiles are present in the FSR, such as genes involved in vacuolar transport and mitochondrial function (Table 4.2), suggesting the activation of these pathways to regulate the response to ethanol.  122  4.4.5 Osmotic stress The high sugar concentration (21.4 % w/v) in the Riesling grape must is a source of osmotic stress to the yeast cell. This stress is partially relieved during fermentation due to the conversion of sugar to ethanol and CO2, as evidenced by an increase in water activity from an initial value of 0.965 to 0.987 at the last sampling point when ethanol is approximately 10% (v/v). The osmo-regulatory response of S. cerevisiae results in the enhanced production and intracellular accumulation of glycerol as the main compatible solute to counter-balance osmotic pressure (48). This is mediated by the High Osmolarity Glycerol (HOG) pathway (49). Genes induced by high salt or high sorbitol were compared to the genes in this data set; a total of 104 of 186 genes induced by high salt/sorbitol (50) were induced during the course of fermentation; 22 of these genes were present in the FSR (Table 4.2). Analysis of gene expression patterns in a wine yeast strain subjected to 40 % (w/v) sugar stress identified 589 genes with altered expression patterns (16). It was found that 232 of the sugar-induced stress genes are also induced during fermentation; 50 of these are in the FSR (Table 4.2). Osmotic stress-responding genes, based on the union of the Rep et al. (50) and Erasmus et al. (16) data sets, comprise 27% of the FSR, indicating that osmotic stress plays a role in the FSR.  4.4.6 Nature of the signal for entry into stationary growth Entry into stationary phase is common to adverse conditions and is reflected by the dramatic reduction of ribosome biogenesis (51-55). Consistently, the majority of repressed genes are preferentially annotated to ribosome biogenesis and assembly (Figure 4.3). Analysis of TFBS motifs in repressed genes shows enrichment of the RRPE  123  (ribosomal RNA processing element), PAC (Polymerase A and C) and RAP1 (repressor activating protein 1) motifs (Appendix 3 Table S2). These motifs are conserved within the promoters of many genes implicated in cell growth and ribosome synthesis (56-60). Under laboratory conditions (i.e. low concentrations of fermentable carbon sources such as glucose that is present at 2% in standard media), yeast cells enter diauxic growth when glucose is depleted and subsequently enter into stationary phase when ethanol is depleted (6;61). Carbon sources never become depleted during vinification, and carbon depletion, therefore, cannot be responsible for entry of yeast cells into stationary phase during fermentation. Nitrogen sources, however, can become limiting during fermentation and is the main cause of problematic fermentations. The data shows that yeast cells in fermenting grape musts enter stationary phase as early as 32 hours after inoculation when glucose is in excess and the ethanol concentration is ~ 2.0 % v/v (Figure 4.5). This is consistent with reports by Rossignol et al. (14). In fermentations containing different concentrations of nitrogen, the yeast cells entered into stationary phase at the same time point, i.e. when the ethanol concentration was ~2.0 % (v/v), independent of nitrogen or carbon depletion (Figure 4.5). Despite excess nitrogen, no further growth was observed after 50 hours when the ethanol concentration reached ~ 5.0% (v/v). Similar results were obtained in Yeast Nitrogen Base (Difco) containing 21.4% sugars (equimolar amounts of glucose and fructose). Ethanol produced during fermentation thus seems to be the trigger for entry into stationary phase.  4.5 Conclusions Wine fermentations subject yeast to a barrage of stressors, including osmotic pressure, hypoxia, nitrogen depletion, and increasing ethanol concentrations. Global  124  genomic expression patterns over the duration of fermentation revealed an integrated compendium of stress responses as well as a novel long-term adaptive response, which we refer to as a FSR. Approximately 28% of FSR genes have not yet been characterized. FSR genes exhibit sustained and dramatic induction under fermentation conditions and further studies will be required to elucidate their roles during wine making conditions that differ considerably from standard laboratory conditions. These results suggest that ethanol acts as a signal that activates a hitherto unidentified ethanol signal transduction pathway regulating genes in the FSR. The identification and characterization of regulatory circuits that govern the FSR will provide insight into the remarkable ability of S. cerevisiae to flourish and ferment grape must and then survive the hostile environment of wine for months. A concerted study of yeast during wine fermentation will also lead to annotation of many of the orphan genes in the FSR. We cannot, however, exclude the possibility that fermentation of different grape musts that vary in composition might reveal more FSR genes. Finally, our results indicate that contrary to previous reports, growth arrest of yeast cells was not due to depletion of carbon or nitrogen in fermenting grape must; ethanol seems to be the trigger for entry into stationary phase. This data suggests that studies restricted to standard laboratory conditions are inadequate to understand the regulation of yeast metabolism in industrial fermentations, and the regulatory role of ethanol during wine fermentation should be explored.  125  4.6 References 1. Pena-Castillo,L. and Hughes,T.R. (2007) Why are there still over 1000 uncharacterized yeast genes? Genetics, 176, 7-14. 2. Hohmann,S. and Mager,W.H. (2003) Yeast Stress Responses. Springer-Verlag, Berlin. 3. Gasch,A.P., Spellman,P.T., Kao,C.M., Carmel-Harel,O., Eisen,M.B., Storz,G., Botstein,D. and Brown,P.O. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol.Biol.Cell, 11, 4241-4257. 4. Causton,H.C., Ren,B., Koh,S.S., Harbison,C.T., Kanin,E., Jennings,E.G., Lee,T.I., True,H.L., Lander,E.S. and Young,R.A. (2001) Remodeling of yeast genome expression in response to environmental changes. Mol.Biol.Cell, 12, 323-337. 5. Schuller,H.J. (2003) Transcriptional control of nonfermentative metabolism in the yeast Saccharomyces cerevisiae. Curr.Genet., 43, 139-160. 6. DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680-686. 7. De Vit,M.J., Waddle,J.A. and Johnston,M. (1997) Regulated nuclear translocation of the Mig1 glucose repressor. Mol.Biol.Cell, 8, 1603-1618. 8. Nehlin,J.O. and Ronne,H. (1990) Yeast MIG1 repressor is related to the mammalian early growth response and Wilms' tumour finger proteins. EMBO J., 9, 2891-2898. 9. Devantier,R., Scheithauer,B., Villas-Boas,S.G., Pedersen,S. and Olsson,L. (2005) Metabolite profiling for analysis of yeast stress response during very high gravity ethanol fermentations. Biotechnol.Bioeng., 90, 703-714. 10. Puig,S. and Perez-Ortin,J.E. (2000) Stress response and expression patterns in wine fermentations of yeast genes induced at the diauxic shift. Yeast, 16, 139-148. 11. Varela,C., Cardenas,J., Melo,F. and Agosin,E. (2005) Quantitative analysis of wine yeast gene expression profiles under winemaking conditions. Yeast, 22, 369-383. 12. Zuzuarregui,A. and del Olmo,M.L. (2004) Expression of stress response genes in wine strains with different fermentative behavior. FEMS Yeast Res., 4, 699-710. 13. Zuzuarregui,A., Monteoliva,L., Gil,C. and del Olmo,M. (2006) Transcriptomic and proteomic approach for understanding the molecular basis of adaptation of Saccharomyces cerevisiae to wine fermentation. Appl.Environ.Microbiol., 72, 836847.  126  14. Rossignol,T., Dulau,L., Julien,A. and Blondin,B. (2003) Genome-wide monitoring of wine yeast gene expression during alcoholic fermentation. Yeast, 20, 1369-1385. 15. Marks,V.D., van der Merwe,G.K. and van Vuuren,H.J. (2003) Transcriptional profiling of wine yeast in fermenting grape juice: regulatory effect of diammonium phosphate. FEMS Yeast Res., 3, 269-287. 16. Erasmus,D.J., van der Merwe,G.K. and van Vuuren,H.J. (2003) Genome-wide expression analyses: Metabolic adaptation of Saccharomyces cerevisiae to high sugar stress. FEMS Yeast Res., 3, 375-399. 17. Irizarry,R.A., Hobbs,B., Collin,F., Beazer-Barclay,Y.D., Antonellis,K.J., Scherf,U. and Speed,T.P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics., 4, 249-264. 18. Cherry,J.M., Adler,C., Ball,C., Chervitz,S.A., Dwight,S.S., Hester,E.T., Jia,Y., Juvik,G., Roe,T., Schroeder,M. et al. (1998) SGD: Saccharomyces Genome Database. Nucleic Acids Res., 26, 73-79. 19. Storey,J.D. and Tibshirani,R. (2003) Statistical significance for genomewide studies. Proc.Natl.Acad.Sci.U.S.A, 100, 9440-9445. 20. Bryan,J. (2004) Problems in gene clustering based on gene expression data. J.Multivariate Anal., 90, 44-66. 21. R Development Core Team. R: A language and environment for statistical computing. http://www.r-project.org . 2006. Vienna, Austria, R Foundation for Statistical Computing. 22. Gautier,L., Cope,L., Bolstad,B.M. and Irizarry,R.A. (2004) affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20, 307-315. 23. Gentleman,R.C., Carey,V.J., Bates,D.M., Bolstad,B., Dettling,M., Dudoit,S., Ellis,B., Gautier,L., Ge,Y., Gentry,J. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5, R80. 24. Dabney,A. and Storey,J. qvalue: Q-value estimation for false discovery rate control. (R package version 1.1). 2006. 25. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat.Genet., 25, 2529. 26. Sandelin,A., Hoglund,A., Lenhard,B. and Wasserman,W.W. (2003) Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct.Integr.Genomics, 3, 125-134.  127  27. Lenhard,B. and Wasserman,W.W. (2002) TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics., 18, 1135-1136. 28. Ho Sui,S.J., Mortimer,J.R., Arenillas,D.J., Brumm,J., Walsh,C.J., Kennedy,B.P. and Wasserman,W.W. (2005) oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res., 33, 3154-3164. 29. Fleiss,J.L. (1981) Statistical methods for Rates and Proportions. John Wiley , New York. 30. Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J.R.Statist.Soc.B, 57, 289-300. 31. van Voorst,F., Houghton-Larsen,J., Jonson,L., Kielland-Brandt,M.C. and Brandt,A. (2006) Genome-wide identification of genes required for growth of Saccharomyces cerevisiae under ethanol stress. Yeast, 23, 351-359. 32. Gancedo,J.M. (1998) Yeast carbon catabolite repression. Microbiol.Mol.Biol.Rev., 62, 334-361. 33. Johnston,M. (1999) Feasting, fasting and fermenting. Glucose sensing in yeast and other cells. Trends Genet., 15, 29-33. 34. Young,E.T., Dombek,K.M., Tachibana,C. and Ideker,T. (2003) Multiple pathways are co-regulated by the protein kinase Snf1 and the transcription factors Adr1 and Cat8. J.Biol.Chem., 278, 26146-26158. 35. Dennis,R.A. and McCammon,M.T. (1999) Acn9 is a novel protein of gluconeogenesis that is located in the mitochondrial intermembrane space. Eur.J.Biochem., 261, 236-243. 36. Lodi,T., Alberti,A., Guiard,B. and Ferrero,I. (1999) Regulation of the Saccharomyces cerevisiae DLD1 gene encoding the mitochondrial protein D-lactate ferricytochrome c oxidoreductase by HAP1 and HAP2/3/4/5. Mol.Gen.Genet., 262, 623-632. 37. Moreno,F. and Herrero,P. (2002) The hexokinase 2-dependent glucose signal transduction pathway of Saccharomyces cerevisiae. FEMS Microbiol.Rev., 26, 8390. 38. Entian,K.D., Kopetzki,E., Frohlich,K.U. and Mecke,D. (1984) Cloning of hexokinase isoenzyme PI from Saccharomyces cerevisiae: PI transformants confirm the unique role of hexokinase isoenzyme PII for glucose repression in yeasts. Mol.Gen.Genet., 198, 50-54.  128  39. Rose,M., Albig,W. and Entian,K.D. (1991) Glucose repression in Saccharomyces cerevisiae is directly associated with hexose phosphorylation by hexokinases PI and PII. Eur.J.Biochem., 199, 511-518. 40. Ahuatzi,D., Riera,A., Pelaez,R., Herrero,P. and Moreno,F. (2007) Hxk2 regulates the phosphorylation state of Mig1 and therefore its nucleocytoplasmic distribution. J.Biol.Chem., 282, 4485-4493. 41. Diderich,J.A., Raamsdonk,L.M., Kuiper,A., Kruckeberg,A.L., Berden,J.A., Teixeira de Mattos,M.J. and Van Dam,K. (2002) Effects of a hexokinase II deletion on the dynamics of glycolysis in continuous cultures of Saccharomyces cerevisiae. FEMS Yeast Res., 2, 165-172. 42. Alexandre,H., Ansanay-Galeote,V., Dequin,S. and Blondin,B. (2001) Global gene expression during short-term ethanol stress in Saccharomyces cerevisiae. FEBS Lett., 498, 98-103. 43. Ingram,L.O. and Buttke,T.M. (1984) Effects of alcohols on micro-organisms. Adv.Microb.Physiol, 25, 253-300. 44. Leao,C. and Van Uden,N. (1984) Effects of ethanol and other alkanols on passive proton influx in the yeast Saccharomyces cerevisiae. Biochim.Biophys.Acta, 774, 43-48. 45. Lloyd,D., Morrell,S., Carlsen,H.N., Degn,H., James,P.E. and Rowlands,C.C. (1993) Effects of growth with ethanol on fermentation and membrane fluidity of Saccharomyces cerevisiae. Yeast, 9, 825-833. 46. Piper,P.W. (1995) The heat shock and ethanol stress responses of yeast exhibit extensive similarity and functional overlap. FEMS Microbiol.Lett., 134, 121-127. 47. Rosa,M.F. and Sa-Correia,I. (1991) In vivo activation by ethanol of plasma membrane ATPase of Saccharomyces cerevisiae. Appl.Environ.Microbiol., 57, 830835. 48. Hohmann,S. (2002) Osmotic stress signaling and osmoadaptation in yeasts. Microbiol.Mol.Biol.Rev., 66, 300-372. 49. Brewster,J.L., de Valoir,T., Dwyer,N.D., Winter,E. and Gustin,M.C. (1993) An osmosensing signal transduction pathway in yeast. Science, 259, 1760-1763. 50. Rep,M., Krantz,M., Thevelein,J.M. and Hohmann,S. (2000) The transcriptional response of Saccharomyces cerevisiae to osmotic shock. Hot1p and Msn2p/Msn4p are required for the induction of subsets of high osmolarity glycerol pathwaydependent genes. J.Biol.Chem., 275, 8290-8300.  129  51. Jorgensen,P., Rupes,I., Sharom,J.R., Schneper,L., Broach,J.R. and Tyers,M. (2004) A dynamic transcriptional network communicates growth potential to ribosome synthesis and critical cell size. Genes Dev., 18, 2491-2505. 52. Miyoshi,K., Miyakawa,T. and Mizuta,K. (2001) Repression of rRNA synthesis due to a secretory defect requires the C-terminal silencing domain of Rap1p in Saccharomyces cerevisiae. Nucleic Acids Res., 29, 3297-3303. 53. Miyoshi,K., Shirai,C. and Mizuta,K. (2003) Transcription of genes encoding transacting factors required for rRNA maturation/ribosomal subunit assembly is coordinately regulated with ribosomal protein genes and involves Rap1 in Saccharomyces cerevisiae. Nucleic Acids Res., 31, 1969-1973. 54. Nomura,M. (2001) Ribosomal RNA genes, RNA polymerases, nucleolar structures, and synthesis of rRNA in the yeast Saccharomyces cerevisiae. Cold Spring Harb.Symp.Quant.Biol., 66, 555-565. 55. Warner,J.R. (1999) The economics of ribosome biosynthesis in yeast. Trends Biochem.Sci., 24, 437-440. 56. Fingerman,I., Nagaraj,V., Norris,D. and Vershon,A.K. (2003) Sfp1 plays a key role in yeast ribosome biogenesis. Eukaryot.Cell, 2, 1061-1068. 57. Hughes,J.D., Estep,P.W., Tavazoie,S. and Church,G.M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J.Mol.Biol., 296, 1205-1214. 58. Li,B., Nierras,C.R. and Warner,J.R. (1999) Transcriptional elements involved in the repression of ribosomal protein synthesis. Mol.Cell Biol., 19, 5393-5404. 59. Moehle,C.M. and Hinnebusch,A.G. (1991) Association of RAP1 binding sites with stringent control of ribosomal protein gene transcription in Saccharomyces cerevisiae. Mol.Cell Biol., 11, 2723-2735. 60. Wade,C., Shea,K.A., Jensen,R.V. and McAlear,M.A. (2001) EBP2 is a member of the yeast RRB regulon, a transcriptionally coregulated set of genes that are required for ribosome and rRNA biosynthesis. Mol.Cell Biol., 21, 8638-8650. 61. Johnston,M. and Carlson,M. (1992) The Molecular Biology of the Yeast Saccharomyces: Gene Expression. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, USA.  130  Chapter 5: Ranking Candidate Genes for Promoter Construct Design 5 5.1 Introduction Gene therapy is an experimental technique that uses genes to treat or prevent disease. Strategies include replacing a nonfunctional gene that causes disease with a functioning copy of the gene, inactivating or repairing a mutated gene that is functioning incorrectly, or introducing a new gene into the body to help fight a disease. Gene therapy holds promise for many diseases, with the majority of clinical trials to date focused on cancer, inherited monogenic diseases and cardiovascular disease (1). Despite preclinical success, clinical trials have highlighted significant challenges, such as problems integrating therapeutic DNA into the genome and the risk of an immune response. Another major challenge is to deliver and express genes in the appropriate cells without harming non-target cells (2). One approach, called “transcriptional targeting”, uses promoters that drive expression of the gene in a tissue- or tumour-specific manner, and has potential to improve the safety and efficacy of gene therapy (3). Transcriptional targeting requires (i) the identification of genes selectively expressed in the cell type of interest, and (ii) the identification of DNA regulatory sequences that direct the expression of those genes. The former is readily addressed using results from gene expression profiling studies. However, isolating tissue-specific regulatory sequences to construct a targeted promoter is more difficult. Comparative sequence analysis can delineate potential regulatory regions from alignments of orthologous sequences from organisms separated by varying evolutionary  5  A version of this chapter will be submitted for publication. Ho Sui, S.J., Portales-Casamar, E. and Wasserman W.W. Ranking Candidate Genes for Promoter Construct Design.  131  distances. Termed “phylogenetic footprinting,” the expectation is that orthologous sequences that serve a function common to the species being considered will have accumulated fewer changes relative to neutral DNA, and can therefore be detected as conserved regions in alignments. Sequence comparisons of individual genes and long genomic regions have confirmed that conserved noncoding sequences between related species do, indeed, harbour regulatory elements (4-7). Some regulatory regions have been conserved through long periods of evolution, as is the case for enhancers regulating key developmental processes in vertebrates (7-10). Many algorithms exist to detect conservation within pairwise alignments of sequences (11-14), as well as across alignments of sequences from multiple species (1519). Conservation within pairwise alignments is determined using defined sequence identity thresholds, while algorithms that identify multi-species conserved sequences consider both evolutionary distance and phylogenetic relationships (15;16). PhastCons, one commonly used approach for detecting multi-species sequence conservation, uses a phylogenetic hidden Markov model to estimate a likelihood that a particular sequence is among the most highly conserved in a genome, allowing for mutation rate variation in different lineages (15). Features other than degree of constraint can discriminate regulatory regions from neutral DNA – regulatory potential (RP) scores measure the similarity of patterns in an alignment to those in experimentally defined regulatory regions (20;21). Both PhastCons and RP have been successful at identifying cisregulatory elements in human promoters (6;22). Additionally, regulatory potential searches for regions of human-rodent alignments that show patterns characteristic of experimentally verified regulatory regions.  132  Based on whole-genome comparisons of multiple vertebrate species, numerous regulatory elements have been predicted to exist in the human genome (23;24). Conservation of noncoding sequence, however, is insufficient to prove regulatory function. Instead, the functional characterization of a sequence as an enhancer or repressor element requires a reporter gene assay or similar laboratory study. Constructing a reporter system involves placing the putative regulatory sequence upstream of a reporter gene, such as green fluorescent protein (GFP), luciferase or lacZ, which mediates a measurable change in expression due to the action of the sequence. The most convincing evidence for enhancer activity is derived from in vivo reporter assays that make use of transgenic animals (8;25). As these experiments are costly, slow and laborious, most confirmatory studies to date have been performed in cell culture. Several groups have now begun the important task of characterizing potential regulatory sequences on a larger scale. Woolfe et al. used a rapid in vivo assay system in zebrafish embryos to confirm tissue-specific enhancer activity for 23 of 25 noncoding regions highly conserved between human and the pufferfish, Fugu rubripes (10). Pennacchio et al. characterized the in vivo enhancer activity of 167 extremely conserved sequences using transgenic mouse enhancer assays, and reported that 45% functioned reproducibly as tissue-specific enhancers during embryonic development (8). Highthroughput luciferase-based transient transfection assays in human cultured cell lines validated 25% of 163 regulatory region predictions based on ChIP-on-chip (chromatin immunoprecipitation combined with genomic microarrays) data (26), as well as 91% of 152 predicted promoters located directly upstream of full-length cDNA transcripts (27). Based on the work of Bronson et al. (28), Farhadi et al. used single-copy reporter  133  constructs in transgenic mice to evaluate the regulatory significance of conserved noncoding MBP sequences (29). The Pleiades Promoter Project (www.pleiades.org) is using a similar approach to generate and test conserved noncoding regions from human and mouse orthologs for their ability to drive tissue-specific expression in transgenic mice. Starting from a list of 2780 genes identified as being selectively expressed in brain (De Souza et al, manuscript in preparation), the end goal of the Pleiades Promoter Project is to generate 220 fully characterized, human DNA MiniPromoters (less than 4 kb) to drive gene expression in defined brain regions of therapeutic interest for diseases such as Alzheimers and Parkinsons (30). The gene selection for MiniPromoter design is performed on a gene-by-gene basis. After evaluating the literature reporting known regulatory sequences for the gene of interest, curators examine the gene in a genome browser, such as the UCSC Genome Browser (31), seeking features that aid the design of a promoter construct. These include a well-defined transcription start site (TSS), defined boundaries for the promoter region (determined by examining the location of upstream, downstream and overlapping genes, as well as the length of the gene), and the presence of highly conserved noncoding sequences within the defined region that can be incorporated in the promoter-reporter construct. Genes that contain a small number of easily distinguished, highly conserved noncoding regions proximal to their TSSs are favourable candidates for MiniPromoter design, compared to genes that have evolved slowly or genes for which the noncoding sequence has diverged beyond our ability to detect conservation. Slowly evolving genes contain many blocks of conserved noncoding sequences that span the entire promoter, requiring a prohibitively larger number of  134  experiments to decipher their regulation, while phylogenetic footprinting of highly diverged genes provides little insight into where regulatory regions may reside. Manual curation of each gene candidate for promoter construct design is slow and tedious, and therefore, a method to rapidly rank genes would be of utility. Here we describe the development of a "regulatory resolution score" to prioritize candidate genes and guide promoter construct design based on conservation profiles. The score is intended to reflect human perception of what constitutes a good candidate gene for further study of regulatory function in the laboratory.  5.2 Methods 5.2.1 Identification of the boundaries for analysis For each gene in the Ensembl human genome database (version 46) (32), we define the transcription start site (TSS) as the start position of the 5’-most exon annotated for the gene. We then determine the boundaries of the region to be analyzed relative to the TSS as follows. In most cases, the upstream boundary is defined as the start/end position of the upstream gene (depending on the upstream gene’s orientation). If the upstream gene is very close, i.e. less than 1 kb from the start site of the gene of interest, we extend the analysis to introns of the upstream gene located within 10 kb of the TSS. In most cases, the downstream boundary is the end of the gene of interest. However, if the gene is long, intronic regions within 30 kb downstream of the TSS are used. If the gene is short, we extend the analysis to the downstream boundary to 2 kb from the TSS of the gene of interest.  135  5.2.2 Nonexonic conserved regions PhastCons scores and PhastCons conserved elements computed from comparisons of 17-way vertebrate multi-species alignments (15) were downloaded from the UCSC Genome Browser database (31). Only PhastCons conserved elements that are both 20 bp or longer and non-overlapping with annotated human mRNAs or Ensembl human gene annotations, are retained for analysis. PhastCons conserved elements residing within 100 bp from one another are chained together (excluding the intervening regions), and considered part of a single longer conserved region.  5.2.3 Score definition Genes containing only a few well-defined regions predicted by conservation analysis to function as regulatory inputs make the easiest targets for promoter construct design. To facilitate rapid ranking of genes for promoter construct design based on their conservation profiles, we define a raw regulatory resolution score as:   ∑ l (c − b)    raw score = log10  n 2  n     where l is the length of the conserved region, c is the “conservation level” of the conserved region (i.e. the mean phastCons score for the conserved region), b is the baseline conservation level (i.e. the mean phastCons score for the entire region analyzed) and n is the number of conserved regions. Thus, for each conserved region, we consider the amount of conserved sequence, how well-distinguished the region is from the background, and penalize genes with many conserved regions.  136  After computing the raw score, we normalize it to obtain a value between 0 and 1 using the following formula: normalized score =  raw score − min raw score max raw score − min raw score  The resulting normalized score is henceforth referred to as the “regulatory resolution score” or simply, the “score.”  5.2.4 Manual promoter interpretation Promoters for 317 genes were manually assessed for the Pleiades Project. Suitability for promoter construct design was annotated based on a number of gene features, including (i) published literature describing experimentally defined regulatory sequences for the gene of interest, (ii) the location of the transcription start points, (iii) the boundaries of analysis, i.e. the amount of noncoding sequence to be analyzed upstream and downstream of the gene of interest, and (iv) the number and qualitative conservation level of conserved regions located proximal to the TSS within the defined boundaries. The 317 genes were assigned to one of five classes describing their suitability for MiniPromoter design: (1) No (not suitable), (2) Unfavourable, (3) OK (fair), (4) Favourable, and (5) Yes (good candidate).  5.3 Results 5.3.1 Genome-wide distribution of regulatory resolution scores Regulatory resolution scores were computed for all Ensembl human genes (Figure 5.1). Of 22298 genes tested, 2411 did not contain any conserved PhastCons elements  137  (Figure 5.1B), and we therefore, could not compute regulatory resolution scores for these genes. The distribution of scores is slightly skewed to the left, with a median score of 0.34 and a mean score of 0.36 (Figure 5.1A). Genes with up to 5 conserved regions receive higher scores (Figure 5.1C), with the top 20th percentile having an average of 2.2 conserved regions per gene. High scores are also preferentially assigned to genes with less than 1000 bp of conserved nonexonic nucleotides (Figure 5.1D), with an average of 330 bp of conserved sequence per gene for genes scoring within the top 20th percentile.  5.3.2 Features of genes with the highest, average and lowest regulatory resolution The ADCK5 locus was assigned the highest regulatory resolution score due to the presence of a single, fairly long (1277 bp), very highly conserved region within the upstream intergenic region (Figure 5.2A). Two smaller conserved regions directly upstream of ADCK5 in the “17-Way Most Cons” track are excluded from the analysis. The larger of the two overlaps with human mRNAs, while the smaller conserved element is only 10 bp long (Figure 5.2A). The low baseline conservation level across the entire region further contributed to the high score. ADCK5 encodes a putative protein kinase whose function is not yet clear (33). Characterization of the large conserved region upstream of this gene could provide clues to its function. The ELOVL3 locus (Figure 5.2B) encodes a long-chain fatty acid elongase found in liver, skin, and brown adipose tissues (34). In terms of its regulatory resolution score, ELOVL3 is an example of an average gene, receiving the mean score of 0.35 for four small, fairly conserved nonexonic regions containing a total of 186 bp of sequence within the boundaries of the analysis. The lowest scoring gene, NR4A3, encodes a member of the steroid-thyroid  138  A  B Score  C  Min 0.00  1st Qu. 0.25  Median 0.34  Mean 0.36  3rd Qu. 0.45  Max 1.00  NA’s 2411  D  Figure 5.1 Genome-wide distribution of regulatory resolution scores (A) Histogram of scores. (B) Summary statistics showing the score by quartiles (Qu.), as well as the median and mean score. (C) Boxplot showing the distribution of the number of conserved regions by score intervals. (D) Boxplot showing the distribution of the number of conserved bases by score intervals. The boxes in both boxplots are drawn with widths proportional to the square-roots of the number of observations in the groups.  139  A. ADCK5: 1277 conserved bases in 1 conserved region. Score = 1.00  B. ELOVL3: 186 conserved bases in 4 conserved regions. Score = 0.35  C. NR4A3: 9281 conserved bases in 90 conserved regions. Score = 0.00  Figure 5.2 Genes with (A) the highest, (B) average and (C) the lowest regulatory resolution scores Each screenshot from the UCSC Genome Browser displays: conserved nonexonic (CNE) conserved elements used in the analysis (see Methods); UCSC gene predictions based on RefSeq, GenBank and UniProt data (31); transcripts for Ensembl genes based on mRNA and protein evidence (32); a dense display of human mRNAs from GenBank (37); CpG islands defined as regions with ≥50% GC content, ≥200 bp in length, and an observed CG to expected CG ratio ≥ 0.6; evolutionary conservation in 17 vertebrates, including mammalian, amphibian, bird and fish species, based on Multiz alignments (19) and PhastCons scores (15); and predictions of conserved elements produced by the PhastCons program (17-way Most Cons).  140  hormone-retinoid receptor superfamily, and may act as a transcriptional activator (35). NR4A3 owes its low score to the presence of 90 small conserved regions, containing a total of 9281 bp of nonexonic sequence, that are distributed across the entire locus (Figure 5.2C). The NR4A proteins are among the most evolutionarily ancient nuclear receptors (36). The majority of the NR4A3 locus is conserved, and the conservation profile reveals few insights into the location of potential regulatory regions for targeted promoter construct design.  5.3.3 Regulatory resolution scores capture aspects of the manual curation process One of the major goals of the Pleiades Promoter Project is to identify genes selectively expressed in specific brain regions and to annotate their regulatory sequences (30). Data curators manually assessed 317 genes and assigned them to one of five classes describing their suitability for promoter construct design (see Methods). The number of curated genes in each category is shown in Table 5.1. The regulatory resolution is a major contributor to the manual classification. However, the availability of laboratory studies describing experimentally determined regulatory regions could promote a poorly resolved gene to a better class. Likewise, multiple promoters could result in a lower score for a well-resolved gene. Table 5.1 Number of genes in each Pleiades promoter assessment class Manual assessment No Unfavourable OK Favourable Number of genes 41 42 85 105  Yes 44  Total 317  We compared the regulatory resolution scores for the 317 genes to their curated classifications, and plotted the distribution of scores within each category (Figure 5.3A). Though the distributions overlap among the different classes, in general, the scores  141  increase as the promoter classification becomes more favourable (Figure 5.3B). This indicates that the regulatory resolution scores are capable of capturing and automating some aspects of the manual curation procedure. The increases in scores from the “No” category to the “Favourable” and “Yes” categories are significant (p < 1e-05 in both cases; Wilcoxon test), as are the increases in scores from the “Unfavourable” to the “Favourable” and “Yes” categories (p < 0.01 in both cases; Wilcoxon test). Surprisingly, there was little difference in scores between the “Unfavourable” and “OK” categories. Of the initial set of 2780 genes, 62 have been approved by the Pleiades Curation Team for MiniPromoter design (57/62 were included in the 317 curated gene set) (Table 5.2). Regulatory resolution scores for the approved genes are higher than the 255 genes remaining from the curated set (p = 6.3e-04; Wilcoxon test).  A  B  Figure 5.3 Application of regulatory resolution scores to manually curated promoters (A) Boxplot showing the score distribution for genes in manually curated promoter classes describing the suitability for MiniPromoter design: (1) No, (2) Unfavourable, (3) OK, (4) Favourable, and (5) Yes. The boxes in the boxplot are drawn with widths proportional to the square-roots of the number of observations in the groups. (B) Mean and median scores versus the manual promoter assessment.  142  Table 5.2 Pleiades Promoter Project genes approved for MiniPromoter design Gene ADORA2A ATP6V1C2 AVP C8orf46 CARTPT CCKBR CCL27 CD68 CLDN5 CRH CRLF1 CX3CR1 DBH DCX DDC DRD1 FEV FEZF2 FIBCD1 GABRA6 GAL GCHFR GFAP GPR88 GPX3 GRP HAP1 HBEGF HCRT HSPA12B HTR1A ICMT LCT MKI67 NR2E1 NTSR1 OLIG1 OXT PCP2 PITX3 PKP2 POGZ RAMP3 RGS16 RLBP1L2  Description Adenosine A2a receptor vacuolar H+ ATPase C2 isoform b arginine vasopressin RIKEN mouse cDNA 3110035E14 gene, mRNA Cocaine- and amphetamine-regulated transcript protein precursor Gastrin/cholecystokinin type B receptor Small inducible cytokine A27 precursor Macrosialin precursor Claudin 5 Corticoliberin precursor cytokine receptor-like factor 1 CX3C chemokine receptor 1 dopamine beta hydroxylase Doublecortin dopa decarboxylase dopamine receptor D1 (mouse dopamine receptor D1A) FEV (ETS oncogene family) zinc finger protein 312 fibrinogen C domain containing 1 gamma-aminobutyric acid (GABA-A) receptor, subunit alpha 6 Galanin GTP cyclohydrolase I feedback regulator Glial fibrillary acidic protein, astrocyte G-protein coupled receptor 88 glutathione peroxidase 3 gastrin releasing peptide huntington-associated protein 1 heparin-binding EGF-like growth factor Hypocretin Heat shock 70kD protein 12B 5-hydroxytryptamine 1A receptor (5-HT-1A) (Serotonin receptor 1A) isoprenylcysteine carboxyl methyltransferase Lactase-phlorizin hydrolase precursor Antigen KI-67 Orphan nuclear receptor NR2E1 neurotensin receptor 1 oligodendrocyte transcription factor 1 Oxytocin-neurophysin 1 precursor Purkinje cell protein 2 (L7) Pituitary homeobox 3 plakophilin 2 Pogo transposable element with ZNF domain receptor (calcitonin) activity modifying protein 3 regulator of G-protein signaling 16 retinaldehyde binding protein 1-like 2  Score 0.52 0.41 0.65 0.31 0.29  Manual Assessment 3 4 5 4 3  0.33 0.32 0.40 0.47 0.32 0.30 0.30 0.38 0.13 0.66 0.50 0.40 0.27 0.23 0.21 0.61 0.48 0.24 0.52 0.42 0.32 0.41 0.19 0.69 0.35 0.26  4 3 5 4 5 5 4 5 4 4 5 4 5 3 5 5 5 5 4 4 N/A N/A 4 4 4 N/A  0.46 0.40 0.33 0.25 0.36 0.49 0.51 0.48 0.37 0.16 0.15 0.59 0.54 0.20  3 3 4 3 4 4 5 N/A 5 4 3 3 5 4  143  Gene S100B SLC6A2 SLC6A3 SLC6A4 SLC6A5 SLC7A5 SLITRK6 STX1A TAC1 TAC3 TBR1 THY1 TNNT1 TRH UGT8 VIM VIP  Description S100 protein, beta polypeptide, neural solute carrier family 6 (neurotransmitter transporter, noradrenalin), member 2 solute carrier family 6 (neurotransmitter transporter, dopamine), member 3 solute carrier family 6 (neurotransmitter transporter, serotonin), member 4 solute carrier family 6 (neurotransmitter transporter, glycine), member 5 Solute carrier family 7, member 5 SLIT and NTRK-like protein 6 precursor syntaxin 1A (brain) tachykinin 1 tachykinin 2 (Mouse Tac2) T-box brain gene 1 Thy-1 membrane glycoprotein precursor Troponin T, slow skeletal muscle thyrotropin releasing hormone UDP galactosyltransferase 8A Vimentin vasoactive intestinal polypeptide  Score 0.37  Manual Assessment 4  0.28  4  0.42  N/A  0.23  4  0.27 0.47 0.39 0.25 0.33 0.44 0.37 0.36 0.31 0.44 0.26 0.30 0.35  4 4 3 3 5 5 5 5 4 4 4 4 4  5.4 Discussion A major goal of the Pleiades Promoter Project is to develop and characterize 220 human DNA MiniPromoters to be used as tools to drive expression in defined brain regions. The selection process requires time-consuming and laborious analysis of candidate gene sequences. Thus, we developed a regulatory resolution score to rapidly rank genes selectively expressed in brain based on their conservation profiles, effectively guiding the promoter design process towards those genes that are more likely to produce favourable results. We have demonstrated that the defined regulatory resolution score, when applied to PhastCons conserved regions from 17-way vertebrate multi-species comparisons, reflects aspects of the manual curation procedure, with higher scores being assigned to genes judged as favourable candidates by the Pleiades team. Large-scale  144  screens of gene regulatory function are becoming increasingly prevalent, and methods that rapidly rank gene candidates for experimental validation are needed. Evaluation of the regulatory resolutions scores revealed that the n2 term in the denominator is punitive, and as a result, most of the highest scoring genes contain less than five conserved regions. However, it should be noted that the chaining of PhastCons conserved elements residing within 100 bp of each other reduces the impact of this penalty. A less conservative score using the term n n in the denominator was also evaluated. Though it had some effect on the relative ranks of candidate genes, the greatest impact on ranks was on genes with scores in the 25th to 75th percentiles (data not shown). As is the case for most bioinformatics approaches, a number of parameters have been arbitrarily selected, including setting the boundaries for analysis, the minimal size of PhastCons conserved elements, and the separation distance used in chaining. These parameters can be varied depending on the goals of each project and the experimental system being used to verify regulatory function. For the Pleiades Promoter Project, we wished to obtain a few short regions to create MiniPromoter DNA constructs to be tested in transgenic mice, and that could later be used in adenovirus vectors for gene therapy. However, a project focused on long-range enhancers would require analysis of more upstream and downstream sequence. Conserved regions separated by larger intervening sequences could be chained if bacterial artificial chromosome (BAC) transgenic constructs are used; BACs can accommodate DNA inserts of up to 300 kb, allowing distal regulatory sequences located tens or hundreds of kilobases away to be included (38). Though we chose to use the PhastCons scores to measure evolutionary constraint, there is no reason why other conservation measures could not be used, including defined  145  sequence identity thresholds for pairwise alignments of promoter regions or multiple conserved sequence (MCS) scores (16). Access to a regulatory resolution score will have utility for numerous projects aimed at investigating gene regulation. There are also open research questions that can be explored. Pattern discovery algorithms are used to identify motifs over-represented in a set of sequences relative to a background model. The expectation is that some of the discovered motifs function as transcription factor binding sites. MEME (Multiple EM for Motif Elicitation) allows sequences to be weighted so that each sequence contributes to the resulting motifs in proportion to its weight (39). One could weight sequences by their regulatory resolution scores such that promoter regions containing a few highly conserved regions contribute to the pattern discovery process to a greater degree. Along the same lines, using oPOSSUM (40), it would be interesting to see whether the signal in motif over-representation analysis is stronger for the more well-resolved subset of genes in a co-expressed set of genes. Regulatory resolution scores could be helpful in prioritizing lists of predicted cis-regulatory modules (CRMs) to test for enhancer function, as the vast majority of computational approaches for the discovery of CRMs result in large numbers of predictions, sometimes predicting modules for hundreds of genes (4144). Finally, it would be interesting to see whether expression pattern is related to regulatory resolution.  5.5 Conclusions Genome-scale projects often generate long lists of candidate genes that require experimental validation. Motivated by the Pleiades Promoter Project, we developed a score to facilitate rapid ranking of genes for in vivo enhancer assays, and managed to  146  capture some aspects of the manual curation procedure employed by the Pleiades Curation Team. The score is a prioritization procedure, and manual efforts are still required to design the MiniPromoter constructs to be tested in transgenic mice. Its utility lies in its ability to provide a starting point and guide the MiniPromoter design process.  147  5.6 References 1. Edelstein,M.L., Abedi,M.R., Wixon,J. and Edelstein,R.M. (2004) Gene therapy clinical trials worldwide 1989-2004-an overview. J.Gene Med., 6, 597-602. 2. Waehler,R., Russell,S.J. and Curiel,D.T. (2007) Engineering targeted viral vectors for gene therapy. Nat.Rev.Genet., 8, 573-587. 3. Sadeghi,H. and Hitt,M.M. (2005) Transcriptionally targeted adenovirus vectors. Curr.Gene Ther., 5, 411-427. 4. Wasserman,W.W., Palumbo,M., Thompson,W., Fickett,J.W. and Lawrence,C.E. (2000) Human-mouse genome comparisons to locate regulatory sites. Nat.Genet., 26, 225-228. 5. Hardison,R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet., 16, 369-372. 6. King,D.C., Taylor,J., Zhang,Y., Cheng,Y., Lawson,H.A., Martin,J., Chiaromonte,F., Miller,W. and Hardison,R.C. (2007) Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res., 17, 775786. 7. Nobrega,M.A., Ovcharenko,I., Afzal,V. and Rubin,E.M. (2003) Scanning human gene deserts for long-range enhancers. Science, 302, 413. 8. Pennacchio,L.A., Ahituv,N., Moses,A.M., Prabhakar,S., Nobrega,M.A., Shoukry,M., Minovitsky,S., Dubchak,I., Holt,A., Lewis,K.D. et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444, 499-502. 9. Ahituv,N., Rubin,E.M. and Nobrega,M.A. (2004) Exploiting human--fish genome comparisons for deciphering gene regulation. Hum.Mol.Genet., 13 Spec No 2, R261-R266. 10. Woolfe,A., Goodson,M., Goode,D.K., Snell,P., McEwen,G.K., Vavouri,T., Smith,S.F., North,P., Callaway,H., Kelly,K. et al. (2005) Highly conserved noncoding sequences are associated with vertebrate development. PLoS.Biol., 3, e7. 11. Lenhard,B., Sandelin,A., Mendoza,L., Engstrom,P., Jareborg,N. and Wasserman,W.W. (2003) Identification of conserved regulatory elements by comparative genome analysis. J.Biol., 2, 13. 12. Loots,G.G., Ovcharenko,I., Pachter,L., Dubchak,I. and Rubin,E.M. (2002) rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res., 12, 832-839.  148  13. Bray,N., Dubchak,I. and Pachter,L. (2003) AVID: A global alignment program. Genome Res., 13, 97-102. 14. Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res., 13, 721-731. 15. Siepel,A., Bejerano,G., Pedersen,J.S., Hinrichs,A.S., Hou,M., Rosenbloom,K., Clawson,H., Spieth,J., Hillier,L.W., Richards,S. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15, 1034-1050. 16. Margulies,E.H., Blanchette,M., Haussler,D. and Green,E.D. (2003) Identification and characterization of multi-species conserved sequences. Genome Res., 13, 25072518. 17. Margulies,E.H., Cooper,G.M., Asimenos,G., Thomas,D.J., Dewey,C.N., Siepel,A., Birney,E., Keefe,D., Schwartz,A.S., Hou,M. et al. (2007) Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res., 17, 760-774. 18. Boffelli,D., McAuliffe,J., Ovcharenko,D., Lewis,K.D., Ovcharenko,I., Pachter,L. and Rubin,E.M. (2003) Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science, 299, 1391-1394. 19. Blanchette,M., Kent,W.J., Riemer,C., Elnitski,L., Smit,A.F., Roskin,K.M., Baertsch,R., Rosenbloom,K., Clawson,H., Green,E.D. et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res., 14, 708-715. 20. Elnitski,L., Hardison,R.C., Li,J., Yang,S., Kolbe,D., Eswara,P., O'Connor,M.J., Schwartz,S., Miller,W. and Chiaromonte,F. (2003) Distinguishing regulatory DNA from neutral sites. Genome Res., 13, 64-72. 21. Kolbe,D., Taylor,J., Elnitski,L., Eswara,P., Li,J., Miller,W., Hardison,R. and Chiaromonte,F. (2004) Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res., 14, 700-707. 22. King,D.C., Taylor,J., Elnitski,L., Chiaromonte,F., Miller,W. and Hardison,R.C. (2005) Evaluation of regulatory potential and conservation scores for detecting cisregulatory modules in aligned mammalian genome sequences. Genome Res., 15, 1051-1060. 23. Bejerano,G., Pheasant,M., Makunin,I., Stephen,S., Kent,W.J., Mattick,J.S. and Haussler,D. (2004) Ultraconserved elements in the human genome. Science, 304, 1321-1325.  149  24. Sandelin,A., Bailey,P., Bruce,S., Engstrom,P.G., Klos,J.M., Wasserman,W.W., Ericson,J. and Lenhard,B. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC.Genomics, 5, 99. 25. Arnone,M.I. and Davidson,E.H. (1997) The hardwiring of development: organization and function of genomic regulatory systems. Development, 124, 18511864. 26. Trinklein,N.D., Karaoz,U., Wu,J., Halees,A., Force,A.S., Collins,P.J., Zheng,D., Zhang,Z.D., Gerstein,M.B., Snyder,M. et al. (2007) Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome. Genome Res., 17, 720-731. 27. Trinklein,N.D., Aldred,S.J., Saldanha,A.J. and Myers,R.M. (2003) Identification and functional analysis of human transcriptional promoters. Genome Res., 13, 308312. 28. Bronson,S.K., Plaehn,E.G., Kluckman,K.D., Hagaman,J.R., Maeda,N. and Smithies,O. (1996) Single-copy transgenic mice with chosen-site integration. Proc.Natl.Acad.Sci.U.S.A, 93, 9067-9072. 29. Farhadi,H.F., Lepage,P., Forghani,R., Friedman,H.C., Orfali,W., Jasmin,L., Miller,W., Hudson,T.J. and Peterson,A.C. (2003) A combinatorial network of evolutionarily conserved myelin basic protein regulatory sequences confers distinct glial-specific phenotypes. J.Neurosci., 23, 10214-10223. 30. Portales-Casamar,E., Kirov,S., Lim,J., Lithwick,S., Swanson,M.I., Ticoll,A., Snoddy,J. and Wasserman,W.W. (2007) PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol., 8, R207. 31. Kuhn,R.M., Karolchik,D., Zweig,A.S., Trumbower,H., Thomas,D.J., Thakkapallayil,A., Sugnet,C.W., Stanke,M., Smith,K.E., Siepel,A. et al. (2007) The UCSC genome browser database: update 2007. Nucleic Acids Res., 35, D668-D673. 32. Hubbard,T.J., Aken,B.L., Beal,K., Ballester,B., Caccamo,M., Chen,Y., Clarke,L., Coates,G., Cunningham,F., Cutts,T. et al. (2007) Ensembl 2007. Nucleic Acids Res., 35, D610-D617. 33. Nusbaum,C., Mikkelsen,T.S., Zody,M.C., Asakawa,S., Taudien,S., Garber,M., Kodira,C.D., Schueler,M.G., Shimizu,A., Whittaker,C.A. et al. (2006) DNA sequence and analysis of human chromosome 8. Nature, 439, 331-335. 34. Tvrdik,P., Westerberg,R., Silve,S., Asadi,A., Jakobsson,A., Cannon,B., Loison,G. and Jacobsson,A. (2000) Role of a new mammalian gene family in the biosynthesis of very long chain fatty acids and sphingolipids. J.Cell Biol., 149, 707-718. 35. Hedvat,C.V. and Irving,S.G. (1995) The isolation and characterization of MINOR, a novel mitogen-inducible nuclear orphan receptor. Mol.Endocrinol., 9, 1692-1700. 150  36. Pei,L., Castrillo,A., Chen,M., Hoffmann,A. and Tontonoz,P. (2005) Induction of NR4A orphan nuclear receptor expression in macrophages in response to inflammatory stimuli. J.Biol.Chem., 280, 29256-29262. 37. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and Wheeler,D.L. (2007) GenBank. Nucleic Acids Res., 35, D21-D25. 38. Yang,Z., Jiang,H., Chachainasakul,T., Gong,S., Yang,X.W., Heintz,N. and Lin,S. (2006) Modified bacterial artificial chromosomes for zebrafish transgenesis. Methods, 39, 183-188. 39. Bailey,T.L., Williams,N., Misleh,C. and Li,W.W. (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res., 34, W369-W373. 40. Ho Sui,S.J., Mortimer,J.R., Arenillas,D.J., Brumm,J., Walsh,C.J., Kennedy,B.P. and Wasserman,W.W. (2005) oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res., 33, 3154-3164. 41. Sharan,R., Ovcharenko,I., Ben Hur,A. and Karp,R.M. (2003) CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics., 19 Suppl 1, i283-i291. 42. Krivan,W. and Wasserman,W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res., 11, 1559-1566. 43. Zhao,G., Schriefer,L.A. and Stormo,G.D. (2007) Identification of muscle-specific regulatory modules in Caenorhabditis elegans. Genome Res., 17, 348-357. 44. Pennacchio,L.A., Loots,G.G., Nobrega,M.A. and Ovcharenko,I. (2007) Predicting tissue-specific enhancers in the human genome. Genome Res., 17, 201-211.  151  Chapter 6: Genomic Regulatory Blocks Underlie Extensive Microsynteny Conservation Among Insects 6 6.1 Introduction Long-range cis-regulation in vertebrates has recently been the focus of much attention, driven by the genome wide discovery of highly conserved noncoding elements (HCNEs) found to span the loci of developmental regulatory genes. After a series of observations of high levels of conservation of individual developmental enhancers, whole-genome comparisons revealed an abundance of HCNEs that tend to cluster along chromosomes. The clusters most often coincide with genes encoding developmental and differentiation-related transcription factors. Many HCNEs have been characterized as long-range enhancers, first in studies of individual genes (1-4), followed by systematic studies in zebrafish, Xenopus and mouse (5-8). Genome-wide analyses of HCNE sequences have detected several over-represented motifs that are believed to be associated with context-specific enhancer activity (9;10). The emerging model is that an array of HCNEs defines a region of regulatory inputs of its target gene(s), and that the full complement of those inputs results in the actual expression pattern of the gene (2;5;6;8). It is plausible to speculate that the genes with the most complex spatiotemporal expression should have more complex regulatory inputs. This is in full agreement with the finding that the targets of the most elaborate  6  A version of this chapter has been published. Engström, P.G., Ho Sui, S.J., Drivenes, Ø., Becker, T.S., Lenhard, B. (2007). Genomic regulatory blocks underlie extensive microsynteny conservation among insects. Genome Research. 17:1898-1908.  152  arrays of HCNEs are genes encoding developmental regulators and genes for proteins that regulate axonal guidance and related processes in the central nervous system (11). Many HCNE arrays span large gene-free regions – so-called gene deserts – around their target genes (12). However, very often the regions spanned by HCNEs contain genes whose biological functions and expression patterns are unrelated to those of the presumptive target genes. These unrelated genes, which we refer to as bystander genes, are independent of the regulatory input of HCNE arrays, but the pressure to maintain HCNE arrays have kept bystander and target genes together for hundreds of millions of years (13). We termed the HCNE-spanned regions containing such genes genomic regulatory blocks (GRBs), and found GRBs to correspond to the longest regions of conserved gene order across vertebrate genomes. In this paper, we use the term microsynteny conservation to denote the preservation of close proximity among genes through evolution, and we refer to chromosomal regions that have been largely maintained in evolution as synteny blocks (14;15). The fruit fly Drosophila melanogaster (Dmel) has been used for a century as a model organism for studies of genetics, animal development, behavior and many other aspects of biology. It is remarkable that most developmental regulatory genes in the fly have conserved orthologs in vertebrates, often with analogous functions (16), and that many of these genes are associated with HCNEs in both fly and vertebrates (17;18). Although insect HCNEs have not been studied as extensively as vertebrate HCNEs, the trends described are similar, strongly suggesting that most HCNEs function as developmental regulatory elements in vertebrates and insects alike. In both insect and vertebrate genomes, most bases that are conserved above neutral evolution rates appear to  153  be noncoding (19). More than 20,000 intronic and intergenic elements are perfectly conserved over at least 50 bp between Dmel and the closely related D. pseudoobscura (Dpse), and are most abundant in the vicinity of developmental transcription factor genes (17). A recent search for HCNEs conserved between Dmel and the more distantly related D. virilis (Dvir) revealed several elements that coincide with characterized developmental enhancers (20). Regions of conserved microsynteny have been found between Dmel and the malaria mosquito Anopheles gambiae (Agam) though these organisms diverged about 250 million years ago (15). A recent comparison of twelve insect genomes demonstrated microsynteny conservation among more distantly related insects (21). This comparison also showed that the distribution of insect synteny block lengths is incompatible with a model where genes have been randomly shuffled in evolution, and would be better explained by the existence of rearrangement hotspots - regions that have been shuffled more than others in evolution. The same trend has been observed in comparisons of mammalian genomes (14;22;23). In vertebrates, conserved microsynteny can at least in part be explained by the occurrence of GRBs (13). In this study, we present evidence for the existence of GRBs in insects and their functional equivalence to those in vertebrates. We have identified 6779 HCNEs shared among five different Drosophila species, demonstrating that fly genomes contain an extensive core repertoire of HCNEs. We show that an equivalent organization can be observed in orthologous mosquito loci through comparisons of the genome sequences of Anopheles gambiae and Aëdes aegypti, and that the maintenance of HCNE clusters is likely to underlie preservation of microsynteny between flies and mosquitoes. The  154  regions of HCNE arrays and microsynteny conservation also contain unrelated genes, probably in a similar way to bystander genes in vertebrate GRBs (13). We provide genome-wide evidence that these genes generally differ from target genes in their type of core promoter – which might for the first time explain on a genome-wide level why bystander genes do not specifically respond to long-range regulation in the region. Finally, we report a striking correspondence between Polycomb binding regions and several Drosophila GRBs, and discuss the occurrence of GRBs as an ancient and fundamental feature of metazoan genomes.  6.2 Methods 6.2.1 Sequences and annotations We used the following genome assemblies: Dmel release 4 (Berkeley Drosophila Genome Project); Dpse release 1.03 (Baylor HGSC); Dana, Dvir and Dmoj Aug. 2005 (Agencourt); Agam MOZ2 (The International Anopheles Genome Project); Aaeg AaegL1 (The Broad Institute and TIGR) and A. mellifera Amel_2.0 (Baylor HGSC). We obtained Aaeg sequences from Ensembl (24), and the other genome sequences, pairwise chained BLASTZ alignments between them, and annotations from the UCSC Genome Browser Database (25). We used FlyBase v. 4.3 gene and CDS annotations (26) and Dmel GO annotations (rev. 1.93) from www.geneontology.org.  6.2.2 HCNE Detection We identified elements highly conserved among flies by scanning pairwise BLASTZ net whole-genome alignments (22) between Dmel and each of the other four Drosophila species for regions with at least 98% identity over 50 alignment columns.  155  Highly conserved elements were merged if they overlapped on the Dmel assembly. We discarded elements whose Dmel coordinates overlapped with any exon in FlyBase 4.3 genes, RefSeq genes, Dmel cDNA sequences from Genbank or GENSCAN predictions. Remaining elements from each pairwise comparison were intersected based on their Dmel coordinates, to obtain elements conserved among all five species. Such elements spanning at least 50 bp of Dmel sequence were considered fly HCNEs. To detect mosquito HCNEs at selected Agam loci, we identified homologous Aaeg contigs by inspecting translated BLAT alignments in Ensembl v. 42-43 (24). We aligned Agam and Aaeg sequences with Shuffle-LAGAN v. 2.0 (27) with default settings and used the resulting alignments to identify HCNEs as described for flies above, but using a lower identity threshold (80%) and removing elements that overlapped exons by comparing with the following UCSC Genome Browser database annotations on the Agam assembly: Ensembl genes, Agam cDNAs from Genbank, aligned Dmel proteins and GENSCAN predictions. To assess conservation of Drosophila HCNEs in Agam, we used a BLASTZ net alignment from Dmel to Agam.  6.2.3 Computation of feature densities and density peak detection For images of loci, we computed HCNE densities by a sliding-window approach to provide easily interpreted density values. For genome-wide detection of density peaks, we required smoothed curves and therefore used the density function in R (http://www.Rproject.org) with a Gaussian kernel and the indicated bandwidths (30,000 unless stated) to compute one density value every kb. We detected peaks by searching for density values that were higher than a threshold value and their five preceding and five following values along the chromosome arm.  156  6.2.4 Identification of synteny blocks and RA sequence among flies To identify synteny blocks, we made extensive use of the utilities and C library functions in the UCSC Genome Browser source package (http://genome.ucsc.edu/FAQ/FAQlicense). Starting from pairwise chained BLASTZ alignments (chains) between the Dmel genome and each of the four other genomes, we constructed pairwise net alignments (nets) by running the program chainNet with option –minSpace=1. chainNet filters a set of chains to retain only the best alignment for each position in one of the genomes (22). The chainNet algorithm tends to prioritize large chains and therefore its output is suitable for identifying synteny blocks. For each of the four pairwise genome comparisons, we constructed two sets of nets (one from the perspective of each genome), and used them to filter the chains into a set of reciprocalbest chains (rb-chains) that only contain alignment columns included in the nets for both genomes. To find the bases in the Dmel sequence that were aligned in a reciprocal-best manner in all four pairwise genome comparisons (RA sequence), we identified the Dmel bases that were in ungapped blocks (i.e. were aligned to some base) in all four sets of rbchains. We constructed pairwise synteny blocks from rb-chains in three steps: (1) Rbchains were split at gaps that spanned nets if, within the gap, nets for either genome contained at least 10 kb in ungapped blocks. We used nets to split rb-chains because they include alignments that are not reciprocal-best, thus allowing us to capture synteny breaks caused, for example, by species-specific duplications. Only rb-chains that contained ≥10 kb in ungapped blocks after this step were retained. (2) We classified regions spanned by multiple (nested) rb-chains as being outside synteny blocks, and truncated nested rb-chains accordingly. Again, rb-chains containing <10 kb in ungapped  157  blocks were discarded. (3) To avoid artificial synteny breaks due to failure to link scaffolds together in any of the non-Dmel assemblies, we joined rb-chains that were nearest neighbors along the same Dmel chromosome arm, but on different scaffolds in the non-Dmel assembly, unless the gap between the rb-chains in either genome contained nets with at least 10 kb of sequence in ungapped blocks (i.e. the same criterion as used to split chains in step 2 above). The set of rb-chains after this third step constituted our pairwise synteny blocks. Although joining of chains may overestimate synteny in pairwise comparisons, any such effects should be minimal after pairwise synteny blocks are intersected into 5-way synteny blocks. We created 5-way synteny blocks by intersecting the sets of pairwise synteny blocks based on their coordinates on the Dmel assembly: any two Dmel bases were assigned to the same 5-way synteny block if and only if they were part of the same synteny block in each of the pairwise comparisons. We discarded 5-way synteny blocks that did not contain at least 10 kb in ungapped alignments across all pairwise synteny blocks.  6.2.5 Analysis of extent of Dmel-Agam synteny To identify Dmel-Agam synteny blocks, we first computed reciprocal-best BLASTZ net alignments between Dmel and Agam as described for fly comparisons above. We then constructed a graph where two alignments (nodes) were connected if they were separated by ≤100 kb in both genomes (not considering strand, to allow local inversions within synteny blocks). We then considered each connected component in the graph to be one synteny block. The threshold of 100 kb is arbitrary; we tested several values in the range 0-300 kb with similar results. Considering all protein-coding FlyBase genes, we assigned a gene to a synteny block if that gene had a transcript with at least  158  25% of its CDS aligned to the syntenic Agam locus. Genes that belonged to multiple blocks according to this rule were excluded.  6.2.6 Core promoter analysis We assigned a McPromoter prediction (28) to a FlyBase transcript if it was within 250 bp upstream of the annotated start site of the transcript or within the noncoding part of its first exon. In rare cases where multiple promoter predictions satisfied these criteria, the prediction closest to the annotated start site was chosen. For illustrated loci, core promoter assignments to genes were reviewed and changed if available transcript data motivated modifications to FlyBase gene models.  6.2.7 Expression analysis To assign expression values to genes, we processed FlyBase gene models as follows. Because the expression signals from the tiling array study (29) are not strandspecific, we masked parts of exons that overlapped exons on the other genomic strand. We disregard any gene that had more than half of its total exon sequence masked. For each remaining gene i, we computed its maximum transfrag coverage cmaxi as maxj(cij), where cij is the number of unmasked exon bases covered by transfrags for gene i at time point j. Any gene i with cmaxi larger or equal to 70% of its unmasked exon sequence was considered expressed [a similar criterion was used in the original analysis of the data (29)]; other genes were assigned an expression value of 0 for all time points. If two expressed genes (annotated on the same strand) shared unmasked exon sequence, only the gene with highest cmax was considered further; we were not interested in comparing expression profiles between genes that share the same transcriptional unit. Each retained  159  gene was then, for each time point, assigned an expression value equal to the median signal over its unmasked exon sequence. Only genes that showed at least a twofold difference in expression values between some time points were used in comparisons of expression profiles.  6.3 Results We identified HCNEs in pairwise alignments between the euchromatic genome sequences of D. melanogaster (Dmel) and four other Drosophila species - D. ananassae (Dana), D. pseudoobscura (Dpse), D. virilis (Dvir), and D. mojavensis (Dmoj) - selected based on the state of their genome assemblies, availability of whole-genome sequence alignments to Dmel and phylogenetic relationships (Appendix 4 Figure S1). We required HCNEs to be conserved at 98% identity over at least 50 bp in all four pairwise comparisons. To focus on elements that are most likely to function in regulation of transcription, we discarded elements that partially or entirely overlapped exons (8;9;17;30). There were 6779 HCNEs, with a median size of 59 bp and a maximum of 157 bp. Consistent with earlier observations for flies (17), nematodes (18) and vertebrates (12;30), we found regions of high HCNE-density to be strongly enriched for genes encoding developmental transcriptional regulators (Appendix 4 Table S1).  6.3.1 Highly conserved noncoding elements are enriched in large synteny blocks To study the distribution of HCNEs with respect to regions of microsynteny, we identified synteny blocks conserved among all five fly genomes as described in Methods. None of the four species that we compared to Dmel has a finished genome assembly. Nevertheless, our results indicate that reliable synteny blocks can be constructed because  160  most of the sequence is in very large scaffolds. Although the synteny blocks included few scaffolds, they spanned 76% of the Dmel euchromatic sequence (Supplemental Table S2). We distinguish between the span of a synteny block, which we define as the entire genomic region between the extreme borders of the block, and its coverage, meaning the reciprocally aligned, syntenic bases in the block. Of the HCNEs, 94% were entirely spanned by synteny blocks, and 86% had at least 98% of their sequence covered by synteny blocks. We wished to compare the coverage of HCNE sequence by synteny blocks to the coverage of coding sequence (CDS) while controlling for the fact that the latter is less conserved overall. We therefore identified the bases in the Dmel sequence that were aligned in a reciprocal-best manner in all four pairwise genome comparisons [reciprocally-best aligned (RA) sequence], and measured the fraction of them that was covered by synteny blocks. Remarkably, 90% of RA-HCNE sequence was covered by synteny blocks, compared to only 75% of RA-CDS. RA-HCNE sequence was enriched in large synteny blocks compared to RA-CDS (Figure 6.1A).  6.3.2 HCNE arrays are centrally positioned in large synteny blocks that span multiple genes We identified 164 peaks of HCNE density on Dmel chromosomes 2, 3 and X by first using a Gaussian kernel to compute local HCNE density at positions spaced 1 kb throughout the euchromatic sequence, and then locating peaks in the resulting density distribution. Many peaks of HCNE density are contained within single synteny blocks and are centrally positioned within those blocks (Figure 1B and C and Appendix 4 Figure S2). Only 5/164 HCNE density peaks were located outside synteny blocks. In contrast, RA-CDS density tends to peak near synteny breaks, confirming that the lower HCNE  161  Figure 6.1 HCNE arrays are centrally positioned in large synteny blocks (A) RA-HCNE sequence is enriched in large synteny blocks compared to RA-CDS. Dashed lines show the distributions when sequence not covered by any synteny block is excluded. (B) HCNE density, RA-CDS density and synteny blocks on Dmel chromosome arm 2L. Synteny blocks (green boxes with black borders) are shown between the density curves and in the in the area under them. Density peaks were detected above a threshold (gray line) set to cover 80% of the area under the density curve for the chromosome arm. In the magnified section, HCNE density peaks are labeled with inferred regulatory target genes located in the same synteny block as the HCNE density peak. (C) Line histogram of position of density peaks within synteny blocks. For each density peak that was located within a synteny block, we computed the distance between the peak and the synteny break closest to it, and scaled the distances to [0, 0.5] by dividing with synteny block size. Dashed lines show distributions from 10,000 randomizations where synteny blocks were ordered independently of density peaks (Appendix 4 Figure S3). (D) Histogram of median distance in each of the 10,000 randomizations. Arrows indicate medians for the non-randomized data, and one-sided p-values indicate the fraction of randomizations having equal or more extreme medians.  162  density in these regions is unlikely to be caused by variations in alignment quality (Figure 1B and C and Appendix 4 Figure S2). We confirmed the statistical significance of these trends by permutation tests (Figure 1D) and found the trends to persist across a wide range of parameter settings (Appendix 4 Figures S3 and S4). RA-CDS density peaks were most frequent in small synteny blocks (Appendix 4 Figure S3). This trend is consistent with the above findings, but can also partially be explained by variability in intron size between genes. The frequency of HCNE density peaks per sequence length increased with synteny block size and nearly all synteny blocks that contained a large HCNE density peak also contained multiple genes (Appendix 4 Figure S3). These findings strongly suggest that large regions containing multiple genes have maintained microsynteny in order to preserve arrays of HCNEs.  6.3.3 HCNE-associated genes are in large blocks of conserved microsynteny between fly and mosquito The ct locus (Figure 6.2A) is one of the more extreme examples of the genomewide trends described above. This synteny block contains the highest HCNE density peak on the Dmel X chromosome (Appendix 4 Figure S2) and HCNE densities are high throughout most of the block. The block is flanked by regions of higher gene density than within the block. The ct gene encodes the homeodomain protein Cut, which regulates cell-fate decisions in multiple lineages (31). ct has been maintained in microsynteny with at least nine other genes (underlined in Figure 6.2A) throughout the five Drosophila genomes investigated here. There is little evidence of a functional relationship between ct and any of these nine genes: five are unannotated, and the remaining four encode a sarcoplasmic calcium-binding protein (CBP), a putative protein phosphatase (CG15035),  163  a putative Na+/solute symporter (CG9657) and a putative forkhead transcription factor (CHES-1-like). CHES-1-like may have a regulatory role, but its function is unknown. ct has been maintained in microsynteny with four genes (CBP, CheA7a, CG9657 and CG15478) between Dmel and Agam. Strikingly, these genes delimit roughly the same Dmel region as the five-way fly synteny block: the region spanned by the HCNE-cluster (Figure 6.2A).  Figure 6.2 Genes associated with HCNE arrays tend to be in large fly-mosquito synteny blocks (A,B) Examples of synteny blocks. Gaps within synteny blocks are colored yellow. Green lines connect genes in conserved microsynteny between Dmel and Agam. Microsynteny conservation between Dmel and Agam was determined by examining chained BLASTZ and tBLASTn alignments in the UCSC Genome Browser (32). Sometimes only parts of genes could be matched (e.g. in the case of ct). Aaeg contigs aligned to the Agam assembly are shown with regions having ≥ 50% identity over 50 bp in black and other regions in yellow. HCNE densities were computed as the fraction of bases in HCNEs in sliding windows of 40 kb. The UCSC Genome Browser was used in making the images. (A) The ct locus in Dmel (upper panel) and Agam (lower panel). ct and nine other genes (underlined) show strong evidence of being in conserved microsynteny among the five flies. The orange line indicates a noncoding BLASTZ match between Dmel and Agam and hints at the location of the first ct exon in Agam. Comparison of HCNE density curves also supports that the first ct exon in Agam is in the area indicated by the orange line. Supporting a common origin of the HCNE clusters at the ct loci in flies and mosquitoes, the HCNE density curves have similar shapes. Two density peaks are visible in both organisms: one between CG9657 and ct, and the other within the borders of ct. The developmental transcriptional regulator brk (33) is centrally positioned in an adjacent synteny block. CG9650, which dominates a neighboring HCNE-rich synteny block, is expressed in developing CNS and PNS and encodes a putative C2H2 zinc finger protein (34). (B) The tailup (tup) locus in Dmel (upper panel) and Agam (lower panel). tup is in conserved microsynteny with CG18397 among the five flies, Agam and Aaeg. tup encodes a homeodomain transcription factor involved in development (35). CG18397 is predicted to encode a protein with an AMP-dependent synthetase and ligase domain. In both flies and mosquitoes, HCNEs are found throughout the synteny block. Some HCNEs are within introns of tup and CG18397. This, combined with the lack of evidence for a functional relationship between the two genes, indicates that they have been kept in proximity in order to maintain the HCNE array. (C) For each gene that we could assign to a synteny block, we measured the span of its synteny block excluding the region spanned by the gene itself (in order to control for differences in gene size). Each curve shows the cumulative distribution of synteny block span, measured in Dmel bp, around genes in a particular category. Categories were defined from GO biological process annotation and HCNE density. The category “any biological process” contains all genes annotated with a GO biological process term other than “biological process unknown”. Genes in HCNE-dense regions overlap a 40 kb region where at least 1% (400 bp) of the sequence is in HCNEs. Numbers within parenthesis indicate the number of genes annotated to the indicated process and assigned to a single synteny block.  164  165  To investigate whether maintained fly-mosquito microsynteny at the ct locus could be explained by a selective pressure to keep the HCNE-cluster intact, we searched for HCNEs conserved between Agam and the yellow fever mosquito Aëdes aegypti (Aeg) at the ct locus in mosquitoes. Indeed, there is a distinct island of mosquito-specific HCNEs confined to the region of the fly-mosquito synteny block (Figure 6.2A). The picture is similar at several other loci, including the locus of the homeodomain transcription factor gene tailup (tup, Figure 6.2B; see also Appendix 4 Figures S5-S7). Curiously, few noncoding elements are highly conserved between flies and mosquitoes (17). Only 612/6779 (9%) of our Drosophila HCNEs are at least partially aligned to the Agam sequence in a precomputed whole-genome alignment and only 264 (4%) are aligned with at least 30 base identities. The examples presented here suggest that, while many elements may have diverged beyond our ability to align them, the selective pressure to maintain their clusters has resulted in microsynteny conservation, which is detectable because protein-coding sequences align between Drosophila and Anopheles. To quantitatively assess whether genes regulated by HCNE arrays are more likely to be in large regions of microsynteny between Dmel and Agam, we constructed synteny blocks between the two genomes, using a more relaxed approach than among the Drosophila because of the large evolutionary distance between flies and mosquitoes (see Methods). We then measured the span of Dmel-Agam synteny blocks around Dmel genes from several categories, including genes in HCNE-dense regions and genes annotated with Gene Ontology (GO) biological process terms that have been found to be associated with genes spanned by HCNE arrays (GO terms “multicellular organismal development” and “regulation of transcription, DNA-dependent”; see Appendix 4 Table S1 and (17).  166  There was a tendency for genes in the HCNE-related categories to be within more extensive blocks of synteny than other types of genes (Figure 6.2C). To better pinpoint the genes that are targets of HCNE arrays, we intersected the HCNE-related categories. Genes that were located in HCNE-dense regions, annotated to be involved in development, and annotated as transcriptional regulators were within significantly larger synteny blocks than genes from any of the non-HCNE related categories (p < 10-7 in pairwise one-tailed Kolmogorov-Smirnov tests against each of the categories “cellular protein metabolic process”, “cell organization and biogenesis”, “transport”, “signal transduction” and “any biological process”).  6.3.4 HCNE-associated genes have specific types of core promoters Data from this and earlier work suggests a model where insect and vertebrate HCNEs arrays represent clusters of enhancers that specify expression programs for only a small subset of the genes that they span. How enhancer activity is specifically directed towards certain genes at HCNE-spanned loci is unknown. It has been demonstrated that enhancers can selectively target certain promoters (36;37) and that this selectivity may be facilitated by the occurrence of different core promoter types (38;39). A recent investigation of core promoters in Dmel classified them into five major types based on motif-content: TATA box followed by initiator (TATA/Inr), initiator followed by downstream promoter element (Inr/DPE), Motif 6 followed by Motif 1 (Motif 1/6), DNA replication element (DRE), and promoters containing only initiator, but none of the other elements (Inr only) (28). Based on these observations, the author designed a program (McPromoter) that predicts core promoters in the Dmel genome with high accuracy and classifies them as one of the five types.  167  Hypothesizing that enhancers in HCNE arrays may target specific genes within “striking distance” on the basis of their core promoter architecture, we used the genomewide McPromoter predictions to investigate core promoter properties of likely target genes. Of 81 developmental transcriptional regulators located in HCNE-dense regions, 56 have a promoter prediction close to one or more annotated transcription start sites. Of these 56 genes, 53 (95%) are associated with a prediction of a type containing an Inrmotif (Inr only, Inr/DPE or TATA/Inr; see also Table 6.1). For comparison, only 2251 (39%) of all 5824 genes assigned a promoter prediction have a prediction with an Inrmotif. The enrichment is strongest for genes with Inr only core promoters (p = 0.005, compared to Inr/DPE enrichment, by Fisher’s exact test). For examples of genes with different core promoter types, see Figures 6.2, 6.4 and 6.5, where gene models are colored according to associated promoter predictions (see also Appendix 4 Figures S5, S6, S8 and S9). Table 6.1 Core promoter classification of Dmel genes Core promoter class 1. Inr only 2. Inr/DPE 3. DRE 4. Motif 1/6 5. TATA/Inr Class 1,2 or 5 Any class Total  All proteincoding genes  Transcriptional regulatorsa  Developmental genesb  Genes in HCNEdense regionsc  439 (8%) 784 (13%) 2162 (37%) 1553 (27%) 1110 (19%) 2251 (39%) 5824 13733  72 (17%) 75 (17%) 140 (32%) 105 (24%) 73 (17%) 204 (47%) 434 768  138 (16%) 196 (23%) 250 (29%) 194 (23%) 149 (18%) 453 (53%) 849 1383  44 (20%) 60 (28%) 34 (16%) 39 (18%) 54 (25%) 150 (69%) 217 684  Developmental transcriptional regulators in HCNE-dense regions 24 (43%) 18 (32%) 3 (5%) 2 (4%) 14 (25%) 53 (95%) 56 81  Genes were counted in more than one promoter category if they had had different types of core promoter predictions for different alternative start sites. Percentages are relative to the number of classified genes. a Protein-coding genes annotated with GO term GO:0006355 (regulation of transcription, DNAdependent). b Protein-coding genes annotated with GO term GO:0007275 (multicellular organismal development). c Protein-coding genes overlapping a 40 kb region where at least 1% (400 bp) of the sequence is covered by HCNEs.  168  To explore the association between core promoter types and gene functions, we performed a systematic search for enrichment of different GO annotations within each of the five core promoter classes (Figure 6.3A and Appendix 4 Table S3). Consistent with the above results, we found enrichment for transcription factors among genes with Inronly core promoters (p<10-10) and enrichment for genes involved in developmental processes among genes with Inr-only core promoters (p<10-13) and genes with Inr/DPE core promoters (p<10-9). The latter gene set is also enriched for genes involved in ion transport (p<10-4) and cell adhesion (p<10-5). On the other hand, the set of genes with DRE core promoters is enriched for housekeeping functions (translation, p<0.001) and mitochondrial proteins (p<10-4). Genes with Motif 1/6 promoters showed a particular enrichment for RNA polymerase II components (p<0.01), which also perform a housekeeping function. Although the set of genes with TATA/Inr core promoters appears to share some of the trends observed for Inr-only and Inr/DPE promoters, these trends are not statistically significant for the TATA/Inr-regulated genes, which instead are enriched for genes encoding proteins with more specialized, tissue-specific functions, such as cuticle constituents (p<10-23), odorant binding proteins (p<0.01) and defense-related proteins (p<0.01). This finding is in agreement with results from mammals, where genes with TATA box core promoters tend to be expressed in tissue-specific contexts (40). All p-values were adjusted for multiple testing using Bonferroni method (see also Appendix 4 Table S3). To explore gene expression correlations among genes with different core promoter types, we used a published tiling array dataset consisting of gene expression measurements across the Dmel genome at twelve time points during the 24 hours of  169  Figure 6.3 Associations between core promoter types and gene functions (A) Bars show the fraction of genes in each core promoter category that are annotated with indicated GO terms. All GO terms shown are significantly associated with a core promoter category at Bonferroni-adjusted p<0.01 (see also Appendix 4 Table S3). (B) Violin plots (boxplots with added kernel density curves) show distributions of Pearson correlation coefficients for expression correlations between randomly selected gene pairs taken from pairs of core promoter categories indicated by colored rectangles below the plots. High correlations are frequent between genes with DRE core promoters and genes with Motif 1/6 core promoters, as well as among genes within each of those categories. Each distribution is based on a sample of 1000 randomly selected gene pairs. Genes were not compared against themselves.  170  embryonic development (29). Consistent with a housekeeping nature of genes with DRE or Motif 1/6 core promoters, we found that randomly selected gene pairs from these sets often have highly correlated expression profiles, unlike gene pairs from the other sets (Figure 6.3B). Genes in these two promoter categories also have the highest detection rates: 1423 (66%) of 2162 genes with DRE promoters and 1031 (66%) of 1553 genes with Motif 1/6 promoters were detected as expressed at some time point. Genes with TATA/Inr promoters have the lowest detection rate (46%; significantly different from genes with DRE or Motif 1/6 promoters: p<10-15, chi-square test), consistent with more specialized roles for genes with TATA/Inr promoters.  6.3.5 HCNE arrays mark regulatory domains maintained in evolution While the data presented here suggest that the need to maintain HCNE clusters is a major reason for microsynteny conservation in insects, other reasons for microsynteny conservation exist. A genome-wide comparison of Dmel-Dpse synteny blocks to changes in gene expression throughout the Dmel life cycle suggested that microsynteny is preserved at some loci in order to maintain co-regulation of neighboring genes (Stolc et al. 2004; see also erratum at http://bussemaker.bio.columbia.edu/papers/Science2004/). Figure 6.4 shows two loci that are likely to be under dual pressures to maintain HCNE arrays and co-regulated genes. Each of these loci contains two co-expressed and paralogous developmental regulatory genes (elB/noc, H15/mid), spanned by a HCNE cluster that delimits roughly the same genomic region as its surrounding synteny block, suggesting that these loci constitute genomic regulatory blocks with dual targets for some of the enhancer activity likely contained in their HCNEs. Further evidence for the existence of large regulatory domains in Drosophila genomes comes from genome-wide  171  mapping of Polycomb binding sites in embryonic cell lines, where Polycomb was found to bind large regions, preferentially around developmental regulators (49;50). Similar findings have been reported for human embryonic stem cells, where the Polycomb repressive complex 2 subunit SUZ12 shows a strong tendency to bind across developmental transcription factor genes and around HCNEs (51). We inspected the Dmel Polycomb binding regions determined by Tolhuis et al. (49) and noted an  Figure 6.4 HCNE-clusters spanning co-regulated genes and boundary agreement among synteny blocks, HCNE clusters and Polycomb binding regions Gene models are colored by predicted core promoter type as in Figure 2. Only selected genes are labeled. (A) The paralogous zinc finger genes elB and noc, implicated in tracheal and appendage development, have different, but partially overlapping, spatial expression patterns during embryonic development (41), and are co-expressed in larval leg and wing discs (42). Among the five flies, elB and noc are in conserved microsynteny with a tRNA gene and at least three protein-coding genes (underlined), which have no evidence of being functionally related to elB or noc: pburs encodes a subunit of the hormone bursicon required for wing expansion and associated cuticle changes after flies emerge from pupae (43); CG3474 is predicted to encode a cuticle component; CG4218 is predicted to encode a protein of unknown function. (B) The paralogous T-box genes H15 and mid are involved in regulation of heart development and have similar spatial expression patterns during embryonic development (44;45). They are in conserved microsynteny with four other genes (underlined) among the five flies: CG12512, predicted to encode a protein with an AMP-dependent synthetase and ligase domain; nompC, encoding a mechanosensory transduction channel (46); and two genes of unknown function. The developmental regulators vri (47) and tomb (48) are centrally positioned in neighboring synteny blocks. Two transcript isoforms are shown for tkv because it has two major transcription start sites with different types of core promoter predictions.  172  association with HCNEs, as expected. Tolhuis et al. interrogated ~30% of the Dmel genome, and found that 10% of the interrogated sequence corresponds to large Polycomb binding regions (Pc domains). HCNE sequence is more than two-fold enriched in these Pc domains: 114 kb of the sequence interrogated by Tolhuis et al. corresponds to HCNEs, and 23% of this HCNE sequence is within Pc domains. The association of HCNEs with Pc domains is significant (p < 10-5; Wilcoxon test) when one compares the density of HCNEs in Pc domains to the density of HCNEs in regions randomly sampled from the part of the genome interrogated by Tolhuis et al. and with similar size distribution as the Pc domains. Interestingly, we also found a very good agreement between the boundaries of synteny blocks, HCNE clusters and Pc domains at a number of loci, including the three shown in Figs. 6.2B and 6.4 (see also Appendix 4 Figures S8 and S9). These examples indicate that synteny blocks, HCNE clusters and Polycomb binding regions can independently pinpoint the same large regulatory domains in insect genomes, suggesting that they reveal different aspects of the same evolutionarily conserved regulatory mechanism.  6.4 Discussion 6.4.1 Experimental evidence for long-range regulation and GRBs in Drosophila Genomic regulatory blocks (GRBs) are regions containing long-range regulatory elements that have been interlocked in cis with their target genes as well as unrelated genes (13). We show here that this concept also applies to insect genomes. In the zebrafish genome, GRBs were discovered through enhancer detection events where the reporter insertion was close to or in a bystander gene, yet recapitulated the expression  173  pattern of the target gene further away (13). Since enhancer detection has been performed extensively in Drosophila, we searched for examples of such insertions near bystander genes in the literature. Such insertions can be used to support the notion that regulatory elements form GRBs and thereby conserve microsynteny. The most striking example we found is the E32 enhancer detection line, which represents an insertion in the 5’ untranslated region of out at first (37). The insertion replicates part of the expression pattern of decapentaplegic (dpp), a developmental regulatory gene located 33 kb away. The region between dpp and the insertion contains a gene desert with HCNEs (Figure 6.5) corresponding to known enhancers and a nonessential gene (Slh) with an unrelated expression pattern (37). We consider this a typical example of a GRB in Drosophila. At the loci of developmental regulators teashirt, engrailed and u-shaped, insertions confirm that regulatory information is present at large distances from target genes, but do not directly show that this information is present inside bystander genes (52). Notably, dpp, teashirt, engrailed and a gene desert with insertions upstream of u-shaped all coincide with Polycomb binding regions determined at high resolution (49) (Figure 6.5 and Appendix 4 Figures S8 and S9). We conclude that, although it is commonly accepted that enhancer detection insertions in Drosophila reproduce the expression pattern of the nearest gene, these examples show that there are exceptions, in agreement with our enhancer detection results in zebrafish and in agreement with the notion that GRBs occur in both insects and vertebrates (13).  6.4.2 Regulatory HCNE arrays are a fundamental feature of metazoan genomes Most target genes in Drosophila GRBs appear to be developmental regulatory genes that have well-conserved vertebrate orthologs spanned by equivalent arrays of  174  Figure 6.5 The E32 enhancer trap insertion at the Dmel decapentaplegic (dpp) locus Gene models are colored by predicted core promoter type as in Figure 2. Two transcript isoforms are shown for dpp because this gene has different core promoter predictions for two of its major transcription start sites (other dpp start sites are not shown, see (53). The HCNE-spanned gene desert downstream of dpp contains several conserved enhancers that specify dpp expression in imaginal discs (37). Although the neighboring, divergently transcribed genes SLY1 homologous (Slh) and out at first (oaf) are insensitive to the array of dpp enhancers and have different expression patterns, the enhancer trap insertion E32, inserted into the 5’-untranslated region of oaf (arrow), reproduces part of the dpp expression pattern in imaginal discs (37). Five other genes (underlined) are in conserved microsynteny with dpp, Slh and oaf among the investigated flies.  HCNEs (12). In addition to noncoding conservation and the types of genes they contain, other parallels between GRBs in insects and vertebrates are evident. They often harbor relatively long regions devoid of genes (gene deserts) (54), and are characterized by microsynteny conserved deep in evolution (13). Our demonstration of similarly organized HCNE arrays at orthologous Drosophila and Anopheles loci (where gene order has been partially preserved) reveals that microsynteny conservation, while constrained by regulatory elements, can outlive the sequence conservation of those elements. The match between synteny blocks, HCNE arrays, and experimentally determined Polycomb binding regions in Drosophila is striking and supports the notion that these features are signatures of GRBs. In vertebrates, Polycomb group proteins are also preferentially found at the loci of developmental regulatory genes (51;55), were shown to bind to evolutionarily conserved CpG islands that overlap large portions of  175  developmental regulatory genes (56), and directly control CpG methylation (57). Even though insects do not have genome methylation or CpG islands, one can speculate that Polycomb binding regions in Drosophila are functionally equivalent to conserved CpG islands in mammals. At present, it is unknown whether those regions in insects have any specific sequence properties analogous to CpG islands. Together with a recent demonstration of the presence of HCNE clusters in nematode genomes of the genus Caenorhabditis (18), our findings indicate that arrays of HCNEs are central to developmental regulation of most, if not all, Metazoa. The association of HCNEs with orthologous genes among nematodes, insects and vertebrates (13;18) suggests that long-range regulation and clusters/arrays of HCNEs are an ancient property of metazoan genomes. The role of HCNEs in constraining microsynteny has not yet been explored beyond vertebrates and insects, however.  6.4.3 Responsiveness of genes to long-range enhancers The apparent unresponsiveness of bystander genes to long-range enhancers in GRBs remains mysterious. Distance does not seem to be crucial for enhancer action (58;59). In the study mentioned above (37), the Drosophila gene out at first does not normally react to dpp enhancers but did so after exchanging its promoter with a dpp promoter. Thus, one explanation for enhancer specificity could be differential responsiveness of core promoters to enhancers (60). In mammals, different types of core promoters have been clearly shown to be related to different modes of regulation (40). In Drosophila, a recent study classified many known promoter regions into a number of different subtypes according to the principal motif (or combinations thereof) they contain (28). In this work we have shown that this classification discriminates between  176  developmental genes (Inr with or without DPE), housekeeping genes (DRE or Motif1/6) and tissue-specific genes (TATA). Based on these results, we speculate that it is the Inrtype of promoters without TATA boxes that are most likely to respond to long-range regulation. Indeed, inspection of dozens of Drosophila GRBs strongly supports the hypothesis that non-responsive bystander genes, with expression patterns unrelated to the target gene in the same region, have core promoters of the DRE or Motif1/6 types. In this way, Ohler's classification of Drosophila core promoters is more powerful than that for vertebrate promoters made by Carninci et al. (40); in vertebrates, we still do not know the fundamental difference between core promoters for housekeeping and developmental regulatory genes, which both seem to have CpG island core promoters, most without TATA-boxes and with "broad"-type transcription start regions. While borders of GRBs can be identified as synteny block boundaries by comparative genomics, it is still unclear how the cellular machinery recognizes those borders. Some regulatory domains are known to be delimited by insulator elements, which bind proteins that block the reach of enhancers or inhibit the spread of repressed chromatin (61). Recent studies have revealed an abundance of putative insulator elements bound by the enhancer-blocking protein CTCF in mammalian genomes, and predicted a similar number of binding sites in Tetraodon (62;63). Human CTCF is functionally conserved in Drosophila, where several other enhancer-blocking proteins also are known (64). It will be interesting to see whether insulator elements are present at the borders of vertebrate and Drosophila GRBs.  177  6.5 Conclusions The evidence presented in this paper establishes GRBs as a fundamental property of metazoan genomes. The long distances of regulatory elements from their developmental regulatory target genes will have to be taken into account in future studies of these genes and their regulatory networks. Additionally, these findings provide guidelines for designing enhancer trap experiments and their interpretation, including an informed choice of core promoter type for enhancer trap constructs.  178  6.6 References 1. Gottgens,B., Barton,L.M., Gilbert,J.G., Bench,A.J., Sanchez,M.J., Bahn,S., Mistry,S., Grafham,D., McMurray,A., Vaudin,M. et al. (2000) Analysis of vertebrate SCL loci identifies conserved enhancers. Nat.Biotechnol., 18, 181-186. 2. Kimura-Yoshida,C., Kitajima,K., Oda-Ishii,I., Tian,E., Suzuki,M., Yamamoto,M., Suzuki,T., Kobayashi,M., Aizawa,S. and Matsuo,I. (2004) Characterization of the pufferfish Otx2 cis-regulators reveals evolutionarily conserved genetic mechanisms for vertebrate head specification. Development, 131, 57-71. 3. Milewski,R.C., Chi,N.C., Li,J., Brown,C., Lu,M.M. and Epstein,J.A. (2004) Identification of minimal enhancer elements sufficient for Pax3 expression in neural crest and implication of Tead2 as a regulator of Pax3. Development, 131, 829-837. 4. Sumiyama,K. and Ruddle,F.H. (2003) Regulation of Dlx3 gene expression in visceral arches by evolutionarily conserved enhancer elements. Proc.Natl.Acad.Sci.U.S.A, 100, 4030-4034. 5. Calle-Mustienes,E., Feijoo,C.G., Manzanares,M., Tena,J.J., Rodriguez-Seguel,E., Letizia,A., Allende,M.L. and Gomez-Skarmeta,J.L. (2005) A functional survey of the enhancer activity of conserved non-coding sequences from vertebrate Iroquois cluster gene deserts. Genome Res., 15, 1061-1072. 6. Pennacchio,L.A., Ahituv,N., Moses,A.M., Prabhakar,S., Nobrega,M.A., Shoukry,M., Minovitsky,S., Dubchak,I., Holt,A., Lewis,K.D. et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444, 499502. 7. Shin,J.T., Priest,J.R., Ovcharenko,I., Ronco,A., Moore,R.K., Burns,C.G. and MacRae,C.A. (2005) Human-zebrafish non-coding conserved elements act in vivo to regulate transcription. Nucleic Acids Res., 33, 5437-5445. 8. Woolfe,A., Goodson,M., Goode,D.K., Snell,P., McEwen,G.K., Vavouri,T., Smith,S.F., North,P., Callaway,H., Kelly,K. et al. (2005) Highly conserved noncoding sequences are associated with vertebrate development. PLoS.Biol., 3, e7. 9. Bailey,P.J., Klos,J.M., Andersson,E., Karlen,M., Kallstrom,M., Ponjavic,J., Muhr,J., Lenhard,B., Sandelin,A. and Ericson,J. (2006) A global genomic transcriptional code associated with CNS-expressed genes. Exp.Cell Res., 312, 3108-3119. 10. Pennacchio,L.A., Loots,G.G., Nobrega,M.A. and Ovcharenko,I. (2007) Predicting tissue-specific enhancers in the human genome. Genome Res., 17, 201-211.  179  11. Lindblad-Toh,K., Wade,C.M., Mikkelsen,T.S., Karlsson,E.K., Jaffe,D.B., Kamal,M., Clamp,M., Chang,J.L., Kulbokas,E.J., III, Zody,M.C. et al. (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438, 803-819. 12. Sandelin,A., Bailey,P., Bruce,S., Engstrom,P.G., Klos,J.M., Wasserman,W.W., Ericson,J. and Lenhard,B. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC.Genomics, 5, 99. 13. Kikuta,H., Laplante,M., Navratilova,P., Komisarczuk,A.Z., Engstrom,P.G., Fredman,D., Akalin,A., Caccamo,M., Sealy,I., Howe,K. et al. (2007) Genomic regulatory blocks encompass multiple neighboring genes and maintain conserved synteny in vertebrates. Genome Res., 17, 545-555. 14. Pevzner,P. and Tesler,G. (2003) Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res., 13, 37-45. 15. Zdobnov,E.M., von Mering,C., Letunic,I., Torrents,D., Suyama,M., Copley,R.R., Christophides,G.K., Thomasova,D., Holt,R.A., Subramanian,G.M. et al. (2002) Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science, 298, 149-159. 16. Carroll,S.B. (2005) Endless Forms Most Beautiful: The New Science of Evo Devo and the Making of the Animal Kingdom. W.W. Norton & Company, New York, N.Y. 17. Glazov,E.A., Pheasant,M., McGraw,E.A., Bejerano,G. and Mattick,J.S. (2005) Ultraconserved elements in insect genomes: a highly conserved intronic sequence implicated in the control of homothorax mRNA splicing. Genome Res., 15, 800808. 18. Vavouri,T., Walter,K., Gilks,W.R., Lehner,B. and Elgar,G. (2007) Parallel evolution of conserved non-coding elements that target a common set of developmental regulatory genes from worms to humans. Genome Biol., 8, R15. 19. Siepel,A., Bejerano,G., Pedersen,J.S., Hinrichs,A.S., Hou,M., Rosenbloom,K., Clawson,H., Spieth,J., Hillier,L.W., Richards,S. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15, 1034-1050. 20. Papatsenko,D., Kislyuk,A., Levine,M. and Dubchak,I. (2006) Conservation patterns in different functional sequence categories of divergent Drosophila species. Genomics, 88, 431-442. 21. Zdobnov,E.M. and Bork,P. (2007) Quantification of insect genome divergence. Trends Genet., 23, 16-20.  180  22. Kent,W.J., Baertsch,R., Hinrichs,A., Miller,W. and Haussler,D. (2003) Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc.Natl.Acad.Sci.U.S.A, 100, 11484-11489. 23. Murphy,W.J., Larkin,D.M., Everts-van der Wind,A., Bourque,G., Tesler,G., Auvil,L., Beever,J.E., Chowdhary,B.P., Galibert,F., Gatzke,L. et al. (2005) Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science, 309, 613-617. 24. Hubbard,T.J., Aken,B.L., Beal,K., Ballester,B., Caccamo,M., Chen,Y., Clarke,L., Coates,G., Cunningham,F., Cutts,T. et al. (2007) Ensembl 2007. Nucleic Acids Res., 35, D610-D617. 25. Kuhn,R.M., Karolchik,D., Zweig,A.S., Trumbower,H., Thomas,D.J., Thakkapallayil,A., Sugnet,C.W., Stanke,M., Smith,K.E., Siepel,A. et al. (2007) The UCSC genome browser database: update 2007. Nucleic Acids Res., 35, D668D673. 26. Crosby,M.A., Goodman,J.L., Strelets,V.B., Zhang,P. and Gelbart,W.M. (2007) FlyBase: genomes by the dozen. Nucleic Acids Res., 35, D486-D491. 27. Brudno,M., Malde,S., Poliakov,A., Do,C.B., Couronne,O., Dubchak,I. and Batzoglou,S. (2003) Glocal alignment: finding rearrangements during alignment. Bioinformatics., 19 Suppl 1, I54-I62. 28. Ohler,U. (2006) Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res., 34, 5943-5950. 29. Manak,J.R., Dike,S., Sementchenko,V., Kapranov,P., Biemar,F., Long,J., Cheng,J., Bell,I., Ghosh,S., Piccolboni,A. et al. (2006) Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nat.Genet., 38, 1151-1158. 30. Bejerano,G., Pheasant,M., Makunin,I., Stephen,S., Kent,W.J., Mattick,J.S. and Haussler,D. (2004) Ultraconserved elements in the human genome. Science, 304, 1321-1325. 31. Nepveu,A. (2001) Role of the multifunctional CDP/Cut/Cux homeodomain transcription factor in regulating differentiation, cell growth and development. Gene, 270, 1-15. 32. Kent,W.J., Sugnet,C.W., Furey,T.S., Roskin,K.M., Pringle,T.H., Zahler,A.M. and Haussler,D. (2002) The human genome browser at UCSC. Genome Res., 12, 9961006. 33. Moser,M. and Campbell,G. (2005) Generating and interpreting the Brinker gradient in the Drosophila wing. Dev.Biol., 286, 647-658. 181  34. McGovern,V.L., Pacak,C.A., Sewell,S.T., Turski,M.L. and Seeger,M.A. (2003) A targeted gain of function screen in the embryonic CNS of Drosophila. Mech.Dev., 120, 1193-1207. 35. Thor,S. and Thomas,J.B. (1997) The Drosophila islet gene governs axon pathfinding and neurotransmitter identity. Neuron, 18, 397-409. 36. Li,X. and Noll,M. (1994) Compatibility between enhancers and promoters determines the transcriptional specificity of gooseberry and gooseberry neuro in the Drosophila embryo. EMBO J., 13, 400-406. 37. Merli,C., Bergstrom,D.E., Cygan,J.A. and Blackman,R.K. (1996) Promoter specificity mediates the independent regulation of neighboring genes. Genes Dev., 10, 1260-1270. 38. Butler,J.E. and Kadonaga,J.T. (2001) Enhancer-promoter specificity mediated by DPE or TATA core promoter motifs. Genes Dev., 15, 2515-2519. 39. Ohtsuki,S., Levine,M. and Cai,H.N. (1998) Different core promoters possess distinct regulatory activities in the Drosophila embryo. Genes Dev., 12, 547-556. 40. Carninci,P., Sandelin,A., Lenhard,B., Katayama,S., Shimokawa,K., Ponjavic,J., Semple,C.A., Taylor,M.S., Engstrom,P.G., Frith,M.C. et al. (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nat.Genet., 38, 626635. 41. Dorfman,R., Glazer,L., Weihe,U., Wernet,M.F. and Shilo,B.Z. (2002) Elbow and Noc define a family of zinc finger proteins controlling morphogenesis of specific tracheal branches. Development, 129, 3585-3596. 42. Weihe,U., Dorfman,R., Wernet,M.F., Cohen,S.M. and Milan,M. (2004) Proximodistal subdivision of Drosophila legs and wings: the elbow-no ocelli gene complex. Development, 131, 767-774. 43. Luo,C.W., Dewey,E.M., Sudo,S., Ewer,J., Hsu,S.Y., Honegger,H.W. and Hsueh,A.J. (2005) Bursicon, the insect cuticle-hardening hormone, is a heterodimeric cystine knot protein that activates G protein-coupled receptor LGR2. Proc.Natl.Acad.Sci.U.S.A, 102, 2820-2825. 44. Miskolczi-McCallum,C.M., Scavetta,R.J., Svendsen,P.C., Soanes,K.H. and Brook,W.J. (2005) The Drosophila melanogaster T-box genes midline and H15 are conserved regulators of heart development. Dev.Biol., 278, 459-472. 45. Reim,I., Mohler,J.P. and Frasch,M. (2005) Tbx20-related genes, mid and H15, are required for tinman expression, proper patterning, and normal differentiation of cardioblasts in Drosophila. Mech.Dev., 122, 1056-1069.  182  46. Walker,R.G., Willingham,A.T. and Zuker,C.S. (2000) A Drosophila mechanosensory transduction channel. Science, 287, 2229-2234. 47. George,H. and Terracol,R. (1997) The vrille gene of Drosophila is a maternal enhancer of decapentaplegic and encodes a new member of the bZIP family of transcription factors. Genetics, 146, 1345-1363. 48. Jiang,J., Benson,E., Bausek,N., Doggett,K. and White-Cooper,H. (2007) Tombola, a tesmin/TSO1-family protein, regulates transcriptional activation in the Drosophila male germline and physically interacts with always early. Development, 134, 1549-1559. 49. Tolhuis,B., de Wit,E., Muijrers,I., Teunissen,H., Talhout,W., van Steensel,B. and van Lohuizen,M. (2006) Genome-wide profiling of PRC1 and PRC2 Polycomb chromatin binding in Drosophila melanogaster. Nat.Genet., 38, 694-699. 50. Schwartz,Y.B., Kahn,T.G., Nix,D.A., Li,X.Y., Bourgon,R., Biggin,M. and Pirrotta,V. (2006) Genome-wide analysis of Polycomb targets in Drosophila melanogaster. Nat.Genet., 38, 700-705. 51. Lee,T.I., Jenner,R.G., Boyer,L.A., Guenther,M.G., Levine,S.S., Kumar,R.M., Chevalier,B., Johnstone,S.E., Cole,M.F., Isono,K. et al. (2006) Control of developmental regulators by Polycomb in human embryonic stem cells. Cell, 125, 301-313. 52. Hayashi,S., Ito,K., Sado,Y., Taniguchi,M., Akimoto,A., Takeuchi,H., Aigaki,T., Matsuzaki,F., Nakagoshi,H., Tanimura,T. et al. (2002) GETDB, a database compiling expression patterns and molecular locations of a collection of Gal4 enhancer traps. Genesis., 34, 58-61. 53. St Johnston,R.D., Hoffmann,F.M., Blackman,R.K., Segal,D., Grimaila,R., Padgett,R.W., Irick,H.A. and Gelbart,W.M. (1990) Molecular organization of the decapentaplegic gene in Drosophila melanogaster. Genes Dev., 4, 1114-1127. 54. Ovcharenko,I., Loots,G.G., Nobrega,M.A., Hardison,R.C., Miller,W. and Stubbs,L. (2005) Evolution and functional classification of vertebrate gene deserts. Genome Res., 15, 137-145. 55. Boyer,L.A., Plath,K., Zeitlinger,J., Brambrink,T., Medeiros,L.A., Lee,T.I., Levine,S.S., Wernig,M., Tajonar,A., Ray,M.K. et al. (2006) Polycomb complexes repress developmental regulators in murine embryonic stem cells. Nature, 441, 349-353. 56. Tanay,A., O'Donnell,A.H., Damelin,M. and Bestor,T.H. (2007) Hyperconserved CpG domains underlie Polycomb-binding sites. Proc.Natl.Acad.Sci.U.S.A, 104, 5521-5526.  183  57. Vire,E., Brenner,C., Deplus,R., Blanchon,L., Fraga,M., Didelot,C., Morey,L., Van Eynde,A., Bernard,D., Vanderwinden,J.M. et al. (2006) The Polycomb group protein EZH2 directly controls DNA methylation. Nature, 439, 871-874. 58. Ellingsen,S., Laplante,M.A., Konig,M., Kikuta,H., Furmanek,T., Hoivik,E.A. and Becker,T.S. (2005) Large-scale enhancer detection in the zebrafish genome. Development, 132, 3799-3811. 59. Nobrega,M.A., Ovcharenko,I., Afzal,V. and Rubin,E.M. (2003) Scanning human gene deserts for long-range enhancers. Science, 302, 413. 60. Smale,S.T. (2001) Core promoters: active contributors to combinatorial gene regulation. Genes Dev., 15, 2503-2508. 61. Valenzuela,L. and Kamakaka,R.T. (2006) Chromatin insulators. Annu.Rev.Genet., 40, 107-138. 62. Kim,T.H., Abdullaev,Z.K., Smith,A.D., Ching,K.A., Loukinov,D.I., Green,R.D., Zhang,M.Q., Lobanenkov,V.V. and Ren,B. (2007) Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell, 128, 1231-1245. 63. Xie,X., Mikkelsen,T.S., Gnirke,A., Lindblad-Toh,K., Kellis,M. and Lander,E.S. (2007) Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc.Natl.Acad.Sci.U.S.A, 104, 7145-7150. 64. Moon,H., Filippova,G., Loukinov,D., Pugacheva,E., Chen,Q., Smith,S.T., Munhall,A., Grewe,B., Bartkuhn,M., Arnold,R. et al. (2005) CTCF is conserved from Drosophila to humans and confers enhancer blocking of the Fab-8 insulator. EMBO Rep., 6, 165-170.  184  Chapter 7: Discussion and Conclusions In the past few years, tremendous progress has been made in the computational analysis of gene regulation. Much of this progress has been driven by the rapid accumulation of diverse data, including multiple genome sequences, large-scale gene expression data, and in vivo binding information from ChIP-chip experiments. New computational tools have been developed to process and utilize the different types of data, helping researchers to extract biologically relevant information from complex data that often contains a great deal of noise. This thesis describes novel strategies for deciphering the mechanisms by which genes are regulated via computational analysis. The predominant goal has been to integrate large-scale genomic data sets, comparative sequence analysis and motif detection. This work additionally generates novel hypotheses regarding the use and interpretation of biological data and bioinformatics tools. From yeast to fly to worms to humans, this work spans a wide range of organisms yet remains focused on improving our understanding of how regulatory networks can modulate gene expression.  7.1 Human/mouse regulation studies A major biological hypothesis is that co-expression can be linked to regulation through the analysis of over-represented transcription factor binding sites, predicting transcriptional networks that are activated in various contexts. While success has been demonstrated in simple eukaryotes such as yeast (1;2), the human genome presents a greater challenge. Of the ~3 billion base pairs that make up the human genome, only 1.5% is estimated to code for protein (3), leaving a vast amount of sequence to search for  185  short, modular cis-regulatory elements. Furthermore, algorithms for detecting regulatory motifs based on weight-matrix models predict many binding sites that are readily bound in vitro but have no functional role in vivo (4). This necessitates using additional information to guide the search for cis-regulatory elements. The oPOSSUM system for identification of over-represented transcription factor binding sites (Chapter 2) was developed with the principle aim of integrating evolutionary conservation with motif discovery to reduce the search space to regions more likely to be functional. oPOSSUM is one of the first tools to address over-representation of conserved regulatory motifs for sets of human genes. Its major contributions lie in providing access to a database of conserved predicted transcription factor binding sites for human and mouse (both through a user-friendly web-based interface and an application programming interface (API) for more advanced users), statistical methods based on clearly stated models, and its robust implementation. oPOSSUM has been well-received by the research community, and has been used to generate novel hypotheses about regulatory mechanisms that may play roles in human diseases (5;6). Cross-species comparisons in higher eukaryotes have been used throughout this thesis to delineate potential regulatory regions. A major contribution is the development of methods to analyze pairwise conservation and identify conserved noncoding sequences in human and mouse sequences. Aligning the orthologous gene sequences to identify conserved promoters is nontrivial as many genes contain multiple alternative transcription start sites to generate diversity and direct temporal- and tissue-specific processes. We hypothesized that by using whole-genome alignments and CAGE transcript data, we could resolve orthologous alternative promoters for inclusion in  186  oPOSSUM, thereby increasing the signal for relevant motifs in sets of biologically related genes. We attained improvements in motif over-representation analysis with the inclusion of the alternative promoter data over using a single start site despite a parallel increase in noise (Chapter 3). Phylogenetic footprinting, which aims to characterize the genomic locations of cis-regulatory elements, does not tell us the functions of the cis-elements. The necessary functional characterization requires costly and time-consuming experimental assays. To minimize the validation time, we hypothesized that features of a gene’s conservation profiles could be used to prioritize candidate genes with predicted cis-elements for promoter-reporter gene studies (Chapter 5). The goal was to capture and automate aspects of the manual promoter assessment procedure used in the Pleiades Promoter Project (www.pleiades.org), in which researchers are generating and characterizing regulatory regions that drive expression to defined brain regions (8). The regulatory resolution scores correlated with subjective manual assessments of candidate genes, but because additional information such as literature-derived regulatory information and the number of TSSs were not incorporated, regulatory resolution scores within in each manual category showed considerable variation.  7.2 Regulation in model organisms Research on model organisms has led to a wealth of knowledge about mechanisms of inheritance and animal development, as well as providing insight into gene function and core biological processes. After the discovery of surprisingly large numbers of highly conserved noncoding elements (HCNEs) associated with key development processes in vertebrates (9;10), we hypothesized that similar genetic  187  elements exist in other metazoan genomes. To address this, we used multi-species alignments to conduct studies of HCNEs in insects and examined their genomic organization and functional roles (Chapter 6). We concluded that insect HNCEs cluster nearby to genes involved in development and transcription like their vertebrate counterparts. A primary result was the observation that arrays of HCNEs in insects are maintained in microsynteny with their regulatory targets as well as bystander genes, like the genomic regulatory blocks (GRBs) observed in vertebrates (11). We showed an association between the boundaries of GRBs and polycomb binding domains, which are epigenetically modified regions involved in gene silencing, to support the hypothesis that GRBs correspond to regulatory domains. Furthermore, we suggested a mechanism for long-range enhancer-promoter specificity through the recognition of different core promoter architectures for developmental regulators. This work motivates continued study of HCNEs as enhancers in insects, while other studies in insects suggest roles for highly conserved elements (in coding and noncoding regions) in mRNA splicing or as microRNAs (7;12;13). Although important aspects of cell/tissue development are conserved between mammals and the distant model organisms (flies and/or worms), long evolutionary distances have largely precluded computational inference of human regulatory mechanisms based on binding sites in model organism genes. As a first step towards a method for long-distance mapping of regulatory programs for key development processes, we developed a worm-specific version of oPOSSUM (Chapter 3). Our attempt to apply methods from human/mouse comparisons to orthologous C. elegans and C. briggsae sequences highlighted new challenges and limitations. The evolutionary divergence  188  among the nematode species (80-110 million years (14)) appears to be too great for the standard alignment-based approach used for human and mouse, and several known binding sites in our reference sets could not be detected. A similar problem was encountered by GuhaThakurta et al. when searching for novel muscle-specific regulatory elements in C. elegans (15). Despite this challenge, many regulatory motifs in C. elegans are also present in the promoters of their C. briggsae orthologs, indicating that they are functionally conserved in the nematodes (15;16). We are currently exploring modified versions of the oPOSSUM system that compare the predicted binding sites in C. elegans sequences with orthologous pairs of gene sequences from the more similar C. briggsae and C. remanei genomes (17). A further limitation of computational regulatory analysis in worm is the lack of available TFBS profiles for C. elegans TFs. This need is likely to be addressed in the near future by the modENCODE Project (www.modencode.org) which, as part of its mission to identify all of the sequence-based functional elements in popular model organisms, aims to map transcription factor binding sites in the C. elegans and D. melanogaster genomes using ChIP-chip and ChIP-sequencing technologies. This thesis additionally presents hypotheses about gene regulatory networks present in the yeast Saccharomyces cerevisiae. Together with the Wine Research Centre at UBC, we used applied genomics and bioinformatics technologies to study long-term yeast fermentation in grape must. A major biological result was the identification of a novel stress response activated during fermentation, and the characterization of that response using over-representation of functional annotations and regulatory motif predictions (Chapter 4). Our findings suggest roles for 62 uncharacterized genes in stress; indeed, one gene has since been implicated in ethanol tolerance and a second in  189  regulating osmotic pressure (Hennie van Vuuren, personal communication). Because yeast genes are separated by short intergenic regions that are on average about 500 bp (18), we did not make use of evolutionary conservation in the regulatory motif overrepresentation analysis. However, seven Saccharomyces genomes are now available (19;20), and investigating the evolution of the identified stress regulatory elements is an intriguing future direction.  7.3 Limitations of conservation analysis Although conservation has provided us with a great deal of insight into gene regulatory mechanisms through the delineation of potential regulatory sequences, it is unlikely to be able to identify all functionally important elements, even with the accumulation of new genomes. Conservation is useful for detecting regulation that has been retained through evolution, but it provides only limited insight into changes that impact gene regulation in a species-specific manner. The choice of species for comparison influences our ability to identify functional regulatory motifs using phylogenetic footprinting, as does the choice of algorithm due to differences in alignment accuracy (21) and transcription factor binding site evolution. Transcription factor binding sites are subject to less stringent selection compared to protein-coding regions. This is because binding site turnover, where new binding sites arise from random mutations to replace the function of lost sites, can easily occur given the short length of the sites and the degeneracy of transcription factor binding (22;23). In addition, new binding sites can arise following transposition of mobile elements (24;25). The even-skipped enhancer system S2E is a well-studied example where substantial divergence among four Drosophila species has occurred at the sequence level, including  190  large insertions and deletions in the spacers between known binding sites, single nucleotide substitutions in binding sites, and gains and losses of binding sites for important activators; yet despite these differences, function, defined in terms of the expression pattern of its target gene, remains highly conserved (26;27). Evidence is accumulating for widespread evolutionary turnover of transcription factor binding sites (23;28-30), as well as for transcription start positions (31). The loss and gain of binding sites and transcription start positions likely contributes to species-specific changes in gene expression. Alignment algorithms that take into account the more flexible architecture of regulatory sequences have emerged that use binding site predictions to anchor and guide alignment (32) or incorporate models of binding site evolution into the alignment process (33). Despite this, it is evident that for a large number of transcriptional regulatory elements, conservation analysis is insufficient and additional information is needed.  7.4 Future directions for gene regulation studies A major bottleneck for improving our understanding of human gene regulation is experimental validation of the functions of computationally predicted regulatory sequences. Only a few data sets exist for which the cis-regulatory elements/modules are well-characterized. The muscle regulatory region collection (34), for example, is used time and time again to judge the performance of motif discovery methods, making the majority of predictions difficult to assess and, at times, leading to overtraining. Thus, we are in dire need of new data. Though a number of groups have begun large-scale enhancer validation experiments (35-39), generating appropriate data to test new computational gene regulation approaches also requires experimental determination of  191  the regulatory elements within enhancers, as well as the transcription factors that bind to them. ChIP-chip data from the pilot phase of the ENCODE Project has elucidated a small fraction of the in vivo binding site locations for human transcription factors (40), and more data is expected in time as the project scales to the whole genome. Additionally, experiments investigating epigenetic modifications have illuminated the role of histone modifications in regulating gene expression, and have potential for delineation of functional regulatory domains (41;42). The vast majority of gene regulation studies represent “the view from the genome” (43). They attempt to model all of the information encoded into the regulatory DNA, and describe the organization of cis-regulatory systems in terms of every interaction that can occur under any circumstance. However, diverse data is accumulating describing regulatory events in specific cell types and at specific developmental time points (40), so that we can now begin to model regulatory networks in a more dynamic manner, i.e. using “the view from the nucleus” (43). Most computational approaches to identify regulatory mechanisms to date take advantage of DNA sequence information and RNA expression levels, but few utilize information on protein sequences, chromatin properties and the three-dimensional structure and position of transcription factors. Protein data can be used to enhance our understanding of transcription factor binding specificity, linking transcription factors to the sites they regulate. Chromatin modification data can be used, much like sequence conservation data was applied in this thesis, to indicate those regions more likely to serve as regulatory sequences for a cell in a specific environment. Increased understanding of the three dimensional properties of transcription factors, including both the protein structure and the spatial position of proteins in the  192  nucleus, will allow computational biologists to move beyond DNA sequences to generate predictive models of the nucleus. At this time, computational analysis of a well-designed experiment with direct modulation of transcription factor activity can identify regulatory programs or regulons active in a specific type of cell in a specific condition. With the growing arsenal of experimental data and more integrative analysis methods, it may soon become feasible for computational approaches to make predictive models for the regulation of each gene within the larger regulon.  193  7.5 References 1. Roth,F.P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by wholegenome mRNA quantitation. Nat.Biotechnol., 16, 939-945. 2. Hughes,J.D., Estep,P.W., Tavazoie,S. and Church,G.M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J.Mol.Biol., 296, 1205-1214. 3. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C., Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860-921. 4. Wasserman,W.W. and Sandelin,A. (2004) Applied bioinformatics for the identification of regulatory elements. Nat.Rev.Genet., 5, 276-287. 5. Newman,J.C., Bailey,A.D. and Weiner,A.M. (2006) Cockayne syndrome group B protein (CSB) plays a general role in chromatin maintenance and remodeling. Proc.Natl.Acad.Sci.U.S.A, 103, 9613-9618. 6. Gazel,A., Banno,T., Walsh,R. and Blumenberg,M. (2006) Inhibition of JNK promotes differentiation of epidermal keratinocytes. J.Biol.Chem., 281, 2053020541. 7. Siepel,A., Bejerano,G., Pedersen,J.S., Hinrichs,A.S., Hou,M., Rosenbloom,K., Clawson,H., Spieth,J., Hillier,L.W., Richards,S. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15, 1034-1050. 8. Portales-Casamar,E., Kirov,S., Lim,J., Lithwick,S., Swanson,M.I., Ticoll,A., Snoddy,J. and Wasserman,W.W. (2007) PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol., 8, R207. 9. Bejerano,G., Pheasant,M., Makunin,I., Stephen,S., Kent,W.J., Mattick,J.S. and Haussler,D. (2004) Ultraconserved elements in the human genome. Science, 304, 1321-1325. 10. Sandelin,A., Bailey,P., Bruce,S., Engstrom,P.G., Klos,J.M., Wasserman,W.W., Ericson,J. and Lenhard,B. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC.Genomics, 5, 99. 11. Kikuta,H., Laplante,M., Navratilova,P., Komisarczuk,A.Z., Engstrom,P.G., Fredman,D., Akalin,A., Caccamo,M., Sealy,I., Howe,K. et al. (2007) Genomic  194  regulatory blocks encompass multiple neighboring genes and maintain conserved synteny in vertebrates. Genome Res., 17, 545-555. 12. Tran,T., Havlak,P. and Miller,J. (2006) MicroRNA enrichment among short 'ultraconserved' sequences in insects. Nucleic Acids Res., 34, e65. 13. Glazov,E.A., Pheasant,M., McGraw,E.A., Bejerano,G. and Mattick,J.S. (2005) Ultraconserved elements in insect genomes: a highly conserved intronic sequence implicated in the control of homothorax mRNA splicing. Genome Res., 15, 800-808. 14. Stein,L.D., Bao,Z., Blasiar,D., Blumenthal,T., Brent,M.R., Chen,N., Chinwalla,A., Clarke,L., Clee,C., Coghlan,A. et al. (2003) The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS.Biol., 1, E45. 15. GuhaThakurta,D., Schriefer,L.A., Waterston,R.H. and Stormo,G.D. (2004) Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. Genome Res., 14, 2457-2468. 16. Efimenko,E., Bubb,K., Mak,H.Y., Holzman,T., Leroux,M.R., Ruvkun,G., Thomas,J.H. and Swoboda,P. (2005) Analysis of xbx genes in C. elegans. Development, 132, 1923-1934. 17. Kiontke,K., Gavin,N.P., Raynes,Y., Roehrig,C., Piano,F. and Fitch,D.H. (2004) Caenorhabditis phylogeny predicts convergence of hermaphroditism and extensive intron loss. Proc.Natl.Acad.Sci.U.S.A, 101, 9003-9008. 18. Dujon,B. (1996) The yeast genome project: what did we learn? Trends Genet., 12, 263-270. 19. Cliften,P., Sudarsanam,P., Desikan,A., Fulton,L., Fulton,B., Majors,J., Waterston,R., Cohen,B.A. and Johnston,M. (2003) Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science, 301, 71-76. 20. Kellis,M., Patterson,N., Endrizzi,M., Birren,B. and Lander,E.S. (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423, 241-254. 21. Pollard,D.A., Moses,A.M., Iyer,V.N. and Eisen,M.B. (2006) Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments. BMC.Bioinformatics, 7, 376. 22. Stone,J.R. and Wray,G.A. (2001) Rapid evolution of cis-regulatory sequences via local point mutations. Mol.Biol.Evol., 18, 1764-1770. 23. Dermitzakis,E.T. and Clark,A.G. (2002) Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol.Biol.Evol., 19, 1114-1121.  195  24. Medstrand,P., van de Lagemaat,L.N., Dunn,C.A., Landry,J.R., Svenback,D. and Mager,D.L. (2005) Impact of transposable elements on the evolution of mammalian gene regulation. Cytogenet.Genome Res., 110, 342-352. 25. Britten,R.J. (1997) Mobile elements inserted in the distant past have taken on important functions. Gene, 205, 177-182. 26. Ludwig,M.Z., Patel,N.H. and Kreitman,M. (1998) Functional analysis of eve stripe 2 enhancer evolution in Drosophila: rules governing conservation and change. Development, 125, 949-958. 27. Ludwig,M.Z., Palsson,A., Alekseeva,E., Bergman,C.M., Nathan,J. and Kreitman,M. (2005) Functional evolution of a cis-regulatory module. PLoS.Biol., 3, e93. 28. Emberly,E., Rajewsky,N. and Siggia,E.D. (2003) Conservation of regulatory elements between two species of Drosophila. BMC.Bioinformatics, 4, 57. 29. Moses,A.M., Pollard,D.A., Nix,D.A., Iyer,V.N., Li,X.Y., Biggin,M.D. and Eisen,M.B. (2006) Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS.Comput.Biol., 2, e130. 30. Doniger,S.W. and Fay,J.C. (2007) Frequent gain and loss of functional transcription factor binding sites. PLoS.Comput.Biol., 3, e99. 31. Frith,M.C., Ponjavic,J., Fredman,D., Kai,C., Kawai,J., Carninci,P., Hayashizaki,Y. and Sandelin,A. (2006) Evolutionary turnover of mammalian transcription start sites. Genome Res., 16, 713-722. 32. Blanco,E., Messeguer,X., Smith,T.F. and Guigo,R. (2006) Transcription factor map alignment of promoter regions. PLoS.Comput.Biol., 2, e49. 33. Bais,A.S., Grossmann,S. and Vingron,M. (2007) Incorporating evolution of transcription factor binding sites into annotated alignments. J.Biosci., 32, 841-850. 34. Wasserman,W.W. and Fickett,J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J.Mol.Biol., 278, 167-181. 35. Trinklein,N.D., Karaoz,U., Wu,J., Halees,A., Force,A.S., Collins,P.J., Zheng,D., Zhang,Z.D., Gerstein,M.B., Snyder,M. et al. (2007) Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome. Genome Res., 17, 720-731. 36. Trinklein,N.D., Aldred,S.J., Saldanha,A.J. and Myers,R.M. (2003) Identification and functional analysis of human transcriptional promoters. Genome Res., 13, 308312.  196  37. Ellingsen,S., Laplante,M.A., Konig,M., Kikuta,H., Furmanek,T., Hoivik,E.A. and Becker,T.S. (2005) Large-scale enhancer detection in the zebrafish genome. Development, 132, 3799-3811. 38. Woolfe,A., Goodson,M., Goode,D.K., Snell,P., McEwen,G.K., Vavouri,T., Smith,S.F., North,P., Callaway,H., Kelly,K. et al. (2005) Highly conserved noncoding sequences are associated with vertebrate development. PLoS.Biol., 3, e7. 39. Pennacchio,L.A., Ahituv,N., Moses,A.M., Prabhakar,S., Nobrega,M.A., Shoukry,M., Minovitsky,S., Dubchak,I., Holt,A., Lewis,K.D. et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444, 499-502. 40. Birney,E., Stamatoyannopoulos,J.A., Dutta,A., Guigo,R., Gingeras,T.R., Margulies,E.H., Weng,Z., Snyder,M., Dermitzakis,E.T., Thurman,R.E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799-816. 41. King,D.C., Taylor,J., Zhang,Y., Cheng,Y., Lawson,H.A., Martin,J., Chiaromonte,F., Miller,W. and Hardison,R.C. (2007) Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res., 17, 775786. 42. Zhang,Z.D., Paccanaro,A., Fu,Y., Weissman,S., Weng,Z., Chang,J., Snyder,M. and Gerstein,M.B. (2007) Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Res., 17, 787-797. 43. Arnone,M.I. and Davidson,E.H. (1997) The hardwiring of development: organization and function of genomic regulatory systems. Development, 124, 18511864.  197  Appendices Appendix 1 Supplemental Table S1 for Chapter 2 of this thesis is available online. http://nar.oxfordjournals.org/content/vol33/issue10/images/data/3154/DC1/gki624.txt Microarray expression data for genes with significantly reduced expression levels in HUVEC cells stimulated with IL-1B in the presence of an inhibitor of the NF-kappaB signaling pathway.  198  Appendix 2 Table S1 shows the distribution of transcription start regions for human and mouse genes based on Ensembl and CAGE annotations. We indicate how many genes (and what percentage of genes, in brackets) are supported by CAGE evidence. As expected, far more TSRs from known transcripts are supported by CAGE tags compared to TSRs derived from putative EST gene transcripts (52% versus 6% for human; 62% versus 9% for mouse). For our analysis, we consider all TSRs derived from known or novel transcripts in the core EnsEMBL annotation, as well as those TSRs derived from putative EST gene transcripts with CAGE support. Thus, we have 20349 TSRs (18497+790+1061) and 20059 TSRs (16757+739+2563) for human and mouse genes, respectively, with an average of 1.3 TSRs per gene. Approximately one quarter of genes have multiple TSRs based on our procedure (Figure S1).  Table S1. Distribution of transcription start regions for human and mouse genes Human Transcription start regions TSRs with ≥ 5 CAGE tags Mouse Transcription start regions TSRs with ≥ 5 CAGE tags  Known 18497 9561 (52%) Known 16757 10455 (62%)  Novel 790 255 (32%) Novel 739 132 (18%)  Putative * 18548 1061 (6%) Putative * 27976 2563 (9%)  Totals 37835 10877 (29%) Totals 45472 13150 (29%)  * Derived from EnsEMBL EST Genes  199  Human (20,349)  Mouse (20,059) 1%  1% 5%  4% 1 18%  19%  2 3  75%  77%  4 5 >5  Figure S1. Number of TSRs per gene.  200  Table S2. oPOSSUM Worm SSA Results using JASPAR CORE TFBS profiles JASPAR CORE TF SP1 SU_h * GAMYB TCF1 ZNF42_5-13 Dof2 PBF MNB1A HMG-1 Agamous  TF Class ZN-FINGER, C2H2 IPT/TIG domain TRPCLUSTER HOMEO ZN-FINGER, C2H2 ZN-FINGER, DOF ZN-FINGER, DOF ZN-FINGER, DOF HMG MADS  Background gene hits  Background gene nonhits  Target gene hits  Target gene nonhits  Background TFBS rate  Target TFBS rate  Z-score  Fisher score  448  10144  5  29  0.0026  0.0127  15.72  1.38E-02  64  10528  2  32  0.0005  0.0051  15.36  1.88E-02  plant vertebrate  1333 101  9259 10491  9 2  25 32  0.0085 0.0007  0.0238 0.0045  13.25 10.66  2.21E-02 4.27E-02  vertebrate  825  9767  7  27  0.005  0.0143  10.42  1.46E-02  plant  3447  7145  13  21  0.0194  0.0362  9.616  2.95E-01  plant  3444  7148  14  20  0.0164  0.0318  9.587  1.85E-01  plant plant plant  3444 3556 1151  7148 7036 9441  14 12 6  20 22 28  0.0164 0.0308 0.0078  0.0318 0.0515 0.0175  9.587 9.472 8.698  1.85E-01 4.80E-01 1.59E-01  TF Supergroup vertebrate insect  Results are ordered by Z-score and the following parameters were used: Conservation level - Top 30% of conserved regions; Matrix match threshold 80%.  201  Appendix 3 This appendix contains Supplemental Tables S2-S7 and Supplemental Figures S1 for Chapter 4 of this thesis. Supplemental Table S1 is in a separate document (tabdelimited file) available online at http://www.cisreg.ca/shosui/FSR/Sup_1.txt.  ESR  CER 76  209  72  261 323  90 1875  Differentially expressed during fermentation  CER (induced)  ESR (induced)  118  112 133  81  264  56 93  15 22  CER (repressed)  ESR (repressed)  8 178 FSR (induced)  110  16  182 Clusters 18-20 (repressed)  Figure S1. Comparison of characterized stress responses. Venn diagram showing the number of induced or repressed genes in fermentation that are associated with transcriptionally characterized stress responses. Environmental Stress Response (ESR), Common Environmental Response (CER), Fermentation Stress Response (FSR).  202  Table S2. Enrichment of regulatory motifs by cluster. Tg hits  Tg nonhits  Bg hits  Bg nonhits  Fisher  MY0044  PDS  0.036  0.014  9.15  4  1  2834  3839  1.08E-01  2  MY0011  LEU3  0.013  0.003  8.35  2  3  887  5786  1.35E-01  3  MY0042  UME6  0.026  0.013  5.78  3  2  2873  3800  3.72E-01  4  MY0041  MIG1c  0.029  0.017  4.67  4  1  3357  3316  1.92E-01  5  MY0030  MIG1b  0.019  0.011  4.20  2  3  1863  4810  4.29E-01  6  MY0009  HAP2_3_4  0.024  0.014  4.15  4  1  3432  3241  2.06E-01  7  MY0035  SWI5  0.029  0.018  4.01  3  2  3512  3161  5.49E-01  8  MY0026  RLM1  0.022  0.014  3.11  3  2  2194  4479  2.03E-01  9  MY0015  MIG1a  0.005  0.002  2.95  1  4  529  6144  3.39E-01  10  MY0018  PDR3  0.010  0.006  2.60  2  3  1332  5341  2.62E-01  1  MY0025  STRE  0.018  0.004  14.53  5  3  901  5772  1.78E-03  2  MY0029  ADR1P  0.042  0.022  8.88  7  1  4687  1986  2.60E-01  3  MY0011  LEU3  0.011  0.003  8.24  3  5  887  5786  7.88E-02  4  MY0028  XBP1  0.174  0.137  6.74  8  0  6654  19  9.77E-01  5  MY0042  UME6  0.025  0.013  6.44  5  3  2873  3800  2.25E-01  6  MY0038  GCR1  0.039  0.025  5.56  7  1  6113  560  8.59E-01  7  MY0006  ECB  0.015  0.008  5.44  3  5  1426  5247  2.34E-01  8  MY0037  UASPHR  0.057  0.041  5.06  7  1  5368  1305  5.17E-01  9  MY0004  CCA  0.016  0.010  4.11  3  5  1721  4952  3.42E-01  10  MY0032  SCB  0.067  0.052  4.06  8  0  6511  162  8.22E-01  1  MY0016  OAF1  0.004  0.001  8.06  1  10  124  6549  1.88E-01  2  MY0004  CCA  0.018  0.010  5.96  6  5  1721  4952  3.99E-02  3  MY0037  UASPHR  0.055  0.041  5.02  10  1  5368  1305  3.36E-01  4  MY0043  GATA  0.096  0.079  4.54  11  0  6651  22  9.64E-01  5  MY0001  ABF1  0.022  0.015  4.38  6  5  3471  3202  5.55E-01  6  MY0008  GCN4  0.020  0.015  3.30  8  3  3436  3237  1.34E-01  7  MY0010  HSE  0.012  0.008  2.79  5  6  1850  4823  1.63E-01  8  MY0028  XBP1  0.148  0.137  2.39  11  0  6654  19  9.69E-01  1  4  Zscore  1  Rank  3  Bg rate  PWM  Cl.  2  Tg rate  TFBS Name  9  MY0021  REB1  0.009  0.007  1.91  2  9  1940  4733  8.74E-01  10  MY0040  FKH1  0.074  0.068  1.69  11  0  6485  188  7.30E-01  1  MY0029  ADR1P  0.036  0.022  17.80  51  12  4687  1986  3.92E-02  2  MY0005  CSRE  0.030  0.017  16.74  46  17  3395  3278  3.00E-04  3  MY0025  STRE  0.010  0.004  16.58  19  44  901  5772  5.05E-04  4  MY0018  PDR3  0.012  0.006  14.97  20  43  1332  5341  1.89E-02  5  MY0016  OAF1  0.003  0.001  12.18  4  59  124  6549  3.13E-02  6  MY0042  UME6  0.020  0.013  11.21  34  29  2873  3800  5.40E-02  7  MY0030  MIG1b  0.016  0.011  10.09  24  39  1863  4810  5.25E-02  8  MY0041  MIG1c  0.023  0.017  8.06  37  26  3357  3316  1.14E-01  9  MY0039  CAR1_r  0.016  0.012  6.31  35  28  3147  3526  1.15E-01  10  MY0020  RAP1  0.025  0.020  6.20  42  21  4214  2459  3.32E-01  203  PWM  TFBS Name  Tg rate  Bg rate  Zscore  Tg hits  Tg nonhits  Bg hits  Bg nonhits  Fisher  0.012  0.006  12.39  15  22  1332  5341  3.41E-03  Cl.  Rank  5  1  MY0018  PDR3  2  MY0013  MCB  0.013  0.008  8.31  18  19  2085  4588  2.04E-02  3  MY0020  RAP1  0.027  0.020  5.79  28  9  4214  2459  7.72E-02  4  MY0004  CCA  0.013  0.010  4.75  13  24  1721  4952  1.35E-01  5  MY0037  UASPHR  0.047  0.041  4.32  28  9  5368  1305  8.27E-01  6  MY0035  SWI5  0.022  0.018  4.18  23  14  3512  3161  1.60E-01  7  MY0036  TBP  0.120  0.111  3.84  37  0  6655  18  9.05E-01  8  MY0032  SCB  0.057  0.052  3.06  36  1  6511  162  7.74E-01  9  MY0040  FKH1  0.074  0.068  3.04  35  2  6485  188  9.14E-01  10  MY0029  ADR1P  0.025  0.022  2.92  27  10  4687  1986  4.37E-01  1  MY0037  UASPHR  0.053  0.041  13.28  88  11  5368  1305  1.90E-02  2  MY0029  ADR1P  0.029  0.022  11.34  73  26  4687  1986  2.62E-01  3  MY0015  MIG1a  0.004  0.002  10.37  12  87  529  6144  9.49E-02  4  MY0028  XBP1  0.152  0.137  9.87  99  0  6654  19  7.56E-01  5  MY0011  LEU3  0.006  0.003  9.80  24  75  887  5786  2.48E-03  6  MY0025  STRE  0.006  0.004  9.63  19  80  901  5772  7.24E-02  7  MY0041  MIG1c  0.022  0.017  8.71  63  36  3357  3316  5.46E-03  8  MY0005  CSRE  0.022  0.017  8.52  60  39  3395  3278  3.39E-02  6  7  8  9  MY0018  PDR3  0.008  0.006  8.39  25  74  1332  5341  1.21E-01  10  MY0039  CAR1_r  0.015  0.012  7.19  52  47  3147  3526  1.69E-01  1  MY0025  STRE  0.006  0.004  17.35  70  257  901  5772  9.06E-05  2  MY0029  ADR1P  0.027  0.022  15.40  249  78  4687  1986  1.19E-02  3  MY0030  MIG1b  0.013  0.011  11.28  105  222  1863  4810  5.80E-02  4  MY0018  PDR3  0.008  0.006  11.08  81  246  1332  5341  2.23E-02  5  MY0016  OAF1  0.002  0.001  10.90  12  315  124  6549  2.48E-02  6  MY0005  CSRE  0.021  0.017  10.88  176  151  3395  3278  1.63E-01  7  MY0039  CAR1_r  0.015  0.012  9.76  176  151  3147  3526  1.08E-02  8  MY0044  PDS  0.017  0.014  8.92  151  176  2834  3839  1.03E-01  9  MY0042  UME6  0.016  0.013  8.82  155  172  2873  3800  6.82E-02  10  MY0041  MIG1c  0.019  0.017  8.54  177  150  3357  3316  9.80E-02  1  MY0040  FKH1  0.078  0.068  14.45  237  1  6485  188  9.61E-03  2  MY0026  RLM1  0.017  0.014  9.16  89  149  2194  4479  8.39E-02  3  MY0021  REB1  0.009  0.007  6.69  85  153  1940  4733  1.74E-02  4  MY0018  PDR3  0.007  0.006  5.76  54  184  1332  5341  1.71E-01  5  MY0003  CBF1  0.006  0.005  5.28  51  187  1209  5464  1.14E-01  6  MY0010  HSE  0.010  0.008  4.48  69  169  1850  4823  3.58E-01  7  MY0024  STE12  0.020  0.019  4.40  140  98  3790  2883  2.91E-01  8  MY0028  XBP1  0.141  0.137  3.75  238  0  6654  19  5.13E-01  9  MY0036  TBP  0.115  0.111  3.64  238  0  6655  18  5.32E-01  10  MY0037  UASPHR  0.043  0.041  3.49  190  48  5368  1305  6.30E-01  204  Cl. 9  10  11  12  Rank  PWM  TFBS Name  Tg rate  Bg rate  Zscore  Tg hits  Tg nonhits  Bg hits  Bg nonhits  Fisher  1  MY0007  GAL4  0.001  0.000  20.69  3  155  28  6645  3.40E-02  2  MY0026  RLM1  0.018  0.014  8.59  67  91  2194  4479  8.37E-03  3  MY0025  STRE  0.005  0.004  7.59  33  125  901  5772  7.50E-03  4  MY0009  HAP2_3_4  0.017  0.014  7.36  90  68  3432  3241  9.76E-02  5  MY0044  PDS  0.018  0.014  7.31  86  72  2834  3839  1.83E-03  6  MY0042  UME6  0.016  0.013  7.05  81  77  2873  3800  2.44E-02  7  MY0005  CSRE  0.021  0.017  6.80  86  72  3395  3278  2.11E-01  8  MY0030  MIG1b  0.013  0.011  6.14  52  106  1863  4810  9.96E-02  9  MY0029  ADR1P  0.025  0.022  5.92  114  44  4687  1986  3.36E-01  10  MY0012  LYS14  0.016  0.014  5.92  69  89  2601  4072  1.33E-01  1  MY0025  STRE  0.007  0.004  13.50  41  133  901  5772  2.86E-04  2  MY0016  OAF1  0.002  0.001  12.22  7  167  124  6549  4.89E-02  3  MY0044  PDS  0.019  0.014  12.19  93  81  2834  3839  2.58E-03  4  MY0029  ADR1P  0.027  0.022  12.07  134  40  4687  1986  3.01E-02  5  MY0039  CAR1_r  0.016  0.012  12.01  102  72  3147  3526  1.79E-03  6  MY0042  UME6  0.016  0.013  6.56  83  91  2873  3800  1.26E-01  7  MY0011  LEU3  0.004  0.003  5.34  30  144  887  5786  8.44E-02  8  MY0018  PDR3  0.007  0.006  5.24  44  130  1332  5341  5.39E-02  9  MY0007  GAL4  0.001  0.000  5.13  2  172  28  6645  1.76E-01  10  MY0038  GCR1  0.027  0.025  4.30  162  12  6113  560  2.95E-01  1  MY0007  GAL4  0.002  0.000  15.65  2  46  28  6645  1.91E-02  2  MY0005  CSRE  0.025  0.017  9.02  34  14  3395  3278  4.09E-03  3  MY0002  AFT1  0.014  0.009  8.04  22  26  1818  4855  4.58E-03  4  MY0009  HAP2_3_4  0.020  0.014  7.29  32  16  3432  3241  2.42E-02  5  MY0029  ADR1P  0.028  0.022  6.92  39  9  4687  1986  6.18E-02  6  MY0025  STRE  0.007  0.004  6.82  12  36  901  5772  2.39E-02  7  MY0030  MIG1b  0.015  0.011  6.38  18  30  1863  4810  9.70E-02  8  MY0017  PAC  0.015  0.011  6.06  29  19  2785  3888  7.16E-03  9  MY0034  ROX1  0.028  0.024  3.85  34  14  4212  2461  1.70E-01  10  MY0044  PDS  0.017  0.014  3.08  22  26  2834  3839  3.71E-01  1  MY0044  PDS  0.020  0.014  9.70  56  42  2834  3839  2.59E-03  2  MY0007  GAL4  0.001  0.000  9.34  2  96  28  6645  6.94E-02  3  MY0005  CSRE  0.023  0.017  9.14  60  38  3395  3278  2.63E-02  4  MY0019  PHO4  0.003  0.002  8.35  11  87  379  6294  2.47E-02  5  MY0011  LEU3  0.005  0.003  6.91  19  79  887  5786  5.88E-02  6  MY0014  MET31_32  0.006  0.004  6.13  21  77  1162  5511  7  MY0029  ADR1P  0.024  0.022  4.29  75  23  4687  1986  1.05E-01  8  MY0012  LYS14  0.016  0.014  4.04  42  56  2601  4072  2.48E-01  9  MY0016  OAF1  0.001  0.001  3.91  3  95  124  6549  2.79E-01  10  MY0030  MIG1b  0.012  0.011  2.80  33  65  1863  4810  1.27E-01  13 (0 genes differentially expressed)  205  Cl. 14  15  16  17  PWM  TFBS Name  Tg rate  Bg rate  Zscore  Tg hits  Tg nonhits  Bg hits  Bg nonhits  Fisher  1  MY0010  HSE  0.012  0.008  11.11  57  97  1850  4823  8.29E-03  2  MY0032  SCB  0.060  0.052  10.40  151  3  6511  162  4.86E-01  3  MY0012  LYS14  0.017  0.014  8.10  75  79  2601  4072  9.64E-03  4  MY0023  RRPE  0.040  0.035  8.06  136  18  5417  1256  1.27E-02  5  MY0028  XBP1  0.146  0.137  6.78  154  0  6654  19  6.48E-01  6  MY0034  ROX1  0.028  0.024  6.66  105  49  4212  2461  1.14E-01  7  MY0033  PHO2  0.112  0.106  4.97  8  146  379  6294  6.52E-01  8  MY0030  MIG1b  0.012  0.011  4.87  47  107  1863  4810  2.65E-01  9  MY0004  CCA  0.011  0.010  4.76  45  109  1721  4952  1.92E-01  10  MY0037  UASPHR  0.044  0.041  3.59  132  22  5368  1305  5.91E-02  1  MY0016  OAF1  0.002  0.001  11.02  8  215  124  6549  6.37E-02  2  MY0031  MCM1  0.090  0.083  9.20  189  34  5576  1097  3.58E-01  3  MY0037  UASPHR  0.045  0.041  6.31  182  41  5368  1305  3.69E-01  4  MY0023  RRPE  0.038  0.035  6.08  187  36  5417  1256  1.79E-01  5  MY0012  LYS14  0.016  0.014  6.06  101  122  2601  4072  3.43E-02  6  MY0014  MET31_32  0.005  0.004  6.01  50  173  1162  5511  3.55E-02  7  MY0033  PHO2  0.111  0.106  5.08  222  1  6662  11  9.44E-01  8  MY0036  TBP  0.115  0.111  4.14  223  0  6655  18  5.53E-01  Rank  9  MY0003  CBF1  0.005  0.005  3.44  47  176  1209  5464  1.50E-01  10  MY0001  ABF1  0.016  0.015  3.18  123  100  3471  3202  1.96E-01  1  MY0023  RRPE  0.041  0.035  15.94  400  45  5417  1256  6.65E-07  2  MY0022  RPN4  0.009  0.007  11.69  160  285  1914  4759  7.85E-04  3  MY0033  PHO2  0.113  0.106  11.16  444  1  6662  11  8.30E-01  4  MY0031  MCM1  0.088  0.083  10.00  379  66  5576  1097  2.07E-01  5  MY0017  PAC  0.013  0.011  9.93  213  232  2785  3888  6.63E-03  6  MY0034  ROX1  0.027  0.024  8.39  304  141  4212  2461  1.51E-02  7  MY0024  STE12  0.021  0.019  7.32  264  181  3790  2883  1.60E-01  8  MY0032  SCB  0.055  0.052  6.22  440  5  6511  162  4.52E-02  9  MY0036  TBP  0.114  0.111  4.53  445  0  6655  18  3.12E-01  10  MY0003  CBF1  0.005  0.005  4.31  87  358  1209  5464  2.42E-01  1  MY0011  LEU3  0.007  0.003  11.62  17  44  887  5786  2.12E-03  2  MY0023  RRPE  0.046  0.035  11.09  56  5  5417  1256  1.85E-02  3  MY0012  LYS14  0.020  0.014  9.94  33  28  2601  4072  1.22E-02  4  MY0031  MCM1  0.096  0.083  8.49  51  10  5576  1097  5.80E-01  5  MY0018  PDR3  0.009  0.006  8.23  17  44  1332  5341  8.80E-02  6  MY0009  HAP2_3_4  0.019  0.014  7.87  38  23  3432  3241  5.86E-02  7  MY0034  ROX1  0.030  0.024  6.95  47  14  4212  2461  1.51E-02  8  MY0004  CCA  0.012  0.010  4.49  19  42  1721  4952  2.08E-01  9  MY0043  GATA  0.086  0.079  4.35  60  1  6651  22  9.82E-01  10  MY0040  FKH1  0.072  0.068  3.11  60  1  6485  188  4.85E-01  206  Cl. 18  19  20  1-6  PWM  TFBS Name  Tg rate  Bg rate  Zscore  Tg hits  Tg nonhits  Bg hits  Bg nonhits  Fisher  1  MY0023  RRPE  0.044  0.035  23.54  332  33  5417  1256  2.93E-07  2  MY0020  RAP1  0.025  0.020  14.88  259  106  4214  2459  1.33E-03  3  MY0017  PAC  0.014  0.011  13.21  187  178  2785  3888  2.31E-04  4  MY0024  STE12  0.022  0.019  10.60  222  143  3790  2883  7.19E-02  5  MY0004  CCA  0.012  0.010  9.65  113  252  1721  4952  1.78E-02  6  MY0033  PHO2  0.111  0.106  6.86  365  0  6662  11  5.56E-01  7  MY0006  ECB  0.009  0.008  6.29  92  273  1426  5247  4.93E-02  8  MY0002  AFT1  0.011  0.009  5.16  114  251  1818  4855  5.58E-02  9  MY0021  REB1  0.008  0.007  5.07  122  243  1940  4733  4.39E-02  10  MY0032  SCB  0.054  0.052  3.44  357  8  6511  162  4.76E-01  1  MY0005  CSRE  0.041  0.017  10.70  6  1  3395  3278  6.86E-02  2  MY0030  MIG1b  0.027  0.011  9.79  5  2  1863  4810  2.11E-02  3  MY0041  MIG1c  0.034  0.017  8.09  6  1  3357  3316  6.47E-02  4  MY0002  AFT1  0.017  0.009  4.70  3  4  1818  4855  2.92E-01  5  MY0014  MET31_32  0.009  0.004  4.58  2  5  1162  5511  3.52E-01  6  MY0023  RRPE  0.046  0.035  3.55  6  1  5417  1256  6.09E-01  7  MY0019  PHO4  0.004  0.002  3.34  1  6  379  6294  3.36E-01  8  MY0012  LYS14  0.020  0.014  3.20  3  4  2601  4072  5.58E-01  9  MY0032  SCB  0.064  0.052  3.13  7  0  6511  162  8.42E-01  10  MY0029  ADR1P  0.027  0.022  2.36  7  0  4687  1986  8.45E-02  Rank  1  MY0020  RAP1  0.036  0.020  17.02  34  7  4214  2459  5.09E-03  2  MY0023  RRPE  0.053  0.035  15.21  38  3  5417  1256  3.67E-02  3  MY0002  AFT1  0.017  0.009  11.49  16  25  1818  4855  6.87E-02  4  MY0042  UME6  0.018  0.013  6.98  22  19  2873  3800  1.14E-01  5  MY0024  STE12  0.024  0.019  6.32  28  13  3790  2883  9.15E-02  6  MY0004  CCA  0.013  0.010  5.30  16  25  1721  4952  4.41E-02  7  MY0035  SWI5  0.023  0.018  5.20  27  14  3512  3161  6.15E-02  8  MY0032  SCB  0.059  0.052  4.51  41  0  6511  162  3.66E-01  9  MY0001  ABF1  0.017  0.015  3.56  26  15  3471  3202  9.63E-02  10  MY0018  PDR3  0.007  0.006  2.23  10  31  1332  5341  2.95E-01  1  MY0029  ADR1P  0.030  0.022  19.33  166  56  4687  1986  8.21E-02  2  MY0018  PDR3  0.010  0.006  18.35  65  157  1332  5341  7.26E-04  3  MY0025  STRE  0.007  0.004  17.82  48  174  901  5772  7.57E-04  4  MY0005  CSRE  0.023  0.017  14.81  136  86  3395  3278  1.40E-03  5  MY0037  UASPHR  0.049  0.041  13.78  189  33  5368  1305  4.60E-02  6  MY0011  LEU3  0.005  0.003  11.74  47  175  887  5786  9.74E-04  7  MY0041  MIG1c  0.021  0.017  11.70  132  90  3357  3316  4.37E-03  8  MY0030  MIG1b  0.014  0.011  10.53  73  149  1863  4810  6.29E-02  9  MY0042  UME6  0.017  0.013  10.20  114  108  2873  3800  8.75E-03  10  MY0039  CAR1_r  0.015  0.012  8.38  114  108  3147  3526  1.23E-01  207  Cl. 7-10  11-12  14-17  18-20  PWM  TFBS Name  Tg rate  Bg rate  Zscore  Tg hits  Tg nonhits  1  MY0025  STRE  0.006  0.004  18.76  172  2  MY0029  ADR1P  0.025  0.022  16.29  652  3  MY0044  PDS  0.017  0.014  13.26  4  MY0039  CAR1_r  0.014  0.012  12.61  5  MY0016  OAF1  0.001  0.001  11.27  26  866  124  6549  2.73E-02  6  MY0018  PDR3  0.007  0.006  10.62  204  688  1332  5341  2.47E-02  7  MY0005  CSRE  0.019  0.017  10.50  473  419  3395  3278  1.21E-01  8  MY0030  MIG1b  0.012  0.011  9.88  272  620  1863  4810  5.95E-02  9  MY0042  UME6  0.015  0.013  9.87  408  484  2873  3800  6.91E-02  Rank  Bg hits  Bg nonhits  Fisher  720  901  5772  4.60E-06  240  4687  1986  4.21E-02  426  466  2834  3839  1.58E-03  473  419  3147  3526  5.63E-04  10  MY0026  RLM1  0.016  0.014  9.16  320  572  2194  4479  4.09E-02  1  MY0007  GAL4  0.001  0.000  16.75  4  142  28  6645  4.55E-03  2  MY0005  CSRE  0.024  0.017  12.67  94  52  3395  3278  7.68E-04  3  MY0044  PDS  0.019  0.014  9.73  77  69  2834  3839  8.54E-03  4  MY0029  ADR1P  0.026  0.022  7.50  114  32  4687  1986  2.26E-02  5  MY0011  LEU3  0.005  0.003  6.85  28  118  887  5786  3.04E-02  6  MY0030  MIG1b  0.013  0.011  5.97  51  95  1863  4810  4.02E-02  7  MY0019  PHO4  0.003  0.002  5.74  13  133  379  6294  7.62E-02  8  MY0002  AFT1  0.011  0.009  4.86  49  97  1818  4855  5.69E-02  9  MY0025  STRE  0.005  0.004  4.33  26  120  901  5772  8.72E-02  10  MY0009  HAP2_3_4  0.016  0.014  3.63  79  67  3432  3241  2.89E-01  1  MY0023  RRPE  0.040  0.035  20.66  776  104  5417  1256  7.18E-08  2  MY0031  MCM1  0.089  0.083  15.36  745  135  5576  1097  2.19E-01  3  MY0033  PHO2  0.112  0.106  13.04  878  2  6662  11  8.14E-01  4  MY0034  ROX1  0.027  0.024  10.76  588  292  4212  2461  1.72E-02  5  MY0017  PAC  0.012  0.011  8.55  404  476  2785  3888  1.03E-02  6  MY0032  SCB  0.055  0.052  8.07  867  13  6511  162  4.43E-02  7  MY0012  LYS14  0.015  0.014  7.61  379  501  2601  4072  1.10E-02  8  MY0036  TBP  0.114  0.111  5.15  879  1  6655  18  3.33E-01  9  MY0022  RPN4  0.008  0.007  5.14  275  605  1914  4759  6.27E-02  10  MY0004  CCA  0.010  0.010  4.53  234  646  1721  4952  3.18E-01  1  MY0023  RRPE  0.045  0.035  27.48  364  36  5417  1256  7.51E-08  2  MY0020  RAP1  0.026  0.020  19.48  283  117  4214  2459  1.12E-03  3  MY0017  PAC  0.014  0.011  12.48  202  198  2785  3888  3.67E-04  4  MY0024  STE12  0.022  0.019  12.01  247  153  3790  2883  2.88E-02  5  MY0004  CCA  0.012  0.010  10.75  126  274  1721  4952  7.57E-03  6  MY0002  AFT1  0.011  0.009  9.21  125  275  1818  4855  4.72E-02  7  MY0033  PHO2  0.111  0.106  6.91  400  0  6662  11  5.27E-01  8  MY0032  SCB  0.055  0.052  5.10  392  8  6511  162  3.70E-01  9  MY0021  REB1  0.008  0.007  4.61  133  267  1940  4733  4.33E-02  10  MY0005  CSRE  0.018  0.017  2.83  211  189  3395  3278  2.49E-01  TFBS profiles with Z-scores ≥ 10 and Fisher scores ≤ 0.01 are highlighted in boldface. Abbreviations: Cl. – cluster; Tg – target; Bg – background  208  Table S3. Glucose repressed genes induced during fermentation Forty most highly glucose repressed genes (Young et al., 2003). Genes in the FSR are shaded in green. An asterisk (*) indicates glucose repressed genes (as reviewed by Schuller, 2003). Gene ACS1 CTA1 YPR151C ATO3 SPG1 REG2 ICL1 SPS100 PCK1 FDH1 YPR150W ADY2 DMC1 YKL187C YMR107W CIT3 JEN1 IDP3 FDH2 POX1  Cluster 4 4 4 6 6 7 7 7 7 7 7 8 8 8 8 8 9 9 9 10  Max. FC 10.79 6.48 17.68 4.41 5.97 3.20 2.26 3.38 3.26 3.89 2.11 2.88 2.05 1.79 3.53 2.18 2.47 1.66 1.62 3.78  DE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE  Gene FOX2 IDP2* FBP1* YNL195C FUN34 POT1 YAT1 YEL008W YER024W YGR067C YGR243W GUT1 BOP2 YMR206W PDH1 ICL2 YIL057C YNL013C SFC1 YLR126C  Cluster 10 10 10 10 10 11 13 13 13 13 13 13 13 13 13 13 14 14 16 16  Max. FC 3.49 3.45 4.11 4.42 5.33 3.82 1.17 1.06 1.07 1.33 1.11 1.30 1.38 1.11 1.40 1.10 2.01 1.50 1.64 1.57  DE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE  209  Table S4. Induced genes involved in mitochondrial respiration/oxidative phosphorylation. FSR genes are highlighted in green. Gene CYB2 MBR1 HAP1 ISF1 COX23 IDH1 NCA2 IDH2 CIT3 MDL2 SDH1 ATP18 COX14 ATP15 COX16 CIT1* KGD2 COQ9 QCR10* INH1 COX13*  Cluster 2 3 5 6 7 7 8 8 8 8 9 9 9 9 9 9 9 9 10 10 10  Max. FC 68.30 27.90 9.63 6.36 3.28 2.22 3.51 2.65 2.18 2.12 2.77 2.59 2.52 2.50 2.20 2.20 2.09 2.09 18.42 12.71 9.41  Gene COX5A* CYC1* PET10 RIP1 CYT1 CYC7 COX7* QCR6* QCR9* COR1 SDH4* DLD1* NDI1* QCR8* SDH2* QCR7* STF1 PET100 AAC1 KGD1* COX4*  Cluster 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 11  Max. FC 9.21 8.97 8.06 7.46 6.96 6.27 5.65 5.24 5.15 4.90 4.76 4.11 3.92 3.83 3.52 3.36 3.03 3.02 3.00 2.97 7.05  Gene PET9 QCR2* ATP20 COX6* ATP3 COX8* ATP2 ATP7 ETR1 MCR1 ATP1 RSM24 SDH3 SHY1 COX9 LSC1 TAZ1 MDH1 YJL045W CYB5 COQ5  Cluster 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12  Max. FC 5.69 5.35 4.24 4.14 3.86 3.52 2.96 2.91 2.85 2.81 2.77 2.66 2.63 2.62 2.62 2.55 2.34 2.32 2.20 2.20 2.10  Table S5. Induced genes involved in sterol biosynthesis Gene ATG26 PDR16 ARE1 ECM22 UPC2 ERG5 YEH1 HMG1  Cluster 7 7 8 8 8 9 9 10  Max. FC 3.10 3.01 2.40 2.02 4.00 2.59 2.50 2.89  Gene ERG13 MCR1 CYB5 ERG1 ERG8 ERG9 IDI1  Cluster 11 11 12 12 12 12 12  Max. FC 3.08 2.81 2.20 2.35 2.74 2.18 2.31  210  Table S6. Induced genes involved in oxidative stress FSR genes are highlighted in green. Gene CTA1 GPX1 RIM15 SKN7 SNQ2 SRX1 GSH1 SCH9 ZWF1 YAP1  Cluster 4 4 6 6 6 6 7 7 7 8  Max. FC 6.48 6.71 6.99 7.62 5.37 4.41 3.85 4.07 3.37 2.77  Gene PRX1 TRX3 CTT1 MCR1 MXR1 SOD2 GRX2 SOD1 TSA2  Cluster 10 10 11 11 11 11 12 12 12  Max. FC 4.64 5.47 10.73 2.81 3.13 6.90 2.58 2.49 2.07  211  Table S7. Induced genes annotated to carbohydrate metabolic process. Genes in the FSR are shaded in green. An asterisk (*) indicates glucose repressed genes (as reviewed by Schuller, 2003). Gene GAC1 INO1 SHC1 GSC2 GSY2 RGT1 TSL1 VID24 ADR1 GIP2 NRG1 PSK1 TOS3 UBC8 ATH1 FBP26 GID8 GLC8 GLK1 GUT2 IDH1 IMP2' KRE6 NTH1 PCK1 PFK26 SGA1  Cluster 1 2 2 4 4 4 4 4 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7  Maximum Fold Change 44.21 39.14 20.00 13.38 7.58 7.26 8.11 6.22 12.61 11.83 16.02 5.66 7.39 5.01 3.59 4.23 3.91 2.40 2.78 3.60 2.22 2.90 3.65 2.91 3.26 3.41 2.32  Gene TPS1 TPS2 ZWF1 AAT1 CIT3 IDH2 KNH1 PSK2 STD1 AMS1 CIT1* GAL1* GAL4* GAL80* GID7 KGD2 KRE9 MLS1* PFK27* RMD5 SDH1* SGA1 VID30 GLC3 GDB1 ACN9 BMH2  Cluster 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10  Maximum Fold Change 3.16 3.79 3.37 2.46 2.18 2.65 2.57 2.30 2.22 2.65 2.20 2.16 2.34 2.42 2.52 2.09 2.23 2.65 2.60 2.76 2.77 2.35 2.10 8.16 5.94 5.59 2.91  Gene DLD1* FBP1* FYV10 UGA2 GND2 HAP4* IDP2* KGD1* NTH2 PGM2 PIG2 SDH2* SDH4* TKL2 YJR096W GRE3 SOL4 GCY1 GLO4 LSC1 MDH1 MDH2 PCL10 PCL7 SDH3  Cluster 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 11 11 12 12 12 12 12 12 12 12  Maximum Fold Change 4.11 4.11 2.74 3.38 4.02 4.10 3.45 2.97 4.06 3.48 3.88 3.52 4.76 4.19 6.18 2.97 6.24 2.64 2.17 2.55 2.32 2.35 2.09 2.26 2.63  212  Appendix 4 This appendix contains Supplemental Tables S1-2 and Supplemental Figures S1-7 for Chapter 6 of this thesis. Supplemental Table S3 is in a separate document (Excel workbook) available online at http://www.genome.org/content/vol0/issue2007/images/data/gr.6669607/DC2/Engstrom_ TableS3.xls.  Figure S1. Phylogenetic tree of Drosophila species. Shown are Drosophila species for which whole-genome alignments to D. melanogaster were available in the UCSC Genome Browser Database (http://genome.ucsc.edu) at the time of this study. The species used in this study are shown in red. Although the remote phylogenetic position of D. grimshawi suggests that including that organism would have been informative, we chose not to because its genome assembly appeared to be in an early state (25,052 scaffolds compared to at most 13,772 scaffolds for any of the included species; see Supplemental Table S2). Adapted from the figure at http://rana.lbl.gov/drosophila/.  213  Table S1. Overlap with HCNE-dense regions for selected gene categories Gene category  Genes annotated with GO term(s)  generation of precursor metabolites and energy (GO:0006091) cellular protein metabolic process (GO:0044267) cell organization and biogenesis (GO:0016043) transport (GO:0006810) signal transduction (GO:0007165) multicellular organismal development (GO:0007275) regulation of transcription, DNA-dependent (GO:0006355) multicellular organismal development (GO:0007275) AND regulation of transcription, DNA-dependent (GO:0006355) All genes  Number of genes overlapping HCNEdense regions a  Total amount of gene sequence within HCNE-dense regions b  14/491 (2.85%)  29k/2056k (1.40%)  77/2030 (3.79%) 76/1680 (4.52%) 69/1517 (4.55%) 83/1338 (6.20%)  334k/11549k (2.90%) 775k/15021k (5.16%) 185k/10115k (1.83%) 876k/15800k (5.54%)  133/1380 (9.64%)  1829k/18544k (9.86%)  103/768 (13.41%)  1385k/7736k (17.90%)  81/334 (24.25%)  1234k/5313k (23.22%)  684/13733 (4.98%)  4001k/67464k (5.93%)  a  Ratio between number of genes overlapping HCNE-dense regions and the total number of genes in the category. b Ratio between amount of gene sequence covered by HCNE-dense regions and the total amount of sequence spanned by genes in the category. HCNE-dense regions were identified by sliding a 40 kb window across the genome in steps of 1 kb and reporting windows for which at least 1% (400 bp) of the sequence was covered by HCNEs. Overlapping HCNE-dense windows were merged, resulting in 421 HCNE-dense regions covering a total of 13.5 Mb (11.41% of the investigated sequence).  214  Table S2. Properties of pairwise and 5-way synteny blocks among Dmel and four other Drosophila species Pairwise synteny blocks between Dmel and indicated query species Dana Dpse Dvir Dmoj Number of synteny blocks Number of query sequences included Number of query sequence fusions a Amount of query sequence spanned Amount of Dmel sequence spanned Number of Dmel genes spanned b Number of Dmel genes covered c  627 30 / 13,772 scaffolds 11 114 Mb (49%) 111 Mb (94%) 12863 (94%) 11631 (85%)  814 16 / 16 ultrascaffolds 26 / 2649 scaffolds+contigs 5 120 Mb (85%) 110 Mb (93%) 12434 (91%) 10685 (78%)  Five-way synteny blocks  920  923  899  24 / 13,562 scaffolds  11 / 6843 scaffolds  n.c.  7  3  n.c.  131 Mb (63%) 107 Mb (90%) 11690 (85%) 9531 (69%)  134 Mb (69%) 105 Mb (89%) 11451 (83%) 9279 (68%)  n.c. 90 Mb (76%) 8958 (65%) 6308 (46%)  n.c., not calculated a To avoid artificial synteny breaks due to scaffold breaks in the assembly, our algorithm for synteny block construction includes a step for fusing scaffolds that, when combined, show colinearity with the Dmel sequence (see Methods). b We considered a gene to be spanned by a synteny block if at least 90% of its total CDS were within the extreme borders of the block on the Dmel genome. c We considered a gene to be covered by a synteny block if the block contained aligned sequence for at least 60% the gene’s total CDS.  None of the four species that we compared to Dmel has a finished genome assembly. Dana appears to have the most fragmented assembly, consisting of 13,772 scaffolds. Nevertheless, our results indicate that reliable synteny blocks can be constructed because most of the sequence is in very large scaffolds. Although the pairwise synteny blocks included few scaffolds (e.g., 30 Dana scaffolds were included), they spanned more than 89% of the Dmel euchromatic sequence. As expected, the number of synteny blocks increased with evolutionary distance between Dmel and the compared species (consistent with an accumulation of rearrangements over evolutionary time). Notably, for Dvir and Dmoj, which are equidistant from Dmel, we obtained nearly identical synteny block counts. Not unexpectedly, the span of synteny blocks over the Dmel sequence and over gene annotations on that sequence decreased with increasing evolutionary distance to the compared species.  215  Figure S2 Densities of reciprocally aligned coding sequences (RA-CDS) and HCNES across the Dmel genome. See the legend for Figure 6.1 where chromosome arm 2L is depicted.  216  Figure S3. HCNE density peaks tend to be centrally located in large synteny blocks that contain multiple genes. See also Figure 6.1. (A) Randomization strategy. Nonsyntenic regions were excised and the order of synteny blocks was randomized, while the positions of peaks were maintained. (B) Effects of varying the bandwidth for density computation and threshold for peak detection. Symbols connected by dashed lines represent medians over 10 randomizations. The trend for HCNE density peaks to be centrally positioned within synteny blocks persisted at all parameter settings. The trend for RA-CDS peaks to be positioned close to synteny breaks disappeared at high density computation bandwidths, which appear too high for a relevant comparison with synteny blocks (Supplemental Figure S4). Unless otherwise noted, results presented in this paper were obtained with a bandwidth of 30,000 and threshold of 0.8. (C) The frequency of HCNE density peaks per sequence length increases with synteny block size. Based on their span in the Dmel genome, the synteny blocks on Dmel chromosomes 2, 3 and X were grouped into six groups of 149-150 blocks each. For each group, the range of synteny block spans and their total span are indicated on the x-axis. The y-value for each group shows the ratio between the total number of peaks and the total number of bases spanned by the synteny blocks in the group. (D) Histogram of number of genes covered by synteny blocks that span HCNE density peaks. Out of all 136 synteny blocks that spanned a HCNE density peak, 120 (88%) also covered multiple genes. As in Table 1, we considered a gene to be covered by a synteny block if the block contained aligned sequence for at least 60% the gene’s total CDS.  217  218  Figure S4. Examples of HCNE and RA-CDS density curves computed with different bandwidths. Shown is the D. melanogaster genomic region around the homothorax (hth) gene.  219  Figure S5. The grn locus in Dmel (upper panel), compared to a wider region in Agam (lower panel). Notation as in Figure 6.2. grn and five other protein-coding genes (underlined) show strong evidence of being in conserved microsynteny among the five investigated flies. No functional relationship has been described between any of these genes, although both grn and ada2b have regulatory functions in development (Brown and Castelli-Gair Hombria, 2000, Development 127:4867; Qi et al., 2004, Mol Cell Biol, 24:8080). Between Dmel and Agam, grn has been maintained in microsynteny with two downstream genes. The HCNE density peak over the grn ortholog in Agam is the largest within the 850 kb region examined. Note the elevated gene density around the right border of the synteny block in Dmel.  220  Figure S6. The hth locus in Dmel. Notation as in Figure 2. The developmental regulatory homeobox gene homothorax (hth) occupies one of the most HCNE-dense synteny blocks in the Dmel genome. Its human orthologs MEIS1 and MEIS2 are surrounded by an exceptional number of noncoding elements conserved across vertebrates (Sandelin et al., 2004, BMC Genomics 5:99). Dmel hth is in conserved microsynteny with at least nine other genes (underlined) among the five investigated flies. Six of these genes encode protein of diverse classes: Irp-1B, an iron regulatory protein uniformly expressed during development (Muckenthaler et al., 1998, Eur J Biochem 254:230); Rrp46, a component of the RNA-processing exosome (Graham et al., 2006, Mol Biol Cell 17:1399); Cyp12e1, a cytochrome P450 protein; CG6465, a putative peptidase; CG14688, a putative phytanoyl-CoA dioxygenase; and Skeletor, a chromosomal protein that relocalizes during mitosis (Walker et al., 2000, J Cell Biol 151:1401). The remaining three genes are unannotated. The function of an additional protein expressed from the CG14681/skeletor transcriptional unit is unknown (Walker et al., 2000). Micro-RNA 318, which was cloned from adult flies (Aravin et al., 2003, Dev Cell, 5:337), aligns to the hth locus in all five flies, but at different locations relative to irp-1b, indicating that local rearrangements have occurred at the edge of the synteny block. hth is the only gene in the synteny block known to have a key regulatory role in development. hth is in conserved microsynteny with the gene Rrp46 among flies, mosquitoes, bees and beetles, although there is no evidence for a functional relationship between hth and Rrp46, which encodes a component of the RNA-processing exosome (Graham et al., 2006). In Dmel the two genes are separated by a region of 95 kb that is HCNE-dense and gene-sparse, containing only four genes, all of which are in conserved microsynteny with hth among the investigated flies. Similarly, the honeybee (Apis mellifera) orthologs of hth and Rrp46 are separated by a gene-sparse ~130 kb region (devoid of matches to known Dmel proteins in the UCSC Genome Browser). In Agam and the beetle Tribolium castaneum, the intervening regions also appear to be gene-free, but are smaller (~8 kb and ~10 kb, respectively; estimated from alignments of Dmel transcripts to the 2005-10-11 version of the T. castaneum assembly, using the BLAST interface at the Baylor HGSC website, www.hgsc.bcm.tmc.edu). In Aaeg, there appears to be a gene desert of at least 45 kb downstream of hth. (Since this gene desert is at the end of a supercontig, we could not determine whether hth and Rrp46 are linked in Aaeg.)  Taken together, these data suggest that a regulatory region is present downstream of hth in many insect species, underlies microsynteny conservation of multiple fly genes, but has been extensively modified in evolution and possibly lost in some insect species, including Agam. Consistent with the variations among insects in the size of the region separating hth and Rrp46,  221  regulatory sequences appear to have diverged greatly at the mosquito hth locus. Little noncoding sequence is highly conserved between the hth loci of Agam and Aaeg: using the same detection criteria as for the ct, tup and grn loci shown in Figure 2 and Supplemental Figure S5, we only found three HCNEs within hth introns and none in the region between hth and Rrp46. We reasoned that closer sequence comparisons may be required to detect the regulatory elements at the mosquito hth locus, so we compared the Aaeg supercontig harboring the locus to trace reads from the genome of Culex pipiens, which is more closely related to Aaeg than Agam. Indeed, the introns of Aaeg hth contain numerous elements conserved in Culex at higher levels than most surrounding exons (Supplemental Figure S7), confirming that HCNEs are abundant at the hth locus in both flies and mosquitoes.  222  Figure S7. Elements at the Aaeg hth locus conserved in Culex pipiens The entire supercontig1.597 from assembly AaegL1 is shown. Conserved elements that overlap Vectorbase transcripts or matches of Dmel proteins to the supercontig are colored red. Conserved elements that by at least half of their length overlap repeats (annotated in Ensembl 43) are colored cyan. Remanining conserved elements are colored black. To identify conserved elements, we downloaded the September 2006 release of Culex pipiens shotgun sequencing trace reads from Vectorbase (http://www.vectorbase.org) and searched them for matches to supercontig1.597, using blastn (Altschul et al., 1992, Nucleic Acids Res 25:3389) with default search parameters (but setting option -b to 100000 so that all alignments with a bit-score above 70 were included in the output). Blast alignments with a bit-score above 70 were then searched for regions with identity above a threshold value (in the range 85% - 100%, as indicated in the figure) over 50 alignment columns. Conserved elements were merged if they overlapped on the Aaeg supercontig. The bit-score threshold of 70 that we used should be sufficiently low to retain all blast alignments containing elements conserved at 95% identity over 50 columns, but may exclude some conserved at 85-90% identity.  223  224  Figure S8. Enhancer trap insertions described by Hayashi et al. Arrows indicate locations of enhancer trap insertions, and are labeled with strain number and expression pattern as described by Hayashi et al. (2002, Genesis 34:58). Gene models are colored by predicted core promoter type as in Figure 2. (A) Insertion 731 is about 44 kb upstream of engrailed (en) and expressed in the posterior compartment of imaginal discs, similar to en (reviewed by Hidalgo, 1996, Trends Genet 12:1). On the contrary, the neighboring gene toutatis (tou) is expressed ubiquitously in wing imaginal discs (Vanolst et al., 2005, Development 132:4327). Insertion 934 has a similar expression pattern as insertion 731 although the two insertions are about 40 kb apart. We failed to find flanking sequence for insertion 934 in any database, and could therefore not determine its orientation and exact position; the dashed arrow indicates its approximate position based on Figure 2 in Hayashi et al. (2002). (B) Insertion 22 is about 33 kb upstream of u-shaped (ush), on the opposite strand, and captures the expression pattern of ush in imaginal disks, but not in the embryo. Insertion 5133 is about 50 bp upstream of ush, on the same strand, and has an embryonic expression pattern similar to that of ush (Fosset et al., 2000, PNAS 97:7348).  225  Figure S9. Enhancer trap insertions around Dmel genes teashirt (tsh) and tiptop (tio) Arrows indicate locations of enhancer trap insertions in the same orientation as tsh and tio (arrows above panel) and the reverse orientation (arrows below panel), and are labeled with strain number and annotated embryonic expression pattern from GETDB (http://flymap.lab.nic.gac.jp). Gene models are colored by predicted core promoter type as in Figure 2. Insertions 103, 3001 and 514 are all within 300 bp of the annotated transcription start site for tsh. Insertion 2364 is about 10 kb downstream of tsh on the opposite strand. Consistent with the expression annotation for these insertions, tsh is expressed in the part of the trunk of the embryo that gives rise to the central nervous system (CNS) and epidermis, and in some other regions, including part of the visceral mesoderm around the midgut (Fasano et al., 1991, Cell 64:63). Insertion 6209 is about 350 bp upstream of tio. Insertions 706, 707 and 1547 are all about 200 bp downstream of tio. tio is expressed in parts of the embryonic CNS (Laugier et al., 2005, Dev Biol 283:446), consistent with the expression annotation for the surrounding insertions. For this figure, we combined the tio and CR33987 gene models from FlyBase, because CR33987 appears to contain the sequence for the first exon of tio.  226  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066331/manifest

Comment

Related Items