UBC Faculty Research and Publications

Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes Sandelin, Albin; Bailey, Peter; Bruce, Sara; Engström, Pär G; Klos, Joanna M; Wasserman, Wyeth W; Ericson, Johan; Lenhard, Boris Dec 21, 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2004_Article_201.pdf [ 1.02MB ]
JSON: 52383-1.0135725.json
JSON-LD: 52383-1.0135725-ld.json
RDF/XML (Pretty): 52383-1.0135725-rdf.xml
RDF/JSON: 52383-1.0135725-rdf.json
Turtle: 52383-1.0135725-turtle.txt
N-Triples: 52383-1.0135725-rdf-ntriples.txt
Original Record: 52383-1.0135725-source.json
Full Text

Full Text

ralssBioMed CentBMC GenomicsOpen AcceResearch articleArrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomesAlbin Sandelin†1, Peter Bailey†2, Sara Bruce1,3, Pär G Engström1, Joanna M Klos2, Wyeth W Wasserman4, Johan Ericson*2 and Boris Lenhard*1Address: 1Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden, 2Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden, 3Department of Biosciences at Novum, Karolinska Institutet, Stockholm, Sweden and 4Centre for Molecular Medicine, Department of Medical Genetics, University of British Columbia, Vancouver, CanadaEmail: Albin Sandelin - albin.sandelin@cgb.ki.se; Peter Bailey - peter.bailey@cmb.ki.se; Sara Bruce - sara.bruce@biosci.ki.se; Pär G Engström - par.engstrom@cgb.ki.se; Joanna M Klos - joanna.klos@cmb.ki.se; Wyeth W Wasserman - wyeth@cmmt.ubc.ca; Johan Ericson* - johan.ericson@cmb.ki.se; Boris Lenhard* - Boris.Lenhard@cgb.ki.se* Corresponding authors    †Equal contributorsAbstractBackground: Evolutionarily conserved sequences within or adjoining orthologous genes oftenserve as critical cis-regulatory regions. Recent studies have identified long, non-coding genomicregions that are perfectly conserved between human and mouse, termed ultra-conserved regions(UCRs). Here, we focus on UCRs that cluster around genes involved in early vertebratedevelopment; genes conserved over 450 million years of vertebrate evolution.Results: Based on a high resolution detection procedure, our UCR set enables novel insights intovertebrate genome organization and regulation of developmentally important genes. We find thatthe genomic positions of deeply conserved UCRs are strongly associated with the locations ofgenes encoding key regulators of development, with particularly strong positional correlation totranscription factor-encoding genes. Of particular importance is the observation that most UCRsare clustered into arrays that span hundreds of kilobases around their presumptive target genes.Such a hallmark signature is present around several uncharacterized human genes predicted toencode developmentally important DNA-binding proteins.Conclusion: The genomic organization of UCRs, combined with previous findings, suggests thatUCRs act as essential long-range modulators of gene expression. The exceptional sequenceconservation and clustered structure suggests that UCR-mediated molecular events involvegreater complexity than traditional DNA binding by transcription factors. The high-resolution UCRcollection presented here provides a wealth of target sequences for future experimental studies todetermine the nature of the biochemical mechanisms involved in the preservation of arrays ofnearly identical non-coding sequences over the course of vertebrate evolution.Background identification of cis-regulatory regions[1,2]. Recent com-Published: 21 December 2004BMC Genomics 2004, 5:99 doi:10.1186/1471-2164-5-99Received: 02 December 2004Accepted: 21 December 2004This article is available from: http://www.biomedcentral.com/1471-2164/5/99© 2004 Sandelin et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 9(page number not for citation purposes)Comparative genome sequence analysis, often termedphylogenetic footprinting, has proven successful for theputational and experimental studies have identified asmall number of large, highly conserved enhancers, orBMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99'global control regions', associated with the regulation ofimportant developmental genes such as DACH [3], SOX9[4], Dlx bigene [5,6], and HOX-D [7,8] clusters. These reg-ulatory regions can act at distances of several hundredkilobases from their target genes, while at the same timeconferring an equivalent expression pattern to reportergenes over much shorter distances (e.g. [3]). A recent com-putational analysis proves that such highly conserved ele-ments (termed ultra-conserved elements (UCRs)) areoccurring far more often than expected [9]. In the study byBejerano et al., UCRs are defined as regions perfectly con-served between human and mouse longer than 200 basepairs (bp). The study reports a significant association of anon-transcribed subset of those elements with DNA-bind-ing proteins; an equivalent observation has been madeindependently by Boffeli et al.[10] for a limited number ofmost highly conserved elements between human andpufferfish. The stringent criteria for conservation appliedin the two studies miss many known enhancer elementsthat are shorter than 200 bp, and highly conserved acrossall vertebrates. For instance, in a recently published study,Sabarinadh et al. [11] described a number of non-tran-scribed regions flanking the genes of HoxD gene clusterthat are highly conserved across vertebrate genomes.In this paper, we define a set of UCRs using high-resolu-tion criteria that detect segments conserved between thehuman, mouse and pufferfish genomes. Analysis of thisset provides insights into a previously unrealized organi-zational structure of UCRs in vertebrate genomes. We con-clusively show that clusters of UCRs are globallyassociated with many of the genes that act as master regu-lators during vertebrate development. The clustered distri-bution of these regions along chromosomes and,importantly, around their presumptive target genes sug-gests that gene regulation involves the coordinated actionof numerous, widely dispersed elements.ResultsDefinition and genomic environment of ultra-conserved non-coding regions (UCRs)We initiated this study by applying comparative genomicsto identify putative regulatory regions for a number ofevolutionary conserved homeodomain transcription fac-tors that control neural cell fate determination [12,13].When we examined the genomic landscapes surroundinghomeodomain gene loci, we consistently found non-cod-ing regions that exhibited a striking degree of sequenceconservation between human and mouse over a mini-mum of 50 bp. Many of these regions are at least partiallyconserved over extended periods of evolution. Theobserved nucleotide identities between human andmouse sequences exceed even those of exon sequenceslong-range enhancers for several developmental genes [3-8].To test whether the association of UCRs with regulatorygenes reflected a global genomic trend, we identified acomprehensive set of human/mouse/pufferfish UCRs fordetailed analysis. We defined minimum requirements fora UCR (see Methods) and performed a genome-scale com-putational analysis that retrieved 3583 human/mouse/pufferfish UCRs. Since one of the requirements is that theUCRs are not overlapping actively transcribed genomicregions, they would belong to type II UCRs defined byBejerano et al. [9].The median UCR length was 125 bp, but extreme lengths(>1000 bp) were observed. Qualitative assessment of"genescapes", the gene structures, surrounding UCRsrevealed them to be present either in introns, in denseclusters around a group of genes or in 'gene deserts' (up toseveral thousands kilobases from known genes). Thereappeared to be a strong association between locations ofour set of UCRs and genes encoding transcription factors– even stronger than that reported by Bejerano et al.[9][see Additional file 1 and 2]. This observation will beproven in the subsequent analysis.UCRs are strongly associated with DNA-binding proteinsTo quantitatively assess the characteristics of genes proxi-mal to UCRs, we analyzed the over-representation of geneannotations. We retrieved the InterPro [14] domain anno-tation for all genes adjacent to or containing UCRs. A sta-tistical assessment (Fisher's exact test) of the observeddomain biases for these genes was performed to assess theprobability that the domain distributions were the samefor the UCR genes as compared to the set of all genes. Evenwith a conservative (Bonferroni) correction for multipletesting [15], structural domains of transcription factors aresignificantly over-represented (P-value 9.33e-66) withinthe gene annotations (Table 1) [all domains are listed inAdditional file 3 and 4]. In order to obtain robust results,we chose the four domains from Table 1 present in thehighest number of proteins (homeobox, C2H2 zinc fin-ger, forkhead and nuclear steroid receptor). We examinedthe extent to which all known genes containing each ofthese four transcription factor domains co-localize withUCRs (Figure 1). We found that a high proportion of thesegenes (163/1084; P-value 7.33e-11) are in genomic neigh-borhoods (<8 kb) of UCRs: more than 30% of all homeo-domain-encoding genes have an UCR within 8 kbp (90/237; p-value 8.67e-11), and more than 55% have onewithin 100 kb (133/237, P-value 7.78e-11). The UCRassociation rates (the fraction of genes with an UCR closerthan 8 kb, compared to the expected value) for genesPage 2 of 9(page number not for citation purposes)encoding identical proteins. Such striking sequence con-servation has previously been anecdotally associated withencoding forkhead (8/31, P-value 6.6e-11), nuclear ster-oid receptor (9/38, P-value 2.81e-9) or zinc fingerBMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99domains (56/751 P-value 8.12e-11) were noted as signif-icant as well. These data provide strong evidence that theUCRs are spatially associated with genes encoding regula-tory proteins.UCRs clusters encompass the entire gene loci of key developmental genesIn order to visualize the distribution of UCR locationsacross the human genome, we generated a UCR densitymap for each chromosome [see Additional file 5]. Figure2a shows such a map for chromosome 2. Visual inspec-tion reveals an obvious qualitative tendency of UCRs tooccur in large clusters, which was validated by a quantita-tive comparison of the distributions of nearest-neighbordistances between UCRs and a neutral background model(P-value 8.02e-16; Kolmogorov-Smirnov test). There is noobserved correlation between regions of high gene densityand UCRs, consistent with previously reported observa-tions that larger conserved regions can be located in genedeserts [3]. As previously noted, many of the UCRs areadjacent to homeobox protein-encoding genes (Figure 1a,Figure 2b). It is interesting to note that the over-represen-tation of UCRs near homeobox genes extends up to 300kbp away from the transcription start site (Figure 1b). Thisis consistent with numerous observations that controlregions need not be proximal to targeted genes, but can belocated hundreds of kilobases from the transcription startsite [3,7,16]. A similar trend is observed for UCRs nearof UCRs can span regions of several hundred kilobasesaround inferred target genes. For the 50 largest UCR clus-ters we generated comprehensive views of the chromo-somal neighborhood (Figure 3). We find that 41 of the 50clusters span one or more genes known to be expressed inembryonal development, including fundamental masterregulator genes (i.e. the HoxD cluster, Nkx6.1, Nkx2.2 andPbx3) [for detailed annotated lists of genes associatedwith UCR clusters, see Additional files 6 and 7]. To pro-vide access to the entire set of UCRs, we have imple-mented a basic UCR browser http://mordor.cgb.ki.se/UCRbrowse/ with links to the UCSC genome browser[17].Rare duplications of UCRs across evolutionWe performed a global pairwise comparison of all UCRs,in order to determine if UCR duplication was commonacross evolution. We discovered only five sets of dupli-cated UCRs, all of which are adjacent to correspondingduplicated genes. For example, duplicated UCRs arepresent in the introns of SOX5 (on chromosome 12) andSOX6 (on chromosome 11), two highly similar genesinvolved in chondrocyte differentiation [18]. Of specialinterest is the conservation of UCRs in the Iroquois (IRX)gene clusters. IRX genes are situated in two clusters ofthree genes each, present on human chromosomes 5 and16 [19]. Similarly positioned arrays of UCRs are present ineach of the four intergenic regions between the IRX genesTable 1: Over-representation of protein domains in genes flanking UCRs. Bonferroni-corrected and uncorrected Fisher Exact Test p-values are shown for the 16 most over-represented InterPro domains. Typical transcription factor domains (DNA binding domains) are indicated in bold. A full list of all InterPro domains with P-values is given in [Additional file 3].Domain description INTERPRO ID Fisher test P value Corrected P valueHTH_lambrepressr IPR000047 6.40E-20 5.36E-17Homeobox IPR001356 1.60E-12 1.34E-09Antennapedia IPR001827 1.37E-10 1.15E-07Paired_box IPR001523 2.39E-05 2.00E-02HLH_basic IPR001092 2.40E-05 2.01E-02POU_domain IPR000327 3.06E-05 2.56E-02Homeo_OAR IPR003654 3.08E-05 2.58E-02TF_Fork_head IPR001766 6.15E-05 5.15E-02Znf_C4steroid IPR001628 7.45E-05 6.23E-02Hormone_rec_lig IPR000536 1.06E-04 8.86E-02HMG_12_box IPR000910 1.81E-04 1.51E-01Stdhrmn_receptor IPR001723 2.63E-04 2.20E-01COUP_TF IPR003068 7.62E-04 6.38E-01LIM IPR001781 1.10E-03 9.18E-01RtnoidX_receptor IPR000003 1.28E-03 1.07E+00FN_III IPR003961 2.57E-03 2.15E+00Page 3 of 9(page number not for citation purposes)C2H2 zinc finger genes, with over-representation of UCRsextending up to 150 kbp away (Figure 1c). Large clusters(Figure 4). The great majority of UCRs, while conservedacross vertebrate evolution, show no similarity betweenBMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99the clusters within the species. An intriguing exception isthe set of four UCRs that are highly similar in both clusterposition and nucleotide sequence.DiscussionThe human genome contains numerous ultra-conservedregulatory sequences that are shared broadly across verte-brates. These UCRs occur in arrays of highly conserved regula-tory elements spanning large chromosomal regions. Theclusters are co-localized with genes encoding key proteinsfor the regulation of development, with a particular corre-lation with genes encoding transcription factors. Thestrength of association between UCRs and diverse classesof DNA binding transcription factors validates that a rela-tively simple definition of UCRs captures a biologicallymeaningful set of functional sequences. The presence ofnon-coding UCRs is predictive for the presence of genesimplicated in development, differentiation and malignan-cies. The list presented in [Additional file 6] hints atpotentially crucial roles of currently uncharacterized tran-scription factor genes, while the collection of reportedUCRs provides a wealth of regulatory locations for furtherstudy.Exceptional mechanisms are brought to bear to retainUCRs over hundreds of millions of years of parallel evolu-tion. UCRs are more strongly conserved than sequencesencoding identical proteins, and exhibit sequence identityexceeding essentially all known cis-regulatory sequences.The retention properties suggest that UCRs have impor-tant functions in the vertebrate genome.The observed UCRs could fall into multiple functionalcategories, including enhancers of transcription, regula-tors of chromatin structure and unknown genes for non-coding transcripts. A small subset of UCRs have beenidentified previously as enhancers of transcription [7,3].The high conservation and length of UCRs compared tobinding sites for single transcription factors suggests thatthe mode of regulation must involve more than the bind-ing of small number of transcription factors. Homeotypicclusters of binding sites, as seen in developmental genesin Drosophila melanogaster [20], represent one regulatorymechanism that could explain the occurrence of long,conserved non-coding regions. However, as transcriptionfactors tolerate considerable variation between functionalbinding sites, a homeotypic cluster of binding sites as suchcannot warrant the extreme level of conservation observedin UCRs. Alternatively, the recent emergence of the role ofmicroRNAs in regulation suggests that there could beadditional non-coding genes in the human genome, per-haps at the sites of ultra-conservation.Spatial correlation of transcription factor gene families to UCRs in th  human genomeFigure 1Spatial correlation of transcription factor gene fami-lies to UCRs in the human genome. A. Cumulative dis-tribution of distances to the closest UCR for selected subsets of genes. Distance to the closer end of the transcript mapping (either 3' or 5'). Majority of major classes of tran-scription factors are closer to UCRs than random genes. B, C. Occurrence of UCRs around selected subsets of genes. This plot summarizes the distribution of distances to all UCRs on the same chromosome for each gene in the subset. There is a visible over-representation of UCRs up to 300 kb from homeobox genes, and up to 150 kb from C2H2 zinc fin-Distances to closest UCR, cumulative distribution0 10,000 20,000 30,000 40,0000. to UCR (bp)fractionHomeoboxC2H2 zinc fingerForkheadNuclear receptorNon-homeobox genesNon-C2H2 genesNon-forkhead genesNon-nuclear receptor genesdistance to UCR (bp)0e+00 1e+05 2e+05 3e+05 4e+05 5e+050.00000.00050.00100.0015fractionDistribution of distances to all cis-UCRs (Homeobox genes)Homeobox genesNon-homeobox genes0e+00 1e+05 2e+05 3e+05 4e+05 5e+050.000000.000050.000100.000150.000200.000250.00030distance to UCR (bp)fractionDistribution of distances to all cis-UCRs (C2H2 zinc finger genes)C2H2 genesNon-C2H2 genesa)b)c)Page 4 of 9(page number not for citation purposes)ger genes.BMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99The clustering of UCRs suggests that UCR-mediated tran-scriptional regulation may involve molecular events on agreater scale, possibly involving chromatin structure. Thispotential link to chromatin structure is suggested by thestriking pattern of UCRs in the IRX gene clusters. Most ofthe UCRs have no similarity between the two clusters,with the exception of a set of four UCRs that have retainedboth mutual sequence similarity and spatial position (Fig-ure 4). It is tempting to assume that the retention of theirmutual similarity is a consequence of IRX cluster co-regu-lation, the mechanism of which remains unknown.Based on the preservation of nearly identical sequencesover ~450 million years of vertebrate evolution, it is rea-sonable to postulate the influence of exceptional bio-chemical mechanisms. Numerous hypotheses couldaccount for the observed data, broadly falling into twocategories – active mechanism(s) resulting in the decreaseof mutational frequency in UCRs, or negative pressureconsistent with evolutionary selection against such muta-tions. Given the breadth of possibilities, we leave postula-tion until further data emerges.ConclusionSince Bejerano et al.[9] focused on larger regions (200 bp)of perfect nucleotide identity compared to our more per-missive settings (95% sequence identity over 50 bp), therealized. Our findings include critical new informationabout UCR clusters, particularly with regards to patternsof conservation, their genomic organization, and theinsights they provide into potential chromatin regulatingmechanisms. These mysterious regions retained over hun-dreds of millions of years of evolution appear to contrib-ute to a novel mechanism of developmental regulation.Detailed studies of UCRs that will ensue from the discov-eries reported here promise to advance our understandingof vertebrate development.MethodsDefinition of UCRs applied in this studyWe defined UCRs as non-protein coding genomic regionshaving a sequence identity > 95% over a 50 bp slidingwindow of length in human/mouse comparison (basedon the tight alignments track from the UCSC genomebrowser database[17], using human and mouse assem-blies hg15 and mm3, respectively). As a further constraint,an UCR must overlap with sequences conserved betweenthe human and pufferfish genomes, as defined in theUCSC genome browser databases (a BLAT [21] alignmentbetween human and pufferfish with a minimum BLATscore of 20). In order to avoid inclusion of codingsequence, we required that a UCR must not overlap amouse or human cDNA mapped to the genome (based oncDNA tracks from from the UCSC genome browser data-Genomic distributions of UCRs and transcription factor genesFigure 2Genomic distributions of UCRs and transcription factor genes. A. Distribution of UCRs on human chromosome 2 is shown in yellow, and total gene density along the chromosome is shown in blue (top track). Note the lack of correlation between gene density and UCR density. Positions of homebox-domain containing genes locations are marked in red, and gen-erally coincide with local maxima of UCR density. The remaining UCR density peaks coincide with genes for transcription fac-tors belonging to structural classes other than homeobox. B. Close-up of a UCR cluster coinciding with the HoxD gene cluster. The HoxD cluster coincides with one of the larger UCR density peaks on chromosome 2, and is associated with nine UCRs. UCR locations are shaded in yellow.176915000 176925000 176935000 176945000 176955000 176965000 176975000 176985000 176995000HOXD13 HOXD12 HOXD11 HOXD10 HOXD9 HOXD8 HOXD4 HOXD3Homeobox-containing genesGene densityUCR densityHuman-mouse conservationPufferfishconservationRefseq genesA)B)Page 5 of 9(page number not for citation purposes)genomic arrangement of UCR-containing regions withrespect to their presumptive target genes was not fullybase[17]) or overlap putative coding regions predicted byGenScan [22].BMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99Genomic landscape surrounding the most prominent UCR clusters in the human genomeFigure 3Genomic landscape surrounding the most prominent UCR clusters in the human genome. UCRs were counted by sliding a 500 kb window along the chromosomes. Overlapping UCR-containing windows were merged into a single cluster span. Each of the regions shows a 4 MB region around the corresponding UCR cluster. The cluster span coordinates corre-spond to the human genome NCBI build 33 (UCSC hg15, April 2003). Transcription factor genes are colored according to structural class. UCR clusters are visibly correlated with transcription factor genes; other developmental regulators that do not contain any of the probed protein domains were located manually (boxed), such as the autism susceptibility gene (chromo-some 7, number 37) and the DACH gene (chromosome 13, number 10). The numbers correspond to annotations in [Additional Page 6 of 9(page number not for citation purposes)file 6 and 7]. The figure was created with the help of the Bio::Graphics Perl library[27].BMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99Calculation of UCR and gene distributionsThe distribution of UCRs in the genome was calculated bycounting the number of UCRs within a 500 kilobase (kb)window which was progressively slid over each chromo-some in 100 kb intervals. The same approach was used toestimate the gene density; specifically by summing thenumber of bases within the window that aligned withhuman mRNA (from the UCSC Genome Browserdatabase).Gene-UCR distance calculationDistances between a given gene and UCR on the samechromosome were defined as the shortest distancebetween the starting points and/or endpoints of UCR andgene in the human genome (UCSC assembly hg15), usingEnsEMBL [23] gene annotation. Genes based solely onESTs or computational predictions were not included.Estimation of significance of Gene-UCR distancesThe distances from genes within a set (for instance, allforkhead domain-containing genes) to the closest UCRswere calculated as above. The expected fraction of gene-UCR distances smaller than 8 kb was estimated by simu-lation: UCR genome coordinates were randomly chosenand distances measured as above. The simulation processwas significantly different from the expected, we used thechi-squared test.Estimation of domain over-representation in genes closest to UCRsFor each UCR, the closest upstream and downstream genewithin 2 Mbp was identified (UCRs inside introns ofgenes were analyzed separately). EnsEMBL InterPro [14]domain annotation was used to tabulate a contingencytable consisting of the positive sample counts (number ofgenes in the set containing a certain domain), negativesample counts (number of remaining genes in the set),background positives (number of genes containing thesame domain in the genome) and background negatives(remaining genes). For clarity, a given gene was onlycounted once, and multiple occurrences of the samedomain within the same protein were not counted.For each domain found in the UCR-proximal genes, wetested the null-hypothesis that the sample and back-ground sets are drawn from the same population versusthe alternative hypothesis that the sample set has a higherfrequency of the domain, using Fisher's Exact Test [24]from the R statistical package http://www.r-project.org.Since the number of tests is considerable, we corrected forSets of UCRs sharing high sequence similarity are involved in regulation of related genes: the case of Iroquois gene clustersFigure 4Sets of UCRs sharing high sequence similarity are involved in regulation of related genes: the case of Iroquois gene clusters. Four similarly positioned UCRs are located within the two Iroquois gene clusters at chromosomes 5 and 16. Block arrows indicate significant sequence similarity. The arrow width is inversely proportional to the alignment BLASTN E-value. There are additional shorter blocks of similarity between the two three-gene clusters; however, most UCRs have diverged between the two clusters, while still preserved across vertebrates.IRX4 IRX2 IRX1chr5 (+)chr16 (-) IRX6 IRX5 IRX3IRX geneBLAST similarity (reverse complement) BLAST similarity Ultra-conserved region Gene duplicationPage 7 of 9(page number not for citation purposes)was repeated 1000 times and the average fractionreported. In order to estimate if the observed distributionmultiple sampling using the conservative Bonferronimethod [15], in which the number of tests is multipliedBMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99with the P-value from the Fisher test with the number ofunique domains tested (837). An analogous analysis wasperformed with genes containing one or more UCRswithin their introns [see Additional file 4].Estimation of clustering tendencyWe used the distances between consecutive UCRs as a sta-tistic indicating clustering. A neutral background distancedistribution was created by assigning UCRs genome coor-dinates randomly, and subsequently measuring distancesbetween consecutive UCRs. This process was repeated1000 times. We compared the distance distributionbetween naturally occurring UCRs and the backgroundusing the Kolmogorov-Smirnov test [25], which assigns aprobability that two distributions are similarly shaped.UCR sequence similarity analysisAll possible pairs of UCRs were aligned using NCBIBLASTN [26] with standard settings. For any pair to bereported as near-identical, we required an HSP of at least50 bp and a pairwise sequence identity exceeding 75%.AbbreviationsUCR – ultraconserved non-coding region; bp – basepairs;kbp – 103 base pairsAuthors' contributionsAS collected the data and performed most steps of bioin-formatic and statistical analysis presented in the paper. Heproduced all the Figures in the paper and Table 1, and co-wrote the manuscript. PB made initial analyses of putativeregulatory elements on selected genes involved in neuraltube development. He discovered a number of super-con-served regions in the process, which helped create therules for their genome-wide computational detection. Healso co-wrote the first versions of the manuscript. SB par-ticipated in the annotation of the gene set and in the cre-ation of software for the visualization of results. PEprepared genome sequence and annotation data forhuman, mouse and pufferfish. He and AS designed thestatistical tests applied in the study. JK participated in theinitial analyses and data extraction with PB. WWW partic-ipated in result interpretation, design of statistical tests,and writing later versions of the manuscript. JE initiatedand co-supervised the study, which has the roots in hisresearch in developmental neurobiology. He also co-wrote the manuscript. BL designed and supervised thebioinformatic study, developed the initial framework forthe analysis of the genomic sequences, made an inde-pendent observation about high incidence of clustering ofsuper-conserved regions around genes encoding DNA-binding proteins, and annotated the UCR clusters with co-localizing genes. He also co-wrote the manuscript.Additional materialAdditional File 1Genescape around 50 randomly selected UCRs. Selected UCRs are shown as yellow triangles, other UCRs as light yellow triangles. Genes are colored after domain (red = Homeobox, green = C2H2 Zink fingers in green, pink = Nuclear receptors, Blue = forkhead).Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-5-99-S1.png]Additional File 2Genescape around 50 randomly selected genes. UCRs are shown as as light yellow triangles. Color coding of genes as above.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-5-99-S2.png]Additional File 3Complete list of protein domains in genes flanking UCRs. Each tested domain is listed along with corrected and uncorrected P-value as in Table 1.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-5-99-S3.html]Additional File 4Complete list of protein domains in genes with UCR(s) in intron(s) Each tested domain is listed along with corrected and uncorrected P-value as in Table 1.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-5-99-S4.htm]Additional File 5UCR distribution in the human genome UCR density (pink) and gene density (blue) is shown for each chromosome. Densities are calculated as described in Methods.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-5-99-S5.png]Additional File 6Genes associated with enumerated UCR clusters from Figure 3. UCRs were counted by sliding a 500 kb window along the chromosomes. Over-lapping UCR-containing windows were merged into a single cluster span. The cluster span coordinates correspond to the human genome NCBI build 33 (UCSC hg15, April 2003). A more exhaustive list is found in [Addi-tional file 7]Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-5-99-S6.htm]Additional File 7Extended list of UCR clusters An extended, but less annotated, version of in [Additional file 6]Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-5-99-S7.htm]Page 8 of 9(page number not for citation purposes)Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Genomics 2004, 5:99 http://www.biomedcentral.com/1471-2164/5/99AcknowledgementsAS and BL were supported in part by funding from Pharmacia Corporation (now Pfizer). JE is supported by the Royal Swedish Academy of Sciences, by a donation from the Wallenberg Foundation, The Swedish Foundation for Strategic Research, The Wallenberg Foundation, The Swedish National Research Council and the EC network grants: Brainstem Genetics: QLRT-2000-01467 and Stembridge: QLG3-CT-2002-01141. W.W. is supported by the Michael Smith Foundation for Health Research and the Canadian Institutes of Health Research.References1. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasser-man WW: Identification of conserved regulatory elements bycomparative genome analysis. J Biol 2003, 2:13.2. Ureta-Vidal A, Ettwiller L, Birney E: Comparative genomics:genome-wide analysis in metazoan eukaryotes. Nat Rev Genet2003, 4:251-262.3. Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning humangene deserts for long-range enhancers. Science 2003, 302:413.4. Bagheri-Fam S, Ferraz C, Demaille J, Scherer G, Pfeifer D: Compar-ative genomics of the SOX9 region in human and Fugurubripes: conservation of short regulatory sequence ele-ments within large intergenic regions. Genomics 2001, 78:73-82.5. Sumiyama K, Irvine SQ, Ruddle FH: The role of gene duplicationin the evolution and function of the vertebrate Dlx/distal-lessbigene clusters. J Struct Funct Genomics 2003, 3:151-159.6. Ghanem N, Jarinova O, Amores A, Long Q, Hatch G, Park BK, Ruben-stein JL, Ekker M: Regulatory roles of conserved intergenicdomains in vertebrate Dlx bigene clusters. Genome Res 2003,13:533-543.7. Spitz F, Gonzalez F, Duboule D: A global control region definesa chromosomal regulatory landscape containing the HoxDcluster. Cell 2003, 113:405-417.8. Santini S, Boore JL, Meyer A: Evolutionary conservation of regu-latory elements in vertebrate Hox gene clusters. Genome Res2003, 13:1111-1122.9. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS,Haussler D: Ultraconserved elements in the human genome.Science 2004, 304:1321-1325.10. Boffelli D, Nobrega MA, Rubin EM: Comparative genomics at thevertebrate extremes. Nat Rev Genet 2004, 5:456-465.11. Sabarinadh C, Subramanian S, Tripathi A, Mishra RK: Extreme con-servation of noncoding DNA near HoxD complex ofvertebrates. BMC Genomics 2004, 5:75.12. Cornell RA, Ohlen TV: Vnd/nkx, ind/gsh, and msh/msx: con-served regulators of dorsoventral neural patterning? Curr OpinNeurobiol 2000, 10:63-71.13. Briscoe J, Ericson J: Specification of neuronal fates in the ven-tral neural tube. Curr Opin Neurobiol 2001, 11:43-49.14. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, BatemanA, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley RR, Cour-celle E, Das U, Durbin R, Falquet L, Fleischmann W, Griffiths-Jones S,Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M,Lopez R, Letunic I, Lonsdale D, Silventoinen V, Orchard SE, Pagni M,Peyruc D, Ponting CP, Selengut JD, Servant F, Sigrist CJ, Vaughan R,Zdobnov EM: The InterPro Database, 2003 brings increasedcoverage and new features. Nucleic Acids Res 2003, 31:315-318.15. Westfall PH, Wolfinger RD: Multiple Tests with DiscreteDistributions. The American Statistician 1997, 51:3-8.16. Lettice LA, Horikoshi T, Heaney SJ, van Baren MJ, van der Linde HC,Breedveld GJ, Joosse M, Akarsu N, Oostra BA, Endo N, Shibata M,Suzuki M, Takahashi E, Shinka T, Nakahori Y, Ayusawa D, Nakaba-yashi K, Scherer SW, Heutink P, Hill RE, Noji S: Disruption of along-range cis-acting regulator for Shh causes preaxialpolydactyly. Proc Natl Acad Sci U S A 2002, 99:7548-7553.17. Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT,Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, HausslerD, Kent WJ: The UCSC Genome Browser Database. NucleicAcids Res 2003, 31:51-54.18. de Crombrugghe B, Lefebvre V, Behringer RR, Bi W, Murakami S,Huang W: Transcriptional mechanisms of chondrocyte19. Gomez-Skarmeta JL, Modolell J: Iroquois genes: genomic organ-ization and function in vertebrate neural development. CurrOpin Genet Dev 2002, 12:403-408.20. Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypicregulatory clusters in Drosophila. Genome Res 2003, 13:579-588.21. Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res2002, 12:656-664.22. Burge C, Karlin S: Prediction of complete gene structures inhuman genomic DNA. J Mol Biol 1997, 268:78-94.23. Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y,Clarke L, Coates G, Cox T, Cuff J, Curwen V, Cutts T, Down T,Durbin R, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, GilbertJ, Hammond M, Hotz H, Iyer V, Kahari A, Jekosch K, Kasprzyk A,Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P,Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G,Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R,Ureta-Vidal A, Woodwark C, Clamp M, Hubbard T: Ensembl 2004.Nucleic Acids Res 2004, 32 Database issue:D468-70.24. Mehta CR, Patel NR: FEXACT: A Fortran subroutine forFisher's exact test on unordered r*c contingency tablesACM Transactions on Mathematical Software, 12, 154–161.ACM Transactions on Mathematical Software 1986, 12:154-161.25. Conover WJ: Practical nonparametric statistics. New York,John Wiley & Sons; 1971. 26. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lip-man DJ: Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Res 1997,25:3389-3402.27. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, NickersonE, Stajich JE, Harris TW, Arva A, Lewis S: The generic genomebrowser: a building block for a model organism systemdatabase. Genome Res 2002, 12:1599-1610.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 9 of 9(page number not for citation purposes)differentiation. Matrix Biol 2000, 19:389-394.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items