Open Collections

UBC Faculty Research and Publications

TFCat: the curated catalog of mouse and human transcription factors Fulton, Debra L; Sundararajan, Saravanan; Badis, Gwenael; Hughes, Timothy R; Wasserman, Wyeth W; Roach, Jared C; Sladek, Rob Mar 12, 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-13059_2008_Article_2179.pdf [ 302.24kB ]
JSON: 52383-1.0223554.json
JSON-LD: 52383-1.0223554-ld.json
RDF/XML (Pretty): 52383-1.0223554-rdf.xml
RDF/JSON: 52383-1.0223554-rdf.json
Turtle: 52383-1.0223554-turtle.txt
N-Triples: 52383-1.0223554-rdf-ntriples.txt
Original Record: 52383-1.0223554-source.json
Full Text

Full Text

Open Access2009Fultonet al.Vo ume 10, Issue 3, Article R29SoftwareTFCat: the curated catalog of mouse and human transcription factorsDebra L Fulton*, Saravanan Sundararajan†, Gwenael Badis‡, Timothy R Hughes‡, Wyeth W Wasserman¤*, Jared C Roach¤§ and Rob Sladek¤†Addresses: *Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics - Child and Family Research Institute, University of British Columbia, West 28th Avenue, Vancouver, V5Z 4H4, Canada. †Departments of Medicine and Human Genetics, McGill University and Genome Quebec Innovation Centre, Dr. Penfield Avenue, Montreal, H3A 1A4, Canada. ‡Banting and Best Department of Medical Research, University of Toronto, College Street, Toronto, M5S 3E1, Canada. §Center for Developmental Therapeutics, Seattle Children's Research Institute, Olive Way, Seattle, 98101, USA. ¤ These authors contributed equally to this work.Correspondence: Rob Sladek. Email:© 2009 Fulton et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Transcription factor catalog<p>TFCa  is a c talog of mouse and human transcription factors based on a reliable core collection of annotations obtained by expert review of the scientifi  li e ature</p>AbstractUnravelling regulatory programs governed by transcription factors (TFs) is fundamental tounderstanding biological systems. TFCat is a catalog of mouse and human TFs based on a reliablecore collection of annotations obtained by expert review of the scientific literature. The collection,including proven and homology-based candidate TFs, is annotated within a function-basedtaxonomy and DNA-binding proteins are organized within a classification system. All data and user-feedback mechanisms are available at the TFCat portal functional properties of cells are determined in large partby the subset of genes that they express in response to physi-ological, developmental and environmental stimuli. Thecoordinated regulation of gene transcription, which is criticalin maintaining this adaptive capacity of cells, relies on pro-teins called transcription factors (TFs), which control profilesof gene activity and regulate many different cellular functionsby interacting directly with DNA [1,2] and with non-DNAbinding accessory proteins [3,4]. While the biochemical prop-to TFs [5,6]), a well-validated and comprehensive catalog ofTFs has not been assembled for any mammalian species.Many gene transcription studies have linked the subset of TFsthat bind specific DNA sequences to the activation of individ-ual genes and, more recently, these have been pursued on agenome-wide basis using high-throughput laboratory studies(for example, by performing chromatin-immunoprecipita-tion) as well as computational analyses (for example, by iden-tifying over-represented DNA motifs within promoters of co-Published: 12 March 2009Genome Biology 2009, 10:R29 (doi:10.1186/gb-2009-10-3-r29)Received: 5 December 2008Revised: 26 February 2009Accepted: 12 March 2009The electronic version of this article is the complete one and can be found online at Biology 2009, 10:R29erties and regulatory activities of both DNA-binding andaccessory TFs have been experimentally characterized andextensively documented (for example, in textbooks devotedexpressed genes). To facilitate such efforts, inventories of TFshave been assembled for Drosophila and Caenorhabditisspecies as well as for specific subfamilies of mammalian TFs29.2 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. R(Table 1). Since only a limited number of protein structurescan mediate high-affinity DNA interactions, collections of TFsubfamilies have been constructed using predictive sequence-based models for DNA-binding domains (DBDs) [7-10]. Forexample, the PFAM Hidden Markov Model (HMM) database[11] and Superfamily HMMs [12] have been applied to sets ofpeptide sequences to identify nearly 1,900 putative TFs in thehuman genome [10] and over 750 fly TFs, of which 60% werewell-characterized site-specific binding proteins [13]. Whilethese collections have emphasized DNA binding proteins,recent evidence suggests that the contributions of accessoryTFs may be equally or more important in establishing the spa-tio-temporal regulation of gene activity. For example, micro-array-based chromatin immunoprecipitation studies havehighlighted the key regulatory contributions of histone mod-ifying TFs over the control of gene expression [14]. Therefore,any comprehensive study of TFs must extend beyond a nar-row focus of DNA binding proteins to serve as a foundationfor regulatory network analyses.The four research laboratories contributing to this reportwere originally pursuing parallel efforts to compile referencecollections of bona fide mammalian TFs. In order to maxi-mize the quality and breadth of our gene curation, we com-bined our efforts to create a single, literature-based catalog ofmouse and human TFs (called TFCat). The collection of anno-tations is based on published experimental evidence. Each TFgene was assigned to a functional category within a hierarchi-cal classification system based on evidence supporting DNAbinding and transcriptional activation functions for each pro-tein. DNA-binding proteins were categorized using an estab-lished structure-based classification system [15]. A blind,random sample of the functional assessments provided byeach expert was used to assess the quality of the gene annota-tions. The evidence-based subset of TFs was used to compu-tationally predict additional un-annotated genes likely toencode TFs. The resulting collection is available for downloadfrom the TFCat portal and is also accessible via a wiki toencourage community input and feedback to facilitate contin-uous improvement of this resource.TF gene candidate selection, the annotation process, and quality assurancePrior to the initiation of the TFCat collaboration, each of thefour participating laboratories constructed mouse TF data-sets using manual text-mining and computational-basedapproaches. As each dataset was created specifically to suitthe needs of the research lab that generated it, combinationsof overlapping and distinct procedures were applied to collectand filter each dataset (Figure S1 in Additional data file 1).These four, independently established, putative TF datasetslaid the foundation for this joint initiative.To ensure the comprehensiveness and utility of our referencecollection, we broadly defined a TF as any protein directlyinvolved in the activation or repression of the initiation ofsynthesis of RNA from a DNA template. Incorporating thisstandard, the union of the four sets yielded 3,230 putativemouse TFs (referred to as the UPTF). As complete manualcuration of all literature to evaluate TFs is not practical, ourcuration efforts were prioritized to maximize the number ofreviews conducted for UPTFs linked to papers. A manual sur-vey of PubMed abstracts was performed, using available genesymbol identifiers and aliases, to identify genes for whichexperimental evidence of TF function might exist. Sincestandardized naming conventions have not been fully appliedin the older literature, the associations between abstracts andgenes may be incomplete or inaccurate due to the redundantuse of the same identifiers for two or more genes. In addition,we did not consider abstracts that made no mention of thegene identifiers of interest or those that, by their description,were unlikely to have conducted transcription regulation-related analyses. From this list of 3,230 putative mouse TFs,coarse precuration identified 1,200 putative TFs with scien-tific papers describing their biochemical or gene regulatoryactivities in the PubMed database [16]. The majority of pre-dicted TFs (2,030 of 3,230) had no substantive literature evi-dence supporting their molecular function. The remaining1,200 transcription factor candidates (TFCs) were prioritizedfor expert annotation.Genes belonging to the TFC set that were associated with twoor more papers in PubMed were selected and randomlyassigned for evaluation by one or more of 17 participatingreviewers. Gene annotations were primarily performed by aTable 1Transcription factor data resourcesResource Organism Reference/URLHuman KZNF Gene Catalog Human Huntley et al. (2006) [68]/[69]Database of bZIP Transcription Factors Human Ryu et al. (2007) [70]/[71]The Drosophila Transcription Factor Database Fly Adryan et al. (2006) [13]/[72]Genome Biology 2009, 10:R29wTF2.0: a collection of predicted C. elegans transcription factors Worm Reece-Hoyes et al. (2005) [73]/[74]29.3 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rsingle reviewer, with the exception of 20 genes assigned tomultiple reviewers for initial training purposes and 50 genesassigned to pairs of reviewers for a quality assurance assess-ment. In total, 1,058 genes (Table 2) have been reviewed. Foreach candidate, a TF confidence judgment was assigned(Table 3) based on the literature surveyed. Annotation of eachTFC required evidence of transcriptional regulation and/orDNA-binding (for example, a reporter gene assay and/orDNA-binding assay). A text summary of the experimental evi-dence was extracted and entered by the reviewer, along withthe PubMed ID, the species under study, and the reviewer'sperception of the strength of the evidence supporting theirjudgment. Although reviewers were not obligated to continuebeyond two types of experimental support, they were encour-aged to review multiple papers where feasible. Based on theirliterature review, annotators were required to classify theirdetermination of each TFC into a positive (TF gene or TF genecandidate), neutral (no data or conflicting data) or negativegroup (not a TF or likely not). Of the 1,058 TFCs reviewed,83% were found to have sufficient experimental evidence tobe classified either as a TF gene or as a TF gene candidate.To simplify data collection and curation, we focused our liter-ature evidence collection and annotation efforts on mousegenes. However, literature pertaining to mouse genes andtheir human (or other mammalian) orthologs was used inter-changeably as evidence for the annotations. Roughly 83% ofthe annotation literature evidence surveyed was based on acombination of mouse and human data, with roughly equalnumbers of papers pertaining to each of these species. MouseTF genes were associated with their putative human orthologusing the NCBI's HomoloGene resource [16]. With the excep-tion of 40 mouse genes, putative ortholog pairs were matchedusing defined HomoloGene groups. All but 13 of the remain-ing 40 were mapped using ortholog relationships in theMouse Genome Database [17]. Each gene's predicted humanortholog is included in the download data and in the pub-lished wiki data.Depending upon the subset of available papers reviewed for agiven TFC, two curators could arrive at different judgments.To ascertain the consistency and quality of our reviewingapproach and judgment decisions, we randomly selected 50genes for re-review and assigned each to a second expert(Tables S1 and S2 in Additional data file 1). Out of the 100annotations (2 reviews each for 50 genes), 37 paired genejudgments (74 annotations) were concordant and 13 pairedgene judgments (26 annotations) were discordant. Examina-tion of the discordant pairs suggested that review of differentpublications may have produced the disagreement in annota-tion. To further evaluate this assumption, we extracted a non-quality assurance (non-QA) sample of multiple annotationswhere different reviewers curated the same genes or genefamily members using the same articles (Table S3 in Addi-tional data file 1) and found that these curation judgmentswere in perfect agreement. Under the assumption that judg-ment conflicts identified in the QA sample would be resolvedin favor of one of the assigned judgment calls, we concludethat 13% of judgments may be altered after additional anno-tation, suggesting that a system to enable continued reviewwould be beneficial.Since mouse and human TFs have been evolutionarily con-served among distantly related species [18], we assessed thecoverage of our curated TF collection by comparing it with alist of expert annotated fly TFs documented in the FlyTF data-base [13]. Over half (443 of 753) of the FlyTF genes werefound in NCBI HomoloGene groups, producing 184 fly TF-containing clusters that also contained mouse homologs.More than 85% (164 of 184) of these homologous TF geneswere in the UPTF set. Inspection of the 20 putative mousehomologs of fly TFs absent from the UPTF set led to the inclu-sion of 5 genes in both the UPTF and the TFC sets for futurecuration, while there were no published studies involving themammalian proteins for the remaining 15 genes. We alsoassessed TFCat's coverage by comparing it with a classic col-lection of TFs prepared prior to the completion of the mousegenome [6]. After mapping 506 TFs to Entrez Gene identifi-ers, we found that 463 were present in the UPTF and 423were members of the TFC gene list. The remaining 43 geneswere added to the UPTF and the TFC list was extended toinclude 83 additional genes. From these analyses, we con-clude that TFCat contains a large majority of known TFs.Identification and classification of DNA binding proteinsGenes positively identified as TFs were categorized using ataxonomy to document their functional properties identifiedin the literature review (Table 4). Notably, 65% (571 of 882)of the genes judged as TFs were reported to act through aDNA binding mechanism and 94% (535 of 571) of these DNA-Table 2TFCat catalog statisticsTotal number of genes annotated 1,058 100%Proportion of genes with positive TF judgments 882 83%Proportion of positive TFs with DNA-binding activity 571 65%Genome Biology 2009, 10:R29Proportion of DNA-binding TFs that are (double-stranded) sequence-specific 535 94%29.4 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rbinding TFs were found to act through sequence-specificinteractions mediated by a small number of protein structuraldomains (Table 5).Members of a DNA-binding TF family share strongly con-served DNA binding domains that, in most cases, have over-lapping affinity for DNA-sequences; therefore, a prediction ofa TF binding site can suggest a role for the family but does notimplicate specific family members. As such, a TF DNA-bind-ing classification system is an essential resource for manypromoter sequence analyses in which researchers should pri-oritize potential trans-acting candidates from a set of equallysuitable candidate TFs within a structural class. Capitalizingon large-scale computational efforts for the prediction of pro-tein domains [11,12,19-21], we analyzed each of the TFCatDNA-binding TF protein sequences with the full set of PFAMand Superfamily HMM domain models to predict DBD struc-tures. A total of 20 Superfamily structure types were identi-fied in our set, along with 54 PFAM DBD models (Table S4 inAdditional data file 1). Where possible, we linked each dou-ble-stranded DNA-binding TF to a family within an estab-lished DNA-binding structural classification system [15] thatwas developed initially to organize the DNA-bound proteincrystal structures found in the Protein Data Bank (PDB) [22].In light of more recent studies, along with a modification ofclassification requirements (see Materials and methods), anadditional set of 16 DBD family classes were added to the sys-tem to map domain structures (Table S5 in Additional datafile 1).The DNA binding domain analysis offers some noteworthyobservations. The homeodomain-containing genes are prom-inently represented in our set, comprising 24% (131 of 545) ofthe classified DBD TFs and 16% of all predicted domainoccurrences. The beta-beta-alpha zinc-finger and helix-loop-helix TF families account for 14% (79 of 545) and 13% (71 of545) of the classified genes, respectively. Given the abun-dance of zinc-finger proteins in the eukaryotic genomes [23]and recent predictions that this DNA-binding structuremakes up a significant portion of all TFs [10], this class maybe under-represented. On the other hand, since zinc-fingercontaining genes are involved in a wide variety of functions,the number of predicted zinc-finger proteins that possess aTF role may be overestimated. In addition, it is likely that cer-tain families of TFs, with central roles in well-studied areas ofbiology, have been more widely covered in the literature,which may account for the prevalence of literature support forhomeodomain TFs.The majority (392 of 545) of the classified DBD TFs in our listcontain a single DNA interaction domain; however, a notableportion (145 of 545) of genes belonging to just a few proteinfamilies contain more than one instance of its designatedDBD structure. These multiple instances predominantlyreside in TFs containing zinc-finger, helix-turn-helix, andleucine zipper domains (Table S6 in Additional data file 1).While most TFs contained single or multiple copies of a singleDNA binding motif, our predictions identified eight TFs withtwo distinct DBDs (Table S7 in Additional data file 1). WeTable 3TFCat judgment classificationsJudgment classification Number of annotations % of annotationsTF gene 733 61.9TF gene candidate 256 21.7Probably not a TF - no evidence that it is a TF 41 3.5Not a TF - evidence that it is not a TF 30 2.5Indeterminate - there is no evidence for or against this gene's role as a TF 114 9.6TF evidence conflict - there is evidence for and against this gene's role as a TF 10 0.8Table 4TFCat taxonomy classificationsTaxonomy classification Number of annotations % of annotationsBasal transcription factor 39 3.7DNA-binding: non-sequence-specific 30 2.9DNA-binding: sequence-specific 591 56.5DNA-binding: single-stranded RNA/DNA binding 20 1.9Transcription factor binding: TF co-factor binding 315 30.1Genome Biology 2009, 10:R29Transcription regulatory activity: heterochromatin interaction/binding 51 4.929.5 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rremoved the second zinc finger-type domain prediction fortwo of the genes (Atf2 and Atf7) as this domain is character-ized as a transactivation domain in Atf2 [24] and may have aPFAM DBD models detected in eight proteins are not repre-sented by a solved structure and, therefore, could not bedirectly appointed in the classification system (see Table 5,Table 5DNA-binding TF gene classification countsProtein group Protein group description Protein family Protein family description Gene count Predicted occurrences1.1 Helix-turn-helix group 2 Homeodomain family 131 1601.1 Helix-turn-helix group 100 Myb domain family 7 161.1 Helix-turn-helix group 109 Arid domain family 5 51.1 Helix-turn-helix group 999 No family level classification 2 21.2 Winged helix-turn-helix 13 Interferon regulatory factor 7 71.2 Winged helix-turn-helix 15 Transcription factor family 10 111.2 Winged helix-turn-helix 16 Ets domain family 23 231.2 Winged helix-turn-helix 101 GTF2I domain family 2 121.2 Winged helix-turn-helix 102 Forkhead domain family 26 261.2 Winged helix-turn-helix 103 RFX domain family 4 41.2 Winged helix-turn-helix 111 Slide domain family 1 12.1 Zinc-coordinating group 17 Beta-beta-alpha-zinc finger family 79 4502.1 Zinc-coordinating group 18 Hormone-nuclear receptor family 43 432.1 Zinc-coordinating group 19 Loop-sheet-helix family 1 12.1 Zinc-coordinating group 104 GATA domain family 7 122.1 Zinc-coordinating group 105 Glial cells missing (GCM) domain family 2 22.1 Zinc-coordinating group 106 MH1 domain family 3 32.1 Zinc-coordinating group 114 Non methyl-CpG-binding CXXC domain 2 42.1 Zinc-coordinating group 999 No family level classification 2 23 Zipper-type group 21 Leucine zipper family 41 643 Zipper-type group 22 Helix-loop-helix family 71 714 Other alpha-helix group 28 High mobility group (Box) family 24 284 Other alpha-helix group 29 MADS box family 4 44 Other alpha-helix group 107 Sand domain family 3 34 Other alpha-helix group 115 NF-Y CCAAT-binding protein family 2 25 Beta-sheet group 30 TATA box-binding family 1 26 Beta-hairpin-ribbon group 34 Transcription factor T-domain 11 116 Beta-hairpin-ribbon group 108 Methyl-CpG-binding domain, MBD family 2 27 Other 37 Rel homology region family 10 107 Other 38 Stat protein family 6 67 Other 110 Runt domain family 3 37 Other 112 Beta_Trefoil-like domain family 2 27 Other 113 DNA-binding LAG-1-like domain family 2 28 Enzyme group 47 DNA polymerase-beta family 1 7999 Unclassified structure 901 CP2 transcription factor domain family 3 3999 Unclassified structure 902 AF-4 protein family 1 1999 Unclassified structure 903 DNA binding homeobox and different transcription factors (DDT) domain family1 1999 Unclassified structure 904 AT-hook domain family 3 6999 Unclassified structure 905 Nuclear factor I - CCAAT-binding transcription factor (NFI-CTF) family3 3Genome Biology 2009, 10:R29similar function in family member Atf7. All other predictedgene domains were retained, based on literature that sup-ported their activity or failed to support their removal. FourProtein group 999). In addition, three nuclear factor I (NFI)proteins were annotated with DNA-binding evidence andpredicted to contain a SMAD MH1 DBD. Interestingly, a29.6 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rrecent study noted that the DBDs of NFI and SMAD-MH1share significant sequence similarity [25]. These TFs werealso assigned to their own family in the unclassified proteingroup (Table 5, and Table S5 in Additional data file 1, Proteingroup 999 and Protein family 905). A group of ten literature-based DNA-binding TFs had no predicted DBDs (Table S8 inAdditional data file 1). The absence of detected DBDs may bedue, in part, to the limited sensitivity of the models. Forexample, the Tcf20 gene (alias Spbp) purportedly contains anovel type of DBD with an AT hook motif [26] that was notpredicted by the corresponding AT hook PFAM model.Restricted model representation is also likely the reason forthe missing domain predictions of the C4 zinc finger domainin the Nr0b1 gene and the basic helix-loop-helix (bHLH)domain in the Spz1 gene. Similarly, four DBDs detected withprotein group class-level Superfamily models (specifically forzinc coordinating and helix-turn-helix models) could not befurther delineated to a protein family level assignment (TableS9 in Additional data file 1), suggesting that their sequencesdeviate from the family-specific properties represented inPFAM. It is quite possible that there remain to be discovereddomains involved in DNA binding by human and mouse TFs.Most TF DNA-protein interactions occur when the DNA is ina double-stranded state; however, a small number of TF pro-teins preferentially bind single-stranded DNA [27,28]. Weidentified in the literature review a set of 16 single-strandedDNA-binding TFs, of which 12 contain HMM-predicted pro-tein domains that are characterized as single-stranded RNA-DNA-binding (Table S10 in Additional data file 1). There maybe other DBD TFs in our list that act on both single-strandedDNA and double-stranded DNA but were not classified in thesingle-stranded DNA DBD taxonomy because this propertywas not specifically characterized in the literature reviewed.The distinction and overlap between single-stranded DNAand double-stranded DNA binding TFs warrants future atten-tion.Generation and assessment of mouse-human TF homology clusters to predict additional putative TFsSince a transcriptional role can be inferred for closely relatedTF homologs [7,29-31], researchers interested in the analysisof gene regulatory networks would benefit from access to abroad data collection of both experimentally validated TFsand their homologs. The curated TF gene list was used toidentify putative mouse TF homologs in the genome-wideRefSeq collection that have not yet been annotated in our cat-alog or that were not evaluated because they lack PubMed lit-erature evidence. While sequence homology is often used inpreliminary analyses to infer similar protein structure andfunction, its success may be limited when similar protein50 amino acids or less [33], we evaluated whether pruningBLAST-derived clusters using a previously publishedsequence similarity metric [34] could be further improved byexplicitly including domain information. Our evaluation ofboth pruning methods indicated that the inclusion of domainknowledge improved homolog cluster content (Figures S2and S3 in Additional data file 1). We therefore incorporatedboth domain structure predictions, using HMMs, andsequence similarity in our homology-based approach to pre-dict additional TF genes.The homolog prediction and clustering process yielded 227homolog clusters containing 3,561 genes (3,419 uniquegenes). The vast majority of the genes (3,284 of 3,561) areassociated with only 1 cluster each, although 128 genes weremembers of 2 clusters and 7 genes were present in 3 clusters.We also identified 72 single gene clusters (singletons), whichincluded 36 TF genes that had only significant BLASTmatches to themselves, 12 genes that derived BLAST hits thatdid not satisfy the homolog candidate cut-offs, 21 genes withcluster members that did not satisfy the pruning criteria, and3 genes that had no RefSeq model sequence. While our TF-seeded homology inference analysis used cut-offs that likelypruned some false negatives, in an effort to emphasize specif-icity, it is likely that these singletons represent TFs that sharecommon protein structural features with low sequence simi-larity.The curated TF set contains some proteins with propertiesnot commonly associated with TF function. For example, ourcatalog included the cyclin dependent kinases (cdk7, cdk8,and cdk9), which are reported to directly activate gene tran-scription (for a review, see [35]). Therefore, the homologanalysis of TFs identified numerous other protein kinasesthat will likely have no direct involvement in transcription.Similarly, larger clusters seeded by TFs containing otherdomains not frequently associated with transcription, such ascalcium-binding, ankyrin repeats, armadillo repeats, dehy-drogenase, and WD40, also attracted false TF predictions.To assign a quantitative confidence metric for the large clus-ters of TF predictions, we developed a scoring procedurebased on protein domain associations to TF activity annota-tions from the Gene Ontology (GO) molecular function sub-tree [36]. The cluster confidence metric was employed usinga four-tier ranking system for clusters containing more thanten gene members (42 out of 227 homolog clusters). Themajority of these clusters (52% or 22 clusters) received highscores, indicating that they contain a high proportion of TFgenes. Given that GO currently annotates only 39% of the TFgenes in our catalog in the TF activity node in the molecularfunction subtree (Table S11 in Additional data file 1), weexpect that less frequently occurring protein domains foundGenome Biology 2009, 10:R29structures have low sequence similarity [32] or short homol-ogous protein domains. Based on recent evidence that over15% of predicted domain families have an average length ofin small homolog clusters may not yet be represented in GO.Therefore, we did not analyze clusters containing fewer thanten members and we anticipate future refinements in the29.7 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rhomolog cluster confidence rankings as TF gene annotation isexpanded in GO.We incorporated our curated set and cluster counts in ananalysis to estimate both the total number of TFs and, asmaller subset, the number of double-stranded DNA-bindingproteins (see Materials and methods). The cluster countswere adjusted using the observed approximate mean TF(OAMTF) proportions associated with each rank level (Table6) to account for false positives. From this mouse RefSeq-based analysis, we arrived at an estimate of 2,355 DNA-bind-ing and accessory TFs. Since peptide sequence-dependentanalyses can result in both omissions and false predictions ofhomologous protein structures, readers should regard thisfigure as a 'best-guess' approximation [32]. A similar analysisconducted over the homolog clusters containing double-stranded DNA-binding TFs resulted in an estimate of 1,510DNA-interacting TFs. We also performed an extraction ofDBD-containing genes from the Ensembl database using theDBDs defined in TFCat. This analysis derived a list of 1,507putative DNA-binding TFs. These estimates agree well withearlier publications [10,37,38].Maintenance and access of TFCat annotation dataAll gene annotations, mouse homolog clusters and humanorthologs are published in the TFCatWiki, which is accessiblefrom the TFCat portal. Each wiki article page houses theannotation information for one gene with its content securedagainst modification. Each gene article page is associatedwith a discussion page, which is available for comments andfeedback by all wiki users. Wiki users can specify that theywish to receive periodic e-mail notification of lists of genewiki pages and their associated discussion pages that havebeen updated. Semantic features and functional capabilitiesare included in the wiki implementation to facilitate easyaccess to all gene annotation data.We established a TFCat annotation feedback system work-flow process (Figure S4 in Additional data file 1) to encouragecontinuous improvement of the catalogued gene entries. Anissue tracking management system is integrated with the wikito capture, queue, and track feedback contributions for fol-low-up by the wiki annotator. Wiki users may view a gene'sfeedback report summaries and current workflow statusthrough an inquiry made available on each gene's articlepage. Gene annotation changes, entered through our inter-nally accessible TFCat annotation system, will be flagged andforwarded to the wiki through an automated updating proc-ess. Community members who wish to directly contribute tothe wiki contents through the backend web application (Fig-ure S5 in Additional data file 1) may contact the authors.The complete TF catalog resource can be downloaded fromour website [39]. The website application enables downloadof the complete list or a subset of annotated genes by assignedjudgment, functional taxonomy, and DNA-binding classifica-tion. The data extraction is run real-time against a relationaldatabase providing access to the most current TF catalogdata.Catalog characteristics, comparisons, and utilityThe comprehensive catalog of TFs contained in TFCat pro-vides an important resource for investigators studying generegulation and regulatory networks in mammals. The cura-tion effort assessed the scientific literature for 3,230 putativemouse and human TFs, including detailed evaluation ofpapers describing the molecular function of 1,058 TFCs, toidentify 882 confirmed human and mouse TFs. Each TF wasfurther described within TFCat using a newly developed TFtaxonomy. DNA binding proteins, a subset of TFs, weremapped to a structural classification system. As an aide toresearchers, an expanded set of putative TFs was generatedthrough a homology-based sequence analysis procedure.Online access to the annotations and homology data are facil-itated through a wiki system. An annotation feedback system,linked from the wiki, enables reporting and tracking of com-munity input. An additional website application offers capa-bilities to extract all or a subset of the catalog data for filedownload.For many researchers, the greatest utility of TFCat is the pro-vision of an organized and comprehensive list of DNA bindingproteins. The protein-DNA structural classification systemused to organize the DBD TFs in the catalog was originallyproposed by Harrison [40], further modified by Luisi [41] andTable 6Large cluster ranking criteriaCn Rank Implication for unannotated genes in cluster Fraction of observed approximate mean TFs (OAMTF)Cn ≥ 0.20 1 The majority of genes are likely TFs 95%0.10 ≤ Cn < 0.20 2 A higher proportion of genes are likely TFs 75%0.03 ≤ Cn < 0.10 3 A higher proportion of genes are likely not TFs 35%Genome Biology 2009, 10:R290.00 ≤ Cn < 0.03 4 The majority of genes are likely not TFs 15%29.8 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rextended by Luscombe et al. [15]. The DBD analysis andgene/domain counts (Table 5) confirmed that well-knownDBD families are represented. The DNA-binding classifica-tion system was extended with new family classes to accom-modate the majority of predicted DNA-binding structures inour curated TF set (Table 5; Table S5 in Additional data file 1).A new family category was included for unrepresented, dou-ble-stranded TF protein-DNA binding mechanisms that weresupported by PDB structures or publications. Similar to theanalysis and classification performed by Luscombe et al. [15],we added structural domain families that were characterizedby distinct DNA-binding mechanisms. However, unlike theLuscombe et al. approach, we did not consider biologicalfunction in our classification decisions. To preserve the prop-erties of the system, the necessary extensions were madewithin the existing protein groups.The value in having inventories of TFs has spurred previousefforts to compile collections of DNA-binding proteins. Toevaluate the comprehensiveness of our curated collection, weperformed a comparison with the gene annotations providedby GO and our DBD classification analysis with domainsfound in a DBD collection [42]. GO assigns molecular func-tion labels to proteins, including functions falling under thebroad category of transcription. The challenge of annotatingall genes is daunting and, therefore, it was not a surprise thatonly 39% (343) of our expert curated collection of TFs hasthus far been associated with GO terms linked to transcrip-tion (Table S11 in Additional data file 1).While TFCat is unique in its evidence-based approach toidentify mouse and human TFs, there are other compilationsof TF binding domain models and predictions of domain-con-taining proteins. For example, a catalog of sequence-specificDNA-binding TFs (which we will refer to as DBDdb) has beencompiled using HMMs to catalog double-stranded and sin-gle-stranded sequence-specific DBDs [42]. Comparison of thedouble-stranded DNA binding subdivision of TFCat with thepredictions in DBDdb highlights some key differencesbetween these efforts (Tables S12-S14 in Additional data file1). For example, the TFCat DNA binding subdivision includesonly TFs with published evidence from mammalian studies,whereas the DBDdb collection includes domain predictionsbased on evidence of sequence-specific DNA binding in anyorganism. While the two TF resources overlap, they servecomplementary purposes. DBDdb is a set of computationalpredictions generated with protein motif models associatedwith sequence-specific single or double-stranded bindingdomains, while TFCat is an expert-curated, highly specificresource that targets the organized identification of all TFs,regardless of DNA binding, in human and mouse. For exam-ple, the high mobility group (HMG) domain TFs, whichexhibit both specific and non-specific DNA-binding, areDBDdb. For example, CG-I has been shown to regulate genetranscription in fly [43] but not in mammals [44].To complement our large set of curated TF proteins, we con-ducted a sequence-based homology analysis, propagatedfrom our positively judged TFs, to predict additional TFencoding genes. We applied a confidence ranking metric topredict the number of false positives included in largerhomolog clusters (Table 6), which should be considered whenextracting un-annotated, predicted TFs. Future adaptationsof the TFCat resource could include literature-based judg-ments of TF homolog predictions. While the homolog clustersas provided are an essential and useful supplement to our evi-dence-based TF catalog, future predictions may benefit fromfurther structure-based homology research.Creation of a comprehensive TF catalog provides an impor-tant first step in unraveling where, when and how each TFacts. For example, a number of recently published genome-scale studies constructed lists of predicted TFs prior to inves-tigating the spatial and temporal expression characteristics ofsets of regulatory proteins [8,9,45,46], in advance of conduct-ing a phylogenetic analysis of genes involved in transcription[47], and as initial input to the analysis of conserved non-cod-ing regions in TF orthologs [48]. The set of literature evi-dence-supported TFs in TFCat will provide an importantfoundation for similar future studies.TF catalogs will become increasingly important and neces-sary to facilitate the investigation and analysis of TF-directedbiological systems. Recent ground-breaking stem cell studies[49,50] have shown the central role of TFs in regulating stemcell pluripotency and differentiation. Understanding the cen-tral role of TFs in the control of cellular differentiation hastherefore taken on increased importance. Computational pre-dictions in regulatory network analysis of cellular differentia-tion often highlight a pattern consistent with binding of astructural class of TFs, but fail to delineate which TF classmember is acting. TFCat will serve as a reference and organ-izing framework through which such linkages can progresstowards the detailed investigation of candidate TF regulators.Materials and methodsCreation of four independent murine and human TF preliminary candidate data setsFour TF collections were compiled by four independentapproaches. All data sets are available on the TFCat portal.Dataset IA list of 986 human genes considered 'very likely' plus 913considered 'possibilities' to code for TFs was manuallycurated in February 2004 [51] using personal knowledgeGenome Biology 2009, 10:R29excluded from DBDdb but included in TFCat. Moreover,TFCat included only TFs with literature support in mamma-lian cells, which excludes certain domains included incombined with information in LocusLink (now Entrez Gene),the Online Mendelian Inheritance in Man database (OMIM)[52], and PubMed [16]. Selection was guided by the following29.9 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rdefinition of a TF: 'a protein that is part of a complex at thetime that complex binds to DNA with the effect of modifyingtranscription'. Inclusion was necessarily subjective for tworeasons: the definition of 'transcription factor' is difficult toprecisely constrain; and there was not enough informationavailable for many genes to be certain of their function. Genesthat primarily mediate DNA repair (for example, ERCC6) orchromatin conformation (for example, CBX1) were excluded.To be considered, a gene had to have an Entrez Gene entrywith a GenBank accession number. Text-based searches forthe terms 'transcription factor' or 'homeobox' were used toidentify Entrez Gene entries for further analysis. GO nodedescriptions including the terms 'nucleic acid binding', 'DNAbinding', and 'transcription' were used as a supplement toguide gene selection. A total of 998 TFs were present in the setfollowing this initial compilation. After February 2004, peri-odic additions were made based on new reports in the litera-ture.Dataset IIThe objective of this analysis was to identify a comprehensivelist of DBDs for TF gene candidate extraction. Firstly, theSwissProt database [53] protein entries (obtained in April2005) were scanned for descriptors or assigned PFAM [11]and/or Interpro [54] domains (downloaded in April 2005)indicating DNA-binding, DNA-dependent, and transcription.The extracted gene set was then further extended by includingSwissProt gene entries that had assignments to the biologicalprocess GO node GO:0006355 (regulation of DNA transcrip-tion, DNA-dependent) and SwissProt records with textdescriptions that included JASPAR database transcriptionfactor binding site class names [55]. A list of unique DBDswas compiled from this extraction. All domains were manu-ally reviewed for evidence strongly suggesting DNA bindingand transcription factor activity using both Interpro andPFAM domain descriptions and associated literature refer-ences. Domains that did not meet these criteria were prunedfrom the list. Both known and putative TF genes wereextracted from the Ensembl V29 database [56] using the TFDBD PFAM-based list, yielding a set of 1,266 mouse and1,500 human DNA-binding TF candidates.Dataset IIIGO trees were constructed for all mouse and human entries inEntrez Gene by starting with the leaf term from gene2go [36](downloaded July 19th, 2005) and enumerating all parentterms using file version 200507-termdb.rdf-xml. As we wereinterested in all genes that could be involved in altering tran-scription, genes were selected if they had any annotation(including Inferred Electronic Annotations) to GO terms withdescriptors 'transcription regulator activity', 'transcriptionfactor activity' and/or 'transcription factor binding' in theirtree. We identified 970 mouse genes and 1,203 human genes'DNA binding' and 'transcription factor' against the domaininformation in the Interpro database [54]. The resultinggenes were mapped to Entrez Gene entries using the Affyme-trix annotation for the MOE-430 v2 chip. Merging the twolists and removing duplicate entries resulted in 2,131 mouseand 2,900 human candidate genes involved in transcriptionalregulation.Dataset IVWe assembled approximately 350,000 isoforms representingapproximately 48,000 known and predicted protein-codingmouse genes by mapping seven collections of known and pre-dicted mRNAs to the mouse chromosomes, and clusteringthem on the basis of overlap (see [57] for source sequences, arepresentative mRNA from each cluster, and a description ofthe clustering method). We then assembled 36 known tran-scription-factor DBDs from PFAM and SMART [58], andscreened the approximately 350,000 isoforms using theHMMER software [59] to identify approximately 2,500known or predicted genes containing at least one of the 36domains. To map the International Regulome Consortiumentries to Entrez Gene, the sequences [60] were comparedwith RefSeq sequences using BLAST. Only sequences with anexpectation value of at most 10-05 were selected and subse-quently mapped to Entrez Gene using the Gene2Refseq table.Standardizing TF gene candidate annotationA website annotation tool and MySQL database were devel-oped to standardize and centralize the annotation effort (Fig-ure S5 in Additional data file 1). TF candidate judgments anda high-level taxonomy classification system were established(Tables 3 and 4) for this web-based annotation process. Thesecure website enables access to only those genes assigned toeach annotator. Each gene annotation required input of textsummarizing the journal article evidence that, to somedegree, supported or refuted the judgment of a gene (or thegene's ortholog in a closely related species) as a TF. One ormore PubMed journal articles were summarized in thereviewer comments and a final judgment and general taxon-omy classification were assigned.Ten trial genes, randomly selected from the list of TFCs, wereassessed by four reviewers. The set of annotations for eachtrial gene was evaluated for literature evidence selected andannotation content and formatting. This evaluation was usedto develop annotation evidence guidelines and a suggestedgeneral documentation format for the annotation process,which was included in the annotator help guidelines.Selection and annotation of a subset of TF candidatesThe mouse TF candidate datasets were merged, usingmapped NCBI Entrez Gene identifiers, into a single non-redundant dataset. Gene2PubMed file counts were extractedGenome Biology 2009, 10:R29using this method. As this first extraction did not identify allfamily members of a putative transcription factor, we per-formed an additional extraction using the term searchesand merged by Entrez Gene ID. Genes were manually pre-curated for evidence supporting TF activity by scanning NCBIPubMed abstracts (where available) using both standard gene29.10 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rsymbols and aliases and examining GeneRIF entries for eachgene in the dataset. Genes with literature evidence suggestingTF function were included in the list of TFCs to be annotated.A set of TFCs associated with two or more PubMed abstracts(based on Gene2Pubmed data and excluding the large anno-tation project articles) were extracted from the TFC list andrandomly assigned to each of 17 reviewers based on pre-determined reviewer allocation counts. Each TFC wasreviewed and judged by the assigned reviewer for TF evidencein the literature as described above. We also extracted andentered the PubMed information accompanying 22 TF DNA-binding profiles from the JASPAR database [55].During this research project, the Entrez Gene numbers weremaintained using the NCBI Gene History file. TFCat geneidentifiers were maintained (changed or merged or deleted) ifa corresponding change was recorded in this file.Randomly sampled quality assessment and auditing of TF annotationsTF gene candidates were randomly selected from eachreviewer-assigned gene set based on the assigned proportionsacross all reviewers to form a list of 50 genes for annotationQA testing. Each gene was allocated to two reviewers forannotation in a blind QA test. The QA gene annotations wereextracted and reviewed for TF judgment and taxonomy clas-sification consistency. A second round of annotation auditingwas performed to ensure consistency in the recorded annota-tion data. All annotations were examined for alignment ofPubMed evidence reviewed and assigned judgment and func-tional taxa. Misaligned annotations were forwarded to theannotator for review and revision.TFC quality assurance comparisonsTo assess sensitivity (coverage) in our initial curated TF list,we compared our gene set with TF genes identified in two TFcollections. Approximately 800 gene symbols listed in a TFtextbook index, authored by Joseph Locker [6], were manu-ally reviewed and mapped, where possible, to 506 mouse Ent-rez Gene identifiers using gene descriptions and citationsprovided in the text. A TF comparison was also performedagainst the list of annotated fly TFs found in the FlyTF data-base [13] by mapping, where possible, FlyBase identifiers toNCBI gene identifiers to locate their corresponding mousehomolog in a HomoloGene group [16].Upon completion of the TFCat curation phase, we performedcomparisons with GO [36] and the DBD Transcription FactorPrediction Database resource [42]. To compare our curatedset with GO, we developed software to enumerate the numberof our TF genes in the GO molecular function subtree underthe 'transcription regulator activity' node. We used the MouseXref file found in the GO Annotation Database [61] to map the(Homo sapiens 49_36 k) predicted TF sets and developmentof software to extract all DBD models identified in thoserecords. We then compared the domains found in the DBDmouse/human set with those domain models annotated asDNA-binding in our curated TF set.Human-mouse ortholog assignmentHuman-mouse predicted orthologs were assigned usingNCBI HomoloGene groups [16] with one-to-one relationshipsbetween the mouse and human genes. Those few genes thatdid not have a one-to-one relationship were manuallyinspected and, when available, a preference was given to thehuman non-predicted RefSeq gene model or an assignmentwas made using the closest Blast alignment scores between amouse and human gene pair. Where HomoloGene entrieswere not available for both human and mouse, orthologassignments identified in the Mouse Genome Database wereused.TF DNA-binding structure analysis and classificationA DNA-binding protein classification system, an extension ofthe work from Luscombe et al. [15], was utilized to classify allgenes judged as TFs with DNA-binding activity. Structuralassignments were made utilizing the HMMER software toenumerate a full set of Superfamily (SCOP-based) HMMs [12]with a threshold of 0.02 and PFAM HMMs [11] for each geneusing gathering threshold cut-offs and a calculated model sig-nificance value ≤ 10-2. The Superfamily domain sequencespredicted in the TF gene set were subjected to a PFAM HMManalysis to identify PFAM domain models that are satisfied bythe same sequences (Table S4 in Additional data file 1). Bothredundant and non-redundant models were then mapped tothe DNA-binding structure classification using model struc-tural descriptions and based on review of related literature forPDB entries that contain these domains.The DNA-binding classification was extended with additionalfamily classes to accommodate the predicted DNA-bindingstructures encountered in the curated set of DBD TFs (Table5; Table S5 in Additional data file 1). To evaluate the struc-tural similarity of DBDs, we performed alignments using theprotein structure comparison web tool Secondary StructureMatching (SSM) [62]. We identified PDB entries for each ofthe new DBD families, with a preference for DNA-boundstructures. The DBD chains of each PDB entry were alignedwith the entire PDB archive (incorporating lowest acceptablematches of 40% and defaulting the remaining parameters) toidentify similar DBD structures based on Q-score metric clus-tering results. A new protein family classification was estab-lished if the structure aligned only to itself or was clustered(by Q-value) within its own set of family class structures. In afew cases, where a structure aligned reasonably well withanother family in the classification system, PubMed articlesGenome Biology 2009, 10:R29TF Entrez gene numbers to the gene identifiers available inthe GO database. The DBD resource comparison involveddownloading the mouse (Mus musculus 49_37 b) and humanwere consulted to derive a final decision and any borderlinecases were noted and described in the family class descriptiontext (Table S5 in Additional data file 1). Each DNA-binding TF29.11 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rwas then assigned to one or more DNA-binding families inthe classification system if it was predicted to contain therelated DBD structure.Identification of homolog sets for mouse TF genesA homolog analysis process was implemented that considersboth sequence similarity and predicted protein domain com-monality, and uses a computationally simplified clusteringapproximation, loosely motivated by proportional linkageclustering [63]. We initially identified sequence similarityusing BLASTALL [64] analysis over a full mouse protein Ref-Seq [65] dataset with an expect value cut-off of 10-3 and enu-merated all HMM PFAM domains over an extracted fullrepresentation of the mouse genome using NCBI RefSeqsequences. To extract putative homolog candidates for eachTF gene, we incorporated a metric, originally proposed by Liet al. [34], which considers the ratio of aligned sequencelength to the entire length of each sequence. Given the focuson mouse genes, the formula for this metric, which we willrefer to as metric I's , was revised to utilize sequence similarityrather than identity. Our metric is computed as:- where S is the proportion of similar amino acids (as definedby the Blosom62 matrix) across the hit, Li is the length ofsequence i (i is the query or hit sequence), and ni is thenumber of amino acids in the aligned region of sequence i. Weconsidered only homolog candidates that had a maximum hitsignificance of 10-4 and allowed for a high level of sensitivityby requiring that the computed I's values were at least 0.06.We did not include any genes that had been reviewed anddeemed not TFs.Our survey of a set of TF gene family sequence characteristicssuggested that some known DBDs were contained in a smallfraction of the total TF protein sequence. However, similarlyshort alignments between a TF gene and other hit sequences(low I's values) can yield a significant amount of false posi-tives. We used well-documented SRY-related HMG-box tran-scription factor (Sox) and Forkhead transcription factor (Fox)TF families (Tables S15 and S16 in Additional data file 1) toevaluate two cluster pruning strategies and selected anapproach that increased cluster specificity (proportion ofmembers of a test set in a cluster) without decreasing clustersensitivity (number of cluster members that are members ofa test set). To evaluate cluster pruning of the Blast-based clus-ters using strictly an I's threshold method, we computed clus-ter sensitivity and cluster specificity over an increasing rangeof I's values, using the Sox and Fox validation sets (Figures S2and S3 in Additional data file 1). An I's value was computedbetween the query sequence and every member in the clusterand a member (gene) was pruned if the I's did not satisfy acessive range of I's values requiring that all predicted domainsin a cluster member (gene) match the query gene or, whenthis criteria could not be met, a particular I's value thresholdbe satisfied (Figures S2 and S3 in Additional data file 1).Inclusion of a domain-based method as a primary criteria forpruning with the incorporation of a stricter I's value criteriawhen the domains did not match, in most cases, maintainedcluster sensitivity while preserving or improving cluster spe-cificity. Importantly, higher cluster sensitivity and clusterspecificity levels enabled comprehensive Sox HMG and FoxForkhead families to emerge when we applied a proportionallinkage clustering approximation approach to merge theoverlapping clusters (Figures S6 and S7 in Additional data file1). While the sole application of an I's value as a pruning cri-teria may not generate comprehensive TF family clusters(compare panel B in Figures S6 and S7 in Additional data file1), our analyses suggested that this metric on its own, imple-mented with higher parameter values, is useful for identifyingclosely related subfamily members (Figure S8 in Additionaldata file 1). Motivated by these assessment results, we imple-mented a cluster pruning step that required that either allpredicted PFAM enumerated domains in the TF gene bematched in a homolog candidate or that the I's value betweenthe query TF gene and its homolog hit be no smaller than 0.21with a sequence similarity no less than 30%. This resulted in830 overlapping sets consisting of 48,555 members in total.To cluster and merge the sets, we implemented a method thatconsiders a proportional linkage median-based relationshipbetween sets. The algorithm performed iterations of setmerges, combining two sets S and T if at least half of the genesin the smaller set matched genes in the larger set, that is, ifthere were |(min(|S|,|T|))/2| matching genes. To mitigatethe cluster attraction strength properties of initially largerand possibly noisier clusters, the merge process iterativelyconsidered and executed merging over smaller to progres-sively larger cluster cardinalities using increments of 10.Cluster membership attained a steady-state convergencewithin 700 iterations.A cluster confidence metric was developed to measure thenumber of potential false positives in a large (cardinality > 10)homolog cluster using predicted domain content. We mappedthe mouse genes with the enumerated PFAM domains toterms in the GO molecular function subtree. We tallied thenumber of times a specific domain is contained within a geneannotated to the transcription regulator activity node and itschild nodes versus the number of times the domain is foundin a gene annotated to some other activity node to compute aprobability of a particular domain Pd being associated with TFfunction. The majority of GO annotation evidence codes wereincluded, with the following exceptions: IEA (Inferred fromElectronic Annotation), ISS (Inferred from Sequence or′ = ×I S Min n L n Ls ( / , / )1 1 2 2Genome Biology 2009, 10:R29cut-off threshold. Cluster sensitivity and cluster specificitywere computed for the range of I's values and compared. Wethen assessed a second cluster pruning approach over a suc-Structural Similarity), and RCA (Inferred from ReviewedComputational Analysis). To evaluate cluster confidence Cn,we first enumerated the number of genes that contain a spe-29.12 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. Rcific domain within a cluster Cd and the number of genes ineach cluster Cg to weight a domain's association to TF activity:- and, secondly, included those cluster domains that satisfy D= {Cd ≥ LCg/4O} to compute Cn, using the following equation:All cluster confidence values and cluster membership werereviewed and qualitatively assessed based on the proportionof verified TFs and binned into four partitions with associatedconfidence rankings (Table 6).To derive an estimate for the total number of TFs in thehuman and mouse species, we computed the number ofknown and predicted TF homologs and adjusted this amountby the cluster rank OAMTF (Table 6) to obtain a prediction of2,355 DNA-binding and accessory TFs. To obtain a ballparkfigure for a total number of DBD TFs, we performed a sepa-rate homolog clustering analysis seeded by genes curatedwith double-stranded DNA binding activity and reduced thecounts using the OAMTF proportions by cluster rank, whereapplicable. The homolog-based analysis generated an esti-mate of 1,510 DBD TFs. To support our DBD homology-basedcount analysis, we developed PERL scripts to query themouse Ensembl mus_musculus_core_47_37 andensembl_mart_47 databases for extraction of predictedDNA-binding TFs using the identified PFAM DBDs in TFCat.This extraction produced a total of 1,507 Ensembl mousegenes (1,416 records supported by Mouse Genome Informat-ics (MGI); 23 RefSeq and Entrez Gene sourced records; 29Uniprot/SPTREML predicted genes; and 39 Ensembl pre-dicted gene models).Website download access, wiki publication and annotation feedbackThe MediaWiki software was used to implement the TFCat-Wiki, with some modifications and additions made to thebase software code and configuration files. We included theSemantic MediaWiki [66] extension to facilitate access andsearching. Each article page contains the annotation informa-tion for one gene and has been configured to disallow edits,although enabling all associated discussion pages for contri-bution. Software was developed to extract data from theTFCat wiki database to create the wiki pages.We implemented a feedback tracking function using the Man-tisBT software system [67], a well-established, open-source,aWiki user information to the feedback system and providedirect query access to feedback records by gene. We also inte-grated new data update flagging mechanisms into our inter-nally available TFCat annotation software tool to identify newor modified gene annotation information that requires re-population to the gene wiki page.The MediaWiki software includes a Watch function, whichissues individual e-mails when information is changed on awiki page by a wiki user. We developed an e-mail feature thatoptionally provides lists of wiki pages that have been changedvia the backend auto-update process. To enable this feature,we developed an external PHP program (MediaWiki) hookand an associated MySQL database table to solicit user entryand capture of desired e-mail parameter options and notifica-tion frequency. An e-mail notification process was developedthat issues e-mails for wiki content updates based on user-selected parameters.AbbreviationsDBD: DNA-binding domain; DBDdb: DBD TranscriptionFactor Database; Fox: Forkhead transcription factor family;GO: Gene Ontology; HMG: high mobility group; HMM: hid-den Markov model; NFI: nuclear factor I; OAMTF: observedapproximate mean TF; PDB: Protein Data Bank; QA: qualityassurance; Sox: SRY-related HMG-box transcription factorfamily; TF: transcription factor; TFC: transcription factorcandidate; UPTF: union of putative TFs.Authors' contributionsInitial putative TF datasets were created by JR (dataset I),DLF (dataset II), SS (dataset III), and GB (dataset IV). SS cre-ated the merged dataset and performed an NCBI mapping fordataset IV. DLF designed, implemented, and populated thecentralized TFCat database and annotation website tool. SSprovided some text data extractions for the TFCat database.RS and DLF precurated the unified dataset. JR, RS, DLF, SS,GB, TH, and WWW acted as the core group of gene annota-tors. DLF performed the TF reference collection compari-sons. Annotation audits were performed by DLF, WWW, RS,and SS. DLF established and implemented the structural clas-sification mapping methodology and performed the analysisof DNA-binding structures to extend the DNA-binding struc-tural classification. DLF devised and implemented thehomolog analysis and gene clustering process. DLF, SS, andRS worked on the wiki gene page format. DLF designed,developed and implemented the wiki. DLF developed andimplemented the website TFCat data download portal.WWW, JR, and RS provided co-supervision for this project,with the implementation led by DLF. DLF wrote the draft ofthe manuscript, with further modifications and edits contrib-NCdC gPd d=CN dii DDn=∈∑Genome Biology 2009, 10:R29issue monitoring system, to accommodate tracking and fol-low-up management of TFCat feedback contributions. PHPinterfaces and software were developed to populate Medi-uted by WWW, RS, JR and SS. All authors read and approvedthe final manuscript.29.13 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. RAdditional data filesThe following additional data is available with the online ver-sion of this paper: a PDF that includes Tables S1-S16 and Fig-ures S1-S8 (Additional data file 1).Additional data file 1Tables S1-S16 nd Figures S1-S8 S1: gene annotation judgment summary counts from the qua i y assur nce assessment proces . Table S2: quality assurance gene pair j dgm t annotations. Table S3: indepe dent annota-tio s of TFs whe  the a  PubMed evidence was us d. Table S4:PFAM and Sup rf mily gro p del DBD p edictions f r the an otated TF g es. Table 5: DNA-bin i g classificati n ext -s n  dded to the Lusc be et al. [15] classific tio  yste . T blS6: pr t in class coun s f g nes pr d cted to on a  multipinstance of he sam DBD. Ta l  S7: NA-b nding TF pr dicted o c t in tw  dif re t DNA-binding clas es. T ble S8: DNA-bin -g TFs that do o  contain a d t te  DBD. Tabl S9: DNA-bi d-wi h no d ected pro family-l v l m . T bl S10: in l -s d d DNA-b ing TF . a le S11: summaryof th ou s um r ed in h  TFCat o aris  w h GO12: summary o  th cou ts e um r d n the comp r so  of Cat cla fi  HMM BDs wi h h  dat ase ( BDd ). 3: Sup rf ily  co rison  w th DBDdb. Tabl S1DBD c p so s wit BDd . T ble S15: Fox a ly est t g n s. Tabl S16: S x f mily s  s t e s. F g r S : V nn diaram  e ov l p of he four i itia TF da s ts. igu e S2: plo sf r the alysi o cl r pr g me h using th ox st set.Fi ur pl t r h  a lysis f cluster pru i g m d us g he Sox e t . Fig re S4: TFC t no tio w rk l w imp e na io . Fig e S5: sc n h s of th backend eb-ba ed TFC  o o . i 6  x-co g cl st r e b h forv lu c ster pr i g met o . F gu S7: -co ta ni gcl st r m er hip f th valuat lust pru in thod . 8 a  xa l f pr n d Fox-c tai ing l st gen ratusi g h I's y  u  l o 0.21.Cl ck her fo filAcknowledgementsWe are grateful for the annotations provided by A Ticoll, E Portales-Casa-mar, M Swanson, S Lithwick, W Cheung, SJ Ho Sui, D Martin, T Kwon, andA Chou. We would like to thank T Siggers for his helpful review of a pre-liminary version of the DNA-binding classification extension work. WWWis a Canadian Institutes of Health Research New Investigator (CIHR) andMichael Smith Foundation for Health Research (MSFHR) Scholar and hisresearch is supported by a CIHR grant. JR received support from theNational Institute of Allergy and Infectious Diseases (National Institutes ofHealth) and the Institute for Systems Biology. RS is a Chercheur-boursierof the Fonds de la recherche en santé du Québec and receives operatingfunds from CIHR to support his research. DLF is supported by a CIHR Can-ada Graduate Scholarship Doctoral Award and MSFHR Graduate Scholar-ship award. SS is supported by The McGill University Health CentreResearch Institute and the Ontario Institute for Cancer Research. GB issupported by a CIHR Post doctoral fellowship and TRH is supported by aCIHR Grant. Computer hardware resources utilized for this project weresupported by the Gene Regulation Bioinformatics Laboratory funded byCanada Foundation for Innovation.References1. Garvie CW, Wolberger C: Recognition of specific DNAsequences.  Mol Cell 2001, 8:937-946.2. Halford SE, Marko JF: How do site-specific DNA-binding pro-teins find their targets?  Nucleic Acids Res 2004, 32:3040-3052.3. Rescan PY: Regulation and functions of myogenic regulatoryfactors in lower vertebrates.  Comp Biochem Physiol B Biochem MolBiol 2001, 130:1-12.4. Rosenfeld MG, Lunyak VV, Glass CK: Sensors and signals: a coac-tivator/corepressor/epigenetic code for integrating signal-dependent programs of transcriptional response.  Genes Dev2006, 20:1405-1428.5. Latchman DS: Eukaryotic Transcription Factors London, San Diego, CA:Elsevier Academic Press; 2004. 6. Locker J: Transcription Factors Oxford, San Diego, CA: Bios, AcademicPress; 2001. 7. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA:Structure and evolution of transcriptional regulatory net-works.  Curr Opin Struct Biol 2004, 14:283-291.8. Gray PA, Fu H, Luo P, Zhao Q, Yu J, Ferrari A, Tenzen T, Yuk DI,Tsung EF, Cai Z, Alberta JA, Cheng LP, Liu Y, Stenman JM, ValeriusMT, Billings N, Kim HA, Greenberg ME, McMahon AP, Rowitch DH,Stiles CD, Ma Q: Mouse brain organization revealed throughdirect genome-scale TF expression analysis.  Science 2004,306:2255-2257.9. Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J,Gordon L, Branscomb E, Stubbs L: A comprehensive catalog ofhuman KRAB-associated zinc finger genes: insights into theevolutionary history of a large family of transcriptionalrepressors.  Genome Res 2006, 16:669-677.10. Messina DN, Glasscock J, Gish W, Lovett M: An ORFeome-basedanalysis of human transcription factor genes and the con-struction of a microarray to interrogate their expression.Genome Res 2004, 14:2041-2047.11. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL:Pfam 3.1: 1313 multiple alignments and profile HMMs matchthe majority of proteins.  Nucleic Acids Res 1999, 27:260-262.12. Gough J: The SUPERFAMILY database in structural genom-ics.  Acta Crystallogr D Biol Crystallogr 2002, 58:1897-1900.13. Adryan B, Teichmann SA: FlyTF: a systematic review of site-spe-cific transcription factors in the fruit fly Drosophila mela-nogaster.  Bioinformatics 2006, 22:1532-1533.14. Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, McKay RD, Che-15. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overviewof the structures of protein-DNA complexes.  Genome Biol2000, 1:REVIEWS001.16. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K,Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, GeerLY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL,Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E,Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatus-ova TA, Wagner L, Yaschenko E: Database resources of theNational Center for Biotechnology Information.  Nucleic AcidsRes 2007, 35:D5-12.17. Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE: The mousegenome database (MGD): new features facilitating a modelsystem.  Nucleic Acids Res 2007, 35:D630-637.18. Coulier F, Popovici C, Villet R, Birnbaum D: MetaHox gene clus-ters.  J Exp Zool 2000, 288:345-351.19. Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese-Scott C,Gonzales NR, Gwadz M, Hao L, He S, Hurwitz DI, Jackson JD, Ke Z,Krylov D, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH,Mullokandov M, Song JS, Thanki N, Yamashita RA, Yin JJ, Zhang D,Bryant SH: CDD: a conserved domain database for interactivedomain family analysis.  Nucleic Acids Res 2007, 35:D237-240.20. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structuralclassification of proteins database for the investigation ofsequences and structures.  J Mol Biol 1995, 247:536-540.21. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, ThorntonJM: CATH-a hierarchic classification of protein domain struc-tures.  Structure 1997, 5:1093-1108.22. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr, Brice MD, Rodg-ers JR, Kennard O, Shimanouchi T, Tasumi M: The Protein DataBank: a computer-based archival file for macromolecularstructures.  J Mol Biol 1977, 112:535-542.23. Laity JH, Lee BM, Wright PE: Zinc finger proteins: new insightsinto structural and functional diversity.  Curr Opin Struct Biol2001, 11:39-46.24. Nagadoi A, Nakazawa K, Uda H, Okuno K, Maekawa T, Ishii S,Nishimura Y: Solution structure of the transactivation domainof ATF-2 comprising a zinc finger-like subdomain and a flex-ible subdomain.  J Mol Biol 1999, 287:593-607.25. Stefancsik R, Sarkar S: Relationship between the DNA bindingdomains of SMAD and NFI/CTF transcription factors definesa new superfamily of genes.  DNA Sequence 2003, 14:233-239.26. Rekdal C, Sjottem E, Johansen T: The nuclear factor SPBP con-tains different functional domains and stimulates the activityof various transcriptional activators.  J Biol Chem 2000,275:40288-40300.27. Horn G, Hofweber R, Kremer W, Kalbitzer HR: Structure andfunction of bacterial cold shock proteins.  Cell Mol Life Sci 2007,64:1457-1470.28. Swamynathan SK, Nambiar A, Guntaka RV: Role of single-strandedDNA regions and Y-box proteins in transcriptional regula-tion of viral and cellular genes.  FASEB J 1998, 12:515-522.29. Gasperowicz M, Otto F: Mammalian Groucho homologs:redundancy or specificity?  J Cell Biochem 2005, 95:670-687.30. Hamilton AT, Huntley S, Tran-Gyamfi M, Baggott DM, Gordon L,Stubbs L: Evolutionary expansion and divergence in theZNF91 subfamily of primate-specific zinc finger genes.Genome Res 2006, 16:584-594.31. Lemons D, McGinnis W: Genomic evolution of Hox gene clus-ters.  Science 2006, 313:1918-1922.32. Rost B: Twilight zone of protein sequence alignments.  ProteinEng 1999, 12:85-94.33. Liu J, Rost B: Domains, motifs and clusters in the protein uni-verse.  Curr Opin Chem Biol 2003, 7:5-11.34. Li WH, Gu Z, Wang H, Nekrutenko A: Evolutionary analyses ofthe human genome.  Nature 2001, 409:847-849.35. Malumbres M, Barbacid M: Mammalian cyclin-dependentkinases.  Trends Biochem Sci 2005, 30:630-641.36. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M,Rubin GM, Sherlock G: Gene ontology: tool for the unificationof biology. The Gene Ontology Consortium.  Nat Genet 2000,25:25-29.Genome Biology 2009, 10:R29noweth JG, Tesar PJ, Furey TS, Ren B, Weng Z, Crawford GE: Iden-tification and characterization of cell type-specific andubiquitous chromatin regulatory structures in the humangenome.  PLoS Genet 2007, 3:e136.37. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J,Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, HarrisK, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P,McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J,29.14 Genome Biology 2009,     Volume 10, Issue 3, Article R29       Fulton et al. RRaymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al.: Initialsequencing and analysis of the human genome.  Nature 2001,409:860-921.38. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, SmithHO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P,Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, ZhengXH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, GaborMiklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA,Zinder N, et al.: The sequence of the human genome.  Science2001, 291:1304-1351.39. TFCat Portal Resource   []40. Harrison SC: A structural taxonomy of DNA-binding domains.Nature 1991, 353:715-719.41. Lilley DMJ: DNA-Protein: Structural Interactions Oxford: IRL Press atOxford University Press; 1995. 42. Kummerfeld SK, Teichmann SA: DBD: a transcription factor pre-diction database.  Nucleic Acids Res 2006, 34:D74-81.43. Han J, Gong P, Reddig K, Mitra M, Guo P, Li HS: The fly CAMTAtranscription factor potentiates deactivation of rhodopsin, aG protein-coupled light receptor.  Cell 2006, 127:847-858.44. Finkler A, Ashery-Padan R, Fromm H: CAMTAs: calmodulin-bind-ing transcription activators from plants to human.  FEBS Lett2007, 581:3893-3898.45. Choi MY, Romer AI, Hu M, Lepourcelet M, Mechoor A, Yesilaltay A,Krieger M, Gray PA, Shivdasani RA: A dynamic expression surveyidentifies transcription factors relevant in mouse digestivetract development.  Development 2006, 133:4119-4129.46. Kong YM, Macdonald RJ, Wen X, Yang P, Barbera VM, Swift GH: Acomprehensive survey of DNA-binding transcription factorgene expression in human fetal and adult organs.  Gene Expres-sion Patterns 2006, 6:678-686.47. Coulson RM, Ouzounis CA: The phylogenetic diversity ofeukaryotic transcription.  Nucleic Acids Res 2003, 31:653-660.48. Lee AP, Yang Y, Brenner S, Venkatesh B: TFCONES: a databaseof vertebrate transcription factor-encoding genes and theirassociated conserved noncoding elements.  BMC Genomics2007, 8:441.49. Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K,Yamanaka S: Induction of pluripotent stem cells from adulthuman fibroblasts by defined factors.  Cell 2007, 131:861-872.50. Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL,Tian S, Nie J, Jonsdottir GA, Ruotti V, Stewart R, Slukvin II, ThomsonJA: Induced pluripotent stem cell lines derived from humansomatic cells.  Science 2007, 318:1917-1920.51. Roach JC, Smith KD, Strobe KL, Nissen SM, Haudenschild CD, ZhouD, Vasicek TJ, Held GA, Stolovitzky GA, Hood LE, Aderem A: Tran-scription factor expression in lipopolysaccharide-activatedperipheral-blood-derived mononuclear cells.  Proc Natl Acad SciUSA 2007, 104:16245-16250.52. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA:Online Mendelian Inheritance in Man (OMIM), a knowledge-base of human genes and genetic disorders.  Nucleic Acids Res2005, 33:D514-517.53. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A,Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S,Schneider M: The SWISS-PROT protein knowledgebase andits supplement TrEMBL in 2003.  Nucleic Acids Res 2003,31:365-370.54. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D,Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U,Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, KahnD, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Mad-era M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN, et al.: Inter-Pro, progress and status in 2005.  Nucleic Acids Res 2005,33:D201-205.55. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B:JASPAR: an open-access database for eukaryotic transcrip-tion factor binding profiles.  Nucleic Acids Res 2004, 32:D91-94.56. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M,Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, DownT, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, HerreroJ, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D,Keenan S, Kokocinsci F, London D, Longden I, McVicker G, et al.:Ensembl 2005.  Nucleic Acids Res 2005, 33:D447-453.Ponting CP, Bork P: SMART 4.0: towards genomic data integra-tion.  Nucleic Acids Res 2004, 32:D142-144.59. HMMER - Profile HMM Software for Protein Sequence Anal-ysis   []60. The International Regulome Consortium   []61. Gene Ontology Annotation (GOA) Database   []62. Krissinel E, Henrick K: Secondary-structure matching (SSM), anew tool for fast protein structure alignment in three dimen-sions.  Acta Crystallogr D Biol Crystallogr 2004, 60:2256-2268.63. William D, Herbert E: Investigation of proportional link linkageclustering methods.  J Classification 1985, 2:239-254.64. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic localalignment search tool.  J Mol Biol 1990, 215:403-410.65. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences(RefSeq): a curated non-redundant sequence database ofgenomes, transcripts and proteins.  Nucleic Acids Res 2007,35:D61-65.66. Markus Krötzscha DV, Völkelb Max, Hallerb Heiko, Studer Rudi:Semantic Wikipedia.  J Web Semantics 2007, 5:251-261.67. MantisBT Issue Tracking Software   []68. Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J,Gordon L, Branscomb E, Stubbs L: A comprehensive catalog ofhuman KRAB-associated zinc finger genes: insights into theevolutionary history of a large family of transcriptionalrepressors.  Genome Res 2006, 16:669-677.69. Human KZNF Gene Catalog   []70. Ryu T, Jung J, Lee S, Nam HJ, Hong SW, Yoo JW, Lee DK, Lee D:bZIPDB: a database of regulatory information for humanbZIP transcription factors.  BMC Genomics 2007, 8:136.71. bZIPDB - Database of bZIP Transcription Factors   []72. FlyTF - The Drosophila Transcription Factor Database[]73. Reece-Hoyes JS, Deplancke B, Shingles J, Grove CA, Hope IA, Wal-hout AJ: A compendium of Caenorhabditis elegans regulatorytranscription factors: a resource for mapping transcriptionregulatory networks.  Genome Biol 2005, 6:R110.74. A Collection of Predicted C. elegans Transcription Factors[]Genome Biology 2009, 10:R2957. International Regulome Consortium Mouse GenomeProject: Mouse Gene List   []58. Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J,


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items