Open Collections

UBC Faculty Research and Publications

The Gene Set Builder: collation, curation, and distribution of sets of genes Yusuf, Dimas; Lim, Jonathan S; Wasserman, Wyeth W Dec 21, 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2005_Article_629.pdf [ 1.17MB ]
JSON: 52383-1.0221596.json
JSON-LD: 52383-1.0221596-ld.json
RDF/XML (Pretty): 52383-1.0221596-rdf.xml
RDF/JSON: 52383-1.0221596-rdf.json
Turtle: 52383-1.0221596-turtle.txt
N-Triples: 52383-1.0221596-rdf-ntriples.txt
Original Record: 52383-1.0221596-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceDatabaseThe Gene Set Builder: collation, curation, and distribution of sets of genesDimas Yusuf1, Jonathan S Lim1 and Wyeth W Wasserman*1,2Address: 1Centre for Molecular Medicine and Therapeutics (CMMT), Child & Family Research Institute, Vancouver, Canada and 2Department of Medical Genetics, University of British Columbia, Vancouver, CanadaEmail: Dimas Yusuf -; Jonathan S Lim -; Wyeth W Wasserman* -* Corresponding author    AbstractBackground: In bioinformatics and genomics, there are many applications designed to investigatethe common properties for a set of genes. Often, these multi-gene analysis tools attempt to revealsequential, functional, and expressional ties. However, while tremendous effort has been investedin developing tools that can analyze a set of genes, minimal effort has been invested in developingtools that can help researchers compile, store, and annotate gene sets in the first place. As a result,the process of making or accessing a set often involves tedious and time consuming steps such asfinding identifiers for each individual gene. These steps are often repeated extensively to shift fromone identifier type to another; or to recreate a published set. In this paper, we present a simpleonline tool which – with the help of the gene catalogs Ensembl and GeneLynx – can helpresearchers build and annotate sets of genes quickly and easily.Description: The Gene Set Builder is a database-driven, web-based tool designed to helpresearchers compile, store, export, and share sets of genes. This application supports the 17eukaryotic genomes found in version 32 of the Ensembl database, which includes species from yeastto human. User-created information such as sets and customized annotations are stored tofacilitate easy access. Gene sets stored in the system can be "exported" in a variety of outputformats – as lists of identifiers, in tables, or as sequences. In addition, gene sets can be "shared" withspecific users to facilitate collaborations or fully released to provide access to published results.The application also features a Perl API (Application Programming Interface) for direct connectivityto custom analysis tools. A downloadable Quick Reference guide and an online tutorial are availableto help new users learn its functionalities.Conclusion: The Gene Set Builder is an Ensembl-facilitated online tool designed to helpresearchers compile and manage sets of genes in a user-friendly environment. The application canbe accessed via genes into "sets" has become an intuitive andsequential, structural, functional, and expressional tiesbetween genes in a given set. For instance, the oPOSSUMPublished: 21 December 2005BMC Bioinformatics 2005, 6:305 doi:10.1186/1471-2105-6-305Received: 08 August 2005Accepted: 21 December 2005This article is available from:© 2005 Yusuf et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 5(page number not for citation purposes)commonplace practice in bioinformatics and genomicsresearch. Many bioinformatics applications can analyzesystem can identify over-represented transcription-factorbinding sites in a group of co-expressed genes [1]. Simi-BMC Bioinformatics 2005, 6:305, GOToolBox can identify Gene Ontology termswhich are over-represented in the annotations of a set ofgenes [2]. In short, a new generation of analysis methodsrequires – as inputs – sets of genes.Despite an abundance of these multi-gene investigativetools, to our knowledge, no published tools exist whichhelp researchers compile, store, and share sets of genes.Consequently, researchers often revert to the time-testednique may be convenient for building small sets of genes,it becomes burdensome for large or shared collections.In this paper, we present the Gene Set Builder, a web-based system designed to help researchers quickly build,sort, and annotate sets of genes in a user-friendly environ-ment. This application features a "point and click" inter-face that lets users search and import genes in batches;synchronize missing and outdated gene annotations withA screen capture of Gene Set BuilderFigure 1A screen capture of Gene Set Builder. This "special edition" user interface utilizes a Flash-based navigation system, com-plete with animation and tool tips.Page 2 of 5(page number not for citation purposes)method of copying and pasting gene identifiers and anno-tations into a spreadsheet or a text file. While this tech-currently available information; compile and export genesets as FASTA sequences, cDNA transcripts, tables, or asBMC Bioinformatics 2005, 6:305 of identifiers; share data with other users; and createsets of homologs to facilitate comparative studies acrossspecies.Construction and contentCodeGene Set Builder is written in the Perl programming lan-guage. The Perl backend uses several third-party modulesincluding CGI, DBI, DBD-mysql, and the GeneLynx API[3]. Components of BioPerl [4] are used to access genomicand cDNA sequences. Similar to other web-based applica-tions, the Perl scripts are executed through a CGI to gener-ate a HTML-based user interface.DatabaseUser-created information is maintained in a password-protected MySQL database. An outline of the databasestructure is shown in Additional file 1.User interfaceDriven by HTML, JavaScript, and Macromedia Flash, GeneSet Builder's interface is designed to be intuitive, flexible,and graphically-rich. It features a navigation system withthree main categories: "Genes", "Sets", and "Shared" (Fig-ure 1). When users click on a category, a list of relevantfunctions is displayed. For instance, clicking on the"Genes" category will display the "Add genes to set","Import a list of genes", "Search", and "Synchronize"functions. An HTML-based navigation system ("classicinterface") is available to accommodate web browserswithout the Flash plug-in.APIA Perl Application Programming Interface has been devel-oped to help advanced users retrieve data directly from theGene Set Builder database. This API can obtain gene andset annotations including names, symbols, descriptions,comments, confidence ratings, and identifiers.UtilityHere we discuss the use of Gene Set Builder: building andsharing gene sets, data annotation, exporting, and usingthe API.TutorialWe have created a number of resources to help new userslearn how to use Gene Set Builder. On the homepage, onecan access multimedia walkthroughs of the system'sessential features, and download a Quick Reference guide.Building and sharing a set of genesGenes can be imported to the "Add genes to set" stagingarea in two ways: (1) users can search for genes individu-ers from diverse resources. For convenience, this massimport tool accepts gene symbols (e.g. HUGO-approvedsymbols [6]), Affymetrix gene identifiers or accessionnumbers from Entrez Gene [7], Ensembl, UniProt [8],Swiss-Prot [9] or RefSeq [10]. From the genes returned tothe staging area, users can select specific genes for inclu-sion in new or existing sets.Sets can be created from other sets as well. The "Create ahomolog set" tool can be used to generate a set of homol-ogous genes from an existing single or multi-species set.This feature is based on Ensembl homology annotations.In addition, users can make copies of sets.The attributes and contents of a gene set can be modified:users can add or remove genes from sets at any time; setscan be commented, shared, "unshared", renamed,deleted, rendered accessible via the API, or protected fromaccidental changes via a "lock" feature. Users can alsomake their gene sets available for public access via the"share set" feature. While visitors can view, export, andcopy a shared set, only the set's owner has the privilege tochange, delete, or withdraw the set from the publicdomain. For convenience and testing purposes, Gene SetBuilder is preloaded with widely-used sets, such as the ESRDataset [11] and the S. cerevisiae Cell Cycle-regulatedGene Set [12].Custom annotations and data managementGene Set Builder can retrieve UniProt, Entrez Gene, Ref-Seq, and GeneLynx identifiers via BioMART annotations.This task is mediated by the "Synchronize" feature, whichcan be found under "Genes" in the menu. Users mayannotate the confidence of a gene's membership in a setvia a 5-point scale displayed as a column of star icons.Users can also attach comments to genes in a general andset-specific context. In addition to comments and confi-dence ratings, we have included search functions to locateor eliminate genes in the workspace by species or key-word. This search engine supports Boolean syntaxes suchas AND, OR, and NOT. Users can tag genes and sets sothey can be easily retrieved in the future.ExportingTo accommodate analysis tools which accept gene identi-fiers as inputs such as oPOSSUM and GoMiner, a set ofgenes can be exported as a list of Entrez Gene, Ensembl,RefSeq, UniProt, or GeneLynx identifiers. When exportingas a list, the user can exclude genes based on their confi-dence ratings and/or species. FASTA-formatted sequencescan also be created, with the option to specify upstreamand downstream flanking basepairs for regulatorysequence analysis. Gene Set Builder can also generate aPage 3 of 5(page number not for citation purposes)ally via search engines which accesses BioMART [5] andGeneLynx, or (2) users can input gene or protein identifi-table populated with gene identifiers and descriptions,BMC Bioinformatics 2005, 6:305 the user can save and open with a word processingor spreadsheet application.Using the APIThe API-enabling feature in Gene Set Builder is treated asan Export function which copies the desired gene set intothe "open-access" portion of the database. Data stored inthis area can be retrieved via the Perl API or a MySQL cli-ent. When exporting sets in this format, users can chooseto divulge only specific gene identifiers and annotationcomponents as the data will become accessible to otherAPI users. Developers of online services may use the APIto allow users to directly submit their sets for analysis.DiscussionSimilar toolsTo our knowledge, no tool in Bioinformatics exists in iso-lation with the unique function of helping users build andmanage sets of genes. Although the Gene Set Builder sys-tem shares similar properties with other multi-gene toolssuch as the Sequence Retrieval System (SRS) [13],SeqHound [14], and WebGestalt [15], it does not sharethe same fundamental concept, nor does it fit into thesame categorical niche. Gene Set Builder's primary role isto help users build, annotate, and import gene sets indetail. SRS and SeqHound are focused more towards thecomputational aspects of working with a set of genes, notthe long-term management and sharing of sets. WebGe-stalt provides users with an array of set analysis functions,but it does not facilitate the creation, maintenance, andsharing of sets. We are exploring mechanisms to directlysubmit Gene Set Builder sets to other tools such as Web-Gestalt.BenefitsThe utility of Gene Set Builder offers users three majorbenefits: (1) it can help users annotate a pre-existing set ofgenes through synchronization with the Ensembl data-base to obtain alternate identifiers and descriptions; (2) itcan aide in collaborative efforts by allowing team mem-bers to store gene sets in a central location where they canbe easily accessed; and (3) it can be used as an aide forpublication by allowing users to share their sets of geneswith the community at large.Most importantly, Gene Set Builder facilitates the storageof gene sets in a relational database as opposed to a textfile, while offering a friendly environment that automatestime-consuming tasks. The application's searching, anno-tating, and sharing features give users flexibility and con-venience. Thus, users benefit from access to curated setsprovided by other users, from the capacity to build sets incollaboration with others, and from the ability to shiftLimitationsWhile Gene Set Builder offers advantages to users, it doeshave several technical limitations which we hope toaddress in the future. One limitation involves Gene SetBuilder's reliance upon Ensembl and GeneLynx for anno-tation data. Due to this dependency, users cannot buildsets of prokaryotic or viral genes, nor include genes fromnon-supported eukaryotic organisms. Ideally, Gene SetBuilder would eventually interact with systems such asEntrez Gene, the UCSC Genome Browser [16], and theComprehensive Microbial Resource at TIGR [17]. As sim-ilar API resources emerge or mature for these systems, wewill work to expand Gene Set Builder's compatibility.While Gene Set Builder is fully compatible with recent ver-sions of the Internet Explorer and Firefox web browsers, itrenders inconsistently when viewed in older releases ofNetscape and Safari. These problems stem from insuffi-cient JavaScript support. To overcome this difficulty, wehave implemented a "safe mode" state which can be tog-gled to increase usability when the system is beingaccessed via a less compatible browser.ConclusionTo our knowledge, the creation of general purpose geneset building tools has remained virtually unexplored.Gene Set Builder is our vision of what an application ofthis type can provide. It fulfils the needs of users interestedin forming, annotating, sharing and exporting sets ofgenes.Availability and requirementsThe Gene Set Builder can be accessed via, and it is available without charge to all users.A guest account is available for those who are interested intesting the system. We recommended a monitor screenresolution of at least 800 by 600 pixels or greater (1024 by768 pixels is preferred), in thousands of colours, and arecent web browser with JavaScript and the MacromediaFlash 7 plug-in installed and enabled.List of abbreviations usedAPI: Application Programming Interface; CSS: CascadingStyle Sheets; DBI: Database interface; FASTA: Fast-All;GSB: Gene Set Builder; HTML: Hypertext Markup Lan-guage; UCSC: University of California at Santa Cruz; UI:User Interface.Authors' contributionsWWW conceptualized the idea and directed the develop-ment process. JSL programmed components and sug-gested approaches. DY designed the user interface,developed the software, and wrote the manuscript withPage 4 of 5(page number not for citation purposes)from one set of identifiers to another. revisions provided by JSL and WWW.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2005, 6:305 materialAcknowledgementsWe thank our lab colleagues for their assistance during the development and testing of the Gene Set Builder: David Arenillas and Carol Huang for technical help; Elodie Portales-Cassamar and Shannan J. Ho Sui for testing and feedback.This project was financially supported by funding from Merck Frosst and the Canadian Institutes of Health Research (CIHR). D.Y.'s work was initially funded as a Mini Med School high school scholar by the Children and Family Research Institute (CFRI); W.W.W. is a CIHR New Investigator and a Scholar of the Michael Smith Foundation for Health Research (MSFHR).References1. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP,Wasserman WW: oPOSSUM: identification of over-repre-sented transcription factor binding sites in co-expressedgenes.  Nucleic Acids Res 2005, 33(10):3154-64.2. Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOTool-Box: functional analysis of gene datasets based on GeneOntology.  Genome Biol 2004, 5(12):R101-.3. Lenhard B, Hayes WS, Wasserman WW: GeneLynx: a gene-cen-tric portal to the human genome.  Genome Research 2001,11(12):2151-2157.4. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C,Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mun-gall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD,Stupka E, Wilkinson MD, Birney E: The Bioperl Toolkit: PerlModules for the Life Sciences.  Genome Research 2002,12:1611-1618.5. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M,Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, DownT, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, HerreroJ, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D,Keenan S, Kokocinsci F, London D, Longden I, McVicker G, MelsoppC, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S,Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A,Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S,Woodwark C, Birney E: Ensembl 2005.  Nucleic Acids Res 2005,:D447-D453.6. Povey S, Lovering R, Bruford E, Wright M, Lush M, Wain H: TheHUGO Gene Nomenclature Committee (HGNC).  HumanGenetics 2001, 109(6):678-680.7. Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI.  Nucleic Acids Res 2005,33:D54-D58.8. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S,Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA,O'Donovan C, Redaschi N, Yeh LS: The Universal ProteinSchneider M: The Swiss-Prot protein knowledgebase and itssupplement TrEMBL in 2003.  Nucleic Acids Res 2003, 31:365-370.10. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence(RefSeq): a curated non-redundant sequence database ofgenomes, transcripts and proteins.  Nucleic Acids Res 2005,33(Database):D501-D504.11. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, StorzG, Botstein D, Brown PO: Genomic expression programs in theresponse of yeast cells to environmental changes.  Mol Biol Cell2000, 11(12):4241-4257.12. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB,Brown PO, Botstein D, Futcher B: Comprehensive identificationof cell cycle-regulated genes of the yeast Saccharomyces cer-evisiae by microarray hybridization.  Mol Biol Cell 1998,9(12):3273-3297.13. Veldhoven A, de Lange D, Smid M, de Jager V, Kors JA, Jenster G:Storing, linking, and mining microarray databases using SRS.BMC Bioinformatics 2005, 6:192-.14. Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R,Hogue CW: SeqHound: biological sequence and structuredatabase as a platform for bioinformatics research.  BMC Bio-informatics 2002, 3:32-.15. Zhang B, Kirov S, Snoddy J: WebGestalt: an integrated systemfor exploring gene sets in various biological contexts.  NucleicAcids Res 2005, 33(Web Server):W741-8.16. Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT,Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, HausslerD, Kent WJ: The UCSC Genome Browser Database.  NucleicAcids Res 2003, 31(1):51-4.17. Peterson JD, Umayam LA, Dickinson T, Hickey EK, White O: TheComprehensive Microbial Resource.  Nucleic Acids Res 2001,29(1):123-5.Additional File 1In the internal portion, gene and set objects are unified by the "Genes in sets" table, which use multiple key entries to assign genes into sets. This database structure allows gene and set objects to exist independently. The open-access tables, "API Genes" and "API Sets", are set up specifically for API connectivity. Sets of genes exported by the user for API use are copied into these two tables.Click here for file[]yours — you keep the copyrightSubmit your manuscript here: 5 of 5(page number not for citation purposes)Resource (UniProt).  Nucleic Acids Res 2005, 33:D154-D159.9. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A,Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S,


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items