UBC Faculty Research and Publications

Ulysses - an application for the projection of molecular interactions across species Kemmer, Danielle; Huang, Yong; Shah, Sohrab P; Lim, Jonathan; Brumm, Jochen; Yuen, Macaire M; Ling, John; Xu, Tao; Wasserman, Wyeth W; Ouellette, BF F Dec 2, 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-13059_2005_Article_1141.pdf [ 1.39MB ]
Metadata
JSON: 52383-1.0228393.json
JSON-LD: 52383-1.0228393-ld.json
RDF/XML (Pretty): 52383-1.0228393-rdf.xml
RDF/JSON: 52383-1.0228393-rdf.json
Turtle: 52383-1.0228393-turtle.txt
N-Triples: 52383-1.0228393-rdf-ntriples.txt
Original Record: 52383-1.0228393-source.json
Full Text
52383-1.0228393-fulltext.txt
Citation
52383-1.0228393.ris

Full Text

commentreviewsreportsdeposited researchrefereed researchinteractionsinformatioOpen Access2005Kemmeret al.Volu  6, Issue 12, Article R106SoftwareUlysses - an application for the projection of molecular interactions across speciesDanielle Kemmer*†, Yong Huang‡, Sohrab P Shah‡¥, Jonathan Lim†, Jochen Brumm†, Macaire MS Yuen‡, John Ling‡, Tao Xu‡, Wyeth W Wasserman†§ and BF Francis Ouellette‡§¶Addresses: *Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden. †Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver V5Z 4H4, BC, Canada. ‡UBC Bioinformatics Centre, University of British Columbia, Vancouver V6T 1Z4, BC, Canada. §Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada. ¶Michael Smith Laboratories, University of British Columbia, Vancouver V6T 1Z4, BC, Canada. ¥Department of Computer Science, University of British Columbia, Vancouver V6T 1Z4, BC, Canada. Correspondence: Wyeth W Wasserman. E-mail: wyeth@cmmt.ubc.ca© 2005 Kemmer et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Projecting molecular interactions across species<p>Ulysses, a new software for the parallel analysis and display of protein interactions detected in various species, is described.</p>AbstractWe developed Ulysses as a user-oriented system that uses a process called Interolog Analysis forthe parallel analysis and display of protein interactions detected in various species. Ulysses wasdesigned to perform such Interolog Analysis by the projection of model organism interaction dataonto homologous human proteins, and thus serves as an accelerator for the analysis ofuncharacterized human proteins. The relevance of projections was assessed and validated againstpublished reference collections. All source code is freely available, and the Ulysses system can beaccessed via a web interface http://www.cisreg.ca/ulysses.RationaleThe catalogue of human protein-encoding genes is largelyenumerated [1], but the task of discerning the functions ofthese genes remain a formidable challenge. A significant frac-tion of protein-encoding genes are entirely novel; the cellularroles of the proteins remain a mystery. As model organismgenome sequences have been available for several years, amodest compendium of functional genomics data hasemerged for these organisms. To capitalize on these data forthe functional annotation of human genes, one can projectmodel organism gene properties onto homologous humangenes [2]. Although the properties of homologous genes areThe increasing body of genomics data allows functions to bepredicted using 'Guilt by Association' (GBA) methods. InGBA, the function of a gene is inferred from the functions ofgenes with which it interacts (for example, protein contact) orparallels (for example, co-expression). Observation of mutu-ally consistent interactions in multiple species improves thepredictive performance of GBA methods, a process namedInterolog Analysis [2,3]. Early demonstrations of the utility ofInterolog Analysis, although limited to the analysis of modelorganism data, offer promise for the accelerated annotationof human genes.Published: 2 December 2005Genome Biology 2005, 6:R106 (doi:10.1186/gb-2005-6-12-r106)Received: 23 February 2005Revised: 3 August 2005Accepted: 8 November 2005The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2005/6/12/R106Genome Biology 2005, 6:R106noften predicted based on recorded annotations of genes withsimilar sequences, such mappings only begin to capitalize onavailable data.Prediction of human gene function based on Interolog Analy-sis requires an underlying set of bioinformatics resources andalgorithms to make unified data accessible to the community.R106.2 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. http://genomebiology.com/2005/6/12/R106First, functional genomics data must be accessible throughreference databases. Second, the relationships betweenhomologous genes must be mapped by a suitable comparisonprocedure. Finally, the relationships must be rendered acces-sible to the broad community through an intuitive interface.A system incorporating these three components would be apowerful tool for laboratory investigators seeking to capital-ize on existing genomics data.Despite substantial success in sequencing genomes, large-scale functional studies have been reported for only a fewcommon model organisms. Key reports have addressed pro-tein-protein interactions in Saccharomyces cerevisiae [4-6],Drosophila melanogaster [7-9], and Caenorhabditis elegans[10]. In addition to these screens, functional studies havelinked genes by tackling such topics as: patterns of co-expres-sion [11], genetic interactions [12], and sub-cellular co-locali-zation [13]. The diverse data from the functional studies havebeen rendered publicly accessible in species-specific reposi-tories [14-16]. Large databases that have emerged to consoli-date the diverse functional genomics data include leadingexamples like the Biomolecular Interaction Network Data-To manage the combination of interaction data and genomeannotation, data warehouses have emerged such as EnsMart[20], SeqHound [21], and Atlas [22]. All three examples storeheterogeneous biological data in a relational schema, allow-ing for rapid retrieval using Structured Query Language(SQL) via an integrated application programming interface(API), or via a web graphical user interface.In order to draw conclusions about human genes from modelorganism data, it is essential to possess a map enumeratinggene homology relationships among species. The fundamen-tal assumption is that direct gene orthologs (genes separatedonly by speciation) typically occupy the same functional niche[23]. Leading systems such as COGs [24,25] and Inparanoid[26] continue to unravel the complex evolutionary relation-ships between genes. As shown by these efforts, the stringentdemands for orthology mapping are challenging, so it is oftenmore feasible to group homologs. The National Center forBiotechnology Information's (NCBI) HomoloGene [27] pro-vides such a high-throughput map suitable for incorporationinto larger analyses that address many organisms. The estab-lishment of evolutionary relationships between genesInterologs mapping of conserved protein networks across multiple species (each plane corresponds to a species)Figure 1Interologs mapping of conserved protein networks across multiple species (each plane corresponds to a species). Orthologous proteins are defined and protein interactions identified in each model organism. Virtual human protein networks are generated by projecting the observed interactions across all planes onto homologous human genes. HID, HomoloGene identifier.Networks ProjectionsHumanFlyWormYeastHID 1HID 2HID 3Genome Biology 2005, 6:R106base (BIND) [17], DIP [18], and MINT [19]. remains a topic of active investigation.106.3commentreviewsreportsrefereed researchdeposited researchinteractionsinformatiohttp://genomebiology.com/2005/6/12/R106 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. RBiological interpretation of integrated data is greatly aided bytools for visualization of properties. Multiple platforms forthe visualization and manipulation of protein interaction net-works [28-32] provide users with interfaces to complex inter-action data. Interolog Analysis has emerged as a powerfulmeans to predict the function of genes [2,33-36]. ExistingInterolog Analysis tools, like the Interolog database [3] andSTRING [37], convey information about protein associationsacross species using databases, homology maps, and simplevisualization methods. These visualization tools, however, arerestricted to single views that fail to convey the evidence fromeach species.We report the construction and assessment of a novel Inter-olog system for the exploration of human genes based ongene-gene interactions in yeast, fly, and worm (Figure 1). Thesystem displays composite interaction networks composed ofprotein associations detected in the model organisms. Thesystem unites the Atlas database, HomoloGene mappings,and a new Interolog visualization tool, all accessed via a user-friendly web interface entitled Ulysses [38]. We assessed thecorrect networks. Redundantly observed gene-gene associa-tions across datasets or species are demonstrated to beremarkably specific. We applied the most accurate parame-ters to predict human protein interactions and new candidatemembers for inclusion in known pathways and complexes.Model organism data to predict human protein interactionsThe available pool of curated annotations of protein-proteininteractions in reference databases is sparse, only a smallsubset of the interactome (the complete collection of all func-tionally relevant protein-protein interactions) is present. TheHuman Protein Reference Database (HPRD) [39] is the larg-est curated collection of documented human protein interac-tions. To assess the relevance of observed interactionsbetween model organism proteins for the prediction ofhuman interactions, we determined the overlap between pro-tein interactions in the HPRD reference dataset and homolo-gous interactions from model organisms represented inBIND [17]. Reflecting the sparse coverage of the interactome,Table 1Yeast protein interactions reported in BIND confirmed by co-localizationTotal Independently confirmed interactionsBin match Exact matchLow-throughput 1,753 565 448 (79%) 335 (59%)High-throughput 54,439 4,485 3,464 (77%) 1,096 (24%)Data were from BIND freeze 20 April 2005. Bin matches refer to protein interactors localizing to the same major cellular compartments (nucleus, cytoplasm, extra-cellular space). Exact matches refer to specific sub-cellular locations captured by GO annotations.Table 2Composition of localization binsCytoplasm (C) Nucleus (N) Extra-cellular (E) OtherCytoplasm Nucleus Extra-cellular Plasma membraneMitochondria Nucleolus MembraneEndoplasmic reticulum CentrosomeGolgi apparatus OtherLysosomeEndosomeSarcoplasmic reticulumPeroxisomeRibosomeDetailed localization labels from HPRD were used to assign each sub-cellular compartment to at least one of the three major cellular localizations. Localization labels that could not be classified were excluded from the analysis (other).nGenome Biology 2005, 6:R106performance of the underlying Interolog algorithm againstpublished reference collections of protein interactions,revealing a statistically significant ability to link genes to theonly 80 such interactions were found. The sparse coverage ofbona fide protein-protein interactions is problematic to eval-uating the performance of predictive methods. Previous stud-R106.4 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. http://genomebiology.com/2005/6/12/R106ies have assessed the quality of interaction data on the basisof protein interactor pairs sharing the same annotated GO-terms [33,35,40,41] or pathway assignments [42]. While suchmeasures are often supportive of the predictive performanceof methods, we believe such criteria suffer from a focus on thestrongest and most easily observed interactions.To gain a broader assessment of the relevance of mappinginteractions from model organism proteins onto correspond-ing homologous human proteins, we elected to apply a com-partment-based assessment of the Interolog Analysis. Asprotein interactions preferentially occur between proteinsresiding in the same sub-cellular compartment [13,43], inter-actions between two proteins were considered to be true ifboth interactors co-localized to the same sub-cellular loca-tion. To validate this approach, we analyzed yeastinteractions reported in BIND that have an annotated GeneOntology (GO) [44] localization label. We distinguishedbetween low-throughput (LTP: less than 40 interactionrecords in the same publication, using the same experimentalmethod) and high-throughput (HTP) data and counted inter-actions supported by at least two independent reports (Table1). For LTP and HTP experiments, respectively, 79% and 77%of the interactors from the redundantly observed interactionsmatched major sub-cellular compartments (nucleus, cyto-plasm, extra-cellular space), both statistically significant incomparison to background levels. Exact matches to highlyspecific GO compartments were 59% for LTP and 24% forHTP data. This difference at the specific compartment levelreflects the tendency for well-studied genes (those that havebeen the focus of LTP studies) to be deeply annotated. Giventhe correlation between interaction and general sub-cellularlocalization of yeast proteins, we adopted the criterion of co-localization to assess the predictive value of Interolog Analy-sis for the study of human protein interactions.We mapped all human RefSeq identifiers for proteins in theHPRD database (6,141 proteins) to HomoloGene identifiers(5,308 HomoloGene groups). Each HomoloGene interactorwas assigned to one or more cell compartment(s) based onthe curated HPRD annotations (Table 2). As a control data setfor the rate of co-localization for arbitrary pairs of interactors,we randomly created 60,000 pairings of the HomoloGenegroups represented in the HPRD data. HomoloGene identifi-ers were retrieved for S. cerevisiae, D. melanogaster, and C.elegans proteins reported as interactors in the BIND data-base. For each model organism interactor mapping to thesame HomoloGene as an HPRD human protein, the sub-cel-lular compartment (as defined by HPRD) was noted (Figure2). For 28,254 interactions, both interactors were annotatedas localizing within at least one cellular compartment (Table3). In a second step, for each of these pairs, we determined ifboth protein interactors co-localized to the same cellular loca-tion, that is, if they shared at least one cellular compartment.For BIND-reported interacting pairs, co-localization was truefor between 75% and 97%, depending on the species andmethod (Table 4). Compared to the background rate of 66%for the randomly generated pairs of interactors (whichreflects the fact that many proteins are annotated with multi-ple localizations), every category was significantly biasedtowards co-localization. The success rates for yeast two-hybrid (Y2H) data reached 87% in worm, but only 75% in fly.Table 3Data resources for performance evaluationSource InteractionsRandomly generated human (HPRD) 59,981*Model organisms (BIND) 32,930Total pairs 92,911Pairs mapped to cellular compartments 28,254*Redundancy was eliminated from the initial 60,000 random human interactions.Table 4Cross-classification of interaction and localization - single projectionsYeast two-hybrid Complex purificationInteractions Yeast Worm Fly YeastRandom BIND Random BIND Random BIND Random BINDNo co-localization 9,493 3 9,475 21 9,454 42 9,314 182Co-localization 18,656 102 18,614 144 18,628 130 17,411 1,347Total 28,149 105 28,089 165 28,082 172 26,725 1,529Success rate 66.28 97.14 66.27 87.27 66.33 75.58 65.15 88.10p-value 4.3e-09 8.67e-08 0.0111 2e-16Interactions from model organisms reported in BIND for which both interactors could be mapped to human homologs (HPRD) were evaluated for co-localization. Random interactions generated for HPRD interactors are shown as control datasets.Genome Biology 2005, 6:R106106.5commentreviewsreportsrefereed researchdeposited researchinteractionsinformatiohttp://genomebiology.com/2005/6/12/R106 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. RThis observation agrees with a recent study [33], where theauthors attributed greater confidence to protein interactionsoriginating from the published HTP experiments for S. cere-visiae and C. elegans compared to the published results for D.melanogaster.To identify predictions of greater specificity, we determinedthe co-localization rates for proteins for which 'double link-age' interactions were observed, where 'double linkage' refersto interactions supported either by two different experimen-tal methods for a single organism or in data from two differ-ent species (Table 5). As for single linkage interactions, thebackground co-localization rate for randomly selected pairsof interactors was 66%. For those interacting pairs with dou-ble linkage in BIND, 100% co-localization was observed. Eventhough our results were concordant with earlier reports[3,33,43], the number of 'double linkage' interactions (n = 4to 28) was too sparse to achieve statistical significance, butthe perfect predictive specificity is qualitatively noteworthy.Negative control dataBecause a curated reference collection of non-interactinghuman proteins is lacking and because pairs of proteins resid-ing in different sub-cellular compartments are less likely tointeract [45], we assessed the noise in the interaction data bythe frequency with which HomoloGene interactors wereannotated with incompatible localizations. We evaluated pro-teins localizing to the nucleus, the cytoplasm, and the extra-cellular space. We considered all model organism proteininteractions for which both interactors mapped to a Homolo-Gene containing a human protein with annotated localizationin the HPRD database. We found that 'true' interactions, thatis, interactions between two model organism proteins anno-tated with the same compartment, accounted for 91% andinconsistencies were observed in 9% of the cases. As proteinscan exist in different compartments at different times, andthe curated HPRD annotations are restricted to the availableliterature, the inconsistencies should be viewed as an upper-bound of the false classification rate. It is noteworthy thatthere were no inconsistencies for the double linkageinteractions.Network expansion and detection (multi-protein interactions)KEGG [46] and PINdb [47] are curated annotation databasesdescribing biological pathways and complexes. To demon-strate the capacity of Ulysses to detect new components ofthese known pathways and complexes, we identified candi-dates based on the following double linkage criteria: the can-Distribution of RefSeq/HomoloGene proteins across HPRD cellular localization binsFigure 2Distribution of RefSeq/HomoloGene proteins across HPRD cellular localization bins. Protein interactors from BIND were mapped to HomoloGene to delineate homologs across the four organisms, and to associate each protein to a sub-cellular compartment.Nucleus CytoplasmExtra-cellular765 1,41150667711 11735Other: 910Table 5Cross-classification of interaction and localization - double projectionsYeast two-hybrid Yeast two-hybrid/complex purificationInteractions Yeast/worm Yeast/fly Fly/worm YeastRandom BIND Random BIND Random BIND Random BINDNo co-localization9,472 0 9,451 0 9,433 0 9,311 0Co-localization 18,520 8 18,532 6 18,488 4 17,337 28Total 27,992 8 27,983 6 27,921 4 26,648 28Success rate 66.16 100 66.23 100 66.22 100 65.06 100Double linkages from model organisms for which each interaction was either reported in at least two different species or datasets were evaluated nGenome Biology 2005, 6:R106for co-localization. Random interactions between HPRD interactors are displayed for control.R106.6 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. http://genomebiology.com/2005/6/12/R106didate interacted with two or more pathway members in oneorganism; or the candidate interacted with homologous pro-teins of pathway members in two or more species. Based onthese criteria and after mapping all pathway and complexcomponents to HomoloGene, 14 HomoloGenes were newlyassociated with 11 pathways and complexes previouslydescribed in KEGG and PINdb (Additional data file 1). Severalof these candidates have been previously linked to the path-ways or processes in the scientific literature, but have not yetbeen annotated as such in the reference databases.Based on the ability of the Ulysses system to identify candi-dates for inclusion in known networks, we sought to uncoverinterconnected networks within which each member is con-nected to at least two other members. Extracting all pairs ofHomoloGene proteins supported by two or more datasets, forwhich there was at least one human homolog for each inter-actor, we were able to identify 127 distinct HomoloGenesinvolved in 82 interactions. Amongst these observed highconfidence pairwise interactions (Table 6 and Additional datafile 2) were two apparently novel interactions involvingdisease-linked genes. The YEATS4 gene, a poorly character-ized gene known as glioma-amplified sequence 41, was linkedto DMAP1, a DNA methyltransferase-associated protein. TheDGCR14 gene from the DiGeorge Syndrome critical regionwas found to interact with VDP, a vesicle docking proteinlinked to the golgi. Table 6 specifies candidate interactions forwhich we could not identify existing support, while Addi-tional data file 2 lists those interactions that appear consist-ent with established literature.Grouping of overlaps in these high confidence interactionsrevealed previously characterized networks, including highlyconserved pathways and complexes.We recovered elements of the spliceosome, including sevencore small nuclear ribonucleoprotein particle (snRNP) com-ponents (LSM1, 2, 4, 5, 7, 8, SNRPD2), four U2 and U3snRNP-specific proteins (SF3A3, IMP3, IMP4,MPHOSPH10), a splicing factor (PRPF19), as well as a pro-tein usually associated with the PRPF19 complex (CRNKL1)known to interact with the spliceosome [48].Two clusters were observed composed of proteins requiredfor DNA replication and repair, as well as replication-depend-ent structural proteins. One cluster contained all five subunits(RFC1, 2, 3, 4, 5) of an accessory factor for DNA replication,replication factor C (RF-C). The other cluster contained fournucleosomal proteins, three members of the H2A histonefamily (H2AFE, H2AFJ, H2AFN), which were all connectedto the nucleosome assembly protein 1-like 1 (NAP1L1).We also identified a network of 19 interconnected proteasomeplex. We located the proteasome regulatory particle subunitPSMD6 interacting with PSMD3, a non-ATPase subunit ofthe 19S regulatory complex.These examples of functional networks among protein mem-bers of well conserved cellular complexes and pathways vali-date our approach to detect biologically meaningful proteininteractions in human by overlaying and projecting interac-tion data originating from diverse model organisms.To date, the limiting factor for network discovery is the sparseprotein interaction data. As more association data are gener-ated for the core model organisms, the Ulysses Interolog anal-ysis system will facilitate greater inference of networkmembers.Ulysses web interface for analysis and Table 6Human protein interaction predictions supported by redundant observations for homologous proteins in model organismsHomoloGene ID 1Gene symbol 1 HomoloGene ID 2Gene symbol 25257 XAB1 7006 ATPBD1B6136 NACA 932 LOC3910406127 NMD3 3139 COX5A5998 NOL10 5682 AATF5601 PSF1 5759 SLD55754 MCEMP1 5436 TRAPPC25368 C20orf14 5574 UBE2I12733 MGC4093 8440 EPPB91220 ACTG1 4643 LOC4010763531 WDR39 6115 CGI-1285257 XAB1 6487 ATPBD1C5715 BCCIP 755 RPL23682 IMPDH2 1080 EIF2B15356 PP 1080 EIF2B120319 ARL1 3444 ATP6V0D11776 MAGOH 3744 RBM8A5699 SKP1A 4485 NEDD810422 DMAP1 4760 YEATS42900 ZNF259 6872 RBX110363 KCTD5 9180 GORASP22754 VDP 11184 DGCR14Double linkage criteria (see Table 5) revealed high confidence protein associations. Interacting partners 1 and 2 are listed with their human gene symbols and HomoloGene groups. Previously known interactions are reported in Additional data file 2.Genome Biology 2005, 6:R106subunits. We found five core alpha (PSMA1, 2, 3, 5, 7) andfour core beta subunits (PSMB3, 4, 5, 7) from the 20S protea-some, as well as nine subunits from the 19S regulatory com-visualization of networksTo bring the power of multi-organism network analysis tolaboratory researchers, a web-based interface to the Ulysses106.7commentreviewsreportsrefereed researchdeposited researchinteractionsinformatiohttp://genomebiology.com/2005/6/12/R106 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. RnGenome Biology 2005, 6:R106Figure 3 (see legend on next page)R106.8 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. http://genomebiology.com/2005/6/12/R106system was implemented [38] (Figure 3). A user enters thedatabase with a gene of interest by submitting either the genename or symbol, an accession ID, or even by pasting the pro-tein sequence of the corresponding gene product. The systemcalls to the Atlas database and returns all interactionsreported in the BIND database for homologous proteins inthe model organisms, as well as the secondary interactions tothe direct partners of the reference gene. These primary andsecondary interactors are plotted and displayed in a series ofnetwork windows for each species. The option to individuallydisplay species-specific protein networks allows the user totrace back the origin of the projected data; the user can assessprojections based on the source of evidence. The user can fur-ther choose to display a composite image overlaying interac-tion data for homologous genes in all organisms, or limit theview to an individual species. The original protein of interestand its homologs are clearly labeled across all organisms. Ineach display mode, 'starburst' proteins, defined as proteinsinvolved in excess of a user-defined number of interactions,are color-coded and easily identified (such 'starbursts' mayrepresent genes prone to false interactions in HTP studies).These 'starbursts' can be displayed in either a compactedfashion or expanded. Individual protein interactions arelinked to publications citing the corresponding association.The database also links each gene in organism-specific net-works to gene information in external resources such asGeneLynx [49], SGD [16], WormBase [15], and FlyBase [14].Utility and comparison to other systemsHere we described an exploratory Interolog Analysis frame-work for the inference of protein function. We demonstrate,by overlaying protein interaction data sets, dramaticimprovements in the specificity of projected 'dual-linkage'interactions compared to those based on a single study.Through a novel interface, we provide a means for the broadcommunity of researchers to use Interolog Analysis for thedirected study of specific pathways or processes.Ulysses represents a significant advance in the graphical dis-play of protein interaction data for comparative genomics.Visualization tools for the study of protein and genetic net-works have been available for many years, including Cyto-scape [32], Osprey [31], and ProViz [28]. These useful toolshave enabled researchers to display networks for a single spe-knowledge, only two software tools provide interfaces forcomparative analysis of protein interactions (Interolog Anal-ysis). POINT [36] displays pairwise network diagrams; how-ever, positions of homologous proteins are not preservedbetween panes, making visual interpretation exceedingly dif-ficult. The mature STRING system [37] features an excellentunderlying data collection. The STRING visual interface forcomparative analysis, however, is restricted to a compositeplot - there is no parallel display for individual species.Although the underlying data in STRING is robust, only themost advanced users of the system can extract the informa-tion provided intuitively in the Ulysses interface. ThusUlysses is unique in its capacity for parallel display of interac-tion data from multiple species for comparative analysis andbiological interpretation.A limiting factor for inference of new protein clusters andextension of known clusters is the sparse existing coverage ofinteractions in genomics data. Even though proteome-scaleanalyses have been conducted for several organisms [4,7,10],the lack of overlapping interactions limits the impact of theanalysis of interactions shared by homologs. In this study, wefound that interactions observed in multiple studies (forhomologous proteins) are highly reliable (Table 5). As moreextensively overlapping interaction data sets emerge, Inter-olog Analysis will allow for expanded functional annotation ofhuman genes. Individual uncharacterized genes will be linkedto known cellular pathways and complexes, and we anticipatethe discovery of new functional units. To this end, we stronglyencourage protein interaction screens of additional organ-isms and deeper coverage of the primary model organisms, asthe depth of data is critical to increasing the utility of Inter-olog Analysis.The homology mapping obtained from HomoloGene was con-venient for the Ulysses system. Because homology mappingacross organisms remains an issue of debate, however, futurereleases of Ulysses will offer an option to choose between dif-ferent resources, possibly including well established systems[24,26,27].Even though the small size of the present body of functionalgenomics data does not allow for extended de novo discoveryof cellular networks, detection of known complexes and path-ways demonstrate Ulysses' capacity to successfully identifyScreenshot of the Ulysses interfaceFigure 3 (see pr vious page)Screenshot of the Ulysses interface. The user-specified protein is shown in blue, and interacting proteins are displayed in green. Proteins with greater than three interactions (the 'starburst' threshold) are marked with a magenta-colored cross. The colors and 'starburst' threshold are user-adjustable parameters. Species-specific interactions are displayed in the panel of windows on the left. In this figure, the central graph displays a composite image identifying each node with its HomoloGene identifier. By selecting a species window, the species-specific interactions will be displayed along with the identity of the individual protein interactors.Genome Biology 2005, 6:R106cies or data set. Each of these tools requires submission of apre-computed table of results, whereas Ulysses both per-forms the data analysis and renders a visual display. To ourbiological networks. Ulysses is available without restriction asan internet-based resource or as downloadable code fordevelopers [38]. The novel interface partitions data into dis-106.9commentreviewsreportsrefereed researchdeposited researchinteractionsinformatiohttp://genomebiology.com/2005/6/12/R106 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. Rcrete planes, offering an intuitive means of performing Inter-olog Analysis.Materials and methodsDatabase implementationAll data were stored within the Atlas database system [22,50].The Atlas data warehouse provides a framework for integrat-ing data from diverse systems within a unified environment.All data sets were imported from indicated databases usingthe SQL interface or Java API. All software and scripts used toextract data from the Atlas system are available by request.Interaction dataProtein interaction data were obtained from BIND [51](freeze August 2004). Direct protein-protein interactionsfrom yeast two-hybrid experiments and indirect associationsfrom protein complex purification experiments wereextracted. Table 7 reports the number of unique interactionsand interactors (proteins) acquired for each method andmodel organism. For the online system, protein interactiondata from BIND are updated automatically. At the time ofpublication, the interaction data underlying the Ulysses sys-tem were updated as of October 2005.Homology mappingHomoloGeneHomoloGene [52] is an NCBI resource providing computa-tionally identified homologs to human protein referencesequences derived from the RefSeq collection [53]. We useddata from HomoloGene freeze July 2004, which included26,797 HomoloGene groups and 108,734 unique genes. TheHomoloGene dataset was seeded by a non-redundant humanRefSeq protein sequence collection and compared using pro-tein-protein BLAST [54] to RefSeq protein sequences frommodel organisms. After mapping the protein sequences backto their respective genomes, both distance (Ka/Ks ratios [55])and synteny were assessed to identify false pairings.numbers were extracted from BIND. These identifiers weremapped to corresponding identifiers in the RefSeq collectionand the RefSeq IDs were used to select homology sets inHomoloGene. For BIND sequences without a mapping to aRefSeq sequence, BLAST analysis was performed against adatabase of all RefSeq sequences represented in the Homolo-Gene system. Parameters were set to an e-value cutoff of 10-20, and sequences were only included in the set if the match-ing portion included the entirety (100%) of the querysequence. At the time of publication, homology mappingsthrough HomoloGene were updated as of September 2005.Reference data sets and evaluation criteriaThe HPRD is a collection of hand-curated reports on humanproteins extracted from the scientific literature [39]. TheHPRD collection (HPRD freeze July 2004: 13,469 proteins,26,893 protein interactions) was uploaded into the Atlasdatabase, and protein identifiers were mapped to corre-sponding HomoloGene and RefSeq identifiers. The HPRDannotations include reported sub-cellular locations for eachprotein.Statistical evaluationInteraction data set from model organismsA total of 32,930 binary and protein complex interactionswere obtained from BIND for which both interactors hadbeen successfully mapped to HomoloGene homology groups.These interactions constitute the observed data and wereassessed relative to the HPRD reference set.Sampling from HPRDWe generated 60,000 random pairings of all interactors (pro-teins) present in HPRD bearing a localization label. Aftereliminating redundancy, we used this set to determine thesub-cellular co-localization. Statistical significance was eval-uated using the Fisher exact test.Visualization and web interfaceThe Ulysses visualization system dynamically generatesTable 7Model organism protein interaction datasetsYeast Fly WormSource Interactions Interactors Interactions Interactors Interactions InteractorsBIND - Y2H 6,799 3,837 18,899 6,785 5,100 2,907HomoloGene 2,110 1,562 4,448 2,614 1,639 1,170BIND - complex 56,109 2,356 8 7 - -HomoloGene 24,733 1,530 - 1 - -HomoloGene interactions indicate the number of BIND (freeze 4 August 2004) interactions for which both interactors could be mapped to human genes by HomoloGene.nGenome Biology 2005, 6:R106Ortholog mapping for model organismsFor proteins from each of the three included model organisms(worm, fly, and yeast), unique GenBank protein geninfo (gi)images for display in a web browser. The visualization prob-lem was divided into two tasks: graph network layout andimage rendering. The open source JUNG (Java UniversalR106.10 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. http://genomebiology.com/2005/6/12/R106Network/Graph) Framework [56] was used for modeling thenetwork structure, based on interaction data extracted fromthe Atlas database via the Atlas API. Image rendering and webpage generation were performed by a Java framework com-posed of the following components: JavaServer Pages (JSPs),standard Java libraries included with J2SE 1.5.0 [57], and theJava Advanced Imaging (JAI) libraries [58]. JSPs were usedto unite the various components. The visualization applica-tion is deployed using the Tomcat web application server[59]. The network layout is defined using all reported Homol-oGene sets in all organisms, and the species-specific imagesare constructed by limiting the display to proteins participat-ing in interactions within the species. This process allows forthe positions of homologous genes to be maintained acrossspecies.Additional data filesThe following additional data are available with the onlineversion of this paper. Additional data file 1 is a table showingnew HomoloGene associations with known pathways andcomplexes described in KEGG and PINdb. Additional datafile 2 lists the human protein interaction predictions sup-ported by redundant observations for homologous proteins inmodel organisms.Additional data file 1Table showing new HomoloGene associations with known path-w ys and complex s described in KEGG and PINdbSpecies in lude S. cerevisiae (Sc), C. eleg ns (Ce), and D. mela-nogaster (D ). The e candidate conn ctio included genes linked to core biol gical functions. New ca didate members w r  identi-fie  for protein degradation proces es, ncluding u iquiti -d pende t protein c tab l sm a d pr t n degradation via th  pro-easom . A d tio ally, candidat s w re li k d by interactions to he r b s m , maint n ce, and nucl ar export. N w c didate c mpo ents w r li k d to th  highly o serv d RNA polym rase lex Fi ally, ov l s ci ion  w e pr icte  ith mem-br n  a d v si l -bas d targeting pr t i . These xamp i lus-r t th  ca city of th Uly ses sy t m o r veal sta ti g p ts forudy f ew c mplexe or to ori n  he xplor of n w canidat  me b rs of kn w  func al unCli k fo  f 2Hum  r te i t ra t n p i t o s su  y edu t obs v t ns for o g u  prote  i  mo  orga smsD ub l kag cr e i  (  Tabl 5) rev a  hi h c f d ncein as o ia i ns. In r i g ar e 1  2 ar listed wi  ehu n g ne ymb s nd H m l G  r psAcknowledgementsThe authors would like to thank Dr Christer Höög for insightful discus-sions. B.F.F.O. acknowledges the University of British Columbia for supportof this project. W.W.W. acknowledges support from the Canadian Insti-tutes of Health Research and the Michael Smith Foundation for HealthResearch. This work was supported by funding from Merck (to the Centrefor Molecular Medicine and Therapeutics) and the Pfizer Corporation(D.K.). J.B. is supported by a predoctoral scholarship from the CanadianInstitutes of Health Research. We thank Stefanie Butland for criticalreviews of this manuscript, and Miroslav Hatas and Jonathan Falkowski forsystems and software installation, and continuing maintenance of theUlysses server.References1. Southan C: Has the yo-yo stopped? An assessment of humanprotein-coding gene number.  Proteomics 2004, 4:1712-1726.2. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S,Vidal M: Identification of potential interaction networks usingsequence-based searches for conserved protein-proteininteractions or "interologs".  Genome Res 2001, 11:2120-2126.3. Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JD, Bertin N, ChungS, Vidal M, Gerstein M: Annotation transfer between genomes:protein-protein interologs and protein-DNA regulogs.Genome Res 2004, 14:1107-1118.4. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lock-shon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehen-sive analysis of protein-protein interactions inSaccharomyces cerevisiae.  Nature 2000, 403:623-627.5. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A,Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organ-ization of the yeast proteome by systematic analysis of pro-tein complexes.  Nature 2002, 415:141-147.6. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A,Taylor P, Bennett K, Boutilier K, et al.: Systematic identificationDrosophila melanogaster.  Science 2003, 302:1727-1736.8. Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A,Reverdy C, Betin V, Maire S, Brun C, et al.: Protein interactionmapping: a Drosophila case study.  Genome Res 2005,15:376-384.9. Stanyon CA, Liu G, Mangiola BA, Patel N, Giot L, Kuang B, Zhang H,Zhong J, Finley RL Jr: A Drosophila protein-interaction mapcentered on cell-cycle regulators.  Genome Biol 2004, 5:R96.10. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, VidalainPO, Han JD, Chesneau A, Hao T, et al.: A map of the interactomenetwork of the metazoan C. elegans.  Science 2004,303:540-543.11. Stuart JM, Segal E, Koller D, Kim SK: A gene coexpression net-work for global discovery of conserved genetic modules.  Sci-ence 2003, 21:21.12. Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, BerrizGF, Brost RL, Chang M, et al.: Global mapping of the yeastgenetic interaction network.  Science 2004, 303:808-813.13. Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, WeissmanJS, O'Shea EK: Global analysis of protein localization in bud-ding yeast.  Nature 2003, 425:686-691.14. The FlyBase database of the Drosophila genome projectsand community literature.  Nucleic Acids Res 2003, 31:172-175.15. Harris TW, Chen N, Cunningham F, Tello-Ruiz M, Antoshechkin I,Bastiani C, Bieri T, Blasiar D, Bradnam K, Chan J, et al.: WormBase:a multi-species resource for nematode biology andgenomics.  Nucleic Acids Res 2004:D411-417.16. Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K,Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, et al.: Sac-charomyces Genome Database (SGD) provides tools toidentify and analyze sequences from Saccharomyces cerevi-siae and related sequences from other organisms.  NucleicAcids Res 2004:D311-314.17. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K,Betel D, Bobechko B, Boutilier K, Burgess E, et al.: The Biomolecu-lar Interaction Network Database and related tools 2005update.  Nucleic Acids Res 2005:D418-424.18. Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, EisenbergD: DIP: the database of interacting proteins.  Nucleic Acids Res2000, 28:289-291.19. Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeractiondatabase.  FEBS Lett 2002, 513:135-140.20. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C,Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a genericsystem for fast and flexible access to biological data.  GenomeRes 2004, 14:160-169.21. Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R,Hogue CW: SeqHound: biological sequence and structuredatabase as a platform for bioinformatics research.  BMCBioinformatics 2002, 3:32.22. Shah SP, Huang Y, Xu T, Yuen MMS, Ling J, Ouellette BFF: Atlas - Adata warehouse for integrative bioinformatics.  BMCBioinformatics 2005, 6:34.23. Gabaldon T, Huynen MA: Prediction of protein function andpathways in the genome era.  Cell Mol Life Sci 2004, 61:930-944.24. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, KooninEV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al.:The COG database: an updated version includes eukaryotes.BMC Bioinformatics 2003, 4:41.25. Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG data-base: a tool for genome-scale analysis of protein functionsand evolution.  Nucleic Acids Res 2000, 28:33-36.26. O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a compre-hensive database of eukaryotic orthologs.  Nucleic Acids Res2005, 33:D476-480.27. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, ChurchDM, DiCuccio M, Edgar R, Federhen S, Helmberg W, et al.: Databaseresources of the National Center for BiotechnologyInformation.  Nucleic Acids Res 2005:D39-45.28. Iragne F, Nikolski M, Mathieu B, Auber D, Sherman D: ProViz: pro-tein interaction visualization and exploration.  Bioinformatics2005, 21:272-274.29. Hanisch D, Sohler F, Zimmer R: ToPNet-an application for inter-Genome Biology 2005, 6:R106of protein complexes in Saccharomyces cerevisiae by massspectrometry.  Nature 2002, 415:180-183.7. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL,Ooi CE, Godwin B, Vitols E, et al.: A protein interaction map ofactive analysis of expression data and biological networks.Bioinformatics 2004, 20:1470-1471.30. Suzuki H, Saito R, Kanamori M, Kai C, Schonbach C, Nagashima T,Hosaka J, Hayashizaki Y: The mammalian protein-protein inter-106.11commentreviewsreportsrefereed researchdeposited researchinteractionsinformatiohttp://genomebiology.com/2005/6/12/R106 Genome Biology 2005,     Volume 6, Issue 12, Article R106       Kemmer et al. Raction database and its viewing system that is linked to themain FANTOM2 viewer.  Genome Res 2003, 13:1534-1541.31. Breitkreutz BJ, Stark C, Tyers M: Osprey: a network visualizationsystem.  Genome Biol 2003, 4:R22.32. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, AminN, Schwikowski B, Ideker T: Cytoscape: a software environmentfor integrated models of biomolecular interaction networks.Genome Res 2003, 13:2498-2504.33. Lehner B, Fraser AG: A first-draft human protein-interactionmap.  Genome Biol 2004, 5:R63.34. Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, SittlerT, Karp RM, Ideker T: Conserved patterns of protein interac-tion in multiple species.  Proc Natl Acad Sci USA 2005,102:1974-1979.35. Brown KR, Jurisica I: Online predicted human interactiondatabase.  Bioinformatics 2005, 21:2076-2082.36. Huang TW, Tien AC, Huang WS, Lee YC, Peng CL, Tseng HH, KaoCY, Huang CY: POINT: a database for the prediction of pro-tein-protein interactions based on the orthologousinteractome.  Bioinformatics 2004, 20:3273-3276.37. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M,Jouffre N, Huynen MA, Bork P: STRING: known and predictedprotein-protein associations, integrated and transferredacross organisms.  Nucleic Acids Res 2005:D433-437.38. Ulysses   [http://www.cisreg.ca/ulysses]39. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK,Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M,et al.: Development of human protein reference database asan initial platform for approaching systems biology inhumans.  Genome Res 2003, 13:2363-2371.40. Deng M, Tu Z, Sun F, Chen T: Mapping Gene Ontology to pro-teins based on protein-protein interaction data.  Bioinformatics2004, 20:895-902.41. Lin N, Wu B, Jansen R, Gerstein M, Zhao H: Information assess-ment on predicting protein-protein interactions.  BMCBioinformatics 2004, 5:154.42. Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thi-erry-Mieg N, Vidal M: Protein interaction mapping in C. elegansusing proteins involved in vulval development.  Science 2000,287:116-122.43. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, BorkP: Comparative assessment of large-scale data sets of pro-tein-protein interactions.  Nature 2002, 417:399-403.44. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eil-beck K, Lewis S, Marshall B, Mungall C, et al.: The Gene Ontology(GO) database and informatics resource.  Nucleic Acids Res2004, 32:D258-261.45. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, EmiliA, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networksapproach for predicting protein-protein interactions fromgenomic data.  Science 2003, 302:449-453.46. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGGresource for deciphering the genome.  Nucleic Acids Res 2004,32:D277-280.47. Luc PV, Tempst P: PINdb: a database of nuclear protein com-plexes from human and yeast.  Bioinformatics 2004,20:1413-1415.48. Jurica MS, Moore MJ: Pre-mRNA splicing: awash in a sea ofproteins.  Mol Cell 2003, 12:5-14.49. Lenhard B, Hayes WS, Wasserman WW: GeneLynx: a gene-cen-tric portal to the human genome.  Genome Res 2001,11:2151-2157.50. Atlas Integrated Database System   [http://bioinformatics.ubc.ca/atlas]51. Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Inter-action Network Database.  Nucleic Acids Res 2003, 31:248-250.52. HomoloGene   [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene]53. Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Mad-den TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, et al.: Data-base resources of the National Center for BiotechnologyInformation: update.  Nucleic Acids Res 2004:D35-40.54. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic localalignment search tool.  J Mol Biol 1990, 215:403-410.57. Java Technology   [http://java.sun.com]58. Java Advanced Imaging (JAI) API   [http://java.sun.com/products/java-media/jai/]59. Apache Tomcat   [http://jakarta.apache.org/tomcat/]60. Trotta CR, Lund E, Kahan L, Johnson AW, Dahlberg JE: Coordi-nated nuclear export of 60S ribosomal subunits and NMD3in vertebrates.  EMBO J 2003, 22:2841-2851.61. Gadal O, Strauss D, Kessl J, Trumpower B, Tollervey D, Hurt E:Nuclear export of 60s ribosomal subunits depends on Xpo1pand requires a nuclear export sequence-containing factor,Nmd3p, that associates with the large subunit proteinRpl10p.  Mol Cell Biol 2001, 21:3405-3415.62. Ho JH, Kallstrom G, Johnson AW: Nmd3p is a Crm1p-dependentadapter protein for nuclear export of the large ribosomalsubunit.  J Cell Biol 2000, 151:1057-1066.nGenome Biology 2005, 6:R10655. Hurst LD: The Ka/Ks ratio: diagnosing the form of sequenceevolution.  Trends Genet 2002, 18:486.56. Java Universal Network/Graph Framework   [http://jung.sourceforge.net]

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0228393/manifest

Comment

Related Items