UBC Faculty Research and Publications

Structural characterization of genomes by large scale sequence-structure threading Cherkasov, Artem; Jones, Steven J Apr 3, 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2003_Article_153.pdf [ 7.43MB ]
JSON: 52383-1.0220537.json
JSON-LD: 52383-1.0220537-ld.json
RDF/XML (Pretty): 52383-1.0220537-rdf.xml
RDF/JSON: 52383-1.0220537-rdf.json
Turtle: 52383-1.0220537-turtle.txt
N-Triples: 52383-1.0220537-rdf-ntriples.txt
Original Record: 52383-1.0220537-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleStructural characterization of genomes by large scale sequence-structure threadingArtem Cherkasov*1,2 and Steven JM Jones1Address: 1Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada and 2Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, CanadaEmail: Artem Cherkasov* - artc@interchange.ubc.ca; Steven JM Jones - sjones@bcgsc.ca* Corresponding author    AbstractBackground: Using sequence-structure threading we have conducted structural characterizationof complete proteomes of 37 archaeal, bacterial and eukaryotic organisms (including worm, fly,mouse and human) totaling 167,888 genes.Results: The reported data represent first rather general evaluation of performance of fullsequence-structure threading on multiple genomes providing opportunity to evaluate its generalapplicability for large scale studies.According to the estimated results the sequence-structure threading has assigned protein folds tomore then 60% of eukaryotic, 68% of archaeal and 70% of bacterial proteomes.The repertoires of protein classes, architectures, topologies and homologous superfamilies(according to the CATH 2.4 classification) have been established for distant organisms andsuperkingdoms. It has been found that the average abundance of CATH classes decreases from"alpha and beta" to "mainly beta", followed by "mainly alpha" and "few secondary structures".3-Layer (aba) Sandwich has been characterized as the most abundant protein architecture andRossman fold as the most common topology.Conclusion: The analysis of genomic occurrences of CATH 2.4 protein homologous superfamiliesand topologies has revealed the power-law character of their distributions. The correspondingdouble logarithmic "frequency – genomic occurrence" dependences characteristic of scale-freesystems have been established for individual organisms and for three superkingdoms.Supplementary materials to this works are available at [1].BackgroundRecent world-wide progress in sequencing projects led tothe exponential growth of genomic information andlaunched a race in the area of structural genomics. Thedevelopment of bioinformatics tools of gene predictionplete proteomes, and gave a boost to the emerging areasof comparative structural genomics and proteomics.The investigation of repertoires of protein structures andfunctions employed by species at different taxonomy lev-Published: 03 April 2004BMC Bioinformatics 2004, 5:37Received: 09 October 2003Accepted: 03 April 2004This article is available from: http://www.biomedcentral.com/1471-2105/5/37© 2004 Cherkasov and Jones; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.Page 1 of 16(page number not for citation purposes)and methods of sequence matching and threadingallowed structural evaluation of sizable portions of com-els led to numerous important discoveries in life science.Thus, the analysis of patterns of folds distribution acrossBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37three domains of life helped to identify instances of intra-and inter-species lateral gene transfer [2,3]; the character-ization of common and unique protein folds providedvaluable information about potential drug targets; andthe discovery of the scale-free character of gene propaga-tion allowed the estimation of a finite number of basicprotein shapes [4].There is little doubt about the great prospects within thearea of comparative proteomics, but there is no shortageof challenges too. An adequate yet general estimation ofthe three-dimensional structure of a protein from itsamino acid sequence remains the biggest obstacle in thefield. There are two broad approaches to protein structureprediction: ab initio modeling and fold recognition. Theformer relies on well-understood principles guiding thefolding of isolated amino-acid sequences into energeti-cally favorable three-dimensional conformations.Although they are physically justified, these methods donot yet possess useful accuracy, speed and reliability suit-able for large-scale proteome studies.Fold recognition techniques utilize the wealth of experi-mentally determined protein structures accumulated intoire of structural motifs used by nature to build proteins.The motifs have been catalogued into numerous standardlibraries of protein folds (such as SCOP, CATH, FSSP,MMDB, LPFC, VAST, ASTRAL, SUPERFAMILY) in whichprotein "building parts" are classified at several hierarchylevels [6-14].The fold recognition approaches can be divided into twobroad classes by the ways in which they utilize libraries ofstandard folds. The first group of fold recognition meth-ods, called profile-based approaches, represents structuralinformation in linear form, called a profile. A profilereflects the statistically derived probabilities of the occur-rence of residues in a particular structure [15-22]. The pro-file-based fold recognition methods use conventionalsequence alignment tools (such as BLAST, FASTA, PSI-BLAST, Hidden Markov Model) to find matches betweena probe sequence of unknown structure and the appropri-ate library entity. The profile-based approaches are veryrapid – the modern text alignment algorithms and com-puter hardware make it a routine operation to process sev-eral medium-sized proteomes on a single CPU in a day. Inthe same time, an unknown protein can only be character-ized by a profile if it has reasonable sequence similarityThe estimated coverage of superkingdoms by medium-, high- and very high quality threading predictionsFigure 1The estimated coverage of superkingdoms by medium-, high- and very high quality threading predictions.Page 2 of 16(page number not for citation purposes)the protein databank [5]. Protein scientists thoughtfullyanalyzed these structures to determine a redundant reper-with protein(s) with known structure.BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37The second strategy of usage of folds libraries is sequencethreading. The threading utilizes empirical pair potentials,scoring the likelihood of two residues being at a certaindistance in a space. This approach is based upon theassumption that seemingly countless different proteinsfold into a limited number of shapes (estimates vary from4,000 to 10,000+) [4,23] and that nearly all protein struc-tures can be described based upon these shapes. Thread-ing attempts to assign folds for a protein sequence bysampling it onto each member of a folds library usingIndependent from the sequence information, threadinghas been shown to make accurate predictions even in a"twilight zone" of <25% sequence identity, wheresequence-based approaches normally fail. When beingbenchmarked with a set of proteins with known 3D struc-tures, sequence – structure threading demonstrated accu-rate performance even well below 25% sequence identitylevel [30]. However, in contrast to the profile-based meth-ods, the threading-based approaches are too slow to bewidely applicable for large-scale structural genomics.The estimated coverage of proteomes by medium-, high- and very high quality threading predictionsFigure 2The estimated coverage of proteomes by medium-, high- and very high quality threading predictions.Page 3 of 16(page number not for citation purposes)pseudo-energy as a measure of fit [24-29].BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37Another original approach called GenTHREADER usingsequence-sequence threading within one day could assignknown CATH folds to 46 percents of Myciplasma genital-ium proteome (containing 468 ORFs) [31]. This fast andaccurate approach utilizes some features of classic thread-ing techniques but also largely relies on traditionalsequence alignment.characterized proteomes [3,14,32-35]. Large collectionsof protein folds predictions across genomes are currentlyavailable on-line [36-38]. In should be stressed, however,that all these structural predictions have been performedby various automated sequence-sequence matching tech-niques. Up until now no comparative studies capitalizingon sequence-structure full threading (viewed by some asmost accurate and comprehensive) have yet beenPie charts of total and superkingdom-specific distributions of protein classesFigure 3Pie charts of total and superkingdom-specific distributions of protein classes.Page 4 of 16(page number not for citation purposes)Numerous comparative structural genomic studies havebeen reported to date describing dozens of structurallyreported.BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37In our recent work, we have used the large-scale fullsequence-structure threading to discover novel bacterialvirulence factors mimicking host functions. The hypothe-sis was that, under selective pressure, pathogen genes haveevolved to encode proteins that functionally mimic hostproteins independently of significant primary sequencesimilarity. We suggested that such bacterial "mimickers"could be considered as potential virulence factors and,thus, the objective was to identify pathogen genes encod-ing proteins with low sequence identity but high struc-tural similarity with host counterparts. Since the threadingremains the only reliable alternative for comparison ofproteins with limited sequence identity, we have adoptedthe THREADER program [28] which we have customizedfor large-scale distributed processing.To achieve the described objectives we have aimed toprocess a sufficient number of complete genomes from alldomains of live, covering a range of parasitic and free liv-ing species. Thus, we have performed sequence-structurethreading for more than 30 complete proteomes of organ-isms from Bacteria, Archaea and Eukaryote superking-doms. Specific aspects of the discovery of bacterialvirulence factors by full threading will be discussed in theseparate report. In the present work we utilize the gener-ated information in a conventional form of comparativeResultsThe THREADER fold recognition program uses the CATHfolds library, which has four hierarchical levels of classifi-cation of proteins: by classes, architectures, topologiesand homologous superfamilies [27]. The CATH classes aredetermined by protein secondary structure composition,the architectures reflect the overall shape of the proteindomain, protein topologies depend on both the overallshape and connectivity of the domain, and the homolo-gous superfamily level groups domains with significantsequence similarity [39].In contrast to profile-based approaches, the threadingcannot readily specify several distinct structural domainsfor one sequence; rather, it tends to associate the entiresequence with a particular CATH entity. The THREADERsamples a raw sequence into domains from the CATHlibrary to produce multiple scores quantifying differentaspects of the threading. One of them, a Z score of theweighted sum of threading and solvation energies, isregarded as the characteristics of overall goodness of fitbetween probe sequence and library fold. The results ofthe threading can be considered at three levels of predic-tion accuracy. When the Z threading score is above 3.5then the match between the fold and probe sequence isconsidered as very significant. The hits with Z > 2.9 areregarded as significant, and Z values between 2.7 and 2.9represent possibly correct threading prediction [28].Venn diagrams of the distribution of distinct CATH domains shared by species from th ee d mains of lifeFigure 4Venn diagrams of the distribution of distinct CATH domains shared by species from three domains of life.Venn diagrams of the distribution of distinct CATH topolo-gi s shared by speci s from three domains of lifeFigure 5Venn diagrams of the distribution of distinct CATH topolo-gies shared by species from three domains of life.Page 5 of 16(page number not for citation purposes)structural genomic analysis and discuss the applicabilityof classical full threading for large scale analysis.BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37The threading is a computationally-intense procedure thatrequires several CPU hours to process a single proteinsequence. To allow for a large-scale threading of completeproteomes, we have implemented an automated parallelprotocol for distributed processing of THREADER onBeowulf cluster. The distributed processing made it possi-ble to perform the threading-based structural characteriza-tion of large sets of genomic information, includingcomplete proteomes of human and major bacterial path-ogens (the overall processing took several dozens years ofa single CPU time).We have anticipated that such threading will providewider genome coverage and more accurate structural pre-dictions than traditional sequence-based approaches andthus will provide a valuable insight into folds composi-tion of the proteomes studied according to the CATH clas-sification. It has also been expected that results will allowgeneral evaluating of accuracy, genomic coverage andCPU usage by classical full threading in order to accessfeasibility to trade its lower processing speed for higherquality of predictions in large scale studies.The performance of large-scale threadingUsing the sequence-structure threading we have processedcomplete proteomes of 37 organisms (including 25 bacte-ria, 6 eukaryotes and 6 archaea) totaling 167,888spond to 21.58 percent of the processed sequences. Thefractions of protein structure predictions with high (Z >2.9) and acceptable (Z > 2.7) accuracy averaged 64.7 and80.7 percent respectively. The estimated figures of genomecoverage by the threading for the studied organisms arelisted in Table 1 (see additional file 1) for three levels ofprediction accuracy.These parameters are plotted on Figures 1 and 2 for thestudied organisms and three superkingdoms.It should be noted, that the produced structural predic-tions provide first comprehensive enough evaluation ofclassic sequence-structure threading in large scale struc-tural genomics studies. Data from table 1 (see additionalfile 1) demonstrate that the average accuracy of proteinstructure prediction is slightly better for microbial organ-isms: protein folds have been confidently assigned tomore then 60% of eukaryotic genomes, about 68% ofarchaeal genomes and 70% of bacterial genomes (hereand later a protein is considered to be assigned to particu-lar CATH homologous superfamilies if the correspondingZ threading parameter is above 2.9). The better coveragefor bacteria may perhaps be explained by the fact that bac-terial proteins underwent more extensive experimentalcharacterization and, thus, the Protein Data Bank is heav-ily biased with bacterial data.It is difficult to compare the estimated genome coverageby the threading with performance of the sequence-basedmethods directly, as they do not grade their predictions bylevels of accuracy. It is known, however, that gene cover-age of profile-based approaches varies between 10 and 45percent [3,14,33,40-45]. Some newer automated genomeannotation techniques could assign up to 62% of certaingenomes [46]. Therefore, the estimated results of fullsequence-structure threading can be viewed as generally6–8% better or comparable to those obtained in similarstudies. One can argue, however, that when using a lessstrict cutoff of Z = 2.7 (corresponding to "possibly correct"predictions according to the THEADER) the coverage ofgenomes goes to up to 90% (see Table 1 as an additionalfile 1). On another hand, as it can be seen from table 2(see additional file 2), a sizable fraction of the estimatedpredictions corresponds to multi-domain protein foldsaccording to the CATH 2.4 which was the default libraryfor the THREADER. Unlike SCOP, 3D-PSSM or other sim-ilar databases created with a great deal of human insightand curation, the CATH collection is based on automatedclassification protocols [7]. Apparently, by that time whenthe CATH 2.4 was created, those protocols could not suf-ficiently distinguish individual domains in the most com-plex multi-fold entries. However, in the very latest 2.5Venn diagrams of the distribution of distinct CATH and architectu es hared by species from three domains of lifeFigur  6Venn diagrams of the distribution of distinct CATH and architectures shared by species from three domains of life.Page 6 of 16(page number not for citation purposes)sequences. The threading has produced 36,240 predic-tions with the Z scores above 3.5 threshold that corre-release of the CATH (became available on-line on Dec. 01,2003) all multi-domain entries have been split into sim-BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37pler α, β and (α+β) components. Thus, in view of theserecent changes, our multi-domains predictions can alsobe considered as not successful for assigning definedclasses to the corresponding proteins. It will reduce the C-level genome coverage by sequence-structure threading to12%, 38% and 45% for very significant-, significant – andpossibly correct predictions respectively. In the same timeit should be stressed, that although the multi-domain pre-dictions do not directly contribute to the knowledgeabout representation of α, β, α+β and "few secondarystructures" elements in complete proteomes, they do pro-vide meaningful insight on three dimensional structuresfor the target sequences.Thus, it would be possible to summarize, that in our expe-rience the achieved precision and genome coverage by thewith the THREADER package. In the same time, it cannotbe underestimated that the reported data represent firstbroad application of full sequence-structure threading tomultiple genomes with all its pluses and minuses. There isno doubt in our mind that sequence-structure threadingcurrently remains the only suitable instrument for proteinstructure predictions when no sequence homology infor-mation is available. The full sequence-structure threadingcan be used as powerful complimentary approach forstructural genomic studies and the reported results (inde-pendent from any sequence homology information) canbe viewed as very complementary to the existing structuralgenomic databases. Thus, we have developed the web-based interface: [1] for open access to our results. It shouldalso be stressed that the accuracy of the threading couldprobably be improved further by using pre-computedGeneral genomic occurrence of frequencies of protein topologies (in % of totals)Figure 7General genomic occurrence of frequencies of protein topologies (in % of totals).Page 7 of 16(page number not for citation purposes)full threading may not compensate for more then 3 hoursof CPU time we had to spend to process a single sequenceprotein secondary structures and domain boundaries.However, these approaches have not been used, since weBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37tried to minimize human intervention and maximize theautomation of large-scale protein structure prediction.DiscussionFold repertoires of eukaryotes, bacteria and archaeaThe statistics of protein folds distribution represent one ofthe most important aspects of structural genomics. Infor-mation about the most abundant and unique folds canprovide valuable insight into evolutionary relations andcan serve as an important source of drug target informa-tion [4,34]. Previous studies have produced several verysimilar lists of the most abundant protein folds accordingto the SCOP classification. Thus, Wolf et al. have identi-fied P-loop NTP-ase as the most abundant fold in all threeferredoxin-like fold, TIM barrel and methyltransferase,whereas in eukaryotic proteomes the most common foldwas followed by protein kinase, β-propeller and TIM bar-rel. Muller et al. have also named P-loop NTP-ase as themost common fold, while other members of their "topfive" lists varied. Hegyi et al. have developed a fold rank-ing system which produced similar folds abundanceorder: P-loop NTP-ase, ferrodoxin, TIM barrel. It has alsobeen generally agreed that the significant fraction ofknown protein folds can be found in all three majorgroups of organisms [47] and that the distribution of foldswithin organisms can differ significantly [33,47].Using the generated threading results we have estimatedGeneral genomic occurrence of frequencies of protein domains (in % of totals)Figure 8General genomic occurrence of frequencies of protein domains (in % of totals).Page 8 of 16(page number not for citation purposes)superkingdoms. The next most common protein struc-tures in bacteria and archaea have been characterized asthe distribution of CATH classifications among threemajor groups of organisms. The populations of "all-BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37alpha", "all-beta", "alpha and beta", "few secondary struc-tures" classes are presented in Figure 3 for eukaryotes, bac-teria and archaea.According to the numerical data from Table 2 (see addi-tional file 2), apart from multi-domain predictionsaccounting for more then 44% of the threading results(and requiring further separation into conventionalclasses), the most abundant protein class is "alpha andbeta" (36.6%), followed by "mainly beta" (10.8%),"mainly alpha" (8.4%) and "few secondary structures"(0.2%). The difference in the abundance of classes whenthree major groups of organisms are considered is notice-able. More than half of the archaeal proteins belong to"alpha and beta" class while no proteins with "few sec-ondary structures" have been detected. The distributionsof eukaryotic and bacterial protein classes are similar. Thealpha" and "mainly beta" proteins have higher occur-rences, and "alpha and beta" have lower occurrences,when compared to bacteria. The estimated higher propor-tion of multi-domain predictions agrees with the resultsof global studies by Teichmann with co-authors [48,49]assigning larger parts of prokaryote and particularlyeukaryote proteomes to multi-domain folds.From the total of 1893 default CATH homologous super-families used by the THREADER, we have identified 1520as being present in the organisms studied. The datagenerated suggests that eukaryotic species contain thelargest fraction of the established H-classifications – 1447,while 1059 distant homologous superfamilies have beenfound in bacteria and 565 in archaea.The 3bct00 CATH fold corresponding to "ArmadilloGenomic occurrence of frequencies of protein domains in eukaryoteFigure 9Genomic occurrence of frequencies of protein domains in eukaryote.Page 9 of 16(page number not for citation purposes)only difference is that the proportions of secondary struc-tures are more homogenous for eukaryotes: their "mainlyrepeat" topology, "horseshoe" architecture and "mainlyalpha" class has the highest total count (2865) within theBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37studied proteomes. Two other folds – 1gdoA0 (class:alpha and beta, architecture: 4-layer sandwich, topology:glutamine phosphoribosylpyrophosphate) and 1gotB0(mainly beta, 7 propellor, methylamine dehydrogenase)– have been counted 2728 and 2283 times respectively.This abundance order is changed to 1gdoA0, 3bct00,1gotB0 if the fold abundance is calculated as a sum of frac-tions of distant protein fold in individual proteomes. Therepertoires of protein functions in the studied organismscan be illustrated by distributions of protein topologies.In total, 588 out of 703 CATH topologies have been iden-tified in the studied proteomes. It has been found thateukaryotic organisms contain the largest variety of proteintopologies – 563. Bacterial species are constituted with439 topologies, archaeal with 275.Two CATH topologies – Rossman fold and TIM barrel –produce the highest frequency of occurrence. These twotopologies are the most abundant for all the studiedorganisms except Plasmodium falciparum (containing anunusually large fraction of hydrolases). Rossman fold andTIM barrel account for 14 to 28 percent of topology com-positions of the individual proteomes, what, likely, isdetermined by known multi-functionality of these folds.Several other highly abundant topologies also been iden-tified within the studied organisms. These folds mostlyassociated with transport and metabolism functionsinclude hydrolase, oxidoreductase, neuraminidase, trans-ferase, Armadillo repeat, glutamine phosphoribosyl-pyro-phosphate, methylamine dehydrogenase, and isomerase/synthase bifunctional proteins. Some species containlarge fractions of phosphotransferases, binding proteins,sugar transport proteins, isomerazes also related to trans-port and metabolism. Various toxins folds also have highabundance, particularly in the bacterial genomes. The listof top 10 common topologies identified by the currentstudy is presented in Table 3 (see additional file 3).The ranking of protein topologies has been conducted inthree different ways: based on topology count for theentire dataset, using a sum of fractions of distinct topolo-gies within individual genomes, and by numerical rank-ing of topologies within organisms. All three approacheshave produced very similar abundance orders identifyingRossman fold and TIM barrel as the top-ranked proteintopologies.The threading results have also illustrated the fact thereare very few protein topologies having a high frequency ofgenome occurrence; the overwhelming majority of pro-tein topologies occur quite rarely (the uneven character ofthe distribution of protein folds will be discussed ingreater detail later). This makes it difficult to produce aies (also biased by the choice of the particular organismsstudied). Nonetheless, it is fair to conclude that the esti-mated CATH T-ranking generally agrees with the previousstudies mentioned above, which have identified P-loopNTPase, TIM barrel and Rossman fold as the most abun-dant folds according to the SCOP classification [3,33,47].The similarity is even more noticeable considering thatthe organizations of SCOP and CATH 2.4 libraries arequite distinct.According to the threading results, the most abundantprotein architectures can be placed in the following order:3-layer (aba) sandwich, barrel, 2-layer sandwich. The topthree are followed by non-bundle, 4-layer sandwich,horseshoe and 6-bladed propeller. As it can be noted fromthe list (and previously mentioned by other authors), themost abundant protein folds possess rather high symme-try. Perhaps, the corresponding symmetric protein com-positions correspond to energetically more favorableconfigurations and, hence, have some evolutionaryadvantage.30 out of 31 distinct known protein architectures havebeen identified in the studied organisms. Eukaryotic pro-teomes contain all 30 identified architectures, 26 architec-tures have been identified in the studied bacteria and 22in archaea. The distribution of architectures can also becharacterized as highly uneven. Remarkably, one proteinarchitecture has been identified at only one occasion (asuper-fold) and several others could be found fewer thanfour or five times. Not surprisingly, all these architecturesalso appeared to be superkingdom-specific. Figure 4 rep-resents a Venn diagram of the established distribution ofprotein architectures in three superkingdoms.Figures 4, 5, 6 demonstrate that bacterial and archaealproteins do not form any superkingdom-specific architec-tures and do not share any architectures that would alsobe absent from eukaryotes. Eukaryotes, however, exclu-sively possess four protein architectures not presentelsewhere. Already mentioned eukaryotic super-fold isassociated with bactericidal function and involved intoeukaryotic host defense. Besides, there are 5-stranded pro-peller, orthogonal prism and aligned prism architectureswhich can only be found in eukaryotes. The correspond-ing configurations have been adopted by various mem-brane associated proteins in eukaryotes. Likely, theseeukaryotic protein architectures evolved to accommodatespecifics of eukaryotic cell wall composition. Orthogonalprism configuration is related to lactins and mannose-binding function. Based on the observation that the corre-sponding architecture has been observed only in mouse,it is feasible to speculate it may be related to certain mam-Page 10 of 16(page number not for citation purposes)very precise topologies abundance ranking and tocompare the results of different structural genomics stud-malian – specific features. Four other architectures areexclusively shared between eukaryotes and bacteria: rib-BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37bon, 3-layer sandwich, distorted sandwich and irregulararchitecture. It seems to be a general observation that lowcomplexity and irregular structures are absent fromarchaea; species of this group also lack the entire "few sec-ondary structures" protein class. It has been previouslyoutlined by Gerstein and Levitt that small folds prevail ineukaryotic proteomes as they are mostly involved intointercellular communication and regulation in vertebrates[46]. This may explain why the representatives of thesmaller "few secondary structures" class have not beenfound in archea, while 42 of them have been identified inhuman- and 82 in mouse proteomes.of 703 CATH protein topologies have been identified inthe studied organisms. As can be seen from Figure 5, mostof them are at least present in two superkingdoms. 265protein topologies can be found in all three speciesgroups. These compose almost the entire topology reper-toire of the archaea, which exclusively share only 4 topol-ogies with bacteria and 5 with eukaryotes. Only onemerely archaeal topology could be found thus far. Thisobservation agrees with the previous studies that havepointed to the near absence of archaea-specific SCOPfolds [33,47]. These findings provide another example ofuniqueness of archaeal fold repertoire (in addition to thepreviously indicated absence of superkingdom-specificGenomic occurrence of frequencies of protein domains in bacteriaFigure 10Genomic occurrence of frequencies of protein domains in bacteria.Page 11 of 16(page number not for citation purposes)The distribution of CATH topologies between threesuperkingdoms follows similar trends. 588 out of the totalarchitectures, unusually high fraction of "alpha and beta"proteins and lowered "low-complexity" content). The lowBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37proportion of accurate structural predictions for archaealproteins can also be viewed as supporting this idea (seeTable 1 in additional file 1).Another important observation of the work coincideswith the previous findings that bacteria and eukaryotesshare a significant fraction of their folds repertoires. Thethreading results indicate that out of 563 CATHtopologies found in eukaryotes and 439 in bacteria, 152are exclusively shared between these two superkingdoms.This figure composes more then 25 percent of the entirepool of identified protein topologies. This is in goodagreement with results of Wolf et al. who indicated thatmore than 20 percent of SCOP folds can be sharedbetween bacteria and eukaryotes. The fraction of proteinhomologous superfamilies exclusively shared betweeneukaryotic and bacterial organisms is even greater. Ourresults demonstrate that bacteria share almost 45 percentand archaea is very limited (see Figure 4). The situationwith the estimated superkingdom-specific homologoussuperfamilies is very similar to the picture of distributionof protein topologies: only 10 protein homologous super-families can be exclusively found in archaea and 53 inbacteria while the eukaryotes contain 428 unique H-rep-resentatives (out of the total of 1447).Figures 4, 5, 6 illustrate similar distribution trends at allthree levels of CATH classification. The results demon-strate that bacterial and eukaryotic organisms have similarprotein organizations and share a high degree of commo-nality. The similarities in their protein repertoires mayillustrate the relevance of lateral gene transfer betweenbacteria and eukaryotes as well as an intensity of ancestralrelationships between bacteria and eukaryotic organelles[50]. Archaeal organisms demonstrate quite distincttrends in protein organization: they seem to lackGenomic occurrence of frequencies of protein domains archaeaFigure 11Genomic occurrence of frequencies of protein domains archaea.Page 12 of 16(page number not for citation purposes)of their homologous superfamilies with eukaryotes (474out of 1059), while the H-level sharing between bacteriasuperkingdom-specific topologies and architectures and,BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37as well, they do not contain proteins with irregular or lowcomplexity structures.At the moment, it is difficult to speculate whether it is pos-sible to identify any species-specific protein topologies, aswe have processed only a limited number of proteomes.At the same time, the structural characterization of com-plete proteomes by threading is in continuous progressand new conclusions may emerge.Folds occurrences and power-law behavior of fold distributionsIt has been recently demonstrated by several independentstudies that protein fold distribution is drastically uneven:there are few extremely common protein structures whilemost protein folds occur very infrequently. It has beenlaws. The double logarithmic linear plots could be estab-lished for distribution of protein folds by the number offamilies, distribution of families by the number ofdomains, etc [4,34,35,51,52]. These findings laid thefoundation for characterizing the evolution of the proteinuniverse in terms of a growing scale-free system in whichindividual genes are represented as the nodes of a propa-gating network. The estimated scale-free character of sucha network indicates a preference to duplicate genes encod-ing for already common protein folds [35].We have analyzed structural predictions generated by thethreading for the frequency of occurrence of particularCATH classes, architectures, topologies and homologoussuperfamilies. The frequency distributions have beenestablished for the studied organisms and for superking-Genomic occurrence of frequencies of protein domains in C. elegans and M. genitaliumFigure 12Genomic occurrence of frequencies of protein domains in C. elegans and M. genitalium.Page 13 of 16(page number not for citation purposes)previously shown that the occurrence of SCOP proteinfamilies, superfamilies and folds follow asymptotic powerdoms. The double logarithmic plots estimated forfrequencies of total genomic occurrence of proteinBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37homologous superfamilies, topologies and architecturesfor all proteins combined are presented in Figures 7, 8.It can readily be seen from the figures that these depend-ences can be described by a power-law f(i)~i-b functionrelating occurrences i of CATH classifications with theircorresponding frequencies of occurrence f(i). The totaldistribution of protein homologous superfamiliesdetermined for all the studied proteins has produced anexponent b = 0.729. The power factor for the distributionof frequencies of genomic occurrence of protein topolo-gies has been established as b = 0.409.The estimated values of the b exponent appeared to be outof typical range (1.5÷3), characteristic for scale-free sys-tems. Figures 7 and 8 demonstrate that the fitted f(i) = ai-b double logarithmic functions clearly deviate from theupper end of the distribution trends favoring the loweredmagnitudes of b. The obvious reason for such behavior isin the fact that the majority of points in Figures 7, 8 areconcentrated at the lower parts of the distributions (theareas corresponding to higher genomic occurrence).Such "tailing" puts a significant statistical weight to thelower part of the funnel-like dependence, so the trend linetends to fit the majority of the data points at the bottomof a graph. Similar deviations of power-law trend lines canalso be recognized in Figures 9, 10, 11, illustrating thegenome occurrences of protein homologous super-families within three superkingdoms.However, it should be noted that double logarithmicdependences estimated for superkingdoms have moreprofound scale-free character. The corresponding powerfactors b for distribution of protein homologoussuperfamilies increase from -1.0976 for archaeal, through-0.8612 for eukaryotic, to -0.8182 for bacterial proteomes.Parameters a and b of the power law dependences f(i) = ai-b relating the genomic occurrences i with the frequenciesof protein homologous superfamilies and topologies f(i)have also been calculated for the distinct organisms andare presented in Table 4 (see additional file 4).The numbers in the table illustrate that the magnitude ofb can vary significantly among the species. Thus, Figure 12plots the genomic occurrences of homologous super-families frequencies within the genomes of C. elegans andM. genitalium where the difference in b factors for the twoorganisms (-1.96 and -0.97 respectively) can readily berecognized.It is difficult to speculate at this point whether the esti-protein folds in organisms, or merely result from the poorability of power function to describe funnel-likedependences.It is clear, however, that the estimated results illustrate theneed for development of new statistical functions describ-ing protein fold distributions in a more accurate way thanthe conventional double logarithmic "frequency –genomic occurrence" dependences. The development ofsuch new statistical functions and tools is currently under-way. We expect that new statistical approaches will help usto answer the questions raised by the reported study.ConclusionsWe have analyzed the results of the large-scale automatedthreading procedure applied to complete proteomes of 6eukaryotic, 25 bacterial and 6 archaeal organisms. Thecoverage and reliability of unmodified full threading pro-cedure have been assessed for large-scale automated pro-tein structure predictions. The sequence-structurethreading allowed satisfactory assignment of structures tomore than 60% of eukaryotic, 68% of archaeal and 70%of the bacterial proteomes analyzed.The folds recognition results have also been estimated forvery high and lower levels of prediction confidence; theestimated accuracy, genomic coverage and CPU usage bythe classical full threading have generally demonstratedthat the trade of lower processing speed of the method forits higher quality of predictions may not be justified forlarge scale studies.The current work relying on sequence-structure threadinghas identified the most abundant and unique CATH 2.4folds in individual species and superkingdoms. 3-layer(aba) sandwich has been characterized as the most abun-dant protein architecture and Rossman fold as the mostcommon topology.The results highlight similarities and differences in theprotein compositions of eukaryotes, bacteria and archaea.It has been found that eukaryotes share a significantportion of their protein repertoires with bacteria, whichillustrates the intensity of their ancestral relationships.The protein composition of archaeal organisms was char-acterized as being quite distinct and generally missing lowcomplexity and protein structures.It has been found that protein homologous superfamiliesand topologies distributions in the studied organisms andsuperkingdoms obey the power law dependencecharacteristic of scale-free systems. The correspondingdouble logarithmic "frequency – genomic occurrence"Page 14 of 16(page number not for citation purposes)mated deviation of the power exponents from the scalefree range truly reflect specific aspects of distribution ofdependences characteristic for scale-free systems haveBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/37been established for individual organisms and for threesuperkingdoms.MethodsThreading has been carried out by the THREADER2 [28]program with default parameters. The CATH v4.2. foldassembly has been used as a library of standard folds.The large-scale threading has been conducted on Beowulfcluster with 52 dual processor blades (2 × 1 GHz, 1 GRAM). The automated control has been implemented bythe PVM-supported Perl scripts.The threading results have been stored and manipulatedwithin the MySQL database.Authors' contributionsSJ has developed the general concept of the work and par-ticipated in drawing the conclusions; AC has performedthe fold prediction and carried out all the calculations.Additional materialAcknowledgementsThe work has been funded by the Vancouver Hospital and Health Sciences Centre research award for AC and by the Functional Pathogenomics of Mucosal Immunity project, funded by Genome Prairie, Genome BC and their industry partners, Inimex Pharmaceuticals and Pyxis Genomics.References1.  [http://www.pathogenomics.bc.ca/cgi-bin/threader/threader_folds.cgi].2. Yanai I, Wolf Y, Koonin EV: Evolution of gene fusions: horizontaltransfer versus independent events. Genome Biology 2002,3(5):1-0024.3. Hegyi H, Lin J, Greenbaum D, Gerstein M: Structural genomicsanalysis: characteristics of atypical, common and horizon-tally transferred folds. Proteins: Struct Funct Genet 2002,47:126-141.4. Koonin EV, Wolf YI, Karev GP: The structure of the protein uni-verse and genome evolution. Nature 2002, 420:218-223.5. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,Shindyalov IN, Bourne PE: The protein data bank. Nucl Acids Res2000, 28:235-242.6. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOPdatabase in 2002: refinements accommodate structuralgenomics. Nucleic Acids Res 2002, 30:264-247.7. Orengo CA, Jones DT, Thornton JM: Protein superfamilies anddomain superfolds. Nature 1994, 372:61-634.8. Gibrat JF, Madej T, Bryant SH: Surprising similarities in structurecomparison. Curr Opin Struct Biol 1996, 6:377-385.9. Holm L, Sander C: Dali/FSSP classification of three-dimen-sional protein folds. Nucleic Acids Res 1997, 25:231-234.10. Orengo CA, Flores TP, Taylor WR, Thornton JM: Identificationand classification of protein fold families. Protein Eng 1993,6:485-500.11. Schmidt R, Gerstein M, Altman R: LPFC: an internet library ofprotein family core structures. Protein Sci 1997, 6:246-248.12. Madej T, Gibrat J-F, Bryant SH: Threading a database of proteincores. Proteins 1995, 23:356-369.13. Brenner SE, Koehl P, Levitt M: The ASTRAL compendium forsequence and structure analysis. Nucl Acids Res 2000,28:254-256.14. Gough J, Karplus K, Hughey R, Chotia C: Assignment of homologyto genome sequences using a library of hidden Markov mod-els that represent all proteins of known structure. J Mol Biol2001, 313:903-919.15. Russell RB, Saqi MAS, Bates PA, Sayle RA, Sternberg MJE: Recogni-tion of analogous and homologous protein folds – assess-ment of prediction success and associated alignmentaccuracy using empirical substitution matrices. ProteinEngineering 1998, 11:1-9.16. Bowie JU, Luthy R, Eisenberg G: A method to identify proteinsequences that fold into a known three-dimensionalstructure. Science 1991, 253:164-170.17. Bates A, Jackson RM, Sternberg MJE: Genomes, Molecular Biology andDrug Discovery London, Academic Press; 1996. 18. Russell RB, Copley RR, Barton GJ: Protein fold recognition bymapping predicted secondary structure. J Mol Biol 1996,259:349-365.19. Rice DW, Eisenberg GA: 3D-1D substitution matrix for proteinfold recognition that includes predicted secondary structureof the sequence. J Mol Biol 1997, 267:1026-1038.20. Rost B, Schneider R, Sander C: Protein fold recognition by pre-diction – based threading. J Mol Biol 1997, 270:471-480.21. Defay TR, Cohen FE: Multiple sequence information for thread-ing algorithms. J Mol Biol 1996, 262:314-323.22. Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF:IMPALA: matching a proteins sequence against a collectionof PSI-BLAST – constructed position-specific scorematrices. Bioinformatics 1999, 15:1000-1011.23. Machalek AZ: Structural genomics: a slice of the proteomicspie. ASM News 2001, 67:441-446.24. Godzik A, Skolnick J: Sequence – structure matching in globu-lar proteins: application to supersecondary and tertiarystructure determination. Proc Natl Acad Sci 1992,89:12098-12102.25. Bryant SH, Altschul SF: Statistics of sequence – structurethreading. Curr Opin Struct Biol 1995, 5:236-244.26. Murzin AG, Bateman A: Distant homology recognition usingstructural classification of proteins. Proteins 1997, Suppl1:105-112.Additional File 1Accuracy of structural characterization of distant genomes by threading.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-37-S1.doc]Additional File 2Distribution of major classes of proteins from distinct organisms.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-37-S2.doc]Additional File 3The most abundant folds for the studied organisms.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-37-S3.doc]Additional File 4List of genomic properties characterizing the power law distribution of pro-tein hierarchies.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-37-S4.doc]Page 15 of 16(page number not for citation purposes)SJMJ is a Michael Smith Foundation for Health Research Scholar. 27. Jones DT, Miller RT, Thornton JM: Successful protein fold recog-nition by optimal sequence threading validated by rigorousblind testing. Proteins 1995, 23:387-397.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/3728. Jones DT, Taylor WR, Thornton JM: A new approach to proteinfold recognition. Nature 1992, 358:86-89.29. Taylor WR: Multiple sequence threading: an analysis of align-ment quality and stability. J Mol Biol 1997, 269:902-943.30. Levitt M: Competitive assessment of protein fold recognitionand alignment accuracy. Proteins 1997, Suppl 1:92-104.31. Jones DT: GenTHREADER: an efficient and reliable proteinfold recognition method for genome sequences. J Mol Biol1999, 287:797-815.32. Wolf YI, Aravind L, Koonin EV: Rickettsiae and Chlamydiae: evi-dence of horizontal gene transfer and gene exchange. TrendsGenet 1999, 15:173-175.33. Wolf YI, Brenner SE, Bash PA, Koonin EV: Distribution of proteinfolds in the three superkingdoms of life. Genome Research 1999,9:17-26.34. Luscombe NM, Qian J, Zhang Z, Johnson T, Gerstein M: The domi-nance of the population by a selected few: power-law behav-ior applies to a wide variety of genomic properties. GenomeBiology 2002, 3(8):0040.1-0040.7.35. Qian J, Luscombe NM, Gerstein M: Protein family and occur-rence in genomes: power-law behavior and evolutionarymodel. J Mol Biol 2001, 313:673-681.36.  [http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY].37.  [http://bioinf.cs.ucl.ac.uk/psipred].38.  [http://www.sbg.bio.ic.ac.uk/3dgenomics].39. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, ThorntonJM: CATH – A Hierarchic Classification of Protein DomainStructures. Structure 1997, 5:1093-1108.40. Peitsch MC: PROMOD and SWISS-MODEL – Internet-basedtools for automated comparative protein modeling. BiochemSoc Trans 1996, 24:274-279.41. Peitsch MC, Wilkins MR, Tonella L, Sanchez JC, Appel RD, Hoch-strasser DF: Large scale protein modeling and integrationwith the SWISS-PROT and SWISS-2DPAGE databases: theexample of Escherichia coli. Electrophoresis 1997, 18:498-501.42. Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A:Comparative protein structure modeling of genes andgenomes. Ann Rev Biophys Biomol Struct 2000, 29:291-325.43. Sanchez R, Sali A: Large-scale protein structure prediction ofthe Saccharomyces cerevisiae genome. Proc Natl Acad Sci USA1998, 95:13597-13602.44. Fisher D, Eisenberg D: Assigning folds to the proteins encodedby the genome of Mycoplasma genitalium. Proc Natl Acad SciUSA 1997, 94:11929-11934.45. Muller A, MacCallum RM, Sternberg MJE: Structural characteriza-tion of human proteome. Genome Research 2002, 12:1625-1641.46. Iliopoulos I, Tsoka S, Andrade MA, Janssen P, Audit B, Tramontano A,Valencia A, Leroy C, Sander C, Ouzonis CA: Genome sequencesand great expectations. Genome Biology 2001, 2:0001. INTERAC-TIONS Epub 200047. Gerstein M, Levitt M: A structural census of the current popu-lation of protein sequences. Proc Natl Acad Sci 1997,94:11911-11916.48. Apic G, Huber W, Teichmann SA: Multi-domain protein familiesand domain pairs: comparison with known structures and arandom model of domain recombination. J Struct FunctGenomics 2003, 4:67-78.49. Teichmann SA, Park J, Chotia C: Structural assignments to theMycoplasma genitalium proteins show extensive gene dupli-cations and domain rearrangements. Proc Natl Acad Sci USA1998, 95:14658-14663.50. Brinkman FSL, Blanchard JL, Cherkasov A, Av-Gay Y, Brunham RC,Fernandez RC, Finlay BB, Otto SP, Oullette BF, Keeling PJ, Rose AM,Hankock REW, Jones SJM: Evidence that plant-like genes inChlamydia species reflect an ancestral relationship betweenChlamydiaceae, cyanobacteria and the chloroplast. GenomeResearch 2002, 12:1159-1167.51. Rzhetski A, Gomez SM: Birth of scale-free molecular networksand the number of distinct DNA and protein domains pergenome. Bioinformatics 2001, 17:988-996.52. Yanai I, Camacho CJ, DeLisi C: Predictions of gene family distri-butions in microbial genomes: evolution by gene duplicationand modification. Phys Rev Lett 2000, 85:2641-2644.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 16 of 16(page number not for citation purposes)


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items