UBC Faculty Research and Publications

An approach to large scale identification of non-obvious structural similarities between proteins Cherkasov, Artem; Jones, Steven J May 17, 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2004_Article_177.pdf [ 11.37MB ]
JSON: 52383-1.0132564.json
JSON-LD: 52383-1.0132564-ld.json
RDF/XML (Pretty): 52383-1.0132564-rdf.xml
RDF/JSON: 52383-1.0132564-rdf.json
Turtle: 52383-1.0132564-turtle.txt
N-Triples: 52383-1.0132564-rdf-ntriples.txt
Original Record: 52383-1.0132564-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleAn approach to large scale identification of non-obvious structural similarities between proteinsArtem Cherkasov*1 and Steven JM Jones2Address: 1Division of Infectious Diseases, Department of Medicine, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada and 2Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, CanadaEmail: Artem Cherkasov* - artc@interchange.ubc.ca; Steven JM Jones - sjones@bcgsc.ca* Corresponding author    AbstractBackground: A new sequence independent bioinformatics approach allowing genome-widesearch for proteins with similar three dimensional structures has been developed. By utilizing thenumerical output of the sequence threading it establishes putative non-obvious structuralsimilarities between proteins. When applied to the testing set of proteins with known threedimensional structures the developed approach was able to recognize structurally similar proteinswith high accuracy.Results: The method has been developed to identify pathogenic proteins with low sequenceidentity and high structural similarity to host analogues. Such protein structure relationships wouldbe hypothesized to arise through convergent evolution or through ancient horizontal gene transferevents, now undetectable using current sequence alignment techniques. The pathogen proteins,which could mimic or interfere with host activities, would represent candidate virulence factors.The developed approach utilizes the numerical outputs from the sequence-structure threading. Itidentifies the potential structural similarity between a pair of proteins by correlating the threadingscores of the corresponding two primary sequences against the library of the standard folds. Thisapproach allowed up to 64% sensitivity and 99.9% specificity in distinguishing protein pairs with highstructural similarity.Conclusion: Preliminary results obtained by comparison of the genomes of Homo sapiens andseveral strains of Chlamydia trachomatis have demonstrated the potential usefulness of the methodin the identification of bacterial proteins with known or potential roles in virulence.BackgroundPathogen proteins often manipulate host cellular func-tions by mimicking host activities. In some cases, mimicryis achieved through virulence factors that are direct homo-logues of host proteins that have been incorporated intothe genome of the pathogen through horizontal geneamino acid sequence similarity to host factors, mimicthem at the structural level [3].Our recent research was conducted on the discovery ofnovel bacterial virulence factors through identification ofpathogen genes that share a higher degree of sequencePublished: 17 May 2004BMC Bioinformatics 2004, 5:61Received: 02 February 2004Accepted: 17 May 2004This article is available from: http://www.biomedcentral.com/1471-2105/5/61© 2004 Cherkasov and Jones; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.Page 1 of 11(page number not for citation purposes)transfer (HGT) [1,2]. In others, convergent evolution hasproduced new effectors that, although having no obvioussimilarity to host genes than would otherwise be expectedbased on their phylogeny suggesting their likelyBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61acquisition by HGT [4]. To achieve this objective wedeveloped novel bioinformatics tools to identify genes incomplete bacterial genomes, which may be cases of HGTfrom eukaryotes. Based on a combined analysis of136,195 genes from 36 bacterial and eukaryotic genomesequences [4], we identified no definitive cases of "recent"(defined as approximately since the divergence of mam-mals from other amniotes) HGT between bacteria andmulticellular eukaryotes, including human genes recentlysequenced in the Human Genome Project [5].We have established that within the limitations of thedataset used, there was a notable lack of genes in thehuman and other genomes of multicellular eukaryotesthat were highly similar to genes from any bacterial spe-cies examined. While this analysis did show that bacterialpathogens do contain "host-like" genes that may functionas mimics, for the most part these appear to be primarilycases of either maintenance of an orthologous gene thatwas lost in other lineages, or ancient HGT [4].It has yet to be established the extent to which convergentevolution events have played a role in the evolution ofpathogens, involving alternative mechanisms than thenHGT by which pathogens acquire host-mimicking viru-lence factors. Such genes and their corresponding proteinswould usually have distinct sequences from those of themolecule they mimic, but would typically have evolved toimitate, at least in part, the shape and critical chemicalgroups on the surface of the functional homologues.In the present work we describe our new efforts to identifypathogen genes, encoding proteins that have lowsequence identity but potentially high structural similaritywith host proteins. The hypothesis is that under selectivepressure, pathogen genes have evolved to encode proteinsthat functionally mimic host proteins independently ofsignificant primary sequence similarity. Such bacterialproteins mimicking host's functions can, therefore, beconsidered as potential virulence factors.Results and DiscussionGenome – wide search for pathogen proteins that havelow sequence similarity but significant structural resem-blance with host proteins represent an opportunity fornew insight into infectious agents, host cell biology andthe mechanisms of pathogenesis. Theoretically, such atask should require a comprehensive modeling of three-dimensional (3D) structures of proteins, what is not yetachievable with useful accuracy nor is it computationallyfeasible on the scale of thousands of sequences.The existing methods of fold recognition can broadly beempirically derived scores for the expected occurrence ofresidues in a particular structure [6-13]. This type ofapproach is relatively rapid, but an unknown protein canonly be characterized if it has reasonable sequence simi-larity with protein(s) with known structure. The secondstrategy is threading, which involves using pair potentialsthat score the likelihood of two residues being at a certaindistance. This approach is based upon the assumptionthat nature has made certain economic decisions whereincountless different proteins fold into a limited number ofshapes (estimated to approximate 4000 [14] and thatnearly all natural protein structures can be describedbased upon these shapes. Threading attempts to assignfolds for a protein of unknown structure by sampling itonto each member of a library of known folds usingpseudo-energy as a measure of fit [15-20]. Threadingapproaches have been shown to make accurate predic-tions even in a "twilight zone" of <25% sequence identity,where sequence-based approaches normally fail [21].Presently, however, neither profile-based nor threading-based approaches are capable of direct identification ofstructurally similar proteins from two different sources(such as distinct organismal protein datasets).The methodIn order to allow for structural comparisons to be madeacross genomes (where limited primary sequence identityis the case) we have adopted an indirect approach to iden-tify potential protein structural similarities, based upon abroader utilization of the numerical outputs from derivedfrom threading applications. For each raw sequencethreaded onto the available 1893 model folds using theTHREADER2 package [18] we derived Z scores represent-ing the weighted sum of pair wise and solvation energies.Our hypothesis is that each sequence will have its ownunique threading profile against a library of model foldsand therefore, by correlating these "fingerprints" for twolinear sequences it is possible to estimate the degree ofpotential 3D structural similarity. To support this argu-ment, we have examined if complete sets of threadingscores should indeed correlate for two structurally similarproteins. A dataset of 866,631 selected pair wise align-ments of protein chains with "possible biologically inter-esting similarities" and covering whole 0–100% range ofsequence similarity has been used for the study.This dataset has been generated by an "all against all"comparison of protein chains in PDB by the authors of acombinatorial extension (CE) algorithm [22,23]. A CEapproach has been shown to accurately identify the simi-larity of protein structures using a dynamic programmingalgorithm determining the RMSD and a significance score"Z" for optimal structural alignment. The publicly availa-Page 2 of 11(page number not for citation purposes)divided into two types. In the first, the information is rep-resented in linear form, called a profile, which is based onble CE dataset includes the alignments of proteins (withlength difference no more then 30%) corresponding to ZBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61value above 3.5 threshold. Notably, 783,841 or more then90% of the sequence alignments in the set have sequenceidentity below 20%. This circumstance makes it very favo-rable to use it for testing the threading-based approach.We have downloaded CE dataset from [24] and processedthe protein chains with a THREADER2 package to pro-duce their threading profiles against 1893 library folds.For every aligned pair of proteins from the dataset we haveestimated a correlation between their threading scores. Ifa set of threading scores against a standard library isindeed sequence specific, then for proteins with knownstructures we should be able to observe a defined relation-ship between coefficients of threading scores correlationand parameters of protein 3D structural similarity.On a Figure 1 (color-coded according to the plots density)the estimated squared coefficients of correlations betweenwith known structure. As it can readily be seen, the mean-ingful correlations between threading scores (squared cor-relation coefficients R2 above 0.7) correspond to higherquality structural alignments with lower RMSD.A RMSD value of 2Å is normally considered as a thresholdvalue distinguishing pairs of structurally similar proteins.Thus, as it was anticipated, the thresholds R2~0.7 andRMSD ~2Å clearly separate two most populated areaswhich correspond to pairs of proteins with low and highstructural similarity (R2 < 0.7; RMSD > 2Å) and (R2 > 0.7;RMSD < 2Å).A quantitative assessment of protein structural similarityis rather an ambiguous task; thus some structural align-ments produce the alignment score Z along with exces-sively high RMSD parameters, or instead there are fewwell-superimposed protein pairs in the CE dataset withRMSD values of pair wise alignments of representative protein chains versus the corresponding parameters RFigure 1RMSD values of pair wise alignments of representative protein chains versus the corresponding parameters R.Page 3 of 11(page number not for citation purposes)threading scores are plotted against RMSD values for846,534 selected CE – pair wise alignments of proteinsrather low Z. Therefore, to further support the previousobservations, we have introduced an additionalBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61parameter (we called structure similarity score – SSS) cal-culated as the structural alignment score Z (ranging from0 to 10) divided by the sum of the corresponding RMSDvalue and a factor of 10: SSS = Z/(10+RMSD). Thus, SSS isnormalized to [0–1] range, where 1 corresponds to a pairof completely similar protein structures superimposedwith Z = 10 and RMSD = 0.In Figure 2, the SSS parameters calculated for 846,534alignments are plotted against the coefficients of correla-tions between threading scores. The plots on a graph canbe conventionally divided into three major areas of low(SSS < 0.4), medium (0.4 < SSS < 0.6) and high (SSS > 0.6)structural similarity. The graph indicates that the vastmajority of protein pairs with correlated threading scores(R2 > 0.7) fail into areas of medium and high structuralsimilarity.To assess a distinguishing power of R2 cutoff we have esti-mated the sensitivity and specificity of the approach inrecognizing pairs of superimposed proteins with RMSD <2Å. The calculated sensitivity and specificity parametersare plotted on a Figure 3 for entire [0–1] range of the R2.According to the estimated numbers (true negative predic-tions (TN): 790630, true positive predictions (TP): 18060,false negative predictions (FN): 28623, false positive pre-dictions (FP): 9221) the approach allows to achieve 99%specificity TN/(FP+TN) in distinguishing protein pairswith RMSD < 2Å when threshold R2 = 0.68 is used. Thecorresponding sensitivity TP/(TP+FN) of the methodreaches 38.7%. The predictive value positive TP/(TP+FP)and the predictive value negative TN/(TN+FN) are 66.2%and 96.5% respectively. Similar evaluation of proteinpairs with medium and high degree of similarity (or SSS >2 Structure similarity scores (SSS) values for pair wise alignments of representative protein chains versus the corresponding paramet r  RFigur  2Structure similarity scores (SSS) values for pair wise alignments of representative protein chains versus the corresponding parameters R.Page 4 of 11(page number not for citation purposes)0.4) can also relate 99% specificity level with R = 0.68threshold, while the corresponding sensitivity could beBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61estimated as 47% (Figure 4). The sensitivity of thedeveloped approach improves with increase of the struc-tural similarity criteria. For highly similar proteins (align-ments producing SSS > 0.6) from the CE dataset sensitivityreaches 92%.Thus, it is feasible to conclude, that the coefficient of cor-sional similarity between the corresponding protein struc-tures. When the threshold R2 = 0.68 is used the generalaccuracy of the developed approach (TP+TN)/(TP+TN+FP+FN) is 95.1%.The training set has also been used to estimate the receiveroperating characteristic plots (ROC). The ROC plot isobtained by plotting all sensitivity values (true positivefraction) on y axis against their equivalent (1-specificity)values (false positive fraction) for all available thresholdson the x axis. The area under the ROC curve (AUC) is usu-ally taken as an important single measure of overall accu-racy of approach that is not dependent on the particularthreshold.We have plotted the specificity versus selectivity parame-ters of the developed approach on 0–1.0 range of the Rthreshold with the step of 0.01. The calculation has beenconducted on the training set of low sequence similarityalignments. The resulting ROC curve is presented on a Fig-ure 5. As it can be seen, the resulting ROC covers morethen a half of the chart's area reflecting the fact, that theapproach gives better then a chance performance. There-fore, the developed approach has a valid general predic-tive power and, thus, provides an opportunity toinvestigate potential structural similarity between twoproteins without actual modeling of their structures.In addition, it should also be outlined, that the developedSensitivity and selectivity of the developed approach in distin-guishing meaningful (RMSD < 2A) protein structure alignmentsF ur  3Sensitivity and selectivity of the developed approach in distin-guishing meaningful (RMSD < 2A) protein structure alignmentsSensitivity and selectivity of the developed approach in distin-guishing meaningful (SSS > 0.4) protein structu e alignmentsFigure 4Sensitivity and selectivity of the developed approach in distin-guishing meaningful (SSS > 0.4) protein structure alignmentsROC plot for describing the ability of the approach to distin-guish superimposed protein pairs with RMSD below 2A thresholdFigure 5ROC plot for describing the ability of the approach to distin-guish superimposed protein pairs with RMSD below 2A threshold.Page 5 of 11(page number not for citation purposes)relation between 1893 threading scores for 2 rawsequences can adequately indicate putative three-dimen-approach does not impose requirement for high qualityassignment of sequence to particular fold(s) by threading.BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61In fact, one or even both proteins may not be assigned byTHREADER2 to any known folds, but their resultingthreading profiles can be used to reveal the existing struc-tural similarity.To illustrate this point, the threading scores for sequencesof staphylokinases 1BUI:C and 2SAK are presented on aFigure 6 as histograms. As it can readily be seen, neither ofthese sequences could be assigned to certain fold, as aminimal threshold for reliable fold recognition is 3.5 (dis-played as a horizontal bar on a Figure 6). At the sametime, structures of proteins 1BUI:C and 2SAK can besuperimposed with RMSD = 1.3Å and therefore are verysimilar.In spite the fact, that neither 1BUI:C or 2SAK could not beassigned to certain fold, their threading possess a correla-tion coefficient, R, of 0.83 clearly indicating high struc-tural similarity between 1BUI:C and 2SAK.The developed approach has further been applied to a ran-dom set of proteins with known structure. Using CE pro-grams have been superimposed 1800 randomly selectedproteins from Protein Databank (PDB) on "all against all"basis to generate 3,240,000 redundant structural align-ments (small fraction of alignments could not beproduced). In the same time, all 1700 correspondingsequences have been processed with the THREADER2package and the generated threading scores datasets haveZ values of pseudo energies of threading of protein chains 2BUI:C and 2SAK through 1893 CATH model foldsFigur  6Z values of pseudo energies of threading of protein chains 2BUI:C and 2SAK through 1893 CATH model folds.Page 6 of 11(page number not for citation purposes)been cross-correlated to produce 3,222,731 correlationcoefficients.BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61The generated RMSD values for random pair wise struc-tural alignments are plotted against the corresponding R2parameters on a Figure 7. The shape of the graph resem-bles previously obtained well-like "RMSD vs R2"dependence for selected CE dataset. Major areas of truepositive and true negative observations for the randomdataset can be separated by R2 threshold value of 0.5. Sim-ilar cutoff level can be observed on a Figure 8. represent-ing a relationship between SSS and R2 parameters for3,222,731 random structural alignments under consider-ation. Areas of protein alignments with low (SSS < 0.4)and medium (0.4 < SSS < 0.6) structural similarity areclearly separated by the R2~0.5 threshold. There are veryfew highly similar proteins (with SSS > 0.6) have beenobserved within a random dataset.Thus, when applied to the random dataset of protein withknown 3D structures, the developed approach (operating2 TP and 3,174,891 TN) and 64% if SSS = 0.4 is used as acriteria of similar structures. The specificity in both casesremains at 99.9% level. On a random dataset of alignedprotein structures the R2 = 0.68 threshold value identifiesprotein pairs with SSS > 0.6 with the sensitivity of 97 %.When SSS value reaches >= 0.7, both sensitivity and spe-cificity of the developed approach stay around 99% (Fig-ure 9).The results obtained on selected and random datasets ofproteins with known structures allow concluding, that thedeveloped approach is enable to identify with reasonableaccuracy proteins with medium and high levels ofstructural similarity. To address the question whether thedeveloped approach is sequence dependent, we have esti-mated its sensitivity and selectivity in distinguishing struc-turally similar proteins (with SSS > 0.6) at 0 – 20%, 20 –40% and 40 – 60 % levels of sequence identity. The corre-RMSD values of pair wise alignments of randomly selected protein chains versus the corresponding parameters RFigure 7RMSD values of pair wise alignments of randomly selected protein chains versus the corresponding parameters R.Page 7 of 11(page number not for citation purposes)R = 0.68 cutoff value) has a sensitivity of 50% for thealignments with RMSD < 2Å (651 FP, 23,432 FN, 23,765sponding results presented on a Figure 10 illustrate thatthe predictive power of the developed approach variesBMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61upon different levels of protein sequence similarity.Apparently, the threading profiles of structurally similarproteins become less resembling as their sequence iden-tity drops. This observation is somewhat contradictorysince threading is considered to be independent fromsequence identity information.Considering the specific need of the approach to recog-nize structurally similar proteins with low sequence iden-tity and taking into account a large number of proteinalignments with SSS > 0.6 (negative counts) in the inves-tigated CE database, we have compiled an additionaltraining set. The set only included protein pairs with lowsequence identity (>20%) and had equal representationof proteins alignments with SSS below and above 0.6threshold.Structure similarity scores (SSS) for pair wise alignments of randomly selected protein chains versus the corresponding param-e ers RFigure 8Structure similarity scores (SSS) for pair wise alignments of randomly selected protein chains versus the corresponding param-eters R.Distinguishing power of the developed approach at different levels of protein structural similarityFigure 9Distinguishing power of the developed approach at different Page 8 of 11(page number not for citation purposes)Thus, 244 pair wise sequence alignments with low simi-larity have been extracted from the CE set. This comprisedlevels of protein structural similarity.BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61all 122 alignments with SSS above 0.6 threshold and 122randomly selected ones with SSS < 0.6. The use of R = 0.68threshold has yielded the following predictions: TP: 44,TN: 122, FP: 0, FN: 78. These correspond to 36 %sensitivity TP/(TP+FN), 100% specificity TN/(FP+TN),100 % predictive value positive TP/(TP+FP) and 61% pre-dictive value negative TN/(TN+FN).The estimated numbers allow to conclude that the devel-oped approach utilizing quantitative outputs of threadingpossesses useful sensitivity and specificity in recognizingproteins with low sequence identity (below 20%) andhigh structural similarity (SSS > 0.6). This makes it suita-ble for genome scaled studies.Identification of structural homologues between H. sapien and C. trachomatis proteinsThe developed approach has been used to test the hypoth-esis that pathogenicity of microorganisms can dependenton number of proteins in their genomes mimicking struc-tures of host analogues. In order to evaluate this assump-tion we have examined human structural homologuesamong proteins from Chlamydia trachomatis organism.First of all, currently available proteomes of Homo sapienand Chlamydia trachomatis strain D (30585 and 894entries respectively) have been processed with theTHREADER2.The generated threading profiles of human and Chlamy-dia proteins then have then been compared using thehave also been compared on "all-against-all" manner forsequence similarity to identify those pairs with nosequence homology but similar threading profiles (high Rscores).Overall, we were able to produce 25,649,384 pair wisecomparisons of threading profiles of human and Chlamy-dia proteins (some short sequences have been rejected bythe threading). Out of these, only 636 protein alignmentsproduced sequence similarity value above 20%. Among636 pairs of similar proteins, 86 (or 13.5 percent) have Rparameter above 0.68 threshold. The fraction ofstructurally similar proteins among those with lowsequence identity (<20%) is 5.5 percent: 1,409,914 out of25,648,748 alignments. This is 1.5 folds higher then theproportion of potentially similar proteins found in thetraining set of the CE sequence alignments (27,281(FP+TP) out of the total of 846,534, or 3.2 percent). Thisis an interesting finding, considering that the CE set of"possible biologically interesting similarities" is alreadyheavily enriched with structurally similar proteins. Onanother hand, these finding demonstrate the CE –training set we have used for the threshold estimation, canbe used as rather adequate representation of bacterialgenome.We have also compared the estimated positive count ratioof 5.5 percent with the corresponding number for ran-domly sampled PDB – chains alignments. In this case wehave found much more significant difference of 8 folds:5.5 versus 0.7 percent (the later can be calculated as a sumof 23,765 true positive and 651 false positive predictionsfor 3,222,731 random sequence alignments). Such ratherelevated occurrence of Chlamydia proteins with potentialstructural similarity with human counterparts may illus-trate the importance of the factors of convergentevolution.From the pool of human and Chlamydia protein align-ments we have identified 40 pairs of single domain pro-teins with no detected sequence similarity (at E =0.00001) and the highest R scores which are presented inTable 1. Multi-domain proteins have not been consideredin the study to simplify the exercise.If our assumptions about the conversion nature of bacte-rial virulent mimicry were correct, then we would expectsome chlamydial virulence factors to be found in thetable. Evaluation of the table indicates that a large part ofthe presented Chlamydia proteins has indeed been alreadypreviously identified as potential virulence factors. Thus,five Chlamydia trachomatis putative out-membrane pro-teins (F, H, I, E and A types corresponding to the tableDistinguishing power of the developed approach at different levels of protein sequence identityFigure 10Distinguishing power of the developed approach at different levels of protein sequence identity.Page 9 of 11(page number not for citation purposes)developed approach, aiming to produce 30585 * 894 =27,342,990 parameters R. All proteins from two genomesentries 2, 15, 17, 25 and 27 respectively) have beendetected by the developed approach as top virulence can-BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61didates. An important role of these proteins in Chlamydiaantigenic polymorphism has been previously underlinedin [25].Among other proteins from the Table 1,gi3328486gbAAC67681.1 (entry 26) is known as aChlamydia virulence factor responsible for pathogensurvival in Ca – deficient environment; proteingi3328822gbAAC67993.1 (entry 29) is a heat shock pro-tein – one of potential Chlamydia virulence factors; pro-tein YopC (entry 35) is involved in secretion ofpathogenic genetic material.Chlamydia is "energy parasite" [25] importing ATP fromhost cells. Thus, it came to no surprise, that ATP transportprotein gi3328511gbAAC67704.1 (kinase fold) has alsobeen identified as potential virulence factor (entry 10).Two other potential Chlamydia virulence factors performtransport functions: transpeptitasegi3329155gbAAC68296.1 and protein translocasegi3329345gbAAC68468.1 (entries 12 and 3 respectively).The majority of other Chlamydia proteins presented in theTable 10 can be divided into proteases (entries 23, 31, 37proteases and metalloprotease 34) and proteins related toDNA transcription (entries 6,8 – transcription proteins,20 – nucleotide transport, 19 – DNA isomerase, 7,9,40 –ribonucleases, riboreductase, 4,13,18,32 – Gly, isoLeu,Ala and Leu tRNA synthetases).Thus, the preliminary results allow to conclude, that outof 33 top Chlamydia hit with assigned functions presentedin the Table, up to 11 proteins either have been previouslyidentified as pathogenic virulence factors or possessdefine virulent characteristics. The predictive value posi-tive (TP/TP+FP) of the approach above the separatingthreshold R2 > 0.9 is as high as 92.33 % (for 846534 pre-dictions 1052 FP, 34170 FN, 12513 TP and 798799 TN).Therefore, it is expected that the most if not all of 40Chlamydia proteins presented in Table 1 can be reliablyconsidered as structurally highly similar to their humancounterparts.To assess the actual ability of the developed approach toenrich for proteins attributable to virulence we need toevaluate how many virulence factors can be found bychance in random pool of 33 Chlamydia proteins. This isnot a trivial task as it requires the knowledge of the totalnumber of virulence genes in Chlamydia trachomatisgenome. At the moment the exact virulence content of theChlamydia trachomatis genome remains unknown, so weattempted its evaluation using available literature data.who have experimentally identified 81 genes of Salmonellatyphimurium responsible for its survival in professionalphagocytes [26,27]. Taking similar to the previous guessthat the real number of virulent factors is as twice as high,the hypothetical virulence content of Salmonella typhimu-rium genome can be contemplated around 3.6% (162 outof 4451 genes).Thus, by the analogy, we may expect that about 4 percentsof an average bacterial proteome can be assigned to viru-lence associated proteins. Therefore, there is roughly 4percent probability of random finding of virulence factorsin arbitrary pool of bacterial genes.Based on that estimate, we may expect that among 33annotated Chlamydia trachomatis proteins presented inTable 1 one or two potential virulence factors could beidentified by chance. The fact, that there are about 11 ofthem demonstrates that the developed approach is indeedcapable of 6 – 10 folds enriching for bacterial virulencefactors.ConclusionsVirulence factors candidates from bacteria and viruseshaving low sequence and high 3D similarity with hostproteins can be readily identified by the developedapproach. Its sensitivity can future be improved as effortsto complete and organize the inventory of model folds aresuccessful [14] (as it has been mentioned theTHREADER2 takes into account only known 2000 modelfolds that covers only about 50% of 4000 folds predicted).The developed approach is not only applicable for identi-fication of potential novel virulence factors in pathogengenomes, but may be broadly used for all kinds of proteinsimilarity studies.MethodsSequence similarity search has been conducted withBLAST program [28] with E value of 0.00001.Threading has been carried out by the THREADER2 [18]program with default parameters. The CATH v2.0(November 2000) fold assembly has been used as alibrary of standard folds.Human proteome has been downloaded from ENSEMBLdatabase; the proteome of Chlamydia trachomatis serovar D– from NCBI site.Authors' contributionsSJ and AC have developed the general concept of the workand participated in drawing the conclusions; AC hasPage 10 of 11(page number not for citation purposes)Thus, an indirect justification for this number can bederived from the results of the work of Fields and all 1986preformed the fold prediction and carried out all thecalculations.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/61Additional materialAcknowledgementsThe work has been funded by the Vancouver Hospital and Health Sciences Centre research award for AC and by the Functional Pathogenomics of Mucosal Immunity project, funded by Genome Prairie, Genome BC and their industry partners, Inimex Pharmaceuticals and Pyxis Genomics.SJMJ is a Michael Smith Foundation for Health Research Scholar.References1. Davies J: Origins and evolution of antibiotics eresistance.Microbiologia 1996, 12:9-16.2. Wolf YI, Aravind L, Koonin EV: Rickettsiae and Chlamydiae: evi-dence of horizontal gene transfer and gene exchange. TrendsGenet 1999, 15:173-175.3. Stebbins CE, Galan JE: Maintenance of an unfolded polypeptideby a cognate chaperone in bacterial type III secretion. Nature2001, 412:70-81.4. Brinkman FSL, Blanchard JL, Cherkasov A, Av-Gay , Brunham RC,Fernandez RC, Finlay BB, Otto SP, Oullette BF, Keeling PJ, HancockREW, Rose AM, Jones SJM: Evidence that plant-like genes inChlamydia species reflect an ancestral relationship betweenChlamydiaceae, cyanobacteria and the chloroplast. GenomeResearch 2002, 12:1159-1167.5. International Human Genome Sequencing Consortium: Nature 2001,409:860.6. Russell RB, Saqi MAS, Bates PA, Sayle RA, Sternberg MJE: Recogni-tion of analogous and homologous protein folds – assess-ment of prediction success and associated alignmentaccuracy using empirical substitution matrices. ProteinEngineering 1998, 11:1-9.7. Bowie JU, Luthy R, Eisenberg G: A method to identify proteinsequences that fold into a known three-dimensionalstructure. Science 1991, 253:164-170.8. Bates A, Jackson RM, Sternberg MJE: Genomes, Molecular Biology andDrug Discovery Academic Press, London; 1996. 9. Russell RB, Copley RR, Barton GJ: Protein fold recognition bymapping predicted secondary structure. J Molec Biol 1996,259:349-365.10. Rice DW, Eisenberg G: A 3D-1D substitution matrix for proteinfold recognition that includes predicted secondary structureof the sequence. J Molec Biol 1997, 267:1026-1038.11. Rost B, Schneider R, Sander C: Protein fold recognition by pre-diction – based threading. J Molec Biol 1997, 270:471-480.12. Defay TR, Cohen FE: Multiple sequence information for thread-ing algorithms. J Mol Biol 1996, 262:314-323.13. Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF:IMPALA: matching a proteins sequence against a collectionof PSI-BLAST – constructed position-specific scorematrices. Bioinformatics 1999, 15:1000-1011.14. Machalek AZ: ASM News 2001, 67:441-447.15. Godzik A, Skolnick J: Sequence-structure matching in globularproteins: application to supersecondary and tertiary struc-ture determination. Proc Natl Acad Sci 1992, 89:12098-12102.16. Bryant SH, Altschul SF: Statistics of sequence-structurethreading. Curr Opin Struct Biol 1995, 5:236-244.17. Murzin AG, Bateman A: Distant homology recognition using18. Jones DT, Taylor WR, Thornton JM: A new approach to proteinfold recognition. Nature 1992, 358:86-89.19. Jones DT, Miller RT, Thornton JM: Successful protein fold recog-nition by optimal sequence threading validated by rigorousblind testing. Proteins 1995, 23:387-397.20. Taylor WR: Multiple sequence threading: an analysis of align-ment quality and stability. J Molec Biol 1997, 269:902-943.21. Levitt M: Competitive assessment of protein fold recognitionand alignment accuracy. Proteins (Suppl) 1997, 1:92-104.22. Shindyalov IN, Bourne PE: Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path.Protein Engineering 1998, 11:739-747.23. Shindyalov IN, Bourne PE: A database and tools for 3-D proteinstructure comparison and alignment using the Combinato-rial Extension (CE) alrorithm. Nucleic Acids Research 2001,29:228-229.24. CE Database  [http://cl.sdsc.edu/ce.html]25. Stephens RS, Kalman S, Lammel C, Fan J, Marathe R, Aravind L, Mitch-ell W, Olinger L, Tatusov RL, Zhao Q, Koonin EV, Davis RW:Genome sequence of an obligate intracellular pathogen ofhumans: Chlamydia trachomatis. Science 1998, 282:754-759.26. Fields PI, Swanson RV, Haidaris CG, Heffron F: Mutants of Salmo-nella typhimurium that cannot survive within the macro-phage are avirulent. Proc Natl Acad Sci 1986, 86:5189-5193.27. Gahring LC, Heffron F, Finlay BB, Falkow S: Invasion and Replica-tion of Salmonella typhimurium in Animal Cells. Infection andImmunity 1990, 58:443-448.28. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic localalignment search tool. J Mol Biol 1990, 215:403-410.Additional File 1Protein pairs from H. sapien and C. trachomatis with low sequence iden-tity and high structural similarity.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-61-S1.doc]yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 11 of 11(page number not for citation purposes)structural classification of proteins. Proteins (Suppl) 1997,1:105-112.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items