UBC Faculty Research and Publications

Relationship between insertion/deletion (indel) frequency of proteins and essentiality Chan, Simon K; Hsing, Michael; Hormozdiari, Fereydoun; Cherkasov, Artem Jun 28, 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2006_Article_1599.pdf [ 829.89kB ]
JSON: 52383-1.0228392.json
JSON-LD: 52383-1.0228392-ld.json
RDF/XML (Pretty): 52383-1.0228392-rdf.xml
RDF/JSON: 52383-1.0228392-rdf.json
Turtle: 52383-1.0228392-turtle.txt
N-Triples: 52383-1.0228392-rdf-ntriples.txt
Original Record: 52383-1.0228392-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleRelationship between insertion/deletion (indel) frequency of proteins and essentialitySimon K Chan1,2, Michael Hsing2, Fereydoun Hormozdiari3 and Artem Cherkasov*4Address: 1CIHR/MSFHR Strategic Training Program in Bioinformatics, Canada's Michael Smith Genome Sciences Centre, 570 West 7th Ave – Suite 100, Vancouver, BC, V5Z 4S6, Canada, 2Bioinformatics Graduate Program, University of British Columbia, 570 West 7th Ave – Suite 100, Vancouver, BC, V5Z 4S6, Canada, 3School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, V5A 1S6, Canada and 4Division of Infectious Diseases, Faculty of Medicine, University of British Columbia, 2733 Heather Street, Vancouver, BC, V5Z 3J5, CanadaEmail: Simon K Chan - sichan@bcgsc.ca; Michael Hsing - mhsing@interchange.ubc.ca; Fereydoun Hormozdiari - fhormozd@cs.sfu.ca; Artem Cherkasov* - artc@interchange.ubc.ca* Corresponding author    AbstractBackground: In a previous study, we demonstrated that some essential proteins from pathogenicorganisms contained sizable insertions/deletions (indels) when aligned to human proteins of highsequence similarity. Such indels may provide sufficient spatial differences between the pathogenicprotein and human proteins to allow for selective targeting. In one example, an indel difference wastargeted via large scale in-silico screening. This resulted in selective antibodies and smallcompounds which were capable of binding to the deletion-bearing essential pathogen proteinwithout any cross-reactivity to the highly similar human protein. The objective of the current studywas to investigate whether indels were found more frequently in essential than non-essentialproteins.Results: We have investigated three species, Bacillus subtilis, Escherichia coli, and Saccharomycescerevisiae, for which high-quality protein essentiality data is available. Using these data, wedemonstrated with t-test calculations that the mean indel frequencies in essential proteins weregreater than that of non-essential proteins in the three proteomes. The abundance of indels in bothtypes of proteins was also shown to be accurately modeled by the Weibull distribution. However,Receiver Operator Characteristic (ROC) curves showed that indel frequencies alone could not beused as a marker to accurately discriminate between essential and non-essential proteins in thethree proteomes. Finally, we analyzed the protein interaction data available for S. cerevisiae andobserved that indel-bearing proteins were involved in more interactions and had greaterbetweenness values within Protein Interaction Networks (PINs).Conclusion: Overall, our findings demonstrated that indels were not randomly distributed acrossthe studied proteomes and were likely to occur more often in essential proteins and those thatwere highly connected, indicating a possible role of sequence insertions and deletions in theregulation and modification of protein-protein interactions. Such observations will provide newinsights into indel-based drug design using bioinformatics and cheminformatics tools.Published: 28 June 2007BMC Bioinformatics 2007, 8:227 doi:10.1186/1471-2105-8-227Received: 6 November 2006Accepted: 28 June 2007This article is available from: http://www.biomedcentral.com/1471-2105/8/227© 2007 Chan et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 13(page number not for citation purposes)BMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227BackgroundEssential genes encode products that are required for theviability of an organism. There are two major reasons whythere is considerable interest in determining the set ofessential genes in an organism. Firstly, this will provideinsights into the basic requirements needed to sustain aliving cell. For example, the sequencing of the parasiticbacterium Mycoplasma genitalium [1] and the subsequentstudies to determine its essential genes [2,3] have pro-vided a more in-depth understanding of what constitutesa 'minimum genome.' Secondly, essential proteins inpathogens can potentially be excellent drug targets [4,5],as interfering with the proper functioning of one willlikely interfere with an important pathway in the patho-gen, thus reducing its threat to the host. However, target-ing such essential proteins in a pathogen has one majordrawback: essential proteins are often conserved acrossspecies, thus a drug that targets an essential protein in apathogen may also cross-react with a similar host protein[6]. To combat this problem, our laboratory has recentlydeveloped a strategy to target insertions/deletions (indels)that occur among the proteins of a pathogen and itshuman host. For example, Leishmania donovani is a proto-zoan parasite that infects and inactivates the macrophagesof its human host [7]. The main structural differencebetween the essential elongation factor (EF-1 α) proteinof L. donovani and that of its human host is a 12 aminoacid deletion that occurs in the L. donovani sequence [7].The 12 amino acid sequence corresponds to a hair pinloop that is present in the human protein, but absent inthe L. donovani protein. Using computational chemistryand molecular docking, we were able to develop inhibi-tors that directly recognized the exposed region in the L.donovani protein without any cross-reactivity to the highlysimilar human host protein [6,8,9]. Interestingly, thisdeletion can potentially allow EF-1 α from L. donovani togain an interaction, relative to human EF-1 α, and interactwith human tyrosine phosphatase, which leads to inacti-vation of the host macrophage [7]. With these past stud-ies, we showed that indels can offer enough structuraldifferences to target specific pathogen essential proteins aswell as allow them to acquire and/or modify the protein-protein interactions that they are involved in.Recently, we performed a large scale survey for potentiallytargetable indels by aligning the complete proteomes ofbacterial and protozoan pathogens to the completehuman proteome [10]. Our results showed that sizableindels were found in approximately 5–10% of bacterialproteins and as much as 25% of protozoan proteins withrespect to human proteins. A large number of those pro-teins with indels were identified as being essential to theirrespective pathogens. Therefore, in this current study, wehypothesis is that essential proteins will likely containmore indels due to the following two observations: firstly,protein domain profiles characterized in databases suchas Pfam [11] showed that protein sequences of the sameprotein interaction domain contained a large amount ofresidue variations across multiple species, which impliedthat a single point mutation in a protein did not have alarge impact on the function of protein interactiondomains. Secondly, essential proteins undergo strongerselective pressure and thus accumulate point mutations ata slower rate than non-essential proteins [12,13]. There-fore, taking these two considerations together, we proposethat formation of indels may be one method by whichproteins, especially those that are essential, use to acquirenew interaction sites and/or modify existing ones, andthus their interaction partners. For example, it is wellknown that PINs tend to be scale-free [14,15], in whichthe majority of the proteins in an interaction networkhave much fewer interactions than the few highly con-nected 'hub' proteins. Due to the greater number of inter-actions that they participate in, hubs tend to be essentialproteins. These hubs can gain interactions in the networkif a gene encoding one of its interacting partners dupli-cates. This process is known as preferential attachment[14,15]. If an indel were to occur in the interaction site ofthe duplicate copy of the gene, then the resulting proteinmay reflect this change through a change in the number ofinteraction partners.To our knowledge, the body of work presented here is thefirst to investigate a possible relationship between indelfrequency and essentiality. We chose three species thathave complete global knockout data: Bacillus subtilis,Escherichia coli, and Saccharomyces cerevisiae. Specifically,the purpose of this study was to determine 1) whether themean indel frequency of essential proteins differed fromthat of non-essential proteins 2) whether the Weibull dis-tribution could accurately model the indel abundances inboth types of proteins 3) whether the indel frequency of aprotein could be used as a marker to predict whether ornot a given protein was essential and 4) whether proteinswith indels participated in more interactions than thosethat do not. We defined indels as insertions and deletionsbetween proteins of high sequence similarity (at least50%), regardless of their evolutionary relationship withone another (i.e. not just orthologs between species). Thiswork could potentially locate similar situations to the L.donovani case described and thus further explore the meth-odology of targeting indels of specific pathogen proteinswithout cross-reactivity to human host proteins.Results and discussionQuery and subject species analyzedPage 2 of 13(page number not for citation purposes)set out to determine if the frequency of indels in essentialproteins differed from that of non-essential proteins. OurTo test whether the indel frequency of a protein is relatedto essentiality, we obtained protein sequences in FASTABMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227format from NCBI RefSeq [16]for B. subtilis, E. coli, and S.cerevisiae. These organisms were chosen because theirgenomes have been sequenced and global knockout phe-notype data was available [17-19]. We referred to thesethree species as 'query species,' since their respective pro-teins were the queries in the sequence alignments (Table1). We referred to the proteins from the query species as'query proteins.' Essentiality data was available for otherorganisms besides B. subtilis, E. coli, and S. cerevisiae, how-ever, these data were not produced by complete gene dele-tion, as in E. coli and S. cerevisiae, or by insertion of amarker, as in B. subtilis, but by transposon mutagenesis(Mycoplasma genitalium [2,3], Haemophilus influenzae [20],Escherichia coli (strain MG1655) [21]) or anti-sense RNA(Staphylococcus aureus (strains RN450 and RN4220) [22]).Transposon mutagenesis can miss essential genes that tol-erate transposon insertions as well as produce false nega-tives due to non-polar insertions. Inhibition by anti-senseRNA is a 'knock down' rather than a knockout of a geneand may not result in the complete removal of the tran-script of the target gene. Also, this technique is limited togenes for which adequate expression of the anti-senseRNA can be obtained [17,18]. With these considerationsin mind, we performed our analyses with B. subtilis, E. coli,and S. cerevisiae as the essentiality data for these threeorganisms were potentially more reliable.We also downloaded protein sequences, in FASTA format,for 22 bacterial and 15 eukaryote species with fullysequenced genomes. We referred to these species as 'sub-ject species,' since their respective proteins were the sub-jects in the sequence alignments (Additional file 1). Wereferred to the proteins from the subject species as 'subjectproteins.' All together, 14,214 query proteins (8342 bacte-rial and 5872 eukaryote) and 336,086 subject proteins(53,454 bacterial and 282,632 eukaryote) were analyzed.The comparison of the indel frequencies of essential andnon-essential proteins was performed to determine if thefrequencies differed in a statistically significant manner.We aligned all NCBI RefSeq proteins from B. subtilis andE. coli against the proteins of 22 bacteria subject species,and S. cerevisiae against the proteins from 15 eukaryotesubject species with BLASTP. A gap opened in the queryprotein could be reported as a deletion in the query pro-tein or as an insertion in the subject protein. Similarly, agap opened in the subject protein could be reported as adeletion in the subject protein or as an insertion in thequery protein. To maintain a consistent naming scheme,we reported gaps with respect to the query protein (Figure1a). Figure 1b summarizes the steps performed whileAdditional file 2 shows a summary of the number ofindels and proteins of high sequence similarity for eachIs there a significant difference between the indel frequencies of essential and non-essential proteins?To evaluate whether or not the differences between meanindel frequencies of essential and non-essential proteinswere statistically significant, we first calculated the fre-quencies of insertions and deletions of a given minimumlength (one to twenty amino acids) for all query species(see Methods). Next, we calculated the mean insertionand deletion frequencies for both essential and non-essential proteins for each query species. Figure 2 containsplots of the mean insertion and deletion frequenciesagainst the minimum insertion and deletion lengths forthe three query species. As the figure illustrates, the meanfrequencies in the proteins of the three query speciesdecrease as the minimum indel lengths increase, suggest-ing that short indels are more likely to occur than longindels. Next, we performed t-tests to examine the nullhypothesis that the mean indel frequencies of essentialand non-essential proteins were equal. We observed thatwhile the absolute differences between the mean indel fre-quencies were small, the differences were statistically sig-nificant as assessed by the t-test calculation (P < 0.05). Asseen in the figure, the essential proteins in B. subtilis, E.coli, and S. cerevisiae had significantly different insertionand deletion frequencies from their non-essential coun-terparts. All significant t-test values were positive for thequery species, which suggested that for these three organ-isms, essential proteins had a greater frequency of indelsthan non-essential proteins. While both insertions anddeletions occurred significantly more often in essentialproteins than in non-essential proteins for E. coli and S.cerevisiae, only deletions of minimum length eight totwenty amino acids produced significant results in B. sub-tilis.It is interesting to note that while long indels in S. cerevi-siae were significant, the greatest t-test value occurredwhen the indel length was defined as one or more aminoacids. A large t-test value suggested that differencesbetween the mean indel frequencies of essential and non-essential proteins were not likely due to chance alone. Fur-thermore, if indels of exactly one amino acid long wererandomly distributed across essential and non-essentialproteins, then one would expect that the t-test value of alonger minimum indel length would produce the greatestt-test value. However, this was not the case and one expla-nation for this trend could be that essential proteins con-tained a higher frequency of indels of exactly one aminoacid in length. To test this possible explanation, we re-ranour BLASTP processing scripts again for S. cerevisiae, thistime checking for indels of length exactly one to twentyamino acids. The results from this new set (data notshown) showed that there was significance even at the onePage 3 of 13(page number not for citation purposes)species-species comparison. amino acid indel length, and thus confirmed our suspi-cions.BMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227While these initial t-test results supported our predictionsthat essential proteins of the three query species wouldhave more indels than their respective non-essential pro-teins, we reasoned that the frequency of indels producedis at least partially dependent on the specific subject spe-cies chosen. To observe how our choice of subject speciesmay have impacted our results, we repeated the t-test anal-ysis with a smaller set of 14 randomly chosen subject spe-cies. After performing BLASTP of B. subtilis and E. coliagainst the proteins of nine sequenced bacterial speciesand S. cerevisiae against the proteins of five sequencedeukaryote species (Additional file 3), we observed similartrends in that essential proteins had significantly greaterindel frequencies than non-essential proteins (P < 0.05)(Additional file 4). For example, in the complete set ofbacterial subject proteins (22 bacteria species), E. coliinsertions of minimum length 10 to 20 amino acidsoccurred more frequently in essential proteins than non-essential proteins, while with the smaller set of bacterialsubject proteins, insertions of one and seven to twentyamino acids occurred more frequently in essential pro-teins. Similarly, in the complete set of bacterial subjectproteins, B. subtilis deletions of minimum length eight totwenty amino acids occurred more frequently in essentialproteins, while in the smaller set, this trend was extendedto seven to twenty amino acids. These results showed thatwhile the choice of subject species did alter the specificindel lengths that produced significant results, in general,the trends were consistent with our predictions. The onlyexception was the shorter insertions of B. subtilis. Weobserved that insertions of minimum length three, four,and six were found more frequently in non-essential pro-teins, as indicated by the negative t-test values. However,the longer deletions of B. subtilis, as discussed, followedthe predicted trend. With this specific result in mind, wenow speculate that perhaps only longer indels, say oflength greater than or equaled to seven amino acids, aremore likely to be found in essential proteins.Another issue that may have impacted our initial t-testresults was the quality of the protein sequences we used.A portion of the proteins we obtained from NCBI RefSeqand/or annotation errors of these protein sequences mayhave resulted in "pseudo-indels" in the BLASTP align-ments. To observe how these proteins in the complete setof subject proteins from the 22 bacteria and 15 eukaryotesmay have impacted our initial t-test results, we repeatedthe analyses and performed BLASTP of S. cerevisiae againstthe smaller set of five randomly chosen eukaryote subjectspecies, but this time only fully curated and reviewedNCBI RefSeq proteins were included. We focused only onS. cerevisiae because all of its respective proteins in NCBIRefSeq were fully curated and reviewed, while this was notcase for any of the proteins from the other two query spe-cies. If the resulting trends from this smaller set of subjectspecies varied greatly with that which was observed withthe complete set of subject species (15 eukaryotes), then itwould be likely that the results produced from the com-plete set of subject proteins were caused by the pseudo-indels created by the alignments of the predicted and non-curated NCBI RefSeq proteins. However, this was not thecase as the trends seen with the highly curated proteinswere very similar to what was observed in the completesubject species set (Additional file 5). Therefore, we con-cluded that it was unlikely that the observed trend, inwhich the indel frequency of essential proteins was greaterthan that of non-essential proteins, was merely caused bysequencing and/or annotation errors. While we per-formed this check to further test our results, we wish toremind the reader that sequences in NCBI RefSeq repre-sent a nearly non-redundant collection of sequences andis described as a 'summary' of the currently availableinformation for each sequence [16].Cumulative insertion and deletion frequencies and approximation by the Weibull distributionTo investigate if the abundance of indels in essential andnon-essential proteins could be modeled by consistentstatistical distributions, we calculated the cumulative dis-tribution functions (CDF) for the minimum lengths ofinsertions and deletions in essential and non-essentialproteins in the query species (Figure 3). As can be seen inFigure 3, the dependences between the abundance ofindels in both essential and non-essential proteins andTable 1: Selected query species. The three query species that had completed genome projects and complete global knockout data availableQuery Species Domain Taxonomy ID Number of Proteins from NCBI RefSeqEssential Genes that could be mapped to a NCBI RefSeq ID:Bacillus subtilis (strain 168) Bacteria 224308 4105 271/271Escherichia coli (strain K12) Bacteria 83333 4237 299/303Saccharomyces cerevisiae Eukaryote 4932 5872 1050/1105Page 4 of 13(page number not for citation purposes)resulted from computational predictions and/or have notundergone full manual curation. Therefore, sequencingminimum indel lengths formed typical exponent-like dis-tributions. In our previous work [10], we demonstratedBMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227Page 5 of 13(page number not for citation purposes)Sample alignment and pipelineFigur  1Sample alignment and pipeline. A) Sample Alignment: Gaps were reported as insertions/deletions with respect to the query sequence. There are seven insertions (red) and two deletions (blue) in this sample alignment. B) Pipeline: A summary of the steps taken to calculate the mean insertion and deletion frequencies for essential and non-essential proteins in B. subtilis, E. coli, and S. cerevisiae.Obtain all RefSeq proteins for three query species with global knock out data: B. subtilis, E. coli, and S. cerevisiaeObtain all RefSeq proteins for 22 bacterialand 15 eukaryote species Calculate insertion and deletion frequencies for B. subtilis, E. coli, and S. cerevisiaeCalculate mean insertion anddeletion frequencies for essentialand non-essential proteins for eachof the three speciesAlign B. subtilis & E. coli proteins to the proteinsfrom the 22 bacterial species with BLASTP.  Repeat for S. cerevisiae and eukaryote speciesA)B)BMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227Page 6 of 13(page number not for citation purposes)Mean insertion and deletion frequencies in essential and non-essential proteins plotted against minimum indel lengthFigure 2Mean insertion and deletion frequencies in essential and non-essential proteins plotted against minimum indel length. Mean insertion and deletion frequencies were calculated for essential and non-essential query proteins aligned to pro-teins from the 22 bacteria or 15 eukaryote species. The t-test statistic is shown for the minimum indel lengths that were found significantly more often in essential (blue bars) than non-essential (purple bars) proteins. Significance was set at P < 0.05. Note that no such difference was observed in insertions within B. subtilis proteins.S. cerevisiae1.9292.4172.6793.7734.5304.9526.16400.511.522.531 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Deletion Length (aa)Mean Deletion FrequencyE. coli2.1612.1162.3992.1802.1272.1292.1412.1462.3392.2652.07600. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Insertion Length (aa)Mean Insertion FrequencyE. coli1.8161.9341.9361.7021.81800. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Deletion Length (aa)Mean Deletion FrequencyB. subtilis00. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Insertion Length (aa)Mean Insertion FrequencyB. subtilis2.8512.7742.6412.3472.6412.5512.4062.4172.2042.1011.9431.8561.69500. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Deletion Length (aa)Mean Deletion FrequencyS. cerevisiae6.7356.3176.0085.5395.2944.8054.7244.4104.1113.5773.6053.6053.5023.1233.0353.0032.7602.5422.5722.58300.511.522.533.544.51 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Insertion Length (aa)Mean Insertion FrequencyBMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227that the distribution of indels of varying length across allproteins studied could be accurately described by theWeibull distribution:SDF(x) = exp{-(x/α)β}, x ≥ 0, β > 0where SDF(x) is the survival distribution function, α is ascaling factor, and β is a shape parameter that may reflectthe evolutionary rates for the occurrence/expansion ofindels in the proteomes examined. The Weibull distribu-tion is a statistical function defined within extreme valuetheory and often used in reliability engineering to modelmaterial strength and durability of electronic andmechanical components [23]. The Weibull distributionutilizes a time-to-failure measure to assess the reliabilityof a system and to predict its stability. A typical time-to-failure experiment involves applying a disruptive stress toa sample of objects representative of the population. Thetime taken for each object to break (i.e. to fail) is recorded.The resulting values are then used to determine if theobjects in the population follow a Weibull distribution.For example, a recent study [24] characterized the strengthof three ceramic materials by applying mechanical stressesof 70 – 400 MPa/s to determine characteristics of break-ing. Similarly, the formation and expansion of indels inthe proteome of an organism take place under 'disruptivestress' (evolutionary pressure). An indel 'breaks' or 'fails'when it is lost. Because our previous Weibull analysesonly considered indels across all proteins regardless oftheir essentiality [10], we examined whether the statisticalfunction could accurately describe the abundance ofindels in essential and non-essential proteins separately.For each query species the double logarithmic transforma-tion of SDF(x), as represented by the CDF, was calculatedand plotted:log(-log(SDF(x)) = βlog(x) - βlog(α)If the abundance of indels in the three query species couldbe accurately described by this distribution, then theresulting plots should be linear. We observed that theWeibull distribution could accurately model the depend-ence between the length of indels and their abundance inthe essential and non-essential proteins in the query spe-cies, as indicated by the high r2 values (Figure 4). The βparameter is represented by the slopes in each of thegraphs in Figure 4 and the values suggested two observa-tions. Firstly, as described previously [25], a β value of lessthan one indicates that there is reliable growth in the sys-tem as the rate of failure is decreasing. In this case, ourresults indicated that some indels are retained over evolu-tionary time, suggesting some functional importance. Sec-ondly, while the differences between the β values ofvalues for both insertions and deletions, suggesting thatindels occur and expand more readily in non-essentialproteins. This observation appeared to be at odds with ourearlier observations on the mean indel frequency of essen-tial and non-essential proteins. We wondered how itcould be possible for non-essential proteins to acquireand expand their indels at a slightly faster rate and yet, ingeneral, observe more indels in essential proteins. Thisobservation may be explained by the differences in theevolutionary age of essential and non-essential genes. Arecent study into two fungal species [26], Schizosaccharo-myces pombe and Saccharomyces cerevisiae, showed that themore ancient a gene was, the more likely it was to beessential. Thus, essential genes may have had more timeto accumulate and expand their indels.Can indel frequencies be used to discriminate between an essential and non-essential protein?While the t-test statistic assesses whether or not the differ-ence in the means of a quantifiable trait from two popu-lations is significant, it does not take into considerationthe actual magnitude of the difference. Even if the meanindel frequency of essential proteins was statistically dif-ferent from that of non-essential proteins, if there was alarge amount of overlap between the two distributions, itwould still be difficult to predict whether a protein wasessential or not based merely on its indel frequency. Todetermine if indel frequencies could be used as a markerto differentiate between essential and non-essential pro-teins, Receiver Operating Characteristic (ROC) curveswere utilized. The Area Under the ROC curve (AUROC)was used as an assessment of the accuracy of the predic-tions. An AUROC of 1.0 implies that all predictions werecorrect, suggesting that all essential proteins can be com-pletely separated from non-essential proteins based onsome indel frequency threshold. An AUROC of 0.50 sug-gests that using indel frequency to predict essentiality has50% sensitivity and specificity, which is not a useful test.Finally, an AUROC that is less than 0.50 implies that theopposite trend, in which non-essential proteins have ahigher frequency of indels than essential proteins, isobserved.We calculated AUROCs for each of the query species. Sim-ilar to the t-tests, each of the query species was comparedto the other species in the same domain. The AUROCresults for all three query species were moderate as S. cer-evisiae was the only query species to produce AUROCsbetween 0.57 to 0.59, while B. subtilis and E. coli AUROCvalues ranged from 0.46 to 0.56 (data not shown). Theseweak trends were not unexpected, because our reasoningalso allowed non-essential proteins to use indels as a wayto acquire and/or modify protein-protein interactions.Page 7 of 13(page number not for citation purposes)essential and non-essential proteins are small, the non-essential proteins in all three query species have greater βBMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227Page 8 of 13(page number not for citation purposes)Proportion of essential and non-essential proteins with indels plotted against minimum indel lengthFigure 3Proportion of essential and non-essential proteins with indels plotted against minimum indel length. Insertions are represented by blue bars while deletions are represented by purple bars.S. cerevisiae00. 2 3 4 5 6 7 8 9 10 11 1213 14 15 1617 18 19 20Minimum Indel Length (aa)Indel Abundance in Non-essential ProteinsE. coli00. 2 3 4 5 6 7 8 9 10 11 1213 14 15 1617 18 19 20Minimum Insertion Length (aa)Indel Abundance in Essential ProteinsE. coli00. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Deletion Length (aa)Indel Abundance in Non-essential ProteinsB. subtilis00. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Minimum Insertion Length (aa)Indel Abundance in Essential ProteinsB. subtilis00. 2 3 4 5 6 7 8 9 10 11 1213 14 15 1617 18 19 20Minimum Deletion Length (aa)Indel Abundance in Non-essential ProteinsS. cerevisiae00. 2 3 4 5 6 7 8 9 10 11 1213 14 15 1617 18 19 20Minimum Indel Length (aa)Indel Abundance in Essential ProteinsBMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227Page 9 of 13(page number not for citation purposes)Approximation of abundance of indels with the Weibull distributionFigure 4Approximation of abundance of indels with the Weibull distribution. r2 values close to 1.0 indicated that the abundance of insertions (blue points and blue line) and deletions (purple points and purple line) in essential and non-essential proteins of the three query species could be accurately modeled by the Weibull distribution.S. cerevisiae Non-essential Proteinsy = 0.514x - 0.2825R2 = 0.9902y = 0.5315x - 0.2937R2 = 0.9992-0.4-0.3-0.2- 0.5 1 1.5log(x), x = minimum indel length (aa)log(-log(CDF(x)))E. coli Essential Proteinsy = 0.4896x - 0.2661R2 = 0.9833y = 0.5334x - 0.2944R2 = 0.997-0.4-0.3-0.2- 0.5 1 1.5log(x), x = minimum indel length (aa)log(-log(CDF(x)))E. coli Non-essential Proteinsy = 0.6041x - 0.3347R2 = 0.9958y = 0.6367x - 0.3534R2 = 0.9979-0.5-0.4-0.3-0.2- 0.5 1 1.5log(x), x = minimum indel length (aa)log(-log(CDF(x)))B. subtilis Essential Proteinsy = 0.5874x - 0.3259R2 = 0.9932y = 0.4944x - 0.2698R2 = 0.9934-0.5-0.4-0.3-0.2- 0.5 1 1.5log(x), x = minimum indel length (aa)log(-log(CDF(x)))B. subtilis Non-essential Proteinsy = 0.651x - 0.3614R2 = 0.9981y = 0.5955x - 0.3308R2 = 0.9981-0.5-0.4-0.3-0.2- 0.5 1 1.5log(x), x = minimum indel length (aa)log(-log(CDF(x)))S. cerevisiae Essential Proteinsy = 0.5211x - 0.287R2 = 0.9993y = 0.4654x - 0.2522R2 = 0.9997-0.4-0.3-0.2- 0.5 1 1.5log(x), x = minimum indel length (aa)log(-log(CDF(x)))BMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227While our t-tests showed that essential proteins have sig-nificantly more indels than non-essential proteins, theseAUROC results showed that indels were found frequentlyenough in non-essential proteins to make it difficult toaccurately predict whether a protein is essential or notbased solely on its indel frequency. A recent publication[27] identified 14 characteristic sequence features, such ascodon adaptation, hydrophobicity, and localization sig-nals, which are potentially associated with essential genesin fungal genomes. Thus, many different features arelikely indicative of essential proteins and perhaps the pre-dictions based on indel frequency would be more accurateif these other features were considered.Do proteins with indels have different network properties than those without indels?It has been well documented that essential proteins areoften involved in a greater number of interactions (i.e. agreater connectivity) than non-essential proteins [28,29].Because indels tend to occur on the external surface ofproteins, usually as reverse turns or coils within loops[30,31], and these structures play important roles in pro-tein-protein interactions, we reasoned that formation ofindels could be a means by which proteins acquired and/or modified the interactions that they are involved in.Using the protein-protein interaction counts for the 4148S. cerevisiae proteins available from the Munich Informa-tion Center for Protein Sequences (MIPS) database [32],we determined whether indel containing proteins in S.cerevisiae had a greater mean connectivity than those thatdo not. We calculated the mean connectivity of proteinswith and without indels of minimum length of four andten amino acids (Table 2). While the absolute differencesbetween the mean connectivity of both types of proteinswere small, the differences were statistically significant (P< 0.05) as determined by the t-test. Therefore, in general,proteins with indels have more connections than proteinsthat do not. This can be explained by indels creating and/or exposing new interaction sites, which result in newinteractions, as was illustrated in the L. donovani example[7].We also considered whether indel containing proteinshad a greater betweenness than proteins without indels.The betweenness is a measure in graph theory and is deter-mined by counting the number of times a particular vertexis located on the shortest path between two vertices in anetwork [33]. From a biological perspective, the between-ness accounts for the direct and indirect influences of pro-teins at a distant location in the network. For example, iftwo clusters of interacting proteins, A and B, are joinedtogether only through their mutual interaction with pro-tein X, then X would have a high betweenness measure,tein in B, it must do so through a direct or indirect inter-action with protein X.The naïve method used to calculate the betweennessmeasure can require up to O(n3) in time and O(n2) inspace, making the calculation inefficient. Therefore, weused a faster method developed by Brandes [33], whichwe implemented in C and executed under a Linux plat-form. Briefly, this method calculates the betweenness fora particular vertex, v, by first computing the number oftimes v occurs between any other two vertices, x and y, inthe network. Next, a value known as the pair-dependencyis calculated. This value is the proportion of shortest pathsbetween vertices x and y that v lines on. This step isrepeated for all vertices v, x, and y and the values aresummed. Table 2 shows that indel containing proteinshad greater betweenness, suggesting their importance inthe S. cerevisiae protein-protein interaction network.Taken together, these two observations suggested that thepresence of indels is related to two network properties(connectivity and betweenness) of proteins in PINs. Oneapplication of these results would be in bait-prey pulldown experiments. These results suggest that to increasethe coverage of the PIN with each pull down experiment,the bait should be one that contains an indel, as indelcontaining proteins are involved in a greater number ofinteractions and have greater betweenness.ConclusionWe previously conducted a large scale analysis of poten-tially targetable indels in bacterial and protozoan patho-gen proteins [10]. In that study, we located manyexamples of essential pathogen proteins that containedsizable indels. Therefore, the objective of this study was todetermine how indels were related to essential and non-essential proteins. To our knowledge, such a relationshiphad not been previously explored. We further analyzedindels for their ability to discriminate between essentialand non-essential proteins and compared two networkproperties, connectivity and betweenness, of indel andnon-indel containing proteins. We determined that forthree species, Bacillus subtilis, Escherichia coli, and Saccharo-myces cerevisiae, essential proteins had a greater meanindel frequency than non-essential proteins. The abun-dance of indels in both types of proteins could be accu-rately modeled by the Weibull distribution. Furthermore,we demonstrated with ROC curves that accurate discrimi-nation of essential and non-essential proteins based solelyon indel frequency could not be achieved. Finally, weshowed that indel containing proteins had different net-work properties, namely that they had greater connectivityand betweenness, suggesting a possible role of indels inthe regulation of interaction partners.Page 10 of 13(page number not for citation purposes)because if any protein in A is to interact with another pro-BMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227In our analyses, we did not consider the actual location ofthe indels in the folded three dimensional protein struc-tures, which is critical for effective drug design. Therefore,some future directions that we will focus on include threedimensional modeling of indel containing proteins aswell identifying any functional protein domains that arecommonly disrupted by indels. Given that indels can beused to selectively target essential pathogen proteins thathave high sequence similarity to human proteins, charac-terization of these indels will potentially lead to new drugtargets for infectious diseases.MethodsSystematic knockout data and NCBI RefSeq proteinsWe conducted a broad literature search to identify fullysequenced genomes in which genome-wide knockoutdata was available (i.e. protein essentiality is defined). Welocated complete knockout data for B. subtilis (strain 168)[17], E. coli (strain K12) [18], and S. cerevisiae [19]. Foreach of these species, we downloaded the complete non-redundant set of proteins ('query proteins') in FASTA for-mat from NCBI RefSeq [16]. In total, 14,214 query pro-teins were analyzed. Next, we obtained the list of essentialgenes and cross referenced them to a NCBI RefSeq proteinID using an in-house Perl script that utilized BioPerl mod-ules (Version 1.5.1) [34] to search for the gene name inthe complete set of RefSeq Genbank protein files for theparticular query organism.NCBI RefSeq proteins for BLAST databasesWe searched the Entrez Genome Project section of NCBI[35] for all bacterial and eukaryote genome projects anno-tated as completed. From this list, a wide range of bacte-rial and eukaryote species were chosen. We chose speciesfrom a wide range of different classes to avoid biasing ourresults to particular organisms in a specific class. Thisresulted in 22 bacterial species and 15 eukaryote species(Additional file 1). Next, we obtained the complete set ofprotein sequences from these selected organisms ('subjectproteins') from NCBI RefSeq. In total, this set consisted of53,454 bacterial and 282,632 eukaryote subject proteins.To further validate our results, we randomly chose nine(35,429 bacterial and 75,881 eukaryote). We alsoobtained the fully curated and reviewed proteins for eachof the five eukaryote species (54,927 reviewed eukaryoteproteins).BLASTP parameters used to determine alignmentsWe used formatdb [36] to format the subject proteinsequences into BLAST databases. To align the B. subtilis, E.coli, and S. cerevisiae query proteins to the subject proteins,we conducted BLASTP-based alignment of B. subtilis andE. coli query proteins against the 53,454 bacterial subjectproteins and S. cerevisiae query proteins against the282,632 eukaryote subject proteins. We set a maximum E-value of 10-5 and considered only sequence alignmentswith a minimum 50% similarity. The same parameterswere used for the analyses with the smaller set of subjectspecies. The BLASTP alignments were performed on nineIBM machines running the CentOS Linux operating sys-tem.Processing alignments that contain indelsWe developed in-house Perl scripts that would process theresults of the BLASTP alignments and search for indels.For all alignments that matched our BLASTP parameters,we searched for gaps that were opened in the query pro-tein (deletions) and the subject protein (insertions) ofminimum X amino acids long, where the values of Xranged from one to twenty amino acids. Note that gapswere reported as insertions or deletions based on thequery protein (Figure 1a). For each insertion of minimumX amino acids long, we calculated the Insertion Frequency(IF) as follows:where Ii is the number of insertions the query speciesshares with species i and Hi is the number of proteins thatsatisfied our alignment parameters between the query spe-cies and species i. Similarly, we calculated the DeletionFrequency (DF) as follows:IFI I I IH H H H=+ + + ++ + + +1 2 3 221 2 3 22......D D D D+ + + +1 2 3 22...Table 2: Summary of mean connectivity and betweenness of S. cerevisiae proteins with and without indels: The mean connectivity and betweenness of indel containing proteins were significantly greater than those of the non-indel containing proteins. Significance was set at P < 0.05Min Indel Length (aa) Number of proteins with at least one indel of at least 4 or 10 aa longMean connectivity of proteins with at least one indel of at least 4 or 10 aa longNumber of proteins without at least one indel of at least 4 or 10 aa longMean connectivity of proteins without at least one indel of at least 4 or 10 aa longBetweenness of proteins with at least one indel of at least 4 or 10 aa longBetweenness of proteins without at least one indel of at least 4 or 10 aa long4 907 4.194 562 3.986 15354 1513310 381 4.394 1088 4.017 15712 15115Page 11 of 13(page number not for citation purposes)bacterial and five eukaryote species (Additional file 3) andobtained their respective proteins from NCBI RefSeqDFH H H H=+ + + +1 2 3 22...BMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/227where Di is the number of deletions the query speciesshared with species i and Hi is the number of proteins thatsatisfied our alignment parameters between the query spe-cies and species i. Note that for S. cerevisiae as the queryspecies, I22, H22, and D22 would be I15, H15, and D15,respectively, as there was only 15 eukaryote subject spe-cies.Calculations and statistical analysesReceiver Operator Characteristic (ROC) curves and thecorresponding Area Under the ROC curve (AUROC) weredetermined using the R statistical package, version 2.3.1[37] for Linux-like operating systems and the ROCR pack-age [38]. An ROC curve plots the Sensitivity (True Posi-tives/(True Positives + False Negatives)) vs False PositiveRate (1 - (True Negatives)/(True Negatives + False Posi-tives)). Perl scripts performing t-test calculations were alsoimplemented and significance was set at P < 0.05.Protein-protein interaction countsThe S. cerevisiae protein-protein interaction counts wereobtained from the Munich Information Center for ProteinSequences (MIPS) database [32]. In total, we obtainedinteraction counts for 4148 proteins. Of the 4148 S. cere-visiae proteins with interaction counts, 837 (20.2%) wereessential. We determined the best match in Homo sapiensusing the BLASTP algorithm. Again, we specified a maxi-mum E-value of 10-5 and that the query and subject pro-teins shared at least 50% sequence similarity. Using in-house Perl scripts, we then determined which proteinscontained at least one indel of at least four and ten aminoacids long.Authors' contributionsSKC acquired the data from various online resources,developed the computer code, performed the analyses,and wrote the manuscript. FH implemented the between-ness algorithm and performed the network analyses. ACconceived of the study, while SKC, MH, and AC partici-pated in its design and interpretation of results. Allauthors read and approved the final manuscript.Additional materialAcknowledgementsSKC is funded by an award from the CIHR/MSFHR Strategic Training Pro-gram in Bioinformatics for Health Research http://bioinformatics.bcgsc.ca. MH is supported by the Michael Smith Foundation for Health Research (MSFHR) and the Natural Sciences and Engineering Research Council (NSERC). AC is funded by Genome Canada and Genome BC through the PRoteomics for Emerging PAthogen REsponse (PREPARE) Project. The authors acknowledge the helpful comments and suggestions provided by the anonymous reviewers.References1. Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleis-chmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, FritchmanRD, Weidman JF, Small KV, Sandusky M, Fuhrmann J, Nguyen D,Utterback TR, Saudek DM, Phillips CA, Merrick JM, Tomb JF, Dough-erty BA, Bott KF, Hu PC, Lucier TS, Peterson SN, Smith HO,Hutchison CA 3, Venter JC: The minimum gene complement ofMycoplasma genitalium.  Science 1995, 270:397-403.2. Hutchison CA, Peterson SN, Gill SR, Cline RT, White O, Fraser CM,Smith HO, Venter JC: Global transposon mutagenesis and aAdditional File 1The 22 bacterial and 15 eukaryote subject species utilized.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-227-S1.ppt]Additional File 2Indel and similar protein counts for each query species when compared to each subject species.Click here for fileAdditional File 3The nine bacteria and five eukaryote subject species utilized.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-227-S3.ppt]Additional File 4Mean insertion and deletion frequencies in essential and non-essential proteins plotted against minimum indel length. Mean insertion and deletion frequencies were calculated for essential and non-essential query proteins aligned to the proteins of the 14 randomly chosen subject species. The t-test statistic is shown for the minimum indel lengths that were found significantly more often in essential (blue bars) than non-essential (purple bars) proteins. Significance was set at P < 0.05. Note that no such differ-ence was observed in deletions within E. coli proteins. Also note that inser-tions of minimum length three, four, and six amino acids were found more frequently in non-essential than essential proteins of B. subtilis. See text for discussion.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-227-S4.ppt]Additional File 5Mean indel frequency calculated with curated eukaryote proteins. Mean indel frequencies were calculated for the curated S. cerevisiae essential and non-essential proteins aligned to the curated proteins of the five randomly chosen subject species. Note that the observed trend in which the mean indel frequency of essential proteins was greater than that of non-essential proteins was also seen with this smaller set of curated pro-teins, suggesting that the observed trend seen with the proteins from the complete set of subject species was not merely due to sequencing/annota-tion errors. The t-test statistic is shown for the minimum indel lengths that were found significantly more often in essential (blue bars) than non-essential (purple bars) proteins. Significance was set at P < 0.05. See text for discussion.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-227-S5.ppt]Page 12 of 13(page number not for citation purposes)minimal Mycoplasma genome.  Science 1999, 286:2165-2169.[http://www.biomedcentral.com/content/supplementary/1471-2105-8-227-S2.ppt]BMC Bioinformatics 2007, 8:227 http://www.biomedcentral.com/1471-2105/8/2273. Glass JI, Assad-Garcia N, Alperovich N, Yooseph S, Lewis MR, MarufM, Hutchison CA 3, Smith HO, Venter JC: Essential genes of aminimal bacterium.  Proc Natl Acad Sci 2006, 103:425-430.4. Cole ST: Comparative mycobacterial genomics as a tool fordrug target and antigen discovery.  Eur Respir J Suppl 2002,36:78s-86s.5. Chalker AF, Lunsford RD: Rational identification of new anti-bacterial drug targets that are essential for viability using agenomics-based approach.  Pharmacol Ther 2002, 95:1-20.6. Nandan D, Lopez M, Ban F, Huang M, Li Y, Reiner NE, Cherkasov A:Indel-based targeting of essential proteins in human patho-gens that have close host orthologue(s): discovery of selec-tive inhibitors for Leishmania donovani elongation factor-1α.Proteins 2007, 67:53-64.7. Nandan D, Reiner NE: Leishmania donovani engages in regula-tory interference by targeting macrophage protein tyrosinephosphatase SHP-1.  Clin Immunnol 2005, 114:266-277.8. Cherkasov A, Nandan D, Reiner NE: Selective targeting of indel-inferred differences in spatial structures of highly homolo-gous proteins.  Proteins 2005, 58:959-954.9. Li YY, Jones SJ, Cherkasov A: Selective targeting of indel-inferred differences in spatial structures of homologous pro-teins.  J Bioinform Comput Biol 2006, 2:403-414.10. Cherkasov A, Lee SJ, Nandan D, Reiner NE: Large-Scale Surveyfor Potentially Targetable Indels in Bacterial and ProtozoanProteins.  Proteins 2006, 62:371-380.11. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S,Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ,Yeast C, Eddy SR: The Pfam protein families database.  NucleicAcids Res 2004, 32:D138-141.12. Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes aremore evolutionarily conserved than are nonessential genesin bacteria.  Genome Res 2002, 12:962-968.13. Zhang L, Li WH: Mammalian housekeeping genes evolve moreslowly than tissue-specific genes.  Mol Biol Evol 2004, 21:236-239.14. Barabasi AL, Oltvai ZN: Network biology: understanding thecell's functional organization.  Nat Rev Genet 2004, 5:101-113.15. Barabasi AL, Albert R: Emergence of scaling in random net-works.  Science 1999, 286:509-512.16. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence(RefSeq): a curated non-redundant sequence database ofgenomes, transcripts and proteins.  Nucleic Acids Res 2005,33:D501-D504.17. Kobayashi K, Ehrlich SD, Albertini A, Amati G, Andersen KK, ArnaudM, Asai K, Ashikaga S, Aymerich S, Bessieres P, Boland F, Brignell SC,Bron S, Bunai K, Chapuis J, Christiansen LC, Danchin A, DebarbouilleM, Dervyn E, Deuerling E, Devine K, Devine SK, Dreesen O, Err-ington J, Fillinger S, Foster SJ, Fujita Y, Galizzi A, Gardan R, EschevinsC, Fukushima T, Haga K, Harwood CR, Hecker M, Hosoya D, HulloMF, Kakeshita H, Karamata D, Kasahara Y, Kawamura F, Koga K,Koski P, Kuwana R, Imamura D, Ishimaru M, Ishikawa S, Ishio I, LeCoq D, Masson A, Mauel C, Meima R, Mellado RP, Moir A, Moriya S,Nagakawa E, Nanamiya H, Nakai S, Nygaard P, Ogura M, Ohanan T,O'Reilly M, O'Rourke M, Pragai Z, Pooley HM, Rapoport G, RawlinsJP, Rivas LA, Rivolta C, Sadaie A, Sadaie Y, Sarvas M, Sato T, SaxildHH, Scanlan E, Schumann W, Seegers JF, Sekiguchi J, Sekowska A,Seror SJ, Simon M, Stragier P, Studer R, Takamatsu H, Tanaka T,Takeuchi M, Thomaides HB, Vagner V, van Dijl JM, Watabe K, WipatA, Yamamoto H, Yamamoto M, Yamamoto Y, Yamane K, Yata K,Yoshida K, Yoshikawa H, Zuber U, Ogasawara N: Essential Bacillussubtilis genes.  Proc Natl Acad Sci 2003, 100:4678-4683.18. Baba T, Ara T, Hasegawa M, Takai Y, Okumura Y, Baba M, DatsenkoKA, Tomita M, Wanner BL, Mori H: Construction of Escherichiacoli K-12 in-frame, single-gene knockout mutants: the Keiocollection.  Mol Syst Biol 2006, 2:2006.0008.19. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S,Lucau-Danila A, Anderson K, Andre B, Arkin AP, Astromoff A, El-Bakkoury M, Bangham R, Benito R, Brachat S, Campanaro S, CurtissM, Davis K, Deutschbauer A, Entian KD, Flaherty P, Foury F, GarfinkelDJ, Gerstein M, Gotte D, Guldener U, Hegemann JH, Hempel S, Her-man Z, Jaramillo DF, Kelly DE, Kelly SL, Kotter P, LaBonte D, LambDC, Lan N, Liang H, Liao H, Liu L, Luo C, Lussier M, Mao R, MenardP, Ooi SL, Revuelta JL, Roberts CJ, Rose M, Ross-Macdonald P, Sche-rens B, Schimmack G, Shafer B, Shoemaker DD, Sookhai-Mahadeo S,K, Bussey H, Boeke JD, Snyder M, Philippsen P, Davis RW, JohnstonM: Functional profiling of the Saccharomyces cerevisiaegenome.  Nature 2002, 418:387-391.20. Akerley BJ, Rubin EJ, Novick VL, Amaya K, Judson N, Mekalanos JJ: Agenome-scale analysis for identification of genes required forgrowth or survival of Haemophilus influenzae.  Proc Natl AcadSci 2002, 99:966-971.21. Gerdes SY, Scholle MD, Campbell JW, Balazsi G, Ravasz E, DaughertyMD, Somera AL, Kyrpides NC, Anderson I, Gelfand MS, BhattacharyaA, Kapatral V, D'Souza M, Baev MV, Grechkin Y, Mseeh F, FonsteinMY, Overbeek R, Barabasi AL, Oltvai ZN, Osterman AL: Experi-mental determination and system level analysis of essentialgenes in Escherichia coli MG1655.  J Bacteriol 2003,185:5673-5684.22. Forsyth RA, Haselbeck RJ, Ohlsen KL, Yamamoto RT, Xu H, TrawickJD, Wall D, Wang L, Brown-Driver V, Froelich JM, C KG, King P,McCarthy M, Malone C, Misiner B, Robbins D, Tan Z, Zhu Zy ZY,Carr G, Mosca DA, Zamudio C, Foulkes JG, Zyskind JW: A genome-wide strategy for the identification of essential genes in Sta-phylococcus aureus.  Mol Microbiol 2002, 43:1387-1400.23. Coles S: An introduction to statistical modeling of extreme values Lon-don:Springer-verlag; 2001. 24. Teixeira EC, Piascik JR, Stoner BR, Thompson JY: Dynamic fatigueand strength characterization of three ceramic materials.  JMater Sci Mater Med 2007, 18:1219-1224.25. Cherkasov A, Ho Sui SJ, Brunham RC, Jones SJ: Structural charac-terization of genomes by large scale sequence-sequencethreading: application of reliability analysis in structuralgenomics.  BMC Bioinformatics 2004, 5:101.26. Decottignies A, Sanchez-Perez I, Nurse P: Schizosaccharomycespombe essential genes: a pilot study.  Genome Res 2003,13:399-406.27. Seringhaus M, Paccanaro A, Borneman A, Snyder M, Gerstein M: Pre-dicting essential genes in fungal genomes.  Genome Res 2006,16:1126-1135.28. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and central-ity in protein networks.  Nature 2001, 411:41.29. He X, Zhan Z: Why do hubs tend to be essential in protein net-works?  PLoS Genet 2006, 2:e88.30. Pascarella S, Argos P: Analysis of insertions/deletions in proteinstructures.  J Mol Biol 1992, 224:461-471.31. Benner SA, Cohen MA, Gonnet GH: Empirical and structuralmodels for insertions and deletions in the divergent evolu-tion of proteins.  Mol Biol Evol 1993, 11:316-324.32. Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K,Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS:a database for genomes and protein sequences.  Nucleic AcidsRes 2002, 30:31-34.33. Brandes U: A faster algorithm for betweenness centrality.  JMath Sociol 2001, 25:163-177.34. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C,Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mun-gall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD,Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: Perl mod-ules for the life sciences.  Genome Res 2002, 12:1611-16118.35. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K,Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, GeerLY, Helmberg W, Kapustin Y, Kenton DL, Khovayko O, Lipman DJ,Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM,Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, SuzekTO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Databaseresources of the National Center for Biotechnology Infor-mation.  Nucleic Acids Res 2006, 34:D173-D180.36. BLAST Binaries   [ftp://ftp.ncbi.nih.gov/blast/]37. CRAN Project   [http://www.r-project.org]38. Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizingclassifier performance in R.  Bioinformatics 2005, 20:3940-3941.Page 13 of 13(page number not for citation purposes)Storms RK, Strathern JN, Valle G, Voet M, Volckaert G, Wang CY,Ward TR, Wilhelmy J, Winzeler EA, Yang Y, Yen G, Youngman E, Yu


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items