UBC Faculty Research and Publications

Indel PDB: A database of structural insertions and deletions derived from sequence alignments of closely… Hsing, Michael; Cherkasov, Artem Jun 25, 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2007_Article_2278.pdf [ 1.62MB ]
JSON: 52383-1.0228398.json
JSON-LD: 52383-1.0228398-ld.json
RDF/XML (Pretty): 52383-1.0228398-rdf.xml
RDF/JSON: 52383-1.0228398-rdf.json
Turtle: 52383-1.0228398-turtle.txt
N-Triples: 52383-1.0228398-rdf-ntriples.txt
Original Record: 52383-1.0228398-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceDatabaseIndel PDB: A database of structural insertions and deletions derived from sequence alignments of closely related proteinsMichael Hsing*1 and Artem Cherkasov2Address: 1Bioinformatics Graduate Program, Faculty of Graduate Studies, University of British Columbia, 100-570 West 7th Avenue, Vancouver, BC, V5T 4S6, Canada and 2Division of Infectious Diseases, Department of Medicine, Faculty of Medicine, University of British Columbia, D 452 HP, VGH, 2733 Heather Street, Vancouver, BC, V5Z 3J5, CanadaEmail: Michael Hsing* - mhsing@interchange.ubc.ca; Artem Cherkasov - artc@interchange.ubc.ca* Corresponding author    AbstractBackground: Insertions and deletions (indels) represent a common type of sequence variations,which are less studied and pose many important biological questions. Recent research has shownthat the presence of sizable indels in protein sequences may be indicative of protein essentiality andtheir role in protein interaction networks. Examples of utilization of indels for structure-based drugdesign have also been recently demonstrated. Nonetheless many structural and functionalcharacteristics of indels remain less researched or unknown.Description: We have created a web-based resource, Indel PDB, representing a structuraldatabase of insertions/deletions identified from the sequence alignments of highly similar proteinsfound in the Protein Data Bank (PDB). Indel PDB utilized large amounts of available structuralinformation to characterize 1-, 2- and 3-dimensional features of indel sites.Indel PDB contains 117,266 non-redundant indel sites extracted from 11,294 indel-containingproteins. Unlike loop databases, Indel PDB features more indel sequences with secondarystructures including alpha-helices and beta-sheets in addition to loops. The insertion fragmentshave been characterized by their sequences, lengths, locations, secondary structure composition,solvent accessibility, protein domain association and three dimensional structures.Conclusion: By utilizing the data available in Indel PDB, we have studied and presented hereseveral sequence and structural features of indels. We anticipate that Indel PDB will not only enablefuture functional studies of indels, but will also assist protein modeling efforts and identification ofindel-directed drug binding sites.BackgroundInsertions/deletions (indels) and amino acid substitu-tions represent the two most common types of sequencevariations, observed among similar proteins [1]. UnlikeRecently, a large-scale indel analysis has been conductedfor 136 complete bacterial and protozoan genomes, andthe results have shown that up to 5–10% of all proteinscontained sizable indels, when compared to humanPublished: 25 June 2008BMC Bioinformatics 2008, 9:293 doi:10.1186/1471-2105-9-293Received: 19 December 2007Accepted: 25 June 2008This article is available from: http://www.biomedcentral.com/1471-2105/9/293© 2008 Hsing and Cherkasov; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 12(page number not for citation purposes)amino acid substitutions, which have been studied inten-sively in the past years [2,3], indels remain less under-stood and still pose many biological questions.homologues [4]. Our research has further shown possiblerelationships between indels and protein essentiality [5]and the role of indels in protein-protein interactions. ForBMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293instance, it has been shown that indel-containing proteinswere more likely to be essential than non-indel proteinsand involved in more protein-protein interactions [5]. Ithas also been suggested that sequence insertions and dele-tions can change protein-protein interactions and modifyprotein network characteristics [6].Moreover, it has also been demonstrated that the struc-tural differences of indel sites between pathogen and hostproteins can have valuable therapeutic applications, ena-bling selective targeting of conserved bacterial proteins,but at the same time, eliminating drug cross-reaction withthe human homologues [7-9]. For instance, it has beenshown that a Leishmania elongation factor contains a 12amino acid sequence deletion compared with its humanhomolog, and the deletion site has been used for develop-ing small compounds targeting specifically to the Leish-mania protein but not the human protein [9].Despite the common occurrences of indels and theirimportant roles in protein functions, currently there areno bioinformatics resources that archive structural andsequence information on indel sites derived fromsequence alignments of similar proteins. Although earlystudies have shown us some common features shared byindels in limited datasets [10-15], our understanding ofindels can be improved by utilizing the large amount ofstructural data, as accumulated in Protein Data Bank [16].Thus we present here, Indel PDB, a structural database ofinsertion and deletion sites, extracted from aligned pro-tein sequences in PDB. The goal of Indel PDB is to providea resource of indel 3D structures, which enable variousbioinformatics analyses including primary sequence com-position, secondary structure assignment, solvent accessi-bility, length distribution, protein domain association,homology modeling and other comprehensive structuralstudies. Some of such applications from Indel PDB havebeen performed and reported in this paper.Indel PDB is different from loop databases, whose scopeis limited to protein loops that lack clear secondary struc-tures [17,18]. For instance in ArchDB [18], which repre-sents one of the most comprehensive loop databasesavailable on the internet, loops are defined as regions thatconnect the regular secondary structures, extracted from9587 protein structures. ArchDB classified a total of58,664 loops (ArchDB95, 13-6-2007) based on theirstructural similarity with respect to the surrounding sec-ondary structures.On the other hand, Indel PDB is not limited to loops, butincludes all possible gaps (insertions or deletions) presentble secondary structures. Although some overlap betweenIndel PDB and loop databases is expected, Indel PDB fea-tures more indel sequences with secondary structuresincluding alpha-helices and beta-sheets in addition toloops. In fact, our analyses have demonstrated that manyindels had recognizable 2D structures, in contrast to pre-vious studies that showed most indels had undefinedstructures and loops [13]. To further distinguish betweenindels and loops, their differences have been investigatedin three aspects: sequence composition, length distribu-tion, and solvent accessibility.In addition, Indel PDB contains a larger structural data-base in comparison to ArchDB. Indel PDB is consisted of117,266 non-redundant indel structures extracted from11,294 indel-containing proteins. Both the indel struc-tural data and the analysis results are freely accessiblethrough the Indel PDB website [19].We believe data presented in Indel PDB will not only ena-ble future functional studies of indels, but also facilitateprotein modeling of indels and the identification of noveldrug binding sites against infectious diseases. Thus,potential users of Indel PDB include 1) molecular biolo-gists who wish to study the functions of particular indelsites by integrating information on protein domains, 2)structural biologists who wish to improve protein homol-ogy models or to perform a comprehensive indel struc-tural analysis based on the available indel 3D coordinates,and 3) computational chemists who are searching forpotential compound-binding sites of new drug leads bythe use of a comprehensive indel search engine availableat the Indel PDB website.Construction and contentConstruct Indel PDBBuilding Indel PDB involved nine steps, each of which isdescribed in this section and depicted on Figure 1. In stepone, a total of 38,395 PDB structural files and theirsequences were downloaded from the PDB website ([20],dated August 30th 2006). In step two, BLASTCLUST (apart of BLAST 2.2.13 package) was used to clustersequences of the downloaded proteins, based on 100%identity. Proteins that had 100% identical sequences wereremoved, and only one representative protein wasincluded in the next steps. To avoid short proteinsequences, only proteins with 70 or more amino acidswere selected for the subsequent parts of the analysis.There were a total of 22,103 proteins that met such crite-ria.Furthermore, 22,103 PDB protein sequences were alignedto each other, using BLASTp (version 2.2.13) with the fol-Page 2 of 12(page number not for citation purposes)in sequence alignments among closely related proteins inPDB, and therefore such indel sites can possess any possi-lowing parameters: e-value ≤ 10-5 and low-complexity fil-ter turned on. The BLAST produced 1,299,971 insertionBMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293sites, whose locations and sequences were parsed by in-house Perl scripts, and the results were stored in a MySQLdatabase. Because an insertion site in a query sequenceimplied a deletion site in a subject sequence, and viceversa; only insertions but not deletions needed to be rep-resented in the database. Therefore, in this paper the word'indel' was used to refer insertion sites on query proteins.In step four, the DSSP program (obtained from [21] inAugust, 2006, original paper by [22]) was performed oneach of the 22,103 protein structures from step #2 todetermine the secondary structures and solvent accessibil-ity. During step 1 to 4, we have noticed there were discrep-ancies between some 'original' protein sequences asobtained from the PDB website and the 'actual' sequencesas stored in the PDB structural files. Such discrepancieswere caused by unresolved structural regions or gaps inthe crystallographic analysis. Thus, in the fifth step,another BLAST was performed on the original proteinsequences against the actual PDB sequences to identify allthe unresolved regions in the structural files. This step isimportant to ensure that the unresolved structural regionsin the PDB files were removed from the subsequent indelanalyses, so that Indel PDB represents true insertions ordeletions in sequence alignments, not the structural gaps#3) was combined to accurately assign secondary struc-tures and solvent accessibility scores to each of the indelsites.In the seventh step, indels were selected to be included inthe Indel PDB database, based on the following criteriawhich ensured that the indels were the results of signifi-cant BLAST alignments. The BLAST alignment criteriawere e-value ≤ 10-5, sequence similarity ≥ 50%, and align-ment coverage ≥ 80%. In addition, any indel site that con-tained unresolved region in the PDB structure or low-complexity residues marked by BLAST, was removed fromthe analysis and excluded from Indel PDB.In the eighth step, Perl scripts were utilized to extract 3Dcoordinates of the selected indel sites from the PDB struc-tural files. The indel structures of the same protein werecopied into a single PDB file. A total of 11,294 PDB fileswere produced, which together contained the 3D struc-tures of 488,039 indel sites.In the final step, an Apache web server was setup on anIBM Pentium D computer, which links to all the necessaryindel information and files stored in a local MySQL data-base. All of the above indel results are stored in two tables:A flowchart for constructing the Indel PDB web serverFigure 1A flowchart for constructing the Indel PDB web server. The numbers in brackets indicate proteins or indels remained at each stage of the bioinformatics pipeline.Page 3 of 12(page number not for citation purposes)in crystallographic analysis. In step number six, the infor-mation from step #5 and the original indel positions (step[indel_pdb_summary] and [pdb_blast_alignment], asshown in Figure 2. The connection between the web server BMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293and the MySQL database was established through Perland CGI.Comparative analysis of indelsTo demonstrate applications of the Indel PDB database,we utilized the indel data to investigate several indel fea-tures that include sequence composition, length distribu-tion, secondary structure composition, and solventaccessibility. All of the analyses operated on a non-redun-dant set of indel sites, which were extracted from the orig-inal set of 488,039 indels by grouping together indel siteswith the same start and end position on the same protein.The resulting non-redundant set contains 117,266 indelsites. The values required for each of the analyses wereretrieved from the MySQL database using Perl scripts.The analyses of amino acid sequence and secondary struc-ture composition were repeated on both the indel sitesand the full-length indel-containing proteins (referred asindel proteins). Data obtained from indel proteins weretreated as background values that were compared to theindel site data. Chi-square test was applied to evaluate ifthe differences between indel sites and indel proteins weresignificant. For instance, in the case of comparing thealpha-helix content (H) between indel sites and indel pro-teins (our samples), the percentages of residues that wereH or non-H in both samples were calculated. Then a Chi-square test value was calculated and a P-value wasassigned. The same process was repeated for the other sec-ondary structures or the sequence compositions.Solvent accessibility was measured by (the number ofwater molecules in contact with a residue) multiplied by10 or (residue water exposed surface in Angstrom)2,according to the DSSP program. Two sample t-test wasapplied to compare the differences of solvent accessibilitybetween indel sites and indel proteins.Length distributionThe indel and loop length distributions were modeled bythe Weibull [23] and power law distributions. TheWeibull distribution can be described by the function:S(x) = exp{-(x/α)β}, × ≥ 0, α,β > 0where S(x) is the survival function, and α and β representa scaling factor and a shape parameter, respectively. Thedouble logarithmic transformation of the Weibull func-tion was performed:log(-log(S(x)) = β log(x) – β log(α)The survival function, S(x), is the probability that a varia-A database schema for Indel PDBFigure 2A database schema for Indel PDB.Page 4 of 12(page number not for citation purposes)ble X has a value greater than a number x. S(x) was calcu-lated by dividing the number of indels with more than xBMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293residues by the total number of indels, where x rangesfrom 1 to 49. If the Weibull distribution can accuratelymodel the indel length distribution, the double logarith-mic plot is expected to be linear. The Pearson correlationcoefficient (r2) as implemented in MS Excel was used toevaluate the linearity of the resulting plot.The power law distribution is represented by the function:S(x) = ax-bThe logarithmic transformation of the function is:log(S(x)) = log(a) – blog(x)Therefore, the fitness of the power law function for theindel/loop length distribution has been evaluated basedon the Pearson correlation coefficient (r2) of the linearplot.Location analysis of protein domain and indelTo further study functional aspects of indels presented inIndel PDB, we have investigated the presence of proteindomains that were in the proximity of indel sites. First,9,318 protein domain profiles characterized by HiddenMarkov Model (HMM) were obtained from the Pfamdatabase (version 22.0, [24]). Second, the HMMER pro-gram (ver 2.3.2, [25]) was utilized to scan each of the22,103 PDB protein sequences against each of the 9,318Pfam domain profiles. The scanning processes were per-formed on a cluster of 50 CPUs to generate outputs, whichcontained the exact starting and ending amino acid resi-dues where protein domains were located for each of theprotein sequences. In step three, the locations of the pro-tein domains were overlaid with the locations of 117,266indel sites in 11,294 indel-containing proteins. From anindel perspective, we calculated the distance between anygiven indel site and all domains on a given protein. Thedistance was measured by the number of amino acid resi-dues between the boundary of the indel site and a domainsite. If there was an overlap between the residues of theindel and the domain, the distance was assigned a "0".Based on the locations of the indels and the domains, wehave computed the overall percentages of indel sites thatoverlapped with domains, and vise versa, the overall per-centages of domain sites that overlapped with indels. Inaddition, the top 20 protein domains with the highestpercentages of overlapping indels were reported with Pvalues determined by Fisher's exact test. The Fisher's exacttest was performed with a 2 × 2 contingency table (col-umn 1: indel containing protein, column 2: non-indelcontaining protein, row 1: contained a domain that over-Utility and DiscussionOverview of Indel PDBIndel PDB contains sequence and structural data associ-ated with 488,039 (or 117,266 non-redundant) indelsites, extracted from 11,294 indel-containing proteins inPDB. Indel PDB and the indel analysis results are freelyaccessible to the public over the internet on the WorldWide Web [19].An easy way for users to interact with indel data is througha comprehensive indel search engine. Users can searchindels using one or more of the following criteria, includ-ing PDB ID, indel length, secondary structure composi-tion, solvent accessibility score, and proximity withprotein domains. In addition, users can specify thesources (species) of query and subject proteins. For exam-ple, the various searching criteria can be used to identifyindels of interests between pathogens and humans forpossible drug target binding sites. Furthermore, users canset a specific range on indel length, secondary structure orsolvent accessibility to find indel sites that are, forinstance, long, mainly alpha-helical, and surface exposed.Moreover, users can search for indels that overlap withcertain protein domains by turning on the domain searchoption, setting the proximity domain distance to '0' andgiving a specific domain name or ID (e.g. Peroxidase orPF00141). Such results are useful to study the functionalroles of indels among similar proteins.Alternatively, a query protein sequence can be submittedand searched against all the indel sequences in Indel PDBby BLASTp. Successfully indel hits are displayed to users.As shown in Figure 3, the following information of eachindel site is displayed: Query PDB ID (protein that con-tains the insertion site), Query name, Query source, Sub-ject PDB ID (protein that contains the correspondingdeletion site), Subject name, Subject source, BLAST align-ment scores, the complete sequence alignment, indellocation (start and end position on the query protein),indel length, indel sequence, indel secondary structurecomposition, and indel solvent accessibility scores. More-over, each indel protein has been cross-referenced to theUniProt database for comprehensive functional annota-tions. Furthermore, the page shows the number of proteindomains that are in proximity of the indel sites, with linksto additional information on the domains. The "help"function on each webpage contains more detailed infor-mation on web site navigation and display.In addition, users can visualize each indel 3D structure onIndel PDB by a Jmol JAVA applet [26]. As an example, a3D view of a 14-amino acid indel site, with an alpha-helixPage 5 of 12(page number not for citation purposes)lapped with an indel, row 2: not contained a domain thatoverlapped with an indel)structure, between 1EDO_A and 1UZL_A is shown in Fig-ure 4. Indel PDB includes not only the 3D atomic coordi-BMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293nates of each indel, but also the anchoring residues up to6 amino acids on each side of the indel, which can be usedfor protein homology modeling of the indel regions. Theindel 3D structure files can be downloaded directly fromthe Indel PDB website.In the following sections, we demonstrated the applica-tions of Indel PDB to characterize the structural features ofindels. In particular, we studied the sequence composi-tion, length distribution, secondary structure composi-tion, solvent accessibility and domain association ofand their roles in protein essentiality, protein-proteininteractions and drug design. For example, the results onsolvent accessibility and secondary structure compositionwill enable the identification of surface exposed indel siteswith unique structural conformation, which can beapplied to design novel drug binding site for bacteria andtheir host proteins. Moreover, each indel site in Indel PDBhas a start and end location with respect to its PDBsequence, and thus the indel locations can be mapped tonearby protein domains to investigate the functions of theindels and their potential ligand-binding ability. In addi-An indel detail page on Indel PDBFigure 3An indel detail page on Indel PDB. The navigation buttons on the top of the page provide easy access to different func-tionality of the website. The indel report on the query protein, 10 gs_A is shown. The screenshot displays indel sites between the query and one of its subject proteins (1ags_A). A detailed BLAST alignment report, followed by an indel summary table is shown.Page 6 of 12(page number not for citation purposes)indels in known protein structures. The results obtainedare important for understanding the functions of indelstion, the sequence composition results enable studyingthe bias of amino acid usage on indel sites. Finally, theBMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293length distribution models of indels can provide insightsabout indel abundance among related proteins.Sequence composition of indelsThe average sequence composition of the 20 amino acidsin each of the 117,266 indel sites was calculated, and thecalculation was repeated in the full-length sequences ofthe11,294 indel-containing proteins, where those indelsites were extracted from. Indels and their indel proteinswere classified into four groups according to their length:indels with ≥ 1, ≥ 5, ≥ 10, ≥ 15, or ≥ 20 residues. Addi-tional file 1 summarizes the sequence composition resultsand the corresponding chi-square and p values of eachindel length category, by comparing indel sites and theirindel-containing proteins. The average amino acid com-position of indels with different minimal length isdepicted in Figure 5. As shown in the Figure, the averageresidue percentages of A, D, and E increased, but those ofG, L, N, S, T, and Y decreased when the length of indelswent up. The rest of amino acids did not show any cleartrend among different indel-length groups. This resultindicated that residues such as Alanine (A) and Glutamicacid (E), which prefer a helix conformation [27], are morefrequently observed as the length of indels increased. Thisobservation is supported by the later analysis of secondarystructural composition, which showed an increase ofalpha-helix content (H) in indels as length increased.Figure 6 shows the natural logarithm of the ratio of aver-age amino acid frequency in the indel sites to that in thefull-length protein sequences. Some trends can be easilyidentified from the Figure. For instance, indel sites con-tained more D, P, and Y in comparison to the entire pro-tein sequences, while I, L, Q, T and V were reduced inindel sites. The differences are significant at P value <0.001, based on the chi-square tests.To compare the sequence composition of indels to that ofloops (protein regions that lack any defined secondarystructures), the average amino acid frequency of all loopsin the 11,294 indel-containing proteins has been com-puted. A total of 310,103 of loops of various lengths havebeen identified from the proteins. As shown in Figure 7,the indel sites contained more A, D, E, F, K, M, R, W, Y res-The difference of average amino acid composition between ind l sit s and l op sequencesFigure 7The difference of average amino acid composition between indel sites and loop sequences. The y-axis shows the natural logarithm of the ratio of average amino acid frequency in the indel sites to that in the loop Amino acid composition of indel sequencesFigure 5A 3D view of an indel site by the Jmol applet on Indel PDBFigure 4A 3D view of an indel site by the Jmol applet on Indel PDB. The indel site (insertion) is 14 amino acid long, present on a query protein, 1EDO_A (Beta-keto acyl carrier protein reductase), in an alignment with a subject protein, 1UZL_A (3-oxoacyl-[acyl-carrier protein] reductase). The indel site has an alpha-helix structure.The difference of average amino acid composition between ind l sit s and full-length protein sequencesFigure 6The difference of average amino acid composition between indel sites and full-length protein sequences. The y-axis shows the natural logarithm of the ratio of aver-age amino acid frequency in the indel sites to that in the full-length protein sequences.Page 7 of 12(page number not for citation purposes)sequences.Amino acid composition of indel sequences.BMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293idues in comparison to the loops, but less C, I, L, N, P, Q,S, T, V residues. The differences are significant at P value <0.001, based on the chi-square tests. Additional file 2summarizes the sequence composition results and thecorresponding chi-square and p values of each indellength category, by comparing indel sites and loops.Length distribution of indelsOur previous indel studies have shown that indel lengthdistribution could be accurately modeled by the Weibulldistribution [4,5]. Therefore, in the current study theWeibull distribution was used to model the length distri-bution of the 117,266 indel sites and 310,103 loops fromIndel PDB. In addition to the Weibull function, the lengthdistributions of indels and loops have been fitted to apower-law function. The survival function over a range ofindel or loop lengths (from 1–25) was plotted on Figure8, indicating that there were many short indels/loops butvery few longer indels/loops. The number of indels andloops both reduced as the length increased, however, thenumber of loops reduced as a faster rate than indels.Moreover, the maximal loop length was 26 amino-acidlong, while the maximal indel length was 50 in the IndelPDB dataset.Figure 9a shows a double logarithmic plot of the survivalfunction versus the logarithm of the indel length. The plotcould be fitted onto a liner line with R2 value of 0.9661,indicating a good fit of indel length distribution by theWeibull distribution. In Figure 9b, the indel length distri-bution was fitted into a logarithmic transformation of apower law function, with R2 value of 0.9643. The looplength distribution has been fitted to the Weibull function(Figure 10a) with R2 value of 0.991, and the power lawfunction with R2 value of 0.9237 (Figure 10b).power law function. In addition, the results suggest thatthe occurrence of indels in the studied PDB proteins can-not be attributed to random processes (when normal dis-tribution behaviors would be expected), and the indellengths are likely to be associated with certain evolution-ary mechanisms.Secondary structure composition of indelsWe have assigned secondary structures to each of the11,294 indel-containing proteins and their 117,266 indelsites in Indel PDB. Secondary structures were defined andassigned by DSSP [22] and its computer program [21].Additional file 3 summarizes the secondary structurecomposition results and the corresponding chi-squareand p values of each indel length category, by comparingindel sites and their indel-containing proteins. Figure 11illustrates that when the indel length increased, there wasan increase of alpha-helix content (H) in indels, while thepercentages of extended beta-strands (E), H-bonded turns(T), bends (S) and loops decreased. In comparison to theindel proteins (as shown in Figure 12), indel sites haveIndel length modeled by the Weibull and power law distribu-tionFigur  9Indel length modeled by the Weibull and power law distribution. a) The double logarithm of the survival func-tion in Weibull distribution was plotted against the logarithm of indel length, which ranged from 1 to 49 residues. The plot was fitted onto a liner line with R2 = 0.9661, indicating a good fit of the Weibull distribution. b) The logarithm of the sur-vival function in power law distribution was plotted against the logarithm of indel length, which ranged from 1 to 49 res-idues. The plot was fitted onto a liner line with R2 = 0.9643.Accumulative indel length distributionFig re 8Accumulative indel length distribution. The survival function of indel and loop lengths were plotted against a range of lengths from 1 ~25 residues.Page 8 of 12(page number not for citation purposes)Thus, the Weibull function has a better fit of the lengthdistribution of indels and loops, in comparison to theincreased percentages of T, S, and loop structures, butreduced contents of alpha helices and beta strands. TheBMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293differences are significant at P value < 0.001, based on thechi-square tests. The proportion of alpha-helices inshorter indels was lower, comparing to the indel proteinsequences. However, longer indels had comparable orhigher H content than the indel proteins.In contrast to previous findings [12,13], our resultsshowed that many indel sites had recognizable secondarystructures such as alpha helices and beta sheets, in addi-tion to loops or turns.Solvent accessibility of indelsThe DSSP program was utilized to predict solvent accessi-bility of the indel proteins and their indel sites. Table 1indicates that indel sites in all five length groups hadhigher average of solvent accessibility values than theindel proteins. The differences of solvent accessibility val-ues were significant at P value of 0. The result showed thatindel sites were more exposed to the protein surface thanaverage residues of the proteins.In addition, the solvent accessibility of indels has beencompared to that of loops. Table 2 indicates that indelsites in all five length groups had higher average of solventaccessibility values than the loops with a significant PTable 1: Solvent accessibility of indel sites and indel proteins.Minimal indel length 1 5 10 15 20Indel – avg. solvent accessibility59.53 60.36 60.03 58.46 60.50indel protein – avg. solvent accessibility44.68 43.52 43.46 42.05 42.80T test value 168.20 130.52 73.96 45.96 34.83P value 0 0 0 0 0Solvent accessibility was measured by (the number of water molecules in contact with a residue) multiplied by 10 or (residue water exposed Average secondary structure composition of indelsFigure 11Average secondary structure composition of indels. Secondary structures were defined by DSSP (Kabsch and Sander 1983): H = alpha-helix, B = residue in isolated beta-bridge. E = extended strand, participates in beta-ladder, G = 3-helix, I = 5-helix, T = H-bonded turn, S = bend, and loop = Loop length modeled by the Weibull and power law distribu-ti nFigur  10Loop length modeled by the Weibull and power law distribution. a) The double logarithm of the survival func-tion in Weibull distribution was plotted against the logarithm of loop length, which ranged from 1 to 25 residues. The plot was fitted onto a liner line with R2 = 0.991, indicating a good fit of the Weibull distribution. b) The logarithm of the sur-vival function in power law distribution was plotted against the logarithm of loop length, which ranged from 1 to 25 resi-dues. The plot was fitted onto a liner line with R2 = 0.9237.Difference of average secondary structure composition between indel sites and full-length pro ein sequencesFigu  12Difference of average secondary structure composi-tion between indel sites and full-length protein sequences. The y-axis shows the natural logarithm of the ratio of average secondary structure frequency in the indel sites to that in the full-length protein sequences.Page 9 of 12(page number not for citation purposes)surface in Angstrom)2, according to the DSSP program. The higher the solvent accessibility value, the more exposed the residue is.undefined structure.BMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293value of 0. The result showed that indel sites were moreexposed to the protein surface than the loops.Proximity of indels and protein domainsPrevious studies have suggested the roles of indels inmodification of protein functions and interactions. Thus,it is possible to anticipate that such indel-directed modifi-cations may occur in the proximity of protein functionalor structural domains. To determine the percentage ofindel sites that overlapped with at least one proteindomain, we have calculated the relative distance betweeneach indel site and a domain on all 22,103 PDB proteins.The average length of domains was 151.4 amino-acidlong, with a minimal length of 8 and a maximal length of1289. As shown in Table 3, among domains of all possiblelengths, 93.66% of all indel sites overlapped with at leastone domain. Among domains that were equal or less thanthe average length, 47.33% of all indel sites overlappedwith at least one domain.From a domain perspective, a total of 31,700 instances ofdomains have been found on the 22,103 PDB proteins.Table 4 indicates that among domains of all possiblelengths, 45.22% of the domains overlapped with at leastone indel site, and among domains that were equal or lessthan the average length, 25.94% of the domains over-lapped with an indel.In addition, for each indel-overlapping domain, we havecalculated the faction of the number of proteins with sucha domain overlapped with an indel to the total number ofproteins where the domain was present. Table 5 shows thetop 20 over-represented indel-overlapping domains withP-values determined by the Fisher's exact test. Severalenzymatic domains such as peroxidase, nitric oxide syn-thase, and catalase are among the top 20 list, and there-fore the result has suggested some possible functionalroles of indels, participating in the modification of enzy-matic activity of those proteins.Overall the results have indicated that a great number ofindel sites overlapped with the locations of proteindomains, and therefore it is possible to hypothesize thatsome of such indel sites are associated with the change ofprotein functions through domain modifications in evo-lution.ConclusionWe presented Indel PDB, a free web-based resource thatcontains information on structural insertions and dele-tions in proteins that have been derived from alignmentsof closely related sequence. The developed Indel PDBresource aims to facilitate bioinformatics analysis of 1-, 2-and 3-dimensional features of indel sites and their roles inprotein essentiality, protein-protein interactions, homol-ogy modeling and drug design.The analysis of the database content demonstrated thatindel sites had certain bias of amino acid usage and thatindel tended to occur on solvent exposed areas of pro-teins. In addition, it has been shown that protein indelspossessed distinguishable secondary structure composi-tion where loops, turns and bends were the most abun-dant structural features followed by alpha-helix and beta-sheet containing fragments. It has also been demonstratedthat indel length distribution could be accuratelydescribed by Weibull function. Moreover, a great numberof indel sites have overlapped with locations of proteindomains, and the result suggests a possible associationbetween indel occurrences and modifications of proteinfunction.We anticipate that further applications of Indel PDB inconjunction of protein domain and drug databases willenable identification of novel indel-based drug bindingsites for computer-aided drug discovery.Availability and requirementsIndel PDB is freely available over the internet on theWorld Wide Web [19].Table 3: Percentage of indel sites that overlapped with at least one protein domain.Domain length # of indels overlapping with a domain% # of indels not overlapping with a domain% Total # of indels< = 1289 109835 93.66% 7431 6.34% 117266Table 2: Solvent accessibility of indel sites and loops.Minimal indel length 1 5 10 15 20indel – avg. solvent accessibility59.53 60.36 60.03 58.46 60.50loop – avg. solvent accessibility49.26 47.62 47.55 44.64 44.21t test value 98.74 85.06 47.53 32.38 25.87p value 0 0 0 0 0Solvent accessibility was measured by (the number of water molecules in contact with a residue) multiplied by 10 or (residue water exposed surface in Angstrom)2, according to the DSSP program. The higher the solvent accessibility value, the more exposed the residue is.Page 10 of 12(page number not for citation purposes)< = 152 55503 47.33% 61763 52.67% 117266BMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293Page 11 of 12(page number not for citation purposes)Table 5: Top 20 domains that overlapped with indels.Domain ID (Pfam)Domain Name # of proteins with the domain overlapping with an indelTotal # of proteins with the domainDomain fractionP-value (one tail) from Fisher's exact testPF00141 Peroxidase 88 88 1 1.85E-26PF00565 Staphylococcal nuclease homologue44 44 1 1.42E-13PF02876 Staphylococcal/Streptococcal toxin, beta-grasp domain33 33 1 2.33E-10PF02898 Nitric oxide synthase, oxygenase domain31 31 1 8.94E-10PF00199 Catalase 30 30 1 1.75E-09PF00232 Glycosyl hydrolase family 1 30 30 1 1.75E-09PF00502 Phycobilisome protein 29 29 1 3.43E-09PF01327 Polypeptide deformylase 24 24 1 9.92E-08PF00896 Phosphorylase family 2 23 23 1 1.94E-07PF00490 Delta-aminolevulinic acid dehydratase22 22 1 3.81E-07PF01742 Clostridial neurotoxin zinc protease21 21 1 7.45E-07PF00022 Actin 20 20 1 1.46E-06PF00274 Fructose-bisphosphate aldolase class-I20 20 1 1.46E-06PF00503 G-protein alpha subunit 20 20 1 1.46E-06PF00224 Pyruvate kinase, barrel domain17 17 1 1.10E-05PF03414 Glycosyltransferase family 6 16 16 1 2.15E-05PF00217 ATP:guanido phosphotransferase, C-terminal catalytic domain15 15 1 4.21E-05PF00343 Carbohydrate phosphorylase 15 15 1 4.21E-05PF00113 Enolase, C-terminal TIM barrel domain14 14 1 8.24E-05PF00162 Phosphoglycerate kinase 13 13 1 1.61E-04Table 4: Percentage of domains that overlapped with at least one indel site.Domain length # of domains overlapping with a indel% # of domains not overlapping with a indel% Total # of domains< = 1289 14336 45.22% 17364 54.78% 31700< = 152 8222 25.94% 23478 74.06% 31700Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2008, 9:293 http://www.biomedcentral.com/1471-2105/9/293Authors' contributionsMH acquired and analyzed protein sequence and struc-tural data from PDB, designed and implemented the IndelPDB database and website, carried out the indel analyses,and drafted and revised the manuscript. AC conceived anddesigned the study, and revised the manuscript.Additional materialAcknowledgementsThe authors acknowledge the technical support from Christopher Fjell and Evgeny Maksakov during the construction of the Indel PDB web server.MH was supported by the Michael Smith Foundation for Health Research (MSFHR) and the Natural Sciences and Engineering Research Council (NSERC). AC was funded by Genome Canada and Genome BC through the PRoteomics for Emerging PAthogen REsponse (PREPARE) project.References1. Thorne JL: Models of protein sequence evolution and theirapplications.  Curr Opin Genet Dev 2000, 10(6):602-605.2. Yang Z, Goldman N, Friday A: Comparison of models for nucle-otide substitution used in maximum-likelihood phylogeneticestimation.  Mol Biol Evol 1994, 11(2):316-324.3. Whelan S, Lio P, Goldman N: Molecular phylogenetics: state-of-the-art methods for looking into the past.  Trends Genet 2001,17(5):262-272.4. Cherkasov A, Lee SJ, Nandan D, Reiner NE: Large-scale survey forpotentially targetable indels in bacterial and protozoan pro-teins.  Proteins 2006, 62(2):371-380.5. Chan SK, Hsing M, Hormozdiari F, Cherkasov A: Relationshipbetween insertion/deletion (indel) frequency of proteins andessentiality.  BMC Bioinformatics 2007, 8:227.6. Wagner A: How the global structure of protein interactionnetworks evolves.  Proc Biol Sci 2003, 270(1514):457-466.7. Nandan D, Cherkasov A, Sabouti R, Yi T, Reiner NE: Molecularcloning, biochemical and structural analysis of elongationfactor-1 alpha from Leishmania donovani: comparison withthe mammalian homologue.  Biochem Biophys Res Commun 2003,302(4):646-652.8. Cherkasov A, Nandan D, Reiner NE: Selective targeting of indel-inferred differences in spatial structures of highly homolo-gens that have close host orthologue(s): discovery of selec-tive inhibitors for Leishmania donovani elongation factor-1alpha.  Proteins 2007, 67(1):53-64.10. Gu X, Li WH: The size distribution of insertions and deletionsin human and rodent pseudogenes suggests the logarithmicgap penalty for sequence alignment.  J Mol Evol 1995,40(4):464-473.11. Qian B, Goldstein RA: Distribution of Indel lengths.  Proteins2001, 45(1):102-104.12. Pascarella S, Argos P: Analysis of insertions/deletions in proteinstructures.  J Mol Biol 1992, 224(2):461-471.13. Sibanda BL, Thornton JM: Accommodating sequence changes inbeta-hairpins in proteins.  J Mol Biol 1993, 229(2):428-447.14. Fechteler T, Dengler U, Schomburg D: Prediction of proteinthree-dimensional structures in insertion and deletionregions: a procedure for searching data bases of representa-tive protein fragments using geometric scoring criteria.  J MolBiol 1995, 253(1):114-131.15. Benner SA, Cohen MA, Gonnet GH: Empirical and structuralmodels for insertions and deletions in the divergent evolu-tion of proteins.  J Mol Biol 1993, 229(4):1065-1082.16. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,Shindyalov IN, Bourne PE: The Protein Data Bank.  Nucleic AcidsRes 2000, 28(1):235-242.17. Michalsky E, Goede A, Preissner R: Loops In Proteins (LIP)–acomprehensive loop database for homology modelling.  Pro-tein Eng 2003, 16(12):979-985.18. Espadaler J, Fernandez-Fuentes N, Hermoso A, Querol E, Aviles FX,Sternberg MJ, Oliva B: ArchDB: automated protein loop classi-fication as a tool for structural genomics.  Nucleic Acids Res2004:D185-188.19. Indel PDB   [http://www.cnbi2.com/indel/]20. Protein Data Bank (PDB)   [http://www.rcsb.org]21. The DSSP software and database   [http://swift.cmbi.ru.nl/gv/dssp/]22. Kabsch W, Sander C: Dictionary of protein secondary struc-ture: pattern recognition of hydrogen-bonded and geometri-cal features.  Biopolymers 1983, 22(12):2577-2637.23. Coles S: An introduction to statistical modeling ofextremevalues.  London: Springer-Verlag; 2001. 24. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V,Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Son-nhammer EL, Bateman A: Pfam: clans, web tools and services.Nucleic Acids Res 2006:D247-251.25. HMMER: biosequence analysis using profile hidden Markov-models   [http://hmmer.janelia.org/]26. Jmol: an open-source Java viewer for chemicalstructures in3D   [http://www.jmol.org/]27. Petsko GA, Ringe D: Protein Structure and Function.  Waltham,MA: New Science Press; 2004. Additional File 1Table S1.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-293-S1.xls]Additional File 2Table S2.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-293-S2.xls]Additional File 3Table S3.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-293-S3.xls]yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 12 of 12(page number not for citation purposes)gous proteins.  Proteins 2005, 58(4):950-954.9. Nandan D, Lopez M, Ban F, Huang M, Li Y, Reiner NE, Cherkasov A:Indel-based targeting of essential proteins in human patho-


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items