UBC Faculty Research and Publications

Structural characterization of genomes by large scale sequence-structure threading: application of reliability… Cherkasov, Artem; Ho Sui, Shannan J; Brunham, Robert C; Jones, Steven J Jul 26, 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2004_Article_217.pdf [ 1.6MB ]
JSON: 52383-1.0220531.json
JSON-LD: 52383-1.0220531-ld.json
RDF/XML (Pretty): 52383-1.0220531-rdf.xml
RDF/JSON: 52383-1.0220531-rdf.json
Turtle: 52383-1.0220531-turtle.txt
N-Triples: 52383-1.0220531-rdf-ntriples.txt
Original Record: 52383-1.0220531-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleStructural characterization of genomes by large scale sequence-structure threading: application of reliability analysis in structural genomicsArtem Cherkasov*1, Shannan J Ho Sui2, Robert C Brunham1,3 and Steven JM Jones4Address: 1Division of Infectious Diseases, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada, 2Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada, 3University of British Columbia Center for Disease Control, Vancouver, British Columbia, Canada and 4Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, CanadaEmail: Artem Cherkasov* - artc@interchange.ubc.ca; Shannan J Ho Sui - shosui@cmmt.ubc.ca; Robert C Brunham - robert.brunham@bccdc.ca; Steven JM Jones - sjones@bcgsc.ca* Corresponding author    AbstractBackground: We establish that the occurrence of protein folds among genomes can be accuratelydescribed with a Weibull function. Systems which exhibit Weibull character can be interpretedwith reliability theory commonly used in engineering analysis. For instance, Weibull distributionsare widely used in reliability, maintainability and safety work to model time-to-failure of mechanicaldevices, mechanisms, building constructions and equipment.Results: We have found that the Weibull function describes protein fold distribution within andamong genomes more accurately than conventional power functions which have been used in anumber of structural genomic studies reported to date.It has also been found that the Weibull reliability parameter β for protein fold distributions variesbetween genomes and may reflect differences in rates of gene duplication in evolutionary historyof organisms.Conclusions: The results of this work demonstrate that reliability analysis can provide usefulinsights and testable predictions in the fields of comparative and structural genomics.BackgroundRecent advances in networks theory have demonstrated akey role of uneven distributions occurring in many natu-ral processes. It has been found that seemingly unrelatedsystems such as economic, professional, sexual and socialnetworks, airline routing, power lines connections, lan-where x is the number of links connected to each networknode and γ is the value of the exponent typically varyingin the range of 2–3 [1]. The heterogeneous architecture ofscale-free networks imparts a robustness and error-toler-ance from random perturbation and is often viewed as apossible common blueprint for naturally occurring large-Published: 26 July 2004BMC Bioinformatics 2004, 5:101 doi:10.1186/1471-2105-5-101Received: 22 April 2004Accepted: 26 July 2004This article is available from: http://www.biomedcentral.com/1471-2105/5/101© 2004 Cherkasov et al; licensee BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 10(page number not for citation purposes)guage networks and internet hyperlinks all exhibit apower law decay of the cumulative distribution Px ≈ x-γ,scale networks. The critical role of the power law distribu-tion has also been acknowledged in many areas of lifeBMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101sciences: metabolic and other cellular networks, proteinsinteraction maps, brain cellular organization, food andecological webs all have been described as scale-free sys-tems. It would be fair to say that the advances in the scalefree network studies have revitalized the original Pareto'sinequality law introduced more then a century ago [2].The applicability of the scale free networks has been exam-ined in numerous structural genomics studies. It has beenproposed that the genomic occurrence of protein families,superfamilies and folds can follows an asymptotic powerlaw:SDF(GO) = aGO-b  (1), where SDF(GO) is survival distribution function ofgenomic occurrence GO of a certain protein family, super-family and fold. These findings have laid the foundationfor characterizing the evolution of the protein universe interms of a growing scale-free system in which individualgenes are represented as nodes of a propagating network[3-7].In our previous work [9], we have used the large-scalesequence-structure threading to assign protein folds to 33genomes from all three superkingdoms of life. It has beenfound that more then 60% of the studied eukaryotic, 68%of archaeal and 70% of bacterial proteomes could beassigned to defined protein folds by threading. The esti-mated results have been used to analyze the distributionof protein architectures, topologies and domains (orhomologous superfamilies according to the CATH classi-fication [8]). Thus, we have found that the frequencies ofgenomic occurrence of assigned protein domains (homol-ogous superfamilies) and topologies can be described bya power function (1) with moderate accuracy. Accordingto the formalism of network theory, such a power law rep-resentation of the cumulative distribution of node con-nections governs a scale-free character of the system [10].At the same time we have noted that the values of thepower exponent b estimated in the study generally fallbelow the 2–3 range typical for scale-free systems (analo-gous observations could also be noted in a number ofsimilar investigations [3-5]). Table 1 (see Additional file1) features the estimated parameters a and b along withthe corresponding correlation coefficients r2 reflecting thegoodness of fit of experimental data with the logarithmiclinear plots (1) (Table 1 also reflects the total number ofthe analyzed ORF-s in each genome and the correspond-ing number of proteins for which the THREADER has con-fidently assigned certain fold).The established lowered values of the power exponentencouraged us to seek alternative approaches to moreaccurately describe protein folds distributions.ResultsWeibull (reliability) analysisThe Weibull distribution is a general-purpose statisticalfunction defined within Extreme Value Theory [11] andwidely used in reliability engineering to model materialstrength, durability of electronic and mechanical compo-nents or equipments. In the most common case the prob-ability density distribution is described by a two-parameter Weibull distribution, where α is a scaling factor and β is a shape parameter alsoknown as the slope [12].The Weibull analysis operates on life data, i.e it utilizestime-to-failure (or time under the testing stress) to assessthe reliability of a system and to forecast its stabilitythrough parameters of the characteristic life span α andshape β. A typical Weibull experiment is based on appli-cation of disruptive stress to multiple samples representa-tive of the population until the tested objects achieve astate of failure and produce time-to-failure numbers. Thecorresponding time-to-failure values form heterogeneousWeibull distributions described by (2).Application of Weibull function to genomic analysisThe distribution of protein folds in a genome can beviewed much like the behavior of a mechanical systemunder disruptive testing. It is feasible to stipulate that theincrease of genomic abundance of any protein fold occursunder evolutionary pressure. Some folds are able toexpand their genomic occurrence over a course of evolu-tion others have higher probability to be lost throughgenetic drift and other random events, i.e. to fail. Consid-ering these analogies, we anticipated that the Weibulllogistic can provide some natural explanations for highlyheterogeneous abundance of protein folds in genomes. Totest this hypothesis we used two independent approachesto examine whether the genomic occurrence of proteintopologies and domains can indeed be adequatelydescribed by the Weibull function.First of all, we employed the maximum likelihood (ML)method [13] to fit the survival distribution functionSDF(x) of the genomic occurrences of protein topologiesand homologous superfamilies into the Weibull depend-ence (2). The corresponding Weibull shape parametershave been established by solving the equationSDF xxx b( ) = −≥ > ( )exp , ,αβ0 0 2Page 2 of 10(page number not for citation purposes)and modest accuracy of the power law dependences (1)BMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101 while the scaling factorshave been calculated as .The ML method allowed very accurate description of thedistribution of protein folds among the genomes. Figures1a and 1b feature the survival distributions of CATHtopologies and homologous superfamilies among all thestudied genomes in combined (these experimental(observed) data curves are marked in red). On the samegraphs we have plotted the SDF(GO) parameters repro-duced within equation (2) through α and β values esti-mated by the ML approach. It is obvious that thesecomputed blue curves labeled as 'Weibull analytical'resemble the experimental distributions (marked in red)very precisely.The corresponding α and β values estimated by the MLapproach have been collected into Table 2 (see Additionalfile 2).The second way of examining applicability of the Weibullfunction (2) was based on notion that the double loga-rithmic transformation of the SDF(x) leads to the equa-tion of a straight line:log(- log(SDF(x)) = β log(x) - β log(α)  (3)We performed the transformation (3) on the experimen-tal SDF(GO) data to estimate the Weibull coefficients αand β and squared correlation coefficients r2 which allhave also been collected into Table 2 (marked as 'Weibullby plotting').The values of the estimated squared correlation coeffi-cients r2 from Table 2 demonstrate very high accuracy ofthe linear dependences (3) established from the survivaldistributions of CATH folds in the studied genomes.These parameters also allow comparing the accuracy ofdouble logarithmic dependences (3) with accuracy of sim-ple logarithmic dependences derived from the power lawmodel (1):log(SDF) = a - b*log(GO)  (4)As it has been mentioned earlier, the dependences (4)have been estimated for the SDF(GO) functions for indi-of proteins. The comparison of r2 values from Table 2 andTable 1 established from the linear functions (3) and (4)respectively, reveals that for all studies cases (individualgenomes, superkingdoms, total dependences) the statisti-cal quality of Weibull dependences (3) is much betterthan of power law function (4). Figures 1a and 1b featurethe Weibull distributions estimated by plotting (doublelogarithmic transformation) which reproduce the experi-mental SDF(x) curves with remarkable accuracy. Appar-ently, the distributions calculated from (3) (labeled as'Weibull by plotting') are much closer to the experimentaldistributions than the power law curves (labeled as 'powerlaw') computed within the conventional power function(1). Apparently, that the Weibull distributions establishedby the double logarithmic representations (4) (marked onFigure 1 'Weibull by plotting') are very close to those cal-culated by the ML method ('Weibull analytical'). It shouldbe mentioned, however, that despite close resemblancebetween the Weibull distributions established by the ana-lytical ML method and the 'double logarithmic' approach,the corresponding values of α and β parameters fromTable 2 differ (due to the different data fitting algorithmsemployed by two methods) and the preference should,perhaps, be given to more stringent ML-derived data.Characteristic conditions for the Weibull distributionAlthough the estimated statistical criteria clearly demon-strate the suitability and superiority of a Weibull functionover a power function in describing protein fold distribu-tions, we decided to examine several additional criteriacharacteristic of the Weibull distribution. As it has beensuggested by Romeu [14] there are four such characteristicproperties immanent for the Weibull function.The double logarithmic plot of life data (also called 'a Weibull paper') should be linearAs it can be seen from Table 1 the estimated r2 values fromthe columns marked as 'Weibull by plotting' are all con-tained within the range ~0.95–0.98 what demonstratesthat the 'Weibull papers' do indeed describe protein foldsdistribution in the studied genomes with high accuracy.Figures 2a,2b feature the 'Weibull papers' for the distribu-tion of protein topologies and domains among all thestudied species and illustrate that deviations from linear-ity are very insignificant.The slope of the 'Weibull paper' is an alternative estimator of βThe data from Table 2 demonstrate that the estimatedslopes of the 'Weibull papers' are very close to the valuesof β derived by analytical maximum likelihood approach.The xβ transformation should yield an exponential distribution with mean αβx xxnxi iiniin iinββ βlnln===∑∑∑− − =1111 10αβ==∑ xniin1Page 3 of 10(page number not for citation purposes)vidual genomes, superkingdoms and for the combined setThe genomic occurrences of protein topologies anddomains in the genomes and superkingdoms have beenBMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101Observed and recalculated survival distribution functions of genomic occurrences of protein topologies (a) and domains (b) among all the studied genomes combinedFigure 1Observed and recalculated survival distribution functions of genomic occurrences of protein topologies (a) and domains (b) Page 4 of 10(page number not for citation purposes)among all the studied genomes combined.BMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101Weibull plots for survival distributions of genomic occurrences of protein topologies (2) and domains (2) among the studied genomesFigure 2Weibull plots for survival distributions of genomic occurrences of protein topologies (2) and domains (2) among the studied Page 5 of 10(page number not for citation purposes)genomes.BMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101transformed into GOβ distributions through the powerfactors β. The exponential character of the resulting distri-bution has been examined by several statistical tests andin all cases has been confirmed. The observed medians ofthe exponential distributions GOβ accumulated in Table 3(see Additional file 3) demonstrate strong correlationswith the calculated αβ values.Characteristic life α of the Weibull distribution lies approximately at the 63% of the populationThe values of the Weibull characteristic life at 63% of dis-tributions have been calculated and collected in Table 3.It is obvious that these parameters closely match values αdefined by plotting.Thus, all four specific criteria studied indicate that thegenomic occurrence of protein topologies and domainscan be characterized as true Weibull distributions. To sup-port this notion further we have also considered anotherimportant property of the he Weibull distribution – thedependence of its median (MDN) from shape and scaleparameters [13]:To assess the applicability of this condition, we calculatedWeibull medians using sets of α and β parameters – esti-mated by graphical (double logarithmic transformation)and analytical (ML) approaches. The corresponding'MDN Calctd' values have been collected into Table 3along with the observed medians of the Weibull distribu-tions (marked 'MDN Obsrvd'). The estimated high qualitylinear dependences between the theoretical and observedmedians are present on figures 3a and 3b for topologiesand domains distributions respectively. The graphicsdemonstrate that calculated and observed median valuesare virtually the same what unanimously confirms validityof the condition (5).Thus, multiple independent tests have demonstrated thatoccurrence of protein folds in genomes obey the Weibulldistribution and therefore can be interpreted in terms ofthe reliability theory what can provide additional insightinto folds evolution.DiscussionInterpretation of the Weibull parametersThe very fact that we were able to assign the Weibull char-acter to the distributions of the CATH protein topologiesand homologous superfamilies within genomes ulti-mately implies that parameters of genomic occurrence canbe classified as extreme values. According to the Extremeure processes are "racing" to failure and the first to reachit produces the observed failure time [15]. In regard togenomic occurrence this may suggest that protein foldsincrease their genomic occurrence in a competitive man-ner and that those folds having a greater potential toduplicate, will continue to duplicate at the cost of lessabundant folds which may ultimately disappear fromgenome.On another hand, according to reliability theory aWeibull distribution with β > 1 characterizes a life systemthat increasingly deteriorates. If the shape parameter issmaller then unity (β <1), there is a reliability growth asthe failure rate of the system decreases with time [14]. It isnot clear at the moment, whether a reliability criterion isdirectly applicable to protein folds distributions. How-ever, β does indeed describe the "skewdness" of the folddistribution, for example Caenorhabditis elegans has thelowest calculated value β among the studied organisms,whilst this genome has also been characterized for itsrecent expansion and duplication of several gene families[16]. Presumably, many of these folds are present at lowerabundances in other genomes. It could be proposed thatsuch a low β (according to the reliability theory character-izing the genome of C. elegans as the most stable amongstthe studied) may reflect the fact that chances of loosingsome lower abundant fold families are lower for C. elegans(considering that >70% of the translated ORFs C. elegansgenome have been covered by the sequence-structurethreading we have assumed that the recently duplicategenes are accordingly represented in the results). In thiscontext, the reliability of a proteome can be viewed as itsability to maintain and expand its composition withoutloss of protein folds.We can speculate that life systems that enjoy evolutionarysuccess will tend to minimize β <1 i.e. to have more bal-anced (less heterogeneous) folds representation in theirgenomes. The fact that most β values presented in Table 2fall below the unity threshold demonstrates that, in gen-eral, the reliability of genome fold composition increaseswith time, i.e. less protein folds reach the failure state (ter-mination of multiplication and, likely, following evolu-tionary extinction) as an organism evolves.Interestingly, little difference is observed has been foundbetween the β shape parameters for topologies distribu-tions across the three superkingdoms. All three lineardependences ln(- ln(SDF(GO))) ~ ln(GO) for Bacteria,Eukaryote and Archaea presented on Figures 4a,4b appearvery similar.As it has been already mentioned above, it is difficult to0 515. exp ln= −⇒ + = ( )MDN MDNα α ββln ln 2 ln Page 6 of 10(page number not for citation purposes)Values Theory the Weibull distribution will successfullymodel life systems for which many competing similar fail-decide at this point whether the observed Weibull charac-ter of protein folds distribution can be placed in a largerBMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101Observed vs. calculated medians of the Weibull distributions established by the maximum likelihood and plotting methods for prot in topologies (a) and domains (b)Figure 3Observed vs. calculated medians of the Weibull distributions established by the maximum likelihood and plotting methods for Page 7 of 10(page number not for citation purposes)protein topologies (a) and domains (b).BMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101Weibull distributions of protein topologies (a) and domains (b) among three superkingdoms (Archaea, Bacteria and Eukaryote)Figure 4Page 8 of 10(page number not for citation purposes)Weibull distributions of protein topologies (a) and domains (b) among three superkingdoms (Archaea, Bacteria and Eukaryote).BMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101context. We can only speculate that protein folding prefer-ences may lead to a greater abundance of favorable pro-tein configurations and to extinction of those folds whichare less favorable. Such selection may represent a mecha-nism of evolutionary quest for searching for better proteinfolds. In any case, the observed phenomenon illustratesthe act of natural selection in determination of the proteinfold repertoires and that the propagation of protein foldsin a genome occurs in a competitive manner, i.e. moreabundant folds tend to expand their genomic presenceeven further causing lesser abundant folds to extinct.It also remains to be seen whether some other propertiesof genomes and proteomes can also be described by theWeibull statistics. In our studies we plan to use theWeibull approach to examine other distributions such asgenomic occurrence of transcriptional promoters and reg-ulatory elements, levels of gene expression and occurrenceof protein domains per gene, among others.Another possible development for the reliability analysisin structural genomics might be to investigate whether thestandard libraries of proteins folds themselves can be ade-quately described by the Weibull function. As it has beenstipulated, in the study we have used the CATH standardlibrary of protein folds, which is one of the most acceptedand used protein folds classifications. Ii is not unfeasible,that the representation of protein architectures, topolo-gies, homologous superfamilies, etc in the CATH can beadequately described by the Weibull law. Thus, it has beenpreviously demonstrated that another widely used foldslibrary – SCOP does indeed obey the power low [4]. Suchobservations would not necessarily contradict the unevencharacter of the fold distributions in individual proteomesor superkingdoms as a given protein fold library shouldreflect the proportion of protein folds occurrence innature. At the same time, we anticipate that the analysis ofthe standard fold libraries in terms of the Weibull distri-butions may bring an additional insight into the field andwill be carried out in the near future.To summarize the current work, it is possible to concludethat the use of the Weibull distribution allows more accu-rate description of protein topologies and domains distri-butions within and among genomes than power functionused in conventional structural genomic studies. In addi-tion, we were able to establish the Extreme Values rela-tionships for protein folds distributions to demonstratethat the protein fold repertoire of an organism most likelyoccurs as a result of the competition amongst folds. Thismay reflect a mechanism of natural selection searching foran optimal protein structures when more evolutionaryfavorable folds tend to populate the entire genomic spaceConclusionsUse of a Weibull function allows describing cumulativedistribution of protein topologies and domains withinindividual genomes and superkingdoms with higher accu-racy compared to the conventional power function usedin the related studies. The developed approach may beapplied to quantification of the distribution of differentproperties of genomes and can be particularly useful forassessing and comparing fold distributions between dif-ferent organisms and possible impact of the "reliability"of organisms due to a higher redundancy in their foldcomposition.In general, the results of investigation demonstrate thefeasibility and importance of using the reliability analysisto improve the bioinformatics analysis of proteomes.MethodsAssignment of protein foldsThe prediction of the protein folds has been conductedusing the THREADER2 program [17]. The CATH homo-logues superfamily representative has been assigned to agiven protein sequence if the THREADER2 produced anoutput above 2.9 for the Z score for the threading energy.After a certain CATH entry has been assigned to a proteinsequence, it has also been associated with the correspond-ing higher level CATH representations: class, architectureand topology.The translated protein sequences for 33 completegenomes downloaded from the NCBI and ENSEMBLdatabases have been processed in this manner. Thethreading computations have been paralleled for process-ing on a Beowulf cluster consisting of 52 dual processorblades (2 × 1 GHz, 1 G RAM). The automated control wasimplemented by our own PVM-supporting Perl scriptsenabling to distribute and query the individual threadingprocesses over multiple computer servers.Survival distribution calculationAfter the occurrences of distinct classes, architectures,topologies and homologue family representatives havebeen established within the individual genomes,superkingdoms and in total, the corresponding survivaldistributions have been computed. First of all, we haveestablished the counts of protein architectures, topologiesand domains (homologues families) with a givengenomic occurrence GO. At the next step these numbershave been converted into the fractional values. After thatthe survival distribution functions SDF(x) have been com-puted for genomes, superfamilies and for the combinedset of proteins. The SDF(GO) numbers have been calcu-lated for each integer GO value in the range from 0 to thePage 9 of 10(page number not for citation purposes)and cause the extinction of lesser favorable proteinconfigurations.maximal GO estimated within the set (genome/super-family/total).Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2004, 5:101 http://www.biomedcentral.com/1471-2105/5/101Statistical analysisThe fitting of the SDF(GO)~GO functions has been con-ducted by the SAS 9.0 statistical package (SAS Inc.). Thepower law dependences SDF(GO) = aGO-b have been ana-lyzed as a logarithmic transformsLog(SDF(GO) = a - b*GOwhere the fitting has been conducted for the linearfunction.The Weibull – like dependences SDF(GO) = exp(-aGOb)have been fitted using both non-linear approximation (bymaximum likelihood method) and by the linear fitting ofthe double logarithmic transform:log(- log(SDF(x)) = β log(x) - β log(α)The calculation of median valued for the survival distribu-tion has been done by the 'R-project' open source statisti-cal package.Additional materialAcknowledgementsThe authors thank Dr. Boris Sobolev (Clinical Epidemiology and Statistics, UBC) for valuable inputs and help with statistical analysis of genomic data with the SAS.their industry partner Inimex Pharmaceuticals and by the Vancouver Hos-pital and Health Sciences Centre research award for AC.RCB acknowledges support from the Canadian Institutes for Health Research (CIHR). Student SHS acknowledges the support of the CIHR/MSFHR Strategic Training Program in Bioinformatics http://bioinformat ics.bcgsc.ca. SJMJ is a Michael Smith Foundation for Health Research Scholar.References1. Barabasi A-L: Linked: The New Science of Networks Perseus Publ; Cam-bridge, Mass; 2002. 2. Pareto V: The New Theories of Economics. Journal of PoliticalEconomy 1897, 5:485-502.3. Luscombe NM, Qian J, Zhang Z, Johnson T, Gerstein M: The domi-nance of the population by a selected few: power-law behav-ior applied to a wide variety of genomic properties. GenomeBiology 2002, 3(8):0040.1-0040.7.4. Koonin EV, Wolf YI, Karev GP: The structure of the protein uni-verse and genome evolution. Nature 2002, 420:218-223.5. Qian J, Luscombe NM, Gerstein M: Protein family and fold occur-rence in genomes: power-law behaviour and evolutionarymodel. J Mol Biol 2001, 313:673-81.6. Rzhetski A, Gomez SM: Birth of scale-free molecular networksand the number of distinct DNA and protein domains pergenome. Bioinformatics 2001, 17:988-996.7. Yanai I, Camacho CJ, DeLisi C: Predictions of gene family distri-butions in microbial genomes: evolution by gene duplicationand modification. Phys Rev Lett 2000, 85:2641-2644.8. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, ThorntonJM: CATH-A Hierarchic Classification of Protein DomainStructures. Structure 1997, 5:1093-1108.9. Cherkasov A, Jones SJM: Structural characterization ofgenomes by threading. BMC Bioinformatics 2004, 5:37.10. Barabasi A-L, Albert R: Emergence of scaling in randomnetworks. Science 1999, 286:509-512.11. Coles S: An Introduction to Statistical Modeling of Extreme Values London:Springer-Verlag; 2001. 12. Cox DR, Oakes D: Analysis of Survival Data London, New York: Chap-man and Hall; 1984. 13. Wu S-J: Estimation of the parameters of the Weibull distribu-tion with progressively censored data. J Japan Stat Soc 2002,32:155-163.14. Romeu JL: Empirical assessment of Weibull distribution.Selected Topics in Assurance Related Technologies 2003, 10:1-6.15. Gumbel EJ: Statistical Theory of Extreme Values and SomePractical Applications,. in National Bureau of Standards AppliedMathematics Series Volume 33. Washington, D.C: U.S. GovernmentPrinting Office; 1954. 16. The C. elegans Sequencing Consortium: Genome sequence of thenematode C. elegans: a platform for investigating biology.Science 1998, 282:2012-2018.17. Jones DT, Taylor WR, Thornton JM: A new approach to proteinfold recognition. Nature 1992, 358:86-89.Additional File 1Parameters of power – law dependences for the survival distribution of genomic occurrences SDF(GO) = a GOb.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-101-S1.doc]Additional File 2Parameters α, β and medians (calculated and observed) of Weibull dis-tribution of survival functions of genome occurrences  established by maximum likelihood and plotting methods.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-101-S2.doc]Additional File 3Statistical parameters for 'Weibull papers' for genomic occurrences of pro-tein topologies and domains.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-101-S3.doc]SDF GO eGO( ) = −αβyours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 10 of 10(page number not for citation purposes)The work has been funded in part by the Functional Pathogenomics of Mucosal Immunity project, funded by Genome Prairie, Genome BC and 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items