Open Collections

UBC Faculty Research and Publications

Prediction of 492 human protein kinase substrate specificities Safaei, Javad; Maňuch, Ján; Gupta, Arvind; Stacho, Ladislav; Pelech, Steven Oct 14, 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12953_2011_Article_287.pdf [ 383.4kB ]
JSON: 52383-1.0135724.json
JSON-LD: 52383-1.0135724-ld.json
RDF/XML (Pretty): 52383-1.0135724-rdf.xml
RDF/JSON: 52383-1.0135724-rdf.json
Turtle: 52383-1.0135724-turtle.txt
N-Triples: 52383-1.0135724-rdf-ntriples.txt
Original Record: 52383-1.0135724-source.json
Full Text

Full Text

PROCEEDINGS Open AccessPrediction of 492 human protein kinase substratespecificitiesJavad Safaei1*, Ján Maňuch1,2, Arvind Gupta1, Ladislav Stacho2, Steven Pelech3,4From International Workshop on Computational ProteomicsHong Kong, China. 18-21 December 2010AbstractBackground: Complex intracellular signaling networks monitor diverse environmental inputs to evoke appropriateand coordinated effector responses. Defective signal transduction underlies many pathologies, including cancer,diabetes, autoimmunity and about 400 other human diseases. Therefore, there is high impetus to define thecomposition and architecture of cellular communications networks in humans. The major components ofintracellular signaling networks are protein kinases and protein phosphatases, which catalyze the reversiblephosphorylation of proteins. Here, we have focused on identification of kinase-substrate interactions throughprediction of the phosphorylation site specificity from knowledge of the primary amino acid sequence of thecatalytic domain of each kinase.Results: The presented method predicts 488 different kinase catalytic domain substrate specificity matrices in 478typical and 4 atypical human kinases that rely on both positive and negative determinants for scoring individualphosphosites for their suitability as kinase substrates. This represents a marked advancement over existing methodssuch as those used in NetPhorest (179 kinases in 76 groups) and NetworKIN (123 kinases), which consider onlypositive determinants for kinase substrate prediction. Comparison of our predicted matrices with experimentally-derived matrices from about 9,000 known kinase-phosphosite substrate pairs revealed a high degree ofconcordance with the established preferences of about 150 well studied protein kinases. Furthermore for many ofthe better known kinases, the predicted optimal phosphosite sequences were more accurate than the consensusphosphosite sequences inferred by simple alignment of the phosphosites of known kinase substrates.Conclusions: Application of this improved kinase substrate prediction algorithm to the primary structures of over23, 000 proteins encoded by the human genome has permitted the identification of about 650, 000 putativephosphosites, which are posted on the open source PhosphoNET website ( cell signaling pathways contribute to complexcommunications networks that govern basic and specia-lized cellular activities [2]. The ability of cells to perceiveand correctly respond to their micro-environment isessential for growth, development, homeostasis, defense,and reproduction for tissue repair. Defective cell signal-ing, which can arise from gene mutation or toxic sti-muli, has been linked to over 400 human diseases,including cancer, diabetes, autoimmunity, and neurolo-gical disorders [3]. Therefore, it is critical to map andtrack cell signaling networks with high precision inhumans for diagnostic and therapeutic purposes. Proteinphosphorylation catalyzed by protein kinases is the pre-dominant mode of reversible post-translational controlof proteins in eukaryotic cells.Protein kinases transfer the gamma phosphate( )PO42− of ATP to hydroxyl (-OH) groups found onamino acids in substrate proteins. Serine (S), threonine(T) and tyrosine (Y) represent the three amino acidresidues most commonly targeted by these proteinkinases [4-6]. Of the 23, 000 proteins encoded by the* Correspondence: jsafaei@cs.ubc.ca1Department of Computer Science, University of British Columbia, Vancouver,CanadaFull list of author information is available at the end of the articleSafaei et al. Proteome Science 2011, 9(Suppl 1):S6© 2011 Safaei et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (, which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.human genome, two-thirds have already been demon-strated to be phosphorylated at over 93,000 phospho-sites [1]. Many of the targets of protein kinases includeother protein kinases, and these enzymes can sequen-tially regulate each other in complex signaling net-works. Our knowledge of the architecture of thesekinase communications networks, which span from thecell plasma membrane to deep within the nucleus ofcells, is very rudimentary. Most of the protein kinasesare expressed in each cell in tens of thousands ofcopies, but a few are very restricted in their cellularexpression patterns and have specialized functions.Under 10, 000 kinase–substrate phosphosite interactingpairs have been identified empirically, but probablyover 10 million exist.Domains are substrings of protein sequences that canevolve, function, and exist independently of the rest ofthe protein chain. The most common domain in proteinkinases is the catalytic domain which carries out theactual phosphorylation of protein substrates. Most ofthe kinases feature a single highly related catalyticdomain, some can have two of these catalytic domains,and few others have atypical catalytic domains.Throughout the catalytic domain of the kinases specifi-city-determining residues (SDRs) often directly interactwith the side chains of amino acid sequences surround-ing phosphosites (i.e. phosphosite region) in substrates[7]. Kinase-substrate binding conforms to a lock and keymodel, where a semi-linear phosphosite peptidesequence (surrounding the phosphosite) fits into akinase active site that includes the SDRs.Atypical kinases have completely different structureswhen compared to the typical protein kinases. They donot possess a catalytic domain similar to those found intypical kinases and appear to have evolved separately.No equivalent catalytic domain has been computed forthem using alignment techniques. As a result, SDRs ofthe atypical kinases have to be searched through thewhole surface of the protein, while for typical kinasesthey are contained within their catalytic domains. In thiswork, we have predicted the locations of SDRs in 488human kinase catalytic domains and generated position-specific scoring matrices (PSSM) for each kinase.The organization of the paper is as follows. In Section, we describe previous works related to the prediction ofkinase phosphorylation specificities. In Section , kinasephosphorylation specificity is mathematically formalized.In Section , we propose our prediction algorithm forkinase phosphosite specificities based on consensussequences of the phosphosite regions, and in Section ,we improve the consensus idea by using profile matricesof each kinase, and finally in Section , we present ourresults.Related previous studiesThere are many previous studies that aim to predictkinase specificities for protein substrate recognition andidentify potential phospho-sites. The methods developedare usually based on computing consensus kinase recog-nition sequences, PSSM matrices or machine learningmethods. Scansite [8], artificial neural networks [9] andsupport vector machines [10], conditional random fields[11], and voting based methods [12] are some of theexamples of these approaches. A survey and comparisonof the some of the mentioned prediction methods arerepresented in [13]. In addition, NetworKIN [6] andNetPhorest [14] are two significant efforts for modelingprotein phosphorylation networks.NetworKIN employs artificial neural networks andPSSMs to predict kinase domain specificities and usesprotein-protein interaction databases such as STRING[15] to increase the accuracy of the prediction. Thosekinases and substrates that are connected directly orindirectly (linked by a short path) in the STRING pro-tein interaction database are better candidates to beselected in the phosphorylation network. NetworKINcovers only 123 kinases of the 516 known human pro-tein kinases, since it does not compute phosphosite spe-cificities for those kinases where there are noexperimentally confirmed phosphosites. NetPhorest hasslightly wider coverage with 179 kinases. Similar to Net-worKIN, NetPhorest uses a combination of ANN andPSSM matrices for prediction, but it places relatedkinases within the same group (76 groups in total) andrepresents that all kinases in the same group have iden-tical kinase phosphorylation specificities.All the mentioned methods have two major problems:1) they can only compute specificity of those kinasesthat are in the kinase–phosphosite pair databases; and2) they are highly dependent on the number of con-firmed phosphosites available for each kinase. The train-ing data for all these works is usually retrieved fromPhosphoSitePlus [16] and Phospho.ELM [17] whichstore information on kinase–phosphosite pairs. At thisjuncture, PhosphoSitePlus has gathered 95, 724 phos-phorylation sites in 13,157 distinct proteins, while Phos-pho.ELM has 42,575 sites in 8,718 proteins. For lessthan 9500 of these phosphosites an upstream kinase isknown.Kinase phosphosite specificityGenerally, there is a pattern in the phosphosite regionsthat a specific kinase phosphorylates. We shall refer tothis pattern as its kinase phosphosite specificity. The rea-son is that each protein kinase has a unique 3D struc-ture in its active site that is dictated by its own primaryamino acid sequence, and only a small subset of peptideSafaei et al. Proteome Science 2011, 9(Suppl 1):S6 2 of 13substrates would be expected to possess complementarystructural conformations that can fully penetrate and fitinto the kinase’s active site for phosphorylation. Theamino acid sequences surrounding the phospho-accep-tor residue in the best peptide substrates can often bealigned to generate a consensus sequence for optimalkinase recognition. Due to redundancies in the proper-ties of various amino acid side chains, a series of kinasesubstrate analogs are feasible and these can best berepresented in a kinase specificity matrix. Thesematrices display the observed or predicted frequenciesof each of the 20 possible common amino acids at eachposition surrounding and including the phospho-accep-tor amino acid residue. PSSM matrices and machinelearning methods (e.g. ANN, HMM) can be used togenerate a score for a given kinase and a substrate phos-phosite region. Higher scores show that the kinase ismore likely to phosphorylate that phosphosite. In otherwords, the score is a measure of kinase phosphosite spe-cificity. To represent the pattern properly at least 9amino acids (centered at phosphosite with four aminoacids to right and left of the site) should be considered[13]. We decided to work with regions of length 15because by considering six more amino acids we mayobtain further information about the specificities forsome kinases. Indeed, after computing the profilematrices of several hundred kinases we found that someadditional information can be obtained from the addedpositions –7, –6, –5, +5, +6, and +7 (where 0 is thephosphosite, – means left and + means right of thephosphosite). However, increasing the length of thephosphosite regions to more than 15 may lead to thehigher noise in the training data, which would make theprediction task harder.We introduce a new PSSM matrix to predict kinasephosphosite specificities, which is computed in threesteps described below.Profile matrixWe first compute the probability matrix, called the pro-file matrix for each kinase. Assume that it is experimen-tally known that kinase k phosphorylates n differentphosphosite regions {p1, p2, … pn} of length 15. Theprofile matrix Pk of kinase k is 21 × 15 matrix, whererows represent amino acids (including unknown aminoacid ‘x’) and columns represent positions in the phos-phosite regions. The reason of using symbol ‘x’ is thatbecause in some positions of the primary structure ofthe proteins the exact amino acid is not known, and inaddition to that some phosphosites are located close tothe N-terminus (C-terminus) of the proteins and as aresult no amino acid can be considered for the left(right) of the phosphosite region. In both of these caseswe use symbol ‘x’ to create a consistent training set.Background frequencies of amino acidsNext, we compute the probabilities of each amino acidto appear on the surface of proteins. We call these prob-abilities background frequencies of amino acids anddenote them by B(i), where 1 ≤ i ≤ 21. To computebackground frequency we use all the 93, 000 confirmedphosphosite regions in human proteome. The reason isthat all of these confirmed regions are on the surface ofproteins, and hence, they can be a good sample of theprotein surface. By examining the profile matrices of thekinases we have determined that positions –3, –2, 0,and +1 are particularly biased for kinase recognition,since all of them had a very low entropy. Therefore, weexcluded these positions for the computation of thebackground frequency of each amino acid.PSSM matrixNow having a profile matrix of each kinase and thebackground frequency of amino acids, the PSSM matrixfor kinase k is typically computed using log odds ratiomeasure:M i jP i jB ikk( , ) log( , )( ),= (1)where 1 ≤ i ≤ 21 and 1 ≤ j ≤ 15. The problem withthis method is that since the profile matrix Pk computedusing experimental data contains many zeros, the result-ing PSSM matrix Mk has many –∞ values, and conse-quently, Mk is not smooth enough for the prediction.Various smoothing techniques [18] are applied here toavoid zeros and –∞ values, but we use a differentapproach which produces better PSSM matrices for pre-diction:M i j P i j B i P i j B ik k k( , ) sgn( ( , ) ( )) ( , ) ( ) ,.= − ⋅ −1 2 (2)where the exponent 1.2 was determined experimen-tally to achieve the best results.The logic behind this method is similar to log oddsratio. If the probability of amino acid i at position j ofprofile matrix is bigger than the background frequencyof i then that amino acid is a positive determinant,while if it is less than the background frequency it is anegative determinant for the phosphosite region con-taining i at position j to be recognized by that specifickinase. For a given candidate phosphosite region we areinterested to see more positive and less negative deter-minant amino acids to predict it as a phosphosite.Score of phosphosite regionHaving PSSM matrix Mk for kinase k, we can computehow likely a given candidate phosphosite region r =r1r2…r15 is going to be phosphorylated by kinase k. ThisSafaei et al. Proteome Science 2011, 9(Suppl 1):S6 3 of 13value is called kinase specificity score S and is computedas follows.S k r M r jk jj( , ) ( , ).==∑115(3)Prediction of PSSM for kinases without substratedataIn this section, we present our algorithm for predictionof PSSM matrices based on their catalytic domains. Theidea is that those catalytic domains in different kinaseswhich have similar SDRs tend to have similar patternsin the phosphosite regions. To quantify the similarity ofcatalytic domains of kinases we perform multiplesequence alignment (MSA) of catalytic domains usingClustalW algorithm [19]. The result of the MSA is notquite accurate as it has many gaps, therefore, the align-ments were manually modified. We perform this align-ment on 488 catalytic domains of the typical proteinkinases. The length of each kinase catalytic domain afterMSA is 247. For 224 domains in the alignment we com-pute consensus sequences using 6, 515 confirmedkinase–phosphosite pairs. Figure 1 represents portionsof the catalytic domain after MSA of some of the bestcharacterized kinases for which the most phosphositeshave been identified. To generate the consensussequence of each kinase, profile matrix of each kinase iscomputed using the confirmed phosphosite regions ofeach kinase. For each position in the consensussequence the amino acids with the maximum probabilityin that position is selected. If the probability is biggerthan 15% then a capital letter is used to represent thatamino acid, if it is less than 15% and bigger than 8%, asmall letter is used, and if it is less than 8%, symbol ‘x’is used in that position of the consensus sequence. ‘x’here is a ”don’t care” letter and it means that any aminoacid can appear in that position of the phosphositeregion of a kinase. Therefore, those kinases that havemore ‘x’ in their consensus sequence are more generaland can phosphorylate more sites than the others. InFigure 2 consensus sequences of some of the well stu-died kinases are presented.In what follows we use the example in Figure 1 toexplain how mutual information and charge informationare applied to find SDRs on the catalytic domains of thekinases.Mutual informationEach position in catalytic domains or consensussequences can be considered as a random variablewhich can take 21 different values. Both randomvariables can take any of the 20 amino acids. In addi-tion, the random variables in domains can also take thegap value ~, while the random variables in consensussequence can take the unknown value ‘x’. In informationtheory the mutual information of two random variablesis a quantity that measures the mutual dependence ofthe two variables [20]. We can use this measure here tofind out which two positions in consensus and catalyticdomain are highly correlated. Formally, the mutualinformation of two discrete random variables X and Y isdefined as:I X Y p x yp x yp x p yy Bx A( , ) ( , ) log( , )( ) ( ),=∈∈∑∑1 2where p(x, y) = P(X = x, Y = y), p1(x) = P(X = x), andp2(y) = P(Y = y). The higher mutual information, themore the random variables are correlated. In our con-text, X is a position in the kinase catalytic domain, Y isa position in the consensus sequence, A is a set ofamino acids plus ~ and B is a set of amino acids plus ‘x’.Charge informationNegatively charged amino acids interact with positivelycharged, and hydrophobic amino acids with hydrophobicones. Therefore, if a position in the catalytic domains(see Figure 1) tends to have many negatively chargedamino acids and a position in the consensus sequencestends to have more positively charged amino acids, it islikely that these two positions are interacting with eachother. Therefore, we define charge dependency C(X, Y)of two positions (random variables), one in kinase cata-lytic domains (X) and the other in consensus sequences(Y ), as follows.C X Y R x yk kkn( , ) ( , ),==∑1(4)where n is the number of kinases with consensus pairs(in our case 224). R is also residue interaction score oftwo different amino acids, c.f. Figure 3, xk is the aminoacid of the kth kinase at position X of the catalyticdomain and yk is the amino acid of the correspondingconsensus sequence at position Y.Residue interaction matrix shown in Figure 3 esti-mates the strength of a bond created between aminoacids in the average case independent of their distance.Negatively (positively) charged amino acids repel them-selves (score –2 in the interaction matrix R) and theyattract positively (negatively) charged amino acids (score+2). Histidine (H) has a smaller positive charge thanlysine (K) and arginine (R). Therefore, scores for it are+1 for interacting with negatively charged amino acidsSafaei et al. Proteome Science 2011, 9(Suppl 1):S6 4 of 132 32 37 40 69 75 77 89 94 97 98 110 120 135 151 156 161 162 173PKACa 734 E I Q H E E F E F A Q D E D L E E I V RPKCa 523 N I D C E D M E F A E G D D F D E I V RCK2a1 483 Q I K K E D K D F Y E G H D R Y E L L eERK2 410 T K Y R D D Y N Y Y Q N S D Y W E I I pCDK1 393 T K V T E D K S S Y Q R Q D E W E V V lSrc 385 R T S A E S L L D A Q N A D A K E A S eERK1 292 T K Y R D D Y N Y Y Q N S D Y W E I I pCDK2 201 Q K V T E D K L S F Q R Q D E W E I V xCaMK2a 178 Q I D K D E F E H Q Q G E D F G E V V Rp38a 178 Q K H R H D N D F Y Q D S D Y W E I V xGSK3b 175 T K D F D T Y V L Y Q G Q D Y Y E L I pAkt1 158 E I E H E E F E F A E K E D F E E V V RCDK5 147 E R V S E D K P S F Q N Q D E W D V I xJNK1 122 Q K H R E N C H Y Y Q G S D Y Y E V V aFyn 120 Q T S S E S L L D A Q N A D A K E A S xPKCd 109 I A D C E D M L F A E G D D F D E I V RCK1a1 102 K L R Q D S E M M D Q N D D L D N A D dLck 98 K S S A E S V I D A Q N A D A K E A S dPlk1 97 V I L P E S L E Y R Q R G D L N E V V lPKCb1 86 N I D C E D M E F A E G D D F D E I V rLyn 85 K T S A E S L L D A Q N A D A K E A S ePKG1 85 N I Q H E E W D F A C G E D F E E I A RAbl 73 T T E E E N L A Y T Q N R D A K E S S dInsR 73 T T E E E D K L Q A E K R D G R E S S xPAK1 73 T Q K L E S T E A R E Q D D M Y E V V RPKCe 72 N V D C E D M E F A E G D D F D E I V RPKCz 71 D V D W E D M E F A E G D D F N E I V rselected positions in kinase catalytic domain (-3) pos in Consensus# sitesKinaseFigure 1 Kinase catalytic domain alignment. Some of the well characterized protein kinases with critical amino acids in their catalyticdomains. In the right most column, (–3) position of the consensus sequence of each kinase is shown. Strongly positively charged amino acids(R, K) are represented as blue, weakly positively charged histidine as light blue, strongly negatively charged amino acids (E, D) as red,hydrophobic amino acids (L, V, I, F) as green, and proline (P) as brown.Kinase name Consensus sequence # phosph. sitesPKACa (PRKACA) xxrxRRlSlxxxxxx 734PKCa (PRKCA) xxxxRRxSfKrkkxx 523CK2a1 (CSNK2A1) xxxeeedSDdEeeee 483ERK2 (MAPK1) xxpxpPlSPtppxxx 410CDK1 (CDC2) xxxxlpxSPxkkxxx 393SRC xxxeedvYgxvxxxx 385ERK1 xxpppPlSPtptxxx 292CDK2 xxxxxpxSPgKkxlx 201PDK1 (PDPK1) xxgxttxTFCGTpeY 43Figure 2 Kinase consensus sequences. Consensus sequences of some of the kinases for which we have the most number of experimentallyconfirmed phosphorylation sites in protein substrates (except for PDK1, which is shown because it is a threonine-specific protein kinase).Phosphorylation sites are marked by bold font at the center of consensus sequence. Number of total phosphorylation sites for each kinase isshown in the last column.Safaei et al. Proteome Science 2011, 9(Suppl 1):S6 5 of 13and –1 for interacting with positively charged aminoacids. Hydrophobic amino acids attract each other(score +2) while they repel both positively and nega-tively charged amino acids (score –1). S, T and Y resi-dues have a weak tendency to bind to each other (score+0.5), while they are completely neutral with the otheramino acids (score 0). For all the amino acids discussedso far, it is not relevant whether they are in the kinasecatalytic domain or phosphosite region. In both situa-tions the score is the same, which makes the interactionmatrix symmetric. However, glycine (G) is favored to bein the phosphosite region, because it is a small aminoacid that creates a pocket on the surface of the regionthat permits the catalytic domain of the kinase comecloser to the region. The reason that we do not considereffect of G in the catalytic domain is that we are unclearabout the 3D structure of the most kinase catalyticdomains, while phosphosite regions are linear or semi-linear.If we look at Figure 1 we observe that columns 69,135, and 161 are quite conserved with negativelycharged amino acids. Since at (–3) position of the con-sensus sequences of the substrates mostly positivelycharged amino acids (e.g. arginine (R)) appear, thesepositions have a high charge dependency score C andare strong candidate positions for interaction with (–3)position of the phosphosite regions. On the otherhand, these positions are very conserved and theyseem to be uncorrelated with the (–3) position of thephosphosite regions (e.g. when the (–3) position ispositively charged or neutral, position 69, 135, and 161are still negatively charged). Therefore, we need a cri-terion to combine the correlation and charge depen-dency measures. The following equation combinesthese two measures.C X Y R x y p x yp x yp x p ycy Bx A( , ) ( , ) ( , ) log( , )( ) ( ),= ⋅∈∈∑∑1 2(5)where Cc(X, Y ) is called correlation–charge depen-dency of two positions X in catalytic domains and Y inconsensus sequences.A C D E F G H I K L M N P Q R S T V W Y XA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 -2 -2 -1 0 1 -1 2 -1 0 0 0 0 2 0 0 -1 0 0 0E 0 0 -2 -2 -1 0 1 -1 2 -1 0 0 0 0 2 0 0 -1 0 0 0F 0 0 -1 -1 2 0 -1 2 -1 2 0 0 0 0 -1 0 0 2 0 0 0G 0 0 0 0 0.5 -1 0.5 0 0.5 0 0 0.5 0 0.5 1 0 0 0 1 0.5 0H 0 0 1 1 0 0 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0I 0 0 -1 -1 2 0 -1 2 -1 2 0 0 0 0 -1 0 0 2 0 0 0K 0 0 2 2 -1 0 -1 -1 -2 -1 0 0 0 0 -2 0 0 -1 0 0 0L 0 0 -1 -1 2 0 -1 2 -1 2 0 0 0 0 -1 0 0 2 0 0 0M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0R 0 0 2 2 -1 0 -1 -1 -2 -1 0 0 0 0 -2 0 0 -1 0 0 0S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0.5 0 0 0.5 0T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0.5 0 0 0.5 0V 0 0 -1 -1 2 0 -1 2 -1 2 0 0 0 0 -1 0 0 2 0 0 0W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0.5 0 0 0.5 0X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0amino acids in kinase catalytic domainamino acids in substrate phospho-site regionFigure 3 Residue interaction matrix. Residue interaction matrix R. Rows show the amino acids in the phosphosite regions and columns areamino acids in the catalytic domain of the kinases. Negatively charged amino acids (D, E) are red, positively charged amino acids (K, R, H) areblue, hydrophobic amino acids (F, I, L, V) green, proline as orange, and phosphorylation site residues (S, T, Y) are represented as gray. ‘x’ alsocorresponds to the absence of an amino acid, which occurs for phosphosites located at the N-and C-termini of proteins. This figure was derivedfrom knowledge of the structure and charge of the amino acid side chains.Safaei et al. Proteome Science 2011, 9(Suppl 1):S6 6 of 13Using this hybrid criterion Cc(X, Y ) in our example,column 120 gets the maximum correlation chargedependency in Figure 2. It is usually preferred that for aparticular position in consensus sequences, SDRs in cat-alytic domain stay near each other, because they caneasily interact with that position in consensus sequences.For example, positions 120 and 121 should be preferredto positions 120 and 220. However, in the 3D structureof the protein kinase domain, amino acids that are wellseparated in the sequence could be situated next to eachother. In view of such exceptions, we did not includethis preference in our model. Algorithm 1 computes thebest SDRs (positions X in the catalytic domain) for eachkinase consensus sequence position Y and their interac-tion probabilities P PP( )( , )( )Y XX YX| = using correlation–charge dependencies.Algorithm 1 Computing SDRs                                                                                                                  224 human kinase catalyticInput :  domains and their consensus sequences. Parameter .m ≤ 247Output :  SDRs and their interaction probabilities for each position in the phosphosite region.1:    2: Let for doj ← 1 15,Y jij  be the  position in consensus sequences  th3 1 2: ,for ← 47 4: Let  be the  position in catalytic domains5: CthdoX iiompute 6:7: Order positions  based on C X YX Cc i ji c( , )end for( , ),X YZ ki jj k (decreasingly).Let  be the  position in thth is order.8:   as SDRs for position  andOutput Z Z Yj j m j, ,, ,1 !  interaction probabilities| | .9:  Pend( ), , ( ), ,Y Z P Y Zj j j j m1 ! for                                                                                                                                                By examination of the x-ray crystallographic 3D struc-tures of 11 protein kinases co-crystallized with peptidesubstrates, we determined that usually at most sevenSDRs may interact with an amino acid position on thesubstrate phosphosite region, therefore we set the valuem in Algorithm 1 to 7. Furthermore, considering highervalues for m will result in very smooth specificitymatrices more or less similar for all the kinases.From 516 known human protein kinases, 478 kinasesare typical kinases with 488 known catalytic domainsand the remaining 38 kinases are all atypical kinases andwe have phosphosite specificity data only for four ofthem. Algorithm 2 computes the profile and PSSMmatrices for 488 catalytic domains, using the SDRsdetermined by Algorithm 1. The formula in Line 5 ofthe Algorithm 2 is based on the observation that thoseinteractions which have higher correlation–chargedependency are more important in estimation of profilematrices.Using profile matrices for predictionIn Section we used phospho-peptide consensussequences of each kinase to compute correlation chargedependency and SDRs, because it was easier to describe.Another idea was to use profile matrices of eachAlgorithm 2 Prediction of PSSM matrices of all kinases.                                                                                           SDRs and inte       Input : raction probabilities from Algorithm 1 and 488 catalytic domains. Profile and PSSM matrices of all kinase cOutput : atalytic domains.1: Estimation of the profile matrix of e ach kinase.2:       4: Estimationfor dofor dokj←←1 4883 1 15,: ,  of interaction probabilities5: Compute     |P( ,, ,Y Z Zj j j1 2, , ),( , ) ( | )( , ), ,,!A AAAAZ j mC Z Y Y ZC Z Yc j j j jmc j jm     asP==∑∑116:7: Store 21 15 computed values in profile matrix end for × Pk8:9: Computing PSSM matrices.10: Compute the bacend for kground frequencies  using the idea mentioned in SectionB  .11: Compute the PSSM matrix of each kinase using Equation (2).                                                                           kinase in Algorithm 1 without determination of theirconsensus sequence. With this strategy we use more infor-mation and it it might allow for better prediction, while onthe other hand it may lead to overfitted results. In Section, we will test both of these algorithms (1. consensus basedand 2. profile based), and compare the results. In profilematrix based method, the main difficulty is that for therandom variable Yj (column j in the aligned consensussequences) we do not have the correlated values of therandom variable Xi (column i in the aligned catalyticdomains). Instead, for each value in Xi we have 21 differ-ent amino acid probabilities of Yj. Assume ak,i is theamino acid in the aligned catalytic domain of the kinase k,also let be the probability of the lth amino acid (1 ≤ l ≤ 21)at position j (1 ≤ j ≤ 15) of the profile matrix of kinase k.Figure 4 represents these notations in a visual manner.Before computing the charge dependency correlation oftwo columns (or random variables) Xi and Yi we computethe probability of amino acids in each random variable. P(Xi = x) is computed by maximum likelihood estimationusing ak,i amino acids as follows:P( ) ,,X xx awik ikw= ===∑ 1 (6)where 〈x = ak,i〉 is the indicator variable taking valuesones (when ak,i is equal to x) and zeros otherwise, w isthe number of all kinase catalytic domains. P(Yj = l) orP(Yj = y) is also computed by the following equation:P( ) ,Y y pj y jkkw= ==∑1(7)Similar to the previous section we replace P(Xi = x)and P(Yj = y) with p1(x) and p2(y) respectively. p(x, y) isalso computed using maximum likelihood estimation asfollows:p x yx a pwk i y jkkw( , ), ,== ⋅=∑ 1 (8)Safaei et al. Proteome Science 2011, 9(Suppl 1):S6 7 of 13having p1(x), p2(y), and p(x, y) we can now computethe charge correlation dependency using Equation (5)and pick top values for SDRs.Another modification which should be applied on theconsensus method is to change conditional probabilityof phospho-peptide positions given SDRs (which isshown by P(Yj|Zj,l) ) in Line 5 of Algorithm 2. Thisprobability according to Bayes’ theorem equals toPP( , )( , ),Y ZZj lj j land both numerator and denominator canbe computed similar to Equations (8) and (6)respectively.Results and discussionIn this study we perform four different computationalsimulations on the proposed predictor. The firstCorrelation ChargeDependency1 2 247Xi Yja1,i21×15p1,15p2,15p21,15......  .  ..  .  ..  .  .p1,1p2,1p21,1.....p1,jp2,jp21,j......  .  ..  .  ..  .  .21×15p1,15p2,15p21,15......  .  ..  .  ..  .  .p1,1p2,1p21,1.....p1,jp2,jp21,j......  .  ..  .  ..  .  ...............a2,ia302,i21×15p1,15p2,15p21,15......  .  ..  .  ..  .  .p1,1p2,1p21,1.....p1,jp2,jp21,j......  .  ..  .  ..  .  .Profile matrix for kinase 1Profile matrix for kinase 2Profile matrix for kinase 302Aligned catalytic domain for kinase 1Aligned catalytic domain for kinase 2Aligned catalytic domain for kinase 302Figure 4 Computing correlation charge dependency in profile matrix based module of the predictor. In the left part of the figure thealigned catalytic domains of the 302 training kinases is shown, and on the right hand side for each kinase the profile matrix is drawn. It is clearthat the same columns in all the kinase profile matrices create only one random variable, where its correlation to the aligned catalytic domainshould be studied.Safaei et al. Proteome Science 2011, 9(Suppl 1):S6 8 of 13simulation evaluates the accuracy of the predicted pro-file matrices by consensus and profile-based modules ofthe predictor. The second simulation is to build PSSMbased on the predicted profile matrices and use them asclassifiers for each kinase, and to compute the confusionmatrices to further determine the accuracy of the pre-dictor. The third simulation is to compare NetPhorestwith our predictor based on our set of kinase-phospho-site pairs. Finally, the last simulation is to compare Net-Phorest and our method with NetPhorest data sets.Each of these tests are explained more thoroughly in thefollowing subsections.Comparison of profile matricesIn this simulation we compare the accuracy of the pre-dicted matrices with the original matrices computedbased on experimental kinase-phosphosite pairs. For 308kinases we could gather 9,012 confirmed phosphositesfrom PhosphoSitePlus, Phospho.ELM, the scientific lit-erature and other databases. The confirmed kinase-phosphosite pairs were partitioned into two training andtest sets. The test set contains top five kinases that havethe most phosphosite data. The reason is that by choos-ing those kinases in the test set we will be almost confi-dent that the experimentally computed profile matricesare correct and reasonable to compare with the pre-dicted matrices. The training set contains 302 kinaseswith 6, 515 experimentally confirmed phospho-peptides.To start running our predictor on the training data weneeded to generate reliable consensus sequences (for theconsensus based module of the predictor) of phospho-peptide for each kinase first, therefore we eliminatedthose kinases having less than 10 phospho-peptides.Among the 302 kinases in the training set, 224 kinaseshad more than 10 phospho-peptides and we could com-pute 224 consensus sequences for each using the pro-cess explained in Section . From about 516 humankinases, we gathered 488 catalytic domains in 478human typical protein kinases, aligned these catalyticdomains, and used Algorithms 1 and 2 to computeSDRs and profile matrices.To evaluate these predicted matrices, we also com-puted the profile matrices of 302 kinases in the trainingset computed by the method described in Section(empirical matrices), and the results were comparedusing sum of squared differences. Figure 5 illustrateshow we set up this comparison, and Figure 6 shows thedistribution of these errors. This figure presents theresults for the profile matrix based module of the pre-dictor as well. It is evident that the majority of the pre-dicted matrices are extremely similar to those generatedby known substrate alignments. Interestingly, the resultson the test set are more accurate (with sum of squarederror less than 1) than the predicted results on thetraining set (which can be up to 10 to 15) for both mod-ules. The reason is that for each kinase in the test setthere are more experimental substrate peptides, and as aresult empirically computed matrices are closer to thecorrect specificity of each kinase. However, in the train-ing set there are many kinases with less than 20 – 30experimentally confirmed substrate peptides and wemay expect their empirically computed matrices are notclose the correct profile of each kinase. The profilematrix based module used more information than theconsensus module, and not only does it not overfit onthe data, but also it has more accuracy with both test(with a total of 1.99 sum of squared error (SSE) for allfive kinases) and training set (with a total of 494.22SSE). As evident in Figure 6, the consensus basedmethod had higher errors with a total of 584.22 SSE forall kinases in the training set and a total of 2.66 SSE forall kinases in the test set.Computation of confusion matrices and accuracyIn this simulation, we measured the accuracy of eachpredicted kinase substrate specificity for those kinases inthe test set. For this, we determined classifiers for eachkinase and prepared positive and negative instances foreach classifier. We used PSSM matrices of each kinaseas a classifier and took the confirmed substrate peptidesof each kinase in the test set as positive instances. Fornegative instances, unlike previous attempts [10,11], werandomly generated negative instances for each kinasein the test set equal to the number of its positiveinstances using a uniform distribution. The reason forthis is that if we choose those substrate peptides thatare not experimentally confirmed but are in the sub-strate protein as negative instances, it is probable that infuture studies (e.g. From mass spectrometry analyses)they may later prove to be positive instances. Afterward,for any given substrate phospho-peptide, we computedthe score of the PSSM matrix as in Equation (3) and ifthe score was less than zero it was declared that thesubstrate phospho-peptide was negative for the kinase inquestion. Otherwise, we accepted the given substratephospho-peptide as a candidate peptide phosphorylata-ble by the kinase. Figure 5 is also helpful for showingthe flow of the data for preparing positive and negativesubstrate phospho-peptides for the top five test kinases.For all the kinases in the test set similar results werecomputed, and the classifiers were successful in identify-ing most of the negative instances (low rate of falsepositives), but they were apparently much less efficientfor identifying all the positive instances (high rate offalse negatives). Approximately 77% accuracy, 60% sensi-tivity and 95% specificity values were computed for allthe classifiers in the test set. Figure 7 represents theexact confusion matrix, accuracy, sensitivity andSafaei et al. Proteome Science 2011, 9(Suppl 1):S6 9 of 13specificity values for each kinase in the test set for boththe consensus and profile matrix based sub-modules ofour predictor. It was observed that the sensitivity for allclassifiers in the consensus based method was low, whilethis disadvantage was eliminated in the profile matrixbased method with 10% higher accuracy compared tothe consensus method.NetPhorest vs. our predictorIn this part we compare the accuracy of NetPhorest andour method on the same kinase– phosphosite pairs. Tofulfill this task we extracted 1, 978 distinct phospho-peptide–kinase pairs from the total 9, 012 pairs dis-cussed at the beginning of this section. This set wasused afterward, for measuring the accuracy of eachpredictor. For each phospho-peptide in this set westored the best kinase (highest score) suggested by Net-Phorest and our method. Afterward, we measured howmany of the predicted kinases are matching with theoriginal kinases in the 1978 pairs. Because NetPhorestworks based on kinase groups and predicts the bestkinase group for the input phospho-peptide, anytimethat the original kinase falls into the predicted kinasegroup we consider it as matching. For instance, if theinput pair is <TRKLMEFpSEHCAIIL, TGFbR2> andNetPhorest predicts kinase groupACTR2_ACTR2B_TGFbR2_group for the input phos-pho-peptide ‘TRKLMEFpSEHCAIIL’ we accept it as ahit. After running this experiment, we observed thatNetPhorest was successful in 72 of the pairs and our9012 phosphosite peptide pairs with 308 kinase domains516 kinases in humanAligned 488 kinase catalytic domains from 478 kinasesPhospho.ELM PhosphoSite Plus Literature, and other small DBs224 kinases with consensus sequences Compute profile matricesDetermine SDRs and Profile Matrices of 488 kinasesPrepare 488 profile matrices kinase-domainsRemove atypical kinasesAvailabe maching learningANN, SVM, HMM methods for those kinases that we have phospho-peptide data in literatureTest Set: 5 kinase domains with most phospho-peptides (2497 peptides in total)Training Set: 302 kinases with 6515 phospho-peptides in totalKinases with more than 10 peptidesBackground (Surface) frequency of amino acidsCompute PSSM matrices for 488 kinase-domainsCompare profile matrices for training dataCompare profile matrices for test dataGenerate random negative phospho-peptide (uniform distribution) for each of 5 kinases equal to the number of their positive dataCompute confusion matrices for 5 kinasesExtract 1978 distinct phospho-peptide with one kinase for each1978 distinct phospho-peptide with one original known kinase for eachPredict the best kinase for each peptide, for each methodRun Netphorest and NetworKINCompare the best kinases suggested by each method with the original known kinasesAdd profile matrices of 4 atypical kinases (ATR, ATM, mTor, and DNAPK) from phosphoprotein substrate  alignment dataPhosphoNET with 93,000 phosphositesFigure 5 Data and process flow of the experiments. The figure shows the order of creating the datasets for the computational simulationsand comparison of our predictor results with current state of the art methods such as NetPhorest and NetworKIN.Safaei et al. Proteome Science 2011, 9(Suppl 1):S6 10 of 13proposed method in 82 cases. By this experiment weshow that our method outperformed NetPhorest onthree accounts: Firstly, it has a higher rate of matcheswith empirical data. Secondly, it separately considers492 different kinase catalytic domains, whereas NetPhor-est matchings are based on 76 groups of related kinases.Thirdly, 2497 training phospho-peptide pairs for fivewell studied kinases PKACa, PKCa, CK2a1, ERK2 andCDK1 are not used in training the classifiers and theyare solely kept for testing the algorithm, while NetPhor-est uses most of these data to improve its accuracy.NetPhorest vs. our predictor based on NetPhorestconfirmed sitesIn this part we compare consensus module of the pre-dictor with NetPhorest based on confirmed phospho-peptides existing in NetPhorest database. At this junc-ture, NetPhorest contains 10, 261 confirmed phospho-sites and has 76 specified groups for a total of 179kinases linked to phosphorylation of 8, 746 of thosesites. In this dataset, some phosphosites had more thanone kinase phosphorylating them. To compare our pre-dictor with NetPhorest easier we retained only the bestkinase for each phosphosite. We also considered onlythose kinases with our predictor algorithm that wereincluded in the list of 179 protein kinases covered byNetPhorest. As a result, the number of kinase–phospho-site pairs was reduced to 6, 299. To examine how manyof these kinase-phosphosite pairs were consistent withour predictor, we subjected these 6, 299 phosphosites toour predictor algorithm to determine which individualkinases were more likely to phosphorylate these sites.We ranked the 179 protein kinases based on their calcu-lated PSSM scores for each NetPhorest confirmed phos-pho-site region. It was desirable that the experimentallyconfirmed kinases for each phosphosite region had highPSSM scores in our predictor. However, we cannotexpect these confirmed kinases always have maximumPSSM scores, because although these kinases wereexperimentally demonstrated to phosphorylate those−5 0 5 10 15 20050100Sum of squared errorFrequencyConsensus based on training data−5 0 5 10 15 20050100150Sum of squared errorFrequencyProfile based on training data−0.5 0 0.5 1 1.5012Sum of squared errorFrequencyConsensus based on test data−0.5 0 0.5 1 1.50123Sum of squared errorFrequencyProfile based on test dataFigure 6 Comparison of predicted vs. experimentally computed profile matrices. The figure contains four different histograms, where eachdiagram represent the sum of squared error of the predicted profile matrices and experimentally computed profile matrices based on confirmedphospho-peptide pairs for each kinase. x-axis is the sum of squared error, and y-axis is the frequency or number of matrices which have thespecified error. Left histograms are based on consensus based module of the predictor and right histograms related to the profile matrix basedmodule of the predictor. Top histograms show the training set, and the bottom histograms are the results on the five test kinases. Total sum ofsquared error (SSE) for the consensus based module on training data is 584.89, and on the test set is 2.66, while total SSE for the profile matrixbased on training data is 494.22, and on the test set is 1.99.Safaei et al. Proteome Science 2011, 9(Suppl 1):S6 11 of 13Uniprot: TP FP AC 0.7931 Uniprot: TP FP AC 0.9035P17612 443 21 SE 0.6153 P17612 661 80 SE 0.9181Short name: 277 699 SP 0.9708 Short name: 59 640 SP 0.8889PKACa (PRKACA) FN TN PKACa (PRKACA) FN TNUniprot: TP FP AC 0.7772 Uniprot: TP FP AC 0.8794P17252 308 23 SE 0.5992 P17252 462 72 SE 0.8988Short name: 206 491 SP 0.9553 Short name: 52 442 SP 0.8599PKCa (PRKCA) FN TN PKCa (PRKCA) FN TNUniprot: TP FP AC 0.7180 Uniprot: TP FP AC 0.8145P68400 255 54 SE 0.5532 P68400 423 133 SE 0.9176Short name: 206 407 SP 0.8829 Short name: 38 328 SP 0.7115CK2a1 (CSNK2A1) FN TN CK2a1 (CSNK2A1) FN TNUniprot: TP FP AC 0.7751 Uniprot: TP FP AC 0.9181P28482 234 9 SE 0.5721 P28482 355 13 SE 0.868Short name: 175 400 SP 0.9780 Short name: 54 396 SP 0.9682ERK2 (MAPK1) FN TN ERK2 (MAPK1) FN TNUniprot: TP FP AC 0.7710 Uniprot: TP FP AC 0.9224P06493 233 10 SE 0.5674 P06493 351 19 SE 0.8931Short name: 170 383 SP 0.9746 Short name: 42 374 SP 0.9517CDK1 (CDC2) FN TN CDK1 (CDC2) FN TNConsensus Based Profile Matrix BasedFigure 7 Confusion matrices for two modules. The figure includes two tables and each table represents the classification power or consensusmodule (on the left) and profile matrix module of the predictor (on the right). In each table confusion matrix is represented by true positive (TP),false positive (FP), false negative (FN), and true negative (TN). Also accuracy (AC), sensitivity (SE), and specificity (SP) metrics are computed basedupon the confusion matrices.Rank # Kinase Groups % Kinase Groups Adjusted Rank1 1058 16.8 32 651 10.3 73 517 8.2 104 469 7.4 135 396 6.3 176 282 4.5 207 273 4.3 238 220 3.5 279 171 2.7 3010 137 2.2 331 to 5 3091 49.1 1 to176 to 10 1083 17.2 18 to 3311 to 15 402 6.4 34 to 5016 to 20 186 3.0 51 to 6621 to 25 43 0.7 67 to 8326 to 30 1 0.0 84 to100>30 1491 23.7 >100Figure 8 Comparison with NetPhorest predictions. This table shows how many times the NetPhorest kinases groups fall to the rankinggroups 1 to 30 as determined in our kinase substrate predictor algorithm. For instance the first row illustrates that 1058 NetPhorest kinasegroups (16.8%) were similarly predicted by our algorithm as the best kinase groups for the specific phospho-peptides. Because every kinasegroup in NetPhorest contains 3.3 kinases in average, the rank can be adjusted and we can say 1058 NetPhorest kinases (and not kinase groups)were similarly predicted by our algorithm as the best three kinases for the specific phospho-peptides.Safaei et al. Proteome Science 2011, 9(Suppl 1):S6 12 of 13phosphosites, it is unclear that they are always the bestpossible matches. Figure 8 shows that 1058 NetPhorestkinase groups were similarly predicted by our algorithmas the best kinase groups for the specific phospho-pep-tides, and 651 kinase groups were predicted as the sec-ond best kinase groups, etc. On average eachNetPhorest kinase family has 3.3 kinases and becauseour algorithm works based on individual kinases andnot a group, we adjusted the ranks and intervals for theresults from our algorithm accordingly to provide directcomparison. It is evident that 35 percent of the Net-Phorest predicted kinases groups corresponded to thetop 10 candidate kinases proposed by our algorithm.Therefore, our predictor had similar prediction accuracyto NetPhorest, but we achieved coverage with threetimes as many different protein kinases and with indivi-dual assignments rather than groups of kinases. Thisresult is also shown in our previous work in BIBM 2010[21].AcknowledgmentThis work was supported in part by CRD grant from the Natural Sciencesand Engineering Research Council of Canada and the MITACS AccelerateInternship Program.This article has been published as part of Proteome Science Volume 9Supplement 1, 2011: Proceedings of the International Workshop onComputational Proteomics. The full contents of the supplement are availableonline at details1Department of Computer Science, University of British Columbia, Vancouver,Canada. 2Department of Mathematics, Simon Fraser University, Burnaby,Canada. 3Department of Medicine, University of British Columbia, Vancouver,Canada. 4Kinexus Bioinformatics Corporation, Vancouver, Canada.Competing interestsThe authors declare that they have no competing interests.Published: 14 October 2011References1. Kostich M, English J, Madison V, Gheyas F, Wang L, Qiu P, Greene J, Laz T:Human members of the eukaryotic protein kinase family. Genome Biology2002, 3(9).2. Via M: Kinases: From targets to therapeutics. Cambridge Health InstitutesInsight Reports 2003, 1:1-124.3. Pelech S: Kinase profiling: The mysteries unraveled. FuturePharmaceuticals 2006, 1:23-25.4. Pelech S: Dimerization in protein kinase signaling. Journal of Biology 2006,5(12):1-7.5. Linding R, Jensen LJJ, Ostheimer GJ, van Vugt MA, Jørgensen C, Miron IM,Diella F, Colwill K, Taylor L, Elder K, Metalnikov P, Nguyen V, Pasculescu A,Jin J, Park JGG, Samson LD, Woodgett JR, Russell RB, Bork P, Yaffe MB,Pawson T: Systematic discovery of in vivo phosphorylation networks. Cell2007, 129(7):1415-1426.6. [].7. Saunders NFW, Brinkworth RI, Huber T, Kemp BE, Kobe B: Predikin andPredikinDB: a computational framework for the prediction of proteinkinase peptide specificity and an associated database ofphosphorylation sites. BMC Bioinformatics 2008, 9:245+.8. Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: Proteome-wideprediction of cell signaling interactions using short sequence motifs.Nucleic acids research 2003, 31(13):3635-3641.9. Blom N, Gammeltoft S, Brunak S: Sequence and structure-basedprediction of eukaryotic protein phosphorylation sites. Journal ofMolecular Biology 1999, 294(5):1351-1362.10. Kim JH, Lee J, Oh B, Kimm K, Koh I: Prediction of phosphorylation sitesusing SVMs. Bioinformatics 2004, 20(17):3179-3184.11. Dang THH, Van Leemput K, Verschoren A, Laukens K: Prediction of kinase-specific phosphorylation sites using conditional random fields.Bioinformatics (Oxford, England) 2008, 24(24):2857-2864.12. Wan J, Kang S, Tang C, Yan J, Ren Y, Liu J, Gao X, Banerjee A, Ellis LB, Li T:Meta-prediction of phosphorylation sites with weighted voting andrestricted grid search parameter selection. Nucleic acids research 2008,36(4).13. Miller ML, Blom N: Kinase-specific prediction of protein phosphorylationsites. Methods in molecular biology (Clifton, N.J.) 2009, 527:299-310.14. Miller MLL, Jensen LJJ, Diella F, Jørgensen C, Tinti M, Li L, Hsiung M,Parker SA, Bordeaux J, Sicheritz-Ponten T, Olhovsky M, Pasculescu A,Alexander J, Knapp S, Blom N, Bork P, Li S, Cesareni G, Pawson T, Turk BE,Yaffe MB, Brunak S, Linding R: Linear motif atlas for phosphorylation-dependent signaling. Science signaling 2008, 1(35):ra2+.15. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T,Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8-a globalview on proteins and their functional interactions in 630 organisms.Nucleic acids research 2009, 37(Database issue):D412-416.16. [].17. Dinkel H, Chica C, Via A, Gould CM, Jensen LJ, Gibson TJ, Diella F: Phospho.ELM: a database of phosphorylation sites—update 2011. Nucleic AcidsResearch 2011, 39(suppl 1):D261-D267.18. Chen SF, Goodman J: An empirical study of smoothing techniques forlanguage modeling. Proceedings of the 34th annual meeting on Associationfor Computational Linguistics Morristown, NJ, USA: Association forComputational Linguistics; 1996, 310-318.19. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA,McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD,Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics2007, 23(21):2947-2948.20. Cover TM, Thomas JA: Elements of Information Theory 2nd Edition. Wiley-Interscience;, 2 2006.21. Safaei J, Manuch J, Gupta A, Stacho L, Pelech S: Prediction of humanprotein kinase substrate specificities. BIBM 2010, 259-264.doi:10.1186/1477-5956-9-S1-S6Cite this article as: Safaei et al.: Prediction of 492 human protein kinasesubstrate specificities. Proteome Science 2011 9(Suppl 1):S6.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at et al. Proteome Science 2011, 9(Suppl 1):S6 13 of 13


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items