He et al. J Cheminform (2017) 9:24 DOI 10.1186/s13321-017-0209-zRESEARCH ARTICLESimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machinesTong He1†, Marten Heidemeyer1†, Fuqiang Ban2, Artem Cherkasov2‡ and Martin Ester1*‡ Abstract Computational prediction of the interaction between drugs and targets is a standing challenge in the field of drug discovery. A number of rather accurate predictions were reported for various binary drug–target benchmark data-sets. However, a notable drawback of a binary representation of interaction data is that missing endpoints for non-interacting drug–target pairs are not differentiated from inactive cases, and that predicted levels of activity depend on pre-defined binarization thresholds. In this paper, we present a method called SimBoost that predicts continuous (non-binary) values of binding affinities of compounds and proteins and thus incorporates the whole interaction spectrum from true negative to true positive interactions. Additionally, we propose a version of the method called SimBoostQuant which computes a prediction interval in order to assess the confidence of the predicted affinity, thus defining the Applicability Domain metrics explicitly. We evaluate SimBoost and SimBoostQuant on two established drug–target interaction benchmark datasets and one new dataset that we propose to use as a benchmark for read-across cheminformatics applications. We demonstrate that our methods outperform the previously reported models across the studied datasets.Keywords: Read-across, Gradient boosting, Drug–target interaction, Prediction interval, Applicability Domain, QSAR© The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.BackgroundFinding a compound that selectively binds to a particu-lar protein is a highly challenging and typically expen-sive procedure in the drug development process, where more than 90% of candidate compounds fail due to cross-reactivity and/or toxicity issues. It is therefore an impor-tant topic in drug research to gain knowledge about the interaction of compounds and target proteins through computational methods. Such in silico approaches are capable of speeding up the experimental wet lab work by systematically prioritizing the most potent compounds and help predicting their potential side effects.Recent studies [1] have demonstrated that machine learning-based approaches have the potential to predict compound-protein interactions on a large scale by learn-ing from limited interaction data supplemented with information on the similarity among compounds and among proteins. Incorporating the similarity between drugs and between targets to infer the interaction of untested drug–target pairs is the essence of the read-across methodology [2].The datasets commonly used for the training and evaluation of such machine learning-based prediction methods are the Enzymes, Ion Channels, Nuclear Recep-tor, and G Protein-Coupled Receptor datasets [3]. These datasets contain binary labels Y(i,j) = 1 if drug–target pair (di, tj) is known to interact (as shown by wet lab experiments) and Y(i,j) = 0 if either (di, tj) is known to not interact or if the interaction of (di, tj) is unknown. The datasets tend to be biased towards drugs and tar-gets that are considered to be more important or easier Open Access*Correspondence: ester@cs.sfu.ca; ester@sfu.ca †Tong He and Marten Heidemeyer contributed equally to this work‡Artem Cherkasov and Martin Ester labs contributed equally to the work1 School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, CanadaFull list of author information is available at the end of the articlePage 2 of 14He et al. J Cheminform (2017) 9:24 to test experimentally. As elaborated in [4], the use of such binary datasets has two major limitations: (1) true-negative interactions and missing values are not differ-entiated, and (2) a given compound-target interaction is treated as a binary on–off relationship, although it is more informative to use a continuous value that quanti-fies how strongly a compound binds to a target.The study of [4] introduces two continuous interaction datasets and the continuous evaluation metric CI and presents a read-across method KronRLS which predicts continuous binding affinities. The prediction in Kron-RLS is based on a similarity score for each drug–target pair, where the similarity of drug–target pairs is defined through the Kronecker product of a drug–drug similar-ity matrix and a target–target similarity matrix. Another method that has previously been shown to achieve high performance in drug target interaction prediction is Matrix Factorization (MF) [5–7], which in its simplest formulation learns to predict drug target interaction just from the given binding values without incorporating sim-ilarity information among drugs and among targets.Intuitively both KronRLS and MF share the limitation of capturing only linear dependencies in the training data. To the best of our knowledge, no non-linear meth-ods for drug–target interaction prediction have been presented in the literature. Furthermore, we believe that due to the biased nature of the training datasets it is nec-essary to assign a confidence score to a prediction. As emphasized in [8] it is important to address the uncer-tainty of the predictions of read-across approaches, but previous methods have neglected this need.In this paper, we propose a novel non-linear method, SimBoost, for continuous drug–target binding affinity prediction, and a version SimBoostQuant, using quantile regression to estimate a prediction interval as a meas-ure of confidence. Given a training dataset of continuous binding affinities and the similarities among drugs and among targets, SimBoost constructs features for drugs, targets, and drug–target pairs, and uses gradient boost-ing machines to predict the binding affinity for a drug–target pair and to generate a prediction interval. Besides gradient boosting, another non-linear method that can predict the value of some dependent variable and gener-ate a prediction interval is random forests [9]. We have two reasons for choosing gradient boosting over ran-dom forests. First, all the trees in a random forest can be seen as identically distributed. Thus, if their prediction is biased then the average of the prediction is also biased, which may lead to a less accurate final result. Second, the random forest algorithm for quantile regression intro-duced in [9] produces the same tree structure as the usual random forest algorithm and only changes the way in which predictions are generated for the leaf nodes. This implies that the trees are not grown in a shape optimized for quantile regression. Our proposed gradient boosting method overcomes both limitations.Gradient boosting machines have been employed in previous QSAR studies [10, 11]. Svetnik et al. [11] com-pares the performance of gradient boosting machines against commonly used QSAR methods such as support vector machines for regression and classification prob-lems involving only compounds. Singh and Shikha [10] utilizes gradient boosting machines to predict toxic effects of nanomaterials. In both studies, gradient boosting machines show promising results in terms of prediction performance, speed and robustness. A major difference of our work compared to these previous studies is the prob-lem formulation: In [10, 11] a prediction is made for a sin-gle entity (nanomaterial or compound), and descriptors for the compounds/nanomaterials are given. In the drug–target setting, on the other hand, we make predictions for pairs of entities, i.e. one drug and one target. Therefore, we present a novel feature engineering step on which our method relies in the learning and prediction phases.Related workTraditional methods for drug target interaction predic-tion typically focus on one particular target of interest. These approaches can again be divided into two types which are target-based approaches [12–14] and ligand-based approaches [15–18]. In target-based approaches the molecular docking of a candidate compound with the protein target is simulated, based on the 3D structure of the target (and the compound). This approach is widely utilized to virtually screen compounds against target pro-teins; however this approach is not applicable when the 3D structure of a target protein is not available which is often the case, especially for G-protein coupled receptors and ion channels. The intuition in ligand-based methods is to model the common characteristics of a target, based on its known interacting ligands (compounds). One inter-esting example for this approach is the study [4] which utilizes similarities in the side-effects of known drugs to predict new drug–target interactions. However, the ligand-based approach may not work well if the number of known interacting ligands of a protein target is small.To allow more efficient predictions on a larger scale, i.e. for many targets simultaneously, and to overcome the limitations of the traditional methods, machine learning based approaches have attracted much atten-tion recently. In the chemical and biologicals sciences, machine learning-based approaches have been known as (multi-target) Quantitative structure–activity relation-ship (QSAR) methods, which relate a set of predictor Page 3 of 14He et al. J Cheminform (2017) 9:24 variables, describing the physico-chemical properties of a drug–target pair, to the response variable, representing the existence or the strength of an interaction.Current machine learning methods can be classified into two types, which are feature-based and similarity-based approaches. In feature-based methods, known drug–target interactions are represented by feature vectors generated by combining chemical descriptors of drugs with descrip-tors for targets [19–23]. With these feature vectors as input, standard machine learning methods such as Support Vec-tor Machines (SVM), Naïve Bayes (NB) or Neural Net-works (NN) can be used to predict the interaction of new drug–target pairs. Vina et al. [24] proposes a method tak-ing into consideration only the sequence of the target and the chemical connectivity of the drug, but without relying on geometry optimization or drug–drug and target–tar-get similarities. Cheng et al. [25] introduces a multi-target QSAR method that integrates chemical substructures and protein sequence descriptors to predict interactions for G-protein coupled receptors and kinases based on two comprehensive data sets derived from the ChEMBL data-base. Merget et al. [26] evaluates different machine learning methods and data balancing schemes and reports that ran-dom forests yielded the best activity prediction and allowed accurate inference of compound selectivity.In similarity-based methods [3, 27–32], similarity matrices for both the drug–drug pairs and the target–tar-get pairs are generated. Different types of similarity met-rics can be used to generate these matrices [33]; typically, chemical structure fingerprints are used to compute the similarity among drugs and a protein sequence align-ment score is used for targets. One of the simplest ways of using the similarities is a Nearest Neighbor classifier [28], which predicts new interactions from the weighted (by the similarity) sum of the interaction profiles of the most similar drugs/targets. The Kernel method proposed in [27] computes a similarity for all drug–target pairs (a pairwise-kernel) using the drug–drug and target–tar-get similarities and then uses this kernel of drug–target pairs with known labels to train an SVM-classifier. The approaches presented in [28–30] represent drug–target interactions by a bipartite graph and label drug–target pairs as +1 if the edge exists or −1, otherwise. For each drug and for each target, a separate SVM (local model) is trained, which predicts interactions of that drug (target) with all targets (drugs). The similarity matrices are used as kernels for those SVMs, and the final prediction for a pair is obtained by averaging the scores for the respective drug SVM and target SVM.All of the above machine-learning based methods for drug–target interaction prediction formulate the task as a binary classification problem, with the goal to classify a given drug–target pair as binding or non-binding. As pointed out in [4], drawbacks of the binary problem for-mulation are that true-negative interactions and untested drug–target pairs are not differentiated, and that the whole interaction spectrum, including both true-positive and true-negative interactions, is not covered well. Pahik-kala et al. [4] introduces the method KronRLS which pre-dicts continuous drug–target binding affinity values. To the best of our knowledge, KronRLS is the only method in the literature which predicts continuous binding affinities, and we give a detailed introduction to KronRLS below, since we use it as baseline in our experiments. Below, we also introduce Matrix Factorization as it was used in the literature for binary drug–target interaction prediction and as it plays an important role in our proposed method.KronRLSRegularized Least Squares Models (RLS) have previously been shown to be able to predict binary drug target inter-action with high accuracy [31]. KronRLS as introduced in [4] can be seen as a generalization of these models for the prediction of continuous binding values. Given a set {di} of drugs and a set {tj} of targets, the training data consists of a set X = {x1, . . . , xm} of drug–target pairs (X is a sub-set of {di × tj}) and an associated vector y = y1, . . . , ym of continuous binding affinities. The goal is to learn a pre-diction function f (x) for all possible drug–target pairs x ∈ {di × tj}, i.e. a function that minimizes the objective:In the objective function, ||f ||2k is the norm of f , which is associated to a kernel function k (described below), and > 0 is a user defined regularization parameter. A minimizer of the above objective can be expressed asThe kernel function k is a symmetric similarity meas-ure between two of the m drug–target pairs, which can be represented by an m×m matrix K . For two individ-ual similarity matrices Kd and Kt for the drugs and tar-gets respectively, a similarity matrix for each drug–target pair can be computed as Kd ⊗ Kt, where ⊗ stands for the Kronecker product. If the training set X contains every possible pair of drugs and targets, K can be computed as K = Kd ⊗ Kt and the parameter vector a can be learnt by solving the following system of linear equations:where I is the di × tj identity matrix. If only a subset of {di × tj} is given as training data, the vector y has missing J (f ) =m∑i=1(yi − f (xi))2 + ||f ||2kf (x) =m∑i=1aik(x, xi)(K + I)a = yPage 4 of 14He et al. J Cheminform (2017) 9:24 values. To learn the parameter a, [4] suggests to use con-jugate gradient with Kronecker algebraic optimization to solve the system of linear equations.Matrix factorizationThe Matrix Factorization (MF) technique has been demonstrated to be effective especially for personalized recommendation tasks [34], and it has been previously applied for drug–target interaction prediction [5–7]. In MF, a matrix M ∈ Rd×t (for the drug–target predic-tion task, M represents a matrix of binding affinities of d drugs and t targets) is approximated by the product of two latent factor matrices P ∈ Rk×d and Q ∈ Rk×t.The factor matrices P and Q are learned by minimiz-ing the regularized squared error on the set of observed affinities κ:The term (mi,j − qTi pj)2 represents the fit of the learned parameters to the observed binding affinities. The term (||p||2 + ||q||2) penalizes the magnitudes of the learned parameters to prevent overfitting, and the constant controls the weight of the two terms. With learned matri-ces P and Q, a matrix M′ with predictions for all drug–target pairs can be computed as:In SimBoost, the columns of the factor matrices P and Q are utilized as parts of the feature vectors for the drugs and targets respectively and thus Matrix Factorization is used as a feature extraction step.minQ,P∑(di ,tj)∈κ(mi,j − qTi pj)2 + (||p||2 + ||q||2)M′ = PTQMethodsProblem definitionWe assume input data in the format (M,D,T ), where M is a matrix with continuous values where Mi,j represents the binding affinity of drug i and target j. D is a similar-ity matrix of drugs, and T is a similarity matrix of targets. Specifically, we define Mi,· as the i-th row of M, and M·,j as the j-th column of M. Similarly, we define Di,· as the i-th row of D, and T·,j as the j-th row of T . Only a subset of the elements of M is observed, and our goal is to predict all the non-observed values in M with the given information.SimBoost and SimBoostQuantOur proposed method, called SimBoost, constructs fea-tures for each drug, each target and each drug–target pair. These features represent the properties of drugs, targets and drug–target pairs, respectively. SimBoost associates a feature vector with each pair of one drug and one target. From pairs with observed binding affini-ties, it trains a gradient boosting machine model to learn the nonlinear relationships between the features and the binding affinities. Once the model is trained, SimBoost can make predictions of the binding affinities for unob-served drug–target pairs, based on their known features.We also propose a version of SimBoost, called Sim-BoostQuant, which computes the confidence of the pre-diction by using quantile regression to learn a prediction interval for a given drug–target pair as a measure of the confidence of the prediction.Figure 1 illustrates the workflow of SimBoost and Sim-BoostQuant, consisting of the three steps of feature engi-neering, gradient boosting trees and prediction interval. These steps are introduced in the following.Fig. 1 The workflow of SimBoost and SimBoostQuantPage 5 of 14He et al. J Cheminform (2017) 9:24 Feature engineeringWe define three types of features to describe the proper-ties of drugs, targets and drug–target pairs.Type 1 features: We extract features for each single object (i.e. drug or target). • Number of observations in M for the object (n.obs).• The number of observations in the corresponding row/column of M. • Average of all similarity scores of the object (ave.sim).• For drug i, the average of Di,·.• For target j, the average of Tj,·. • Histogram of the similarity values of the object (hist.sim).• A vector of frequencies of the similarity values, where the number of bins is an input parameter. • Average of the observed values for the object in M (ave.val).• For drug i, the average of Mi,·.• For target j, the average of M·,j.Type 2 features: We build two networks, one for drugs and another one for targets, from D and T , respectively. The nodes are drugs or targets, and an edge between two nodes exists if their similarity is above a user-defined threshold. We define the following features for each node. • Number of neighbours (num.nb). • The similarity values of the k-nearest neighbours of the node (k.sim). • The average of the Type 1 features among the k-near-est neighbours of the node (k.ave.feat), simply aver-aging these vectors from different objects, which results in a vector of the same length. • The average of the Type 1 features among the k-near-est neighbours of the node, weighted by the similarity values (k.w.ave.feat). • Betweenness, closeness and eigenvector centrality of the node as introduced in [35] (bt, cl, ev). • PageRank score as described in [36] (pr).Type 3 features: We build a network for drugs and tar-gets from M. The nodes are drugs and targets, and an edge connects a drug and a target, with the binding affin-ity as the edge weight. We define the following features for each drug–target pair. • Latent vectors from matrix factorization (mf).• The latent vector for the drug or the target, obtained by matrix factorization of M. • Weighted scores from drug to target’s neighbours (if any) (d.t.ave).• If drug i has observed affinities with target j‘s neighbours, average the values. • Weighted scores from target to drug’s neighbours (if any) (t.d.ave).• If target j has observed affinities with drug i‘s neighbours, average the values. • Betweenness, closeness and eigenvector centrality of the node (d.t.bt, d.t.cl, d.t.ev). • PageRank score (d.t.pr).For each drug and target we build Type 1 and Type 2 feature vectors, and for each drug–target pair we build a Type 3 feature vector. For each drug–target pair (di, tj), we build a feature vector by concatenating the Type 1 and Type 2 feature vectors for di and tj, and the Type 3 feature vector for (di, tj), as illustrated in Table 1.Gradient boosting regression treesTo predict the continuous binding affinity for drug–tar-get pairs, we train a supervised learning model based on the features defined in the Feature Engineering section. We choose the gradient boosting machine as our model, which was originally proposed in [37], because of its fol-lowing benefits [38, 39]: • Accuracy: the boosting algorithm is an ensemble model, which trains a sequence of “weak learners” to gradually achieve a good accuracy. • Efficiency: the training process can be parallelized, greatly reducing the training time.In the following, we provide a brief introduction to a variant of this model, gradient boosting regression trees, which we use in our methods. The details are described in [38, 39]. In the common supervised learning scenario, the data set can be represented by a set containing n paired feature vectors and labels: D = {(xi, yi)} (|D| = n). In the context of our task, xi ∈ Rd is the vector of features of the i-th drug–target pair, while yi ∈ R is its binding affinity.In the gradient boosting regression trees model, the score yˆi predicted for input xi is given by the following functional form:Table 1 Structure of feature vector for (di , tj)Type 1 of di Type 1 of tj Type 2 of di Type 2 of tj Type 3 of (di , tj)Page 6 of 14He et al. J Cheminform (2017) 9:24 where K is the number of regression trees and F is the space of all possible trees.To learn the set of trees {fk}, we define the following regularized objective function:where l is a differentiable loss function that evaluates the prediction yˆi with regard to the known binding affinity yi. The second term Ω measures the complexity of the model (i.e., the set of trees) to avoid overfitting. With this objective function, a simple and predictive set of boosted trees will be selected as the best model.Because the model includes trees as parameters, we cannot directly use traditional optimization methods in Euclidean space to find the solution. Instead, we train the model additively: at each iteration t, F is searched to find a new tree ft that optimizes the objective function, which is then added to the ensemble. Formally, let yˆ(t)i be the prediction for the i-th pair at the t-th iteration, the model finds ft that optimizes the following objective.This objective means that the model adds the best func-tion to the set. In the general setting, the above objective is still hard to optimize, and we approximate the objec-tive using the second order Taylor expansion.where gi = ∂yˆ(t−1) l(yi, yˆ(t−1)i ) and hi = ∂2yˆ(t−1) l(yi, yˆ(t−1)i ). We can remove the constant terms to obtain the follow-ing approximate objective at step t:A gradient boosting algorithm iteratively adds trees that optimize L˜(t) for a number of user-specified iterations.Usually in supervised learning tasks for continuous data, we use the squared loss function L = (y− yˆ)2 to compute the error. The first and second order gradient of this loss function are:yˆi = φ(xi) =K∑k=1fk(xi), fk ∈ FL(φ) =∑il(yˆi, yi)+∑kΩ(fk)L(t) =n∑i=1l(yi, yˆ(t)i )+t∑i=1Ω(fi)=n∑i=1l(yi, yˆ(t−1)i + ft(xi))+t∑i=1Ω(fi)L(t) ≃n∑i=1[l(yi, yˆ(t−1)i )+ gift(xi)+12hif2t (xi)]+t∑i=1Ω(fi)L˜(t) =n∑i=1[gift(xi)+12hif2t (xi)]+Ω(ft)We define Ω as follows:where T is the number of trees, w2j is the prediction score for data corresponding to the j-th leaf from ft and γ and are two weight parameters.Prediction intervalsWe extend gradient boosting regression trees by the concept of quantile regression to characterize the confi-dence of the prediction. Suppose the model can predict the quantile given the quantile parameter α. To obtain the interval, we need to train the model twice to calculate the α quantile and the (1− α) quantile to get the boundary of the prediction interval. To make a prediction for binding affinity, we use the median of the interval.To perform quantile regression, we use the following loss function instead of the squared loss:where α is the quantile parameter, yi is the true binding affinity, and ui is the prediction.Within the framework of gradient boosting trees, the new loss function can be optimized with stochastic gra-dient descent. The gradient for each data sample isThe second order gradient is not applicable here, there-fore we will set it to 1.Since we care more about drug–target pairs with high binding affinities, the model weights these drug–target pairs more. To achieve that, we weight the prediction error by the true affinity, i.e. the larger the true affinity is the more weight the prediction error gets. The resulting loss function and the gradient are as follows:ExperimentsDataWe evaluated the performance of the proposed methods on three drug–target binding affinity datasets. We used the two large-scale biochemical selectivity assays for clin-ically relevant kinase inhibitors from the studies by [40, 41] that were already used to evaluate the performance of KronRLS in the original paper [4]. These two datasets gi = 2(yˆ− y)hi = 2Ω(ft)= γT +12T∑jw2jli = α(yi − ui)Iyi>ui + (1− α)(yi − ui)Iyi<uigi = αIyi>ui − (1− α)Iyi<uili = yi[α(yi − ui)Iyi>ui + (1− α)(yi − ui)Iyi<ui ]gi = yi[αIyi>ui − (1− α)Iyi<ui ]Page 7 of 14He et al. J Cheminform (2017) 9:24 will be referred to as the Davis and the Metz dataset. The Davis dataset is fully populated with binding affini-ties observed for all pairs of 68 drugs and 442 targets, measured by the Kd value (kinase dissociation constant). The Metz dataset consists of 1421 drugs and 156 targets, and for 42% of the drug–target pairs the binding affinity is given as the pKi value (log transformed kinase inhibi-tion constant). Kd values in the Davis dataset were trans-formed into logspace (pKd) as:The Davis and Metz datasets are suitable for the eval-uation of predictive models for drug–target interac-tion because data heterogeneity is not an issue. We can assume that the experimental settings for the measured drug–target pairs in each dataset were the same and the binding affinities are comparable. When working with experimental results that come from multiple sources the data might be heterogeneous: In one case the binding affinity might be measured by Ki, in another case by Kd and in a third case by IC50. Another source of data heter-ogeneity are different experimental settings. An approach to integrate observations from different sources, named KIBA, and a corresponding dataset are presented in [42]. In this work the authors integrated the experimental results from multiple databases into a bioactivity matrix of 52,498 compounds and 467 targets, including 246,088 observations. The binding affinities in this matrix are given as KIBA-values. We used this dataset to obtain a third evaluation dataset, which we call the KIBA dataset, by removing all drugs and targets with less than 10 obser-vations from the original dataset (downloaded from the supplementary materials of [42]), resulting in a dataset of 2116 drugs and 229 targets with a density of 24%.pKd = −log10(Kd1e9) For the evaluation of the model we also include a per-formance evaluation for the classification of drug–tar-get pairs into binding or non-binding, using the metrics AUC and AUPR. For these experiments, we used ver-sions of the datasets binarized by applying thresholds as done in [4]. For the Metz dataset we used the threshold of pKi ≥ 7.6 as suggested in [4] to assign a label of 1, i.e. binding. For the Davis dataset we used a threshold of pKd ≥ 7.0 which is a bit less stringent than the thresh-old suggested in [4]. In the KIBA dataset, the lower the KIBA-score, the higher the binding affinity, and [42] sug-gests a threshold of KIBA value ≤3.0 to binarize the data-set. In an additional preprocessing step, we transformed the KIBA dataset by taking the negative of each value and adding the minimum to all values in order to obtain a threshold where all values above the threshold are classi-fied as binding. The KIBA threshold of 3.0 in the untrans-formed dataset then becomes 12.1. Table 2 lists the sizes and densities of the datasets, and Fig. 2 illustrates the dis-tributions of the affinity values for the three datasets.In the original Davis dataset a large fraction of the val-ues is given as “>10,000Kd”, meaning no binding affinity was detected in the wet lab experiment. These values were transformed to 10,000Kd (5pKd) in the preprocessed dataset, which explains the high bar at value 5pKd.As drug–drug and target–target similarity matrices for the Davis and Metz dataset we used the precomputed Table 2 The statistics of the three datasetsDataset Number of drugs Number of targets Density (%)Davis 68 442 100Metz 1421 156 42.1KIBA 2116 229 24.4Fig. 2 Distribution of values in the three datasets (Davis, Metz and KIBA from left to right) and binarization thresholds (vertical red line)Page 8 of 14He et al. J Cheminform (2017) 9:24 matrices that are provided on the website of [4]. Here, the drug–drug similarity was computed based on the 2D chemical structure of the compounds, using the structure clustering server at PubChem. This tool clusters the com-pounds based on the structure similarity using the single linkage algorithm [44] and allows to download a similar-ity matrix containing the similarity for each drug–drug pair. The target–target similarity was computed based on the protein sequences, using the normalized Smith-Waterman score [3]. For the KIBA dataset we obtained the drug–drug similarity matrix through the compound clustering tool of PubChem as done by the authors of [4] for the Davis and Metz datasets. The given ChEMBL IDs of the compounds were first matched to their PubChem CIDs which were then used as input to the PubChem web interface.The web tool allows to download a similarity matrix for the compounds as described above (similarly as for the drug–drug similarity of the Metz and David datasets). For the KIBA dataset we downloaded the protein sequences from NCBI and computed the normalized Smith Water-man similarity for each pair by aligning the sequences using the Biostrings R package.ResultsThe baselines evaluated in our experiments are matrix factorization trained on continuous data (referred to as MF), and the KronRLS method trained on both continu-ous (referred to as Continuous KronRLS) or binarized data (referred to as Binary KronRLS). For MF, we use the implementation libmf as described in [43]. For KronRLS, we use the original code from the author of [4].We also compare the performance of the two models we proposed. The difference lies in the loss function, the first one employs the usual squared loss (SimBoost), the second one employs the quantile regression loss (Sim-BoostQuant). We use the library xgboost to train the model [38].We perform fivefold cross validation. To ensure that no target is used only for training or only for testing, we build the folds in a way such that every target has an observation in at least two folds. To test the variance of the performance scores, we repeat the cross validation ten times for each model on each dataset, and report the mean and standard deviation for each metric.The evaluation metrics used are Root Mean Squared Error (RMSE), Area Under the Curve (AUC), Area Under the Precision-Recall curve (AUPR) and Concordance Index (CI). The RMSE is a commonly used metric for the error in continuous prediction. The binary metrics AUC and AUPR are commonly used in the related work on drug–target interaction prediction, measuring the ranking error of the predictions. The AUPR is commonly used in these types of studies because it punishes more false positive predictions in highly unbalanced data sets [45]. The CI is a ranking metric for continuous values that was suggested by [4]. The CI over a set of paired data is the probability that the predictions for two randomly drawn drug–target pairs with different label values are predicted in the correct order.KronRLS determines the optimal regularization param-eter by an inner cross validation step, and the parameter which gives the best performance on the desired metric is selected to predict the test fold. The prediction of Kro-nRLS might therefore depend on the used metric. Spe-cifically, when the classification metrics AUC or AUPR are applied, KronRLS learns and predicts binary labels, meaning that the datasets are binarized according to the cutoff threshold before the training step.Our methods in contrast only predict continuous val-ues, and the binarization threshold is applied after the prediction step to calculate the AUC and AUPR metrics. We argue that, given two models A and B, where A learns to predict continuous values and model B learns to pre-dict binary values, and the performance of model A in terms of AUC and AUPR is as good as the performance of model B, model A is advantageous because it does not need to be retrained when the threshold for the dataset is changed. For a fair comparison we also list the per-formance of KronRLS in terms of AUC and AUPR when continuous values were predicted and the threshold was applied after the prediction step.DiscussionThe results with respect to the four performance metrics are presented in Table 3 (Davis dataset), Table 4 (Metz dataset) and Table 5 (KIBA dataset). For every metric, the value for the best-performing method is highlighted in bold font.We observe that SimBoost consistently outperforms all baselines on all datasets in terms of all performance metrics. Based on the standard deviation obtained from 10 repetitions, the improvement is significant (Table 4).In particular, for the Davis dataset, SimBoost reduces the RMSE of MF by 51% and improves the AUPR of Con-tinuous KronRLS by 12%. For the Metz dataset, SimBoost reduces the RMSE of MF by 45%, improves the AUPR of Binary KronRLS by 11% and improves the CI of MF by 8%. For the KIBA dataset, SimBoost reduces the RMSE of the MF by 47%, improves the AUPR of Binary KronRLS by 2% and improves the CI of MF by 7%.SimBoostQuant achieves the second best performance in terms of RMSE, AUC and CI. While this model is not as good as SimBoost in terms of prediction performance, Page 9 of 14He et al. J Cheminform (2017) 9:24 it has the advantage of quantifying the confidence of the predicted value.To provide further insight into the performance of Sim-Boost, Figs. 3, 4 and 5 plot the predictions of SimBoost against the actual values on the three datasets.Figure 6 illustrates the prediction intervals from Sim-BoostQuant for all drugs for two targets from the KIBA dataset.The highest dash-dot green line corresponds to the upper quantile, the lowest dashed red line corresponds to the lower quantile. The dotted blue line in the middle is the average of the quantiles, used as our prediction. The solid black line is the true value in the training dataset. The drugs are listed on the x-axis, sorted by the predicted value.The numbers of observations for drugs in the two plots from left to right are 37 and 1136, and the average inter-val width is 1.51 and 0.75. We observe that the width decreases as the number of observations increases, which is demonstrated in Fig. 7.This is because generally the more observations, the more information has the model. For a given target, the width of the prediction interval varies a lot for differ-ent compounds, therefore among the compounds with high predicted affinities we choose those with narrower intervals.Figure 8 plots the relative importance of the features, as computed by the xgboost package with feature names as introduced in the “Methods” section.We note that the most important features are the aver-age affinity of the drug and the target, as well as the latent factors obtained from matrix factorization. The average affinity of a drug/target captures their typical binding behavior, which is valuable to predict the binding affinity with a specific target/drug. While SimBoost substantially outperforms matrix factorization, arguably because of its ability to discover non-linear relationships, the latent factors learnt by MF do provide important features for SimBoost.Table 3 Results on the Davis data set, with the mean and standard deviation from 10 repetitionsRMSE AUC AUPR CIMF 0.509 ± 0.010 0.876 ± 0.004 0.499 ± 0.017 0.816 ± 0.004Continuous KronRLS 0.608 ± 0.002 0.942 ± 0.001 0.679 ± 0.003 0.860 ± 0.001Binary KronRLS – 0.931 ± 0.001 0.686 ± 0.006 –SimBoost 0.247 ± 0.003 0.956 ± 0.001 0.758 ± 0.005 0.884 ± 0.001SimBoostQuant 0.36 ± 0.001 0.942 ± 0.002 0.680 ± 0.002 0.871 ± 0.004Table 4 Results on the Metz data set, with the mean and standard deviation from 10 repetitionsRMSE AUC AUPR CIMF 0.303 ± 0.005 0.895 ± 0.003 0.358 ± 0.011 0.788 ± 0.001Continuous KronRLS 0.562 ± 0.001 0.943 ± 0.001 0.518 ± 0.003 0.789 ± 0.001Binary KronRLS – 0.932 ± 0.001 0.565 ± 0.004 –SimBoost 0.166 ± 0.001 0.958 ± 0.001 0.629 ± 0.003 0.851 ± 0.001SimBoostQuant 0.249 ± 0.002 0.942 ± 0.002 0.523 ± 0.004 0.813 ± 0.020Table 5 Results on the KIBA data set, with the mean and standard deviation from 10 repetitionsRMSE AUC AUPR CIMF 0.382 ± 0.003 0.831 ± 0.002 0.631 ± 0.004 0.792 ± 0.001Continuous KronRLS 0.620 ± 0.001 0.884 ± 0.001 0.735 ± 0.001 0.792 ± 0.001Binary KronRLS – 0.904 ± 0.001 0.7660 ± 0.001 –SimBoost 0.204 ± 0.001 0.907 ± 0.001 0.782 ± 0.001 0.847 ± 0.001SimBoostQuant 0.299 ± 0.001 0.875 ± 0.001 0.708 ± 0.002 0.796 ± 0.001Page 10 of 14He et al. J Cheminform (2017) 9:24 Fig. 3 Prediction from SimBoost against the real values on DavisFig. 4 Prediction from SimBoost against the real values on MetzPage 11 of 14He et al. J Cheminform (2017) 9:24 Fig. 5 Prediction from SimBoost against the real values on KIBAFig. 6 The prediction intervals of two targets from KIBAPage 12 of 14He et al. J Cheminform (2017) 9:24 Fig. 7 The relationship between the number of observations and the average width of the prediction intervals, in the KIBA datasetFig. 8 Relative feature importance in DavisPage 13 of 14He et al. J Cheminform (2017) 9:24 ConclusionsThe majority of the existing cheminformatics methods for predicting drug–target interactions perform binary classification into binding and non-binding. In this paper we proposed the novel read-across method SimBoost for the problem of predicting continuous, as opposed to binary, drug–target binding affinities. As discussed above, continuous values allow the distinction between true negatives and missing values, and provide more information about the actual strength of protein-com-pound binding. To the best of our knowledge, SimBoost is the first non-linear method for continuous drug–tar-get interaction prediction, and the first method that also computes a prediction interval as a measure of the confi-dence of the prediction.In our experiments we compared SimBoost with Kro-nRLS, the state-of-the-art method for this task, on the Davis, Metz, and KIBA datasets. To the best of our knowledge, this is the first study where the KIBA data-set was used for the evaluation of drug–target interac-tion predictions, and we believe that it should be used in future studies as a benchmark set, because of its hetero-geneous and well-balanced nature.In summary, our contributions are as follows: • We proposed the first non-linear read-across method SimBoost for continuous drug–target binding affinity prediction. SimBoost takes informative features from the drug and target similarities and from a matrix factorization model, and trains a gradient boosting tree model. • We proposed a version of SimBoost, called SimBoost-Quant, which, using the same features, predicts bind-ing affinities as well as prediction intervals as a meas-ure of the confidence of the prediction. • We performed extensive experiments on three data-sets evaluating four performance metrics. The results demonstrate that SimBoost and SimBoostQuant con-sistently outperform state-of-the-art methods.It should be noted that while SimBoost is a more accu-rate method, SimBoostQuant provides important infor-mation on the confidence of a prediction and, thus, explicitly address the Applicability Domain challenge. In our opinion, the choice between the two methods is essentially a trade-off between slightly more accurate ver-sus somewhat more informative predictions.Authors’ contributionsME, AC, TH and MH conceived the study. TH and MH developed the prediction methods. TH, MH and FB prepared the datasets and designed and performed the experiments. TH, MH, and ME wrote the manuscript, AC, and FB proofread it. All authors read and approved the final manuscript.Author details1 School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada. 2 Faculty of Medicine, Vancouver Prostate Center, University of British Columbia, Vancouver, BC V6H 3Z6, Canada. Competing interestsThe authors declare that they have no competing interests.Availability of data and materialsThe original datasets supporting the conclusions of this article are available in the following links:http://staff.cs.utu.fi/~aatapa/data/DrugTarget/known_drug-target_interac-tion_affinities_pKi__Metz_et_al.2011.txtdrug-target_interaction_affinities_Kd__Davis_et_al.2011.txthttp://pubs.acs.org/doi/suppl/10.1021/ci400709dci400709d_si_002.xlsxThe preprocessed data, code for experiments and detailed instructions are available in the following link: https://zenodo.org/record/164436.FundingThe authors kindly acknowledge the following funding sources: NSERC Cre-ate program “Computational Methods for the Analysis of the Diversity and Dynamics of Genomes” (433905-2013), AC acknowledges the NSERC Discovery grant.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims in pub-lished maps and institutional affiliations.Received: 3 November 2016 Accepted: 30 March 2017References 1. Ding H, Takigawa I, Mamitsuka H, Zhu S (2014) Similarity-based machine learning methods for predicting drug–target interactions: a brief review. Brief Bioinform 15(5):734–747 2. Ball N, Cronin MT, Shen J, Blackburn K, Booth ED, Bouhifd M, Donley E, Egnash L, Hastings C, Juberg DR, Kleensang A (2015) Toward good read-across practice (GRAP) guidance. Altex 33(2):149–166 3. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M (2008) Predic-tion of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13):232–240 4. Pahikkala T, Airola A, Pietila S et al (2015) Toward more realistic drug-target interaction predictions. Brief Bioinform 16(2):325–337. doi:10.1093/bib/bbu010 5. Liu Y, Wu M, Miao C, Zhao P, Li XL (2016) Neighborhood regularized logistic matrix factorization for drug–target interaction prediction. PLoS Comput Biol 12(2):e1004760 6. Ezzat A, Zhao P, Wu M et al (2016) Drug-target interaction prediction with graph regularized matrix factorization. IEEE/ACM Trans Comput Biol Bioinf. doi:10.1109/tcbb.2016.2530062 7. Gonen M, Kaski S (2014) Kernelized Bayesian Matrix Factorization. IEEE Trans Pattern Anal Mach Intell 36(10):2047–2060. doi:10.1109/tpami.2014.2313125 8. Patlewicz G, Ball N, Becker RA, Booth ED, Cronin MT, Kroese D, Steup D, van Ravenzwaay B, Hartung T (2014) Food for thought: read-across approaches–misconceptions, promises and challenges ahead. Altern Anim Exp: ALTEX 31(4):387–396 9. Meinshausen N (2006) Quantile regression forests. J Mach Learn Res 7:983–999 10. Singh K, Shikha G (2014) Nano-QSAR modeling for predicting biological activity of diverse nanomaterials. RSC Adv 4(26):13215–13230 11. Svetnik V, Wang T, Tong C, Liaw A, Sheridan R, Song Q (2015) Boosting: an ensemble learning tool for compound classification and QSAR modeling. J Chem Inf Model 45(3):786–799Page 14 of 14He et al. J Cheminform (2017) 9:24 12. Morris GM et al (2009) AutoDock4 and AutoDockTools4: automated dock-ing with selective receptor flexibility. J Comput Chem 30(16):2785–2791 13. Cheng AC et al (2007) Structure-based maximal affinity model predicts small-molecule druggability. Nat Biotechnol 25(1):71–75 14. Rarey M et al (1996) A fast flexible docking method using an incremental construction algorithm. J Mol Biol 261(3):470–489 15. Campillos M et al (2008) Drug target identification using side-effect similarity. Science 321(5886):263–266 16. Kinnings SL et al (2009) Drug discovery using chemical systems biology: repositioning the safe medicine Comtan to treat multi-drug and exten-sively drug resistant tuberculosis. PLoS Comput Biol 5(7):e1000423 17. Li YY, An J, Jones SJM (2011) A computational approach to finding novel targets for existing drugs. PLoS Comput Biol 7(9):e1002139 18. Wang K et al (2013) Prediction of drug–target interactions for drug repo-sitioning only based on genomic expression similarity. PLoS Comput Biol 9(11):e1003315 19. Yabuuchi H et al (2011) Analysis of multiple compound–protein interac-tions reveals novel bioactive molecules. Mol Syst Biol 7(1):472 20. Nagamine N, Sakakibara Y (2007) Statistical prediction of protein–chemi-cal interactions based on chemical structure and mass spectrometry data. Bioinformatics 23(15):2004–2012 21. Nagamine N et al (2009) Integrating statistical predictions and experi-mental verifications for enhancing protein–chemical interaction predic-tions in virtual screening. PLoS Comput Biol 5(6):e1000397 22. Schürer SC, Muskal SM (2013) Kinome-wide activity modeling from diverse public high-quality data sets. J Chem Inf Model 53(1):27–38 23. Manallack DT et al (2002) Selecting screening candidates for kinase and G protein-coupled receptor targets using neural networks. J Chem Inf Comput Sci 42(5):1256–1262 24. Vina D et al (2009) Alignment-free prediction of a drug–target complex network based on parameters of drug connectivity and protein sequence of receptors. Mol Pharm 6(3):825–835 25. Cheng F et al (2012) Prediction of chemical–protein interactions: multitar-get-QSAR versus computational chemogenomic methods. Mol BioSyst 8(9):2373–2384 26. Merget B, Turk S, Eid S et al (2017) Profiling prediction of kinase inhibitors: toward the virtual assay. J Med Chem 60(1):474–485. doi:10.1021/acs.jmedchem.6b01611 27. Jacob L, Vert J-P (2008) Protein–ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24(19):2149–2156 28. Bleakley K, Yamanishi Y (2009) Supervised prediction of drug–target inter-actions using bipartite local models. Bioinformatics 25(18):2397–2403 29. Bleakley K, Biau G, Vert J-P (2007) Supervised reconstruction of biological networks with local models. Bioinformatics 23(13):i57–i65 30. Mordelet F, Vert J-P (2008) SIRENE: supervised inference of regulatory networks. Bioinformatics 24(16):i76–i82 31. Xia Z, Wu L-Y, Zhou X, Wong ST (2010) Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Syst Biol. 4(Suppl 2):S6. doi:10.1186/1752-0509-4-s2-s6 32. van Laarhoven T, Nabuurs SB, Marchiori E (2011) Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics 27(21):3036–3043 33. Öztürk H, Ozkirimli E, Özgür A (2016) A comparative study of SMILES-based compound similarity functions for drug–target interaction predic-tion. BMC Bioinformatics 17(1):128 34. Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37 35. Newman M (2010) Networks: an introduction. Oxford University Press, Oxford 36. Page L et al (1999) The PageRank citation ranking: bringing order to the web. Stanford InfoLab, Stanford 37. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 1:1189–1232 38. Chen T, He T (2015) Higgs boson discovery with boosted trees. In: Cowan et al (eds) JMLR: workshop and conference proceedings 2015, vol 42, pp 69–80 39. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. arXiv preprint arXiv:1603.02754 40. Metz JT, Johnson EF, Soni NB, Merta PJ, Kifle L, Hajduk PJ (2011) Navigat-ing the kinome. Nat Chem Biol 7(4):200–202 41. Davis MI, Hunt JP, Herrgard S, Ciceri P, Wodicka LM, Pallares G, Hocker M, Treiber DK, Zarrinkar PP (2011) Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29(11):1046–1051 42. Tang J, Szwajda A, Shakyawar S, Xu T, Hintsanen P, Wennerberg K, Ait-tokallio T (2014) Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model 54(3):735–743 43. Chin WS, Yuan BW, Yang MY, Zhuang Y, Juan YC, Lin CJ (2016) LIBMF: a library for parallel matrix factorization in shared-memory systems. J Mach Learn Res 17(86):1–5 44. Olson CF (1995) Parallel algorithms for hierarchical clustering. Parallel Comput 21(8):1313–1325 45. Yang F, Xu J, Zeng J (2014) Drug–target interaction prediction by integrat-ing chemical, genomic, functional and pharmacological data. In: Pacific symposium on biocomputing. NIH Public Access
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- SimBoost: a read-across approach for predicting drug–target...
Open Collections
UBC Faculty Research and Publications
SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting… He, Tong; Heidemeyer, Marten; Ban, Fuqiang; Cherkasov, Artem; Ester, Martin Apr 18, 2017
pdf
Page Metadata
Item Metadata
Title | SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines |
Creator |
He, Tong Heidemeyer, Marten Ban, Fuqiang Cherkasov, Artem Ester, Martin |
Contributor | Vancouver Prostate Centre |
Publisher | Springer International Publishing |
Date Issued | 2017-04-18 |
Description | Computational prediction of the interaction between drugs and targets is a standing challenge in the field of drug discovery. A number of rather accurate predictions were reported for various binary drug–target benchmark datasets. However, a notable drawback of a binary representation of interaction data is that missing endpoints for non-interacting drug–target pairs are not differentiated from inactive cases, and that predicted levels of activity depend on pre-defined binarization thresholds. In this paper, we present a method called SimBoost that predicts continuous (non-binary) values of binding affinities of compounds and proteins and thus incorporates the whole interaction spectrum from true negative to true positive interactions. Additionally, we propose a version of the method called SimBoostQuant which computes a prediction interval in order to assess the confidence of the predicted affinity, thus defining the Applicability Domain metrics explicitly. We evaluate SimBoost and SimBoostQuant on two established drug–target interaction benchmark datasets and one new dataset that we propose to use as a benchmark for read-across cheminformatics applications. We demonstrate that our methods outperform the previously reported models across the studied datasets. |
Subject |
Read-across Gradient boosting Drug–target interaction Prediction interval Applicability Domain QSAR |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2018-06-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0368605 |
URI | http://hdl.handle.net/2429/66275 |
Affiliation |
Medicine, Faculty of Other UBC Non UBC |
Citation | Journal of Cheminformatics. 2017 Apr 18;9(1):24 |
Publisher DOI | 10.1186/s13321-017-0209-z |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | The Author(s) |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52383-13321_2017_Article_209.pdf [ 2.35MB ]
- Metadata
- JSON: 52383-1.0368605.json
- JSON-LD: 52383-1.0368605-ld.json
- RDF/XML (Pretty): 52383-1.0368605-rdf.xml
- RDF/JSON: 52383-1.0368605-rdf.json
- Turtle: 52383-1.0368605-turtle.txt
- N-Triples: 52383-1.0368605-rdf-ntriples.txt
- Original Record: 52383-1.0368605-source.json
- Full Text
- 52383-1.0368605-fulltext.txt
- Citation
- 52383-1.0368605.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0368605/manifest