Yang et al. BMC Bioinformatics (2018) 19:96 https://doi.org/10.1186/s12859-018-2091-8RESEARCH ARTICLE Open AccessInferring RNA sequence preferences forpoorly studied RNA-binding proteins based onco-evolutionShu Yang1* , Junwen Wang2 and Raymond T. Ng1AbstractBackground: Characterizing the binding preference of RNA-binding proteins (RBP) is essential for us to understandthe interaction between an RBP and its RNA targets, and to decipher the mechanism of post-transcriptional regulation.Experimental methods have been used to generate protein-RNA binding data for a number of RBPs in vivo and in vitro.Utilizing the binding data, a couple of computational methods have been developed to detect the RNA sequenceor structure preferences of the RBPs. However, the majority of RBPs have not yet been experimentally characterizedand lack RNA binding data. For these poorly studied RBPs, the identification of their binding preferences cannot beperformed by most existing computational methods because the experimental binding data are prerequisite to thesemethods.Results: Here we propose a new method based on co-evolution to predict the sequence preferences for the poorlystudied RBPs, waiving the requirement of their binding data. First, we demonstrate the co-evolutionary relationshipbetween RBPs and their RNA partners. We then present a K-nearest neighbors (KNN) based algorithm to infer thesequence preference of an RBP using only the preference information from its homologous RBPs. By benchmarkingagainst several in vitro and in vivo datasets, our proposed method outperforms the existing alternative which uses theclosest neighbor’s preference on all the datasets. Moreover, it shows comparable performance with two state-of-the-art methods that require the presence of the experimental binding data. Finally, we demonstrate the usage of thismethod to infer sequence preferences for novel proteins which have no binding preference information available.Conclusion: For a poorly studied RBP, the current methods used to determine its binding preference needexperimental data, which is expensive and time consuming. Therefore, determining RBP’s preference is not practicalin many situations. This study provides an economic solution to infer the sequence preference of such protein basedon the co-evolution. The source codes and related datasets are available at https://github.com/syang11/KNN.Keywords: RBP binding preference, K-nearest neighbors, Co-evolution, Machine learningBackgroundDetermining the binding preference of an RBP is central toinvestigating RNA-protein interactions. Such preference,also known as specificity, denotes the RBP’s preferen-tial association with specific RNA sequence motifs (i.e.sequence preference) or structure motifs (i.e. structurepreference) [1]. Typically, in order to characterize the pref-erence of an RBP, experimental methods are designed*Correspondence: syang11@cs.ubc.ca1Department of Computer Science, University of British Columbia, Vancouver,CanadaFull list of author information is available at the end of the articleto generate binding data consisted of enriched RNAsequences bound by a particular RBP, either in vivo likeCLIP (Crosslinking immunoprecipitation) based method(HITSCLIP, PAR-CLIP and iCLIP) [2, 3] or in vitro likeRNAcompete assays [4, 5]. Computational methods arethen used to predict a binding model pertaining to thatRBP based on the binding data.However, due to the limited availability of experimen-tal data, only a small fraction of the RBPs from a fewrepresentative species have been well studied regardingtheir preferences up to now. Identifying the RNA tar-gets bound by novel or poorly studied RBPs remains a© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Yang et al. BMC Bioinformatics (2018) 19:96 Page 2 of 12challenge. Currently, most experimental methods employmicroarray [4] or next-generation sequencing [6] to assaythe corresponding RNA sequence information of an RBP.Although there are methods such as icSHAPE [7] thatcan determine RNA structures, RNA structure data is notcaptured in most experimental methods, and it is usu-ally predicted from sequence data using algorithms suchas RNAshapes [8] and RNAplfold [9]. Given the experi-mental data as input, a number of computational methodshave been developed to build binding preference models.Those methods can be roughly classified into two cat-egories: (1) methods focusing on sequence models, i.e.considering RNA sequence information alone for bind-ing preference [10–12]; (2) methods focusing on sequenceand structure models, i.e. considering both RNA sequenceand structure information for binding preference [13–17].Some representative methods are summarized in Table 1.For an RBP of interest, all the methods in Table 1 requirethe RBP’s experimental binding data as input to directlydetermine the preference. We call these methods “direct”methods to distinguish them from “inferred”methods thatpredict the preference indirectly from other RBPs withknown preferences. The latter category is the focus ofthis paper. The binding preference of a novel or poorlystudied RBP that only has amino acid sequence availablecould not be predicted by any of the “direct” methods.To the best of our knowledge, only one study has sug-gested an “inferred” workaround for such case [5]. AsTable 1 Representative computational methods for RBP bindingpreference predictionMethod Input data Ref HighlightDeepBind RNAcompete [10] Learning sequence preference asthe convolution function in a deepconvolutional neural networkMEMERIS SELEX [13] Estimating sequence preference(PWM) with single-strandedstructure context by maximumlikelihood estimationLi et al. RIP-chip [14] Predicting sequence preference(consensus) with single-strandedstructure context by iterativerefinementRNAcontext RNAcompete [15] Learning a joint model withPWM for sequence preferenceand probability vector for structurepreferenceGraphProt CLIP-seq [16] Learning sequence and structurepreference using graph encodingand graph-kernel SVMRCK RNAcompete [17] Extending RNAcontext usingposition-dependent k-mer modelfor sequence and structurepreferenceobserved by Ray et al. in this study, RBPs that have iden-tity > 70% in their RNA-binding domain sequences havesimilar target RNA sequence motifs. Hence, the authorsassumed that the sequence preference (represented byposition weight matrix (PWM) [18]) of a poorly studiedRBP would be the same as a well-studied RBP if more than70% of the sequences within their RNA-binding domainsare identical. Based on this assumption, Ray et al. inferredsequence preferences for poorly studied RBPs across 288sequenced eukaryotes. These binding preferences weredeposited into the cisBP-RNA database [19]. Neverthe-less, this inference only provides a crude estimation, andcould not work for RBPs that do not have highly homol-ogous RBPs. In spite of the obvious limitations of thismethod, it implies the conserved correlation between RBPsequences and their RNA binding targets along evolution.In this paper we introduce a machine learning approachto predict the sequence preference for poorly studiedRBPs. The proposed approach is an “inferred” methodthat utilizes co-evolution between the RBPs and theirbinding RNAs. The use of co-evolution has not yet beenexplored between the RBPs and their binding RNAs,although it has been widely studied in protein-proteininteractions [20, 21] and DNA-protein interactions[22, 23]. In general, mutations in either the RBP or theRNA target may weaken their interactions, potentiallyleading to abnormality in organisms. In fact, a number ofdiseases have been previously reported to be linked to themis-regulation or malfunction of specific RNA-proteininteractions [1]. Thus in order to maintain the impor-tant interactions in organisms during evolution, crucialmutations in one interacting partner might be rescuedby compensatory changes in the other partner. This con-cept is known as co-evolution, also known as correlatedevolution or co-variation. Since there are not enough invivo data available to test the co-evolution in RNA-proteininteractions, we first use an in vitro dataset [5] of morethan 200 RBPs to show that a significant correlation isobserved between RBPs and their binding preferences.Then based on such correlation, we introduce a K-nearestneighbors algorithm to predict the sequence preference(represented by a PWM) for an RBP, using PWMs of thehomologous neighbors as input. We evaluate the algo-rithm through a set of tests on the RBPs with known invivo or in vitro binding data. We compare the KNN algo-rithm with (1) the alternative “inferred” approach in Rayet al.’s study [5] which used the closest neighbor’s prefer-ence (i.e. 1NN approach), (2) two state-of-the-art “direct”methods: DeepBind which represents the methods focus-ing on sequence preference [10], and RCK which rep-resents the methods focusing on sequence-and-structurepreference [17]. Our algorithm outperforms 1NN on all invivo and in vitro datasets that have been tested, and evenperforms comparably on in vivo test sets in comparison toYang et al. BMC Bioinformatics (2018) 19:96 Page 3 of 12the “direct” methods DeepBind and RCK. In addition, weextend the RCK program to plug in our predicted PWMsas its sequence preference, in order to further incorporatestructure preference. We show that the extended methodperforms comparably with DeepBind and RCK on in vitrotest sets with smaller model and far less training time.Finally, we demonstrate the ability of our algorithm to pre-dict binding preference for poorly studied RBPs, and wepredict binding preferences for 1000 RBPs which do nothave experimental data available.MethodsDatasetsin vitro datasetThe first dataset was derived from a previously pub-lished RNAcompete study conducted by Ray et al. [5].This study published results of 244 in vitro RNAcompeteexperiments for 207 RBPs from 24 eukaryotes. For eachexperiment, the studymeasured the RBP binding intensityfor approximately 240,000 RNA probe sequences. Posi-tion frequency matrices were derived using top 10 probesfor each experiment [5]. Many previous methods includ-ing 1NN, DeepBind, RCK/RNAcontext were all trainedand tested on this dataset.We used the position frequencymatrices in this dataset to form our training set, and theprobes to form our in vitro testing set.We performed several pre-processing steps on thisdataset. We first filtered out proteins which containmore than one type of RNA-binding domains or pro-tein families with too few members. We also removedthe experiments with customized protein constructs toretain only the proteins with full-length (FL) or RNA-binding region (RBR, core binding region containingall RNA-binding domains in a protein), because Ray etal. cloned RBPs in different types of constructs [5]. Inaddition, for each RNAcompete experiment, probes withintensities above the 99.95th percentile were consideredoutliers and were clamped to the value of the 99.95th per-centile as suggested in the studies of DeepBind and RCK[10, 17]. These steps made sure that we focus on theevolution of one protein family at a time and measureat both the FL sequence and the RBR levels. As a result,200 out of the original 244 experiments remained afterthe pre-processing, which corresponds to the two largestRBP families known, the RNA Recognition Motif (RRM)family (177 in total: 126 RBR and 51 FL) and the K-homology (KH) family (23 in total: 15 RBR and 8 FL). Wecall this dataset the InVitro dataset for convenience. Itcovers RBPs from 24 diverse eukaryotes including animal,fungi, plant, and protist groups. The top three specieswith the most entries are human (74), Drosophila (56),C. elegans (10), etc. The detailed composition is listed inthe Additional file 1. A summary of the InVitro dataset isshown in Table 2.Table 2 Summary of datasets used in this studyName # Source Type Species compositionInVitro 200 [5] in vitro RNAcompete 24 different eukaryotesInVivoRay 32 [5] in vivo CLIP and RIP humanInVivoAURA 9 [24] in vivo CLIP humanin vivo datasetIn addition, as shown in Table 2, we used two in vivodatasets to test the performance of in vitro derived bind-ing preferences. The first one was the overlap of the invivo dataset curated by Ray et al. [5] from different lit-eratures with our InVitro dataset. It has 13 CLIP/RIPexperiments corresponding to 14 RNAcompete proteins,which result in 32 RNAcompete-CLIP/RIP combinations.Each CLIP/RIP experiment here contains target RNAsequences with binary labels (i.e. “bound” or “unbound”),and has balanced samples for each label [5]. We call thisdataset the InVivoRay dataset. All the corresponding RBPsin the InVivoRay dataset are from human, andmost belongto the RRM family except one from the KH family. Thedetailed composition is listed in the Additional file 1. Thesecond in vivo dataset was the overlap of the in vivodataset derived by Cirillo et al. [24] from the AURA [25]database with our InVitro dataset. RNAs here are all longnon-coding RNAs (lncRNAs). We got 6 overlapped com-binations (out of 6 RNAcompete experiments and 2 CLIPexperiments) with our InVitro dataset. Moreover, thereare 3 additional CLIP experiments in this dataset thatinvolve RBPs with no RNAcompete data, which provides agood case study to test the ability of our algorithm to inferbinding preferences for the poorly studied RBPs. We callthis dataset the InVivoAURA dataset. Furthermore, all thecorresponding RBPs in the InVivoAURA dataset are fromhuman, and most belong to the RRM family except onefrom the KH family.RBP binding preference modelSequence preferenceIn this study, we used PWMs as our sequence preferencerepresentations. A PWM is a 4 (one for each nucleotide)by k (one for each position in a motif ) matrix of basecompositions (probabilities), which assumes positionindependence. Despite the fact that there are moreadvanced representations of binding preference whichhave weaker assumptions and capture more spatial rela-tion [10, 16, 17], PWM has been the most commonlyused representation, especially when integrating differentmodels from various sources [19, 26]. We collected theposition frequency matrices from the InVitro dataset, andconverted them to PWMs with identical length (7) [22](more details in Additional file 2: Supplementary Note).Then, for an RBP x of interest, we infer its PWM from itsYang et al. BMC Bioinformatics (2018) 19:96 Page 4 of 12homologous PWMs using the KNN algorithm introducedbelow, without looking at x’s binding data like probesequences, intensity values, etc.Here we present our KNN based algorithm for sequencepreference prediction. Suppose we are interested in apoorly characterized RBP x which only has its amino acidsequence available, eg. a novel protein that is newly dis-covered to be associated with certain disease. If we canfind some of x’s homologous RBPs (either orthologs orparalogs, denoted by set H) that have known PWMs andmap x to these RBPs by sequence identity, then we canpredict the PWM of x with a non-parametric methodsimilar to the K-nearest neighbors regression:1 Compute a pairwise similarity wi between RBP x andeach RBP hi in H, based on the sequence identity.2 Sort hi in descending order in terms of wi.3 Find a K value which denotes the number of thenearest neighbors.4 For the K nearest RBPs h1, .., hK with similaritiesw1, ..,wK and PWMs PWM1, ..,PWMK , predict x’sPWM with each cell (i, j) in PWMx a weightedaverage:PWMx(i, j) =∑Kp=1 wpPWMp(i, j)∑Kp=1 wp(1)Intuitively, our KNN algorithm assumes the RBPs andtheir binding motifs co-evolved perfectly, and infers theprobability in each cell of the new PWM as a weightedaverage with weights equal to the similarities of proteinsequences. The algorithm computes the sequence similar-ities using ClustalW [27]. Like the typical KNN algorithm,the proposed algorithm goes over different K values tofind the optimal K (optK) for each RBP by cross valida-tion. In this case, the different K values indicate differentevolutionary distances between RBPs. Note the K (uppercase) here denotes the number of neighbors and has noth-ing to do with the k (lowercase) in k-mer. In addition,to be consistent with the previous RNAcompete papers[4, 5], we used a similar approach as theirs to assign a scoreto an RNA sequence using a PWM. The predicted PWMwith the length k (k was fixed to 7 in our case) assignsa score for any k-mer RNA sequence by taking the prod-uct of the PWM entries corresponding to each base in thek-mer. For an RNA sequence s with length |s| > k, theproposed algorithm scans s using the PWM to compute aRBP-binding score y for the entire sequence:y = 1|s||s|−k∑t=0f⎛⎝t+k∏l=t+1PWM(index(sl), l − t)⎞⎠ , where f (a)={arcsinh(a), a > 10, a ≤ 1(2)index(sl) returns the PWM’s row index for base sl. Theuse of f (a) guarantees that only k-mers with high scoresare retained. This y score is used as our prediction for thebinding intensity of an RNA probe.Sequence-and-structure preferenceSince the RNA structure is known to play a significant rolein RNA-protein interactions [2, 28, 29] and more experi-mentally measured RNA structure data may be availablein the future [7], we provide the flexibility of incorporatingstructure information with our predicted PWM.We choseto extend the recently published RCK program [17] whichcan infer both the sequence and the structure preferencesusing a k-mer based model. There are several reasons forchoosing RCK: (1) it has a sequence-and-structure modelwith clear interpretation of each part; (2) it is suitable toplug in our PWM; (3) it was reported to have superiorperformance among others [17]. We modified the RCK’ssequence model so that it can take our PWM as inputand use the parameters derived from our PWM insteadof learning from sequence data. We then trained a jointmodel with the structure preference incorporated. In spiteof the fact that our PWM was inferred without lookingat the target RBP’s binding data, the rest of the modelparameters were directly trained on the RBP’s RNAcom-pete probe data. Thus, this method is still a “direct”method. In addition, the RNA structure distribution waspredicted computationally by a variant of RNAplfold[9, 15]. For simplicity, we call our modified RCK versionKNN-RCK.For each RBP x, KNN-RCK fits a model on x’s RNAcom-pete experiment data which consists of a set of probes andtheir binding intensities to x. Here an RNAcompete probewith length |s| is encoded as a vector s of nucleotides and avector p of structural probabilities. We left the other partsof the RCKmodel untouched and focused on the sequencepreference part Fseq(·) which is a logistic function thatestimates the probability of a given k-mer subsequencebeing bound by x:Fseq(st+1:t+k ,) = (1 + exp (−b − φst+1:t+k))−1 (3)where st+1:t+k is the k-mer subsequence starting at t + 1on s, φst+1:t+k is the score parameter for this k-mer, and b issimply a bias term. b,φ ∈ . For a given k, RCK assumesposition dependence, and has a score parameter for eachpossible k-mer. Thus, φ has 4k parameters. For instance, ifk = 5, φ would be a vector of parameters for all 5-mers:like φAAAAA = 0.03, φCAAAA = 1.20, φGAAAA = −2.11,etc. In KNN-RCK, we have a PWM with length k, whichassumes position independence and thus has 4×k param-eters instead of 4k . In order to convert the PWM to φ, weYang et al. BMC Bioinformatics (2018) 19:96 Page 5 of 12used the PWM to score each possible k-mer m by simplymultiplying the relevant probabilities at each position:φm =k∏l=1PWM(index(ml), l) (4)where index(ml) returns the PWM’s row index for baseml. When training KNN-RCK, we assigned these scores toφ at the initialization stage, and removed φ from parame-ter optimization. The rest parameters were still optimizedthe same way as in RCK.Assessing co-evolution in RNA-protein interactionTo test if the evolutions of the RBPs and their bind-ing sequence preferences are correlated, we used a sim-ilar approach as in our previous study for measuringco-evolution between the transcription factors and theirbinding sites [22]. This approach was derived from the“mirror tree” method originally used in protein-proteinco-evolution [30]. In brief, to assess the correlation, wederived a pairwise sequence similarity matrix for proteinsand a pairwise similarity matrix for PWMs, then we com-puted a Pearson’s correlation coefficient (PCC) betweenthese two matrices as the measure of co-evolution [30].Each PWM represented a set of RNA targets for an RBP.Since this approach is basically the same as our previousstudy and is not the focus here, the details are describedin the Additional file 2: Supplementary Note.Evaluating prediction performanceWe evaluated our predicted binding preference through aseries of tests on the in vitro and in vivo datasets. Firstly,for the in vitro testing, we performed leave-one-out vali-dation for all the proteins in our InVitro dataset. Each timefor an experiment with the target RBP x, we pretended notto have the binding data for x and trained a PWM withour KNN algorithm using only the homologous proteins’PWMs. In the original study of Ray et al. [5], the probes inthe InVitro dataset were split into two sets A and B whichhave similar sizes and k-mer coverages. We trained ourKNN on the homologous PWMs derived from set A, andselected the optimal K value using 2-fold cross-validationon the probes (and intensities) from A. Then we testedon the probes (and intensities) from B. Since the probeintensities are continuous, the performance was evaluatedby PCC between the predicted and the real intensities. Inthe DeepBind and the RCK papers, these two methodswere also trained on set A and tested on B using PCC,except they were directly trained on the target RBP’s probedata [10, 17]. So we just used their published performanceresults. In addition, we also trained our KNN-RCK algo-rithm the same way as RCK did in its paper to incorporatethe structure preference.The more important evaluation is the in vivo test-ing. All the methods were trained on the completeInVitro dataset (set A+B) then tested on the two in vivodatasets, respectively. Since RNA sequences in the InVivo-Ray and InVivoAURA datasets were labeled as boundand unbound, the performance was evaluated by the areaunder the receiver operating characteristic curve (AUC).For the InVivoRay dataset, there were 6 InVitro RBPswith each corresponding to multiple in vivo test sets. Theprevious Ray et al. study selected the test set with the bestperformance for each RBP [5]. Here, we simply took theaverage performance of all the test sets for each case, andobtained 16 entries from the total 32. To be consistentwith the RCK paper [17], we used 2-fold cross-validationto determine RCK’s hyper-parameter width (4–7) on theentire InVitro dataset, and then tested on the InVivoRaydataset with optimal width. The DeepBind study did thesame evaluation procedure as ours to test on the InVivo-Ray dataset [10]. So we again used the performance resultsin the DeepBind paper. For the InVivoAURA dataset, wetested DeepBind using its published pre-trained prefer-ence models since its training took too long. We ranDeepBind with ‘-average’ option turned on to be consis-tent with the DeepBind paper [10]. For the RBPs in theInVivoAURA that did not overlap with the InVitro dataset(i.e. novel RBPs), DeepBind and RCK could not deal withsuch cases. We used the preference model of the near-est neighbor available for each novel RBP instead, i.e. thesame idea as 1NN.ResultsCorrelation between the RBPs and their RNA targetsFirst we tested the co-evolution in RNA-protein interac-tions. Since the RRM and the KH families are the twofamilies composing our InVitro dataset, we focused onthem to assess such correlation. As described earlier, theprotein constructs in the InVitro dataset were either FLsequences or RBR fragments. Hence, we separated thecases of RRM-FL, RRM-RBR, KH-FL, and KH-RBR for theanalysis.As shown in Table 3, the values under the PCC col-umn stand for the measured co-evolutions. To assess thesignificance of the PCC value, we used a nonparamet-ric rank test and a parametric test as suggested in Yanget al.’s study [22] (Additional file 2: Supplementary Note).In both tests, the KH-FL and the RRM-FL sets showed sig-nificant correlation with p-value< 0.05 and p-value< 0.01,respectively. In the nonparametric test, the KH-RBR andthe RRM-RBR sets showed significant correlations withp-value< 0.05 (*). The parametric test is more stringent:the KH-RBR set had a p-value= 0.158 and the RRM-RBR set had a p-value= 0.057, which was close to 0.05significant level. The fact that the FL level displayed moresignificant correlations than the RBR level may indicateYang et al. BMC Bioinformatics (2018) 19:96 Page 6 of 12Table 3 Co-evolution between RBPs and their RNA targetsRBP family Construct1 # of members PCCKH FL 8 0.760*RRM FL 51 0.419**KH RBR 15 0.367(*)RRM RBR 126 0.174(*)1: Protein construct, FL stands for full length protein, and RBR stands forRNA-binding region. *: p-vale < 0.05, ** p-value< 0.01 from both the parametric andnonparametric tests; (*): p-value< 0.05 from the nonparametric testthat: although the binding domain is themost relevant fac-tor to the RNA contacting, the rest of the protein sequencemight have a long range effect on the RNA recognitionand binding, which would provide additional evolution-ary information. In addition, since each of the RRM-FL,the RRM-RBR, the KH-FL, and the KH-RBR sets con-tains proteins from multiple species, we also controlledthe effects of speciation and confirmed the observed cor-relations were not due to speciation (Additional file 2:Supplementary Note and Figure S1). In summary, weobserved strong correlations between the RBPs and theirPWMs in our InVitro dataset.Performance of our “inferred” method for preferencepredictionIn order to assess the ability of our method in predictingthe binding preference for novel or poorly characterizedRBPs, first we applied our KNN algorithm to the RBPswith known binding data and demonstrated the abilityof our algorithm. We compared the performance of ourmethod with 1NN, DeepBind and RCK on the InVitro andInVivoRay datasets.Ourmethod is more accurate than the alternative “inferred”methodTo compare with the alternative “inferred” method1NN, we evaluated the performances of our KNN withK = optK and K = 1. The 1NN case corresponds tothe method used in Ray et al.’s study [5]. We first gaugedthe in vitro performances on the 200 experiments in theInVitro dataset. As described in the Methods section, wedid it in a leave-one-out fashion. As shown in Fig. 1 (a andc) and Table 4, KNN actually outperformed 1NN on everyexperiment, with an average PCC 0.257 as compared to0.202 for 1NN. The p-value from a paired t-test betweenthe KNN’s PCCs and the 1NN’s PCCs was around 10−13.In addition, we compared the performance of our KNNpredicted PWMs with the left-out original PWMs derivedby Ray et al. [5]. As shown in Fig. 1b, our KNN predictedPWMs performed much better (p-value around 10−10)even than the original left-out PWMs. This is encourag-ing because for each protein x in the dataset, its originalPWMwas derived directly from the RNAcompete probes,while its KNN inferred PWMwas derived indirectly with-out the probe information but only using the homologousproteins’ original PWMs.Moreover, since the in vitro performance was trainedand tested on the same type of RNAcompete data, wethen investigated whether the PWMs trained on the RNA-complete data generalized well on the in vivo data, whichis a more important task for the RNAprotein interactionstudy. As shown in Table 4 and Fig. 2, when evaluated onthe 32 in vivo entries in the InVivoRay dataset, the PWMspredicted by KNN achieved an average AUC 0.818, com-paring to 1NN AUC 0.736. The corresponding p-valuewas < 0.05 from a paired t-test. Thus, in general, weobserved a strong improvement of using our KNN algo-rithm as opposed to the 1NN. This also confirmed that theco-evolution detected from the in vitro data also exists invivo.Ourmethod is comparable to the state-of-the-art “direct”methodsNext we compared the performance of our “inferred”method to the “direct” methods DeepBind and RCK.For the in vitro binding prediction, as we can find inTable 4 and Fig. 1c, the performance of DeepBind (aver-age PCC= 0.429) and RCK (average PCC= 0.484) weremuch better than our KNN (average PCC= 0.257). How-ever, both the “direct” methods DeepBind and RCKwere trained to predict the RNAcompete probe intensity,and were directly optimized to minimize the differencebetween the predicted and the real probe intensities as theobjective function. Our KNN was not trained to directlypredict the intensity, which was obviously disadvanta-geous when using the intensity as evaluation criteria.To make the comparison fairer, we evaluated our KNN-RCK which was also trained to directly predict the RNA-compete probe intensity. As a result, we got an averagePCC= 0.417 which was much closer to DeepBind (no sta-tistically significant difference) and RCK (still significantlystronger), with much less training time. When trainedon one RNAcompete experiment (using set A only) onthe same machine, KNN-RCK took < 1 hr (53min onaverage over a subset of 14 experiments); RCK took 3–4hrs (220min on average), both with the hyper-parameterwidth= 7. The time was evaluated on a single Intel XeonE5-2690 (2.90GHz) CPUwith 8GB RAM. DeepBind is notcomparable regarding the time since it needs GPU fortraining, which is much more computationally intensive.In our empirical test, DeepBind did not finish training in24 hrs. The time was based on a single NVIDIA TeslaM2070s GPU (5.5 GB memory) of a 12-CPU Intel XeonE5694 (2.53GHz) machine with 23GB RAM. It is alsoworth noting that the sizes of our models (i.e. numberof parameters) are much smaller than the models fit-ted by DeepBind and RCK (Table 4): our KNN simplyYang et al. BMC Bioinformatics (2018) 19:96 Page 7 of 12a bcFig. 1 Performance in predicting in vitro binding on the InVitro dataset. For each RBP, all methods were trained and tested on the InVitro dataset.Performance was measured by PCC of the predicted and real RNAcompete probe intensities. a Scatter plot shows our KNN (with optimal K)predicted PWMs perform better than or as well as 1NN predicted PWMs in all RBPs in terms of the PCCs of predicted and true probe intensities.p-value is calculated by paired t-test. b Scatter plot shows our KNN predicted PWMs also outperform the left-out original PWMs derived by Ray et al.[5]. c Box plot of PCCs for different methods including KNN, 1NN [5], DeepBind [10], RCK [17], and KNN-RCK. The vertical dashed line separates boxesfor methods requiring only the target RBP’s homologous binding information for training to the left, and methods requiring the target RBP’s explicitbinding data for training to the right. In each box, the dashed green line denotes the mean, and the brown line denotes the medianhas a PWM as its model which contains only 4 × k(k was fixed to 7) parameters; DeepBind typically hasthousands of parameters (depending on the settings ofits many hyper-parameters); RCK has about 4k sequenceparameters (k is within 3-7, determined through crossvalidation), and 4k × c (c is the number of structural con-texts, and by default equals to 5) structure parameters(plus a few regression terms); KNN-RCK is smaller thanRCK since using PWM computed scores as the sequencemodel, and has about 4k × c + 4 × k parameters.Moreover, we also compared the in vivo binding pre-diction. As shown in Table 4 and Fig. 2, KNN, RCK andDeepBind showed comparable performance. Our KNNmethod had the highest AUC (0.818) on average, whichwas significantly better than RCK (AUC 0.708, p-valuearound 10−4), and also slightly better than DeepBind(AUC 0.791). This may reflect that the complicated mod-els like DeepBind and RCK have a higher variance inprediction and tend to overfit to the training data com-pared to the simple models like KNN. We also comparedthe performance of our KNN predicted PWMs with theoriginal PWMs derived by Ray et al. [5]. On the InVivoRaydataset, the original PWMs got an average AUC= 0.785,and was again worse than KNN (0.818, p-value= 0.019).Besides, KNN-RCK (AUC 0.664) was significantly worsethan KNN (AUC 0.818) in this test (p-value around 10−6).Table 4 Overview of different methods that were evaluatedMethod Training data Model Testing data Performancein vitro in vivo InVitro InVivoRay InVivoAURA1NN [5] PWM PWM p, i t, l 0.2021 0.7362 0.6822DeepBind [10] p, i CNN3 p, i t, l 0.429 0.791 0.671RCK [17] p, s, i k-mer p, s, i t, s, l 0.484 0.708 0.539KNN PWMs PWM p, i t, l 0.257 0.818 0.714KNN-RCK PWM, p, s, i customized k-mer p, s, i t, s, l 0.417 0.664 -1: Pearson correlation averaged over all tested proteins. 2: AUC averaged over all tested proteins. 3: Convolutional neural network. In the second column, p: RNAcompeteprobe sequences. i: RNAcompete probe intensities. s: predicted structural distribution. t: CLIP/RIP binding transcript segment sequences. l: CLIP/RIP binary label for bound orunboundYang et al. BMC Bioinformatics (2018) 19:96 Page 8 of 12Fig. 2 Performance in predicting in vivo binding on the InVivoRay dataset. For each RBP, all methods were trained on the InVitro dataset and testedon the InVivoRay dataset. Performance was measured by AUC of the predicted and real (CLIP/RIP) binary labels. The figure shows the box plot ofAUCs for different methods including KNN, DeepBind [10], 1NN [5], RCK [17], and KNN-RCK. The vertical dashed line separates boxes for methodsrequiring only the target RBP’s homologous binding information for training to the left, and methods requiring the target RBP’s explicit binding datafor training to the right. In each box, the dashed green line denotes the mean, and the brown line denotes the medianThe reasons for the sequence-and-structure preferencemodels like RCK and KNN-RCK not performing as wellas the sequence preference models like DeepBind andKNN may be that: (1) As the training data, the RNA-compete probes were designed to be short (30-41 nt)and have weak secondary structures. While as the test-ing data, the RNA segments from CLIP/RIP experimentswere usually much longer (many > 1000 nt) and tendedto form much more structures (also harder for computa-tional structure prediction) [5]. (2) The RNA sequencesin InVivoRay were only transcript segments which did notinclude flanking regions so that the predicted structuresmight be inaccurate. These also reflected the limitation ofRCK (and KNN-RCK) which requires not only the bindingsequence data but also the accurate structure annotationto be available to make a decent prediction. In summary,the results here showed that although our KNN methodrequires only homologous proteins’ PWMs as input, itsperformance was comparable to the much more compli-cated state-of-the-art methods when testing on in vivobinding data.In addition, we further utilized KNN-RCK’s bindingpreference model to assess the relative importance of thesequence or the structure feature alone, regarding bind-ing prediction. Note that although DeepBind representsthe sequence-based methods and RCK represents thesequence-and-structure methods, we cannot simply com-pare the performance of DeepBind with RCK to assessthe relative importance since their models and train-ing algorithms are very different. So we did the assess-ment under KNN-RCK’s unitary framework to controlthe irrelevant effects. The results were presented anddiscussed in the Additional file 2: Supplementary Noteand Figure S2.Case study: our method infers PWM for novel proteinsHere we used the InVivoAURA dataset as a case studyto further demonstrate the ability of our KNN algorithmto predict the binding preference for the novel or poorlystudied RBPs. As introduced in the Methods section, thisdataset contains 9 sets of lncRNA-protein interactions(on average > 1000 nt long, entire 3’UTR/5’UTR) with3 out of them having no RNAcompete information.As shown in Table 4 and Fig. 3a, overall KNN (aver-age AUC= 0.714) performed the best among all fourmethods (1NN:0.682, DeepBind:0.671, RCK:0.539) andwas significantly better than 1NN (p-value= 0.021) andRCK (p-value= 0.035). To elaborate, we first lookedat the two RBPs ELAVL1 and QKI which have knownRNAcompete binding data (ELAVL1 corresponds toRNCMPT00032, RNCMPT00112, RNCMPT00117,RNCMPT00136, RNCMPT00274; QKI correspondsto RNCMPT00047) to train. As shown in Fig. 3b, forELAVL1, all four programs KNN, 1NN, DeepBindand RCK gave similar AUCs with KNN slightly betterthan the rest (KNN still significantly better than 1NNwith p-value= 0.014); then for QKI, KNN also hadthe highest AUC (0.718) with DeepBind very close toit (0.709). Next, for the remaining three RBPs (NCL,TNRC6B, TNRC6C), there was no RNAcomepte dataavailable, which served as the case for the poorly studiedproteins. Since all three RBPs can be mapped to theYang et al. BMC Bioinformatics (2018) 19:96 Page 9 of 12ab cFig. 3 Performance in predicting in vivo binding on the InVivoAURA dataset. For each RBP, all methods were trained on the InVitro dataset andtested on the InVivoAURA dataset. Performance was measured by AUC of the predicted and real (CLIP) binary labels. a Box plot of AUCs for differentmethods including KNN, DeepBind [10], 1NN [5], and RCK [17]. The vertical dashed line separates boxes for methods requiring only the target RBP’shomologous binding information for training to the left, and methods requiring the target RBP’s explicit binding data for training to the right. Ineach box, the dashed green line denotes the mean, and the brown line denotes the median. b Bar plot of AUCs for RBPs (named by model IDs) withexplicit binding data available for training. c Bar plot of AUCs for RBPs with no binding data but only homologous binding information available fortraining. b and c are the performances breakdown for each group of RBPs (well studied, poorly studied) from aRRM family (FL) based on the protein sequence identity,we could use our KNN method as before to predictthe PWMs for them. Here we predicted with a fixedK= 7 (average of opt-K values over all experiments fromtraining) to find the proper homologous proteins’ PWMsin the InVitro dataset. For DeepBind and RCK, since theydid not have the corresponding InVitro data to train, weused the model of the nearest neighbor from the InVitrodataset for each of the three RBPs (NCL:RNCMPT00009,TNRC6B:RNCMPT00094, TNRC6C:RNCMPT00179).Our KNN method performed the best in all threecases (Fig. 3c). Especially, it outperformed DeepBindand RCK by a large margin (except for DeepBindin TNRC6C), which suggested the capability andnecessity of our KNN method for the poorly studiedproteins.Finally, after demonstrating the capability of our KNNmethod, we made inference of PWMs for 1000 poorlystudied RBPs selected from cisBP-RNA database [19].These RBPs contain either KH or RRM RNA-bindingdomain, from a diverse range of eukaryotes. Theywere categorized as “inferred” in the “motif evidence”menu in the cisBP-RNA database, and were previ-ously inferred for their binding preferences by 1NNmethod [5]. We predicted the PWMs for these proteinsby KNN and expected the new PWMs would be moreaccurate than the previous 1NN inferred ones. The PWMsare available on our website.DiscussionThe main contribution of this study is to predict the bind-ing preferences for poorly characterized RBPs by utilizingco-evolution. It would be ideal if we could directly deter-mine an RBP’s preference from its experimental bindingdata. However, such data is currently missing for mostproteins. So here we explore how to indirectly infer thepreferences for poorly studied RBPs in the absence of theirbinding data. We conducted a co-evolutionary analysison an in vitro RNAcompete dataset which is the largestRNA-protein binding dataset by far and is known to cor-relate well with the in vivo data [4, 5]. Based on theexistence of such co-evolution, we proposed a KNN algo-rithm to integrate the binding preferences of the homologsinto the binding preference prediction. We then bench-marked its performance on the in vivo as well as the invitro binding data available, and compared it with sev-eral representative “direct” and “inferred” methods. Theperformance was especially well on the in vivo data. Bytaking an independent lncRNA dataset as a case study,we further demonstrated how to use the algorithm forpoorly studied RBPs which do not have binding datain practice.Yang et al. BMC Bioinformatics (2018) 19:96 Page 10 of 12To predict the binding preference for a poorly stud-ied RBP, our method requires the presence of a set ofhomologs with known preferences. Although the existingdatasets, such as the InVitro RNAcompete dataset, pro-vide good sources of homologous proteins with PWMs,the homologous data is still very limited for most RBPs.So currently, for a query RBP, our method uses the InVitrodataset as the source and combines information of bothorthologs and paralogs from it to make preference pre-dictions. However, the idea underlying our KNN methodis that the homologous RBPs highly co-evolve with theirbinding motifs subject to the evolutionary selection. Itwas derived from the famous “mirror tree” approach tomeasure protein-protein co-evolution [30], which usesthe orthologs only. We relaxed this requirement heredue to the limited availability of data. If more orthologsdata become available in the future, our method will berestricted to use orthologs only.It is desirable to understand how the KNN methodworks in terms of the number of neighbors (i.e.homologs). Here we provide some intuition. As describedin the Method section, the optimal number of neighborsto use in the algorithm is determined by cross validation.The question is why some RBPs have small optK values(eg. optK= 1) while others need much larger values (eg.optK= 30). We make the general observation that thecloser the neighbors to the target RBP, the smaller thenumber of neighbors needed to make the prediction. Toillustrate this observation, we use the RRM-FL set fromin vitro testing as an example. In Fig. 4a, the x-axis showsthe global sequence similarity between the target RBPand the nearest neighbor (1NN). In the RRM-FL set, thereare 51 RBPs. We sorted the 1NN similarity values and putthem into five bins. The y-axis shows the performance(PCC) of using 1NN for preference prediction. Theright-hand-side bins corresponding to more similar 1NNneighbors show better performance in general. And whenthe 1NN similarity is low, the prediction performance byusing 1NN only is poor. The red dashed line connectsthe mean value of each bin in Fig. 4a. There is a positivecorrelation (0.30) between the 1NN similarity and theprediction performance (p-value< 0.05). To generalizefrom using 1NN only for prediction to the proposed algo-rithm of using optK for prediction, Fig. 4b shows a generalanti-correlation (−0.17) between the optimal numberof neighbors needed and the similarity to the nearestneighbor. While the x-axis in Fig. 4b is the same as thatin Fig. 4a, the y-axis shows the optimal number of neigh-bors (optK). The right-hand-side bins generally requiresmaller numbers of neighbors for prediction. And whena bFig. 4 Analyses of the number of homologous RBPs and their sequence similarities to the target RBP for the KNN algorithm. The figure is based onthe RRM-FL set from the InVitro dataset. a Box plot of the preference prediction performances for five different sequence similarity bins. The x-axisshows the similarity between the target RBP and the nearest neighbor (1NN). The y-axis shows the in vitro performance (PCCs) of using 1NN forpreference prediction. The red dashed line connects the mean value of each bin. A significant correlation (0.3, p-value< 0.05) was observed betweenthe PCC and the sequence similarity values. b Box plot of the number of neighbors needed for five different sequence similarity bins. The x-axis isthe same as that in a. The y-axis denotes the optimal number of neighbors (optK) to use in the KNN algorithm. The optK value was determinedthrough cross validation. A negative correlation (-0.17) was observed between the optK and the sequence similarity valuesYang et al. BMC Bioinformatics (2018) 19:96 Page 11 of 12the 1NN similarity is low, a larger number of neighborsare needed.Although an RNA molecule could fold onto itself toform functionally important structures, we focus on thesequence preference of RNA in this study. One mainreason is that there are too few experimentally determinedstructure data available at this moment. Moreover, thecomputationally predicted RNA structure tends to be lessaccurate for longer sequence [31]. Another reason is thatmany RBPs are sequence-specific and essentially bind tosingle-stranded RNAs [2], such as RRM and KH familyin this study. However, as we expect that more struc-ture data will be available in the future, we provide theKNN-RCK algorithm to recruit structure features as well.We can assess the importance of structure or sequenceinformation alone for binding preference prediction usingKNN-RCK.In general, predicting RBPs’ binding preferences ischallenging, especially for the RBPs with no availableexperimental binding data. The simple KNN method weintroduced here exhibits considerable potential for thistask, and can be further extended in several directions.Firstly, one limitation of the current version is that thewidths of all the PWMs are fixed to be the same. It maybe interesting to make the width of the PWMs an vari-able that can be tuned for different RBPs. Furthermore,when predicting preferences for novel RBPs with no bind-ing data, the optK value for KNN is currently fixed to anempirical value for different RBPs. It would be interestingto explore more about the co-evolutionary relationship tosee if we could customize the optK for each novel RBP.Finally, our KNN algorithm is ready to be used in otherscenarios such as the transcription factor binding prefer-ence detection. The binding preferencemodel in our KNNdoes not have to be PWMs, and could be replaced by othermodels (like the k-mer model in RCK, or the convolutionfunction in DeepBind) instead. In general, this study pro-vides a flexible framework to investigate the dynamics ofthe nucleotide-protein interactions in cell through evolu-tion, and supplies a practical solution that is easy to usefor the research community.ConclusionsIn this study we presented a novel method to predictthe binding preference for the poorly studied RBPs. Firstwe examined the co-evolution in the RNA-protein inter-action using data available in vitro. Then we describeda KNN based RBP sequence preference prediction algo-rithm utilizing such correlation. We evaluated our pre-dicted preferences on several datasets both in vitro andin vivo. Moreover, we explored how to use our KNNbased method to infer sequence preferences for theunknown or poorly studied RBPs. This study is the firstto explicitly explore the co-evolution to predict the RBPbinding preference. It has the potential to reveal theexistence of the complicated interaction codes betweenRNAs and proteins, and study the vast majority of theRBPs that are pending to be characterized.Additional filesAdditional file 1: Supplementary Data. The excel document contains theSupplementary Data about the performances of different methods onInVitro, InVivoRay and InVivoAURA datasets, as well as the composition ofeach datasets (including species and RNA-binding domain information).(XLSX 32 kb)Additional file 2: Supplementary File. The PDF document contains textsfor the Supplementary Note, and the Supplementary Figures S1 to S2.Figure S1 shows the PCCs of KH RBP, RRM PWM pairs for 1000 randomlyshuffled sets. Figure S2 shows the comparison of the full sequence-and-structure, structure alone, and sequence alone models in KNN-RCK, interms of their performances in predicting (A) in vitro binding on the InVitrodataset (B) and in vivo binding on the InVivoRay dataset. (PDF 217 kb)AbbreviationsAUC: Area under the receiver operating characteristic curve; CLIP: Crosslinkingimmunoprecipitation; FL: full-length; KH: K-homology; KNN: K nearestneighbors; lncRNA: long non-coding RNA; PCC: Pearson correlation coeffcient;PWM: Position weight matrix; RBP: RNA-binding protein; RBR: RNA-bindingregion; RIP: RNA immunoprecipitation; RRM: RNA recognition motifAcknowledgementsThe authors would like to thank Anne Condon, Daniel Lai, Xiaoxi Liu, XiaoChan Wang, Feng Xu, and Alice Zhu for critical comments on the work.FundingThis work was supported by Genome Canada, and Natural Sciences andEngineering Research Council (NSERC) of Canada.Availability of data andmaterialsThe datasets used to perform the analysis are publicly available on [5] http://hugheslab.ccbr.utoronto.ca/supplementary-data/RNAcompete_eukarya/, [19]http://cisbp-rna.ccbr.utoronto.ca (restrict search for RBPs by domain types“RRM” and “KH”), and [24] http://www.nature.com/nmeth/journal/v14/n1/extref/nmeth.4100-S2.xlsx. The source codes implementing the methods inthis paper and the performance results are available at https://github.com/syang11/KNN.Authors’ contributionsSY, JW, and RTN conceived of and designed the study. SY carried out analysesand wrote the program. SY and RTN drafted the manuscript. JW helped revisethe manuscript. All authors read and approved the final manuscript.Ethics approval and consent to participateNot applicable.Consent for publicationNot applicable.Competing interestsThe authors declare that they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.Author details1Department of Computer Science, University of British Columbia, Vancouver,Canada . 2Department of Health Sciences Research, Mayo Clinic Arizona,Scottsdale, USA.Received: 7 November 2017 Accepted: 28 February 2018Yang et al. BMC Bioinformatics (2018) 19:96 Page 12 of 12References1. Jankowsky E, Harris ME. Specificity and nonspecificity in rna-proteininteractions. Nat Rev Mol Cell Biol. 2015;16(9):533–44. https://doi.org/10.1038/nrm4032.2. Li X, Kazan H, Lipshitz HD, Morris QD. Finding the target sites ofrna-binding proteins. Wiley Interdiscip Rev RNA. 2014;5(1):111–30. https://doi.org/10.1002/wrna.1201.3. Wang T, Xiao G, Chu Y, Zhang MQ, Corey DR, Xie Y. Design andbioinformatics analysis of genome-wide clip experiments. Nucleic AcidsRes. 2015. https://doi.org/10.1093/nar/gkv439.4. Ray D, Kazan H, Chan ET, Castillo LP, Chaudhry S, Talukder S,Blencowe BJ, Morris Q, Hughes TR. Rapid and systematic analysis of therna recognition specificities of rna-binding proteins. Nature Biotech.2010;27(7):667–70. https://doi.org/0.1038/nbt.1550.5. Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X,Gueroussov S, Albu M, Zheng H, Yang A, Na H, Irimia M, Matzat LH,Dale RK, Smith SA, Yarosh CA, Kelly SM, Nabet B, Mecenas D, Li W,Laishram RS, Qiao M, Lipshitz HD, Piano F, Corbett AH, Carstens RP,Frey BJ, Anderson RA, Lynch KW, Penalva LOF, Lei EP, Fraser AG,Blencowe BJ, Morris QD, Hughes TR. A compendium of rna-bindingmotifs for decoding gene regulation. Nature. 2013;499(7457):172–7.https://doi.org/10.1038/nature12311.6. König J, Zarnack K, Luscombe NM, Ule J. Protein–rna interactions: newgenomic technologies and perspectives. Nat Rev Genet. 2012;13:77–83.7. Spitale RC, Flynn RA, Zhang QC, Crisalli P, Lee B, Jung J-W, KuchelmeisterHY, Batista PJ, Torre EA, Kool ET, Chang HY. Structural imprints in vivodecode rna regulatory mechanisms. Nature. 2015;519(7544):486–90.8. Janssen S, Giegerich R. The rna shapes studio. Bioinformatics. 2015;31(3):423. https://doi.org/10.1093/bioinformatics/btu649.9. Bernhart SH, Hofacker IL, Stadler PF. Local rna base pairing probabilitiesin large sequences. Bioinformatics. 2006;22(5):614–5. https://doi.org/10.1093/bioinformatics/btk014.10. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequencespecificities of dna- and rna-binding proteins by deep learning. NatureBiotech. 2015;33(8):831–8.11. Foat BC, Houshmandi SS, Olivas WM, Bussemaker HJ. Profilingcondition-specific, genome-wide regulation of mrna stability in yeast.Proc Natl Acad Sci U S A. 2005;102(49):17675–80.12. Pelossof R, Singh I, Yang JL, Weirauch MT, Hughes TR, Leslie CS. Affinityregression predicts the recognition code of nucleic acid–bindingproteins. Nat Biotechnol. 2015;33:1242–9.13. Hiller M, Pudimat R, Busch A, Backofen R. Using rna secondary structuresto guide sequence motif finding towards single-stranded regions. NucleicAcids Res. 2006;34(17):117. https://doi.org/10.1093/nar/gkl544.14. Li X, Quon G, Lipshitz HD, Morris Q. Predicting in vivo binding sites ofrna-binding proteins using mrna secondary structure. RNA. 2010;16(6):1096–107. https://doi.org/10.1261/rna.2017210.15. Kazan H, Ray D, Chan ET, Hughes TR, Morris Q. Rnacontext: A newmethod for learning the sequence and structure binding preferences ofrna-binding proteins. PLoS Comput Biol. 2010;6(7):1000832.https://doi.org/10.1371/journal.pcbi.1000832.16. Maticzka D, Lange SJ, Costa F, Backofen R. Graphprot: modeling bindingpreferences of rna-binding proteins. Genome Biol. 2014;15(1):17.https://doi.org/10.1186/gb-2014-15-1-r17.17. Orenstein Y, Wang Y, Berger B. Rck: accurate and efficient inference ofsequence- and structure-based protein–rna binding models fromrnacompete data. Bioinformatics. 2016;32(12):351. https://doi.org/10.1093/bioinformatics/btw259.18. Stormo GD. Dna binding sites: representation and discovery. Bioinformatics.2000;16(1):16. https://doi.org/10.1093/bioinformatics/16.1.16.19. CISBP-RNA Database: Catalog of Inferred Sequence Binding Preferencesof RNA Binding Proteins. http://cisbp-rna.ccbr.utoronto.ca. Accessed 20June 2017.20. de Juan D, Pazos F, Valencia A. Emergingmethods in protein co-evolution.Nature Review Genetics. 2013;14(4):249–61. https://doi.org/10.1038/nrg3414.21. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of directresidue contacts in protein–protein interaction by message passing. ProcNatl Acad Sci. 2009;106(1):67–72. https://doi.org/10.1073/pnas.0805923106.22. Yang S, Yalamanchili HK, Li X, Yao K-M, Sham PC, Zhang MQ, Wang J.Correlated evolution of transcription factors and their binding sites.Bioinformatics. 2011;27(21):2972–8. https://doi.org/10.1093/bioinformatics/btr503.23. Mahony S, Auron PE, Benos PV. Inferring protein–dna dependenciesusing motif alignments and mutual information. Bioinformatics.2007;23(13):297. https://doi.org/10.1093/bioinformatics/btm215.24. Cirillo D, Blanco M, Armaos A, Buness A, Avner P, Guttman M, Cerase A,Tartaglia GG. Quantitative predictions of protein interactions with longnoncoding rnas. Nat Methods. 2017;14(1):5–6.25. Dassi E, Re A, Leo S, Tebaldi T, Pasini L, Peroni D, Quattrone A. Aura 2.Translation. 2014;2(1):27738. https://doi.org/10.4161/trla.27738.26. Cook KB, Kazan H, Zuberi K, Morris Q, Hughes TR. Rbpdb: a database ofrna-binding specificities. Nucleic Acids Res. 2011;39(suppl 1):301–8.https://doi.org/10.1093/nar/gkq1069.27. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA,McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD,Gibson TJ, Higgins DG. Clustal w and clustal x version 2.0. Bioinformatics.2007;23(21):2947–8. https://doi.org/10.1093/bioinformatics/btm404.28. Re A, Joshi T, Kulberkyte E, Morris Q, Workman CT. RNA-ProteinInteractions: An Overview, vol. 1097. New York: Humana Press; 2014.https://doi.org/10.1007/978-1-62703-709-9.29. Dieterich C, Stadler PF. Computational biology of rna interactions. WileyInterdiscip Rev RNA. 2012;4:107–20.30. Pazos F, Valencia A. Similarity of phylogenetic trees as indicator ofprotein–protein interaction. Protein Eng. 2001;14(9):609–14.https://doi.org/10.1093/protein/14.9.609.31. Hofacker I. Energy-directed rna structure prediction. In: Gorodkin J,Ruzzo WL, editors. RNA Sequence, Structure, and Function:Computational and Bioinformatic Methods, vol. 1097. New York: HumanaPress; 2014. https://doi.org/10.1007/978-1-62703-709-9.• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal• We provide round the clock customer support • Convenient online submission• Thorough peer review• Inclusion in PubMed and all major indexing services • Maximum visibility for your researchSubmit your manuscript atwww.biomedcentral.com/submitSubmit your next manuscript to BioMed Central and we will help you at every step: