Open Collections

UBC Faculty Research and Publications

A computational pipeline for the development of multi-marker bio-signature panels and ensemble classifiers Günther, Oliver P; Chen, Virginia; Freue, Gabriela C; Balshaw, Robert F; Tebbutt, Scott J; Hollander, Zsuzsanna; Takhar, Mandeep; McMaster, W R; McManus, Bruce M; Keown, Paul A; Ng, Raymond T Dec 8, 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2012_Article_5668.pdf [ 559.99kB ]
JSON: 52383-1.0221585.json
JSON-LD: 52383-1.0221585-ld.json
RDF/XML (Pretty): 52383-1.0221585-rdf.xml
RDF/JSON: 52383-1.0221585-rdf.json
Turtle: 52383-1.0221585-turtle.txt
N-Triples: 52383-1.0221585-rdf-ntriples.txt
Original Record: 52383-1.0221585-source.json
Full Text

Full Text

A computational pipeline for the development ofmulti-marker bio-signature panels and ensembleclassifiersGünther et al.Günther et al. BMC Bioinformatics 2012, 13:326 ARTICLE Open AccessA computational pipeline for the development ofmulti-marker bio-signature panels and ensembleclassifiersOliver P Günther1, Virginia Chen1, Gabriela Cohen Freue1,2, Robert F Balshaw1,2,12, Scott J Tebbutt1,7,10,11,Zsuzsanna Hollander1,3, Mandeep Takhar1, W Robert McMaster1,4,9, Bruce M McManus1,3,7,10,Paul A Keown1,3,5,6 and Raymond T Ng1,8*AbstractBackground: Biomarker panels derived separately from genomic and proteomic data and with a variety ofcomputational methods have demonstrated promising classification performance in various diseases. An openquestion is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been appliedsuccessfully in various analytical domains to combine classifiers so that the performance of the ensemble exceedsthe performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a casestudy, we address the following question in this paper: Can acute rejection classification performance be improved bycombining individual genomic and proteomic classifiers in an ensemble?Results: The first part of the paper presents a computational biomarker development pipeline for genomic andproteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), qualitycontrol, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures thatthe various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNAgenomic and five proteomic classifiers were developed independently using single time-point blood samples from11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines fiveensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by areaunder the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individualclassifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2)Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUCbeyond the best values observed for any of the individual classifiers in the ensemble, while staying within therange of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5ensembles, but typically at the cost of decreased specificity.Conclusion: Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograftrejection and can improve classification performance beyond that of individual genomic or proteomic classifiersalone. Validation of our results in an international multicenter study is currently underway.Keywords: Biomarkers, Computational, Pipeline, Genomics, Proteomics, Ensemble, Classification* Correspondence: rng@cs.ubc.ca1NCE CECR Prevention of Organ Failure (PROOF) Centre of Excellence,Vancouver, BC V6Z 1Y6, Canada8Department of Computer Science, University of British Columbia, Vancouver,BC V6T 1Z2, CanadaFull list of author information is available at the end of the article© 2012 Günther et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (, which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.Günther et al. BMC Bioinformatics 2012, 13:326 the advancement of whole-genome technologiesand unbiased discovery approaches such as microarraysand mass spectrometry, molecular biomarker panel de-velopment has attracted much attention and investmentin the past decade. Given that biomarker panels may bevaluable for prognosis, diagnosis or prediction of a med-ical condition, or for efficacy and safety of a treatmentoption [1-3], many teams have embarked on biomarkerpanel development projects and programs with the aimof clinical utility and health care benefits.The mission of the NCE CECR Centre of Excellencefor Prevention of Organ Failure (PROOF Centre) is todevelop biomarker panels for heart, lung and kidneyconditions along the life cycle from risk to presence,progression and variable responses to clinical interven-tions including pharmacotherapies. Its flagship programis the Biomarker in Transplantation initiative, whichbegan in 2004. One branch of the work focuses on renalallograft rejection, which is harnessed in this paper as anillustrative case study. Samples from this study are ofone of two types: acute rejection (AR) or non-rejection(NR), representing binary classification tasks. Acuterenal allograft rejection in transplant patients in Canadaoccurs in approximately 10-20% of transplant patientswithin the first 12 weeks post-transplant. Acute rejectionis a serious problem that leads to kidney failure and graftloss if untreated and recurrent. Early detection of acuterejection with a highly sensitive test followed by appro-priate treatment is thus of paramount importance; simi-larly, the exclusion of acute rejection with a highlyspecific test followed by tailoring of immunosuppressivetherapy will benefit many patients by reducing toxicside-effects. Acute rejection is currently diagnosed bytissue biopsy, an invasive procedure that requires sub-jective grading by pathologists to determine if and towhat degree acute rejection is present in the tissue sam-ple [4]. A promising alternative to tissue biopsy, onewhich we have pursued since 2004, is the use of blood-based biomarkers for diagnosing acute rejection. Wereported the first such genomic biomarker panel inTransplantation [5] and a proteomic panel in Molecu-lar&Cellular Proteomics [6]. With successful replicationof our early results, we participated in a Voluntary Ex-ploratory Data Submission to the USA FDA. A multi-national observational trial on refined biomarker panelsis now in its late stages, with the goal of obtaining regu-latory approval from the USA FDA and Health Canada.This paper will first present an established computa-tional biomarker development pipeline for genomic andproteomic data. The pipeline begins with data acquisition(e.g., from bio-samples to microarray data), quality con-trol, statistical analysis and mining of the data, and finallyvarious forms of validation. Several groups, includingours, have explored blood-based genomic and proteomicclassifiers of acute rejection in kidney and heart transplantrecipients with promising results [5-11]. However, the po-tential of combining genomic- and proteomic-based clas-sifiers in an effective manner remains largely unknown.Second, we describe an ensemble approach for buildingproteo-genomic biomarker panels. An intuitive strategyfor building such panels is to merge genomic and prote-omic data and apply a single-platform analysis strategy tothe merged data set [12,13]. Unfortunately, with thisapproach, one encounters challenges related to scalingand normalization, especially with the large differences inthe distribution of the data values between the two plat-forms. In addition, due to differing signal strengths be-tween genomic and proteomic data, it is likely for datafrom one platform to dominate the final classifier panel,masking what might be a potentially valuable contributionfrom the second data type. Although these issues havebeen addressed by potential solutions, such as the promis-ing approach taken by mixOmics tools that incorporatepartial least squares and canonical correlation analysis[14], a different path is described in this paper. Fully deve-loped individual classifiers are combined in an ensemble[15-17], thus avoiding the aforementioned issues whileallowing for an intuitive interpretation and straight-forward implementation.MethodsBiomarker development pipelineThe biomarker development process represents a seriesof sequential steps that can be described as a computa-tional pipeline. Figure 1 shows the genomic biomarkerdevelopment pipeline, with initial data quality assess-ment, sample selection, and pre-processing steps on theleft, and main analysis components such as pre-filtering,uni- and multivariate ranking and filtering steps in thecenter. The numbers on the right represent the numberof features (e.g., probe sets in the genomic case) thatcorrespond to each analysis step. The purpose of pre-filtering, uni- and multivariate ranking, and filteringsteps is to reduce the number of features to be usedin the classification model, while selecting relevantfeatures for the classification task. This final list offeatures represents the biomarker panel which typic-ally ranges in size from 1–100 features.The analysis of proteomic data requires some proteomic-specific analytical steps that are beyond the scope of thisarticle, including data assembled from untargeted lists ofidentified protein groups, imputation of missing values,and quality assessment of protein identification para-meters [18]. Regardless, the main aims of the analysesundertaken at the different steps of the proteomics andthe genomics pipeline are essentially the same. Briefly, atthe discovery stage, the proteomics computationalGünther et al. BMC Bioinformatics 2012, 13:326 Page 2 of 17 utilizes a combination of appropriate univariateand multivariate statistical methodologies to identify apanel of candidate biomarker proteins. The quality of theidentified list of markers is evaluated looking at proteinidentification parameters and examining the existence ofpotential confounding factors. In previous studies basedon iTRAQ-MALDI-TOF/TOF technology, the total num-ber of identified protein groups was about 1500. However,due to undersampling commonly seen in shotgun proteo-mics studies, only about 10% of these protein groups wereconsistently detected in all the samples involved in a par-ticular analysis. Thus, the proteomic analysis data setsfrom this technology were smaller than the genomic onedescribed in Figure 1.Quality assessmentIt is important to detect quality issues to prevent themfrom entering the biomarker development pipeline andnegatively affecting analysis results. The quality of sam-ples is therefore assessed as the first step. Only samplesthat did not raise quality concerns are included in theanalysis, otherwise samples are reanalyzed using a differ-ent aliquot of the same sample. For Affymetrix HumanGenome U133 Plus 2 GeneChip microarray experi-ments, quality assessment is through visual inspection ofRLE, NUSE and weight plots produced with theAffyPLM-package. Other options include the MDQC-package (developed at the PROOF Centre) and thearrayQualityMetrics-package in R [19,20]. Quality controlof the plasma depletion step and the acquired iTRAQ datahave been previously described [6], which examines thereproducibility of sample handling procedures, the confi-dence on the identified protein identities to be analyzed aswell as their levels.Sample selectionAnalysis samples are selected by a domain expert work-ing with a statistician to ensure that a statistically soundanalysis can be performed on samples that are relevantto the study question. Group sizes are reviewed to en-sure a reasonable design in regards to balance, possibleconfounders (typical examples include gender, age, eth-nicity), and power of the study. The domain expert isresponsible for choosing samples that represent the con-ditions of interest. For the two-group acute kidney rejec-tion case study that is used as an example throughoutthis paper, a nephrologist confirmed the rejection statusof individuals with acute rejection (AR) based on biopsyinformation, and selected control cases with clinical anddemographic characteristics similar to those of rejectioncases. The time of blood collection relative to start of re-jection treatment in AR patients is an important factor[21], and was taken into account during sample selec-tion. The presented case study is based on a prospectivelongitudinal design, which required a sample selectionstep as described in Figure 1. Depending on the specificexperimental design, a sample selection step might notbe needed in general.Figure 1 Schematic representation of the biomarker development pipeline for genomic microarray data. The analysis starts with a pre-filtering step applied to the full pre-processed data set (54613 probe sets from the Affymetrix Human Genome U133 Plus 2 GeneChip) on top ofthe funnel, followed by uni- and multivariate ranking and filtering steps before arriving at a biomarker panel. The numbers on the right indicatethe number of features (probe sets) at each step. The biomarker development pipeline for proteomic data looks similar except that data sets aretypically smaller and proteomic-specific pre-processing steps need to be applied.Günther et al. BMC Bioinformatics 2012, 13:326 Page 3 of 17 on the type of data, specific pre-processingsteps are applied to prepare the raw data for subsequentstatistical analysis. In the case of Affymetrix microarrayexperiments, raw data represents expression values forprobes on the array. These values are provided in CEL-files together with other information about the experi-ment. Pre-processing in this case includes backgroundadjustment, normalization and summarization of probeinformation into probe sets that can be mapped to genes.This process transforms raw CEL-files into a data matrixof probe set values for all analysis samples. We have usedthe Robust Multi-Array Average (RMA)-pr1ocedure inBioconductor as implemented in the RMA- and RefPlus-packages to perform these steps but other methods can besubstituted, for example GCRMA or Factor Analysis forRobust Microarray Summarization (FARMS) [22-25]. Thenormalization can use an expanded sample set to provideincreased performance and stability of the pre-processingprocedures, e.g., by including all available microarraysamples at different time points for the selected patientsin the RMA-normalization procedure.PrefilteringNot all features in a data set carry useful information.Probe sets with little variation and low expression for ex-ample are dominated by noise that can negatively affectthe statistical analysis, depending on the methods used.The main goal of the pre-filtering step is therefore to re-move features with little variation across analysis samples,independent of sample class, before applying univariateranking and filtering methods on the remaining features.For that purpose a quantile-based filter was applied to thekidney rejection case study which ranked all samplesaccording to an empirical central mass range (ECMR) asgiven in Eq.(1) where f1 is the fraction of the smallest class,e.g. f1 ¼ min NARNARþNNR ;NNRNARþNNR in the 2-class classifica-tion problem of acute renal allograft rejection, and thenremoved all features with values below the median ECMR.ECMR xð Þ ¼ quantile x; 1 f12  quantile x; f12 ð1ÞFor the genomic data from the Affymetrix HumanGenome U133 Plus 2 GeneChip, this approach removeshalf of the 54,613 probe sets. If a more stringent pre-filter is desired, one could for example remove 75% offeatures with the lowest ECMR. The inter-quartile range(IQR) is a special case of the ECMR with f1=0.5, i.e., IQRand ECMR are the same for balanced class sizes. Forunbalanced class sizes the ECMR-based filter allowsvariation in the smaller class to enter the calculation ofthe quantile range. Other pre-filtering options includeapplication of an absolute count cut-off that requires atleast k samples to have an expression above a fixedthreshold, which would address concerns regarding theimpact of dependencies between pre- and univariate fil-ters and the ability to control type-I error rates [26]. Thechoice of threshold in any of these methods represents atrade-off between allowing more potential biomarkers topass the filter and at the same time adding more noisyfeatures, which increase the chance of identifying falsebiomarkers down-stream.Univariate ranking and filteringHaving a large number of features in a biomarker panelis typically not practical, as diagnostic or predictive testsin clinical applications are commonly based on a smallnumber of relevant markers. In fact, many currentlyapplied laboratory tests are based on single markers. Inaddition, some classification models pose statisticalconstraints on the number of features that they can in-corporate, e.g., a Linear Discriminant Analysis (LDA)classification model has to be based on fewer featuresthan the number of training samples. For that reason aunivariate ranking and filtering step is applied to reducethe number of candidate features to be included in theclassification model.The univariate ranking step calculates a measure of classdifferentiation ability for each individual feature thatpassed the pre-filtering stage. Moderated t-tests are com-monly used for determining differentially expressed fea-tures when sample sizes are small. Examples are thelimma-package in Bioconductor or the Signficance Anay-sis of Microarrays (SAM) tool [27,28]. These tests returnadjusted p-values or false discovery rates (FDR) that ac-count for multiple hypothesis testing by applying permu-tation tests (SAM), Bonferroni, Benjamini and Hochberg,or other methods, which is generally recommended for –omics data [29,30]. The limma package includes an empi-rical Bayes method that moderates the standard errors ofthe estimated log-fold changes. This approach results inmore stable inference and improved power, especially forexperiments with small numbers of arrays [27].Various combinations of FDR cut-offs and fold-changethresholds are applied to produce reduced lists of candi-date features that serve as input for the subsequent multi-variate ranking, filtering and supervised learning steps. Inaddition, lower and upper counts for the number of fea-tures are sometimes imposed to ensure a minimum and/or maximum number of features in the returned list.Multivariate ranking and filteringIt might be desirable in some instances to filter a list offeatures that are relevant as a group without requiring allof them to be relevant individually. Multivariate ranking isGünther et al. BMC Bioinformatics 2012, 13:326 Page 4 of 17 by applying a multivariate scoring method thatorders features by method-specific weights. Examples aresupport vector machines (SVM) where squared weightsfrom the SVM model are used, or Random Forest (RF)which provides a feature-importance measure. The multi-variate filtering step simply applies a cut-off regarding thenumber of ranked features to include.The steps described above are put together in theorder shown in Figure 1 to develop a biomarker panel.The final product in terms of class prediction, e.g. acuterejection versus non-rejection, is a classification modelbased on a biomarker panel in combination with asupervised learner. The requirements for a supervisedlearner are that it has to be able to (1) train its classifica-tion model based on a training set representing pairs offeatures (input) and response (output), and (2) return aclass probability or score for the different response typesgiven a test case, i.e., a set of input features. Not all stepsin the center portion of Figure 1 are performed every time.For example, the multivariate ranking and filtering stepmay be skipped, and the output from the univariate stepsis then used to directly define the biomarker panel. It ispossible that a classification model applies an additionalfeature selection step, e.g., Elastic Net [31].For the binary classification task of separating acuterejection from non-rejection samples, four supervisedlearning methods were applied: Support Vector Machine(SVM) with linear kernel, Linear Discriminant Analysis(LDA), Elastic Net (EN), and Random Forest (RF) [31-34].Where applicable, algorithm parameters were tuned formodel selection. Additional methods such as PAM(Shrunken Centroids) [35] or sPLS-DA (mixOmics) [14]have been explored for other data sets at the PROOFCentre.Model assessment and selectionPerformance of classification models needs to be esti-mated for model assessment and selection. For this pur-pose, it is common practice to split a data set into 3parts: (1) training, (2) validation and (3) testing, withsuggested splits being 50%, 25% and 25%, respectively[34]. A set of candidate models is first trained on thetraining set. One of the candidate models is thenselected by comparing performances on the validationdata, and the performance of the selected model is fi-nally assessed on the test data. In many cases however,particularly in the high-throughput genomic and prote-omic arena, data sets suffer from low sample size andcross-validation or boot-strap methods are typically ap-plied to avoid excluding too many samples from thetraining set.For the present case study, nested leave-one-out cross-validation (LOO-CV) was used in combination withminimum misclassification error for model selection andassessment. The outer loop was used for model selectionwhile the nested loops were used for model assessment byaveraging performance over k models that were tuned inthe inner loops of the nested cross-validation procedure.Model parameters were tuned for Elastic Net (lambda)and LDA (number of features), while the default cost par-ameter in SVM and default settings for mtry and node-sizeparameters were used in Random Forest, since these para-meters had little impact on the classification performancein the given data sets. In general, it is advisable to tunethese parameters and study their effects on classificationperformance to decide whether tuning is necessary. Esti-mators based on LOO-CV are known to have low bias butlarge variance. An alternative to nested LOO-CV, espe-cially for larger sample sizes, is based on averaging perfor-mances over multiple k-fold CV partitions.In general, models with multiple parameters requiremulti-parameter optimization. This is not straightfor-ward especially when sample sizes are small and differ-ent areas of the multi-parameter plane show the same orsimilar performances. In these cases it is not clear whichparameter combination should be chosen. One solutionis to fix all but one parameter and select a model basedon tuning that parameter. For example, Elastic Net hastwo parameters, alpha and lambda, where alpha is typ-ically fixed to select a trade-off between Lasso penaliza-tion and ridge regression, while lambda is varied to tunethe model.In addition to misclassification error, sensitivity, speci-ficity and area under the ROC curve (AUC) were deter-mined. Misclassification error, sensitivity and specificitydepend on the probability cut-off used. For example, if asample has a predicted probability of 0.4 of being an AR,it would be misclassified using a cut-off of 0.5 but cor-rectly classified using a cut-off of 0.3. Misclassificationerror is the fraction of misclassified AR and NR samples.All reported misclassification errors, sensitivities andspecificities are based on a 0.5 cut-off. The AUC is aquantitative measurement that averages classificationperformance over all probability cut-offs, and as suchdoes not depend on any particular cut-off value.Ensemble classifiersIn an effort to integrate multiple classification models,separately developed genomic and proteomic classifierswere combined in an ensemble of classifiers as shown inFigure 2. Ensemble classification methods have been ap-plied in a variety of fields with promising results[15,33,34,36]. Ensembles often combine predictions froma large number of different individual classifiers to pro-duce a final classification that is based on specific aggre-gation methods, e.g., average vote. The motivating ideabehind ensembles is that inclusion of a diverse set ofclassifiers ensures representation of various aspects ofGünther et al. BMC Bioinformatics 2012, 13:326 Page 5 of 17 underlying system, while a single classifier typicallyhas limited focus. For example, a genomic classifiermight focus mainly on an immunological signature inwhole blood, while a proteomic classifier might focus onan inflammation signature in plasma.Proteo-genomic ensembles combine classifiers fromgenomics and proteomics in an effort to improve per-formance and robustness of predictions. Each ensembleconsists of a set of genomic and proteomic classifiersthat are characterized by a biomarker panel, i.e., a list ofprobe sets or protein groups. All classifiers produce aprobability of acute rejection (AR) when given an un-known sample. Predicted class probabilities from indi-vidual classifiers were aggregated using one of twomethods: Average Probability (AP) or Vote Threshold(VT). The AP aggregation method averaged class prob-ability for a specific sample from all individual classifiersin the respective ensemble. Ensemble AUC and otherperformance measures were then derived from theseaverage probabilities. The VT aggregation method repre-sents a modified majority vote approach that can be ap-plied to binary classification tasks with only two classesG1 and G2. The predicted class from each classifier isinterpreted as a vote for that class; if the number ofvotes for G1 exceeds a fixed threshold, then the pre-dicted class is G1, otherwise it is declared G2.Ensembling of classifiers is well-studied in the literature[36,37]. In [36], the analysis of ensembling is extended toimbalanced and high-dimensional data (e.g., tens of thou-sands of probe sets). The analysis indicates that the more"independent" the individual classifiers are, the larger theexpected performance gain of the ensemble. This is par-ticularly relevant to integrating molecular signals fromwhole blood RNA and plasma proteins.Prior to the case study described in this paper, bloodsamples were collected from renal allograft recipients inthe Biomarkers in Transplantation initiative- [5,6].Whole-blood RNA samples were analyzed with Affyme-trix Human Genome U133 Plus 2.0 arrays (genomic)and plasma samples were analyzed with iTRAQMALDI-TOF/TOF Mass Spectrometry (proteomics).The two data sources are derived from different com-partments of peripheral blood and focus on two separatetypes of biological material, i.e., leukocyte cellular RNAand plasma proteins. Perhaps not surprisingly, signalsdetected by genomic analysis are different from thosedetected by proteomic analysis, although both types ofsignals are consistent with the current understanding ofthe pathogenesis of acute rejection injury. In particular,differentially expressed genes represent three major bio-logical processes related to immune signal transduction,cytoskeletal reorganization, and apoptosis [5], and differ-entially expressed proteins represent biological processesrelated to inflammation, complement activation, bloodcoagulation, and wound repair [6]. This diversity in bio-logical signals is maintained in individual genomic- andproteomic-based acute rejection classifiers, and is adesired property in ensemble classifiers. In general, en-semble classifiers demonstrate improved classificationperformance when individual classifiers in the ensemblerepresent independent experts [17,38,39].Although the current case study focuses on combininggenomic with proteomic data, the ensemble frameworkis more general in nature and does not need to berestricted to these types of data. A second analysis wasperformed to show how gene expression could be com-bined with miRNA classifiers. This analysis was basedon publicly available mRNA- and miRNA- data setsfrom a cancer study [40]. Using the computational pipe-line, classifiers for the diagnosis of tumour- versusnormal- samples were developed separately for themRNA- and miRNA- data sets. A number of ensembleswere defined and performances for the AP and VT ag-gregation methods were estimated.Figure 2 Schematic overview of ensemble classifiers. Ensemble classifiers represent a combination of genomic and proteomic classifiers.Individual classifier output is aggregated by either average probability or vote threshold (a modified version of majority vote).Günther et al. BMC Bioinformatics 2012, 13:326 Page 6 of 17 and proteomic classifiers were developed inde-pendently with the biomarker development pipelineusing 32 samples from the same 32 patients (11AR and21 NR) collected at the same time point. All sampleswere used for classifier training, and thus no samplesremained for classifier testing. As such, validation andcalculation of probability-of-AR was done with 32-fold(leave-one-out) cross-validation wherein 32-models werecreated for each of the genomic and proteomic classi-fiers separately with one of the samples left out. Theclassifer then tested the left-out sample and aprobability-of-AR was returned. When classifier devel-opment included a model tuning step, nested cross-validation was applied to ensure an unbiased estimate ofthe probability-of-AR.The 32 samples were used in previous publicationsthat describe the development of the Genomics 1 andProteomics 1 classifiers with a simplified pipeline [5,6]a.Genomic data represent RNA-based gene expressionprofiles as measured by Affymetrix HG-U133 Plus 2GeneChips and were pre-processed with RMA using anenlarged pool of 195 genomic samples that were avail-able at different time-points for the 32 patients, plus anadditional 20 samples from healthy volunteers, takenfrom the same biomarker project as described in [5]. AnECMR-based pre-filter shown in Eq. (1) was applied tothe subset of 32 analysis samples and returned 27,306probe sets for the analysis. Expression values were ana-lyzed on the log-base 2 scale.Proteomic data represent ratios between depletedplasma samples from transplant patients and healthypooled controls as measured by iTRAQ-MALDI-TOF/TOF methodology and several post-processing steps, in-cluding ProteinPilot™ software v2.0 with the integratedParagon™ Search and Pro Group™Algorithms, andsearching against the International Protein Index (IPIHUMAN v3.39) database. A Protein Group Code Algo-rithm (PGCA; in-house) was used to link protein groupsacross different iTRAQ experiments by defining globalprotein group codes (PGC) from multiple runs [6].There were a total of 1260 PGCs, each of which wasdetected in at least one sample. Of those, 147 PGCspassed a 75% minimum detection rule filter across the32 analysis samplesb.The number of features and performance characteris-tics of five genomic and five proteomic classifiers is sum-marized in Table 1c. Performance of individual classifiersas measured by AUC was typically high, and specificitywas higher than sensitivity for all classifiers. In additionto the published genomic classifier [5], four additionalgenomic classifiers based on SVM, RF and EN classifica-tion methods were developed [31-34]. Genomics 2 (SVM)and 3 (RF) classifiers were based on the top 50 FDR-ranked probe sets while Genomics 4 and 5 classifiers werebased on probe sets selected by Elastic Net from the probesets with an FDR<0.05 (with an additional constraint of atleast 50 but at most 500 probe sets).The development of the Proteomics 1 classifier wasdescribed previously [6]. Four additional proteomic clas-sifiers were developed in a process similar to that usedfor the Genomics analysis described above. ClassifiersProteomics 2–5 in Table 1 are based on EN and SVMclassification methods, either robust limma (Proteomics4–5) or no univariate filter (Proteomics 2–3), and a fold-change cutoff of FC≥1.15 in all cases. In addition, a 75%-rule regarding missing values was implemented, i.e., aprotein group was only included if it was detected in atleast 75% of all samples. The missing values were imputedusing k-nearest neighbours (knn) with k=3 across all train-ing samples, independent of class label. Imputation of testsamples was performed in each fold of the cross-validation by combining the imputed training data withthe test data, then applying knn imputation.Also shown in Table 1 is the definition of five ensem-bles representing different combinations of the 10 indi-vidual classifiers. Ensemble 1 represents a two-classifierensemble based on the published genomic and prote-omic biomarker panels, Ensemble 2 and 3 expand onEnsemble 1 by adding 2 genomic and 1 proteomic classi-fier (Ensemble 2), and one genomic and 2 proteomicclassifiers (Ensemble 3). Ensemble 4 combines the lar-gest genomic (Genomics 5) and proteomic (Proteomics3) classifiers and Genomics 3. Ensemble 5 combines all5 genomic and 5 proteomic classifiers.The performance of ensemble classifiers was charac-terized by sensitivity, specificity and AUC. These mea-sures were all derived from a probability-of-AR for theensemble, which was calculated from probability-of-ARvalues returned by individual classifiers in the ensemblein combination with either the average probability (AP)or vote-threshold (VT) aggregation methods. For VT athreshold of one was used, i.e., a single AR call by any ofthe classifiers in the ensemble was enough to call thesample as ARd. A probability-threshold of 0.5 was usedin the calculation of sensitivity and specificity. Resultsare summarized in Tables 2 and 3.Ensemble 1 in combination with aggregation methodAP achieves a sensitivity and specificity equaling that ofthe Genomics 1 classifier, while the AUC is improvedslightly relative to the Proteomics 1 classifier. Figure 3shows the estimated probabilities of acute rejection fromthe different classifiers for each of the 11AR and 21 NRsamples. For the 11 AR samples all red and orange pairsfall on the same side of the 0.5-probability threshold lineused to determine rejection status. This means that theGenomics 1 and Ensemble 1 classifiers not only displaythe same sensitivity, but they also misclassify the sameGünther et al. BMC Bioinformatics 2012, 13:326 Page 7 of 17 samples. Also, for the 21 NR samples all black andgrey pairs fall on the same side of the 0.5-probabilitythreshold line, which explains the same specificity ofGenomics 1 and Ensemble 1, again due to the same NRsamples being misclassified. The figure also provides anexplanation for the improved AUC of the ensemble ascompared to that of Genomics 1 alone. It is due to theprobability of the misclassified NR samples beingreduced from 1.0 (grey points) to a smaller value (blackpoints), in one case close to the 0.5-probability line. Inother words, although the same two NR samples remainmisclassified, the AUC of the ensemble is improved be-cause AUC is calculated based on the order ofprobability-of-AR for all samples. Overall, ensemble 1 incombination with aggregation method AP does not seemto improve classification performance much beyond thatof the Genomics 1 classifier alone.Figure 3 can also be used to interpret the results forEnsemble 1 when the VT aggregation method is used. Inthis case the red and black points in Figure 3 should beignored and the ensemble-produced probability of AR isinstead given by the larger of the probability-pairs repre-sented by the orange and yellow (for AR), or grey andbrown points (for NR). Ensemble 1 has better sensitivitythan either the genomic or proteomic classifier alone,and misclassifies only 2 AR samples. However, this im-provement comes at the cost of decreased specificity,with 3 misclassified NR samples, as compared to thegenomic (2 misclassified) or proteomic (1 misclassified)classifier alone.For all 10 analyses (5 ensembles with 2 aggregationmethods each), we find that sensitivity always meets orexceeds the maximum sensitivity of all individual classi-fiers in the corresponding ensemble, but exceeds themaximum value for all ensembles wherein the vote-threshold aggregation method is used. A similar obser-vation holds for Ensemble 4 when the AP aggregationmethod is used and an increased sensitivity of 82% isobserved. Specificity, on the other hand, is never betterthan the best specificity of all individual classifiers in anensemble, but is always within the min/max range forthe 5 ensembles when the AP aggregation method isused, or is usually below the min/max range when theVT aggregation method is used. Ensemble 4 is again theexception here, achieving specificity equaling the mini-mum value of 81%.When measuring classifier performance, it can be in-formative to look at performance in a threshold-Table 1 Overview of individual classifier performance and definition of ensemblesClassifier Method Features Sensitivity Specificity AUC Ensemble 1 Ensemble 2 Ensemble 3 Ensemble 4 Ensemble 5Genomics 1 LDA 24 0.73 0.90 0.73 X X X XGenomics 2 SVM 50 0.82 0.95 0.96 X XGenomics 3 RF 50 0.64 0.95 0.92 X X XGenomics 4 EN 43 0.73 1.00 0.93 X XGenomics 5 EN 174 0.73 1.00 0.95 X XProteomics 1 SVM 12 0.64 0.95 0.94 X X X XProteomics 2 EN 10 0.64 0.81 0.90 X XProteomics 3 SVM 33 0.55 0.81 0.83 X XProteomics 4 EN 13 0.55 0.86 0.85 X XProteomics 5 SVM 13 0.64 0.95 0.94 X XShown is a list of 5 genomic and 5 proteomic classifiers, their individual classification performance and their inclusion into 5 ensembles that are explored in thispaper. LDA stands for linear discriminant analysis; EN for Elastic Net (Generalized Linear Model); SVM for Support Vector Machine, and RF for Random Forest.Sensitivity, specificity and area under the ROC [receiver operator characteristics] Curve (AUC) for the individual classifiers were estimated using cross-validation.Table 2 Summary of classification performance for the Average Probability aggregation methodAVERAGEPROBABILITYSensitivity Specificity AUCEnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiersmin max average min max average min max averageEnsemble 1 0.73 0.64 0.73 0.68 0.90 0.90 0.95 0.93 0.95 0.73 0.94 0.84Ensemble 2 0.82 0.55 0.82 0.69 0.95 0.86 1.00 0.93 0.98 0.73 0.96 0.88Ensemble 3 0.73 0.64 0.73 0.65 0.95 0.81 0.95 0.91 0.97 0.73 0.94 0.88Ensemble 4 0.82 0.55 0.73 0.64 0.90 0.81 1.00 0.92 0.97 0.83 0.95 0.90Ensemble 5 0.82 0.55 0.82 0.66 0.95 0.81 1.00 0.92 0.98 0.73 0.96 0.89Shown is classification performance as measured by sensitivity, specificity and AUC – for the 5 ensembles defined in Table 1 when using the average probabilityaggregation method. The minimum, maximum and average performances of individual classifiers in the respective ensemble are included in the tablefor comparison.Günther et al. BMC Bioinformatics 2012, 13:326 Page 8 of 17 manner. The area under the curve (AUC)in the ROC assesses performance in this way, summariz-ing a classifier’s ability to separate two classes across thecomplete range of possible probability-thresholds. Usingthis measure of performance, we find that the AUC ofensembles based on the AP aggregation method is al-ways higher than the best (maximum) AUC of the indi-vidual classifiers in the corresponding ensemble,although the improvement is generally small as can beseen in Table 2. The AUC when using the VT aggrega-tion method is typically within the range for individualclassifiers, but for Ensemble 4 with an AUC of 0.952slightly exceeds the best individual AUC of 0.948.Comparing ensemble performance with mean perform-ance of individual classifiers in Tables 2 and 3 shows thatthe sensitivity and AUC is always higher in the ensembleclassifiers, while ensemble specificity is below mean speci-ficity for all 5 ensembles with VT aggregation, and 2 out of5 ensembles with AP aggregation.In Figure 4, one of the two genomic classifiers (Gen-omics 5) in Ensemble 4 is compared with the proteomicclassifier from the same ensemble, using posterior prob-abilities of acute rejection (AR). The plot demonstratesthat for the majority of samples the two classifiers agreeand assign the same class label (points that fall in yellowareas), although they do not produce the sameTable 3 Summary of classification performance for the Vote Threshold aggregation methodVOTETHRESHOLDSensitivity Specificity AUCEnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiersmin max average min max average min max averageEnsemble 1 0.82 0.64 0.73 0.68 0.86 0.90 0.95 0.93 0.89 0.73 0.94 0.84Ensemble 2 0.91 0.55 0.82 0.69 0.76 0.86 1.00 0.93 0.89 0.73 0.96 0.88Ensemble 3 1.00 0.64 0.73 0.65 0.76 0.81 0.95 0.91 0.90 0.73 0.94 0.88Ensemble 4 0.91 0.55 0.73 0.64 0.81 0.81 1.00 0.92 0.95 0.83 0.95 0.90Ensemble 5 1.00 0.55 0.82 0.66 0.62 0.81 1.00 0.92 0.90 0.73 0.96 0.89Shown is classification performance for the 5 ensembles defined in Table 1 when using the vote threshold aggregation method. Similarly to Table 2, individualclassifier performances are included for comparison.Probability of acute rejectionprobabilityAR1AR2AR3AR4AR5AR6AR7AR8AR9AR10AR11NR1NR2NR3NR4NR5NR6NR7NR8NR9NR10NR11NR12NR13NR14NR15NR16NR17NR18NR19NR20NR210. AR (genomics)NR (genomics)AR (proteomics)NR (proteomics)AR (ensemble)NR (ensemble)Figure 3 Comparison of predicted probabilities of acute rejection. Estimated probability of acute rejection (AR) for each of the AR and NRsamples as returned by the Genomics 1 and Proteomics 1 classifiers in Ensemble 1, and the Ensemble 1 classifier which represents a combinationof Genomics 1 and Proteomics 1. Samples are grouped along the x-axis into 11 AR (left group) and 21 NR (right group). Each point represents aprobability of acute rejection for a specific sample. Three color-coded probabilities are shown per sample. Red and black points representprobabilities from Ensemble 1, orange and grey points from Genomics 1 and yellow and brown points from Proteomics 1.Günther et al. BMC Bioinformatics 2012, 13:326 Page 9 of 17 (which would place points on the diagonalline); in some cases, the classifiers disagree on the classof a particular sample (points that fall in grey areas). Forexample, the proteomic classifier misclassifies the 4 ARsamples in the right bottom quadrant, while the genomicclassifier misclassifies the two AR samples in the top leftquadrant. One AR sample in the bottom left (yellow)square is misclassified by both classifiers. It is possible tocompare all pairs of classifiers in an ensemble using thescatter plot approach from Figure 4. An example of thisis shown in Figure 5, which displays a matrix of scatterplots for all 10 possible pairs of individual classifiersfrom Ensemble 2.In addition to the presented case study, the ensembleframework was also applied to a set of publicly availablemRNA- and miRNA- data that contain samples from avariety of human cancers as well as samples for compar-able normal tissue. We focused on six tissue types(colon, kidney, prostate, uterus, lung and breast) andused all tumour and normal samples for which bothmRNA and miRNA data were available. This resulted in57 samples (38 tumour and 19 normal samples). Thecomputational pipeline was applied, using 10x 19-foldcross-validation and a maximum AUC (within onestandard-error) model selection criteria to develop a setof 12 classifiers for each of the mRNA- and miRNA-data sets separately. Classifier characteristics and esti-mated performances are shown in Table 4, together withthe definition of six ensembles that represent differentcombinations of mRNA- and miRNA- classifiers. Similarto our results, previous work by Ramaswamy et al. on asuper-set of the mRNA-data was able to differentiatetumour- from normal samples with an accuracy of 92%using SVM and cross-validation [41].Performance of ensemble classifiers was then deter-mined for the AP and VT aggregation methods. Thevote threshold was set to one as before, i.e. a sample wasclassified to class tumour if at least one of the classifiersin the ensemble classified it as such. Classification per-formance is summarized in Table 5 (AP) and Table 6(VT). For both AP and VT aggregation methods, allensembles achieve a higher AUC than the best individualclassifier in the respective ensemble. Ensembles D and Fwith the AP aggregation method show the best perfor-mances, both having sensitivity of 100%, specificity of95% and AUC of 0.9986, although ensemble F is basedon twice as many individual classifiers as ensemble D.For both ensembles, only one normal sample is misclas-sified as can be seen in Figure 6, which shows the prob-ability of tumour for ensemble D and for the sixindividual classifiers that are equally split betweenmRNA- and miRNA-classifiers (three each).From Figure 6, it can be seen across the 57 samplesthat the three classifiers based on mRNA-data showsimilar probabilities of tumour most of the time, as dothe three classifiers based on miRNA-data. However,because miRNA-classifiers perform better when mRNA-classifiers misclassify (for example in several of the pros-tate cancer samples), and mRNA-classifiers performbetter when some of the miRNA classifiers misclassify(for example in several of the uterus cancer samples),the ensemble can overall benefit from the averaging ofprobabilities. This is evident by the fact that all ensembleprobabilities for the cancer samples (red points) fallabove the probability=0.5 dashed line, thus achieving theaforementioned sensitivity of 100%. A similar effect ofprobability-grouping by platform is observed for the nor-mal samples. For example, the mRNA-classifiers show aprobability of tumour>0.9 for the single misclassifiednormal sample, while all miRNA-classifiers have a prob-ability of less than 0.3 for the same sample.DiscussionA biomarker development pipeline with applications ingenomic-, proteomic-, and other –omic data was pre-sented and applied to the clinical challenge of classifyingacute renal allograft rejection in blood samples.Genomic- and proteomic-based classification modelswere developed and showed adequate classification per-formance for clinical use. Individual genomic- and0.0 0.2 0.4 0.6 0.8 Probability of AR Scatter PlotGenomic ClassifierProteomic ClassifierFigure 4 Classifier comparison within Ensemble 4. Scatter plot ofthe predicted posterior probabilities of AR from the Genomics 5 andProteomics 3 classifier in Ensemble 4. Red points represent 11 ARsamples, while black points represent 21 NR samples. Points that fallinto yellow areas were classified identically to the genomic and theproteomic classifiers while points in the grey area were classifieddifferently. AR samples are classified correctly when the probabilityfor the corresponding red point is at or above 0.5. NR samples arepredicted correctly when the probability is below 0.5.Günther et al. BMC Bioinformatics 2012, 13:326 Page 10 of 17 classifiers were then combined intoEnsemble classifiers. Given the cited improvement inclassification performance of ensemble classifiers inother fields [36,42-44], an important question underlyingour analysis was the extent that ensembles can improveclassification performance regarding acute renal allograftrejection beyond that of individual genomic and proteomicclassifiers alone. Our application area is characterized bysmall sample sizes and adequate classification per-formance of individual classifiers. In general, we foundthat classification performance improved by using ensem-bles, although improvements in some performancemeasures might be countered by a decrease in other per-formance measures. In general, the number of classifiersin an ensemble did not seem to affect performanceimprovements.0.0 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 5 Comparison of all classifier pairs in Ensemble 2. Shown is a matrix of scatter plots of the predicted probabilities of AR for all 10pairs of classifiers in Ensemble 2 as defined in Table 1. Red and black points indicate AR and NR samples respectively, the interpretation of yellowand grey areas is the same as Figure 4.Günther et al. BMC Bioinformatics 2012, 13:326 Page 11 of 17 4 Overview of individual classifier performance and definition of ensemblesClassifier Method Features Accuracy Sensitivity Specificity AUC Ensemble A Ensemble B Ensemble C Ensemble D Ensemble E Ensemble FmRNA-Classifier1 EN 182 0.9298 0.9737 0.8421 0.9737 X X XmRNA-Classifier2 EN 73 0.9123 1.0000 0.7368 0.9709 XmRNA-Classifier3 EN 36 0.8947 0.9737 0.7368 0.9501 X XmRNA-Classifier4 LDA 2 0.9298 0.9211 0.9474 0.9640 XmRNA-Classifier5 RF 500 0.8947 0.9737 0.7368 0.9418 XmRNA-Classifier6 SVM 500 0.9298 0.9474 0.8947 0.9640 XmRNA-Classifier7 EN 43 0.9123 0.9474 0.8421 0.9598 XmRNA-Classifier8 EN 25 0.9298 0.9737 0.8421 0.9612 XmRNA-Classifier9 EN 17 0.9298 0.9737 0.8421 0.9695 XmRNA-Classifier10 LDA 2 0.9298 0.9211 0.9474 0.9640 X XmRNA-Classifier11 RF 50 0.9298 0.9474 0.8947 0.9584 X XmRNA-Classifier12 SVM 50 0.8947 0.9211 0.8421 0.9557 X X XmiRNA-Classifier1 EN 66 0.8947 0.9211 0.8421 0.9626 X XmiRNA-Classifier2 EN 21 0.9474 0.9737 0.8947 0.9709 X XmiRNA-Classifier3 EN 8 0.9649 0.9737 0.9474 0.9723 X XmiRNA-Classifier4 LDA 4 0.9298 0.9211 0.9474 0.9626 X XmiRNA-Classifier5 RF 152 0.8947 0.8947 0.8947 0.9765 XmiRNA-Classifier6 SVM 152 0.9123 0.9474 0.8421 0.9626 XmiRNA-Classifier7 EN 36 0.9298 0.9474 0.8947 0.9709 XmiRNA-Classifier8 EN 16 0.9298 0.9474 0.8947 0.9848 XmiRNA-Classifier9 EN 12 0.9474 0.9737 0.8947 0.9806 XmiRNA-Classifier10 LDA 4 0.9298 0.9211 0.9474 0.9626 XmiRNA-Classifier11 RF 50 0.9123 0.9211 0.8947 0.9778 X XmiRNA-Classifier12 SVM 50 0.8947 0.9211 0.8421 0.9612 X X XShown is a list of 12 mRNA- and 12 miRNA classifiers, their individual classification performance and their inclusion into 6 ensembles that are explored for classification of tumour vs normal samples. Abbreviations arethe same as in Table 1.Güntheretal.BMCBioinformatics2012,13:326Page12of17 diagnosing acute kidney rejection, it is arguablymore important to avoid false negatives (rejection that isfalsely classified as non-rejection) than false positives(non-rejection falsely classified as rejection), becausedelays in the treatment of acute rejection cause bothshort- and long-term harm to the patient [45,46]. Thiswas the motivation behind the vote-threshold aggrega-tion method, which ensures that a single individualclassifier vote for acute-rejection would result in anacute-rejection classification by the ensemble. Theresults in Table 3 demonstrate that the VT aggregationmethod achieved an increase in sensitivity across allensembles though at the intuitively expected cost ofdecreased specificity in 4 out of 5 ensembles. The impactof this approach is similar to lowering the probability-of-AR-threshold for an individual classifier, but it benefitsfrom the increased diversity that comes with an ensem-ble, which in our case includes genomic- and proteomic-based biological signals. The VT method is especiallyvaluable in cases where one platform is able to detect arejection signal in some patients while another platformis not, as is demonstrated, for example, in Figure 4.One of the ensembles (Ensemble 1) represents a two-classifier ensemble combining our previously publishedgenomic and proteomic classifiers [5,6]. Even thoughAUC improves slightly when using the AP aggregationmethod, the same samples are misclassified as in thegenomic classifier of Ensemble 1. Sensitivity is improvedbeyond that of the genomic or proteomic classifier alonewhen the VT aggregation method is used, but specificitydropped below the values for the individual classifiers.Ensemble 1 therefore does not seem to improve classifi-cation performance much beyond that of the Genomics1 classifier alone. Ensembles 2, 3 and 5 represent an ex-tension of Ensemble 1, where further genomic and/orproteomic classifiers were added. For the AP aggregationmethod these three ensembles show a similar perform-ance range as Ensemble 1, while for the VT aggregationmethod Ensembles 3 and 5 can improve sensitivity to100% but drop below the range of individual classifiersfor specificity, while staying within range regardingAUC. Ensemble 5 has a specificity of 62% which is thelowest specificity across all 5 ensembles and 10 individ-ual classifiers. This is not surprising since Ensemble 5combines all 10 individual classifiers and a single AR-classification of one of the 10 classifiers is enough to callthe sample AR, therefore maximally increasing sensitiv-ity and lowering specificity. In this case and for ensem-bles with a larger number of individual classifiers, theVT method might perform better with a higher thresh-old, which could be, for example, AR-classification fromat least two classifiers.Table 5 Summary of classification performance for the Average Probability aggregation methodAVERAGEPROBABILITYSensitivity Specificity AUCEnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiersmin max average min max average min max averageEnsemble A 1.0000 0.9211 0.9737 0.9474 0.8421 0.8421 0.8421 0.8421 0.9972 0.9626 0.9737 0.9681Ensemble B 0.9737 0.9211 0.9211 0.9211 0.8421 0.8421 0.8421 0.8421 0.9931 0.9557 0.9612 0.9584Ensemble C 1.0000 0.9211 0.9737 0.9539 0.8421 0.7368 0.8947 0.8421 0.9917 0.9501 0.9709 0.9602Ensemble D 1.0000 0.9211 0.9737 0.9386 0.9474 0.8421 0.9474 0.9035 0.9986 0.9557 0.9778 0.9677Ensemble E 1.0000 0.8947 1.0000 0.9518 0.8947 0.7368 0.9474 0.8553 0.9972 0.9418 0.9765 0.9643Ensemble F 1.0000 0.9211 0.9737 0.9430 0.9474 0.8421 0.9474 0.8816 0.9986 0.9557 0.9848 0.9672Shown is performance for tumour vs normal classification for the 6 ensembles defined in Table 4 using the average probability aggregation method. Theminimum, maximum and average performances of individual classifiers in the respective ensemble are included in the table for comparison.Table 6 Summary of classification performance for the Vote Threshold aggregation methodVOTETHRESHOLDSensitivity Specificity AUCEnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiers EnsembleclassifierIndividual classifiersmin max average min max average min max averageEnsemble A 1.0000 0.9211 0.9737 0.9474 0.7368 0.8421 0.8421 0.8421 0.9875 0.9626 0.9737 0.9681Ensemble B 1.0000 0.9211 0.9211 0.9211 0.6842 0.8421 0.8421 0.8421 0.9917 0.9557 0.9612 0.9584Ensemble C 1.0000 0.9211 0.9737 0.9539 0.6842 0.7368 0.8947 0.8421 0.9861 0.9501 0.9709 0.9602Ensemble D 1.0000 0.9211 0.9737 0.9386 0.7368 0.8421 0.9474 0.9035 0.9875 0.9557 0.9778 0.9677Ensemble E 1.0000 0.8947 1.0000 0.9518 0.6316 0.7368 0.9474 0.8553 0.9903 0.9418 0.9765 0.9643Ensemble F 1.0000 0.9211 0.9737 0.9430 0.6842 0.8421 0.9474 0.8816 0.9931 0.9557 0.9848 0.9672Shown is performance for tumour vs normal classification for the 6 ensembles defined in Table 4 using the vote threshold aggregation method. Similarly toTable 5, individual classifier performances are included for comparison.Günther et al. BMC Bioinformatics 2012, 13:326 Page 13 of 17 best-performing ensemble (Ensemble 4) excludesthe published genomic and proteomic classifiers but in-stead combines the largest genomic, the largest prote-omic and a 50-feature genomic classifier based onRandom Forest. The results in Table 2 and Table 3favour Ensemble 4, which is the only one that improvessensitivity and AUC beyond that of individual classifiersin the ensemble while staying within the range for speci-ficity. The two genomic classifiers in Ensemble 4 arebased on Elastic Net (174 features) and Random Forest(50 features, of which 49 are also included in the 174-Elastic Net classifier). The proteomic classifier is basedon SVM using 33 features that were selected by fold-change criteria. A contributing factor for the good per-formance of Ensemble 4 could therefore be the use ofcomparatively large classifier panels and a fold-changefilter on the proteomic side.Several parts of the biomarker development pipelinefor individual classifiers were designed to reduce theselection of false positive biomarkers, including pre-fil-tering, multiple hypothesis testing correction, cross-validation to maximize use of the small number ofavailable samples, and nested cross-validation to avoidbias when models are tuned [29,34,47,48]. Ensemblesprovide an additional layer of robustness for classifica-tion when aggregation methods that average over severalclassifiers, e.g. average probability or majority vote, areused. This robustness is achieved by reducing the impactof inaccurate classifiers based on false positive genes orproteins by allowing more accurate classifiers in the en-semble to “out-vote” a small number of inaccurate clas-sifiers. Related to the previous point is the fact that thekidney rejection data is “wide” data, which is defined ashaving more features than samples. In “wide” data pro-blems it is not feasible to find the best classifier. Instead,one commonly finds many and possibly quite differentclassifiers that seem equally valid while displaying arange of classification performances. Ensembling there-fore provides a robust approach to “wide” data classifica-tion problems.An important question surrounding ensembling con-cerns the choice of individual classifiers that should bepart of the ensemble. Theoretical analysis points to in-cluding classifiers that are as independent as possible[36]. One source of “independence” in the acute kidneyrejection case study comes from the two data types, i.e.genomic versus proteomic. Within genomic and prote-omic data, classifiers are developed using different com-binations of filtering- and classification methods asshown in Figure 1, thus focusing on different aspects ofgenomic and proteomic data respectively. An additionalsource of “independence” that has not been explored inT_COLON_2T_COLON_3T_COLON_4T_COLON_5T_COLON_7T_COLON_9T_COLON_10T_KID_1T_KID_3T_KID_4T_KID_5T_PROST_1T_PROST_2T_PROST_3T_PROST_4T_PROST_5T_PROST_6T_UT_1T_UT_2T_UT_3T_UT_4T_UT_5T_UT_6T_UT_7T_UT_8T_UT_9T_UT_10T_LUNG_1T_LUNG_2T_LUNG_3T_LUNG_4T_LUNG_6T_BRST_1T_BRST_2T_BRST_3T_BRST_4T_BRST_5T_BRST_6N_COLON_1N_COLON_3N_COLON_4N_COLON_5N_KID_1N_KID_2N_KID_3N_PROST_1N_PROST_2N_PROST_4N_PROST_6N_PROST_7N_PROST_8N_UT_8N_LUNG_1N_LUNG_3N_BRST_1N_BRST_2N_BRST_3Tumour (ensemble)Normal (ensemble)Tumour (Microarray)Normal (Microarray)Tumour (miRNA)Normal (miRNA)probability0. of TumourFigure 6 Comparison of predicted probabilities of tumour. Estimated probability of tumour for each of the tumour- and normal samples asreturned by all six classifiers in Ensemble D, and the Ensemble D classifier itself. Samples are grouped along the x-axis into 38 tumour (left) and19 normal (right). Seven color-coded probabilities are shown per sample. Red and black points represent probabilities from Ensemble D, orangeand grey crosses from the three mRNA classifiers, and pink and blue stars from the three miRNA classifiers.Günther et al. BMC Bioinformatics 2012, 13:326 Page 14 of 17 study could be provided on a biological level. Bio-informatics tools, such as pathway analysis tools andontology-based tools, can provide insights as to howmuch individual biomarker panels differ biologically. In-dividual classifiers in an ensemble could then be selectedto cover a wide range of biological pathways, thus pro-viding a diverse biological cross-section. Pathway ana-lysis is an area of active research in its own right that iscurrently going through a dynamic flux [49]. Hence, wehave concentrated in our approach and discussion oncomputational aspects of ensemble classifiers.In addition to selecting individual classifiers to becombined in an ensemble, a weighting needs to be pro-vided. We have used equal weights of individual classi-fiers in our analyses, as suggested by the term averageprobability. In general, each classifier can be weighteddifferently in classifier aggregation such that more trust-worthy classifiers receive a higher weight. It is importantto note that any composition of an ensemble introducesa form of weighting. For example, an ensemble of 2 gen-omic and 5 proteomic classifiers, in which all classifiershave equal weights, would put a higher weight onproteomic-based classifiers as a group when comparedto genomic-based classifiers. If one prefers to give equalweight to genomic- and proteomic-based classifier-groups, the two genomic-based classifiers should have aweight of 0.25 each (thus adding up to 0.5), while thefive proteomic-based classifiers should have a weight of0.1 each (also adding up to 0.5). The five ensembles inTable 1 followed an underlying balanced design in thisregard, i.e., the difference in the number of genomic andproteomic classifiers in an ensemble, is at most 1.Figure 5 shows a matrix of scatter plots for all 10 pos-sible pairs of individual classifiers from Ensemble 2,demonstrating the usefulness of this type of visualizationin providing an overview of the diversity between theclassifiers in an ensemble. The scatter plots betweenpairs of genomic classifiers (three plots in the upper left)and proteomic classifiers (one plot in the bottom right)show similar classification of samples, with most sam-ples falling into the yellow areas. The remaining 6 scat-ter plots compare one genomic and one proteomicclassifier each. Here, an increase in disagreement be-tween the classifier pairs is observed, which is evident bymore samples falling in the grey areas. Disagreement tosome extent is desired in ensemble classifiers since theyderive a benefit from the diversity of the underlying clas-sifiers. In addition to comparing classifiers in an ensem-ble based on the number of features and individualperformance characteristics as shown in Table 1, onecan also use information from scatter plots as shown inFigure 4 and 5 to add or remove classifiers, in an effortto optimize diversity during the ensemble designprocess. It should be noted that the number of plots in ascatter plot matrix grows with the square of the numberof individual classifiers, an effect that poses a practicallimitation on this type of visualization.Because the proteo-genomic ensemble approachassumes fully developed individual classifiers, test sam-ples need to be classified by genomic and proteomicclassifiers before they can be aggregated. This requiresthe samples to be run on both platforms. In cases wherea sample is only run on one platform, the ensemble clas-sifier cannot be used. An alternative in this case is to fallback on a platform-specific classifier, which by itself couldbe an ensemble (e.g., a genomic-ensemble), although onewould lose the advantage of using information fromdiverse sources for classification. The inclusion of datafrom other platforms within the presented ensembleframework, for example miRNA, metabolomic or clinicaldata sources, is easily possible as long as patient-matchedmeasurements from the corresponding platforms are pro-vided. The generality of the ensemble framework has beendemonstrated by applying it off the shelf to an additionalcancer data set based on two different types of genomicdata (mRNA and miRNA). The findings there show thatensemble classifiers can improve upon already well-performing individual mRNA and miRNA classifiers, thussupporting the notion that ensemble classifiers based on adiverse set of individual classifiers across different plat-forms have the ability to outperform any single classifierin the ensemble.ConclusionsProteo-genomic biomarker ensemble classifiers showpromise in the diagnosis of acute renal allograft rejectionand can improve classification performance beyond thatof individual genomic or proteomic classifiers alone. TheVote Threshold application method allows fine-tuning ofsensitivity and specificity while incorporating diverseclassification signals from individual classifiers. This isan important feature in application areas where sensitiv-ity is more important than specificity. Validation of ourrenal allograft rejection results in an international multi-center study is currently underway.EndnotesaThe Genomics 1 classifier was developed based on33-samples which included one additional non-rejectionsample that was available only on the genomic platform.This sample was not included in the development of theother genomic and proteomic classifiers.bClassifier Proteomics 1 in Table 1 is from a previouspublication [6] which used a 67% minimum detectionrule.cPerformance estimates for classifier Genomics 1 inTable 1 were based on values for 32 samples derivedGünther et al. BMC Bioinformatics 2012, 13:326 Page 15 of 17 11-fold cross-validation of the 33 sample set asdescribed in a previous publication [5].dPerformance estimates for ensembles that includedthe Genomics 1 classifier used posterior probabilities forthe 32 samples in common.AbbreviationsAR: Acute Rejection; NR: Non-Rejection; LDA: Linear Discriminant Analysis;SVM: Support Vector Machine; EN: Elastic Net (Generalized Linear Model);RF: Random Forest; ROC: Receiver Operating Characteristics; AUC: Area underthe (ROC) Curve; PGCA: Protein Group Code Algorithm; AP: AverageProbability; VT: Vote Threshold; mRNA: Messenger RNA; miRNA: MicroRNA.Competing interestsThe authors declare that they have no competing interests.Authors’ contributionsOG carried out computational genomic- and ensemble analyses, participated inthe design, execution and analytical discussions of the work, and prepared themanuscript. VC carried out the computational proteomic analysis andcontributed to the design, execution and analytical discussions of the workpresented in this manuscript. GCF contributed to the design, development anddescription of the proteomics pipeline, analytical discussion of the work andreviewing the manuscript. RB contributed to conception, design and statisticaldiscussion of the computational pipeline and ensembles discussed in themanuscript. ST contributed to the design and participated in analytical andbiologically discussions of the work. ZH contributed to design and analyticaldiscussion of the computational pipeline, and data management support. MTparticipated in analytical discussion of the work and preparation of iTRAQproteomics data. RM participated in the conception and design of the workdiscussed in this manuscript. BM contributed to the conception, design,execution and analytical discussions of the work discussed in this manuscript.PK participated in the conception, design and execution of the work discussedin this manuscript. RN contributed to the conception, design, execution andanalytical discussions of the work, and participated in preparing the manuscript.All authors read and approved the final manuscript.AcknowledgementsThe authors thank the members of the Genome Canada Biomarkers inTransplantation Team and the NCE CECR Prevention of Organ Failure(PROOF) Centre of Excellence, Genome Canada, Novartis Pharma, IBM,Genome British Columbia, Astellas, Eksigent, Vancouver Hospital Foundation,St. Paul’s Hospital Foundation, University of British Columbia VP Research,UBC James Hogg Research Centre, and BC Transplant.Author details1NCE CECR Prevention of Organ Failure (PROOF) Centre of Excellence,Vancouver, BC V6Z 1Y6, Canada. 2Department of Statistics, University ofBritish Columbia, Vancouver, BC V6T 1Z2, Canada. 3Department of Pathologyand Laboratory Medicine, University of British Columbia, Vancouver, BC V6T2B5, Canada. 4Immunity and Infection Research Centre, Vancouver, BC V5Z3J5, Canada. 5Immunology Laboratory, Vancouver General Hospital,Vancouver, BC V5Z 1M9, Canada. 6Department of Medicine, University ofBritish Columbia, Vancouver, BC V5Z 1M9, Canada. 7James Hogg ResearchCentre, St. Paul’s Hospital, University of British Columbia, Vancouver, BC V6Z1Y6, Canada. 8Department of Computer Science, University of BritishColumbia, Vancouver, BC V6T 1Z2, Canada. 9Department of Medical Genetics,University of British Columbia, Vancouver, BC V6T 1Z3, Canada. 10Institute forHEART+LUNG Health, Vancouver, BC V6Z 1Y6, Canada. 11Department ofMedicine, Division of Respiratory Medicine, University of British Columbia,Vancouver, BC V5Z 1M9, Canada. 12Department of Statistics and ActuarialScience, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.Received: 3 April 2012 Accepted: 4 December 2012Published: 8 December 2012References1. Fassett RG, Venuthurupalli SK, Gobe GC, Coombes JS, Cooper MA, Hoy WE:Biomarkers in chronic kidney disease: a review. Kidney Int 2011, 80:806–821.2. Vasan RS: Biomarkers of cardiovascular disease: molecular basis andpractical considerations. Circulation 2006, 113:2335–2362.3. Dash PK, Zhao J, Hergenroeder G, Moore AN: Biomarkers for the diagnosis,prognosis, and evaluation of treatment efficacy for traumatic braininjury. Neurotherapeutics 2010, 7:100–114.4. Racusen LC, Solez K, Colvin RB, Bonsib SM, Castro MC, Cavallo T, Croker BP,Demetris AJ, Drachenberg CB, Fogo AB, Furness P, Gaber LW, Gibson IW,Glotz D, Goldberg JC, Grande J, Halloran PF, Hansen HE, Hartley B, Hayry PJ,Hill CM, Hoffman EO, Hunsicker LG, Lindblad AS, Yamaguchi Y: The Banff 97working classification of renal allograft pathology. Kidney Int 1999,55:713–723.5. Günther OP, Balshaw RF, Scherer A, Hollander Z, Mui A, Triche TJ, Freue GC,Li G, Ng RT, Wilson-McManus J, McMaster WR, McManus BM, Keown PA:Functional genomic analysis of peripheral blood during early acute renalallograft rejection. Transplantation 2009, 88:942–951.6. Freue GVC, Sasaki M, Meredith A, Günther OP, Bergman A, Takhar M, Mui A,Balshaw RF, Ng RT, Opushneva N, Hollander Z, Li G, Borchers CH, Wilson-McManus J, McManus BM, Keown PA, McMaster WR: Proteomic signaturesin plasma during early acute renal allograft rejection. Mol Cell Proteomics2010, 9:1954–1967.7. Flechner SM, Kurian SM, Head SR, Sharp SM, Whisenant TC, Zhang J,Chismar JD, Horvath S, Mondala T, Gilmartin T, Cook DJ, Kay SA, Walker JR,Salomon DR: Kidney transplant rejection and tissue injury by geneprofiling of biopsies and peripheral blood lymphocytes. Am J Transplant2004, 4:1475–1489.8. Kurian SM, Heilman R, Mondala TS, Nakorchevsky A, Hewel JA, Campbell D,Robison EH, Wang L, Lin W, Gaber L, Solez K, Shidban H, Mendez R, SchafferRL, Fisher JS, Flechner SM, Head SR, Horvath S, Yates JR, Marsh CL, SalomonDR: Biomarkers for early and late stage chronic allograft nephropathy byproteogenomic profiling of peripheral blood. PLoS One 2009, 4:e6212.9. Perkins D, Verma M, Park KJ: Advances of genomic science and systemsbiology in renal transplantation: a review. Semin Immunopathol 2011,33(2):211–218.10. Lin D, Hollander Z, Ng RT, Imai C, Ignaszewski A, Balshaw R, Freue GC,Wilson-McManus JE, Qasimi P, Meredith A, Mui A, Triche T, McMaster R,Keown PA, McManus BM: Whole blood genomic biomarkers of acutecardiac allograft rejection. J Heart Lung Transplant 2009, 28:927–935.11. Bernstein D, Williams GE, Eisen H, Mital S, Wohlgemuth JG, Klingler TM,Fang KC, Deng MC, Kobashigawa J: Gene expression profilingdistinguishes a molecular signature for grade 1B mild acute cellularrejection in cardiac allograft recipients. J Heart Lung Transplant 2007,26:1270–1280.12. Bloom G, Yang IV, Boulware D, Kwong KY, Coppola D, Eschrich S,Quackenbush J, Yeatman TJ: Multi-platform, multi-site, microarray-basedhuman tumor classification. Am J Pathol 2004, 164:9–16.13. Li G, Zhang W, Zeng H, Chen L, Wang W, Liu J, Zhang Z, Cai Z: Anintegrative multi-platform analysis for discovering biomarkers ofosteosarcoma. BMC Cancer 2009, 9:150.14. Kim-Anh LC, Debra R, Christèle R-G, Philippe B: A sparse PLS for variableselection when integrating omics data. Stat Appl Genet Mol Biol 2008,7(1). Article 35.15. Kittler J, Hatef M, Duin RPW, Matas J: On combining classifiers. IEEE TransPattern Anal Mach Intell 1998, 20:226–239.16. Rokach L: Ensemble-based classifiers. Artif Intell Rev 2010, 33:1–39.17. Polikar R: Ensemble based systems in decision making. Circ Syst Mag IEEE2006, 6:21–45.18. Cohen Freue GV, Bergman A, Meredith A, Lam K, Sasaki M, Smith D,Hollander Z, Opushneva N, Takhar M, Lin D, Wilson-McManus J, Balshaw RF,Ng RT, Keown PA, McManus B, Borchers CH, McMaster WR: Computationalbiomarker pipeline from discovery to clinical implementation: humanplasma proteomic biomarkers for cardiac transplantation. PLoS Comp Bio,under review.19. Cohen Freue GV, Hollander Z, Shen E, Zamar RH, Balshaw R, Scherer A,McManus B, Keown P, McMaster WR, Ng RT: MDQC: a new qualityassessment method for microarrays based on quality control reports.Bioinformatics 2007, 23:3162–3169.20. Kauffmann A, Gentleman R, Huber W: ArrayQualityMetrics—abioconductor package for quality assessment of microarray data.Bioinformatics 2009, 25:415–416.21. Günther OP, Lin D, Balshaw RF, Ng RT, Hollander Z, Wilson-McManus J,McMaster WR, McManus BM, Keown PA: Effects of sample timing andGünther et al. BMC Bioinformatics 2012, 13:326 Page 16 of 17 on gene expression in early acute renal allograft rejection.Transplantation 2011, 91:323–329.22. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summariesof Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31:e15.23. Harbron C, Chang K-M, South MC: RefPlus: an R package extending theRMA Algorithm. Bioinformatics 2007, 23:2493–2494.24. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F: A model-based background adjustment for oligonucleotide expression arrays.Am Stat Assoc 2004, 99:909–917.25. Hochreiter S, Clevert D-A, Obermayer K: A new summarization method forAffymetrix probe level data. Bioinformatics 2006, 22:943–949.26. Bourgon R, Gentleman R, Huber W: Independent filtering increasesdetection power for high-throughput experiments. Proc Nat Aca Sci USA2010, 107:9546–9551.27. Smyth GK: Linear models and empirical Bayes methods for assessingdifferential expression in microarray experiments. Stat Appl Genet Mol Biol2004, 3. Article 3.28. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarraysapplied to the ionizing radiation response. Proc Natl Acad Sci USA 2001,98:5116–5121.29. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practicaland powerful approach to multiple testing. J Royal Stat Soc Series B(Methodological) 1995, 57:289–300.30. Dudoit, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarrayexperiments. Stat Sci 2003, 18:71–103.31. Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G: Support vectormachines and kernels for computational biology. PLoS Comput Biol 2008,4:e1000173.32. Breiman L: Random forests. Mach Learn 2001, 45:5–32.33. Friedman J, Hastie T, Tibshirani R: Regularization paths for generalizedlinear models via coordinate descent. J Stat Softw 2010, 33:1–22.34. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: DataMining, Inference, and Prediction. Secondth edition. New York: Corr. 3rdprinting. Springer; 2009. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancertypes by shrunken centroids of gene expression. Proc Natl Acad Sci USA2002, 99:6567–6572.36. Zhang Q, Hughes-Oliver JM, Ng RT: A model-based ensembling approachfor developing QSARs. J Chem Inform Model 2009, 49:1857–1865.37. Kuncheva LI, Whitaker CJ: Measures of diversity in classifier ensemblesand their relationship with the ensemble accuracy. Mach Learn 2003,51:181–207.38. Jahrer M, Töscher A, Legenstein R: Combining predictions for accuraterecommender systems. Proc 16th ACM SIGKDD Int Conf KnowledgeDiscovery and Data Mining 2010, 693–702.39. Netflix Prize: Home; Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-CorderoA, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horvitz HR, GolubTR: MicroRNA expression profiles classify human cancers. Nature 2005,435:834–838.41. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C-H, Angelo M, LaddC, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES,Golub TR: Multiclass cancer diagnosis using tumor gene expressionsignatures. PNAS 2001, 98:15149–15154.42. Luo S-T, Cheng B-W: Diagnosing breast masses in digital mammographyusing feature selection and ensemble methods. J Med Syst 2010,98(26):15149–15154.43. Oh S, Lee MS, Zhang B-T: Ensemble learning with active exampleselection for imbalanced biomedical data classification. IEEE/ACM TransComput Biol Bioinform 2011, 8:316–325.44. Afridi TH, Khan A, Lee YS: Mito-GSAAC: mitochondria prediction usinggenetic ensemble classifier and split amino acid composition. AminoAcids 2011, 42(4):1443–1454.45. Peeters P, Van Laecke S, Vanholder R: Acute kidney injury in solid organtransplant recipients. Acta Clin Belg Suppl 2007:389–392.46. de Fijter JW: Rejection and function and chronic allograft dysfunction.Kidney Int Suppl 2010, 78(S119):S38–S41.47. Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system forautomated cancer diagnosis and biomarker discovery from microarraygene expression data. Int J Med Inform 2005, 74:491–503.48. Lee S: Mistakes in validating the accuracy of a prediction classifier inhigh-dimensional but small-sample microarray data. Stat Methods MedRes 2008, 17:635–642.49. Khatri P, Sirota M, Butte AJ: Ten years of pathway analysis: currentapproaches and outstanding challenges. PLoS Comp Bio 2012, 8(2).doi:10.1186/1471-2105-13-326Cite this article as: Günther et al.: A computational pipeline for thedevelopment of multi-marker bio-signature panels and ensembleclassifiers. BMC Bioinformatics 2012 13:326.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript atünther et al. BMC Bioinformatics 2012, 13:326 Page 17 of 17


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items