UBC Faculty Research and Publications

MINT: a multivariate integrative method to identify reproducible molecular signatures across independent… Rohart, Florian; Eslami, Aida; Matigian, Nicholas; Bougeard, Stéphanie; Lê Cao, Kim-Anh Feb 27, 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2017_Article_1553.pdf [ 1.21MB ]
JSON: 52383-1.0362032.json
JSON-LD: 52383-1.0362032-ld.json
RDF/XML (Pretty): 52383-1.0362032-rdf.xml
RDF/JSON: 52383-1.0362032-rdf.json
Turtle: 52383-1.0362032-turtle.txt
N-Triples: 52383-1.0362032-rdf-ntriples.txt
Original Record: 52383-1.0362032-source.json
Full Text

Full Text

Rohart et al. BMC Bioinformatics  (2017) 18:128 DOI 10.1186/s12859-017-1553-8METHODOLOGY ARTICLE Open AccessMINT: a multivariate integrative methodto identify reproducible molecular signaturesacross independent experiments andplatformsFlorian Rohart1, Aida Eslami2, Nicholas Matigian1, Stéphanie Bougeard3 and Kim-Anh Lê Cao1*AbstractBackground: Molecular signatures identified from high-throughput transcriptomic studies often have poorreliability and fail to reproduce across studies. One solution is to combine independent studies into a singleintegrative analysis, additionally increasing sample size. However, the different protocols and technological platformsacross transcriptomic studies produce unwanted systematic variation that strongly confounds the integrative analysisresults. When studies aim to discriminate an outcome of interest, the common approach is a sequential two-stepprocedure; unwanted systematic variation removal techniques are applied prior to classification methods.Results: To limit the risk of overfitting and over-optimistic results of a two-step procedure, we developed a novelmultivariate integration method,MINT, that simultaneously accounts for unwanted systematic variation and identifiespredictive gene signatures with greater reproducibility and accuracy. In two biological examples on the classificationof three human cell types and four subtypes of breast cancer, we combined high-dimensional microarray and RNA-seqdata sets and MINT identified highly reproducible and relevant gene signatures predictive of a given phenotype. MINTled to superior classification and prediction accuracy compared to the existing sequential two-step procedures.Conclusions: MINT is a powerful approach and the first of its kind to solve the integrative classification framework in asingle step by combining multiple independent studies.MINT is computationally fast as part of the mixOmics R CRANpackage, available at http://www.mixOmics.org/mixMINT/ and http://cran.r-project.org/web/packages/mixOmics/.Keywords: Integration, Multivariate, Classification, Transcriptome analysis, Algorithm, Partial-least-squareBackgroundHigh-throughput technologies, based on microarray andRNA-sequencing, are now being used to identify biomark-ers or gene signatures that distinguish disease subgroups,predict cell phenotypes or classify responses to therapeu-tic drugs. However, few of these findings are reproducedwhen assessed in subsequent studies and even fewer leadto clinical applications [1, 2]. The poor reproducibility ofidentified gene signatures is most likely a consequence ofhigh-dimensional data, in which the number of genes or*Correspondence: k.lecao@uq.edu.au1The University of Queensland Diamantina Institute, The University ofQueensland, Translational Research Institute, 4102 Brisbane QLD, AustraliaFull list of author information is available at the end of the articletranscripts being analysed is very high (often several thou-sands) relative to a comparatively small sample size beingused (< 20).One way to increase sample size is to combine raw datafrom independent experiments in an integrative analysis.This would improve both the statistical power of the anal-ysis and the reproducibility of the gene signatures that areidentified [3]. However, integrating transcriptomic studieswith the aim of classifying biological samples based on anoutcome of interest (integrative classification) has a num-ber of challenges. Transcriptomic studies often differ fromeach other in a number of ways, such as in their exper-imental protocols or in the technological platform used.These differences can lead to so-called ‘batch-effects’, or© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Rohart et al. BMC Bioinformatics  (2017) 18:128 Page 2 of 13systematic variation across studies, which is an impor-tant source of confounding [4]. Technological platform,in particular, has been shown to be an important con-founder that affects the reproducibility of transcriptomicstudies [5]. In the MicroArray Quality Control (MAQC)project, poor overlap of differentially expressed geneswas observed across different microarray platforms (∼60%), with low concordance observed between microar-ray and RNA-seq technologies specifically [6]. Therefore,these confounding factors and sources of systematic vari-ation must be accounted for, when combining indepen-dent studies, to enable genuine biological variation to beidentified.The common approach to integrative classification issequential. A first step consists of removing batch-effectby applying for instance ComBat [7], FAbatch [8], BatchMean-Centering [9], LMM-EH-PS [10], RUV-2 [4] orYuGene [11]. A second step fits a statistical model toclassify biological samples and predict the class member-ship of new samples. A range of classification methodsalso exists for these purposes, including machine learn-ing approaches (e.g. random forests [12, 13] or SupportVector Machine [14–16]) as well as multivariate linearapproaches (Linear Discriminant Analysis LDA, PartialLeast Square Discriminant Analysis PLSDA [17], or sparsePLSDA [18]).The major pitfall of the sequential approach is a risk ofover-optimistic results from overfitting of the training set.This leads to signatures that cannot be reproduced on testsets. Moreover, most proposed classification models havenot been objectively validated on an external and indepen-dent test set. Thus, spurious conclusions can be generatedwhen using these methods, leading to limited potentialfor translating results into reliable clinical tools [2]. Forinstance, most classification methods require the choiceof a parameter (e.g. sparsity), which is usually optimisedwith cross-validation (data are divided into k subsets or‘folds’ and each fold is used once as an internal test set).Unless the removal of batch-effects is performed indepen-dently on each fold, the folds are not independent andthis leads to over-optimistic classification accuracy on theinternal test sets. Hence, batch removal methods must beused with caution. For instance, ComBat can not removeunwanted variation in an independent test set alone asit requires the test set to be normalised with the learn-ing set in a transductive rather than inductive approach[19]. This is a clear example where over-fitting and over-optimistic results can be an issue, even when a test set isconsidered.To address existing limitations of current data integra-tion approaches and the poor reproducibility of results, wepropose a novel Multivariate INTegrative method,MINT.MINT is the first approach of its kind that integratesindependent data sets while simultaneously, accountingfor unwanted (study) variation, classifying samples andidentifying key discriminant variables. MINT predictsthe class of new samples from external studies, whichenables a direct assessment of its performance. It alsoprovides insightful graphical outputs to improve inter-pretation and inspect each study during the integrationprocess.We validated MINT in a subset of the MAQC project,which was carefully designed to enable assessmentof unwanted systematic variation. We then combinedmicroarray and RNA-seq experiments to classify sam-ples from three human cell types (human Fibroblasts(Fib), human Embryonic Stem Cells (hESC) and humaninduced Pluripotent Stem Cells (hiPSC)) and from fourclasses of breast cancer (subtype Basal, HER2, LuminalA and Luminal B). We use these datasets to demon-strate the reproducibility of gene signatures identifiedbyMINT.MethodsWe use the following notations. Let X denote a datamatrix of size N observations (rows) × P variables (e.g.gene expression levels, in columns) and Y a dummymatrix indicating each sample class membership ofsize N observations (rows) × K categories outcome(columns). We assume that the data are partitioned intoM groups corresponding to each independent study m:{(X(1),Y (1)), . . . , (X(M),Y (M))} so that ∑Mm=1 nm = N ,where nm is the number of samples in group m, seeAdditional file 1: Figure S1. Each variable from the dataset X(m) and Y (m) is centered and has unit variance.We write X and Y the concatenation of all X(m) andY (m), respectively. Note that if an internal known batcheffect is present in a study, this study should be splitaccording to that batch effect factor into several sub-studies considered as independent. For n ∈ N, wedenote for all a ∈ Rn its 1 norm ||a||1 = ∑n1 |aj|and its 2 norm ||a||2 =(∑n1 a2j)1/2and |a|+ thepositive part of a. For any matrix we denote by  itstranspose.PLS-based classification methods to combine independentstudiesPLS approaches have been extended to classify samplesY from a data matrix X by maximising a formula basedon their covariance. Specifically, latent components arebuilt based on the original X variables to summarisethe information and reduce the dimension of the datawhile discriminating the Y outcome. Samples are thenprojected into a smaller space spanned by the latent com-ponent. We first detail the classical PLS-DA approachRohart et al. BMC Bioinformatics  (2017) 18:128 Page 3 of 13and then describe mgPLS, a PLS-based model we pre-viously developed to model a group (study) structurein X.PLS-DA Partial Least Squares Discriminant Analysis[17] is an extension of PLS for a classification frame-works where Y is a dummy matrix indicating sampleclass membership. In our study, we applied PLS-DA as anintegrative approach by naively concatenating all studies.Briefly, PLS-DA is an iterative method that constructs Hsuccessive artificial (latent) components th = Xhah anduh = Yhbh for h = 1, ..,H , where the hth componentth (respectively uh) is a linear combination of the X (Y )variables.H denotes the dimension of the PLS-DAmodel.The weight coefficient vector ah (bh) is the loading vec-tor that indicates the importance of each variable to definethe component. For each dimension h = 1, . . . ,H PLS-DAseeks to maximizemax||ah||2=||bh||2=1cov(Xhah,Yhbh), (1)where Xh,Yh are residual matrices (obtained through adeflation step, as detailed in [18]). The PLS-DA algo-rithm is described in Additional file 1: SupplementalMaterial S1. The PLS-DA model assigns to each sam-ple i a pair of H scores (tih,uih) which effectively rep-resents the projection of that sample into the X- or Y -space spanned by those PLS components. As H <<P, the projection space is small, allowing for dimensionreduction as well as insightful sample plot representa-tion (e.g. graphical outputs in “Results” section). WhilePLS-DA ignores the data group structure inherent to eachindependent study, it can give satisfactory results whenthe between groups variance is smaller than the withingroup variance or when combined with extensive datasubsampling to account for systematic variation acrossplatforms [21].mgPLS Multi-group PLS is an extension of the PLSframework we recently proposed to model grouped data[22, 23], which is relevant for our particular case wherethe groups represent independent studies. In mgPLS,the PLS-components of each group are constraint tobe built based on the same loading vectors in X andY. These global loading vectors thus allow the samplesfrom each group or study to be projected in the samecommon space spanned by the PLS-components. Weextended the original unsupervised approach to a super-vised approach by using a dummy matrix Y as in PLS-DAto classify samples while modelling the group structure.For each dimension h = 1, . . . ,H mgPLS-DA seeks tomaximizemax||ah||2=||bh||2=1M∑m=1nmcov(X(m)h ah,Y(m)h bh), (2)where ah and bh are the global loadings vectors com-mon to all groups, t(m)h = X(m)h ah and u(m)h = Y (m)h bhare the group-specific (partial) PLS-components, andX(m)h and Y(m)h are the residual (deflated) matrices. Theglobal loadings vectors (ah, bh) and global components(th = Xhah,uh = Yhbh) enable to assess overall classifi-cation accuracy, while the group-specific loadings andcomponents provide powerful graphical outputs for eachstudy that is integrated in the analysis. Global and group-specific components and loadings are represented inAdditional file 1: Figure S2. The next development wedescribe below is to include internal variable selection inmgPLS-DA for large dimensional data sets.MINTOur novel multivariate integrative method MINT simul-taneously integrates independent studies and selects themost discriminant variables to classify samples and pre-dict the class of new samples. MINT seeks for a commonprojection space for all studies that is defined on a smallsubset of discriminative variables and that display an anal-ogous discrimination of the samples across studies. Theidentified variables share common information across allstudies and therefore represent a reproducible signaturethat helps characterising biological systems. MINT fur-ther extends mgPLS-DA by including a 1-penalisation onthe global loading vector ah to perform variable selection.For each dimension h = 1, . . . ,H the MINT algorithmseeks to maximizemax||ah||2=||bh||2=1M∑m=1nmcov(X(m)h ah,Y(m)h bh) + λh||ah||1,(3)where in addition to the notations from Eq. (2), λh isa non negative parameter that controls the amount ofshrinkage on the global loading vectors ah and thus thenumber of non zero weights. Similarly to Lasso [24] orsparse PLS-DA [18], the added 1 penalisation in MINTimproves interpretability of the PLS-components that arenow defined only on a set of selected biomarkers fromX (with non zero weight) that are identified in the lin-ear combination X(m)h ah. The 1 penalisation in effectivelysolved in the MINT algorithm using soft-thresholding(see pseudo Algorithm 1).In addition to the integrative classification framework,MINT was extended to an integrative regression frame-work (multiple multivariate regression, Additional file 1Supplemental Material S2).Rohart et al. BMC Bioinformatics  (2017) 18:128 Page 4 of 13Algorithm 1MINT1: We denote ∀1 ≤ m ≤ M,X(m)1 = X(m), Y (m)1 = Y (m),X(m) = X and Y (m) = Y , where X and Y are centeredand scaled.2: For h < H , choose λh and an initial value for ah with||ah||2 = 1,3: repeat4: t(m)h ← X(m)h ah  partial components5: th ← Xhah  global components6: b(m)h ← (Y (m)h )t(m)h  partial loadings7: bh ← (∑Mm=1 b(m)h )/||∑Mm=1 b(m)h ||2  globalloadings8: u(m)h ← Y (m)h bh  partial components9: a(m)h ← (X(m)h )u(m)h  partial loadings10: ah ← (∑Mm=1 a(m)h )/||∑Mm=1 a(m)h ||2  globalloadings11: ah ← sign(ah)(|ah| − λh)+  soft thresholding12: until convergence of ah and bh.13: P ← I − th(th th)−1th , where I = identity matrix ofRN14: Xh+1 ← PXh and Yh+1 ← PYh  deflationClass prediction and parameters tuning withMINTMINT centers and scales each study from the training set,so that each variable has mean 0 and variance 1, similarlyto any PLS methods. Therefore, a similar pre-processingneeds to be applied on test sets. If a test sample belongs toa study that is part of the training set, then we apply thesame scaling coefficients as from the training study. This isrequired so that MINT applied on a single study will pro-vide the same results as PLS. If the test study is completelyindependent, then it is centered and scaled separately.After scaling the test samples, the prediction frameworkof PLS is used to estimate the dummy matrix Ytest of anindependent test set Xtest [25], where each row in Ytestsums to 1, and each column represents a class of the out-come. A class membership is assigned (predicted) to eachtest sample by using the maximal distance, as describedin [18]. It consists in assigning the class with maximalpositive value in Ytest .The main parameter to tune in MINT is the penalty λhfor each PLS-component h, which is usually performedusing Cross-Validation (CV). In practice, the parameterλh can be equally replaced by the number of variables toselect on each component, which is our preferred user-friendly option. The assessment criterion in the CV canbe based on the proportion of misclassified samples, pro-portion of false or true positives, or, as in our case, the bal-anced error rate (BER). BER is calculated as the averagedproportion of wrongly classified samples in each classand weights up small sample size classes. We considerBER to be a more objective performance measure thanthe overall misclassification error rate when dealing withunbalanced classes.MINT tuning is computationally effi-cient as it takes advantage of the group data structure inthe integrative study. We used a “Leave-One-Group-OutCross-Validation (LOGOCV)”, which consists in perform-ing CV where group or study m is left out only oncem = 1, . . . ,M. LOGOCV realistically reflects the truecase scenario where prediction is performed on indepen-dent external studies based on a reproducible signatureidentified on the training set. Finally, the total number ofcomponents H in MINT is set to K − 1, K = numberof classes, similar to PLS-DA and 1 penalised PLS-DAmodels [18].Case studiesWe demonstrate the ability of MINT to identify the truepositive genes on the MAQC project, then highlight thestrong properties of our method to combine independentdata sets in order to identify reproducible and predictivegene signatures on two other biological studies.TheMicroArray quality control (MAQC) project. Theextensive MAQC project focused on assessing microarraytechnologies reproducibility in a controlled environment[5]. Two reference samples, RNA samples UniversalHuman Reference (UHR) and Human Brain Reference(HBR) and two mixtures of the original samples were con-sidered. Technical replicates were obtained from threedifferent array platforms -Illumina, AffyHuGene andAffyPrime- for each of the four biological samples A (100%UHR), B (100%HBR), C (75%UHR, 25%HBR) and D (25%UHR and 75% HBR). Data were downloaded from GeneExpression Omnibus (GEO) - GSE56457. In this study, wefocused on identifying biomarkers that discriminate A vs.B and C vs. D. The experimental design is referenced inAdditional file 1: Table S1.Stem cells. We integrated 15 transcriptomics microar-ray datasets to classify three types of human cells: humanFibroblasts (Fib), human Embryonic Stem Cells (hESC)and human induced Pluripotent Stem Cells (hiPSC). Asthere exists a biological hierarchy among these three celltypes, two sub-classification problems are of interest inour analysis, which we will address simultaneously withMINT. On the one hand, differences between pluripo-tent (hiPSC and hESC) and non-pluripotent cells (Fib) arewell-characterised and are expected to contribute to themain biological variation. Our first level of analysis willtherefore benchmark MINT against the gold standard inthe field. On the other hand, hiPSC are genetically repro-grammed to behave like hESC and both cell types arecommonly assumed to be alike. However, differences haveRohart et al. BMC Bioinformatics  (2017) 18:128 Page 5 of 13been reported in the literature [26–28], justifying the sec-ond and more challenging level of classification analysisbetween hiPSC and hESC. We used the cell type annota-tions of the 342 samples as provided by the authors of the15 studies.The stem cell dataset provides an excellent showcasestudy to benchmark MINT against existing statisticalmethods to solve a rather ambitious classification prob-lem.Each of the 15 studies was assigned to either a trainingor test set. Platforms uniquely represented were assignedto the training set and studies with only one sample in oneclass were assigned to the test set. Remaining studies wererandomly assigned to training or test set. Eventually, thetraining set included eight datasets (210 samples) derivedon five commercial platforms and the independent testset included the remaining seven datasets (132 samples)derived on three platforms (Table 1).The pre-processed files were downloaded from thehttp://www.stemformatics.org collaborative platform[29]. Each dataset was background corrected, log2 trans-formed, YuGene normalized and mapped from probes IDto Ensembl ID as previously described in [11], resulting in13 313 unique Ensembl gene identifiers. In the case wheredatasets contained multiple probes for the same EnsemblTable 1 Stem cells experimental designExperiment Platform Fib hESC hiPSCBock Affymetrix HT-HG-U133A 6 20 12Briggs Illumina HumanHT-12 V4 18 3 30Chung Affymetrix HuGene-1.0-ST V1 3 8 10Ebert Affymetrix HG-U133 Plus2 2 5 3Guenther Affymetrix HG-U133 Plus2 2 17 20Maherali Affymetrix HG-U133 Plus2 3 3 15Marchetto Affymetrix HuGene-1.0-ST V1 6 3 12Takahashi Agilent SurePrint G3 GE 8x60K 3 3 3Total training set 5 platforms 43 62 105Andrade Affymetrix HuGene-1.0-ST V1 3 6 15Hu Affymetrix HG-U133 Plus2 1 5 12Kim Affymetrix HG-U133 Plus2 1 1 3Loewer Affymetrix HG-U133 Plus2 4 2 7Si-Tayeb Affymetrix HG-U133 Plus2 3 6 6Vitale Illumina HumanHT-12 V4 8 3 18Yu Affymetrix HG-U133 Plus2 2 10 16Total test set 3 platforms 22 33 77A total of 15 studies were analysed, including three human cell types, humanFibroblasts (Fib), human Embryonic Stem Cells (hESC) and human inducedPluripotent Stem Cells (hiPSC) across five different types of microarray platforms.Eight studies from five microarray platforms were considered as a training set[57–64] and seven independent studies from three of the five platforms wereconsidered as a test set [65–71]ID gene, the highest expressed probe was chosen as therepresentative of that gene in that dataset. The choiceof YuGene normalisation was motivated by the need tonormalise each sample independently rather than as apart of a whole study (e.g. existing methods ComBat [7],quantile normalisation (RMA [30])), to effectively limitover-fitting during the CV evaluation process.Breast cancer. We combined whole-genome gene-expression data from two cohorts from the MolecularTaxonomy of Breast Cancer International Consortiumproject (METABRIC, [31] and of two cohorts from theCancer Genome Atlas (TCGA, [32]) to classify the intrin-sic subtypes Basal, HER2, Luminal A and Luminal B, asdefined by the PAM50 signature [20]. The METABRICcohorts data were made available upon request, and wereprocessed by [31]. TCGA cohorts are gene-expressiondata from RNA-seq and microarray platforms. RNA-seqdata were normalised using Expectation Maximisation(RSEM) and percentile-ranked gene-level transcrip-tion estimates. The microarray data were processed asdescribed in [32].The training set consisted in three cohorts (TCGARNA-seq and both METABRIC microarray studies),including the expression levels of 15 803 genes on 2 814samples; the test set included the TCGA microarraycohort with 254 samples (Table 2). Two analyses were con-ducted, which either included or discarded the PAM50genes from the data. The first analysis aimed at recoveringthe PAM50 genes used to classify the samples. The sec-ond analysis was performed on 15,755 genes and aimed atidentifying an alternative signature to the PAM50.Performance comparison with sequential classificationapproachesWe compared MINT with sequential approachesthat combine batch-effect removal approaches withTable 2 Experimental design of four breast cancer cohortsincluding 4 cancer subtypes: Basal, HER2, Luminal A (LumA) andLuminal B (LumB)Experiment Platform Basal Her2 LumA LumBMETABRICDiscoveryIlluminaHT-12 v3118 87 466 268METABRICValidationIlluminaHT-12 v3213 153 255 224TCGA RNA-seq illuminaHiSeq 2000188 80 549 213Total trainingset2 platforms 519 320 1270 705TCGAmicroarrayAgilentcustom 244K57 31 99 67Total test set 1 platform 57 31 99 67Rohart et al. BMC Bioinformatics  (2017) 18:128 Page 6 of 13classification methods. As a reference, classificationmethods were also used on their own on a naive con-catenation of all studies. Batch-effect removal methodsincluded Batch Mean-Centering (BMC, [9]), ComBat [7],linear models (LM) or linear mixed models (LMM), andclassification methods included PLS-DA, sPLS-DA [18],mgPLS [22, 23] and Random forests (RF [12]). For LMand LMM, linear models were fitted on each gene andthe residuals were extracted as a batch-corrected geneexpression [33, 34]. The study effect was set as a fixedeffect with LM or as a random effect with LMM. Nosample outcome (e.g. cell-type) was included.Prediction with ComBat normalised data were obtainedas described in [19]. In this study, we did not includemethods that require extra information -as control geneswith RUV-2 [4]- and methods that are not widely availableto the community as LMM-EH [10]. Classification meth-ods were chosen so as to simultaneously discriminate allclasses. With the exception of sPLS-DA, none of thosemethods perform internal variable selection. The multi-variate methods PLS-DA, mgPLS and sPLS-DA were runon K − 1 components, sPLS-DA was tuned using 5-foldCV on each component. All classification methods werecombined with batch-removal method with the exceptionof mgPLS that already includes a study structure in themodel.MINT and PLS-DA-like approaches use a predictionthreshold based on distances (see “Class prediction andparameters tuning with MINT” section) that optimallydetermines class membership of test samples, and as suchdo not require receiver operating characteristic (ROC)curves and area under the curve (AUC) performancemea-sures. In addition, those measures are limited to binaryclassification which do not apply for our stem cell andbreast cancer multi-class studies. Instead we use Bal-anced classification Error Rate to objectively evaluate theclassification and prediction performance of the meth-ods for unbalanced sample size classes (“MINT” section).Classification accuracies for each class were also reported.ResultsValidation of theMINT approach to identify signaturesagnostic to batch effectThe MAQC project processed technical replicates of fourwell-characterised biological samples A, B, C and D acrossthree platforms. Thus, we assumed that genes that aredifferentially expressed (DEG) in every single platformare true positive. We primarily focused on identifyingbiomarkers that discriminate C vs. D, and report theresults of A vs. B in the Additional file 1: SupplementalMaterial S3, Figure S3. Differential expression analysis ofC vs. D was conducted on each of the three microar-ray platforms using ANOVA, showing an overlap of 1385DEG (FDR < 10−3 [35]), which we considered as truepositive. This corresponded to 62.6% of all DEG for Illu-mina, 30.5% for AffyHuGene and 21.0% for AffyPrime(Additional file 1: Figure S4). We observed that conduct-ing a differential analysis on the concatenated data fromthe three microarray platforms without accommodatingfor batch effects resulted in 691 DEG, of which only56% (387) were true positive genes. This implies that theremaining 44% (304) of these genes were false positive,and hence were not DE in at least one study. The highpercentage of false positive was explained by a PrincipalComponent Analysis (PCA) sample plot that showed sam-ples clustering by platforms (Additional file 1: Figure S4),which confirmed that the major source of variation in thecombined data was attributed to platforms rather than celltypes.MINT selected a single gene, BCAS1, to discriminatethe two biological classes C and D. BCAS1 was a true pos-itive gene, as part of the common DEG, and was ranked 1for Illumina, 158 for AffyPrime and 1182 for AffyHuGene.Since the biological samples C and D are very different,the selection of one single gene by MINT was not sur-prising. To further investigate the performance of MINT,we expanded the number of genes selected by MINT, bydecreasing its sparsity parameter (seeMethods), and com-pared the overlap between this largerMINT signature andthe true positive genes. We observed an overlap of 100%for a MINT signature of size 100, and an overlap of 89%for a signature of size 1385, which is the number of com-mon DEG identified previously. The high percentage oftrue positive selected by MINT demonstrates its ability toidentify a signature agnostic to batch effect.Limitations of commonmeta-analysis and integrativeapproachesA meta-analysis of eight stem cell studies, each includingthree cell types (Table 1, stem cell training set), highlighteda small overlap of DEG lists obtained from the analysis ofeach separate study (FDR < 10−5, ANOVA, Additionalfile 1: Table S2). Indeed, the Takahashi study with only 24DEG limited the overlap between all eight studies to only5 DEG. This represents a major limitation of merging pre-analysed gene lists as the concordance between DEG listsdecreases when the number of studies increases.One alternative to meta-analysis is to perform an inte-grative analysis by concatenating all eight studies. Simi-larly to the MAQC analysis, we first observed that themajor source of variation in the combined data wasattributed to study rather than cell type (Fig. 1a). PLS-DAwas applied to discriminate the samples according to theircell types, and it showed a strong study variation (Fig. 1b),despite being a supervised analysis. Compared to unsu-pervised PCA (Fig. 1a), the study effect was reduced forthe fibroblast cells, but was still present for the similar celltypes hESC and hiPSC. We reached similar conclusionsRohart et al. BMC Bioinformatics  (2017) 18:128 Page 7 of 13Fig. 1 Stem cell study. a PCA on the concatenated data: a greater study variation than a cell type variation is observed. b PLSDA on theconcatenated data clustered Fibroblasts only. cMINT sample plot shows that each cell type is well clustered, dMINT performance: BER andclassification accuracy for each cell type and each studywhen analysing the breast cancer data (Additional file 1:Supplemental Material S4, Figure S5).MINT outperforms state-of-the-art methodsWe compared the classification accuracy of MINT tosequential methods where batch removal methods wereapplied prior to classification methods. In both stem celland breast cancer studies, MINT led to the best accu-racy on the training set and the best reproducibility of theclassification model on the test set (lowest Balanced ErrorRate, BER, Fig. 2, Additional file 1: Figures S6 and S7). Inaddition, MINT consistently ranked first as the best per-forming method, followed by ComBat+sPLSDA with anaverage rank of 4.5 (Additional file 1: Figure S8).On the stem cell data, we found that fibroblasts werethe easiest to classify for all methods, including those thatdo not accommodate unwanted variation (PLS-DA, sPLS-DA and RF, Additional file 1: Figure S6). Classifying hiPSCvs. hESC proved more challenging for all methods, lead-ing to a substantially lower classification accuracy thanfibroblasts.The analysis of the breast cancer data (excluding PAM50genes) showed that methods that do not accommodateunwanted variation were able to rightly classify most ofthe samples from the training set, but failed at classifyingany of the four subtypes on the external test set. As aconsequence, all samples were predicted as LumB withPLS-DA and sPLS-DA, or Basalwith RF (Additional file 1:Figure S7). Thus, RF gave a satisfactory performance onthe training set (BER = 18.5), but a poor performance onthe test set (BER= 75).Additionally, we observed that the biomarker selectionprocess substantially improved classification accuracy. Onthe stem cell data, LM+sPLSDA andMINT outperformedtheir non sparse counterparts LM+PLSDA and mgPLS(Fig. 2, BER of 9.8 and 7.1 vs. 20.8 and 11.9), respectively.Finally, MINT was largely superior in terms of compu-tational efficiency. The training step on the stem cell datawhich includes 210 samples and 13,313 was run in 1 s,compared to 8 s with the second best performing methodComBat+sPLS-DA (2013 MacNook Pro 2.6Ghz, 16Gbmemory). The popular method ComBat took 7.1s to run,and sPLS-DA 0.9s. The training step on the breast can-cer data that includes 2817 samples and 15,755 genes wasrun in 37s for MINT and 71.5s for ComBat(30.8s)+sPLS-DA(40.6s).Study-specific outputs withMINTOne of the main challenges when combining indepen-dent studies is to assess the concordance between studies.Rohart et al. BMC Bioinformatics  (2017) 18:128 Page 8 of 13Fig. 2 Classification accuracy for both training and test set for the stem cells and breast cancer studies (excluding PAM50 genes). The classificationBalanced Error Rates (BER) are reported for all sixteen methods compared with MINT (in black)During the integration procedure, MINT proposes notonly individual performance accuracy assessment, butalso insightful graphical outputs that are study-specificand can serve as Quality Control step to detect out-lier studies. One particular example is the Takahashistudy from the stem cell data, whose poor performance(Fig. 1d) was further confirmed on the study-specific out-puts (Additional file 1: Figure S9). Of note, this study wasthe only one generated through Agilent technology and itssample size only accounted for 4.2% of the training set.The sample plots from each individual breast cancerdata set showed the strong ability of MINT to discrim-inate the breast cancer subtypes while integrating datasets generated from disparate transcriptomics platforms,microarrays and RNA-sequencing (Fig. 3a–c). Those datasets were all differently pre-processed, and yet MINT wasable to model an overall agreement between all studies;MINT successfully built a space based on a handful ofgenes in which samples from each study are discriminatedin a homogenous manner.MINT gene signature identified promising biomarkersMINT is a multivariate approach that builds successivecomponents to discriminate all categories (classes) indi-cated in an outcome variable. On the stem cell data,MINTselected 2 and 15 genes on the first two componentsrespectively (Additional file 1: Table S3). The first compo-nent clearly segregated the pluripotent cells (fibroblasts)vs. the two non-pluripotent cell types (hiPSC and hESC)(Fig. 1c, d). Those non pluripotent cells were subsequentlyseparated on component two with some expected overlapgiven the similarities between hiPSC and hESC. The twogenes selected by MINT on component 1 were LIN28Aand CAR which were both found relevant in the litera-ture. Indeed, LIN28A was shown to be highly expressed inESCs compared to Fibroblasts [36, 37] and CAR has beenassociated to pluripotency [38]. Finally, despite the highheterogeneity of hiPSC cells included in this study, MINTgave a high accuracy for hESC and hiPSC on indepen-dent test sets (93.9% and 77.9% respectively, Additionalfile 1: Figure S6), suggesting that the 15 genes selected byMINT on component 2 have a high potential to explainthe differences between those cell types (Additional file 1:Table S3).On the breast cancer study, we performed two analyseswhich either included or discarded the PAM50 genes thatwere used to define the four cancer subtypes Basal, HER2,Luminal A and Luminal B [20]. In the first analysis, weaimed to assess the ability of MINT to specifically identifythe PAM50 key driver genes. MINT successfully recov-ered 37 of the 48 PAM50 genes present in the data (77%)on the first three components (7, 20 and 10 respectively).The overall signature included 30, 572 and 636 genes oneach component (see Additional file 1: Table S4), i.e. 7.8%of the total number of genes in the data. The performanceof MINT (BER of 17.8 on the training set and 11.6 on theRohart et al. BMC Bioinformatics  (2017) 18:128 Page 9 of 13Fig. 3MINT study-specific sample plots showing the projection of samples from aMETABRIC Discovery, bMETABRIC Validation and cTCGA-RNA-seq experiments, in the same subspace spanned by the first two MINT components. The same subspace is also used to plot the (d)overall (integrated) data. e Balanced Error Rate and classification accuracy for each study and breast cancer subtype from the MINT analysistest set) was superior than when performing a PLS-DAon the PAM50 genes only (BER of 20.8 on the trainingset and a very high 75 on the test set). This result showsthat the genes selected by MINT offer a complementarycharacterisation to the PAM50 genes.In the second analysis, we aimed to provide an alter-native signature to the PAM50 genes by ommitting themfrom the analysis.MINT identified 11, 272 and 253 geneson the first three components respectively (Additionalfile 1: Table S5 and Figure S10). The genes selectedon the first component gradually differentiated Basal,HER2 and Luminal A/B, while the second componentgenes further differentiated Luminal A from LuminalB (Fig. 3d). The classification performance was similarin each study (Fig. 3e), highlighting an excellent repro-ducibility of the biomarker signature across cohorts andplatforms.Among the 11 genes selected byMINT on the first com-ponent, GATA3 is a transcription factor that regulatesluminal epithelial cell differentiation in the mammaryglands [39, 40], it was found to be implicated in luminaltypes of breast cancer [41] and was recently investigatedfor its prognosis significance [42]. The MYB-protein playsan essential role in Haematopoiesis and has been asso-ciated to Carcinogenesis [43, 44]. Other genes presentin our MINT gene signature include XPB1 [45], AGR3[46], CCDC170 [47] and TFF3 [48] that were reported asbeing associated with breast cancer. The remaining geneshave not been widely associated with breast cancer. Forinstance, TBC1D9 has been described as over expressed incancer patients [49, 50]. DNALI1 was first identified for itsrole in breast cancer in [51] but there was no report of fur-ther investigation. Although AFF3 was never associated tobreast cancer, it was recently proposed to play a pivotalrole in adrenocortical carcinoma [52]. It is worth notingthat these 11 genes were all included in the 30 genes pre-viously selected when the PAM50 genes were included,and are therefore valuable candidates to complement theRohart et al. BMC Bioinformatics  (2017) 18:128 Page 10 of 13PAM50 gene signature as well as to further characterisebreast cancer subtypes.DiscussionThere is a growing need in the biological and computa-tional community for tools that can integrate data fromdifferent microarray platforms with the aim of classi-fying samples (integrative classification). Although sev-eral efficient methods have been proposed to addressthe unwanted systematic variation when integrating data[4, 7, 9–11], these are usually applied as a pre-processingstep before performing classification. Such sequentialapproach may lead to overfitting and over-optimisticresults due to the use of transductive modelling (such asprediction based on ComBat-normalised data [19]) andthe use of a test set that is normalised or pre-processedwith the training set. To address this crucial issue, weproposed a newMultivariate INTegrative method, MINT,that simultaneously corrects for batch effects, classifiessamples and selects the most discriminant biomarkersacross studies.MINT seeks to identify a common projection space forall studies that is defined on a small subset of discrimina-tive variables and that display an analogous discriminationof the samples across studies. Therefore, MINT providessample plot and classification performance specific toeach study (Fig. 3). Among the compared methods, MINTwas found to be the fastest and most accurate method tointegrate and classify data from different microarray andRNA-seq platforms.Integrative approaches such as MINT are essentialwhen combining multiple studies of complex data to limitspurious conclusions from any downstream analysis. Cur-rent methods showed a high proportion of false positives(44% on MAQC data) and exhibited very poor predictionaccuracy (PLS-DA, sPLS-DA and RF, Fig. 2). For instance,RF was ranked second only to MINT on the breast cancerlearning set, but it was ranked as the worst method on thetest set. This reflects the absence of controlling for batcheffects in these methods and supports the argument thatassessing the presence of batch effects is a key preliminarystep. Failure to do so, as shown in our study, can result inpoor reproducibility of results in subsequent studies, andthis would not be detected without an independent testset.We assessed the ability of MINT to identify rele-vant gene signatures that are reproducible and platform-agnostic. MINT successfully integrated data from theMAQC project by selecting true positives genes thatwere also differentially expressed in each experiment.We also assessed MINT’s capabilities analysing stemcells and breast cancer data. In these studies, MINTdisplayed the highest classification accuracy in thetraining sets and the highest prediction accuracy inthe testing sets, when compared to sixteen sequen-tial procedures (Fig. 2). These results suggest that, inaddition to being highly predictive, the discriminantvariables identified by MINT are also of strong biologicalrelevance.In the stem cell data, MINT identified 2 genes LIN28Aand CAR, to discriminate pluripotent cells (fibroblasts)against non-pluripotent cells (hiPSC and hESC). Pluripo-tency is well-documented in the literature and OCT4 iscurrently the main known marker for undifferentiatedcells [53–56]. However, MINT did not selected OCT4 onthe first component but instead, identified two markers,LIN28A and CAR, that were ranked higher than OCT4in the DEG list obtained on the concatenated data (seeAdditional file 1: Figure S11, S12). While the results fromMINT still supported OCT4 as a marker of pluripotency,our analysis suggests that LIN28A and CAR are strongerreproducible markers of differentiated cells, and couldtherefore be superior as substitutions or complements toOCT4. Experimental validation would be required to fur-ther assess the potential of LIN28A or CAR as efficientmarkers.Several important issues require consideration whendealing with the general task of integrating data. Firstand foremost, sample classification is crucial and needsto be well defined. This required addressing in analyseswith the stem cell and breast cancer studies generatedfrom multiple research groups and different microarrayand RNA-seq platforms. For instance, the breast cancersubtype classification relied on the PAM50 intrinsic classi-fier proposed by [20], which we admit is still controversialin the literature [31]. Similarly, the biological definitionof hiPSC differs across research groups [26, 28], whichresults in poor reproducibility among experiments andmakes the integration of stem cell studies challenging [21].The expertise and exhaustive screening required tohomogeneously annotate samples hinders data integra-tion, and because it is a process upstream to the statisticalanalysis, data integration approaches, including MINT,can not address it.A second issue in the general process of integratingdatasets from different sources is data access and normal-isation. As raw data are often not available, this results inintegration of data sets that have each been normaliseddifferently, as was the case with the breast cancer datain our study. Despite this limitation, MINT producedsatisfactory results in that study. We were also able toovercome this issue in the stem cells data by using thestemformatics resource [29] where we had direct accessto homogeneously pre-processed data (background cor-rection, log2- and YuGene-transformed [11]). In general,variation in the normalisation processes of different datasets produces unwanted variation between studies and werecommend this should be avoided if possible.Rohart et al. BMC Bioinformatics  (2017) 18:128 Page 11 of 13A final important issue in data integration involvesaccounting for both between-study differences and plat-form effects. When samples clustered by study and thestudies clustered by platform, then the experimental plat-form and not the study, is the biggest source of varia-tion (e.g. 75% of the variance in the breast cancer data,Additional file 1: Figure S5). Indeed, there are inherentdifferences between commercial platforms that greatlymagnify unwanted variability, as was discussed by [5] onthe MAQC project. As platform information and studyeffects are nested,MINT and other data integrationmeth-ods dismiss the platform information and focus on thestudy effect only. Indeed, each study is considered asincluded in a single platform. MINT successfully inte-grated microarray and RNA-seq data, which supportsthat such an approach will likely be sufficient in mostscenarios.When applying MINT, additional considerations needbe taken into account. In order to reduce unwanted sys-tematic variation, the method centers and scales eachstudy as an initial step, similarly to BMC [9]. Therefore,only studies with a sample size > 3 can be included,either in a training or test set. In addition, all outcomecategories need to be represented in each study. Indeed,neitherMINT nor any classificationmethods can performsatisfactorily in the extreme case where each study onlycontains a specific outcome category, as the outcome andthe study effect can not be distinguished in this specificcase.ConclusionWe introduced MINT, a novel Multivariate INTegrativemethod, that is the first approach to integrate indepen-dent transcriptomics studies from different microarrayand RNA-seq platforms by simultaneously, correcting forbatch effects, classifying samples and identifying key dis-criminant variables.We first validated the ability of MINTto select true positives genes when integrating the MAQCdata across different platforms. Then, MINT was com-pared to sixteen sequential approaches and was shownto be the fastest and most accurate method to discrimi-nate and predict three human cell types (human Fibrob-lasts, human Embryonic Stem Cells and human inducedPluripotent Stem Cells) and four subtypes of breast cancer(Basal, HER2, Luminal A and Luminal B). The gene sig-natures identified by MINT contained existing and novelbiomarkers that were strong candidates for improvedcharacterisation the phenotype of interest. In conclusion,MINT enables reliable integration and analysis of inde-pendent genomic data sets, outperforms existing availablesequential methods, and identifies reproducible geneticpredictors across data sets. MINT is available through themixMINT module in the mixOmics R-package.Additional fileAdditional file 1: Supplementary material. This pdf document containssupplementary methods and all supplementary Figures and Tables.Specifically, it provides the PLS-algorithm, the extension ofMINT in aregression framework, the application to the MAQC data (A vs B), themeta-analysis of the breast cancer data, the classification accuracy of thetested methods on the stem cells and breast cancer data, and details onthe signature genes identified byMINT on the stem cells and breast cancerdata. (PDF 4403 kb)AbbreviationsBER: Balanced error rate; DEG: Differentially expressed gene; FDR: Falsediscovery rate; Fib: Fibroblast; hESC: Human embryonic stem cells; hiPSC:Human induced pluripotent stem cells; LM: Linear model; LMM: Linear mixedmodel; MAQC: MicroArray quality control; MINT: Multivariate integrationmethod; sPLS-DA: sparse partial least square discriminant analysis; RF: RandomforestAcknowledgmentsThe authors would like to thank Marie-Joe Brion, University of QueenslandDiamantina Institute for her careful proof-reading and suggestions.FundingThis project was partly funded by the ARC Discovery grant projectDP130100777 and the Australian Cancer Research Foundation for theDiamantina Individualised Oncology Care Centre at the University ofQueensland Diamantina Institute (FR), and the National Health and MedicalResearch Council (NHMRC) Career Development fellowship APP1087415(KALC).The funding bodies did not play a role in the design of the study andcollection, analysis, and interpretation of data.Availability of data andmaterialsThe MicroArray Quality Control (MAQC) project data are available from theGene Expression Omnibus (GEO) - GSE56457.The stem cell raw data are available from GEO and the pre-processed data isavailable from the (http://www.stemformatics.org) platform.The breast cancer data were obtained from the Molecular Taxonomy of BreastCancer International Consortium project (METABRIC, [31], upon request) andfrom the Cancer Genome Atlas (TCGA, [32]). The MINT R scripts and functionsare publicly available in the mixOmics R package (https://cran.r-project.org/package=mixOmics), with tutorials on http://www.mixOmics.org/mixMINT.Authors’ contributionsFR developed and implemented the MINT method, analysed the stem cell andbreast cancer data, NM analysed the MAQC data, KALC supervised all statisticalanalyses. ES and SB contributed to the early stage of the project to set up theanalysis plan. The manuscript was primarily written by FR with editorial advicefrom AE, NM, SB and KALC. All authors read and approved the final manuscript.Competing interestsThe authors declare that they have no competing interests.Consent for publicationNot applicable.Ethics approval and consent to participateNot applicable.Author details1The University of Queensland Diamantina Institute, The University ofQueensland, Translational Research Institute, 4102 Brisbane QLD, Australia.2Centre for Heart Lung Innovation, University of British Columbia, VancouverBC V6Z 1Y6, Canada. 3French agency for food, environmental and occupationalhealth safety (Anses), Department of Epidemiology, 22440 Ploufragan, France.Received: 23 September 2016 Accepted: 16 February 2017Rohart et al. BMC Bioinformatics  (2017) 18:128 Page 12 of 13References1. Pihur V, Datta S, Datta S. Finding common genes in multiple cancertypes through meta–analysis of microarray experiments: A rankaggregation approach. Genomics. 2008;92(6):400–3.2. Kim S, Lin C-W, Tseng GC. Metaktsp: a meta-analytic top scoring pairmethod for robust cross-study validation of omics prediction analysis.Bioinformatics. 2016;32:1966–173.3. Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C,Y.Weiss-Solis D, Duque R, Bersini H, Nowé A. Batch effect removalmethods for microarray gene expression data integration: a survey. BriefBioinform. 2012;14(4):469–90.4. Gagnon-Bartsch JA, Speed TP. Using control genes to correct forunwanted variation in microarray data. Biostatistics. 2012;13(3):539–52.5. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ,De Longueville F, Kawasaki ES, Lee KY, et al. The microarray qualitycontrol (maqc) project shows inter-and intraplatform reproducibility ofgene expression measurements. Nat Biotechnol. 2006;24(9):1151–61.6. Su Z, Labaj P, Li S, Thierry-Mieg J, et al. A comprehensive assessment ofrna-seq accuracy, reproducibility and information content by thesequencing quality control consortium. Nat Biotechnol. 2014;32(9):903–14.7. Johnson W, Li C, Rabinovic A. Adjusting batch effects in microarrayexpression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.8. Hornung R, Boulesteix AL, Causeur D. Combining location-and-scalebatch effect adjustment with data cleaning by latent factor adjustment.BMC Bioinforma. 2016;17(1):1.9. Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A,Miller CJ, Clarke RB. The removal of multiplicative, systematic bias allowsintegration of breast cancer gene expression datasets–improvingmeta-analysis and prediction of prognosis. BMC Med Genomics.2008;1(1):42.10. Listgarten J, Kadie C, Schadt EE, Heckerman D. Correction for hiddenconfounders in the genetic analysis of gene expression. Proc Natl AcadSci USA. 2010;107(38):16465–70.11. Lê Cao KA, Rohart F, McHugh L, Korm O, Wells CA. YuGene: A simpleapproach to scale gene expression data derived from different platformsfor integrated analyses. Genomics. 2014;103:239–51.12. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.13. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methodsfor the classification of tumors using gene expression data. J Am StatAssoc. 2002;97(457):77–87.14. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancerclassification using support vector machines. Mach Learn. 2002;46(1-3):389–422.15. Díaz-Uriarte R, De Andres SA. Gene selection and classification ofmicroarray data using random forest. BMC Bioinforma. 2006;7(1):1.16. Sowa JP, Atmaca Ö, Kahraman A, Schlattjan M, Lindner M, Sydor S,Scherbaum N, Lackner K, Gerken G, Heider D, et al. Non-invasiveseparation of alcoholic and non-alcoholic liver disease with predictivemodeling. PloS ONE. 2014;9(7):101444.17. Barker M, Rayens W. Partial least squares for discrimination. J Chemom.2003;17(3):166–73.18. Lê Cao KA, Boitard S, Besse P. Sparse PLS discriminant analysis:biologically relevant feature selection and graphical displays formulticlass problems. BMC Bioinforma. 2011;12:253.19. Hughey JJ, Butte AJ. Robust meta-analysis of gene expression using theelastic net. Nucleic Acids Res. 2015;43(12):79.20. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S,Fauron C, He X, Hu Z, et al. Supervised risk predictor of breast cancerbased on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160–7.21. Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, ButcherS, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K, Wells CA. Amolecular classification of human mesenchymal stromal cells. PeerJ.2016;4:1845.22. Eslami A, Qannari EM, Kohler A, Bougeard S. Multi-group PLS regression:application to epidemiology. In: New Perspectives in Partial Least Squaresand Related Methods. New York: Springer; 2013. p. 243–55.23. Eslami A, Qannari EM, Kohler A, Bougeard S. Algorithms for multi-groupPLS. J Chemometrics. 2014;28(3):192–201.24. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat SocSer B Stat Methodol. 1996;58(1):267–88.25. Tenenhaus M. La Régression PLS: Théorie et Pratique. Paris: EditionsTechnip; 1998.26. Bilic J, Belmonte JCI. Concise review: Induced pluripotent stem cellsversus embryonic stem cells: close enough or yet too far apart? StemCells. 2012;30(1):33–41.27. Chin MH, Mason MJ, Xie W, Volinia S, Singer M, Peterson C,Ambartsumyan G, Aimiuwu O, Richter L, Zhang J, et al. Inducedpluripotent stem cells and embryonic stem cells are distinguished bygene expression signatures. Cell stem cell. 2009;5(1):111–23.28. Newman AM, Cooper JB. Lab-specific gene expression signatures inpluripotent stem cells. Cell stem cell. 2010;7(2):258–62.29. Wells CA, Mosbergen R, Korn O, Choi J, Seidenman N, Matigian NA,Vitale AM, Shepherd J. Stemformatics: visualisation and sharing of stemcell gene expression. Stem Cell Res. 2013;10(3):387–95.30. Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison ofnormalization methods for high density oligonucleotide array data basedon variance and bias. Bioinformatics. 2003;19(2):185–93.31. Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, SpeedD, Lynch AG, Samarajiwa S, Yuan Y, et al. The genomic andtranscriptomic architecture of 2,000 breast tumours reveals novelsubgroups. Nature. 2012;486(7403):346–52.32. Cancer Genome Atlas Network and others. Comprehensive molecularportraits of human breast tumours. Nature. 2012;490(7418):61–70.33. Whitcomb BW, Perkins NJ, Albert PS, Schisterman EF. Treatment ofbatch in the detection, calibration, and quantification of immunoassays inlarge-scale epidemiologic studies. Epidemiology (Cambridge).2010;21(Suppl 4):44.34. Rohart F, San Cristobal M, Laurent B. Selection of fixed effects in highdimensional linear mixed models using a multicycle ecm algorithm.Comput Stat Data Anal. 2014;80:209–22.35. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practicaland powerful approach to multiple testing. J R Stat Soc Ser B StatMethodol. 1995;57(1):289–300.36. Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, TianS, Nie J, Jonsdottir GA, Ruotti V, Stewart R, et al. Induced pluripotentstem cell lines derived from human somatic cells. Science.2007;318(5858):1917–20.37. Tsialikas J, Romer-Seibert J. LIN28: roles and regulation in developmentand beyond. Development. 2015;142(14):2397–404.38. Krivega M, Geens M, Van de Velde H. CAR expression in human embryosand hESC illustrates its role in pluripotency and tight junctions.Reproduction. 2014;148(5):531–44.39. Kouros-Mehr H, Slorach EM, Sternlicht MD, Werb Z. Gata-3 maintains thedifferentiation of the luminal cell fate in the mammary gland. Cell.2006;127(5):1041–55.40. Asselin-Labat ML, Sutherland KD, Barker H, Thomas R, Shackleton M,Forrest NC, Hartley L, Robb L, Grosveld FG, van der Wees J, et al. Gata-3is an essential regulator of mammary-gland morphogenesis andluminal-cell differentiation. Nat Cell Biol. 2007;9(2):201–9.41. Jiang YZ, Yu KD, Zuo WJ, Peng WT, Shao ZM. Gata3 mutations define aunique subtype of luminal-like breast cancer with improved survival.Cancer. 2014;120(9):1329–37.42. McCleskey BC, Penedo TL, Zhang K, Hameed O, Siegal GP, Wei S. Gata3expression in advanced breast cancer: prognostic value andorgan-specific relapse. Am J Clin Path. 2015;144(5):756–63.43. Vargova K, Curik N, Burda P, Basova P, Kulvait V, Pospisil V, Savvulidi F,Kokavec J, Necas E, Berkova A, et al. Myb transcriptionally regulates themir-155 host gene in chronic lymphocytic leukemia. Blood. 2011;117(14):3816–825.44. Khan FH, Pandian V, Ramraj S, Aravindan S, Herman TS, Aravindan N.Reorganization of metastamirs in the evolution of metastatic aggressiveneuroblastoma cells. BMC Genomics. 2015;16(1):1.45. Chen X, Iliopoulos D, Zhang Q, Tang Q, Greenblatt MB, HatziapostolouM, Lim E, Tam WL, Ni M, Chen Y, et al. Xbp1 promotes triple-negativebreast cancer by controlling the hif1 [agr] pathway. Nature.2014;508(7494):103–7.46. Garczyk S, von Stillfried S, Antonopoulos W, Hartmann A, Schrauder MG,Fasching PA, Anzeneder T, Tannapfel A, Ergönenc Y, Knüchel R, et al.Agr3 in breast cancer: Prognostic impact and suitable serum-basedbiomarker for early cancer detection. PloS ONE. 2015;10(4):0122106.47. Yamamoto-Ibusuki M, Yamamoto Y, Fujiwara S, Sueta A, Yamamoto S,Hayashi M, Tomiguchi M, Takeshita T, Iwase H. C6orf97-esr1 breastRohart et al. BMC Bioinformatics  (2017) 18:128 Page 13 of 13cancer susceptibility locus: influence on progression and survival inbreast cancer patients. Eur J Human Genet. 2015;23(7):949–56.48. May FE, Westley BR. Tff3 is a valuable predictive biomarker of endocrineresponse in metastatic breast cancer. Endocr Relat Cancer. 2015;22(3):465–79.49. Andres SA, Brock GN, Wittliff JL. Interrogating differences in expression oftargeted gene sets to predict breast cancer outcome. BMC Cancer.2013;13(1):1.50. Andres SA, Smolenkova IA, Wittliff JL. Gender-associated expression oftumor markers and a small gene set in breast carcinoma. Breast.2014;23(3):226–33.51. Parris TZ, Danielsson A, Nemes S, Kovács A, Delle U, Fallenius G,Möllerström E, Karlsson P, Helou K. Clinical implications of gene dosageand gene expression patterns in diploid breast carcinoma. Clin CancerRes. 2010;16(15):3860–874.52. Lefevre L, Omeiri H, Drougat L, Hantel C, Giraud M, Val P, Rodriguez S,Perlemoine K, Blugeon C, Beuschlein F, et al. Combined transcriptomestudies identify aff3 as a mediator of the oncogenic effects of β-catenin inadrenocortical carcinoma. Oncogenesis. 2015;4(7):161.53. Rosner MH, Vigano MA, Ozato K, Timmons PM, Poirie F, Rigby PW,Staudt LM. A POU-domain transcription factor in early stem cells andgerm cells of the mammalian embryo. Nature. 1990;345(6277):686–92.54. Schöler HR, Ruppert S, Suzuki N, Chowdhury K, Gruss P. New type of POUdomain in germ line-specific protein Oct-4. Nature. 1990;344(6265):435–9.55. Niwa H, Miyazaki J-i, Smith AG. Quantitative expression of Oct-3/4defines differentiation, dedifferentiation or self-renewal of ES cells. NatGenet. 2000;24(4):372–6.56. Matin MM, Walsh JR, Gokhale PJ, Draper JS, Bahrami AR, Morton I,Moore HD, Andrews PW. Specific knockdown of Oct4 andβ2-microglobulin expression by RNA interference in human embryonicstem cells and embryonic carcinoma cells. Stem Cells. 2004;22(5):659–68.57. Bock C, Kiskinis E, Verstappen G, Gu H, Boulting G, Smith ZD, Ziller M,Croft GF, Amoroso MW, Oakley DH, et al. Reference Maps of human ESand iPS cell variation enable high-throughput characterization ofpluripotent cell lines. Cell. 2011;144(3):439–52.58. Briggs JA, Sun J, Shepherd J, Ovchinnikov DA, Chung TL, Nayler SP, KaoLP, Morrow CA, Thakar NY, Soo SY, et al. Integration-free inducedpluripotent stem cells model genetic and neural developmental featuresof down syndrome etiology. Stem Cells. 2013;31(3):467–78.59. Chung HC, Lin RC, Logan GJ, Alexander IE, Sachdev PS, Sidhu KS.Human induced pluripotent stem cells derived under feeder-freeconditions display unique cell cycle and DNA replication gene profiles.Stem Cells Dev. 2011;21(2):206–16.60. Ebert AD, Yu J, Rose FF, Mattis VB, Lorson CL, Thomson JA, SvendsenCN. Induced pluripotent stem cells from a spinal muscular atrophypatient. Nature. 2009;457(7227):277–80.61. Guenther MG, Frampton GM, Soldner F, Hockemeyer D, Mitalipova M,Jaenisch R, Young RA. Chromatin structure and gene expressionprograms of human embryonic and induced pluripotent stem cells. CellStem Cell. 2010;7(2):249–57.62. Maherali N, Ahfeldt T, Rigamonti A, Utikal J, Cowan C, Hochedlinger K.A high-efficiency system for the generation and study of human inducedpluripotent stem cells. Cell Stem Cell. 2008;3(3):340–5.63. Marchetto MC, Carromeu C, Acab A, Yu D, Yeo GW, Mu Y, Chen G,Gage FH, Muotri AR. A model for neural development and treatment ofRett syndrome using human induced pluripotent stem cells. Cell.2010;143(4):527–39.64. Takahashi K, Tanabe K, Ohnuki M, Narita M, Sasaki A, Yamamoto M,Nakamura M, Sutou K, Osafune K, Yamanaka S. Induction ofpluripotency in human somatic cells via a transient state resemblingprimitive streak-like mesendoderm. Nat Commun. 2014;5:3678.65. Andrade LN, Nathanson JL, Yeo GW, Menck CFM, Muotri AR. Evidencefor premature aging due to oxidative stress in iPSCs from Cockaynesyndrome. Hum Mol Genet. 2012;21(17):3825–4.66. Hu K, Yu J, Suknuntha K, Tian S, Montgomery K, Choi KD, Stewart R,Thomson JA, Slukvin II. Efficient generation of transgene-free inducedpluripotent stem cells from normal and neoplastic bone marrow andcord blood mononuclear cells. Blood. 2011;117(14):109–19.67. Kim D, Kim CH, Moon JI, Chung YG, Chang MY, Han BS, Ko S, Yang E,Cha KY, Lanza R, et al. Generation of human induced pluripotent stemcells by direct delivery of reprogramming proteins. Cell Stem Cell.2009;4(6):472.68. Loewer S, Cabili MN, Guttman M, Loh YH, Thomas K, Park IH, Garber M,Curran M, Onder T, Agarwal S, et al. Large intergenic non-codingRNA-RoR modulates reprogramming of human induced pluripotent stemcells. Nat Genet. 2010;42(12):1113–7.69. Si-Tayeb K, Noto FK, Nagaoka M, Li J, Battle MA, Duris C, North PE,Dalton S, Duncan SA. Highly efficient generation of humanhepatocyte-like cells from induced pluripotent stem cells. Hepatology.2010;51(1):297–305.70. Vitale AM, Matigian NA, Ravishankar S, Bellette B, Wood SA, WolvetangEJ, Mackay-Sim A. Variability in the generation of induced pluripotentstem cells: importance for disease modeling. Stem Cells Transl Med.2012;1(9):641–50.71. Yu J, Hu K, Smuga-Otto K, Tian S, Stewart R, Slukvin II, Thomson JA.Human induced pluripotent stem cells free of vector and transgenesequences. Science. 2009;324(5928):797–801.•  We accept pre-submission inquiries •  Our selector tool helps you to find the most relevant journal•  We provide round the clock customer support •  Convenient online submission•  Thorough peer review•  Inclusion in PubMed and all major indexing services •  Maximum visibility for your researchSubmit your manuscript atwww.biomedcentral.com/submitSubmit your next manuscript to BioMed Central and we will help you at every step:


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items