UBC Faculty Research and Publications

Enumerateblood – an R package to estimate the cellular composition of whole blood from Affymetrix Gene… Shannon, Casey P; Balshaw, Robert; Chen, Virginia; Hollander, Zsuzsanna; Toma, Mustafa; McManus, Bruce M; FitzGerald, J. M; Sin, Don D; Ng, Raymond T; Tebbutt, Scott J Jan 6, 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2016_Article_3460.pdf [ 1.43MB ]
JSON: 52383-1.0361964.json
JSON-LD: 52383-1.0361964-ld.json
RDF/XML (Pretty): 52383-1.0361964-rdf.xml
RDF/JSON: 52383-1.0361964-rdf.json
Turtle: 52383-1.0361964-turtle.txt
N-Triples: 52383-1.0361964-rdf-ntriples.txt
Original Record: 52383-1.0361964-source.json
Full Text

Full Text

METHODOLOGY ARTICLE Open AccessEnumerateblood – an R package toestimate the cellular composition of wholeblood from Affymetrix Gene ST geneexpression profilesCasey P. Shannon1,7*, Robert Balshaw1,2, Virginia Chen1,7, Zsuzsanna Hollander1,7, Mustafa Toma3,Bruce M. McManus1,4,7,8, J. Mark FitzGerald6,8, Don D. Sin6,7,8, Raymond T. Ng1,5,7,8 and Scott J. Tebbutt1,6,7,8AbstractBackground: Measuring genome-wide changes in transcript abundance in circulating peripheral whole blood is auseful way to study disease pathobiology and may help elucidate the molecular mechanisms of disease, ordiscovery of useful disease biomarkers. The sensitivity and interpretability of analyses carried out in this complextissue, however, are significantly affected by its dynamic cellular heterogeneity. It is therefore desirable to quantifythis heterogeneity, either to account for it or to better model interactions that may be present between theabundance of certain transcripts, specific cell types and the indication under study. Accurate enumeration of themany component cell types that make up peripheral whole blood can further complicate the sample collectionprocess, however, and result in additional costs. Many approaches have been developed to infer the composition of asample from high-dimensional transcriptomic and, more recently, epigenetic data. These approaches rely on theavailability of isolated expression profiles for the cell types to be enumerated. These profiles are platform-specific,suitable datasets are rare, and generating them is expensive. No such dataset exists on the Affymetrix GeneST platform.Results: We present ‘Enumerateblood’, a freely-available and open source R package that exposes a multi-responseGaussian model capable of accurately predicting the composition of peripheral whole blood samples from AffymetrixGene ST expression profiles, outperforming other current methods when applied to Gene ST data.Conclusions: ‘Enumerateblood’ significantly improves our ability to study disease pathobiology from whole bloodgene expression assayed on the popular Affymetrix Gene ST platform by allowing a more complete study of thevarious components of this complex tissue without the need for additional data collection. Future use of the modelmay allow for novel insights to be generated from the ~400 Affymetrix Gene ST blood gene expression datasetscurrently available on the Gene Expression Omnibus (GEO) website.* Correspondence: casey.shannon@hli.ubc.ca1PROOF Centre of Excellence, Vancouver, BC, Canada7Centre for Heart Lung Innovation, University of British Columbia, Vancouver,BC, CanadaFull list of author information is available at the end of the article© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Shannon et al. BMC Genomics  (2017) 18:43 DOI 10.1186/s12864-016-3460-1Key Points We introduce a model that accurately predicts thecellular composition of blood from Affymetrix GeneST gene expression profiles. This model outperforms existing methods whenapplied to Affymetrix Gene ST expression profilesfrom whole blood. The model is available on GitHub: https://github.com/cashoes/enumeratebloodBackgroundMeasuring genome-wide changes in transcript abun-dance in circulating peripheral whole blood cells is auseful way to study disease pathobiology [1]. By provid-ing a relatively comprehensive survey of the status of theimmune system, peripheral whole blood transcript abun-dances may help elucidate molecular mechanisms [2].The sensitivity and interpretability of analyses carriedout in this tissue, however, are significantly affected byits dynamic heterogeneity [3]. It is therefore desirable toquantify this heterogeneity, either to account for it or tomodel interactions that may be present between theabundance of certain transcripts, some cell types, andsome phenotypic indication.Accurate enumeration of the many component celltypes that make up peripheral whole blood can be costly,however, and further complicates the sample collectionprocess. Furthermore, the majority of publicly availableperipheral whole blood-derived gene expression profileson the Gene Expression Omnibus [4] do not include anycomposition information. Accurate quantification of thecellular composition of blood samples from gene expres-sion data without performing additional experiments isuseful, allowing for re-analysis of existing public data,for example.Many approaches have been developed to infer the cel-lular composition of a sample from high-dimensionaltranscriptomic [3, 5–9] and, more recently, DNA methyla-tion data [10, 11]. Briefly, if X,W, and H are matrices withentries Xij (observed expression for sample i, gene j), wik(for sample i, proportion of cell type k), and hkj (cell type-specific contribution to the observed expression for celltype k, gene j), then the problem can be stated: having ob-served X, we wish to estimate W, based on the assumedrelationship between expression and composition:Xij ¼XKk¼1wikhkj þ eijwhere eij represents the expression information forsample i, gene j that is not predictable by the cellcomposition.We further assume that, for each component cell typek, there exists a subset of features Xkij’ in X whose ob-served expression in sample i is proportional to the rela-tive abundance of cell type k in sample i. More formally:Xkij′∝wkiThese composition-discriminating features are termedmarker genes. For such genes, the elements of the H canbe derived from e.g. omics profiles obtained from cellsisolated from the tissue to be deconvolved (we refer tocollections of such profiles as “reference datasets”), andW estimated by regression [5–11]. In this treatment, His referred to as the basis matrix for deconvolution. Wehave previously used this approach to study cell-specificdifferential expression in the context of acute kidneyallograft rejection, using GSE28490 as an AffymetrixU133 plus 2.0 “reference dataset” and identifying thebasis matrix genes by using elastic net to derive a min-imal multinomial classification model for the profiledcell types [12]. Importantly, mapping such marker genesacross technology platforms is not always tractable. Notall genes can be readily mapped across gene expressionplatforms and the values derived from reference datasetsmay be specific to the platform on which the gene ex-pression was measured. This limits application of thesetechniques to platforms on which suitable referencedatasets exist. Unfortunately, generating such datasets iscostly and replicating suitable existing studies on newplatforms of limited scientific (and funding) interest.Reference datasets are correspondingly rare.More recently, so-called reference-free approacheshave been proposed to address this issue [13–15]. Theseapproaches still require the identification of suitablemarker genes for the cell types to be quantified, how-ever, and this selection is of paramount importance toachieve optimal performance. The general strategy formarker selection is to identify genes whose expression inone cell type differs from that observed in all other celltypes being considered [13], a process that itself relieson reference datasets. In fact, all approaches discussedthus far leverage one of a handful of publicly availablereference datasets to derive a basis matrix or identifysuitable marker genes [11, 16, 17]. No such suitable ref-erence dataset exists on the newer Affymetrix Gene STplatform.Here we apply a multi-task learning algorithm toconstruct a statistical model capable of predicting thecomposition of peripheral whole blood samples fromAffymetrix Gene ST expression profiles. We demon-strate its performance on both Gene 1.0 ST (GEO plat-form GPL6244) and Gene 1.1 ST (GPL11532) datasets,which represent the bulk of the Gene ST arrays on GEO.Gene ST data summarized using custom CDF files (e.g.Shannon et al. BMC Genomics  (2017) 18:43 Page 2 of 11GPL16977, GPL15648 or GPL19370), or summarized toexon, rather than transcript, level (GPL10739), could beprocessed from the raw CEL files to a suitable format,though we did not evaluate performance in this case.We also show that the genes that make up this modelcan directly serve as marker genes, suggesting that itmay be possible to identify marker gene sets for newtechnology platforms, or cell populations, using mixturegene expression profiles with corresponding cell com-position information rather than using the more conven-tional reference dataset strategy.The strategy we described in the current work couldbe readily applied to other tissues and/or platforms,which would allow for the development of tools toaccurately segment and quantify a variety of admixedtissues from their gene expression profiles, to accountfor cellular heterogeneity across indications or model in-teractions between gene expression, some cell types andthe indication under study. The described model isfreely-available and open source, outperforms othercurrent methods when applied to Gene ST data, andcould significantly improve our ability to study diseasepathobiology in blood by allowing a more completestudy of the various components of the immune com-partment of blood from whole blood gene expression.MethodsPatient cohorts usedWe used previously unpublished gene expression pro-files from two large clinical cohorts to train and validatethe new statistical model. The Rapid Transition Program(RTP) included prospectively enrolled patients withchronic obstructive pulmonary disease (COPD), present-ing either to St. Paul’s Hospital or Vancouver GeneralHospital (Vancouver, Canada). Subjects presenting to theemergency department or those visiting the COPD clinicwere approached for consent to participate in the study.Matched, peripheral blood derived, genome-wide tran-script abundance and DNA methylation profiles wereavailable for 172 blood samples from this cohort. TheDNA methylation profiles were used to obtain estimatesof the cellular composition of the blood samples, whilethe gene expression profiles were used to train a modelto predict these inferred cell proportions and estimateits performance using cross-validation. Complete bloodcounts, including leukocyte differentials (CBC/Diffs)were available for all blood samples and used as an inde-pendent measure of blood composition (excludinglymphocyte subtypes).The chronic heart failure (HF) program (CHFP) includedprospectively enrolled HF patients presenting to St. Paul’sHospital or Vancouver General Hospital (Vancouver,Canada). Subjects were approached during their visit tothe heart function, pre-transplant, or maintenance clinics,and those who consented were enrolled in the study. Ablood sample was collected at the time of enrollment.Genome-wide transcript abundance profiles and completeblood count, including leukocyte differential (CBC/Diffs)were available for 197 HF patients. This data was used toindependently validate the performance of the statisticalmodel.The model’s performance was further validated usinggene expression profiles obtained from a previously pub-lished asthma cohort (GSE40240) for which monocyte, Band T cell proportions were known [18, 19].Sample processingThe following describes sample processing for the RTPand CHFP cohorts. For details regarding sample process-ing for GSE40240 (asthma cohort) refer to Singh, et al.[18]. For all subjects, blood was collected in PAXgene(PreAnalytix, Switzerland) and EDTA tubes. The EDTAblood was spun down (200 x g for 10 min at roomtemperature) and the buffy coat aliquoted out. BothPAXgene blood and buffy coat samples were storedat −80 °C.Transcript abundanceTotal RNA was extracted from PAXgene blood on theQIAcube (Qiagen, Germany), using the PAXgene BloodmiRNA kit from PreAnalytix, according to manufac-turer’s instructions. Human Gene 1.1 (GPL6244; RTPand CHFP cohorts) ST array plates (Affymetrix, UnitedStates) were used to measure mRNA abundance. Thiswork was carried out at The Scripps Research InstituteDNA Array Core Facility (TSRI; La Jolla, CA). Theresulting CEL files were processed using the ‘oligo’ Rpackage [20].DNA methylationFor the RTP cohort samples only, DNA was extractedfrom buffy coat using Qiagen’s QIAamp DNA BloodMini kits. DNA was bisulfite-converted using the ZymoResearch EZ DNA methylation conversion kit, andInfinium HumanMethylation450 BeadChips (Illumina,United States) were used to measure methylation statusat >485,000 sites across the genome. This work was car-ried out at The Centre for Applied Genomics (TCAG;Toronto, Canada). The resulting IDAT files were proc-essed using the ‘minfi’ R package [21].Statistical analysisFollowing preprocessing with their respective packages(‘oligo’ or ‘minfi’), the normalized data were first batchcorrected using the ‘ComBat’ algorithm [22], as imple-mented in the ‘sva’ R package [23]. A schematic repre-sentation of the statistical analysis is shown in Fig. 1.Shannon et al. BMC Genomics  (2017) 18:43 Page 3 of 11Estimating cellular composition from DNA methylationprofilesNext, we inferred the cellular composition of the RTPcohort blood samples from their DNA methylation pro-files using the ‘estimateCellCounts’ function provided by‘minfi’. This function uses publicly available DNAmethylation profiles obtained from isolated leukocytesub-types to infer the relative abundance of granulo-cytes, monocytes, B, CD4+ T, CD8+ T and NK cells(details in Table 1) with very high accuracy [11, 21].These estimates of the cellular composition of our trainingsamples were used as a ‘silver standard’ to train the modelin the absence of gold standard (e.g. flow cytometry) datato provide us with a ground truth. In order to gainadditional confidence in these estimates, we comparedthem to those obtained from a hematology analyzer(CBC/Diffs) to assess accuracy and tested the 600 CpGsites used by ‘estimateCellCounts’ for associations withage, sex, or disease status in our training cohort (afteradjusting for cellular composition using the CBC/Diffs) todetermine whether these factors could be introducing anybias into the predictions.Model trainingWe then fit a multi-response Gaussian model using elasticnet regression via the ‘glmnet’ R package [24] on thegenome-wide transcript abundance data, using the DNAmethylation-derived cell proportions as response variables.The multi-response Gaussian model family is useful whenthere are a number of possibly correlated responses — a socalled “multi-task learning” problem — as is the case forthese cell proportions. Sparsity was an additional require-ment: using a minimal set of features to predict cell pro-portions is desirable because it reduces the risk of biasbeing introduced under varied experimental conditions.Conversely, redundancy in the information provided by thefeatures ensures robustness to such bias when it is present.We chose to use ‘glmnet’ because it is amenable to multi-task learning problems (using family = ‘mgaussian’) andTable 1 Description of predicted leukocytesCell name Abbreviation used DescriptionGranulocytes Gran CD15+ granulocytesMonocytes Mono CD14+ monocytesB cells Bcell CD19+ B-lymphocytesT cells (CD4+) CD4T CD3 + CD4+ T-lymphocytesT cells (CD8+) CD8T CD3 + CD8+ T-lymphocytesNK cells NK CD56+ Natural Killer (NK) cellsFig. 1 Schematic representation of the experiment. Cell proportions were estimated from DNA methylation profiles for 172 samples (1; COPD patients).These DNA methylation-derived cell proportions were used as a ‘silver’ standard in the absence of the ground truth. This ‘silver standard’ dataset was usedto train a multi-response Gaussian model using the gene expression data (2). Out-of-sample performance was evaluated using a repeated (20x) 10-foldcross-validation (3) and in two independent sets of samples, in different clinical indications: granulocyte, monocyte, and lymphocyte performance wasevaluated in a large set of samples (4; 197 samples, heart failure), while monocyte, B, and T cell prediction performance was evaluated in a smaller set(4; 28 samples, asthma)Shannon et al. BMC Genomics  (2017) 18:43 Page 4 of 11can provide a balance between sparse models and redun-dancy in the model features (by tuning the parameter α),but any machine learning method that meets these re-quirements could potentially have been substituted.Probesets with minimum log2 expression < 5.5 acrossall samples (22,251) were first excluded using the ‘ex-clude’ parameter. Next, we performed hyper-parametertuning by running the ‘cv.glmnet’ function for a numberof different values of α (α = 1, 0.9, 0.5, 0.1, 0), letting‘cv.glmnet’ construct models using a sequence of λ values(default behavior). Briefly, the elastic net mixing parameterα provides a way to tune between ridge (α = 0) and lasso(α = 1) penalized regression, and the complexity parameterλ a means of adjusting the degree of shrinkage beingapplied to the coefficients. The optimal value for eachparameter was that which minimized out-of-sample errorrate in cross-validation, using the mean-square errorcriterion (‘cv.glmnet’ parameter type.measure = ‘mse’).Estimating out-of-sample performanceOut-of-sample performance of our model was first eval-uated using 10-fold cross-validation, repeated 20 timesto eliminate any potential biases introduced by the parti-tioning of the data. We then validated the accuracy andcalibration of our model by comparing its predicted pro-portions to that obtained from CBC/Diffs data in theCHFP cohort. Unfortunately, a more complete enumer-ation of the lymphocyte compartment (e.g., by flowcytometry) was not available in this large cohort, so wecould not independently validate performance in thevarious lymphocyte sub-types. Instead, the sum of thepredicted B, CD4+ T, CD8+ T and NK cell proportionswas compared to total lymphocyte proportions from theCBC/Diffs. In addition, we accessed a small asthma co-hort from GEO (GSE40240) for which monocyte, B- andpan T-cell (CD3+) proportions were available (EpiontisqPCR cell quantification assay) [19]. This dataset wasused to independently validate the performance of themodel’s B- and T-cell predictions.In all cases, we report both model error (root-mean-square error) and correlation (Pearson’s product–momentcorrelation) to the actual cell proportions. The latter ismore pertinent, however, as accurate multivariate calibra-tion is not necessary for our intended use for these pre-dicted proportions, namely as proxy measures useful forstatistical work in the absence of more direct, clinicallyrelevant, measures.Performance compared to the method described inAbbas et al.We also compared the performance of our model to analternative approach for determining the composition ofblood samples from their gene expression profiles, de-scribed in Abbas et al. [5], in both the CHFP and asthmacohorts. The basis matrix from Abbas et al., derivedfrom the IRIS (Immune Response In Silico) referencedataset [16], was used to predict the cell proportions ofneutrophils, monocytes, B, CD4+ T, CD8+ T and NKcells. These predicted proportions were compared tothose obtained from CBC/Diffs (CHFP), or an EpiontisqPCR cell quantification assay (asthma), as above.Model features as marker genes for the reference-freeapproach described in Chikina et al.Finally, we evaluated whether our approach could beused to identify more suitable marker gene sets com-pared to using a reference dataset obtained on a differ-ent platform. The reference-free approach described byChikina et al. [13], does not require a basis matrix, rely-ing instead on a set of putative marker genes. These areused to guide the decomposition of the dataset’s covari-ance structure into separate variance components, usingsingular value decomposition. Marker genes for each celltype are summarized in this manner, a technique knownas eigengene summarization [25]. Given a good set ofmarker genes, these summarized values, termed surro-gate proportion variables (SPVs), should track with mix-ture proportions. Because SPVs are estimated directly inthe dataset, platform mapping issues should be mini-mized, but whether marker genes identified on one plat-form may perform well when applied to another has notbeen evaluated before.We used the reference-free approach described byChikina et al. (as implemented in the ‘CellCODE’ Rpackage) and marker genes derived either from the IRISreference dataset, as recommended by Chikina et al., orthe model features. Features were deemed marker genesfor specific cell types based on the absolute value oftheir coefficient weights across cell types in the model.We then compared the SPVs produced by ‘CellCODE’,using either marker gene sets, to the CBC/Diffs, asabove. Pearson’s product–moment correlation (r) wasused to summarize association between predictions.ResultsDNA methylation-derived predictions of the cellular com-position of the RTP cohort blood samples were accuratewhen compared to those obtained from CBC/Diffs (rootmean squared error [RMSE] = 0.01 – 0.08, Pearson’sr = 0.87 – 0.96; Additional file 1: Figure S1). The ob-served error rates for granulocytes and monocytes wereconsistent with those previously reported [10, 11]. Wecould not determine the accuracy of the B, CD4+ T, CD8+T and NK cell predictions directly. The sum of thesepredicted proportions were well correlated to thelymphocyte proportions obtained from the CBC/Diffs(Pearson’ r = 0.87), however. When controlling for cellu-lar composition using the CBC/Diffs, none of the 600Shannon et al. BMC Genomics  (2017) 18:43 Page 5 of 11CpG sites used by ‘estimateCellCounts’ were significantlyassociated with disease status, but 8 were significantly as-sociated with sex, and 70 with age, in our data (Additionalfile 2: Tables S1-S3).These predictions were used as the response variablesto train a multi-response Gaussian model in the RTPcohort gene expression data using elastic net regression.The optimal model hyper-parameterization (α = 0.1,λ = 0.8857) retained 491 features. Its fit to the data is vi-sualized in Fig. 2, against both the DNA methylationderived composition estimates (Fig. 2a), and CBC/Diffs(Fig. 2b). Model fit was good across all cell types, with theexception, perhaps, of CD8+ T cells. When consideringthe model fit to the CBC/Diffs data, we noted slight bias,with granulocyte proportions tending to be under-predicted and lymphocyte proportions over-predicted.To characterize the potential performance of thismodel on new data, we carried out a 20 × 10-fold cross-validation. We summarize the RMSE and Pearson’s robserved across 200 (20 × 10) left-out sets in Table 2and the data is visualized in Fig. 3. Estimated out-of-sample performance varied across cell types; RMSE waslowest for monocytes and highest for granulocytes. Errorrates compared favorably to other methods for inferringcellular composition of samples from gene expressiondata [5–7, 12]. Correlation between predicted and actualin the 200 left-out sets was highest for granulocytes(0.926), followed by monocytes (0.824), NK cells (0.812),CD4+ T cells (0.785), B cells (0.731), and CD8+ T cells(0.671).Next, we applied the model to gene expression profilesfrom the CHFP (Fig. 4a) and asthma cohorts (Fig. 4b) inorder to independently validate its performance.Performance in these cohorts is summarized in Table 2.The predicted proportions for granulocytes and mono-cytes were well correlated with those obtained fromCBC/Diffs across all 197 samples in the CHFP cohort(r = 0.89 and 0.74 respectively; Fig. 4a). While we couldFig. 2 Assessing model fit. Predicted proportions from the model are plotted against the DNA methylation-derived cell proportions for eachsample in the training data (a) or that obtained from CBC/Diffs (b). For (a), linear best-fit line to the data is plotted (blue line) with 95% point-wiseconfidence interval for fit (grey band) and compared with perfect agreement (red dashed line). For (b), predicted monocyte proportions arecompared directly to the CBC/diffs. The predicted granulocyte proportions are compared to the sum of neutrophil, eosinophil and basophilproportions from the CBC/diffs, while the sum of the predicted B, CD4+ T, CD8+ T and NK cell proportions is compared to the total lymphocyteproportions from the CBC/diffs. For each cell type, Pearson’s product–moment correlation (Pearson’s r) and the root mean squared error (RMSE)are reportedShannon et al. BMC Genomics  (2017) 18:43 Page 6 of 11not determine the accuracy of the B, CD4+ T, CD8+T and NK cell predicted proportions directly, the sum ofthese predicted proportions were well correlated to thelymphocyte proportions obtained from the CBC/Diffs(r = 0.90). Monocyte, B and pan T-cell proportions wereknown in the smaller asthma cohort (Epiontis qPCR cellquantification assay [19]). There, B-cell predicted propor-tions were highly correlated to those obtained from theEpiontis assay (Fig. 4b). Again, we could not determinethe accuracy of the CD4+ T and CD8+ T cell predictedproportions directly, but the sum of these predicted pro-portions was well correlated to the pan T cell propor-tions obtained from the Epiontis assay (r = 0.91). Wealso applied the basis matrix described by Abbas andothers in [5] to predict cell proportions in both the CHFPand asthma cohorts (Additional file 3: Figure S2) by map-ping the Affymetrix U133 identifiers to the correspondingGene 1.1 (CHFP) or 1.0 (asthma) ST identifiers usingthe Biomart service [26, 27]. Our model generally per-formed better, especially for monocytes, though this wasexpected given that the Abbas, et al. basis matrix was de-veloped on the Affymetrix U133 platform.More recently, reference-free approaches to quantifyingthe composition of mixed tissue samples from gene ex-pression [13] or DNA methylation [14] profiles have beenproposed. Such approaches may offer a solution to theplatform mapping issues we describe. Reference-free ap-proaches rely on the availability of marker genes for thecell types to be quantified. Many strategies have been de-scribed for identifying such marker genes [5, 12, 13, 28],but, so far, all have leveraged existing reference datasets:collections of gene expression profiles derived from cellsisolated from the mixed tissue to be quantified. In orderto determine whether marker gene selection exhibitsplatform-bias, we compared CellCODE SPVs derivedusing either marker genes identified from the IRISFig. 3 Cross-validation performance. Distribution of root mean square error (RMSE; (a)) and Pearson’s product–moment correlation (Pearson’s r; (b)) forout-of-sample predictions across repeated (20x) 10-fold cross-validations are visualized using boxplots. The mean and 95% CI are shown as a point andrange in the center of each boxplot and represent the expected out-of-sample performanceTable 2 Model performanceCell type 20x 10-fold cross-validation Independent test setRMSE (mean ± sd) Pearson’s r (mean ± sd) RMSE (n) Pearson’s r (n)Bcell 0.021 ± 0.007 0.755 ± 0.188 0.04 (28) 0.93 (28)CD4T 0.038 ± 0.01 0.813 ± 0.09 0.06 (28) 0.91 (28)CD8T 0.034 ± 0.006 0.683 ± 0.138Gran 0.054 ± 0.013 0.923 ± 0.046 0.06 (197) 0.89 (197)Mono 0.018 ± 0.003 0.842 ± 0.068 0.02 (197) 0.74 (197)NK 0.027 ± 0.006 0.816 ± 0.083 NA NAShannon et al. BMC Genomics  (2017) 18:43 Page 7 of 11reference dataset (U133 platform; mapped to Gene STplatform identifiers) or the features in our model. Markergenes derived from our model outperformed those identi-fied from the IRIS reference dataset and mapped to GeneST platform identifiers, when used with the CellCODE ap-proach (Fig. 5). Interestingly, the marker gene sets showedminimal overlap (granulocytes = 3/51, monocytes = 4/58,B cells = 0/55, CD4+ T cells = 0/11, CD8+ T cells = 1/15,NK cells = 6/22). We confirmed that the genes in ourmodel varied significantly across the included cell types byperforming ANOVA in the GSE28490 dataset, which in-cludes replicate profiles of the relevant cell types isolatedfrom blood. Most (461/491; 94%) mapped identifiers var-ied across cell types (adjusted p-value < 0.05).Fig. 5 Our model identifies better performing marker genes for use with reference-free approaches in Affymetrix Gene ST data. Surrogate proportionvariables obtained from CellCODE are plotted against the cell proportions obtained from CBC/Diffs in an independent dataset (CHFP cohort). The sumof the surrogate proportion variables obtained for B, CD4+ T, CD8+ T and NK cells is compared to the total lymphocyte proportions from the CBC/Diffs.Marker genes used by CellCODE were derived from the coefficients of the model (a) or using the recommended set of marker genes (b) derived fromthe IRIS reference dataset. For each cell type, Spearman’s rank correlation (ρ) is reportedFig. 4 Our model accurately predicts the cellular composition of blood samples and outperforms existing approaches in Affymetrix Gene ST data.Predicted cell proportions are plotted against the cell proportions obtained from CBC/diffs ((a); CHFP cohort) or a cell-type specific DNA methylationcell-typing assay ((b); Epiontis asthma cohort). In (a), the sum of the predicted B, CD4+ T, CD8+ T and NK cell proportions is compared to the totallymphocyte proportions from the CBC/diffs. The predicted granulocyte and monocyte proportions are directly compared. In (b), the sum of thepredicted CD4+ and CD8+ T cell proportions is compared to T cell proportion from the Epiontis assay. The predicted monocyte and B cell proportionsare directly compared. For each cell type, Pearson’s product–moment correlation (Pearson’s r) and the root mean squared error (RMSE) are reportedShannon et al. BMC Genomics  (2017) 18:43 Page 8 of 11Finally, we applied the model to predict the compos-ition of the RTP cohort blood samples from their geneexpression. This is a contrived example, as this informa-tion was already available to us, but it serves to illustratea possible application of the approach: to adjust for theconfounding effect of changes in cellular compositionwhen studying the effect of prednisone on whole bloodgene expression. As expected, we observed large differ-ences in the predicted proportions of the various celltypes between patients given prednisone or not (Fig. 6).Patients on prednisone had proportionally loweredmonocytes, B, CD4+ T, CD8+ T, and NK cells, and pro-portionally elevated granulocytes. This was consistentwith the CBC/Diffs, and 460/491 genes in our modelshowed no significant residual association between theirexpression and prednisone status (adj. p-value > 0.05),after adjusting for cellular composition using the CBC/Diffs, suggesting that the observed differences reflecttrue changes in the cellular composition of the samplesin response to prednisone, rather than changes in thegene expression of the underlying model features.DiscussionWe introduce ‘Enumerateblood’, a freely available andopen source R package that exposes a statistical model forpredicting the composition of blood samples fromAffymetrix Gene ST gene expression profiles. We demon-strate that this model has suitable performance across allincluded cell types in cross-validation, and validate its per-formance in two independent cohorts. The training andvalidation cohorts represent two major clinical indica-tions, COPD and HF, and include patients with variouscomorbidities, on various medications, some with strongeffects on blood gene expression (e.g., prednisone), sug-gesting that our model may generalize well and be broadlyapplicable. All training and validation samples were fromolder individuals, however, and it may be that this modelwill not generalize well to pediatric populations. A loss ofperformance in pediatric populations has been notedwhen using a similar approach with DNA methylationdata [29].We also show that platform-specific marker gene setscan be derived without the need for “reference datasets”:collections of gene expression profiles obtained from theisolated cells types we wish to enumerate. Using themodel features as marker genes in combination with thereference-free approach proposed by Chikina et al. re-sulted in better performance compared to using markergenes derived from isolated leukocyte gene expressionprofiles obtained on another microarray platform. Inter-estingly, the reference-free approach performed onlyslightly worse than our model, although with loss ofscale. This suggests that the coefficient weights of themodel may be estimated directly in the data, and thatthese marker genes may be context-independent surro-gates of cell proportions.More generally, the strategy we adopted to train ourmodel and identify suitable marker genes could be readilyapplied to other platforms, or tissues of interest. The onlyrequirements are accurate quantification of the cell typesof interest across a large cohort with matched omicsprofiling. For many popular platforms (e.g., RNA-seq), thisschema may be more cost effective than sorting andprofiling a number of replicates for all cells of interest,Fig. 6 Model predicted cell proportions highlight prednisone-dependent changes in peripheral blood composition. Treatment of acute exacerbations(AE) in COPD with prednisone results in important changes in the cellular composition of peripheral blood. The distributions of granulocyte, monocyte,B, CD4+ T, CD8+ T and NK cell proportions are visualized for patients from the Rapid Transition Program (RTP) cohort that were given prednisone ornot (p-value is for the unpaired Student’s t-test comparing the two groups in each case)Shannon et al. BMC Genomics  (2017) 18:43 Page 9 of 11particularly when we consider how costs would scale withadditional cell types to be quantified. Moreover, for lowabundance cell types (e.g. Tregs), obtaining a sufficientquantity to profile may not be feasible, depending onthe efficiency of available separation techniques, andamount of admixed tissue that can be collected inpractice. Single cell RNA-seq may change all this inthe near future, however.The lack of independent validation for some of thelymphocyte sub-types (CD4+ T, CD8+ T, and NK cells)is a limitation, though cross-validation performance wasgood across all cell types. We believe it is unlikely thatpoor performance in some or all lymphocyte sub-typeswould result in good performance when summed andcompared to CBC/Diffs. Model fit exhibits some degreeof shrinkage (flattening of the plot of predicted vs. ob-served away from the 45 degree line). This is expected,however, and related to the phenomenon of regressionto the mean. Performance in cross-validation was not-ably worse for CD8+ T cells. This could be because ofthe preponderance of zero values for this particular celltype. We also note that performance in monocytes dropssignificantly in the validation cohort. It is unclear whythis is, but one possibility is the difference in thedistribution of values in the validation cohort (meanmonocyte proportion in training: 0.073 vs. 0.090 in thevalidation; p = 1.39 × 10−7). We have observed poorperformance of various deconvolution approaches inquantifying monocytes in the past [12, 30]. It might bethat circulating monocyte diversity is poorly reflected inour current framework and we may be selecting poormarker genes for this cell type as a result. A similar ra-tionale could be applied to explain the poor CD8+ T cellperformance results in cross-validation. Certainly, it of-fers the opportunity for further exploration of the truecomplexity of these cell types in peripheral blood.ConclusionIn summary, our freely-available and open source Rpackage, ‘Enumerateblood’, exposes a statistical modelcapable of accurately inferring the composition of per-ipheral whole blood samples from Affymetrix Gene STexpression profiles. The strategy we adopted to derivethis model is readily applicable to other tissues and/orplatforms, which would allow for the development oftools to accurately segment and quantify a variety ofadmixed tissues from their gene expression profiles, toaccount for cellular heterogeneity across indications ormodel interactions between gene expression, some celltypes and the indication under study. The describedmodel outperforms other current methods when appliedto Gene ST data. By allowing a more complete study ofthe various components of the immune compartment ofblood from whole blood gene expression, this model willsignificantly improve our ability to study disease patho-biology in blood, and may generate novel insights fromexisting Affymetrix Gene ST blood gene expressiondatasets.Additional filesAdditional file 1: Figure S1. DNA methylation-derived compositionvs. CBC/Diffs. Predicted proportions were obtained by applying the‘estimateCellCounts’ function from the ‘minif R package to peripheral bloodderived DNA methylation profiles in the Rapid Transition Program (RTP)cohort and plotted against cell proportions obtained from CBC/Diffs. Thesum of the predicted B, CD4+ T, CD8+ T and NK cell proportions iscompared to the total lymphocyte proportions from the CBC/Diffs. For eachcell type, Spearman’s rank correlation (ρ) and the root mean squared error(RMSE) are reported. (PNG 47 kb)Additional file 2: Association of the 600 CpG sites used by minfi's‘estimateCellCounts’ function with disease status (Table S1), sex (TableS2), or age (Table S3), after adjusting for cellular composition using CBC/Diffs in the RTP cohort. (XLSX 172 kb)Additional file 3: Figure S2. Performance of Abbas et al. method inour validation datasets (Gene ST). Predicted cell proportions (using themethod from Abbas et al.) are plotted against the cell proportionsobtained from CBC/diffs (A; CHFP cohort) or a cell-type specific DNAmethylation cell-typing assay (B; Epiontis asthma cohort). In A, the sum ofthe predicted B, CD4+ T, CD8+ T and NK cell proportions is compared tothe total lymphocyte proportions from the CBC/diffs. The predictedgranulocyte and monocyte proportions are directly compared. In B, thesum of the predicted CD4+ and CD8+ T cell proportions is compared toT cell proportion from the Epiontis assay. The predicted monocyte and Bcell proportions are directly compared. For each cell type, Pearson’sproduct–moment correlation (Pearson’s r) and the root mean squarederror (RMSE) are reported. (PNG 83 kb)AcknowledgementsThe authors would like to thank the research participants whose tissuedonations made this work possible. Additional thanks go to the study nursesand clinical coordinators for their contributions to patient recruitment anddata collection. The authors would also like to acknowledge Dr. Karen Lamfor her insightful comments and discussion.FundingFunding was provided by Genome Canada, Genome British Columbia,Genome Quebec, the Canadian Institutes of Health Research, PROOF Centre,St. Paul’s Hospital Foundation, Providence Health Care, and the COPD ClinicalResearch Network. The funders had no input into the design of the study,collection, analysis, and interpretation of data, or in writing the manuscript.Availability of data and materialsThe datasets supporting the conclusions of this article are available on theGene Expression Omnibus (GEO): repositories GSE77344 (RTP cohortsamples), GSE77343 (CHFP samples), and GSE40240 (asthma cohort). Themodel is made available as a package for the R statistical programminglanguage, distributed under the GNU General Public License v3.0, and ishosted on GitHub: https://github.com/cashoes/enumerateblood.Authors’ contributionsVC and ZH carried out quality control and normalization of the microarraydata. CPS conceived of, and designed this study, performed the statisticalanalysis, and drafted the manuscript. RB, RTN and SJT participated in thedesign of the study and helped draft the manuscript. MT and BMM, and JMFand DDS conceived of, and designed the CHFP and RTP studies, respectively,whose samples were used for this analysis. All authors read and approvedthe final manuscript.Competing interestsThe authors declare that they have no competing interest.Shannon et al. BMC Genomics  (2017) 18:43 Page 10 of 11Consent for publicationNot applicable.Ethics approval and consent to participateThe Rapid Transition Program (RTP) and Chronic Heart Failure Program (CHFP)cohorts leveraged in the current work were collected as part of two distinctstudies. Both studies were approved by the University of British ColumbiaClinical Research Ethics Board and Providence Health Care Research EthicsBoard under Research Ethics Boards (REB) numbers H11-00786 (RTP) andH11-01897 (CHFP). The studies conform to the principles outlined in theDeclaration of Helsinki. All subjects gave informed, written consent. Patientinformation was anonymized and de-identified prior to the analysis.Author details1PROOF Centre of Excellence, Vancouver, BC, Canada. 2BC Centre for DiseaseControl, Vancouver, BC, Canada. 3Division of Cardiology, University of BritishColumbia, Vancouver, BC, Canada. 4Department of Pathology and LaboratoryMedicine, University of British Columbia, Vancouver, BC, Canada.5Department of Computer Science, University of British Columbia, Vancouver,BC, Canada. 6Department of Medicine, Division of Respiratory Medicine,University of British Columbia, Vancouver, BC, Canada. 7Centre for Heart LungInnovation, University of British Columbia, Vancouver, BC, Canada. 8Institutefor Heart and Lung Health, Vancouver, BC, Canada.Received: 17 February 2016 Accepted: 22 December 2016References1. Chaussabel D. Assessment of immune status using blood transcriptomicsand potential implications for global health. Semin Immunol. 2015;27:58–66.2. Li S, Rouphael N, Duraisingham S, Romero-Steiner S, Presnell S, Davis C, etal. Molecular signatures of antibody responses derived from a systemsbiology study of five human vaccines. Nat Immunol. 2013;15:195–204.3. Shen-Orr SS, Gaujoux R. Computational deconvolution: extracting celltype-specific information from heterogeneous samples. Curr Opin Immunol.2013;25:571–8.4. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al.NCBI GEO: archive for functional genomics data sets—update. Nucleic AcidsRes. 2013;41:D991–5.5. Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolutionof Blood Microarray Data Identifies Cellular Activation Patterns in SystemicLupus Erythematosus. Tan P, editor. PLoS ONE. 2009;4:e6098.6. Gaujoux R, Seoighe C. Semi-supervised Nonnegative Matrix Factorization forgene expression deconvolution: a case study. Infect Genetics Evol. 2011;12:913–21.7. Gong T, Hartmann N, Kohane IS, Brinkmann V, Staedtler F, Letzkus M, et al.Optimal deconvolution of transcriptional profiling data using quadraticprogramming with application to complex clinical blood samples.PLoS One. 2011;6:e27156.8. Lu P, Nakorchevskiy A, Marcotte EM. Expression deconvolution: areinterpretation of DNA microarray data reveals dynamic changes in cellpopulations. Proc Natl Acad Sci U S A. 2003;100:10370.9. Newman A, Liu C, Green M, et al. Robust enumeration of cell subsets fromtissue expression profiles. Nat Methods. 2015;12:453–457.10. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ,Nelson HH, et al. DNA methylation arrays as surrogate measures of cellmixture distribution. BMC Bioinformatics. 2012;13:86.11. Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical inepigenome-wide association studies. Genome Biol. 2014;15:R31.12. Shannon CP, Balshaw R, Ng RT, Wilson-McManus JE, Keown P, McMaster R,et al. Two-Stage, In Silico Deconvolution of the Lymphocyte Compartmentof the Peripheral Whole Blood Transcriptome in the Context of AcuteKidney Allograft Rejection. PLoS ONE. 2014;9:e95224.13. Chikina M, Zaslavsky E, Sealfon SC. CellCODE: a robust latent variableapproach to differential expression analysis for heterogeneous cellpopulations. Bioinformatics. 2015;31:1584–91.14. Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixture adjustmentsin analysis of DNA methylation data. Bioinformatics. 2014;30:1431–9.15. Houseman EA, Kile ML, Christiani DC, Ince TA, Kelsey KT, Marsit CJ.Reference-free deconvolution of DNA methylation data and mediation bycell composition effects. BMC Bioinformatics. 2016;17:259.16. Abbas AR, Baldwin D, Ma Y, Ouyang W, Gurney A, Martin F, et al. Immuneresponse in silico (IRIS): immune-specific genes identified from acompendium of microarray expression data. Genes Immun. 2005;6:319–31.17. Allantaz F, Cheng DT, Bergauer T, Ravindran P, Rossier MF, Ebeling M, et al.Expression Profiling of human immune cell subsets identifies miRNA-mRNAregulatory relationships correlated with cell type specific expression.PLoS One. 2012;7:e29979.18. Singh A, Yamamoto M, Kam SHY, Ruan J, Gauvreau GM, O’Byrne PM, et al.Gene-Metabolite Expression in Blood Can Discriminate Allergen-InducedIsolated Early from Dual Asthmatic Responses. Hsu Y-H, editor. PLoS ONE.2013;8:e67907.19. Singh A, Yamamoto M, Ruan J, Choi JY, Gauvreau GM, Olek S, et al. Th17/Treg ratio derived using DNA methylation analysis is associated with thelate phase asthmatic response. Allergy, Asthma Clin Immunol. 2014;10:32.20. Carvalho BS, Irizarry RA. A framework for oligonucleotide microarraypreprocessing. Bioinformatics. 2010;26:2363–7.21. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD,et al. Minfi: a flexible and comprehensive bioconductor package for theanalysis of infinium DNA methylation microarrays. Bioinformatics.2014;30:1363–9.22. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarrayexpression data using empirical Bayes methods. Biostatistics. 2007;8:118–27.23. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package forremoving batch effects and other unwanted variation in high-throughputexperiments. Bioinformatics. 2012;28:882–3.24. Zou H, Hastie T. Regularization and variable selection via the elastic net.J R Stat Soc Ser B Stat Methodol. 2005;67:301–20.25. Langfelder P, Horvath S. WGCNA: an R package for weighted correlationnetwork analysis. BMC Bioinformatics. 2008;9:559.26. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al.BioMart and Bioconductor: a powerful link between biological databasesand microarray data analysis. Bioinforma Oxf Engl. 2005;21:3439–40.27. Durinck S, Spellman PT, Birney E, Huber W. Mapping Identifiers for theintegration of genomic datasets with the R/bioconductor package biomaRt.Nat Protoc. 2009;4:1184–91.28. Benita Y, Cao Z, Giallourakis C, Li C, Gardet A, Xavier RJ. Gene enrichmentprofiles reveal T-cell development, differentiation, and lineage-specifictranscription factors including ZBTB25 as a novel NF-AT repressor. Blood.2010;115:5376–84.29. Jones MJ, Islam SA, Edgar RD, Kobor MS. Adjusting for cell type composition inDNA methylation data using a regression-based approach. Totowa: HumanaPress; 2015. [cited 2015 Dec 22], Available from: http://link.springer.com/10.1007/7651_2015_262.30. Shannon CP, Hollander Z, Wilson-McManus J, Balshaw R, Ng R, McMaster R,et al. White Blood Cell Differentials Enrich Whole Blood Expression Data inthe Context of Acute Cardiac Allograft Rejection. Bioinforma. Biol. Insights.2012;49.•  We accept pre-submission inquiries •  Our selector tool helps you to find the most relevant journal•  We provide round the clock customer support •  Convenient online submission•  Thorough peer review•  Inclusion in PubMed and all major indexing services •  Maximum visibility for your researchSubmit your manuscript atwww.biomedcentral.com/submitSubmit your next manuscript to BioMed Central and we will help you at every step:Shannon et al. BMC Genomics  (2017) 18:43 Page 11 of 11


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items