Koestler et al. BMC Bioinformatics (2016) 17:120 DOI 10.1186/s12859-016-0943-7METHODOLOGY ARTICLE Open AccessImproving cell mixture deconvolution byidentifying optimal DNAmethylation libraries(IDOL)Devin C. Koestler1*, Meaghan J. Jones2, Joseph Usset1, Brock C. Christensen3,4,5, Rondi A. Butler6,Michael S. Kobor2, John K. Wiencke7 and Karl T. Kelsey6,8AbstractBackground: Confounding due to cellular heterogeneity represents one of the foremost challenges currently facingEpigenome-Wide Association Studies (EWAS). Statistical methods leveraging the tissue-specificity of DNA methylationfor deconvoluting the cellular mixture of heterogenous biospecimens offer a promising solution, however theperformance of such methods depends entirely on the library of methylation markers being used for deconvolution.Here, we introduce a novel algorithm for Identifying Optimal Libraries (IDOL) that dynamically scans a candidate setof cell-specific methylation markers to find libraries that optimize the accuracy of cell fraction estimates obtained fromcell mixture deconvolution.Results: Application of IDOL to training set consisting of samples with both whole-blood DNA methylation data(Illumina HumanMethylation450 BeadArray (HM450)) and flow cytometry measurements of cell composition revealedan optimized library comprised of 300 CpG sites. When compared existing libraries, the library identified by IDOLdemonstrated significantly better overall discrimination of the entire immune cell landscape (p = 0.038), and resultedin improved discrimination of 14 out of the 15 pairs of leukocyte subtypes. Estimates of cell composition across thesamples in the training set using the IDOL library were highly correlated with their respective flow cytometrymeasurements, with all cell-specific R2 > 0.99 and root mean square errors (RMSEs) ranging from [0.97% to 1.33%]across leukocyte subtypes. Independent validation of the optimized IDOL library using two additional HM450 datasets showed similarly strong prediction performance, with all cell-specific R2 > 0.90 and RMSE < 4.00%. In simulationstudies, adjustments for cell composition using the IDOL library resulted in uniformly lower false positive ratescompared to competing libraries, while also demonstrating an improved capacity to explain epigenome-widevariation in DNA methylation within two large publicly available HM450 data sets.Conclusions: Despite consisting of half as many CpGs compared to existing libraries for whole blood mixturedeconvolution, the optimized IDOL library identified herein resulted in outstanding prediction performance across allconsidered data sets and demonstrated potential to improve the operating characteristics of EWAS involvingadjustments for cell distribution. In addition to providing the EWAS community with an optimized library for wholeblood mixture deconvolution, our work establishes a systematic and generalizable framework for the assembly oflibraries that improve the accuracy of cell mixture deconvolution.Keywords: EWAS, DNA methylation, Cell mixture estimation, Cell heterogeneity*Correspondence: dkoestler@kumc.edu1Department of Biostatistics, University of Kansas Medical Center, 3901Rainbow Blvd., 66160 Kansas City, KS, USAFull list of author information is available at the end of the article© 2016 Koestler et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Koestler et al. BMC Bioinformatics (2016) 17:120 Page 2 of 21BackgroundThe past decade has witnessed an exponential increasein epidemiologic studies of DNA methylation, driven inlarge part by increasing appreciation for its critical rolein the development and progression of human diseasestogether with the declining cost of high-throughput tech-nologies for interrogating the epigenome. Following thenamesake adopted for genome-wide, genetic associationstudies of disease phenotypes (GWAS), studies investigat-ing the role of DNA methylation in human diseases andexposures have been aptly dubbed epigenome-wide asso-ciation studies (EWAS) [1]. While GWAS and EWAS datashare many of the same analytical challenges, the tissuespecificity of DNAmethylation presents an added layer ofcomplexity in the analysis, and particularly in the interpre-tation of EWAS. Owing to the tissue specificity of DNAmethylation, it is now well established that comparisonsof methylation signatures assessed over heterogenous cellpopulations are susceptible to confounding and misinter-preted associations [2–5], issues that are believed by manyto be among the foremost challenges currently facingEWAS [6–9].Recent attempts aimed at minimizing the potential forconfounding in the analysis of DNA methylation datahave prompted some researchers to restrict methyla-tion assessment to purified cell populations [10, 11], forexample, CD4+ or CD14+ cells isolated from peripheralblood. Although such studies may be less prone to con-founding by leukocyte-lineage heterogeneity comparedto those involving whole blood (WB) DNA methyla-tion assessments, purification of cell populations carryingthese markers will not completely eliminate heterogene-ity attributable to lineage differences [3]. Other attemptsto address the potential for confounding in blood-basedDNA methylation data have involved adjusting statis-tical models with additional terms reflecting the cellcomposition of study samples using, for example, mea-surements from complete blood cell counts (CBC) orfluorescence-activated cell sorting (FACS) [5, 12]. How-ever, these measurements are not often collected as partof EWAS (Additional file 1: Table S1), the reasons forwhich commonly include: insufficient quantities of sub-strate for both DNA methylation assessment and mea-surements of cell composition, budgetary constraints, andthe inability of technologies - such as FACS - to accu-rately measure biospecimens stored over extended timeperiods. In addition, because EWAS typically representsubsidiary studies, whose associated parent study predatecurrent understanding of the impact of cellular hetero-geneity on DNA methylation analyses, direct measure-ments of cell composition were unlikely to have beenperformed when biospecimens were initially collected.These considerations, together with the emerging consen-sus concerning the need to account for cell compositionin the statistical analysis of DNA methylation data[6–9] have served to motivate the development of novelstatistical/bioinformatic methodologies for addressing thepotential confounding effects driven by cell heterogene-ity [13–15]. The first of such methodologies [13] and themost widely applied within the EWAS literature lever-ages the cell-specificity of DNA methylation as the basisfor estimating the cellular landscape of samples consistingof heterogeneous cell populations. This approach, com-monly referred to as cell mixture deconvolution (CMD),is grounded on the assumption that the methylation sig-nature for a given target sample (methylation profiledacross a diverse population of underlying cell types) canbe viewed as a weighted mixture of the unique methyla-tion signature of each of its constituent cell types, withweights reflecting the proportion of each cell type withinthe target biospecimen. Under certain constraints, fairlyroutine statistical procedures can be employed to esti-mate such weights, thereby providing investigators witha “prediction” of the cellular distribution for each targetsample to which it is applied. Much in the same way onewould adjust for cell composition if cell fractions weremeasured directly, estimates of cell composition obtainedusing CMD can be added as additional covariate termsto control for the potential confounding effects associatedwith cell heterogeneity [16–21].The first and most critical step of CMD and theimpetus for this research, involves assembling a libraryof cell-specific methylation biomarkers that collectivelyreflect the unique methylomic fingerprint of each celltype. In the case of leukocyte subtypes, we refer tosuch cell-specific methylation biomarkers as leukocyte-differentially methylated regions (L-DMRs) to conveytheir differential methylation status across leukocyte sub-types. Motivated by the critical role played by L-DMRlibraries and their relationship to the accuracy of cell com-position estimates [8, 22], here we develop and evaluatea novel, iterative algorithm for Identifying Optimal L-DMR Libraries (IDOL) that improves the accuracy andefficiency of cell composition estimates obtained by CMD.In what follows, we aim to address three key ques-tions: (i) does the optimal library identified from IDOLresult in improved estimates of cellular composition com-pared to existing libraries, (ii) if so, are there discernibledifferences between libraries that might offer an explana-tion for their prediction performance, and lastly (iii), whatimpact does the difference in prediction performancebetween libraries have on EWAS when estimates of cellmixture are desired. To address these important questionswe begin by applying IDOL to a training set consist-ing of samples with both whole-blood DNA methylationdata (assayed using the Illumina HumanMethylation450BeadArray (HM450)) and flow cytometry measurementsof cell composition in order to calibrate the selection ofKoestler et al. BMC Bioinformatics (2016) 17:120 Page 3 of 21an optimal L-DMR library. To illustrate the utility of theidentified IDOL library as resource for future EWAS, webenchmark its performance against existing libraries intwo independent HM450 data sets and conduct a thor-ough comparison of libraries to gain insight into theirobserved prediction accuracy. Finally, the impact of differ-ent libraries on the false positive rate and statistical powerof EWAS is evaluated using both simulation studies anda data application involving two large publicly availableHM450 data sets.ResultsThe essential nature of library assembly and its impacton the accuracy of cell composition estimates is high-lighted in Fig. 1. Figure 1a,b depict heat maps generatedfrom hierarchical clustering the K = 6 major leukocytecomponents of WB (i.e., CD4T cells, CD8T cells, natu-ral killer (NK) cells, B cells, monocytes, and granulocytes)based on their methylation signature across two differ-ent L-DMR libraries [13, 23]. The first of these librariesto appear in the literature (TopANOVA [13]) was assem-bled using the 600 CpGs with the largest F-statisticscomputed from a series of ANOVA models comparingCpG-specific patterns of methylation across leukocytes(Fig. 1a). The second library is the default library usedby the EstimateCellCounts function in the minfi Biocon-dutor package [23]. While also comprised of 600 CpGs,the EstimateCellCounts library is instead assembled usingthe top 100 CpGs that uniquely distinguish each cell typefrom the remaining K − 1 cell types (100 × K = 600Total CpGs). While both libraries adequately discriminatelymphoid-derived cells (CD4T, CD8T, NK, and B cells)frommyeloid-derived cells (monocytes and granulocytes),the EstimateCellCounts library exhibits far better dis-crimination of lineage-specific cell types, particularly, NK,CD4T, and CD8T lymphocytes (Fig. 1c). The net resultof its improved discrimination of lineage-specific sub-types is uniformly better prediction performance acrossthe entire immune cell landscape, the largest of such gainsbeing associated with NK, CD4T, and CD8T lymphocytes(Fig. 1d,e, Additional file 2: Figure S1).The principle reason for the difference in discrimi-nation power and prediction accuracy between librariesis due entirely to the criteria used for their assembly.While assembling libraries using ANOVA F-statisticsmight seem reasonable, it is inherently susceptible to theover-selection of CpGs that are capable of discriminat-ing certain subsets of leukocytes (i.e., lymphocytes ver-sus myeloid cell types), but provide poor discriminationof other subsets (i.e., lineage-specific subtypes). On theother hand, the EstimateCellCounts library is constructedby imposing an equal representation of CpGs for eachcell type (top 100 cell-specific L-DMRs). This strategyleads to better discrimination of lineage-specific cell typesand, as a result, improved estimation accuracy of thosecell types (Fig. 1d,e). Despite representing an obviousimprovement over TopANOVA, the prediction accuracyassociated with EstimateCellCounts demonstrates ampleroom for improvement and suggests that further refine-ments in the assembly of L-DMR libraries may provide thesolution.Motivated by the critical role played by L-DMR librarieson the accuracy of cell composition estimates, we focushere on the development and evaluation of a novel iter-ative algorithm (IDOL) for identifying L-DMR librariesthat improve the performance of CMD. A schematic dia-gram illustrating the various steps of IDOL is given inFig. 2a. IDOL first involves the construction of a can-didate set of L-DMRs consisting of CpG sites exhibitingdifferential DNA methylation across leukocyte subtypes.From this candidate set, subsets of L-DMRs are randomlyselected at each iteration, with each randomly selectedL-DMR being evaluated for its contribution to cell com-position prediction accuracy. The contribution of each L-DMR is then used to modify its probability of selection insubsequent rounds of IDOL, where selection probabilitiesare updated in a manner proportional to its contributionto prediction accuracy. This is similar in principle to theweight updating rule in supervised competitive learningnetworks and the update rules employed in Learning Vec-tor Quantization [24]. Specifically, L-DMRs found to con-tribute favorably to prediction performance are updatedto have greater chance of being selected in subsequentiterations, whereas L-DMRs that hinder or have no effecton prediction performance are updated to have a reducedchance of being selected (Fig. 2b,c). This dynamic pro-cess is repeated thousands of times, with L-DMR selectionprobabilities evolving at each iteration depending on howthey impact the accuracy of CMD estimates (Fig. 2d).By updating selection probabilities in this way, randomlyselected L-DMR subsets at each sequential IDOL itera-tion become enriched with L-DMRs that were previouslymarked as beneficial to prediction accuracy. As a result,the temporal evolution of IDOL witnesses the preferentialselection of L-DMR subsets that, as a whole, contributeto improved accuracy of CMD estimates (Fig. 2e,f). Upontermination, one is left with the subset consisting of the L-DMRs with the largest selection probabilities, henceforthreferred to as the optimal IDOL library.Training the selection of L-DMR libraries for cell mixturedeconvolutionTo calibrate the selection of optimal L-DMR libraries,we first applied IDOL to a training data set consistingn = 6 non-diseased adults with WB DNA methyla-tion and immune profiling data (Section ‘Adult wholeblood (WB) samples’). Flow cytometry estimated cellfractions of CD4T, CD8T, natural killer (NK), B cells,Koestler et al. BMC Bioinformatics (2016) 17:120 Page 4 of 21GranMonoBcellCD8TCD4T NKMonoGranBcellCD4TCD8T NK10 15 20 25 30 351015202530350 5 10 150510150 2 4 6 8 1002468100 5 10 15 20 25 3005101520253040 45 50 55 60 65 70404550556065702 4 6 8 10 1224681012CD4T CD8TBcell NK Mono Gran FACS mixture Fractions (%) Predicted Mixture Fractions (%)Gran Mono NK BcellCD8TCD4TMethylation -value a b c d e ANOVA-ranked L-DMRsEstimateCellCountsL- DMRs D4TCD8TBcellNKMonoGranCD4TCD8TBcellNK Mono CD8TBcellNK Mono Gran -12.0-9.6-7.2-4.8-2.40.02.44.87.29.612.0Difference in DSC (r,s) s r R2 RMSE Cell Type TopANOVA EstimateCellCounts TopANOVA EstimateCellCounts CD4T 0.99 0.89 3.63 2.65 CD8T 0.16 0.89 10.55 2.96 Bcell 0.73 0.80 1.87 0.98 NK 0.50 0.71 14.11 5.53 Mono 0.00 0.29 2.32 1.76 Gran 0.99 0.99 4.53 3.87 Average 0.56 0.76 6.17 2.96 Fig. 1 Impact of L-DMR library on the accuracy of cell composition estimation. a, b Hierarchical clustering heat maps of the mean methylationsignatures of isolated leukocyte subtypes [3] using (a) the top 600 ANOVA-ranked L-DMRs (TopANOVA library) and (b) the 600 L-DMRs that uniquelydistinguish each cell type from all other cell types (EstimateCellCounts default library). Column dendrograms are colored to reflect the cell-lineage ofleukocyte subtypes: lymphocytes (pink) and myeloid-derived cells (blue). c Image plot showing the difference in the dispersion separability criterion(DSC) between the EstimateCellCounts and TopANOVA libraries. For a given pair of leukocyte subtypes, larger values of DSC difference (shades ofblue) indicate better discrimination associated with the EstimateCellCounts library, whereas smaller values of DSC difference (shades of red) indicatebetter discrimination associated with the TopANOVA library. d Scatterplots of the CMD predicted and FACS cell fractions for the n = 6 AdultMixedsamples. Dashed lines indicate the line of unity, dotted lines represent the fitted regression lines based on cell predictions obtained using theTopANOVA library, and solid lines represent the fitted regression lines based on cell predictions obtained using the EstimateCellCounts library.e Cell-specific prediction performance for the AdultMixed samples based on the TopANOVA and EstimateCellCounts librariesmonocytes, and granulocytes across the training sam-ples are depicted in Fig. 3a. On average, granulocytesrepresented the most abundant cell type across thetraining samples (mean = 57.4%, sd = 11.5%), fol-lowed by CD4T (mean = 17.9%, sd = 5.7%), CD8T(mean = 9.7%, sd = 4.5%), monocytes (mean = 6.7%,sd = 1.2%), B cells (mean = 4.9%, sd = 1.9%), and NKcells (mean = 3.5%, sd = 1.4%), in descending order ofabundance.Since the objective of IDOL is to identify the best sub-set (or subsets) of L-DMRs from a larger candidate set ofputative L-DMRs, we first focused on constructing thiscandidate set by identifying CpG sites with differentialmethylation across leukocytes. Using the DNA methyla-tion profiles for isolated leukocyte subtypes reported in[3], we first fit a series of two-sample t-tests to compareCpG-specific DNA methylation patterns across the K =6 immune cell subtypes. Specifically, the CpG-specificmethylation signature of each of cell type was comparedto the K − 1 remaining cell types and the top 150 CpGswith largest and smallest t-statistics were combined into asingle candidate list consisting of 1,800 putative L-DMRs(Additional file 3: Table S2). Following construction of thecandidate set, we next applied IDOL to identify optimallibraries across a range of possible library sizes, from 100to 800 CpG loci in increments of one-hundred. AcrossKoestler et al. BMC Bioinformatics (2016) 17:120 Page 5 of 21Identify the U CpGs that uniquely characterize each of the P cell types Among the U candidate CpGs, J* < U are randomly selected for mixture deconvolution with probability , j = 1, ,UUsing the J* randomly selected CpGs, cell mixture distribution is predicted using mixture deconvolution(Houseman et al., 2012) Remove CpG q from the J*CpGs and apply mixture deconvolution using the remaining (J* - 1) CpGsCompute absolute/relative prediction performance when CpG q is withheld from the set of J* CpGsAssess contribution of CpG qto prediction performance by comparing prediction performance when CpG q is withheld, to the prediction performance when all J* CpGsare used for mixture prediction Update probability of selection for each of the J* CpGs based on how beneficial it was for prediction performance 0 1 2 4 5 6 Repeat steps 1-6 until convergence j( )Predicted cell distribution is compared to the observed cell distribution (FACS or CBC) and the absolute/relative prediction performance is computed 3 7 Modifying CpG selection probabilities for subsequent iterations CpG5 5.00 0.25 0.37 3 1.50 0.25 0.33 3 -2.60 0.25 0.14 1 -0.50 0.25 0.22 j r j j( )p j j( +1)/ 413 /127 /1219 /12Increasing Decreasing Unchanged Probability of selection a b c Iteration: 0 Iteration: 100 Iteration: 500 Beneficial for prediction performance probability of selection Not Beneficial for prediction performance probability of selection Probability of selection is modified based on how beneficial a given CpG was in improving prediction performance All CpGs have equal chance of being selected d Output 0 50 100 150 2000501001500 50 100 150 2000.50.60.70.80.9Iteration IterationDecreasing prediction error across iterations Increasingvariation in observed cell composition being explained f e Fig. 2 Conceptual illustration of the IDOL algorithm. a Schematic diagram showing each step of IDOL. b, c Illustration of the scheme for updatingthe selection probabilities of L-DMRs. d Conceptual depiction of the L-DMR selection probabilities as a function of the sequential progression ofIDOL. At iteration 0, L-DMRs have an equal probability of being selected for inclusion in the randomly assembled L-DMR subset. At each sequentialiteration of IDOL (i.e., moving from left to right), the selection probabilities for L-DMRs are updated in a manner proportion to their contribution toprediction performance; selection probabilities for L-DMRs that contribute favorably to prediction performance are increased (increasing shades ofgreen), whereas the selection probabilities for those that hinder prediction performance are decreased (increasing shades of red). Upon algorithmtermination, the J L-DMRs with the largest selection probabilities are taken to represent the optimal L-DMR library. e, f Plots showing mean RMSE(M¯) and coefficient of determination (R¯2) respectively, as a function of sequential progression of the the IDOL algorithmthe spectrum of library sizes considered, the average R2and root mean square error (RMSE) between flow cytom-etry measurements and predicted cell type proportionsobtained from the identified optimal libraries was verystable, ranging from 0.98 to 1.00 for R2 and from 2.41% to3.30% for RMSE (Additional file 4: Figure S2). As notedin Additional file 4: Figure S2, a subtle drop-off in pre-diction performance was observed libraries whose sizeexceeded 500 CpGs. Given the general preference for pre-diction models that use fewer features and because thelibrary consisting of 300 CpGs (Additional file 5: Table S3)performed favorably both with respect to its average R2and RMSE, this library was selected as the representativeIDOL library for all subsequent comparisons and analyses.Hierarchical clustering of leukocytes based on theirmean methylation signature across the 300 CpGs in theoptimal IDOL library is given in Fig. 3b and clearly showsbetter discrimination of lymphocyte subtypes comparedto the TopANOVA library (Fig. 1a). Using the IDOLlibrary for deconvoluting the cellular mixture of the train-ing set samples revealed a high degree of concordancebetween flow cytometry and predicted cell type propor-tions, with nearly perfect R2 values across all cell types andRMSEs ranging from as low as 0.97% for monocytes to1.37% for CD4T cells (Fig. 3c). Across the six leukocytes,the average R2 and RMSE between the predicted and flowcytometry cell type proportions were estimated at 0.99and 1.15%, respectively. When compared to the resultsKoestler et al. BMC Bioinformatics (2016) 17:120 Page 6 of 21020406080100WB3021WB3045WB3082WB3093WB3159WB3177Gran Mono NK BcellCD8TCD4TWhole Blood (WB) Samples Fraction of Leukocyte DNA (%)BcellMonoGranCD4T NKCD8T128 172 472 -0.60-0.48-0.36-0.24-0.120.000.120.240.360.480.60Difference in DSC (r,s) Optimized L-DMRs EstimateCellCountL-DMRs adbec!4TCD8TBcellNKMonoGranCD4TCD8TBcellNKMono**p < 0.10 p < 0.05 Permutation p-value -0.5 0.0 0.5 1.0 1.5050100150200Difference in DSC Frequency Permutation p = 0.038 f Optimized L-DMRs FACS Cell Mixture Fractions (%) Predicted Mixture Fractions (%) 0 2 4 6 8 1002468104 6 8 10 14 18468101214161812 1610 15 20 25 30101520253045 50 55 60 65 704550556065705 6 7 8 9 1056789101 2 3 4 5 6 7 812345678CD4T CD8T BcellNK Mono Gran R2 = 1.00 RMSE = 1.37R2 = 0.99 RMSE = 1.19R2 = 1.00 RMSE = 1.00 R2 = 0.99 RMSE = 0.97R2 = 0.99 RMSE = 1.04R2 = 0.99 RMSE = 1.33Methylation -value s r Fig. 3 Results obtained from applying IDOL to the training set. a Stacked bar plots showing the FACS measured fractions of granulocytes (Gran),monocytes (Mono), natural-killer cell (NK), B cells (Bcell), CD8T lymphocytes (CD8T), and CD4T lymphocytes (CD4T) across the 6 training samples.b Hierarchical clustering heat map of the mean methylation signature of leukocyte cell-types (columns) based on the 300 optimized L-DMRs (rows)identified by IDOL. The column dendrogram is colored to reflect the cell lineage of the leukocyte subtypes, where lymphocyte-derived subtypes arecolored pink and myeloid-derived cell types are colored blue. c Scatterplots of FACS measured cell fractions (x-axes) and predicted cell proportionsobtained using the optimized IDOL library (y-axes). Dotted lines indicate the line of unity and colored lines represent the regression line fit to theFACS measured cell fractions and predicted cell fractions. d Overlap between IDOL and EstimateCellCounts libraries. e Image plot showing thedifference in the dispersion separability criterion (DSC) between the IDOL and EstimateCellCounts libraries for discriminating specific pairs ofleukocyte subtypes. For a given pair of leukocytes, larger values of DSC difference (shades of blue) indicate better discrimination associated with theIDOL library, whereas smaller values of DSC difference (shades of red) indicate better discrimination associated with the EstimateCellCounts library.f Histogram showing the results of a permutation-based testing procedure for examining the difference in the overall DSC between the IDOL andEstimateCellCounts librariesobtained from the application of both the EstimateCell-Counts and TopANOVA libraries to training set (Fig. 1d,e,Additional file 2: Figure S1), the IDOL library resulted bet-ter prediction performance for all cell types except B cells,whose predictions from EstimateCellCounts exhibitedslightly lower RMSE (0.98% versus 1.04%). Upon fur-ther comparison, the greatest improvements in predictionperformance associated with the IDOL library occurredfor monocytes and among lymphocyte subtypes. Specifi-cally, the IDOL library resulted in monocyte predictionsthat explained approximately 70% more variation in theflow cytometry measurements of monocytes comparedto EstimateCellCounts (Figs. 1e and 3c). Similarly, pre-dictions of CD4T, CD8T, and NK cell type fractionsobtained from the IDOL library explained an average of17% more variation in the flow cytometry derived frac-tions of these cell types compared to EstimateCellCounts,and were associated with RMSEs that were on average3.3-fold lower.Of the 300 CpGs encompassing the IDOL library, 128(43%) were shared with 600 L-DMRs used by Estimate-CellCounts (Fig. 3d, Additional file 5: Table S3). Tounderstand how differences between these two librariesmight explain their observed prediction performance,we next compared libraries with respect to their abil-ity to discriminate specific pairs of leukocytes. For eachlibrary we computed the dispersion separability crite-rion (DSC), defined here as the ratio of the averageKoestler et al. BMC Bioinformatics (2016) 17:120 Page 7 of 21distance between cell-specific centroids and the overallmean to the average distance between samples of thesame cell type. As such, increasing DSC values indi-cate greater between-cell-type dispersion/discrimination.Using the leukocyte-specific methylation data reportedin [3] as the basis for estimation, we found that theIDOL library resulted in a significantly larger DSCcompared to the EstimateCellCounts library (permuta-tion p = 0.038) (Fig. 3f). Furthermore, a compari-son of the DSC values computed between each pairof leukocytes showed that the IDOL library resulted inlarger DSC values in 14 out of the 15 comparisons, ofwhich 4 were associated with p-values that borderedon statistical significance (p < 0.10) (Fig. 3e). Amongthe 4 comparisons with marginally statistically signifi-cant p-values, 3 involved specific pairs of lymphocytesubtypes.Independent validation of the optimal L-DMR setTo validate the IDOL library identified in the train-ing set, we next examined its performance for accu-rately deconvoluting the cellular composition of 12additional samples spread across two independent testsets: MethodA and MethodB sets. As described inSection ‘Cell mixture reconstruction experiment’, theMethodA and MethodB data sets were created by mix-ing purified leukocyte subtype DNA from CD4T, CD8T,NK, B cells, monocytes, and granulocytes in predeter-mined proportions (Fig. 4a). As such, the true cellu-lar mixture of the MethodA and MethodB samples areknown with a high degree of confidence, representingideal candidates in which to validate the optimal libraryidentified by IDOL in its application to the trainingset.As noted in Fig. 4a, whereas the MethodA samples arecharacterized by a roughly equivalent fraction of eachcell type (mean CD4T = 11.8%; CD8T = 20.8%; NK =15.0%; Bcell = 16.0%, monocyte = 19.2%, and granulo-cyte = 17.2%), the cellular composition of the MethodBsamples were reconstructed to resemble the immune celllandscape observed in healthy human adults [25] (meanCD4T = 13.2%; CD8T = 6.0%; NK = 3.0%; Bcell =2.7%, monocyte = 6.2%, and granulocyte = 69.0%). Sim-ilar to the results obtained in the training set, cell typepredictions in the testing sets using the IDOL library werehighly correlated with true mixture fractions (Fig. 4b).Specifically, the cell-specific R2 values computed acrossboth testing sets ranged from 0.94 (CD4T cells) to 1.00(both, granulocytes and B cells), with an R2 of 0.97 aver-aged across the six cell types. In addition, the cell-specificRMSEs computed across the testing sets showed that in 4out of the 6 cell types, predictions were, on average, within2.0% of their true reconstructedmixture proportions. Thetwo exceptions being NK cells (RMSE = 2.5%) and CD8Tcells (RMSE = 3.4%). A comparison of the cell-specific R2and RMSEs computed within the MethodA and MethodBdata sets separately revealed relatively minor differencesin prediction accuracy (Additional file 6: Table S4 andAdditional file 7: Figure S3 and Additional file 8: FigureS4). For the MethodA set, cell-specific R2 and RMSEranged between [0.86, 1.00] and [1.09%, 4.11%] withmean values of 0.96 and 2.14%, respectively. Similarly, inthe MethodB data set, cell-specific R2 and RMSE rangedbetween [0.82, 0.98] and [1.44%, 2.52%] with mean valuesof 0.91 and 1.68%. Furthermore, there appeared to be noassociation between the prediction performance of a givencell type and its true underlying fraction in the MethodAand MethodB reconstructed mixture samples (Additionalfile 7: Figure S3, Additional file 8: Figure S4 and Additionalfile 9: Figure S5).The prediction performance obtained using the IDOLlibrary compared favorably to the performance associatedwith EstimateCellCounts, the predictions of whichexplained, on average, 2% less variation in the under-lying reconstructed mixture fractions compared to theIDOL library (Additional file 6: Table S4 and Fig. 4c).The largest difference in performance was observedfor CD4T cells, whose IDOL associated predictionsexplained an estimated 12% more variation in thereconstructed mixture proportions of CD4T cells andwere associated with a 2-fold lower RMSE comparedto EstimateCellCounts (Additional file 6: Table S4and Fig. 4c).Implications of cell composition adjustment methodologyfor EWASIn the overwhelming majority of the studies using CMD,estimates of immune cell fractions are first obtainedfor each study sample, followed by their inclusion asadditional covariate terms in statistical models to con-trol for the potential confounding effects of cellularheterogeneity [26–28]. For this reason, metrics such asR2 and RMSE, while providing a useful starting pointfor comparing different L-DMR libraries, say little abouthow the prediction error associated with a given libraryrelates to its impact on the power and false discoveryrate (FDR) of EWAS. With this in mind, we conducteda series of analyses aimed at examining how adjust-ments for cellular mixture in the statistical modelingof DNA methylation data impact the ability to cor-rectly identify true negatives (FDR) and true positives(power).To understand the consequences of prediction errorin cell fraction estimates for EWAS, we first conducteda simulation study comparing the FDR when differentstrategies for cell composition adjustment were employed,namely, when cell fraction estimates were obtained usingthe IDOL and EstimateCellCounts libraries. For ourKoestler et al. BMC Bioinformatics (2016) 17:120 Page 8 of 21Fraction of Leukocyte DNA (%)0204060801001 2 3 4 5 6Method A1 2 3 4 5 6Method BReconstructed Mixtures (RM) CD4T CD8T BcellNK Mono Gran -5 0 5 Gran Mono NK BcellCD8T CD4T 0 20 40 60 80 1000.00.20.40.60.81.00 20 40 60 80 1000.00.20.40.60.81.00 20 40 60 80 1000.00.20.40.60.81.00 20 40 60 80 1000.20.10.00.10.20 20 40 60 80 1000.20.10.00.10.20 20 40 60 80 1000.20.10.00.10.2Dissimilarity in mixture composition between groups FDR Difference in FDR n1 = n2 = 50 n1 = n2 = 100 n1 = n2 = 250 n1 = n2 = 500 Sample size 50 5050 10050 25050 500100 100100 250100 500250 250250 500500 500Difference in FDR -.02 0 .02 .06 n1 n2No Adjustment EstimateCellCountsOptimized L-DMRs True Cell Composition Difference in FDR between EstimateCellCount and Optimized L-DMRs 0 20 40 60 80 1000.00.20.40.60.81.00 20 40 60 80 1000.20.10.00.10.2Reconstructed Mixture Fractions (%) Predicted Mixture Fractions (%) 6 8 10 12 14 16 18 20681012141618200 5 10 15 20 25 30 35051015202530350 5 10 15 20 2505101520250 5 10 15 20 2505101520255 10 15 20 2551015202510 20 30 40 50 60 7010203040506070R2 = 0.94 RMSE = 1.28R2 = 0.96 RMSE = 3.41R2 = 0.92 RMSE = 2.45 R2 = 0.98 RMSE = 1.60R2 = 1.00 RMSE = 1.52R2 = 1.00 RMSE = 1.44CD4T CD8T BcellNK Mono Gran Fig. 4 Results obtained from applying the optimal IDOL library to the testing sets. a Stacked bar plots showing the cell type fractions for eachtesting set sample. b Scatter plots of the true reconstructed mixture fractions (x-axes) and the predicted cell fractions obtained using the optimizedIDOL library (y-axes). Circles indicate Method A samples and squares indicate Method B samples. Dotted lines indicate the line of unity and coloredlines represent the regression line fit to the true reconstructed mixture fractions and predicted cell fractions. c Box plots showing the predicted cell(%) − observed cell (%) across leukocyte cell types, where blue boxes represent estimates obtained from the optimal IDOL library and red boxesrepresent estimates obtained from the EstimateCellCounts library. (d, top panel) Estimated false discovery rate (FDR) for a two-group comparison ofDNA methylation as a function of the dissimilarity in the cellular distribution between groups (x-axes). Colored lines represent different approachesfor cell composition adjustment. (d, bottom panel) Difference in the FDR between the EstimateCellCounts and IDOL libraries where points abovethe dotted line indicate that the EstimateCellCounts library resulted in more false positive results compared to the IDOL library. eMean difference inthe FDR for varying sample sizes when cell mixture was adjusted using cell fractions estimates from the EstimateCellCounts and IDOL libraries. Barsrepresent the 95% bootstrap confidence intervals for each point estimate. Points to the right of the dotted line indicate that the EstimateCellCountslibrary resulted in more false positive results compared to the IDOL libraryssimulations, we assumed simplistic study design that, typ-ical of many EWAS, focused on the identification ofdifferentially methylated CpG sites between two groups,i.e., case/control comparison. As described in Section‘Simulation study comparing false discovery rates (FDR)across different cell composition adjustment techniques’,for each sample, methylation beta-values were simulatedfor a total of 10,000 CpGs, assuming within-group samplesizes that ranged from small/moderate (i.e., n = {50, 100})to moderate/large (i.e., n = {250, 500}). Most importantly,while the underlying cellular composition was permittedto vary across groups, each cell type was assumed tohave an identical methylation signature between groups:no group effect. As such, tests of CpG-specific differ-ential methylation between groups with adjustments forcellular composition should not be rejected and thereforerepresent the basis for our estimates of FDR.As expected, the FDR was appropriately controlled at5% when adjustments for cell composition were carriedout using the “true simulated” cell distribution (Fig. 4d,black lines). On the other hand, a clear inflation in theFDR was observed when tests for differential methylationKoestler et al. BMC Bioinformatics (2016) 17:120 Page 9 of 21were unadjusted for cellular composition, the degree ofinflation depending heavily on the between-group dissim-ilarity in cellular distribution (Fig. 4d, green lines). Whilea subtle inflation in FDR was observed when cell typeadjustments were carried out using cell fraction estimatesobtained from the IDOL (blue lines) and EstimateCell-Counts (red lines) libraries, the IDOL library tended toresult in a reduced number of false positive results acrossthe spectrum of simulation conditions (Fig. 4d). Thisobservation is more clearly illustrated in Fig. 4e whichdepicts the average difference in FDR computed betweenEstimateCellCounts and the IDOL libraries across therange of assumed within group sample sizes. Comparedto EstimateCellCounts, the IDOL library resulted in, onaverage, 2%–5% fewer false discoveries when within-group sample sizes ranged from 50 to 500.To further understand the implications of cell type pre-diction methodology for EWAS, we made use of two ofthe largest publicly-available WB DNA methylation datasets [16, 29]. Our analysis of the Liu [16] and Hannum[29] data sets was aimed at addressing two different butrelated questions: (i) which cell prediction methodologyperformed better at explaining variation in DNAmethyla-tion within each data set and (ii) how does the additionalvariation being explained relate to the statistical power ofeach study. To address these questions we began by esti-mating the cellular distribution of the samples in each dataset using both the IDOL and EstimateCellCounts libraries(Fig. 5a,b).As noted in Fig. 5a,b, a high degree of correlation wasobserved in the cell fraction estimates obtained usingthe IDOL and EstimateCellCounts libraries, with cell-specific R2 ranging from [0.80, 0.99] and [0.84, 0.99] forthe Liu and Hannum data sets, respectively. In both datasets, the predicted fraction of monocytes exhibited thegreatest variation between the the IDOL and Estimate-CellCounts libraries, with the IDOL library resulting inslightly smaller estimates of monocyte fractions comparedto EstimateCellCounts, on average: (5.4% versus 7.8%)and (6.8% versus 8.7%) in the Liu and Hannum data sets,respectively. Conversely, estimates of CD4T cells obtainedfrom the IDOL library were, on average, slightly largercompared to those obtained from EstimateCellCounts;(12.9% versus 8.3%) and (13.8% versus 9.1%) in the Liuand Hannum data sets.Array-wide comparisons of the proportion of CpG-specific variation in DNA methylation explained by cellcomposition estimates revealed that the IDOL librarytended to explain more variation compared to Estimate-CellCounts (Fig. 5c,d). Specifically, cell estimates obtainedfrom the IDOL library explained more variation in DNAmethylation for 83.3% of the CpGs in the Liu data set and74.8% of the CpGs in the Hannum data set, both of whichrepresent significantly larger proportions than would beexpected by random chance (permutation p < 0.001 forboth). To understand how these findings relate to sta-tistical power of EWAS, we used the residual varianceestimates obtained from each methodology as the basisfor estimating the sample size required for detecting astatistically significant difference in DNA methylation at80% power (Section ‘Data application for exploring theimplications of cell composition adjustment in EWAS’).Figure 5e,f show the number of additional samples neededwhen cell type correction was carried out using esti-mates from EstimateCellCounts (purple) or no cell typecorrection (green), as a function of the desired differ-ence to be detected in the mean methylation beta-valuesbetween two groups. Using the residual variance esti-mates computed in the Hannum data set, there wereonly modest differences in the number of additionalsamples needed when cell type correction was basedon estimates from EstimateCellCounts, with virtually nodifference between the IDOL library and EstimateCell-Counts beyond effect sizes of 0.03 (on the beta-valuescale). However, using the residual variance estimatesobtained in the Liu data set showed that, for effectsizes ranging from 0.03–0.05 (on the beta-value scale),approximately 15 and 5 additional samples respec-tively would be needed if the analysis was adjustedfor cell composition using estimates obtained fromEstimateCellCounts.DiscussionIn this manuscript, we have described and extensivelyevaluated a novel, iterative algorithm for assembling L-DMR libraries. Our objective was to present a methodol-ogy that can identify libraries that improve the predictionperformance of CMD. Building off existing approaches[8, 13], IDOL involves the targeted curation of librarieswhose constituent L-DMRs are selected on the basis oftheir collective ability to optimize the accuracy and mini-mize the prediction error associated with cell compositionestimates obtained through CMD. The principal differ-ence between IDOL and the assembly of existing L-DMRlibraries is that IDOL makes use of a training data setconsisting of samples with both WB DNA methylationsignatures and immune profiling data as a means of cal-ibrating the selection of L-DMRs. This in turn results inlibraries that enhance the accuracy of CMD estimates and,as a consequence, improve the operating characteristicsof EWAS, i.e., decreased false positive rate and increasedstatistical power.In our application of IDOL to a training set, we assem-bled optimal L-DMR libraries across a range of possiblelibrary sizes (i.e., 100, 200, . . . , 800) in order to examinethe relationship between library size and the accuracy ofcell composition estimates. Although only modest differ-ences in prediction performance were observed betweenKoestler et al. BMC Bioinformatics (2016) 17:120 Page 10 of 21Fig. 5 Cell mixture deconvolution of the Liu and Hannum blood data sets using the IDOL and EstimateCellCounts libraries. a, b Scatter plots of thepredicted cell type fractions obtained using EstimateCellCounts library (x-axes) and the IDOL library (y-axes) for the Liu and the Hannum data sets,respectively. c, d Distribution of the difference in the R2 computed from the IDOL and EstimateCellCounts libraries for the (c) Liu and (d) Hannumdata sets. e, f Estimated number of additional samples needed (y-axis, left) and approximate additional cost (y-axis, right) as a function of the desireddifference in DNA methylation to be detected (x-axis) when correction for cell mixture was carried out using the EstimateCellCounts library. Varianceestimates were obtained from the (e) Liu and (d) Hannum data setsthe optimal libraries identified at each size considered,our results showed a trend toward diminishing predictionperformance for sizes exceeding 500 L-DMRs. Thoughcaution should be exercised when drawing conclusionson the basis of a single analysis, these results seem sug-gest that when it comes to assembling libraries for CMD,the quality of selected L-DMRs takes precedence overtheir quantity (i.e., library size). Despite being half thesize, the prediction performance observed for the finalIDOL library was on par with, and in many cases bet-ter than, EstimateCellCounts across both the training andindependent testing sets. We hypothesized that the bet-ter performance associated with the IDOL library was aresult of its ability to find libraries that better characterizethe unique methylomic fingerprint of leukocyte subtypes.To examine this hypothesis, we compared each library interms of how well it performed in discriminating eachpair of cell types by computing the DSC. The results ofthis analysis showed that the IDOL library better discrim-inated 14 out of the 15 pairs of leukocyte subtypes, withsignificantly improved discrimination strength across theentire immune cell landscape. This observation is note-worthy in that it may suggest a framework for gauging theprognostic potential of DMR libraries in the absence ofDNA methylation data sets with available immune profil-ing information, the “gold-standard” for assessing the pre-diction performance associated with different libraries.More importantly, our results serve to illustrate a key fac-tor underlying the accuracy of cell composition estimatesobtained via CMD, namely, that prediction accuracy isstrongly related to a library’s ability to provide a powerfuldiscrimination of the entire cellular landscape.Koestler et al. BMC Bioinformatics (2016) 17:120 Page 11 of 21While the library used by EstimateCellCounts is asignificant improvement over the TopANOVA approachfor library assembly, it imposes an equal number of cell-specific L-DMRs for all cell types. In principle, this wouldbe reasonable if cell types were mutually distinct fromone another, however this is not the case for white bloodcell types whose DNA methylation signatures are lineage-specific [3, 4, 13]. Because of the shared lineages of leuko-cyte subtypes, more or fewer L-DMRs might be neededfor certain cell types depending on strength of their sig-nal, within cell-type variability of those markers, and thelineage relationships between cell types. The data-drivenapproach for assembling libraries characteristic of IDOLindirectly addresses this issue by iteratively searching forthe subset of L-DMRs that optimize the accuracy of CMD,with no a priori constraints on the number of cell-specificL-DMRs used in assembly of libraries. As demonstratedhere, this approach resulted in a library that demon-strated highly accurate cell composition estimates in alldata sets considered in our examination. Although theEstimateCellCounts library showed similar performanceacross the testing sets, the results of our simulation studyand data applications showed that even modest improve-ments to the overall accuracy of cell fraction estimatesresults in non-negligible differences in the false posi-tive rate and statistical power for EWAS. In particular,our simulation studies showed that when differences inthe underlying cellular distribution between groups arelarge, tests of differential methylation adjusted for cellcomposition estimates obtained using the default Esti-mateCellCounts library can lead to an estimated 5% infla-tion in the false positive rate compared to adjustmentsmade using the IDOL library. On the scale of EWAS,which typically involve testing hundreds of thousands tomillions of CpGs, this amounts to thousands to tens-of-thousands of CpGs being incorrectly classified as differ-entially methylated. Moreover, in both data applicationscell fraction estimates obtained using the identified IDOLlibrary demonstrated an improved ability to explain varia-tion in whole-blood-derived DNAmethylation signatures.This lead to increased statistical power, and as a result,fewer samples needed when cell composition correctionwas carried out using the IDOL library. Although the Liuand Hannum data applications revealed relatively minordifferences in the number of samples needed betweenlibraries, the corresponding cost-differential for a singlestudy can be on the order of several thousand dollars con-sidering the current cost of the Illumina HumanMethyla-tion450 array (http://www.illumina.com/), a figure whosemagnitude becomes substantial when taken across theentire spectrum of past, present, and future EWAS involv-ing adjustments for cell composition via CMD.Notwithstanding the potential of IDOL to identify L-DMR libraries that enhance the accuracy of cell typepredictions obtained through CMD, this method is notwithout certain limitations. As IDOL does not includean evaluation of the prediction performance of all possi-ble combinations of L-DMR libraries (i.e.,(LJ)), there isno guarantee that IDOL will arrive at globally optimalsolutions. Because of the inherent computational bur-den that would be required to ensure global optimalityin this case, we opted for a more computationally par-simonious approach wherein libraries are identified bysequentially selecting L-DMR subsets preferentially com-prised of L-DMRs that were previously marked as bene-ficial to prediction accuracy in previous IDOL iterations.Our procedure resulted in a optimized library consist-ing of 300 L-DMRs, which compared favorably to existingL-DMR libraries and demonstrated excellent predictionperformance in two independent testing data sets. Thus,while global optimality cannot be guaranteed our resultsare encouraging and provide assurance of the capacity ofIDOL to identifying libraries that result in highly accurateestimates of cell composition.It also deserves mentioning the the ability of IDOL tofind libraries that better characterize the unique methy-lomic fingerprint of leukocyte subtypes comes at theexpense of moderate increases in computational timecompared to existing techniques for library assembly.Along these lines, the leave-one-out procedure employedin Step 4 of IDOL may unnecessarily contribute to slowerconvergence and thus increased computational demands.To this end, bootstrap resampling [30] as a substitute forour leave-one-out procedure may lead to faster conver-gence of IDOL and represents a potential opportunityfor future enhancements to this methodology. Finally,while the applications presented herein targeted theHM450 BeadArray, we note that IDOL is generalizable toother platforms (i.e., whole-genome bisulfite sequencing,Illumina HumanMethylationEPIC BeadArray, ect.) pro-vided that the reference methylomes for isolated leuko-cyte subtypes are available on those platforms. As interestin this area continues to grow, future studies should aim tocompare the L-DMR library identified here to those iden-tified from technologies with expanded coverage of themethylome.ConclusionsMotivated by the critical importance of accounting forcellular distribution when DNA methylation is assessedin heterogeneous tissue types [6–9, 20], along with thelogistical and economic considerations that often ren-der direct measurements of cell composition infeasible,our work fills a critical gap in the EWAS literature byreinforcing the importance of library assembly and its crit-ical role in CMD. Further, we provide the epigenomicsresearch community with a L-DMR library, optimizedto improve the accuracy of cell distribution estimates inKoestler et al. BMC Bioinformatics (2016) 17:120 Page 12 of 21blood-derived biospecimens from human adults. Impor-tantly, while motivated by the problem of deconvolutingthe cellular mixture of whole blood, this research pro-vides a framework for the systematic construction of DMRlibraries in general, and represents a viable approach forlibrary assembly for EWAS moving forward.MethodsIn what follows, we begin by describing the DNA methy-lation array data sets used in this research as well as thesteps implemented in their preprocessing and quality con-trol. We next provide an overview of cell mixture decon-volution and the IDOL algorithm. Finally, we describeour application of IDOL, metrics employed for assess-ing and comparing cell type prediction performance, andfinish by describing a data application for exploring theimplications of cell composition adjustment in EWAS.Cell mixture reconstruction experimentPurified granulocytes, monocytes, CD4T, CD8T, naturalkiller cells, and B cells from normal human subjects werepurchased from AllCells LLC (Emeryville, CA). As descri-bed (http://www.allcells.com/normal-peripheral-blood/)ethical approval, including all consents and protocols,have been approved by an independent review board.Both positive and negative selection for relevant cell sur-face proteins was conducted by AllCells using antibodiesconjugated to magnetic beads and protocols from Mil-tenyi Biotec, Inc. (Auburn, CA). DNA was extracted frompurified blood leukocyte subtypes using the DNeasy bloodand tissue kit (QIAGEN, Valencia CA) or the AllPrepDNA/RNA/Protein Mini Kit (QIAGEN) using previouslydescribed protocols [31]. DsDNA was quantified usinga Qubit 3.0 fluorometer (Life Technologies). Followingquantification, DNA extracted from purified leukocytesubtypes were mixed in predetermined proportions toreconstruct two distinct sets, consisting of n = 6 sampleseach. The first set of reconstructed samples used mixturesof purified leukocyte subtype DNA in relatively equiva-lent proportions across the leukocyte subtypes, hereafterreferred to as the MethodA samples. For the second setof six samples, the proportion of DNA for each leukocytesubtype were selected to resemble their relative fractionsin the peripheral blood of normal human adult subjects(MethodB samples). All DNA samples were bisulfite mod-ified using the Zymo EZ DNA Methylation kit (Irvine,CA) and epigenome-wide DNA methylation assessmentwas performed using the Illumina HumanMethylation450array platform.Adult whole blood (WB) samplesAn additional n = 6 whole blood (WB) samples fromdisease-free adult donors with available immune cellprofiling data from flow cytometry were purchased fromAllCells LLC. Inclusion and exclusion criteria for donorsas well as a statement describing the ethical approval ofsuch samples can be found on the AllCells LLC webpage(http://www.allcells.com/normal-peripheral-blood/). Wehereafter refer to this data set as the AdultMixed set. DNAextraction and bisulfite modification of the AdultMixedsamples followed an identical protocol to that describedabove, with epigenome-wide DNA methylation profil-ing performed using the Illumina HumanMethylation450array platform.Reference DNAmethylomes for isolated leukocytesubtypesTo identify L-DMRs and as the basis of all applications ofCMD, we used a publicly available data set (GEO Acces-sion ID: GSE35069) consisting of epigenome-wide DNAmethylation profiles for the same six leukocyte subtypesused in our reconstruction experiments. Further detailsconcerning the study participants, purification of bloodcell populations, and DNA extraction have been previ-ously described [3].Additional DNAmethylation data setsIn addition to the aforementioned data sets, we also madeuse of two of the largest publicly available blood-derivedDNA methylation data sets currently available on GeneExpression Omnibus (Accession numbers: GSE42861 andGSE40279). Collectively, these two data sets consist ofWB DNA methylation data on >1200 adult patientsand were used here for the purpose of understandingthe implications of cell mixture adjustment when mix-ture fractions were estimated using differing L-DMRlibraries. The first data set (Liu) consisted of blood-derived DNA methylation data on 689 human subjects,including n = 354 rheumatoid arthritis and n = 335non-diseased control patients [16]. The second data set(Hannum) included blood-derived DNAmethylation dataon 656 non-diseased adults, ranging in age from 19 to 101years old [29]. For both data sets, epigenome-wide DNAmethylation assessment was performed using the IlluminaHumanMethylation450 array platform.Quality control and preprocessing of the DNAmethylationdata setsFor each of the data sets used in this research (Table 1),background subtraction and normalization utilizingvarious internal controls present on the Methylation450BeadChip was conducted using the publicly available,minfi Bioconductor package (http://bioconductor.org).Every beta-value on the HumanMethylation450 arrayplatform is accompanied with a detection p-value, rep-resenting the confidence that the signal intensities forthat locus exceed the background defined by the nega-tive control probes. To ensure high-quality methylationKoestler et al. BMC Bioinformatics (2016) 17:120 Page 13 of 21Table 1 Summary of the data sets used in this researchBiospecimen Name Details N Training ortestingGEO ID DescriptionWhole Blood (WB) AdultMixed Unfractioned peripheralblood leukocytes (PBL)6 Training GSE77797 DNAm profiling of WBsamples collected from 6different healthy adultdonors.MethodA Reconstructed cell mixtures 6 Testing GSE77797 DNAm profiled in samplesconsisting of mixtures ofCD4T, CD8T, NK, B cells,Monocytes, and Granulocytes,mixed predeterminedproportionsMethodB Reconstructed cell mixtures 6 Testing GSE77797 DNAm profiled in samplesconsisting of mixtures ofCD4T, CD8T, NK, B cells,Monocytes, and Granulocytes,mixed predeterminedproportionsLiu Unfractioned peripheralblood leukocytes (PBL)689 Testing GSE42861 DNAm profiled in WBsamples collected fromn = 354 rheumatoidarthritis and n = 335 non-diseased control patients[16]Hannum Unfractioned peripheralblood leukocytes (PBL)656 Testing GSE40279 DNAm profiled in WBsamples collected from atotal of 656 adults rangingin age from 19–101 yearsold [29]Isolated cell types (Reference set) Granulocytes (Gran) Purified CD16+ cells 6 Both GSE35069 DNAm profiling in purifiedcell types [3]Monocytes (Mono) Purified CD14+ cells 6 Both GSE35069 DNAm profiling in purifiedcell types [3]B cells (Bcell) Purified CD19+ cells 6 Both GSE35069 DNAm profiling in purifiedcell types [3]Natural Killer (NK) Purified CD56+ cells 6 Both GSE35069 DNAm profiling in purifiedcell types [3]CD8T Purified CD3+CD8+ cells 6 Both GSE35069 DNAm profiling in purifiedcell types [3]CD4T Purified CD3+CD4+ cells 6 Both GSE35069 DNAm profiling in purifiedcell types [3]data, CpG loci having a sizable fraction (>25%) ofdetection p-values above a predetermined threshold(detection p > 10−5) were excluded from our analysis[20]. Also, we employed Subset quantile within array-normalization (SWAN) to adjust the beta-values of type2 design probes into a statistical distribution charac-teristic of type 1 probes [32]. Finally, the presence ofbatch-effects, or technical sources of variability inducedby plate and/or BeadChip, was assessed using principalcomponents analysis (PCA) [33, 34]. Specifically, PCAwas fit to the background subtracted and normalizedmethylation data and the top S principal components(S determined using a previously described approach[35]) were examined in terms of their association withplate and BeadChip. If plate and/or BeadChip wasfound to be significantly associated with any of thetop S principal components (p < 0.05), we appliedComBat [36, 37], an empirical Bayes batch-adjustmentmethodology that has become a standard pre-processingtechnique for array-based DNA methylation data[7, 20, 38].Cell mixture deconvolutionTo motivate the IDOL algorithm, we provide a briefdescription of CMD, referring interested readers toHouseman et al. (2012) for further details. Let Yi =[Yi1,Yi2, . . . ,YiJ ] represent the methylation beta-valuesacross J CpG loci for target sample i. Further assume thatfor target sample i, DNA methylation was assessed over aheterogeneous cell population, comprised of a mixture ofK underlying cell types whose proportions within samplei are given by: ωi =[ωi1,ωi2, . . . ,ωiK ].Koestler et al. BMC Bioinformatics (2016) 17:120 Page 14 of 21As first described in Houseman et al. (2012), the methy-lation signature of sample i is assumed to arise as aweighted mixture of the DNA methylation signature ofeach of the K underlying cell types:E[Yi]= ωiμ′, 0 ≤ ωik ≤ 1 andK∑k=1ωik ≤ 1 (1)where μ is a J × K matrix of mean methylation beta-values whose rows represent the same ordering of the JCpGs in Yi and whose columns represent the K distinctcell types. Thus, the (jk)th element of μ represents thepopulation mean beta-value for CpG j among cell typek. Following from Eq. (1), the objective of CMD involvesestimating the mixture weights ω˜i that minimize:argminωi ||Yi − ωiμ′||2 (2)subject to the aforementioned constraints on ωi. Becauseμ is unobserved in practice, it is substituted with itssample mean M, estimated from one of several possibleexisting reference methylation data sets [3, 13].The mainstay of CMD is that knowledge of the methy-lomic fingerprint of each cell type - represented by the col-umn space of M - can be used to estimate their fractionswithin a sample consisting of a heterogenous mixture ofthose same cell types, Yi. As such, the ability to accu-rately estimate the underlying mixture composition of agiven target sample depends entirely on the J CpGs (i.e.,L-DMR library) being used as the basis of CMD. Ideally,L-DMR libraries should consist of CpGs whose methyla-tion signature is maximally distinct across the K cell typesand whose within-cell-type variation is minimal. Hence,efforts to improve the accuracy of cell composition esti-mates obtained through CMD should focus on identifyingL-DMR libraries that satisfy the above criteria. To date,several strategies have been been proposed for assemblingL-DMR libraries.The first of such strategies involved assembling librariesusing the J J CpGs with the largest F-statistics com-puted from a series of ANOVA models fit to the DNAmethylation profiles of purified isolated leukocyte celltypes [13]. While reasonable in principle, using ANOVAF-statistics as the criteria for constructing libraries hasthe major limitation that libraries can become oversat-urated with CpGs that discriminate certain leukocytesubsets (i.e., lymphoid- versus myeloid-cell-types), butlack sufficient signal for distinguishing closely relatedcell types. Recent attempts to address the limitations ofthe “ANOVA-based” strategy have instead used the topL hyper- and hypomethylated CpGs for each cell type,selected from a rank ordering of CpGs by their t-statisticcomputed from two-sample t-test comparisons of themethylation signature of each cell type against all othercell types [8]. This procedure is implemented in the Bio-conductor package minfi:EstimateCellCounts [23], where,by default, the top 50 hyper- and hypomethylated CpGsfor each cell type (i.e., CD4T, CD8T, NK, B cell, monocyte,granulocyte) are used to assemble the L-DMR library. Byimposing an equal representation L-DMRs for each celltype, this strategy is much less prone to the oversaturationproblem characteristic of the “ANOVA-based” approach;the net effect being improved discrimination of closelyrelated cell types and as a result, more accurate estimatesof cell composition.Algorithm for the optimal selection of L-DMRsWhile the strategy for library assembly used by Esti-mateCellCounts is less susceptible to the types of issuesthat can arise when rank ordering CpGs using the F-statistic, it has several limitations that may curtail theaccuracy of cell fraction estimates. In particular, becauseCpGs are selected irrespective of any evaluation of theircontribution to the accuracy of cell fraction estimates,the EstimateCellCounts library may not necessarily coin-cide with the optimal set of CpGs for cell composi-tion prediction. In addition, EstimateCellCounts uses alibrary that is comprised of an equal number of cell-specific L-DMRs (i.e, top 50 hyper- and hypomethylatedcell-specific CpGs). While preventing scenarios wherelibraries are oversaturated with CpGs that only discrimi-nate certain subsets of leukocytes, the assumption of anequal number of cell-specific CpGs may not necessar-ily correspond with optimal prediction accuracy. Finally,although using top hyper- and hypomethylated CpGsacross each cell type for library assembly is an intuitiveand sensible approach, it is possible that there exists anon-overlapping set of L-DMRs that outperform the tophyper- and hypomethylated CpGs in terms of predictionaccuracy.To address the limitations of existing L-DMR libraries,we propose IDOL, an algorithm that iteratively searchesfor libraries that improve the accuracy and precision ofCMD. It is important to note that IDOL requires a train-ing data set for calibrating the selection of optimal DMRlibraries. For example, when focus is centered on identi-fying optimal DMR libraries for deconvoluting peripheralblood, training data sets should consist of samples withboth WB DNA methylation signatures and direct mea-surements of the underlying cell distribution of thosesamples; i.e., CBC, FACS, etc. In what follows, we pro-vide a detailed description of each step of the IDOLalgorithm.Step 0: Construction of the candidate L-DMR searchspacea. Similar to [8], a series of two-sample t-tests (orsimilar methodology) are fit to the J arrayed CpGsKoestler et al. BMC Bioinformatics (2016) 17:120 Page 15 of 21and used to compare the mean methylationbeta-values between each cell type against the meanbeta-values computed across all other cell types.b. Identify the L/2CpGs with the largest t-statistics andthe L/2 CpGs with the smallest t-statistics for eachof the K cell types, where L is a tuning parameterrepresenting the number of cell-specific L-DMRs.c. Construct a setQ, which consists of the Lcell-specific L-DMRs identified in (b). Thus,Q iscomprised of P = L × K putative L-DMRs, andrepresents the candidate search space for thesubsequent steps of IDOL. It should be noted thatthere are trade-offs in the selection of L. Whereaslarge values of L broaden the candidate space inwhich to search for optimal L-DMR libraries, thiscomes at the expense of increased computationalburden. Conversely, while small L results in lowercomputational costs, this comes with the riskmissing potentially predictive L-DMRs due to anarrower candidate search space. Since the IDOLalgorithm needs to be applied only when thereference methylomes for “new” cell types are addedto those that currently exist (i.e., CD4T, CD8T, NK,Bcell, Monocytes, and Granulocytes), or if onewishes to identify optimized L-DMR libraries basedon different technologies for interrogating themethylome (i.e., Illumina Human Methylation EPICBeadArray, whole genome bisulfite sequencing, etc.),we advise users to select L to be arbitrarily large toensure a broad enough candidate search space.d. In addition to pre-selecting L, the user also needs topre-select J P, representing the library size. It isimportant to note that special care should be givenin the selection of J, as the accuracy and precisionof cell proportion estimates are sensitive to itsspecification [22]. We provide specific suggestionsits selection at the end of this section.Step 1: Random assembly of L-DMR librariesa. At iteration , J CpGs are randomly selected fromQ with probability π()j , j = 1, 2, . . . ,P. At iteration0, every CpG among the P candidate L-DMRs has anequal chance of being selected, i.e., π(0)j = 1/P,∀j ∈ Q.b. LetQ() ⊂ Q represent the randomly assembledL-DMR library, comprised of the J randomlyselected CpGs at iteration .Step 2: Cell composition estimation using randomlyassembled librarya. Using the randomly assembled libraryQ(), applyCMD to a training set to obtain cell compositionestimates: ω˜i, where i = 1, . . . ,N1 and N1 representsthe number of training samples.b. The resulting set of predictions are given as˜ =[ ω˜1, ω˜2, . . . , ω˜N1 ], where 0 ≤ ω˜i ≤ 1 is a K × 1vector of the predicted cell proportions for trainingsample i. Further define ˜k =[ ω˜1k , ω˜2k , . . . , ω˜N1k] asthe predicted proportions for cell type k across theN1 training samples.Step 3: Assessing the accuracy of cell compositionestimates: Given the strengths and limitations of purelyrelative and absolute measures for assessing predictionperformance [39], we propose using both the R2 and rootmean square error (RMSE) as the basis for our assess-ments. Let = [ω1,ω2, . . . ,ωN1 ] represent the observedcell proportions for the N1 target samples obtained viaCBC, FACS, etc. The proportion of variation in theobserved fraction of cell-type k (k) explained by itspredicted fraction (˜k) is computed as:R2k = 1−1′N1(k − ̂k)′(k − ̂k)1N11′N1(k − ¯k)′(k − ¯k)1N1, 0 ≤ R2k ≤ 1where ¯k = ∑N1i=1 k/N1 is an estimate of the meanobserved fraction of cell-type k and ̂k represents thelinear predictor obtained from regressing k on ˜k . Inparticular,̂k = β̂k˜kwhere, β̂k = (˜k ′˜k)−1˜k ′k . Thus, R¯2 = 1K∑Kk=1 R2krepresents an estimate of the mean coefficient of deter-mination across the K cell types. Additionally, the RMSEfor cell type k = 1, 2, . . . ,K is computed here using thefollowing expression:RMSEk =√1′N1(k − ˜k)′(k − ˜k)1N1N1,0 ≤ RMSEk < ∞with M¯ = 1K∑Kk=1 RMSEk representing an estimate ofthe mean RMSE across the K cell types. Given the above,IDOL seeks to find L-DMR libraries whose cell-typepredictions minimize M¯ and maximize R¯2. As describedin further detail below, both M¯ and R¯2 are used for deter-mining the contribution of each CpG in Q() on overallprediction performance.Step 4: Leave-one out procedure: In order to assess theindividual contribution of each CpG in Q(), we imple-ment the following leave-one-out procedure:Koestler et al. BMC Bioinformatics (2016) 17:120 Page 16 of 21a. Each of the J∗ CpGs contained inQ() are iterativelyremoved to obtain the following setsQ()−j , whichinclude all CpGs inQ(), except for CpG j.b. Steps 2–3 are repeated for each reduced libraryQ()−jand used to obtain (M¯−j, R¯2−j); estimates of theoverall RMSE and coefficient determination whenCpG j is excluded from the L-DMR library.Conceptually, when R¯2−j is small relative to R¯2, thissuggests that withholding CpG j fromQ() resultedin predictions that, on average, accounted for asmaller proportion of variation in the observed cellfractions. Conversely, when R¯2−j > R¯2, withholdingCpG j fromQ() resulted in predictions thataccounted for a larger proportion of variation in theobserved cell proportions. A similar argument holdsfor the relationship between M¯−j and M¯.c. From (b), it is clear that in subsequent IDOLiterations we would want to preferentially keepCpGs whose M¯ − M¯−j < 0 and R¯2 − R¯2−j > 0. Thisobservation implies a framework for updating theselection probabilities of each CpG.Step 5: Updating selection probabilities:a. Since R2 and RMSE are measured on differentscales, we begin by normalizing both M¯−j and R¯2−j toobtain U−j and V−j, j = 1, . . . J∗ respectively:U−j = M¯−j − M¯sd (M¯−j) , V−j = R¯2−j − R¯2sd(R¯2−j)where −∞ < U−j < ∞ and −∞ < V−j < ∞.b. Noting that CpG j should be preferentially updatedto have a larger probability of selection when bothU−j and −V−j are large, we generate a compositemeasure by first converting (U−j,−V−j) from theCartesian coordinate system to the polar coordinatesystem:r−j =√δU2−j + (1 − δ)(−V−j)2θ−j = atan2(−(1 − δ)V−j, δU−j)where atan2 is a common variation of the arctangent function, r−j is the radial coordinate, θ−j isthe angular coordinate, and 0 ≤ δ ≤ 0 is a parameterthat controls the balance between relative andabsolute prediction performance. For example, whenδ = 1/2, a CpG’s influence on relative and absoluteprediction performance receives equal weight. Whenδ → 1 a CpG’s influence on absolute predictionperformance receives more weight and when δ → 0,a CpG’s influence on relative prediction performancereceives more weight. The increment for modifyingthe selection probability of CpG j is given as:p−j = r−jcos(θ−j − π/4), −∞ ≤ p−j ≤ ∞For the purpose of exposition, when δ = 1/2, CpGswith the largest increment in selection probability(i.e., large p−j) are those with large r−j and θ−j closeto π/4 radians (Fig. 2b,c). Conversely, CpGs with thelargest decrease in selection probability (i.e., smallp−j) are those with large r−j and θ−j close to 5π/4.When p−j ≈ 0, this implies that either r−j is small orθ−j is close to (3π/4, −π/4) radians and suggeststhat withholding CpG j fromQ() is neither helpfulnor detrimental to prediction performance. In thesesituations, the selection probability should remainunchanged.c. This brings us to the following procedure forupdating selection probabilities,π(+1)j =ρ(+1)j∑j∈Q ρ(+1)j, 0 ≤ π(+1)j ≤ 1 (3)where,ρ(+1)j ={π()j expit(p−j) + π()j /2 if j ∈ Q()π()j if j ∈ Q()(4)and expit is the inverse-logit function, i.e.,expit(x) = exp(x)/(1 + exp(x)). Thus, selectionprobabilities for each j ∈ Q() are modified based onhow beneficial/not beneficial each CpGs wasdetermined to be in the presence of the remainingJ − 1 CpGs. As noted from Eqs. (3 and 4), theprobability of selection is unchanged for CpGsj ∈ Q() as well as for CpGs where p−j ≈ 0.Step 6: Continue Iteration: Using the updated proba-bilities, π(+1)j , j = 1, . . . ,P, repeat steps 1-5. The finalsolution consists of the library comprised of the J CpGswith the largest selection probabilities (Fig. 2).As previously described, because the accuracy and pre-cision of cell proportion estimates are sensitive to thespecification of J, special treatment should be giventowards its selection. Although computationally demand-ing, our strategy for determining J involves fitting IDOLacross a range of possible values for J, (i.e., J ={50, 100, 200, . . .}) followed by a comparison of predictionperformance across each of the specified values. Undersuch a framework, we select the smallest value of J uponwhich the gains in prediction performance for increasingvalues of J is minimal, (i.e., within some predeterminedtolerance of the performance metrics).Koestler et al. BMC Bioinformatics (2016) 17:120 Page 17 of 21Application and assessment of IDOLTraining the L-DMR selection algorithmTo examine the robustness of IDOL, we employeda training and testing procedure and benchmarkedtheprediction performance of the library identified byIDOL against the widely used EstimateCellCounts func-tion in the minfi Bioconductor package. Specifically, wefirst applied IDOL to the AdultMixed samples (TrainingSet) to identify “optimal” L-DMR libraries for decon-voluting the cell distribution of whole blood. As pre-viously described, the AdultMixed samples consisted ofboth flow cytometric measurements and whole bloodDNA methylation data derived from the same set ofbiospecimens used for flow cytometry. To examine thesensitivity of prediction performance based on the num-ber of L-DMRs used for deconvoluting cellular mixture,we applied IDOL to the training samples assuming arange of possible values for J, specifically assuming J ={100, 200, 300, 400, 500, 600, 700, 800}. The final selectionof J and the representative IDOL library used in oursubsequent validation analyses was chosen to be thevalue J that resulted best prediction performance inthe training set. Finally, in training the IDOL algorithm,selection probabilities of putative L-DMRs were updatedassuming equal weights in terms of their contribu-tion to relative and absolute prediction performance,i.e., δ = 1.Following the application of IDOL to the training set,we next examined the overlap between the “optimal”IDOL library and the 600 L-DMRs currently used byEstimateCellCounts. In order to comprehend the natureof the difference between these libraries and how suchdifferences might influence their propensity for accuratecell fraction estimates, we computed the dispersion sepa-rability criterion (DSC). The DSC was initially developedas a metric for quantifying the extent of batch effectsin ’omic data sets, and is computed as the ratio of theaverage distance between batch centroids and the globalmean (Dbetween) and the average distance between sam-ples belonging to the same batch (Dwithin). Larger val-ues of DSC indicate greater dispersion between batchesthan within batches; i.e., samples within batches are morehomogeneous compared to samples in different batches.In the same way, the DSC can be used for quantifyingthe dispersion between and within specific leukocyte sub-types based on a given set of L-DMRs, substituting batchwith cell-type identity of a given sample. Using refer-ence DNA methylation data profiled across the six majorleukocyte components of whole blood [3], we computedthe overall DSC and the DSC between each pair of celltypes (i.e., CD4T vs CD8T, CD4T vs NK, etc.) using boththe “optimal” IDOL library and the EstimateCellCountslibrary. Equation 5 provides the DSC formula for pair-wise comparisons, where (r, s) denotes the two cell typesbeing compared, D(r, s)between represents the the average dis-tance between cell type centroids and the global mean,and D(r, s)within represents average distance between samplesof the same cell type.DSC(r, s) = D(r, s)betweenD(r, s)within, (r, s) ∈ {(1, 2), . . . ,(1,K), . . . , (K − 1,K)}(5)In order to assess which L-DMR library exhibited betterperformance at discriminating specific pairs of cell types(i.e., (r, s)), we computed the difference betweenDSCs cal-culated from the IDOL and EstimateCellCounts libraries(Eq. 6). (r, s) = DSC(r, s)IDOL − DSC(r, s)EstimateCellCounts (6)Based on Eq. 6, (r, s) = 0 signifies no differencebetween the IDOL and EstimateCellCounts libraries fordiscriminating cell types r and s, whereas large posi-tive or negative values of (r, s) signify improved dis-crimination associated with the IDOL library (former)or the EstimateCellCounts library (latter). To test thehypothesis that (r, s) = 0, we conducted a non-parametric, randomization-based test. Specifically, p-values were computed by comparing the observed DSCdifferences to the empirical null distribution, gener-ated through repeated random permutations of the data.Randomization-based p-values less than 0.05 were treatedas statistically significant.Independent validation of the optimal L-DMR setTo validate IDOL, we applied CMD to two independenttest sets (MethodA and MethodB sets) using the optimalIDOL library identified in the training set. Our choice touse the MethodA and MethodB samples as our testingsets was motivated by the fact that the samples in bothsets were obtained by mixing leukocyte subtype-specificDNA in known, predetermined proportions. Thus, fora given sample, the underlying leukocyte fractions areknown with high confidence and are likely less prone tothe measurement error associated cell sorting/countingtechniques. As such, the MethodA and MethodB setsrepresent ideal data sets for validating the prognostic per-formance of the optimal L-DMR library identified in thetraining set.To assess the performance of our cell type predic-tions, we estimated the proportion of variation of theknown, reconstructed mixture fractions explained by ourcell type predictions (i.e., R2) as well as the average devi-ation between the reconstructed mixture fractions andour predictions (i.e., RMSE). R2 and RMSE were com-puted for each cell type individually, across all testingsamples and within each testing set separately. The ratio-nale for latter was to examine the robustness of the IDOLKoestler et al. BMC Bioinformatics (2016) 17:120 Page 18 of 21library when the underlying cellular landscape differed(see Section ‘Cell mixture reconstruction experiment’for further details on the MethodA and MethodB recon-struction experiment). As an additional comparison andto benchmark the performance of the IDOL libraryfor accurately deconvoluting cellular mixture, we alsoapplied the minfi:EstimateCellCounts function (using itsdefault options). In a similar manner, cell-specific R2 andRMSE were computed based on the cell type predic-tions obtained from EstimateCellCounts, both within andacross the two MethodA and MethodB sets.Simulation study comparing false discovery rates (FDR)across different cell composition adjustment techniquesTo understand the consequences of prediction error incell fraction estimates for EWAS, we conducted a sim-ulation study to compare the false discovery rate (FDR)when different strategies for cell composition adjustmentwere employed. For our simulations, we assumed simplis-tic study design that, typical of many EWAS, focused onthe identification of differentially methylated CpG sitesbetween two groups, i.e, case/control comparison. Todetermine if the relationship between cell compositionadjustment method and FDR was sensitive to the studysample size (i.e., n = n1+n2), we conducted separate sim-ulations that ranged from small/moderately sized studies(i.e., n1, n2 = {50, 100}) to large studies (i.e., n1, n2 ={250, 500}). In addition to varying the sample sizes of eachgroup, we also examined the relationship between FDR asa function of the dissimilarity in the true, simulated celldistribution between the two groups.To motivate the design of our simulation study, weassumed that the methylation beta-value for CpG j amongtarget sample i, Yij, follows a beta-distribution with expec-tation and variance given by:ωiμ′j and(1−ωiμ′j)ωiμ′j1+φj , respec-tively. As previously, ωi is vector of length K representingthe true underlying cell fractions for sample i, μj is avector whose elements represent the population meanbeta-values for CpG j across the K cell types, and φj > 0 isthe unobserved dispersion parameter for CpG j. LettingXidenote the group membership for sample i, many EWASinvolve fitting regression models that have the followingform:Yij = α0j + α1jXi +K−1∑k=1γkjωik + ij,E[ ij]= 0 and V[ij]= σ 2j(7)where the term∑K−1k=1 γkjωik is introduced to control forcell composition differences across subjects and ij cap-tures the remaining variation in methylation after takinggroup status and cellular composition into account. In theabove regression model, interest is typically centered ontesting the hypothesis of no difference in DNA methy-lation levels between groups, i.e., α1j = 0. However, inpractice ωik is unknown and typically substituted with itsestimate ω˜ik , obtained for example by CMD [13]. Since ω˜ikis an estimate and therefore subject to uncertainty, testsof hypothesis and confidence intervals based on model 7can become unreliable and prone to inflated Type 1 and 2error rates.To examine how cell type prediction errors associatedwith the IDOL and EstimateCellCounts libraries impactthe FDR for testing α1j, we first estimated the uncertaintyof cell fraction predictions for each method by squaringthe RMSEs computed across the MethodA and MethodBtesting sets to obtain the mean squared prediction errors(MSPEs):τ̂ 2kl = MSPEkl = RMSE2kl =√√√√ 1NN∑i=1(ωik − ω˜ikl)2,k = 1, 2, . . .K(8)where l is an index representing the library used for CMD(i.e., l = {EstimateCellCounts, IDOL}) and N representsthe total sample size for the testing data (i.e., N = 12for the MethodA andMethodB sets). After obtaining esti-mates of precision, τ̂ 2kl, we implemented the followingseven steps in our simulation study:1. Randomly sample G = 10, 000 CpGs from theIllumina HumanMethylation450 array.2. Estimate the dispersion parameter within thecombined testing sets for each of the G randomlyselected CpGs, φ̂g , g = 1, 2, . . .G. In addition, usingthe reference leukocyte methylation data [3], estimatecell-specific mean methylation beta-values for each ofthe G CpGs,mkg , g = 1, 2, . . .G and k = 1, 2, . . .K .Parameter estimation was carried out using methodof moments estimation.3. Randomly generate the cell distribution for groups 1and 2.a. For group 1, simulate the cell distribution, ω(1),from a Dirichlet distribution with concentrationparameters, ν(1) =[ ν(1)1 , ν(1)2 , . . . ν(1)K ].b. For group 2, simulate the cell distribution, ω(2),from a Dirichlet distribution with concentrationparameters, ν(2) =[ ν(2)1 , ν(2)2 , . . . ν(2)K ].4. For both groups, simulate methylation beta-values foreach of the G CpGs from a beta-distribution.a. For each of the n1 samples in group 1, randomlysample beta-values Y (1)ig from a beta-distributionwith mean ω(1)m′g and variance(1−ω(1)m′g )ω(1)m′g1+φ̂g .Koestler et al. BMC Bioinformatics (2016) 17:120 Page 19 of 21b. For each of the n2 samples in group 2, randomlysample beta-values Y (2)ig from a beta-distributionwith mean ω(2)m′g and variance(1−ω(2)m′g )ω(2)m′g1+φ̂g .5. Randomly sample cell type predictions for eachsample (i.e., ω˜(1)il and ω˜(2)il ) based using thecell-specific uncertainty estimates (Eq. 8) associatedwith the EstimateCellCounts and Optimized L-DMRmethods.a. For each of the n1 samples in group 1, obtain ω˜(1)ilby randomly sampling from a multivariate normaldistribution with mean ω(1) and variance-covariance, (1)l = diag(̂τ 2kl), k = 1, 2, . . .K andl = (EstimateCellCounts or IDOL).b. For each of the n2 samples in group 2, obtain ω˜(2)ilby randomly sampling from a multivariate normaldistribution with mean ω(2) and variance-covariance, (2)l = diag(̂τ 2kl), k = 1, 2, . . .K andl = (EstimateCellCounts or IDOL).6. Fit model 7 to each of the G CpGs, adjusting for cellcomposition using the cell type predictions generatedin Step 4. Based on the model fit, test the hypothesis,H0 : α1g = 0, for g = 1, 2, . . . ,G.7. Calculate the FDR for each method assuming anominal p-value cutoff of 0.05 for declaring CpGs asstatistically significant.8. Repeat steps 1-7.Since the beta-values for groups 1 and 2 were simulatedassuming no group effect (i.e., assuming α1g = 0),the methylation profile for groups 1 and 2 differ onlywith respect to the dissimilarity in the cell compositionbetween groups, Dissimilarity := ||ω(1) − ω(2)||. Thus,rejections of the hypothesis H0 : α1g = 0 based on fittingmodel 7 to the simulated data signify Type 1 errors. As ameasure to ensure that the FDR was correctly controlledat 5% in models that controlled for the true, simulatedcell distributions, we also augmented our simulation studywith models that included adjustment for terms, ω(1) andω(2).Data application for exploring the implications of cellcomposition adjustment in EWASTo further understand the implications of cell typeprediction methodology for EWAS (particularly, thoseusing blood-derived DNA methylation data), we madeuse of two of the largest, publicly available, blood-derivedDNA methylation data sets [16, 29]. Our analysis of thesedata sets was aimed at addressing two different but relatedquestions: (i) which cell prediction methodology per-formed better at explaining variation in DNAmethylationwithin each data set and (ii) how do differences in thevariation being explained relate to the statistical powerof such studies. To address these questions, we beganby applying CMD [13] for estimating the immune cellcomposition of the samples in the Liu and Hannum datasets. CMDwas applied using both the EstimateCellCounts(default settings) and the optimal IDOL library, giving riseto two sets of cell type predictions for each of the twodata sets. For each data set, linear regression models werefit to the J CpG loci independently, modeling methyla-tion beta-values as the response against the predicted celldistribution. Based on the fitted regression models, weestimated the variation in methylation unaccounted forby our estimates of cell mixture (i.e., residual variance)as well as the proportion of variation in methylationexplained by cell mixture estimates: R2jl, j = 1, 2, . . . , J andl = {EstimateCellCounts, IDOL}). Using these estimates,the difference in R2 between models adjusted for cellmixture using the optimal IDOL library versus Estimate-CellCounts were computed for each of the J CpGs; i.e.,Dj = R2j,IDOL − R2j,EstimateCellCounts.To answer the first of our questions - which cell pre-diction methodology performs better at explaining varia-tion in DNA methylation? - we computed the proportionof CpG loci where the IDOL library resulted in morevariation in DNA methylation explained compared toEstimateCellCounts, i.e., 1J∑Jj=1 I(Dj > 0). To assesswhether the observed proportion was greater than wouldbe expected at random, we employed a non-parametricrandomized-based test with a p-value cutoff of 0.05 todetermine statistical significance.We next sought to compare the impact of differentL-DMR libraries on the statistical power of EWAS.Similar to our simulation study (Section ‘Simulationstudy comparing false discovery rates (FDR) acrossdifferent cell composition adjustment techniques’), weassumed a simple study design that was aimed atidentifying differences in the mean methylation levelsbetween two groups. Using the residual variance estimatesobtained above, we computed the sample size requiredfor identifying differences in the mean methylation levelsbetween groups that ranged from 0.01 to 0.05 on the beta-value scale. For our sample size estimates, we assumed atwo-sample t-test, 80% power, and Bonferroni correctedtype 1 error rate (i.e, α/400, 000) to account for issue ofmultiple testing encountered in EWAS. Within both theLiu and Hannum data sets, we randomly sampled theresidual variance estimates for 1000 CpG loci obtained foreach cell mixture correction methodology and computedthe sample size needed for detecting a difference inmean methylation based on the previously mentionedassumptions. For a given difference in mean methylation,the sample size estimates based on the 1000 randomlysampled residual variance estimates were summarized byKoestler et al. BMC Bioinformatics (2016) 17:120 Page 20 of 21computing the mean, which formed the basis for ourcomparisons.To highlight the economic implications of our findings,we also estimated the cost-differential for EWAS whencell mixture correction was carried out using the IDOLlibrary versus EstimateCellCounts based on our esti-mates of the required sample sizes for each methodol-ogy. Cost-differential estimates were obtained by usingthe current per-sample cost of the Illumina Human-Methylation450 array of approximately 300 US dollars(http://www.illumina.com/).Additional filesAdditional file 1: Table S1. Summary of cell distributionmeasurements/cell mixture adjustment, among random sample of EWASpublished in 2015. (XLSX 11 kb)Additional file 2: Figure S1. Bland-Altman plots constructed from cellfraction estimates of the n = 6 AdultMixed samples obtained using the(a) TopANOVA, (b) EstimateCellCounts, and (c) IDOL optimized L-DMRlibraries. (PDF 344 kb)Additional file 3: Table S2. CpG-specific annotations for the candidateL-DMR library used in the IDOL algorithm. (XLSX 241 kb)Additional file 4: Figure S2. Performance metrics for the optimal librariesidentified by applying IDOL to the training set. (a) Average R-squared(computed across cell types) in the training set based on the optimallibraries at each assumed library size. (a) Average RMSE (computed acrosscell types) in the training set based on the optimal libraries at eachassumed library size. (PDF 324 kb)Additional file 5: Table S3. CpG-specific annotations for optimal L-DMRlibrary identified by the IDOL algorithm. (XLSX 77 kb)Additional file 6: Table S4. Comparison of prediction performancebetween the optimal IDOL library and EstimateCellCounts within bothtesting sets. (XLSX 50 kb)Additional file 7: Figure S3. Bland-Altman plots constructed from cellfraction estimates of the n = 6 MethodA reconstruction samples obtainedusing the (a) TopANOVA, (b) EstimateCellCounts, and (c) IDOL optimizedL-DMR libraries. (PDF 306 kb)Additional file 8: Figure S4. Bland-Altman plots constructed from cellfraction estimates of the n = 6 MethodB reconstruction samples obtainedusing the (a) TopANOVA, (b) EstimateCellCounts, and (c) IDOL optimizedL-DMR libraries. (PDF 282 kb)Additional file 9: Figure S5. Cell type prediction performance (R2 (top)and RMSE (bottom)) as a function of each cell type’s mean observedmixture fraction in the MethodA reconstruction samples. (PDF 698 kb)AbbreviationsEWAS: epigenome-wide association studies; CMD: cell mixture deconvolution;L-DMR: leukocyte differentially methylated region; WB: whole blood; FDR: falsediscovery rate; DSC: dispersion separability criterion; RMSE: root mean squarederror; CBC: complete blood cell counts; FACS: fluorescence-activated cellsorting; HM450: Illumina HumanMethylation450 BeadArray.Competing interestsThe authors declare no competing interests.Authors’ contributionsDCK conceived the selection algorithm, performed the statistical analyses, anddrafted the manuscript. MJJ and JU assisted in refining the selection algorithm,helped direct statistical analyses, and contributed to the interpretation ofstudy results. RAB and BCC conducted the cell mixture reconstructionexperiments and prepared samples subsequent DNA methylation assessment.MSK contributed to the interpretation of study results and assisted inmanuscript preparation. JKW and KTK provided samples for the cell mixturereconstruction experiments and the whole blood samples on healthy adults.All authors read and approved the final manuscript.AcknowledgementsThis work was supported by the National Institute of Health (NIH) grants:(1KL2TR000119 to D.C.K), (R01CA52689 and P50CA097257 to J.K.W),(R01DE022772 and P20GM104416 to B.C.C), Kansas IDeA Network ofBiomedical Research Excellence (K-INBRE) Bioinformatics Core supported inpart by the National Institute of General Medical Science awardP20GM103418, Robert Newman Magnin endowment for neurooncology atthe University of California, San Francisco (to J.K.W), and the Child and FamilyResearch Institute (to M.J.J.). M.S.K is a Senior Fellow of the Canadian Instituteof Advanced Research and the Canada Research Chair in Social Epigenetics.Author details1Department of Biostatistics, University of Kansas Medical Center, 3901Rainbow Blvd., 66160 Kansas City, KS, USA. 2Centre for Molecular Medicine andTherapeutics, Child and Family Research Institute, Department of MedicalGenetics, The University of British Columbia, 950 West 28th Ave., V5Z 4H4Vancouver, BC, Canada. 3Department of Epidemiology, Geisel School ofMedicine, Dartmouth College, 1 Medical Center Dr., 03756 Lebanon, NH, USA.4Department of Pharmacology and Toxicology, Dartmouth College, 1 RopeFerry Rd., 03755 Hanover, NH, USA. 5Department of Community and FamilyMedicine, Geisel School of Medicine, Dartmouth College, 1 Medical Center Dr.,03756 Lebanon, NH, USA. 6Department of Pathology and Laboratory Medicine,Brown University, 70 Ship St., 02912 Providence, RI, USA. 7Department ofNeurological Surgery, University of California San Francisco, 505 ParnassusAve., 94143 San Francisco, CA, USA. 8Department of Epidemiology, BrownUniversity, 121 South Main St., 02912 Providence, RI, USA.Received: 23 September 2015 Accepted: 9 February 2016References1. Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide associationstudies for common human diseases. Nat Rev Genet. 2011;12(8):529–41.doi:10.1038/nrg3000.2. Adalsteinsson BT, Gudnason H, Aspelund T, Harris TB, Launer LJ,Eiriksdottir G, Smith AV, Gudnason V. Heterogeneity in white blood cellshas potential to confound dna methylation measurements. PLoS ONE.2012;7(10):46705. doi:10.1371/journal.pone.0046705.3. Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahln SE, Greco D,Sderhll C, Scheynius A, Kere J. Differential dna methylation in purifiedhuman blood cells: implications for cell lineage and studies on diseasesusceptibility. PLoS ONE. 2012;7(7):41361.doi:10.1371/journal.pone.0041361.4. Koestler DC, Marsit CJ, Christensen BC, Accomando W, Langevin SM,Houseman EA, Nelson HH, Karagas MR, Wiencke JK, Kelsey KT.Peripheral blood immune cell methylation profiles are associated withnonhematopoietic cancers. Cancer Epidemiol Biomarkers Prev.2012;21(8):1293–302. doi:10.1158/1055-9965.EPI-12-0361.5. Lam LL, Emberly E, Fraser HB, Neumann SM, Chen E, Miller GE,Kobor MS. Factors underlying variable dna methylation in a humancommunity cohort. Proc Natl Acad Sci U S A. 2012;109 Suppl 2:17253–60.doi:10.1073/pnas.1121249109.6. Houseman EA, Kim S, Kelsey KT, Wiencke JK. Dna methylation in wholeblood: Uses and challenges. Curr Environ Health Rep. 2015;2(2):145–54.doi:10.1007/s40572-015-0050-3.7. Michels KB, Binder AM, Dedeurwaerder S, Epstein CB, Greally JM, Gut I,Houseman EA, Izzi B, Kelsey KT, Meissner A, Milosavljevic A,Siegmund KD, Bock C, Irizarry RA. Recommendations for the design andanalysis of epigenome-wide association studies. Nat Methods.2013;10(10):949–55. doi:10.1038/nmeth.2632.8. Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical inepigenome-wide association studies. Genome Biol. 2014;15(2):31.doi:10.1186/gb-2014-15-2-r31.9. Liang L, Cookson WOC. Grasping nettles: cellular heterogeneity andother confounders in epigenome-wide association studies. Hum MolGenet. 2014;23(R1):83–8. doi:10.1093/hmg/ddu284.Koestler et al. BMC Bioinformatics (2016) 17:120 Page 21 of 2110. Reynolds LM, Taylor JR, Ding J, Lohman K, Johnson C, Siscovick D,Burke G, Post W, Shea S, Jacobs DR Jr, Stunnenberg H, Kritchevsky SB,Hoeschele I, McCall CE, Herrington DM, Tracy RP, Liu Y. Age-relatedvariations in the methylome associated with gene expression in humanmonocytes and t cells. Nat Commun. 2014;5:5366.doi:10.1038/ncomms6366.11. Gunawardhana LP, Gibson PG, Simpson JL, Benton MC, Lea RA,Baines KJ. Characteristic dna methylation profiles in peripheral bloodmonocytes are associated with inflammatory phenotypes of asthma.Epigenetics. 2014;9(9):1302–16. doi:10.4161/epi.33066.12. Marioni RE, Shah S, McRae AF, Ritchie SJ, Muniz-Terrera G, Harris SE,Gibson J, Redmond P, Cox SR, Pattie A, Corley J, Taylor A, Murphy L,Starr JM, Horvath S, Visscher PM, Wray NR, Deary IJ. The epigenetic clockis correlated with physical and cognitive fitness in the lothian birth cohort1936. Int J Epidemiol. 2015. doi:10.1093/ije/dyu277.13. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ,Nelson HH, Wiencke JK, Kelsey KT. Dna methylation arrays as surrogatemeasures of cell mixture distribution. BMC Bioinformatics. 2012;13:86.doi:10.1186/1471-2105-13-86.14. Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J. Epigenome-wideassociation studies without the need for cell-type composition. NatMethods. 2014;11(3):309–11. doi:10.1038/nmeth.2815.15. Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixtureadjustments in analysis of dna methylation data. Bioinformatics.2014;30(10):1431–9. doi:10.1093/bioinformatics/btu029.16. Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A,Reinius L, Acevedo N, TaubM, Ronninger M, Shchetynsky K, Scheynius A,Kere J, Alfredsson L, Klareskog L, Ekstrm TJ, Feinberg AP.Epigenome-wide association data implicate dna methylation as anintermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol.2013;31(2):142–7. doi:10.1038/nbt.2487.17. Koestler DC, Chalise P, Cicek MS, Cunningham JM, Armasu S, Larson MC,Chien J, Block M, Kalli KR, Sellers TA, Fridley BL, Goode EL. Integrativegenomic analysis identifies epigenetic marks that mediate genetic risk forepithelial ovarian cancer. BMC Med Genomics. 2014;7:8.doi:10.1186/1755-8794-7-8.18. Koestler DC, Avissar-Whiting M, Houseman EA, Karagas MR, Marsit CJ.Differential dna methylation in umbilical cord blood of infants exposed tolow levels of arsenic in utero. Environ Health Perspect. 2013;121(8):971–7.doi:10.1289/ehp.1205925.19. Liang L, Willis-Owen SAG, Laprise C, Wong KCC, Davies GA, Hudson TJ,Binia A, Hopkin JM, Yang IV, Grundberg E, Busche S, Hudson M,Rnnblom L, Pastinen TM, Schwartz DA, Lathrop GM, Moffatt MF,Cookson WOCM. An epigenome-wide association study of total serumimmunoglobulin e concentration. Nature. 2015;520(7549):670–4.doi:10.1038/nature14125.20. Wilhelm-Benartzi CS, Koestler DC, Karagas MR, Flanagan JM,Christensen BC, Kelsey KT, Marsit CJ, Houseman EA, Brown R. Review ofprocessing and analysis methods for dna methylation array data. Br JCancer. 2013;109(6):1394–402. doi:10.1038/bjc.2013.496.21. Jones MJ, Islam SA, Edgar RD, Kobor MS. Adjusting for cell typecomposition in dna methylation data using a regression-based approach.Methods Mol Biol. 2015. doi:10.1007/7651_2015_262.22. Koestler DC, Christensen B, Karagas MR, Marsit CJ, Langevin SM,Kelsey KT, Wiencke JK, Houseman EA. Blood-based profiles of dnamethylation predict the underlying distribution of cell types: a validationanalysis. Epigenetics. 2013;8(8):816–26. doi:10.4161/epi.25430.23. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP,Hansen KD, Irizarry RA. Minfi: a flexible and comprehensive bioconductorpackage for the analysis of infinium dna methylation microarrays.Bioinformatics. 2014;30(10):1363–9. doi:10.1093/bioinformatics/btu049.24. Hertz J, Krogh A, Palmer G. Introduction to the Theory of Computation:Addison-Wesley; 1993.25. McClatchey KD, (ed). 2002. Clinical Laboratory Medicine. LippincottWilliams & Wilkins.26. Flanagan JM, Brook MN, Orr N, Tomczyk K, Coulson P, Fletcher O,Jones ME, Schoemaker MJ, Ashworth A, Swerdlow A, Brown R,Garcia-Closas M. Temporal stability and determinants of white blood celldna methylation in the breakthrough generations study. CancerEpidemiol Biomarkers Prev. 2015;24(1):221–9.doi:10.1158/1055-9965.EPI-14-0767.27. Marioni RE, Shah S, McRae AF, Chen BH, Colicino E, Harris SE, Gibson J,Henders AK, Redmond P, Cox SR, Pattie A, Corley J, Murphy L, Martin NG,Montgomery GW, Feinberg AP, Fallin MD, Multhaup ML, Jaffe AE,Joehanes R, Schwartz J, Just AC, Lunetta KL, Murabito JM, Starr JM,Horvath S, Baccarelli AA, Levy D, Visscher PM, Wray NR, Deary IJ. Dnamethylation age of blood predicts all-cause mortality in later life. GenomeBiol. 2015;16:25. doi:10.1186/s13059-015-0584-6.28. Demerath EW, GuanW, Grove ML, Aslibekyan S, MendelsonM, Zhou YH,Hedman K, Sandling JK, Li LA, Irvin MR, Zhi D, Deloukas P, Liang L, Liu C,Bressler J, Spector TD, North K, Li Y, Absher DM, Levy D, Arnett DK,Fornage M, Pankow JS, Boerwinkle E. Epigenome-wide association study(ewas) of bmi, bmi change and waist circumference in african americanadults identifies multiple replicated loci. Hum Mol Genet. 2015;24(15):4464–79. doi:10.1093/hmg/ddv161.29. Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, Klotzle B,Bibikova M, Fan JB, Gao Y, Deconde R, Chen M, Rajapakse I, Friend S,Ideker T, Zhang K. Genome-wide methylation profiles reveal quantitativeviews of human aging rates. Mol Cell. 2013;49(2):359–67.doi:10.1016/j.molcel.2012.10.016.30. Efron B, Tibshirani R. An Introduction to the Bootstrap: Chapman &Hall/CRC; 1993.31. Accomando WP, Wiencke JK, Houseman EA, Nelson HH, Kelsey KT.Quantitative reconstruction of leukocyte subsets using dna methylation.Genome Biol. 2014;15(3):50. doi:10.1186/gb-2014-15-3-r50.32. Maksimovic J, Gordon L, Oshlack A. Swan: Subset-quantile within arraynormalization for illumina infinium humanmethylation450 beadchips.Genome Biol. 2012;13(6):44. doi:10.1186/gb-2012-13-6-r44.33. Harper KN, Peters BA, Gamble MV. Batch effects and pathway analysis:two potential perils in cancer studies involving dna methylation arrayanalysis. Cancer Epidemiol Biomarkers Prev. 2013;22(6):1052–60.doi:10.1158/1055-9965.EPI-13-0114.34. Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Gayther SA,Apostolidou S, Jones A, Lechner M, Beck S, Jacobs IJ, Widschwendter M.An epigenetic signature in peripheral blood predicts active ovariancancer. PLoS ONE. 2009;4(12):8274. doi:10.1371/journal.pone.0008274.35. Teschendorff AE, Zhuang J, Widschwendter M. Independent surrogatevariable analysis to deconvolve confounding factors in large-scalemicroarray profiling studies. Bioinformatics. 2011;27(11):1496–505.doi:10.1093/bioinformatics/btr171.36. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarrayexpression data using empirical bayes methods. Biostatistics. 2007;8(1):118–27. doi:10.1093/biostatistics/kxj037.37. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package forremoving batch effects and other unwanted variation inhigh-throughput experiments. Bioinformatics. 2012;28(6):882–3.doi:10.1093/bioinformatics/bts034.38. Bock C. Analysing and interpreting dna methylation data. Nat Rev Genet.2012;13(10):705–19. doi:10.1038/nrg3273.39. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N,Pencina MJ, Kattan MW. Assessing the performance of predictionmodels: a framework for traditional and novel measures. Epidemiology.2010;21(1):128–38. doi:10.1097/EDE.0b013e3181c30fb2.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- Improving cell mixture deconvolution by identifying...
Open Collections
UBC Faculty Research and Publications
Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL) Koestler, Devin C; Jones, Meaghan J; Usset, Joseph; Christensen, Brock C; Butler, Rondi A; Kobor, Michael S; Wiencke, John K; Kelsey, Karl T Mar 8, 2016
pdf
Page Metadata
Item Metadata
Title | Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL) |
Creator |
Koestler, Devin C Jones, Meaghan J Usset, Joseph Christensen, Brock C Butler, Rondi A Kobor, Michael S Wiencke, John K Kelsey, Karl T |
Contributor | University of British Columbia. Centre for Molecular Medicine and Therapeutics Child and Family Research Institute |
Publisher | BioMed Central |
Date Issued | 2016-03-08 |
Description | Background: Confounding due to cellular heterogeneity represents one of the foremost challenges currently facing Epigenome-Wide Association Studies (EWAS). Statistical methods leveraging the tissue-specificity of DNA methylation for deconvoluting the cellular mixture of heterogenous biospecimens offer a promising solution, however the performance of such methods depends entirely on the library of methylation markers being used for deconvolution. Here, we introduce a novel algorithm for Identifying Optimal Libraries (IDOL) that dynamically scans a candidate set of cell-specific methylation markers to find libraries that optimize the accuracy of cell fraction estimates obtained from cell mixture deconvolution. Results Application of IDOL to training set consisting of samples with both whole-blood DNA methylation data (Illumina HumanMethylation450 BeadArray (HM450)) and flow cytometry measurements of cell composition revealed an optimized library comprised of 300 CpG sites. When compared existing libraries, the library identified by IDOL demonstrated significantly better overall discrimination of the entire immune cell landscape (p = 0.038), and resulted in improved discrimination of 14 out of the 15 pairs of leukocyte subtypes. Estimates of cell composition across the samples in the training set using the IDOL library were highly correlated with their respective flow cytometry measurements, with all cell-specific R 2>0.99 and root mean square errors (RMSEs) ranging from [0.97 % to 1.33 %] across leukocyte subtypes. Independent validation of the optimized IDOL library using two additional HM450 data sets showed similarly strong prediction performance, with all cell-specific R 2>0.90 and R M S E<4.00 %. In simulation studies, adjustments for cell composition using the IDOL library resulted in uniformly lower false positive rates compared to competing libraries, while also demonstrating an improved capacity to explain epigenome-wide variation in DNA methylation within two large publicly available HM450 data sets. Conclusions Despite consisting of half as many CpGs compared to existing libraries for whole blood mixture deconvolution, the optimized IDOL library identified herein resulted in outstanding prediction performance across all considered data sets and demonstrated potential to improve the operating characteristics of EWAS involving adjustments for cell distribution. In addition to providing the EWAS community with an optimized library for whole blood mixture deconvolution, our work establishes a systematic and generalizable framework for the assembly of libraries that improve the accuracy of cell mixture deconvolution. |
Subject |
EWAS DNA methylation Cell mixture estimation Cell heterogeneity |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2017-12-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0361792 |
URI | http://hdl.handle.net/2429/63899 |
Affiliation |
Medicine, Faculty of Other UBC Non UBC Medical Genetics, Department of |
Citation | BMC Bioinformatics. 2016 Mar 08;17(1):120 |
Publisher DOI | 10.1186/s12859-016-0943-7 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Koestler et al. |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52383-12859_2016_Article_943.pdf [ 2.89MB ]
- Metadata
- JSON: 52383-1.0361792.json
- JSON-LD: 52383-1.0361792-ld.json
- RDF/XML (Pretty): 52383-1.0361792-rdf.xml
- RDF/JSON: 52383-1.0361792-rdf.json
- Turtle: 52383-1.0361792-turtle.txt
- N-Triples: 52383-1.0361792-rdf-ntriples.txt
- Original Record: 52383-1.0361792-source.json
- Full Text
- 52383-1.0361792-fulltext.txt
- Citation
- 52383-1.0361792.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0361792/manifest