Salehi et al. Genome Biology (2017) 18:44 DOI 10.1186/s13059-017-1169-3METHOD Open AccessddClone: joint statistical inference ofclonal populations from single cell and bulktumour sequencing dataSohrab Salehi1 , Adi Steif2,3, Andrew Roth2,3, Samuel Aparicio1,2, Alexandre Bouchard-Côté4†and Sohrab P. Shah1,2*†AbstractNext-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers andmeasure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, whichmay be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more directmethod, although its replacement of NGS is impeded by technical noise and sampling limitations. We proposeddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through jointstatistical inference. We show on real and simulated datasets that ddClone produces more accurate results than canbe achieved by either method alone.Keywords: Intra-tumour heterogeneity, Clonal evolution, Joint probabilistic model, Distance dependent, Chineserestaurant process, Single cell sequencing, Next-generation sequencingBackgroundHuman cancers develop through branched evolutionaryprocesses [1] resulting in genetically diverse clonal cellpopulations. Every cancer cell likely harbours a distinctgenome through accrual of individualmutations; however,evolutionary relationships between cells can be hierarchi-cally encoded with phylogenetic trees. The major cladesrepresent cell populations with a majority shared geno-type. Mutations impacting phenotypic variation betweenclonal populations are thought to drive the clonal popu-lation dynamics of a cancer over temporal and microen-vironmental dimensions. Clonal dynamics in turn impactclinical trajectories, underpinning disease complicationssuch as treatment resistance and metastasis.Quantitative characterization of the number of clones,their genotypes, and their abundance is of central impor-tance in the study of the evolutionary dynamics of cancer.Ideally, the identified clones would correspond with the*Correspondence: sshah@bccrc.ca†Equal contributors1Bioinformatics Graduate Program, University of British Columbia, 570 West7th Avenue, V5Z 4S6, Vancouver, BC, Canada2Department of Pathology and Laboratory Medicine, University of BritishColumbia, V6T 2B5, Vancouver, BC, CanadaFull list of author information is available at the end of the articlebranches of an underlying generative process modelledby a phylogenetic tree. In practice, because of limitationsof current sequencing technologies, we are not able todirectly observe clones of interest. Instead, indirect exper-imental methods are used: bulk targeted deep sequencing[2] and single cell sequencing [3]. In both bulk and singlecell, we focus the discussion on nucleotide variant mark-ers (single nucleotide variants, SNVs), which we assumehave been identified in a preliminary analysis [4–7]. Inboth experimental platforms, technical challenges remainwhich prevent accurate inference of the desired quantities.We posited that joint statistical modelling of bulk and sin-gle cell sequencing data could improve inference of clonalcomposition and abundance.We begin the discussion with an overview of methodsfor bulk sequencing. Bulk methods can only provide adirect measure of sampled allele prevalences (the fractionof reads that harbour a mutation at a specific genomiclocus) over DNA fragments sampled from a large, mixedpool of alleles extracted from the totality of cells presentin the input tissues. Consequently, allele prevalence isa compound measure impacted by the unknown quan-tity of non-malignant cells and the unknown compositionof the constituent malignant clones. Leveraging many© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Salehi et al. Genome Biology (2017) 18:44 Page 2 of 18mutations measured from the same allelic pool, computa-tional methods have been developed to estimate subclonalstructure from allele prevalences. The PyClone model [2]takes into account several confounding factors, includ-ing statistical variation coming from the sampling of thereads; non-malignant cell fraction; mis-called bases andother technical artefacts; and most importantly, how copynumber alterations resulting from segmental aneuploidieslocally and/or globally deviate from diploidy. PyClone andother methods such as PhyloSub [8], Clomial [9], Ances-Tree [10], and SciClone [11] generally assume mutationswith shared prevalence (either cellular prevalence or alleleprevalence) are more likely to be co-occurring within thesame cell, thus defining components of a clonal geno-type. This assumption may be violated to varying degrees;mutations may be present at similar allele prevalence butdistributed across clones [12].A potential solution to this problem lies in single cellsequencing (SCS). SCS via whole genome shotgun ormultiplex targeted design by PCR amplification theoret-ically yields direct ascertainment of genotypes wherebythe data itself will encode whether sets of mutations areco-occurring in individual cells. While the measurementsof SCS are conceptually simpler, they come with a muchhigher level of technical noise [13–16]. Since the amountof measured DNA from each cell is minimal, missingone or both of the alleles (allelic drop-out (ADO) [15])is common, resulting in sparse representation of underly-ing genotypes. While missing both alleles is relatively easyto detect, missing only one can seriously skew interpreta-tion of heterozygous loci [17]. Moreover, by construction,SCS methods sample a dramatically smaller number ofcells compared to bulk sequencing. As a consequence,when estimating cellular prevalences the sampling errorwill tend to be markedly higher (see Results section andalso Additional file 1). A number of computational meth-ods have been developed to work with SCS data thataccount for (some of) these limitations. The single cellgenotyper (SCG) [16] uses a hierarchical Bayesian modelto cluster single cells into clones and infer constitutinggenotypes and their prevalences, and it models varioustechnical errors, including doublets. Using mutual SNVpatterns in the single cells, OncoNEM [18] and BitPhy-logeny [19] infer the evolutionary relationships betweenconstituent clones, while SCITE [20] also reconstructsorder of mutations.We propose to leverage the strengths of both sequenc-ingmethods for optimal computational inference of clonalgenotypes and prevalences.We present a novel probabilis-tic model based on non-parametric Bayesian integrationof bulk and single cell data. We demonstrate on syn-thetic and real datasets how simultaneous analysis resultsin improved inference of salient quantities of interest forbiological inference of clonal dynamics in cancer.Results and discussionWe developed a statistical framework, ddClone, leverag-ing data obtained from both single cell and bulk sequenc-ing methods (Fig. 1). The ddClone approach assumessingle cell sequencing data will inform and improve clus-tering of allele fractions derived from bulk sequencingdata in a joint statistical model. ddClone combines aBayesian non-parametric prior informed by single celldata with a likelihood model based on bulk sequenc-ing data to infer clonal population architecture throughclustered mutations. Intuitively, the prior ‘encourages’genomic loci with co-occurringmutations in single cells tocluster together. Using a cell-locus binary matrix from sin-gle cell sequencing, ddClone computes a distance matrixbetween mutations using the Jaccard distance with expo-nential decay. This matrix is then used as a prior forinference over mutation clusters and their prevalencesfrom deeply sequenced bulk data in a distance-dependentChinese restaurant process [21] framework. The outputof the model is the most probable set of clonal genotypespresent and the prevalence of each genotype in the pop-ulation. Full mathematical and implementation details areprovided in Methods and Additional file 1.Benchmarking over simulated dataWe benchmarked ddClone by simulating 10 ground truthsynthetic datasets each with 10 cell genotypes and 48genomic loci (Fig. 2). Joint bulk and single cell data weregenerated from a phylogenetic Dollo process (Additionalfile 1: Figure S1; Additional file 2).We compared ddClone to threemethods that operate onbulk data only: PyClone [2], PhyloWGS [22], and Clomial[9], and to two methods that leverage single cell data only:SCITE [20] and OncoNEM [18]. Two performance met-rics were evaluated: clustering accuracy (by V-measure[23]) and accuracy of inferred cellular prevalences (theaverage over loci of the absolute differences between theinferred and true cellular prevalences). For the same bulkdata, three sets of single cell data with different levels ofnoise were generated: (1) ideal data with no ADO or dou-blets; (2) data with moderate levels of sampling distortion,in the presence of 30% doublet cells and an ADO rate of30%; and finally (3) data with higher levels of sampling dis-tortion reflective of real data, with the same doublet andADO rates as in (2). We designate these three regimes byλ = ∞, λ = 10, and λ = 1.12 respectively. ddClone wassupplied with the above single cell data for encoding theprior over clustering. Single cell-only methods were giventhe exact same input as ddClone’s prior.Under noise levels corresponding to real datasets (λ =1.12, Fig. 3), ddCloneλ=1.12 had a mean cellular prevalenceestimation error of 0.09 ± 0.03, significantly outperform-ing both OncoNEMλ=1.12 (0.17 ± 0.03) and SCITEλ=10(0.18 ± 0.05), while doing slightly better than the secondSalehi et al. Genome Biology (2017) 18:44 Page 3 of 18Fig. 1 The workflow of ddClone. This figure shows the workflow of our method, ddClone. The ddClone approach is predicated on the notion thatsingle cell sequencing data will inform and improve clustering of allele fractions derived from bulk sequencing data in a joint statistical model.ddClone combines a Bayesian non-parametric prior informed by single cell data with a likelihood model based on bulk sequencing data to inferclonal population architecture. Intuitively, the prior encourages genomic loci with co-occurring mutations in single cells to cluster together. Using acell-locus binary matrix from single cell sequencing, ddClone computes a distance matrix between mutations using the Jaccard distance withexponential decay. This matrix is then used as a prior for inference over mutation clusters and their prevalences from deeply sequenced bulk data ina distance-dependent Chinese restaurant process framework. The output of the model is the most probable set of clonal genotypes present andthe prevalence of each genotype in the populationFig. 2 Simulated phylogenetic tree (panel a) and the resulting binarized cell genotype matrix (panel b). Transposed binarized simulated cellgenotypes from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix CN is in copy number space. We binarize itby setting entries with non-zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero(coloured blue). The clonal prevalence of each genotype is in parenthesesSalehi et al. Genome Biology (2017) 18:44 Page 4 of 18Fig. 3 Performance analysis in presence of sampling distortion. Effect of sampling distortion on V-measure index (panel a) and mean absolute errorof cellular prevalences (panel b) across multiple values for the total number of single cells (specified on top of each panel). Each box plot represents10 simulated datasets each with 10 genotypes and 48 genomic loci. The cells are sampled from a Dirichlet-multinomial distribution with sample sizem ∈ {50, 100, 200, 500, 1000} and parameters equal to the true prevalence of each genotype scaled by the concentration coefficient λ. The largerthe λ, the closer the Dirichlet-multinomial distribution approximates the multinomial distribution. At higher values of λ the sampled cells betterrepresent the true proportions of genotypes. Estimated values of λ for the real datasets are annotated on panel (b). We note that OncoNEM did notconverge when number of cells exceeded 100 (boxes marked by a star). This result suggests that ddClone’s clustering and cellular prevalenceestimates are fairly robust to the presence of distorted single cell samplingbest performing bulk data-only method, PyClone (0.10 ±0.05). ddCloneλ=1.12 also had high clustering accuracy inthis noise regime, with a mean V-measure of 0.77 ± 0.06relative to 0.74 ± 0.06 for OncoNEMλ=1.12, 0.71 ± 0.08for SCITEλ=1.2, and 0.71 ± 0.10 for PyClone. Clomial hada slightly higher mean V-measure than PyClone (0.78 ±0.07), but it had a worse cellular prevalence estimationerror (0.14 ± 0.04). PhyloWGS had a mean V-measure of0.73±0.03 and amean cellular prevalence estimation errorof 0.14 ± 0.04.Under λ = 10, the moderate sampling distortion noiseregime, ddCloneλ=10 significantly outperformed both sin-gle cell data-only methods, in terms of cellular prevalenceestimation, achieving a mean error of 0.07 ± 0.02 versusOncoNEMλ=10’s 0.13± 0.03 and SCITEλ=10’s 0.18± 0.05.ddCloneλ=10 did comparably well to OncoNEMλ=10 andSCITEλ=10 in terms of clustering accuracy, with a meanV-measure of 0.79±0.09 against 0.81±0.03 and 0.75±0.05respectively.With perfect, noiseless single cell data (λ = ∞),OncoNEMλ=∞ outperformed SCITEλ=∞ and ddCloneλ=∞both in terms of cellular prevalence estimation, withan average error of 0.04 ± 0.01 against 0.06 ± 0.01 and0.06 ± 0.01, and in terms of clustering accuracy, with amean V-measure of 0.90 ± 0.03 versus 0.87 ± 0.09 and0.86 ± 0.04 respectively.These results suggest that in the presence of simulta-neous doublets, ADO events, and assortment bias noise,ddClone compares favourably well to other methods(Fig. 4). This is most relevant in the case of improved cellu-lar prevalence estimates, as single cell platforms will likelystay unfit for this type of measurement in the near futuredue to under-sampling.Sensitivity to presence of noise in single cell dataWe next directly considered the impact of four typesof noise likely to be present in single cell data: ‘assort-ment bias’, where the quantity of sampled cells are notSalehi et al. Genome Biology (2017) 18:44 Page 5 of 18Fig. 4 Benchmarking results over simulated data. Performance results for ddClone, single cell-only, and bulk data methods on ten syntheticdatasets. ddClone and single cell-only methods were provided with single cells, either (1) 50 cells, sampled from a multinomial distribution with truegenotype prevalences as parameters (labelled ddClone(λ = ∞), OncoNEM(λ = ∞), and SCITE(λ = ∞)) in absence of doublet and ADO noise, or (2)50 cells sampled from a Dirichlet-multinomial distribution with λ = 10, constituting moderate to small levels of sampling bias (labelled asddClone(λ = 10), OncoNEM(λ = 10), and SCITE(λ = 10), or (3) 50 cells sampled from a Dirichlet-multinomial distribution with λ = 1.12,constituting high levels of sampling bias (labelled as ddClone(λ = 1.12), OncoNEM(λ = 1.12), and SCITE(λ = 1.12), where in the case of (2) and (3),30% of cells are doublets and rADO = 30%. Panel a shows V-measure clustering performance. Panel b shows the average over loci of the absolutedifferences between the inferred and true cellular prevalences. This result shows that in the presence of reasonable levels of noise, ddCloneperforms comparably well in terms of both V-measure and the accuracy of inferred cellular prevalencesrepresentative of the underlying tumour, ‘doublets’ and‘allele drop-outs’ affecting the quality of the signal at asingle genomic locus, and ‘genotype loss noise’, whereone or more cell genotypes are unavailable (i.e. due tounder-sampling) for formulation of the prior.Assortment biasHere we compare our method to methods that exclusivelyaccept as input single cell sequencing data: OncoNEM[18] and SCITE [20]. In contrast to ddClone, thesemethods accept cell-mutation data and not a derivedgenotype-mutation matrix. In order to accommodate thisin our experiments, we simulated cells from the genotypesas described below. See Additional file 1 for parameter set-tings and the derivation of cellular prevalence estimatesfor these methods. We note that even though ddCloneis not designed to work with cell-mutation matrices, inthe following simulations we have used this type of datato remove the effects of genotype inference methods (e.g.[16]) on the results. We investigated the effects of sam-pling bias modelled using the parameter λ (see Methodssections). For small values of λ, we expect the sampledcells not to be representative of the true tumour contentand vice versa. With increasing assortment bias, ddCloneperforms better than single cell-only methods (Fig. 3),most importantly in λ ranges (Methods section) approx-imating the real datasets. When the sampled cells areaccurate representations of the underlying sample, singlecell-only methods outperform ddClone as expected, sinceprevalence estimates map directly to cell counting, with-out requiring inference.DoubletsDoublets are one source of noise in single cell sequenc-ing experiments. They occur when two or more cells aretrapped together in a single well during the sequencingprocedure. As the genotype assigned to a doublet wellwill be a hybrid of the genotypes of the two or morecells that it contains, we assume that this results in afalse positive error where the hybrid genotype will havemore mutated genomic loci than the original trappedcells (Methods). We simulated an additional 500 datasetsacrossmultiple values of rdoublet, the percentage of doubletevents, and multiple values of m, the number of sam-pled single cells, where m ∈ {50, 100, 200, 500, 1000} andrdoublet ∈ (0, 1]. ddClone’s cellular prevalence estimatesSalehi et al. Genome Biology (2017) 18:44 Page 6 of 18are in general robust to the presence of uncorrecteddoublet noise (Fig. 5). We reiterate that ddClone is notdesigned to work with cell-mutation matrices, and thebest input to it is the genotype-mutationmatrix, for exam-ple, as generated by the SCG model. SCG is designed tocorrect for doublets, and we anticipate that using it wouldimprove ddClone’s performance.Allele drop-outsWe next investigated the effect of increasing ADO (lociwith ADO sit at the extremes of the allele count distribu-tion; details in the Methods section) in ddClone accuracy.Progressively increasing the ADO rate results in degrad-ing performance in both clustering and cellular prevalenceestimates (Fig. 6). Unsurprisingly, the detrimental effectdampens as the number of sampled cells increases.Clonal genotype lossClonal genotype loss is defined as a lack of inclusion ofa population’s genotype in the encoding of the prior. Weunder-sampled genotypes by systematically ‘hiding’ singlecell genotypes from the prior. Unsurprisingly, progres-sively removing more cell genotypes (in increasing orderof their prevalence) results in monotonically degradingperformance (Fig. 7). However, when as few as approx-imately half of the genotypes are available to encode inthe prior, ddClone still outperforms the naive methodsin terms of cellular prevalence estimation (Figs. 4 and7). This suggests a degree of robustness in the pres-ence of under-sampling of clones, and that even partialprior information will improve prevalence estimates per-formance.Benchmarking over triple-negative breast cancerpatient-derived xenograft dataTo test our method on a real dataset, we used a subsetof samples from a triple-negative breast cancer (TNBC)xenograft study [24], where breast cancer tissues from55 patients were transplanted into immuno-suppressedmice, resulting in 30 xenograft lines. Over 3 years, theselines were passaged up to 16 generations. Whole genomesequencing was performed over a subset of passages toidentify point mutations at specific genomic positions.Deep targeted amplicon sequencing of between 100 to300 SNV positions per sample was then used to establishthe allelic prevalences of these mutations. We chose 210cells from five timepoints that span two samples for singleFig. 5 Performance analysis in presence of doublets. Effect of presence of doublets on V-measure index (panel a) and mean absolute error of cellularprevalences (panel b) across multiple values for the total number of single cells (specified as m on top of each panel). Each box plot represents 10simulated datasets each with 10 genotypes and 48 genomic loci. The cells are sampled from a multinomial distribution with a sample size of m andparameters equal to the true prevalence of each genotype. Progressively increasing the percentage of doublet cells results in minor degradingperformance in cellular prevalence estimate. Overall, this result suggests that ddClone’s cellular prevalence estimates are robust to the presence ofuncorrected doublet noiseSalehi et al. Genome Biology (2017) 18:44 Page 7 of 18Fig. 6 Performance analysis in presence of allele drop-outs. Effect of presence of allele drop-outs (ADO) on V-measure index (panel a) and meanabsolute error of cellular prevalences (panel b) across multiple values for the total number of single cells (specified as m on top of each panel). Eachbox plot represents 10 simulated datasets each with 10 genotypes and 48 genomic loci. The cells are sampled from a multinomial distribution witha sample size of m and parameters equal to the true prevalence of each genotype. As expected, progressively increasing the ADO rate results indegrading performance in both clustering and cellular prevalence estimates. The detrimental effect dampens as the number of sampled cellsincreasesFig. 7 Performance analysis in presence of loss of multiple genotypes. Effect of removing genotypes on V-measure index (panel a) and meanabsolute error of cellular prevalences (panel b). Unsurprisingly, progressively removing more cell genotypes (in increasing order of prevalence)results in monotonically degrading performance However, when as few as approximately half of the genotypes are available to encode in the prior,ddClone still outperforms the naive methods in terms of cellular prevalence estimateSalehi et al. Genome Biology (2017) 18:44 Page 8 of 18cell genotyping, and approximately 48 SNV positions weretargeted for each timepoint, with some filtration due topoorly performing cells, or loci [24]. A consensus phylo-genetic tree over cells was inferred using MrBayes [25].Figure 8 shows the inferred cell genotype matrix foreach sample. In each timepoint, we only kept genomicloci that were shared between the bulk and single cellgenotype data (Additional file 3).Since the exact clustering configuration and cellularprevalences of the genomic loci in the real dataset areunknown, we used the multi-sample PyClone results overseveral timepoints as a benchmark (see Additional file 1for details). PyClone in multi-sample mode borrows sta-tistical strength across all timepoints to give generallymore accurate estimates of clonal structure in individ-ual timepoints. We ran our method along with competingmethods on each timepoint independently (Additionalfile 4). By these criteria, ddClone showed better perfor-mance than the second best performing method in termsof V-measure (Wilcoxon rank sum test with p value <0.05) and performs comparably well (SA494, timepoint Tand SA501, timepoint X4) or better (all the other time-points) than the second best performing method in termsof accuracy of inferred cellular prevalences (Fig. 9).ddClone achieved a V-measure of 0.88 and 0.89 for sam-ple SA494 at timepoints T and X4 and 0.82, 0.82, and 0.81for sample SA501 at timepoints X1, X2, and X4 respec-tively. The second best performing method, PyClone,achieved a V-measure of 0.56, 0.69, 0.70, 0.69, and 0.67corresponding to sample SA494 at timepoints T and X4and sample SA501 at timepoints X1, X2, and X4. Sum-marizing across samples, ddClone’s clustering was best(mean V-measure = 0.85, SD = 0.04), followed by PyClone(mean V-measure = 0.66, SD = 0.06), Clomial (mean V-measure = 0.61, SD = 0.06), SCITE (mean V-measure =0.60, SD = 0.08), OncoNEM (mean V-measure = 0.60, SD= 0.08), and finally PhyloWGS (mean V-measure = 0.53,SD = 0.05). Use of the mean cellular prevalence estimationerror resulted in a very similar ranking: ddClone (mean= 0.04, SD = 0.01), PyClone (mean = 0.05, SD = 0.04),Clomial (mean = 0.07, SD = 0.01), PhyloWGS (mean =0.08, SD = 0.02), OncoNEM (mean = 0.15, SD = 0.05), andfinally SCITE (mean = 0.16, SD = 0.05).Inference of genotypes frommultiple spatial samples inovarian cancerWe next evaluated performance on samples from a high-grade serous ovarian cancer (HGSOvCa) study [26] whereFig. 8 Genotypes curated for the triple-negative breast cancer data. Binary cell genotype matrices for sample SA494 over 28 genomic loci (left) andsample SA501 over 38 genomic loci (right). These are manually curated from a single cell genotype sequencing experiment [24]. Briefly, MrBayeswas used to infer a consensus phylogenetic tree over the single nuclei. Then they were grouped into clades according to high probability branchingsplits. Finally, each clade was assigned a consensus genotype by taking the mode genotype of the clade at each genomic locus. Colour red indicatesa mutated locus, while colour blue indicates a non-mutated locusSalehi et al. Genome Biology (2017) 18:44 Page 9 of 18Fig. 9 Benchmarking results over TNBC dataset. Performance results for ddClone and existing methods over TNBC SA501 X1, X2, X4, andSA494 T, X4. Panel a shows clustering assignment performance. Panel b shows cellular prevalence approximation mean absolute error.Evaluated against multi-sample PyClone, ddClone outperforms the second best performing method (PyClone) in terms of V-measure (Wilcoxonrank sum test with p value < 0.05) and performs as well (SA494, timepoint T) or better (all the other timepoints) than the second best performingmethod in terms of accuracy of inferred cellular prevalences68 tumour samples from seven patients (5–13 samples perpatient) including samples from the ovary and omentumwere obtained during initial debulking surgery, except forone patient for whom samples from the first and secondrelapses were also available. Whole genome sequencingof 31 cryopreserved tissues and matched normal bloodproduced a panel of 3577 to 16,987 somatic genomicaberrations including SNVs and allele-specific absolutecopy number variations (CNVs) per patient. To verifyexistence and allelic counts of these predicted SNVs, 37formalin-fixed, paraffin-embedded specimens were usedin targeted deep sequencing of 300 loci per patient withmultiplex PCR amplicons. Single-nucleus sequencing of atotal of 1680 cells from three patients was used to deter-mine the co-occurrence of between 43 to 84 SNVs persample. This data in combination with the single cell geno-typer (SCG)model [16] produced the cell genotypematrix for each sample. Similar to the xenograft TNBC casestudy, we only kept genomic loci that were shared betweenthe bulk and single cell genotype data and evaluated theresults analogously.Measured against the multi-sample PyClone bench-mark, ddClone outperforms all other methods in terms ofclustering accuracy with a mean V-measure of 0.68 (SD= 0.12). The next best performing methods are SCITE(mean V-measure = 0.60, SD = 0.08), PyClone (mean V-measure = 0.56, SD = 0.10), OncoNEM (mean V-measure= 0.53, SD = 0.11), PhyloWGS (mean V-measure = 0.52,SD = 0.12), and finally Clomial (mean V-measure = 0.52,SD = 0.15). We note that although Clomial seems to tiewith PhyloWGS, it did not converge over 4 out of 13 sam-ples (P3 - Adnx1, P3 Om1, P3 - ROv1, and P3 ROv2).Similarly, OncoNEM did not converge over 5 out of 13samples (P2 - ROv2, P3 - Adnx1, P3 - Om1, P3 - ROv1,and P3 - ROv2). This ranking is very similar in terms ofthe cellular prevalence metric where ddClone has the low-est cellular prevalence estimation error (mean = 0.07, SD= 0.03), followed by PyClone (mean = 0.10, SD = 0.07).OncoNEM ties SCITE with a mean cellular prevalenceerror equal to 0.19 (SD = 0.06 and SD = 0.08 respectively).Then comes PhyloWGS (mean = 0.27, SD = 0.11) andfinally Clomial (mean = 0.27, SD = 0.14). These resultssuggest that using ddClone over single datasource-onlymethods may help avoid catastrophic estimation errorsbest exemplified in the Omentum site 1 in Patient 9 (P9- Om1) where ddClone has a cellular prevalence esti-mation error less than one-fifth that of the second bestperforming method, SCITE (Fig. 10).Salehi et al. Genome Biology (2017) 18:44 Page 10 of 18Fig. 10 Benchmarking results over HGSOvCa dataset. Performance results for ddClone and existing methods over HGSOvCa data, from threepatients: Patient 2 (P2) at sites Om1, Om2, ROv1, ROv2, Patient 3 (P3) at sites Adnx1, Om1, Rov1, Rov2, and Patient 9 (P9) at sitesLOv1, LOv2, Om1, Om2, and ROv1. Panel a shows clustering assignment performance. Panel b shows cellular prevalence approximationmean absolute error. (Om1) Omentum sample 1, (Om2) Omentum sample 2, (ROv1) Right ovary sample 1, (ROv2) Right ovary sample 2, (LOv1) Leftovary sample 1, (LOv2) Left ovary sample 2, (Adnx1) Adnexa sample1Investigating mutation clusters in a patient with acutelymphoblastic leukemiaHere we analyse a dataset consisting in targeted sequenc-ing of a panel of mutations (mostly SNVs) in 1479 singletumour cells from six patients with acute lymphoblasticleukemia (ALL) [12]. The genomic loci were assumed tobe highly diploid. To confirm mutations in the single cellsamples, the authors performed resequencing of the bulksamples over an average of 46 loci (between 10 to 105) foreach patient.Figure 11 shows ddClone’s analysis on one of thepatients in this study (Patient 1). Four clones werereported in this dataset, one of which was labelled a dou-blet (Fig. 11, clone number 4) and was removed fromsubsequent analyses. The authors then extracted consen-sus genotypes for these clones (Fig. 11, panel A, bottom).ddClone finds six clusters. While single cell genotypessupport a merger of clusters 4 and 2, ddClone splitsthem in two, placing locus chr19:40895668 in a separatecluster. This split is supported by the bulk data wherethe variant allele frequency (VAF) of chr19:40895668 isabout 1.5 times that of the mean VAF of cluster 4 (0.33and 0.22 respectively). Conversely, loci chr17:1657484 andchr1:38226084 have similar bulk VAFs (0.21 and 0.21respectively), but since they have different prior geno-types, ddClone assigns them to separate clusters (clusters4 and 5 respectively). PyClone assigns these twomutationsto one cluster. We find similar instances in other patientsin this dataset (see Additional file 1).Due to the lack of multiple samples from within apatient, we were unable to use the same method weused to establish benchmark as in the other real datasets.Despite this, we confirm that ddClone’s estimated cellu-lar prevalences are highly correlated with the reportedbulk VAFs (R2 = 0.85 across all patients), suggesting thatddClone does not introduce unreasonable structure in theresults (Additional file 1).ddClone avoids co-clustering of mutations from distinctclones with shared cellular prevalencesMethods that cluster mutations based only on cellularprevalences are prone to grouping togethermutations thatbelong to separate unique clones, if such clones happento exist in similar cellular prevalences. Co-occurrence pat-terns from single cell data can be used to distinguish suchclones. We define mutually exclusive mutations (MEMs)as a pair of mutations that never co-occur in clonesinferred from single cell genotype analysis. The MEMscorrespond to pair of mutations with a Jaccard distance ofone (see Methods). PyClone, the second best performingmethod in terms of clustering, erroneously merges mul-tiple MEMs in 8 out of 13 samples across three patientsin the HGSOvCa data (Additional file 5). The numbersof pairs of MEMs erroneously merged by single-sampleSalehi et al. Genome Biology (2017) 18:44 Page 11 of 18Fig. 11 Analysis results of an acute lymphoblastic leukemia (ALL)dataset [12]. Analysis results of a patient with ALL (Patient 1) [12]. Thevariant allele frequencies (VAFs) from the bulk data (panel a, top)along with the consensus genotypes estimated from the binary cellmatrix (panel A, bottom). These two constitute the input to theddClone model. We note that the binary cell matrix b is displayedhere for comparison and is not an input to ddClone. This binary cellmatrix was used in [12] to cluster the cells into clones (vertical bar atthe right side of the figure) and consensus genotypes (bottom part ofpanel a). ddClone clusters mutations into 6 groups (panel c, top) andestimates cellular prevalence () for each (panel c, bottom). ddClone’sestimated are highly correlated with the corrected bulk VAFs(R2 = 0.98, also see Additional file 1), suggesting that it does notintroduce unreasonable structure in the data. Furthermore, whenthere is evidence in the bulk, it can override its prior and split clustersas necessary. For instance, even though locus chr19:40895668 has thesame prior genotype as loci in cluster 4, its VAF in the bulk data is 1.5times that of the mean of loci in cluster 4. This hints at a finerstructure in cluster 4, and ddClone has automatically assignedchr19:40895668 to a separate clusterPyClone in each of the 8 samples are 13, 140, 259, 103,169, 2, 14, and 1 respectively. Even multi-sample PyClonefails in correctly clustering MEMs in 9 out of 13 sam-ples in the HGSOvCa data, although for markedly fewermutations. The numbers of pairs of MEMs erroneouslymerged by multi-sample PyClone in each of the 9 sam-ples are 5, 5, 5, 5, 2, 2, 2, 2, and 2 respectively. In contrast,ddClone only mergedMEMs in 2 out of 13 samples (1 pairin the first sample and 2 pairs in the second sample) in theHGSOvCa data.One pair of MEMs, 15:26990805 (SNV at chromosome15, coordinate 26990805) and 5:38686543 (SNV at chro-mosome 5, coordinate 38686543) from Patient 3 in Omen-tum sample 1, had assigned cellular prevalences of 0.47and 0.48 by PyClone, 0.43 and 0.46 by ddClone, and 0.41and 0.41 by multi-sample PyClone respectively. PyCloneand multi-sample PyClone both merged these MEMs;however, ddClone, while estimating a cellular prevalencein agreement with multi-sample PyClone (mean absolutedifference of 0.03), separated them into different clus-ters. See Additional file 5 for a complete list of MEMs. Inthe TNBC xenograft data, PyClone erroneously merged6 MEMs in 1 out of 5 samples. Neither multi-samplePyClone nor ddClone merged any MEMs. Another exam-ple is loci 17:1657484 and 1:38226084 in Patient 1 in theALL dataset. They have similar bulk VAFs (both equal to0.21) but different prior genotypes, and ddClone assignsthem to separate clusters while PyClone co-clusters them.Taken together, results on real data suggest a markedadvantage of using ddClone as measured by clusteringaccuracy. We note that the gains on prevalence error weremore modest. We suggest this underscores the impor-tance of single cell data to resolve mutation clustering asa reflection of genotype, while bulk data likely providesan accurate representation of mutation prevalence. Thusthe ddClone approach can leverage the strengths of bothmeasurement types and provide an overall improvementin the parameters of interest.ddClone overrides its prior in presence of evidence in thebulk dataddClone is provided with a prior genotype-mutationmatrix. When this prior encodes identical genotypes fortwo genomic loci, ddClone is very likely to cluster thepair together. However, if there is evidence in the bulkdata suggesting that the mutations do not belong to acluster, i.e. their bulk VAFs corrected for CNA are toodissimilar, we expect the model to override its prior andassign those genomic loci to separate clusters. We defineprior overriding mutations (POMs) as a pair of muta-tions that have identical prior genotype, but are clusteredseparately by ddClone. The TNBC xenograft dataset hadon average 41 (ranging from 32 to 61) POM pairs. Forinstance, in sample SA501, timepoint X1, 20:3209183 andSalehi et al. Genome Biology (2017) 18:44 Page 12 of 182:152063945 were a POM pair with a corrected bulk VAFof 6. On average about 10 (from 0 to 27) POM pairs werein the HGSOvCa data, including genomic loci 9:35546540and X:154158018 from Patient 2, Omentum site 2 with acorrected bulk VAF of 1.56. In the ALL dataset, in Patient1, loci chr19:40895668 and chr17:1657484 had identicalprior genotypes, but a corrected bulk VAF ratio of 1.4, andddClone put them into separate clusters. In this dataset,Patients 1 to 5 had 3, 4, 105, 320, and 1264 such pairs,with an average corrected bulk VAF ratio of 1.36 ± 0.13,1.61 ± 0.25, 1.72 ± 0.61, 1.40 ± 0.39, and 1.69 ± 1.19respectively. There were no such pairs in Patient 6.ConclusionsThe ddClone approach presented here exemplifies thecombined statistical strength of orthogonally derivedobservations for inference of clonal populations fromNGS sequencing. Single cell sequencing methods arecontinually improving; however, they will likely alwaysbe limited by the effect of small DNA inputs andsparsely sampled cell populations. Bulk methods, on theother hand, will require computational deconvolutionapproaches to disentangle the unobserved underlyingclonal constituents used to generate a measurement ofinterest. Here we show that bulk and single cell measure-ments when fused together with joint statistical inferencecan overcome the limitations of both methods, leadingto more accurate inference. Single cell sequencing exper-iments typically generate a bulk template as a controlsample, and so statistical integration can be ubiquitouslyapplied. In particular, we show how ddClone resolvesclonally mutually exclusive mutations which would oth-erwise be co-clustered in bulk, therefore underestimatingthe number of clones present in a sample of interest. Wenote that samples analysed by ddClone from the ovariancancer study were heavily intermixed, as reported in [26],representing a situation where multiple clones co-existedin different anatomic sites at relatively equal prevalence.This is similar to what might be observed in haematolog-ical malignancies where relatively less anatomic isolationof clones is the default model for clonality and thus clonesare likely to co-exist at equal prevalence [12]. Failure toresolve clones in these scenarios could lead to poor andspurious biological interpretation and underestimation oftumour complexity. Multiple samples where clonal preva-lences vary would lead to more accurate inference asdemonstrated by [2]; however, we show in the single sam-ple scenario that ddClone can overcome under-clusteringofmutations that arises frommultiple clones co-occurringat near equal prevalences.While the ddClone presents an advance in statisticalintegration, several limitations remain. As investigatorscontinue to dissect longitudinal clonal dynamics throughtemporal sampling, extensions to leverage statisticalsignals across multiple samples will be necessary. Fur-thermore, we expect the method will generalize well todifferent single cell platforms offering longer reads withphased mutations. However, considering more mutationswill come at a computational cost that may not scale towhole genome dimensions. This may limit the utility ofddClone in the case of whole genome analysis. In addi-tion, we showed with theoretical and simulated ‘clean’single cell data that single cell-only methods outperformddClone. This is expected and reflects, in the context offuture potential for accurate single cell methods, the needfor bulk observations to infer prevalence of clones maydiminish.Analogously, there are some scenarios in which bulkdata may be a biased representation of the underlyingtumour, for instance, due to sampling from spatially sep-arated regions of the tumour [26]. This may suggest thatinvestigators should take caution in matching samplesfrom single cell and bulk data.We emphasize that multi-sample PyClone does notconstitute ground truth. For example, we observe someerroneous clustering of mutations based on VAFs inits results. Nevertheless, previous research demonstratesthat using samples from multiple regions or timepointsimproves the accuracy of the clonal structure inferencemethods [8, 9, 27] since statistical strength can be bor-rowed across multiple measurements. In this context,we use multi-sample analysis as a convenient benchmarkagainst which we quantitatively assess performance usingsingle sample data. This may be suboptimal, and thusour study illuminates the need to create ground truthdatasets either through extensive orthogonal measure-ment or through engineered admixtures of related cellpopulations in defined proportions.We focused our work on point mutations in this report,but other clonal marks such as structural variations andepigenetic markers can be used to infer clonal composi-tion and dynamics. Extensions to the model for featureswith different statistical properties will be required tointegrate non-point-mutation features of the genome. Theuse of Jaccard index to summarize the prior genotypesin our model may be suboptimal, due to different noiselevels, among other reasons. We implemented an aug-mented Jaccard index taking this asymmetry into account.While for the majority of datasets it has marginal effect,it improves the performance of ddClone in one of thereal datasets analysed here. Continued improvement ofsummary statistics, including for example phylogeneticmodels, to encode prior knowledge should lead to furtherincreases in accuracy.Finally, the model we have proposed is unidirectional,encoding single cell data as a Bayesian prior and bulkdata with a likelihood model. Future improvements maybe realized by implementing a bi-directional inferenceSalehi et al. Genome Biology (2017) 18:44 Page 13 of 18framework which iteratively improves predictions frombulk data informed by single cell and single cell datainformed from bulk data. These limitations representopen problems for future work stimulated by our con-tribution here. We anticipate that our work here lays afoundation upon which complementary bulk and singlecell measurements in cancer can be statistically integratedto sharpen the investigator’s view of clonal dynamics. Wecontend this is an important step towards ultimately real-izing quantitative fitness properties leading to a deeperunderstanding of cancer progression and morbidity inpatients.MethodsConcepts and definitionsGiven (1) variant allele counts and (2) copy number ateach genomic locus, (3) tumour cellularity, and (4) singlecell genotype data, our method infers (a) cellular preva-lences and (b) cluster assignments for those genomic loci.We review these notations below.Variant allele counts. We assume that at each genomiclocus i, a total of di reads map to locus i, out of whichbi reads harbour the variant allele.Variant allelic prevalence. The expected fraction ofreads, ξ , that harbour the variant allele. However,this quantity is not observed directly; rather, weobserve, for each locus of interest, the number ofvariant reads divided by the total number of reads inall cells.Copy number at each genomic locus. Copy numbervariations influence the allelic prevalence ξ . Anexample of this influence is shown in Fig. 12b, whereξ = 2×52×1+3×3+3×5 = 513 .Tumour cellularity, t. The fraction of cancer cells in thesample. Hence the fraction of normal cells would be1− t. We assume that tumour cellularity is estimatedindependently from our model.Cell genotype data. Let M denote the number of cellgenotypes in the tumour sample and N be the num-ber of genomic loci in our model. Cell genotype datais modelled as a binary matrix ∈ {0, 1}M×N withrows corresponding to cell genotypes and columns togenomic loci. m,n = 1 if the genotypem is mutatedat locus n. We assume in this work that cell genotypedata are derived from single cell sequencing studies.The desired outputs are cluster assignments of genomicloci and their cellular prevalences. Cellular prevalence φifor a particular genomic locus i is defined as the frac-tion of cells in the sample that harbour a mutation at thatgenomic locus. For example, in Fig. 12b cellular preva-lence for the depicted genomic locus is 59 . Thus 1−φi, thefraction of cancer cells from the reference population, is1 − 59 = 39 . We define the clonal prevalence of a genotypeto be the fraction of cells in the tumour sample harbouringthat genotype.NotationLet X = {x1, x2, . . . , xN } be the set of the N genomic lociof interest, indexed by = {1, 2, . . . ,N}.We adopt the notation j : i for j ≤ i, j, i ∈ N to denote{j, j + 1, j + 2, . . . , i}, a subset of successive integers.We define a clustering of X as a partition T of its indexset , that is, T = {T1,T2, . . . ,TK } such that unionsqk∈1:KTk = where K is the number of partitions, unionsq denotes thedisjoint union operator, and each subset Tk is called acluster.Fig. 12 Hypothesized sitting arrangement in ddCRP/subpopulation assumptions in the bulk data. a Induced table sitting T(C) by a particularcustomer connection configuration C. Bold arrows show customer connections and dotted arrows point to equivalent table sittings. Since customer7 only has a self-loop, the corresponding table has only one customer. b Our assumption about clonal architecture in the tumour with respect to aparticular genomic locus. In this example, normal subpopulation represents a collection of un-mutated diploid cells. Reference subpopulationcomprises cells that have a copy number amplification event, but no single nucleotide mutations. Variant subpopulation is a collection of cells thathave an SNV at the particular genomic locusSalehi et al. Genome Biology (2017) 18:44 Page 14 of 18We define xA for A ⊂ to be {xi|i ∈ A}. For exam-ple, xTk is the set of data points in cluster Tk and xi:j ={xi, xi+1, xi+2, . . . , xj}.Furthermore, let T(.) : N → Nmap data point indices totheir clusters, that is, T(i) = k iff i ∈ Tk .Partitions of a graphLet G(V ,E) denote an undirected graph G where V is theset of vertices and E is the set edges, i.e. a set of unorderedpairs {u, v} ⊂ V . The set of edges E induces a partitioningon V , where each connected component of V correspondsto a cluster. With a slight abuse of notation, let T(E) =T(G(V ,E)) denote this partitioning andTkEdenote its k-thcluster.A directed graph G(V , E) consists in a set of verticesV and a set of directed edges E where each edge is anordered pair of vertices. For a directed graph G, we defineits underlying undirected graph U(G) to be the graphobtained by replacing all directed edges in G with undi-rected ones. Let T(E) be the partitioning induced byU(G), the underlying undirected graph of G. Throughoutthis document the G corresponding to E is always appar-ent from the context, with V always being the set of ourdata points. Let TE : N → N map vertex indices to theirclusters, that is, TE (i) = k iff i ∈ TkE .Traditional CRPddCRP can be explained through an alternative repre-sentation of the Chinese restaurant process (CRP). Wefollow the notation in [21]. In the traditional CRP, cus-tomers enter a Chinese restaurant and opt to sit at atable where the probability of joining a table is propor-tional to the number of customers already sitting at thattable. Customers may also choose to sit at a new tablewith probability proportional to α, a model parameter. Inthe Chinese restaurant metaphor, customers represent thegenomic loci and tables represent clusters [28].Let zi denote the table assignment for customer i andassume that customers 1 : i−1 have occupied tables 1 : K ,and let nk be the number of customers sitting at table k.The customer sitting configuration induces a partitioningof customer indices. The CRP draws zi as in Eq. (1).p(zi = k|z1:(i−1),α) ∝{nk for k ≤ Kα for k = K + 1 (1)Alternative representation of traditional CRPTraditional CRP can equivalently be viewed as customersjoining other customers instead of joining other tables. Letci denote the customer index with whom customer i issitting and C = c1:N .This defines a directed graph G(V , E) with V the set ofcustomer indices and E the set of ordered pairs (i, ci).As described above, this induces TE = T(C), a parti-tioning of customer indices. Each cluster corresponds toa table in the traditional representation. Figure 12a showsan example C and its corresponding T(C).In a generalization of this model, the probability fora customer i to connect to a customer j is proportionalto a function of the distance between them. The dis-tance matrix D encodes our knowledge about the datapoints’ dissimilarity from a secondary source. In this work,this distance matrix is computed from the cell geno-types derived from single cell genotyping experiments.The non-increasing decay function f takes non-negativefinite values. This is summarized in Eq. 2.p(ci = j|D,α) ∝{f(di,j)for i = jα for i = j (2)This defines the ddCRP model. We note that pickinga constant decay function f (x) = 1 reduces ddCRP totraditional CRP, since in that case, Eq. (2) is identical toEq. (1).The ddClone modelWe assign each genomic locus to a customer. Through-out this document, we use cell genotype data from singlecell genotyping studies to compute the distance betweengenomic loci. We note that this is not a requirement ofthe model, and other sources could be used to definedissimilarity between genomic loci.DistancematrixWe have used the Jaccard distance to form the distancematrix D ∈ [0, 1]N×N between genomic loci. The Jaccarddistance is computed as 1 − JaccardIndex that is:JaccardDist(A,B) = 1− |A ∩ B||A ∪ B| = 1−∑Mi=1(Ai × Bi)∑Mi=1(Ai + Bi)(3)where AM×1 and BM×1 are binary column vectors, eachrepresenting a genomic locus. Intuitively, this assigns ahigher distance to genomic loci that co-occur less oftenin the single cell genotypes and vice versa. We note thatour use of the Jaccard index to compute distances betweengenomic loci is related to distance-based phylogeneticinference methods [29].As the Jaccard index is agnostic to the different FN andFP noise rates inherent in the single cell data, we haveproposed and investigated an augmented (modified) Jac-card distance (MJD). The results show that while oversimulated data, MJD has a marginal effect on ddClone’sperformance, using MJD substantially improves perfor-mance over one of the real datasets. See Additional file 1for the formulation and more details.Salehi et al. Genome Biology (2017) 18:44 Page 15 of 18Let λ = {s,α, a} be the collection of hyperparameters inour model. For brevity, we first assume that these hyper-parameters are fixed, and in Additional file 1 discuss theirresampling scheme.Bulk population assumptionsSimilar to PyClone, we make the simplifying assumptionthat the clonal population in the bulk data, with respectto a specific mutation, comprises three subpopulations:the normal, the reference, and the variant subpopulations.Figure 12b illustrates this assumption. To avoid confusionwith the cell genotype states coming from the single cellsequencing study, we refer to the assumed copy numberof the subpopulations in the bulk data as locus geno-types. This data is usually not available directly from thebulk data and has to be inferred or accounted for in theinference procedure.Locus genotype state priorsLet ψi =(giN , giR, giV) ∈ (N0 ×N0)3 represent the assumedlocus genotype state at each genomic locus i in the bulkdata where N0 = N ∪ {0}.Let giN represent the normal locus genotypeN, giR repre-sent the reference locus genotype R, and giV represent thevariant locus genotype V. Each giS is a pair of non-negativeintegers that denote the copy number for the locus geno-type S ∈ {N ,R,V } at the genomic locus i. For example,giN = (2, 3) means that the normal locus genotype in thebulk tumour sample has two copies of the reference alleleand three copies of the variant allele at genomic locus i.Here (0, 0) denotes a homozygous deletion. For g ∈ G =N0 ×N0, let ζ : G → N0 be the total copy number of locusgenotype g. We define μ(g), the probability of sampling avariant allele from a subpopulation with locus genotype g,as follows:μ(g) =⎧⎪⎨⎪⎩ forb(g) = 01 − forb(g) = ζ(g)b(g)ζ(g) otherwisewhere is the sequencing error probability, the probabil-ity of observing a variant allele when sequencing a truereference allele.To capture the effects of locus genotypes, cellular preva-lence, and tumour cellularity, we define ξ(ψ ,φ, t) as fol-lows:ξ(ψ ,φ, t) = (1 − t)ζ(gN )Z μ(gN ) +t(1 − φ)ζ(gR)Z μ(gR)+tφζ(gV )Z μ(gV )where Z = (1 − t)ζ(gN ) + t(1 − φ)ζ(gR) + tφζ(gV ) is thenormalizing constant.To compute the likelihood, we sum over possible valuesofψi. Since the discrete space of values quickly becomesintractable, we only consider a limited number of locusgenotypes. This is done by defining an informative priorπi over ψi (more details are given in Additional file 1).The likelihood functionGiven the priors over locus genotypes, the emission likeli-hood for one locus is:p (bi|φi, di,πi, t) =∑ψi∈G3p (bi|φi, di,ψi, t) p (ψi|πi) (4)To address overdispersion, we have modelled the condi-tional distribution of variant allele counts bi with a beta-binomial distribution, characterized in terms of mean andprecision as follows:p(b|d,m, s) =(db)B(b + sm, d − b + s(1 − m))B(sm, s(1 − m)) (5)where B is the beta function. To reflect our assumptionsover the sample sub-population structure, we set themeanvalue to a function of locus genotypes, cellular preva-lence, and cellularity for each data point, that is, m =ξ(ψn,φn, t). To reduce the number of parameters, all locishare the same precision s.Synthetic data simulationSingle cell instantiationTo simulate cells, we first sample observed prevalences = {observed1 ,observed2 , . . . ,observedM } for each geno-type from a Dirichlet distribution observed ∼ Dir(λ),where = {1,2, . . . ,M} are the true prevalencesfor genotypes 1 to M. We then simulate m cells froma multinomial distribution with parameters observed,i.e. (n1, n2, . . . , nM) ∼ Mult(observed) where ni is thenumber of cells that have genotype i. This processis equivalent to sampling the cells from a Dirichlet-multinomial distribution, that is, (n1, n2, . . . , nM) ∼Dirichlet-multinomial(λ). The larger the λ is, the closerare the two vectors observed and . In fact, as the value ofλ grows, the Dirichlet-multinomial distribution progres-sively better approximates the multinomial distribution.For each dataset, we represent the average error betweentrue and observed prevalences by e = 1M∑M1 |i −observedi |, the average absolute difference between trueand observed genotype prevalences. We measure the dis-crepancy between the true and the observed prevalencesby the number of absent genotypes in the samples of cellsand by e, the average error between true and observedprevalences.For λ = 0.01, on average only about 1 out of 10 geno-types is observed in the sampled cells and e = 0.17. Incontrast, when λ = 1000, on average, more than 9 outof 10 genotypes are observed and observed prevalencesclosely resemble the true genotype prevalences (e =0.008).Salehi et al. Genome Biology (2017) 18:44 Page 16 of 18Modelling doublet noiseAssume K cells c1, c2, . . . , cK with genotypes c1 ,c2 , . . . ,cK are trapped in a well wd, where ci correspond torows in the binary genotype matrix as defined in theMethods section. We define the reported genotype for wdas the logical OR between genotypes of its constituentcells, i.e. Wd = c1ORc2OR . . .ORcK . In this studywe assume that, for a doublet, exactly two cells are trappedin a well simultaneously (K = 2).For a fixed value of rdoublet, we first sample m cells asthe original set. Second we sample an extra rdoublet ∗ mcells to act as co-trapped cells. Finally, we randomly pickrdoublet ∗ m of the original set and combine each with oneof the cells from the co-trapped cells by recording thelogical OR of their respective genotypes. These consti-tute the doublets. Algorithm 1 shows the pseudo code forsimulating doublets.Algorithm 1 Simulating doublet noise1: procedure SIMULATEDOUBLETNOISE (m, rDOUBLET,)2: NtrappedCells ← round(rdoublet × m)3: originalCells ← sampleCells(,m)4: trappedCells ← sampleCells(,NtrappedCells)5: noisyCells ← originalCells6: for i in 1 : NtrappedCells do7: Randomly pick without replacement a cell ci fromoriginalCells8: noisyCells[ci]← noisyCells[ci] OR trappedCells[i]9: end for10: return noisyCells11: end procedureIn Algorithm 1, sampleCells(,m) is a method that,given a genotype matrix , returns an array X of size m,where the i-th itemX[ i] is a row in the genotypematrix.Modelling allele drop-out noiseTo simulate the effect of ADOs, we first pick mcells from a multinomial distribution with parametersequal to the true prevalence of each genotype, that is,(n1, n2, . . . , nM) ∼ Mult(), where ni is the number ofcells that have genotype i,∑Mi=1 = m, and is the trueprevalence of each genotype. This results in a binary cell-genotype matrix G ∈ {0, 1}m,M with rows correspondingto sampled cells and columns corresponding to genomicloci where Gi,j = 1 if cell i is mutated at locus j. Weassume that ADO affects a cell by turning a mutated locusinto an unmutated one and causing a false negative error.When an unmutated locus is affected, it mimics a deletionand does not alter the genotype matrix. At a fixed ADOrate, rADO, we randomly pick rADO of the mutated lociacross all sampled cells and set their value to zero. Thisconstitutes the modified binary matrix G that we use asinput to ddClone. Algorithm 2 shows the pseudo code forsimulating allele drop-out noise.Algorithm 2 Simulating allele drop-out noise1: procedure SIMULATEADONOISE (m, rADO,)2: NdroppedAlleles ← round(rdoublet × m)3: G ← sampleCells(,m)4: mutatedLoci ← {(i, j) : G[i, j]= 1}5: droppedLoci ← randomly pick NdroppedAlleles locifrom mutatedLoci6: for (i, j) in droppedLoci do7: G[i, j] ← 08: end for9: return G10: end procedureInferenceWe use a Gibbs sampler to draw samples from the pos-terior distribution of the model. We initialize the samplersuch that all customers are in their own clusters. Let c−ibe the customer connection configuration with customeri’s outgoing connection removed. Let xi = (bi, di) denotethe observed data, namely, variant and total allele counts.The full conditional distribution of ci is:p (ci|c−i, x1:N , λ) ∝ p (ci|λ) p (x1:N |ci, c−i, λ) (6)where p(ci|λ) is the same as Eq. (2) and λ is the set of allhyperparameters. Let xTk be the set of customers in clus-ter Tk or, equivalently, the set of customers sitting at tablek, then the likelihood term factors in:p(x1:N |c−i, ci = j, λ) =∏Tk∈T(C)p(xTk |λ)(7)where T(C) is the partitioning induced by current cus-tomer connection configuration C. The term p(xTk |λ)further expands as:p(xTk |λ) =∫⎛⎝∏i∈Tkp (xi|θ , λ)⎞⎠ p(θ |λ)dθ (8)where the likelihood p(xi|θ , λ) = p(bi|φi, di,πi, t) is thesame as Eq. (4).Since our prior over cellular prevalences φi is non-conjugate to the likelihood, we resort to a cached versionof Griddy Gibbs method [30] to compute the above inte-gral. At the end of each iteration (i.e. when all customersare reassigned), we sample φk for each cluster k as follows:φk ∼ p(φk|xTk ,πTk , t, λ) ∝ p (φTk |λ)p(xTk |φTk , λ,πTk , t)(9)Salehi et al. Genome Biology (2017) 18:44 Page 17 of 18where p(φTk |λ) is the probability density function of auniform distribution.Approximating λ in real datasetsFirst, we computed, for simulated datasets with variousvalues of λ, the concordance between bulk and single celldata as measured by the coefficient of determination (R2),that is, how well mutation cellular prevalences (φ) esti-mated from the bulk data correspond to that estimatedfrom the single cell data.We then measured the observed concordance betweenmutation cellular prevalences as estimated from bulkdata by multi-sample PyClone (for TNBC xenograft andHGSOvCa datasets) or corrected bulk VAFs (for the ALLdataset) and single cell data. Lastly, we compared at whatvalue for λ, the R2 value in the simulated dataset matchedthe R2 value of each real dataset. The estimated λ val-ues are 1.13 ± 0.31, 2.00 ± 0.21, and 2.24 ± 0.21 forthe HGSOvCa, TNBC, and ALL datasets respectively. Forthe ALL dataset, in computing the coefficient of determi-nation, we set aside the outlier Patient 5 which had anR2 = 0.08. We note that since single cell data in the realdataset are affected by sources of noise other than sam-pling distortion, including doublets and ADOs, the aboveprocedure overestimates λ.Clustering summarizationTo cluster genomic loci we first compute the posteriorsimilarity matrix and then maximize the PEAR index tocompute a point estimate [31] as implemented in theR package mcclust [32]. We estimate the cellular preva-lence for each genomic locus as the mean of after burn-inMarkov chain Monte Carlo (MCMC) samples.Computational complexityComputing the distance matrix takes O(N2M) where Nand M are the rows and columns of the input matrixto ddClone. In the intended use of ddClone, the inputmatrix would be the binary genotype matrix , in whichcase N is the number of genotypes and M is the num-ber of genomic loci. Computing the clustering result takesO(M2). The complete analysis with 10,000 MCMC itera-tions on a machine with 40x cores of Intel Xeon 2.20GHzCPU and 500GB of RAM, for a dataset of 37 genomicloci, takes about 6 hours (365.9 ± 47.32 minutes) to finish(averaged on 4 samples from Patient 2 in the HGSOvCadataset).Additional filesAdditional file 1: Supplementary information. A PDF file that containssupplementary figures, details of the mathematical derivation of theddClone model, and the description and algorithms to simulate data fromthe Generalized Dollo model. (PDF 2200 kb)Additional file 2: Simulated data inputs. An excel file containing bulkallele counts and single cell genotypes generated from the GeneralizedDollo model used as input to ddClone and existing methods in simulationstudies. These data were used to assess the performance of all methodsconsidered in this study and resulted in Fig. 3. (XLSX 84 kb)Additional file 3: Real data inputs. An excel file containing bulk data allelecounts and single cell genotypes from both TNBC xenograft and ITHHGSOvCa datasets used as input to ddClone and existing methods. Thesedata were used to assess the performance of all methods considered inthis study and resulted in Figs. 9 and 10. (XLSX 67 kb)Additional file 4: Real data benchmarks. An excel file containing themulti-sample PyClone results for both TNBC xenograft and ITH HGSOvCadatasets used as benchmark to assess the performance of ddClone andexisting methods. The performance metrics reported in Figs. 9 and 10 aremeasured against these results. (XLSX 32 kb)Additional file 5: List of mutually exclusive mutations (MEMs). A ziparchive of five txt files that list the MEMs (see the main text) found in TNBCxenograft and ITH HGSOvCa datasets as well as the MEMs merged byPyClone and multi-sample PyClone. (ZIP 7 kb)FundingWe gratefully acknowledge long-term funding support from the BC CancerFoundation. This project was supported by a Discovery Frontiers project grant,“The Cancer Genome Collaboratory”, jointly sponsored by the Natural Sciencesand Engineering Research Council (NSERC), Genome Canada (GC), theCanadian Institutes of Health Research (CIHR) and the Canada Foundation forInnovation (CFI) to SPS and ABC. The SPS and SA groups receive operatingfunds from the Canadian Cancer Society Research Institute, Terry Fox ResearchInstitute, Genome Canada/Genome BC, Canadian Institutes for HealthResearch (CIHR) grant #245779, and a CIHR Foundation program to SPS. SPS isa Michael Smith Foundation for Health Research Scholar; SPS and SA holdCanada Research Chairs.Availability of data andmaterialsOur model is implemented in the R [33] programming language and is freelyavailable as an open source R package on GitHub https://github.com/sohrabsa/ddclone/ under GPLv2 licence. The source code is also deposited toa DOI assigning repository at https://doi.org/10.5281/zenodo.220987. It is builtupon the implementation of ddCRP in [21].Authors’ contributionsSS and AR provided model development and implementation. SS, ABC, andSPS designed and executed the experiments. AS analysed the data, and SAoversaw the single cell data. SPS and ABC conceived and oversaw the project.All authors read and approved the final manuscript.Competing interestsThe authors declare that they have no competing interests.Ethics approval and consent to participateNot applicable.Author details1Bioinformatics Graduate Program, University of British Columbia, 570 West7th Avenue, V5Z 4S6, Vancouver, BC, Canada. 2Department of Pathology andLaboratory Medicine, University of British Columbia, V6T 2B5, Vancouver, BC,Canada. 3Department of Molecular Oncology, British Columbia CancerAgency, 675 West 10th Avenue, V5Z 1L3, Vancouver, BC, Canada. 4Departmentof Statistics, University of British Columbia, 2207 Main Mall, V6T 1Z4,Vancouver, BC, Canada.Received: 8 November 2016 Accepted: 10 February 2017References1. Nowell PC. The clonal evolution of tumor cell populations. Science.1976;194(4260):23–8.2. Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, Ha G, Aparicio S,Bouchard-Côté A, Shah SP. PyClone: statistical inference of clonalpopulation structure in cancer. Nat Meth. 2014;11(4):396–8.Salehi et al. Genome Biology (2017) 18:44 Page 18 of 183. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K,Stepansky A, Levy D, Esposito D, et al. Tumour evolution inferred bysingle-cell sequencing. Nature. 2011;472(7341):90–4.4. Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A, HirstM, Turashvili G, Oloumi A, et al. JointSNVMix: a probabilistic model foraccurate detection of somatic mutations in normal/tumour pairednext-generation sequencing data. Bioinformatics. 2012;28(7):907–13.5. Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK.Strelka: accurate somatic small-variant calling from sequencedtumor–normal sample pairs. Bioinformatics. 2012;28(14):1811–7.6. Ding J, Bashashati A, Roth A, Oloumi A, Tse K, Zeng T, Haffari G, HirstM, Marra MA, Condon A, et al. Feature-based classifiers for somaticmutation detection in tumour–normal paired sequencing data.Bioinformatics. 2012;28(2):167–75.7. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C,Gabriel S, Meyerson M, Lander ES, Getz G. Sensitive detection of somaticpoint mutations in impure and heterogeneous cancer samples. NatBiotechnol. 2013;31(3):213–9.8. Jiao W, Vembu S, Deshwar A, Stein L, Morris Q. Inferring clonal evolutionof tumors from single nucleotide somatic mutations. BMC Bioinforma.2014;15(1):35. doi:10.1186/1471-2105-15-35.9. Zare H, Wang J, Hu A, Weber K, Smith J, Nickerson D, Song C, Witten D,Blau CA, Noble WS. Inferring clonal composition from multiple sectionsof a breast cancer. PLoS Comput Biol. 2014;10(7):1003703.10. El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction ofclonal trees and tumor composition from multi-sample sequencing data.Bioinformatics. 2015;31(12):62–70.11. Zare H, Wang J, Hu A, Weber K, Smith J, Nickerson D, Song C, Witten D,Blau CA, Noble WS. Inferring clonal composition from multiple sectionsof a breast cancer. PLoS Comput Biol. 2014;10(7):1003703.12. Gawad C, Koh W, Quake SR. Proc Nat Acad Sci USA. 2014;111(50):17947–17952. doi:10.1073/pnas.1420822111.13. Wang Y, Navin NE. Advances and applications of single-cell sequencingtechnologies. Mol Cell. 2015;58(4):598–609.14. Navin NE. Cancer genomics: one cell at a time. Genome Biol. 2014;15:452.15. de Bourcy CF, De Vlaminck I, Kanbar JN, Wang J, Gawad C, Quake SR. Aquantitative comparison of single-cell whole genome amplificationmethods. PloS One. 2014;9(8):105585.16. Roth A, McPherson A, Laks E, Biele J, Yap D, Wan A, Smith MA, NielsenCB, McAlpine JN, Aparicio S, Bouchard-Cote A, Shah SP. Clonalgenotype and population structure inference from single-cell tumorsequencing. Nat Meth. 2016;13(7):573–6.17. Gawad C, Koh W, Quake SR. Single-cell genome sequencing: currentstate of the science. Nat Rev Genet. 2016;17(3):175–88.18. Ross EM, Markowetz F. OncoNEM: inferring tumor evolution fromsingle-cell sequencing data. Genome Biol. 2016;17(1):1.19. El-Kebir M, Satas G, Oesper L, Raphael BJ. Inferring the mutational historyof a tumor using multi-state perfect phylogeny mixtures. Cell Syst.2016;3(1):43–53.20. Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data.Genome Biol. 2016;17(1):1.21. Blei DM, Frazier PI. Distance dependent Chinese restaurant processes. JMach Learn Res. 2011;12:2461–88.22. Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS:Reconstructing subclonal composition and evolution fromwhole-genome sequencing of tumors. Genome Biol. 2015;16(1):35.doi:10.1186/s13059-015-0602-8.23. Rosenberg A, Hirschberg J. V-measure: A conditional entropy-basedexternal cluster evaluation measure. In: EMNLP-CoNLL. Prague:Association for Computational Linguistics; 2007. p. 410–20.24. Eirew P, Steif A, Khattra J, Ha G, Yap D, Farahani H, Gelmon K, Chia S,Mar C, Wan A, et al. Dynamics of genomic clones in breast cancer patientxenografts at single-cell resolution. Nature. 2014;518:422–6.25. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S,Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficientBayesian phylogenetic inference and model choice across a large modelspace. Syst Biol. 2012;61(3):539–42.26. McPherson A, Roth A, Laks E, Masud T, Bashashati A, Zhang AW, Ha G,Biele J, Yap D, Wan A, Prentice LM, Khattra J, Smith MA, Nielsen CB,Mullaly SC, Kalloger S, Karnezis A, Shumansky K, Siu C, Rosner J, ChanHL, Ho J, Melnyk N, Senz J, Yang W, Moore R, Mungall AJ, Marra MA,Bouchard-Cote A, Gilks CB, Huntsman DG, McAlpine JN, Aparicio S,Shah SP. Divergent modes of clonal spread and intraperitoneal mixing inhigh-grade serous ovarian cancer. Nat Genet. 2016;48(7):758–67.27. Schuh A, Becq J, Humphray S, Alexa A, Burns A, Clifford R, Feller SM,Grocock R, Henderson S, Khrebtukova I, et al. Monitoring chroniclymphocytic leukemia progression by whole genome sequencing revealsheterogeneous clonal evolution patterns. Blood. 2012;120(20):4191–6.28. Sammut C, Webb GI. Encyclopedia of machine learning, 1st ed. NewYork: Springer; 2011.29. Felsenstein J. Distance methods for inferring phylogenies: a justification.Evolution. 1984;38:16–24.30. Ritter C, Tanner MA. Facilitating the Gibbs sampler: the Gibbs stopperand the griddy-Gibbs sampler. J Am Stat Assoc. 1992;87(419):861–8.doi:10.1080/01621459.1992.10475289.http://www.tandfonline.com/doi/pdf/10.1080/01621459.1992.10475289.31. Fritsch A, Ickstadt K, et al. Improved criteria for clustering based on theposterior similarity matrix. Bayesian Anal. 2009;4(2):367–91.32. Fraley C, Raftery AE. Model-based clustering, discriminant analysis anddensity estimation. J Am Stat Assoc. 2002;97:611–31.33. R Development Core Team. R: A Language and Environment for StatisticalComputing. Vienna, Austria: R Foundation Statistical Computing; 2008. RFoundation for Statistical Computing. ISBN 3-900051-07-0. http://www.R-project.org.• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal• We provide round the clock customer support • Convenient online submission• Thorough peer review• Inclusion in PubMed and all major indexing services • Maximum visibility for your researchSubmit your manuscript atwww.biomedcentral.com/submitSubmit your next manuscript to BioMed Central and we will help you at every step:
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- ddClone: joint statistical inference of clonal populations...
Open Collections
UBC Faculty Research and Publications
ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing… Salehi, Sohrab; Steif, Adi; Roth, Andrew; Aparicio, Samuel; Bouchard-Côté, Alexandre; Shah, Sohrab P. Mar 1, 2017
pdf
Page Metadata
Item Metadata
Title | ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data |
Creator |
Salehi, Sohrab Steif, Adi Roth, Andrew Aparicio, Samuel Bouchard-Côté, Alexandre Shah, Sohrab P. |
Contributor | University of British Columbia. Graduate Program in Bioinformatics |
Publisher | BioMed Central |
Date Issued | 2017-03-01 |
Description | Next-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers and measure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, which may be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more direct method, although its replacement of NGS is impeded by technical noise and sampling limitations. We propose ddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through joint statistical inference. We show on real and simulated datasets that ddClone produces more accurate results than can be achieved by either method alone. |
Subject |
Intra-tumour heterogeneity Clonal evolution Joint probabilistic model Distance dependent Chinese restaurant process Single cell sequencing Next-generation sequencing |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2017-12-14 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0362034 |
URI | http://hdl.handle.net/2429/64003 |
Affiliation |
Medicine, Faculty of Science, Faculty of Other UBC Non UBC Pathology and Laboratory Medicine, Department of Statistics, Department of |
Citation | Genome Biology. 2017 Mar 01;18(1):44 |
Publisher DOI | 10.1186/s13059-017-1169-3 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | The Author(s) |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 52383-13059_2017_Article_1169.pdf [ 2.46MB ]
- Metadata
- JSON: 52383-1.0362034.json
- JSON-LD: 52383-1.0362034-ld.json
- RDF/XML (Pretty): 52383-1.0362034-rdf.xml
- RDF/JSON: 52383-1.0362034-rdf.json
- Turtle: 52383-1.0362034-turtle.txt
- N-Triples: 52383-1.0362034-rdf-ntriples.txt
- Original Record: 52383-1.0362034-source.json
- Full Text
- 52383-1.0362034-fulltext.txt
- Citation
- 52383-1.0362034.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0362034/manifest