dd-PyClone: Improving Clonal Subpopulation Inferencefrom Single Cells and Bulk Sequencing DatabySohrab SalehiB.Sc. Computer Engineering, Amirkabir University of Technology, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)December 2015c© Sohrab Salehi, 2015AbstractImproving our understanding of intra-tumour heterogeneity in cancer has importantclinical implications, including an opportunity to understand mechanisms behindrelapses and drug resistance. Next generation bulk sequencing is a mature tech-nology that has been used to study subclonal tumour populations at an aggregatelevel. Inference of populations from bulk sequencing requires sophisticated com-putational deconvolution methods. An alternative is to identify populations directlywith single cell sequencing. However, single cell sequencing is a very error-proneprocess, and this impedes its ability to completely replace bulk sequencing for now.In this work we present dd-PyClone, a statistical model to combine single celland bulk sequencing data to study clonal subpopulation architecture and improveclustering assignment and cellular prevalence estimates of a set of genomic loci.We introduce a single nucleotide variant and copy number aberration awaregenotype simulation scheme based on a phylogenetic tree, termed the GeneralizedDollo model. This model is an improvement over previous genotype generatormodels in that it also accounts for the evolutionary process before a rare event(here the single nucleotide variant) occurs.We show that incorporating genomic loci co-occurrence patterns from singlecell sequencing studies in inferring clonal subpopulation structure from bulk se-quencing data is beneficial. Our method outperforms existing methods in simula-tion studies and performs comparably in real dataset benchmarking. We also showthat our method is fairly robust as to the choice of hyperparameters and performsreasonably in presence of noise. We hope that our method will further the under-standing of the evolutionary basis of cancer.iiPrefaceThe project idea was conceived by Prof. Shah. The application of a distance depen-dent Chinese restaurant process to solve this problem was my idea. The generalizedDollo model in Chapter 7 is Prof. Bouchard-Coˆte´’s idea. The real datasets used inChapter 7 was used with permission from authors.The implementation of ddCRP inference algorithm is based on that in [2]. Ihave extended it to support the non-conjugate case relevant to this work.There are no publications based on this work so far.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Cancer as an evolutionary phenomenon . . . . . . . . . . . . . . 11.2 Next generation sequencing . . . . . . . . . . . . . . . . . . . . . 21.2.1 Somatic single nucleotide variants . . . . . . . . . . . . . 21.2.2 Targeted deep sequencing . . . . . . . . . . . . . . . . . 31.2.3 Copy number aberrations . . . . . . . . . . . . . . . . . . 31.2.4 Measuring clonal parameters . . . . . . . . . . . . . . . . 41.3 Existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.1 Clomial . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 PyClone . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.3 PhyloSub and PhyloWGS . . . . . . . . . . . . . . . . . 61.3.4 SciClone . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Single cell sequencing . . . . . . . . . . . . . . . . . . . . . . . 7iv1.5 Main hypothesis, combining bulk data and SCS data . . . . . . . 81.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.1 Main hypothesis . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Concepts and definitions . . . . . . . . . . . . . . . . . . 112.1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.4 Simplified generative model . . . . . . . . . . . . . . . . 142.2 Distance dependent Chinese Restaurant Process . . . . . . . . . . 162.2.1 Traditional CRP . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Alternative representation of Traditional CRP . . . . . . . 162.2.3 Generative process for ddCRP mixture modelling . . . . . 182.3 The dd-PyClone model . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Distance matrix . . . . . . . . . . . . . . . . . . . . . . . 222.3.2 Bulk population assumptions . . . . . . . . . . . . . . . . 222.3.3 Pseudo-genotype state priors . . . . . . . . . . . . . . . . 232.3.4 The Parental Copy Number (PCN) prior . . . . . . . . . . 242.3.5 The likelihood function . . . . . . . . . . . . . . . . . . . 252.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5 Resampling hyperparameters . . . . . . . . . . . . . . . . . . . . 282.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.7 Limitations and future extensions . . . . . . . . . . . . . . . . . . 292.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1 General information . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.1 Clustering summarization . . . . . . . . . . . . . . . . . 323.1.2 Clustering evaluation . . . . . . . . . . . . . . . . . . . . 323.2 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.1 Simulating genotypes . . . . . . . . . . . . . . . . . . . . 323.2.2 Simulating bulk data . . . . . . . . . . . . . . . . . . . . 393.2.3 Benchmarking against existing methods . . . . . . . . . . 40v3.3 Real dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 Triple-negative breast cancer Xenograft data . . . . . . . 443.3.2 Establishing the ground truth . . . . . . . . . . . . . . . . 453.4 Parameter sensitivity . . . . . . . . . . . . . . . . . . . . . . . . 473.4.1 Sensitivity to value of a . . . . . . . . . . . . . . . . . . 493.4.2 Sensitivity to presence of noise . . . . . . . . . . . . . . . 503.4.3 Sensitivity to a and noise . . . . . . . . . . . . . . . . . . 543.4.4 Effect of non-exchangeability . . . . . . . . . . . . . . . 643.5 Computational aspects . . . . . . . . . . . . . . . . . . . . . . . 653.6 Convergence diagnostics . . . . . . . . . . . . . . . . . . . . . . 663.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.1 Significance and contribution . . . . . . . . . . . . . . . . . . . . 684.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3 Potential applications . . . . . . . . . . . . . . . . . . . . . . . . 694.4 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.5 Final word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 75A.1 Simulated genotypes from the GD model . . . . . . . . . . . . . 75A.2 Convergence analysis results . . . . . . . . . . . . . . . . . . . . 80viList of TablesTable 2.1 Notation reference for dd-PyClone . . . . . . . . . . . . . . . 21viiList of FiguresFigure 1.1 The workflow of dd-PyClone. . . . . . . . . . . . . . . . . . 8Figure 2.1 Phylogenetic tree and mutation co-occurrence relationship . . 12Figure 2.2 The ddCRP prior . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 2.3 Probabilistic graphical model for dd-PyClone . . . . . . . . . 20Figure 2.4 Subpopulation assumptions in the bulk data . . . . . . . . . . 23Figure 2.5 Gibbs steps in the inference procedure . . . . . . . . . . . . . 28Figure 3.1 The synthetic data simulation workflow . . . . . . . . . . . . 33Figure 3.2 The Generalized Dollo model and its predecessors . . . . . . 34Figure 3.3 Rate matrices governing the GD model . . . . . . . . . . . . 36Figure 3.4 Rooting and trimming concepts in the GD model . . . . . . . 37Figure 3.5 Simulated phylogenetic tree and the resulting binarized geno-type matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 3.6 Benchmarking results over simulated data . . . . . . . . . . . 42Figure 3.7 Genotypes curated for the triple-negative breast cancer data . . 45Figure 3.8 Analysis results for multi-sample PyClone on TNBC dataset . 46Figure 3.9 Analysis results for multi-sample PyClone on TNBC datasetfor overlapping mutations . . . . . . . . . . . . . . . . . . . 47Figure 3.10 Benchmarking results over TNBC dataset . . . . . . . . . . . 48Figure 3.11 Sensitivity to decay function’s hyperparameter a . . . . . . . 49Figure 3.12 Performance analysis in presence of point noise . . . . . . . . 51Figure 3.13 Performance analysis in presence of loss of multiple genotypes 53Figure 3.14 Performance analysis in presence of loss of individual genotypes 54viiiFigure 3.15 Performance analysis in presence of point noise and varyinghyperparameter a . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 3.16 Performance analysis in presence of individual genotype lossand varying hyperparameter a . . . . . . . . . . . . . . . . . 61Figure 3.17 Performance analysis in presence of multiple genotype lossand varying hyperparameter a . . . . . . . . . . . . . . . . . 64Figure 3.18 Effects of non-exchangeability . . . . . . . . . . . . . . . . . 65Figure 3.19 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 66Figure A.1 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 0 . . . . . . . . . . . . . . . . . . . . . 75Figure A.2 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 1 . . . . . . . . . . . . . . . . . . . . . 76Figure A.3 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 2 . . . . . . . . . . . . . . . . . . . . . 76Figure A.4 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 3 . . . . . . . . . . . . . . . . . . . . . 77Figure A.5 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 4 . . . . . . . . . . . . . . . . . . . . . 77Figure A.6 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 5 . . . . . . . . . . . . . . . . . . . . . 78Figure A.7 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 6 . . . . . . . . . . . . . . . . . . . . . 78Figure A.8 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 8 . . . . . . . . . . . . . . . . . . . . . 79Figure A.9 Simulated phylogenetic tree and the resulting binarized geno-type matrix with seed 9 . . . . . . . . . . . . . . . . . . . . . 79Figure A.10 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 80Figure A.11 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 81Figure A.12 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 82Figure A.13 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 83Figure A.14 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 84Figure A.15 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 85ixAcknowledgmentsI would like to thank my supervisors, Prof. Alexandre Bouchard-Coˆte´ and Prof.Sohrab P. Shah, for their unrelenting support during my studies.I would also like to thank my committee member Prof. Samuel Aparicio forhis contributions to the progress of my thesis, and Prof. Paul Pavlidis for chairingmy defence.A special thanks to Andrea Sollberger, Sharon Ruschkowsk, Jenny Cromarty,and Carolyn Lui for helping me manage my work.I would like to acknowledge the Canadian Institute of Health Research (CIHR)for funding my studies.xChapter 1IntroductionIn this chapter, we first introduce cancer as an evolutionary phenomenon. Next gen-eration bulk and single cell sequencing and their associated computational methodsare reviewed as the tools for quantifying the properties of this evolutionary system,including SNVs as genetic markers. We continue with a review of existing meth-ods that infer clonal subpopulation structure from bulk sequencing SNV data. Wethen state our hypothesis in this work as follows: Combining single cell and bulkdata in a unified statistical framework improves clustering and cellular prevalenceestimates. Our main contribution is a computational method to implement, test,and verify our hypothesis.1.1 Cancer as an evolutionary phenomenonCancer is an evolutionary phenomenon [23]. Cells accrue mutations in time, someof which confer a selective advantage on the bearer. This process leads to multiplecell subpopulations, each with unique genomic aberrations. This intra-tumour het-erogeneity results in clinical complications, including relapses and drug resistance[36].To find an optimal therapeutic intervention we first need to address a fundamen-tal question, ”How does a tumour population respond to a specific perturbation?”,where a perturbation is any treatment policy, including surgery and chemotherapy.The first step toward understanding this problem is to identify subpopulations1or subclones present in a tumour sample as well as their prevalences. Somatic ge-netic markers, particularly single nucleotide variants (SNVs) are used to identifysubclones [29]. Cells in the human body inherit their genomes from their par-ents. These inherited genomes harbour differences to other individuals that arecalled germline mutations. On the other hand, both normal and cancerous cells, alldescendants of the initial zygote cell, accrue novel mutations not present in theircommon ancestor. These are somatic mutations [33].Next generation sequencing has made measurement of these markers fast andcost-efficient [22]. Somatic mutations are essentially inferred by comparing DNAextracted from tumour tissue to the DNA from normal tissue from the same patient.Here we focus on next generation sequencing methods to measure SNVs.1.2 Next generation sequencingBriefly, first DNA molecules from millions of cells from a tissue sample are col-lected. This involves cell lysis through which we lose the assignment of genomesto cells, and consequently the mutation co-occurrence patterns. In a library prepa-ration step, DNA molecules are fragmented and then are ligated by adapter se-quences. This is often proceeded by an amplification phase where fragmentedDNA sequences (templates), are amplified in a process called polymerase chainreaction (PCR). The amplified templates are sequenced in a sequencer and thenanalyzed to obtain their nucleic acid sequence in what is called the base callingstep [15].Then, base-called reads are mapped to a reference genome. At this point, sev-eral post-processing steps may be required. For example, the identification andremoval of adapter sequences, reads of low quality, and contamination from for-eign DNA [16].1.2.1 Somatic single nucleotide variantsMethods such as [4, 7, 28, 30] are used to detect somatic SNVs. This is a challeng-ing task, since many variants are either germline mutations or some artifact of thesequencing process. This is further complicated by the fact that in some cancersgermline variations outnumber somatic variations by orders of magnitude. Se-2quencing artifacts are also prevalent and come from a number of different sources,including normal DNA contamination, strand bias, and low mapping quality.Somatic variant calling methods work by essentially modelling the allelic dis-tribution, taking into account various sources of error. They may involve prepro-cessing steps, or filters, such as removal of low quality reads and local realign-ments. Some post processing may also be applied to acquire higher confidencecandidate variants, for instance, filtering the variants against a database of normalpatients [4].1.2.2 Targeted deep sequencingA technique used to validate these mutation calls is targeted deep sequencing [31].To ensure that the called SNVs are not an artifact of sequencing error, limited sec-tions of the genome that contain selected SNVs are chosen and highly amplifiedand sequenced. This very high read coverage (number of reads covering each ge-nomic locus) increases confidence in the validity of the called SNVs. This datais then used to generate allele counts. That is, at each genomic locus, how manyreads map to the reference allele and how many reads map to the alternative allele.1.2.3 Copy number aberrationsAnother important genomic feature of cancer cells is their copy number state. Am-plification and Loss of Heterozygosity (LOH) events have been shown to play arole in tumour progression [32].NGS data can be used to identify Copy Number Aberration (CNA) events [24].Computational methods based on NGS data mostly work by dividing the genomeinto different segments (bins) and use mean change in read coverage in each seg-ment to assign an average copy number to the whole region. In tumour samplesthat contain multiple heterogeneous subpopulations, traditional CNA calling meth-ods may fail to assign an integer copy number to a region. This is due to eachsubpopulation having potentially a different copy number state at the same region.Methods such as [14] take tumour cellularity and different CN states for eachsubpopulation genotype into account, and they assign the most probable copy num-ber state to each region based on a Hidden Markov Model.31.2.4 Measuring clonal parametersIn NGS data we can only directly measure allele counts. Due to fragmentation andshort read lengths, patterns of co-occurrence (phasing) are lost and direct obser-vation of number of subpopulations, their genotypes, and prevalences (collectivelycalled subclonal structure) is not possible.To address this problem, computational methods have been developed to es-timate these quantities or their surrogates. Since in general there is no bijectionbetween allelic ratios and subclonal structures, it is often easier to estimate surro-gate quantities for subclonal structure parameters.Mutational cellular prevalence is a compound measure for genotype preva-lence. It is defined as the fraction of cells that harbour a mutation at a specificgenomic locus.To infer cellular prevalence, methods such as PyClone [29] correct allele countsfor CNVs and LOH events, and tumour cellularity to cluster genomic loci intogroups with similar cellular prevalences. One of the main hypothesis in these meth-ods is that there are no subclones with identical prevalences in the tumour, and thusa difference in cellular prevalence is due to belonging to different subclones. If thisassumption is violated in the data, such subclones would get merged into a singlegroup in the bulk deconvolution (over-clustering).These mutation clusters could be considered as surrogate measures for sub-clonal genotypes, where we would expect that two mutations in the same cluster toco-occur in a genotype. In section 1.3 we review in more detail methods that infersubclonal structure from next generation sequencing SNV data.1.3 Existing methodsHere we briefly review some of the existing methods that infer sub-clonal popula-tion composition from bulk next generation sequencing data.1.3.1 ClomialClomial [37] accepts reference and variant allele counts from multiple subsectionsof a tumour as input. Number of genotypes should be set a priori. From this, it es-timates both a genotype matrix, which indicates which genomic loci are mutated in4each genotype, and a genotype prevalence matrix that shows what fraction of cellsin a subsample belong to a particular genotype. The estimation problem is for-mulated as a variation of the Matrix Factorization problem where the allele countsmatrix is factored into genotypes and genotype frequencies matrices. Inference isdone using an Expectation Maximization algorithm.It infers tumour cellularity, the fraction of cancerous cells in the tumour sample,from the data. Tumour cellularity is a measure of normal DNA contamination inthe sample.On the other hand, it assumes a diploid genotype and does not correct allelecounts for copy number variations, nor does it take sequencing errors into accountin its Binomial likelihood model. It is limited to situations where the number ofclones are less than or equal to the number of samples (subsections of the tumour),since otherwise, the inference problem is under-constrained.A major problem with Clomial is that it needs the number of genotypes to beset a priori. To mitigate this, Clomial poses the choice of the number of genotypesC as a model selection problem and suggests running the method with differentvalues of C, selecting the one with the best Bayesian Information Criterion (BIC).Furthermore, to increase the chance of finding the global extremum, the authorssuggest multiple restarts from different starting positions.1.3.2 PyCloneThe PyClone model has inspired our current work. As input, it accepts a fixed setof genomic loci with their allele counts and copy number states, as well as tumourcellularity. It uses a Dirichlet process mixture model [34] to jointly infer cellu-lar prevalences and cluster assignments. PyClone assumes that at each genomiclocus, the tumour has 3 subpopulations, namely, normal, reference, and variantsubpopulations. Each subpopulation has a fixed copy number state with respect toa particular genomic locus.It accounts for copy number states by defining a prior probability distributionover possible genotypes in each subpopulation at each genomic locus, and sum-ming over these genotypes in the likelihood calculation. It uses a Beta-Binomiallikelihood (emission) distribution to account for overdispersion in the sequencing5data. PyClone’s inference is based on a Markov chain Monte Carlo (MCMC) sam-pling scheme.It also supports a multi-sample mode where it accepts as input sample specifictumour cellularity, copy number state, and allele count data. In multisample mode,PyClone will output one clustering result for mutations over all samples, and a persample cellular prevalence estimate.1.3.3 PhyloSub and PhyloWGSPhyloSub [17] tries to infer cellular prevalences and cluster assignments of SNVsas well as their phylogenetic history. Its inputs are allele counts of a fixed set ofSNVs, as well as associated copy number state estimates. It outputs clusters ofSNVs and their cellular prevalences, in addition to a set of most likely phyloge-netic trees. It achieves this by using a tree-structured stick breaking process [12].This is a Bayesian non-parametric prior over partitions that has a hierarchical treetopology, where data points can be assigned to any node on the tree.Heuristic rules are used to limit the space of possible phylogenetic trees. Theserules come from a perfect and persistence phylogeny assumption [29]. That is, amutation is only gained once over a tree, and when its gained, it cannot be lost orreverted.In PhyloSub, phylogenetic trees are directed acyclic graphs (DAG) where nodesare SNV clusters and edges connect parental clusters to their immediate child clus-ters. PhyloWGS [6] is a successor to PhyloSub. It improves on PhyloSub in tworespects. First, it handles CNAs differently by modelling them as a pseudo-SNVand thus inferring at what point on the phylogenetic tree they have occurred. Sec-ond, the authors claim that it is much faster than PhyloSub and therefore couldbe used to infer tumour subpopulation structure form whole genome sequencingdata. Both models use a Binomial likelihood model and a MCMC algorithm forinference.1.3.4 SciCloneSciClone’s [18] input consists in variant allele frequencies (VAF) as well as copynumber variations. It operates over genomic loci in copy neutral regions. The6maximum number of clusters is another input of this model.SciClone clusters genomic loci with similar VAFs. Similar to PyClone, it out-puts cluster assignments for genomic loci and frequencies for mutational groups.It models VAFs as a mixture of Beta distributions. Its inference framework isbased on the variational Bayesian mixture modelling. To approximate the posteriordistribution, it uses Gamma distributions with a Dirichlet distribution for mixtureparameters.It does not take into account copy number variations, nor does it correct fortumour cellularity. Variational Bayes methods will generally never converge to thetrue posterior (it could not even represent it).1.4 Single cell sequencingIn general, bulk methods are limited. The fact that association of genotypes to cellsis lost at cell lysis results in complications such as the phasing and over-clusteringissues mentioned earlier. A potential solution to these problems is sequencingthe genome of individual cells. Single cell sequencing (SCS) involves isolating acell, amplifying its genome followed by base calling, and mapping to a referencegenome [20].SCS is still in its infancy and suffers from a number of problems [20, 35].First, it is very error-prone. Sources of error include allele dropout and unevengenome coverage that may result from a biased amplification step. Second, it isprohibitively expensive. Third is the issue of undersampling.In SCS studies, single cells analyzed number in the lower hundreds [8, 19].The probability of observing a representative cell from a very rare clone (cellularprevalence of less than 0.01%) is only about 0.01. Concretely, considering theabove numbers and assuming a uniform sampling procedure and that the clonesare uniformly distributed throughout the tumour, the power of the SCS experimentto test the existence of a very rare clone would be:p(observing at least a single cell from a very rare clone|that clone exists in the tumour)=1−p(not observing any cells from the very rare clone|that clone exists)= 1−(1−0.0001)100 ≈ 0.01.71.5 Main hypothesis, combining bulk data and SCS dataOur main hypothesis is ”Combining single cell and bulk data in a unified statisti-cal framework improves clustering and cellular prevalence estimates.” Figure 1.1shows the main components and workflow of our method. This work has the po-tential to overcome the limitations of both bulk and single cell methods. This ismade more precise in chapter 2. We will experimentally validate our hypothesis inchapter 3 and demonstrate that if the majority of genotypes present in the tumourare observed, using them to guide clustering in bulk data will result in improvedestimates.MutationsAllele Counts (1000×) VariantReference xxMutation 6xxMutation 5xxxMutation 4xxMutation 2xxxxGenotype 1 xGenotype 5Genotype 2xMutation 1Genotype 4Mutation 3xGenotype 3Mutation ClustersCellular Prevalence0.51.00.0Jaccard Distance MatrixGibbs SamplingMPEARddCRPdd-PyCloneExponential Decay FunctionMutationsInteger Copy Number MajorMinor012345SampleTumour Cellularity0.00.51.0012345Figure 1.1: This figure shows the workflow of our method, dd-PyClone. In-puts are genotypes from single cell genotyping experiment, tumour cel-lularity, and allele count estimates. As output, we infer clusters of ge-nomic loci (mutation clusters) and their cellular prevalences. The detailsof the model, including definitions of ddCRP, distance matrix calcula-tion, clustering, and inference routines are given in Chapter 2.81.6 SummaryIn this chapter we introduced cancer as an evolutionary system. We discussedstrengths and limitations of next generation and single cell sequencing technologiesand associated computational methods in quantifying intra tumour heterogeneity.We stated combination of bulk NGS and SCS as our objective in this work.The rest of this document is organized as follows: Chapter 2 describes theassumptions of the model, its mathematical formulation, and the computationalmodel to solve it. Chapter 3 introduces a simulator to generate SNV and CNAaware genotypes and reports our simulation studies as well as experiments on areal primary triple negative breast cancer dataset. Finally, Chapter 4 discusses asummary of the work as well as potential future research directions.9Chapter 2MethodsIn this chapter we formulate our hypothesis and establish relevant notations. Wethen introduce our modelling procedure in the context of a simplified problem. Wethen describe our main modelling method, the ddCRP that is a generalization ofthe traditional CRP. Using ddCRP, we construct an informed prior over partitionsthat encourage co-occurring genomic loci to cluster together. We proceed with de-scribing our full model, dd-PyClone. We then introduce our inference procedure,a MCMC sampling scheme that uses Gibbs moves to update clustering and param-eter assignments. Finally, we discuss limitations of our model and possible futureextensions.102.1 Problem statementHere, we first establish relevant terms, notations, inputs, and outputs of our model.Then we formulate our problem in a mixture modelling framework.2.1.1 Main hypothesisWe reiterate our main hypothesis here: Combining single cell and bulk data in aunified statistical framework improves clustering and cellular prevalence estimates.2.1.2 Concepts and definitionsGiven (i) variant allele counts and (ii) copy number state at each genomic locus,(iii) tumour cellularity, and (iv) single cell genotype data, our method infers (i)cellular prevalences and (ii) cluster assignments for those genomic loci. We definethese terms below.InputsVariant allele counts is the first input to the model. We assume that at each ge-nomic locus i, a total of di reads map to the variant and normal alleles out of whichbi reads map to the variant allele. A related notion is the variant allelic preva-lence, ξ , that is the expected fraction of reads that harbour the variant allele. Thisis computed as the number of mutated alleles in the variant subpopulations dividedby the total number of alleles in all cells.The second input to the model is the copy number state at each genomic lo-cus. Note that copy number variations affect ξ . For instance in Figure 2.4, ξ =2×52×1+3×3+3×5 =513 .The third input to the model is the tumour cellularity t, as the fraction ofcancer cells in the sample. Hence the fraction of normal cells would be 1− t. Weassume it is estimated independently from our model.The last input to our model is the genotype data. Let M denote the numberof genotypes in the tumour sample and N be the number of genomic loci in ourmodel. Genotype data is modelled as a binary matrix ∆ ∈ {0,1}M×N with rowscorresponding to genotypes and columns to genomic loci. Each entry ∆m,n is equalto one if the genotype m is mutated at locus n. We assume in this work that geno-11type data is derived from single cell sequencing studies. Figure 2.1 illustrates ourassumption about the relation between the genomic loci co-occurrence patterns andthe underlying phylogenetic tree.G1G2G3 G4G1xxMut 3Mut 5Mut 7xMut 4Mut 1Mut 2G4xxxxMut 4Figure 2.1: A hypothetical phylogenetic tree with genotypes at leaves (top).The green and blue bars on the tree denote mutations that have hap-pened together. A subset of the corresponding mutation co-occurrencepatterns (bottom). Note that the bottom matrix shows a transposed ver-sion of the genotype matrix. While it always holds that if mutationsare gained at the same site on the phylogenetic tree, then they will co-appear in the genotype matrix (the top-to-bottom arrow), the opposite isnot always true (the bottom-to-top arrow). We are making the simpli-fying assumption that if mutations co-occur in a genotype matrix, thenthey have co-occurred in the underlying phylogenetic tree as well.12OutputsThe desired outputs are cluster assignments of genomic loci and their cellularprevalences. Cellular prevalence φi for a particular genomic locus i is defined asthe fraction of cells in the sample that harbour a mutation at that genomic locus.For example, in Figure 2.4 cellular prevalence for the depicted genomic locus is 58 .Thus 1−φi, the fraction of cancer cells from the reference population, is 1− 58 = 38 .We define the clonal prevalence of a genotype to be the fraction of cells in thetumour sample that match that genotype.2.1.3 NotationsLet X = {x1,x2, ...,xN} be the set of ourN genomic loci, indexed byϖ = {1,2, ...,N}.We adopt the notation j : i for j ≤ i, j, i ∈ N to denote { j, j+ 1, j+ 2, ..., i}, asubset of successive integers.We define a clustering of X as a partition T of its index set ϖ , that is T ={T1,T2, ...,TK} such that unionsqk∈1:KTk = ϖ where K is the number of partitions and unionsqdenotes the disjoint union operator and each subset Tk is called a cluster.We define xA for A ⊂ ϖ to be {xi|i ∈ A}. For example xTk is the set of datapoints in cluster Tk and xi: j = {xi,xi+1,xi+2, ...,x j}.Furthermore, let T (.) : N→ N map data point indices to their clusters, that isT (i) = k iff i ∈ Tk.Partitions in a graphLet G(V ,E) denote an undirected graph G where V is the set of vertices and E isthe set edges, i.e., a set of unordered pairs {u,v} ⊂ V .The set of edges E induces a partitioning on V , where each connected com-ponent of V corresponds to a cluster. With a slight abuse of notation, let T (E) =T (G(V ,E)) denote this partitioning and T kE denote its k-th cluster.A directed graph G (V ,E ) consists in a set of vertices V and a set of directededges E where each edge is an ordered pair of vertices.For a directed graph G , we define its underlying undirected graph U(G ) to bethe graph obtained by replacing all directed edges in G with undirected ones.Let T (E ) be the partitioning induced byU(G ), the underlying undirected graph13of G . Throughout this document the G corresponding to E is always apparent fromthe context, with V always being the set of our data points.Let TE : N→N map vertex indices to their clusters, that is TE (i) = k iff i ∈ T kE .2.1.4 Simplified generative modelFor exposition, we start with a simplified generative model in which we describethe relationship between inputs and the outputs of our method. Assume we havea heterogeneous tumour that contains subpopulations from two distinct haploidgenotypes, g1 and g2 with clonal prevalences of 30% and 70%, respectively. Forsimplicity we set the tumour cellularity in our sample to one (t = 1). Since thisimplies that the expected fraction of variant allele reads is equal to cellular preva-lence at each genomic locus (ξ = φ ), we will ignore ξ and directly use φ in thissubsection.The possible cellular prevalences for any genomic locus i in this tumour areφi = {φ 1i = 0.0,φ 2i = 0.3,φ 3i = 0.7,φ 4i = 1.0}. Since locus i is either not mutatedin any of the genotypes (hence φ1), only mutated in g1 (corresponding to φ2), onlymutated in g2 (therefore φ3), or mutated in both g1 and g2 (meaning φ4). Thesefour cases represent our possible clusters.To simulate the sequencing process for genomic locus i, we first pick its cluster.We use an auxiliary variable zi as follows: zi ∼ Categorical(w) where w = w1:4denotes the mixing weights, the proportion of clusters such that ∑4i=1wi = 1.The cellular prevalence for genomic locus i is now φzi . In the inference pro-cedure, zis and φzis constitute our desired outputs. Next we simulate the num-ber of variant alleles. Since according to our assumptions φzi also denotes theexpected proportion of variant reads in the sequencing experiment, we can re-late it to the variant read counts bi via a Binomial likelihood function as follows:bi ∼ Binom(di,φzi) where for now, we fix di, the total number of reads, to someappropriate constant value 1. In the inference procedure, we observe bi and di and1It could vary from about 10× in a whole genomic sequencing to about 10,000× in an ultra deepsequencing experiment [31].14they are the inputs to our model. Put together, we have:zi ∼ Categorical(w)bi ∼ Binom(di,φzi)(2.1)The two step process we described in equation 2.1 defines a mixture distribu-tion as follows:p(bi) =4∑j=1w jBinom(di,φ j) (2.2)Here we have four possible mixture components or clusters. Cluster assign-ments for each datapoint (a genomic locus in our model), are determined by theindicator variables zi-s that are sampled from a categorical distribution, our priorover partitions, since we assumed that (i) the possible values for the φi-s were finiteand (ii) known. Neither of these assumptions hold in general, that is, the φi-s couldbe any real-valued number in [0,1]. To address this issue in a principled way, weintroduce the Chinese Restaurant Process (CRP) in subsection 2.2.2.Furthermore, co-occurrence patterns in g1 and g2 could be used to construct aninformed prior over partitions of genomic loci. This can be done via a generaliza-tion of CRP, called ddCRP, that we introduce in subsection 2.2.1. Before describingour model, dd-PyClone, in section 2.3, we present an updated generative processfor the simplified example that we considered here in subsection 2.2.3.152.2 Distance dependent Chinese Restaurant ProcessThis section introduces the distance dependent Chinese Restaurant Process (dd-CRP), the method we use to incorporate single cell genotyping data to construct aninformed prior over partitions that encourages co-occurring genomic loci to clustertogether. We begin by presenting the traditional CRP, a special case of ddCRP thatis agnostic to the co-occurrence pattern of genomic loci.2.2.1 Traditional CRPddCRP can be explained through an alternative representation of the Chinese Restau-rant Process (CRP). We follow the notation in [2]. In the traditional CRP, customersenter a Chinese restaurant and opt to sit at a table where the probability of joininga table is proportional to the number of customers already sitting at that table.Customers may also choose to sit at a new table with probability proportionalto α , a model parameter. In the Chinese restaurant metaphor, customers rep-resent the genomic loci and tables represent clusters.Let zi denote the table assignment for customer i and assume that customers1 : i−1 have occupied tables 1 : K and let nk be the number of customers sitting attable k. Customer sitting configuration induces a partitioning of customer indices.CRP draws zi as in equation 2.3.p(zi = k|z1:(i−1),α) ∝{nk for k ≤ Kα for k = K+1(2.3)The CRP is an exchangeable process, that is, the order in which customersenter the restaurant does not affect the probability of a certain partition [34].2.2.2 Alternative representation of Traditional CRPTraditional CRP can equivalently be viewed as customers joining other customersinstead of joining other tables. Let ci denote the customer index with whom cus-tomer i is sitting and C = c1:N . This defines a directed graph G (V ,E ) with V theset of customer indices and E the set of ordered pairs (i,ci).As described in subsection 2.1.3, this induces TE = T (C) a partitioning of cus-tomer indices. Each cluster corresponds to a table in the traditional representation.16Figure 2.2 shows an example C and its corresponding T (C).3412657……CustomersTables1 3 4 5 62 7Figure 2.2: Induced table sitting T (C) by a particular customer connectionconfiguration C . Bold arrows show customer connections and dottedarrows point to equivalent table sittings. Customer 7 has a self loop andsince she is not connected to any other customers, the correspondingtable has only one customer.In a generalization of this model, the probability for a customer i to connectto a customer j is proportional to a function of the distance between them. Thedistance matrix D encodes our knowledge about the data point’s dissimilarity froma secondary source. In this work, this distance matrix is computed from the geno-types derived from single cell genotyping experiments (more details in subsection2.3.1). The non-increasing decay function f takes non-negative finite values. Thisis summarized in equation 2.4.p(ci = j|D,α) ∝{f (di, j) for i 6= jα for i= j(2.4)This defines the ddCRP model. We note that picking a constant decay functionf (x) = 1 reduces ddCRP to traditional CRP, since in that case, equation 2.4 isidentical to equation 2.3.Unlike traditional CRP, ddCRP is not an exchangeable process. This meansthat the order in which customers enter the restaurant changes the probability of17a particular partition. In our implementation, we randomly shuffle the order ofcustomers at each iteration of the sampling algorithm. To investigate the effects ofnon-exchangeability, we ran our method over synthetic and real datasets with andwithout random reordering of customers in chapter 3, subsection 3.4.4.We found that the method performs nearly identically with or without randomcustomer reordering. This may imply that our method is not very sensitive to theorder of customers.2.2.3 Generative process for ddCRP mixture modellingNow that we have seen how ddCRP can be used to construct an informed priorover partitions, we present the high level forward simulation algorithm for mixturemodelling in ddCRP for the simplified example we considered in subsection 2.1.4:1. For i ∈ [1,N], draw ci ∼ ddCRP(α, f ,D).From this, derive T (C), the corresponding table assignment.2. For i ∈ [1,K],draw φi ∼ G0.3. For i ∈ [1,N], draw bi ∼ Fi(φTC(i)).where α is a model parameter, f is a decay function, G0 is the base distributionfor the φi-s, Fi is the likelihood function relating expected number of reads to bi tocellular prevalence φi as in equation 2.1, and TC(i) is the index of the table at whichcustomer i is sitting.Formally, in our simplified model, for each genomic locus i ∈ [1,N], wewant to infer φi and TC(i), given the model observations bi and di. In section2.3 we report a complete set of expected model inputs and outputs.182.3 The dd-PyClone modelHere we introduce our model, dd-PyClone. Figure 2.3 summarizes dependencyand distributional assumptions in dd-PyClone’s model. Table 2.1 explains randomvariables used in this model.addCRP↵ H0↵↵,↵s↵s,sibiditA0 i ⇡iDi 2 {1 : N}(a) Probabilistic Graphical Model (PGM) of dd-PyClone19↵ ⇠ Gamma(a↵, b↵)H0 = Uniform([0, 1])A0 = Uniform([0, 1])a ⇠ A0D = {di,j}, di,j = JaccardDist(i, j), i, j 2 {1 : N}fa = exp(di,j/a)i|fa, D,H0,↵ ⇠ ddCRP(fa, D,H0,↵) i|⇡i ⇠ Categorical(⇡i)s|as, bs ⇠ Gamma(as, bs)bi|di, i,i, t, s ⇠ BetaBinomial(di, ⇠( i,i, t), s)⇠( ,, t) =(1 t)⇣(gN )Zµ(gN ) +t(1 )⇣(gR)Zµ(gR) +t⇣(gV )Zµ(gV )Z = (1 t)⇣(gN ) + t(1 )⇣(gR) + t⇣(gV )1(b) Distributional assumptions of dd-PyCloneFigure 2.3: The complete dd-PyClone model. In the graphical model, theshaded nodes are observed and the rest of the nodes are not observed.In the inference step, the unobserved nodes will be inferred via Gibbssampling. In particular, we are interested in inferring φi-s, the cellularprevalences for genomic loci and the induced clustering by the ddCRP.20Table 2.1: Notation reference for dd-PyClone’s probabilistic graphicalmodel.[Not tion reference for dd-PyClone]Variable Description ObservedA0 Prior distribution over decay function’s parameter a. Yesαα Shape hyperparameter over ddCRP distribution’s α parameter. Yesβα Rate hyperparameter over ddCRP distribution’s α parameter. Yesa Decay function’s parameter. Noα Model parameter for the ddCRP model. NoH0 Base measure for the ddCRP used to sample cellular preva-lences for genomic loci.Yesαs Shape hyperparameter for the Beta-Bionomial precision param-eter s.Yesβs Rate hyperparameter for the Beta-Bionomial precision parame-ter s.YesD Distance matrix over genomic loci. In this work, this is com-puted from single cell genotype analysis.YesddCRP The distance dependent Chinese restaurant process with decayfunction f , distance matrix D, base measure H0, and model pa-rameter α .Nos Precision parameter for the Beta-Binomial emission model. Noφi Cellular prevalence for the genomic locus i. Nodi Total number of reads that map to genome locus i. Yesbi Number of reads that map to variant allele at genomic locus i. Yesψi A vector (giN ,giR,giV ) denoting genotype state at genomic locusi.Nopii Prior over the genotype state for the genomic locus i. Yest Tumour cellularity. YesN Number of genomic loci. YesM Number of genotypes. Yes21We assign each genomic locus to a customer. Throughout this document, weuse genotype data from single cell genotyping studies to compute the distance be-tween genomic loci. We note that this is not a requirement of the model, and othersources could be used to define dissimilarity between genomic loci.2.3.1 Distance matrixWe have used the Jaccard distance to form the distance matrix D ∈ [0,1]N×N be-tween genomic loci. Jaccard distance is computed as 1− JaccardIndex that is:JaccardDist(A,B) = 1− |A∩B||A∪B||A∩B|=M∑i=1(Ai×Bi)|A∪B|=M∑i=1(Ai+Bi)(2.5)where AM×1 and BM×1 are binary column vectors, each representing a genomiclocus. Intuitively, this assigns a higher distance to genomic loci that co-occur lessoften in the single cell genotypes and vice versa. We note that our use of the Jaccardindex to compute distances between genomic loci is related to the distance-basedphylogenetic inference methods [9].Let λ = {s,α,a} be the collection of hyperparameters in our model. Forbrevity, we first assume that these hyperparameters are fixed, and later we discusstheir resampling scheme.2.3.2 Bulk population assumptionsSimilar to PyClone, we make the simplifying assumption that the clonal populationin the bulk data comprises three subpopulations, namely, the normal, the reference,and the variant subpopulations. Figure 2.4 illustrates this assumption. To avoidconfusion with the genotype states coming from the single cell sequencing study,we refer to the assumed copy number state of the subpopulations in the bulk data aspseudo-genotypes. This data is usually not available directly from the bulk data,and has to be inferred or accounted for in the inference procedure.22Tumour SampleChromosomeMutationVariant SubpopulationNormal Subpopulation Reference SubpopulationFigure 2.4: Our assumption about clonal architecture in the tumour, with re-spect to a particular genomic locus. In this example, normal subpop-ulation represents a collection of un-mutated diploid cells. Referencesubpopulation comprises cells that have a copy number amplificationevent, but no single nucleotide mutations. Variant subpopulation is acollection of cells that have a SNV at the particular genomic locus.2.3.3 Pseudo-genotype state priorsLet ψi = (giN ,giR,giV ) ∈ (N0×N0)3 represent the assumed pseudo-genotype stateat each genomic locus i in the bulk data where N0 = N∪{0}. Let giN represent thenormal pseudo-genotype N, giR represent the reference pseudo-genotype R, and giV23represent the variant pseudo-genotypeV . Each giS is a pair of non-negative integersthat denote the copy number state for the pseudo-genotype S ∈ {N,R,V} at thegenomic locus i. For example, giN = (2,3) means that the normal pseudo-genotypein the bulk tumour sample has two copies of the reference allele and three copiesof the variant allele at genomic locus i. Here (0,0) denotes a homozygous deletion.For g ∈ G = N0×N0, let ζ : G → N0 be the total copy number of pseudo-genotype g, We define µ(g), the probability of sampling a variant allele from asubpopulation with pseudo-genotype g as follows:µ(g) =ε for b(g) = 01− ε for b(g) = ζ (g)b(g)ζ (g) otherwise(2.6)where ε is the sequencing error probability, the probability of observing a vari-ant allele when sequencing a true reference allele.We define the function ξ (ψ,φ , t) to capture the effects of pseudo-genotypes,cellular prevalence, and tumour cellularity as follows:ξ (ψ,φ , t) =(1− t)ζ (gN)Zµ(gN)+t(1−φ)ζ (gR)Zµ(gR)+tφζ (gV )Zµ(gV ) (2.7)where Z=(1−t)ζ (gN)+t(1−φ)ζ (gR)+tφζ (gV ) is the normalizing constant.To compute the likelihood, we sum over possible values of ψi. Since the dis-crete space of Ψ values quickly becomes intractable, we only consider a limitednumber of pseudo-genotypes. This could be done via defining a prior pii over ψi.2.3.4 The Parental Copy Number (PCN) priorFollowing [29], when copy number variation data in form of major and minor copynumbers is available, we have implemented a number of methods to elicit priorsover pseudo-genotypes. We assume that copy number state at each genomic locusis reported as a pair of integers (ζ¯1, ζ¯2), where the major copy number ζ¯1, refers tothe maximum of the two said integers and the minor copy number ζ¯2 refers to theminimum of the pair and ζ¯ = ζ¯1 + ζ¯2 is the total copy number. Here we describethe Parental Copy Number (PCN) strategy that is used in our experiments.24Let P denote the set of pseudo-genotype states that PCN scheme describes.We assign equal weight to the pseudo-genotype states inP and zero weight to anyother pseudo-genotype state that is not a member ofP , that is a pseudo-genotypestate ψi ∈P has a weight equal to 1|P| . The pseudo-genotype states with non-zeroweights areP = {ψ1,ψ2,ψ3,ψ4} where• ψ1 = (gN = (2,0),gR = (2,0),gV = (ζ¯1, ζ¯2))• ψ2 = (gN = (2,0),gR = (2,0),gV = (ζ¯2, ζ¯1))• ψ3 = (gN = (2,0),gR = (2,0),gV = (ζ¯ −1,1))• ψ4 = (gN = (2,0),gR = (ζ¯ ,0),gV = (ζ¯ −1,1))These pseudo-genotype states adhere to the following conditions: gN = (2,0)so that the normal pseudo-genotype is diploid with respect to the reference alleles,and ζ (gV ) = ζ¯ and b(gV ) ∈ {1, ζ¯1, ζ¯2}. The number of variant alleles b(gV ) is atleast one, in other words we do not consider genomic loci that are not mutated.When b(gV ) ∈ {ζ¯1, ζ¯2}, we set gR = gN (as in ψ1,ψ2). For b(gV ) = 1, we considertwo scenarios: (i) either the point mutation event has happened before the copynumber event, in which case we set gR = gN (see ψ3), or (ii) the copy numberevent preceded the point mutation, where we choose gR such that ζ (gR) = ζ¯ andb(gR) = 0 (as in ψ4).We note that for some copy number configurations such as when ζ¯1 = ζ¯2 orζ¯2 = 0, some ψi values will be identical. For example, when total copy numberis equal to one, the possible pseudo-genotype states in the PCN scheme are P ={ψ1 = (gN = (2,0),gR = (2,0),gV = (0,1)),ψ2 = (gN = (2,0),gR = (1,0),gV =(0,1))}.2.3.5 The likelihood functionGiven the priors over pseudo-genotypes, the likelihood for each data point is:p(bi|φi,di,pii, t) = ∑ψi∈G 3p(bi|φi,di,ψi, t)p(ψi|pii) (2.8)25To address overdispersion, we have modelled the conditional distribution ofvariant allele counts bi with a Beta-Binomial distribution, characterized in terms ofmean and precision as follows:p(b|d,m,s) =(db)B(b+ sm,d−b+ s(1−m))B(sm,s(1−m)) (2.9)where B is the Beta function. To reflect our assumptions over the sample sub-population structure, we set the mean value to a function of pseudo-genotypes,cellular prevalence, and cellularity for each data point, that is m= ξ (ψn,φ n, t). Alldata points share the same precision s.262.4 InferenceThe main objective of this procedure is to infer the desired outputs of our model,namely for genomic locus i, induced cluster assignments by ci and cellular preva-lences φi. We use a Gibbs sampler to draw samples from the posterior distributionof the model. We initialize the sampler such that all customers are in their own clus-ters. Let c−i be the customer connection configuration with customer i’s outgoingconnection removed. Let xi= (bi,di) denote the observed data, namely, variant andtotal allele counts.The full conditional distribution of ci is:p(ci|c−i,x1:N ,λ ) ∝ p(ci|λ )p(x1:N |ci,c−i,λ ) (2.10)where p(ci|λ ) is the same as equation 2.4 and λ is the set of all hyperpa-rameters. Let xTk be the set of customers in cluster Tk or equivalently, the set ofcustomers sitting at table k, then the likelihood term factors in:p(x1:N |c−i,ci = j,λ ) = ∏Tk∈T (C)p(xTk |λ ) (2.11)where T (C) is the partitioning induced by current customer connection config-uration C. The term p(xTk |λ ) further expands as:p(xTk |λ ) =∫(∏i∈Tkp(xi|θ ,λ ))p(θ |λ )dθ (2.12)where the likelihood p(xi|θ ,λ ) = p(bi|φi,di,pii, t) is the same as equation 2.8.Since our prior over cellular prevalences φi is non-conjugate to the likelihood,we resolve to a cached version of Griddy Gibbs method [25] to compute the aboveintegral. At the end of each iteration (i.e., when all customers are reassigned), wesample φk, for each cluster k as follows:φk ∼ p(φk|xTk ,piTk , t,λ ) ∝ p(φTk |λ )p(xTk |φTk ,λ ,piTk , t) (2.13)where p(φTk |λ ) is the probability density function of a uniform distribution.This Gibbs sampler potentially displaces more customers at each step, and as27such might have better mixing properties compared to the traditional CRP Gibbssampler [2]. Figure 2.5 shows such a step in ddCRP.1 3 4 5 62 73126573 1 2 5 67445 671 3 4 5 62 71 3 4 5 62 71234Figure 2.5: Possible moves by the sampler. Left column shows customer con-nection and right column shows induced table configuration at each step.We want to remove the outgoing connection of customer two, i.e., c2 = 6(top row, the red arrow). When this connection is removed, the secondtable is split into two tables, with customers one and two sitting at onetable and customers five and six sitting at a new table (middle row).Customer three is picked as the new connection for customer two, i.e.,cnewi = 3, and this causes their respective tables to merge (bottom row,the green arrow).2.5 Resampling hyperparametersα and a are resampled using methods described in [2]. Briefly, we used the fol-lowing Gibbs move to update the value of hyperparameter α given the customerconnection configuration C:p(α|C) ∝ αK [N∏i=1(α+∑j 6=if (di j))]−1p(α) (2.14)whereK=∑Ni ci= i, i.e., the number of self-connections, and p(α)∼Gamma(α0,β0)is the Gamma prior over α with shape and rate parameters α0 and β0.28The decay function parameter a is updated using the following Gibbs move:p(a|C,α) ∝ [∏i:ci 6=if (di j,a)][N∏i=1(α+i−1∑j=1f (di j,a))]−1p(a|α) (2.15)where we assume a uniform prior on a independent of α .Since the decay function is exponential in our model, we use the Griddy-Gibbs[25] approach to sample approximately from equation 2.15.We use the method proposed in [21] for resampling s.Gamma distributed priors are characterized using shape α and rate β parame-ters. Equation 2.16 shows the corresponding distribution function:g(x;α,β ) =βα ,xα−1e−xβΓ(α)(2.16)where Γ(α) is the Gamma function.By default, hyperparameter resampling is enabled in our experiments in thiswork, unless otherwise specified. We note that to explore the model’s sensitiv-ity to the value of hyperparameters, in some of our experiments in chapter 3, wedisable hyperparameter resampling. We specify this in the description of thoseexperiments.2.6 ImplementationThis model is implemented in R programming language and is available upon re-quest. It is built upon the implementation of ddCRP in [2].2.7 Limitations and future extensionsWe note that our assumptions regarding the pseudo-genotypes in the bulk datamay not agree with the genotype matrix derived from the single cell sequencingexperiment. More specifically, in modelling the bulk data, we assume that thetumour consists of exactly 3 subpopulations (i.e., pseudo-genotypes). Equivalently,we assume the existence of three genotypes in the bulk data, where the first twoare un-mutated and the third is mutated across all genomic loci. If the single cell29sequencing study indicates the existence of a different number of genotypes, thenthe bulk and single cell sequencing assumptions about the genotypes do not match.We intend to explore a number of ways to mitigate this issue, including using thegenotypes observed in the single cell sequencing experiments to inform the prioron the pseudo-genotypes in the bulk data.The shortcomings of the single cell sequencing method, especially the grossundersampling problem may obstruct using genotypes inferred from this type ofdata in the likelihood of dd-PyClone. As implemented, our model only works witha single anatomical/spatial tumour sample. We aim to expand it to use multiplesamples in future implementations. Our method uses a binarized version of thegenotype matrix. It may be possible to use the original genotype matrix in thecopy number space to calculate distances between genomic loci and potentiallyimprove accuracy of our estimates.2.8 ConclusionIn this chapter we gave a precise mathematical formulation of the main objective ofthis work, that is, how to use a binarized genotype matrix from single cell sequenc-ing data to improve clustering and cellular prevalence estimation for genomic locifrom bulk sequencing data.We proposed a solution based on the distance dependent Chinese RestaurantProcess (ddCRP), an infinite clustering framework that enables us to encourage co-occurring genomic loci to cluster together. We then described an inference methodfor our model based on a MCMC sampling scheme.30Chapter 3ExperimentsIn this chapter we report our experimental results over synthetic and real datasets.We begin by describing methods pertaining to all of our experiments. To estimateclustering assignment and cellular prevalences from our MCMC samples we usemax PEAR index and the mean over all samples respectively.We then introduce the Generalized Dollo model, a strategy to generate geno-types via a stochastic process over a phylogenetic tree that simulates SNV and CNVevents. We use the genotypes generated by this model to simulate the bulk data.Generating realistic simulated datasets is necessary to test the performance of ourmethod since it is not yet possible to exactly quantify the CNV and SNV state inreal datasets and therefore we cannot accurately assess the accuracy of our modelusing real datasets. We proceed to compare our method against existing methods.To gauge performance of results we use V-Measure index and mean absolute error.Real data experiments come next. We introduce a dataset consisting in fivetimepoints of a Triple-Negative Breast Cancer (TNBC) xenograft experiment. Webenchmark our method against other existing methods over this dataset and re-port the results. Finally, we present parameter sensitivity and MCMC convergenceanalysis results.313.1 General information3.1.1 Clustering summarizationTo cluster genomic loci we first compute the posterior similarity matrix and thenmaximize the PEAR index to compute a point estimate [11] as implemented inthe R package mcclust provided by the authors in [10]. We estimate the cellularprevalence for each genomic locus as the mean of after burn-in MCMC samples.3.1.2 Clustering evaluationClustering performance is measured with respect to the ground truth data in tworespects; first V-measure index that evaluates clustering completeness and homo-geneity [27], and second, the accuracy of cellular prevalence estimates is reportedas the average absolute error.3.2 Simulated dataHere we introduce our simulation scheme. We first use the Generalized Dollo(GD) model to simulate genotypes. We then use these genotypes to simulate thebulk data. A binarized version of the genotypes is used to inform our prior inour method, while the bulk data constitutes the main input to our model and thecompeting methods. Figure 3.1 shows the high-level data simulation workflow.3.2.1 Simulating genotypesGeneralized Dollo modelWe used a variation of the Stochastic Dollo (SD) model, called Generalized DolloModel (GD) to simulate synthetic data accounting for both SNVs and CNVs. SDis a stochastic process that models evolution of binary features (in our case, pointmutations) along a phylogenetic tree. A feature could only be gained on one pointon the tree, and could be lost multiple times on different branches, but when lost,it cannot be regained [1].A limitation of SD is that it is restricted to binary features. For instance, it can32Genotype data in CN stateBinarized genotype dataGD modeldd-PyClonePyCloneSciCloneClomialPhyloSub/PhyloWGSBulk dataAll Inference MethodsFigure 3.1: High-level data simulation workflow. First, the GD model is usedto generate genotype data in copy number state (∆CN). Second, thegenotype data is converted into bulk data. This is given as input to all themethods tested in this work to be used to infer the clustering assignmentand cellular prevalences of genomic loci. Third, the genotype data inCN state is converted into a binary genotype matrix and supplied onlyto our method, dd-PyClone, whereby it is used to construct an informedprior over the partitions of genomic loci (the bold arrow).only model presence and absence of a mutation at a certain genomic locus.Multi State Stochastic Dollo (MSSD) model [1] relaxes this restriction by ex-panding the present feature state and allowing transition within this expanded statespace. For example, MSSD allows transition and transversion point mutations inaddition to deletion.MSSD can only model evolution after a SNV has happened. This is not acorrect assumption when modelling copy number variation events where we wouldlike to be able to account for copy number changes before a point mutation hashappened.33To resolve this problem, in addition to expanding the present feature, we alsoexpand the absence feature and allow transition within these new states. This is theGD model. Once the system gains the feature, that is, it transits into the presentfeatures state subspace, it can make transitions within this subspace, but cannot goback to the previous state. Figure 3.2 illustrates SD, MSSD, and GD side by sideon a specific phylogenetic tree for a particular genomic locus.SNVGD(1,0)(3,1)(0,2) (3,0)MSSD(2,0)(3,1)(0,2) (2,0)SD010 0(2,0) (2,0) 0Before SNV After SNV Fixed StateSNV SNV(2,0) (3,0) (1,0) (1,1) (1,2) (0,2) (2,1) (3,1) (2,0) (0) (1)StateStyleFigure 3.2: An instance of Generalized Dollo, Multi state stochastic Dolloand Stochastic Dollo models over a rooted phylogenetic tree for a singlegenomic locus side by side. We assume that a SNV has happened atthe red dot on the tree. Dashed lines represent the GD model’s runover the subtree before the SNV has happened. The thick solid linesrepresent the process after the SNV has happened. The thin solid linesrepresent a fixed state, i.e., the process can only handle a fixed statebefore the SNV gain event. The numbers and colours represent the stateof the process (CTMC) at that point. GD can model multiple states onbranches where SNV does not appear, while MSSD is forced to be ina fixed state in those positions. Hence the space of problems that GDmodels is a superset of that of SD.34GD model’s setupGD uses a Continuous-Time Markov Chain (CTMC) to simulate the evolution ofgenomic loci states along the paths of the phylogenetic tree. The state space ofthis CTMC consists of pairs (c1,c2) ∈ N0×N0 where N0 = N∪{0} and c1 andc2 represent reference copy and variant copy numbers respectively. Rate matrixQ1 controls the CTMC before the occurrence of the rare-event and rate matrix Q2controls the CTMC after the occurrence of the rare-event.We design Q1 and Q2 such that a complete deletion, i.e., transitioning to state(0,0) is not possible. Q1 only allows transition between states that have zero vari-ant copy number. This simulates the behaviour of the system before a SNV hap-pens. We assume once a mutation is lost, it cannot be recovered, and enforce thisassumption in Q2 by not allowing transition from states with zero variant copynumber zero to states with non-zero variant copy numbers.350 0 00 −1 10 1 −1(0,0)(1,0)(2,0)(0,0)(1,0)(2,0)0 0 0 0 0 0 0 0 00 −1 1 0 0 0 0 0 00 1 −1 0 0 0 0 0 00 0 0 −1 0 0 1 0 00 1 0 1 −4 1 0 1 00 0 1 0 1 −3 0 0 10 0 0 1 0 0 −1 0 00 0 0 0 1 0 1 −3 10 0 0 0 0 1 0 1 −2(0,0)(0,1)(0,2)(1,0)(1,1)(1,2)(2,0)(2,1)(2,2)(0,0)(0,1)(0,2)(1,0)(1,1)(1,2)(2,0)(2,1)(2,2)Figure 3.3: Rate matrices for CTMC used on τ−pSNV (left) and τpSNV (right).States represent are pairs representing reference and variant allele copynumbers. In this example, maximum allowed copy number for both ref-erence and variant alleles is 2. States to which transition is possible areannotated green. Note that in both rate matrices, first row and columnthat represent transitioning from and to the complete deletion state, areall zero. This means that it is not possible to reach complete deletion inour model.Figure 3.3 shows an example Qabove and Qbelow rate matrices. The state spaceof the Qabove is a subset of that of Qbelow since we do not allow transition to stateswhere a SNV has happened.Simulating from the GD modelTo simulate data for each genomic locus from the GD model on a phylogenetic tree,we randomly pick a point on the tree to designate where the SNV has happened.We call the subtree rooted at SNV point the below-subtree and the remaining partof the tree, the above-subtree. We simulate GD on the above-subtree to determinethe copy number state of the SNV point along with other point on this subtree. Thisaccounts for evolution of the genotypes before a SNV has happened. We continue36by simulating the GD on the below-subtree to determine the copy number state ofdecedents of the SNV point. This accounts for evolution of the genotypes after theSNV occurs.To continue we first establish some notation following [3]. We define a phy-logeny denoted by τ to be a continuous set of points and G (L ,E ) to be its topologywhereL are the leaves and E the edges. Let Hi(ν) denote the state of the CTMCfor genomic locus i at point ν . For an edge e ∈ E , let |e| denote its length.Let τ−x be the tree pruned at point x, that is, the tree with subtree rooted at x,removed. We write τx for the subtree of τ rooted at node x, and τ−x the subtreepruned at node x, that is, the subtree with points in τ − τx. Let ρ be a normalizedmeasure that assigns zero weights to absorbing states and equal weights to non-absorbing states. Then the state of the CTMC at the root Hi(Ω) is distributedaccording to a categorical distribution with parameter ρ .⌦x 2 ⌧⌦x 2 ⌧ x 2 ⌧G1G2G3 G4 G1 G3 G4G2G3⌧x ⌧x⌧Figure 3.4: A rooted tree topology τ with root node Ω and SNV event atpoint x (left). G1, G2, G3, and G4 represent genotypes. τ−x is the sub-tree pruned at x (middle). Subtree rooted at x is denoted by τx(right). Tosimulate from the Generalized Dollo model, for a specific genomic lo-cus i, we first pick the SNV position on the tree, then simulate a CTMCon the pruned tree, and simulate another CTMC on the subtree.Algorithm 1 shows the pseudo code to simulate from Generalized Dollo Model.As input we provide the SimulateGeneralizedDollo procedure with tree topology τwith M leaves and parameter µ , as well as rate matrices Qabove,Qbelow.37Algorithm 1 Simulating From Generalized Dollo Model1: procedure SIMULATEGENERALIZEDDOLLO(τ,µ,QABOVE,QBELOW)2: for i in 1 : N do3: Simulate SNV edge eSNV ∼ ν(τ,µ)4: Simulate SNV point pSNV ∼ Uniform[0, |eSNV|] on eSNV.5: Simulate state of CTMC at the root node, Hi(Ω)∼ Categorical(ui)6: aboveStates← sampleTreeCTMC(τ−pSNV ,Qabove)7: belowStates← sampleTreeCTMC(τpSNV ,Qbelow)8: allStates← allStates ∪ combine(aboveStates, belowStates)9: return allStateswhereX ∼ ν(τ,µ)⇔ p(X = x) = 1||τ||+1/µ ×{|x| if x 6=Ω1/µ otherwise(3.1)where sampleTreeCTMC(τ,Q) simulates along a phylogeny, a substitution CTMCand a substitution-deletion CTMC for rate matrices Q1 and Q2 respectively.Since ν has a point mass µ on Ω, there is a non-zero probability that a SNVhappens at the root and hence τ−pSNV = /0. In this case sampleTreeCTMC( /0,Q)returns /0. If a SNV does not happen at the root, then with probability one there aregenotypes in the sample that have no variant allele copy at that genomic locus.This will give us the copy number state of each genotype at each genomic locus.We summarize this into copy-number aware genotype matrix ∆CN ∈ (N×N)M×N .Each element of this matrix ∆CNm,n is a pair = (CNR,CNV ) where CNR and CNVrepresent reference and variant allele copy numbers respectively at genomic locusn for the m-th genotype. The binary genotype matrix ∆ comes from binarized ∆CN.An element of ∆ is equal to one if the second element of the corresponding elementin ∆CN is non-zero, and it is zero otherwise. Figure 3.5 shows the phylogeny and10 genotypes each with 48 genomic loci generated from the GD model.383.2.2 Simulating bulk dataWe use the generated genotypes ∆CN from the GD model to simulate the bulkdata. This bulk data serves as the input to the competing methods in this work.Our method additionally takes as input a binarized version of the genotype data toinform its prior over partitions of genomic loci.For each genomic locus i, the simulated bulk data consists in (i) variant andtotal allele counts (bi,di), (ii) major and minor copy numbers ζ¯ i1 and ζ¯i2, and (iii)tumour cellularity t. We set t = 1 for simulated experiments in this work.Let Φ = Φ1:M where Φm is the clonal prevalence for genotype m, that is, thefraction of cells in the tumour sample that have genotype m and M be the totalnumber of genotypes. Then φi, the clonal prevalence of genomic locus i in oursample would be Φ.∆[, i] and φ =Φ.∆. In this work, we set Φm = m∑Mj=1 j .To generate bulk data at genomic locus i, we first simulate di, the total numberof reads mapping to i from a Poisson distribution with parameter 10,000. We thenuse the CN state of the most prevalent genotype from ∆CN (here, it would be theM-th genotype) at locus i to set the major and minor CNs for the bulk. That is weset ζ¯ i1 = Maximum(∆CNM,i) and ζ¯ i2 = Minimum(∆CNM,i). To simulate the variant allelecounts bi we have to take into account the aggregate effect of all genotypes atlocus i in ∆CN. This means that the ψi-s will be slightly different from subsection2.3.3 in chapter 2, that is, instead of containing normal, reference, and variantsubpopulations ψi = (giN ,giR,giV ), it should contain normal, and all the genotypesfrom ∆CN. With a slight abuse of notation, we denote this by ψ∗i (∆CN) = ψ∗i =(giN ,gi1,gi2, ...,giM). We also have to modify the definition of ξ to work with thenew ψ∗ as follows:ξ ∗(ψ∗i ,Φi, t) =(1− t)ζ (gN)Z∗µ(gN)+ tM∑j=1Φ jζ (gim)Z∗µ(gim) (3.2)where Z∗ is the appropriate updated normalizing constant. Finally, we have bi ∼Beta-Binomial(di,ξ ∗(ψ∗i ,φi, t),s)where we set s= 1000. Algorithm 2 summarizesthe bulk data simulation procedure:39Algorithm 2 Simulate Bulk Data1: procedure SIMULATEBULKDATA(Φ,∆CN,s, t)2: for i in 1 : N do3: d← d∪ Simulate di ∼ Pois(10,000)4: ζ¯1← ζ¯1∪Maximum(∆CNM,i)5: ζ¯2← ζ¯2∪Minimum(∆CNM,i)6: b← b∪ Simulate bi ∼ Beta-Binomial(di,ξ ∗(ψ∗i ,Φi, t),s)7: return {d,b, ζ¯1, ζ¯2, t}0.04t 9t 4t 6t 5t 3t 7t 2t10t 1t 8Genomic LociGenotypes1234567891040 43 16 27 1 10 47 46 39 15 33 26 45 48 2 24 41 38 35 25 23 19 12 8 3 6 14 21 34 28 44 30 42 4 9 36 32 31 29 5 17 18 37 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 6Figure 3.5: Transposed binarized simulated genotypes ∆ (right) from Gener-alized Dollo process over a fixed phylogeny (left). The original geno-type matrix ∆CN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).3.2.3 Benchmarking against existing methodsWe benchmarked our method against existing methods over synthetic data. Wesimulated 10 synthetic datasets each with 10 genotypes over 48 genomic loci fromthe GD model. Figure 3.5 shows the genotype heat map and phylogeny used in oneof the datasets. The rest of the figures are in the appendix A, section A.1. In theseexperiments, dd-PyClone was always supplied with the binarized genotype matrix.The other methods were given the bulk data simulated from the original genotypematrix in the copy number state. We refer the reader to Figure 3.1 for a schematic of40the simulated data generation workflow. Figure 3.6 shows the performance resultsof this experiment.Parameters for synthetic data generationWe used the following setup to generate synthetic data:t = 1d = 10 ,000s = 1000µ = 100number o f genotypes = 10number o f genomic l o c i = 48Max To ta l Copy Number = 441Synthetic − Dollo (10X48)V measure0.00.20.40.60.8dd−PyClonePyClonePhyloWGSClomialSciClone●●●●●●*Synthetic − Dollo (10X48)Phi error0.10.20.30.4dd−PyClonePyClonePhyloWGSClomialSciClone●●●●●*Figure 3.6: Performance results for dd-PyClone and existing methods oversynthetic data. Right panel shows clustering assignments performance.Left panel shows cellular prevalence mean absolute error. *We werenot able to run Clomial with more than one dataset, since in the rest ofthe datasets, the number of clusters are more than 10, and in this case,Clomial never converged.Parameter setting in method comparison experimentsWe ran PyClone version 0.12.3 for 10,000 iterations with a burn-in of 1000 andthinning of 1. Remaining parameters were set as follows:num i te rs :10 ,000base measure alpha=1base measure beta=1concen t ra t i on=1p r i o r shape = 1p r i o r ra te = 0.00142dens i t y = pyc lone be ta b inomia lbe ta b inomia l p rec i son value = 1000be t a b i nom ia l p r i o r shape = 1be t a b i nom ia l p r i o r ra te = 1be ta b inomia l p rec i son proposal p rec i s i on = 0.01tumour content = 1e r r o r r a t e = 0.001We used Clomial version 1.4.0 and provided it with the correct number ofclusters via its c. Remaining arguments were set as follows:maxIt = 20binomTryNum=10c = True Number of ClustersWe downloaded PhyloWGS with git tag smchet1-31-g57294e3 and usedit with the default settings for the following parameters:−−mcmc−samples = 2500−−mh−i t e r a t i o n s = 5000Since this version of PhyloWGS did not output clonal frequencies, we editedthe source code to extract these values. Furthermore, to simplify comparison withother methods, we provided PhyloWGS with an empty copy number file.We used SciClone version 1.0.7. We set the maximum number of clusters toits true value. The remaining parameters are as follows:minimumDepth = 2copyNumberMargins = 0.25maximumClusters = True number of clusters.433.3 Real dataset3.3.1 Triple-negative breast cancer Xenograft dataTo test our method over a real dataset, we used a subset of samples from a triple-negative breast cancer xenograft study [8] where breast cancer tissues from 55 pa-tients were transplanted into highly immunodeficient mice to generate 30 xenograftlines. Over 3 years, these lines were passaged up to 16 generations.Whole genome sequencing was performed over a subsample of this cohort toidentify candidate genomic positions. It was followed by deep targeted ampliconsequencing of between 100 to 300 SNV positions per sample. 210 cells from fivetimepoints that span two samples were chosen for single cell genotyping, and about48 SNV positions were targeted for each timepoint.The results were post-processed to remove all positions labeled as non-somatic.This was further summarized into constituent genotypes. A consensus phyloge-netic tree was inferred using MrBayes [26]. Cells were grouped into clades con-sisting of high probability branching splits. For each clade a consensus genotypewas derived by taking the most prevalent genotype at each genomic locus. Figure3.7 shows the inferred genotype matrix ∆ for each sample. In each timepoint, weonly kept genomic loci that were shared between the bulk and single cell genotypedata. Inferred genotypes from the triple-negative breast cancer xenograft singlecell genotyping study is shown in Figure 3.7.44GenotypesGenomic LociX:1175268958:483479917:365700305:1406040444:585489353:7251978020:4081769520:162017772:2504368517:7142487117:6752243117:6593522516:8259079116:2588795612:1071259651:23903001111:42258257:635288857:191848783:6253053122:489747112:463944212:1180909619:3109417312:10671539811:663611541:575226510:1232051881 2 Curated Genotypes for sample SA49410.80.60.40.20GenotypesGenomic LociX:138867418X:1073787328:4424048:1437149248:27453456:795097002:2427413011:11561528610:555824956:4341079222:380717132:372295082:2144866702:21059492119:5148560119:111992918:3551506614:238265601:16169329810:1059717976:349509253:14149723320:320918313:926895252:1520639452:1626887005:15664229821:4541377911:1077508982:459418427:800548587:637551602:2879175714:7466433814:10444515713:10117561212:10682445612:22856371 2 5 3 4 Curated Genotypes for sample SA50110.80.60.40.20Figure 3.7: Binary genotype matrices for sample SA494 over 29 genomicloci (left) and sample SA501 over 38 genomic loci (right). These aremanually curated from a single cell genotype sequencing experiment[8]. Briefly, MrBayes was used to infer a consensus phylogenetic treeover the single nuclei. Then they were grouped into clades accordingto high probability branching splits. Finally, each clade was assigned aconsensus genotype by taking the mode genotype of the clade at eachgenomic locus.3.3.2 Establishing the ground truthSince exact clustering configuration and cellular prevalences of the genomic lociin the real dataset is unknown, we used multi-sample PyClone’s result over 11timepoints from sample SA501 and 4 timepoints from sample SA494 as our goldstandard. PyClone in multi-sample mode borrows statistical strength across all45timepoints to give better estimates of subclonal structure in individual timepoints.The following timepoints were used for sample SA501:SA501T, SA501X1A, SA501X2A, SA501X2B, SA501X3A,SA501X3B, SA501X4A, SA501X4B, SA501X4C,SA501X4D, SA501X5A.The following timepoints were used for sample SA494:SA494X4, SA494X3, SA494X2, SA494TPyClone was run for 100,000 iterations with a burn-in period of 50,000 itera-tions. The rest of the settings were identical to synthetic simulation experiments asin listing 3.2.3. Cellular prevalence estimates are summarized in Figure 3.8. Thisresults in an overall clustering and timepoint-based prevalence estimates for eachgenomic locus.SA494Time pointCellular prevalence0.00.20.40.60.81.0T X4llllllllllllllllClusters (n)1 (16)2 (12)3 (5)4 (1)5 (49)6 (2)7 (1)8 (1)9 (1)SA501Time pointCellular prevalence0.00.20.40.60.81.0X1 X2 X3 X4 X5l llllllllllllllllllllllllllllll lllllll llllll lllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllClusters (n)1 (15)2 (6)3 (9)4 (1)5 (1)6 (1)7 (1)8 (14)9 (11)10 (58)11 (1)12 (1)13 (1)14 (7)15 (6)16 (3)17 (1)18 (2)19 (5)20 (2)21 (23)22 (5)23 (1)24 (1)25 (1)Figure 3.8: Clustering result for multi-sample PyClone over timepointsSA501 X1, X2, X4, and SA494 T, X446SA494Time pointCellular prevalence0.00.20.40.60.81.0T X4lllllllllClusters (n)1 (10)2 (7)4 (1)5 (16)6 (1)SA501Time pointCellular prevalence0.00.20.40.60.81.0X1 X2 X3 X4 X5l llllllllllllllllll ll l lllll llllllllllllClusters (n)1 (8)2 (4)3 (4)8 (5)10 (11)14 (3)15 (2)21 (7)22 (1)Figure 3.9: Clustering result for multi-sample PyClone over timepointsSA501 X1, X2, X4, and SA494 T, X4 for genomic loci thatoverlap with those sequenced in the single genotype analysis.We ran our method along with competing methods over timepoints SA501X1, X2, X4, and SA494 T, X4 for which we had matching single genotypesequencing data. Figure 3.10 shows the performance of each method against thegolden standard.3.4 Parameter sensitivityIn this section we report our simulation studies aimed at elaborating dd-PyClone’ssensitivity to the choice of hyperparameters and noise level. Since hyperparametera, the decay function parameter, is the distinguishing parameter of our model, wemostly focus on effects of its starting value on our model. To assess our model’srobustness to noise, we introduce two types of noise, namely, point error and geno-type loss. Finally, we examine our model under varying values of a and presenceof noise simultaneously.47TNBC XenograftV measure0.30.40.50.60.70.8dd−PyClonePyClonePhyloWGSClomialSciCloneSA494 − Tdd−PyClonePyClonePhyloWGSClomialSciCloneSA494 − X4SA501 − X10.30.40.50.60.70.8SA501 − X20.30.40.50.60.70.8SA501 − X3 SA501 − X40.30.40.50.60.70.8SA501 − X5Methodsdd−PyClonePyClonePhyloWGSClomialSciCloneTNBC XenograftPhi error0.10.20.30.4dd−PyClonePyClonePhyloWGSClomialSciCloneSA494 − Tdd−PyClonePyClonePhyloWGSClomialSciCloneSA494 − X4SA501 − X10.10.20.30.4SA501 − X20.10.20.30.4SA501 − X3 SA501 − X40.10.20.30.4SA501 − X5Methodsdd−PyClonePyClonePhyloWGSClomialSciCloneTNBC XenograftV measure0.30.40.50.60.70.8dd−PyClonePyClonePhyloWGSClomialSciCloneSA494 − Tdd−PyClonePyClonePhyloWGSClomialSciCloneSA494 − X4SA501 − X10.30.40.50.60.70.8SA501 − X20.30.40.50.60.70.8SA501 − X3 SA501 − X40.30.40.50.60.70.8SA501 − X5Methodsdd−PyClonePyClonePhyloWGSClomialSciCloneFigure 3.10: Performance results for dd-PyClone and existing methods overTNBC SA501 X1, X2, X4, and SA494 T, X4. Right panelshows clustering assignment performance. Left panel shows cellularprevalence approximation mean absolute error.483.4.1 Sensitivity to value of aFigure 3.11 shows the result of running our model with different starting valuesfor the hyperparameter a. In these experiments we disabled resampling of hyper-parameters a, α , and s, and fixed them at their starting value. We simulated 10datasets from the GD model with 5 genotypes over 48 genomic loci. We ran ourmodel 170 times for each dataset, with different initial values for hyperparameters,each time for 200 iterations. Each box plot shows the respective performance indexfor runs with an identical initial value of a and different values for s and α , eachfor 5 datasets.aV measure0.20.40.60.81.00.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1llllll ll l l llllllllllllllllllllllllaPhi error0.10.20.30.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1l lll l ll l ll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllFigure 3.11: Performance over 10 synthetic datasets. Hyperparameter a isfixed at the specific value for each inference run. This result suggeststhat performance declines with increasing values of hyperparameter a.493.4.2 Sensitivity to presence of noiseWe consider two types of noise. A point noise that affects the status of a singlegenomic locus for certain genotypes, and a genotype loss noise, where one or mul-tiple genotypes are completely lost. Let the original genotype matrix be ∆M×Nwhere M is the number of genotypes and N is the number of genomic loci. In ourexperiments, we provided the noisy genotype matrix to our model.Point noiseWhen a genotype is erroneously marked as mutated at a genomic locus, we saya point noise has occurred. In particular, we assume that a process independentlyoperates on each element of ∆ and flips its value with a probability p. This processcould be due to false positives in calling SNVs.Concretely, assume fp : {0,1}M×N × [0,1]→ {0,1}M×N be a stochastic mapparameterized by p corresponding to the point noise process. The filtered matrixwould be a random binary matrix RM×N with elements that follow the distributionin equation 3.3.Pr(Ri, j = k) = 1(∆i, j = 0)pk(1− p)(1−k)+1(∆i, j = 1)p(1−k)(1− p)k (3.3)In other words, we can view this process as first sampling a random binary matrixFM×N each element of which is sampled independently from a Bernoulli distribu-tion with parameter p, Fi, jiid∼ Bern(p). Then we combine this filtered matrix withthe original genotype matrix ∆ using an element-wise XOR operation to get filteredmatrix R.Figure 3.12 shows the result of providing our model with a noisy genotypematrix, under various probabilities of noise p. We simulated 10 datasets from theGD model with 10 genotypes and 48 genomic loci. We ran dd-PyClone for 500iterations and enabled resampling of hyperparameters a, α , and s. Each box plotshows the respective performance index when dd-PyClone was supplied with anoisy genotype matrix with a particular p.50Noise ProbabilityV measure0.00.20.40.60.81.00 0.1 0.2 0.3 0.4 0.5lll llllNoise ProbabilityPhi error0.060.080.100.120 0.1 0.2 0.3 0.4 0.5lllll lFigure 3.12: Effect of adding point noise on V measure index (left) and meanabsolute error of cellular prevalences (right). This result implies thatour method is sensitive to high levels of point noise.Genotype lossNow we turn our attention to effects of genotype loss. It may happen due to un-dersampling inherent in single cell genotyping (SCG) experiments. The number ofisolated cells that are to be sequenced in SCG experiments is order of magnitudesless than the number of cells in the donor tumour. This may result in undersamplingof some genotypes, with less prevalent genotypes being more prone to missing.Figure 3.13 illustrates the effects of progressively removing more genotypes.We simulated 10 datasets from the GD model with 10 genotypes over 48 genomicloci. For each dataset, we ran dd-PyClone 11 times each for 500 iterations. Each51time, we held out a number of genotypes from the original genotype matrix ∆ andprovided dd-PyClone with the resulting undersampled genotype matrix ∆′ . Thisresult implies that if more than half of high prevalence genotypes are observed,using them will improve clustering assignment and cellular prevalence estimatesover the genotype-naive methods.52# of lost genotypesV measure0.60.70.80.90 1 2 3 4 5 6 7 8 9 10lllllll llll# of lost genotypesPhi error0.050.100.150 1 2 3 4 5 6 7 8 9 10lll l lll llllFigure 3.13: Effect of removing genotypes on V measure index (left) andmean absolute error of cellular prevalences (right). Zero genotypeloss depicts the model performance under the original genotype ma-trix. Number of lost genotypes of 10 indicates the performance ofthe model with no genotypes supplied. In this case, the method es-sentially falls back to that of PyClone. We have included PyClone’sperformance over the same datasets with 500 MCMC iterations as areference (right most box plot). As expected, with loss of genotypes,the method progressively performs worse since it is losing informa-tion. The rise in performance when all genotypes are lost could beattributed to the fact that undersampled genotypes are misleading andinterfere with the signal in the bulk data.We also tested the performance of the model under removal of individual geno-types. Figure 3.14 illustrates the effects of removing individual genotypes. Theexperimental configuration is exactly the same as the above experiment, except53that we held out only one genotype from the original genotype matrix ∆ and pro-vided dd-PyClone with the resulting undersampled genotype matrix ∆′. This resultimplies that our method is robust regarding the removal of individual genotypes.Index of removed genotypeV measure0.800.850.900.950 1 2 3 4 5 6 7 8 9 10lllllllllllIndex of removed genotypePhi error0.040.060.080.100.120 1 2 3 4 5 6 7 8 9 10llll llll lllFigure 3.14: Effect of removing individual genotypes on V measure index(left) and mean absolute error of cellular prevalence (right). Zero geno-type loss depicts the model performance under the original genotypematrix. Horizontal axes shows which genotype was removed from thegenotype matrix supplied to the method. The model is robust to theloss of individual genotypes.3.4.3 Sensitivity to a and noiseHere we examine the effects of simultaneously varying a and introducing noise.In our first experiment, we added point noise. We simulated five datasets from the54GD model with 10 genotypes over 48 genomic loci. For each dataset, we ran dd-PyClone for 200 iterations 60 times. Each time we fixed the hyperparameters a,α and s to a different starting value and disabled hyperparameter resampling. Foreach dataset, we introduced point noise with specified probability p to the originalgenotype matrix, and input the filtered genotype matrix to our model. Results forthis experiment are shown in Figure 3.15. It implies that in presence of noise, themodel is more sensitive to higher values of decay function parameter a and as aincreases, model performance declines.550.00.10.20.30.40.5 0.20.40.60.81.00.550.600.650.700.750.800.85Noise ProbabilityaV measureseed = 00.00.10.20.30.40.5 0.20.40.60.81.00.550.600.650.700.750.800.85Noise ProbabilityaV measureseed = 10.00.10.20.30.40.5 0.20.40.60.81.00.550.600.650.700.750.800.85Noise ProbabilityaV measureseed = 20.00.10.20.30.40.5 0.20.40.60.81.00.550.600.650.700.750.800.85Noise ProbabilityaV measureseed = 30.00.10.20.30.40.5 0.20.40.60.81.00.550.600.650.700.750.800.85Noise ProbabilityaV measureseed = 40.550.600.650.700.750.800.85(a) V measure Index560.00.10.20.30.40.5 0.20.40.60.81.00.060.080.100.120.140.16Noise ProbabilityaPhi errorseed = 00.00.10.20.30.40.5 0.20.40.60.81.00.060.080.100.120.140.16Noise ProbabilityaPhi errorseed = 10.00.10.20.30.40.5 0.20.40.60.81.00.060.080.100.120.140.16Noise ProbabilityaPhi errorseed = 20.00.10.20.30.40.5 0.20.40.60.81.00.060.080.100.120.140.16Noise ProbabilityaPhi errorseed = 30.00.10.20.30.40.5 0.20.40.60.81.00.060.080.100.120.140.16Noise ProbabilityaPhi errorseed = 40.040.060.080.100.120.140.16(b) Cellular prevalence error57Figure 3.15: Effect of adding random point noise and varying decay parame-ter a on V measure index (a) and mean absolute error of cellular preva-lence estimates (b) for the five simulated datasets. Beta-Binomial pre-cision parameter s and hyperparameter α are fixed at 1000 and 1 re-spectively. We note that V measure index is more sensitive to changesin value of a than the level of point noise. Heat map colours representvalues in the vertical axis and are included to aid the eyes.We examined two genotype loss scenarios: one where only a single genotype islost, and one where progressively more genotypes are missed. Results for the firstscenario are in Figure 3.16. Five datasets identical to the point noise experimentwere generated. For each dataset, we held out the specified genotype and input theremaining as the genotype matrix to our model (i.e., a matrix with 9 genotypes inour experiments).58024680.00.20.40.60.81.00.550.600.650.700.750.800.85Index of removed genotypeaV measureseed = 0024680.00.20.40.60.81.00.550.600.650.700.750.800.85Index of removed genotypeaV measureseed = 1024680.00.20.40.60.81.00.550.600.650.700.750.800.85Index of removed genotypeaV measureseed = 2024680.00.20.40.60.81.00.550.600.650.700.750.800.85Index of removed genotypeaV measureseed = 3024680.00.20.40.60.81.00.550.600.650.700.750.800.85Index of removed genotypeaV measureseed = 40.550.600.650.700.750.800.85(a) V measure Index59024680.00.20.40.60.81.00.060.080.100.120.140.16Index of removed genotypeaPhi errorseed = 0024680.00.20.40.60.81.00.060.080.100.120.140.16Index of removed genotypeaPhi errorseed = 1024680.00.20.40.60.81.00.060.080.100.120.140.16Index of removed genotypeaPhi errorseed = 2024680.00.20.40.60.81.00.060.080.100.120.140.16Index of removed genotypeaPhi errorseed = 3024680.00.20.40.60.81.00.060.080.100.120.140.16Index of removed genotypeaPhi errorseed = 40.040.060.080.100.120.140.160.18(b) Cellular prevalence error60Figure 3.16: Effect of removing single genotypes and varying hyperparame-ter a on V measure index (a) and mean absolute error of cellular preva-lence estimates (b) for five simulated datasets. Genotypes are sortedin decreasing order of prevalence from right to left. Genotype 1 is theleast prevalent and genotype 9 is the most prevalent. Beta-Binomialprecision parameter s and hyperparameter α are fixed at 1000 and 1 re-spectively. We note that V measure index is more sensitive to changesin value of a than removal of single genotypes. Heat map colours rep-resent values in the vertical axis and are included to aid the eyes.In the second scenario, we progressively removed more genotypes. Figure 3.17depicts these results. Except for genotype loss, the rest of experiment setup wasidentical to the first scenario. This result implies that the model is more sensitiveto the value of the decay function parameter a than it is to genotype removal.61024680.00.20.40.60.81.00.550.600.650.700.750.800.85# of removed genotypesaV measureseed = 0024680.00.20.40.60.81.00.550.600.650.700.750.800.85# of removed genotypesaV measureseed = 1024680.00.20.40.60.81.00.550.600.650.700.750.800.85# of removed genotypesaV measureseed = 2024680.00.20.40.60.81.00.550.600.650.700.750.800.85# of removed genotypesaV measureseed = 3024680.00.20.40.60.81.00.550.600.650.700.750.800.85# of removed genotypesaV measureseed = 40.550.600.650.700.750.800.85(a) V measure Index62024680.00.20.40.60.81.00.060.080.100.120.140.16# of removed genotypesaPhi errorseed = 0024680.00.20.40.60.81.00.060.080.100.120.140.16# of removed genotypesaPhi errorseed = 1024680.00.20.40.60.81.00.060.080.100.120.140.16# of removed genotypesaPhi errorseed = 2024680.00.20.40.60.81.00.060.080.100.120.140.16# of removed genotypesaPhi errorseed = 3024680.00.20.40.60.81.00.060.080.100.120.140.16# of removed genotypesaPhi errorseed = 40.040.060.080.100.120.140.16(b) Cellular prevalence error63Figure 3.17: Effect of removing progressively more genotypes and varyingdecay parameter a on V measure index (a) and mean absolute errorof cellular prevalence estimates (b) for five simulated datasets. Beta-Binomial precision parameter s and hyperparameter α are fixed at1000 and 1 respectively. We note that V measure index is more sensi-tive to changes in value of a than removal of multiple low prevalencegenotypes. Heat map colours represent values in the vertical axis andare included to aid the eyes.3.4.4 Effect of non-exchangeabilityAs stated in the Methods chapter, ddCRP is not an exchangeable prior in general.Figure 3.18 shows the effect of random reordering of customers on our model. Itimplies that our model is not significantly sensitive to the order of customers.64Customer PermutationV measure0.70.80.9No PermutationPermutatedl llTNBC Xenograft0.70.80.9lllSynthetiticCustomer PermutationPhi error0.040.060.080.100.12No PermutationPermutatedllTNBC Xenograft0.040.060.080.100.12l lllSynthetiticFigure 3.18: Effect of random reordering of customers in each iteration ofthe sampler vs keeping the order of customers fixed on V measureindex (left) and mean absolute error of cellular prevalence (right). Thetop row depicts results over a synthetic dataset (all settings identicalto subsection 3.4.2 except that there is no added artificial noise). Thebottom row shows result over a real dataset.3.5 Computational aspectsComputing the Distance Matrix takes O(N2M). Computing the clustering resulttakes O(N2). The complete analysis with 100 MCMC runs on a personal laptopwith 2.4 GHz Intel Core 2 Duo and 8GB of RAM memory, for a dataset of 48genomic loci and 10 genotypes takes about 5 minutes.653.6 Convergence diagnosticsFollowing [29] to assess convergence of the MCMC chain for TNBC Xenograftsamples SA501 and SA494, we ran 3 chains for 10,000 iterations with randomseeds and visually inspected Posterior Similarity Matrices (PSM) to ensure simi-larity. Figure 3.19 shows the PSM for time point X4 in sample SA494. The restof the figures are in the appendix, section A.2. These experiments imply that thechains have converged.ClusterLocus 35Locus 34Locus 33Locus 32Locus 31Locus 30Locus 29Locus 28Locus 27Locus 26Locus 25Locus 24Locus 23Locus 22Locus 21Locus 20Locus 19Locus 18Locus 17Locus 16Locus 15Locus 14Locus 13Locus 12Locus 11Locus 10Locus 09Locus 08Locus 07Locus 06Locus 05Locus 04Locus 03Locus 02Locus 01ClusterCluster 00.20.40.60.81Pairwise posterior clustering probabilityFigure 3.19: 10,000 runs from 3 different seeds over the Xenograft sampleSA494 timepoint X4.663.7 ConclusionIn this chapter, we first introduced a SNV and CNV aware genotype simulationscheme based on a phylogenetic tree, termed the Generalized Dollo model, fromwhich we also simulated the bulk datasets.We have shown that our method outperforms existing methods on both clus-tering assignment and cellular prevalence estimates in simulated datasets from theGD model. Furthermore, we have demonstrated that our method performs com-parably well with existing methods in a benchmark over a real dataset. We havealso shown that our method is fairly robust to the choice of hyperparameters andperforms reasonably in presence of noise.67Chapter 4ConclusionUnderstanding tumour subpopulation structure is essential in understanding howtumours start, grow, and develop resistance to treatment [13]. Next generationsequencing and, more recently, single cell sequencing have been used to study thisintra-tumour heterogeneity.Our method sits at the intersection of bulk and single cell sequencing technolo-gies. It leverages genotype co-occurrence patterns extracted from SCS to improveclustering and cellular prevalence estimates in bulk sequencing data.4.1 Significance and contributionIn this work we introduced a novel method to incorporate single cell genotypingdata with bulk sequencing data in the study of subclonal architecture. We presenteda new genotype simulator, the Generalized Dollo model. It enables us to accountfor both copy number and single nucleotide variations while respecting the Dolloparsimony principle. Moreover, it models evolution before the occurrence of aSNV, resulting in a more realistic simulated dataset.We have shown that our method outperforms existing methods on both cluster-ing assignment and cellular prevalence estimates in simulated experiments. Fur-thermore, we have demonstrated that our method performs comparably well withexisting methods in a benchmark over real datasets. We have also shown that ourmethod is fairly robust to the choice of hyperparameters and performs reasonably68in presence of noise.Thus we have confirmed the hypothesis we posited in the introduction chapter,that is, that co-occurrence patterns from single genotyping assays, when enoughgenotypes have been captured, in conjunction with deep sequencing bulk data, mayimprove cellular prevalence estimates of genomic loci.4.2 LimitationsWe note that our assumptions regarding the clonal subpopulation in the bulk datamay not agree with the input genotype matrix ∆ from the single cell genotypingexperiment. More specifically, we assume in modelling of the bulk data that thereare only three possible genotypes per genomic locus, namely, normal, variant, andreference subpopulations. If there are fewer or greater than three genotypes presentin ∆, then the two assumptions are in conflict. We note that this is not an inherentproblem in the model and could be fixed in future implementations.The current inference algorithm uses a cache-based Griddy-Gibbs method todeal with non-conjugate distributions. This may potentially impair accuracy andimpose a high memory footprint.Our experiments indicate that our method is sensitive to undersampling ofgenotypes. That is, performance in both clustering assignment and cellular preva-lence estimation decline when some of the existing genotypes are not observed.In particular, we observed that if less than half of the genotypes are observed, ourmodel performs worse than some of the genotype-naive methods.4.3 Potential applicationsUntil single cell sequencing technology is mature enough, that is, sufficiently cost-efficient and more accurate, it could be used in conjunction with bulk sequencingdata to obtain improved estimates of tumour subclonal properties.One scenario is the longitudinal sampling of cancer patients and model systemsto more accurately profile evolutionary dynamics. This is a step on the route touncovering properties of clonal fitness that in turn are required for quantitative de-scriptions of phenotypic traits conferring selective advantages. We are undertakingsuch experiments in our labs so that computational methods to decipher the sub-69clonal structure of heterogeneous tumours, including ours, can be directly appliedin the coming years.We note that genotype matrix inferred from single cell genotyping studiesis the main intended source to compute distance between genomic loci in dd-PyClone. However, any genomic loci co-occurrence indicator data from parallelstudies could be used with our model. One example is co-occurring and anti-co-occurring mutations network [5].4.4 Future researchThere are a number of ways in which our model could be improved. First, insteadof a binarized genotype matrix, the original single cell genotype matrix in copynumber space could be used to compute distance between genomic loci. Second,we posit that it may be possible to use the bulk sequencing data to estimate missingor noisy values in the single cell genotype matrix. Third, these could be incorpo-rated into a phylogenetic reconstruction algorithm that jointly infers evolutionaryhistory, genotypes, and prevalences from bulk and single cell sequencing data.4.5 Final wordIn summary, we have introduced a novel way to combine single cell genotyping andbulk sequencing data to study clonal subpopulation architecture and have shownthat it outperforms existing methods in simulation studies and performs compara-bly in real dataset benchmarking. We hope that our method will help in under-standing the evolutionary basis of cancer.70Bibliography[1] A. V. Alekseyenko, C. J. Lee, and M. A. Suchard. Wagner and dollo: astochastic duet by composing two parsimonious solos. Systematic biology,57(5):772–784, 2008. → pages 32, 33[2] D. M. Blei and P. I. Frazier. Distance dependent chinese restaurantprocesses. The Journal of Machine Learning Research, 12:2461–2488,2011. → pages iii, 16, 28, 29[3] A. Bouchard-Coˆte´ and M. I. Jordan. Evolutionary inference via the Poissonindel process. Proceedings of the National Academy of Sciences,10.1073/pnas.1220450110, 2013. → pages 37[4] K. Cibulskis, M. S. Lawrence, S. L. Carter, A. Sivachenko, D. Jaffe,C. Sougnez, S. Gabriel, M. Meyerson, E. S. Lander, and G. Getz. Sensitivedetection of somatic point mutations in impure and heterogeneous cancersamples. Nature biotechnology, 31(3):213–219, 2013. → pages 2, 3[5] Q. Cui. A network of cancer genes with co-occurring and anti-co-occurringmutations. PLoS One, 5(10):e13180, 2010. → pages 70[6] A. G. Deshwar, S. Vembu, C. K. Yung, G. H. Jang, L. Stein, and Q. Morris.Phylowgs: Reconstructing subclonal composition and evolution fromwhole-genome sequencing of tumors. Genome Biology, 16(1):35, 2015.doi:10.1186/s13059-015-0602-8. URLhttp://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359439/. → pages 6[7] J. Ding, A. Bashashati, A. Roth, A. Oloumi, K. Tse, T. Zeng, G. Haffari,M. Hirst, M. A. Marra, A. Condon, et al. Feature-based classifiers forsomatic mutation detection in tumour–normal paired sequencing data.Bioinformatics, 28(2):167–175, 2012. → pages 271[8] P. Eirew, A. Steif, J. Khattra, G. Ha, D. Yap, H. Farahani, K. Gelmon,S. Chia, C. Mar, A. Wan, et al. Dynamics of genomic clones in breast cancerpatient xenografts at single-cell resolution. Nature, 2014. → pages 7, 44, 45[9] J. Felsenstein. Distance methods for inferring phylogenies: a justification.Evolution, pages 16–24, 1984. → pages 22[10] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysisand density estimation. Journal of the American Statistical Association, 97:611–631, 2002. → pages 32[11] A. Fritsch, K. Ickstadt, et al. Improved criteria for clustering based on theposterior similarity matrix. Bayesian analysis, 4(2):367–391, 2009. →pages 32[12] Z. Ghahramani, M. I. Jordan, and R. P. Adams. Tree-structured stickbreaking for hierarchical data. In Advances in neural information processingsystems, pages 19–27, 2010. → pages 6[13] M. Greaves and C. C. Maley. Clonal evolution in cancer. Nature, 481(7381):306–313, 01 2012. URL http://dx.doi.org/10.1038/nature10762. → pages 68[14] G. Ha, A. Roth, J. Khattra, J. Ho, D. Yap, L. M. Prentice, N. Melnyk,A. McPherson, A. Bashashati, E. Laks, et al. Titan: inference of copynumber architectures in clonal cell populations from tumor whole-genomesequence data. Genome research, 24(11):1881–1893, 2014. → pages 3[15] D. G. Hert, C. P. Fredlake, and A. E. Barron. Advantages and limitations ofnext-generation sequencing technologies: A comparison of electrophoresisand non-electrophoresis methods. Electrophoresis, 29(23):4618–4626,2008. → pages 2[16] B. P. Hodkinson and E. A. Grice. Next-generation sequencing: a review oftechnologies and tools for wound microbiome research. Advances in woundcare, 4(1):50–58, 2015. → pages 2[17] W. Jiao, S. Vembu, A. Deshwar, L. Stein, and Q. Morris. Inferring clonalevolution of tumors from single nucleotide somatic mutations. BMCBioinformatics, 15(1):35, 2014. ISSN 1471-2105.doi:10.1186/1471-2105-15-35. URLhttp://www.biomedcentral.com/1471-2105/15/35. → pages 672[18] C. A. Miller, B. S. White, N. D. Dees, M. Griffith, J. S. Welch, O. L. Griffith,R. Vij, M. H. Tomasson, T. A. Graubert, M. J. Walter, et al. Sciclone:inferring clonal architecture and tracking the spatial and temporal patterns oftumor evolution. 2014. → pages 6[19] N. Navin, J. Kendall, J. Troge, P. Andrews, L. Rodgers, J. McIndoo,K. Cook, A. Stepansky, D. Levy, D. Esposito, et al. Tumour evolutioninferred by single-cell sequencing. Nature, 472(7341):90–94, 2011. →pages 7[20] N. E. Navin. Cancer genomics: one cell at a time. Genome Biol, 15:452,2014. → pages 7[21] R. M. Neal. Markov chain sampling methods for dirichlet process mixturemodels. Journal of Computational and Graphical Statistics, 9(2):249–265,2000. doi:10.1080/10618600.2000.10474879. URLhttp://amstat.tandfonline.com/doi/abs/10.1080/10618600.2000.10474879.→ pages 29[22] R. Nielsen, J. S. Paul, A. Albrechtsen, and Y. S. Song. Genotype and snpcalling from next-generation sequencing data. Nature Reviews Genetics, 12(6):443–451, 2011. → pages 2[23] P. C. Nowell. The clonal evolution of tumor cell populations. Science, 194(4260):23–28, 1976. → pages 1[24] L. Oesper, G. Satas, and B. J. Raphael. Quantifying tumor heterogeneity inwhole-genome and whole-exome sequencing data. Bioinformatics, 2014.doi:10.1093/bioinformatics/btu651. URL http://bioinformatics.oxfordjournals.org/content/early/2014/10/29/bioinformatics.btu651.abstract.→ pages 3[25] C. Ritter and M. A. Tanner. Facilitating the gibbs sampler: The gibbsstopper and the griddy-gibbs sampler. Journal of the American StatisticalAssociation, 87(419):861–868, 1992.doi:10.1080/01621459.1992.10475289. URLhttp://www.tandfonline.com/doi/abs/10.1080/01621459.1992.10475289. →pages 27, 29[26] F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling,S. Ho¨hna, B. Larget, L. Liu, M. A. Suchard, and J. P. Huelsenbeck. Mrbayes3.2: efficient bayesian phylogenetic inference and model choice across alarge model space. Systematic biology, 61(3):539–542, 2012. → pages 4473[27] A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-basedexternal cluster evaluation measure. In EMNLP-CoNLL, volume 7, pages410–420, 2007. → pages 32[28] A. Roth, J. Ding, R. Morin, A. Crisan, G. Ha, R. Giuliany, A. Bashashati,M. Hirst, G. Turashvili, A. Oloumi, et al. Jointsnvmix: a probabilistic modelfor accurate detection of somatic mutations in normal/tumour pairednext-generation sequencing data. Bioinformatics, 28(7):907–913, 2012. →pages 2[29] A. Roth, J. Khattra, D. Yap, A. Wan, E. Laks, J. Biele, G. Ha, S. Aparicio,A. Bouchard-Cote, and S. P. Shah. Pyclone: statistical inference of clonalpopulation structure in cancer. Nat Meth, 11(4):396–398, 04 2014. URLhttp://dx.doi.org/10.1038/nmeth.2883. → pages 2, 4, 6, 24, 66[30] C. T. Saunders, W. S. Wong, S. Swamy, J. Becq, L. J. Murray, and R. K.Cheetham. Strelka: accurate somatic small-variant calling from sequencedtumor–normal sample pairs. Bioinformatics, 28(14):1811–1817, 2012. →pages 2[31] S. P. Shah, A. Roth, R. Goya, A. Oloumi, G. Ha, Y. Zhao, G. Turashvili,J. Ding, K. Tse, G. Haffari, et al. The clonal and mutational evolutionspectrum of primary triple-negative breast cancers. Nature, 486(7403):395–399, 2012. → pages 3, 14[32] A. Shlien and D. Malkin. Copy number variations and cancer. 2009. →pages 3[33] M. R. Stratton, P. J. Campbell, and P. A. Futreal. The cancer genome.Nature, 458(7239):719–724, 2009. → pages 2[34] Y. W. Teh. Dirichlet process. In Encyclopedia of machine learning, pages280–287. Springer, 2010. → pages 5, 16[35] Y. Wang and N. E. Navin. Advances and applications of single-cellsequencing technologies. Molecular cell, 58(4):598–609, 2015. → pages 7[36] T. A. Yap, M. Gerlinger, P. A. Futreal, L. Pusztai, and C. Swanton.Intratumor heterogeneity: seeing the wood for the trees. Sciencetranslational medicine, 4(127):127ps10–127ps10, 2012. → pages 1[37] H. Zare, J. Wang, A. Hu, K. Weber, J. Smith, D. Nickerson, C. Song,D. Witten, C. A. Blau, and W. S. Noble. Inferring clonal composition frommultiple sections of a breast cancer. 2014. → pages 474Appendix ASupporting MaterialsA.1 Simulated genotypes from the GD model0.03t 9t 3t 5t 6t 1t 8t10t 2t 4t 7Genomic LociGenotypes1234567891037 47 45 44 42 30 26 23 21 19 17 4 6 16 33 7 9 11 43 5 38 46 48 41 36 35 32 2 18 1 39 34 31 29 28 25 24 22 20 15 13 12 10 3 8 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 0Figure A.1: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).750.03t 8t 3t 5t 6t 9t 1t 2t10t 4t 7Genomic LociGenotypes1234567891031 24 44 39 33 27 25 17 19 10 4 5 43 46 21 30 3 6 40 23 1 18 2 12 15 28 47 41 38 42 13 37 36 35 32 29 26 20 14 11 7 9 45 8 34 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 1Figure A.2: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).0.03t 5t 9t 3t 8t 6t10t 4t 2t 1t 7Genomic LociGenotypes1234567891012 25 41 42 10 2 16 45 38 33 20 19 13 3 7 6 23 11 46 1 24 48 44 43 40 39 36 35 29 28 26 21 15 14 5 8 27 37 34 31 47 30 9 22 32 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 2Figure A.3: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).760.05t 3t 6t 1t 2t 7t 9t 4t 8t10t 5Genomic LociGenotypes1234567891028 39 47 46 45 44 42 41 40 34 25 15 13 11 12 16 20 18 19 32 4 48 36 14 35 31 33 27 37 10 43 22 29 38 30 24 23 21 17 8 7 5 1 2 3 26 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 3Figure A.4: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).0.04t 7t 2t 5t 6t 8t 3t10t 9t 1t 4Genomic LociGenotypes1234567891041 18 6 29 5 16 47 46 45 44 43 38 36 34 24 17 11 3 8 21 37 48 13 9 42 39 33 30 25 28 23 40 19 7 2 1 27 15 14 4 12 20 26 31 35 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 4Figure A.5: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).770.05t 6t 4t 7t10t 5t 9t 3t 2t 1t 8Genomic LociGenotypes1234567891020 47 34 23 13 10 4 5 24 39 46 33 40 6 31 41 44 42 38 30 9 22 26 48 28 21 17 16 14 12 8 7 1 3 43 27 32 25 19 36 2 37 15 18 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 5Figure A.6: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).0.04t 9t 4t 6t 5t 3t 7t 2t10t 1t 8Genomic LociGenotypes1234567891040 43 16 27 1 10 47 46 39 15 33 26 45 48 2 24 41 38 35 25 23 19 12 8 3 6 14 21 34 28 44 30 42 4 9 36 32 31 29 5 17 18 37 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 6Figure A.7: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).780.04t 5t 4t 7t 3t 6t10t 9t 2t 1t 8Genomic LociGenotypes1234567891048 24 19 7 15 44 28 10 1 40 38 35 45 46 32 22 39 34 26 17 8 3 5 31 20 13 2 33 47 25 42 27 9 41 37 16 18 11 43 36 30 29 21 12 4 6 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 8Figure A.8: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).0.04t 4t 7t 8t 5t 3t 9t 2t 1t 6t10Genomic LociGenotypes1234567891018 12 2 44 45 39 37 36 35 28 22 11 3 9 33 42 23 29 19 26 5 1 20 41 34 31 16 7 15 10 48 40 4 13 47 46 43 38 32 30 27 25 21 17 14 6 8 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 9Figure A.9: Transposed binarized simulated genotypes X (right) from Gen-eralized dollo process over a fixed phylogeny (left). The original geno-type matrix XCN is in copy number space. We binarize it by setting en-tries with non zero variant allele copy number to one (coloured red) andsetting entries with variant allele copy number of zero to zero (colouredblue).79A.2 Convergence analysis resultsThe following figures each show PSM from 3 chains over the Xenograft TNBCreal dataset.ClusterLocus 38Locus 37Locus 36Locus 35Locus 34Locus 33Locus 32Locus 31Locus 30Locus 29Locus 28Locus 27Locus 26Locus 25Locus 24Locus 23Locus 22Locus 21Locus 20Locus 19Locus 18Locus 17Locus 16Locus 15Locus 14Locus 13Locus 12Locus 11Locus 10Locus 09Locus 08Locus 07Locus 06Locus 05Locus 04Locus 03Locus 02Locus 01ClusterCluster 00.20.40.60.81Pairwise posterior clustering probabilityFigure A.10: 10,000 runs from 3 different seeds over the Xenograft sampleSA501 timepoint X1.80ClusterLocus 38Locus 37Locus 36Locus 35Locus 34Locus 33Locus 32Locus 31Locus 30Locus 29Locus 28Locus 27Locus 26Locus 25Locus 24Locus 23Locus 22Locus 21Locus 20Locus 19Locus 18Locus 17Locus 16Locus 15Locus 14Locus 13Locus 12Locus 11Locus 10Locus 09Locus 08Locus 07Locus 06Locus 05Locus 04Locus 03Locus 02Locus 01ClusterCluster 00.20.40.60.81Pairwise posterior clustering probabilityFigure A.11: 10,000 runs from 3 different seeds over the Xenograft sampleSA501 timepoint X2.81ClusterLocus 41Locus 40Locus 39Locus 38Locus 37Locus 36Locus 35Locus 34Locus 33Locus 32Locus 31Locus 30Locus 29Locus 28Locus 27Locus 26Locus 25Locus 24Locus 23Locus 22Locus 21Locus 20Locus 19Locus 18Locus 17Locus 16Locus 15Locus 14Locus 13Locus 12Locus 11Locus 10Locus 09Locus 08Locus 07Locus 06Locus 05Locus 04Locus 03Locus 02Locus 01ClusterCluster 00.20.40.60.81Pairwise posterior clustering probabilityFigure A.12: 10,000 runs from 3 different seeds over the Xenograft sampleSA501 timepoint X3.82ClusterLocus 45Locus 44Locus 43Locus 42Locus 41Locus 40Locus 39Locus 38Locus 37Locus 36Locus 35Locus 34Locus 33Locus 32Locus 31Locus 30Locus 29Locus 28Locus 27Locus 26Locus 25Locus 24Locus 23Locus 22Locus 21Locus 20Locus 19Locus 18Locus 17Locus 16Locus 15Locus 14Locus 13Locus 12Locus 11Locus 10Locus 09Locus 08Locus 07Locus 06Locus 05Locus 04Locus 03Locus 02Locus 01ClusterCluster 00.20.40.60.81Pairwise posterior clustering probabilityFigure A.13: 10,000 runs from 3 different seeds over the Xenograft sampleSA501 timepoint X4.83ClusterLocus 45Locus 44Locus 43Locus 42Locus 41Locus 40Locus 39Locus 38Locus 37Locus 36Locus 35Locus 34Locus 33Locus 32Locus 31Locus 30Locus 29Locus 28Locus 27Locus 26Locus 25Locus 24Locus 23Locus 22Locus 21Locus 20Locus 19Locus 18Locus 17Locus 16Locus 15Locus 14Locus 13Locus 12Locus 11Locus 10Locus 09Locus 08Locus 07Locus 06Locus 05Locus 04Locus 03Locus 02Locus 01ClusterCluster 00.20.40.60.81Pairwise posterior clustering probabilityFigure A.14: 10,000 runs from 3 different seeds over the Xenograft sampleSA501 timepoint X5.84ClusterLocus 28Locus 27Locus 26Locus 25Locus 24Locus 23Locus 22Locus 21Locus 20Locus 19Locus 18Locus 17Locus 16Locus 15Locus 14Locus 13Locus 12Locus 11Locus 10Locus 09Locus 08Locus 07Locus 06Locus 05Locus 04Locus 03Locus 02Locus 01ClusterCluster 00.20.40.60.81Pairwise posterior clustering probabilityFigure A.15: 10,000 runs from 3 different seeds over the Xenograft sampleSA494 timepoint T.85
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- dd-PyClone : improving clonal subpopulation inference...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
dd-PyClone : improving clonal subpopulation inference from single cells and bulk sequencing data Salehi, Sohrab 2015
pdf
Page Metadata
Item Metadata
Title | dd-PyClone : improving clonal subpopulation inference from single cells and bulk sequencing data |
Creator |
Salehi, Sohrab |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | Improving our understanding of intra-tumour heterogeneity in cancer has important clinical implications, including an opportunity to understand mechanisms behind relapses and drug resistance. Next generation bulk sequencing is a mature tech- nology that has been used to study subclonal tumour populations at an aggregate level. Inference of populations from bulk sequencing requires sophisticated com- putational deconvolution methods. An alternative is to identify populations directly with single cell sequencing. However, single cell sequencing is a very error-prone process, and this impedes its ability to completely replace bulk sequencing for now. In this work we present dd-PyClone, a statistical model to combine single cell and bulk sequencing data to study clonal subpopulation architecture and improve clustering assignment and cellular prevalence estimates of a set of genomic loci. We introduce a single nucleotide variant and copy number aberration aware genotype simulation scheme based on a phylogenetic tree, termed the Generalized Dollo model. This model is an improvement over previous genotype generator models in that it also accounts for the evolutionary process before a rare event (here the single nucleotide variant) occurs. We show that incorporating genomic loci co-occurrence patterns from single cell sequencing studies in inferring clonal subpopulation structure from bulk se- quencing data is beneficial. Our method outperforms existing methods in simula- tion studies and performs comparably in real dataset benchmarking. We also show that our method is fairly robust as to the choice of hyperparameters and performs reasonably in presence of noise. We hope that our method will further the under- standing of the evolutionary basis of cancer. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-01-05 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0223066 |
URI | http://hdl.handle.net/2429/56179 |
Degree |
Master of Science - MSc |
Program |
Bioinformatics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_february_salehi_sohrab.pdf [ 1.32MB ]
- Metadata
- JSON: 24-1.0223066.json
- JSON-LD: 24-1.0223066-ld.json
- RDF/XML (Pretty): 24-1.0223066-rdf.xml
- RDF/JSON: 24-1.0223066-rdf.json
- Turtle: 24-1.0223066-turtle.txt
- N-Triples: 24-1.0223066-rdf-ntriples.txt
- Original Record: 24-1.0223066-source.json
- Full Text
- 24-1.0223066-fulltext.txt
- Citation
- 24-1.0223066.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0223066/manifest