Probabilistic models for the identificationand interpretation of somatic singlenucleotide variants in cancer genomesbyAndrew Justin Latham RothB.Sc., The University of British Columbia, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformtics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2015c© Andrew Justin Latham Roth 2015AbstractSomatic single nucleotide variants (SNVs) are mutations resulting from the substitution of a single nucle-otide in the genome of cancer cells. Somatic SNVs are numerous in the genomes of most types of cancers.SNVs can contribute to the malignant phenotype of cancer cells, though many SNVs likely have negligibleselective value. Because many SNVs are selectively neutral, their presence in a measurable proportion ofcells is likely due to drift or genetic hitchhiking. This makes SNVs an appealing class of genomic aber-rations to use as markers of clonal populations and ultimately tumour evolution. Advances in sequencingtechnology, in particular the development of high throughput sequencing (HTS) technologies, have madeit possible to systematically profile SNVs in tumour genomes. We introduce three probabilistic models tosolve analytical problems raised by experimental designs that leverage HTS to study cancer biology.The first experimental design we address is paired sequencing of normal and tumour tissue samples toidentify somatic SNVs. We develop a probabilistic model to jointly analyse data from both samples, andreduce the number of false positive somatic SNV predictions.The second experimental design we address is the deep sequencing of SNVs to quantify the cellularprevalence of clones harbouring the SNVs. The key challenge we resolve is that allele abundance measuredby HTS is not equivalent to cellular prevalence due to the confounding issues of mutational genotype, normalcell contamination and technical noise. We develop a probabilistic model which solves these problems whilesimultaneously inferring the number of clonal populations in the tissue.The final experimental design we consider is single cell sequencing. Single cell sequencing providesa direct means to measure the genotypes of clonal populations. However, sequence data from a single cellis inherently noisy which confounds accurate measurement of genotypes. To overcome this problem wedevelop a model to aggregate cells by clonal population in order to pool statistical strength and reduceerror. The model jointly infers the assignment of cells to clonal populations, the genotype of the clonalpopulations, and the number of populations present.iiPrefaceChapter 2 is a modified version of material published in “Andrew Roth, Jiarui Ding, Ryan D. Morin, Ana-maria Crisan, Gavin Ha, Ryan Giuliany, Ali Bashashati, Martin Hirst, Gulisa Turashvili, Arusha Oloumi,Marco A. Marra, Sam Aparicio, and Sohrab P. Shah. JointSNVMix: a probabilistic model for accurate de-tection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 28(7):907–913, 2012”. The JointSNVMix model was jointly conceived by myself and Dr. Shah. I imple-mented the JointSNVMix software and performed all computational analyses except sequence alignment. Ico-wrote the text with Dr. Shah. I am the original creator and copyright holder of all figures presented inthis chapter. Additional co-authors of Roth et al. [2012] provided assistance with generation of sequencingdata and alignment of the data.Chapter 3 is a modified version of material published in “Andrew Roth, Jaswinder Khattra, DamianYap, Adrian Wan, Emma Laks, Justina Biele, Gavin Ha, Samuel Aparicio, Alexandre Bouchard-Côté, andSohrab P Shah. PyClone: statistical inference of clonal population structure in cancer. Nature methods,11(4):396–398, 2014”. The PyClone model was jointly conceived by myself, Dr. Shah and Dr. Bouchard-Côté. I implemented the PyClone software and performed all computational analyses described in chapter 3.I co-wrote the text with Dr. Shah and Dr. Bouchard-Côté. I am the original creator and copyright holderof all figures presented in this chapter. Additional co-authors of Roth et al. [2014] provided assistance withgeneration of sequencing data and alignment of the data.Chapter 4 is unpublished material that is being prepared for submission to a peer reviewed journal. TheSCG model was jointly conceived by myself, Dr. Bouchard-Côté and Dr. Shah. I implemented the SCGsoftware and performed all computational analyses described in chapter 4. I co-wrote the text with Dr.Shah and Dr. Bouchard-Côté. I am the original creator and copyright holder of all figures presented in thischapter. The data presented in section 4.3.7 is from a manuscript in preparation. This manuscript representswork I have done in collaboration with Andrew McPherson.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Genomic aberrations in cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Genome sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Research contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.1 Identifying somatic nucleotide variants from paired tumour/normal digital sequen-cing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Inference of clonal population structure using deep digital sequencing . . . . . . . . 71.4.3 Identification of clonal genotypes from single cell digital sequencing . . . . . . . . 82 Identifying somatic nucleotide variants from paired tumour/normal digital sequencing data 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 Next generation sequencing of tumour genomes . . . . . . . . . . . . . . . . . . . 9iv2.1.2 Methods for discovering SNVs and somatic mutations . . . . . . . . . . . . . . . . 102.2 Method: JointSNVMix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.5 Alternative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.6 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.7 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.1 Joint modelling shows increased ability to detect shared signals on simulated data . 232.3.2 Joint modelling increases enrichment of true somatic SNVs in high ranking predic-tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.1 Limitations and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4.3 Postscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Inference of clonal population structure using deep digital sequencing . . . . . . . . . . . . . 293.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Method: PyClone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.1 Model overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.3 Extension to multiple samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.4 Extension to model overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.5 Eliciting mutational genotype priors . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.8 Alternative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.9 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41v3.2.10 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.11 Running PyClone and MCMC analysis . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.2 Normal mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.3 High grade serous ovarian cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4.1 Incorporating external sources of copy number information . . . . . . . . . . . . . 513.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.3 Alternative applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.4.5 Postscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 Identification of clonal genotypes from single cell digital sequencing . . . . . . . . . . . . . . 564.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.1.1 Sources of error in single cell data . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2 Methods: Single cell genotyper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.1 Model overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.2 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2.3 Extension for position specific error rates . . . . . . . . . . . . . . . . . . . . . . . 634.2.4 Extension to multiple data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2.5 Extension to multiple samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2.6 Extension to model doublets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2.7 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.2.8 Alternative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.9 Simulating synthetic data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2.10 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3.1 Allelic dropout causes a multi modal distribution of variant allele frequency . . . . 764.3.2 Modelling heterozygous genotype states improves performance for the SCG model . 77vi4.3.3 Phylogenetic constraints do not significantly improve performance . . . . . . . . . 804.3.4 Modelling doublets reduces the rate of false positive clonal population prediction . . 824.3.5 Automatic identification of doublets in real world data sets . . . . . . . . . . . . . . 844.3.6 Inference of clonal dynamic in xenograft passages . . . . . . . . . . . . . . . . . . 874.3.7 Identifying late driver amplifications in high grade serous ovarian cancer . . . . . . 914.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.4.1 Limitations and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113A PyClone: Supplemental material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113A.1 Copy number analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113A.1.1 Normalisation and feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 113A.1.2 ASCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113A.2 Normal mixture data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114A.3 High grade serous ovarian cancer data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 115A.3.1 Single-cell genotyping of frozen high grade serous ovarian cancers . . . . . . . . . 115viiList of Tables2.1 The nine possible joint genotypes and their associated mappings onto biologically inter-pretable marginal genotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Parameters in the JointSNVMix1 and JointSNVMix2 models along with their default values 142.3 Results from synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1 Indices used in equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Model variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.3 Example parameter settings for γ parameter when using SNV data . . . . . . . . . . . . . . 634.4 Definition of the binary operator ⊕ used to combine SNV genotypes of two cells . . . . . . 654.5 Definition of parameters used for simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 735.1 Source code repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97viiiList of Figures2.1 Hypothetical example of the JointSNVMix analysis process . . . . . . . . . . . . . . . . . . 122.2 Probabilistic graphical model representing the JointSNVMix1 . . . . . . . . . . . . . . . . 152.3 Probabilistic graphical model representing the JointSNVMix2 . . . . . . . . . . . . . . . . 162.4 Concordance analysis of the 12 DLBCL data sets . . . . . . . . . . . . . . . . . . . . . . . 253.1 Clonal evolution model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Workflow for PyClone analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 Probabilistic graphical representation of the PyClone model . . . . . . . . . . . . . . . . . . 333.4 The PyClone model population structure assumptions . . . . . . . . . . . . . . . . . . . . . 343.5 Synthetic data method comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6 Synthetic performance varying number of mutations . . . . . . . . . . . . . . . . . . . . . . 463.7 Comparison of clustering performance for the mixture of normal tissues data set . . . . . . . 473.8 Joint analysis of multiple samples from high grade serous ovarian cancer (HGSOC) patient 2 493.9 Joint analysis of multiple samples from high grade serous ovarian cancer (HGSOC) patient 1 504.1 Probabilistic graphical model representing the SCG model . . . . . . . . . . . . . . . . . . 614.2 Plot of normalised difference of lower bound from maximum value attained across restarts . 714.3 Probabilistic graphical model representing the Categorical mixture model . . . . . . . . . . 724.4 Example of one step from the clone phylogeny simulation procedure . . . . . . . . . . . . . 744.5 Distribution of allele frequencies from heterozygous positions in the diploid 184-hTert cellline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.6 Distribution of variant allele frequencies for putative heterozygous SNPs from six patientswith childhood acute lymphoblastic leukemia . . . . . . . . . . . . . . . . . . . . . . . . . 79ix4.7 Comparison of clustering performance of method using a binary representation of the data(CMM-2 and SCG-2-P) versus those using a three state representation (CMM-3 and SCG-3-P) and non-model based hierarchical clustering (HC) . . . . . . . . . . . . . . . . . . . . 814.8 Comparison of clustering performance of a standard clustering model (CMM-2) to a phylo-genetically informed model (BitPhylogeny) . . . . . . . . . . . . . . . . . . . . . . . . . . 834.9 Feature allocation performance of doublet aware (D-SCG-3-P) compared to doublet naive(SCG-3-P) models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.10 The maximum Hamming distance between a predicted clone and its nearest true clone plot-ted as function of the proportion of doublets . . . . . . . . . . . . . . . . . . . . . . . . . . 864.11 CMM-3 model applied to C-ALL patient 1 data set . . . . . . . . . . . . . . . . . . . . . . 884.12 SCG-3-P model applied to C-ALL patient 1 data set . . . . . . . . . . . . . . . . . . . . . . 894.13 D-SCG-3-P model applied to C-ALL patient 1 data set . . . . . . . . . . . . . . . . . . . . 904.14 Comparison of predicted clonal genotypes for CMM-3, D-SCG-3-P and SCG-3-P modelsapplied to C-ALL patient 1 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.15 D-SCG-3-P model applied to xenograft SA501 . . . . . . . . . . . . . . . . . . . . . . . . 924.16 D-SCG-3-P model applied to HGSOC patient with ERBB2 amplification . . . . . . . . . . 94A.1 Predicted cellular prevalence and clustering estimates from HGSOC . . . . . . . . . . . . . 118A.2 Posterior similarity matrices for high grade serous ovarian cancer case 1 using ASCAT forcopy number prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119A.3 Posterior similarity matrices for high grade serous ovarian cancer case 1 using OncoSNP forcopy number prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120A.4 Posterior similarity matrices for high grade serous ovarian cancer case 1 using PICNIC forcopy number prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121A.5 Posterior similarity matrices for high grade serous ovarian cancer case 2 using ASCAT forcopy number prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122A.6 Posterior similarity matrices for high grade serous ovarian cancer case 2 using OncoSNP forcopy number prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123A.7 Posterior similarity matrices for high grade serous ovarian cancer case 2 using PICNIC forcopy number prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124xAcknowledgementsThis research was supported in part by the CIHR bioinformatics training program and a CIHR FrederickBanting and Charles Best Canada Graduate Scholarship.Thank you to my supervisor Dr. Sohrab Shah. His guidance both academically and professionally hasbeen invaluable. Thank you to Dr. Alexandre Bouchard-Côté for the many engaging discussions that shapedthis work. Thank you to Dr. Samuel Aparicio for his insight into the fundamental biological problems at thecore of this work. I am exceedingly lucky to have had a supervisory committee that has been so thoroughlyengaged in my research.Many thanks to the members of the Aparicio, Huntsman and Shah labs at the BC Cancer ResearchCentre. I would particularly like to thank Jaswinder Khattra, Emma Laks, Tehmina Masud, Adrian Wanand Damian Yap from the Aparicio for their work generating data sets, and helping me interpret the data.Special thanks to my lab mates Andrew McPherson, Gavin Ha, and Fong Chun Chan for many stimulatingdiscussions.xiDedicationTo my family, thank you for your constant encouragement and support. To my partner Ana-Maria, withoutyour patience, reassurance and inspiration I could not have done this.xiiChapter 1IntroductionCancer is a disease caused by cells which have acquired an abnormal phenotype allowing for uncontrolledgrowth, expansion and migration. Underlying this abnormal phenotype are genomic aberrations which havebeen acquired during the evolution of the cancer [Hanahan and Weinberg, 2000]. Historically, the simplestevents to identify in cancers were large scale aberrations such as whole chromosome gains and losses, lossof fragments of chromosomes and chromosome fusions. The advent of high throughput sequencing (HTS)has changed this, making it possible to observe the complete set of genomic aberrations in a tumour byperforming whole genome sequencing (WGS) [Bentley et al., 2008, Mardis et al., 2009]. HTS is nowa standard tool for cancer genome research, with studies leveraging the technology to: map mutationallandscapes and identify frequently mutated genes [Hudson et al., 2010, Weinstein et al., 2013]; track clonalpopulation dynamic across time points [Ding et al., 2012b, Eirew et al., 2014]; determine the distributionof clonal populations across spatially separated samples [Gerlinger et al., 2012, Bashashati et al., 2013];precisely determine clonal population structure at single cell resolution [Navin et al., 2011, Gawad et al.,2014, Wang et al., 2014].Among the first applications of HTS in cancer biology were studies that aimed to catalogue the com-plete set of genomic aberrations in various cancer types using WGS [Hudson et al., 2010, Weinstein et al.,2013]. These studies generated so called mutational landscapes, which provided a genomic explanation thatpartially explained the variable phenotypes of cancer cells observed by pathologists. This has driven thedevelopment of tools to identify genomic aberrations in cancer genomes using HTS data. While the conceptof a mutational landscape is a useful first approximation to characterise cancers, it ignores the reality thathuman cancers are heterogeneous. It has been understood for many years that human cancers are not formedby a single population of genomically identical cells. Rather, the ongoing accrual of aberrations leads to theformation of multiple sub-populations with variable phenotypes [Nowell, 1976]. Recently there has beenrenewed interest in understanding both the evolutionary history of tumours and the diversity of the clonal11.1. Genomic aberrations in cancerpopulations within tumours. The development of targeted therapeutic strategies provides one motivation forthis line of research. These treatments often work well initially, but patients relapse in the vast majorityof cases [Hochhaus et al., 2008, Misale et al., 2012, Prahallad et al., 2012]. It is believed that this occursbecause populations of cells which are resistant to the treatment survive to recreate the tumour [Aparicioand Caldas, 2013]. Increasingly HTS technologies are being employed to study clonal population structureand tumour evolution [Ding et al., 2010, 2012b, Nik-Zainal et al., 2012, Shah et al., 2012]. In particular,targeted ultra deep sequencing has emerged as a mature technology which can be used to accurately quantifythe prevalence of alleles in tumour samples. This information can in turn be used to quantify the prevalenceof clonal populations harbouring the variant alleles [Roth et al., 2014].While targeted deep sequencing can be used to determine the prevalence of clonal populations, it is quitedifficult to use that data to determine the associated genotype of each population. The main impediment isthat traditional methods simultaneously sequence DNA from a large admixture of cells. The isolation ofindividual cells for sequencing provides a direct route to the measurement of clonal genotypes. Recentadvances in automatic isolation of cells and improvements in molecular biology have allowed researchersto sequence targeted events in hundreds of individual cells [Navin et al., 2011, Wang et al., 2014, Baslanet al., 2015]. Despite the obvious benefits, specifically that each measured cell should, theoretically, definea clonal genotype, the data produced from single cell sequencing is imperfect. Thus there is a need forcomputational and statistical methods to leverage this data to infer clonal genotypes.1.1 Genomic aberrations in cancerCancer genomes can acquire a number of aberrations over their evolutionary history. The main class ofaberrations that will be analysed in this dissertation is single nucleotide variants (SNVs). There are severalother types of genomic aberrations that remodel cancer genomes. For the purposes of this dissertation wewill focus on how these aberrations relate to our ability to identify and interpret SNVs.Single nucleotide variants: Single nucleotide variants are simple point mutations which occur when asingle nucleotide is substituted in the genome. SNVs are the most common type of mutation in most cancergenomes with hundreds to hundreds of thousands of mutations per patient identified in recent large scalestudies [Kandoth et al., 2013].21.1. Genomic aberrations in cancerDespite being a relatively simple class of genomic aberrations, SNVs can have a dramatic effect ona cell’s phenotype, in some case being major oncogenic drivers. However, the vast majority of somaticSNVs are likely passenger mutations that achieve a high prevalence in the cancer cell population due toco-occurence with driver aberrations. While this impedes our ability to understand the functional role ofSNVs, it is useful for inferring clonal population structure. In particular this means that a number of pas-senger SNVs may be associated with a clonal expansion. This in turn provides more data points for us toquantify the prevalence of the associated clonal population. In addition, if the driver aberration for a clonalexpansion is not an SNV, we can still track the associated clonal population using the passenger SNVs fromthe expansion. Because of their abundance and variable selective value, SNVs are ideal markers for studyingthe evolution of cancers.Copy number alterations: Copy number alterations (CNAs), or segmental aneuploidies, are large scalegenomic aberrations common in cancer genomes [Beroukhim et al., 2010]. CNAs are defined as the altera-tion of the number of copies of a genomic region, that is, amplification or deletion of genomic regions. Theseregions will often encompass SNVs, thus consideration of CNAs can be important for both identificationand interpretation of SNVs. In chapter 2 we briefly consider how CNAs impact our ability to detect SNVsfrom WGS data. In chapter 3 we show how accounting for coincident CNAs is fundamental to accuratelyinferring the cellular prevalence of mutations using allelic abundance data from HTS.Rearrangements: Rearrangements are formed when non-adjacent regions of the genomes become con-nected in novel chromosomes [Yang et al., 2013]. We refer to the boundary between the two previouslyseparate regions as the rearrangement breakpoint. Rearrangements can cause CNAs if the resulting eventduplicates or deletes genetic material. Because breakpoints are point like events, we can leverage them toidentify the presence of CNAs when it is not possible to measure the actual number of copies of the asso-ciated region. This is useful as it provides a bridge between WGS, which can identify CNAs, and targeteddeep sequencing which is useful for identifying rare alleles and quantifying prevalence, but cannot measureCNAs. In chapter 4 we concurrently measure SNVs and breakpoints associated with CNAs to define clonalgenotypes across multiple features of the genomic landscape at single cell resolution.31.2. Genome sequencing1.2 Genome sequencingDevelopments in high throughput sequencing (HTS) have been a major factor motivating the work in thisdissertation. Genome sequencing was historically dominated by capillary, or Sanger, sequencing. Importantmilestones such as the sequencing of the human genome were accomplished with this technology [Landeret al., 2001, Consortium et al., 2004]. However, Sanger sequencing is too slow, labour intensive and expens-ive to use for routine sequencing of human genomes. Second generation, next generation, or high throughputsequencing devices such as the Illumina GAII significantly reduced the cost and time to sequence genomes[Bentley et al., 2008]. These technologies produce much shorter reads than Sanger sequencing, typically inthe range of 75-300bp. Because of the short length of the reads, genome assembly is not easily accomplisheddue to both high computational cost and difficulties related to repetitive regions in the genome. Instead mostapplications of HTS use a closely related reference genome as a template and align reads against this gen-ome [Li and Durbin, 2010]. Provided the reference genome is sufficiently similar to the genome beingsequenced, many types of genomic variants can be identified. In particular, SNVs can be readily identifiedwith these technologies.A key benefit of HTS technologies is that there is one-to-one correspondence between each read (barringtechnical artefacts) and a DNA fragment in the original tissue sample [Aparicio and Huntsman, 2010]. Thismeans that HTS data can be used to quantify allelic abundance via digital counts of reads. Specifically,we can count the number of reads containing each nucleotide at a given locus. It is typically assumed thatonly two alleles exist at any locus. These alleles are frequently labelled A and B. The reference allele,represented as A, corresponds to the nucleotide present in the reference genome at the locus. The variantallele, represented as B, corresponds to the most prevalent (by read count) non-reference allele. Though itis not difficult to work with the richer four-state representation of counts for each nucleotide, it is typicallyassumed only one variant allele exists at a locus, with additional read counts supporting other alleles beingdue to sequencing error. This assumption can be violated if mutation rates are high or selection is acting, forexample to disable a tumour suppressor gene in cancer. Counts for the A and B allele can be leveraged toidentify loci with variant alleles, and further be used to distinguish between homozygous and heterozygousvariants (chapters 2 and 4). In the context of cancer, the fact the counts are proportional to the populationprevalence of an allele can be used to infer the prevalence of the clonal populations harbouring a mutation(chapter 3).41.2. Genome sequencingSeveral sequencing strategies are used in conjunction with HTS devices to study cancer genomes. Belowwe review three strategies that are used to generate data for the models described in this dissertation.Whole genome sequencing As the name suggests, whole genome sequencing (WGS) attempts to se-quence the entire genome of the cells in a sample. This approach is unbiased, in that no regions of interestare selected a priori for study. This strategy is useful for performing SNV discovery, CNA analysis and re-arrangement detection. Typically cancer genomes are sequenced to a haploid coverage of between 30x-60x,that is, each region of the genome would have on average 30-60 reads covering them if a haploid genomewas sequenced [Shah et al., 2009b, Pleasance et al., 2010, Ding et al., 2010, Yachida et al., 2010, Dinget al., 2012b, Shah et al., 2012]. This is typically enough resolution to identify highly prevalent SNVs, inferCNAs and identify rearrangements. One issue that confounds this design is that the input DNA is derived bysampling a large pool of cells which are lysed, and the resultant admixed DNA pool sequenced. In tumoursamples this means that normal (non-malignant) cells will typically be sequenced along with the cancercells of interest. As a result, many reads are sampled from uninteresting normal cells and are not useful foridentification of somatic aberrations. In addition, heterogeneity of the cancer cell population also createsdifficulties. Rare clonal populations will tend to contribute fewer cells, and hence less DNA to the input.Thus aberrations unique to these populations may be missed during sequencing. Unfortunately this can onlybe addressed by sequencing to a higher haploid depth, or sequencing multiple tissue samples, both of whichincrease cost. Another consequence of the low depth of WGS is that allelic abundance measurements canbe inaccurate estimates of the true allelic population prevalence due to stochastic sampling.Targeted deep sequencing If the quantification of allelic abundance or identification of rare variants is apriority, targeted deep sequencing can be performed [Shah et al., 2009b, Ding et al., 2010, 2012b, Shah et al.,2012]. Like WGS this strategy begins from a large pool of cells which are lysed, then PCR amplificationof small regions of interest are performed to generate DNA for sequencing. This requires that the aberrantloci be known before designing PCR primers, thus it is usually used after WGS has been performed on arelated tissue sample to identify aberrations. The advantage of this approach is that a limited amount ofgenomic real estate needs to be sequenced. As a result very high depths of coverage can be obtained, onthe order of 1000x-1,000,000x. This strategy is useful for profiling point like mutations such as SNVs andrearrangement breakpoints.51.3. Probabilistic modelsSingle cell sequencing Both WGS and targeted deep sequencing begin by admixing the DNA of a largenumber of cells. While this is useful for measuring population-level details of a tissue sample, it poses achallenge for studying sub-populations. Isolation of individual cells for sequencing is a promising alternativewhich provides a much higher level of resolution. Historically, single cell sequencing has been difficultto perform due to difficulties in cell isolation and problems related to sequencing the small amount ofDNA in a cell. The former problem can be addressed by recent developments using flow sorting or micro-fluidics [Navin et al., 2011, Wang et al., 2014, Baslan et al., 2015]. The latter problem can be addressedby amplifying the DNA of the cells, either using whole genome amplification (WGA) or targeted PCRbased strategies. Despite improvements in molecular biology, amplification of limited quantities of DNAstill remains challenging [Zong et al., 2012b, Gawad et al., 2014]. In particular, stochastic amplification ofone allele can mislead measurement at heterozygous loci, while biased amplification in WGA can impedeidentification of CNAs [Ning et al., 2014].1.3 Probabilistic modelsGenome sequencing studies using HTS technologies generate a great deal of data. This data is typicallynoisy with interesting biological signals confounded by both irrelevant biological signals and technical arte-facts. Leveraging HTS to make useful biological observations thus requires systematically removing thesesources of noise. In particular, stochastic under-sampling and heterogeneous mixtures of cells confound bulksequencing. In addition to these challenges, we must also account for stochastic dropout of heterozygousalleles when sequencing single cells. While ad hoc methods based on deterministic filtering or naive ap-plication of standard statistical approaches can be applied to solve some of these problems, such approachestend to be difficult to interpret biologically. In addition, the lack of an underlying probabilistic model of theentire data generation process makes it difficult to quantify uncertainty in parameter estimates. Probabilisticmodelling provides an approach based on defined theoretical principles to deal with these issues. In thisdissertation a common theme is to construct a probabilistic model such that biologically interesting phe-nomena map to model variables, with the observed data modelled by a stochastic process with a distributionconditioned on these variables. Of particular use are latent variable models, where the latent variables alongwith the model parameters represent biological quantities of interest [Bishop, 1998]. With a probabilisticmodel defined we can then draw on standard statistical theory to infer model parameters from observed data.61.4. Research contributionA Bayesian paradigm is used, with model parameters being treated as random variables [Bernardo, 2011].This allows us to develop extensible hierarchical models, and infer posterior distributions for the modelparameters to quantify uncertainty.1.4 Research contributionIn this dissertation, we develop three probabilistic models to identify somatic SNVs, infer clonal populationstructures and precisely define clonal genotypes using HTS generated data from tumour tissue. These modelsand the associated problems they address are briefly outlined below, with a thorough treatment provided inthe research chapters.1.4.1 Identifying somatic nucleotide variants from paired tumour/normal digitalsequencing dataIdentification of somatic SNVs in tumour genomes is a necessary step in defining the mutational landscapesof cancers. Experimental designs for genome-wide ascertainment of somatic mutations now routinely per-form WGS of tumour DNA and matched constitutional DNA from the same individual. This allows investig-ators to control for germline polymorphisms and distinguish somatic mutations that are unique to malignantcells, thus reducing the burden of labour-intensive and expensive downstream experiments needed to verifyinitial predictions. In order to make full use of such paired data sets, computational tools for simultaneousanalysis of tumour–normal paired sequence data are required. In chapter 2 we present a family of probab-ilistic models to identify somatic SNVs from matched tumour/normal experiments. By sharing statisticalstrength between the data from the two tissue samples we are able to more accurately identify germlineSNVs and reduce the number of false positive somatic SNV predictions.1.4.2 Inference of clonal population structure using deep digital sequencingOnce SNVs have been identified they can be used as markers of clonal evolution. The digital measurementof allelic counts using HTS yields precise estimates of the allele prevalence in samples, which can potentiallybe used to track shifts in clonal population structure across time and space [Shah et al., 2009b, Ding et al.,2010, Shah et al., 2012, Ding et al., 2012b, Nik-Zainal et al., 2012]. As discussed in section 1.2 this approachrequires that we start with a large number of cells, or bulk population, which are lysed before sequencing.71.4. Research contributionThis removes information about the cell of origin for the observed reads, which in turns means we cannotdirectly predict the proportion of cancer cells harbouring a mutation (cellular prevalence) from the allelicabundance measurements. As the cell is the unit of evolution, the cellular prevalence of a mutation is therelevant quantity for tracking changes in clonal population structures. To address this problem we present astatistical method in chapter 3 which seeks to convert allelic abundance measurements to cellular prevalenceestimates while identifying groups of mutations which mark clonal populations. This method is able toanalyse multiple samples from the same cancer to track changes in clonal population structure across spaceand time.1.4.3 Identification of clonal genotypes from single cell digital sequencingSingle cell sequencing is emerging as a powerful method to study the clonal population structure of tu-mours. Advances in cell isolation coupled with improvements in the the efficiency of DNA amplificationfrom single cells have made high throughput profiling of single cell genomes a feasible assay for identifyinggenomic clones and studying their evolution. Current protocols for single cell sequencing generate imper-fect data with missing values, highly biased allelic counts at heterozygous loci, and false measurements ofgenotypes due to sequencing multiple cells. Because of these errors, measurements from a single cell donot accurately represent a clonal genotype. In chapter 4 we develop a probabilistic model which attempts tohandle the missing values and biased allelic abundance measurements to yield accurate information aboutclonal genotypes. The method exploits the fact that cancer cells are phylogenetically related in order toshare statistical strength by clustering cells by clonal genotypes. This method is able to: i) predict numberof clonal populations represented in the data; ii) predict the genotype of each clonal population; iii) integ-rate multiple data types; iv) identify noisy measurements resulting from sequencing two cells; v) predict theproportion of each clone in multiple samples with an estimate of uncertainty.8Chapter 2Identifying somatic nucleotide variantsfrom paired tumour/normal digitalsequencing data2.1 Introduction2.1.1 Next generation sequencing of tumour genomesHigh throughput sequencing (HTS) of tumour genomes is now a routine assay in cancer studies. Early stud-ies exploring the mutational landscapes of prostate [Berger et al., 2011], breast [Shah et al., 2009b, Dinget al., 2010], ovarian [Shah et al., 2009a, Wiegand et al., 2010], pancreatic [Campbell et al., 2010, Yachidaet al., 2010], lung [Pleasance et al., 2010], and haematological [Ley et al., 2008, Mardis et al., 2009] malig-nancies revealed new cancer genes, new insights into tumour evolution, comprehensive mutational profilesand exploration of genomic architectures. These studies have established HTS experiments as an extremelyeffective approach to study cancer genomes and perform genome-wide somatic mutation discovery. Thesuccess of these studies has spurred large scale systematic projects to catalogue the mutational landscapesof a number of tumour types [Hudson et al., 2010, Weinstein et al., 2013] by sequencing hundreds of gen-omes. As such there is a major need for cancer-focused methods for robust, comprehensive interpretation ofthis data.The computational challenges in applying HTS to cancer research are similar to mainstream HTS applic-ations such as the 1000 genomes project [Consortium et al., 2010]. One crucial difference is the importanceof distinguishing germline polymorphisms present in healthy tissue from somatically acquired mutationsin tumour cells. This problem can be addressed by an experimental design in which DNA sampled fromhealthy normal tissue and DNA from tumour tissue are sequenced from the same individual. Fully exploiting92.1. Introductionthis experimental design and the resulting correlated nature of the pair of data sets to identify somaticallyacquired SNVs is the focus of this chapter.2.1.2 Methods for discovering SNVs and somatic mutationsAlmost all methods which detect single nucleotide variants (SNVs) from HTS data use a representation ofdigital allelic counts to infer allelic abundance in the sample. For example, a heterozygous germline SNVshould be present in approximately 50% of all aligned reads at that locus. In the cancer setting, allelic countdata is used to distinguish SNVs which are unique to the tumour DNA (somatic mutations) from those SNVswhich are present in the matched normal DNA (germline polymorphisms).Screening the set of predicted SNVs from a tumour tissue sample against databases such as dbSNP[Sherry et al., 2001] provides one method to address this issue. The challenge with this approach is that thereare 3-15 million germline single nucleotide variants per individual. Early results from the 1000 genomesstudy indicate that 10-50% of these are novel events [Consortium et al., 2010]. This suggest that possiblymillions of SNVs in a single individual will be uncatalogued in polymorphism databases. These SNVs willbe falsely identified as somatic mutations if the primary strategy for distinguishing somatic and germlineevents is screening against public databases. In the future as SNV databases become more comprehensivethe fraction of novel SNVs found in an individual will decrease. However, even if databases were to capture99% of all germline SNVs present in an individual and that individual carried 5 million SNVs, 50,000 SNVswould remain uncatalogued. This number is of the same order as the number of somatic SNVs present intumours from many types of cancers [Kandoth et al., 2013]. Hence there is a danger that the somaticmutations signal in a data set could be overwhelmed by the signal from these germline events.A more robust approach to identifying somatic mutations is to sequence a paired sample of DNA fromnormal (healthy) tissue and DNA from tumour tissue from the same patient. The normal tissue can then actas a control against which SNVs detected in the tumour can be screened. When the project presented inthis chapter began, a number of methods for discovering SNVs in HTS data had been developed [Koboldtet al., 2009, Goya et al., 2010, DePristo et al., 2011, McKenna et al., 2010]. However, tools specificallytailored to somatic mutation discovery in normal/tumour pairs did not exist. At the time the main methodfor SNV identification from paired tumour/normal data required using standard SNV discovery tools on thenormal and tumour samples separately and then contrasting the results post-hoc using so-called subtractiveanalysis. Sequencing noise can cause variant alleles in both tumour and normal samples to be observed102.2. Method: JointSNVMixat frequencies that are less than expected, making variant identification difficult. Independent analysis ofsamples combined with post-hoc analysis results in premature thresholding of real signals and, in particular,results in loss of specificity when detecting somatic mutations. We propose that simultaneous analysis oftumour and normal data sets from the same individual will likely result in an increased ability to detectshared signals (arising from germline polymorphisms or technical noise). Moreover, we expect that realsomatic mutations that emit weak observed signals can be more readily detected if there is strong evidenceof a non-variant genotype in the normal sample. Therefore our hypothesis is that joint modelling of atumour-normal pair will result in increased specificity and sensitivity compared to independent analysis.To address this question, we developed a family of probabilistic models called JointSNVMix to jointlyanalyse tumour-normal pair sequence data for cancer studies and a suite of more standard comparison meth-ods based on independent analyses and frequentist statistical approaches. We show that by borrowing stat-istical strength between samples, the JointSNVMix models allows us to better capture shared signal andremove false positive predictions caused by miscalled germline events.2.2 Method: JointSNVMix2.2.1 Problem formulationGiven tumour-normal paired allelic counts obtained from HTS sequence data aligned to the human referencegenome, we focus on the problem of identifying the joint-genotype (see below) of the samples at everylocation in the data with coverage. We assume that each variant loci has only two possible alleles, A and B.The allele A indicates that the nucleotide at a position matches the reference genome and B indicates thatthe nucleotide is a mismatch. Using HTS data we can measure the presence of these alleles using binarycount data. We examine all reads at a given site i and count the number of reads with a nucleotide at theloci that match the reference genome. We designate this value ai. We also compute the number of readswith mis-matches for each non-reference nucleotide. We define bi to be maximum of these values and thenucleotide associated with this value as the variant, alternate or non-reference allele. This formalism canbe extended to tumour-normal paired samples (figure 2.1) by considering allelic counts from each sample.Care must be taken to ensure the same nucleotide is selected as the non-reference allele in both samples.For a diploid genome we consider all pairs of alleles which gives rise to the set, G = {AA,AB,BB},the set of diploid genotypes. Now given two diploid samples the set of possible joint-genotypes consists112.2. Method: JointSNVMixACTCCCGTCGGAACGAATGCCACGACTCCCGTCGGAACCAATGCC----CTCCCGTCGGAACCAATGCCACC---CCCGTCGGAACCAATGCCACG-----CGTCGGAACCAATGCCACG-----CATCGGAACCAATGCCACC------GTCGGAACCAATGCCACG--------------CAATGCCACC--------------------CACC122335566666660777778773122335666666667777778777ACTCCCGTCGGAACCAATGCCACC--TCCCGTCGGAACCAATGCCACC---CCCGTCGGAACCAATGCCACC------GTCGGCACCAATGCCACG--------CGGCACCAATGCCACG----------GCACCAATGCCACG---------------AATGCCACG-------------------CCACG112333445563660777788883112333445566666777788888(BB,BB) (AB,AB)(AA,AB)GermlineSomaticNormalTumourReferenceGenomeAA AB BBAAABBB0.01 0.95 0.000.00 0.04 0.000.00 0.00 0.00Allelic CountsAllelic CountsJoint Genotype ProbabilitiesFigure 2.1: Hypothetical example of the JointSNVMix analysis process. Reads are first aligned to thereference genome (green). Next the allelic counts, which are the number of matches and depth of readsat each position are tabulated. Allelic count information can then be used to identify germline (blue) andsomatic positions (red). At the bottom of the figure we show the hypothetical probabilities of the nine jointgenotypes based on the count data for the somatic position (AA, AB).122.2. Method: JointSNVMixgN \gT AA AB BBAA Wild-type Somatic SomaticAB LOH Germline LOHBB Error Error GermlineTable 2.1: The nine possible joint genotypes and their associated mappings onto biologically interpretablemarginal genotypes: Wildtype (no change: p(AA,AA)), Somatic (wildtype normal and variant tumour:p(AA,AB)+ p(AA,BB)), Germline (variant normal and tumour: p(AB,AB)+ p(BB,BB)), and loss of het-erozygosity (LOH - heterozygous normal and homozygous tumour: p(AB,AA)+ p(AB,BB)). We treat thejoint genotypes (BB,AB) and (BB,AA) as errors since this would imply that a homozygous variant mutatesback to the reference base, which is a possible, but unlikely event. It is more plausible that these cases aresimply errors due to alignment or base calling.of all combinations of diploid genotypes which is equivalent to the Cartesian product of G with itself, i.e.G ×G = {(gN ,gT ) : gN ,gT ∈ G }.We assume the joint genotype of a given position can be mapped onto the more biologically interpretableset of marginal genotypes according to table 2.1. This can be done by assigning the joint genotype to themost probable state, or marginalising together the joint genotype probabilities for a given state. As an ex-ample of marginalisation we compute p(Somatic)= p((AA,AB))+ p((AA,BB)), i.e. the sum of probabilitiesof a wildtype genotype in the normal data and a variant genotype in the tumour data.2.2.2 Model descriptionJointSNVMix1 and JointSNVMix2 are generative probabilistic models which describe the joint emission ofthe allelic count data observed at position i in the normal and tumour samples. The graphical models rep-resenting JointSNVMix1 and JointSNVMix2 are given in figure 2.2 and figure 2.3. A complete descriptionof the notation and model parameters is given in table 2.2.We introduce a random variable Gi as an indicator vector representing the joint genotype of the samples.More explicitly Gi = (Gi(AA,AA),Gi(AA,AB), . . . ,Gi(BB,BB)) where Gi(gN ,gT )= 1 if the joint genotype of position iis (gN ,gT ), and Gi(gN ,gT ) = 0 otherwise. We assume the count data from the two samples are conditionallyindependent given Gi. By conditioning the emission densities of both samples on Gi we can capture correl-ations between the observed count data in each sample, allowing statistical strength to be borrowed acrossthe samples. This is the key insight which differentiates this model from running an independent analysis ofeach sample and joining the inferred genotypes post-hoc.Given the joint genotype of the sample we model the normal and tumour sample as being conditionally132.2. Method: JointSNVMixParameter Description Valueδ Pseudo counts in Dirichletprior on pigN\gT AA AB BBAA 1e5 1e2 1e2AB 1e2 1e3 1e2BB 1e1 1e1 1e3pi Multinomial distribution overjoint genotypesEstimated by EM (M-step)Gi Genotype at position i Estimated by EM (E-step)aix Number of bases matchingthe reference genome atposition i in genomex ∈ {N,T}Observed (JointSNVMix1 only)aix: jx Indicator that base jx atposition i matches referencein genome x ∈ {N,T}Latent (JointSNVMix2 only)zix: jx Indicator that base jx atposition i is correctly alignedx ∈ {N,T}Latent (JointSNVMix2 only)dix Depth of coverage at positioni in genome x ∈ {N,T}Observedqix: jx Probability that base call iscorrect in genome x ∈ {N,T}Observed (JointSNVMix2 only)rix: jx Probability that alignment iscorrect in genome x ∈ {N,T}Observed (JointSNVMix2 only)µx:gx Parameter of Binomialdistribution for genotype gxin genome x ∈ {N,T}Estimated by EM (M-step)αx:gx α parameter in Beta priordistribution on µx:gxAA AB BBNormal 1000 500 2Tumour 1000 500 2βx:gx β parameter in Beta priordistribution on µx:gxAA AB BBNormal 2 500 1000Tumour 2 500 1000Table 2.2: Parameters in the JointSNVMix1 and JointSNVMix2 models along with their default values.Parameters in the model are learned using the EM algorithm, while hyper-parameters are fixed to the valuegiven.142.2. Method: JointSNVMixδpiGiaiNdiNaiTdiTµN µTαN βN αT βTi ∈ {1, . . . , I}pi|δ ∼ Dirichlet(pi|δ )Gi|pi ∼ Multinomial(Gi|pi)µN:gN |αN:gN ,βN:gN ∼ Beta(µN:gN |αN:gN ,βN:gN )µT :gT |αT :gT ,βT :gT ∼ Beta(µT :gT |αT :gT ,βT :gT )aiN |Gi(gN ,gT ) = 1,µN ,diN ∼ Binomial(aiN |diN ,µN:gN )aiT |Gi(gN ,gT ) = 1,µT ,diT ∼ Binomial(aiT |diT ,µT :gN )Figure 2.2: Probabilistic graphical model representing the JointSNVMix1. Shaded nodes represent observedvalues or fixed values, while the values of un-shaded nodes are learned using EM. Description of all randomvariables is given in table 2.2.152.2. Method: JointSNVMixδpiGiaiN :jNziN :jNqiN :jNriN :jNaiT :jT ziT :jTqiT :jT riT :jTµN µTαN βN αT βTi ∈ {1, . . . , I}jN ∈ {1, . . . , diN} jT ∈ {1, . . . , diT }pi|δ ∼ Dirichlet(pi|δ )Gi|pi ∼ Multinomial(Gi|pi)µN:gN |αN:gN ,βN:gN ∼ Beta(µN:gN |αN:gN ,βN:gN )µT :gT |αT :gT ,βT :gT ∼ Beta(µT :gT |αT :gT ,βT :gT )ziN: jN ∼ Bernoulli(ziN: jN |0.5)ziT : jT ∼ Bernoulli(ziT : jT |0.5)qiN: jN |aiN: jN ,ziN: jN ∼ f (qiN: jN |aiN: jN ,ziN: jN )qiT : jT |aiT : jT ,ziT : jT ∼ f (qiT : jT |aiT : jT ,ziT : jT )riN: jN |ziN: jN ∼ g(riN: jN |ziN: jN )riT : jT |ziT : jT ∼ g(riT : jT |ziT : jT )aiN |Gi(gN ,gT ) = 1,µN ,diN ∼ Binomial(aiN |diN ,µN:gN )aiT |Gi(gN ,gT ) = 1,µT ,diT ∼ Binomial(aiT |diT ,µT :gT )Figure 2.3: Probabilistic graphical model representing the JointSNVMix2. Shaded nodes represent ob-served values or fixed values, while the values of un-shaded nodes are learned using EM. We have definedf (q|a,z) = z[qa+(1− q)(1− a)]+ 0.5(1− z) and g(r|z) = zr+(1− z)(1− r). Description of all randomvariables is given in table 2.2.162.2. Method: JointSNVMixindependent. For JointSNVMix1 the conditional distribution for each sample is modelled as a three com-ponent mixture of Binomial densities, where the densities correspond to the genotypes AA, AB and BB.These conditional densities are the same as used by SNVMix1 model [Goya et al., 2010]. For JointSNV-Mix2 the conditional densities are the same as SNVMix2 which allows for the incorporation of base andmapping quality information [Goya et al., 2010].2.2.3 InferenceWe use the expectation maximisation (EM) algorithm [Bishop et al., 2006, McLachlan and Krishnan, 2007]to perform maximum a posteriori (MAP) estimation of the values of the model parameters and latent vari-ables. One could hand-set parameters of the model to intuitive values. However, we expect that fittingthe model will allow for sample-specific adjustments to inter-experimental technical variability and inter-sample variation from tumour-normal admixture (so called tumour cellularity) in the tumour samples. TheEM update equations for JointSNVMix1 and JointSNVMix2 are given in the following sections.Storing the posterior marginals generated in the E-step requires O(n) memory, thus for a large data settraining may exhaust available memory. To circumvent this problem, we sub-sample every kth position withcoverage exceeding a specified threshold in both the tumour and normal. For the results presented below wesub-sampled every 100th position with a coverage of at least 10 reads. Lower values of k and hence largersub-sample sizes will lead to improved parameter estimates at the cost of using more memory and CPUtime.JointSNVMix1The following steps constitute one pass through the EM algorithm. These steps should be repeated until thethe (unnormalised) log posterior no longer changes.1. E-Stepξ i(gN ,gT ) =pi(gN ,gT )×{µaiNN:gN (1−µN:gN )diN−aiN}×{µaiTT :gT (1−µT :gT )diT−aiT }∑gN∈G ∑gT∈G pi(gN ,gT )×{µaiNN:gN (1−µN:gN )diN−aiN}×{µaiTT :gT (1−µT :gT )diT−aiT }172.2. Method: JointSNVMix2. Compute the marginal responsibilitiesτ iN:gN = ∑gT∈Gξ i(gN ,gT )τ iT :gT = ∑gN∈Gξ i(gN ,gT )3. Compute expected sufficient statisticsn(gN ,gT ) =I∑i=1ξ i(gN ,gT )aN:gN =I∑i=1τ iN:gN aiNaT :gT =I∑i=1τ iT :gT aiTdN:gN =I∑i=1τ iN:gN diNdT :gT =I∑i=1τ iT :gT diT4. M-Steppi(gN ,gT ) =n(gN ,gT )+δ(gN ,gT )−1∑gN∈G ∑gT∈G {n(gN ,gT )+δ(gN ,gT )−1}µN:gN =aN:gN +αN:gN −1dN:gN +αN:gN +βN:gN −2µT :gT =aT :gT +αT :gT −1dT :gT +αT :gT +βT :gT −2Convergence can be monitored by checking when the unnormalised log posterior given below stopschanging.` =I∑i=1log(∑gN∈G∑gT∈Gpi(gN ,gT )Binomial(aiN |diN ,µN:gN )Binomial(aiT |diT ,µT :gT ))+∑gN∈G∑gT∈G(δ(gN ,gT )−1) logpi(gN ,gT )+∑gN∈G{(αN:gN −1) logµN:gN +(βN:gN −1) log(1−µN:gN )}+∑gT∈G{(αT :gT −1) logµT :gT +(βT :gT −1) log(1−µT :gT )}182.2. Method: JointSNVMixJointSNVMix2Definef (q,r,µ) = 0.5(1− r)+ r[(1−q)(1−µ)+qµ]g(q,r,µ,d) =d∏j=1f (q j,r j,µ)The following steps constitute one pass through the EM algorithm. These steps should be repeated until thethe (unnormalised) log posterior no longer changes.1. E-Step : Compute responsibilities and expected counts for each read.ξ i(gN ,gT ) =pi(gN ,gT )g(qiN ,riN ,µN:gN )g(qiT ,riT ,µT :gT )∑gN∈G ∑gT∈G pi(gN ,gT )g(qiN ,riN ,µN:gN )g(qiT ,riT ,µT :gT )a¯ix: jx =0.5(1− rix: jx)+ rix: jxqix: jxµx:gxf (qix: jx ,rix: jx ,µx:gx)ξ i(gN ,gT )b¯ix: jx =0.5(1− rix: jx)(1−µx:gx)+ rix: jx(1−qix: jx)(1−µx:gx)f (qix: jx ,rix: jx ,µx:gx)ξ i(gN ,gT )2. Compute expected sufficient statisticsn(gN ,gT ) =I∑i=1ξ i(gN ,gT )aN:gN =I∑i=1diN∑jN=1∑gT∈Ga¯iN: jNaT :gT =I∑i=1diT∑jT=1∑gN∈Ga¯iT : jTbN:gN =I∑i=1diN∑jN=1∑gT∈Gb¯iN: jNbT :gT =I∑i=1diT∑jT=1∑gN∈Gb¯iT : jTdN:gN = aN:gN +bN:gNdT :gT = aT :gT +bT :gT192.2. Method: JointSNVMix3. M-Steppi(gN ,gT ) =n(gN ,gT )+δ(gN ,gT )−1∑gN∈G ∑gT∈G {n(gN ,gT )+δ(gN ,gT )−1}µN:gN =aN:gN +αN:gN −1dN:gN +αN:gN +βN:gN −2µT :gT =aT :gT +αT :gT −1dT :gT +αT :gT +βT :gT −2Convergence can be monitored by checking when the unnormalised log posterior given below stopschanging.` =I∑i=1{∑gN∈G∑gT∈Gpi(gN ,gT )g(qiN ,riN ,µN:gN ,diN)g(qiT ,riT ,µT :gT ,diT )}∑gN∈G∑gT∈G(δ(gN ,gT )−1) logpi(gN ,gT )+∑gN∈G{(αN:gN −1) logµN:gN +(βN:gN −1) log(1−µN:gN )}+∑gT∈G{(αT :gT −1) logµT :gT +(βT :gT −1) log(1−µT :gT )}2.2.4 ImplementationThe JointSNVMix software package is implemented using the Python programming language with com-putationally intensive portions written in Cython. The input is aligned sequence data in base space storedin the SAM/BAM format [Li et al., 2009] from a tumour-normal pair. The program implements two maincommands, train and classify. A typical analysis consists of training to learn the model parameters. Thesecan then be used with the classify command. Doing both steps of the analysis using coding space positionsfrom a 30x genome takes about 6 hours on a four core Intel i7 processor running at 1.73GHz with 8GBmemory.2.2.5 Alternative methodsIn order to evaluate the effect of modelling the joint distribution of the tumour and normal data, we comparedthe JointSNVMix1 and JointSNVMix2 models to their independent analogues, SNVMix1 and SNVMix2.We re-implemented the SNVMix models in order to compare classifier performance without introducingvariation due to implementation. We also considered independent and joint methods for classification which202.2. Method: JointSNVMixuse Fisher’s exact test. We include these two methods to verify the performance difference we observe inthe SNVMix models are due to joint analysis.Independent Fisher This method uses a right tailed Fisher exact test in order to test the null hypothesisthat the number of variant bases observed is due to random error. If the null hypothesis is not rejectedat p-value 0.05, the site is assigned the genotype AA. Otherwise, a site is assigned a genotype AB if thefrequency of the B-allele is between 0.2 and 0.8, and a genotype of BB if this frequency is above 0.8Joint Fisher This method first applies the Fisher method to call the genotypes in normal and tumourseparately. Putative somatic sites are identified and a two tailed Fisher exact test is run to test the nullhypothesis that the normal and tumour count data do not differ significantly. If the null hypothesis is notrejected at p-value 0.05 the putative somatic site is reassigned the reference genotype (AA,AA).SNVMix1 We re-implemented the SNVMix1 model used in the published SNVMix software [Goya et al.,2010]. To assign joint genotype probabilities we take the genotype probabilities from the normal and tumoursamples and multiply them to obtain the joint genotype probabilities.SNVMix2 We re-implemented the SNVMix2 model used in the published SNVMix software [Goya et al.,2010]. Joint genotype probabilities are computed in the same was as the SNVMix1 model above.2.2.6 Data setsSynthetic We generated the synthetic data by sampling 106 sites from the JointSNVMix1 model in fig-ure 2.2. We used the following parameters:µN ,µT = (0.999,0.6,0.001)pi ∝normal\tumour AA AB BBAA 106 102 102AB 102 104 102BB 1 1 104212.2. Method: JointSNVMixpi is normalised so that the sum over all entries equals 1. The depths diN ,diT were sampled for a Poissondistribution with parameter λ = 10. We set parameters for the AB class in the vectors µN ,µT to 0.6 whichis slightly skewed towards the reference allele. We did this to ensure the parameters would be different thanthe hand set defaults used in the untrained version of JointSNVMix1 and SNVMix1. For the SNVMix1 andJointSNVMix1 models we used a threshold p(Somatic) >= 0.5 to classify a position as a somatic variant.We did not benchmark SNVMix2 and JointSNVMix2 because our simulation technique did not generatebase and mapping quality scores.Diffuse large B-cell lymphoma We analysed matched tumour/normal data from Morin et al. [2011](dbGAP study accession phs000235.v2.p1). Exome data for patients A and B was captured using Agi-lent SureSelect, and subsequently sequenced on the Illumina GA II platform and aligned with BWA. Forpatients C-M the data was generated by whole genome shotgun sequencing (WGSS) and was run on theIllumina Hiseq2000 platform and aligned with BWA as described in Morin et al. [2011].Ground truth data was predicted from the primary tumour genome and RNA-Seq data with the SNVMixsoftware package followed by manual curation to remove artefacts. Support on both strands was requiredand variants near gapped alignments disregarded. Two or more high quality bases matching SNV were re-quired. Finally, putative variants were visually inspected in IGV. Validation of the non-synonymous curatedpredictions by targeted Sanger sequencing of the normal and tumour samples was performed to establish thetrue somatic mutations. There were 312 unique positions validated as somatic mutations across all patientsin this study. Complete details are available in Morin et al. [2011].In the analysis presented below, only coding space positions were analysed. For SNVMix2 and JointS-NVMix2 no pre-processing was performed. For the other models we removed bases with base or mappingqualities less than 10.2.2.7 Performance metricsBecause exhaustive validation of both somatic and non-somatic events was not available for the data setsused, we measured the concordance of the somatic predictions with databases known to be enriched forsomatic or germline mutations. In theory, classifiers which predict more true somatic mutations shouldshow higher concordance with the database of somatic mutations and lower concordance with the germlinedatabase.222.3. ResultsTo generate a database of somatic mutations we took the complete set of 312 unique somatic mutationsvalidated across the cases and joined them with COSMIC v54 database [Forbes et al., 2010]. We used theSNP predictions from the 2010/11/23 release of the 1000 genomes projects [Consortium et al., 2010], withthe positions found in COSMIC v54 removed, as the database of likely germline variants.For the SNVMix and JointSNVMix methods, which assign probabilities to their predictions, we plotcurves based on the rank ordering of somatic predictions. We do this by computing concordance as theprobability threshold for classification is lowered. For the Fisher methods, which do not assign scores totheir predictions, we plot a point in space for the complete set of predictions.To estimate the recall rate of the JointSNVMix models, we computed how many of the validated muta-tions for a case were found by the models using a threshold of p(Somatic)>= 0.5. There were 307 validatedpositions across the 12 cases used for this analysis. This number differs from the 312 unique positions be-cause some mutations are found across multiples cases, and some cases from the original study were notincluded in our analysis.We note that copy number variation (CNV) due to segmental aneuploidies and abnormal karyotypescould affect predictions. To assess the effect of CNV on prediction accuracy, we repeated the recall ex-periment using positions found in regions of predicted CNVs. CNVs were predicted using HMMCopy[Lai et al., 2012], available from http://www.bioconductor.org/packages/release/bioc/html/HMMcopy.html. This tool requires whole genome data, so patients A and B had to be excludedfrom this analysis. When we excluded these cases there were 187 validated somatic positions, 44 (23.5%)of which were found in regions of CNV.2.3 Results2.3.1 Joint modelling shows increased ability to detect shared signals on simulated dataWe summarise the results from our synthetic experiment in table 2.3. The F-measure and Matthews correl-ation coefficient (MCC) for the trained JointSNVMix1 model is the highest among all models, reflecting agood trade-off between sensitivity and specificity. The trained SNVMix1 model had the most true positivesof all models, however the F-measure and MCC were lower than JointSNVMix1. This is due to the highnumber of false positives (823) associated with the method, the bulk of which are false positive germlineevents (743).232.3. ResultsCaller TP FP TN FN F-M MCC FP-G FP-WJointSNVMix1(Trained)140 13 999788 59 0.795 0.802 8 2JointSNVMix1 153 50 999751 46 0.761 0.761 42 0SNVMix1(Trained)190 823 998978 9 0.314 0.423 743 70Joint Fisher 159 1155 998646 40 0.21 0.311 1109 0SNVMix1 178 1653 998148 21 0.175 0.295 1632 0IndependentFisher159 2538 997263 40 0.11 0.217 2464 0Table 2.3: Results from synthetic data. We report the number of true positives (TP), false positives (FP),true negatives (TN), false negatives (FN), F-measure (F-M), Matthews correlation coefficients (MCC), falsepositives which are germline (FP-G), and false positives which are wildtype (FP-W). The best results foreach category are shown in bold.There is an bias associated with simulating from the JointSNVMix1 model to generate the synthetic data.We present this data to emphasise the relatively high false positive rate that can be expected from post-hocmethods which treat the data as being independently sampled. When comparing number of false positivesfor JointSNVMix1 and SNVMix1 we observed an 80 fold reduction with the joint modelling approach. Thistrend is supported by the Fisher models where see a 2 fold reduction in false positives when using the jointapproach.2.3.2 Joint modelling increases enrichment of true somatic SNVs in high rankingpredictionsIn figure 2.4 we show the aggregated concordance analysis results for the 12 DLBCL cases. The circles atthe start of the lines for the JointSNVMix and SNVMix models represents the set of predictions for whichp(Somatic) = 1. The JointSNVMix models show higher somatic concordance (left) in the top mutationspredicted, while showing lower germline concordance (right) then their SNVMix analogues. This suggeststhat by distinguishing and removing false positive germline events, JointSNVMix enriches the top rankedsomatic predictions for true somatic mutations.Comparing the Fisher methods, we see that the joint model significantly improves performance. Theindependent Fisher method produces an unrealistically large number of predictions, but the joint Fisherapproach seems much better. A major limitation of the Fisher methods is the lack of ranking information.As a result we can only compare these methods to JointSNVMix and SNVMix models at a single point. At242.3. ResultsFigure 2.4: Concordance analysis of the 12 DLBCL data sets. The Somatic column represents concordancewith the merged COSMIC and ground truth set. The Germline column represents concordance with the 1000Genomes positions with the cosmic positions removed. The horizontal axis shows the number of somaticpredictions made and the vertical axes shows the fraction of those predictions found to be in the respectiveset. Lines are drawn by computing concordance as the threshold for classification is lowered. Lines startaway from the left side because multiple positions may have p(Somatic) = 1. Circles at the start of linesindicate this positions, these points are also labelled with the number of somatic predictions (in 1000’s) andconcordance.this single point, the Fisher methods have similar somatic concordance, but higher germline concordance.This suggests that the sensitivity of the Fisher methods is similar to the SNVMix family of models, but thespecificity is lower.JointSNVMix2 has a recall rate of 0.935 (287/307) on the validated mutations, which is higher thanJointSNVMix1’s rate of 0.915 (281/307). JointSNVMix2 shows similar or higher somatic and germlineconcordance than JointSNVMix1 model. This suggests that JointSNVMix2 is erring on the side of recallversus specificity when compared to JointSNVMix1, though the performance is not dramatically different.One benefit of the JointSNVMix2 model as previously discussed in Goya et al. [2010], is that it frees theuser from setting arbitrary thresholds on base and mapping quality.The recall rates for both JointSNVMix1 and 2 was 0.864 (38/44) in regions of CNV, while in copy neutralregions recall rates were 0.916 (131/143) and 0.923 (132/143). This suggests that CNV slightly degradesthe performance of both methods, although neither method showed statistically worse performance (Fisherexact test, p=0.3787, p=0.2383 respectively.)In total, there were 2496 somatic variants called by JointSNVMix2 at the highest stringency (p=1)of which 559 were non-synonymous variants not present in 1000 genomes polymorphism database. As252.4. Discussiondiscussed earlier, Morin et al. [2011] used stringent criteria, manual curation and validation to establishthe 307 true mutations. We did not validate the non-intersecting predictions in this contribution, leaving thepossibility of false positive events due to technical artefacts. We suggest a robust solution to mitigate againsttechnical artefacts in the discussion section §2.4.2.4 DiscussionIn this chapter, we examined the problem of simultaneous analysis of tumour-normal pair HTS data for thepurpose of identifying somatic point mutations. We developed a probabilistic framework called JointSNV-Mix to allow us to benchmark a model that can borrow statistical strength between samples against standardindependent analysis. We showed that joint modelling of genotypes confers an increased specificity oversimpler or independent analysis (section 2.3.2). Interestingly, the frequentist statistics based method - jointFisher method that considers both data sets simultaneously shows an increased specificity over its independ-ent analogue, albeit considerably lower than what was achieved for JointSNVMix.2.4.1 Limitations and extensionsData pre-processing In our study, we assume that the input data is aligned correctly and focus specificallyon the problem of identifying somatic mutations from allelic counts extracted from perfectly aligned data.The scope of this study is thus restricted to examining the effects of model-based classifiers for identifyingsomatic mutations from count based data. The alignment and pre-processing steps leading to the generationof the count data are expected to impact the performance of the classifiers we considered. The simpleapproach we used of filtering (JointSNVMix1) or modelling (JointSNVMix2) mapping and base qualitiesis likely sub-optimal since they both do not consider technical artefacts such as strand bias. However, allclassifiers are presented with the same data, and all will likely benefit to the same degree from improvedpre-processing. As our software makes use of BAM files as inputs it is agnostic to any upstream processingin the generation of these files, and as such should be compatible with more sophisticated pre-processingstrategies. In our experience, post-processing JointSNVMix predictions with mutationSeq [Ding et al.,2012a] removes the majority of technical artefacts.262.4. DiscussionModel Extensions The JointSNVMix models make several simplifying assumptions that may negativelyaffect performance. In future work it would be interesting to extend the framework presented here to modeltumour-normal admixture, cellular prevalence, coincident copy number status and nucleotide identity. Wenote in section 2.3.2 that although approximately 25% of mutations fell in regions of CNV, sensitivity wasnot significantly affected when copy neutral mutations and mutations in regions of CNV were compared.The effect of substantially altered karyotypes (typically exhibited by epithelial malignancies) on the abilityto detect somatic mutations remains an open question. However, we expect that explicit modelling of the ge-nomic complexities of cancer such as CNVs and sub-clonal populations would lead to enhanced interpretab-ility of somatic mutation prediction and enhanced performance. In addition, the independent and identicallydistributed nature of the model easily allows for location-specific prior knowledge of germline polymorph-isms or somatic mutation hot spots to be incorporated into the joint genotype prior. As population-leveldatabases mature, this extension to the model should increase the performance of somatic mutation predic-tions.2.4.2 ConclusionsIn this paper we formulated the joint genotype problem for somatic mutation discovery from a tumour-normal pair of HTS data sets. We have developed a novel statistical model JointSNVMix which explicitlymodels the process that jointly generates a pair of samples. The joint modelling approach employed byJointSNVMix allows us to reduce the number of germline events falsely predicted to be somatic. Thisincreased specificity comes with little to no decrease in sensitivity.We have provided a complete software package for detecting somatic mutations in paired sequencedata. This package includes the JointSNVMix1/2 classifiers, as well as the other four methods presentedin the paper. The Python implementation of our software is available under an open source license fromhttps://github.com/aroth85/joint-snv-mix.2.4.3 PostscriptSince the publication of the method developed in this chapter [Roth et al., 2012] a number of tools whichsolve the same problem have been published [DePristo et al., 2011, Larson et al., 2012, Koboldt et al., 2012,Ding et al., 2012a, Cibulskis et al., 2013, Shiraishi et al., 2013, Kim et al., 2013]. Many of these methods272.4. Discussionhave focused on reducing false positive errors due to systematic sequencing errors. Most methods employextensive heuristic filtering to accomplish this task [Koboldt et al., 2012, Saunders et al., 2012, Cibulskiset al., 2013]. Other methods have used more rigorous approaches, training discriminative classifiers, whichcan make use of many more features than the allelic count data presented to the JointSNVMix model [Dinget al., 2012a]. These approaches have demonstrated significant reductions in false positive rates, with littleto no loss in sensitivity. One potential issue for these methods is generalisation error as they are, eitherimplicitly or explicitly, trained to work with specific technologies and potentially cancer sub-types.An interesting developments in the field has been the emergence of methods to leverage multiplysampled tumour data sets [Josephidou et al., 2015]. This is a natural extension to the problem of callingmutations in tumour/normal paired samples. The key benefit of this approach is that it will likely be moresensitive to variants present in rare clones common to multiple samples. Of course, the issue of systematicerrors will still afflict model based approaches similar to JointSNVMix.Perhaps the most important development in the field has been the organisation of competitions to sys-tematically compare somatic variant callers [Boutros et al., 2014]. This provides a set of ground truth datawhich can be used to compare existing and new methods. This was desperately needed, as the performanceassessment provided for different methods was inconsistent, creating difficulty in making informed choicesabout which method(s) to apply.28Chapter 3Inference of clonal population structureusing deep digital sequencing3.1 IntroductionHuman cancer progresses under Darwinian evolution where (epi)genetic variation alters molecular pheno-types in individual cells [Nowell, 1976]. Consequently, tumours at diagnosis often consist of multiple, gen-otypically distinct cell populations (figure 3.1) [Aparicio and Caldas, 2013]. These populations, referred toas clones, are related through a phylogeny and act as substrates for selection in tumour micro-environmentsor with therapeutic intervention [Greaves and Maley, 2012, Aparicio and Caldas, 2013]. The prevalenceof a particular clone measured over time or in anatomic space is a reflection of its growth and proliferativefitness. Thus, ascertaining the dynamic prevalence of clones can identify precise genetic determinants ofphenotypes such as acquisition of metastatic potential or chemotherapeutic resistance.In this chapter, we provide a probabilistic model for analysis of deeply sequenced (coverage ≥ 1000×)CellularPrevalenceMutation00.20.40.60.8a) b)1.0Figure 3.1: Clonal evolution model | (a) A hypothetical phylogenetic tree generated by clonal expansionvia the accumulation of mutations (stars). Unlike traditional phylogenetic trees internal nodes (clones) inthe tree may contribute to the observed data, not just the leaf nodes. (b) Hypothetical observed cellularprevalences for the mutations in tree. Mutations occurring higher up the tree always have a greater cellularprevalence than their descendants (the same statement need not be true about variant allelic prevalencebecause of the effect of genotype). Note that the green, blue and purple mutations occur at the same cellularprevalence because they always co-occur in the clones of the tree.293.1. Introductionmutations to identify and quantify clonal populations in tumours, which extends to modelling mutationsmeasured in multiple samples from the same patient. Our approach uses the measurement of allelic abund-ance to estimate the proportion of tumour cells harbouring a mutation (referred to herein as the cellularprevalence). Due to the cell lysis involved in the preparation of bulk samples for sequencing, we cannotdetermine the complete set of genomic aberrations defining a clonal population. However, assuming clonalpopulations follow a perfect phylogeny (no site can mutate more than once in the tree) we can identify andquantify the prevalence of clonal populations. This assumption implies that clusters of mutations occur-ring at the same point in the clonal phylogeny are present at shared cellular prevalences. Thus, clusters ofmutations can be used as markers of clonal populations.Despite progress in measuring allele prevalence with deep sequencing [Carter et al., 2012, Ding et al.,2012b, Govindan et al., 2012, Nik-Zainal et al., 2012, Shah et al., 2012], statistical approaches to cluster deepdigital sequencing of mutations into biologically relevant groupings remain underdeveloped, with poorlyunderstood analytical assumptions. The allelic abundance of a mutation is a compound measure of severalfactors: the proportion of contaminating normal cells, the proportion of tumour cells harbouring the muta-tion and the number of allelic copies of the mutation in each cell, plus uncharacterized sources of technicalnoise. Consequently, allelic abundance does not straightforwardly relate to cellular prevalence. This is par-ticularly exacerbated when only single sample allelic abundance measurements are taken. Multiple samplemeasurements (taken in time [Shah et al., 2009b, Ding et al., 2012b, Eirew et al., 2014] or space [Gerlingeret al., 2012, Bashashati et al., 2013, de Bruin et al., 2014, Hong et al., 2015]) have two distinct advantages:reduction of noise due to repeated measures and the potential to identify sets of mutations whose cellu-lar prevalences shift together. The latter is a route to precisely identifying clones whose prevalences arechanging under selective pressures.To systematically address these deficiencies, we developed the PyClone model: a novel, hierarchicalBayes statistical model (figure 3.3, figure 3.4). The inputs to the model are a set of deeply sequenced muta-tions from one or more samples derived from the same cancer, and a measure of allele-specific copy numberat each mutation locus in each sample (figure 3.2). The PyClone software outputs posterior densities formodel parameters including the cellular prevalence φ n for each mutation n in the input (figure A.1) and theclustering structure over the mutations (figure A.2 - figure A.7). The posterior densities of these quantit-ies are then post-processed to give interpretable point estimates of the cellular prevalences and mutationalclustering.303.2. Method: PyClone3.2 Method: PyClone3.2.1 Model overviewThe PyClone model framework overcomes the challenges outlined in the introduction through four novelmodelling advances. First, it uses Beta-Binomial emission densities which more effectively models data setswith more variance in allelic abundance measurements relative to a Binomial model. Second, flexible priorsover mutational genotypes are used, reflecting how allelic abundance measurements are deterministicallylinked to zygosity and coincident copy number variation events. Third, Bayesian non-parametric clusteringis used to simultaneously discover groupings of mutations and the number of groups. This obviates fixingthe number of groups a priori, and allows for cellular prevalence estimates to reflect uncertainty in thisparameter. Fourth, multiple samples from the same cancer may be analysed jointly to leverage the scenariowhere clonal populations are shared across samples.3.2.2 Model descriptionPyClone is a software package which provides tools for performing Dirichlet Process (DP) clustering ofmutations. It implements several standard clustering algorithms such as the infinite Binomial mixture model(IBMM) as well as several models which account for the genotype of a mutation. In the following descriptionwe will use PyClone to refer to the collection of genotype aware clustering models.The PyClone model is a hierarchical Bayes statistical model (figure 3.3). Input data consists of alleliccounts from a set of N deeply sequenced mutations for a given sample. Prior information is elicited fromcopy number estimates obtained from either genotyping arrays or whole genome sequencing. For mostavailable tools, these estimates will represent the average copy number of a locus if the copy number of thepopulation is heterogeneous at the locus. An optional estimate of tumour content, derived from computa-tional methods or pathologists estimates, may also be used. The model outputs a posterior density for eachmutation’s cellular prevalence and a matrix containing the probability any two mutations occur in the samecluster. The model assigns two mutations to the same cluster if they occur at the same cellular prevalence inthe sample(s). This is a necessary but not a sufficient condition for mutations to be present in the same clonalpopulation. To obtain a flat clustering of the mutations from the matrix of pairwise probabilities we constructa dendrogram and find the cut point that optimises the MPEAR criterion [Fritsch and Ickstadt, 2009] whichis discussed in section 3.2.11. Figure 3.2 shows a typical experimental workflow used to produce the allelic313.2. Method: PyCloneTumour tissue NormaltissueTumour library NormallibraryTumoursequenceNormalsequence1. Tumour banking/biorepository2. DNA extraction/Library construction3. Whole genome (exome) sequencing4. Sequence analysis5. Deep digital sequencing6. PyClone statistical modelingCopy number analysisSomatic mutation detectionAllelic countsCellular prevalence and mutational clusteringFigure 3.2: Workflow for PyClone analysis | The sample is first assayed using whole genome sequencing(WGS) or exome capture sequencing to identify putative mutations. Copy number information, which isused to inform the PyClone model priors, can be derived from either sequence or array data. Putative muta-tions are subjected to targeted deep sequencing using either custom capture array or targeted PCR amplific-ation. The input for the PyClone model is the allelic abundance measurements for the validated mutationsfrom the targeted deep sequencing experiment and the prior information elicited from the copy number pro-filing. Additionally an estimate of tumour content derived from analysis of the array data, sequencing data,or from pathologist estimates can be supplied.323.2. Method: PyCloneaα, bααHH0φnmbnmdnmtmψnm pinmsas, bsn ∈ {1, . . . , N}m ∈ {1, . . . ,M}Figure 3.3: Probabilistic graphical representation of the PyClone model | The model assumes the observedcount data for the nth mutation is dependent on the cellular prevalence of the mutation as well as the state ofthe normal, reference and variant populations. The cellular prevalence of mutation n across the M samples,φ n, is drawn from a Dirichlet Process (DP) prior to allow mutations to cluster and the number of clustersto be inferred. For brevity we show the multi-sample version of the PyClone model which generalises thesingle sample case (M = 1). We also show the model with either the Binomial or Beta Binomial emissiondensities. For all analyses conducted in this paper we set vague priors of aα = 1,bα = 10−3 for the DPconcentration parameter α and as = 1,bs = 10−4 for the Beta Binomial precision parameter s. The Gammadistributions are parametrised in terms of the shape, a, and rate, b, parameters.333.2. Method: PyCloneVariantPopulationNormal PopulationReference PopulationChromosomeMutationFigure 3.4: The PyClone model population structure assumptions | Simplified structure of a sample submit-ted for sequencing. Here we consider the sample with respect to a single mutation (stars). With respect tothis mutation we can separate the cells in the sample into three populations: the normal population con-sists of all normal cells (circular), the reference population consists of cancer cells (irregular) which do notcontain the mutation and the variant population consists of all cancer cells with the mutation. To simplifythe model we assume all the cells within each population share the same genotype. For example all cells inthe variant population in this case have the genotype AABB i.e. two copies of the reference allele, A, andtwo copies of the variant allele, B. Note that the fraction of cancer cells from the variant population is thecellular prevalence of the mutation which is 610 = 0.6 in this example. Due to the effect of heterogeneityand genotype the expected fraction of reads containing the variant allele (variant allelic frequency) in thisexample would be 6·4·242·2+4·3+6·4 = 0.3.count data and tumour content estimates which are inputs for a PyClone analysis. The same workflow alsoshows how copy number information is generated which is then used to elicit priors for possible mutationalgenotypes which are also required as input for a PyClone analysis. For more details on how copy numberinformation is used to elicit priors see section 3.2.5.The model divides the sample into three sub-populations with respect to mutation n ∈ {1, . . . ,N}: thenormal (non-malignant) population, the reference and the variant cancer cell populations (figure 3.4). Thereference population consists of all cancer cells which are wildtype for the nth mutation. The variant popu-lation consists of all cancer cells with at least one variant allele of the nth mutation. To simplify inferenceof the model parameters we assume that within each sub-population, the mutational genotype at site n isthe same for all cells in that sub-population. But importantly, we allow the mutational genotypes to varyacross populations. We introduce a collection of categorical random variables gnN,gnR,gnV, each taking val-ues in G = {−,A,B,AA,AB,BB,AAA,AAB, . . .}, denoting the genotype of the normal, reference and variantpopulations with respect to mutation n. For example, the genotype AAB refers to the genotype with two343.2. Method: PyClonereference alleles and one variant allele. The symbol − denotes the genotype with no alleles, in other wordsa homozygous deletion of the locus. The vector ψn = (gnN,gnR,gnV) ∈ G 3 represents the state for the nthmutation, while pin is a vector of prior probabilities over all possible states, ψn, of the nth mutation.The fraction of cancer cells (tumour content) is t, with fraction of normal cells 1− t. We fix t prior toinference, assuming estimates from orthogonal assays such as WGS, micro-arrays or histopathology. Wedefine the fraction of cancer cells from the variant population φ n, and correspondingly 1−φ n as the fractionof cancer cells from the reference population. With this formulation the cellular prevalence, the fractionof cancer cells harbouring a mutation, is given by φ n. The cellular prevalence is a fundamental quantityfor examining population dynamics across multiple samples as it is not affected by the tumour content ofa sample. As a result the cellular prevalence allows us to track changes in the population structure acrosssamples (with regards to SNVs) without the confounding effect of contaminating normal cells.For a genotype g ∈ G , c(g) : G 7→N returns the copy number of the genotype, for example c(AAB) = 3.We define b(g) : G 7→N, which returns the number of variant alleles in the genotype, for example b(AAB) =1. If b(g) 6= 0 and b(g) 6= c(g) we assume that the probability of sampling a variant allele from a cellwith genotype g is given by µ(g) = b(g)c(g) . In the case where b(g) = 0 we assume µ(g) = ε , where ε isthe probability of erroneously observing a B allele when the true allele sequenced was A. We make thismodification to allow for the effect of sequencing error. Similarly we define µ(g) = 1−ε when b(g) = c(g).The definition of µ(g) assumes the probability of a sequencing error is independent of the sequenced allele.Because of this assumption we do not account for sequencing errors for other genotypes since these errorsshould cancel on average and the expected fraction of B alleles should stay the same as the error free case.We assume that the sequenced reads are independently sampled from an infinite pool of DNA fragments.Thus the probability of sampling a read covering a given locus from a sub-population is proportional tothe prevalence of the sub-population and the copy number of the locus in cells in from that population.Therefore, the probability of sampling a read containing the variant allele covering a mutation with stateψ = (gN,gR,gV) and cellular prevalence φ is given by:ξ (ψ,φ , t) =(1− t)c(gN)Zµ(gN)+t(1−φ)c(gR)Zµ(gR)+tφc(gV)Zµ(gV)Z = (1− t)c(gN)+ t(1−φ)c(gR)+ tφc(gV)353.2. Method: PyCloneWe let bn denote the number of reads observed with the B allele, with dn total reads covering the locus,where the nth mutation has occurred. It is straightforward to show that bn follows a Binomial distributionwith parameters dn and ξ (ψn,φn, t). This assertion follows from the fact that the sum of n Bernoulli randomvariables with parameter p follows a Binomial distribution with parameters n, p [Ross, 2002].The posterior distribution of the prevalences φ = (φ 1, . . . ,φN) is then given byp(φ |bn,dn,pin, t) ∝ p(φ)N∏n=1p(bn|φ n,dn,pin, t)= p(φ)N∏n=1∑ψn∈G 3p(bn|φ n,dn,ψn, t)p(ψn|pin)= p(φ)N∏n=1∑ψn∈G 3Binomial(bn|dn,ξ (ψn,φ n, t))pinψn .In principle, the sum over ψn ∈ G 3 could be infinite. In practice we cannot enumerate an infinite set ofstates. Thus we must specify a finite set of states which will have non-zero prior probability which in turntruncates the sum.Mutations from the same clonal population should appear at the same cellular prevalence. To account forthis we specify a DP prior with base measure H0 ∼Uniform(0,1) for p(φ) which allows mutations to sharethe same cellular prevalence [Ferguson, 1973]. If we were to directly use a continuous distribution such asUniform(0,1) as a prior for p(φ), the cellular prevalence of all mutations would be different with probabilityone. The DP prior converts this continuous distribution into a discrete distribution with an infinite number ofpoint masses. Since the DP distribution is discrete, it gives a non-zero prior probability to mutations sharingthe same cellular prevalence. Though each mutation samples its own value of φ n from the DP, the fact thatφ n can be identical induces a clustering of the data.3.2.3 Extension to multiple samplesIncreasingly, common experimental designs acquire deep digital sequencing across spatial [Gerlinger et al.,2012, Bashashati et al., 2013] or temporal axes [Eirew et al., 2014], examining shifts in prevalence as amarker of selection. As these measurements are not independent (derivative clones are related phylogen-etically), we assume M samples from the same cancer can share statistical strength to improve clusteringperformance. We substitute the univariate base measure in H0 with a multivariate base measure; for concrete-363.2. Method: PyCloneness we use the uniform distribution over [0,1]M. The Dirichlet process then samples a discrete multivariatemeasure H over the clusters and each data point draws a vector of, φ n = (φ n1 . . .φnM) from this measure. Thelikelihood under this model is given byp(φ |bn,dn,pin, t) ∝ p(φ)M∏m=1N∏n=1∑ψnm∈G 3p(bnm|dnm,φ nm,ψnm, tm)p(ψnm|pinm)For each mutation n we assign different priors, pinm, for each sample, allowing for the genotypes ofthe reference and variant populations to change between samples; for example if the samples came from aregional samples in a tumour mass, primary tumour and distant metastasis, or pre- and post- chemotherapy.We also introduce the vector t = (t1, . . . , tM) which contains the tumour content of each sample. Usingthis approach the clustering of mutations is shared across all samples but the cellular prevalence of eachmutation is still free to vary in the M samples. Thus the final output of the model will be a single posteriorsimilarity matrix for all mutations and N ×M posterior densities (one per mutation per sample) for thecellular prevalences of each mutation.3.2.4 Extension to model overdispersionNext generation sequencing data is often overdispersed [Heinrich et al., 2012]. We implemented a version ofthe PyClone model which replaces the Binomial distribution with a Beta-Binomial distribution, parametrisedin terms of the mean and precision. The density is given byp(b|d,m,s) =(db)B(b+ sm,d−b+ s(1−m))B(sm,s(1−m))where B is the Beta function. We set m = ξ (ψn,φ n, t) and to reduce the number of parameters whichneed to be estimated we share the same value s across all data points, and when applicable all samples.3.2.5 Eliciting mutational genotype priorsThe genotype aware models implemented in the PyClone software package require that we specify a priorpinψ for the state of the sample at the nth mutation. The state is defined by the normal, reference and variant373.2. Method: PyClonegenotypes and is denoted by the state vector ψn = (gnN,gnR,gnV). A number of methods are available toprofile parental (allele specific) and total copy number from high density genotyping arrays [Greenmanet al., 2010, Loo et al., 2010, Yau et al., 2010], or from whole genome sequencing data [Boeva et al., 2012,Ha et al., 2012]. As segmental aneuploidies and loss of heterozygosity are accepted to be an essentialpart of the tumour genome landscape [Bignell et al., 2010, Curtis et al., 2012], it has become routine toassay the genome architecture in conjunction with mutational analysis. To explore the impact different priorassumptions have on performance, we consider a range of strategies for setting the prior probabilities overstates. We denote the total copy number by c, and the copy number of each homologous chromosome byc1,c2. In what follows we assume that correct copy number information is available for c, c1, and c2.We consider five strategies for eliciting prior distributions. For all priors discussed we assume thatgN = AA (in other words we assign prior probability zero to all vectors ψ i with gN 6= AA). We assignuniform probability over the support. In other words, priors only differ in which states are assigned non-zero probability. All states with non-zero probability receive equal weight.• AB prior: We assume that gR = AA and gV = AB. Intuitively this means each mutation is assumed tobe diploid and heterozygous.• BB prior: We assume that gR = AA and gV = BB. Intuitively this means each mutation is assumed tobe diploid and homozygous.• No Zygosity prior (NZ): We assume that gR = AA, c(gV) = c and b(gV) = 1. In other words thegenotype of the variant population has the predicted copy number with exactly one mutant allele.This is similar to the approach used in Nik-Zainal et al. [2012].• Total Copy Number prior (TCN): We assume that c(gV) = c and b(gV) ∈ {1, . . . ,c}. In other wordsthe genotype of the variant population has the predicted copy number and at least one variant allele.We assume, with equal probability, that gR is either AA or the genotype with c(gR) = c and b(gR) = 0.Intuitively this means the genotype of the variant population at the locus has the predicted total copynumber and we consider the possibility that any number of copies (> 0) of the locus contains themutant allele. We consider states where the reference population has the AA genotype or the genotypewith the predicted copy number and all A’s.• Parental Copy Number prior (PCN): We assume that c(gV) = c and b(gV)∈ {1,c1,c2}. In other words383.2. Method: PyClonethe genotype of the variant population has the genotype with the predicted copy number and onevariant allele, or as many variant alleles as one of the parental copy numbers. When b(gV) ∈ {c1,c2}we assume gR = gN, in other words the mutation occurs before copy number events. When b(gV) = 1we assume gR is the genotype with c(gR) = c and b(gR) = 0, in other words the mutation occurs afterthe copy number event. Intuitively this means each mutant locus has the predicted total copy number.We then consider if the mutation occurred before the copy number event, in which case the number ofcopies with the mutant allele should match one of the predicted parental copy numbers. Alternativelyif the mutation occurs after the copy number event we assume only a single copy of the locus containsthe mutant allele. This scheme assumes that a point mutation only occurs once. If more then one copyof the mutant allele is present in the variant population genotype, this occurred because the mutationpreceded any copy number changes and was subsequently amplified.3.2.6 InferenceDue to the presence of the DP prior, computing the exact posterior distribution is not tractable. We use anauxiliary variable sampling method (algorithm 8 from Neal [2000]) to perform Markov Chain Monte Carlo(MCMC) sampling from the posterior distribution. First, the sampler iterates over each mutation choosinga new value of φ n from p(φ). The mutations may either choose a value of φ n used by other mutations,effectively joining a cluster, or choose an unused value of φ n, starting a new cluster. After this step thevalues of each cluster are re-sampled using a Metropolis-Hastings step with the base measure H0 as theproposal distribution. The concentration parameter, α , in the DP is sampled using the method described inWest and Escobar [1993]. This method places a Gamma distribution on α which leads to simple a Gibbs re-sampling step. The Gamma distribution prior is parameterised in terms of the shape a and rate b parameters.The density for this prior is given byp(α|a,b) = baαa−1 exp(−bα)Γ(a)where Γ(x) is the gamma function. The mean and variance of this distribution are given byE(α) =abVar(α) =ab2393.2. Method: PyCloneWe typically use values of a = 1.0 and b = 10−3 so that the variance of the prior distribution on α is106, which is extremely vague.To initialise the sampler we assign all mutations to separate clusters. As a result the computationalcomplexity of the first pass of the sampler is O(N2), where N is the number of mutations. Subsequentiterations have computational complexity O(NK) where K is the number of active clusters.3.2.7 ImplementationThe code implementing all methods plus plotting and clustering is included in the PyClone software pack-age. PyClone is implemented in the Python programming language. All analyses were performed usingPyClone 0.12.4 and PyDP 0.2.1. PyDP is freely available under open source licensing. PyClone is freelyavailable for academic use at https://bitbucket.org/aroth85/pyclone.3.2.8 Alternative methodsTo compare genotype aware clustering to other clustering methods that ignore genotype, we implementedthree standard clustering models in the PyClone software package:• IGMM - Infinite Gaussian mixture model• IBMM - Infinite Binomial mixture model• IBBMM - Infinite Beta Binomial mixture modelWe interpreted the probability of success for the IBMM, and the mean parameter m for the IBBMM as thecellular prevalence of mutations. For the IGMM analysis we first computed the variant allelic frequency foreach mutation and then clustered these values, interpreting the mean of the clusters as the cellular prevalence.All MCMC analysis, clustering and cellular frequency inference was done as described earlier and in section3.2.11. All methods implemented in the PyClone software package, including the IGMM, IBMM andIBBMM use the PyDP package to perform inference. Thus any variation in performance should be due todiffering distributional assumptions, not inference methods or implementation.403.2. Method: PyClone3.2.9 Data setsSynthetic To generate synthetic data for figures 3.5 and 3.6, we sampled from the PyClone model witha Binomial emission letting di ∼ Poisson(10,000), t = 0.75, and using 8 clusters with cellular frequenciesdrawn from a Uniform(0,1) distribution. To assign genotypes to each mutation, we randomly sampled atotal copy number, c ∈ {1, . . . ,5}. We sampled another value c∗ uniformly from the set {0,1, ...,c} and setthe major copy number, c1, to max{c∗,c− c∗} and the minor copy number, c2, to c− c1. We randomlysample gR from the set {gN,g∗} where c(g∗) = c and b(g∗) = 0. If gR = gN then we assumed the mutationoccurred early so that gV had either c1 or c2 B alleles and total copy number c. If gR 6= gN we set gV to thegenotype with one variant allele and total copy number c. For figure 3.5 we generated 100 simulated datasets by sampling 100 mutations for each data set. For figure 3.6 we generated 400 data sets, 100 data setswith 50, 100, 500 and 1,000 mutations.Normal tissue mixtures Each mixture contained DNA in approximate proportions of 0.01, 0.05, 0.20,0.74. Specific single nucleotide variants were amplified using PCR, and then sequenced deeply on theIllumina MiSeq platform. This data set simulates a hierarchically related population with ground truth forquantitative benchmarking (figure 3.7 c). We selected positions with variants present in exactly one case,shared by specific subsets of cases, and shared by all cases.Variants positions in the four cases used were previously identified in Ng et al. [2009]. We compiled alist of positions with variants in: i) exactly one of the four cases used in the mixture; ii) shared by NA18507and NA19240; iii) shared by NA18507, NA19240 and NA12878; iv) shared by all four cases. For SNPspresent in multiple cases we only considered positions that had the same genotype in all the variant cases. Wemanually removed positions which appeared to have variant allelic frequency deviating significantly fromthe expected values. These outliers were either due to incorrect annotation of the genotype in one of the fourcases, or because sequencing appeared to work poorly at the target location. The position coordinates wereconverted from hg18 to hg19 coordinates using the UCSC liftover program.The position selected simulated an idealized cancer consisting of seven clonal cell populations whichevolved according to a bifurcating tree. SNPs in all four cases represent the root of the tree. SNPs presentin 2 or 3 cases represent interior nodes of the tree. SNPs unique to each sample represent leaf nodes in thetree.Because the data set was highly imbalanced for variants with the BB genotype we randomly down413.2. Method: PyClonesampled the BB positions to obtain 10 smaller data sets of 37 mutations with a 50:50 representation ofpositions with AB and BB genotypes for the nodes with SNVs in exactly one of the samples. There werenot enough mutations in the interior nodes to do this, so we used all mutations for these nodes.The predicted genotypes from Ng et al. [2009] were used to determine the homologous copy number forthe PCN prior as follows: for the positions predicted to have the AB genotype in the variant sample we setthe major and minor copy numbers to 1; for positions with the BB genotype in the variant sample we set themajor copy number to 2 and the minor copy number to 0.Benchmarking in this experiment was measured by the ability to correctly group mutations based onthe known, but held-out reference clustering (figure 3.7 c). Reference clustering was defined by the case(s)harbouring the variant genotype. To be explicit, we expected seven clusters to be identified: one for SNPspresent in all cases; one for SNPs present in NA18507, NA19240 and NA12878; one for SNPS presentin NA18507 and NA19240; and four clusters for SNPs unique to each case. Since we knew which SNPswere present in each case we could compute a ground truth clustering based on the above expectation. Thechallenge for the methods we considered was to correctly predict the co-occurrence of SNPs from the sameground truth clusters with different genotypes. We would expect methods which ignore genotype to fail atthis task.We did not attempt to benchmark cellular prevalence estimates for this data set since the cellular preval-ence values were only approximately correct due to experimental variability in the mixing of the tissues.Details of the informatics to generate this data set are presented in section §A.2.High grade serous ovarian cancer Multiple PyClone analyses were performed, one analysis per copynumber prediction method (see section §A.1). We specified the priors for cellularity estimation in PICNICbased on quantitative pathology estimates. Allelic count data was obtained from Bashashati et al. [2013].We obtained count data from additional mutations not validated as somatic from the authors. For each copynumber method we used the homologous copy number information to elicit priors for the PyClone modelusing the PCN strategy, and set the tumour content for the PyClone analysis to the value predicted by thecopy number method. MCMC analysis and post-processing of the trace was done as discussed in 3.2.11.423.2. Method: PyClone3.2.10 Performance metricsTo assess the clustering performance of the methods we computed the V-measure Rosenberg and Hirschberg[2007], calculated using the scikits-learn Python package 0.14.1. V-measure is a measure of cluster-ing accuracy between 0 and 1 where a V-measure score of 1 represents perfect clustering. Cellular preval-ence estimates were evaluated using the mean error over MCMC samples, where a mean error of 0 representsperfect cellular prevalence estimates. Explicitly, for each mutation the mean of the post-burnin trace of thecellular prevalence parameter was used as a point estimate. The absolute difference between this value andthe true value for each mutations in each data set was computed. For each of the data sets the mean valueof the absolute error across mutations was taken and used to generate the boxplots. Statistical analysis wasperformed with the aov and TukeyHSD functions in the R statistical computing package using RStudiov.0.96.331.3.2.11 Running PyClone and MCMC analysisFor the synthetic model comparison and normal mixing experiments all analyses involving the PyClonegenotype aware models, IGMM, IBMM and IBBMM models were run for 10,000 MCMC iterations dis-carding the first 1,000 samples as burnin and no thinning was done. For the synthetic investigation varyingthe number of mutations and HGSOC analyses we used 100,000 iterations of the sampler discarding thefirst 50,000 as burnin. To generate cellular frequency plots we fit Gaussian kernel density estimators to thepost-burnin MCMC trace using the scipy 0.12.0 Python library. To assess convergence for high gradeserous ovarian cancers we ran three MCMC chains from random starting positions. The posterior densitiesof the cellular prevalence estimates were then inspected to ensure they were visually similar (figures A.2 -A.7). For the synthetic data set and normal mixing experiment this was not done due to the large numberof runs. In general we have found that 10,000 - 100,000 iterations is sufficient for convergence in data setswith 100s of mutations.To cluster the data we formed the pairwise posterior similarity matrix (the matrix indicating how fre-quently any two mutations appeared in the same cluster in the post-burnin trace). We then hierarchicallyclustered the data using average linkage, and the resulting dendrogram was used as a guide to find a clus-tering which optimised the MPEAR criterion described in Fritsch and Ickstadt [2009] (the code for doingthis is built into PyClone). We note there are other approaches such as using the sample with the highest433.3. ResultsFigure 3.5: Synthetic data method comparison | (a) Clustering performance and (b) estimated cellular pre-valence accuracy for different methods applied to 100 synthetic data. (a) The accuracy of the inferredclusters is measured using the V-measure metric (y-axis). (b) The accuracy of inferred cellular prevalenceis measured by computing the difference between the mean posterior value inferred from MCMC samplingand true value. Whiskers indicate 1.5 the interquartile range, the notches indicate the median, and boxesrepresent the interquartile range.posterior probability to infer flat clusters using a DP, but Fritsch and Ickstadt [2009] have shown these tendto perform poorly in comparison to the method we use. Because this step requires the formation of theposterior pairwise similarity matrix the computational complexity scales as O(N2).3.3 Results3.3.1 Synthetic dataTo systematically assess the performance of different modelling strategies for mutational clustering and cel-lular prevalence inference, we generated 100 synthetic data sets of 100 mutations with randomly assignedcopy number and mutational genotypes, grouped into eight clusters in each run. We evaluated the perform-ance of the PyClone model on these data using five different strategies for specifying the state priors plustwo standard clustering models (IGMM, IBMM) outlined in section 3.2.8. Benchmarking was based onV-measure and mean error in estimating cellular prevalence (see section 3.2.10). Group distributions overaccuracy were assessed using ANOVA tests with pair-wise TukeyHSD tests for two-group comparisons.Adjusted p values with q< 0.05 was used as a criteria for statistical differences.443.3. ResultsClustering accuracy was highest using the PyClone model with the PCN method with a V-measure of0.78± 0.06, followed by the PyClone model with the TCN method (0.65± 0.07) (figure 3.5 a). The PCNand TCN methods, which account for mutational genotype, significantly outperformed all other methods(figure 3.5 a). Accounting for copy number but ignoring mutational genotype, the NZ method (0.56±0.06)was worse than the PCN and TCN methods, but performed significantly better than the AB and BB methodsthat assume diploid states for the variant population (0.49± 0.05 and 0.52± 0.04, q < 0.05, ANOVA andTukeyHSD). The IBMM (0.51± 0.05) performed similarly to the PyClone model with diploid genotypeprior methods and was significantly better than IGMM (0.20±0.11).Mean error on cellular prevalence estimates was also measured for each method. Similar to V-measurebenchmarks, the PyClone model with the PCN method was the most accurate (mean absolute error= 0.03±0.01), significantly lower than all other methods (figure 3.5 b). TCN (0.07± 0.02) and NZ (0.14± 0.03)were more accurate than the AB and BB methods (0.21± 0.03 and 0.20± 0.04). The PyClone modelusing all genotype prior methods were significantly more accurate than the IBMM and IGMM methods(0.25±0.05 and 0.26±0.05). The likely reason is that the PyClone model accounts for tumour content (setat 0.75 for all simulations). Of the two genotype prior methods which consider mutational genotype, PCNsignificantly outperformed TCN. Though not surprising since the PCN method provides more informativeprior information, this result suggests that given reliable parental copy number information, the PCN strategyfor setting genotype priors would improve inference.Taken together, these results systematically demonstrate the theoretical basis for estimating mutationalgenotype by incorporating copy number and parental allele information. This confers increased accuracy inboth clustering and cellular prevalence estimates. Furthermore these results validate the basis for avoidingthe use of Gaussian distributions when clustering deep digital sequencing data.To assess how performance varies with the number of mutations, we simulated data sets with 50, 100,500 and 1,000 mutations. 100 replicates were generated for each number of mutations using the sameprocedure as above. The data sets were analysed using the PyClone model with Bin-PCN model. We findthat clustering performance deteriorates (figure 3.6 a) as the number of mutations analysed increases. Incontrast the estimated cellular prevalence of the mutations improves as we increase the number of mutationsanalysed (figure 3.6 b). We believe that clustering performance deteriorates with more mutations becausethe problem of clustering larger data sets is intrinsically more difficult. The increasing accuracy of cellularprevalence estimates would suggest that mutations from the same cluster are being placed together. As we453.3. ResultsFigure 3.6: Synthetic performance varying number of mutations | (a) Clustering performance and (b) estim-ated cellular prevalence accuracy for the PyClone BeBin-PCN model as function of the number of mutations.(a) The accuracy of the inferred clusters is measured using the V-measure metric (y-axis). (b) The accuracyof inferred cellular prevalence is measured by computing the difference between the mean posterior valueinferred from MCMC sampling and true value. Whiskers indicate 1.5 the interquartile range, the notchesindicate the median, and boxes represent the interquartile range.increase the number of mutations, the mean number of predicted clusters increases from 8.64±2.02 with 50mutations to 28.85± 13.77 with 1,000 mutations. Taken together this suggest the PyClone model tends toover cluster as the number of mutations increases, but the cellular prevalence of each cluster will be accurate.This differs from the over clustering of genotype naive methods, which results in mutational clusters beingassigned inaccurate cellular prevalence estimates.3.3.2 Normal mixturesWe evaluated our approach on idealized data sets (figure 3.7), produced by physical mixtures of DNA ex-tracted from four 1000 genomes project samples [1000 Genomes Project Consortium, 2010, Harismendyet al., 2011] (section 3.2.9). We compared the PyClone model using combinations of emission densities andgenotype priors to two genotype naive methods: an infinite Binomial mixture model (IBMM) and an infiniteBeta-Binomial mixture model (IBBMM) (section 3.2.8). Empirical comparisons to ground truth showedthat PyClone model with Beta-Binomial emission densities (BeBin-PCN and BeBin-TCN) outperformedall other methods based on clustering accuracy by V-measure [Rosenberg and Hirschberg, 2007] (section3.2.10). Performance gains were consistent when analyzing each data set separately (figure 3.7 a) or all463.3. ResultsFigure 3.7: Comparison of clustering performance for the mixture of normal tissues data set | We comparethe infinite Binomial mixture model (IBMM); infinite Beta-Binomial mixture model (IBBMM); the PyClonemodel using binomial emission densities and total copy number (Bin-TCN) or parental copy number (Bin-PCN) prior; the PyClone model using Beta-Binomial emissions and the parental (BeBin-PCN) or total(BeBin-TCN) copy number prior. (a) Comparison of methods when analysing each mixture experimentseparately and (b) analysing all four mixtures jointly. (c) Expected cellular prevalence of each cluster acrossthe four mixture experiments. (d) Inferred cellular prevalences and clustering using the IBBMM model and(e) PyClone BeBin-PCN model to jointly analyse all four mixtures. Solid lines (d, e) indicate clusters forwhich SNVs are predominatly homozygous (BB) and dashed lines indicate clusters for which SNVs arepredominatly heterozygous (AB), in the event an equal number of both types of SNVs is present the clusteris drawn as a solid line. (f) Variant allelic frequency for mutations assigned to cluster 1 by PyClone BeBin-PCN model. Dashed lines represent heterozygous SNVs and solid lines represent homozygous SNVs. (a,b) Whiskers indicate 1.5 the interquartile range, red bars indicate the median, and boxes represent theinterquartile range. (d, e) Error bars indicate the mean standard deviation of MCMC cellular prevalencesestimates for mutations in a cluster. (d, e, f) The number of mutations n in each cluster is shown in thelegend in parentheses.473.3. Resultsfour samples jointly (figure 3.7 b). Accounting for mutational genotype and joint inference each conferredindependent performance gains with best results obtained when using PyClone BeBin-PCN. To illustrate theeffect of accounting for mutational genotype, we randomly selected one of the ten joint analysis runs con-trasting PyClone BeBin-PCN with IBBMM, a baseline analogous to clustering the raw allelic data withoutconsidering mutational genotype. IBBMM output 12 clusters, consistently assigning heterozygous and ho-mozygous SNPs from each ground truth cluster to separate clusters (figure 3.7 d). By contrast, PyCloneidentifies the 7 correct clusters (figure 3.7 e), placing heterozygous and homozygous SNPs from the sameclusters together (figure 3.7 f).3.3.3 High grade serous ovarian cancerWe compared results from multi-sample BeBin-PCN and IBBMM in the cancer setting, using recently pub-lished mutational profiles of multiple samples from a high grade serous ovarian cancer (HGSOC) [Bashashatiet al., 2013]. Four spatially separated samples were taken from a primary, untreated ovarian tumour (studycase 2). Mutations called in exomes were deeply sequenced (∼ 5000×), yielding 49 validated mutations.Copy number priors were obtained from high-density genotyping arrays (section §A.1).IBBMM inferred 9 clusters, whereas PyClone identified 6 clusters. Clusters 1, 2 and 6 identified byIBBMM show a similar pattern of variation across the four samples. PyClone places these 25 mutationstogether in cluster 1 (figure 3.8 a). We propose these mutations strictly co-occur, with the observed variationin allelic abundance due to differing mutational genotypes. Supporting this hypothesis, 75% (3/4) mutationsplaced in cluster 1 by IBBMM are predicted to be in regions of loss of heterozygosity, whereas 90% (18/20)mutations placed in cluster 2 by IBBMM are in regions predicted to be diploid heterozygous. The singlemutation placed in cluster 6 by IBBMM falls in a region predicted to have major copy 3 and minor copy 1.The sum of variant allelic frequencies for IBBMM clusters 1 and 2 exceed 1.0 which implies thereshould exist cells with mutations from both clusters, with another group of cells containing only mutationsfrom cluster 1. PyClone predicts mutations from IBBMM cluster 1 and cluster 2 strictly co-occur. To testthese competing hypotheses we performed single cell sequencing of tissue from sample B, obtaining reliablemeasurements for 25 nuclei. When interpreting the data we advise the reader that failure to detect a mutationin single cell data can occur even if the mutation is present due to biased PCR amplification of alleles. Formore details on these problems see chapter 4. To clarify the competing hypothesis generated by IBBMMand PyClone, we collapse the raw data to presence or absence of clusters predicted by IBBMM, where a483.3. ResultsFigure 3.8: Joint analysis of multiple samples from high grade serous ovarian cancer (HGSOC) patient 2 |The variant allelic frequency for each mutation color coded by predicted cluster using the (a) IBBMM and(c) PyClone with BeBin-PCN model to jointly analyse the four samples. The inferred cellular prevalencefor each cluster using the (b) IBBMM and (d) BeBin-PCN methods. As in figure 3.7 the cellular prevalenceof the cluster is the mean value of the cellular prevalence of mutations in the cluster. (e) Presence or absenceof variant allele at target loci in single cells from sample B. Loci with less than 40 reads covering themare coloured gray. Predicted clusters for each method are shown on the left, with white cells indicatingnon-somatic control positions. (f) Presence or absence of IBBMM clusters in single cells from sample B.Clusters were deemed present if any mutation in the cluster was present. (b, d) Error bars indicate themean standard deviation of MCMC cellular prevalences estimates for mutations in a cluster. The number ofmutations n in each cluster is shown in the legend in parentheses.493.3. ResultsFigure 3.9: Joint analysis of multiple samples from high grade serous ovarian cancer (HGSOC) patient 1| Joint analysis of multiple samples from high grade serous ovarian cancer (HGSOC) case 1. The variantallelic frequency for each mutation color coded by predicted cluster using the (a) IBBMM and (c) thePyClone model with BeBin-PCN model to jointly analyse the four samples. The inferred cellular prevalencefor each cluster using the (b) IBBMM and (d) BeBin-PCN methods. As in figure 3.7 the cellular prevalenceof the cluster is the mean value of the cellular prevalence of mutations in the cluster. Error bars indicate themean standard deviation of MCMC cellular prevalences estimates for mutations in a cluster. The number ofmutations n in each cluster is shown in the legend in parentheses.cluster is determined to be present in a cell if any mutation from the cluster is present (figure 3.8 f). We findno cells in which mutations from IBBMM cluster 1 are detected without mutations from IBBMM cluster 2(figure 3.8 c). This is parsimonious with IBBMM clusters 1 and 2 comprising a single cluster, as predictedby PyClone.Case 1 mutations clustered by IBBMM (figure 3.9 a, b) resulted in mutations with the highest allelicabundance (Cluster 4, n = 9 mutations) grouped together. These mutations showed a similar allelic abund-ance pattern to mutations in Cluster 5 (n= 31 mutations) across samples. We suggest these mutations belongto the same clone, with Cluster 4 mutations predominantly homozygous and Cluster 5 mutations predomin-antly heterozygous. Eight of nine mutations in IBBMM Cluster 4 fall into regions of heterozygous deletion503.4. Discussionin all four samples. By contrast, 28 of 31 mutations in Cluster 5 are in diploid heterozygous regions in allfour samples. Therefore, the difference in cellular prevalence estimates in IBBMM between Cluster 4 and 5can be explained by the copy number of the loci spanning the mutations impacting the mutational genotype.The PyClone model instead groups the mutations corresponding to IBBMM Cluster 4 and Cluster 5 intoone group of n= 44 mutations (figure 3.9 c, d - Cluster 1), with representation of both the heterozygous andhomozygous loci at cellular prevalences of near 1.0.PyClone cluster 1 in both case 1 (n = 44 mutations) and case 2 (n = 25 mutations) (figure 3.9 d, fig-ure 3.8 d) likely represent mutations comprising the ancestral clone in these tumours’ aetiology. In contrast,clusters with cellular prevalences lower than 1.0 indicate putative descendant clones. We emphasize thatIBBMM splits the ancestral clone into at least two clusters in both case 1 and case 2, with evidence from thecopy number analysis and attributing the split based on heterozygous or homozygous mutational genotype.PyClone modelling of mutational genotypes coupled with simultaneous inference across multiple samplestherefore provides a more robust approach to ascertaining clonal populations with dramatic implications forhow cellular prevalence estimates of mutations are interpreted in reconstruction of evolutionary histories.3.4 Discussion3.4.1 Incorporating external sources of copy number informationThe accuracy of genotype prior information will greatly influence the results of the PyClone analysis. Vagueprior information will limit the ability to accurately cluster mutations and infer cellular frequency since alarge space of equally likely explanations for the observed data must be considered. This problem is notspecific to our method, and represents a major challenge of the problem in general.A principled approach to limit the search space is to restrict the variant population to have genotypescompatible with copy number information predicted on an orthogonal platform. This approach still leavesopen the problems of determining which genotypes compatible with the predicted copy number are mostlikely, and what the copy number of the reference population is. In some cases, biological informationabout the mutation may help solve the first problem. For example, EZH2 mutations in lymphoid cancers areknown to be functional only if a wildtype mutation remains [Yap et al., 2011], thus we would weight theheterozygous mutations more heavily in this case. By contrast, TP53 mutations in high grade serous ovariancancers are nearly always homozygous [Ahmed et al., 2010]. As for the copy number of the reference513.4. Discussionpopulation, it is ultimately determined by whether the mutation event precedes or follows the copy numberevent in the region. In some cancers, large scale alterations of the copy number architecture are believed tobe early events with subsequent evolution occurring via the accumulation of point mutations [Carter et al.,2012]. In this case it would be more likely that the reference population has the same total copy numberas the variant population. As we show, homologous copy number information can also be of help if we arewilling to make some assumptions about the mutational process.One area of future research is how the predictions from multiple sources could be used to inform thePyClone model. We saw that different software tools will produce different copy number and tumour contentpredictions, which will influence the PyClone results. Given that the PyClone model has a great deal offlexibility in specifying priors, it should be possible to use multiple tools to inform the priors. A principledmethod to weigh the various predictions is required for this approach, ideally one that accounts for thestrengths and weaknesses of different prediction tools. A closely related issue is how we can use sub-clonalcopy number predictions to inform the PyClone model priors.3.4.2 LimitationsThe PyClone model does not cluster cells by mutational composition, which is the traditional view of clonalheterogeneity and tumour phylogenies. Instead it clusters mutations which appear at similar cellular fre-quencies. In simple cases the clustering derived may be correct, however if multiple sub-clones exist atsimilar cellular frequencies the model will falsely cluster the associated mutations together. This is onereason we have focused on targeted deep sequencing when applying PyClone, as the chance of making thiserror will decrease with higher sequencing coverage. Joint analysis of multiple samples can also help ad-dress this issue as clones appearing at similar cellular prevalences in one sample may appear at differentprevalences in another. Ultimately this will be resolved with the maturation of single-cell sequencing tech-niques that scale to allow for the reconstruction of individual cell genotypes. We address this problem inchapter 4.A key assumption of our model is that all cells within each populations have the same genotype. Thisassumption is likely false in some cases as cancer cells can undergo copy number alterations and LOH eventsbefore and after the acquisition of mutations [Navin et al., 2011]. The error induced by this assumption willdepend heavily on how variable the genotype of cells are at the locus of interest. In solid tumours the tissuesamples prepared for deep sequencing experiments are taken from a relatively small spatial area. In this case523.4. Discussionthe error from assuming the same genotype within populations may be relatively small. For liquid tumoursthis assumption might be much worse as cells are highly mobile. It is possible to relax this assumption butthe resulting model would lead to an intractable inference problem.We note that our assumptions are predicated on the notion of a perfect phylogeny whereby mutationsoccur exactly once. We recognize that well described phenomena such as recurrent mutations (occurringindependently along different branches of the tree) would violate this assumption. Accounting for suchpossibilities would ultimately lead to an intractable inference problem owing to unidentifiable explanationsfor the observed data.The posterior densities for the cellular frequencies can often exhibit a degree of uncertainty making in-terpretation difficult. Uncertainty can arise due to imprecise prior information on genotype of the mutationsand to the depth of sequencing. As prior information about genotype improves and depth of sequencingincreases we would expect multi-modality to become less prominent. Though uncertainty makes interpret-ation difficult, it is a realistic representation of the confounding factors in this problem. One of the majorcontributions of this work is highlighting that such uncertainty exists unless strong assumptions about thegenotype of the mutations are made.Another approach to reducing the uncertainty is to use multiple samples from patients separated in time(primary vs. relapse) or space through regional or anatomic sampling. We have shown that our model canaccommodate multi-sample data in a principled way by using hierarchical Bayesian modelling where stat-istical strength is borrowed across data sets, dramatically decreasing uncertainty while increasing accuracy.3.4.3 Alternative applicationsThe PyClone model is specifically designed for the problem of inferring the cellular prevalence of singlenucleotide mutations in deeply sequenced tumour samples. However, the model is quite generic in the sensethat it only assumes the sequenced sample is a heterogeneous mixture of cells which fall into three distinctsub-populations. Provided that data which accurately reflects the abundance of an alteration can be gen-erated, we could likely apply the PyClone model to infer the prevalence of indels, genomic rearrangementbreakpoints or methylation marks in malignant tissues. By varying the model parameters and genotype pri-ors it would also be relatively straightforward to apply the PyClone model to infer the prevalence of somaticalterations in non-malignant tissue.533.4. Discussion3.4.4 ConclusionsIn summary, we have introduced a probabilistic model for inference of clonal population structures in humancancers from deep digital sequencing of mutations, with supporting validation from single-cell sequencingexperiments. The advances we present have practical implications for inference of clonal populations andshow measurable reductions in spurious inference relative to current approaches. As the practice of meas-uring allelic abundance during the treatment cycle [Forshew et al., 2012, Dawson et al., 2013] or throughretrospective analysis of multiple samples increases [Gerlinger et al., 2012, Bashashati et al., 2013, Sottorivaet al., 2013, de Bruin et al., 2014, Hong et al., 2015], we suggest the PyClone model will contribute a robuststatistical inference approach for studying selection patterns underpinning disease progression in cancer.3.4.5 PostscriptSince the publication of the work in this chapter [Shah et al., 2012, Roth et al., 2014] there have been anumber of methods developed to infer clonal population structure from HTS data. Some of these methodsattempt to solve the exact problem described in this chapter [Miller et al., 2014]. Other methods havefocused on inferring the underlying clonal phylogeny, either using predicted cellular prevalences from toolssuch as PyClone [Malikic et al., 2015], or jointly performing clustering and phylogeny inference [Jiao et al.,2014, Deshwar et al., 2015]. In a similar vein, there is growing body of literature that proposes the use offeature allocation models in place of clustering models [Zare et al., 2014, Ji et al., 2015]. In these models,features represent clones, and the associated prevalence parameters are clonal prevalences.Inferring copy number variation from HTS data and using this information to infer clonal populationstructures is closely related to the problem in this chapter [Oesper et al., 2013, Ha et al., 2014]. One of themost interesting developments in the field have been methods to jointly infer clonal copy number profilesand SNV assignments to clones [Fischer et al., 2014]. This removes several of the strong assumptions usedin the PyClone model. In particular, the model no longer requires a priori copy number information whichis assumed accurate. More importantly, this relaxes the model assumption that the reference and varianttumour populations are homogeneous. The penalty for this is that model selection becomes a formidableproblem.There is currently a great deal of interest in inferring clonal population structures using single sample,low depth WGS data sets. This in largely motivated by the availability of a 100s of such samples from543.4. Discussionthe ICGC [Hudson et al., 2010] and TCGA [Weinstein et al., 2013] consortia and the desire to performpan-cancer clonal analysis. The PyClone model was primarily developed with targeted deep sequencing inmind, and as shown in this chapter, benefits greatly from multiple sample analysis. While there is nothing inprinciple that would prevent PyClone from working on such low resolution data sets, performance expect-ations should be suitably tapered. This is generally true for all the methods mentioned, as low depth singlesample clonal analysis is a difficult and largely unidentifiable problem.55Chapter 4Identification of clonal genotypes fromsingle cell digital sequencing4.1 IntroductionIn chapter 3 we presented a method for the inference of clonal population structure from deep sequence data.The input data for this model was assumed to be generated by simultaneously sequencing a large numberof lysed cells from a genomically heterogeneous population. This creates a difficult deconvolution problemthat must be solved to infer clonal prevalence and genotypes. Single cell sequencing, that is isolation ofindividual cells and sequencing of their genomes, provides a more direct assay to measure clonal genotypesand infer population structure. Early single cell cancer studies employed slow and labour intensive manualisolation of cells by mouth pipette coupled with multiple displacement amplification (MDA) to generate asufficient quantity DNA for whole exome sequencing [Hou et al., 2012, Xu et al., 2012]. While these studiesgenerated a great deal of scientific interest, the amount of labour involved and difficulties with MDA limitedthe utility of this study design. To alleviate the bottleneck in single cell isolation, researchers have usedflow-cytometry [Navin et al., 2011] or micro-fluidics [Gawad et al., 2014] to rapidly isolate 100s−1000s ofcells with little manual intervention. Protocols for efficient amplification of DNA from single cells have alsobeen refined [Zong et al., 2012a, Gawad et al., 2014]. These advances have made high throughput profilingof single cell genomes a feasible assay for studying cancer heterogeneity and evolution [Navin et al., 2011,Eirew et al., 2014, Gawad et al., 2014, Wang et al., 2014].Current protocols for single cell sequencing generate imperfect data with missing values, highly biasedallelic counts at heterozygous loci, and false measurements of genotypes due to sequencing multiple cells.Because of these errors, measurements from a single cell do not accurately represent a clonal genotype. Wesuggest that aggregating measurements from multiple cells from the same clonal population would improve564.1. Introductioninference of clonal genotypes. To do this we need to know the clonal population of origin for each cell,which is of course not possible without knowing the genotype composition of the population. Thus we havethe problem of jointly inferring clonal genotypes and assigning cells to clonal populations.We have developed a novel statistical model and a mean field variational inference method to addressthis problem. Using allelic measurements from a fixed set of loci across a population of single cells, themodel solves this joint inference problem, while also estimating the unknown number of clones. Our modelis able to handle missing data and allelic dropout in a principled way. In addition, we are able to identifymeasurement which are the result of sequencing multiple cells. Furthermore, the model is more flexiblethan previous approaches, allowing input data from multiple discrete data-types with an arbitrary number ofstates.4.1.1 Sources of error in single cell dataThough single cell sequencing offers a clear advantage over bulk sequencing for studying cancer heterogen-eity, it is not currently able to replace bulk sequencing due relatively high rates of error. There are severalsources of sequencing error in single cell experiments, but most stem from the fact a single cell has only asingle copy of the genome. This means that the genome (or subset of interest) in each cell must be amplifiedin some way before sequencing using the current generation of devices [Shapiro et al., 2013]. Amplific-ation of DNA for sequencing is a stochastic process, with biases associated with GC content and genomeaccessibility [Benjamini and Speed, 2012].There are two classes of genomic events that have been profiled by single cell sequencing to date, largescale copy number changes [Navin et al., 2011] and point-like mutations [Eirew et al., 2014, Gawad et al.,2014]. The latter class includes SNVs, small indels and rearrangement breakpoints. The type of errorscaused by DNA amplification for each class of event are quite different and require different methods ofcorrection. In this work we focus on the analysis of point-like events but briefly review the challenges forcopy number profiling to clarify the need for multiple strategies.For large scale copy number changes, whole genome sequencing of single cells can be performed [Navinet al., 2011, Wang et al., 2014, Baslan et al., 2015]. Due to cost and technical reasons, the depth of coveragetends to be relatively low compared to bulk whole genome sequencing. However, statistical strength canbe gained by pooling sequenced reads across large bins. In principle the number of reads in each bin aredirectly proportional to the number of copies of the DNA fragment associated with the bin. Corrections for574.1. Introductionbiases in amplification are required as more reads will be generated from regions which are preferentiallyamplified. Binning has successfully been employed to predict total copy number of single cells [Navin et al.,2011, Wang et al., 2014], though prediction of allele specific copy number remains a formidable challengedue to the low coverage and biased allelic amplification. This means that accurate measurement of variantallele frequency at heterozygous sites, which is required for allele specific copy number prediction, is notpossible.Targeted sequencing can be used to profile point-like events. This significantly reduces the amountof the genome that needs to be sequenced, allowing for a high depth of coverage to be achieved with areasonable cost. Because amplification of DNA is a stochastic process, even amplification of both allelesat heterozygous alleles is not guaranteed. For bulk sequencing experiments this is not typically a problemsince there is an excess of both alleles when amplification begins. For single cell sequencing the problem isfundamentally different, since we only have one copy of each DNA fragment in a cell [Shapiro et al., 2013].Bias in amplification towards one allele in the DNA amplification steps can result in a representation ofalleles so skewed that we fail to measure the other allele. This effect, known as allelic dropout, is currentlya major impediment to determining mutational genotypes at heterozygous loci in single cell experiments[Shapiro et al., 2013, Ning et al., 2014]. Even when both alleles are detected, it is rarely the case that theproportion of each allele is reflective of the expected value given the mutational genotype. We provide aquantitative analysis of this in section 4.3.1.Missing data is also a major issue for all current single cell methods. For targeted sequencing exper-iments this can occur because the PCR primers for a target fail to work. The reasons for this failure varybased on the exact protocol but can be related to issues such as multi-plexed PCR primers interacting, adhe-sion of DNA to the surface of micro-fluidic devices and limited accessibility of DNA. In the case of wholegenome sequencing assays, a whole genome amplification step is required. This step can fail to amplifyportions of the genome, leading to failure measure loci within these regions. Even when the PCR primersor whole genome amplification work relatively well, some cells still fail to produce data at all loci, possiblydue to degradation of DNA.An implicit assumption of single cell sequencing is that each measurement is made on a single cell.However, most high throughput methods for single cell isolation will make errors and isolate multiple cells.Cases where two cells are isolated and sequenced together have been observed to occur frequently in anumber of single cell experimental protocols [Eirew et al., 2014, Gawad et al., 2014]. These so called584.1. Introductiondoublet measurements can be problematic for downstream analysis, as they lead to the inference of clonalgenotypes which do not actually exists. This has implications for phylogenetic inference as the resultinggenotype will be a mixture of existing genotypes and may not be compatible with any perfect phylogeny.Counter intuitively this problem will get worse as more cells are sequenced and the number of doubletsincreases, creating a more robust signal for these rare pseudo clonal populations.The problem of inferring clonal genotypes from point-like measurements remains understudied, withno comprehensive method available to handle all the challenges outlined above. In this contribution wesystematically develop a probabilistic model to address this problem. Our model is applicable to inferringgenotypes defined by point-like events with a discrete number of observable states.594.2. Methods: Single cell genotyperIndex Range Descriptioni {1, . . . , I} Index of tissue sample.n {1, . . . ,Ni} Data point index. Depends on tissue sample.d {1, . . . ,D} Data type index.m {1, . . . ,Md} Event index. Depends on data type.k {1, . . . ,K} Clone (cluster) index.s Sd Genotype state index. Depends on data type.t Td Observed state index. Depends on data type.Table 4.1: Indices used in equations.Variable Range Description CommentsGkdm {1, . . . , |Sd |} State of clone k at event m for data type d.Xidnm {1, . . . , |Td |} Observed state of data point n from samplei at event m for data type d.εdst [0,1] Probability of observing state t given thatthe hidden state is s for data type d.∑Tdt=1 εdst = 1δ [0,1] Probability a data point is a doublet.piik [0,1] Proportion of cells from clone k in samplei.∑Kk=1piik = 1Yin {0,1} Variable indicating if data point n insample i is a doublet.Z1in {1, . . . ,K} Variable indicating which data point nfrom sample i belongs to if not a doublet.Z2in {(k1,k2)|k1,k2 ∈ {1, . . . ,K}} Variable indicating which clonescontribute cells to observation i in samplen if it is a doublet.γdst (0,∞) Dirichlet prior for εdst .κk (0,∞) Dirichlet prior piik.α,β (0,∞) Beta prior for δ .Table 4.2: Model variables.4.2 Methods: Single cell genotyper4.2.1 Model overviewWe derive the single cell genotyper (SCG) model in the following section. To ease the exposition we derivethe model in steps, starting with the simplest model and progressively extending the model. All of theextensions discussed can be combined into a single full featured model presented in figure 4.1. A full list ofmodel variables is provided in table 4.2 with a description of the associated indices in table 4.1.604.2. Methods: Single cell genotyperXdinmZ1inZ2inYinδα, βpiiκεdms γdsGdkmn ∈ {1, . . . , Ni}i ∈ {1, . . . , I}d ∈ {1, . . . , D}k ∈ {1, . . . ,K}m ∈ {1, . . . ,Md}δ |α,β ∼ Beta(δ |α,β )Yin|δ ∼ Bernoulli(Yin|δ )pi i|κ ∼ Dirichlet(pi i|κ)Z1in|pi i ∼ Categorical(Z1n |pi i)Z2in|pi i ∼ Categorical(Z2in|pi i⊗pi i)Gdkm ∼ Uniform(Gdkm|Sd)εdms|γds ∼ Dirichlet(εdms|γds)gdnm|G,Yin,Z1in = k,Z2in = (k1,k2) ={Gdkm Yin = 0Gdk1m⊕Gdk2m Yin = 1Xidnm|ε,gdnm ∼ Categorical(Xdinm|εdgdnm)Figure 4.1: Probabilistic graphical model representing the SCG model. Shaded nodes represent observedvalues or fixed values, while the values of un-shaded nodes are learned.614.2. Methods: Single cell genotyper4.2.2 Model descriptionAssume that we have M events measured for N data points. In general the N data point should correspond tosingle cells but we will use the term data point to allow for the cases were multiple cells may be measuredby accident. For now we will assume all events are of the same data type. The data can be represented bya matrix X with dimensions N rows and M columns. Let S be the set of possible genotype values for theevents, andT be the set of observable values for the events. For example, for SNVs we takeS = {A,AB,B}to indicate the presence of one or both alleles. We could take T = {?,A,AB,B} where the state ? indicatesa missing value. This would allow missing values to be encoded in an informative way. In practice we willtake S = T and treat missing values as missing completely at random [Ghahramani and Jordan, 1995].However, we will develop the more general formulation were S and T may differ for future flexibility.Given that the true genotype state of an event m is s∈S , the probability of observing a value t ∈T is givenby εst ∈ [0,1]. Intuitively, εss represents the probability we make a correct measurement of the genotype and∑t 6=s εst represents the probability of making an error in measurement.To share statistical strength among data points we assume that there is set of K N clonal populationswhich cells are sampled from. The definition of a clonal population is a group of cells which have identicalgenotypes. In practice we will only be able to distinguish clonal populations with genotypes that differ for atleast one of the M measured events. The assumption of clonal population structure naturally leads us to usea mixture model to describe the data [McLachlan and Peel, 2004]. Let Zn ∈ {1, . . . ,K} indicate the clonalpopulation which the cell measured by data point n belongs to. For now we will assume that each data pointis associated with a single cell, so that this is well defined. We define the genotype of clone k at event m tobe Gkm, where we assume a priori any possible value fromSd is equally likely. We assume that K is knownso that we have a finite mixture model. We will discuss how to infer K in section 4.2.7.The basic model can be formulated as followspi|κ ∼ Dirichlet(pi|κ)Zn|pi ∼ Categorical(Zn|pi)Gkm ∼ Uniform(Gkm|S )εs|γs ∼ Dirichlet(εs|γs)Xnm|ε,G,Zn = k ∼ Categorical(Xnm|εGkm)624.2. Methods: Single cell genotypers\t A AB BA 9 0.5 0.5AB 1 1 1B 0.5 0.5 9Table 4.3: Example parameter settings for γ parameter when using SNV data. The rows correspond tohidden (genotype) states and the columns correspond to observed states. Each row thus represents a settingof γs for s ∈ {A,AB,B}. Values are pseudo-counts in the Dirichlet distribution for εs.There are several model hyper-parameters that need to be set. The parameter κ controls the distribution overclone prevalences, or in mixture model terminology the component mix-weights. We take all componentsof the vector κ to be the same so that we have a symmetric Dirichlet distribution. We further set this valueto 1 for all analyses performed, but leave this as a configurable parameter in the software. The parameterγs specifies prior belief about the distribution of observed states given that the genotype states is s. Thisparameter ensures the components of the emission distribution are identifiable if set in an informative way.For example, we use the values in table 4.3 values for γ when applying the SCG model to analyse thesynthetic benchmark data .These values place a high prior probability that the observed state matches the true genotype state whenonly one allele is present in the genotype. When the genotype has both alleles, we assume that the observedstate is uniformly drawn from the observable states. This corresponds to prior knowledge that allelic dropoutonly affects heterozygous loci.4.2.3 Extension for position specific error ratesAs discussed in the previous section, εst represents the probability of observing state t given the genotypehas state s. In the simple model this parameter is shared across all event. We can relax this and allow forevent specific error rates, εmst , where we now have an additional index for events. For SNVs this maybebeneficial if the mutational genotypes of the events are variable. For example, the probability of allelicdropout for the mutational genotype AB is expected to be higher than for AABB since the latter genotype hastwice as many copies of each allele. With more copies of each allele the chance of failing to amplify one willdecrease. This has been noted in previous studies, which have selected cells in the G2/M cell cycle stage(where cells have duplicated the genome before division) to reduce dropout [Wang et al., 2014]. Similarly,634.2. Methods: Single cell genotyperthe chance of dropping an allele may increase if the allele relatively less common than the other allele inthe mutational genotype. We may expect a larger proportion of measurements of the B observed state forthe genotype ABBB than the A observed state. Though we do not pursue it in this work, we could also makethe prior on εmst position dependent, perhaps using information about coincident copy number variation toinform the prior.4.2.4 Extension to multiple data typesWe can extend the basic model to handle multiple data types measured concurrently on the same data points.In section 4.3.7 we show a data set with both SNV and rearrangement breakpoint data profiled concurrentlyin cells. Assume that we measure D separate data types, with Md events for data type d. Further define Sdand Td to be the genotype and observable states for data type d. We define the probability of observing statet ∈ Td given genotype state s ∈Sd as εdst with corresponding prior parameter γdst . Let the genotype of thekth clone at event m for data type d be Gdkm. The data is now represented by a three dimensional raggedarray, X , with dimensions (D,N,Md).With this notation the model can be extended as followspi|κ ∼ Dirichlet(pi|κ)Zn|pi ∼ Categorical(Zn|pi)Gdkm ∼ Uniform(Gdkm|Sd)εds|γds ∼ Dirichlet(εds|γds)Xdnm|ε,G,Zn = k ∼ Categorical(Xdnm|εdGdkm)Though the notation is cumbersome, the model is virtually identical to the basic model.4.2.5 Extension to multiple samplesIf data points are measured from multiple tissue samples it maybe beneficial to include this information inthe model [Gerlinger et al., 2012, Bashashati et al., 2013, Eirew et al., 2014, de Bruin et al., 2014, Hong et al.,2015]. Assume the input data comes from I related tissue samples and we sequence Ni cells from samplei. By related tissue samples, we mean samples that will contain clones from the same phylogeny. Some644.2. Methods: Single cell genotyper⊕ A AB BA A AB ABAB AB AB ABB AB AB BTable 4.4: Definition of the binary operator ⊕ used to combine SNV genotypes of two cells.examples would include: spatially separated tumour masses from a patient; temporally acquired biopsiesfrom a patient; cell line or xenograft samples from multiple passages. Let the proportion of cells from clonalpopulation k in sample i be given by piik. The modification to the basic model is then given bypi i|κ ∼ Dirichlet(pi i|κ)Zin|pi ∼ Categorical(Zin|pi)Gkm ∼ Uniform(Gkm|S )εs|γs ∼ Dirichlet(εs|γs)Xinm|ε,G,Zin = k ∼ Categorical(Xinm|εGkm)The posterior distribution for pi i gives a measure of the prevalence of the clones in each sample. Thisposterior will be corrected for doublets and uncertainty in the assignment of data points to clusters. Inaddition we can use the posterior distribution of pi i to quantify uncertainty in the prevalence estimates.4.2.6 Extension to model doubletsWe can extend the model to handle the case where some data points result from measuring two cells. Weassume that measuring two cells is a rare event, and measuring more than two cells is extremely rare. Thuswe will only focus on the extension to two cells, or doublets, with higher numbers of cells assumed to bemeasured sufficiently infrequently to be of negligible impact. The basic logic developed to handle two cellscould be extended to handle more cells, but the number of combinations of clusters grows exponentiallywith the maximum number of cells that could be measured by a data point.To model multiple cell measurements, we need to define the expected genotype state when two cells aremeasured together. To that end we need to define a binary operation, ⊕, that describes how we combinegenotype states. In table 4.4 we show how to defined⊕ for SNV data. With this definition, when we combinetwo cells with same genotype which is homozygous, the combined state is homozygous for that allele. All654.2. Methods: Single cell genotyperother combinations yield a heterozygous state. For presence/absence data such as a binary representation ofSNVs or rearrangement breakpoints, we can use a logical or to define ⊕.We introduce the variable Yn which is 0 when data point n is a single cell measurement and 1 whenit is a doublet measurement. We let the probability of sampling a doublet be given by δ . We redefinethe variable Zn ∈ {1, . . . ,K} from the previous sections as Z1n to indicate the clone of origin if data pointn is a single cell measurement. We introduce a new variable Z2n ∈ {(k1,k2)|k1,k2 ∈ {1, . . . ,K}} to indicatethe clone of origin for each of the two cells if data point n is a doublet measurement. Define pi ⊗ pi =(pi1 ·pi1,pi1 ·pi2, . . . ,piK ·piK)∈RK2 to be the vector of probabilities of all doublet combinations. The extendedmodel is then defined asδ |α,β ∼ Beta(δ |α,β )Yn|δ ∼ Bernoulli(Yn|δ )pi|κ ∼ Dirichlet(pi|κ)Z1n |pi ∼ Categorical(Z1n |pi)Z2n |pi ∼ Categorical(Z2n |pi⊗pi)Gkm ∼ Uniform(Gkm|S )εs|γs ∼ Dirichlet(εs|γs)gnm|G,Yn,Z1n = k,Z2n = (k1,k2) =Gkm Yn = 0Gk1m⊕Gk2m Yn = 1Xnm|ε,gnm ∼ Categorical(Xnm|εdgnm)In general we expect doublets to be rare, and we use the prior on δ to enforce this constraint. For allruns reported in this work we set α = 1 and β = 99.In general, modelling doublets should not effect performance if no doublets are present. The value ofδ inferred should be extremely close to 0, forcing all data points to be assigned a value of 0 for Yn, thusselecting the single cell explanation of the data. However, convergence issues during inference may causeworse performance. In addition, considerable computational complexity is added by considering doublets.The memory required to store cluster posteriors for the doublet naive model is O(NK), whereas the doubletmodel requires O(NK2). One theoretical case were modelling doublets maybe mis-leading is if mutations664.2. Methods: Single cell genotyperare lost in descendant clones. In this case the problem may become unidentifiable, and it is possible togenerate valid genotypes which may appear to be doublet measurements. This can be mitigated by includingadditional events which are gained in the clone for which the mutation loss occurs. We leave the option touse the doublet naive model as an option in the software. The user should select this model if they feel nodoublets are likely in their data or they have the identifiability issue described.4.2.7 InferenceWe use a mean field variational method to infer the model parameters [Bishop et al., 2006]. Variationalinference seeks to approximate the posterior distribution p(θ |X) by a simpler distribution q(θ). Specificallywe seek to minimise the Kullback–Leibler divergence between q(θ) and p(θ |X) given byDKL(q(θ)|p(θ |X)) =ˆq(θ) log(q(θ)p(θ |X))dθ=ˆq(θ) log(q(θ)p(X ,θ))dθ + log p(X)= DKL(q(θ)|p(X ,θ))+ log p(X)Since log p(X) does not depend on q, we can minimise DKL(q(θ)|p(θ |X)) with respect to q by minimisingDKL(q(θ)|p(X ,θ)). The latter term is simpler to work with as it does not depend on the normalisationconstant of the posterior, which is assumed intractable to compute. The standard mean field approximationassumes that q maybe written as a product of distributions which depend on only a single model variable.q(θ) = ∏iqi(θi)The assumption that each qi depends on a a single model variable can be relaxed so that each qi depends ona subset of variables, provided the subsets are disjoint.It is a standard result [Bishop et al., 2006] that the optimal value of qi(θi) can be obtained by computingqi(θi) ∝ exp(E∏ j 6=i q j(θ j) [log p(X ,θ)])To optimise the full distribution, q, we iteratively update each qi conditioned on the previous value of674.2. Methods: Single cell genotyper{q j} j 6=i. This procedure is guaranteed to decrease DKL(q(θ)|p(X ,θ)) which can be monitored to assessconvergence.For the SCG model we assume the following factorisation.q(G,Y ,Z1,Z2,pi,ε) = q(G)q(Y ,Z1,Z2)q(pi)q(ε)=[K∏k=1D∏d=1M∏m=1q(Gkdm)][I∏i=1Ni∏n=1q(Yin,Z1in,Z2in)]×[D∏d=1Sd∏s=1q(εds)][I∏i=1q(pii)]The first line follows from the mean field approximation, while the second line follows from conditionalindependence structure of the model. It should be noted that we do not break the dependencies betweenYin,Z1in,Z2in since the appear together in q(Yin,Z1in,Z2in).To express the updates for the optimisation procedure more concisely we introduce the "following vari-ational parametersξink = E[I(Yin = 0,Z1in = k)]ζink` = E[I(Yin = 1,Z2in = (k, `))]ηdkms = E[I(Gdkm = s)]xdinmt = I(Xdinm = t)µdmst = E[logεdmst ]νik = E[logpiik]The expectations are taken with respect to the variational distribution q, excluding contributions from termsinside the expectation.Conveniently qi is either a Dirichlet or Categorical distribution for all parameters. Thus the expectationsdefined in the previous equations can be computed analytically. Below we give the updated parametervalues for q when q is a Dirichlet distribution. When q is a Categorical distribution, we give the formulas to684.2. Methods: Single cell genotypercompute all values in the range of the associated random variable up to the normalisation constant.γ¯dmst = γdmst +I∑i=1Ni∑n=1K∑k=1ξinkηdkmsxdinmt +I∑i=1Ni∑n=1K∑k=1K∑`6=k∑u,v∈Sd :u⊕v=sζink`ηdkmuηd`mvxdinmt +I∑i=1Ni∑n=1K∑k=1∑u∈Sd :u⊕u=sζinkkηdkmuxdinmtq(εds) ∼ Dirichlet(γ¯ds)logq(Gdkm = s) ∝1|Sd | +I∑i=1Ni∑n=1∑t∈Tdξinkµdstxdinmt +2I∑i=1Ni∑n=1K∑`6=k∑w∈Sd∑u:s⊕u=wζink`ηd`muµdwtxdinmt +∑u∈Sd :s⊕s=uζinkkµdutxdinmtlogq(Yn = 0,Z1n = k) ∝ E[log(1−δ )]+νik +D∑d=1Md∑m=1∑s∈Sd∑t∈Tdηdkmsµdstxdinmtlogq(Yn = 1,Z2n = (k, `)) ∝ E[logδ ]+νik +νi`+D∑d=1Md∑m=1∑s∈Sd∑t∈Td∑u,v:u⊕v=sηdkmuηd`mvµdstxdinmtlogq(Yn = 1,Z2n = (k,k)) ∝ E[logδ ]+2νik +D∑d=1Md∑m=1∑s∈Sd∑t∈Td∑u:u⊕u=sηdkmuµdstxdinmtα¯ = α+I∑i=1Ni∑n=1K∑k=1K∑`=1ζink`β¯ = β+I∑i=1Ni∑n=1K∑k=1ξinkq(δ ) ∼ Beta(α¯, β¯ )694.2. Methods: Single cell genotyperκ¯ik = κk +Ni∑n=1ξink +Ni∑n=1K∑`=1ζink`+Ni∑n=1K∑`=1ζin`k= κk +Ni∑n=1ξink +2Ni∑n=1∑`6=kζink`+Ni∑n=1ζinkkq(pi i) ∼ Dirichlet(κ¯i)When understanding and interpreting the update equations it is useful to note that most decouple into fourparts:1. A prior term2. A term for single observations3. A term for doublets from different clonal populations4. A term for doublets from same clonal populationTo assess convergence we compute DKL(q(θ)|p(X ,θ)) after updating all model variables. If the valuedoes not change by a pre-determined threshold, we determine that optimisation has converged and returnthe current variational parameters. The inference procedure outlined above is only guaranteed to convergeto a local optima. To attempt to find the global optima we perform multiple random restarts, 100 for thesynthetic data sets and 1000 for the real data sets (figure 4.2). We seed the random restarts with valuesfrom 0 to J−1, where J is the number of restarts. This significantly increases the computational cost of themethod, but the restarts can be performed in parallel to reduce actual run time.Model selectionA key benefit of using variational Bayesian inference is that model selection is relatively simple [Corduneanuand Bishop, 2001]. Since we attempt to approximate the posterior density, we obtain many of the benefits offull Bayesian inference. In particular there is an intrinsic penalty for fitting overly complex models. In thecontext of mixture models, this means that the model will tend to use fewer than the K possible clusters. Inpractice we initialise K to a much larger value than we expect. We then rely on the intrinsic model selectionto pick the true number of clusters K∗K. To be precise, we set K∗ to the number of clusters which have atleast one data point assigned to them. Data points can be assigned to clusters by choosing the most probablecluster, which for singlets is maxk q(Yn = 0,Zin = k).704.2. Methods: Single cell genotyperFigure 4.2: Plot of normalised difference of lower bound from maximum value attained across restarts. Linesrepresent mean values and error bars represent one standard deviation from mean across the 90 syntheticdata sets.4.2.8 Alternative methodsHierarchical clustering We apply hierarchical clustering to the rows of the N×M matrix of variant al-lele frequencies. To construct the distance matrix we use the Euclidean (L2) metric. We use the averagelinkage method during the agglomerative clustering stage. In order to generate a flat clustering we usedynamicTreeCut [Langfelder et al., 2008] method to automatically cut the dendrogram generated by hier-archical clustering. Unlike the other model based approaches, hierarchical clustering provides no estimateof the clonal genotype associated with each cluster. Thus we exclude it from comparisons of genotypeprediction accuracy.We used the dist and hclust methods provided in the R (3.1.3) software package to compute dis-tance matrices and perform agglomerative clustering. We used the dynamicTreeCut (1.6.2) packagedownloaded from CRAN repository to perform flat clustering.Categorical mixture model The Bernoulli mixture model (BMM) is a well known approach for clusteringbinary valued data [Bishop et al., 2006]. It has previously been applied in the single cell field to clustercells and infer clonal genotypes [Gawad et al., 2014]. The previous work was restricted to using a binaryrepresentation of SNV events, corresponding to the presence or absence of the B allele. It is straightforward714.2. Methods: Single cell genotyperXnmZnpiκµkmγn ∈ {1, . . . , N}m ∈ {1, . . . ,M}k ∈ {1, . . . ,K}pi|κ ∼ Dirichlet(pi|κ)Zn|pi ∼ Categorical(Zn|pi)µkm|γ ∼ Dirichlet(µkm|γ)Xnm|µ,Zn = k ∼ Categorical(Xnm|µkm)Figure 4.3: Probabilistic graphical model representing the Categorical mixture model. Shaded nodes rep-resent observed values or fixed values, while the values of un-shaded nodes are learned.to extend the BMM to a Categorical mixture model (CMM) to handle more general observation data withmultiple discrete states. This is done by replacing the Bernoulli observation distribution with a Categoricaldistribution and the Beta prior distribution on cluster means with a Dirichlet distribution. A full descriptionof this model is given in figure 4.3. We can extend the CMM to handle multiple data types and samplespecific mixing proportions in the same way as the SCG model. There is no corresponding way to extendthe CMM to handle doublets as the cluster means are continuous valued, so that the formalism developedfor the SCG method cannot be applied. We perform inference for this model using the same mean fieldvariational method as for the SCG model, with the same model selection strategy as for the SCG model.This model is implemented in the SCG package along with the code for variational Bayes (VB) infer-ence. To attempt to find the global optima we perform multiple random restarts, 100 for the synthetic datasets and 1000 for the real data sets.BitPhylogeny BitPhylogeny attempts to cluster cells, infer clonal genotypes and infer the clone phylo-geny jointly [Yuan et al., 2015]. The model uses a tree structured stick breaking (TSSB) prior over par-titions of the data points [Ghahramani et al., 2010]. This prior allows for the number of clusters to be724.2. Methods: Single cell genotyperSymbol DefinitionM Number of events to simulate.N Number of data points to simulate.m Number of events converted from A to ABbetween parent and child clone.λ Average proportion of events converted from Ato AB between parent and child clones.ω Probability an event in the AB state is convertedto the B state.pi Vector of proportion of clones in cell population.µ Probability an event is set to be missing.Table 4.5: Definition of parameters used for simulation.inferred automatically while imposing a hierarchical structure among cluster parameters. For each event mand cluster k the authors the define the parameter θkm ∈ (−∞,∞) which evolves according to a continuoustime Markov process along the tree. The observed data is modelled as Bernoulli variable with parameterσ(θkm) = 11+exp(−θkm) . The emission model makes this approach very similar to the Bernoulli mixture modelapproach, but imposing a phylogenetic model on cluster centres (clonal genotypes). The original model isrestricted to modelling binary valued data. It should be possible to extend the model to handle multiple datatypes and sample specific mixing proportions, though we consider only the basic model in this work. Likethe CMM, the continuous nature of the clonal genotypes means that modelling doublets is not possible inthe same way as the SCG model.We use the software provided by the authors with minor modifications to reduce memory and diskspace usage https://bitbucket.org/aroth85/bitphylogeny. All runs where performed byrunning the MCMC chain for 50,000 iterations collecting samples every 5th iteration after a burnin of 30,000samples. These were the settings used by the software authors in Yuan et al. [2015].4.2.9 Simulating synthetic data setsIn order to generate realistic ground truth data set for benchmarking purposes we employed a hybrid simu-lation strategy. In the first step we simulate a clone phylogeny and clonal genotypes in-silico. In the secondstep we simulate allelic count data by sampling from the empirical distribution of variant allele frequencies(VAFs) for a set of known diploid heterozygous positions. Finally, we discretise the data and set a propor-tion of values as missing. We give a detailed description of the simulation procedure below with the relevant734.2. Methods: Single cell genotyperA,A,A,AA,A,A,ABAdd clone Copy parentgenotypeA,A,A,AA,A,A,ABA,A,A,ABA,A,A,AA,A,A,ABAB,A,AB,ABMutate A to ABA,A,A,AA,A,A,ABAB,A,B,ABA,A,A,AA,A,A,ABConvert AB to BFigure 4.4: Example of one step from the clone phylogeny simulation procedure.parameters defined in table 4.5. For brevity we use the term clone interchangeably with clonal population inthe following description.We provide a schematic of the following procedure in figure 4.4. We generate the clone phylogeny byfirst specifying the number of events, M, we wish to simulate. We initialise the root node in the phylogenywith the genotype that has the A state for all M events. We then generate new clones until all M eventshave been converted to the AB or B state in at least one clone. When we add a new clone, we choose aparent clone from the set of existing clones uniformly. We then copy the genotype of the parent clone to thenew clone. We next mutate a Poisson distributed number of sites, m, to the AB state, where the parameterfor the Poisson is λM. As a result each clone differs on average from its parent by a proportion λ of theevents. We enforce the infinite site assumptions so that only sites in the A state which have not been mutatedelsewhere on the tree can be mutated. In the corner case where the number of events to be mutated exceedsthe number of remaining events which have not been set to the AB or B state somewhere on the phylogeny,we set m to mutate the remaining events. We next set each event in the AB genotype state to the state B withprobability ω to simulate loss of heterozygosity events. Note, we do not allow back mutations from the ABor B to A genotype at any stage, so that this procedure is guaranteed to end provided λ > 0. At the end ofthe procedure we return the phylogeny, defined as the graph of parent child relationships. We also return theset of clonal genotypes.To accurately simulate allelic dropout we generated a real data set by sequencing a panel of SNVspredicted to be heterozygous in the 184-hTert cell line. This cell line is largely diploid with amplificationof chromosome 20 known to occur in later passages. We excluded positions on chromosome 20 so that allevents are expected to be in diploid regions. We sequenced a panel of 48 SNVs in 88 cells using a single-744.2. Methods: Single cell genotyperplex PCR amplification of each loci. We included bulk genomic DNA (gDNA) as a positive control for thePCR primers. We only included 32 events with a variant allelic frequency (VAF) between 0.4 and 0.6 inthe gDNA and that had coverage in more than 50% of cells. We took the distribution of VAFs at the targetsites as the empirical distribution for AB state. For the A state, we took the empirical distribution to be theaverage background VAF computed over 30 bases in either direction of the target heterozygous position,excluding the target position. We set the empirical distribution for the B state to be one minus the values inthe A state. We thus had three matrices of 88x32 allele frequencies, one for each state. We associated eachof the M synthetic events with one of the 32 real events for the next step.We randomly sampled the distribution, pi , of clonal prevalences from a symmetric Dirichlet distributionwith parameter 1. We then proceeded to sample count data for N data points. We randomly decided if a datapoint was a doublet with probability δ . If the data point was not a doublet, we sampled a clone of originfrom a Categorical distribution with parameter pi . If the data point was a doublet, we sampled two clonesin the same way, and then combined their genotypes using the binary operator discussed in section 4.2.6.For doublets, we used the combined genotype for the remainder of the procedure. We generated count databased on the genotype of the associated clone. Specifically, we looked up the state of the associated clonegenotype for the event. We then used the relevant column from the matrix of real VAFs for the given state.For each event we simulated a depth of coverage, d, from a Poisson distribution with mean parameter 1000.We then simulated the number of variant reads from a Binomial with the parameter given from the empiricaldistribution, and depth d. Finally, we applied the Binomial exact test to the data using the average variantallelic frequency of the state A empirical distribution as the null rate for each allele. The final step was torandomly assign events as missing with probability µ . If an event was set as missing, the depth of coverageand variant allele count was set to 0. In addition the p-values for the presence of both the A and B allele wasset to 1 for missing events. To discretise the data, we used a p-value threshold of 10−6 to determine if eachallele was present.4.2.10 Performance metricsWhen comparing the various methods we wish to quantify two aspects of their performance: i) how welleach method clusters cells from same clonal population together; ii) how accurately each method predictsthe genotype of the clonal populations.In the absence of doublets we can use the V-measure metric to assess clustering accuracy [Rosenberg754.3. Resultsand Hirschberg, 2007]. When we consider doublets the problem becomes more difficult as data pointsmay belong to multiple clusters. This changes the problem from a strict clustering problem, to a restrictedfeature allocation problem [Broderick et al., 2013]. We use the B-Cubed metric, extended to handle featureallocations, for comparing the performance of algorithms in the presence of doublets [Amigó et al., 2009].When we compute this metric for methods which strictly cluster data-points into disjoint sets, we convertthe predicted clustering into a feature allocation. This is possible since every clustering (partition) is a afeature allocation [Broderick et al., 2013].In the absence of doublets we measure the hamming distance between the predicted genotype of themost likely cluster for a cell and the true genotype of the cell. We compute the mean across all cells tosummarise a methods performance. We distinguish methods which predict only the binary genotype, that isthe presence or absence of the B allele, from those which attempt to predict the three state genotype {A, AB,B}. The predictions for any method which predicts the three state genotype can be converted to a binaryrepresentation by mapping the AB, B states to the B allele present state.Again considering doublets complicates this problem due to the fact data points can be assigned multipleclonal genotypes. To assess the accuracy of clonal genotype inference in the presence of doublets, we firstcompute all predicted clonal genotypes. Recall these correspond to the means of the clusters, where we onlyconsider clusters which are the most probable for at least one data-point. We then compute the minimumhamming distance between all predicted and true clonal genotypes. We use the maximum of this value asthe metric to compare the performance of models. The rationale for this metric is that we expect methodswhich ignore doublets to predict spurious clusters which are the combination of the genotype of two truepopulations. These clusters should tend to be dissimilar from any true cluster. By contrast, methods whichconsider doublets should tend to correctly assign data-points from doublets to two clusters, thus avoidingthe prediction of spurious genotypes.4.3 Results4.3.1 Allelic dropout causes a multi modal distribution of variant allele frequencyWe first sought to determine the expected distribution of variant allele frequencies at heterozygous SNVevents in single cell data using a controlled data set. In figure 4.5 we show the distribution of VAFs at knownheterozygous events for the cell line data set described in subsection 4.2.9. The expected allelic distribution764.3. Resultsin the absence of allelic dropout is uni-modal and centred at 0.5 (the expected VAF for a heterozygousdiploid mutation). The distribution we observe is quite different from this expectation, with two majorpeaks at 0 and 1. In addition we see a very broad distribution of VAFs from 0 to 1 outside these peaks.To confirm these results were not specific to the single cell sequencing protocol used or systematicerrors at our institution, we repeated a similar analysis using previously published data from six patientswith childhood acute lymphoblastic leukemias sequenced using MDA instead of single plex PCR [Gawadet al., 2014]. In this study the authors included a panel of SNPs known to be commonly heterozygous in thepopulation. Because we do not know if a given event is heterozygous for a patient, we only show events witha mean VAF between 0.4 and 0.6 across all cells from a patient. Though biased towards our expectationswithout allelic dropout, this analysis recapitulates the results from the cell line data with peaks at 0 and 1,and a broad distribution of VAFs (figure 4.6).These results suggest that using the observed count data from single cell sequencing experiments asinput may be sub-optimal. Standard clustering algorithms such as Binomial mixture models or Gaussianmixture models would identify at least two clusters at 0 and 1 in these data sets. We suggest that discretisingthe data into three classes based on whether we can detect the A, B or both alleles is a more robust approach.This allows us to use a very sensitive method for the detection of alleles, such as the binomial exact test withthe null hypothesis that the observed counts are due to sequencing error [Shah et al., 2009b]. In addition, byformulating the model in terms of a discrete state space it can be generalised to include other data types.4.3.2 Modelling heterozygous genotype states improves performance for the SCG modelThe first benchmark we performed using synthetic data was to determine if it was beneficial to model thegenotypes of SNVs using three states instead of the previously applied approaches which only consider twostates. When using three states we distinguish between the AB and B genotype states, whereas the two stateapproach just considers the presence or absence of the B allele. Because allelic dropout only affects the ABstate, we expect that we can gain additional power in modelling this state separately from the B state. Toinvestigate this hypothesis we compare the Categorical mixture model with two (CMM-2) and three states(CMM-3) as exemplars of models which use a continuous hidden state space for genotypes. Note that theCMM-2 model is the same as the BMM, but we use the current nomenclature to clarify the comparison. Asexemplars of models using a discrete state space for the genotypes, we apply the SCG model using threestates with a position specific error rate (SCG-3-P) and the same model restricted to binary data (SCG-2-P).774.3. Results0.0 0.2 0.4 0.6 0.8 1.0VAF0100200300400500600700CountFigure 4.5: Distribution of allele frequencies from heterozygous positions in the diploid 184-hTert cell line.Only positions with a mean variant allelic frequency between 0.4 and 0.6 in the genomic DNA controls areincluded.784.3. Results0.0 0.2 0.4 0.6 0.8 1.005101520253035Patient 1 (n=3)0.0 0.2 0.4 0.6 0.8 1.00510152025303540CountPatient 2 (n=2)0.0 0.2 0.4 0.6 0.8 1.0020406080100Patient 3 (n=3)0.0 0.2 0.4 0.6 0.8 1.0020406080100CountPatient 4 (n=8)0.0 0.2 0.4 0.6 0.8 1.0VAF020406080100Patient 5 (n=15)0.0 0.2 0.4 0.6 0.8 1.0VAF050100150200250300CountPatient 6 (n=5)Figure 4.6: Distribution of variant allele frequencies for putative heterozygous SNPs from six patients withchildhood acute lymphoblastic leukemia. Allele frequencies are plotted for SNP positions with a mean VAFbetween 0.4 and 0.6 across all cells sequenced. n indicates the number SNP positions included per patient.794.3. ResultsWe also include hierarchical clustering as an exemplar of non model based methods which cluster based onvariant allelic frequency.We summarise the results across a number of simulation regimes in figure 4.7. The CMM-2 uniformlyoutperforms the CMM-3 model across all regimes, suggesting that using a richer hidden state space isnot beneficial for models which use a continuous state space. In contrast, the three state SCG-3-P modeluniformly outperforms the two state SCG-2-P model. It is unclear why the discrete state space modelsbenefit from the richer state space, though it maybe due to the differing way the models are parameterised.For the CMM family of models, the number of parameters for modelling the error rate is |S | ×K ×Mwhereas the SCG family of models use |S | ×M parameters. The additional K×M parameters availableto the CMM-3 (compared to the CMM-2) model may be causing overfitting. In figure 4.7b we see theperformance difference between the CMM-2 and CMM-3 model decrease as we include a larger number ofdata points, supporting the idea that overfitting maybe occurring on smaller data sets.Overall the SCG-3-P method outperforms all other methods, with the ranking of other methods varyingacross simulation regimes.4.3.3 Phylogenetic constraints do not significantly improve performanceWe next sought to investigate whether models which include phylogenetic constraints improve perform-ance. To that end we compared the CMM-2 model with the BitPhylogeny model. The CMM-2 model issimilar to the BitPhylogeny model without the constraint that clonal genotypes follow a phylogenetic tree.Our simulation strategy closely approximated the assumptions of the BitPhylogeny model, so we expectedBitPhylogeny to outperform the CMM-2 model.Surprisingly the CMM-2 method uniformly outperforms the BitPhylogeny model (figure 4.8). Interest-ingly the BitPhylogeny model seems to maintain a constant performance as the proportion of missing dataincreases, whereas the CMM-2 performance degrades (figure 4.8a). This suggests that the phylogenetic con-straints maybe aiding the model to impute missing values, though performance does not exceed the CMM-2model even when 40% of the data is missing. The performance of the BitPhylogeny method increases as thenumber of data points increases, suggesting that software is working correctly (figure 4.8b). One reason thatBitPhylogeny maybe performing more poorly than the CMM-2 model is due to the complexity of inferencein the model. Due to the use of a non-parametric tree structured stick breaking prior, BitPhylogeny requiresthe use of MCMC sampling. While MCMC sampling is guaranteed to give an exact estimate of the posterior804.3. Results0.0 0.2 0.4Proportion of events missing0.00.20.40.60.81.0V-measureNumber of data points = 1000.0 0.2 0.4Proportion of events missingNumber of data points = 400ModelCMM-2CMM-3HCSCG-2-PSCG-3-P(a) Clustering performance as a function of the proportion of SNV events missing. We compare resultswith 100 (left) and 400 (right) data points. Data was simulated with 50 events and average divergence of10% of events between clones.100 200 400 800 1600Number of data points0.00.20.40.60.81.0V-measureAverage proportion of sites divergent between clones = 0.1100 200 400 800 1600Number of data pointsAverage proportion of sites divergent between clones = 0.2ModelCMM-2CMM-3HCSCG-2-PSCG-3-P(b) Clustering performance as a function of the number of data points. Data was simulated with 50 eventsand approximately 20% missing values.50 100 200 400Number of events0.00.20.40.60.81.0V-measureNumber of data points = 10050 100 200 400Number of eventsNumber of data points = 400ModelCMM-2CMM-3HCSCG-2-PSCG-3-P(c) Clustering performance as a function of the number SNV events. Data was simulated approxim-ately 20% missing values and average divergence of 10% of events between clones.Figure 4.7: Comparison of clustering performance of method using a binary representation of the data(CMM-2 and SCG-2-P) versus those using a three state representation (CMM-3 and SCG-3-P) and non-model based hierarchical clustering (HC). For each plot 10 trees were simulated and 3 data sets sampled foreach tree. For all data sets the proportion of sites which converted from AB to B in each generation was 0.2.814.3. Resultsdistribution asymptotically, convergence can be slow and difficult to diagnose. We use VB inference for theCMM-2 model which is significantly faster and convergence (to a local optima) is deterministic. Thus itis possible that with a significantly longer run of the MCMC chain, the performance of BitPhylogeny mayimprove. We do not pursue this however, as the run times are already several orders of magnitude longer forBitPhylogeny than any other methods we consider.These results suggest that though phylogenetically informed methods maybe beneficial, the additionalcomputational complexity may limit their potential. This is especially concerning given that data sets willlikely grow to 1,000s -10,000s of cells and 100s of events in the foreseeable future. Another concern whenapplying phylogenetic methods, is that the assumptions of the evolutionary model may not be appropriate.For example, BitPhylogeny assumes that no back mutations occur. In genomically unstable cancers, lossof SNVs due to coincident copy number changes can occur. Thus is maybe beneficial to consider simplermodels for clustering and clonal genotype inference, using the output of these models to perform phylogen-etic reconstruction. This approach would be more scalable and also conducive to considering a number ofevolutionary models readily available in existing software packages.4.3.4 Modelling doublets reduces the rate of false positive clonal population predictionA key assumption when applying clustering methods to single cell data to infer clonal genotypes is thateach data point is representative of a single cell. If this does not hold, clusters of data points from multiplecells will not be indicative of any true genotype. Though it is not known how frequently multiple cells aresequenced with current technologies, the presence of two cell data points (doublets) has occurred sufficientlyfrequently to induce spurious genotype clusters in some studies [Gawad et al., 2014]. To explore the effectdoublets have on clustering performance, and to determine whether our proposed model can detect andcorrect for them, we simulated sequence of data sets at varying proportions of doublets.In figure 4.9 we see show the family of B-cubed metrics comparing the doublet aware position specificSCG model (D-SCG-3-P) to the doublet naive equivalent (SCG-3-P). When no doublets are present, thedoublet naive method performs as well or better than the doublet aware method for all metrics. In principlethe doublet aware model should be able to converge to a solution where no doublets are assigned. It isunclear if its failure to do so is an intrinsic problem of the model or if the number of random restarts (100)is not sufficient to find a good solution. Given the significantly larger space of solutions that the doubletaware model must consider, it is conceivable the variational lower bound has a significantly larger number824.3. Results0.0 0.2 0.4Proportion of events missing0.00.20.40.60.81.0V-measureNumber of data points = 1000.0 0.2 0.4Proportion of events missingNumber of data points = 400ModelBitPhylogenyCMM-2(a) Clustering performance as a function of the proportion of SNV events missing. We compare results with100 (left) and 400 (right) data points. Data was simulated with 50 events and average divergence of 10% ofevents between clones.100 200 400 800 1600Number of data points0.00.20.40.60.81.0V-measureAverage proportion of sites divergent between clones = 0.1100 200 400 800 1600Number of data pointsAverage proportion of sites divergent between clones = 0.2ModelBitPhylogenyCMM-2(b) Clustering performance as a function of the number of data points. Data was simulated with 50 eventsand approximately 20% missing values.50 100 200 400Number of events0.00.20.40.60.81.0V-measureNumber of data points = 10050 100 200 400Number of eventsNumber of data points = 400ModelBitPhylogenyCMM-2(c) Clustering performance as a function of the number SNV events. Data was simulated approximately20% missing values and average divergence of 10% of events between clones.Figure 4.8: Comparison of clustering performance of a standard clustering model (CMM-2) to a phylogen-etically informed model (BitPhylogeny). For each plot 10 trees were simulated and 3 data sets sampled foreach tree. For all data sets the proportion of sites which converted from AB to B in each generation was 0.2.834.3. Resultsof local optima. As the proportion of doublets increases the precision of both methods remains constant,with the SCG-3-P method outperforming the D-SCG-3-P model. The precision is essentially a measure ofhow often data points that are predicted to co-occur, truly co-occur. The one subtlety is that we need tocount how many times features co-occur due to the feature allocation issue [Amigó et al., 2009]. The recallof the SCG-3-P method deteriorates more rapidly than the D-SCG-3-P method as the number of doubletsincreases. The recall is the converse of the precision. It measures how frequently data points that trulyco-occur are predicted to co-occur. Because of the feature allocation formulation, methods must be able toaccurately predict how many clusters are shared between data points. Thus the doublet naive method beginsto perform poorly as the proportion of doublets increases since it can only associate one cluster to each datapoint. The final plot shows the F-measure which is the geometric mean of the two previous metrics. Takentogether the results suggest that the doublet correction is working since performance stays nearly constantuntil 40% of cells are doublets. However, the benefits of using the doublet aware model do not appear to besignificant until the proportion of doublets is above 20%.The major problem that we expect when doublets occur is the inference of spurious clonal populations,with genotypes which are grossly different than any true population. These genotypes may confound down-stream analysis since they violate the perfect phylogeny assumption if they are combination of two existinggenotypes from divergent clades in tree. The previous analysis does not capture this. Instead we need a wayto compare the predicted clones to the true set of clones and quantify how distant the predicted genotypesare from truth. Unfortunately, there is no simple way to match the predicted clones to true clones. However,we can compute the distance for any predicted clone to the nearest true clone. If we are predicting spuriousclones, we expect the maximum of this value to be large, as no true clone matches the predicted genotype.In figure 4.10 we show the results of this analysis as we vary the proportion of doublets simulated. In theabsence of doublets, the doublet naive method again outperforms the doublet aware method. However, oncedoublets are present the performance of the doublet naive method quickly deteriorates. The doublet awaremethod is able to maintain constant performance until the proportion of doublets reaches 0.4. This suggeststhat the doublet aware method is avoiding inference of spurious clones.4.3.5 Automatic identification of doublets in real world data setsWe applied the SCG model to previously published data from a patient with childhood acute lymphoblasticleukemia [Gawad et al., 2014]. The authors of the study proposed using a Bernoulli mixture model to cluster844.3. Results0.0 0.05 0.1 0.2 0.4Proportion of doublets0.00.20.40.60.81.0PrecisionModelD-SCG-3-PSCG-3-P(a) B-cubed precision. The maximum value of 1 is attained for a pair of data points when the number of predicted sharedclusters is lower than or equal to the number of true shared clusters. Plotted value is the average across all pairs which arepredicted to share a least one cluster.0.0 0.05 0.1 0.2 0.4Proportion of doublets0.00.20.40.60.81.0RecallModelD-SCG-3-PSCG-3-P(b) B-cubed recall. The maximum value of 1 is attained for a pair of data points when the number of true shared clusters islower than or equal to the number of predicted clusters. Plotted value is the average across all pairs which truly share a leastone cluster.0.0 0.05 0.1 0.2 0.4Proportion of doublets0.00.20.40.60.81.0F-measureModelD-SCG-3-PSCG-3-P(c) B-cubed F-measure. The F-measure is geometric average of B-cubed precision and recall and reaches the maximum valuewhen both are 1.Figure 4.9: Feature allocation performance of doublet aware (D-SCG-3-P) compared to doublet naive (SCG-3-P) models. Data was simulated with 400 data points, 50 events, 10 trees from which 3 data sets wheregenerated, a mean divergence between clones of 0.1, the proportion of sites which converted from AB to Bin each generation was 0.2 and 20% of events were set to be missing.854.3. Results0.0 0.05 0.1 0.2 0.4Proportion of doublets0.00.10.20.30.40.5Maximum Hamming distance to nearest true cloneModelD-SCG-3-PSCG-3-PFigure 4.10: The maximum Hamming distance between a predicted clone and its nearest true clone plottedas function of the proportion of doublets.864.3. Resultsthe single cell data into clonal populations. In two of the patients in the study they observed clusters whichappeared to be a combination of other clusters with disjoints sets of mutations. The authors inferred the datapoints associated with these clusters were due to doublets, and manually removed them from downstreamanalysis. We applied the CMM-3, the SCG-3-P and the D-SCG-3-P models to data from one of these patients(patient 1). As the CMM-3 and SCG-3-P model are both doublet naive we expect them to infer the spuriousdoublet clusters, whereas the genotype aware D-SCG-3-P model should be identify the doublets and assignthem to their true clonal clusters.In figures 4.11, 4.12, 4.13 we show the raw results for each model. We also plot the predicted genotypesassociated with each cluster in figure 4.14. The CMM-3 model infers five clusters, with cluster 4 appearingto be a result of doublet cells from clusters 1 and 5. The SCG-3-P model infers six clusters, with cluster4 appearing to be the result of doublets from 3 (or 2) and 6 (or 5). The D-SCG-3-P model identifies threeclusters, none of which could be explained as doublet mixture of the other clusters.4.3.6 Inference of clonal dynamic in xenograft passagesRecall the the model parameter pi i represents the proportion of clones sample i. Our mean field variationalinference method infers an approximate posterior distribution q(pi i) for this parameter. We can use thisdistribution to estimate the mean prevalence of each clone across samples and assess our uncertainty in thisvalue. In contrast to the simple approach of counting the number of data points assigned to each clusteracross samples, the distribution q(pi i) will be calibrated for uncertainty in cluster assignment and doublets.We apply the D-SCG-3-P model to a previously published data set generated by sequencing single cellsfrom four sequentially acquired xenograft passages [Eirew et al., 2014].We show the results of this analysis in figure 4.15. The dominant clone in X1 (orange) is nearly ex-tinguished in passages X2 but re-emerges sub-dominantly in passage X5. A new clone (pink) emerges inpassage X2 and proceeds to become dominant in passage X4. Interestingly this clone then drops in preval-ence to co-dominate with a new clone (brown) in X5. Two prevalent clones in X1 and X2 (blue and green)appear to be extinguished in later passages. These results suggest a large amount of diversity is preservedbetween passages, with rare clones in one passage able to become dominant in the next passage. A keybenefit of this analysis is that our estimates of prevalences have an associated uncertainty. In passage X5this allows us to determine that the orange clone is more than one standard error more common than eitherthe blue, yellow or green clone. Such information will become increasingly important as we attempt to874.3. Results12345EventsData pointsFigure 4.11: CMM-3 model applied to C-ALL patient 1 data set. Predicted clusters for each data point areplotted in the colorbar on the right.884.3. Results123456EventsData pointsFigure 4.12: SCG-3-P model applied to C-ALL patient 1 data set. Predicted clusters for each data point areplotted in the colorbar on the right.894.3. Results123EventsData pointsFigure 4.13: D-SCG-3-P model applied to C-ALL patient 1 data set. Predicted clusters for each data pointare plotted in the colorbar on the right. Putative doublets event are marked in black to the right of the clustercolorbar.904.3. ResultsCMM-3 D-SCG-3-P SCG-3-P12345123123456Figure 4.14: Comparison of predicted clonal genotypes for CMM-3, D-SCG-3-P and SCG-3-P modelsapplied to C-ALL patient 1 data set.determine whether the fluctuations in prevalence we observe are due to drift or selection. We note that thisanalysis is completely automatic. The only heuristic we applied was to compute the posterior q(pi i) usingonly clones with at least one data point assigned.4.3.7 Identifying late driver amplifications in high grade serous ovarian cancerTo demonstrate the ability of our model to use multiple data types we applied the D-SCG-3-P model to apatient with high grade serous ovarian cancer (HGSOC). We had performed whole genome sequencing offive tumour samples taken from three tumour masses in the left ovary, right ovary and omentum from thepatient. We identified a high level amplification of the ERBB2 locus in four of the five samples. ERBB2is known to be an early driver event in Her2+ breast cancer, but has not been associated with HGSOC. Theabsence of the ERBB2 amplification in the LOv1 sample suggested that the amplification occurred afterthe initiation of oncogenesis. We designed PCR primers for a panel of clonaly informative SNVs based onPyClone [Roth et al., 2014] analysis. We also included several rearrangement breakpoints, including oneassociated with the ERBB2 amplification. We used the D-SCG-3-P model to jointly cluster the SNV andbreakpoint events. Our analysis identified 8 tumour clones and a normal population (cluster 2) of cells.Three of these clusters are small and the assigned data points noisy (clusters 1, 4, 5). We exclude these fromsubsequent discussion. Only a subset of the predicted clones had a genotype which contained an ERBB2breakpoint (figure 4.16 clusters 6, 7, 8, 9). This suggests that the amplification was acquired after theacquisition of the oncogenic initiating events of TP53 mutation and loss of heterozygosity of chromosome17. The putative primary site for this tumour was in the left ovary. We identified multiple populations of914.3. ResultsEventData pointA AB BX1 X2 X4 X5ClusterSampleDoublet(a) Clustering results.A AB BCluster(b) Inferred genotypes.X1 X2 X4 X5Sample0.00.20.40.60.81.0Prevalence(c) Inferred cellular prevalences. Mean valuesof q(piik) are plotted with error bars indicating√Var(piik).Figure 4.15: D-SCG-3-P model applied to xenograft SA501.924.4. Discussioncells with the amplification in one of the two samples (LOv2) taken from this site, while the other sample(LOv1) contained no cells with the amplification. In contrast all the metastatic sites contained clones withthe ERBB2 amplification. Taken together this suggests that the late acquisition of the ERBB2 amplificationmay have provided the clones with the ability to spread to metastatic sites. We also identify two populationswith genotypes that differ only by a pair of breakpoints (clusters 6 and 7). Without incorporating breakpointinformation we would not be able to resolve these clones.4.4 Discussion4.4.1 Limitations and extensionsThere are a number of avenues for improvement to the SCG model. We specify a symmetric Dirichlet priorfor pi i with each component of the prior distribution parameter being assigned a value of 1. This correspondsto a non-informative prior on clonal prevalences. If we were to specify values < 1, we would encouragesparse solutions with fewer clusters, while values > 1 would have the opposite effect. It would be desirableto learn this parameter during inference, rather than fixing it. This could be accomplished via numericaloptimisation of the KL divergence with respect to this parameter.The inference method we use is only guaranteed to find a local minima of the variational objectivefunction. This necessitates the need for multiple restarts, with no clear method to determine an appropriatenumber of restarts. We choose to use as many restarts as our computational budget would allow. Recentresearch has suggested that using a less naive factorisation of the variational distribution which breaks fewerdependencies may lead to improved convergence to the global optimum [Hoffman and Blei, 2014]. Thiswould be a worthwhile avenue of research that could potentially reduce the computational overhead of themethod, and lead to better performance.Finally, our method does not address the problem of inferring clonal phylogenies from single cell data.We performed a limited investigation of this comparing the CMM-2 to BitPhylogeny model. Our tent-ative conclusion is that incorporating phylogenetic constraints during clustering maybe of benefit, but theadditional computational burden currently leads to worse performance. The additional model complexity re-quired to incorporate phylogenetic constraints may also limit our ability to consider other useful extensionssuch as doublet modelling. In addition, phylogenetic modelling would only be useful if the evolutionarymodel is correct. For example, the BitPhylogeny model fails to account for the possibility a mutation maybe934.4. DiscussionEventData pointA AB B Absent PresentLOv1 LOv2 Om1 Om2 ROv1ClusterSampleDoubletERBB2 breakpointTP53chr17 SNPs123456789(a) Clustering results.A AB B Absent PresentClusterERBB2 breakpointTP53chr17 SNPs123456789(b) Inferred genotypes.Figure 4.16: D-SCG-3-P model applied to HGSOC patient with ERBB2 amplification.944.4. Discussionlost in descendant clones. We suggest that a more fruitful approach to the phylogenetic problem maybe toseparate the clustering and genotype inference task, from the inference of the phylogeny. The inferred clonalgenotypes can be readily used as inputs to classical phylogenetic algorithms.4.4.2 ConclusionsWe consider the problem of clustering single cell data points by clonal genotype. We present a novelmodel which provides several advances: the ability to assign interpretable discrete genotypes with morethan two states; the ability to jointly cluster multiple data types; the ability to identify doublet data pointsand resolve the clonal populations contributing to the doublet; the ability to model sample specific variationin clonal prevalences. We present a novel strategy to generate synthetic data sets which accurately representallelic dropout. Using data sets generated by this method, we are able to show that our model benefitsfrom considering a richer 3 state representation of SNV events as opposed to previously applied approacheswhich only consider the presence/absence of the variant allele. By comparing the closely related CMM-2and BitPhylogeny models, we show that inferring the clonal phylogeny during clustering does not improveour ability to accurately cluster data points or resolve clonal genotypes. Though limited, this comparisonhighlights that the additional computational complexity of joint tree inference and clustering may mitigatethe theoretical advantage of this approach. Using both synthetic and real data, we show that the doubletaware version of our SCG model can improve performance by removing spurious clonal populations. Weapply our method to a serially passaged xenograft data sets and show how it can be used to perform automaticinference of clonal prevalences across multiple samples. Finally, we show how our model can be appliedto events from multiple data types. This analysis allows us to better resolve the clonal population structure,while yielding easily interpretable clonal genotypes relating both data types.95Chapter 5Conclusion5.1 Summary of contributionsThis dissertation outlines a set of tools which can be used to systematically explore the clonal populationstructure of tumours. We begin with the problem of identifying the portfolio of somatic SNVs in a tumourfor use as markers of clonal populations. We next show how targeted deep sequencing of identified SNVscan be used to infer clonal population structure and track shifts in clonal prevalence across multiple samples.Finally, we use single cell sequencing data to infer precise clonal genotypes.The three major contributions of this thesis can be briefly summarised as follows.Robust identification of SNVs in tumour/normal genome sequencing data We developed a probabil-istic model to identify somatic SNVs in tumour/normal paired sequencing data sets. By sharing statisticalstrength between the data sets we showed a reduction in the proportion of false positive SNV predictionsattributable to mis-identified germline SNVs.Predictions of clonal population structure and cellular prevalence from targeted deep sequence dataWe developed a probabilistic model to simultaneously cluster SNVs into groups corresponding to clonalpopulations and infer the proportion of malignant cells harbouring these SNVs. Our method models muta-tional genotype, normal contamination and over-dispersion, which confound naive clustering of variantallelic frequencies.Prediction of clonal genotypes from targeted single cell sequencing data We developed a probabilisticmodel to infer clonal genotypes from noisy single cell sequencing data. Our method is able to: infer theunknown number of clonal populations; predict clonal genotypes for each population; identify spuriousmeasurements of multiple cells and infer the proportion of clones in multiple data sets. Though primarily965.2. Future workChapter Model URL2 JointSNVMix https://github.com/aroth85/joint-snv-mix3 PyClone https://bitbucket.org/aroth85/pyclone4 Single Cell Genotyper https://bitbucket.org/aroth85/scgTable 5.1: Source code repositories for software implementing the models described in this dissertation.developed to analyse SNVs, our model is flexible and can incorporate measurements from additional datatypes such as rearrangement breakpoints.5.2 Future workThis thesis has been restricted to identifying and interpreting somatic SNVs in cancer. Cancer genomescontain many other types of aberrations. An important area of future research will be to develop integ-rative methods which jointly profile these aberrations along with SNVs. We expect that methods whichjointly analyse multiple classes of aberrations will achieve better performance than independent identifica-tion. In particular, different classes of aberrations can provide complementary information that can be usedto resolve identifiability issues. For example, prediction of normal contamination and tumour ploidy are no-toriously difficult problems when analysing CNAs in tumours. Incorporating information about SNVs mayallow for the identification of clonally dominant mutations which can anchor these predictions. Conversely,information about coincident CNAs should provide additional power to detect SNVs present at low variantallele frequencies. In addition, the evolution of cancer genomes can involve all types of aberration. Thusconsidering multiple classes of aberration will provide a more accurate assessment of clonal genotypes andincreased power to stratify clonal populations.Clonal populations are related by descent. Inference of the phylogeny relating clones will be an im-portant problem once we can resolve clonal genotypes. This will allow us to understand the aetiology ofa tumour, identifying when in the evolutionary history driver aberrations are acquired. This in turn willprovide insight into mechanisms of metastasis and adaption to micro-environments. An interesting areaof research will be determining whether joint phylogenetic analysis and inference of clonal genotypes isbeneficial.Single cell sequencing is emerging as promising tool for studying clonal populations structure. In thefuture protocols will mature and the cost of sequencing will fall. With more data of higher quality the975.3. Concluding remarksnoise in measurements will decrease. However, current data is noisy, which means that measurements ofindividual cells cannot be treated a reliable representations of cell genotypes. A major theme of single cellsequencing research will be to develop methods which can group cells from the same clonal population toshare statistical strength and reduce the noise in measurement. Like bulk sequencing these methods willlikely benefit from joint consideration of multiple types of aberrations.While inference of clonal population structure and genotype is interesting in its own right, leveraging thisinformation to answer fundamental questions about the population genetics of tumours is the ultimate goal.There will be a number of open problems in the field related to estimating selection effects, determiningfitness landscapes and mapping patterns of migration. We foresee an opportunity for fruitful collaborationsbetween computational biologists, population geneticists, evolutionary biologists and mathematicians totackle these problems.5.3 Concluding remarksCancer is a complex disease that can only be understood by considering tumours as heterogeneous popula-tions of cells evolving according to the principles of Darwinian evolution. HTS technologies have allowedus to sequence the genomes of tumours and generate data with the potential to map the genomic diversity ofcancers in unprecedented detail. While our ability to generate data has rapidly advanced, methods to analysethe data and fully extract the rich body of information it contains continue to lag. In this dissertation wepresent a set of models which can be used to comprehensively catalogue and interpret the genetic diversityof tumours using SNVs. The methods presented in this thesis were among the first to be developed to solvethe problems they address. As discussed in sections 2.4.3 and 3.4.5, these methods have spurred the devel-opment of new fields of research and acted as benchmarks to which other methods have been compared.Furthermore, these methods have featured prominently in several major studies to identify somatic variants[Shah et al., 2012, Lee et al., 2012] and study clonal population structures [Shah et al., 2012, Bashashatiet al., 2013, Eirew et al., 2014, Cazier et al., 2014, Nordentoft et al., 2014, Hong et al., 2015], ultimatelyleading to real biological insights. We hope these tools will continue to be useful for cancer biologists in thedrive to understand the evolution of tumours.98Bibliography1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing.Nature, 467(7319):1061–73, October 2010. ISSN 1476-4687.Ahmed Ashour Ahmed, Dariush Etemadmoghadam, Jillian Temple, Andy G Lynch, Mohamed Riad,Raghwa Sharma, Colin Stewart, Sian Fereday, Carlos Caldas, Anna Defazio, David Bowtell, and James DBrenton. Driver mutations in TP53 are ubiquitous in high grade serous carcinoma of the ovary. J Pathol,221(1):49–56, May 2010. ISSN 1096-9896.Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A comparison of extrinsic clusteringevaluation metrics based on formal constraints. Information retrieval, 12(4):461–486, 2009.Samuel Aparicio and Carlos Caldas. The implications of clonal genome evolution for cancer medicine. NewEngland Journal of Medicine, 368(9):842–851, 2013.Samuel AJR Aparicio and David G Huntsman. Does massively parallel DNA resequencing signify the endof histopathology as we know it? The Journal of pathology, 220(2):307–315, 2010.Ali Bashashati, Gavin Ha, Alicia Tone, Jiarui Ding, Leah M Prentice, Andrew Roth, Jamie Rosner, KareyShumansky, Steve Kalloger, Janine Senz, et al. Distinct evolutionary trajectories of primary high-gradeserous ovarian cancers revealed through spatial mutational profiling. The Journal of pathology, 231(1):21–34, 2013.Timour Baslan, Jude Kendall, Brian Ward, Hilary Cox, Anthony Leotta, Linda Rodgers, Michael Riggs,Sean D’Italia, Guoli Sun, Mao Yong, et al. Optimizing sparse sequencing of single cells for highlymultiplex copy number profiling. Genome research, 25(5):714–724, 2015.Yuval Benjamini and Terence P Speed. Summarizing and correcting the GC content bias in high-throughputsequencing. Nucleic acids research, page gks001, 2012.99BibliographyDavid R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Geoffrey P Smith, John Milton, Clive GBrown, Kevin P Hall, Dirk J Evers, Colin L Barnes, Helen R Bignell, et al. Accurate whole humangenome sequencing using reversible terminator chemistry. nature, 456(7218):53–59, 2008.Michael F Berger, Michael S Lawrence, Francesca Demichelis, Yotam Drier, Kristian Cibulskis, Andrey YSivachenko, Andrea Sboner, Raquel Esgueva, Dorothee Pflueger, Carrie Sougnez, et al. The genomiccomplexity of primary human prostate cancer. Nature, 470(7333):214–220, 2011.José M Bernardo. Bayesian statistics. In International Encyclopedia of Statistical Science, pages 107–133.Springer, 2011.Rameen Beroukhim, Craig H Mermel, Dale Porter, Guo Wei, Soumya Raychaudhuri, Jerry Donovan, JordiBarretina, Jesse S Boehm, Jennifer Dobson, Mitsuyoshi Urashima, et al. The landscape of somatic copy-number alteration across human cancers. Nature, 463(7283):899–905, 2010.Graham R. Bignell, Chris D. Greenman, Helen Davies, Adam P. Butler, Sarah Edkins, Jenny M. Andrews,Gemma Buck, Lina Chen, David Beare, Calli Latimer, Sara Widaa, Jonathon Hinton, Ciara Fahey, Beiy-uan Fu, Sajani Swamy, Gillian L. Dalgliesh, Bin T. Teh, Panos Deloukas, Fengtang Yang, Peter J. Camp-bell, P. Andrew Futreal, and Michael R. Stratton. Signatures of mutation and selection in the cancergenome. Nature, 463(7283):893–898, Feb 2010. ISSN 1476-4687. doi: 10.1038/nature08768.Christopher M Bishop. Latent variable models. In Learning in graphical models, pages 371–403. Springer,1998.Christopher M Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006.Valentina Boeva, Tatiana Popova, Kevin Bleakley, Pierre Chiche, Julie Cappo, Gudrun Schleiermacher,Isabelle Janoueix-Lerosey, Olivier Delattre, and Emmanuel Barillot. Control-FREEC: a tool for as-sessing copy number and allelic content using next-generation sequencing data. Bioinformatics (Ox-ford, England), 28(3):423–425, Feb 2012. ISSN 1367-4811. doi: 10.1093/bioinformatics. URLhttp://www.ncbi.nlm.nih.gov/pubmed/22155870.Paul C Boutros, Adam D Ewing, Kyle Ellrott, Thea C Norman, Kristen K Dang, Yin Hu, Michael R Kellen,Christine Suver, J Christopher Bare, Lincoln D Stein, et al. Global optimization of somatic variant100Bibliographyidentification in cancer genomes with a global community challenge. Nature genetics, 46(4):318–319,2014.Tamara Broderick, Jim Pitman, Michael I Jordan, et al. Feature allocations, probability functions, andpaintboxes. Bayesian Analysis, 8(4):801–836, 2013.Peter J Campbell, Shinichi Yachida, Laura J Mudie, Philip J Stephens, Erin D Pleasance, Lucy A Stebbings,Laura A Morsberger, Calli Latimer, Stuart McLaren, Meng-Lay Lin, et al. The patterns and dynamics ofgenomic instability in metastatic pancreatic cancer. Nature, 467(7319):1109–1113, 2010.Scott L Carter, Kristian Cibulskis, Elena Helman, Aaron McKenna, Hui Shen, Travis Zack, Peter W Laird,Robert C Onofrio, Wendy Winckler, Barbara A Weir, et al. Absolute quantification of somatic DNAalterations in human cancer. Nature biotechnology, 30(5):413–421, 2012.J-B Cazier, SR Rao, CM McLean, AK Walker, BJ Wright, EEM Jaeger, C Kartsonaki, L Marsden, C Yau,C Camps, et al. Whole-genome sequencing of bladder cancers reveals somatic CDKN1A mutations andclinicopathological associations with mutation burden. Nature communications, 5, 2014.Kristian Cibulskis, Michael S Lawrence, Scott L Carter, Andrey Sivachenko, David Jaffe, Carrie Sougnez,Stacey Gabriel, Matthew Meyerson, Eric S Lander, and Gad Getz. Sensitive detection of somatic pointmutations in impure and heterogeneous cancer samples. Nature biotechnology, 31(3):213–219, 2013.1000 Genomes Project Consortium et al. A map of human genome variation from population-scale sequen-cing. Nature, 467(7319):1061–1073, 2010.International Human Genome Sequencing Consortium et al. Finishing the euchromatic sequence of thehuman genome. Nature, 431(7011):931–945, 2004.Adrian Corduneanu and Christopher M Bishop. Variational Bayesian model selection for mixture distribu-tions. In Artificial intelligence and Statistics, volume 2001, pages 27–34. Morgan Kaufmann Waltham,MA, 2001.Christina Curtis, Sohrab P. Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M. Rueda, Mark J. Dunning,Doug Speed, Andy G. Lynch, Shamith Samarajiwa, Yinyin Yuan, Stefan Gräf, Gavin Ha, Gholam-reza Haffari, Ali Bashashati, Roslin Russell, Steven McKinney, METABRIC Group, Anita Langerød,101BibliographyAndrew Green, Elena Provenzano, Gordon Wishart, Sarah Pinder, Peter Watson, Florian Markowetz,Leigh Murphy, Ian Ellis, Arnie Purushotham, Anne-Lise Børresen-Dale, James D. Brenton, SimonTavaré, Carlos Caldas, and Samuel Aparicio. The genomic and transcriptomic architecture of 2,000breast tumours reveals novel subgroups. Nature, 486(7403):346–352, Jun 2012. ISSN 1476-4687. doi:10.1038/nature10983.Sarah-Jane Dawson, Dana W. Y. Tsui, Muhammed Murtaza, Heather Biggs, Oscar M. Rueda, Suet-FeungChin, Mark J. Dunning, Davina Gale, Tim Forshew, Betania Mahler-Araujo, Sabrina Rajan, Sean Hum-phray, Jennifer Becq, David Halsall, Matthew Wallis, David Bentley, Carlos Caldas, and Nitzan Rosen-feld. Analysis of circulating tumor DNA to monitor metastatic breast cancer. The New England journalof medicine, 368(13):1199–1209, Mar 2013. ISSN 1533-4406. doi: 10.1056/NEJMoa1213261.Elza C de Bruin, Nicholas McGranahan, Richard Mitter, Max Salm, David C Wedge, Lucy Yates, MariamJamal-Hanjani, Seema Shafi, Nirupa Murugaesu, Andrew J Rowan, et al. Spatial and temporal diversityin genomic instability processes defines lung cancer evolution. Science, 346(6206):251–256, 2014.Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, An-thony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, et al. A framework for variationdiscovery and genotyping using next-generation DNA sequencing data. Nature genetics, 43(5):491–498,2011.Amit G Deshwar, Shankar Vembu, Christina K Yung, Gun H Jang, Lincoln Stein, and Quaid Morris.PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tu-mors. Genome biology, 16(1):35, 2015.Jiarui Ding, Ali Bashashati, Andrew Roth, Arusha Oloumi, Kane Tse, Thomas Zeng, Gholamreza Haf-fari, Martin Hirst, Marco A Marra, Anne Condon, et al. Feature-based classifiers for somatic mutationdetection in tumour–normal paired sequencing data. Bioinformatics, 28(2):167–175, 2012a.Li Ding, Matthew J Ellis, Shunqiang Li, David E Larson, Ken Chen, John W Wallis, Christopher C Harris,Michael D McLellan, Robert S Fulton, Lucinda L Fulton, et al. Genome remodelling in a basal-like breastcancer metastasis and xenograft. Nature, 464(7291):999–1005, 2010.Li Ding, Timothy J Ley, David E Larson, Christopher A Miller, Daniel C Koboldt, John S Welch, Julie K102BibliographyRitchey, Margaret A Young, Tamara Lamprecht, Michael D McLellan, et al. Clonal evolution in relapsedacute myeloid leukaemia revealed by whole-genome sequencing. Nature, 481(7382):506–510, 2012b.Peter Eirew, Adi Steif, Jaswinder Khattra, Gavin Ha, Damian Yap, Hossein Farahani, Karen Gelmon,Stephen Chia, Colin Mar, Adrian Wan, et al. Dynamics of genomic clones in breast cancer patientxenografts at single-cell resolution. Nature, 2014.T. Ferguson. Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 1973.Andrej Fischer, Ignacio Vázquez-García, Christopher JR Illingworth, and Ville Mustonen. High-definitionreconstruction of clonal composition in cancer. Cell reports, 7(5):1740–1752, 2014.Simon A Forbes, Nidhi Bindal, Sally Bamford, Charlotte Cole, Chai Yin Kok, David Beare, Mingming Jia,Rebecca Shepherd, Kenric Leung, Andrew Menzies, et al. COSMIC: mining complete cancer genomesin the Catalogue of Somatic Mutations in Cancer. Nucleic acids research, page gkq929, 2010.Tim Forshew, Muhammed Murtaza, Christine Parkinson, Davina Gale, Dana W. Y. Tsui, Fiona Kaper, Sarah-Jane Dawson, Anna M. Piskorz, Mercedes Jimenez-Linan, David Bentley, James Hadfield, Andrew P.May, Carlos Caldas, James D. Brenton, and Nitzan Rosenfeld. Noninvasive identification and monitoringof cancer mutations by targeted deep sequencing of plasma DNA. Science translational medicine, 4(136):136ra68, May 2012. ISSN 1946-6242. doi: 10.1126/scitranslmed.3003726.A. Fritsch and K. Ickstadt. Improved criteria for clustering based on the posterior similarity matrix. Bayesiananalysis, 4(2):367–391, 2009.Charles Gawad, Winston Koh, and Stephen R Quake. Dissecting the clonal origins of childhood acutelymphoblastic leukemia by single-cell genomics. Proceedings of the National Academy of Sciences, 111(50):17947–17952, 2014.Marco Gerlinger, Andrew J Rowan, Stuart Horswell, James Larkin, David Endesfelder, Eva Gronroos,Pierre Martinez, Nicholas Matthews, Aengus Stewart, Patrick Tarpey, et al. Intratumor heterogeneity andbranched evolution revealed by multiregion sequencing. New England Journal of Medicine, 366(10):883–892, 2012.Zoubin Ghahramani and Michael I Jordan. Learning from incomplete data. 1995.103BibliographyZoubin Ghahramani, Michael I Jordan, and Ryan P Adams. Tree-structured stick breaking for hierarchicaldata. In Advances in Neural Information Processing Systems, pages 19–27, 2010.Ramaswamy Govindan, Li Ding, Malachi Griffith, Janakiraman Subramanian, Nathan D Dees, Krishna LKanchi, Christopher A Maher, Robert Fulton, Lucinda Fulton, John Wallis, et al. Genomic landscape ofnon-small cell lung cancer in smokers and never-smokers. Cell, 150(6):1121–1134, 2012.Rodrigo Goya, Mark GF Sun, Ryan D Morin, Gillian Leung, Gavin Ha, Kimberley C Wiegand, JanineSenz, Anamaria Crisan, Marco A Marra, Martin Hirst, et al. SNVMix: predicting single nucleotidevariants from next-generation sequencing of tumors. Bioinformatics, 26(6):730–736, 2010.Mel Greaves and Carlo C Maley. Clonal evolution in cancer. Nature, 481(7381):306–313, 2012.Chris D Greenman, Graham Bignell, Adam Butler, Sarah Edkins, Jon Hinton, Dave Beare, Sajani Swamy,Thomas Santarius, Lina Chen, Sara Widaa, P Andy Futreal, and Michael R Stratton. PICNIC: an al-gorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics, 11(1):164–75, January 2010. ISSN 1468-4357. URL http://www.ncbi.nlm.nih.gov/pubmed/19837654.Gavin Ha, Andrew Roth, Daniel Lai, Ali Bashashati, Jiarui Ding, Rodrigo Goya, Ryan Giuliany, JamieRosner, Arusha Oloumi, Karey Shumansky, Suet-Feung Chin, Gulisa Turashvili, Martin Hirst, CarlosCaldas, Marco A Marra, Samuel Aparicio, and Sohrab P Shah. Integrative analysis of genome-wideloss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways intriple-negative breast cancer. Genome Res, 22(10):1995–2007, October 2012. ISSN 1549-5469.Gavin Ha, Andrew Roth, Jaswinder Khattra, Julie Ho, Damian Yap, Leah M Prentice, Nataliya Melnyk,Andrew McPherson, Ali Bashashati, Emma Laks, et al. TITAN: inference of copy number architecturesin clonal cell populations from tumor whole-genome sequence data. Genome research, 24(11):1881–1893, 2014.Douglas Hanahan and Robert A Weinberg. The hallmarks of cancer. cell, 100(1):57–70, 2000.Olivier Harismendy, Richard B Schwab, Lei Bao, Jeff Olson, Sophie Rozenzhak, Steve K Kotsopoulos,Stephanie Pond, Brian Crain, Mark S Chee, Karen Messer, Darren R Link, and Kelly A Frazer. Detection104Bibliographyof low prevalence somatic mutations in solid tumors with ultra-deep targeted sequencing. Genome Biol,12(12):R124, 2011. ISSN 1465-6914.Verena Heinrich, Jens Stange, Thorsten Dickhaus, Peter Imkeller, Ulrike KrÃÂŒger, Sebastian Bauer,Stefan Mundlos, Peter N Robinson, Jochen Hecht, and Peter M Krawitz. The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process.Nucleic Acids Res, 40(6):2426–31, March 2012. ISSN 1362-4962.A Hochhaus, M Baccarani, M Deininger, JF Apperley, JH Lipton, SL Goldberg, S Corm, NP Shah, F Cer-vantes, RT Silver, et al. Dasatinib induces durable cytogenetic responses in patients with chronic myelo-genous leukemia in chronic phase with resistance or intolerance to imatinib. Leukemia, 22(6):1200–1206,2008.Matthew D Hoffman and David M Blei. Structured stochastic variational inference. arXiv preprintarXiv:1404.4114, 2014.Matthew KH Hong, Geoff Macintyre, David C Wedge, Peter Van Loo, Keval Patel, Sebastian Lunke, Lud-mil B Alexandrov, Clare Sloggett, Marek Cmero, Francesco Marass, et al. Tracking the origins anddrivers of subclonal metastatic expansion in prostate cancer. Nature communications, 6, 2015.Yong Hou, Luting Song, Ping Zhu, Bo Zhang, Ye Tao, Xun Xu, Fuqiang Li, Kui Wu, Jie Liang, Di Shao,et al. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferativeneoplasm. Cell, 148(5):873–885, 2012.Thomas J Hudson, Warwick Anderson, Axel Aretz, Anna D Barker, Cindy Bell, Rosa R Bernabé, MK Bhan,Fabien Calvo, Iiro Eerola, Daniela S Gerhard, et al. International network of cancer genome projects.Nature, 464(7291):993–998, 2010.Yuan Ji, Subhajit Sengupta, Juhee Lee, Peter Müller, and Kamalakar Gulukota. Estimating latent cell sub-populations with Bayesian feature allocation models. In Nonparametric Bayesian Inference in Biostatist-ics, pages 77–95. Springer, 2015.Wei Jiao, Shankar Vembu, Amit G Deshwar, Lincoln Stein, and Quaid Morris. Inferring clonal evolution oftumors from single nucleotide somatic mutations. BMC bioinformatics, 15(1):35, 2014.105BibliographyMalvina Josephidou, Andy G Lynch, and Simon Tavaré. multiSNV: a probabilistic approach for improvingdetection of somatic point mutations from multiple related tumour samples. Nucleic acids research, 43(9):e61–e61, 2015.C Kandoth, MD McLellan, F Vandin, K Ye, B Niu, C Lu, M Xie, Q Zhang, JF McMichael, MA Wycza-lkowski, et al. Mutational landscape and significance across 12 major cancer types. Nature, 502(7471):333–339, 2013.Sangwoo Kim, Kyowon Jeong, Kunal Bhutani, Jeong Ho Lee, Anand Patel, Eric Scott, Hojung Nam, HayanLee, Joseph G Gleeson, and Vineet Bafna. Virmid: accurate detection of somatic mutations with sampleimpurity inference. Genome Biol, 14:R90, 2013.Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis,George M Weinstock, Richard K Wilson, and Li Ding. VarScan: variant detection in massively parallelsequencing of individual and pooled samples. Bioinformatics, 25(17):2283–2285, 2009.Daniel C Koboldt, Qunyuan Zhang, David E Larson, Dong Shen, Michael D McLellan, Ling Lin, Chris-topher A Miller, Elaine R Mardis, Li Ding, and Richard K Wilson. VarScan 2: somatic mutation and copynumber alteration discovery in cancer by exome sequencing. Genome research, 22(3):568–576, 2012.Daniel Lai, Gavin Ha, and Sohrab P Shah. HMMcopy: Copy number prediction with correction for GC andmappability bias for HTS data. Technical report, 2012.Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C Zody, Jennifer Baldwin, KeriDevon, Ken Dewar, Michael Doyle, William FitzHugh, et al. Initial sequencing and analysis of the humangenome. Nature, 409(6822):860–921, 2001.Peter Langfelder, Bin Zhang, and Steve Horvath. Defining clusters from a hierarchical cluster tree: theDynamic Tree Cut package for R. Bioinformatics, 24(5):719–720, 2008.David E Larson, Christopher C Harris, Ken Chen, Daniel C Koboldt, Travis E Abbott, David J Dooling,Timothy J Ley, Elaine R Mardis, Richard K Wilson, and Li Ding. SomaticSniper: identification ofsomatic point mutations in whole genome sequencing data. Bioinformatics, 28(3):311–317, 2012.106BibliographyJeong Ho Lee, Jennifer L Silhavy, Sangwoo Kim, Tracy Dixon-Salazar, Andrew Heiberg, Eric Scott, VineetBafna, Kiley J Hill, Adrienne Collazo, Vincent Funari, et al. De novo somatic mutations in componentsof the PI3K-AKT3-mTOR pathway cause hemimegalencephaly. Nature genetics, 44(8):941–945, 2012.Timothy J Ley, Elaine R Mardis, Li Ding, Bob Fulton, Michael D McLellan, Ken Chen, David Dooling,Brian H Dunford-Shore, Sean McGrath, Matthew Hickenbotham, et al. DNA sequencing of a cytogenet-ically normal acute myeloid leukaemia genome. Nature, 456(7218):66–72, 2008.Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler trans-form. Bioinformatics (Oxford, England), 26(5):589–595, Mar 2010. ISSN 1367-4811. doi: 10.1093/bioinformatics/btp698. URL http://www.ncbi.nlm.nih.gov/pubmed/20080505.Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, GoncaloAbecasis, Richard Durbin, et al. The sequence alignment/map format and SAMtools. Bioinformatics,25(16):2078–2079, 2009.Peter Van Loo, Silje H. Nordgard, Ole Christian Lingjærde, Hege G. Russnes, Inga H. Rye, Wei Sun,Victor J. Weigman, Peter Marynen, Anders Zetterberg, Bjørn Naume, Charles M. Perou, Anne-LiseBørresen-Dale, and Vessela N. Kristensen. Allele-specific copy number analysis of tumors. Proceedingsof the National Academy of Sciences of the United States of America, 107(39):16910–16915, September2010. ISSN 00278424. URL http://www.jstor.org/stable/20779885.Salem Malikic, Andrew W McPherson, Nilgun Donmez, and Cenk S Sahinalp. Clonality inference inmultiple tumor samples using phylogeny. Bioinformatics, 31(9):1349–1356, 2015.Elaine R Mardis, Li Ding, David J Dooling, David E Larson, Michael D McLellan, Ken Chen, Daniel CKoboldt, Robert S Fulton, Kim D Delehaunty, Sean D McGrath, et al. Recurring mutations found bysequencing an acute myeloid leukemia genome. New England Journal of Medicine, 361(11):1058–1066,2009.Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky,Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, et al. The Genome Analysis Toolkit: aMapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9):1297–1303, 2010.107BibliographyGeoffrey McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. JohnWiley & Sons, 2007.Geoffrey McLachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004.Christopher A Miller, Brian S White, Nathan D Dees, Malachi Griffith, John S Welch, Obi L Griffith, RaviVij, Michael H Tomasson, Timothy A Graubert, Matthew J Walter, et al. SciClone: inferring clonalarchitecture and tracking the spatial and temporal patterns of tumor evolution. 2014.Sandra Misale, Rona Yaeger, Sebastijan Hobor, Elisa Scala, Manickam Janakiraman, David Liska,Emanuele Valtorta, Roberta Schiavo, Michela Buscarino, Giulia Siravegna, et al. Emergence of KRASmutations and acquired resistance to anti-EGFR therapy in colorectal cancer. Nature, 486(7404):532–536,2012.Ryan D Morin, Maria Mendez-Lago, Andrew J Mungall, Rodrigo Goya, Karen L Mungall, Richard DCorbett, Nathalie A Johnson, Tesa M Severson, Readman Chiu, Matthew Field, et al. Frequent mutationof histone-modifying genes in non-Hodgkin lymphoma. Nature, 476(7360):298–303, 2011.Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers, Jeanne McIndoo, KerryCook, Asya Stepansky, Dan Levy, Diane Esposito, et al. Tumour evolution inferred by single-cell se-quencing. Nature, 472(7341):90–94, 2011.R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of computationaland graphical statistics, 9(2):249–265, 2000.Sarah B Ng, Emily H Turner, Peggy D Robertson, Steven D Flygare, Abigail W Bigham, Choli Lee, TristanShaffer, Michelle Wong, Arindam Bhattacharjee, Evan E Eichler, Michael Bamshad, Deborah A Nicker-son, and Jay Shendure. Targeted capture and massively parallel sequencing of 12 human exomes. Nature,461(7261):272–6, September 2009. ISSN 1476-4687.Serena Nik-Zainal, Peter Van Loo, David C Wedge, Ludmil B Alexandrov, Christopher D Greenman,King Wai Lau, Keiran Raine, David Jones, John Marshall, Manasa Ramakrishna, et al. The life his-tory of 21 breast cancers. Cell, 149(5):994–1007, 2012.Luwen Ning, Geng Liu, Guibo Li, Yong Hou, Yin Tong, and Jiankui He. Current challenges in the bioin-formatics of single cell genomics. Frontiers in oncology, 4, 2014.108BibliographyIver Nordentoft, Philippe Lamy, Karin Birkenkamp-Demtröder, Karey Shumansky, Søren Vang, HenrikHornshøj, Malene Juul, Palle Villesen, Jakob Hedegaard, Andrew Roth, et al. Mutational context anddiverse clonal development in early and late bladder cancer. Cell reports, 7(5):1649–1663, 2014.Peter C Nowell. The clonal evolution of tumor cell populations. Science, 194(4260):23–28, 1976.Layla Oesper, Ahmad Mahmoody, and Benjamin J Raphael. THetA: inferring intra-tumor heterogeneityfrom high-throughput DNA sequencing data. Genome Biol, 14(7):R80, 2013.Erin D Pleasance, Philip J Stephens, Sarah O’Meara, David J McBride, Alison Meynert, David Jones,Meng-Lay Lin, David Beare, King Wai Lau, Chris Greenman, et al. A small-cell lung cancer genomewith complex signatures of tobacco exposure. Nature, 463(7278):184–190, 2010.Anirudh Prahallad, Chong Sun, Sidong Huang, Federica Di Nicolantonio, Ramon Salazar, Davide Zecchin,Roderick L Beijersbergen, Alberto Bardelli, and René Bernards. Unresponsiveness of colon cancer toBRAF (V600E) inhibition through feedback activation of EGFR. Nature, 483(7388):100–103, 2012.A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation meas-ure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning (EMNLP-CoNLL), volume 410, page 420, 2007.Sheldon M. Ross. Simulation. Elsevier, 3. ed. edition, 2002.Andrew Roth, Jiarui Ding, Ryan D. Morin, Anamaria Crisan, Gavin Ha, Ryan Giuliany, Ali Bashashati,Martin Hirst, Gulisa Turashvili, Arusha Oloumi, Marco A. Marra, Sam Aparicio, and Sohrab P. Shah.JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour pairednext-generation sequencing data. Bioinformatics, 28(7):907–913, 2012.Andrew Roth, Jaswinder Khattra, Damian Yap, Adrian Wan, Emma Laks, Justina Biele, Gavin Ha, SamuelAparicio, Alexandre Bouchard-Côté, and Sohrab P Shah. PyClone: statistical inference of clonal popula-tion structure in cancer. Nature methods, 11(4):396–398, 2014.Christopher T Saunders, Wendy SW Wong, Sajani Swamy, Jennifer Becq, Lisa J Murray, and R KeiraCheetham. Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs.Bioinformatics, 28(14):1811–1817, 2012.109BibliographySohrab P Shah, Martin Köbel, Janine Senz, Ryan D Morin, Blaise A Clarke, Kimberly C Wiegand, GillianLeung, Abdalnasser Zayed, Erika Mehl, Steve E Kalloger, et al. Mutation of FOXL2 in granulosa-celltumors of the ovary. New England Journal of Medicine, 360(26):2719–2729, 2009a.Sohrab P Shah, Ryan D Morin, Jaswinder Khattra, Leah Prentice, Trevor Pugh, Angela Burleigh, AllenDelaney, Karen Gelmon, Ryan Guliany, Janine Senz, et al. Mutational evolution in a lobular breasttumour profiled at single nucleotide resolution. Nature, 461(7265):809–813, 2009b.Sohrab P Shah, Andrew Roth, Rodrigo Goya, Arusha Oloumi, Gavin Ha, Yongjun Zhao, Gulisa Turashvili,Jiarui Ding, Kane Tse, Gholamreza Haffari, et al. The clonal and mutational evolution spectrum ofprimary triple-negative breast cancers. Nature, 486(7403):395–399, 2012.Ehud Shapiro, Tamir Biezuner, and Sten Linnarsson. Single-cell sequencing-based technologies will revo-lutionize whole-organism science. Nature Reviews Genetics, 14(9):618–630, 2013.Stephen T Sherry, M-H Ward, M Kholodov, J Baker, Lon Phan, Elizabeth M Smigielski, and Karl Sirotkin.dbSNP: the NCBI database of genetic variation. Nucleic acids research, 29(1):308–311, 2001.Yuichi Shiraishi, Yusuke Sato, Kenichi Chiba, Yusuke Okuno, Yasunobu Nagata, Kenichi Yoshida, NorioShiba, Yasuhide Hayashi, Haruki Kume, Yukio Homma, et al. An empirical Bayesian framework forsomatic mutation detection from cancer genome sequencing data. Nucleic acids research, 41(7):e89–e89,2013.Andrea Sottoriva, Inmaculada Spiteri, Sara G. M. Piccirillo, Anestis Touloumis, V. Peter Collins, John C.Marioni, Christina Curtis, Colin Watts, and Simon Tavaré. Intratumor heterogeneity in human glio-blastoma reflects cancer evolutionary dynamics. Proceedings of the National Academy of Sciences ofthe United States of America, 110(10):4009–4014, Mar 2013. ISSN 1091-6490. doi: 10.1073/pnas.1219747110.Andreas Untergasser, Ioana Cutcutache, Triinu Koressaar, Jian Ye, Brant C Faircloth, Maido Remm, andSteven G Rozen. Primer3—new capabilities and interfaces. Nucleic acids research, 40(15):e115–e115,2012.Yong Wang, Jill Waters, Marco L Leung, Anna Unruh, Whijae Roh, Xiuqing Shi, Ken Chen, Paul Scheet,110BibliographySelina Vattathil, Han Liang, et al. Clonal evolution in breast cancer revealed by single nucleus genomesequencing. Nature, 512(7513):155–160, 2014.John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, KyleEllrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, Cancer Genome Atlas Research Network, et al.The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113–1120, 2013.M. West and M.D. Escobar. Hierarchical priors and mixture models, with application in regression anddensity estimation. Institute of Statistics and Decision Sciences, Duke University, 1993.Kimberly C Wiegand, Sohrab P Shah, Osama M Al-Agha, Yongjun Zhao, Kane Tse, Thomas Zeng, Jan-ine Senz, Melissa K McConechy, Michael S Anglesio, Steve E Kalloger, et al. ARID1A mutations inendometriosis-associated ovarian carcinomas. New England Journal of Medicine, 363(16):1532–1543,2010.Xun Xu, Yong Hou, Xuyang Yin, Li Bao, Aifa Tang, Luting Song, Fuqiang Li, Shirley Tsang, Kui Wu,Hanjie Wu, et al. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of akidney tumor. Cell, 148(5):886–895, 2012.Shinichi Yachida, Siân Jones, Ivana Bozic, Tibor Antal, Rebecca Leary, Baojin Fu, Mihoko Kamiyama,Ralph H Hruban, James R Eshleman, Martin A Nowak, et al. Distant metastasis occurs late during thegenetic evolution of pancreatic cancer. Nature, 467(7319):1114–1117, 2010.Lixing Yang, Lovelace J Luquette, Nils Gehlenborg, Ruibin Xi, Psalm S Haseley, Chih-Heng Hsieh, Cheng-sheng Zhang, Xiaojia Ren, Alexei Protopopov, Lynda Chin, et al. Diverse mechanisms of somatic struc-tural variations in human cancer genomes. Cell, 153(4):919–929, 2013.Damian B Yap, Justin Chu, Tobias Berg, Matthieu Schapira, S-W Grace Cheng, Annie Moradian, Ryan DMorin, Andrew J Mungall, Barbara Meissner, Merrill Boyle, Victor E Marquez, Marco A Marra, Randy DGascoyne, R Keith Humphries, Cheryl H Arrowsmith, Gregg B Morin, and Samuel A J R Aparicio. So-matic mutations at EZH2 Y641 act dominantly through a mechanism of selectively altered PRC2 catalyticactivity, to increase H3K27 trimethylation. Blood, 117(8):2451–9, February 2011. ISSN 1528-0020.Christopher Yau, Dmitri Mouradov, Robert N Jorissen, Stefano Colella, Ghazala Mirza, Graham Steers,Adrian Harris, Jiannis Ragoussis, Oliver Sieber, and Christopher C Holmes. A statistical approach for111detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphismgenotyping data. Genome Biol, 11(9):R92, 2010. ISSN 1465-6914.Ke Yuan, Thomas Sakoparnig, Florian Markowetz, and Niko Beerenwinkel. BitPhylogeny: a probabilisticframework for reconstructing intra-tumor phylogenies. Genome biology, 16(1):36, 2015.Habil Zare, Junfeng Wang, Alex Hu, Kris Weber, Josh Smith, Debbie Nickerson, ChaoZhong Song, DanielaWitten, C Anthony Blau, and William Stafford Noble. Inferring clonal composition from multiple sectionsof a breast cancer. 2014.Chenghang Zong, Sijia Lu, Alec R Chapman, and X Sunney Xie. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science, 338(6114):1622–1626, 2012a.Chenghang Zong, Sijia Lu, Alec R Chapman, and X Sunney Xie. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science, 338(6114):1622–1626, 2012b.112Appendix APyClone: Supplemental materialA.1 Copy number analysisIn the following section array refers to Affymetrix SNP6.0 arrays. No other array platform was used forcopy number analysis.A.1.1 Normalisation and feature extractionASCAT and OncoSNP both require that the input data be suitably normalised and that the B allele frac-tion (BAF) and log R ratio (LRR) be extracted from the raw CEL files. To do this we used a modi-fied version of the workflow described on the PennCNV-Affy website (www.openbioinformatics.org/penncnv/penncnv_tutorial_affy_gw6.html). To extract features in the hg19 coordin-ate system we mapped the supplied PennCNV .pfb and .gcmodel files using the annotations in the Gen-omeWideSNP_6,Full,na31,hg19,HB20110328.ugp file downloaded from the AROMA project (aroma-project.org). We used only steps 1.2 and 1.4 of the PennCNV workflow to perform normalisation and BAF/LRRextraction. The output of step 1.4 was passed to OncoSNP. For ASCAT we applied the PennCNV GCcorrection to the output of step 1.4 since ASCAT has no built in GC normalisation strategy.A.1.2 ASCATWe used ASCAT [Loo et al., 2010] version 2.1 for all analyses. We used the paired analysis mode to analysethe tumour and matched normal arrays jointly. The standard ASCAT workflow and default parametersdescribed in the software manual were used for all analyses.113A.2. Normal mixture data setPICNICThe latest version of PICNIC [Greenman et al., 2010] as of 06/10/13 was used for all analyses. For all ana-lyses PICNIC was run using only the tumour array. The parameter files supplied with PICNIC were mappedto hg19 coordinates using the GenomeWideSNP_6,Full,na31,hg19,HB20110328.ugp file downloaded fromthe AROMA project (aroma-project.org). Quantitative pathology estimates were used to inform theprior for normal contamination in the PICNIC inference procedure. The standard PICNIC workflow asdescribed in the software manual with default parameters was performed for all analyses. Since PICNICperforms normalisation and feature extraction as part of its analysis, raw CEL files were passed as inputs.OncoSNPWe used OncoSNP [Yau et al., 2010] version 1.3 for all analyses. We used the paired analysis mode toanalyse the matched normal and tumour arrays jointly. We used the stromal contamination and intra tumourheterogeneity modes. We also included the X chromosome in the analysis. With these modifications, theOncoSNP workflow and default parameters described in the software manual were used for all analyses.A.2 Normal mixture data setFor the idealised mixture data presented in figure 3.7, data from mixture experiments A (SRR385938), B(SRR385939), C (SRR385940) and D (SRR385941) from Harismendy et al. [2011] were downloaded fromthe NCBI short read archive. Each data set was generated by physically mixing DNA from four tissuesamples from the 1000 genomes project in different proportions. Thus mixtures were generated from thesource DNA material and not in-silico. The resulting mixtures were then subjected to targeted amplificationusing the UDT-Seq protocol and sequenced on the Illumina MiSeq platform.FASTQ files were extracted from the downloaded .sra files using the fastq-dump -split-files-clip command from the NCBI SRA-SDK version 2.3.33. Sequences were aligned to the targeted genomeusing mem command from the bwa 0.7.5a package. Count data was extracted from the BAM files using acustom Python script which filtered out positions with base or mapping qualities below 10. Because theprimers used in the UDT-Seq protocol were designed to target mutational hot spots in cancer and not theSNP positions we were analysing, some candidate positions lay near the start or end of the primers. Thesepositions tended to show significant strand bias which could translate to biased allelic abundance estimates.114A.3. High grade serous ovarian cancer data setTo address this, the count data was post-processed to remove positions showing significant strand bias (P <0.05) as determined using a Fisher exact test. No multiple test correction was done.Primer start and stop positions for the UDT-Seq protocol were obtained from Supplemental Table 2 ofHarismendy et al. [2011]. We downloaded hg19 from UCSC website and used primer positions to build atargeted reference alignment file which contained only the regions spanned by the primers.A.3 High grade serous ovarian cancer data setMultiple PyClone analyses were performed, one analysis per copy number prediction method (section §A.1).We specified the priors for cellularity estimation in PICNIC based on quantitative pathology estimates.Allelic count data was obtained from Bashashati et al. [2013]. We obtained count data from additionalmutations not validated as somatic from the authors. For each copy number method we used the homologouscopy number information to elicit priors for PyClone using the PCN strategy, and set the tumour contentfor the PyClone analysis to the value predicted by the copy number method. MCMC analysis and post-processing of the trace was done as discussed in the section 3.2.11.A.3.1 Single-cell genotyping of frozen high grade serous ovarian cancersNuclei preparation and sortingSingle cell nuclei were prepared using a sodium citrate lysis buffer containing Triton X-100 detergent. Solidtissue samples were first subjected to mechanical homogenization using a laboratory paddle-blender. Theresulting cell lysates were passed twice through a 70-micron filter to remove larger cell debris. Aliquotsof freshly prepared nuclei were visually inspected and enumerated using a dual counting chamber hemo-cytometer (Improved Neubauer, Hausser Scientific, PA) with Trypan blue stain. Single nuclei were flowsorted into individual wells of microtitre plates using propidium iodide staining and a FACSAria II sorter(BD Biosciences, San Jose, CA).Multiplex and singleplex PCRsSomatic coding SNVs catalogued and validated in bulk tissue genome sequencing experiments were pickedfor mutation-spanning PCR primers design using Primer3 [Untergasser et al., 2012]. Common sequences115A.3. High grade serous ovarian cancer data setwere appended to the 5’ ends of the gene-specific primers to enable downstream barcoded adaptor at-tachment using a PCR approach. Multiplex (24) PCRs were performed using an ABI7900HT machineand SYBR GreenER qPCR Supermix reagent (Life Technologies, Burlington, ON). The 24-plex reactionproducts from each nucleus were used as input template to perform 48 singleplex PCRs using 48.48 Ac-cess Array IFCs according to the manufacturer’s protocol (Fluidigm Corporation, San Francisco, CA). Flowsorting plate wells without nuclei and 10 ng gDNA aliquots were used for negative and positive controlreactions, respectively.Nuclei-specific amplicon barcoding and nucleotide sequencingPooled singleplex PCR products from each nucleus were assigned unique molecular barcodes and adaptedfor MiSeq flow-cell NGS sequencing chemistry using a PCR step. Barcoded amplicon libraries were pooledand purified by conventional preparative agarose gel electrophoresis. Library quality and quantitation wasperformed using a 2100 Bioanalyzer with DNA 1000 chips (Agilent Technologies, Santa Clara, CA) anda Qubit 2.0 Fluorometer (Life Technologies, Burlington, ON). Next-generation DNA sequencing was con-ducted using a MiSeq sequencer according to the manufacturer’s protocols (Illumina Inc., San Diego, CA).Bioinformatic analysisPaired end FASTQ files from the MiSeq sequencer were aligned to human genome build 37 downloadedfrom the NCBI using the mem command from the bwa [Li and Durbin, 2010] 0.7.5a package. Allelic countdata was extracted from the BAM files using a custom Python script which filtered out positions with baseor mapping qualities below 10. Any loci with fewer than 40 reads of coverage was deemed unusable andassigned an unknown status in plots. We removed any cell in which more than 80% of the loci were unusable.We also removed any loci which were unusable in more than 80% of cells.Stochastic biased amplification of alleles due to limiting quantities of DNA in single cells made it dif-ficult to detect presence or absence of the variant allele using allelic abundance measurements. To addressthis we applied a statistical test to determine the presence or absence of the variant allele. The null hypo-thesis for the test was that the variant allele was absent, thus we only observe reads with the variant alleledue to sequencing error. We computed the proportion of reads with a variant allele at all positions on theamplicon targeting a loci, excluding the target loci. For these positions we defined the variant allele as thenon reference allele with the most reads supporting it. We used the mean of these values as the estimated116A.3. High grade serous ovarian cancer data setsequencing error rate. This value was used to perform a one tailed Binomial exact test. For each cell wemultiple test corrected the p-values for all loci with coverage using the Benjamini-Hochberg procedure. Weused a false discovery rate of 0.001 to determine if a variant allele was present. These methods have beenapplied in previous work and additional details are reported therein Shah et al. [2009b, 2012].117A.3. High grade serous ovarian cancer data setA B C D0.00.20.40.60.81.0(Inferred) cellularprevalenceClustern1 ( = 45)n2 ( = 1)n3 ( = 1)n4 ( = 1)n5 ( = 21)n6 ( = 2)n7 ( = 1)A B C D0.00.20.40.60.81.0Clustern1 ( = 25)n2 ( = 9)n3 ( = 5)n4 ( = 7)n5 ( = 2)n6 ( = 1)A B C D0.00.20.40.60.81.0(Inferred) cellularprevalence Clustern1 ( = 48)n2 ( = 21)n3 ( = 2)n4 ( = 1)A B C D0.00.20.40.60.81.0Clustern1 ( = 23)n2 ( = 2)n3 ( = 9)n4 ( = 7)n5 ( = 5)n6 ( = 2)n7 ( = 1)A B C DSample0.00.20.40.60.81.0(Inferred) cellularprevalenceClustern1 ( = 44)n2 ( = 3)n3 ( = 1)n4 ( = 2)n5 ( = 21)n6 ( = 1)A B C DSample0.00.20.40.60.81.0a) b)c) d)e) f)Clustern1 ( = 25)n2 ( = 9)n3 ( = 5)n4 ( = 7)n5 ( = 2)n6 ( = 1)Figure A.1: Predicted cellular prevalence and clustering estimates from HGSOC : case 1 using (a) ASCAT,(c) OncoSNP, (e) PICNIC; case 2 using (b) ASCAT, (d) OncoSNP, (f) PICNIC to inform PyClone usingthe BeBin-PCN model. Error bars indicate the mean standard deviation of MCMC cellular prevalencesestimates for mutations in a cluster. n indicates the number of mutations assigned to a cluster.118A.3. High grade serous ovarian cancer data setCluster10:10416469810:105305510:9519123311:11003531711:12974469111:6707908412:11247986912:11383513412:1323278212:5666991012:5814089913:2227554914:9679585616:3136795316:894172518:4409817719:1753120319:5819947719:67607871:1122985691:1543189021:16709644322:248296402:2071755662:975271083:1528807263:502320764:889595325:1217855535:1217867615:1413367195:1482068635:1594375495:1781403506:262734676:434880877:473844327:75290529:1128988409:3121519:50691769:90584713X:107420125X:1876794016:845229231:22289576621:354692244:12624085614:217883348:9539041018:1048778518:3379567119:1406718819:3900310319:436802951:139337361:335637642:1357454632:1898726182:2310816094:1488342504:155292315:434464876:1169516866:264318117:1311511358:1011542048:67380546X:18665396X:18665397X:2341123912:64521435ClusterCluster0.00.10.20.30.40.50.60.70.80.91.0Pairwise posterior clustering probabilityFigure A.2: Posterior similarity matrices for high grade serous ovarian cancer case 1 using ASCAT for copynumber prediction. Three MCMC runs from three random starts are shown.119A.3. High grade serous ovarian cancer data setCluster10:10416469810:105305510:9519123311:11003531711:12974469111:6707908412:11247986912:11383513412:1323278212:5666991012:5814089913:2227554914:9679585616:3136795316:894172518:4409817719:1753120319:5819947719:67607871:1122985691:1543189021:16709644322:248296402:2071755662:975271083:1528807263:502320764:889595325:1217855535:1217867615:1413367195:1482068635:1594375495:1781403506:262734676:434880877:473844327:75290529:1128988409:3121519:50691769:90584713X:107420125X:1876794016:845229231:22289576621:354692244:12624085614:217883348:9539041018:1048778518:3379567119:1406718819:3900310319:436802951:139337361:335637642:1357454632:1898726182:2310816094:1488342504:155292315:434464876:1169516866:264318117:1311511358:1011542048:67380546X:18665396X:18665397X:2341123912:64521435ClusterCluster0.00.10.20.30.40.50.60.70.80.91.0Pairwise posterior clustering probabilityFigure A.3: Posterior similarity matrices for high grade serous ovarian cancer case 1 using OncoSNP forcopy number prediction. Three MCMC runs from three random starts are shown.120A.3. High grade serous ovarian cancer data setCluster10:10416469810:105305510:9519123311:11003531711:12974469111:6707908412:11247986912:11383513412:1323278212:5666991012:5814089913:2227554914:9679585616:3136795316:894172518:4409817719:1753120319:5819947719:67607871:1122985691:1543189021:16709644322:248296402:2071755662:975271083:1528807263:502320764:889595325:1217855535:1217867615:1413367195:1482068635:1594375495:1781403506:262734676:434880877:473844327:75290529:1128988409:3121519:50691769:90584713X:107420125X:1876794016:845229231:22289576621:354692244:12624085614:217883348:9539041018:1048778518:3379567119:1406718819:3900310319:436802951:139337361:335637642:1357454632:1898726182:2310816094:1488342504:155292315:434464876:1169516866:264318117:1311511358:1011542048:67380546X:18665396X:18665397X:2341123912:64521435ClusterCluster0.00.10.20.30.40.50.60.70.80.91.0Pairwise posterior clustering probabilityFigure A.4: Posterior similarity matrices for high grade serous ovarian cancer case 1 using PICNIC for copynumber prediction. Three MCMC runs from three random starts are shown.121A.3. High grade serous ovarian cancer data setCluster11:4392315811:4676081612:5036858413:2610422616:2967592917:766281017:944857619:1156929319:368845471:1106553971:186614181:22607520122:322056062:1726480702:315898493:378189953:503911114:1538313666:315414447:442803137:447970038:133905955X:105280429X:129272637X:14963903719:43617781:1860398513:1213052793:690929455:1790482587:985087339:32632795X:109695677X:15317198316:6168958416:691434082:1968915504:39293450X:15415912811:6635865612:11715822614:645563795:1495143218:95186221X:108631750X:6357932410:264630587:2885877222:39441075ClusterCluster0.00.10.20.30.40.50.60.70.80.91.0Pairwise posterior clustering probabilityFigure A.5: Posterior similarity matrices for high grade serous ovarian cancer case 2 using ASCAT for copynumber prediction. Three MCMC runs from three random starts are shown.122A.3. High grade serous ovarian cancer data setCluster11:4392315811:4676081612:5036858413:2610422616:2967592917:766281017:944857619:1156929319:368845471:1106553971:186614181:22607520122:322056062:1726480702:315898493:378189953:503911114:1538313666:315414447:442803137:447970038:133905955X:105280429X:129272637X:14963903719:43617781:1860398513:1213052793:690929455:1790482587:985087339:32632795X:109695677X:15317198316:6168958416:691434082:1968915504:39293450X:15415912811:6635865612:11715822614:645563795:1495143218:95186221X:108631750X:6357932410:264630587:2885877222:39441075ClusterCluster0.00.10.20.30.40.50.60.70.80.91.0Pairwise posterior clustering probabilityFigure A.6: Posterior similarity matrices for high grade serous ovarian cancer case 2 using OncoSNP forcopy number prediction. Three MCMC runs from three random starts are shown.123A.3. High grade serous ovarian cancer data setCluster11:4392315811:4676081612:5036858413:2610422616:2967592917:766281017:944857619:1156929319:368845471:1106553971:186614181:22607520122:322056062:1726480702:315898493:378189953:503911114:1538313666:315414447:442803137:447970038:133905955X:105280429X:129272637X:14963903719:43617781:1860398513:1213052793:690929455:1790482587:985087339:32632795X:109695677X:15317198316:6168958416:691434082:1968915504:39293450X:15415912811:6635865612:11715822614:645563795:1495143218:95186221X:108631750X:6357932410:264630587:2885877222:39441075ClusterCluster0.00.10.20.30.40.50.60.70.80.91.0Pairwise posterior clustering probabilityFigure A.7: Posterior similarity matrices for high grade serous ovarian cancer case 2 using PICNIC for copynumber prediction. Three MCMC runs from three random starts are shown.124
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Probabilistic models for the identification and interpretation...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Probabilistic models for the identification and interpretation of somatic single nucleotide variants… Roth, Andrew Justin Latham 2015
pdf
Page Metadata
Item Metadata
Title | Probabilistic models for the identification and interpretation of somatic single nucleotide variants in cancer genomes |
Creator |
Roth, Andrew Justin Latham |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | Somatic single nucleotide variants (SNVs) are mutations resulting from the substitution of a single nucleotide in the genome of cancer cells. Somatic SNVs are numerous in the genomes of most types of cancers. SNVs can contribute to the malignant phenotype of cancer cells, though many SNVs likely have negligible selective value. Because many SNVs are selectively neutral, their presence in a measurable proportion of cells is likely due to drift or genetic hitchhiking. This makes SNVs an appealing class of genomic aberrations to use as markers of clonal populations and ultimately tumour evolution. Advances in sequencing technology, in particular the development of high throughput sequencing (HTS) technologies, have made it possible to systematically profile SNVs in tumour genomes. We introduce three probabilistic models to solve analytical problems raised by experimental designs that leverage HTS to study cancer biology. The first experimental design we address is paired sequencing of normal and tumour tissue samples to identify somatic SNVs. We develop a probabilistic model to jointly analyse data from both samples, and reduce the number of false positive somatic SNV predictions. The second experimental design we address is the deep sequencing of SNVs to quantify the cellular prevalence of clones harbouring the SNVs. The key challenge we resolve is that allele abundance measured by HTS is not equivalent to cellular prevalence due to the confounding issues of mutational genotype, normal cell contamination and technical noise. We develop a probabilistic model which solves these problems while simultaneously inferring the number of clonal populations in the tissue. The final experimental design we consider is single cell sequencing. Single cell sequencing provides a direct means to measure the genotypes of clonal populations. However, sequence data from a single cell is inherently noisy which confounds accurate measurement of genotypes. To overcome this problem we develop a model to aggregate cells by clonal population in order to pool statistical strength and reduce error. The model jointly infers the assignment of cells to clonal populations, the genotype of the clonal populations, and the number of populations present. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-01-06 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial 2.5 Canada |
DOI | 10.14288/1.0223130 |
URI | http://hdl.handle.net/2429/56222 |
Degree |
Doctor of Philosophy - PhD |
Program |
Bioinformatics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_february_andrew_roth.pdf [ 25.34MB ]
- Metadata
- JSON: 24-1.0223130.json
- JSON-LD: 24-1.0223130-ld.json
- RDF/XML (Pretty): 24-1.0223130-rdf.xml
- RDF/JSON: 24-1.0223130-rdf.json
- Turtle: 24-1.0223130-turtle.txt
- N-Triples: 24-1.0223130-rdf-ntriples.txt
- Original Record: 24-1.0223130-source.json
- Full Text
- 24-1.0223130-fulltext.txt
- Citation
- 24-1.0223130.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0223130/manifest