UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Probabilistic approaches for profiling copy number aberrations and loss of heterozygosity landscapes… Ha, Gavin 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_september_ha_gavin.pdf [ 19.07MB ]
JSON: 24-1.0167340.json
JSON-LD: 24-1.0167340-ld.json
RDF/XML (Pretty): 24-1.0167340-rdf.xml
RDF/JSON: 24-1.0167340-rdf.json
Turtle: 24-1.0167340-turtle.txt
N-Triples: 24-1.0167340-rdf-ntriples.txt
Original Record: 24-1.0167340-source.json
Full Text

Full Text

Probabilistic Approaches for Profiling CopyNumber Aberrations and Loss ofHeterozygosity Landscapes in CancerGenomesbyGavin HaB.Sc., The University of British Columbia, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)July 2014c© Gavin Ha 2014AbstractGenomic aberrations such as copy number alterations (CNA) and loss of heterozy-gosity (LOH) are hallmarks of human malignancies. These genomic abnormalitiescan have a measurable effect on the structure and dosage of chromosomal regions.Tumour suppressors and oncogenes altered by CNAs often contribute to a tumouri-genic phenotype of increased proliferation. CNA and LOH can accrue through theprocess of branched evolution, resulting in the emergence of divergent clones withdistinct aberrations present at diagnosis. Therefore, measuring and modelling howCNA/LOH distribute in cell populations can elucidate the abundance of specificclones and, ultimately, enable the study of clonal evolution. CNA/LOH events intumours can be profiled using SNP genotyping arrays and whole genome sequenc-ing (WGS). However, to maximize biological interpretability from these data, ac-curate and statistically robust computational methods for inferring CNA/LOH arenecessary.I present three novel probabilistic approaches that apply hidden Markov mod-els (HMM) to analyze CNA/LOH in tumour genomes. The first method is HMM-Dosage, which distinguishes somatic and germline copy number events. This toolwas used to profile 2000 breast cancers, the largest study of this kind in the world.The second method is APOLLOH, which was one of the earliest methods devel-oped to profile LOH in tumour WGS data. Its application to WGS of 23 triple neg-iiAbstractative breast cancers (TNBC) represents the first time that LOH and its effects onallelic expression were jointly analyzed from sequencing data. The third method isTITAN, which simultaneously infers CNA/LOH and the clonal population dynam-ics from tumour WGS data. This method provides an analytical route to studyingthe degree of clonal evolution driven by CNA/LOH. I applied TITAN to a novelset of primary breast tumours and corresponding mouse xenografts, presenting theresults of distinct modes of temporal clonal selection patterns.In conclusion, this dissertation presents a suite of novel approaches and theirapplication to real-world cancer datasets, contributing to significant discoveries inbreast and ovarian cancers. Future applications of these approaches will further fa-cilitate the elucidation of cancer evolution, the genetic basis of metastatic potential,and therapeutic response and resistance.iiiPrefaceIn Chapter 1, Section 1.3.1 was largely taken from “Ha and Shah (2013). Dis-tinguishing Somatic and Germline Copy Number Events in Cancer Patient DNAHybridized to Whole-Genome SNP Genotyping Arrays, volume 973 of Array Com-parative Genomic Hybridization: Protocols and Applications, Methods in Molec-ular Biology, chapter 22. Springer Science and Business Media, LLC”. A smallportion of text for Section 1.2 contain modified text from the submitted manuscript“Ha et al. (2014). TITAN: Inference of copy number architectures in clonal cellpopulations from tumour whole genome sequence data”.The methodology of HMM-Dosage in Chapter 2 was published in Ha andShah (2013), with mathematical details also found in supplementary methods ofCurtis et al. (2012). Sections 2.4.1 contain results from a large collaboration pub-lished in “Curtis et al. (2012). The genomic and transcriptomic architecture of2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346–352”. InSections 2.4.1 through 2.4.1, I only describe the computational analyses, figuresand results that I contributed in the collaboration. Section 2.4.2 was published in“Bashashati et al. (2013). Distinct evolutionary trajectories of primary high-gradeserous ovarian cancers revealed through spatial mutational profiling. Journal ofPathology, 231:21–34”. The text in Sections 2.4.2 and 2.4.2 were taken from themanuscript, which I co-wrote with Dr. Sohrab Shah, describing the computationalivPrefaceanalyses, figures and results that I contributed, with the exception of performingthe experimental FISH validation.Chapter 3 is a modified version of material published in Ha et al. (2012). I im-plemented the APOLLOH method (Section 3.2) and performed all computationalanalyses presented in Section 3.3, with the exception of sequence alignment. I co-wrote the text with Dr. Sohrab Shah and I am the original creator and copyrightholder of all figures presented in this chapter. The genome and transcriptome se-quencing and genotyping array data were provided by project leaders Drs. SamuelAparicio and Sohrab Shah as part of a larger study in “Shah et al. (2012). Theclonal and mutational evolution spectrum of primary triple-negative breast can-cers. Nature, 486(7403):395–399”.Chapter 4 is a modified version of the manuscript by Ha et al. (2014) currentlyunder peer review at an academic journal. I implemented the TITAN method (Sec-tion 4.2) and performed all computational analyses presented in Section 4.3, withthe exception of sequence alignment. In particular, I generated results for Sec-tions 4.3.2 through 4.3.5. I co-wrote the text with Dr. Sohrab Shah and I am theoriginal creator and copyright holder of all figures presented in this chapter; theexception is Figure 4.26, which I made half contribution. The genome sequenc-ing data for the ovarian carcinoma was provided by project leader Dr. SohrabShah. Section 4.4.2 describes my contributions to the analysis of a novel datasetof primary breast cancer derived xenografts as part of a larger project led by Dr.Samuel Aparicio. This work was included as part of the manuscript by “Eirew et al.(2014). Population dynamics of genomic clones in breast cancer patient xenograftsat single cell resolution.” that is currently under peer review. Fluorescence in-situhybridization was performed technicians in the lab of Dr. David Huntsman and thevPrefacesingle-cell sequencing data was generated by members of Dr. Aparicio’s lab.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Somatic landscape of aberrations in cancer . . . . . . . . . . . . 51.1.1 Copy number alterations . . . . . . . . . . . . . . . . . . 51.1.2 Loss of heterozygosity . . . . . . . . . . . . . . . . . . . 91.2 Tumour heterogeneity and clonal diversity . . . . . . . . . . . . 121.3 Genomic assays for CNA/LOH analysis in tumours . . . . . . . . 161.3.1 Detecting CNA in tumour DNA hybridized to SNP geno-typing arrays . . . . . . . . . . . . . . . . . . . . . . . . 17viiTable of Contents1.3.2 Massively parallel sequencing technologies for detectingstructural genomic alterations in tumours . . . . . . . . . 201.3.3 Predicting somatic CNA, LOH and allelic imbalance inwhole genome sequencing of tumours . . . . . . . . . . . 221.4 Inference of CNA and LOH using hidden Markov models . . . . 261.5 Research contribution . . . . . . . . . . . . . . . . . . . . . . . 331.5.1 Distinguishing germline and somatic copy number eventsin cancer SNP genotyping array data . . . . . . . . . . . 331.5.2 Inference of loss of heterozygosity in whole genome se-quencing of tumours . . . . . . . . . . . . . . . . . . . . 341.5.3 Inference of tumour clonality through analysis of copy num-ber in whole genome sequencing data . . . . . . . . . . . 352 Distinguishing somatic and germline copy number events in cancer 372.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.2 Method: HMM-Dosage . . . . . . . . . . . . . . . . . . . . . . 412.2.1 Analysis workflow . . . . . . . . . . . . . . . . . . . . . 412.2.2 Probabilistic framework . . . . . . . . . . . . . . . . . . 442.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . 512.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.3.1 Evaluation and benchmarking . . . . . . . . . . . . . . . 512.4 Application of HMM-Dosage to novel cancer datasets . . . . . . 542.4.1 METABRIC: Profiling the genome architectures of 2000breast cancers . . . . . . . . . . . . . . . . . . . . . . . 552.4.2 Intra-tumoural heterogeneity of HGS ovarian cancer . . . 63viiiTable of Contents2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742.5.1 Limitations and future work . . . . . . . . . . . . . . . . 763 Detecting genome-wide allelic imbalance and LOH in whole genomesequencing of cancer . . . . . . . . . . . . . . . . . . . . . . . . . . 783.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.2 Method: APOLLOH . . . . . . . . . . . . . . . . . . . . . . . . . 813.2.1 APOLLOH workflow overview . . . . . . . . . . . . . . 833.2.2 APOLLOH probabilistic framework description . . . . . 843.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . 943.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.3.1 Initial comparison between WGS and genotyping arraysdemonstrates the platforms are correlated . . . . . . . . 953.3.2 Evaluation of APOLLOH indicates model features system-atically improve performance . . . . . . . . . . . . . . . 973.3.3 Tumour-normal admixture simulation demonstrates perfor-mance maintained at 34% tumour content . . . . . . . . . 1023.3.4 Genomic landscape of allelic imbalance reveals widespreadLOH in TNBC . . . . . . . . . . . . . . . . . . . . . . . 1063.3.5 Somatic inactivation of genes with germline stop codonmutations . . . . . . . . . . . . . . . . . . . . . . . . . . 1103.3.6 Analysis of LOH and somatic mutations reveals potentialsubclonality and temporal ordering . . . . . . . . . . . . 1123.3.7 Monoallelic gene expression events associated with genomicLOH reveal disrupted pathways in TNBC . . . . . . . . . 114ixTable of Contents3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1233.4.1 Limitations and future directions . . . . . . . . . . . . . 1244 modelling the copy number architecture of clonal cell populations us-ing whole genome sequencing of tumours . . . . . . . . . . . . . . . 1264.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264.2 Method: TITAN . . . . . . . . . . . . . . . . . . . . . . . . . . 1314.2.1 Representation of mixed populations in heterogeneous tu-mour WGS data . . . . . . . . . . . . . . . . . . . . . . 1314.2.2 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 1344.2.3 Details of the TITAN probabilistic framework . . . . . . 1374.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . 1474.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1484.3.1 Simulated CNA spike-in experiment demonstrates accu-rate detection for varying event sizes . . . . . . . . . . . 1494.3.2 Evaluation on simulated mixtures of tumour subpopula-tions confers improved sensitivity for low cellular preva-lence events . . . . . . . . . . . . . . . . . . . . . . . . 1554.3.3 Accurate estimation of cellular prevalence . . . . . . . . 1694.3.4 FISH assays validate the presence of subclonal copy num-ber changes . . . . . . . . . . . . . . . . . . . . . . . . . 1744.3.5 Validation of TITAN predictions using single-cell sequenc-ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1764.4 Applications of TITAN . . . . . . . . . . . . . . . . . . . . . . . 183xTable of Contents4.4.1 Characterization of the subclonal CNA in triple negativebreast cancers . . . . . . . . . . . . . . . . . . . . . . . 1834.4.2 Distinct clonal evolution patterns in breast cancer xenografts1894.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1954.5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 1964.5.2 Extensions to current TITAN model . . . . . . . . . . . 1984.5.3 Future directions . . . . . . . . . . . . . . . . . . . . . . 2015 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2045.1 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . 208Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210AppendicesA HMMcopy: Copy number analysis of WGS data . . . . . . . . . . . 242A.1 HMMcopy workflow . . . . . . . . . . . . . . . . . . . . . . . . 242A.1.1 Determine genomic windows that have 1000 reads mapped 242A.1.2 Obtain copy number read counts for normal and tumourfor each window . . . . . . . . . . . . . . . . . . . . . . 243A.1.3 GC content correction of normal and tumour read counts . 243A.1.4 Correcting read counts for highly mappable sequences . . 243A.1.5 Normalizing copy number in tumours . . . . . . . . . . . 244A.1.6 Segmentation and copy number prediction via HMM . . . 244xiTable of ContentsB APOLLOH: Supplementary material . . . . . . . . . . . . . . . . . 245B.1 Biospecimen collection and ethical consent . . . . . . . . . . . . 245B.2 Histopathological review . . . . . . . . . . . . . . . . . . . . . . 246B.3 Library construction and sequence data generation . . . . . . . . 246B.4 Application of APOLLOH to 23 triple negative breast cancers. . . 247B.5 OncoSNP analysis of Affymetrix SNP6.0 analysis . . . . . . . . 248B.6 Analyses for comparing APOLLOH results and Affymetrix SNP6data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248B.6.1 WGSS and SNP6 platform comparison . . . . . . . . . . 248B.6.2 Model evaluation using SNP6 predictions . . . . . . . . . 249B.7 Comparison of transcriptome allelic ratios (TAR) . . . . . . . . . 250C TITAN: Supplementary material . . . . . . . . . . . . . . . . . . . 251C.1 TITAN parameter update derivations . . . . . . . . . . . . . . . 252C.1.1 Prior distribution (mixed weights) parameter for genotypes, piG . . . 252C.1.2 Prior distribution (mixed weights) parameter for clonal clusters, piZG . 252C.1.3 Clonal frequency parameter, s . . . . . . . . . . . . . . . . . . 253C.1.4 Copy number Gaussian variance parameter, σ2 . . . . . . . . . . 256C.1.5 Tumour ploidy, φ . . . . . . . . . . . . . . . . . . . . . . . 258C.1.6 Normal contamination, n . . . . . . . . . . . . . . . . . . . . 260C.2 Biospecimen collection of intra-tumoural ovarian carcinoma sam-ples and FISH validation . . . . . . . . . . . . . . . . . . . . . . 262C.3 TNBC sample collection and sequencing . . . . . . . . . . . . . 263C.3.1 Comparison of cellular prevalence with RNAseq . . . . . 263C.4 Additional details for TITAN evaluation analyses . . . . . . . . . 264xiiTable of ContentsC.4.1 Mixture simulation experiments using intra-tumour sam-ples from an ovarian carcinoma . . . . . . . . . . . . . . 264C.5 Validation using targeted deep amplicon DNA sequencing of single-cell nuclei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268C.5.1 Selection of positions for validation of deletion events . . 268C.5.2 Single-cell sequencing of nuclei DNA for ovarian cancersample DG1136g . . . . . . . . . . . . . . . . . . . . . 269C.5.3 Analysis of single-cell sequencing data . . . . . . . . . . 271C.6 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . 274xiiiList of Tables1.1 Segmentation algorithms for genotyping arrays . . . . . . . . . . 191.2 List of research contributions pertaining to computational methodsdevelopment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.1 Performance of HMM-Dosage . . . . . . . . . . . . . . . . . . . 532.2 ERBB2 χ2 test for CNA/CNV and PAM50 subtype association ex-ample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.1 Description of random variables and fixed quantities in APOLLOH 853.2 APOLLOH model state representations of genotypes and zygositystatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.3 Performance of APOLLOH using whole-exome benchmark data . 1013.4 Inferred normal proportion and performance for tumour-normalmixture experiment . . . . . . . . . . . . . . . . . . . . . . . . . 1043.5 Normal contamination predicted by APOLLOH and OncoSNP, andtranscriptome allelic ratios for LOH in TNBC . . . . . . . . . . . 1164.1 Tumour genotype states used by TITAN . . . . . . . . . . . . . . 1354.2 Description of random variables and fixed quantities in TITAN . . 139xivList of Tables4.3 Sequencing coverage and cellularity for spatially related ovarianintra-tumoural and anatomical sites of DG1136 . . . . . . . . . . 1574.4 Simulation experiments using serial mixtures of spatially relatedovarian intra-tumoural samples . . . . . . . . . . . . . . . . . . . 1604.5 TITAN results for simulation experiments using serial mixtures . . 1614.6 Summary performance for serial mixture experiment . . . . . . . 1614.7 Simulation experiments using pairwise merging mixtures of spa-tially related ovarian intra-tumoural samples . . . . . . . . . . . . 1664.8 Simulation experiments using triple merging mixtures of spatiallyrelated ovarian intra-tumoural samples . . . . . . . . . . . . . . . 1674.9 Sequencing coverage and cellularity for spatially related ovarianintra-tumoural and anatomical sites of DG1136 . . . . . . . . . . 1714.10 TITAN parameters for individual intra-tumour DG1136 samples . 1744.11 Validation of TITAN predictions using FISH . . . . . . . . . . . . 1764.12 TITAN parameters for 23 TNBC samples . . . . . . . . . . . . . 1844.13 Proportion of genome altered in TNBC . . . . . . . . . . . . . . . 1854.14 Concordance of TITAN results between EXCAP and WGS for TNBC187C.1 Spike-In simulation experiment. Randomly sampled deletion (fromchr16) and amplification (from chr8) data was inserted into chr1,2, 9 and 18. The ‘Event ID’ indicates which admixture sample thedata originated from: clonally dominant (tum100), 80% tumour-normal mixture (tum80-norm20), and 60% tumour-normal mixture(tum60-norm20). The length, median allelic ratio and log ratio foreach segment is given. . . . . . . . . . . . . . . . . . . . . . . . 275xvList of Figures1.1 Mechanisms giving rise to CNA and LOH. . . . . . . . . . . . . . 101.2 Schematic of cellular prevalence estimates from pooled DNA . . . 151.3 Illustration of LOH prediction in sequencing data . . . . . . . . . 272.1 Distinguishing germline and somatic CNA using HMM-Dosage . 382.2 HMM-Dosage analysis workflow . . . . . . . . . . . . . . . . . . 422.3 Workflow to generate the masked reference sample . . . . . . . . 452.4 Probabilistic graphical model of HMM-Dosage . . . . . . . . . . 462.5 Illustration of the HMM-Dosage transition probabilities . . . . . . 492.6 Comparison between Student’s-t distributions for tumour and matchednormal samples. . . . . . . . . . . . . . . . . . . . . . . . . . . 542.7 Gene-centric CNA and CNV landscapes for METABRIC . . . . . 572.8 METABRIC CNA/CNV landscapes by PAM50 subtypes . . . . . 582.9 Gene-centric Chi-Square (χ2) analysis to determine subtype-specificCNA and CNV in the PAM50 intrinsic subtypes . . . . . . . . . . 592.10 Analysis of CNA cis-associated expression changes for PPP2R2A 602.11 Analysis of CNA cis-associated expression changes for ERBB2and LSM1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612.12 Analysis of CNV cis-associated expression changes . . . . . . . . 62xviList of Figures2.13 Gene-centric Spearman correlation analysis to determine CNA cis-regulated expression . . . . . . . . . . . . . . . . . . . . . . . . 632.14 Intra-tumoural extreme CNA profiles of HGS ovarian cancer Cases1-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.15 CNA and FISH comparison between left and right ovaries of Case 4 672.16 Heterogeneous NF1 homozygous deletion in Case 1 . . . . . . . . 692.17 Heterogeneous NF1 homozygous deletions in Case 3 . . . . . . . 702.18 Evolutionary sequential compound CNA analysis in HGS ovariantumours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712.19 Genome doubling in HGS ovarian cancer Case 3 . . . . . . . . . 733.1 Illustration of allelic ratios between tumour and normal genomicsequencing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.2 Workflow of the analysis for APOLLOH . . . . . . . . . . . . . . 823.3 Probabilistic graphical model of APOLLOH . . . . . . . . . . . . 863.4 Benchmarking of WGSS allelic ratios against SNP6 genotypingarray B-allele frequencies . . . . . . . . . . . . . . . . . . . . . . 963.5 Tumour content influences agreement of WGSS and SNP6, andseparation between prediction state clusters. . . . . . . . . . . . . 973.6 Evaluation of APOLLOH using Affymetrix SNP6.0 . . . . . . . . 983.7 Systematic comparison of loss of heterozygosity (LOH) predictions 993.8 Tumour-normal sampling admixture experiment . . . . . . . . . . 1023.9 Examples of improvement when accounting for normal contami-nation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105xviiList of Figures3.10 Genome-wide gene frequencies of APOLLOH predictions and monoal-lelic expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063.11 Distribution of the proportion of genome and number of genes al-tered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083.12 Genome-wide gene frequency landscape of APOLLOH loss of het-erozygosity (LOH) predictions for 23 TNBC samples . . . . . . . 1093.13 Examples of LOH events predicted within amplifications in chro-mosome 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1103.14 Germline stop codon and synonymous variants affected by LOH . 1113.15 Analysis of transcriptome RNAseq data . . . . . . . . . . . . . . 1153.16 Number of WGSS and SNP6 probe positions with RNAseq coverage1173.17 Transcriptome allelic ratio distribution and SNVMix parametersused for determining MAE . . . . . . . . . . . . . . . . . . . . . 1193.18 Association between MAE gene frequencies and normal (stromal)contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.19 Genome-wide gene frequency landscape of monoallelic expression 1213.20 Pathway enrichment analysis of genes with MAE established byLOH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1224.1 Detection of subclonal deletions in whole genome sequencing dataof a triple negative breast cancer genome . . . . . . . . . . . . . . 1294.2 Representation of the aggregate copy number signal from mixedpopulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1324.3 Analysis workflow for TITAN . . . . . . . . . . . . . . . . . . . 1364.4 Probabilistic graphical model of TITAN . . . . . . . . . . . . . . 138xviiiList of Figures4.5 Behaviour of model parameters when varying cellular prevalence . 1414.6 Spike-in simulation experiment sample setup . . . . . . . . . . . 1504.7 TITAN results for chr1 of Spike-In experiment . . . . . . . . . . 1514.8 TITAN results for chr2 of Spike-in experiment . . . . . . . . . . . 1524.9 TITAN results for chr9 of Spike-in experiment . . . . . . . . . . . 1534.10 TITAN results for chr18 of Spike-in experiment . . . . . . . . . . 1544.11 Illustration of intra-tumour samples in patient DG1136 and an ex-ample of a mixing simulation . . . . . . . . . . . . . . . . . . . . 1564.12 Performance of TITAN in serial and merging simulations usingreal intra-tumoural HGS ovarian tumour samples . . . . . . . . . 1594.13 Performance of TITAN for serial simulation of intra-tumour HGSovarian samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 1624.14 Performance of TITAN for serial simulation of intratumour sam-ples from an ovarian tumour evaluated at different event size ranges 1634.15 Performance of TITAN for triplet merging simulation of intra-tumourHGS ovarian samples . . . . . . . . . . . . . . . . . . . . . . . . 1684.16 Performance of TITAN for pairwise merging simulation of intra-tumour HGS ovarian samples . . . . . . . . . . . . . . . . . . . . 1694.17 TITAN cellular prevalence estimates for serial and pairwise/tripletmerging simulations . . . . . . . . . . . . . . . . . . . . . . . . . 1704.18 TITAN cellular prevalence estimates for simulations using Control-FreeC normal content . . . . . . . . . . . . . . . . . . . . . . . . 1724.19 TITAN normal proportion estimates for serial and pairwise/tripletmerging simulations . . . . . . . . . . . . . . . . . . . . . . . . . 1734.20 FISH images for CNA events in samples of patient DG1136 . . . 175xixList of Figures4.21 TITAN predictions selected for validation by single-cell sequencing 1784.22 Single-cell validation of Set1 deletions in DG1136g . . . . . . . . 1804.23 Single-cell validation of Set2 deletions in DG1136g . . . . . . . . 1814.24 Comparison of TITAN results for EXCAP and WGS for TNBCsample, SA052 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1864.25 Comparison of TITAN cellular prevalence and RNAseq transcrip-tome allelic ratios (TAR) . . . . . . . . . . . . . . . . . . . . . . 1884.26 Clonal analysis between CNA/LOH and SNV mutational classesin breast xenografts . . . . . . . . . . . . . . . . . . . . . . . . . 1904.27 Clean "sweep" copy neutrality in chromosome 5 of the xenograftin SA494 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1914.28 Chromothripsis of chromosome 21 in SA429 . . . . . . . . . . . 1934.29 Breakage fusion bridge in chromosome 18 of SA429 . . . . . . . 194xxGlossaryallelic ratio Measurement data for inference of LOH in sequencing data. Allelicratio is computed as the tumour reference read count (aligned base match-ing the reference genome) divided by the total read depth for a given po-sition. Allelic ratio is sometimes referred to as the allele fraction, and issimilar to the BAF for SNP genotyping arrays.aneuploidy Abnormal chromosome numbers deviating from 2 parental copies (23pairs for 46 total chromosomes in humans) as a result of whole chromo-some loss or gain.B-allele frequency (BAF) Measurement data used for inference of LOH, and some-times jointly used with log ratio data to predict copy number, in SNP geno-typing arrays. BAF is computed as the proportion of the minor allele (B-allele) intensity relative to the total intensity for a given probe.chromosomal instability (CIN) The increased rate of change in the number andstructure of chromosomes in the cell.chromothripsis The catastrophic shattering of a chromosome followed by thecell’s attempt to repair it, leading to a rejoined chromosome with deletionsand reoriented sequence fragments.cis-regulated expression The down- and up-regulation of the expression of a genedue to decrease and increase of DNA copy number of the same gene, re-spectively.clonal cluster A model feature of TITAN that describes a group of predicted eventsobserved at the same cellular prevalence. Clonal clusters allow for mod-eling sets of aberrations that likely arose from the same punctuated clonalexpansions.clonal diversity The diversity of genomic aberration profiles between co-existingclones of a tumour.xxiGlossaryclonal evolution The process of selection and expansion of clone(s) with advan-tageous phenotypic fitness. This process is analogous to Darwinian naturalselection. The interaction between clones are dependent on the rate of ac-quired mutations and ability for clonal expansion. The selective advantageof particular clone(s) may be due to acquistion of specific driver genomicaberrations.clonal genotype Refers to the genotype of a position, locus, region, or the fullgenomic profile for all cells in a clone.clone Set of cells that are related by descent from a unitary origin (Aparicio andCaldas, 2013). In this dissertation, these cells are considered to be geneti-cally identical, and the phenotype of clones are not considered.copy neutral Normal diploid copy number (2 copies) at a specific locus, region,gene, or chromosome.copy number alteration (CNA) The change in copy number of a region in thetumour genome that differs relative to a normal genome due to deletion orgain. This event is considered a somatic event, which is acquired in cellsof the patient, rather than inherited in the germline.copy number variation (CNV) The change in copy number of a DNA region dueto deletion or gain. In this dissertation, CNVs are consider inherited eventsfrom the germline.expectation maximization (EM) Iterative algorithm used to learn or estimate un-known parameters of a HMM. The algorithm iterates between the expecta-tion (E-step) and maximization (M-step) steps. See Section 1.4.genome rearrangements Class of genome aberrations that involves structural changesof the genome including deletions, duplications, translocations, and inver-sions. These events are sometimes referred to as structural variants (SV)and differ from CNA in that some SV events can lead to no change in copynumber (balanced).germline Refers to gamete cells. Germline variants are events inherited by theoffspring from the gametes of the parents. Often, this indicates that eventsare inherited and not cancer-specific. See also CNV.heterozygous At a specific locus, region, or gene, at least one copy of each parentalallele is present. In a normal diploid cell with 2 alleles, one originates fromthe paternal chromosome and the other from the maternal chromosome.xxiiGlossaryhidden Markov model (HMM) A statistical model that is a Markov process withlatent (hidden) states. HMMs have wide range of applications such as fortemporal pattern recognition (e.g. speech). In bioinformatics, the appli-cation of HMMs is popular for gene finding and annotations, and copynumber analysis. See Section 1.4.homologous recombination (HR) Cellular mechanism to repair double-strandedDNA breaks. A homologous sequence is used as the template during repair.homozygous The alleles of a specific locus, region, or gene are identical for allcopies. For CNA events, homozygosity refers to the alleles originatingfrom one parental chromosome (uniparental). See also LOH.intra-tumour heterogeneity (ITH) Diversity in the genomic profiles between cellpopulations (clones) within a tumour.log ratio Measurement data used for inference of copy number. For SNP geno-typing arrays, the ratio is computed as the tumour array intensity dividedby the reference (normal) array intensity for a given probe. For sequenc-ing, the ratio is computed as the tumour read depth divided by the matchednormal read depth for a given locus (position or region). The ratio is thenln or log2 transformed.loss of heterozygosity (LOH) The change of genotype from heterozygous in thenormal genome to homozygous in the tumour for a specific locus, region,or gene. The remaining allele originates from one parental chromsome.Regions of LOH often occur due to deletion CNA events, but copy neutralLOH can also occur as a result of gene conversion, mitotic recombination,and mitotic nondisjunction or missegragation.masked reference Reference sample used for SNP array analysis that containsdiploid normal copy number for all loci. This facilitates the prediction ofboth germline CNV and somatic CNA events.mono-allelic expression (MAE) mRNA gene expression exclusively from one parentalallele.PAM50 Gene expression classifier to predict the intrinsic breast cancer subtypes(luminal-A, luminal-B, HER2-enriched, basal-like, and normal-like). A setof 50 genes are used as features (Parker et al., 2009).punctuated clonal expansion The theory that a clone evolved a small number oflarge-scale aberrations, followed by the proliferation of this clone into alarge subset of cells in the tumour.xxiiiGlossarySNP genotyping array DNA hybridization platform for analyzing tumour genomes,including copy number. Each array may contain up to millions of probesthat hybridize genomic DNA. For example, Affymetrix SNP6.0 contain900k SNP and 900k CNV probes. Sample preparation and protocols varyby platform models and manufacture.somatic Refers to cells of the body. Somatic aberrations are events acquiredthrough mutations and not passed through the germline (gametes). Oftenused to indicate aberrations as cancer-specific. See also CNA.structural variation (SV) See genome rearrangementssubclonal Refers to the presence of a genomic aberration in a subpopulation oftumour cells or tumour subclone.whole exome sequencing (WES) Short read sequences generated for the exon (cod-ing) regions of the genome.whole genome sequencing (WGS) Short read sequences generated for the entiresequence of the genome. These reads collectively cover the whole lengthof the genome with several fold redundancy. For tumour samples discussedin this disseration, data for 30-50X redundancy were generated.xxivChapter 1IntroductionChromosomal abnormalities are frequently observed in malignant tumour cells andhave important roles for the initiation and development of pathological characteris-tics as first speculated by Theodor Boveri, almost one century ago in 1914 (Boveri,2008). In 1960, Nowell and Hungerford observed an abnormally small chromo-some, named the “Philadelphia chromosome”, that was recurrent in chronic myel-ogenous leukemia (CML) patients (Nowell and Hungerford, 1960; Nowell, 2007).This discovery provided the evidence supporting the hypotheses made decades ear-lier that a single somatic genetic aberration in a cell can consistently lead to a ma-lignant phenotype with proliferative advantages.Today, complex combinations of genomic aberrations and chromosomal insta-bility are accepted as hallmarks of cancer (Lengauer et al., 1998; Stratton et al.,2009; Hanahan and Weinberg, 2011). These mutational abnormalities can have ameasurable effect on the structure and number of chromosomes in the cell, leadingto a tumourigenic phenotype of increased proliferation and fitness. Techniques incytogenetics, such as fluorescence in situ hybridization (FISH), allow for observinganeuploidy — the deviation in the number of whole chromosomes — in individualcancer cells (Pinkel et al., 1986; Hassold and Hunt, 2001). However, cytogeneticassays are now accepted as low resolution techniques that only enable the study ofgross chromosomal abnormalities.1Chapter 1. IntroductionSince the completion of the assembly of the human genome sequence in 2003(International Human Genome Sequencing Consortium, 2004), the course of can-cer genomics research has been reshaped, driven by computational solutions andinformatics efforts. Newer genomic assay platforms, such as DNA hybridiza-tion array technologies, has enabled the higher resolution exploration of genomesof cancers and other diseases than could be observed using cytogenetics (Pinkelet al., 1998; Lindblad-Toh et al., 2000). The study of DNA copy number alter-ation (CNA) — a class of genomic aberrations that alters the number of copies ofsub-chromosomal sized segments of sequences — has benefitted from these tech-nological advances and accompanying development of computational predictionalgorithms (Lucito et al., 2003; Zhao et al., 2004).Emerging studies surveying tumour genomes have catalogued CNA events assignificant contributors in cancer, reporting events specifically affecting genes driv-ing tumourigenesis (Forbes et al., 2011). Increased copies of oncogenes can leadto up-regulated expression of the encoded protein. For example, the amplifica-tion of the ERBB2 gene leading to increased Her2 protein expression in breastcancer patients are candidates for targeted treatment with trastuzumab (Romondet al., 2005). By contrast, bi-allelic abrogation of tumour suppressor genes, suchas PTEN (Li et al., 1997), RB1 (Lee et al., 1988) and TP53 (Baker et al., 1990),through a combination of loss of heterozygosity (LOH) and single-nucleotide vari-ant (SNV) mutations can impact their normal functions of regulating growth incells and preventing uncontrolled proliferation. Identification and prioritization ofCNA events targeting driver genes are crucial steps for understanding the underly-ing genetics responsible for tumour progression, and may nominate the actionabletargets for clinical applications. Furthermore, characterizing the genome-wide in-2Chapter 1. Introductionstability of chromosomes can give insights into the genetic signature that may be areflection of impaired DNA repair mechanisms, genomic heterogeneity, and clonalevolution (Ciriello et al., 2013).The advancement of sequencing technologies has enabled researchers to exam-ine cancer genomes at unprecedented resolutions. Sequencing of whole genomes(WGS) and targeted regions, such as whole exome capture (WES or EXCAP) ofcoding sequences and transcriptomes (RNAseq), have become routine in cancerprojects. Sequencing has been employed in concerted efforts to study larger co-horts of specific cancer types such as breast (Shah et al., 2012; Curtis et al., 2012;Cancer Genome Atlas Research Network, 2012c; Stephens et al., 2012; Nik-Zainalet al., 2012b,a), ovarian (Cancer Genome Atlas Research Network, 2011a), col-orectal (Cancer Genome Atlas Research Network, 2012b), lung (Cancer GenomeAtlas Research Network, 2012a; Imielinski et al., 2012; Rausch et al., 2012), brain(Jones et al., 2012; Northcott et al., 2012; Brennan et al., 2013) cancers, renal car-cinomas (Cancer Genome Atlas Research Network, 2013a), and leukemias (Leyet al., 2008; Cancer Genome Atlas Research Network, 2013b). More recently,pan-cancer studies involving sequencing data of multiple cancer types from thou-sands of patients have drawn insights into related patterns across mutational pro-files (Beroukhim et al., 2010; Alexandrov et al., 2013; Ciriello et al., 2013; Zacket al., 2013). Future large-scale studies and personal genomics initiatives will con-tinue to expand on data generation, further increasing the demand for more robustalgorithms and statistical modelling to address both the volume and poorly un-derstood noise properties in these data to effectively interpret relevant biologicalsignals.Development of probabilistic models for inference of structural genomic aber-3Chapter 1. Introductionrations has always been a major focus in cancer genomics research. Yet robustanalytical methodologies for analyzing CNAs from primary tumour WGS data arestill under-developed. This is partly due to technical challenges, such as sequenc-ing artifacts and data quality that can influence discernment of signal to noise, andbiological considerations, such as normal or stromal contamination in tumour sam-ples. Another challenge when analyzing individual tumour biopsies is the presenceof multiple tumour cell populations (or clones) which can have unique genomicaberrations that define their clonal genotypes (Aparicio and Caldas, 2013). Re-cent studies have taken this focus to explore the clonal diversity and tumour het-erogeneity arising from the presence of multiple tumour subpopulations in manycancer types (Landau et al., 2013; Wu et al., 2012; Anderson et al., 2011; Navinet al., 2011; Martinez et al., 2013; Nik-Zainal et al., 2012b; Ding et al., 2012a;Kreso et al., 2013; Sottoriva et al., 2013; Szerlip et al., 2012; Yachida et al., 2010;Castellarin et al., 2012). While models for inference of single nucleotide variants(SNVs) from heterogeneous mixture of cellular populations in tumour biopsies arebeginning to emerge (Gerstung et al., 2012; Carter et al., 2012; Nik-Zainal et al.,2012b; Shah et al., 2012; Roth et al., 2014), probabilistic approaches for predictingsubclonal CNA from this data is still in its infancy.This dissertation is organized into five chapters, highlighting the design and de-velopment of novel algorithms for the analysis of chromosomal aberrations in can-cer genomes, and application to novel tumour datasets. The remainder of this cur-rent chapter includes the presentation of relevant background for CNA and LOH,tumour heterogeneity, clonal evolution, and the current methodologies for infer-ring these features. This chapter concludes with an introduction to the researchcontributions in this dissertation, outlining the research questions, hypotheses, and41.1. Somatic landscape of aberrations in cancercomputational solutions. Chapters 2, 3, and 4 are the research chapters that system-atically highlight the novel probabilistic developments with the application of hid-den Markov models (HMMs) to address the current technologies and challenges.The structure of each chapter includes a more detailed description of the moti-vation, computational challenges, and problem statement; a summary of previouswork and their limitations; a description of the proposed methodology, along withmathematical and implementation details; benchmarking and evaluation of the ap-proach; and the application of the methodology to novel tumour datasets, reportingsignificant biological findings and discussions on the landmark advances in cancergenomics. Chapter 5 concludes with a discussion on the impact of this researchand the future directions.1.1 Somatic landscape of aberrations in cancer1.1.1 Copy number alterationsCopy number alterations (CNA), which are loosely defined as ranging in size fromapproximately 1 kilo-base (Kb) to entire chromosomes, are events such as dele-tions, insertions, and duplications that result in segmental (partial) chromosomalaberrations and aneuploidy. These events can have pathogenic implications if af-fected regions contain proto-oncogenes (e.g. ERBB2, EGFR) and tumour sup-pressors (e.g. PTEN, RB1) (Bignell et al., 2010; Beroukhim et al., 2010). Thegenomes of human cells are diploid (two copies); however, segmental and whole-chromosome level modifications of copy number can render regions of the genomewith a dosage that deviates from the two original copies. Gains of copies, inducedby duplications, can lead to amplification of specific DNA sequences. By contrast,51.1. Somatic landscape of aberrations in cancerloss of DNA due to hemizygous and homozygous deletions can result in one andno copies, respectively.Two general classes of mechanisms are proposed as the predominant causes ofcopy number changes. Hastings et al. (2009) gave a great review on these mech-anisms. The first is repair of DNA double stranded breakages (DSB) via homolo-gous recombination (HR) and non-homologous recombination, which are normalcellular mechanisms in eukaryotes. HR results in no structural change if the ho-mologous sequence template used for repair is from the exact same chromosomeposition. By contrast, non-allelic HR (NAHR) uses homologous sequences fromdifferent positions, as is often the case for repetitive regions, leading to chromo-somal structure change. Non-homologous repair mechanisms that do not requirelarge homologous template sequences are non-homologous end joining (NHEJ),which can result in up to 4bp micro-homology insertions, and microhomology-mediated end joining (MMEJ), which leads to small 5-25bp insertions. NHEJ andMMEJ are reported as the largest contributors to observed genome rearrangements(Yang et al., 2013).The second general mechanism that leads to induction of copy number in can-cers are errors during mitosis and DNA replication. Aneuploidy, which is definedby the abnormal number of whole chromosomes, is generally caused by chromo-some missegregation during mitotic anaphase. Stalled or defective replication forksduring DNA replication uses a NAHR-type repair called break-induced replication(BIR) which can also lead to LOH, translocation, deletion and duplication. Fornon-homologous repair of defects in replication, fork stalling and template switch-ing (FoSTeS) results in 4-15bp micro-homology insertions. Recently, replicationstress induced by replication fork stalling was reported as being largely responsible61.1. Somatic landscape of aberrations in cancerfor structural chromosomal instability in colorectal cancers (Burrell et al., 2013b).Following replication, a chromosome that has lost its telomere will produce twosister chromatids lacking telomeres that will then fuse, and subsequently breakat a random chromosomal location during separation into daughter nuclei. Thechromosome in the new cell lacking a telomere will undergo the same breakage-fusion-bridge cycle (BFBC), leading to characteristic amplifications observed incancers (Tanaka and Yao, 2009). Analyzing the copy dosage of CNAs, along withgenome rearrangement events, provides a detailed glimpse of the chaos in the aber-rated chromosomal architecture harboured by most tumour genomes. A recentlyreported phenomenon in tumours called chromothripsis describes the catastrophicshattering of a chromosome followed by the cell’s attempt to repair it, leading to arejoined chromosome with deletions and reoriented sequence fragments (Stephenset al., 2011; Rausch et al., 2012; Korbel and Campbell, 2013). Another type ofcomplex rearrangement, called chromoplexy, involves a few events leading to co-ordinated breaks across multiple chromosomes, which was reported as frequentlyobserved in prostate cancers (Baca et al., 2013).In cancer, mutations of genes involved in DNA repair and replication can leadto abnormal regulation of these processes, resulting in increased chromosomal in-stability and structural genomic rearrangements. Inactivating mutations of TP53,coding for hallmark tumour suppressor p53 protein, disregulates normal control ofcell growth and apoptosis programming (Baker et al., 1990). Deleterious muta-tions in BRCA1 and BRCA2 lead to compromised DNA repair and cell cycle con-trol (Venkitaraman, 2014). RAD51, which codes for a protein that interacts withBrca2 to facilitate the joining and crossing over of homologous strands, has beenreported in knockout studies leading to impaired HR and contributing to cancer71.1. Somatic landscape of aberrations in cancersusceptibility (Takata et al., 2001; Lord and Ashworth, 2012). Similarly, mutationsin XRCC4, a gene involved in NHEJ, are associated with susceptibility in severalcancer types (Fu et al., 2003).While the mechanisms leading to formation of CNAs are fundamental forunderstanding this genomic feature, the consequence of abnormal chromosomedosage itself can be the driving events under selection in cancer. The balance be-tween accumulation of CNAs and the ability of the cell to tolerance changes incopy number (Torres et al., 2010) may give rise to cells with increased prolifera-tive advantages, particularly when tumour suppressors and oncogenes are affected.As a result, CNAs make up an important mutational class of aberrations commonlyobserved in cancer genomes. The pathogenicity of a tumour and its potential fordrug resistance and metastasis can be promoted through selective pressures on theconsequent phenotype driven by somatic CNA events. Analysis of the regulationof gene expression as a consequence of CNA events provides yet another layer ininterpreting the phenotypic effects of gene dosage on cellular functions in cancer.The CNA of a gene can affect the measurable expression of the same gene (cis-regulation); the extent of this effect are still understudied. Furthermore, somaticcopy number profiles can aid in refining and discovering novel genomic subtypeclassifications, such as for breast cancers (Chin et al., 2007; Curtis et al., 2012).Genomes of normal cells also harbour polymorphic regions with variable copynumber called germline copy number variants (CNVs) (Conrad et al., 2010; Kiddet al., 2008; Redon et al., 2006; Sebat et al., 2004; Tuzun et al., 2005). CNVsare often prevalent near segmental duplications (Sharp et al., 2005) and contributeto global population-based variations and phenotypic differences in healthy indi-viduals. Substantial effort has been made in sequencing large cohorts to com-81.1. Somatic landscape of aberrations in cancerprehensively explore CNVs as part of the 1000 Genomes Project (1000 GenomesProject Consortium et al., 2010; Mills et al., 2011) and HapMap studies (Inter-national HapMap Consortium et al., 2007; International HapMap 3 Consortiumet al., 2010). Results from the majority of these studies surveying CNVs havebeen archived in the Database of Genomic Variants (DGV, http://projects.tcag.ca/variation/) (Iafrate et al., 2004). In addition to inherited mutations ofBRCA1 and BRCA2, numerous catalogued germline CNVs (Forbes et al., 2011)are associated with cancer susceptibility, such as for breast and ovarian cancersand neuroblastoma (Diskin et al., 2009; Shlien and Malkin, 2009; Fletcher andHoulston, 2010; Walsh et al., 2010). However, analogous to germline SNPs insomatic point mutation analysis, the majority of CNVs are present as a source ofgenetic variation whose role in tumour pathology is often undetermined and mayconfound interpretation of predictions when attempting to extract relevant driversomatic events. This problem is explored in more detail in Chapter Loss of heterozygositySegmental regions of loss of heterozygosity (LOH) are a common feature of tu-mour genomes. LOH is the change of genotype of a particular loci from heterozy-gous (e.g. containing 2 alleles) to homozygous state in which only one parentalallele remains. LOH can be observed in cancer cells, relative to normal tissue ofthe same individual, due to direct consequences of CNA events. The defining eventthat first leads to LOH is a hemizygous terminal or interstitial deletion in which onecopy (i.e. allele) is deleted. Subsequent (secondary) events can duplicate the re-maining copy, returning the region to two copies (copy neutral) of the same allele.Finally, an amplification of this remaining allele can result in an arbitrary num-91.1. Somatic landscape of aberrations in cancerber of copies (Figure 1.1). Therefore, LOH can be observed under three scenarios(Figure 3.1 and 3.7): deletion LOH, due to a CNA loss (DLOH), copy-neutralLOH (NLOH), and amplified LOH (ALOH) due to CNA amplification. Additionalmechanisms such as gene conversion, mitotic recombination, and mitotic nondis-junction or missegregation followed by random chromosome loss (Figure 1.1) alsoresult in copy neutral LOH events in the cancer cell (Ogiwara et al., 2008).Normal?Mitosis?Mitotic ?Non-disjunction? Gene Conversion?Mitotic Recombination? Unbalanced Translocation?Chromosome?Missegragation Deletion?Mitosis?Balanced Translocation?Random Loss?via DSB repair?via HR?Terminal or Interstitial?Deletion?Copy gains of remaining allele?Duplication?(Copy neutral)? Amplification?Whole chromosome events? Partial chromosome events?Figure 1.1: Mechanisms giving rise to copy number alterations (CNA) and loss ofheterozygosity (LOH). Adopted from Ogiwara et al. (2008).In numerous malignancies, tumour suppressor genes such as PTEN, RB1, TP53often exhibit loss of function mutations coupled with LOH, thereby removing allwild-type alleles and rendering mutant alleles homozygous. LOH in cancer has101.1. Somatic landscape of aberrations in cancerbeen extensively studied and reported as contributing to the inactivation of tumoursuppressor genes such as BRCA1, and BRCA2 in breast cancers (Welcsh and King,2001). Recessive tumour suppressors genes that are completely inactivated need tofollow Knudson’s two-hit hypothesis for loss-of-function, as was shown for RB1(Knudson, 1971). Another example is p53 in colorectal cancer, which was ob-served as having deletions leading to LOH and presence of mutations in the otherallele (Baker et al., 1990; Ahmed et al., 2010). In these cases, both events maybe somatically acquired or the first hit may be inherited in the germline. However,there are several tumour suppressor genes reported as requiring only a single so-matic alteration event to disable one allele, leading to haploinsufficiency (Payneand Kemp, 2005). Thus, genome-wide LOH is an essential feature to consider inthe landscape of alterations of cancer genomes, and has been considered promi-nently in recent large-scale genomic studies of cancer subtypes previously men-tioned. Studying the different classes of genetic alterations such as CNA and muta-tions in conjunction with LOH is essential to understanding the tumour suppressorgenetics contributing to cancer progression.LOH leads to mono-allelic expression (MAE) of the single remaining parentalallele; however, MAE of genes can also occur as a result allele silencing due toepigenetic factors. Ascertaining genome-wide allelic expression of genes associ-ated with somatically induced LOH in the genome has not yet been undertaken incancer. The impact of MAE is two-fold in understanding and prioritizing candidategenes: 1) tumour suppressor genes may be haploinsufficient when alleles are lost(Berger et al., 2011) and 2) oncogenes have activating functions when alleles arespecifically amplified (Jirtle, 1999). Investigating MAE from the genomic-drivenperspective via LOH can help to nominate genes whose expression of the remain-111.2. Tumour heterogeneity and clonal diversitying allele may have selective advantages for tumourigenesis and progression. LOHand MAE are explored for triple-negative breast cancers in Chapter 3.1.2 Tumour heterogeneity and clonal diversityGenetic heterogeneity in cancer is well established and frequently observed be-tween cancer types, patients, and within tumours. A source of cancer heterogeneityis the variation in genomic aberrations that are observable through examination ofgenetic profiles. Discerning the variations between cancer types and patients canprovide insights into therapeutic resistance and drug treatment strategies. Genomicstudies have addressed tumour heterogeneity on four general levels. First, large co-hort pan-cancer studies have compared across cancer types localized to different or-gans or tissue types. For example, studies have characterized and compared CNAsand/or SNVs in 3131 tumours from 26 cancer types (Beroukhim et al., 2010), 3299tumours from 12 cancer types (Ciriello et al., 2013), 4934 tumours from 11 cancertypes (Zack et al., 2013), and 7042 cancers across 30 cancer types (Alexandrovet al., 2013). The second level is the inter-patient tumour diversity for the samecancer type, including histological subtypes. Large-scale projects from The CancerGenome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) (In-ternational Cancer Genome Consortium et al., 2010) have characterized hundredsof genomes to study the somatic aberrations in patient cohorts with glioblastoma(Cancer Genome Atlas Research Network, 2008), ovarian cancer (Cancer GenomeAtlas Research Network, 2011a), breast cancer (Cancer Genome Atlas ResearchNetwork, 2012c; Alexandrov et al., 2013), colorectal cancer (Cancer Genome At-las Research Network, 2012b), and clear cell renal cell carcinoma (Cancer Genome121.2. Tumour heterogeneity and clonal diversityAtlas Research Network, 2013a). Recently, we have also profiled the genomic andtranscriptomic landscapes of 2000 breast cancers, reporting novel subgroups basedon genetic heterogeneity across patient samples (Curtis et al., 2012). We discussresults from this study in Section 2.4.1.The third level is the intra-patient temporal diversity between primary, metastatic,and recurrence tumours within an individual patient. Studies have been under-taken to compare the genomic profiles between primary and metastatic tumoursfrom breast (Shah et al., 2009a; Ding et al., 2010) and pancreatic (Yachida et al.,2010; Campbell et al., 2010) cancers, as well as for relapse tumours in high-gradeserous ovarian carcinoma (Castellarin et al., 2012) and acute myeloid leukemia(Ding et al., 2012a). The fourth level is the intra-tumour heterogeneity within asingle patient tumour. Enormous efforts have been put forth to address the firstthree types of heterogeneity; however, intra-tumoural genomic diversity is still anactive area of exploration with the recent emergence of clonal evolution studiesfor cancers of the breast (Nik-Zainal et al., 2012b), kidney (Gerlinger et al., 2012,2014), colorectum (Kreso et al., 2013), pancreas (Kreso et al., 2013) and ovary(Martinez et al., 2013; Castellarin et al., 2012), leukemia (Landau et al., 2013; An-derson et al., 2011; Ding et al., 2012a), medulloblastoma (Wu et al., 2012), andglioblastoma (Sottoriva et al., 2013; Szerlip et al., 2012). We have also contributedto this effort for high-grade serous ovarian cancers (Bashashati et al., 2013), whichis discussed further in Section 2.4.2.Tumours evolve through acquisition of genetic aberrations that give rise to newpopulation of cells with unique genotypes. Tumour clones are defined as cells thatare “related by descent from a unitary origin” (Aparicio and Caldas, 2013). Clonesmay acquire mutations that define its genotype and may lead to phenotypic differ-131.2. Tumour heterogeneity and clonal diversityences. Evolutionary pressures acting on phenotypes that confer a selective advan-tage can enable the expansion of resilient clones that are capable of adapting andproliferating in the microenvironment (Hanahan and Weinberg, 2011; Greaves andMaley, 2012; Yates and Campbell, 2012). Through branched evolution, the emer-gence of new clones co-existing with ancestral tumour populations is a reflection ofclonal diversity within the tumour (Aparicio and Caldas, 2013). By contrast, clonesmay out-compete its predecessors and sister clones through increased growth ratesand clonal expansions. Genetic diversity in clones may be observed by measur-ing SNVs, CNAs, LOH and structural genome rearrangement mutational classes.Chromosomal instability, which leads to the increased propensity for structural al-terations, can result in cells with diverse genotypes (and consequent phenotypes)from which clonal selection will act on (Burrell et al., 2013a,b). Each subpopula-tion of cells will have a unique clonal genotype consisting of a set of aberrations. Inthis dissertation, the focus is on the observed genomic variation that leads to clonaldiversity. However, it is also important to note that clonal diversity leading to di-vergent phenotypes may also be attributed to other factors such as expression andepigenetic regulation (Varley et al., 2009), protein stability (Marusyk et al., 2012),and drug-resistance related microenvironmental changes (Kreso et al., 2013).The abundance of clones can be estimated from the somatic mutations in theirclonal genotypes (Aparicio and Caldas, 2013). Estimating the shifting cellularprevalence — proportion of cells containing the mutation — can reveal mutationsthat were favourable to clonal populations under the presence of treatment or tu-mour microenvironment-induced selective pressures. In clonal population struc-tures, high cellular prevalence of genetic aberrations indicate that mutations wereacquired at an early stage in the tumour’s evolutionary history (Greenman et al.,141.2. Tumour heterogeneity and clonal diversity100%?50%?30%?20%?Heterogeneous tumour biopsy ?Subclonal events ?Cellular prevalence?CNA/LOH events? Normal cell? Tumour cell?DNA extraction?Sequencing?&?Analysis?Pooled DNA tumour sample ?Figure 1.2: Schematic of cellular prevalence estimates from pooled DNA. Dur-ing DNA extraction, the population structure is lost; however, the tumour cellularprevalence (defined as the proportion of tumour cells with the aberration) can beestimated.2012; Nik-Zainal et al., 2012b). By contrast, low cellular prevalence suggests thatmutations were acquired later, and therefore harboured by a minor or subclonalpopulation of cells. Figure 1.2 illustrates the estimation of tumour cellular preva-lence of aberrations. In addition to SNVs, genomic aberrations including CNA andLOH were recently reported to exhibit diverse intra-tumoural patterns in breast(Navin et al., 2011), ovarian (Bashashati et al., 2013), brain (Sottoriva et al., 2013),and renal cancers (Gerlinger et al., 2012, 2014). These studies analyzed multiplebiopsies of individual tumours, revealing spatial genomic heterogeneity as a result151.3. Genomic assays for CNA/LOH analysis in tumoursof clonal evolution. Thus, CNA and LOH events substantially contributions to thegenomic footprint of evolutionary progression in cancer.An important consideration is that each tumour biopsy sample can also be het-erogeneous. The biopsy may be composed of a bulk mixture of contaminatingnormal cells and potentially multiple tumour cell populations. Therefore, in addi-tion to analyzing multiple biopsies within a tumour, deconvolution of the subclonalcell populations within each biopsy can provide a finer level of characterization ofthe tumour heterogeneity. Currently, computational algorithms to address intra-tumour heterogeneity from a single tumour biopsy sample are under-developed.A deeper understanding of the underlying intra-tumoural heterogeneity can helpto elucidate the operating dynamics of clonal evolution and ultimately give betterinsights into diagnosis, metastatic potential, therapeutic resistance and drug treat-ment strategies. In Chapter 4, we discuss the analytical challenges in the inferenceof CNA/LOH in clonal subpopulations and introduce a new method to investigatethis research problem.1.3 Genomic assays for analysis of copy numberalterations and loss of heterozygosity in tumoursWe have described genomic aberrations of CNA and LOH in the context of cancerand clonal diversity in the previous sections. Now, we introduce the technologiesused to assay and measure tumour genomes and the computational methods usedto analyze these data to profile CNA and LOH. In general, genotyping arrays andDNA sequencing are the two most popular technologies developed for obtainingmeasurements used in profiling CNA and LOH and are presented in Section Genomic assays for CNA/LOH analysis in tumours1.3.1 Detecting CNA in tumour DNA hybridized to SNP genotypingarraysHigh-density SNP genotyping arrays are effective high-throughput assays for ana-lyzing genome-wide CNA and LOH in humans cancers (Lin et al., 2004; Lafram-boise et al., 2007; Korn et al., 2008; Yau and Holmes, 2008; Cooper et al., 2008).This technology is well established and cost effective for large cohort cancer stud-ies, and analysis of recurrent copy number events (Bignell et al., 2010; CancerGenome Atlas Research Network, 2008, 2011a, 2013a). Affymetrix Genome-WideHuman SNP6.0 and Illumina Human1M BeadChip are popular platforms used forSNP genotyping and profiling CNA/LOH. A single array consisting of more than 1million SNP and CNV probes measures hybridization intensity (via fluorescence)as a surrogate for the amount of DNA at each loci. For example, SNP6.0 arrayshave ∼ 9× 105 SNP and ∼ 9× 105 CNV probes, each measuring a unique 25bpsequence in the genome.Each SNP probe on the array interrogates the major (A) and minor (B) alleles.Analysis of this data primarily uses the major (θAt ) and minor (θBt ) allelic intensi-ties at probe t to derive two quantitative metrics. The first is the log ratio, whichis used for performing total copy number prediction. The log ratio is computedas log(θAt +θBtθRt)where θRt is the matched normal or reference sample intensity forprobe t. In the absence of matched normal tissue, a pooled reference sample canbe used. Typically, the reference is generated by taking the median probe intensityacross a pool of healthy samples such as HapMap (International HapMap Con-sortium et al., 2007; International HapMap 3 Consortium et al., 2010). While thereference sample can help reduce the population-level CNVs from the tumour data,171.3. Genomic assays for CNA/LOH analysis in tumoursusing matched normal samples are required for removing patient-specific CNVs.In general, copy number neutrality, gain and loss are inferred from log ratio val-ues near, greater than, and less than zero, respectively. The second metric is theB-allele frequency (BAF; usually derived from minor allele identification in popu-lation studies), which is computed as θBtθAt +θBt, from the SNP probes. BAF allows foranalysis of genotypes and LOH events spanning genomically adjacent SNP probesalong a chromosome. A somatic LOH event is defined as having a BAF equal tozero (major allele genotype; AA) or one (minor allele genotype; BB) in the tumourbut a BAF of ~0.5 (AB genotype) in the normal.Similar to gene expression microarrays, SNP array intensities also need to benormalized to eliminate noise generated by sample preparation, hybridization pro-tocol, and probe-specific biases. Normalization tools are platform-specific and gen-erally categorized by the array manufacturer. For Affymetrix, CRMAv2 (Bengts-son et al., 2008, 2009) and ACNE (Ortiz-Estevez et al., 2010) perform calibrationfor cross-hybridization of probes between the alleles, GC content, and restrictionenzyme-induced fragment-length biases. For Illumina, crlmm (Ritchie et al., 2009;Scharpf et al., 2011) and tQN (Staaf et al., 2008b) are tools that normalize raw andBeadStudio-processed intensities (Peiffer et al., 2006), respectively.There are some general limitations of using SNP genotyping arrays for infer-ence of CNAs. First, the set of SNPs of an array are fixed to specific loci, and whilethe density of loci across the genome is certainly adequate for CNA analysis, theresolution cannot be increased. Conrad et al. (2010) used a custom approach toalleviate this restriction by combining multiple arrays to generate 42 million probeloci to measure copy number; however, these were not all SNP probes. Anotherlimitation is the use of hybridization intensities as a surrogate measure of DNA181.3. Genomic assays for CNA/LOH analysis in tumoursabundance. SNP probes are subject to an upper limit saturation of hybridizationthat can underestimate truly high-level amplifications.Software Type Events Key features ReferenceDNAcopy (CBS) Change-point CNA Requirespost-processing step fordiscrete CNA callingVenkatramanand Olshen(2007)PICNIC HMM CNA/LOH Ploidy, normalcontamination,allele-specific CNAGreenmanet al.(2010a)GPHMM HMM CNA/LOH Ploidy, GC contentcorrection, normalcontaminationLi et al.(2011)ASCAT Change-point CNA/LOH Ploidy, normalcontaminationVan Looet al. (2010)OncoSNP HMM CNA/LOH Ploidy, GC contentcorrection, normalcontamination,heterogeneityYau et al.(2010)HMM-Dosage HMM CNA/CNV Distinguish somaticCNA and germlineCNVHa andShah (2013)Table 1.1: Segmentation algorithms for genotyping arrays.There are two general classes of algorithms applied to infer CNAs as segmen-tal events from SNP genotyping array data. The first class of algorithms involvesthe detection of change points among the loci, and the subsequent inference ofcopy number for each detected segment using a non-parametric solution. Oneof the first algorithms to employ this approach is Circular Binary Segmentation(CBS) (Olshen et al., 2004; Venkatraman and Olshen, 2007), which detects signif-icant change at breakpoints in array CGH and SNP genotyping array data. ASCAT(Van Loo et al., 2010) is one such method that employs piece-wise constant func-tions to segment data at change points and estimates normal contamination and191.3. Genomic assays for CNA/LOH analysis in tumourstumour ploidy.A second class of algorithms employ hidden Markov models (HMM), whichprobabilistically predicts CNA events by inferring a linear sequence of discretestates (Day et al., 2007; Wang et al., 2007; Shah et al., 2006; Greenman et al.,2010a) (Table 1.1). SNP arrays are also an attractive platform for detecting LOHdue to its genotyping capabilities. Published methods that employ HMMs to per-form simultaneous prediction of CNA and LOH use an increased state space ofgenotypes (e.g. AAA, AAB, AABB) to represent number of copies of alleles ata given loci by modelling both log ratio and BAF data (Greenman et al., 2010a;Yau et al., 2010; Li et al., 2011). PICNIC (Greenman et al., 2010b), GPHMM (Liet al., 2011), and OncoSNP (Yau et al., 2010) are methods that simultaneously in-fer CNA and LOH while also estimating normal contamination and tumour ploidy(Table 1.1). These three methods estimate proportion of normal as part of the Gaus-sian or Student’s-t parameters in the HMM emission. In Section 1.4, we providefurther details on the inference of CNA using HMMs.1.3.2 Massively parallel sequencing technologies for detectingstructural genomic alterations in tumoursWhole genome (WGS), targeted exome capture (WES) and transcriptome (RNAseq)sequencing of patient tumour derived DNA and RNA samples is a common ap-proach for interrogating cancer genomes and transcriptomes to simultaneouslydetermine structural and nucleotide-level aberrations that underpin malignancies(Mardis and Wilson, 2009; Stratton et al., 2009). The nucleotide resolution of theseplatforms has enabled researchers to comprehensively analyze genomic aberra-tions in large-scale studies by interrogating human genomes (and transcriptomes).201.3. Genomic assays for CNA/LOH analysis in tumoursIn recent studies, somatic point mutations, insertion and deletions, copy numberchanges, and rearrangements were profiled in sequencing data of tumours, demon-strating the utility of this technology in detecting genetic aberrations.There are a number of commercial platforms, such as the Illumina Genome An-alyzer II and HiSeq 2000/2500, Life/ABI SOLiD, Roche/454 GS FLX, that havebeen used to analyze re-sequenced human genomes in a massively parallel man-ner producing short reads of 35 to 100bp. There are two steps to most platforms,template preparation and sequencing/imaging. Briefly, Illumina GAII/HiSeq se-quencers immobilize fragmented DNA to a solid-phase glass slide during templatepreparation where clonal amplification of these fragments into 100 million clus-ters take place. Sequencing is performed using cyclic reversible termination whichinvolves iterating between incorporation of a nucleotide and imaging of the iden-tifying fluorescence marker. Life/ABI SOLiD technology uses amplification viaemulsion PCR for cell-free template preparation before immobilization on a glassslide. Sequencing uses 4 colour sequence by ligation and interpretation of imagingis based on overlapping di-nucleotide scheme that can be decoded using a colour-space reference. For a more detailed review of the technologies, refer to Metzker(2010).Millions of sequenced reads (single and paired-end) are aligned to the refer-ence human genome using algorithms specifically designed for aligning short readlengths in manageable runtimes. Briefly, the popular tools are MAQ (Li et al.,2008), which uses a spaced seeding-and-extend approach by hashing the sequencedreads; BWA (Li and Durbin, 2009), which uses an FM-index based on Burrows-Wheeler Transform data structure to store the reference sequence; and Bowtie(Langmead et al., 2009), which, similar to the data structure used by BWA, has211.3. Genomic assays for CNA/LOH analysis in tumoursthe advantage that identical reads need only be aligned once. These three algo-rithms also account for base qualities and return mapping qualities for downstreamconfidence assessment. For a detailed review of NGS aligners, refer to Li andHomer (2010).Aligned data can be further processed to provide read counts of nucleotides(A,T,C,G) at each position in the genome (WGS) or coding space (WES and RNA-seq). Typically, the read counts represent the digital number of aligned readsmatching the reference genome. These data provide higher resolution and bet-ter coverage of the genome compared to the restricted probe design of SNP arrays.Similar to SNP arrays, sequencing data will require data normalization proceduresto account for technology specific artifacts (see Section 1.3.3). However, becausesequencing data involve discrete read count values rather than continuous (signalintensity) measurements in SNP arrays, there are few normalization approachesin the literature at the moment. An important paper that address this issue waspublished by Benjamini and Speed (2012).1.3.3 Predicting somatic CNA, LOH and allelic imbalance in wholegenome sequencing of tumoursClasses of approaches for detecting structural variation and copy numberWhole genome sequencing has opened up opportunities for detecting structuralvariation (SV) and CNAs of tumours at nucleotide resolution. Three general strate-gies and a total of nineteen different algorithms were developed and applied toHapMap individuals in the latest 1000 Genomes companion SV study (Mills et al.,2011). The first general approach is the analysis of read-pair (Korbel et al., 2007,221.3. Genomic assays for CNA/LOH analysis in tumours2009), which is a method that uses the discordance of distances between mappedpaired-end reads to determine outliers in the empirical distribution of insert sizes.Read pairs that mapped further apart, closer, or reversed compared to the expectedorientation where classified as deletions, insertions (albeit the sequence is un-known), inversions, respectively; translocations were classified as inter- and intra-chromosomal rearrangements. The second general approach is split-read analysis,which identifies paired-end reads with one end confidently mapped and the otherunmapped end consisting of sub-sequences originating from different locations.Local alignment algorithms are then applied to the unmapped read to determineif identified breakpoints supports a deletion or short novel insertion sequence (Yeet al., 2009; McPherson et al., 2011). The third strategy is read-depth analysis,which involves counting aligned reads for fixed-size windows or bins dividing thegenome, followed by application of a segmentation algorithm to determine copynumber differences between the tumour and normal genomes. For each bin, shortaligned reads are collectively treated as a measure of DNA abundance. In practice,using bins also helps to reduce the dimensionality of the problem. All three gen-eral approaches have been well established in the field for detection of breakpoints,rearrangements, and fine-scale copy number regions in cancer genomes.SVs and CNAs in cancer are related mutational features that can be inferredfrom genome sequencing data. However, prediction methods in the literature an-alyze these mutational classes separately. In general, prediction of SVs rely on acombination of read-pair and split-reads approaches (Chen et al., 2009; McPher-son et al., 2011) to precisely identify breakpoints of rearrangements, while analy-sis of CNAs solely use read-depth for inferring the number of copies. In general,structural rearrangements and segmental CNA have corroborated breakpoints and231.3. Genomic assays for CNA/LOH analysis in tumoursboundaries; however, there are scenarios when SV events do not lead to observ-able changes in copy number. For example, balanced rearrangements results inno measurable net change of copies and telomeric copy number deletions are notreadily detected in SV analyses. SVs are a class of mutations that is critical forunderstanding the mechanisms involved in the somatic chromosomal architectureof tumour genomes (Yang et al., 2013). However, in this dissertation, the focus ison methodology that uses the analysis of read depth (third strategy) to predict CNAand LOH in tumour WGS data.Analyzing read-depth to predict CNA in tumour-normal paired samplesAnalysis of CNA using read-depth in tumour WGS data parallels the conventionalapproaches designed for SNP genotyping data (see Section 1.3.1). The first re-ported read-depth method to detect copy number in paired tumour-normal cell linesused CBS to segment read counts at each bin (Campbell et al., 2008b). Anotherapproach, SegSeq (Chiang et al., 2009), relaxed the fixed bin sizing to help refinebreakpoints by joining bins which had statistically different log ratios by Poissonmodelling. BIC-seq (Xi et al., 2011a) is a non-parametric model that iterativelyjoins windows based on the Bayesian information criterion (BIC). Control-FreeC(Boeva et al., 2011) segments binned data using a Lasso-based change-point detec-tion algorithm. These tools are part CNA prediction algorithms where a two-stepstrategy is used: 1) segmentation of data, followed by 2) classification of segmentsinto discrete copy number states in a secondary step. These approaches will gener-ally use a non-parametric model for the state space (second step), and thus have theadvantage of using an arbitrary number of copies for CNA status; however the twodisjoint steps may be less statistically robust. Recently, OncoSNP-seq (Yau, 2013),241.3. Genomic assays for CNA/LOH analysis in tumourswhich is the re-purposing of OncoSNP (Yau et al., 2010) for WGS data, uses aHMM to segment and infer CNA status. Nevertheless, integrative approaches thatsegment the data and simultaneously predict CNA status are well represented inthe literature for SNP genotying arrays but are still deficient for WGS of tumours.WGS data have systematic biases that can be attributed to sequence-specificproperties during alignment of reads. GC content, which is the proportion or den-sity of G and C nucleotides in a given sequence, is shown to influence the amount ofobserved read depth for specific sequences largely due to PCR amplification duringlibrary preparation (Benjamini and Speed, 2012). Another sequence-specific biaspertains to the uniqueness of sequences, known as mappability. Reads originatingfrom highly repetitive DNA are prone to mis-alignments, generally leading to un-reliable (i.e. increased or decreased) read depth at these regions. This dissertationdoes not explicitly focus on the preprocessing normalization of tumour read depthWGS data; however, in Appendix A, we provide a brief description of an in-housesolution that we developed, called HMMcopy, to address these important biases.Predicting LOH from allele read counts in tumour samplesSNV genotyping tools for sequencing data, such as MAQ (Li et al., 2008) andSNVMix (Goya et al., 2010), have been used to analyze tumour-normal genomes todetect LOH at individual loci in identically independently distributed (iid) (or site-independent) manner. For example, given a SNP loci that is heterozygous in thenormal (germline) sample, LOH is observed if a homozygous prediction is madeat the same locus in the tumour (Figure 1.3) Using this strategy, Zhao et al. (2010)predicted LOH in exome sequencing data by comparing allele frequencies in abreast cancer cell-line and the matching normal human lymphoblastoid cell-line.251.4. Inference of CNA and LOH using hidden Markov modelsHowever, these iid tools have less statistical power because they do not account forthe spatial nature of CNA and LOH events as regions of sequence spanning consec-utive germline SNPs. Currently, approaches for the analysis of segmental regionsof LOH in WGS tumour data is still deficient in the cancer genomics community.Integrated analysis of CNA and LOH in WGS dataThere are few computational methods available for simultaneously inferring bothCNA and LOH from read depth and allelic ratios, respectively, in tumour WGSdata using a unified probabilistic model. The tools that were previously described,such as SegSeq (Chiang et al., 2009) and BIC-seq (Xi et al., 2011b), only detectcopy number changes. Control-FreeC (Boeva et al., 2012a) analyzes both CNAand LOH but segments the two data types independently before resolving themin a post-processing step. OncoSNP-seq (Yau, 2013) is the only aforementionedtool that simultaneously infers CNA and LOH. Nevertheless, robust probabilisticapproaches to perform joint genome-wide detection of CNA and LOH in tumourgenome sequencing data are still under-developed. To meet the demands of cancersequencing studies, methods designed to jointly analyze CNA and LOH from WGSdata are warranted.1.4 Inference of CNA and LOH using hidden MarkovmodelsCNA events have the defining property that the change in genomic structure con-tiguously span a segment of sequence along a chromosome. Genomic assay plat-forms often interrogate the genome at fixed loci, such as SNPs (genotyping arrays)261.4. Inference of CNA and LOH using hidden Markov modelsccagataccggtatgga---------------- ccagatacctgtatggaaacag-----------ccagatacctgtatggaaacagttacc------ccagataccggtatcgaaacagttacctaggaa -----taccggtatggaaacagttccctaggaa -----tacctgtatggaatcagttacctaggaa --------ctgtatggaaacagtttcctaggaa ------------atggaaacagtttcctaggaa ------------------acagttacctaggaa ccagataccggtatggaaacagttacctaggaa 444446667377887887788877477666666 000000000400001000100000300000000 ccagataccggtatggca--------------- ccagataccggtatggaaacagttacc------ --agataccggtatggaaacagttacctaggaa  ---gataccggtatggaaacagttacctaggaa -------ccggtatggaaacagtttcctaggaa ----------gtatggaaacagttacctagtaa ------------atggaaacagtttcctaggaa -----------------aacagtttcctaggaa -------------------------cctaggaa 223444455566777768777777488777677 000000000000000010000000300000100 NORMAL?TUMOUR?Reference Sequence?Aligned Reads?Allelic Counts?Aligned Reads?Allelic Counts?---------L--------------H-------- LOH prediction?a b a b Figure 1.3: Illustration of LOH prediction in sequencing data. a represents the readcounts that match the base from the reference genome; b represents the read countsnot matching the base in the reference genome.and binned windows (WGS data). This problem has motivated the application ofsegmentation algorithms for detecting regions of structural alterations by identify-ing events that span consecutive loci of interest. The data often contain technical271.4. Inference of CNA and LOH using hidden Markov modelsartifacts and biological considerations, motivating the need for robust analyticalsolutions.During segmentation of genomic data, the HMM handles noise and outliers inthe data by taking into account spatial correlation, hence modelling events that areusually observed as spanning contiguous adjacent probes or genomic regions. Fur-thermore, HMM-based approaches have the flexibility to model variable numberof discrete copy number states. This class of algorithms were originally designedfor analyzing array based data but have since been extended to genome sequencingdata. The algorithms have evolved to also consider cancer-specific properties suchas polyploidy, normal contamination and even tumour heterogeneity.We now introduce the general details of a Bayesian formulation of the con-tinuous HMM that forms the basic framework for the three methods presented inthis dissertation. In each research chapter, we will present the novel extensionsthat enabled the algorithm to solve the unique problem. The formulations here willuse standard generic notation, however, each chapter may define new symbols torepresent specific biological nomenclature. The details presented next is largely asummary of Bishop (2007).HMMs generally consists of three components: (1) a set of hidden or latent(unknown) random variables for each data point t ∈ L, where L = {ti}Ti=1 is theset of T genomic loci of interest; (2) emission distributions to model the observedinput data y1:T at each data point in L; and (3) a transition component to model therelationship between consecutive data points (loci).281.4. Inference of CNA and LOH using hidden Markov modelsLatent state space of biologically interpretable statesThe latent (or hidden) component in the HMM is a set of random variables Z1:T ,one for each genomic locus. Each variable Zt can be assigned one of the discretestates k ∈ K where K is the set of |K| biologically interpretable values such as thecopy number (see Section 2.2.2) or genotype (see Section 3.2.2) status. The initialdistribution over states in K at the first data point t = 0 of each chromosome ismodeled using the multinomial distribution with parameter piZ ,p(Z0|piZ) = Mult (Z0|piZ) (1.1)piZ , also known as the component mixing weights, is a parameter of the HMM andis distributed according to a Dirichlet prior with hyperparameter δZ (Equation 1.2).p(piZ|δZ) = Dir (piZ|δZ) (1.2)Emission component models input data at genomic lociThe input data consisting of measurements taken at genomic loci L is modeled bya mixture of |K| distributions, one for each state in K. Each kth distribution willbe parameterized by one or more parameters represented here as µk. Therefore, atlocus t, the emission model for yt isp(yt |µk,Zt = k) (1.3)Under the Bayesian formulation, parameter(s) µk is also assumed to be drawn froma prior distribution p(µk|mk) where m is a set of one or more hyperparameters. In291.4. Inference of CNA and LOH using hidden Markov modelsChapters 2, 3, and 4, the observed data is modeled using distributions specific tovarious data types. Moreover, the definition and parameterization of µk is depen-dent on the nature of the data (i.e. continuous or discrete) and unique cancer-relatedproperties in the model.Transition component for modelling the regional relationship betweenadjacent genomic lociThe segmental property of copy number events spanning genomic regions presentsa problem suited for HMMs. The transition matrix encodes the correlation be-tween measurements from adjacent genomic positions. For CNA inference, theseprobabilities are used to represent blocks of contiguous positions having similarcopy number (self-transitions) as well as the transitions between segments of dif-ferent copy number states. These probabilities are defined in the transition matrixA ∈ R|K|×|K|. The probability for transitions between adjacent positions t and t−1(within a chromosome) and states i, j ∈ K isp(Zt = j|Zt−1 = i) = A(i, j) (1.4)The rows of the matrix sum to 1 such that ∑ j A(i, j) = 1, ∀i.Because the distances between genomic loci may not necessarily be uniform,this warrants the use of a position-specific, or non-stationary, formulation of thetransition model. Each locus uses a transition matrix At ∈ R|K|×|K| which encodesthe additional distance-based probabilities (Colella et al., 2007). In Chapters 2,3, and 4, novel extensions to the transition model make use of a position-specifictransition to encode further prior knowledge.301.4. Inference of CNA and LOH using hidden Markov modelsParameter estimation and inferenceThe complete log-likelihood for the data, given the parameters θ ={pi1:|K|, µ1:|K|, A}and indicator I(Zt = k) that is 1 when Zt = k and 0 otherwise,log p(y,Z|θ) ∝ ∑kI(Z0 = k) log p(Z0 = k|piZ) (1.5)+T∑t=1K∑iK∑jI(Zt = j|Zt−1 = i) log p(Zt = j|Zt−1 = i,At) (1.6)+T∑t=1K∑kI(Zt = k) log p(yt |µk,Zt = k) (1.7)+K∑klog p(µk|mk)+K∑klog p(pik|δZ) (1.8)The complete log-likelihood includes the log of the initial state distribution (Equa-tion 1.5), transition model (Equation 1.6), emission model (Equation 1.7), andpriors (Equation 1.8). The objective is to find the set of parameters θ that max-imizes the log-likelihood function log p(y,Z|θ). A solution to learning the param-eters is the use of the expectation maximization (EM) algorithm which iterativelyalternates between computing the expected value of the log-likelihood functionQ(n) = EZ|y1:T ,θ (n−1)[log p(y1:T ,Z|θ (n))]with the previous setting of the parame-ters θ (n−1) at EM iteration (n− 1), and then estimating the new parameters thatmaximizes Q(n) at iteration (n).In the expectation step, the expectation of the random variables Z1:T are in-ferred and the posterior probabilities p(Zt |y1:T ,θ) = γt (responsibilities) are com-puted for each locus t given all the data y1:T and the current estimates of the pa-311.4. Inference of CNA and LOH using hidden Markov modelsrameters θ . This computation follows from Bayes’ rule,p(Zt |y1:T ,θ) =p(y1:T |Zt ,θ)p(Zt |θ)p(y1:T |θ)(1.9)Computing the posterior probabilities γ1:T are generally difficult, but an efficientdynamic programming technique called the forwards-backwards algorithm is used(see Bishop (2007) for details). γt (k) are computed using probabilities computedin the forward ft and backward bt propagationsγt(k) =ft(k)bt(k)∑Tt=1 logwt(1.10)at position t and state k ∈K, and wt is coefficient that normalizes ft (.) (scaling fac-tor). The forward/backward probabilities are normalized so that the values do notdecrease to zero. For millions of genomic loci, even with 64-bit machine precisionof approximately 2×10−316, compounding probabilities across the data points canstill lead to machine underflow. Keeping track of wt at each position of the forwardtrace also gives the data log-likelihood,p(y1:T |θ) =T∑t=1logwt (1.11)In the maximization step, the maximum a posteriori (MAP) estimate, or themode of the expected complete likelihood Q(n), is used to update the parameters.That is, we maximize Q(n) with respect to θ for iteration n by the followingθ (n) = argmaxθ ∑Zp(Z|y1:T ,θ (n−1))log p(y1:T ,Z|θ (n))= argmaxθQ(n) (1.12)321.5. Research contributionThe EM convergence criteria is met when F(n)−F(n−1) < threshold, where Fis sum of the data log-likelihood and the log priorsF(n) = log(p(y1:T |θ (n−1)))+ log p(θ (n)|m,δZ)(1.13)The converged parameters θˆ are taken from the last iteration following stop-page criteria of EM. The Viterbi algorithm is used to infer the optimal hidden statepath of genotypes, Z1:T .Z1:T = argmaxZ{p(Z1:T |y1:T , θˆ)}(1.14)1.5 Research contributionIn this dissertation, we developed three models that employ HMMs to address thecurrent challenges in CNA/LOH detection from cancer SNP genotyping array andWGS data. These are briefly outlined in the remaining sections and are presentedas the technical research chapters.1.5.1 Distinguishing germline and somatic copy number events incancer SNP genotyping array dataIdentifying somatic CNAs is often confounded by germline CNVs or polymor-phisms. In order to prioritize candidate genes specifically observed in cancer cells,CNVs need to be properly identified and distinguished from somatic events. Thereare a lack of tools designed to simultaneously perform segmentation and predic-tion of both somatic and germline copy number events in a unified framework. In331.5. Research contributionChapter 2, we discuss the current limitations of distinguishing CNAs and CNVsas separate, discrete events in primary tumour SNP genotyping array data, andpresent a probabilistic approach called HMM-Dosage to address these issues. Thealgorithm is based on the hypothesis that CNAs and CNVs contribute different sta-tistical signals in tumour genome data due to normal contamination. CNV eventsare found in all cells (including normal) while CNAs are harboured by only tumourcells. HMM-Dosage, which is an extension to CNA-HMMer (Shah et al., 2006),models the different statistical patterns of CNVs and CNAs in a single, unifiedprobabilistic framework by incorporating position-specific CNV prior knowledge.We provide an evaluation of the method for detecting germline CNVs comparedwith existing methods and published databases. Finally, we describe the publishedresults from the application of HMM-Dosage to a cohort of 2000 breast cancers(Curtis et al., 2012) and 31 spatial and temporal intra-patient ovarian tumours(Bashashati et al., 2013).1.5.2 Inference of loss of heterozygosity in whole genome sequencingof tumoursIn Chapter 3, we describe a novel probabilistic model called APOLLOH which ad-dresses the current challenges in genome-wide detection of LOH in tumour WGSdata. APOLLOH infers regions of allelic imbalance and LOH from paired tumour-normal data based on the hypothesis that accounting for genomic spatial correlationand tumour CNA profiles from tumour WGS data will provide more accurate pre-dictions. APOLLOH uses a non-stationary HMM to analyze allelic read counts toinfer regions with the following genotypes: retention or heterozygous (HET), ho-mozygous (LOH), and allele-specific copy number amplification (ASCNA). The341.5. Research contributionmodel considers CNA for accurate distinction of LOH from ASCNA. We evalu-ated APOLLOH by comparing to results obtained from orthogonal platforms andbenchmarking on real, in silico admixture data. Finally, we analyzed the tran-scriptomes of 22 TNBC patients to study the contribution of LOH to mono-allelicexpression (MAE). In the context of prioritizing candidate genes, the impact ofMAE is two-fold: 1) tumour suppressors may be haploinsufficient when alleles arelost (Berger et al., 2011) or 2) oncogenes with activating mutations on the remain-ing allele may be selectively amplified (Jirtle, 1999). This is the first study aimedat describing LOH and MAE by integrating whole genome and transcriptome se-quencing datasets of this magnitude.1.5.3 Inference of tumour clonality through analysis of copy numberin whole genome sequencing dataIn Chapter 4, we present a novel probabilistic model called TITAN, which simul-taneously infers CNA and LOH segments and estimates their cellular prevalencefrom WGS data of a single tumour biopsy. Currently, there is no tool for jointlyestimating cellular prevalence and inferring CNA and LOH in tumour WGS data.TITAN was designed under the hypothesis that observed signals in the WGS ofthe biopsy’s bulk DNA result from a the mixture of normal and multiple sub-populations of tumour cells. Furthermore, TITAN uses the assumption that co-occurring CNA/LOH events in the same cells emerged during punctuated clonalexpansions (Navin et al., 2011; Greaves and Maley, 2012). This motivated a clus-tering paradigm for statistical inference, allowing for increased power to detectweaker signals in the data, and to distinguish sets of events at different cellularprevalences (see Section 1.2). We evaluated TITAN using in silico simulation of351.5. Research contributionmultiple subpopulations from intra-tumour samples of an ovarian carcinoma, ac-companied with fluorescence in-situ hybridization (FISH) assay and single-cell se-quencing validations. We also applied TITAN to 23 triple-negative breast cancers(TNBC) to study evolutionary patterns by analyzing the (sub)clonal copy numberlandscape. Finally, we applied TITAN to a set of primary breast tumours and thecorresponding mouse xenografts to study the clonal evolution and selection pat-terns during engraftment.Method Description Publications ChapterHMM-DOSAGE Distinguishing somatic andgermline CNA in SNPgenotyping array dataHa and Shah (2013)Curtis et al. (2012)Bashashati et al. (2013)2APOLLOH Inference of LOH in WGS data Ha et al. (2012)Shah et al. (2012)3TITAN Inference of subclonal CNA andLOH in WGS dataHa et al. (2014)Eirew et al. (2014)4Table 1.2: List of research contributions pertaining to computational methods de-velopment. For each method, first citation listed refers to the publication of themethodology and subsequent citations refer to publications containing results forthe application of the software.36Chapter 2Distinguishing somatic andgermline copy number changes inSNP genotyping arraysinterrogating cancer patient DNA2.1 IntroductionTumour genomes contain inherent germline polymorphic CNVs that may confoundthe analysis of somatic CNA events. Identification and separation of CNAs andCNVs are critical steps during prioritization of candidate oncogenes and tumoursuppressors. CNVs have distinct statistical patterns compared to CNAs due twofactors. First, CNVs tend to be much shorter in length compared to CNAs basedon inspection of publicly available catalogue of CNVs in Database of GenomicVariants (DGV) (Iafrate et al., 2004) and comprehensive analysis of CNA and CNVin a cohort of 2000 breast cancer patients (Curtis et al., 2012). Second, normalcontamination leads to diluted CNA event signals because CNAs are only foundin the sub-population of the sample attributed to cancer cells. By contrast, CNVsignals are stronger relative to CNAs because all cells in the sample contain thesegermline events (Figure 2.1).The advantages of separating CNA and CNV predictions are two-fold. First,and most importantly, filtering non-pathological germline CNVs from somatic CNA372.1. Introductionresults helps to prioritize regions that contain aberrated driver tumour suppressorsor oncogenes. Secondly, we can perform downstream analysis of CNV-specificpatterns to investigate its effects on gene expression and potential association withinherited cancer susceptibility. As large-scale studies continue to gain prominencein the cancer genomics community, there will be a demand for a robust frameworkto systematically prioritize and separate relevant CNAs from CNVs.Figure 2.1: Distinguishing germline and somatic CNA in SNP genotyping arrayusing HMM-DOSAGE. The output shows somatic copy number events in red andgreen (top track) and germline CNVs and matched normal (bottom track) data.CNVs in the tumour sample are (black) are distinguished from CNA events.382.1. IntroductionThere are several general strategies to separate CNAs and CNVs during analy-sis of cancer DNA hybridized to SNP genotyping arrays. These approaches includethe use of available matched normal DNA data, database filtering, and statisticalclassification. If matched normal DNA is available, log ratios of genotyping arrayprobes are computed using the normal as the reference prior to segmentation. Thisnegates the signals for inherent germline variants, and emphasizes somatic CNAs.However, normal DNA from patients are not always available, thereby requiringthe use of a pooled reference (as described in Section 1.3.1). Another solutionis to identify CNAs and CNVs after segmentation and copy number prediction.Copy number events that overlap records in a database containing germline poly-morphic CNVs, such as DGV (Iafrate et al., 2004) can be manually subtracted(or filtered). DGV is a public database (http://dgv.tcag.ca/dgv/app/home)that contains a large collection of CNVs catalogued from multiple human geneticstudies. The drawback to this approach is that there are uncertainties and incompat-ibilities in region sizes due to differing resolutions from various technologies usedin the studies. Furthermore, DGV regions collectively account for a large fractionof the genome, which can potentially exclude a large percentage of the genome,including tumour-relevant CNAs. This is particularly important for genes, such asBRCA1/2, that may have an overlapping germline CNV in DGV but can also beaffected by somatic events.A probabilistic approach using a regression tree classifier trained on DGV fea-tures, called PredictCNV (Ostrovnaya et al., 2010), classifies segments of copynumber changes as CNA or CNV. However, the trained classifier was trained onlower resolution arrays (244k Agilent array CGH) in the TCGA glioblastoma data(Cancer Genome Atlas Research Network, 2008), and it is not clear if the classifier392.1. Introductionis suitable for other cancer types or data using high-density arrays. Overall, bothdatabase filtering and probabilistic classifiers are applied independently followingsegmentation analysis and therefore do not factor into improving segmentation ofthese events.We developed a novel approach, called HMM-Dosage, which uses a unifiedmodel to segment, predict, and distinguish CNA and CNV events in cancer DNAhybridized to SNP genotyping arrays in the absence of a matched normal. HMM-Dosage addresses the need for tools to probabilistically separate germline and so-matic copy number events. The model benefits from the flexibility of HMMs toincorporate variable, finite number of biologically interpretable states, and allowthe integration of multiple sources of prior information. The HMM has a statespace that includes both CNVs and CNAs. It also employes a non-stationary tran-sition matrix that enables the probabilistic incorporation of position-specific CNVprior knowledge. The output is the set of segment boundaries and the predictedcopy number status for the tumour sample.We assessed the performance of HMM-Dosage by evaluating CNV predictionsfrom Affymetrix SNP6.0 array data for three tumour cohorts: 28 triple-negativebreast cancers (TNBC) (Shah et al., 2012), 72 ovarian cancers, and 228 TCGAglioblastomas (Cancer Genome Atlas Research Network, 2008). CNVs were pre-dicted using a pooled reference and available ground truth matched normal sam-ples were with-held from the analysis. Predicted CNVs were then validated bycomparing to matched normals and concordance with public databases. Next, wepresent results from two applications of HMM-Dosage. First, the method was usedto analyze a cohort of 2000 breast cancers to profile the full landscape of CNAsand CNVs, and to jointly analyze matching expression data (Curtis et al., 2012).402.2. Method: HMM-DosageSecond, HMM-Dosage was applied to the analysis of 31 spatial and temporal intra-patient ovarian tumours to study the intra-tumoural relationships (Bashashati et al.,2013). These are presented in Section 2.4.1 and 2.4.2, both of which describe theresults from the analysis of the novel datasets, and also demonstrate the utility ofHMM-Dosage.2.2 Method: HMM-DosageHMM-Dosage analyzes the log ratios of tumour and reference probe intensitiesfrom genotyping arrays to infer CNA and CNV events. A central feature of HMM-Dosage is the prediction of CNAs and CNVs using separate emission distributionsand HMM states. Furthermore, HMM-Dosage uses prior information to specify thelocation and frequencies of known germline CNV events from published studies.2.2.1 Analysis workflowThe analysis workflow for analyzing tumour SNP genotyping data is shown in Fig-ure 2.2. In general, normalization of the data is required to correct for technicalartifacts and noise (Bengtsson et al., 2009). Next, a second normalization is per-formed by taking the log ratio y1:T between the probe intensities of tumour samplei (θ i1:T ) and reference sample (θR1:T ) (as described in Section 1.3.1). This serves toremove systematic sequence-based bias present in the data of this platform. The logratios y1:T are input into HMM-Dosage for segmentation and inference of CNAsand CNVs, generating the inferred sequence of copy number states Z1:T . Here arethe general steps in the analysis workflow:1. Probe-level intensities for each cancer sample are normalized using software412.2. Method: HMM-DosageTumours Reference Probes  (Median Intensities) 3) Apply HMM 2) Calculate raw copy number Probes (Intensities) Tumour Samples Probes (Raw CN) Tumour Samples Probes (HMM calls) Tumour Samples CNV frequencies Amplification Deletion CNV Neutral Raw copy number Normalized intensity 1) Input normalized intensities Probes  Figure 2.2: HMM-Dosage analysis workflow. Array intensities are normalizedsuch that multiple samples are comparable. The raw copy number is computed asthe log ratio between tumour and reference (or normal) intensities. CNV frequen-cies are generated using publicly available and/or matched normal samples and isused as prior information. HMM-Dosage outputs somatic CNA and germline CNVsegments.such as CRMAv2 (Bengtsson et al., 2009) or PennCNV (Wang et al., 2007).This results in a normalized intensity θ it for each probe t ∈ {1, . . . , T} ofsample i.2. Generally, the log ratios yt = θ it/θRt , are computed for sample i by comparingagainst the intensity of reference R at position t. This is sometimes referred422.2. Method: HMM-Dosageto as the raw copy number because it is the input data to segmentation algo-rithms. Typically, a standard pooled reference generated by taking the me-dian intensity value across a HapMap cohort is used. However, if users areinterested in the inference and subsequent separation of the full enumerationof CNVs in the tumour sample, then the reference sample R needs to facili-tate detection of CNVs by avoiding the “subtraction” of germline events. Todo this, we generated a masked reference as follows (Figure 2.3):(a) Using SNP6.0 arrays from publicly available population dataset of Jsamples, such as HapMap270 (International HapMap Consortium et al.,2007) or HapMap3 (International HapMap 3 Consortium et al., 2010),generate a random reference RRand by randomizing the probes (columnsof the matrix) and then computing the median intensity value for eachprobe.(b) Compute log ratio h jt for each HapMap sample j ∈ J, using the randomreference: h jt = θjtRRand , ∀t.(c) h j1:T is input into a standard HMM (that infers zero to 5 copies) (Shahet al., 2006) to identify positions C j with gains or losses for sample j.(d) Build an intensity matrix M ∈ RJ×T where each element correspondsto θ jt for HapMap sample j (row) and position t (column).(e) For all j, mask out the gain/loss positions C j for row j in M. Then, gen-erate the masked reference Rmasked ∈ R1×T by computing the medianacross the samples (column) for each position t in M.The masked reference allows for the inference of both CNAs and CNVs by432.2. Method: HMM-Dosageserving as a quasi-baseline (copy neutral) vector of intensities for raw copynumber calculations.3. From matched normals, databases, and/or public datasets, genome-wide CNVfrequencies f L1:T (loss) and fG1:T (gain) are generated for use as priors (seeSection 2.2.2). Log ratios y1:T and CNV frequencies are input into HMM-Dosage for segmentation and CNA/CNV prediction. The output is the opti-mal sequence of inferred CNA/CNV states Z1:T2.2.2 Probabilistic frameworkFigure 2.4 shows the probabilistic graphical model for HMM-Dosage and mathe-matical details follow.State spaceHMM-Dosage has a state space of 11 copy number classes that can be assigned tothe latent variable Zt at each probe t ∈T . The full state space is K = {kCNA, kCNV L, kCNV G, kNEUT}wherekCNA = {HOMD, HET D, GAIN, AMP, HLAMP} (2.1)kCNV L = {CNV HOMD, CNV HET D} (2.2)kCNV G = {CNV GAIN, CNVAMP, CNV HLAMP} (2.3)K includes somatic CNAs of homozygous (HOMD) and heterozygous (HET D)deletions, gain (GAIN), amplification (AMP), and high-level amplicon (HLAMP);germline CNV loss (CNV HOMD, CNV HET D) and CNV gain (CNV GAIN, CNVAMP,442.2. Method: HMM-DosageFigure 2.3: Workflow to generate the masked reference sample. The steps (inparentheses) correspond to the workflow described in the text.CNVHLAMP), at copy numbers analogous to those of kCNA. One state (kNEUT ) isused for copy neutral (NEUT ) diploid status of no copy number change in thetumour relative to the normal genome.These states were chosen as being biologically interpretable and representativeof copy number dosage while maintaining a suitable number of states for the trade-452.2. Method: HMM-Dosagep(Zt|pi) = Multinom(Zt|pi)p(pi|δpi) = Dir(pi|δpi)p(yt|Zt = k) = St(yt|µk, λk, νk)p(µk|mk, ηk) = N (µk|mk, ηk)p(λk|sk, γk) = Gam(λk|sk, γk)p(Zt = j|Zt−1 = i) = Atijp(At|δA + F t) = Dir(At|δA + F t)1Figure 2.4: Probabilistic graphical model of HMM-Dosage.off in run-time. The differences between GAIN, AMP, and HLAMP are relative andnot based on absolute integer copy number. For downstream interpretation, GAINis often associated with 1 copy gain, AMP for gain of 2-3 copies, and HLAMP forgain of 4 or more copies. The same interpretation also applies to the kCNV G states.The latent random variable Zt ∈ K at each probe t is modeled using the multi-nomial distribution with unknown mixed weights pi (Equation 2.4). pi is modeledusing a Dirichlet distribution (prior) with hyperparameter δ pi (Equation 2.5).p(Zt |pi) = Multinomial (Zt |pi) (2.4)p(pi|δ pi) = Dirichlet (pi|δ pi) (2.5)Emission modelThe emission distributions are composed of a mixture of |K|= 11 univariate Student’s-t distributions (Yau et al., 2010) to model the input continuous log ratio data y1:Twith parameters µk (mean), λk (precision), and νk (degrees of freedom) conditionalon state k ∈ K,p(yt |Zt = k) = St (yt |µk,λk,νk) (2.6)462.2. Method: HMM-DosageThe degrees of freedom, νk is a user-defined value, while the location (mean) µkand scale (precision) λk are unknown parameters. Figure 2.6 shows the probabilitydensities for the 11-state emission mixture model using 5 CNV states (kCNV ); theCNV state distributions closely resemble the same distributions in the correspond-ing normal genome. The mean µk is modeled using a Gaussian prior distribution(Equation 2.7) and the precision λk is modeled using gamma prior distribution(Equation 2.8) (Li et al., 2011). Hyperparameters mk, ηk, sk, and γk are fixed anduser specified.p(µk|mk,ηk) = N (µk|mk,ηk) (2.7)p(λk|sk,γk) = Gamma(λk|sk,γk) (2.8)Calculating prior CNV frequencies from public results and databasesHMM-Dosage uses external information, such as matched normal SNP arraysand published CNV events from normal population studies (International HapMapConsortium et al., 2007; International HapMap 3 Consortium et al., 2010; Conradet al., 2010; 1000 Genomes Project Consortium et al., 2010), to indicate the prob-ability that a specific position is a CNV. Using matched normals allows the modelto capture the patient-specific patterns of CNV in the normal and tumour. In largecancer studies where matched-normal samples may be incompletely paired or ab-sent, publicly accessible normal genome samples can be used as an informativedataset for representing population-specific CNVs. At each probe t, the frequen-cies of CNV loss ( f Lt ) and CNV gain ( fGt ) are computed from the published CNVresults across a set of individuals in the population study or database.472.2. Method: HMM-DosageEmploying a non-stationary transition model using CNV frequenciesThe baseline probabilities for transitioning between copy number states in K aregiven by the matrix δ (Equation 2.9). For self-transitions, δ uses probability e,which is set to value close to 1 (e.g. 0.999). e is user-defined and can be adjustedto correct for under- or over-segmentation of the data. The remaining residualprobability mass is distributed evenly to the non-self transitions.δt (i, j) =e i = j, i, j ∈ K1−e|K|−1 otherwise(2.9)Prior CNV frequencies are explicitly used in the CNV transition matrix F t at probet. The frequencies f Lt and fGt , which are in the range [0,1], are used as the prob-abilities for transitioning into kCNV L or kCNV G states, respectively. The remainingprobability mass, 1−(2 f Lt +3 f Gt), is equally distributed to the kCNA states. Forpositions where CNV prior frequencies are low, the transition probabilities intokCNA states will be given more weighting.Ft (i, j) =f Lt i ∈ K, j ∈ kCNV Lf Gt i ∈ K, j ∈ kCNV G1−(2 f Lt +3 f Gt )|kCNA| i ∈ K, j ∈ kCNA(2.10)The combined transition matrix for HMM-Dosage is the position-specific ma-trix At ∈ R|K|×|K|. Prior CNV information is incorporated into At at each probe tby adding δt and Ft .At (i, j) =δ (i, j)+Ft (i, j)∑ j δt (i, j)+Ft (i, j)(2.11)482.2. Method: HMM-DosageFigure 2.5: Illustration of the HMM-Dosage transition probabilities. δt representsthe baseline transitions probabilities given by Equation 2.9. Ft represent the CNVtransition probabilities given by Equation 2.10. Both matrices have dimensions|K| × |K|; however, for brevity, {kCNA,kCNV L,kCNV G} and kNEUT (not shown) areused to represent the full set of states in the columns and rows.An optional constraint can be used for probe positions that have unknownor zero CNV frequency by directly assigning zeros to At(i, j), if ∀i ∈ K, ∀ j ∈{kCNV L, kCNV G}, effectively disallowing transitions into kCNV states at probe t.Learning model parameters using expectation maximizationWe use the expectation maximization (EM) to train the Student’s-t emission param-eters θ = {µ1:11,λ1:11}. In the expectation (E-step), the forwards-backwards algo-rithm is used to compute the posterior marginal probabilities, γ(n)t = p(Zt |y1:T ,θ (n−1))at EM iteration n (Equation 2.12).γ(n)t =p(y1:T |Zt ,θ (n−1))p(Zt |θ (n−1))p(y1:T |θ (n−1)), (2.12)492.2. Method: HMM-DosageThe expected complete log-likelihood Q(n) at EM iteration n isQ(n) =K∑k=1p(Z0 = k|y1:T ,θ (n−1))logMult (Z0 = k|pi) (2.13)+T∑t=1K∑i=1K∑j=1p(Zt = i,Zt−1 = j|y1:T ,θ (n−1)) logAt(i, j)+T∑t=1K∑k=1p(Zt = k|y1:T ,θ (n−1))logSt (yt |µk,λk)+ ∑klogN (µk|mk,ηk)+ ∑klogGamma(λk|sk,γk)+ logDirichlet (pi|δ pi)In the maximization step (M-Step), the parameters are updated by taking the max-imum a posteriori (MAP) estimate, θ (n) = argmaxθ Q(n) ={µ(n)1:11,λ(n)1:11}. At isfixed and not estimated. The update equations for µ and λ are described in Ar-chambeau (2005). This is run until the convergence criteria F(n)−F(n−1) < 102 issatisfied. F is the sum of the data log-likelihood and the log priors,Fn = log(p(y1:T |θ (n)))(2.14)+ ∑klogN (µk|mk,ηk)+ ∑klogGamma(λk|sk,γk)+ logDirichlet (pi|δ pi)502.3. ResultsPredicting optimal sequence of copy number for CNA and CNV eventsThe converged parameters θˆ are taken from the last iteration of EM. The Viterbialgorithm is used to infer the optimal hidden state path of genotypes,Z1:T = argmaxZ1:T{p(Z1:T |y1:T , θˆ)}(2.15)2.2.3 ImplementationHMM-Dosage is implemented in Matlab; the forward-backward and Viterbi algo-rithms are implemented in C. The time and memory complexity is O(K2T)andO (KT ), respectively, where K is the number of states and T is the number ofpositions. The practical single-core run-time of the algorithm for an AffymetrixSNP6.0 array with 1.8 million SNP loci is approximately 20 seconds per EM it-eration. The typical run of HMM-Dosage converges in 10 to 50 iterations. Thememory usage of HMM-Dosage for SNP6.0 is approximately 3 GB. The sourcecode and compiled executable (usable on Linux x64 architecture) can be down-loaded at http://compbio.bccrc.ca/software/hmm-dosage/.2.3 Results2.3.1 Evaluation and benchmarkingWe evaluated HMM-Dosage using Affymetrix SNP6.0 genotyping array data fromthree cancer cohorts: 28 triple-negative breast cancer (TNBC) (Shah et al., 2012),228 glioblastoma multiforme (GBM) brain cancer (Cancer Genome Atlas ResearchNetwork, 2008), and 72 ovarian (OV) cancer samples. The TNBC samples, with512.3. Resultscorresponding matched normals, was part of a larger study involving 104 totalTNBC samples (Shah et al., 2012). The 228 GBM samples were retrieved fromthe TCGA (Cancer Genome Atlas Research Network, 2008), along with matchednormals. To demonstrate the performance of HMM-Dosage in predicting CNVsin the absence of matched normals, the analysis was performed by holding outthe normal germline samples. Then, the matched normals for the TNBC and GBMdatasets were used as ground truth to evaluate the the prediction of germline CNVs.Performance metrics were computed for CNVs predicted in the tumour samplesusing CNVs identified in the normal samples as ground truth. For the OV cancerdataset, no matched normals were available. This represented a typical scenariothat motivates the need for approaches capable of detecting germline CNVs intumour data in the absence of matched normals.We compared the CNV prediction performance between HMM-Dosage and aclassification-based software, PredictCNV (Ostrovnaya et al., 2010). For the Pre-dictCNV analysis, the segments were first generated by CBS (Venkatraman andOlshen, 2007) where CNV gain and loss were called if the segment log ratio is onemedian absolute deviation (MAD) of the residuals above and below the array me-dian, respectively (as described in Ostrovnaya et al., 2010). For the HMM-Dosageruns, the CNV prior frequencies were generated from 270 HapMap individuals (In-ternational HapMap Consortium et al., 2007) for all three cancer datasets (TNBC,GBM, OV).We computed two performance statistics for the evaluation. First, in the DGV1database, the set of ground truth CNV regions CTruth was selected as the CNVsthat were found in at least two samples within a study. A true positive is defined as1DGV data version 10 was downloaded on November 2010.522.3. ResultsTNBC GBM OVSoftware Concord.F-measureConcord.F-measureConcord.HMM CBS HMM CBSHMM-Dosage 0.95 0.61 0.60 0.94 0.51 0.53 0.95CBS+PredictCNV 0.55 0.44 0.50 0.54 0.46 0.71 0.58Table 2.1: Performance of HMM-Dosage and PredictCNV. Performance is basedon evaluation of germline CNV events in the tumours, using matched normalsas the ground truth. The ground truth CNV segments were predicted using twoalgorithms, CNA-HMMer (HMM) and CBS. The columns for F-measure indicatewhich ground truth set (HMM or CBS) was used in the evaluation.the overlap of ≥ 50% of probes for the HMM-Dosage or PredictCNV CNV seg-ment with the boundaries of a record in CTruth. Concordance was computed as theproportion of overlapping (true positive) CNV predictions out of all predictions,which is the positive predictive value (precision). For HMM-Dosage, the concor-dance was higher for all three datasets (Table 2.1). This performance statistic isimportant because false positive CNVs may in fact be CNAs.For the TNBC and GBM datasets that had matched normals as ground truth, weused the F-measure for the second statistic. Two ground truth sets were generatedby predicting CNVs from the normal samples using CNA-HMMer (Shah et al.,2006) and CBS (Venkatraman and Olshen, 2007). Precision and recall were cal-culated by analyzing probe-based percentage overlap between tumour and normalsegments. F-measure was computed as (2∗ precision∗recall)/(precision+recall)(Table 2.1). In the TNBC results, the F-measure for HMM-Dosage was higher re-gardless of the ground truth segmentation. By contrast, the performance for theGBM dataset was higher for PredictCNV when the CBS ground truth was used.This improved performance may be due to overfitting because PredictCNV wasinitially trained (by the authors of the software) on lower resolution arrays (244K532.4. Application of HMM-Dosage to novel cancer datasetsAgilent aCGH) for the same GBM samples.By examining the results in the output, we were able to observe that convergedStudent’s-t parameters (µ1:11 and λ1:11) for CNA (kCNA) and CNV (kCNV ) stateswere distinctly different (Figure 2.6). In particular, the parameters for the CNVstates in the tumour were very similar to parameters from the matched normaldata. This confirmed our initial hypothesis that the germline and somatic events intumours emit different statistical signals.Figure 2.6: Comparison between Student’s-t distributions for tumour and matchednormal samples.2.4 Application of HMM-Dosage to novel cancer datasetsIn Section 2.4.1, HMM-Dosage was applied to analyze the CNA and CNV land-scapes in 1992 breast cancer samples. The dichotomization of the somatic andgermline events enabled the comparative analysis of their effects on gene expres-sion. In Section 2.4.2, application of HMM-Dosage identified key homozygousdeletions involving tumour suppressor genes that were validated as events observed542.4. Application of HMM-Dosage to novel cancer datasetsduring the tumour evolution of ovarian cancers.2.4.1 METABRIC: Profiling the genome architectures of 2000 breastcancersThe intrinsic subtypes of breast cancer (Sørlie et al., 2001), determined throughmolecular classifications, represent distinct diseases that have implications on re-sponse to treatment (Perou et al., 2000; Rouzier et al., 2005). This classificationwas developed with the introduction of gene expression profiling and has sincebeen refined to using a set of 50 genes as differentiating features, called PAM50(Parker et al., 2009). The five intrinsic subtypes are luminal-A, luminal-B, HER2-enriched, basal-like, and normal-like. More recently, copy number profiles havebeen an effective feature for identifying novel breast cancer subtypes (Chin et al.,2007; Bergamaschi et al., 2006; Soneson et al., 2010); however, classifiers that in-tegrate both expression and genomic copy number remain elusive. We undertookthe METABRIC study (Curtis et al., 2012), setting out to analyze the genomesof 1992 breast cancers. In this study, we hypothesized that the joint analysis ofcopy number and gene expression can reveal distinct mutational signatures, allow-ing the refinement and discovery of novel molecular breast cancer subtypes. Thelarge number of patients in this cohort enabled better representation of more raresubgroups compared to previous smaller, under-powered studies. This cohort rep-resents the largest breast cancer dataset to date. The major research goals of thisproject were to 1) identify driver genes using an integrated analysis of copy num-ber and gene expression, and 2) refine the molecular subgrouping of breast cancerusing unique signatures derived from copy number and expression.The genomes for these tumours were hybridized on the Affymetrix SNP6.0552.4. Application of HMM-Dosage to novel cancer datasetsplatform. The dataset was divided into the discovery (997 samples) and validation(995 samples) cohorts. Matching RNA expression data for every tumour sampleand an additional set of 482 normal samples were also available. Finally, clini-cal data for each patient, including age, grade, stage, estrogen-receptor (ER) andlymph-node (LN) status, and outcome information were available. HMM-Dosagewas applied to profile the somatic CNA and germline CNV landscapes, enablingthe separate downstream analyses of these aberration classes. In this analysis, themasked reference sample, generated from the 482 normals, was used in the pre-processing normalization, and the prior CNV frequencies were generated from thecombination of the 482 normals and 450 HapMap samples (Conrad et al., 2010).Separating the somatic and germline copy number landscapesWe obtained the copy number profiles for the genomes of 1992 samples usingHMM-Dosage. We then generated a gene-based CNA landscape, where, for eachgene, the proportions of samples overlapping copy gains and losses are computed.The CNA landscape shows distinct hotspot patterns of gene clusters with high pro-portions of gains or losses; however, the germline CNVs can confound the analy-sis of the somatic landscape (Figure 2.7, Supplementary Table S7 in Curtis et al.(2012)). The ability of HMM-Dosage to separate CNVs, which are focal and athigh proportions in the landscape, from CNAs while still correctly calling knownsomatic CNA events such as the amplification of ERBB2 and CCND1 (Figure 2.7).562.4. Application of HMM-Dosage to novel cancer datasetsFigure 2.7: Gene-centric somatic CNA and germline CNV landscapes forMETABRIC cohort predicted using HMM-Dosage. Overall landscape (top) isshown as separated into CNV (middle) and CNA (bottom). Gene frequencies ofcopy gain (red) and loss (blue) are shown.Association of copy number and expression-based intrinsic subtypeclassificationCurtis et al. (2012) had determined the intrinsic subtypes for each of the 997 pa-tients in the discovery cohort using the 50-gene expression-based classifier, PAM50(Parker et al., 2009). Because the classification was based on expression dataalone, we were interested to see if any associations existed between patterns ofCNA/CNV and the PAM50 subtype classification. Qualitatively, we compared thepatterns of gene-based frequency landscapes — defined as the proportion of gainand loss for each gene across patients within a subtype. We observed noticeable572.4. Application of HMM-Dosage to novel cancer datasetsBasal?Her2?LumA?LumB?Normal?Frequency?Chromosome?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?Frequency?Frequency?Frequency?Frequency?Basal?Her2?LumA?LumB?Normal?Frequency?Chromosome?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?1?0.5?0?-0.5?-1?Frequency?Frequency?Frequency?Frequency?Figure 2.8: METABRIC CNA/CNV landscapes by PAM50 subtypes. CNA(left) and CNV (right) landscapes for METABRIC cohort separated into PAM50-predicted intrinsic subtypes. Normal-like is labeled as "Normal".differences between subtypes for the CNA landscapes (Figure 2.8). However, thelandscapes for CNV events appeared nearly identical. In order to investigate thedifferences between subtypes for the CNA and CNV landscapes in a more quan-titative manner, we applied the χ2 test of independence for each gene. Table 2.2illustrates an example of the contingency matrix for ERBB2. The χ2 analysis con-firms that many genes altered by CNA events were indeed significant for beingsubtype-specific (Figure 2.9, Supplementary Table S39 in (Curtis et al., 2012)).By contrast, few CNV altered genes were significant by the χ2 analysis. This re-sult reiterates that expression-based subtypes are associated with a unique CNAlandscape (Chin et al., 2007).582.4. Application of HMM-Dosage to novel cancer datasetsERBB2	   Her2	  Not	  Her2	  Loss	   7	   134	  Neutral	   20	   587	  Gain	   64	   160	  Subtype?Copy Number?Table 2.2: Example χ2 test for copy number and subtype association for ERBB2in the METABRIC cohort.Basal?Her2?LumA?LumB?Normal?-log ?p-value?Chromosome?-log ?p-value?-log ?p-value?-log ?p-value?-log ?p-value?Basal?Her2?LumA?LumB?Normal?-log ?p-value?Chromosome?-log ?p-value?-log ?p-value?-log ?p-value?-log ?p-value?Figure 2.9: Gene-centric Chi-Square (χ2) analysis to determine subtype-specificCNA (left) and CNV (right) in the PAM50 intrinsic subtypes. Each data pointrepresents a χ2 test for a gene and the association between predicted copy numberand PAM50 subtypes. Normal-like is labeled as "Normal".592.4. Application of HMM-Dosage to novel cancer datasetsExpression of 40% of genes are cis-regulated by somatic copy numberUsing gene expression data from the tumours, we investigated the effect of copynumber on cis-regulation of expression (influence of a gene’s copy number on itsobserved expression). First, differential expression analysis was performed be-tween samples grouped by copy number status of loss, neutral, and gain. Acrossthe genome, 7334 (~39%) genes altered by CNAs showed significant differentialexpression using Kruskal Wallis test (3 copy number groups) and Wilcoxon ranksum test (2 copy number groups) with Bonferroni adjusted p-value < 0.001 (Sup-plementary Table S37 in Curtis et al. (2012)). Figure 2.10 and 2.11 illustrate exam-ples of significant differentially expressed genes PPP2R2A, ERBB2, and LSM1.lllllllCNA_loss nonCNA CNA_gain−4−2024PPP2R2AILMN_1788961N=385 N=556 N=60lllllllllllllllllllll llllllllllllll llll l llllllllllllllllll llllllllllllllll lllll llllllllllllllllllllllllllllllllllllllllllllllll lllllll lllllll ll llllll lll llllll llllllll ll llllllllllllllllllll lllllllllllllllll lll llllllllll lllllllllllll lllllll lllllllllllllllllll−6 −4 −2 0 2 4 6−4−2024CNA segment logRGene Expression logRllllllllll llllllllll ll l lllllllllllll l lll llllllllllll llll lllllllllllllll llllllll lllll lllllllll lllllll llllll llllllll llllllllll lll llll llll l lllll lllllll lll ll lll lllllllllllllllllllllllllllll llllllllllllllll llllll l llllllllllllll ll lllllllllllllll lll lll llllllllllllllllllll llll ll llllllllllll llllllll lllllSpearman rho=0.700752PPP2R2AFigure 2.10: Analysis of CNA cis-associated expression changes for PPP2R2A.Differential expression (left) using Kruskal-Wallis test and correlation of copynumber and expression (right) using Spearman correlation are shown for exam-ples of PPP2R2A.602.4. Application of HMM-Dosage to novel cancer datasetslllllllllllllllllCNA_loss nonCNA CNA_gain−4−2024ERBB2ILMN_2352131N=127 N=644 N=230llll l lllllllllllll llll llllllllllllllllll lll lllllllllllllllllll lllllllllllllllll lllllllllllllllll lllllll llllll lllllllllllllllllllllllllllllllllllllll lllllllll−6 −4 −2 0 2 4 6−4−2024CNA segment logRGene Expression logRl lllllllll llll lllllllllllllll llllllllllll lllllllllll lllllllllllllllll lllllllllllllllllllllllll llll llllllllllllllllllllllllll llllllllllllllllllllllllll lllllllllllll llll lllll l llllll llll lllllllllllllllllllllllllllllllllll lll l ll llSpearman rho=0.652249ERBB2llllllllllCNA_loss nonCNA CNA_gain−4−2024LSM1ILMN_2218450N=159 N=597 N=245llllllllllllllllllllllllllllllllll lllllllllllll llllllllllllllllllll lll llllllllllllllllllll llll ll lllllll lllllllllllll lllllllllllllllllllll−6 −4 −2 0 2 4 6−4−2024CNA segment logRGene Expression logRlllllllllllll llllllllll llllll llllllll llllllllllllll lllllll llll llllll llllllllllll ll llllllllllll lll lll llllllllllllllllll lllllll lllll lllll llll lllllllllllllllll lllllllllllllll lll lllllllll ll llllllllllllllllllllllll lllllllll lll llllll llllllll lllllll lll ll llll lSpearman rho=0.792087LSM1Figure 2.11: Analysis of CNA cis-associated expression changes for ERBB2 andLSM1. Differential expression (left) using Kruskal-Wallis test and correlation ofcopy number and expression (right) using Spearman correlation are shown for ex-amples of ERBB2 and LSM1. The y-axis is the log of the normalized gene expres-sion. Copy number is denoted by light green (homozygous deletion), dark green(hemizygous deletion), blue (copy neutral), dark red (gain), bright red (higher levelamplification).612.4. Application of HMM-Dosage to novel cancer datasetsllllllllllllllllllCNV_loss nonCNV CNV_gain−4−2024GSTM1ILMN_1668134N=132 N=333 N=408llllllllll ll lllll llllllll llllll llll lllllllllllllll llllllll lllll lllll llllllllllllllllll lllll−6 −4 −2 0 2 4 6−4−2024CNV segment logRGene Expression logRlll lll l lll lllllllll ll lllllllll llllll ll lllll l ll ll ll llll llllllllll lll lllllllllllllllll llllllll llllllllll llllllllllllllllll lll lllllllllllllllllllllllllllllllllll llllllllllll llllllllll lll lllllll l llllll llllllllllllll llllllllllllllllllllllllllllllllllll ll lllll ll lllllllllllllll lll llllll lllllll llllllllllll lllllllllll lll lllllllllll lll llllll llllllllll lll lllllSpearman rho=0.676746GSTM1lllllllllllllll llllllllllll lllCNV_loss nonCNV CNV_gain−4−2024GSTT1ILMN_1730054N=181 N=448 N=195lllllllllllllllll llllllllllllllll llll lllllllllllllllllllllllll llllll l llllll llll llllllllllllllllllllllll llllll−6 −4 −2 0 2 4 6−4−2024CNV segment logRGene Expression logRl lll llllll ll l llllllllll ll l llll lllllll lll l ll l lllllll ll lll ll lllllllllllll ll lllllllllllllllll lllllllllllllll ll llllll lllllll llllllll lllllllll llllllllllllllllllllSpearman rho=0.683296GSTT1Figure 2.12: Analysis of CNV cis-associated expression changes. Differential ex-pression (left) using Kruskal-Wallis test and correlation of copy number and ex-pression (right) using Spearman correlation are shown for examples of GSTT1 andGSTM1.622.4. Application of HMM-Dosage to novel cancer datasetsNext, we determined the association between copy number and expressionby applying a Spearman rank correlation test between the median log2 ratios ofCNA segments overlapping the genes and the log2 expression values. In total,the expression of 1008 (~5%) genes were positively correlated with somatic CNA(Spearman’s ρ ≥ 0.5, Bonferroni adjusted-p < 0.001, Figure 2.13, Figure 2.11,Supplementary Table S35 in Curtis et al. (2012)). By contrast, only 3 CNV genes,including the glutathione S-transferase genes GSTT1 and GSTM1, were differ-entially expressed and observed to be positively correlated by the same criteria(Figure 2.12). GSTT1 and GSTM1 have been shown to be associated with cancersusceptibility (Rebbeck, 1997). No genes altered by CNA nor CNV were signifi-cantly negatively correlated in cis.Figure 2.13: Gene-centric Spearman correlation analysis to determine CNA cis-regulation of gene expression changes across 997 patients.2.4.2 Intra-tumoural heterogeneity revealed through study ofintra-patient samples of high-grade serous ovarian cancerHigh-grade serous (HGS) is the dominant histological subtype of ovarian cancers.These cancers are prone to relapse after chemotherapy with low survival rates (Her-zog, 2004). Previous work have indicated that studying the clonal diversity of HGSovarian cancer can help to understand the cause of chemotherapeutic resistance632.4. Application of HMM-Dosage to novel cancer datasets(Cooke et al., 2010). To this end, we studied the tumour heterogeneity of 31 totalintra-patient tumour samples from six HGS ovarian carcinomas. For each patient,biopsies were taken from within an individual tumour at proximal adjacent sites,between anatomical sites such as left and right ovaries, or temporal samples fromrelapse (Table 1 in Bashashati et al. (2013)). Patients were designated as Cases1 through 6 and, within each patient, 2 to 5 biopsies are labeled using lower casealphabet letters. Through computational analyses, we quantitatively measured thedegree of intra-tumoural heterogeneity and investigated the evolutionary relation-ships between the genomic architecture of intra-patient samples.Degree of heterogeneity observed through extreme copy number eventsWe performed copy number and LOH analysis on 38 total samples from the sixcases hybridized to Affymetrix SNP6.0 genotyping arrays. HMM-Dosage was ap-plied to identify extreme somatic CNAs (homozygous deletions and high-level am-plifications), taking advantage of the removal of confounding CNVs. From the re-sults, heterogeneous variation of genomic architectures were observed in sampleswithin the same (intra-tumour) and between different (inter-tumour) patients (Sup-plementary Table S4 in Bashashati et al. (2013)). In particular, the intra-tumourCNA profiles between samples within Cases 1, 2, and 6 were highly similar, but weobserved substantial intra-tumoural heterogeneity between samples within Cases3-5 (Figure 3 in Bashashati et al. (2013)).Next, we examined extreme CNAs affected within intra-patient samples. Wegenerated a phylogenetic tree based on comparing the extreme copy number eventsbetween all samples. First, for each sample, each gene was assigned the log ratioof the predicted segment that overlapped the gene boundaries. Then, the Pearson642.4. Application of HMM-Dosage to novel cancer datasetsabcdab cdab cabc def ghiabc f gabControl0.017 0.139 0.040.0260.0630.0280.040.058 0.0390.0290.0460.1210.1820.2820.032 0.0160.0340.0470.1650.0350.4020.0470.5760.2760.0190.0390.0140.0560.2770.0330.016Case1Case2Case3Case4 (left)Case5Case4 (right)Case6Figure 2.14: Intra-tumoural genomic architecture profiles of HGS ovarian cancerCases 1-6. Samples within each Case (patient) are labeled using lower case alpha-bet letters. Phylogenetic tree of homozygous deletion and high-level amplificationCNA profiles for Cases 1-6 depicting evolutionary branching patterns reflective ofclonal relationships between samples.correlation was computed for each pair of samples, across the set of genes; thiswas done for all patients, including a control. The control was simulated to repre-sent a sample having no CNA; this was done using a log ratio of 0 for all genes.The distance matrix was then constructed from these correlation coefficients, andsubsequently the phylogenetic tree was built using the Neighbour-Joining method652.4. Application of HMM-Dosage to novel cancer datasets(’ape’ R package (Paradis et al., 2004)) (Figure 2.14). The most qualitative ex-ample of copy number diversity was observed between samples from the left andright ovaries of Case 4, where samples in Case 4a-e and Case 4f-i formed dis-tinct branches in the phylogeny and were at least as divergent from each other asfrom other patient samples. For example, Case 4 right ovary is more similar toCase 6 (distance 0.456) than to Case 4 left ovary (distance 0.508), according tothe phylogenetic analysis. Across the samples in Case 4, 631 genes were alteredby homozygous deletions and 852 genes were altered by high-level amplifications.None of these alterations was observed in all nine tumour samples (SupplementaryTable S5 in Bashashati et al. (2013)).662.4. Application of HMM-Dosage to novel cancer datasetsBHemizygousDeletionHomozygousDeletion Copy NeutralGain Amplicationchr20:43,503,512-43,655,407BAC RP11-241P620pVysis, 05J03-020ACase4bCase4aChromosome 20Case4cCase4dCase4eCase4fCase4gCase4hCase4iCase4nRight OvaryLeftOvaryNormalRight OvaryLeft OvaryBAC RP11-241P643,503,512-43,655,407Figure 2.15: Copy number alteration (CNA) and FISH comparisons between right(a-e) and left (f-i) ovaries of Case 4 at chromosome 20. Right ovary (a-e) showsaneuploid gain; left ovary (f-i) shows amplification of Region-2.Three segmental amplifications predicted in Case 4 were validated using fluo-rescence in-situ hybridization (FISH): a 150kb region (chr20:43,503,512- 43,655,407)predicted to have a high-level amplification in the left ovary of Case 4 with onlylow-level amplification in the right ovary (Figure 2.15); a highly amplified regionchr12:25,863,200-26,026,351 region (Supplementary Figure S10A in Bashashatiet al. (2013)); and an amplified region chr6:47,952,581-48,122,073 (Supplemen-tary Figure S10B in Bashashati et al. (2013)).Several genes in the Cancer Gene Census (Futreal et al., 2004) were heteroge-neously altered by extreme CNA events in samples within a patient (Supplementary672.4. Application of HMM-Dosage to novel cancer datasetsTable S5 in Bashashati et al. (2013)). Case 1a-c (but not 1d) exhibited homozy-gous deletion of tumour suppressor NF1 in a 190kb region on chromosome 17(26,496,299-26,686,045) (Figure 2.16). FISH assays probing this event revealedsubpopulations of cells containing homozygous deletion of NF1 and all cells con-taining monosomy of chr17, confirming that homozygous deletion of NF1 was notin the ancestral clone. Moreover, the presence of cells with monosomy suggeststhat Case 1a,b and Case 1d may have shared a common subclone of cells withthe chromosomal hemizygous deletion. The same scenario was also observed inCase 3, which harbored NF1 homozygous deletions in only two of the three sam-ples (Figure 2.17). Observations of subclonal NF1 deletions suggest that importantdriver mutations may potentially be acquired later in the evolution of the tumour.682.4. Application of HMM-Dosage to novel cancer datasetsHemizygousDeletionHomozygousDeletion CopyNeutral GainCase1aCase1bchr17:26,527,833-26,569,372Fosmid WI2-492A20Cep17#32-112017Figure 2.16: Heterogeneous NF1 homozygous deletions in Case 1. FISH imagesincluded for Case 1a,b both show subclonal populations of cells with homozy-gous deletions (red arrows) and hemizygous deletions within NF1 (Fosmid WI2-49A20).692.4. Application of HMM-Dosage to novel cancer datasetsHemizygousDeletionHomozygousDeletion CopyNeutral Gainchr17:26,527,833-26,569,372Fosmid WI2-492A20Cep17#32-112017Case3aCase3bFigure 2.17: Heterogeneous NF1 homozygous deletions in Case 3. FISH assaysshow coexisting subclones of cells with homozygous and hemizygous deletionswithin NF1 (Fosmid WI2-49A20).Evolving genome architecture gives insights into clonal divergenceIn addition to high-level amplifications and homozygous deletions, we comparedthe overall genome architecture based on loss of heterozygosity (LOH) profiles in-ferred from allelic intensity ratios using OncoSNP (Yau et al., 2010) (version 1.1).All intra-patient and intra-tumour samples in the six cases harbored whole chro-mosome 17 LOH, supporting evidence that this is among the earliest aberrationsin the tumour progression. To investigate events that may further indicate contin-ual evolution of the genomic architecture, we identified chromosomal aberrations,702.4. Application of HMM-Dosage to novel cancer datasetssuch as copy neutral LOH (NLOH) and amplified LOH (ALOH), that arose fromat least two sequential (compound) genomic modifications. The relative proportionof the genome altered by compound copy number events varied between sampleswithin Case 3 and Case 5 (Figure 2.18A).Case4 (right) Case3Case2Case6Case1Case5Case4 (left)Aabcd abcdacabcdef g h iabcfgabControl9.50323.28510.210.6622.82715.233 40.0410.7132.6085.2844.24112.99928.13128.16321.87719.44141.78514.7416.5647.18723.9716.129 2.5396.3396.2717.81814.75244.6124.65326.3932.00924.7358.6219.48934.24125.87742.24543.56330.1662.12316.56619.0434.99315.25916.569Case1aCase1bCase1cCase1dCase2aCase2bCase2cCase2dCase3aCase3bCase3cCase4aCase4bCase4cCase4dCase4eCase4fCase4gCase4hCase4iCase5aCase5bCase5cCase5fCase5gCase6aCase6bProportionGenome Altered0. 1 2 3 4 5012345Case3b CNCase3c CN0 1 2 3 4 5012345Case3a CNCase3b CN0 1 2 3 4 5012345Case3a CNCase3c CNFigure 2.18: Evolutionary sequential compound CNA analysis in HGS ovariantumours. (A) Analysis of proportion of the genome that was altered by sequen-tial compound events copy neutral LOH (NLOH) and amplified LOH (ALOH)regions. These events indicate the occurrence of more than one copy number eventin sequence (e.g. deletion followed by amplification of remaining allele resultsin ALOH). (B) Pairwise comparison of copy number samples within Case 3. Thenumber of genes with a specific predicted discrete copy number (CN) is representedby the size of the dot. Genes that also have the same zygosity (LOH or heterozy-gous) status between the two samples are coloured red, and grey, otherwise. (C)Phylogenetic tree of compound events.In Case 5, samples 5b and 5f exhibited a higher proportion of the genome har-boring compound aberrations (mean 0.43) than the other samples – 5a, 5c and5g (mean 0.23) (Figure 2.18A). This can be visualized as a phylogenetic tree,which was constructed using genes assigned with the integer score representing the712.4. Application of HMM-Dosage to novel cancer datasetsweight of observing compound events: 2-ALOH, 2-NLOH, 2-homozygous dele-tion, 1-hemizygous deletion, 0-diploid heterozygous, 0-allele-specific amplifica-tion. First, the Euclidean distance between sample x and sample y was computedas dxy =√∑Gg (xg− yg)2 for all G genes where xg and yg are integer scores. Thedistance was computed for all pairs of tumour samples in all patients, includingthe control (zeros for all genes). Neighbour-Joining was again applied using theEuclidean distances to construct a phylogenetic tree (Figure 2.18B).For Case 3, samples b and c underwent whole genome duplication relative tosample a (Figure 2.18C). Segmental deletions in Case 3a which were observed asNLOH events (doubling of the undeleted remaining chromosome) in Cases 3b-cand diploid heterozygous regions in Case 3a were observed to be doubled to fourbalanced copies in Cases 3b-c (Figure 2.19). Overall, 17 chromosomes in Cases3b and 3c showed evidence of doubling. The remaining chromosomes (4, 8, 11, 13and 19) appeared to have undergone concurrent segmental aneuploid events in bothCases 3a and 3b, preceding or following genome doubling (Figure 2.19B,C), sug-gesting continual accrual of genomic aberrations after clonal divergence of Cases3b,c from 3a from the ancestral clone. These results suggest evolutionary trajecto-ries of copy number in multiple spatial intra-patient samples can provide insightsinto clonal divergence.722.4. Application of HMM-Dosage to novel cancer datasetsCase3a Case3bNEUT (2 copies)AMP (4+ copies)HEMD (1 copy)HOMDHETLOHNLOHGAIN (3 copies)NEUT (2 copies)AMP (4+ copies)HEMD (1 copy)HOMDHETLOHNLOHGAIN (3 copies)(1) D -> 2 copies (2) A -> 4 copies (1) D -> 2 copies Chr7Chr4Chr13NEUT (2 copies)AMP (4+ copies)HEMD (1 copy)HOMDHETLOHNLOHGAIN (3 copies)(1) D -> 4 copies (AABB) (2) L -> 3 copies (AAB) (1) D -> 4 copies(1) D -> 2 copies (0) 2 copies (AB)(2) L -> 1 copy(0) 1 copy(1) D -> 4 copies (1) D -> 2 copies (0) 2 copies (0) 1 copy(0) 1 copy (0) 2 copiesD = DoublingL = LossA = AmplificationD = DoublingL = LossA = AmplificationD = DoublingL = LossA = AmplificationSubclonal Segmental AneuploidyCase3a(clone1): 1CNCase3a(clone2): 2CN ->                         L -> 1CNCase3b: 2CN(clone2) -> D -> 4CN -> L -> 3CNConcurrent Segmental AneuploidyCase3a: 2CN ->                         L -> 1CNCase3b: 2CN -> D -> 4CN -> L -> 3CNCompeting Explanations for 13q12.11 - 13q14.2ABCFigure 2.19: Genome doubling in HGS ovarian cancer Case 3. Comparison ofCase 3a and Case 3b reveals genome doubling. (A) Chromosome 7 shows thefinal observable result of genome doubling where 7p was 1 copy and 7q was 2copies in the initial Step ‘0’ of Case 3a but becomes 2 and 4 copies, respectively,in Step ‘1’ doubling of Case 3b. (B) Chromosome 4 in Case 3a initial started at 1copy at Step ‘0’ and doubles to 2 copies of the entire chromosome in Step ‘1’ ofCase 3b. An additional segmental aneuploid event occurs in 4p, amplifying it to 4homozygous copies in Step ‘2’. (C) Comparison of chromosome 13 revealed twopossible sequence of events, particularly at q12.11-q14.12.732.5. Discussion2.5 DiscussionHMM-Dosage is a tool for identifying both somatic CNA and germline CNVevents from SNP genotyping array data. The inclusion of prior CNV informationin the transition probabilities informs the transitions into CNV states at specificloci. Also, the use of a state space that includes five separate CNV states allows themodel to capture the distinct statistical signals in the data, emitted from germlineevents. In a simple evaluation, we showed the performance HMM-Dosage in dis-tinguishing CNAs and CNVs. Then, we demonstrated the utility of HMM-Dosagewith its application to novel breast and ovarian cancer datasets.Using HMM-Dosage, we profiled the landscape of CNVs and CNAs in 1992breast cancers. We observed that germline CNVs were not significantly associatedwith expression, with the exception of two glutathione S-transferase genes, GSTT1and GSTM1. These genes have been reported in numerous studies as being associ-ated with cancer susceptibility (Rebbeck, 1997), including breast (Dunning et al.,1999; Zheng et al., 2002). By contrast, 39% of somatic CNAs showed differ-ential cis-regulated gene expression, which is higher than previously reported inbreast cancer (Pollack et al., 2002). The integration of 997 copy number and ex-pression profiles for discovering breast cancer subtypes provided a high-resolution,genome-wide approach not previously described.To address the hypotheses of the METABRIC study, Curtis et al. (2012) setout to refine the current expression-based subtype classification strategy as wellas to discover novel subgroups. The authors applied an unsupervised integrativeclustering approach (Shen et al., 2009) that combined copy number and expressiondata for the top 1000 cis-regulated genes as features. Ten subgroups were deter-742.5. Discussionmined after applying the Dunn’s index internal cluster validation (Figure 4 in Curtiset al., 2012). Of these groups, two were novel and associated with distinct clini-cal outcomes: ER+ 11q13/14 amplicon group (n=45/997) and CNA-devoid group(n=164/997). This provides a new view on breast cancers that are characterizedwith these novel signatures, and can refine the diagnosis and treatment responsefor these subtypes.In six high-grade serous ovarian cancer patients, we investigated the tumourheterogeneity by applying HMM-Dosage and OncoSNP to profile the genomesof multiple intra-patient samples. The results revealed extensive heterogeneitybetween intra-tumour and anatomical sites within individual patients. FISH as-says were used to validate, at the single-cellular level, heterogeneous (subclonal)homozygous events, such as NF1, which were found in a subset of intra-tumoursamples. In a recent study, a fluorescence-based variant technique for looking atmultiple samples was also used to determine tumour heterogeneity in glioblastoma(Sottoriva et al., 2013). Thus, FISH may still be a viable lower-throughput single-cell assay for confirming the presence of coexisting tumour subclones.Furthermore, genome doubling (via endoreduplication), which was recentlyshown to be frequent across human cancers (Carter et al., 2012), was observed insubclones in our data. While it is not clear if segmental CNAs were harboured pre-ceding or following doubling, this result may be further evidence that genome dou-bling is a mechanism that contributes to clonal diversity (Cooke et al., 2010). Toour knowledge, this is the first report of genome doubling occurring in a subclonalcontext in ovarian cancer. Moreover, this study is the first to investigate HGS ovar-ian carcinomas through high-throughput genomics of multiple intra-patient andintra-tumour samples within individual patients.752.5. Discussion2.5.1 Limitations and future workHMM-Dosage is a tool for identifying both somatic CNA and germline CNVevents. The inclusion of prior CNV information in the transition probabilities in-forms the allowable transitions into CNV states at specific loci. Also, by using astate space that includes five separate CNV states allows for parameter estimationspecific to the CNV Student’s-t emission distributions.Other HMM-based tools in the literature generally employ Gaussian distribu-tions, such as for GPHMM (Li et al., 2011), because the log ratio data are normallydistributed following intensity normalization (Bengtsson et al., 2008). We chose touse Student’s-t because of the wider tails in the distribution, which allow for bet-ter tolerance to outliers compared to a Gaussian. Because HMMs are parametricand have an upper-bound on the maximum copies, it may be appropriate to use agamma emission distribution with negative skew (longer left tail) for the highestcopy number state. This state can capture data with signals at much higher numberof copies than is modeled.To distinguish CNA and CNV events, HMM-Dosage implicitly assumes thatCNV signals are present in tumour and contaminating normal cells. While thisis important for identifying CNVs, prediction of CNA events may be less robustbecause normal contamination is not explicitly modeled. This is perhaps mitigatedby the parameter estimation of CNA emission distributions separately from CNVstates. HMM-Dosage also does not consider the effects of polyploidy. Previousstudies have shown that including BAF in the analysis is useful for estimating nor-mal contamination and polyploidy (Laframboise et al., 2007; Colella et al., 2007;Yau et al., 2010; Li et al., 2011). In Chapter 3 and Chapter 4, we address these762.5. Discussionlimitations by explicitly modelling these biological artifacts as parameters in themodel. Moreover, these two chapters also address the significant technology shiftto next-generation sequencing. Thus, the next research Chapters are built uponwork presented in this chapter by improving statistical models and pioneering toolsfor the emerging era of cancer genome sequencing.77Chapter 3Detecting genome-wide allelicimbalance and LOH in wholegenome sequencing of cancer3.1 IntroductionLoss of heterozygosity (LOH) is a class of mutations associated with the loss of aparental allele in tumour genomes. LOH leads to homozygosity at a locus, region,or gene as a result of segmental deletion CNA events. LOH analysis is well estab-lished for high resolution SNP genotyping arrays (Lin et al., 2004; LaFramboiseet al., 2005; Beroukhim et al., 2006; Dutt and Beroukhim, 2007; LaFramboise,2009; Staaf et al., 2008a; Närvä et al., 2010); however, this platform is limited tointerrogating fixed loci that use hybridization intensities as a surrogate measureof nucleotide abundance. Whole genome sequencing (WGS) enables the genome-wide inference of LOH at nucleotide-resolution. Robust probabilistic approachesfor analysis of LOH in this type of data were under-developed at the time when thework in this chapter was carried out.There are four challenges to consider when analyzing LOH and allelic im-balance in WGS data. First, heterozygous SNPs in the germline DNA are non-uniformly distributed across the genome; therefore, genomic distance between ad-jacent SNPs needs to be considered in the analysis. Second, the input data repre-783.1. Introductionsenting the observed allelic counts in WGS data are discrete and thus are not wellsuited to commonly used Gaussian or Student-t distributions that are often em-ployed for the analogous problem in continuous array data. Third, the allelic countdata from the tumour DNA will reflect the proportion of normal cells that are ad-mixed with the tumour cells. Fourth, allelic skew due to allele-specific copy num-ber amplifications (ASCNA) can often be erroneously interpreted as LOH. ASCNAshould still retain signal from the unamplified allele; however, the amplified allelecan dominate the overall signal (LaFramboise et al., 2005; Dewal et al., 2011).Figure 3.1 illustrates difference in allelic distribution between LOH (regions i, ii,iii) and ASCNA (region iv) events; the allelic ratio (proportion of reference readcounts) for LOH events are more distributed towards 0 and 1, which is the signalfor the homozygous genotype. Most of these challenges are analogously present inanalysis of SNP genotyping arrays and some available solutions can be leveraged(LaFramboise et al., 2005; Bengtsson et al., 2010; Greenman et al., 2010a; Yauet al., 2010; Li et al., 2011), however the specific application to WGS data requiresnew formulations and statistical modelling.To address the challenges, we developed APOLLOH, which uses a non-stationaryHMM to analyze allelic counts and distinguish LOH from allele specific amplifica-tion (ASCNA) by accounting for copy number. To be compliant with WGS-specificdata, the model uses a mixture of binomial distributions to model the digital allelicread counts and a two-component mixture to facilitate the modelling of the pro-portion of the observed signal expected to come from normal cells (Laframboiseet al., 2007; Yau et al., 2010; Li et al., 2011). APOLLOH analyzes data from threeinputs: (1) the set of genome-wide heterozygous SNP positions inferred from thenormal genome, (2) the copy number profiles inferred from the tumour genome,793.1. IntroductionFigure 3.1: Illustration of empirical allelic ratios between tumour and normal ge-nomic sequencing data from chromosome 20 of a triple negative breast cancergenome (SA225), and effects of copy number. (A) Allelic ratio data of heterozy-gous loci in the normal genome is centred around 0.5, which represents the bal-anced presence of two alleles. (B) At the same corresponding loci, allelic ratiosin the tumour genome reveal four examples of somatically acquired segments ofallelic imbalance in regions (i)-(iv). (C) The segmental copy number of the tumourhelps give context to the allelic data: (i) copy neutral LOH (NLOH), AA/BB; (ii)deletion-induced LOH (DLOH), A/B; (iii) amplified LOH (ALOH), AAA/BBB;and (iv) allele-specific amplification (ASCNA), AAAB/ABBB. Allelic ratio valueis defined as the reference read counts divided by total depth at a given position. Aand B represent reference and non-reference alleles in the genotype, respectively.803.2. Method: APOLLOHand (3) the allelic counts of the tumour data for each heterozygous SNP positionfrom (1). The output of APOLLOH are the set segment boundaries, predictionstatus of LOH, ASCNA, and heterozygous (HET) for the segments, and normalcontamination estimates.We applied APOLLOH to 23 triple negative breast cancer (TNBC) patient sam-ples for which the tumour and normal DNA were sequenced up to ∼30X coverageusing WGS Illumina and SOLiD platforms (see Section 1.3.2). For all 23 samples,Affymetrix SNP 6.0 data, which is a standard, orthogonal technology commonlyused to profile LOH in tumour genomes, was also acquired. This data served as abenchmark for systematic comparisons of accuracy of each of the novel aspects ofthe APOLLOH method against baseline methods. In silico mixing experimentswere designed to determine the normal contamination levels that render unde-tectable tumour signals at both 30X and 60X (Section 3.3.3). Finally, the tumourtranscriptomes (RNAseq) of 22 patients were analyzed to study the consequence ofgenomic allelic imbalance on allele-specific expression in the transcriptome, suchas mono-allelic expression (MAE) (Section 3.3.7). Studying MAE serves as anorthogonal validation and provides an opportunity to examine the impact of LOHon transcription profiles. Finally, we interpreted the combined observations of al-lelic distributions from somatic point mutations and overlapping LOH to suggestthe presence of sub-clonal events and temporal ordering.3.2 Method: APOLLOHAPOLLOH is a computational method for predicting regions of LOH and ASCNAfrom WGS data. The underlying model has three central features: 1) inference of813.2. Method: APOLLOHAPOLLOH1) Obtainheterozygous positionsNormal Heterozygous Positions,P={1...T}Normal WGSS BAM file2) Obtain tumour allelic counts for positions in P  3) Obtain tumour discrete copy number segments   Tumour WGSS BAM filePreparing Tumour Input Data APOLLOH analysisInferredGenotype Sequence,G1:TDecode Genotype to ZygosityExtracting positions for analysis Transcriptome AnalysisTumour RNAseq BAM fileObtain allelic counts for overlapping positions in P  Predict homozygous genotypesInputCopy Number SegmentsInputTumour Allelic Counts,a1:T, N1:T OutputZygosity Status,ZS1:TOutputNormal proportion parameter, s OutputMonoallelic ExpressionGATK Unified-GenotyperBAM ParserHMMcopyBAM ParserSNVMixFigure 3.2: Workflow of the analysis for APOLLOH. Three inputs are required: 1)Heterozygous positions found in the normal DNA predicted by genotyping toolssuch as SNVMix(Goya et al., 2010); these genomic positions are the sites of in-terest in the analysis; 2) Reference counts at these positions in the tumour DNAsequencing data are obtained by extracting alignment read counts using SAMtools(Li et al., 2009); 3) Copy number status for the tumour are predicted by HMM-Dosage. APOLLOH uses the inputs to infer the genotype and subsequently zygos-ity status is determined for each position of interest. Transcriptome RNAseq datawas analyzed for expressed allelic imbalance, and mono-allelic expression (MAE)was determined as homozygous genotypes using SNVMix (Goya et al., 2010).segments using a HMM to account for spatial correlation; 2) input copy numberdata allows the model to be copy-number-aware; and 3) the model explicitly ac-counts for and estimates normal contamination. The method analyzes tumour readcount data at heterozygous germline SNPs and copy number to generate the set ofregions with predicted LOH, ASCNA, or HET status and the estimate of normalcontamination in the tumour sample. Next, we describe the details of the analysisworkflow.823.2. Method: APOLLOH3.2.1 APOLLOH workflow overview1. Identify the set of T positions of interest, L = {ti}Ti=1. These positions areheterozygous germline SNPs, which are obtained by applying SNP predic-tion tools, such as GATK (McKenna et al., 2010), to the matched normalgenome. Approximately 1 to 2 million positions are identified genome-wideper patient (Supplementary Table S10 in Ha et al., 2012). Restricting theanalysis to positions where both alleles are present in the matched normalsample reduces the dimensionality of the analysis to T loci and ensures de-tected homozygosity will be somatic events (Figure 3.2). Figure 1.3 illus-trates a hypothetical example of LOH at a germline SNP loci observed inaligned sequencing data.2. From the tumour genome data, the read counts mapping to the referencebase (A allele), read counts mapping to non-reference base (B allele), andtotal depth at all positions in L are extracted and represented as a1:T , b1:T ,and N1:T , respectively (see Figure 1.3 for an illustrative example).3. From the tumour sample, obtain copy number information in the form ofsegmental integer copy number status: homozygous deletion (no copies),hemizygous deletion (1 copy), neutral (2 copies), 1 copy gain (3 copies), 2and 3 copy amplifications (4 and 5 copies). Copy number status c1:T is thenassigned to all positions in L based on its overlap within the correspond-ing copy number segment. Copy number profiling of the tumour genomealigned reads can be done using any tool on any technology platform. Forthis study, we used an in-house HMM-based approach called HMMcopy (seeAppendix A).833.2. Method: APOLLOH4. APOLLOH performs inference and segmentation given the input data a1:T ,N1:T , c1:T from the tumour. Subsequently, the inferred genotypes G1:T are en-coded into the corresponding zygosity status ZS1:T (Table 3.2; Section 3.2.2).LOH zygosity status is used as the biological descriptor for homozygousgenotypes (i.e. A, B, AA, BB, AAA, BBB, etc.); HET status is used for het-erozygous genotypes (i.e. AB, AAB, ABB, AABB, etc.), which include theapproximately balanced presence of both alleles; and ASCNA status is usedfor genotypes in which one allele is present at higher copies (i.e. AAAB,ABBB, AAAAB, ABBBB).3.2.2 APOLLOH probabilistic framework descriptionThe description of APOLLOH is defined in Figure 3.3 and Table 3.1, and mathe-matical details are presented below.Hidden state spaceThe full state space consists of 18 genotype states, K, where at each position t, thestate space is restricted to Kct given copy number ct (Table 3.2). The initial statedistribution, pi , is assumed to follow a Dirichlet distribution with hyperparameterδ pi (Table 3.1). Positions that are homozygously deleted are assumed to have zerodepth and ignored if they managed to pass the initial depth filter. Positions over-lapping a hemizygous deletion (i.e. 1 copy) are assumed to contain reads fromonly a or b but not both; therefore, these positions are re-labeled to 2 copies inorder to guard against any unexpected signals involving both a and b reads due toerroneous alignments and signals from normal cell contamination. LOH regionsare later post-processed into categories of deletion (DLOH), copy-neutral LOH843.2. Method: APOLLOHVariable Description Sourcepi Initial state distribution M-step in EMδ pi Prior counts; parameter of Dirichlet for pi User-definedGt Latent variable for genotype at position t E-step in EMat Symmetric reference count at position t ObservedNt Total read depth at position t Observedct Copy number status at position t ObservedµN Normal reference allelic ratio genotype g User-definedµg Tumour reference allelic ratio genotype g M-step in EMαµg Hyperparameter of Beta prior on µg User-definedβµg Hyperparameter of Beta prior on µg User-defineds Global stromal contamination proportionparameterM-step in EMαs Hyperparameter of Beta prior on s User-definedβs Hyperparameter of Beta prior on s User-definedCt 18×18 copy number transition matrix att; determines allowable transition basedon ctFixedTt 18 × 18 genotype transition matrix atposition t; genomic distance-dependentprobabilitiesFixedAt 18 × 18 combined, copy-number re-stricted transition matrix, Ct×Tt at po-sition tFixedL Expected length of chromosomal regionsaltered in breast tumoursUser-definedTable 3.1: Description of random variables and fixed quantities in the APOLLOHframework depicted in Figure 3.3. a1:T , N1:T and c1:T are observed input quantities.All hyperparameters are user-defined and used to help initialize model parameters.The position-specific HMM transition probabilities are fixed quantities. pi1:18 andµ1:18 are unknown variables estimated by expectation maximization (EM).853.2. Method: APOLLOHp(G0|pi) = Multinomial(G0|pi)P (pi|δpi) = Dirichlet(pi|δpi)p(at|Gt = g, ct) = Binomial(at|µ¯g, Nt)p(µg|αµg , βµg) = Beta(µg|αµg , βµg)p (s|αs, βs) = Beta (s|αs, βs)µ¯g = s · µN + (1 − s) · µgp(Gt = j|Gt−1 = i) = At(i, j) = Tt(i, j) × Ct(i, j)Ct(i, j) ={1 i ∈ Kl, j ∈ Kk0 otherwiseTt(i, j) ={ρ i = j or sameZS(i, j)1−ρ|Kct | otherwise1Figure 3.3: Probabilistic graphical model of APOLLOH. The latent variablesGt ∈ K represent the genotype state at position t. pi is the initial state distributionof G0 and is distributed with a Dirichlet prior. APOLLOH employs a position-specific transition model At. Ct and Tt are the copy number and genotype transitionprobabilities. The transition probabilities in At are element-wise product betweenTt and Ct . The emission models the reference counts at and depth Nt using a mix-ture of binomial distributions. The normal proportion is represented by s, which isBeta (prior) distributed. See Table 3.1 for definitions of variables. Grey nodes areknown or observed quantities; white nodes are unknown quantities; arrows repre-sent conditional dependence between random variables.863.2. Method: APOLLOHState(K) Total copy number (c) Genotype (G) Zygosity Status (ZS)1 K2 1-2 A/AA LOH2 AB HET3 B/BB LOH4 K3 3 AAA LOH5 AAB HET6 ABB HET7 BBB LOH8 K4 4 AAAA LOH9 AAAB ASCNA10 AABB HET11 ABBB ASCNA12 BBBB LOH13 K5 5 AAAAA LOH14 AAAAB ASCNA15 AAABB HET16 AABBB HET17 ABBBB ASCNA18 BBBBB LOHTable 3.2: APOLLOH model state representations of genotypes and zygosity sta-tus. Gt is inferred to be one of 18 possible states from an expanded list of genotypestates divided into groups of states Kc based on increasing levels of copy numberc. Post-assignment of zygosity status ZS helps represent the final interpretationswhich maps to each genotype state.873.2. Method: APOLLOH(NLOH), and amplified LOH (ALOH) by referring back to original input integercopy number status.If we assume the alleles are equally likely to be observed (i.e. there is no skewtowards one particular allele), then the genotypes can be treated symmetrically (e.g.AA and BB or AAB and ABB are treated the same) using the symmetric referencecount, a¯t = max(at ,bt). APOLLOH is flexible to use a¯ or a; however, in thisstudy, we modeled the alleles separately and therefore will describe the asymmetricversion of the model throughout.Emission modelEach read at a given position t ∈ L are modeled as a Bernoulli trial. Given Ntindependent Bernoulli trials, the number of reads mapping to the reference base,at , is modeled using a binomial distribution. The observed likelihood is a mixtureof 18 univariate binomial distributions modelling input data reference read countsa1:T and total depth N1:T with parameter µ¯g conditioned on genotype g ∈ K.p(at |Gt = g) = Binomial(at |µ¯g,Nt) (3.1)The parameter µ¯g consists of a two-component mixture representing the combinedsignals from tumour proportion (1− s) and normal cell proportion s (Laframboiseet al., 2007; Yau et al., 2010; Li et al., 2011)µ¯g = sµN +(1− s)µg (3.2)883.2. Method: APOLLOHµg is the unobserved parameter representing the reference allelic ratio of tumourcells for genotype state g ∈ K, and µN represents the allelic ratio of normal cells isset to the heterozygous signal of 0.5. The observed likelihood becomesp(at |Gt = g,µN) =(Ntat)(µ¯g)at (1− µ¯g)Nt−at (3.3)=(Ntat)(sµN +(1− s)µg)at (1− sµN− (1− s)µg)Nt−atThe normal proportion parameter s is modeled using a Beta distribution with hy-perparameters αs and βs (Equation 3.4), and the reference allelic ratio parameterµg is also assumed to follow a Beta distribution with hyperparameters αµg and βµg(Equation 3.5).p(s|αs,βs) = Beta(s|αs,βs) (3.4)p(µg|αµg ,βµg) = Beta(µg|αµg ,βµg) (3.5)Position-specific transition modelThe transition component of the framework uses a position-specific transition ma-trix At ∈ R18×18 that specify probabilities for distance-dependent (Colella et al.,2007) and copy-number-permitted transitions between genotypes at each positiont. Transition probabilities are informed by input copy number ct , such that geno-type states Kct at position t are allowable. A unique set of transition probabilitiesis used for each position t to capture two key ideas which are each encoded in ma-trices Tt and Ct . Thus, for transitions between genotype state i to genotype statej (i, j ∈ K) at position t − 1 and t, and given copy number ct at position t, the893.2. Method: APOLLOHprobability is formulated asp(Gt = j|Gt−1 = i,ct) = Tt(i, j)×Ct( j)= At(i, j) (3.6)Rows of At are normalized such that they sum to 1.Genotype transitions The genotype state transition is specified by a position-specific stochastic transition matrix, Tt ∈R18×18, for each position t. There are twoprimary reasons for modelling probabilities at each position/loci independently.1. The genomic distances between pairs of adjacent positions in L are non-uniform, thus, a distance dependent strategy (Colella et al., 2007) is usedas previously described in Section 1.4. The transition probabilities are gen-erated by an exponential function modelling a priori knowledge for transi-tioning between genetic events (Equation 3.7). The genomic distance dt (inbase-pairs), between positions t and t− 1, and the expected segment lengthl are used to define a function ρt ; l is user-defined.ρt = 1−12[1− e(−dt2l )](3.7)2. The genotype transition matrix applies high probabilities (ρt) for self-transitionsand transitions between genotypes of same zygosity status. For example,genotype states AA, BBB, and AAAAA have LOH zygosity status, andtherefore have high transition probabilities. More formally, a transition fromgenotype Gt−1 = i to a genotype Gt = j such that i and j have same zygosity903.2. Method: APOLLOHstatus, will have probability ρt . The genotype transition probabilities Tt aredefined asTt(i, j) =ρt i = j or sameZS(i, j)1−ρt|Kct |−1otherwise(3.8)where sameZS(i, j) is a function that returns true if genotype states i and jhave same zygosity status.Copy number transitions The transitions between genotypes of different copynumber is captured with a position-specific indicator function, Ct , which definesallowable genotype transitions from Gt−1 = i into Gt = j such that j can only beone of Kct for the given ct at position t.Ct( j) =1 j ∈ Kct0 otherwise(3.9)If uncertainty in copy number is provided as probabilities p(ct) for ct ∈{1, . . . ,5}at each position, then the copy number transition matrix can be remodeled to incor-porate this information. Effectively, rather than using binary values, Ct becomesa soft weighting matrix (Equation 3.10) that introduces probability mass into allgenotypes in K.Ct( j) = p(ct), j ∈ Kct ,∀ct (3.10)Learning and inferenceThe expectation maximization (EM) algorithm is used for estimating model param-eters, θ = {s,µ1:18,pi1:18}. In the expectation step, given the data D = {a1:T ,N1:T}913.2. Method: APOLLOHand the current settings of parameters θ (n−1), the posterior marginal probabilitiesγt(n) at EM iteration (n) are computed asp(Gt |D ,θ (n−1)) =p(D |Gt ,θ (n−1))p(Gt |θ (n−1))p(D |θ (n−1)) (3.11)This calculation is done efficiently using the scaled version of the forwards-backwardsalgorithm (see Equation 1.10 and Bishop (2007)). The normalization constant wtfrom the forward propagation conveniently gives us the data log-likelihood,p(a1:T |N1:T ,θ (n−1)) =T∑t=1log{wt} (3.12)The expected complete log-likelihood Q(n) is computed as the sum of the data-likelihood and the log priors,Q(n) = EG|a,N,θ (n−1) [log p(G1:T ,a1:T , |θ ,N1:T )] (3.13)=K∑k=1p(G0 = k|D ,θ (n−1))logMultinomial (G0|pi)+T∑t=1K∑i=1K∑j=1p(Gt = i,Gt−1 = j|D ,θ (n−1)) logAt(i, j)+T∑t=1K∑k=1p(Gt = k|D ,θ (n−1))logBinomial (at |µ¯k,Nt)+ logBeta(s|αs,βs)+K∑k=1logBeta(µk|αµk ,βµk)+ logDirichlet (pi|δ pi)In the maximization step, the maximum a posteriori (MAP) estimate of Q(n) is used923.2. Method: APOLLOHto update the parameters θ (n) at EM iteration (n),θ (n) = argmaxθQ(n)={s(n),µ(n)1:18,pi(n)1:18}(3.14)A1:T is fixed and not estimated. The update equation for the initial state distributionpi ispi(n)k =γ(n) (G0 = k)+δ pi (k)−1∑Kk′=1(γ(n) (G0 = k′)+δ pi (k′)−1) (3.15)The normal proportion and tumour allelic ratio parameters of the binomial ob-servation model are derived by maximizing Q(n) and taking partial derivatives withrespect to s and µk for a given genotype state k ∈ K , equating to zero, and solvingfor the parameters,∂Q(n)∂µk=(1− s)(a¯ksµN +(1− s)µk− b¯k1− sµN− (1− s)µk)+(αµk −1µk− βµk −11−µk)(3.16)∂Q(n)∂ s =K∑k=1((µN−µk)(a¯ksµN +(1− s)µk− b¯k1− sµN− (1− s)µk))+(αs−1s− βs−11− s)(3.17)We use the following to simplify the computation due to the summation over datapoints, a¯k = ∑Tt=1 γ(n) (Gt = k)at and b¯k = ∑Tt=1 γ(n) (Gt = k)(Nt −at).The EM convergence criteria is satisfied when F(n)−F(n−1) < threshold, where933.2. Method: APOLLOHF is sum of the data log-likelihood and the log priors,Fn = log(p(D |θ (n)))(3.18)+ logBeta(s|αs,βs)+K∑k=1logBeta(µk|αµk ,βµk)+ logDirichlet (pi|δ pi)The converged parameters θˆ are taken from the final iteration of EM and usedto infer the optimal hidden state path of genotypes using the Viterbi algorithm,G1:T = argmaxG{p(G|D , θˆ)}(3.19)Finally, the zygosity state can be decoded, resulting in the final sequence of zygos-ity status, ZS1:T (Table 3.2).3.2.3 ImplementationAPOLLOH was implemented in Matlab; the forward-backward and Viterbi algo-rithms were implemented in C. The time and memory complexity is O(K2T)and O (KT ), respectively, where K is the number of states and T is the num-ber of positions. The practical run-time of the algorithm for about 1.5 millionpositions on a single-core is approximately 0.5 minutes per EM iterations. Thetypical run of APOLLOH may require 20 to 200 iterations. The memory usageof APOLLOH for 1.5 million positions is approximately 3 GB. The source codeand compiled executable (usable on Linux x64 architecture) can be downloaded athttp://compbio.bccrc.ca/software/apolloh/.943.3. Results3.3 ResultsWe applied APOLLOH to analyze the whole genomes of 23 triple negative breastcancer (TNBC) tumour-normal pairs. The genomes of 17 patients were sequencedusing the Life/ABI SOLiD platform (∼26X average sequence coverage); 6 patientswere sequenced using the Illumina HiSeq sequencing platform (∼29X average se-quence coverage) (Supplementary Table S2 in Ha et al., 2012). Each genome wasaligned to the reference genome using BioScope for SOLiD and BWA (Li andDurbin, 2009) for Illumina data. The transcriptomes for 22 of these tumours weresequenced using RNAseq on the Illumina GAii platform. The full analytical work-flow for analysis of these datasets is presented in Figure 3.2. Additional descrip-tions of the analyses presented in this section are presented in Appendix B.3.3.1 Initial comparison between WGS and genotyping arraysdemonstrates the platforms are correlatedWe compared the allelic ratios between results from APOLLOH (analyzed on theWGS data) and Affymetrix SNP6.0 B-allele frequency (BAF) obtained from thesame DNA extractions. We carried out an omnibus correlation analysis by com-paring the allelic ratios of predicted APOLLOH segments and the median BAFacross overlapping SNP6 probes. The result was statistically significantly positivecorrelated (Spearman’s rho = 0.72, p < 0.001, Figure 3.4, Appendix B), whichdemonstrated that WGS is comparable to the SNP6 platform for analyzing allelicimbalance in cancer.953.3. ResultsFigure 3.4: Benchmarking of WGSS allelic ratios against SNP6 genotyping arrayB-allele frequencies (BAF) for 23 breast cancer samples. Each datapoint repre-sents an APOLLOH segment whose median allelic ratio is plotted against me-dian BAF across probes that overlap the segment (Appendix B). WGSS allelicratios (max(re fCount,nonRe fCount)/depth) and BAF (max(A− intensity, B−intensity)/Total− intensity) were computed as symmetric values. LOH allelic ra-tios and B-allele frequencies are distributed within a range, which is likely due todiffering normal cell contamination proportions. The Spearman rank correlationcoefficient was 0.72 (p < 0.001).The correlation analysis was also carried out separately for each sample (seeexamples for SA232, 224, 231 in Figure 3.6A). The correlation coefficients acrossthe samples were also significantly associated with the APOLLOH-estimated nor-mal contamination (Spearman’s rho = −0.71, p < 0.001, Figure 3.5A), indicatingthat higher tumour content led to better platform agreement. Furthermore, the sep-aration between predicted LOH, HET and ASCNA clusters (Figure 3.6A) were ob-served to vary over a dynamic range such that the distance between cluster centreswere correlated with the proportion of normal content in the samples (Spearman’srho = −0.81, p < 0.001, Figure 3.5B).963.3. Results●●● ●●●●●●●●●●●● ●●●● ●●●0.0 0.2 0.4 0.6 0.8 Normal ProportionCluster Distance (2D−Euclidean)Spearman’s rho = −0.8063, p<0.001●●●●●●●●●●●●●●●●● ●●●●●●0.0 0.2 0.4 0.6 0.8 Normal ProportionSpearman’s RhoSpearman’s rho = −0.7055, p=0.0002A BFigure 3.5: tumour content influences agreement of WGSS and SNP6, and sepa-ration between prediction state clusters. (A) The correlation between APOLLOHsegment allelic ratio and SNP6 BAF are plotted against normal proportion for the23 samples. Examples of correlations for 3 samples are shown in Figure 3.6A. Theassociation between APOLLOH-SNP6 correlation and estimated normal contami-nation is negatively correlated (rho = -0.7055, p < 0.001). (B) For each of the 23breast cancer samples, Euclidean distance was computed between cluster centroids(2-dimensional median) of APOLLOH predicted LOH and HET classes. Exampleclusters are shown in Figure 3.6A. Spearman rank correlation (rho = −0.8063,p < 0.001) was computed between euclidean distances and estimated normal pro-portion parameter s across the 23 samples.3.3.2 Evaluation of APOLLOH indicates model featuressystematically improve performanceWe examined the benefits of systematically modelling three key features of spa-tial correlation, copy number awareness, and normal cell contamination by com-paring modular variations of the APOLLOH model (Figure 3.7). Setting inputcopy number status to diploid for all positions reduced the framework to a standardHMM that did not model copy number (APOLLOH-noCN) nor normal contami-973.3. ResultsFigure 3.6: Comparison and evaluation of APOLLOH results using data fromAffymetrix SNP6.0 genotyping arrays as the benchmark. (A) Initial benchmarkingby comparing WGSS derived allelic ratios and SNP6 B-allele frequencies. Threesamples are shown with LOH clusters centred at locations reflecting APOLLOHnormal contamination estimation. (B) For the 23 TNBC samples, precision, recalland F-measure metrics were computed for LOH predictions from each APOLLOHmodel variant and SNVMix using OncoSNP (Yau et al., 2010) predictions (fromSNP6 data) as the ground truth.983.3. ResultsFigure 3.7: Systematic comparison of loss of heterozygosity (LOH) predictions forchromosome 20 of a triple negative breast cancer genome (SA225). The OncoSNPsoftware (Yau et al., 2010) was applied on an orthogonal platform, AffymetrixSNP6 arrays, and served as the ground truth dataset for evaluation. SNVMix (Goyaet al., 2010) was used to predict homozygous (LOH) and heterozygous (HET)genotypes on the whole genome shotgun data (WGS) data to represent the in-dependent, identically distributed (iid) model. APOLLOH is the full model thatmodel copy number (CN) and normal contamination (SP). APOLLOH-noCN is amodel variant of APOLLOH that analyzes WGSS without copy number nor es-timating normal contamination parameter, but models spatial correlation (SC) topredict only LOH and HET in a reduced state space. APOLLOH-noS models copynumber but not normal cell proportion, predicting additional marginal states ofallele-specific copy number amplification (ASCNA) in an expanded state space.Copy number results were predicted by HMMcopy.993.3. Resultsnation. Setting stromal proportion s to zero in APOLLOH reduced the model toan HMM that modeled copy number but did not account for normal contamination(APOLLOH-noS). SNVMix (Goya et al., 2010) genotypes were used as the base-line naive iid binomial mixture model that did not account for the three features.LOH predictions made by SNVMix, APOLLOH-noCN, APOLLOH-noS andAPOLLOH on the 23 TNBC WGS samples were evaluated using predictions fromSNP6 array data analyzed by OncoSNP (Yau et al., 2010) as ground truth. Preci-sion, recall, and F-measure metrics were computed (Appendix B) for each modelvariant and tumour sample (Figure 3.6B, Supplementary Table S4A,B in Ha et al.,2012). SNVMix LOH predictions, determined by homozygous genotypes at eachsite independently using a global threshold on genotype probabilities, showed sig-nificantly lower sensitivity across all samples (median recall 0.09). APOLLOH-noCN had significantly higher recall (0.98, one-tailed Wilcoxon-signed-rank testp < 0.001) and F-measure (0.83, p < 0.001) compared to SNVMix, establish-ing the benefit of modelling spatial correlation. APOLLOH-noS had significantlyhigher precision than APOLLOH-noCN (0.94 compared to 0.83, p < 0.001) dueto the ability to distinguish LOH and ASCNA in amplified copy number regions,thereby reducing false positive LOH calls as shown in the q-arm in Figure 3.7. F-measure of APOLLOH-noS (0.92) was also significantly higher than APOLLOH-noCN (p < 0.01). The full APOLLOH model, which explicitly models normalcontamination also had a high F-measure with a median of 0.91, which was notsignificantly different than APOLLOH-noS (two-tailed Wilcoxon-signed-rank test,p = 0.11).In order to assess the benefits of modelling copy number, the performance ofdistinguishing LOH and ASCNA was evaluated using 278,229 OncoSNP-predicted1003.3. ResultsASCNA positions as ground truth. APOLLOH-noCN correctly called only 6% (re-call) as bi-allelic (presence of both alleles) and had a precision of 0.39. In contrast,APOLLOH demonstrated median recall of 0.73 and precision of 0.82 (Supplemen-tary Table S4C in Ha et al., 2012), firmly establishing that explicit consideration ofcopy number is essential for distinguishing LOH and ASCNA.For five cases, we also evaluated performance of APOLLOH on an additionalbenchmark dataset by applying the model to whole exome-capture (EXCAP) se-quence data published previously (Shah et al., 2012). Using the EXCAP data asground truth, the APOLLOH performance (predicted on WGS) had median pre-cision, recall and F-measure of 0.85, 0.95 and 0.91, respectively (Table 3.3). Theagreement of LOH in these cases in orthogonal EXCAP data platforms provides anadditional source of validation and demonstrates high confidence in the APOLLOHpredictions.EXCAP as truth SNP6 as truthSample Precision Recall F-Measure Precision Recall F-MeasureSA029 0.9570 0.9758 0.9663 0.9828 0.9997 0.9912SA030 0.7827 0.7937 0.7882 0.6814 0.9933 0.8083SA052 0.8521 0.9691 0.9069 0.9597 0.8519 0.9026SA065 0.9777 0.9549 0.9661 0.9868 0.9972 0.9920SA073 0.7189 0.8975 0.7983 0.5503 0.8188 0.6582Median 0.8521 0.9549 0.9069 0.9597 0.9933 0.9026Table 3.3: Performance of APOLLOH using whole-exome benchmark data. Per-formance for 5 cases was computed using Exon Capture (EXCAP) sequencingdata, published in Shah et al., 2012, as ground truth data. Calculations were per-formed as described in Appendix B, similar to evaluation using SNP6 data as truth.One caveat with using OncoSNP as the basis of the evaluation is the inclusionof germline LOH regions in the truth set. When comparing APOLLOH predictionsoverlapping these germline LOH loci, these regions will be devoid of data because1013.3. Resultsonly informative heterozygous positions are included (see Figure 3.7 at 20q11.22-23). This may suggest that the observed recall (sensitivity) rates from the WGSdata may in fact be even higher than reported.3.3.3 Tumour-normal admixture simulation demonstratesperformance maintained at 34% tumour contentFigure 3.8: Tumour-normal sampling admixture experiment. Nine mixture propor-tions generated by sampling reads from the tumour and normal BAM files wereanalyzed (see Methods). (A) APOLLOH results are shown for chromosome 9 ofmixtures proportions of 0.09, 0.26, 0.43, 0.60 and 0.77 tumour reads sampled to30X. ‘Tumour100’ are results from the original tumour sample. (B) The normalproportion parameter s inferred by APOLLOH was significantly correlated (Spear-man’s rho = 0.92) with the mixture proportions of 0.1 to 1.0 (increments of 0.1)at 30X and 60X. (C) The F-Measure performance of APOLLOH and APOLLOH-noS (not account for normal contamination) for 30X and 60X admixtures wereevaluated using Affymetrix SNP6.0 data as ground truth.1023.3. ResultsWe assessed the effectiveness of APOLLOH in predicting allelic imbalance andestimating normal proportion under varying proportions of tumour-normal contentby using real data in a controlled in-silico experiment. Reads were sampled froma tumour sample (SA225) and its matched normal data to generate nine wholegenome datasets for 30X and 60X at proportions of 0.9 to 0.1 normal content. Thetotal amount of reads was set to be the same as the original normal sequence BAMfile (~30.5X or 91Gb of aligned reads). Based on the 13.8% original predicted nor-mal contamination for this case, the expected normal proportions were determinedusing a 15% baseline: 0.915, 0.830, 0.745, 0.660, 0.575, 0.490, 0.405, 0.320, and0.235. APOLLOH hyperparameter settings for the Beta prior distribution of thenormal proportion parameter s were assigned uniform settings, αs = 5000 andβs = 5000. The copy number results from the original tumour sample were usedfor the APOLLOH analysis of all 9 mixtures in each of 30X and 60X datasets.Figure 3.8A shows how the increased subsampling of normal proportion affectsthe signal of observable allelic imbalance in the 30X data.For 30X sampled coverage, APOLLOH accurately estimated the normal pro-portion parameter s for each mixture ≤0.745 with statistically significant overallcorrelation (Spearman’s rho = 0.92, p < 0.001, Figure 3.8B, Table 3.4). The F-measure (Figure 3.8C, Table 3.4) for each mixture using SNP6 for ground truthcomparison (from the original tumour DNA) indicated that high performance (F-measure = 0.94) was achieved at normal content of 0.58 and was maintained evenat 0.66 (F-measure = 0.75). At high levels of contamination, inspection of the dataclearly shows that allelic imbalance levels cannot be detected, as the contributionof heterozygous ratios from normal cells dominate the overall signal (Figure 3.8A).At 60X coverage, the performance was consistent across all admixture levels, sug-1033.3. ResultsMixtureProportionsAdjustedProportions30X 60XMixture Tumour Normal Tumour Normal EstimatedNormalF-MeasureEstimatedNormalF-MeasureTum10Norm90 0.1 0.9 0.0850 0.9150 0.6096 0.0255 0.7098 0.7062Tum20Norm80 0.2 0.8 0.1700 0.8300 0.6251 0.0858 0.7131 0.7158Tum30Norm70 0.3 0.7 0.2550 0.7450 0.6908 0.3809 0.6675 0.7468Tum40Norm60 0.4 0.6 0.3400 0.6600 0.6128 0.7518 0.5885 0.8003Tum50Norm50 0.5 0.5 0.4250 0.5750 0.5353 0.9404 0.5136 0.8793Tum60Norm40 0.6 0.4 0.5100 0.4900 0.4574 0.9731 0.4393 0.9162Tum70Norm30 0.7 0.3 0.5950 0.4050 0.3818 0.9757 0.3643 0.9570Tum80Norm20 0.8 0.2 0.6800 0.3200 0.3074 0.9758 0.2849 0.9639Tum90Norm10 0.9 0.1 0.7650 0.2350 0.2389 0.9740 0.2121 0.9679Tum100 1.00 0.00 0.8500 0.1500 0.1382 0.9741 0.1382 0.9741Table 3.4: Inferred normal proportion and F-Measure performance for the 30Xand 60X tumour-normal mixture experiment. Mixture proportions are based onthe sampling proportions extracted from the tumour and normal BAM files. Theadjusted theoretical proportions, which factors 15% normal cell content (85% cel-lularity), is the best approximation to the true tumour-normal mixture. This iscomputed as ad justed Normal Proportion = 1−(tumour mixture∗0.85). A com-parison of the F-measure between the APOLLOH-noS model (not accounting fornormal contamination) and the full APOLLOH model is provided.gesting that sequencing genomes to such depths will likely lead to improved LOHprediction.Comparison of performance of the full APOLLOH model to the APOLLOH-noS showed that modelling normal contamination modestly increased performance(Figure 3.8C). Therefore, the estimation of the binomial distribution parameter µg,even without direct inference of s, allowed the model to adapt to the altered dis-tributions induced by normal contamination. In addition, there were several anec-dotal examples where accuracy was improved in the full model over APOLLOH-noS (Figure 3.9). The estimate of normal proportion of the full model has manyadditional benefits including informing case-specific stringency thresholds for so-1043.3. Resultsmatic point mutation prediction and informing depth of sequencing that would beneeded to recover somatic point mutations. Taken together, these results establishthe genome-wide estimation of normal contamination from APOLLOH as an ef-fective indicator of normal cell admixture and provide a reasonable estimate of theupper bound of normal contamination where tumour signal can still be extractedfrom the data at 30X and 60X coverage.NEUTAMPHEMDHETASCNALOHHOMDHOMDNLOHHETASCNALOHNLOHHETASCNALOHHOMDNLOHOncoSNPSNP6 dataAPOLLOH-noSExcludes normal proportion parameterAPOLLOHFull modelHMMcopyWGSS copy number segmentsB Allele FrequencyAllelic RatioLog RatioAllelic RatioSA030Chromosome 17SA224Chromosome 17Figure 3.9: Examples of improved results when accounting for normal contam-ination in APOLLOH. OncoSNP (Yau et al., 2010) results was used as groundtruth. APOLLOH-noS is the model that does not estimate the normal proportionparameter. APOLLOH (full model) models normal contamination and HMMcopywas used to predict and segment copy number in tumour-normal samples (see Ap-pendix A).1053.3. ResultsFigure 3.10: Genome-wide gene frequencies of APOLLOH predictions, copy num-ber profiles from the current 23 cases and an external (METABRIC) dataset (Curtiset al., 2012), and monoallelic expression. Panels (1-2) show copy number profilesfor cohorts of 118 basal-like subtype breast cancer patients from METABRIC, an-alyzed on Affymetrix SNP6.0 arrays, and the 23 TNBC patients. Deletion genefrequency profiles (blue, negated for display purposes) in both datasets show sim-ilar patterns to deletion LOH frequencies (Panel 3). Panel 4 shows the profile ofgenes affected by copy neutral LOH. Panel 5 shows the profile of overall LOHevents including genes found within deletions, copy neutral regions, and amplifi-cations. Panel 6 is the frequency profile of genes that are observed with MAE as aconsequence of genomic LOH events for 22 samples with available RNAseq data.3.3.4 Genomic landscape of allelic imbalance reveals widespreadLOH in TNBCIn order to infer LOH profiles in the TNBC genomes, we ascertained the copynumber profiles from the WGS data (Figure 3.10). The resulting copy number1063.3. Resultslandscape resembled that obtained from an external cohort of 118 basal-like (asubtype of TNBC) breast cancer samples profiled using SNP6 arrays (Curtis et al.,2012). Application of APOLLOH to the WGS data from the 23 tumour TNBCsamples then yielded a total of 37,204 LOH, 19,798 HET, and 2568 ASCNA seg-ments (Supplementary Table S6A in Ha et al., 2012). LOH events were furthercharacterized into 9447 (25%) deletion LOH (DLOH), 17,875 (48%) copy-neutralLOH (NLOH) and 9882 (27%) amplified LOH (ALOH) segments. While the num-ber of NLOH segments was higher than DLOH, the median length of a NLOH re-gion was shorter (97kb compared to 145kb), and collectively covered, on averageacross the samples, less of the genome (16% compared to 23%). By contrast, HETregions were much larger with a median of 409kb and accounted for more than49% of the genome on average, compared to 46% by LOH events (Supplemen-tary Table S6B in Ha et al., 2012 and Figure 3.11A). The full list of APOLLOHpredicted segments are in Supplementary Table S7 in Ha et al. (2012).LOH genes were determined by assessing complete overlap within predictedLOH segments. On average for each case within the genome, 3404 (16%), 2406(12%) and 1072 (5%) genes within DLOH, NLOH and ALOH segments were ob-served, respectively (Figure 3.11B and Supplementary Table S6C in Ha et al.,2012). The deletion induced-LOH accounted for the majority of the landscape;however, copy neutral LOH contributed substantially with notably higher genefrequencies within chromosomes 3p, 7, 8p, 10, 12, 14, 17 and 22 (Figure 3.10).Regions with highest frequency of amplified LOH were 1q and 17q (Figure 3.12).The most frequent large-scale event observed in the landscape of zygosity (Fig-ure 3.10) was the whole chromosome-level loss of heterozygosity of chromosome17 in 18 cases (78%). The full list of gene-based LOH alteration frequencies is1073.3. ResultsProportion of genome altered by LOH and ASCNAProportion of genome alteredLOH ASCNANumber of genes altered by LOH and ASCNANumber of GenesLOH ASCNAA BSA235SA232SA299SA221SA065SA227SA300SA028SA239SA236SA225SA224SA231SA219SA237SA052SA029SA030SA223SA073SA238SA220SA2330200040006000800010000SA235SA227SA065SA232SA299SA236SA028SA225SA300SA221SA239SA231SA224SA029SA237SA219SA052SA223SA030SA073SA220SA238SA2330. 3.11: Distribution of the proportion of (A) genome and (B) number of genesaltered by APOLLOH predicted LOH (green) and ASCNA regions (red). Theproportion of the genome altered by LOH ranges from 13-67%, with a median of49%. The number of genes altered by LOH ranges from 1694 to 10,446, with amedian of 6941.found in Supplementary Table S8 in Ha et al., 2012.Notable genes that are frequently affected by LOH in chromosome 17 acrossthe entire cohort include BRCA1 (88%), USP22 (96%), ITGB3 (92%), NLRP1(92%), RPA1 (92%), NF1 (84%), HDAC5 (80%), YWHAE (88%), POLR2A (92%),and KIF18B (84%), TP53 (76%), all of which are predominately altered by dele-tions. Another region that is frequently undergoing LOH is 5q31-32 which in-cludes CTNNA1, RAD50, RAD17, CDKL3, DDX46, and TCERG1. Chromosome14 harbours a frequently focal LOH gene, MTA1, which has a role in repressingestrogen receptor (Martin et al., 2001; Kumar et al., 2002). The notable genesfrequently found within NLOH regions were CNTNAP3B (40%), FANCD2 (32%),GPS2 (32%), MYO1C (28%), CAMKK1 (32%), CAMTA2 (40%), DHX33 (40%),ITGB4 (16%).1083.3. ResultsFigure 3.12: Genome-wide gene frequency landscape of APOLLOH loss of het-erozygosity (LOH) predictions for 23 TNBC samples. Events are categorizedinto homozygous deletion (HOMD), deletion LOH (DLOH), copy neutral LOH(NLOH), amplification LOH (ALOH), overall LOH (Total LOH), balanced copynumber amplification (BCNA), allele-specific copy number amplification (AS-CNA), and heterozygous or retention (HET).For genes falling within amplified regions across the samples, the median pro-portion of LOH, ASCNA and balanced CNA (BCNA) was 57%, 28% and 10%,respectively (Supplementary Table S6D in Ha et al., 2012). Amplified and copyneutral LOH are consistent with the notion that segmental amplifications or du-plications are the result of at least two copy number events in the evolutionaryhistory of the tumour. Several examples, specifically on chromosome 17, con-tained regions whereby compound deletion-amplification events likely occurred in1093.3. ResultsNEUTAMPHEMDHOMDHETASCNALOHNLOHHETASCNALOHHOMDNLOHOncoSNPSNP6 dataAPOLLOHFull modelHMMcopyWGSS copy number segmentsB Allele FrequencyAllelic RatioLog RatioSA030Chromosome 17SA224Chromosome 17Figure 3.13: Examples of LOH events predicted within amplifications in chromo-some 17. Sample SA224 contains a large amplification spanning q21.32 to q35where only one allele is observed. SA030 contains a more focal amplification inq22 undergoing LOH.sequence (Figure 3.13).3.3.5 Somatic inactivation of genes with germline stop codonmutationsWe investigated the effects of LOH on genes that harbour heterozygous germlinestop codon variants. For germline truncating variants, normal heterozygous po-sitions for each sample were used. Somatic truncating mutations for this analy-sis came from two sources: 1) for the samples sequenced using the SOLiD plat-form, the published set of validated mutations (Shah et al., 2012) was used; 2)for the samples sequenced using the Illumina HiSeq platform, a set of mutationswas predicted using JointSNVMix (Roth et al., 2012) and filtered by the classi-1103.3. ResultsGermline Stop CodonsNumber of Codons010203040A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A S A SSA299 SA232 SA300 SA221 SA065 SA028 SA235 SA052 SA227 SA239 SA238 SA225 SA236 SA237 SA030 SA231 SA029 SA224 SA073 SA219 SA223 SA220 SA233Germline Synonymous VariantsNumber of Codons050010001500R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R N R NSA299 SA232 SA300 SA221 SA065 SA028 SA235 SA052 SA227 SA239 SA238 SA225 SA236 SA237 SA030 SA231 SA029 SA224 SA073 SA219 SA223 SA220 SA233Figure 3.14: Germline stop codon and synonymous variants affected by LOH. Thenumber of genes with germline stop codon and synonymous variants that wereaffected by LOH is shown for each of the 23 tumour samples. Normal heterozy-gous positions in determined by GATK (and used in the APOLLOH analysis wereannotated with codon effects using snpEff (Cingolani, 2012). For all variant locithat was annotated with having a stop codon and overlapped LOH regions, the re-maining allele corresponding to the amino acid or the stop codon were labeled as"A" and "S", respectively. For loci annotated as synonymous, the remaining allelecorresponding to the reference and non-reference were labeled as "R" and "N",respectively.fier MutationSeq (Ding et al., 2012a). The positions for each sample were an-notated using snpEff (Cingolani, 2012) (hg36.54) and positions with codon ef-fect “STOP_LOST” (germline only) and “STOP_GAINED” were extracted. Theremaining alleles following LOH were assigned as wild-type (WT) and mutant(MUT) if the tumour allelic ratio was > 0.5 and < 0.5, respectively; for the val-idated mutations, the ultra-deep amplicon sequencing allelic read counts were used.For non-synonymous mutations, positions with the codon effect “NON_SYNONYMOUS_CODING”were used.We conservatively determined 1390 truncating variants that overlapped the nor-1113.3. Resultsmal heterozygous positions in our dataset. Across the 23 cases, LOH led to the lossof the amino acid coding allele in 291 positions, leaving only the stop codon al-lele that encodes a truncated protein (Figure 3.14, Supplementary Table S9A inHa et al., 2012). By contrast, 582 events were observed to have the truncatingvariant lost due to LOH. Using 44,754 synonymous germline variants as a back-ground distribution where 18,154 variant events (41%) were retained after LOH,the proportion of truncating variants (33%) was statistically significantly enriched(χ2, p < 0.001) for losing the truncating variant after LOH. This suggests thatselection on somatically driven LOH of a germline background of truncating poly-morphisms may lead to removal of truncated genes. However, the 291 events stillrepresent an intriguing upper bound on the possibility of partial or complete lossof function in the affected genes. The rate of occurrence (12.7± 6.4 per case) wascomparable to the same order of magnitude of the number of genes affected bynon-synonymous coding mutations typically reported in epithelial cancer genomes(Pleasance et al., 2010a,b; Ding et al., 2008; Shah et al., 2009b, 2012), indicat-ing that somatic inactivation by LOH of germline truncating protein variants likelycontributes meaningfully to the mutational landscape of TNBC. Moreover, thisanalysis outlines a genome-wide substrate composed of germline and somatic ge-netics upon which selection may be acting. Larger studies would be required todetermine its implication in the pathogenesis of TNBC.3.3.6 Analysis of LOH and somatic mutations reveals potentialsubclonality and temporal orderingWe next interpreted somatic point mutations in the context of the genomic archi-tectures as inferred by APOLLOH. We investigated 680 missense and 55 trun-1123.3. Resultscating (nonsense) mutations (Supplementary Table S9B in Ha et al., 2012) usingpreviously validated data (Shah et al., 2012) and JointSNVMix from the 23 casesused in this study. In 63 (9.3%) of the missense events, LOH rendered the muta-tion homozygous, which included mutations affecting TP53, PTEN, ERBB2 andPIK3CA. The mutation in PIK3CA was a canonical activating kinase domain mu-tation H1047R and was found in a region of ALOH, agreeing with previous find-ings (LaFramboise et al., 2005; Dewal et al., 2011) that the mutation was acquiredearly and was selectively amplified. In addition, mutations rendered homozygousdue to LOH affected genes with roles in actin cytoskeleton and microtubule stabi-lization functions (KLHL1, ESPN, DIAPH1, CASC5), extracellular matrix (ECM)interactions (LAMA1), angiogenesis (BAI2) and cell division (CDC5, CDCA7). Inthe truncating events, 9 were homozygous for the stop codon (Supplementary Ta-ble S9C in Ha et al., 2012), leading to complete inactivation of genes such asRAD51C (involved in homologous repair), THSD4 (involved in ECM assembly),JAK1 (involved in the IFN-alpha/beta/gamma signal pathway) and CDK12 (a cy-clin dependent kinase involved in splicing) (Supplementary Table S9B in Ha et al.,2012). For bi-allelic inactivation due to DLOH, temporal ordering of coincidentmutation and the CNA deletion is challenging to ascertain. However, for mutationsrendered homozygous that overlap NLOH and ALOH, the parsimonious explana-tion for the combined observations is that the mutation events likely arose first andsubsequent duplication or amplification of the remaining mutant allele followed.Thus, the resulting temporal ordering suggest these are candidate tumourigenicmutations that were selected for throughout the evolutionary history of the tumour.In contrast, 247 total missense and nonsense mutations in regions of LOHhave allelic ratios that were skewed toward the wild-type allele (Supplementary1133.3. ResultsTable S9B in Ha et al., 2012). These are more difficult to interpret, since thereare competing explanations: i) the events may be mutually exclusive, occurringindependently in separate, individual cells; ii) in NLOH and ALOH regions, themutation may have occurred subsequently to the LOH and amplification events,leading to the presence of the mutation in only a portion of the alleles. Whethersubclonal or relatively late in the evolutionary process, these mutations were likelynot early drivers of tumourigenesis. Ultimately, single cell resolution would berequired to adequately confirm and interpret their significance.3.3.7 Monoallelic gene expression events associated with genomicLOH reveal disrupted pathways in TNBCWe investigated the association between APOLLOH results and transcriptome al-lelic ratio (TAR) by analyzing 22 TNBC patients for which tumour RNAseq datawas available. For LOH predicted segments, the corresponding TAR is expectedto be monoallelic. In contrast, TAR for HET- and ASCNA-predicted segmentsmay be observed as either balanced, skewed, or monoallelic depending on factorssuch as epigenetic modifications and mutations in regulatory elements (Pastinenand Hudson, 2004). Across the cohort, the median TAR values for LOH, ASCNAand HET were 0.83, 0.71 and 0.63, respectively (Figure 3.15A). The median TARoverlapping APOLLOH-predicted LOH segments and the APOLLOH-estimatednormal proportion parameter s were statistically significant for a negative corre-lation (Spearman’s rho = −0.91, p < 0.001, Figure 3.15B, Table 3.5), supportingthe observed overall deviation of the TAR distribution away from 1.0. Thus, theRNAseq data corroborated the prediction of normal proportion from APOLLOH inaddition to contributing to the accuracy of LOH calls. By contrast, the correlation1143.3. ResultsLOH−induced monoallelic expressionNumber of GenesDLOH NLOH ALOHDC●●●●●●●●●●●●●●● ●●●●●●●0.0 0.2 0.4 0.6 0.8 Normal ProportionRNAseq allelic ratioSpearman’s rho = −0.9136, p<0.01A BHET ASCNA LOH0. allelic ratioNumber of GenesMonoallelic expression within HET, BCNA, ASCNA regionsHET BCNA ASCNASA221SA232SA235SA065SA239SA236SA227SA225SA219SA299SA224SA029SA028SA237SA030SA231SA073SA223SA052SA220SA238SA23301000200030004000SA221SA232SA235SA065SA239SA236SA227SA225SA219SA299SA224SA029SA028SA237SA030SA231SA073SA223SA052SA220SA238SA23301000200030004000Figure 3.15: Analysis of transcriptome RNAseq data. (A) The distribution oftranscriptome RNAseq symmetric allelic ratios that fall within HET (grey), AS-CNA (red) and LOH (green) predicted regions are significantly different (pair-wiseWilcoxon one-tailed test, p < 0.01). (B) The median symmetric allelic ratio ofRNAseq data within predicted LOH segments for each sample, represented as apoint, strongly negatively correlated with estimated normal proportion parameter s(first principal component line in red). (C) The number of MAE genes establishedby LOH events are categorized into deletion (DLOH), copy neutral (NLOH) andamplification (ALOH) and sorted by total LOH in descending order. (D) The num-ber of genes with MAE that overlapped genomic HET, balanced CNA (BCNA) andASCNA regions are shown in same sorted order as in (C).1153.3. Resultsfor TAR within LOH regions and normal contamination predicted by OncoSNPwas not as strong, but still significant (Spearman’s rho = −0.85, Table 3.5).Sample APOLLOHnormalproportionMedian RNAseqallelic ratio forAPOLLOH LOHOncoSNPnormalproportionMedian RNAseqallelic ratio forOncoSNP LOHSA028 0.14 0.9706 0.20 0.9643SA029 0.59 0.7778 0.70 0.7692SA030 0.38 0.8333 0.50 0.8873SA052 0.46 0.7143 0.60 0.6906SA065 0.18 0.9000 0.30 0.8919SA073 0.53 0.8000 0.80 0.8333SA219 0.50 0.8235 0.60 0.8333SA220 0.20 0.9286 0.30 0.9231SA221 0.55 0.8182 0.90 0.7551SA223 0.35 0.8333 0.30 0.8333SA224 0.21 0.9353 0.40 0.9333SA225 0.14 1.0000 0.20 0.9945SA227 0.30 0.8571 0.50 0.9167SA231 0.13 0.9714 0.20 0.9615SA232 0.55 0.8000 0.70 0.8000SA233 0.59 0.8000 0.80 0.7273SA235 0.49 0.8000 0.80 0.7500SA236 0.17 0.9333 0.40 0.9286SA237 0.39 0.8333 0.50 0.8333SA238 0.38 0.8484 0.60 0.8889SA239 0.56 0.8000 0.70 0.8000SA299 0.43 0.8095 0.70 0.8000SA300 0.31 NA 0.70 NATable 3.5: Normal contamination estimates predicted by APOLLOH and OncoSNPand transcriptome allelic ratios for LOH predicted regions in 23 TNBC samples.Median transcriptome allelic ratios of positions overlapping all LOH regions, pre-dicted by APOLLOH and OncoSNP, for 22 breast samples with correspondingRNAseq data is shown.The unbiased genome-wide coverage of WGS nominated more normal het-erozygous loci in each of the 23 cases compared to the full scaffold of probes on1163.3. Resultsthe SNP6 platform (Supplementary Table S10 in Ha et al., 2012). Subsequently,the number of overlapping RNAseq positions with available coverage was also ∼2fold more for WGSS (mean 108,778 ±31,832) compared to SNP6 (mean 48,224±13,570) (Figure 3.16). Moreover, the high resolution offered by genome se-quencing enabled APOLLOH to predict 2021 LOH segments smaller than 3kb,of which 1481 were not predicted by OncoSNP; these predictions were supportedby similar RNAseq allelic ratios (median of 0.83 and to 0.80, respectively). In fact,1020 of 1481 segments had boundaries located completely between or outside ofAffymetrix SNP6 probe scaffold (Supplementary Table S11 in Ha et al., 2012).These results demonstrate that whole genome sequence data is more suitable forcomprehensively analyzing LOH and allelic expression at resolutions that is notattainable by SNP6.RNAseq coverage at WGSS and SNP6 positionsNumber of Positions050000100000150000200000250000300000WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6 WGSS SNP6SA028 SA029 SA030 SA052 SA065 SA073 SA219 SA220 SA221 SA223 SA224 SA225 SA227 SA231 SA232 SA233 SA235 SA236 SA237 SA238 SA239 SA299WGSSSNP6WGSS SNP050000100000150000200000250000300000Number of PositionsFigure 3.16: Number of WGSS and SNP6 probe positions with RNAseq cover-age. The number of normal heterozygous positions predicted by GATK (McKennaet al., 2010) from the WGSS data (red) with coverage in the RNAseq data is con-sistently higher across the cohort when compared to the number of SNP probes onthe Affymetrix SNP6 platform.Monoallelic expression (MAE) can arise as a result of genomic allelic loss viaLOH events. In order to characterize the occurrence of this mechanism, we deter-1173.3. Resultsmined genes that exhibited MAE in the transcriptome established by co-occurringpredicted LOH events (Supplementary Table S12 in Ha et al., 2012). MAE was de-termined using the genotypes inferred by SNVMix for all transcriptome positionsthat intersected loci used in the APOLLOH analysis. Parameters for SNVMixwere set using the 2-component mixture, s ·0.5+(1− s)µg, where µaa = 1,µab =0.5,µbb = 0 and s is inferred by APOLLOH on the genomic data. These parameterswere considered appropriate after comparing them to the distributions of transcrip-tome allelic ratios (TAR) (Figure S3.17). A gene g was determined to have MAEstatus if the genotypes for all positions xg ∈ L overlapping g had a marginal poste-rior probability of being homozygous (paa + pbb) greater than heterozygous (pab).An average of 3137 genes per case exhibited MAE, of which 2017 (64%) wereobserved to be coincident with LOH (Figure 3.15C). Deletion LOH gave rise toan average of 962 genes with MAE whereas copy neutral and amplified LOHevents lead to average MAE of 696 and 358 genes, respectively (SupplementaryTable S6E in Ha et al., 2012). In contrast, there were far fewer instances of MAEof genes within HET, BCNA and ASCNA regions, averaging 993, 29 and 98 percase, respectively. Only 3 (14%) cases had more genes implicated within theseregions, than within regions of LOH (Figure 3.15D). This suggests that genomicLOH explained the majority of MAE in TNBC and established a lower bound onthe proportion of MAE that can be directly attributed to LOH. As a result, it appearsonly a minor proportion of MAE could be attributed to other modifications of thegenome such as epigenetic factors and mutation. Moreover, the abundance of MAEgenes within HET, BCNA and ASCNA regions and the predicted normal propor-tion were statistically significant for positive correlation (Figure 3.18), indicatingthat the MAE genes in these regions were likely inherited germline (epigenetic)1183.3. Results0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.102468 SA028RNAseq Allelic RatioDensitys=0.144513 AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA029RNAseq Allelic RatioDensitys=0.58681AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA030RNAseq Allelic RatioDensitys=0.379573AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA052RNAseq Allelic RatioDensitys=0.456241AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA065RNAseq Allelic RatioDensitys=0.184626 AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA073RNAseq Allelic RatioDensitys=0.529546AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA219RNAseq Allelic RatioDensitys=0.499175AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA220RNAseq Allelic RatioDensitys=0.20347 AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA221RNAseq Allelic RatioDensitys=0.546988AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA223RNAseq Allelic RatioDensitys=0.348076AA/BBAB0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.102468 SA224RNAseq Allelic RatioDensitys=0.211005AA/BBAB0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.102468 SA225RNAseq Allelic RatioDensitys=0.138208 AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA227RNAseq Allelic RatioDensitys=0.299803AA/BBAB0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.102468 SA231RNAseq Allelic RatioDensitys=0.128067 AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA232RNAseq Allelic RatioDensitys=0.546429AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA233RNAseq Allelic RatioDensitys=0.585597AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA235RNAseq Allelic RatioDensitys=0.490148AA/BBAB0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.102468 SA236RNAseq Allelic RatioDensitys=0.168956 AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA237RNAseq Allelic RatioDensitys=0.392439AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA238RNAseq Allelic RatioDensitys=0.380569AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA239RNAseq Allelic RatioDensitys=0.558197AA/BBAB0.5 0.6 0.7 0.8 0.9 1.002468 SA299RNAseq Allelic RatioDensitys=0.429066AA/BBABFigure 3.17: Transcriptome allelic ratio distribution and SNVMix parameters usedfor determining MAE. The distribution is for symmetric RNAseq allelic ratio,which is defined as max(re fCount,nonRe fCount)/depth. Binomial parametersare fixed using APOLLOH-inferred normal proportion parameters s such thatµaa = s ∗ 0.5+(1− s) ∗ 1.0, µbb = 1− µaa, and µab = s ∗ 0.5+(1− s) ∗ 0.5. Thedotted lines AA/AB and AB represent µaa/µbb and µab for each case given s.events whose signals became more detectable as normal cell content increased.We next examined the genome-wide landscape of LOH-associated MAE. Ingeneral, the pattern of LOH-induced MAE closely mirrored the landscape of ge-nomic LOH as shown in Figure 3.10. However, the absolute frequency of eventswas reduced, most likely due to lower expression of genes in deletion LOH re-gions, and our conservative approach for establishing MAE. Examination of thecopy neutral frequencies also closely mirrors the shape across the genome of the1193.3. Resultsl lllllllllll llllllllll0.0 0.2 0.4 0.6 0.8 1.001000200030004000LOH−induced monoallelic expressionEstimated stromal proportionNumber of GenesSpearman rho=0.09, p=0.71lll lllllllll llllllllll0.0 0.2 0.4 0.6 0.8 1.001000200030004000Estimated stromal proportionNumber of GenesMonoallelic expression within HET, BCNA, ASCNA regionsSpearman rho=0.92, p=0.00Figure 3.18: Association between MAE gene frequencies and normal (stromal)contamination. The left plot represents MAE genes induced by genomic LOH.The right plot represents MAE genes that have balanced copy number amplification(BCNA), allele-specific copy number amplification (ASCNA), or heterozygous orretention (HET) genomic allelic imbalance states. The line of best fit is shown foreach plot. The strong correlation for HET/BCNA/ASCNA MAE genes indicatesthat MAE may be due to germline epigenetic events that become easier to detectas normal cell content increases in the samples.LOH-associated MAE profile. Consistent with our observation from the genomicLOH landscape, the most frequent genes exhibiting LOH-associated MAE werefound within chromosome 3p, 5q, 8p, 10p, 14, and 17 (Figure 3.10, Figure 3.19,Supplementary Table S8 in Ha et al., 2012).To further refine the interpretation of the LOH-induced MAE genes, we per-formed a pathway analysis to examine biological functions that could be modulatedby these genes. Using the Reactome FI (Wu et al., 2010) database, the aberratedgenes were projected onto a network of interacting proteins and clustered this net-1203.3. ResultsFigure 3.19: Genome-wide gene frequency landscape of monoallelic expression(MAE) as a consequence of loss of heterozygosity (LOH). MAE landscapes arecategorized into the corresponding genomic events of deletion LOH (DLOH), copyneutral LOH (NLOH), amplification LOH (ALOH), overall LOH (Total LOH),balanced copy number amplification (BCNA), allele-specific copy number ampli-fication (ASCNA), and heterozygous or retention (HET).work into highly connected modules using the Cytoscape v2.8.1 (Smoot et al.,2011). Genes that had LOH-MAE frequencies of 10 or greater were used in theanalysis. Significant pathways (FDR < 0.05) in Modules 0-5 were analyzed usingEnrichmentMap (Merico et al., 2010) analysis to determine relationships betweenpathways within the module. For this analysis, we used gene sets, in GMT format,as described in Shah et al., 2012.A total of eleven modules were produced with seven having statistically sig-1213.3. ResultsEGFR1(C)NICOTINICACETYLCHOLINERECEPTORSIGNALINGPATHWAY(P)SYNDECAN-1-MEDIATEDSIGNALINGEVENTS(N)IGF1PATHWAY(N)INSULINPATHWAY(N)ARRHYTHMOGENICRIGHTVENTRICULARCARDIOMYOPATHY(ARVC)(K)ARF6SIGNALINGEVENTS(N)SIGNALINGEVENTSACTIVATED BY HEPATOCYTEGROWTHFACTORRECEPTOR(C-MET)(N) INTEGRINSIGNALLINGPATHWAY(P)FOCALADHESION(K)ARF6TRAFFICKINGEVENTS(N)REGULATION OF ACTINCYTOSKELETON(K) FATTY ACID METABOLISM(K)M PHASE(R) CADHERINSIGNALINGPATHWAY(P)WNT SIGNALING PATHWAY(P)CHROMATINREMODELING BY HSWI/SNFATP-DEPENDENTCOMPLEXES(B)APC/C-MEDIATEDDEGRADATIONOFCELL CYCLE PROTEINS(R)SIGNALING BY WNT(R)CELL CYCLE CHECKPOINTS(R)M/G1TRANSITION(R)PROTEASOME(K)HOSTINTERACTIONSOFHIV FACTORS(R) HOMOLOGOUSRECOMBINATION(K)Module 0 Module 1Module 2Module 3Module 4 Module 5Figure 3.20: Pathway enrichment analysis of genes with monoallelic expression(MAE) established by loss of heterozygosity (LOH) events. Gene networks wereinferred using Reactome Functional Interaction software (Wu et al., 2010) withinthe Cytoscape (Smoot et al., 2011) plugin. LOH-induced MAE genes were usedin the analysis and subsequently clustered into modules. At false discovery rate(FDR) of 0.05, significantly enriched pathways included Modules 0 to 5. Shownare the Enrichment Map (Merico et al., 2010) networks generated for the significantpathways (Supplementary Table S10 in Ha et al., 2012), highlighting the interac-tions between pathways identified within each of the six modules.nificant enriched pathways (FDR < 0.05, Figure 3.20, Supplementary Table S10in Ha et al., 2012). In particular, Module 0 contained pathways involving cell-shape/motility, focal adhesion and integrin signaling; Module 2 contained M-Phasegenes; Module 3 contained homologous recombination (HR); Module 4 containedWnt and cadherin signaling, and chromatin remodelling complexes. Haploinsuffi-ciency in HR genes is known to lead to chromosome fragmentation and genome in-stability (Date et al., 2006; Thacker and Zdzienicka, 2003), and Wnt, cell cycle andfocal adhesion are all known from functional studies to modify tumour initiation1223.4. Discussionand/or tumour progression and furthermore have been specifically associated withbreast cancer pathogenesis. Intriguingly, genes in Module 1 nominated function-ally enriched gene sets that are linked along a chain of related oncogenic pathways.Notably, integrin signaling, regulation of actin cytoskeleton, focal adhesion andWnt signaling exhibit considerable cross talk with growth factor signaling (Turner,2000) due to EGFR and PI3 kinase, both of which are known oncogenic driversin breast cancer. Our results now implicate a genomic mutational mechanism fordisrupting the normal function of these pathways, in the form of LOH-associatedMAE, that has been under-appreciated in the literature. The identification of thesecore pathways in our analysis indicates that LOH-associated MAE contributes ameasurable component of the somatic mutational landscape that includes CNAs,point mutations, insertions/deletions and epigenetic changes that collectively mod-ulate biological function.3.4 DiscussionWe have described a probabilistic framework for predicting regions of LOH ingenome sequencing data of cancers, and implemented the model as a non-stationaryHMM called APOLLOH. The algorithm models discrete, digital allelic read counts,taking advantage of the base-pair resolution quality offered in sequencing data. Theexperimental workflow allows the analysis to be performed at an unprecedentednumber of possible germline heterozygous loci in the normal and, in contrast togenotyping arrays, are unrestricted to fixed loci. We applied the algorithm to 23TNBC genomes sequenced at ∼30X sequence coverage on two massively parallelsequencing platforms to profile the LOH landscapes.1233.4. DiscussionThe performance of the variants of the APOLLOH framework shows progres-sively improved results when features are incrementally accounted: spatial cor-relation, copy number data inclusion, and normal contamination modelling (fullmodel). The model benefits from being aware of copy number because, particu-larly in regions of amplifications, ASCNA can be distinguished from LOH, reduc-ing false positive LOH predictions. The correct identification of LOH is importantfor interpreting results such as the integration of LOH with SNVs (Section 3.3.5,3.3.6) or MAE (Section 3.3.7).Accounting for normal cell contamination did not significantly improve accu-racy in our benchmarking analysis; however, there were specific instances in whichincorporation of the s parameter allowed APOLLOH to be more sensitive to LOH(Figure 3.9). Moreover, the full model has the advantage of providing the normalproportion estimate for each sample, which is useful not only for confirming thegeneral validity of the LOH predictions but also aides in interpretation of othersomatic alterations (e.g. point mutations) in the context of cellularity.We also investigated the extent of LOH in affecting allele-specific expressionby analyzing matching tumour transcriptome sequencing data. This study is thefirst to describe the integrated analysis describing the landscape of somatic LOHunderpinning MAE in whole genome sequencing of primary tumours.3.4.1 Limitations and future directionsAPOLLOH requires copy number as input, treating pre-computed CNA resultswith certainty. As previously described in Section 1.3.3, a more robust solutionis the simultaneous inference of both CNA and LOH. Furthermore, while APOL-LOH explicitly accounts for normal cell contamination, it does not yet model intra-1243.4. Discussiontumour heterogeneity due to the presence of multiple tumour subclones. The pres-ence of subclonal allelic imbalance signals, amongst signals from admixed normalcells, are more difficult to detect, potentially leading to false negative predictionswith the current model. In Chapter 4, we address these limitations through novelextensions that enable the simultaneous inference of subclonal CNA and LOH byaccounting for multiple tumour subpopulations.125Chapter 4modelling the copy numberarchitecture of clonal cellpopulations using whole genomesequencing of tumours4.1 IntroductionTumours are often genetically heterogeneous, both between and within patients.This also extends to the context of intra-tumour heterogeneity in which multipletumour clones harbour divergent genomic aberrations. As was introduced in Sec-tion 1.2, distinct cell populations (clones) within a tumour can harbour divergentgenotypes and associated phenotypes. We define a clone as a population of cells re-lated by descent from a unitary origin and uniquely identified by the complement ofgenetic aberrations (SNV, CNA/LOH, rearrangements) comprising its clonal geno-type. The clonal evolution theory implies that extant clones are related geneticallythrough a phylogenetic tree (Nowell, 1976). The cellular prevalence — defined asthe proportion of cells harbouring an aberration in the overall (bulk) tumour sample— can indicate the evolutionary timing of the tumour: high prevalence aberrationsare acquired earlier than low prevalence mutations. Thus, ancestral aberrationsare found at the root of the tree, while descendent aberrations are situated towardsthe leaves. Ancestral aberrations are considered clonally dominant because these1264.1. Introductionevents are found in all tumour cells (homogeneity); descendent aberrations areconsidered subclonal because these events are found in a subpopulation of tumourcells (heterogeneity). Cellular prevalence can be measured approximately throughsequencing a bulk sample (Nik-Zainal et al., 2012b; Landau et al., 2013; Rothet al., 2014), or more precisely in independent analysis of single cells (Navin et al.,2011). Observing the cellular prevalence of genome aberrations can help to eluci-date clonal evolution underpinning tumour progression, response to treatment, andadaptation to tumour micro-environments.Whole genome sequencing (WGS) of a single biopsy is a standard for study-ing the tumour genomic landscape from an individual cancer patient. This samplewill contain a heterogeneous bulk mixture of millions of cells which often includesnormal or stromal contaminating cells, lymphocytic infiltration, and potentiallydistinct tumour clones (Aparicio and Caldas, 2013). Characterization of clonalpopulations from such datasets has primarily been focused on SNVs, which re-quired deep targeted amplicon sequencing data (Campbell et al., 2008a; Ding et al.,2012a; Shah et al., 2012; Nik-Zainal et al., 2012b; Gerstung et al., 2012; Roth et al.,2014). Measuring cellular prevalence of CNA and LOH presents unique challengessince these events can span megabases of a chromosome, rendering targeted deepsequencing of alleles infeasible. A solution to address the absence of deep se-quencing data is the use of genome-wide germline heterozygous SNP loci, whichallows statistical strength to be gained by simultaneously analyzing thousands tomillions of positions. CNA and LOH that are present in only a minor cell popula-tion (subclonal) will have diminished statistical signals that are susceptible to falsenegative detection. For example, in Figure 4.1, Deletions I and III have a weakercopy number signal (closer to baseline of zero) and less allelic ratio spread (away1274.1. Introductionfrom heterozygosity; 0.5) compared to Deletion II. These two deletion events (Iand III) are clustered with tumour cellular prevalence of 0.51 (proportion of alltumour cells).We developed a novel probabilistic tool called TITAN that simultaneously in-fers CNA and LOH segments from genome-wide read depth and allelic ratios. Foreach alteration, we assume the event is segregated into the underlying populationsconsisting of three different cell types: normal cells, tumour cells containing theevent, and tumour cells without the event (Figure 4.2). The presence of normalcells (e.g. from the stroma) can confound the deconvolution of clonal population,particularly if it constitutes a large fraction (>60%) of the sample. We hypothesizedthat co-occurring events will be observed at similar cellular prevalence, resultingfrom punctuated clonal expansions (Navin et al., 2011; Greaves and Maley, 2012).The concept of punctuated clonal expansion states that a clone evolved a smallnumber of large-scale aberrations, followed by the proliferation of this clone intoa large subset of cells in the tumour (Greaves and Maley, 2012). TITAN groupsthese events into a clonal cluster, which we define as a group of events observedat similar cellular prevalence due to clonal expansion(s). In the data, modellingclonal clusters facilitates increased sensitivity to detect weaker signals across mul-tiple loci arising from the same clonal expansion (Figure 4.1). We integrated thesefeatures into a hidden Markov model (HMM).Our approach is distinct from related methods in the literature. Methods such asAPOLLOH and Control-FREEC (Boeva et al., 2012b) both model normal contam-ination from WGS of tumours, but they do not jointly infer CNA and LOH in a uni-fied model, nor explicitly account for multiple tumour subpopulations. SNP geno-typing array-based methods, such as OncoSNP (Yau et al., 2010) and ABSOLUTE1284.1. IntroductionFigure 4.1: Detection of subclonal deletions in whole genome sequencing data of atriple negative breast cancer genome. Copy number is represented as the log ratioof tumour and normal read depth. Discrete copy number status shown is predictedas either a hemizygous deletion (HEMD; green), copy neutral (NEUT; blue), orgain/amplification (AMP; red). Allelic ratios, which are required for analyzingloss of heterozygosity (LOH), are computed as the proportion of reference alleles.The LOH status shown is one of heterozygous (HET; grey), LOH (green), copyneutral LOH (NLOH; blue), or allele-specific gain/amplification (ASCNA; red).Subclonal deletions are observed to have a weaker log ratio signal that is closer to0 and shows less spreading in allelic ratios (Deletion I) compared to clonal dele-tions (Deletion II). This can be seen in the in the cellular prevalence estimates(proportion of sample) where ‘Deletion I’ is present in approximately half the pro-portion of tumour cells which is indicated by the 2nd clonal cluster ‘Z2’ horizontalline. ‘Deletion I’ and ‘Deletion III‘ are clustered into the same subclonal clusterbecause they share similar signals and cellular prevalence in the data. ‘DeletionII’ is present in all tumour cells, indicated by the 1st clonal cluster ‘Z1’ horizontalline. Tumour cellularity of 81% (normal contamination of 19%) is denoted with ablack horizontal line.1294.1. Introduction(Carter et al., 2012), analyze CNA while accounting for intra-tumoural hetero-geneity in cancer samples but cannot be directly applied to WGS data. MEDICC(Schwarz et al., 2014) attempts to quantity intra-tumour heterogeneity and re-construct evolutionary phylogenies; however, it requires multiple related (intra-tumour) samples. Another recently developed approach, THetA (Oesper et al.,2013), was designed to predict subclonal CNA events from tumour sequencingdata. However, THetA requires pre-segmented results, and has super-polynomialtime and memory complexity when two or more tumour subpopulations are consid-ered. Thus, fewer than 15 segments are required for reasonable run-times, resultingin lower resolution CNA results. Moreover, THetA analyzes subclonal CNA in theabsence of allelic ratios, which results in the omission of LOH and allelic imbal-ance. Finally, OncoSNP-seq (Yau, 2013) accounts for mixed populations in WGSdata, but does not model distinct clonal populations in a clustering approach.We present a rigorous evaluation of TITAN including: i) single-cell sequenc-ing based validation of predictions on WGS data from a high grade serous ovariantumour, ii) systematically engineered in-silico mixtures with WGS data from mul-tiple intra-patient samples, and iii) artificially embedded CNA and LOH events ofvarying sizes. We performed a comparison of TITAN to four published methodsto demonstrate that TITAN has higher sensitivity to detect subclonal events andaccurately estimates cellular prevalence for these events. Application of TITAN to23 triple negative breast cancers (TNBCs) shows that a substantial proportion ofclonal diversity is captured in the CNA/LOH dimensions, including low cellularprevalence LOH predictions that are associated with mono-allelic expression in-ferred from RNA-seq data. Using single-cell sequencing experiments for validationof TITAN-predicted CNA/LOH events, we were able to confirm the key modelling1304.2. Method: TITANassumptions that distinct cell populations can be identified from CNA/LOH infer-ence on WGS data. Finally, we applied TITAN to a set of primary breast tumoursand their corresponding mouse xenografts to study the clonal selection patterns.Together, these data show that the cellular prevalence profile of the copy numberarchitecture from WGS provides an effective route to inferring clonal populationsin patient tumour samples.4.2 Method: TITANWe developed TITAN, which is a probabilistic tool for predicting segmental CNAand LOH from WGS data. The input to the model is the full set of germline het-erozygous SNP loci identified from the normal sample, and the corresponding readdepth and allele ratios at those SNP positions from the matching tumour. Theoutput is a set of segmental CNA and LOH alterations, clonal cluster member-ships, and estimation of cellular prevalence, normal contamination, and averagetumour ploidy (Figure 4.3). The probabilistic graphical model is shown in Fig-ure 4.4; definitions for states are described in Table 4.1; definitions for variablesand parameters are described in Table 4.2; and additional mathematical details arein Appendix C. Representation of mixed populations in heterogeneous tumourWGS dataIn this section, we provide the assumptions of TITAN and introduce the conceptsof cellular prevalence and clonal clusters more formally. The model is based onfour main assumptions:1314.2. Method: TITANNormalpopulationTumourpopulationwith DELTumourpopulationwithout DELCTotal � 𝑛𝑐𝑛𝑜𝑟𝑚 + (1-𝑛)𝑠z𝑐𝑛𝑜𝑟𝑚 + (1-𝑛)(1-𝑠z)𝑐DELDELFigure 4.2: Representation of the total copy number signal from the mixed popula-tions in a heterogeneous tumour sample. CTotalDEL is the total copy number signal thatis the sum of the 3 components: normal population (white circles), tumour popu-lations with the deletion (green decagons) and without the event (blue decagons).n is the normal proportion; sz is the tumour proportion for the zth clonal clusterthat does not contain the event; cnorm and cDEL are normal (norm) and tumour in-teger copy numbers. DEL, deletion event in the example; it is used to indicate thetumour copy number (cDEL) and total copy number (CTotalDEL ) at the deletion locus.Assumption 1 At sufficient sequencing depth, the allelic ratio and tumour se-quence coverage (depth) at approximately one to three million heterozygousgermline SNP loci reflect the underlying somatic genotype of the tumour.Assumption 2 Segmental regions of CNA and LOH span 10s to 1000s of contigu-ous SNP loci.Assumption 3 The observed sequencing signal is the sum of the signals from het-erogeneous cellular populations, including normal and tumour subpopula-tions.Assumption 4 Sets of genetic aberrations are observed at similar cellular preva-lence if these events arose from the same clone during punctuated expansion.1324.2. Method: TITANFor Assumption 3, we assume the observed measurements were generated froma composite of three types of cell populations (Yau et al., 2010), which allows formodelling tumours that contain multiple tumour subpopulations (clones). Let s bethe proportion of tumour cells that are diploid heterozygous (and therefore normal)at the locus. Then, (1− s) is the tumour cellular prevalence or the proportion of thetumour cells containing the event. The relative proportions of the three cell pop-ulations are as follows: n, the proportion of sample that are non-malignant cells;(1−n)s, the proportion of sample that are tumour cells and have normal genotype;and (1−n)(1− s), the sample cellular prevalence or the proportion of sample thatare tumour cells and harbour the CNA or LOH event of interest (Figure 4.2). Wealso assume that, at any aberrant locus, only one tumour genotype is harboured.For Assumption 4, we assume that punctuated clonal expansions likely gaverise to multiple somatic events that will be observed at similar cellular prevalence;therefore, these events can be assigned to one of a finite number of clonal clusters.Because each event in a clonal cluster will have a unique cellular prevalence, wecan redefine the parameter s. Let Z be the set of clonal clusters. Then, (1− sz) isthe tumour cellular prevalence at the locus of interest for clonal cluster z ∈ Z. Thesimultaneous inference and clustering of each data point to z ∈ Z is the primarydistinguishing feature over related work (Yau et al., 2010; Van Loo et al., 2010;Carter et al., 2012; Oesper et al., 2013; Yau, 2013). We further assume that thereare only a finite (|Z|) number of clusters.For Assumption 1 and Assumption 2, we assume segmental CNA and LOHevents span many contiguous SNP positions. Let G be the genotype states thatincludes the combination of both copy number and allelic imbalance (Table 4.1).To capture the shared signals between adjacent positions, TITAN was implemented1334.2. Method: TITANas a two-factor hidden Markov model (HMM) where the hidden genotypes G1:Tand the hidden clonal cluster memberships Z1:T make up the factorial Markov chainfor T heterozygous germline SNPs (Figure 4.4). The state space is dynamicallydetermined as a function of the number of clonal clusters, resulting in |G| × |Z|number of state tuples (g ∈ G,z ∈ Z) (Table 4.1).4.2.2 WorkflowThe analysis workflow of TITAN for tumour whole genome sequencing data isshown in Figure 4.3.1. First, germline heterozygous SNP positions L = {ti}Ti=1 are identified fromthe normal genome using a genotyping tool, such as SAMtools mpileup (Liet al., 2009) or GATK (McKenna et al., 2010). The analysis uses approxi-mately one to three million loci genome-wide per patient and allows for iden-tification of somatic allelic imbalance events relative to the normal genome(See Section 3.2 and Ha et al. (2012)).2. From the tumour genome data, the read counts mapping to the reference base(A allele) and total depth at all positions in L are extracted and representedas a1:T and N1:T , respectively (Figure 1.3).3. The tumour copy number is normalized for GC content and mappability bi-ases using only the normalization component of HMMcopy (Appendix A).Briefly, the genome is divided into bins of 1kb and read count is repre-sented as the number of reads overlapping each bin. GC content and map-pability bias correction are performed on tumour and normal samples, sep-1344.2. Method: TITANState Genotype (g) Copy number (cT,g) Allelic Ratio (rT,g) Status-1 NA NA OUT0 NA 0 0.5 HOMD1 A 1 1 DLOH2 B ε DLOH3 AA 2 1 NLOH4 AB 12 HET5 BB ε NLOH6 AAA 3 1 ALOH7 AAB 23 GAIN8 ABB 13 GAIN9 BBB ε ALOH10 AAAA 4 1 ALOH11 AAAB 34 ASCNA12 AABB 0.5 BCNA13 ABBB 14 ASCNA14 BBBB ε ALOH15 AAAAA 5 1 ALOH16 AAAAB 45 ASCNA17 AAABB 35 UBCNA18 AABBB 25 UBCNA19 ABBBB 15 ASCNA20 BBBBB ε ALOHTable 4.1: Tumour genotype states used by TITAN. Descriptions of states: deletionLOH (DLOH), copy neutral LOH (NLOH), diploid heterozygous (HET), ampli-fied LOH (ALOH), gain/duplication of 1 allele (GAIN), allele-specific copy num-ber amplification (ASCNA), balanced copy number amplification (BCNA), unbal-anced copy number amplification (UBCNA). State 0 represents the outlier state.A represents the allele matching the reference genome and B represents the non-reference allele. Allelic ratio is defined as the proportion of reference alleles, AA+B .ε is a small number (1×10−5).1354.2. Method: TITANTITANInputInputOutput Output1) Obtainheterozygous positionsNormal Heterozygous Positions,L ={1...T}Normal WGSS BAM file2) Obtain tumour allelic counts for positions in P  Tumour Allelic Counts,a , N  Tumour WGSS BAM fileCorrected Copy Number log ratios,lPrepare Tumour Allelic Ratio Input DataTITAN analysisExtract positions for analysis Normal Depth  for P,N N4 ) C ompute log ratios from corrected reads,  log( N / N N )   3) N ormalize G C  and mappability biases for N T and N N using H M M copyCorrected Tumour &  Normal Depth  for P,N  &  N NCNA &  LOHG 1 ; T4 ) G enerate copy number and L OH  profilesPrepare Tumour Copy Number Log Ratio Input Datan ,  s,  ϕ Genotype CallerBAM ParserHMMcopyFigure 4.3: Analysis workflow for TITAN. Heterozygous SNPs in the normal DNAare genomic positions of interest in the analysis. Reference counts a, read depthNT , and log ratio between tumour and normal read depths lt are inputs into TITAN.The output are the optimal sequence of CNA/LOH genotypes and clonal clustermemberships. Model parameter for normal contamination n, cellular prevalencesz, and tumour ploidy φ are estimated.1364.2. Method: TITANarately. The corrected read counts for the overlapping 1kb bin at each posi-tion of interest t ∈ L, N¯t and N¯Nt , are used to compute the log ratio, l1:T =log(N¯1:T/N¯N1:T).4. TITAN analyzes the data l1:T , a1:T , N1:T to segment the data into regions ofCNA/LOH and estimates normal contamination, tumour ploidy and cellularprevalences for Z number of clonal clusters.5. For range i = 1 to 5, run TITAN analysis once for clonal cluster states Zi :={1, ..., i} where |Z| = i is the number of clonal clusters. That is, TITAN isrun once for Z1 =∈ {1}, |Z| = 1 , then independently run again for Z2 ={1,2}, |Z|= 2, and again for Z3 = {1,2,3}, |Z|= 3, etc.4.2.3 Details of the TITAN probabilistic frameworkHidden states for joint genotype and clonal clusterThe model consists of 21 genotype states G (Table 4.1) and a finite number ofclonal cluster states Z. Each position t ∈ L can be assigned a state tuple (g,z),for g ∈ G, z ∈ Z. Thus, there are K = 21× |Z| number of state tuples. The ini-tial state distributions (mixing weights), piG and piZ , are Dirichlet distributed withhyperparameters δG and δZ , respectively.Joint emission modelCopy number data from the tumour genome is represented by the log ratio betweenthe tumour and normal read depths l1:T . We assume l1:T is Gaussian distributed:l1:T ∼N(l1:T |µg,z,σ2g)with mean1374.2. Method: TITANܩߨ߱݃,ݖ ݖ,݃ߤ߶݊ݖݏ1ܣܶ1ܼ00ܩܼ11ܩܽ0,ܰ0,݈0 ܽ1,ܰ1,݈1ܼߨ. . .. . .݁݌ݕݐ݋݊݁ܩ ݃ݖ ݎ݁ݐݏݑ݈ܥp (G0|piG) = Mult (G0|piG)p (piG|δG) = Dir (piG|δG)p (Z0|piZ) = Mult (Z0|piZ)p (piZ |δZ) = Dir (piZ |δZ)p (at|Gt = g, Zt = z) = Bin (at|ωk,g, Nt)p (lt|Gt = g, Zt = z, φ) = N (lt|µk,g, σ2g)p (sz|αz, βz) = Beta (sz|αz, βz)p (σ2g |αg, βg) = InvGam (σ2g |αg, βg)p (φ|αφ, βφ) = InvGam (φ|αφ, βφ)p (n|αn, βn) = Beta (n|αn, βn)p (Gt = j|Gt−1 = i) = At (i, j)p (Zt = n|Zt−1 = m) = Tt(m,n)ωg,z = ncN + (1− n) szrNcN + (1− n) (1− sz) rT,gcT,gncN + (1− n) szcN + (1− n) (1− sz) cT,gµg,z = log(ncN + (1− n) szcN + (1− n) (1− sz) cT,gncN + (1− n)φ)At(i, j) =ρG i = j or(ZS (i) = ZS (j) and cT,i = cT,j)1−ρG|K−1| otherwiseTt(m,n) ={ρZ m = n1− pZ otherwiseFigure 4.4: Probabilistic graphical model of TITAN. Shaded nodes are known orobserved quantities; open nodes are random variables of unknown quantities. Ar-rows represent conditional dependence between random variables. Probabilisticgraphical model of TITAN. Shaded nodes are known or observed quantities; opennodes are random variables of unknown quantities. Arrows represent conditionaldependence between random variables. Full details and definitions are listed in Ta-ble 4.2. The hidden variables G1:T and Z1:T are latent states for the genotypes andclonal cluster memberships, respectively; A1:T and T 1:T are the transition matricesfor the factorial Markov chain. The observed data are the tumour reference alleleread count a1:T , tumour depth N1:T , which are modeled using a binomial emis-sion density with parameter ωk,z, and copy number log ratio l1:T , which is modeledusing a Gaussian emission density with mean µk,z and variance σ2k . The normalcontamination n, cellular prevalence sz for clonal cluster z, and average tumourploidy φ are unknown parameters.1384.2. Method: TITANVariable Description ValuepiZ Initial state distribution for clonal clusters Estimated by EM in M-stepδZ Prior counts; parameter of Dirichlet for piZ User-definedpiG Initial state distribution for genotypes Estimated by EM in M-stepδG Prior counts; parameter of Dirichlet for piG User-definedZt Latent variable for clonal cluster at position t Estimated by EM in E-stepGt Latent variable for genotype at position t Estimated by EM in E-stepat Reference count at position t ObservedNt Total read depth at position t Observedlt Log ratio of tumour-normal depths at position t ObservedrN Expected normal reference allelic ratio 0.5rT,g Expected tumour reference allelic ratio for genotype g User-definedcN Expected normal integer copy number 2cT,g Expected tumour integer copy number for genotype g User-definedsz Cellular prevalence parameter for clonal cluster z Estimated by EM in M-stepn Global normal proportion parameter Estimate by EM in M-step(σ2)g Variance parameter of Gaussian for genotype g Estimated by EM in M-stepφ Tumour ploidy parameter Estimated by EM in M-stepαz Hyperparameter of Beta prior (shape) on sz Uniform settingβz Hyperparameter of Beta prior (scale) on sz Uniform settingαg Hyperparameter of Inverse Gamma prior (shape) on(σ2)g User-definedβg Hyperparameter of Inverse Gamma prior (scale) on(σ2)g User-definedαφ Hyperparameter of Inverse Gamma prior (shape) on φ User-definedβφ Hyperparameter of Inverse Gamma prior (scale) on φ User-definedTt |Z|× |Z| clonal cluster transition matrix at position t Fixed using ρZAt |K|× |K| genotype transition matrix at position t Fixed using ρGTable 4.2: Description of random variables and fixed quantities in the TITANframework depicted in Figure 4.2c and described in Methods. a1:T , N1:T and l1:Tare observed input quantities. All hyperparameters are user-defined. The position-specific HMM transition probabilities for genotypes At and clonal clusters Tt arefixed quantities. sz, n, (σ2)1:21, piG, piZ and are unknown variables estimated duringexpectation maximization (EM).1394.2. Method: TITANµk,z = log(ncN +(1−n)szcN +(1−n)(1− sz)cT,gncN +(1−n)φ)(4.1)where φ is the genome-wide average tumour ploidy, cN is normal copy number,and cT,g is the copy number of tumour state g ∈G (Table 4.1 and 4.2). Thus, µg,z isthe parameter representing copy number resulting from the three cell populations,accounting for the overall ploidy of the genome (Figure 4.2). Prior normalizationsteps lead to diploid baselines; therefore, this formulation allows the model to ac-count for and estimate for tumour ploidy φ during inference (Van Loo et al., 2010).We assume the reference allelic read counts from the tumour a1:T are Binomialdistributed a1:T ∼ Bin(a1:T |N1:T ,ωg,z) with parameterωg,z =ncNrN +(1−n)szrNcN +(1−n)(1− sz)rT,gcT,gncN +(1−n)szcN +(1−n)(1− sz)cT,g(4.2)where N1:T is the sequencing depth at each position, and rN and rT,g are the ex-pected reference allelic ratios for normal and tumour state g ∈ G (Table 4.1 and4.2). The reference allelic ratio is defined as the proportion of reads matching thereference genome. Thus, ωg,z is the proportion of reference alleles from all popu-lation types out of the total number of alleles (or copies) for a specific locus.The parameters µg,z and ωg,z are functions of sz (Figure 4.5), and thereforerepresent the signals from the three cell populations introduced in Section 4.2.1.This formulation enables TITAN to model events at subclonal cellular prevalence.A joint emission is used to model l1:T , a1:T , and N1:T in a multivariate approach1404.2. Method: TITAN  &RS\QXPEHUORJmHDQ,     $OOHOLF5DWLRpDUDPHWHU,í í  ● ●●●●●●●●●●●●●●●●●●●●●●●●●+20''/2+1/2++(7$/2+$/2+$/2+*$,1$6&1$$6&1$%&1$8%&1$n=0.16, φ=1.66s1=0.97, s2=0.51s1s2μg,zω g,zFigure 4.5: Behaviour of model parameters ωg,z and µg,z when cellular prevalencevaries. s1 and s2 is shown as the tumour proportion containing a event (i.e. trans-formed using 1− sz). n is normal proportion and φ is average tumour ploidy. EachCNA/LOH genotype is shown (Table 4.1) with the associated integer copy numberin parenthesis.and is defined asp(at ,Nt , lt |Zt = zt ,Gt = gt ,θ)=Binomial (at |Nt ,ωg,z)×N(lt |µg,z,σ2g)gt > 0,U (0,Nt)×N (lt |0,Σ) gt = 0,zt = 0(4.3)where an optional outlier state with large variance Σ is used for (g = 0,z = 0) (Yauet al., 2010).The parameters of the emission densities, ωg,z and µg,z, were defined above andillustrated as being influenced by cellular prevalence (Figure 4.2). These parame-ters are functions of the unknown parameters for global normal proportion n andtumour ploidy φ , clonal cluster-specific cellular prevalence sz, and state-specificGaussian variance σ2g . The prior distributions for these unknown parameters are1414.2. Method: TITANthe following:p(sz|αz,βz) = Beta(sz|αz,βz) (4.4)p(n|αn,βn) = Beta(n|αn,βn) (4.5)p(φ |αφ ,βφ)= InverseGamma(φ |αφ ,βφ)(4.6)p(σ2g |αg,βg)= InverseGamma(σ2g |αg,βg)(4.7)Genotype and clonal cluster transition modelTITAN employs a non-stationary (heterogeneous) transition model in the HMM,which involves transitioning between both the CNA/LOH genotype and clonalcluster state spaces. Two transition probability matrices, At ∈ R21×21 for the geno-types and Tt ∈ R|Z|×|Z| for the clonal clusters, are used to define the joint transitionmatrix Jt ∈R|K|×|K| where K is the set of 21×|Z| number of genotype-clonal clus-ter state tuples (g,z), ∀g ∈ G and ∀z ∈ Z.At is the genotype transition probability matrix at position t. Let At(i, j) bethe probability of transitioning between genotypes states i ∈ G at position t − 1and j ∈ G at position t. The probability ρG, which accounts for the distance dbetween t and t − 1 is defined as ρG = 1− 12(1− e−d/2∗LG), where LG is a user-defined, expected length of CNA/LOH events (Colella et al., 2007). ρG is used iftransitions are between the same state (i = j), share the same allelic zygosity status1424.2. Method: TITAN(ZS(i) = ZS( j)), and share the same copy number (cT,i = cT, j),At(i, j) =ρG i = j or(ZS (i) = ZS ( j) and cT,i = cT, j)1−ρG|K−1| otherwise(4.8)Each row of At is then normalized such that ∑ j At (i, j) = 1, ∀i.T t is the clonal cluster transition probability matrix at position t. Let Tt(m,n)be the transition probability from clonal cluster m ∈ Z at position t− 1 to clustern ∈ Z at position t. Higher probabilities are used when transitioning to the sameclonal cluster (m = n). This is represented using ρZ = 1− 12(1− e−d/2∗LZ ), whereLZ is the user-defined, expected length of clonal cluster segments.Tt(m,n) =ρZ m = n1−ρZ otherwise(4.9)Learning and inferenceThe expectation maximization (EM) algorithm is used to estimate the model pa-rametersθ = {n,s1:|Z|,φ ,(σ2)1:21,piG,piZ} (4.10)1434.2. Method: TITANgiven all the dataD = {l1:T ,a1:T}. In the expectation step, the forwards-backwardsalgorithm is used to compute the joint-posterior marginal probabilities,p(Gt = g,Zt = z|D ,θ) = γ (Gt = g,Zt = z) (4.11)= p(D |Gt = g,Zt = z,θ) p(Gt ,Zt |θ)p(D |θ)p(Zt = z|D ,Gt ,θ) = γ (Zt = z) (4.12)=p(Zt = z|θ)∑g [p(D |Zt = z,Gt = g,θ) p(Gt = g|θ)]p(D |θ)p(Gt = g|D ,Zt ,θ) = γ (Gt = g) (4.13)= p(Gt = g|θ)∑z [p(D |Zt = z,Gt = g,θ) p(Zt = z|θ)]p(D |θ)1444.2.Method:TITANThe resulting expectation of the complete log-likelihood at EM iteration n isQ(n) = EG|D ,θ (n−1) [log p(Z,D |θ)] (4.14)=G∑g=1p(G0 = g|D ,θ (n−1))logMultinomial (G0|piG) (4.15)+Z∑z=1p(Z0 = z|D ,θ (n−1))logMultinomial (Z0|piZ)+T∑t=1{G∑i=1G∑j=1p(Gt = j,Gt−1 = i|D ,θ (n−1))logAt (i, j)}+T∑t=1{Z∑m=1Z∑n=1p(Zt = m,Zt−1 = n|D ,θ (n−1))logTt (m,n)}+Z∑z=1T∑t=1{G∑g=1p(Gt = g,Zt = z|D ,θ (n−1)){logBinomial(at |ωg,z,NTt)+ logN(lt |µg,z,σ2)}}+Z∑z=1logBeta(sz|αz,βz)+G∑g=1log InvGamma(σg|αg,βg)+ logBeta(n|αn,βn)+ log InvGamma(φ |αφ ,βφ)+ logDirichlet(piG|δG)+ logDirichlet (piZ|δ z)1454.2. Method: TITANIn the maximization step, the set of unknown parameters θ are estimated usingthe maximum a posteriori (MAP) estimate of the complete log-likelihood Q. Forparameters n, s1:|Z|, φ and(σ2)1:21, coordinate (directional) descent is used, up toa maximum number of iterations (1500) or until convergence criteria is satisfied.The MAP estimate equations are detailed in Appendix C.1 for piG (Equation C.2),piZ (Equation C.4), n (Equation C.18), s1:|Z| (Equation C.8), φ (Equation C.14), and(σ2)1:21 (Equation C.11).The EM convergence criteria is satisfied when F(n)−F(n−1) < threshold, whereF is sum of the log-likelihood and the log priors,F(n) = log(p(D |θ (n)))(4.16)+Z∑z=1logBeta(sz|αz,βz)+G∑g=1log InvGamma(σg|αg,βg)+ logBeta(n|αn,βn)+ log InvGamma(φ |αφ ,βφ)+ logDirichlet(piG|δG)+ logDirichlet (piZ|δ z)Mathematical details of the forwards-backwards and Viterbi algorithms aresimilar to APOLLOH and previously described in Section 3.2.Choosing the optimal number of clonal clustersRecall that for each i = 1 to 5, TITAN is run once for the setting of Zi := {1, ..., i}for |Zi| = i. To determine the run with the optimal number of initialized clusters|Zi|, we used an internal validation scoring approach called the S_Dbw validityindex (Halkidi et al., 2002).S_Dbw penalizes over-fitting due to increasing number of clusters by minimiz-1464.2. Method: TITANAlgorithm 4.1 Selecting the TITAN run with optimal number of clonal clusterssbwMin = 0, sbwMinVal = ∞for i = 1 to n doRun TITAN with i number of clonal cluster states (|Zi|= i)Compute S_Dbw for TITAN results with clonal cluster set Ziif S_Dbw(i) < sbwMinVal thensbwMin = i, sbwMinVal = S_Dbw(i)end ifend forSelect TITAN run with clonal cluster states, ZsbwMining within cluster variances (scat) and maximizing density-based cluster separation(Dens),S_Dbw(|Zi|) = 25∗Dens(|cT | ∗ |Zi|)+ scat(|cT | ∗ |Zi|)where Dens and scat are defined in Halkidi et al. (2002) and |cT | is the number ofcopy levels. This was applied to our runs by defining the copy number log ratio l1:Tas the internal data and the resulting joint states of (cT ,z), for cT ∈ {0 . . .5} andz ∈ {1 . . . |Zi|}, are the clusters in the internal validation. An S_Dbw index value iscomputed for each run of TITAN using a fixed number of clonal clusters |Zi| andthe optimalIndex = argmini {S_Dbw(|Zi|)} is chosen based on |cT | ∗ |Zi| numberof S_Dbw internal evaluation clusters. For instance, when TITAN is run with Z2 ={1,2}, the number of clusters in the S_Dbw internal evaluation is |cT | ∗2 = ImplementationTITAN is implemented in an R package, called TitanCNA, which is availablethrough Bioconductor. The functionality implemented in R consists of the compo-nent for GC and mappability bias correction, which uses a wrapper for the HMM-copy (See Appendix A); and the HMM component that performs segmentation1474.3. Resultsand inference of subclonal copy number. The forwards-backwards and Viterbialgorithms in the HMM are implemented in C, and are interfaced as dynamic func-tion objects within R. The time and memory complexity is O(K2T)and O (KT ),respectively, where K is the number of joint states (g,z) and T is the number of po-sitions. Because TITAN models a range of clonal clusters using a joint state spaceof the clusters and genotypes, K scales based on the specified number of clusters.Instructions on using the software can be accessed at http://compbio.bccrc.ca/software/titan/.4.3 ResultsWe hypothesized that TITAN had two major advantages over existing methods:1) increased sensitivity for detecting subclonal (low prevalence) CNA/LOH eventsover existing tools, and 2) accurate estimation of cellular prevalence. In order toevaluate and experimentally validate these features, we used the genomes from aset of five synchronously resected pre-treatment high grade serous (HGS) ovar-ian cancer specimens (DG1136a,c,e,g,i) from the same patient (Figure 4.11a, Ap-pendix C). We obtained Illumina HiSeq 2500 WGS 100bp paired-end data, se-quenced at ∼30X, for each of the five tumour samples and the patient’s matchednormal DNA. There were ∼2.3 million high confidence heterozygous SNPs in thenormal genome of DG1136. Across the five tumour samples, there were 2,816CNA/LOH events. Using these data to generate benchmarking datasets, we per-formed a series of evaluations:1. Simulated synthetic embedding of sampled clonal and subclonal CNA/LOHevents in DG1136a (Section 4.3.1).1484.3. Results2. Systematic admixing of related but distinct CNA/LOH profiles from the tu-mour samples of DG1136 in known quantities to simulate mixed populations(Section 4.3.2 and 4.3.3).3. Experimental validation of TITAN predictions using FISH (Section 4.3.4)and single-cell sequencing (Section 4.3.5).4.3.1 Simulated CNA spike-in experiment demonstrates accuratedetection for varying event sizesGenerating clonal spike-in data from DG1136aWe began by profiling the CNA landscape of DG1136a using two complementarymethods: HMMcopy (Ha et al., 2012) and Control-FreeC (Boeva et al., 2012b).Both methods identified a large deletion (chr16:46464744-90173515) and an am-plification (chr8:97045605-144155272) (Figure 4.6) of interest. We randomly sam-pled log ratios and allele counts for four, non-consecutive sets of 10, 100 and 1000positions in each of these two regions. We also included one deletion and oneamplification event spanning 10,000 SNPs. There was a total of 26 sampled locisets.We initialized the spike-in simulation sample to have the same log ratios andallelic ratios for all SNPs from DG1136a. The 26 sampled loci were then insertedinto diploid heterozygous chromosomes (chr1, 2, 9 and 18) at consecutive SNPpositions of this sample to simulate segmental CNA events. Median genomic sizesof these events, after insertion, were 6.9Kb, 82Kb and 1.2Mb, 12.5Mb (AppendixTable C.1).1494.3. ResultsFigure 4.6: Spike-in simulation experiment sample setup. HMMcopy (a) andAPOLLOH (b) predictions of DG1136a used for the Spike-in simulation exper-iment. The log ratio and allelic ratio data for chromosomes 8 (chr8:97045605-144155272) and 16 (chr16:46464744-90173515) were randomly sampled and in-serted into whole diploid heterozygous chromosomes of 1, 2, 9 and 18 as spike-inevents of length 10, 100, 1000, and 10000 SNPs.Generating subclonal spike-in eventsTo vary cellular prevalences, we generated spike-in events sampled from two simu-lated tumour-normal admixtures at 80% and 60% of the original DG1136a dataset,computationally admixed with its matched normal WGS data (Figure 4.7-4.10).These simulations were generated using the same approach described in Section 3.3.3.Briefly, this was the procedure for the 80% admixture. Let N be the total number ofreads in the sequencing data of the normal. Then, 80% of N reads were randomlysampled from the tumour and 20% of N reads were randomly sampled from thenormal. These sampled reads were merged into a single data file for the admixture.Because the original tumour content of DG1136a was 0.65, the expected tumourcontent are 0.52 and 0.39 for 80% and 60% admixtures, respectively.1504.3. ResultsFigure 4.7: TITAN CNA (top) and cellular prevalence (middle) results for chro-mosome 1 of the Spike-In simulation experiment using DG1136a. Spike-in eventsof length 10, 100, 1000, and 10000 SNPs were inserted. The vertical lines corre-spond to the known inserted (spiked-in) data; the number labels correspond to thelist of events of the same ordering in Appendix TableC.1. The truth and TITAN-predicted cellular prevalence results for the spike-in events at chromosomes 1, 2,9, and 18 are shown. TITAN cellular prevalence parameters were estimated on theentire genome including all original DG1136a events plus the spike-in events atthe designated chromosomes. For log ratio plots, hemizygous deletion (HEMD),copy neutral (NEUT), and copy amplification (AMP) results are shown.The cellu-lar prevalence value indicates the proportion of tumour cells in the whole sample.The plot follows the same colour legend as per the allelic ratio plot. Clonal clustersare shown in horizontal lines labeled with a ‘Z’; tumour content is denoted with theblack horizontal line. Deletion LOH (DLOH), copy neutral LOH (NLOH), diploidheterozygous (HET), and allele-specific amplification (ASCNA) are shown withgreen, blue, dark red, and red, respectively.Spike-in events loci sets of size 10, 100, 1000, and 10,000 SNPs were gener-ated from these admixtures for the same two CNA regions (see above). This data1514.3. ResultsFigure 4.8: TITAN CNA (top) and cellular prevalence (middle) results for chro-mosome 2 of the Spike-In simulation experiment using DG1136a.was inserted into the same diploid chromosomes for the same spike-in simulationsample. This resulted in one simulated sample in which chr1, 2, 9 and 18 containeda total of 26 clonal and 52 subclonal spike-in events.Application of TITAN and performance assessmentTITAN was run, from a range of one to five clonal clusters, on the entire spikeinsimulation sample (containing 78 events), including the chromosomes withoutspike-in events. The run with four clonal clusters was selected as optimal basedon the S_DBw validity index (Halkidi et al., 2002).Performance was computed by comparing the TITAN-predicted copy numberstatus of the SNPs within each spike-in event. The true positive rate (TPR) for copy1524.3. ResultsFigure 4.9: TITAN CNA (top) and cellular prevalence (middle) results for chro-mosome 9 of the Spike-In simulation experiment using DG1136a.number was computed as the proportion of positions with copy number status lessthan 2 for matching deletions or greater than 2 for matching gains. All 54 eventswith size of 100 SNPs or larger were detected (TPR≥ 0.9); however, only 11 of24 events with size of 10 SNPs were recalled. The global false positive rate (FPR)was 0.04, which was computed by considering all SNP positions in chr1, 2, 9 and18 where no spike-in data was inserted.Sample cellular prevalence estimates for two of the TITAN clonal clusters were0.52 and 0.36, which are within range of the expected values of 0.52 and 0.39. TheTPR for cellular prevalence estimates of spike-in events was computed using amatching criteria of ±0.05 of the expected prevalence. For deletions and amplifi-cations respectively, the cellular prevalence for 24 (89%) and 10 (37%) events with1534.3. ResultsFigure 4.10: TITAN CNA (top) and cellular prevalence (middle) results for chro-mosome 18 of the Spike-In simulation experiment using DG1136a.100 SNPs or larger were correctly predicted (TPR≥ 0.9, Figure 4.7-4.10). Despitethe prevalence estimates of many amplifications not matching expected values, theevents were still predicted to be subclonal but with a lower prevalences in mostinstances. Overall, the spike-in experiments demonstrated TITAN is accurate atdetecting (sub)clonal events of varying sizes, but illustrates a potential limitationin detection of very small (10 SNP) events and estimation of the true prevalencefor amplifications.1544.3. Results4.3.2 Evaluation on simulated mixtures of tumour subpopulationsconfers improved sensitivity for low cellular prevalence eventsSubpopulation simulation using intra-tumour samples of DG1136To assess the performance of TITAN using benchmarking datasets that are morerepresentative of clonal mixtures, we designed systematic experiments that sim-ulated genomes with multiple tumour subpopulations at known proportions. Wemixed, in-silico, related intra-tumour∼30X coverage WGS samples obtained fromregional biopsies DG1136a,c,e,g,i of a high-grade serous (HGS) ovarian carcinoma(Figure 4.11a, Table 4.3). Because the samples were from the same patient, theidentical set of heterozygous germline SNPs were used in each mixture. Withineach mixture, we defined clonal (or clonally dominant) events as CNA/LOH eventspresent in all individual samples while subclonal events as present in only a subsetof the samples (Figure 4.4, Table 4.4,4.7,4.8). The proportion of tumour contri-bution from each individual sample (Table 4.3) in the mixture was used to com-pute the expected cellular prevalence. Figures 4.11b,c illustrate a mixture scenario,which identify true (sub)clonal events and their expected cellular prevalence.The ground truth data are CNA and LOH results predicted by HMMcopy andAPOLLOH from the five individual samples, respectively. The truth set consistsof CNA/LOH status at all germline heterozygous (HET) SNP positions includedin the APOLLOH results for each of the five samples. The rationale for using allgermline HET positions in the evaluation is that it represents a genome-wide as-sessment, such that larger events are given more weight because they span moreloci. Furthermore, every evaluation examines the same set of positions, providinga more comparable performance metric across methods and alleviating the com-1554.3. ResultsNormal20%Tumour  80%Normal30%Tumour  70%Normal25%a40%b35%0.50.5 Sample a Sample bMerged a+b 1-10Sample bChromosome Position1-10Sample  aCopy Number(log ratio)Copy Number(log ratio)1-10Merged a+b  Copy Number10Merged a+b  Cellular PrevalenceChromosome PositionCellular Prevalencea+b75% a40%Normal25%Copy Number(log ratio)0.5a ba,c,e,gicFigure 4.11: Illustration of intra-tumour samples in patient DG1136 and an exam-ple of a mixing simulation. (a) Patient DG1136 had biopsies taken from four sitesin the primary tumour of the right ovary and one site from the left pelvic sidewallmetastasis. (b) An illustration demonstrating the expected proportions in a simu-lation of two tumour subpopulations. The tumour content of Sample a (80%) andSample b (70%) inform on the expected contribution to the cellular prevalence inthe merged Sample a+b. Events found in all samples of the mixture representsimulated clonal events. For example, the (green) deletion is present in 75% ofthe merged sample (or 100% of tumour cells) given that the normal proportionis 25%. Events present in a subset of samples in the mixture simulate subclonalevents such as for the (red) gain unique to Sample a which is present in 40% of themerged sample or 53% of the tumour cells.plexity in varying precision of boundaries between approaches.Precision, recall, and F-measure performance was computed for CNA/LOH1564.3. Resultssegments in the simulated mixtures that overlapped ground truth loci. The per-formance was computed for deletions, gains, and LOH status, independently, andaveraged together when evaluating for overall assessment (see Appendix C.4 fordetails). For evaluation of cellular prevalence, the proportion of tumour contribu-tion from each sample was used as the expected cellular prevalence for each simu-lated subpopulation in the mixture (Table 4.3, see Section 4.3.3). For comparison,we applied APOLLOH/HMMcopy (A), Control-FreeC (CF) (Boeva et al., 2012a),and BIC-seq (B) (Xi et al., 2011a) to identify CNA/LOH segments in the simu-lated mixtures. Also, we compared cellular prevalence estimates between TITANand THetA (Oesper et al., 2013). See Appendix C.4 for software usage details.Sample Site Sequencecoverage(×haploid)PathologistestimateAPOLLOHcellularityestimateConsensuscellularityDG1136a Right Ovary Site 1 33.9 60% 67% 64%DG1136c Right Ovary Site 2 34.7 25% 44% 35%DG1136e Right Ovary Site 3 31.7 70% 63% 67%DG1136g Right Ovary Site 4 29.5 65% 46% 56%DG1136i Left Pelvic Sidewall Site 1 35.28 50% 48% 49%Table 4.3: Sequencing coverage and cellularity for spatially related ovarian intra-tumoural and anatomical sites of DG1136. The consensus cellularity is computedas the average between the pathologist and APOLLOH tumour estimates.Serial mixture experimentTwo intra-tumoural genomes, DG1136e and DG1136g, were serially mixed in sil-ico at increments of 10% (0.1e/0.9g, 0.2e/0.8g, ..., 0.9e/0.1g). This was done bysampling reads from the individual tumour sequence files, generating nine (∼30X)mixtures. Because DG1136e and DG1136g contain normal contamination of 67%and 56%, respectively, based on APOLLOH analysis and pathological review,1574.3. Resultsthe known proportions of the simulated tumour subpopulations were adjusted to0.07/0.50, 0.13/0.45, 0.20/0.39, 0.27/0.33, 0.33/0.28, 0.40/0.22, 0.47/0.17, 0.53/0.11,0.60/0.06 (Table 4.4, 4.5). These proportions constitute the relative tumour contri-bution from each of the two samples in the overall mixture and were used as theexpected sample cellular prevalence in the evaluation (Table 4.4).To formalize this, let the mixture proportion be pe for DG1136e and pg forDG1136g, and tumour content be te for DG1136e and tg for DG1136g. Then, theexpected sample cellular prevalence for events contributed uniquely from DG1136ein the simulated mixture is ssamplee = pe ∗ te; the sample cellular prevalence forevents unique to DG1136g is ssampleg = pg ∗ tg. The expected tumour cellular preva-lence are then computed as stumoure = ssamplee /(ssamplee + ssampleg)and stumourg = ssampleg /(ssamplee + ssampleg)for events in the mixture that are contributed uniquely from DG1136e and DG1136g,respectively. These values make up the ground truth expected cellular prevalencein Section 4.3.3.The performance of TITAN for predicting the presence and absence of (sub)clonalevents shows improvement over existing methods. TITAN’s median overall F-measure across the nine mixtures for predicting both clonally dominant and sub-clonal was 0.90; this was similar to APOLLOH (0.91) and Control-FREEC (0.88),but higher than BIC-seq (0.73) (Figure 4.13a, Table 4.6). While the precision forall approaches performed comparably (Figure 4.12a), TITAN had higher sensitiv-ity (median 0.91 compared to 0.85 (A), 0.83 (CF), 0.58 (B)), respectively (Fig.4.12b).TITAN’s sensitivity gains could be primarily attributed to improved sensitivityto subclonal events (Figure 4.12c,d). Accordingly, we observed improved perfor-mance for runs with two or more clusters compared to runs with one cluster (Fig-1584.3. ResultsMixture ProportionCNA/LOH Precision0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqMixture ProportionCNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample e)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample g)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqa bc dFigure 4.12: Performance of TITAN in serial and merging simulations using realintra-tumoural samples from a HGS ovarian tumour. Performance of the serial mix-ture experiment between TITAN (optimal clonal clusters), APOLLOH/HMMcopy(Ha et al., 2012), Control-FreeC (Boeva et al., 2012b), and BIC-seq (Xi et al.,2011a). Precision (a) and recall (b) are shown for subclonal and clonal eventsaveraged across gains, deletions, and LOH. Mixture Proportion refers to the com-binations 10% e and 90% g, 20% e and 80% g, etc. Recall performance for eventsfound uniquely in Sample e (c) and Sample g (d) are shown. The Expected CellularPrevalence shown was computed by adjusting the mixture proportion for tumourcontent of 67% and 56% for Sample e and g, respectively (see “% tumour of e”and “% tumour of g” in Table 4.4).1594.3. ResultsDG1136e DG1136g % tumourof e% tumourof g% normalof e% normalof g0.1 0.9 0.07 0.50 0.03 0.400.2 0.8 0.13 0.45 0.07 0.350.3 0.7 0.20 0.39 0.10 0.310.4 0.6 0.27 0.33 0.13 0.270.5 0.5 0.33 0.28 0.17 0.220.6 0.4 0.40 0.22 0.20 0.180.7 0.3 0.47 0.17 0.23 0.130.8 0.2 0.53 0.11 0.27 0.090.9 0.1 0.60 0.06 0.30 0.04Table 4.4: Simulation experiments using serial spatially related ovarian intra-tumoural samples. Serial mixture experiment showing the tumour and normal cellcontributions from DG1136e and DG1136g to each mixture. Proportion of eachsample in the mixture was pre-defined at 10%-90%, 20%-80%, etc. The contri-bution of each sample to the mixture was computed by adjusting the pre-definedmixture proportion by tumour content of 67% and 56% for Sample e and g, respec-tively. TITAN results for number of clusters and normal and cellular prevalenceestimates are also presented.ure 4.13b,c). Over the range of two to five clusters, recall was similar for subclonalevents, suggesting TITAN is relatively stable in its predictions when accountingfor more than one tumour subpopulation. Despite this stability, we explored theutility of unbiased model selection to choose the optimal number of clusters. Wenote that three clusters fit the scenario where events may be clonally dominant(present in both samples) or subclonal, having one of two possible unique cellularprevalences contributing from the individual samples. Using the S_Dbw validityindex (Halkidi et al., 2002) (see Section 4.2), three clusters were selected as theoptimal number for the majority of the mixtures (Table 4.5). The subclonal predic-tions using the optimal cluster runs consistently out-performed the other methods(Figure 4.12c,d).In addition, we evaluated the performance across ranges of event lengths of1604.3. ResultsDG1136e DG1136g # clonalclustersNormalestimateCluster1Cluster2Cluster3Cluster40.1 0.9 4 0.40 0.95 0.74 0.52 0.190.2 0.8 4 0.38 0.95 0.71 0.50 0.180.3 0.7 3 0.39 0.97 0.57 0.240.4 0.6 3 0.38 0.97 0.60 0.260.5 0.5 3 0.38 0.99 0.62 0.260.6 0.4 3 0.36 0.98 0.63 0.250.7 0.3 2 0.35 0.96 0.360.8 0.2 3 0.34 0.98 0.61 0.210.9 0.1 3 0.32 0.97 0.62 0.20Table 4.5: TITAN results for simulation experiments using serial mixtures of spa-tially related ovarian intra-tumoural samples. Serial mixture experiment showingthe tumour and normal cell contributions from DG1136e and DG1136g to eachmixture. Proportion of each sample in the mixture was pre-defined at 10%-90%,20%-80%, etc. The contribution of each sample to the mixture was computed byadjusting the pre-defined mixture proportion by tumour content of 67% and 56%for Sample e and g, respectively. TITAN results for number of clusters and normaland cellular prevalence estimates are also presented.Method AllEventsF-MeasureAllEventsPrecisionAllEventsRecallDG1136eEventsRecallDG1136gEventsRecallTITAN 0.90 0.90 0.91 0.75 0.68APOLLOH 0.91 0.99 0.85 0.50 0.14Control-FreeC0.88 0.96 0.83 0.57 0.16BIC-seq 0.73 0.98 0.58 0.05 0.02Table 4.6: Summary performance for serial mixture simulation experiment usingrelated ovarian intra-tumoural samples. Performance values shown are averagedbetween CNA loss, CNA gain, and LOH for TITAN, APOLLOH, and Control-FreeC; BIC-seq did not have contain LOH results. F-measure, precision, and recallvalues are shown as the median across the 9 serial mixtures.10Kb-100Kb, 100kb-1Mb, 1Mb-10Mb, and >10Mb and observed similar perfor-mance gains for recall of subclonal events (Figure 4.14).1614.3. ResultsMixture ProportionCNA/LOH F−Measure0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular PrevalenceCNA/LOH Recall0. 0.2 0.4 0.6 0.8 1.0Subclonal Events Sample eTITAN, 0 clusterTITAN, 1 clusterTITAN, 2 clustersTITAN, 3 clustersTITAN, 4 clustersTITAN, 5 clustersAPOLLOHControl−FreeCExpected Cellular PrevalenceCNA/LOH Recall0. 0.2 0.4 0.6 0.8 1.0Subclonal Events Sample gb caFigure 4.13: Performance of TITAN for serial simulation of intra-tumour HGSovarian tumour samples. a) F-measure performance across the mixture proportionscomparing TITAN with Control-FreeC (Boeva et al., 2012b) and APOLLOH (Haet al., 2012) (including HMMcopy). Events for deletions, gains and LOH are av-eraged.s b-c) TITAN runs initialized with number of clusters ranging from 0 to5 are shown. Recall performance for events found unique in Sample e (b) andin Sample g (b) represent events that are subclonal within the simulated mixture.Average recall across deletions, gains, and LOH are shown. Cluster 0 representsthe TITAN run that does not consider multiple tumour subpopulations. Cluster 0and cluster 1 results are nearly identical because 1 cluster converges to the clon-ally dominant cluster and thus only one tumour population and the normal cellsexist. TITAN runs with two or more clusters outperforms the other approaches.Ground truth events were identified in the individual samples of the mixture usingAPOLLOH/HMMcopy and expected prevalence values are shown in Table S4.3.1624.3. ResultsMixture ProportionCNA/LOH F−Measure0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqMixture ProportionCNA/LOH F−Measure0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqMixture ProportionCNA/LOH F−Measure0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqMixture ProportionCNA/LOH F−Measure0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seq10kb-100kb 100kb-1Mb 1Mb-10Mb > 10MbExpected Cellular Prevalence(Sample e)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample g)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample e)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample g)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample e)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample g)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample e)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seqExpected Cellular Prevalence(Sample g)Subclonal CNA/LOH Recall0. 0.4 0.80.2 0.6 1.0TITANAPOLLOHControl−FreeCBIC−seq10kb-100kb 100kb-1Mb1Mb-10Mb > 10MbabFigure 4.14: Performance of TITAN for serial simulation of intratumour samplesfrom an ovarian tumour evaluated at different event size ranges. Sample DG1136eand DG1136g were mixed at known proportions (Table 4.3). Events were groupedinto ranges of lengths 10kb-100kb, 100kb-1Mb, 1Mb-10Mb, and greater than10Mb as predicted in the ground truth on the samples, individually. a) F-measureperformance across the mixture proportions comparing TITAN with Control-FreeC(Boeva et al., 2012b), APOLLOH (Ha et al., 2012) (including HMMcopy), andBIC-seq (Xi et al., 2011a). Events for deletions, gains and LOH are averaged. b)Recall performance for TITAN subclonal prediction results shown for the expectedcellular prevalence computed from the original tumour contribution of each samplein the mixture (Table 4.3). For each size range, performance is shown for subclonalevents found only contributing from DG1136e and events only contributing fromDG1136g. Cellular prevalence is defined as the proportion of tumour cells har-bouring the events.Merging mixture experimentNext, we used the five DG1136 samples from the same patient to generate 10 pair-wise (61-69X coverage) and 10 triplet (95-103X coverage) merged combinations1634.3. Resultsof the five intra-tumour samples, each mixed at approximately equal proportions(Table 4.7,4.8). This was done using SAMtools (Li et al., 2009) merge commandwhich simply combines all the reads from each individual sample into a singlesequence file. The expected cellular prevalence for each mixture was computedbased on tumour contributions from individual samples making up the mixture,while also adjusting for sequencing coverage.The mixture proportion for sample a with coverage ca and tumour content taand sample b with coverage cb and tumour content tb is pa = ca/(ca + cb) and pb =cb/(ca + cb). The expected sample cellular prevalence for the merged mixture isssamplea = pa∗ta and ssampleb = pb∗tb for events observed uniquely in a and b, respec-tively. The expected tumour cellular prevalence is stumoura = ssamplea /(ssamplea + ssampleb)and stumourb = ssampleb /(ssamplea + ssampleb)for events in the mixture that are con-tributed uniquely from a and b, respectively. This can be extended for the casewith three samples (triplet merge). These values make up the ground truth ex-pected cellular prevalence in Section 4.3.3.In the triplet merged samples (Figure 4.15), TITAN performed comparablyfor all (clonal and subclonal) amplifications (0.85 median F-measure) relative toAPOLLOH (0.85), Control-FREEC (0.85) and BIC-seq (0.60). For deletions andLOH events, TITAN showed statistically significant improvement over the other al-gorithms (0.91 compared to 0.87 (A), 0.83 (CF), and 0.60 (B); two-sample Wilcoxonrank-sum test p < 0.001) and LOH events (0.96 compared to 0.94 (A) and 0.85(CF), p < 0.001). Similar performance was observed in the pairwise merged sam-ples with comparable F-measure for all amplifications (median 0.87 compared to0.87 (A), 0.85 (CF), and 0.70 (B)) and statistically significant improvement inF-measure for deletions (0.95 compared to 0.92 (A), 0.87 (CF), and 0.69 (B);1644.3. Resultsp < 0.005) and LOH events (0.98 compared to 0.96 (A) and 0.89 (CF)) (Fig-ure 4.16).1654.3.ResultsMerging of paired samples TITAN estimated parametersSample 1 Sample 2 % tumour of1% tumour of2MergedcoverageNormalestimateCluster1Cluster2Cluster3Cluster4a c 33.62% 23.77% 68.62 0.44 0.97 0.68 0.47a e 35.06% 34.87% 65.79 0.29 0.95 0.63 0.46a g 36.35% 27.46% 63.45 0.37 0.98 0.72 0.47 0.20a i 33.33% 26.00% 69.20 0.47 0.97 0.57c e 24.50% 34.47% 66.56 0.43 0.96 0.70 0.43c g 25.39% 27.13% 64.23 0.50 0.97 0.41c i 23.31% 25.71% 69.98 0.59 0.97 0.54e g 37.37% 28.38% 61.39 0.36 0.98 0.44e i 34.17% 26.80% 67.14 0.45 0.96 0.70 0.47g i 26.88% 27.76% 64.81 0.53 0.98 0.45Table 4.7: Simulation experiments using pairwise merging mixtures of spatially related ovarian intra-tumoural samples ofDG1136a,c,e,g,i. Two samples were mixed at approximately equal proportions with differences attributed to difference inindividual sample read coverage and normal contamination. TITAN results are also shown.1664.3.ResultsMixture of individual samples TITAN estimated parametersSample1Sample2Sample3%tumourof 1%tumourof 2%tumourof 3MergedCoverageNormalesti-mateCluster1Cluster2Cluster3Cluster4Cluster5a c e 22.96% 16.23% 22.83% 100.49 0.38 0.95 0.66 0.47a c g 23.50% 16.62% 17.75% 98.15 0.42 0.95 0.73 0.51 0.19a c i 22.20% 15.70% 17.32% 103.90 0.47 0.92 0.79 0.62 0.47 0.29a e g 24.20% 24.07% 18.28% 95.32 0.33 0.96 0.77 0.51 0.39 0.15a e i 22.82% 22.70% 17.80% 101.07 0.41 0.96 0.50a g i 23.36% 17.65% 18.22% 98.73 0.45 0.97 0.78 0.45 0.22c e g 16.97% 23.87% 18.13% 96.09 0.41 0.95 0.71 0.23c e i 16.01% 22.53% 17.67% 101.8 0.48 0.96 0.72 0.29c g i 16.39% 17.51% 18.08% 99.51 0.54 0.97 0.35e g i 23.73% 18.02% 18.61% 96.67 0.42 0.95 0.76 0.50 0.23Table 4.8: Simulation experiments using triplet merging mixtures of spatially related ovarian intra-tumoural samples ofDG1136a,c,e,g,i. Three samples were mixed at approximately equal proportions with differences attributed to difference inindividual sample read coverage and normal contamination. TITAN results are also shown.1674.3. Resultsll0. CNA LOSSF−MeasureB CF A T CNA GAINF−MeasureB CF A Tll0. LOHF−MeasureB CF A Tlllllll l0. CNA LOSSRecallB CF A T B CF A T B CF A TSubclonal 1 Subclonal 2 Clonalllllllll ll l l0. CNA GAINRecallB CF A T B CF A T B CF A TSubclonal 1 Subclonal 2 Clonallllllllll l0. LOHRecallB CF A T B CF A T B CF A TSubclonal 1 Subclonal 2 ClonalFigure 4.15: Performance of TITAN (T), APOLLOH (A, including HMMcopy),Control-FreeC (CF), and BIC-seq (B) for triplet merging simulation of intra-tumour samples from a HGS ovarian tumour at approximately equal proportions.Events are divided into CNA loss, gains, and LOH. ‘Subclonal 1’ denotes eventsthat are present uniquely in only one sample in the mixture and therefore consid-ered subclonal in the simulation. Similarly, ‘Subclonal 2’ denotes events that arepresent in two out of three samples in a triplet merge simulation. Ground truthevents were identified in the individual samples of the mixture using APOLLOHand expected prevalence values are shown in Table 4.3.For subclonal events, TITAN was more sensitive than the other methods forboth the pairwise and triplet merging simulations. Events unique to only one sam-ple (subclonal in the mixture) in both simulations were predicted with significantlyhigher sensitivity (two-sample Wilcoxon rank-sum tests, p < 0.001) by TITAN (see“Subclonal 1” in Figure 4.15 and Figure 4.16). TITAN was also significantly moresensitive to deletion and LOH events that were present in two individual samplesin the triplet merged mixtures (p < 0.001), and had comparable results for am-plifications (see “Subclonal 2” in Figure 4.15). All methods accurately predictedclonally dominant events for each merged simulation. Therefore, while maintain-1684.3. Resultslll0. CNA LOSSF−MeasureB CF A Tllll0. CNA GAINF−MeasureB CF A T LOHF−MeasureB CF A Tll ll l0. CNA LOSSRecallB CF A T B CF A TSubclonal 1 Clonallll0. CNA GAINRecallB CF A T B CF A TSubclonal 1 Clonal l0. LOHRecallB CF A T B CF A TSubclonal 1 ClonalFigure 4.16: Performance of TITAN (T), APOLLOH (A, including HMMcopy),Control-FreeC (CF), and BIC-seq (B) for pairwise merging simulation of intra-tumour samples from a HGS ovarian tumour at approximately equal proportions.Events are divided into CNA loss, gains, and LOH. ‘Subclonal 1’ denotes eventsthat are present uniquely in only one sample in the mixture and therefore consid-ered subclonal in the simulation. Ground truth events were identified in the indi-vidual samples of the mixture using APOLLOH and expected prevalence valuesare shown in Table 4.3.ing accuracy of clonal events, TITAN showed clear advantages in detection of theengineered subclonal events.4.3.3 Accurate estimation of cellular prevalenceNext, we assessed the accuracy of predicted cellular prevalence estimates by com-paring to the expected ground truth values for each simulated mixture. For eachpairwise mixture, three clonal clusters were expected (ancestral and sample spe-cific) while for triplet mixtures, seven clonal clusters were expected (all combina-tion of samples). The expected cellular prevalence was computed from the tumourcontribution from each individual sample making up the simulated mixture, as de-1694.3. Resultsscribed earlier (Table 4.4, 4.7, 4.8). Cellular prevalence estimates predicted byTITAN were significantly correlated (Pearson’s r ≥ 0.9, p < 0.001) with the ex-pected tumour cellular prevalence across all samples in the serial (Figure 4.17a,Table 4.5) and merging simulations (Figure 4.17b,c, Table 4.7, 4.8). We alsocalculated the root mean squared error (RMSE) between the predicted and ex-pected cellular prevalences for the serial mixtures (RMSE=0.11), pairwise mix-tures (RMSE=0.07) and triplet mixtures (RMSE=0.11) (Fig. 4.17a-c).Expected Cellular PrevalencePredicted Cellular Prevalencey=xBest Fit95% CI0. 0.4 0.80.2 0.6 1.0r=0.95, p=4.5e−14RMSE=0.11Expected Cellular PrevalencePredicted Cellular Prevalencey=xBest Fit95% CI0. 0.4 0.80.2 0.6 1.0r=0.86, p=1.1e−08RMSE=0.180.0 0.4 0.8Expected Cellular PrevalencePredicted Cellular Prevalence0. 0.6 1.0r=0.9, p=8.7e−12y=xBest Fit95% CIRMSE=0.120.0 0.4 0.8Expected Cellular PrevalencePredicted Cellular Prevalence0. 0.6 1.0r=0.96, p < 2.2e−16RMSE=0.068y=xBest Fit95% CI0.0 0.4 0.8Expected Cellular PrevalencePredicted Cellular Prevalence0. 0.6 1.0r=0.9, p < 2.2e−16RMSE=0.11y=xBest Fit95% CIa b cd eSerial Mixture Pairwise Merge Mixture Triple Merge MixtureTITANTHetAFigure 4.17: Performance of TITAN cellular prevalence estimates for serial (30X)and pairwise (60X)/triplet (90X) merging simulations of intra-tumour samplesfrom an ovarian tumour. Pearson correlation coefficients are shown for TITAN(a-c) and THetA (Oesper et al., 2013) (d-e) estimates where each data point rep-resents an expected clonal with a unique cellular prevalence. Ground truth eventswere identified in the individual samples of the mixture using APOLLOH and ex-pected prevalence values are shown in Table 4.4-4.8.For comparison, we ran THetA (Oesper et al., 2013), which is not a strictmethodological comparison to TITAN as its inputs are substantially different, but1704.3. Resultsit is the only currently published method that produces a comparable output to TI-TAN’s cellular prevalence quantity on segmental alterations. THetA’s estimatesalso showed statistically significant correlation with expected values (Pearson’sr > 0.86, p < 0.001, Fig. 4.17d,e), however, the RMSE was relatively lower forTITAN (0.11) compared to THetA (0.18) for the serial mixtures, and similarly forthe pairwise mixtures (0.07 compared to 0.12). For more than one tumour pop-ulation, THetA cannot perform the analysis in polynomial time and memory inthe number of input segments. Therefore, we were only able to run THetA forup to only two tumour populations under reasonable runtimes, and were unable toevaluate on the triplet mixtures.To ensure expected cellular prevalences were not skewed by erroneous tumourcontent predictions, we also used an orthogonal approach (Control-FreeC) to es-timate tumour content used to compute the expected cellular prevalence and ob-served similar correlations and RMSE results for TITAN and THetA (Figure 4.18,Table 4.9).Sample Site Sequencecoverage(×haploid)PathologistestimateControl-FreeCcellularityestimateConsensuscellularityDG1136a Right Ovary Site 1 33.9 60% 65% 62%DG1136c Right Ovary Site 2 34.7 25% 63% 44%DG1136e Right Ovary Site 3 31.7 70% 70% 70%DG1136g Right Ovary Site 4 29.5 65% 75% 70%DG1136i Left Pelvic Sidewall Site 1 35.28 50% 60% 55%Table 4.9: Sequencing coverage and cellularity for spatially related ovarian intra-tumoural and anatomical sites of DG1136. The consensus cellularity is computedas the average between the pathologist and Control-FreeC tumour estimates.1714.3. ResultsExpected Cellular PrevalencePredicted Cellular Prevalencey=xBest Fit95% CI0. 0.4 0.80.2 0.6 1.0r=0.96, p=3.2e−14RMSE=0.1Expected Cellular PrevalencePredicted Cellular Prevalencey=xBest Fit95% CI0. 0.4 0.80.2 0.6 1.0r=0.85, p=1.7e−08RMSE=0.180.0 0.4 0.8Expected Cellular PrevalencePredicted Cellular Prevalence0. 0.6 1.0r=0.88, p=9.1e−11y=xBest Fit95% CIRMSE=0.130.0 0.4 0.8Expected Cellular PrevalencePredicted Cellular Prevalence0. 0.6 1.0r=0.97, p < 2.2e−16RMSE=0.059y=xBest Fit95% CI0.0 0.4 0.8Expected Cellular PrevalencePredicted Cellular Prevalence0. 0.6 1.0r=0.9, p < 2.2e−16RMSE=0.11y=xBest Fit95% CIa b cd eSerial Mixture Pairwise Merge Mixture Triple Merge MixtureTITANTHetAFigure 4.18: Performance of TITAN cellular prevalence estimates for serial (30X)and pairwise (60X)/triplet (90X) merging simulations of intra-tumour samplesfrom an ovarian tumour. Pearson correlation coefficients are shown for TITAN(a-c) and THetA (Oesper et al., 2013) (d-e) estimates where each data point rep-resents an expected clonal with a unique cellular prevalence. Ground truth eventswere identified in the individual samples of the mixture using Control-FreeC.Global normal contamination impacts the ability of the model to reconcile thepresence of subclonality. We assessed the ability of the model to correctly es-timate the global normal contamination in the model. TITAN estimates showedsignificantly positive correlation for the serial mixtures (Pearson’s r = 0.96, p <0.0001, RMSE=0.023), pairwise mixtures (r = 0.86, p = 0.0014, RMSE=0.047)and the triplet mixtures (r = 0.74, p = 0.014, RMSE=0.048) relative to the ex-pected normal proportion (Figure 4.19a-c, Table 4.4, 4.7, 4.8). TITAN’s esti-mates were considerably more accurate than THetA for the serial (r = 0.93, p <0.0002, RMSE=0.23) and pairwise mixtures (r = 0.51, p= 0.14, RMSE=0.3) (Fig-1724.3. Resultsure 4.19d,e). Therefore, in addition to increased sensitivity for detecting subclonalevents, TITAN showed accurate inference of cellular prevalence and normal pro-portion, which adds an interpretive layer to estimating the composition of bothtumour and normal cells.Expected Normal ProportionPredicted Normal Proportion0. 0.4 0.80.2 0.6 1.0r=0.96, p=5.1e−05RMSE=0.023Expected Normal ProportionPredicted Normal Proportion0. 0.4 0.80.2 0.6 1.0r=0.93, p=0.00022RMSE=0.230.0 0.4 0.8Expected Normal ProportionPredicted Normal Proportion0. 0.6 1.0r=0.51, p=0.14RMSE=0.30.0 0.4 0.8Expected Normal ProportionPredicted Normal Proportion0. 0.6 1.0r=0.86, p=0.0014RMSE=0.0470.0 0.4 0.8Expected Normal ProportionPredicted Normal Proportion0. 0.6 1.0r=0.74, p=0.014RMSE=0.048a b cd eSerial Mixture Pairwise Merge Mixture Triple Merge MixtureTITANTHetAFigure 4.19: Performance of TITAN normal proportion estimates for serial (30X)and pairwise (60X)/triplet (90X) merging simulations of intra-tumour samplesfrom an ovarian tumour. Pearson correlation coefficients are shown for TITAN (a-c) and THetA (Oesper et al., 2013) (d-e) estimates where each data point representsa sample in the mixture. Ground truth events were identified in the individual sam-ples of the mixture using APOLLOH and expected prevalence values are shownin Table 4.3. Expected normal proportion was determined as the consensus of thepathologist and APOLLOH (Ha et al., 2012) estimates as shown in Table 4.4.1734.3. ResultsSample # of clonal clusters NormalestimateCluster 1 Cluster 2DG1136a 2 0.32 0.97 0.58DG1136c 2 0.54 0.93 0.61DG1136e 1 0.31 0.97 NADG1136g 2 0.41 0.94 0.51DG1136i 1 0.59 0.96 NATable 4.10: TITAN parameter estimates (normal proportion, ploidy, and cellularprevalence for one and two clonal clusters) for individual intra-tumour DG1136samples.4.3.4 FISH assays validate the presence of subclonal copy numberchangesWe performed fluorescence in-situ hybridization (FISH) assays to experimentallyvalidate the predicted CNAs in the individual intra-tumoural samples (Table 4.10)of the HGS ovarian cancer at single cell resolution. Seven regions were assayedand all confirmed our CNA predictions, which included two clonally dominant andfive subclonal copy number events (Table 4.11).For the five events (one gain and four deletions) predicted as subclonal, FISHalso confirmed the presence of subclonal heterogeneity. The three deletions wereundetected by other segmentation approaches. In DG1136g, a 47Mbp region deletedin chr1:70539053-117275764, which was estimated at cellular prevalence of 0.51,was observed to be present with cells not harbouring the event (Figure 4.20a). Inanother example in DG1136g, a 12Mbp region of copy gain in chr2:18272917-30386174 was confirmed to be present but observed to be heterogeneous (Fig-ure 4.20b). Two clonally dominate events were confirmed in chr3 of DG1136e(Figure 4.20c)A limitation of using FISH is that the clones observed may not necessarily be1744.3. ResultsFigure 4.20: Fluorescence in situ hybridization (FISH) images for CNA eventsin samples of patient DG1136. a) 3-colour FISH probes for a subclonal deletionin chromosome 1 of DG1136g using fresh-frozen tissue. b) A subclonal copygain and a subclonal deletion in chromosome 2 is shown in fresh-frozen tissueof DG1136g. Cells with the deletion (green arrow), copy gain (red arrow), anddiploid (white arrow) are shown. Cells with both events co-occurring are shownwith overlapping green and red arrows. c) Clonal deletion and copy gain event inchromosome 3 of fresh-frozen paraffin-embedded tissue of DG1136e.1754.3. ResultsLocus FISH probe Type Clonality TITANcellularpreva-lenceControl-FreeCchr1:69,851,036-70,025,173 RP11-795A13 DEL S 0.51 Nchr2:28154550-28364468 RP11-829F10 GAIN S 0.51 Ychr2:59904520-60114863 RP11-462D13 DEL S 0.51 Nchr3:21501606-21662405 RP11-17C24 DEL C 0.94 Nchr3:168704881-168869852 RP11- 33A1 GAIN C 0.94 Ychr7:145,530,552-145,724,648 RP11-1005P9 DEL S 0.51 Nchr21:22060445-22231762 RP11-49J9 DEL S 0.51 NTable 4.11: Validation of TITAN predictions using fluorescence in-situ hybridiza-tion (FISH). ‘Locus’ is the region that the FISH BAC probe was used to validate theTITAN copy number prediction of ‘Type’ deletion (DEL) or gain (GAIN). For eachFISH assay, counts of the event observed in approximately 40 cells was performed.Comparison Control-FreeC (Boeva et al., 2012b) is shown as present ‘Y’ or absent‘N’ for detecting the event. ‘Clonality’ denotes if the FISH result was determinedas clonal (C) subclonal (S). Coordinates are from build GRCh37 (hg19).the same ones from the predicted from WGS. Nevertheless, FISH assays are stillinformative for confirming the presence of predicted events thereby addressing thesensitivity of TITAN.4.3.5 Validation of TITAN predictions using single-cell sequencingWe further validated TITAN predictions from DG1136g using single-cell sequenc-ing. Targeted positions were sequenced, using multiplex PCR reactions and Flu-idigm access array technology, in isolated nuclei that were sorted from disaggre-gated frozen tissue blocks (Appendix C.5.2). Two sets of events, Set1 and Set2,each included one clonal LOH event, two subclonal deletions, two heterozygousdiploid regions (Figure 4.21, Appendix C.5.1). For each set, 42 single cells weresorted, followed by library construction and sequencing; statistical analysis was1764.3. Resultsthen carried out independently for the two sets (Appendix C.5.3).Selection of loci and distinguishing tumour and normal nucleiThis experiment focused on LOH events because confirmation of homozygosity, orthe absence of one allele, from single-cell sequencing is generally unambiguous.For statistical robustness we interrogated multiple SNPs within each prediction ofLOH (10-11 SNPs) and heterozygous (2-3 SNPs) negative control regions. Wealso selected previously validated somatic point mutations (SNVs), including a ho-mozygous SNV in TP53, from DG1136g (Appendix C.5.1). Because it is widelyaccepted that TP53 mutation is a tumour initiating event in HGS ovarian cancer(Ahmed et al., 2010; Cancer Genome Atlas Research Network, 2011b; Bashashatiet al., 2013), this mutation was expected to be ancestral, and thus present in alltumour cells. TP53, along with the other SNVs, were used as markers to distin-guish tumour and contaminating normal nuclei in this experiment. This resultedin 14 tumour and 14 normal nuclei for Set1 (Figure 4.22a), and 9 tumour and 9normal nuclei for Set2 (Figure 4.23a). The remaining nuclei contained insufficientread coverage (< 10 positions with ≥ 50 reads). The details for the selection ofpositions for validation are described in Appendix C.5.1.Determining the expected allelic drop-out rate and heterozygous allelic ratioAllelic drop-out refers to the unequal amplification of one allele for a heterozy-gous position, and this can be mistaken for the homozygous signal arising fromloss of heterozygosity. Using all positions in every normal nuclei, we computedthe expected drop-out rate as the proportion of (sufficient coverage) positions withpresent status for one of reference or variant but not both (XOR). Drop-out rates1774.3. ResultsSC-DLOH-1 SC-DLOH-3SC-DLOH-4 SC-DLOH-5C-DLOH-1 C-NLOH-1XRCC2 TP53HET-3HET-1 HET-4 HET-5NEUT GAINHEMDHET ASCNALOH NLOHNLOH GAINLOHFigure 4.21: TITAN predictions selected for validation by single-cell sequencingof DNA from individual nuclei. Two clonally dominant LOH regions (C-DLOH-1 and C-NLOH-1) were selected from chr17. Four subclonal regions were se-lected from chr1 (SC-DLOH-1), chr2 (SC-DLOH-3), chr7 (SC-DLOH-4), andchr21 (SC-DLOH-5). A set of somatic mutations (SNVs) were also selected ascontrols to distinguish cell types of normal and tumour nuclei.1784.3. Results(DOR) for Set1 and Set2 were 0.28 and 0.48, respectively. Furthermore, the ex-pected allelic ratio for a heterozygous event was computed as the median across allheterozygous positions in every normal nuclei. The expected heterozygous allelic(HAR) ratio for Set1 and Set2 were 0.57 and 0.68, respectively. Next, we describethe use of DOR and HAR to determine LOH status for each event.Statistical analysis of LOH eventsWe expected to observe all tumour nuclei to be homozygous for the SNPs in thepredicted clonal LOH events, while only a portion of the tumour nuclei would behomozygous for predicted subclonal events. We used two statistical tests to deter-mine whether an event in a tumour nucleus is LOH by analyzing the set of positionswithin the event (Appendix C.5.3). The first test involves controlling for alleledrop-out by using a one-tailed binomial test with parameter DOR; this determinesif the ratio between number of homozygous and heterozygous positions is statis-tically significantly greater than DOR (p < 0.05). The second test determines ifthe allelic ratio is statistically significant for LOH. A one-sample Wilcoxon signedrank test was used to examine if the allelic ratio distribution across the positionswithin an event was significantly different than HAR (p < 0.05). The minimumBenjamini & Hochberg adjusted p-values between the two tests were used.We applied the two statistical tests to all events for both normal and tumournuclei. We classified each nucleus as heterozygous or homozygous (or unknown ifstatistically inconclusive; see Appendix C.5.3). As expected, for each of the nor-mal nuclei in both Set1 and Set2, all LOH events were classified as heterozygous,independently confirming the initial grouping of cell types using mutations. In ad-dition, the four negative control heterozygous events HET1, HET3 (Figure 4.22),1794.3. ResultsNormalTumourHomozygousHeterozygousLow CoverageAbsent0.000.501.00MutationVariant RatioLow Coverage0.500.751.00SNPSymmetricAllelic RatioLOH (Tumour)HET (Tumour)HET (Normal)UNKLow CoverageCell TypeMUTSC3−SULT6B1_2:37406686MUTSC1−DENND2C_1:115168405MUTSC1−ABCA4_1:94522334MUTC−FGD5_3:14861524MUTCTRL−CSMD1_8:2949102MUTCTRL−TP53_17:7577121HET−3_2:86078478HET−3_2:84226730HET−3_2:82870237HET−1_1:68910999HET−1_1:65307409HET−1_1:62615684HET−1_1:56977819C−DLOH−1_17:21074153C−DLOH−1_17:20733390C−DLOH−1_17:19550270C−DLOH−1_17:19361211C−DLOH−1_17:18890762C−DLOH−1_17:18219835C−DLOH−1_17:18070062C−DLOH−1_17:17507308C−DLOH−1_17:17448691C−DLOH−1_17:17415217SC−DLOH−1_1:117275764SC−DLOH−1_1:112027939SC−DLOH−1_1:102312848SC−DLOH−1_1:90953104SC−DLOH−1_1:89475230SC−DLOH−1_1:83019476SC−DLOH−1_1:76077534SC−DLOH−1_1:73311542SC−DLOH−1_1:70539053SC−DLOH−3_2:80861750SC−DLOH−3_2:79875302SC−DLOH−3_2:72263628SC−DLOH−3_2:66588804SC−DLOH−3_2:55098077SC−DLOH−3_2:52022010SC−DLOH−3_2:50464483SC−DLOH−3_2:46018223SC−DLOH−3_2:38390311SC−DLOH−3_2:31374733TP53SC−DLOH−3SC−DLOH−1C−DLOH−1HET−3HET−101 02 04 05 07 10 18 31 34 38 39 40 41 42 08 03 16 20 25 29 32 26 27 19 37 17 15 28NucleusabFigure 4.22: Single-cell validation of Set1 (sub)clonal deletions in DG1136gusing deep DNA sequencing of individual nuclei. (a) The 28 nuclei for Set1were designated as tumour and normal cell type using the status of mutations.The mutant allele ratio ( variant readsdepth ) for mutations and symmetric allele ratio( max(re f erence reads,variant reads)depth ) for SNP positions are shown. (b) The LOH statusfor each heterozygous (HET) and LOH (C-DLOH, SC-DLOH) event is shown.Tumour and normal nuclei denoted by parentheses.1804.3. Results0.000.501.00MutationVariant RatioNormalTumourLow Coverage0.500.751.00SNPSymmetricAllelic RatioHomozygousHeterozygousLow CoverageAbsentLOH (Tumour)HET (Tumour)HET (Normal)UNKLow CoverageCell TypeMUTSC4−XRCC2_7:152346029MUTSC4−MUC3A_7:100550031MUTC−SPTB_14:65258541MUTC−LRRC36_16:67397564MUTC−GALNT16_14:69813784MUTC−GABRA5_15:27193297MUTCTRL−RFC3_13:34404895MUTCTRL−CSMD1_8:2949102MUTCTRL−TP53_17:7577121HET−5_21:19674681HET−5_21:19359230HET−4_7:141135114HET−4_7:139447377HET−4_7:138768839C−NLOH−1_17:62185764C−NLOH−1_17:60386211C−NLOH−1_17:59762452C−NLOH−1_17:59526121C−NLOH−1_17:57521208C−NLOH−1_17:57013803C−NLOH−1_17:56698827C−NLOH−1_17:56015673C−NLOH−1_17:55557779C−NLOH−1_17:55290843SC−DLOH−4_7:153688808SC−DLOH−4_7:153346402SC−DLOH−4_7:152159506SC−DLOH−4_7:149072406SC−DLOH−4_7:148282238SC−DLOH−4_7:147471122SC−DLOH−4_7:146544492SC−DLOH−4_7:145783861SC−DLOH−4_7:144281505SC−DLOH−4_7:143777995SC−DLOH−5_21:25770230SC−DLOH−5_21:24752515SC−DLOH−5_21:24729963SC−DLOH−5_21:24339568SC−DLOH−5_21:24297622SC−DLOH−5_21:23720528SC−DLOH−5_21:23315891SC−DLOH−5_21:22477273SC−DLOH−5_21:22129099SC−DLOH−5_21:22084693TP53SC−DLOH−5SC−DLOH−4C−NLOH−1HET−5HET−402 15 21 26 27 30 31 39 40 05 07 10 42 41 01 06 04 16NucleusabFigure 4.23: Single-cell validation of Set2 (sub)clonal deletions in DG1136gusing deep DNA sequencing of individual nuclei. (a) The 18 nuclei for Set2were designated as tumour and normal cell type using the status of mutations.The mutant allele ratio ( variant readsdepth ) for mutations and symmetric allele ratio( max(re f erence reads,variant reads)depth ) for SNP positions are shown. (b) The LOH statusfor each heterozygous (HET) and LOH (C-DLOH, SC-DLOH) event is shown.Tumour and normal nuclei denoted by parentheses. 1814.3. ResultsHET4, HET5 (Figure 4.23) were each classified as heterozygous in all tumour nu-clei for which sufficient coverage was obtained. By contrast, for the predictedclonal LOH events C-DLOH-1 (Figure 4.22) and C-NLOH-1 (Figure 4.23), alltumour nuclei were classified as homozygous, confirming the LOH predictionswere clonally dominant. For each of the predicted subclonal deletion events (SC-DLOH-1, 3, 4 and 5), the tumour nuclei were divided into two groups: LOH andHET status.The proportions of tumour nuclei with LOH status in these events were 0.54(7/13 for SC-DLOH-1), 0.71 (10/14 for SC-DLOH-3), 0.50 (4/8 for SC-DLOH-4),and 0.50 (4/8 for SC-DLOH-5), which were generally consistent with the TITANcellular prevalence estimate of 0.51 (Figure 4.21, Table 4.10). Therefore, in twoindependently executed single nucleus sequencing experiments, the results allowedus to relate our predictions back to the key modelling assumptions of TITAN, con-firming the presence of the three cell types (Figure 4.2): 1) a population of normalcells, 2) a population of tumour cells harbouring the CNA/LOH event, and 3) apopulation of cells without the CNA/LOH event.Interestingly, the presence of a mutation in XRCC2, which encodes a RAD51-related protein and is involved in homologous DNA repair of double-strandedbreaks, in Set2 (Figure 4.22b) were found in only in Nuclei 42, 10, 05 and 07that contained HET status for event SC-DLOH-4. Because XRCC2 was a het-erozygous mutation in three of these nuclei were, with the possibility of the fourthbeing homozygous due to drop-out, we speculate that mono-allelic disruption ofthe gene is advantageous whereas bi-allelic inactivation has a negative impact forthe tumour cell. A scenario, assuming the heterozygous mutation was acquiredearlier, is that the mutated allele was preferentially selected to be deleted (LOH1824.4. Applications of TITANstatus). Further experimental data involving more single-cell tumour nuclei andfunctional analyses will be required to investigate this finding.4.4 Applications of TITAN4.4.1 Characterization of the subclonal CNA in triple negative breastcancersWe analyzed a set of 23 triple-negative breast cancers (TNBC) with paired tumour-normal WGS data (Shah et al., 2012; Ha et al., 2012). We applied TITAN to predictregions of (sub)clonal CNA and LOH in the TNBC genomes, and profile the pat-terns of evolution in the genome architecture. Five cases were clonally homoge-nous (i.e. one clonal cluster) with cellular prevalence between 0.92-0.97, whilethe remaining 18 cases were more heterogeneous (between two and six number ofclonal clusters and cellular prevalence estimates ranging 0.17-0.98 (Table 4.12).In the 18 heterogeneous cases, the proportion of the genome altered by subclonalevents ranged between 0.18 to 0.96, with 10 cases having a higher proportion ofsubclonal alterations than clonal events (Table 4.13). This emphasizes the impor-tance of considering mixed populations in the analysis and suggests failure to doso will lead to vastly different interpretations of the data and preclude inference ofevolutionary patterns present in the data.Application of TITAN to TNBC whole exome-capture sequencing dataFour TNBC samples (SA030, SA052, SA065, SA073) that we had previously ana-lyzed (Shah et al., 2012; Ha et al., 2012) contained both whole genome (WGS) andexome sequencing (WES) data. We applied TITAN to WES data to demonstrate1834.4. Applications of TITANSample # ofclonalclustersNormalestimateCluster1Cluster2Cluster3Cluster4Cluster5Cluster6SA028 3 0.14 0.95 0.74 0.49SA029 1 0.59 0.96SA030 6 0.41 0.98 0.69 0.53 0.47 0.35 0.23SA052 4 0.17 0.96 0.53 0.39 0.17SA065 2 0.20 0.96 0.65SA073 2 0.51 0.96 0.53SA219 3 0.49 0.99 0.68 0.43SA220 1 0.41 0.97SA221 5 0.11 0.98 0.56 0.45 0.37 0.29SA223 4 0.01 0.93 0.71 0.42 0.27SA224 3 0.28 0.93 0.67 0.45SA225 1 0.20 0.97SA227 3 0.33 0.96 0.72 0.44SA231 2 0.16 0.97 0.51SA232 1 0.63 0.94SA233 1 0.70 0.92SA235 3 0.47 0.98 0.70 0.34SA236 3 0.27 0.94 0.66 0.40SA237 3 0.48 0.96 0.74 0.27SA238 5 0.48 0.98 0.86 0.71 0.53 0.33SA239 2 0.52 0.96 0.58SA299 5 0.00 1.00 0.48 0.37 0.30 0.21SA300 5 0.38 0.93 0.73 0.55 0.40 0.26Table 4.12: TITAN parameter estimates (normal proportion, ploidy, and cellularprevalence) for 23 TNBC samples.that the method can also be used for targeted genomic regions. We applied a mod-ification in the normalization procedure for GC-content bias correction. The loesscurve-fitting correction method was applied to only positions overlapping exonsin the hg18 genome reference build, which was downloaded in BED format fromUCSC. All other TITAN parameter settings were initialized to default values as perusage for other WGS samples in Section 4.3.1844.4. Applications of TITANSample Proportion ofgenome altered byclonal eventsProportiongenome altered bysubclonal eventsSA028 0.58 0.36SA029 0.47 0.00SA030 0.25 0.50SA052 0.10 0.56SA065 0.72 0.24SA073 0.31 0.59SA219 0.26 0.62SA220 0.67 0.00SA221 0.03 0.91SA223 0.14 0.78SA224 0.42 0.49SA225 0.62 0.00SA227 0.42 0.51SA231 0.55 0.18SA232 0.77 0.00SA233 0.60 0.00SA235 0.06 0.77SA236 0.49 0.35SA237 0.43 0.22SA238 0.06 0.46SA239 0.55 0.06SA299 0.01 0.96SA300 0.07 0.72Table 4.13: Proportion of genome altered in TNBC cohort by clonally dominant orsubclonal events based on TITAN results.Performance was computed by comparing the copy number concordance be-tween TITAN results for WGS and WES at all overlapping positions. A ‘match’was determined for a deletion or amplification at an overlapping position if bothresults (TITAN run on WGS and WES) were less than 2 or both greater than 2,respectively. Across the four samples, there were a total of 79,097 (∼19,700 persample) overlapping SNP positions between the exome and WGS data, of which1854.4. Applications of TITANEXCAPEXCAPEXCAPWGSWGSWGSNLOH GAINLOHHET ASCNALOH NLOHNEUT GAINHEMDNEUT GAINHEMDNLOH GAINLOHHET ASCNALOH NLOHFigure 4.24: Comparison of TITAN results for whole exome capture (EXCAP)sequencing and whole genome sequencing (WGS) of triple negative breast cancersample SA052. For copy number plots, copy neutral, deletion, amplification arerepresented by blue, green, red, respectively. For log ratio plots, hemizygous dele-tion (HEMD), copy neutral (NEUT), and copy gain (GAIN) results are shown.For allelic ratio plots, LOH, copy neutral LOH (NLOH), diploid heterozygous(HET), and allele-specific amplification (ASCNA) are shown. The cellular preva-lence value indicates the proportion of tumour cells in the whole sample. Clonalclusters are shown in horizontal lines labeled with a ‘Z’; tumour content is denotedwith the black horizontal line.55,846 (71%) were concordant for predicted copy number between the TITAN re-sults for the two data types (Figure 4.24, Table 4.14). However, in the WGS data,1864.4. Applications of TITANTITAN appears to resolve the signal into higher number of clonal clusters possiblydue to analysis of 100 times more SNP loci (Table 4.14).Sample # overlap-pingpositionsDELMatchAMPmatchHETmatchTOTALmatchConcordanceSA030 20335 7542 1800 4386 13728 0.67SA052 20475 6660 3306 5394 15360 0.75SA065 16668 560 8254 6958 15772 0.94SA073 21619 6372 2485 2129 10986 0.50Table 4.14: Comparison of TITAN results for whole exome (EXCAP) and genome(WGS) sequencing data for TNBC. Concordance was computed based on overlap-ping germline heterozygous SNP positions between the EXCAP and WGS samplefor the same patient sample. A match for a deletion (’DEL Match’), amplification(’AMP match’), or copy neutral (’HETmatch’) at an overlapping position if bothwere less than 2, both greater than 2, or both equal to 2, respectively. ’Concor-dance’ was computed as the proportion of overlapping positions that matched.Transcriptome allelic imbalance corroborates cellular prevalence of subclonalLOH For 22 of the TNBC cohort, we also analyzed the transcriptomes sequencedvia RNAseq (Shah et al., 2012; Ha et al., 2012) to assess whether the subclonalLOH predictions influences allele-specific expression. Using the observed tran-scriptome allelic ratio (TAR; proportion of reference read counts), we comparedTITAN cellular prevalence estimates to the expected allelic imbalance in expres-sion. We considered only positions within deletion LOH segments. For a referencepoint, we estimated the expected baseline TAR as a function of the cellular preva-lence (1− sz)+ sz2 , where the first term corresponds to the tumour cells expressingonly one allele (LOH) and the second term represents all other cells expressingboth alleles (heterozygous) (see red lines in Figure 4.25).Across the 22 TNBC cases and all clonal cluster, the TAR was significantly cor-1874.4. Applications of TITAN0. Sample PrevalenceRNAseq Allelic Ratio0.0 0.4 0.80.2 0.6 1.0r=0.71, p=2.3e− Sample PrevalenceRNAseq Allelic Ratio0.0 0.4 0.80.2 0.6 1.0r=0.72, p=2.2e−11a bFigure 4.25: Comparison of TITAN cellular prevalence and RNAseq transcriptomeallelic ratios (TAR). Sample prevalence (proportion within sample including nor-mal contamination) for (a) deletion LOH segments only and (b) all LOH segments(including copy neutral LOH, and amplified LOH) in all clonal clusters of all sam-ples are shown (x-axis). The mean RNAseq allelic ratio (max( re fdepth ,1−re fdepth)), fortranscriptomic positions overlapping LOH regions for each clonal cluster acrossall samples are shown (y-axis). The Pearson correlation coefficient in the compar-isons were 0.72 and 0.71, respectively. The red line indicates the expected TAR forthe given sample prevalence assuming cells (tumour and normal) without the eventare diploid heterozygous and both alleles are expressed equally, thus is a functionof the cellular prevalence (1− sz)+ sz2 . RNAseq data was filtered based on depththreshold > 10, mapping quality > 30, and base quality > 5.related with the predicted cellular prevalence (Pearson’s r = 0.72, p = 2.2×10−11,Figure 4.25a). TAR values were observed to be more imbalanced than could beexplained by deletion LOH alone. This could be attributed to epigenetic silenc-ing of one or both alleles in cells without LOH. When higher copy number eventsare considered such as amplified LOH, this can also result in more imbalance thanexpected due to stronger representation of the homozygous allele (if expression isgenerally positively correlated with copy number) (Figure 4.25b). These resultsindicate that a substantial proportion of mono-allelic expression is associated with1884.4. Applications of TITANcoincident subclonal LOH prediction, providing evidence that the clonality of theseevents are impacting the transcriptional program in these tumours.4.4.2 Distinct clonal evolution patterns in breast cancer xenograftsHuman breast cancer tissue engraftment in mice provides a model for studyingthe biology in the originating tumour (DeRose et al., 2011); however, the genomicclonal architecture and evolution patterns during engraftment have not been welldescribed. We aimed to comprehensively study the clonal evolution in six primarybreast and corresponding mouse xenograft tumours. We applied TITAN and Py-Clone (Roth et al., 2014) to obtain cellular prevalence for CNA/LOH and SNV,respectively, and compared these estimates between tumour and xenograft (Fig-ure 4.26).For each case, a set of ancestral events was present at high estimated cellularprevalence in both the tumour and xenograft. For SA493 and SA499, we observedthe presence of the same clones with subdominant prevalence, indicating potentialpolyclonal engraftment.By contrast, SA494 cellular prevalence patterns from both CNA and SNV di-mensions show mutually exclusive clonally dominant events. Closer inspection ofchromosome 5 revealed a seemingly dominant clone with a large LOH region inthe tumour was not selected for during engraftment (Figure 4.27). This supportsthe hypothesis that a rare (undetectable) subclone with a normal genotype of chr5in the tumour underwent a clean “sweep” during engraftment, becoming the dom-inant clonal in the xenograft.For SA500, subclonal SNVs with cellular prevalences varying from 5-50%were expanded to being clonally dominant in the xenograft. These observations1894.4. Applications of TITANtumourxeno p1xeno p1xeno p3xeno p4xeno p4xeno p5tumourSA493SA494SA495SA499SA500SA501xeno0.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.25 0.50 0.75 1.00Clonal group1 (17)2 (38)3 (3)4 (1)5 (1)6 (12)7 (5)8 (1)9 (1)1234567890.000.250.500.751.000.00 0.25 0.50 0.75 1.00Clonal group1 (6)2 (17)3 (8)4 (2)5 (53)6 (21)7 (1)8 (1)9 (7)10 (1)12345678 9100.000.250.500.751.000.00 0.25 0.50 0.75 1.00Clonal group1 (80)2 (9)3 (12)4 (6)5 (22)6 (8)7 (5)12345670.000.250.500.751.000.00 0.25 0.50 0.75 1.00Clonal group1 (9)2 (24)3 (13)4 (8)5 (11)6 (3)7 (1)12345670.000.250.500.751.000.00 0.25 0.50 0.75 1.00Clonal group1 (45)2 (14)3 (21)4 (3)5 (1)6 (5)7 (8)12345670.000.250.500.751.000.00 0.25 0.50 0.75 1.00Clonal group1 (2)2 (5)3 (10)4 (3)5 (16)6 (2)7 (57)8 (4)9 (1)10 (1)11 (1)12 (1)13 (1)14 (1)15 (1)1 23456 789 1011 12131415CNASNV# of genesFigure 4.26: Clonal analysis between CNA/LOH and SNV mutational classes inbreast xenografts. SNVs were predicted using PyClone. Gene-based counts areused for the CNA/LOH analysis. The axes are cellular prevalence.1904.4. Applications of TITANFigure 4.27: Clean "sweep" copy neutrality in chromosome 5 of the xenograft inSA494. Despite the high cellular prevalence of the copy neutral LOH (NLOH)and copy gain events in the tumour, chromosome 5 is almost completely copyneutral and heterozygous in the xenograft suggesting the selection of a rare clonenot harbouring the NLOH and gain events during engraftment. Copy number isshown as the log of the ratio between corrected tumour read depth and correctednormal read depth for WGS where deletions (green), gains (red), and copy neutral(blue) events are below, above and at 0, respectively. The allelic ratio (proportion ofreference read counts at each germline heterozygous SNP) indicates copy neutralLOH (blue), deletion LOH (green), and allele-specific amplification (red). Thecellular prevalence refers to the proportion of tumour cells with the event; here,nearly chromosome-wide LOH is clonally dominant. Circos plot shows the tumourin the outer-most track and xenograft in the middle track. Rearrangement, shownas arcs, were predicted by deStruct.1914.4. Applications of TITANsuggest that subclones serially accumulated these SNVs in a linear evolution fash-ion, leading to the range of prevalences. Then, the engraftment selected for the final(latest) subclones containing all these SNVs. On the other hand, the majority ofCNA events were ancestral, indicating they occurred early in tumour progressionprior to accumulation of SNVs.In case SA429, we also observed instances of complex rearrangements in theform of chromothripsis (Figure 4.28) and breakage fusion bridge (Figure 4.29)events based on analytical criteria suggested in Korbel and Campbell (2013). Tu-mour samples were analyzed for various post-engrafted mouse passages (A to E) atdifferent time points (1 to 5) using SNP6 arrays. Data for Sample 2A (passage-Aand time point 2) was generated using WGS, and analyzed with TITAN and de-Struct for copy number and rearrangements, respectively. Both chromothripsis andbreakage fusion bridge events were observed across all passages and time points,indicating that the events were ancestral. However, we were not able to determineif the events were acquired or selected during engraftment because the originat-ing primary tumour sample contained insufficient tumour content (<10%) for CNAanalysis.1924.4. Applications of TITAN1B2B2C2D2E3A4A5A1ANEUTGAINHEMDAMPHOMD2AFigure 4.28: Chromothripsis of chromosome 21 in SA429 across all xenograft pas-sages. The original identification of chromothripsis was made in the WGS sample(2A) where the joint analysis of copy number predicted by TITAN and rearrange-ments predicted by deStruct showed periodicity of alternating gains and losses.Copy number is shown using log ratio which is the ratio between corrected tumourread depth and corrected normal read depth for WGS. For SNP6, the log ratio isthe ratio between tumour and normal intensities.1934.4. Applications of TITAN1B2B2C2D2E3A4A5A1ANEUTGAINHEMDAMPHOMD2AFigure 4.29: Breakage fusion bridge in chromosome 18 of SA429 across allxenograft passages. The breakage-fusion-bridge was identified in the WGS sample(2A) using the joint results of copy number predicted by TITAN and rearrange-ments predicted by deStruct. Copy number is shown using log ratio which is theratio between corrected tumour read depth and corrected normal read depth forWGS. For SNP6, the log ratio is the ratio between tumour and normal intensities.1944.5. Discussion4.5 DiscussionTITAN is a novel algorithm that infers subclonal CNA/LOH segments from tu-mour whole genome sequencing data. Using a unified probabilistic framework, themodel analyzes the composite signals in the bulk tumour sample by deconvolvingthe presence of multiple tumour cell populations and normal cell contamination.The advantages of TITAN are three-fold. First, the proper deconvolution of sig-nals in the sequencing reads using the proposed sampling model allows for im-proved performance for predicting CNAs and LOH, including both clonally domi-nant and subclonal events (Figure 4.12c,d). Second, the algorithm is more sensitiveto subclonal events, which generally have more diluted signals, demonstrated bythe serial and merging mixture experiments from a HGS ovarian cancer dataset(Figure 4.12e-j). Third, estimation of tumour cellular prevalence and normal pro-portion is a powerful feature which adds a descriptive layer in the interpretation,enabling the inference of evolutionary dynamics of the tumour’s genome at CNAand LOH perspectives (Figure 4.17, 4.19). Importantly, using sequencing of single-cell nuclei, we were able to successfully confirm the presence of 6 (sub)clonal LOHevents. In summary, results from experiments over a broad range of synthetic data,patient tumour data, and experimental validation have established that, throughappropriate statistical modelling, the composition of cell populations in source tu-mour samples can be accurately identified by inference of CNA and LOH eventsin WGS data.We demonstrated that single-cell analysis was a viable experimental validationfor unambiguously confirming the presence of TITAN-predicted (sub)clonal LOHevents at the resolution of individual cells. The quantification of cellular preva-1954.5. Discussionlence was still challenging due to the small numbers of nuclei, and inference ofpatterns of clonal evolution was limited by the analysis of only three LOH eventsin each cell. Scaling up the numbers of tumour nuclei and LOH events will en-able the statistical analysis of mutual exclusion and co-occurrence of events thatare unattainable from single bulk tumour biopsies, and construction of clonal evo-lutionary phylogenies (Potter et al., 2013).Our results indicate that modelling WGS data as generated from different tu-mour populations with diverse somatic genomes will result in a more completerepresentation of a tumour’s CNA and LOH profiles. Enabled with the ability toaddress clonal diversity, we were able to assess the relative timing of events acrossclasses of mutations (CNA/LOH and SNV). This is exemplified by the analysis ofthe TNBC genomes, which suggested that the somatic copy number architecturein tumours may be evolving substantially. In the breast xenograft dataset, we wereable to suggest plausible clonal selection theories supported by joint analysis of theclonality in CNA/LOH and SNV.In conclusion, the TITAN statistical framework represents a significant advancein the field of copy number and LOH analysis for tumour genome sequencing data.As the implications for clonal evolution of point mutations in clinical trajectoriesare well-documented, we suggest that TITAN will enable the execution of com-plementary studies to investigate the role of genome architecture in driving theevolutionary selection of clonal cell populations.4.5.1 LimitationsThe primary aim of TITAN is the improvement of sensitivity for detecting sub-clonal CNA/LOH events, and we have provided improved performance in bench-1964.5. Discussionmarking mixtures up to ∼90X sequencing coverage (Figure 4.15, 4.17c); however,this is limited to resolving events in major clones at detectable prevalence. The fullenumeration of clonal cell populations, including minor clones, is a limitation ofTITAN and remains a difficult problem in the analysis of a single tumour biopsy.Solutions to address this limitation will require integration of additional data, suchas somatic SNVs (Carter et al., 2012), genomic rearrangement breakpoints, in-creased coverage of sequencing (Nik-Zainal et al., 2012b), multi-patient biopsies,and ultimately, single-cell analysis (Navin et al., 2011; Potter et al., 2013).TITAN does not model more than one aberrated genotype at the same locus,but instead assumes that clones harbouring the subclonal event coexist with tumourpopulation(s) that have a normal (diploid heterozygous) genotype. An example ofthis limitation is shown in Figure 4.1, in which the event at chr2q23.3-q24.1 islikely an aggregated signal from a tumour population with a hemizygous deletionand another with a copy neutral LOH (subsequent duplication). In particular, it isdifficult to distinguish among coexisting clones that harbour amplifications of vari-able copies. In order to model multiple tumour genotypes, the mixture representa-tion model (Figure 4.2) will need to be re-formulated. This is further discussed inSection 4.5.3.TITAN estimates parameters using the EM algorithm; however, this approachmay return locally optimal solutions, influenced by the initializations prior to in-ference. Informed initializations of the key parameters, normal proportion andtumour ploidy, using orthogonal sources such as histopathology may improve EM-converged parameter values and overall predictions. Also, we determined the op-timal number of clonal clusters, which best represented the expected number ofclonal clusters in the sample by employing a modified internal clustering valida-1974.5. Discussiontion measure S_Dbw. However, a more robust solution would be to integrate themodel selection directly into the framework, such as representing the clonal clustergroupings using a phylogenetic tree to relate inferred clones into their ancestrallineages. While this is a complicated problem, we provide alternative solutions tothis limitation in the following section.4.5.2 Extensions to current TITAN modelThere are several routes of possible extensions to the existing TITAN model thatcan improve the accuracy, interpretation, and novelty of the method.Model selection problem for number of clonal clustersThe optimal number of clonal clusters Z is selected in a post-processing step. Thenumber of states K (21 ∗ Z) in the factorial-HMM is a function of Z. Therefore,increasing the number of clonal clusters will consistently yield a better likelihooddue to over-fitting. The challenge lies in selecting the optimal Z rather than theconventional problem of choosing K in an HMM. As a result, methods such asBayesian information criterion (BIC) appears to fail from our experience. We ad-dressed this by employing a modified internal evaluation measure S_Dbw whichallowed us to directly compare between settings of Z.A more robust solution would be to integrate the model selection directly intothe model. Such an integrated solution is the use of factorized asymptotic Bayesian(FAB) inference methods that applies regularization to eliminate clonal clustersthat do not fit well to the data (Fujimaki and Hayashi, 2012). However, it is not yetclear how to adapt FAB to decompose states based only on Z rather than K. An-other solution is to implement an infinite HMM (iHMM), which uses a hierarchical1984.5. DiscussionDirichlet process (HDP), to select the expected number of hidden states (Beal et al.,2002). Finally, another solution is to represent the clonal cluster groupings using aphylogenetic tree, relating inferred clone clusters into their ancestral lineages. Wecan make use of pairs of events having the sum of cellular prevalence greater than1.0 to assume the minimum proportion of co-occurrence. This can help informbranching or linear evolution along the tree construction.Predicting parental CNA/LOH using phased haplotypesWe can revise the TITAN to analyze parent-specific haplotypes instead of allele-specific regions using properties of linkage disequilibrium (LD). This was appliedin a recent study employing the Battenburg algorithm to resolve subclonal CNA us-ing haplotype blocks (Nik-Zainal et al., 2012b). Currently in TITAN, the conceptof alleles and genotypes at each germline SNP locus are defined as matching or notmatching the reference genome. Instead, it is more ideal to represent alleles orig-inating from parental (maternal or paternal) chromosomes. Using software suchas IMPUTE2 (Howie et al., 2009)and BEAGLE (Browning and Browning, 2007),informative (heterozygous) germline SNPs can be phased and haplotype blockscan be determined. This additional haplotype information, which inherently fol-lows properties of genomic spatial associations as part of LD, can help to resolvesymmetric adjacent genotype transitions (e.g. AA to BB) in the model. This canenable prediction of contiguous blocks of CNA or LOH events that occurred in aparent-specific chromosome, and fluctuations from expected haplotype signals mayindicate the clonality of the events. Preliminary results were promising but meritsfurther investigation in evaluating its advantages to the TITAN analysis. More-over, because TITAN uses a non-stationary (position-specific) transition matrix, it1994.5. Discussionwill be convenient to define transition probabilities to reflect phasing of adjacentpositions to the same haplotype blocks.More robust representation using over-dispersed emission distributionsThe TITAN framework is flexible to using alternative underlying emission distri-butions to model the input data. The log ratio l1:T is a legacy data type carried overfrom established models in aCGH and genotyping array approaches. Therefore, itis more suitable to use the digital read depths N1:T at each heterozygous germlineSNP directly from WGS data, similar to reference allele read count a1:T . This hasan added advantage of eliminating the dependency on HMMcopy bin-based cover-age as a preprocessing step. However, a new method for single position GC contentand mappability correction will need to be applied (Benjamini and Speed, 2012).Another logical modification would be to use over-dispersed versions of the cur-rent distributions. For the read depths, the gamma-poisson (also called negative-binomial) with the mean µ¯g,z and precision τGPk parameterization can be used tomodel N1:T ,GP(Nt |µg,z,σcT )=Γ(µ2g,zτGPk +Nt)Γ(µ2g,zτGPk)Γ(Nt +1)(µg,zτGPk)µ2g,zτGPk (µg,zτGPk +1)−µ2g,zτGPk −Nt(4.17)where µg,z = αβ , τGP =β 2α , α = µ2g,zτGP and β = µg,zτGP. The sampling model ofthe read depth based on the 3-component mixture is redefined with the haploidycoverage 1/φ as a scaling factor,µ¯g,z =ncN +(1−n)szcN +(1−n)(1− sz)cT,gφ (4.18)2004.5. DiscussionFor allele read count data, the beta-binomial, reparameterized in terms of the meanω¯g,z (Equation 4.2) and precision τBBk , provides a more robust representation (Rothet al., 2014),BB(at |Nt ,ωg,z,τBBk)=(Ntat)B(at + τBBk ωg,z,Nt −at + τBBk (1−ωg,z))B(τBBk ωg,z,τBBk (1−ωg,z)) (4.19)The multivariate (joint) emission becomesp(at ,Nt , |Zt = z,Gt = g,θ) = BB(at |Nt ,ωg,z,τBBk)×GP(Nt |µg,z,τGPk)(4.20)Extensions to TITAN using these distributions and replacing log ratios withread depth were implemented. Preliminary runs using naive parameter settingsdemonstrated plausible results; however, the performance, based on the evaluationdescribed in Section 4.3.2, was inferior to the current TITAN emission distribu-tions. Moreover, the runtime was significantly longer largely due to increasednumber of parameters and the more complex MAP estimation equations. Areasfor further investigation include addressing the need for faster parameter inferenceand the validity of GC content bias correction for the new data type.4.5.3 Future directionsAside from the proposed technical extensions, further fundamental modelling con-siderations are required to better represent the dynamics of clonal selection. Thereare two areas in which future novel contributions can significantly impact the land-scape of algorithm development for studying clonal evolution in tumours.2014.5. DiscussionDeconvolving multiple subclonal tumour genotypes will require additionaldatatypes Currently, TITAN makes the assumption that a clonally sub-dominantaberration coexists with tumour population(s) that have a normal (diploid heterozy-gous) genotype at the same locus. However, clonal selection and subsequent ex-pansions often result in coexisting, competing tumour populations with more thanone aberrated genotype (Greaves and Maley, 2012). For TITAN to model such atumour, the signal mixture representation will likely require more than the currentthree components (see Equations 4.1 and 4.2). Given only tumour read depth andallele counts, specific combinations of the signals will yield multiple solutions,making this problem unidentifiable. In order to address this limitation, additionaldata, such as somatic SNVs, genomic rearrangement breakpoints and increasedcoverage of sequencing (Nik-Zainal et al., 2012b), may be required. Currently,there is no approach that can infer multiple coexisting subclonal aberrations at thesame locus from WGS data of bulk tumour samples.Analysis of multiple related intra-patient tumour samples We had assumedevents being part of the same clonal expansion is a sufficient condition for observ-ing similar cellular prevalence for these events. However, other scenarios can giverise to similar cellular prevalence between events. For example, two events at 0.4cellular prevalence can be harbored in mutually exclusive clones, each making up40% of the sample. Analyzing only a single tumour sample cannot distinguishco-occurring events in the same clone or mutually exclusive clones.A significant novel extension is the analysis of whole genome sequencing ofmultiple intra-patient tumour samples with the aim of identifying the full com-plement of aberrations for each clone. By extending from the concept of cellular2024.5. Discussionprevalence, we can infer the set of aberrations co-occurring in distinct clones. Inparticular, this could present the opportunity to spatially and temporally “track” thedynamics of clonal populations via their genomic aberration profiles. This opensup the possibility, along with the analysis of point mutations, to reliably investigatethe process of clonal evolution and ultimately understand the impact it can have ontherapeutic response.Single-cell sequencing for validation Single cell sequencing technologies havebeen promising in addressing intra-tumoural heterogeneity(Navin et al., 2011; Houet al., 2012; Xu et al., 2012) but are still limited until it can be scaled-up to samplehundreds or thousands of cells. Ultimately, the firm establishment of single-cellsequencing technologies will drive development of reliable solutions to uncoverthe clonal selection and diversity. In Section 4.3.5, we successfully demonstratedthe use of single-cell sequencing of individual tumour nuclei for validating copynumber events. We claim that this is the first result to make use of single-cellsequencing for confirming subclonal CNA predictions, and will set the standardfor such types of validation in future studies.203Chapter 5ConclusionsIn this dissertation, we have presented three novel computational methods address-ing the current challenges in the analysis of copy number alterations (CNA) andloss of heterozygosity (LOH) from genomic data. HMM-Dosage, APOLLOH, andTITAN employed HMMs to analyze data generated from high-density genotypingarrays and WGS of tumour DNA. The presentation of these methods highlight thesuitability of HMMs for profiling CNA/LOH. As was introduced in Section 1.4,the basic property of spatial correlation between adjacent positions make HMMsfavourable for analyzing CNA/LOH for various genomic data types. The methodspresented here differ from previous work by taking advantage of the flexibility ofemission distributions and the ability to incorporate prior knowledge. Each methodwas designed with specific extensions that aimed to address technical and tumour-related properties of the data, improving the interpretation of genomic aberrationprofiles. We also included results from the applications of the methods to novelcancer datasets to demonstrate the utility for accurately profiling tumour genomesand to generate quantitative results crucial to interpreting the underlying geneticsassociated with breast and ovarian cancers.In Chapter 2, we presented the method, HMM-Dosage, which distinguishesgermline (CNV) and somatic (CNA) copy number changes. By segregating CNA204Chapter 5. Conclusionsand CNV events in a robust unified framework, HMM-Dosage characterizes a setof somatic profiles that are relevant for downstream interpretation and identifica-tion of actionable driver targets. We applied HMM-Dosage to the analysis of twonovel landmark datasets of 2000 breast cancers (METABRIC) and intra-patientovarian carcinomas, yielding significant findings that have major bearings on char-acterization of these diseases. In the METABRIC study (Curtis et al., 2012), thedichotomization of CNVs and CNAs enabled the profiling of the aberration land-scape from both perspectives, including the association of the two signature typeswith expression and subtype classification. In the intra-tumoural study (Bashashatiet al., 2013), we were the first to describe the genomic heterogeneity through anal-ysis of regional tumour biopsies for ovarian high-grade serous carcinoma. Alongwith a few recent studies in breast and renal cancers (Navin et al., 2010, 2011;Gerlinger et al., 2012, 2014), our work contributes to the series of pioneer studiesaddressing tumour evolution from multiple intratumour samples.In Chapter 3, we presented, APOLLOH, which was one of the first methodsdesigned specifically for inference of LOH in WGS of tumour DNA. APOLLOHachieved higher performance compared with competing methods because it explic-itly modeled spatial correlation, copy number status and normal cell contamination.The latter feature, normal contamination, is a useful quantity that can be used toconfirm estimates of tumour cellularity from clinical histopathology. We reportedthe landscape of allelic imbalance in 23 whole triple negative cancer genomes,surveying genes that were affected by LOH. The strongest signal resided in chro-mosome 17 which was observed to have nearly complete chromosomal level LOHin 78% of the samples. Despite the majority of LOH events being induced bydeletions in chromosome 17, nearly 20% of samples show substantial copy neutral205Chapter 5. ConclusionsLOH which would have otherwise been overlooked only copy number was con-sidered. This result is similar to those previously reported in another breast cancercohort (Van Loo et al., 2010) and in a high grade serous ovarian dataset (CancerGenome Atlas Research Network, 2011b).This was also the first study aimed at jointly analyzing genome (WGS) andtranscriptome (RNA-seq) data in 23 tumours to determine LOH and its effects onallelic expression, particularly MAE, at nucleotide resolution. We provided ananalysis of MAE that investigated the impact of gene expression as result of LOH.The MAE of genes may have biological impact to the progression and state of thetumour such as for brain tumours (Walker et al., 2012). Indeed, pathway analysisof the genes affected by LOH-associated MAE revealed core oncogenic pathways,and therefore implicates LOH with coincident MAE as an important mechanismof pathway abrogation that complements copy number, point mutation (Shah et al.,2012), and epigenetic analysis. Interestingly, the results show that the majority ofMAE is associated with LOH regions, which implies that the majority of MAE inTNBC is explained by fixed genome aberrations rather than epigenetic regulationor mutations. Full integration of all of these molecular views, including epige-netic methylation and chromatin modification data, are likely to reveal additionalinsights into tumour biology.In Chapter 4, we presented TITAN, which is a significant advance in compu-tational development for analysis of CNA and LOH in sequencing data of tumourgenomes. TITAN is a novel probabilistic method that jointly analyzes both the tu-mour read depth and digital allele read counts for segmentation of subclonal CNAand LOH in WGS of heterogeneous tumours. By representing the observed sig-nals as originating from multiple populations, TITAN quantifies the clonality from206Chapter 5. Conclusionsthe perspective of chromosomal aberrations. While algorithms for inferring clonaldiversity from the context of SNVs are emerging in cancer genomics studies (Lan-dau et al., 2013; Wu et al., 2012; Martinez et al., 2013; Nik-Zainal et al., 2012b;Ding et al., 2012a; Kreso et al., 2013; Yachida et al., 2010; Castellarin et al., 2012),methods for analysis of the clonality for CNA/LOH are still deficient.TITAN represents a timely contribution when the focus of the cancer genomesresearch community is on the clonal architecture and evolution of cancers. Tofully understand the clonal diversity of tumours, multiple mutation classes, suchas CNA/LOH and SNVs, need to be considered in parallel or simultaneously (Nik-Zainal et al., 2012b). To this end, we demonstrated the utility of TITAN by present-ing results of the clonal diversity from the perspectives of CNA/LOH and SNVsfor 23 TNBC samples and 8 breast patient primary-xenograft tumour comparisons.These analyses represented a novel approach to uncovering the underlying dynam-ics of clonal evolution and enabled conclusions to be drawn on the clonal selectionpatterns of these tumours. The application of TITAN presented here lay the essen-tial groundwork for quantitative assessment of clonal diversity.A downstream clinical application of TITAN will be in identifying tumourswith particular DNA repair defects that would make them susceptible to genotoxicdrugs. For example, tumours with defective homologous recombination (HR) re-pair of double stranded genomic breaks (Lord and Ashworth, 2012) are likely tocarry an overall heterogeneous genomic footprint as a result of continual accrualof genomic structural changes. This can be discerned through profiling of sub-clonal cell populations using TITAN. Tumours exhibiting subclonal CNA and LOHevents are likely to have ongoing homologous recombination defects and wouldbe good candidates for treatment with platinum-based drugs or PARP inhibitors2075.1. Concluding remarks(Mukhopadhyay et al., 2012). Genome sequencing of tumours and the appropriateanalytical solutions, coupled with reliable homologous recombination deficiencyassays, will be required to test this hypothesis. As WGS of tumour genomes scaleto cohorts of tens of thousands (International Cancer Genome Consortium et al.,2010), TITAN will be an invaluable and necessary computational tool that will en-able researchers to investigate the role of structural genomic aberrations in clonalevolution.5.1 Concluding remarksCancer is a complex and multi-faceted disease that is accepted as being largelydriven by a genetic component, including structural genomic aberrations (Hana-han and Weinberg, 2011). The field of cancer genomics has entered an era ofunprecedented measurement capacity leading to accelerated understanding of ge-nomic aberrations associated with the behaviour of malignancies. Development ofadvanced technologies has facilitated researchers in achieving feats and milestonesprevious thought impossible, and has reinvigorated the ambition for renewed goalsand long-term visions for cancer research. The growth of genomics has paral-leled the continual progress and discoveries in laboratory techniques, leading toa new challenge borne from the vast abundance of data. Bioinformatics, devel-oping alongside these fields, has become an integral and necessary component.Not unlike the maturation of cancer research and genomic assays, this dissertationalso represents an evolution through the development of algorithmic concepts. Thework presented here chronologically lays the foundation for each new intellectualdevelopment that addresses the current needs and challenges in cancer genomics2085.1. Concluding remarksresearch. The future impact of this work will soon be realized as genomics re-search reaches new boundaries, demanding new methods that will be built on ideasfostered in this dissertation.From observing defective cell division in sea urchin mammalian models (Boveri,2008) one century ago to visualizing abnormalities in single chromosomes (Now-ell and Hungerford, 1960) in 1960, we are in a position, today, to not only identifygenomic aberrations, but to investigate the clonal populations and the evolutionarydynamics of tumours. The fundamental principles for profiling subclonal cellularpopulations presented in this dissertation will cultivate new and exciting endeav-ours that can have an far-reaching implications for clinical applications such ascancer diagnosis and therapeutic intervention.209Bibliography1000 Genomes Project Consortium, Durbin, R. M., Abecasis, G. R., Altshuler,D. L., Auton, A., Brooks, L. D., Durbin, R. M., Gibbs, R. A., Hurles, M. E., andMcVean, G. A. (2010). A map of human genome variation from population-scalesequencing. Nature, 467(7319):1061–73.Ahmed, A. A., Etemadmoghadam, D., Temple, J., Lynch, A. G., Riad, M., Sharma,R., Stewart, C., Fereday, S., Caldas, C., Defazio, A., Bowtell, D., and Brenton,J. D. (2010). Driver mutations in tp53 are ubiquitous in high grade serous carci-noma of the ovary. J Pathol, 221(1):49–56.Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Aparicio, S. A. J. R., Behjati,S., Biankin, A. V., Bignell, G. R., Bolli, N., Borg, A., Børresen-Dale, A.-L.,Boyault, S., Burkhardt, B., Butler, A. P., Caldas, C., Davies, H. R., Desmedt,C., Eils, R., Eyfjörd, J. E., Foekens, J. A., Greaves, M., Hosoda, F., Hutter,B., Ilicic, T., Imbeaud, S., Imielinsk, M., Jäger, N., Jones, D. T. W., Jones, D.,Knappskog, S., Kool, M., Lakhani, S. R., López-Otín, C., Martin, S., Munshi,N. C., Nakamura, H., Northcott, P. A., Pajic, M., Papaemmanuil, E., Paradiso,A., Pearson, J. V., Puente, X. S., Raine, K., Ramakrishna, M., Richardson, A. L.,Richter, J., Rosenstiel, P., Schlesner, M., Schumacher, T. N., Span, P. N., Teague,J. W., Totoki, Y., Tutt, A. N. J., Valdés-Mas, R., van Buuren, M. M., van ’tVeer, L., Vincent-Salomon, A., Waddell, N., Yates, L. R., Australian PancreaticCancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-SeqConsortium, ICGC PedBrain, Zucman-Rossi, J., Futreal, P. A., McDermott, U.,Lichter, P., Meyerson, M., Grimmond, S. M., Siebert, R., Campo, E., Shibata,T., Pfister, S. M., Campbell, P. J., and Stratton, M. R. (2013). Signatures ofmutational processes in human cancer. Nature, 500(7463):415–21.Anderson, K., Lutz, C., van Delft, F. W., Bateman, C. M., Guo, Y., Colman, S. M.,Kempski, H., Moorman, A. V., Titley, I., Swansbury, J., Kearney, L., Enver, T.,and Greaves, M. (2011). Genetic variegation of clonal architecture and propa-gating cells in leukaemia. Nature, 469(7330):356–61.Aparicio, S. and Caldas, C. (2013). The implications of clonal genome evolutionfor cancer medicine. The New England journal of medicine, 368(9):842–851.210BibliographyArchambeau, C. (2005). Probabilistic Models in Noisy Environments - And theirApplication to a Visual Prosthesis for the Blind. PhD thesis, Universit catholiquede Louvain.Baca, S. C., Prandi, D., Lawrence, M. S., Mosquera, J. M., Romanel, A., Drier,Y., Park, K., Kitabayashi, N., MacDonald, T. Y., Ghandi, M., Van Allen, E.,Kryukov, G. V., Sboner, A., Theurillat, J.-P., Soong, T. D., Nickerson, E., Au-clair, D., Tewari, A., Beltran, H., Onofrio, R. C., Boysen, G., Guiducci, C., Bar-bieri, C. E., Cibulskis, K., Sivachenko, A., Carter, S. L., Saksena, G., Voet, D.,Ramos, A. H., Winckler, W., Cipicchio, M., Ardlie, K., Kantoff, P. W., Berger,M. F., Gabriel, S. B., Golub, T. R., Meyerson, M., Lander, E. S., Elemento, O.,Getz, G., Demichelis, F., Rubin, M. A., and Garraway, L. A. (2013). Punctuatedevolution of prostate cancer genomes. Cell, 153(3):666–677.Baker, S. J., Preisinger, A. C., Jessup, J. M., Paraskeva, C., Markowitz, S., Willson,J. K., Hamilton, S., and Vogelstein, B. (1990). p53 gene mutations occur incombination with 17p allelic deletions as late events in colorectal tumorigenesis.Cancer Res, 50(23):7717–7722.Barnett, D. W., Garrison, E. K., Quinlan, A. R., Strömberg, M. P., and Marth, G. T.(2011). Bamtools: a c++ api and toolkit for analyzing and managing bam files.Bioinformatics (Oxford, England), 27(12):1691–1692.Bashashati, A., Ha, G., Tone, A., Ding, J., Prentice, L. M., Roth, A., Rosner, J.,Shumansky, K., Kalloger, S., Senz, J., Yang, W., McConechy, M., Melnyk, N.,Anglesio, M., Luk, M. T. Y., Tse, K., Zeng, T., Moore, R., Zhao, Y., Marra,M. A., Gilks, B., Yip, S., Huntsman, D. G., McAlpine, J. N., and Shah, S. P.(2013). Distinct evolutionary trajectories of primary high-grade serous ovariancancers revealed through spatial mutational profiling. J Pathol, 231:21–34.Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). The infinite hiddenmarkov model. In Machine Learning, pages 29–245. MIT Press.Bengtsson, H., Irizarry, R., Carvalho, B., and Speed, T. P. (2008). Estimationand assessment of raw copy numbers at the single locus level. Bioinformatics,24(6):759–67.Bengtsson, H., Neuvial, P., and Speed, T. P. (2010). Tumorboost: normalization ofallele-specific tumor copy numbers from a single pair of tumor-normal genotyp-ing microarrays. BMC Bioinformatics, 11:245–245.Bengtsson, H., Wirapati, P., and Speed, T. P. (2009). A single-array pre-processing method for estimating full-resolution raw copy numbers from all211Bibliographyaffymetrix genotyping arrays including genomewidesnp 5 & 6. Bioinformatics,25(17):2149–56.Benjamini, Y. and Speed, T. P. (2012). Summarizing and correcting the gc contentbias in high-throughput sequencing. Nucleic Acids Res, 40(10):e72.Bergamaschi, A., Kim, Y. H., Wang, P., Sørlie, T., Hernandez-Boussard, T., Lon-ning, P. E., Tibshirani, R., Børresen-Dale, A.-L., and Pollack, J. R. (2006). Dis-tinct patterns of dna copy number alteration are associated with different clini-copathological features and gene-expression subtypes of breast cancer. Genes,chromosomes &amp; cancer, 45(11):1033–1040.Berger, A. H., Knudson, A. G., and Pandolfi, P. P. (2011). A continuum model fortumour suppression. Nature, 476(7359):163–9.Beroukhim, R., Lin, M., Park, Y., Hao, K., Zhao, X., Garraway, L. A., Fox, E. A.,Hochberg, E. P., Mellinghoff, I. K., Hofer, M. D., Descazeaud, A., Rubin, M. A.,Meyerson, M., Wong, W. H., Sellers, W. R., and Li, C. (2006). Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide snparrays. PLoS Comput Biol, 2(5):e41.Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan,J., Barretina, J., Boehm, J. S., Dobson, J., Urashima, M., Mc Henry, K. T.,Pinchback, R. M., Ligon, A. H., Cho, Y. J., Haery, L., Greulich, H., Reich,M., Winckler, W., Lawrence, M. S., Weir, B. A., Tanaka, K. E., Chiang, D. Y.,Bass, A. J., Loo, A., Hoffman, C., Prensner, J., Liefeld, T., Gao, Q., Yecies,D., Signoretti, S., Maher, E., Kaye, F. J., Sasaki, H., Tepper, J. E., Fletcher,J. A., Tabernero, J., Baselga, J., Tsao, M. S., Demichelis, F., Rubin, M. A.,Janne, P. A., Daly, M. J., Nucera, C., Levine, R. L., Ebert, B. L., Gabriel, S.,Rustgi, A. K., Antonescu, C. R., Ladanyi, M., Letai, A., Garraway, L. A., Loda,M., Beer, D. G., True, L. D., Okamoto, A., Pomeroy, S. L., Singer, S., Golub,T. R., Lander, E. S., Getz, G., Sellers, W. R., and Meyerson, M. (2010). Thelandscape of somatic copy-number alteration across human cancers. Nature,463(7283):899–905.Bignell, G. R., Greenman, C. D., Davies, H., Butler, A. P., Edkins, S., Andrews,J. M., Buck, G., Chen, L., Beare, D., Latimer, C., Widaa, S., Hinton, J., Fahey,C., Fu, B., Swamy, S., Dalgliesh, G. L., Teh, B. T., Deloukas, P., Yang, F.,Campbell, P. J., Futreal, P. A., and Stratton, M. R. (2010). Signatures of mutationand selection in the cancer genome. Nature, 463(7283):893–898.Bishop, C. M. (2007). Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer, 1st ed. 2006. corr. 2nd printing edition.212BibliographyBoeva, V., Popova, T., Bleakley, K., Chiche, P., Cappo, J., Schleiermacher, G.,Janoueix-Lerosey, I., Delattre, O., and Barillot, E. (2012a). Control-freec: a toolfor assessing copy number and allelic content using next-generation sequencingdata. Bioinformatics, 28(3):423–425.Boeva, V., Popova, T., Bleakley, K., Chiche, P., Cappo, J., Schleiermacher, G.,Janoueix-Lerosey, I., Delattre, O., and Barillot, E. (2012b). Control-freec: a toolfor assessing copy number and allelic content using next-generation sequencingdata. Bioinformatics (Oxford, England), 28(3):423–425.Boeva, V., Zinovyev, A., Bleakley, K., Vert, J. P., Janoueix-Lerosey, I., Delattre, O.,and Barillot, E. (2011). Control-free calling of copy number alterations in deep-sequencing data using gc-content normalization. Bioinformatics, 27(2):268–269.Boveri, T. (2008). Concerning the origin of malignant tumours by theodor boveri.translated and annotated by henry harris. Journal of cell science, 121 Suppl1:1–84.Brennan, C. W., Verhaak, R. G. W., McKenna, A., Campos, B., Noushmehr,H., Salama, S. R., Zheng, S., Chakravarty, D., Sanborn, J. Z., Berman, S. H.,Beroukhim, R., Bernard, B., Wu, C.-J., Genovese, G., Shmulevich, I., Barnholtz-Sloan, J., Zou, L., Vegesna, R., Shukla, S. A., Ciriello, G., Yung, W. K., Zhang,W., Sougnez, C., Mikkelsen, T., Aldape, K., Bigner, D. D., Meir, E. G. V., Pra-dos, M., Sloan, A., Black, K. L., Eschbacher, J., Finocchiaro, G., Friedman, W.,Andrews, D. W., Guha, A., Iacocca, M., O’Neill, B. P., Foltz, G., Myers, J.,Weisenberger, D. J., Penny, R., Kucherlapati, R., Perou, C. M., Hayes, D. N.,Gibbs, R., Marra, M., Mills, G. B., Lander, E., Spellman, P., Wilson, R., Sander,C., Weinstein, J., Meyerson, M., Gabriel, S., Laird, P. W., Haussler, D., Getz, G.,Chin, L., and Network, T. C. G. A. R. (2013). The somatic genomic landscapeof glioblastoma. Cell, 155(2):462–477.Browning, S. R. and Browning, B. L. (2007). Rapid and accurate haplotype phas-ing and missing-data inference for whole-genome association studies by use oflocalized haplotype clustering. Am J Hum Genet, 81(5):1084–1097.Burrell, R. A., McClelland, S. E., Endesfelder, D., Groth, P., Weller, M.-C., Shaikh,N., Domingo, E., Kanu, N., Dewhurst, S. M., Gronroos, E., Chew, S. K., Rowan,A. J., Schenk, A., Sheffer, M., Howell, M., Kschischo, M., Behrens, A., Helle-day, T., Bartek, J., Tomlinson, I. P., and Swanton, C. (2013a). Replicationstress links structural and numerical cancer chromosomal instability. Nature,494(7438):492–6.213BibliographyBurrell, R. A., McGranahan, N., Bartek, J., and Swanton, C. (2013b). Thecauses and consequences of genetic heterogeneity in cancer evolution. Nature,501(7467):338–45.Campbell, P. J., Pleasance, E. D., Stephens, P. J., Dicks, E., Rance, R., Goodhead,I., Follows, G. A., Green, A. R., Futreal, P. A., and Stratton, M. R. (2008a).Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing.Proc Natl Acad Sci U S A, 105(35):13081–13086.Campbell, P. J., Stephens, P. J., Pleasance, E. D., O’Meara, S., Li, H., Santarius,T., Stebbings, L. A., Leroy, C., Edkins, S., Hardy, C., Teague, J. W., Menzies,A., Goodhead, I., Turner, D. J., Clee, C. M., Quail, M. A., Cox, A., Brown, C.,Durbin, R., Hurles, M. E., Edwards, P. A. W., Bignell, G. R., Stratton, M. R., andFutreal, P. A. (2008b). Identification of somatically acquired rearrangements incancer using genome-wide massively parallel paired-end sequencing. Nat Genet,40(6):722–729.Campbell, P. J., Yachida, S., Mudie, L. J., Stephens, P. J., Pleasance, E. D., Steb-bings, L. A., Morsberger, L. A., Latimer, C., McLaren, S., Lin, M.-L., McBride,D. J., Varela, I., Nik-Zainal, S. A., Leroy, C., Jia, M., Menzies, A., Butler, A. P.,Teague, J. W., Griffin, C. A., Burton, J., Swerdlow, H., Quail, M. A., Strat-ton, M. R., Iacobuzio-Donahue, C., and Futreal, P. A. (2010). The patternsand dynamics of genomic instability in metastatic pancreatic cancer. Nature,467(7319):1109–13.Cancer Genome Atlas Research Network (2008). Comprehensive genomic char-acterization defines human glioblastoma genes and core pathways. Nature,455(7216):1061–8.Cancer Genome Atlas Research Network (2011a). Integrated genomic analyses ofovarian carcinoma. Nature, 474(7353):609–15.Cancer Genome Atlas Research Network (2011b). Integrated genomic analyses ofovarian carcinoma. Nature, 474(7353):609–615.Cancer Genome Atlas Research Network (2012a). Comprehensive genomic char-acterization of squamous cell lung cancers. Nature, 489(7417):519–525.Cancer Genome Atlas Research Network (2012b). Comprehensive molecular char-acterization of human colon and rectal cancer. Nature, 487(7407):330–337.Cancer Genome Atlas Research Network (2012c). Comprehensive molecular por-traits of human breast tumours. Nature, 490(7418):61–70.214BibliographyCancer Genome Atlas Research Network (2013a). Comprehensive molecular char-acterization of clear cell renal cell carcinoma. Nature, 499(7456):43–9.Cancer Genome Atlas Research Network (2013b). Genomic and epigenomic land-scapes of adult de novo acute myeloid leukemia. N Engl J Med, 368(22):2059–2074.Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird,P. W., Onofrio, R. C., Winckler, W., Weir, B. A., Beroukhim, R., Pellman, D.,Levine, D. A., Lander, E. S., Meyerson, M., and Getz, G. (2012). Absolutequantification of somatic dna alterations in human cancer. Nat Biotechnol.Castellarin, M., Milne, K., Zeng, T., Tse, K., Mayo, M., Zhao, Y., Webb, J. R.,Watson, P. H., Nelson, B. H., and Holt, R. A. (2012). Clonal evolution of high-grade serous ovarian carcinoma from primary to recurrent disease. J Pathol.Chen, K., Wallis, J. W., McLellan, M. D., Larson, D. E., Kalicki, J. M., Pohl, C. S.,McGrath, S. D., Wendl, M. C., Zhang, Q., Locke, D. P., Shi, X., Fulton, R. S.,Ley, T. J., Wilson, R. K., Ding, L., and Mardis, E. R. (2009). Breakdancer:an algorithm for high-resolution mapping of genomic structural variation. NatMethods, 6(9):677–81.Chiang, D. Y., Getz, G., Jaffe, D. B., O’Kelly, M. J. T., Zhao, X., Carter, S. L., Russ,C., Nusbaum, C., Meyerson, M., and Lander, E. S. (2009). High-resolutionmapping of copy-number alterations with massively parallel sequencing. NatMethods, 6(1):99–103.Chin, S. F., Teschendorff, A. E., Marioni, J. C., Wang, Y., Barbosa-Morais, N. L.,Thorne, N. P., Costa, J. L., Pinder, S. E., van de Wiel, M. A., Green, A. R.,Ellis, I. O., Porter, P. L., Tavaré, S., Brenton, J. D., Ylstra, B., and Caldas, C.(2007). High-resolution acgh and expression profiling identifies a novel genomicsubtype of er negative breast cancer. Genome Biol, 8(10):R215.Cingolani, P. (2012). snpeff: Variant effect prediction.http://snpeff.sourceforge.net.Ciriello, G., Miller, M. L., Aksoy, B. A., Senbabaoglu, Y., Schultz, N., and Sander,C. (2013). Emerging landscape of oncogenic signatures across human cancers.Nat Genet, 45(10):1127–1133.Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S.,Seller, A., Holmes, C. C., and Ragoussis, J. (2007). Quantisnp: an objectivebayes hidden-markov model to detect and accurately map copy number variationusing snp genotyping data. Nucleic Acids Res, 35(6):2013–2025.215BibliographyConrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts,J., Andrews, T. D., Barnes, C., Campbell, P., Fitzgerald, T., Hu, M., Ihm, C. H.,Kristiansson, K., Macarthur, D. G., Macdonald, J. R., Onyiah, I., Pang, A. W. C.,Robson, S., Stirrups, K., Valsesia, A., Walter, K., Wei, J., Wellcome Trust CaseControl Consortium, Tyler-Smith, C., Carter, N. P., Lee, C., Scherer, S. W., andHurles, M. E. (2010). Origins and functional impact of copy number variationin the human genome. Nature, 464(7289):704–12.Cooke, S. L., Ng, C. K. Y., Melnyk, N., Garcia, M. J., Hardcastle, T., Temple,J., Langdon, S., Huntsman, D., and Brenton, J. D. (2010). Genomic analysisof genetic heterogeneity and evolution in high-grade serous ovarian carcinoma.Oncogene, 29(35):4905–4913.Cooper, G. M., Zerr, T., Kidd, J. M., Eichler, E. E., and Nickerson, D. A. (2008).Systematic assessment of copy number variant detection via genome-wide snpgenotyping. Nat Genet, 40(10):1199–203.Curtis, C., Shah, S. P., Chin, S. F., Turashvili, G., Rueda, O. M., Dunning, M. J.,Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., Gräf, S., Ha, G., Haffari,G., Bashashati, A., Russell, R., McKinney, S., METABRIC Group, Langerod,A., Green, A., Provenzano, E., Wishart, G., Pinder, S., Watson, P., Markowetz,F., Murphy, L., Ellis, I., Purushotham, A., Borresen-Dale, A. L., Brenton, J. D.,Tavaré, S., Caldas, C., and Aparicio, S. (2012). The genomic and transcrip-tomic architecture of 2,000 breast tumours reveals novel subgroups. Nature,486(7403):346–352.Date, O., Katsura, M., Ishida, M., Yoshihara, T., Kinomura, A., Sueda, T., andMiyagawa, K. (2006). Haploinsufficiency of rad51b causes centrosome frag-mentation and aneuploidy in human cells. Cancer Res, 66(12):6018–6024.Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A., and No-ble, W. S. (2007). Unsupervised segmentation of continuous genomic data.Bioinformatics, 23(11):1424–1426.DeRose, Y. S., Wang, G., Lin, Y.-C., Bernard, P. S., Buys, S. S., Ebbert, M. T. W.,Factor, R., Matsen, C., Milash, B. A., Nelson, E., Neumayer, L., Randall, R. L.,Stijleman, I. J., Welm, B. E., and Welm, A. L. (2011). Tumor grafts derivedfrom women with breast cancer authentically reflect tumor pathology, growth,metastasis and disease outcomes. Nat Med, 17(11):1514–1520.Dewal, N., Hu, Y., Freedman, M. L., Laframboise, T., and Pe’er, I. (2011). Callingamplified haplotypes in next generation tumor sequence data. Genome Res.216BibliographyDing, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng, T., Haffari, G., Hirst,M., Marra, M. A., Condon, A., Aparicio, S., and Shah, S. P. (2012a). Feature-based classifiers for somatic mutation detection in tumour-normal paired se-quencing data. Bioinformatics, 28(2):167–175.Ding, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng, T., Haffari, G., Hirst,M., Marra, M. A., Condon, A., Aparicio, S., and Shah, S. P. (2012b). Feature-based classifiers for somatic mutation detection in tumour-normal paired se-quencing data. Bioinformatics, 28(2):167–175.Ding, L., Ellis, M. J., Li, S., Larson, D. E., Chen, K., Wallis, J. W., Harris, C. C.,McLellan, M. D., Fulton, R. S., Fulton, L. L., Abbott, R. M., Hoog, J., Dool-ing, D. J., Koboldt, D. C., Schmidt, H., Kalicki, J., Zhang, Q., Chen, L., Lin,L., Wendl, M. C., McMichael, J. F., Magrini, V. J., Cook, L., McGrath, S. D.,Vickery, T. L., Appelbaum, E., Deschryver, K., Davies, S., Guintoli, T., Lin,L., Crowder, R., Tao, Y., Snider, J. E., Smith, S. M., Dukes, A. F., Sanderson,G. E., Pohl, C. S., Delehaunty, K. D., Fronick, C. C., Pape, K. A., Reed, J. S.,Robinson, J. S., Hodges, J. S., Schierding, W., Dees, N. D., Shen, D., Locke,D. P., Wiechert, M. E., Eldred, J. M., Peck, J. B., Oberkfell, B. J., Lolofie, J. T.,Du, F., Hawkins, A. E., O’Laughlin, M. D., Bernard, K. E., Cunningham, M.,Elliott, G., Mason, M. D., Thompson, D. M., Ivanovich, J. L., Goodfellow, P. J.,Perou, C. M., Weinstock, G. M., Aft, R., Watson, M., Ley, T. J., Wilson, R. K.,and Mardis, E. R. (2010). Genome remodelling in a basal-like breast cancermetastasis and xenograft. Nature, 464(7291):999–1005.Ding, L., Getz, G., Wheeler, D. A., Mardis, E. R., McLellan, M. D., Cibulskis,K., Sougnez, C., Greulich, H., Muzny, D. M., Morgan, M. B., Fulton, L., Ful-ton, R. S., Zhang, Q., Wendl, M. C., Lawrence, M. S., Larson, D. E., Chen, K.,Dooling, D. J., Sabo, A., Hawes, A. C., Shen, H., Jhangiani, S. N., Lewis, L. R.,Hall, O., Zhu, Y., Mathew, T., Ren, Y., Yao, J., Scherer, S. E., Clerc, K., Met-calf, G. A., Ng, B., Milosavljevic, A., Gonzalez-Garay, M. L., Osborne, J. R.,Meyer, R., Shi, X., Tang, Y., Koboldt, D. C., Lin, L., Abbott, R., Miner, T. L.,Pohl, C., Fewell, G., Haipek, C., Schmidt, H., Dunford-Shore, B. H., Kraja, A.,Crosby, S. D., Sawyer, C. S., Vickery, T., Sander, S., Robinson, J., Winckler, W.,Baldwin, J., Chirieac, L. R., Dutt, A., Fennell, T., Hanna, M., Johnson, B. E.,Onofrio, R. C., Thomas, R. K., Tonon, G., Weir, B. A., Zhao, X., Ziaugra, L.,Zody, M. C., Giordano, T., Orringer, M. B., Roth, J. A., Spitz, M. R., Wistuba,I. I., Ozenberger, B., Good, P. J., Chang, A. C., Beer, D. G., Watson, M. A.,Ladanyi, M., Broderick, S., Yoshizawa, A., Travis, W. D., Pao, W., Province,M. A., Weinstock, G. M., Varmus, H. E., Gabriel, S. B., Lander, E. S., Gibbs,217BibliographyR. A., Meyerson, M., and Wilson, R. K. (2008). Somatic mutations affect keypathways in lung adenocarcinoma. Nature, 455(7216):1069–1075.Diskin, S. J., Hou, C., Glessner, J. T., Attiyeh, E. F., Laudenslager, M., Bosse,K., Cole, K., Mossé, Y. P., Wood, A., Lynch, J. E., Pecor, K., Diamond, M.,Winter, C., Wang, K., Kim, C., Geiger, E. A., McGrady, P. W., Blakemore, A.I. F., London, W. B., Shaikh, T. H., Bradfield, J., Grant, S. F. A., Li, H., Devoto,M., Rappaport, E. R., Hakonarson, H., and Maris, J. M. (2009). Copy numbervariation at 1q21.1 associated with neuroblastoma. Nature, 459(7249):987–991.Dunning, A. M., Healey, C. S., Pharoah, P. D., Teare, M. D., Ponder, B. A., andEaston, D. F. (1999). A systematic review of genetic polymorphisms and breastcancer risk. Cancer epidemiology, biomarkers &amp; prevention : a publicationof the American Association for Cancer Research, cosponsored by the AmericanSociety of Preventive Oncology, 8(10):843–854.Dutt, A. and Beroukhim, R. (2007). Single nucleotide polymorphism array analysisof cancer. Curr Opin Oncol, 19(1):43–49.Eirew, P., Steif, A., Khattra, J., Ha, G., Yap, D., Farahani, H., Gelmon, K., Chia,S., Wan, A., Shumansky, K., Rosner, J., McPherson, A., Nielsen, C., Roth, A.J. L., Lefebvre, C., Bashashati, A., Edwards, J., Oloumi, A., Osako, T., Bruna,A., Sandoval, J., Algara, T., Greenwood, W., Leung, K., Cheng, H., Xue, H.,Wang, Y., Lin, D., Mungall, A., Moore, R., Zhao, Y., Lorette, J., Nguyen, L.,Huntsman, D., Eaves, C., Hansen, C., Marra, M., Caldas, C., Shah, S. P., andAparicio, . S. (2014). Population dynamics of genomic clones in breast cancerpatient xenografts at single cell resolution. Under Review.Fletcher, O. and Houlston, R. S. (2010). Architecture of inherited susceptibility tocommon cancer. Nature reviews. Cancer, 10(5):353–361.Flicek, P., Aken, B. L., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y.,Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Fernandez-Banet, J., Gordon,L., Graf, S., Haider, S., Hammond, M., Howe, K., Jenkinson, A., Johnson, N.,Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F., Koscielny, G.,Kulesha, E., Lawson, D., Longden, I., Massingham, T., McLaren, W., Megy,K., Overduin, B., Pritchard, B., Rios, D., Ruffier, M., Schuster, M., Slater, G.,Smedley, D., Spudich, G., Tang, Y. A., Trevanion, S., Vilella, A., Vogel, J.,White, S., Wilder, S. P., Zadissa, A., Birney, E., Cunningham, F., Dunham, I.,Durbin, R., Fernandez-Suarez, X. M., Herrero, J., Hubbard, T. J. P., Parker, A.,Proctor, G., Smith, J., and Searle, S. M. J. (2010). Ensembl’s 10th year. Nucl.Acids Res., 38(suppl_1):D557–562.218BibliographyForbes, S. A., Bindal, N., Bamford, S., Cole, C., Kok, C. Y., Beare, D., Jia, M.,Shepherd, R., Leung, K., Menzies, A., Teague, J. W., Campbell, P. J., Stratton,M. R., and Futreal, P. A. (2011). Cosmic: mining complete cancer genomes inthe catalogue of somatic mutations in cancer. Nucleic Acids Res, 39(Databaseissue):D945–D950.Fu, Y.-P., Yu, J.-C., Cheng, T.-C., Lou, M. A., Hsu, G.-C., Wu, C.-Y., Chen, S.-T.,Wu, H.-S., Wu, P.-E., and Shen, C.-Y. (2003). Breast cancer risk associated withgenotypic polymorphism of the nonhomologous end-joining genes: a multigenicstudy on cancer susceptibility. Cancer Res, 63(10):2440–2446.Fujimaki, R. and Hayashi, K. (2012). Factorized asymptotic bayesian hiddenmarkov models.Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., Rahman,N., and Stratton, M. R. (2004). A census of human cancer genes. Nat RevCancer, 4(3):177–183.Gerlinger, M., Horswell, S., Larkin, J., Rowan, A. J., Salm, M. P., Varela, I., Fisher,R., McGranahan, N., Matthews, N., Santos, C. R., Martinez, P., Phillimore, B.,Begum, S., Rabinowitz, A., Spencer-Dene, B., Gulati, S., Bates, P. A., Stamp,G., Pickering, L., Gore, M., Nicol, D. L., Hazell, S., Futreal, P. A., Stewart, A.,and Swanton, C. (2014). Genomic architecture and evolution of clear cell renalcell carcinomas defined by multiregion sequencing. Nat Genet, 46(3):225–233.Gerlinger, M., Rowan, A. J., Horswell, S., Larkin, J., Endesfelder, D., Gronroos,E., Martinez, P., Matthews, N., Stewart, A., Tarpey, P., Varela, I., Phillimore,B., Begum, S., McDonald, N. Q., Butler, A., Jones, D., Raine, K., Latimer, C.,Santos, C. R., Nohadani, M., Eklund, A. C., Spencer-Dene, B., Clark, G., Pick-ering, L., Stamp, G., Gore, M., Szallasi, Z., Downward, J., Futreal, P. A., andSwanton, C. (2012). Intratumor heterogeneity and branched evolution revealedby multiregion sequencing. N Engl J Med, 366(10):883–892.Gerstung, M., Beisel, C., Rechsteiner, M., Wild, P., Schraml, P., Moch, H., andBeerenwinkel, N. (2012). Reliable detection of subclonal single-nucleotide vari-ants in tumour cell populations. Nat Commun, 3:811.Goya, R., Sun, M. G. F., Morin, R. D., Leung, G., Ha, G., Wiegand, K. C., Senz,J., Crisan, A., Marra, M. A., Hirst, M., Huntsman, D., Murphy, K. P., Aparicio,S., and Shah, S. P. (2010). Snvmix: predicting single nucleotide variants fromnext-generation sequencing of tumors. Bioinformatics, 26(6):730–736.219BibliographyGreaves, M. and Maley, C. C. (2012). Clonal evolution in cancer. Nature,481(7381):306–313.Greenman, C. D., Bignell, G., Butler, A., Edkins, S., Hinton, J., Beare, D., Swamy,S., Santarius, T., Chen, L., Widaa, S., Futreal, P. A., and Stratton, M. R. (2010a).Picnic: an algorithm to predict absolute allelic copy number variation with mi-croarray cancer data. Biostatistics, 11(1):164–75.Greenman, C. D., Bignell, G., Butler, A., Edkins, S., Hinton, J., Beare, D., Swamy,S., Santarius, T., Chen, L., Widaa, S., Futreal, P. A., and Stratton, M. R. (2010b).Picnic: an algorithm to predict absolute allelic copy number variation with mi-croarray cancer data. Biostatistics, 11(1):164–75.Greenman, C. D., Pleasance, E. D., Newman, S., Yang, F., Fu, B., Nik-Zainal, S.,Jones, D., Lau, K. W., Carter, N., Edwards, P. A. W., Futreal, P. A., Stratton,M. R., and Campbell, P. J. (2012). Estimation of rearrangement phylogeny forcancer genomes. Genome Res, 22(2):346–61.Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPher-son, A., Bashashati, Ali Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shuman-sky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R.,and Shah, S. P. (2014). Titan: Inference of copy number architectures in clonalcell populations from tumour whole genome sequence data. Under Review.Ha, G., Roth, A., Lai, D., Bashashati, A., Ding, J., Goya, R., Giuliany, R., Rosner,J., Oloumi, A., Shumansky, K., Chin, S.-F., Turashvili, G., Hirst, M., Caldas,C., Marra, M. A., Aparicio, S., and Shah, S. P. (2012). Integrative analysis ofgenome-wide loss of heterozygosity and monoallelic expression at nucleotideresolution reveals disrupted pathways in triple-negative breast cancer. Genomeresearch, 22(10):1995–2007.Ha, G. and Shah, S. P. (2013). Distinguishing Somatic and Germline Copy NumberEvents in Cancer Patient DNA Hybridized to Whole-Genome SNP GenotypingArrays, volume 973 of Array Comparative Genomic Hybridization: Protocolsand Applications, Methods in Molecular Biology, chapter 22. Springer Scienceand Business Media, LLC.Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002). Clustering validity check-ing methods: part ii. SIGMOD Rec., 31(3):19–27.Hanahan, D. and Weinberg, R. A. (2011). Hallmarks of cancer: the next generation.Cell, 144(5):646–674.220BibliographyHassold, T. and Hunt, P. (2001). To err (meiotically) is human: the genesis ofhuman aneuploidy. Nature reviews. Genetics, 2(4):280–291.Hastings, P. J., Lupski, J. R., Rosenberg, S. M., and Ira, G. (2009). Mechanisms ofchange in gene copy number. Nat Rev Genet, 10(8):551–564.Herzog, T. J. (2004). Recurrent ovarian cancer: how important is it to treat to dis-ease progression? Clinical cancer research : an official journal of the AmericanAssociation for Cancer Research, 10(22):7439–7449.Hou, Y., Song, L., Zhu, P., Zhang, B., Tao, Y., Xu, X., Li, F., Wu, K., Liang, J.,Shao, D., Wu, H., Ye, X., Ye, C., Wu, R., Jian, M., Chen, Y., Xie, W., Zhang, R.,Chen, L., Liu, X., Yao, X., Zheng, H., Yu, C., Li, Q., Gong, Z., Mao, M., Yang,X., Yang, L., Li, J., Wang, W., Lu, Z., Gu, N., Laurie, G., Bolund, L., Kris-tiansen, K., Wang, J., Yang, H., Li, Y., Zhang, X., and Wang, J. (2012). Single-cell exome sequencing and monoclonal evolution of a jak2-negative myelopro-liferative neoplasm. Cell, 148(5):873–885.Howie, B. N., Donnelly, P., and Marchini, J. (2009). A flexible and accurategenotype imputation method for the next generation of genome-wide associa-tion studies. PLoS Genet, 5(6):e1000529.Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y.,Scherer, S. W., and Lee, C. (2004). Detection of large-scale variation in thehuman genome. Nat Genet, 36(9):949–951.Imielinski, M., Berger, A. H., Hammerman, P. S., Hernandez, B., Pugh, T. J.,Hodis, E., Cho, J., Suh, J., Capelletti, M., Sivachenko, A., Sougnez, C., Auclair,D., Lawrence, M. S., Stojanov, P., Cibulskis, K., Choi, K., de Waal, L., Sharifnia,T., Brooks, A., Greulich, H., Banerji, S., Zander, T., Seidel, D., Leenders, F.,Ansén, S., Ludwig, C., Engel-Riedel, W., Stoelben, E., Wolf, J., Goparju, C.,Thompson, K., Winckler, W., Kwiatkowski, D., Johnson, B. E., Jänne, P. A.,Miller, V. A., Pao, W., Travis, W. D., Pass, H. I., Gabriel, S. B., Lander, E. S.,Thomas, R. K., Garraway, L. A., Getz, G., and Meyerson, M. (2012). Mappingthe hallmarks of lung adenocarcinoma with massively parallel sequencing. Cell,150(6):1107–1120.International Cancer Genome Consortium, Hudson, T. J., Anderson, W., Artez,A., Barker, A. D., Bell, C., Bernabé, R. R., Bhan, M. K., Calvo, F., Eerola, I.,Gerhard, D. S., Guttmacher, A., Guyer, M., Hemsley, F. M., Jennings, J. L.,Kerr, D., Klatt, P., Kolar, P., Kusada, J., Lane, D. P., Laplace, F., Youyong, L.,Nettekoven, G., Ozenberger, B., Peterson, J., Rao, T. S., Remacle, J., Schafer,221BibliographyA. J., Shibata, T., Stratton, M. R., Vockley, J. G., Watanabe, K., Yang, H., Yuen,M. M. F., Knoppers, B. M., Bobrow, M., Cambon-Thomsen, A., Dressler, L. G.,Dyke, S. O. M., Joly, Y., Kato, K., Kennedy, K. L., Nicolás, P., Parker, M. J.,Rial-Sebbag, E., Romeo-Casabona, C. M., Shaw, K. M., Wallace, S., Wiesner,G. L., Zeps, N., Lichter, P., Biankin, A. V., Chabannon, C., Chin, L., Clément,B., de Alava, E., Degos, F., Ferguson, M. L., Geary, P., Hayes, D. N., Johns,A. L., Kasprzyk, A., Nakagawa, H., Penny, R., Piris, M. A., Sarin, R., Scarpa,A., van de Vijver, M., Futreal, P. A., Aburatani, H., Bayés, M., Botwell, D.D. L., Campbell, P. J., Estivill, X., Grimmond, S. M., Gut, I., Hirst, M., López-Otín, C., Majumder, P., Marra, M., McPherson, J. D., Ning, Z., Puente, X. S.,Ruan, Y., Stunnenberg, H. G., Swerdlow, H., Velculescu, V. E., Wilson, R. K.,Xue, H. H., Yang, L., Spellman, P. T., Bader, G. D., Boutros, P. C., Flicek, P.,Getz, G., Guigó, R., Guo, G., Haussler, D., Heath, S., Hubbard, T. J., Jiang,T., Jones, S. M., Li, Q., López-Bigas, N., Luo, R., Muthuswamy, L., Ouellette,B. F. F., Pearson, J. V., Quesada, V., Raphael, B. J., Sander, C., Speed, T. P.,Stein, L. D., Stuart, J. M., Teague, J. W., Totoki, Y., Tsunoda, T., Valencia, A.,Wheeler, D. A., Wu, H., Zhao, S., Zhou, G., Lathrop, M., Thomas, G., Yoshida,T., Axton, M., Gunter, C., Miller, L. J., Zhang, J., Haider, S. A., Wang, J., Yung,C. K., Cros, A., Cross, A., Liang, Y., Gnaneshan, S., Guberman, J., Hsu, J.,Chalmers, D. R. C., Hasel, K. W., Kaan, T. S. H., Lowrance, W. W., Masui,T., Rodriguez, L. L., Vergely, C., Bowtell, D. D. L., Cloonan, N., deFazio, A.,Eshleman, J. R., Etemadmoghadam, D., Gardiner, B. B., Gardiner, B. A., Kench,J. G., Sutherland, R. L., Tempero, M. A., Waddell, N. J., Wilson, P. J., Gallinger,S., Tsao, M.-S., Shaw, P. A., Petersen, G. M., Mukhopadhyay, D., DePinho,R. A., Thayer, S., Shazand, K., Beck, T., Sam, M., Timms, L., Ballin, V., Lu,Y., Ji, J., Zhang, X., Chen, F., Hu, X., Yang, Q., Tian, G., Zhang, L., Xing, X.,Li, X., Zhu, Z., Yu, Y., Yu, J., Tost, J., Brennan, P., Holcatova, I., Zaridze, D.,Brazma, A., Egevard, L., Prokhortchouk, E., Banks, R. E., Uhlén, M., Viksna,J., Ponten, F., Skryabin, K., Birney, E., Borg, A., Børresen-Dale, A.-L., Caldas,C., Foekens, J. A., Martin, S., Reis-Filho, J. S., Richardson, A. L., Sotiriou, C.,Thoms, G., van’t Veer, L., Birnbaum, D., Blanche, H., Boucher, P., Boyault, S.,Masson-Jacquemier, J. D., Pauporté, I., Pivot, X., Vincent-Salomon, A., Tabone,E., Theillet, C., Treilleux, I., Bioulac-Sage, P., Decaens, T., Franco, D., Gut, M.,Samuel, D., Zucman-Rossi, J., Eils, R., Brors, B., Korbel, J. O., Korshunov, A.,Landgraf, P., Lehrach, H., Pfister, S., Radlwimmer, B., Reifenberger, G., Taylor,M. D., von Kalle, C., Majumder, P. P., Pederzoli, P., Lawlor, R. A., Delledonne,M., Bardelli, A., Gress, T., Klimstra, D., Zamboni, G., Nakamura, Y., Miyano,S., Fujimoto, A., Campo, E., de Sanjosé, S., Montserrat, E., González-Díaz, M.,Jares, P., Himmelbauer, H., Himmelbaue, H., Bea, S., Aparicio, S., Easton, D. F.,Collins, F. S., Compton, C. C., Lander, E. S., Burke, W., Green, A. R., Hamilton,222BibliographyS. R., Kallioniemi, O. P., Ley, T. J., Liu, E. T., and Wainwright, B. J. (2010).International network of cancer genome projects. Nature, 464(7291):993–998.International HapMap 3 Consortium, Principal investigators, Altshuler, D. M.,Gibbs, R. A., Peltonen, L., Project coordination leaders, Altshuler, D. M., Gibbs,R. A., Peltonen, L., Dermitzakis, E., Manuscript writing group, Schaffner, S. F.,Yu, F., Peltonen, L., Dermitzakis, E., Bonnen, P. E., Altshuler, D. M., Gibbs,R. A., Genotyping and QC, de Bakker, Co-Leader, P. I. W., Deloukas, Co-Leader, P., Gabriel, S. B., Gwilliam, R., Hunt, S., Inouye, Co-Leader, M., Jia,X., Palotie, A., Parkin, Co-Leader, M., Whittaker, P., ENCODE 3 sequencingand SNP discovery, Yu, Leader, F., Chang, K., Hawes, A., Lewis, L. R., Ren,Y., Wheeler, D., Gibbs, R. A., Marie Muzny, D., Copy number variation typ-ing and analysis, Barnes, C., Darvishi, K., Hurles, Co-Leader, M., Korn, J. M.,Kristiansson, K., Lee, C., McCarroll, Co-Leader, S. A., Nemesh, J., Popula-tion analysis, Dermitzakis, E., Keinan, Leader, A., Montgomery, S. B., Pollack,S., Price, A. L., Soranzo, N., Low frequency variation analysis, Bonnen, P. E.,Gibbs, R. A., Gonzaga-Jauregui, C., Keinan, A., Price, A. L., Yu, Leader, F.,Linkage disequilibrium and haplotype sharing analysis, Anttila, V., Brodeur, W.,Daly, M. J., Leslie, S., McVean, G., Moutsianas, L., Nguyen, H., Schaffner,Leader, S. F., Zhang, Q., Imputation, Ghori, M. J. R., McGinnis, Co-Leader, R.,McLaren, W., Pollack, S., Price, Co-Leader, A. L., Schaffner, Co-Leader, S. F.,Takeuchi, F., Natural selection, Grossman, S. R., Shlyakhter, I., Hostetter, E. B.,Sabeti, Leader, P. C., Community engagement and sample collection groups,Adebamowo, C. A., Foster, M. W., Gordon, D. R., Licinio, J., Cristina Manca,M., Marshall, P. A., Matsuda, I., Ngare, D., Ota Wang, V., Reddy, D., Rotimi,C. N., Royal, C. D., Sharp, R. R., Zeng, C., Scientific management, Brooks,L. D., and McEwen, J. E. (2010). Integrating common and rare genetic variationin diverse human populations. Nature, 467(7311):52–58.International HapMap Consortium, Frazer, K. A., Ballinger, D. G., Cox, D. R.,Hinds, D. A., Stuve, L. L., Gibbs, R. A., Belmont, J. W., Boudreau, A., Hard-enbol, P., Leal, S. M., Pasternak, S., Wheeler, D. A., Willis, T. D., Yu, F., Yang,H., Zeng, C., Gao, Y., Hu, H., Hu, W., Li, C., Lin, W., Liu, S., Pan, H., Tang, X.,Wang, J., Wang, W., Yu, J., Zhang, B., Zhang, Q., Zhao, H., Zhao, H., Zhou, J.,Gabriel, S. B., Barry, R., Blumenstiel, B., Camargo, A., Defelice, M., Faggart,M., Goyette, M., Gupta, S., Moore, J., Nguyen, H., Onofrio, R. C., Parkin, M.,Roy, J., Stahl, E., Winchester, E., Ziaugra, L., Altshuler, D., Shen, Y., Yao, Z.,Huang, W., Chu, X., He, Y., Jin, L., Liu, Y., Shen, Y., Sun, W., Wang, H., Wang,Y., Wang, Y., Xiong, X., Xu, L., Waye, M. M. Y., Tsui, S. K. W., Xue, H., Wong,J. T.-F., Galver, L. M., Fan, J.-B., Gunderson, K., Murray, S. S., Oliphant, A. R.,223BibliographyChee, M. S., Montpetit, A., Chagnon, F., Ferretti, V., Leboeuf, M., Olivier, J.-F., Phillips, M. S., Roumy, S., Sallée, C., Verner, A., Hudson, T. J., Kwok,P.-Y., Cai, D., Koboldt, D. C., Miller, R. D., Pawlikowska, L., Taillon-Miller,P., Xiao, M., Tsui, L.-C., Mak, W., Song, Y. Q., Tam, P. K. H., Nakamura, Y.,Kawaguchi, T., Kitamoto, T., Morizono, T., Nagashima, A., Ohnishi, Y., Sekine,A., Tanaka, T., Tsunoda, T., Deloukas, P., Bird, C. P., Delgado, M., Dermitzakis,E. T., Gwilliam, R., Hunt, S., Morrison, J., Powell, D., Stranger, B. E., Whit-taker, P., Bentley, D. R., Daly, M. J., de Bakker, P. I. W., Barrett, J., Chretien,Y. R., Maller, J., McCarroll, S., Patterson, N., Pe’er, I., Price, A., Purcell, S.,Richter, D. J., Sabeti, P., Saxena, R., Schaffner, S. F., Sham, P. C., Varilly, P.,Altshuler, D., Stein, L. D., Krishnan, L., Smith, A. V., Tello-Ruiz, M. K., Tho-risson, G. A., Chakravarti, A., Chen, P. E., Cutler, D. J., Kashuk, C. S., Lin, S.,Abecasis, G. R., Guan, W., Li, Y., Munro, H. M., Qin, Z. S., Thomas, D. J.,McVean, G., Auton, A., Bottolo, L., Cardin, N., Eyheramendy, S., Freeman, C.,Marchini, J., Myers, S., Spencer, C., Stephens, M., Donnelly, P., Cardon, L. R.,Clarke, G., Evans, D. M., Morris, A. P., Weir, B. S., Tsunoda, T., Mullikin, J. C.,Sherry, S. T., Feolo, M., Skol, A., Zhang, H., Zeng, C., Zhao, H., Matsuda,I., Fukushima, Y., Macer, D. R., Suda, E., Rotimi, C. N., Adebamowo, C. A.,Ajayi, I., Aniagwu, T., Marshall, P. A., Nkwodimmah, C., Royal, C. D. M.,Leppert, M. F., Dixon, M., Peiffer, A., Qiu, R., Kent, A., Kato, K., Niikawa,N., Adewole, I. F., Knoppers, B. M., Foster, M. W., Clayton, E. W., Watkin, J.,Gibbs, R. A., Belmont, J. W., Muzny, D., Nazareth, L., Sodergren, E., Wein-stock, G. M., Wheeler, D. A., Yakub, I., Gabriel, S. B., Onofrio, R. C., Richter,D. J., Ziaugra, L., Birren, B. W., Daly, M. J., Altshuler, D., Wilson, R. K., Ful-ton, L. L., Rogers, J., Burton, J., Carter, N. P., Clee, C. M., Griffiths, M., Jones,M. C., McLay, K., Plumb, R. W., Ross, M. T., Sims, S. K., Willey, D. L., Chen,Z., Han, H., Kang, L., Godbout, M., Wallenburg, J. C., L’Archevêque, P., Belle-mare, G., Saeki, K., Wang, H., An, D., Fu, H., Li, Q., Wang, Z., Wang, R.,Holden, A. L., Brooks, L. D., McEwen, J. E., Guyer, M. S., Wang, V. O., Pe-terson, J. L., Shi, M., Spiegel, J., Sung, L. M., Zacharia, L. F., Collins, F. S.,Kennedy, K., Jamieson, R., and Stewart, J. (2007). A second generation humanhaplotype map of over 3.1 million snps. Nature, 449(7164):851–61.International Human Genome Sequencing Consortium (2004). Finishing the eu-chromatic sequence of the human genome. Nature, 431(7011):931–945.Jirtle, R. L. (1999). Genomic imprinting and cancer. Exp Cell Res, 248(1):18–24.Jones, D. T. W., Jäger, N., Kool, M., Zichner, T., Hutter, B., Sultan, M., Cho, Y.-J.,Pugh, T. J., Hovestadt, V., Stütz, A. M., Rausch, T., Warnatz, H.-J., Ryzhova,M., Bender, S., Sturm, D., Pleier, S., Cin, H., Pfaff, E., Sieber, L., Wittmann,224BibliographyA., Remke, M., Witt, H., Hutter, S., Tzaridis, T., Weischenfeldt, J., Raeder, B.,Avci, M., Amstislavskiy, V., Zapatka, M., Weber, U. D., Wang, Q., Lasitschka,B., Bartholomae, C. C., Schmidt, M., von Kalle, C., Ast, V., Lawerenz, C., Eils,J., Kabbe, R., Benes, V., van Sluis, P., Koster, J., Volckmann, R., Shih, D., Betts,M. J., Russell, R. B., Coco, S., Tonini, G. P., Schüller, U., Hans, V., Graf, N.,Kim, Y.-J., Monoranu, C., Roggendorf, W., Unterberg, A., Herold-Mende, C.,Milde, T., Kulozik, A. E., von Deimling, A., Witt, O., Maass, E., Rössler, J.,Ebinger, M., Schuhmann, M. U., Frühwald, M. C., Hasselblatt, M., Jabado, N.,Rutkowski, S., von Bueren, A. O., Williamson, D., Clifford, S. C., McCabe,M. G., Collins, V. P., Wolf, S., Wiemann, S., Lehrach, H., Brors, B., Scheurlen,W., Felsberg, J., Reifenberger, G., Northcott, P. A., Taylor, M. D., Meyerson,M., Pomeroy, S. L., Yaspo, M.-L., Korbel, J. O., Korshunov, A., Eils, R., Pfister,S. M., and Lichter, P. (2012). Dissecting the genomic complexity underlyingmedulloblastoma. Nature, 488(7409):100–105.Kidd, J. M., Cooper, G. M., Donahue, W. F., Hayden, H. S., Sampas, N., Graves,T., Hansen, N., Teague, B., Alkan, C., Antonacci, F., Haugen, E., Zerr, T., Ya-mada, N. A., Tsang, P., Newman, T. L., Tüzün, E., Cheng, Z., Ebling, H. M.,Tusneem, N., David, R., Gillett, W., Phelps, K. A., Weaver, M., Saranga, D.,Brand, A., Tao, W., Gustafson, E., McKernan, K., Chen, L., Malig, M., Smith,J. D., Korn, J. M., McCarroll, S. A., Altshuler, D. A., Peiffer, D. A., Dorschner,M., Stamatoyannopoulos, J., Schwartz, D., Nickerson, D. A., Mullikin, J. C.,Wilson, R. K., Bruhn, L., Olson, M. V., Kaul, R., Smith, D. R., and Eichler,E. E. (2008). Mapping and sequencing of structural variation from eight humangenomes. Nature, 453(7191):56–64.Knudson, A. G. (1971). Mutation and cancer: statistical study of retinoblastoma.Proc Natl Acad Sci U S A, 68(4):820–823.Korbel, J. O., Abyzov, A., Mu, X. J., Carriero, N., Cayting, P., Zhang, Z., Sny-der, M., and Gerstein, M. B. (2009). Pemer: a computational framework withsimulation-based error models for inferring genomic structural variants frommassive paired-end sequencing data. Genome Biol, 10(2):R23.Korbel, J. O. and Campbell, P. J. (2013). Criteria for inference of chromothripsisin cancer genomes. Cell, 152(6):1226–1236.Korbel, J. O., Urban, A. E., Affourtit, J. P., Godwin, B., Grubert, F., Simons, J. F.,Kim, P. M., Palejev, D., Carriero, N. J., Du, L., Taillon, B. E., Chen, Z., Tanzer,A., Saunders, A. C. E., Chi, J., Yang, F., Carter, N. P., Hurles, M. E., Weissman,S. M., Harkins, T. T., Gerstein, M. B., Egholm, M., and Snyder, M. (2007).225BibliographyPaired-end mapping reveals extensive structural variation in the human genome.Science, 318(5849):420–426.Korn, J. M., Kuruvilla, F. G., McCarroll, S. A., Wysoker, A., Nemesh, J., Cawley,S., Hubbell, E., Veitch, J., Collins, P. J., Darvishi, K., Lee, C., Nizzari, M. M.,Gabriel, S. B., Purcell, S., Daly, M. J., and Altshuler, D. (2008). Integratedgenotype calling and association analysis of snps, common copy number poly-morphisms and rare cnvs. Nat Genet, 40(10):1253–1260.Kreso, A., O’Brien, C. A., van Galen, P., Gan, O. I., Notta, F., Brown, A. M. K.,Ng, K., Ma, J., Wienholds, E., Dunant, C., Pollett, A., Gallinger, S., McPherson,J., Mullighan, C. G., Shibata, D., and Dick, J. E. (2013). Variable clonal repopu-lation dynamics influence chemotherapy response in colorectal cancer. Science,339(6119):543–548.Kumar, R., Wang, R. A., Mazumdar, A., Talukder, A. H., Mandal, M., Yang, Z.,Bagheri-Yarmand, R., Sahin, A., Hortobagyi, G., Adam, L., Barnes, C. J., andVadlamudi, R. K. (2002). A naturally occurring mta1 variant sequesters oestro-gen receptor-alpha in the cytoplasm. Nature, 418(6898):654–657.LaFramboise, T. (2009). Single nucleotide polymorphism arrays: a decade ofbiological, computational and technological advances. Nucleic Acids Res,37(13):4181–4193.Laframboise, T., Harrington, D., and Weir, B. A. (2007). Plasq: a generalizedlinear model-based procedure to determine allelic dosage in cancer cells fromsnp array data. Biostatistics, 8(2):323–336.LaFramboise, T., Weir, B. A., Zhao, X., Beroukhim, R., Li, C., Harrington, D.,Sellers, W. R., and Meyerson, M. (2005). Allele-specific amplification in cancerrevealed by snp array analysis. PLoS Comput Biol, 1(6):e65.Landau, D. A., Carter, S. L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence,M. S., Sougnez, C., Stewart, C., Sivachenko, A., Wang, L., Wan, Y., Zhang,W., Shukla, S. A., Vartanov, A., Fernandes, S. M., Saksena, G., Cibulskis, K.,Tesar, B., Gabriel, S., Hacohen, N., Meyerson, M., Lander, E. S., Neuberg, D.,Brown, J. R., Getz, G., and Wu, C. J. (2013). Evolution and impact of subclonalmutations in chronic lymphocytic leukemia. Cell, 152(4):714–726.Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009). Ultrafastand memory-efficient alignment of short dna sequences to the human genome.Genome Biol, 10(3):R25.226BibliographyLee, E. Y., To, H., Shew, J. Y., Bookstein, R., Scully, P., and Lee, W. H. (1988).Inactivation of the retinoblastoma susceptibility gene in human breast cancers.Science (New York, N.Y.), 241(4862):218–221.Lengauer, C., Kinzler, K. W., and Vogelstein, B. (1998). Genetic instabilities inhuman cancers. Nature, 396(6712):643–649.Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., McLellan, M. D., Chen, K., Dooling,D., Dunford-Shore, B. H., McGrath, S., Hickenbotham, M., Cook, L., Abbott,R., Larson, D. E., Koboldt, D. C., Pohl, C., Smith, S., Hawkins, A., Abbott, S.,Locke, D., Hillier, L. W., Miner, T., Fulton, L., Magrini, V., Wylie, T., Glass-cock, J., Conyers, J., Sander, N., Shi, X., Osborne, J. R., Minx, P., Gordon,D., Chinwalla, A., Zhao, Y., Ries, R. E., Payton, J. E., Westervelt, P., Tomasson,M. H., Watson, M., Baty, J., Ivanovich, J., Heath, S., Shannon, W. D., Nagarajan,R., Walter, M. J., Link, D. C., Graubert, T. A., DiPersio, J. F., and Wilson, R. K.(2008). Dna sequencing of a cytogenetically normal acute myeloid leukaemiagenome. Nature, 456(7218):66–72.Li, A., Liu, Z., Lezon-Geyda, K., Sarkar, S., Lannin, D., Schulz, V., Krop, I.,Winer, E., Harris, L., and Tuck, D. (2011). Gphmm: an integrated hiddenmarkov model for identification of copy number alteration and loss of heterozy-gosity in complex tumor samples using whole genome snp arrays. Nucleic AcidsRes, 39(12):4928–4941.Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754–1760.Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,Abecasis, G., Durbin, R., and Subgroup, . G. P. D. P. (2009). The sequencealignment/map format and samtools. Bioinformatics, 25(16):2078–2079.Li, H. and Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform, 11(5):473–483.Li, H., Ruan, J., and Durbin, R. (2008). Mapping short dna sequencing reads andcalling variants using mapping quality scores. Genome Res, 18(11):1851–1858.Li, J., Yen, C., Liaw, D., Podsypanina, K., Bose, S., Wang, S. I., Puc, J., Miliaresis,C., Rodgers, L., McCombie, R., Bigner, S. H., Giovanella, B. C., Ittmann, M.,Tycko, B., Hibshoosh, H., Wigler, M. H., and Parsons, R. (1997). Pten, a putativeprotein tyrosine phosphatase gene mutated in human brain, breast, and prostatecancer. Science (New York, N.Y.), 275(5308):1943–1947.227BibliographyLin, M., Wei, L.-J., Sellers, W. R., Lieberfarb, M., Wong, W. H., and Li, C.(2004). dchipsnp: significance curve and clustering of snp-array-based loss-of-heterozygosity data. Bioinformatics, 20(8):1233–1240.Lindblad-Toh, K., Tanenbaum, D. M., Daly, M. J., Winchester, E., Lui, W. O.,Villapakkam, A., Stanton, S. E., Larsson, C., Hudson, T. J., Johnson, B. E.,Lander, E. S., and Meyerson, M. (2000). Loss-of-heterozygosity analysis ofsmall-cell lung carcinomas using single-nucleotide polymorphism arrays. NatBiotechnol, 18(9):1001–1005.Lord, C. J. and Ashworth, A. (2012). The dna damage response and cancer therapy.Nature, 481(7381):287–294.Lucito, R., Healy, J., Alexander, J., Reiner, A., Esposito, D., Chi, M., Rodgers, L.,Brady, A., Sebat, J., Troge, J., West, J. A., Rostan, S., Nguyen, K. C. Q., Powers,S., Ye, K. Q., Olshen, A., Venkatraman, E., Norton, L., and Wigler, M. (2003).Representational oligonucleotide microarray analysis: a high-resolution methodto detect genome copy number variation. Genome research, 13(10):2291–2305.Mardis, E. R. and Wilson, R. K. (2009). Cancer genome sequencing: a review.Hum Mol Genet, 18(R2):R163–R168.Martin, M. D., Fischbach, K., Osborne, C. K., Mohsin, S. K., Allred, D. C., andO’Connell, P. (2001). Loss of heterozygosity events impeding breast cancermetastasis contain the mta1 gene. Cancer Res, 61(9):3578–3580.Martinez, P., Birkbak, N. J., Gerlinger, M., McGranahan, N., Burrell, R. A.,Rowan, A. J., Joshi, T., Fisher, R., Larkin, J., Szallasi, Z., and Swanton, C.(2013). Parallel evolution of tumour subclones mimics diversity between tu-mours. J Pathol, 230(4):356–364.Marusyk, A., Almendro, V., and Polyak, K. (2012). Intra-tumour heterogeneity: alooking glass for cancer? Nat Rev Cancer, 12(5):323–334.McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky,A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M. A.(2010). The genome analysis toolkit: a mapreduce framework for analyzingnext-generation dna sequencing data. Genome Res, 20(9):1297–1303.McKernan, K. J., Peckham, H. E., Costa, G. L., McLaughlin, S. F., Fu, Y., Tsung,E. F., Clouser, C. R., Duncan, C., Ichikawa, J. K., Lee, C. C., Zhang, Z., Ranade,S. S., Dimalanta, E. T., Hyland, F. C., Sokolsky, T. D., Zhang, L., Sheridan,A., Fu, H., Hendrickson, C. L., Li, B., Kotler, L., Stuart, J. R., Malek, J. A.,228BibliographyManning, J. M., Antipova, A. A., Perez, D. S., Moore, M. P., Hayashibara, K. C.,Lyons, M. R., Beaudoin, R. E., Coleman, B. E., Laptewicz, M. W., Sannicandro,A. E., Rhodes, M. D., Gottimukkala, R. K., Yang, S., Bafna, V., Bashir, A.,MacBride, A., Alkan, C., Kidd, J. M., Eichler, E. E., Reese, M. G., De La Vega,F. M., and Blanchard, A. P. (2009). Sequence and structural variation in a humangenome uncovered by short-read, massively parallel ligation sequencing usingtwo-base encoding. Genome Research, 19(9):1527–1541.McPherson, A., Hormozdiari, F., Zayed, A., Giuliany, R., Ha, G., Sun, M. G. F.,Griffith, M., Heravi Moussavi, A., Senz, J., Melnyk, N., Pacheco, M., Marra,M. A., Hirst, M., Nielsen, T. O., Sahinalp, S. C., Huntsman, D., and Shah, S. P.(2011). defuse: an algorithm for gene fusion discovery in tumor rna-seq data.PLoS computational biology, 7(5):e1001138.Merico, D., Isserlin, R., Stueker, O., Emili, A., and Bader, G. D. (2010). Enrich-ment map: a network-based method for gene-set enrichment visualization andinterpretation. PLoS One, 5(11).Metzker, M. L. (2010). Sequencing technologies - the next generation. Nat RevGenet, 11(1):31–46.Mills, R. E., Walter, K., Stewart, C., Handsaker, R. E., Chen, K., Alkan, C., Aby-zov, A., Yoon, S. C., Ye, K., Cheetham, R. K., Chinwalla, A., Conrad, D. F., Fu,Y., Grubert, F., Hajirasouliha, I., Hormozdiari, F., Iakoucheva, L. M., Iqbal, Z.,Kang, S., Kidd, J. M., Konkel, M. K., Korn, J., Khurana, E., Kural, D., Lam,H. Y. K., Leng, J., Li, R., Li, Y., Lin, C.-Y., Luo, R., Mu, X. J., Nemesh, J.,Peckham, H. E., Rausch, T., Scally, A., Shi, X., Stromberg, M. P., Stütz, A. M.,Urban, A. E., Walker, J. A., Wu, J., Zhang, Y., Zhang, Z. D., Batzer, M. A.,Ding, L., Marth, G. T., McVean, G., Sebat, J., Snyder, M., Wang, J., Ye, K.,Eichler, E. E., Gerstein, M. B., Hurles, M. E., Lee, C., McCarroll, S. A., Korbel,J. O., and 1000 Genomes Project (2011). Mapping copy number variation bypopulation-scale genome sequencing. Nature, 470(7332):59–65.Morin, R. D., Mendez-Lago, M., Mungall, A. J., Goya, R., Mungall, K. L., Cor-bett, R. D., Johnson, N. A., Severson, T. M., Chiu, R., Field, M., Jackman, S.,Krzywinski, M., Scott, D. W., Trinh, D. L., Tamura-Wells, J., Li, S., Firme,M. R., Rogic, S., Griffith, M., Chan, S., Yakovenko, O., Meyer, I. M., Zhao,E. Y., Smailus, D., Moksa, M., Chittaranjan, S., Rimsza, L., Brooks-Wilson, A.,Spinelli, J. J., Ben-Neriah, S., Meissner, B., Woolcock, B., Boyle, M., McDon-ald, H., Tam, A., Zhao, Y., Delaney, A., Zeng, T., Tse, K., Butterfield, Y., Birol,I., Holt, R., Schein, J., Horsman, D. E., Moore, R., Jones, S. J. M., Connors,J. M., Hirst, M., Gascoyne, R. D., and Marra, M. A. (2011). Frequent mutation229Bibliographyof histone-modifying genes in non-hodgkin lymphoma. Nature, 476(7360):298–303.Mukhopadhyay, A., Plummer, E. R., Elattar, A., Soohoo, S., Uzir, B., Quinn, J. E.,McCluggage, W. G., Maxwell, P., Aneke, H., Curtin, N. J., and Edmondson, R. J.(2012). Clinicopathological features of homologous recombination-deficient ep-ithelial ovarian cancers: sensitivity to parp inhibitors, platinum, and survival.Cancer research, 72(22):5675–5682.Närvä, E., Autio, R., Rahkonen, N., Kong, L., Harrison, N., Kitsberg, D., Borgh-ese, L., Itskovitz-Eldor, J., Rasool, O., Dvorak, P., Hovatta, O., Otonkoski, T.,Tuuri, T., Cui, W., Brüstle, O., Baker, D., Maltby, E., Moore, H. D., Benvenisty,N., Andrews, P. W., Yli-Harja, O., and Lahesmaa, R. (2010). High-resolutiondna analysis of human embryonic stem cell lines reveals culture-induced copynumber changes and loss of heterozygosity. Nat Biotechnol, 28(4):371–7.Navin, N., Kendall, J., Troge, J., Andrews, P., Rodgers, L., McIndoo, J., Cook,K., Stepansky, A., Levy, D., Esposito, D., Muthuswamy, L., Krasnitz, A., Mc-Combie, W. R., Hicks, J., and Wigler, M. (2011). Tumour evolution inferred bysingle-cell sequencing. Nature, 472(7341):90–94.Navin, N., Krasnitz, A., Rodgers, L., Cook, K., Meth, J., Kendall, J., Riggs, M.,Eberling, Y., Troge, J., Grubor, V., Levy, D., Lundin, P., Månér, S., Zetterberg,A., Hicks, J., and Wigler, M. (2010). Inferring tumor progression from genomicheterogeneity. Genome Res, 20(1):68–80.Nik-Zainal, S., Alexandrov, L. B., Wedge, D. C., Van Loo, P., Greenman, C. D.,Raine, K., Jones, D., Hinton, J., Marshall, J., Stebbings, L. A., Menzies, A., Mar-tin, S., Leung, K., Chen, L., Leroy, C., Ramakrishna, M., Rance, R., Lau, K. W.,Mudie, L. J., Varela, I., McBride, D. J., Bignell, G. R., Cooke, S. L., Shlien, A.,Gamble, J., Whitmore, I., Maddison, M., Tarpey, P. S., Davies, H. R., Papaem-manuil, E., Stephens, P. J., McLaren, S., Butler, A. P., Teague, J. W., Jönsson, G.,Garber, J. E., Silver, D., Miron, P., Fatima, A., Boyault, S., Langerød, A., Tutt,A., Martens, J. W. M., Aparicio, S. A. J. R., Borg, A., Salomon, A. V., Thomas,G., Børresen-Dale, A.-L., Richardson, A. L., Neuberger, M. S., Futreal, P. A.,Campbell, P. J., Stratton, M. R., and the Breast Cancer Working Group of the In-ternational Cancer Genome Consortium (2012a). Mutational processes moldingthe genomes of 21 breast cancers. Cell, 149(5):979–993.Nik-Zainal, S., Van Loo, P., Wedge, D. C., Alexandrov, L. B., Greenman, C. D.,Lau, K. W., Raine, K., Jones, D., Marshall, J., Ramakrishna, M., Shlien, A.,Cooke, S. L., Hinton, J., Menzies, A., Stebbings, L. A., Leroy, C., Jia, M., Rance,230BibliographyR., Mudie, L. J., Gamble, S. J., Stephens, P. J., McLaren, S., Tarpey, P. S., Pa-paemmanuil, E., Davies, H. R., Varela, I., McBride, D. J., Bignell, G. R., Leung,K., Butler, A. P., Teague, J. W., Martin, S., Jönsson, G., Mariani, O., Boyault, S.,Miron, P., Fatima, A., Langerød, A., Aparicio, S. A. J. R., Tutt, A., Sieuwerts,A. M., Borg, Å., Thomas, G., Salomon, A. V., Richardson, A. L., Børresen-Dale, A.-L., Futreal, P. A., Stratton, M. R., Campbell, P. J., and Breast CancerWorking Group of the International Cancer Genome Consortium (2012b). Thelife history of 21 breast cancers. Cell, 149(5):994–1007.Northcott, P. A., Shih, D. J. H., Peacock, J., Garzia, L., Morrissy, A. S., Zichner,T., Stütz, A. M., Korshunov, A., Reimand, J., Schumacher, S. E., Beroukhim,R., Ellison, D. W., Marshall, C. R., Lionel, A. C., Mack, S., Dubuc, A., Yao,Y., Ramaswamy, V., Luu, B., Rolider, A., Cavalli, F. M. G., Wang, X., Remke,M., Wu, X., Chiu, R. Y. B., Chu, A., Chuah, E., Corbett, R. D., Hoad, G. R.,Jackman, S. D., Li, Y., Lo, A., Mungall, K. L., Nip, K. M., Qian, J. Q., Ray-mond, A. G. J., Thiessen, N. T., Varhol, R. J., Birol, I., Moore, R. A., Mungall,A. J., Holt, R., Kawauchi, D., Roussel, M. F., Kool, M., Jones, D. T. W., Witt,H., Fernandez-L, A., Kenney, A. M., Wechsler-Reya, R. J., Dirks, P., Aviv, T.,Grajkowska, W. A., Perek-Polnik, M., Haberler, C. C., Delattre, O., Reynaud,S. S., Doz, F. F., Pernet-Fattet, S. S., Cho, B.-K., Kim, S.-K., Wang, K.-C.,Scheurlen, W., Eberhart, C. G., Fèvre-Montange, M., Jouvet, A., Pollack, I. F.,Fan, X., Muraszko, K. M., Gillespie, G. Y., Di Rocco, C., Massimi, L., Michiels,E. M. C., Kloosterhof, N. K., French, P. J., Kros, J. M., Olson, J. M., Ellenbo-gen, R. G., Zitterbart, K., Kren, L., Thompson, R. C., Cooper, M. K., Lach,B., McLendon, R. E., Bigner, D. D., Fontebasso, A., Albrecht, S., Jabado, N.,Lindsey, J. C., Bailey, S., Gupta, N., Weiss, W. A., Bognár, L., Klekner, A.,Van Meter, T. E., Kumabe, T., Tominaga, T., Elbabaa, S. K., Leonard, J. R.,Rubin, J. B., Liau, L. M., Van Meir, E. G., Fouladi, M., Nakamura, H., Cinalli,G., Garami, M., Hauser, P., Saad, A. G., Iolascon, A., Jung, S., Carlotti, C. G.,Vibhakar, R., Ra, Y. S., Robinson, S., Zollo, M., Faria, C. C., Chan, J. A., Levy,M. L., Sorensen, P. H. B., Meyerson, M., Pomeroy, S. L., Cho, Y.-J., Bader,G. D., Tabori, U., Hawkins, C. E., Bouffet, E., Scherer, S. W., Rutka, J. T.,Malkin, D., Clifford, S. C., Jones, S. J. M., Korbel, J. O., Pfister, S. M., Marra,M. A., and Taylor, M. D. (2012). Subgroup-specific structural variation across1,000 medulloblastoma genomes. Nature, 488(7409):49–56.Nowell, P. C. (1976). The clonal evolution of tumor cell populations. Science,194(4260):23–8.Nowell, P. C. (2007). Discovery of the philadelphia chromosome: a personal per-spective. The Journal of clinical investigation, 117(8):2033–2035.231BibliographyNowell, P. C. and Hungerford, D. A. (1960). Chromosome studies on normal andleukemic human leukocytes. Journal of the National Cancer Institute, 25:85–109.Oesper, L., Mahmoody, A., and Raphael, B. J. (2013). Theta: Inferring intra-tumor heterogeneity from high-throughput dna sequencing data. Genome biol-ogy, 14(7):R80.Ogiwara, H., Kohno, T., Nakanishi, H., Nagayama, K., Sato, M., and Yokota, J.(2008). Unbalanced translocation, a major chromosome alteration causing lossof heterozygosity in human lung cancer. Oncogene, 27(35):4788–97.Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. (2004). Circularbinary segmentation for the analysis of array-based dna copy number data. Bio-statistics, 5(4):557–572.Ortiz-Estevez, M., Bengtsson, H., and Rubio, A. (2010). Acne: a summariza-tion method to estimate allele-specific copy numbers for affymetrix snp arrays.Bioinformatics, 26(15):1827–1833.Ostrovnaya, I., Nanjangud, G., and Olshen, A. B. (2010). A classification modelfor distinguishing copy number variants from cancer-related alterations. BMCBioinformatics, 11(1):297–297.Paradis, E., Claude, J., and Strimmer, K. (2004). Ape: Analyses of phylogeneticsand evolution in r language. Bioinformatics, 20(2):289–290.Parker, J. S., Mullins, M., Cheang, M. C. U., Leung, S., Voduc, D., Vickery, T.,Davies, S., Fauron, C., He, X., Hu, Z., Quackenbush, J. F., Stijleman, I. J.,Palazzo, J., Marron, J. S., Nobel, A. B., Mardis, E., Nielsen, T. O., Ellis, M. J.,Perou, C. M., and Bernard, P. S. (2009). Supervised risk predictor of breastcancer based on intrinsic subtypes. J Clin Oncol, 27(8):1160–7.Pastinen, T. and Hudson, T. J. (2004). Cis-acting regulatory variation in the humangenome. Science, 306(5696):647–650.Payne, S. R. and Kemp, C. J. (2005). Tumor suppressor genetics. Carcinogenesis,26(12):2031–2045.Peiffer, D. A., Le, J. M., Steemers, F. J., Chang, W., Jenniges, T., Garcia, F.,Haden, K., Li, J., Shaw, C. A., Belmont, J., Cheung, S. W., Shen, R. M., Barker,D. L., and Gunderson, K. L. (2006). High-resolution genomic profiling of chro-mosomal aberrations using infinium whole-genome genotyping. Genome Res,16(9):1136–1148.232BibliographyPerou, C. M., Sørlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A.,Pollack, J. R., Ross, D. T., Johnsen, H., Akslen, L. A., Fluge, O., Pergamen-schikov, A., Williams, C., Zhu, S. X., Lønning, P. E., Børresen-Dale, A. L.,Brown, P. O., and Botstein, D. (2000). Molecular portraits of human breasttumours. Nature, 406(6797):747–752.Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., Collins, C.,Kuo, W. L., Chen, C., Zhai, Y., Dairkee, S. H., Ljung, B. M., Gray, J. W., andAlbertson, D. G. (1998). High resolution analysis of dna copy number varia-tion using comparative genomic hybridization to microarrays. Nature genetics,20(2):207–211.Pinkel, D., Straume, T., and Gray, J. W. (1986). Cytogenetic analysis using quanti-tative, high-sensitivity, fluorescence hybridization. Proceedings of the NationalAcademy of Sciences of the United States of America, 83(9):2934–2938.Pleasance, E. D., Cheetham, R. K., Stephens, P. J., McBride, D. J., Humphray,S. J., Greenman, C. D., Varela, I., Lin, M.-L., Ordóñez, G. R., Bignell, G. R.,Ye, K., Alipaz, J., Bauer, M. J., Beare, D., Butler, A., Carter, R. J., Chen, L.,Cox, A. J., Edkins, S., Kokko-Gonzales, P. I., Gormley, N. A., Grocock, R. J.,Haudenschild, C. D., Hims, M. M., James, T., Jia, M., Kingsbury, Z., Leroy, C.,Marshall, J., Menzies, A., Mudie, L. J., Ning, Z., Royce, T., Schulz-Trieglaff,O. B., Spiridou, A., Stebbings, L. A., Szajkowski, L., Teague, J., Williamson,D., Chin, L., Ross, M. T., Campbell, P. J., Bentley, D. R., Futreal, P. A., andStratton, M. R. (2010a). A comprehensive catalogue of somatic mutations froma human cancer genome. Nature, 463(7278):191–6.Pleasance, E. D., Stephens, P. J., O’Meara, S., McBride, D. J., Meynert, A., Jones,D., Lin, M.-L., Beare, D., Lau, K. W., Greenman, C., Varela, I., Nik-Zainal, S.,Davies, H. R., Ordoñez, G. R., Mudie, L. J., Latimer, C., Edkins, S., Stebbings,L., Chen, L., Jia, M., Leroy, C., Marshall, J., Menzies, A., Butler, A., Teague,J. W., Mangion, J., Sun, Y. A., McLaughlin, S. F., Peckham, H. E., Tsung, E. F.,Costa, G. L., Lee, C. C., Minna, J. D., Gazdar, A., Birney, E., Rhodes, M. D.,McKernan, K. J., Stratton, M. R., Futreal, P. A., and Campbell, P. J. (2010b).A small-cell lung cancer genome with complex signatures of tobacco exposure.Nature, 463(7278):184–90.Pollack, J. R., Sørlie, T., Perou, C. M., Rees, C. A., Jeffrey, S. S., Lonning, P. E.,Tibshirani, R., Botstein, D., Børresen-Dale, A.-L., and Brown, P. O. (2002).Microarray analysis reveals a major direct role of dna copy number alteration inthe transcriptional program of human breast tumors. Proceedings of the NationalAcademy of Sciences of the United States of America, 99(20):12963–12968.233BibliographyPotter, N. E., Ermini, L., Papaemmanuil, E., Cazzaniga, G., Vijayaraghavan, G.,Titley, I., Ford, A., Campbell, P., Kearney, L., and Greaves, M. (2013). Single-cell mutational profiling and clonal phylogeny in cancer. Genome Research,23(12):2115–2125.Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2007). NCBI reference sequences(RefSeq): a curated non-redundant sequence database of genomes, transcriptsand proteins. Nucl. Acids Res., 35(suppl_1):D61–65.Rausch, T., Jones, D. T. W., Zapatka, M., Stütz, A. M., Zichner, T., Weischenfeldt,J., Jäger, N., Remke, M., Shih, D., Northcott, P. A., Pfaff, E., Tica, J., Wang,Q., Massimi, L., Witt, H., Bender, S., Pleier, S., Cin, H., Hawkins, C., Beck, C.,von Deimling, A., Hans, V., Brors, B., Eils, R., Scheurlen, W., Blake, J., Benes,V., Kulozik, A. E., Witt, O., Martin, D., Zhang, C., Porat, R., Merino, D. M.,Wasserman, J., Jabado, N., Fontebasso, A., Bullinger, L., Rücker, F. G., Döh-ner, K., Döhner, H., Koster, J., Molenaar, J. J., Versteeg, R., Kool, M., Tabori,U., Malkin, D., Korshunov, A., Taylor, M. D., Lichter, P., Pfister, S. M., andKorbel, J. O. (2012). Genome sequencing of pediatric medulloblastoma linkscatastrophic dna rearrangements with tp53 mutations. Cell, 148(1-2):59–71.Rebbeck, T. R. (1997). Molecular epidemiology of the human glutathione s-transferase genotypes gstm1 and gstt1 in cancer susceptibility. Cancer epidemi-ology, biomarkers &amp; prevention : a publication of the American Associa-tion for Cancer Research, cosponsored by the American Society of PreventiveOncology, 6(9):733–743.Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D.,Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire,S., Freeman, J. L., González, J. R., Gratacòs, M., Huang, J., Kalaitzopoulos,D., Komura, D., MacDonald, J. R., Marshall, C. R., Mei, R., Montgomery, L.,Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia,A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L.,Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C.,Jones, K. W., Scherer, S. W., and Hurles, M. E. (2006). Global variation in copynumber in the human genome. Nature, 444(7118):444–454.Ritchie, M. E., Carvalho, B. S., Hetrick, K. N., Tavaré, S., and Irizarry, R. A.(2009). R/bioconductor software for illumina’s infinium whole-genome geno-typing beadchips. Bioinformatics, 25(19):2621–2623.Romond, E. H., Perez, E. A., Bryant, J., Suman, V. J., Geyer, C. E., Davidson,N. E., Tan-Chiu, E., Martino, S., Paik, S., Kaufman, P. A., Swain, S. M., Pisan-234Bibliographysky, T. M., Fehrenbacher, L., Kutteh, L. A., Vogel, V. G., Visscher, D. W.,Yothers, G., Jenkins, R. B., Brown, A. M., Dakhil, S. R., Mamounas, E. P.,Lingle, W. L., Klein, P. M., Ingle, J. N., and Wolmark, N. (2005). Trastuzumabplus adjuvant chemotherapy for operable her2-positive breast cancer. The NewEngland journal of medicine, 353(16):1673–1684.Roth, A., Ding, J., Morin, R., Crisan, A., Ha, G., Giuliany, R., Bashashati, A.,Hirst, M., Turashvili, G., Oloumi, A., Marra, M. A., Aparicio, S., and Shah,S. P. (2012). Jointsnvmix: a probabilistic model for accurate detection of somaticmutations in normal/tumour paired next-generation sequencing data. Bioinfor-matics, 28(7):907–913.Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S.,Bouchard-Côté, A., and Shah, S. P. (2014). Pyclone: statistical inference ofclonal population structure in cancer. Nature methods, 11(4):396–398.Rouzier, R., Perou, C. M., Symmans, W. F., Ibrahim, N., Cristofanilli, M., An-derson, K., Hess, K. R., Stec, J., Ayers, M., Wagner, P., Morandi, P., Fan, C.,Rabiul, I., Ross, J. S., Hortobagyi, G. N., and Pusztai, L. (2005). Breast can-cer molecular subtypes respond differently to preoperative chemotherapy. ClinCancer Res, 11(16):5678–5685.Scharpf, R. B., Ruczinski, I., Carvalho, B., Doan, B., Chakravarti, A., and Irizarry,R. A. (2011). A multilevel model to address batch effects in copy number esti-mation using snp arrays. Biostatistics, 12(1):33–50.Schwarz, R. F., Trinh, A., Sipos, B., Brenton, J. D., Goldman, N., and Markowetz,F. (2014). Phylogenetic quantification of intra-tumour heterogeneity. PLoS com-putational biology, 10(4):e1003535.Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Månér,S., Massa, H., Walker, M., Chi, M., Navin, N., Lucito, R., Healy, J., Hicks,J., Ye, K., Reiner, A., Gilliam, T. C., Trask, B., Patterson, N., Zetterberg, A.,and Wigler, M. (2004). Large-scale copy number polymorphism in the humangenome. Science, 305(5683):525–8.Shah, S. P., Morin, R. D., Khattra, J., Prentice, L., Pugh, T., Burleigh, A., Delaney,A., Gelmon, K., Guliany, R., Senz, J., Steidl, C., Holt, R. A., Jones, S., Sun,M., Leung, G., Moore, R., Severson, T., Taylor, G. A., Teschendorff, A. E., Tse,K., Turashvili, G., Varhol, R., Warren, R. L., Watson, P., Zhao, Y., Caldas, C.,Huntsman, D., Hirst, M., Marra, M. A., and Aparicio, S. (2009a). Mutationalevolution in a lobular breast tumour profiled at single nucleotide resolution. Na-ture, 461(7265):809–813.235BibliographyShah, S. P., Morin, R. D., Khattra, J., Prentice, L., Pugh, T., Burleigh, A., Delaney,A., Gelmon, K., Guliany, R., Senz, J., Steidl, C., Holt, R. A., Jones, S., Sun,M., Leung, G., Moore, R., Severson, T., Taylor, G. A., Teschendorff, A. E., Tse,K., Turashvili, G., Varhol, R., Warren, R. L., Watson, P., Zhao, Y., Caldas, C.,Huntsman, D., Hirst, M., Marra, M. A., and Aparicio, S. (2009b). Mutationalevolution in a lobular breast tumour profiled at single nucleotide resolution. Na-ture, 461(7265):809–13.Shah, S. P., Roth, A., Goya, R., Oloumi, A., Ha, G., Zhao, Y., Turashvili, G., Ding,J., Tse, K., Haffari, G., Bashashati, A., Prentice, L. M., Khattra, J., Burleigh, A.,Yap, D., Bernard, V., McPherson, A., Shumansky, K., Crisan, A., Giuliany, R.,Heravi-Moussavi, A., Rosner, J., Lai, D., Birol, I., Varhol, R., Tam, A., Dhalla,N., Zeng, T., Ma, K., Chan, S. K., Griffith, M., Moradian, A., Cheng, S. W.,Morin, G. B., Watson, P., Gelmon, K., Chia, S., Chin, S. F., Curtis, C., Rueda,O. M., Pharoah, P. D., Damaraju, S., Mackey, J., Hoon, K., Harkins, T., Tadig-otla, V., Sigaroudinia, M., Gascard, P., Tlsty, T., Costello, J. F., Meyer, I. M.,Eaves, C. J., Wasserman, W. W., Jones, S., Huntsman, D., Hirst, M., Caldas,C., Marra, M. A., and Aparicio, S. (2012). The clonal and mutational evolutionspectrum of primary triple-negative breast cancers. Nature, 486(7403):395–399.Shah, S. P., Xuan, X., DeLeeuw, R. J., Khojasteh, M., Lam, W. L., Ng, R., andMurphy, K. P. (2006). Integrating copy number polymorphisms into array cghanalysis using a robust hmm. Bioinformatics, 22(14):e431–9.Sharp, A. J., Locke, D. P., McGrath, S. D., Cheng, Z., Bailey, J. A., Vallente, R. U.,Pertz, L. M., Clark, R. A., Schwartz, S., Segraves, R., Oseroff, V. V., Albertson,D. G., Pinkel, D., and Eichler, E. E. (2005). Segmental duplications and copy-number variation in the human genome. Am J Hum Genet, 77(1):78–88.Shen, R., Olshen, A. B., and Ladanyi, M. (2009). Integrative clustering of multiplegenomic data types using a joint latent variable model with application to breastand lung cancer subtype analysis. Bioinformatics, 25(22):2906–2912.Shlien, A. and Malkin, D. (2009). Copy number variations and cancer. Genomemedicine, 1(6):62.Smoot, M. E., Ono, K., Ruscheinski, J., Wang, P. L., and Ideker, T. (2011). Cy-toscape 2.8: new features for data integration and network visualization. Bioin-formatics, 27(3):431–432.Soneson, C., Lilljebjörn, H., Fioretos, T., and Fontes, M. (2010). Integrative anal-ysis of gene expression and copy number alterations using canonical correlationanalysis. BMC bioinformatics, 11:191.236BibliographySørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T.,Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C.,Brown, P. O., Botstein, D., Lønning, P. E., and Børresen-Dale, A. L. (2001).Gene expression patterns of breast carcinomas distinguish tumor subclasses withclinical implications. Proceedings of the National Academy of Sciences of theUnited States of America, 98(19):10869–10874.Sottoriva, A., Spiteri, I., Piccirillo, S. G. M., Touloumis, A., Collins, V. P., Marioni,J. C., Curtis, C., Watts, C., and Tavaré, S. (2013). Intratumor heterogeneity inhuman glioblastoma reflects cancer evolutionary dynamics. Proc Natl Acad SciU S A, 110(10):4009–14.Staaf, J., Lindgren, D., Vallon-Christersson, J., Isaksson, A., Göransson, H., Julius-son, G., Rosenquist, R., Höglund, M., Borg, A., and Ringnér, M. (2008a).Segmentation-based detection of allelic imbalance and loss-of-heterozygosityin cancer cells using whole genome snp arrays. Genome Biol, 9(9).Staaf, J., Vallon-Christersson, J., Lindgren, D., Juliusson, G., Rosenquist, R.,Höglund, M., Borg, A., and Ringnér, M. (2008b). Normalization of illuminainfinium whole-genome snp data improves copy number estimates and allelicintensity ratios. BMC Bioinformatics, 9:409–409.Stephens, P. J., Greenman, C. D., Fu, B., Yang, F., Bignell, G. R., Mudie, L. J.,Pleasance, E. D., Lau, K. W., Beare, D., Stebbings, L. A., McLaren, S., Lin,M. L., McBride, D. J., Varela, I., Nik-Zainal, S., Leroy, C., Jia, M., Menzies,A., Butler, A. P., Teague, J. W., Quail, M. A., Burton, J., Swerdlow, H., Carter,N. P., Morsberger, L. A., Iacobuzio-Donahue, C., Follows, G. A., Green, A. R.,Flanagan, A. M., Stratton, M. R., Futreal, P. A., and Campbell, P. J. (2011).Massive genomic rearrangement acquired in a single catastrophic event duringcancer development. Cell, 144(1):27–40.Stephens, P. J., Tarpey, P. S., Davies, H., Van Loo, P., Greenman, C., Wedge, D. C.,Nik-Zainal, S., Martin, S., Varela, I., Bignell, G. R., Yates, L. R., Papaemmanuil,E., Beare, D., Butler, A., Cheverton, A., Gamble, J., Hinton, J., Jia, M., Jayaku-mar, A., Jones, D., Latimer, C., Lau, K. W., McLaren, S., McBride, D. J., Men-zies, A., Mudie, L., Raine, K., Rad, R., Chapman, M. S., Teague, J., Easton,D., Langerød, A., Oslo Breast Cancer Consortium (OSBREAC), Lee, M. T. M.,Shen, C.-Y., Tee, B. T. K., Huimin, B. W., Broeks, A., Vargas, A. C., Turashvili,G., Martens, J., Fatima, A., Miron, P., Chin, S.-F., Thomas, G., Boyault, S., Mar-iani, O., Lakhani, S. R., van de Vijver, M., van ’t Veer, L., Foekens, J., Desmedt,C., Sotiriou, C., Tutt, A., Caldas, C., Reis-Filho, J. S., Aparicio, S. A. J. R., Sa-lomon, A. V., Børresen-Dale, A.-L., Richardson, A. L., Campbell, P. J., Futreal,237BibliographyP. A., and Stratton, M. R. (2012). The landscape of cancer genes and mutationalprocesses in breast cancer. Nature, 486(7403):400–404.Stratton, M. R., Campbell, P. J., and Futreal, P. A. (2009). The cancer genome.Nature, 458(7239):719–24.Szerlip, N. J., Pedraza, A., Chakravarty, D., Azim, M., McGuire, J., Fang, Y.,Ozawa, T., Holland, E. C., Huse, J. T., Jhanwar, S., Leversha, M. A., Mikkelsen,T., and Brennan, C. W. (2012). Intratumoral heterogeneity of receptor tyrosinekinases egfr and pdgfra amplification in glioblastoma defines subpopulationswith distinct growth factor response. Proc Natl Acad Sci U S A, 109(8):3041–6.Takata, M., Sasaki, M. S., Tachiiri, S., Fukushima, T., Sonoda, E., Schild, D.,Thompson, L. H., and Takeda, S. (2001). Chromosome instability and defectiverecombinational repair in knockout mutants of the five rad51 paralogs. Mol CellBiol, 21(8):2858–2866.Tanaka, H. and Yao, M.-C. (2009). Palindromic gene amplification–an evolution-arily conserved role for dna inverted repeats in the genome. Nat Rev Cancer,9(3):216–224.Thacker, J. and Zdzienicka, M. Z. (2003). The mammalian xrcc genes: their rolesin dna repair and genetic stability. DNA Repair (Amst), 2(6):655–672.Thierry-Mieg, D. and Thierry-Mieg, J. (2006). AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biology, 7(Suppl 1):S12.Torres, E. M., Dephoure, N., Panneerselvam, A., Tucker, C. M., Whittaker, C. A.,Gygi, S. P., Dunham, M. J., and Amon, A. (2010). Identification of aneuploidy-tolerating mutations. Cell, 143(1):71–83.Turner, C. E. (2000). Paxillin and focal adhesion signalling. Nat Cell Biol,2(12):231–236.Tuzun, E., Sharp, A. J., Bailey, J. A., Kaul, R., Morrison, V. A., Pertz, L. M.,Haugen, E., Hayden, H., Albertson, D., Pinkel, D., Olson, M. V., and Eichler,E. E. (2005). Fine-scale structural variation of the human genome. Nat Genet,37(7):727–732.Van Loo, P., Nordgard, S. H., Lingjærde, O. C., Russnes, H. G., Rye, I. H., Sun, W.,Weigman, V. J., Marynen, P., Zetterberg, A., Naume, B., Perou, C. M., Borresen-Dale, A. L., and Kristensen, V. N. (2010). Allele-specific copy number analysisof tumors. Proc Natl Acad Sci U S A, 107(39):16910–16915.238BibliographyVarley, K. E., Mutch, D. G., Edmonston, T. B., Goodfellow, P. J., and Mitra, R. D.(2009). Intra-tumor heterogeneity of mlh1 promoter methylation revealed bydeep single molecule bisulfite sequencing. Nucleic Acids Res, 37(14):4603–4612.Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentationalgorithm for the analysis of array cgh data. Bioinformatics, 23(6):657–663.Venkitaraman, A. R. (2014). Cancer suppression by the chromosome custodians,brca1 and brca2. Science (New York, N.Y.), 343(6178):1470–1475.Walker, E. J., Zhang, C., Castelo-Branco, P., Hawkins, C., Wilson, W., Zhukova,N., Alon, N., Novokmet, A., Baskin, B., Ray, P., Knobbe, C., Dirks, P., Taylor,M. D., Croul, S., Malkin, D., and Tabori, U. (2012). Monoallelic expressiondetermines oncogenic progression and outcome in benign and malignant braintumors. Cancer Res, 72(3):636–644.Walsh, T., Lee, M. K., Casadei, S., Thornton, A. M., Stray, S. M., Pennil, C.,Nord, A. S., Mandell, J. B., Swisher, E. M., and King, M.-C. (2010). Detectionof inherited mutations for breast and ovarian cancer using genomic capture andmassively parallel sequencing. Proceedings of the National Academy of Sciencesof the United States of America, 107(28):12629–12633.Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F., Hakonarson, H.,and Bucan, M. (2007). Penncnv: an integrated hidden markov model designedfor high-resolution copy number variation detection in whole-genome snp geno-typing data. Genome Res, 17(11):1665–1674.Welcsh, P. L. and King, M. C. (2001). Brca1 and brca2 and the genetics of breastand ovarian cancer. Hum Mol Genet, 10(7):705–713.Wiegand, K. C., Shah, S. P., Al-Agha, O. M., Zhao, Y., Tse, K., Zeng, T., Senz,J., McConechy, M. K., Anglesio, M. S., Kalloger, S. E., Yang, W., Heravi-Moussavi, A., Giuliany, R., Chow, C., Fee, J., Zayed, A., Prentice, L., Mel-nyk, N., Turashvili, G., Delaney, A. D., Madore, J., Yip, S., McPherson, A. W.,Ha, G., Bell, L., Fereday, S., Tam, A., Galletta, L., Tonin, P. N., Provencher,D., Miller, D., Jones, S. J., Moore, R. A., Morin, G. B., Oloumi, A., Boyd,N., Aparicio, S. A., Shih, I. e. M., Mes-Masson, A. M., Bowtell, D. D., Hirst,M., Gilks, B., Marra, M. A., and Huntsman, D. G. (2010). Arid1a mutationsin endometriosis-associated ovarian carcinomas. N Engl J Med, 363(16):1532–1543.239BibliographyWu, G., Feng, X., and Stein, L. (2010). A human functional protein interactionnetwork and its application to cancer data analysis. Genome Biol, 11(5).Wu, X., Northcott, P. A., Dubuc, A., Dupuy, A. J., Shih, D. J. H., Witt, H., Croul,S., Bouffet, E., Fults, D. W., Eberhart, C. G., Garzia, L., Meter, T. V., Zagzag, D.,Jabado, N., Schwartzentruber, J., Majewski, J., Scheetz, T. E., Pfister, S. M., Ko-rshunov, A., Li, X.-N., Scherer, S. W., Cho, Y.-J., Akagi, K., MacDonald, T. J.,Koster, J., McCabe, M. G., Sarver, A. L., Collins, V. P., Weiss, W. A., Largaes-pada, D. A., Collier, L. S., and Taylor, M. D. (2012). Clonal selection drivesgenetic divergence of metastatic medulloblastoma. Nature, 482(7386):529–533.Xi, R., Hadjipanayis, A. G., Luquette, L. J., Kim, T. M., Lee, E., Zhang, J., John-son, M. D., Muzny, D. M., Wheeler, D. A., Gibbs, R. A., Kucherlapati, R., andPark, P. J. (2011a). Copy number variation detection in whole-genome sequenc-ing data using the bayesian information criterion. Proc Natl Acad Sci U S A,108(46):1128–1136.Xi, R., Hadjipanayis, A. G., Luquette, L. J., Kim, T.-M., Lee, E., Zhang, J., John-son, M. D., Muzny, D. M., Wheeler, D. A., Gibbs, R. A., Kucherlapati, R., andPark, P. J. (2011b). Copy number variation detection in whole-genome sequenc-ing data using the bayesian information criterion. Proc Natl Acad Sci U S A,108(46):E1128–36.Xu, X., Hou, Y., Yin, X., Bao, L., Tang, A., Song, L., Li, F., Tsang, S., Wu, K.,Wu, H., He, W., Zeng, L., Xing, M., Wu, R., Jiang, H., Liu, X., Cao, D., Guo,G., Hu, X., Gui, Y., Li, Z., Xie, W., Sun, X., Shi, M., Cai, Z., Wang, B., Zhong,M., Li, J., Lu, Z., Gu, N., Zhang, X., Goodman, L., Bolund, L., Wang, J., Yang,H., Kristiansen, K., Dean, M., Li, Y., and Wang, J. (2012). Single-cell exomesequencing reveals single-nucleotide mutation characteristics of a kidney tumor.Cell, 148(5):886–895.Yachida, S., Jones, S., Bozic, I., Antal, T., Leary, R., Fu, B., Kamiyama, M.,Hruban, R. H., Eshleman, J. R., Nowak, M. A., Velculescu, V. E., Kinzler, K. W.,Vogelstein, B., and Iacobuzio-Donahue, C. A. (2010). Distant metastasis occurslate during the genetic evolution of pancreatic cancer. Nature, 467(7319):1114–7.Yang, L., Luquette, L. J., Gehlenborg, N., Xi, R., Haseley, P. S., Hsieh, C. H.,Zhang, C., Ren, X., Protopopov, A., Chin, L., Kucherlapati, R., Lee, C., andPark, P. J. (2013). Diverse mechanisms of somatic structural variations in humancancer genomes. Cell, 153(4):919–929.240Yates, L. R. and Campbell, P. J. (2012). Evolution of the cancer genome. Nat RevGenet, 13(11):795–806.Yau, C. (2013). Oncosnp-seq: a statistical approach for the identification of somaticcopy number alterations from next-generation sequencing of cancer genomes.Bioinformatics (Oxford, England), 29(19):2482–2484.Yau, C. and Holmes, C. C. (2008). Cnv discovery using snp genotyping arrays.Cytogenet Genome Res, 123(1-4):307–312.Yau, C., Mouradov, D., Jorissen, R. N., Colella, S., Mirza, G., Steers, G., Harris,A., Ragoussis, J., Sieber, O., and Holmes, C. C. (2010). A statistical approachfor detecting genomic aberrations in heterogeneous tumor samples from singlenucleotide polymorphism genotyping data. Genome Biol, 11(9).Ye, K., Schulz, M. H., Long, Q., Apweiler, R., and Ning, Z. (2009). Pindel: apattern growth approach to detect break points of large deletions and mediumsized insertions from paired-end short reads. Bioinformatics, 25(21):2865–2871.Zack, T. I., Schumacher, S. E., Carter, S. L., Cherniack, A. D., Saksena, G.,Tabak, B., Lawrence, M. S., Zhang, C.-Z., Wala, J., Mermel, C. H., Sougnez,C., Gabriel, S. B., Hernandez, B., Shen, H., Laird, P. W., Getz, G., Meyerson,M., and Beroukhim, R. (2013). Pan-cancer patterns of somatic copy numberalteration. Nat Genet, 45(10):1134–1140.Zhao, Q., Kirkness, E. F., Caballero, O. L., Galante, P. A., Parmigiani, R. B.,Edsall, L., Kuan, S., Ye, Z., Levy, S., Vasconcelos, A. T., Ren, B., de Souza,S. J., Camargo, A. A., Simpson, A. J., and Strausberg, R. L. (2010). Systematicdetection of putative tumor suppressor genes through the combined use of exomeand transcriptome sequencing. Genome Biol, 11(11).Zhao, X., Li, C., Paez, J. G., Chin, K., Jänne, P. A., Chen, T.-H., Girard, L., Minna,J., Christiani, D., Leo, C., Gray, J. W., Sellers, W. R., and Meyerson, M. (2004).An integrated view of copy number and allelic alterations in the cancer genomeusing single nucleotide polymorphism arrays. Cancer research, 64(9):3060–3071.Zheng, W., Wen, W.-Q., Gustafson, D. R., Gross, M., Cerhan, J. R., and Folsom,A. R. (2002). Gstm1 and gstt1 polymorphisms and postmenopausal breast can-cer risk. Breast cancer research and treatment, 74(1):9–16.241Appendix AHMMcopy: Copy numberanalysis of WGS dataThe GC content correction and pre-processing pipeline prior to segmentation wasco-developed with Daniel Lai and can be downloaded from http://compbio.bccrc.ca/software/hmmcopy/ and http://www.bioconductor.org/packages/2.12/bioc/html/HMMcopy.html.A.1 HMMcopy workflowA.1.1 Determine genomic windows that have 1000 reads mappedThe input data into the HMM is first preprocessed in order to determine fixed sizedwindows/bins across the genome for each sample. This step reduces the problemto ~3 million loci. A set of chromosomal windows, R, are extracted such that foreach region is 1000 base pairs in length with coordinate boundaries start rstart andend rend .242A.1. HMMcopy workflowA.1.2 Obtain copy number read counts for normal and tumour foreach windowNext, separately for the normal and tumour genomes of each patient, the total readdepth for each window in R is extracted using BAMtools Barnett et al. (2011),resulting in a vector of read count data from the normal, NR =(n1, . . . ,n|R|)andtumour, TR =(t1, . . . , t|R|), respectively.A.1.3 GC content correction of normal and tumour read countsGC content bias correction was performed for the tumour and normal of each pa-tient separately. A global loess fit is applied between GC content and read depthfor windows in R. Due to computational restrictions of fitting 3 million data points,outlier windows based on read depths being in the upper and lower 1% quantile areexcluded, and a random sample of 20,000 of the remaining windows is used forgenerating the loess curve. Finally, the read depth of all windows, NR and TR, iscorrected by scaling the observed value by the loess fitted value (Equation A.1).corrected read depth = observed read depthloess f itted value(A.1)A.1.4 Correcting read counts for highly mappable sequencesUsing "ENCODE Duke Uniqueness of 35bp sequences" track from UCSC (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeMapability),windows that are within repetitive regions are filtered, resulting in being highlymappable by aligners. Windows that had mappability score of ≥ 0.9 were ex-cluded. This removed extreme amplified positions which would have otherwise be243A.1. HMMcopy workflowconfounding outliers in downstream segmentation analysis.A.1.5 Normalizing copy number in tumoursThe GC corrected normal NR and tumour TR counts are normalized independentlyto generate N¯R =(n¯1, . . . , n¯|R|)and T¯R =(t¯1, . . . , t¯|R|), respectively, where n¯i = ni∑ j n jand t¯i = ti∑ j t j , i ∈ {1, . . . , |R|}. To obtain the final tumour copy number observedat each loci r ∈ R, another normalization step is taken to generate the log2 ratiobetween tumour and normal copy number, TN(i) = log2(t¯in¯i), i ∈ {1, . . . , |R|}.A.1.6 Segmentation and copy number prediction via HMMThe 6-state version of HMM-DOSAGE is used to segment the input data TN . Thedefault initialization and hyperparameters for the prior means of the Student’s-tdistribution are log2(([1,1.4,2,2.7,3,4.5])/2). Because the normalized the tumourcounts with the match normal read count naturally excludes germline CNVs, usingthe 11-state HMM-DOSAGE to distinguish CNA from CNV is not required.244Appendix BAPOLLOH: SupplementarymaterialB.1 Biospecimen collection and ethical consentTumour specimens were obtained from three tumour banks (BCCA Vancouver,breast tumour tissue repository; Alberta CBCF Breast Tumour Bank Edmonton,Cambridge UK, Addenbrooke’s Hospital breast tumour bank) each with local REB/IRBapproval for genomic studies of nucleic acids from breast cancer patients. Thisproject was conducted under local BCCA REB/IRB projects H06-00289, H08-1230, H06-3199. The source of germline DNA was from peripheral blood lym-phocytes in all but 4 cases. In these 4 cases histologically normal adjacent breasttissue was used. Initial case selection was based on clinical immunohistochemistryto define primary triple negative breast cancers obtained from surgical specimens,prior to the initiation of any chemotherapy or radiotherapy. Tumours typed as ER-,HER2- and PR- were initially selected for further review and re-validation of theIHC. Cases found to be ERBB2 amplified on copy number analysis, but IHC -vefor ERBB2, were rejected. The complete sequence level genome landscapes ofthese tumours will be described elsewhere (Shah et al, submitted).245B.2. Histopathological reviewB.2 Histopathological reviewTissue sections were subject to expert histopathological review (GT) to assess thepresence of invasive tumour, pre-malignant or benign changes, lymphocytic infil-tration, necrosis and tumour cellularity. Tumour cellularity was scored visuallyin a semiquantitative fashion on sections taken from the cryosectioning runs usedto isolate nucleic acids from each tumour. Cellularity values were binned suchthat ‘low cellularity’ corresponds to samples with <40% malignant cells, ‘moder-ate cellularity’ corresponds to 40% - 70% malignant cells, and samples with >70%malignant cells were considered to have ‘high cellularity’. All but one sampleclassified as low cellularity were excluded from further analysis. The ER-, PR-HER2- immunophenotype was reconfirmed on sections or TMA cores from thecases included for analysis and additionally CK5/6 and EGFR were assessed byIHC. Subsequently SNP6.0 copy number analysis was also used to confirm theabsence of HER2 amplification in each case.B.3 Library construction and sequence data generationSOLiD whole genome shotgun libraries for 17 tumour/normal pairs were gener-ated as previously described (McKernan et al., 2009) and aligned to the humanreference genome (hg18, NCBI36) using BioScope. Illumina libraries were pre-pared as described in (Morin et al., 2011) and aligned using BWA (Li and Durbin,2009). Paired end RNAseq libraries were generated as described in Wiegand et al.(2010). Sequence reads were aligned using a modified version of BWA (base ver-sion 0.5.5 (Li and Durbin, 2009) to a reference consisting of the human genomereference (NCBI build 36, hg18) and a database of known exon-exon junctions ob-246B.4. Application of APOLLOH to 23 triple negative breast cancers.tained from different annotation databases (Ensembl (Flicek et al., 2010), RefSeq(Pruitt et al., 2007), AceView (Thierry-Mieg and Thierry-Mieg, 2006)). Sequencesrepresenting exon-exon junctions were designed to require at least a 4 base pairoverlap for split-reads. Considering a read length of 50 base pairs, 46 base pairson either side of the exon-exon junction were concatenated to represent each exon-exon junction.B.4 Application of APOLLOH to 23 triple negativebreast cancers.The analysis workflow is described in Methods and Figure S3.2. For extractingheterozygous positions from the normal genomes, default settings were used forGATK’s UnifiedGenotyper (McKenna et al., 2010). Heterozygous genotypes wereaccepted based on "PASS" or "." in the reported UnifiedGenotyper ’Quality’ field.In the tumours, pileups were generated using SAMtools (Li et al., 2009) and readsin the corresponding normal heterozygous positions were filtered by base quality of10 and mapping quality of 20. Low depth (>10 reads) and read-sink (<200 reads)positions were excluded.APOLLOH parameters (described in Table S3.1) used in this analysis are givenin a configuration file packaged with the software (http://compbio.bccrc.ca/software/apolloh/).247B.5. OncoSNP analysis of Affymetrix SNP6.0 analysisB.5 OncoSNP analysis of Affymetrix SNP6.0 analysisAffymetrix SNP6.0 genotyping arrays were analyzed for the 23 breast cancers. Wedetermined regions of loss of heterozygosity (LOH) using the OncoSNP softwarev1.0 Beta (Internal Release v2.19) (Yau et al., 2010). Due to the absence of a com-plete set of matched normals, the unpaired tumour analysis was used. OncoSNPwas adapted for the Affymetrix SNP6 platform by using initial parameters settingsobtained by training 45 COSMIC (Bignell et al., 2010) breast cancer samples hy-bridized to SNP6 arrays. Log ratios and B-allele frequencies were obtained fromPennCNV-Affy (Wang et al., 2007) normalization results. OncoSNP hyperparam-eters used in the analysis was provided as a standard configuration file with thedownloaded software. Other OncoSNP settings included using 15 iterations of ex-pectation maximization, 30 sub-sampling, and stromal setting activated. Two toeight copy ’Somatic’ LOH predictions made by OncoSNP were consolidated assimply LOH while ’Mono-allelic amplification’ states were relabeled as ’Allele-specific copy number amplifications’ (ASCNA).B.6 Analyses for comparing APOLLOH results andAffymetrix SNP6 dataB.6.1 WGSS and SNP6 platform comparisonFor each predicted APOLLOH segment x with boundaries xstart and xend and seg-ment median allelic ratio ax, we computed the SNP6 median BAF yx for probes248B.6. Analyses for comparing APOLLOH results and Affymetrix SNP6 datathat overlapped x as,yx = median(BAF(p)), {p : pstart ≥ xstart , pend ≤ xend} (B.1)Spearman’s rank correlation is computed on y1:X and a1:X where X is the totalnumber APOLLOH predicted segments.To measure association with the dynamic range of clusters as shown in Fig-ure 3.6 and Figure S3.4, we computed Euclidean distance of the class centroidsbetween LOH and HET and calculated the Spearman rank correlation statistic withAPOLLOH estimates of normal proportion s (Figure S3.5).B.6.2 Model evaluation using SNP6 predictionsSNP6 LOH results, predicted using the OncoSNP (Yau et al., 2010) software, wereused as truth for evaluating APOLLOH model variants and SNVMix. OncoSNPwas run using parameters and settings designed for the Affymetrix SNP6.0 plat-form. Predicted states were redefined into comparable classes: deletion, neutraland amplified LOH; allele-specific copy number amplification; and heterozygous.Positions that intersected between the loci used in APOLLOH for each sample andthe probes of the SNP6 array were used for evaluation. Homozygous positions pre-dicted by OncoSNP or HMMcopy (Supplemental Methods) were excluded fromthe evaluation. True positives (TP) are defined as positions that were predicted asLOH by both APOLLOH and OncoSNP; false positives (FP) are positions whereAPOLLOH predicted LOH but were predicted as HET or ASCNA by OncoSNP;false negatives (TN) make up positions that APOLLOH called HET/ASCNA butOncoSNP predicted as LOH. Precision (TP/(TP+FP)), recall (TP/(TP+FN)), and249B.7. Comparison of transcriptome allelic ratios (TAR)F-measure (Equation B.2) were calculated for each sample and APOLLOH modelvariant. For ASCNA performance, positives were ASCNA for APOLLOH (fullmodel) and HET for APOLLOH-noCN.F measure = 2× precision× recallprecision× recall (B.2)Evaluation using exome data was computed the same as above. LOH predictionsfor the exome data were generated using the full APOLLOH model.B.7 Comparison of transcriptome allelic ratios (TAR)RNAseq pileups for each sample were generated using SAMtools (Li et al., 2009),and positions that intersected the loci used in the APOLLOH analysis (i.e. het-erozygous positions in the normal genome) of each WGSS sample were extracted.Transcriptome allelic ratios were first converted to symmetric counts,ARi =max(ai,bi)ai +bi(B.3)where a is the reference count and b is the non-reference count for each positioni. Each position of the RNAseq is then classified with the corresponding zygositybased on the APOLLOH call of the same loci in the genome.250Appendix CTITAN: Supplementary material251C.1.TITANparameterupdatederivationsC.1 TITAN parameter update derivationsC.1.1 Prior distribution (mixed weights) parameter for genotypes, piGThe complete data log-likelihood term involving piG areQ(piG) =G∑g=1γG (G0 = g){log(∑Gg=1G0 (g)!∏Gg=1G0 (g)!)+g∑g=1G0 (g) logpig}+ logΓ(∑Gkg=1 δpi (g))∏Gg=1Γ(δpi (g))+G∑g=1piδpi (g)−1g (C.1)The standard Dirichlet-Multinomial conjugate update using the MAP estimate is given by,piG (g) =∑t γG (Gt = g)+δZ (Z)−1∑g′=1[∑t γG (Gt = g′)+δG (G′)−1] (C.2)C.1.2 Prior distribution (mixed weights) parameter for clonal clusters, piZGThe complete data log-likelihood term involving piZ are252C.1.TITANparameterupdatederivationsQ(piZ) =Z∑z=1{γZ (Z0 = z)(log(∑Zz=1Z(z)!∏Zz=1Z (z)!)+Z∑z=1Z (z) logpiz)}+ log(Γ(∑Zz=1 δZ (z))∏Zz=1Γ(δZ (z)))+Z∑z=1piδZ(z)−1Z (C.3)The standard Dirichlet-Multinomial conjugate update using the MAP estimate is given by,piz (z) =∑t γZ (Zt = z)+δZ (Z)−1∑z′=1 [∑t γZ (Zt = z′)+δZ (Z′)−1](C.4)C.1.3 Clonal frequency parameter, sThe expected data-likelihood terms involving sc are the emission data likelihood and Beta prior.Q(s) =Z∑z=1T∑t=1G∑g=1γ (Gt = g,Zt = z)log((Ntat))+at logωg,z+(Nt −at) log(1−ωg,z)+ log 1√2piσ2g−(lt −µg,z)22σ2g+Z∑z=1{log(Γ(αz+βz)Γ(αz)Γ(βz))+(αz−1) log(sz)+(βz−1) log(1− sz)}(C.5)Equation C.5 can be simplified using ,253C.1.TITANparameterupdatederivationsQ¯(s) =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z){at log(ωg,z)+(Nt −at) log(1−ωgk,z)−(lt −µg,z)22σ2g}+Z∑z=1{(αz−1) log(sz)+(βz−1) log(1− sz)}+ constant (C.6)Taking the derivative with respect to s gives∂ Q¯(s)∂ s =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z){at∂ log(ωk,z)∂ s +(Nt −at)∂ log(1−ωg,z)∂ s +(lt −µg,z)σ2g× ∂µg,z∂ s}+Z∑z=1{(αz−1)∂ log(sz)∂ sz+(βz−1)∂ log(1− sz)∂ sz}= 0 (C.7)The partial derivatives can be solved using the following,254C.1.TITANparameterupdatederivations∂ωg,z∂ sz= (1−n)cNcT(rN − rT,g)[ncN +(1−n)szcN +(1−n)(1− sz)cT,g]2∂ log(ωg,z)∂ sz= 1ωg,z× ∂ωg,z∂ sz∂ log(1−ωg,z)∂ sz= 11−ωg,z×−∂ωg,z∂ sz∂µg,z∂ sz= (1−n)cN − (1−n)cT,gncN +(1−n)szcN +(1−n)(1− sz)cT,g∂ log(sz)∂ sz= 1sz∂ log(1− sz)∂ sz= −11− szSimplify by letting γ (Gt = g,Zt = z) be γt (g,z) and computing the following,a¯g,z =T∑γt (g,z)atb¯g,z =T∑γt (g,z)(Nt −at)c¯g,z =T∑γt (g,z) lte¯g,z =T∑γt (g,z)g¯g,z =T∑γt (g,z) l2t255C.1.TITANparameterupdatederivationsPlugging in into Equation C.7, we get∂ Q¯(s)∂ s =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z){∂µRg,z∂ sz(atµRg,z− (Nt −at)1−ωg,z)+∂µCg,z∂ sz(lt −µg,zσ2g)}+Z∑z=1(αz−1sz− βz−11− sz)∂ Q¯(s)∂ sz=G∑g=1{∂µRg,z∂ sz(a¯g,zωkg,z− b¯g,z1−ωg,z)+∂µCg,z∂ sz(c¯g,z− e¯g,zµg,zσ2g)}+(αz−1sz− βz−11− sz)= 0 (C.8)C.1.4 Copy number Gaussian variance parameter, σ2The expected data-likelihood terms involving σ2 are the emission data likelihood and inverse gamma prior.Q(σ2)=T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z)log((Ntat))+at logµRg,z+(Nt −at) log(1−µRg,z)+ log 1√2piσ2g−(lt −µCg,z)22σ2g+G∑g=1{log(βαggΓ(αg))+(−αg−1)log(σ2g)− βgσ2g}(C.9)Taking the derivative of Equation C.9 with respect to σ2 gives256C.1.TITANparameterupdatederivations∂ Q¯(σ2)∂σ2=T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z)−∂ log(√2pi(σ2)1/2)∂σ2g−∂((lt−µCg,z)22σ 2g)∂σ2g+G∑g=1(−αg−1) ∂ log(σ2g)∂σ2g−βk∂(1σ 2g)∂σ2g= 0 (C.10)The partial derivatives are the following:∂ log(√2pi(σ2)1/2)∂σ2g= 12σ2g∂((lt−µCg,z)22σ 2g)∂σ2g= −(lt −µCg,z)22σ4g∂(1σ 2g)∂σ2g= − 1σ4gPlugging in the partial derivatives into Equation C.10 gives,257C.1.TITANparameterupdatederivations∂ Q¯(σ)∂σ2g=T∑t=1Z∑z=1γ (Gt = g,Zt = z){− 12σ2g+(lt −µg,z)22σ4g}+(−αg−1σ2g+ βgσ4g)= 0∂ Q¯(σ2)∂σ2g=Z∑z=1{− e¯g,z2σ2g+ g¯g,z2σ4g− c¯g,zµg,z2σ4g+ e¯g,z(µg,z)22σ4g}− αg+1σ2g+ βgσ4g= 0σ2g =−∑Zz=1(g¯g,z−2c¯g,zµg,z+ e¯g,z(µg,z)2)−2βg−∑Zz=1 e¯g,z−2(αg+1) (C.11)C.1.5 Tumour ploidy, φThe expected data-likelihood terms involving φ are the emission data likelihood and inverse gamma prior.Q(φ) =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z)log((Ntat))+at logµRg,z+(Nt −at) log(1−µRg,z)+ log 1√2piσ2g−(lt −µCg,z)22σ2kg+ log(βαφφΓ(αφ))+(−αφ −1)log(φ)− βφφ (C.12)Taking the derivative of Equation C.12 with respect to φ ,258C.1.TITANparameterupdatederivations∂ Q¯(φ)∂φ =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z)−12σ2g×∂((lt −µg,z)2)∂φ+(−αφ −1) ∂ log(φ)∂φ −βφ∂(1φ)∂φ= 0 (C.13)Partial derivatives are∂µg,z∂φ = −(1−n)ncN +(1−n)φ∂((lt −µg,z)2)∂φ = 2(lt −µg,z)×−∂µg,z∂φPlugging in the partial derivatives into Equation C.13 gives,∂ Q¯(φ)∂φ =T∑t=1Z∑z=1G∑G=1γ (Gt = g,Zt = z){lt −µg,zσ2g× ∂µg,z∂φ}+(−αφ −1φ +βφφ2)= 0∂ Q¯(φ)∂φ =Z∑z=1G∑g=1{∂µg,z∂φ(c¯g,z− e¯g,zµg,zσ2k)}− αφ +1φ +βφφ2= 0 (C.14)Because we expect the average ploidy to at least 1, we can choose the positive solution.259C.1.TITANparameterupdatederivationsC.1.6 Normal contamination, nThe expected data-likelihood terms involving n are the emission data likelihood and the beta prior.Q(n) =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z)log((Ntat))+at logωg,z+(Nt −at) log(1−ωg,z)+ log 1√2piσ2k−(lt −µg,z)22σ2g+(αn−1)∂ log(n)∂n +(βn−1)∂ log(1−n)∂n (C.15)Equation C.16 can be simplified using ,Q¯(n) =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z){at log(ωg,z)+(Nt −at) log(1−ωg,z)−(lt −µg,z)22σ2g}+(αn−1)∂ log(n)∂n +(βn−1)∂ log(1−n)∂n + constant (C.16)Taking the derivative of Equation C.16with respect to n gives∂ Q¯(n)∂n =T∑t=1Z∑z=1G∑g=1γ (Gt = g,Zt = z){at∂ log(ωg,z)∂n +(Nt −at)∂ log(1−ωg,z)∂n +(lt −µg,z)σ2k× ∂µg,z∂n}+(αn−1)∂ log(n)∂n +(βn−1)∂ log(1−n)∂n = 0 (C.17)260C.1.TITANparameterupdatederivationsThe partial derivatives can be solved using the following,∂ωg,z∂n =(1− sz)cNcT(rN − rT,g)[ncN +(1−n)szcN +(1−n)(1− sz)cT,g]2∂ log(ωg,z)∂n =1ωg,z× ∂ωg,z∂n∂ log(1−ωg,z)∂n =11−ωg,z×−∂ωg,z∂n∂µg,z∂n =cN − szcN − (1− sz)cT,gncN +(1−n)sZcN +(1−n)(1− sz)cT,g− cN −φncN +(1−n)φ∂ log(n)∂n =1n∂ log(1−n)∂n =−11−nSimplify by using the previously defined a¯g,z, b¯g,z, c¯g,z, e¯g,z, g¯g,z and plugging into Equation C.17,∂ Q¯(n)∂n =T∑t=1G∑g=1Z∑z=1γ (Gt = g,Zt = z){∂ωg,z∂n(atωg,z− (Nt −at)1−ωg,z)+ ∂µg,z∂n(lt −µg,zσ2g)}+(αn−1n− βn−11−n)∂ Q¯(n)∂n =G∑g=1Z∑z=1{∂ωg,z∂n(a¯g,zωg,z− b¯g,z1−ωg,z)+ ∂µg,z∂n(c¯g,z− e¯g,zµg,zσ2g)}+(αn−1n− βn−11−n)= 0 (C.18)261C.2. Biospecimen collection of intra-tumoural ovarian carcinoma samples and FISH validationC.2 Biospecimen collection of intra-tumoural ovariancarcinoma samples and FISH validationEthical approval was obtained from the University of British Columbia (UBC)Ethics Board. Patient DG1136 was chosen as a high-grade serous carcinoma wheremore than one sample was surgically extracted from four sites in the primary tu-mour on the right ovary and left pelvic sidewall prior to treatment. The peripheralblood lymphocyte sample was collected prior to surgery. Tissue sections weresubject to expert histopathological review (GT) to assess the presence of invasivetumour, pre-malignant or benign changes, lymphocytic infiltration, necrosis and tu-mour cellularity. Genomic DNA was extracted from fresh frozen tumour tissue andpatient matched peripheral blood lymphocytes as previously described (Bashashatiet al., 2013). Constructed libraries were sequenced on the Illumina HiSeq 2000,according to Illumina protocols, generating 100bp paired-end reads. The amountof sequence generated ranged from 94 to 110 gigabases total for an estimated cov-erage of sequencing between 29X and 35X (Table 4.3). The sequenced reads werealigned to the reference genome (build GRCh37, hg19) using BWA (Li and Durbin,2009).Locus specific analysis was performed as previously described (Bashashatiet al., 2013) using full 5 micron sections from representative FFPE blocks. Briefly,BACs were directly labeled with spectrum green or orange using a Nick Transla-tion Kit (Abbott Molecular, Illinois, USA) and chromosomal locations were val-idated using normal metaphases from blood (results not shown). Specific BACand control probe identifiers are listed in the corresponding figures. Nuclei werecounterstained with 4,6- diamidino 2-phenylindole and signals and patterns were262C.3. TNBC sample collection and sequencingidentified on a Zeiss Axioplan epifluorescent microscope and were scored manu-ally in 80 nuclei using an oil immersion 100x objective. Images were capturedusing Metasystems software (MetaSystems Group Inc., Belmont, MA, USA) andan oil immersion 63x objective. To obtain prevalence estimates from FISH, ap-proximately 40 cells were used to compute the proportion of the sample containinga particular event.C.3 TNBC sample collection and sequencingDetails of ethical consent, biospecimen collection, histopathological review, librarypreparation, and sequencing are described in (Shah et al., 2012). Mutations orig-inally identified using JointSNVMix (Roth et al., 2012) and MutationSeq (Dinget al., 2012b) for 16 genomes sequenced using ABi/SOLiD. Targeted deep ampli-con sequencing data was generated on a selection of these positions and determinedto be somatic using the Binomial exact test (Shah et al., 2012).C.3.1 Comparison of cellular prevalence with RNAseqRNAseq data for TNBC was obtained from the study by (Ha et al., 2012) andaligned as previously described (Shah et al., 2012) using BWA v0.5.5 to the humangenome reference (NCBI build 36, hg18) and a database of known exon-exon junc-tions obtained from different annotation databases (Ensembl, RefSeq, AceView).Allele counts were extracted for all germline SNP positions using filters for basephred qualities (> 5), mapping qualities (> 30), and depth threshold (> 10).263C.4. Additional details for TITAN evaluation analysesC.4 Additional details for TITAN evaluation analysesC.4.1 Mixture simulation experiments using intra-tumour samplesfrom an ovarian carcinomaComputing performance metricsHMMcopy and APOLLOH (Ha et al., 2012) results from the individual sampleswere used as ground truth CNA and LOH events, respectively. Default parameterswere used for APOLLOH/HMMcopy (http://compbio.bccrc.ca/software/).The truth set consists of CNA/LOH status at all germline heterozygous (HET) SNPpositions included in the APOLLOH results for each of the five samples. The ra-tionale for using all germline HET positions in the evaluation is that it representsa genome-wide assessment, such that larger events are given more weight becausethey span more loci. Spurious, potentially false, ground truth events that spanfewer loci, and thus perhaps less confident, are weighted less. Furthermore, ev-ery evaluation examines the same set of positions, providing a more comparableperformance metric across methods and alleviating the complexity in varying pre-cision of boundaries between approaches.Precision, recall, and F-measure performance for TITAN, APOLLOH, Control-FREEC (Boeva et al., 2012b), and BIC-seq (Xi et al., 2011a) analysis on the simu-lated mixture samples were computed based on the CNA/LOH status of predictedsegments overlapping the ground truth SNP loci. Precision was computed as theproportion of SNP loci in which the predicted CNA/LOH status from the over-lapping segment matched the ground truth status at all CNA/LOH predicted SNPloci. Recall was computed as the proportion of SNP loci in which the predicted264C.4. Additional details for TITAN evaluation analysesCNA/LOH status from the overlapping segment matched the ground truth status atall CNA/LOH truth SNP loci. The performance was computed for deletions, gains,and LOH status, independently, and averaged together when evaluating for overallassessment. True deletion and amplification loci were determined if both predic-tions and ground truth were < 2 and > 2, respectively. True LOH loci were deter-mined as presence or absence in predictions, matching the ground truth. Groundtruth subclonal events that are a mixture of two different tumour genotypes (non-diploid-heterozygous) were excluded from the performance because these eventswere uncommon, could possibly lead to identifiability issues, and all tools wereonly capable or designed to return a single prediction genotype. For evaluatingsize-based performance, ground truth events from each individual sample (not thesimulated mixtures) were grouped into ranges of length 10kb-100kb, 100kb-1Mb,1Mb-10Mb, and greater than 10Mb; precision, recall, and F-measure were com-puted similarly for each size group.For evaluation of cellular prevalence, the proportion of tumour contributionfrom each individual sample making up the simulated was used to compute theexpected cellular prevalence (Supplementary Table 4.3b-d). Figure 3b illustrates amixture scenario, and identifies true (sub)clonal events and expected cellular preva-lence. For example, Sample A has 80% tumour content and Sample B has 70%.If these samples were mixed at equal proportions to generate Mixture X, then Xwill have a tumour subclone (population) of 40% contributing from Sample A (0.5* 80%), another tumour subclonal of 35% from Sample B (0.5 * 70%), and nor-mal population (0.5 * (20%+30%)). Clonally dominant events in X are consideredas those that are present in both Sample A and B, and will have expected cellularprevalence of 40%+35%=70%. Subclonal events in X are those that are present in265C.4. Additional details for TITAN evaluation analysesexactly one of the samples but not both. For example, if a GAIN event was onlyfound in Sample A, then the expected cellular prevalence of this GAIN in X is40%. Because the individual samples may have different sequence coverage, wehave also adjusted for this.The number of expected clonal clusters with unique cellular prevalence is takenas the permutation of the number of simulated tumour populations. For the serialand pairwise analysis, three possible clonal clusters exist; for the triplet simulation,up to seven clusters may be present. The correlation analysis used a sample sizebased on the expected number of clusters across all mixtures in each experiment(Fig. 4, Supplementary Fig. 4.18).Usage details of other copy number prediction softwareAPOLLOH Input data consisted of read counts at heterozygous germline SNPpositions identified using SAMtools mpileup. HMMcopy was used to gener-ate input copy number data; settings included width=1000 and quality=0 forreadCounter during bin read count generation, and param$mu <- log(c(1,1.4, 2, 2.7, 3, 4.5) / 2, 2) for the R function HMMsegment during seg-mentation. Default configuration parameters, as is provided inThe configuration file, apolloh_K18_params_Illumina_stromalRatio_Hyper10k_min10max200.matdownloaded from http://compbio.bccrc.ca/software/apolloh/, was usedfor APOLLOH (Ha et al., 2012)to predict regions of LOH, allele-specific amplifi-cation (ASCNA), and heterozygous (HET).Control-FREEC Control-FreeC (version 6.0) (Boeva et al., 2012b) was appliedto the mixtures using the following parameters: ploidy=2, contaminationAdjustment=TRUE,266C.4. Additional details for TITAN evaluation analysessex=XX, uniqueMatch=TRUE, window=1000. Mappability was corrected using theinput file ‘out100m2_hg19.gem’ (2 mismatches) and the SNPs used for B-allelefrequency (LOH) analysis was provided in hg19_snp137.SingleDiNucl.1based.txt(downloaded from http://bioinfo-out.curie.fr/projects/freec/tutorial.html). The output file with the extension “.CNVs" and containing the inferred seg-ments were used to evaluate the performance of Control-FREEC.BIC-seq BIC-seq (version 1.1.2) (Xi et al., 2011a) was used with bin size of 1kband λ set to 10, while default settings were used for all other parameters. The out-put file with extension “.bicseg" containing the resulting segments were used forcopy number performance evaluation. Copy number loss and gain were determinedas segments having “log2.copyRatio" < log2(0.75) with “log10.pvalue" <log10(0.0001) and “log2.copyRatio" > log2(1.25) with “log10.pvalue" < log10(0.0001),respectively.THetA We also compared the results to THetA (Oesper et al., 2013), which is apost-segmentation software for estimating cellular prevalence. Due to limitationsin runtime and memory, we ran THetA for one normal and one tumour population(n = 2) using conservatively large BIC-seq segments (λ = 200), and subsequentlyfiltered for regions larger than 5 Mbp. The default parameter settings, such asheuristics, were used. Then, we ran THetA for one normal and two tumour pop-ulations (n = 3) using the n = 2 results, filtering down to only the 15 largest non-diploid or full-chromosome sized segments, and changing any zero lower boundcopy number heuristics to 1. All other parameters were set to default values.267C.5. Validation using targeted deep amplicon DNA sequencing of single-cell nucleiC.5 Validation using targeted deep amplicon DNAsequencing of single-cell nucleiC.5.1 Selection of positions for validation of deletion eventsDeletions in single cells were interrogated at heterozygous germline SNP loci. Toimprove the likelihood of observing a true deletion, multiple loci overlapping adeletion were used; this also allowed for distinguishing signals from random al-lele drop-out during sequencing. For deletions and diploid regions, 10-11 and 2-3positions were selected, respectively. These regions were HET-1 (chr1:56977819-68910999), HET-3 (chr2:82870237-86078478), C-DLOH-1 (chr17:17415217-21074153),SC-DLOH-1 (chr1:70539053-117275764) and SC-DLOH-3 (chr2:31374733-80861750)for Set1, and HET-4 (chr7:138768839-141135114), HET-5 (chr21:19359230-19674681),C-NLOH-1 (chr17:55290843-62185764), SC-DLOH-4 (chr7:143777995-153688808),SC-DLOH-5 (chr21:22084693-25770230) for Set2, where ‘C’ represents clonallydominant and ‘SC’ represents subclonal. Additional criteria for selecting thesepositions were as follows: 1) SNP positions overlapped Affymetrix SNP6.0 arrayloci. These positions were likely also found within populations-based studies usedin the array design; 2) Positions were equally spaced across the deletion region;3) 500bp flanking regions to left and right of chosen positions did not contain anygermline variants (heterozygous or homozygous). This helps with primer designand leads to more optimal primer amplification.Mutations were chosen from a list of previously validated SNVs via Ampli-Crazy primer design platform sequenced on a MiSeq. Clonally dominant mutations(TP53, CSMD1, ARID1B, RFC3) were selected to later help distinguish tumour268C.5. Validation using targeted deep amplicon DNA sequencing of single-cell nucleiand normal cells. In particular, TP53 was validated as a clonally dominant ho-mozygous mutation (containing only the variant allele). Additional clonally dom-inant mutations were selected for Set1 (FGD5) and Set2 (GABRA5, GALNT16,LRRC36, SPTB). We also included mutations that were found within subclonaldeletions for Set1 (ABCA4, DENND2C, SULT6B1) and Set2 (MUC3A, XRCC2)to investigate bi-allelic inactivation in tumour cells.C.5.2 Single-cell sequencing of nuclei DNA for ovarian cancersample DG1136gNuclei preparation and sortingSingle cell nuclei were prepared using a sodium citrate lysis buffer containing Tri-ton X-100 detergent. Solid tissue samples were first subjected to mechanical ho-mogenization using a laboratory paddle-blender. The resulting cell lysates werepassed twice through a 70-micron filter to remove larger cell debris. Aliquotsof freshly prepared nuclei were visually inspected and enumerated using a dualcounting chamber hemocytometer (Improved Neubauer, Hausser Scientific, PA)with Trypan blue stain. Single nuclei were flow sorted into individual wells ofmicrotitre plates using propidium iodide staining and a FACSAria II sorter (BDBiosciences, San Jose, CA).Genomic DNA (gDNA), which refers to the bulk tumour DNA and can containstromal DNA, is a potential source of contamination in the nuclei buffer duringpreparation. Included in each set were control nuclei samples with the absenceof DNA templates, called non-template control (NTC) cells. These samples wereused as the background control because any signal present will be from gDNA269C.5. Validation using targeted deep amplicon DNA sequencing of single-cell nucleicontamination as well as various amplicon and primer artifacts.Multiplex and singleplex PCRs Somatic coding SNVs catalogued and validatedin bulk tissue genome sequencing experiments were picked for mutation-spanningPCR primers design using Primer3. Common sequences were appended to the5’ ends of the gene-specific primers to enable downstream barcoded adaptor at-tachment using a PCR approach. Multiplex (24) PCRs were performed using anABI7900HT machine and SYBR GreenER qPCR Supermix reagent (Life Tech-nologies, Burlington, ON). The 24-plex reaction products from each nucleus wereused as input template to perform 48 singleplex PCRs using 48 by 48 Access ArrayIFCs according to the manufacturer’s protocol (Fluidigm Corporation, San Fran-cisco, CA). Flow sorting plate wells without nuclei and 10ng gDNA aliquots wereused for negative and positive control reactions, respectively.Nuclei-specific amplicon barcoding and nucleotide sequencing Pooled single-plex PCR products from each nucleus were assigned unique molecular barcodesand adapted for MiSeq flow-cell NGS sequencing chemistry using a PCR step.Barcoded amplicon libraries were pooled and purified by conventional preparativeagarose gel electrophoresis. Library quality and quantitation was performed usinga 2100 Bioanalyzer with DNA 1000 chips (Agilent Technologies, Santa Clara, CA)and a Qubit 2.0 Fluorometer (Life Technologies, Burlington, ON). Next-generationDNA sequencing was conducted using a MiSeq sequencer according to the manu-facturer’s protocols (Illumina Inc., San Diego, CA).270C.5. Validation using targeted deep amplicon DNA sequencing of single-cell nucleiC.5.3 Analysis of single-cell sequencing dataInitial analysis of sequenced readsPaired end FASTQ files from the MiSeq sequencer were aligned to human genomebuild 37 downloaded from the NCBI using the mem command from the bwa 0.7.5apackage. Allelic count data was extracted from the BAM files using a customPython script which filtered out positions with base or mapping qualities below 10.For each position, both mutation SNVs and SNPs, one-tailed binomial exacttests were independently applied to the reference and variant alleles in order to de-termine the presence or absence while accounting for sequencing errors and gDNAcontamination. The error and contamination variant ratio was computed for eachposition by looking at the mean variant allelic ratio (variant reads divided by depth)for the flanking bases of the amplicon at that position from the NTC samples. Thisparameter encapsulated both the sequencing bias of the amplicon and the presenceof gDNA contamination. The one-tailed binomial exact test was used to estimatewhether the variant allelic ratio of the position was greater than expected. Simi-larly, the test was applied to the reference allelic ratio (reference reads divided bydepth) for the same position. A present status was used for statistically significanttest (Benjamini and Hochberg adjusted p-value < 0.05) and absent otherwise forthe reference and variant alleles. Positions with fewer than a depth of 50 were con-sidered low_coverage. Positions with low_coverage in ≥ 50% of all nuclei in a setwere also removed.271C.5. Validation using targeted deep amplicon DNA sequencing of single-cell nucleiDistinguishing tumour and normal nucleiFirst, the nuclei were filtered for global low coverage if fewer than 10 positionshad sufficient coverage (≥ 50 reads); these nuclei were excluded from the anal-ysis. Next, normal nuclei were determined conservatively based on absent TP53variant allele status, and absent or low_coverage variant allele status for all othermutations. While SNP positions for the regions of interest should be heterozy-gous in normal cells, we do not use these in the criteria due to allelic drop-out.For the remaining nuclei, each were classified as tumour if it had a present TP53variant allele status but absent TP53 reference allele status; however, if TP53 waslow_coverage, then at least one mutation with present variant allele status sufficedfor tumour designation. All remaining nuclei were classified as Unknown becausethe data was ambiguous for determining normal or tumour.The 42 nuclei in Set1 were divided into 14 with global low coverage, 14 nor-mal, and 14 tumour; Set2 were divided into 23 with global low coverage, 9 normal,9 tumour, and 1 Unknown.Calculating the expected allelic drop-out rate and heterozygous allelic ratioAllelic drop-out refers to the preferential amplification of one allele for a heterozy-gous position, and this can be mistaken for the homozygous signal arising fromloss of heterozygosity. As a result, approximately 10 positions were selected toassess the LOH status in individual nuclei for predicted deletion events (from thebulk WGS sample). The expected drop-out rate was computed as the proportionof (sufficient coverage) positions with present status for one of reference or variantbut not both (XOR) out of all positions from every normal nuclei. Drop-out rates272C.5. Validation using targeted deep amplicon DNA sequencing of single-cell nuclei(DOR) for Set1 and Set2 were 0.28 and 0.48, respectively.The expected allelic ratio for a heterozygous position is subject to gDNA con-tamination that can deviate this value away from the theoretical 0.5 ratio. There-fore, to account for this artifact, the expected allelic ratio was computed as the me-dian across all (sufficient coverage) heterozygous positions, determined by havingboth reference and variant present status, from every normal nuclei. The expectedheterozygous allelic (HAR) ratio for Set1 and Set2 were 0.57 and 0.68, respec-tively.Two statistical tests to determine LOH event statusTo determine the LOH status of an event across all SNP positions within the event,two statistical tests were applied to each event. First, the event is assessed for be-ing a true LOH and not due to allelic drop-out. We used a one-tailed binomialtest in which the null hypothesis asserts that the ratio of homozygous:heterozygouspositions is not greater than the drop-out rate. The drop-out rate was used as the ex-pected ratio (probability of success); number of homozygous positions, determinedby present reference XOR variant status, is the number of successes; and the totalnumber of is the number of trials. The second analysis is a one-sample Wilcoxonsigned rank test that was used to examine whether the allelic ratio distributionacross the positions within the event was significantly different than the expectedHAR. In particular, a one-tailed Wilcoxon test was used to assess if the symmetricallelic ratio, SAR =(max(re f reads,variant reads)depth), distribution is greater than HAR.These two tests were applied to deletion and diploid heterozygous events for eachtype of test, separately. The p-values were adjusted using Benjamini & Hochbergcorrection across all events and all tumour or normal nuclei, separately.273C.6. Supplementary TablesBecause the second test did not account for drop-out, both tests were combinedby taking the maximum adjusted p-value to generate the final p-value representingthe event. This conservatively ensured that a statistically significant final p-value(< 0.05 for both Set1 and Set2) indicated an LOH event that was supported bya homozygous allelic ratio and not due to allelic drop-out. The event was desig-nated as heterozygous (HET) if the final p-value was not statistically significantand unknown (UNK) if the final p-value was not statistically but did not contain atleast one heterozygous position (present status for both reference and variant). Thecellular prevalence for each event was then computed based on nuclei that had theevent status of LOH or HET.C.6 Supplementary Tables274C.6. Supplementary TablesTable C.1: Spike-In simulation experiment. Randomly sampleddeletion (from chr16) and amplification (from chr8) data was in-serted into chr1, 2, 9 and 18. The ‘Event ID’ indicates whichadmixture sample the data originated from: clonally dominant(tum100), 80% tumour-normal mixture (tum80-norm20), and 60%tumour-normal mixture (tum60-norm20). The length, median al-lelic ratio and log ratio for each segment is given.Event ID Chr Start End Length NumberofSNPsAllelicRatioLogR TypeDG1136a_tum100 2 228309211 228320019 10809 10 0.70 0.78 AMPDG1136a_tum100 18 76062577 76069480 6904 10 0.71 0.85 AMPDG1136a_tum100 1 21551341 21574838 23498 10 0.69 0.72 AMPDG1136a_tum100 2 170169436 170173164 3729 10 0.69 0.82 AMPDG1136a_tum100 2 171424123 171530478 106356 100 0.70 0.79 AMPDG1136a_tum100 1 225636789 225743300 106512 100 0.71 0.81 AMPDG1136a_tum100 1 59772268 59849153 76886 100 0.72 0.77 AMPDG1136a_tum100 18 9171177 9395621 224445 100 0.71 0.78 AMPDG1136a_tum100 9 26376870 27417407 1040538 1000 0.70 0.79 AMPDG1136a_tum100 18 11825395 12955120 1129726 1000 0.71 0.79 AMPDG1136a_tum100 1 192738094 194385340 1647247 1000 0.70 0.79 AMPDG1136a_tum100 2 202842117 205200643 2358527 1000 0.71 0.79 AMPDG1136a_tum100 2 144547402 160516534 15969133 10000 0.70 0.79 AMPContinued on next page275C.6. Supplementary TablesEvent ID Chr Start End Length NumberofSNPsAllelicRatioLogR TypeDG1136a_tum100 9 110421778 110422922 1145 10 0.75 -0.51 DELDG1136a_tum100 9 132551331 132552994 1664 10 0.74 -0.46 DELDG1136a_tum100 18 61571199 61577041 5843 10 0.68 -0.42 DELDG1136a_tum100 1 177836854 177841976 5123 10 0.79 -0.52 DELDG1136a_tum100 1 7661765 7736097 74333 100 0.76 -0.51 DELDG1136a_tum100 2 23630297 23689945 59649 100 0.76 -0.51 DELDG1136a_tum100 1 210038318 210354131 315814 100 0.75 -0.53 DELDG1136a_tum100 2 228809625 228908051 98427 100 0.76 -0.51 DELDG1136a_tum100 2 164851354 166979889 2128536 1000 0.75 -0.52 DELDG1136a_tum100 9 74927609 76427979 1500371 1000 0.76 -0.51 DELDG1136a_tum100 2 60491758 61872624 1380867 1000 0.75 -0.52 DELDG1136a_tum100 9 72639118 73577447 938330 1000 0.76 -0.50 DELDG1136a_tum100 9 15018976 25043605 10024630 10000 0.75 -0.51 DELDG1136a_tum80-norm20 9 31488462 31489598 1137 10 0.60 0.48 AMPDG1136a_tum80-norm20 18 32460023 32611888 151866 10 0.60 0.48 AMPDG1136a_tum80-norm20 9 70162694 70173990 11297 10 0.58 0.48 AMPDG1136a_tum80-norm20 18 59450622 59459110 8489 10 0.62 0.51 AMPDG1136a_tum80-norm20 1 26449595 26528761 79167 100 0.60 0.49 AMPDG1136a_tum80-norm20 2 141363453 141422030 58578 100 0.59 0.47 AMPDG1136a_tum80-norm20 18 65367112 65428015 60904 100 0.61 0.47 AMPContinued on next page276C.6. Supplementary TablesEvent ID Chr Start End Length NumberofSNPsAllelicRatioLogR TypeDG1136a_tum80-norm20 2 181675610 181764318 88709 100 0.61 0.47 AMPDG1136a_tum80-norm20 2 27013263 28718249 1704987 1000 0.60 0.47 AMPDG1136a_tum80-norm20 18 67227276 68611117 1383842 1000 0.61 0.46 AMPDG1136a_tum80-norm20 2 172543651 173498025 954375 1000 0.60 0.47 AMPDG1136a_tum80-norm20 18 65537101 66580694 1043594 1000 0.61 0.47 AMPDG1136a_tum80-norm20 2 211854174 224694654 12840481 10000 0.61 0.47 AMPDG1136a_tum80-norm20 9 1594081 1596072 1992 10 0.57 -0.39 DELDG1136a_tum80-norm20 2 210633725 210642473 8749 10 0.60 -0.35 DELDG1136a_tum80-norm20 2 133097321 133098040 720 10 0.67 -0.39 DELDG1136a_tum80-norm20 1 79988932 79999331 10400 10 0.65 -0.39 DELDG1136a_tum80-norm20 18 39193771 39316813 123043 100 0.60 -0.35 DELDG1136a_tum80-norm20 9 98990674 99073357 82684 100 0.62 -0.33 DELDG1136a_tum80-norm20 9 44872439 44993571 121133 100 0.62 -0.33 DELDG1136a_tum80-norm20 18 71571911 71653657 81747 100 0.61 -0.33 DELDG1136a_tum80-norm20 18 57890756 59319282 1428527 1000 0.60 -0.35 DELDG1136a_tum80-norm20 1 73226328 73912799 686472 1000 0.60 -0.35 DELDG1136a_tum80-norm20 18 34879309 36037470 1158162 1000 0.61 -0.35 DELDG1136a_tum80-norm20 18 24656519 27006298 2349780 1000 0.61 -0.35 DELDG1136a_tum80-norm20 9 83489396 95699605 12210210 10000 0.61 -0.35 DELDG1136a_tum60-norm40 1 235462744 235467933 5190 10 0.61 0.36 AMPContinued on next page277C.6. Supplementary TablesEvent ID Chr Start End Length NumberofSNPsAllelicRatioLogR TypeDG1136a_tum60-norm40 2 84542478 84548690 6213 10 0.64 0.38 AMPDG1136a_tum60-norm40 9 115400528 115483680 83153 10 0.53 0.33 AMPDG1136a_tum60-norm40 9 14454393 14458887 4495 10 0.58 0.34 AMPDG1136a_tum60-norm40 9 14582043 14664357 82315 100 0.57 0.35 AMPDG1136a_tum60-norm40 1 53010798 53103781 92984 100 0.57 0.36 AMPDG1136a_tum60-norm40 1 246301867 246438778 136912 100 0.56 0.34 AMPDG1136a_tum60-norm40 2 64292301 64360433 68133 100 0.57 0.35 AMPDG1136a_tum60-norm40 18 14559167 15390699 831533 1000 0.57 0.35 AMPDG1136a_tum60-norm40 2 4121289 5008127 886839 1000 0.57 0.35 AMPDG1136a_tum60-norm40 2 16951286 18038730 1087445 1000 0.57 0.35 AMPDG1136a_tum60-norm40 18 6578141 7508561 930421 1000 0.57 0.35 AMPDG1136a_tum60-norm40 2 71297982 82998737 11700756 10000 0.57 0.35 AMPDG1136a_tum60-norm40 18 72226504 72233320 6817 10 0.62 -0.33 DELDG1136a_tum60-norm40 18 22740958 22777081 36124 10 0.56 -0.29 DELDG1136a_tum60-norm40 18 66834481 66853925 19445 10 0.58 -0.29 DELDG1136a_tum60-norm40 1 192205562 192213236 7675 10 0.55 -0.30 DELDG1136a_tum60-norm40 2 143377394 143420708 43315 100 0.57 -0.30 DELDG1136a_tum60-norm40 1 85938561 85991140 52580 100 0.58 -0.30 DELDG1136a_tum60-norm40 9 6927093 6941889 14797 100 0.57 -0.31 DELDG1136a_tum60-norm40 2 242125784 242221219 95436 100 0.59 -0.30 DELContinued on next page278C.6. Supplementary TablesEvent ID Chr Start End Length NumberofSNPsAllelicRatioLogR TypeDG1136a_tum60-norm40 2 104816213 106152788 1336576 1000 0.58 -0.29 DELDG1136a_tum60-norm40 2 123606159 124915643 1309485 1000 0.58 -0.30 DELDG1136a_tum60-norm40 18 2776307 3675604 899298 1000 0.58 -0.30 DELDG1136a_tum60-norm40 1 44565239 47000996 2435758 1000 0.58 -0.30 DELDG1136a_tum60-norm40 1 28247278 43014649 14767372 10000 0.58 -0.30 DEL279


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items