Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Charting clonal heterogeneity in breast cancers : from bulk tumor genomes to single-cell genotypes Khattra, Jaswinder 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_september_khattra_jaswinder.pdf [ 47.16MB ]
JSON: 24-1.0167718.json
JSON-LD: 24-1.0167718-ld.json
RDF/XML (Pretty): 24-1.0167718-rdf.xml
RDF/JSON: 24-1.0167718-rdf.json
Turtle: 24-1.0167718-turtle.txt
N-Triples: 24-1.0167718-rdf-ntriples.txt
Original Record: 24-1.0167718-source.json
Full Text

Full Text

CHARTING CLONAL HETEROGENEITY IN BREAST CANCERS: FROM BULK TUMOR GENOMES TO SINGLE-CELL GENOTYPES by  Jaswinder Khattra  M.Sc., The University of British Columbia, 2008  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Pathology and Laboratory Medicine)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  May 2015  © Jaswinder Khattra, 2015    ii Abstract Traditional classifications and treatment of human cancers have operated with limitations surrounding tumor homogeneity and mutational stasis. Clinical metrics of malignant tumors focused on descriptive and behavioral properties such as tissue of origin, cellular morphologic features and extent of spread. Missing has been an understanding of the dynamics of cellular subpopulations that underpin divergent functional properties in space and time. This dissertation is focused on the development and application of methods, including next generation DNA sequencing, computational modeling, and single-cell genotyping protocols to elucidate breast tumor heterogeneity and clonal evolution at single nucleotide and single-cell resolution. First, I present advances in our knowledge of the mutational spectrum that may occur and evolve in an individual epithelial cancer, namely a lobular breast cancer metastases and matched primary tumor separated by a nine year interval. This seminal study demonstrated clonal evolution in a patient’s breast cancer and the successful application of targeted deep sequencing for determining digital allelic prevalences and clonal genotypes in bulk tumors. Second, I describe the diversity of genomic sequence and clonal heterogeneity in tumors of the triple-negative breast cancer subtype. The study uncovered wide clonal diversity in these primary tumors at first diagnosis. Third, I demonstrate via genotyping single tumor cells, that computational inferences of tumor clonal architecture can be made reliably from bulk tissue-derived data sets. This was performed using both somatic point mutations and loss of heterozygosity loci as clonal marks. And fourth, I applied single-cell analysis to study the clonal evolution in breast tumor murine xenografts following engraftment and serial passaging. This research uncovered a range of outcomes in tumor clonal composition upon initial engraftment and serial passaging. The same clonal groups were found to arise independently in separate   iii xenografts derived from the same primary tumor, suggesting selection of functionally significant genotypes.  Comprehensive capabilities in the measurement and analysis of clonal structure in cancers offers improved classification and combinatorial treatments of subpopulations in heterogeneous tumors and better use of murine xenograft models. Functionally relevant subpopulations of tumor cells, irrespective of numerical abundance or spatiotemporal persistence, can thereby be targeted using clonally informative genomic profiles.    iv Preface I coauthored five original research manuscripts resulting from the collaborative research activities described below and their content is incorporated into this dissertation. These publications include one first co-authorship and two second authorships. Permission to reuse published content from Elsevier (for a chapter in the textbook Cancer Genomics), Nature Publishing Group (for three journal articles in Nature and one in Nature Methods), and from the Cold Spring Harbor Laboratory Press (for a journal article in Genome Research) is granted by way of author rights to reproduce published materials in a thesis or dissertation. The first two manuscripts listed below (in Nature), describing the lobular and triple-negative breast cancers, are incorporated in chapter two of this dissertation. The third (in Nature Methods) and fourth (in Genome Research) manuscripts, addressing two computational approaches for inferring tumor clonal structure and their validation by single-cell genotyping, resulted from work described in chapter three. The fifth manuscript (in Nature), documenting murine xenograft models of breast cancers at single-cell resolution, is comprised of research presented in chapter four.   Sohrab P Shah*, Ryan D Morin*, Jaswinder Khattra, Leah Prentice, Trevor Pugh, Angela Burleigh, Allen Delaney, Karen Gelmon, Ryan Guliany, Janine Senz, Christian Steidl, Robert A Holt, Steven Jones, Mark Sun, Gillian Leung, Richard Moore, Tesa Severson, Greg A Taylor, Andrew E Teschendorff, Kane Tse, Gulisa Turashvili, Richard Varhol, René L Warren, Peter Watson, Yongjun Zhao, Carlos Caldas, David Huntsman, Martin Hirst, Marco A Marra, Samuel Aparicio. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature 461, 809-813 (2009). *Equal contribution.  Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, Turashvili G, Ding J, Tse K, Haffari G, Bashashati A, Prentice LM, Khattra J, Burleigh A, Yap D, Bernard V, McPherson A, Shumansky K, Crisan A, Giuliany R, Heravi-Moussavi A, Rosner J, Lai D, Birol I, Varhol R, Tam A, Dhalla N, Zeng T, Ma K, Chan SK, Griffith M, Moradian A, Grace Cheng SW, Morin GB, Watson P, Gelmon K, Chia S, Chin SF, Curtis C, Rueda OM, Pharoah PD, Damaraju S, Mackey J, Hoon K, Harkins T, Tadigotla V, Sigaroudinia M, Gascard P, Tlsty T, Costello JF, Meyer IM, Eaves CJ, Wasserman WW, Jones S, Huntsman D, Hirst M, Caldas C, Marra MA,   v Aparicio S. 2012. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 486, 395-399 (2012).   Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, Ha G, Aparicio S, Bouchard-Cote A, Shah SP. PyClone: statistical inference of clonal population structure in cancer. Nature Methods 11, 396–398 (2014).  Ha G, Roth A, Khattra J, Yap D, Melynk N, McPherson A, Prentice LM, Bashashati A, Laks E, Biele J, Ding J, Le A, Rosner J, Shumansky K, Marra MA, Gilks CB, Huntsman DG, McAlpine JN, Aparicio S, Shah SP. TITAN: Inference of copy number architectures in clonal cell populations from tumor whole genome sequence data. Genome Research (2014). doi:10.1101/gr.180281.114.  Eirew P*, Steif A*, Khattra J*, Ha G, Yap D, Farahani H, Gelmon K, Chia S, Mar C, Wan A, Laks E, Biele J, Shumansky K, Rosner J, McPherson A, Nielsen C, Roth AJL, Lefebvre C, Bashashati A, de Souza C, Siu C, Aniba R, Brimhall J, Oloumi A, Osako T, Bruna A, Sandoval J, Algara T, Greenwood W, Leung K, Cheng H, Xue H, Wang Y, Lin D, Mungall A, Moore R, Zhao Y, Lorette J, Nguyen L, Huntsman D, Eaves CJ, Hansen C, Marra MA, Caldas C, Shah SP, Aparicio S. Population dynamics of genomic clones in breast cancer patient xenografts at single cell resolution. Nature (2014). doi:10.1038/nature13952. *Equal contribution. I contributed two sections of a book chapter titled Second-Generation Sequencing for Cancer Genome Analysis written in collaboration with colleagues at the British Columbia Cancer Agency Genome Sciences Centre and published in the following textbook: Chun, H.-J. E., Khattra, J., Krzywinski, M., Aparicio, S. A. & Marra, M. A. Cancer Genomics 13–30 (2014). doi:10.1016/B978-0-12-396967-5.00002-5.  My contribution to the tumor genome sequencing projects detailed in chapter two included processing patient tumor samples for next generation DNA and RNA sequencing and conducting validation experiments in the lobular breast cancer genome project. I also performed experiments in primary cell culture and xenografting the lobular breast tumor metastases. The design of these projects was conceived by Drs. Samuel Aparicio and Marco Marra. Drs. Sohrab Shah and Ryan Morin led the analysis and publication of the resulting data sets. Large scale   vi whole genome sequencing was conducted at the British Columbia Cancer Agency Genome Sciences Centre.  For the single-cell analysis projects described in chapter three I designed and conducted experiments for developing microlitre scale protocols and their application to the study of cell lines, breast and ovarian tumors. I also applied these methods to validate computational predictions of tumor clonal architecture made using bulk tissue-derived genomic data. The projects were conceived by Drs. Carl Hansen and Samuel Aparicio. Dr. Sohrab Shah led the analysis of single-cell data sets with Drs. Andrew Roth and Gavin Ha.  In the breast tumors xenografting project covered in chapter four I contributed to the implementation of targeted amplicon DNA library construction and NGS workflows, plus single-cell experiments to reconstruct tumor xenograft clonal dynamics. The design and execution of the project was led by Drs. Peter Eirew and Samuel Aparicio. Single-cell data analyses were led by Dr. Sohrab Shah and Ms. Adi Steif. Large scale whole genome DNA sequencing was conducted at the British Columbia Cancer Agency Genome Sciences Centre.  Ethics approval from the BCCA Research Ethics Board and UBC Animal Care Committee for experiments conducted in this dissertation included the following certificates: H06-00289, H08-01230, H09-00026, H11-01887, A7-0524, A11-0137.    vii Table of Contents Abstract.......................................................................................................................................... ii	  Preface........................................................................................................................................... iv	  Table of Contents ........................................................................................................................ vii	  List of Tables ................................................................................................................................ xi	  List of Figures.............................................................................................................................. xii	  List of Abbreviations ................................................................................................................. xiv	  Glossary ...................................................................................................................................... xvi	  Acknowledgements ................................................................................................................... xvii	  Chapter 1: Introduction ................................................................................................................1	  1.1	   Breast tumor heterogeneity............................................................................................. 1	  1.2	   Clonal evolution in malignant tissues............................................................................. 3	  1.3	   Shortcomings in the traditional classification of breast cancers ................................. 5	  1.4	   Cancer taxonomy in the omics era ................................................................................. 7	  1.4.1	   From Sanger to second generation DNA sequencing of bulk tumors ........................ 7	  1.4.2	   Key concepts in clonality as defined by genomic sequence features ....................... 11	  1.4.3	   From cell population-averaged data to single-cell genotypes................................... 12	  1.5	   Thesis goals and scientific questions ............................................................................ 16	  Chapter 2: Genome analysis of breast tumors at single nucleotide resolution ......................18	  2.1	   Cancer genomes are not static: Mutational evolution between a paired metastatic and primary lobular breast cancer ....................................................................................... 18	  2.1.1	   Background – Lobular breast cancer ........................................................................ 18	  2.1.2	   Methods..................................................................................................................... 19	    viii	   Tumor specimens and cytology ......................................................................... 19	   Laboratory models of a lobular breast cancer metastasis .................................. 20	   Second generation sequencing and targeted validation of variants ................... 20	   Analysis of tumor whole genome and transcriptome data sets.......................... 24	  2.1.3	   Results....................................................................................................................... 25	   Whole genome and transcriptome metrics of two LBC tumors ........................ 25	   A metastatic breast tumor harbors thirty-two somatic coding mutations .......... 26	   Insights from comparing a breast tumor’s genome and transcriptome.............. 30	   Primary and metastatic breast tumors differ in their mutational spectrum........ 30	   RNA editing recodes amino acid sequences in a metastatic breast tumor......... 34	  2.2	   What is a cancer subtype? Mutational heterogeneity in primary TNBCs ............... 36	  2.2.1	   Background – Triple-negative breast cancers........................................................... 36	  2.2.2	   Methods..................................................................................................................... 37	  2.2.3	   Results....................................................................................................................... 38	   Primary TNBCs exhibit a wide spectrum of somatic mutation ......................... 38	   One-third of somatic coding mutations are transcribed..................................... 39	   Allelic measurements at depth reveal a diversity of clonal prevalence ............. 39	   Mutations in tumor suppressor genes are not always early events .................... 41	  Chapter 3: A leap in resolution: Single-cell genotypes define tumor clonal structure..........44	  3.1	   Synopsis........................................................................................................................... 44	  3.2	   Methods development.................................................................................................... 47	  3.2.1	   Preparation of nuclei from bulk tissue samples ........................................................ 47	  3.2.2	   Isolating individual nuclei for genotyping................................................................ 48	    ix 3.2.3	   Improving genomic DNA accessibility in single-cell assays ................................... 49	  3.2.4	   Primer design strategy influences PCR efficiency.................................................... 50	  3.2.5	   Considerations for single-cell PCR reaction chemistries.......................................... 51	  3.2.6	   Efficiency of PCR amplification from a single-cell genome.................................... 53	  3.2.7	   Improvements in single-cell targeted amplicon library quality ................................ 56	  3.2.8	   Sequencing low complexity single-cell targeted amplicon libraries ........................ 57	  3.3	   Protocols for genotyping single cells by two-round targeted PCRs and NGS.......... 58	  3.3.1	   Preparation of nuclei suspensions............................................................................. 58	  3.3.2	   Flow cytometric sorting of single nuclei .................................................................. 59	  3.3.3	   PCR oligonucleotide primers design ........................................................................ 59	  3.3.4	   First round multiplex PCR of single nuclei .............................................................. 60	  3.3.5	   Second round re-amplification by singleplex PCRs ................................................. 60	  3.3.6	   Nuclei-specific amplicon barcoding and addition of NGS adaptors ........................ 64	  3.3.7	   Targeted amplicon library quality assessment and purification ............................... 64	  3.3.8	   NGS of targeted amplicon library using Illumina MiSeq chemistry ........................ 65	  3.4	   Single-cell SNV genotyping validates inferences of tumor clonal structure............. 67	  3.4.1	   Background - PyClone.............................................................................................. 67	  3.4.2	   Targeted DNA amplicon sequencing of somatic SNVs ........................................... 69	  3.4.3	   Statistical analysis of single-cell SNV data .............................................................. 69	  3.5	   Single-cell LOH genotyping validates inferences of tumor clonal structure............ 73	  3.5.1	   Background - TITAN................................................................................................ 73	  3.5.2	   Targeted DNA amplicon sequencing of LOH loci ................................................... 75	  3.5.3	   Statistical analysis of single-cell LOH data .............................................................. 76	    x	   Calculating allelic drop-out and heterozygous allelic ratios.............................. 77	  Chapter 4: Clonal evolution in breast tumor xenografts at single-cell resolution.................82	  4.1	   Background – Murine xenograft models for personalized oncology ........................ 82	  4.2	   Methods........................................................................................................................... 83	  4.2.1	   Single-cell genotyping of murine xenografts............................................................ 86	  4.3	   Results ............................................................................................................................. 88	  4.3.1	   Xenoengraftment can yield simple or complex clonal architectures ........................ 88	  4.3.2	   Patient tumor-derived xenografts undergo changes in clonal composition .............. 89	  4.3.3	   Single-cell genotypes reconstruct xenograft clonal structure ................................... 90	  4.3.4	   Serially passaged xenografts undergo clonal evolution............................................ 95	  4.3.5	   Replicate xenografts recapitulate genomically defined clonal structures................. 96	  Chapter 5: Discussion..................................................................................................................98	  5.1	   Conclusions and significance ........................................................................................ 98	  5.2	   Limitations and challenges.......................................................................................... 101	  5.2.1	   Analysis of single-cell genomes ............................................................................. 103	   Allelic drop-out and noisy data........................................................................ 105	  5.2.2	   Implementation of murine xenograft models.......................................................... 107	  5.3	   Applications and future direction .............................................................................. 108	  5.3.1	   Tumor clonality and single-cell assays in the clinic ............................................... 108	  5.3.2	   Adequate spatial and temporal sampling of patient tumors.................................... 109	   Circulating tumor cells and cell-free nucleic acids.......................................... 110	  5.3.3	   Final words.............................................................................................................. 111	  References...................................................................................................................................115	    xi List of Tables Table 1. Whole genome and transcriptome coverage metrics of a metastatic LBC..................... 25	  Table 2. Somatic coding SNVs in a metastatic LBC.................................................................... 28	  Table 3. Digital prevalence of somatic SNVs in a primary LBC tumor and metastases.............. 32	  Table 4. RNA edits confirmed by Sanger sequencing.................................................................. 35	  Table 5. Second round analytical PCR pre-mix............................................................................ 61	  Table 6. Second round analytical PCR reaction mix .................................................................... 61	  Table 7. Primers solution mix for Fluidigm Access Array™ singleplexes .................................. 62	  Table 8. PCR pre-mix for Fluidigm Access Array™ singleplexes .............................................. 62	  Table 9. Sample mix for Fluidigm Access Array™ singleplexes ................................................ 63	  Table 10. Molecular barcoding PCR pre-mix............................................................................... 64	     xii List of Figures Figure 1. Heterogeneity in a hypothetical tumor biopsy................................................................. 2	  Figure 2. NGS read counts versus Sanger sequence traces for a MYH8 locus SNV ..................... 8	  Figure 3. Analysis of tumor evolution using different DNA sequencing technologies.................. 9	  Figure 4. General steps in NGS analysis of whole tumor genomes.............................................. 11	  Figure 5. LBC tumors genome and transcriptome sequencing and validation workflows........... 23	  Figure 6. RNA editing events in COG3 and SRP9 transcripts ..................................................... 34	  Figure 7. Somatic mutational spectrum across 65 TNBCs........................................................... 39	  Figure 8. Inferring clonal prevalence from TNBC bulk tumor-derived genomic data................. 40	  Figure 9. A wide spectrum of inferred number of clonal clusters in TNBCs............................... 41	  Figure 10. Clonal prevalence of TNBC tumor-specific mutations ............................................... 42	  Figure 11. Integrating bulk tumor and single-cell assays with computational modeling ............. 46	  Figure 12. Overview of protocol steps for single-cell genotyping assay development ................ 47	  Figure 13. Nuclei preparations from two breast tumors ............................................................... 48	  Figure 14. Flow cytometric gating of tumor nuclei from a primary TNBC ................................. 49	  Figure 15. PCRs comparing amplification from multiplex versus singleplex primer designs ..... 51	  Figure 16. Optimizing PCRs for improved uniformity of multiplexed reactions......................... 52	  Figure 17. Amplicon libraries from incremental amounts of genomic DNA and nuclei ............. 53	  Figure 18. Electropherograms of singleplex PCR amplicons from one LBC nucleus ................. 54	  Figure 19. Titrating down barcoded adaptors to reduce spurious amplicons ............................... 56	  Figure 20. Agarose gel versus paramagnetic beads cleanup for NGS amplicon libraries............ 57	  Figure 21. Bulk tissue SNV data-derived clonal structure predictions in an ovarian tumor ........ 70	  Figure 22. Single-cell validation of SNV-predicted clones in an ovarian tumor.......................... 71	    xiii Figure 23. Bulk tissue SNV data-derived clonal structure predictions in a TNBC tumor............ 72	  Figure 24. Single-cell validation of SNV-predicted clones in a TNBC tumor............................. 73	  Figure 25. Validation of LOH-predicted clonal structure across 28 ovarian tumor nuclei .......... 79	  Figure 26. Validation of LOH-predicted clonal structure across 18 ovarian tumor nuclei .......... 80	  Figure 27. Patient breast tumors xenografting timeline................................................................ 85	  Figure 28. Mutational spectrum of two xenograft cases using bulk tissue-derived data sets....... 88	  Figure 29. Single-cell lineage genotypes in an ER-positive breast tumor and xenograft............. 90	  Figure 30. Single-cell lineage genotypes in a TNBC tumor and serial xenografts....................... 93	  Figure 31. Clonal genotype evolution in a TNBC tumor and serial xenografts ........................... 95	     xiv List of Abbreviations ADO  allelic drop-out CNA  copy number aberration CTC  circulating tumor cell EGFR  epidermal growth factor receptor (HER1) ER  estrogen receptor-alpha ERBB2 v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2 (HER2) FDR  false discovery rate FFPE  formalin-fixed paraffin embedded FISH  fluorescence in situ hybridization HER2  human epidermal growth factor receptor 2 (ERBB2) IFC  integrated fluidic circuit (a Fluidigm array) INSR  insulin receptor LBC  lobular breast cancer LOH  loss of heterozygosity MTP  microtitre plate, 96- or 384-well NGS  Next (second) generation DNA sequencing NRG  NOD/RAG1-/-/IL2r-gamma-/- NSG  NOD/SCID/IL2r-gamma-/- PARP1 poly(ADP-ribose) polymerase 1 PBS  phosphate-buffered saline PCR  polymerase chain reaction PR  progesterone receptor   xv qPCR  quantitative polymerase chain reaction ROC  receiver operating characteristic SKY  spectral karyotyping SNP  single nucleotide polymorphism SNV  single nucleotide variant TNBC  triple-negative breast cancer WGA  whole genome amplification MDA  multiple displacement amplification WGSS  whole genome shotgun sequencing WTSS  whole transcriptome shotgun sequencing     xvi Glossary Bulk tumor tissue mass with millions of cells, several cell types and stroma Mutation cellular prevalence  the proportion of tumor cells harboring a mutation, inferred computationally or measured using single-cell methods Chromothripsis a cluster of hundreds of rearranged chromosomal fragments Clonal dynamics  the flux in clonal prevalence in a population of tumor cells over space and time Clonal evolution an iterative process of genetic diversification, clonal selection and       clonal expansion or contraction in malignant microenvironments Clonal genotype a set of genomic features defining a cell’s membership in a clone Clonal lineage  the hierarchical relationship between cells in a population over space and time Clonal mark  a stable heritable cellular feature such as (epi)genetic, non-genetic, or exogenous (experimentally introduced) feature Clonal prevalence fraction of a tumor cell population belonging to a specific clone Clonal selection  the flux in clonal prevalences due to processes such as survival and proliferation Clone a group of cells related by descent from a common ancestral cell Multiplex PCR multiple, 48 in this dissertation, targeted loci per PCR Mutation cluster  a mutation grouping from spatiotemporal population measurements, inferred computationally and assumed to be present in the same clone Singleplex PCR one locus targeted per polymerase chain reaction Subclone a clone of minor prevalence Variant allelic prevalence the fraction of aligned sequence reads with the variant allele    xvii Acknowledgements I am enormously grateful to my supervisor Dr. Samuel Aparicio for an abundance of support and patience through my doctoral training program, especially surrounding a leave of absence with interrupted research projects. Thank you to Drs. Carl Hansen, David Huntsman and Haydn Pritchard for their guidance as PhD Committee Members. A big thank you to the administrative and laboratory staff in the BCCRC Department of Molecular Oncology for support with day-to-day operations and the execution of large collaborative research projects. I thank Drs. Angela Burleigh and Peter Eirew for training in cell culture and murine xenografting methods; Teresa Algara for assistance with rodent husbandry; Dr. Kaston Leung and members of Dr. Carl Hansen’s laboratory for extensive collaboration in single-cell handling and amplification chemistries; Dr. Damian Yap for timely assistance with flow cytometry, coordinating MiSeq platform workflows and enthusiastic discussions; Dr. Sohrab Shah, his bioinformatics staff and students for critical assistance with designing experiments from genome scale data sets and single-cell data analysis; Ms. Raewyn Billings for cell cultures and testing targeted PCR amplicon assays in cell line mixture experiments for simulating tumor heterogeneity as part of her master’s thesis work; Mr. Keith Mewis, Mr. Mani Hamidi and Ms. Carolina Novoa for their contributions in developing experimental techniques during research rotations as UBC GSAT master’s students; Mr. Joseph Lau, Mr. Karn Safaya, Ms. Emma Laks, and Ms. Justina Biele for assistance with conducting experiments and single-cell protocols development as undergraduate co-op students. Thank you to the BCCRC Terry Fox Laboratory Flow Cytometry Core for assistance with customized cell sorting and to the BCCA Centre for Translational and Applied Genomics   xviii for histology services and access to laboratory instrumentation critical to all experiments conducted for this dissertation. A special thank you to Ms. Aleya Abdulla and Dr. Haydn Pritchard, at the UBC Department of Pathology and Laboratory Medicine’s Graduate Student Support office, for their assistance during my doctoral program with matters both routine and non-routine. I am grateful for external funding support in the form of graduate scholarships from the Natural Sciences and Engineering Research Council of Canada and the Michael Smith Foundation for Health Research, plus a doctoral fellowship and tuition awards from the University of British Columbia. Finally, I thank my mother and brother’s family for all the years of emotional and financial support and assistance with child care, without which I could not have completed my doctoral program.    1 Chapter 1: Introduction 1.1 Breast tumor heterogeneity Several normal cellular processes, when dysregulated, can give way to heterogeneous malignant breast tissue. The natural history of a human female breast is characterized by marked structural and morphological variation over time and space1. The embryonic mammary bud starts proliferation after birth, followed by pubertal changes resulting in the formation of epithelial ducts with terminal end buds in the fat pad displaying proliferative branching. Hormonal changes during pregnancy in reproductively active females result in further proliferation with lobular alveoli transformed into milk-secreting structures. Post-pregnancy processes of involution again remodel the breast. Similarly, the menstrual cycle orchestrates regular proliferation and remodeling. At menopause, hormonal changes mark yet another major event in the physiology of the female breast. Thus, normal physiology of the human female breast involves a near lifetime of heterogeneity in cellular and tissue architecture. This natural heterogeneity derives from many cellular processes, including proliferation, angiogenesis, invasion, apoptosis, and stromal modification, several of which can converge to bear on carcinogenesis.  Heterogeneity in breast carcinogenesis may be described in several contexts. It may mean any morphological, metabolic, physicochemical, (epi)genetic, biomolecular or behavioral feature which differs spatially or temporally between tumors or across one tumor mass (Figure 1). In the context of clinical disease, this variation may refer to any of several clinical metrics: variable numbers of a hormone receptor in cells across sections of a single tumor, or between different tumors of the same clinical subtype; or differences in time to metastatic disease or relapse amongst patients diagnosed with the same subtype of breast cancer. The heterogeneity may be divided into three general categories. First, intertumoral heterogeneity describes differences   2 occurring between breast cancers classified to belong to the same subtype. Second, intrapatient intertumoral heterogeneity refers to differences between several tumors from the same patient. Third, variation within a physically contiguous tumor mass from a patient is intratumoral heterogeneity2. A looser interpretation of the latter is warranted in blood or other bodily fluid samples harboring tumor cells, such as pleural effusions or cerebrospinal fluid taps where tumor content could be derived from multiple sites. Figure 1. Heterogeneity in a hypothetical tumor biopsy A variety of cell types with different molecular features are shown in a tumor mass comprised of two clones. Typically, a biopsy may capture only a subset of these cell types and a much smaller fraction of the bulk tumor mass than suggested in this illustration.   These various states of malignant cells present distinct challenges for treatment. For instance, chemotherapeutically ameliorated tumors may regrow from a few subclones of tumor cells with greater clonal fitness. Furthermore, these subclones with greater clonal fitness may occur in one patient’s breast tumor but not in another’s tumor, resulting in different clinical outcomes. Alternately, a small population of otherwise therapy-sensitive tumor cells may escape   3 treatment by virtue of a biophysical barrier to delivery, such as an abnormal extracellular matrix with increased resistance to drug penetration, of an otherwise effective therapeutic agent. These malignant escapees may then rebound to repopulate the former tumor microenvironment to its carrying capacity. Or, differences may occur in the hormone receptor status of a metastatic breast tumor versus its matched primary tumor, rendering a targeted therapeutic regimen appropriate in one instance but not both3,4. Intratumoral heterogeneity was observed nearly 100 years ago when Theodor Boveri, peering through a light microscope, noted variation in chromosomal features between tumor cells and predicted their relevance to malignancy, well before higher resolution cytogenetic approaches documented the diversity of structural and numerical chromosomal aberrations in tumors5. Others, such as David Paul von Hansemann, described diversity in abnormal mitotic processes in tumor cells. Modern cytogenetic approaches now allow for more detailed characterization of intratumoral heterogeneity, including direct observations of large-scale structural aberrations and abnormal ploidy. With regard to the scientific history of recognizing differences between tumor masses, early descriptions of intertumor heterogeneity focused on gross anatomic features, including the size of tumors. For example, larger breast tumors were posited to have a poorer prognosis and relate to the age of a tumor6. The modern era of molecular heterogeneity and rationally targeted cancer therapy began with the scoring of hormone receptor status for endocrine therapy of estrogen receptor-positive breast cancers7. 1.2 Clonal evolution in malignant tissues In 1976 Peter Nowell summarized the prevailing concept of clonal evolutionary processes operating in tumor microenvironments8. He speculated that malignant tissues are   4 subject to Darwinian evolution, whereby each single malignant cell is ascribed a quasi organismal status possessing unique and heritable genetic or epigenetic features that confer unequal fitness phenotypes. These fitness differences in turn influence the expansion or contraction of clonal subpopulations as environmental stresses (e.g. hypoxia, therapy, immune response) drive selection in space and time. In solid tumors, dividing cells may yield spatial clones or subgroups of mutationally related cells. Though regaining popularity today, this conceptual framework has been applied to interpret tumor heterogeneity for some time.  For instance, reports of ploidy-defined clonal architectures in leukemias and tumors of the breast date back several decades9 and have been described again recently at single-cell resolution in breast cancers10,11. There is a growing recognition of the utility of incorporating aspects of tumor clonal structure as a relevant metric in clinical oncology12. For example, clonal selection can result in markedly different genetic profiles of metastatic disease when compared to matched primary tumors in aggressive childhood brain cancers13. Thus, considering the two stages of malignancy to be distinct clinical entities is warranted. However, this Darwinian framework of tumor evolution is challenged with issues similar to those inherent in studying evolutionary biology over geological time scales. For instance, cell lineage analysis is often restricted to a single time point and thus severely limited by the inability to sample intermediate stages. Computational modeling approaches of tumor clonal evolution include major assumptions such as unidirectional change. Examples of other challenges include sequence reversions that remain too cryptic to resolve and difficulty recognizing aberrant genomic features arising independently more than once.  On the topic of how clonal diversity in solid tumors arises or evolves in space and time, there is vigorous debate. Can any cell within a tumor divide and maintain the malignant mass, or   5 is it a (numerically small) subset of tumor cells? Reaching back further to the beginnings of a tumor cell lineage tree, can any somatic cell be disrupted beyond some threshold of disciplined growth via an (incremental) accumulation of mutations, or is that phenomenon restricted to rare and less differentiated cells such as tissue stem cells? The former inferences are in line with Nowell’s clonal evolution framework, while the latter fall in the camp of cancer stem cells being the key players. These are, of course, two polarized ideas and cancers may include varying degrees of both processes occurring over space and time. Though both schools of thought share the premise that the cancerous process originates in a single cell, there are different therapeutic implications for each scenario, in that halting the growth of cancer stem cell-driven tumors would require specifically targeting all such cells. Following important early experiments in leukemia14, an assortment of cell marker combinations are now in use for functionally marking putative cancer stem cells in solid tumors. The possibility of cancer stem cells operating in malignancies of the breast was advanced with a CD44+/CD24-/low immunophenotype satisfying the functional properties of such a cell type15. Similarly, although not as well developed as in leukemias, markers for such cells have also been reported in solid malignancies of the brain16 and colon17. 1.3 Shortcomings in the traditional classification of breast cancers The Bloom-Richardson style histological grading of large cases of breast cancers began in the 1950s18. The invasive ductal and lobular types represent the majority, about 90 %, of breast carcinomas. Since then features such as cell and nuclear size, morphologies, IHC staining of antigens, and observational data such as tumor spread (in situ, invasive) have been the mainstay of clinical annotation of the morphologically diverse cancers of the breast. However, even established protocols such as standardized IHCs can be confounded by heterogeneous   6 staining or intratumoral sectional differences19. This traditional classification of cancers falls short clinically when cases with apparently identical histopathologies progress differently or show different recurrence rates20.  Early work with in vitro laboratory models revealed similar phenomena of heterogeneity, even at a functional level, amongst cells derived from the same neoplasm. For example, experiments using a single murine tumor-derived set of cell lines showed marked differences amongst sublines or metastases with respect to chemotherapeutic drug sensitivities21,22. These studies pointed to the challenge of developing accurate models for predicting drug response and the need for combinatorial therapies targeting functionally distinct subpopulations of cells. Other varying features observed in neoplasms included karyotypic heterogeneity across quadrants of individual breast tumors, pointing to the need for multiple biopsies of an individual tumor23.  Even with our currently evolving appreciation for the cellular complexity of malignant tissues, clinically complex tumor behavior such as therapy-refractive subclones and relapse following targeted therapy continue to challenge oncologists. As elegantly documented in primary renal carcinomas24, prognosis derived from scoring of multiple biopsies from the same patient tumor can be discordant, pointing to serious shortcomings in clinical sampling and interpretation. The same problem may occur when there are functionally important differences between primary and metastatic disease in one patient25.  Despite mounting evidence of its clinical importance, the extent and nature of clonal heterogeneity has yet to be incorporated in tumor classification and treatment. More modern techniques for tumor classification using molecular analysis are now being introduced with the anticipation of delivering meaningful tumor groupings and rational targets for therapy. The dynamic spatial and temporal properties of tumor clonal architecture, as determined by  genomic   7 analysis, may better inform the choice and sequence of combinatorial treatments in managing heterogeneous malignancies. 1.4 Cancer taxonomy in the omics era  1.4.1 From Sanger to second generation DNA sequencing of bulk tumors Second (or next) generation DNA sequencing methods have greatly improved resolution and sensitivity in detecting genomic heterogeneity26. Prior to this development, decades of cancer genomic studies using Sanger DNA sequencing27 on targeted genomic regions identified recurring genes (oncogenes and tumor suppressors) that are commonly mutated in cancer. Targeted amplification and Sanger sequencing projects catalogued increasing numbers of small insertions, deletions and somatic point mutations. Discerning subclonal sequence variation, however, remained a low resolution challenge with Sanger sequencing. A high quality Sanger sequence electropherogram is at best limited to a sensitivity of detecting a minor allele that is about 5 % of the primary allele peak, when determined using sequence trace analysis algorithms in softwares such as Mutation Surveyor28. This translates to a detection sensitivity corresponding to about one in sixteen template molecules, or haplotypes. The contrast in resolution between traditional Sanger DNA sequencing versus digital second generation sequencing technologies is exemplified in Figure 2. As shown for a primary tumor, a barely discernible Sanger sequencing secondary peak yielded about 14 % of the 10,000 reads targeting that SNV position by NGS. A Sanger sequencing-based approach to acquire comparable quantitative allelic measurements entails a cumbersome intermediate molecular cloning step, followed by sequencing a substantial number of cloned molecules.     8 Figure 2. NGS read counts versus Sanger sequence traces for a MYH8 locus SNV Somatic SNV analysis at the MYH8 locus by amplicon resequencing three tissue samples from a lobular breast cancer patient. As per the Sanger sequence traces on the right, all three samples harbor the reference G allele (black peak and arrow), the metastatic tumor also harbors the variant C allele (blue peak), which is barely discernible in the matched primary tumor (red arrow). In contrast, the NGS read counts shown on the left for the two tumor samples capture the allelic prevalences quantitatively.   Furthermore, decoding the wide expanse of a 3 billion base pair human genome, or better yet whole genome sequences of individual tumors and matched normal tissues, remained a Herculean task until the maturation of second or next generation DNA sequencing technologies26. Whilst NGS technologies matured (Figures 3 and 4), the utility of genomic approaches to classifying sporadic breast cancers was demonstrated using DNA microarray technology29,30. In one study, Perou et al characterized gene expression signatures from 42 breast cancer patients across about 8000 genes and also analyzed pre- and post-chemotherapy molecular signatures. In the other, Sorlie et al classified about 80 breast cancers to associate tumor characteristics with clinical outcomes. Survival analysis showed different outcomes for patients in each of the molecularly defined subgroups.   9 Figure 3. Analysis of tumor evolution using different DNA sequencing technologies First (Sanger), second (NGS), and third (true single molecule) generation DNA sequencing technologies with a concurrent increase in profiling sensitivity of tumor stages. The current era is approaching the routine application of NGS to resolving single-cell genomes for analysis of tumor cell lineages and CTCs.   Starting about five years ago, research teams began reporting NGS-based studies uncovering whole genome sequence and structural heterogeneity in breast cancers at single nucleotide resolution. In 2009 Shah et al reported a combined genome-transcriptome study of an estrogen-positive lobular breast cancer which revealed the mutational changes that occurred over a 9-year interval between primary and metastatic disease31. The study is the subject of the second chapter in this dissertation. Ding et al expanded genome-wide analyses of matched primary and metastatic tumors from a patient with basal-like breast cancer, by the inclusion of an experimental xenograft derived from the primary tumor32. In addition to retaining genome mutation features of the primary tumor, the mutational evolution pattern of the xenograft was shown to match the metastatic tumor. A study by Stephens et al described the spectrum of somatic rearrangements in twenty-four breast cancer genomes33. They found mostly intrachromosomal rearrangements, with tandem duplications and non-homologous end-joining DNA repair being common features. In 2012, two large breast cancer genomics studies   10 integrated data from several analysis platforms, applied across hundreds of tumors34,35.  Curtis et al integrated transcriptome and copy number analysis in 2000 breast tumors. They reported that variants were associated with transcription levels in about 40 % of genes.  New subgroups with distinct clinical outcomes were determined using paired genome-transcriptome profiles. In the other large study, the Cancer Genome Atlas Network investigated almost 500 primary breast tumors using five analysis platforms and reported novel protein expression-defined subgroups. Basal-like breast tumors were seen to share molecular features with high-grade serous ovarian tumors. Nik-Zainal et al analyzed 21 breast cancers and reported localized regions of hypermutation, or kataegis, distinct nucleotide mutational signatures of BRCA1/BRCA2 breast cancers, plus relationships between mutational prevalence and transcription36. Besides confirming a role of the usual suspects in human cancers (common tumor suppressors and oncogenes), a sobering observation from these early whole genome studies was low prevalence of any specific genomic lesion across different patient tumors; this highlights the challenge in developing targeted therapeutics based on molecular analysis. The formidable task of understanding the significance of this ever expanding catalogue of cancer mutations is now the overarching goal of several research consortia. A central challenge is to discern the many incidental mutations (passenger mutations) from functionally important lesions (drivers mutations), and the complication that the significance of mutations depends on context and may vary over time and space. As these second generation tumor genome sequencing projects gathered steam, there also developed a growing appreciation for mutational landscapes beyond point mutations or indels in somatic coding space, with studies unraveling larger scale genomic rearrangements and copy number states in cancer genomes. Any and all of these genomic features may serve as substrates for defining tumor clonal architectures.   11 Figure 4. General steps in NGS analysis of whole tumor genomes Genomic DNA is prepared from a portion of tumor tissue and subjected to fragmentation and end repair. Sequencing adaptors are added to size-selected DNA fragments and the resulting products are purified. A precise molar quantity of the NGS library is loaded on to the NGS device flow-cell or reaction chamber. The sequence reads passing quality control are subjected to short read alignment against human reference or a matched normal genome, a process aided by the generation of paired-end sequencing reads. Variants are then called and revalidated by targeted amplicon sequencing.    1.4.2 Key concepts in clonality as defined by genomic sequence features A clone is defined as a group of cells with a common ancestor. Though the earliest stages of tumorigenesis are not readily amenable to study, models of malignancy routinely posit a single founder cell. In this paradigm, by definition, tumors must be considered as clones, irrespective of debate on the cell type of origin being stem-like or not. Nevertheless, it is also plausible that some malignancies may not be clones but arise as a consequence of phenomena such as cell-to-cell fusions37.  Other concepts and terminology in tumor clonality analysis when viewed through the lens of next generation DNA sequencing include clonal mark, clonal genotype, allelic prevalence, cellular mutation prevalence, clonal prevalence and cell lineage38.     12 A clonal mark can be any stable cellular feature used to define clonal membership. In this dissertation, clonal marks include somatic coding SNVs and CNA/LOH events. Clonal genotypes are a set of clonal marks that define membership in a clone. In this dissertation, a clonal genotype refers to a subset of somatic coding SNVs enumerated in a bulk tumor genome and determined to be informative in defining tumor clonal structure. In the case of single-cell analysis, this would refer to an empirically determined set of clonal marks present on a per cell basis. However, since all cells may differ from each other in some respect, a pragmatic definition for clonal marks may be grouping functionally related cells. Allelic prevalence is the proportion of NGS reads with a variant allele. In this dissertation, it refers to the proportion of sequence reads harboring a somatic coding SNV, with the remaining reads bearing the reference allele. Cellular mutation prevalence is the fraction of cells in a tumor mass harboring an allele or genotype. Measurements of cellular mutation prevalence are influenced by factors such as tumor cell content. Clonal prevalence is the proportion of cells in a tumor mass bearing the genotypes defining membership in a clone. Finally, cell lineage is the familial relationship between cells or clones in a tumor mass inferred through genotypes. For instance, the simplest interpretation of the most prevalent genotype in a population of cells would be that that genotype represents the most ancestral state measured. Determinations of cell lineage in this dissertation are based on the presence of somatic coding SNVs in single-cell genotypes. 1.4.3 From cell population-averaged data to single-cell genotypes  Whole genome NGS studies of tumors conducted over the last decade have been performed on pooled nucleic acids extracted from a large number, usually millions, of cells. Allelic prevalence measured in these experiments indicate that not all tumor cells harbor the same set of mutations. This fact has important implications for our understanding of how   13 mutations are distributed, biological pathways disrupted, or phenotypes perturbed, on a per cell basis. Biomolecular measurements averaged across a population of cells do not generally reflect the state of individual cells. For instance, the stochastic nature of transcription in a cell renders quantitative polymerase chain reaction (PCR) and mRNA abundance data from bulk tissue a poor surrogate for per cell activity39. Or, the characteristics of the most abundant cell type may not capture relevant properties inherent to rare subclones40. Furthermore, single-cell genotyping across several loci in parallel is the only way to formally resolve clonal genotypes. Thus a need for single-cell measurements.  Genomic analysis of individual cells can be used to infer cell lineage trees in tumorigenesis, or in normal development by tracking sequence variation at microsatellite loci41. However, reconstructing cell lineages of the millions of cells that compose a tumor is a formidable challenge. Indeed, the magnitude of one such endeavor, the Human Cell Lineage Initiative (, rightly dwarfs that of its predecessor, the original human genome project. A principle challenge is that analyzing the biomolecular content of individual cells is, as yet, not routine practice. Laboratory methods under continuing development include cell isolation and fractionation, scale and cost of analyses, sensitivity and speed, fidelity of sequence data, and controlling for contamination during experimental procedures. A typical diploid human cell contains approximately a mere seven picograms of genomic DNA and tens of picograms of RNA. Nevertheless, technological advances in customized42,43 and commercial44 microfluidic or lab-chip type devices45 are gradually expanding the scale and diversity of measurements in single cells. These engineering advances complement and enhance a variety of existing protocols of   14 varying throughput, such as micromanipulation, laser capture microdissection and flow cytometry. Several single molecule and single-cell nucleic acid analytical approaches have matured considerably over the last decade. Digital PCR at picolitre scale can be performed in parallel across hundreds of molecules on several hundred cells46. Following early studies in murine development conducted at a modest scale47, single-cell transcriptomics or RNA-seq experiments have matured to a point where a multitude of amplification chemistries and device platforms are available48. RNA amplification chemistries for single-cell analysis started with PCR-based exponential amplification protocols49 and were followed by linear amplification approaches50. Scaled up applications of the latter approach include a study by Hashimshony et al51 to unravel single-cell transcriptomic profiles from early developmental stages in Caenorhabditis elegans. Conducting whole transcriptome studies at the scale of hundreds to thousands of cells across several experimental conditions has been demonstrated in murine bone marrow dendritic cells52. Recent applications in human cancers include single-cell transcriptome analysis in several glioblastomas, where Patel et al uncovered a wide spectrum of stemness-associated transcriptional states of cells within individual tumors53.  In the arena of DNA amplification chemistries with a need for picogram level sensitivity, the polymerase chain reaction54 remains a mainstay and quickly found early application for targeted single-cell assays in preimplantation genetic diagnoses55. However, starting with a single diploid cell harboring two parental copies of DNA template, routinely querying entire individual genomes is still an unmet challenge. Early protocols to access whole single-cell genomes included the use of random-primed whole genome PCR and multiple displacement amplification chemistries. MDA is in widespread use today56 and involves random priming, followed by   15 displacement extension by Phi29 polymerase to produce large amounts of fragments several kilobases long. The protocol has been scaled up to achieve thousands of reactions per experiment at nanolitre scale57. PCR-based WGA approaches have been applied along with array CGH protocols to analyze single-cell genomes for copy number changes58. A recent protocol added to single-cell DNA amplification chemistries is multiple annealing and looping-based amplification cycles, or MALBAC59. This approach combines several semi-linear preamplification cycles wherein fragments amplified early in the process are looped to keep them from dominating subsequent cycles, thereby achieving more uniform amplification.  The last four years have seen a rapidly growing collection of publications describing near genome scale interrogation of single cells from bulk tissues, including tumors of the breast. Navin et al evaluated clonality in twenty primary breast tumors in anatomic space by dissecting tumors into arbitrary sectors and evaluating DNA copy number10. They observed multiple intermixed cell populations within individual tumors, distinct clonal subpopulations in polygenomic tumors, and inferred the occurrence of punctuated bursts of genomic alterations. Powell et al profiled single circulating tumor cells in breast cancers for transcriptional heterogeneity, distinguishing CTCs between primary and metastatic breast cancer patients and breast cancer cell lines60. Baslan et al explored copy number variation in single tumor cells and quantified genomic intervals in two breast cancers61. They reported the presence of distinct tumor subpopulations that arose from a common ancestral clone in one of the two cancers, plus highly similar primary and metastatic clonal structure in the other. Wang et al achieved high mean coverage in a single-cell whole genome and exome study of ER-positive and triple-negative breast cancers, reporting that aneuploid rearrangements occurred early in tumor evolution while point mutations accumulated gradually62. They incorporated technical sequence noise-correction   16 methods for the purpose of estimating accurate per cell mutation rates and identifying rare mutations. Potter et al performed single-cell measurements to analyze SNVs, gene fusions, and copy number variants, charting cellular phylogenies in leukemia63. Another pertinent application at single-cell resolution is determining mutation rates in individual cancer genomes, for which the MALBAC approach was used in a colon cancer cell line64. The combination of single-cell and NGS technologies are also finding interesting applications outside of cancer biology, including the study of microbial metagenomes following ocean oil spills65, surveying the trillions of microbes that comprise the human microbiome66, and charting somatic mosaicism in neurobiology67,68. The latter studies have fueled debate about the possible role of aneuploidy in generating phenotypic diversity for normal physiological functioning in individual neurons69. Furthermore, single-cell approaches are gradually being complemented with even higher resolution analyses such as allelic phasing in diploid cells70, with clinical significance in areas such as human leukocyte antigen haplotyping for bone marrow transplantation. Mutation hotspots have been catalogued in single sperm cells71 and mapping errors resolved in diploid murine genome assemblies by sequencing parental DNA template strands72.  1.5 Thesis goals and scientific questions The first goal of my doctoral thesis research was to apply next generation DNA sequencing and bioinformatic methods to elucidate whole genome DNA sequence features and clonal architecture of individual breast tumors at single nucleotide resolution. Second, develop single-cell genotyping methods and apply them to validate the performance of computational approaches for inferring tumor clonal architecture from bulk tumor tissue-derived genomic sequence data sets. And third, perform targeted somatic variant analysis of xenografted breast   17 tumors at single-cell resolution to reconstruct tumor clonal structures and assess clonal evolution over serial xenograft passages.  In parallel with the advancement of improved single-cell genotyping approaches for the analysis of tumor genomes, this work addressed several timely scientific questions related to understanding genomic heterogeneity in tumors. One, what is the genome-wide mutational content in an individual breast tumor assessed at single nucleotide resolution? Two, does the mutational spectrum in metastatic disease differ markedly from its matched primary breast tumor which arose years earlier? Three, what are the differences in genome-wide mutational complexity and clonal composition amongst tumors of the triple-negative subtype of breast cancers? Four, are bulk tumor tissue-derived whole genome sequence data sets useful for making accurate inferences of clonal composition using computational approaches, as validated by single-cell analysis? And five, do important laboratory models such as breast tumor murine xenografts undergo clonal selection upon engraftment or when serially passaged, as assessed at single-cell resolution?  18 Chapter 2: Genome analysis of breast tumors at single nucleotide resolution 2.1 Cancer genomes are not static: Mutational evolution between a paired metastatic and primary lobular breast cancer  2.1.1 Background – Lobular breast cancer  Invasive breast cancer of the lobular type accounts for about 15 % of breast carcinomas73. It is distinguished from the more prevalent ductal subtype by frequent loss of the classical epithelial adhesion protein, E-cadherin (cadherin-1, CDH1 gene, 16q22.1). This is often a consequence of one allele harboring a somatic inactivating mutation and the other subjected to LOH. Mutations in this cell-to-cell adhesion glycoprotein are associated with cancers of the breast, ovary, stomach, colorectum and thyroid. This estrogen receptor-positive subtype is usually of low to intermediate histological grade and is responsive to therapeutics targeting the receptor, but relapsing cases present tumors that have evolved resistance to targeted therapies. Subsequent complications can include ER ligand-independent signaling and pathways cross-talk. Metastatic disease often involves patients presenting with fluid buildup in the abdominal or pleural spaces, which is removed via clinical taps. The composition of this fluid can range widely and include a number of different cell types74. Cellular constituents may include tumor, immune, or mesothelial cells. Cells may be adherent to body cavity linings or exist as oligocellular aggregates, a phenotype of unknown functional significance. Resident tumor cells represent an important metastatic subpopulation warranting investigation of clinically relevant properties such as tamoxifen resistance.  Research models for lobular breast cancer are scarce since maintaining primary cell cultures or establishing murine xenografts have generally been difficult, as experienced in early   19 experiments for this dissertation using pleural effusion cells from a LBC metastases for both primary cell culture and murine xenografts. Thus, understanding LBC biology via direct interrogation of matched primary and metastatic tumor genomes represented a critical opportunity. This study31 applied the next generation DNA sequencing technologies I introduced in the first chapter along with NGS validation experiments, with the following objectives. One, define the genome wide somatic coding mutational content at nucleotide resolution in paired lobular breast cancer samples from the same patient, separated by a nine year interval from primary to metastatic disease. Two, assess the contribution of mutational processes such as RNA editing towards tumor protein variation. Three, define the extent of clonal mutational evolution in this patient’s cancer over a nine year interval.  2.1.2 Methods Tumor specimens and cytology Donated tumor tissues (patient ID VBA0038) were acquired from the BC Cancer Agency tumor bank and the study performed with ethics approval for genomic analyses as stated in the preface of this dissertation. Initial presentation with the primary tumor occurred in 1999. This tumor was diagnosed as an intermediate grade LBC, with a tumor cell content of 73 %. Treatment included surgery and  radiotherapy, and then anti-estrogen tamoxifen and aromatase inhibitor therapies were administered. Primary tumor morphology and histology were confirmed by pathology to be ER-positive and negative for E-Cadherin. The pleural effusion metastases was collected in May 2008 and cryopreserved. At the same time peripheral blood was also collected to prepare genomic DNA from a buffy coat lymphocyte fraction for use as matched reference or germline genomic DNA. This pleural tap was the third of four taps from the patient,   20 all collected between January 2008 and April 2009. Total cell count of vials cryopreserved from this third tap, generated by concentration from the pleural fluid, was about 2.0 x 109 cells. This tap had an estimated tumor nuclei count of 89 % as determined by cell block analysis and it was processed for genomic DNA and mRNA preparation without cell fractionation. The metastatic tumor phenotype was determined to be ER-positive, PR-positive, EGFR-positive, HER2-negative, and negative for E-Cadherin and cytokeratins 5/6. The mitotic index was low. Laboratory models of a lobular breast cancer metastasis Preliminary experiments were conducted to establish in vitro primary cell cultures of the LBC pleural metastases using several extracellular matrix cell culture surfaces and to establish xenografts in the renal capsules of immunosuppressed NOD/SCID/IL2rγ-/- mice. However, primary cultures resulted in the growth of mostly fibroblast-like cells and cellular senescence set in after about six passages over as many weeks. Murine xenografts did not yield palpable growth over four months, or discernible tumors by necropsy. These outcomes were in agreement with general observations in the field, that it is usually clinically aggressive tumors such as the TNBCs yield high take rates in laboratory models75, an observation also made in the study comprising chapter four of this dissertation. Further work in establishing these models was discontinued. Second generation sequencing and targeted validation of variants The tumor genomes and transcriptome sequencing workflows and validation steps are summarized in Figure 5 below. Tumor genomic DNA was prepared using standard phenol-chloroform extraction and RNase treatment. DNA from the primary tumor FFPE block was prepared using a Qiagen DNeasy Blood and Tissue kit with the inclusion of de-paraffinisation steps using xylene at 65oC (3 changes). A total RNA preparation was subjected to DNase I   21 treatment and polyA+ RNA purification using a MACS mRNA preparation protocol (Miltenyi Biotech, San Diego CA). A SuperScript cDNA synthesis kit (Invitrogen, Burlington, ON) was used to prepare double-stranded cDNA using random hexamer priming. The nucleic acid preparations were sonicated, followed by standard Illumina (formerly Solexa) paired-end library construction. Paired-end NGS was performed on the two genomic DNA libraries (primary tumor and metastases) and one transcriptome library (metastases) using Illumina Genome Analyzers. Analytical approaches for validation experiments included targeted PCRs followed by Sanger sequencing, or targeted PCRs followed by molecular cloning followed by Sanger sequencing several clones per PCR, or targeted PCRs followed by normalization of amplicon pools followed by Illumina NGS, and FISH. Sanger amplicon sequencing was conducted on both strands and from at least two separate PCR reaction products.  Oligonucleotide primer pairs spanning candidate SNVs and other features of interest were designed using Primer3 software76. Standard desalted primers (IDT Coralville, IA) were checked for uniqueness in the human genome and spaced to yield small (100 – 200 bp) amplicons in order to accommodate the low molecular weight of the primary tumor genomic DNA extracted from archival FFPE blocks. Polymerase chain reactions were carried out in 50 uL volumes using a gradient Mastercycler (Eppendorf, Mississauga, ON) and Platinum™ Blue PCR SuperMix (Invitrogen), 200 nM of each oligonucleotide primer, and 10 nanograms of genomic DNA isolated from primary tumor blocks, cryopreserved pleural effusion cells, and buffy coat fraction of blood. Amplification products were subjected to 2 % agarose gel electrophoresis in 1x TAE running buffer and predicted size bands were excised using a non-UV Dark Reader transilluminator (Clare Chemical Research, Dolores, CO). Purification of DNA from gel slices was performed with a MinElute Gel Extraction kit (Qiagen, Mississauga, ON) using a standard   22 microcentrifuge-based protocol from the manufacturer. Purified PCR amplicons were cloned into the pCR4-TOPO vector using a TOPOTM TA Cloning Kit for Sequencing (Invitrogen) with reaction conditions optimized for transformation using electrocompetent E. coli. The resulting clones were each transformed into 50 uL of One ShotTM TOP10 electrocompetent cells (Invitrogen) using 0.1 cm cuvettes and a GenePulserTM machine (Bio-Rad, Hercules, CA). Electroporated cells were immediately resuspended in 250 uL of SOC medium and incubated at 37°C for 1 hour with shaking at 225 rpm. Serial dilutions of each transformation were spread on LB-kanamycin (50 ug/mL) agar plates and incubated overnight at 37°C. The resulting colonies (12 or more per PCR reaction) were picked for inoculating 2x YT media supplemented with 7.5 % glycerol and 50 ug/mL kanamycin. Following overnight incubation at 37°C, the plasmid glycerol stocks were used to inoculate larger volume cultures for plasmid preparation using a standard alkaline lysis procedure. Sanger DNA sequencing was performed with BigDye v3.1 dye terminator cycle sequencing chemistry (Applied Biosystems, Foster City, CA) run on Tetrad thermal-cyclers (Bio-Rad). Sanger sequencing reaction products were purified by ethanol precipitation and analyzed on model 3730xl capillary DNA sequencers (Applied Biosystems). DNA sequence data was collected and stored automatically using a custom DNA sequencing database, then processed by trimming of low quality bases and removal of sequence derived from vector DNA.  Allelic prevalence comparisons between the primary and metastatic LBC tumors were conducted by combining targeted PCRs and normalization of amplicon pools followed by Illumina NGS. To control for amplification from the primary and metastatic tumor samples, randomly selected heterozygous germline alleles were amplified and sequenced as well. PCR amplicons were subjected to standard preparative agarose gel electrophoresis followed by   23 purification of excised amplicon bands using QIAquick gel extraction spin columns (Qiagen) and quantitation with a NanoDrop spectrophotometer (Thermo Scientific, Wilmington, DE). Equimolar amounts of adaptor-ligated amplicons were pooled for library preparation and sequenced using an Illumina Genome Analyzer. RNA editing events were screened in a panel of 41 LBCs plus 7 control gDNA and cDNA samples. To address the possibility of a lack of proofreading in reverse transcriptases contributing to false conclusions of RNA editing in vivo, I performed separate validation experiments using different reverse transcriptases. These included an MMLV RT (SuperScript III, Invitrogen), described to have reduced RNase H activity and increased thermal stability, and AMV RT (Invitrogen) with intrinsic RNase H activity. For second strand cDNA synthesis, I used the same proofreading DNA polymerase, namely Phusion high-fidelity (NEB, Ipswich MA). Figure 5. LBC tumors genome and transcriptome sequencing and validation workflows Schematic showing experimental and analytical steps. Biological features are shown in blue and validation assays in yellow. Features from integrated analyses are shown in red.     24 Analysis of tumor whole genome and transcriptome data sets Several sequence and structural features were extracted from the whole genome and transcriptome NGS data sets. These included somatic non-synonymous coding sequence variants, indels, rearrangements (translocations, inversions, fusions), copy number variants, RNA edits, fusion transcripts, unequal transcript allelic expression, alternative exon usage, and intron retention. The genomic DNA short read sequences were mapped to the reference human genome (NBCI build 36.1, hg18) using the MAQ short read aligner77.  The transcriptome reads were mapped to the human reference genome and to a database of known exon junctions in order to map reads that spanned splice sites78. Transcriptional coverage of a gene was based on the sum of exonic and junction read bases. SNVs were called based on a binomial mixture model, SNVmix79. Therein, each base is assigned a probability of its identity being either homozygous reference (aa), heterozygous non-reference (ab), or homozygous non-reference (bb). At a threshold of p=0.77 for ab or bb non-reference calls, a false discovery rate of 1 % was estimated. This derived from the ROC performance estimate using allele calls from Affymetrix SNP 6.0 tumor DNA hybridization data. Non-reference positions in the metastatic RNA-seq library were called with p=0.53, FDR = 0.01. These were then filtered against dbSNP80 and available human genomes.   Read counts were used as a measure of copy number in the tumor genome data sets. Mappability windows with non-overlapping segments were determined as in Campbell et al81.   Additional laboratory and bioinformatic methodological details were described in the supplementary information (pages 1 to 15) accompanying the manuscript31 describing this study.   25 2.1.3 Results Whole genome and transcriptome metrics of two LBC tumors Over 43x haploid genome sequence coverage was generated for the metastatic LBC tumor genome and 161 million aligned reads for the metastatic transcriptome library (Table 1). The somatic coding SNVs found in the metastatic LBC are summarized in Table 2 below. A comparison with allelic prevalences in the matched primary tumor follows in Table 3.  Table 1. Whole genome and transcriptome coverage metrics of a metastatic LBC  WGSS WTSS Total number of reads 2,922,713,774 182,532,650 Total nucleotides (Gb) 140.991 7.108 Number of aligned reads 2,502,465,226 160,919,484 Aligned nucleotides (Gb) 120.718 6.266 Estimated error rate 0.021 0.013 Estimated depth (non-gap regions) 43.114 N/A Canonically aligned reads 2,294,067,534 109,093,616 Percent reads aligned canonically 78.49 67.79 Unaligned reads 420,248,548 21,613,166 Mean read length (bp) 48.24 38.94  Accumulating transcriptomic data sets over the last decade have shown patterns of transcription across mammalian genomes to be much more complex than the descriptions of canonical mRNA processing in textbooks of yesteryears. This complexity includes the phenomenon of intron retention across the human transcriptome82. Intron-containing transcripts   26 have also been catalogued from human breast tumors83. The whole transcriptome shotgun sequence library from this metastatic pleural effusion included millions of intronic reads. A metastatic breast tumor harbors thirty-two somatic coding mutations  Non-synonymous coding SNVs in the metastatic tumor genome were enumerated based on predictions from whole genome and whole transcriptome data sets. Tumor and matched normal DNA amplicon samples were subjected to Sanger DNA sequencing for validation and distinguishing somatic versus germline variants. Of the 437 positions confirmed in the genome and/or transcriptome data sets, 405 were novel germline alleles and 32 non-synonymous somatic coding SNVs. Of these 32 positions (Table 2), two were observed only in the metastatic transcriptome data.  The occurrence of these mutations was checked in existing cancer genome databases. None of the genes were in the list of candidate cancer genes published earlier84. Eleven genes, but with mutations at different positions, were found in the COSMIC database85. Nine somatic coding mutations were screened further in a panel of 192 breast cancers, comprising 80 ductal and 112 lobular cases from the METABRIC project tumor collection34. None of the same mutations were found in this panel but three cases had alternate variants in ERBB2. Of note, two LBC cases had alternate truncating variants in the HAUS3 gene (formerly C4ORF15), which had been reported as a member of the augmin protein complex involved in cell cycle processes of kinetochore attachment and centrosome morphogenesis86,87. HAUS3 harbored the only homozygous non-synonymous somatic coding mutation in the metastatic lobular breast tumor and was present in the primary tumor. This mutation may have contributed to centrosome dysfunction and genome instability early in this cancer’s etiology. Another locus with a somatic   27 SNV of high allelic prevalence in the primary tumor was PALB2 (Partner and Localizer of BRCA2), wherein germline mutations have been described in familial breast cancers88.  Candidate structural inversions and indels were all found to be present in the matched normal DNA, while gene fusions did not revalidate. A low level amplicon mapped to the INSR locus was called from copy number analysis and revalidated by FISH.  A key question in whole genome shotgun sequencing data sets is the relationship between sequence depth and retrieval of genomic sequence features such as SNVs. This is influenced by genome ploidy, copy number and tumor cell content. For the metastatic tumor genome, the rate of new allele discovery dropped off by 43x coverage. By randomly re-sampling data, the first 30 Gb of sequence yielded 18 somatic SNVs while the last 30 Gb captured another two somatic SNVs.    28 Table 2. Somatic coding SNVs in a metastatic LBC Features of 32 somatic coding SNVs from genome and transcriptome NGS. Allele and amino acid changes are from reference to variant.  Gene Description Position hg18 Source Allele change Amino acid change Protein domain affected Expression (sequenced bases per exonic base) Allele expression bias (R, NR allele) Copy number state ABCB11 Bile salt export pump (ATP-binding cassette sub-family B member 11) 2:169497197 WGSS C>T R>H Transmembrane helix 3 0.3 1, 1 Amplification (4) HAUS3 HAUS3 coiled-coil protein 4:2203607 WGSS, WTSS C>T V>M Unknown 14.1 4, 23 Neutral (2) CDC6 Cell division control protein 6 homologue 17:35701114 WGSS, WTSS G>A E>K N-terminal, unknown 2.7 3, 3 Amplification (4) CHD3 Chromodomain-helicase-DNA-binding protein 3 17:7751231 WGSS G>A E>K Unknown, C-terminal 3.9 41, 11 Neutral (2) DLG4 Disks large homologue 4 17:7052251 WGSS G>A P>L Unknown, N-terminal 5.5 7, 1 Neutral (2) ERBB2 Receptor tyrosine-protein kinase erb-b2 17:35133783 WGSS, WTSS C>G I>M Kinase domain 67.1 62, 35 Amplification (4) FGA Fibrinogen alpha chain 4:155726802 WGSS C>T W>stop Fibrinogen a/b/c domain 0.01 NA Gain (3) GOLGA4 Golgin subfamily A member  3:37267947 WGSS, WTSS G>C E>Q Unknown, N-terminal 111.8 37, 12 Gain (3) GSTCD Glutathione S-transferase C-terminal domain-containing  4:106982671 WGSS, WTSS G>C E>Q Unknown, C-terminal 23.2 23, 8 Neutral (2) KIAA1468 LisH domain and HEAT repeat-containing protein 18:58076768 WGSS, WTSS G>C R>T ARM type fold 36.1 23, 11 Neutral (2) KIF1C Kinesin-like protein KIF1C 17:4848025 WGSS, WTSS G>C K>N Kinesin motor domain 28.5 16, 13 Neutral (2) KLHL4 Kelch-like protein 4 X:86659878 WGSS C>T S>L Unknown, N-terminal 1.7 1, 0 Neutral (2) MYH8 Myosin 8 (heavy chain 8) 17:10248420 WGSS C>G M>I Actin-interacting protein domain 0 NA Neutral (2) PALB2 Partner and localizer of BRCA2 16:23559936 WGSS T>G E>A N-terminal prefolding 13.0 NA Amplification (4) PKDREJ Polycystic kidney disease and receptor for egg-jelly 22:45035285 WGSS C>G E>Q Unknown 0.1 NA Gain (3)   29 Gene Description Position hg18 Source Allele change Amino acid change Protein domain affected Expression (sequenced bases per exonic base) Allele expression bias (R, NR allele) Copy number state RASEF RAS and EF-hand domain-containing protein 9:84867250 WTSS G>A S>L EF-hand Ca2+-binding motif 65.0 3, 2 Gain (3) RNASEH2A Ribonuclease H2 subunit A (EC 19:12785252 WGSS, WTSS G>A R>H Unknown, C-terminal 5.3 2, 2 Neutral (2) RNF220 RING finger protein C1orf164 1:44650831 WGSS G>A D>N Unknown, N-terminal 16.1 NA Neutral (2) SP1 Transcription factor Sp1 12:52063157 WGSS G>C E>Q Glu-rich N-terminal domain 57.3 40, 10 Amplification (4) USP28 Ubiquitin carboxyl-terminal hydrolase 28 11:113185109 WGSS, WTSS C>T D>N Unknown 12.5 3, 7 Gain (3) C11orf10 UPF0197 transmembrane protein C11orf10 11:61313958 WGSS G>A T>I Transmembrane domain 28.9 13, 3 Amplification (4) THRSP Thyroid hormone-inducible hepatic protein 11:77452594 WGSS C>T R>C Unknown 0.3 NA Gain (3) SCEL Sciellin 13:77076497 WGSS A>G K>R Unknown 0.3 1, 0 Gain (3) SLC24A4 Na+/K+/Ca2+-exchange protein 4 14:92018836 WGSS G>A V>I Transmembrane domain 1.2 1, 0 Amplification (4) COL1A1 Collagen alpha-1(I) chain  17:45625043 WGSS C>T G>D Pro-rich domain 80.0 24, 0 Amplification (4) KIAA1772 GREB1-like protein 18:17278222 WGSS A>G D>G Unknown 2.8 4, 1 Neutral (2) CCDC117 Coiled-coil domain-containing protein 117 22:27506951 WGSS G>C K>N Unknown 12.9 2, 0 Neutral (2) RP1-32I10.10 Novel protein 22:43140252 WGSS G>C E>Q Unknown 0 NA Gain (3) MORC1 MORC family CW-type zinc finger protein 1 3:110271286 WGSS G>A A>V Coiled-coil 0.1 NA Gain (3) SNX4 Sorting nexin 4 3:126721688 WGSS C>T D>N Unknown, N-terminal 43.4 NA Gain (3) LEPREL1 Prolyl 3-hydroxylase 2 precursor (EC 3:191172415 WGSS T>C E>G Hydroxylase domain 1.1 NA Gain (3) WDR59 WD repeat-containing protein 59 16:73500342 WTSS C>T M>I Unknown, C-terminal 17.3 6, 5 Neutral (2)    30 Insights from comparing a breast tumor’s genome and transcriptome RNA-seq paired-end reads were used to determine skewed allelic expression, exon usage, and immune cell receptor rearrangements. Non-canonical mappings of aligned read pairs yielded 2184 loci with alternative splicing events. In order to assess functional significance of these genome features, the loci with altered transcripts were checked for biological pathways membership in the Ingenuity™ pathway analysis database (, Qiagen, Redwood City, CA). The classical estrogen receptor pathway was the most significant cluster of alternatively spliced genes. This included 22 of 128 of the pathway gene members (Fisher’s exact test, p = 4.2 x 10-5). Other top hits included inositol phosphate metabolism and clathrin mediated endocytosis genes. The molecular level consequences of these splicing events include in-frame truncations and altered spacing of protein functional domains. This observed alternative splicing in the ER signaling proteins may have implications for the regulation of pathway activity and thus of interest for clinical intervention.  Relating novel somatic SNVs with heterozygous allelic expression bias across the metastatic transcriptome revealed 122 significantly biased allelic variants in coding sequence space, with 35 being non-synonymous (FDR <0.01). Three somatic mutations, in CHD3, SP1, and COL1A1 showed significant allelic bias, which was towards the reference allele (FDR <0.01). Primary and metastatic breast tumors differ in their mutational spectrum This LBC tumor pair offered an opportunity to study changes in mutational prevalence between primary and metastatic malignancies separated by a nine year interval (Table 3). Of the 32 somatic non-synonymous coding mutations present in the metastatic tumor, 30 were   31 successfully amplified in the primary. Of these 30, five mutations were prevalent in the primary tumor, while another six occurred at low prevalences. The remaining 19 mutations were not detected in the primary tumor. Digital prevalences of 28 of the 30 somatic mutations found in the metastatic tumor genome were successfully measured in the primary tumor by targeted amplicon NGS and allelic read counting. The HAUS3 mutation prevalence value suggested a homozygous mutation state, the only such example in the set. Three mutations were found at prevalences suggesting heterozygous mutations, after considering tumor cellularity. Six mutations had intermediate allelic prevalence values, between 1 and 13 % (P <0.01, Binomial exact test). These measurements collectively demonstrated the presence of mutational heterogeneity in the primary tumor and acquisition of additional mutations over a nine year interval. The contribution of radiotherapy to mutational changes is unknown.       32 Table 3. Digital prevalence of somatic SNVs in a primary LBC tumor and metastases Chr position = hg18 chromosome coordinates, Ref = reference allele, NR = non-reference allele, V = variant, S = somatic, G = germline.  Chr position Locus Ref NR Primary depth (reads) Primary non ref ratio Primary p-value Primary status Met depth (reads) Met non ref ratio V Copy number state (metastasis) 4:2203607 HAUS3 C T 5700 0.5472 0.0000 DOMINANT 762 0.7874 S neutral(2) 16:23559936 PALB2 T G 115 0.4957 0.0000 DOMINANT 669 0.4350 S amplification(4) 2:169497197 ABCB11 C T 506 0.3261 0.0000 DOMINANT 959 0.3691 S amplification(4) 14:92018836 SLC24A4 G A 13347 0.2341 0.0000 DOMINANT 13670 0.3518 S amplification(4) 17:10248420 MYH8 C G 10657 0.1353 0.0000 SUBDOMINANT 1797 0.5932 S neutral(2) 3:110271286 MORC1 G A 24572 0.0468 0.0000 SUBDOMINANT 32273 0.4107 S gain(3) 17:4848025 KIF1C G A 8587 0.0107 0.0000 SUBDOMINANT 2272 0.3077 S neutral(2) 11:113185109 USP28 C T 6654 0.0095 0.0000 SUBDOMINANT 1387 0.4484 S gain(3) 18:58076768 KIAA1468 G A 719 0.0083 0.0020 SUBDOMINANT 1056 0.3059 S neutral(2) 19:12785252 RNASEH2A G A 6537 0.0029 0.0276 SUBDOMINANT 1497 0.2806 S neutral(2) 4:106982671 GSTCD G T 7273 0.0008 0.9885 ABSENT 2208 0.2174 S neutral(2) 17:35701114 CDC6 G T 4894 0.0008 0.9733 ABSENT 4208 0.3577 S amplification(4) 17:7751231 CHD3 G A 9665 0.0007 0.9981 ABSENT 1737 0.2671 S neutral(2) 4:155726802 FGA C T 5756 0.0007 0.9911 ABSENT 2287 0.2755 S gain(3) 17:7052251 DLG4 G A 4383 0.0007 0.9835 ABSENT 706 0.3272 S neutral(2) 3:37267947 GOLGA4 G T 13051 0.0006 0.9999 ABSENT 3262 0.2235 S gain(3) 9:84867250 RASEF G T 1690 0.0006 0.9500 ABSENT 796 0.3656 S gain(3) 17:35133783 ERBB2 C A 3736 0.0005 0.9899 ABSENT 1722 0.3612 S amplification(4) X:86659878 KLHL4 C T 6561 0.0005 0.9993 ABSENT 977 0.3153 S neutral(2) 3:191172415 LPREL1 T C 11963 0.0004 1.0000 ABSENT 8381 0.2148 S gain(3) 16:73500342 WDR59 C T 4846 0.0004 0.9982 ABSENT 1396 0.2629 S neutral(2) 1:44650831 RNF220 G A 8160 0.0004 0.9999 ABSENT 967 0.2203 S neutral(2) 22:45035285 PKDREJ C T 6674 0.0003 0.9999 ABSENT 1230 0.3366 S gain(3)    33 Chr position Locus Ref NR Primary depth (reads) Primary non ref ratio Primary p-value Primary status Met depth (reads) Met non ref ratio V Copy number state (metastasis) 11:61313958 C11ORF10 G A 116705 0.0003 1.0000 ABSENT 14354 0.4651 S amplification(4) 12:52063157 SP1 G T 7732 0.0003 1.0000 ABSENT 2011 0.2193 S amplification(4) 11:77452594 THRSP C T 24219 0.0002 1.0000 ABSENT 40652 0.4750 S gain(3) 17:45625043 COL1A1 C A 26343 0.0001 1.0000 ABSENT 32259 0.2543 S amplification(4) 13:77076497 SCEL A G 49 0.0000 1.0000 ABSENT 187 0.5722 S gain(3) 19:9314428 - A G 176 0.5057 0.0000 PRESENT 321 0.4953 G neutral(2) 4:130144460 - A T 2020 0.2188 0.0000 PRESENT 2081 0.3099 G neutral(2) 8:27835012 - G A 13587 0.8602 0.0000 PRESENT 10781 0.6667 G deletion(1) 6:32908543 - C T 4718 0.7484 0.0000 PRESENT 16370 0.4897 G amplification(4) 20:43363061 - G A 5950 0.5249 0.0000 PRESENT 5540 0.5049 G amplification(4) 4:8672089 -  G A 381 1.0000 0.0000 PRESENT 2850 0.8032 G gain(3) 16:1331138 - C T 677 0.4963 0.0000 PRESENT 554 0.6245 G   amplification(5)       34 RNA editing recodes amino acid sequences in a metastatic breast tumor Editing events downstream of canonical processing of heteronuclear RNA introns and UTRs can alter protein sequences. High confidence SNVmix predictions from the metastatic tumor transcriptome library were checked against the matched genome library. There were 526 non-synonymous coding sequence edits. A subset of 75 events across 13 coding sequence loci revalidated by Sanger sequencing in the tumor cDNA, but not genomic DNA (Table 4, below). A>I>G base change events were most recurrent and, interestingly, the RNA editing enzyme ADAR was among the top 5 % expressed genes and the only highly expressed editing gene. Two genes, COG3 and SRP9, harbored non-synonymous coding edits with altered protein sequences (Figure 6) , an observation not possible by genomic DNA sequencing alone.  RNA editing events across these 13 loci were screened further in a panel of 41 lobular breast cancer tumors and were found to not occur in the panel.  Figure 6. RNA editing events in COG3 and SRP9 transcripts Non-synonymous coding edit positions validated by Sanger sequencing both DNA strands. Arrows mark edits.     35  Table 4. RNA edits confirmed by Sanger sequencing Number of edits across 13 loci are shown. Multiple edits are entered as a range. Two edits are listed as dbSNP alleles (rs notation).  Gene Number of edits Non-synonymous Position(s) EIF2AK2 12 0 chr2:37181093-37184781 AKAP9 25 0 chr7:91559936-91560175 COG3 1 1 chr13:44988372 ICK 2 0 chr6:52976046,chr6:52976047 EEF2K 9 0 chr16:22204361,chr16:22204737 FAM38A 6 0 chr16:87312679-chr16:87312746 PIP 2 0 chr1:218297899, chr1:218298107, chr1:218298761 SRP9 2 1 chr1:224041204 (rs11555111),  chr1:224041237 INPP5B 1 0 chr1:38099315 CWF19L1 2 0 chr10:101982773 (rs3190709),  chr10:101982783 POFUT1 1 0 chr20:30268977 CTSB 5 0 chr8:11738692, chr8:11738704, chr8:11738714, chr8:11738732, chr8:11738867 MSR1 7 0 chr8:16010106, chr8:16010115, chr8:16010088, chr8:16010073, chr8:16010070, chr8:16010035  Thus, integrating transcriptome data with genome sequencing enabled a more thorough assessment of tumor protein variation and its potential contribution to malignant phenotypes.   36 Collectively, the results from this initial study successfully demonstrated the application of evolving next generation sequencing and mutation validation methods on frozen and FFPE clinical tumor samples from a breast cancer patient. Deep sequencing of targeted amplicon panels and sequence variant calling algorithms enabled estimation of mutant allelic prevalences at unprecedented depth, establishing the experimental framework for the study of mutational heterogeneity and clonality in larger tumor sets, which is the subject of subsequent chapters in this dissertation.  Despite the expanding ability for generating affordable genome scale data sets from individual tumors at single nucleotide resolution, several limitations of this research methodology existed, some of which continue to challenge scientists to the present time. These pitfalls and challenges, such as difficulty capturing the mutational spectrum of rare and transient subpopulations of tumor cells or identifying spatiotemporally relevant mutations driving a malignant phenotype, are addressed later in the discussion chapter of this dissertation.  2.2 What is a cancer subtype? Mutational heterogeneity in primary TNBCs 2.2.1 Background – Triple-negative breast cancers Unlike receptor-defined breast cancers such as the estrogen, progesterone, or HER2 overexpressing subtypes, the triple-negative group lacks these immunophenotypic features by definition. TNBC represents about 15 % of breast cancers30. It is often an aggressive malignancy in younger women, especially premenopausal African American patients89. TNBC is a poor prognostic factor for disease-free and overall survival and is aggressive in the metastatic setting. Metastatic spread is commonly to the liver, lungs and brain. Chemotherapy usually yields a good initial response but progression-free intervals can be short. TNBCs share features with the   37 molecularly defined basal-like class and include several histological variants such as infiltrating ductal and medullary. Germline BRCA1 mutations are also associated with triple-negative and high grade tumors90, making these DNA–repair defective tumors candidates for treatment with PARP1 inhibitors91. Other therapeutics in development include inhibition of the epidermal growth factor receptor and mTOR92.  This study93 applied the NGS wet lab and sequence data analysis approaches developed for the previously described lobular breast cancer genome study to a panel of over 100 triple-negative breast cancers. Objectives included elucidation of the mutational spectrum in a large number of primary tumor genomes across one breast cancer subtype and assessing clonal diversity at diagnosis using deep allelic prevalence measurements. 2.2.2 Methods Clinically annotated and immunohistochemically defined primary TNBC tumor specimens were acquired from tumor banks with local research ethics board approval for genomic studies as stated in the preface of this dissertation. Peripheral blood lymphocytes from each patient served as matched normal or germline DNA. Cases were defined semi-quantitatively to be of low (less than 40 %), moderate (40 to 70 %), or high (higher than 70 %) tumor cellularity from visually scored tumor sections. HER2 negativity was also confirmed by Affymetrix SNP 6.0 array data.  A total of 104 TNBCs and matched normal samples were analyzed on SNP 6.0 arrays, 80 cases by RNA-seq whole transcriptome analysis using Illumina GAII devices, and 65 cases by whole genome or exome capture sequencing94 using Agilent’s Human All Exon SureSelect Target Enrichment System with Illumina GAII devices.   38 Tumor genome data sets were analyzed for somatic coding SNVs, indels, copy number variation, fusions, and transcriptional patterns using informatics approaches for matched tumor-normal samples, building upon the previously completed LBC tumor genomes project. Analyses included the use of the MAQ77 short read aligner, JointSNVmix95 to call somatic SNVs, and APOLLOH96 to infer regions of LOH . Targeted deep amplicon sequencing was performed to validate non-synonymous variants profiled by MutationAssessor97. The impact of mutations and copy number in transcriptional space was assessed using DriverNet98, associating genomic lesions with expression outliers in sets of gene pathways. Tumor clonal structure analysis was conducted using an early version of the PyClone99 statistical model to infer clonal prevalences with mutational genotypes from deeply sequenced somatic coding SNVs. Detailed laboratory and bioinformatic methodological details were described in the supplementary information (pages 128-140) accompanying the manuscript93 for this study. 2.2.3 Results Primary TNBCs exhibit a wide spectrum of somatic mutation From the set of primary TNBC cases a total of 2414 somatic SNVs, 107 indels, and 43 non-coding splice site mutations were validated by targeted deep sequencing. As evident in Figure 7, there is a wide distribution of somatic mutational load across tumors, independent of tumor cellularity or copy number changes. The copy number spectrum generally agreed with a larger METABRIC study34 of 2000 breast tumors and recurrent events were found in PARK2, EGFR, RB1, and PTEN, in the 3 to 5 % range. Additionally, intragenic deletions were documented in the PARK2 tumor suppressor. Structural rearrangements were not recurrent across the tumor set.    39 Figure 7. Somatic mutational spectrum across 65 TNBCs Prevalence of mutations in basal and other cases, annotated further by known driver genes. Lower panel shows mutations across copy number classes along with the number of copy number alterations and percent genome altered. HOMD = homozygous deletion, HETD = heterozygous deletion, NEUT = neutral copy state, GAIN = single copy gain, AMP = amplification, HLAMP = high level amplification. One-third of somatic coding mutations are transcribed The transcriptome data sets revealed that 36 % of somatic coding SNVs were transcribed. There were 43 splice junction mutations impacting splicing patterns, including both known and new genes implicated in carcinogenesis. Mutations in human gene regulatory regions revealed retinoblastoma-associated protein binding sites to be overrepresented (Fisher’s exact test, 32% versus expected 2.5%, P = 2x10-19) and several observed mutations disruptive to binding activity. The functional loss of this locus has been reported previously in breast cancers100. Allelic measurements at depth reveal a diversity of clonal prevalence Given the observed mutational heterogeneity in these cancers, deep allelic prevalence measurements were made to determine tumor clonal composition and complexity      40 using PyClone, taking into account allelic prevalence measurements, copy number state, LOH, and tumor cellularity (Figure 8).  Figure 8. Inferring clonal prevalence from TNBC bulk tumor-derived genomic data PyClone Dirichlet process model integrating tumor features to estimate clonal prevalence of deeply sequenced genotypes. Schematic in the top panel depicts three genotypes comprised of four mutations and associated prevalence values. Lower panel shows clonal prevalence inferences for two TNBC cases, SA029 and SA228. Gene SNVs with matching prevalence distributions are colored identically. Each prevalence estimate is shown as the distribution of probabilities from the model.    41 Deep sequencing (20,000x median coverage) of validated SNVs grouped mutations within each tumor to reveal a wide spectrum of clonal prevalences and genotypes (Figure 9).  Figure 9. A wide spectrum of inferred number of clonal clusters in TNBCs Number of clonal clusters plotted across mutation abundance of non-synonymous and synonymous variants, over 54 TNBC cases, and across kernel density for basal and non-basal tumors. The panel on the left shows the relationship between mutational abundance and the number of inferred clonal clusters. The middle panel displays the distribution of the number of clonal clusters across the 54 cases. The panel on the right shows the density distribution of clonal clusters for basal and non-basal tumors. Mutations in tumor suppressor genes are not always early events Mutations in common tumor suppressor genes were often, but not always, observed at prevalences higher than other mutated loci, suggesting they were early mutational events or driver genes (Figure 10). For example, TP53 was often present in the most prevalent clonal group. To some extent, clonal prevalence groups increased with mutational load and the basal TNBCs had more clonal prevalence modes than non-basals (Figure 9). Thus, as observed with the wide mutational spectrum of primary TNBCs, their clonal complexity at diagnosis is also diverse. At a pathways level, common driver genes such as p53 again occurred at higher clonal prevalences than genes previously not as well associated with malignancy93. However, this was not the case for the cytoskeletal function associated genes, suggesting they were later events in tumor clonal evolution. With respect to mutational patterns, the most prevalent single gene mutations were TP53 lesions in 62 % of the basal TNBCs, as observed in other studies101. Other   42 single gene mutations at well known loci included RB1, PTEN, PIK3CA, USH2A, MYO3A, ranging from about 7 to 10 % of cases (Figure 12). The synuclein genes SYNE1 and SYNE2 had recurrent mutations in 6 of 65 or 9.2 % of cases. Other loci with hits in two to three cases included BRCA2, ERBB2, ERBB3, BRAF, and NRAS, with some of these genes harboring clinically actionable mutations.  Figure 10. Clonal prevalence of TNBC tumor-specific mutations Mutations shaded by clonal prevalence across 47 TNBC cases for driver genes, integrin signaling and extracellular matrix-related proteins. A range of lower (yellow) to higher (red) prevalences, plotted in 10 % increments, is evident. P53 is the most frequently mutated gene.  Clonal prevalence (% tumour cells)   43 To capture relevant biological pathway aberrations, the Reactome protein interaction database102 was mined for functionally connected gene families, insights not captured by single gene mutation analysis. Overrepresented gene families (FDR <0.001) included TP53 pathways, signaling in PIK3, ERBB, WNT, ATM/RB, and integrin families. Mutated genes included matrix laminins and collagens, integrin receptors, actin cytoskeleton proteins, and kinesins. This enrichment for cytoskeletal aberrations was also evident in copy number space. Over two-thirds of cases harbored mutations in cytoskeletal function genes.  Integrating mutation and copy number data with transcriptome data to glean genomic lesions influencing strong perturbations in gene transcription may reveal driver mutations. To this end DriverNet98, an integrated bipartite algorithmic approach, was applied and analysis revealed significant (P <0.05) hits to common cancer related loci, including TP53, PIK3CA, EGFR, RB, ATM, and NRAS. However, about 12 % of cases did not have mutations in these common tumor suppressors and oncogenes, or the cytoskeletal function genes noted above. Thus, at diagnosis, primary TNBCs displayed a heterogeneous mix of genomic lesions.   In summary, despite the limitations and challenges of applying whole genome scale analyses to a large set of tumors, as addressed later in the discussion chapter, this study uncovered the most comprehensive mutational spectrum of TNBC genomes to date. Primary tumors of this breast cancer subtype recapitulate the heterogeneity seen in clinical behavior, in that there is extensive variation at the level of sequence mutations and clonal composition at diagnosis. Clonal heterogeneity at first diagnosis presents an opportunity to dissect these cancers further for developing a clinically meaningful reclassification.    44 Chapter 3: A leap in resolution: Single-cell genotypes define tumor clonal structure As exemplified with lobular and triple-negative breast cancer subtypes in the previous chapter, whole genome sequencing of individual tumors successfully enumerated genomic sequence and structural aberrations at nucleotide resolution, validating common mutations found previously using targeted gene-centric approaches and uncovering additional genomic lesions. These expansive data sets, however, do not provide resolution of mutational content at the level of individual cells or elucidate tumor clonal structure. This is only achievable by enumerating mutational genotypes on a per cell basis and thus the motivation for reanalyzing bulk tumors at single-cell resolution. This level of genome analysis can also provide empirical evidence for the performance of computational approaches that predict tumor clonal architectures from bulk tumor-derived DNA sequence data sets, as detailed later in this dissertation for the PyClone99 and TITAN103 computational models. 3.1 Synopsis The experimental workflow used in single-cell genotyping studies conducted for this dissertation is outlined in the next paragraph and in Figure 11 below. Methods development and a detailed protocol are described in the next section. Sections 3 and 4 of this chapter detail two early applications of the single-cell genotyping workflow, validating statistical models for inferring tumor clonal architecture using bulk tumor-derived genomic sequence features. Solid frozen tumor tissue was cryosectioned, subjected to mechanical homogenization, and cell nuclei were isolated using a hypotonic lysis buffer. Liquid pleural effusion samples were processed directly in lysis buffer. Aliquots of freshly prepared and filtered nuclei were visually   45 inspected and enumerated using a hemocytometer. Single nuclei were flow cytometrically sorted into microtitre plates and then subjected to two rounds of PCR. First, microlitre-scale 48-plex PCRs were performed, targeting known somatic SNVs and regions of LOH, previously identified from matched bulk tumor genomic data sets. Each single-cell-derived multiplex PCR amplicon pool was then subjected to a second round of PCRs, as 48 singleplex reactions. The singleplex products were processed for next generation DNA sequencing by incorporation of barcoded NGS flow cell adaptors. Amplicon libraries were purified and quantified before paired-end second generation sequencing. Per cell genotypes were called to determine the combinations in which mutations co-occur in individual cells, to assess the cellular mutation prevalence of genotypes and tumor clonal lineages, and to track spatial or temporal changes across related experimental samples.     46 Figure 11. Integrating bulk tumor and single-cell assays with computational modeling WGSS data sets were mined for sequence features (SNVs, LOH, etc) to serve as clonal marks and computational modeling performed to infer cellular genotypes and prevalence, and predict tumor clonal structures. Bulk tumors were resampled for the clonal marks using a 4-step single-cell genotyping workflow to empirically reconstruct predictions made by the models and chart tumor cell lineages.   47 3.2 Methods development This section highlights testing and optimization of some of the steps involved in the experimental workflow adopted for single-cell genotyping experiments.  Figure 12 is an overview of the experimental parameters that were considered for optimization.  Figure 12. Overview of protocol steps for single-cell genotyping assay development The first step involved efficient isolation of cells and nuclei with minimal loss or lysis prior to sorting. Second, efficient sorting of individual nuclei and controls into reaction vessels. And third, sensitive amplification of mutant loci using sequential multiplex and singleplex PCRs, followed by NGS.  3.2.1 Preparation of nuclei from bulk tissue samples Nuclei from liquid biopsies such as pleural effusion metastases (Figure 13, left panel) were prepared by hypotonic lysis of cell membranes. Solid frozen tissues were cryosectioned or 48 minced and then subjected to mechanical homogenization, followed by hypotonic lysis to release nuclei (Figure 13, right panel). Assuming a mass of 1.0 gram per cm3 of solid tissue and cuboidal cells of 20 micron sides, a solid tumor nuclei preparation typically yielded about 2 x 10^5 nuclei per 100 milligrams of tissue, a yield of 1.0 % of the estimated starting number of nuclei. Tissues harvested in the murine xenograft project were disrupted using a laboratory paddle blender as described in Chapter 4. Preliminary work to dissociate tissue samples using enzymatic approaches such as digestion with collagenase and hyaluronidase resulted in unpredictable under- or over-digestion and were abandoned. Figure 13. Nuclei preparations from two breast tumors A nuclei preparation from a metastatic LBC pleural effusion tumor sample stained with Trypan blue dye is shown on the left in this light microscopic image. The right panel shows a nuclei preparation from a solid primary TNBC tumor, with visible cell debris.   3.2.2 Isolating individual nuclei for genotyping  Early single-cell experiments were conducted using custom programmable microfluidic devices manufactured by multilayer soft lithography (Leung et al42). This platform provided  photomicrographic validation of sorted nuclei and the capability to scale reaction volumes down to nanolitre scale, with associated advantages in reaction kinetics and reagent cost savings. 49  Subsequent experiments involved processing nuclei preparations by flow cytometry104, into 384-well microtitre plates followed by microlitre scale multiplex PCR reactions. The general gating strategy employed for single nuclei preparations is shown in Figure 14 below. Besides a standard negative no-template-control reaction, in which no nucleus was sorted into the well, synthetic beads were spiked in for sorting and used as an additional negative control, to sample background genomic DNA contamination in the approximately 2 nL of buffer that is deposited with each nucleus. Figure 14. Flow cytometric gating of tumor nuclei from a primary TNBC The first gate on forward and side physical scatter (top left) separated cell debris, followed by selection for singlets using side scatter area and width parameters (top right). The final gating step selected for objects staining for a nuclear dye and represented in putative cell ploidy peaks such as the two peaks in the lower panel.   3.2.3 Improving genomic DNA accessibility in single-cell assays In addition to sorting nuclei directly into wells of microtitre plates, several buffers and reagent formulations were tested for this preliminary step, considering that conventional genomic DNA preparation and purification protocols are not feasible with single-cell templates and microlitre scale reaction vessels. Options included sorting nuclei into molecular biology grade 50 water; phosphate buffered saline; 5% w/v bovine serum albumin; 5% v/v proteinase K; 5% v/v heat-labile protease; and 200mM NaOH + 50mM dithiothreitol. However, when evaluated downstream, none of these reagents yielded better success with positive PCR amplifications than seen with directly sorting into wells and freeze-thawing. Factors contributing to this outcome may include increased reaction volumes reducing amplification efficiencies already compromised at the microlitre scale with single-cell templates. Also, incomplete inactivation of proteinase K following nuclei treatment may compromise downstream enzymatic steps due to residual activity on DNA polymerases, when using proteases that may not be 100 % inactivated by heat denaturation. 3.2.4 Primer design strategy influences PCR efficiency Multiplex PCRs conducted using pooled singleplex-designed oligonucleotide primer pairs generally outperformed commercially sourced multiplex-designed primer pools (Figure 15). The point of comparison was amplicon yield and locus dropout when performing second round singleplex PCRs, by using aliquots of first round multiplex PCR reaction products as input template.                     51 Figure 15. PCRs comparing amplification from multiplex versus singleplex primer designs A representative analytical agarose gel showing second round singleplex PCR reaction products, generated using first round multiplex PCR reaction products as input template. The top panel shows results from a multiplex PCR primer design strategy, while the lower panel is from a singleplex PCR primer design strategy. Each panel has two single nuclei, bulk genomic DNA, and No-Template-Control reactions, with 12 singleplex PCR reactions each. A white arrow marks the expected sized amplicons, about 200 bp in length.    3.2.5 Considerations for single-cell PCR reaction chemistries First, the use of nuclei staining dyes during flow cytometric sorting of nuclei preparations was especially useful with preparations from solid tumors, which harbored cellular debris. However, the inhibitory influence of DNA intercalating dyes on downstream enzymatic amplification chemistries is a concern105. To minimize the influence of nuclear dye staining on amplification chemistry, low sub-saturating concentrations of 1-2 ug/mL propidium iodide were used in nuclei preparations typically comprising several hundred thousand nuclei per mL of buffer106. Second, given the high number of PCR cycles required for target amplification from the genome of a single cell, PCR carryover contamination was of concern. A protocol step to ameliorate this problem included performing the first round multiplex PCRs with dUTP and heat-labile uracil DNA glycosylase107 reagents. Any previously generated uracil-containing PCR amplicons contaminating a newly assembled multiplex PCR reaction are degraded by incubation 52 with the glycosylase enzyme present in the newly assembled reaction mix. The enzyme is then heat inactivated prior to commencing amplification of the new multiplex PCR reaction. Multiplex PCR amplification products were promptly processed for second round singleplex  PCRs (performed with a reaction mix lacking the glycosylase enzyme), or stored frozen.  Third, preliminary experiments involved optimizing multiplex PCR oligonucleotide primer concentrations to account for differences in amplification efficiency between primer pairs targeting their respective loci. To this end, multicopy genes were used as amplification sensitivity controls for targeted PCR in single cells108 (Figure 16). However, incorporation of this optimization step into routine use was not practical when scaling the number of target positions to 48 loci, and due to the need for repeated re-optimization given the tumor-specific lists of somatic SNVs and LOH loci.  Figure 16. Optimizing PCRs for improved uniformity of multiplexed reactions Analytical agarose gel shows amplification products from five SNV loci (A to E) from a metastatic LBC genomic DNA sample. Amplification across three multicopy genes is also shown, on the right. The pre-optimization reactions with variable amplification are shown in the top gel while primer concentration-optimized reactions are shown in the lower gel. Expected amplicons, approximately 200bp in size, are boxed.   53 3.2.6 Efficiency of PCR amplification from a single-cell genome Targeted amplicon libraries constructed from incremental amounts of template clearly point to the challenge of efficiently interrogating input template amounts approaching the level of single genomes. This is exemplified in Figure 17 for both, incremental amounts of purified genomic DNA and numbers of nuclei. Figure 17. Amplicon libraries from incremental amounts of genomic DNA and nuclei  The top row of Agilent 2100 Bioanalyzer DNA chip electropherograms each show a composite peak of up to 48 PCR amplicons generated from different amounts of starting genomic DNA template (10 nanograms, 1 nanogram, 100 picograms) from 184-hTERT cells, an epithelial cell line. The lower row shows the same amplicon libraries constructed using different numbers of starting nuclei (240, 10, 1). The black arrows mark the approximate peak sizes of the expected amplicon pool. Smaller sized products (to the left of each arrow) result from spurious amplifications that compromise library quality and are subsequently purified away. The x-axis is amplicon size and the y-axis is arbitrary fluorescent units.   To assess the efficiency of target amplicon retrieval with reducing amounts of input template down to one diploid cell genome, the yield of amplified molecules over a given number of amplification cycles is instructive. Amplicon DNA mass measurements following two rounds of PCR can be made, per target locus, per nucleus. This is exemplified below for an FGA locus (fibrinogen alpha chain, hg18_chr4:155726802) somatic coding SNV amplicon retrieved from one metastatic LBC tumor nucleus (Figure 18). 54 Figure 18. Electropherograms of singleplex PCR amplicons from one LBC nucleus The Agilent 2100 Bioanalyzer DNA chip electropherograms show second round singleplex PCR products from five loci. The FGA locus amplicon metrics are highlighted in the inset on the lower right, with information on DNA concentration and molarity of the amplicon peak. The two smaller flanking peaks are sizing markers. The panel on the right is a genomic sequence alignment viewer snapshot, showing the FGA peak-derived NGS sequence reads. Both reference (C) and variant (T) allelic bases are present with an approximately equal proportion of sequence reads for each base.   55 The following calculations consider the yield of amplicon molecules and PCR efficiency following multiplex and singleplex PCRs at one locus (FGA) of one metastatic LBC tumor nucleus. This represents one of 48 loci targeted and is representative of a reaction which would typically be carried forward for amplicon library construction and sequencing. Assume the starting number of template molecules to be two, that is, a normal diploid state, with no allelic dropout. The first round multiplex PCR reaction comprised 30 cycles and 1 % of the reaction product was used for the second round singleplex PCR of 35 cycles. The specific yield of the singleplex reaction was 33.0 nanograms. For a 200 bp double-stranded DNA molecule, 1 picogram represents about 2.2x107 molecules. This yield then corresponds to 7.3 x 1011 molecules. A 100 % efficient PCR reaction doubles the amplicon pool every cycle. Assume a 90 % efficiency of the singleplex reactions, that is, a 1.8-fold increase on average per cycle. The number of starting template molecules n that would yield 7.3 x 1011 molecules at 90 % amplification efficiency can be calculated by n x 1.835 = 7.3 x 1011. Or n = 849 molecules, when 1 % of the multiplex PCR product is used as input template for the second round singleplex reactions. Thus the total multiplex reaction product for this locus was 84,900 molecules. For a size of 200 bp dsDNA, this corresponds to 1.9x10-14 grams, or 19 femtograms (10-15 grams) of specific DNA molecules yield in the multiplex PCR reaction. The overall efficiency e of single-cell PCR for this diploid locus can be estimated using 2 x e^30 = 84,900. Or e = 1.43 or 43 % efficiency. This estimate for one of 48 loci, targeted simultaneously in one nucleus, is comparable to other reports. For example, targeting a single locus, the human beta-globin gene, in a diploid cell yielded a PCR cycle efficiency of 65 %109.  56 3.2.7 Improvements in single-cell targeted amplicon library quality When starting with a single cell genome as input template, spurious extraneous amplicons often result during library construction and compromise library quality with wasted NGS flow cell cluster space. Efforts to ameliorate this issue included reducing Illumina flow cell barcoded adaptors, when incorporating them following second round singleplex PCRs (Figure 19). However, the relative proportion of specific and spurious amplification products remained unchanged even when titrating down to the limits of product detection by agarose gel electrophoresis. Figure 19. Titrating down barcoded adaptors to reduce spurious amplicons Three different molar concentrations of flow cell adaptors (standard 1x reaction at 400 micromolar, 1/5x at 80 micromolar and 1/25x at 16 micromolar) were tested during their incorporation on to library singleplex PCR products. The desired size amplicons band is marked by the black arrow, at about 225 bp size. The red arrow marks undesirable spurious amplicons arising, for example, by adaptors cross-amplifying in the absence of adequate bonafide template.    An additional step for avoiding spurious amplification products being carried forward in NGS amplicon library construction is by rigorous purification away of such products following addition of flow cell adaptors. To this end, comparisons were made using traditional preparative agarose gel electrophoresis and paramagnetic beads for purification, the latter being popular in high-throughput protocols using robotic pipeting. Preparative agarose gel electrophoresis 57 purification performed more consistently than paramagnetic beads when used for single-cell amplicon library cleanup (Figure 20), leading to the adoption of a preparative agarose gel purification protocol. Figure 20. Agarose gel versus paramagnetic beads cleanup for NGS amplicon libraries The two upper electropherograms of a single-cell-derived amplicon library prior to (left) and following paramagnetic bead purification (right) show the retention of a spurious amplicon peak (red arrow) in the bead-purified electropherogram, besides the specific sized amplicons peak (black arrows). By contrast, the preparative agarose gel-purified (lower right) amplicon library products shown in the electropherogram (lower left) yielded a clean amplicon library.    3.2.8 Sequencing low complexity single-cell targeted amplicon libraries Unpredictable sample (nucleus), locus (PCR amplicon sequence reads), and allelic (one of two heterozygous alleles) dropouts are ongoing challenges in single-cell assays, resulting in variable amplicon library complexity irrespective of the number of loci originally targeted. Adequate base diversity, especially in early sequence cycles, is a critical requirement of Illumina imaging-dependent sequence calling algorithms110.The Illumina MiSeq NGS platform was used routinely for single-cell targeted amplicon library sequencing, with the addition of a PhiX DNA spike-in up to a 33 % molar amount to circumvent the failure of sequencing runs occurring due to a possible lack of base diversity in single-cell-derived targeted amplicon libraries. 58 3.3 Protocols for genotyping single cells by two-round targeted PCRs and NGS Experiments were performed as appropriate for sterile technique and avoiding nucleic acid carryover contamination by using PCR workstations, HEPA-filtered cabinets, and strict adherence to segregating pre- and post-amplification workflows. 3.3.1 Preparation of nuclei suspensions  When starting with cryopreserved liquid pleural effusion tumor samples, approximately 1 mL of nuclease- and protease-free EZ Prep lysis buffer (Sigma-Aldrich, St. Louis MO) was added directly to no more than 1 million cells and incubated on ice for 5 minutes. The nuclei preparation was subsequently filtered through a sterile 70-micron nylon Falcon™ cell strainer (Fisher Scientific, Pittsburgh PA), a step repeated immediately prior to flow sorting using a second filter. Filtered nuclei were transferred into a 5 mL BD Falcon™ polystyrene tube (Fisher Scientific) and a sub-saturating amount of propidium iodide (Invitrogen, Burlington ON) was added up to a final concentration of 2 micrograms per mL. Solid frozen tissue samples were processed by cryosectioning 50-micron sections, followed by mechanical homogenization using a Polytron homogenizer (Kinematica, Bohemia NY) with disposable aggregate shafts. Approximately three 50-micron sections per 1 mL of lysis buffer were homogenized by two to four 30-second pulses, with incubation on ice between each  pulse. The homogenized product was then filtered as described above. Murine xenograft solid tumor samples were mechanically disrupted using a Stomacher™ brand laboratory paddle-blender (Seward USA, Davie FL) and subsequently processed as described above for the other types of samples. Aliquots of freshly prepared nuclei were visually inspected and enumerated using a Leica DM2500 light microscope (Leica Microsystems, Concord, Ontario) with a dual counting 59 chamber hemocytometer (Improved Neubauer, Hausser Scientific, PA) and Trypan Blue stain solution (Invitrogen).  3.3.2 Flow cytometric sorting of single nuclei Single nuclei were flow sorted into individual wells of 384-well microtitre plates using a FACSAria III sorter (BD Biosciences, San Jose, CA). Events were gated as described in Figure 13, using forward and side scatter to exclude cellular debris, followed by singlet discrimination, and then nuclear stain intensity to select events within and between putative cell cycle ploidy peaks. Nuclei preparations were spiked with BD Calibrite™ FITC beads (BD Biosciences) and these were sorted as indicators of background contamination of nuclei preparation buffer for nucleic acid debris released during sample preparation. This no-template-bead control was in addition to a standard no-template-control, without any deposited nuclei, and to a 10 ng genomic DNA positive control, prepared from the nuclei preparation remaining after flow sorting by using a QIAamp DNA Mini Kit according to the manufacturer’s protocols (Qiagen, Toronto ON). The alignment of microtitre plate wells with the sorter’s 100-micron nozzle was checked periodically. 3.3.3 PCR oligonucleotide primers design Somatic coding SNVs and LOH events validated in bulk tumor tissue genome sequencing experiments and determined to be clonally informative by statistical modeling, using the PyClone and TITAN programs, were selected for mutation-spanning PCR primers design using Primer3 software76. Amplicon lengths ranged from 100 to 200 bp. Oligonucleotides were ordered desalted at 100 uM synthesis scale (Integrated DNA Technologies, Coralville IA). Illumina common sequences for a PCR-based barcoding step were appended for synthesis to the 5-prime ends of the gene-specific primers. The appended 5-prime common sequences were as follows. 60 P5 side forward primer: 5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[gene specific sequence]. P7 side reverse primer: 5’-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-[gene specific sequence]. 3.3.4 First round multiplex PCR of single nuclei Multiplex (48 loci) PCRs were set up by combining reagents in a template-free pre-PCR workspace. Reactions were performed with SYBR GreenER qPCR Supermix reagent (Life Technologies, Burlington, ON), by combining 2.5 uL of the 2x concentrated reagent with 1.0 uL of 1.0 uM each primers mix and 1.5 uL molecular biology grade water, per reaction. Typically 48 to 96 multiplex reactions were processed at a time to constitute an amplicon library. Each set of 48 reactions typically comprised 42 single nuclei, two 10-nanogram genomic DNA controls, two no-template-controls, and two no-template-bead controls. Thermal cycling was performed using an ABI7900HT machine (Life Technologies) with the following parameters: 94°C (11:00 min), 30x [94°C (30 sec), 60°C (30 sec), 72°C (30 sec)], 72°C (5:00 min), 4°C hold. Multiplex PCR reaction products were promptly processed for subsequent enzymatic steps or stored frozen. In a post-PCR workspace, ExoSAP-IT (Affymetrix, Santa Clara CA) treatment of reaction products was performed to remove unused oligonucleotides and dNTPs. This was done by combining each of the 5.0 uL multiplex PCR reactions with 2.0 uL of ExoSAP-IT reagent and incubating as follows: 37°C for 15 min, 80°C for 15 min, 4°C hold.  3.3.5 Second round re-amplification by singleplex PCRs Before performing reamplification with 48 singleplex PCRs across each of the 48 multiplex PCR reaction products, a subset of 6 to 12 analytical singleplex PCRs were performed 61 to assess success with multiplex PCR of single nuclei. These PCRs were set up as follows per sample, using a pre-mix of FastStart High Fidelity PCR reagents (Roche, Indianapolis IN).  Table 5. Second round analytical PCR pre-mix Reagent Volume 10x FastStart High Fidelity Reaction buffer without MgCl2 2.0 uL 25 mM MgCl2 3.6 uL DMSO 1.0 uL 10 mM PCR Grade Nucleotide Mix 0.4 uL 5 U/uL FastStart High Fidelity Enzyme Blend 0.2 uL PCR Certified Water 7.8 uL Total 15.0 uL  The above PCR pre-mix was combined as follows per reaction: Table 6. Second round analytical PCR reaction mix Reagent 1x reaction PCR pre-mix 15.0 uL 1 uM singleplex primer pair 4.0 uL Multiplex PCR product (~1 %) 1.0 uL Total 20.0 uL  Thermal cycling was performed on a Veriti™ 96-well machine (Applied Biosystems) with the following parameters: 94°C (60 sec), 30x [94°C (30 sec), 58°C (30 sec), 72°C (30 sec)], 72°C (5:00 min), 4°C hold. A 10 uL aliquot of each test singleplex PCR-amplified product was run on a standard 2 % analytical agarose gel in 1x TAE buffer at 110 volts for 45 minutes. Gels were imaged using GelStar™ nucleic acid stain (Lonza, Rockland ME) and an AlphaImager™ (Alpha Innotech, Fisher Scientific, Ottawa ON) according to the manufacturer’s protocol. 62 Subsequent re-amplification of the complete set of 48 multiplex reactions by singleplex PCRs was performed using a Fluidigm 48.48 Access Array™ platform and reagents according to the manufacturer’s protocols (Fluidigm Corporation, San Francisco, CA) in 30 nanolitre reaction volumes. A 20x primers solution for each of the 48 primer pairs was prepared in a pre-PCR workspace as follows: Table 7. Primers solution mix for Fluidigm Access Array™ singleplexes Reagent Volume 50 uM each forward & reverse primers mix 6.0 uL 20x Access Array Loading Reagent 2.5 uL DNA Suspension Buffer 41.5 uL Total 50.0 uL  The following 1.25x PCR pre-mix was prepared per sample in a pre-PCR workspace: Table 8. PCR pre-mix for Fluidigm Access Array™ singleplexes Reagent Volume 10x FastStart High Fidelity Reaction buffer without MgCl2 0.5 uL 25 mM MgCl2 0.9 uL DMSO 0.25 uL 10 mM PCR Grade Nucleotide Mix 0.1 uL 5 U/uL FastStart High Fidelity Enzyme Blend  0.05 uL 20x Access Array Loading Reagent 0.25 uL PCR Certified Water 1.95 uL Total 4.0 uL  The following Sample Mix solution was prepared for each of the 48 ExoSAP-IT treated multiplex PCR reaction products: 63  Table 9. Sample mix for Fluidigm Access Array™ singleplexes Reagent Volume 1.25x PCR pre-mix 4.0 uL Multiplex PCR product (14 %) 1.0 uL Total 5.0 uL  A 48.48 Access Array IFC was primed using a pre-PCR Access Array IFC Controller AX, with injection of control line fluid into both Access Array IFC accumulators. A volume of 500 uL of 1x Access Array Harvest Reagent was added to the appropriate wells of the Access Array. A volume of 500 uL of 1x Access Array Hydration Reagent was added to the appropriate well of the Access Array. The Access Array was placed in the pre-PCR Controller AX and the Prime (151x) script was run. A post-PCR Access Array IFC Controller AX was cleaned with DNAZap™ solution (Ambion, Life Technologies) and rinsed with water and isopropanol.  A volume of 4.0 uL of the Sample Mix solution was pipetted into each of the left side inlets of the Access Array, followed by 4.0 uL of 20x Primers Solution into each of the primer inlets on the right side of the Access Array. The Access Array was loaded in the Post-PCR Controller AX and the Load Mix (151x) script was run. The loaded Access Array was placed in a Fluidigm FC1 Thermal Cycler and cycled using the 35-cycle 48x48 Standard protocol.  Singleplex PCR products were harvested from the Access Array by pipetting 2.0 uL of 1x Access Array Harvest Reagent into each of the left side inlets. The remaining 1x Access Array Harvest Reagent was removed from the appropriate wells and replaced with a volume of 600 uL of fresh 1x Access Array Harvest Reagent. The Access Array was placed in the Post-PCR IFC Controller AX and the Harvest (151x) script was run. The 48 harvested singleplex PCR products 64 per sample, pooled to yield approximately 10 uL each, were transferred from the sample inlets into a 96-well PCR plate for Illumina NGS barcoding. 3.3.6 Nuclei-specific amplicon barcoding and addition of NGS adaptors The 48 pooled singleplex PCR products from each multiplex PCR reaction were assigned unique molecular barcodes and adapted for MiSeq flow-cell NGS chemistry using a PCR step comprising the following pre-mix:  Table 10. Molecular barcoding PCR pre-mix Reagent Volume 10x FastStart High Fidelity Reaction Buffer without MgCl2  2.0 uL 25 mM MgCl2  3.6 uL DMSO  1.0 uL 10 mM PCR Grade Nucleotides Mix 0.4 uL 5 U/µL  FastStart High Fidelity Enzyme Blend 0.2 uL PCR Certified Water 7.8 uL Total  15.0 uL  A volume of 4.0 uL of each Illumina barcoding primer pair (2 uM each) was aliquoted in a pre-PCR workspace, and then added with 1.0 uL (10 %) of each harvested singleplex sample pool in a post-PCR workspace to the above PCR pre-mix. Thermal cycling was performed on these 20.0 uL reactions using an ABI Veriti™ cycler as follows: 95°C (10:00 min), 15x [95°C (15 sec), 60°C (30 sec), 72°C (60 sec)], 72°C (3:00 min), 4°C hold.  3.3.7 Targeted amplicon library quality assessment and purification A 2100 Bioanalyzer instrument with DNA chips (Agilent Technologies, Santa Clara, CA) was used to assess the quality and yield of a subset of the barcoded samples according to the manufacturer’s protocol. The gel-dye mix was prepared by allowing the dye concentrate and gel matrix to first equilibrate to room temperature. A volume of 25.0 uL of the dye was added to a 65 gel matrix vial, vortexed, spun down, and transferred to a spin filter. The mix was centrifuged at 2240g for 15 minutes on a standard benchtop microcentrifuge. The filtered product was used to prime a DNA 1000 chip on a Bioanalyzer chip priming station. A volume of 5.0 uL of marker reagent was added to each of the sample wells and the ladder well. A volume of 1.0 uL of the DNA ladder was added to the appropriate well. A volume of 1.0 uL of sample was added to each of the sample wells. The loaded DNA chip was vortexed for 1 minute at 2400 rpm and then run promptly on a 2100 Bioanalyzer. The electropherograms were analyzed for the presence of amplicons of the desired size and for the absence or minimal occurrence of spurious amplicons of smaller sizes.   Barcoded products were pooled and purified by preparative agarose gel electrophoresis using a 2 % E-Gel system (Invitrogen) according to the manufacturer’s protocol. The gel was run and amplicons harvested continuously after products up to and including the approximately 175 bp-sized spurious amplicons had cleared the harvest well. The harvest with minimal or no spurious amplification products was carried forward for quantitation and dilution for DNA sequencing. 3.3.8 NGS of targeted amplicon library using Illumina MiSeq chemistry The gel purified amplicon pools were quantified using a Qubit 2.0 Fluorometer and dsDNA High Sensitivity Assay (Life Technologies, Burlington ON) according to the manufacturer’s protocol. A dye:buffer mix of 1:200 was prepared along with DNA standards using a 1:20 standard:dye-buffer ratio. The samples to be analyzed were prepared in a sample: dye-buffer mix of 1.0 uL sample with 199.0 uL dye-buffer. The DNA standards were used to calibrate the instrument before measuring the concentration of each diluted sample. 66 Libraries were diluted to a concentration of 4.0 nM. A Savant™ DNA SpeedVac™ (ThermoScientific, Waltham MA) was used to concentrate libraries that were too dilute.  Next-generation DNA sequencing was conducted using a MiSeq NGS sequencer according to the manufacturer’s protocols (Illumina, San Diego, CA). A manifest file and samplesheet were prepared and sequencing cartridge thawed at room temperature. A maintenance wash of the MiSeq was performed before each run. DNA LoBind™ tubes (Eppendorf, Hauppauge NY) were used for amplicon library dilution. A fresh dilution of 0.2 N NaOH was prepared to denature the 4.0 nM DNA amplicon library using equal volumes of each. Incubation was performed for 5 minutes at room temperature. A volume of 990 µL pre-chilled HT1 buffer was added to 10 µL of denatured DNA. This yielded a 20.0 pM denatured library, which was kept on ice until performing the final dilution. The denatured DNA was further diluted with pre-chilled HT1 to the desired concentration between 6.0 and 15.0 pM for achieving an optimal flow cell cluster density of 1000 - 1200 K/mm².  A spike-in of denatured PhiX control DNA was used to avoid base-calling issues with low complexity targeted amplicon libraries. The 10.0 nM PhiX DNA was diluted to the desired picomolar concentration to achieve a library:PhiX molar ratio of about 3:1. The standard manifest and samplesheets were uploaded using the MiSeq control software and a new flow cell was rinsed and dried. The library sample cartridge, flow cell, and running buffer were loaded and the MiSeq run for paired-end sequencing of 150 to 250 bp read lengths. Standard sequence and library quality metrics output by the MiSeq Control and Reporter softwares were assessed before processing the data further using custom analysis routines detailed in the next two sections of this chapter and in the fourth chapter. 67 In summary, several stepwise refinements in the ability to measure somatic mutations across single-cell genomes at microlitre reaction scales were accomplished. This process included incremental improvements in the preparation and isolation of nuclei suspensions from a variety of tumor tissue types, evaluation of approaches for improving genomic DNA accessibility, optimal PCR primer design and reaction chemistries, and efficient NGS of low complexity libraries.   3.4 Single-cell SNV genotyping validates inferences of tumor clonal structure  An early objective with the aforementioned protocols developed for genotyping mutations in parallel across several genomic loci in tumor nuclei was empirically testing computational predictions of tumor clonal structure from bulk tumor-derived genomic data sets. This validation approach was first applied to tumor clonality predictions based on deeply sequenced somatic SNVs from individual tumors using the PyClone statistical model99.  3.4.1 Background - PyClone Applying NGS approaches to targeted DNA amplicon analysis of tumor genomes enables quantitative measurements of the prevalence of genomic sequence features based on the proportion of read counts bearing that feature or variant38. This approach yields an allelic prevalence count from the original template DNA, which is typically a genomic DNA preparation from a population of thousands to millions of cells. The range of allelic prevalences routinely observed indicate that not all alleles are present in all the cells from which genomic DNA was prepared. Thus, there is ambiguity about the combination of variants or alleles present per cell. This ambiguity may be reduced by applying statistical models which estimate the fraction of tumor cells carrying a mutation, that is, the cellular mutation prevalence. Mutations 68 occurring earlier in a tumor cell lineage tree may be predicted to have a higher cellular mutation prevalence than those occurring in descendant generations. This property, however, does not hold true with allelic prevalence because of the influence of factors such as copy number aberrations on genotypes.  PyClone is a Bayesian statistical model which infers the cellular mutation prevalence of clonal marks using an input of deeply sequenced mutations or variant alleles. These marks are first extracted from WGSS or exome sequence data sets. Model features are shown schematically in the top panel of Figure 10 (chapter 2) of this dissertation. Cell population structure assumptions in the PyClone model include three defined states. One, normal cells without the mutational marks. Two, reference tumor cells without the mutational marks. And three, variant tumor cells harboring the mutational marks. It is also assumed that cells within each of these three states share the same genotype, and that sequenced reads are sampled independently without bias from a library of molecules such that the probability of sampling a read for a given mutation is proportional to the prevalence of the subpopulation and copy number of the allele. Copy number constitutes a prior input, that can be measured from SNP genotyping arrays or WGSS data sets. Similarly, a prior input metric of tumor cell content can be estimated from sequence data sets or conventional pathology. PyClone outputs a posterior density for each mutational feature’s cellular mutation prevalence and a probability of any two mutations occurring in the same cluster. Mutations are assigned to the same cluster if they occur at the same cellular mutation prevalence. A comprehensive discussion of this model was provided in the supplementary text and figures (pages 1 to 28) accompanying the manuscript99 for this study. Despite all the complexities inherent to such clonal inference, sampling tumors in a spatially and temporally defined manner to generate several measurements can compensate for 69 noise and adds statistical confidence to observed cellular mutation prevalences fluctuating in parallel.  3.4.2 Targeted DNA amplicon sequencing of somatic SNVs Performance of PyClone model cluster predictions against alternate models was initially tested using fixed proportions of predefined DNA samples, empirically recapitulating a ground truth data set. The process involved selecting SNVs known to exist in each sample. The SNVs were amplified and subjected to NGS library construction and sequencing using the experimental approaches described in chapters 2 and 4. The results confirmed model clustering accuracy, placing homozygous and heterozygous SNVs from the same clusters together. Another performance test was conducted using spatially contiguous sections of a primary untreated high-grade serous ovarian tumor (Figure 21). Differences in cluster outputs between PyClone and an alternate approach, IBBMM (infinite Beta Binomial mixture model), were then validated at the highest resolution by conducting single-cell genotyping of SNVs from one tumor section using the multiplex PCR approach described earlier in this chapter.  3.4.3 Statistical analysis of single-cell SNV data FASTQ files generated from targeted amplicon paired-end sequencing on an Illumina MiSeq platform were aligned to the hg19 human reference genome using BWA121. Allelic prevalence data was mined and positions with base or mapping quality values below 10 were removed. In addition, nuclei samples with more than 80 % of unusable target loci were removed from further analysis. Likewise, target loci that were unusable in over 80 % of nuclei samples were removed.  To account for biased allelic amplification inherent to the analysis of heterozygous positions in single-cell genomes, statistical testing was performed to call variant alleles present 70 or not. The null hypothesis postulated the variant allelic base to be absent, or occurring due to sequencing error. The proportion of reads harboring a variant allele at positions across the amplicon, except the target SNV position, for each of the 48 SNV loci was computed. At the target SNV position, the variant allele was defined as the non-reference allele with the most number of reads. The mean of these values was used to represent the sequencing error rate and to perform a one-tailed binomial exact test. The p-values for each cell were multiple test-corrected for all targeted loci using the Benjamini-Hochberg procedure for multiple testing correction. A false discovery rate of 0.001 was used to assess whether a variant allele was present or not. Figure 21. Bulk tissue SNV data-derived clonal structure predictions in an ovarian tumor  Variant allelic and cellular mutation prevalences in a serous ovarian tumor measured across four physically contiguous tumor sections (A, B, C, and D) are shown using two models. Inferred clusters are numbered and color-coded, with the number of mutations n in each cluster shown in brackets. Cellular mutation prevalence of a cluster is the mean value of the mutations in that cluster.   Since the sum of the variant allelic prevalences for IBBMM clusters 1 and 2 exceeds 1.0, cluster 2 is likely to contain mutations from cluster 1. Thus some cells will harbor mutations from both clusters while others harbor only mutations from cluster 1. In contrast, PyClone 71 clusters these two IBBMM clusters as a single cluster. Single-cell genotyping of 25 nuclei from tumor section B confirmed that nuclei with IBBMM cluster 1 mutations  also included some IBBMM cluster 2 mutations (Figure 22), arguing for combining these two clusters into one. A complete PyClone cluster 1 mutation set was not observed in a cell due to allelic drop-out at heterozygous loci, as also seen with germline positions. For tumor section B, PyClone also predicted low prevalence clusters (4 and 5), which showed higher prevalence in neighbouring sections A and C. None of the 25 nuclei sampled showed presence of these mutations, indicating absence or low prevalence of these mutations in tumor section B.  Figure 22. Single-cell validation of SNV-predicted clones in an ovarian tumor The occurrence of variant alleles (rows, with hg19 chromosome coordinates) in single cells isolated from tumor section B are shown in the upper panel. Predicted clusters from PyClone (BeBin-PCN, left) and IBBMM (right) are shown in the two color-coded columns alongside the allele coordinates. White cells are non-somatic control positions. Grey cells were excluded due to low sequence read coverage. A condensed interpretation of these results is summarized in the lower panel.  72 A second comparison of bulk tumor-derived inferences of clonality versus single-cell measurements was performed using a triple-negative breast tumor. Figure 23 shows the cellular mutation prevalence posterior distributions for primary TNBC case SA029, using IBBMM and two implementations of PyClone ( Bin-PCN, BeBin-PCN). Major and minor copy numbers, when predicted, are shown.  Figure 23. Bulk tissue SNV data-derived clonal structure predictions in a TNBC tumor Cellular mutation prevalence inferences made using IBBMM and two variations of PyClone. The input variant allelic prevalences across the 12 SNV target alleles (rows) are color-coded green.     To test for mutations predicted by PyClone (BeBin-PCN) to be co-occurring with RB1, single-cell targeted PCR analysis was performed and showed that the SNVs for SAGE and XPO were also present (Figure 24). This co-occurrence supported the PyClone (BeBin-PCN) clustering, but not inferences made from raw allelic prevalence, IBBMM, or Bin-PCN. These other approaches predicted some cells to harbor RB1 SNVs without SAGE or XPO SNVs. Thus, accounting for the mutational genotype leads to more accurate predictions of tumor cell population structure. Also, modeling the extra variability in the data with the beta binomial 73 corrects for genotype better than the binomial, as seen with BeBin-PCN over Bin-PCN. Figure 24. Single-cell validation of SNV-predicted clones in a TNBC tumor Measured variant allelic prevalence values at 9 SNV target loci and 11 non-somatic control positions are shown. Predicted clusters for each of the three models are shown along the left columns. Non-somatic control positions are color-coded white. Low reads coverage loci are shaded grey.   In conclusion, PyClone presents an opportunity to mine clinically meaningful mutational clusters of tumor cells in space and time, and upon treatment or where multiple sampling can be conducted. The use of other data types such as indels and structural rearrangements in tumor tissues can also be incorporated into the model to perform cellular mutation prevalence and clonality analyses.  3.5 Single-cell LOH genotyping validates inferences of tumor clonal structure 3.5.1 Background - TITAN While making inferences of cellular mutation prevalences and tumor clonal architectures using somatic SNVs and indels has gained traction, the TITAN103 probabilistic model presents a 74 computational statistical approach to interpreting tumor clonal dynamics utilizing larger scale genomic lesions, namely copy number alterations and LOH. Such somatic structural alterations have been well documented in breast tumors11,36. Recently, Landau et al reported clinically informative fluxes of malignant clonal and subclonal populations defined by copy number in chronic lymphocytic leukemias subjected to chemotherapy111. Input for the TITAN model is matched normal DNA-derived heterozygous germline SNP loci, plus tumor read depth and allele ratios at the SNP loci. The model accounts for mixtures of cell populations and outputs segmental CNA/LOH events, cellular mutation prevalences, and clonal clusters. Tumor cellular prevalence in this framework is the fraction of tumor cells, while the sample cellular prevalence is the fraction of total cells, harboring a CNA or LOH event. There are several assumptions: Analysis of allelic ratio and sequence depth/coverage at one to three million heterozygous SNP loci adequately captures the underlying somatic genotype of the tumor; events span 10s to 1000s of contiguous SNPs; sequence signal is a measure of heterogeneous populations of cells including tumor and normal cells; and events measured at similar cellular prevalence co-occur in the same clonal cluster.  Three cell types or states are defined: normal cells, tumor cells harboring the event, and tumor cells not harboring the event. This approach is strengthened by neighbouring loci with segmental CNA/LOH events spanning several contiguous SNPs. The model differs from alternate models in jointly inferring CNA and LOH, accounting for multiple tumor subpopulations, and providing segmentation analysis.  TITAN was tested with idealized mixtures to simulate clonal complexity and determined to accurately inform population clonal structure in a panel of 23 breast tumor genomes. The definitive measure of performance was then conducted using single-cell genotyping and FISH 75 analyses of an ovarian tumor. These analyses confirmed that predicted CNA and LOH events extracted from WGSS data can identify clonal populations using cellular mutation prevalence profiles. A comprehensive discussion of this model, laboratory methods and statistical analyses were provided in the supplementary material (pages 1 to 44) accompanying the published manuscript103. 3.5.2 Targeted DNA amplicon sequencing of LOH loci Individual nuclei from a high grade serous ovarian tumor, case DG1136g, were analyzed for predicted copy number deletions by testing for homozygosity from the absence of one allele, rather than gains of allele copy number. Somatic point mutations differentiated tumor from non-tumor nuclei and heterozygous germline positions were used to assess amplification efficiency around loci of interest. Two experiments were designed, with each querying individual nuclei for heterozygous diploid regions, somatic mutations, a clonal deletion, and two subclonal deletions. Targeted multiplex PCR amplicon libraries were prepared and sequenced per set of experimental events, from 42 nuclei each. Deletions in single nuclei were analyzed at heterozygous germline SNP positions, using multiple loci across a deleted region. This approach enabled determining signal from allelic drop-out. For each of the two experiments, about 10 positions were targeted for deletions and 3 for diploid regions. Furthermore, positions were selected based on overlap with Affymetrix SNP6.0 array loci, being likely to occur in population based studies used for array design. Positions were picked to uniformly span the deletion region and regions of germline variation were excluded to avoid PCR primers performance being compromised by being located in such regions. To identify tumor from non-tumor cells, clonally dominant mutations were picked. For 76 example, TP53 was validated to harbor a clonally dominant homozygous mutation in the bulk tumor-derived data. Mutations located within subclonal deletions for the two experiments were also included to test for biallelic inactivation in tumor cells. 3.5.3 Statistical analysis of single-cell LOH data Targeted amplicon paired-end FASTQ files from Illumina MiSeq platform runs were aligned to the hg19 human reference genome using BWA121. Allelic prevalence counts were enumerated after removal of low quality reads. To account for sequencing errors and nucleic acid contamination, one-tailed binomial exact tests were applied to variant and reference allele positions (somatic SNVs and SNPs) to make present or absent calls. Error and contamination for each position was determined from the mean variant allelic ratio (variant reads over total number of reads) for bases flanking the base of interest in the no-template-control negative samples. This approach encompassed noise from both sequencing chemistry errors and nucleic acid contamination. A one-tailed binomial exact test was applied to determine whether the variant allelic ratio of the base was higher than expected. The same determination was made of reference allele ratios for each base position of interest. Statistically significant testing (Benjamini and Hochberg adjusted p-value < 0.05) was used to call a present status. Positions with less than 50 sequence reads were excluded as low coverage and those with low coverage in half or more of the sampled nuclei were removed from further analysis. Nuclei with fewer than 10 (of 48) positions having a read coverage of at least 50 reads were removed from further analysis. Absence of TP53 variant allele was first used to score normal nuclei, along with absent or low coverage variant allele status for other somatic mutations. Heterozygous SNP position calls were not used as the criteria for normal cell calling to avoid the influence of allelic dropout on making such calls. Next, the remaining nuclei were 77 scored as tumor nuclei if the TP53 variant allele was present and reference allele absent. In the case of low TP53 allele coverage, another mutation with present variant allele status was used to call a tumor cell. Remaining nuclei remained unclassified due to the uncertainty of calling normal versus tumor. This procedure yielded 14 normal nuclei and 14 tumor nuclei in the first experimental set, and 9 each normal and tumor nuclei in the second experiment. Calculating allelic drop-out and heterozygous allelic ratios To address the challenge of preferential amplification of one of two alleles at a heterozygous position, yielding a false homozygous call which may be mislabeled as LOH, 10 positions were picked to assess LOH status in individual nuclei for deletion events predicted from bulk tissue WGSS. For positions with adequate sequence read coverage, the dropout rate was determined as the proportion of positions with a present status for either reference or variant allele but not both across all positions for all nuclei scored as normal. These empirically determined dropout rates were 28 % for the first experimental set and 48 % for the second.  The presence of nucleic acid contamination may skew the allelic ratio for a heterozygous positions away from the theoretical expectation of 0.5. This was corrected by determining the expected allelic ratio to be the median across all heterozygous positions with sufficient coverage, having both reference and variant present status from every normal nucleus. The resulting expected heterozygous allelic ratio was 0.57 and 0.68 for experiments 1 and 2, respectively. Two tests were applied to assess the LOH status of an event using the SNP positions spanning the event. To assess whether an event was a true LOH event and not a result of allelic dropout, a one-tailed binomial test was used with the null hypothesis being that the ratio of homozygous to heterozygous positions is not greater than the dropout rate.  The dropout rate was used as the expected ratio or probability of success. The number of homozygous positions, called 78 by present reference or variant allele but not both alleles, is the number of successes. The total number is the number of trials.  The second test is a one sample Wilcoxon signed rank test to assess whether the allelic ratios across the positions spanning an event were significantly different than the expected  heterozygous allele ratio, or if the symmetric allele ratio distribution was higher than the heterozygous allele ratio, the symmetric allele ratio being defined as the following: [max(reference reads, variant reads)/depth]. The tests were applied separately to deletion and diploid heterozygous events, and p-values adjusted using Benjamini and Hochberg correction across all events and nuclei separately.  The lack of accounting for allelic dropout in the second test was addressed by combining the tests to take the maximum adjusted p-value representing each event. This approach provided a conservative interpretation of a statistically significant p-value (< 0.05 for both experiments) indicating an LOH event with a homozygous allelic ratio, and not a result of allelic dropout. An event was called heterozygous if the p-value was not statistically significant, and unknown if the p-value was not statistically significant but did not harbor at least one heterozygous position (that is, a present call for both reference and variant bases). Finally, the cellular mutation prevalence for each event was determined from the nuclei which had an LOH or heterozygous call. Experiment set 1 analyses results are shown in Figure 25 and set 2 results follow in Figure 26.    79  Figure 25. Validation of LOH-predicted clonal structure across 28 ovarian tumor nuclei Targeted amplicon NGS data of copy number events in 28 nuclei (experiment 1) is plotted. The mutant allelic ratio for mutations and symmetric allelic ratio for SNPs are shown, with low coverage positions shaded grey. TP53 mutation status is shown and nuclei were scored as tumor or normal based on somatic mutations. LOH status is shown, with events being scored as either heterozygous (HET), clonal deletion LOH (C-DLOH), or subclonal deletion LOH (SC-DLOH). Tumor nuclei harboring the LOH event are color-coded green while those that do not are blue. Normal nuclei are color-coded white. Unknown events in grey were inconclusive for HET or LOH status.   80  Figure 26. Validation of LOH-predicted clonal structure across 18 ovarian tumor nuclei Targeted amplicon NGS data of copy number events in 18 nuclei (experiment 2) are plotted. The mutant allelic ratio for mutations and symmetric allelic ratio for SNPs are shown, with low coverage positions shaded grey. TP53 mutation status is shown and nuclei were scored as tumor or normal based on somatic mutations. LOH status is shown, with events being scored as either heterozygous (HET), clonal neutral LOH (C-NLOH), or subclonal deletion LOH (SC-DLOH). Tumor nuclei harboring the LOH event are color-coded green while those that do not are blue. Normal nuclei are color-coded white. Unknown events in grey were inconclusive for HET or LOH status. 81  In summary, the enumeration of single nuclei genotypes empirically validated the TITAN modeling approach by identifying cellular subpopulations defined by copy number and LOH events in genomic sequence space. Validated events include clonal and subclonal LOH features. Increasing the numbers of events, nuclei analyzed, and tumor regions assayed will enable charting clonal phylogenies. Other potentially interesting applications of TITAN may be realized in breast and ovarian tumors where BRCA-driven defects in homologous recombination generate genomic CNA/LOH events, usable as clonal marks in tracking treatment efficacy. However, several key assumptions and limitations inherent to such computational modeling approaches for deciphering tumor clonal architectures introduced in earlier sections are addressed further in the discussion chapter of this dissertation.    82 Chapter 4: Clonal evolution in breast tumor xenografts at single-cell resolution 4.1 Background – Murine xenograft models for personalized oncology High attrition rates in cancer drug development, drug withdrawals, and suspensions all point to shortcomings in preclinical models112. A thorough understanding of this reality is complicated by clinical trials often not being published113. Xenograft model systems are being pursued as a platform for better predictive testing of therapeutic options114. These models may be well suited to recapitulate the pathology, cellular heterogeneity and genetic diversity of patient tumors75,115, features often lacking in other laboratory models of carcinogenesis. Early work in this area with translational implications included the study of chemotherapy targeting DNA topoisomerase-I in xenografts of human colon cancers116. However, as with all laboratory models, several factors influence the outcome of tumor transplantation protocols. These variables include the extent and manner of disaggregation of patient tumor tissue, the proportion of cells grafted, co-grafted materials, site of engraftment, and genetic background of the murine host. Approaches to making the host tissue microenvironment more receptive to human patient tumor tissue are also important. Options include the addition of human epithelial or stromal cells in the grafting process. Mice with humanized immune systems can inform immune response to tumors, a feature lacking in commonly used immunosuppressed murine strains. Though still in early development, approaches combining NGS analysis of patient tumors with matched murine xenograft models117 are providing opportunities to uncover and test actionable targets in clinically relevant time frames118. These murine xenografts provide a functional test bed for predictions from tumor genome analyses. Investigators can discern 83 relevant drug targets and biomarkers, treating the host mice with combinatorial regimens that may subsequently form the basis for rational therapeutic decisions in the clinic. Relapsing xenograft models may predict resistance pathways that could arise in the patient. Such possibilities can be tested with treatments in advance of similar relapse in the patient. Grafted tumors can be cryopreserved for reuse later as needs arise or analytical technologies improve. The model also allows for medical insights when a patient is unable or unwilling to participate in clinical trials. Murine xenografts of patient breast tumors are thus an attractive model system for utility in the era of personalized medicine. The objectives of this study119 included determining the extent to which the resulting grafts are relevant representations of the original patient tumors when viewed through the lens of tumor clonal architectures defined by genomic sequence features. The study further sought to assess, at single nucleotide and single-cell resolution, the changes occurring in tumor clonal dynamics following the establishment of patient tumor xenografts in immunocompromised mice and over the course of serially passaging successive grafts.  4.2 Methods Patient tumor tissues were collected from surgical and diagnostic biopsy procedures with informed consent and ethics committee approval as documented in the preface of this dissertation. Portions of tissue specimens were processed for routine nucleic acid preparation and histological analysis using a Discovery XT platform (Ventana, Tucson AZ). Tissue material for xenografting was prepared by mechanical disruption using a laboratory paddle blender (Stomacher 80 Biomaster, Seward USA, Davie FL). Xenografts were processed similarly after reaching approximately 1 mL in volume. About 0.2 % of this material was serially transplanted. 84 Both primary and metastatic human breast cancers of several molecular subtypes were processed for xenografting into immunodeficient mice (NOD/SCID/IL2r-gamma-/- and NOD/RAG1-/- IL2r-gamma-/-) in subcutaneous, mammary fat pad, and subrenal locations (Figure 27). Each patient tumor tissue specimen was transplanted in up to eight mice and serial grafts in up to 4 mice. Successfully established grafts (30 from 55 patients), taking a median time of 7 months to engraftment, were serially passaged for up to three years and 16 mouse generations.  Surgery was performed on 5 to 10 week old female NSG and NRG mice. Subrenal capsule transplant assays120 were performed on mice anesthetized with isoflurane and administered buprenorphine analgesic (Reckitt Benckiser Pharmaceuticals, Richmond VA). Tissue fragments or cell suspensions in collagen matrix were implanted surgically and topical analgesic administered following incision closure. ER-positive-tumor recipient mice were subcutaneously administered 17 beta-estradiol (Sigma Aldrich, Oakville ON) every two weeks. Subcutaneous transplants were performed similarly but under the skin in the flank area using tumor cell or organoid suspensions in matrigel (BD Biosciences, San Jose CA). Mammary fat pad transplants were performed in the #4 fat pads. 85  Figure 27. Patient breast tumors xenografting timeline Transplant history of 15 cases, including the two pursued for single-cell genotyping (SA494, SA501). Patient tumors are annotated as Primary or Metastatic, along with immunohistochemical status for ER, HER2 or TN (Triple-Negative for ER, PR, and HER2). The three sites of transplantation are color-coded. Serial passages (black points) over three years were analyzed as annotated, by whole genome and/or targeted amplicon sequencing. Dashed red lines represent replicate grafts in a subrenal location.    Patient tumors (15), xenografts (17 of 88), and matched normal (15) tissues were subjected to Illumina HiSeq 2000 paired-end whole genome shotgun sequencing to a median depth of 45-fold haploid coverage and followed by targeted amplicon sequencing, following experimental and analytical approaches described earlier for the LBC and TNBC tumor genome projects. A range of 35x to 72x haploid genome sequence coverage was completed and reads aligned to the hg19 human reference genome using BWA121. SNVs were called using MutationSeq122. Murine cell and sequence contamination was resolved by sequencing the genome of an NSG mouse to use its alignment against the human genome and create a mouse SNV mask from over 2 million unique SNVs identified in the mouse genome. Furthermore, all primer pairs for targeted SNVs analysis were checked in silico against possible amplification in 86 the human genome and in vitro by testing on murine genomic DNA. Somatic indels were identified using the Dindel123 and Pindel124 programs designed for short sequence read data. Structural variants with rearrangement breakpoints were predicted using nFuse125 and a derivative program deStruct. CNA and LOH  analyses were performed using the TITAN103 package. Affymetrix SNP6.0 array analysis was undertaken to determine genome copy number. Normalization of SNP6.0 data was performed using PennCNV126 and the data was segmented to determine CNA/LOH using OncoSNP127.   WGSS-derived MutationSeq somatic coding SNVs were validated by targeted PCR amplicon resequencing of tumor-xenograft series on an Illumina Miseq platform. Positions picked included those exclusive to the patient tumor, exclusive to the xenograft, and shared between the two samples. Amplicon data alignment and variant calling routines were as described for the LBC and TNBC projects. Inference of cellular mutation prevalence and mutation clustering was performed using PyClone. TITAN and OncoSNP analysis of the WGSS and SNP6.0 array data sets was used to acquire copy number, LOH, and non-tumor cell content.  Joint PyClone runs were performed for temporal evolution analysis and independent runs for replicate series correlation analysis.  4.2.1 Single-cell genotyping of murine xenografts Nuclei preparation, flow sorting, multiplex PCRs followed by singleplex PCRs, nuclei-specific amplicon barcoding and amplicon library sequencing were all conducted as described in the protocols in chapter three. SNV amplicon libraries were prepared for the SA494 and SA501 series xenografts. Data was aligned using BWA121, with the minimum base quality and minimum mapping quality both set to 10. Allelic prevalence matrices were generated for each target SNV position using the 87 number of reads of the predicted variant allele over the total number of reads, with a lower bound cutoff of 25 reads. Nuclei samples with more than one-third of the target positions falling below this lower bound were removed from further analysis. The no-template-control samples all had reads signal below this lower bound. Heterozygous germline positions were used to estimate allelic dropout of one allele, defined as an allelic prevalence below 1 %. The dropout values ranged from 0.081 to 0.222.  Allelic prevalence matrices were binarized to call presence or absence of SNVs and then used as input for Bayesian phylogenetic inference using the program MrBayes128. Cut-off values for binarization were a ‘present’ call if the allele prevalence was > 0.10, an ‘absent’ call if < 0.01, and an ‘ambiguous’ call if > 0.01 but < 0.10. Normal nuclei were removed from the preliminary tree generated for SA494. Nuclei in the heatmaps were ordered according to the phylogeny branchings and SNV positions were aligned blinded to the PyClone clustering. High probability branchpoints (> 0.85) were used to define clades of more than two nuclei and consensus genotypes for each clade. PyClone-predicted cellular mutation prevalences were compared with single-cell genotyping-derived mutational groupings to ascertain the distinct clonal genotypes from bulk tissue data sets. The proportion of nuclei assigned to a consensus genotype for the PyClone clusters were assessed relative to the total number of nuclei analyzed. A Pearson correlation coefficient was computed for each sample to estimate concordance between the two analytical approaches. Values ranged from 0.682 to 0.996. Major mutation clusters showed high concordance, while minor clones (e.g. SA494 cluster 4, SA501 cluster 10) had lower concordance.   88 Comprehensive details of all laboratory methods and data analysis routines utilized in this study were provided in the supplementary information (pages 1 to 19) accompanying the published manuscript119. 4.3 Results 4.3.1 Xenoengraftment can yield simple or complex clonal architectures Two of the xenograft series analyzed in detail at single-cell resolution, SA494 (metastatic ER-positive) and SA501 (primary, triple-negative), first yielded bulk tissue-derived clonal architectures suggesting markedly different clonal dynamics across somatic coding SNV and copy number space (Figure 28). Both included features which were shared between patient tumor and xenograft (on-diagonal) or unique to each tissue (off-diagonal), and shared or evolving over successive xenograft passages. Figure 28. Mutational spectrum of two xenograft cases using bulk tissue-derived data sets Xenograft (y-axis) versus patient tumor (x-axis) comparisons across SNVs and copy number aberrations for SA494 (upper panel) and SA501 (lower panel) cases. The density scatter plots on the left show high-confidence SNV allelic prevalences wherein neutrally selected SNVs occur along the diagonal. The plots in the middle show PyClone-inferred cellular mutation prevalences of SNVs from targeted deep sequencing. The SNVs are color-coded by clusters representing similar cellular mutation prevalences. The plots on the right show cellular mutation prevalences of CNA/LOH features inferred by TITAN. The height of a bar (z-axis) represents the number of genes in a cluster.    89 4.3.2 Patient tumor-derived xenografts undergo changes in clonal composition   Genome-wide somatic coding SNV allelic prevalences (100 to 300 per case) were compared between patient tumors and matched xenografts. Many SNVs (53 - 93 %) were common to the tumor-xenograft pairs, as evident in Figure 28. However, there were also SNV features found to be rare in one sample but not the other – 0.2 to 19.4 % abundance in the patient tumor but rare in the matched xenograft, and 6.5 to 32.1 % abundance in the xenograft but rare in the matched patient tumor. There were also instances of SNVs being exclusive to one of the two matched samples. For example, the SA494 xenograft harbored 27.7 % of SNV mutations undetected in the original patient tumor, and 19.4 % observed in the patient tumor but not the xenograft. This suggests a possible expansion of rare subclones from the patient tumor.  PyClone clustering was applied to this data set, taking into consideration tumor cellularity, copy number and LOH. Deep sequenced SNV amplicons served as empirically measured clonal markers, transformed as variant allelic and cellular mutation prevalences in patient tumor versus matched xenograft tissues. The output cellular mutation prevalences were used to determine membership of co-varying clonal marks into clusters. As observed with the somatic SNV allelic prevalences, clonal cluster dynamics were observed to reflect several patterns. This included invariant clusters shared between the patient tumor and matched xenograft, clusters expanding or contracting in the murine host, and high prevalence clusters in the xenograft that were rare or absent in the patient tumor (e.g. Figure 28 upper centre panel, SA494 cluster 3). Furthermore, following initial expansion upon xenografting, polyclonal architecture distinct to the xenograft was observed in several cases, including SA494. By contrast several cases, including SA501, retained clonal groups pre- and post-engraftment (on-diagonal clusters).  90  Using sample copy number and LOH features as clonal marks, modeling was conducted using TITAN, again accounting for tumor cellularity. The model output cellular mutation prevalence rates of 31 to 99 % and inferred 1 to 5 mutational clusters. Again, both shared and exclusive features were observed. For instance, SA494 had a rapid expansion of a minor subclone while SA501 showed polyclonal engraftment. Some complex structural changes, such as chromothripsis and breakage-fusion-bridge cycles, were conserved with an example of the latter found in SA494. However, there were substantial copy number differences between patient tumors and matched xenografts. These included a p53 deletion in an SA500 xenograft.  4.3.3 Single-cell genotypes reconstruct xenograft clonal structure  Cell population-derived inferences of mutation clusters and clonal genotypes were verified by single-cell genotyping. Two cases analyzed were SA494 (Figure 29), with strong clonal selection upon first engraftment, and SA501 (Figures 30 and 31), with polyclonal engraftment and evolving clonal dynamics upon serial passaging. The analyzed cell populations included some expected normal nuclei. Bayesian phylogenetic inference was applied, using the presence versus absence of SNV alleles and nuclei binned into clades, to reconstruct consensus genotypes. The resultant groupings validated the PyClone clustering results.          Figure 29. Single-cell lineage genotypes in an ER-positive breast tumor and xenograft Targeted amplicon genotyping across 62 individual SA494 tumor and 58 (passage 4) xenograft nuclei by multiplex PCR of 40 SNV and 7 germline variants to determine variant allele ratios. The four mutation clusters inferred by PyClone using bulk tissue-derived genomic data are shown in the top left inset, for the patient tumor and xenograft passages 2, 3, and 4. A Bayesian phylogenetic tree drawn along the left of the figure groups the tumor and xenograft nuclei into distinct clades. The color-coded heatmap shows the variant allele prevalences as wildtype (blue), heterozygous (yellow), or homozygous (red) across the individual nuclei (vertical axis). White spaces are missing data. The small topmost panel of the heatmap shows genomic DNA control samples from the tumor and xenograft, while the intermediate size middle panel represents normal cell nuclei present in the tumor sample. 91  The two consensus genotypes shown across the bottom of the figure represent high-probability branch points in the phylogenetic tree, to support the expansion of a minor clone in the patient tumor that dominated the xenograft, in addition to the presence of a few shared mutations.  1/25	  rxn 16	  uM -­‐	  300	  bp 92  For the SA494 series, the two dominant clades represent tumor and xenograft nuclei with a mostly separated set of SNV alleles. Cluster 1, the predicted ancestral clone, comprised 16 SNVs that were shared. Cluster 2 included nuclei belonging to the original patient tumor while cluster 3 was the minor clone that engrafted to dominate the initial graft. These results confirmed the shared ancestry of the patient tumor and xenograft nuclei, and the expansion of the minor (about 5 %) clone in the patient tumor, inferences made by PyClone from cell population-averaged data.  Similarly, clonal architectures of samples in the more complex SA501 series were reconstructed with single-cell genotype data (Figure 30). Single-cell analyses were performed on three (X1, X2, X4) xenograft passages, with observed expansions of minor clones over serial passages, and expansion followed by decline of other clones. Five clades resulted from phylogenetic inference, suggesting a stepwise acquisition of clonal marks from earlier to later clones (Figure 30). Of the five genotypes, the A and B sibling genotypes are defined by clusters 4 and 5 joining the ancestral clusters 1 and 8. Next, the addition of cluster 7 to genotype B yielded genotype C, while the addition of cluster 2 yielded genotype D. Genotype E derived from genotype D by the addition of cluster 3 mutations and loss of cluster 8. The first two xenografts in the series, X1and X2, yielded clones from genotypes A, B, C, and D. The latter genotype then expanded from X1. The X4 passage exclusively included genotype E. As with other measured prevalences, the only X4 nucleus seen to harbor the B genotype was in keeping with observations made from cell population-averaged data. Thus the dominant clade genotype E, descended from genotype B, confirmed that the genotype B lineage outcompeted genotype A during serial passaging.  93 Taken together, the single-cell analyses confirmed PyClone predictions from bulk tissue and yielded phylogenetic relationships between clades of cellular subpopulations from both clonally simple and complex xenograft cases.                                           Figure 30. Single-cell lineage genotypes in a TNBC tumor and serial xenografts The insert on the lower left of the figure shows the bulk tissue-derived cellular mutation prevalence clusters, starting with the patient tumor and across five serially passaged xenografts. Ninety nuclei from SA501 patient tumor and xenografts (passages 1, 2, and 4) were analyzed for 45 somatic and 10 germline SNVs by targeted amplicon genotyping. The Bayesian phylogenetic tree depicts cascading clonal evolution across the sample series. The heatmap shows wildtype (blue), heterozygous (yellow), and homozygous (red) variant allele ratios across individual nuclei (vertical axis), ordered by the phylogenetic branching. White spaces are missing data. The variant positions (horizontal axis) are plotted according to the consensus genotypes resulting from high-probability phylogenetic 94 branch points. The single-cell-derived clusters denoted by the horizontal bar across the bottom of the heatmap reconstruct the major patterns seen in the bulk tissue-derived PyClone groupings.   95 Figure 31. Clonal genotype evolution in a TNBC tumor and serial xenografts Phylogenetically inferred consensus genotypes yielded five groupings, denoted A, B, C, D, and E, with sequential expansion of subclones. The coloring scheme is based on the last PyClone mutation cluster acquired. The schematic representing the three xenograft masses are colored according to their constituent genotypes and the proportion of cells experimentally measured for each genotype.    4.3.4 Serially passaged xenografts undergo clonal evolution Besides analyzing changes in clonal architecture upon engraftment in murine hosts, the process of serially passaging established xenografts was studied for clonal dynamics in 12 tumor series over three years. Generally, cases with strong selection upon initial engraftment, such as SA494, were found to have stable clonal architectures following continued passaging. More substantive changes occurred in cases with modest initial selection. For example, SA501 had modest initial selection but several clonal dynamic changes upon serial passaging, as evident with clusters 2, 3, and 8. The prevalence of some clusters remained unchanged over serial passages. None of these several modes of clonal evolution were defining of primary or metastatic disease, or any breast cancer subtype.  Other factors potentially influencing the clonal dynamics of murine xenograft models that were assessed included comparisons of site of tumor engraftment. Anecdotal observations from modest numbers of comparisons suggested better engraftment rates in mammary fat pad versus   96 renal capsule sites for two cases (SA429, SA496). However, ongoing serial passaging did not show strong clonal selection.  4.3.5 Replicate xenografts recapitulate genomically defined clonal structures Functional significance of the genetic marks or unknown co-segregating features, used to define cluster membership, may be inferred if the same clusters persist independently in replicate grafts from the same patient tumor. This was indeed observed, with 4 of 5 xenograft series showing similar clonal dynamics of clusters in multiple replicate transplantations. For example, in the SA501 series in both replicate mice at xenograft passage X3, and in all 4 replicate mice at xenograft passage X4. Furthermore, an expansion of mutations in cluster 3 occurred with a contraction of cluster 5, albeit at different times across xenograft passages 2, 3, and 4. This observation also applied to expansions amongst minor subclones, strongly suggesting functionally relevant shared mechanisms rather than random events. Other observations included clonal architectural differences when comparing different strains of murine hosts, namely NSG and NRG mice. However, both strains showed similar clonal dynamics within serial passages.   In conclusion, this study illuminated tumor clonal architecture and its evolution in a model of human breast cancers that is likely to become an enabling platform for predictive testing in personalized cancer genomics. The process of tumor engraftment was seen to result in modest to sweeping selection amongst clones present in the original patient tumor. The process of serial passaging yielded xenografts with ongoing changes in clonal architecture, especially with tumors exhibiting initial polyclonal engraftment. Inferences of tumor clonal architecture from bulk tissue-derived data sets were recapitulated in single-cell genotyping experiments and cellular phylogenetic relationships were reconstructed. The same clonal clusters, as defined by 97 genomic sequence features, arose independently in separate xenografts from individual patient tumors, suggesting functional significance of these or co-occurring genomic features. On the subject of clonal origins, this study showed that pre-existing subpopulations of cells can seed expanding clonal clusters, as opposed to newly arising somatic mutations. The latter, however, become an interesting possibility with therapeutic intervention if genotoxic therapies trigger increased mutational diversity. The practice of analyzing clonal marks at single-cell resolution can yield insights to tumor xenograft properties at a resolution not achieved using conventional histopathology, wherein observations of patient tumor-matched features are purported to indicate faithful recapitulation of the original tumor75. Thus tumor clonal heterogeneity, as defined by genomic clonal marks, is a tractable metric for ascertaining growth and fitness properties of subpopulations of tumor cells. Such utility of tracking tumor cell heterogeneity to assess functional significance has been demonstrated, for example, in acute lymphoblastic leukemias129. However, one cannot rule out the influence of stochastic factors and limiting dilution conditions of (rare) constituents such as tumor-initiating cells. Looking ahead, these approaches to resolving tumor clonal dynamics using an in vivo model system will be of utility in predictive testing of patient-matched chemotherapeutic regimens. 98 Chapter 5: Discussion 5.1 Conclusions and significance The whole genome sequencing and analysis of a lobular breast cancer metastasis contrasted with its matched primary tumor from nine years earlier marked a milestone in the study of individual tumor genomes. This study catalogued changes in mutational heterogeneity at single base resolution in a breast cancer at two seminal stages in its clinical history. The genomic sequence variants provide telltale signatures of the natural history of this malignancy as it evolved from primary to metastatic disease. A comparison of matched genome and transcriptome data sets in the metastasis revealed processes of RNA editing, indiscernible using either method alone. This analytical approach finds application in programs such as DriverNet98, for sifting through often large inventories of observed sequence mutations in tumors to highlight those of potential functional importance at the level of transcriptional networks. The years following this study have seen substantial growth in the application of affordable NGS technologies integrated with statistical computation to make routine the interrogation of individual tumor genomes and transcriptomes at single nucleotide resolution. These NGS technology and informatics approaches have contributed to laying the foundational framework for bringing personalized genomics into the practice of clinical oncology. Next, from one cancer to many, the richness of genomic heterogeneity across tumor landscapes was uncovered through whole genome analyses of a large cohort of triple-negative breast cancers. A subtype defined by what it is not, the diverse mutational spectrum catalogued across these primary TNBCs points to shortcomings and challenges in defining a cancer subtype. Estimates of cellular mutation prevalence of mutations demonstrated that lesions in well known driver genes may not necessarily be early events in tumor development, highlighting the 99 importance of clinical timing in therapeutic targeting of actionable mutations. This study underscored the need for reevaluating a mutationally heterogeneous subtype of breast cancer and developing a revised classification scheme with greater clinical relevance. A cellular level approach to more meaningful classifications of heterogeneous cancers  involves classifying tumors as viewed through the lens of population clonal architectures. This was achieved at exquisite resolution by developing and applying methods in single-cell mutation and lineage analysis to reconstruct tumor clonal structures in both spatial and temporal dimensions. Experimental results validated the performance of computational approaches to infer dominant clonal structures using SNVs and LOH events, measured using conventional bulk tumor whole genome sequencing, as clonal marks. These observations and experimental resources have paved the way for routinely charting tumor clonal architecture as a useful representation of functional changes over the course of a cancer’s natural history. Finally, developing more clinically meaningful laboratory models of carcinogenesis and therapeutic drug testing was achieved using murine xenograft models in conjunction with NGS and single-cell approaches to monitor clonal evolutionary dynamics. The final study in this dissertation highlighted several possible outcomes in the clonal composition of murine xenografts upon initial grafting, replicate grafting, and serial passaging. The resolution achieved using the described approaches surpasses the static image afforded by conventional histological comparisons of patient tumor and derivative xenografts. These findings and resources have set the stage for assessing the concordance of patient-matched xenografts as predictive models of drug testing with clinically administered therapeutic outcomes. Other than the several studies cited earlier in this dissertation regarding the significance of genome scale and targeted deep NGS analysis of individual tumors for deciphering clonal 100 structure and evolution, many additional recent studies on tumors from several other types of human cancers reinforce the growing importance of this approach in charting the evolution of malignancy and its potential clinical applications. Ding et al130 applied whole genome sequencing to track clonal evolution in eight relapsed acute myeloid leukemias. They observed two clonal evolutionary patterns during AML relapse, with one pattern involving a founding clone in the primary tumor gaining mutations to evolve into the relapsed clone and the second pattern involving a subclone of the founder clone surviving initial therapy to gain additional mutations and expand at relapse. Importantly, they also documented DNA damage caused by cytotoxic chemotherapy which partly influenced clonal evolution in relapse. Schuh et al131 monitored progression in chronic lymphocytic leukemia using whole genome and targeted deep amplicon sequencing of tumor cells from three CLL patients going through repeated cycles of therapy and uncovered heterogeneous clonal evolution patterns. Each patient tumor yielded different clonal evolution trajectories with subpopulations declining or expanding over time and exhibiting rapid or gradual shifts in clonal composition. Keats et al132 studied clonality in 28 patients with multiple myeloma and observed that cytogenetically defined high-risk patients harbored genomes exhibiting more clonality changes over time compared to standard-risk patients. They validated their findings in a genetically engineered mouse model of myeloma to reproduce a dynamic pattern of clonal selection and tumor evolution. Ruiz et al133 explored 40 pancreatic and prostate adenocarcinomas using DNA content-based flow sorting coupled with array CGH and targeted resequencing. They identified genomic aberrations specific to both therapeutically resistant or responsive clones and observed divergent clonal populations within individual biopsies. Gerstung et al134 applied a comparative targeted deep sequencing approach on renal cell carcinomas for quantifying subclonal SNVs in mixed populations and detected 101 variants with prevalences as low as 1/10,000 alleles. Wu et al13 analyzed pediatric medulloblastomas and found clonal selection to drive genetic divergence in metastatic disease, with the latter showing high similarity in both mouse and human studies. Furthermore, only rare cells in the primary tumor appeared to harbor metastatic ability, a key insight for making effective therapeutic choices in this bicompartmental model of malignancy. 5.2 Limitations and challenges Unlike disciplines such as microbiology or epidemiology, where a set of postulates or criteria can be applied to assess cause and effect to directly relate variables such as infectious agents to a disease state, determining cause and effect using high dimensional cancer genomics data sets is substantially more elusive. Barring prevalent sequence variants readily found in data sets derived from a cohort of tumors from a rare cancer or genetic condition, the deluge of (epi)genomic sequence and structural variants emerging from individual tumor genome studies do not readily lend themselves to interpretation. Both functional testing using wet lab approaches and inferences using innovative computational algorithms are standard approaches to distill meaning across tumor genomes. In this regard, selection during tumor clonal evolution can be suggestive of fitness for persistent clonal genotypes.  A transformative development in measuring properties of individual tumors occurred with the transition from Sanger sequencing-based analog data sets to the digital output of next-generation sequencing chemistries. However, though NGS provides a quantitative measure of features such as allelic prevalence based on proportions of read counts at loci of interest, data sets can be skewed by artifacts such as the generation of PCR-amplified duplicates during library construction, or low quality sequences. The basis for duplicates often resides in a lack of template diversity from sparse or low quality samples, making quantitative single-cell and FFPE 102 tissue analyses all the more arduous. Such technical artifacts can inflate inferred clone size and consequently suggest differences where none exist. Though not yet routinely implemented, such biased measurements can be alleviated by incorporating single molecule tagging approaches135 to avoid making false inferences of clonal structure or mistaking sequence errors for rare variants.  Inferences of clonality and clonal lineage in tumor tissues, though informative with respect to the evolution of functionally relevant clinical events such as metastatic spread or relapse, suffer from practical limitations akin to phylogenetic lineages of species inferred over geological time scales. In the above regard, statistical modeling approaches such as PyClone and TITAN operate with assumptions of linearity of events. The models assume that sequence features are persistent and not lost, or do not revert, or mutate only once, through their lineage history. Additional limitations can include sensitivity to inaccurate prior input data, such as copy number values in PyClone. Inaccuracy in such metrics can limit the model’s ability to correctly cluster mutations and generate meaningful inferences of cellular mutation prevalence and clonality due to a large space of equally likely explanations of the observed data. PyClone is also limited in assuming identical genotypes within each cellular subpopulation, although sites of mutations can harbor copy number alterations before and/or after a mutation occurs. A more sophisticated modeling approach would incorporate sub-clonal copy number states as prior inputs. Further, since PyClone clusters mutations occurring at similar cellular mutation prevalences instead of clustering cells by mutational composition, multiple subclones which happen to have similar cellular mutation prevalences may be incorrectly clustered. Distinguishing clones with variable sampling efficiency or clonal mark amplification efficiency can also confound analyses in PyClone and TITAN. Resolving minor clones can be a challenge when tumors harbor abundant 103 admixtures of non-tumor cells and due to the small numbers of cells currently sampled from single biopsies. However, most of these challenges with modeling bulk tissue-derived data sets can be circumvented with accurate and scalable single-cell analytical approaches. 5.2.1 Analysis of single-cell genomes Pending the routine implementation of true single molecule, also known as third generation, DNA sequencing methods for whole genomes136, the challenges of nucleic acid analyses from single cells revolve around efficiently scaling sample handling, controlling cellular and biomolecular contamination, addressing sequence error and skew in amplification chemistries, minimizing allelic drop-out or data loss, and developing analytical frameworks tuned to the distinctive nature of single-cell data sets.  In the matter of scaling single-cell assays, microfluidic or droplet chemistry devices that routinely process hundreds to thousands of cells in parallel with reduced reagent costs will propel the field forward, especially in studies where numerically rare cells are sought without prior immunophenotypic fractionation. A continued reduction in the cost of DNA sequencing, combined with smart barcoding strategies, will also enable affordable larger scale studies. However, with increasing scale, contamination, both cellular and biomolecular, become of routine concern in single-cell experiments. Since elaborate engineering design and resource-intensive protocols common to forensic laboratories are not routinely implemented in academic settings, alternate strategies are imperative. For example, the addition of uracil bases in dNTP mixtures, coupled with the use of heat-labile Uracil-DNA-Glycosylase, reduces the likelihood of PCR carryover contamination during the critical first step of single-cell multiplex PCR amplifications. Alternately, periodically cycling through several non-overlapping sets of 104 molecular barcodes used in DNA library construction can help distinguish current experimental data outputs from contaminating past libraries. With regard to the front-end amplification of single-cell genomes and transcriptomes for widening subsequent analytical choices, the use of exponential amplification chemistries can suffer from non-uniformity of coverage and increased opportunity for sequence error accumulation. Modeling and statistical correction are challenged by a lack of universally adopted controls for variability. Besides the use of WGA-derived approaches, other ways to circumvent destructive sampling inherent to single cell experimentation can include a cell recycling approach137 as used in preimplantation genetic diagnosis protocols. A single cell fixed to a solid substrate is first subjected to a PCR assay and then a secondary assay such as FISH is performed. The latter technology is seeing an increasing number of data points and information content retrieved per cell with newer spatially resolved in situ tagging protocols138. The many experimental challenges notwithstanding, distinguishing biological change from technical noise in single-cell data can be a subtle and difficult exercise. A biomolecular knowledgebase founded almost entirely on cell population-averaged measurements, coupled with stochastic subcellular processes add to this challenge. This problem may be addressed in part by using an, albeit circular, approach of sampling large numbers of single cells from a population to inform modeling approaches that impute missing data. Biological factors that can confound interpretation of single-cell genomic data sets include the influence of cell cycle stages139 and the presence of extra-chromosomal elements140. Though analytical frameworks for single-cell transcriptional data sets have steadily matured, noise and interpretive models for other genomic features are works in progress, especially for integrated analysis of multiple omics features. 105 Allelic drop-out and noisy data The failure to amplify and detect one or both alleles at a heterozygous locus of interest is a common problem in single-cell assays and dates back to the early days of implementing such assays in pre-implantation genetic diagnosis protocols55. Such false negative results, or false positives from errors accumulated over a large number of nucleic acid amplification steps, remain a critical challenge. Improving genomic DNA accessibility during cell lysis is a common first step tested for reducing ADO. Options include alkaline lysis and proteinase K digestion141, freeze-thaw, boiling, and heat denaturation steps during thermal cycling142. However, these options may prove fruitless if processes such as DNA strand breaks or damage by endogenous nucleases has already occurred. As noted earlier, the problem of missing data can be addressed indirectly by combining information from several cells in a clonal population to add statistical strength and imputing ADO. An erroneous call of a heterozygous position as homozygous requires the same allele to be missed each time in a population of cells with known ground truth state, which can be inferred from deep sequencing amplicons at loci using genomic DNA prepared from bulk tissue. The frequency at which a biallelic locus may be miscalled as monoallelic when sampling ‘n’ cells in a population of haploid cells will be 2(0.5)n. The frequency at which a heterozygous locus may be misdiagnosed as homozygous in diploid cells from a cell population homogeneous at that locus will be 2(ADO x 0.5)n. At an arbitrarily assigned ADO rate of 50 % and measurements collected individually from 10 cells, the presence of both alleles in such a population would be found 99.999 % of the time. This sensitivity of detection suffices for genotyping applications, when the aforementioned population level allele structure is known, inferred, or assumed. 106  While the aforementioned experimental challenge applies primarily to genomic loci, single-cell transcriptomic studies are confounded by changes in transcript levels that arise from front-end handling of tissue specimens with the disruption of cells from their physiological context. Furthermore, the reverse transcription step for first-strand cDNA synthesis is critical to maintaining representation and full-length coverage of transcripts. Antisense RNA-based linear amplification chemistries address some of the biases introduced by exponential cDNA amplification protocols. However, despite improving NGS read capacities, capturing the majority of each transcriptome cell-by-cell is still a work in progress. Several hundred thousand transcript molecules may be present per mammalian cell. Thousands of different transcript species are estimated to be transcribed in a typical cell, with the majority of transcripts expressed at about tens of molecules per cell. The low abundance mRNA species may include functionally relevant transcription factors and/or leaky transcripts of no known functional relevance. Current systematic estimates of whole transcriptome coverage in single cells are modest, ranging up to 40 %143. Efforts to establish standards and controls for meaningful comparisons of single-cell transcriptomic data address challenges with technical noise and differences in detection sensitivity between studies144. However, artifacts from inherent biological noise will continue to confound experimentation and interpretation, including variable mRNA decay rates and transcriptional noise.  Given the destructive sampling inherent to analyzing single cells, orthogonal validation is a challenge. The limited choices, using additional matched tissue, include FISH probes for both RNA and DNA assays to assess copy number. High error rates in noisy amplified sequence data may be addressed using single molecule barcoding approaches to tease apart randomly accumulating sequence errors from rare biological variants145,146. This is especially relevant 107 when chasing down rare variants. Even whilst surmounting these many challenges of single-cell genomic and transcriptomic studies, the functional significance of variants and their abundance with respect to downstream proteomic function may still remain unclear. This uncertainty is likely to persist until there is improved sensitivity in technologies involving mass spectrometry and antibody derivatives, coupled with integrated omics analytical approaches. 5.2.2 Implementation of murine xenograft models Physiologically more complex compared to other laboratory models of cancer, the biology of murine xenograft models can be complicated. Marusyk et al147 recently reported on non-cell autonomous processes driving tumor growth to generate clonal diversity. In their model system, a subpopulation of cells was observed to drive tumor growth via permissive microenvironmental changes but it was then outcompeted by other subclones. Such clonal interactions contribute to overall phenotypic properties of biological and clinical significance and add to cell-autonomous fitness properties. This study points to the shortsightedness of ascribing tumor behavior to only dominant clones. Limitations with murine xenograft models include both scientific and practical challenges. A broader clinical application is challenged by the need for fresh tumor tissue, engraftment failure, tissue heterogeneity, clonal architecture instability, and a requirement for substantial laboratory resources. Success of engraftment can vary from tumor to tumor. Natural differences exist between human and murine models of tumorigenesis, including growth kinetics, immune system involvement, and manner of metastatic spread. Stromal composition differences exist, which may be addressed in part by co-engraftment of relevant non-tumor cells and factors. Different genetic backgrounds of mouse strains may yield differing results, confounding interpretation. Admixtures of murine host cells with human grafts can yield molecular sequence 108 data sets that require precise separation in silico. Given the highly immunosuppressed state of murine host systems, an immune response cannot be studied or immune modulatory agents tested. Drug metabolism and pharmacokinetics in murine models need to be extrapolated to human patients. As detailed in this dissertation, both patient tumors and xenografts may evolve over time and space, presenting moving therapeutic targets. In practice, the establishment, testing, and maintenance of patient xenograft models is resource-intensive, requires highly qualified personnel, regulatory compliance, and involves costs which are currently outside the scope of health insurance services. Patents for experimental drugs, acquiring such drugs in sufficient quantities for testing, and material transfer agreements can also hinder adoption of this model system for predictive testing of novel therapeutic combinations. 5.3 Applications and future direction 5.3.1 Tumor clonality and single-cell assays in the clinic Clonal complexity-informed combination therapies administered early in the evolution of malignancy is a conceptually appealing strategy in the treatment of heterogeneous breast tumors. Over three decades have passed since the demonstration of rationally designed single-target therapy in estrogen receptor-positive breast cancers7 and a decade since the use of monoclonal antibody therapy against HER2 overexpressing breast cancers148. However, a roughly billion dollar price tag149, a decade or longer development timeline, patient cohorts too small to sufficiently power trials, and expansive regulatory frameworks are incompatible with the implementation of combinatorial treatment approaches targeting novel personalized actionable targets. Furthermore, the meaningful integration of omics data sets, first with each other150 and then with established clinical metrics151 to advance personalized cancer medicine, have yet to be realized. How the “N of 1” challenge unfolds and is implemented in clinical practice is a core 109 issue in modern oncology, bringing to a head the integration of biomedical, pharmaceutical, regulatory, health economic, and ethical dimensions. Nevertheless, the type of application possible with tumor clonality and single-cell analyses in the clinic is exemplified in a few recent studies. Melchor et al deciphered the composition of parental clones and phylogenetic patterns in multiple myeloma and derivative xenografts152, while Hughes et al validated clonal architectures in secondary acute myeloid leukemia and uncovered additional complexity at single-cell resolution which was obscure in bulk tissue analysis153. Charting clonal architectures and cellular phylogenies in matched normal and diseased states by single-cell measurements is beginning to provide investigators an important dimension to the inner workings of malignant processes. Other clinical developments include minimally invasive monitoring of disease using CTCs, addressed later in this chapter. There is also interest in expanded preimplantation genetic diagnostics assays154 plus the analysis of circulating fetal cells and DNA in the maternal bloodstream155.  5.3.2 Adequate spatial and temporal sampling of patient tumors The sparse clinical sampling of malignancies challenges the acquisition of meaningful data sets for disease detection, prognosis and monitoring treatment efficacy. By the time a tumor mass is detected and analyzed, many cell generations have elapsed such that critical early events can only be inferred. With regard to inadequate sampling, a case in point is the lobular breast cancer metastases presented earlier in this dissertation. Firstly, only tap three, one of four metastatic pleural taps, was analysed in detail. Second, though approximately 2x109 cells were collected from this pleural tap, less than one percent of these cells were processed for preparing genomic DNA for analysis. Third, only a fraction of the prepared genomic DNA was subjected to library 110 preparation for whole genome sequence analysis. Fourth, even at a relatively high depth of coverage for WGSS at the time the study was conducted (43x haploid), less than a percent of the genomes present in the original population of cells were represented in the aligned reads at any locus of interest. Other confounding factors at play included the loss of cells that may not withstand steps in tumor processing. Also, in the case of such processed liquid biopsies, there is loss of spatial context such as oligocellular clusters which exist in pleural samples74. Bearing in mind a central concept in cellular models of cancer initiation or progression, that it can take only one rogue cell to precipitate malignancy, or initiate metastases, or develop therapeutic resistance, the sparsity in sampling is a sobering reminder of how inadequately tumors are analyzed.  A few approaches may help ameliorate some of the challenges with sampling patient tumors. Amongst current surgically guided options, fine-needle aspiration is likely to continue to serve as the most suitable option for repetitive but low invasiveness biopsies of metastatic disease156. However, any surgical sampling of metastases to internal organs is difficult in several respects. Thus non-invasive imaging approaches for assessing tumor metastases warrant continued development157. A minimally invasive and possibly very informative sampling method that is currently gaining traction involves capturing circulating tumor cells and cell-free tumor DNA via peripheral blood draws. Circulating tumor cells and cell-free nucleic acids Adequately monitoring changes in tumor load, mutational content, or clonal composition may be realized in near real-time using cells or DNA shedding from a tumor mass into the patient’s bloodstream. First described over a century ago by Ashworth158, CTCs are generally in much lower numerical abundance in a backdrop of billions of blood cells. Circulating DNA in the plasma of cancer patients was described several decades ago159 and is seeing renewed interest 111 for monitoring treatment and residual disease in the clinic160. The last decade has seen practical advances in isolating and enumerating CTCs in the peripheral circulation of breast cancer patients, especially using immunomagnetic approaches161,162. In breast cancers, elevated CTC levels have been associated with worse prognosis in both metastatic and non-metastatic settings163,164. Amplifying the genome of isolated CTCs for comprehensive analysis is also under investigation165. Newer label-free approaches under development for the analysis of such rare tumor cells include the use of acoustic wave technology in cell separation166. However, heterogeneous immunophenotyping results have been observed in CTCs from individual patients. If reflective of the biology of the underlying tumor, this could provide useful insights about tumor clonal dynamics. In comparing subtypes of lung cancer, Ni et al discovered that CTCs from different subtypes showed differences in copy number variants64. High dimensional mRNA profiling conducted on CTCs in breast cancers has also uncovered heterogeneity in transcriptional signatures60. This promising sampling approach is not without its challenges, in both sensitivity and specificity. For example, utilizing EpCAM expression as the basis for epithelial tumor cell immunomagnetic capture will fail to assay CTCs that do not express this marker167. Furthermore, tumors that are clinically dormant for years can continue to turnover and shed CTCs168. Half lives of CTCs and cell-free DNAs and their clearance in the circulatory system also require continued investigation to advance their application in monitoring malignancy and emergence of treatment resistance. 5.3.3 Final words Though the focus of this dissertation and cancer genomic studies in general are aberrations in diseased tissues, a large scale effort to illuminate variation in normal tissue 112 physiology can prove informative. Such an endeavor would help develop working definitions of intermediate single-cell states already recognized at the level of malignant tissue masses, such as a precancerous state. Current models in single-cell analysis simply binarize each cell as tumor or non-tumor based on the presence or absence of a somatic mutation. The malignant phenotype can certainly be a more complicated affair, with the influence of (epi)genetic and non-genetic determinants in precise spatiotemporal contexts. Other experimental approaches to dissecting tumor heterogeneity one cell at a time are also in development. However, single-cell metabolome169 and epigenomic170,171 studies are still few and far between. Early single-cell epigenomic studies include the analysis of methylation in murine embryos and stem cells172. Guo et al reported on the demethylation dynamics of parental genomes following fertilization in mice and observed that genic regions demethylated faster than intergenic regions in pronuclei. Single-cell proteomic protocols have currently achieved capacities of analyzing several dozen proteins in parallel173. Another approach to gleaning protein states at single-cell resolution is through the use of multiplexed proximity ligation assays, to simultaneously infer multiple protein complexes in situ174. A new dimension to single-cell genome studies is in chromosome conformation capture analysis, relating the three dimensional state of cellular genomes to their activity patterns, as demonstrated by Nagano et al in individual mouse spleenic T cells175. Besides a few reports176,177, the tantalizing possibility of routinely integrating omics measurements from single cells is a technological milestone yet to be reached, requiring the efficient separation of both mRNA and genomic DNA fractions178. Other analytical approaches breaking traditional limits of resolution to dissect the inner workings of single cells include single-molecule or super-resolved fluorescence microscopy179, enabling nanometre scale 113 sub-organellar observations without the harsh treatments of traditional methods involving electron microscopy. Prime candidates for clonality analysis at the resolution of single cells are tumors of the  mutationally diverse triple-negative subtype of breast cancer. To this end, a project has commenced with the objective of analyzing a large panel of primary TNBCs at a scale of hundreds to thousands of cells per tumor for somatic coding SNVs, LOH, and other aberrant genomic features previously enumerated from bulk tumor-derived whole genome data sets presented in this dissertation. Tumors spanning a wide range of predicted clonal complexity have been selected for analysis. A revised molecular pathology-based classification that incorporates intratumoral clonal structure can improve our understanding of how different subclones contribute to clinically relevant phenotypes. Young women with this particularly aggressive malignancy can look forward to actionable diagnostic and therapeutic approaches that offer an improvement over the current standard of systemic chemotherapy. In conclusion, introductory biology texts define a cell to be the unit of structure and function of living organisms, a cell’s genome to be the transmissible blueprint of life, and evolutionary selective processes to be determinants of species survival or extinction. These core theories in biology remarkably converge to bear on single-cell cancer genomics. Looking ahead, quantitative single-cell biology will steadily bring into focus the blur of cell population-averaged biomolecular data sets and guide scientists charting exquisite cell lineages in malignant tissues and complex organisms. However, in a context much broader than discoursed, one fundamental reality of causation in malignancy is not captured in the proximate molecular and mechanistic insights afforded by these latest multidimensional technologies mated with computational prowess. This overriding reality resides in understanding human cancers in the context of 114 evolutionary medicine180 and the many nature-nurture mismatches afflicting a physiologically misplaced species of hunter-gatherers. 115 References  1. Gusterson, B. A. & Stein, T. Human breast development. Semin. Cell Dev. Biol. 23, 567–573 (2012). 2. Swanton, C. Intratumor Heterogeneity: Evolution through Space and Time. Cancer Res. 72, 4875–4882 (2012). 3. Lindström, L. S. et al. Clinically used breast cancer markers such as estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 are unstable throughout tumor progression. J. Clin. Oncol. 30, 2601–8 (2012). 4. Spataro, V. et al. Sequential estrogen receptor determinations from primary breast cancer and at relapse: prognostic and therapeutic relevance. The International Breast Cancer Study Group (formerly Ludwig Group). Ann. Oncol. 3, 733–740 (1992). 5. Hansford, S. & Huntsman, D. G. Boveri at 100: Theodor Boveri and Genetic Predisposition To Cancer. J. Pathol. (2014). doi:10.1002/path.4414 6. Fisher, B., Slack, N. H. & Bross, I. D. Cancer of the breast: size of neoplasm and prognosis. Cancer 24, 1071–1080 (1969). 7. Fisher, B., Wickerham, D. L., Brown, A. & Redmond, C. K. Breast cancer estrogen and progesterone receptor values: their distribution, degree of concordance, and relation to number of positive axillary nodes. J. Clin. Oncol. 1, 349–358 (1983). 8. Nowell, P. C. The Clonal Evolution of Tumor Cell Populations. Science (80-. ). 23–28 (1976). 9. Wolman, S. R. Cytogenetic heterogeneity: its role in tumor evolution. Cancer Genet. Cytogenet. 19, 129–140 (1986). 10. Navin, N. et al. Inferring tumor progression from genomic heterogeneity. Genome Res. 20, 68–80 (2010). 11. Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011). 12. Aparicio, S. & Caldas, C. The Implications of Clonal Genome Evolution for Cancer Medicine. N. Engl. J. Med. 368, 842–851 (2013). 13. Wu, X. et al. Clonal selection drives genetic divergence of metastatic medulloblastoma. Nature 482, 529–533 (2012). 14. Lapidot, T. et al. A cell initiating human acute myeloid leukaemia after transplantation into SCID mice. Nature 367, 645–648 (1994). 15. Al-Hajj, M., Wicha, M. S., Benito-Hernandez, A., Morrison, S. J. & Clarke, M. F. Prospective identification of tumorigenic breast cancer cells. Proc. Natl. Acad. Sci. U. S. A. 100, 3983–8 (2003). 16. Singh, S. K. et al. Identification of human brain tumour initiating cells. Nature 432, 396–401 (2004). 116 17. Ricci-Vitiani, L. et al. Identification and expansion of human colon-cancer-initiating cells. Nature 445, 111–115 (2007). 18. BLOOM, H. J. & RICHARDSON, W. W. Histological grading and prognosis in breast cancer; a study of 1409 cases of which 359 have been followed for 15 years. Br. J. Cancer 11, 359–77 (1957). 19. Greer, L. T. et al. Does breast tumor heterogeneity necessitate further immunohistochemical staining on surgical specimens? in J. Am. Coll. Surg. 216, 239–251 (2013). 20. Stenkvist, B. et al. Histopathological systems of breast cancer classification: reproducibility and clinical significance. J. Clin. Pathol. 36, 392–8 (1983). 21. Heppner, G. H., Dexter, D. L., DeNucci, T., Miller, F. R. & Calabresi, P. Heterogeneity in drug sensitivity among tumor cell subpopulations of a single mammary tumor. Cancer Res. 38, 3758–3763 (1978). 22. Tsuruo, T. & Fidler, I. J. Differences in drug sensitivity among tumor cells from parental tumors, selected variants, and spontaneous metastases. Cancer Res. 41, 3058–3064 (1981). 23. Teixeira, M. R., Pandis, N., Bardi, G., Andersen, J. A. & Heim, S. Karyotypic comparisons of multiple tumorous and macroscopically normal surrounding tissue samples from patients with breast cancer. Cancer Res. 56, 855–859 (1996). 24. Gerlinger, M. et al. Intratumor Heterogeneity and Branched Evolution Revealed by Multiregion Sequencing. N. Engl. J. Med. 366, 883–892 (2012). 25. Yachida, S. et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 467, 1114–1117 (2010). 26. Chun, H.-J. E., Khattra, J., Krzywinski, M., Aparicio, S. A. & Marra, M. A. in Cancer Genomics (Dellaire, G., Berman, J. & Arceci, R.) 13–30 (Elsevier, 2014). doi:10.1016/B978-0-12-396967-5.00002-5 27. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U. S. A. 74, 5463–5467 (1977). 28. Minton, J. A. L., Flanagan, S. E. & Ellard, S. Mutation surveyor: software for DNA sequence analysis. Methods Mol. Biol. 688, 143–53 (2011). 29. Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–52 (2000). 30. Sørlie, T. et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. U. S. A. 98, 10869–74 (2001). 31. Shah, S. P. et al. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature 461, 809–813 (2009). 32. Ding, L. et al. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature 464, 999–1005 (2010). 33. Stephens, P. J. et al. Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature 462, 1005–1010 (2009). 117 34. Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–52 (2012). 35. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012). 36. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012). 37. Lu, X. & Kang, Y. Cell fusion as a hidden force in tumor progression. Cancer Res. 69, 8536–8539 (2009). 38. Aparicio, S. A. J. R. & Huntsman, D. G. Does massively parallel DNA resequencing signify the end of histopathology as we know it? J. Pathol. 220, 307–315 (2010). 39. Elowitz, M. B., Levine, A. J., Siggia, E. D. & Swain, P. S. Stochastic gene expression in a single cell. Science 297, 1183–1186 (2002). 40. Marusyk, A., Almendro, V. & Polyak, K. Intra-tumour heterogeneity: a looking glass for cancer? Nat. Rev. Cancer 12, 323–334 (2012). 41. Frumkin, D., Wasserstrom, A., Kaplan, S., Feige, U. & Shapiro, E. Genomic variability within an organism exposes its cell lineage tree. PLoS Comput. Biol. 1, e50 (2005). 42. Leung, K. et al. A programmable droplet-based microfluidic device applied to multiparameter analysis of single microbes and microbial communities. Proc. Natl. Acad. Sci. 109, 7665–7670 (2012). 43. Lecault, V., White, A. K., Singhal, A. & Hansen, C. L. Microfluidic single cell analysis: From promise to practice. Curr. Opin. Chem. Biol. 16, 381–390 (2012). 44. Pieprzyk, M. & High, H. Fluidigm Dynamic Arrays provide a platform for single-cell gene expression analysis. Nat. Methods 6, (2009). 45. Chen, L., Manz, A. & Day, P. J. R. Total nucleic acid analysis integrated on microfluidic devices. Lab Chip 7, 1413–1423 (2007). 46. White, A. K., Heyries, K. a, Doolin, C., Vaninsberghe, M. & Hansen, C. L. High Throughput Microfluidic Single Cell Digital PCR. Anal. Chem. (2013). doi:10.1021/ac400896j 47. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–82 (2009). 48. Wu, A. R. et al. Quantitative assessment of single-cell RNA-sequencing methods. Nat. Methods 11, 41–6 (2014). 49. Brady, G., Barbara, M. & Iscove, N. N. Representative in vitro cDNA amplification from individual hemopoietic cells and colonies. Methods Mol. Cell. Biol. 2, 17–25 (1990). 50. Eberwine, J. et al. Analysis of gene expression in single live neurons. Proc. Natl. Acad. Sci. U. S. A. 89, 3010–3014 (1992). 51. Hashimshony, T., Wagner, F., Sher, N. & Yanai, I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep. 2, 666–73 (2012). 118 52. Shalek, A. K. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 509, 363–9 (2014). 53. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–401 (2014). 54. Mullis, K. et al. Specific Enzymatic Amplification of DNA In Vitro: The Polymerase Chain Reaction. Cold Spring Harb. Symp. Quant. Biol. 51, 263–273 (1986). 55. Navidi, W. & Arnheim, N. Using PCR in preimplantation genetic disease diagnosis. Hum. Reprod. 6, 836–849 (1991). 56. Spits, C. et al. Whole-genome multiple displacement amplification from single cells. Nat. Protoc. 1, 1965–1970 (2006). 57. Gole, J. et al. Massively parallel polymerase cloning and genome sequencing of single cells using nanoliter microwells. Nat. Biotechnol. 31, 1126–32 (2013). 58. Geigl, J. B. & Speicher, M. R. Single-cell isolation from cell suspensions and whole genome amplification from single cells to provide templates for CGH analysis. Nat. Protoc. 2, 3173–3184 (2007). 59. Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338, 1622–6 (2012). 60. Powell, A. A. et al. Single cell profiling of Circulating tumor cells: Transcriptional heterogeneity and diversity from breast cancer cell lines. PLoS One 7, (2012). 61. Baslan, T. et al. Genome-wide copy number analysis of single cells. Nat. Protoc. 7, 1024–1041 (2012). 62. Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature (2014). doi:10.1038/nature13600 63. Potter, N. E. et al. Single-cell mutational profiling and clonal phylogeny in cancer. Genome Res. 23, 2115–25 (2013). 64. Ni, X. et al. Reproducible copy number variation patterns among single circulating tumor cells of lung cancer patients. Proc. Natl. Acad. Sci. U. S. A. 110, 21083–8 (2013). 65. Mason, O. U. et al. Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to Deepwater Horizon oil spill. ISME J. 6, 1715–27 (2012). 66. Turnbaugh, P. J. et al. The human microbiome project. Nature 449, 804–10 (2007). 67. McConnell, M. J. et al. Mosaic copy number variation in human neurons. Science 342, 632–7 (2013). 68. Cai, X. et al. Single-Cell, Genome-wide Sequencing Identifies Clonal Somatic Copy-Number Variation in the Human Brain. Cell Rep. 8, 1280–1289 (2014). 69. Knouse, K. A., Wu, J., Whittaker, C. A. & Amon, A. Single cell sequencing reveals low levels of aneuploidy across mammalian tissues. Proc. Natl. Acad. Sci. 111, 13409–13414 (2014). 119 70. Fan, H. C., Wang, J., Potanina, A. & Quake, S. R. Whole-genome molecular haplotyping of single cells. Nat. Biotechnol. 29, 51–57 (2011). 71. Wang, J., Fan, H. C., Behr, B. & Quake, S. R. Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell 150, 402–12 (2012). 72. Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods (2012). 73. Vlug, E., Ercan, C., Van Der Wall, E., Van Diest, P. J. & Derksen, P. W. B. Lobular breast cancer: Pathology, biology, and options for clinical intervention. Arch. Immunol. Ther. Exp. (Warsz). 62, 7–21 (2014). 74. Cailleau, R., Mackay, B., Young, R. K. & Reeves, W. J. Tissue culture studies on pleural effusions from breast carcinoma patients. Cancer Res. 34, 801–9 (1974). 75. DeRose, Y. S. et al. Tumor grafts derived from women with breast cancer authentically reflect tumor pathology, growth, metastasis and disease outcomes. Nat. Med. 17, 1514–1520 (2011). 76. Rozen, S. & Skaletsky, H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol. 132, 365–386 (2000). 77. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008). 78. Morin, R. D. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 45, 81–94 (2008). 79. Goya, R. et al. SNVMix: Predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics 26, 730–736 (2010). 80. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001). 81. Campbell, P. J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 40, 722–729 (2008). 82. Galante, P. A. F., Sakabe, N. J., Kirschbaum-Slager, N. & de Souza, S. J. Detection and evaluation of intron retention events in the human transcriptome. RNA 10, 757–765 (2004). 83. Bolodeoku, J., Yoshida, K., Sugino, T., Goodison, S. & Tarin, D. Accumulation of Immature Intron-containing CD44 Gene Transcripts in Breast Cancer Tissues. Mol Diagn 1, 175–181 (1996). 84. Wood, L. D. et al. The genomic landscapes of human breast and colorectal cancers. Science 318, 1108–13 (2007). 85. Forbes, S. A. et al. The catalogue of somatic mutations in cancer (COSMIC). Curr. Protoc. Hum. Genet. (2008). 86. Goshima, G., Mayer, M., Zhang, N., Stuurman, N. & Vale, R. D. Augmin: a protein complex required for centrosome-independent microtubule generation within the spindle. J. Cell Biol. 181, 421–9 (2008). 120 87. Uehara, R. et al. The augmin complex plays a critical role in spindle microtubule generation for mitotic progression and cytokinesis in human cells. Proc. Natl. Acad. Sci. U. S. A. 106, 6998–7003 (2009). 88. Erkko, H. et al. A recurrent mutation in PALB2 in Finnish cancer families. Nature 446, 316–319 (2007). 89. Carey, L. A. et al. Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study. JAMA 295, 2492–2502 (2006). 90. Lakhani, S. R. et al. The pathology of familial breast cancer: predictive value of immunohistochemical markers estrogen receptor, progesterone receptor, HER-2, and p53 in patients with mutations in BRCA1 and BRCA2. J. Clin. Oncol. 20, 2310–2318 (2002). 91. Rottenberg, S. et al. High sensitivity of BRCA1-deficient mammary tumors to the PARP inhibitor AZD2281 alone and in combination with platinum drugs. Proc. Natl. Acad. Sci. U. S. A. 105, 17079–17084 (2008). 92. Chacón, R. D. & Costanzo, M. V. Triple-negative breast cancer. Breast Cancer Res. 12 Suppl 2, S3 (2010). 93. Shah, S. P. et al. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature (2012). doi:10.1038/nature10933 94. Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009). 95. Roth, A. et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28, 907–13 (2012). 96. Ha, G. et al. Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res. 22, 1995–2007 (2012). 97. Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: Application to cancer genomics. Nucleic Acids Res. 39, (2011). 98. Bashashati, A. et al. DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biol. 13, R124 (2012). 99. Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods 11, 396–8 (2014). 100. Herschkowitz, J. I., He, X., Fan, C. & Perou, C. M. The functional loss of the retinoblastoma tumour suppressor is a common event in basal-like and luminal B breast carcinomas. Breast Cancer Res. 10, R75 (2008). 101. Langerød, A. et al. TP53 mutation status and gene expression profiles are powerful prognostic markers of breast cancer. Breast Cancer Res. 9, R30 (2007). 102. Wu, G., Feng, X. & Stein, L. A human functional protein interaction network and its application to cancer data analysis. Genome Biol. 11, R53 (2010). 103. Ha, G. et al. TITAN: Inference of copy number architectures in clonal cell populations from tumor whole genome sequence data. Genome Res. (2014). doi:10.1101/gr.180281.114 121 104. Jaye, D. L., Bray, R. a, Gebel, H. M., Harris, W. a C. & Waller, E. K. Translational applications of flow cytometry in clinical practice. J. Immunol. 188, 4715–9 (2012). 105. Nath, K., Sarosy, J. W., Hahn, J. & Di Como, C. J. Effects of ethidium bromide and SYBR Green I on different polymerase chain reaction systems. J. Biochem. Biophys. Methods 42, 15–29 (2000). 106. Darzynkiewicz, Z., Halicka, H. D. & Zhao, H. Analysis of cellular DNA content by flow and laser scanning cytometry. Adv. Exp. Med. Biol. 675, 137–147 (2010). 107. Longo, M. C., Berninger, M. S. & Hartley, J. L. Use of uracil DNA glycosylase to control carry-over contamination in polymerase chain reactions. Gene 93, 125–128 (1990). 108. Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010). 109. Li, H. H. et al. Amplification and analysis of DNA sequences in single human sperm and diploid cells. Nature 335, 414–7 (1988). 110. Krueger, F., Andrews, S. R. & Osborne, C. S. Large scale loss of data in low-diversity illumina sequencing libraries can be recovered by deferred cluster calling. PLoS One 6, (2011). 111. Landau, D. A. et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152, 714–726 (2013). 112. Hutchinson, L. & Kirk, R. High drug attrition rates--where are we going wrong? Nat. Rev. Clin. Oncol. 8, 189–90 (2011). 113. Doshi, P., Dickersin, K., Healy, D., Vedula, S. S. & Jefferson, T. Restoring invisible and abandoned trials: a call for people to publish the findings. BMJ 346, f2865 (2013). 114. Tentler, J. J. et al. Patient-derived tumour xenografts as models for oncology drug development. Nat. Rev. Clin. Oncol. 9, 338–350 (2012). 115. Zhang, X. et al. A renewable tissue resource of phenotypically stable, biologically and ethnically diverse, patient-derived human breast cancer xenograft models. Cancer Res. 73, 4885–4897 (2013). 116. Giovanella, B. et al. DNA topoisomerase I--targeted chemotherapy of human colon cancer in xenografts. Science (80-. ). 246, 1046–1048 (1989). 117. Li, S. et al. Endocrine-Therapy-Resistant ESR1 Variants Revealed by Genomic Characterization of Breast-Cancer-Derived Xenografts. Cell Rep. 4, 1116–1130 (2013). 118. Garralda, E. et al. Integrated next-generation sequencing and avatar mouse models for personalized cancer treatment. Clin. Cancer Res. 20, 2476–84 (2014). 119. Eirew, P. et al. Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature (2014). doi:10.1038/nature13952 120. Eirew, P., Stingl, J. & Eaves, C. J. Quantitation of human mammary epithelial stem cells with in vivo regenerative properties using a subrenal capsule xenotransplantation assay. Nat. Protoc. 5, 1945–1956 (2010). 122 121. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). 122. Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics 28, 167–175 (2012). 123. Albers, C. A. et al. Dindel: Accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011). 124. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009). 125. McPherson, A. et al. NFuse: Discovery of complex genomic rearrangements in cancer using high-throughput sequencing. Genome Res. 22, 2250–2261 (2012). 126. Wang, K. et al. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–1674 (2007). 127. Yau, C. et al. A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol. 11, R92 (2010). 128. Huelsenbeck, J. P. & Ronquist, F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001). 129. Notta, F. et al. Evolution of human BCR-ABL1 lymphoblastic leukaemia-initiating cells. Nature 469, 362–367 (2011). 130. Ding, L. et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481, 506–10 (2012). 131. Schuh, A. et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood 120, 4191–6 (2012). 132. Keats, J. J. et al. Clonal competition with alternating dominance in multiple myeloma. Blood 120, 1067–76 (2012). 133. Ruiz, C. et al. Advancing a clinically relevant perspective of the clonal nature of cancer. Proc. Natl. Acad. Sci. U. S. A. 108, 12054–9 (2011). 134. Gerstung, M. et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3, 811 (2012). 135. Smith, E. N. et al. Biased estimates of clonal evolution and subclonal heterogeneity can arise from PCR duplicates in deep sequencing experiments. Genome Biol. 15, 420 (2014). 136. Pushkarev, D., Neff, N. F. & Quake, S. R. Single-molecule sequencing of an individual human genome. Nat. Biotechnol. 27, 847–850 (2009). 137. Thornhill, A., Holding, C. & Monk, M. Recycling the single cell to detect specific chromosomes and to investigate specific gene sequences. Hum. Reprod. 9, 2150–2155 (1994). 123 138. Battich, N., Stoeger, T. & Pelkmans, L. Image-based transcriptomics in thousands of single human cells at single-molecule resolution. Nat. Methods 10, 1127–33 (2013). 139. Van Der Aa, N. et al. Genome-wide copy number profiling of single cells in S-phase reveals DNA-replication domains. Nucleic Acids Res. 41, (2013). 140. Vogt, N. et al. Molecular structure of double-minute chromosomes bearing amplified copies of the epidermal growth factor receptor gene in gliomas. Proc. Natl. Acad. Sci. U. S. A. 101, 11368–73 (2004). 141. Thornhill, A. R., McGrath, J. A., Eady, R. A. J., Braude, P. R. & Handyside, A. H. A comparison of different lysis buffers to assess allele dropout from single cells for preimplantation genetic diagnosis. Prenat. Diagn. 21, 490–497 (2001). 142. Ray, P. F. & Handyside, A. H. Increasing the denaturation temperature during the first cycles of amplification reduces allele dropout from single cells for preimplantation genetic diagnosis. Mol. Hum. Reprod. 2, 213–218 (1996). 143. Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–6 (2014). 144. Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011). 145. Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K. W. & Vogelstein, B. Detection and quantification of rare mutations with massively parallel sequencing. Proc. Natl. Acad. Sci. U. S. A. 108, 9530–9535 (2011). 146. Schmitt, M. W. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc. Natl. Acad. Sci. U. S. A. 109, 14508–13 (2012). 147. Marusyk, A. et al. Non-cell-autonomous driving of tumour growth supports sub-clonal heterogeneity. Nature advance on, (2014). 148. Slamon, D. J. et al. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N. Engl. J. Med. 344, 783–792 (2001). 149. Morgan, S., Grootendorst, P., Lexchin, J., Cunningham, C. & Greyson, D. The cost of drug development: a systematic review. Health Policy 100, 4–17 (2011). 150. Russnes, H. G., Lønning, P. E., Børresen-Dale, A.-L. & Lingjærde, O. C. The multitude of molecular analyses in cancer: the opening of Pandora’s box. Genome Biol. 15, 447 (2014). 151. Dancey, J. E., Bedard, P. L., Onetto, N. & Hudson, T. J. The genetic basis for cancer treatment decisions. Cell 148, 409–420 (2012). 152. Melchor, L. et al. Single-cell genetic analysis reveals the composition of initiating clones and phylogenetic patterns of branching and parallel evolution in myeloma. Leukemia 28, 1705–15 (2014). 153. Hughes, A. E. O. et al. Clonal Architecture of Secondary Acute Myeloid Leukemia Defined by Single-Cell Sequencing. PLoS Genet. 10, (2014). 154. Van der Aa, N., Esteki, M. Z., Vermeesch, J. R. & Voet, T. Preimplantation genetic diagnosis guided by single-cell genomics. Genome Med. 5, 71 (2013). 124 155. Bischoff, F. Z. et al. Cell-free fetal DNA and intact fetal cells in maternal blood circulation: Implications for first and second trimester non-invasive prenatal diagnosis. Hum. Reprod. Update 8, 493–500 (2002). 156. Beca, F. & Schmitt, F. Growing indication for FNA to study and analyze tumor heterogeneity at metastatic sites. Cancer Cytopathol. 122, 504–11 (2014). 157. De Giorgi, U. et al. Circulating tumor cells and bone metastases as detected by FDG-PET/CT in patients with metastatic breast cancer. Ann. Oncol. 21, 33–39 (2010). 158. Ashworth, T. R. A case of cancer in which cells similar to those in the tumors were seen in the blood after death. Aust. Med. J. 14, 146–149 (1869). 159. Leon, S. A., Shapiro, B., Sklaroff, D. M. & Yaros, M. J. Free DNA in the serum of cancer patients and the effect of therapy. Cancer Res. 37, 646–650 (1977). 160. Siravegna, G. & Bardelli, A. Genotyping cell-free tumor DNA in the blood to detect residual disease and drug resistance. Genome Biol. 15, 449 (2014). 161. Riethdorf, S. et al. Detection of circulating tumor cells in peripheral blood of patients with metastatic breast cancer: a validation study of the CellSearch system. Clin. Cancer Res. 13, 920–928 (2007). 162. Andreopoulou, E. et al. Comparison of assay methods for detection of circulating tumor cells in metastatic breast cancer: AdnaGen AdnaTest BreastCancer Select/DetectTM versus Veridex CellSearchTM system. Int. J. Cancer 130, 1590–1597 (2012). 163. Lucci, A. et al. Circulating tumour cells in non-metastatic breast cancer: A prospective study. Lancet Oncol. 13, 688–695 (2012). 164. Bidard, F.-C. et al. Assessment of circulating tumor cells and serum markers for progression-free survival prediction in metastatic breast cancer: a prospective observational study. Breast Cancer Res. 14, R29 (2012). 165. Swennenhuis, J. F., Reumers, J., Thys, K., Aerssens, J. & Terstappen, L. W. Efficiency of whole genome amplification of single circulating tumor cells enriched by CellSearch and sorted by FACS. Genome Med. 5, 106 (2013). 166. Ding, X. et al. Cell separation using tilted-angle standing surface acoustic waves. Proc. Natl. Acad. Sci. U. S. A. 1413325111– (2014). doi:10.1073/pnas.1413325111 167. Sieuwerts, A. M. et al. Anti-epithelial cell adhesion molecule antibodies and the detection of circulating normal-like breast tumor cells. J. Natl. Cancer Inst. 101, 61–66 (2009). 168. Meng, S. et al. Circulating tumor cells in patients with breast cancer dormancy. Clin. Cancer Res. 10, 8152–8162 (2004). 169. Rubakhin, S. S., Romanova, E. V, Nemes, P. & Sweedler, J. V. Profiling metabolites and peptides in single cells. Nat. Methods 8, S20–S29 (2011). 170. Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods 11, 817–20 (2014). 171. Lorthongpanich, C. et al. Single-cell DNA-methylation analysis reveals epigenetic chimerism in preimplantation embryos. Science 341, 1110–2 (2013). 125 172. Guo, H. et al. Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation bisulfite sequencing. Genome Res. 23, 2126–35 (2013). 173. Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–96 (2011). 174. Leuchowius, K.-J. et al. Parallel visualization of multiple protein complexes in individual cells in tumor tissue. Mol. Cell. Proteomics 12, 1563–71 (2013). 175. Nagano, T. et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 502, 59–64 (2013). 176. Klein, C. A. et al. Combined transcriptome and genome analysis of single micrometastatic cells. Nat. Biotechnol. 20, 387–392 (2002). 177. Han, L. et al. Co-detection and sequencing of genes and transcripts from the same single cells facilitated by a microfluidics platform. Sci. Rep. 4, 6485 (2014). 178. Strotman, L. et al. Selective nucleic acid removal via exclusion (SNARE): capturing mRNA and DNA from a single sample. Anal. Chem. 85, 9764–70 (2013). 179. Betzig, E. et al. Imaging intracellular fluorescent proteins at nanometer resolution. Science 313, 1642–1645 (2006). 180. Greaves, M. Darwinian medicine: a case for cancer. Nat. Rev. Cancer 7, 213–221 (2007).    


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items