"Medicine, Faculty of"@en . "Medical Genetics, Department of"@en . "DSpace"@en . "UBCV"@en . "Pugh, Trevor John"@en . "2009-11-09T19:46:57Z"@en . "2009"@en . "Doctor of Philosophy - PhD"@en . "University of British Columbia"@en . "Cells in the human body contain DNA genomes that encode instructions regulating their biology. Accumulation of somatic DNA sequence alterations such as point mutations and structural rearrangements can disrupt critical genes resulting in malignant cancer phenotypes. Identification of cancer \u00E2\u0080\u009Cdrivers\u00E2\u0080\u009D is a central goal of cancer genome analysis due to their causation of oncogenesis and potential as diagnostic and therapeutic targets. Analysis of normal polymorphisms can also impact the treatment of cancer by identifying individuals most likely to benefit from specific therapies. To uncover molecular correlates with treatment outcome, my graduate work has focused on applying DNA sequencing technology to clinical cancer patient samples. In an early example of medical oncogenomics, I evaluated mutations and amplifications of a single gene, EGFR, in patient tumour samples and investigated associations with response to an EGFR inhibitor, gefitinib. This study was challenged by limited nucleic acid quantities available from small or microdissected tissue biopsies. Therefore, I next characterized bias induced by a whole genome amplification technique and demonstrated genotype and copy number analysis using amplified material. To investigate the role that normal polymorphisms play in guiding cancer treatment, my third project sought to correlate DNA repair gene polymorphisms with the development of late side effects following radiation therapy for prostate cancer. Late side effects were associated with variants in three genes, uncovered by sequencing the exons of eight DNA repair genes in patients with varying degrees of radiosensitivity. Advancements in DNA sequencing technologies have enabled a move beyond candidate gene approaches towards gaining sequence and expression information from all expressed genes (i.e. the transcriptome). Utilizing second generation sequencing technology, my final project was a transcriptome analysis of lung tumours prior to treatment with the EGFR inhibitor, erlotinib. I uncovered gene expression profiles specific to clinical subgroups and, in one case, detected expression of the Epstein-Barr virus. The second phase of this project will validate putative somatic mutations identified by transcriptome sequencing and investigate viral involvement in other lung tumours. Genome sequence information is becoming readily extracted from clinical sources and there is great potential to use this information to effectively guide cancer treatment."@en . "https://circle.library.ubc.ca/rest/handle/2429/14710?expand=metadata"@en . "5186984 bytes"@en . "application/pdf"@en . " ANALYSIS OF PRIMARY HUMAN CANCERS: FROM SINGLE GENES TO WHOLE TRANSCRIPTOMES by TREVOR JOHN PUGH B.Sc. (Honours), The University of British Columbia, 2004 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Medical Genetics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) November 2009 \u00C2\u00A9 Trevor John Pugh, 2009 ii Abstract Cells in the human body contain DNA genomes that encode instructions regulating their biology. Accumulation of somatic DNA sequence alterations such as point mutations and structural rearrangements can disrupt critical genes resulting in malignant cancer phenotypes. Identification of cancer \u00E2\u0080\u009Cdrivers\u00E2\u0080\u009D is a central goal of cancer genome analysis due to their causation of oncogenesis and potential as diagnostic and therapeutic targets. Analysis of normal polymorphisms can also impact the treatment of cancer by identifying individuals most likely to benefit from specific therapies. To uncover molecular correlates with treatment outcome, my graduate work has focused on applying DNA sequencing technology to clinical cancer patient samples. In an early example of medical oncogenomics, I evaluated mutations and amplifications of a single gene, EGFR, in patient tumour samples and investigated associations with response to an EGFR inhibitor, gefitinib. This study was challenged by limited nucleic acid quantities available from small or microdissected tissue biopsies. Therefore, I next characterized bias induced by a whole genome amplification technique and demonstrated genotype and copy number analysis using amplified material. To investigate the role that normal polymorphisms play in guiding cancer treatment, my third project sought to correlate DNA repair gene polymorphisms with the development of late side effects following radiation therapy for prostate cancer. Late side effects were associated with variants in three genes, uncovered by sequencing the exons of eight DNA repair genes in patients with varying degrees of radiosensitivity. Advancements in DNA sequencing technologies have enabled a move beyond candidate gene approaches towards gaining sequence and expression information from all expressed genes (i.e. the transcriptome). Utilizing second generation sequencing technology, my final project was a transcriptome analysis of lung tumours prior to treatment with the EGFR inhibitor, erlotinib. I uncovered gene expression profiles specific to clinical subgroups and, in one case, detected expression of the Epstein-Barr virus. The second phase of this project will validate putative somatic mutations identified by transcriptome sequencing and investigate viral involvement in other lung tumours. Genome sequence information is becoming readily extracted from clinical sources and there is great potential to use this information to effectively guide cancer treatment. iii Table of contents Abstract ....................................................................................................................................... ii Table of contents ........................................................................................................................ iii List of tables ............................................................................................................................... vi List of figures............................................................................................................................. vii Acknowledgements .................................................................................................................. viii Co-authorship statement............................................................................................................ x Chapter 1. Introduction ............................................................................................................. 1 1.1. Human phenotypes are controlled by cellular genomes ............................................... 1 1.2. Variants in genome sequence and structure differentiate human phenotypes .............. 2 1.3. Cancers arise from accumulation of abnormal somatic variants in critical genes........ 4 1.4. Activating mutations and amplifications of a lung cancer oncogene, EGFR, have been associated with response to tyrosine kinase inhibitors ................................................. 8 1.5. Molecular studies of patient biopsy samples have been limited by suboptimal tissue quality and quantity .................................................................................................... 10 1.6. Sequencing of multiple genes in clinical sample sets is facilitated by high-throughput methods....................................................................................................................... 12 1.7. Second generation sequencing technologies have enabled whole cancer genome sequencing .................................................................................................................. 13 1.8. Thesis description ....................................................................................................... 14 1.9. Figures ........................................................................................................................ 17 1.10. Bibliography ........................................................................................................... 19 Chapter 2. Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients .............. 28 2.1. Introduction................................................................................................................. 29 2.2. Materials and methods................................................................................................ 31 2.2.1. Patient population and assessment response....................................................... 31 2.2.2. Laser microdissection and DNA extraction........................................................ 31 2.2.3. PCR and sequencing of EGFR exons 18-24....................................................... 32 2.2.4. Copy number analysis of EGFR and HER2 ....................................................... 34 2.3. Results......................................................................................................................... 35 2.3.1. Patient population ............................................................................................... 35 2.3.2. EGFR tyrosine-kinase domain mutations........................................................... 36 2.3.3. EGFR tyrosine-kinase domain polymorphisms.................................................. 37 2.3.4. EGFR and HER2 copy number analysis ............................................................ 38 2.4. Discussion................................................................................................................... 39 2.5. Conclusion .................................................................................................................. 41 2.6. Figures ........................................................................................................................ 43 2.7. Tables.......................................................................................................................... 47 2.8. Bibliography ............................................................................................................... 52 Chapter 3. Impact of whole genome amplification on analysis of copy number variants . 56 3.1. Introduction................................................................................................................. 57 3.2. Materials and methods................................................................................................ 60 3.2.1. Tissue material and DNA extraction .................................................................. 60 3.2.2. Whole genome amplification.............................................................................. 60 3.2.3. Labelling and hybridization to the Affymetrix 500K array................................ 60 3.2.4. Sample preparation for NimbleGen 385k CGH array ........................................ 61 iv 3.2.5. Genotype and copy number analysis .................................................................. 61 3.2.6. Sequence analysis of recurrent whole genome amplification-induced artifacts. 63 3.3. Results......................................................................................................................... 64 3.3.1. Array noise and copy number variation in samples pre- and post-WGA........... 64 3.3.2. Copy number variants induced by whole genome amplification ....................... 65 3.3.3. Use of amplified material for pair-wise copy number comparisons................... 68 3.3.4. Validation of WGA pair-wise comparisons for copy number detection ............ 69 3.3.5. Genotype fidelity ................................................................................................ 70 3.4. Discussion................................................................................................................... 71 3.5. Figures ........................................................................................................................ 77 3.6. Tables.......................................................................................................................... 83 3.7. Bibliography ............................................................................................................... 96 Chapter 4. Sequence variant discovery in DNA repair genes from radiosensitive and radiotolerant prostate brachytherapy patients...................................................................... 99 4.1. Introduction............................................................................................................... 100 4.2. Materials and methods.............................................................................................. 103 4.2.1. Patient selection and toxicity metrics ............................................................... 103 4.2.2. PCR amplification and sequencing of DNA repair genes ................................ 105 4.2.3. Statistical analyses ............................................................................................ 106 4.3. Results....................................................................................................................... 107 4.3.1. DNA sequencing summary............................................................................... 107 4.3.2. ATM variants detected by previous studies of radiosensitivity ........................ 108 4.3.3. Using quantity of DNA repair gene variants to predict radiosensitivity .......... 109 4.3.4. Using specific DNA repair gene variants to predict radiosensitivity ............... 109 4.3.5. Relationship of DNA repair gene variants with residual gammaH2AX following irradiation.......................................................................................................... 110 4.4. Discussion................................................................................................................. 111 4.5. Figures ...................................................................................................................... 114 4.6. Tables........................................................................................................................ 117 4.7. Bibliography ............................................................................................................. 133 Chapter 5. Transcriptome sequencing of treatment-na\u00C3\u00AFve lung cancers from individuals likely to benefit from erlotinib treatment............................................................................. 137 5.1. Introduction............................................................................................................... 138 5.2. Methods .................................................................................................................... 141 5.2.1. Biopsy collection and processing ..................................................................... 141 5.2.2. DNA extraction and Sanger sequencing........................................................... 143 5.2.3. RNA extraction, amplification and Illumina sequencing ................................. 144 5.2.4. RNA-seq data analysis...................................................................................... 146 5.3. Results....................................................................................................................... 147 5.3.1. Patient data and source tumour material........................................................... 147 5.3.2. Summary of sequencing data and variant discovery in lung cancer biopsies... 148 5.3.3. Addressing end-bias induced by amplification................................................. 149 5.3.4. Viral transcripts ................................................................................................ 150 5.3.5. Expression profiling.......................................................................................... 152 5.3.6. Fusion transcripts.............................................................................................. 154 5.3.7. Mutation detection ............................................................................................ 154 5.3.8. Validation of novel coding pSNVs................................................................... 156 5.4. Discussion................................................................................................................. 159 5.5. Future directions ....................................................................................................... 162 v 5.6. Figures ...................................................................................................................... 164 5.7. Tables........................................................................................................................ 175 5.8. Bibliography ............................................................................................................. 184 Chapter 6. Discussion ............................................................................................................. 188 6.1. DNA sequencing efforts of increasing scale are becoming clinically applicable .... 188 6.2. Revolutions in DNA sequencing technology have enabled routine genome sequencing .................................................................................................................................. 189 6.3. Future treatments of cancer will be guided by genome information ........................ 193 6.4. Future directions ....................................................................................................... 194 6.5. Figures ...................................................................................................................... 197 6.6. Bibliography ............................................................................................................. 198 vi List of tables Table 2-1 PCR primers for 7 exons of the EGFR tyrosine kinase domain................................. 47 Table 2-2 Summary of all patient clinical data and molecular status......................................... 48 Table 2-3 EGFR exon 19 deletions/substitution......................................................................... 49 Table 2-4 EGFR point mutations................................................................................................ 50 Table 2-5 EGFR and HER2 copy number alterations ................................................................ 51 Table 3-1 Regions of recurrent WGA over-amplification.......................................................... 83 Table 3-2 Regions of recurrent WGA under-amplification........................................................ 87 Table 3-3 Distribution of log2 ratios from comparison of unamplified and amplified samples versus a common reference set of 48 individuals ....................................................................... 89 Table 3-4 Apparent amplifications and deletions detected prior to amplification through comparison with a reference set of 48 individuals ..................................................................... 90 Table 3-5 Distribution of log2 ratios from comparison of two experimental replicates of each sample ......................................................................................................................................... 91 Table 3-6 Regions of recurrent WGA under-amplification within chromosome ends............... 92 Table 3-7 Apparent copy number differences identified by pair-wise comparisons of all possible combinations of unamplified and amplified samples ................................................... 93 Table 3-8 Copy number variants detected by pair-wise comparisons of unamplified and amplified sample sets.................................................................................................................. 94 Table 3-9 Copy number variants detected in MR families by pair-wise comparisons of unamplified and amplified sample sets (child versus father) ..................................................... 95 Table 4-1 Modified RTOG scoring system used to generate toxicity scores ........................... 117 Table 4-2 Patient-by-patient radiation dosimetry, gammaH2AX scores, DNA sequence variant counts, toxicity score breakdown, and other data..................................................................... 118 Table 4-3 PCR primer sequences used to amplify amplicons targeting candidate gene exons for sequencing ................................................................................................................................ 123 Table 4-4 Number of variant sites detected in each DNA repair gene. .................................... 130 Table 4-5 Coding variant genotypes observed in high and low toxicity prostate brachytherapy patients. (A = reference allele, B = non-reference allele)......................................................... 131 Table 4-6 Variants associated with residual gamma H2AX levels following irradiation ........ 132 Table 5-1 Tumour content and quantities of total RNA extracted from 30 lung tumour biopsies .................................................................................................................................................. 175 Table 5-2 Complete viral genomes against which all unmapped transcriptome reads were mapped...................................................................................................................................... 176 Table 5-3 33 pSNVs validated by PCR and Sanger sequencing .............................................. 178 Table 5-4 Genes with exons targeted for solution hybrid capture, containing mutations in \u00E2\u0089\u00A54 pre-treatment tumours (\u00E2\u0089\u00A53 tumours for COSMIC genes)......................................................... 179 Table 5-5 Genes containing at least 2 types of mutation exclusively in post-treatment tumours. .................................................................................................................................................. 183 vii List of figures Figure 1.1 Parallel pathways of tumourigenesis......................................................................... 17 Figure 1.2 Representations of a crystal structure of the EGFR kinase domain in complex with erlotinib....................................................................................................................................... 18 Figure 2.1 DNA of varying quality from formalin-fixed paraffin-embedded tissues. ............... 43 Figure 2.2 Laser microdissection of mixed tumour and normal cell populations ...................... 44 Figure 2.3 EGFR variant detection summary ............................................................................. 45 Figure 2.4 Examples of tumours with increased gene copy number detected by FISH............. 46 Figure 3.1 Experimental design.................................................................................................. 77 Figure 3.2 Boxplots comparing the spread of log2 ratios in unamplified and amplified samples .................................................................................................................................................... 79 Figure 3.3 Apparent CNVs in unamplified and amplified samples............................................ 80 Figure 3.4 Copy number distribution and GC content of WGA-induced CNVs........................ 81 Figure 3.5 Example of how a pair-wise comparison of amplified material can partially compensate for WGA-induced bias............................................................................................ 82 Figure 4.1 Candidate genes encode proteins directly involved in the detection and repair of damaged DNA and triggering of cell cycle control signalling pathways................................. 114 Figure 4.2 Toxicity scores, radiation dosimetry, count of DNA variants, and gammaH2AX rank expression from 41 prostate brachytherapy patients ................................................................ 115 Figure 5.1 Isolation of tumour cells from a complex pleural fluid mixture using flow cytometry .................................................................................................................................................. 164 Figure 5.2 Summary of data generated, sequence mapped, and genes, variants, and fusions detected in each library............................................................................................................. 165 Figure 5.3 Distribution of RNA seq reads mapped to exonic, intronic, and intergenic regions166 Figure 5.4 Distribution of sequence coverage and putative SNVs detected across all expressed transcripts from 41 RNA-seq libraries...................................................................................... 167 Figure 5.5 Comparison of sequence coverage distribution in libraries constructed from RNA amplified using a standard or modified in vitro transcription primer mix. .............................. 168 Figure 5.6 A) Circos visualization of RNA-seq reads from a lymphoepithelioma-like lung cancer aligned to an EBV genome. B) Confirmation of EBV tumour-specificity by in situ hybridization. ............................................................................................................................ 169 Figure 5.7 Supervised hierarchical clustering of gene expression profiles uncovers molecular and clinical subtypes of lung cancer ......................................................................................... 171 Figure 5.8 Attrition of putative SNVs to select variants for validation.................................... 173 Figure 5.9 Position of baits designed to A) validate putative point mutations detected by RNA- seq and B) discover additional mutations in exons from genes with putative point mutations in at least 3 tumours ...................................................................................................................... 174 Figure 6.1 Computed tomography (CT) images of lung metastases from an adenocarcinoma of the tongue in the months before and after administration of sunitinib, a drug selected to exploit somatic aberrations identified by cancer genome and transcriptome sequencing.................... 197 viii Acknowledgements This work could not have come together without the unwavering support, principled guidance, and intellectual enthusiasm of my mentor, Dr. Marco Marra. I have been privileged to learn from an exceptional leader, consummate gentleman, and true academic scholar. He has set a scientific and personal standard to which I aspire. The thoughtful vision and clinical perspective provided by our ardent collaborator, Dr. Janessa Laskin, has challenged me to constantly think and learn outside of the lab. Her earnest, practical approach to science and exceptional ability to unite basic and clinical research has shaped and reshaped my vision of clinical genomics. Thanks also to the additional members of my thesis advisory committee, Drs. Jan Friedman, Rob Holt, and Andre Marziali for introducing me to the word of genomics, for providing healthy doses of academic and \u00E2\u0080\u0098real world\u00E2\u0080\u0099 perspective, and for accompanying me on this journey from start to finish. Thank you to Margaret and Simon Sutcliffe for coaxing me to think across boundaries institutional, translational, and international. Thank you to Lorena Barclay for your unmatched organizational skills and willingness to help at a moment\u00E2\u0080\u0099s notice. Thanks to Cindy Yang for being an incredible student who probably taught me more than I taught her. I would also like to thank the staff and scientists of the BC Cancer Agency Genome Sciences Centre (GSC). I could reproduce the staff directory listing every person who has helped me along the way. I grew up here, scientifically speaking, and thank you for the complete immersion in cutting edge science. In particular, Duane Smailus, George Yang, Jeff Stott, Richard Moore, and Julius Halaschek-Wiener have been exceptional mentors beginning from my early days at the GSC. Thanks also to Allen Delaney and Irene Li for innumerable ix crash courses in bioinformatics and other tools of the trade. Thank you to Robyn Roscoe and Karen Novik for gracefully managing the twists and turns each project has taken. Thanks also to Robin Coope for sharing an infectious scientific enthusiasm and Yongjun Zhao for technical discussions of lab techniques. Much of what I have learned over the last five years has come from those people I see every day, the members of Marco Marra and Angela (Angie) Brooks-Wilson\u00E2\u0080\u0099s research groups. Thank you Malachi Griffith for conversations laboratory, informatic, and otherwise, Ryan Morin for an endless number of scripts and quips, Noushin Farnoud for being my go-to statistical consultant, Ian Bosdet for mentorship in my early days and for carrying on and improving my work, and Tesa Severson for holding us all together. Thank you to Angie for generously hosting my \u00E2\u0080\u009Cshort term\u00E2\u0080\u009D stay with her lab group and for modelling exceptional scientific leadership of an incredible team. Thank you to Johanna Schuetz for sharing a brain, Dan Fornika for thoughtful opinions on any topic, Steve Leach for the fierce friendly rivalry, and the rest of the 12 o\u00E2\u0080\u0099clock lunch group for the scintillating daily discussions. Thanks to Claire Hou for ongoing career advice. Thanks also to long-time friends outside of science for reminding me that there is life outside of the lab. Finally, a big thank you to the Pugh and Gastaldo families. To my Mom and Dad, thank you for imbuing in me a love of science at an early age and for always loving and supporting me in whatever I do. To brothers Kevin and Steven, thank you for always offering cheerful smiles, good times, and insightful discussions of the world at large. To my in-laws Silvano, Jacqueline, Claudia, Mirella, and Milva, thank you for embracing me as one of your own and sharing with me your incredible daughter and sister. Christina, your love and encouragement makes each day a joy. Thank you for showing me what is truly important in life. x Co-authorship statement The work presented in this thesis is the product of substantial collaboration. Each study is presented as an independent published or publishable unit and individual contributors to each chapter are listed here and at the beginning of each chapter: Chapter 2 I participated in the study coordination, performed the DNA extraction, qualification, and sequence analysis, and generated drafts of the manuscript. Gwyn Bebb conceived of the study and participated in its design as well as treated patients and identified patients for study. Lorena Barclay, Margaret Sutcliffe, and John Fee reviewed patient samples and performed microdissection. Chris Salski and Doug Horsman carried out the FISH studies. Robert O\u00E2\u0080\u0099Connor served as the reference pathologist and reviewed all patient samples. Cheryl Ho, Nevin Murray, and Barbara Melosky treated patients and identified patients for this study. John English coordinated sample acquisition. Jeurgen Vielkind oversaw the microdissection process. Janessa Lakin and Marco Marra conceived of the study, participated in its design and coordination, and contributed to writing the manuscript. Chapter 3 I participated in the study coordination, carried out the laboratory work, performed copy number and genotype analyses, and generated drafts of the manuscript. Allen Delaney, Stephane Flibotte, H. Irene Li, and Hong Qian assisted with copy number and genotype analysis. Noushin Farnoud and Malachi Griffith performed statistical sequence content analyses. Pedro Farinha and Randy Gascoyne provided and sectioned tissue samples. Marco Marra conceived of the study, participated in its design and coordination, and contributed to writing the manuscript. xi Chapter 4 I participated in the study coordination, carried out a portion of the laboratory work, performed sequence and statistical analyses, and wrote the manuscript. Lorena Barclay and Cindy Yang performed a portion of the laboratory work. Karen Novik provided project management support. Allen Delaney, Martin Kryzwinski, and Dallas Thomas wrote data handling programs and provided bioinformatic support. Alexander Agranovich, Mira Keyes, Michael McKenzie, and W. Jim Morris saw patients and provided clinical data. Peggy Olive provided gammaH2AX measurements for patients and aided in interpretation of the data. Mira Keyes, Marco Marra, and Richard Moore conceived of the study, participated in its design and coordination, and contributed to writing the manuscript. Chapter 5 I participated in the study coordination, attended biopsies for collection of patient material, processed biopsy tissue for library construction, performed data analysis and interpretation, and wrote the manuscript. Janessa Laskin coordinated the study, treated patients, and participated in the data interpretation. Jennifer Asano, Lorena Barclay, Susanna Chan, and Cindy Yang performed laboratory work. Ian Bosdet, Obi Griffith, Ryan D. Morin, and Sorana Morrissey assisted with data analysis and interpretation. Diana Ionescu was the reference pathologist for the study. Margaret Sutcliffe assisted with study coordination and pathology review. Cheryl Ho, Christopher Lee, Barb Melosky, Nevin Murray, and Sophie Sun treated patients. Ciaran Keogh, Monty Martin, Kaushik Bhagat, and Helena Odwyer collected ultrasound- and CT-guided needle biopsy specimens. Stephen Lam and Annette McWilliams collected bronchoscopy specimens. The BC Cancer Agency Lab Accessioning group collected all blood samples. The BC Cancer Agency Genome Sciences Centre Sequencing group constructed and sequenced all transcriptome sequencing libraries. Marco Marra conceived of the project, coordinated the study, and contributed to writing the manuscript. 1 Chapter 1. Introduction 1.1. Human phenotypes are controlled by cellular genomes A genome is a collection of genetic instructions that dictate cellular development, structure, and maintenance. These instructions take the form of long polymers of double- stranded deoxyribonucleic acid (DNA) in which the order of nucleic acid couplets or base-pairs spells out distinct modular elements such as genes, regulatory elements, and structural motifs. Cells use these modules to create and control molecules necessary for life. These molecules are used to replicate individual cells, to interact with and modify cellular environments, and to form larger tissues. Complex cellular populations can themselves form connections and interdependencies, eventually giving rise to extraordinarily complicated organisms. In the case of the human body, cell populations form distinct organ systems nested within cellular structures connected by complex cell-based wiring and piping. The development and interaction of these myriad cells is dictated by the genome contained within each one. Determining the nucleic acid sequence of the human genome has been long thought to hold the key to understanding human phenotypes. Beginning in 1990, an international consortium of publicly-funded researchers undertook the sequencing of a pool of individual genomes with the goal of generating a draft human genome reference sequence. In 1998, a parallel project sequencing a smaller pool of five individuals was begun by a private company, Celera Genomics. Both of these groups published draft sequences in 2001 [1, 2], each representing a haploid consensus sequence of a set of human genomes. In its simplest form, this collection of DNA sequences enabled observation of GC- and repeat-content, CpG island distribution, and gene content across the human genome at a resolution of single nucleotides [1]. However, these drafts did not represent the exact sequence of any one human being but were instead an amalgamation of a group of individuals. This mixture of genetic information led to the subsequent identification of 1.4-2.1 million variant base-pairs in the reference 2 genome sequences [1, 2], and millions of additional sequence variants have since been identified [3-5]. While genome sequences are estimated to be 99.9% identical between any two human beings (excluding structural variants) [6, 7], even small changes in nucleic acid sequence can have a dramatic impact on cell behaviour and resulting organismal phenotypes. For example, much of the phenotypic differences observed between Europeans, Asians and Africans can be explained by differences in less than 0.01% of the genome [8]. As with each human, each genome is likely to be different and with the reference sequence in hand, a number of large scale projects set out to catalogue these differences in human populations. 1.2. Variants in genome sequence and structure differentiate human phenotypes Two classes of human genetic variation have been uncovered: Single nucleotide polymorphisms (SNPs) and structural variants. SNPs are traditionally defined as single base pair changes with a minor allele frequency of at least 1% in a population [1]. Early efforts focused on cataloguing the nature and frequency of these polymorphisms in large human populations [3, 9], the results of which are recorded in the dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP/) at the National Centre for Biotechnology Information (NCBI). Even a single base-pair change can have major biological effects and specific SNPs have been linked to human diseases. Cystic fibrosis, haemophilia, and sickle cell anaemia can each arise due to missense variants in single genes, CFTR [10], F8/F9 [11, 12], and beta- globin/HBB [13], respectively. However, many diseases cannot be linked to single variants or even single genes. The majority of inherited or genetic diseases is more complex and likely arises from the combination of many variants spread across several genes [14]. In contrast to single base-pair variants, structural variants can involve thousands of base-pairs of DNA [15, 16]. These large scale variants are subcategorized into segmental duplications or low-copy repeats, inversions, translocations, segmental uniparental disomy and copy number variants 3 (CNVs) consisting of insertions, deletions, and duplications [17]. A CNV present at a frequency of 1% in a population is defined as a copy number polymorphism (CNP) [17]. As of August 2009, 8,410 CNV loci covering ~32% of the genome have been recorded in the online Database for Genomic Variants ([16], http://projects.tcag.ca/variation/) by The Centre for Applied Genomics (TCAG, Toronto, Canada). Many of the effects of CNVs appear to be dosage-related [18, 19]. For example, up to 16 copies of the amylase gene have been observed in individuals from populations with historically high starch diets, and increased gene copy number was correlated with greater protein concentrations in saliva [18]. Currently, genome-wide association studies (GWAS) seek to genotype hundreds of thousands of SNPs and CNPs in large, carefully phenotyped populations with the goal of associating common genetic variants with common diseases [20, 21]. Over the last 3 years, GWAS have been used effectively to associate hundreds of loci with over 80 diseases in thousands of individuals [21]. A disadvantage of this approach is the indirect method in which causative variants are detected as a many variants may be in linkage disequilibrium with the genotyped variant. As a result, several candidate genes can be identified by even high density genotyping studies [21]. In addition, these genes are often not suspected of being involved with disease, and the contribution of individual variants is low [21]. This has lead to difficulty replicating the findings of even well-powered GWAS, although there have been reproducible associations with disease, most notably in type 2 diabetes [22] and Crohn\u00E2\u0080\u0099s disease [23]. To directly pinpoint exact functional variants that underlie disease, base-pair resolution of genomes from tens of thousands of affected and unaffected individuals may be necessary, a prohibitively slow and costly requirement using traditional DNA sequencing techniques. Recent advances in DNA sequencing technologies have reduced the time and financial cost required for whole genome sequencing by several orders of magnitude [24, 25]. In 2007, 4 the first diploid genomes from single individuals were published. Both were from Caucasians of European descent, J. Craig Venter [26] and James Watson [27]. These were soon followed by the sequencing of individuals from two other ethnic groups, Yang Huanming of Han Chinese descent [28] and an anonymous individual from the Yoruban ethnic group of western Africa [29]. Since then, several individual genomes have been sequenced [30, 31], some of them privately through commercial entities such as Knome [32] and Complete Genomics [33]. Several large scale projects are underway with the goal of sequencing the genomes of thousands of individuals (1000 Genomes Project [5], Personal Genomes Project [34]). It has been estimated that all common variants (those with a population frequency of at least 1%) can be detected by sequencing only 350 individual genomes [35]. Of the 18.8 million SNPs mapped to the reference genome assembly and listed in dbSNP, 5.7 million (30%) are based at least in part on 105 genomes from 1000 Genomes Project submissions and 1.9 million (10%) of those are novel (submitted only by 1000 Genomes Project) [36]. Therefore, in relatively short order, we may have knowledge of every common human SNP and a high proportion of rare variants with frequencies of 0.1 to 1% [35]. The next challenge will be to integrate these genomic data sets with carefully curated phenotypic data to extract biologically meaningful information. Well-annotated genome sequences from healthy individuals will serve as an excellent reference to diagnose and treat diseases such as cancer that commonly arise from genome aberrancies. 1.3. Cancers arise from accumulation of abnormal somatic variants in critical genes In cancer, abnormalities of genome sequence or structure undermine normal expression and behaviour of molecules critical for maintaining cell and tissue homeostasis. These changes take forms similar to those of normal polymorphisms (single base-pair variants, copy number changes, inversions, etc.) as well as more complex structures such as chromosomal translocations. These changes are somatic as evidenced by their absence in normal cells from the same individual and each one initially affects a single cell. Rarely is a single mutation or 5 structural change sufficient to result in malignancy [49], and a number of discrete genetic and biological changes are often necessary to develop cancer [38]. Mutations and structural alterations most often alter genes encoding proteins that regulate cell proliferation, differentiation, and programmed cell death [37]. Most cancers harbour genomic abnormalities that result in the acquisition of many if not all of the six classical \u00E2\u0080\u009Challmarks of cancer\u00E2\u0080\u009D [38]: evasion of apoptosis, self-sufficiency in growth signals, insensitivity to antigrowth signals, limitless replicative potential, sustained angiogenesis, and tissue invasion and metastasis. Immune system evasion appears to be a \u00E2\u0080\u009Cseventh hallmark\u00E2\u0080\u009D [95] as many cancers select for non-immunogenic cell variants and actively suppress immune response [95-96]. These hallmarks are not necessarily acquired in a prescribed order; however, acquisition of one trait facilitates the acquisition of others (Figure 1.1, [38]). Mutations causally implicated in oncogenesis have been termed \u00E2\u0080\u009Cdriver\u00E2\u0080\u009D mutations and are distinctly different from \u00E2\u0080\u009Cpassenger\u00E2\u0080\u009D mutations without functional consequence [49]. Identifying driver mutations is a central goal of cancer genome analysis to further our understanding of how cancer hallmarks arise and to suggest potential targets for therapy [37, 49]. The presence of passenger mutations can confound this analysis as it can be difficult to differentiate these mutations from the driver mutations upon which cancers are dependent [49]. Sequencing candidate genes from hundreds of tumours has identified over 1800 genes commonly mutated in cancers [37] and subsequent functional validation has established many of these as true cancer drivers [49]. Somatic point mutations and small insertions and deletions (indels) have great potential to drive cancer and are attractive drug targets due to their impact on protein structure and function. The effect of sequence mutation on a cancer cell depends on the location and type of the mutation. Mutations that result in amino acid substitutions (missense or non-synonymous mutations) can lead to gain or loss of protein function by altering catalytic residues or 6 disrupting protein structure. Multiple amino acids can be lost or changed by the introduction of a premature stop codon (nonsense mutations) or by indels that shift the protein\u00E2\u0080\u0099s codon reading frame (frame shift mutations). Even non-coding or synonymous point mutations can have an impact on protein expression due to differences in codon usage, nonsense-mediated decay of abnormal transcripts [50], distorted transcription factor binding sites, and, when mutations are located at exon splice sites, skewed exon usage [51]. Somatic structural alterations of the human genome can also drive cancer phenotypes. Accumulation of somatic copy number variants and structural rearrangements are often the result of increased genome instability and can arise due to complex mechanisms mediated by chromosome breakage and rejoining [39]. Loss of cell cycle control accelerates the acquisition of such features, and copy number alterations that support an oncogenic phenotype can rapidly become established in a population of tumour cells. Amplifications that drive cancer commonly involve oncogenes such as ERBB2, MYC, MYCN, MYCL1, EGFR, and AKT2 [39]. On the other hand, deletions often eliminate tumour suppressor genes such as PTEN, RB1, and TP53 thereby removing their regulatory effect on the cell [39]. Detection of specific somatic CNVs in cancer is used clinically to guide cancer treatment. In breast cancer, detection of HER2 amplification or overexpression is necessary for the prescription of the HER2 inhibitor trastuzumab (Herceptin, Genentech) [40] as this drug is ineffective in HER2 negative tumours and there is a risk of cardiac dysfunction and heart failure [41]. Binding of trastuzumab to HER2 primarily stimulates endocytosis of the receptor, thereby removing it from the cell surface and extinguishing receptor-initiated constitutive signalling [42]. Somatic genome rearrangements are characteristic of cancer subtypes, and translocation partner networks are characterized by a few recurrently fused genes, including MLL, BCL6, and ALK [43]. Due in part to the ease of detection by traditional cytogenetic analysis, the most commonly observed rearrangements in cancer are large-scale translocations where part of a 7 chromosomal arm is exchanged (balanced translocation) or replaced (unbalanced translocation) with material from another chromosome [44]. As the resolution of genome technologies increase, smaller-scale events have become evident in a number of tumours including inversions, insertions, deletions, microtranslocations, and untemplated additions [45, 46]. Structural rearrangements can result in new gene constructs through the joining of catalytic and regulatory domains from different genes. As a result, the activity of one protein becomes receptive to the regulatory signals targeting another, and there is strong evidence that several gene rearrangements are early and important steps towards cancer development. [43]. Specific translocations are associated with specific phenotypes [43], and while fusion proteins have historically been associated with hematological malignancies [44], transformative fusions have recently been described in solid tumours [43, 47]. Induction of fusion constructs in animal models gives rise to cancers similar to those observed in human patients, and silencing of fusion transcripts in vitro reduces cell proliferation and differentiation [43]. Targeting of fusion proteins for therapy has been highly effective in reducing tumour burden [43]. One of the first successful targeted therapies for leukemia, imatinib, was designed to inhibit the kinase domain of the BCR-ABL fusion protein and has revolutionized how this disease is treated [48]. Specific cancers have been linked to a relatively small set of cancer drivers, and there are striking examples of single base-pair positions recurrently mutated in hundreds of tumours of the same type [37]. My early thesis work sought to investigate one such example with clinical implications - two distinct mutations in the tyrosine kinase domain of the Epidermal Growth Factor Receptor (EGFR), a cell surface receptor mutated in 30% of non-small cell lung adenocarcinomas [52] and <2% of other cancers [37]. 8 1.4. Activating mutations and amplifications of a lung cancer oncogene, EGFR, have been associated with response to tyrosine kinase inhibitors As early as 1980, specific histological features of lung cancers were identified, among them overexpression of EGFR (also known as HER1 or ERBB1), a cell surface receptor overexpressed in 40-80% of lung tumours [53] and implicated in control of cell growth and differentiation. EGFR is a large 170 kiloDalton glycoprotein with three distinct domains: an extracellular ligand binding domain, a transmembrane domain, and an intracellular tyrosine kinase domain. Upon ligand binding, EGFR forms homo- and hetero-dimers with other receptors, often HER2 (also known as ERBB2). These multimeric complexes then autophosphorylate, leading to activation of intracellular signalling kinase cascades. This is followed by internalization of the receptor complex for recycling or destruction by the cell, thereby removing the signalling cascade stimulus. The complete EGFR signalling network is complex, and computational methods have been used to annotate its many interactions [54]. Associated pathways are involved in specific functions such as endocytosis, degradation, recycling of EGFR, small GTPase signalling, MAPK cascade, PIP signalling, cell cycle control, Ca2\u00C3\u00BE signalling, and G-Protein-Coupled-Receptor-mediated EGFR transactivation [54]. The development of therapies targeting EGFR was spurred by the discovery of EGFR overexpression in many late stage lung tumours with poor prognosis and the ability of EGFR overexpression to confer a malignant phenotype on cultured cells [53]. Health Canada and the United States Food and Drug Administraton initially approved the use of two small molecules, gefitinib (Iressa from Astra Zeneca) and erlotinib (Tarceva from Genentech/Roche) in second- and third-line treatment of lung cancer. Both of these drugs are tyrosine kinase inhibitors (TKIs) that reversibly bind the ATP-binding pocket of the cytoplasmic EGFR tyrosine-kinase domain, thereby inhibiting autophosphorylation and stimulation of downstream signalling pathways resulting in inhibition of proliferation, delayed cell cycle progression, and increased apoptosis [53]. Side-effects associated with this drug are generally limited to skin rash and 9 diarrhea [55], suggesting a degree of tumour-specificity unseen from treatment with conventional cytotoxic chemotherapies. In 2004, three studies found that somatic base-pair mutations in the ATP-binding pocket of the EGFR tyrosine-kinase domain correlated with dramatic reduction in tumour size as a result of treatment with gefitinib and erlotinib [56-58]. These mutations are particularly prevalent in adenocarcinomas from female, non-smokers of Asian descent [52, 56-59], a particularly responsive subgroup observed in initial clinical trials of these drugs [60-62]. EGFR mutations cluster around the TKI binding site (Figure 1.2) and commonly implicate an L858R amino acid substitution or in-frame deletions and substitutions of amino acids L747-T751 and are not often seen in primary tumours of other tissues (COSMIC, [37]). These mutations do not affect the stability or expression of EGFR [56], and it has been demonstrated in vitro that such mutations result in increased EGFR activity, longer activation times before receptor complex internalization, and increased sensitivity to gefitinib [56, 57]. The onset of drug resistance has been associated with the rise of a point mutation that results in an additional amino acid substitution, T790M, [63] very near the site bound by gefitinib and erlotinib (Figure 1.2). This substitution increases the affinity of the kinase domain for its natural substrate, ATP, thereby reducing the inhibitory effect of TKIs [64]. The link between somatic DNA sequence mutations, altered protein function, and treatment outcome made EGFR mutation screening a potentially useful clinical tool and was considered an early harbinger of personalized medicine [65]. Other studies have questioned the strength of this correlation, however, instead finding amplification of EGFR to be a more accurate independent predictor of sensitivity to EGFR inhibitors [66-68]. A recent phase III trial has supported these observations, as patients with amplification of EGFR had significantly higher response rates to erlotinib than those without this characteristic (20% vs. 2%) [69]. Multivariate analysis revealed that only EGFR expression 10 and increased copy number were associated with erlotinib response, and no statistically significant correlation between base-pair mutation and response was found [69]. More recently, increased HER2 copy number has been associated with response to gefitinib, and the outcome of patients positive for both EGFR and HER2 amplification was significantly better than those positive for amplification of just one of these factors [70]. An ongoing debate continues to question which genetic features of lung cancer are clinically informative [71-74]. Chapter 2 documents my investigation of this issue by evaluating EGFR mutation, EGFR amplification, and HER2 amplification retrospectively in archival tumour samples from a local cohort of lung cancer patients treated with gefitinib [75]. 1.5. Molecular studies of patient biopsy samples have been limited by suboptimal tissue quality and quantity Studies of cancer cell lines have driven many fundamental discoveries in cancer research and continue to be a valuable resource for understanding cancer biology [38]. Derived initially from primary patient material, often dissected or purified tumour cells, cell lines are modified to allow them to grow in culture media independent of their original tissue microenvironment. This capability facilitates the generation of billions of clonal daughter cells and a nearly unlimited resource for biological study. For this reason, cell-line-based studies are ideal for application of standardized assays such as high-throughput screening of therapeutic compounds, elucidating protein interactions, or systematic genetic manipulation such as gene knockdown. However, this high degree of homogeneity does not reflect actual human tumours which are often highly heterogeneous and made up of subpopulations of cells that interact with one another and surrounding normal cells [38]. Cells lines can be replicated thousands of times over many years and, due in part to pre-existing cancer phenotypes, can acquire de novo mutations, structural rearrangements, or even gain or lose chromosomes. While individual cell lines may be clonal, parallel lines derived from a common population but maintained under 11 differing culture conditions can have distinctly different genome alterations. For these reasons, cell lines are often an inadequate representation of cancers as they occur \u00E2\u0080\u0098in the wild\u00E2\u0080\u0099 [38]. Therefore, efforts to discover alterations of native cancer genomes must instead focus on primary sources of tumour material. These sources often take the form of diagnostic clinical biopsy samples or surgical tissue resections which present a unique set of challenges to applying genome tools. My early experience studying primary lung cancer samples identified three major challenges in extracting molecular genetic information from clinical cancer specimens: 1) nucleic acid quality can be compromised by clinical tissue archival methods, 2) tumour content can be variable due to cellular heterogeneity within a tumour mass, and 3) often only small quantities of tissue are taken to minimize impact on the donor patient. The first challenge is readily addressed by using tissue archival techniques that maintain nucleic acid integrity such as flash freezing tissues immediately upon biopsy or by adjusting molecular assays to compensate for degraded samples. For example, increasing the amount of DNA input to a PCR can often yield amplicons from degraded template. The second challenge can be overcome using well-developed cell purification techniques such as laser microdissection or flow cytometry. An example of metastatic tumour cells isolated by laser microdissection is shown in Chapter 2, Figure 2.2. The third problem of limited tissue quantities is not as easily addressed as additional material often cannot be collected without risk to patient health and safety. To circumvent this problem, amplification methods have been developed to increase the amount of DNA available from a sample by several orders of magnitude. Whole genome amplification using Phi29 polymerase had been shown to have high sequence fidelity and genotype concordance before and after amplification [76-78]. However, the use of amplified material for copy number analysis had only been investigated using low resolution methods and amplification biases have been characterized descriptively without statistical analysis [77, 78, 12 93, 94]. Current genome-wide copy number analyses make use of high-density oligonucleotide microarrays capable of querying hundreds of thousands of genome positions in a single assay (e.g. Affymetrix GeneChip Human Mapping arrays [79] and Nimblegen Whole Genome Tiling arrays [80]). However, these methods require significant quantities of genomic DNA, often not available from small samples. This limitation has restricted routine genome-wide copy number analysis of smaller cancer biopsy samples despite past success identifying novel oncogenes and tumour suppressors in larger cancer samples not requiring amplification [81-84]. To address this challenge, we sought to investigate the use of amplified DNA for genome-wide copy number analyses [85], the results of which are included in chapter 3. 1.6. Sequencing of multiple genes in clinical sample sets is facilitated by high-throughput methods When sample quantities are adequate, there is potential to investigate a large number of candidate genes or variants for association with disease. To uncover specific variants within disease-associated genes or to confirm variants detected using orthogonal methods, these candidates are often sequenced at base-pair resolution in multiple patient samples. Parallelizing these experiments to sequence hundreds of targets in even a handful of patient samples is technically demanding as each sample is subject to DNA preparation, PCR and sequencing reaction setup, and data analysis. Therefore, several research groups have established laboratory \u00E2\u0080\u009Cpipelines\u00E2\u0080\u009D through which sets of clinical samples are standardized and subjected to a common set of protocols to generate high quality sequence data that can be compared between samples. A flexible high-throughput amplicon sequencing platform has recently been implemented at the BC Cancer Agency Genome Sciences Centre as a necessary tool to discover and validate sequence variants in a myriad of clinical samples. While helping develop this platform, I conducted a pilot study of germ-line DNA from prostate brachytherapy patients to uncover associations of DNA repair gene variants with the development of late side effects 13 resulting from localized radiation therapy. The results of this study are included in Chapter 4, and to date represent the largest sequencing-based survey of DNA repair genes to uncover variants associated with radiosensitivity. This pipeline has since been used to validate putative somatic changes detected by second generation sequencing methods [86] including a subset of those described in Chapter 5. 1.7. Second generation sequencing technologies have enabled whole cancer genome sequencing Just as human genome sequencing is on the verge of becoming routine, so too is sequencing of cancer genomes. Whole cancer genome shotgun sequencing allows researchers to go beyond sequencing individual candidate genes and to comprehensively assess each base- pair position for somatic events as well as gain fine-scale copy number and structural information such as the base-pair position of a translocation breakpoint. Recently, the genome sequence of a single acute myeloid leukemia revealed this cancer to be essentially diploid and uncovered ten non-synonymous mutations in genes that would not have been candidates for resequencing based on current knowledge of cancer [87]. Similar results are being reported for a lobular breast cancer in which the genome sequence was complemented by sequencing RNA from the same sample [86]. This study and others [88-90] have illustrated the ability of transcriptome sequencing to provide quantitative gene expression and structural information including splicing isoform usage and detection of gene fusions resulting from genome rearrangements. As transcriptome data primarily aligns to annotated exons, this method is particularly well-suited to detecting expressed somatic mutations and RNA editing events not detectable in genome sequence. In the breast cancer study, 1/3 (11 of 32) of the coding somatic mutations were detectable in the transcriptome using 1/20 the number of sequencing reads [86]. Profiling cancer transcriptomes is an efficient usage of sequencing capacity for coding mutation detection and provides additional transcript information not available from genome sequence data. This approach was used to profile 30 lung tumours for the study presented in Chapter 5. 14 1.8. Thesis description The objective of my graduate work has been to identify genomic variants in primary patient specimens and to evaluate sequence mutations, polymorphisms, or structural variants that may be related to treatment outcome. This thesis provides a systematic account of my research to uncover correlates of molecular information with clinical outcomes of cancer therapy, beginning with a study of single genes (Chapter 2) and ending with a transcriptome- wide survey (Chapter 5). Throughout my research, I was presented with several challenges inherent to applying genome technologies to patient material. Chapters 2 and 5 explore predictors of outcome to treatment of non-small cell lung cancer with tyrosine kinase inhibitors, first retrospectively by studying archival diagnostic tissues and then prospectively in fresh- frozen biopsy samples collected as part of a clinical trial. Chapters 3 and 4 of this thesis, while not directly studying lung cancer samples, represent projects designed to overcome problems central to studying cancer in patients. Chapter 3 illustrates a method for reliably deriving copy number and genotype information from small quantities of tissue and my findings are broadly applicable to the study of human disease including cancer. Chapter 4 explores the relationship of germline polymorphisms with side-effects induced by radiation therapy of prostate cancer. As my research has addressed several distinct aspects of cancer genomics, each chapter is written as an independent, in most cases peer-reviewed and published, unit preceded by a brief introduction relating the work back to the overall theme. In Chapter 2, I sought to investigate the relationship of gefitinib response with three putative molecular predictors, EGFR mutation, EGFR amplification, and HER2 amplification, in archival samples from a local cohort of lung cancer patients [75]. At the time that this study was conducted, it was unclear which of these somatic changes, if any, were accurate predictors of response. Even today, this debate continues [71-73], and the identification of new predictive biomarkers is desperately needed to guide lung cancer treatment. As a result of the findings of 15 this early retrospective study, the prospective study presented in Chapter 5 was begun to find novel genomic features of lung cancer. This early experience also identified challenges inherent to working with clinical lung cancer specimens which were subsequently addressed by the work presented in Chapters 3 and 4. From the small quantities of tissue available from many cancer biopsies, it became apparent that to expand our investigations beyond single genes required amplification of the limited quantities of DNA available from these samples. The work covered in Chapter 3 provides a statistical treatment of amplification biases induced by this technique and demonstrates the ability to use amplified material to detect bona fide CNVs in amplified DNA. We later used this amplification method to increase the amount of DNA available from lung tumour biopsies collected prospectively for the study presented in Chapter 5. Due to a growing number of candidate mutations identified by next generation sequencing methods and an increasing number of clinical samples usable for sequence analysis, there became a need to sequence hundreds of amplicons from multiple patient samples. Therefore, I helped design and implement an amplicon sequencing pipeline for the high- throughput generation and analysis of sequence data from a wide range of clinical specimens. Chapter 4 documents the pilot project for this pipeline in which I conducted a study of germ- line variants in DNA repair genes from prostate brachytherapy patients with varying degrees of radiation toxicity following treatment. This study identified variants of three DNA repair genes that may confer increased radiation sensitivity and, if validated in larger patient populations, may be used to identify patients likely intolerant of radiation therapy. The pipeline developed during this project has since been used to validate somatic variants in cancers detected using second generation sequencing methods [86], including candidate mutations identified by the study presented in Chapter 5. 16 In Chapter 5, I demonstrate the ability of transcriptome sequencing to simultaneously query the structure, sequence, and expression levels of transcripts expressed by a clinically selected set of lung cancers. Using this information, we sought to explain observed responses to the tyrosine kinase inhibitor erlotinib (Tarceva, Roche) in the context of integrated patterns of somatic sequence alterations, fusion transcripts, and gene expression. The results and techniques developed during the studies presented in Chapters 2-4 laid the foundation for this work. For example, the experience of working with poor quality material from archival samples in the retrospective study from Chapter 2 dictated that fresh frozen biopsies be prospectively collected for this study. As the lung tumour biopsies were collected for research as part of a clinical trial, tissue quantities were very limited and amplification strategies were used, including the method characterized in Chapter 3. Finally, a subset of putative mutations identified from this study was validated using the amplicon sequencing infrastructure implemented in Chapter 4. This chapter represents a modern medical onco-genomics project combining cutting edge DNA sequencing technology, rigorous tumour review and purification, and standardized patient treatment beginning from treatment-naivety as part of a drug trial. Chapter 6 provides a summary of the lessons learned from five years of cancer research and possible directions of the current lung cancer research program. I also discuss the evolution of cancer genomics in recent years, large scale cancer genomics projects that are underway now, and the potential future impact of genomics on the clinical management of cancer. 17 1.9. Figures Figure 1.1 Parallel pathways of tumourigenesis Reproduced with permission from [38]. Panel A depicts six hallmarks of cancer biology and provides examples of each as discussed by [38]. A seventh hallmark, immune system evasion, has been proposed by [95]. Panel B provides alternate sequences in which cancer hallmarks can be acquired, all of which eventually lead to a cancer phenotype. Single events may confer multiple capabilities and acquiring one ability can facilitate the acquisition of subsequent cancer hallmarks. 18 Figure 1.2 Representations of a crystal structure of the EGFR kinase domain in complex with erlotinib. Modified from structure published by [91] and freely accessible from the NCBI Molecular Modeling Database (MMDB ID: 20494, PDB ID: 1M17). The data used to generate this figure were downloaded from http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?Dopt=s&uid=20494 and visualized using the Cn3D software package [92]. Both panels depict erlotinib as a purple stick-and-ball figure in complex with a space-fill (left) or ribbon (right) representation of the EGFR kinase domain. Three mutations commonly observed in lung cancer are marked in yellow. The LREA deletion (d.LREA) and L858R point mutation are commonly observed prior to treatment and have been correlated with increased sensitivity to TKIs. The T790M point mutation is often acquired as a result of treatment with TKIs and results in resistance to these drugs. 19 1.10. Bibliography 1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange- Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921. 2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, 20 Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X: The sequence of the human genome. Science 2001, 291(5507):1304-1351. 3. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, Zhao H, Zhou J, Gabriel SB, Barry R, Blumenstiel B, Camargo A, Defelice M, Faggart M, Goyette M, Gupta S, Moore J, Nguyen H, Onofrio RC, Parkin M, Roy J, Stahl E, Winchester E, Ziaugra L, Altshuler D, Shen Y, Yao Z, Huang W, Chu X, He Y, Jin L, Liu Y, Shen Y, Sun W, Wang H, Wang Y, Wang Y, Xiong X, Xu L, Waye MM, Tsui SK, Xue H, Wong JT, Galver LM, Fan JB, Gunderson K, Murray SS, Oliphant AR, Chee MS, Montpetit A, Chagnon F, Ferretti V, Leboeuf M, Olivier JF, Phillips MS, Roumy S, Sallee C, Verner A, Hudson TJ, Kwok PY, Cai D, Koboldt DC, Miller RD, Pawlikowska L, Taillon-Miller P, Xiao M, Tsui LC, Mak W, Song YQ, Tam PK, Nakamura Y, Kawaguchi T, Kitamoto T, Morizono T, Nagashima A, Ohnishi Y, Sekine A, Tanaka T, Tsunoda T, Deloukas P, Bird CP, Delgado M, Dermitzakis ET, Gwilliam R, Hunt S, Morrison J, Powell D, Stranger BE, Whittaker P, Bentley DR, Daly MJ, de Bakker PI, Barrett J, Chretien YR, Maller J, McCarroll S, Patterson N, Pe'er I, Price A, Purcell S, Richter DJ, Sabeti P, Saxena R, Schaffner SF, Sham PC, Varilly P, Altshuler D, Stein LD, Krishnan L, Smith AV, Tello-Ruiz MK, Thorisson GA, Chakravarti A, Chen PE, Cutler DJ, Kashuk CS, Lin S, Abecasis GR, Guan W, Li Y, Munro HM, Qin ZS, Thomas DJ, McVean G, Auton A, Bottolo L, Cardin N, Eyheramendy S, Freeman C, Marchini J, Myers S, Spencer C, Stephens M, Donnelly P, Cardon LR, Clarke G, Evans DM, Morris AP, Weir BS, Tsunoda T, Mullikin JC, Sherry ST, Feolo M, Skol A, Zhang H, Zeng C, Zhao H, Matsuda I, Fukushima Y, Macer DR, Suda E, Rotimi CN, Adebamowo CA, Ajayi I, Aniagwu T, 21 Marshall PA, Nkwodimmah C, Royal CD, Leppert MF, Dixon M, Peiffer A, Qiu R, Kent A, Kato K, Niikawa N, Adewole IF, Knoppers BM, Foster MW, Clayton EW, Watkin J, Gibbs RA, Belmont JW, Muzny D, Nazareth L, Sodergren E, Weinstock GM, Wheeler DA, Yakub I, Gabriel SB, Onofrio RC, Richter DJ, Ziaugra L, Birren BW, Daly MJ, Altshuler D, Wilson RK, Fulton LL, Rogers J, Burton J, Carter NP, Clee CM, Griffiths M, Jones MC, McLay K, Plumb RW, Ross MT, Sims SK, Willey DL, Chen Z, Han H, Kang L, Godbout M, Wallenburg JC, L'Archeveque P, Bellemare G, Saeki K, Wang H, An D, Fu H, Li Q, Wang Z, Wang R, Holden AL, Brooks LD, McEwen JE, Guyer MS, Wang VO, Peterson JL, Shi M, Spiegel J, Sung LM, Zacharia LF, Collins FS, Kennedy K, Jamieson R, Stewart J: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851-861. 4. dbSNP Home Page [http://www.ncbi.nlm.nih.gov/projects/SNP/] 5. 1000 Genomes - Home [http://www.1000genomes.org] 6. Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S, Van Etten WJ, Reich D, Higgins J, Daly MJ, Blumenstiel B, Baldwin J, Stange-Thomann N, Zody MC, Linton L, Lander ES, Altshuler D: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001, 409(6822):928-933. 7. Schneider JA, Pungliya MS, Choi JY, Jiang R, Sun XJ, Salisbury BA, Stephens JC: DNA variability of human genes. Mech Ageing Dev 2003, 124(1):17-25. 8. Jorde LB, Wooding SP: Genetic variation, classification and 'race'. Nat Genet 2004, 36(11 Suppl):S28-33. 9. A haplotype map of the human genome. Nature 2005, 437(7063):1299-1320. 10. Zielenski J: Genotype and phenotype in cystic fibrosis. Respiration 2000, 67(2):117- 133. 11. Peake I: The molecular basis of haemophilia A. Haemophilia 1998, 4(4):346-349. 12. Lillicrap D: The molecular basis of haemophilia B. Haemophilia 1998, 4(4):350-357. 13. Ashley-Koch A, Yang Q, Olney RS: Sickle hemoglobin (HbS) allele and sickle cell disease: a HuGE review. Am J Epidemiol 2000, 151(9):839-845. 14. Frazer KA, Murray SS, Schork NJ, Topol EJ: Human genetic variation and its contribution to complex traits. Nat Rev Genet 2009, 10(4):241-251. 15. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525-528. 16. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet 2004, 36(9):949- 951. 17. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet 2006, 7(2):85-97. 18. Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R, Carter NP, Lee C, Stone AC: Diet and the evolution of human amylase gene copy number variation. Nat Genet 2007, 39(10):1256-1260. 19. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavare S, Deloukas P, 22 Hurles ME, Dermitzakis ET: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315(5813):848-853. 20. A Catalog of Published Genome-Wide Association Studies [www.genome.gov/26525384] 21. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 2009, 106(23):9362-9367. 22. Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM, Barrett JC, Shields B, Morris AP, Ellard S, Groves CJ, Harries LW, Marchini JL, Owen KR, Knight B, Cardon LR, Walker M, Hitman GA, Morris AD, Doney AS, McCarthy MI, Hattersley AT: Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 2007, 316(5829):1336-1341. 23. Franke A, Balschun T, Karlsen TH, Hedderich J, May S, Lu T, Schuldt D, Nikolaus S, Rosenstiel P, Krawczak M, Schreiber S: Replication of signals from recent studies of Crohn's disease identifies previously unknown disease loci for ulcerative colitis. Nat Genet 2008, 40(6):713-715. 24. Morozova O, Marra MA: Applications of next-generation sequencing technologies in functional genomics. Genomics 2008, 92(5):255-264. 25. Holt RA, Jones SJ: The new paradigm of flow cell sequencing. Genome Res 2008, 18(6):839-846. 26. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254. 27. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872-876. 28. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, Li G, Yang Z, Zhang G, Yang B, Yu C, Liang F, Li W, Li S, Li D, Ni P, Ruan J, Li Q, Zhu H, Liu D, Lu Z, Li N, Guo G, Zhang J, Ye J, Fang L, Hao Q, Chen Q, Liang Y, Su Y, San A, Ping C, Yang S, Chen F, Li L, Zhou K, Zheng H, Ren Y, Yang L, Gao Y, Yang G, Li Z, Feng X, Kristiansen K, Wong GK, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J: The diploid genome sequence of an Asian individual. Nature 2008, 456(7218):60-65. 29. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IM, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DM, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson 23 KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara ECM, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53-59. 30. McKernan KJ, Peckham HE, Costa G, McLaughlin S, Tsung E, Fu Y, Clouser C, Dunkan C, Ichikawa J, Lee C, Zhang Z, Sheridan A, Fu H, Ranade S, Dimilanta E, Sokolsky T, Zhang L, Hendrickson C, Li B, Kotler L, Stuart J, Malek J, Manning J, Antipova A, Perez D, Moore M, Hayashibara K, Lyons M, Beaudoin R, Coleman B, Laptewicz M, Sanicandro A, Rhodes M, De La Vega F, Gottimukkala RK, Hyland F, Reese M, Yang S, Bafna V, Bashir A, Macbride A, Aklan C, Kidd JM, Eichler EE, Blanchard AP: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two base encoding. Genome Res 2009. 31. Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, Kim BC, Kim SY, Kim WY, Kim C, Park D, Lee YS, Kim S, Reja R, Jho S, Kim CG, Cha JY, Kim KH, Lee B, Bhak J, Kim SJ: The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res 2009. 32. Knome, Inc. | Know Thyself [http://www.knome.com/] 33. Complete Genomics [http://www.completegenomics.com/] 34. Personal Genome Project - Homepage [http://www.personalgenomes.org] 35. Ionita-Laza I, Lange C, N ML: Estimating the number of unseen variants in the human genome. Proc Natl Acad Sci U S A 2009, 106(13):5008-5013. 36. Genome-announce -- UCSC Genome Browser project announcements mailing list [https://lists.soe.ucsc.edu/mailman/listinfo/genome-announce] 37. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. 38. Hanahan D, Weinberg RA: The hallmarks of cancer. Cell 2000, 100(1):57-70. 39. Bignell GR, Santarius T, Pole JC, Butler AP, Perry J, Pleasance E, Greenman C, Menzies A, Taylor S, Edkins S, Campbell P, Quail M, Plumb B, Matthews L, McLay K, Edwards PA, Rogers J, Wooster R, Futreal PA, Stratton MR: Architectures of somatic 24 genomic rearrangement in human cancer amplicons at sequence-level resolution. Genome Res 2007, 17(9):1296-1303. 40. Herceptin (Trastuzumab) product insert [http://www.fda.gov/medwatch/SAFETY/2005/Herceptin_Promo_PDF_Feb_2005.pdf] 41. Seidman A, Hudis C, Pierri MK, Shak S, Paton V, Ashby M, Murphy M, Stewart SJ, Keefe D: Cardiac dysfunction in the trastuzumab clinical trials experience. J Clin Oncol 2002, 20(5):1215-1221. 42. Cho HS, Mason K, Ramyar KX, Stanley AM, Gabelli SB, Denney DW, Jr., Leahy DJ: Structure of the extracellular region of HER2 alone and in complex with the Herceptin Fab. Nature 2003, 421(6924):756-760. 43. Mitelman F, Johansson B, Mertens F: The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer 2007, 7(4):233-245. 44. Mitelman F: Recurrent chromosome aberrations in cancer. Mutat Res 2000, 462(2- 3):247-253. 45. Krzywinski M, Bosdet I, Mathewson C, Wye N, Brebner J, Chiu R, Corbett R, Field M, Lee D, Pugh T, Volik S, Siddiqui A, Jones S, Schein J, Collins C, Marra M: A BAC clone fingerprinting approach to the detection of human genome rearrangements. Genome Biol 2007, 8(10):R224. 46. Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, Grigorova M, Jones KW, Wei W, Stratton MR, Futreal PA, Weber B, Shapero MH, Wooster R: High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res 2004, 14(2):287-295. 47. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, Bando M, Ohno S, Ishikawa Y, Aburatani H, Niki T, Sohara Y, Sugiyama Y, Mano H: Identification of the transforming EML4- ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448(7153):561-566. 48. Druker BJ: Translation of the Philadelphia chromosome into therapy for CML. Blood 2008, 112(13):4808-4817. 49. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. 50. Holbrook JA, Neu-Yilik G, Hentze MW, Kulozik AE: Nonsense-mediated decay approaches the clinic. Nat Genet 2004, 36(8):801-808. 51. Cartegni L, Chew SL, Krainer AR: Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat Rev Genet 2002, 3(4):285-298. 52. Shigematsu H, Gazdar AF: Somatic mutations of epidermal growth factor receptor signaling pathway in lung cancers. Int J Cancer 2006, 118(2):257-262. 53. Laskin JJ, Sandler AB: Epidermal growth factor receptor: a promising target in solid tumours. Cancer Treat Rev 2004, 30(1):1-17. 54. Oda K MY, Funahashi A, Kitano H: A comprehensive pathway map of epidermal growth factor receptor signaling. Molecular Systems Biology, 2005, msb4100014:E1\u00E2\u0088\u0092E17. 55. Ranson M, Hammond LA, Ferry D, Kris M, Tullo A, Murray PI, Miller V, Averbuch S, Ochs J, Morris C, Feyereislova A, Swaisland H, Rowinsky EK: ZD1839, a selective oral epidermal growth factor receptor-tyrosine kinase inhibitor, is well tolerated and active in patients with solid, malignant tumors: results of a phase I trial. J Clin Oncol 2002, 20(9):2240-2250. 56. Lynch TJ BD, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG, Louis DN, Christiani DC, Settleman J, Haber DA: Activating mutations in the epidermal growth factor receptor underlying 25 responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004, 350:2129-2139. 57. Paez JG JP, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ, Naoki K, Sasaki H, Fujii Y, Eck MJ, Sellers WR, Johnson BE, Meyerson M: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304:1497-1500. 58. Pao W MV, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L, Mardis E, Kupfer D, Wilson R, Kris M, Varmus H.: EGF receptor gene mutations are common in lung cancers from \"never smokers\" and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004, 101(36):13306-13311. 59. Sun S, Schiller JH, Gazdar AF: Lung cancer in never smokers--a different disease. Nat Rev Cancer 2007, 7(10):778-790. 60. Fukuoka M YS, Giaccone G, Tamura T, Nakagawa K, Douillard JY, Nishiwaki Y, Vansteenkiste J, Kudoh S, Rischin D, Eek R, Horai T, Noda K, Takata I, Smit E, Averbuch S, Macleod A, Feyereislova A, Dong RP, Baselga J.: Multi-institutional randomized phase II trial of gefitinib for previously treated patients with advanced non-small-cell lung cancer (The IDEAL 1 Trial). J Clin Oncol 2003, 21(12):2237- 2246. 61. Ho C MN, Laskin J, Melosky B, Anderson H, Bebb G.: Asian ethnicity and adenocarcinoma histology continues to predict response to gefitinib in patients treated for advanced non-small cell carcinoma of the lung in North America. Lung Cancer 2005, 49(2):225-231. 62. Kris MG NR, Herbst RS, Lynch TJ Jr, Prager D, Belani CP, Schiller JH, Kelly K, Spiridonidis H, Sandler A, Albain KS, Cella D, Wolf MK, Averbuch SD, Ochs JJ, Kay AC.: Efficacy of gefitinib, an inhibitor of the epidermal growth factor receptor tyrosine kinase, in symptomatic patients with non-small cell lung cancer: a randomized trial. JAMA 2003, 290(16):2149-2158. 63. Pao W MV, Politi KA, Riely GJ, Somwar R, Zakowski MF, Kris MG, Varmus H.: Acquired Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib Is Associated with a Second Mutation in the EGFR Kinase Domain. PLoS Med 2005, 2:e73. 64. Yun CH, Mengwasser KE, Toms AV, Woo MS, Greulich H, Wong KK, Meyerson M, Eck MJ: The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. Proc Natl Acad Sci U S A 2008, 105(6):2070-2075. 65. Green MR: Targeting targeted therapy. N Engl J Med 2004, 350(21):2191-2193. 66. Hirsch FR, Scagliotti GV, Langer CJ, Varella-Garcia M, Franklin WA: Epidermal growth factor family of receptors in preneoplasia and lung cancer: perspectives for targeted therapies. Lung Cancer 2003, 41 Suppl 1:S29-42. 67. Cappuzzo F HF, Rossi E, Bartolini S, Ceresoli GL, Bemis L, Haney J, Witta S, Danenberg K, Domenichini I, Ludovini V, Magrini E, Gregorc V, Doglioni C, Sidoni A, Tonato M, Franklin WA, Crino L, Bunn PA Jr, Varella-Garcia M: Epidermal growth factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Natl Cancer Inst 2005, 97(9):643-655. 68. Hirsch FR, Varella-Garcia M, Bunn PA, Jr., Di Maria MV, Veve R, Bremmes RM, Baron AE, Zeng C, Franklin WA: Epidermal growth factor receptor in non-small- cell lung carcinomas: correlation between gene copy number and protein expression and impact on prognosis. J Clin Oncol 2003, 21(20):3798-3807. 26 69. Tsao MS SA, Cutz JC, Zhu CQ, Kamel-Reid S, Squire J, Lorimer I, Zhang T, Liu N, Daneshmand M, Marrano P, da Cunha Santos G, Lagarde A, Richardson F, Seymour L, Whitehead M, Ding K, Pater J, Shepherd FA: Erlotinib in lung cancer - molecular and clinical predictors of outcome. N Engl J Med 2005, 353(2):133-144. 70. Cappuzzo F V-GM, Shigematsu H, Domenichini I, Bartolini S, Ceresoli GL, Rossi E, Ludovini V, Gregorc V, Toschi L, Franklin WA, Crino L, Gazdar AF, Bunn PA Jr, Hirsch FR: Increased HER2 gene copy number is associated with response to gefitinib therapy in epidermal growth factor receptor-positive non-small-cell lung cancer patients. J Clin Oncol 2005, 23(22):5007-5018. . 71. Shepherd FA: Molecular selection of patients for first-line treatment of advanced non-small-cell lung cancer with epidermal growth factor inhibitors: not quite ready for prime time. J Clin Oncol 2008, 26(15):2426-2427. 72. Takano T OY: Erlotinib in lung cancer. N Engl J Med 2005, 353(16):(16):1739-1741. 73. Hirsch FR, Bunn PA, Jr.: EGFR testing in lung cancer is ready for prime time. Lancet Oncol 2009, 10(5):432-433. 74. Johnson BE, Janne PA: Selecting patients for epidermal growth factor receptor inhibitor treatment: A FISH story or a tale of mutations? J Clin Oncol 2005, 23(28):6813-6816. 75. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B, English J, Vielkind J, Horsman D, Laskin JJ, Marra MA: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC Cancer 2007, 7:128. 76. Esteban JA, Salas M, Blanco L: Fidelity of phi 29 DNA polymerase. Comparison between protein-primed initiation and DNA polymerization. J Biol Chem 1993, 268(4):2719-2726. 77. Paez JG LM, Beroukhim R, Lee JC, Zhao X, Richter DJ, Gabriel S, Herman P, Sasaki H, Altshuler D, Li C, Meyerson M, Sellers WR.: Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Res 2004, 32:e71. 78. Tzvetkov MV, Becker C, Kulle B, Nurnberg P, Brockmoller J, Wojnowski L: Genome- wide single-nucleotide polymorphism arrays demonstrate high fidelity of multiple displacement-based whole-genome amplification. Electrophoresis 2005, 26(3):710- 715. 79. Affymetrix webpage [http://www.affymetrix.com/] 80. Nimblegen webpage [http://www.nimblegen.com] 81. Weir B, Zhao X, Meyerson M: Somatic alterations in the human cancer genome. Cancer Cell 2004, 6(5):433-438. 82. Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA, Shah K, Sato M, Thomas RK, Barletta JA, Borecki IB, Broderick S, Chang AC, Chiang DY, Chirieac LR, Cho J, Fujii Y, Gazdar AF, Giordano T, Greulich H, Hanna M, Johnson BE, Kris MG, Lash A, Lin L, Lindeman N, Mardis ER, McPherson JD, Minna JD, Morgan MB, Nadel M, Orringer MB, Osborne JR, Ozenberger B, Ramos AH, Robinson J, Roth JA, Rusch V, Sasaki H, Shepherd F, Sougnez C, Spitz MR, Tsao MS, Twomey D, Verhaak RG, Weinstock GM, Wheeler DA, Winckler W, Yoshizawa A, Yu S, Zakowski MF, Zhang Q, Beer DG, Wistuba, II, Watson MA, Garraway LA, Ladanyi M, Travis WD, Pao W, Rubin MA, Gabriel SB, Gibbs RA, Varmus HE, Wilson RK, Lander ES, Meyerson M: Characterizing the cancer genome in lung adenocarcinoma. Nature 2007, 450(7171):893-898. 27 83. Mullighan CG, Goorha S, Radtke I, Miller CB, Coustan-Smith E, Dalton JD, Girtman K, Mathew S, Ma J, Pounds SB, Su X, Pui CH, Relling MV, Evans WE, Shurtleff SA, Downing JR: Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature 2007, 446(7137):758-764. 84. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061-1068. 85. Pugh TJ, Delaney AD, Farnoud N, Flibotte S, Griffith M, Li HI, Qian H, Farinha P, Gascoyne RD, Marra MA: Impact of whole genome amplification on analysis of copy number variants. Nucleic Acids Res 2008, 36(13):e80. 86. Shah S, Morin R, Khattra J, Prentice L, Pugh TJ, Burleigh A, Delaney A, Gelmon K, Guliany R, Holt RA, Jones SJ, Sun M, Moore R, Teschendorff A, Tse K, Turashivili G, Varhol R, Warren R, Watson P, Zhao Y, Caldas C, Huntsman D, Hirst M, Marra M, Aparicio S: Mutational evolution of a lobular breast tumour, profiled by whole- transcriptome and whole-genome next generation sequencing. Submitted 2009. 87. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, Dunford- Shore BH, McGrath S, Hickenbotham M, Cook L, Abbott R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, Gordon D, Chinwalla A, Zhao Y, Ries RE, Payton JE, Westervelt P, Tomasson MH, Watson M, Baty J, Ivanovich J, Heath S, Shannon WD, Nagarajan R, Walter MJ, Link DC, Graubert TA, DiPersio JF, Wilson RK: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008, 456(7218):66-72. 88. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621- 628. 89. Fullwood MJ, Wei CL, Liu ET, Ruan Y: Next-generation DNA sequencing of paired- end tags (PET) for transcriptome and genome analyses. Genome Res 2009, 19(4):521-532. 90. Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M: Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 2008, 45(1):81- 94. 91. Stamos J, Sliwkowski MX, Eigenbrot C: Structure of the epidermal growth factor receptor kinase domain alone and in complex with a 4-anilinoquinazoline inhibitor. J Biol Chem 2002, 277(48):46265-46272. 92. Cn3D Homepage [http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml] 93. Arriola E, Lambros MB, Jones C, Dexter T, Mackay A, Tan DS, Tamber N, Fenwick K, Ashworth A, Dowsett M, Reis-Filho JS: Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation. Lab Invest 2007, 87(1):75-83. 94. Bredel M, Bredel C, Juric D, Kim Y, Vogel H, Harsh GR, Recht LD, Pollack JR, Sikic BI: Amplification of whole tumor genomes and gene-by-gene mapping of genomic aberrations from limited sources of fresh-frozen and paraffin-embedded DNA. J Mol Diagn 2005, 7(2):171-182. 95. Dunn GP, Old LJ, Schreiber RD. The three Es of cancer immunoediting. Annu Rev Immunol. 2004, 22:329-60. 96. Zitvogel L, Tesniere A, Kroemer G. Cancer despite immunosurveillance: immunoselection and immunosubversion. Nat Rev Immunol. 2006, 6(10):715-27. 28 Chapter 2. Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients1 This chapter documents my first investigation of somatic mutations and gene amplifications that may predict the response of lung cancer to treatment with tyrosine kinase inhibitors. At the time this study was initiated, EGFR mutations had just been described in lung cancer and a strong correlation was observed with response to the EGFR-inhibitor gefitinib (Iressa, AstraZeneca). These initial studies were in small patient populations and subsequent studies suggested that amplification of EGFR and similar amplifications of the closely related HER2 gene were more accurate predictors of clinical outcome. In this chapter, I tested the ability of EGFR mutations and increases in EGFR and HER2 copy number to predict response to gefitinib in a local population of lung cancer patients. The finding that none of the features were diagnostic of response suggested that genes other than EGFR and HER2 may harbour abnormalities predictive of response. While this finding was contradictory at the time these data were published [1], similar results have since been reported, and a debate continues regarding the predictive value of these biomarkers [2]. The samples for this project were archival tissue blocks routinely used to assess cellular morphology for diagnosis that often yield degraded, chemically modified nucleic acids. For this study, I adapted methods to extract genomic information from these materials, including laser microdissection of cancer cells, assessment of DNA quality, PCR amplification of degraded DNA, and usage of a published scoring system for interpreting the results of fluorescent in situ hybridization. 1 A version of this chapter has been published. Pugh T.J., Bebb G., Barclay L., Sutcliffe M., Fee J., Salski C., O'Connor R., Ho C., Murray N., Melosky B., English J., Vielkind J., Horsman D., Laskin J.J., Marra M.A. BMC Cancer. 2007 Jul 13;7 128. 29 2.1. Introduction Lung cancer overall is the leading cause of cancer-related death in North America with 85% of patients eventually succumbing to the disease [3]. The five year survival rate for these cancers is low (16%) compared to other cancers [3] and there exists a major need for additional therapeutic strategies in its treatment. EGFR has been identified as a potential therapeutic target as protein over-expression is observed in 40-80% late stage lung tumours and can confer a malignant phenotype in cultured cells [4]. Health Canada and the United States Food and Drug Administration initially approved the use of two EGFR-targeted molecules, gefitinib (\u00E2\u0080\u009CIressa\u00E2\u0080\u009D from Astra Zeneca) and erlotinib (\u00E2\u0080\u009CTarceva\u00E2\u0080\u009D from Genentech/Roche) in the second- and third- line treatment of lung cancer. Both of these drugs were designed to reversibly bind the ATP- binding pocket of the EGFR tyrosine-kinase domain, thereby inhibiting autophosphorylation and stimulation of downstream signalling pathways, resulting in inhibition of proliferation, delayed cell cycle progression, and increased apoptosis. Despite being marketed as \u00E2\u0080\u009CEGFR tyrosine kinase inhibitors\u00E2\u0080\u009D, these drugs have affinity for 18-26 protein kinases in addition to EGFR [5, 6]. In international phase II trials, ~28% of Japanese patients responded to gefitinib versus ~10% of patients of European descent as assessed by symptom improvement and tumour shrinkage [7, 8]. These population-specific findings have suggested that responses to these drugs have genetic components, although regional environmental factors have not been discounted. Two somatic mutations in the EGFR tyrosine-kinase domain have been correlated with reduced tumour size as a result of treatment with gefitinib [9-13]. These mutations were commonly found in patients fitting the responsive profile observed in initial and subsequent clinical studies [7, 8, 14], specifically female non-smokers of Asian descent. In a review of sixteen studies, EGFR mutations clustering around the tyrosine kinase domain ATP-binding pocket have been observed in 151 of 191 gefitinib responders (79.1%) and 11 of 19 erlotinib 30 responders (57.9%) [15]. Confounding the model of mutation-mediated drug response is the finding that 40 of 191 gefitinib responders (20.9%) and 8 of 19 erlotinib responders (42.1%) lack EGFR mutations [15]. Conversely, EGFR mutations were seen in 40 of 355 gefitinib non- responders (11.3%) and 16 of 117 erlotinib non-responders (13.7%) [15]. These results suggested that somatic EGFR mutations are neither necessary nor sufficient for response to EGFR inhibitors. This suggestion is supported by the findings of a prospective trial of gefitinib in which 4 of 16 patients selected for tumours with EGFR mutations didn\u00E2\u0080\u0099t respond to gefitinib [16]. An increasing number of studies examining the tumours of patients treated with gefitinib and erlotinib have correlated increased EGFR gene copy number with response [13, 17, 18]. Data analysis from a recent phase III trial of erlotinib has supported these observations [19]. In this trial, the response rate among patient tumours with amplification of EGFR was significantly higher than those without this characteristic (20% vs. 2%) [19]. Multivariate analysis revealed that only EGFR expression and increased copy number were associated with erlotinib response, and no correlation between single-nucleotide mutations and response was found [19]. Increased HER2/Neu gene copy number has also been associated with response, particularly in the presence of increased EGFR copy number, EGFR overexpression or EGFR mutation [18]. Other studies have shown that tumours co-expressing HER2 and EGFR have a poor prognosis [20, 21], suggesting that there is a relationship between these genes that drives pathogenesis and which may be targeted by gefitinib. Additional data are needed to explore the ability of these molecular features to predict response to EGFR-targeted tyrosine-kinase inhibitors [22, 23]. Recently our clinical collaborators at the BC Cancer Agency confirmed that Asian ethnicity predicts response to gefitinib in a Canadian setting in a population in which 38% of patients are of Asian descent [14]. To test whether gefitinib response could have been predicted 31 by somatic EGFR mutation, EGFR amplification, or HER2 amplification, we retrospectively analyzed archival diagnostic samples from this cohort of patients. 2.2. Materials and methods 2.2.1. Patient population and assessment response Samples for molecular analysis were drawn from patients who received gefitinib through the Extended Access Program at the BC Cancer Agency as reported by Ho et al [14] with ethics approval from the BC Cancer Agency Ethics Review Board. The criteria for enrolment in the program were the presence of histologically or cytologically confirmed locally advanced or metastatic NSCLC and having received prior standard systemic or radiation therapy or being ineligible for standard treatment. Patients received gefitinib following standard systemic or radiation therapy, and response was assessed radiographically according to the SWOG modification of the WHO criteria [24]. In brief, complete response (CR) was defined as a complete disappearance of disease, partial response (PR) was defined as a decrease of >50% of the sum of the products of the maximal perpendicular dimensions of measurable lesions, stable disease (SD) was defined as the presence of no new lesions or progression of current lesions, progressive disease (PD) was defined as an increase of >50% of the sum of the products of the maximal perpendicular dimensions of measurable lesions, the development of new lesions, recurrence of lesions that had previously disappeared or failure to return for evaluation because of symptomatic deterioration. 2.2.2. Laser microdissection and DNA extraction To identify tumour cell populations for laser microdissection (LM) or manual scrape, malignant cells (cytology specimens) or tissues (paraffin embedded biopsies) were reviewed by a single reference pathologist. Because the DNA extracted from formalin-fixed, paraffin- embedded tissue blocks is of variable quality, the DNA from these sources was characterized 32 prior to microdissection. DNA was extracted from a full 8 micron section of each block using the \u00E2\u0080\u009CLaser-Microdissected Tissues\u00E2\u0080\u009D protocol of the QIAamp spin-column kit (QIAgen, Valencia, CA). The digestion volumes were increased five-fold and three final 30 \u00C2\u00B5L elutions of TE (10:0.1) were performed. The DNA was quantified by PicoGreen assay (Invitrogen, Carlsbad, CA) and observed on a 2% agarose gel stained with ethidium bromide. For a block to qualify for LM, the presence of DNA fragments >2000 bp was required (Figure 2.1). 40 archival samples from 37 patients were suitable for LM and yielded enough DNA of sufficient quality for PCR and sequencing. Laser microdissection of pathologist-identified cells was performed on serial sections of paraffin blocks using either an Arcturus PixCell infra-red laser-capture device or a Molecular Machines and Industries SL\u00C2\u00B5CUT UV laser microdissection instrument. Dissected cells were isolated onto the adhesive caps of 1.0 mL microcentrifuge tubes (Arcturus) (Figure 2.2). Material from cytology slides was scraped with a razor blade directly into microcentrifuge tubes and DNA extracted as described above. 2.2.3. PCR and sequencing of EGFR exons 18-24 Exons 18-24, coding for the tyrosine kinase domain of EGFR, were amplified by PCR and sequenced. PCR primers were designed using human genome reference sequence acquired from the UCSC Genome Browser [25, 26] (hg17_refGene_NM_005228). Primers were designed to anneal within introns at least 40bp away from exon splice sites using the Primer3 program [27]. Sequencing tags were added to all PCR primers for downstream sequencing and experimentally optimized for annealing temperature. The DNA sequence and annealing temperatures of all seven EGFR primer pairs are listed in Table 2-1. PCR reactions were performed in 20 \u00C2\u00B5L and consisted of: 2.0 \u00C2\u00B5L 10X Pfx Amplification Buffer (Invitrogen), 0.4 \u00C2\u00B5L 50 mM MgSO4 (Invitrogen), 0.4 \u00C2\u00B5L 10 mM dNTPs (from 100 mM stock, Invitrogen), 1 \u00C2\u00B5L 33 each of 10 \u00C2\u00B5M forward and reverse primers (Invitrogen), 2.0 \u00C2\u00B5L 10X PCRx Enhancer (Invitrogen), 0.1 \u00C2\u00B5L 2.5 U/\u00C2\u00B5L Pfx Polymerase (Invitrogen) with 5-10 ng template and distilled water added up to the final volume. Reactions were cycled on an MJResearch Tetrad at 95\u00C2\u00BAC for 5 minutes followed by 35 cycles of: 95\u00C2\u00BAC for 30 seconds, annealing temperature for 15 seconds (Table 2-1), and 70\u00C2\u00BAC for 2 minutes. PCR products were purified using the Ampure magnetic- bead-based PCR product purification system (Agencourt, Beverly, MA). Sequencing of PCR products was performed with standard chemistries in use by the production sequencing team at the BC Cancer Agency Michael Smith Genome Sciences Centre. Briefly, \u00E2\u0080\u009Cforward\u00E2\u0080\u009D and \u00E2\u0080\u009Creverse\u00E2\u0080\u009D 1/24X reactions contained 0.02 \u00C2\u00B5L of 100 \u00C2\u00B5M primer, 0.33 \u00C2\u00B5L BigDye Ready Reaction Mix v3.1 (ABI), 0.4 \u00C2\u00B5L 15X Big Dye Buffer (50% by volume Big Dye v3.1 Sequencing Buffer (ABI), 50% by volume Tris-EDTA), 0.02 \u00C2\u00B5L distilled water, and 2 \u00C2\u00B5L of purified PCR product. Reactions were cycled 50 times with annealing temperatures of 52\u00C2\u00BAC for forward and 43\u00C2\u00BAC for reverse sequencing primers (96\u00C2\u00BAC for 10 seconds, annealing temperature for 5s, 60\u00C2\u00BAC for 3 minutes). All reactions were precipitated in a final concentration of 70% ethanol and 10 mM EDTA and spun at 2750g for 30 minutes to pellet sequencing products. The pellet was washed with 30\u00C2\u00B5L of 70% ethanol and air dried before resuspension in 10\u00C2\u00B5L distilled water. Sequencing reaction products were analyzed on automated ABI 3730XL sequencers and traces analyzed using the Mutation Surveyor software package (SoftGenetics, State College, PA) and the Phred/Phrap/Consed suite [28, 29]. All sequences were compared against a reference human genome sequence (NCBI accession NM_005228.3) to identify mutations and polymorphisms. Observed known polymorphisms recorded in the Single Nucleotide Polymorphism database (dbSNP) [30, 31] were identified by appropriate \u00E2\u0080\u0098rs\u00E2\u0080\u0099 number. To further validate results, PCR and sequencing reactions were repeated for all samples in which an apparent mutation was observed. Correlations between clinical features and EGFR mutations were assessed using the two-sided Fisher\u00E2\u0080\u0099s exact test. 34 2.2.4. Copy number analysis of EGFR and HER2 To assess EGFR and HER2 copy number, fluorescent in-situ hybridization (FISH) was conducted by the BC Cancer Agency Pathology Department (CS, DH authors) using Pathvysion EGFR and HER-2 DNA Probe kits (Vysis, Downers Grove, IL). Formalin-fixed paraffin-embedded tissues were prepared in serial 6um sections on positively charged Colorfrost/Plus microscope slides (Fisher Scientific, Hampton, NH). One section was H&E stained and tumour populations were identified by a pathologist. Hybridization areas were marked with a diamond-tipped pencil on the back of each slide. Sections were incubated overnight at 56\u00C2\u00BAC, dewaxed by exposure to xylene for 10 minutes, dehydrated in 100% ethanol for 5 minutes, and air-dried 2-4 minutes on a slide warmer set to 37-45\u00C2\u00BAC. The slides were immersed in 0.2N HCl for 20 minutes, rinsed in distilled water for 10 minutes, and incubated in 1M NaSCN pre-treatment solution (Vysis) for 30 minutes at 80\u00C2\u00BAC. After rinsing with room temperature water for 3 minutes, sections were digested with pepsin (0.25mg/mL in 0.01N HCl) for 15-18 minutes at 37\u00C2\u00BAC, and rinsed with room temperature water for 5 minutes. Tissue morphology was assessed by phase contrast microscopy to ensure sufficient digestion of the collagen matrix. Slides were dehydrated with two 4-minute treatments of 100% ethanol and air- dried 2-4 minutes on a slide warmer set to 37-45\u00C2\u00BAC. 2.5-3 \u00C2\u00B5L of the EGFR/CEP7 or HER2/CEP17 probe mixture was applied to the hybridization area marked on the slide and covered with a glass coverslip. Edges were sealed with rubber cement. The slides were incubated at 73\u00C2\u00BAC for 5 minutes then 37\u00C2\u00BAC overnight to first co-denaturate the probe and chromosomal DNA and then allow hybridization. Rubber cemented coverslips were then removed and the slides were placed in a post-hybridization wash solution (2X SSC, 0.3% NP- 40) at 72\u00C2\u00BAC for 2 minutes. After rinsing the slides in 1X PBS, they were air-dried in the dark for 30-60 minutes. 4 \u00C2\u00B5L DAPI-1 counterstain (Vysis) was applied to the hybridization area and a 35 glass coverslip fixed in place. FISH analysis was performed by counting the number of signals from each probe in forty tumour nuclei on one slide from each patient. Two approaches were used to interpret raw FISH probe counts and define gene amplification. In the first approach, the total number of EGFR or HER2 signals was divided by the total number of centromeric CEP7 or CEP17 signals and a gene/CEP ratio reported for the population of forty cells. Samples with a gene/CEP ratio \u00E2\u0089\u00A5 2 were defined as displaying gene amplification. The second approach applies published criteria [17] to raw FISH counts to classify patients into six strata according to the frequency of cells with specific gene copy numbers within the tumour population. The six strata, as published by Cappuzzo et al [17] and applied in our study, were: 1) disomy (\u00E2\u0089\u00A4 2 copies in > 90% of cells); 2) low trisomy (\u00E2\u0089\u00A4 2 copies in \u00E2\u0089\u00A5 40% of cells, 3 copies in 10% \u00E2\u0080\u0093 40% of the cells, \u00E2\u0089\u00A5 4 copies in < 10% of cells); 3) high trisomy ( \u00E2\u0089\u00A4 2 copies in \u00E2\u0089\u00A5 40% of cells, 3 copies in \u00E2\u0089\u00A5 40% of cells, \u00E2\u0089\u00A5 4 copies in < 10% of cells); 4) low polysomy ( \u00E2\u0089\u00A5 4 copies in 10% \u00E2\u0080\u0093 40% of cells); 5) high polysomy ( \u00E2\u0089\u00A5 4 copies in \u00E2\u0089\u00A5 40% of cells); and 6) gene amplification (defined by presence of tight EGFR gene clusters and a ratio of EGFR gene to chromosome of \u00E2\u0089\u00A5 2 or \u00E2\u0089\u00A5 15 copies of EGFR per cell in \u00E2\u0089\u00A5 10% of analyzed cells). The first approach is commonly used in practical clinical assessment of gene copy number and generally reflects the average copy number of the cell population examined. The second approach attempts to capture the degree to which gene amplification defines a cell population. This second method was published by one of the first studies to associate increased EGFR copy number with gefitinib response [17]. 2.3. Results 2.3.1. Patient population Our clinical colleagues at the BC Cancer Agency previously documented the clinical characteristics of a population of 61 patients treated with gefitinib at their clinic between April 2002 and May 2004 [14]. In that study, patients of Asian decent with adenocarcinoma 36 displayed a preferential response to gefitinib. Diagnostic samples from 39 of these individuals were suitable for microdissection and yielded DNA of sufficient quality for PCR and sequencing and/or copy number analysis by FISH. Microdissected materials were used to avoid masking of cancer-specific features by contaminating normal material. Figure 2.2 demonstrates the heterogeneous nature of a metastatic lung tumour and the ability of laser microdissection to separate tumour cells from surrounding normal tissue. The patient subset consisted of 23 females (59%), 17 patients of Asian descent (44%), 12 non-smokers (31%), 34 tumours of adenocarcinoma subtype (87%), and a distribution between partial response/stable disease/progressive disease of 6/14/17 (15%/33%/44%). 2 patients lacked a response assessment. The clinical characteristics and molecular status of these patients are described in Table 2-2. 2.3.2. EGFR tyrosine-kinase domain mutations We studied the DNA sequence of the EGFR tyrosine kinase domain in our patient samples as this domain was previously associated with increased gefitinib sensitivity [9-11]. In eight of thirty-eight tumours assessed we found ten non-unique mutations, five of which have been previously correlated with response (Figure 2.3). Four of these mutations were in-frame deletions or substitutions within exon 19, all of which impacted L747-A750 (Table 2-3) and retained the ATP-binding lysine moiety. All four patients with mutations were of Asian descent, and two of these patients were females responsive to gefitinib, of which one was a non-smoker and one had unknown smoking history. The two non-responders were non- smokers, one female and one male. We resequenced the normal tissue remaining after microdissection in two of these samples and found no mutations, consistent with previous reports that these mutations are somatic. The fifth mutation was a homozygous missense point mutation within exon 21, resulting in an L858R substitution (Table 2-4). This patient was a 37 female non-smoker of Asian descent who did not respond to gefitinib. Three missense and two synonymous point mutations were detected in exon 20, four of which have been previously observed in other patients (Table 2-4). One of these mutations was in a tumour from one of the drug responsive patients who also had an exon 19 deletion. The exon 20 T790M mutation previously documented to confer resistance to gefitinib [32] was not observed. We were unable to validate the previously reported relationships between response and the presence of exon 19 mutations (p = 0.0889) or exon 21 mutations (we observed a single mutation in a non-responder). If patients exhibiting stable disease were counted among the responders (\u00E2\u0080\u009Cdisease control\u00E2\u0080\u009D), correlation between exon 19 deletions and response was not observed (p = 1.00). The presence of exon 19 mutations was correlated with Asian ethnicity (p = 0.0207) and non-smoking status (p = 0.0406) but not with female sex (p = 0.633) or adenocarcinoma histology (p = 1.00). When taken as a group, there were no correlations with response and exon 20 mutations (p = 0.0889), female sex (p = 0.633), non-smoking status (p = 1.00), Asian ethnicity (p = 1.00), adenocarcinoma subtype (p = 1.00), or disease control (p = 0.104). 2.3.3. EGFR tyrosine-kinase domain polymorphisms We detected two previously documented single nucleotide polymorphisms (dbSNP rs10251977, rs17290643). Exon 20 harbours the synonymous G/A SNP rs10251977, while exon 23 contains the synonymous SNP T/C rs17290643. Neither of these variants result in an infrequently used codon (codons per thousand for each allele: rs10251977 34.2:12.3, rs17290643 13.1:18.9) [33]. There was no correlation between these alleles and gefitinib response in our patient population. 38 2.3.4. EGFR and HER2 copy number analysis Gene copy number was assessed in our patient tumour samples as previous studies have shown a correlation between copy number increases in EGFR [13, 17, 19] or HER2 [18] and gefitinib response. Two techniques were used to interpret the FISH data for this analysis (Methods). Increases in EGFR copy number, defined as an EGFR/CEP7 ratio \u00E2\u0089\u00A5 2.0, were observed in ten of twenty-six tumours (Table 2-5). Of these ten, three also displayed increased HER2 copy number (HER2/CEP17 ratio \u00E2\u0089\u00A5 2.0). HER2 amplification in the absence of EGFR amplification was seen in three additional tumours. Examples of the varying degrees of amplification of these genes are shown in Figure 2.4. Increased EGFR copy number did not correlate with the presence of mutation in either EGFR exon 19 (p = 0.130) or exon 20 (p = 1.00); increased HER2 copy number (p = 0.644); sex (p = 0.457); Asian ethnicity (p = 0.688); smoking status (p = 0.380); adenocarcinoma histology (p = 0.538); or response to gefitinib (p = 1.00). When patients with stable disease are counted among the responders (\u00E2\u0080\u009Cdisease control\u00E2\u0080\u009D), no correlation with response was observed (p = 0.210). Likewise, increased HER2 copy number did not correlate with: the presence of mutation of either EGFR exon 19 (p = 1.00) or exon 20 (p = 1.00); increased EGFR copy number (p = 0.644); gender (p = 0.160); Asian ethnicity (p = 0.645); smoking status (p = 0.351); adenocarcinoma histology (p = 1.00); or gefitinib response (p = 1.00) and disease control (p = 0.114). Tumours were also stratified by EGFR and HER2 copy number using the criteria proposed by Cappuzzo et al [17] (Table 2-5). Seven tumours were identified as FISH+ for EGFR amplification, and four tumours were identified as FISH+ for HER2 amplification (high polysomy or gene amplification). Only one of these tumours was FISH+ for both EGFR and HER2, and this was the only sample to meet the EGFR \u00E2\u0080\u009Cgene amplification\u00E2\u0080\u009D criteria as proposed by Capuzzo et al (10; \u00E2\u0089\u00A5 15 copies in \u00E2\u0089\u00A5 10% of cells). FISH+ status corresponded with 39 an EGFR/CEP7 ratio \u00E2\u0089\u00A5 2.0 in seven of ten samples. FISH+ status corresponded with a HER2/CEP17 ratio \u00E2\u0089\u00A5 2.0 in four of six samples. There was no correlation between EGFR FISH+ status and mutation of either EGFR exon 19 (p = 0.0543) or exon 20 (p = 0.283); female sex (p = 0.378); Asian ethnicity (p = 1.00); smoking status (p = 1.00); adenocarcinoma histology (p = 0.167); response to gefitinib (p = 0.552) or disease control (p = 0.653). Likewise, HER2 FISH+ did not correlate with the presence of mutation of either EGFR exon 19 (p = 1.00) or exon 20 (p = 0.544); increased EGFR copy number (p = 1.00); sex (p = 0.593); Asian ethnicity (p = 0.593); smoking status (p = 1.00); adenocarcinoma histology (p = 0.408); response to gefitinib (p = 0.437) or disease control (p = 0.239). 2.4. Discussion In DNA sequencing studies using patient samples, contaminating normal tissue has the potential to mask tumour-specific features, particularly in cases of highly heterogeneous metastatic deposits. To examine somatic features specific to tumours, we employed laser microdissection to isolate cancer cells from surrounding normal tissue. The selectivity of this technique was demonstrated by the identification of EGFR exon 19 deletions in the tumour populations of two patient samples but not the surrounding normal tissue remaining after microdissection. As sequencing and cell isolation technologies continue to mature, there is potential to further dissect genetic heterogeneity within tumour populations, perhaps eventually to the resolution of single cells. Such efforts may uncover low frequency resistance alleles pre- existing at low frequencies prior to treatment that then come to predominate in the tumour population as a consequence of selective pressure applied by therapy. In the evolving area of biomarkers predictive of response to EGFR tyrosine kinase inhibitors, two hypotheses have arisen, each claiming a specific alteration of EGFR is predictive of response. One hypothesis is that mutations within the EGFR tyrosine kinase domain targeted by these drugs are indicative of a capability to respond [9-11]. The second 40 hypothesis is that the presence of increased gene copy number of EGFR or HER2 is a better predictor of response [17-19]. When investigating the relevance of these features to our own population of lung cancer patients treated with gefitinib, our study detected all of these features occurring both independently and coincidentally in microdissected tumour cells. Tumours from four of thirty-eight patients contained a form of the exon 19 L747-A750 deletion and one tumour harboured the exon 21 L858R point mutation. Two of the patients with exon 19 deletions were responsive to gefitinib and were also found to have increased EGFR copy number. In the remaining four responders, EGFR mutations or gene amplifications that others previously correlated with gefitinib response [9-11] were not observed. These data are consistent with the notion that tumours reliant on amplification of a mutant EGFR allele may be particularly susceptible to inhibition by gefitinib. However, responders without apparent gefitinib-sensitising EGFR alterations may have shown characteristics of response even without treatment or may have responded due to an interaction between gefitinib and a protein other than EGFR [34, 35]. To identify alternative genetic features mediating drug response, candidate genes influenced by receptor tyrosine kinase inhibitors need to be identified and studied in patients receiving these drugs. In this study, we compared two methods of interpreting FISH data and defining increased gene copy number. One technique defined gene amplification as a gene/centromere (e.g. EGFR/CEP7) threshold \u00E2\u0089\u00A5 2.0, while the second technique defined \u00E2\u0080\u009CFISH+\u00E2\u0080\u009D status from the stratification of different gene/centromere ratios into varying degrees of polysomy [17]. While both of these methods identified seven tumours with EGFR amplification, the EGFR/CEP7 ratio \u00E2\u0089\u00A5 2 method identified an additional three tumours which were classified as \"Low Polysomy\" under the Cappuzzo criteria (Methods). While not originally designed for this purpose, we also applied Cappuzzo's criteria [17] to our HER2 FISH data. Again we saw an overlap of the samples identified by both methods as having increased HER2 copy number. 41 However, as with EGFR, the HER2/CEP17 ratio method identified samples not captured by the stratification method but with ratios near the threshold of 2 for amplification. None of these patients responded to gefitinib. These results suggest a need for further refinement of criteria for defining amplification and may reflect the ability of FISH to define precise copy number. Our experience underscores the difficulty in capturing the heterogeneous nature of a tumour population with a single measurement. An understanding of the biological implications of EGFR gene amplification is needed to refine the predictive specificity of these tests. 2.5. Conclusion Recently, several studies have correlated gefitinib response with either EGFR mutation [9-13] or increased EGFR copy number [13, 17-19], but the true predictive value of these features is still under debate [22, 23]. While we observed EGFR DNA sequence mutations and increases in EGFR and HER2 gene copy number in several of our specimens, we were unable to statistically correlate the presence of any of these molecular features with response. While these findings may be due to a lack of statistical power due to our small sample size, our study differs from others in our use of a population with a large Asian component in a North American setting and enrichment of tumour cells using laser microdissection. Even though EGFR status was not a single predictive factor of drug response in our small sample set, its assessment can increase the likelihood of selecting patients likely to respond to these drugs. To improve the sensitivity of screening for potential responders, additional features other than EGFR that mediate drug response need to be identified. Recently, activating point mutations in KRAS [36], amplification of the oncogene MET [37] and loss of tumour suppressor PTEN [38] have been found to confer resistance to EGFR-inhibitors due to activation of downstream pathways independent of EGFR signalling. Therefore, permutations of regulators of EGFR signaling may lead to TKI resistance in the absence of EGFR resistance mutations [39]. In the absence of these features, selection of patients likely to respond to TKIs 42 will continue to be reliant on clinical criteria including sex, histology, smoking status and ethnicity as indirect surrogates for molecular features. This study [1] and others [40] have concluded that response to targeted small molecules cannot be explained in the context of mutation or amplification of a single gene but more likely as a spectrum of altered targets. Mutations in EGFR do not affect the binding affinity of gefitinib or erlotinib [6], suggesting that EGFR mutations are markers of a TKI-susceptible biological subtype or of generally good prognosis regardless of treatment [40], rather than of a high-affinity drug binding partner. Therefore, there may be mutations in other genes that confer a similar phenotype susceptible to TKIs. While the identification of features conferring drug response is of great utility, the characterization of non-responsive patients and drug resistance features will also contribute to an understanding of drug action. Managing or curing cancer will rely on the comprehensive detection of all somatic events within a tumour population and using this knowledge to rationally deliver targeted therapies. 43 2.6. Figures Figure 2.1 DNA of varying quality from formalin-fixed paraffin-embedded tissues. DNA extracted from tissue blocks is often degraded and chemically modified to varying degrees due to differences in fixation method and time, storage conditions, and nature of the tissue. Diagnostic treatments such as fixation with Bouin\u00E2\u0080\u0099s solution (samples 9-11) or acid decalcification (sample 12) can result in severely degraded template unusable for PCR. Slightly (sample 1) or moderately (sample 2-8) degraded templates can be used for PCR, although additional input DNA may be necessary for robust PCR. To ensure that blocks with degraded DNA were not used in labour-intensive microdissection, DNA from whole sections was extracted and qualified on a 2% agarose gel prior to microdissection of additional sections. Blocks yielding highly degraded DNA were not used in this study. 44 Figure 2.2 Laser microdissection of mixed tumour and normal cell populations Tumour cells were microdissected using a UV laser microdissection instrument (Methods) to isolate tumour cells from surrounding normal tissue. A) Uncut lymph node tissue with metastatic tumour populations outlined in yellow. Each tumour cluster contains roughly 100-200 cells. B) Normal stromal cells remaining after excision of tumour. C) Tumour cells isolated on adhesive cap. 0.8 mm 45 Figure 2.3 EGFR variant detection summary The seven exons coding for the tyrosine kinase domain of EGFR were sequenced in 37 tumours. Eight of these samples contained mutations, four with in-frame exon 19 deletions impacting L747-A750, four with a variety of exon 20 point mutations, and one with an exon 21 point mutation, L858R. Two previously documented synonymous polymorphisms were detected in this study, G2607A in exon 20 (rs10251977) and T2955C in exon 23 (rs17290643). Amino acid numbering is from the initial methionine residue of the EGFR protein isoform a (NCBI accession NP_005219). The data from this study have since been recorded in the Catalogue of Somatic Mutations in Cancer [41]. 46 Figure 2.4 Examples of tumours with increased gene copy number detected by FISH Gene copy number visualized by fluorescent in situ hybridization (FISH). Blue DAPI stain identifies the DNA present in each cell\u00E2\u0080\u0099s nucleus. Red Cy5-labelled probes hybridize to the gene region targeted by each assay (EGFR or HER2). Green Cy3-labelled probes target the centromere of the chromosome appropriate for the gene-specific assay (chromosome 7 for EGFR, chromosome 17 for HER2). The ratio reported is the number of red probes / green probes (genes/chromosome) based on an average of 40 cells. A) Tumour cells without increased EGFR copy number B) Tumour cells with increased HER2 copy number C) Tumour cells with \u00E2\u0080\u009Cgene amplification\u00E2\u0080\u009D of EGFR 47 2.7. Tables Table 2-1 PCR primers for 7 exons of the EGFR tyrosine kinase domain Exon Annealing Temperature (\u00C2\u00BAC) Forward Primer Sequence Reverse Primer Sequence Product length including primer sequences (bp) 18 60 gtgtcctggcacccaagc ccccaccagaccatgaga 340 19 60 cagcatgtggcaccatctc cagagcagctgccagacat 273 20 60 cattcatgcgtcttcacctg catatccccatggcaaactc 412 21 60 agccataagtcctcgacgtg acccagaatgtctggagagc 372 22 56 tccagagtgagttaactttttcca ttgcatgtcagaggatataatgtaa 277 23 60 gaagcaaattgcccaagact atttctccagggatgcaaag 413 24 56 gcaatgccatctttatcatttc gctggcatgtgacagaacac 281 PCR primers were designed at least 40bp from EGFR exons coding for the tyrosine kinase domain. Sequencing tags were added to each primer to allow sequencing of the PCR products. All forward primer sequences were prefixed with a -21M13 sequencing tag, TGTAAAACGACGGCCAGT. All reverse primer sequences were prefixed with an M13R sequencing tag, CAGGAAACAGCTATGAC. - 21M13 and M13R sequencing primers were then used in the corresponding sequencing reaction to generate sequences from both strands of the PCR products. 48 Table 2-2 Summary of all patient clinical data and molecular status # Sex Ethnicity Smoker? Histology Source Tissue Block Type1 Response2 EGFR Mutation EGFR/ CEP7 HER2/ CEP17 EGFR Stratification3 HER2 Stratification3 3 F Caucasian Unknown adeno. Skin Nodule Tissue Block PD None 6 F Caucasian Y adeno. Lung Tissue Block PD None 7 F Caucasian Unknown adeno. Lymph Node Tissue Block PD None 9 F Asian N adeno. Cerebellum Tissue Block SD Not Sequenced 2.1 1.9 High Poly. Low Poly. 10 M Caucasian Unknown PD NSC Lymph Node Tissue Block PD None 1.2 1.4 High Trisomy High Trisomy 11 F Asian Unknown adeno. Lung Tissue Block PR Exon 19 Del.*, Exon 20 V774L 2.7 1.5 High Poly. High Trisomy 12 M Asian N adeno. Lymph Node Tissue Block SD None 1.2 1.9 Low Trisomy Low Poly. 14 F Asian N adeno. Pericardium Tissue Block PD None 1 1.2 Disomy Low Trisomy 15 M Asian Y adeno. Lung Tissue Block PR Not Sequenced 1.1 1.2 Low Trisomy High Trisomy 20 F Caucasian Unknown adeno. Lung Tissue Block PD None 1.1 1.7 Low Trisomy Low Poly. 21 F Caucasian Y adeno. Lymph Node Cytology Slide PR None 22 M Asian N adeno. Lymph Node Cytology Slide PD Exon 19 Del. 24 F Caucasian N adeno. Lymph Node Cytology Slide SD None 25 F Asian N adeno. Lung Cytology Slide PD Exon 19 Del. 26 F Caucasian Y adeno. Lung Cytology Slide SD None 27 F Caucasian Y SCC Lung Tissue Block PD None 2.1 1.5 High Poly. Low Poly. 28 M Caucasian Y adeno. Brain Tissue Block SD Exon 20 G779S 1.9 1.5 Low Poly. Low Poly. 30 M Asian Y adeno. Brain Tissue Block SD None 1.3 1.2 High Trisomy High Trisomy 33 M Asian Y adeno. Lung Tissue Block PD None 1.3 1.1 High Trisomy Low Trisomy 34 M Caucasian Y SCC Lung Tissue Block SD None 17.3 2.6 Gene Amp. High Poly. 35 F Caucasian Unknown adeno. Brain Tissue Block PD Exon 20 V819V 0.7 1.4 Low Trisomy Low Poly. 36 M Asian Y adeno. Pleura Tissue Block SD None 2.0 2.0 Low Poly. Low Poly. 37 M Caucasian Y PD NSC Skin Nodule Cytology Slide SD None 39 M Caucasian Unknown adeno. Cell Block SD None 40 M Caucasian Y adeno. Pleura Cell Block SD None 3.1 1.4 High Poly. Low Poly. 42 M Caucasian Y adeno. Lymph Node Tissue Block Unknown None 2.1 2.3 Low Poly. High Poly. 43 F Caucasian Y adeno. Lymph Node Tissue Block PD None 2.7 0.9 High Poly. Low Trisomy 44 F Asian N adeno. Pleura Tissue Block PR Exon 20 S768I, Exon 20 L815L 1.9 2.9 Low Poly. High Poly. 47 F Asian N adeno. Lung Tissue Block PD 1 58R 1.2 1.4 Low Trisomy Low Poly. 48 F Asian N adeno. Lung Tissue Block PD None 1.3 0.8 High Trisomy Disomy 51 F Caucasian Y LCC Lymph Node Cytology Slide PD None 52 F Asian N adeno. Lymph Node Cytology Slide PR None 56 M Caucasian Y adeno. Lung Tissue Block Unknown None 1.5 2.0 High Trisomy Low Poly. 57 M Caucasian Y adeno. Lymph Node Tissue Block SD None 1.6 2.2 Low Poly. High Poly. 59 M Caucasian Y adeno. Skin Nodule Tissue Block PD None 1.1 1.4 Low Trisomy High Trisomy 60 F Asian N adeno. Lymph Node Tissue Block SD None 1.1 1.7 Low Trisomy Low Poly. 61 F Asian Y adeno. Lung Cytology Slide PD None PreRx: adeno. Lymph Node Tissue Block - None 1.0 1.2 Low Trisomy High Trisomy 64 F Caucasian Y Post Rx: adeno. Pericaridium Tissue Block SD None 2.2 1.3 Low Poly. Low Trisomy 66 F Asian N adeno. Lung Tissue Block PR Exon 19 Del.* 2.9 1.2 High Poly. Low Trisomy 49 Table 2-3 EGFR exon 19 deletions/substitution I K E L R E A T S P K a.a. # Sex Ethnicity Smoking Status Source Tissue Response 1 TCAAGGAATTAAGAGAAGCAACATCTCCGAA CDS 11 F Asian Unknown Lung PR TCAA- - - - - - - - - - - - - - - AACAT CT CCGAA Het2 22 M Asian N Lymph Node PD TCAAGGAA- - - - - C- - - - - - - CAT CT CCGAA Het 25 F Asian N Lung PD TCAAGGAA- - - - - - - - - - - - - - - T CT CCGAA Del 66 F Asian N Lung PR TCAA- - - - - - - - - - - - - - - AACAT CT CCGAA Het2 Deletions of L747-A750 were detected in EGFR exon 19. All samples were classified as adenocarcinoma based on histology. Deleted bases are indicated by \u00E2\u0080\u009C-\u00E2\u0080\u009C. In the case of patient #22, thirteen deleted bases were replaced by a single \u00E2\u0080\u0098C\u00E2\u0080\u0099 thereby retaining the reading frame. In all cases, the ATP-binding residue K745 was retained. In the case of patients #11 and #66, a synonymous codon change results from the deletion (AAG>AAA) and the K745 ATP-binding residue is unchanged. 1 response as measured radiographically and defined by SWOG modification of the WHO criteria [24]. PD = progressive disease, SD = stable disease, PR = partial response. 2 no mutations detected in normal tissue remaining after microdissection. 50 Table 2-4 EGFR point mutations # Sex Ethnicity Smoking Status Source Tissue Response 1 Exon CDS Mutation Amino Acid Previously Documented 44 F Asian N Pleura PR 20 G2549>TT C2691>CT S768I L815L [42-45] none 11 F Asian Unknown Lung PR 20 G2566>TT2 V774L V774M [45, 46] 28 M Caucasian Y Brain SD 20 G2581>AG G779S G779F [46] 35 F Caucasian Unknown Brain SD 20 G2703>GA V819V [47] 47 F Asian N Lung SD 21 T2573>GG L858R [9-11, 17, 19, 45, 48, 49] Point mutations detected in EGFR exons 20 and 21. All samples were classified as adenocarcinoma based on histology. Point mutations altering V774 and G779 have been previously documented to result in amino acid substitutions different than those found in this study. 1 response as measured radiographically and defined by SWOG modification of the WHO criteria [24]. PD = progressive disease, SD = stable disease, PR = partial response. 2 no mutations detected in normal tissue remaining after microdissection. 51 Table 2-5 EGFR and HER2 copy number alterations Stratification3 # Sex Ethnicity Smoking Status Histology Source Tissue Block Type 1 Response2 EGFR Mutation EGFR:CEP7 HER2: CEP17 EGFR HER2 9 F Asian N adeno Cerebellum Tissue Block SD Not Sequenced 2.1 1.9 High Poly Low Poly 11 F Asian Unknown adeno Lung Tissue Block PR Exon 19 Del 4 , Exon 20 V774L 2.7 1.5 High Poly High Tri. 27 F Caucasian Y SCC Lung Tissue Block PD None 2.1 1.5 High Poly Low Poly 34 M Caucasian Y SCC Lung Tissue Block SD None 17.3 2.6 Gene Amp. High Poly 36 M Asian Y adeno Pleura Tissue Block SD None 2.0 2.0 Low Poly Low Poly 40 M Caucasian Y adeno Pleura Cell Block SD None 3.1 1.4 High Poly Low Poly 42 M Caucasian Y adeno Lymph Node Tissue Block Unknown None 2.1 2.3 Low Poly High Poly 43 F Caucasian Y adeno Lymph Node Tissue Block PD None 2.7 0.9 High Poly Low Tri. 44 F Asian N adeno Pleura Tissue Block PR Exon 20 S768I, Exon 20 L815L 1.9 2.9 Low Poly High Poly 56 M Caucasian Y adeno Lung Tissue Block Unknown None 1.5 2.0 High Tri. Low Poly 57 M Caucasian Y adeno Lymph Node Tissue Block SD None 1.6 2.2 Low Poly High Poly Pre Rx: adeno Lymph Node Tissue Block - None 1.0 1.2 Low Tri. High Tri. 64 F Caucasian Y Post Rx: adeno Pericaridium Tissue Block SD None 2.2 1.3 Low Poly Low Tri. 66 F Asian N adeno Lung Tissue Block PR Exon 19 Del4 2.9 1.2 High Poly Low Tri. Patient data provided for samples displaying increased EGFR or HER2 copy number (Probe:CEP ratio > 2.0) or identified as FISH+ (High Polysomy or Gene Amplification)1 source of patient material (Tissue Block = microdissected formalin-fixed paraffin-embedded tissue block; Cell Block = whole section or microdissected formalin-fixed paraffin-embedded cell block; Cytology = scraped cytology slide) 2 response as measured radiographically and defined by SWOG modification of the WHO criteria [24]. PD = progressive disease, SD = stable disease, PR = partial response 3 Copy number stratification as proposed by Cappuzzo et al [17]. (Disomy = < 2 copies in > 90% of cells, Low Trisomy = \u00E2\u0089\u00A4 2 copies in \u00E2\u0089\u00A5 40% of cells, 3 copies in 10-40% of cells, \u00E2\u0089\u00A5 4 copies in < 10% of cells, High Trisomy = \u00E2\u0089\u00A4 2 copies in \u00E2\u0089\u00A5 40% of cells, 3 copies in \u00E2\u0089\u00A5 40% of cells, \u00E2\u0089\u00A5 4 copes in < 10% of cells, Low Polysomy: \u00E2\u0089\u00A5 4 copies in 10-40% of cells, High Polysomy = \u00E2\u0089\u00A5 4 copies in 40% of cells, Gene Amplification = \u00E2\u0089\u00A5 15 copies in \u00E2\u0089\u00A5 10% of cells) 4 no mutations detected in normal tissue remaining after microdissection. 52 2.8. Bibliography 1. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B et al: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC Cancer 2007, 7:128. 2. Bonomi PD, Buckingham L, Coon J: Selecting patients for treatment with epidermal growth factor tyrosine kinase inhibitors. Clin Cancer Res 2007, 13(15 Pt 2):s4606- 4612. 3. Damaraju S, Murray D, Dufour J, Carandang D, Myrehaug S, Fallone G, Field C, Greiner R, Hanson J, Cass CE et al: Association of DNA repair and steroid metabolism gene polymorphisms with clinical late toxicity in patients treated with conformal radiotherapy for prostate cancer. Clin Cancer Res 2006, 12(8):2545- 2554. 4. Laskin JJ, Sandler AB: Epidermal growth factor receptor: a promising target in solid tumours. Cancer Treat Rev 2004, 30:1-17. 5. Brehmer D, Greff Z, Godl K, Blencke S, Kurtenbach A, Weber M, Muller S, Klebl B, Cotten M, Keri G et al: Cellular targets of gefitinib. Cancer Res 2005, 65(2):379-382. 6. Fabian MA, Biggs WH, 3rd, Treiber DK, Atteridge CE, Azimioara MD, Benedetti MG, Carter TA, Ciceri P, Edeen PT, Floyd M et al: A small molecule-kinase interaction map for clinical kinase inhibitors. Nat Biotechnol 2005, 23(3):329-336. 7. Fukuoka M, Yano S, Giaccone G, Tamura T, Nakagawa K, Douillard JY, Nishiwaki Y, Vansteenkiste J, Kudoh S, Rischin D et al: Multi-institutional randomized phase II trial of gefitinib for previously treated patients with advanced non-small-cell lung cancer (The IDEAL 1 Trial). J Clin Oncol 2003, 21(12):2237-2246. 8. Kris MG, Natale RB, Herbst RS, Lynch TJ, Prager D, Belani CP, Schiller JH, Kelly K, Spiridonidis H, Sandler A et al: Efficacy of gefitinib, an inhibitor of the epidermal growth factor receptor tyrosine kinase, in symptomatic patients with non-small cell lung cancer: a randomized trial. JAMA 2003, 290(16):2149-2158. 9. Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye F, Lindeman N, Boggon TJ et al: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304:1497-1500. 10. Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG et al: Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004, 350:2129-2139. 11. Pao W, Miller V, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L et al: EGF receptor gene mutations are common in lung cancers from \"never smokers\" and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004, 101(36):13306-13311. 12. Taron M, Ichinose Y, Rosell R, Mok T, Massuti B, Zamora L, Mate JL, Manegold C, Ono M, Queralt C et al: Activating mutations in the tyrosine kinase domain of the epidermal growth factor receptor are associated with improved survival in gefitinib-treated chemorefractory lung adenocarcinomas. Clin Cancer Res 2005, 11(16):5878-5885. 13. Takano T, Ohe Y, Sakamoto H, Tsuta K, Matsuno Y, Tateishi U, Yamamoto S, Nokihara H, Yamamoto N, Sekine I et al: Epidermal growth factor receptor gene mutations and increased copy numbers predict gefitinib sensitivity in patients with recurrent non-small-cell lung cancer. J Clin Oncol 2005, 23(28):6829-6837. 53 14. Ho C, Murray N, Laskin J, Melosky B, Anderson H, Bebb G: Asian ethnicity and adenocarcinoma histology continues to predict response to gefitinib in patients treated for advanced non-small cell carcinoma of the lung in North America. Lung Cancer 2005, 49(2):225-231. 15. Giaccone G, Rodriguez JA: EGFR inhibitors: what have we learned from the treatment of lung cancer? Nat Clin Pract Oncol 2005, 2(11):554-561. 16. Inoue A, Suzuki T, Fukuhara T, Maemondo M, Kimura Y, Morikawa N, Watanabe H, Saijo Y, Nukiwa T: Prospective phase II study of gefitinib for chemotherapy-naive patients with advanced non-small-cell lung cancer with epidermal growth factor receptor gene mutations. J Clin Oncol 2006, 24(21):3340-3346. 17. Cappuzzo F, Hirsch FR, Rossi E, Bartolini S, Ceresoli GL, Bemis L, Haney J, Witta S, Danenberg K, Domenichini I et al: Epidermal growth factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Natl Cancer Inst 2005, 97(9):643-655. 18. Cappuzzo F, Varella-Garcia M, Shigematsu H, Domenichini I, Bartolini S, Ceresoli G, Rossi E, Ludovini V, Gregorc V, Toschi L et al: Increased HER2 gene copy number is associated with response to gefitinib therapy in epidermal growth factor receptor-positive non-small-cell lung cancer patients. J Clin Oncol 2005, 23(22):5007-5018. . 19. Tsao MS, Sakurada A, Cutz JC, Zhu CQ, Kamel-Reid S, Squire J, Lorimer I, Zhang T, Liu N, Daneshmand M et al: Erlotinib in lung cancer - molecular and clinical predictors of outcome. N Engl J Med 2005, 353(2):133-144. 20. Brabender J, Danenberg KD, Metzger R, Schneider PM, Park J, Salonga D, Holscher AH, Danenberg PV: Epidermal growth factor receptor and HER2-neu mRNA expression in non-small cell lung cancer Is correlated with survival. Clin Cancer Res 2001, 7(7):1850-1855. 21. Onn A, Correa AM, Gilcrease M, Isobe T, Massarelli E, Bucana CD, O'Reilly MS, Hong WK, Fidler IJ, Putnam JB et al: Synchronous overexpression of epidermal growth factor receptor and HER2-neu protein is a predictor of poor outcome in patients with stage I non-small cell lung cancer. Clin Cancer Res 2004, 10(1 Pt 1):136-143. 22. Johnson BE, Janne PA: Selecting patients for epidermal growth factor receptor inhibitor treatment: A FISH story or a tale of mutations? J Clin Oncol 2005, 23(28):6813-6816. 23. Shepherd FA, Tsao MS: Unraveling the mystery of prognostic and predictive factors in epidermal growth factor receptor therapy. J Clin Oncol 2006, 24(7):1219-1220; author reply 1220-1211. 24. Green S, Weiss GR: Southwest Oncology Group standard response criteria, endpoint definitions and toxicity criteria. Invest New Drugs 1992, 10(4):239-253. 25. Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, Haussler D: The Human Genome Browser at UCSC. . Genome Res 2002, 12(6):996-1006. 26. McDonald DM, Munn L, Jain RK: Vasculogenic mimicry: how convincing, how novel, and how significant? Am J Pathol 2000, 156(2):383-388. 27. Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. In: Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by Krawetz S MS. Totowa, NJ: Humana Press; 2000: 365-386. 28. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998, 8(3):186-194. 54 29. Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res 1998, 8(3):195-202. 30. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29(1):308- 311. 31. Salgia R, Skarin AT: Molecular abnormalities in lung cancer. J Clin Oncol 1998, 16(3):1207-1217. 32. Pao W, Miller VA, Politi KA, Riely GJ, Somwar R, Zakowski MF, Kris MG, Varmus H: Acquired Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib Is Associated with a Second Mutation in the EGFR Kinase Domain. PLoS Med 2005, 2:e73. 33. Nakamura Y: Codon usage table (Homo sapiens). 2009. 34. Brehmer D, Greff Z, Godl K, Blencke S, Kurtenbach A, Weber M, Muller S, Klebl B, Cotten M, Keri G et al: Cellular targets of gefitinib. Cancer Res 2005, 65(2):379-382. 35. Fabian MA, Biggs WH, Treiber DK, Atteridge CE, Azimioara MD, Benedetti MG, Carter TA, Ciceri P, Edeen PT, Floyd M et al: A small molecule-kinase interaction map for clinical kinase inhibitors. Nat Biotechnol 2005, 23(3):329-336. 36. Pao W, Wang TY, Riely GJ, Miller VA, Pan Q, Ladanyi M, Zakowski MF, Heelan RT, Kris MG, Varmus HE: KRAS mutations and primary resistance of lung adenocarcinomas to gefitinib or erlotinib. PLoS Med 2005, 2(1):e17. 37. Engelman JA, Zejnullahu K, Mitsudomi T, Song Y, Hyland C, Park JO, Lindeman N, Gale CM, Zhao X, Christensen J et al: MET amplification leads to gefitinib resistance in lung cancer by activating ERBB3 signaling. Science 2007, 316(5827):1039-1043. 38. Sos ML, Koker M, Weir BA, Heynck S, Rabinovsky R, Zander T, Seeger JM, Weiss J, Fischer F, Frommolt P et al: PTEN loss contributes to erlotinib resistance in EGFR- mutant lung cancer by activation of Akt and EGFR. Cancer Res 2009, 69(8):3256- 3261. 39. Janne PA: Challenges of detecting EGFR T790M in gefitinib/erlotinib-resistant tumours. Lung Cancer 2008, 60 Suppl 2:S3-9. 40. Shepherd FA: Molecular selection of patients for first-line treatment of advanced non-small-cell lung cancer with epidermal growth factor inhibitors: not quite ready for prime time. J Clin Oncol 2008, 26(15):2426-2427. 41. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. 42. Eberhard DA, Johnson BE, Amler LC, Goddard AD, Heldens SL, Herbst RS, Ince WL, J\u00C3\u00A4nne PA, Januario T, Johnson DH et al: Mutations in the epidermal growth factor receptor and in KRAS are predictive and prognostic indicators in patients with non-small-cell lung cancer treated with chemotherapy alone and in combination with erlotinib. J Clin Oncol 2005, 23(25):5900-5909. 43. Huang SF, Liu HP, Li LH, Ku YC, Fu YN, Tsai HY, Chen YT, Lin YF, Chang WC, Kuo HP et al: High frequency of epidermal growth factor receptor mutations with complex patterns in non-small cell lung cancers related to gefitinib responsiveness in Taiwan. Clin Cancer Res 2004, 10(24):8195-8203. 44. Kosaka T, Yatabe Y, Endoh H, Kuwano H, Takahashi T, Mitsudomi T: Mutations of the epidermal growth factor receptor gene in lung cancer: biological and clinical implications. Cancer Res 2004, 64(24):8919-8923. 45. Shigematsu H, Lin L, Takahashi T, Nomura M, Suzuki M, Wistuba II, Fong KM, Lee H, Toyooka S, Shimizu N et al: Clinical and biological features associated with 55 epidermal growth factor receptor gene mutations in lung cancers. J Natl Cancer Inst 2005, 97(5):339-346. . 46. Yang SH, Mechanic LE, Yang P, Landi MT, Bowman ED, Wampfler J, Meerzaman D, Hong KM, Mann F, Dracheva T et al: Mutations in the Tyrosine Kinase Domain of the Epidermal Growth Factor Receptor in Non-Small Cell Lung Cancer. Clinical Cancer Res 2005, 11:2106-2110. 47. Su MC, Lien HC, Jeng YM: Absence of epidermal growth factor receptor exon 18- 21 mutation in hepatocellular carcinoma. Cancer Lett 2005, 224(1):117-121. 48. Bell DW, Lynch TJ, Haserlat SM, Harris PL, Okimoto RA, Brannigan BW, Sgroi DC, Muir B, Riemenschneider MJ, Iacona RB et al: Epidermal growth factor receptor mutations and gene amplification in non-small-cell lung cancer: molecular analysis of the IDEAL/INTACT gefitinib trials. J Clin Oncol 2005, 23(31):8081-8092. 49. Marchetti A, Martella C, Felicioni L, Barassi F, Salvatore S, Chella A, Camplese PP, Iarussi T, Mucilli F, Mezzetti A et al: EGFR mutations in non-small-cell lung cancer: analysis of a large series of cases and development of a rapid and sensitive method for diagnostic screening with potential implications on pharmacologic treatment. J Clin Oncol 2005, 23:857-865. 56 Chapter 3. Impact of whole genome amplification on analysis of copy number variants2 Genome analyses of primary cancer samples are often limited by the amount of tumour tissue available for study. The work outlined in the previous chapter was limited to the analysis of seven target amplicons due, in part, to limited quantities of DNA available from clinical lung biopsy samples. This problem is not limited to the study of lung cancer and is an issue in many tumour settings, particularly those from rare or specially-treated cancers. Clinical biopsy materials are often very precious due to the difficulty in obtaining them and the rich clinical data with which they are associated. As the scope of genome analyses continue to grow, there is an increasing demand for large quantities of nucleic acids from these sources, particularly as tissues are being collected specifically for research more routinely. This section of the thesis characterizes a technique for amplification of DNA from limited quantities of clinical material and the use of amplified product for genome-wide copy number analysis. This technique makes use of Phi29 DNA polymerase primed using random hexamers to replicate more than a million genome equivalents from an input of only a thousand. To characterize systematic bias induced by this technique and to evaluate the ability to use amplified material for SNP and copy number analysis, we performed a microarray-based analysis of pre- and post-amplification pairs. This study showed that whole genome amplification (WGA) induces hundreds of copy number variant artifacts that can obscure bona fide copy number variants. However, these artifacts are systematic and correlate with GC content and proximity to chromosome ends. Pair-wise comparison in which amplified samples are compared to amplified samples can correct for these biases and restores the ability to distinguish real copy number variants from false positives arising from technical artifacts. 2 A version of this chapter has been published. Pugh T.J., Delaney A.D., Farnoud N., Flibotte S., Griffith M., Li H.I., Qian H., Farinha P., Gascoyne R.D., Marra M.A. Nucleic Acids Res. 2008 Aug;36(13):e80. 57 Genotype concordance before and after amplification was high (>98%) and the effects of WGA amplification bias were not a significant contributor to non-concordance. Armed with knowledge of the biases induced by this technique and a proven method to resolve real copy number variants from WGA material, we have since used WGA to amplify DNA from several clinical sources, including some lung tumour biopsies containing a few thousand tumour cells collected and sequenced for the work outlined in Chapter 5. 3.1. Introduction Initial analysis of the human genome identified single nucleotide polymorphisms (SNPs) as the primary source of genotypic and phenotypic variation among humans. However, subsequent studies identified larger-scale copy number variants that apparently impacted millions of nucleotides [1-6]. These larger-scale variants included polymorphic deletions and duplications that are present in >1% of the population and therefore meet the traditional definition of polymorphism [2]. As of August, 2009, 8,410 copy number variant loci impacting over 911 Mbp of DNA sequence (~32% of the genome) were identified, and these are listed in the Database for Genomic Variants (http://projects.tcag.ca/variation/). Copy number variants are also features of several human diseases, including Alzheimer disease [7], Cri du chat syndrome [8], mental retardation [9], and cancer [10, 11]. For example, somatic gene amplification is a common mechanism of oncogene overexpression in lung cancer (EGFR) [12, 13] and breast cancer (HER2) [14, 15], resulting in upregulation of cell signalling pathways including the prosurvival PI3K/Akt and mitogenic MAPK pathways [16-18]. A database of pathogenic copy number variants has been created with the goal of linking specific variants to disease (Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources, https://decipher.sanger.ac.uk). As robust array-based methods for copy number detection continue to mature, increasing numbers of these variants are being identified [2]. 58 Current whole-genome methods to detect copy number variants require relatively large input quantities of DNA that are difficult or impossible to obtain from rare cell populations such as cancer biopsies and microdissected tissues. To address this challenge, whole genome amplification (WGA) techniques were developed that increase the amount of DNA for analysis. For example, multiple-strand displacement amplification (MDA) using Phi29 DNA polymerase was used to generate microgram quantities of high molecular weight DNA (>30kb) from nanograms of high quality input material [19, 20]. A recent report described a protocol for amplification of picogram quantities of DNA from single cells [21], further expanding the applications for this technique. The replication fidelity of WGA techniques have been investigated [22-27]. Estimates of base-pair incorporation errors resulting from Phi29-mediated amplification have ranged from 2.2x10-5 [28] to 9.5x10-6 [23], and the concordance of genotypes between unamplified and amplified samples was reported to be >99.8% [23, 26]. Recurrent WGA-induced copy number biases were observed in previous studies [22-27] and were associated with sequence repeats and proximity to chromosome ends [24-27], increased GC content [24, 27], and annotated copy number variants [24]. Many of these associations were explored descriptively without statistical analysis, and there was no consensus on the 92 recurrent regions of bias explicitly defined by three of these studies [23, 24, 27]. A recent study of 532 samples subjected to WGA and subsequent analysis using a relatively low-density Affymetrix 10k Mapping microarray [29] identified a median of 438 WGA-induced copy number artifacts in comparisons between amplified samples and an unamplified reference set [22]. While there is a consensus that at least partial compensation of systematic biases can be achieved through the use of an amplified reference [23-27], it is unknown to what degree such comparisons can capture real copy number variants detected using more sensitive, higher resolution platforms. 59 Recently, a high-throughput, massively parallel whole genome pyrosequencing technique was used to examine bias induced by three commercially available whole genome amplification protocols: MDA, primer-extension pre-amplification, and degenerate oligonucleotide-primed PCR [30]. In this comparison, which involved sequencing two bacterial genomes, Phi29 MDA-based approaches generated the most complete genome coverage (50- 99%) and introduced the least bias compared to PCR-based techniques. DNA sequences generated from Phi29-amplified material were 2.9-3.8% lower in GC-content than those from the unamplified material, suggesting a relationship between amplification bias and GC-content. However, over-amplification of certain sequences could not be explained by any of the previously mentioned sources of bias, suggesting a need to directly investigate the nature of regions prone to over- or under- amplification. Although the study was of high resolution, direct comparison of the results from this study with those using human samples is difficult due to differences in chromosome organization, size and composition. In the present study, I investigated amplification bias resulting from whole genome amplification of DNA from fresh-frozen human tissues using the Affymetrix 500k Mapping microarray set. This set is comprised of two high-resolution microarrays that together contain probes to query over 500,000 SNPs from across the human genome, a dramatic increase in probe density from previous studies using a similar oligonucleotide array with 50X fewer probes, the Affymetrix 10k Mapping array, [22, 23, 26], lower resolution cDNA and bacterial artificial chromosome arrays [24, 25], or individual PCRs [31]. Copy number can be inferred from the resultant probe intensities [11] and, as only a single sample is applied to each microarray, multiple sample comparisons can be performed using normalized data. We quantified the effects of WGA on microarray signal and background noise, localized and statistically analysed genomic regions of WGA-induced bias, and directly compared the ability to resolve copy number variants in comparisons of unamplified and amplified material. 60 3.2. Materials and methods 3.2.1. Tissue material and DNA extraction Normal lymph nodes from three individuals were fresh frozen in Optimal Cutting Temperature (OCT; Sakura Finetek, Torrance, CA) compound and stored at -80oC by the service pathology laboratory at the BC Cancer Agency. Genomic DNA was extracted from these sources using the Gentra PureGene DNA purification kit (Gentra Systems, Minneapolis, MN). Prior to labelling and microarray hybridization, the genomic DNA was quantified using a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE). Prior to whole genome amplification, the genomic DNA was diluted to ~1.5 ng/\u00C2\u00B5L and quantified using a PicoGreen assay (Invitrogen, Carlsbad, CA). To ensure consistent DNA quality across all samples, the DNA was visualized on an agarose gel to confirm the presence of undegraded, predominantly high molecular weight (>10 kb) DNA. 3.2.2. Whole genome amplification We used Qiagen\u00E2\u0080\u0099s Repli-g Mini whole genome amplification kit and protocol (QIAgen, Valencia, CA) to amplify 7 ng of PicoGreen-quantified DNA from fresh frozen samples to generate >10 \u00C2\u00B5g of high molecular weight DNA. We performed the isothermal amplification reaction in 1.5 mL microcentrifuge tubes incubated in a 30\u00C2\u00B0C water bath for 18 hours and inactivated the enzyme by incubating the tubes in a 65\u00C2\u00B0C water bath for 3 minutes. The amplified products were purified and quantified as described in the previous section, and the amplification products were visualized on a 0.8% agarose gel stained with SYBR Green (Invitrogen, Carlsbad, CA). 3.2.3. Labelling and hybridization to the Affymetrix 500K array 500 ng samples of DNA were processed following the instructions in the GeneChip Mapping 500K manual (Affymetrix, Santa Clara, CA). Briefly, 250 ng of DNA was digested 61 using one of two restriction enzymes, Nsp I or Sty I, and ligated to Nsp I or Sty I adaptors. These adaptor-ligated fragments were amplified by PCR, the purified products quantified using a Bio-Tek PowerWave X spectrophotometer, and the concentration normalized to 2 \u00C2\u00B5g/\u00C2\u00B5L. The normalized products were then fragmented and labelled as described in the manual. Samples were hybridized to the GeneChip Human Mapping 250K Nsp or Sty array in an Affymetrix Hybridization Oven 640. Washing and staining of the arrays were performed using an Affymetrix Fluidics Station 450. Images of the arrays were obtained using an Affymetrix GeneChip Scanner 3000. 3.2.4. Sample preparation for NimbleGen 385k CGH array Samples of >2.5 \u00C2\u00B5g of DNA were prepared following the instructions provided by NimbleGen Systems Inc. (NimbleGen Systems Inc, Madison, Wisconsin). Briefly, purified samples were concentrated to 250 ng/\u00C2\u00B5l and analysed for quality on an agarose gel. Samples were then shipped on ice to NimbleGen for subsequent labelling and hybridization to the 385k Human Whole-Genome CGH array. 3.2.5. Genotype and copy number analysis Genotype calls were derived from Genechip microarray images using the GTYPE v4.0 software program (Affymetrix, Santa Clara, CA). We detected copy number variants in individual samples using comparisons to a common reference data set and comparisons between pre- and post-amplification sample pairs (Figure 3.1). These were performed using a software pipeline (Figure 3.1) that utilizes the Affymetrix Chromosome Copy Number Analysis Tool (CNAT) version 4.0 (Affymetrix, Santa Clara, CA) and an exhaustive t-score optimization algorithm. To analyse sample pairs on the Affymetrix platform, we used CNAT to perform quantile normalization of probe intensities from the samples and calculated log2 intensity ratios 62 for each probe set on the array. For unpaired analysis of individual samples against a common reference set, we used a set of average probe intensities from the reference set in place of the second sample. The reference set used for this purpose, referred to hereafter as the \u00E2\u0080\u009CAffy48 reference set\u00E2\u0080\u009D, was downloaded from the Affymetrix website (http://www.affymetrix.com/support/technical/sample_data/500k_data.affx) and consisted of 48 samples representing 5 HapMap CEPH trios, 5 HapMap Yoruban trios, 3 other non-HapMap trios, and 9 unrelated HapMap Asian samples. To analyse sample pairs on the NimbleGen platform, we used qspline normalized data and log2 intensity ratios provided by NimbleGen for each probe on the array. To identify significant deviations in the log2 ratio data from both platforms, the following t-score optimization algorithm was used. First, log2 ratios were sorted by genome coordinate, and moving windows representing a number of adjacent probes were subjected to a t-test against the rest of the data outside of the window on the same chromosome. This was done across the entire genome for all window sizes from 3 to 30 probe sets for the Affymetrix and NimbleGen data. To establish a comparison-specific false-positive threshold, the order of log2 ratios was then randomized, and moving window t-tests were recalculated. Two t-score thresholds, one for amplifications and one for deletions, were then defined at which no amplifications or deletions were identified in the randomized data. These thresholds were then applied to the t-scores derived from the original data, and regions with t-scores exceeding these thresholds were identified. To identify apparent variants impacting regions larger than our largest moving window size, t-scores were optimized for aberrations encompassing more than 27 probe sets using larger and larger windows until a local maximum t-score was found. As no CNVs met the false positive thresholds set for the NimbleGen data, a 50 probe window was used to detect statistically significant CNVs, and a comparison-specific false positive threshold was not applied. 63 3.2.6. Sequence analysis of recurrent whole genome amplification-induced artifacts In the analysis of recurrent WGA-induced artifacts, several sets of genomic coordinates were defined based on the human genome reference sequence Build 36/hg18 (released March, 2006) downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/). To define a set of regions that were consistently over- or under-amplified by the whole genome amplification technique, we analysed apparent variants arising from our comparison of matched pre- and post-WGA samples for overlapping genomic coordinates across all three comparisons and defined minimal overlapping regions (Table 3-1 and Table 3-2). These minimal overlapping regions were defined as the smallest region overlapped by a WGA-induced variant in all three comparisons. To define a subset of recurrently under-amplified chromosome ends, the first or last 2.5% of the reference genome sequence of any chromosome was recorded if it was impacted by a region consistently under-amplified by the WGA technique. To serve as reference sets representing the remainder of the human genome, random sets of coordinates were generated with equivalent size distributions for the regions consistently over- or under-amplified by the whole genome amplification technique and for the subset of recurrently biased regions affecting chromosome ends. In these reference sets, 10 random segments were generated with sizes corresponding to each entry in the list of regions affected by WGA-induced bias (i.e. 1,900 amplifications and 750 deletions). The GC and repeat content of each entry in the above sets of coordinates were calculated in the following manner. For each set, the genomic sequence for each coordinate was downloaded from the Ensembl database (http://www.ensembl.org). To calculate the GC content of the sequence, the number of Gs and Cs in the sequence was counted and that number divided by the total length of the sequence. To calculate the repeat content of the sequence, the coordinates of the UCSC Genome Browser \u00E2\u0080\u009CSimple Repeats\u00E2\u0080\u009D track generated by Tandem Repeats Finder [32] was used to identify base 64 pairs belonging to repeat sequences. The number of these base pairs was then divided by the total length of the sequence to give the percentage of repeat sequence in the region. As most of the sets were not normally distributed in GC or repeat content as found by the Jarque-Bera test, the two-sample Kolmogorov-Smirnov test (KS test) was used to test whether these sets differed in their distribution of these two parameters. 3.3. Results 3.3.1. Array noise and copy number variation in samples pre- and post-WGA To establish a base line for array noise and copy number variant detection prior to amplification, three unamplified DNA samples were compared to the Affy48 reference set (Methods; Figure 3.1b), and candidate copy number variants were identified. This comparison versus the Affy48 set was then repeated using three corresponding amplified samples. As a measure of array noise, we quantified the distribution of log2 ratios resulting from these comparisons by calculating the mean, standard deviation (SD), and interquartile range (IQR) (Table 3-3, Figure 3.2). As expected due to normalization by CNAT4, the mean log2 ratios from both unamplified and amplified samples were very close to zero. The SDs and IQRs of log2 ratios from amplified samples were nearly twice those of the unamplified samples, suggesting an increase in array noise using WGA material. To compare the copy number variants detected pre- and post-WGA, we counted apparent copy number variants with p-values more significant than each comparison\u00E2\u0080\u0099s false- positive detection limit (Table 3-3, Figure 3.3). The analysis of unamplified samples detected 13 candidate copy number variants, 11 of which overlapped the coordinates of genomic variants listed in the Database of Genomic Variants (http://projects.tcag.ca) [5] (Table 3-4). In contrast, the analysis of the amplified samples identified 1,572 apparent copy number variants, an approximately 100 fold increase in the number of apparently significant amplifications and 65 deletions versus the unamplified samples (Table 3-3). These artifactual CNVs are likely the result of WGA-induced biases. To assess experimental variation prior to amplification, each unamplified and amplified sample was subjected to a pair-wise comparison against an experimental replicate of itself (Table 3-5). The lack of fluctuation in mean, SD, and IQR in the log2 ratios from unamplified replicates suggests a high degree of reproducibility of the array method used. Similarly, while still elevated relative to unamplified samples, there is no major fluctuation in these values between amplified replicates, further supporting the notion that the WGA method behaves consistently. However, the values obtained from unamplified samples versus values obtained from amplified samples using the Affy48 reference set, showed a substantial decrease in SDs and IQRs. This indicates that amplified samples produce different signal intensity distributions than unamplified samples, suggesting that comparison of amplified to unamplified data sets is potentially problematic. 3.3.2. Copy number variants induced by whole genome amplification To identify apparent copy number variants arising from non-uniform amplification bias in the WGA technique, data from paired pre- and post-WGA samples were directly compared to each other (Figure 3.1b). Our analysis identified apparent WGA-induced over- and under- amplifications in each of the three comparisons of amplified versus unamplified material. In sample 1, we detected 502 amplifications (p-value threshold of detection, p<1.68x10-6) and 580 deletions (p<1.71x10-8). In sample 2, we detected 467 amplifications (p<1.68x10-6) and 202 deletions (p<1.64x10-8). In sample 3, we detected 546 amplifications (p<1.68x10-6) and 259 deletions (p<3.45x10-8). Our analysis also revealed a set of 265 recurrent apparent WGA- associated aberrations that were detected in all three comparisons. This set consisted of 190 over-amplifications (Table 3-1) and 75 under-amplifications (Table 3-2). 39 of these regions 66 overlapped one of the 92 regions of bias (31 of 62 over-amplifications, 8 of 30 under- amplifications) identified by three previous studies [23, 24, 27]. 110 of the regions we identified overlapped genomic regions with known copy number variants [2] (64 over- amplifications, 46 under-amplifications) but there was no correlation between regions susceptible to WGA-associated bias and known copy number variants (p=1.00). In a set of 2,650 random genomic coordinates with the same size distribution as the WGA-induced artifacts, 36.26% overlapped a known copy number variant, a proportion near the 41.51% overlap observed with the set of WGA-induced biases. The minimal overlapping regions (see Methods) of WGA-induced over-amplifications encompassed 13.6Mbp of the reference human genome sequence and ranged from 2,207 bp to 357,399 bp, with a median size of 58,961 bp, and an IQR of 66,524 bp. These recurrently over- amplified sites were distributed throughout the genome and had a statistically significant increase in GC content relative to a set of 1,900 random genomic segments with identical size distribution (p=8.36x10-40). These over-amplified sites were also enriched for tandem repeat sequences relative to the set of 1,900 random genomic segments (p=1.76x10-6). These results are compatible with the notion that over-amplification by the WGA technique is related to the GC and repeat content of the underlying sequence. The minimal overlapping regions of the recurrent WGA-induced under-amplifications encompassed 8.37 Mb of the reference human genome sequence and ranged from 5,206 bp to 1.93 Mbp, with a median size of 75,698 bp, and an IQR of 64,619 bp. These regions of under- amplification appeared to fall into two groups: those near chromosome ends and those distributed throughout the genome. Comparison of the 54 under-amplified sites distributed throughout the genome with a set of 540 random genomic segments with identical size distribution found no statistically significant difference in GC content (p=0.0796) or repeat sequences (p=0.1901). However, the under-amplifications were greatly depleted for GC-rich 67 regions compared to the over-amplifications (p=1.93x10-5) which supports the notion that WGA amplification efficiency is related to the GC content of the underlying sequence. A plot of GC content versus copy number shows a trend of increasing amplification magnitude (i.e. increasing copy number) with increasing GC content (Figure 3.4). Of the 39 chromosome ends (see Methods) assayed by probe sets, 15 contained regions of under-amplification (Table 3-6). Only 3 chromosome ends contained over-amplifications, suggesting that under-representation of chromosome ends is a consistent result of whole genome amplification. The set of chromosome end under-amplifications impacted 2.547 Mbp of the reference human genome sequence, and the GC content was statistically greater than that of a set of 150 random genomic segments with identical size distribution (p=1.12x10-6). However, there was no statistical difference in GC content between the under-amplified chromosome ends and the 25 appropriately amplified chromosome ends (p=0.8215). This suggests that amplification bias due to GC content does not play a role in under-amplification of specific subtelomeric regions. Under-amplified chromosome ends were enriched for repetitive sequences (see Methods) relative to both a set of 150 random genomic segments with identical size distribution (p=1.52x10-9) and the 25 assayed chromosome ends that were not under-amplified (p=0.0022), suggesting that increased repeat content of specific chromosome ends may result in their under-amplification. To assess WGA-induced CNV artifacts using a second array platform, we compared pre- and post-amplification sample pairs in three comparative genome hybridization (CGH) experiments using the NimbleGen 385k array. The log2 ratios from these experiments were widely distributed (average SD = 0.378, average IQR = 0.457) and while several thousand CNVs were detected, none was identified with a p-value passing the stringent false positive thresholds set by our algorithm due to the high level of noise in these data (p<3.51x10-7 for over-amplifications, p<3.30x10-11 for under-amplifications). Analysis of these data using a 50 68 probe moving window without filtering for false positives detected 2,116 WGA-induced CNVs (466 over-amplifications, 1,650 under-amplifications) of which 141 occurred in all three comparisons (29 over-amplifications, 112 under-amplifications). Despite their relatively large size (average = 1.06 Mb, median = 0.36 Mb, SD = 4.10 Mb), only 28 of these overlapped recurrent artifacts detected by the Affymetrix comparisons (17 of 190 over-amplifications, 11 of 75 under-amplifications). This amount of overlap is similar to that seen with a random set of 2,116 random genomic coordinates with the same size distribution as the CNVs detected by the NimbleGen platform, of which 65 overlapped a WGA-induced CNV detected by the Affymetrix platform. These results suggest that these are artifacts resulting from the difficulty in distinguishing real CNVs from background noise when co-hybridizing amplified and unamplified samples even when a large moving window of 50 probes is used. 3.3.3. Use of amplified material for pair-wise copy number comparisons To assess the use of WGA material in pair-wise comparisons, each sample was compared to the other samples one-by-one, and relative differences in copy number in the three samples assessed using: 1) unamplified samples vs. unamplified samples, 2) amplified samples vs. unamplified samples, and 3) amplified samples vs. amplified samples (Figure 3.1d). An example of the output from one such set of comparisons is illustrated in Figure 3.5. The unamplified vs. unamplified comparisons identified 21 apparent differences in copy number between the three samples (Table 3-7 and Table 3-8). These pair-wise comparisons identified 5 of 13 apparent differences expected from the individual comparisons of samples to the Affy48 reference set. Twelve of these apparent differences, including the 5 differences expected from comparison with the Affy48 set, overlap variants listed in the Database of Genomic Variants (http://projects.tcag.ca). The amplified vs. unamplified comparisons identified 3,207 apparent differences in copy number among the three samples (Table 3-7). 69 Only seven of these apparent differences were detected by both unamplified/amplified and amplified/unamplified comparisons, suggesting that systematic WGA-induced variants and random WGA-reaction variability mask real events. The amplified vs. amplified comparisons identified 275 apparent differences in copy number among the three samples (Table 3-7). These amplified vs. amplified comparisons identified 2 of the 12 apparent amplifications and 5 of the 9 apparent deletions seen in the unamplified comparisons (Table 3-8), suggesting that pair-wise comparisons of material where both samples have been subjected to WGA can partially compensate for reproducible WGA- induced bias (Figure 3.5). The most significant deletion identified by each unamplified comparison was recapitulated as the most significant deletion identified by the corresponding amplified comparison (Table 3-8). This was also true of the most significant amplification in two of the three comparisons (Table 3-8). The list of variants detected at lower levels of significance than these top scoring events may still contain real CNVs although it is difficult to isolate these from the remaining artifactual events resulting from random experimental variation without independent validation of each one. 3.3.4. Validation of WGA pair-wise comparisons for copy number detection To determine the extent to which amplified pair-wise comparisons mask known, validated copy number variants, DNA from the blood of three father/child pairs with previously described CNVs [9] were subjected to WGA and copy number analysis using the 250k Nsp chip of the Affymetrix 500k set. The original analysis of unamplified DNA performed using the Affymetrix Mapping 100k SNP array set [9] identified a total of 32 CNVs within the three father/child pairs of which five (2 amplifications, 3 deletions) were validated by conventional cytogenetic analysis or FISH (Table 3-9). 70 The amplified child vs. amplified father comparisons identified 63 CNVs within the three pairs. Analysis of amplified family pair #8379 identified 41 copy number differences (13 amplifications p<3.48x10-6, 28 deletions p<8.38x10-8 in the child relative to the father), analysis of amplified family pair #1280 identified 6 copy number differences (2 relative amplifications p<2.14x10-6, 4 relative deletions p<1.05x10-8), and analysis of amplified family pair #3476 identified 16 copy number differences (6 relative amplifications p<2.07x10-6, 10 relative deletions p<6.09x10-9). These copy number differences were then ranked by p-value (most significant to least) and the coordinates compared to those of the validated aberrations. The amplified vs. amplified comparisons identified four of the five CNVs (2 amplifications, 2 deletions) validated by FISH [9] and each received the lowest p-value for its comparison (Table 3-9). The single validated CNV that was not detected by the amplified comparisons may have been missed due to a difference in array coverage at this site. On the 250k Nsp array, this region was covered by 3 probe sets (10,683bp/probe set) compared to 6 probe sets (5,341bp/probe set) on the 100k array. This was also the smallest feature of the set of validated CNVs (0.03Mb) and may reflect a decrease in detection sensitivity when using amplified comparisons. Among the top-ranked variants (i.e. those with the most significant p-values), six variants were identified by the 250k WGA experiment that were not detected by the original experiments. Five of these are covered by 6 or fewer probe sets (5,743-93,452bp/probe set, one with no probes) on the 100k array. In addition to the possibility of an increased false positive rate due to increased array noise, differences in each array\u00E2\u0080\u0099s probe coverage may explain why these regions were only detected by the experiment using amplified samples. 3.3.5. Genotype fidelity To compare the fidelity of genotype calls derived from WGA product to those from corresponding unamplified samples, data from matched pairs of these sources were compared. 71 Average genotype call rates (+/- 1 standard deviation) were 96.74+/-1.14% from the unamplified samples and 93.14+/-2.68% from the WGA samples, suggesting a modest degree of information loss following amplification. Of the SNPs which were unsuccessfully called in the amplified samples, only 2% were common to all three samples, and only one of these fell within a region of WGA-induced bias (an over-amplification). Genotype concordance was 98.57+/-0.53% between calls successfully made from both amplified and unamplified samples in each matched pair. There was very little overlap in the coordinates of SNPs with non- concordant genotypes and regions of recurrent WGA-induced bias. Of the non-concordant calls, 58.77% were called heterozygotes in the unamplified sample and homozygotes in the amplified sample (i.e., AB called as AA or BB) and 0.2% of these were located in regions of WGA- induced over-amplification while none were in regions of WGA-induced under-amplification. 40.66% were called homozygotes in the unamplified sample and heterozygotes in the amplified sample (i.e., AA or BB called as AB), of which none were located in regions of WGA-induced bias, and 0.57% were incorrectly called homozygotes (i.e., AA called as BB or BB called as AA) of which none were located in regions of WGA-induced bias. 12 regions, each containing 3-7 SNPs, were identified as displaying loss of heterozygosity (LOH) in total from the three pre- and post-amplification comparisons. Three of the LOH regions showed allele-specific amplification (copy number of 3), while the remaining 9 did not (copy number of 2). These regions impacted a total of 58 SNPs, 0.01% of all of the SNPs assayed, and none overlapped a region recurrently over- or under-amplified by WGA. These results suggest that increased random array noise is likely a greater source of genotype non-concordance than systematic allele-specific amplification bias or polymerase error. 3.4. Discussion The ability to discover copy number variants in unamplified human DNA using data generated by the Affymetrix Mapping SNP array platform has been previously demonstrated by 72 our group and others [1-3, 9]. However, with small amounts of DNA from tumour biopsies, for example, amplification of the starting material prior to discovery of copy number variants is often necessary to generate enough material to conduct such analyses. We aimed to assess the nature of biases that are introduced by this amplification and to determine their impact on copy number detection and whether pair-wise comparisons could compensate for these biases. For the first time, we have used a high resolution microarray platform to explicitly define regions susceptible to WGA-induced bias, statistically assessed the sequence features underlying these biases, and demonstrated an ability to correct for these biases and resolve real CNVs. In this study, three unamplified DNA samples were used to establish a base line for array noise and copy number variant detection. These were compared to the same DNA samples that were amplified in duplicate using a WGA technique. The apparent copy number variants we detected by comparing unamplified samples to the unamplified Affy48 reference set were likely real events, as the variants were relatively large, statistically significant, and 11 of the 13 copy number variants corresponded to previously documented genomic variants [5]. While our variant detection approach adjusts its threshold of significance based on the level of noise of each array, comparisons using amplified samples still identified hundreds of apparent CNVs not seen in the unamplified comparisons on the Affymetrix array platform. Since these comparisons were performed against an unamplified reference, it is likely that these artifactual apparent CNVs were the result of preferentially amplifying of regions of the genome and not due to an increased level of array noise. The data from the NimbleGen platform appeared to have a high level of noise that affected our ability to detect WGA-induced CNVs when co-hybridizing unamplified and amplified samples. Our results suggest that amplified and unamplified samples cannot be directly compared to uncover WGA-induced artifacts using the NimbleGen CGH array. However, this should not preclude the comparison of similarly amplified samples on this 73 platform as we have shown using Affymetrix arrays that the biases are largely systematic and the noise is reduced substantially when comparing two amplified samples. To explore the nature of this bias, we directly compared Affymetrix data from pre- and post-amplification sample pairs and observed a set of regions apparently over- or under- amplified in all three samples. These regions impacted a total of 21.97 Mb of sequence, consisted of 190 over-amplifications and 75 under-amplifications, and overlapped 39 of 92 regions of WGA-induced bias identified by other studies [23, 24, 27]. The low amount of overlap is perhaps due to differences in genome coverage by the arrays used in these studies, particularly as there was no previous consensus on any region being susceptible to WGA- induced bias. The results reported here are for DNA amplified using the QIAgen Repli-g Mini kit, and it is conceivable that DNA amplified using different protocols will exhibit different bias. While the lack of a correlation between regions of WGA-induced bias and known CNVs is different from a previous observation [24], we have demonstrated that the degree of overlap of the amplification biases we identified with known CNVs is only slightly greater than would be expected by chance. The amount of overlap observed is likely due to the fact that documented CNVs are generally large, 165kb on average, and, in total, impact ~27% of the genome. The difference in size and size distribution of the over- and under-amplifications that we identified suggests focal over-amplification of specific sequences and broader under- representation of others. We observed a direct relationship between amplification efficiency and GC-content as over-amplified regions had a statistically significant increase in GC content relative to the deletions (p=1.93x10-5) and the magnitude of over-amplification appeared to scale directly with GC richness (Figure 3.4). These results are consistent with the notion that WGA-induced over-amplification bias is related to the increased binding affinity of GC-rich hexamers relative to AT rich hexamers and not a shortage of hexamers corresponding to 74 repetitive regions in the genome. There is also the possibility that, unlike many polymerases, Phi29 polymerase is more efficient in synthesizing GC-rich sequences, thereby resulting in over-amplification of these regions. These effects likely also contribute to under-amplification of GC-poor regions distributed throughout the genome but not to the loss of chromosome ends. The lack of a relationship between regions of WGA-induced bias and the presence of known copy number variants suggests that different mechanisms account for these phenomena. The loss of chromosome ends appears to be a frequent result of the WGA procedure as 15 of the 39 ends assayed were under-amplified in all samples compared to only three that were over-amplified. Relative to chromosome ends that were not affected by bias, the under- amplified ends were enriched for repetitive sequences (p=0.0022) but did not have a statistically significant difference in GC content (p=0.8215). These results suggest that the source of amplification bias at chromosome ends is different from GC-content-derived biases affecting the rest of the genome. One possible explanation is the positional effect of having fewer overlapping amplification products at the ends of linear stands of DNA than in the middle. However, if this were the case, then all chromosome ends should be similarly under- amplified, which they are not. Another possible explanation is that the limited quantities of hexamers corresponding to subtelomeric repeats result in fewer priming events in these regions. This may account for the loss of repetitive chromosome ends more frequently than less repetitive ends. We found that samples subject to Phi29-based WGA can be used for accurate genotyping, albeit with some data loss. From the WGA samples, we consistently observed a decrease in the average number of genotype calls and a wider range of call rates compared to those from the unamplified samples. However, of the genotype calls that were made, over 98% were concordant between amplified and unamplified sample pairs. The less than 2% non- concordant calls were 99.43% discrepant heterozygotes (i.e., AB called as AA or BB, AA or 75 BB called as AB), rather than incorrectly called homozygotes, and nearly none (<0.12%) were located in regions of WGA-induced bias. This discrepancy rate is very near that observed between unamplified replicates on the Affymetrix 500k array [33]. It is likely that the source of genotype call non-concordance is related to the genotyping accuracy of the array in the presence of increased noise due to WGA and not truly genotype changes induced by WGA through allele-specific amplification or polymerase error. Regardless of the source of the systematic biases induced by WGA, we have shown that pair-wise analysis of amplified samples is a viable strategy for CNV detection, albeit with an appropriate threshold of significance to filter the number of low-significance random artifacts induced by this technique. While the greater number of apparent copy number differences detected using amplified samples has the potential to mask real events, we observed that pair- wise comparisons of such samples can detect real differences between samples. By comparing amplified samples to amplified samples, the number of artifactual copy number differences is reduced by an order of magnitude relative to comparisons of amplified versus unamplified samples due to the systematic nature of the bias induced by the technique. Conceivably, the use of a large, amplified reference set would be a practical alternative to pair-wise comparisons for larger batches of amplified samples requiring a universal reference. Of the apparent copy number differences detected by the three pair-wise comparisons using unamplified material, all of the top deletions and two of the three top amplifications were identified as the most significant by the corresponding comparisons using amplified material. By applying this technique to paired child/father samples with known, validated copy number differences [9], four of the five validated differences detected by the original study using unamplified DNA were the most significant in the same comparisons using amplified DNA. The only validated CNV that was missed using WGA material was probably due to a difference in coverage by the array platforms used. A similar difference in coverage partially explains the presence of six 76 high confidence CNVs detected by the WGA experiments not seen in the original study, as one of these has recently been observed in the unamplified material using a higher resolution platform. Therefore, when evaluating the results from amplified comparisons, CNVs with the top ranked significance are more likely to be real CNVs in the unamplified sample. 77 3.5. Figures Figure 3.1 Experimental design (A) In this study, we aimed to assess the impact of WGA on the detection of copy number variants, to explore copy number biases induced by this technique, and to assess the use of pair- wise analysis to address such biases. To this end, DNA samples from three fresh frozen tissues were subject to WGA and analyzed pre- and post-amplification on the Affymetrix Mapping 500k SNP array set. For each copy number analysis, different sets of microarray data were compared as shown in panels B-D. Log2 intensity ratios were calculated from the selected data comparisons using a software pipeline based on CNAT v4.0. These ratios were then screened by an \u00E2\u0080\u009Cexhaustive search\u00E2\u0080\u009D algorithm, in which t-scores were calculated in 3 to 30 probe windows and statistically significant aberrations identified above array-specific thresholds defined through permutation. To detect CNVs impacting more than 30 probes, aberrations found to contain more than 27 probes were subject to a t-score optimization using larger and larger window sizes until a local maximum t-score was found. The resulting high confidence list of CNVs was then compared as appropriate for each analysis. (B) In this set of comparisons against a common reference set, we investigated the effect of WGA on array noise (i.e., the distribution of log2 ratios) and the ability to resolve copy number variants. To this end, each unamplified and amplified sample was independently compared against the Affy48 reference set, log2 ratios calculated and detected copy number variants compared. (C) To assess the nature of bias induced by WGA, this data set directly compared matched pre- and post-WGA samples. Since matched samples were used, all copy number variants detected in this analysis are due to the amplification technique. (D) This set of comparisons examined the ability of pair- wise analysis of amplified samples to reciprocate copy number variants detected in unamplified samples. Three pair-wise comparisons were conducted using both unamplified and amplified material and the observed copy number variants were compared. 78 79 Figure 3.2 Boxplots comparing the spread of log2 ratios in unamplified and amplified samples The log2 ratios resulting from comparison of each sample against the Affy48 reference set were plotted using a standard box and whisker plot displaying a five number summary: maximum value or Q3+(1.5 x IQR), Q3, mean, Q1, and minimum value or Q1-(1.5 x IQR). Outliers, defined as values that fall more than 1.5 x IQR above Q3 or below Q1, are displayed as individual data points. Due to normalization as part of the CNAT4 analysis pipeline, the mean log2 ratio from each sample is close to zero. However, the IQR, as well as the maximum and minimum values, were further from the mean in the amplified samples relative to the unamplified samples. The increased spread of data distribution is likely due to increased array noise and the detection of amplification biases induced by WGA. 80 Figure 3.3 Apparent CNVs in unamplified and amplified samples The number of variants detected in unamplified and amplified samples from comparison against the Affy48 reference set was counted. The amplified samples appear to contain hundreds of copy number variants not seen in the unamplified samples suggesting that WGA over- or under-represents specific regions of the genome. 81 Figure 3.4 Copy number distribution and GC content of WGA-induced CNVs The number of variants and % GC content were plotted against copy number magnitude for all of the CNVs detected by comparisons of each pre- and post-WGA sample pair. There appears to be a direct relationship between the magnitude of over-amplification and increased GC content. 82 Figure 3.5 Example of how a pair-wise comparison of amplified material can partially compensate for WGA-induced bias Shown is the output of three copy number analyses conducted using our CNV discovery software pipeline. Copy number, calculated directly from log2 ratios of probe intensities, is plotted against genome location using a sliding window of averaged data points, in this case 60 probes. Regions of copy number increase or decrease, those with statistically significant p- values, are identified in green and all other regions are marked in red. In this example, a pair- wise comparison of two unamplified samples, identified a gain of copy number (p<1.00x10-16) in unamplified sample #1 relative to unamplified sample #2 at a locus documented to be copy number variable in the Database of Genomic Variants. Conducting the same comparison after WGA of sample #1 results in hundreds of confounding copy number variants from which the known copy number variant is indistinguishable. However, conducting this comparison after WGA of both samples restores the ability to detect this CNV. Artifactual variants do still remain as a result of random variation in the WGA process, however they do not reach the level of significance of the real event. Therefore, when interpreting results from comparisons of WGA samples, only the top-most hits are likely to be representative of the unamplified sample. 83 3.6. Tables Table 3-1 Regions of recurrent WGA over-amplification Genome Coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end chr1:4289059-4343246 0.054 49.838 4.289 chr1:4471600-4481483 0.010 52.459 4.472 chr1:18235855-18239273 0.003 43.609 18.236 chr1:18436270-18439577 0.003 48.549 18.436 chr1:31292882-31443646 0.151 47.315 31.293 chr1:37463385-37475919 0.013 42.928 37.463 chr1:41064118-41206071 0.142 47.879 41.064 chr1:44791755-44930325 0.139 49.964 44.792 chr1:58584351-58645411 0.061 42.279 58.584 chr1:149246905-149597928 0.351 47.321 97.652 chr1:156259883-156311043 0.051 48.289 90.939 chr1:158049173-158126016 0.077 48.743 89.124 chr1:201867229-201924372 0.057 46.302 45.325 chr1:206392425-206428178 0.036 48.218 40.822 chr1:207779076-207858238 0.079 45.699 39.391 chr2:23092874-23191469 0.099 42.645 23.093 chr2:29833725-29967658 0.134 42.63 29.834 chr2:38509857-38556050 0.046 45.558 38.510 chr2:44086403-44160450 0.074 44.483 44.086 chr2:67878898-67954538 0.076 40.88 67.879 chr2:79084008-79097608 0.014 41.291 79.084 chr2:85989722-86036159 0.046 44.905 85.990 chr2:206333511-206353492 0.020 45.186 36.598 chr2:216823668-216877334 0.054 46.06 26.074 chr2:218376045-218417460 0.041 52.994 24.534 chr2:219657428-219717510 0.060 42.947 23.234 chr3:10602703-10670666 0.068 48.005 10.603 chr3:14573684-14647641 0.074 50.043 14.574 chr3:25304796-25364253 0.059 40.188 25.305 chr3:63462557-63541811 0.079 38.468 63.463 chr3:67380171-67422715 0.043 39.967 67.380 chr3:72035027-72087064 0.052 45.219 72.035 chr3:117955893-117958100 0.002 36.911 81.544 chr3:124147597-124238833 0.091 49.101 75.263 chr3:136150537-136154569 0.004 44.855 63.347 chr4:85502742-85547359 0.045 39.273 85.503 chr5:32503741-32608477 0.105 44.94 32.504 chr5:38045875-38059398 0.014 42.273 38.046 chr5:73611022-73613695 0.003 39.005 73.611 chr5:137957669-138108906 0.151 47.3 42.749 chr5:139060368-139257047 0.197 52.492 41.601 chr5:141093169-141181524 0.088 50.573 39.676 chr5:141773766-141849848 0.076 44.822 39.008 chr6:12973635-13018684 0.045 42.162 12.974 chr6:14857627-14948029 0.090 43.871 14.858 84 Genome Coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end chr6:15073036-15207564 0.135 45.072 15.073 chr6:26485525-26524595 0.039 44.614 26.486 chr6:29669829-29753557 0.084 46.525 29.670 chr6:30880323-30977693 0.097 48.513 30.880 chr6:31370848-31380959 0.010 40.615 31.371 chr6:31683255-31829012 0.146 51.426 31.683 chr6:32174155-32249861 0.076 50.974 32.174 chr6:34062545-34211717 0.149 54.414 34.063 chr6:36830767-36865161 0.034 51.563 36.831 chr6:37651502-37710319 0.059 50.272 37.652 chr6:37741084-37875032 0.134 49.759 37.741 chr6:39361539-39374614 0.013 49.794 39.362 chr6:41500542-41617879 0.117 50.4 41.501 chr6:43063502-43360189 0.297 50.899 43.064 chr6:44172042-44301684 0.130 50.396 44.172 chr6:47761199-47804068 0.043 41.129 47.761 chr6:85782474-85821656 0.039 41.299 85.078 chr6:89190270-89273163 0.083 39.062 81.627 chr6:110376081-110425208 0.049 43.002 60.475 chr7:3058747-3073709 0.015 44.744 3.059 chr7:3197339-3301123 0.104 44.716 3.197 chr7:5835087-5867724 0.033 47.779 5.835 chr7:66777854-66801140 0.023 43.724 66.778 chr7:66906956-66969142 0.062 44.561 66.907 chr7:67775989-67872385 0.096 44.37 67.776 chr7:68179354-68228090 0.049 43.891 68.179 chr7:68322081-68428002 0.106 44.394 68.322 chr7:71337089-71367377 0.030 44.373 71.337 chr7:75049350-75225967 0.177 48.26 75.049 chr7:100488171-100592180 0.104 50.525 58.229 chr7:127584188-127675925 0.092 48.963 31.145 chr7:131655313-131766360 0.111 46.675 27.055 chr7:140731983-140813912 0.082 44.764 18.008 chr7:142275787-142371817 0.096 47.609 16.450 chr7:142704105-142779432 0.075 46.742 16.042 chr7:152750937-152798127 0.047 43.184 6.023 chr8:20350532-20387149 0.037 46.731 20.351 chr8:21420713-21468570 0.048 43.667 21.421 chr8:23668724-23752488 0.084 41.757 23.669 chr8:37354229-37472939 0.119 42.699 37.354 chr8:70967579-71130733 0.163 44.063 70.968 chr8:128415540-128483680 0.068 39.91 17.791 chr8:131827508-131890351 0.063 44.281 14.384 chr8:133104378-133175247 0.071 44.203 13.100 chr8:134000193-134029470 0.029 44.699 12.245 chr8:134733068-134748357 0.015 46.9 11.526 chr8:136259961-136277180 0.017 40.906 9.998 chr9:1730367-1756255 0.026 40.048 1.730 chr9:34400978-34556969 0.156 48.251 34.401 chr9:109183520-109250183 0.067 45.642 31.023 chr9:109337263-109457162 0.120 45.546 30.816 85 Genome Coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end chr9:111552203-111572732 0.021 40.472 28.701 chr9:118130565-118179376 0.049 44.288 22.094 chr9:118523287-118573078 0.050 43.423 21.700 chr9:118699338-118773794 0.074 42.379 21.499 chr9:119032843-119048518 0.016 43.583 21.225 chr9:121500039-121529854 0.030 39.16 18.743 chr9:121697808-121802562 0.105 42.241 18.471 chr10:17037587-17055411 0.018 41.492 17.038 chr10:29045369-29067693 0.022 43.857 29.045 chr10:30830367-30889472 0.059 43.781 30.830 chr10:35213736-35222408 0.009 46.57 35.214 chr10:72358857-72391369 0.033 52.659 62.983 chr10:78771093-78832348 0.061 44.58 56.542 chr10:80117004-80173559 0.057 46.232 55.201 chr10:102944887-103070144 0.125 51.084 32.305 chr10:106870314-106893971 0.024 40.701 28.481 chr10:119306070-119335828 0.030 47.287 16.039 chr11:13052115-13062946 0.011 46.926 13.052 chr11:45426732-45493098 0.066 47.543 45.427 chr11:56668848-56711422 0.043 45.214 56.669 chr11:61917900-62120435 0.203 49.924 61.918 chr11:69515071-69537473 0.022 46.476 64.915 chr11:114155209-114231222 0.076 42.651 20.221 chr11:115741865-115777459 0.036 45.487 18.675 chr11:117461473-117478883 0.017 47.901 16.974 chr11:117533377-117562454 0.029 51.279 16.890 chr11:118796655-118914006 0.117 49.231 15.538 chr11:126136392-126161131 0.025 45.614 8.291 chr11:130902902-130982364 0.079 44.085 3.470 chr12:51596678-51709772 0.113 47.216 51.597 chr12:52387023-52412999 0.026 47.904 52.387 chr12:52681349-52755293 0.074 50.454 52.681 chr12:53228106-53329952 0.102 47.65 53.228 chr12:106372782-106389331 0.017 48.151 25.960 chr12:112474952-112562237 0.087 46.32 19.787 chr12:112573453-112641210 0.068 46.563 19.708 chr12:113616005-113721129 0.105 46.98 18.628 chr12:114338015-114416917 0.079 42.64 17.933 chr12:115325571-115386413 0.061 48.51 16.963 chr12:117606824-117703301 0.096 43.816 14.646 chr12:118053074-118108718 0.056 46.243 14.241 chr12:120038561-120080574 0.042 46.956 12.269 chr12:120399191-120643045 0.244 50.054 11.706 chr13:24677565-24700676 0.023 45.448 24.678 chr13:35171922-35227287 0.055 43.169 35.172 chr14:69704299-69714918 0.011 44.275 36.654 chr14:91892520-91908197 0.016 47.748 14.460 chr14:93980102-94033174 0.053 46.442 12.335 chr14:95141650-95185168 0.044 46.387 11.183 chr15:55978179-56036075 0.058 42.721 44.303 chr15:56550970-56610077 0.059 46.007 43.729 86 Genome Coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end chr15:65169105-65182542 0.013 47.559 35.156 chr15:66485246-66523220 0.038 49.396 33.816 chr15:68217654-68337380 0.120 48.759 32.002 chr15:86586302-86669305 0.083 49.155 13.670 chr15:88453860-88601412 0.148 49.25 11.738 chr16:5750131-5773779 0.024 43.393 5.750 chr16:8660048-8713922 0.054 47.311 8.660 chr16:9207447-9226408 0.019 46.883 9.207 chr16:10244004-10265038 0.021 47.188 10.244 chr16:11194191-11272284 0.078 51.812 11.194 chr16:16018456-16102139 0.084 48.541 16.018 chr16:19961872-19991194 0.029 44.354 19.962 chr16:20132016-20173234 0.041 42.691 20.132 chr16:26692215-26716739 0.025 43.947 26.692 chr17:4592813-4825056 0.232 51.562 4.593 chr17:28710395-28760926 0.051 44.404 28.710 chr17:29816420-29832905 0.016 46.894 29.816 chr17:38983818-39099959 0.116 49.368 38.984 chr18:17857497-17977937 0.120 44.565 17.857 chr18:33126219-33191107 0.065 53.998 33.126 chr18:33331187-33357665 0.026 52.045 33.331 chr18:42299285-42320826 0.022 49.285 33.796 chr18:43392252-43496059 0.104 43.349 32.621 chr18:46571157-46644139 0.073 43.646 29.473 chr19:5299933-5446197 0.146 50.388 5.300 chr19:6607455-6641771 0.034 50.203 6.607 chr19:7117138-7343349 0.226 47.339 7.117 chr19:7605347-7713610 0.108 52.77 7.605 chr19:11088326-11445725 0.357 52.354 11.088 chr19:56105140-56130289 0.025 47.948 7.681 chr20:5435848-5639678 0.204 43.587 5.436 chr20:35680392-35749998 0.070 46.356 26.686 chr20:40784442-40792215 0.008 45.343 21.644 chr20:44231964-44376706 0.145 51.218 18.059 chr20:54720806-54774975 0.054 47.408 7.661 chr21:36470807-36502797 0.032 47.488 10.442 chr21:40273447-40312865 0.039 44.928 6.631 chr22:25796629-25833837 0.037 46.682 23.858 chr22:25917001-25958151 0.041 45.374 23.733 chr22:26250032-26287965 0.038 49.391 23.403 chr22:26417123-26435851 0.019 50.211 23.256 chr22:32410571-32471321 0.061 45.12 17.220 chr22:35337416-35439400 0.102 46.816 14.252 87 Table 3-2 Regions of recurrent WGA under-amplification Genome Coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end chr1:3058506-3129776 0.071 57.113 3.059 chr1:5857077-5871605 0.015 57.168 5.857 chr1:7718141-7831475 0.113 41.364 7.718 chr1:188208865-188276324 0.067 34.619 58.973 chr1:214179005-214234267 0.055 35.409 33.015 chr1:218251671-218276896 0.025 37.933 28.973 chr1:232592563-232677945 0.085 39.832 14.572 chr1:235352252-235413331 0.061 40.499 11.836 chr2:554079-613259 0.059 45.934 0.554 chr2:1841469-1968296 0.127 45.876 1.841 chr2:128743219-128877673 0.134 52.327 114.073 chr2:159098562-159174260 0.076 36.627 83.777 chr5:487981-738504 0.251 56.251 0.488 chr5:2187888-2267721 0.080 49.395 2.188 chr5:2836714-2884070 0.047 41.89 2.837 chr5:3160861-3195828 0.035 46.205 3.161 chr5:6776429-6806873 0.030 46.553 6.776 chr6:170198708-170308225 0.110 51.929 0.592 chr7:47777759-47884020 0.106 43.623 47.778 chr7:158582043-158739710 0.158 45.905 0.082 chr8:791584-850907 0.059 47.539 0.792 chr8:1816651-1946694 0.130 49.183 1.817 chr8:4027418-4039531 0.012 38.749 4.027 chr8:5771188-5797004 0.026 37.599 5.771 chr8:6316036-6371546 0.056 39.288 6.316 chr8:12634490-12708688 0.074 40.565 12.634 chr9:95031716-95156970 0.125 50.386 45.116 chr10:2593122-2624375 0.031 37.102 2.593 chr10:3877605-3946173 0.069 42.615 3.878 chr10:29732980-29815742 0.083 42.533 29.733 chr10:131402923-131473049 0.070 47.837 3.902 chr10:134327710-134332916 0.005 49.165 1.042 chr11:22799571-22884748 0.085 36.31 22.800 chr11:41726762-41805501 0.079 36.953 41.727 chr11:98814153-98916830 0.103 34.289 35.536 chr11:123252803-123316873 0.064 36.102 11.136 chr12:76314644-76411616 0.097 36.908 55.938 chr12:113021576-113043667 0.022 43.491 19.306 chr12:130611957-130673802 0.062 51.924 1.676 chr13:112193014-112294946 0.102 42.808 1.848 chr13:113053814-113215730 0.162 50.548 0.927 chr14:46453446-46536895 0.083 33.96 46.453 chr15:25756802-25785341 0.029 45.473 25.757 chr15:99580062-99745948 0.166 47.27 0.593 chr16:14143079-14310216 0.167 40.888 14.143 chr16:31443920-33371617 1.928 41.702 31.444 chr16:52606706-52680890 0.074 40.821 36.146 chr16:62091121-62180196 0.089 35.123 26.647 88 Genome Coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end chr16:74485846-74586365 0.101 37.305 14.241 chr16:79337064-79399432 0.062 39.18 9.428 chr16:81193852-81216695 0.023 44.458 7.611 chr16:82334613-82343460 0.009 45.355 6.484 chr16:83305397-83391850 0.086 44.081 5.435 chr16:86246069-86327452 0.081 56.272 2.500 chr16:87408466-87706274 0.298 59.068 1.121 chr17:19959433-20070746 0.111 41.634 19.959 chr17:20801440-20845676 0.044 49.974 20.801 chr17:36095188-36165632 0.070 40.48 36.095 chr17:50010009-50101821 0.092 37.207 28.673 chr17:59661634-59698705 0.037 44.675 19.076 chr17:64611290-64747300 0.136 35.902 14.027 chr17:67185051-67306066 0.121 37.813 11.469 chr19:373238-892603 0.519 59.541 0.373 chr19:42148245-42239059 0.091 43.47 21.573 chr19:49185317-49257267 0.072 40.664 14.554 chr20:13008085-13018489 0.010 39.702 13.008 chr20:20272110-20371019 0.099 46.048 20.272 chr20:60967459-61027216 0.060 49.085 1.409 chr22:17359268-17386984 0.028 45.153 17.359 chr22:17761331-17872715 0.111 46.706 17.761 chr22:18119390-18255422 0.136 54.22 18.119 chr22:20832179-20871057 0.039 42.792 20.832 chr22:45726565-45769450 0.043 48.708 3.922 chr22:45898268-46013955 0.116 52.68 3.677 chr22:47504692-47516812 0.012 43.61 2.175 89 Table 3-3 Distribution of log2 ratios from comparison of unamplified and amplified samples versus a common reference set of 48 individuals Apparent Amplifications Apparent Deletions Sample compared vs. Affy48 Mean* SD** IQR*** Count p< Count p< Sample 1 - Unamplified 0.0002517 0.3079 0.3428 2 1.99x10-8 3 1.65x10-9 Amplified 0.001971 0.3790 0.4793 322 9.76x10-7 368 9.39x10-9 Sample 2 - Unamplified 0.002710 0.2602 0.3152 2 3.70x10-7 2 1.00x10-16 Amplified -0.0001297 0.4188 0.5412 254 8.91x10-7 157 8.33x10-9 Sample 3 - Unamplified 0.003530 0.2584 0.3176 3 5.42x10-10 1 1.00x10-16 Amplified -0.0004284 0.4076 0.5178 295 7.45x10-7 176 1.36x10-8 * Mean value of log2 ratios resulting from each comparison. A site with equivalent copy number in both samples would return a log2 ratio of 0. ** Standard deviation of log2 ratios resulting from each comparison. These values are interpreted as a measure of data noise from each comparison. *** Interquartile range of log2 ratios resulting from each comparison. These values are interpreted as a measure of data noise from each comparison. 90 Table 3-4 Apparent amplifications and deletions detected prior to amplification through comparison with a reference set of 48 individuals Sample compared vs. Affy48 Genome Coordinates of Variant (NCBI Build 36/hg18/Mar 2006) Size (bp) CN within variant CN outside variant SNP count p-value Variation Locus* chr7:48424572-48431182 6610 2.88184 2.04848 11 1.99x10-8 - Sample 1 chr14:19381928-19492423 110495 2.93812 2.03610 28 4.85x10-13 Locus 2636 chr2:113809804-113849256 39452 2.28770 2.04023 12 3.70x10-7 Locus 0397 Sample 2 chr17:41569489-41709662 140173 3.07396 2.03694 41 2.31x10-12 Locus 3029 chr9:29695281-29706655 11374 2.19958 2.04042 4 <1.00x10-16 - chr14:19309086-19459561 150475 2.65807 2.03481 25 5.42x10-10 Locus 2639 A m p l i f i c a t i o n s Sample 3 chr15:19163125-20077554 914429 2.66995 2.04165 72 <1.00x10-16 Locus 2748 chr7:142030227-142210594 180367 1.54593 2.04848 27 1.61x10-10 Locus 1656 chr14:21451264-22044096 592832 1.51299 2.03610 161 <1.00x10-16 Loci 2644 and 2645 Sample 1 chr22:33661041-33725126 64085 1.75349 2.06794 21 1.65x10-9 Locus 3489 chr2:50682535-50865587 183052 1.44974 2.04023 40 <1.00x10-16 Locus 0329 Sample 2 chr14:21792331-22040096 247765 1.38419 2.02893 60 <1.00x10-16 Locus 2645 D e l e t i o n s Sample 3 chr14:21800768-21932862 132094 1.53811 2.03481 32 <1.00x10-16 Locus 2645 * from the Database of Genomic Variants (http://projects.tcag.ca/variation/) 91 Table 3-5 Distribution of log2 ratios from comparison of two experimental replicates of each sample Sample Mean SD IQR Sample 1 - Unamplified 0.005517 0.2579 0.3223 Amplified 0.002538 0.2840 0.3544 Sample 2 - Unamplified 0.008175 0.2658 0.3299 Amplified 0.0003263 0.3264 0.4153 Sample 3 - Unamplified 0.0064235 0.2585 0.3187 Amplified 0.001687 0.2842 0.3517 92 Table 3-6 Regions of recurrent WGA under-amplification within chromosome ends Genome Coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end chr1:3058506-3129776 0.071 57.113 3.059 chr1:5857077-5871605 0.015 57.168 5.857 chr2:554079-613259 0.059 45.934 0.554 chr2:1841469-1968296 0.127 45.876 1.841 chr5:487981-738504 0.251 56.251 0.488 chr5:2187888-2267721 0.080 49.395 2.188 chr5:2836714-2884070 0.047 41.89 2.837 chr5:3160861-3195828 0.035 46.205 3.161 chr8:791584-850907 0.059 47.539 0.792 chr8:1816651-1946694 0.130 49.183 1.817 chr10:2593122-2624375 0.031 37.102 2.593 p- te rm in a l e n d chr19:373238-892603 0.519 59.541 0.373 chr6:170198708-170308225 0.110 51.929 0.592 chr7:158582043-158739710 0.158 45.905 0.082 chr10:134327710-134332916 0.005 49.165 1.042 chr12:130611957-130673802 0.062 51.924 1.676 chr13:112193014-112294946 0.102 42.808 1.848 chr13:113053814-113215730 0.162 50.548 0.927 chr15:99580062-99745948 0.166 47.27 0.593 chr16:87408466-87706274 0.298 59.068 1.121 q - te rm in a l e n d chr20:60967459-61027216 0.060 49.085 1.409 93 Table 3-7 Apparent copy number differences identified by pair-wise comparisons of all possible combinations of unamplified and amplified samples Apparent Amplifications Apparent Deletions Samples Compared Count p< Count p< Total Apparent CNVs CNVs in common between matched comparisons1 Unamplified sample 1 Unamplified sample 2 4 4.26x10 -7 3 1.40x10-8 7 Unamplified sample 1 Unamplified sample 3 4 3.88x10 -8 4 1.05x10-13 8 Unamplified sample 2 Unamplified sample 3 4 1.09x10 -10 2 3.44x10-15 6 Amplified sample 1 Unamplified sample 2 369 1.26x10 -6 367 7.77x10-9 736 Unamplified sample 1 Amplified sample 2 69 1.05x10 -6 358 7.04x10-9 427 2 Amplified sample 1 Unamplified sample 3 471 1.81x10 -6 498 1.28x10-8 969 Unamplified sample 1 Amplified sample 3 110 1.60x10 -6 536 1.53x10-8 646 1 Amplified sample 2 Unamplified sample 3 183 1.07x10 -6 49 5.64x10-8 232 Unamplified sample 2 Amplified sample 3 67 1.28x10 -6 130 3.31x10-8 197 4 Amplified sample 1 Amplified sample 2 21 2.03x10 -6 49 1.71x10-8 70 Amplified sample 1 Amplified sample 3 18 9.67x10 -7 82 2.69x10-8 100 Amplified sample 2 Amplified sample 3 44 1.82x10 -6 61 8.23x10-8 105 1CNVs seen in both comparisons regardless of which sample was amplified. i.e. seen in amplified 1 vs unamplified 2 as well as amplified 2 vs unamplified 1. 94 Table 3-8 Copy number variants detected by pair-wise comparisons of unamplified and amplified sample sets Detected by pairwise comparison of unamplified samples Detected by pairwise comparison of amplified samples Sample Comparison Relative CN Difference Coordinates (Build 36) p<= Rank Coordinates (Build 36) p<= Rank Variation Locus* chr2:50775422-51014967 1.00x10-16 1 chr2:50828689-50960764 1.15x10-9 1 of 21 0329** chr14:19272965-19489991 1.38x10-10 2 - 2636 chr3:21942154-21975950 3.91x10-7 3 - - Increase chr16:22640088-22688093 4.26x10-7 4 - 2893 chr17:41569489-41708649 1.00x10-16 1 chr17:41587072-41709662 1.00x10-16 1 of 48 3029 chr9:11936421-11997006 5.09x10-11 2 - 1901 1 vs. 2 Decrease chr10:95243220-95304377 1.40x10-8 3 - - chr8:124654695-124656225 1.00x10-16 1 - - chr13:43692360-43696382 3.99x10-13 2 - - chr18:20691186-20697540 4.86x10-13 3 - - Increase chr14:19402695-19502641 3.88x10-8 4 - 2636 chr14:21715523-22040167 1.00x10-16 1 chr14:21531617-22057862 1.00x10-16 1 of 82 2644/5 chr10:54588936-54590136 1.00x10-16 1 - - chr17:76310141-76321112 1.00x10-16 1 - - 1 vs. 3 Decrease chr15:19876834-20005562 1.05x10-13 4 chr15:19877365-20077554 2.11x10-10 37 of 82 2748 chr17:41572099-41708649 1.00x10-16 1 chr17:41522422-41647903 8.47x10-13 1 of 44 3029 chr15:84684853-84693981 1.00x10-16 1 - 2830 chr15:98087203-98095507 1.11x10-11 3 - 2860 Increase chr16:77105899-77109454 1.09x10-10 4 - - chr15:18711364-20079140 1.00x10-16 1 chr15:19313868-20329239 1.00x10-16 1 of 61 2748 2 vs. 3 Decrease chr2:50870615-51020480 3.44x10-15 2 chr2:50828689-51018056 1.00x10-16 1 of 61 - * from the Database of Genomic Variants (http://projects.tcag.ca/variation/) ** This CNV locus is overlapped only by the coordinates expected from comparison versus the Affy48 reference set. 95 Table 3-9 Copy number variants detected in MR families by pair-wise comparisons of unamplified and amplified sample sets (child versus father) Validated aberrations detected by pairwise comparison of unamplified samples [9] (100k array set) Detected by pairwise comparison of amplified samples (250k Nsp array) Variation Locus* Family ID [9] Relative CN Difference Coordinates (Build 36) Mbp Validation Cytoband Coordinates (Build 36) p=< Rank** chr10:259695-23144645 22.88 karyotyping 10p12.2-p15.3 chr10:1000464-24070263 1.00x10-16 1 of 13 many chr15:19208413-19943075 0.73 karyotyping 15q11.2 chr15:18850150-20335459 1.00x10-16 1 of 13 2748 8379 Increase - - - - chr14:21394980-21864733 1.00x10-16 1 of 13 many - - - - chr9:10069844-10104307 5.54x10-7 1 of 2 - Increase - - - - chr13:100974064-101034679 2.14x10-6 2 of 2 - 1280 Decrease chr4:22943293-23102259 0.16 FISH (BAC) 4p15.2 chr4:22828003-23025619 3.64x10-10 1 of 4 0794 - - - - chr5:64484426-64535538 1.00x10-16 1of 6 - Increase - - - - chr20:50794691-50801972 1.00x10-16 1 of 6 3405 chr1:83242288-83274337 0.03 FISH (fosmid) 1p31.1 - - - 0104 chr4:82282746-85558739 3.28 FISH (BAC) 4q21.23 chr4:82531241-92371701 1.00x10-16 1 of 10 many 3476 Decrease - - - - chr22:46869824-46963276 1.00x10-16 1 of 10 - * from the Database of Genomic Variants (http://projects.tcag.ca/variation/) ** Ranked by significance (p-value). Only variants with the lowest p-value scores are shown 96 3.7. Bibliography 1. McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S, Gabriel SB, Lee C, Daly MJ, Altshuler DM: Common deletion polymorphisms in the human genome. Nat Genet 2006, 38(1):86-92. 2. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet 2006, 7(2):85-97. 3. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 2006, 38(1):75- 81. 4. Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 2005, 77(1):78-88. 5. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet 2004, 36(9):949- 951. 6. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525-528. 7. Rovelet-Lecrux A, Hannequin D, Raux G, Le Meur N, Laquerriere A, Vital A, Dumanchin C, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D: APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy. Nat Genet 2006, 38(1):24-26. 8. Zhang X, Snijders A, Segraves R, Zhang X, Niebuhr A, Albertson D, Yang H, Gray J, Niebuhr E, Bolund L, Pinkel D: High-resolution mapping of genotype-phenotype relationships in cri du chat syndrome using array comparative genomic hybridization. Am J Hum Genet 2005, 76(2):312-326. 9. Friedman JM, Baross A, Delaney AD, Ally A, Arbour L, Armstrong L, Asano J, Bailey DK, Barber S, Birch P, Brown-John M, Cao M, Chan S, Charest DL, Farnoud N, Fernandes N, Flibotte S, Go A, Gibson WT, Holt RA, Jones SJ, Kennedy GC, Krzywinski M, Langlois S, Li HI, McGillivray BC, Nayar T, Pugh TJ, Rajcan- Separovic E, Schein JE, Schnerch A, Siddiqui A, Van Allen MI, Wilson G, Yong SL, Zahir F, Eydoux P, Marra MA: Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 2006, 79(3):500- 513. 10. Tonon G, Wong KK, Maulik G, Brennan C, Feng B, Zhang Y, Khatry DB, Protopopov A, You MJ, Aguirre AJ, Martin ES, Yang Z, Ji H, Chin L, Depinho RA: High- resolution genomic profiles of human lung cancer. Proc Natl Acad Sci U S A 2005, 102(27):9625-9630. 11. Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res 2004, 64(9):3060-3071. 12. Cappuzzo F, Hirsch FR, Rossi E, Bartolini S, Ceresoli GL, Bemis L, Haney J, Witta S, Danenberg K, Domenichini I, Ludovini V, Magrini E, Gregorc V, Doglioni C, Sidoni A, Tonato M, Franklin WA, Crino L, Bunn PA, Jr., Varella-Garcia M: Epidermal growth 97 factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Natl Cancer Inst 2005, 97(9):643-655. 13. Hirsch FR, Varella-Garcia M, Bunn PA, Jr., Di Maria MV, Veve R, Bremmes RM, Baron AE, Zeng C, Franklin WA: Epidermal growth factor receptor in non-small- cell lung carcinomas: correlation between gene copy number and protein expression and impact on prognosis. J Clin Oncol 2003, 21(20):3798-3807. 14. Nahta R, Yu D, Hung MC, Hortobagyi GN, Esteva FJ: Mechanisms of disease: understanding resistance to HER2-targeted therapy in human breast cancer. Nat Clin Pract Oncol 2006, 3(5):269-280. 15. Cho HS, Mason K, Ramyar KX, Stanley AM, Gabelli SB, Denney DW, Jr., Leahy DJ: Structure of the extracellular region of HER2 alone and in complex with the Herceptin Fab. Nature 2003, 421(6924):756-760. 16. Menard S, Pupa SM, Campiglio M, Tagliabue E: Biologic and therapeutic role of HER2 in cancer. Oncogene 2003, 22(42):6570-6578. 17. Rubin I, Yarden Y: The basic biology of HER2. Ann Oncol 2001, 12 Suppl 1:S3-8. 18. Laskin JJ, Sandler AB: Epidermal growth factor receptor: a promising target in solid tumours. Cancer Treat Rev 2004, 30(1):1-17. 19. Hughes S, Arneson N, Done S, Squire J: The use of whole genome amplification in the study of human disease. Prog Biophys Mol Biol 2005, 88(1):173-189. 20. Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, Sun Z, Zong Q, Du Y, Du J, Driscoll M, Song W, Kingsmore SF, Egholm M, Lasken RS: Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 2002, 99(8):5261-5266. 21. Spits C, Le Caignec C, De Rycke M, Van Haute L, Van Steirteghem A, Liebaers I, Sermon K: Whole-genome multiple displacement amplification from single cells. Nat Protoc 2006, 1(4):1965-1970. 22. Corneveaux JJ, Kruer MC, Hu-Lince D, Ramsey KE, Zismann VL, Stephan DA, Craig DW, Huentelman MJ: SNP-based chromosomal copy number ascertainment following multiple displacement whole-genome amplification. Biotechniques 2007, 42(1):77-83. 23. Paez JG LM, Beroukhim R, Lee JC, Zhao X, Richter DJ, Gabriel S, Herman P, Sasaki H, Altshuler D, Li C, Meyerson M, Sellers WR.: Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Res 2004, 32:e71. 24. Arriola E, Lambros MB, Jones C, Dexter T, Mackay A, Tan DS, Tamber N, Fenwick K, Ashworth A, Dowsett M, Reis-Filho JS: Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation. Lab Invest 2007, 87(1):75-83. 25. Lage JM, Leamon JH, Pejovic T, Hamann S, Lacey M, Dillon D, Segraves R, Vossbrinck B, Gonzalez A, Pinkel D, Albertson DG, Costa J, Lizardi PM: Whole genome analysis of genetic alterations in small DNA samples using hyperbranched strand displacement amplification and array-CGH. Genome Res 2003, 13(2):294- 307. 26. Tzvetkov MV, Becker C, Kulle B, Nurnberg P, Brockmoller J, Wojnowski L: Genome- wide single-nucleotide polymorphism arrays demonstrate high fidelity of multiple displacement-based whole-genome amplification. Electrophoresis 2005, 26(3):710- 715. 27. Bredel M, Bredel C, Juric D, Kim Y, Vogel H, Harsh GR, Recht LD, Pollack JR, Sikic BI: Amplification of whole tumor genomes and gene-by-gene mapping of genomic 98 aberrations from limited sources of fresh-frozen and paraffin-embedded DNA. J Mol Diagn 2005, 7(2):171-182. 28. Esteban JA, Salas M, Blanco L: Fidelity of phi 29 DNA polymerase. Comparison between protein-primed initiation and DNA polymerization. J Biol Chem 1993, 268(4):2719-2726. 29. Affymetrix webpage [http://www.affymetrix.com/] 30. Pinard R, de Winter A, Sarkis GJ, Gerstein MB, Tartaro KR, Plant RN, Egholm M, Rothberg JM, Leamon JH: Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics 2006, 7:216. 31. Hosono S FA, Dean FB, Du Y, Sun Z, Wu X, Du J, Kingsmore SF, Egholm M, Lasken RS.: Unbiased whole-genome amplification directly from clinical samples. Genome Res 2003, 13:954-964. 32. Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999, 27(2):573-580. 33. Iiizumi M, Liu W, Pai SK, Furuta E, Watabe K: Drug development against metastasis-related genes and their pathways: a rationale for cancer therapy. Biochim Biophys Acta 2008, 1786(2):87-104. 99 Chapter 4. Sequence variant discovery in DNA repair genes from radiosensitive and radiotolerant prostate brachytherapy patients3 In the treatment of human cancer, an individual\u00E2\u0080\u0099s genetic background, their complement of SNPs and CNPs, dictate not only their susceptibility to getting cancer but also the manner in which normal cells respond to therapy. Efficacy of several medications has been linked to SNPs in genes encoding drug-metabolizers, transporters, receptors, and targets [1, 2]. Stratifying patients based on genotype has the potential to classify diseases with known susceptibility, improve the dosing and targeting of drugs, and decrease the time and cost of conducting clinical trials [2]. However, the role of genetic polymorphisms is not restricted to genetic predictions of drug response. While maximizing treatment efficacy is the goal of any therapy, an equally important consideration is the minimization of treatment-induced side effects. This is particularly important in the case of radiation therapies in which the treatment can induce new primary cancers in radiosensitive patients. Many genetic studies of radiosensitivity in cancer patients have focused on variants of ATM, a gene central to DNA damage detection and repair, with mixed results. In an expanded search for variants associated with radiosensitivity, this chapter documents a search for germline genetic variants predictive of late side effects in prostate cancer patients treated with radiation brachytherapy. We also investigated associations of these variants with increased levels of DNA double-strand breaks marked by expression of gammaH2AX following irradiation. At the time of manuscript publication, this study was the largest sequencing-based survey of DNA repair genes to look for germline variants associated with radiosensitivity. 3 A version of this chapter has been published. Pugh, T.J., Keyes, M., Barclay, L. , Delaney, A., Krzywinski, M., Novik, K., Thomas, D., Yang, C., Agranovich, A., McKenzie, M., Morris, W.J., Olive, P.L., Marra, M.A., Moore, R.A. Clin Cancer Res. 2009 Aug 1;15(15):5008-16. 100 4.1. Introduction Prostate brachytherapy (PB) is a standard treatment for early stage prostate cancer. Radioactive seeds are implanted into the prostate to deliver high doses of conformal radiation achieving excellent long term results [3-5]. A recent study of 1,006 consecutive PB patients at the BC Cancer Agency found biochemical freedom from recurrence rates of 95.6% at 5 years and 94.0% at 7 years [5]. Despite great pains taken to minimize exposure of tumour-adjacent cells to ionizing radiation, even patients with ideal radiation dosimetries can develop side effects months or even years after recovery from the early inflammatory response following radiation treatment. While early side effects are often temporary, late side effects are often irreversible or even progressive [6] and appear to be a symptom of long-term DNA damage to surrounding tissues. Symptoms include acute and late urinary toxicity, acute and late rectal toxicity and loss of sexual potency. Predictive clinical factors for severity of toxicity such as baseline urinary and sexual function prior to procedure and radiation dose have been investigated by our group [5, 7-12] and others [13-15], but overall there is no consensus that any of these factors are effective predictors of late side effects of PB. If not for the toxicity to normal tissues in some patients, PB would be an ideal treatment for prostate cancer due to its high long term efficacy, minimal invasiveness, and low impact on patient quality of life. The presence of intrinsic radiosensitivity may prove to be an important factor contributing to development of PB toxicity. Several observations support the hypothesis that an individual\u00E2\u0080\u0099s radiosensitivity is mediated by genetic variants in specific genes. Radiosensitivity appears to be an inherited trait as cells from monozygotic twins have greater intrapair correlation of cell cycle delay and apoptosis following irradiation than dizygotic twins [16]. In addition, two studies of breast cancer patients have found that cells from first degree relatives of patients with high radiosensitivity are similarly sensitive to radiation [17, 18]. Ionizing radiation such as that used in PB has been well documented to cause DNA double-strand breaks 101 that are repaired through a number of DNA repair mechanisms [19-21]. Defects in DNA repair genes ATM, LIG4, and MRE11 lead to developmental syndromes that include increased radiosensitivity [21, 22]. In vitro, cells with mutations in ATM have been shown to have increased radiosensitivity [23]. In the treatment of cancer, positive correlations have been made between variants in ATM and radiosensitivity [24-30]. Such variants have been uncovered by two studies of prostate cancer patients. Hall et al. [28] used DNA sequencing to find ATM mutations in 3 of 17 prostate cancer patients with late radiotherapy side effects. A study of 37 PB patients by Cesaretti et al. [26] used DHPLC to identify 21 variants in ATM in 16 patients and found a correlation between possession of sequence variants in this gene, particularly missense variants, and late side effects of PB. Several studies of this kind have been limited by small sample size, indirect or low resolution variant detection methods, and examination of only a single candidate gene [24]. Current evidence suggests that radiosensitivity is a complex genetic trait mediated by a number of genes, each of which may harbour low frequency variants which together modulate the radiosensitive phenotype [24, 30, 31]. Two studies have examined the role of single nucleotide polymorphisms (SNPs) from multiple genes in predicting radiosensitivity of prostate cancer patients treated with radiation. One study genotyped 49 SNPs from 24 genes in 83 patients and identified three genes, LIG4, ERCC2 and CYP2D6, containing SNPs associated with radiation toxicity [32]. The second study genotyped 450 SNPs from 118 genes in 197 patients and defined urinary toxicity \u00E2\u0080\u0098risk genotypes\u00E2\u0080\u0099 associated with SNPs in five genes, SART1, ID3, EPDR1, PAH, and XRCC6 [33]. These studies genotyped an average of 1.8 and 6.1 known SNPs from each gene, respectively, and would not have discovered novel variants in these genes that may directly mediate radiosensitivity. To date, no comprehensive sequencing-based survey of multiple candidate 102 DNA repair genes has been performed in a set of high and low toxicity PB patients to discover and genotype such variants. We set out to perform such a survey to 1) discover new variants and 2) to investigate whether variants in genes responsible for detecting and repairing DNA damage contribute to PB toxicity. While the effects of radiation can take several forms and involve a number of mechanisms [6], late side effects can develop months or years after irradiation due to the presence of unrepaired, damaged DNA. Poor ability to repair these lesions may be due to reduced function of proteins responsible for detection and repair of DNA damage. As of March, 2009, over 175 DNA repair genes had been identified, but many play a supporting role in DNA damage repair as members of cell signalling pathways, catalytic subunits, cofactors, or proteins that interact with well characterized genes but without a known function of their own [34]. Therefore, we restricted our study to a set of well-characterized DNA repair genes that encode proteins that act directly at the site of double-strand breaks and are the primary machinery for the detection and repair of these lesions (Figure 4.1, [35]). We sequenced the coding and flanking intronic regions of eight DNA repair genes (ATM, BRCA1, ERCC2, H2AFX, LIG4, MDC1, MRE11A, and RAD50) in 41 prostate cancer patients treated with PB at the BC Cancer Agency. These genes were selected because each plays a role in the detection and repair of DNA damage from ionizing radiation (16-18) and functional alterations of any of these genes may result in reduced ability to repair double stranded DNA breaks caused by prostate brachytherapy. ATM kinase plays a central role as a sensor of DNA damage that activates signal transduction pathways to halt cell cycle progression until the DNA damage is repaired. At the site of a double strand break (DSB), ATM phosphorylates several proteins including H2AFX, a histone variant, to recruit a nuclease complex for DNA repair [21]. MRE11A nuclease and RAD50 ATPase are part of this complex and enzymatically process the ends of DSBs for repair 103 by homologous recombination [21, 36]. MDC1 mediates the recruitment of this complex by interacting with both H2AFX and MRE11A/RAD50 [36]. BRCA1 acts as a scaffold for replication and DNA repair proteins and forms a \u00E2\u0080\u009CBRCA1-associated genome surveillance complex\u00E2\u0080\u009D with ATM and the MRE11A/RAD50 complex [21, 37]. LIG4, also known as DNA ligase IV, plays a role in repairing DSBs by uniting broken ends through an alternative mechanism called non-homologous end-joining [21, 38]. Damage to individual DNA bases is addressed by a third mechanism, nucleotide excision repair, in which ERCC2 helicase, also known as XPD, is responsible for unwinding the damaged DNA helical structure so repair can take place [21]. 4.2. Materials and methods 4.2.1. Patient selection and toxicity metrics The Prostate Brachytherapy Program at the British Columbia Cancer Agency (BCCA) was established in 1997. As of March 2008, more than 2500 patients had undergone PB as part of this program. Eligible patients included those with low-risk disease (clinical stage \u00E2\u0089\u00A4 T2a, initial PSA (iPSA) \u00E2\u0089\u00A4 10.0 ng/ml and Gleason Score (GS) \u00E2\u0089\u00A4 6), and \u00E2\u0080\u0098low tier\u00E2\u0080\u0099 intermediate risk patients (stage \u00E2\u0089\u00A4 T2c and GS \u00E2\u0089\u00A4 6 with iPSA 10-15 ng/ml or GS = 7 with iPSA < 10 ng/ml). Our implant technique is described in detail elsewhere [5, 7, 9]. Prostate and rectal dosimetry is obtained using day 30 post-implant CT, using VariSeed software (Varian Medical Systems, Palo Alto, CA). Post-implant contouring was done by an implanting oncologist and tissue dosimetry recorded. Patients were seen at 6 weeks after the procedure, every 6 months for 2-3 years, and then annually. Toxicity score components are assessed by a physician on each visit, including Radiation Therapy Oncology Group (RTOG) urinary and rectal toxicity scores [39], International Prostate Symptom Score (IPSS, [40],) and patient-reported erectile function. To better reflect the specific brachytherapy toxicity profile, the genitourinary (GU) and gastrointestinal (GI) RTOG toxicity scale was modified at the inception of the program. 104 Forty-one prostate brachytherapy patients living in the Vancouver lower mainland region with at least three years of follow-up (mean 6.6 years) were selected for study from the BCCA Prostate Brachytherapy database, agreed to participate and provided informed consent. Ethics approval of the study was granted by the BC Cancer Agency Research Ethics Board. Patients were chosen based on development or lack of development of late normal tissue toxicity following brachytherapy and, to minimize radiation dose as a source of experimental variability, their near ideal rectal and prostate post-implant dosimetry: Prostate D90 < 175Gy (dose covering 90% of the prostate less than 175Gy), prostate V100 > 85% (volume of the prostate covered by more than 85% of the radiation dose), and rectal VR100 < 1.0 cm3 (volume of the rectum receiving 145 Gy is less than 1.0 cm3). Median prostate D90, prostate V100, and rectal VR100 for low and high toxicity patients were 154 Gy, 93%, 0.26 cm3 and 148 Gy, 92%, 0.34 cm3 respectively. While some patients received less than ideal dosimetry, neither the low or high toxicity groups are enriched for these exceptions (p=1.00 for V100, VR100, and D90) nor is there a linear, exponential, or up to sixth order polynomial relationship between any of the dosimetry values and toxicity score (R2<0.4). Usually, toxicity after radiation therapy is reported as a single organ or tissue toxicity score and the original RTOG toxicity scale is primarily used for external beam radiation therapy and not for brachytherapy. To adequately capture patients with multiple organ or tissue toxicities, we have created a somewhat arbitrary composite toxicity score listed in Table 4-1. Several peer reviewed articles have been published regarding the toxicities determined using the modified RTOG toxicity scores [10-12]. From our analysis of 1000 brachytherapy patients with a minimum of 3 years of follow-up, patients with multiple severe toxicities are relatively rare, comprising 2-10% of the entire population (unpublished data). For this study, an attempt was made to capture patients with the worst multiple toxicities, respecting the limitation of geographical availability, patient willingness to participate in the study and the requirement of 105 near-ideal post-implant dosimetry. As acute toxicity is likely related to the PB procedure itself [8, 9, 12, 41], this study focuses on late toxicity only, defined as development of toxicity more than 1 year following the implant. The average follow-up time since implant was 78 \u00C2\u00B1 14 months for patients with little or no evidence of late toxicity and 81.2 \u00C2\u00B1 12 months for patients with late toxicity. Twenty patients with no late side effects of prostate brachytherapy were chosen based on a score of 0 or 1 for the criteria listed in Table 4-1. Twenty-one patients with documented late side effects to brachytherapy were selected with scores in at least two of the Table 4-1 categories and to have a total score of at least 2. Clinical and DNA sequence variant data used in our analysis for all patients are reported in Supplemental Table 1 available, due to its large size, in electronic format at www. clincancerres.aacrjournals.org. Table 4-2 contains a summary of the DNA sequence variant data, a breakdown of the toxicity scores, and additional clinical data including hormone usage, tumour stage, planning ultrasound target volume, Gleason score, and age at implant. 4.2.2. PCR amplification and sequencing of DNA repair genes Each patient provided a 24 mL blood sample from which genomic DNA was extracted using the Gentra Puregene Blood kit (Qiagen Inc, Mississauga, ON) and quantified using a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE). PCR and sequencing of the target amplicons were carried out by the BC Cancer Agency Genome Sciences Centre sequencing group using previously published reaction chemistries [42]. PCR volumes were scaled down to 10\u00C2\u00B5L, 10ng of genomic DNA was used for each reaction, and reactions were performed using a 60\u00C2\u00BAC annealing temperature. PCR primer sequences and the genome coordinates of each amplicon are available in Table 4-3. To facilitate sequencing with universal sequencing primers, each forward primer was ordered with the prefix sequence TGTAAAACGAGGCCAGT and each reverse primer was ordered with the prefix sequence CAGGAAACAGCTATGAC. Variants were detected using Mutation Surveyor v3.2 106 (SoftGenetics, State College, PA) and mutation reports summarized using custom scripts. Genome coordinates and dbSNP accession numbers reported here correspond to human genome build 36.1/hg18 (March 2006) and dbSNP build 128 (http://www.ncbi.nlm.nih.gov/). 4.2.3. Statistical analyses Statistical analyses were performed using custom Perl and shell scripts. A Perl module implementation of the two-tailed Fisher\u00E2\u0080\u0099s exact test (http://search.cpan.org/dist/Text- NSP/lib/Text/NSP/Measures/2D/Fisher/twotailed.pm) was used to assess differences in allele distribution between low and high toxicity patients at each variant site. To investigate the contribution of individual alleles to intrinsic radiosensitivity, four statistical tests were performed using different allele distributions between the high and low toxicity groups at each variant site (A = reference allele, B = non-reference allele): p1 tested for an association with homozygosity for the reference allele (AA vs. AB+BB), p2 tested for association with homozygosity for the non-reference allele (BB vs. AA+AB), p3 tested for association with the presence of either allele regardless of zygosity (p3 = A vs. B), and p4 tested for association with homozygotes only to remove possible intermediate toxicities due to heterozygosity for a risk or protective allele (p4 = AA vs. BB, heterozygotes discarded). Patients without sequence coverage of a particular variant were not included in the statistical tests of that site. We also used this approach to investigate correlations between these variants and residual gammaH2AX measured in blood cells from these patients following radiation [43]. We used the combined ranks of the monocyte and lymphocyte scores published by Olive et al. [43] to divide our patients into two groups: 19 \u00E2\u0080\u009Clow gammaH2AX\u00E2\u0080\u009D individuals (sum of ranks < 41) and 21 \u00E2\u0080\u009Chigh gammaH2AX\u00E2\u0080\u009D individuals (sum of ranks \u00E2\u0089\u00A5 41). A higher expression of residual gammaH2AX measured 24 hours after radiation can suggest decreased DNA repair capacity and, we hypothesize, an increased likelihood of late side effects. As this is a hypothesis- 107 generating study seeking to identify subtle genetic relationships with radiosensitivity, we conventionally set our p-value threshold for significance at 0.05 with the understanding that false-positives are a real possibility given the number of tests performed. More stringent corrections for multiple testing are warranted for future validation studies in larger patient populations. 4.3. Results 4.3.1. DNA sequencing summary We selected eight DNA repair genes in each of 41 individuals for sequencing. These eight genes contained 173 exons which were covered by 242 PCR amplicons. The amplicons targeted 115 kbp of genomic sequence of which 45.2 kbp were exonic (Table 4-4). Across 41 patients, 239 sites were shown to differ from the human genome reference sequence and 170 of these corresponded to known variants listed in dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/). 60 of the variants, 14 of them novel, fell in a protein coding region (Table 4-5), and 43 of these resulted in an amino acid substitution (Table 4-4). 22 of the variants that effected an amino acid change were expected to be non- conservative, which we defined as a score less than 0 on the BLOSUM62 alignment score matrix [44, 45]. 36 variants were small (1 to 5 bp) insertions or deletions, none of which affected protein coding regions. 23 of these were known polymorphisms recorded in dbSNP, 14 of 25 deletions and 9 of 11 insertions. Variants within exon splicing sites were rare, and only a single variant fell within 2bp of an annotated exon end, a novel intronic MRE11A variant located at chr11:93,866,690 in a single heterozygous low toxicity individual. The largest gene, ATM, contained the greatest number of variants, 53 in non-coding and 9 in coding regions. However, the 1.5 variants per kbp sequenced was near the average of 2.3 variants per kbp for the eight genes sequenced (median 2.2, range 0.4-4.2 variants/kb), suggesting that this gene has slightly decreased variant density compared to the other genes in 108 our study despite harbouring the greatest number of variants (Table 4-4). The BRCA1 gene contained the second largest number of variants, 47, and the greatest proportion of known variants recorded in dbSNP, 91%. This high percentage of known variants likely reflects the intense research interest in this gene due to its association with breast cancer risk. MRE11A, and RAD50 contained the lowest number of variants per kbp in coding regions, 0 and 0.3 respectively. This high level of sequence conservation likely reflects the fact that the products of these genes interact with a large number of proteins. In contrast, ERCC2, LIG4, and MDC1 had the greatest number of variants per kbp sequenced, 4.2, 3.6 and 3.7 respectively. While MDC1 had 23 coding variants, the greatest of any of the genes studied, 5 of these were only present in a single individual with no late side effects (Patient 33) and 4 were shared exclusively between this individual and a patient with high toxicity (Patient 43). H2AFX, the smallest gene, contained the fewest number of variants, 5, and the single coding variant, while novel, did not result in an amino acid change. 4.3.2. ATM variants detected by previous studies of radiosensitivity We detected 5 known coding variants observed in previous studies relating ATM to radiosensitivity: 1) 2119T>C, P707S (chr11:107,629,971, rs4986761) [25, 27]; 2) 5558A>T, D1853V (chr11:107,680,673, rs1801673) [25]; 3) 3161C>G, P1054R (chr11:107,648,666, rs1800057) [25, 26, 29]; 4) 4578C>T, P1526P (chr11:107,668,697, rs1800889) [25, 26, 29]; 5) 5557G>A, D1853N (chr11:107,680,672, rs1801516) [25, 26]. Coding sequence coordinates listed are relative to the ATM transcript record ENST00000278616 accessed through the Ensembl website (http://www.ensembl.org/). The first and second variants, resulting in amino acid changes S707P and D1853V respectively, were observed in a single low toxicity patient heterozygous at both sites. The third variant, resulting in the amino acid change P1045R, was previously found to double the risk of developing prostate cancer [46] and was observed in two heterozygous high toxicity patients in our population (Patients 9 and 17). The fourth variant, a 109 synonymous change retaining P1526, was observed in two heterozygous patients, one with high toxicity and one with low toxicity. The fifth variant, resulting in the amino acid change D1853N and previously suggested to mediate radiosensitivity in breast cancer patients [25], was observed in five heterozygous patients, three low toxicity and two high toxicity. None of these variants was statistically associated with high PB toxicity in our population (p>0.46). 4.3.3. Using quantity of DNA repair gene variants to predict radiosensitivity Previous studies have postulated that the number of variants in DNA repair genes can be used to distinguish radiotherapy patients with high toxicity from low toxicity [25-27, 33]. In our study, every patient had at least 1 variant in each of 5 DNA repair genes (ATM, ERCC2, H2AFX, MDC1, and RAD50) (Figure 4.2). Three genes had more variants on average in the high toxicity patients than in the low toxicity patients (BRCA1, H2AFX, and MDC1). However, there was no statistically significant enrichment to either side of the mean in high or low toxicity groups for any of the eight genes studied (p>0.10). This was also true for missense variants (p>0.09) and non-conservative variants (p>0.16). However, when all coding variants were taken as a group regardless of amino acid conservation score, we did observe an enrichment of such variants in the LIG4 gene in high toxicity patients (p=0.03 for LIG4, p>0.34 for all other genes). We tested all possible quantity thresholds (from 1 to 32 variants) of all four variant classes to distinguish low toxicity and high toxicity groups and found only one that met statistical significance. In our population, the high toxicity group was enriched for individuals harbouring at least one LIG4 coding variant. 4.3.4. Using specific DNA repair gene variants to predict radiosensitivity We hypothesized that specific DNA repair gene variants, not the number of such variants, would be associated with radiosensitivity. To assess genetic associations with radiation toxicity, the genotype and allele distribution between high and low toxicity groups 110 were analyzed at every variant site using four two-tailed Fisher\u00E2\u0080\u0099s exact tests (Methods). One coding synonymous variant in MDC1, 4178C>CG, A1657AA, located at chr6:30,779,968 returned a p-value less than 0.05 (p1 = 0.048, p2 = 1.00, p3 = 0.056, p4 = 1.00). All five patients with the minor allele (Patients 4, 18, 21, 26, 34) were heterozygous and had high radiation toxicity scores (6, 3, 4, 5, and 13). This variant has been previously recorded in dbSNP as rs28986317, and minor allele frequencies have been observed from 1-6% in four populations. All other variant sites returned p-values greater than 0.05, and none appeared to be associated with increased radiation toxicity at a statistically significant level in our patient population. 4.3.5. Relationship of DNA repair gene variants with residual gammaH2AX following irradiation Assessments of DNA repair ability, represented by the relative expression of gammaH2AX remaining 24 hours after exposure to 2 Gy, were taken for 40 of 41 of the patients [43]. While residual gammaH2AX following irradiation did not correlate with late side effects of PB [43], we did observe 15 intronic variants to be correlated with decreased residual gammaH2AX, i.e. increased DNA repair activity (p\u00E2\u0089\u00A40.049), and 1 coding, missense variant to be correlated with increased residual gammaH2AX, i.e. decreased DNA repair ability (p=0.042) (Table 4-5). 14 of the 15 low gammaH2AX variants were in BRCA1, and 13 of these were documented in dbSNP, suggesting a primary role for this protein in addressing double-strand breaks marked by gammaH2AX. The remaining intronic variant is a known variant in MDC1 (rs9405048), but none of these variants was correlated with fewer late side effects of PB (p\u00E2\u0089\u00A50.127). The single variant associated with increased gammaH2AX results in a conservative amino acid change in ERCC1 (D312N, BLOSUM62 score = 1) and is documented in dbSNP (rs1799793). 12 of the 17 patients harbouring the minor allele had high expression of residual 111 H2AX (sum of ranks \u00E2\u0089\u00A5 41), and 5 of these had toxicity scores higher than 2. Patient 10 was an exception as he harboured the minor allele and received a high toxicity score of 7 and yet had the lowest residual gammaH2AX. This patient did not harbour any of the 15 variants correlated with decreased gammaH2AX, suggesting that clearance of double-strand breaks is also mediated by genes outside of our candidate set. 4.4. Discussion To the best of our knowledge, this study represents the first direct sequencing study of multiple DNA repair genes in radiosensitive and radiotolerant prostate brachytherapy patients. This survey uncovered 239 variants distributed across eight DNA repair genes of which 69 were novel and had not been recorded in dbSNP. Of the 46 coding variants, 32 of which resulted in an amino acid change, 14 were novel (not in dbSNP). These results suggest that the genetic diversity of these genes is not fully captured in existing databases and that sequencing of genes in larger populations is necessary to uncover lower frequency variants that may define complex radiosensitive phenotypes. This survey identified five ATM variants analyzed in previous investigations of ATM and radiosensitivity [25-27, 29] but observed no statistically significant relationship with late side effects of prostate brachytherapy. Contrary to previous reports based on population sizes similar to our own [26, 27], the number of variants in ATM could not predict radiation toxicity. However, a specific variant that doubles the risk of prostate cancer, P1054R [46], was seen exclusively in two high toxicity individuals, which suggests that aspects of DNA repair ability may underlie a predisposition for prostate cancer. While the frequency of this variant in the high toxicity group did not reach statistical significance (p=0.488), the fact that this variant was found exclusively in high toxicity patients warrants further investigation in larger populations as a potential predictor of radiosensitivity. Of 239 variants detected across eight DNA repair genes, only one variant, rs28986317 in MDC1, was statistically associated with late side effects 112 of PB (p=0.048). The biological effect of this variant on protein structure is not immediately obvious as it does not result in an amino acid change. However, synonymous changes have been shown to affect protein translation due to changes in codon usage, altered mRNA stability, disrupted miRNA binding, and exon skipping [47]. In this case, rs28986317, the minor allele results in the use of the least frequently used codon of four used to encode alanine in humans (codons per thousand for each allele, 27.7:7.4) [48]. As the presence of rare codons can decrease protein translation rates [47], minor allele carriers of this variant may express MDC1 at lower levels, thereby decreasing their ability to repair DNA damage from ionizing radiation. Despite representing less than 10% of the sequence targeted, coding variants in MDC1 represented over 33% of the synonymous variants, over 25% of the conservative variants, and over 50% of the non-conservative variants including all of the novel non-conservative variants detected in our study. The large amount of per-kbp variation present in this gene and others may explain the wide range of toxicities observed following radiation treatment. In the case of MDC1, a recruiter of DNA damage repair complexes, different variants may preferentially recruit complexes specific to each individual. Similarly, the high number of variants per kbp in ERCC2 and LIG4, proteins that directly carry out repair, may reflect a spectrum of activation efficiencies or enzymatic activities that manifest as a range of toxicity levels. Our study\u00E2\u0080\u0099s small patient population limits the statistical power of our analysis to resolve cumulative genetic effects on radiosensitivity at a population level. Regardless, even from this small population, a large amount of genetic diversity was observed in the eight candidate genes sequenced. The positive results associating the MDC1 variant with increased radiosensitivity and the observation of the ATM variant P1054R exclusively in high toxicity patients need to be validated in larger populations, particularly as no correction for multiple testing was performed and there is a possibility of false positives. The finding that no single LIG4 variant was statistically associated with radiosensitivity despite the observation that high 113 toxicity patients were more likely to contain LIG4 coding variants suggests that this gene may contain a class of low frequency variants with similar effect on protein behaviour. As LIG4 functions as a ligase, variants resulting in codon or amino acid substitutions in this protein may decrease its ability to join broken ends of DNA or reduce the amount of enzyme available to perform DNA repair. Variants in DNA repair genes were both positively and negatively associated with residual gammaH2AX that signifies the presence of unrepaired or misrepaired double-strand breaks following irradiation [43]. Non-coding variants, 14 in BRCA1 and 1 in MDC1, correlated with lower levels of gammaH2AX (p\u00E2\u0089\u00A40.049) while a coding, non-synonymous variant in ERCC2 (rs1799793) was correlated with higher levels of gammaH2AX (p=0.042) (Table 4-5). The ERCC2 minor allele was found in 42% of patients (5 of 12) with high levels of residual gammaH2AX and high toxicity. While gammaH2AX levels did not correlate with development of late side effects of PB [43], variants in specific DNA repair genes, particularly BRCA1 and ERCC1, appear to mediate the clearance of double-strand breaks which may play a factor in the eventual development of toxicity. This initial survey has identified a number of promising candidate variants that may show an ability to predict increased radiosensitivity in a larger population and serves to illustrate the genetic diversity present in a number of DNA repair genes. While the hypothesis that DNA repair gene variants mediate radiosensitivity has not been disproven, it is likely that the effect of individual variants is small and that variants outside of this set of candidate genes also play a role in mediating radiosensitivity. Investigation of variants of additional genes in larger patient populations may lead to prognostic tests to identify radiosensitive cancer patients prior to treatment. Given such knowledge, the clinical course of these patients could be altered to consider their treatment with non-radiation therapies. 114 4.5. Figures Figure 4.1 Candidate genes encode proteins directly involved in the detection and repair of damaged DNA and triggering of cell cycle control signalling pathways Reproduced and modified from [35]. Cellular response to DNA damage is controlled by signalling pathways that halt cell cycle progression (cell cycle phases at bottom) until the damage is repaired. For this study, we focused on eight candidate genes (yellow) that encode proteins responsible for identifying and directly repairing DNA damage. These proteins are primary activators of cell signalling pathways that limit cell growth or replication in the presence of damaged DNA. 115 Figure 4.2 Toxicity scores, radiation dosimetry, count of DNA variants, and gammaH2AX rank expression from 41 prostate brachytherapy patients The x-axes in all panels correspond to patient numbers ordered by toxicity score presented in the top panel. This panel indexes the subsequent panels and shows the toxicity score for each patient determined using the scoring system shown in Table 4-1. A dashed vertical line down the centre of the figure separates data from low toxicity patients (left, toxicity score \u00E2\u0089\u00A4 1) from high toxicity patients (right, toxicity score \u00E2\u0089\u00A52). The next three panels present similar post- implant radiation dosimetry for each patient. The thresholds for \"ideal\" dosimetry are shown as dashed red lines (D90 < 175Gy, V100 > 85%, VR100 < 1cm3). Panels 5-12 present the number of all variants (black, left axis) and coding variants (red, right axis) in each gene from each patient. Genes are ordered first by the magnitude of the left y-axis (i.e. the maximum count of total variants) and then by the magnitude of the right y-axis (i.e. the maximum count of coding variants). Note the enrichment of individuals with at least 1 LIG4 coding mutation in high versus low toxicity individuals (p=0.028). Counts that include specific variants associated with radiosensitivity, ATM rs1800057 (P1054R) and MDC1 rs28986317, are indicated by wire- frame data markers. The last panel displays sum of rank data from Olive et al. [43] and represents the combined rank of residual gamma H2AX scores from lymphocytes and monocytes from 40 of the 41 patients. The numbers used to generate these plots were drawn from Table 4-2. Lines between each point are for ease of comparison and do not represent continuous variables or an explicit relationship between data points. 116 117 4.6. Tables Table 4-1 Modified RTOG scoring system used to generate toxicity scores Category Description Score IPSS1 score Failed to normalize to within 5 points of baseline after 24 months of follow up add 1 No symptoms add 0 RTOG Grade 1 GI: Increased frequency or change in quality of bowel habits not requiring medication/rectal discomfort not requiring analgesic. add 1 GU: Frequency, nocturia twice pre-treatment habits, dysuria, urgency not requiring medication. add 1 RTOG Grade 2 GI: pain or irritation requiring medication, mucus or bloody discharge, haemorrhoids requiring analgesics alone, or change in bowel habits requiring medication. add 3 GU: Frequency and nocturia less frequent than every hour, dysuria, and urgency, bladder spasm requiring medication. add 3 RTOG Grade 3 GI: hospitalization for severe pain, bleeding from thrombosed haemorrhoids, severe mucus or bloody discharge, superficial ulceration, minor surgical procedure. add 5 GU: frequency with urgency or nocturia hourly or more frequently, dysuria, pelvic pain or bladder spasm requiring frequent narcotics, gross hematuria, and obstruction requiring indwelling catheter or minor surgical procedure (trans-urethral resection or incision of the prostate, stricture dilatation). add 5 RTOG Grade 4 GI: Ulceration, necrosis, major surgical procedure add 7 Maximum RTOG2 GI: rectum GU: urinary GU: Ulceration, necrosis, major surgical procedure add 7 Prolonged Urinary Retention Catheterization required for more than 3 weeks add 3 Partially potent after PB add 2 Sexual potency (for previously potent patients) Impotent after PB add 3 Maximum score: 21 GI = gastrointestinal symptoms, GU = genitourinary symptoms 1IPSS = International Prostate Symptom Score [40], describes symptoms of prostate cancer. 2RTOG = Radiation Therapy Oncology Group late radiation morbidity scoring scale [39], describes radiation-induced side effects. This is an in-house modified RTOG toxicity scale to better reflect specific prostate brachytherapy rather than external beam radiation toxicity profile. 118 Table 4-2 Patient-by-patient radiation dosimetry, gammaH2AX scores, DNA sequence variant counts, toxicity score breakdown, and other data Toxicity and Dosimetry data Patient # Toxicity Score Prostate D90 (Gy) Prostate V100 (%) Rectal VR100 (cm3) From Olive et al. GammaH2AX scores: sum of ranks, monocytes and lymphocytes Monocyte score Lymphocyte score 2 0 176 97.0 0.17 49 0.70 0.49 3 0 165 95.8 0.79 9 0.48 0.16 5 0 166 94.4 0.00 11 0.40 0.24 12 0 151 93.2 0.84 26 0.74 0.21 13 0 130 84.1 0.01 65 0.84 0.58 14 0 143 89.8 0.46 35 0.76 0.27 15 0 146 91.3 0.16 27 0.60 0.29 16 0 146 90.6 0.94 44 0.90 0.27 20 0 183 99.0 0.26 61 1.01 0.41 22 0 151 93.2 0.34 60 1.26 0.34 25 0 155 93.5 0.20 22 0.75 0.13 27 0 142 89.0 0.00 29 0.72 0.27 28 0 166 96.0 1.20 69 1.01 0.52 29 0 154 93.1 1.54 63 1.31 0.36 33 0 157 93.7 1.50 69 0.92 0.56 39 0 154 96.1 0.20 26 0.56 0.29 44 0 152 92.3 0.70 39 0.55 0.42 1 1 165 96.8 0.00 39 0.62 0.4 24 1 131 85.7 0.00 61 1.14 0.37 31 1 128 85.2 1.46 38 1.00 0.21 6 3 172 96.6 0.33 25 0.53 0.6 18 3 146 90.9 0.05 11 0.50 0.18 43 3 147 91.0 0.31 18 0.51 0.21 11 4 141 89.0 0.81 57 1.04 0.37 17 4 130 84.4 1.48 61 0.75 0.46 21 4 148 91.4 0.81 50 0.73 0.47 32 4 164 95.4 0.90 51 0.74 0.46 35 4 160 95.2 3.20 61 0.76 0.64 38 4 142 88.5 0.35 40 0.75 0.31 8 5 182 97.9 0.03 11 0.37 0.25 26 5 170 97.9 0.26 65 1.21 0.41 37 5 148 91.7 0.06 47 0.85 0.32 4 6 128 86.9 0.23 9 0.43 0.19 7 7 176 96.9 0.17 48 0.77 0.37 10 7 152 93.0 0.98 6 0.46 0.07 19 7 183 97.9 0.89 14 0.40 0.26 9 9 113 78.3 0.03 58 0.84 0.43 23 10 136 88.2 0.07 49 0.86 0.33 30 10 167 97.0 0.10 - - - 36 11 150 92.5 1.20 43 0.66 0.41 34 13 136 87.6 3.80 74 1.21 0.55 119 Total variants Coding variants Patient # ATM BRCA1 ERCC2 H2AFX LIG4 MDC1 MRE11A RAD50 ATM BRCA1 ERCC2 H2AFX LIG4 MDC1 MRE11A RAD50 2 26 3 14 4 1 2 7 2 1 0 4 0 0 0 0 0 3 26 0 14 4 4 4 7 2 1 0 4 0 0 1 0 0 5 28 23 2 1 1 2 2 9 2 1 0 0 1 0 0 0 12 24 6 14 1 3 2 3 4 1 0 2 0 0 0 0 0 13 27 0 12 2 3 2 3 2 1 0 2 0 0 0 0 0 14 26 0 7 1 0 3 1 3 1 0 0 0 0 0 0 0 15 21 19 11 4 0 3 1 8 1 1 2 0 0 1 0 0 16 17 1 15 1 3 2 5 2 1 0 4 0 0 0 0 0 20 21 0 15 1 1 3 3 2 1 0 3 0 1 1 0 0 22 27 22 10 3 2 4 7 2 1 1 2 0 1 2 0 0 25 26 11 11 2 3 3 0 1 2 0 1 0 0 1 0 0 27 23 29 4 4 1 2 8 3 1 1 0 0 1 0 0 0 28 28 3 5 4 1 3 6 2 2 0 0 0 0 1 0 0 29 4 28 12 1 6 6 7 1 1 1 2 0 2 2 0 0 33 26 25 10 2 0 19 6 6 1 1 1 0 0 10 0 0 39 5 0 9 1 1 3 7 6 3 0 1 0 0 0 0 0 44 29 28 12 4 4 3 9 2 3 1 2 0 0 0 0 0 1 26 31 13 1 4 2 2 2 1 1 3 0 2 0 0 0 24 23 1 16 4 1 2 3 6 1 0 4 0 1 0 0 0 31 24 18 14 1 4 2 3 2 1 1 4 0 1 0 0 0 6 30 27 7 4 5 4 3 8 2 1 2 1 1 1 0 0 18 9 29 15 1 1 3 2 3 1 1 3 0 1 1 0 1 43 26 30 1 1 2 11 7 2 1 1 0 0 2 5 0 0 11 23 0 18 4 1 2 4 2 2 0 4 0 0 0 0 0 17 26 29 6 4 2 4 3 8 2 1 2 0 1 1 0 0 21 22 21 9 3 1 3 1 2 1 1 1 0 1 1 0 0 32 28 1 11 2 1 3 2 2 1 0 1 0 1 1 0 0 35 24 0 16 1 1 2 7 10 1 0 4 0 1 0 0 0 38 6 28 10 4 2 8 6 2 2 1 1 0 2 3 0 0 8 18 31 11 1 2 5 2 2 1 1 1 0 1 2 0 0 26 24 4 14 4 3 2 6 1 1 0 4 0 0 1 0 0 37 23 0 4 4 1 2 3 2 1 0 0 0 0 0 0 0 4 26 2 14 1 1 4 8 2 1 0 3 0 1 2 0 0 7 25 2 11 3 2 2 1 8 1 0 1 0 0 0 0 1 10 5 0 15 4 4 3 1 2 1 0 4 0 1 1 0 0 19 18 32 5 1 2 4 7 2 1 1 0 0 1 1 0 0 9 24 18 14 1 4 2 3 2 2 1 4 0 1 0 0 0 23 30 0 1 4 1 4 1 2 2 0 0 0 1 2 0 0 30 29 18 4 4 1 4 6 3 1 1 0 0 1 1 0 0 36 23 20 1 4 2 3 5 2 1 1 0 0 0 0 0 0 34 5 0 15 2 2 4 6 2 2 0 4 0 1 1 0 0 120 Non-conservative variants Missense variants Patient # ATM BRCA1 ERCC2 H2AFX LIG4 MDC1 MRE11A RAD50 ATM BRCA1 ERCC2 H2AFX LIG4 MDC1 MRE11A RAD50 2 0 0 0 0 0 0 0 0 1 1 2 0 0 0 0 0 3 0 0 0 0 0 0 0 0 1 0 2 0 0 1 0 0 5 0 1 0 0 0 0 0 0 2 4 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 1 1 2 0 0 0 0 0 13 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 14 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 15 0 1 0 0 0 0 0 0 1 5 1 0 0 1 0 0 16 0 0 0 0 0 0 0 0 1 0 4 0 0 0 0 0 20 0 0 0 0 0 1 0 0 1 0 2 0 0 1 0 0 22 0 1 0 0 0 1 0 0 1 5 1 0 0 2 0 0 25 0 0 0 0 0 0 0 0 2 2 1 0 0 1 0 0 27 0 1 0 0 1 0 0 0 1 4 0 0 1 0 0 0 28 0 0 0 0 0 0 0 0 2 2 0 0 0 1 0 0 29 0 1 0 0 2 1 0 0 1 4 1 0 2 2 0 0 33 0 1 0 0 0 5 0 0 1 3 0 0 0 7 0 0 39 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 44 2 1 0 0 0 0 0 0 3 4 1 0 0 0 0 0 1 0 1 0 0 2 0 0 0 1 5 1 0 2 0 0 0 24 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 31 0 1 0 0 0 0 0 0 1 2 2 0 0 0 0 0 6 0 1 0 0 1 0 0 0 2 4 1 0 1 1 0 0 18 0 1 0 0 0 0 0 0 1 5 1 0 0 0 0 0 43 0 1 0 0 1 3 0 0 1 6 0 0 1 4 0 0 11 0 0 0 0 0 0 0 0 2 0 3 0 0 0 0 0 17 1 1 0 0 1 0 0 0 2 5 1 0 1 1 0 0 21 0 1 0 0 0 0 0 0 1 4 0 0 0 0 0 0 32 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 35 0 0 0 0 1 0 0 0 1 0 4 0 1 0 0 0 38 0 1 0 0 1 2 0 0 2 4 0 0 1 3 0 0 8 0 1 0 0 0 0 0 0 1 5 0 0 0 1 0 0 26 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 0 37 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 1 1 0 0 1 1 2 0 1 1 0 0 7 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 0 1 0 0 1 0 4 0 0 1 0 0 19 0 1 0 0 0 0 0 0 1 6 0 0 0 1 0 0 9 1 1 0 0 0 0 0 0 2 3 2 0 0 0 0 0 23 0 1 0 0 0 0 0 0 2 3 0 0 1 1 0 0 30 0 1 0 0 0 0 0 0 1 3 0 0 0 1 0 0 36 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 34 0 0 0 0 1 0 0 0 1 0 4 0 1 0 0 0 121 Toxicity score breakdown Patient # IPSS Normalized Score (0=Yes, 1=No) Pre-implant IPSS Max Late IPSS RTOG Score Max Late Urinary RTOG Max Late Rectal RTOG Acute Urinary Retention >3 weeks (1=Yes, 0=No) Potency Score Potency at Implant Post Implant Potency at subsequent follow-ups 2 0 6 4 0 0 0 0 0 Normal NNN 3 0 4 8 0 0 0 0 0 Impotent IIIIIIII 5 0 5 6 0 0 0 0 0 Impotent IIPIINNNNNIP 12 0 6 8 0 0 0 0 0 Normal NNNNNNNNNNN 13 0 2 5 0 0 0 0 0 Normal NNNNNNNN 14 0 16 21 0 0 0 0 0 Normal NINNNNNNNNNN 15 0 4 6 0 0 0 0 0 Normal INNNNN 16 0 2 4 0 0 0 0 0 Normal NNNNNNN 20 0 3 5 0 0 0 0 0 Partial NNNNNNNN 22 0 1 7 0 0 0 0 0 Normal IIIINNNNNNN 25 0 6 10 0 0 0 0 0 Partial NNNNNNIINN 27 0 11 6 0 0 0 0 0 Normal NNNNNNNN 28 0 7 9 0 0 0 0 0 Impotent NIIII 29 0 2 7 0 0 0 0 0 Normal IINNINN 33 0 6 12 0 0 0 0 0 Partial NNNNNNNN 39 0 7 12 0 0 0 0 0 Normal NNNNN 44 0 7 12 0 0 0 0 0 Normal NNNNNNNN 1 0 13 11 1 1 0 0 0 Partial INNNNINNNNN 24 0 3 5 1 0 1 0 0 Normal NNNNN 31 0 8 14 1 1 0 0 0 Partial IINNNNNNN 6 0 19 15 3 2 0 0 0 Normal INNNNNNI 18 0 7 16 3 0 2 0 0 Normal INNNNNN 43 1 1 15 2 1 1 0 0 Normal NNNNNN 11 1 4 24 3 2 0 0 0 Partial NNNNNNNNNN 17 0 6 24 4 2 1 0 0 Impotent IIIIIINII 21 0 6 13 4 2 1 0 0 Normal IPNPNNNNNNNN 32 1 2 17 3 2 0 0 0 Normal IINNNNN 35 0 5 6 2 1 1 0 2 Normal NNIIIIIII 38 1 3 22 3 2 0 0 0 Normal IINNNNNN 8 1 5 23 4 2 1 0 0 Normal NNNIINNN 26 1 4 22 2 1 1 0 2 Partial IIIIIIIIIIII 37 1 5 17 4 2 1 0 0 Normal NNNIINN 4 1 5 11 2 1 1 0 3 Normal IIIIII 7 1 4 16 3 2 0 0 3 Normal IIIIIIIIIIII 10 1 9 24 6 2 2 0 0 Normal NNNNNNNNNNNN 19 0 9 14 5 3 0 0 2 Partial IIIIIIIIIIII 9 0 9 14 6 1 3 0 3 Normal IIIIIIIIIIII 23 1 21 33 8 2 3 1 0 Impotent IIIIII 30 0 5 5 8 3 2 0 2 Partial IIIIIII 36 1 0 24 10 2 4 0 0 Normal NNN 34 0 12 20 10 3 3 0 3 Normal IINIIII 122 Additional clinical data Patient # Hormones Tumour Stage Planning Ultrasound Target Volume (PUTV) Gleason Score Age at Implant PSA at last follow up 2 Yes 1C 32.240 6 69 0.05 3 Yes 2A 33.539 6 76 0.04 5 Yes 2B 28.332 6 70 0.84 12 No 1C 21.774 5 49 0.02 13 Yes 2B 40.100 7 75 0.02 14 No 2A 32.081 6 71 8.80 15 Yes 1C 50.499 6 70 0.11 16 Yes 1C 29.928 6 51 5.70 20 Yes 2A 49.457 6 72 5.68 22 No 2A 37.022 6 62 0.02 25 No 1C 39.334 5 53 0.04 27 Yes 1C 37.095 6 56 0.02 28 Yes 2A 46.107 6 74 0.04 29 Yes 1C 30.791 5 60 0.10 33 Yes 1C 19.740 7 68 0.17 39 No 2A 39.075 6 63 0.05 44 Yes 1C 38.158 7 59 0.02 1 Yes 2B 35.187 6 62 0.05 24 Yes 2A 28.525 7 64 0.03 31 Yes 1C 22.864 5 66 0.02 6 Yes 1C 29.138 7 62 0.02 18 Yes 2A 19.196 7 51 0.02 43 Yes 1C 26.486 6 65 0.02 11 No 2A 38.417 6 66 0.03 17 Yes 2A 32.200 6 71 0.02 21 Yes 1C 27.600 4 69 0.03 32 No 1C 37.177 6 62 0.29 35 Yes 1C 51.781 6 72 0.03 38 Yes 1C 39.470 6 61 0.04 8 No 1C 38.442 6 61 0.02 26 Yes 1C 48.610 7 76 0.03 37 No 1C 40.777 6 64 0.02 4 Yes 1C 32.265 3 60 0.02 7 Yes 1C 33.642 6 71 0.02 10 Yes 1C 35.378 6 58 0.04 19 No 2A 47.332 6 68 0.18 9 Yes 2A 26.939 5 62 0.03 23 No 2A 33.177 6 75 0.01 30 Yes 1C 31.517 7 70 0.01 36 Yes 1C 35.400 6 57 0.43 34 Yes 2B 47.859 6 76 0.06 123 Table 4-3 PCR primer sequences used to amplify amplicons targeting candidate gene exons for sequencing Gene symbol Amplicon Coordinates (Build 36, hg18) Amplicon Size (bp) Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) ATM chr11:107598633-107599269 673 AGAGAAAGAAAGGCGCCGAAAT CAGGAAAGATGGAGTGAGGAGAGG ATM chr11:107603376-107603752 413 TGCCTTTGACCAGAATGTGCC CAGGATCTCGAATCAGGCGCT ATM chr11:107603731-107604317 623 AAGCGCCTGATTCGAGATCCT TTGCCACTCCTGTCCAGCAA ATM chr11:107604897-107605533 696 GAAGATTAAGAGCTTTGCAGACCAGA TGAGTGCAGTGGTGTTTACAACGA ATM chr11:107611462-107612041 616 GGGCCATAATTTGCCAATTTCTTC CACTGTCAACTCCTTGACGATGGA ATM chr11:107619639-107620264 642 GTAATGTTTCTGCGACCTGGCTCT TCTCCCCCTTGAAAACTTCACGTA ATM chr11:107620495-107621149 655 GAGGGAGAGCTAACAGAAGTGGTCTC CAGAATCTGCTACCACTGCTTCAAA ATM chr11:107622743-107623339 633 TCAAGGATCTTGTCAGAAGAGGCA TCGAATCATTAGGGTTAGGGTCACCT ATM chr11:107624617-107625197 581 TTGTCATGGCAATCACATATCCCT GGATGGTCTCAATCTCCTGACCTT ATM chr11:107626456-107627005 586 TTCCTGCCAATTTAGGAAGTAGGACA AAGGTCTGCAGGCTGACCCAG ATM chr11:107626771-107627347 613 CGATGCCTTACGGAAGTTGCAT GGTGCTGATATCCCATCACCCTACA ATM chr11:107626771-107627347 613 CGATGCCTTACGGAAGTTGCAT GGTGCTGATATCCCATCACCCT ATM chr11:107627451-107627964 550 TCGGACACCAGGTCTTATTCCTTC GAAGAATTGGAGGCACTTCTGTGCT ATM chr11:107627694-107628152 495 TGCCAGGCACTGTCCTGATAGA CAGAAGCAATCAGGCATAAAGACACA ATM chr11:107628448-107628999 588 GGAGAGCAGACCTCCGAATGG AAACAACCTCTTCCCTGGCTAACAG ATM chr11:107629396-107629940 581 TGCTCTTTACTTCCTCTGCTTGGTGA TCCCAGAAGACAGCGATCCAG ATM chr11:107629739-107630115 413 ACTTTCTTGAAGTGAACACCACCAA GAGCCCTTTACTGCCACTTTGC ATM chr11:107631926-107632547 683 CTGCTTGGCCATCAGGAGATACTT AGCAAACCACCATGGCACTGTAT ATM chr11:107633181-107633775 645 TCTGTCACTGGTATGATTTGCAAGAA ACACCAACCAGTGATCAATTCCCT ATM chr11:107634624-107635232 665 TTCCTCCTTTTTGGTGTAAGTGGG TGCTAAGGGTGCTACTGAACAAGG ATM chr11:107642881-107643523 680 CACCACCACACCCAGCTAATTTTT ATGCCTGGCCTGGTTTTATTTCTT ATM chr11:107644298-107644874 613 TGGCTGTTGTGCCCTTCTCTT GGCAAGGTTCCAACTTCAAACACA ATM chr11:107646917-107647546 631 CACTGCACCCGGCCTATGTTTAT CGGTTTAGAAAGCCAAGCCTTAAA ATM chr11:107648447-107649065 691 TGGAAAACTTACTTGATTTCAGGCATC ACACACCTCACTCGAGTCAACCAC ATM chr11:107655136-107655728 629 TGGCTAGTTTGAGTTCAGTGCTGTTC TGATTTGACCCATTGTGACCCA ATM chr11:107656736-107657326 627 TGTCAGTGCTTGTGATTTAGCAAAGG GCCTTGTGAAATGCTCTTAAATGGA ATM chr11:107658394-107659040 659 AACCAAATGTTGGAAACTCTTAGCA GTCAGATAGCTGGTTGTTGGCACA ATM chr11:107659900-107660587 688 AGGGAGACAACACGACATAACCAA CACTAAAGTGTCACAAGATTCTGTTCTCA ATM chr11:107660210-107660595 422 TTTGATGAGGTGAAGTCCATTGC CCAATCAGAGGGAGACAACACGA ATM chr11:107663213-107663837 697 TTGGTTGGCAGAAAAATTACCAGG TGTGGGGAGACTATGGTAAAAGAGGA ATM chr11:107664461-107664977 553 TGCCCTACTGTGCTGGGCATA GGCAAATGTTGCTTTAATCACATGC ATM chr11:107664618-107665256 648 CTGAGGCTCATTTATTGAACTGCC TGACACTTTAGTGATATATTAGCTCAGGGAA 124 Gene symbol Amplicon Coordinates (Build 36, hg18) Amplicon Size (bp) Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) ATM chr11:107665244-107665844 601 GCGGACAGAGTGAGTCTTTGTCTC CACATATCACCATGCCCGACTAAT ATM chr11:107665667-107666232 602 TGGCTTAGGAGGAGCTTGGGC TTTCCCAGGCAAGTAGCGCA ATM chr11:107668363-107668981 688 TGGAAGCTTAGAGCTGCCTATTCTG ACGTTGCGAACTGCTATCCCTAGT ATM chr11:107669092-107669688 656 TGAAAACACAGAAACTAAAAGCTGGG GGGAATTGCAGTTGCAGTGATTAG ATM chr11:107670611-107671220 640 AGGTAGCAGTGAGCCTTGATAGCG TGGGTTTAGAGAAATCATCTGGCA ATM chr11:107672937-107673608 687 TTGTGCTTCTGTTTGTGATTTTGC ACTCACATTCATTCCGCCAATACA ATM chr11:107675388-107676039 653 CCACGCCTGGCTAATTTTGTATTT AGGGACCTTGTCTGGAATGTTCAC ATM chr11:107677372-107677965 594 CCCTCCCCCAAAAATCAACTACTA TGCCTGACACATATTGAAAGCTCA ATM chr11:107678599-107679179 651 GTTTTTGCCATACCACTCTGCCTC CAAGGCACCCCTTAGAACTCCTCT ATM chr11:107680465-107681075 675 TTTAGCAGTATGTTGAGTTTATGGCAGA TTGTGGCAAACCTCCAAAAAGTT ATM chr11:107683528-107684224 697 GACATGATCTGTCTTGTTCATGCTT TGGAAAAACACTTGCTCCTATCCC ATM chr11:107685736-107686312 613 GGGTGCTTGTGTGCATTTGTATTAGC ACCCTTATTGAGACAATGCCAACAT ATM chr11:107688058-107688691 650 CTGAATTGGATGGCATCTGCTCTA TGGCATAAACTCTGAGACAGGTGG ATM chr11:107691503-107692135 633 TGGCTGTGTAAATATCCACCAACAT CTGACTTTCCTGTGTCTCCCTGAA ATM chr11:107691658-107692264 665 GAGTTGGGAGTTACATATTGGTAATGATACA TTCTGAATCCAGTTTAATTTAGGACCAA ATM chr11:107692911-107693471 597 CTGATGCTGGAGTGCATTAGCG TCAAATTTCTTACCTGACGGAAGTGC ATM chr11:107695596-107696218 625 TTGTGAATTCCCCTTGTGCCTAGT CAAAACTTTCTTGAAGGGTCAGGG ATM chr11:107696950-107697500 587 GCCACCTTCATGTTGAGTAGGATAAGG TGATCCACCCACTTCAGCCTTC ATM chr11:107701047-107701669 655 AGAGCTCTGACCGCATAGCATTTT TACCCTTGCCCAAGGCTTAAAAGT ATM chr11:107701802-107702377 649 GAGACAGACAGACAGACAGATAGGCA TCGACCACATGATGGACTGATAGAA ATM chr11:107703466-107704051 622 TTTCCCACCCACCAAGGAAA TGTGGGTGGCTGGGCTAATG ATM chr11:107704950-107705510 597 TCTTGAAGGCAGTAGAAGTTGCTGGA CCAGGTATGGCGTGCACCTG ATM chr11:107705936-107706600 696 GGAGAGTCCCCTTTGTCCTTTGAT TTGCATCATTTACAGCTTGTCAGC ATM chr11:107707142-107707720 579 AGATACTGCAGTGGGTAGAGCGTG TAGAGACAGGGTTTTGCTGTGTCG ATM chr11:107707571-107708204 649 CCATTCCCTCTAAGAAATGGAAATACA ATCATTCCATTGTCTAGATTTGTGCAT ATM chr11:107707895-107708483 625 AATTTCTGACTAAACCAGAGGTAGCCA CACTTTGCCAACTGCTTTGAGGA ATM chr11:107708462-107708787 362 CCTCAAAGCAGTTGGCAAAGTGAA GCATCACAAAGTGCCTCAACACTTC ATM chr11:107708582-107709209 628 TGTTTTTAAGTCCCAGGGCAGTTT GACGTCAACTTGCACTATTCAAGGA ATM chr11:107709569-107710197 683 GGTTCCTCAGGTGGAATCTGGTCT GCATAAGCACACGGAAACTCTCCT ATM chr11:107710731-107711373 697 AGTTAAAGTTACGAGCGTGAGCCA CTGTGTACTCAACTTGGATTGGGG ATM chr11:107711517-107712123 633 CAGGTGTTTGCAGTATGCCTTCCT GGACTACAAGCACATGCCATCATC ATM chr11:107718932-107719475 580 TGATGGGCAGGCTCTCAAACA TTTCACTCACACACTTTCATTCTGATG ATM chr11:107721341-107721894 590 CCCAATGCTGTGATGCCACC CCTGCCAAACAACAAAGTGCTCA ATM chr11:107722913-107723608 699 TGGGTCTCAACTTTAGCCACAATAA TTGTTTTGGTGGAATAGCCCTGAT ATM chr11:107729507-107730099 629 TGGATTTGAGGTGGATCTCACAGA GGCACTGGAATACGATTCTAGCACTAA 125 Gene symbol Amplicon Coordinates (Build 36, hg18) Amplicon Size (bp) Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) ATM chr11:107730405-107731005 601 GGCTGAAGGCATTTCTAAACCAGT GGAGTTCAAGACCAGTCTGACCAA ATM chr11:107740795-107741378 641 ACCCCTAACCATGGAGGAAGAATG AGCAAATTCACTTGTCCACCAACA ATM chr11:107741079-107741675 633 CAGCAGAGGCCGGAAGATGA CCCAGCCCGAATGACCATTAT ATM chr11:107741271-107741899 662 CCAGAGTTTCAACAAAGTAGCTGAACG TCACCTCCCAGGTTCAAGAGATTC ATM chr11:107741618-107742216 635 TGGTCTTAAGGAACATCTCTGCTTTCA CAAGAGAATTCATGAACTAGAAGGCAA ATM chr11:107742138-107742657 556 AAGCCCTTCTGTACTGTCCATGTATGT TTGCCTTGGCCTGGGAAATC ATM chr11:107742240-107742894 620 CCCTATCCATTGGGCTTCTTCTTT GGGAGCAAAGAACCCAGGGC ATM chr11:107742240-107742894 657 CCCTATCCATTGGGCTTCTTCTTT GATGCTGTAGCTGTCCTGGAACAA ATM chr11:107742800-107743394 631 GCAAGCCCTGGGTTCTTTGC GGGACAGAGAAATGTTCCACTTCTACC ATM chr11:107743240-107743895 658 TGACCGTAAGGATTTCCCCTTTCT ACCCATGAAGATTTCAGGGCTTTC ATM chr11:107743765-107744370 633 TGGTGTATCTTTTTCTTACAAGCTGCC TGGGTCAGTGACTTAGCATACAACAA ATM chr11:107744011-107744572 598 TGCATGGTATGCTATGAGGCTCC TGAGCAACTGACTGGCAAACCC ATM chr11:107744451-107745113 677 GGAGAAATAAGTTGTCCAAGGCAAGA TTCTCCTTAGAACGAGTCCCATGC BRCA1 chr17:38449676-38450260 621 TTTGGCAGCAACAGGAAATACAAA TCAAGAACCGGTTTCCAAAGACA BRCA1 chr17:38450099-38450693 631 GCTTCCTTCCTGGTGGGATCTG GGGAGGAAATTCTGAGGCAGGT BRCA1 chr17:38450672-38451219 584 ACCTGCCTCAGAATTTCCTCCC TGCAGCCAGCCACAGGTACA BRCA1 chr17:38451018-38451610 593 TGCCTGTAATTCCAGCTACTCAGG TGGTGGTACGTGTCTGTAGTTCCA BRCA1 chr17:38452863-38453470 654 GAACTTCTAGGCTCCCACCTTGAC CCAAGACTCCCTCATCCTCAAAAT BRCA1 chr17:38454531-38454932 438 TTGGCACAGGTATGTGGGCA CATGGCATATCAGTGGCAAATTGA BRCA1 chr17:38456336-38456841 542 CAGCAGCTCAACGCCATCTG TGGACATTGGACTGCTTGTCCC BRCA1 chr17:38462347-38462755 445 GAACCCGAGACGGGAATCCA TGACGTGTCTGCTCCACTTCCA BRCA1 chr17:38468728-38469145 454 GGCCTGCATAATTCTTGATGATCC GGAATCCATGTGCAGCAGGC BRCA1 chr17:38469233-38469773 577 CCCAGCATCACCAGCTTATCTGA GCCTTGGCGTCTAGAAGATGGG BRCA1 chr17:38472653-38473237 621 GCCTGGCCCACACTCCAAAT TGCTCGTGTACAAGTTTGCCAGA BRCA1 chr17:38472753-38473302 586 GACTATCATCCATGCTATGCTCAACA ACTAGTATTCTGAGCTGTGTGCTAGA BRCA1 chr17:38476099-38476698 636 TGCTGGTAAATTCACCCATGTGA TCAGCTCGTGTTGGCAACATA BRCA1 chr17:38476548-38476930 419 GCTTCTCCCTGCTCACACTTTCTTC GCTACTTTGGATTTCCACCAACACTG BRCA1 chr17:38479623-38480265 643 CCCCATGTTATATGTCAACCCTGA AAAGTCCTTCACACAGCTAGGACG BRCA1 chr17:38479905-38480299 431 CGTCAAATCGTGTGGCCCAG CAGCTGGGAGATATGGTGCCTC BRCA1 chr17:38481747-38482329 619 CCATCAGTTTCCAAGCTTGTTCAGG GCATCTGTCTGTTGCATTGCTTG BRCA1 chr17:38487763-38488291 565 TGCCTTGGGTCCCTCTGACTG GGGCATTAATTGCATGAATGTGG BRCA1 chr17:38496190-38496818 687 TTAGCAAATGGGTTTCGAAGGTTT TAAATTCCTTGCTTTGGGACACCT BRCA1 chr17:38496693-38497337 645 CCGTTGCTACCGAGTGTCTGTCTA ATGAGATGTGCACCCACAGTGATA BRCA1 chr17:38497064-38497511 484 TCACTCAGACCAACTCCCTGGC GGAGTCCTAGCCCTTTCACCCA BRCA1 chr17:38497068-38497625 594 TCAGACCAACTCCCTGGCT CTGATGACCTGTTAGATGATGGTGAA 126 Gene symbol Amplicon Coordinates (Build 36, hg18) Amplicon Size (bp) Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) BRCA1 chr17:38497484-38498049 602 TGTGTATGGGTGAAAGGGCTAGGA TCACCTGAAAGAGAAATGGGAAATGA BRCA1 chr17:38497841-38498430 626 GGCCCTCTGTTTCTACCTAGTTCTGC TGTGCAACATTCTCTGCCCACTC BRCA1 chr17:38498167-38498726 596 ATTTGGAGTAATGAGTCCAGTTTCGTT TCTCGTTACTGGAAGTTAGCACTCT BRCA1 chr17:38498648-38499242 631 TCAAATGCTGCACACTGACTCACA TGAGGAGGAAGTCTTCTACCAGGCA BRCA1 chr17:38499078-38499482 441 GGTTTCTGCTGTGCCTGACTGG TGATAAATCAGGGAACTAACCAAACGG BRCA1 chr17:38499285-38499872 624 CGAGTTCCATATTGCTTATACTGCTGC GGGAGTCTGAATCAAATGCCAAAG BRCA1 chr17:38499660-38500239 616 GGAGGCTTGCCTTCTTCCGA CATGCCAGCTCATTACAGCATGA BRCA1 chr17:38499977-38500495 555 TCTCTAGGATTCTCTGAGCATGGCA TGGTTGATTTCCACCTCCAAGG BRCA1 chr17:38500257-38500851 631 GCTCCACATGCAAGTTTGAAACAGA TGCCTGGCCTGCCCTTTACT BRCA1 chr17:38501285-38501853 605 TGGGTTGTAAAGGTCCCAAATGGT CTGCCTCCCAGGTTGAAGCC BRCA1 chr17:38502623-38503015 429 CACCAAATCCCAAGTCGTGTGTT CGAAGCCCATGCCTTTAACCA BRCA1 chr17:38505141-38505506 402 TCTTCAAGGTGGGAACTGCGTC TCCATGGTGTCAAGTTTCTCTTCAGG BRCA1 chr17:38509319-38509889 607 CAATGCTCAATAAAGAGATGTTGCCA CATAGGGTTTCTCTTGGTTTCTTTGA BRCA1 chr17:38510241-38510764 560 AAGGTGTGAGACCAGTGGGAGTAATTT TGCAATGCATTATATCTGCTGTGGAT BRCA1 chr17:38511518-38512056 575 AAATTGGCCGGGCATGGTAG TCAACCAGAAGAAAGGGCCTTCA BRCA1 chr17:38511741-38512186 482 CAGCCCTACTTTACATAAGTCTGCAA GCTCTTAAGGGCAGTTGTGAGATTA BRCA1 chr17:38521024-38521673 678 GACAGAGCGAGACTTTGTCTCAAAA TTGTGTTGAAAAGGAGAGGAGTGG BRCA1 chr17:38529340-38529971 632 TTGGAGAAAGCTAAGGCTACCACC TGACAGATGGGTATTCTTTGACGG BRCA1 chr17:38530137-38530734 666 CTCTACTTCCCTCTTGCGCTTTCT CTTCCCTCGCGACCTACAAACT BRCA1 chr17:38530564-38531237 674 TAGCGATTCTGACCTTCGTACAGC ATTTCCAAGGGAGACTTCAAGCAG ERCC2 chr19:50546451-50547026 624 ACTAACGTCCAGTGAACTGCGCT GTCCTTCTCCGACTCCCTAGCTG ERCC2 chr19:50547076-50547724 696 CTCACCCCAACTTCTCTCACCCT CCAGTTCCAGATTCGTGAGAATGA ERCC2 chr19:50547530-50548205 677 CTGGGAAATGAACGGGAAACAG AAAGTGTCCGAGGGAATCGACTTT ERCC2 chr19:50548034-50548716 688 GTGCCTAGGGACAGAGGGGAG ACAGTCAGCCCCTCCACCAAT ERCC2 chr19:50549541-50550173 635 AGCAGCAGAGAAGCAAGGAACCTA CCCTTCTGCACTCATTTCATTGG ERCC2 chr19:50550498-50551109 665 CCAGTGCACAATACACTGTGACCA TAACAGGGTTGCTGAGGGTTCATT ERCC2 chr19:50552145-50552709 641 GGGAGATCAGGGAGGATACATTCC AGGGTGTGAATGCTCTGTGGGT ERCC2 chr19:50552434-50553046 634 CAGGATCTTGGGGTAGATGTCCAG GTGGGCTCTCTACTTGGGATCCTT ERCC2 chr19:50556357-50557008 677 GAATTTCTCTGGCCTCCTCCCTTA GTCGTGCTAGCAGGTGTGACAAGT ERCC2 chr19:50558600-50559220 645 GAGATTCTCCAATCCAGCCAGGT CAGGATCAAAGAGACAGACGAGCA ERCC2 chr19:50559079-50559771 696 CTCACCCTGCAGCACTTCGTC CATACTTCTGCCTGGCCTGTGTCT ERCC2 chr19:50559588-50560214 639 CAATCTTGGGGTCCAGGAGGTAGT CACAGCCTCACAGCCTCCTATGT ERCC2 chr19:50559816-50560412 633 CAGGGAGATGCAGACAGGCA CCCTGCATTAAGTTCCCACGC ERCC2 chr19:50563472-50564059 625 AACCAGGCTTGCCAGAGACTCTAA ACTGCTCAAGAACTGTGCCAGAGA ERCC2 chr19:50563547-50564095 585 GGAGGAAGTGTGCTTGCCTGG GGCAGGCATATCCGCTGGAG 127 Gene symbol Amplicon Coordinates (Build 36, hg18) Amplicon Size (bp) Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) ERCC2 chr19:50563763-50564257 531 GAGCCAGTCCCAGAAACGGC CCCAACATGCAGGGTCATGG ERCC2 chr19:50564733-50565330 634 TGCAGGTTAAATATTTGGCACAGTAGC GCTCAACGTGGACGGGCTC ERCC2 chr19:50565302-50565927 626 AGTAGACCAGGAGCCCGTCCA TGAGATCGAGTCTCTCGGCTCTTT H2AFX chr11:118469715-118470325 632 CCTCCCACCCCTATTATCAGGAAA GTGCTTAGCCCAGGACTTTCAGAC H2AFX chr11:118469862-118470452 627 CGTGGAAGGGTTAGCTGCAGAA GGAAGACTTGGCCTTCCGCTC H2AFX chr11:118470365-118470972 608 GTGCTGCTGCCCAAGAAGAC GAGGGCCTCACTCACCTTCAG H2AFX chr11:118470769-118471363 670 CTTGCCCCGCAGTCTGAAG TCTGTTCTAGTGTTTGAGCCGTCG H2AFX chr11:118470788-118471465 678 CGGCTCAGCTCTTTCCATGAG TTGGAGAAAAGAGCCAATCAGGAG LIG4 chr13:107657854-107658452 599 GGGTAGAATTGTTACAGCTGGACTTG TCCACGGTTTGAATAAAATTTCCA LIG4 chr13:107658427-107659021 631 CAAGTCCAGCTGTAACAATTCTACCC GGGAAGATCATAGTCGTGTTGCAGA LIG4 chr13:107658691-107659357 667 TTCTTTCTTGGCTTTGGGCTATTG TTGCCCGTGAATATGATTGCTATG LIG4 chr13:107659241-107659863 673 CCATTTCTTCAGGAGTCTGCTCGT AGATGACAAGGAGTGGCATGAGTG LIG4 chr13:107659744-107660318 628 TCTTGTGGTTCATCATCACCACCT CCTCTATCCATCTACAAGCCAGACA LIG4 chr13:107659887-107660486 636 ACGCAAGGTGCAGCCAGTTT GGGCATGAGACTCTGAGAAAGAGG LIG4 chr13:107660197-107660888 698 CCTTTACCCCAATATCCTCCAACA TCTGCATTTAAACCAATGCTAGCTG LIG4 chr13:107660463-107661042 616 CCTCTTTCTCAGAGTCTCATGCCC GGATTTAAAGCTTGGTGTTAGTCAGCA LIG4 chr13:107660834-107661402 605 CCTTCTCAATGTGCTCAATATCTGCAA TCCTCAGCTAGAAAGAGAGAGAATGGC LIG4 chr13:107661249-107661844 632 TCCAGCATCTCCATGAGTTCCA AACTTCAAATTAGGGTTGGAGCAAA LIG4 chr13:107664667-107665326 667 GAGCAGACAAAGACGCTAGAAGGG GTGGGGAGTCAAGTAGGGGAAGTG LIG4 chr13:107665526-107666176 668 GCCACACACACCCCAAACC GAGGCTATCACTAGCCAGAGCACA MDC1 chr6:30775839-30776419 631 CATGAGTGGCATCGAGCAATAACT CCTGATTTTGCCTTTGCTCTGTCT MDC1 chr6:30778056-30778701 687 GGGATAGATGGAGCAAATGTAGCA AAATGCTAGGCAGCAGAGCTGATT MDC1 chr6:30778580-30779183 669 AGGGTCGGTCACCACATATTCATC AGCCCCCAAAGTAAGAGACAAAGG MDC1 chr6:30779082-30779646 632 GCCATCAGCACCCATCTCTACAAT GAATCCCTTACAGCCATTCCTGAG MDC1 chr6:30779588-30780153 639 TGAATTGGTGTCTCAAGAAGCTGG GCCACTAGGTGCAGGACAAATAGG MDC1 chr6:30780525-30781149 635 GAGACGTAGGCTCAGGGGTAACAG CCCACATATCAGGCTACTAGGGGA MDC1 chr6:30780824-30781496 673 GCAGGACAAATAGGTCCTCTGTCA GGGCTGGAGGATCAATCTCTAAAA MDC1 chr6:30780824-30781496 694 TCAGGGGCTATAGGGACAGTTGAT GATCCTCTCTTCTTCCGCATCAGT MDC1 chr6:30781256-30781880 626 GTAGCCTGAGAGGTGGGTTCAGAG CCTCACTTGCTTCTGTTTCTCCCT MDC1 chr6:30782861-30783502 677 CACACGTGGATGATGGTAAGGAAA GATACACAGAGAGGGGAGCCAGAG MDC1 chr6:30783362-30783980 655 ACCTGACTGGCTCCCAGAAGGTA ACCAGGAGACCAACATCCAGAGAG MDC1 chr6:30783519-30784152 639 TTGGGCACCTTCTCTTCTAACTCG TCCCTCTCCCTCTCTCTCTCTTCC MDC1 chr6:30786907-30787506 642 GATCACTTGAGGTCGGGAGTTTGT TGTCTTATATTCCTCCCCGACCAG MDC1 chr6:30787433-30788050 686 CTGATTCTCCAGAAAGCACTGGGT CAAACAGATGTGAAAGCAGTTGGG MDC1 chr6:30787893-30788495 631 GCTGGAAGCTGGCTCTTTCTTACA CAGCGATACAGATGACGAGGAAGA 128 Gene symbol Amplicon Coordinates (Build 36, hg18) Amplicon Size (bp) Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) MDC1 chr6:30788436-30789038 674 TCTTTCAGATGTGCCAAAGTCAGC AGATGTGGAAGAAGGTCAGCAACC MDC1 chr6:30789156-30789773 685 AGGGATACCCCAACTCAACTGTGA GAGACCTCCTAAGGTTTTGAGCCC MDC1 chr6:30789418-30790048 649 CACAGAGGAAGATGTGGTCCTTGA TAATGCATATGGAGGCCTTGAGTG MDC1 chr6:30790528-30791214 699 TCTCAAAGCACGGGAATTACAGGT GCTGAGGTGAGAGAATTGCTTGAA MDC1 chr6:30792710-30793286 631 AACTCCGTCTTTATGACAGGCCAA AAGCGGTAGTGGGTTGTCCCTT MRE11A chr11:093791680-93792263 620 TGAGGCAGCCACTAACCAAGTG CCCACTCTTCCTTTCTCCTTTGC MRE11A chr11:93789924-93790518 595 TGATGGAATCCCTCTACAGGTCAA TTCAGTTCACCGCTAAGGAAAGTG MRE11A chr11:93790206-93790525 356 TCCATAACAGGCTGAACCAAATGA TGAGCACTGATGGAATCCCTCTACA MRE11A chr11:93790371-93790813 479 TGCGTCTAATGGCAGAAGCACC CATTCAGAGGAAGACATCTGTAGGGAA MRE11A chr11:93790501-93791160 693 TGTAGAGGGATTCCATCAGTGCTC ACGCCTGGCTGATTTTTGTATTTT MRE11A chr11:93790822-93791406 621 CCTGCTCTCCCAAAGCCTCC GGAGGCAGTGTCTGGATGATGC MRE11A chr11:93791078-93791702 689 TCACAAAATGACCTGAATTAGCTGGA ACACTTGGTTAGTGGCTGCCTCAT MRE11A chr11:93791531-93792212 699 TATTTCACAGGACAATGCCTTCCA TGCCTAGTTCTAGGAGGAAACGGG MRE11A chr11:93792063-93792643 630 GAGAAATCCTGGCATTGACATTCC GCCAAAGTTGAGGGAAAGAGCTTA MRE11A chr11:93792621-93793211 627 AAGCTCTTTCCCTCAACTTTGGCT AGATACATGCACACACAGCAGGATAAA MRE11A chr11:93802483-93803058 646 TGGAGGCTGAGAGTAGGATGTGTG TGAGGAAATTGAAGCACAGAGAGG MRE11A chr11:93808340-93808994 665 AGTCTAGGCGACAGAATGAGGCAC GAGCCTATGCAAGTCATTCACCAG MRE11A chr11:93809865-93810243 415 GCTCCTTCCAGCTTTAATGTTCCA CAGTTGGGCATTGAGTTATGCG MRE11A chr11:93818381-93818954 574 CTACCACGTCCAGCCTATTTCCTT TTGGACTCCATATCCTAGCCATCA MRE11A chr11:93819823-93820439 647 CTAAGTGCATGTGGCCATTCAAAA AACTCTTGGGTGCAAGTGATCCTC MRE11A chr11:93828956-93829542 623 GCAGCCATCCTAAGCCAACCC TGCAGCAAGGTGCACAAGAGTG MRE11A chr11:93831998-93832625 694 TCCATGGGGAACAAAACACTTTAG GAGGAGTATTCATGTGTATGCCTTATCC MRE11A chr11:93832241-93832603 399 TCGAGGGCATCAATATGACGTTC CAGGCCTCTACATTTCACGTGTCC MRE11A chr11:93833579-93833975 433 TTTCCCTGCTGTGCAGCAACT CCATGTTGAGCAAGCTGGCAGT MRE11A chr11:93836707-93837325 685 CACGTTGTGCACATGTACCCTAGA CAGGGGGAGATGTAATCATTCTGC MRE11A chr11:93840495-93840993 535 TTGCCTCCGATGGTGATTGC GCTGAGGAAAGCCTTATTGAAACATGA MRE11A chr11:93842925-93843486 598 CACACGATAGTCTCCCATTCCTCA TGACTCGGTGTTCATTTCTCTCCA MRE11A chr11:93844148-93844803 673 TTGAATTTGCCTAATGAGCAGCAA GCTTTCTCTTTTCGGGTTTCCACT MRE11A chr11:93848872-93849513 642 GCATGGTGGCTTATGCTTGTAATC AAGATTACAGGCACACACCACCAT MRE11A chr11:93851104-93851690 623 TCCTGCTCTTTCACTTGCAGAATCA GCAGATGCACTTTGTGCCTTGG MRE11A chr11:93852229-93852814 633 GGTGTTCACTCCTACTCCTGGCTT CCACAAGCACTGTTTATTATAGTTGGGG MRE11A chr11:93858544-93859140 633 CCGAGTAGCTGGGTCAGTTCCAC TGGCCTGAATCAGAGACTTGGTG MRE11A chr11:93863534-93864079 582 GCAGGCAAGGTAAGCACCTGA TTTGGGCCTGGGTTACATGA MRE11A chr11:93865381-93865970 662 TGCAAGACTCCAATCTATAGCTGCAT AAGAGCACGGGAAAGGAAAATAGG MRE11A chr11:93866249-93866869 626 AAAATAGCCAGGTTTGGGGAAGAG GACAGGGAATAACAACCCACCTGA 129 Gene symbol Amplicon Coordinates (Build 36, hg18) Amplicon Size (bp) Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) RAD50 chr5:131920445-131921084 656 AGCACCTAGCCCTCTGCTTCG GTAGCGACCTGTAAACTGAAGGCG RAD50 chr5:131922511-131923103 629 TCCTACAGCCTGGAGTTAATGTGAGAA TGTCAACAACGGTTACTACTGGGTGC RAD50 chr5:131939145-131939740 632 TCTCCTAATGATGCTGAATAAAGGAGG GGAGGAGGCTGTGTGTGCGT RAD50 chr5:131942768-131943237 506 TGATGAAGCCATTTCTAACGGGA ATGCCCAATGGTTGCTGCTG RAD50 chr5:131943218-131943758 577 CAGCAGCAACCATTGGGCAT GCTTGATTTAGCCAGTCCACGATG RAD50 chr5:131950876-131951537 670 AATGACTTTTGTGGCAGGTGTTGA TGCTCATCAGTCCCTTGAAAAACC RAD50 chr5:131951269-131951916 669 GAAAATGGAAAAGGTTTGTGGTGG ACCATACCTAGCTCCCTCCTGTCC RAD50 chr5:131952046-131952625 580 TTGCTTCAATAAAGGTTTTTCTGCC ATGGCGAAAACCCGTCTCTACTAA RAD50 chr5:131952285-131952840 592 ACAGCTGCAAGCAGATCGCC TGGGACCAATGTCAAATATGTGGTCTA RAD50 chr5:131953044-131953607 632 TGATGTTCACACAATGATAAAATTGCC TCGGGTTGTAGAACCAAAGAGTCA RAD50 chr5:131954414-131954989 612 TTGGTCAGGGACCACATCACA TGGTCAGCATCTCCATTTGGG RAD50 chr5:131954563-131955072 546 TCTCACATTTCTTTCCTGTTTGACCC TCTATGACTTATGAGTGCAAGGTAGGC RAD50 chr5:131955267-131955870 674 TCGACTTGGTACTCCACTCTTAAGGC TTGCTGGTATCACTGCTCAGAAGA RAD50 chr5:131958269-131958857 660 GTGAGTCAGTGTCCTAGGGGGAGA CCACAGGTGGTAGTTGTGTCCTCA RAD50 chr5:131958991-131959528 574 TGCTCTTTGGAAGCGAATATCGG AAGTGATATAAGACAGGGCATACCAGC RAD50 chr5:131966678-131967137 496 GGAGAAACCTGGCCAACACCA CAAATTCAAGCCACCATGGAACA RAD50 chr5:131967208-131967764 593 TCAGATACCCTCAGGGAGAAACTGC AAGAGCAAATATATGTGGACAGGAAGG RAD50 chr5:131968184-131968541 394 CCTGCCCTGTAAGCTTTCCCTG TGCTCCTCCAGTTGCTGACGA RAD50 chr5:131968209-131968557 385 GTGGCCAGCAGGGAACATCA GGATAATTCCACAGTCTGCTCCTCC RAD50 chr5:131968219-131968837 664 GGGAACATCAAGCTGATTTGAGAA AATGAGCATGTTTGCCTAGACAGC RAD50 chr5:131971975-131972579 668 CATTTACTGGCTGTTGTGACCCTG ATCGCAGACAAGTCCCTTTCTCAC RAD50 chr5:131972368-131972937 606 TTCATAGCACCACGTCGGACA TTTCTTTGTGTTTCTCGCATTCAC RAD50 chr5:131972555-131973105 587 GGTGAGAAAGGGACTTGTCTGCG CCACCACGCTTGGCCTCTTT RAD50 chr5:131979446-131979999 590 TCCTTCACACTGGCTTATTCTCCC TGTGCAGCAGGCTAGCAGATGT RAD50 chr5:131981443-131982091 668 GCTATTTAGCAACCTATGTGCGGC AGGATGAGGCAGGAGAATCATTTG RAD50 chr5:132000305-132000887 619 CAGGCCTTCCTGTGACCCGT TCCAGGGAGGTAATGCTGGC RAD50 chr5:132001420-132002040 647 GCCTGGAGGAAACTCTTAACAGGG GAGAATGCTTCAGGCCCTTCTTTT RAD50 chr5:132004019-132004675 696 TCAGCGTTGTTCTGAGCATTTTGT GAACCCCTCACAGTGACTCTCTCC RAD50 chr5:132005586-132006177 628 GGAGAAGAGACTCCTGCCTGGCT TGGAATGGGATGAAGAGCAGCA RAD50 chr5:132005782-132006374 629 CGCTCACAGCAGCGTAACTTCC CCACATGCAAGGAAGTAAATTCAGAGG RAD50 chr5:132005974-132006573 636 TGCCATAGAAATGTAGGTCCTCAGAAA GAAGGTGGTGGGTACTGACTTAGATGA RAD50 chr5:132006247-132006835 625 TGCAAATGCATGCTTCTTCTCAA GGGAGCAGGCCTTGACTCTG RAD50 chr5:132006718-132007308 640 CCTCTGCGTCTATCCTGTGTAGCA ACCACCCCCAGGATACTCTGTCTT RAD50 chr5:132007037-132007701 665 TGCTGCAACAACTAGCACTTTCAT CAGGGGTACAAATAAAATTGGGGA RAD50 chr5:132007405-132008002 634 TTTATCCCAAGAATGCAAGATTTCAGA GCCCAGGCAGTCTGGCTCAT 130 Table 4-4 Number of variant sites detected in each DNA repair gene. kbp sequenced Amino acid change Gene Total Coding Total per kbp per coding kbp In dbSNP Novel In-Dels Conservative (novel) Non-conservative (novel) Synonymous (novel) ATM 40.9 13.0 62 1.5 0.7 41 21 16 4 (2) 4 (0) 1 (0) BRCA1 18.2 7.1 47 0.4 2.1 43 4 6 8 (0) 4 (0) 3 (0) ERCC2 8.8 2.3 37 4.2 1.7 28 9 5 2 (0) 0 (0) 2 (0) H2AFX 1.8 1.4 5 2.8 0.7 2 3 0 0 (0) 0 (0) 1 (1) LIG4 5.3 4.0 19 3.6 1.5 10 9 6 1 (1) 2 (0) 3 (2) MDC1 9.7 6.5 36 3.7 3.5 22 14 0 5 (2) 12 (4) 6 (2) MRE11A 14.3 5.1 15 1.0 0 13 2 1 0 (0) 0 (0) 0 (0) RAD50 15.8 5.8 18 1.1 0.3 11 7 2 1 (0) 0 (0) 1 (0) Overall 114.8 45.2 239 2.3 1.1 170 69 36 21 (5) 22 (4) 17 (5) 131 Table 4-5 Coding variant genotypes observed in high and low toxicity prostate brachytherapy patients. (A = reference allele, B = non-reference allele) High toxicity Low toxicity Gene Genome coordinate (Build 36, hg18) Amino acid change Conservation Score* dbSNP accession AA AB BB AA AB BB ATM chr11:107,603,786 S49C -1 rs1800054 21 0 0 19 1 0 ATM chr11:107,626,842 L480F 0 novel 20 1 0 20 0 0 ATM chr11:107,629,971 S707P -1 rs4986761 21 0 0 17 1 0 ATM chr11:107,648,666 P1054R -2 rs1800057 19 2 0 20 0 0 ATM chr11:107,668,697 P1526P 7 rs1800889 3 1 0 1 1 0 ATM chr11:107,669,347 V1570A 0 novel 20 1 0 19 0 0 ATM chr11:107,680,672 D1853N 1 rs1801516 11 2 0 9 3 0 ATM chr11:107,680,673 D1853V -3 rs1801673 11 0 0 9 1 0 ATM chr11:107,688,377 N1983S 1 rs659243 0 0 21 0 0 20 BRCA1 chr17:38,476,501 M1652I 1 rs1799967 9 9 2 9 6 3 BRCA1 chr17:38,476,574 M1628T -1 rs4986854 6 6 4 6 7 1 BRCA1 chr17:38,476,620 S1613G 0 rs1799966 11 7 3 10 5 1 BRCA1 chr17:38,487,996 S1436S 4 rs1060915 14 3 3 9 10 1 BRCA1 chr17:38,497,526 K1183R 2 rs16942 7 8 5 4 11 5 BRCA1 chr17:38,497,955 S1040N 1 rs4986852 20 1 0 19 0 0 BRCA1 chr17:38,497,961 E1038G -2 rs16941 19 0 0 13 1 0 BRCA1 chr17:38,498,462 L871P -3 rs799917 11 9 0 16 3 0 BRCA1 chr17:38,498,553 R841W -3 rs1800709 19 1 0 16 0 0 BRCA1 chr17:38,498,763 L771L 4 rs16940 4 0 0 7 1 0 BRCA1 chr17:38,498,992 S694S 4 rs1799949 4 7 1 7 3 0 BRCA1 chr17:38,498,997 D693N 1 rs4986850 4 0 0 7 2 0 BRCA1 chr17:38,499,587 R496H 0 rs28897677 19 1 0 18 1 0 BRCA1 chr17:38,499,693 F461L 0 rs62625300 19 1 0 18 1 0 BRCA1 chr17:38,500,007 Q356R 1 rs1799950 15 1 0 18 1 0 ERCC2 chr19:50,546,759 K751Q 1 rs13181 15 5 0 18 0 0 ERCC2 chr19:50,547,364 D711D 6 rs1052555 18 1 0 17 0 0 ERCC2 chr19:50,559,099 D312N 1 rs1799793 18 0 0 17 1 0 ERCC2 chr19:50,560,149 R156R 5 rs238406 18 0 0 17 1 0 H2AFX chr11:118,471,119 L66L 4 novel 13 1 0 13 0 0 LIG4 chr13:107,659,299 Q773Q 5 novel 13 0 0 13 1 0 LIG4 chr13:107,659,914 D568D 6 rs1805386 16 0 0 12 1 0 LIG4 chr13:107,660,726 Y298H 2 novel 16 0 0 12 1 0 LIG4 chr13:107,661,333 E95E 5 novel 16 0 0 12 1 0 LIG4 chr13:107,661,592 T9I -2 rs1805388 9 0 0 8 1 0 LIG4 chr13:107,661,610 A3V -2 rs1805389 9 0 0 8 1 0 MDC1 chr6:30,779,291 D1855E 2 rs28994874 9 1 0 8 0 0 MDC1 chr6:30,779,567 V1791E -2 rs28994873 16 1 0 16 1 0 MDC1 chr6:30,779,705 P1745R -2 rs28994871 14 1 0 14 0 0 MDC1 chr6:30,779,968 A1657A 4 rs28986317 14 1 0 14 0 0 MDC1 chr6:30,780,543 T1466A -1 novel 14 1 0 14 0 0 MDC1 chr6:30,780,702 K1413E 1 novel 14 6 0 14 5 1 MDC1 chr6:30,780,736 G1401G 6 rs28994870 14 1 0 13 1 0 MDC1 chr6:30,780,992 M1316T -1 rs61733213 14 1 0 13 0 0 MDC1 chr6:30,781,326 T1205A -1 novel 14 1 0 13 0 0 MDC1 chr6:30,781,604 S1112F -2 rs28987085 17 1 0 17 0 0 MDC1 chr6:30,781,641 P1100A -1 rs28994869 19 1 0 20 0 0 MDC1 chr6:30,781,660 P1093P 7 rs28994868 21 0 0 19 1 0 MDC1 chr6:30,783,584 R917S -1 rs28986467 20 1 0 20 0 0 MDC1 chr6:30,783,903 D811G -1 novel 21 0 0 17 1 0 MDC1 chr6:30,783,969 Q789P -1 novel 19 2 0 20 0 0 MDC1 chr6:30,784,025 T770T 4 rs28986466 3 1 0 1 1 0 MDC1 chr6:30,788,541 P386L -3 rs28986465 20 1 0 19 0 0 MDC1 chr6:30,788,587 E371K 1 rs2075015 11 2 0 9 3 0 MDC1 chr6:30,788,777 S307S 4 novel 11 0 0 9 1 0 MDC1 chr6:30,788,895 R268K 2 rs9262152 0 0 21 0 0 20 MDC1 chr6:30,789,456 R179C -3 rs28986464 9 9 2 9 6 3 MDC1 chr6:30,789,662 P138P 7 novel 6 6 4 6 7 1 MDC1 chr6:30,789,910 V56I 3 novel 11 7 3 10 5 1 RAD50 chr5:131,951,572 V315L 1 rs28903090 14 3 3 9 10 1 RAD50 chr5:132,005,895 I1293I 4 rs28903094 7 8 5 4 11 5 132 Table 4-6 Variants associated with residual gamma H2AX levels following irradiation High gamma H2AX Low gamma H2AX Gene Genome coordinate (Build 36, hg18) Amino acid change, Conservation score* dbSNP accession AA AB BB AA AB BB Possible effect on DNA repair activity p-values < 0.05 MDC1 chr6:30778271 intronic rs9405048 16 3 0 8 9 2 Increase p1=0.009 p3=0.006 BRCA1 chr17:38449934 intronic rs12516 8 2 0 5 6 3 Increase p1=0.047 p3=0.023 BRCA1 chr17:38453439 intronic rs817630 14 5 2 5 10 3 Increase p1=0.025 BRCA1 chr17:38469351 intronic rs3092994 10 4 0 3 8 3 Increase p1=0.021 p3=0.009 p4=0.036 BRCA1 chr17:38469731 intronic rs8176257 10 2 0 3 8 1 Increase p1=0.012 p3=0.017 BRCA1 chr17:38469732 intronic rs8176256 10 1 0 3 4 0 Increase p1=0.047 BRCA1 chr17:38472867 intronic rs11654396 14 3 1 6 8 1 Increase p1=0.038 BRCA1 chr17:38480127 intronic rs8176212 12 5 2 5 10 3 Increase p1=0.049 BRCA1 chr17:38480201 intronic rs2236762 12 5 2 5 11 3 Increase p1=0.049 BRCA1 chr17:38480270-38480274 intronic novel 14 4 2 5 7 3 Increase p1=0.044 BRCA1 chr17:38496716 intronic rs799916 14 5 2 5 10 3 Increase p1=0.025 BRCA1 chr17:38501690 intronic rs8176147 11 3 0 3 9 2 Increase p1=0.007 p3=0.007 BRCA1 chr17:38502890 intronic rs8176144 13 3 0 6 8 3 Increase p1=0.013 p3=0.004 BRCA1 chr17:38510660 intronic rs799912 11 4 1 5 10 3 Increase p1=0.037 p3=0.037 BRCA1 chr17:38530713 intronic rs799905 13 5 2 5 11 3 Increase p1=0.025 ERCC2 chr19:50559099 D312N, 1 rs1799793 8 9 3 14 4 1 Decrease p3=0.042 *Conservation score of an amino acid substitution as determined from the BLOSUM62 alignment score matrix [44, 45]. We defined scores < 0 as non-conservative substitutions and all others as conservative substitutions. 133 4.7. Bibliography 1. Evans WE, Relling MV: Pharmacogenomics: translating functional genomics into rational therapeutics. Science 1999, 286(5439):487-491. 2. Frazer KA, Murray SS, Schork NJ, Topol EJ: Human genetic variation and its contribution to complex traits. Nat Rev Genet 2009, 10(4):241-251. 3. Grimm PD, Blasko JC, Sylvester JE, Meier RM, Cavanagh W: 10-year biochemical (prostate-specific antigen) control of prostate cancer with (125)I brachytherapy. Int J Radiat Oncol Biol Phys 2001, 51(1):31-40. 4. Potters L, Morgenstern C, Calugaru E, Fearn P, Jassal A, Presser J, Mullen E: 12-year outcomes following permanent prostate brachytherapy in patients with clinically localized prostate cancer. J Urol 2005, 173(5):1562-1566. 5. Morris WJ, Keyes M, Palma D, Spadinger I, McKenzie MR, Agranovich A, Pickles T, Liu M, Kwan W, Wu J, Berthelet E, Pai H: Population-based Study of Biochemical and Survival Outcomes After Permanent (125)I Brachytherapy for Low- and Intermediate-risk Prostate Cancer. Urology 2009, 4:860-865. 6. Bentzen SM: Preventing or reducing late side effects of radiation therapy: radiobiology meets molecular pathology. Nat Rev Cancer 2006, 6(9):702-713. 7. Keyes M, Miller S, Moravan V, Pickles T, McKenzie M, Pai H, Liu M, Kwan W, Agranovich A, Spadinger I, Lapointe V, Halperin R, Morris WJ: Predictive Factors for Acute and Late Urinary Toxicity After Permanent Prostate Brachytherapy: Long- Term Outcome in 712 Consecutive Patients. Int J Radiat Oncol Biol Phys 2008, 73(4):1023-1032. 8. Bucci J, Morris WJ, Keyes M, Spadinger I, Sidhu S, Moravan V: Predictive factors of urinary retention following prostate brachytherapy. Int J Radiat Oncol Biol Phys 2002, 53(1):91-98. 9. Keyes M, Schellenberg D, Moravan V, McKenzie M, Agranovich A, Pickles T, Wu J, Liu M, Bucci J, Morris WJ: Decline in urinary retention incidence in 805 patients after prostate brachytherapy: the effect of learning curve? Int J Radiat Oncol Biol Phys 2006, 64(3):825-834. 10. Keyes M, Miller, S., Moravan, V., Pai, H., Kwan, W., Liu, M., Morris, J., Halperins, R., Pickles, T.: Acute and Late Urinary Toxicity in 606 Prostate Brachytherapy Patients. Radiotherapy and Oncology 2006, 80(Supplement 1):S41-S42. 11. Keyes M, Moravan, V., Liu, M., Jankovic, B., Morris, W.J.: Rectal Toxicity After I 125 Permanent. Radiotherapy and Oncology 2005, 76(Supplement 1):S5-S6. 12. Macdonald AG, Keyes M, Kruk A, Duncan G, Moravan V, Morris WJ: Predictive factors for erectile dysfunction in men with prostate cancer after brachytherapy: is dose to the penile bulb important? Int J Radiat Oncol Biol Phys 2005, 63(1):155-163. 13. Bottomley D, Ash D, Al-Qaisieh B, Carey B, Joseph J, St Clair S, Gould K: Side effects of permanent I125 prostate seed implants in 667 patients treated in Leeds. Radiother Oncol 2007, 82(1):46-49. 14. Lehrer S, Cesaretti J, Stone NN, Stock RG: Urinary symptom flare after brachytherapy for prostate cancer is associated with erectile dysfunction and more urinary symptoms before implantation. BJU Int 2006, 98(5):979-981. 15. Wust P, von Borczyskowski DW, Henkel T, Rosner C, Graf R, Tilly W, Budach V, Felix R, Kahmann F: Clinical and physical determinants for toxicity of 125-I seed prostate brachytherapy. Radiother Oncol 2004, 73(1):39-48. 134 16. Finnon P, Robertson N, Dziwura S, Raffy C, Zhang W, Ainsbury L, Kaprio J, Badie C, Bouffler S: Evidence for significant heritability of apoptotic and cell cycle responses to ionising radiation. Hum Genet 2008, 123(5):485-493. 17. Roberts SA, Spreadborough AR, Bulman B, Barber JB, Evans DG, Scott D: Heritability of cellular radiosensitivity: a marker of low-penetrance predisposition genes in breast cancer? Am J Hum Genet 1999, 65(3):784-794. 18. Burrill W, Barber JB, Roberts SA, Bulman B, Scott D: Heritability of chromosomal radiosensitivity in breast cancer patients: a pilot study with the lymphocyte micronucleus assay. Int J Radiat Biol 2000, 76(12):1617-1619. 19. Wood RD, Mitchell M, Lindahl T: Human DNA repair genes, 2005. Mutat Res 2005, 577(1-2):275-283. 20. Wood RD, Mitchell M, Sgouros J, Lindahl T: Human DNA repair genes. Science 2001, 291(5507):1284-1289. 21. De la Torre C, Pincheira J, Lopez-Saez JF: Human syndromes with genomic instability and multiprotein machines that repair DNA double-strand breaks. Histol Histopathol 2003, 18(1):225-243. 22. O'Driscoll M, Cerosaletti KM, Girard PM, Dai Y, Stumm M, Kysela B, Hirsch B, Gennery A, Palmer SE, Seidel J, Gatti RA, Varon R, Oettinger MA, Neitzel H, Jeggo PA, Concannon P: DNA ligase IV mutations identified in patients exhibiting developmental delay and immunodeficiency. Mol Cell 2001, 8(6):1175-1185. 23. Gutierrez-Enriquez S, Fernet M, Dork T, Bremer M, Lauge A, Stoppa-Lyonnet D, Moullan N, Angele S, Hall J: Functional consequences of ATM sequence variants for chromosomal radiosensitivity. Genes Chromosomes Cancer 2004, 40(2):109-119. 24. Andreassen CN, Alsner J, Overgaard J: Does variability in normal tissue reactions after radiotherapy have a genetic basis--where and how to look for it? Radiother Oncol 2002, 64(2):131-140. 25. Angele S, Romestaing P, Moullan N, Vuillaume M, Chapot B, Friesen M, Jongmans W, Cox DG, Pisani P, Gerard JP, Hall J: ATM haplotypes and cellular response to DNA damage: association with breast cancer risk and clinical radiosensitivity. Cancer Res 2003, 63(24):8717-8725. 26. Cesaretti JA, Stock RG, Lehrer S, Atencio DA, Bernstein JL, Stone NN, Wallenstein S, Green S, Loeb K, Kollmeier M, Smith M, Rosenstein BS: ATM sequence variants are predictive of adverse radiotherapy response among patients treated for prostate cancer. Int J Radiat Oncol Biol Phys 2005, 61(1):196-202. 27. Iannuzzi CM, Atencio DP, Green S, Stock RG, Rosenstein BS: ATM mutations in female breast cancer patients predict for an increase in radiation-induced late effects. Int J Radiat Oncol Biol Phys 2002, 52(3):606-613. 28. Hall EJ, Schiff PB, Hanks GE, Brenner DJ, Russo J, Chen J, Sawant SG, Pandita TK: A preliminary report: frequency of A-T heterozygotes among prostate cancer patients with severe late responses to radiation therapy. Cancer J Sci Am 1998, 4(6):385-389. 29. Andreassen CN, Overgaard J, Alsner J, Overgaard M, Herskind C, Cesaretti JA, Atencio DP, Green S, Formenti SC, Stock RG, Rosenstein BS: ATM sequence variants and risk of radiation-induced subcutaneous fibrosis after postmastectomy radiotherapy. Int J Radiat Oncol Biol Phys 2006, 64(3):776-783. 30. West CM, Elliott RM, Burnet NG: The genomics revolution and radiotherapy. Clin Oncol (R Coll Radiol) 2007, 19(6):470-480. 135 31. Andreassen CN, Alsner J, Overgaard M, Overgaard J: Prediction of normal tissue radiosensitivity from polymorphisms in candidate genes. Radiother Oncol 2003, 69(2):127-135. 32. Damaraju S, Murray D, Dufour J, Carandang D, Myrehaug S, Fallone G, Field C, Greiner R, Hanson J, Cass CE, Parliament M: Association of DNA repair and steroid metabolism gene polymorphisms with clinical late toxicity in patients treated with conformal radiotherapy for prostate cancer. Clin Cancer Res 2006, 12(8):2545- 2554. 33. Suga T, Iwakawa M, Tsuji H, Ishikawa H, Oda E, Noda S, Otsuka Y, Ishikawa A, Ishikawa K, Shimazaki J, Mizoe JE, Tsujii H, Imai T: Influence of multiple genetic polymorphisms on genitourinary morbidity after carbon ion radiotherapy for prostate cancer. Int J Radiat Oncol Biol Phys 2008, 72(3):808-813. 34. Human DNA repair genes. Supplement to the review by Wood RD, Mitchell M, & Lindahl T published in Mutation Research, 2005. [http://www.cgal.icnet.uk/DNA_Repair_Genes.html] 35. Niida H, Nakanishi M: DNA damage checkpoints in mammals. Mutagenesis 2006, 21(1):3-9. 36. Spycher C, Miller ES, Townsend K, Pavic L, Morrice NA, Janscak P, Stewart GS, Stucki M: Constitutive phosphorylation of MDC1 physically links the MRE11- RAD50-NBS1 complex to damaged chromatin. J Cell Biol 2008, 181(2):227-240. 37. Wang Y, Cortez D, Yazdi P, Neff N, Elledge SJ, Qin J: BASC, a super complex of BRCA1-associated proteins involved in the recognition and repair of aberrant DNA structures. Genes Dev 2000, 14(8):927-939. 38. Robins P, Lindahl T: DNA ligase IV from HeLa cell nuclei. J Biol Chem 1996, 271(39):24257-24261. 39. Cox JD, Stetz J, Pajak TF: Toxicity criteria of the Radiation Therapy Oncology Group (RTOG) and the European Organization for Research and Treatment of Cancer (EORTC). Int J Radiat Oncol Biol Phys 1995, 31(5):1341-1346. 40. Barry MJ, Fowler FJ, Jr., O'Leary MP, Bruskewitz RC, Holtgrewe HL, Mebust WK, Cockett AT: The American Urological Association symptom index for benign prostatic hyperplasia. The Measurement Committee of the American Urological Association. J Urol 1992, 148(5):1549-1557; discussion 1564. 41. Keyes M, Miller S, Moravan V, Pickles T: Pedictive factors for acute and late urinary toxicity after permanent prostate brachytherapy: long-term outcome in 712 consecutive patients. IJROBP 2008, in press. 42. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B, English J, Vielkind J, Horsman D, Laskin JJ, Marra MA: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC Cancer 2007, 7:128. 43. Olive PL, Banath JP, Keyes M: Residual gammaH2AX after irradiation of human lymphocytes and monocytes in vitro and its relation to late effects after prostate brachytherapy. Radiother Oncol 2008, 86(3):336-346. 44. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89(22):10915-10919. 45. Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 2004, 22(8):1035-1036. 136 46. Meyer A, Wilhelm B, Dork T, Bremer M, Baumann R, Karstens JH, Machtens S: ATM missense variant P1054R predisposes to prostate cancer. Radiother Oncol 2007, 83(3):283-288. 47. Parmley JL, Hurst LD: How do synonymous mutations affect fitness? Bioessays 2007, 29(6):515-519. 48. Nakamura Y: Codon usage table (Homo sapiens). 2009. 137 Chapter 5. Transcriptome sequencing of treatment-na\u00C3\u00AFve lung cancers from individuals likely to benefit from erlotinib treatment4 Candidate gene sequencing has been a primary strategy to uncover sequence variants in genes associated with a distinct phenotype as was presented in Chapters 2 and 4. However, recent technological advancements have increased DNA sequencing capacity dramatically, and whole genome sequencing is on the verge of becoming routine. Currently, sequencing cancer genomes (i.e. DNA) is still relatively expensive due to the sheer size of the genome and high depth of coverage (>20X, [56]) currently required to detect all sequence variants. On the other hand, sequencing transcriptomes (i.e. RNA) is an effective method to concentrate sequencing capacity on the 1-2% of the genome that encodes protein-coding genes, the source of nearly all cancer drivers identified to date [56]. Sequencing RNA is effective not only at uncovering sequence variants but also provides quantitative measurements of expression levels for every gene [26,53]. Building on the results of Chapter 2, this work seeks to go beyond assessment of EGFR status to identify additional cancer drivers found in pre-treatment lung cancer. We therefore sequenced the transcriptomes of 30 lung tumour biopsies to identify candidate driver mutations and fusion transcripts as well as elucidate patterns of gene and viral expression. As very small tumour biopsies were collected as a condition of enrolment in a clinical trial, many of the techniques developed during the previous three studies were applied to generate sequence data from a myriad of complex clinical samples. 4 A version of this chapter will be submitted for publication. Pugh TJ, Laskin JJ, Bosdet I, Asano J, Barclay L, Chan S, Griffth OL, Morin RD, Morrissey S, Sutcliffe M, Yang C, Ho C, Lee C, Ionescu D, Melosky B, Murray NR, Sun S, Marra MA. Transcriptome profiling of treatment-na\u00C3\u00AFve lung cancers from individuals enrolled in a trial of first-line erlotinib. 138 The work presented here represents the expression profiling and variant discovery phase of the project. Recent genome and transcriptome sequencing efforts using second generation technologies have uncovered a wealth of human variation not currently catalogued in public databases. However, lists of putative mutations identified by these methods are currently rife with technical artifacts. Therefore, the second phase of this project is the validation of all putative mutations using an orthogonal method based on targeted DNA sequencing. This chapter documents the validation of a small subset of putative somatic variants using traditional PCR and sequencing methods. However, as over 9,000 putative variants were identified from our set of 30 lung tumour transcriptome sequences, we are pursuing an alternate validation assay. This chapter provides a rationale for the design of this assay, a plan to apply this assay to multiple samples, and proposed analyses once the data from this second phase are in hand. 5.1. Introduction Lung cancer continues to be the leading cause of cancer-related death worldwide. While the majority of cases can be attributed to smoking, 25% cannot (53% of lung cancers in women, 15% in men), and these cancers account for over 300,000 deaths annually [1]. Lung cancers arise due to an accumulation of genetic defects that transform normal bronchial epithelium into neoplastic tissue. It has been estimated that 10 to 20 mutations are acquired before a tumour is evident clinically [2], although this has yet to be directly measured. 55% of lung cancers are adenocarcinomas (53% in smokers, 62% in non-smokers) [1], and specific mutations in these tumours are associated with an individual\u00E2\u0080\u0099s smoking history. For example, tumours from never- smokers are more likely to have one of two activating mutations of the epidermal growth factor receptor (EGFR) tyrosine kinase domain [3-5], either an L858R point mutation in exon 21 or a 12-15bp deletion in exon 19 involving four amino acids L747-A750 (LREA). In contrast, 139 tumours from smokers more commonly harbour activating mutations of the KRAS gene affecting exon 2, resulting in substitutions of amino acids G12 or G13 [6]. Treatment with the tyrosine kinase inhibitors (TKIs) erlotinib (Tarceva) and gefitinib (Iressa) has been particularly effective in non-smokers, especially in patients with EGFR mutations [3-5], while tumours with KRAS mutations tend to be resistant [6]. However, EGFR mutations, while a positive prognostic indicator [7], are neither necessary nor sufficient for response to these drugs [8-10], and patients who initially respond invariably become resistant to treatment. Resistance is explained in some cases by the occurrence of a T790M resistance mutation in the EGFR kinase domain [11, 12]. This mutation increases the affinity of EGFR for kinase-activating ATP, thereby decreasing the effectiveness of kinase inhibitors [12]. Presented as an alternative to EGFR mutation screening, increases in EGFR or HER2 gene copy number have also been associated with TKI response [13, 14]. The diagnostic value of EGFR status is still under debate [15, 16], and assessment of additional genes are likely needed to adequately predict sensitivity to targeted therapies [8]. Comprehensive sequencing-based surveys of hundreds of genes in even a handful of tumour samples are technically and financially challenging using traditional Sanger sequencing methods. In lung cancer, two large-scale surveys have been conducted that have uncovered thousands of mutations in hundreds of genes, of which 3-11% were recurrent across multiple tumours [17, 18]. Both studies concluded that lung adenocarcinomas are genetically heterogeneous, that a large number of mutations may be passenger mutations, and that a core set of non-synonymous driver mutations exists that drive the biology of these cancers. Davies et al. (2005) raised the concern that this set of driver mutations may be very large and spread across a large number of protein kinases, each requiring a different therapy. Ding et al. (2008) on the other hand, suggested that these mutations affect a relatively few common pathways that could be targeted as a group if recurrent signatures were identified. Similar surveys of a 140 diversity of cancers have reported comparable findings with similar interpretations [17, 19-21]. These studies have focused on sets of candidate genes and, while technically impressive, have examined only a fraction of the over 20,000 protein-coding genes in the human genome [49] and did not have the ability to detect expression levels or abnormal gene constructs such as fusion genes. In addition, their retrospective nature has made them reliant on banked tumour samples and cell lines with varying histologies, tumour types, and clinical histories, which may account for the wide spectrum of mutations observed. While commercialized next-generation sequencing technologies have reduced the cost of DNA sequencing by three orders of magnitude, the cost of sequencing entire cancer genomes is still high (>$100,000) [22]. However, targeted sequencing of the 1-2% of the genome that encodes expressed exons can be achieved by sequencing RNA transcripts at a fraction of the cost of the entire genome [23]. This approach, RNA-seq, is not only effective at detecting expressed sequence aberrations such as point mutations [24, 27, 28, 40, 50] and fusion genes [51, 52] but also provides quantitative gene expression and splicing information [24, 25, 26]. The wide dynamic range of gene expression results in greater sequence coverage of genes with higher expression [26] and therefore, most effective mutation detection will occur in transcripts with moderate or high expression. In addition, mutations that result in a decrease or loss of expression, due to nonsense-mediated decay for example, may be missed [23]. Despite these caveats, RNA-seq is well suited to identify cancer drivers as most drivers identified to date are small mutations that alter amino acid sequence or large-scale structural changes resulting in gene amplification or fusion [23]. Expressed proteins containing somatic mutations are particularly attractive drug targets due to their tumour specificity. Recently, transcript sequencing has been used to uncover consistently occurring non-synonymous point mutations in granulosa-cell ovarian cancer [27] and follicular and diffuse large B-cell lymphomas [50]. Here we report the analysis of RNA-seq data from 30 lung adenocarcinomas collected 141 prospectively as part of a clinical trial in a group of patients with increased likelihood of benefiting from treatment with an EGFR inhibitor. 5.2. Methods 5.2.1. Biopsy collection and processing This study was carried out in conjunction with a phase II clinical trial of erlotinib as a first line therapy for metastatic lung cancer at the BC Cancer Agency. Eligibility criteria for the clinical trial included: stage IIIB/IV non-small cell lung cancer, no prior chemotherapy and at least 2 of the following four criteria: 1) women, 2) never-smokers, 3) southeast Asian racial origin, 4) adenocarcinoma and/or bronchoalveolar carcinoma. Prior to treatment, 65 patients provided informed consent and agreed to a 24 mL blood sample and a fresh tumour biopsy. Biopsies during treatment and at disease progression were optional. Additional blood samples were taken after one month of treatment and upon disease progression. Clinical response to erlotinib was assessed radiographically using the same Southwest Oncology Group modification of the RECIST criteria presented in Chapter 2 [8]. Solid tissues obtained from core-needle biopsies, bronchoscopies, or surgical resections were embedded in optimum cutting temperature compound (OCT) immediately upon removal from the patient and fresh frozen in liquid nitrogen vapour. To perform an initial pathology review and assessment of tumour content, 8 \u00C2\u00B5M sections were cut from each block using a - 20\u00C2\u00BAC microtome-cryostat and treated with haematoxylin and eosin (H&E) stains. Once a block of sufficient tumour content was selected for nucleic acid extraction, sets of 30 sections were taken alternately for DNA and RNA. After each set of 30 sections, a single section was taken for pathology review to confirm consistent tumour content throughout the sample. The total number of sections taken for DNA and RNA extraction was dependent on the amount of tissue available. If necessary, tissues were sectioned onto membrane slides for laser microdissection as described in Chapter 2 [8]. In cases where microdissection was not necessary, sets of 30 142 tissue sections were transferred directly to a tube containing 400 \u00C2\u00B5L of Gentra PureGene Cell Lysis Solution (QIAgen) for DNA extraction or 800 \u00C2\u00B5L Trizol (Invitrogen) for RNA extraction. Liquid samples obtained from fine-needle aspirates were processed in the operating suite immediately upon collection. A droplet of fluid was spotted on a positively charged glass slide (Fisher Scientific) and smeared for H&E staining and pathology review. The remaining fluid was mixed with a mixture of 5% DMSO in 200 \u00C2\u00B5L phosphate buffered saline (PBS) and fresh frozen in liquid nitrogen vapour. Samples of sufficient tumour content were thawed and portions transferred to new tubes containing Gentra PureGene Cell Lysis Solution or Trizol for DNA and RNA extraction respectively. Pleural fluids obtained by thoracentesis were collected in eight 50 mL tubes (BD Falcon) containing 2.07 mL 0.5 M EDTA to prevent clotting. Volumes in excess of 400 mL were collected in 800 mL or 1.5 L vacuum bottles and stored at -80\u00C2\u00BAC without further processing. To pellet the cells, the 50 mL tubes were centrifuged at 2000 x g for 10 minutes and the supernatant removed and stored at -80\u00C2\u00BAC. 3 mL of Gentra Red Blood Cell lysis solution was added to the cell pellets and mixed briefly by vortexing for 1 second. The remaining cells were pelleted by spinning at 2000 x g for 10 minutes. The supernatant was discarded and the cell pellets washed by resuspension in 3 mL of PBS, vortexing at low speed, and re-pelleted by spinning at 2000 x g for 10 minutes. The washed cell pellets were resuspended in 1000 uL fetal bovine serum (FBS) containing 5% DMSO. A droplet of the resuspended cell solution was smeared on a glass slide and H&E stained for pathology review. In all but one case, fewer than 5% of the cells collected by thoracentesis were tumour cells, with the vast majority being normal immune or mesothelial cells (Figure 5.1A). To isolate tumour cells present in these samples, lung tumour cells were fluorescently labelled and isolated by flow cytometry. Cells were thawed and pelleted by spinning at 2000 x g for 10 minutes. The supernatant was discarded and the cells resuspended in 100 \u00C2\u00B5L PBS and 143 10 \u00C2\u00B5L Ber-Ep4 antibody conjugated to a fluorescein isothiocyanate (FITC) flurophore (Dako, F0860). The mixture was incubated on ice in darkness for 30 minutes. The labelled cells were pelleted and washed by spinning at 2000 x g for 1 minute and discarding the supernatant. Pellets were washed twice by resuspending in 1000uL PBS/2% BSA, spinning at 2000 x g for 1 minute and discarding the supernatant. Washed cell pellets were then resuspended in 2000 \u00C2\u00B5L FBS. To remove large cell clumps and debris, the cells were passed through a 35 \u00C2\u00B5M Cell Strainer (BD Falcon, 352235) and collected in a 5 mL tube. To verify that tumour cells were labelled correctly, 5 \u00C2\u00B5L of strained cells were smeared on a positively charged slide, covered by a glass coverslip, and visualized on a fluorescent microscope (Figure 5.1B). Tumour cells were isolated using the BD Falcon FACSDiVa flow cytometer using settings specific for collecting large FITC positive tumour cells and with increased sensitivity to detect and discard small reactive cells. Up to 1 million tumour cells as measured by the flow cytometer were collected in 1.5 mL tubes containing 800 \u00C2\u00B5L of Gentra Cell Lysis buffer for DNA extraction or 800 \u00C2\u00B5L Trizol for RNA extraction. To verify that tumour cells were being isolated with high specificity, ~25,000 cells were sorted directly onto a positively charged glass slide, dried, and H&E stained for pathology review (Figure 5.1C). 5.2.2. DNA extraction and Sanger sequencing Tumour cells were transferred to Gentra Cell Lysis solution as described above. Volumes reported are for 400 \u00C2\u00B5L of Cell Lysis solution and volumes were scaled up for larger volumes such as those resulting from flow sorting. To digest tissues, 2 \u00C2\u00B5L of 20 mg/mL Proteinase K was added, mixed by vortexing, and incubated at 55\u00C2\u00BAC for 3 hours. Once tissue fragments were no longer visible, the OCT gel (when present) was collected at the bottom of each tube by spinning at 2000 x g, and the supernatant containing cell lysates transferred to a new tube containing 80 \u00C2\u00B5L Gentra Protein Precipitation solution (QIAgen). The tubes were 144 vortexed for 20 seconds to precipitate proteins and spun for 3 minutes at 20,000 x g. The supernatant was transferred to a new tube containing 400 \u00C2\u00B5L isopropanol and 1\u00C2\u00B5L of 20 mg/mL glycogen and mixed by gently inverting 50 times. Precipitated DNA was pelleted by spinning the tubes for 3 minutes at 20,000 x g. Pellets were washed with 400 \u00C2\u00B5L 70% ethanol and spun at 20,000 x g for 1 minute. The wash solution was discarded, the tube drained at an angle onto a clean Kimwipe, and the pellet air dried for no more than 10 minutes. Pellets were resuspended in 10 \u00C2\u00B5L TE (10:0.1) and incubated at 65\u00C2\u00BAC for 1 hour or at 4\u00C2\u00BAC overnight to facilitate resuspension. DNA was quantified using a NanoDrop spectrophotometer and by PicoGreen fluorometry (Qubit, Invitrogen). For samples yielding less than 500 ng of genomic DNA, 10 ng was amplified using the RepliG Mini whole genome amplification kit (QIAgen) using the method documented in Chapter 3. 24 mL blood samples were taken before treatment, after 1 month (cycle) of treatment, and upon disease progression. DNA was extracted from each blood sample using the Gentra Puregene Blood kit (Qiagen) used in Chapter 4 and the plasma was stored at -80\u00C2\u00BAC. Sanger sequencing reactions were performed as documented in Chapter 2 [8]. 5.2.3. RNA extraction, amplification and Illumina sequencing Tumour cells were transferred to Trizol as described above. Volumes reported are for 800 \u00C2\u00B5L of Trizol solution and volumes were scaled up for larger volumes such as those resulting from flow sorting. The liquid cell lysate was transferred to a pre spun 2 mL PhaseLoc gel tube (Eppendorf) and incubated at room temperature for 5 minutes. 160 \u00C2\u00B5L of chloroform (Sigma) was added and the tubes shaken to mix. Tubes were spun at 12,000 x g for 10 minutes at 4\u00C2\u00BAC and the top aqueous phase transferred to a fresh 1.5ml tube containing 400 \u00C2\u00B5l IPA and 1 \u00C2\u00B5L of 20 mg/mL glycogen. Samples were mixed by repeated inversion and incubated at room temperature for 10 minutes. RNA was pelleted by spinning tubes at 12,000 x g for 10 minutes. 145 The supernatant was discarded and the pellets washed with 800 \u00C2\u00B5L 75% ethanol following by a spin at 8,000 x g for 5 minutes. The supernatant was discarded and the RNA pellet air dried for no more than 10 minutes. The pellet was then resuspended in 10 \u00C2\u00B5L of RNase free water and incubated at 60\u00C2\u00BAC for 30 minutes to facilitate dissolution. RNA quantity and quantity were assessed using an Agilent Bioanalyzer Nano chip. 200 ng of RNA (based on the Agilent quantitation) was diluted to 7 \u00C2\u00B5L with DEPC water, and 1 \u00C2\u00B5L of 20 U/\u00C2\u00B5L RNase inhibitor (Applied Biosystems) was added. To digest contaminating genomic DNA, a mixture of 1 \u00C2\u00B5L of 10X DNaseI buffer, 0.8 \u00C2\u00B5L of RNase-free water, and 0.2 \u00C2\u00B5L of DNaseI enzyme (Ambion) was added and incubated at room temperate for 15 minutes. The reaction was stopped by adding 1 \u00C2\u00B5L of 25 mM EDTA. To purify the RNA, the reaction was transferred to a pre-spun 2 mL PhaseLoc tube containing 189 \u00C2\u00B5L RNase-free water. 200 \u00C2\u00B5L of a 25:24:1 phenol-choloform- isoamyl mixture (Sigma) was added and shaken to mix. The tube was spun at 15,000 x g for 5 minutes and the ~200uL aqueous layer transferred to a new 1.5 mL microcentrifuge tube. RNA was precipitated by adding 30uL 3M sodium acetate, 1 \u00C2\u00B5L 20 mg/mL glycogen, and 600 \u00C2\u00B5L 100% ethanol and vortexing to mix. RNA was pelleted by spinning at 15,000 x g for 5 minutes and the supernatant removed. The pellet was washed with 1 mL of 70% ethanol and spun at 15,000 x g for 1 minute. The supernatant was removed and the pellet air dried for no more than 10 minutes. The pellet was suspended in 6 \u00C2\u00B5L RNase-free water and mixed by pipetting. RNA quantity and quantity were assessed using an Agilent Bioanalyzer Nano chip. 50 ng of DNaseI-treated RNA was used for amplification by in vitro transcription using the MessageAmpII kit (Ambion) following the manufacturer\u00E2\u0080\u0099s instructions. 500 ng of amplified product was used for paired-end (PE) sequencing library construction by the BC Cancer Agency Genome Sciences Centre Sequencing Group using methods similar to a published protocol [24]. Double-stranded cDNA was synthesized from 500 ng amplified RNA using Superscript Double-Stranded cDNA Synthesis kit (Invitrogen) and random hexamer primers 146 (Invitrogen) at a concentration of 5\u00C2\u00B5M. The cDNA was sonicated and the sample was run on an 8% polyacrylamide gel. A gel slice corresponding to DNA fragments of 180-220 bp was excised, and the DNA eluted overnight at 4\u00C2\u00B0C in 300 \u00C2\u00B5l of elution buffer (5:1, LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)-7.5 M ammonium acetate). DNA was purified using a Spin-X Filter Tube (Fisher Scientific) followed by ethanol precipitation. The ends of the DNA fragments were repaired and phosphorylated by treatment with T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase in a single reaction. 3\u00E2\u0080\u0099 A-overhangs were added by Klenow fragment (3\u00E2\u0080\u0099 to 5\u00E2\u0080\u0099 exo minus) to facilitate subsequent ligation of Illumina PE adapters which contain 5\u00E2\u0080\u0099 T overhangs. The adapter-ligated products were purified on QIAquick spin columns (QIAgen) and amplified by 10-15 cycles of PCR using Phusion DNA polymerase and Illumina\u00E2\u0080\u0099s PE primer set (Illumina). PCR products were purified on QIAquick MinElute columns (QIAgen), and the DNA quality assessed and quantified using an Agilent Bioanalyzer DNA 1000 series II assay and Nanodrop 7500 spectrophotometer (Nanodrop). DNA was diluted to 10 nM and sequencing clusters were generated on the Illumina cluster station. DNA sequencing was performed using Illumina Genome Analyzers following the manufacturer\u00E2\u0080\u0099s instructions. 5.2.4. RNA-seq data analysis 36-50 bp paired-end reads were generated using four lanes of an Illumina Genome Analyzer flowcell and aligned using Maq [28] to a human genome reference supplemented with known exon junction sequences for genes listed in release 52 of the Ensembl homo sapiens core database (http://www.ensembl.org). Paired-read information was used to infer the presence of fusion transcripts when the two reads of a read pair aligned to different annotated transcripts. Unmapped reads were aligned to a set of complete virus genomes (Table 5-2). Single nucleotide variants (SNVs) were identified using SNVMix, a variant detection program 147 designed to detect SNVs from next-generation sequencing data from tumours using three implementations of a binomial mixture model [58]. SNVs were compared to lists of known variants using custom Perl and shell scripts. Gene expression levels were quantified in reads per kilobase of exon model per million mapped reads (RPKM) [26]. Unsupervised and supervised clustering analysis of gene expression levels was performed using the TM4 analysis suite [29]. 5.3. Results 5.3.1. Patient data and source tumour material DNA was extracted from 53 matched tumour/blood pairs, whole genome amplified if necessary, and used for Sanger sequencing of all 28 exons of EGFR isoform a (ENSG00000146648) and exon 2 of KRAS (ENSG00000133703). Somatic EGFR mutations absent in corresponding DNA from blood were seen in 19 of 49 pre-treatment tumour samples (39%), a 1.5 fold enrichment over the ~26% of lung cancers with EGFR mutations listed in the Catalogue of Somatic Mutations in Cancer. (COSMIC, http://www.sanger.ac.uk/genetics/CGP/cosmic/). These mutations consisted of nine exon 19 LREA deletions and ten exon 21 L858R point mutations. 23 of 48 patients with response data had a partial response (PR) to erlotinib, and 17 of these had either an EGFR LREA deletion or L858R point mutation, compared to 2 of the 25 non-responders (p=0.000003). KRAS exon 2 point mutations were present in three pre-treatment tumour samples (6%), 2 from non- responders (both G12V) and 1 without response data (G12C) due to death prior to completion of 1 cycle of erlotinib. Of the four post-treatment samples sequenced, two contained EGFR mutations observed prior to treatment (1 L858R, 1 LREA deletion) and both of these acquired additional EGFR exon 20 T790M point mutations associated with TKI-resistance. While the current model of EGFR and KRAS mutations mediating TKI response and resistance is supported by our data, this presumption is insufficient to fully explain the treatment outcomes observed in our population because 25% of partial responders lacked EGFR mutations, 8% of 148 patients with progressive disease had EGFR mutations, and no canonical EGFR mutations were observed in patients with stable disease. 1 of 3 post-treatment biopsies taken upon disease progression lacked an EGFR resistance mutation suggesting that erlotinib resistance may be explained by variants in other genes. To uncover additional genetic features of these lung cancers, we performed transcriptome sequencing of 30 tumour samples from 28 patients (27 pre-treatment, 3 post- treatment): 21 females; 20 of Asian and 8 of Caucasian descent; 22 non-smokers; and 24 adeno- or bronchoalveolar carcinomas, 5 unclassified non-small-cell carcinomas, and 1 initially identified squamous cell carcinoma that was subsequently reclassified as a lymphoepithelioma- like carcinoma. The source materials consisted of 11 core biopsies (2 laser microdissected), 7 bronchoscopies, 7 thoracenteses (6 flow sorted), 3 fine needle aspirates, and 2 surgical resections. These tumours were determined by pathologist to contain at least 40% tumour cells and the median tumour content for the set was 80%. The amount of total RNA available from 29 of 30 biopsies (Table 5-1) was below the 5-10 \u00C2\u00B5g needed prior to polyA selection for our standard transcriptome sequencing protocol. Therefore, we amplified 50 ng of total RNA using in vitro transcription (MessageAmp II, Ambion) to generate 3-28 \u00C2\u00B5g of polyA-selected aRNA. 500 ng of amplified product was processed using our standard RNA-seq library construction protocol, omitting the polyA selection step. 5.3.2. Summary of sequencing data and variant discovery in lung cancer biopsies 2.1 billion 36-50bp paired-end sequencing reads were generated from 30 lung cancer biopsy samples (27 pre-treatment, 3 post-treatment) (Figure 5.2). 50.6 Gbp of sequence was successfully aligned to a human genome and exon junction reference (Methods), with 56% of mapped reads aligned to exons, 31% aligned to introns, and 13% aligned to intergenic regions (Figure 5.3). On average, 41\u00C2\u00B19 million reads representing 1.7\u00C2\u00B10.4 Gbp of sequence were 149 mapped per sample. Of the 69.1 Mbp annotated exonic sequences, an average 29\u00C2\u00B14 Mbp were covered by at least 1 read and 13\u00C2\u00B12 Mbp were covered by at least 6 reads. 5.3.3. Addressing end-bias induced by amplification Coverage of the transcriptome is determined primarily by the expression level of each transcript as more abundant transcripts generate a directly proportional greater number of reads. In the 29 libraries that used RNA amplified by in vitro transcription (IVT), we observed significant 3\u00E2\u0080\u0099 bias in the distribution of reads across each transcript compared to 22 libraries constructed from unamplified RNA (Figure 5.4). In the lung cancer libraries constructed from IVT-amplified RNA, 73% of the sequence coverage falls within the last 50% of each transcript\u00E2\u0080\u0099s annotated coordinates compared to 54% of the average coverage from 22 libraries constructed from unamplified samples. This skewed coverage biases our variant detection approach as this region contained 71% of the putative SNVs uncovered in the lung cancer samples compared to 64% of the pSNVs detected in unamplified samples using the same method (Figure 5.4). While transcripts from standard libraries constructed from sonicated cDNA typically have increased coverage of their 3\u00E2\u0080\u0099 ends [53], the additional bias observed in IVT-amplified libraries is likely due to the polyA priming method employed for amplification and the length of amplification fragments is a function of enzyme processivity. We attempted to address this difference in coverage by modifying the primer mix used for RNA amplification. The standard first strand synthesis reaction prior to IVT utilizes an oligo(dT) oligonucleotide to prime synthesis from the ends of polyadenylated transcripts. To facilitate priming within the body of transcripts, we performed IVT reactions using a mixture of standard oligo(dT) primers and non-degenerate primers corresponding to known codon sequences (Full Spectrum MultiStart, System Biosciences, Mountain View, CA). We anticipated that initiation of first strand synthesis at multiple points along each transcript would 150 improve the representation of 5\u00E2\u0080\u0099 positions distant from the polyA tail. However, this modification did not result in an even distribution of reads across each transcript and instead resulted in a profile nearly identical to that from the standard polyA-primed IVT libraries with again 73% of coverage falling in the last 50% of transcript coordinates (Figure 5.5). This may be due to preferential binding and enzyme recognition of the oligo(dT) primer, non-ideal hybridization conditions for the non-degenerate oligonucleotides, or cross-hybridization of codon-specific sequences reducing their potential for hybridization to the RNA template. 5.3.4. Viral transcripts Infection with exogenous viruses has been linked to 15-25% of all cancers including some forms of lung cancer [1]. Detection of viral involvement in lung cancer may have significant impact on the clinical management of this disease. For example, vaccines against strains of human papilloma virus that cause cervical cancer have been shown to dramatically reduce the number of precancerous lesions observed in treated populations [54]. In an attempt to uncover viral transcripts in these lung cancers, reads that did not align to our transcriptome reference were aligned to 2.9 Mbp of reference sequences from 92 complete viral genomes (Table 5-2). The median number of reads corresponding to a viral genome was 1, suggesting very little viral transcription in these lung cancers. However, the pre-treatment library from patient 9 had 1,698 reads aligned to genes from two types of Epstein-Barr virus (EBV): Human herpesvirus 4 type 1 (761 reads, NC_007605.1) and Human herpesvirus 4 type 2 (937 reads, NC_009334.1) (Figure 5.6A). This observation suggested the presence of a virus in this patient sample. To confirm this observation, we treated a section of biopsy material from this patient with a fluorescently-labelled oligonucleotide probe (INFORM EBER probe, Ventana). Confinement of EBV infection to tumour cells was confirmed by in situ hybridization (Figure 5.6B). This tumour was initially identified as the only squamous cell carcinoma in the set; however, this classification was revisited upon discovery of EBV involvement. Prompted by 151 our observation, a second pathology review determined this tumour to be a lymphoepithelioma- like carcinoma, a rare lung cancer subtype associated with EBV infection particularly in Asians and non-smokers. No EGFR mutations were observed in this cancer, and the patient exhibited stable disease when treated with erlotinib. Expression was localized to three discrete regions comprising 21% of the ~172 kbp EBV genome (Figure 5.6A) containing some of the key genes related to viral infection and tumourigenesis [30]. The region of most substantial coverage (265X maximum coverage, 155 kb-161 kb with diminishing expression from 137 kb-152 kb) encompassed the BamHIA region. Abundantly expressed RNAs from this region such as A73 and RPMS1 are often complexly spliced [55] and were originally identified in nasopharyngeal carcinoma (NPC) [30]. EBV BamHI-A transcripts have since been found to be expressed in all EBV-associated malignancies [55] as well as in peripheral blood of healthy individuals [30, 55]. The protein products for these RNAs and their functions are unknown [30, 55]. The second largest peak (in the 166 kb to 169 kb region, with additional expression from 0 to 6 kb) corresponds to LMP1/LMP2A. LMP1 is the main transforming protein of EBV, functioning as a classic oncogene in rat studies, that induces cell-surface adhesion and up-regulates anti-apoptosis genes such as BCL2. A smaller but distinct peak (96 kb-99 kb) corresponds to EBNA-1. This is a key gene responsible for maintenance of the viral genome. It is expressed in all virus-infected cells, where it maintains the episomal EBV genome by sequence-specific binding to the Viral OriP, and also acts a transcription factor for itself and other key proteins such as LMP1. The expression pattern of these genes is very similar to that seen in NPC, in which EBV expression is restricted to EBNA1, LMP2A and BamHIA transcripts, with ~20% of tumours also expressing LMP1 [30]. 152 5.3.5. Expression profiling Counting the number of reads corresponding to an mRNA transcript has been shown to be a quantitative measure of gene expression over five orders of magnitude [26], and there is high correlation between expression values derived from microarrays and those derived from transcriptome sequencing [24, 25, 26]. Therefore, we sought to uncover clinical and molecular subgroups using patterns of gene expression derived from our set of lung tumour transcriptome sequences. To normalize for both transcript length and total number of reads in each library, we quantified transcript expression levels in reads per kilobase of exon model per million mapped reads (RPKM), as pioneered by Mortazavi et al [26]. Unsupervised clustering of RPKM values from all 30 tumours could not identify distinct gene expression patterns within the set, nor could we differentiate pre- and post-treatment expression profiles. Supervised clustering, however, was able to identify a number of gene sets that could differentiate among known subgroups within the pre-treatment samples using the following classifiers: EGFR/KRAS mutational status, smoking history and response, and histology (Figure 5.7). Expression signatures unique to sex, ethnicity, and response (PR vs. SD/PD and PR/SD vs. PD) were not apparent (p>0.05). Expression levels of eight genes were able differentiate EGFR mutants, KRAS mutants, and those with no mutations in either gene (Bonferroni adjusted p\u00E2\u0089\u00A40.00124) (Figure 5.7A). Low-density lipoprotein receptor LDLR (ENSG00000130164) was expressed at a high level in KRAS mutants, while an unnamed gene similar to heterogeneous nuclear ribonucleoprotein A1 (ENSG00000213847) was expressed at higher levels in EGFR mutants. KRAS mutants had consistently lower expression of six genes: Pseudogene AC009945.4 (ENSG00000216737), 5S ribosomal RNA (ENSG00000200873), rRNA pseudogene AL365364.19-2 (ENSG00000210729), small nucleolar RNA SNORA75 (ENSG00000212620), beta-defensin 153 118 precursor DEFB118 (ENSG00000131068) and pseudogene RP11-257K9.3 (ENSG00000220467). Two smokers with partial response were differentiated from three smokers with progressive disease on the basis of five genes (Bonferroni adjusted p\u00E2\u0089\u00A40.0307) (Figure 5.7B): CCL19, PTGDS, AP003780.3, Z98749.11, and AC008660.5 (ENSG00000172724, ENSG00000107317, ENSG00000210016, ENSG00000100181, and ENSG00000203776). 12 additional genes were expressed at low levels in smokers who responded compared to other tumours in the set (p\u00E2\u0089\u00A49.19x10-9): Small proline-rich protein 2E SPRR2E (ENSG00000203785), Myosin-6 MYH6 (ENSG00000197616), Y RNA (ENSG00000199678), small nucleolar RNA SNORD113 (ENSG00000200367), U6 spliceosomal RNA (ENSG00000202089), pseudogene AC006479.2 (ENSG00000177590), retrotransposed gene Z70227.1 (ENSG00000185095), Developmental pluripotency-associated protein DPPA2 (ENSG00000163530), uncharacterized protein C14orf177 (ENSG00000176605), pseudogene RP11-392A19.1 (ENSG00000220026), SNURF-like protein CXorf19 (ENSG00000173954), and pseudogene RP11-345I18.6 (ENSG00000220129). Expression of CD70 (ENSG00000125726) was found to be increased in the EBV- associated lymphoepithelioma-like carcinoma (Figure 5.7C). Compared to the 26 pre-treatment NSCLCs/adenocarcinomas, this gene was expressed at a very high level in this tumour (16.2 RPKM vs. 0.2 median and max 4.6 RPKM from other tumours, p=5.97x10-14) and may be a potential marker of EBV-associated lung cancer. To test this hypothesis, we plan to assess CD70 expression and the presence of EBV using immunohistochemistry and in situ hybridization respectively in a set of over 600 lung tumours assembled into a tissue microarray. This tissue microarray contains NSCLCs, squamous cell carcinomas, and large cell carcinomas, some of which may be misclassified lymphoepithelioma-like lung cancers. We anticipate a number of these cancers to contain EBV and that a large fraction of these will be strongly 154 positive for CD70 expression. A positive correlation of CD70 with EBV expression may enable refinement of lung cancer diagnoses by helping differentiate otherwise morphologically similar cancers through screening of a simple cell surface marker. 5.3.6. Fusion transcripts Fusion transcripts have been particularly effective therapeutic targets in the treatment of hematopoietic malignancies [43, 44]. Assessment of gene fusions in solid tumours is salient in light of the recently described EML4-ALK fusion in lung cancer [31], as other fusions may exist that could be targeted to treat this disease. To uncover known and novel fusions present in our set of lung tumours, we used paired-read information to uncover evidence of 142 putative fusion transcripts. None of the fused genes correspond to EML4 or ALK, the partners in a fusion transcript recently described in lung cancer [31]. Of the 200 genes implicated in these fusions, 3 are listed in the Mitelman Database of Chromosome Aberrations in Cancer (http://cgap.nci.nih.gov/Chromosomes/Mitelman), and 17 are listed in COSMIC, suggesting that most observed fusions in our patient tumour population are rare, novel events or possibly technical artifacts. 5.3.7. Mutation detection From data derived from all 30 biopsies, we detected 432,043 putative single nucleotide variants (pSNVs), of which 344,956 (80%) correspond to 53,744 unique polymorphisms listed in dbSNP or have been detected in the genomes of J. Craig Venter [32], James Watson [33], an anonymous Yoruban male [34], an anonymous Asian male [35], or individuals being sequenced as part of the 1000 Genomes project (http://www.1000genomes.org). Genotypes were concordant for 95.1% of known SNPs called in both libraries of each pre/post-treatment pair (Patient 19: 6,413 of 6,868. Patient 28: 7,005 of 7,236). The 87,087 novel pSNVs correspond to 53,517 unique genomic positions, of which 1,839 are likely artifacts resulting from mismapping 155 near exon-exon junctions. 13,186 pSNVs impact a codon sequence in 7,090 genes, and 9,149 (69%) of these are predicted to be non-synonymous. 6,181 of the non-synonymous variants are predicted to induce a radical amino acid change, either by reversing residue polarity or dramatically altering residue size. 385 may alter protein length: 362 introduce a stop codon and 23 remove a stop codon. In addition, 9,038 insertions and 3,816 deletions were apparent which correspond to 4,978 unique genome positions. 2,276 of these correspond to known polymorphisms, and, of the remaining 2,702 novel indels, 404 impact protein coding sequences. Absent from this list were the EGFR LREA deletions detected by Sanger sequencing. This may be due to the difficulty in mapping relatively short sequences on either side of the 12-15bp deletion to non- contiguous segments of the human genome reference sequence. However, a simple text- matching search for reads containing any of the possible deletion breakpoints did not uncover any reads that support a deletion. A second possibility is that the inability to detect these deletions may be due to a lack of sequence coverage of this region due to low expression compounded by coverage biased towards the 3\u00E2\u0080\u0099 end of the transcript, particularly as exon 19 is 3 kb from the annotated end of the EGFR transcript. This hypothesis is supported by the low sequence coverage of exon 19 in the five patients with EGFR LREA deletions (1.9X average coverage, 3.6X maximum, 0X minimum). To look for possible low-frequency EGFR alleles present prior to treatment and not identified by our automated SNV caller, we performed a manual inspection of read alignments at the positions of known EGFR point mutations, T790M and L858R. In a single non- responder, Patient 57, both of these mutations were supported by a small proportion of reads (2 of 10 support L858R, 2 of 9 support T790M). Re-examination of the Sanger data found a similar ratio of allele frequencies for both of these mutations; however, neither was identified using automated methods. The biopsy from this patient contained >75% tumour cells, 156 suggesting that the resistant allele was present at low frequency prior to treatment. This may explain the observed resistance to erlotinib despite the presence of a TKI-susceptible L858R mutation. Neither of the post-treatment T790M point mutations detected by Sanger sequencing were detected by our automated SNV caller, likely due to a lack of sequence coverage at this position in both samples. The two reads from Patient 19 contained a mutant allele, while the three reads from Patient 28 provided no evidence of mutation. The corresponding pre-treatment libraries had greater coverage (8X for Patient 19, 15X for Patient 28) and none of these reads supported a T790M mutation existing pre-treatment at a detectable frequency. This difference in read coverage is readily explained by a lower expression of EGFR in the post-treatment samples relative to the corresponding pre-treatment samples (pre>post RPKM: Patient 19 7.9>5.9, Patient 28 11.1>5.4). Generation of additional sequence data for the post-treatment samples would improve mutation detection in genes with lower expression levels. 5.3.8. Validation of novel coding pSNVs Recent experience validating pSNVs detected in cancer genomes using short-read sequences suggests that 96-97% of candidate somatic pSNVs may be either germline events (8- 40%) or sequencing artifacts (58-84%) rather than somatic mutations acquired by the cancer [39, 40]. To assess whether pSNVs observed in the pre-treatment tumours were real and somatic, we sought to verify these variants in DNA from matched tumour and blood using orthogonal methods. As an initial screen, we attempted to validate mutations that hit the same codon at multiple positions in multiple tumours using PCR and Sanger sequencing. We hypothesized that tumours select for substitution of these amino acids and that a number of DNA sequence mutations within a codon would be tolerated. There were 11 genes with such mutations 157 contributing 22 unique positions for validation (ANKRD12, ARIH1, CHD7, HLA-DPA1, MAN2A1, MT-ND1, NAE1, NUCB2, REST, SEC62, and SPARC). 2 additional genes had single positions with different nucleic acid substitutions (e.g. C>CT and C>CA) which were also included in the validation set (MANBA, COX2). Finally, positions within three genes were targeted for a variety of reasons: TRIP6 (chr7:100,308,290 T>G) contained the most frequently observed non-synonymous putative mutation (10 of 27 tumours); C4orf15 (chr4:2,212,348 A>G and chr4:2,212,389 T>C) somatic mutations were recently validated by our group in a lobular breast cancer [40], and MET (6 positions) contained 6 candidate mutations within 18bp (6/7 amino acids) in a single tumour, suggesting a high frequency of localized mutation. In total, 16 target intervals were identified, amplified by PCR, and Sanger sequenced from the tumour samples in which the mutations were detected by RNA-seq (Table 5-3). Matched blood samples were also sequenced as normal controls. Of the 32 candidate mutations targeted, 27 were absent in both the tumour and blood DNA (81%, likely artifacts), while 5 were validated as true germline variants (15%, present in both blood and tumour DNA). A single somatic mutation in MT-ND1, NADH dehydrogenase, (chrM:4,161) was seen once in a single tumour. This mutation did not correspond to any of the positions targeted for validation; rather it was captured by a PCR amplicon designed to validate mutations at chrM:4,262 and chrM:4,263. This somatic mutation was also detected by RNA-seq, confirming that our master list of pSNVs contains real somatic variants, albeit not at positions specifically targeted by this pilot set. The validation rate observed here is similar to that from previous studies: 81% false positives, 15% germline variants, and 3% true somatic mutations. Recent developments in genome technology have made it possible to capture thousands of selected portions of the genome in a single assay for targeted resequencing [41-45]. Due to the low quantities of DNA required and relatively few targets for validation, we settled on a solution hybrid capture approach (Agilent SureSelect) which utilizes custom RNA probes 158 (\u00E2\u0080\u009Cbaits\u00E2\u0080\u009D) to isolate corresponding genomic DNA fragments from 500 ng of a standard Illumina genome shotgun sequencing library [42]. All 9,149 novel, coding pSNVs identified from the 30 lung tumour libraries were targeted for bait design, 8,281 from pre-treatment samples and 868 from post-treatment samples (Figure 5.8). 744 of the pre-treatment pSNVs are located in regions with sequence similar to at least four other locations in the genome (i.e. >3 BLAT [57] alignments for the mutation position +/-50bp). Despite the increased likelihood of this subset containing mapping artifacts, we anticipate using validation data from these sites to further refine our SNV detection methods particularly as this set of pSNVs passed our standard filters for detection of high quality variants. Two overlapping 120 bp candidate baits were designed for each pSNV, one with the variant at position 30+/-1bp and one with the variant at position 90+/-1 bp (Figure 5.9A). Identical candidate baits resulting from neighbouring or adjacent pSNVs were discarded. 16,690 of the 18,272 submitted baits (92%) passed Agilent\u00E2\u0080\u0099s bait design criteria and were included in the assay. Baits targeting pSNVs with >3 BLAT hits were less likely to pass Agilent\u00E2\u0080\u0099s bait design criteria: 79% of these baits passed (1,180 of 1,486) compared to 94% of those with \u00E2\u0089\u00A43 BLAT hits (14,178 of 15,062 baits targeting pre-treatment pSNVs, 1,602 of 1,722 baits targeting post-treatment pSNVs). As baits targeting pSNVs utilized only ~30% of the probes available in the assay, additional baits were designed to uncover additional SNVs by capturing exons of recurrently mutated genes. For this purpose, we submitted coordinates for 7,634 exons from 272 genes for \u00E2\u0080\u009Coptimized\u00E2\u0080\u009D bait tiling by Agilent (Figure 5.9B). 212 genes (5,985 exons) contained pSNVs seen in at least four pre-treatment tumours (Table 5-4) and 18 genes (266 exons) contained at least two pSNVs in any post-treatment tumours (Table 5-5). We also targeted 42 genes (1,383 exons) listed in COSMIC with pSNVs in three pre-treatment tumours (Table 5-4). In total, 30,598 baits were designed: 24,675 targeting genes mutated in at least 4 pre-treatment tumours, 159 1,010 targeting genes mutated in at least 2 post-treatment tumours, and 4,913 targeting COSMIC genes mutated in 3 pre-treatment tumours. Baits were successfully designed for 98% of the target exons (7,456 of 7,634). The final capture assay contains 47,558 baits (16,960 baits targeting pSNVs, 30,598 baits tiled across exons) targeting ~3.5 Mbp of unique sequence. The nature of the assay requires that all baits are applied to each sample and we therefore anticipate validating each pSNV and sequencing the exons of each recurrently mutated gene in every sample tested. As an initial test of the system, we constructed Illumina whole genome shotgun sequencing (WGSS) libraries from a matched tumour/blood pair from Patient 28. This patient was selected as his pre-treatment tumour harboured a validated EGFR exon 19 LREA deletion and the corresponding post-treatment sample contained an additional EGFR exon 20 T790M resistance mutation that may be present in the pre-treatment sample at a frequency below the detection threshold of Sanger sequencing. Solution capture of each WGSS library will be performed following the manufacturer\u00E2\u0080\u0099s instructions and each catch sequenced using one lane of an Illumina Genome Analyzer flow-cell. 5.4. Discussion While lung cancer is a highly heterogeneous disease, the treatment-na\u00C3\u00AFve tumours collected prospectively for this study are from a highly selected group of patients with narrowly defined histologies. Clinical selection of patients meeting two of four characteristics (female, non-smoker, Asian descent, adenocarcinoma/BAC) enriched for individuals with tumours harbouring erlotinib-sensitizing EGFR mutations (39% of patients sequenced) and selected against patients with erlotinib-resistant KRAS mutations (6%). This population had a partial response rate similar to that of the EGFR mutation rate (40%); however, EGFR mutational 160 status was not an ideal predictor of response as 25% of responders lacked EGFR mutations and 8% of non-responders had EGFR mutations. This observation is similar to that of the study of gefitinib-treated patients presented in Chapter 2. In that retrospective study, 10% (3/31) of non-responders had EGFR mutations also supporting the notion that a small percentage of patients refractory to EGFR tyrosine kinase inhibitors have tumours with EGFR mutations. However, the frequency of EGFR mutations overall was far lower in the set of archival tumour samples studied in Chapter 2 compared to those collected prospectively for Chapter 5 (13 versus 39%). In addition, the proportion of responders with EGFR mutations was far lower in Chapter 2\u00E2\u0080\u0099s unselected patient population compared to the highly selected population studied in Chapter 5 (33 versus 74%). Both groups had similar proportions of histological lung subtypes (87 versus 89% NSCLC/adenocarcinoma) and differed primarily in the proportion of females (59 versus 80%), Asians (44 versus 78%), and non-smokers (31 versus 81%). It appears that these physical phenotypes are associated with a clinical subgroup with increased likelihood of acquiring specific somatic variants such as EGFR mutations. Selection for this subgroup likely accounts for the clear association of EGFR mutations with response in the Chapter 5 study. Conversely, the lack of selection in the Chapter 2 study may have masked this association due to the diversity of backgrounds in the patient population and a lack of statistical power derived from the small number of responders available for study. Overall, both studies identified patients lacking EGFR mutations who responded to EGFR inhibitors and patients with mutations who did not respond. Therefore, there continues to be a need to uncover additional biomarkers predictive of response and to uncover additional drivers of lung cancer. A unique opportunity to uncover such variants was afforded by the recent development of second generation sequencing technologies. Transcriptome sequencing is a sensitive, quantitative method for assessing the expression, structure, and sequence content of transcribed genes. Using this method, we have 161 uncovered transcription of Epstein-Barr virus implicating a rare type of lung cancer, defined quantitative gene expression profiles that distinguish clinical subtypes, and identified 9,149 putative somatic single-base pair variants. Given the experience of recent cancer genome sequencing projects [39, 40], a large number of these putative variants are likely false positives or germ-line variants. Therefore, the next phase of this project is validation of these putative mutations in corresponding tumour and blood DNA (Future Directions). Targeted sequencing of candidate regions of the genome identified by transcriptome sequencing has recently been used to uncover common mutations of FOXL2 in granulosa-cell ovarian cancer [27] and EZH2 in diffuse large B-cell lymphoma [50]. While a recurrent base-pair mutation is not apparent from our transcriptome data, we anticipate that patterns of mutation may emerge from our set of validated somatic mutations. A fundamental challenge in cancer genomics is the small quantities of material available from clinical tumour biopsy samples. The samples for this study were collected as a condition of enrolment in a clinical trial and therefore were collected using minimally invasive techniques that provide minimal quantities of tissue. The resulting low nucleic acid quantities were increased using commercial amplification techniques, in vitro transcription for RNA (MessageAmp II, Ambion) and Phi29-based whole genome amplification for DNA (Repli-g, QIAgen). These methods introduce amplification biases that are relatively well understood [46] and, as we demonstrate for amplified RNA, can be accurately quantified (Figure 5.4). The use of IVT-amplified RNA clearly biases the distribution of reads towards the 3\u00E2\u0080\u0099 ends of each transcript decreasing our ability to uncover 5\u00E2\u0080\u0099 SNVs. Our validation approach partially addresses this problem by including a SNV discovery component that targets all exons of genes with mutations identified in multiple tumours. There is a possibility of missing mutations found only in the 5\u00E2\u0080\u0099 ends of genes with low expression. This may be addressed by further sequencing 162 of the transcriptome libraries to increase overall coverage or by selectively targeting 5\u00E2\u0080\u0099 exons of genes for sequencing in genomic DNA. 5.5. Future directions With targeted sequencing data from a single tumour/blood pair in hand, we plan to apply our solution hybrid capture assay to the other samples collected for this study. A high level of target sequence coverage is anticipated from a single lane of an Illumina flowcell. If 0.5 Gbp of mappable sequence is generated (e.g. conservatively, 10 million 50 bp reads), and 72% of reads that correspond exactly to baits [42], then we anticipate ~100X coverage of the 3.5 Mbp target space. The latest version of SNVMix can detect 94.96% of known variants with a 2% false discovery rate from a genome sequence with only 10X coverage [58]. Therefore, we plan to assay multiple samples in parallel through use of an indexing and pooling strategy. Briefly, WGSS libraries will be constructed from genomic or Phi29-amplified DNA from the remaining 52 tumour samples and their corresponding matched blood samples. These 104 libraries will each incorporate a different molecular \u00E2\u0080\u009Cbarcode\u00E2\u0080\u009D that will allow data to be assigned to specific samples based on the use of a unique sequence built into each sequencing read. Indexed libraries will then be combined in equimolar quantities for solution hybrid capture and the resulting catch sequenced using one lane of an Illumina flowcell. As sequence coverage distribution is highly reproducible between subsequent catches using the same probe library [42], we will use the coverage profile from the initial test to determine the number of samples that could be pooled and still ensure robust genotyping of each pSNV. A primary metric to determine this number will be the actual coverage of each pSNV compared to the minimum coverage necessary to make each genotyping call at that position. As a simplified example, if the lowest coverage pSNV receives ten times more reads than are necessary to make a base-pair call, then up to ten samples could be theoretically sequenced in a single pool. 163 This assumes that the coverage distribution is consistent between experiments, that the coverage will be evenly distributed across each sample in a pool, that the sequencing yield from each flow cell is consistent, and that there is no loss in capture or sequencing efficiency using pooled, indexed libraries. Once validated somatic mutations have been uncovered, we will perform statistical analyses to determine whether any variants 1) are associated with response or resistance to erlotinib, 2) commonly implicate specific molecular pathways, and 3) form integrated signatures of mutation and gene expression that correspond to potential subtypes of lung adenocarcinoma in our patient population. 164 5.6. Figures Figure 5.1 Isolation of tumour cells from a complex pleural fluid mixture using flow cytometry A) Prior to flow sorting, cells isolated from a pleural fluid sample stained with hematoxyln and eosin (H&E, 20X magnification). Clusters of darkly staining cells are the desired tumour cells surrounded by normal mesothelial and reactive immune cells. Estimated ~1% of cells are tumour cells. B) Prior to flow sorting, cells labelled with a Ber-EP4 antibody conjugated to a FITC fluorophore and visualized by fluorescent microscopy (20X magnification). Clusters of tumour cells are brightly labelled. C) After flow sorting, H&E stain of cells recovered by flow cytometry on the basis of FITC intensity and size. Estimated ~80% of cells are cancer cells (20X magnification). 165 Figure 5.2 Summary of data generated, sequence mapped, and genes, variants, and fusions detected in each library Top panel: Average read length (blue) and count of raw (black) and mapped (red) reads from 27 transcriptome sequencing libraries from treatment na\u00C3\u00AFve lung tumours. During the project, longer read lengths became available on the Illumina Genome Analyzer and resulted in longer average read lengths for later libraries. Second panel: Gbp of sequence aligned to a human genome reference (black) and resulting fraction of the ~65 Mbp annotated exonic sequence with 1X (red) and 6X (blue) coverage. Third panel: Number of genes with detectable expression (>1 read) and count of putative variants that correspond to known SNPs (black) and novel SNVs (red). Bottom panel: Number of candidate fusions supported by paired read information. A legend and average values for each plot are provided to the right. 166 Figure 5.3 Distribution of RNA seq reads mapped to exonic, intronic, and intergenic regions Percentage of mapped reads from each library that correspond to annotated exons (exonic), regions between exons of an annotated gene that may represent novel exons or genomic contamination (intronic), and regions between annotated genes that may represent novel genes, misannotated transcription start sites of known genes, or genomic contamination. 167 Figure 5.4 Distribution of sequence coverage and putative SNVs detected across all expressed transcripts from 41 RNA-seq libraries Percentage of total sequence coverage normalized by position within each expressed transcript from 29 RNA-seq libraries prepared from RNA amplified by in vitro transcription (IVT) for this lung cancer study (red) and from 22 libraries prepared in the \u00E2\u0080\u009Cstandard\u00E2\u0080\u009D manner from unamplified RNA for other studies of multiple myeloma, follicular and diffuse large B-cell lymphoma (black). Percentage of total putative SNVs distributed across each transcript normalized for length from 22 standard (blue) and 29 IVT (pink) libraries. IVT libraries have nearly 50% greater coverage coverage of 3\u00E2\u0080\u0099 bases (positions 51-100%) resulting in a greater sensitivity for SNV detection in this region. There is a corresponding decrease in coverage and in the percentage of SNVs detected in 5\u00E2\u0080\u0099 bases (positions 1-50%). 168 Figure 5.5 Comparison of sequence coverage distribution in libraries constructed from RNA amplified using a standard or modified in vitro transcription primer mix. RNA from two samples, 1 and 2, was amplified by in vitro transcription using two different primer mixes, \u00E2\u0080\u009CStandard\u00E2\u0080\u009D and \u00E2\u0080\u009CMultistart\u00E2\u0080\u009D. Transcriptome libraries were sequenced from which we plotted the distribution of sequence coverage across each transcript normalized for length. The standard primer mix contains an oligo(dT) primer that initiates amplification from the 3\u00E2\u0080\u0099 end of polyadenylated transcripts which results in a biased coverage profile. To address the observed end-bias by initiating synthesis from within the body of protein-coding transcripts, we used a commercial primer mix (Multistart) containing non-degenerate oligonucleotides corresponding to known amino acid codon sequences. This modification did not substantially alter the distribution of sequence coverage across expressed transcripts from either of the RNA samples tested. 169 Figure 5.6 A) Circos visualization of RNA-seq reads from a lymphoepithelioma-like lung cancer aligned to an EBV genome. B) Confirmation of EBV tumour-specificity by in situ hybridization. A) Circos visualization [48] of RNA-seq reads aligned to the EBV genome. The outermost black line represents the complete EBV genome from 0 to ~172000 bp. The grey bars of the middle track represent an alignment of all ~100 EBV protein sequences to the genome and the height of the red bars in the inner track represents the sequence coverage of regions of the EBV genome (0-265X coverage). Expression is localized to three discrete regions commonly expressed in viral-associated cancer (red) and important for tumourigenesis and maintenance of the viral genome: BamHIA, LMP1/2A/2B, and EBNA1. B) In situ hybridization confirms present of EBV in tumour cells from a fine-needle aspirate of a metastatic nodule. Darkly stained EBV-positive tumour cells in the centre of the image are surrounded by EBV-negative normal reactive cells and fibrous material. 20X magnification. 170 A B 0.8 mm 171 Figure 5.7 Supervised hierarchical clustering of gene expression profiles uncovers molecular and clinical subtypes of lung cancer Heatmap diagrams of gene expression values from 27 pre-treatment lung tumour RNA-seq libraries measured in reads per kilobase of exon model per million mapped reads (RPKM, [26]). Rows indicate RPKM values for each gene using the colour scale indicated at the top of each panel. Columns correspond to individual tumour samples which are clustered into groups indicted by a hierarchical tree diagram above the heatmap. The order of samples differs from panel to panel. Above the hierarchical tree are notations for clinical information unique to each panel. A) Eight genes are able to differentiate tumours based on mutation status of KRAS (K), EGFR (E), or wildtype (-) for these two genes (Bonferroni adjusted p\u00E2\u0089\u00A40.00124). B) Combinations of response (PR=partial response, SD=stable disease, PD=progressive disease) and smoking status (NS=non-smokers, S=smokers) can be differentiated on the basis of the first five genes listed (Bonferroni adjusted p\u00E2\u0089\u00A40.0307). The additional 12 genes listed are specifically down-regulated in smokers who respond to erlotinib (p\u00E2\u0089\u00A49.19x10-9). C) Expression of CD70 is much higher in the EBV-associated lymphoepithelioma-like carcinoma than the other adenocarcinomas in the set (16.2 RPKM vs. 0.2 median and max 4.6 RPKM from other tumours, p=5.97x10-14). 172 173 Figure 5.8 Attrition of putative SNVs to select variants for validation Categorical breakdown of all putative SNVs detected from transcriptome sequencing and selection of specific putative SNVs (pSNVs) for PCR and solution capture validation. Of the 432,043 pSNVs detected, 53,517 unique positions were not recorded in public SNP databases or seen in personal genome sequences published to date. After removal of likely mapping artifacts and non-coding pSNVs, 9,149 non-synonymous pSNVs remained as possible somatic mutations for validation in corresponding pre- and post-treatment tumour and blood DNA. 33 pSNVs were targeted for validation by PCR and sequencing of which 1 was somatic (present only in tumour DNA), 5 were germline variants (present in both tumour and blood DNA), and 27 were false positives (absent in both tumour and blood DNA). An assay has been designed to validate of all remaining non-synonymous pSNVs using a commercial targeted sequencing method (Agilent SureSelect). 2 overlapping baits were designed targeting each non- synonymous pSNV (Figure 5.9) and filtered using Agilent\u00E2\u0080\u0099s bait design criteria. To uncover additional somatic mutations, baits were also designed to tile across exons from genes with non-synonymous pSNVs in multiple tumours. In total, 47,558 baits have been designed and will be applied to each sample, 16,960 targeting pSNVs and 30,598 targeting genes with pSNVs in multiple tumours. 174 Figure 5.9 Position of baits designed to A) validate putative point mutations detected by RNA-seq and B) discover additional mutations in exons from genes with putative point mutations in at least 3 tumours A) Two 120 bp baits (red) were designed to target each putative mutation (star) in genomic DNA (black). To maximize the number of 50 bp end-reads that contain the putative mutation, these baits correspond to fragments with the variant positioned within 30 bp of an end. B) Baits were tiled across exons (yellow) of genes with putative mutations in at least 3 tumours using the default settings provided by the manufacturer. 30 bp 30 bp 90 bp 90 bp A 120 bp 120 bp 120 bp 120 bp B 175 5.7. Tables Table 5-1 Tumour content and quantities of total RNA extracted from 30 lung tumour biopsies Type and number of biopsies Median ng Range ng Median % tumour Range % tumour Bronchoscopies (7) 115 90 - 898 70 45-100 Core needle biopsies (11) 440 100 - 5,396 75 40-100 Fine needle aspirates (3) 520 180 - 772 100 90-100 Surgical removal (2) 1,935 340 - 3,530 93 90-95 Pleural fluid (6 post-sort) 981 160 - 4,992 80 80 Pleural fluid (1 unsorted) 60,390 75 176 Table 5-2 Complete viral genomes against which all unmapped transcriptome reads were mapped NCBI Reference NCBI Definition NC_001460.1 Human adenovirus A, complete genome NC_004001.2 Human adenovirus B, complete genome NC_001405.1 Human adenovirus C, complete genome NC_002067.1 Human adenovirus D, complete genome NC_003266.2 Human adenovirus E, complete genome NC_001454.1 Human adenovirus F, complete genome NC_001943.1 Human astrovirus, complete genome NC_007455.1 Human bocavirus, complete genome NC_002645.1 Human coronavirus 229E, complete genome NC_006577.2 Human coronavirus HKU1, complete genome NC_005831.2 Human coronavirus NL63, complete genome NC_005147.1 Human coronavirus OC43, complete genome NC_009887.1 Human enterovirus 100, complete genome NC_001612.1 Human enterovirus A, complete genome NC_001472.1 Human enterovirus B, complete genome NC_001428.1 Human enterovirus C, complete genome NC_001430.1 Human enterovirus D, complete genome NC_004295.1 Human erythrovirus V9, complete genome NC_001736.1 Human foamy virus, complete genome NC_001806.1 Human herpesvirus 1, complete genome NC_001798.1 Human herpesvirus 2, complete genome NC_001348.1 Human herpesvirus 3, complete genome NC_007605.1 Human herpesvirus 4, complete genome NC_009334.1 Human herpesvirus 4, complete genome NC_001347.3 Human herpesvirus 5 strain AD169, complete genome NC_006273.2 Human herpesvirus 5 strain Merlin, complete genome NC_001664.1 Human herpesvirus 6A, complete genome NC_000898.1 Human herpesvirus 6B, complete genome NC_001716.2 Human herpesvirus 7, complete genome NC_003409.1 Human herpesvirus 8 type M, complete genome NC_009333.1 Human herpesvirus 8, complete genome NC_001802.1 Human immunodeficiency virus 1, complete genome NC_001722.1 Human immunodeficiency virus 2, complete genome NC_004148.2 Human metapneumovirus, complete genome NC_001356.1 Human papillomavirus - 1, complete genome NC_001352.1 Human papillomavirus - 2, complete genome NC_001531.1 Human papillomavirus - 5, complete genome NC_001526.1 Human papillomavirus - 16, complete genome NC_001357.1 Human papillomavirus - 18, complete genome NC_001676.1 Human papillomavirus - 54, complete genome NC_001694.1 Human papillomavirus - 61, complete genome NC_004761.1 Human papillomavirus RTRX7, complete genome NC_001457.1 Human papillomavirus type 4, complete genome NC_001355.1 Human papillomavirus type 6b, complete genome NC_001595.1 Human papillomavirus type 7, complete genome NC_001596.1 Human papillomavirus type 9, complete genome NC_001576.1 Human papillomavirus type 10, complete genome NC_001683.1 Human papillomavirus type 24, complete genome NC_001583.1 Human papillomavirus type 26, complete genome NC_001586.1 Human papillomavirus type 32, complete genome NC_001587.1 Human papillomavirus type 34, complete genome NC_001354.1 Human papillomavirus type 41, complete genome 177 NCBI Reference NCBI Definition NC_001690.1 Human papillomavirus type 48, complete genome NC_001591.1 Human papillomavirus type 49, complete genome NC_001691.1 Human papillomavirus type 50, complete genome NC_001593.1 Human papillomavirus type 53, complete genome NC_001693.1 Human papillomavirus type 60, complete genome NC_001458.1 Human papillomavirus type 63, complete genome NC_002644.1 Human papillomavirus type 71, complete genome NC_010329.1 Human papillomavirus type 88, complete genome NC_004104.1 Human papillomavirus type 90, complete genome NC_004500.1 Human papillomavirus type 92, complete genome NC_005134.2 Human papillomavirus type 96, complete genome NC_008189.1 Human papillomavirus type 101, complete genome NC_008188.1 Human papillomavirus type 103, complete genome NC_009239.1 Human papillomavirus type 107, complete genome NC_003461.1 Human parainfluenza virus 1 strain Washington/1964, complete genome NC_003443.1 Human parainfluenza virus 2, complete genome NC_001796.2 Human parainfluenza virus 3, complete genome NC_001897.1 Human parechovirus, genome NC_007018.1 Human parvovirus 4, complete genome NC_007026.1 Human picobirnavirus RNA segment 1, complete sequence NC_007027.1 Human picobirnavirus RNA segment 2, complete sequence NC_001781.1 Human respiratory syncytial virus, complete genome NC_001617.1 Human rhinovirus 89, complete genome NC_001490.1 Human rhinovirus B, complete genome NC_009996.1 Human rhinovirus C, complete genome NC_007473.1 Human rotavirus G3 segment 1, complete sequence NC_007472.1 Human rotavirus G3 segment 2, complete sequence NC_007471.1 Human rotavirus G3 segment 3, complete sequence NC_007470.1 Human rotavirus G3 segment 4, complete sequence NC_007467.1 Human rotavirus G3 segment 5, complete sequence NC_007469.1 Human rotavirus G3 segment 6, complete sequence NC_007466.1 Human rotavirus G3 segment 7, complete sequence NC_007465.1 Human rotavirus G3 segment 8, complete sequence NC_007468.1 Human rotavirus G3 segment 9, complete sequence NC_007464.1 Human rotavirus G3 segment 10, complete sequence NC_007463.1 Human rotavirus G3 segment 11, complete sequence NC_001795.1 Human spumaretrovirus, complete genome NC_001436.1 Human T-lymphotropic virus 1, complete genome NC_001488.1 Human T-lymphotropic virus 2, complete genome NC_001870.1 Simian-Human immunodeficiency virus, complete genome 178 Table 5-3 33 pSNVs validated by PCR and Sanger sequencing Gene Tumour/ blood pairs Positions targeted Codons targeted Distinct substitutions Tumours with position1 / position2 Target positions Artifacts positions Germline positions Somatic positions ANKRD12 5 chr18:9246629 chr18:9246631 1 2 5 / 3 2 2 ARIH1 6 chr15:70554273 chr15:70554275 1 2 6 / 4 2 2 CHD7 3 chr8:61856379 chr8:61856381 1 3 3 / 1 2 2 HLA-DPA1 3 chr6:33145544 chr6:33145545 1 2 2 / 1 2 2 MAN2A1 5 chr5:109218837 chr5:109218839 1 2 4 / 1 2 2 (single deletion) MT-ND1 9 chrM:4262 chrM:4263 chrM:4161 1 3 8 / 1 (1) 2 2 (1) NAE1 4 chr16:65422257 chr16:65422258 1 2 3 / 1 2 2 NUCB2 4 chr11:1730953 chr11:1730955 1 3 3 / 1 2 1 1 (deletion) REST 3 chr4:57491514 chr4:57491515 1 2 2 / 1 2 2 SEC62 3 chr3:171193531 chr3:171193533 1 2 2 / 1 2 2 SPARC 3 chr5:151027230 chr5:151027232 1 2 3 / 2 2 2 MANBA 3 chr4:103772429 1 2 2 / 1 1 1 MT-CO2 3 chrM:7594 1 2 2 / 1 1 1 TRIP6 10 chr7:100308290 1 1 10 1 1 C4orf15 2 chr4:2212348 chr4:2212389 2 2 1 / 1 2 2 MET 1 chr7:116127278 chr7:116127283 chr7:116127285 chr7:116127292 chr7:116127294 chr7:116127296 6 6 1 6 6 Total 33 27 5 (1) 179 Table 5-4 Genes with exons targeted for solution hybrid capture, containing mutations in \u00E2\u0089\u00A54 pre-treatment tumours (\u00E2\u0089\u00A53 tumours for COSMIC genes) Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000198888 16 11 MT-ND1 ENSG00000146733 16 4 PSPH ENSG00000198763 14 23 MT-ND2 ENSG00000159140 14 5 SON ENSG00000087077 12 3 TRIP6 ENSG00000114933 11 2 ENSG00000127481 10 8 UBR4 X ENSG00000136153 10 8 LMO7 ENSG00000087365 10 4 SF3B2 ENSG00000170759 10 2 KIF5B ENSG00000139218 9 9 SFRS2IP ENSG00000054654 9 8 SYNE2 ENSG00000116539 9 8 ASH1L ENSG00000115760 9 7 BIRC6 X ENSG00000169599 9 3 NFU1 ENSG00000213639 9 2 PPP1CB X ENSG00000127914 8 9 AKAP9 X ENSG00000198712 8 9 MT-CO2 ENSG00000048649 8 6 RSF1 ENSG00000126777 8 6 KTN1 ENSG00000198938 8 6 MT-CO3 ENSG00000079385 8 4 CEACAM1 ENSG00000054356 8 2 PTPRN X ENSG00000132334 8 2 PTPRE ENSG00000172273 8 2 ENSG00000168447 8 1 SCNN1B ENSG00000196126 8 1 HLA-DRB1 ENSG00000055609 7 9 MLL3 ENSG00000127603 7 7 MACF1 ENSG00000198886 7 7 MT-ND4 ENSG00000105877 7 6 DNAH11 ENSG00000109323 7 6 MANBA ENSG00000170776 7 6 AKAP13 X ENSG00000106397 7 5 PLOD3 ENSG00000172799 7 4 ENSG00000166233 7 3 ARIH1 ENSG00000212989 7 2 Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000136045 7 1 PWP1 ENSG00000169894 6 12 MUC3A,MUC3B ENSG00000198727 6 8 MT-CYB ENSG00000080345 6 7 RIF1 X ENSG00000115310 6 7 RTN4 ENSG00000151914 6 6 DST ENSG00000173193 6 6 PARP14 ENSG00000186566 6 6 GPATCH8 ENSG00000011295 6 5 TTC19 ENSG00000066933 6 5 MYO9A ENSG00000101745 6 5 ANKRD12 ENSG00000120913 6 5 PDLIM2 ENSG00000135250 6 5 SRPK2 X ENSG00000145901 6 5 TNIP1 ENSG00000100889 6 4 PCK2 ENSG00000107099 6 4 DOCK8 ENSG00000112893 6 4 MAN2A1 ENSG00000132549 6 4 VPS13B X ENSG00000157765 6 4 SLC34A2 ENSG00000181444 6 3 ZNF467 ENSG00000188747 6 3 NOXA1 ENSG00000204287 6 3 HLA-DRA ENSG00000189079 6 2 ARID2 ENSG00000168818 6 1 STX18 ENSG00000135837 5 9 CEP350 ENSG00000065526 5 7 SPEN X ENSG00000182670 5 6 TTC3 ENSG00000198804 5 6 MT-CO1 ENSG00000049323 5 5 LTBP1 X ENSG00000073614 5 5 JARID1A ENSG00000075292 5 5 ZNF638 ENSG00000118058 5 5 MLL X ENSG00000119778 5 5 ATAD2B ENSG00000125633 5 5 CCDC93 ENSG00000138778 5 5 CENPE X ENSG00000143379 5 5 SETDB1 X ENSG00000147133 5 5 TAF1 X 180 Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000148773 5 5 MKI67 ENSG00000159882 5 5 ZNF230 ENSG00000171490 5 5 RSL1D1 ENSG00000174953 5 5 DHX36 X ENSG00000198695 5 5 MT-ND6 ENSG00000198744 5 5 MT-ATP8 ENSG00000103496 5 4 STX4 ENSG00000111642 5 4 CHD4 ENSG00000138600 5 4 ENSG00000139372 5 4 TDG ENSG00000148516 5 4 ZEB1 ENSG00000173575 5 4 CHD2 ENSG00000176915 5 4 ANKLE2 ENSG00000186153 5 4 WWOX ENSG00000122257 5 3 RBBP6 ENSG00000134285 5 3 FKBP11 ENSG00000146648 5 3 EGFR X ENSG00000163714 5 3 ENSG00000168758 5 3 SEMA4C ENSG00000042493 5 2 CAPG ENSG00000157637 5 2 SLC38A10 ENSG00000181982 5 2 CCDC149 ENSG00000187775 5 2 ENSG00000216490 5 2 IFI30 ENSG00000107020 5 1 C9orf46 ENSG00000120087 5 1 HOXB7 ENSG00000127415 5 1 IDUA ENSG00000135480 5 1 KRT7 ENSG00000136104 5 1 RNASEH2B ENSG00000157021 5 1 FAM92A1,FAM92A2 ENSG00000163220 5 1 S100A9 ENSG00000179912 5 1 R3HDM2 ENSG00000198815 5 1 FOXJ3 ENSG00000155657 4 8 TTN X ENSG00000089737 4 6 DDX24 X ENSG00000115816 4 6 CEBPZ ENSG00000164190 4 6 NIPBL X ENSG00000005483 4 5 MLL5 ENSG00000049759 4 5 NEDD4L ENSG00000060491 4 5 OGFR ENSG00000071054 4 5 MAP4K4 X ENSG00000073350 4 5 LLGL2 Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000088340 4 5 FER1L4 ENSG00000119285 4 5 HEATR1 ENSG00000144674 4 5 GOLGA4 ENSG00000165732 4 5 DDX21 X ENSG00000170004 4 5 CHD3 ENSG00000173230 4 5 GOLGB1 ENSG00000174197 4 5 ENSG00000181555 4 5 SETD2 ENSG00000198677 4 5 TTC37 ENSG00000008952 4 4 SEC62 ENSG00000009413 4 4 REV3L X ENSG00000064313 4 4 TAF2 X ENSG00000065243 4 4 PKN2 X ENSG00000066279 4 4 ASPM ENSG00000071994 4 4 PDCD2 ENSG00000073331 4 4 ALPK1 X ENSG00000084093 4 4 REST X ENSG00000084676 4 4 NCOA1 ENSG00000085721 4 4 RRN3 ENSG00000088970 4 4 ENSG00000091436 4 4 ENSG00000095002 4 4 MSH2 X ENSG00000099940 4 4 SNAP29 ENSG00000101596 4 4 SMCHD1 ENSG00000102189 4 4 EEA1 ENSG00000102893 4 4 PHKB ENSG00000105373 4 4 GLTSCR2 ENSG00000114857 4 4 NKTR ENSG00000116783 4 4 TNNI3K,FPGT X ENSG00000118197 4 4 DDX59 ENSG00000119397 4 4 CEP110 X ENSG00000119487 4 4 MAPKAP1 X ENSG00000124228 4 4 DDX27 X ENSG00000127947 4 4 PTPN12 X ENSG00000128845 4 4 DYX1C1 ENSG00000131018 4 4 C6orf98,SYNE1 ENSG00000132305 4 4 IMMT ENSG00000136169 4 4 SETDB2 ENSG00000139410 4 4 SDSL ENSG00000140386 4 4 SCAPER ENSG00000140497 4 4 SCAMP2 ENSG00000143669 4 4 LYST 181 Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000144028 4 4 ENSG00000144645 4 4 OSBPL10 ENSG00000146085 4 4 MUT ENSG00000146918 4 4 NCAPG2 X ENSG00000153575 4 4 TUBGCP5 ENSG00000164347 4 4 GFM2 ENSG00000165219 4 4 GAPVD1 ENSG00000165632 4 4 TAF3 ENSG00000166750 4 4 SLFN5 ENSG00000171456 4 4 ASXL1 ENSG00000173473 4 4 SMARCC1 X ENSG00000173692 4 4 PSMD1 ENSG00000175455 4 4 CCDC14 ENSG00000185658 4 4 BRWD1 ENSG00000187098 4 4 MITF ENSG00000187240 4 4 DYNC2H1 ENSG00000197102 4 4 DYNC1H1 ENSG00000197324 4 4 LRP10 ENSG00000198707 4 4 CEP290 ENSG00000198840 4 4 MT-ND3 ENSG00000198862 4 4 ENSG00000204764 4 4 RANBP17 ENSG00000000971 4 3 CFH X ENSG00000003056 4 3 M6PR ENSG00000047634 4 3 SCML1 X ENSG00000066651 4 3 TRMT11 ENSG00000070081 4 3 NUCB2 ENSG00000070814 4 3 TCOF1 ENSG00000092208 4 3 SIP1 ENSG00000101654 4 3 RNMT ENSG00000103335 4 3 FAM38A ENSG00000104361 4 3 ENSG00000107581 4 3 EIF3A ENSG00000110002 4 3 ENSG00000115267 4 3 IFIH1 ENSG00000117000 4 3 RLF ENSG00000117523 4 3 BAT2D1 ENSG00000131503 4 3 ANKHD1,EIF4EBP3 X ENSG00000132950 4 3 ZMYM5 ENSG00000133027 4 3 PEMT ENSG00000135749 4 3 PCNXL2 ENSG00000135968 4 3 GCC2 Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000143776 4 3 CDC42BPA X ENSG00000150630 4 3 VEGFC ENSG00000157399 4 3 ARSE ENSG00000162994 4 3 C2orf63 ENSG00000164597 4 3 COG5 ENSG00000164828 4 3 UNC84A ENSG00000168970 4 3 JMJD7,PLA2G4B ENSG00000175221 4 3 MED16 ENSG00000179889 4 3 PDXDC1 ENSG00000188170 4 3 HBD ENSG00000204261 4 3 PSMB9 ENSG00000007384 4 2 RHBDF1 ENSG00000058272 4 2 PPP1R12A X ENSG00000113810 4 2 SMC4 ENSG00000116691 4 2 ENSG00000117335 4 2 CD46 ENSG00000119318 4 2 RAD23B ENSG00000143643 4 2 TTC13 ENSG00000159593 4 2 NAE1 ENSG00000161813 4 2 LARP4 ENSG00000168564 4 2 CDKN2AIP ENSG00000189042 4 2 ZNF567 ENSG00000197498 4 2 BXDC1 ENSG00000198589 4 2 LRBA ENSG00000205002 4 2 ENSG00000066455 4 1 GOLGA5 X ENSG00000103351 4 1 CLUAP1 ENSG00000110801 4 1 PSMD9 ENSG00000119396 4 1 RAB14 ENSG00000125686 4 1 MED1 X ENSG00000131400 4 1 NAPSA ENSG00000155957 4 1 TMBIM4 ENSG00000179820 4 1 MYADM ENSG00000197462 4 1 ENSG00000213088 4 1 DARC X ENSG00000105976 3 6 MET X ENSG00000110841 3 5 PPFIBP1 X ENSG00000038382 3 4 TRIO X ENSG00000039650 3 4 PNKP X ENSG00000070061 3 4 IKBKAP X ENSG00000119684 3 4 MLH3 X ENSG00000120868 3 4 APAF1 X 182 Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000128829 3 4 EIF2AK4 X ENSG00000177200 3 4 CHD9 X ENSG00000198231 3 4 DDX42 X ENSG00000198399 3 4 ITSN2 X ENSG00000006468 3 3 ETV1 X ENSG00000010292 3 3 NCAPD2 X ENSG00000012983 3 3 MAP4K5 X ENSG00000066777 3 3 ARFGEF1 X ENSG00000070018 3 3 LRP6 X ENSG00000079739 3 3 PGM1 X ENSG00000081237 3 3 PTPRC X ENSG00000086758 3 3 HUWE1 X ENSG00000099956 3 3 SMARCB1 X ENSG00000100644 3 3 HIF1A X ENSG00000100815 3 3 TRIP11 X ENSG00000100888 3 3 CHD8 X ENSG00000110713 3 3 NUP98 X ENSG00000113163 3 3 COL4A3BP X ENSG00000117139 3 3 JARID1B X ENSG00000131626 3 3 PPFIA1 X ENSG00000132466 3 3 ANKRD17 X ENSG00000135090 3 3 TAOK3 X Ensembl ID # tumours for gene Types of mutation Gene Name Recorded in COSMIC v41 ENSG00000135541 3 3 AHI1 X ENSG00000137497 3 3 NUMA1 X ENSG00000138032 3 3 PPM1B X ENSG00000138764 3 3 CCNG2 X ENSG00000141510 3 3 TP53 X ENSG00000142949 3 3 PTPRF X ENSG00000155304 3 3 HSPA13 X ENSG00000156256 3 3 USP16 X ENSG00000157212 3 3 PAXIP1 X ENSG00000163029 3 3 SMC6 X ENSG00000173482 3 3 PTPRM X ENSG00000174243 3 3 DDX23 X ENSG00000175054 3 3 ATR X ENSG00000047936 3 2 ROS1 X ENSG00000103342 3 2 GSPT1 X ENSG00000180900 3 2 SCRIB X ENSG00000188042 3 2 ARL4C X ENSG00000054118 3 1 THRAP3 X ENSG00000060237 3 1 WNK1 X ENSG00000141562 3 1 NARF X ENSG00000147364 3 1 FBXO25 X 183 Table 5-5 Genes containing at least 2 types of mutation exclusively in post-treatment tumours. Ensembl ID # tumours with mutation in gene Types of mutation Gene Name COSMIC v41 ENSG00000121966 2 3 CXCR4 X ENSG00000161011 2 2 SQSTM1 ENSG00000138182 2 2 ENSG00000133114 2 2 KIAA1704 ENSG00000114867 2 2 EIF4G1 ENSG00000138002 1 3 IFT172 ENSG00000175197 1 2 DDIT3 X ENSG00000170476 1 2 ENSG00000169550 1 2 MUC15 ENSG00000166165 1 2 CKB ENSG00000137106 1 2 GRHPR ENSG00000135972 1 2 MRPS9 ENSG00000135912 1 2 TTLL4 ENSG00000123240 1 2 OPTN ENSG00000113580 1 2 NR3C1 ENSG00000108439 1 2 PNPO ENSG00000004779 1 2 NDUFAB1 184 5.8. Bibliography 1. Sun S, Schiller JH, Gazdar AF: Lung cancer in never smokers--a different disease. Nat Rev Cancer 2007, 7(10):778-790. 2. Salgia R, Skarin AT: Molecular abnormalities in lung cancer. J Clin Oncol 1998, 16(3):1207-1217. 3. Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG et al: Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004, 350:2129-2139. 4. Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye F, Lindeman N, Boggon TJ et al: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304:1497-1500. 5. Pao W, Miller V, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L et al: EGF receptor gene mutations are common in lung cancers from \"never smokers\" and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004, 101(36):13306-13311. 6. Pao W WT, Riely GJ, Miller VA, Pan Q, Ladanyi M, Zakowski MF, Heelan RT, Kris MG, Varmus HE.: KRAS Mutations and Primary Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib. PLoS Med 2005, 2:e17. 7. Taron M, Ichinose Y, Rosell R, Mok T, Massuti B, Zamora L, Mate JL, Manegold C, Ono M, Queralt C et al: Activating mutations in the tyrosine kinase domain of the epidermal growth factor receptor are associated with improved survival in gefitinib-treated chemorefractory lung adenocarcinomas. Clin Cancer Res 2005, 11(16):5878-5885. 8. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B et al: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC cancer 2007, 7:128. 9. Giaccone G, Rodriguez JA: EGFR inhibitors: what have we learned from the treatment of lung cancer? Nat Clin Pract Oncol 2005, 2(11):554-561. 10. Tsao MS, Sakurada A, Cutz JC, Zhu CQ, Kamel-Reid S, Squire J, Lorimer I, Zhang T, Liu N, Daneshmand M et al: Erlotinib in lung cancer - molecular and clinical predictors of outcome. N Engl J Med 2005, 353(2):133-144. 11. Pao W, Miller VA, Politi KA, Riely GJ, Somwar R, Zakowski MF, Kris MG, Varmus H: Acquired Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib Is Associated with a Second Mutation in the EGFR Kinase Domain. PLoS Med 2005, 2:e73. 12. Yun CH, Mengwasser KE, Toms AV, Woo MS, Greulich H, Wong KK, Meyerson M, Eck MJ: The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. Proc Natl Acad Sci U S A 2008. 13. Cappuzzo F, Gregorc V, Rossi E, Cancellieri A, Magrini E, Paties CT, Ceresoli G, Lombardo L, Bartolini S, Calandri C et al: Gefitinib in pretreated non-small-cell lung cancer (NSCLC): analysis of efficacy and correlation with HER2 and epidermal growth factor receptor expression in locally advanced or metastatic NSCLC. J Clin Oncol 2003, 21(14):2658-2663. 14. Takano T, Ohe Y, Sakamoto H, Tsuta K, Matsuno Y, Tateishi U, Yamamoto S, Nokihara H, Yamamoto N, Sekine I et al: Epidermal growth factor receptor gene 185 mutations and increased copy numbers predict gefitinib sensitivity in patients with recurrent non-small-cell lung cancer. J Clin Oncol 2005, 23(28):6829-6837. 15. Johnson BE, Janne PA: Selecting patients for epidermal growth factor receptor inhibitor treatment: A FISH story or a tale of mutations? J Clin Oncol 2005, 23(28):6813-6816. 16. Shepherd FA, Tsao MS: Unraveling the mystery of prognostic and predictive factors in epidermal growth factor receptor therapy. J Clin Oncol 2006, 24(7):1219-1220; author reply 1220-1211. 17. Davies H, Hunter C, Smith R, Stephens P, Greenman C, Bignell G, Teague J, Butler A, Edkins S, Stevens C et al: Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res 2005, 65(17):7591-7595. 18. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB et al: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069-1075. 19. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C et al: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446(7132):153-158. 20. Thomas RK, Baker AC, Debiasi RM, Winckler W, Laframboise T, Lin WM, Wang M, Feng W, Zander T, MacConaill L et al: High-throughput oncogene mutation profiling in human cancer. Nat Genet 2007, 39(3):347-351. 21. Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N et al: The consensus coding sequences of human breast and colorectal cancers. Science 2006, 314(5797):268-274. 22. Mardis ER: Anticipating the 1,000 dollar genome. Genome Biol 2006, 7(7):112. 23. Shendure J, Stewart CJ: Cancer genomes on a shoestring budget. N Engl J Med 2009, 360(26):2781-2783. 24. Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M: Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 2008, 45(1):81- 94. 25. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D et al: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008, 321(5891):956-960. 26. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621- 628. 27. Shah SP, Kobel M, Senz J, Morin RD, Clarke BA, Wiegand KC, Leung G, Zayed A, Mehl E, Kalloger SE et al: Mutation of FOXL2 in granulosa-cell tumors of the ovary. N Engl J Med 2009, 360(26):2719-2729. 28. Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 2008, 18(11):1851-1858. 29. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M et al: TM4: a free, open-source system for microarray data management and analysis. BioTechniques 2003, 34(2):374-378. 30. Young LS, Rickinson AB: Epstein-Barr virus: 40 years on. Nat Rev Cancer 2004, 4(10):757-768. 31. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H et al: Identification of the transforming 186 EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448(7153):561- 566. 32. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G et al: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254. 33. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT et al: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872-876. 34. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53-59. 35. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J et al: The diploid genome sequence of an Asian individual. Nature 2008, 456(7218):60-65. 36. Schmidt EE, Ichimura K, Goike HM, Moshref A, Liu L, Collins VP: Mutational profile of the PTEN gene in primary human astrocytic tumors and cultivated xenografts. J Neuropathol Exp Neurol 1999, 58(11):1170-1183. 37. Duerr EM, Rollbrocker B, Hayashi Y, Peters N, Meyer-Puttlitz B, Louis DN, Schramm J, Wiestler OD, Parsons R, Eng C et al: PTEN mutations in gliomas and glioneuronal tumors. Oncogene 1998, 16(17):2259-2264. 38. Sos ML, Koker M, Weir BA, Heynck S, Rabinovsky R, Zander T, Seeger JM, Weiss J, Fischer F, Frommolt P et al: PTEN loss contributes to erlotinib resistance in EGFR- mutant lung cancer by activation of Akt and EGFR. Cancer Res 2009, 69(8):3256- 3261. 39. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, Dunford- Shore BH, McGrath S, Hickenbotham M et al: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008, 456(7218):66-72. 40. Shah S, Morin R, Khattra J, Prentice L, Pugh TJ, Burleigh A, Delaney A, Gelmon K, Guliany R, Holt RA et al: Mutational evolution of a lobular breast tumour, profiled by whole-transcriptome and whole-genome next generation sequencing. Submitted 2009. 41. Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ et al: Direct selection of human genomic loci by microarray hybridization. Nat Methods 2007, 4(11):903-905. 42. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C et al: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 2009, 27(2):182-189. 43. Olson M: Enrichment of super-sized resequencing targets from the human genome. Nat Methods 2007, 4(11):891-892. 44. Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME: Microarray- based genomic selection for high-throughput resequencing. Nat Methods 2007, 4(11):907-909. 45. Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F et al: Multiplex amplification of large sets of human exons. Nat Methods 2007, 4(11):931-936. 46. Pugh TJ, Delaney AD, Farnoud N, Flibotte S, Griffith M, Li HI, Farinha P, Gascoyne RD, Marra MA: Impact of Whole Genome Amplification on Analysis of Copy Number Variants. Nucleic Acids Res 2008, 36(13):e80 187 47. Jones SJ, Laskin JJ, Li Y, Griffith O, Bilenky M, Butterfield Y, Cezard T, Chuah E, Corbett R, Fejes A et al: Complete genomic characterization of an adenocarcinoma of the tongue provides rational therapeutic options. Submitted 2009. 48. Krzywinski MI, Schein JE, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: An information aesthetic for comparative genomics. Genome Res 2009. 49. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011):931-45. 50. Morin RD, Johnson NA , Severson TM, Mungall AJ, An J, Paul JE, Uyar B, Boyle M, Kuchenbauer F, Petriv OI, Humphries RK, Griffith OL, Shah S, Corbett R, Tam A, Varhol R, Zhao Y, Delaney A, Qian H, Birol I, Aparicio S, Schein J, Moore R, Holt R, Horsman DE, Connors JM, Jones S, Hirst M, Gascoyne RD, Marra MA. EZH2 (Y641) is Frequently Mutated in Follicular and Diffuse Large B-cell Lymphomas of Germinal Center Origin. Nature Genetics 2009, submitted. 51. Fullwood MJ, Wei CL, Liu ET, Ruan Y: Next-generation DNA sequencing of paired- end tags (PET) for transcriptome and genome analyses. Genome Res 2009, 19(4):521-532. 52. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PA, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008, 40(6):722-9. 53. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10(1):57-63. 54. Paavonen J, Naud P, Salmer\u00C3\u00B3n J, Wheeler CM, Chow SN, Apter D, Kitchener H, Castellsague X, Teixeira JC, Skinner SR, Hedrick J, Jaisamrarn U, Limson G, Garland S, Szarewski A, Romanowski B, Aoki FY, Schwarz TF, Poppe WA, Bosch FX, Jenkins D, Hardt K, Zahaf T, Descamps D, Struyf F, Lehtinen M, Dubin G; HPV PATRICIA Study Group, Greenacre M: Efficacy of human papillomavirus (HPV)-16/18 AS04- adjuvanted vaccine against cervical infection and precancer caused by oncogenic HPV types (PATRICIA): final analysis of a double-blind, randomised study in young women. Lancet. 2009, 374(9686):301-14. 55. Chen H, Huang J, Wu FY, Liao G, Hutt-Fletcher L, Hayward SD: Regulation of Expression of the Epstein-Barr Virus BamHI-A Rightward Transcripts. J Virol. 2005, 79(3):1724-33. 56. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. 57. Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 12(4):656-64. 58. Goya R, Sun MGF, Morin RD, Leung G, Ha1 G, Wiegand K, Senz J, Crisan1 A, Marra MA, Hirst M, Huntsman D, Murphy KP, Aparicio S, Shah SP: SNVMix: predicting single nucleotide variants from next generation sequencing of tumors. Submitted 2009. 188 Chapter 6. Discussion 6.1. DNA sequencing efforts of increasing scale are becoming clinically applicable The rapid advance of genome technology has fundamentally altered how genome research is conducted. In five years of graduate training, I moved from sequencing a few exons of one candidate gene (Chapter 2) to sequencing all exons from a set of eight genes (Chapter 4) to sequencing transcripts of more than 20,000 expressed genes (Chapter 5). Interpretation of genome data has also evolved along with technical advancements. Early debates in the lung cancer field about the use of EGFR biomarkers to predict TKI response have been tempered by investigation of larger sample sets and the discovery of additional mutations that confer resistance such as activating KRAS mutations or deletion of PTEN. While scientists continue to debate whether EGFR testing is \u00E2\u0080\u009Cready for prime time\u00E2\u0080\u009D [1, 2], clinical labs have recently begun routine screening of lung cancers for EGFR kinase domain mutations [3]. While these efforts are a good first effort to characterize cancers at the molecular level, additional novel mutations in lung cancer may have additional predictive and prognostic value. Identifying clinically relevant variants and facilitating their use in the clinic remains a major challenge of modern cancer genomics. To fully deliver on the promise of genomics, fundamental discoveries within the cancer genome must be translated into actionable diagnostics or therapies that can positively impact patient care. I predict that routine profiling of cancer samples will be complementary to existing pathology reviews as histological subtyping and assessment of tumour content are crucial for putting genomic information in a cellular context. However, the adoption of genome tools to profile clinical cancer specimens faces a number of hurdles. Many laboratory methods are developed using a virtually unlimited supply of high quality nucleic acids derived from homogeneous cell lines or blood samples. Primary cancer biopsies, on the other hand, are often 189 highly limited in tissue quantity, yield lower quality nucleic acids, and contain a large number of different cell types, of which only a fraction may be cancerous. In my thesis work, I have developed methods to address many of these issues and to facilitate the application of genome science to clinical specimens. Low quality DNA and RNA were a common feature of the tumour specimens used in my research. In chapter 2, I present criteria for the qualification of DNA extracted from formalin-fixed paraffin-embedded (FFPE) tissues. Chapter 5 presents a protocol for review of fresh-frozen tissues, qualification of extracted RNA, and construction of sequencing libraries from amplified product. Isolation of tumour cells from surrounding normal cells has been a recurring challenge in my graduate work. Chapter 2 documents the isolation of lung tumour cells from FFPE tissue sections using laser microdissection, a technique revisited in chapter 5 for application to fresh- frozen samples. Also in chapter 5, I developed a flow cytometry assay to harvest tumour cells from up to two litres of pleural effusion containing primarily normal reactive cells. Characterizing bias induced by amplification of nucleic acids from limited starting quantities is a major focus of my cancer research. Chapter 4 presents a treatment of the biases included by a DNA amplification method and demonstrates the detection of bona fide copy number variants using amplified material. The work presented in chapter 5 relies on an RNA amplification method to detect gene mutations and expression levels in lung tumour biopsies and includes a characterization of sequence coverage bias induced by this technique. 6.2. Revolutions in DNA sequencing technology have enabled routine genome sequencing DNA sequencing capacity has now reached a critical mass. The sequence of the genome, being finite, is becoming completely knowable as the genotype of nearly every base- pair can now be assessed at a reasonable cost. Full knowledge of an individual\u00E2\u0080\u0099s genome variation will soon be a starting point for analysis and not an end goal. DNA sequence data are becoming a universal commodity that can be applied to answer any number of questions in 190 much the same way computer processing power has become generally applicable over the last 20 years. The challenge will be how best to spend this commodity and how to interpret the resulting flood of information. The study of cancer genomes has great potential to not only impact the treatment of this disease but to also further our understanding of cell biology. The root causes of each of the hallmarks of cancer are likely to be uncovered as somatic mutations are compared across cancers and pre-neoplastic lesions with distinct phenotypes. Separating somatic from germline mutations will soon be a solved problem as every common variant will be known [4] and for rare alleles there will be the ultimate personal reference sequence for each tumour in the form of normal DNA from unaffected tissue or peripheral blood. As sequence capacity continues to grow into the realm of sequencing single cells, even low frequency variants will be readily detected. This will not only be important in sequencing populations of heterogeneous tumours but also to better understand genetic mosaicism in normal tissues and the acquisition of somatic mutations in normal cells throughout our lifetime. The challenge clinically will be how to identify those mutations on which a tumour relies, which mutations result in druggable targets, and how to treat resistant clones that exist prior to, or arise during, treatment. Cancer is a genetically heterogeneous disease. The Catalogue of Somatic Mutations in Cancer (COSMIC, [5]) lists over 750 genes in which somatic point mutations or indels have been detected in lung cancer alone. Less than ten of these genes are mutated in more than 10% of tumours, suggesting that lung cancers as a whole can arise due to mutations in a diversity of genes or can tolerate a large number of passenger mutations or both [5, 6]. Common activating mutations impact oncogenes EGFR (26% of lung cancers), KRAS (17%), and CDKN2A (15%), while common inactivating mutations target tumour suppressors p53 (60%), RB1 (14%), and STK11A (10%) [5]. In the analysis of lung tumour transcriptomes in chapter 5, we observed distinct subsets of expression and a rare EBV-associated cancer despite careful clinical 191 selection of patients (female, Asian, non-smokers) and tumour histology. Therefore, sequencing large numbers of tumours will likely be necessary to refine molecular subclasses of morphologically similar cancers. The total number of possible cancer mutations may not be small, but it will be within the defined space of the finite human genome. Analysis of two cancer genomes sequenced thus far [7, 8] found that these tumours harbour relatively few (10-32) somatic coding mutations likely to drive cancer. Mutations were found in genes not previously associated with cancer but linked to known disease pathways [7, 8], suggesting that comprehensive, unbiased sequencing of tumours is necessary to uncover these mutations rather than a candidate gene approach. Large scale projects such as The Cancer Genome Atlas [9] are beginning to sequence hundreds of tumour genomes with the goal of integrating genome sequence information with clinical, histological, and gene expression information. As large sets of cancer genome data accrue, patterns of somatic mutation will continue to be refined, building on early observations from candidate gene sequencing studies [6, 10-12]. The oncogenic forces that drive some cancers are not restricted to changes of the native DNA sequence itself. Rather, changes in methylation patterns (epigenetics) [13], RNA splicing, regulation and editing [14], microRNA processing [15], viral integration [16], and the mitochondrial genome [17] have all been implicated in oncogenesis, and many of these features are potential targets of modern cancer therapies [13, 16-18]. Therefore, integration of complementary sequencing approaches such as chromatin-immunoprecipitation (ChIP-seq) and transcriptome sequencing (RNA-seq) will further serve to identify mechanisms of oncogenesis. As sequencing technologies are applied to an increasing diversity of cancers, I predict a gradual redefinition of how cancers are classified and treated. Not since the adoption of the microscope to differentiate cell morphology will cancer pathology undergo such as transformation. In addition to diagnosing a cancer as a \u00E2\u0080\u009Clung adenocarcinoma\u00E2\u0080\u009D, comprehensive molecular 192 information will include \u00E2\u0080\u009Cactivating mutations of genes x, y, z\u00E2\u0080\u009D, \u00E2\u0080\u009Camplification of oncogene a, fusions of genes b, and c\u00E2\u0080\u009D or \u00E2\u0080\u009Cexpression of virus p, subtypes q and r.\u00E2\u0080\u009D The art of genomic medicine will be interpreting these molecular signatures and formulating effective treatment strategies based on this information. Many cancer therapies rely on targeting the hallmarks of cancer. Traditional chemotherapies prevent cells from rapidly dividing by cross-linking a cell\u00E2\u0080\u0099s DNA through alkylation or providing cytotoxic molecules for biosynthesis [19]. Radiation therapies induce DNA breaks and point mutations which become particularly deleterious as they accumulate in immortalized tumour cells, resulting in eventual biological collapse. Angiogeneic inhibitors exploit a growing tumour\u00E2\u0080\u0099s dependence on increased blood supply for continued progression [20]. However, cancer is an evolutionary process, and rapidly dividing tumour populations can adapt to external pressures that would otherwise result in the death of normal cells [21]. These treatments alone often do not cure cancer, rather they destroy the susceptible percentage of cancer cells, leaving behind resistant survivors that can grow to form their own population. This has been directly observed in cells that become resistant to EGFR tyrosine kinase inhibitors, in which long-term treatment drives the development of a specific resistance mutation, T790M, that reduces the effectiveness of the drug [22, 23]. In the case of a single patient from our study in Chapter 5, an initially undetected T790M resistance mutation was present at low frequency prior to treatment that may explain that patient\u00E2\u0080\u0099s lack of response to erlotinib. Therefore, to rationally guide effective therapies, even low frequency cancer genome aberrations need to be catalogued before treatment and be tracked over time in addition to detecting new mutations that arise. To implement routine cancer genome diagnostics, there is a continuing need to minimize invasive collection of tumour tissue and for targeted methods to validate genome- wide findings. Large quantities of tissue will likely not become available to address these 193 problems, so methods of cell isolation and nucleic acid amplification must continue to mature. Ideally, invasive biopsies will be rendered unnecessary with improvements in isolation of tumour cells or DNA from complex samples such as pleural fluid or blood. Amplification of DNA and RNA will likely become routine and statistical characterization of bias induced by these methods will be necessary for their application. In the near term, putative somatic variants detected by second generation sequencing methods will need to be validated by orthogonal methods. Even given samples of ideal quality and quantity, false positive rates are very high using these methods, and confirmation using traditional methods is still warranted. Current laboratory techniques such as FISH, Sanger sequencing, and single-variant genotyping will not be discarded, rather they will continue to guide the refinement current generation sequence analysis and the development of the next generation of DNA sequencers. 6.3. Future treatments of cancer will be guided by genome information We need to begin treating cancer as a genomic disease. The structure and content of oncogenes and tumour suppressors are keys not only to identifying known, treatable molecular features of cancer but also to identifying new candidate drug targets and mechanisms of resistance. Particularly when managing cancers long-term, the potential for acquisition of resistance mutations increases with time, and regular molecular profiling of sequence and structural rearrangements will increase understanding of cancer evolution in response to treatment. Rational selection of cancer therapy based on the predicted effects of observed genome alterations will become a major tool to improve patient survival. So far, there is a single example of this being attempted with positive results, of which I was fortunate to play a small part [24]. In September 2008, a patient presented at the BC Cancer Agency with lung metastases from a rare adenocarcinoma of the tongue previously treated by surgery and radiation. Despite 194 expression of EGFR, this cancer was resistant to erlotinib and very few routine therapeutic options remained. To help predict possible effective drugs, tumour samples were collected from which genome and transcriptome data were generated. A normal genome sequence was derived from DNA extracted from a peripheral blood sample. Analyses of these data uncovered a PTEN deletion, an indicator of erlotinib resistance, and RET amplification and overexpression, which together appeared to drive tumour progression. While 84 putative mutations were uncovered, none were present in RET or several other drug targets, suggesting that inhibition of these intact proteins may be particularly effective. A list of seven drugs was presented to the patient\u00E2\u0080\u0099s oncologist and sunitinib (Sutent, Pfizer), a RET inhibitor used to treat kidney, thyroid, and gastrointestinal cancers, was selected as the therapy going forward. After 6 weeks of treatment, there was an approximate 20% decrease in tumour size, and no new nodules had appeared (Figure 6.1). This is a remarkable development as without genome information, this therapy would not have been considered for this cancer. The patient enjoyed good quality of life as the tumour remained in remission for over 5 months. At this point, the tumour grew again and the therapy was changed to a cocktail of two drugs from the original list, sorafenib and sulindac. The tumour again went into remission, this time for four months. In July 2009, the tumour relapsed and has again begun to grow. To uncover how this tumour has again become resistant and to suggest a new therapeutic strategy, tumour tissue has again been acquired, and its genome and transcriptome are being sequenced for comparison with the pre-treatment sample. The goal is to again use genome information to understand what somatic aberrations are now driving this cancer and what can be done clinically to return this patient to good health. This is the future of medical onco-genomics. 6.4. Future directions My immediate short-term goal is to carry out the final experiment outlined in the Chapter 5 to validate sequence variants identified by RNA-seq. The results of this experiment 195 have the potential to not only refine current variant detection methods but will also allow, for the first time, an observation of the spectrum of transcribed somatic mutations present in a set of highly homogeneous solid tumours. As clinical data continue to become available from the ongoing clinical trial, these data may yield fundamental insights into the relationship between somatic alteration and the clinical course of cancer. While the population is not large enough to draw conclusive associations between specific variants and drug response, this initial examination of lung cancer transcriptomes may suggest pathway members commonly mutated in this cancer and potential biomarkers predictive of response that could be validated in larger sample sets. Even prior to the validation of somatic variants, expression profiling of these lung cancers has already identified avenues for investigation. The discovery of Epstein-Barr viral (EBV) transcription in a rare lymphoepithelioma-like carcinoma demonstrated the ability of transcriptome sequencing to refine a pathology diagnosis and suggests that other tumours of this type may be misclassified. Expression profiling of this tumour has also uncovered a possible biomarker of EBV involvement, the overexpression of CD70. Therefore, I plan to investigate the expression of EBV and CD70 in a collection of lung tumours to ascertain whether EBV infection is common in lung cancer and whether it is linked to CD70 upregulation. Expression of specific genes was found to correlate with clinical features including EGFR and KRAS mutation, smoking status and erlotinib response. To refine and validate these profiles, I plan to compare these data with existing and emerging gene expression data sets. Comparison with similar tumours should answer whether these profiles are a consistent feature of this highly selected group of tumours, and comparison with other types of lung cancer, squamous, for example, should illustrate whether expression profiling can be used to differentiate lung cancers with specific mutations or increased likelihood of TKI response. 196 I believe that comparison and integration of cancer genome and transcriptome sequences will uncover common patterns of mutation, structural variation, gene expression, and viral transcription that can eventually be used to guide the treatment of cancer. Such patterns will not be evident without integrated genome-scale data from hundreds of tumours. However, even our current knowledge of cancer genes is sufficient to interpret the genome sequence of an individual tumour to make an effective therapeutic recommendation in at least one case [24]. Ongoing sequence analysis of routine cancer biopsies will not only benefit patients but will further refine our ability to tie genome information with clinical outcome. 197 6.5. Figures Figure 6.1 Computed tomography (CT) images of lung metastases from an adenocarcinoma of the tongue in the months before and after administration of sunitinib, a drug selected to exploit somatic aberrations identified by cancer genome and transcriptome sequencing Reproduced from [24]. A) October 1st, 2008, one month before sunitinib initiation. Tumour masses with diameters of 22 and 24 mm are identified by arrows (top and bottom respectively). B) October 29th, 2008, baseline before sunitinib initiation on Oct 30th, 2008. Tumour masses have grown by 25% on standard therapy. C) December 9th, 2008, 4 weeks on sunitinib, 2 weeks off drug. Tumour masses have decreased by approximately 20% and no new nodules were observed. 198 6.6. Bibliography 1. Shepherd FA: Molecular selection of patients for first-line treatment of advanced non-small-cell lung cancer with epidermal growth factor inhibitors: not quite ready for prime time. J Clin Oncol 2008, 26(15):2426-2427. 2. Hirsch FR, Bunn PA, Jr.: EGFR testing in lung cancer is ready for prime time. Lancet Oncol 2009, 10(5):432-433. 3. Smith S: MGH to use genetics to personalize cancer care. In: Boston Globe. Boston, MA; 2009. 4. Ionita-Laza I, Lange C, N ML: Estimating the number of unseen variants in the human genome. Proc Natl Acad Sci U S A 2009, 106(13):5008-5013. 5. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. 6. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, Fulton L, Fulton RS, Zhang Q, Wendl MC, Lawrence MS, Larson DE, Chen K, Dooling DJ, Sabo A, Hawes AC, Shen H, Jhangiani SN, Lewis LR, Hall O, Zhu Y, Mathew T, Ren Y, Yao J, Scherer SE, Clerc K, Metcalf GA, Ng B, Milosavljevic A, Gonzalez-Garay ML, Osborne JR, Meyer R, Shi X, Tang Y, Koboldt DC, Lin L, Abbott R, Miner TL, Pohl C, Fewell G, Haipek C, Schmidt H, Dunford-Shore BH, Kraja A, Crosby SD, Sawyer CS, Vickery T, Sander S, Robinson J, Winckler W, Baldwin J, Chirieac LR, Dutt A, Fennell T, Hanna M, Johnson BE, Onofrio RC, Thomas RK, Tonon G, Weir BA, Zhao X, Ziaugra L, Zody MC, Giordano T, Orringer MB, Roth JA, Spitz MR, Wistuba, II, Ozenberger B, Good PJ, Chang AC, Beer DG, Watson MA, Ladanyi M, Broderick S, Yoshizawa A, Travis WD, Pao W, Province MA, Weinstock GM, Varmus HE, Gabriel SB, Lander ES, Gibbs RA, Meyerson M, Wilson RK: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069-1075. 7. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, Dunford- Shore BH, McGrath S, Hickenbotham M, Cook L, Abbott R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, Gordon D, Chinwalla A, Zhao Y, Ries RE, Payton JE, Westervelt P, Tomasson MH, Watson M, Baty J, Ivanovich J, Heath S, Shannon WD, Nagarajan R, Walter MJ, Link DC, Graubert TA, DiPersio JF, Wilson RK: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008, 456(7218):66-72. 8. Shah S, Morin R, Khattra J, Prentice L, Pugh TJ, Burleigh A, Delaney A, Gelmon K, Guliany R, Holt RA, Jones SJ, Sun M, Moore R, Teschendorff A, Tse K, Turashivili G, Varhol R, Warren R, Watson P, Zhao Y, Caldas C, Huntsman D, Hirst M, Marra M, Aparicio S: Mutational evolution of a lobular breast tumour, profiled by whole- transcriptome and whole-genome next generation sequencing. Submitted 2009. 9. The Cancer Genome Atlas [http://cancergenome.nih.gov/] 10. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, Edkins S, O'Meara S, Vastrik I, Schmidt EE, Avis T, Barthorpe S, Bhamra G, Buck G, Choudhury B, Clements J, Cole J, Dicks E, Forbes S, Gray K, Halliday K, Harrison R, Hills K, Hinton J, Jenkinson A, Jones D, Menzies A, Mironenko T, Perry J, Raine K, Richardson D, Shepherd R, Small A, Tofts C, Varian J, Webb T, West S, Widaa S, Yates A, Cahill DP, Louis DN, Goldstraw P, Nicholson AG, Brasseur F, Looijenga L, Weber BL, Chiew YE, DeFazio A, Greaves MF, Green AR, Campbell P, Birney E, Easton DF, Chenevix-Trench G, Tan MH, Khoo SK, Teh BT, 199 Yuen ST, Leung SY, Wooster R, Futreal PA, Stratton MR: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446(7132):153-158. 11. Davies H, Hunter C, Smith R, Stephens P, Greenman C, Bignell G, Teague J, Butler A, Edkins S, Stevens C, Parker A, O'Meara S, Avis T, Barthorpe S, Brackenbury L, Buck G, Clements J, Cole J, Dicks E, Edwards K, Forbes S, Gorton M, Gray K, Halliday K, Harrison R, Hills K, Hinton J, Jones D, Kosmidou V, Laman R, Lugg R, Menzies A, Perry J, Petty R, Raine K, Shepherd R, Small A, Solomon H, Stephens Y, Tofts C, Varian J, Webb A, West S, Widaa S, Yates A, Brasseur F, Cooper CS, Flanagan AM, Green A, Knowles M, Leung SY, Looijenga LH, Malkowicz B, Pierotti MA, Teh BT, Yuen ST, Lakhani SR, Easton DF, Weber BL, Goldstraw P, Nicholson AG, Wooster R, Stratton MR, Futreal PA: Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res 2005, 65(17):7591-7595. 12. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061-1068. 13. Egger G, Liang G, Aparicio A, Jones PA: Epigenetics in human disease and prospects for epigenetic therapy. Nature 2004, 429(6990):457-463. 14. Scholzova E, Malik R, Sevcik J, Kleibl Z: RNA regulation and cancer development. Cancer Lett 2007, 246(1-2):12-23. 15. Schmittgen TD: Regulation of microRNA processing in development, differentiation and cancer. J Cell Mol Med 2008, 12(5B):1811-1819. 16. Talbot SJ, Crawford DH: Viruses and tumours--an update. Eur J Cancer 2004, 40(13):1998-2005. 17. Chatterjee A, Mambo E, Sidransky D: Mitochondrial DNA mutations in human cancer. Oncogene 2006, 25(34):4663-4674. 18. Holbrook JA, Neu-Yilik G, Hentze MW, Kulozik AE: Nonsense-mediated decay approaches the clinic. Nat Genet 2004, 36(8):801-808. 19. Fischer DS, Knobf MT, Durivage HJ: The Cancer Chemotherapy Handbook, 4th edn. St. Louis, MO: Mosby-Year Book, Inc.; 1993. 20. Carmeliet P, Jain RK: Angiogenesis in cancer and other diseases. Nature 2000, 407(6801):249-257. 21. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. 22. Yun CH, Mengwasser KE, Toms AV, Woo MS, Greulich H, Wong KK, Meyerson M, Eck MJ: The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. Proc Natl Acad Sci U S A 2008, 105(6):2070-2075. 23. Janne PA: Challenges of detecting EGFR T790M in gefitinib/erlotinib-resistant tumours. Lung Cancer 2008, 60 Suppl 2:S3-9. 24. Jones SJ, Laskin JJ, Li Y, Griffith O, Bilenky M, Butterfield Y, Cezard T, Chuah E, Corbett R, Fejes A, Griffith M, Yee J, Martin MA, Mayo M, Melnyk N, Morin R, Pugh TJ, Severson T, Shah S, Tam A, Terry J, Thiessen N, Varhol R, Zeng T, Zhao Y, Moore R, Huntsman D, Briol I, Hirst M, Holt RA, Marra M: Complete genomic characterization of an adenocarcinoma of the tongue provides rational therapeutic options. Submitted 2009. "@en . "Thesis/Dissertation"@en . "2010-05"@en . "10.14288/1.0068007"@en . "eng"@en . "Medical Genetics"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "Attribution-NonCommercial-NoDerivatives 4.0 International"@en . "http://creativecommons.org/licenses/by-nc-nd/4.0/"@en . "Graduate"@en . "Analysis of primary human cancers : from single genes to whole transcriptomes"@en . "Text"@en . "http://hdl.handle.net/2429/14710"@en .