Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Analysis of primary human cancers : from single genes to whole transcriptomes Pugh, Trevor John 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2010_spring_pugh_trevor.pdf [ 4.95MB ]
Metadata
JSON: 24-1.0068007.json
JSON-LD: 24-1.0068007-ld.json
RDF/XML (Pretty): 24-1.0068007-rdf.xml
RDF/JSON: 24-1.0068007-rdf.json
Turtle: 24-1.0068007-turtle.txt
N-Triples: 24-1.0068007-rdf-ntriples.txt
Original Record: 24-1.0068007-source.json
Full Text
24-1.0068007-fulltext.txt
Citation
24-1.0068007.ris

Full Text

ANALYSIS OF PRIMARY HUMAN CANCERS: FROM SINGLE GENES TO WHOLE TRANSCRIPTOMES  by TREVOR JOHN PUGH B.Sc. (Honours), The University of British Columbia, 2004  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY  in  THE FACULTY OF GRADUATE STUDIES (Medical Genetics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  November 2009  © Trevor John Pugh, 2009  Abstract Cells in the human body contain DNA genomes that encode instructions regulating their biology. Accumulation of somatic DNA sequence alterations such as point mutations and structural rearrangements can disrupt critical genes resulting in malignant cancer phenotypes. Identification of cancer “drivers” is a central goal of cancer genome analysis due to their causation of oncogenesis and potential as diagnostic and therapeutic targets. Analysis of normal polymorphisms can also impact the treatment of cancer by identifying individuals most likely to benefit from specific therapies. To uncover molecular correlates with treatment outcome, my graduate work has focused on applying DNA sequencing technology to clinical cancer patient samples. In an early example of medical oncogenomics, I evaluated mutations and amplifications of a single gene, EGFR, in patient tumour samples and investigated associations with response to an EGFR inhibitor, gefitinib. This study was challenged by limited nucleic acid quantities available from small or microdissected tissue biopsies. Therefore, I next characterized bias induced by a whole genome amplification technique and demonstrated genotype and copy number analysis using amplified material. To investigate the role that normal polymorphisms play in guiding cancer treatment, my third project sought to correlate DNA repair gene polymorphisms with the development of late side effects following radiation therapy for prostate cancer. Late side effects were associated with variants in three genes, uncovered by sequencing the exons of eight DNA repair genes in patients with varying degrees of radiosensitivity. Advancements in DNA sequencing technologies have enabled a move beyond candidate gene approaches towards gaining sequence and expression information from all expressed genes (i.e. the transcriptome). Utilizing second generation sequencing technology, my final project was a transcriptome analysis of lung tumours prior to treatment with the EGFR inhibitor, erlotinib. I uncovered gene expression profiles specific to clinical subgroups and, in one case, detected expression of the Epstein-Barr virus. The second phase of this project will validate putative somatic mutations identified by transcriptome sequencing and investigate viral involvement in other lung tumours. Genome sequence information is becoming readily extracted from clinical sources and there is great potential to use this information to effectively guide cancer treatment.  ii  Table of contents Abstract ....................................................................................................................................... ii Table of contents ........................................................................................................................ iii List of tables ............................................................................................................................... vi List of figures............................................................................................................................. vii Acknowledgements .................................................................................................................. viii Co-authorship statement............................................................................................................ x Chapter 1. Introduction ............................................................................................................. 1 1.1. Human phenotypes are controlled by cellular genomes ............................................... 1 1.2. Variants in genome sequence and structure differentiate human phenotypes .............. 2 1.3. Cancers arise from accumulation of abnormal somatic variants in critical genes........ 4 1.4. Activating mutations and amplifications of a lung cancer oncogene, EGFR, have been associated with response to tyrosine kinase inhibitors ................................................. 8 1.5. Molecular studies of patient biopsy samples have been limited by suboptimal tissue quality and quantity .................................................................................................... 10 1.6. Sequencing of multiple genes in clinical sample sets is facilitated by high-throughput methods....................................................................................................................... 12 1.7. Second generation sequencing technologies have enabled whole cancer genome sequencing .................................................................................................................. 13 1.8. Thesis description ....................................................................................................... 14 1.9. Figures ........................................................................................................................ 17 1.10. Bibliography ........................................................................................................... 19 Chapter 2. Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients .............. 28 2.1. Introduction................................................................................................................. 29 2.2. Materials and methods ................................................................................................ 31 2.2.1. Patient population and assessment response....................................................... 31 2.2.2. Laser microdissection and DNA extraction........................................................ 31 2.2.3. PCR and sequencing of EGFR exons 18-24....................................................... 32 2.2.4. Copy number analysis of EGFR and HER2 ....................................................... 34 2.3. Results......................................................................................................................... 35 2.3.1. Patient population ............................................................................................... 35 2.3.2. EGFR tyrosine-kinase domain mutations........................................................... 36 2.3.3. EGFR tyrosine-kinase domain polymorphisms.................................................. 37 2.3.4. EGFR and HER2 copy number analysis ............................................................ 38 2.4. Discussion................................................................................................................... 39 2.5. Conclusion .................................................................................................................. 41 2.6. Figures ........................................................................................................................ 43 2.7. Tables.......................................................................................................................... 47 2.8. Bibliography ............................................................................................................... 52 Chapter 3. Impact of whole genome amplification on analysis of copy number variants . 56 3.1. Introduction................................................................................................................. 57 3.2. Materials and methods ................................................................................................ 60 3.2.1. Tissue material and DNA extraction .................................................................. 60 3.2.2. Whole genome amplification.............................................................................. 60 3.2.3. Labelling and hybridization to the Affymetrix 500K array................................ 60 3.2.4. Sample preparation for NimbleGen 385k CGH array ........................................ 61 iii  3.2.5. Genotype and copy number analysis .................................................................. 61 3.2.6. Sequence analysis of recurrent whole genome amplification-induced artifacts. 63 3.3. Results......................................................................................................................... 64 3.3.1. Array noise and copy number variation in samples pre- and post-WGA........... 64 3.3.2. Copy number variants induced by whole genome amplification ....................... 65 3.3.3. Use of amplified material for pair-wise copy number comparisons................... 68 3.3.4. Validation of WGA pair-wise comparisons for copy number detection ............ 69 3.3.5. Genotype fidelity ................................................................................................ 70 3.4. Discussion................................................................................................................... 71 3.5. Figures ........................................................................................................................ 77 3.6. Tables.......................................................................................................................... 83 3.7. Bibliography ............................................................................................................... 96 Chapter 4. Sequence variant discovery in DNA repair genes from radiosensitive and radiotolerant prostate brachytherapy patients...................................................................... 99 4.1. Introduction............................................................................................................... 100 4.2. Materials and methods .............................................................................................. 103 4.2.1. Patient selection and toxicity metrics ............................................................... 103 4.2.2. PCR amplification and sequencing of DNA repair genes ................................ 105 4.2.3. Statistical analyses ............................................................................................ 106 4.3. Results....................................................................................................................... 107 4.3.1. DNA sequencing summary............................................................................... 107 4.3.2. ATM variants detected by previous studies of radiosensitivity ........................ 108 4.3.3. Using quantity of DNA repair gene variants to predict radiosensitivity .......... 109 4.3.4. Using specific DNA repair gene variants to predict radiosensitivity ............... 109 4.3.5. Relationship of DNA repair gene variants with residual gammaH2AX following irradiation.......................................................................................................... 110 4.4. Discussion................................................................................................................. 111 4.5. Figures ...................................................................................................................... 114 4.6. Tables........................................................................................................................ 117 4.7. Bibliography ............................................................................................................. 133 Chapter 5. Transcriptome sequencing of treatment-naïve lung cancers from individuals likely to benefit from erlotinib treatment............................................................................. 137 5.1. Introduction............................................................................................................... 138 5.2. Methods .................................................................................................................... 141 5.2.1. Biopsy collection and processing ..................................................................... 141 5.2.2. DNA extraction and Sanger sequencing........................................................... 143 5.2.3. RNA extraction, amplification and Illumina sequencing ................................. 144 5.2.4. RNA-seq data analysis...................................................................................... 146 5.3. Results....................................................................................................................... 147 5.3.1. Patient data and source tumour material........................................................... 147 5.3.2. Summary of sequencing data and variant discovery in lung cancer biopsies... 148 5.3.3. Addressing end-bias induced by amplification................................................. 149 5.3.4. Viral transcripts ................................................................................................ 150 5.3.5. Expression profiling.......................................................................................... 152 5.3.6. Fusion transcripts.............................................................................................. 154 5.3.7. Mutation detection ............................................................................................ 154 5.3.8. Validation of novel coding pSNVs................................................................... 156 5.4. Discussion................................................................................................................. 159 5.5. Future directions ....................................................................................................... 162 iv  5.6. Figures ...................................................................................................................... 164 5.7. Tables........................................................................................................................ 175 5.8. Bibliography ............................................................................................................. 184 Chapter 6. Discussion ............................................................................................................. 188 6.1. DNA sequencing efforts of increasing scale are becoming clinically applicable .... 188 6.2. Revolutions in DNA sequencing technology have enabled routine genome sequencing .................................................................................................................................. 189 6.3. Future treatments of cancer will be guided by genome information ........................ 193 6.4. Future directions ....................................................................................................... 194 6.5. Figures ...................................................................................................................... 197 6.6. Bibliography ............................................................................................................. 198  v  List of tables Table 2-1 PCR primers for 7 exons of the EGFR tyrosine kinase domain................................. 47 Table 2-2 Summary of all patient clinical data and molecular status......................................... 48 Table 2-3 EGFR exon 19 deletions/substitution......................................................................... 49 Table 2-4 EGFR point mutations................................................................................................ 50 Table 2-5 EGFR and HER2 copy number alterations ................................................................ 51 Table 3-1 Regions of recurrent WGA over-amplification.......................................................... 83 Table 3-2 Regions of recurrent WGA under-amplification........................................................ 87 Table 3-3 Distribution of log2 ratios from comparison of unamplified and amplified samples versus a common reference set of 48 individuals ....................................................................... 89 Table 3-4 Apparent amplifications and deletions detected prior to amplification through comparison with a reference set of 48 individuals ..................................................................... 90 Table 3-5 Distribution of log2 ratios from comparison of two experimental replicates of each sample ......................................................................................................................................... 91 Table 3-6 Regions of recurrent WGA under-amplification within chromosome ends............... 92 Table 3-7 Apparent copy number differences identified by pair-wise comparisons of all possible combinations of unamplified and amplified samples ................................................... 93 Table 3-8 Copy number variants detected by pair-wise comparisons of unamplified and amplified sample sets.................................................................................................................. 94 Table 3-9 Copy number variants detected in MR families by pair-wise comparisons of unamplified and amplified sample sets (child versus father) ..................................................... 95 Table 4-1 Modified RTOG scoring system used to generate toxicity scores ........................... 117 Table 4-2 Patient-by-patient radiation dosimetry, gammaH2AX scores, DNA sequence variant counts, toxicity score breakdown, and other data..................................................................... 118 Table 4-3 PCR primer sequences used to amplify amplicons targeting candidate gene exons for sequencing ................................................................................................................................ 123 Table 4-4 Number of variant sites detected in each DNA repair gene. .................................... 130 Table 4-5 Coding variant genotypes observed in high and low toxicity prostate brachytherapy patients. (A = reference allele, B = non-reference allele)......................................................... 131 Table 4-6 Variants associated with residual gamma H2AX levels following irradiation ........ 132 Table 5-1 Tumour content and quantities of total RNA extracted from 30 lung tumour biopsies .................................................................................................................................................. 175 Table 5-2 Complete viral genomes against which all unmapped transcriptome reads were mapped...................................................................................................................................... 176 Table 5-3 33 pSNVs validated by PCR and Sanger sequencing .............................................. 178 Table 5-4 Genes with exons targeted for solution hybrid capture, containing mutations in ≥4 pre-treatment tumours (≥3 tumours for COSMIC genes)......................................................... 179 Table 5-5 Genes containing at least 2 types of mutation exclusively in post-treatment tumours. .................................................................................................................................................. 183  vi  List of figures Figure 1.1 Parallel pathways of tumourigenesis......................................................................... 17 Figure 1.2 Representations of a crystal structure of the EGFR kinase domain in complex with erlotinib....................................................................................................................................... 18 Figure 2.1 DNA of varying quality from formalin-fixed paraffin-embedded tissues. ............... 43 Figure 2.2 Laser microdissection of mixed tumour and normal cell populations ...................... 44 Figure 2.3 EGFR variant detection summary ............................................................................. 45 Figure 2.4 Examples of tumours with increased gene copy number detected by FISH............. 46 Figure 3.1 Experimental design .................................................................................................. 77 Figure 3.2 Boxplots comparing the spread of log2 ratios in unamplified and amplified samples .................................................................................................................................................... 79 Figure 3.3 Apparent CNVs in unamplified and amplified samples............................................ 80 Figure 3.4 Copy number distribution and GC content of WGA-induced CNVs........................ 81 Figure 3.5 Example of how a pair-wise comparison of amplified material can partially compensate for WGA-induced bias............................................................................................ 82 Figure 4.1 Candidate genes encode proteins directly involved in the detection and repair of damaged DNA and triggering of cell cycle control signalling pathways ................................. 114 Figure 4.2 Toxicity scores, radiation dosimetry, count of DNA variants, and gammaH2AX rank expression from 41 prostate brachytherapy patients ................................................................ 115 Figure 5.1 Isolation of tumour cells from a complex pleural fluid mixture using flow cytometry .................................................................................................................................................. 164 Figure 5.2 Summary of data generated, sequence mapped, and genes, variants, and fusions detected in each library............................................................................................................. 165 Figure 5.3 Distribution of RNA seq reads mapped to exonic, intronic, and intergenic regions166 Figure 5.4 Distribution of sequence coverage and putative SNVs detected across all expressed transcripts from 41 RNA-seq libraries...................................................................................... 167 Figure 5.5 Comparison of sequence coverage distribution in libraries constructed from RNA amplified using a standard or modified in vitro transcription primer mix. .............................. 168 Figure 5.6 A) Circos visualization of RNA-seq reads from a lymphoepithelioma-like lung cancer aligned to an EBV genome. B) Confirmation of EBV tumour-specificity by in situ hybridization. ............................................................................................................................ 169 Figure 5.7 Supervised hierarchical clustering of gene expression profiles uncovers molecular and clinical subtypes of lung cancer......................................................................................... 171 Figure 5.8 Attrition of putative SNVs to select variants for validation.................................... 173 Figure 5.9 Position of baits designed to A) validate putative point mutations detected by RNAseq and B) discover additional mutations in exons from genes with putative point mutations in at least 3 tumours ...................................................................................................................... 174 Figure 6.1 Computed tomography (CT) images of lung metastases from an adenocarcinoma of the tongue in the months before and after administration of sunitinib, a drug selected to exploit somatic aberrations identified by cancer genome and transcriptome sequencing .................... 197  vii  Acknowledgements This work could not have come together without the unwavering support, principled guidance, and intellectual enthusiasm of my mentor, Dr. Marco Marra. I have been privileged to learn from an exceptional leader, consummate gentleman, and true academic scholar. He has set a scientific and personal standard to which I aspire. The thoughtful vision and clinical perspective provided by our ardent collaborator, Dr. Janessa Laskin, has challenged me to constantly think and learn outside of the lab. Her earnest, practical approach to science and exceptional ability to unite basic and clinical research has shaped and reshaped my vision of clinical genomics. Thanks also to the additional members of my thesis advisory committee, Drs. Jan Friedman, Rob Holt, and Andre Marziali for introducing me to the word of genomics, for providing healthy doses of academic and ‘real world’ perspective, and for accompanying me on this journey from start to finish.  Thank you to Margaret and Simon Sutcliffe for coaxing me to think across boundaries institutional, translational, and international. Thank you to Lorena Barclay for your unmatched organizational skills and willingness to help at a moment’s notice. Thanks to Cindy Yang for being an incredible student who probably taught me more than I taught her.  I would also like to thank the staff and scientists of the BC Cancer Agency Genome Sciences Centre (GSC). I could reproduce the staff directory listing every person who has helped me along the way. I grew up here, scientifically speaking, and thank you for the complete immersion in cutting edge science. In particular, Duane Smailus, George Yang, Jeff Stott, Richard Moore, and Julius Halaschek-Wiener have been exceptional mentors beginning from my early days at the GSC. Thanks also to Allen Delaney and Irene Li for innumerable viii  crash courses in bioinformatics and other tools of the trade. Thank you to Robyn Roscoe and Karen Novik for gracefully managing the twists and turns each project has taken. Thanks also to Robin Coope for sharing an infectious scientific enthusiasm and Yongjun Zhao for technical discussions of lab techniques.  Much of what I have learned over the last five years has come from those people I see every day, the members of Marco Marra and Angela (Angie) Brooks-Wilson’s research groups. Thank you Malachi Griffith for conversations laboratory, informatic, and otherwise, Ryan Morin for an endless number of scripts and quips, Noushin Farnoud for being my go-to statistical consultant, Ian Bosdet for mentorship in my early days and for carrying on and improving my work, and Tesa Severson for holding us all together. Thank you to Angie for generously hosting my “short term” stay with her lab group and for modelling exceptional scientific leadership of an incredible team. Thank you to Johanna Schuetz for sharing a brain, Dan Fornika for thoughtful opinions on any topic, Steve Leach for the fierce friendly rivalry, and the rest of the 12 o’clock lunch group for the scintillating daily discussions. Thanks to Claire Hou for ongoing career advice. Thanks also to long-time friends outside of science for reminding me that there is life outside of the lab.  Finally, a big thank you to the Pugh and Gastaldo families. To my Mom and Dad, thank you for imbuing in me a love of science at an early age and for always loving and supporting me in whatever I do. To brothers Kevin and Steven, thank you for always offering cheerful smiles, good times, and insightful discussions of the world at large. To my in-laws Silvano, Jacqueline, Claudia, Mirella, and Milva, thank you for embracing me as one of your own and sharing with me your incredible daughter and sister. Christina, your love and encouragement makes each day a joy. Thank you for showing me what is truly important in life. ix  Co-authorship statement The work presented in this thesis is the product of substantial collaboration. Each study is presented as an independent published or publishable unit and individual contributors to each chapter are listed here and at the beginning of each chapter: Chapter 2 I participated in the study coordination, performed the DNA extraction, qualification, and sequence analysis, and generated drafts of the manuscript. Gwyn Bebb conceived of the study and participated in its design as well as treated patients and identified patients for study. Lorena Barclay, Margaret Sutcliffe, and John Fee reviewed patient samples and performed microdissection. Chris Salski and Doug Horsman carried out the FISH studies. Robert O’Connor served as the reference pathologist and reviewed all patient samples. Cheryl Ho, Nevin Murray, and Barbara Melosky treated patients and identified patients for this study. John English coordinated sample acquisition. Jeurgen Vielkind oversaw the microdissection process. Janessa Lakin and Marco Marra conceived of the study, participated in its design and coordination, and contributed to writing the manuscript. Chapter 3 I participated in the study coordination, carried out the laboratory work, performed copy number and genotype analyses, and generated drafts of the manuscript. Allen Delaney, Stephane Flibotte, H. Irene Li, and Hong Qian assisted with copy number and genotype analysis. Noushin Farnoud and Malachi Griffith performed statistical sequence content analyses. Pedro Farinha and Randy Gascoyne provided and sectioned tissue samples. Marco Marra conceived of the study, participated in its design and coordination, and contributed to writing the manuscript.  x  Chapter 4 I participated in the study coordination, carried out a portion of the laboratory work, performed sequence and statistical analyses, and wrote the manuscript. Lorena Barclay and Cindy Yang performed a portion of the laboratory work. Karen Novik provided project management support. Allen Delaney, Martin Kryzwinski, and Dallas Thomas wrote data handling programs and provided bioinformatic support. Alexander Agranovich, Mira Keyes, Michael McKenzie, and W. Jim Morris saw patients and provided clinical data. Peggy Olive provided gammaH2AX measurements for patients and aided in interpretation of the data. Mira Keyes, Marco Marra, and Richard Moore conceived of the study, participated in its design and coordination, and contributed to writing the manuscript. Chapter 5 I participated in the study coordination, attended biopsies for collection of patient material, processed biopsy tissue for library construction, performed data analysis and interpretation, and wrote the manuscript. Janessa Laskin coordinated the study, treated patients, and participated in the data interpretation. Jennifer Asano, Lorena Barclay, Susanna Chan, and Cindy Yang performed laboratory work. Ian Bosdet, Obi Griffith, Ryan D. Morin, and Sorana Morrissey assisted with data analysis and interpretation. Diana Ionescu was the reference pathologist for the study. Margaret Sutcliffe assisted with study coordination and pathology review. Cheryl Ho, Christopher Lee, Barb Melosky, Nevin Murray, and Sophie Sun treated patients. Ciaran Keogh, Monty Martin, Kaushik Bhagat, and Helena Odwyer collected ultrasound- and CT-guided needle biopsy specimens. Stephen Lam and Annette McWilliams collected bronchoscopy specimens. The BC Cancer Agency Lab Accessioning group collected all blood samples. The BC Cancer Agency Genome Sciences Centre Sequencing group constructed and sequenced all transcriptome sequencing libraries. Marco Marra conceived of the project, coordinated the study, and contributed to writing the manuscript. xi  Chapter 1. Introduction 1.1. Human phenotypes are controlled by cellular genomes A genome is a collection of genetic instructions that dictate cellular development, structure, and maintenance. These instructions take the form of long polymers of doublestranded deoxyribonucleic acid (DNA) in which the order of nucleic acid couplets or base-pairs spells out distinct modular elements such as genes, regulatory elements, and structural motifs. Cells use these modules to create and control molecules necessary for life. These molecules are used to replicate individual cells, to interact with and modify cellular environments, and to form larger tissues. Complex cellular populations can themselves form connections and interdependencies, eventually giving rise to extraordinarily complicated organisms. In the case of the human body, cell populations form distinct organ systems nested within cellular structures connected by complex cell-based wiring and piping. The development and interaction of these myriad cells is dictated by the genome contained within each one. Determining the nucleic acid sequence of the human genome has been long thought to hold the key to understanding human phenotypes. Beginning in 1990, an international consortium of publicly-funded researchers undertook the sequencing of a pool of individual genomes with the goal of generating a draft human genome reference sequence. In 1998, a parallel project sequencing a smaller pool of five individuals was begun by a private company, Celera Genomics. Both of these groups published draft sequences in 2001 [1, 2], each representing a haploid consensus sequence of a set of human genomes. In its simplest form, this collection of DNA sequences enabled observation of GC- and repeat-content, CpG island distribution, and gene content across the human genome at a resolution of single nucleotides [1]. However, these drafts did not represent the exact sequence of any one human being but were instead an amalgamation of a group of individuals. This mixture of genetic information led to the subsequent identification of 1.4-2.1 million variant base-pairs in the reference 1  genome sequences [1, 2], and millions of additional sequence variants have since been identified [3-5]. While genome sequences are estimated to be 99.9% identical between any two human beings (excluding structural variants) [6, 7], even small changes in nucleic acid sequence can have a dramatic impact on cell behaviour and resulting organismal phenotypes. For example, much of the phenotypic differences observed between Europeans, Asians and Africans can be explained by differences in less than 0.01% of the genome [8]. As with each human, each genome is likely to be different and with the reference sequence in hand, a number of large scale projects set out to catalogue these differences in human populations.  1.2. Variants in genome sequence and structure differentiate human phenotypes Two classes of human genetic variation have been uncovered: Single nucleotide polymorphisms (SNPs) and structural variants. SNPs are traditionally defined as single base pair changes with a minor allele frequency of at least 1% in a population [1]. Early efforts focused on cataloguing the nature and frequency of these polymorphisms in large human populations [3, 9], the results of which are recorded in the dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP/) at the National Centre for Biotechnology Information (NCBI). Even a single base-pair change can have major biological effects and specific SNPs have been linked to human diseases. Cystic fibrosis, haemophilia, and sickle cell anaemia can each arise due to missense variants in single genes, CFTR [10], F8/F9 [11, 12], and betaglobin/HBB [13], respectively. However, many diseases cannot be linked to single variants or even single genes. The majority of inherited or genetic diseases is more complex and likely arises from the combination of many variants spread across several genes [14]. In contrast to single base-pair variants, structural variants can involve thousands of base-pairs of DNA [15, 16]. These large scale variants are subcategorized into segmental duplications or low-copy repeats, inversions, translocations, segmental uniparental disomy and copy number variants 2  (CNVs) consisting of insertions, deletions, and duplications [17]. A CNV present at a frequency of 1% in a population is defined as a copy number polymorphism (CNP) [17]. As of August 2009, 8,410 CNV loci covering ~32% of the genome have been recorded in the online Database for Genomic Variants ([16], http://projects.tcag.ca/variation/) by The Centre for Applied Genomics (TCAG, Toronto, Canada). Many of the effects of CNVs appear to be dosage-related [18, 19]. For example, up to 16 copies of the amylase gene have been observed in individuals from populations with historically high starch diets, and increased gene copy number was correlated with greater protein concentrations in saliva [18]. Currently, genome-wide association studies (GWAS) seek to genotype hundreds of thousands of SNPs and CNPs in large, carefully phenotyped populations with the goal of associating common genetic variants with common diseases [20, 21]. Over the last 3 years, GWAS have been used effectively to associate hundreds of loci with over 80 diseases in thousands of individuals [21]. A disadvantage of this approach is the indirect method in which causative variants are detected as a many variants may be in linkage disequilibrium with the genotyped variant. As a result, several candidate genes can be identified by even high density genotyping studies [21]. In addition, these genes are often not suspected of being involved with disease, and the contribution of individual variants is low [21]. This has lead to difficulty replicating the findings of even well-powered GWAS, although there have been reproducible associations with disease, most notably in type 2 diabetes [22] and Crohn’s disease [23]. To directly pinpoint exact functional variants that underlie disease, base-pair resolution of genomes from tens of thousands of affected and unaffected individuals may be necessary, a prohibitively slow and costly requirement using traditional DNA sequencing techniques. Recent advances in DNA sequencing technologies have reduced the time and financial cost required for whole genome sequencing by several orders of magnitude [24, 25]. In 2007, 3  the first diploid genomes from single individuals were published. Both were from Caucasians of European descent, J. Craig Venter [26] and James Watson [27]. These were soon followed by the sequencing of individuals from two other ethnic groups, Yang Huanming of Han Chinese descent [28] and an anonymous individual from the Yoruban ethnic group of western Africa [29]. Since then, several individual genomes have been sequenced [30, 31], some of them privately through commercial entities such as Knome [32] and Complete Genomics [33]. Several large scale projects are underway with the goal of sequencing the genomes of thousands of individuals (1000 Genomes Project [5], Personal Genomes Project [34]). It has been estimated that all common variants (those with a population frequency of at least 1%) can be detected by sequencing only 350 individual genomes [35]. Of the 18.8 million SNPs mapped to the reference genome assembly and listed in dbSNP, 5.7 million (30%) are based at least in part on 105 genomes from 1000 Genomes Project submissions and 1.9 million (10%) of those are novel (submitted only by 1000 Genomes Project) [36]. Therefore, in relatively short order, we may have knowledge of every common human SNP and a high proportion of rare variants with frequencies of 0.1 to 1% [35]. The next challenge will be to integrate these genomic data sets with carefully curated phenotypic data to extract biologically meaningful information. Well-annotated genome sequences from healthy individuals will serve as an excellent reference to diagnose and treat diseases such as cancer that commonly arise from genome aberrancies.  1.3. Cancers arise from accumulation of abnormal somatic variants in critical genes In cancer, abnormalities of genome sequence or structure undermine normal expression and behaviour of molecules critical for maintaining cell and tissue homeostasis. These changes take forms similar to those of normal polymorphisms (single base-pair variants, copy number changes, inversions, etc.) as well as more complex structures such as chromosomal translocations. These changes are somatic as evidenced by their absence in normal cells from the same individual and each one initially affects a single cell. Rarely is a single mutation or 4  structural change sufficient to result in malignancy [49], and a number of discrete genetic and biological changes are often necessary to develop cancer [38]. Mutations and structural alterations most often alter genes encoding proteins that regulate cell proliferation, differentiation, and programmed cell death [37]. Most cancers harbour genomic abnormalities that result in the acquisition of many if not all of the six classical “hallmarks of cancer” [38]: evasion of apoptosis, self-sufficiency in growth signals, insensitivity to antigrowth signals, limitless replicative potential, sustained angiogenesis, and tissue invasion and metastasis. Immune system evasion appears to be a “seventh hallmark” [95] as many cancers select for non-immunogenic cell variants and actively suppress immune response [95-96]. These hallmarks are not necessarily acquired in a prescribed order; however, acquisition of one trait facilitates the acquisition of others (Figure 1.1, [38]). Mutations causally implicated in oncogenesis have been termed “driver” mutations and are distinctly different from “passenger” mutations without functional consequence [49]. Identifying driver mutations is a central goal of cancer genome analysis to further our understanding of how cancer hallmarks arise and to suggest potential targets for therapy [37, 49]. The presence of passenger mutations can confound this analysis as it can be difficult to differentiate these mutations from the driver mutations upon which cancers are dependent [49]. Sequencing candidate genes from hundreds of tumours has identified over 1800 genes commonly mutated in cancers [37] and subsequent functional validation has established many of these as true cancer drivers [49]. Somatic point mutations and small insertions and deletions (indels) have great potential to drive cancer and are attractive drug targets due to their impact on protein structure and function. The effect of sequence mutation on a cancer cell depends on the location and type of the mutation. Mutations that result in amino acid substitutions (missense or non-synonymous mutations) can lead to gain or loss of protein function by altering catalytic residues or 5  disrupting protein structure. Multiple amino acids can be lost or changed by the introduction of a premature stop codon (nonsense mutations) or by indels that shift the protein’s codon reading frame (frame shift mutations). Even non-coding or synonymous point mutations can have an impact on protein expression due to differences in codon usage, nonsense-mediated decay of abnormal transcripts [50], distorted transcription factor binding sites, and, when mutations are located at exon splice sites, skewed exon usage [51]. Somatic structural alterations of the human genome can also drive cancer phenotypes. Accumulation of somatic copy number variants and structural rearrangements are often the result of increased genome instability and can arise due to complex mechanisms mediated by chromosome breakage and rejoining [39]. Loss of cell cycle control accelerates the acquisition of such features, and copy number alterations that support an oncogenic phenotype can rapidly become established in a population of tumour cells. Amplifications that drive cancer commonly involve oncogenes such as ERBB2, MYC, MYCN, MYCL1, EGFR, and AKT2 [39]. On the other hand, deletions often eliminate tumour suppressor genes such as PTEN, RB1, and TP53 thereby removing their regulatory effect on the cell [39]. Detection of specific somatic CNVs in cancer is used clinically to guide cancer treatment. In breast cancer, detection of HER2 amplification or overexpression is necessary for the prescription of the HER2 inhibitor trastuzumab (Herceptin, Genentech) [40] as this drug is ineffective in HER2 negative tumours and there is a risk of cardiac dysfunction and heart failure [41]. Binding of trastuzumab to HER2 primarily stimulates endocytosis of the receptor, thereby removing it from the cell surface and extinguishing receptor-initiated constitutive signalling [42]. Somatic genome rearrangements are characteristic of cancer subtypes, and translocation partner networks are characterized by a few recurrently fused genes, including MLL, BCL6, and ALK [43]. Due in part to the ease of detection by traditional cytogenetic analysis, the most commonly observed rearrangements in cancer are large-scale translocations where part of a 6  chromosomal arm is exchanged (balanced translocation) or replaced (unbalanced translocation) with material from another chromosome [44]. As the resolution of genome technologies increase, smaller-scale events have become evident in a number of tumours including inversions, insertions, deletions, microtranslocations, and untemplated additions [45, 46]. Structural rearrangements can result in new gene constructs through the joining of catalytic and regulatory domains from different genes. As a result, the activity of one protein becomes receptive to the regulatory signals targeting another, and there is strong evidence that several gene rearrangements are early and important steps towards cancer development. [43]. Specific translocations are associated with specific phenotypes [43], and while fusion proteins have historically been associated with hematological malignancies [44], transformative fusions have recently been described in solid tumours [43, 47]. Induction of fusion constructs in animal models gives rise to cancers similar to those observed in human patients, and silencing of fusion transcripts in vitro reduces cell proliferation and differentiation [43]. Targeting of fusion proteins for therapy has been highly effective in reducing tumour burden [43]. One of the first successful targeted therapies for leukemia, imatinib, was designed to inhibit the kinase domain of the BCR-ABL fusion protein and has revolutionized how this disease is treated [48]. Specific cancers have been linked to a relatively small set of cancer drivers, and there are striking examples of single base-pair positions recurrently mutated in hundreds of tumours of the same type [37]. My early thesis work sought to investigate one such example with clinical implications - two distinct mutations in the tyrosine kinase domain of the Epidermal Growth Factor Receptor (EGFR), a cell surface receptor mutated in 30% of non-small cell lung adenocarcinomas [52] and <2% of other cancers [37].  7  1.4. Activating mutations and amplifications of a lung cancer oncogene, EGFR, have been associated with response to tyrosine kinase inhibitors As early as 1980, specific histological features of lung cancers were identified, among them overexpression of EGFR (also known as HER1 or ERBB1), a cell surface receptor overexpressed in 40-80% of lung tumours [53] and implicated in control of cell growth and differentiation. EGFR is a large 170 kiloDalton glycoprotein with three distinct domains: an extracellular ligand binding domain, a transmembrane domain, and an intracellular tyrosine kinase domain. Upon ligand binding, EGFR forms homo- and hetero-dimers with other receptors, often HER2 (also known as ERBB2). These multimeric complexes then autophosphorylate, leading to activation of intracellular signalling kinase cascades. This is followed by internalization of the receptor complex for recycling or destruction by the cell, thereby removing the signalling cascade stimulus. The complete EGFR signalling network is complex, and computational methods have been used to annotate its many interactions [54]. Associated pathways are involved in specific functions such as endocytosis, degradation, recycling of EGFR, small GTPase signalling, MAPK cascade, PIP signalling, cell cycle control, Ca2þ signalling, and G-Protein-Coupled-Receptor-mediated EGFR transactivation [54]. The development of therapies targeting EGFR was spurred by the discovery of EGFR overexpression in many late stage lung tumours with poor prognosis and the ability of EGFR overexpression to confer a malignant phenotype on cultured cells [53]. Health Canada and the United States Food and Drug Administraton initially approved the use of two small molecules, gefitinib (Iressa from Astra Zeneca) and erlotinib (Tarceva from Genentech/Roche) in secondand third-line treatment of lung cancer. Both of these drugs are tyrosine kinase inhibitors (TKIs) that reversibly bind the ATP-binding pocket of the cytoplasmic EGFR tyrosine-kinase domain, thereby inhibiting autophosphorylation and stimulation of downstream signalling pathways resulting in inhibition of proliferation, delayed cell cycle progression, and increased apoptosis [53]. Side-effects associated with this drug are generally limited to skin rash and 8  diarrhea [55], suggesting a degree of tumour-specificity unseen from treatment with conventional cytotoxic chemotherapies. In 2004, three studies found that somatic base-pair mutations in the ATP-binding pocket of the EGFR tyrosine-kinase domain correlated with dramatic reduction in tumour size as a result of treatment with gefitinib and erlotinib [56-58]. These mutations are particularly prevalent in adenocarcinomas from female, non-smokers of Asian descent [52, 56-59], a particularly responsive subgroup observed in initial clinical trials of these drugs [60-62]. EGFR mutations cluster around the TKI binding site (Figure 1.2) and commonly implicate an L858R amino acid substitution or in-frame deletions and substitutions of amino acids L747-T751 and are not often seen in primary tumours of other tissues (COSMIC, [37]). These mutations do not affect the stability or expression of EGFR [56], and it has been demonstrated in vitro that such mutations result in increased EGFR activity, longer activation times before receptor complex internalization, and increased sensitivity to gefitinib [56, 57]. The onset of drug resistance has been associated with the rise of a point mutation that results in an additional amino acid substitution, T790M, [63] very near the site bound by gefitinib and erlotinib (Figure 1.2). This substitution increases the affinity of the kinase domain for its natural substrate, ATP, thereby reducing the inhibitory effect of TKIs [64]. The link between somatic DNA sequence mutations, altered protein function, and treatment outcome made EGFR mutation screening a potentially useful clinical tool and was considered an early harbinger of personalized medicine [65]. Other studies have questioned the strength of this correlation, however, instead finding amplification of EGFR to be a more accurate independent predictor of sensitivity to EGFR inhibitors [66-68]. A recent phase III trial has supported these observations, as patients with amplification of EGFR had significantly higher response rates to erlotinib than those without this characteristic (20% vs. 2%) [69]. Multivariate analysis revealed that only EGFR expression 9  and increased copy number were associated with erlotinib response, and no statistically significant correlation between base-pair mutation and response was found [69]. More recently, increased HER2 copy number has been associated with response to gefitinib, and the outcome of patients positive for both EGFR and HER2 amplification was significantly better than those positive for amplification of just one of these factors [70]. An ongoing debate continues to question which genetic features of lung cancer are clinically informative [71-74]. Chapter 2 documents my investigation of this issue by evaluating EGFR mutation, EGFR amplification, and HER2 amplification retrospectively in archival tumour samples from a local cohort of lung cancer patients treated with gefitinib [75].  1.5. Molecular studies of patient biopsy samples have been limited by suboptimal tissue quality and quantity Studies of cancer cell lines have driven many fundamental discoveries in cancer research and continue to be a valuable resource for understanding cancer biology [38]. Derived initially from primary patient material, often dissected or purified tumour cells, cell lines are modified to allow them to grow in culture media independent of their original tissue microenvironment. This capability facilitates the generation of billions of clonal daughter cells and a nearly unlimited resource for biological study. For this reason, cell-line-based studies are ideal for application of standardized assays such as high-throughput screening of therapeutic compounds, elucidating protein interactions, or systematic genetic manipulation such as gene knockdown. However, this high degree of homogeneity does not reflect actual human tumours which are often highly heterogeneous and made up of subpopulations of cells that interact with one another and surrounding normal cells [38]. Cells lines can be replicated thousands of times over many years and, due in part to pre-existing cancer phenotypes, can acquire de novo mutations, structural rearrangements, or even gain or lose chromosomes. While individual cell lines may be clonal, parallel lines derived from a common population but maintained under 10  differing culture conditions can have distinctly different genome alterations. For these reasons, cell lines are often an inadequate representation of cancers as they occur ‘in the wild’ [38]. Therefore, efforts to discover alterations of native cancer genomes must instead focus on primary sources of tumour material. These sources often take the form of diagnostic clinical biopsy samples or surgical tissue resections which present a unique set of challenges to applying genome tools. My early experience studying primary lung cancer samples identified three major challenges in extracting molecular genetic information from clinical cancer specimens: 1) nucleic acid quality can be compromised by clinical tissue archival methods, 2) tumour content can be variable due to cellular heterogeneity within a tumour mass, and 3) often only small quantities of tissue are taken to minimize impact on the donor patient. The first challenge is readily addressed by using tissue archival techniques that maintain nucleic acid integrity such as flash freezing tissues immediately upon biopsy or by adjusting molecular assays to compensate for degraded samples. For example, increasing the amount of DNA input to a PCR can often yield amplicons from degraded template. The second challenge can be overcome using well-developed cell purification techniques such as laser microdissection or flow cytometry. An example of metastatic tumour cells isolated by laser microdissection is shown in Chapter 2, Figure 2.2. The third problem of limited tissue quantities is not as easily addressed as additional material often cannot be collected without risk to patient health and safety. To circumvent this problem, amplification methods have been developed to increase the amount of DNA available from a sample by several orders of magnitude. Whole genome amplification using Phi29 polymerase had been shown to have high sequence fidelity and genotype concordance before and after amplification [76-78]. However, the use of amplified material for copy number analysis had only been investigated using low resolution methods and amplification biases have been characterized descriptively without statistical analysis [77, 78, 11  93, 94]. Current genome-wide copy number analyses make use of high-density oligonucleotide microarrays capable of querying hundreds of thousands of genome positions in a single assay (e.g. Affymetrix GeneChip Human Mapping arrays [79] and Nimblegen Whole Genome Tiling arrays [80]). However, these methods require significant quantities of genomic DNA, often not available from small samples. This limitation has restricted routine genome-wide copy number analysis of smaller cancer biopsy samples despite past success identifying novel oncogenes and tumour suppressors in larger cancer samples not requiring amplification [81-84]. To address this challenge, we sought to investigate the use of amplified DNA for genome-wide copy number analyses [85], the results of which are included in chapter 3.  1.6. Sequencing of multiple genes in clinical sample sets is facilitated by high-throughput methods When sample quantities are adequate, there is potential to investigate a large number of candidate genes or variants for association with disease. To uncover specific variants within disease-associated genes or to confirm variants detected using orthogonal methods, these candidates are often sequenced at base-pair resolution in multiple patient samples. Parallelizing these experiments to sequence hundreds of targets in even a handful of patient samples is technically demanding as each sample is subject to DNA preparation, PCR and sequencing reaction setup, and data analysis. Therefore, several research groups have established laboratory “pipelines” through which sets of clinical samples are standardized and subjected to a common set of protocols to generate high quality sequence data that can be compared between samples. A flexible high-throughput amplicon sequencing platform has recently been implemented at the BC Cancer Agency Genome Sciences Centre as a necessary tool to discover and validate sequence variants in a myriad of clinical samples. While helping develop this platform, I conducted a pilot study of germ-line DNA from prostate brachytherapy patients to uncover associations of DNA repair gene variants with the development of late side effects 12  resulting from localized radiation therapy. The results of this study are included in Chapter 4, and to date represent the largest sequencing-based survey of DNA repair genes to uncover variants associated with radiosensitivity. This pipeline has since been used to validate putative somatic changes detected by second generation sequencing methods [86] including a subset of those described in Chapter 5. 1.7. Second generation sequencing technologies have enabled whole cancer genome sequencing Just as human genome sequencing is on the verge of becoming routine, so too is sequencing of cancer genomes. Whole cancer genome shotgun sequencing allows researchers to go beyond sequencing individual candidate genes and to comprehensively assess each basepair position for somatic events as well as gain fine-scale copy number and structural information such as the base-pair position of a translocation breakpoint. Recently, the genome sequence of a single acute myeloid leukemia revealed this cancer to be essentially diploid and uncovered ten non-synonymous mutations in genes that would not have been candidates for resequencing based on current knowledge of cancer [87]. Similar results are being reported for a lobular breast cancer in which the genome sequence was complemented by sequencing RNA from the same sample [86]. This study and others [88-90] have illustrated the ability of transcriptome sequencing to provide quantitative gene expression and structural information including splicing isoform usage and detection of gene fusions resulting from genome rearrangements. As transcriptome data primarily aligns to annotated exons, this method is particularly well-suited to detecting expressed somatic mutations and RNA editing events not detectable in genome sequence. In the breast cancer study, 1/3 (11 of 32) of the coding somatic mutations were detectable in the transcriptome using 1/20 the number of sequencing reads [86]. Profiling cancer transcriptomes is an efficient usage of sequencing capacity for coding mutation detection and provides additional transcript information not available from genome sequence data. This approach was used to profile 30 lung tumours for the study presented in Chapter 5. 13  1.8. Thesis description The objective of my graduate work has been to identify genomic variants in primary patient specimens and to evaluate sequence mutations, polymorphisms, or structural variants that may be related to treatment outcome. This thesis provides a systematic account of my research to uncover correlates of molecular information with clinical outcomes of cancer therapy, beginning with a study of single genes (Chapter 2) and ending with a transcriptomewide survey (Chapter 5). Throughout my research, I was presented with several challenges inherent to applying genome technologies to patient material. Chapters 2 and 5 explore predictors of outcome to treatment of non-small cell lung cancer with tyrosine kinase inhibitors, first retrospectively by studying archival diagnostic tissues and then prospectively in freshfrozen biopsy samples collected as part of a clinical trial. Chapters 3 and 4 of this thesis, while not directly studying lung cancer samples, represent projects designed to overcome problems central to studying cancer in patients. Chapter 3 illustrates a method for reliably deriving copy number and genotype information from small quantities of tissue and my findings are broadly applicable to the study of human disease including cancer. Chapter 4 explores the relationship of germline polymorphisms with side-effects induced by radiation therapy of prostate cancer. As my research has addressed several distinct aspects of cancer genomics, each chapter is written as an independent, in most cases peer-reviewed and published, unit preceded by a brief introduction relating the work back to the overall theme. In Chapter 2, I sought to investigate the relationship of gefitinib response with three putative molecular predictors, EGFR mutation, EGFR amplification, and HER2 amplification, in archival samples from a local cohort of lung cancer patients [75]. At the time that this study was conducted, it was unclear which of these somatic changes, if any, were accurate predictors of response. Even today, this debate continues [71-73], and the identification of new predictive biomarkers is desperately needed to guide lung cancer treatment. As a result of the findings of 14  this early retrospective study, the prospective study presented in Chapter 5 was begun to find novel genomic features of lung cancer. This early experience also identified challenges inherent to working with clinical lung cancer specimens which were subsequently addressed by the work presented in Chapters 3 and 4. From the small quantities of tissue available from many cancer biopsies, it became apparent that to expand our investigations beyond single genes required amplification of the limited quantities of DNA available from these samples. The work covered in Chapter 3 provides a statistical treatment of amplification biases induced by this technique and demonstrates the ability to use amplified material to detect bona fide CNVs in amplified DNA. We later used this amplification method to increase the amount of DNA available from lung tumour biopsies collected prospectively for the study presented in Chapter 5. Due to a growing number of candidate mutations identified by next generation sequencing methods and an increasing number of clinical samples usable for sequence analysis, there became a need to sequence hundreds of amplicons from multiple patient samples. Therefore, I helped design and implement an amplicon sequencing pipeline for the highthroughput generation and analysis of sequence data from a wide range of clinical specimens. Chapter 4 documents the pilot project for this pipeline in which I conducted a study of germline variants in DNA repair genes from prostate brachytherapy patients with varying degrees of radiation toxicity following treatment. This study identified variants of three DNA repair genes that may confer increased radiation sensitivity and, if validated in larger patient populations, may be used to identify patients likely intolerant of radiation therapy. The pipeline developed during this project has since been used to validate somatic variants in cancers detected using second generation sequencing methods [86], including candidate mutations identified by the study presented in Chapter 5.  15  In Chapter 5, I demonstrate the ability of transcriptome sequencing to simultaneously query the structure, sequence, and expression levels of transcripts expressed by a clinically selected set of lung cancers. Using this information, we sought to explain observed responses to the tyrosine kinase inhibitor erlotinib (Tarceva, Roche) in the context of integrated patterns of somatic sequence alterations, fusion transcripts, and gene expression. The results and techniques developed during the studies presented in Chapters 2-4 laid the foundation for this work. For example, the experience of working with poor quality material from archival samples in the retrospective study from Chapter 2 dictated that fresh frozen biopsies be prospectively collected for this study. As the lung tumour biopsies were collected for research as part of a clinical trial, tissue quantities were very limited and amplification strategies were used, including the method characterized in Chapter 3. Finally, a subset of putative mutations identified from this study was validated using the amplicon sequencing infrastructure implemented in Chapter 4. This chapter represents a modern medical onco-genomics project combining cutting edge DNA sequencing technology, rigorous tumour review and purification, and standardized patient treatment beginning from treatment-naivety as part of a drug trial. Chapter 6 provides a summary of the lessons learned from five years of cancer research and possible directions of the current lung cancer research program. I also discuss the evolution of cancer genomics in recent years, large scale cancer genomics projects that are underway now, and the potential future impact of genomics on the clinical management of cancer.  16  1.9. Figures Figure 1.1 Parallel pathways of tumourigenesis  Reproduced with permission from [38]. Panel A depicts six hallmarks of cancer biology and provides examples of each as discussed by [38]. A seventh hallmark, immune system evasion, has been proposed by [95]. Panel B provides alternate sequences in which cancer hallmarks can be acquired, all of which eventually lead to a cancer phenotype. Single events may confer multiple capabilities and acquiring one ability can facilitate the acquisition of subsequent cancer hallmarks.  17  Figure 1.2 Representations of a crystal structure of the EGFR kinase domain in complex with erlotinib.  Modified from structure published by [91] and freely accessible from the NCBI Molecular Modeling Database (MMDB ID: 20494, PDB ID: 1M17). The data used to generate this figure were downloaded from http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?Dopt=s&uid=20494 and visualized using the Cn3D software package [92]. Both panels depict erlotinib as a purple stick-and-ball figure in complex with a space-fill (left) or ribbon (right) representation of the EGFR kinase domain. Three mutations commonly observed in lung cancer are marked in yellow. The LREA deletion (d.LREA) and L858R point mutation are commonly observed prior to treatment and have been correlated with increased sensitivity to TKIs. The T790M point mutation is often acquired as a result of treatment with TKIs and results in resistance to these drugs.  18  1.10. Bibliography 1.  2.  Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, StangeThomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, 19  3.  Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X: The sequence of the human genome. Science 2001, 291(5507):1304-1351. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, Zhao H, Zhou J, Gabriel SB, Barry R, Blumenstiel B, Camargo A, Defelice M, Faggart M, Goyette M, Gupta S, Moore J, Nguyen H, Onofrio RC, Parkin M, Roy J, Stahl E, Winchester E, Ziaugra L, Altshuler D, Shen Y, Yao Z, Huang W, Chu X, He Y, Jin L, Liu Y, Shen Y, Sun W, Wang H, Wang Y, Wang Y, Xiong X, Xu L, Waye MM, Tsui SK, Xue H, Wong JT, Galver LM, Fan JB, Gunderson K, Murray SS, Oliphant AR, Chee MS, Montpetit A, Chagnon F, Ferretti V, Leboeuf M, Olivier JF, Phillips MS, Roumy S, Sallee C, Verner A, Hudson TJ, Kwok PY, Cai D, Koboldt DC, Miller RD, Pawlikowska L, Taillon-Miller P, Xiao M, Tsui LC, Mak W, Song YQ, Tam PK, Nakamura Y, Kawaguchi T, Kitamoto T, Morizono T, Nagashima A, Ohnishi Y, Sekine A, Tanaka T, Tsunoda T, Deloukas P, Bird CP, Delgado M, Dermitzakis ET, Gwilliam R, Hunt S, Morrison J, Powell D, Stranger BE, Whittaker P, Bentley DR, Daly MJ, de Bakker PI, Barrett J, Chretien YR, Maller J, McCarroll S, Patterson N, Pe'er I, Price A, Purcell S, Richter DJ, Sabeti P, Saxena R, Schaffner SF, Sham PC, Varilly P, Altshuler D, Stein LD, Krishnan L, Smith AV, Tello-Ruiz MK, Thorisson GA, Chakravarti A, Chen PE, Cutler DJ, Kashuk CS, Lin S, Abecasis GR, Guan W, Li Y, Munro HM, Qin ZS, Thomas DJ, McVean G, Auton A, Bottolo L, Cardin N, Eyheramendy S, Freeman C, Marchini J, Myers S, Spencer C, Stephens M, Donnelly P, Cardon LR, Clarke G, Evans DM, Morris AP, Weir BS, Tsunoda T, Mullikin JC, Sherry ST, Feolo M, Skol A, Zhang H, Zeng C, Zhao H, Matsuda I, Fukushima Y, Macer DR, Suda E, Rotimi CN, Adebamowo CA, Ajayi I, Aniagwu T, 20  4. 5. 6.  7. 8. 9. 10. 11. 12. 13. 14. 15.  16.  17. 18.  19.  Marshall PA, Nkwodimmah C, Royal CD, Leppert MF, Dixon M, Peiffer A, Qiu R, Kent A, Kato K, Niikawa N, Adewole IF, Knoppers BM, Foster MW, Clayton EW, Watkin J, Gibbs RA, Belmont JW, Muzny D, Nazareth L, Sodergren E, Weinstock GM, Wheeler DA, Yakub I, Gabriel SB, Onofrio RC, Richter DJ, Ziaugra L, Birren BW, Daly MJ, Altshuler D, Wilson RK, Fulton LL, Rogers J, Burton J, Carter NP, Clee CM, Griffiths M, Jones MC, McLay K, Plumb RW, Ross MT, Sims SK, Willey DL, Chen Z, Han H, Kang L, Godbout M, Wallenburg JC, L'Archeveque P, Bellemare G, Saeki K, Wang H, An D, Fu H, Li Q, Wang Z, Wang R, Holden AL, Brooks LD, McEwen JE, Guyer MS, Wang VO, Peterson JL, Shi M, Spiegel J, Sung LM, Zacharia LF, Collins FS, Kennedy K, Jamieson R, Stewart J: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851-861. dbSNP Home Page [http://www.ncbi.nlm.nih.gov/projects/SNP/] 1000 Genomes - Home [http://www.1000genomes.org] Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S, Van Etten WJ, Reich D, Higgins J, Daly MJ, Blumenstiel B, Baldwin J, Stange-Thomann N, Zody MC, Linton L, Lander ES, Altshuler D: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001, 409(6822):928-933. Schneider JA, Pungliya MS, Choi JY, Jiang R, Sun XJ, Salisbury BA, Stephens JC: DNA variability of human genes. Mech Ageing Dev 2003, 124(1):17-25. Jorde LB, Wooding SP: Genetic variation, classification and 'race'. Nat Genet 2004, 36(11 Suppl):S28-33. A haplotype map of the human genome. Nature 2005, 437(7063):1299-1320. Zielenski J: Genotype and phenotype in cystic fibrosis. Respiration 2000, 67(2):117133. Peake I: The molecular basis of haemophilia A. Haemophilia 1998, 4(4):346-349. Lillicrap D: The molecular basis of haemophilia B. Haemophilia 1998, 4(4):350-357. Ashley-Koch A, Yang Q, Olney RS: Sickle hemoglobin (HbS) allele and sickle cell disease: a HuGE review. Am J Epidemiol 2000, 151(9):839-845. Frazer KA, Murray SS, Schork NJ, Topol EJ: Human genetic variation and its contribution to complex traits. Nat Rev Genet 2009, 10(4):241-251. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525-528. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet 2004, 36(9):949951. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet 2006, 7(2):85-97. Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R, Carter NP, Lee C, Stone AC: Diet and the evolution of human amylase gene copy number variation. Nat Genet 2007, 39(10):1256-1260. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavare S, Deloukas P, 21  20. 21.  22.  23.  24. 25. 26.  27.  28.  29.  Hurles ME, Dermitzakis ET: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315(5813):848-853. A Catalog of Published Genome-Wide Association Studies [www.genome.gov/26525384] Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 2009, 106(23):9362-9367. Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM, Barrett JC, Shields B, Morris AP, Ellard S, Groves CJ, Harries LW, Marchini JL, Owen KR, Knight B, Cardon LR, Walker M, Hitman GA, Morris AD, Doney AS, McCarthy MI, Hattersley AT: Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 2007, 316(5829):1336-1341. Franke A, Balschun T, Karlsen TH, Hedderich J, May S, Lu T, Schuldt D, Nikolaus S, Rosenstiel P, Krawczak M, Schreiber S: Replication of signals from recent studies of Crohn's disease identifies previously unknown disease loci for ulcerative colitis. Nat Genet 2008, 40(6):713-715. Morozova O, Marra MA: Applications of next-generation sequencing technologies in functional genomics. Genomics 2008, 92(5):255-264. Holt RA, Jones SJ: The new paradigm of flow cell sequencing. Genome Res 2008, 18(6):839-846. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872-876. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, Li G, Yang Z, Zhang G, Yang B, Yu C, Liang F, Li W, Li S, Li D, Ni P, Ruan J, Li Q, Zhu H, Liu D, Lu Z, Li N, Guo G, Zhang J, Ye J, Fang L, Hao Q, Chen Q, Liang Y, Su Y, San A, Ping C, Yang S, Chen F, Li L, Zhou K, Zheng H, Ren Y, Yang L, Gao Y, Yang G, Li Z, Feng X, Kristiansen K, Wong GK, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J: The diploid genome sequence of an Asian individual. Nature 2008, 456(7218):60-65. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IM, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DM, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson 22  30.  31.  32. 33. 34. 35. 36. 37. 38. 39.  KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara ECM, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53-59. McKernan KJ, Peckham HE, Costa G, McLaughlin S, Tsung E, Fu Y, Clouser C, Dunkan C, Ichikawa J, Lee C, Zhang Z, Sheridan A, Fu H, Ranade S, Dimilanta E, Sokolsky T, Zhang L, Hendrickson C, Li B, Kotler L, Stuart J, Malek J, Manning J, Antipova A, Perez D, Moore M, Hayashibara K, Lyons M, Beaudoin R, Coleman B, Laptewicz M, Sanicandro A, Rhodes M, De La Vega F, Gottimukkala RK, Hyland F, Reese M, Yang S, Bafna V, Bashir A, Macbride A, Aklan C, Kidd JM, Eichler EE, Blanchard AP: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two base encoding. Genome Res 2009. Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, Kim BC, Kim SY, Kim WY, Kim C, Park D, Lee YS, Kim S, Reja R, Jho S, Kim CG, Cha JY, Kim KH, Lee B, Bhak J, Kim SJ: The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res 2009. Knome, Inc. | Know Thyself [http://www.knome.com/] Complete Genomics [http://www.completegenomics.com/] Personal Genome Project - Homepage [http://www.personalgenomes.org] Ionita-Laza I, Lange C, N ML: Estimating the number of unseen variants in the human genome. Proc Natl Acad Sci U S A 2009, 106(13):5008-5013. Genome-announce -- UCSC Genome Browser project announcements mailing list [https://lists.soe.ucsc.edu/mailman/listinfo/genome-announce] Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. Hanahan D, Weinberg RA: The hallmarks of cancer. Cell 2000, 100(1):57-70. Bignell GR, Santarius T, Pole JC, Butler AP, Perry J, Pleasance E, Greenman C, Menzies A, Taylor S, Edkins S, Campbell P, Quail M, Plumb B, Matthews L, McLay K, Edwards PA, Rogers J, Wooster R, Futreal PA, Stratton MR: Architectures of somatic 23  40. 41.  42.  43. 44. 45.  46.  47.  48. 49. 50. 51. 52. 53. 54.  55.  56.  genomic rearrangement in human cancer amplicons at sequence-level resolution. Genome Res 2007, 17(9):1296-1303. Herceptin (Trastuzumab) product insert [http://www.fda.gov/medwatch/SAFETY/2005/Herceptin_Promo_PDF_Feb_2005.pdf] Seidman A, Hudis C, Pierri MK, Shak S, Paton V, Ashby M, Murphy M, Stewart SJ, Keefe D: Cardiac dysfunction in the trastuzumab clinical trials experience. J Clin Oncol 2002, 20(5):1215-1221. Cho HS, Mason K, Ramyar KX, Stanley AM, Gabelli SB, Denney DW, Jr., Leahy DJ: Structure of the extracellular region of HER2 alone and in complex with the Herceptin Fab. Nature 2003, 421(6924):756-760. Mitelman F, Johansson B, Mertens F: The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer 2007, 7(4):233-245. Mitelman F: Recurrent chromosome aberrations in cancer. Mutat Res 2000, 462(23):247-253. Krzywinski M, Bosdet I, Mathewson C, Wye N, Brebner J, Chiu R, Corbett R, Field M, Lee D, Pugh T, Volik S, Siddiqui A, Jones S, Schein J, Collins C, Marra M: A BAC clone fingerprinting approach to the detection of human genome rearrangements. Genome Biol 2007, 8(10):R224. Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, Grigorova M, Jones KW, Wei W, Stratton MR, Futreal PA, Weber B, Shapero MH, Wooster R: High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res 2004, 14(2):287-295. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, Bando M, Ohno S, Ishikawa Y, Aburatani H, Niki T, Sohara Y, Sugiyama Y, Mano H: Identification of the transforming EML4ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448(7153):561-566. Druker BJ: Translation of the Philadelphia chromosome into therapy for CML. Blood 2008, 112(13):4808-4817. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. Holbrook JA, Neu-Yilik G, Hentze MW, Kulozik AE: Nonsense-mediated decay approaches the clinic. Nat Genet 2004, 36(8):801-808. Cartegni L, Chew SL, Krainer AR: Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat Rev Genet 2002, 3(4):285-298. Shigematsu H, Gazdar AF: Somatic mutations of epidermal growth factor receptor signaling pathway in lung cancers. Int J Cancer 2006, 118(2):257-262. Laskin JJ, Sandler AB: Epidermal growth factor receptor: a promising target in solid tumours. Cancer Treat Rev 2004, 30(1):1-17. Oda K MY, Funahashi A, Kitano H: A comprehensive pathway map of epidermal growth factor receptor signaling. Molecular Systems Biology, 2005, msb4100014:E1−E17. Ranson M, Hammond LA, Ferry D, Kris M, Tullo A, Murray PI, Miller V, Averbuch S, Ochs J, Morris C, Feyereislova A, Swaisland H, Rowinsky EK: ZD1839, a selective oral epidermal growth factor receptor-tyrosine kinase inhibitor, is well tolerated and active in patients with solid, malignant tumors: results of a phase I trial. J Clin Oncol 2002, 20(9):2240-2250. Lynch TJ BD, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG, Louis DN, Christiani DC, Settleman J, Haber DA: Activating mutations in the epidermal growth factor receptor underlying 24  57.  58.  59. 60.  61.  62.  63.  64.  65. 66.  67.  68.  responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004, 350:2129-2139. Paez JG JP, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ, Naoki K, Sasaki H, Fujii Y, Eck MJ, Sellers WR, Johnson BE, Meyerson M: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304:1497-1500. Pao W MV, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L, Mardis E, Kupfer D, Wilson R, Kris M, Varmus H.: EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004, 101(36):13306-13311. Sun S, Schiller JH, Gazdar AF: Lung cancer in never smokers--a different disease. Nat Rev Cancer 2007, 7(10):778-790. Fukuoka M YS, Giaccone G, Tamura T, Nakagawa K, Douillard JY, Nishiwaki Y, Vansteenkiste J, Kudoh S, Rischin D, Eek R, Horai T, Noda K, Takata I, Smit E, Averbuch S, Macleod A, Feyereislova A, Dong RP, Baselga J.: Multi-institutional randomized phase II trial of gefitinib for previously treated patients with advanced non-small-cell lung cancer (The IDEAL 1 Trial). J Clin Oncol 2003, 21(12):22372246. Ho C MN, Laskin J, Melosky B, Anderson H, Bebb G.: Asian ethnicity and adenocarcinoma histology continues to predict response to gefitinib in patients treated for advanced non-small cell carcinoma of the lung in North America. Lung Cancer 2005, 49(2):225-231. Kris MG NR, Herbst RS, Lynch TJ Jr, Prager D, Belani CP, Schiller JH, Kelly K, Spiridonidis H, Sandler A, Albain KS, Cella D, Wolf MK, Averbuch SD, Ochs JJ, Kay AC.: Efficacy of gefitinib, an inhibitor of the epidermal growth factor receptor tyrosine kinase, in symptomatic patients with non-small cell lung cancer: a randomized trial. JAMA 2003, 290(16):2149-2158. Pao W MV, Politi KA, Riely GJ, Somwar R, Zakowski MF, Kris MG, Varmus H.: Acquired Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib Is Associated with a Second Mutation in the EGFR Kinase Domain. PLoS Med 2005, 2:e73. Yun CH, Mengwasser KE, Toms AV, Woo MS, Greulich H, Wong KK, Meyerson M, Eck MJ: The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. Proc Natl Acad Sci U S A 2008, 105(6):2070-2075. Green MR: Targeting targeted therapy. N Engl J Med 2004, 350(21):2191-2193. Hirsch FR, Scagliotti GV, Langer CJ, Varella-Garcia M, Franklin WA: Epidermal growth factor family of receptors in preneoplasia and lung cancer: perspectives for targeted therapies. Lung Cancer 2003, 41 Suppl 1:S29-42. Cappuzzo F HF, Rossi E, Bartolini S, Ceresoli GL, Bemis L, Haney J, Witta S, Danenberg K, Domenichini I, Ludovini V, Magrini E, Gregorc V, Doglioni C, Sidoni A, Tonato M, Franklin WA, Crino L, Bunn PA Jr, Varella-Garcia M: Epidermal growth factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Natl Cancer Inst 2005, 97(9):643-655. Hirsch FR, Varella-Garcia M, Bunn PA, Jr., Di Maria MV, Veve R, Bremmes RM, Baron AE, Zeng C, Franklin WA: Epidermal growth factor receptor in non-smallcell lung carcinomas: correlation between gene copy number and protein expression and impact on prognosis. J Clin Oncol 2003, 21(20):3798-3807. 25  69.  70.  71.  72. 73. 74.  75.  76.  77.  78.  79. 80. 81. 82.  Tsao MS SA, Cutz JC, Zhu CQ, Kamel-Reid S, Squire J, Lorimer I, Zhang T, Liu N, Daneshmand M, Marrano P, da Cunha Santos G, Lagarde A, Richardson F, Seymour L, Whitehead M, Ding K, Pater J, Shepherd FA: Erlotinib in lung cancer - molecular and clinical predictors of outcome. N Engl J Med 2005, 353(2):133-144. Cappuzzo F V-GM, Shigematsu H, Domenichini I, Bartolini S, Ceresoli GL, Rossi E, Ludovini V, Gregorc V, Toschi L, Franklin WA, Crino L, Gazdar AF, Bunn PA Jr, Hirsch FR: Increased HER2 gene copy number is associated with response to gefitinib therapy in epidermal growth factor receptor-positive non-small-cell lung cancer patients. J Clin Oncol 2005, 23(22):5007-5018. . Shepherd FA: Molecular selection of patients for first-line treatment of advanced non-small-cell lung cancer with epidermal growth factor inhibitors: not quite ready for prime time. J Clin Oncol 2008, 26(15):2426-2427. Takano T OY: Erlotinib in lung cancer. N Engl J Med 2005, 353(16):(16):1739-1741. Hirsch FR, Bunn PA, Jr.: EGFR testing in lung cancer is ready for prime time. Lancet Oncol 2009, 10(5):432-433. Johnson BE, Janne PA: Selecting patients for epidermal growth factor receptor inhibitor treatment: A FISH story or a tale of mutations? J Clin Oncol 2005, 23(28):6813-6816. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B, English J, Vielkind J, Horsman D, Laskin JJ, Marra MA: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC Cancer 2007, 7:128. Esteban JA, Salas M, Blanco L: Fidelity of phi 29 DNA polymerase. Comparison between protein-primed initiation and DNA polymerization. J Biol Chem 1993, 268(4):2719-2726. Paez JG LM, Beroukhim R, Lee JC, Zhao X, Richter DJ, Gabriel S, Herman P, Sasaki H, Altshuler D, Li C, Meyerson M, Sellers WR.: Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Res 2004, 32:e71. Tzvetkov MV, Becker C, Kulle B, Nurnberg P, Brockmoller J, Wojnowski L: Genomewide single-nucleotide polymorphism arrays demonstrate high fidelity of multiple displacement-based whole-genome amplification. Electrophoresis 2005, 26(3):710715. Affymetrix webpage [http://www.affymetrix.com/] Nimblegen webpage [http://www.nimblegen.com] Weir B, Zhao X, Meyerson M: Somatic alterations in the human cancer genome. Cancer Cell 2004, 6(5):433-438. Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA, Shah K, Sato M, Thomas RK, Barletta JA, Borecki IB, Broderick S, Chang AC, Chiang DY, Chirieac LR, Cho J, Fujii Y, Gazdar AF, Giordano T, Greulich H, Hanna M, Johnson BE, Kris MG, Lash A, Lin L, Lindeman N, Mardis ER, McPherson JD, Minna JD, Morgan MB, Nadel M, Orringer MB, Osborne JR, Ozenberger B, Ramos AH, Robinson J, Roth JA, Rusch V, Sasaki H, Shepherd F, Sougnez C, Spitz MR, Tsao MS, Twomey D, Verhaak RG, Weinstock GM, Wheeler DA, Winckler W, Yoshizawa A, Yu S, Zakowski MF, Zhang Q, Beer DG, Wistuba, II, Watson MA, Garraway LA, Ladanyi M, Travis WD, Pao W, Rubin MA, Gabriel SB, Gibbs RA, Varmus HE, Wilson RK, Lander ES, Meyerson M: Characterizing the cancer genome in lung adenocarcinoma. Nature 2007, 450(7171):893-898. 26  83.  84. 85.  86.  87.  88.  89.  90.  91.  92. 93.  94.  95. 96.  Mullighan CG, Goorha S, Radtke I, Miller CB, Coustan-Smith E, Dalton JD, Girtman K, Mathew S, Ma J, Pounds SB, Su X, Pui CH, Relling MV, Evans WE, Shurtleff SA, Downing JR: Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature 2007, 446(7137):758-764. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061-1068. Pugh TJ, Delaney AD, Farnoud N, Flibotte S, Griffith M, Li HI, Qian H, Farinha P, Gascoyne RD, Marra MA: Impact of whole genome amplification on analysis of copy number variants. Nucleic Acids Res 2008, 36(13):e80. Shah S, Morin R, Khattra J, Prentice L, Pugh TJ, Burleigh A, Delaney A, Gelmon K, Guliany R, Holt RA, Jones SJ, Sun M, Moore R, Teschendorff A, Tse K, Turashivili G, Varhol R, Warren R, Watson P, Zhao Y, Caldas C, Huntsman D, Hirst M, Marra M, Aparicio S: Mutational evolution of a lobular breast tumour, profiled by wholetranscriptome and whole-genome next generation sequencing. Submitted 2009. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, DunfordShore BH, McGrath S, Hickenbotham M, Cook L, Abbott R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, Gordon D, Chinwalla A, Zhao Y, Ries RE, Payton JE, Westervelt P, Tomasson MH, Watson M, Baty J, Ivanovich J, Heath S, Shannon WD, Nagarajan R, Walter MJ, Link DC, Graubert TA, DiPersio JF, Wilson RK: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008, 456(7218):66-72. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621628. Fullwood MJ, Wei CL, Liu ET, Ruan Y: Next-generation DNA sequencing of pairedend tags (PET) for transcriptome and genome analyses. Genome Res 2009, 19(4):521-532. Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M: Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 2008, 45(1):8194. Stamos J, Sliwkowski MX, Eigenbrot C: Structure of the epidermal growth factor receptor kinase domain alone and in complex with a 4-anilinoquinazoline inhibitor. J Biol Chem 2002, 277(48):46265-46272. Cn3D Homepage [http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml] Arriola E, Lambros MB, Jones C, Dexter T, Mackay A, Tan DS, Tamber N, Fenwick K, Ashworth A, Dowsett M, Reis-Filho JS: Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation. Lab Invest 2007, 87(1):75-83. Bredel M, Bredel C, Juric D, Kim Y, Vogel H, Harsh GR, Recht LD, Pollack JR, Sikic BI: Amplification of whole tumor genomes and gene-by-gene mapping of genomic aberrations from limited sources of fresh-frozen and paraffin-embedded DNA. J Mol Diagn 2005, 7(2):171-182. Dunn GP, Old LJ, Schreiber RD. The three Es of cancer immunoediting. Annu Rev Immunol. 2004, 22:329-60. Zitvogel L, Tesniere A, Kroemer G. Cancer despite immunosurveillance: immunoselection and immunosubversion. Nat Rev Immunol. 2006, 6(10):715-27. 27  Chapter 2. Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients1 This chapter documents my first investigation of somatic mutations and gene amplifications that may predict the response of lung cancer to treatment with tyrosine kinase inhibitors. At the time this study was initiated, EGFR mutations had just been described in lung cancer and a strong correlation was observed with response to the EGFR-inhibitor gefitinib (Iressa, AstraZeneca). These initial studies were in small patient populations and subsequent studies suggested that amplification of EGFR and similar amplifications of the closely related HER2 gene were more accurate predictors of clinical outcome. In this chapter, I tested the ability of EGFR mutations and increases in EGFR and HER2 copy number to predict response to gefitinib in a local population of lung cancer patients. The finding that none of the features were diagnostic of response suggested that genes other than EGFR and HER2 may harbour abnormalities predictive of response. While this finding was contradictory at the time these data were published [1], similar results have since been reported, and a debate continues regarding the predictive value of these biomarkers [2]. The samples for this project were archival tissue blocks routinely used to assess cellular morphology for diagnosis that often yield degraded, chemically modified nucleic acids. For this study, I adapted methods to extract genomic information from these materials, including laser microdissection of cancer cells, assessment of DNA quality, PCR amplification of degraded DNA, and usage of a published scoring system for interpreting the results of fluorescent in situ hybridization.  1  A version of this chapter has been published. Pugh T.J., Bebb G., Barclay L., Sutcliffe M., Fee J., Salski C., O'Connor R., Ho C., Murray N., Melosky B., English J., Vielkind J., Horsman D., Laskin J.J., Marra M.A. BMC Cancer. 2007 Jul 13;7 128.  28  2.1. Introduction Lung cancer overall is the leading cause of cancer-related death in North America with 85% of patients eventually succumbing to the disease [3]. The five year survival rate for these cancers is low (16%) compared to other cancers [3] and there exists a major need for additional therapeutic strategies in its treatment. EGFR has been identified as a potential therapeutic target as protein over-expression is observed in 40-80% late stage lung tumours and can confer a malignant phenotype in cultured cells [4]. Health Canada and the United States Food and Drug Administration initially approved the use of two EGFR-targeted molecules, gefitinib (“Iressa” from Astra Zeneca) and erlotinib (“Tarceva” from Genentech/Roche) in the second- and thirdline treatment of lung cancer. Both of these drugs were designed to reversibly bind the ATPbinding pocket of the EGFR tyrosine-kinase domain, thereby inhibiting autophosphorylation and stimulation of downstream signalling pathways, resulting in inhibition of proliferation, delayed cell cycle progression, and increased apoptosis. Despite being marketed as “EGFR tyrosine kinase inhibitors”, these drugs have affinity for 18-26 protein kinases in addition to EGFR [5, 6]. In international phase II trials, ~28% of Japanese patients responded to gefitinib versus ~10% of patients of European descent as assessed by symptom improvement and tumour shrinkage [7, 8]. These population-specific findings have suggested that responses to these drugs have genetic components, although regional environmental factors have not been discounted. Two somatic mutations in the EGFR tyrosine-kinase domain have been correlated with reduced tumour size as a result of treatment with gefitinib [9-13]. These mutations were commonly found in patients fitting the responsive profile observed in initial and subsequent clinical studies [7, 8, 14], specifically female non-smokers of Asian descent. In a review of sixteen studies, EGFR mutations clustering around the tyrosine kinase domain ATP-binding pocket have been observed in 151 of 191 gefitinib responders (79.1%) and 11 of 19 erlotinib 29  responders (57.9%) [15]. Confounding the model of mutation-mediated drug response is the finding that 40 of 191 gefitinib responders (20.9%) and 8 of 19 erlotinib responders (42.1%) lack EGFR mutations [15]. Conversely, EGFR mutations were seen in 40 of 355 gefitinib nonresponders (11.3%) and 16 of 117 erlotinib non-responders (13.7%) [15]. These results suggested that somatic EGFR mutations are neither necessary nor sufficient for response to EGFR inhibitors. This suggestion is supported by the findings of a prospective trial of gefitinib in which 4 of 16 patients selected for tumours with EGFR mutations didn’t respond to gefitinib [16]. An increasing number of studies examining the tumours of patients treated with gefitinib and erlotinib have correlated increased EGFR gene copy number with response [13, 17, 18]. Data analysis from a recent phase III trial of erlotinib has supported these observations [19]. In this trial, the response rate among patient tumours with amplification of EGFR was significantly higher than those without this characteristic (20% vs. 2%) [19]. Multivariate analysis revealed that only EGFR expression and increased copy number were associated with erlotinib response, and no correlation between single-nucleotide mutations and response was found [19]. Increased HER2/Neu gene copy number has also been associated with response, particularly in the presence of increased EGFR copy number, EGFR overexpression or EGFR mutation [18]. Other studies have shown that tumours co-expressing HER2 and EGFR have a poor prognosis [20, 21], suggesting that there is a relationship between these genes that drives pathogenesis and which may be targeted by gefitinib. Additional data are needed to explore the ability of these molecular features to predict response to EGFR-targeted tyrosine-kinase inhibitors [22, 23]. Recently our clinical collaborators at the BC Cancer Agency confirmed that Asian ethnicity predicts response to gefitinib in a Canadian setting in a population in which 38% of patients are of Asian descent [14]. To test whether gefitinib response could have been predicted  30  by somatic EGFR mutation, EGFR amplification, or HER2 amplification, we retrospectively analyzed archival diagnostic samples from this cohort of patients.  2.2. Materials and methods 2.2.1. Patient population and assessment response Samples for molecular analysis were drawn from patients who received gefitinib through the Extended Access Program at the BC Cancer Agency as reported by Ho et al [14] with ethics approval from the BC Cancer Agency Ethics Review Board. The criteria for enrolment in the program were the presence of histologically or cytologically confirmed locally advanced or metastatic NSCLC and having received prior standard systemic or radiation therapy or being ineligible for standard treatment. Patients received gefitinib following standard systemic or radiation therapy, and response was assessed radiographically according to the SWOG modification of the WHO criteria [24]. In brief, complete response (CR) was defined as a complete disappearance of disease, partial response (PR) was defined as a decrease of >50% of the sum of the products of the maximal perpendicular dimensions of measurable lesions, stable disease (SD) was defined as the presence of no new lesions or progression of current lesions, progressive disease (PD) was defined as an increase of >50% of the sum of the products of the maximal perpendicular dimensions of measurable lesions, the development of new lesions, recurrence of lesions that had previously disappeared or failure to return for evaluation because of symptomatic deterioration.  2.2.2. Laser microdissection and DNA extraction To identify tumour cell populations for laser microdissection (LM) or manual scrape, malignant cells (cytology specimens) or tissues (paraffin embedded biopsies) were reviewed by a single reference pathologist. Because the DNA extracted from formalin-fixed, paraffinembedded tissue blocks is of variable quality, the DNA from these sources was characterized 31  prior to microdissection. DNA was extracted from a full 8 micron section of each block using the “Laser-Microdissected Tissues” protocol of the QIAamp spin-column kit (QIAgen, Valencia, CA). The digestion volumes were increased five-fold and three final 30 µL elutions of TE (10:0.1) were performed. The DNA was quantified by PicoGreen assay (Invitrogen, Carlsbad, CA) and observed on a 2% agarose gel stained with ethidium bromide. For a block to qualify for LM, the presence of DNA fragments >2000 bp was required (Figure 2.1). 40 archival samples from 37 patients were suitable for LM and yielded enough DNA of sufficient quality for PCR and sequencing. Laser microdissection of pathologist-identified cells was performed on serial sections of paraffin blocks using either an Arcturus PixCell infra-red laser-capture device or a Molecular Machines and Industries SLµCUT UV laser microdissection instrument. Dissected cells were isolated onto the adhesive caps of 1.0 mL microcentrifuge tubes (Arcturus) (Figure 2.2). Material from cytology slides was scraped with a razor blade directly into microcentrifuge tubes and DNA extracted as described above.  2.2.3. PCR and sequencing of EGFR exons 18-24 Exons 18-24, coding for the tyrosine kinase domain of EGFR, were amplified by PCR and sequenced. PCR primers were designed using human genome reference sequence acquired from the UCSC Genome Browser [25, 26] (hg17_refGene_NM_005228). Primers were designed to anneal within introns at least 40bp away from exon splice sites using the Primer3 program [27]. Sequencing tags were added to all PCR primers for downstream sequencing and experimentally optimized for annealing temperature. The DNA sequence and annealing temperatures of all seven EGFR primer pairs are listed in Table 2-1. PCR reactions were performed in 20 µL and consisted of: 2.0 µL 10X Pfx Amplification Buffer (Invitrogen), 0.4 µL 50 mM MgSO4 (Invitrogen), 0.4 µL 10 mM dNTPs (from 100 mM stock, Invitrogen), 1 µL 32  each of 10 µM forward and reverse primers (Invitrogen), 2.0 µL 10X PCRx Enhancer (Invitrogen), 0.1 µL 2.5 U/µL Pfx Polymerase (Invitrogen) with 5-10 ng template and distilled water added up to the final volume. Reactions were cycled on an MJResearch Tetrad at 95ºC for 5 minutes followed by 35 cycles of: 95ºC for 30 seconds, annealing temperature for 15 seconds (Table 2-1), and 70ºC for 2 minutes. PCR products were purified using the Ampure magneticbead-based PCR product purification system (Agencourt, Beverly, MA). Sequencing of PCR products was performed with standard chemistries in use by the production sequencing team at the BC Cancer Agency Michael Smith Genome Sciences Centre. Briefly, “forward” and “reverse” 1/24X reactions contained 0.02 µL of 100 µM primer, 0.33 µL BigDye Ready Reaction Mix v3.1 (ABI), 0.4 µL 15X Big Dye Buffer (50% by volume Big Dye v3.1 Sequencing Buffer (ABI), 50% by volume Tris-EDTA), 0.02 µL distilled water, and 2 µL of purified PCR product. Reactions were cycled 50 times with annealing temperatures of 52ºC for forward and 43ºC for reverse sequencing primers (96ºC for 10 seconds, annealing temperature for 5s, 60ºC for 3 minutes). All reactions were precipitated in a final concentration of 70% ethanol and 10 mM EDTA and spun at 2750g for 30 minutes to pellet sequencing products. The pellet was washed with 30µL of 70% ethanol and air dried before resuspension in 10µL distilled water. Sequencing reaction products were analyzed on automated ABI 3730XL sequencers and traces analyzed using the Mutation Surveyor software package (SoftGenetics, State College, PA) and the Phred/Phrap/Consed suite [28, 29]. All sequences were compared against a reference human genome sequence (NCBI accession NM_005228.3) to identify mutations and polymorphisms. Observed known polymorphisms recorded in the Single Nucleotide Polymorphism database (dbSNP) [30, 31] were identified by appropriate ‘rs’ number. To further validate results, PCR and sequencing reactions were repeated for all samples in which an apparent mutation was observed. Correlations between clinical features and EGFR mutations were assessed using the two-sided Fisher’s exact test. 33  2.2.4. Copy number analysis of EGFR and HER2 To assess EGFR and HER2 copy number, fluorescent in-situ hybridization (FISH) was conducted by the BC Cancer Agency Pathology Department (CS, DH authors) using Pathvysion EGFR and HER-2 DNA Probe kits (Vysis, Downers Grove, IL). Formalin-fixed paraffin-embedded tissues were prepared in serial 6um sections on positively charged Colorfrost/Plus microscope slides (Fisher Scientific, Hampton, NH). One section was H&E stained and tumour populations were identified by a pathologist. Hybridization areas were marked with a diamond-tipped pencil on the back of each slide. Sections were incubated overnight at 56ºC, dewaxed by exposure to xylene for 10 minutes, dehydrated in 100% ethanol for 5 minutes, and air-dried 2-4 minutes on a slide warmer set to 37-45ºC. The slides were immersed in 0.2N HCl for 20 minutes, rinsed in distilled water for 10 minutes, and incubated in 1M NaSCN pre-treatment solution (Vysis) for 30 minutes at 80ºC. After rinsing with room temperature water for 3 minutes, sections were digested with pepsin (0.25mg/mL in 0.01N HCl) for 15-18 minutes at 37ºC, and rinsed with room temperature water for 5 minutes. Tissue morphology was assessed by phase contrast microscopy to ensure sufficient digestion of the collagen matrix. Slides were dehydrated with two 4-minute treatments of 100% ethanol and airdried 2-4 minutes on a slide warmer set to 37-45ºC. 2.5-3 µL of the EGFR/CEP7 or HER2/CEP17 probe mixture was applied to the hybridization area marked on the slide and covered with a glass coverslip. Edges were sealed with rubber cement. The slides were incubated at 73ºC for 5 minutes then 37ºC overnight to first co-denaturate the probe and chromosomal DNA and then allow hybridization. Rubber cemented coverslips were then removed and the slides were placed in a post-hybridization wash solution (2X SSC, 0.3% NP40) at 72ºC for 2 minutes. After rinsing the slides in 1X PBS, they were air-dried in the dark for 30-60 minutes. 4 µL DAPI-1 counterstain (Vysis) was applied to the hybridization area and a 34  glass coverslip fixed in place. FISH analysis was performed by counting the number of signals from each probe in forty tumour nuclei on one slide from each patient. Two approaches were used to interpret raw FISH probe counts and define gene amplification. In the first approach, the total number of EGFR or HER2 signals was divided by the total number of centromeric CEP7 or CEP17 signals and a gene/CEP ratio reported for the population of forty cells. Samples with a gene/CEP ratio ≥ 2 were defined as displaying gene amplification. The second approach applies published criteria [17] to raw FISH counts to classify patients into six strata according to the frequency of cells with specific gene copy numbers within the tumour population. The six strata, as published by Cappuzzo et al [17] and applied in our study, were: 1) disomy (≤ 2 copies in > 90% of cells); 2) low trisomy (≤ 2 copies in ≥ 40% of cells, 3 copies in 10% – 40% of the cells, ≥ 4 copies in < 10% of cells); 3) high trisomy ( ≤ 2 copies in ≥ 40% of cells, 3 copies in ≥ 40% of cells, ≥ 4 copies in < 10% of cells); 4) low polysomy ( ≥ 4 copies in 10% – 40% of cells); 5) high polysomy ( ≥ 4 copies in ≥ 40% of cells); and 6) gene amplification (defined by presence of tight EGFR gene clusters and a ratio of EGFR gene to chromosome of ≥ 2 or ≥ 15 copies of EGFR per cell in ≥ 10% of analyzed cells). The first approach is commonly used in practical clinical assessment of gene copy number and generally reflects the average copy number of the cell population examined. The second approach attempts to capture the degree to which gene amplification defines a cell population. This second method was published by one of the first studies to associate increased EGFR copy number with gefitinib response [17].  2.3. Results 2.3.1. Patient population Our clinical colleagues at the BC Cancer Agency previously documented the clinical characteristics of a population of 61 patients treated with gefitinib at their clinic between April 2002 and May 2004 [14]. In that study, patients of Asian decent with adenocarcinoma 35  displayed a preferential response to gefitinib. Diagnostic samples from 39 of these individuals were suitable for microdissection and yielded DNA of sufficient quality for PCR and sequencing and/or copy number analysis by FISH. Microdissected materials were used to avoid masking of cancer-specific features by contaminating normal material. Figure 2.2 demonstrates the heterogeneous nature of a metastatic lung tumour and the ability of laser microdissection to separate tumour cells from surrounding normal tissue. The patient subset consisted of 23 females (59%), 17 patients of Asian descent (44%), 12 non-smokers (31%), 34 tumours of adenocarcinoma subtype (87%), and a distribution between partial response/stable disease/progressive disease of 6/14/17 (15%/33%/44%). 2 patients lacked a response assessment. The clinical characteristics and molecular status of these patients are described in Table 2-2.  2.3.2. EGFR tyrosine-kinase domain mutations We studied the DNA sequence of the EGFR tyrosine kinase domain in our patient samples as this domain was previously associated with increased gefitinib sensitivity [9-11]. In eight of thirty-eight tumours assessed we found ten non-unique mutations, five of which have been previously correlated with response (Figure 2.3). Four of these mutations were in-frame deletions or substitutions within exon 19, all of which impacted L747-A750 (Table 2-3) and retained the ATP-binding lysine moiety. All four patients with mutations were of Asian descent, and two of these patients were females responsive to gefitinib, of which one was a non-smoker and one had unknown smoking history. The two non-responders were nonsmokers, one female and one male. We resequenced the normal tissue remaining after microdissection in two of these samples and found no mutations, consistent with previous reports that these mutations are somatic. The fifth mutation was a homozygous missense point mutation within exon 21, resulting in an L858R substitution (Table 2-4). This patient was a 36  female non-smoker of Asian descent who did not respond to gefitinib. Three missense and two synonymous point mutations were detected in exon 20, four of which have been previously observed in other patients (Table 2-4). One of these mutations was in a tumour from one of the drug responsive patients who also had an exon 19 deletion. The exon 20 T790M mutation previously documented to confer resistance to gefitinib [32] was not observed. We were unable to validate the previously reported relationships between response and the presence of exon 19 mutations (p = 0.0889) or exon 21 mutations (we observed a single mutation in a non-responder). If patients exhibiting stable disease were counted among the responders (“disease control”), correlation between exon 19 deletions and response was not observed (p = 1.00). The presence of exon 19 mutations was correlated with Asian ethnicity (p = 0.0207) and non-smoking status (p = 0.0406) but not with female sex (p = 0.633) or adenocarcinoma histology (p = 1.00). When taken as a group, there were no correlations with response and exon 20 mutations (p = 0.0889), female sex (p = 0.633), non-smoking status (p = 1.00), Asian ethnicity (p = 1.00), adenocarcinoma subtype (p = 1.00), or disease control (p = 0.104).  2.3.3. EGFR tyrosine-kinase domain polymorphisms We detected two previously documented single nucleotide polymorphisms (dbSNP rs10251977, rs17290643). Exon 20 harbours the synonymous G/A SNP rs10251977, while exon 23 contains the synonymous SNP T/C rs17290643. Neither of these variants result in an infrequently used codon (codons per thousand for each allele: rs10251977 34.2:12.3, rs17290643 13.1:18.9) [33]. There was no correlation between these alleles and gefitinib response in our patient population.  37  2.3.4. EGFR and HER2 copy number analysis Gene copy number was assessed in our patient tumour samples as previous studies have shown a correlation between copy number increases in EGFR [13, 17, 19] or HER2 [18] and gefitinib response. Two techniques were used to interpret the FISH data for this analysis (Methods). Increases in EGFR copy number, defined as an EGFR/CEP7 ratio ≥ 2.0, were observed in ten of twenty-six tumours (Table 2-5). Of these ten, three also displayed increased HER2 copy number (HER2/CEP17 ratio ≥ 2.0). HER2 amplification in the absence of EGFR amplification was seen in three additional tumours. Examples of the varying degrees of amplification of these genes are shown in Figure 2.4. Increased EGFR copy number did not correlate with the presence of mutation in either EGFR exon 19 (p = 0.130) or exon 20 (p = 1.00); increased HER2 copy number (p = 0.644); sex (p = 0.457); Asian ethnicity (p = 0.688); smoking status (p = 0.380); adenocarcinoma histology (p = 0.538); or response to gefitinib (p = 1.00). When patients with stable disease are counted among the responders (“disease control”), no correlation with response was observed (p = 0.210). Likewise, increased HER2 copy number did not correlate with: the presence of mutation of either EGFR exon 19 (p = 1.00) or exon 20 (p = 1.00); increased EGFR copy number (p = 0.644); gender (p = 0.160); Asian ethnicity (p = 0.645); smoking status (p = 0.351); adenocarcinoma histology (p = 1.00); or gefitinib response (p = 1.00) and disease control (p = 0.114). Tumours were also stratified by EGFR and HER2 copy number using the criteria proposed by Cappuzzo et al [17] (Table 2-5). Seven tumours were identified as FISH+ for EGFR amplification, and four tumours were identified as FISH+ for HER2 amplification (high polysomy or gene amplification). Only one of these tumours was FISH+ for both EGFR and HER2, and this was the only sample to meet the EGFR “gene amplification” criteria as proposed by Capuzzo et al (10; ≥ 15 copies in ≥ 10% of cells). FISH+ status corresponded with 38  an EGFR/CEP7 ratio ≥ 2.0 in seven of ten samples. FISH+ status corresponded with a HER2/CEP17 ratio ≥ 2.0 in four of six samples. There was no correlation between EGFR FISH+ status and mutation of either EGFR exon 19 (p = 0.0543) or exon 20 (p = 0.283); female sex (p = 0.378); Asian ethnicity (p = 1.00); smoking status (p = 1.00); adenocarcinoma histology (p = 0.167); response to gefitinib (p = 0.552) or disease control (p = 0.653). Likewise, HER2 FISH+ did not correlate with the presence of mutation of either EGFR exon 19 (p = 1.00) or exon 20 (p = 0.544); increased EGFR copy number (p = 1.00); sex (p = 0.593); Asian ethnicity (p = 0.593); smoking status (p = 1.00); adenocarcinoma histology (p = 0.408); response to gefitinib (p = 0.437) or disease control (p = 0.239).  2.4. Discussion In DNA sequencing studies using patient samples, contaminating normal tissue has the potential to mask tumour-specific features, particularly in cases of highly heterogeneous metastatic deposits. To examine somatic features specific to tumours, we employed laser microdissection to isolate cancer cells from surrounding normal tissue. The selectivity of this technique was demonstrated by the identification of EGFR exon 19 deletions in the tumour populations of two patient samples but not the surrounding normal tissue remaining after microdissection. As sequencing and cell isolation technologies continue to mature, there is potential to further dissect genetic heterogeneity within tumour populations, perhaps eventually to the resolution of single cells. Such efforts may uncover low frequency resistance alleles preexisting at low frequencies prior to treatment that then come to predominate in the tumour population as a consequence of selective pressure applied by therapy. In the evolving area of biomarkers predictive of response to EGFR tyrosine kinase inhibitors, two hypotheses have arisen, each claiming a specific alteration of EGFR is predictive of response. One hypothesis is that mutations within the EGFR tyrosine kinase domain targeted by these drugs are indicative of a capability to respond [9-11]. The second 39  hypothesis is that the presence of increased gene copy number of EGFR or HER2 is a better predictor of response [17-19]. When investigating the relevance of these features to our own population of lung cancer patients treated with gefitinib, our study detected all of these features occurring both independently and coincidentally in microdissected tumour cells. Tumours from four of thirty-eight patients contained a form of the exon 19 L747-A750 deletion and one tumour harboured the exon 21 L858R point mutation. Two of the patients with exon 19 deletions were responsive to gefitinib and were also found to have increased EGFR copy number. In the remaining four responders, EGFR mutations or gene amplifications that others previously correlated with gefitinib response [9-11] were not observed. These data are consistent with the notion that tumours reliant on amplification of a mutant EGFR allele may be particularly susceptible to inhibition by gefitinib. However, responders without apparent gefitinib-sensitising EGFR alterations may have shown characteristics of response even without treatment or may have responded due to an interaction between gefitinib and a protein other than EGFR [34, 35]. To identify alternative genetic features mediating drug response, candidate genes influenced by receptor tyrosine kinase inhibitors need to be identified and studied in patients receiving these drugs. In this study, we compared two methods of interpreting FISH data and defining increased gene copy number. One technique defined gene amplification as a gene/centromere (e.g. EGFR/CEP7) threshold ≥ 2.0, while the second technique defined “FISH+” status from the stratification of different gene/centromere ratios into varying degrees of polysomy [17]. While both of these methods identified seven tumours with EGFR amplification, the EGFR/CEP7 ratio ≥ 2 method identified an additional three tumours which were classified as "Low Polysomy" under the Cappuzzo criteria (Methods). While not originally designed for this purpose, we also applied Cappuzzo's criteria [17] to our HER2 FISH data. Again we saw an overlap of the samples identified by both methods as having increased HER2 copy number. 40  However, as with EGFR, the HER2/CEP17 ratio method identified samples not captured by the stratification method but with ratios near the threshold of 2 for amplification. None of these patients responded to gefitinib. These results suggest a need for further refinement of criteria for defining amplification and may reflect the ability of FISH to define precise copy number. Our experience underscores the difficulty in capturing the heterogeneous nature of a tumour population with a single measurement. An understanding of the biological implications of EGFR gene amplification is needed to refine the predictive specificity of these tests.  2.5. Conclusion Recently, several studies have correlated gefitinib response with either EGFR mutation [9-13] or increased EGFR copy number [13, 17-19], but the true predictive value of these features is still under debate [22, 23]. While we observed EGFR DNA sequence mutations and increases in EGFR and HER2 gene copy number in several of our specimens, we were unable to statistically correlate the presence of any of these molecular features with response. While these findings may be due to a lack of statistical power due to our small sample size, our study differs from others in our use of a population with a large Asian component in a North American setting and enrichment of tumour cells using laser microdissection. Even though EGFR status was not a single predictive factor of drug response in our small sample set, its assessment can increase the likelihood of selecting patients likely to respond to these drugs. To improve the sensitivity of screening for potential responders, additional features other than EGFR that mediate drug response need to be identified. Recently, activating point mutations in KRAS [36], amplification of the oncogene MET [37] and loss of tumour suppressor PTEN [38] have been found to confer resistance to EGFR-inhibitors due to activation of downstream pathways independent of EGFR signalling. Therefore, permutations of regulators of EGFR signaling may lead to TKI resistance in the absence of EGFR resistance mutations [39]. In the absence of these features, selection of patients likely to respond to TKIs 41  will continue to be reliant on clinical criteria including sex, histology, smoking status and ethnicity as indirect surrogates for molecular features. This study [1] and others [40] have concluded that response to targeted small molecules cannot be explained in the context of mutation or amplification of a single gene but more likely as a spectrum of altered targets. Mutations in EGFR do not affect the binding affinity of gefitinib or erlotinib [6], suggesting that EGFR mutations are markers of a TKI-susceptible biological subtype or of generally good prognosis regardless of treatment [40], rather than of a high-affinity drug binding partner. Therefore, there may be mutations in other genes that confer a similar phenotype susceptible to TKIs. While the identification of features conferring drug response is of great utility, the characterization of non-responsive patients and drug resistance features will also contribute to an understanding of drug action. Managing or curing cancer will rely on the comprehensive detection of all somatic events within a tumour population and using this knowledge to rationally deliver targeted therapies.  42  2.6. Figures Figure 2.1 DNA of varying quality from formalin-fixed paraffin-embedded tissues.  DNA extracted from tissue blocks is often degraded and chemically modified to varying degrees due to differences in fixation method and time, storage conditions, and nature of the tissue. Diagnostic treatments such as fixation with Bouin’s solution (samples 9-11) or acid decalcification (sample 12) can result in severely degraded template unusable for PCR. Slightly (sample 1) or moderately (sample 2-8) degraded templates can be used for PCR, although additional input DNA may be necessary for robust PCR. To ensure that blocks with degraded DNA were not used in labour-intensive microdissection, DNA from whole sections was extracted and qualified on a 2% agarose gel prior to microdissection of additional sections. Blocks yielding highly degraded DNA were not used in this study.  43  Figure 2.2 Laser microdissection of mixed tumour and normal cell populations  Tumour cells were microdissected using a UV laser microdissection instrument (Methods) to isolate tumour cells from surrounding normal tissue. A) Uncut lymph node tissue with metastatic tumour populations outlined in yellow. Each tumour cluster contains roughly 100-200 cells. B) Normal stromal cells remaining after excision of tumour. C) Tumour cells isolated on adhesive cap.  0.8 mm  44  Figure 2.3 EGFR variant detection summary  The seven exons coding for the tyrosine kinase domain of EGFR were sequenced in 37 tumours. Eight of these samples contained mutations, four with in-frame exon 19 deletions impacting L747-A750, four with a variety of exon 20 point mutations, and one with an exon 21 point mutation, L858R. Two previously documented synonymous polymorphisms were detected in this study, G2607A in exon 20 (rs10251977) and T2955C in exon 23 (rs17290643). Amino acid numbering is from the initial methionine residue of the EGFR protein isoform a (NCBI accession NP_005219). The data from this study have since been recorded in the Catalogue of Somatic Mutations in Cancer [41].  45  Figure 2.4 Examples of tumours with increased gene copy number detected by FISH  Gene copy number visualized by fluorescent in situ hybridization (FISH). Blue DAPI stain identifies the DNA present in each cell’s nucleus. Red Cy5-labelled probes hybridize to the gene region targeted by each assay (EGFR or HER2). Green Cy3-labelled probes target the centromere of the chromosome appropriate for the gene-specific assay (chromosome 7 for EGFR, chromosome 17 for HER2). The ratio reported is the number of red probes / green probes (genes/chromosome) based on an average of 40 cells. A) Tumour cells without increased EGFR copy number B) Tumour cells with increased HER2 copy number C) Tumour cells with “gene amplification” of EGFR  46  2.7. Tables Table 2-1 PCR primers for 7 exons of the EGFR tyrosine kinase domain  Exon 18 19 20 21 22 23 24  Annealing Temperature (ºC) 60 60 60 60 56 60 56  Forward Primer Sequence  Reverse Primer Sequence  gtgtcctggcacccaagc cagcatgtggcaccatctc cattcatgcgtcttcacctg agccataagtcctcgacgtg tccagagtgagttaactttttcca gaagcaaattgcccaagact gcaatgccatctttatcatttc  ccccaccagaccatgaga cagagcagctgccagacat catatccccatggcaaactc acccagaatgtctggagagc ttgcatgtcagaggatataatgtaa atttctccagggatgcaaag gctggcatgtgacagaacac  Product length including primer sequences (bp) 340 273 412 372 277 413 281  PCR primers were designed at least 40bp from EGFR exons coding for the tyrosine kinase domain. Sequencing tags were added to each primer to allow sequencing of the PCR products. All forward primer sequences were prefixed with a -21M13 sequencing tag, TGTAAAACGACGGCCAGT. All reverse primer sequences were prefixed with an M13R sequencing tag, CAGGAAACAGCTATGAC. 21M13 and M13R sequencing primers were then used in the corresponding sequencing reaction to generate sequences from both strands of the PCR products.  47  Table 2-2 Summary of all patient clinical data and molecular status # 3 6 7 9 10 11 12 14 15 20 21 22 24 25 26 27 28 30 33 34 35 36 37 39 40 42 43 44 47 48 51 52 56 57 59 60 61  Sex F F F F M F M F M F F M F F F F M M M M F M M M M M F F F F F F M M M F F  Ethnicity Caucasian Caucasian Caucasian Asian Caucasian Asian Asian Asian Asian Caucasian Caucasian Asian Caucasian Asian Caucasian Caucasian Caucasian Asian Asian Caucasian Caucasian Asian Caucasian Caucasian Caucasian Caucasian Caucasian Asian Asian Asian Caucasian Asian Caucasian Caucasian Caucasian Asian Asian  Smoker? Unknown Y Unknown N Unknown Unknown N N Y Unknown Y N N N Y Y Y Y Y Y Unknown Y Y Unknown Y Y Y N N N Y N Y Y Y N Y  64  F  Caucasian  Y  66  F  Asian  N  Histology adeno. adeno. adeno. adeno. PD NSC adeno. adeno. adeno. adeno. adeno. adeno. adeno. adeno. adeno. adeno. SCC adeno. adeno. adeno. SCC adeno. adeno. PD NSC adeno. adeno. adeno. adeno. adeno. adeno. adeno. LCC adeno. adeno. adeno. adeno. adeno. adeno. PreRx: adeno. Post Rx: adeno. adeno.  Source Tissue Skin Nodule Lung Lymph Node Cerebellum Lymph Node Lung Lymph Node Pericardium Lung Lung Lymph Node Lymph Node Lymph Node Lung Lung Lung Brain Brain Lung Lung Brain Pleura Skin Nodule Pleura Lymph Node Lymph Node Pleura Lung Lung Lymph Node Lymph Node Lung Lymph Node Skin Nodule Lymph Node Lung Lymph Node Pericaridium Lung  Block Type1 Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Cytology Slide Cytology Slide Cytology Slide Cytology Slide Cytology Slide Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Cytology Slide Cell Block Cell Block Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block Cytology Slide Cytology Slide Tissue Block Tissue Block Tissue Block Tissue Block Cytology Slide Tissue Block Tissue Block Tissue Block  Response2 PD PD PD SD PD PR SD PD PR PD PR PD SD PD SD PD SD SD PD SD PD SD SD SD SD Unknown PD PR PD PD PD PR Unknown SD PD SD PD SD PR  EGFR Mutation None None None Not Sequenced None Exon 19 Del.*, Exon None 20 V774L None Not Sequenced None None Exon 19 Del. None Exon 19 Del. None None Exon 20 G779S None None None Exon 20 V819V None None None None None None Exon 20 S768I, Exon 20 21 L815L L858R None None None None None None None None None None Exon 19 Del.*  EGFR/ CEP7  HER2/ CEP17  EGFR Stratification3  HER2 Stratification3  2.1 1.2 2.7 1.2 1 1.1 1.1  1.9 1.4 1.5 1.9 1.2 1.2 1.7  High Poly. High Trisomy High Poly. Low Trisomy Disomy Low Trisomy Low Trisomy  Low Poly. High Trisomy High Trisomy Low Poly. Low Trisomy High Trisomy Low Poly.  2.1 1.9 1.3 1.3 17.3 0.7 2.0  1.5 1.5 1.2 1.1 2.6 1.4 2.0  High Poly. Low Poly. High Trisomy High Trisomy Gene Amp. Low Trisomy Low Poly.  Low Poly. Low Poly. High Trisomy Low Trisomy High Poly. Low Poly. Low Poly.  3.1 2.1 2.7 1.9 1.2 1.3  1.4 2.3 0.9 2.9 1.4 0.8  High Poly. Low Poly. High Poly. Low Poly. Low Trisomy High Trisomy  Low Poly. High Poly. Low Trisomy High Poly. Low Poly. Disomy  1.5 1.6 1.1 1.1  2.0 2.2 1.4 1.7  High Trisomy Low Poly. Low Trisomy Low Trisomy  Low Poly. High Poly. High Trisomy Low Poly.  1.0 2.2 2.9  1.2 1.3 1.2  Low Trisomy Low Poly. High Poly.  High Trisomy Low Trisomy Low Trisomy  48  Table 2-3 EGFR exon 19 deletions/substitution  #  Sex  Ethnicity  11 22 25 66  F M F F  Asian Asian Asian Asian  Smoking Status Unknown N N N  Source Tissue Lung Lymph Node Lung Lung  Response1 PR PD PD PR  I K E L R E A T S P K TCAAGGAATTAAGAGAAGCAACATCTCCGAA TCAA- - - - - - - - - - - - - - - AACATCTCCGAA TCAAGGAA- - - - - C- - - - - - - CATCTCCGAA TCAAGGAA- - - - - - - - - - - - - - - TCTCCGAA TCAA- - - - - - - - - - - - - - - AACATCTCCGAA  a.a. CDS Het2 Het Del Het2  Deletions of L747-A750 were detected in EGFR exon 19. All samples were classified as adenocarcinoma based on histology. Deleted bases are indicated by “-“. In the case of patient #22, thirteen deleted bases were replaced by a single ‘C’ thereby retaining the reading frame. In all cases, the ATP-binding residue K745 was retained. In the case of patients #11 and #66, a synonymous codon change results from the deletion (AAG>AAA) and the K745 ATP-binding residue is unchanged. 1  response as measured radiographically and defined by SWOG modification of the WHO criteria [24]. PD = progressive disease, SD = stable disease, PR = partial response. 2 no mutations detected in normal tissue remaining after microdissection.  49  Table 2-4 EGFR point mutations  #  Sex  Ethnicity  Smoking Status  Source Tissue  Response1  Exon  20 20 20  CDS Mutation G2549>TT C2691>CT G2566>TT2 G2581>AG G2703>GA  Amino Acid S768I L815L V774L G779S V819V  44  F  Asian  N  Pleura  PR  20  11 28 35  F M F  Asian Caucasian Caucasian  Unknown Y Unknown  Lung Brain Brain  PR SD SD  47  F  Asian  N  Lung  SD  21  T2573>GG  L858R  Previously Documented [42-45] none V774M [45, 46] G779F [46] [47] [9-11, 17, 19, 45, 48, 49]  Point mutations detected in EGFR exons 20 and 21. All samples were classified as adenocarcinoma based on histology. Point mutations altering V774 and G779 have been previously documented to result in amino acid substitutions different than those found in this study. 1  response as measured radiographically and defined by SWOG modification of the WHO criteria [24]. PD = progressive disease, SD = stable disease, PR = partial response. 2 no mutations detected in normal tissue remaining after microdissection.  50  Table 2-5 EGFR and HER2 copy number alterations  #  Sex  Ethnicity  Smoking Status  Histology  Source Tissue  Block Type1  Response2  EGFR Mutation  9  F  Asian  N  adeno  Cerebellum  Tissue Block  SD  11  F  Asian  Unknown  adeno  Lung  Tissue Block  PR  27  F  Caucasian  Y  SCC  Lung  Tissue Block  PD  Not Sequenced Exon 19 Del4, Exon 20 V774L None  34  M  Caucasian  Y  SCC  Lung  Tissue Block  SD  None  36 40 42 43  M M M F  Asian Caucasian Caucasian Caucasian  Y Y Y Y  adeno adeno adeno adeno  Pleura Pleura Lymph Node Lymph Node  Tissue Block Cell Block Tissue Block Tissue Block  SD SD Unknown PD  2.0 3.1 2.1 2.7  N  adeno  Pleura  Tissue Block  PR  Y Y Pre Rx: Post Rx: N  adeno adeno adeno adeno adeno  Lung Lymph Node Lymph Node Pericaridium Lung  Tissue Block Tissue Block Tissue Block Tissue Block Tissue Block  Unknown SD SD PR  None None None None Exon 20 S768I, Exon 20 L815L None None None None Exon 19 Del4  44  F  Asian  56 57  M M  Caucasian Caucasian  64  F  Caucasian  66  F  Asian  Y  Stratification3  EGFR: CEP7  HER2: CEP17  EGFR  HER2  2.1  1.9  High Poly  Low Poly  2.7  1.5  High Poly  High Tri.  2.1  1.5  17.3  2.6 2.0 1.4 2.3 0.9  High Poly Gene Amp. Low Poly High Poly Low Poly High Poly  Low Poly Low Poly High Poly Low Tri.  1.9  2.9  Low Poly  High Poly  1.5 1.6 1.0 2.2 2.9  2.0 2.2 1.2 1.3 1.2  High Tri. Low Poly Low Tri. Low Poly High Poly  Low Poly High Poly High Tri. Low Tri. Low Tri.  Low Poly High Poly  Patient data provided for samples displaying increased EGFR or HER2 copy number (Probe:CEP ratio > 2.0) or identified as FISH+ (High Polysomy or Gene Amplification)1 source of patient material (Tissue Block = microdissected formalin-fixed paraffin-embedded tissue block; Cell Block = whole section or microdissected formalin-fixed paraffin-embedded cell block; Cytology = scraped cytology slide) 2 response as measured radiographically and defined by SWOG modification of the WHO criteria [24]. PD = progressive disease, SD = stable disease, PR = partial response 3 Copy number stratification as proposed by Cappuzzo et al [17]. (Disomy = < 2 copies in > 90% of cells, Low Trisomy = ≤ 2 copies in ≥ 40% of cells, 3 copies in 10-40% of cells, ≥ 4 copies in < 10% of cells, High Trisomy = ≤ 2 copies in ≥ 40% of cells, 3 copies in ≥ 40% of cells, ≥ 4 copes in < 10% of cells, Low Polysomy: ≥ 4 copies in 10-40% of cells, High Polysomy = ≥ 4 copies in 40% of cells, Gene Amplification = ≥ 15 copies in ≥ 10% of cells) 4 no mutations detected in normal tissue remaining after microdissection.  51  2.8. Bibliography 1. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B et al: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC Cancer 2007, 7:128. 2. Bonomi PD, Buckingham L, Coon J: Selecting patients for treatment with epidermal growth factor tyrosine kinase inhibitors. Clin Cancer Res 2007, 13(15 Pt 2):s46064612. 3. Damaraju S, Murray D, Dufour J, Carandang D, Myrehaug S, Fallone G, Field C, Greiner R, Hanson J, Cass CE et al: Association of DNA repair and steroid metabolism gene polymorphisms with clinical late toxicity in patients treated with conformal radiotherapy for prostate cancer. Clin Cancer Res 2006, 12(8):25452554. 4. Laskin JJ, Sandler AB: Epidermal growth factor receptor: a promising target in solid tumours. Cancer Treat Rev 2004, 30:1-17. 5. Brehmer D, Greff Z, Godl K, Blencke S, Kurtenbach A, Weber M, Muller S, Klebl B, Cotten M, Keri G et al: Cellular targets of gefitinib. Cancer Res 2005, 65(2):379-382. 6. Fabian MA, Biggs WH, 3rd, Treiber DK, Atteridge CE, Azimioara MD, Benedetti MG, Carter TA, Ciceri P, Edeen PT, Floyd M et al: A small molecule-kinase interaction map for clinical kinase inhibitors. Nat Biotechnol 2005, 23(3):329-336. 7. Fukuoka M, Yano S, Giaccone G, Tamura T, Nakagawa K, Douillard JY, Nishiwaki Y, Vansteenkiste J, Kudoh S, Rischin D et al: Multi-institutional randomized phase II trial of gefitinib for previously treated patients with advanced non-small-cell lung cancer (The IDEAL 1 Trial). J Clin Oncol 2003, 21(12):2237-2246. 8. Kris MG, Natale RB, Herbst RS, Lynch TJ, Prager D, Belani CP, Schiller JH, Kelly K, Spiridonidis H, Sandler A et al: Efficacy of gefitinib, an inhibitor of the epidermal growth factor receptor tyrosine kinase, in symptomatic patients with non-small cell lung cancer: a randomized trial. JAMA 2003, 290(16):2149-2158. 9. Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye F, Lindeman N, Boggon TJ et al: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304:1497-1500. 10. Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG et al: Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004, 350:2129-2139. 11. Pao W, Miller V, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L et al: EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004, 101(36):13306-13311. 12. Taron M, Ichinose Y, Rosell R, Mok T, Massuti B, Zamora L, Mate JL, Manegold C, Ono M, Queralt C et al: Activating mutations in the tyrosine kinase domain of the epidermal growth factor receptor are associated with improved survival in gefitinib-treated chemorefractory lung adenocarcinomas. Clin Cancer Res 2005, 11(16):5878-5885. 13. Takano T, Ohe Y, Sakamoto H, Tsuta K, Matsuno Y, Tateishi U, Yamamoto S, Nokihara H, Yamamoto N, Sekine I et al: Epidermal growth factor receptor gene mutations and increased copy numbers predict gefitinib sensitivity in patients with recurrent non-small-cell lung cancer. J Clin Oncol 2005, 23(28):6829-6837. 52  14.  15. 16.  17.  18.  19.  20.  21.  22.  23.  24. 25. 26. 27.  28.  Ho C, Murray N, Laskin J, Melosky B, Anderson H, Bebb G: Asian ethnicity and adenocarcinoma histology continues to predict response to gefitinib in patients treated for advanced non-small cell carcinoma of the lung in North America. Lung Cancer 2005, 49(2):225-231. Giaccone G, Rodriguez JA: EGFR inhibitors: what have we learned from the treatment of lung cancer? Nat Clin Pract Oncol 2005, 2(11):554-561. Inoue A, Suzuki T, Fukuhara T, Maemondo M, Kimura Y, Morikawa N, Watanabe H, Saijo Y, Nukiwa T: Prospective phase II study of gefitinib for chemotherapy-naive patients with advanced non-small-cell lung cancer with epidermal growth factor receptor gene mutations. J Clin Oncol 2006, 24(21):3340-3346. Cappuzzo F, Hirsch FR, Rossi E, Bartolini S, Ceresoli GL, Bemis L, Haney J, Witta S, Danenberg K, Domenichini I et al: Epidermal growth factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Natl Cancer Inst 2005, 97(9):643-655. Cappuzzo F, Varella-Garcia M, Shigematsu H, Domenichini I, Bartolini S, Ceresoli G, Rossi E, Ludovini V, Gregorc V, Toschi L et al: Increased HER2 gene copy number is associated with response to gefitinib therapy in epidermal growth factor receptor-positive non-small-cell lung cancer patients. J Clin Oncol 2005, 23(22):5007-5018. . Tsao MS, Sakurada A, Cutz JC, Zhu CQ, Kamel-Reid S, Squire J, Lorimer I, Zhang T, Liu N, Daneshmand M et al: Erlotinib in lung cancer - molecular and clinical predictors of outcome. N Engl J Med 2005, 353(2):133-144. Brabender J, Danenberg KD, Metzger R, Schneider PM, Park J, Salonga D, Holscher AH, Danenberg PV: Epidermal growth factor receptor and HER2-neu mRNA expression in non-small cell lung cancer Is correlated with survival. Clin Cancer Res 2001, 7(7):1850-1855. Onn A, Correa AM, Gilcrease M, Isobe T, Massarelli E, Bucana CD, O'Reilly MS, Hong WK, Fidler IJ, Putnam JB et al: Synchronous overexpression of epidermal growth factor receptor and HER2-neu protein is a predictor of poor outcome in patients with stage I non-small cell lung cancer. Clin Cancer Res 2004, 10(1 Pt 1):136-143. Johnson BE, Janne PA: Selecting patients for epidermal growth factor receptor inhibitor treatment: A FISH story or a tale of mutations? J Clin Oncol 2005, 23(28):6813-6816. Shepherd FA, Tsao MS: Unraveling the mystery of prognostic and predictive factors in epidermal growth factor receptor therapy. J Clin Oncol 2006, 24(7):1219-1220; author reply 1220-1211. Green S, Weiss GR: Southwest Oncology Group standard response criteria, endpoint definitions and toxicity criteria. Invest New Drugs 1992, 10(4):239-253. Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, Haussler D: The Human Genome Browser at UCSC. . Genome Res 2002, 12(6):996-1006. McDonald DM, Munn L, Jain RK: Vasculogenic mimicry: how convincing, how novel, and how significant? Am J Pathol 2000, 156(2):383-388. Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. In: Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by Krawetz S MS. Totowa, NJ: Humana Press; 2000: 365-386. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998, 8(3):186-194. 53  29. 30.  31. 32.  33. 34. 35.  36.  37.  38.  39. 40.  41. 42.  43.  44.  45.  Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res 1998, 8(3):195-202. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29(1):308311. Salgia R, Skarin AT: Molecular abnormalities in lung cancer. J Clin Oncol 1998, 16(3):1207-1217. Pao W, Miller VA, Politi KA, Riely GJ, Somwar R, Zakowski MF, Kris MG, Varmus H: Acquired Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib Is Associated with a Second Mutation in the EGFR Kinase Domain. PLoS Med 2005, 2:e73. Nakamura Y: Codon usage table (Homo sapiens). 2009. Brehmer D, Greff Z, Godl K, Blencke S, Kurtenbach A, Weber M, Muller S, Klebl B, Cotten M, Keri G et al: Cellular targets of gefitinib. Cancer Res 2005, 65(2):379-382. Fabian MA, Biggs WH, Treiber DK, Atteridge CE, Azimioara MD, Benedetti MG, Carter TA, Ciceri P, Edeen PT, Floyd M et al: A small molecule-kinase interaction map for clinical kinase inhibitors. Nat Biotechnol 2005, 23(3):329-336. Pao W, Wang TY, Riely GJ, Miller VA, Pan Q, Ladanyi M, Zakowski MF, Heelan RT, Kris MG, Varmus HE: KRAS mutations and primary resistance of lung adenocarcinomas to gefitinib or erlotinib. PLoS Med 2005, 2(1):e17. Engelman JA, Zejnullahu K, Mitsudomi T, Song Y, Hyland C, Park JO, Lindeman N, Gale CM, Zhao X, Christensen J et al: MET amplification leads to gefitinib resistance in lung cancer by activating ERBB3 signaling. Science 2007, 316(5827):1039-1043. Sos ML, Koker M, Weir BA, Heynck S, Rabinovsky R, Zander T, Seeger JM, Weiss J, Fischer F, Frommolt P et al: PTEN loss contributes to erlotinib resistance in EGFRmutant lung cancer by activation of Akt and EGFR. Cancer Res 2009, 69(8):32563261. Janne PA: Challenges of detecting EGFR T790M in gefitinib/erlotinib-resistant tumours. Lung Cancer 2008, 60 Suppl 2:S3-9. Shepherd FA: Molecular selection of patients for first-line treatment of advanced non-small-cell lung cancer with epidermal growth factor inhibitors: not quite ready for prime time. J Clin Oncol 2008, 26(15):2426-2427. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. Eberhard DA, Johnson BE, Amler LC, Goddard AD, Heldens SL, Herbst RS, Ince WL, Jänne PA, Januario T, Johnson DH et al: Mutations in the epidermal growth factor receptor and in KRAS are predictive and prognostic indicators in patients with non-small-cell lung cancer treated with chemotherapy alone and in combination with erlotinib. J Clin Oncol 2005, 23(25):5900-5909. Huang SF, Liu HP, Li LH, Ku YC, Fu YN, Tsai HY, Chen YT, Lin YF, Chang WC, Kuo HP et al: High frequency of epidermal growth factor receptor mutations with complex patterns in non-small cell lung cancers related to gefitinib responsiveness in Taiwan. Clin Cancer Res 2004, 10(24):8195-8203. Kosaka T, Yatabe Y, Endoh H, Kuwano H, Takahashi T, Mitsudomi T: Mutations of the epidermal growth factor receptor gene in lung cancer: biological and clinical implications. Cancer Res 2004, 64(24):8919-8923. Shigematsu H, Lin L, Takahashi T, Nomura M, Suzuki M, Wistuba II, Fong KM, Lee H, Toyooka S, Shimizu N et al: Clinical and biological features associated with 54  46.  47. 48.  49.  epidermal growth factor receptor gene mutations in lung cancers. J Natl Cancer Inst 2005, 97(5):339-346. . Yang SH, Mechanic LE, Yang P, Landi MT, Bowman ED, Wampfler J, Meerzaman D, Hong KM, Mann F, Dracheva T et al: Mutations in the Tyrosine Kinase Domain of the Epidermal Growth Factor Receptor in Non-Small Cell Lung Cancer. Clinical Cancer Res 2005, 11:2106-2110. Su MC, Lien HC, Jeng YM: Absence of epidermal growth factor receptor exon 1821 mutation in hepatocellular carcinoma. Cancer Lett 2005, 224(1):117-121. Bell DW, Lynch TJ, Haserlat SM, Harris PL, Okimoto RA, Brannigan BW, Sgroi DC, Muir B, Riemenschneider MJ, Iacona RB et al: Epidermal growth factor receptor mutations and gene amplification in non-small-cell lung cancer: molecular analysis of the IDEAL/INTACT gefitinib trials. J Clin Oncol 2005, 23(31):8081-8092. Marchetti A, Martella C, Felicioni L, Barassi F, Salvatore S, Chella A, Camplese PP, Iarussi T, Mucilli F, Mezzetti A et al: EGFR mutations in non-small-cell lung cancer: analysis of a large series of cases and development of a rapid and sensitive method for diagnostic screening with potential implications on pharmacologic treatment. J Clin Oncol 2005, 23:857-865.  55  Chapter 3. Impact of whole genome amplification on analysis of copy number variants2 Genome analyses of primary cancer samples are often limited by the amount of tumour tissue available for study. The work outlined in the previous chapter was limited to the analysis of seven target amplicons due, in part, to limited quantities of DNA available from clinical lung biopsy samples. This problem is not limited to the study of lung cancer and is an issue in many tumour settings, particularly those from rare or specially-treated cancers. Clinical biopsy materials are often very precious due to the difficulty in obtaining them and the rich clinical data with which they are associated. As the scope of genome analyses continue to grow, there is an increasing demand for large quantities of nucleic acids from these sources, particularly as tissues are being collected specifically for research more routinely. This section of the thesis characterizes a technique for amplification of DNA from limited quantities of clinical material and the use of amplified product for genome-wide copy number analysis. This technique makes use of Phi29 DNA polymerase primed using random hexamers to replicate more than a million genome equivalents from an input of only a thousand. To characterize systematic bias induced by this technique and to evaluate the ability to use amplified material for SNP and copy number analysis, we performed a microarray-based analysis of pre- and post-amplification pairs. This study showed that whole genome amplification (WGA) induces hundreds of copy number variant artifacts that can obscure bona fide copy number variants. However, these artifacts are systematic and correlate with GC content and proximity to chromosome ends. Pair-wise comparison in which amplified samples are compared to amplified samples can correct for these biases and restores the ability to distinguish real copy number variants from false positives arising from technical artifacts.  2  A version of this chapter has been published. Pugh T.J., Delaney A.D., Farnoud N., Flibotte S., Griffith M., Li H.I., Qian H., Farinha P., Gascoyne R.D., Marra M.A. Nucleic Acids Res. 2008 Aug;36(13):e80.  56  Genotype concordance before and after amplification was high (>98%) and the effects of WGA amplification bias were not a significant contributor to non-concordance. Armed with knowledge of the biases induced by this technique and a proven method to resolve real copy number variants from WGA material, we have since used WGA to amplify DNA from several clinical sources, including some lung tumour biopsies containing a few thousand tumour cells collected and sequenced for the work outlined in Chapter 5.  3.1. Introduction Initial analysis of the human genome identified single nucleotide polymorphisms (SNPs) as the primary source of genotypic and phenotypic variation among humans. However, subsequent studies identified larger-scale copy number variants that apparently impacted millions of nucleotides [1-6]. These larger-scale variants included polymorphic deletions and duplications that are present in >1% of the population and therefore meet the traditional definition of polymorphism [2]. As of August, 2009, 8,410 copy number variant loci impacting over 911 Mbp of DNA sequence (~32% of the genome) were identified, and these are listed in the Database for Genomic Variants (http://projects.tcag.ca/variation/). Copy number variants are also features of several human diseases, including Alzheimer disease [7], Cri du chat syndrome [8], mental retardation [9], and cancer [10, 11]. For example, somatic gene amplification is a common mechanism of oncogene overexpression in lung cancer (EGFR) [12, 13] and breast cancer (HER2) [14, 15], resulting in upregulation of cell signalling pathways including the prosurvival PI3K/Akt and mitogenic MAPK pathways [16-18]. A database of pathogenic copy number variants has been created with the goal of linking specific variants to disease (Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources, https://decipher.sanger.ac.uk). As robust array-based methods for copy number detection continue to mature, increasing numbers of these variants are being identified [2].  57  Current whole-genome methods to detect copy number variants require relatively large input quantities of DNA that are difficult or impossible to obtain from rare cell populations such as cancer biopsies and microdissected tissues. To address this challenge, whole genome amplification (WGA) techniques were developed that increase the amount of DNA for analysis. For example, multiple-strand displacement amplification (MDA) using Phi29 DNA polymerase was used to generate microgram quantities of high molecular weight DNA (>30kb) from nanograms of high quality input material [19, 20]. A recent report described a protocol for amplification of picogram quantities of DNA from single cells [21], further expanding the applications for this technique. The replication fidelity of WGA techniques have been investigated [22-27]. Estimates of base-pair incorporation errors resulting from Phi29-mediated amplification have ranged from 2.2x10-5 [28] to 9.5x10-6 [23], and the concordance of genotypes between unamplified and amplified samples was reported to be >99.8% [23, 26]. Recurrent WGA-induced copy number biases were observed in previous studies [22-27] and were associated with sequence repeats and proximity to chromosome ends [24-27], increased GC content [24, 27], and annotated copy number variants [24]. Many of these associations were explored descriptively without statistical analysis, and there was no consensus on the 92 recurrent regions of bias explicitly defined by three of these studies [23, 24, 27]. A recent study of 532 samples subjected to WGA and subsequent analysis using a relatively low-density Affymetrix 10k Mapping microarray [29] identified a median of 438 WGA-induced copy number artifacts in comparisons between amplified samples and an unamplified reference set [22]. While there is a consensus that at least partial compensation of systematic biases can be achieved through the use of an amplified reference [23-27], it is unknown to what degree such comparisons can capture real copy number variants detected using more sensitive, higher resolution platforms.  58  Recently, a high-throughput, massively parallel whole genome pyrosequencing technique was used to examine bias induced by three commercially available whole genome amplification protocols: MDA, primer-extension pre-amplification, and degenerate oligonucleotide-primed PCR [30]. In this comparison, which involved sequencing two bacterial genomes, Phi29 MDA-based approaches generated the most complete genome coverage (5099%) and introduced the least bias compared to PCR-based techniques. DNA sequences generated from Phi29-amplified material were 2.9-3.8% lower in GC-content than those from the unamplified material, suggesting a relationship between amplification bias and GC-content. However, over-amplification of certain sequences could not be explained by any of the previously mentioned sources of bias, suggesting a need to directly investigate the nature of regions prone to over- or under- amplification. Although the study was of high resolution, direct comparison of the results from this study with those using human samples is difficult due to differences in chromosome organization, size and composition. In the present study, I investigated amplification bias resulting from whole genome amplification of DNA from fresh-frozen human tissues using the Affymetrix 500k Mapping microarray set. This set is comprised of two high-resolution microarrays that together contain probes to query over 500,000 SNPs from across the human genome, a dramatic increase in probe density from previous studies using a similar oligonucleotide array with 50X fewer probes, the Affymetrix 10k Mapping array, [22, 23, 26], lower resolution cDNA and bacterial artificial chromosome arrays [24, 25], or individual PCRs [31]. Copy number can be inferred from the resultant probe intensities [11] and, as only a single sample is applied to each microarray, multiple sample comparisons can be performed using normalized data. We quantified the effects of WGA on microarray signal and background noise, localized and statistically analysed genomic regions of WGA-induced bias, and directly compared the ability to resolve copy number variants in comparisons of unamplified and amplified material. 59  3.2. Materials and methods 3.2.1. Tissue material and DNA extraction Normal lymph nodes from three individuals were fresh frozen in Optimal Cutting Temperature (OCT; Sakura Finetek, Torrance, CA) compound and stored at -80oC by the service pathology laboratory at the BC Cancer Agency. Genomic DNA was extracted from these sources using the Gentra PureGene DNA purification kit (Gentra Systems, Minneapolis, MN). Prior to labelling and microarray hybridization, the genomic DNA was quantified using a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE). Prior to whole genome amplification, the genomic DNA was diluted to ~1.5 ng/µL and quantified using a PicoGreen assay (Invitrogen, Carlsbad, CA). To ensure consistent DNA quality across all samples, the DNA was visualized on an agarose gel to confirm the presence of undegraded, predominantly high molecular weight (>10 kb) DNA.  3.2.2. Whole genome amplification We used Qiagen’s Repli-g Mini whole genome amplification kit and protocol (QIAgen, Valencia, CA) to amplify 7 ng of PicoGreen-quantified DNA from fresh frozen samples to generate >10 µg of high molecular weight DNA. We performed the isothermal amplification reaction in 1.5 mL microcentrifuge tubes incubated in a 30°C water bath for 18 hours and inactivated the enzyme by incubating the tubes in a 65°C water bath for 3 minutes. The amplified products were purified and quantified as described in the previous section, and the amplification products were visualized on a 0.8% agarose gel stained with SYBR Green (Invitrogen, Carlsbad, CA).  3.2.3. Labelling and hybridization to the Affymetrix 500K array 500 ng samples of DNA were processed following the instructions in the GeneChip Mapping 500K manual (Affymetrix, Santa Clara, CA). Briefly, 250 ng of DNA was digested 60  using one of two restriction enzymes, Nsp I or Sty I, and ligated to Nsp I or Sty I adaptors. These adaptor-ligated fragments were amplified by PCR, the purified products quantified using a Bio-Tek PowerWave X spectrophotometer, and the concentration normalized to 2 µg/µL. The normalized products were then fragmented and labelled as described in the manual. Samples were hybridized to the GeneChip Human Mapping 250K Nsp or Sty array in an Affymetrix Hybridization Oven 640. Washing and staining of the arrays were performed using an Affymetrix Fluidics Station 450. Images of the arrays were obtained using an Affymetrix GeneChip Scanner 3000.  3.2.4. Sample preparation for NimbleGen 385k CGH array Samples of >2.5 µg of DNA were prepared following the instructions provided by NimbleGen Systems Inc. (NimbleGen Systems Inc, Madison, Wisconsin). Briefly, purified samples were concentrated to 250 ng/µl and analysed for quality on an agarose gel. Samples were then shipped on ice to NimbleGen for subsequent labelling and hybridization to the 385k Human Whole-Genome CGH array.  3.2.5. Genotype and copy number analysis Genotype calls were derived from Genechip microarray images using the GTYPE v4.0 software program (Affymetrix, Santa Clara, CA). We detected copy number variants in individual samples using comparisons to a common reference data set and comparisons between pre- and post-amplification sample pairs (Figure 3.1). These were performed using a software pipeline (Figure 3.1) that utilizes the Affymetrix Chromosome Copy Number Analysis Tool (CNAT) version 4.0 (Affymetrix, Santa Clara, CA) and an exhaustive t-score optimization algorithm. To analyse sample pairs on the Affymetrix platform, we used CNAT to perform quantile normalization of probe intensities from the samples and calculated log2 intensity ratios 61  for each probe set on the array. For unpaired analysis of individual samples against a common reference set, we used a set of average probe intensities from the reference set in place of the second sample. The reference set used for this purpose, referred to hereafter as the “Affy48 reference set”, was downloaded from the Affymetrix website (http://www.affymetrix.com/support/technical/sample_data/500k_data.affx) and consisted of 48 samples representing 5 HapMap CEPH trios, 5 HapMap Yoruban trios, 3 other non-HapMap trios, and 9 unrelated HapMap Asian samples. To analyse sample pairs on the NimbleGen platform, we used qspline normalized data and log2 intensity ratios provided by NimbleGen for each probe on the array. To identify significant deviations in the log2 ratio data from both platforms, the following t-score optimization algorithm was used. First, log2 ratios were sorted by genome coordinate, and moving windows representing a number of adjacent probes were subjected to a t-test against the rest of the data outside of the window on the same chromosome. This was done across the entire genome for all window sizes from 3 to 30 probe sets for the Affymetrix and NimbleGen data. To establish a comparison-specific false-positive threshold, the order of log2 ratios was then randomized, and moving window t-tests were recalculated. Two t-score thresholds, one for amplifications and one for deletions, were then defined at which no amplifications or deletions were identified in the randomized data. These thresholds were then applied to the t-scores derived from the original data, and regions with t-scores exceeding these thresholds were identified. To identify apparent variants impacting regions larger than our largest moving window size, t-scores were optimized for aberrations encompassing more than 27 probe sets using larger and larger windows until a local maximum t-score was found. As no CNVs met the false positive thresholds set for the NimbleGen data, a 50 probe window was used to detect statistically significant CNVs, and a comparison-specific false positive threshold was not applied. 62  3.2.6. Sequence analysis of recurrent whole genome amplification-induced artifacts In the analysis of recurrent WGA-induced artifacts, several sets of genomic coordinates were defined based on the human genome reference sequence Build 36/hg18 (released March, 2006) downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/). To define a set of regions that were consistently over- or under-amplified by the whole genome amplification technique, we analysed apparent variants arising from our comparison of matched pre- and post-WGA samples for overlapping genomic coordinates across all three comparisons and defined minimal overlapping regions (Table 3-1 and Table 3-2). These minimal overlapping regions were defined as the smallest region overlapped by a WGA-induced variant in all three comparisons. To define a subset of recurrently under-amplified chromosome ends, the first or last 2.5% of the reference genome sequence of any chromosome was recorded if it was impacted by a region consistently under-amplified by the WGA technique. To serve as reference sets representing the remainder of the human genome, random sets of coordinates were generated with equivalent size distributions for the regions consistently over- or under-amplified by the whole genome amplification technique and for the subset of recurrently biased regions affecting chromosome ends. In these reference sets, 10 random segments were generated with sizes corresponding to each entry in the list of regions affected by WGA-induced bias (i.e. 1,900 amplifications and 750 deletions). The GC and repeat content of each entry in the above sets of coordinates were calculated in the following manner. For each set, the genomic sequence for each coordinate was downloaded from the Ensembl database (http://www.ensembl.org). To calculate the GC content of the sequence, the number of Gs and Cs in the sequence was counted and that number divided by the total length of the sequence. To calculate the repeat content of the sequence, the coordinates of the UCSC Genome Browser “Simple Repeats” track generated by Tandem Repeats Finder [32] was used to identify base 63  pairs belonging to repeat sequences. The number of these base pairs was then divided by the total length of the sequence to give the percentage of repeat sequence in the region. As most of the sets were not normally distributed in GC or repeat content as found by the Jarque-Bera test, the two-sample Kolmogorov-Smirnov test (KS test) was used to test whether these sets differed in their distribution of these two parameters.  3.3. Results 3.3.1. Array noise and copy number variation in samples pre- and post-WGA To establish a base line for array noise and copy number variant detection prior to amplification, three unamplified DNA samples were compared to the Affy48 reference set (Methods; Figure 3.1b), and candidate copy number variants were identified. This comparison versus the Affy48 set was then repeated using three corresponding amplified samples. As a measure of array noise, we quantified the distribution of log2 ratios resulting from these comparisons by calculating the mean, standard deviation (SD), and interquartile range (IQR) (Table 3-3, Figure 3.2). As expected due to normalization by CNAT4, the mean log2 ratios from both unamplified and amplified samples were very close to zero. The SDs and IQRs of log2 ratios from amplified samples were nearly twice those of the unamplified samples, suggesting an increase in array noise using WGA material. To compare the copy number variants detected pre- and post-WGA, we counted apparent copy number variants with p-values more significant than each comparison’s falsepositive detection limit (Table 3-3, Figure 3.3). The analysis of unamplified samples detected 13 candidate copy number variants, 11 of which overlapped the coordinates of genomic variants listed in the Database of Genomic Variants (http://projects.tcag.ca) [5] (Table 3-4). In contrast, the analysis of the amplified samples identified 1,572 apparent copy number variants, an approximately 100 fold increase in the number of apparently significant amplifications and  64  deletions versus the unamplified samples (Table 3-3). These artifactual CNVs are likely the result of WGA-induced biases. To assess experimental variation prior to amplification, each unamplified and amplified sample was subjected to a pair-wise comparison against an experimental replicate of itself (Table 3-5). The lack of fluctuation in mean, SD, and IQR in the log2 ratios from unamplified replicates suggests a high degree of reproducibility of the array method used. Similarly, while still elevated relative to unamplified samples, there is no major fluctuation in these values between amplified replicates, further supporting the notion that the WGA method behaves consistently. However, the values obtained from unamplified samples versus values obtained from amplified samples using the Affy48 reference set, showed a substantial decrease in SDs and IQRs. This indicates that amplified samples produce different signal intensity distributions than unamplified samples, suggesting that comparison of amplified to unamplified data sets is potentially problematic.  3.3.2. Copy number variants induced by whole genome amplification To identify apparent copy number variants arising from non-uniform amplification bias in the WGA technique, data from paired pre- and post-WGA samples were directly compared to each other (Figure 3.1b). Our analysis identified apparent WGA-induced over- and underamplifications in each of the three comparisons of amplified versus unamplified material. In sample 1, we detected 502 amplifications (p-value threshold of detection, p<1.68x10-6) and 580 deletions (p<1.71x10-8). In sample 2, we detected 467 amplifications (p<1.68x10-6) and 202 deletions (p<1.64x10-8). In sample 3, we detected 546 amplifications (p<1.68x10-6) and 259 deletions (p<3.45x10-8). Our analysis also revealed a set of 265 recurrent apparent WGAassociated aberrations that were detected in all three comparisons. This set consisted of 190 over-amplifications (Table 3-1) and 75 under-amplifications (Table 3-2). 39 of these regions 65  overlapped one of the 92 regions of bias (31 of 62 over-amplifications, 8 of 30 underamplifications) identified by three previous studies [23, 24, 27]. 110 of the regions we identified overlapped genomic regions with known copy number variants [2] (64 overamplifications, 46 under-amplifications) but there was no correlation between regions susceptible to WGA-associated bias and known copy number variants (p=1.00). In a set of 2,650 random genomic coordinates with the same size distribution as the WGA-induced artifacts, 36.26% overlapped a known copy number variant, a proportion near the 41.51% overlap observed with the set of WGA-induced biases. The minimal overlapping regions (see Methods) of WGA-induced over-amplifications encompassed 13.6Mbp of the reference human genome sequence and ranged from 2,207 bp to 357,399 bp, with a median size of 58,961 bp, and an IQR of 66,524 bp. These recurrently overamplified sites were distributed throughout the genome and had a statistically significant increase in GC content relative to a set of 1,900 random genomic segments with identical size distribution (p=8.36x10-40). These over-amplified sites were also enriched for tandem repeat sequences relative to the set of 1,900 random genomic segments (p=1.76x10-6). These results are compatible with the notion that over-amplification by the WGA technique is related to the GC and repeat content of the underlying sequence. The minimal overlapping regions of the recurrent WGA-induced under-amplifications encompassed 8.37 Mb of the reference human genome sequence and ranged from 5,206 bp to 1.93 Mbp, with a median size of 75,698 bp, and an IQR of 64,619 bp. These regions of underamplification appeared to fall into two groups: those near chromosome ends and those distributed throughout the genome. Comparison of the 54 under-amplified sites distributed throughout the genome with a set of 540 random genomic segments with identical size distribution found no statistically significant difference in GC content (p=0.0796) or repeat sequences (p=0.1901). However, the under-amplifications were greatly depleted for GC-rich 66  regions compared to the over-amplifications (p=1.93x10-5) which supports the notion that WGA amplification efficiency is related to the GC content of the underlying sequence. A plot of GC content versus copy number shows a trend of increasing amplification magnitude (i.e. increasing copy number) with increasing GC content (Figure 3.4). Of the 39 chromosome ends (see Methods) assayed by probe sets, 15 contained regions of under-amplification (Table 3-6). Only 3 chromosome ends contained over-amplifications, suggesting that under-representation of chromosome ends is a consistent result of whole genome amplification. The set of chromosome end under-amplifications impacted 2.547 Mbp of the reference human genome sequence, and the GC content was statistically greater than that of a set of 150 random genomic segments with identical size distribution (p=1.12x10-6). However, there was no statistical difference in GC content between the under-amplified chromosome ends and the 25 appropriately amplified chromosome ends (p=0.8215). This suggests that amplification bias due to GC content does not play a role in under-amplification of specific subtelomeric regions. Under-amplified chromosome ends were enriched for repetitive sequences (see Methods) relative to both a set of 150 random genomic segments with identical size distribution (p=1.52x10-9) and the 25 assayed chromosome ends that were not under-amplified (p=0.0022), suggesting that increased repeat content of specific chromosome ends may result in their under-amplification. To assess WGA-induced CNV artifacts using a second array platform, we compared pre- and post-amplification sample pairs in three comparative genome hybridization (CGH) experiments using the NimbleGen 385k array. The log2 ratios from these experiments were widely distributed (average SD = 0.378, average IQR = 0.457) and while several thousand CNVs were detected, none was identified with a p-value passing the stringent false positive thresholds set by our algorithm due to the high level of noise in these data (p<3.51x10-7 for over-amplifications, p<3.30x10-11 for under-amplifications). Analysis of these data using a 50 67  probe moving window without filtering for false positives detected 2,116 WGA-induced CNVs (466 over-amplifications, 1,650 under-amplifications) of which 141 occurred in all three comparisons (29 over-amplifications, 112 under-amplifications). Despite their relatively large size (average = 1.06 Mb, median = 0.36 Mb, SD = 4.10 Mb), only 28 of these overlapped recurrent artifacts detected by the Affymetrix comparisons (17 of 190 over-amplifications, 11 of 75 under-amplifications). This amount of overlap is similar to that seen with a random set of 2,116 random genomic coordinates with the same size distribution as the CNVs detected by the NimbleGen platform, of which 65 overlapped a WGA-induced CNV detected by the Affymetrix platform. These results suggest that these are artifacts resulting from the difficulty in distinguishing real CNVs from background noise when co-hybridizing amplified and unamplified samples even when a large moving window of 50 probes is used.  3.3.3. Use of amplified material for pair-wise copy number comparisons To assess the use of WGA material in pair-wise comparisons, each sample was compared to the other samples one-by-one, and relative differences in copy number in the three samples assessed using: 1) unamplified samples vs. unamplified samples, 2) amplified samples vs. unamplified samples, and 3) amplified samples vs. amplified samples (Figure 3.1d). An example of the output from one such set of comparisons is illustrated in Figure 3.5. The unamplified vs. unamplified comparisons identified 21 apparent differences in copy number between the three samples (Table 3-7 and Table 3-8). These pair-wise comparisons identified 5 of 13 apparent differences expected from the individual comparisons of samples to the Affy48 reference set. Twelve of these apparent differences, including the 5 differences expected from comparison with the Affy48 set, overlap variants listed in the Database of Genomic Variants (http://projects.tcag.ca). The amplified vs. unamplified comparisons identified 3,207 apparent differences in copy number among the three samples (Table 3-7). 68  Only seven of these apparent differences were detected by both unamplified/amplified and amplified/unamplified comparisons, suggesting that systematic WGA-induced variants and random WGA-reaction variability mask real events. The amplified vs. amplified comparisons identified 275 apparent differences in copy number among the three samples (Table 3-7). These amplified vs. amplified comparisons identified 2 of the 12 apparent amplifications and 5 of the 9 apparent deletions seen in the unamplified comparisons (Table 3-8), suggesting that pair-wise comparisons of material where both samples have been subjected to WGA can partially compensate for reproducible WGAinduced bias (Figure 3.5). The most significant deletion identified by each unamplified comparison was recapitulated as the most significant deletion identified by the corresponding amplified comparison (Table 3-8). This was also true of the most significant amplification in two of the three comparisons (Table 3-8). The list of variants detected at lower levels of significance than these top scoring events may still contain real CNVs although it is difficult to isolate these from the remaining artifactual events resulting from random experimental variation without independent validation of each one.  3.3.4. Validation of WGA pair-wise comparisons for copy number detection To determine the extent to which amplified pair-wise comparisons mask known, validated copy number variants, DNA from the blood of three father/child pairs with previously described CNVs [9] were subjected to WGA and copy number analysis using the 250k Nsp chip of the Affymetrix 500k set. The original analysis of unamplified DNA performed using the Affymetrix Mapping 100k SNP array set [9] identified a total of 32 CNVs within the three father/child pairs of which five (2 amplifications, 3 deletions) were validated by conventional cytogenetic analysis or FISH (Table 3-9).  69  The amplified child vs. amplified father comparisons identified 63 CNVs within the three pairs. Analysis of amplified family pair #8379 identified 41 copy number differences (13 amplifications p<3.48x10-6, 28 deletions p<8.38x10-8 in the child relative to the father), analysis of amplified family pair #1280 identified 6 copy number differences (2 relative amplifications p<2.14x10-6, 4 relative deletions p<1.05x10-8), and analysis of amplified family pair #3476 identified 16 copy number differences (6 relative amplifications p<2.07x10-6, 10 relative deletions p<6.09x10-9). These copy number differences were then ranked by p-value (most significant to least) and the coordinates compared to those of the validated aberrations. The amplified vs. amplified comparisons identified four of the five CNVs (2 amplifications, 2 deletions) validated by FISH [9] and each received the lowest p-value for its comparison (Table 3-9). The single validated CNV that was not detected by the amplified comparisons may have been missed due to a difference in array coverage at this site. On the 250k Nsp array, this region was covered by 3 probe sets (10,683bp/probe set) compared to 6 probe sets (5,341bp/probe set) on the 100k array. This was also the smallest feature of the set of validated CNVs (0.03Mb) and may reflect a decrease in detection sensitivity when using amplified comparisons. Among the top-ranked variants (i.e. those with the most significant p-values), six variants were identified by the 250k WGA experiment that were not detected by the original experiments. Five of these are covered by 6 or fewer probe sets (5,743-93,452bp/probe set, one with no probes) on the 100k array. In addition to the possibility of an increased false positive rate due to increased array noise, differences in each array’s probe coverage may explain why these regions were only detected by the experiment using amplified samples.  3.3.5. Genotype fidelity To compare the fidelity of genotype calls derived from WGA product to those from corresponding unamplified samples, data from matched pairs of these sources were compared. 70  Average genotype call rates (+/- 1 standard deviation) were 96.74+/-1.14% from the unamplified samples and 93.14+/-2.68% from the WGA samples, suggesting a modest degree of information loss following amplification. Of the SNPs which were unsuccessfully called in the amplified samples, only 2% were common to all three samples, and only one of these fell within a region of WGA-induced bias (an over-amplification). Genotype concordance was 98.57+/-0.53% between calls successfully made from both amplified and unamplified samples in each matched pair. There was very little overlap in the coordinates of SNPs with nonconcordant genotypes and regions of recurrent WGA-induced bias. Of the non-concordant calls, 58.77% were called heterozygotes in the unamplified sample and homozygotes in the amplified sample (i.e., AB called as AA or BB) and 0.2% of these were located in regions of WGAinduced over-amplification while none were in regions of WGA-induced under-amplification. 40.66% were called homozygotes in the unamplified sample and heterozygotes in the amplified sample (i.e., AA or BB called as AB), of which none were located in regions of WGA-induced bias, and 0.57% were incorrectly called homozygotes (i.e., AA called as BB or BB called as AA) of which none were located in regions of WGA-induced bias. 12 regions, each containing 3-7 SNPs, were identified as displaying loss of heterozygosity (LOH) in total from the three pre- and post-amplification comparisons. Three of the LOH regions showed allele-specific amplification (copy number of 3), while the remaining 9 did not (copy number of 2). These regions impacted a total of 58 SNPs, 0.01% of all of the SNPs assayed, and none overlapped a region recurrently over- or under-amplified by WGA. These results suggest that increased random array noise is likely a greater source of genotype non-concordance than systematic allele-specific amplification bias or polymerase error.  3.4. Discussion The ability to discover copy number variants in unamplified human DNA using data generated by the Affymetrix Mapping SNP array platform has been previously demonstrated by 71  our group and others [1-3, 9]. However, with small amounts of DNA from tumour biopsies, for example, amplification of the starting material prior to discovery of copy number variants is often necessary to generate enough material to conduct such analyses. We aimed to assess the nature of biases that are introduced by this amplification and to determine their impact on copy number detection and whether pair-wise comparisons could compensate for these biases. For the first time, we have used a high resolution microarray platform to explicitly define regions susceptible to WGA-induced bias, statistically assessed the sequence features underlying these biases, and demonstrated an ability to correct for these biases and resolve real CNVs. In this study, three unamplified DNA samples were used to establish a base line for array noise and copy number variant detection. These were compared to the same DNA samples that were amplified in duplicate using a WGA technique. The apparent copy number variants we detected by comparing unamplified samples to the unamplified Affy48 reference set were likely real events, as the variants were relatively large, statistically significant, and 11 of the 13 copy number variants corresponded to previously documented genomic variants [5]. While our variant detection approach adjusts its threshold of significance based on the level of noise of each array, comparisons using amplified samples still identified hundreds of apparent CNVs not seen in the unamplified comparisons on the Affymetrix array platform. Since these comparisons were performed against an unamplified reference, it is likely that these artifactual apparent CNVs were the result of preferentially amplifying of regions of the genome and not due to an increased level of array noise. The data from the NimbleGen platform appeared to have a high level of noise that affected our ability to detect WGA-induced CNVs when co-hybridizing unamplified and amplified samples. Our results suggest that amplified and unamplified samples cannot be directly compared to uncover WGA-induced artifacts using the NimbleGen CGH array. However, this should not preclude the comparison of similarly amplified samples on this 72  platform as we have shown using Affymetrix arrays that the biases are largely systematic and the noise is reduced substantially when comparing two amplified samples. To explore the nature of this bias, we directly compared Affymetrix data from pre- and post-amplification sample pairs and observed a set of regions apparently over- or underamplified in all three samples. These regions impacted a total of 21.97 Mb of sequence, consisted of 190 over-amplifications and 75 under-amplifications, and overlapped 39 of 92 regions of WGA-induced bias identified by other studies [23, 24, 27]. The low amount of overlap is perhaps due to differences in genome coverage by the arrays used in these studies, particularly as there was no previous consensus on any region being susceptible to WGAinduced bias. The results reported here are for DNA amplified using the QIAgen Repli-g Mini kit, and it is conceivable that DNA amplified using different protocols will exhibit different bias. While the lack of a correlation between regions of WGA-induced bias and known CNVs is different from a previous observation [24], we have demonstrated that the degree of overlap of the amplification biases we identified with known CNVs is only slightly greater than would be expected by chance. The amount of overlap observed is likely due to the fact that documented CNVs are generally large, 165kb on average, and, in total, impact ~27% of the genome. The difference in size and size distribution of the over- and under-amplifications that we identified suggests focal over-amplification of specific sequences and broader underrepresentation of others. We observed a direct relationship between amplification efficiency and GC-content as over-amplified regions had a statistically significant increase in GC content relative to the deletions (p=1.93x10-5) and the magnitude of over-amplification appeared to scale directly with GC richness (Figure 3.4). These results are consistent with the notion that WGA-induced over-amplification bias is related to the increased binding affinity of GC-rich hexamers relative to AT rich hexamers and not a shortage of hexamers corresponding to 73  repetitive regions in the genome. There is also the possibility that, unlike many polymerases, Phi29 polymerase is more efficient in synthesizing GC-rich sequences, thereby resulting in over-amplification of these regions. These effects likely also contribute to under-amplification of GC-poor regions distributed throughout the genome but not to the loss of chromosome ends. The lack of a relationship between regions of WGA-induced bias and the presence of known copy number variants suggests that different mechanisms account for these phenomena. The loss of chromosome ends appears to be a frequent result of the WGA procedure as 15 of the 39 ends assayed were under-amplified in all samples compared to only three that were over-amplified. Relative to chromosome ends that were not affected by bias, the underamplified ends were enriched for repetitive sequences (p=0.0022) but did not have a statistically significant difference in GC content (p=0.8215). These results suggest that the source of amplification bias at chromosome ends is different from GC-content-derived biases affecting the rest of the genome. One possible explanation is the positional effect of having fewer overlapping amplification products at the ends of linear stands of DNA than in the middle. However, if this were the case, then all chromosome ends should be similarly underamplified, which they are not. Another possible explanation is that the limited quantities of hexamers corresponding to subtelomeric repeats result in fewer priming events in these regions. This may account for the loss of repetitive chromosome ends more frequently than less repetitive ends. We found that samples subject to Phi29-based WGA can be used for accurate genotyping, albeit with some data loss. From the WGA samples, we consistently observed a decrease in the average number of genotype calls and a wider range of call rates compared to those from the unamplified samples. However, of the genotype calls that were made, over 98% were concordant between amplified and unamplified sample pairs. The less than 2% nonconcordant calls were 99.43% discrepant heterozygotes (i.e., AB called as AA or BB, AA or 74  BB called as AB), rather than incorrectly called homozygotes, and nearly none (<0.12%) were located in regions of WGA-induced bias. This discrepancy rate is very near that observed between unamplified replicates on the Affymetrix 500k array [33]. It is likely that the source of genotype call non-concordance is related to the genotyping accuracy of the array in the presence of increased noise due to WGA and not truly genotype changes induced by WGA through allele-specific amplification or polymerase error. Regardless of the source of the systematic biases induced by WGA, we have shown that pair-wise analysis of amplified samples is a viable strategy for CNV detection, albeit with an appropriate threshold of significance to filter the number of low-significance random artifacts induced by this technique. While the greater number of apparent copy number differences detected using amplified samples has the potential to mask real events, we observed that pairwise comparisons of such samples can detect real differences between samples. By comparing amplified samples to amplified samples, the number of artifactual copy number differences is reduced by an order of magnitude relative to comparisons of amplified versus unamplified samples due to the systematic nature of the bias induced by the technique. Conceivably, the use of a large, amplified reference set would be a practical alternative to pair-wise comparisons for larger batches of amplified samples requiring a universal reference. Of the apparent copy number differences detected by the three pair-wise comparisons using unamplified material, all of the top deletions and two of the three top amplifications were identified as the most significant by the corresponding comparisons using amplified material. By applying this technique to paired child/father samples with known, validated copy number differences [9], four of the five validated differences detected by the original study using unamplified DNA were the most significant in the same comparisons using amplified DNA. The only validated CNV that was missed using WGA material was probably due to a difference in coverage by the array platforms used. A similar difference in coverage partially explains the presence of six 75  high confidence CNVs detected by the WGA experiments not seen in the original study, as one of these has recently been observed in the unamplified material using a higher resolution platform. Therefore, when evaluating the results from amplified comparisons, CNVs with the top ranked significance are more likely to be real CNVs in the unamplified sample.  76  3.5. Figures Figure 3.1 Experimental design  (A) In this study, we aimed to assess the impact of WGA on the detection of copy number variants, to explore copy number biases induced by this technique, and to assess the use of pairwise analysis to address such biases. To this end, DNA samples from three fresh frozen tissues were subject to WGA and analyzed pre- and post-amplification on the Affymetrix Mapping 500k SNP array set. For each copy number analysis, different sets of microarray data were compared as shown in panels B-D. Log2 intensity ratios were calculated from the selected data comparisons using a software pipeline based on CNAT v4.0. These ratios were then screened by an “exhaustive search” algorithm, in which t-scores were calculated in 3 to 30 probe windows and statistically significant aberrations identified above array-specific thresholds defined through permutation. To detect CNVs impacting more than 30 probes, aberrations found to contain more than 27 probes were subject to a t-score optimization using larger and larger window sizes until a local maximum t-score was found. The resulting high confidence list of CNVs was then compared as appropriate for each analysis. (B) In this set of comparisons against a common reference set, we investigated the effect of WGA on array noise (i.e., the distribution of log2 ratios) and the ability to resolve copy number variants. To this end, each unamplified and amplified sample was independently compared against the Affy48 reference set, log2 ratios calculated and detected copy number variants compared. (C) To assess the nature of bias induced by WGA, this data set directly compared matched pre- and post-WGA samples. Since matched samples were used, all copy number variants detected in this analysis are due to the amplification technique. (D) This set of comparisons examined the ability of pairwise analysis of amplified samples to reciprocate copy number variants detected in unamplified samples. Three pair-wise comparisons were conducted using both unamplified and amplified material and the observed copy number variants were compared.  77  78  Figure 3.2 Boxplots comparing the spread of log2 ratios in unamplified and amplified samples  The log2 ratios resulting from comparison of each sample against the Affy48 reference set were plotted using a standard box and whisker plot displaying a five number summary: maximum value or Q3+(1.5 x IQR), Q3, mean, Q1, and minimum value or Q1-(1.5 x IQR). Outliers, defined as values that fall more than 1.5 x IQR above Q3 or below Q1, are displayed as individual data points. Due to normalization as part of the CNAT4 analysis pipeline, the mean log2 ratio from each sample is close to zero. However, the IQR, as well as the maximum and minimum values, were further from the mean in the amplified samples relative to the unamplified samples. The increased spread of data distribution is likely due to increased array noise and the detection of amplification biases induced by WGA.  79  Figure 3.3 Apparent CNVs in unamplified and amplified samples  The number of variants detected in unamplified and amplified samples from comparison against the Affy48 reference set was counted. The amplified samples appear to contain hundreds of copy number variants not seen in the unamplified samples suggesting that WGA over- or under-represents specific regions of the genome.  80  Figure 3.4 Copy number distribution and GC content of WGA-induced CNVs  The number of variants and % GC content were plotted against copy number magnitude for all of the CNVs detected by comparisons of each pre- and post-WGA sample pair. There appears to be a direct relationship between the magnitude of over-amplification and increased GC content.  81  Figure 3.5 Example of how a pair-wise comparison of amplified material can partially compensate for WGA-induced bias  Shown is the output of three copy number analyses conducted using our CNV discovery software pipeline. Copy number, calculated directly from log2 ratios of probe intensities, is plotted against genome location using a sliding window of averaged data points, in this case 60 probes. Regions of copy number increase or decrease, those with statistically significant pvalues, are identified in green and all other regions are marked in red. In this example, a pairwise comparison of two unamplified samples, identified a gain of copy number (p<1.00x10-16) in unamplified sample #1 relative to unamplified sample #2 at a locus documented to be copy number variable in the Database of Genomic Variants. Conducting the same comparison after WGA of sample #1 results in hundreds of confounding copy number variants from which the known copy number variant is indistinguishable. However, conducting this comparison after WGA of both samples restores the ability to detect this CNV. Artifactual variants do still remain as a result of random variation in the WGA process, however they do not reach the level of significance of the real event. Therefore, when interpreting results from comparisons of WGA samples, only the top-most hits are likely to be representative of the unamplified sample.  82  3.6. Tables Table 3-1 Regions of recurrent WGA over-amplification Genome Coordinates (Build 36/hg18/Mar 2006) chr1:4289059-4343246 chr1:4471600-4481483 chr1:18235855-18239273 chr1:18436270-18439577 chr1:31292882-31443646 chr1:37463385-37475919 chr1:41064118-41206071 chr1:44791755-44930325 chr1:58584351-58645411 chr1:149246905-149597928 chr1:156259883-156311043 chr1:158049173-158126016 chr1:201867229-201924372 chr1:206392425-206428178 chr1:207779076-207858238 chr2:23092874-23191469 chr2:29833725-29967658 chr2:38509857-38556050 chr2:44086403-44160450 chr2:67878898-67954538 chr2:79084008-79097608 chr2:85989722-86036159 chr2:206333511-206353492 chr2:216823668-216877334 chr2:218376045-218417460 chr2:219657428-219717510 chr3:10602703-10670666 chr3:14573684-14647641 chr3:25304796-25364253 chr3:63462557-63541811 chr3:67380171-67422715 chr3:72035027-72087064 chr3:117955893-117958100 chr3:124147597-124238833 chr3:136150537-136154569 chr4:85502742-85547359 chr5:32503741-32608477 chr5:38045875-38059398 chr5:73611022-73613695 chr5:137957669-138108906 chr5:139060368-139257047 chr5:141093169-141181524 chr5:141773766-141849848 chr6:12973635-13018684 chr6:14857627-14948029  Size (Mbp) 0.054 0.010 0.003 0.003 0.151 0.013 0.142 0.139 0.061 0.351 0.051 0.077 0.057 0.036 0.079 0.099 0.134 0.046 0.074 0.076 0.014 0.046 0.020 0.054 0.041 0.060 0.068 0.074 0.059 0.079 0.043 0.052 0.002 0.091 0.004 0.045 0.105 0.014 0.003 0.151 0.197 0.088 0.076 0.045 0.090  % GC content 49.838 52.459 43.609 48.549 47.315 42.928 47.879 49.964 42.279 47.321 48.289 48.743 46.302 48.218 45.699 42.645 42.63 45.558 44.483 40.88 41.291 44.905 45.186 46.06 52.994 42.947 48.005 50.043 40.188 38.468 39.967 45.219 36.911 49.101 44.855 39.273 44.94 42.273 39.005 47.3 52.492 50.573 44.822 42.162 43.871  Mbp from nearest chromosome end 4.289 4.472 18.236 18.436 31.293 37.463 41.064 44.792 58.584 97.652 90.939 89.124 45.325 40.822 39.391 23.093 29.834 38.510 44.086 67.879 79.084 85.990 36.598 26.074 24.534 23.234 10.603 14.574 25.305 63.463 67.380 72.035 81.544 75.263 63.347 85.503 32.504 38.046 73.611 42.749 41.601 39.676 39.008 12.974 14.858  83  Genome Coordinates (Build 36/hg18/Mar 2006) chr6:15073036-15207564 chr6:26485525-26524595 chr6:29669829-29753557 chr6:30880323-30977693 chr6:31370848-31380959 chr6:31683255-31829012 chr6:32174155-32249861 chr6:34062545-34211717 chr6:36830767-36865161 chr6:37651502-37710319 chr6:37741084-37875032 chr6:39361539-39374614 chr6:41500542-41617879 chr6:43063502-43360189 chr6:44172042-44301684 chr6:47761199-47804068 chr6:85782474-85821656 chr6:89190270-89273163 chr6:110376081-110425208 chr7:3058747-3073709 chr7:3197339-3301123 chr7:5835087-5867724 chr7:66777854-66801140 chr7:66906956-66969142 chr7:67775989-67872385 chr7:68179354-68228090 chr7:68322081-68428002 chr7:71337089-71367377 chr7:75049350-75225967 chr7:100488171-100592180 chr7:127584188-127675925 chr7:131655313-131766360 chr7:140731983-140813912 chr7:142275787-142371817 chr7:142704105-142779432 chr7:152750937-152798127 chr8:20350532-20387149 chr8:21420713-21468570 chr8:23668724-23752488 chr8:37354229-37472939 chr8:70967579-71130733 chr8:128415540-128483680 chr8:131827508-131890351 chr8:133104378-133175247 chr8:134000193-134029470 chr8:134733068-134748357 chr8:136259961-136277180 chr9:1730367-1756255 chr9:34400978-34556969 chr9:109183520-109250183 chr9:109337263-109457162  Size (Mbp) 0.135 0.039 0.084 0.097 0.010 0.146 0.076 0.149 0.034 0.059 0.134 0.013 0.117 0.297 0.130 0.043 0.039 0.083 0.049 0.015 0.104 0.033 0.023 0.062 0.096 0.049 0.106 0.030 0.177 0.104 0.092 0.111 0.082 0.096 0.075 0.047 0.037 0.048 0.084 0.119 0.163 0.068 0.063 0.071 0.029 0.015 0.017 0.026 0.156 0.067 0.120  % GC content 45.072 44.614 46.525 48.513 40.615 51.426 50.974 54.414 51.563 50.272 49.759 49.794 50.4 50.899 50.396 41.129 41.299 39.062 43.002 44.744 44.716 47.779 43.724 44.561 44.37 43.891 44.394 44.373 48.26 50.525 48.963 46.675 44.764 47.609 46.742 43.184 46.731 43.667 41.757 42.699 44.063 39.91 44.281 44.203 44.699 46.9 40.906 40.048 48.251 45.642 45.546  Mbp from nearest chromosome end 15.073 26.486 29.670 30.880 31.371 31.683 32.174 34.063 36.831 37.652 37.741 39.362 41.501 43.064 44.172 47.761 85.078 81.627 60.475 3.059 3.197 5.835 66.778 66.907 67.776 68.179 68.322 71.337 75.049 58.229 31.145 27.055 18.008 16.450 16.042 6.023 20.351 21.421 23.669 37.354 70.968 17.791 14.384 13.100 12.245 11.526 9.998 1.730 34.401 31.023 30.816  84  Genome Coordinates (Build 36/hg18/Mar 2006) chr9:111552203-111572732 chr9:118130565-118179376 chr9:118523287-118573078 chr9:118699338-118773794 chr9:119032843-119048518 chr9:121500039-121529854 chr9:121697808-121802562 chr10:17037587-17055411 chr10:29045369-29067693 chr10:30830367-30889472 chr10:35213736-35222408 chr10:72358857-72391369 chr10:78771093-78832348 chr10:80117004-80173559 chr10:102944887-103070144 chr10:106870314-106893971 chr10:119306070-119335828 chr11:13052115-13062946 chr11:45426732-45493098 chr11:56668848-56711422 chr11:61917900-62120435 chr11:69515071-69537473 chr11:114155209-114231222 chr11:115741865-115777459 chr11:117461473-117478883 chr11:117533377-117562454 chr11:118796655-118914006 chr11:126136392-126161131 chr11:130902902-130982364 chr12:51596678-51709772 chr12:52387023-52412999 chr12:52681349-52755293 chr12:53228106-53329952 chr12:106372782-106389331 chr12:112474952-112562237 chr12:112573453-112641210 chr12:113616005-113721129 chr12:114338015-114416917 chr12:115325571-115386413 chr12:117606824-117703301 chr12:118053074-118108718 chr12:120038561-120080574 chr12:120399191-120643045 chr13:24677565-24700676 chr13:35171922-35227287 chr14:69704299-69714918 chr14:91892520-91908197 chr14:93980102-94033174 chr14:95141650-95185168 chr15:55978179-56036075 chr15:56550970-56610077  Size (Mbp) 0.021 0.049 0.050 0.074 0.016 0.030 0.105 0.018 0.022 0.059 0.009 0.033 0.061 0.057 0.125 0.024 0.030 0.011 0.066 0.043 0.203 0.022 0.076 0.036 0.017 0.029 0.117 0.025 0.079 0.113 0.026 0.074 0.102 0.017 0.087 0.068 0.105 0.079 0.061 0.096 0.056 0.042 0.244 0.023 0.055 0.011 0.016 0.053 0.044 0.058 0.059  % GC content 40.472 44.288 43.423 42.379 43.583 39.16 42.241 41.492 43.857 43.781 46.57 52.659 44.58 46.232 51.084 40.701 47.287 46.926 47.543 45.214 49.924 46.476 42.651 45.487 47.901 51.279 49.231 45.614 44.085 47.216 47.904 50.454 47.65 48.151 46.32 46.563 46.98 42.64 48.51 43.816 46.243 46.956 50.054 45.448 43.169 44.275 47.748 46.442 46.387 42.721 46.007  Mbp from nearest chromosome end 28.701 22.094 21.700 21.499 21.225 18.743 18.471 17.038 29.045 30.830 35.214 62.983 56.542 55.201 32.305 28.481 16.039 13.052 45.427 56.669 61.918 64.915 20.221 18.675 16.974 16.890 15.538 8.291 3.470 51.597 52.387 52.681 53.228 25.960 19.787 19.708 18.628 17.933 16.963 14.646 14.241 12.269 11.706 24.678 35.172 36.654 14.460 12.335 11.183 44.303 43.729  85  Genome Coordinates (Build 36/hg18/Mar 2006) chr15:65169105-65182542 chr15:66485246-66523220 chr15:68217654-68337380 chr15:86586302-86669305 chr15:88453860-88601412 chr16:5750131-5773779 chr16:8660048-8713922 chr16:9207447-9226408 chr16:10244004-10265038 chr16:11194191-11272284 chr16:16018456-16102139 chr16:19961872-19991194 chr16:20132016-20173234 chr16:26692215-26716739 chr17:4592813-4825056 chr17:28710395-28760926 chr17:29816420-29832905 chr17:38983818-39099959 chr18:17857497-17977937 chr18:33126219-33191107 chr18:33331187-33357665 chr18:42299285-42320826 chr18:43392252-43496059 chr18:46571157-46644139 chr19:5299933-5446197 chr19:6607455-6641771 chr19:7117138-7343349 chr19:7605347-7713610 chr19:11088326-11445725 chr19:56105140-56130289 chr20:5435848-5639678 chr20:35680392-35749998 chr20:40784442-40792215 chr20:44231964-44376706 chr20:54720806-54774975 chr21:36470807-36502797 chr21:40273447-40312865 chr22:25796629-25833837 chr22:25917001-25958151 chr22:26250032-26287965 chr22:26417123-26435851 chr22:32410571-32471321 chr22:35337416-35439400  Size (Mbp) 0.013 0.038 0.120 0.083 0.148 0.024 0.054 0.019 0.021 0.078 0.084 0.029 0.041 0.025 0.232 0.051 0.016 0.116 0.120 0.065 0.026 0.022 0.104 0.073 0.146 0.034 0.226 0.108 0.357 0.025 0.204 0.070 0.008 0.145 0.054 0.032 0.039 0.037 0.041 0.038 0.019 0.061 0.102  % GC content 47.559 49.396 48.759 49.155 49.25 43.393 47.311 46.883 47.188 51.812 48.541 44.354 42.691 43.947 51.562 44.404 46.894 49.368 44.565 53.998 52.045 49.285 43.349 43.646 50.388 50.203 47.339 52.77 52.354 47.948 43.587 46.356 45.343 51.218 47.408 47.488 44.928 46.682 45.374 49.391 50.211 45.12 46.816  Mbp from nearest chromosome end 35.156 33.816 32.002 13.670 11.738 5.750 8.660 9.207 10.244 11.194 16.018 19.962 20.132 26.692 4.593 28.710 29.816 38.984 17.857 33.126 33.331 33.796 32.621 29.473 5.300 6.607 7.117 7.605 11.088 7.681 5.436 26.686 21.644 18.059 7.661 10.442 6.631 23.858 23.733 23.403 23.256 17.220 14.252  86  Table 3-2 Regions of recurrent WGA under-amplification Genome Coordinates (Build 36/hg18/Mar 2006) chr1:3058506-3129776 chr1:5857077-5871605 chr1:7718141-7831475 chr1:188208865-188276324 chr1:214179005-214234267 chr1:218251671-218276896 chr1:232592563-232677945 chr1:235352252-235413331 chr2:554079-613259 chr2:1841469-1968296 chr2:128743219-128877673 chr2:159098562-159174260 chr5:487981-738504 chr5:2187888-2267721 chr5:2836714-2884070 chr5:3160861-3195828 chr5:6776429-6806873 chr6:170198708-170308225 chr7:47777759-47884020 chr7:158582043-158739710 chr8:791584-850907 chr8:1816651-1946694 chr8:4027418-4039531 chr8:5771188-5797004 chr8:6316036-6371546 chr8:12634490-12708688 chr9:95031716-95156970 chr10:2593122-2624375 chr10:3877605-3946173 chr10:29732980-29815742 chr10:131402923-131473049 chr10:134327710-134332916 chr11:22799571-22884748 chr11:41726762-41805501 chr11:98814153-98916830 chr11:123252803-123316873 chr12:76314644-76411616 chr12:113021576-113043667 chr12:130611957-130673802 chr13:112193014-112294946 chr13:113053814-113215730 chr14:46453446-46536895 chr15:25756802-25785341 chr15:99580062-99745948 chr16:14143079-14310216 chr16:31443920-33371617 chr16:52606706-52680890 chr16:62091121-62180196  Size (Mbp) 0.071 0.015 0.113 0.067 0.055 0.025 0.085 0.061 0.059 0.127 0.134 0.076 0.251 0.080 0.047 0.035 0.030 0.110 0.106 0.158 0.059 0.130 0.012 0.026 0.056 0.074 0.125 0.031 0.069 0.083 0.070 0.005 0.085 0.079 0.103 0.064 0.097 0.022 0.062 0.102 0.162 0.083 0.029 0.166 0.167 1.928 0.074 0.089  % GC content 57.113 57.168 41.364 34.619 35.409 37.933 39.832 40.499 45.934 45.876 52.327 36.627 56.251 49.395 41.89 46.205 46.553 51.929 43.623 45.905 47.539 49.183 38.749 37.599 39.288 40.565 50.386 37.102 42.615 42.533 47.837 49.165 36.31 36.953 34.289 36.102 36.908 43.491 51.924 42.808 50.548 33.96 45.473 47.27 40.888 41.702 40.821 35.123  Mbp from nearest chromosome end 3.059 5.857 7.718 58.973 33.015 28.973 14.572 11.836 0.554 1.841 114.073 83.777 0.488 2.188 2.837 3.161 6.776 0.592 47.778 0.082 0.792 1.817 4.027 5.771 6.316 12.634 45.116 2.593 3.878 29.733 3.902 1.042 22.800 41.727 35.536 11.136 55.938 19.306 1.676 1.848 0.927 46.453 25.757 0.593 14.143 31.444 36.146 26.647  87  Genome Coordinates (Build 36/hg18/Mar 2006) chr16:74485846-74586365 chr16:79337064-79399432 chr16:81193852-81216695 chr16:82334613-82343460 chr16:83305397-83391850 chr16:86246069-86327452 chr16:87408466-87706274 chr17:19959433-20070746 chr17:20801440-20845676 chr17:36095188-36165632 chr17:50010009-50101821 chr17:59661634-59698705 chr17:64611290-64747300 chr17:67185051-67306066 chr19:373238-892603 chr19:42148245-42239059 chr19:49185317-49257267 chr20:13008085-13018489 chr20:20272110-20371019 chr20:60967459-61027216 chr22:17359268-17386984 chr22:17761331-17872715 chr22:18119390-18255422 chr22:20832179-20871057 chr22:45726565-45769450 chr22:45898268-46013955 chr22:47504692-47516812  Size (Mbp) 0.101 0.062 0.023 0.009 0.086 0.081 0.298 0.111 0.044 0.070 0.092 0.037 0.136 0.121 0.519 0.091 0.072 0.010 0.099 0.060 0.028 0.111 0.136 0.039 0.043 0.116 0.012  % GC content 37.305 39.18 44.458 45.355 44.081 56.272 59.068 41.634 49.974 40.48 37.207 44.675 35.902 37.813 59.541 43.47 40.664 39.702 46.048 49.085 45.153 46.706 54.22 42.792 48.708 52.68 43.61  Mbp from nearest chromosome end 14.241 9.428 7.611 6.484 5.435 2.500 1.121 19.959 20.801 36.095 28.673 19.076 14.027 11.469 0.373 21.573 14.554 13.008 20.272 1.409 17.359 17.761 18.119 20.832 3.922 3.677 2.175  88  Table 3-3 Distribution of log2 ratios from comparison of unamplified and amplified samples versus a common reference set of 48 individuals  Sample compared Apparent Amplifications Apparent Deletions Mean* SD** IQR*** vs. Affy48 Count p< Count p< 8 Sample 1 - Unamplified 0.0002517 0.3079 0.3428 2 1.99x103 1.65x10-9 Amplified 0.001971 0.3790 0.4793 322 9.76x10-7 368 9.39x10-9 -7 Sample 2 - Unamplified 0.002710 0.2602 0.3152 2 3.70x10 2 1.00x10-16 Amplified -0.0001297 0.4188 0.5412 254 8.91x10-7 157 8.33x10-9 Sample 3 - Unamplified 0.003530 0.2584 0.3176 3 5.42x10-10 1 1.00x10-16 -7 Amplified -0.0004284 0.4076 0.5178 295 7.45x10 176 1.36x10-8 * Mean value of log2 ratios resulting from each comparison. A site with equivalent copy number in both samples would return a log2 ratio of 0. ** Standard deviation of log2 ratios resulting from each comparison. These values are interpreted as a measure of data noise from each comparison. *** Interquartile range of log2 ratios resulting from each comparison. These values are interpreted as a measure of data noise from each comparison.  89  Deletions  Amplifications  Table 3-4 Apparent amplifications and deletions detected prior to amplification through comparison with a reference set of 48 individuals  Sample Genome Coordinates of Variant compared (NCBI Build 36/hg18/Mar 2006) vs. Affy48 Sample 1 chr7:48424572-48431182 chr14:19381928-19492423 Sample 2 chr2:113809804-113849256 chr17:41569489-41709662 Sample 3 chr9:29695281-29706655 chr14:19309086-19459561 chr15:19163125-20077554 Sample 1 chr7:142030227-142210594 chr14:21451264-22044096 chr22:33661041-33725126 Sample 2 chr2:50682535-50865587 chr14:21792331-22040096 Sample 3 chr14:21800768-21932862  Size (bp) 6610 110495 39452 140173 11374 150475 914429 180367 592832 64085 183052 247765 132094  CN within variant 2.88184 2.93812 2.28770 3.07396 2.19958 2.65807 2.66995 1.54593 1.51299 1.75349 1.44974 1.38419 1.53811  * from the Database of Genomic Variants (http://projects.tcag.ca/variation/)  CN outside variant 2.04848 2.03610 2.04023 2.03694 2.04042 2.03481 2.04165 2.04848 2.03610 2.06794 2.04023 2.02893 2.03481  SNP count  p-value  Variation Locus*  11 28 12 41 4 25 72 27 161 21 40 60 32  1.99x10-8 4.85x10-13 3.70x10-7 2.31x10-12 <1.00x10-16 5.42x10-10 <1.00x10-16 1.61x10-10 <1.00x10-16 1.65x10-9 <1.00x10-16 <1.00x10-16 <1.00x10-16  Locus 2636 Locus 0397 Locus 3029 Locus 2639 Locus 2748 Locus 1656 Loci 2644 and 2645 Locus 3489 Locus 0329 Locus 2645 Locus 2645  90  Table 3-5 Distribution of log2 ratios from comparison of two experimental replicates of each sample  Sample Sample 1 - Unamplified Amplified Sample 2 - Unamplified Amplified Sample 3 - Unamplified Amplified  Mean  SD  IQR  0.005517 0.002538 0.008175 0.0003263 0.0064235 0.001687  0.2579 0.2840 0.2658 0.3264 0.2585 0.2842  0.3223 0.3544 0.3299 0.4153 0.3187 0.3517  91  q-terminal end  p-terminal end  Table 3-6 Regions of recurrent WGA under-amplification within chromosome ends  Genome Coordinates (Build 36/hg18/Mar 2006) chr1:3058506-3129776 chr1:5857077-5871605 chr2:554079-613259 chr2:1841469-1968296 chr5:487981-738504 chr5:2187888-2267721 chr5:2836714-2884070 chr5:3160861-3195828 chr8:791584-850907 chr8:1816651-1946694 chr10:2593122-2624375 chr19:373238-892603 chr6:170198708-170308225 chr7:158582043-158739710 chr10:134327710-134332916 chr12:130611957-130673802 chr13:112193014-112294946 chr13:113053814-113215730 chr15:99580062-99745948 chr16:87408466-87706274 chr20:60967459-61027216  Size (Mbp) 0.071 0.015 0.059 0.127 0.251 0.080 0.047 0.035 0.059 0.130 0.031 0.519 0.110 0.158 0.005 0.062 0.102 0.162 0.166 0.298 0.060  % GC content 57.113 57.168 45.934 45.876 56.251 49.395 41.89 46.205 47.539 49.183 37.102 59.541 51.929 45.905 49.165 51.924 42.808 50.548 47.27 59.068 49.085  Mbp from nearest chromosome end 3.059 5.857 0.554 1.841 0.488 2.188 2.837 3.161 0.792 1.817 2.593 0.373 0.592 0.082 1.042 1.676 1.848 0.927 0.593 1.121 1.409  92  Table 3-7 Apparent copy number differences identified by pair-wise comparisons of all possible combinations of unamplified and amplified samples  Samples Compared Unamplified sample 1 Unamplified sample 2 Unamplified sample 1 Unamplified sample 3 Unamplified sample 2 Unamplified sample 3 Amplified sample 1 Unamplified sample 2 Unamplified sample 1 Amplified sample 2 Amplified sample 1 Unamplified sample 3 Unamplified sample 1 Amplified sample 3 Amplified sample 2 Unamplified sample 3 Unamplified sample 2 Amplified sample 3 Amplified sample 1 Amplified sample 2 Amplified sample 1 Amplified sample 3 Amplified sample 2 Amplified sample 3 1  Apparent Amplifications Count p<  Apparent Deletions Count p<  Total Apparent CNVs  4  4.26x10-7  3  1.40x10-8  7  4  3.88x10-8  4  1.05x10-13  8  4  1.09x10-10  2  3.44x10-15  6  369  1.26x10-6  367  7.77x10-9  736  69  -6  1.05x10  358  -9  7.04x10  427  471  1.81x10-6  498  1.28x10-8  969  110  -6  1.60x10  536  -8  1.53x10  646  183  1.07x10-6  49  5.64x10-8  232  CNVs in common between matched comparisons1  2  1  4 -6  -8  67  1.28x10  130  3.31x10  197  21  2.03x10-6  49  1.71x10-8  70  18  9.67x10-7  82  2.69x10-8  100  44  1.82x10-6  61  8.23x10-8  105  93  CNVs seen in both comparisons regardless of which sample was amplified. i.e. seen in amplified 1 vs unamplified 2 as well as amplified 2 vs unamplified 1.  Table 3-8 Copy number variants detected by pair-wise comparisons of unamplified and amplified sample sets  Sample Comparison  Relative CN Difference  1 vs. 2  Increase  Decrease  1 vs. 3  Increase  Decrease  2 vs. 3  Increase  Decrease  Detected by pairwise comparison of unamplified samples Coordinates (Build 36) p<= Rank chr2:50775422-51014967 1.00x10-16 1 -10 chr14:19272965-19489991 1.38x10 2 chr3:21942154-21975950 3.91x10-7 3 chr16:22640088-22688093 4.26x10-7 4 -16 chr17:41569489-41708649 1.00x10 1 chr9:11936421-11997006 5.09x10-11 2 chr10:95243220-95304377 1.40x10-8 3 -16 chr8:124654695-124656225 1.00x10 1 chr13:43692360-43696382 3.99x10-13 2 -13 chr18:20691186-20697540 4.86x10 3 chr14:19402695-19502641 3.88x10-8 4 chr14:21715523-22040167 1.00x10-16 1 chr10:54588936-54590136 1.00x10-16 1 -16 chr17:76310141-76321112 1.00x10 1 4 chr15:19876834-20005562 1.05x10-13 -16 chr17:41572099-41708649 1.00x10 1 chr15:84684853-84693981 1.00x10-16 1 chr15:98087203-98095507 1.11x10-11 3 -10 chr16:77105899-77109454 1.09x10 4 chr15:18711364-20079140 1.00x10-16 1 chr2:50870615-51020480 3.44x10-15 2  Detected by pairwise comparison of amplified samples Coordinates (Build 36) p<= Rank chr2:50828689-50960764 1.15x10-9 1 of 21 chr17:41587072-41709662 1.00x10-16 1 of 48 chr14:21531617-22057862 1.00x10-16 1 of 82 chr15:19877365-20077554 2.11x10-10 37 of 82 chr17:41522422-41647903 8.47x10-13 1 of 44 chr15:19313868-20329239 1.00x10-16 1 of 61 chr2:50828689-51018056 1.00x10-16 1 of 61  * from the Database of Genomic Variants (http://projects.tcag.ca/variation/) ** This CNV locus is overlapped only by the coordinates expected from comparison versus the Affy48 reference set.  Variation Locus* 0329** 2636 2893 3029 1901 2636 2644/5 2748 3029 2830 2860 2748 -  94  Table 3-9 Copy number variants detected in MR families by pair-wise comparisons of unamplified and amplified sample sets (child versus father)  Family ID [9]  Relative CN Difference  8379  Increase  1280  Increase Decrease  3476  Increase Decrease  Validated aberrations detected by pairwise comparison of unamplified samples [9] (100k array set) Cytoband Coordinates (Build 36) Mbp Validation chr10:259695-23144645 22.88 karyotyping 10p12.2-p15.3 chr15:19208413-19943075 0.73 karyotyping 15q11.2 chr4:22943293-23102259 0.16 FISH 4p15.2 (BAC) chr1:83242288-83274337 0.03 FISH 1p31.1 (fosmid) chr4:82282746-85558739 3.28 FISH 4q21.23 (BAC) -  Detected by pairwise comparison of Variation amplified samples (250k Nsp array) Locus* ** Coordinates (Build 36) p=< Rank chr10:1000464-24070263 1.00x10-16 1 of 13 many chr15:18850150-20335459 1.00x10-16 1 of 13 2748 chr14:21394980-21864733 1.00x10-16 1 of 13 many chr9:10069844-10104307 5.54x10-7 1 of 2 chr13:100974064-101034679 2.14x10-6 2 of 2 chr4:22828003-23025619 3.64x10-10 1 of 4 0794 chr5:64484426-64535538 chr20:50794691-50801972 -  1.00x10-16 1.00x10-16 -  1of 6 1 of 6 -  3405 0104  chr4:82531241-92371701  1.00x10-16  1 of 10  many  chr22:46869824-46963276  1.00x10-16  1 of 10  -  * from the Database of Genomic Variants (http://projects.tcag.ca/variation/) ** Ranked by significance (p-value). Only variants with the lowest p-value scores are shown  95  3.7. Bibliography 1. McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S, Gabriel SB, Lee C, Daly MJ, Altshuler DM: Common deletion polymorphisms in the human genome. Nat Genet 2006, 38(1):86-92. 2. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet 2006, 7(2):85-97. 3. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 2006, 38(1):7581. 4. Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 2005, 77(1):78-88. 5. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet 2004, 36(9):949951. 6. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525-528. 7. Rovelet-Lecrux A, Hannequin D, Raux G, Le Meur N, Laquerriere A, Vital A, Dumanchin C, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D: APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy. Nat Genet 2006, 38(1):24-26. 8. Zhang X, Snijders A, Segraves R, Zhang X, Niebuhr A, Albertson D, Yang H, Gray J, Niebuhr E, Bolund L, Pinkel D: High-resolution mapping of genotype-phenotype relationships in cri du chat syndrome using array comparative genomic hybridization. Am J Hum Genet 2005, 76(2):312-326. 9. Friedman JM, Baross A, Delaney AD, Ally A, Arbour L, Armstrong L, Asano J, Bailey DK, Barber S, Birch P, Brown-John M, Cao M, Chan S, Charest DL, Farnoud N, Fernandes N, Flibotte S, Go A, Gibson WT, Holt RA, Jones SJ, Kennedy GC, Krzywinski M, Langlois S, Li HI, McGillivray BC, Nayar T, Pugh TJ, RajcanSeparovic E, Schein JE, Schnerch A, Siddiqui A, Van Allen MI, Wilson G, Yong SL, Zahir F, Eydoux P, Marra MA: Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 2006, 79(3):500513. 10. Tonon G, Wong KK, Maulik G, Brennan C, Feng B, Zhang Y, Khatry DB, Protopopov A, You MJ, Aguirre AJ, Martin ES, Yang Z, Ji H, Chin L, Depinho RA: Highresolution genomic profiles of human lung cancer. Proc Natl Acad Sci U S A 2005, 102(27):9625-9630. 11. Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res 2004, 64(9):3060-3071. 12. Cappuzzo F, Hirsch FR, Rossi E, Bartolini S, Ceresoli GL, Bemis L, Haney J, Witta S, Danenberg K, Domenichini I, Ludovini V, Magrini E, Gregorc V, Doglioni C, Sidoni A, Tonato M, Franklin WA, Crino L, Bunn PA, Jr., Varella-Garcia M: Epidermal growth 96  13.  14.  15.  16. 17. 18. 19. 20.  21.  22.  23.  24.  25.  26.  27.  factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Natl Cancer Inst 2005, 97(9):643-655. Hirsch FR, Varella-Garcia M, Bunn PA, Jr., Di Maria MV, Veve R, Bremmes RM, Baron AE, Zeng C, Franklin WA: Epidermal growth factor receptor in non-smallcell lung carcinomas: correlation between gene copy number and protein expression and impact on prognosis. J Clin Oncol 2003, 21(20):3798-3807. Nahta R, Yu D, Hung MC, Hortobagyi GN, Esteva FJ: Mechanisms of disease: understanding resistance to HER2-targeted therapy in human breast cancer. Nat Clin Pract Oncol 2006, 3(5):269-280. Cho HS, Mason K, Ramyar KX, Stanley AM, Gabelli SB, Denney DW, Jr., Leahy DJ: Structure of the extracellular region of HER2 alone and in complex with the Herceptin Fab. Nature 2003, 421(6924):756-760. Menard S, Pupa SM, Campiglio M, Tagliabue E: Biologic and therapeutic role of HER2 in cancer. Oncogene 2003, 22(42):6570-6578. Rubin I, Yarden Y: The basic biology of HER2. Ann Oncol 2001, 12 Suppl 1:S3-8. Laskin JJ, Sandler AB: Epidermal growth factor receptor: a promising target in solid tumours. Cancer Treat Rev 2004, 30(1):1-17. Hughes S, Arneson N, Done S, Squire J: The use of whole genome amplification in the study of human disease. Prog Biophys Mol Biol 2005, 88(1):173-189. Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, Sun Z, Zong Q, Du Y, Du J, Driscoll M, Song W, Kingsmore SF, Egholm M, Lasken RS: Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 2002, 99(8):5261-5266. Spits C, Le Caignec C, De Rycke M, Van Haute L, Van Steirteghem A, Liebaers I, Sermon K: Whole-genome multiple displacement amplification from single cells. Nat Protoc 2006, 1(4):1965-1970. Corneveaux JJ, Kruer MC, Hu-Lince D, Ramsey KE, Zismann VL, Stephan DA, Craig DW, Huentelman MJ: SNP-based chromosomal copy number ascertainment following multiple displacement whole-genome amplification. Biotechniques 2007, 42(1):77-83. Paez JG LM, Beroukhim R, Lee JC, Zhao X, Richter DJ, Gabriel S, Herman P, Sasaki H, Altshuler D, Li C, Meyerson M, Sellers WR.: Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Res 2004, 32:e71. Arriola E, Lambros MB, Jones C, Dexter T, Mackay A, Tan DS, Tamber N, Fenwick K, Ashworth A, Dowsett M, Reis-Filho JS: Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation. Lab Invest 2007, 87(1):75-83. Lage JM, Leamon JH, Pejovic T, Hamann S, Lacey M, Dillon D, Segraves R, Vossbrinck B, Gonzalez A, Pinkel D, Albertson DG, Costa J, Lizardi PM: Whole genome analysis of genetic alterations in small DNA samples using hyperbranched strand displacement amplification and array-CGH. Genome Res 2003, 13(2):294307. Tzvetkov MV, Becker C, Kulle B, Nurnberg P, Brockmoller J, Wojnowski L: Genomewide single-nucleotide polymorphism arrays demonstrate high fidelity of multiple displacement-based whole-genome amplification. Electrophoresis 2005, 26(3):710715. Bredel M, Bredel C, Juric D, Kim Y, Vogel H, Harsh GR, Recht LD, Pollack JR, Sikic BI: Amplification of whole tumor genomes and gene-by-gene mapping of genomic 97  28.  29. 30.  31.  32. 33.  aberrations from limited sources of fresh-frozen and paraffin-embedded DNA. J Mol Diagn 2005, 7(2):171-182. Esteban JA, Salas M, Blanco L: Fidelity of phi 29 DNA polymerase. Comparison between protein-primed initiation and DNA polymerization. J Biol Chem 1993, 268(4):2719-2726. Affymetrix webpage [http://www.affymetrix.com/] Pinard R, de Winter A, Sarkis GJ, Gerstein MB, Tartaro KR, Plant RN, Egholm M, Rothberg JM, Leamon JH: Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics 2006, 7:216. Hosono S FA, Dean FB, Du Y, Sun Z, Wu X, Du J, Kingsmore SF, Egholm M, Lasken RS.: Unbiased whole-genome amplification directly from clinical samples. Genome Res 2003, 13:954-964. Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999, 27(2):573-580. Iiizumi M, Liu W, Pai SK, Furuta E, Watabe K: Drug development against metastasis-related genes and their pathways: a rationale for cancer therapy. Biochim Biophys Acta 2008, 1786(2):87-104.  98  Chapter 4. Sequence variant discovery in DNA repair genes from radiosensitive and radiotolerant prostate brachytherapy patients3 In the treatment of human cancer, an individual’s genetic background, their complement of SNPs and CNPs, dictate not only their susceptibility to getting cancer but also the manner in which normal cells respond to therapy. Efficacy of several medications has been linked to SNPs in genes encoding drug-metabolizers, transporters, receptors, and targets [1, 2]. Stratifying patients based on genotype has the potential to classify diseases with known susceptibility, improve the dosing and targeting of drugs, and decrease the time and cost of conducting clinical trials [2]. However, the role of genetic polymorphisms is not restricted to genetic predictions of drug response. While maximizing treatment efficacy is the goal of any therapy, an equally important consideration is the minimization of treatment-induced side effects. This is particularly important in the case of radiation therapies in which the treatment can induce new primary cancers in radiosensitive patients. Many genetic studies of radiosensitivity in cancer patients have focused on variants of ATM, a gene central to DNA damage detection and repair, with mixed results. In an expanded search for variants associated with radiosensitivity, this chapter documents a search for germline genetic variants predictive of late side effects in prostate cancer patients treated with radiation brachytherapy. We also investigated associations of these variants with increased levels of DNA double-strand breaks marked by expression of gammaH2AX following irradiation. At the time of manuscript publication, this study was the largest sequencing-based survey of DNA repair genes to look for germline variants associated with radiosensitivity.  3  A version of this chapter has been published. Pugh, T.J., Keyes, M., Barclay, L. , Delaney, A., Krzywinski, M., Novik, K., Thomas, D., Yang, C., Agranovich, A., McKenzie, M., Morris, W.J., Olive, P.L., Marra, M.A., Moore, R.A. Clin Cancer Res. 2009 Aug 1;15(15):5008-16.  99  4.1. Introduction Prostate brachytherapy (PB) is a standard treatment for early stage prostate cancer. Radioactive seeds are implanted into the prostate to deliver high doses of conformal radiation achieving excellent long term results [3-5]. A recent study of 1,006 consecutive PB patients at the BC Cancer Agency found biochemical freedom from recurrence rates of 95.6% at 5 years and 94.0% at 7 years [5]. Despite great pains taken to minimize exposure of tumour-adjacent cells to ionizing radiation, even patients with ideal radiation dosimetries can develop side effects months or even years after recovery from the early inflammatory response following radiation treatment. While early side effects are often temporary, late side effects are often irreversible or even progressive [6] and appear to be a symptom of long-term DNA damage to surrounding tissues. Symptoms include acute and late urinary toxicity, acute and late rectal toxicity and loss of sexual potency. Predictive clinical factors for severity of toxicity such as baseline urinary and sexual function prior to procedure and radiation dose have been investigated by our group [5, 7-12] and others [13-15], but overall there is no consensus that any of these factors are effective predictors of late side effects of PB. If not for the toxicity to normal tissues in some patients, PB would be an ideal treatment for prostate cancer due to its high long term efficacy, minimal invasiveness, and low impact on patient quality of life. The presence of intrinsic radiosensitivity may prove to be an important factor contributing to development of PB toxicity. Several observations support the hypothesis that an individual’s radiosensitivity is mediated by genetic variants in specific genes. Radiosensitivity appears to be an inherited trait as cells from monozygotic twins have greater intrapair correlation of cell cycle delay and apoptosis following irradiation than dizygotic twins [16]. In addition, two studies of breast cancer patients have found that cells from first degree relatives of patients with high radiosensitivity are similarly sensitive to radiation [17, 18]. Ionizing radiation such as that used in PB has been well documented to cause DNA double-strand breaks 100  that are repaired through a number of DNA repair mechanisms [19-21]. Defects in DNA repair genes ATM, LIG4, and MRE11 lead to developmental syndromes that include increased radiosensitivity [21, 22]. In vitro, cells with mutations in ATM have been shown to have increased radiosensitivity [23]. In the treatment of cancer, positive correlations have been made between variants in ATM and radiosensitivity [24-30]. Such variants have been uncovered by two studies of prostate cancer patients. Hall et al. [28] used DNA sequencing to find ATM mutations in 3 of 17 prostate cancer patients with late radiotherapy side effects. A study of 37 PB patients by Cesaretti et al. [26] used DHPLC to identify 21 variants in ATM in 16 patients and found a correlation between possession of sequence variants in this gene, particularly missense variants, and late side effects of PB. Several studies of this kind have been limited by small sample size, indirect or low resolution variant detection methods, and examination of only a single candidate gene [24]. Current evidence suggests that radiosensitivity is a complex genetic trait mediated by a number of genes, each of which may harbour low frequency variants which together modulate the radiosensitive phenotype [24, 30, 31]. Two studies have examined the role of single nucleotide polymorphisms (SNPs) from multiple genes in predicting radiosensitivity of prostate cancer patients treated with radiation. One study genotyped 49 SNPs from 24 genes in 83 patients and identified three genes, LIG4, ERCC2 and CYP2D6, containing SNPs associated with radiation toxicity [32]. The second study genotyped 450 SNPs from 118 genes in 197 patients and defined urinary toxicity ‘risk genotypes’ associated with SNPs in five genes, SART1, ID3, EPDR1, PAH, and XRCC6 [33]. These studies genotyped an average of 1.8 and 6.1 known SNPs from each gene, respectively, and would not have discovered novel variants in these genes that may directly mediate radiosensitivity. To date, no comprehensive sequencing-based survey of multiple candidate  101  DNA repair genes has been performed in a set of high and low toxicity PB patients to discover and genotype such variants. We set out to perform such a survey to 1) discover new variants and 2) to investigate whether variants in genes responsible for detecting and repairing DNA damage contribute to PB toxicity. While the effects of radiation can take several forms and involve a number of mechanisms [6], late side effects can develop months or years after irradiation due to the presence of unrepaired, damaged DNA. Poor ability to repair these lesions may be due to reduced function of proteins responsible for detection and repair of DNA damage. As of March, 2009, over 175 DNA repair genes had been identified, but many play a supporting role in DNA damage repair as members of cell signalling pathways, catalytic subunits, cofactors, or proteins that interact with well characterized genes but without a known function of their own [34]. Therefore, we restricted our study to a set of well-characterized DNA repair genes that encode proteins that act directly at the site of double-strand breaks and are the primary machinery for the detection and repair of these lesions (Figure 4.1, [35]). We sequenced the coding and flanking intronic regions of eight DNA repair genes (ATM, BRCA1, ERCC2, H2AFX, LIG4, MDC1, MRE11A, and RAD50) in 41 prostate cancer patients treated with PB at the BC Cancer Agency. These genes were selected because each plays a role in the detection and repair of DNA damage from ionizing radiation (16-18) and functional alterations of any of these genes may result in reduced ability to repair double stranded DNA breaks caused by prostate brachytherapy. ATM kinase plays a central role as a sensor of DNA damage that activates signal transduction pathways to halt cell cycle progression until the DNA damage is repaired. At the site of a double strand break (DSB), ATM phosphorylates several proteins including H2AFX, a histone variant, to recruit a nuclease complex for DNA repair [21]. MRE11A nuclease and RAD50 ATPase are part of this complex and enzymatically process the ends of DSBs for repair 102  by homologous recombination [21, 36]. MDC1 mediates the recruitment of this complex by interacting with both H2AFX and MRE11A/RAD50 [36]. BRCA1 acts as a scaffold for replication and DNA repair proteins and forms a “BRCA1-associated genome surveillance complex” with ATM and the MRE11A/RAD50 complex [21, 37]. LIG4, also known as DNA ligase IV, plays a role in repairing DSBs by uniting broken ends through an alternative mechanism called non-homologous end-joining [21, 38]. Damage to individual DNA bases is addressed by a third mechanism, nucleotide excision repair, in which ERCC2 helicase, also known as XPD, is responsible for unwinding the damaged DNA helical structure so repair can take place [21].  4.2. Materials and methods 4.2.1. Patient selection and toxicity metrics The Prostate Brachytherapy Program at the British Columbia Cancer Agency (BCCA) was established in 1997. As of March 2008, more than 2500 patients had undergone PB as part of this program. Eligible patients included those with low-risk disease (clinical stage ≤ T2a, initial PSA (iPSA) ≤ 10.0 ng/ml and Gleason Score (GS) ≤ 6), and ‘low tier’ intermediate risk patients (stage ≤ T2c and GS ≤ 6 with iPSA 10-15 ng/ml or GS = 7 with iPSA < 10 ng/ml). Our implant technique is described in detail elsewhere [5, 7, 9]. Prostate and rectal dosimetry is obtained using day 30 post-implant CT, using VariSeed software (Varian Medical Systems, Palo Alto, CA). Post-implant contouring was done by an implanting oncologist and tissue dosimetry recorded. Patients were seen at 6 weeks after the procedure, every 6 months for 2-3 years, and then annually. Toxicity score components are assessed by a physician on each visit, including Radiation Therapy Oncology Group (RTOG) urinary and rectal toxicity scores [39], International Prostate Symptom Score (IPSS, [40],) and patient-reported erectile function. To better reflect the specific brachytherapy toxicity profile, the genitourinary (GU) and gastrointestinal (GI) RTOG toxicity scale was modified at the inception of the program. 103  Forty-one prostate brachytherapy patients living in the Vancouver lower mainland region with at least three years of follow-up (mean 6.6 years) were selected for study from the BCCA Prostate Brachytherapy database, agreed to participate and provided informed consent. Ethics approval of the study was granted by the BC Cancer Agency Research Ethics Board. Patients were chosen based on development or lack of development of late normal tissue toxicity following brachytherapy and, to minimize radiation dose as a source of experimental variability, their near ideal rectal and prostate post-implant dosimetry: Prostate D90 < 175Gy (dose covering 90% of the prostate less than 175Gy), prostate V100 > 85% (volume of the prostate covered by more than 85% of the radiation dose), and rectal VR100 < 1.0 cm3 (volume of the rectum receiving 145 Gy is less than 1.0 cm3). Median prostate D90, prostate V100, and rectal VR100 for low and high toxicity patients were 154 Gy, 93%, 0.26 cm3 and 148 Gy, 92%, 0.34 cm3 respectively. While some patients received less than ideal dosimetry, neither the low or high toxicity groups are enriched for these exceptions (p=1.00 for V100, VR100, and D90) nor is there a linear, exponential, or up to sixth order polynomial relationship between any of the dosimetry values and toxicity score (R2<0.4). Usually, toxicity after radiation therapy is reported as a single organ or tissue toxicity score and the original RTOG toxicity scale is primarily used for external beam radiation therapy and not for brachytherapy. To adequately capture patients with multiple organ or tissue toxicities, we have created a somewhat arbitrary composite toxicity score listed in Table 4-1. Several peer reviewed articles have been published regarding the toxicities determined using the modified RTOG toxicity scores [10-12]. From our analysis of 1000 brachytherapy patients with a minimum of 3 years of follow-up, patients with multiple severe toxicities are relatively rare, comprising 2-10% of the entire population (unpublished data). For this study, an attempt was made to capture patients with the worst multiple toxicities, respecting the limitation of geographical availability, patient willingness to participate in the study and the requirement of 104  near-ideal post-implant dosimetry. As acute toxicity is likely related to the PB procedure itself [8, 9, 12, 41], this study focuses on late toxicity only, defined as development of toxicity more than 1 year following the implant. The average follow-up time since implant was 78 ± 14 months for patients with little or no evidence of late toxicity and 81.2 ± 12 months for patients with late toxicity. Twenty patients with no late side effects of prostate brachytherapy were chosen based on a score of 0 or 1 for the criteria listed in Table 4-1. Twenty-one patients with documented late side effects to brachytherapy were selected with scores in at least two of the Table 4-1 categories and to have a total score of at least 2. Clinical and DNA sequence variant data used in our analysis for all patients are reported in Supplemental Table 1 available, due to its large size, in electronic format at www. clincancerres.aacrjournals.org. Table 4-2 contains a summary of the DNA sequence variant data, a breakdown of the toxicity scores, and additional clinical data including hormone usage, tumour stage, planning ultrasound target volume, Gleason score, and age at implant. 4.2.2. PCR amplification and sequencing of DNA repair genes Each patient provided a 24 mL blood sample from which genomic DNA was extracted using the Gentra Puregene Blood kit (Qiagen Inc, Mississauga, ON) and quantified using a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE). PCR and sequencing of the target amplicons were carried out by the BC Cancer Agency Genome Sciences Centre sequencing group using previously published reaction chemistries [42]. PCR volumes were scaled down to 10µL, 10ng of genomic DNA was used for each reaction, and reactions were performed using a 60ºC annealing temperature. PCR primer sequences and the genome coordinates of each amplicon are available in Table 4-3. To facilitate sequencing with universal sequencing primers, each forward primer was ordered with the prefix sequence TGTAAAACGAGGCCAGT and each reverse primer was ordered with the prefix sequence CAGGAAACAGCTATGAC. Variants were detected using Mutation Surveyor v3.2 105  (SoftGenetics, State College, PA) and mutation reports summarized using custom scripts. Genome coordinates and dbSNP accession numbers reported here correspond to human genome build 36.1/hg18 (March 2006) and dbSNP build 128 (http://www.ncbi.nlm.nih.gov/).  4.2.3. Statistical analyses Statistical analyses were performed using custom Perl and shell scripts. A Perl module implementation of the two-tailed Fisher’s exact test (http://search.cpan.org/dist/TextNSP/lib/Text/NSP/Measures/2D/Fisher/twotailed.pm) was used to assess differences in allele distribution between low and high toxicity patients at each variant site. To investigate the contribution of individual alleles to intrinsic radiosensitivity, four statistical tests were performed using different allele distributions between the high and low toxicity groups at each variant site (A = reference allele, B = non-reference allele): p1 tested for an association with homozygosity for the reference allele (AA vs. AB+BB), p2 tested for association with homozygosity for the non-reference allele (BB vs. AA+AB), p3 tested for association with the presence of either allele regardless of zygosity (p3 = A vs. B), and p4 tested for association with homozygotes only to remove possible intermediate toxicities due to heterozygosity for a risk or protective allele (p4 = AA vs. BB, heterozygotes discarded). Patients without sequence coverage of a particular variant were not included in the statistical tests of that site. We also used this approach to investigate correlations between these variants and residual gammaH2AX measured in blood cells from these patients following radiation [43]. We used the combined ranks of the monocyte and lymphocyte scores published by Olive et al. [43] to divide our patients into two groups: 19 “low gammaH2AX” individuals (sum of ranks < 41) and 21 “high gammaH2AX” individuals (sum of ranks ≥ 41). A higher expression of residual gammaH2AX measured 24 hours after radiation can suggest decreased DNA repair capacity and, we hypothesize, an increased likelihood of late side effects. As this is a hypothesis106  generating study seeking to identify subtle genetic relationships with radiosensitivity, we conventionally set our p-value threshold for significance at 0.05 with the understanding that false-positives are a real possibility given the number of tests performed. More stringent corrections for multiple testing are warranted for future validation studies in larger patient populations.  4.3. Results 4.3.1. DNA sequencing summary We selected eight DNA repair genes in each of 41 individuals for sequencing. These eight genes contained 173 exons which were covered by 242 PCR amplicons. The amplicons targeted 115 kbp of genomic sequence of which 45.2 kbp were exonic (Table 4-4). Across 41 patients, 239 sites were shown to differ from the human genome reference sequence and 170 of these corresponded to known variants listed in dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/). 60 of the variants, 14 of them novel, fell in a protein coding region (Table 4-5), and 43 of these resulted in an amino acid substitution (Table 4-4). 22 of the variants that effected an amino acid change were expected to be nonconservative, which we defined as a score less than 0 on the BLOSUM62 alignment score matrix [44, 45]. 36 variants were small (1 to 5 bp) insertions or deletions, none of which affected protein coding regions. 23 of these were known polymorphisms recorded in dbSNP, 14 of 25 deletions and 9 of 11 insertions. Variants within exon splicing sites were rare, and only a single variant fell within 2bp of an annotated exon end, a novel intronic MRE11A variant located at chr11:93,866,690 in a single heterozygous low toxicity individual. The largest gene, ATM, contained the greatest number of variants, 53 in non-coding and 9 in coding regions. However, the 1.5 variants per kbp sequenced was near the average of 2.3 variants per kbp for the eight genes sequenced (median 2.2, range 0.4-4.2 variants/kb), suggesting that this gene has slightly decreased variant density compared to the other genes in 107  our study despite harbouring the greatest number of variants (Table 4-4). The BRCA1 gene contained the second largest number of variants, 47, and the greatest proportion of known variants recorded in dbSNP, 91%. This high percentage of known variants likely reflects the intense research interest in this gene due to its association with breast cancer risk. MRE11A, and RAD50 contained the lowest number of variants per kbp in coding regions, 0 and 0.3 respectively. This high level of sequence conservation likely reflects the fact that the products of these genes interact with a large number of proteins. In contrast, ERCC2, LIG4, and MDC1 had the greatest number of variants per kbp sequenced, 4.2, 3.6 and 3.7 respectively. While MDC1 had 23 coding variants, the greatest of any of the genes studied, 5 of these were only present in a single individual with no late side effects (Patient 33) and 4 were shared exclusively between this individual and a patient with high toxicity (Patient 43). H2AFX, the smallest gene, contained the fewest number of variants, 5, and the single coding variant, while novel, did not result in an amino acid change. 4.3.2. ATM variants detected by previous studies of radiosensitivity We detected 5 known coding variants observed in previous studies relating ATM to radiosensitivity: 1) 2119T>C, P707S (chr11:107,629,971, rs4986761) [25, 27]; 2) 5558A>T, D1853V (chr11:107,680,673, rs1801673) [25]; 3) 3161C>G, P1054R (chr11:107,648,666, rs1800057) [25, 26, 29]; 4) 4578C>T, P1526P (chr11:107,668,697, rs1800889) [25, 26, 29]; 5) 5557G>A, D1853N (chr11:107,680,672, rs1801516) [25, 26]. Coding sequence coordinates listed are relative to the ATM transcript record ENST00000278616 accessed through the Ensembl website (http://www.ensembl.org/). The first and second variants, resulting in amino acid changes S707P and D1853V respectively, were observed in a single low toxicity patient heterozygous at both sites. The third variant, resulting in the amino acid change P1045R, was previously found to double the risk of developing prostate cancer [46] and was observed in two heterozygous high toxicity patients in our population (Patients 9 and 17). The fourth variant, a 108  synonymous change retaining P1526, was observed in two heterozygous patients, one with high toxicity and one with low toxicity. The fifth variant, resulting in the amino acid change D1853N and previously suggested to mediate radiosensitivity in breast cancer patients [25], was observed in five heterozygous patients, three low toxicity and two high toxicity. None of these variants was statistically associated with high PB toxicity in our population (p>0.46).  4.3.3. Using quantity of DNA repair gene variants to predict radiosensitivity Previous studies have postulated that the number of variants in DNA repair genes can be used to distinguish radiotherapy patients with high toxicity from low toxicity [25-27, 33]. In our study, every patient had at least 1 variant in each of 5 DNA repair genes (ATM, ERCC2, H2AFX, MDC1, and RAD50) (Figure 4.2). Three genes had more variants on average in the high toxicity patients than in the low toxicity patients (BRCA1, H2AFX, and MDC1). However, there was no statistically significant enrichment to either side of the mean in high or low toxicity groups for any of the eight genes studied (p>0.10). This was also true for missense variants (p>0.09) and non-conservative variants (p>0.16). However, when all coding variants were taken as a group regardless of amino acid conservation score, we did observe an enrichment of such variants in the LIG4 gene in high toxicity patients (p=0.03 for LIG4, p>0.34 for all other genes). We tested all possible quantity thresholds (from 1 to 32 variants) of all four variant classes to distinguish low toxicity and high toxicity groups and found only one that met statistical significance. In our population, the high toxicity group was enriched for individuals harbouring at least one LIG4 coding variant.  4.3.4. Using specific DNA repair gene variants to predict radiosensitivity We hypothesized that specific DNA repair gene variants, not the number of such variants, would be associated with radiosensitivity. To assess genetic associations with radiation toxicity, the genotype and allele distribution between high and low toxicity groups 109  were analyzed at every variant site using four two-tailed Fisher’s exact tests (Methods). One coding synonymous variant in MDC1, 4178C>CG, A1657AA, located at chr6:30,779,968 returned a p-value less than 0.05 (p1 = 0.048, p2 = 1.00, p3 = 0.056, p4 = 1.00). All five patients with the minor allele (Patients 4, 18, 21, 26, 34) were heterozygous and had high radiation toxicity scores (6, 3, 4, 5, and 13). This variant has been previously recorded in dbSNP as rs28986317, and minor allele frequencies have been observed from 1-6% in four populations. All other variant sites returned p-values greater than 0.05, and none appeared to be associated with increased radiation toxicity at a statistically significant level in our patient population.  4.3.5. Relationship of DNA repair gene variants with residual gammaH2AX following irradiation Assessments of DNA repair ability, represented by the relative expression of gammaH2AX remaining 24 hours after exposure to 2 Gy, were taken for 40 of 41 of the patients [43]. While residual gammaH2AX following irradiation did not correlate with late side effects of PB [43], we did observe 15 intronic variants to be correlated with decreased residual gammaH2AX, i.e. increased DNA repair activity (p≤0.049), and 1 coding, missense variant to be correlated with increased residual gammaH2AX, i.e. decreased DNA repair ability (p=0.042) (Table 4-5). 14 of the 15 low gammaH2AX variants were in BRCA1, and 13 of these were documented in dbSNP, suggesting a primary role for this protein in addressing double-strand breaks marked by gammaH2AX. The remaining intronic variant is a known variant in MDC1 (rs9405048), but none of these variants was correlated with fewer late side effects of PB (p≥0.127). The single variant associated with increased gammaH2AX results in a conservative amino acid change in ERCC1 (D312N, BLOSUM62 score = 1) and is documented in dbSNP (rs1799793). 12 of the 17 patients harbouring the minor allele had high expression of residual 110  H2AX (sum of ranks ≥ 41), and 5 of these had toxicity scores higher than 2. Patient 10 was an exception as he harboured the minor allele and received a high toxicity score of 7 and yet had the lowest residual gammaH2AX. This patient did not harbour any of the 15 variants correlated with decreased gammaH2AX, suggesting that clearance of double-strand breaks is also mediated by genes outside of our candidate set. 4.4. Discussion To the best of our knowledge, this study represents the first direct sequencing study of multiple DNA repair genes in radiosensitive and radiotolerant prostate brachytherapy patients. This survey uncovered 239 variants distributed across eight DNA repair genes of which 69 were novel and had not been recorded in dbSNP. Of the 46 coding variants, 32 of which resulted in an amino acid change, 14 were novel (not in dbSNP). These results suggest that the genetic diversity of these genes is not fully captured in existing databases and that sequencing of genes in larger populations is necessary to uncover lower frequency variants that may define complex radiosensitive phenotypes. This survey identified five ATM variants analyzed in previous investigations of ATM and radiosensitivity [25-27, 29] but observed no statistically significant relationship with late side effects of prostate brachytherapy. Contrary to previous reports based on population sizes similar to our own [26, 27], the number of variants in ATM could not predict radiation toxicity. However, a specific variant that doubles the risk of prostate cancer, P1054R [46], was seen exclusively in two high toxicity individuals, which suggests that aspects of DNA repair ability may underlie a predisposition for prostate cancer. While the frequency of this variant in the high toxicity group did not reach statistical significance (p=0.488), the fact that this variant was found exclusively in high toxicity patients warrants further investigation in larger populations as a potential predictor of radiosensitivity. Of 239 variants detected across eight DNA repair genes, only one variant, rs28986317 in MDC1, was statistically associated with late side effects 111  of PB (p=0.048). The biological effect of this variant on protein structure is not immediately obvious as it does not result in an amino acid change. However, synonymous changes have been shown to affect protein translation due to changes in codon usage, altered mRNA stability, disrupted miRNA binding, and exon skipping [47]. In this case, rs28986317, the minor allele results in the use of the least frequently used codon of four used to encode alanine in humans (codons per thousand for each allele, 27.7:7.4) [48]. As the presence of rare codons can decrease protein translation rates [47], minor allele carriers of this variant may express MDC1 at lower levels, thereby decreasing their ability to repair DNA damage from ionizing radiation. Despite representing less than 10% of the sequence targeted, coding variants in MDC1 represented over 33% of the synonymous variants, over 25% of the conservative variants, and over 50% of the non-conservative variants including all of the novel non-conservative variants detected in our study. The large amount of per-kbp variation present in this gene and others may explain the wide range of toxicities observed following radiation treatment. In the case of MDC1, a recruiter of DNA damage repair complexes, different variants may preferentially recruit complexes specific to each individual. Similarly, the high number of variants per kbp in ERCC2 and LIG4, proteins that directly carry out repair, may reflect a spectrum of activation efficiencies or enzymatic activities that manifest as a range of toxicity levels. Our study’s small patient population limits the statistical power of our analysis to resolve cumulative genetic effects on radiosensitivity at a population level. Regardless, even from this small population, a large amount of genetic diversity was observed in the eight candidate genes sequenced. The positive results associating the MDC1 variant with increased radiosensitivity and the observation of the ATM variant P1054R exclusively in high toxicity patients need to be validated in larger populations, particularly as no correction for multiple testing was performed and there is a possibility of false positives. The finding that no single LIG4 variant was statistically associated with radiosensitivity despite the observation that high 112  toxicity patients were more likely to contain LIG4 coding variants suggests that this gene may contain a class of low frequency variants with similar effect on protein behaviour. As LIG4 functions as a ligase, variants resulting in codon or amino acid substitutions in this protein may decrease its ability to join broken ends of DNA or reduce the amount of enzyme available to perform DNA repair. Variants in DNA repair genes were both positively and negatively associated with residual gammaH2AX that signifies the presence of unrepaired or misrepaired double-strand breaks following irradiation [43]. Non-coding variants, 14 in BRCA1 and 1 in MDC1, correlated with lower levels of gammaH2AX (p≤0.049) while a coding, non-synonymous variant in ERCC2 (rs1799793) was correlated with higher levels of gammaH2AX (p=0.042) (Table 4-5). The ERCC2 minor allele was found in 42% of patients (5 of 12) with high levels of residual gammaH2AX and high toxicity. While gammaH2AX levels did not correlate with development of late side effects of PB [43], variants in specific DNA repair genes, particularly BRCA1 and ERCC1, appear to mediate the clearance of double-strand breaks which may play a factor in the eventual development of toxicity. This initial survey has identified a number of promising candidate variants that may show an ability to predict increased radiosensitivity in a larger population and serves to illustrate the genetic diversity present in a number of DNA repair genes. While the hypothesis that DNA repair gene variants mediate radiosensitivity has not been disproven, it is likely that the effect of individual variants is small and that variants outside of this set of candidate genes also play a role in mediating radiosensitivity. Investigation of variants of additional genes in larger patient populations may lead to prognostic tests to identify radiosensitive cancer patients prior to treatment. Given such knowledge, the clinical course of these patients could be altered to consider their treatment with non-radiation therapies.  113  4.5. Figures Figure 4.1 Candidate genes encode proteins directly involved in the detection and repair of damaged DNA and triggering of cell cycle control signalling pathways  Reproduced and modified from [35]. Cellular response to DNA damage is controlled by signalling pathways that halt cell cycle progression (cell cycle phases at bottom) until the damage is repaired. For this study, we focused on eight candidate genes (yellow) that encode proteins responsible for identifying and directly repairing DNA damage. These proteins are primary activators of cell signalling pathways that limit cell growth or replication in the presence of damaged DNA.  114  Figure 4.2 Toxicity scores, radiation dosimetry, count of DNA variants, and gammaH2AX rank expression from 41 prostate brachytherapy patients  The x-axes in all panels correspond to patient numbers ordered by toxicity score presented in the top panel. This panel indexes the subsequent panels and shows the toxicity score for each patient determined using the scoring system shown in Table 4-1. A dashed vertical line down the centre of the figure separates data from low toxicity patients (left, toxicity score ≤ 1) from high toxicity patients (right, toxicity score ≥2). The next three panels present similar postimplant radiation dosimetry for each patient. The thresholds for "ideal" dosimetry are shown as dashed red lines (D90 < 175Gy, V100 > 85%, VR100 < 1cm3). Panels 5-12 present the number of all variants (black, left axis) and coding variants (red, right axis) in each gene from each patient. Genes are ordered first by the magnitude of the left y-axis (i.e. the maximum count of total variants) and then by the magnitude of the right y-axis (i.e. the maximum count of coding variants). Note the enrichment of individuals with at least 1 LIG4 coding mutation in high versus low toxicity individuals (p=0.028). Counts that include specific variants associated with radiosensitivity, ATM rs1800057 (P1054R) and MDC1 rs28986317, are indicated by wireframe data markers. The last panel displays sum of rank data from Olive et al. [43] and represents the combined rank of residual gamma H2AX scores from lymphocytes and monocytes from 40 of the 41 patients. The numbers used to generate these plots were drawn from Table 4-2. Lines between each point are for ease of comparison and do not represent continuous variables or an explicit relationship between data points.  115  116  4.6. Tables Table 4-1 Modified RTOG scoring system used to generate toxicity scores  Category 1  IPSS score  Maximum RTOG2 GI: rectum GU: urinary  Prolonged Urinary Retention Sexual potency (for previously potent patients)  Description Failed to normalize to within 5 points of baseline after 24 months of follow up No symptoms RTOG Grade 1 GI: Increased frequency or change in quality of bowel habits not requiring medication/rectal discomfort not requiring analgesic. GU: Frequency, nocturia twice pre-treatment habits, dysuria, urgency not requiring medication. RTOG Grade 2 GI: pain or irritation requiring medication, mucus or bloody discharge, haemorrhoids requiring analgesics alone, or change in bowel habits requiring medication. GU: Frequency and nocturia less frequent than every hour, dysuria, and urgency, bladder spasm requiring medication. RTOG Grade 3 GI: hospitalization for severe pain, bleeding from thrombosed haemorrhoids, severe mucus or bloody discharge, superficial ulceration, minor surgical procedure. GU: frequency with urgency or nocturia hourly or more frequently, dysuria, pelvic pain or bladder spasm requiring frequent narcotics, gross hematuria, and obstruction requiring indwelling catheter or minor surgical procedure (trans-urethral resection or incision of the prostate, stricture dilatation). RTOG Grade 4 GI: Ulceration, necrosis, major surgical procedure GU: Ulceration, necrosis, major surgical procedure  Score  Catheterization required for more than 3 weeks  add 3  Partially potent after PB  add 2  Impotent after PB  add 3 Maximum score:  add 1 add 0 add 1 add 1  add 3 add 3  add 5  add 5  add 7 add 7  21  GI = gastrointestinal symptoms, GU = genitourinary symptoms 1 IPSS = International Prostate Symptom Score [40], describes symptoms of prostate cancer. 2 RTOG = Radiation Therapy Oncology Group late radiation morbidity scoring scale [39], describes radiation-induced side effects. This is an in-house modified RTOG toxicity scale to better reflect specific prostate brachytherapy rather than external beam radiation toxicity profile.  117  Table 4-2 Patient-by-patient radiation dosimetry, gammaH2AX scores, DNA sequence variant counts, toxicity score breakdown, and other data Toxicity and Dosimetry data Patient #  118  2 3 5 12 13 14 15 16 20 22 25 27 28 29 33 39 44 1 24 31 6 18 43 11 17 21 32 35 38 8 26 37 4 7 10 19 9 23 30 36 34  Toxicity Score  Prostate D90 (Gy)  Prostate V100 (%)  Rectal VR100 (cm3)  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 3 3 3 4 4 4 4 4 4 5 5 5 6 7 7 7 9 10 10 11 13  176 165 166 151 130 143 146 146 183 151 155 142 166 154 157 154 152 165 131 128 172 146 147 141 130 148 164 160 142 182 170 148 128 176 152 183 113 136 167 150 136  97.0 95.8 94.4 93.2 84.1 89.8 91.3 90.6 99.0 93.2 93.5 89.0 96.0 93.1 93.7 96.1 92.3 96.8 85.7 85.2 96.6 90.9 91.0 89.0 84.4 91.4 95.4 95.2 88.5 97.9 97.9 91.7 86.9 96.9 93.0 97.9 78.3 88.2 97.0 92.5 87.6  0.17 0.79 0.00 0.84 0.01 0.46 0.16 0.94 0.26 0.34 0.20 0.00 1.20 1.54 1.50 0.20 0.70 0.00 0.00 1.46 0.33 0.05 0.31 0.81 1.48 0.81 0.90 3.20 0.35 0.03 0.26 0.06 0.23 0.17 0.98 0.89 0.03 0.07 0.10 1.20 3.80  From Olive et al. GammaH2AX scores: sum of ranks, monocytes and lymphocytes 49 9 11 26 65 35 27 44 61 60 22 29 69 63 69 26 39 39 61 38 25 11 18 57 61 50 51 61 40 11 65 47 9 48 6 14 58 49 43 74  Monocyte score  Lymphocyte score  0.70 0.48 0.40 0.74 0.84 0.76 0.60 0.90 1.01 1.26 0.75 0.72 1.01 1.31 0.92 0.56 0.55 0.62 1.14 1.00 0.53 0.50 0.51 1.04 0.75 0.73 0.74 0.76 0.75 0.37 1.21 0.85 0.43 0.77 0.46 0.40 0.84 0.86 0.66 1.21  0.49 0.16 0.24 0.21 0.58 0.27 0.29 0.27 0.41 0.34 0.13 0.27 0.52 0.36 0.56 0.29 0.42 0.4 0.37 0.21 0.6 0.18 0.21 0.37 0.46 0.47 0.46 0.64 0.31 0.25 0.41 0.32 0.19 0.37 0.07 0.26 0.43 0.33 0.41 0.55  119  Patient # 2 3 5 12 13 14 15 16 20 22 25 27 28 29 33 39 44 1 24 31 6 18 43 11 17 21 32 35 38 8 26 37 4 7 10 19 9 23 30 36 34  Total variants ATM BRCA1 26 3 26 0 28 23 24 6 27 0 26 0 21 19 17 1 21 0 27 22 26 11 23 29 28 3 4 28 26 25 5 0 29 28 26 31 23 1 24 18 30 27 9 29 26 30 23 0 26 29 22 21 28 1 24 0 6 28 18 31 24 4 23 0 26 2 25 2 5 0 18 32 24 18 30 0 29 18 23 20 5 0  ERCC2 14 14 2 14 12 7 11 15 15 10 11 4 5 12 10 9 12 13 16 14 7 15 1 18 6 9 11 16 10 11 14 4 14 11 15 5 14 1 4 1 15  H2AFX 4 4 1 1 2 1 4 1 1 3 2 4 4 1 2 1 4 1 4 1 4 1 1 4 4 3 2 1 4 1 4 4 1 3 4 1 1 4 4 4 2  LIG4 1 4 1 3 3 0 0 3 1 2 3 1 1 6 0 1 4 4 1 4 5 1 2 1 2 1 1 1 2 2 3 1 1 2 4 2 4 1 1 2 2  MDC1 2 4 2 2 2 3 3 2 3 4 3 2 3 6 19 3 3 2 2 2 4 3 11 2 4 3 3 2 8 5 2 2 4 2 3 4 2 4 4 3 4  MRE11A 7 7 2 3 3 1 1 5 3 7 0 8 6 7 6 7 9 2 3 3 3 2 7 4 3 1 2 7 6 2 6 3 8 1 1 7 3 1 6 5 6  RAD50 2 2 9 4 2 3 8 2 2 2 1 3 2 1 6 6 2 2 6 2 8 3 2 2 8 2 2 10 2 2 1 2 2 8 2 2 2 2 3 2 2  Coding variants ATM BRCA1 1 0 1 0 2 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 2 0 1 1 2 0 1 1 1 1 3 0 3 1 1 1 1 0 1 1 2 1 1 1 1 1 2 0 2 1 1 1 1 0 1 0 2 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 2 1 2 0 1 1 1 1 2 0  ERCC2 4 4 0 2 2 0 2 4 3 2 1 0 0 2 1 1 2 3 4 4 2 3 0 4 2 1 1 4 1 1 4 0 3 1 4 0 4 0 0 0 4  H2AFX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  LIG4 0 0 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0 2 1 1 1 1 2 0 1 1 1 1 2 1 0 0 1 0 1 1 1 1 1 0 1  MDC1 0 1 0 0 0 0 1 0 1 2 1 0 1 2 10 0 0 0 0 0 1 1 5 0 1 1 1 0 3 2 1 0 2 0 1 1 0 2 1 0 1  MRE11A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  RAD50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0  Patient # 2 3 5 12 13 14 15 16 20 22 25 27 28 29 33 39 44 1 24 31 6 18 43 11 17 21 32 35 38 8 26 37 4 7 10 19 9 23 30 36 34  Non-conservative variants ATM BRCA1 ERCC2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 2 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0  H2AFX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  LIG4 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 2 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1  MDC1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 5 0 0 0 0 0 0 0 3 0 0 0 0 0 2 0 0 0 1 0 1 0 0 0 0 0 0  MRE11A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  RAD50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  Missense variants ATM BRCA1 1 1 1 0 2 4 1 1 1 0 1 0 1 5 1 0 1 0 1 5 2 2 1 4 2 2 1 4 1 3 2 0 3 4 1 5 1 0 1 2 2 4 1 5 1 6 2 0 2 5 1 4 1 1 1 0 2 4 1 5 1 2 1 0 1 1 1 1 1 0 1 6 2 3 2 3 1 3 1 0 1 0  ERCC2 2 2 0 2 2 0 1 4 2 1 1 0 0 1 0 0 1 1 3 2 1 1 0 3 1 0 0 4 0 0 2 0 2 0 4 0 2 0 0 0 4  H2AFX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  LIG4 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 2 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1  MDC1 0 1 0 0 0 0 1 0 1 2 1 0 1 2 7 0 0 0 0 0 1 0 4 0 1 0 1 0 3 1 0 0 1 0 1 1 0 1 1 0 0  MRE11A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  RAD50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0  120  Toxicity score breakdown Patient # 2 3 5 12 13 14 15 16 20 22 25 27 28 29 33 39 44 1 24 31 6 18 43 11 17 21 32 35 38 8 26 37 4 7 10 19 9 23 30 36 34  IPSS Normalized Score (0=Yes, 1=No)  Pre-implant IPSS  Max Late IPSS  RTOG Score  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 0 1 0  6 4 5 6 2 16 4 2 3 1 6 11 7 2 6 7 7 13 3 8 19 7 1 4 6 6 2 5 3 5 4 5 5 4 9 9 9 21 5 0 12  4 8 6 8 5 21 6 4 5 7 10 6 9 7 12 12 12 11 5 14 15 16 15 24 24 13 17 6 22 23 22 17 11 16 24 14 14 33 5 24 20  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 3 3 2 3 4 4 3 2 3 4 2 4 2 3 6 5 6 8 8 10 10  Max Late Urinary RTOG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 0 1 2 2 2 2 1 2 2 1 2 1 2 2 3 1 2 3 2 3  Max Late Rectal RTOG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 1 0 1 1 0 1 0 1 1 1 1 0 2 0 3 3 2 4 3  Acute Urinary Retention >3 weeks (1=Yes, 0=No) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0  Potency Score  Potency at Implant  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 3 3 0 2 3 0 2 0 3  Normal Impotent Impotent Normal Normal Normal Normal Normal Partial Normal Partial Normal Impotent Normal Partial Normal Normal Partial Normal Partial Normal Normal Normal Partial Impotent Normal Normal Normal Normal Normal Partial Normal Normal Normal Normal Partial Normal Impotent Partial Normal Normal  Post Implant Potency at subsequent follow-ups NNN IIIIIIII IIPIINNNNNIP NNNNNNNNNNN NNNNNNNN NINNNNNNNNNN INNNNN NNNNNNN NNNNNNNN IIIINNNNNNN NNNNNNIINN NNNNNNNN NIIII IINNINN NNNNNNNN NNNNN NNNNNNNN INNNNINNNNN NNNNN IINNNNNNN INNNNNNI INNNNNN NNNNNN NNNNNNNNNN IIIIIINII IPNPNNNNNNNN IINNNNN NNIIIIIII IINNNNNN NNNIINNN IIIIIIIIIIII NNNIINN IIIIII IIIIIIIIIIII NNNNNNNNNNNN IIIIIIIIIIII IIIIIIIIIIII IIIIII IIIIIII NNN IINIIII  121  Patient # 2 3 5 12 13 14 15 16 20 22 25 27 28 29 33 39 44 1 24 31 6 18 43 11 17 21 32 35 38 8 26 37 4 7 10 19 9 23 30 36 34  Additional clinical data Hormones  Tumour Stage  Yes Yes Yes No Yes No Yes Yes Yes No No Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes No Yes No Yes Yes Yes No Yes No Yes Yes Yes  1C 2A 2B 1C 2B 2A 1C 1C 2A 2A 1C 1C 2A 1C 1C 2A 1C 2B 2A 1C 1C 2A 1C 2A 2A 1C 1C 1C 1C 1C 1C 1C 1C 1C 1C 2A 2A 2A 1C 1C 2B  Planning Ultrasound Target Volume (PUTV) 32.240 33.539 28.332 21.774 40.100 32.081 50.499 29.928 49.457 37.022 39.334 37.095 46.107 30.791 19.740 39.075 38.158 35.187 28.525 22.864 29.138 19.196 26.486 38.417 32.200 27.600 37.177 51.781 39.470 38.442 48.610 40.777 32.265 33.642 35.378 47.332 26.939 33.177 31.517 35.400 47.859  Gleason Score 6 6 6 5 7 6 6 6 6 6 5 6 6 5 7 6 7 6 7 5 7 7 6 6 6 4 6 6 6 6 7 6 3 6 6 6 5 6 7 6 6  Age at Implant 69 76 70 49 75 71 70 51 72 62 53 56 74 60 68 63 59 62 64 66 62 51 65 66 71 69 62 72 61 61 76 64 60 71 58 68 62 75 70 57 76  PSA at last follow up 0.05 0.04 0.84 0.02 0.02 8.80 0.11 5.70 5.68 0.02 0.04 0.02 0.04 0.10 0.17 0.05 0.02 0.05 0.03 0.02 0.02 0.02 0.02 0.03 0.02 0.03 0.29 0.03 0.04 0.02 0.03 0.02 0.02 0.02 0.04 0.18 0.03 0.01 0.01 0.43 0.06  122  Table 4-3 PCR primer sequences used to amplify amplicons targeting candidate gene exons for sequencing  Gene symbol  123  ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM  Amplicon Coordinates (Build 36, hg18) chr11:107598633-107599269 chr11:107603376-107603752 chr11:107603731-107604317 chr11:107604897-107605533 chr11:107611462-107612041 chr11:107619639-107620264 chr11:107620495-107621149 chr11:107622743-107623339 chr11:107624617-107625197 chr11:107626456-107627005 chr11:107626771-107627347 chr11:107626771-107627347 chr11:107627451-107627964 chr11:107627694-107628152 chr11:107628448-107628999 chr11:107629396-107629940 chr11:107629739-107630115 chr11:107631926-107632547 chr11:107633181-107633775 chr11:107634624-107635232 chr11:107642881-107643523 chr11:107644298-107644874 chr11:107646917-107647546 chr11:107648447-107649065 chr11:107655136-107655728 chr11:107656736-107657326 chr11:107658394-107659040 chr11:107659900-107660587 chr11:107660210-107660595 chr11:107663213-107663837 chr11:107664461-107664977 chr11:107664618-107665256  Amplicon Size (bp) 673 413 623 696 616 642 655 633 581 586 613 613 550 495 588 581 413 683 645 665 680 613 631 691 629 627 659 688 422 697 553 648  Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) AGAGAAAGAAAGGCGCCGAAAT TGCCTTTGACCAGAATGTGCC AAGCGCCTGATTCGAGATCCT GAAGATTAAGAGCTTTGCAGACCAGA GGGCCATAATTTGCCAATTTCTTC GTAATGTTTCTGCGACCTGGCTCT GAGGGAGAGCTAACAGAAGTGGTCTC TCAAGGATCTTGTCAGAAGAGGCA TTGTCATGGCAATCACATATCCCT TTCCTGCCAATTTAGGAAGTAGGACA CGATGCCTTACGGAAGTTGCAT CGATGCCTTACGGAAGTTGCAT TCGGACACCAGGTCTTATTCCTTC TGCCAGGCACTGTCCTGATAGA GGAGAGCAGACCTCCGAATGG TGCTCTTTACTTCCTCTGCTTGGTGA ACTTTCTTGAAGTGAACACCACCAA CTGCTTGGCCATCAGGAGATACTT TCTGTCACTGGTATGATTTGCAAGAA TTCCTCCTTTTTGGTGTAAGTGGG CACCACCACACCCAGCTAATTTTT TGGCTGTTGTGCCCTTCTCTT CACTGCACCCGGCCTATGTTTAT TGGAAAACTTACTTGATTTCAGGCATC TGGCTAGTTTGAGTTCAGTGCTGTTC TGTCAGTGCTTGTGATTTAGCAAAGG AACCAAATGTTGGAAACTCTTAGCA AGGGAGACAACACGACATAACCAA TTTGATGAGGTGAAGTCCATTGC TTGGTTGGCAGAAAAATTACCAGG TGCCCTACTGTGCTGGGCATA CTGAGGCTCATTTATTGAACTGCC  Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) CAGGAAAGATGGAGTGAGGAGAGG CAGGATCTCGAATCAGGCGCT TTGCCACTCCTGTCCAGCAA TGAGTGCAGTGGTGTTTACAACGA CACTGTCAACTCCTTGACGATGGA TCTCCCCCTTGAAAACTTCACGTA CAGAATCTGCTACCACTGCTTCAAA TCGAATCATTAGGGTTAGGGTCACCT GGATGGTCTCAATCTCCTGACCTT AAGGTCTGCAGGCTGACCCAG GGTGCTGATATCCCATCACCCTACA GGTGCTGATATCCCATCACCCT GAAGAATTGGAGGCACTTCTGTGCT CAGAAGCAATCAGGCATAAAGACACA AAACAACCTCTTCCCTGGCTAACAG TCCCAGAAGACAGCGATCCAG GAGCCCTTTACTGCCACTTTGC AGCAAACCACCATGGCACTGTAT ACACCAACCAGTGATCAATTCCCT TGCTAAGGGTGCTACTGAACAAGG ATGCCTGGCCTGGTTTTATTTCTT GGCAAGGTTCCAACTTCAAACACA CGGTTTAGAAAGCCAAGCCTTAAA ACACACCTCACTCGAGTCAACCAC TGATTTGACCCATTGTGACCCA GCCTTGTGAAATGCTCTTAAATGGA GTCAGATAGCTGGTTGTTGGCACA CACTAAAGTGTCACAAGATTCTGTTCTCA CCAATCAGAGGGAGACAACACGA TGTGGGGAGACTATGGTAAAAGAGGA GGCAAATGTTGCTTTAATCACATGC TGACACTTTAGTGATATATTAGCTCAGGGAA  Gene symbol  124  ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM  Amplicon Coordinates (Build 36, hg18) chr11:107665244-107665844 chr11:107665667-107666232 chr11:107668363-107668981 chr11:107669092-107669688 chr11:107670611-107671220 chr11:107672937-107673608 chr11:107675388-107676039 chr11:107677372-107677965 chr11:107678599-107679179 chr11:107680465-107681075 chr11:107683528-107684224 chr11:107685736-107686312 chr11:107688058-107688691 chr11:107691503-107692135 chr11:107691658-107692264 chr11:107692911-107693471 chr11:107695596-107696218 chr11:107696950-107697500 chr11:107701047-107701669 chr11:107701802-107702377 chr11:107703466-107704051 chr11:107704950-107705510 chr11:107705936-107706600 chr11:107707142-107707720 chr11:107707571-107708204 chr11:107707895-107708483 chr11:107708462-107708787 chr11:107708582-107709209 chr11:107709569-107710197 chr11:107710731-107711373 chr11:107711517-107712123 chr11:107718932-107719475 chr11:107721341-107721894 chr11:107722913-107723608 chr11:107729507-107730099  Amplicon Size (bp) 601 602 688 656 640 687 653 594 651 675 697 613 650 633 665 597 625 587 655 649 622 597 696 579 649 625 362 628 683 697 633 580 590 699 629  Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) GCGGACAGAGTGAGTCTTTGTCTC TGGCTTAGGAGGAGCTTGGGC TGGAAGCTTAGAGCTGCCTATTCTG TGAAAACACAGAAACTAAAAGCTGGG AGGTAGCAGTGAGCCTTGATAGCG TTGTGCTTCTGTTTGTGATTTTGC CCACGCCTGGCTAATTTTGTATTT CCCTCCCCCAAAAATCAACTACTA GTTTTTGCCATACCACTCTGCCTC TTTAGCAGTATGTTGAGTTTATGGCAGA GACATGATCTGTCTTGTTCATGCTT GGGTGCTTGTGTGCATTTGTATTAGC CTGAATTGGATGGCATCTGCTCTA TGGCTGTGTAAATATCCACCAACAT GAGTTGGGAGTTACATATTGGTAATGATACA CTGATGCTGGAGTGCATTAGCG TTGTGAATTCCCCTTGTGCCTAGT GCCACCTTCATGTTGAGTAGGATAAGG AGAGCTCTGACCGCATAGCATTTT GAGACAGACAGACAGACAGATAGGCA TTTCCCACCCACCAAGGAAA TCTTGAAGGCAGTAGAAGTTGCTGGA GGAGAGTCCCCTTTGTCCTTTGAT AGATACTGCAGTGGGTAGAGCGTG CCATTCCCTCTAAGAAATGGAAATACA AATTTCTGACTAAACCAGAGGTAGCCA CCTCAAAGCAGTTGGCAAAGTGAA TGTTTTTAAGTCCCAGGGCAGTTT GGTTCCTCAGGTGGAATCTGGTCT AGTTAAAGTTACGAGCGTGAGCCA CAGGTGTTTGCAGTATGCCTTCCT TGATGGGCAGGCTCTCAAACA CCCAATGCTGTGATGCCACC TGGGTCTCAACTTTAGCCACAATAA TGGATTTGAGGTGGATCTCACAGA  Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) CACATATCACCATGCCCGACTAAT TTTCCCAGGCAAGTAGCGCA ACGTTGCGAACTGCTATCCCTAGT GGGAATTGCAGTTGCAGTGATTAG TGGGTTTAGAGAAATCATCTGGCA ACTCACATTCATTCCGCCAATACA AGGGACCTTGTCTGGAATGTTCAC TGCCTGACACATATTGAAAGCTCA CAAGGCACCCCTTAGAACTCCTCT TTGTGGCAAACCTCCAAAAAGTT TGGAAAAACACTTGCTCCTATCCC ACCCTTATTGAGACAATGCCAACAT TGGCATAAACTCTGAGACAGGTGG CTGACTTTCCTGTGTCTCCCTGAA TTCTGAATCCAGTTTAATTTAGGACCAA TCAAATTTCTTACCTGACGGAAGTGC CAAAACTTTCTTGAAGGGTCAGGG TGATCCACCCACTTCAGCCTTC TACCCTTGCCCAAGGCTTAAAAGT TCGACCACATGATGGACTGATAGAA TGTGGGTGGCTGGGCTAATG CCAGGTATGGCGTGCACCTG TTGCATCATTTACAGCTTGTCAGC TAGAGACAGGGTTTTGCTGTGTCG ATCATTCCATTGTCTAGATTTGTGCAT CACTTTGCCAACTGCTTTGAGGA GCATCACAAAGTGCCTCAACACTTC GACGTCAACTTGCACTATTCAAGGA GCATAAGCACACGGAAACTCTCCT CTGTGTACTCAACTTGGATTGGGG GGACTACAAGCACATGCCATCATC TTTCACTCACACACTTTCATTCTGATG CCTGCCAAACAACAAAGTGCTCA TTGTTTTGGTGGAATAGCCCTGAT GGCACTGGAATACGATTCTAGCACTAA  Gene symbol  125  ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM ATM BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1  Amplicon Coordinates (Build 36, hg18) chr11:107730405-107731005 chr11:107740795-107741378 chr11:107741079-107741675 chr11:107741271-107741899 chr11:107741618-107742216 chr11:107742138-107742657 chr11:107742240-107742894 chr11:107742240-107742894 chr11:107742800-107743394 chr11:107743240-107743895 chr11:107743765-107744370 chr11:107744011-107744572 chr11:107744451-107745113 chr17:38449676-38450260 chr17:38450099-38450693 chr17:38450672-38451219 chr17:38451018-38451610 chr17:38452863-38453470 chr17:38454531-38454932 chr17:38456336-38456841 chr17:38462347-38462755 chr17:38468728-38469145 chr17:38469233-38469773 chr17:38472653-38473237 chr17:38472753-38473302 chr17:38476099-38476698 chr17:38476548-38476930 chr17:38479623-38480265 chr17:38479905-38480299 chr17:38481747-38482329 chr17:38487763-38488291 chr17:38496190-38496818 chr17:38496693-38497337 chr17:38497064-38497511 chr17:38497068-38497625  Amplicon Size (bp) 601 641 633 662 635 556 620 657 631 658 633 598 677 621 631 584 593 654 438 542 445 454 577 621 586 636 419 643 431 619 565 687 645 484 594  Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) GGCTGAAGGCATTTCTAAACCAGT ACCCCTAACCATGGAGGAAGAATG CAGCAGAGGCCGGAAGATGA CCAGAGTTTCAACAAAGTAGCTGAACG TGGTCTTAAGGAACATCTCTGCTTTCA AAGCCCTTCTGTACTGTCCATGTATGT CCCTATCCATTGGGCTTCTTCTTT CCCTATCCATTGGGCTTCTTCTTT GCAAGCCCTGGGTTCTTTGC TGACCGTAAGGATTTCCCCTTTCT TGGTGTATCTTTTTCTTACAAGCTGCC TGCATGGTATGCTATGAGGCTCC GGAGAAATAAGTTGTCCAAGGCAAGA TTTGGCAGCAACAGGAAATACAAA GCTTCCTTCCTGGTGGGATCTG ACCTGCCTCAGAATTTCCTCCC TGCCTGTAATTCCAGCTACTCAGG GAACTTCTAGGCTCCCACCTTGAC TTGGCACAGGTATGTGGGCA CAGCAGCTCAACGCCATCTG GAACCCGAGACGGGAATCCA GGCCTGCATAATTCTTGATGATCC CCCAGCATCACCAGCTTATCTGA GCCTGGCCCACACTCCAAAT GACTATCATCCATGCTATGCTCAACA TGCTGGTAAATTCACCCATGTGA GCTTCTCCCTGCTCACACTTTCTTC CCCCATGTTATATGTCAACCCTGA CGTCAAATCGTGTGGCCCAG CCATCAGTTTCCAAGCTTGTTCAGG TGCCTTGGGTCCCTCTGACTG TTAGCAAATGGGTTTCGAAGGTTT CCGTTGCTACCGAGTGTCTGTCTA TCACTCAGACCAACTCCCTGGC TCAGACCAACTCCCTGGCT  Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) GGAGTTCAAGACCAGTCTGACCAA AGCAAATTCACTTGTCCACCAACA CCCAGCCCGAATGACCATTAT TCACCTCCCAGGTTCAAGAGATTC CAAGAGAATTCATGAACTAGAAGGCAA TTGCCTTGGCCTGGGAAATC GGGAGCAAAGAACCCAGGGC GATGCTGTAGCTGTCCTGGAACAA GGGACAGAGAAATGTTCCACTTCTACC ACCCATGAAGATTTCAGGGCTTTC TGGGTCAGTGACTTAGCATACAACAA TGAGCAACTGACTGGCAAACCC TTCTCCTTAGAACGAGTCCCATGC TCAAGAACCGGTTTCCAAAGACA GGGAGGAAATTCTGAGGCAGGT TGCAGCCAGCCACAGGTACA TGGTGGTACGTGTCTGTAGTTCCA CCAAGACTCCCTCATCCTCAAAAT CATGGCATATCAGTGGCAAATTGA TGGACATTGGACTGCTTGTCCC TGACGTGTCTGCTCCACTTCCA GGAATCCATGTGCAGCAGGC GCCTTGGCGTCTAGAAGATGGG TGCTCGTGTACAAGTTTGCCAGA ACTAGTATTCTGAGCTGTGTGCTAGA TCAGCTCGTGTTGGCAACATA GCTACTTTGGATTTCCACCAACACTG AAAGTCCTTCACACAGCTAGGACG CAGCTGGGAGATATGGTGCCTC GCATCTGTCTGTTGCATTGCTTG GGGCATTAATTGCATGAATGTGG TAAATTCCTTGCTTTGGGACACCT ATGAGATGTGCACCCACAGTGATA GGAGTCCTAGCCCTTTCACCCA CTGATGACCTGTTAGATGATGGTGAA  Gene symbol  126  BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2 ERCC2  Amplicon Coordinates (Build 36, hg18) chr17:38497484-38498049 chr17:38497841-38498430 chr17:38498167-38498726 chr17:38498648-38499242 chr17:38499078-38499482 chr17:38499285-38499872 chr17:38499660-38500239 chr17:38499977-38500495 chr17:38500257-38500851 chr17:38501285-38501853 chr17:38502623-38503015 chr17:38505141-38505506 chr17:38509319-38509889 chr17:38510241-38510764 chr17:38511518-38512056 chr17:38511741-38512186 chr17:38521024-38521673 chr17:38529340-38529971 chr17:38530137-38530734 chr17:38530564-38531237 chr19:50546451-50547026 chr19:50547076-50547724 chr19:50547530-50548205 chr19:50548034-50548716 chr19:50549541-50550173 chr19:50550498-50551109 chr19:50552145-50552709 chr19:50552434-50553046 chr19:50556357-50557008 chr19:50558600-50559220 chr19:50559079-50559771 chr19:50559588-50560214 chr19:50559816-50560412 chr19:50563472-50564059 chr19:50563547-50564095  Amplicon Size (bp) 602 626 596 631 441 624 616 555 631 605 429 402 607 560 575 482 678 632 666 674 624 696 677 688 635 665 641 634 677 645 696 639 633 625 585  Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) TGTGTATGGGTGAAAGGGCTAGGA GGCCCTCTGTTTCTACCTAGTTCTGC ATTTGGAGTAATGAGTCCAGTTTCGTT TCAAATGCTGCACACTGACTCACA GGTTTCTGCTGTGCCTGACTGG CGAGTTCCATATTGCTTATACTGCTGC GGAGGCTTGCCTTCTTCCGA TCTCTAGGATTCTCTGAGCATGGCA GCTCCACATGCAAGTTTGAAACAGA TGGGTTGTAAAGGTCCCAAATGGT CACCAAATCCCAAGTCGTGTGTT TCTTCAAGGTGGGAACTGCGTC CAATGCTCAATAAAGAGATGTTGCCA AAGGTGTGAGACCAGTGGGAGTAATTT AAATTGGCCGGGCATGGTAG CAGCCCTACTTTACATAAGTCTGCAA GACAGAGCGAGACTTTGTCTCAAAA TTGGAGAAAGCTAAGGCTACCACC CTCTACTTCCCTCTTGCGCTTTCT TAGCGATTCTGACCTTCGTACAGC ACTAACGTCCAGTGAACTGCGCT CTCACCCCAACTTCTCTCACCCT CTGGGAAATGAACGGGAAACAG GTGCCTAGGGACAGAGGGGAG AGCAGCAGAGAAGCAAGGAACCTA CCAGTGCACAATACACTGTGACCA GGGAGATCAGGGAGGATACATTCC CAGGATCTTGGGGTAGATGTCCAG GAATTTCTCTGGCCTCCTCCCTTA GAGATTCTCCAATCCAGCCAGGT CTCACCCTGCAGCACTTCGTC CAATCTTGGGGTCCAGGAGGTAGT CAGGGAGATGCAGACAGGCA AACCAGGCTTGCCAGAGACTCTAA GGAGGAAGTGTGCTTGCCTGG  Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) TCACCTGAAAGAGAAATGGGAAATGA TGTGCAACATTCTCTGCCCACTC TCTCGTTACTGGAAGTTAGCACTCT TGAGGAGGAAGTCTTCTACCAGGCA TGATAAATCAGGGAACTAACCAAACGG GGGAGTCTGAATCAAATGCCAAAG CATGCCAGCTCATTACAGCATGA TGGTTGATTTCCACCTCCAAGG TGCCTGGCCTGCCCTTTACT CTGCCTCCCAGGTTGAAGCC CGAAGCCCATGCCTTTAACCA TCCATGGTGTCAAGTTTCTCTTCAGG CATAGGGTTTCTCTTGGTTTCTTTGA TGCAATGCATTATATCTGCTGTGGAT TCAACCAGAAGAAAGGGCCTTCA GCTCTTAAGGGCAGTTGTGAGATTA TTGTGTTGAAAAGGAGAGGAGTGG TGACAGATGGGTATTCTTTGACGG CTTCCCTCGCGACCTACAAACT ATTTCCAAGGGAGACTTCAAGCAG GTCCTTCTCCGACTCCCTAGCTG CCAGTTCCAGATTCGTGAGAATGA AAAGTGTCCGAGGGAATCGACTTT ACAGTCAGCCCCTCCACCAAT CCCTTCTGCACTCATTTCATTGG TAACAGGGTTGCTGAGGGTTCATT AGGGTGTGAATGCTCTGTGGGT GTGGGCTCTCTACTTGGGATCCTT GTCGTGCTAGCAGGTGTGACAAGT CAGGATCAAAGAGACAGACGAGCA CATACTTCTGCCTGGCCTGTGTCT CACAGCCTCACAGCCTCCTATGT CCCTGCATTAAGTTCCCACGC ACTGCTCAAGAACTGTGCCAGAGA GGCAGGCATATCCGCTGGAG  Gene symbol  127  ERCC2 ERCC2 ERCC2 H2AFX H2AFX H2AFX H2AFX H2AFX LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1  Amplicon Coordinates (Build 36, hg18) chr19:50563763-50564257 chr19:50564733-50565330 chr19:50565302-50565927 chr11:118469715-118470325 chr11:118469862-118470452 chr11:118470365-118470972 chr11:118470769-118471363 chr11:118470788-118471465 chr13:107657854-107658452 chr13:107658427-107659021 chr13:107658691-107659357 chr13:107659241-107659863 chr13:107659744-107660318 chr13:107659887-107660486 chr13:107660197-107660888 chr13:107660463-107661042 chr13:107660834-107661402 chr13:107661249-107661844 chr13:107664667-107665326 chr13:107665526-107666176 chr6:30775839-30776419 chr6:30778056-30778701 chr6:30778580-30779183 chr6:30779082-30779646 chr6:30779588-30780153 chr6:30780525-30781149 chr6:30780824-30781496 chr6:30780824-30781496 chr6:30781256-30781880 chr6:30782861-30783502 chr6:30783362-30783980 chr6:30783519-30784152 chr6:30786907-30787506 chr6:30787433-30788050 chr6:30787893-30788495  Amplicon Size (bp) 531 634 626 632 627 608 670 678 599 631 667 673 628 636 698 616 605 632 667 668 631 687 669 632 639 635 673 694 626 677 655 639 642 686 631  Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) GAGCCAGTCCCAGAAACGGC TGCAGGTTAAATATTTGGCACAGTAGC AGTAGACCAGGAGCCCGTCCA CCTCCCACCCCTATTATCAGGAAA CGTGGAAGGGTTAGCTGCAGAA GTGCTGCTGCCCAAGAAGAC CTTGCCCCGCAGTCTGAAG CGGCTCAGCTCTTTCCATGAG GGGTAGAATTGTTACAGCTGGACTTG CAAGTCCAGCTGTAACAATTCTACCC TTCTTTCTTGGCTTTGGGCTATTG CCATTTCTTCAGGAGTCTGCTCGT TCTTGTGGTTCATCATCACCACCT ACGCAAGGTGCAGCCAGTTT CCTTTACCCCAATATCCTCCAACA CCTCTTTCTCAGAGTCTCATGCCC CCTTCTCAATGTGCTCAATATCTGCAA TCCAGCATCTCCATGAGTTCCA GAGCAGACAAAGACGCTAGAAGGG GCCACACACACCCCAAACC CATGAGTGGCATCGAGCAATAACT GGGATAGATGGAGCAAATGTAGCA AGGGTCGGTCACCACATATTCATC GCCATCAGCACCCATCTCTACAAT TGAATTGGTGTCTCAAGAAGCTGG GAGACGTAGGCTCAGGGGTAACAG GCAGGACAAATAGGTCCTCTGTCA TCAGGGGCTATAGGGACAGTTGAT GTAGCCTGAGAGGTGGGTTCAGAG CACACGTGGATGATGGTAAGGAAA ACCTGACTGGCTCCCAGAAGGTA TTGGGCACCTTCTCTTCTAACTCG GATCACTTGAGGTCGGGAGTTTGT CTGATTCTCCAGAAAGCACTGGGT GCTGGAAGCTGGCTCTTTCTTACA  Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) CCCAACATGCAGGGTCATGG GCTCAACGTGGACGGGCTC TGAGATCGAGTCTCTCGGCTCTTT GTGCTTAGCCCAGGACTTTCAGAC GGAAGACTTGGCCTTCCGCTC GAGGGCCTCACTCACCTTCAG TCTGTTCTAGTGTTTGAGCCGTCG TTGGAGAAAAGAGCCAATCAGGAG TCCACGGTTTGAATAAAATTTCCA GGGAAGATCATAGTCGTGTTGCAGA TTGCCCGTGAATATGATTGCTATG AGATGACAAGGAGTGGCATGAGTG CCTCTATCCATCTACAAGCCAGACA GGGCATGAGACTCTGAGAAAGAGG TCTGCATTTAAACCAATGCTAGCTG GGATTTAAAGCTTGGTGTTAGTCAGCA TCCTCAGCTAGAAAGAGAGAGAATGGC AACTTCAAATTAGGGTTGGAGCAAA GTGGGGAGTCAAGTAGGGGAAGTG GAGGCTATCACTAGCCAGAGCACA CCTGATTTTGCCTTTGCTCTGTCT AAATGCTAGGCAGCAGAGCTGATT AGCCCCCAAAGTAAGAGACAAAGG GAATCCCTTACAGCCATTCCTGAG GCCACTAGGTGCAGGACAAATAGG CCCACATATCAGGCTACTAGGGGA GGGCTGGAGGATCAATCTCTAAAA GATCCTCTCTTCTTCCGCATCAGT CCTCACTTGCTTCTGTTTCTCCCT GATACACAGAGAGGGGAGCCAGAG ACCAGGAGACCAACATCCAGAGAG TCCCTCTCCCTCTCTCTCTCTTCC TGTCTTATATTCCTCCCCGACCAG CAAACAGATGTGAAAGCAGTTGGG CAGCGATACAGATGACGAGGAAGA  Gene symbol  128  MDC1 MDC1 MDC1 MDC1 MDC1 MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A MRE11A  Amplicon Coordinates (Build 36, hg18) chr6:30788436-30789038 chr6:30789156-30789773 chr6:30789418-30790048 chr6:30790528-30791214 chr6:30792710-30793286 chr11:093791680-93792263 chr11:93789924-93790518 chr11:93790206-93790525 chr11:93790371-93790813 chr11:93790501-93791160 chr11:93790822-93791406 chr11:93791078-93791702 chr11:93791531-93792212 chr11:93792063-93792643 chr11:93792621-93793211 chr11:93802483-93803058 chr11:93808340-93808994 chr11:93809865-93810243 chr11:93818381-93818954 chr11:93819823-93820439 chr11:93828956-93829542 chr11:93831998-93832625 chr11:93832241-93832603 chr11:93833579-93833975 chr11:93836707-93837325 chr11:93840495-93840993 chr11:93842925-93843486 chr11:93844148-93844803 chr11:93848872-93849513 chr11:93851104-93851690 chr11:93852229-93852814 chr11:93858544-93859140 chr11:93863534-93864079 chr11:93865381-93865970 chr11:93866249-93866869  Amplicon Size (bp) 674 685 649 699 631 620 595 356 479 693 621 689 699 630 627 646 665 415 574 647 623 694 399 433 685 535 598 673 642 623 633 633 582 662 626  Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) TCTTTCAGATGTGCCAAAGTCAGC AGGGATACCCCAACTCAACTGTGA CACAGAGGAAGATGTGGTCCTTGA TCTCAAAGCACGGGAATTACAGGT AACTCCGTCTTTATGACAGGCCAA TGAGGCAGCCACTAACCAAGTG TGATGGAATCCCTCTACAGGTCAA TCCATAACAGGCTGAACCAAATGA TGCGTCTAATGGCAGAAGCACC TGTAGAGGGATTCCATCAGTGCTC CCTGCTCTCCCAAAGCCTCC TCACAAAATGACCTGAATTAGCTGGA TATTTCACAGGACAATGCCTTCCA GAGAAATCCTGGCATTGACATTCC AAGCTCTTTCCCTCAACTTTGGCT TGGAGGCTGAGAGTAGGATGTGTG AGTCTAGGCGACAGAATGAGGCAC GCTCCTTCCAGCTTTAATGTTCCA CTACCACGTCCAGCCTATTTCCTT CTAAGTGCATGTGGCCATTCAAAA GCAGCCATCCTAAGCCAACCC TCCATGGGGAACAAAACACTTTAG TCGAGGGCATCAATATGACGTTC TTTCCCTGCTGTGCAGCAACT CACGTTGTGCACATGTACCCTAGA TTGCCTCCGATGGTGATTGC CACACGATAGTCTCCCATTCCTCA TTGAATTTGCCTAATGAGCAGCAA GCATGGTGGCTTATGCTTGTAATC TCCTGCTCTTTCACTTGCAGAATCA GGTGTTCACTCCTACTCCTGGCTT CCGAGTAGCTGGGTCAGTTCCAC GCAGGCAAGGTAAGCACCTGA TGCAAGACTCCAATCTATAGCTGCAT AAAATAGCCAGGTTTGGGGAAGAG  Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) AGATGTGGAAGAAGGTCAGCAACC GAGACCTCCTAAGGTTTTGAGCCC TAATGCATATGGAGGCCTTGAGTG GCTGAGGTGAGAGAATTGCTTGAA AAGCGGTAGTGGGTTGTCCCTT CCCACTCTTCCTTTCTCCTTTGC TTCAGTTCACCGCTAAGGAAAGTG TGAGCACTGATGGAATCCCTCTACA CATTCAGAGGAAGACATCTGTAGGGAA ACGCCTGGCTGATTTTTGTATTTT GGAGGCAGTGTCTGGATGATGC ACACTTGGTTAGTGGCTGCCTCAT TGCCTAGTTCTAGGAGGAAACGGG GCCAAAGTTGAGGGAAAGAGCTTA AGATACATGCACACACAGCAGGATAAA TGAGGAAATTGAAGCACAGAGAGG GAGCCTATGCAAGTCATTCACCAG CAGTTGGGCATTGAGTTATGCG TTGGACTCCATATCCTAGCCATCA AACTCTTGGGTGCAAGTGATCCTC TGCAGCAAGGTGCACAAGAGTG GAGGAGTATTCATGTGTATGCCTTATCC CAGGCCTCTACATTTCACGTGTCC CCATGTTGAGCAAGCTGGCAGT CAGGGGGAGATGTAATCATTCTGC GCTGAGGAAAGCCTTATTGAAACATGA TGACTCGGTGTTCATTTCTCTCCA GCTTTCTCTTTTCGGGTTTCCACT AAGATTACAGGCACACACCACCAT GCAGATGCACTTTGTGCCTTGG CCACAAGCACTGTTTATTATAGTTGGGG TGGCCTGAATCAGAGACTTGGTG TTTGGGCCTGGGTTACATGA AAGAGCACGGGAAAGGAAAATAGG GACAGGGAATAACAACCCACCTGA  Gene symbol  129  RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50 RAD50  Amplicon Coordinates (Build 36, hg18) chr5:131920445-131921084 chr5:131922511-131923103 chr5:131939145-131939740 chr5:131942768-131943237 chr5:131943218-131943758 chr5:131950876-131951537 chr5:131951269-131951916 chr5:131952046-131952625 chr5:131952285-131952840 chr5:131953044-131953607 chr5:131954414-131954989 chr5:131954563-131955072 chr5:131955267-131955870 chr5:131958269-131958857 chr5:131958991-131959528 chr5:131966678-131967137 chr5:131967208-131967764 chr5:131968184-131968541 chr5:131968209-131968557 chr5:131968219-131968837 chr5:131971975-131972579 chr5:131972368-131972937 chr5:131972555-131973105 chr5:131979446-131979999 chr5:131981443-131982091 chr5:132000305-132000887 chr5:132001420-132002040 chr5:132004019-132004675 chr5:132005586-132006177 chr5:132005782-132006374 chr5:132005974-132006573 chr5:132006247-132006835 chr5:132006718-132007308 chr5:132007037-132007701 chr5:132007405-132008002  Amplicon Size (bp) 656 629 632 506 577 670 669 580 592 632 612 546 674 660 574 496 593 394 385 664 668 606 587 590 668 619 647 696 628 629 636 625 640 665 634  Forward Primer Sequence (preceded by TGTAAAACGAGGCCAGT) AGCACCTAGCCCTCTGCTTCG TCCTACAGCCTGGAGTTAATGTGAGAA TCTCCTAATGATGCTGAATAAAGGAGG TGATGAAGCCATTTCTAACGGGA CAGCAGCAACCATTGGGCAT AATGACTTTTGTGGCAGGTGTTGA GAAAATGGAAAAGGTTTGTGGTGG TTGCTTCAATAAAGGTTTTTCTGCC ACAGCTGCAAGCAGATCGCC TGATGTTCACACAATGATAAAATTGCC TTGGTCAGGGACCACATCACA TCTCACATTTCTTTCCTGTTTGACCC TCGACTTGGTACTCCACTCTTAAGGC GTGAGTCAGTGTCCTAGGGGGAGA TGCTCTTTGGAAGCGAATATCGG GGAGAAACCTGGCCAACACCA TCAGATACCCTCAGGGAGAAACTGC CCTGCCCTGTAAGCTTTCCCTG GTGGCCAGCAGGGAACATCA GGGAACATCAAGCTGATTTGAGAA CATTTACTGGCTGTTGTGACCCTG TTCATAGCACCACGTCGGACA GGTGAGAAAGGGACTTGTCTGCG TCCTTCACACTGGCTTATTCTCCC GCTATTTAGCAACCTATGTGCGGC CAGGCCTTCCTGTGACCCGT GCCTGGAGGAAACTCTTAACAGGG TCAGCGTTGTTCTGAGCATTTTGT GGAGAAGAGACTCCTGCCTGGCT CGCTCACAGCAGCGTAACTTCC TGCCATAGAAATGTAGGTCCTCAGAAA TGCAAATGCATGCTTCTTCTCAA CCTCTGCGTCTATCCTGTGTAGCA TGCTGCAACAACTAGCACTTTCAT TTTATCCCAAGAATGCAAGATTTCAGA  Reverse Primer Sequence (preceded by CAGGAAACAGCTATGAC) GTAGCGACCTGTAAACTGAAGGCG TGTCAACAACGGTTACTACTGGGTGC GGAGGAGGCTGTGTGTGCGT ATGCCCAATGGTTGCTGCTG GCTTGATTTAGCCAGTCCACGATG TGCTCATCAGTCCCTTGAAAAACC ACCATACCTAGCTCCCTCCTGTCC ATGGCGAAAACCCGTCTCTACTAA TGGGACCAATGTCAAATATGTGGTCTA TCGGGTTGTAGAACCAAAGAGTCA TGGTCAGCATCTCCATTTGGG TCTATGACTTATGAGTGCAAGGTAGGC TTGCTGGTATCACTGCTCAGAAGA CCACAGGTGGTAGTTGTGTCCTCA AAGTGATATAAGACAGGGCATACCAGC CAAATTCAAGCCACCATGGAACA AAGAGCAAATATATGTGGACAGGAAGG TGCTCCTCCAGTTGCTGACGA GGATAATTCCACAGTCTGCTCCTCC AATGAGCATGTTTGCCTAGACAGC ATCGCAGACAAGTCCCTTTCTCAC TTTCTTTGTGTTTCTCGCATTCAC CCACCACGCTTGGCCTCTTT TGTGCAGCAGGCTAGCAGATGT AGGATGAGGCAGGAGAATCATTTG TCCAGGGAGGTAATGCTGGC GAGAATGCTTCAGGCCCTTCTTTT GAACCCCTCACAGTGACTCTCTCC TGGAATGGGATGAAGAGCAGCA CCACATGCAAGGAAGTAAATTCAGAGG GAAGGTGGTGGGTACTGACTTAGATGA GGGAGCAGGCCTTGACTCTG ACCACCCCCAGGATACTCTGTCTT CAGGGGTACAAATAAAATTGGGGA GCCCAGGCAGTCTGGCTCAT  Table 4-4 Number of variant sites detected in each DNA repair gene.  kbp sequenced Gene ATM BRCA1 ERCC2 H2AFX LIG4 MDC1 MRE11A RAD50 Overall  Total  Coding  40.9 18.2 8.8 1.8 5.3 9.7 14.3 15.8 114.8  13.0 7.1 2.3 1.4 4.0 6.5 5.1 5.8 45.2  Total 62 47 37 5 19 36 15 18 239  per kbp 1.5 0.4 4.2 2.8 3.6 3.7 1.0 1.1 2.3  per coding kbp 0.7 2.1 1.7 0.7 1.5 3.5 0 0.3 1.1  In Novel dbSNP 41 43 28 2 10 22 13 11 170  21 4 9 3 9 14 2 7 69  In-Dels 16 6 5 0 6 0 1 2 36  Amino acid change Conservative Non-conservative (novel)  (novel)  4 (2) 8 (0) 2 (0) 0 (0) 1 (1) 5 (2) 0 (0) 1 (0) 21 (5)  4 (0) 4 (0) 0 (0) 0 (0) 2 (0) 12 (4) 0 (0) 0 (0) 22 (4)  Synonymous (novel)  1 (0) 3 (0) 2 (0) 1 (1) 3 (2) 6 (2) 0 (0) 1 (0) 17 (5)  130  Table 4-5 Coding variant genotypes observed in high and low toxicity prostate brachytherapy patients. (A = reference allele, B = non-reference allele)  Gene ATM ATM ATM ATM ATM ATM ATM ATM ATM BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 ERCC2 ERCC2 ERCC2 ERCC2 H2AFX LIG4 LIG4 LIG4 LIG4 LIG4 LIG4 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 MDC1 RAD50 RAD50  Genome coordinate (Build 36, hg18)  Amino acid change  Conservation Score*  dbSNP accession  chr11:107,603,786 chr11:107,626,842 chr11:107,629,971 chr11:107,648,666 chr11:107,668,697 chr11:107,669,347 chr11:107,680,672 chr11:107,680,673 chr11:107,688,377 chr17:38,476,501 chr17:38,476,574 chr17:38,476,620 chr17:38,487,996 chr17:38,497,526 chr17:38,497,955 chr17:38,497,961 chr17:38,498,462 chr17:38,498,553 chr17:38,498,763 chr17:38,498,992 chr17:38,498,997 chr17:38,499,587 chr17:38,499,693 chr17:38,500,007 chr19:50,546,759 chr19:50,547,364 chr19:50,559,099 chr19:50,560,149 chr11:118,471,119 chr13:107,659,299 chr13:107,659,914 chr13:107,660,726 chr13:107,661,333 chr13:107,661,592 chr13:107,661,610 chr6:30,779,291 chr6:30,779,567 chr6:30,779,705 chr6:30,779,968 chr6:30,780,543 chr6:30,780,702 chr6:30,780,736 chr6:30,780,992 chr6:30,781,326 chr6:30,781,604 chr6:30,781,641 chr6:30,781,660 chr6:30,783,584 chr6:30,783,903 chr6:30,783,969 chr6:30,784,025 chr6:30,788,541 chr6:30,788,587 chr6:30,788,777 chr6:30,788,895 chr6:30,789,456 chr6:30,789,662 chr6:30,789,910 chr5:131,951,572 chr5:132,005,895  S49C L480F S707P P1054R P1526P V1570A D1853N D1853V N1983S M1652I M1628T S1613G S1436S K1183R S1040N E1038G L871P R841W L771L S694S D693N R496H F461L Q356R K751Q D711D D312N R156R L66L Q773Q D568D Y298H E95E T9I A3V D1855E V1791E P1745R A1657A T1466A K1413E G1401G M1316T T1205A S1112F P1100A P1093P R917S D811G Q789P T770T P386L E371K S307S R268K R179C P138P V56I V315L I1293I  -1 0 -1 -2 7 0 1 -3 1 1 -1 0 4 2 1 -2 -3 -3 4 4 1 0 0 1 1 6 1 5 4 5 6 2 5 -2 -2 2 -2 -2 4 -1 1 6 -1 -1 -2 -1 7 -1 -1 -1 4 -3 1 4 2 -3 7 3 1 4  rs1800054 novel rs4986761 rs1800057 rs1800889 novel rs1801516 rs1801673 rs659243 rs1799967 rs4986854 rs1799966 rs1060915 rs16942 rs4986852 rs16941 rs799917 rs1800709 rs16940 rs1799949 rs4986850 rs28897677 rs62625300 rs1799950 rs13181 rs1052555 rs1799793 rs238406 novel novel rs1805386 novel novel rs1805388 rs1805389 rs28994874 rs28994873 rs28994871 rs28986317 novel novel rs28994870 rs61733213 novel rs28987085 rs28994869 rs28994868 rs28986467 novel novel rs28986466 rs28986465 rs2075015 novel rs9262152 rs28986464 novel novel rs28903090 rs28903094  High toxicity AA 21 20 21 19 3 20 11 11 0 9 6 11 14 7 20 19 11 19 4 4 4 19 19 15 15 18 18 18 13 13 16 16 16 9 9 9 16 14 14 14 14 14 14 14 17 19 21 20 21 19 3 20 11 11 0 9 6 11 14 7  AB 0 1 0 2 1 1 2 0 0 9 6 7 3 8 1 0 9 1 0 7 0 1 1 1 5 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 6 1 1 1 1 1 0 1 0 2 1 1 2 0 0 9 6 7 3 8  BB 0 0 0 0 0 0 0 0 21 2 4 3 3 5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 2 4 3 3 5  Low toxicity AA 19 20 17 20 1 19 9 9 0 9 6 10 9 4 19 13 16 16 7 7 7 18 18 18 18 17 17 17 13 13 12 12 12 8 8 8 16 14 14 14 14 13 13 13 17 20 19 20 17 20 1 19 9 9 0 9 6 10 9 4  AB 1 0 1 0 1 0 3 1 0 6 7 5 10 11 0 1 3 0 1 3 2 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0 0 0 5 1 0 0 0 0 1 0 1 0 1 0 3 1 0 6 7 5 10 11  BB 0 0 0 0 0 0 0 0 20 3 1 1 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 20 3 1 1 1 5  131  Table 4-6 Variants associated with residual gamma H2AX levels following irradiation  Gene  Genome coordinate (Build 36, hg18)  Amino acid change, Conservation score*  dbSNP accession  High gamma H2AX AA AB BB  Low gamma H2AX AA AB BB  Possible effect on DNA repair activity  MDC1  chr6:30778271  intronic  rs9405048  16  3  0  8  9  2  Increase  BRCA1  chr17:38449934  intronic  rs12516  8  2  0  5  6  3  Increase  BRCA1  chr17:38453439  intronic  rs817630  14  5  2  5  10  3  Increase  BRCA1  chr17:38469351  intronic  rs3092994  10  4  0  3  8  3  Increase  BRCA1  chr17:38469731  intronic  rs8176257  10  2  0  3  8  1  Increase  BRCA1 BRCA1 BRCA1 BRCA1  intronic intronic intronic intronic  rs8176256 rs11654396 rs8176212 rs2236762  10 14 12 12  1 3 5 5  0 1 2 2  3 6 5 5  4 8 10 11  0 1 3 3  intronic  novel  14  4  2  5  7  3  Increase Increase Increase Increase Increase  BRCA1  chr17:38469732 chr17:38472867 chr17:38480127 chr17:38480201 chr17:3848027038480274 chr17:38496716  intronic  rs799916  14  5  2  5  10  3  BRCA1  chr17:38501690  intronic  rs8176147  11  3  0  3  9  2  BRCA1  chr17:38502890  intronic  rs8176144  13  3  0  6  8  3  BRCA1  chr17:38510660  intronic  rs799912  11  4  1  5  10  3  BRCA1 ERCC2  chr17:38530713 chr19:50559099  intronic D312N, 1  rs799905 rs1799793  13 8  5 9  2 3  5 14  11 4  3 1  BRCA1  p-values < 0.05 p1=0.009 p3=0.006 p1=0.047 p3=0.023 p1=0.025 p1=0.021 p3=0.009 p4=0.036 p1=0.012 p3=0.017 p1=0.047 p1=0.038 p1=0.049 p1=0.049 p1=0.044  Increase Increase Increase Increase Increase Decrease  *Conservation score of an amino acid substitution as determined from the BLOSUM62 alignment score matrix [44, 45]. We defined scores < 0 as non-conservative substitutions and all others as conservative substitutions.  132  p1=0.025 p1=0.007 p3=0.007 p1=0.013 p3=0.004 p1=0.037 p3=0.037 p1=0.025 p3=0.042  4.7. Bibliography 1. 2. 3.  4.  5.  6. 7.  8.  9.  10.  11. 12.  13.  14.  15.  Evans WE, Relling MV: Pharmacogenomics: translating functional genomics into rational therapeutics. Science 1999, 286(5439):487-491. Frazer KA, Murray SS, Schork NJ, Topol EJ: Human genetic variation and its contribution to complex traits. Nat Rev Genet 2009, 10(4):241-251. Grimm PD, Blasko JC, Sylvester JE, Meier RM, Cavanagh W: 10-year biochemical (prostate-specific antigen) control of prostate cancer with (125)I brachytherapy. Int J Radiat Oncol Biol Phys 2001, 51(1):31-40. Potters L, Morgenstern C, Calugaru E, Fearn P, Jassal A, Presser J, Mullen E: 12-year outcomes following permanent prostate brachytherapy in patients with clinically localized prostate cancer. J Urol 2005, 173(5):1562-1566. Morris WJ, Keyes M, Palma D, Spadinger I, McKenzie MR, Agranovich A, Pickles T, Liu M, Kwan W, Wu J, Berthelet E, Pai H: Population-based Study of Biochemical and Survival Outcomes After Permanent (125)I Brachytherapy for Low- and Intermediate-risk Prostate Cancer. Urology 2009, 4:860-865. Bentzen SM: Preventing or reducing late side effects of radiation therapy: radiobiology meets molecular pathology. Nat Rev Cancer 2006, 6(9):702-713. Keyes M, Miller S, Moravan V, Pickles T, McKenzie M, Pai H, Liu M, Kwan W, Agranovich A, Spadinger I, Lapointe V, Halperin R, Morris WJ: Predictive Factors for Acute and Late Urinary Toxicity After Permanent Prostate Brachytherapy: LongTerm Outcome in 712 Consecutive Patients. Int J Radiat Oncol Biol Phys 2008, 73(4):1023-1032. Bucci J, Morris WJ, Keyes M, Spadinger I, Sidhu S, Moravan V: Predictive factors of urinary retention following prostate brachytherapy. Int J Radiat Oncol Biol Phys 2002, 53(1):91-98. Keyes M, Schellenberg D, Moravan V, McKenzie M, Agranovich A, Pickles T, Wu J, Liu M, Bucci J, Morris WJ: Decline in urinary retention incidence in 805 patients after prostate brachytherapy: the effect of learning curve? Int J Radiat Oncol Biol Phys 2006, 64(3):825-834. Keyes M, Miller, S., Moravan, V., Pai, H., Kwan, W., Liu, M., Morris, J., Halperins, R., Pickles, T.: Acute and Late Urinary Toxicity in 606 Prostate Brachytherapy Patients. Radiotherapy and Oncology 2006, 80(Supplement 1):S41-S42. Keyes M, Moravan, V., Liu, M., Jankovic, B., Morris, W.J.: Rectal Toxicity After I 125 Permanent. Radiotherapy and Oncology 2005, 76(Supplement 1):S5-S6. Macdonald AG, Keyes M, Kruk A, Duncan G, Moravan V, Morris WJ: Predictive factors for erectile dysfunction in men with prostate cancer after brachytherapy: is dose to the penile bulb important? Int J Radiat Oncol Biol Phys 2005, 63(1):155-163. Bottomley D, Ash D, Al-Qaisieh B, Carey B, Joseph J, St Clair S, Gould K: Side effects of permanent I125 prostate seed implants in 667 patients treated in Leeds. Radiother Oncol 2007, 82(1):46-49. Lehrer S, Cesaretti J, Stone NN, Stock RG: Urinary symptom flare after brachytherapy for prostate cancer is associated with erectile dysfunction and more urinary symptoms before implantation. BJU Int 2006, 98(5):979-981. Wust P, von Borczyskowski DW, Henkel T, Rosner C, Graf R, Tilly W, Budach V, Felix R, Kahmann F: Clinical and physical determinants for toxicity of 125-I seed prostate brachytherapy. Radiother Oncol 2004, 73(1):39-48. 133  16.  17.  18.  19. 20. 21.  22.  23.  24.  25.  26.  27.  28.  29.  30.  Finnon P, Robertson N, Dziwura S, Raffy C, Zhang W, Ainsbury L, Kaprio J, Badie C, Bouffler S: Evidence for significant heritability of apoptotic and cell cycle responses to ionising radiation. Hum Genet 2008, 123(5):485-493. Roberts SA, Spreadborough AR, Bulman B, Barber JB, Evans DG, Scott D: Heritability of cellular radiosensitivity: a marker of low-penetrance predisposition genes in breast cancer? Am J Hum Genet 1999, 65(3):784-794. Burrill W, Barber JB, Roberts SA, Bulman B, Scott D: Heritability of chromosomal radiosensitivity in breast cancer patients: a pilot study with the lymphocyte micronucleus assay. Int J Radiat Biol 2000, 76(12):1617-1619. Wood RD, Mitchell M, Lindahl T: Human DNA repair genes, 2005. Mutat Res 2005, 577(1-2):275-283. Wood RD, Mitchell M, Sgouros J, Lindahl T: Human DNA repair genes. Science 2001, 291(5507):1284-1289. De la Torre C, Pincheira J, Lopez-Saez JF: Human syndromes with genomic instability and multiprotein machines that repair DNA double-strand breaks. Histol Histopathol 2003, 18(1):225-243. O'Driscoll M, Cerosaletti KM, Girard PM, Dai Y, Stumm M, Kysela B, Hirsch B, Gennery A, Palmer SE, Seidel J, Gatti RA, Varon R, Oettinger MA, Neitzel H, Jeggo PA, Concannon P: DNA ligase IV mutations identified in patients exhibiting developmental delay and immunodeficiency. Mol Cell 2001, 8(6):1175-1185. Gutierrez-Enriquez S, Fernet M, Dork T, Bremer M, Lauge A, Stoppa-Lyonnet D, Moullan N, Angele S, Hall J: Functional consequences of ATM sequence variants for chromosomal radiosensitivity. Genes Chromosomes Cancer 2004, 40(2):109-119. Andreassen CN, Alsner J, Overgaard J: Does variability in normal tissue reactions after radiotherapy have a genetic basis--where and how to look for it? Radiother Oncol 2002, 64(2):131-140. Angele S, Romestaing P, Moullan N, Vuillaume M, Chapot B, Friesen M, Jongmans W, Cox DG, Pisani P, Gerard JP, Hall J: ATM haplotypes and cellular response to DNA damage: association with breast cancer risk and clinical radiosensitivity. Cancer Res 2003, 63(24):8717-8725. Cesaretti JA, Stock RG, Lehrer S, Atencio DA, Bernstein JL, Stone NN, Wallenstein S, Green S, Loeb K, Kollmeier M, Smith M, Rosenstein BS: ATM sequence variants are predictive of adverse radiotherapy response among patients treated for prostate cancer. Int J Radiat Oncol Biol Phys 2005, 61(1):196-202. Iannuzzi CM, Atencio DP, Green S, Stock RG, Rosenstein BS: ATM mutations in female breast cancer patients predict for an increase in radiation-induced late effects. Int J Radiat Oncol Biol Phys 2002, 52(3):606-613. Hall EJ, Schiff PB, Hanks GE, Brenner DJ, Russo J, Chen J, Sawant SG, Pandita TK: A preliminary report: frequency of A-T heterozygotes among prostate cancer patients with severe late responses to radiation therapy. Cancer J Sci Am 1998, 4(6):385-389. Andreassen CN, Overgaard J, Alsner J, Overgaard M, Herskind C, Cesaretti JA, Atencio DP, Green S, Formenti SC, Stock RG, Rosenstein BS: ATM sequence variants and risk of radiation-induced subcutaneous fibrosis after postmastectomy radiotherapy. Int J Radiat Oncol Biol Phys 2006, 64(3):776-783. West CM, Elliott RM, Burnet NG: The genomics revolution and radiotherapy. Clin Oncol (R Coll Radiol) 2007, 19(6):470-480.  134  31.  32.  33.  34.  35. 36.  37.  38. 39.  40.  41.  42.  43.  44. 45.  Andreassen CN, Alsner J, Overgaard M, Overgaard J: Prediction of normal tissue radiosensitivity from polymorphisms in candidate genes. Radiother Oncol 2003, 69(2):127-135. Damaraju S, Murray D, Dufour J, Carandang D, Myrehaug S, Fallone G, Field C, Greiner R, Hanson J, Cass CE, Parliament M: Association of DNA repair and steroid metabolism gene polymorphisms with clinical late toxicity in patients treated with conformal radiotherapy for prostate cancer. Clin Cancer Res 2006, 12(8):25452554. Suga T, Iwakawa M, Tsuji H, Ishikawa H, Oda E, Noda S, Otsuka Y, Ishikawa A, Ishikawa K, Shimazaki J, Mizoe JE, Tsujii H, Imai T: Influence of multiple genetic polymorphisms on genitourinary morbidity after carbon ion radiotherapy for prostate cancer. Int J Radiat Oncol Biol Phys 2008, 72(3):808-813. Human DNA repair genes. Supplement to the review by Wood RD, Mitchell M, & Lindahl T published in Mutation Research, 2005. [http://www.cgal.icnet.uk/DNA_Repair_Genes.html] Niida H, Nakanishi M: DNA damage checkpoints in mammals. Mutagenesis 2006, 21(1):3-9. Spycher C, Miller ES, Townsend K, Pavic L, Morrice NA, Janscak P, Stewart GS, Stucki M: Constitutive phosphorylation of MDC1 physically links the MRE11RAD50-NBS1 complex to damaged chromatin. J Cell Biol 2008, 181(2):227-240. Wang Y, Cortez D, Yazdi P, Neff N, Elledge SJ, Qin J: BASC, a super complex of BRCA1-associated proteins involved in the recognition and repair of aberrant DNA structures. Genes Dev 2000, 14(8):927-939. Robins P, Lindahl T: DNA ligase IV from HeLa cell nuclei. J Biol Chem 1996, 271(39):24257-24261. Cox JD, Stetz J, Pajak TF: Toxicity criteria of the Radiation Therapy Oncology Group (RTOG) and the European Organization for Research and Treatment of Cancer (EORTC). Int J Radiat Oncol Biol Phys 1995, 31(5):1341-1346. Barry MJ, Fowler FJ, Jr., O'Leary MP, Bruskewitz RC, Holtgrewe HL, Mebust WK, Cockett AT: The American Urological Association symptom index for benign prostatic hyperplasia. The Measurement Committee of the American Urological Association. J Urol 1992, 148(5):1549-1557; discussion 1564. Keyes M, Miller S, Moravan V, Pickles T: Pedictive factors for acute and late urinary toxicity after permanent prostate brachytherapy: long-term outcome in 712 consecutive patients. IJROBP 2008, in press. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B, English J, Vielkind J, Horsman D, Laskin JJ, Marra MA: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC Cancer 2007, 7:128. Olive PL, Banath JP, Keyes M: Residual gammaH2AX after irradiation of human lymphocytes and monocytes in vitro and its relation to late effects after prostate brachytherapy. Radiother Oncol 2008, 86(3):336-346. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89(22):10915-10919. Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 2004, 22(8):1035-1036.  135  46.  47. 48.  Meyer A, Wilhelm B, Dork T, Bremer M, Baumann R, Karstens JH, Machtens S: ATM missense variant P1054R predisposes to prostate cancer. Radiother Oncol 2007, 83(3):283-288. Parmley JL, Hurst LD: How do synonymous mutations affect fitness? Bioessays 2007, 29(6):515-519. Nakamura Y: Codon usage table (Homo sapiens). 2009.  136  Chapter 5. Transcriptome sequencing of treatment-naïve lung cancers from individuals likely to benefit from erlotinib treatment4 Candidate gene sequencing has been a primary strategy to uncover sequence variants in genes associated with a distinct phenotype as was presented in Chapters 2 and 4. However, recent technological advancements have increased DNA sequencing capacity dramatically, and whole genome sequencing is on the verge of becoming routine. Currently, sequencing cancer genomes (i.e. DNA) is still relatively expensive due to the sheer size of the genome and high depth of coverage (>20X, [56]) currently required to detect all sequence variants. On the other hand, sequencing transcriptomes (i.e. RNA) is an effective method to concentrate sequencing capacity on the 1-2% of the genome that encodes protein-coding genes, the source of nearly all cancer drivers identified to date [56]. Sequencing RNA is effective not only at uncovering sequence variants but also provides quantitative measurements of expression levels for every gene [26,53]. Building on the results of Chapter 2, this work seeks to go beyond assessment of EGFR status to identify additional cancer drivers found in pre-treatment lung cancer. We therefore sequenced the transcriptomes of 30 lung tumour biopsies to identify candidate driver mutations and fusion transcripts as well as elucidate patterns of gene and viral expression. As very small tumour biopsies were collected as a condition of enrolment in a clinical trial, many of the techniques developed during the previous three studies were applied to generate sequence data from a myriad of complex clinical samples.  4  A version of this chapter will be submitted for publication. Pugh TJ, Laskin JJ, Bosdet I, Asano J, Barclay L, Chan S, Griffth OL, Morin RD, Morrissey S, Sutcliffe M, Yang C, Ho C, Lee C, Ionescu D, Melosky B, Murray NR, Sun S, Marra MA. Transcriptome profiling of treatment-naïve lung cancers from individuals enrolled in a trial of first-line erlotinib.  137  The work presented here represents the expression profiling and variant discovery phase of the project. Recent genome and transcriptome sequencing efforts using second generation technologies have uncovered a wealth of human variation not currently catalogued in public databases. However, lists of putative mutations identified by these methods are currently rife with technical artifacts. Therefore, the second phase of this project is the validation of all putative mutations using an orthogonal method based on targeted DNA sequencing. This chapter documents the validation of a small subset of putative somatic variants using traditional PCR and sequencing methods. However, as over 9,000 putative variants were identified from our set of 30 lung tumour transcriptome sequences, we are pursuing an alternate validation assay. This chapter provides a rationale for the design of this assay, a plan to apply this assay to multiple samples, and proposed analyses once the data from this second phase are in hand.  5.1. Introduction Lung cancer continues to be the leading cause of cancer-related death worldwide. While the majority of cases can be attributed to smoking, 25% cannot (53% of lung cancers in women, 15% in men), and these cancers account for over 300,000 deaths annually [1]. Lung cancers arise due to an accumulation of genetic defects that transform normal bronchial epithelium into neoplastic tissue. It has been estimated that 10 to 20 mutations are acquired before a tumour is evident clinically [2], although this has yet to be directly measured. 55% of lung cancers are adenocarcinomas (53% in smokers, 62% in non-smokers) [1], and specific mutations in these tumours are associated with an individual’s smoking history. For example, tumours from neversmokers are more likely to have one of two activating mutations of the epidermal growth factor receptor (EGFR) tyrosine kinase domain [3-5], either an L858R point mutation in exon 21 or a 12-15bp deletion in exon 19 involving four amino acids L747-A750 (LREA). In contrast,  138  tumours from smokers more commonly harbour activating mutations of the KRAS gene affecting exon 2, resulting in substitutions of amino acids G12 or G13 [6]. Treatment with the tyrosine kinase inhibitors (TKIs) erlotinib (Tarceva) and gefitinib (Iressa) has been particularly effective in non-smokers, especially in patients with EGFR mutations [3-5], while tumours with KRAS mutations tend to be resistant [6]. However, EGFR mutations, while a positive prognostic indicator [7], are neither necessary nor sufficient for response to these drugs [8-10], and patients who initially respond invariably become resistant to treatment. Resistance is explained in some cases by the occurrence of a T790M resistance mutation in the EGFR kinase domain [11, 12]. This mutation increases the affinity of EGFR for kinase-activating ATP, thereby decreasing the effectiveness of kinase inhibitors [12]. Presented as an alternative to EGFR mutation screening, increases in EGFR or HER2 gene copy number have also been associated with TKI response [13, 14]. The diagnostic value of EGFR status is still under debate [15, 16], and assessment of additional genes are likely needed to adequately predict sensitivity to targeted therapies [8]. Comprehensive sequencing-based surveys of hundreds of genes in even a handful of tumour samples are technically and financially challenging using traditional Sanger sequencing methods. In lung cancer, two large-scale surveys have been conducted that have uncovered thousands of mutations in hundreds of genes, of which 3-11% were recurrent across multiple tumours [17, 18]. Both studies concluded that lung adenocarcinomas are genetically heterogeneous, that a large number of mutations may be passenger mutations, and that a core set of non-synonymous driver mutations exists that drive the biology of these cancers. Davies et al. (2005) raised the concern that this set of driver mutations may be very large and spread across a large number of protein kinases, each requiring a different therapy. Ding et al. (2008) on the other hand, suggested that these mutations affect a relatively few common pathways that could be targeted as a group if recurrent signatures were identified. Similar surveys of a 139  diversity of cancers have reported comparable findings with similar interpretations [17, 19-21]. These studies have focused on sets of candidate genes and, while technically impressive, have examined only a fraction of the over 20,000 protein-coding genes in the human genome [49] and did not have the ability to detect expression levels or abnormal gene constructs such as fusion genes. In addition, their retrospective nature has made them reliant on banked tumour samples and cell lines with varying histologies, tumour types, and clinical histories, which may account for the wide spectrum of mutations observed. While commercialized next-generation sequencing technologies have reduced the cost of DNA sequencing by three orders of magnitude, the cost of sequencing entire cancer genomes is still high (>$100,000) [22]. However, targeted sequencing of the 1-2% of the genome that encodes expressed exons can be achieved by sequencing RNA transcripts at a fraction of the cost of the entire genome [23]. This approach, RNA-seq, is not only effective at detecting expressed sequence aberrations such as point mutations [24, 27, 28, 40, 50] and fusion genes [51, 52] but also provides quantitative gene expression and splicing information [24, 25, 26]. The wide dynamic range of gene expression results in greater sequence coverage of genes with higher expression [26] and therefore, most effective mutation detection will occur in transcripts with moderate or high expression. In addition, mutations that result in a decrease or loss of expression, due to nonsense-mediated decay for example, may be missed [23]. Despite these caveats, RNA-seq is well suited to identify cancer drivers as most drivers identified to date are small mutations that alter amino acid sequence or large-scale structural changes resulting in gene amplification or fusion [23]. Expressed proteins containing somatic mutations are particularly attractive drug targets due to their tumour specificity. Recently, transcript sequencing has been used to uncover consistently occurring non-synonymous point mutations in granulosa-cell ovarian cancer [27] and follicular and diffuse large B-cell lymphomas [50]. Here we report the analysis of RNA-seq data from 30 lung adenocarcinomas collected 140  prospectively as part of a clinical trial in a group of patients with increased likelihood of benefiting from treatment with an EGFR inhibitor. 5.2. Methods 5.2.1. Biopsy collection and processing This study was carried out in conjunction with a phase II clinical trial of erlotinib as a first line therapy for metastatic lung cancer at the BC Cancer Agency. Eligibility criteria for the clinical trial included: stage IIIB/IV non-small cell lung cancer, no prior chemotherapy and at least 2 of the following four criteria: 1) women, 2) never-smokers, 3) southeast Asian racial origin, 4) adenocarcinoma and/or bronchoalveolar carcinoma. Prior to treatment, 65 patients provided informed consent and agreed to a 24 mL blood sample and a fresh tumour biopsy. Biopsies during treatment and at disease progression were optional. Additional blood samples were taken after one month of treatment and upon disease progression. Clinical response to erlotinib was assessed radiographically using the same Southwest Oncology Group modification of the RECIST criteria presented in Chapter 2 [8]. Solid tissues obtained from core-needle biopsies, bronchoscopies, or surgical resections were embedded in optimum cutting temperature compound (OCT) immediately upon removal from the patient and fresh frozen in liquid nitrogen vapour. To perform an initial pathology review and assessment of tumour content, 8 µM sections were cut from each block using a 20ºC microtome-cryostat and treated with haematoxylin and eosin (H&E) stains. Once a block of sufficient tumour content was selected for nucleic acid extraction, sets of 30 sections were taken alternately for DNA and RNA. After each set of 30 sections, a single section was taken for pathology review to confirm consistent tumour content throughout the sample. The total number of sections taken for DNA and RNA extraction was dependent on the amount of tissue available. If necessary, tissues were sectioned onto membrane slides for laser microdissection as described in Chapter 2 [8]. In cases where microdissection was not necessary, sets of 30 141  tissue sections were transferred directly to a tube containing 400 µL of Gentra PureGene Cell Lysis Solution (QIAgen) for DNA extraction or 800 µL Trizol (Invitrogen) for RNA extraction. Liquid samples obtained from fine-needle aspirates were processed in the operating suite immediately upon collection. A droplet of fluid was spotted on a positively charged glass slide (Fisher Scientific) and smeared for H&E staining and pathology review. The remaining fluid was mixed with a mixture of 5% DMSO in 200 µL phosphate buffered saline (PBS) and fresh frozen in liquid nitrogen vapour. Samples of sufficient tumour content were thawed and portions transferred to new tubes containing Gentra PureGene Cell Lysis Solution or Trizol for DNA and RNA extraction respectively. Pleural fluids obtained by thoracentesis were collected in eight 50 mL tubes (BD Falcon) containing 2.07 mL 0.5 M EDTA to prevent clotting. Volumes in excess of 400 mL were collected in 800 mL or 1.5 L vacuum bottles and stored at -80ºC without further processing. To pellet the cells, the 50 mL tubes were centrifuged at 2000 x g for 10 minutes and the supernatant removed and stored at -80ºC. 3 mL of Gentra Red Blood Cell lysis solution was added to the cell pellets and mixed briefly by vortexing for 1 second. The remaining cells were pelleted by spinning at 2000 x g for 10 minutes. The supernatant was discarded and the cell pellets washed by resuspension in 3 mL of PBS, vortexing at low speed, and re-pelleted by spinning at 2000 x g for 10 minutes. The washed cell pellets were resuspended in 1000 uL fetal bovine serum (FBS) containing 5% DMSO. A droplet of the resuspended cell solution was smeared on a glass slide and H&E stained for pathology review. In all but one case, fewer than 5% of the cells collected by thoracentesis were tumour cells, with the vast majority being normal immune or mesothelial cells (Figure 5.1A). To isolate tumour cells present in these samples, lung tumour cells were fluorescently labelled and isolated by flow cytometry. Cells were thawed and pelleted by spinning at 2000 x g for 10 minutes. The supernatant was discarded and the cells resuspended in 100 µL PBS and 142  10 µL Ber-Ep4 antibody conjugated to a fluorescein isothiocyanate (FITC) flurophore (Dako, F0860). The mixture was incubated on ice in darkness for 30 minutes. The labelled cells were pelleted and washed by spinning at 2000 x g for 1 minute and discarding the supernatant. Pellets were washed twice by resuspending in 1000uL PBS/2% BSA, spinning at 2000 x g for 1 minute and discarding the supernatant. Washed cell pellets were then resuspended in 2000 µL FBS. To remove large cell clumps and debris, the cells were passed through a 35 µM Cell Strainer (BD Falcon, 352235) and collected in a 5 mL tube. To verify that tumour cells were labelled correctly, 5 µL of strained cells were smeared on a positively charged slide, covered by a glass coverslip, and visualized on a fluorescent microscope (Figure 5.1B). Tumour cells were isolated using the BD Falcon FACSDiVa flow cytometer using settings specific for collecting large FITC positive tumour cells and with increased sensitivity to detect and discard small reactive cells. Up to 1 million tumour cells as measured by the flow cytometer were collected in 1.5 mL tubes containing 800 µL of Gentra Cell Lysis buffer for DNA extraction or 800 µL Trizol for RNA extraction. To verify that tumour cells were being isolated with high specificity, ~25,000 cells were sorted directly onto a positively charged glass slide, dried, and H&E stained for pathology review (Figure 5.1C).  5.2.2. DNA extraction and Sanger sequencing Tumour cells were transferred to Gentra Cell Lysis solution as described above. Volumes reported are for 400 µL of Cell Lysis solution and volumes were scaled up for larger volumes such as those resulting from flow sorting. To digest tissues, 2 µL of 20 mg/mL Proteinase K was added, mixed by vortexing, and incubated at 55ºC for 3 hours. Once tissue fragments were no longer visible, the OCT gel (when present) was collected at the bottom of each tube by spinning at 2000 x g, and the supernatant containing cell lysates transferred to a new tube containing 80 µL Gentra Protein Precipitation solution (QIAgen). The tubes were 143  vortexed for 20 seconds to precipitate proteins and spun for 3 minutes at 20,000 x g. The supernatant was transferred to a new tube containing 400 µL isopropanol and 1µL of 20 mg/mL glycogen and mixed by gently inverting 50 times. Precipitated DNA was pelleted by spinning the tubes for 3 minutes at 20,000 x g. Pellets were washed with 400 µL 70% ethanol and spun at 20,000 x g for 1 minute. The wash solution was discarded, the tube drained at an angle onto a clean Kimwipe, and the pellet air dried for no more than 10 minutes. Pellets were resuspended in 10 µL TE (10:0.1) and incubated at 65ºC for 1 hour or at 4ºC overnight to facilitate resuspension. DNA was quantified using a NanoDrop spectrophotometer and by PicoGreen fluorometry (Qubit, Invitrogen). For samples yielding less than 500 ng of genomic DNA, 10 ng was amplified using the RepliG Mini whole genome amplification kit (QIAgen) using the method documented in Chapter 3. 24 mL blood samples were taken before treatment, after 1 month (cycle) of treatment, and upon disease progression. DNA was extracted from each blood sample using the Gentra Puregene Blood kit (Qiagen) used in Chapter 4 and the plasma was stored at -80ºC. Sanger sequencing reactions were performed as documented in Chapter 2 [8].  5.2.3. RNA extraction, amplification and Illumina sequencing Tumour cells were transferred to Trizol as described above. Volumes reported are for 800 µL of Trizol solution and volumes were scaled up for larger volumes such as those resulting from flow sorting. The liquid cell lysate was transferred to a pre spun 2 mL PhaseLoc gel tube (Eppendorf) and incubated at room temperature for 5 minutes. 160 µL of chloroform (Sigma) was added and the tubes shaken to mix. Tubes were spun at 12,000 x g for 10 minutes at 4ºC and the top aqueous phase transferred to a fresh 1.5ml tube containing 400 µl IPA and 1 µL of 20 mg/mL glycogen. Samples were mixed by repeated inversion and incubated at room temperature for 10 minutes. RNA was pelleted by spinning tubes at 12,000 x g for 10 minutes. 144  The supernatant was discarded and the pellets washed with 800 µL 75% ethanol following by a spin at 8,000 x g for 5 minutes. The supernatant was discarded and the RNA pellet air dried for no more than 10 minutes. The pellet was then resuspended in 10 µL of RNase free water and incubated at 60ºC for 30 minutes to facilitate dissolution. RNA quantity and quantity were assessed using an Agilent Bioanalyzer Nano chip. 200 ng of RNA (based on the Agilent quantitation) was diluted to 7 µL with DEPC water, and 1 µL of 20 U/µL RNase inhibitor (Applied Biosystems) was added. To digest contaminating genomic DNA, a mixture of 1 µL of 10X DNaseI buffer, 0.8 µL of RNase-free water, and 0.2 µL of DNaseI enzyme (Ambion) was added and incubated at room temperate for 15 minutes. The reaction was stopped by adding 1 µL of 25 mM EDTA. To purify the RNA, the reaction was transferred to a pre-spun 2 mL PhaseLoc tube containing 189 µL RNase-free water. 200 µL of a 25:24:1 phenol-choloformisoamyl mixture (Sigma) was added and shaken to mix. The tube was spun at 15,000 x g for 5 minutes and the ~200uL aqueous layer transferred to a new 1.5 mL microcentrifuge tube. RNA was precipitated by adding 30uL 3M sodium acetate, 1 µL 20 mg/mL glycogen, and 600 µL 100% ethanol and vortexing to mix. RNA was pelleted by spinning at 15,000 x g for 5 minutes and the supernatant removed. The pellet was washed with 1 mL of 70% ethanol and spun at 15,000 x g for 1 minute. The supernatant was removed and the pellet air dried for no more than 10 minutes. The pellet was suspended in 6 µL RNase-free water and mixed by pipetting. RNA quantity and quantity were assessed using an Agilent Bioanalyzer Nano chip. 50 ng of DNaseI-treated RNA was used for amplification by in vitro transcription using the MessageAmpII kit (Ambion) following the manufacturer’s instructions. 500 ng of amplified product was used for paired-end (PE) sequencing library construction by the BC Cancer Agency Genome Sciences Centre Sequencing Group using methods similar to a published protocol [24]. Double-stranded cDNA was synthesized from 500 ng amplified RNA using Superscript Double-Stranded cDNA Synthesis kit (Invitrogen) and random hexamer primers 145  (Invitrogen) at a concentration of 5µM. The cDNA was sonicated and the sample was run on an 8% polyacrylamide gel. A gel slice corresponding to DNA fragments of 180-220 bp was excised, and the DNA eluted overnight at 4°C in 300 µl of elution buffer (5:1, LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)-7.5 M ammonium acetate). DNA was purified using a Spin-X Filter Tube (Fisher Scientific) followed by ethanol precipitation. The ends of the DNA fragments were repaired and phosphorylated by treatment with T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase in a single reaction. 3’ A-overhangs were added by Klenow fragment (3’ to 5’ exo minus) to facilitate subsequent ligation of Illumina PE adapters which contain 5’ T overhangs. The adapter-ligated products were purified on QIAquick spin columns (QIAgen) and amplified by 10-15 cycles of PCR using Phusion DNA polymerase and Illumina’s PE primer set (Illumina). PCR products were purified on QIAquick MinElute columns (QIAgen), and the DNA quality assessed and quantified using an Agilent Bioanalyzer DNA 1000 series II assay and Nanodrop 7500 spectrophotometer (Nanodrop). DNA was diluted to 10 nM and sequencing clusters were generated on the Illumina cluster station. DNA sequencing was performed using Illumina Genome Analyzers following the manufacturer’s instructions.  5.2.4. RNA-seq data analysis 36-50 bp paired-end reads were generated using four lanes of an Illumina Genome Analyzer flowcell and aligned using Maq [28] to a human genome reference supplemented with known exon junction sequences for genes listed in release 52 of the Ensembl homo sapiens core database (http://www.ensembl.org). Paired-read information was used to infer the presence of fusion transcripts when the two reads of a read pair aligned to different annotated transcripts. Unmapped reads were aligned to a set of complete virus genomes (Table 5-2). Single nucleotide variants (SNVs) were identified using SNVMix, a variant detection program 146  designed to detect SNVs from next-generation sequencing data from tumours using three implementations of a binomial mixture model [58]. SNVs were compared to lists of known variants using custom Perl and shell scripts. Gene expression levels were quantified in reads per kilobase of exon model per million mapped reads (RPKM) [26]. Unsupervised and supervised clustering analysis of gene expression levels was performed using the TM4 analysis suite [29].  5.3. Results 5.3.1. Patient data and source tumour material DNA was extracted from 53 matched tumour/blood pairs, whole genome amplified if necessary, and used for Sanger sequencing of all 28 exons of EGFR isoform a (ENSG00000146648) and exon 2 of KRAS (ENSG00000133703). Somatic EGFR mutations absent in corresponding DNA from blood were seen in 19 of 49 pre-treatment tumour samples (39%), a 1.5 fold enrichment over the ~26% of lung cancers with EGFR mutations listed in the Catalogue of Somatic Mutations in Cancer. (COSMIC, http://www.sanger.ac.uk/genetics/CGP/cosmic/). These mutations consisted of nine exon 19 LREA deletions and ten exon 21 L858R point mutations. 23 of 48 patients with response data had a partial response (PR) to erlotinib, and 17 of these had either an EGFR LREA deletion or L858R point mutation, compared to 2 of the 25 non-responders (p=0.000003). KRAS exon 2 point mutations were present in three pre-treatment tumour samples (6%), 2 from nonresponders (both G12V) and 1 without response data (G12C) due to death prior to completion of 1 cycle of erlotinib. Of the four post-treatment samples sequenced, two contained EGFR mutations observed prior to treatment (1 L858R, 1 LREA deletion) and both of these acquired additional EGFR exon 20 T790M point mutations associated with TKI-resistance. While the current model of EGFR and KRAS mutations mediating TKI response and resistance is supported by our data, this presumption is insufficient to fully explain the treatment outcomes observed in our population because 25% of partial responders lacked EGFR mutations, 8% of 147  patients with progressive disease had EGFR mutations, and no canonical EGFR mutations were observed in patients with stable disease. 1 of 3 post-treatment biopsies taken upon disease progression lacked an EGFR resistance mutation suggesting that erlotinib resistance may be explained by variants in other genes. To uncover additional genetic features of these lung cancers, we performed transcriptome sequencing of 30 tumour samples from 28 patients (27 pre-treatment, 3 posttreatment): 21 females; 20 of Asian and 8 of Caucasian descent; 22 non-smokers; and 24 adenoor bronchoalveolar carcinomas, 5 unclassified non-small-cell carcinomas, and 1 initially identified squamous cell carcinoma that was subsequently reclassified as a lymphoepitheliomalike carcinoma. The source materials consisted of 11 core biopsies (2 laser microdissected), 7 bronchoscopies, 7 thoracenteses (6 flow sorted), 3 fine needle aspirates, and 2 surgical resections. These tumours were determined by pathologist to contain at least 40% tumour cells and the median tumour content for the set was 80%. The amount of total RNA available from 29 of 30 biopsies (Table 5-1) was below the 5-10 µg needed prior to polyA selection for our standard transcriptome sequencing protocol. Therefore, we amplified 50 ng of total RNA using in vitro transcription (MessageAmp II, Ambion) to generate 3-28 µg of polyA-selected aRNA. 500 ng of amplified product was processed using our standard RNA-seq library construction protocol, omitting the polyA selection step. 5.3.2. Summary of sequencing data and variant discovery in lung cancer biopsies 2.1 billion 36-50bp paired-end sequencing reads were generated from 30 lung cancer biopsy samples (27 pre-treatment, 3 post-treatment) (Figure 5.2). 50.6 Gbp of sequence was successfully aligned to a human genome and exon junction reference (Methods), with 56% of mapped reads aligned to exons, 31% aligned to introns, and 13% aligned to intergenic regions (Figure 5.3). On average, 41±9 million reads representing 1.7±0.4 Gbp of sequence were  148  mapped per sample. Of the 69.1 Mbp annotated exonic sequences, an average 29±4 Mbp were covered by at least 1 read and 13±2 Mbp were covered by at least 6 reads.  5.3.3. Addressing end-bias induced by amplification Coverage of the transcriptome is determined primarily by the expression level of each transcript as more abundant transcripts generate a directly proportional greater number of reads. In the 29 libraries that used RNA amplified by in vitro transcription (IVT), we observed significant 3’ bias in the distribution of reads across each transcript compared to 22 libraries constructed from unamplified RNA (Figure 5.4). In the lung cancer libraries constructed from IVT-amplified RNA, 73% of the sequence coverage falls within the last 50% of each transcript’s annotated coordinates compared to 54% of the average coverage from 22 libraries constructed from unamplified samples. This skewed coverage biases our variant detection approach as this region contained 71% of the putative SNVs uncovered in the lung cancer samples compared to 64% of the pSNVs detected in unamplified samples using the same method (Figure 5.4). While transcripts from standard libraries constructed from sonicated cDNA typically have increased coverage of their 3’ ends [53], the additional bias observed in IVT-amplified libraries is likely due to the polyA priming method employed for amplification and the length of amplification fragments is a function of enzyme processivity. We attempted to address this difference in coverage by modifying the primer mix used for RNA amplification. The standard first strand synthesis reaction prior to IVT utilizes an oligo(dT) oligonucleotide to prime synthesis from the ends of polyadenylated transcripts. To facilitate priming within the body of transcripts, we performed IVT reactions using a mixture of standard oligo(dT) primers and non-degenerate primers corresponding to known codon sequences (Full Spectrum MultiStart, System Biosciences, Mountain View, CA). We anticipated that initiation of first strand synthesis at multiple points along each transcript would 149  improve the representation of 5’ positions distant from the polyA tail. However, this modification did not result in an even distribution of reads across each transcript and instead resulted in a profile nearly identical to that from the standard polyA-primed IVT libraries with again 73% of coverage falling in the last 50% of transcript coordinates (Figure 5.5). This may be due to preferential binding and enzyme recognition of the oligo(dT) primer, non-ideal hybridization conditions for the non-degenerate oligonucleotides, or cross-hybridization of codon-specific sequences reducing their potential for hybridization to the RNA template. 5.3.4. Viral transcripts Infection with exogenous viruses has been linked to 15-25% of all cancers including some forms of lung cancer [1]. Detection of viral involvement in lung cancer may have significant impact on the clinical management of this disease. For example, vaccines against strains of human papilloma virus that cause cervical cancer have been shown to dramatically reduce the number of precancerous lesions observed in treated populations [54]. In an attempt to uncover viral transcripts in these lung cancers, reads that did not align to our transcriptome reference were aligned to 2.9 Mbp of reference sequences from 92 complete viral genomes (Table 5-2). The median number of reads corresponding to a viral genome was 1, suggesting very little viral transcription in these lung cancers. However, the pre-treatment library from patient 9 had 1,698 reads aligned to genes from two types of Epstein-Barr virus (EBV): Human herpesvirus 4 type 1 (761 reads, NC_007605.1) and Human herpesvirus 4 type 2 (937 reads, NC_009334.1) (Figure 5.6A). This observation suggested the presence of a virus in this patient sample. To confirm this observation, we treated a section of biopsy material from this patient with a fluorescently-labelled oligonucleotide probe (INFORM EBER probe, Ventana). Confinement of EBV infection to tumour cells was confirmed by in situ hybridization (Figure 5.6B). This tumour was initially identified as the only squamous cell carcinoma in the set; however, this classification was revisited upon discovery of EBV involvement. Prompted by 150  our observation, a second pathology review determined this tumour to be a lymphoepitheliomalike carcinoma, a rare lung cancer subtype associated with EBV infection particularly in Asians and non-smokers. No EGFR mutations were observed in this cancer, and the patient exhibited stable disease when treated with erlotinib. Expression was localized to three discrete regions comprising 21% of the ~172 kbp EBV genome (Figure 5.6A) containing some of the key genes related to viral infection and tumourigenesis [30]. The region of most substantial coverage (265X maximum coverage, 155 kb-161 kb with diminishing expression from 137 kb-152 kb) encompassed the BamHIA region. Abundantly expressed RNAs from this region such as A73 and RPMS1 are often complexly spliced [55] and were originally identified in nasopharyngeal carcinoma (NPC) [30]. EBV BamHI-A transcripts have since been found to be expressed in all EBV-associated malignancies [55] as well as in peripheral blood of healthy individuals [30, 55]. The protein products for these RNAs and their functions are unknown [30, 55]. The second largest peak (in the 166 kb to 169 kb region, with additional expression from 0 to 6 kb) corresponds to LMP1/LMP2A. LMP1 is the main transforming protein of EBV, functioning as a classic oncogene in rat studies, that induces cell-surface adhesion and up-regulates anti-apoptosis genes such as BCL2. A smaller but distinct peak (96 kb-99 kb) corresponds to EBNA-1. This is a key gene responsible for maintenance of the viral genome. It is expressed in all virus-infected cells, where it maintains the episomal EBV genome by sequence-specific binding to the Viral OriP, and also acts a transcription factor for itself and other key proteins such as LMP1. The expression pattern of these genes is very similar to that seen in NPC, in which EBV expression is restricted to EBNA1, LMP2A and BamHIA transcripts, with ~20% of tumours also expressing LMP1 [30].  151  5.3.5. Expression profiling Counting the number of reads corresponding to an mRNA transcript has been shown to be a quantitative measure of gene expression over five orders of magnitude [26], and there is high correlation between expression values derived from microarrays and those derived from transcriptome sequencing [24, 25, 26]. Therefore, we sought to uncover clinical and molecular subgroups using patterns of gene expression derived from our set of lung tumour transcriptome sequences. To normalize for both transcript length and total number of reads in each library, we quantified transcript expression levels in reads per kilobase of exon model per million mapped reads (RPKM), as pioneered by Mortazavi et al [26]. Unsupervised clustering of RPKM values from all 30 tumours could not identify distinct gene expression patterns within the set, nor could we differentiate pre- and post-treatment expression profiles. Supervised clustering, however, was able to identify a number of gene sets that could differentiate among known subgroups within the pre-treatment samples using the following classifiers: EGFR/KRAS mutational status, smoking history and response, and histology (Figure 5.7). Expression signatures unique to sex, ethnicity, and response (PR vs. SD/PD and PR/SD vs. PD) were not apparent (p>0.05). Expression levels of eight genes were able differentiate EGFR mutants, KRAS mutants, and those with no mutations in either gene (Bonferroni adjusted p≤0.00124) (Figure 5.7A). Low-density lipoprotein receptor LDLR (ENSG00000130164) was expressed at a high level in KRAS mutants, while an unnamed gene similar to heterogeneous nuclear ribonucleoprotein A1 (ENSG00000213847) was expressed at higher levels in EGFR mutants. KRAS mutants had consistently lower expression of six genes: Pseudogene AC009945.4 (ENSG00000216737), 5S ribosomal RNA (ENSG00000200873), rRNA pseudogene AL365364.19-2 (ENSG00000210729), small nucleolar RNA SNORA75 (ENSG00000212620), beta-defensin  152  118 precursor DEFB118 (ENSG00000131068) and pseudogene RP11-257K9.3 (ENSG00000220467). Two smokers with partial response were differentiated from three smokers with progressive disease on the basis of five genes (Bonferroni adjusted p≤0.0307) (Figure 5.7B): CCL19, PTGDS, AP003780.3, Z98749.11, and AC008660.5 (ENSG00000172724, ENSG00000107317, ENSG00000210016, ENSG00000100181, and ENSG00000203776). 12 additional genes were expressed at low levels in smokers who responded compared to other tumours in the set (p≤9.19x10-9): Small proline-rich protein 2E SPRR2E (ENSG00000203785), Myosin-6 MYH6 (ENSG00000197616), Y RNA (ENSG00000199678), small nucleolar RNA SNORD113 (ENSG00000200367), U6 spliceosomal RNA (ENSG00000202089), pseudogene AC006479.2 (ENSG00000177590), retrotransposed gene Z70227.1 (ENSG00000185095), Developmental pluripotency-associated protein DPPA2 (ENSG00000163530), uncharacterized protein C14orf177 (ENSG00000176605), pseudogene RP11-392A19.1 (ENSG00000220026), SNURF-like protein CXorf19 (ENSG00000173954), and pseudogene RP11-345I18.6 (ENSG00000220129). Expression of CD70 (ENSG00000125726) was found to be increased in the EBVassociated lymphoepithelioma-like carcinoma (Figure 5.7C). Compared to the 26 pre-treatment NSCLCs/adenocarcinomas, this gene was expressed at a very high level in this tumour (16.2 RPKM vs. 0.2 median and max 4.6 RPKM from other tumours, p=5.97x10-14) and may be a potential marker of EBV-associated lung cancer. To test this hypothesis, we plan to assess CD70 expression and the presence of EBV using immunohistochemistry and in situ hybridization respectively in a set of over 600 lung tumours assembled into a tissue microarray. This tissue microarray contains NSCLCs, squamous cell carcinomas, and large cell carcinomas, some of which may be misclassified lymphoepithelioma-like lung cancers. We anticipate a number of these cancers to contain EBV and that a large fraction of these will be strongly 153  positive for CD70 expression. A positive correlation of CD70 with EBV expression may enable refinement of lung cancer diagnoses by helping differentiate otherwise morphologically similar cancers through screening of a simple cell surface marker.  5.3.6. Fusion transcripts Fusion transcripts have been particularly effective therapeutic targets in the treatment of hematopoietic malignancies [43, 44]. Assessment of gene fusions in solid tumours is salient in light of the recently described EML4-ALK fusion in lung cancer [31], as other fusions may exist that could be targeted to treat this disease. To uncover known and novel fusions present in our set of lung tumours, we used paired-read information to uncover evidence of 142 putative fusion transcripts. None of the fused genes correspond to EML4 or ALK, the partners in a fusion transcript recently described in lung cancer [31]. Of the 200 genes implicated in these fusions, 3 are listed in the Mitelman Database of Chromosome Aberrations in Cancer (http://cgap.nci.nih.gov/Chromosomes/Mitelman), and 17 are listed in COSMIC, suggesting that most observed fusions in our patient tumour population are rare, novel events or possibly technical artifacts.  5.3.7. Mutation detection From data derived from all 30 biopsies, we detected 432,043 putative single nucleotide variants (pSNVs), of which 344,956 (80%) correspond to 53,744 unique polymorphisms listed in dbSNP or have been detected in the genomes of J. Craig Venter [32], James Watson [33], an anonymous Yoruban male [34], an anonymous Asian male [35], or individuals being sequenced as part of the 1000 Genomes project (http://www.1000genomes.org). Genotypes were concordant for 95.1% of known SNPs called in both libraries of each pre/post-treatment pair (Patient 19: 6,413 of 6,868. Patient 28: 7,005 of 7,236). The 87,087 novel pSNVs correspond to 53,517 unique genomic positions, of which 1,839 are likely artifacts resulting from mismapping 154  near exon-exon junctions. 13,186 pSNVs impact a codon sequence in 7,090 genes, and 9,149 (69%) of these are predicted to be non-synonymous. 6,181 of the non-synonymous variants are predicted to induce a radical amino acid change, either by reversing residue polarity or dramatically altering residue size. 385 may alter protein length: 362 introduce a stop codon and 23 remove a stop codon. In addition, 9,038 insertions and 3,816 deletions were apparent which correspond to 4,978 unique genome positions. 2,276 of these correspond to known polymorphisms, and, of the remaining 2,702 novel indels, 404 impact protein coding sequences. Absent from this list were the EGFR LREA deletions detected by Sanger sequencing. This may be due to the difficulty in mapping relatively short sequences on either side of the 12-15bp deletion to noncontiguous segments of the human genome reference sequence. However, a simple textmatching search for reads containing any of the possible deletion breakpoints did not uncover any reads that support a deletion. A second possibility is that the inability to detect these deletions may be due to a lack of sequence coverage of this region due to low expression compounded by coverage biased towards the 3’ end of the transcript, particularly as exon 19 is 3 kb from the annotated end of the EGFR transcript. This hypothesis is supported by the low sequence coverage of exon 19 in the five patients with EGFR LREA deletions (1.9X average coverage, 3.6X maximum, 0X minimum). To look for possible low-frequency EGFR alleles present prior to treatment and not identified by our automated SNV caller, we performed a manual inspection of read alignments at the positions of known EGFR point mutations, T790M and L858R. In a single nonresponder, Patient 57, both of these mutations were supported by a small proportion of reads (2 of 10 support L858R, 2 of 9 support T790M). Re-examination of the Sanger data found a similar ratio of allele frequencies for both of these mutations; however, neither was identified using automated methods. The biopsy from this patient contained >75% tumour cells, 155  suggesting that the resistant allele was present at low frequency prior to treatment. This may explain the observed resistance to erlotinib despite the presence of a TKI-susceptible L858R mutation. Neither of the post-treatment T790M point mutations detected by Sanger sequencing were detected by our automated SNV caller, likely due to a lack of sequence coverage at this position in both samples. The two reads from Patient 19 contained a mutant allele, while the three reads from Patient 28 provided no evidence of mutation. The corresponding pre-treatment libraries had greater coverage (8X for Patient 19, 15X for Patient 28) and none of these reads supported a T790M mutation existing pre-treatment at a detectable frequency. This difference in read coverage is readily explained by a lower expression of EGFR in the post-treatment samples relative to the corresponding pre-treatment samples (pre>post RPKM: Patient 19 7.9>5.9, Patient 28 11.1>5.4). Generation of additional sequence data for the post-treatment samples would improve mutation detection in genes with lower expression levels.  5.3.8. Validation of novel coding pSNVs Recent experience validating pSNVs detected in cancer genomes using short-read sequences suggests that 96-97% of candidate somatic pSNVs may be either germline events (840%) or sequencing artifacts (58-84%) rather than somatic mutations acquired by the cancer [39, 40]. To assess whether pSNVs observed in the pre-treatment tumours were real and somatic, we sought to verify these variants in DNA from matched tumour and blood using orthogonal methods. As an initial screen, we attempted to validate mutations that hit the same codon at multiple positions in multiple tumours using PCR and Sanger sequencing. We hypothesized that tumours select for substitution of these amino acids and that a number of DNA sequence mutations within a codon would be tolerated. There were 11 genes with such mutations 156  contributing 22 unique positions for validation (ANKRD12, ARIH1, CHD7, HLA-DPA1, MAN2A1, MT-ND1, NAE1, NUCB2, REST, SEC62, and SPARC). 2 additional genes had single positions with different nucleic acid substitutions (e.g. C>CT and C>CA) which were also included in the validation set (MANBA, COX2). Finally, positions within three genes were targeted for a variety of reasons: TRIP6 (chr7:100,308,290 T>G) contained the most frequently observed non-synonymous putative mutation (10 of 27 tumours); C4orf15 (chr4:2,212,348 A>G and chr4:2,212,389 T>C) somatic mutations were recently validated by our group in a lobular breast cancer [40], and MET (6 positions) contained 6 candidate mutations within 18bp (6/7 amino acids) in a single tumour, suggesting a high frequency of localized mutation. In total, 16 target intervals were identified, amplified by PCR, and Sanger sequenced from the tumour samples in which the mutations were detected by RNA-seq (Table 5-3). Matched blood samples were also sequenced as normal controls. Of the 32 candidate mutations targeted, 27 were absent in both the tumour and blood DNA (81%, likely artifacts), while 5 were validated as true germline variants (15%, present in both blood and tumour DNA). A single somatic mutation in MT-ND1, NADH dehydrogenase, (chrM:4,161) was seen once in a single tumour. This mutation did not correspond to any of the positions targeted for validation; rather it was captured by a PCR amplicon designed to validate mutations at chrM:4,262 and chrM:4,263. This somatic mutation was also detected by RNA-seq, confirming that our master list of pSNVs contains real somatic variants, albeit not at positions specifically targeted by this pilot set. The validation rate observed here is similar to that from previous studies: 81% false positives, 15% germline variants, and 3% true somatic mutations. Recent developments in genome technology have made it possible to capture thousands of selected portions of the genome in a single assay for targeted resequencing [41-45]. Due to the low quantities of DNA required and relatively few targets for validation, we settled on a solution hybrid capture approach (Agilent SureSelect) which utilizes custom RNA probes 157  (“baits”) to isolate corresponding genomic DNA fragments from 500 ng of a standard Illumina genome shotgun sequencing library [42]. All 9,149 novel, coding pSNVs identified from the 30 lung tumour libraries were targeted for bait design, 8,281 from pre-treatment samples and 868 from post-treatment samples (Figure 5.8). 744 of the pre-treatment pSNVs are located in regions with sequence similar to at least four other locations in the genome (i.e. >3 BLAT [57] alignments for the mutation position +/-50bp). Despite the increased likelihood of this subset containing mapping artifacts, we anticipate using validation data from these sites to further refine our SNV detection methods particularly as this set of pSNVs passed our standard filters for detection of high quality variants. Two overlapping 120 bp candidate baits were designed for each pSNV, one with the variant at position 30+/-1bp and one with the variant at position 90+/-1 bp (Figure 5.9A). Identical candidate baits resulting from neighbouring or adjacent pSNVs were discarded. 16,690 of the 18,272 submitted baits (92%) passed Agilent’s bait design criteria and were included in the assay. Baits targeting pSNVs with >3 BLAT hits were less likely to pass Agilent’s bait design criteria: 79% of these baits passed (1,180 of 1,486) compared to 94% of those with ≤3 BLAT hits (14,178 of 15,062 baits targeting pre-treatment pSNVs, 1,602 of 1,722 baits targeting post-treatment pSNVs). As baits targeting pSNVs utilized only ~30% of the probes available in the assay, additional baits were designed to uncover additional SNVs by capturing exons of recurrently mutated genes. For this purpose, we submitted coordinates for 7,634 exons from 272 genes for “optimized” bait tiling by Agilent (Figure 5.9B). 212 genes (5,985 exons) contained pSNVs seen in at least four pre-treatment tumours (Table 5-4) and 18 genes (266 exons) contained at least two pSNVs in any post-treatment tumours (Table 5-5). We also targeted 42 genes (1,383 exons) listed in COSMIC with pSNVs in three pre-treatment tumours (Table 5-4). In total, 30,598 baits were designed: 24,675 targeting genes mutated in at least 4 pre-treatment tumours, 158  1,010 targeting genes mutated in at least 2 post-treatment tumours, and 4,913 targeting COSMIC genes mutated in 3 pre-treatment tumours. Baits were successfully designed for 98% of the target exons (7,456 of 7,634). The final capture assay contains 47,558 baits (16,960 baits targeting pSNVs, 30,598 baits tiled across exons) targeting ~3.5 Mbp of unique sequence. The nature of the assay requires that all baits are applied to each sample and we therefore anticipate validating each pSNV and sequencing the exons of each recurrently mutated gene in every sample tested. As an initial test of the system, we constructed Illumina whole genome shotgun sequencing (WGSS) libraries from a matched tumour/blood pair from Patient 28. This patient was selected as his pre-treatment tumour harboured a validated EGFR exon 19 LREA deletion and the corresponding post-treatment sample contained an additional EGFR exon 20 T790M resistance mutation that may be present in the pre-treatment sample at a frequency below the detection threshold of Sanger sequencing. Solution capture of each WGSS library will be performed following the manufacturer’s instructions and each catch sequenced using one lane of an Illumina Genome Analyzer flow-cell.  5.4. Discussion While lung cancer is a highly heterogeneous disease, the treatment-naïve tumours collected prospectively for this study are from a highly selected group of patients with narrowly defined histologies. Clinical selection of patients meeting two of four characteristics (female, non-smoker, Asian descent, adenocarcinoma/BAC) enriched for individuals with tumours harbouring erlotinib-sensitizing EGFR mutations (39% of patients sequenced) and selected against patients with erlotinib-resistant KRAS mutations (6%). This population had a partial response rate similar to that of the EGFR mutation rate (40%); however, EGFR mutational  159  status was not an ideal predictor of response as 25% of responders lacked EGFR mutations and 8% of non-responders had EGFR mutations. This observation is similar to that of the study of gefitinib-treated patients presented in Chapter 2. In that retrospective study, 10% (3/31) of non-responders had EGFR mutations also supporting the notion that a small percentage of patients refractory to EGFR tyrosine kinase inhibitors have tumours with EGFR mutations. However, the frequency of EGFR mutations overall was far lower in the set of archival tumour samples studied in Chapter 2 compared to those collected prospectively for Chapter 5 (13 versus 39%). In addition, the proportion of responders with EGFR mutations was far lower in Chapter 2’s unselected patient population compared to the highly selected population studied in Chapter 5 (33 versus 74%). Both groups had similar proportions of histological lung subtypes (87 versus 89% NSCLC/adenocarcinoma) and differed primarily in the proportion of females (59 versus 80%), Asians (44 versus 78%), and non-smokers (31 versus 81%). It appears that these physical phenotypes are associated with a clinical subgroup with increased likelihood of acquiring specific somatic variants such as EGFR mutations. Selection for this subgroup likely accounts for the clear association of EGFR mutations with response in the Chapter 5 study. Conversely, the lack of selection in the Chapter 2 study may have masked this association due to the diversity of backgrounds in the patient population and a lack of statistical power derived from the small number of responders available for study. Overall, both studies identified patients lacking EGFR mutations who responded to EGFR inhibitors and patients with mutations who did not respond. Therefore, there continues to be a need to uncover additional biomarkers predictive of response and to uncover additional drivers of lung cancer. A unique opportunity to uncover such variants was afforded by the recent development of second generation sequencing technologies. Transcriptome sequencing is a sensitive, quantitative method for assessing the expression, structure, and sequence content of transcribed genes. Using this method, we have 160  uncovered transcription of Epstein-Barr virus implicating a rare type of lung cancer, defined quantitative gene expression profiles that distinguish clinical subtypes, and identified 9,149 putative somatic single-base pair variants. Given the experience of recent cancer genome sequencing projects [39, 40], a large number of these putative variants are likely false positives or germ-line variants. Therefore, the next phase of this project is validation of these putative mutations in corresponding tumour and blood DNA (Future Directions). Targeted sequencing of candidate regions of the genome identified by transcriptome sequencing has recently been used to uncover common mutations of FOXL2 in granulosa-cell ovarian cancer [27] and EZH2 in diffuse large B-cell lymphoma [50]. While a recurrent base-pair mutation is not apparent from our transcriptome data, we anticipate that patterns of mutation may emerge from our set of validated somatic mutations. A fundamental challenge in cancer genomics is the small quantities of material available from clinical tumour biopsy samples. The samples for this study were collected as a condition of enrolment in a clinical trial and therefore were collected using minimally invasive techniques that provide minimal quantities of tissue. The resulting low nucleic acid quantities were increased using commercial amplification techniques, in vitro transcription for RNA (MessageAmp II, Ambion) and Phi29-based whole genome amplification for DNA (Repli-g, QIAgen). These methods introduce amplification biases that are relatively well understood [46] and, as we demonstrate for amplified RNA, can be accurately quantified (Figure 5.4). The use of IVT-amplified RNA clearly biases the distribution of reads towards the 3’ ends of each transcript decreasing our ability to uncover 5’ SNVs. Our validation approach partially addresses this problem by including a SNV discovery component that targets all exons of genes with mutations identified in multiple tumours. There is a possibility of missing mutations found only in the 5’ ends of genes with low expression. This may be addressed by further sequencing  161  of the transcriptome libraries to increase overall coverage or by selectively targeting 5’ exons of genes for sequencing in genomic DNA.  5.5. Future directions With targeted sequencing data from a single tumour/blood pair in hand, we plan to apply our solution hybrid capture assay to the other samples collected for this study. A high level of target sequence coverage is anticipated from a single lane of an Illumina flowcell. If 0.5 Gbp of mappable sequence is generated (e.g. conservatively, 10 million 50 bp reads), and 72% of reads that correspond exactly to baits [42], then we anticipate ~100X coverage of the 3.5 Mbp target space. The latest version of SNVMix can detect 94.96% of known variants with a 2% false discovery rate from a genome sequence with only 10X coverage [58]. Therefore, we plan to assay multiple samples in parallel through use of an indexing and pooling strategy. Briefly, WGSS libraries will be constructed from genomic or Phi29-amplified DNA from the remaining 52 tumour samples and their corresponding matched blood samples. These 104 libraries will each incorporate a different molecular “barcode” that will allow data to be assigned to specific samples based on the use of a unique sequence built into each sequencing read. Indexed libraries will then be combined in equimolar quantities for solution hybrid capture and the resulting catch sequenced using one lane of an Illumina flowcell. As sequence coverage distribution is highly reproducible between subsequent catches using the same probe library [42], we will use the coverage profile from the initial test to determine the number of samples that could be pooled and still ensure robust genotyping of each pSNV. A primary metric to determine this number will be the actual coverage of each pSNV compared to the minimum coverage necessary to make each genotyping call at that position. As a simplified example, if the lowest coverage pSNV receives ten times more reads than are necessary to make a base-pair call, then up to ten samples could be theoretically sequenced in a single pool. 162  This assumes that the coverage distribution is consistent between experiments, that the coverage will be evenly distributed across each sample in a pool, that the sequencing yield from each flow cell is consistent, and that there is no loss in capture or sequencing efficiency using pooled, indexed libraries. Once validated somatic mutations have been uncovered, we will perform statistical analyses to determine whether any variants 1) are associated with response or resistance to erlotinib, 2) commonly implicate specific molecular pathways, and 3) form integrated signatures of mutation and gene expression that correspond to potential subtypes of lung adenocarcinoma in our patient population.  163  5.6. Figures Figure 5.1 Isolation of tumour cells from a complex pleural fluid mixture using flow cytometry  A) Prior to flow sorting, cells isolated from a pleural fluid sample stained with hematoxyln and eosin (H&E, 20X magnification). Clusters of darkly staining cells are the desired tumour cells surrounded by normal mesothelial and reactive immune cells. Estimated ~1% of cells are tumour cells. B) Prior to flow sorting, cells labelled with a Ber-EP4 antibody conjugated to a FITC fluorophore and visualized by fluorescent microscopy (20X magnification). Clusters of tumour cells are brightly labelled. C) After flow sorting, H&E stain of cells recovered by flow cytometry on the basis of FITC intensity and size. Estimated ~80% of cells are cancer cells (20X magnification).  164  Figure 5.2 Summary of data generated, sequence mapped, and genes, variants, and fusions detected in each library  Top panel: Average read length (blue) and count of raw (black) and mapped (red) reads from 27 transcriptome sequencing libraries from treatment naïve lung tumours. During the project, longer read lengths became available on the Illumina Genome Analyzer and resulted in longer average read lengths for later libraries. Second panel: Gbp of sequence aligned to a human genome reference (black) and resulting fraction of the ~65 Mbp annotated exonic sequence with 1X (red) and 6X (blue) coverage. Third panel: Number of genes with detectable expression (>1 read) and count of putative variants that correspond to known SNPs (black) and novel SNVs (red). Bottom panel: Number of candidate fusions supported by paired read information. A legend and average values for each plot are provided to the right.  165  Figure 5.3 Distribution of RNA seq reads mapped to exonic, intronic, and intergenic regions  Percentage of mapped reads from each library that correspond to annotated exons (exonic), regions between exons of an annotated gene that may represent novel exons or genomic contamination (intronic), and regions between annotated genes that may represent novel genes, misannotated transcription start sites of known genes, or genomic contamination.  166  Figure 5.4 Distribution of sequence coverage and putative SNVs detected across all expressed transcripts from 41 RNA-seq libraries  Percentage of total sequence coverage normalized by position within each expressed transcript from 29 RNA-seq libraries prepared from RNA amplified by in vitro transcription (IVT) for this lung cancer study (red) and from 22 libraries prepared in the “standard” manner from unamplified RNA for other studies of multiple myeloma, follicular and diffuse large B-cell lymphoma (black). Percentage of total putative SNVs distributed across each transcript normalized for length from 22 standard (blue) and 29 IVT (pink) libraries. IVT libraries have nearly 50% greater coverage coverage of 3’ bases (positions 51-100%) resulting in a greater sensitivity for SNV detection in this region. There is a corresponding decrease in coverage and in the percentage of SNVs detected in 5’ bases (positions 1-50%).  167  Figure 5.5 Comparison of sequence coverage distribution in libraries constructed from RNA amplified using a standard or modified in vitro transcription primer mix.  RNA from two samples, 1 and 2, was amplified by in vitro transcription using two different primer mixes, “Standard” and “Multistart”. Transcriptome libraries were sequenced from which we plotted the distribution of sequence coverage across each transcript normalized for length. The standard primer mix contains an oligo(dT) primer that initiates amplification from the 3’ end of polyadenylated transcripts which results in a biased coverage profile. To address the observed end-bias by initiating synthesis from within the body of protein-coding transcripts, we used a commercial primer mix (Multistart) containing non-degenerate oligonucleotides corresponding to known amino acid codon sequences. This modification did not substantially alter the distribution of sequence coverage across expressed transcripts from either of the RNA samples tested.  168  Figure 5.6 A) Circos visualization of RNA-seq reads from a lymphoepithelioma-like lung cancer aligned to an EBV genome. B) Confirmation of EBV tumour-specificity by in situ hybridization.  A) Circos visualization [48] of RNA-seq reads aligned to the EBV genome. The outermost black line represents the complete EBV genome from 0 to ~172000 bp. The grey bars of the middle track represent an alignment of all ~100 EBV protein sequences to the genome and the height of the red bars in the inner track represents the sequence coverage of regions of the EBV genome (0-265X coverage). Expression is localized to three discrete regions commonly expressed in viral-associated cancer (red) and important for tumourigenesis and maintenance of the viral genome: BamHIA, LMP1/2A/2B, and EBNA1. B) In situ hybridization confirms present of EBV in tumour cells from a fine-needle aspirate of a metastatic nodule. Darkly stained EBV-positive tumour cells in the centre of the image are surrounded by EBV-negative normal reactive cells and fibrous material. 20X magnification.  169  A  B 0.8 mm  170  Figure 5.7 Supervised hierarchical clustering of gene expression profiles uncovers molecular and clinical subtypes of lung cancer  Heatmap diagrams of gene expression values from 27 pre-treatment lung tumour RNA-seq libraries measured in reads per kilobase of exon model per million mapped reads (RPKM, [26]). Rows indicate RPKM values for each gene using the colour scale indicated at the top of each panel. Columns correspond to individual tumour samples which are clustered into groups indicted by a hierarchical tree diagram above the heatmap. The order of samples differs from panel to panel. Above the hierarchical tree are notations for clinical information unique to each panel. A) Eight genes are able to differentiate tumours based on mutation status of KRAS (K), EGFR (E), or wildtype (-) for these two genes (Bonferroni adjusted p≤0.00124). B) Combinations of response (PR=partial response, SD=stable disease, PD=progressive disease) and smoking status (NS=non-smokers, S=smokers) can be differentiated on the basis of the first five genes listed (Bonferroni adjusted p≤0.0307). The additional 12 genes listed are specifically down-regulated in smokers who respond to erlotinib (p≤9.19x10-9). C) Expression of CD70 is much higher in the EBV-associated lymphoepithelioma-like carcinoma than the other adenocarcinomas in the set (16.2 RPKM vs. 0.2 median and max 4.6 RPKM from other tumours, p=5.97x10-14).  171  172  Figure 5.8 Attrition of putative SNVs to select variants for validation  Categorical breakdown of all putative SNVs detected from transcriptome sequencing and selection of specific putative SNVs (pSNVs) for PCR and solution capture validation. Of the 432,043 pSNVs detected, 53,517 unique positions were not recorded in public SNP databases or seen in personal genome sequences published to date. After removal of likely mapping artifacts and non-coding pSNVs, 9,149 non-synonymous pSNVs remained as possible somatic mutations for validation in corresponding pre- and post-treatment tumour and blood DNA. 33 pSNVs were targeted for validation by PCR and sequencing of which 1 was somatic (present only in tumour DNA), 5 were germline variants (present in both tumour and blood DNA), and 27 were false positives (absent in both tumour and blood DNA). An assay has been designed to validate of all remaining non-synonymous pSNVs using a commercial targeted sequencing method (Agilent SureSelect). 2 overlapping baits were designed targeting each nonsynonymous pSNV (Figure 5.9) and filtered using Agilent’s bait design criteria. To uncover additional somatic mutations, baits were also designed to tile across exons from genes with non-synonymous pSNVs in multiple tumours. In total, 47,558 baits have been designed and will be applied to each sample, 16,960 targeting pSNVs and 30,598 targeting genes with pSNVs in multiple tumours.  173  Figure 5.9 Position of baits designed to A) validate putative point mutations detected by RNA-seq and B) discover additional mutations in exons from genes with putative point mutations in at least 3 tumours  A) Two 120 bp baits (red) were designed to target each putative mutation (star) in genomic DNA (black). To maximize the number of 50 bp end-reads that contain the putative mutation, these baits correspond to fragments with the variant positioned within 30 bp of an end. B) Baits were tiled across exons (yellow) of genes with putative mutations in at least 3 tumours using the default settings provided by the manufacturer. 30 bp  A 90 bp  B  120 bp  90 bp 30 bp  120 bp  120 bp  120 bp  174  5.7. Tables Table 5-1 Tumour content and quantities of total RNA extracted from 30 lung tumour biopsies  Median ng  Range ng  Median % tumour  Range % tumour  Bronchoscopies (7)  115  90 - 898  70  45-100  Core needle biopsies (11)  440  100 - 5,396  75  40-100  Fine needle aspirates (3)  520  180 - 772  100  90-100  1,935  340 - 3,530  93  90-95  Pleural fluid (6 post-sort)  981  160 - 4,992  80  80  Pleural fluid (1 unsorted)  60,390  Type and number of biopsies  Surgical removal (2)  75  175  Table 5-2 Complete viral genomes against which all unmapped transcriptome reads were mapped NCBI Reference NC_001460.1 NC_004001.2 NC_001405.1 NC_002067.1 NC_003266.2 NC_001454.1 NC_001943.1 NC_007455.1 NC_002645.1 NC_006577.2 NC_005831.2 NC_005147.1 NC_009887.1 NC_001612.1 NC_001472.1 NC_001428.1 NC_001430.1 NC_004295.1 NC_001736.1 NC_001806.1 NC_001798.1 NC_001348.1 NC_007605.1 NC_009334.1 NC_001347.3 NC_006273.2 NC_001664.1 NC_000898.1 NC_001716.2 NC_003409.1 NC_009333.1 NC_001802.1 NC_001722.1 NC_004148.2 NC_001356.1 NC_001352.1 NC_001531.1 NC_001526.1 NC_001357.1 NC_001676.1 NC_001694.1 NC_004761.1 NC_001457.1 NC_001355.1 NC_001595.1 NC_001596.1 NC_001576.1 NC_001683.1 NC_001583.1 NC_001586.1 NC_001587.1 NC_001354.1  NCBI Definition Human adenovirus A, complete genome Human adenovirus B, complete genome Human adenovirus C, complete genome Human adenovirus D, complete genome Human adenovirus E, complete genome Human adenovirus F, complete genome Human astrovirus, complete genome Human bocavirus, complete genome Human coronavirus 229E, complete genome Human coronavirus HKU1, complete genome Human coronavirus NL63, complete genome Human coronavirus OC43, complete genome Human enterovirus 100, complete genome Human enterovirus A, complete genome Human enterovirus B, complete genome Human enterovirus C, complete genome Human enterovirus D, complete genome Human erythrovirus V9, complete genome Human foamy virus, complete genome Human herpesvirus 1, complete genome Human herpesvirus 2, complete genome Human herpesvirus 3, complete genome Human herpesvirus 4, complete genome Human herpesvirus 4, complete genome Human herpesvirus 5 strain AD169, complete genome Human herpesvirus 5 strain Merlin, complete genome Human herpesvirus 6A, complete genome Human herpesvirus 6B, complete genome Human herpesvirus 7, complete genome Human herpesvirus 8 type M, complete genome Human herpesvirus 8, complete genome Human immunodeficiency virus 1, complete genome Human immunodeficiency virus 2, complete genome Human metapneumovirus, complete genome Human papillomavirus - 1, complete genome Human papillomavirus - 2, complete genome Human papillomavirus - 5, complete genome Human papillomavirus - 16, complete genome Human papillomavirus - 18, complete genome Human papillomavirus - 54, complete genome Human papillomavirus - 61, complete genome Human papillomavirus RTRX7, complete genome Human papillomavirus type 4, complete genome Human papillomavirus type 6b, complete genome Human papillomavirus type 7, complete genome Human papillomavirus type 9, complete genome Human papillomavirus type 10, complete genome Human papillomavirus type 24, complete genome Human papillomavirus type 26, complete genome Human papillomavirus type 32, complete genome Human papillomavirus type 34, complete genome Human papillomavirus type 41, complete genome  176  NCBI Reference NC_001690.1 NC_001591.1 NC_001691.1 NC_001593.1 NC_001693.1 NC_001458.1 NC_002644.1 NC_010329.1 NC_004104.1 NC_004500.1 NC_005134.2 NC_008189.1 NC_008188.1 NC_009239.1 NC_003461.1 NC_003443.1 NC_001796.2 NC_001897.1 NC_007018.1 NC_007026.1 NC_007027.1 NC_001781.1 NC_001617.1 NC_001490.1 NC_009996.1 NC_007473.1 NC_007472.1 NC_007471.1 NC_007470.1 NC_007467.1 NC_007469.1 NC_007466.1 NC_007465.1 NC_007468.1 NC_007464.1 NC_007463.1 NC_001795.1 NC_001436.1 NC_001488.1 NC_001870.1  NCBI Definition Human papillomavirus type 48, complete genome Human papillomavirus type 49, complete genome Human papillomavirus type 50, complete genome Human papillomavirus type 53, complete genome Human papillomavirus type 60, complete genome Human papillomavirus type 63, complete genome Human papillomavirus type 71, complete genome Human papillomavirus type 88, complete genome Human papillomavirus type 90, complete genome Human papillomavirus type 92, complete genome Human papillomavirus type 96, complete genome Human papillomavirus type 101, complete genome Human papillomavirus type 103, complete genome Human papillomavirus type 107, complete genome Human parainfluenza virus 1 strain Washington/1964, complete genome Human parainfluenza virus 2, complete genome Human parainfluenza virus 3, complete genome Human parechovirus, genome Human parvovirus 4, complete genome Human picobirnavirus RNA segment 1, complete sequence Human picobirnavirus RNA segment 2, complete sequence Human respiratory syncytial virus, complete genome Human rhinovirus 89, complete genome Human rhinovirus B, complete genome Human rhinovirus C, complete genome Human rotavirus G3 segment 1, complete sequence Human rotavirus G3 segment 2, complete sequence Human rotavirus G3 segment 3, complete sequence Human rotavirus G3 segment 4, complete sequence Human rotavirus G3 segment 5, complete sequence Human rotavirus G3 segment 6, complete sequence Human rotavirus G3 segment 7, complete sequence Human rotavirus G3 segment 8, complete sequence Human rotavirus G3 segment 9, complete sequence Human rotavirus G3 segment 10, complete sequence Human rotavirus G3 segment 11, complete sequence Human spumaretrovirus, complete genome Human T-lymphotropic virus 1, complete genome Human T-lymphotropic virus 2, complete genome Simian-Human immunodeficiency virus, complete genome  177  Table 5-3 33 pSNVs validated by PCR and Sanger sequencing  Gene ANKRD12  Tumour/ blood pairs 5  ARIH1  6  CHD7  3  HLA-DPA1  3  MAN2A1  5  MT-ND1  9  NAE1  4  NUCB2  4  REST  3  SEC62  3  SPARC  3  MANBA MT-CO2 TRIP6 C4orf15  3 3 10 2  MET  1  Total  Positions targeted chr18:9246629 chr18:9246631 chr15:70554273 chr15:70554275 chr8:61856379 chr8:61856381 chr6:33145544 chr6:33145545 chr5:109218837 chr5:109218839 chrM:4262 chrM:4263 chrM:4161 chr16:65422257 chr16:65422258 chr11:1730953 chr11:1730955 chr4:57491514 chr4:57491515 chr3:171193531 chr3:171193533 chr5:151027230 chr5:151027232 chr4:103772429 chrM:7594 chr7:100308290 chr4:2212348 chr4:2212389 chr7:116127278 chr7:116127283 chr7:116127285 chr7:116127292 chr7:116127294 chr7:116127296  Codons targeted 1  Distinct substitutions 2  Tumours with position1 / position2 5/3  Target positions 2  Artifacts positions 2  1  2  6/4  2  2  1  3  3/1  2  2  1  2  2/1  2  2  1  2  4/1  2  1  3  8/1  2  2  1  2  (1) 3/1  2  2  1  3  3/1  2  1  1  2  2/1  2  2  1  2  2/1  2  1  2  3/2  2  2  1 1 1 2  2 2 1 2  2/1 2/1 10 1/1  1 1 1 2  1 1 1 2  6  6  1  6  6  33  27  Germline positions  Somatic positions  2 (single deletion)  (1)  1 (deletion)  2  5  (1)  178  Table 5-4 Genes with exons targeted for solution hybrid capture, containing mutations in ≥4 pre-treatment tumours (≥3 tumours for COSMIC genes)  Ensembl ID  179  ENSG00000198888 ENSG00000146733 ENSG00000198763 ENSG00000159140 ENSG00000087077 ENSG00000114933 ENSG00000127481 ENSG00000136153 ENSG00000087365 ENSG00000170759 ENSG00000139218 ENSG00000054654 ENSG00000116539 ENSG00000115760 ENSG00000169599 ENSG00000213639 ENSG00000127914 ENSG00000198712 ENSG00000048649 ENSG00000126777 ENSG00000198938 ENSG00000079385 ENSG00000054356 ENSG00000132334 ENSG00000172273 ENSG00000168447 ENSG00000196126 ENSG00000055609 ENSG00000127603 ENSG00000198886 ENSG00000105877 ENSG00000109323 ENSG00000170776 ENSG00000106397 ENSG00000172799 ENSG00000166233 ENSG00000212989  # tumours for gene 16 16 14 14 12 11 10 10 10 10 9 9 9 9 9 9 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7  Types of mutation 11 4 23 5 3 2 8 8 4 2 9 8 8 7 3 2 9 9 6 6 6 4 2 2 2 1 1 9 7 7 6 6 6 5 4 3 2  Gene Name  Recorded in COSMIC v41  MT-ND1 PSPH MT-ND2 SON TRIP6 UBR4 LMO7 SF3B2 KIF5B SFRS2IP SYNE2 ASH1L BIRC6 NFU1 PPP1CB AKAP9 MT-CO2 RSF1 KTN1 MT-CO3 CEACAM1 PTPRN PTPRE SCNN1B HLA-DRB1 MLL3 MACF1 MT-ND4 DNAH11 MANBA AKAP13 PLOD3 ARIH1  X  X X X  X  X  Ensembl ID ENSG00000136045 ENSG00000169894 ENSG00000198727 ENSG00000080345 ENSG00000115310 ENSG00000151914 ENSG00000173193 ENSG00000186566 ENSG00000011295 ENSG00000066933 ENSG00000101745 ENSG00000120913 ENSG00000135250 ENSG00000145901 ENSG00000100889 ENSG00000107099 ENSG00000112893 ENSG00000132549 ENSG00000157765 ENSG00000181444 ENSG00000188747 ENSG00000204287 ENSG00000189079 ENSG00000168818 ENSG00000135837 ENSG00000065526 ENSG00000182670 ENSG00000198804 ENSG00000049323 ENSG00000073614 ENSG00000075292 ENSG00000118058 ENSG00000119778 ENSG00000125633 ENSG00000138778 ENSG00000143379 ENSG00000147133  # tumours for gene 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 5 5 5  Types of mutation 1 12 8 7 7 6 6 6 5 5 5 5 5 5 4 4 4 4 4 3 3 3 2 1 9 7 6 6 5 5 5 5 5 5 5 5 5  Gene Name PWP1 MUC3A,MUC3B MT-CYB RIF1 RTN4 DST PARP14 GPATCH8 TTC19 MYO9A ANKRD12 PDLIM2 SRPK2 TNIP1 PCK2 DOCK8 MAN2A1 VPS13B SLC34A2 ZNF467 NOXA1 HLA-DRA ARID2 STX18 CEP350 SPEN TTC3 MT-CO1 LTBP1 JARID1A ZNF638 MLL ATAD2B CCDC93 CENPE SETDB1 TAF1  Recorded in COSMIC v41  X  X  X  X  X  X  X X X  Ensembl ID  180  ENSG00000148773 ENSG00000159882 ENSG00000171490 ENSG00000174953 ENSG00000198695 ENSG00000198744 ENSG00000103496 ENSG00000111642 ENSG00000138600 ENSG00000139372 ENSG00000148516 ENSG00000173575 ENSG00000176915 ENSG00000186153 ENSG00000122257 ENSG00000134285 ENSG00000146648 ENSG00000163714 ENSG00000168758 ENSG00000042493 ENSG00000157637 ENSG00000181982 ENSG00000187775 ENSG00000216490 ENSG00000107020 ENSG00000120087 ENSG00000127415 ENSG00000135480 ENSG00000136104 ENSG00000157021 ENSG00000163220 ENSG00000179912 ENSG00000198815 ENSG00000155657 ENSG00000089737 ENSG00000115816 ENSG00000164190 ENSG00000005483 ENSG00000049759 ENSG00000060491 ENSG00000071054 ENSG00000073350  # tumours for gene 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4  Types of mutation 5 5 5 5 5 5 4 4 4 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 1 1 1 1 8 6 6 6 5 5 5 5 5  Gene Name MKI67 ZNF230 RSL1D1 DHX36 MT-ND6 MT-ATP8 STX4 CHD4 TDG ZEB1 CHD2 ANKLE2 WWOX RBBP6 FKBP11 EGFR  Recorded in COSMIC v41  X  X  SEMA4C CAPG SLC38A10 CCDC149 IFI30 C9orf46 HOXB7 IDUA KRT7 RNASEH2B FAM92A1,FAM92A2 S100A9 R3HDM2 FOXJ3 TTN DDX24 CEBPZ NIPBL MLL5 NEDD4L OGFR MAP4K4 LLGL2  X X X  X  Ensembl ID ENSG00000088340 ENSG00000119285 ENSG00000144674 ENSG00000165732 ENSG00000170004 ENSG00000173230 ENSG00000174197 ENSG00000181555 ENSG00000198677 ENSG00000008952 ENSG00000009413 ENSG00000064313 ENSG00000065243 ENSG00000066279 ENSG00000071994 ENSG00000073331 ENSG00000084093 ENSG00000084676 ENSG00000085721 ENSG00000088970 ENSG00000091436 ENSG00000095002 ENSG00000099940 ENSG00000101596 ENSG00000102189 ENSG00000102893 ENSG00000105373 ENSG00000114857 ENSG00000116783 ENSG00000118197 ENSG00000119397 ENSG00000119487 ENSG00000124228 ENSG00000127947 ENSG00000128845 ENSG00000131018 ENSG00000132305 ENSG00000136169 ENSG00000139410 ENSG00000140386 ENSG00000140497 ENSG00000143669  # tumours for gene 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4  Types of mutation 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4  Gene Name FER1L4 HEATR1 GOLGA4 DDX21 CHD3 GOLGB1 SETD2 TTC37 SEC62 REV3L TAF2 PKN2 ASPM PDCD2 ALPK1 REST NCOA1 RRN3  MSH2 SNAP29 SMCHD1 EEA1 PHKB GLTSCR2 NKTR TNNI3K,FPGT DDX59 CEP110 MAPKAP1 DDX27 PTPN12 DYX1C1 C6orf98,SYNE1 IMMT SETDB2 SDSL SCAPER SCAMP2 LYST  Recorded in COSMIC v41  X  X X X  X X  X  X X X X X  Ensembl ID  181  ENSG00000144028 ENSG00000144645 ENSG00000146085 ENSG00000146918 ENSG00000153575 ENSG00000164347 ENSG00000165219 ENSG00000165632 ENSG00000166750 ENSG00000171456 ENSG00000173473 ENSG00000173692 ENSG00000175455 ENSG00000185658 ENSG00000187098 ENSG00000187240 ENSG00000197102 ENSG00000197324 ENSG00000198707 ENSG00000198840 ENSG00000198862 ENSG00000204764 ENSG00000000971 ENSG00000003056 ENSG00000047634 ENSG00000066651 ENSG00000070081 ENSG00000070814 ENSG00000092208 ENSG00000101654 ENSG00000103335 ENSG00000104361 ENSG00000107581 ENSG00000110002 ENSG00000115267 ENSG00000117000 ENSG00000117523 ENSG00000131503 ENSG00000132950 ENSG00000133027 ENSG00000135749 ENSG00000135968  # tumours for gene 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4  Types of mutation 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  Gene Name OSBPL10 MUT NCAPG2 TUBGCP5 GFM2 GAPVD1 TAF3 SLFN5 ASXL1 SMARCC1 PSMD1 CCDC14 BRWD1 MITF DYNC2H1 DYNC1H1 LRP10 CEP290 MT-ND3 RANBP17 CFH M6PR SCML1 TRMT11 NUCB2 TCOF1 SIP1 RNMT FAM38A  Recorded in COSMIC v41  X  X  X X  EIF3A IFIH1 RLF BAT2D1 ANKHD1,EIF4EBP3 ZMYM5 PEMT PCNXL2 GCC2  X  Ensembl ID ENSG00000143776 ENSG00000150630 ENSG00000157399 ENSG00000162994 ENSG00000164597 ENSG00000164828 ENSG00000168970 ENSG00000175221 ENSG00000179889 ENSG00000188170 ENSG00000204261 ENSG00000007384 ENSG00000058272 ENSG00000113810 ENSG00000116691 ENSG00000117335 ENSG00000119318 ENSG00000143643 ENSG00000159593 ENSG00000161813 ENSG00000168564 ENSG00000189042 ENSG00000197498 ENSG00000198589 ENSG00000205002 ENSG00000066455 ENSG00000103351 ENSG00000110801 ENSG00000119396 ENSG00000125686 ENSG00000131400 ENSG00000155957 ENSG00000179820 ENSG00000197462 ENSG00000213088 ENSG00000105976 ENSG00000110841 ENSG00000038382 ENSG00000039650 ENSG00000070061 ENSG00000119684 ENSG00000120868  # tumours for gene 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3  Types of mutation 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 6 5 4 4 4 4 4  Gene Name CDC42BPA VEGFC ARSE C2orf63 COG5 UNC84A JMJD7,PLA2G4B MED16 PDXDC1 HBD PSMB9 RHBDF1 PPP1R12A SMC4  Recorded in COSMIC v41 X  X  CD46 RAD23B TTC13 NAE1 LARP4 CDKN2AIP ZNF567 BXDC1 LRBA GOLGA5 CLUAP1 PSMD9 RAB14 MED1 NAPSA TMBIM4 MYADM DARC MET PPFIBP1 TRIO PNKP IKBKAP MLH3 APAF1  X  X  X X X X X X X X  Ensembl ID ENSG00000128829 ENSG00000177200 ENSG00000198231 ENSG00000198399 ENSG00000006468 ENSG00000010292 ENSG00000012983 ENSG00000066777 ENSG00000070018 ENSG00000079739 ENSG00000081237 ENSG00000086758 ENSG00000099956 ENSG00000100644 ENSG00000100815 ENSG00000100888 ENSG00000110713 ENSG00000113163 ENSG00000117139 ENSG00000131626 ENSG00000132466 ENSG00000135090  # tumours for gene 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  Types of mutation 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  Gene Name EIF2AK4 CHD9 DDX42 ITSN2 ETV1 NCAPD2 MAP4K5 ARFGEF1 LRP6 PGM1 PTPRC HUWE1 SMARCB1 HIF1A TRIP11 CHD8 NUP98 COL4A3BP JARID1B PPFIA1 ANKRD17 TAOK3  Recorded in COSMIC v41 X X X X X X X X X X X X X X X X X X X X X X  Ensembl ID ENSG00000135541 ENSG00000137497 ENSG00000138032 ENSG00000138764 ENSG00000141510 ENSG00000142949 ENSG00000155304 ENSG00000156256 ENSG00000157212 ENSG00000163029 ENSG00000173482 ENSG00000174243 ENSG00000175054 ENSG00000047936 ENSG00000103342 ENSG00000180900 ENSG00000188042 ENSG00000054118 ENSG00000060237 ENSG00000141562 ENSG00000147364  # tumours for gene 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  Types of mutation 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 1 1 1 1  Gene Name AHI1 NUMA1 PPM1B CCNG2 TP53 PTPRF HSPA13 USP16 PAXIP1 SMC6 PTPRM DDX23 ATR ROS1 GSPT1 SCRIB ARL4C THRAP3 WNK1 NARF FBXO25  Recorded in COSMIC v41 X X X X X X X X X X X X X X X X X X X X X  182  Table 5-5 Genes containing at least 2 types of mutation exclusively in post-treatment tumours.  Ensembl ID ENSG00000121966 ENSG00000161011 ENSG00000138182 ENSG00000133114 ENSG00000114867 ENSG00000138002 ENSG00000175197 ENSG00000170476 ENSG00000169550 ENSG00000166165 ENSG00000137106 ENSG00000135972 ENSG00000135912 ENSG00000123240 ENSG00000113580 ENSG00000108439 ENSG00000004779  # tumours with mutation in gene 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1  Types of mutation 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2  Gene Name CXCR4 SQSTM1 KIAA1704 EIF4G1 IFT172 DDIT3  COSMIC v41 X  X  MUC15 CKB GRHPR MRPS9 TTLL4 OPTN NR3C1 PNPO NDUFAB1  183  5.8. Bibliography 1. Sun S, Schiller JH, Gazdar AF: Lung cancer in never smokers--a different disease. Nat Rev Cancer 2007, 7(10):778-790. 2. Salgia R, Skarin AT: Molecular abnormalities in lung cancer. J Clin Oncol 1998, 16(3):1207-1217. 3. Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG et al: Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004, 350:2129-2139. 4. Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye F, Lindeman N, Boggon TJ et al: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304:1497-1500. 5. Pao W, Miller V, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L et al: EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004, 101(36):13306-13311. 6. Pao W WT, Riely GJ, Miller VA, Pan Q, Ladanyi M, Zakowski MF, Heelan RT, Kris MG, Varmus HE.: KRAS Mutations and Primary Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib. PLoS Med 2005, 2:e17. 7. Taron M, Ichinose Y, Rosell R, Mok T, Massuti B, Zamora L, Mate JL, Manegold C, Ono M, Queralt C et al: Activating mutations in the tyrosine kinase domain of the epidermal growth factor receptor are associated with improved survival in gefitinib-treated chemorefractory lung adenocarcinomas. Clin Cancer Res 2005, 11(16):5878-5885. 8. Pugh TJ, Bebb G, Barclay L, Sutcliffe M, Fee J, Salski C, O'Connor R, Ho C, Murray N, Melosky B et al: Correlations of EGFR mutations and increases in EGFR and HER2 copy number to gefitinib response in a retrospective analysis of lung cancer patients. BMC cancer 2007, 7:128. 9. Giaccone G, Rodriguez JA: EGFR inhibitors: what have we learned from the treatment of lung cancer? Nat Clin Pract Oncol 2005, 2(11):554-561. 10. Tsao MS, Sakurada A, Cutz JC, Zhu CQ, Kamel-Reid S, Squire J, Lorimer I, Zhang T, Liu N, Daneshmand M et al: Erlotinib in lung cancer - molecular and clinical predictors of outcome. N Engl J Med 2005, 353(2):133-144. 11. Pao W, Miller VA, Politi KA, Riely GJ, Somwar R, Zakowski MF, Kris MG, Varmus H: Acquired Resistance of Lung Adenocarcinomas to Gefitinib or Erlotinib Is Associated with a Second Mutation in the EGFR Kinase Domain. PLoS Med 2005, 2:e73. 12. Yun CH, Mengwasser KE, Toms AV, Woo MS, Greulich H, Wong KK, Meyerson M, Eck MJ: The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. Proc Natl Acad Sci U S A 2008. 13. Cappuzzo F, Gregorc V, Rossi E, Cancellieri A, Magrini E, Paties CT, Ceresoli G, Lombardo L, Bartolini S, Calandri C et al: Gefitinib in pretreated non-small-cell lung cancer (NSCLC): analysis of efficacy and correlation with HER2 and epidermal growth factor receptor expression in locally advanced or metastatic NSCLC. J Clin Oncol 2003, 21(14):2658-2663. 14. Takano T, Ohe Y, Sakamoto H, Tsuta K, Matsuno Y, Tateishi U, Yamamoto S, Nokihara H, Yamamoto N, Sekine I et al: Epidermal growth factor receptor gene 184  15.  16.  17.  18.  19.  20.  21.  22. 23. 24.  25.  26.  27.  28. 29.  30. 31.  mutations and increased copy numbers predict gefitinib sensitivity in patients with recurrent non-small-cell lung cancer. J Clin Oncol 2005, 23(28):6829-6837. Johnson BE, Janne PA: Selecting patients for epidermal growth factor receptor inhibitor treatment: A FISH story or a tale of mutations? J Clin Oncol 2005, 23(28):6813-6816. Shepherd FA, Tsao MS: Unraveling the mystery of prognostic and predictive factors in epidermal growth factor receptor therapy. J Clin Oncol 2006, 24(7):1219-1220; author reply 1220-1211. Davies H, Hunter C, Smith R, Stephens P, Greenman C, Bignell G, Teague J, Butler A, Edkins S, Stevens C et al: Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res 2005, 65(17):7591-7595. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB et al: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069-1075. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C et al: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446(7132):153-158. Thomas RK, Baker AC, Debiasi RM, Winckler W, Laframboise T, Lin WM, Wang M, Feng W, Zander T, MacConaill L et al: High-throughput oncogene mutation profiling in human cancer. Nat Genet 2007, 39(3):347-351. Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N et al: The consensus coding sequences of human breast and colorectal cancers. Science 2006, 314(5797):268-274. Mardis ER: Anticipating the 1,000 dollar genome. Genome Biol 2006, 7(7):112. Shendure J, Stewart CJ: Cancer genomes on a shoestring budget. N Engl J Med 2009, 360(26):2781-2783. Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M: Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 2008, 45(1):8194. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D et al: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008, 321(5891):956-960. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621628. Shah SP, Kobel M, Senz J, Morin RD, Clarke BA, Wiegand KC, Leung G, Zayed A, Mehl E, Kalloger SE et al: Mutation of FOXL2 in granulosa-cell tumors of the ovary. N Engl J Med 2009, 360(26):2719-2729. Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 2008, 18(11):1851-1858. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M et al: TM4: a free, open-source system for microarray data management and analysis. BioTechniques 2003, 34(2):374-378. Young LS, Rickinson AB: Epstein-Barr virus: 40 years on. Nat Rev Cancer 2004, 4(10):757-768. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H et al: Identification of the transforming 185  32.  33.  34.  35. 36.  37.  38.  39.  40.  41.  42.  43. 44.  45.  46.  EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448(7153):561566. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G et al: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT et al: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872-876. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53-59. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J et al: The diploid genome sequence of an Asian individual. Nature 2008, 456(7218):60-65. Schmidt EE, Ichimura K, Goike HM, Moshref A, Liu L, Collins VP: Mutational profile of the PTEN gene in primary human astrocytic tumors and cultivated xenografts. J Neuropathol Exp Neurol 1999, 58(11):1170-1183. Duerr EM, Rollbrocker B, Hayashi Y, Peters N, Meyer-Puttlitz B, Louis DN, Schramm J, Wiestler OD, Parsons R, Eng C et al: PTEN mutations in gliomas and glioneuronal tumors. Oncogene 1998, 16(17):2259-2264. Sos ML, Koker M, Weir BA, Heynck S, Rabinovsky R, Zander T, Seeger JM, Weiss J, Fischer F, Frommolt P et al: PTEN loss contributes to erlotinib resistance in EGFRmutant lung cancer by activation of Akt and EGFR. Cancer Res 2009, 69(8):32563261. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, DunfordShore BH, McGrath S, Hickenbotham M et al: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008, 456(7218):66-72. Shah S, Morin R, Khattra J, Prentice L, Pugh TJ, Burleigh A, Delaney A, Gelmon K, Guliany R, Holt RA et al: Mutational evolution of a lobular breast tumour, profiled by whole-transcriptome and whole-genome next generation sequencing. Submitted 2009. Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ et al: Direct selection of human genomic loci by microarray hybridization. Nat Methods 2007, 4(11):903-905. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C et al: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 2009, 27(2):182-189. Olson M: Enrichment of super-sized resequencing targets from the human genome. Nat Methods 2007, 4(11):891-892. Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME: Microarraybased genomic selection for high-throughput resequencing. Nat Methods 2007, 4(11):907-909. Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F et al: Multiplex amplification of large sets of human exons. Nat Methods 2007, 4(11):931-936. Pugh TJ, Delaney AD, Farnoud N, Flibotte S, Griffith M, Li HI, Farinha P, Gascoyne RD, Marra MA: Impact of Whole Genome Amplification on Analysis of Copy Number Variants. Nucleic Acids Res 2008, 36(13):e80 186  47.  48.  49. 50.  51.  52.  53. 54.  55.  56. 57. 58.  Jones SJ, Laskin JJ, Li Y, Griffith O, Bilenky M, Butterfield Y, Cezard T, Chuah E, Corbett R, Fejes A et al: Complete genomic characterization of an adenocarcinoma of the tongue provides rational therapeutic options. Submitted 2009. Krzywinski MI, Schein JE, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: An information aesthetic for comparative genomics. Genome Res 2009. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011):931-45. Morin RD, Johnson NA , Severson TM, Mungall AJ, An J, Paul JE, Uyar B, Boyle M, Kuchenbauer F, Petriv OI, Humphries RK, Griffith OL, Shah S, Corbett R, Tam A, Varhol R, Zhao Y, Delaney A, Qian H, Birol I, Aparicio S, Schein J, Moore R, Holt R, Horsman DE, Connors JM, Jones S, Hirst M, Gascoyne RD, Marra MA. EZH2 (Y641) is Frequently Mutated in Follicular and Diffuse Large B-cell Lymphomas of Germinal Center Origin. Nature Genetics 2009, submitted. Fullwood MJ, Wei CL, Liu ET, Ruan Y: Next-generation DNA sequencing of pairedend tags (PET) for transcriptome and genome analyses. Genome Res 2009, 19(4):521-532. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PA, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008, 40(6):722-9. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10(1):57-63. Paavonen J, Naud P, Salmerón J, Wheeler CM, Chow SN, Apter D, Kitchener H, Castellsague X, Teixeira JC, Skinner SR, Hedrick J, Jaisamrarn U, Limson G, Garland S, Szarewski A, Romanowski B, Aoki FY, Schwarz TF, Poppe WA, Bosch FX, Jenkins D, Hardt K, Zahaf T, Descamps D, Struyf F, Lehtinen M, Dubin G; HPV PATRICIA Study Group, Greenacre M: Efficacy of human papillomavirus (HPV)-16/18 AS04adjuvanted vaccine against cervical infection and precancer caused by oncogenic HPV types (PATRICIA): final analysis of a double-blind, randomised study in young women. Lancet. 2009, 374(9686):301-14. Chen H, Huang J, Wu FY, Liao G, Hutt-Fletcher L, Hayward SD: Regulation of Expression of the Epstein-Barr Virus BamHI-A Rightward Transcripts. J Virol. 2005, 79(3):1724-33. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 12(4):656-64. Goya R, Sun MGF, Morin RD, Leung G, Ha1 G, Wiegand K, Senz J, Crisan1 A, Marra MA, Hirst M, Huntsman D, Murphy KP, Aparicio S, Shah SP: SNVMix: predicting single nucleotide variants from next generation sequencing of tumors. Submitted 2009.  187  Chapter 6. Discussion 6.1. DNA sequencing efforts of increasing scale are becoming clinically applicable The rapid advance of genome technology has fundamentally altered how genome research is conducted. In five years of graduate training, I moved from sequencing a few exons of one candidate gene (Chapter 2) to sequencing all exons from a set of eight genes (Chapter 4) to sequencing transcripts of more than 20,000 expressed genes (Chapter 5). Interpretation of genome data has also evolved along with technical advancements. Early debates in the lung cancer field about the use of EGFR biomarkers to predict TKI response have been tempered by investigation of larger sample sets and the discovery of additional mutations that confer resistance such as activating KRAS mutations or deletion of PTEN. While scientists continue to debate whether EGFR testing is “ready for prime time” [1, 2], clinical labs have recently begun routine screening of lung cancers for EGFR kinase domain mutations [3]. While these efforts are a good first effort to characterize cancers at the molecular level, additional novel mutations in lung cancer may have additional predictive and prognostic value. Identifying clinically relevant variants and facilitating their use in the clinic remains a major challenge of modern cancer genomics. To fully deliver on the promise of genomics, fundamental discoveries within the cancer genome must be translated into actionable diagnostics or therapies that can positively impact patient care. I predict that routine profiling of cancer samples will be complementary to existing pathology reviews as histological subtyping and assessment of tumour content are crucial for putting genomic information in a cellular context. However, the adoption of genome tools to profile clinical cancer specimens faces a number of hurdles. Many laboratory methods are developed using a virtually unlimited supply of high quality nucleic acids derived from homogeneous cell lines or blood samples. Primary cancer biopsies, on the other hand, are often  188  highly limited in tissue quantity, yield lower quality nucleic acids, and contain a large number of different cell types, of which only a fraction may be cancerous. In my thesis work, I have developed methods to address many of these issues and to facilitate the application of genome science to clinical specimens. Low quality DNA and RNA were a common feature of the tumour specimens used in my research. In chapter 2, I present criteria for the qualification of DNA extracted from formalin-fixed paraffin-embedded (FFPE) tissues. Chapter 5 presents a protocol for review of fresh-frozen tissues, qualification of extracted RNA, and construction of sequencing libraries from amplified product. Isolation of tumour cells from surrounding normal cells has been a recurring challenge in my graduate work. Chapter 2 documents the isolation of lung tumour cells from FFPE tissue sections using laser microdissection, a technique revisited in chapter 5 for application to freshfrozen samples. Also in chapter 5, I developed a flow cytometry assay to harvest tumour cells from up to two litres of pleural effusion containing primarily normal reactive cells. Characterizing bias induced by amplification of nucleic acids from limited starting quantities is a major focus of my cancer research. Chapter 4 presents a treatment of the biases included by a DNA amplification method and demonstrates the detection of bona fide copy number variants using amplified material. The work presented in chapter 5 relies on an RNA amplification method to detect gene mutations and expression levels in lung tumour biopsies and includes a characterization of sequence coverage bias induced by this technique. 6.2. Revolutions in DNA sequencing technology have enabled routine genome sequencing DNA sequencing capacity has now reached a critical mass. The sequence of the genome, being finite, is becoming completely knowable as the genotype of nearly every basepair can now be assessed at a reasonable cost. Full knowledge of an individual’s genome variation will soon be a starting point for analysis and not an end goal. DNA sequence data are becoming a universal commodity that can be applied to answer any number of questions in 189  much the same way computer processing power has become generally applicable over the last 20 years. The challenge will be how best to spend this commodity and how to interpret the resulting flood of information. The study of cancer genomes has great potential to not only impact the treatment of this disease but to also further our understanding of cell biology. The root causes of each of the hallmarks of cancer are likely to be uncovered as somatic mutations are compared across cancers and pre-neoplastic lesions with distinct phenotypes. Separating somatic from germline mutations will soon be a solved problem as every common variant will be known [4] and for rare alleles there will be the ultimate personal reference sequence for each tumour in the form of normal DNA from unaffected tissue or peripheral blood. As sequence capacity continues to grow into the realm of sequencing single cells, even low frequency variants will be readily detected. This will not only be important in sequencing populations of heterogeneous tumours but also to better understand genetic mosaicism in normal tissues and the acquisition of somatic mutations in normal cells throughout our lifetime. The challenge clinically will be how to identify those mutations on which a tumour relies, which mutations result in druggable targets, and how to treat resistant clones that exist prior to, or arise during, treatment. Cancer is a genetically heterogeneous disease. The Catalogue of Somatic Mutations in Cancer (COSMIC, [5]) lists over 750 genes in which somatic point mutations or indels have been detected in lung cancer alone. Less than ten of these genes are mutated in more than 10% of tumours, suggesting that lung cancers as a whole can arise due to mutations in a diversity of genes or can tolerate a large number of passenger mutations or both [5, 6]. Common activating mutations impact oncogenes EGFR (26% of lung cancers), KRAS (17%), and CDKN2A (15%), while common inactivating mutations target tumour suppressors p53 (60%), RB1 (14%), and STK11A (10%) [5]. In the analysis of lung tumour transcriptomes in chapter 5, we observed distinct subsets of expression and a rare EBV-associated cancer despite careful clinical 190  selection of patients (female, Asian, non-smokers) and tumour histology. Therefore, sequencing large numbers of tumours will likely be necessary to refine molecular subclasses of morphologically similar cancers. The total number of possible cancer mutations may not be small, but it will be within the defined space of the finite human genome. Analysis of two cancer genomes sequenced thus far [7, 8] found that these tumours harbour relatively few (10-32) somatic coding mutations likely to drive cancer. Mutations were found in genes not previously associated with cancer but linked to known disease pathways [7, 8], suggesting that comprehensive, unbiased sequencing of tumours is necessary to uncover these mutations rather than a candidate gene approach. Large scale projects such as The Cancer Genome Atlas [9] are beginning to sequence hundreds of tumour genomes with the goal of integrating genome sequence information with clinical, histological, and gene expression information. As large sets of cancer genome data accrue, patterns of somatic mutation will continue to be refined, building on early observations from candidate gene sequencing studies [6, 10-12]. The oncogenic forces that drive some cancers are not restricted to changes of the native DNA sequence itself. Rather, changes in methylation patterns (epigenetics) [13], RNA splicing, regulation and editing [14], microRNA processing [15], viral integration [16], and the mitochondrial genome [17] have all been implicated in oncogenesis, and many of these features are potential targets of modern cancer therapies [13, 16-18]. Therefore, integration of complementary sequencing approaches such as chromatin-immunoprecipitation (ChIP-seq) and transcriptome sequencing (RNA-seq) will further serve to identify mechanisms of oncogenesis. As sequencing technologies are applied to an increasing diversity of cancers, I predict a gradual redefinition of how cancers are classified and treated. Not since the adoption of the microscope to differentiate cell morphology will cancer pathology undergo such as transformation. In addition to diagnosing a cancer as a “lung adenocarcinoma”, comprehensive molecular 191  information will include “activating mutations of genes x, y, z”, “amplification of oncogene a, fusions of genes b, and c” or “expression of virus p, subtypes q and r.” The art of genomic medicine will be interpreting these molecular signatures and formulating effective treatment strategies based on this information. Many cancer therapies rely on targeting the hallmarks of cancer. Traditional chemotherapies prevent cells from rapidly dividing by cross-linking a cell’s DNA through alkylation or providing cytotoxic molecules for biosynthesis [19]. Radiation therapies induce DNA breaks and point mutations which become particularly deleterious as they accumulate in immortalized tumour cells, resulting in eventual biological collapse. Angiogeneic inhibitors exploit a growing tumour’s dependence on increased blood supply for continued progression [20]. However, cancer is an evolutionary process, and rapidly dividing tumour populations can adapt to external pressures that would otherwise result in the death of normal cells [21]. These treatments alone often do not cure cancer, rather they destroy the susceptible percentage of cancer cells, leaving behind resistant survivors that can grow to form their own population. This has been directly observed in cells that become resistant to EGFR tyrosine kinase inhibitors, in which long-term treatment drives the development of a specific resistance mutation, T790M, that reduces the effectiveness of the drug [22, 23]. In the case of a single patient from our study in Chapter 5, an initially undetected T790M resistance mutation was present at low frequency prior to treatment that may explain that patient’s lack of response to erlotinib. Therefore, to rationally guide effective therapies, even low frequency cancer genome aberrations need to be catalogued before treatment and be tracked over time in addition to detecting new mutations that arise. To implement routine cancer genome diagnostics, there is a continuing need to minimize invasive collection of tumour tissue and for targeted methods to validate genomewide findings. Large quantities of tissue will likely not become available to address these 192  problems, so methods of cell isolation and nucleic acid amplification must continue to mature. Ideally, invasive biopsies will be rendered unnecessary with improvements in isolation of tumour cells or DNA from complex samples such as pleural fluid or blood. Amplification of DNA and RNA will likely become routine and statistical characterization of bias induced by these methods will be necessary for their application. In the near term, putative somatic variants detected by second generation sequencing methods will need to be validated by orthogonal methods. Even given samples of ideal quality and quantity, false positive rates are very high using these methods, and confirmation using traditional methods is still warranted. Current laboratory techniques such as FISH, Sanger sequencing, and single-variant genotyping will not be discarded, rather they will continue to guide the refinement current generation sequence analysis and the development of the next generation of DNA sequencers.  6.3. Future treatments of cancer will be guided by genome information We need to begin treating cancer as a genomic disease. The structure and content of oncogenes and tumour suppressors are keys not only to identifying known, treatable molecular features of cancer but also to identifying new candidate drug targets and mechanisms of resistance. Particularly when managing cancers long-term, the potential for acquisition of resistance mutations increases with time, and regular molecular profiling of sequence and structural rearrangements will increase understanding of cancer evolution in response to treatment. Rational selection of cancer therapy based on the predicted effects of observed genome alterations will become a major tool to improve patient survival. So far, there is a single example of this being attempted with positive results, of which I was fortunate to play a small part [24]. In September 2008, a patient presented at the BC Cancer Agency with lung metastases from a rare adenocarcinoma of the tongue previously treated by surgery and radiation. Despite 193  expression of EGFR, this cancer was resistant to erlotinib and very few routine therapeutic options remained. To help predict possible effective drugs, tumour samples were collected from which genome and transcriptome data were generated. A normal genome sequence was derived from DNA extracted from a peripheral blood sample. Analyses of these data uncovered a PTEN deletion, an indicator of erlotinib resistance, and RET amplification and overexpression, which together appeared to drive tumour progression. While 84 putative mutations were uncovered, none were present in RET or several other drug targets, suggesting that inhibition of these intact proteins may be particularly effective. A list of seven drugs was presented to the patient’s oncologist and sunitinib (Sutent, Pfizer), a RET inhibitor used to treat kidney, thyroid, and gastrointestinal cancers, was selected as the therapy going forward. After 6 weeks of treatment, there was an approximate 20% decrease in tumour size, and no new nodules had appeared (Figure 6.1). This is a remarkable development as without genome information, this therapy would not have been considered for this cancer. The patient enjoyed good quality of life as the tumour remained in remission for over 5 months. At this point, the tumour grew again and the therapy was changed to a cocktail of two drugs from the original list, sorafenib and sulindac. The tumour again went into remission, this time for four months. In July 2009, the tumour relapsed and has again begun to grow. To uncover how this tumour has again become resistant and to suggest a new therapeutic strategy, tumour tissue has again been acquired, and its genome and transcriptome are being sequenced for comparison with the pre-treatment sample. The goal is to again use genome information to understand what somatic aberrations are now driving this cancer and what can be done clinically to return this patient to good health. This is the future of medical onco-genomics.  6.4. Future directions My immediate short-term goal is to carry out the final experiment outlined in the Chapter 5 to validate sequence variants identified by RNA-seq. The results of this experiment 194  have the potential to not only refine current variant detection methods but will also allow, for the first time, an observation of the spectrum of transcribed somatic mutations present in a set of highly homogeneous solid tumours. As clinical data continue to become available from the ongoing clinical trial, these data may yield fundamental insights into the relationship between somatic alteration and the clinical course of cancer. While the population is not large enough to draw conclusive associations between specific variants and drug response, this initial examination of lung cancer transcriptomes may suggest pathway members commonly mutated in this cancer and potential biomarkers predictive of response that could be validated in larger sample sets. Even prior to the validation of somatic variants, expression profiling of these lung cancers has already identified avenues for investigation. The discovery of Epstein-Barr viral (EBV) transcription in a rare lymphoepithelioma-like carcinoma demonstrated the ability of transcriptome sequencing to refine a pathology diagnosis and suggests that other tumours of this type may be misclassified. Expression profiling of this tumour has also uncovered a possible biomarker of EBV involvement, the overexpression of CD70. Therefore, I plan to investigate the expression of EBV and CD70 in a collection of lung tumours to ascertain whether EBV infection is common in lung cancer and whether it is linked to CD70 upregulation. Expression of specific genes was found to correlate with clinical features including EGFR and KRAS mutation, smoking status and erlotinib response. To refine and validate these profiles, I plan to compare these data with existing and emerging gene expression data sets. Comparison with similar tumours should answer whether these profiles are a consistent feature of this highly selected group of tumours, and comparison with other types of lung cancer, squamous, for example, should illustrate whether expression profiling can be used to differentiate lung cancers with specific mutations or increased likelihood of TKI response. 195  I believe that comparison and integration of cancer genome and transcriptome sequences will uncover common patterns of mutation, structural variation, gene expression, and viral transcription that can eventually be used to guide the treatment of cancer. Such patterns will not be evident without integrated genome-scale data from hundreds of tumours. However, even our current knowledge of cancer genes is sufficient to interpret the genome sequence of an individual tumour to make an effective therapeutic recommendation in at least one case [24]. Ongoing sequence analysis of routine cancer biopsies will not only benefit patients but will further refine our ability to tie genome information with clinical outcome.  196  6.5. Figures Figure 6.1 Computed tomography (CT) images of lung metastases from an adenocarcinoma of the tongue in the months before and after administration of sunitinib, a drug selected to exploit somatic aberrations identified by cancer genome and transcriptome sequencing  Reproduced from [24]. A) October 1st, 2008, one month before sunitinib initiation. Tumour masses with diameters of 22 and 24 mm are identified by arrows (top and bottom respectively). B) October 29th, 2008, baseline before sunitinib initiation on Oct 30th, 2008. Tumour masses have grown by 25% on standard therapy. C) December 9th, 2008, 4 weeks on sunitinib, 2 weeks off drug. Tumour masses have decreased by approximately 20% and no new nodules were observed.  197  6.6. Bibliography 1. Shepherd FA: Molecular selection of patients for first-line treatment of advanced non-small-cell lung cancer with epidermal growth factor inhibitors: not quite ready for prime time. J Clin Oncol 2008, 26(15):2426-2427. 2. Hirsch FR, Bunn PA, Jr.: EGFR testing in lung cancer is ready for prime time. Lancet Oncol 2009, 10(5):432-433. 3. Smith S: MGH to use genetics to personalize cancer care. In: Boston Globe. Boston, MA; 2009. 4. Ionita-Laza I, Lange C, N ML: Estimating the number of unseen variants in the human genome. Proc Natl Acad Sci U S A 2009, 106(13):5008-5013. 5. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. 6. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, Fulton L, Fulton RS, Zhang Q, Wendl MC, Lawrence MS, Larson DE, Chen K, Dooling DJ, Sabo A, Hawes AC, Shen H, Jhangiani SN, Lewis LR, Hall O, Zhu Y, Mathew T, Ren Y, Yao J, Scherer SE, Clerc K, Metcalf GA, Ng B, Milosavljevic A, Gonzalez-Garay ML, Osborne JR, Meyer R, Shi X, Tang Y, Koboldt DC, Lin L, Abbott R, Miner TL, Pohl C, Fewell G, Haipek C, Schmidt H, Dunford-Shore BH, Kraja A, Crosby SD, Sawyer CS, Vickery T, Sander S, Robinson J, Winckler W, Baldwin J, Chirieac LR, Dutt A, Fennell T, Hanna M, Johnson BE, Onofrio RC, Thomas RK, Tonon G, Weir BA, Zhao X, Ziaugra L, Zody MC, Giordano T, Orringer MB, Roth JA, Spitz MR, Wistuba, II, Ozenberger B, Good PJ, Chang AC, Beer DG, Watson MA, Ladanyi M, Broderick S, Yoshizawa A, Travis WD, Pao W, Province MA, Weinstock GM, Varmus HE, Gabriel SB, Lander ES, Gibbs RA, Meyerson M, Wilson RK: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069-1075. 7. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, DunfordShore BH, McGrath S, Hickenbotham M, Cook L, Abbott R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, Gordon D, Chinwalla A, Zhao Y, Ries RE, Payton JE, Westervelt P, Tomasson MH, Watson M, Baty J, Ivanovich J, Heath S, Shannon WD, Nagarajan R, Walter MJ, Link DC, Graubert TA, DiPersio JF, Wilson RK: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008, 456(7218):66-72. 8. Shah S, Morin R, Khattra J, Prentice L, Pugh TJ, Burleigh A, Delaney A, Gelmon K, Guliany R, Holt RA, Jones SJ, Sun M, Moore R, Teschendorff A, Tse K, Turashivili G, Varhol R, Warren R, Watson P, Zhao Y, Caldas C, Huntsman D, Hirst M, Marra M, Aparicio S: Mutational evolution of a lobular breast tumour, profiled by wholetranscriptome and whole-genome next generation sequencing. Submitted 2009. 9. The Cancer Genome Atlas [http://cancergenome.nih.gov/] 10. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, Edkins S, O'Meara S, Vastrik I, Schmidt EE, Avis T, Barthorpe S, Bhamra G, Buck G, Choudhury B, Clements J, Cole J, Dicks E, Forbes S, Gray K, Halliday K, Harrison R, Hills K, Hinton J, Jenkinson A, Jones D, Menzies A, Mironenko T, Perry J, Raine K, Richardson D, Shepherd R, Small A, Tofts C, Varian J, Webb T, West S, Widaa S, Yates A, Cahill DP, Louis DN, Goldstraw P, Nicholson AG, Brasseur F, Looijenga L, Weber BL, Chiew YE, DeFazio A, Greaves MF, Green AR, Campbell P, Birney E, Easton DF, Chenevix-Trench G, Tan MH, Khoo SK, Teh BT, 198  11.  12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.  23. 24.  Yuen ST, Leung SY, Wooster R, Futreal PA, Stratton MR: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446(7132):153-158. Davies H, Hunter C, Smith R, Stephens P, Greenman C, Bignell G, Teague J, Butler A, Edkins S, Stevens C, Parker A, O'Meara S, Avis T, Barthorpe S, Brackenbury L, Buck G, Clements J, Cole J, Dicks E, Edwards K, Forbes S, Gorton M, Gray K, Halliday K, Harrison R, Hills K, Hinton J, Jones D, Kosmidou V, Laman R, Lugg R, Menzies A, Perry J, Petty R, Raine K, Shepherd R, Small A, Solomon H, Stephens Y, Tofts C, Varian J, Webb A, West S, Widaa S, Yates A, Brasseur F, Cooper CS, Flanagan AM, Green A, Knowles M, Leung SY, Looijenga LH, Malkowicz B, Pierotti MA, Teh BT, Yuen ST, Lakhani SR, Easton DF, Weber BL, Goldstraw P, Nicholson AG, Wooster R, Stratton MR, Futreal PA: Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res 2005, 65(17):7591-7595. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061-1068. Egger G, Liang G, Aparicio A, Jones PA: Epigenetics in human disease and prospects for epigenetic therapy. Nature 2004, 429(6990):457-463. Scholzova E, Malik R, Sevcik J, Kleibl Z: RNA regulation and cancer development. Cancer Lett 2007, 246(1-2):12-23. Schmittgen TD: Regulation of microRNA processing in development, differentiation and cancer. J Cell Mol Med 2008, 12(5B):1811-1819. Talbot SJ, Crawford DH: Viruses and tumours--an update. Eur J Cancer 2004, 40(13):1998-2005. Chatterjee A, Mambo E, Sidransky D: Mitochondrial DNA mutations in human cancer. Oncogene 2006, 25(34):4663-4674. Holbrook JA, Neu-Yilik G, Hentze MW, Kulozik AE: Nonsense-mediated decay approaches the clinic. Nat Genet 2004, 36(8):801-808. Fischer DS, Knobf MT, Durivage HJ: The Cancer Chemotherapy Handbook, 4th edn. St. Louis, MO: Mosby-Year Book, Inc.; 1993. Carmeliet P, Jain RK: Angiogenesis in cancer and other diseases. Nature 2000, 407(6801):249-257. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. Yun CH, Mengwasser KE, Toms AV, Woo MS, Greulich H, Wong KK, Meyerson M, Eck MJ: The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. Proc Natl Acad Sci U S A 2008, 105(6):2070-2075. Janne PA: Challenges of detecting EGFR T790M in gefitinib/erlotinib-resistant tumours. Lung Cancer 2008, 60 Suppl 2:S3-9. Jones SJ, Laskin JJ, Li Y, Griffith O, Bilenky M, Butterfield Y, Cezard T, Chuah E, Corbett R, Fejes A, Griffith M, Yee J, Martin MA, Mayo M, Melnyk N, Morin R, Pugh TJ, Severson T, Shah S, Tam A, Terry J, Thiessen N, Varhol R, Zeng T, Zhao Y, Moore R, Huntsman D, Briol I, Hirst M, Holt RA, Marra M: Complete genomic characterization of an adenocarcinoma of the tongue provides rational therapeutic options. Submitted 2009.  199  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0068007/manifest

Comment

Related Items