Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Genetic studies to discover common variants associated with epithelial ovarian cancer risk and variation.. Earp, Madalene A 2012

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
24-ubc_2013_spring_earp_madalene.pdf [ 3.45MB ]
Metadata
JSON: 24-1.0073471.json
JSON-LD: 24-1.0073471-ld.json
RDF/XML (Pretty): 24-1.0073471-rdf.xml
RDF/JSON: 24-1.0073471-rdf.json
Turtle: 24-1.0073471-turtle.txt
N-Triples: 24-1.0073471-rdf-ntriples.txt
Original Record: 24-1.0073471-source.json
Full Text
24-1.0073471-fulltext.txt
Citation
24-1.0073471.ris

Full Text

GENETIC STUDIES TO DISCOVER COMMON VARIANTS ASSOCIATED WITH EPITHELIAL OVARIAN CANCER RISK AND VARIATION IN AGE OF NATURAL MENOPAUSE  by Madalene A Earp  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Medical Genetics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) December 2012  © Madalene A Earp, 2012  Abstract Background. Epithelial ovarian cancer (EOC) and age of natural menopause (ANM) are two complex traits impacting women’s health. ANM is also an important EOC risk factor. Insight into genetic factors influencing EOC and ANM could provide novel entry points for understanding EOC pathogenesis, and the normal process of ovarian aging.  Methods. A two-stage genome-wide association study (GWAS) design using DNA pooling in Stage 1 was used to discover single nucleotide polymorphisms (SNPs) associated with histology-specific EOC risk, and population-specific variation in ANM. SNP-trait associations discovered in Stage 1 of these GWAS were replicated in two different consortia; EOC association in the Ovarian Cancer Association Consortium (OCAC), and ANM associations in the ReproGen Consortium.  Results. Eight subtype-specific SNP-EOC associations discovery in Stage 1 of the EOC GWAS were replicated (unadjusted P <0.05) in the datasets of the OCAC: 4 in the mucinous subtype, 2 in the endometrioid and clear cell (ENCC) subtypes combined, and 2 in the lowmalignancy potential (LMP) serous subtype. These associations did not achieve genomewide significance (P < 5x10-8). However, several of the loci implicated by these SNPs harbour attractive candidate genes for ovarian cancer biology, including the mucinous locus harbouring RAD51B, the ENCC locus harbouring GRB10, and the LMP serous locus harbouring BPIL2/C22orf26. The GWAS of ANM performed in Iranian women revealed one SNP-trait association exclusive to this population (rs10140275; unadjusted P=4.0x10-4) and one shared with the European population (rs10840211). In the replication of European GWAS findings in Iranian women one SNP at the 20p12.3 locus was replicated (rs16991615, unadjusted P=0.02). SNPs tagging the 19q13.42 locus narrowly missed replication in our dataset (rs1172822, unadjusted P= 0.08).  Conclusion. Eight SNP-EOC associations warrant further replication and/or fine-mapping in samples with reliable tumour information. These SNPs tag loci harbouring candidate genes ii  involved in DNA repair, the PI3K/AKT pathway, and immune function, suggesting these pathways and processes may be important to the etiology of the rarer EOC subtypes. The generality of the European ANM GWAS findings in established in another human population (20p12.3). Further, one genetic variant influencing ANM in the Iranian population but not the European population is reported.  iii  Preface Chapters 2, 3 and 4 correspond to multi-author collaborations that have produced manuscripts, or are in the process of producing manuscripts, for publication in scientific journals. Chapter 2 has been published in near identical form as: Earp MA, Rahmani M, Chew K, and Brooks-Wilson A. Estimates of array and pool-construction variance for planning efficient DNA-pooling genome wide association studies. BMC Medical Genomics 2011 4:81.Chapter 4 is in submission in near identical form as: Maziar Rahmani*, Madalene A Earp*, Fahimeh Ramezani Tehrani, Mehran Ataee, Jackson Wu, Martin Treml, Ramona Nudischer1, Sara Poormohammadvali-Behnami, John R.B. Perry, Joanne M Murabito, Fereidoun Azizi, Angela Brooks-Wilson. Novel and shared genetic factors for age of natural menopause in Iranian and European women (where * indicates equal contribution). In Chapter 2, experimental design, analysis, and manuscript preparation were carried out by me, with the following exceptions. Figure 2.1 and the java applet entitled PoolingPlanner was created by Kevin Chew. In Chapter 3, I was responsible for the conceptualization, design, implementation, and analysis of the research activities described. Nu Le and Linda Cook coordinated the study that obtained the samples used in Chapter 3. Donna Kan and Barbara Jamieson collected and input the epidemiological data used in this study; Steven Leech and Rozmin Janoo-Gilani prepared the accompanying DNA samples. In Chapter 4, experiment conceptualization and design were the work of Maziar Rahmani. My contribution included all data analysis, interpretation of results, and presentation of results, including manuscript preparation (with editing performed by Maziar Rahmani). Mehran Ataee and Jackson Wu assisted in the preparation of the DNA pools used in Chapter 4. Fereidoun Azizi coordinated the original study that obtained the samples used in Chapter 4. Angela Brooks-Wilson advised on and provided editorial assistance with all chapters in this dissertation. The University of British Columbia - British Columbia Cancer Agency Research Ethics Board (UBC BCCA REB) approved the research in Chapter 3 (REB Number H07-02271) and Chapter 4 (REB Number H09-02731).  iv  Table of Contents Abstract ..................................................................................................................................... ii Preface...................................................................................................................................... iv Table of Contents ...................................................................................................................... v List of Tables ........................................................................................................................... xi List of Figures ......................................................................................................................... xii List of Symbols and Abbreviations........................................................................................ xiii Acknowledgments.................................................................................................................. xvi CHAPTER 1 Introduction ....................................................................................................... 1 1.1 Thesis Overview .............................................................................................................. 1 1.2 Genome-Wide Association Studies (GWAS) .................................................................. 2 1.2.1 Genetic and Statistical Principles of GWAS ..............................................................3 1.2.1.1 Linkage Disequilibrium (LD) .............................................................................. 3 1.2.1.2 Single Nucleotide Polymorphisms (SNPs) .......................................................... 4 1.2.1.3 Common-Variant Common-Disease (CVCD) Hypothesis .................................. 4 1.2.1.4 Minor Allele Frequency (MAF) and Effect Size ................................................. 5 1.2.1.5 Consequences of Multiple Testing....................................................................... 5 1.2.1.6 Multi-Stage Genotyping Approaches .................................................................. 5 1.2.1.7 Replication ........................................................................................................... 6 1.2.1.8 Case-Control Selection ........................................................................................ 7 1.2.2 Pool-based GWAS Approach .....................................................................................8 1.2.2.1 Limitations ........................................................................................................... 8 1.2.2.2 Current Context.................................................................................................... 9 1.2.3 Challenges Encountered After GWAS .....................................................................10 1.2.3.1 Fine-Mapping and Targeted Sequencing ........................................................... 10 1.2.3.2 Functional Annotation of Regulatory SNPs ...................................................... 11 1.2.3.3 Expression Quantitative Trait Loci (eQTL) Studies .......................................... 12 1.2.3.4 Cell, Tissue, and Animal Models ....................................................................... 12 1.2.4 Summay of GWAS Successes ..................................................................................13 v  1.3 Ovarian Cancer .............................................................................................................. 14 1.3.1 Types of Ovarian Cancer ..........................................................................................14 1.3.2 Origins of Epithelial Ovarian Cancer (EOC) ............................................................14 1.3.3 Subtypes of EOC.......................................................................................................15 1.3.3.1 Serous................................................................................................................. 15 1.3.3.2 Mucinous............................................................................................................ 16 1.3.3.3 Endometrioid...................................................................................................... 17 1.3.3.4 Clear Cell ........................................................................................................... 17 1.3.4 Pathogenesis of EOC ................................................................................................18 1.3.4.1 The Role of Ovulation ....................................................................................... 18 1.3.4.2 The Role of Inflammation .................................................................................. 18 1.3.4.3 The Role of Hormones ....................................................................................... 19 1.3.4.4 The Role of Heritable Risk Factors ................................................................... 19 1.3.5 Rare, High Penetrance EOC Risk Alleles .................................................................20 1.3.6 Common, Low to Moderate Penetrance EOC Risk Alleles ......................................21 1.3.6.1 Locus 9p22.2 ...................................................................................................... 21 1.3.6.2 Locus 8q24.21 .................................................................................................... 22 1.3.6.3 Locus 2q31 ......................................................................................................... 22 1.3.6.4 Locus 3q25 ......................................................................................................... 23 1.3.6.5 Locus 17q21 ....................................................................................................... 23 1.3.6.6 Locus 19p13 ....................................................................................................... 24 1.3.7 Candidate Gene Association Studies of EOC ...........................................................24 1.3.8 Conclusion to EOC Introduction ..............................................................................25 1.4 Ovarian Aging and Age at Natural Menopause (ANM) ................................................ 25 1.4.1 Relevance to Health and Disease ..............................................................................26 1.4.2 ANM Genetic Risk Factors: GWAS in Europeans ...................................................27 1.4.3 ANM Genetic Risk Factors: Candidate Gene Studies ..............................................28 1.4.4 Population-Specific ANM Genetic Risk Factors ......................................................28 1.5 Conclusion ..................................................................................................................... 29 1.5.1 GWAS of Subtype-Specific EOC .............................................................................29 1.5.2 GWAS of Population-Specific ANM .......................................................................30 vi  1.5.3 Planning DNA-Pooling GWAS ................................................................................31 CHAPTER 2 Estimates of Array and Pool-Construction Variance for Planning Efficient DNA-Pooling Genome -Wide Association Scans .................................................................. 32 2.1 Background .................................................................................................................... 33 2.2 Methods.......................................................................................................................... 35 2.2.1 Data Collection .........................................................................................................35 2.2.2 Statistical Analysis ....................................................................................................37 2.2.3 PoolingPlanner Theory .............................................................................................40 2.3 Results ............................................................................................................................ 41 2.3.1 Array Variance: Type A Comparisons .....................................................................41 2.3.2 Pooling Variance: Type B and C Comparisons ........................................................42 2.3.3 PoolingPlanner Example...........................................................................................44 2.4 Discussion ...................................................................................................................... 46 2.5 Conclusion ..................................................................................................................... 49 CHAPTER 3 Discovery of Subtype-Specific Epithelial Ovarian Cancer Risk Alleles Using a Pool-Based Genome-Wide Association Scan ...................................................................... 50 3.1 Background .................................................................................................................... 51 3.2 Methods.......................................................................................................................... 53 3.2.1 Discovery Stage ........................................................................................................53 3.2.1.1 Sample and Subtype Description ....................................................................... 53 3.2.1.2 DNA Pooling Procedures ................................................................................... 55 3.2.1.3 Quality Control and Normalization of Array Data ............................................ 56 3.2.1.4 Analysis Approach ............................................................................................. 56 3.2.1.5 Pool-Based Test of Association ......................................................................... 57 3.2.1.6 Filtering of Associated SNPs ............................................................................. 58 3.2.1.7 SNP Selection for Replication ........................................................................... 60 3.2.2 Replication Stage ......................................................................................................60 3.2.2.1 Sample Description ............................................................................................ 60 3.2.2.2 Genotyping and Quality Control ........................................................................ 61 3.2.2.3 Statistical Analysis ............................................................................................. 62 3.2.3 In Silico SNP Analysis ..............................................................................................62 vii  3.3 Results ............................................................................................................................ 63 3.3.1 Discovery Stage ........................................................................................................63 3.3.1.1 Sample and Subtype Description ....................................................................... 63 3.3.1.2 Array Quality Control and Normalization ......................................................... 64 3.3.1.3 Validation ........................................................................................................... 65 3.3.2 Replication Stage ......................................................................................................66 3.3.2.1 Mucinous Results ............................................................................................... 67 3.3.2.2 Endometrioid and Clear Cell Results ................................................................. 69 3.3.2.3 Low Malignancy Potential/Borderline Serous Results ...................................... 70 3.3.2.4 High Grade Serous Results ................................................................................ 71 3.3.3 In Silico SNP Analysis .............................................................................................71 3.3.3.1 Mucinous............................................................................................................ 71 3.3.3.2 Endometrioid and Clear Cell ............................................................................. 72 3.3.3.3 Low Malignancy Potential/Borderline Serous ................................................... 73 3.4 Discussion ...................................................................................................................... 73 3.4.1 Mucinous...................................................................................................................74 3.4.2 Endometrioid and Clear Cell ....................................................................................75 3.4.3 Low Malignancy Potential/Borderline Serous ..........................................................76 3.4.4 High Grade Serous ....................................................................................................77 3.5 Study Limitations ........................................................................................................... 77 3.6 Conclusion ..................................................................................................................... 78 CHAPTER 4 Novel and Shared Genetic Factors for Age of Natural Menopause in Iranian and European Women ............................................................................................................. 80 4.1 Background .................................................................................................................... 81 4.2 Methods.......................................................................................................................... 83 4.2.1 Iranian Sample Collection.........................................................................................83 4.2.2 Discovery Stage DNA Pooling Procedures ..............................................................83 4.2.3 Quality Control and Normalization of Discovery Stage Array Data ........................85 4.2.4 Association Analysis in Discovery Stage Array Data ..............................................85 4.2.5 Filtering of Associated SNPs ....................................................................................85 4.2.6 SNP Selection for Replication Stage ........................................................................86 viii  4.2.7 Replication Stage Statistical Analysis ......................................................................87 4.3 Results ............................................................................................................................ 88 4.3.1 Discovery Stage ........................................................................................................88 4.3.1.1 DNA Pooling ..................................................................................................... 88 4.3.1.2 Quality Control and Normalization of Pooling Data ......................................... 88 4.3.1.3 Quality Control of Individual Genotyping Data ................................................ 89 4.3.1.4 Validation of SNPs Selected for Replication ..................................................... 89 4.3.2 Replication ................................................................................................................91 4.3.2.1 Pool-Based GWAS SNPs .................................................................................. 91 4.3.2.2 Literature SNPs .................................................................................................. 92 4.4 Discussion ...................................................................................................................... 93 4.4.1 Pool-Based GWAS SNPs .........................................................................................93 4.4.2 Literature SNPs .........................................................................................................94 4.5 Study Limitations ........................................................................................................... 95 4.6 Conclusion: Novel and Shared Genetic Factors for ANM ............................................ 96 CHAPTER 5 Discussion ...................................................................................................... 97 5.1 Impression of the Pool-Based GWAS Approach........................................................... 97 5.1.1 Replicate Array Considerations ................................................................................97 5.1.2 Analysis Software .....................................................................................................99 5.1.3 Importance of Validation ..........................................................................................99 5.1.4 Importance of Replication.......................................................................................100 5.2 Relevance of Subtype-Specific EOC Risk Alleles ...................................................... 100 5.2.1 DNA Repair and High-Grade EOC ........................................................................100 5.2.2 PI3K/RAS Signaling and Low Grade EOC ............................................................101 5.2.3 Subtype-Specificity of Common, Low Penetrance EOC Risk Alleles ...................102 5.3 Common Genes and Pathways in ANM Variation and EOC Risk .............................. 103 5.4 Future Studies of EOC and ANM Associated SNPs ................................................... 104 REFERENCES ..................................................................................................................... 166 APPENDIX A. List of Genes by Gene Name ..................................................................... 180 APPENDIX B. Appendices to Chapter 2 ............................................................................ 183 Appendix B.1. Pool and Array Summary. ......................................................................... 183 ix  Appendix B.2. Estimates of Variance Components, 1M-Single Array ............................. 184 Appendix B.3 Estimates of Variance Components, 1M-Duo and 660-Quad Arrays. ....... 185 Appendix B.4. Average MAF on Illumina Arrays. ........................................................... 186 Appendix B.5. Array Allocation, Effective Samples Size, and MDOR. ........................... 187 Appendix B.6. Power Curve Example. .............................................................................. 188 APPENDIX C. Appendices to Chapter 3 ............................................................................ 189 Appendix C.1. SNP Exclusions. ........................................................................................ 189 APPENDIX D. Appendices to Chapter 4 ............................................................................ 190 Appendix D.1. Power Calculations. ................................................................................... 190  x  List of Tables Table 1.1. Features of Five Commonly Described EOC Subtypes. ..................................... 105 Table 1.2. Common Low to Moderate Penetrance EOC Risk Alleles. ................................ 106 Table 2.1. Array Variance for Illumina arrays ..................................................................... 107 Table 2.2. Impact of Replicate Arrays on Effective Sample Size (N*) and Minimum Detectable Odds Ratio (MDOR). .......................................................................................... 108 Table 3.1. Discovery Stage EOC Pools by ICD-0-3 Codes. ................................................ 109 Table 3.2. Summary of SNPs Selected for Replication. ...................................................... 110 Table 3.3. Association and Rank Information for the Top 30 MUC EOC Loci. ................. 111 Table 3.4. Association and Rank Information for the Top 30 ENCC EOC Loci. ................ 113 Table 3.5. Association and Rank Information for the Top 30 LMP SER EOC Loci. .......... 115 Table 3.6. Association and Rank Information for the Top 30 HG SER EOC Loci. ............ 117 Table 3.7. Association and Rank Information for the Singleton SNPs. ............................... 118 Table 3.8. EOC Cases and Controls in OCAC by Subtype and Tumour Behaviour. .......... 120 Table 3.9. Properties of Samples in the Discovery Stage Case-Control Pools. ................... 122 Table 3.10. MUC EOC SNPs Replicated in OCAC, Stratified by Subtype. ....................... 123 Table 3.11. MUC EOC SNPs Replicated in OCAC, Stratified by Tumour Behaviour. ...... 125 Table 3.12. ENCC SNPs Replicated in OCAC, Stratified by Subtype. ............................... 127 Table 3.13. ENCC SNPs Replicated in OCAC, Stratified by END and CC Subtypes. ....... 129 Table 3.14. LMP-SER SNPs Replicated in OCAC, Stratified by Subtype. ......................... 131 Table 3.15. Summary of Nine Subtype-Specific SNPs Replicated in OCAC. .................... 132 Table 3.16. Proxy SNPs With In Silico Predicted Functional Consequences. ..................... 133 Table 4.1. Summary of SNPs Selected for Individual Genotyping in Iranian Women. ...... 134 Table 4.2. Association and Rank Information for 102 Iranian GWAS SNPs Chosen for Individual Genotyping. ......................................................................................................... 135 Table 4.3. SNPs Associated With ANM in 782 Iranian Women. ........................................ 138 Table 4.4. Replication of SNP-ANM Associations Reported by Previous GWAS in 782 Iranian Women...................................................................................................................... 139 Table 4.5. Replication of SNP-ANM Associations Reported by Candidate Gene in 782 Iranian women. ..................................................................................................................... 140 xi  List of Figures Figure 1.1. Possible Cells of Origin in Ovarian Cancers. .................................................... 141 Figure 1.2. Model of Formation of Inclusion Cyst. ............................................................. 142 Figure 2.1. Overview of three Possible Pair-Wise Array Comparisons. ............................. 143 Figure 2.2. Box Plots of Array Variance for Three Illumina Array Types. ......................... 145 Figure 2.3. Box Plots of Array Variance for Illumina 1M-Duo arrays. ............................... 146 Figure 2.4. Decomposition of Pooling Variance for Illumina 1M-Single Arrays. .............. 147 Figure 2.5. Decomposition of Pooling Variance for Illumina 660-Quad Arrays. ............... 148 Figure 2.6. Decomposition of Pooling Variance for Illumina 1M-Duo Arrays. .................. 149 Figure 2.7. Example of PoolingPlanner. .............................................................................. 150 Figure 2.8. Example Use of PoolingPlanner. ....................................................................... 151 Figure 3.1. Summary of Ovarian Cancer GWAS Stages and Outcomes ............................. 152 Figure 3.1. Breakdown of the Discovery Stage Case-Control Pools by 10 Year Age Bins. 153 Figure 3.2. Correlation in Allele Frequency Estimation Between Two Arrays. .................. 154 Figure 3.3. Distribution of Standard Deviations of Allele Frequency Estimates. ............... 155 Figure 3.4. Hierarchical Cluster Plot of Discovery Stage Illumina 660-Quad Array Data Before Normalization............................................................................................................ 156 Figure 3.5. Hierarchical Cluster Plot of Discovery Stage Illumina 660-Quad Array Data After Normalization .............................................................................................................. 157 Figure 3.6. Correlation Between Pool-Based and IG-Based Allele Frequency Estimates. . 158 Figure 3.7. Allele Frequency Difference for 188 SNPs Based on IG and Pooling Data, Stratified by Case Pool.......................................................................................................... 159 Figure 3.8. Allele Frequency Difference for 188 SNPs Based on IG and Pooling Data, Stratified by SNP Filtering Method. ..................................................................................... 160 Figure 4.1. Summary of ANM Study Objectives, Steps, and Outcomes ............................. 161 Figure 4.2. Hierarchical Cluster Plot of 8 Illumina 1M-Duo Arrays. .................................. 162 Figure 4.3. Histogram of the Standard Deviations Associated with SNP allele Frequencies. ............................................................................................................................................... 163 Figure 4.4. Correlation Between Pool and IG-Based Allele Frequency Estimates. ............ 164 Figure 4.5. Allele Frequency Difference for SNPs .............................................................. 165 xii  List of Symbols and Abbreviations 1M-Duo  Illumina Human1M Duo array  1M-Single  Illumina Human1M Single array  3MAD  3-Median Absolute Deviations  660-Quad  Illumina HumanHap660 Quad array  AF  Allele Frequency  AFD  Allele Frequency Difference  AFE  Allele Frequency Estimation  AIMs  Ancestry Informative Markers  ANM  Age of Natural Menopause  ATP  Adenosine-5'-Triphosphate  BCC  BC Controls  BCCA  British Columbia Cancer Agency  BCCR  British Columbia Cancer Registry  BMI  Body Mass Index  CC  Clear Cell  CI  Confidence Interval  CNV  Copy Number Variation  COGS  Collaborative Oncological Gene-environment Study  DF  Degrees of Freedom  DL  DerSimonian and Laird  DNA  Deoxyribonucleic Acid  DSBR  Double Strand Break Repair  EDTA  Ethylenediaminetetraacetic acid  ENCC  Endometrioid/Clear Cell Combined  END  Endometrioid  EOC  Epithelial Ovarian Cancer  ESPERR  Evolutionary Sequence Pattern Extraction Reduced Representation  ESS  Effective Sample Size  FA  Fanconi anemia xiii  FIGO  International Federation of Gynecology and Obstetrics  FSH  Follicle stimulating hormone  GWA  Genome-Wide Association  GWAS  Genome-Wide Association Studies  GWS  Genome-Wide Significant  HAT  Histone Acetyltransferases  HBOCS  Hereditary Breast and Ovarian Cancer Syndrome  HG  High Grade  HGNC  HUGO Gene Nomenclature Committee  HR  Hazards Ratio  HWE  Hardy Weinberg Equilibrium  ICD-O-3  International Classification of Diseases for Oncology, 3rd Edition  IG  Individual Genotyping  IGFs  Insulin-like Growth Factors  IV  Inverse Variance  kb  Kilobase  LD  Linkage Disequilibrium  LH  Lutenizing Hormone  LMP  Low Malignancy Potential  LOD  Logarithm of the Odds  LOF  Loss of Function  LS  Lynch Syndrome  MAF  Minor Allele Frequency  Mb  Megabase  MDOR  Minimum Detectable Odds Ratio  MMR  Mismatch Repair  mtDNA  Mitochondrial DNA  MUC  Mucinous  NOS  Not Otherwise Specified  NS-SNP  Non-Synonymous SNP  OA  Ovarian Aging xiv  OC  Oral Contraceptives  OCAC  Ovarian Cancer Association Consortium  OR  Odds Ratio  OSE  Ovarian Surface Epithelium  OVAL-BC  Ovarian Cancer in Alberta and British Columbia  PARP  ADP-ribose Polymerase  PCA  Principal Components Analyses  PCOS  Polycystic Ovarian Syndrome  POF  Premature Ovarian Failure  QC  Quality Control  RSS  Relative Sample Size  SD  Standard Deviation  SE  Standard Error  SER  Serous  SMP  Screening Mammography Program  SNP  Single Nucleotide Polymorphism  SSBR  Single Strand Break Repair  SSE  Splice Site Enhancer  SSS  Splice Site Suppressor  TCA  Tricarboxylic Acid  TCGA  The Cancer Genome Atlas  TE  Tris-EDTA  TF  Transcription Factor  TFBS  Transcription Factor Binding Site  TLGS  Tehran Lipid and Glucose Study  TSG  Tumour Suppressor Genes  UTR  Untranslated Region  var(earray)  Array Variance  var(econstruction)  Construction Variance  var(epool)  Pool Variance  WHO  World Health Organization xv  Acknowledgments I would like to thank OvCare for their generous financial support, and OvCare members for their thoughtful discussion on ovarian cancer biology, and the subtype-specific ovarian cancer GWAS described in Chapter 3, particularly David Huntsman, Blake Gilks, and Ken Swenerton. I would like to thank members of OVAL-BC for involving me in their study, particularly Nhu Le, Linda Cook, and Linda Kelemen. I would like to thank all members of OCAC for providing such a welcoming and collaborative research environment. I feel that OCAC is an exemplar model of how collaborative research can/should be done.  I would like to thank all of the members of the Brooks-Wilson Lab for providing a fun and supportive environment for doing my PhD. I thank the members of my thesis committee, Marco Marra, Rob Holt, and Nhu Le for agreeing to take on yet another graduate student. I thank my supervisor, Angie Brooks-Wilson for adopting an orphan graduate student.  xvi  CHAPTER 1 Introduction  1.1 Thesis Overview The principal aim of this thesis was to identify genes influencing epithelial ovarian cancer (EOC) risk and variation in age of natural menopause (ANM). These are two complex human conditions that affect women’s health. The experimental approached used was a two-stage genome-wide association study (GWAS) using a DNA pooling strategy. Briefly, in stage 1, the discovery stage, DNA samples were physically pooled and assayed on genotyping arrays. SNP allele frequency estimates in DNA pools and then used in tests of association. In stage 2, the replication stage, SNPs found to be associated in the discovery stage were tested for association in additional samples using individual genotyping. Chapter 2 presents a methods-based analysis of the DNA pooling strategy in the context of a GWAS. It examines the parameters essential to planning an optimal GWAS using DNA pooling. In particular, Chapter 2 addresses how to balance the loss of power inherent to pooling DNA with the cost savings afforded by using fewer arrays, and the ability to recover power by increasing sample size. Further, Chapter 2 discusses how to estimate the effect sizes one has power to detect. Chapter 3 is a subtype-specific GWAS of EOC risk. In the discovery stage of this GWAS, a unique sample resource provided by the Cancer Control Research unit at the British Columbia Cancer Agency (BCCA) was used; a population-based collection of ~1,500 women diagnosed with EOC between 2002 and 2008, and ~1500 appropriately matched controls. In the replication stage of this GWAS, the combined sample resources of the Ovarian Cancer Association Consortium (OCAC), an international group dedicated to discovering genetic and environmental risk factors involved in EOC, was used. It is hoped that an improved understanding of the basic biology of EOC, particularly on a subtypespecific level, will lead to more effective treatment and/or prevention strategies for this cancer. Chapter 4 is two-stage GWAS of ANM in women of Iranian ancestry. ANM is an important risk factor for several diseases, including ovarian cancer1,2, endometrial cancer3, 1  and breast cancer4. A large meta-analysis of ANM GWAS performed in European women was published in 2012, and revealed 17 loci associated with ANM5. Our study allows us to establish the generality of the European findings in another human population, as well as discover putative novel genetic variants influencing ANM in the Iranian population. In Chapter 5 is an integrated discussion of the findings and lessons learned from the pool-based GWAS of EOC and ANM.  This introduction first presents the main experimental approach used in Chapters 3 and 4. It then reviews EOC, focusing on the subtype-specific risk factors and the proposed mechanisms of pathogenesis. ANM is then reviewed, focusing on the known heritable risk factors, and population-specific differences in this trait.  1.2 Genome-Wide Association Studies (GWAS) One way to go about discovering genes influencing a trait is the hypothesis-driven candidate gene approach. To be successful, this approach requires prior biological understanding in order to come up with a list of genes to study. For example, candidate genes studies of EOC have focused on genes belonging to the steroid hormone (e.g. PGR, ESR1, CYP3A4, CYP19A1, SRD5A2), DNA repair (e.g. BRCA1, BRCA2, BRIP1, ERCC1, XRCC2, RAD51, MMR genes) and cell cycle control pathways (e.g. ABL1, CCND1, CDK genes)6, based on their biological relevance to the ovary and cancer. Unfortunately, the candidate gene approach has met with limited success in studies of EOC and ANM. Few, if any, robust genetic associations emerged6,7. An alternative to the candidate gene approach is the hypothesis-free genome-wide association approach study (GWAS) approach. Here, genetic markers spread across and tagging the entire genome are used in tests of association with the phenotype of interest. The first successful GWAS was published in 20058; however, the seminal Wellcome Trust Case Control Consortium (WTCCC)9 paper published in Nature in 2007 is considered the start of this study designs rise in popularity. The WTCCC study was the first well-designed and well-powered GWAS using cost-effective commercially available genotyping arrays. Since 2007, GWAS have identified over 4000 genetic variants associated with complex human traits, and become a vital hypothesis generating research tool10. 2  1.2.1 Genetic and Statistical Principles of GWAS In principle, a genetic association study looks to find an allele that is observed more often than expected by chance in individuals with a trait of interest than those without. Typically these studies compare genotypes (aa, Aa, AA) between a collection of cases and controls. In the past, association studies would test 10-1000 genetic variants, typically selected from candidate gene(s). More recently, hundreds of thousands of SNPs (>500,000) distributed throughout the genome are tested for association. Association studies using the latter approach are referred to as GWAS. The polymorphic sites most commonly genotyped in GWAS are SNPs. SNPs variants may exert a direct functional effect, or they may exert an indirect functional effect if they are in linkage disequilibrium (LD) with a second polymorphism that exerts a functional effect11. LD is the association of alleles at two or more loci. LD is observed in this context because the two SNPs are in proximity on the same chromosome, sufficiently close that the probability of a crossover event (recombination) separating them in meiosis is small, and thus they are inherited together11. The traitassociated SNPs typically reported by GWAS are genetic ‘flags’ indicating regions of the genome where causal elements likely reside.  1.2.1.1 Linkage Disequilibrium (LD) LD is an important concepts in GWAS. Although a simplification, the human genome is described as being inherited in discrete blocks of DNA in high LD12. Specific regions of the genome experience frequent crossover events and are called recombination ‘hot-spots’; these form the boundaries of LD blocks13,14. The r2 statistic is used to measure the level of LD between two bi-allelic SNPs15. r2 takes on values between 0 and 1. When r2 =1 (“perfect LD”), only two combinations of alleles are observed for two bi-allelic SNPs. When r2 =0 (“no LD”), all four combinations of alleles are observed at the frequencies expected based on their minor allele frequencies (MAF) and random assortment. Bi-allelic SNPs with an r2 > 0.8 are referred to as being in high LD. LD between ancestral genomic variants allows one to genotype fewer than a million SNPs; yet make statements about association on a “genomewide” level (>3 billion nucleotide positions). SNPs in high LD are transmitted together, and by identifying the allele at one site, alleles at other polymorphic sites can be unambiguously 3  inferred. SNPs transmitted together are said to be part of a haplotype block. Human populations, defined geographically or by historical events, share LD block structure, along with the SNPs that have arisen within those blocks13,16. When performing a GWAS, is it important that the samples being compared are drawn from the same human population, to avoiding detecting associations arising from population-specific differences in SNP allele frequency, rather than trait-associated differences. LD blocks allow for the simplification of the human genome when performing a GWAS, but they come with a caveat. Once a SNP in an LD block is associated with a trait of interest, determining where in the block causal variant(s) lie requires a different approach17. Some blocks in the human genome extend over 0.5 mega-bases (Mb) making this is a challenging problem.  1.2.1.2 Single Nucleotide Polymorphisms (SNPs) SNPs are the mainstay genetic variant used in GWAS because they are abundant, distributed throughout the genome, and easy to quickly genotype in a cost-efficient manner. SNP chips (i.e. arrays) that simultaneously assay 1 million SNPs are commercially available, and have made the genotyping step in GWAS relatively straightforward. Illumina SNP arrays are used in Chapter 3 & 4. SNPs on Illumina arrays were chosen based on the patterns of LD observed in the European population, to allow each LD block to be tested for association18. Most GWAS use a two-stage or three-stage approach, whereby each successive stage of genotyping is performed on fewer SNPs but more samples (for those SNPs significant in the preceding phases). This approach saves on genotyping cost while minimally affecting power19.  1.2.1.3 Common-Variant Common-Disease (CVCD) Hypothesis When GWAS were first initiated, the common-variant common-disease (CVCD) hypothesis was in favour. It stated that complex disease are likely largely attributable to a moderate number (<50) of common variants (MAF>5%), each accounting for several percent (>2%) of the genetic variance for a trait20. Based on five years of GWAS, this model has been updated: complex diseases are likely largely attributable to a large number (>200) of genetic variants spanning the full allelic spectrum (MAF>1%), each variant accounting for a very small percentage of (<0.02%) of the genetic variance for a trait (also called the infinitesimal 4  model)21. With few exceptions, the loci detected by GWAS have been found to confer very small effects. For example, in 2009 the average SNP association reported by GWAS was estimated to have a modest effect size, a median per risk allele OR of 1.33 (Hindorff et al. 2009). Rarer alleles (MAF<2%) are now thought to play a more important role than previously thought, as are gene-by-gene and gene-by-environment interactions21.  1.2.1.4 Minor Allele Frequency (MAF) and Effect Size GWAS were not initially designed or powered to detect rarer variants (MAF<2%), or variants conferring small effects (OR<1.5); rather, they were designed and powered to detect common (MAF>5%) variants that exert at least a moderate effect (OR>1.5) (consistent with the CVCD hypothesis). This statement applies to the two GWAS presented in Chapter 3 and Chapter 4 of this thesis (each initiated in 2009). In recent years (2011 onwards), by dramatically increasing sample size (>8,000 cases and >8,000 controls) and improving statistical techniques, GWAS have successfully detected associations with common SNPs that exert very small effects (OR: 1.07-1.1)22,23, and associations with rarer (MAF: 0.2%0.5%) SNPs that exert large effects (OR>4)24,25. However, these GWAS are very costly due to the number of samples that need to be genotyped to detect these variants.  1.2.1.5 Consequences of Multiple Testing A complication of the GWAS approach is the enormous number of statistical test that need to be performed to exhaustively query the human genome (see section 1.2.1.1 for a relevant discussion of LD). Multiple-testing necessitates that a very stringent statistical thresholds is imposed to avoid reporting false positive results. A P-value of ~10-7 in the GWAS setting corresponds to a P-value of~ 0.05 in the traditional, classical epidemiological study assuming a single hypothesis is tested17. If multiple analyses are performed, for example, testing different genetic models, even this threshold is insufficient. To achieve P-values of much less than 10-7 for realistic SNP effect size (that is, ORs <1.3), very large sample sizes are needed.  1.2.1.6 Multi-Stage Genotyping Approaches One approach to managing the GWAS sample size and cost burden is the multi-stage design. This is not a discovery/replication design, but a more efficient (cost) approach for discovery 5  given a fixed number of samples19,26. This approach involves successive stages of genotyping; performed on fewer and fewer SNPs in each stage (only the most associated SNPs from the previous stages). In Stage 1, genome-wide analyses are performed on samples using commercially available genotyping arrays, and “promising” SNPs are taken forward to Stage 2, typically 0.1% of the most associated SNPs tested in Stage 119. These SNPs are not required to be genome-wide significant, thus, this is not a statistically stringent step. Carrying forward a large number of SNPs minimizes false negatives. In Stage 2, analyses are performed on a small subset of SNPs in more samples (typically 1-3 times more samples than were used in Stage 1) using custom genotyping. After Stages 2 (possibly Stage 3 or more), SNPs that achieve genome-wide significance can be reported as a positive result; hence, these latter steps are statistically stringent and remove false positive results. Costs savings of ~50% are possible with this design, and studies retain comparable power and false positive rates (assuming 50% of samples are used in Stage 1 and 0.1% of SNPs are forwarded to Stage 2) relative to genotyping all samples and SNPs at once27. Genome-wide significant associations discovered by a GWAS still need to be replicated in by independent studies to exclude to possibility to false positive results arising from undetectable systematic bias.  1.2.1.7 Replication Replication in an independent sample collection, particularly if that sample is from a different human population, represents the most reliable evidence of a genetic association14. For this reason, replication is viewed as a critical step in evaluating the association results of a GWAS. In GWAS, the term replication is reserved for tests involving the same allele (called ‘exact replication’) or the same haplotype (called ‘local replication’), given the same phenotype, genetic model, and population (broadly defined, i.e. European vs. Asian) as the original signal. In Chapter 4, we replicate genetic associations discovered in the European population in the Iranian population, and in doing so provide further evidence for true genetic association. Although replication seems straightforward, failure to replicate has been a recurring problem for GWAS17,28,29, and is typically attributed to between-study heterogeneity and bias, and/or insufficient samples in the replication phases to detect the true effect size of the associated SNPs (which are often overestimated in the initial phase, i.e winner’s curse)30-33. 6  1.2.1.8 Case-Control Selection A critical step in conducting a GWAS is to obtain a sufficiently large and homogenous collection of cases and controls. An ideal case collection is a group of individuals who show the trait of interest for the same genetic reasons. Genetic heterogeneity, when individuals exhibit a trait due to different genetic factors, will hinder detection of SNP-trait associations, but can be difficult to assess beforehand. Very precise phenotype definitions, i.e. cases are as homogenous as possible for the trait of interest, can be used to try and minimize its effects. This approach is used in Chapter 3 to discover novel loci associated with subtype-specific EOC, associations which were not detected when EOC was studied as a single disease. Another strategy to enrich for cases sharing heritable factors has been to focus on extreme cases, a strategy called ‘selective genotyping’34. This approach is used in Chapter 4 to discover novel loci associated with age at natural menopause in Iranian women. An ideal control collection is a group of individuals who do not differ genetically from cases for reasons other than the trait being studied. Cases and controls need to be of the same ancestral background, otherwise differences in allele frequency due to population history (population stratification) will be detected, rather than those due to case-control status. When conducting a GWAS, ancestry informative markers (AIMs) can be used to identify and exclude individuals having substantially different genetic background prior to association analysis35. This is not an option in GWAS using a DNA-pooling strategy as samples are physically combined prior to genotyping. Matching cases to controls is therefore particularly important in pool-based GWAS. Alternatively, AIMs can be genotyped prior to performing a pool-based GWAS, and used to exclude samples as necessary36,37. Several years of GWAS have shown that the SNPs associated with common complex disease often confer small effects (per minor allele odds ratio; OR ≤ 1.3). To be able to detect these associations, i.e. to obtain a collection of cases and controls that is sufficiently large, it is now common for researchers to participate in large consortia that combine their samples to achieve greater power5,38-40. Consortia are used in Chapter 3 and 4 to replicate the associations detected by the discovery stage of the pool-based GWAS.  7  1.2.2 Pool-based GWAS Approach Despite the relatively small per sample and per SNP cost of genotyping, GWAS are still very expensive due to the number of samples that need to be genotyped. In many cases, this cost puts GWAS beyond the means of many research groups. DNA pooling is one strategy that has been adopted to reduce GWAS cost. To use this strategy, individual samples are physically pooled to create a single composite sample, and this pool is assayed on SNP arrays. Data from SNP arrays are used to estimate allele frequency in DNA pools, not determine genotypes. Pearson et al. 2007 empirically demonstrated that a GWAS using the DNA pooling strategy is capable of detecting associations discovered by conventional GWAS, but for a fraction of the cost41. Subsequently, several studies used DNA-pooling in a GWAS context and reported novel genome-wide significant (GWS) associations (p<5x10-7)14 in a variety of complex human diseases and traits, including: follicular lymphoma, otosclerosis, multiple sclerosis, Alzheimer disease, melanoma, psoriasis, and skin colour36,4246  . Like conventional GWAS, most GWAS using a DNA pooling strategy are multi-stage.  DNA pooling is only used in the discovery stage; subsequent replication stages use individual genotyping.  1.2.2.1 Limitations Loss of power when pooling DNA occurs because SNP allele frequencies in pools are estimated, not directly calculated from individual genotypes. This introduces error into the calculation of any test of association and increases Type I (false positive) and Type II (false negative) errors41. Type I errors in any GWAS, pooling or otherwise, have the potential to generate extreme values for the association test statistic. These spurious results can dominate the ends of the test statistic distribution and obscure true association signals. Type II errors can cause true associations to be missed. To reduce the error in allele frequency estimation, replicate arrays are used to assay the same DNA pool (typically between 4 and 8 replicate arrays)41,47-49. Some groups have chosen to construct and analyze replicate pools as well; however, the variance in allele frequency estimation is largely due to array effects, not pool construction effects47,49. Some power is undeniably lost when DNAs are pooled; however, power can be recovered for no expense (in terms of genotyping cost) by increasing the number of unique samples in each DNA pool, assuming samples are available41. How much 8  power is lost can be modeled prior to conducting an experiment, allowing one to plan accordingly, and anticipate the effect sizes that will be detectable. These issues are discussed in further detail in Chapter 2. Another limitation is that only allele-based tests of association are possible, and these are not as powerful as genotype-based tests. Finally, detecting, excluding, or adjusting for individuals whose ancestry differs from that of other samples is not possible with pooled DNA, nor is removing individuals showing cryptic relatedness.  1.2.2.2 Current Context DNA pooling affords a dramatic cost reduction. For diseases or traits with unknown biology or genetic involvement, which cannot or will not realistically be pursued using conventional GWAS due to funding constraints, pooling represents a feasible way to test for risk variants. Some view a pool-based GWAS as a first-pass experiment to probe for variants with relatively large OR41. Of course, what is considered a ‘relatively large’ OR has changed dramatically in the past five years. Assuming that all of the requirements of well-powered and well-designed conventional GWAS have been met (i.e. sufficiently large sample size and well-matched cases and controls) a pool-based GWAS can discover ‘typical’ genetic variants conferring risk in complex human traits (where typical variants have an OR≤1.3; and a MAF>5%). The problem is, many GWAS conducted in the past, pooling and conventional, were under powered to detect the typical genetic variants effect size (see section 1.2.1.3 and 1.2.1.4). Certainly, the GWAS presented in Chapter 3 and 4 are underpowered to detect the typical GWAS hit. These GWAS only probe for the so-called “low-hanging” fruit, in other words variants with ORs≥ 1.7. Although these are not common, and therefore not anticipated, they do exist. For example, a pool-based GWAS of susceptibility to nonHodgkin lymphoma identified a genome-wide significant subtype-specific association in follicular lymphoma that has a per-allele OR= 0.59 (95% CI: 0.5-0.7, reciprocal OR= 1.69)36. Insofar that this study was performed on a complex cancer phenotype with diverse subtypes in a relative small Stage 1 (189 follicular lymphoma cases, 592 controls), this study is similar to the one performed on EOC in Chapter 3. Another relevant example is a pool-based GWAS of melanoma susceptibility, which found a new risk locus with a per allele OR= 1.75 (1.53, 2.01), and evidence for stronger association in early-onset cases50. These studies demonstrate that DNA pooling GWAS are capable of producing meaningful positive results. 9  1.2.3 Challenges Encountered After GWAS In almost all cases, SNPs discovered by GWAS exert no direct functional effect. Approximately eighty percent of these SNPs lie in non-coding introns or intergenic regions51. These SNPs are hypothesized (and in a handful of instances shown) to be linked with causal variants. The non-coding nature of GWAS hits has lead to the hypothesis that many causal variants will exert regulatory effects by altering gene transcript levels; for example, by altering transcription factor binding sites (TFBS), enhancer sites, and suppressor splice sites52. Unambiguously identifying this class of causal alleles will be difficult as we are only beginning to understand the functional consequences of genetic variants residing in regulatory regions. In particular, our understanding of distant regulatory mechanism (defined as a regulatory site exceeding 1Mb from the affected site) is far from comprehensive. Therefore, it has recently been stated that the greatest challenge of the ‘post-GWAS’ era will be to understand the functional consequences of the trait-associated SNPs reported52. First steps in this process of understanding include fine-mapping and targeted sequencing to identify the causal variant(s). A systematic strategy for pursuing causal variants and affected genes was recently put forward52, and key points are presented here. These steps are described to demonstrate what can be done to pursue trait-associated SNPs after a GWAS. In Chapter 3 and Chapter 4, we go so far as to confirm trait-associated SNPs through replication; as well, we test whether the association signals detected can be generalized across two different human populations (Chapter 4).  1.2.3.1 Fine-Mapping and Targeted Sequencing Fine mapping is typically used to follow up loci implicated by GWAS. The goal of fine mapping is to genotype a dense set of tag SNPs (denser than the initial GWAS) and: 1) find the SNPs most associated with trait, and 2) narrow down the segment of genome associated with the trait17. As SNP arrays were designed to capture LD structure, not functional variants, in comes as little surprise that the SNPs identified by GWAS are surrogate markers (not causal variants)52. Fine mapping is critical is ensuring that the right SNPs and/or regions are pursued in future sequencing and/or functional studies. Today, imputation techniques allow one to fill in missing genotype information for common variants (MAF>1%) at a locus 10  without the need for further genotyping, using 1000 Genomes Project resource53,54. Potentially important rare variants (MAF<1%) and structural variants may not be comprehensively assessed using indirect LD-based methods. These require targeted sequencing to be detected. The goal of targeted sequencing is to discover all potential causal SNP(s) in LD with the associated SNP(s) (assuming the associated SNPs are not causal, a typical assumption)52. Reduced cost and increased speed of next generation sequencing techniques have made it possible to sequence whole regions of interest for large numbers of samples (i.e. 100 or more) to identify potentially causal variants. The genomic region that is sequenced can be defined by local patterns of LD (for example, r2>0.2 boundaries), or by arbitrary physical limits (for example, 0.5Mb upstream and downstream). In some instances, GWAS information from non-European populations, in particular those of African descent, can be used to reduce the size of the region that needs to be sequenced (assuming the SNPtrait association is also present in this population)55. This is because the genomes of the African-American populations generally have smaller LD block structure (discussed in detail in section 1.2.1.1). Fine-mapping and sequencing efforts will often produce several highly associated and potentially causal variants. At this point, functional assays are needed to demonstrate tissue-specific and phenotype-specific effects. In may be that several causal variants reside at a given locus. For example, at least two functional SNPs at the 8q24 locus have been associated with prostate and colorectal cancer. One is situated in a FoxA1 binding site56, the other in a transcriptional enhancer57; both appear to exert trait-associated effects on MYC transcription.  1.2.3.2 Functional Annotation of Regulatory SNPs Many causal SNPs are hypothesized to affect gene regulation; therefore, characterizing the regulatory landscape of a susceptibility region is an important first step in gaining a broad understanding of how risk alleles might function. The most abundant regulatory sequences are enhancers, but promoters, insulators, and silencers may also be affected by causal SNPs. Distal regulatory sequences are often cell-type specific58, and may explain why common susceptibility alleles are tissue- and disease specific. Both histone modification and transcription factor occupied site assays (using chromatin immune-precipitation sequencing, ChIP-Seq) can be used to identify alleles that alter the physical state of regulatory sites 11  (where physical state can be correlated with the degree of regulatory activity at that site)52. Enhancer activity can be assayed using in vitro and in vivo reporter assays56. The next challenge is to identify the genes affected by functional regulatory elements. There is no reason to presume that the genes affected by a causal variant will be in LD with. In the most extreme example, they could be on another chromosome.  1.2.3.3 Expression Quantitative Trait Loci (eQTL) Studies Perhaps the most straightforward approach to identifying genes affected by a putatively causal SNP is to test for a correlation between SNP genotypes and transcript abundance (usually for candidate genes at a locus, but potentially also genome-wide). Genetic variants that alter transcript levels are called expression quantitative trait loci (eQTL). Many studies have shown that transcript level can be influenced by inherited variation59-61. Like GWAS, large sample sizes are required to achieve sufficient power to detect eQTLs. Again, stringent critical values resulting from multiple testing are an issue here, as they were for GWAS (see section 1.2.1.5). However, because the average eQTL explains a greater proportion of the trait variance (i.e has a large effect size) the sample size required is typically not as large as for GWAS52. In the future, comprehensive biobanks of normal tissue will be needed to evaluate expression differences for the different alleles of a SNP. Like GWAS, eQTL studies are hampered by false negatives. These can occur because gene expression varies in time and space. Even if a transcript is associated a causal SNP genotype, it does not guarantee that the gene is involved in the trait being studied. Trait-relevant functional assays are still needed to demonstrate that altered transcription levels of a gene confer a phenotypically relevant effect.  1.2.3.4 Cell, Tissue, and Animal Models Once sufficient evidence in support of a gene is established, detailed functional studies are needed to characterize that gene’s role in the pathogenesis of the trait. At this point, it should be noted that the GWAS ‘hit’ has come full circle, and become a hypothesis-driven regional candidate gene. The functional assays that can be brought to bear will depend on the gene and the relevant tissue. For example, for cancer related candidate genes, numerous mouse models have been developed. The human tumor xenograft model is routinely used, and may serve as a starting point in establishing gene-cancer causation. Here, human tumor cells are 12  transplanted under the skin or proximal to the organ of origin in immune-compromised mice. Mice are then observed to establish how (or if) how the gene affects malignant transformation, and/or invasion and metastasis.  1.2.4 Summay of GWAS Successes Some have asserted that the results of GWAS are disappointing, primarily because they do not explain enough of the genetic variance underlying human phenotypes. This is a misrepresentation of the aims of GWAS, which was to map complex traits and discover clinically relevant genes10. In five years, GWAS have led to many new discoveries about genes and pathways in complex traits. In many cases, genes previously not suspected to have a role in disease etiology have been found to be important, emphasizing the value of the hypothesis free approach. Salient examples are the unsuspected role of autophagy genes in inflammatory bowel disease and Crohn’s disease62,63, the IL-23R pathways to rheumatoid arthritis64, and factor H to age-related macular degeneration8. These genes and pathways would not have been tested under the candidate gene approach. The knowledge that these genes might be important simply did not exist. GWAS have also demonstrated previously unsuspected overlap in disease mechanisms for related (and seemingly unrelated) traits10,65. For example, several autoimmunity loci are associated with multiple autoimmune diseases, including type 1 diabetes and multiple sclerosis66. Another example is the 8q24 gene desert found to harbor common variants associated with bladder, breast, colon, ovarian, and prostate cancers, suggesting a common etiology for these diverse cancers67. As stated by Hunter and Kraft: “there have been few, if any, similar bursts of discovery in the history of medical research”68. Future application of systematic approaches to determining causal SNPs and affected genes will enable biologically relevant understanding with clinical utility. In Chapter 3 and Chapter 4 GWAS of EOC and ANM are presented. As stated previously, the candidate gene approach has met with limited success in these traits, presumably because our understanding of the underlying biology at play is incomplete. Although the GWAS approach undoubtedly has many challenges, its potential to reveal previously unsuspected genes and mechanisms make it an ideal study design for advancing the understanding of the etiology of these traits. An introduction to EOC and ANM is given below to provide a context for the GWAS reported in Chapter 3 and 4. 13  1.3 Ovarian Cancer Ovarian cancer is the fifth most common cause of cancer death in women in the Western world, and the leading cause of death from gynecological malignancies. In 40 years there has been little improvement in the early detection and treatment of ovarian cancer, and the prognosis for this disease remains poor69. Survival is directly related to stage, with 5-year survival rates of 93% for those diagnosed with localized disease, but only 31% for those with distant disease70. Unfortunately, two-thirds of patients present with distant disease at the time of diagnosis70. A lack of practical screening methods and the absence of clear symptoms in the early stages of tumour progression have made ovarian cancers difficult to treat and study. The molecular origins and basic biology of ovarian cancer remains understudied, and is needed in order to develop early detection techniques and novel therapeutics to improve ovarian cancer prognosis.  1.3.1 Types of Ovarian Cancer Ovarian cancer is not a single disease. There are upwards of 30 types and subtypes of ovarian cancer, each with its own appearance and biologic behavior. Ovarian cancers fall into three major categories, based on the cells from which they are presumed to have arisen (see Figure 1.1): 1) epithelial tumours arise from cells that line or cover the ovaries; 2) germ cell tumours originate from cells that are destined to form eggs within the ovaries, and 3) sex cord-stromal cell tumours arise in tissues of the follicle, most commonly granulosa or theca cells. In addition, some tumours adjacent to ovarian tissue may be considered ovarian in origin. The vast majority of ovarian cancers (80-90%) are epithelial tumours71.  1.3.2 Origins of Epithelial Ovarian Cancer (EOC) The origins and pathogenesis of EOC remain frustratingly elusive, in part because it is rare to find well-defined precursor lesions, and in part because EOCs tend to have a complex and heterogeneous histology that defies a simple biological explanation69. At the most basic level, it is not always clear what the tissue of origin is; although, historically the ovarian surface epithelium (OSE) has been implicated (Figure 1.1). The OSE is derived from the coelomic epithelium, which is comprised of mostly flat or cuboidal cells72. Importantly, the OSE does not contain cells of Mullerian origin, but EOCs present as differentiated subtypes that 14  recapitulate the histology of gynecologic tissues that are Mullerian in nature73,74. It is perplexing how EOC can arise from such a seemingly simple structure as the OSE. For the OSE to undergo transformation to EOC, sufficient genetic abnormalities to allow transformation to the Mullerian-type epithelium would be necessary. A competing hypothesis suggests that precursor EOC cells may actually originate from the nearby fallopian tube, whose native histology is Mullerian in nature73,75. After 60 years, the tissue of origin for EOC is still not entirely clear; it may depend on the histological subtype of EOC in question.  1.3.3 Subtypes of EOC There are five major histological subtypes of EOC, including: 1) high-grade or invasive serous (HG-SER), 2) low-malignancy potential/ borderline serous (LMP-SER), 3) endometrioid (END), 4) mucinous (MUC), and 5) clear cell (CC). These EOC subtypes differ with respect to morphological appearance, presence of precursor lesions, molecular alterations, and clinical behavior76 (summarized in Table 1.1), and it is becoming apparent that they can be thought of as distinct disease entities77. For example, genetic analyses of somatic alterations in different EOC tumour subtypes have shown that each carries a distinct genetic signature. Likewise, the heritable risk factors for EOC that have already been reported often confer risk in one subtype, but not the others. These findings support the idea that EOC is a genetically heterogeneous group of cancers, and that common heritable risk alleles may also be subtype-specific.  1.3.3.1 Serous Serous EOC is by far the most common subtype, accounting for 30-70% of all EOC tumours, depending on the case collection in question78. Typically SER EOC presents in women aged 40-70 (average age 59.6± 11 years)79. This subtype can be further divided into the HG-SER and LMP-SER categories, based on the level of abnormalities observed in the nucleus, and the rate of division of cells80-82. Morphological features, molecular data, and clinical presentation suggest that HG-SER and LMP-SER arise via different genetic pathways82. HGSER tumours grow quickly and exhibit many aberrations to the nucleus; for example, they show more than 3-fold variation in nuclear size81. A hallmark of this subtype is mutation to TP53, which is found in almost all HG-SER tumours (96%)83. Lower prevalence (8-9%) but 15  statistically recurrent somatic mutations in BRCA1 and BRCA2 are also observed83, altogether suggesting the importance of DNA repair to pathogenesis. Additional low prevalence mutations (<5%) described in HG-SER tumours affect RB1, NF1, CSMD3, EGFR and MYC 83-86; thus, alterations to cell cycle progression and proliferation/survival pathways also appear to play a role. In contrast to HG-SER, LMP-SER tumours are slow-growing, genomically stable, and appear to progress in a step-wise fashion in the context of constant proliferation caused by frequent (20-33%) mutations in KRAS, BRAF, and PTEN87-89. KRAS and BRAF are oncogenes involved in the epidermal growth factor receptor (EGFR) signaling pathway, which controls cell proliferation, differentiation and apoptosis. PTEN is a tumour suppressor gene that controls the cell cycle by negatively regulating Akt/PKB signaling. LMP-SER tumours present in women ~10 years younger than those with HG-SER. They tend to be diagnosed at an earlier stage of disease and have excellent prognosis when diagnosed at this stage81.  1.3.3.2 Mucinous Mucinous tumours account for 5-15% of EOC cases, and are typically diagnosed in women 30-60 years of age (average age 54.8± 14 years)79. Most MUC tumours are benign (~80%), 15% are of borderline malignancy, and only 5% are malignant82. Borderline malignant tumours have cells that proliferate more than benign cysts, but they do not invade surrounding tissue like malignant tumours. Their appearance may be endocervical-like, meaning they resemble the cells of the mucous membrane lining the canal of the cervix, or intestinal-like, meaning they resemble cells of the mucous membrane lining the gastrointestinal tract. MUC tumours tend to be very heterogeneous, with a mixture of benign, borderline, and malignant cells found within a single tumour. This suggests MUC EOC tumourigenesis occurs sequentially, progressing from benign cyst to invasive carcinoma82. Tumours of this subtype can be quite large, and as such the majority of cases are diagnosed in an earlier stage of disease, which lends itself to a good prognosis. Genetically, MUC tumours have predominantly been associated with somatic mutations in BRAF and KRAS71,90,91. These genes are also frequently described in colorectal cancers, non-small cell lung cancers, and LMP-SER EOC. In some cases, what appears to be MUC EOC is actually metastatic cancer from a gastrointestinal (GI) primary site. Distinguishing primary MUC 16  EOC from ovarian metastases of other mucinous cell-type cancers is still an outstanding challenge in the study of EOC92.  1.3.3.3 Endometrioid Endometrioid EOC accounts for between 10- 20% of EOC cases93, and usually presents in women aged 50–59 years (average age 54.2± 13 years)79. Approximately 80% of these tumours are malignant; the remainder are of borderline malignancy. Tumours of this type typically present as a large pelvic mass, and cases are diagnosed in the earlier stages of disease. When END cases are matched with HG-SER cases by age, tumour grade, stage, and level of cytoreduction (the proportion of visible tumour removed during surgery), there is no survival difference94. Mutations frequently described in END tumours are activating mutations in PIK3CA95, silencing mutations of the tumour suppressor gene PTEN96,97, and mutations in CTNNB198 or Wnt signaling proteins causing deregulation of the Wnt/Bcatenin/Tcf pathway. Very recently, mutations in ARID1A, an accessory subunit in the SWISNF chromatin-remodeling complex, were found in 30% of END tumours99. An important risk factor in END EOC is the presence of endometriosis100. Endometriosis is when cells from the lining of the uterus (endometrial cells) implant and grow outside the uterus. Implant sites can include the ovaries, bladder, and the peritoneum (the lining of the pelvic area). Between 15-20% of women with END EOC also have a history of endometriosis, with implants often closely associated with the ovarian tumour82,99.  1.3.3.4 Clear Cell Clear cell EOC accounts for between 10-20% of cases, and is typically diagnosed in women 40-80 years of age93 (average age 55.6± 13 years)79. Nearly all of these tumours are malignant, and a history of endometriosis significantly increases risk100. These tumours have a low mitotic rate and are genetically stable73. This subtype often presents as a pelvic mass in its early stages, and tends to be diagnosed early and have a better prognosis. The molecular events leading to the development of CC share similarities with END EOC. For example, CC EOC tumours are associated with mutations in PI3KCA95,101. As well, ARID1A is frequently (>45%) mutated in this subtype99. Hence, CC and END EOC may share a common mechanism of pathogenesis. 17  1.3.4 Pathogenesis of EOC Numerous reproductive, environmental, and genetic risk factors have been associated with EOC, and several hypotheses have been proposed to explain how these contribute to the pathogenesis of this disease.  1.3.4.1 The Role of Ovulation Fathalla (1971) first proposed the incessant ovulation theory in 1971, and it holds that with every ovulation event there is a wound on the surface of the ovary (Figure 1.2), which must be repaired by post-ovulation mitosis and proliferation102. Concomitant with this activity is an increased chance of accumulating potentially carcinogenic mutations. Environmental factors associated with EOC risk support for this hypothesis. Increased parity, breast-feeding, use of oral contraceptives (OC), and early menopause have all been associated with reduced risk of EOC, and each reduces the number of ovulations a woman experiences71. Meanwhile, an early age of menarche and a late age of menopause have been associated with increased risk of EOC, and both increase the time span in which a woman can experience ovulations. A related hypothesis suggests that with every ovulation, there is the potential for OSE cells to invaginate and form an inclusion cyst in the ovarian stroma (Figure 1.2). This brings OSE cells into the hormonal environment of the stroma, potentially leading to uncontrolled growth103. The incessant ovulation theory is challenged by the observation that progesteroneonly OCs, which do not inhibit ovulation, also reduced EOC risk104. In addition, women with polycystic ovarian syndrome (PCOS) are at an increased risk of EOC, despite experiencing fewer ovulations105. These findings have been reconciled by focusing on the inflammation associated with ovulation, rather than the absolute number of times ovulation occurs106,107. Chronic low-grade inflammation is a defining property of PCOS108. And, progesterone-only OCs attenuate inflammation in response to ovulation.  1.3.4.2 The Role of Inflammation Inflammation is increasingly being recognized as a key player in the pathogenesis of cancers109,110, including EOC78. Many of the environmental factors associated with EOC risk 18  are also associated with inflammation, including: exposure to talc or asbestos, the presence of endometriosis111, pelvic inflammatory disease112, and a history of mumps or viral infection113. In contrast, tubal ligations and hysterectomies, which are thought to reduce the exposure of the ovaries to inflammatory agents, reduce EOC risk111,114,115. The inflammatory response that accompanies each cycle of ovulation, and/or environmentally derived inflammation (endometriosis, talc etc.), may contribute to EOC pathogenesis.  1.3.4.3 The Role of Hormones Ovarian cancer incidence increases dramatically in women over the age of 45, and peaks between 10 and 20 years after menopause103. This coincides with major changes in a woman’s endocrine signaling. In particular, there are reduced levels of female hormones (estrogen and progesterone)116, and elevated levels of follicle stimulating hormone (FSH) and luteinizing hormone (LH). These observations led to the gonadotropin stimulation hypothesis, which postulates that the elevated levels of FSH and LH found in menopausal and postmenopausal women leads to the stimulation and proliferation of the OSE, with the subsequent accumulation of potentially carcinogenic DNA mutations103. Results supporting this hypothesis are inconsistent117. Exposure to gonadotropins has not been shown to induce a malignant phenotype in EOC cells74. However, in mouse models of implanted tumours, exposure to gonadotropins does promote tumour growth118, angiogenesis, vascular endothelial growth factor (VEGF) expression119, and adhesion120, processes which are required for the initiation and maintenance of tumours. Thus, gonadotropins appear to promote the progression of EOC, not intiate it.  1.3.4.4 The Role of Heritable Risk Factors Family history of breast or ovarian cancer accounts for the largest proportion of EOC risk, and is attributed to several known heritable risk factors. These can be categorized as rare highly penetrant mutations conferring large EOC risk; or, as common, moderate to low penetrance risk variants conferring small to moderate risk. Rare highly penetrant mutations explains most families with ≥ 2 cases of EOC, while moderate to low penetrance risk alleles are thought to account for a proportion of the risk for women with and without a family history of breast or ovarian cancer. 19  1.3.5 Rare, High Penetrance EOC Risk Alleles Two familial syndromes have been described for which EOC is a frequent outcome. These include: hereditary breast and ovarian cancer syndrome (HBOCS) and Lynch Syndrome (LS, also called hereditary non-polyposis colorectal cancer, HNPCC). Both syndromes are associated with mutations in DNA repair pathways. HBOC syndrome is caused by mutations in at least five known genes, including BRCA1, BRCA2, RAD51C, RAD51D, and BRIP1121. BRCA1/2 mutations have been found in 80-90% of HBOCS cases, and 6-8% of non-familial EOC cases122. All five genes play key roles in the Fanconi anemia (FA)–BRCA pathway. This pathway is responsible for repairing double-strand breaks by homologous recombination (reviewed in 121). To develop cancer, BRCA1/2 mutation carriers must experience a loss of function (LOF) in their normally functioning BRCA1/2 allele, as well as a LOF in additional cancer-related genes. As early as 1994 it was noted that BRCA1 mutation carriers develop certain subtypes of EOC123. Since then, it has been well established that BRCA1/2 carriers typically present with HG-SER tumours, and occasionally with END or CC tumours. In contrast, BRCA1/2 mutations are rarely seen in MUC and LMP-SER tumours124-126. Even after accounting for mutations in these highly penetrant alleles, excessive risk remains in HBOCS families. Twin studies suggest that this excess risk is attributable to heritable factors as opposed to environmental factors127. LS is a genetically heterogeneous condition caused by mutations in DNA mismatch repair (MMR) genes, including MLH1, MSH2, and MSH6122. These genes repair mismatches that arise during replication or due to radiation damage. Mutations in MMR genes predispose carriers to many types of cancer, but endometrioid cancer and EOC are often the first reported cancers in women from LS families128. Despite conferring a high risk of developing EOC in LS families (women), these mutations account for a relatively small proportion of all EOC cases122. Unlike HBOCS, LS patients tend to develop END or CC tumours that are moderately to well differentiated and of International Federation of Gynecology and Obstetrics (FIGO) Stage I or II at the time of diagnosis. They have a better prognosis than the HG-SER tumours seen in HBOCS, the latter being poorly differentiated and in more advanced stage at the time of diagnosis. 20  These two familial syndromes raise several key points regarding EOC. One, DNA repair pathways are important to pathogenesis, at least they are in the context of highly penetrant risk alleles. Two, mutations in specific genes and/or pathways favour the development of specific histological subtypes: mutations in double-strand break repair (DSBR) genes give rise to HG-SER tumours, mutation in MMR genes give rise to non-SER tumours. Three, approximately 60% of the familial risk associated with HBOCS and LS remains unaccounted for129. The unexplained risk may be attributed, in part, to a combination of common low penetrance risk variants, and/or very rare mutations of moderate to high penetrance in a variety of genes121,128.  1.3.6 Common, Low to Moderate Penetrance EOC Risk Alleles Six low penetrance risk loci have been associated with EOC by GWAS performed in European women130. In these studies SNPs were tested for association with all EOC risk (all subtypes combined). Each locus is described in detail below (summarized in Table 1.2).  1.3.6.1 Locus 9p22.2 9p22.2 was the first locus associated with EOC susceptibility130. Based on logistic regression analyses testing for association between genotype and case-control status, the most significant SNP (rs3814113) at this locus had a per minor allele odds ratio of 0.82 (Ptrend= 5.1x10-19). This association is stronger when HG-SER cases are analyzed alone (OR= 0.77, Ptrend = 4.1 x 10-21), and is not significant in any other subtype-specific analyses. In part this could be attributed to inadequate power to detect the association; analyses in the non-serous subtypes were based on far fewer case samples. rs3814113 was also found to reduce EOC risk, but not breast cancer risk, in BRCA1/2 mutation carriers131. Hence, this association appears specific to ovarian cancer. rs3814113 is an intergenic SNP; however, 8 of the other associated SNPs at 9p22.2 are located within intron 2 of BNC2130. Based on in silico analyses none of these SNPs are predicted to alter enhancer-binding sites or splice sites130; thus, they do not have an obvious functional role. BNC2 encodes a highly conserved zinc-finger DNAbinding protein, which presumably regulates the transcription of many gene targets. At the moment there is little evidence for how BNC2 function in cancer development. It is highly expressed in reproductive tissues and implicated in oocyte development132,133. Recently, 21  seven of the 9p22.2 SNPs associated with EOC risk130 were also associated with sonographically detectable abnormalities in the ovaries of women without ovarian cancer, implicating this locus in ovarian development and/or biology134.  1.3.6.2 Locus 8q24.21 This association was detected by 3 SNPs at a GWS level (all in LD, r2 ≥ 0.46). These SNPs decreased EOC risk, and were more associated in HG-SER only analyses. The most significant of the associations, rs10088218, had a per minor allele OR = 0.76 (P = 8.0x10−15)40. rs10088218 also decreased EOC risk in BRCA1/2 mutation carriers131. rs10088218 is an intergenic SNP, and it is not yet clear how it might exert a functional effect. However, variants at 8q24.21 have been associated with multiple cancer types, including prostate, colorectal, breast and bladder cancer135-137 . Intriguingly, the collective GWAS data on these cancers suggests that multiple independently associated SNPs exist, and each may exert unique functional effects (for example, tissue-specific effects). Most SNP-cancer associations at 8q24 are located 5′ of MYC; however, the SNPs associated with EOC lie ~700 kb downstream of MYC, in an apparent gene desert. The relative proximity of the associated locus to MYC, coupled with this gene’s long history in cancer biology, makes it a compelling candidate for driving tumourigenesis. A hypothesis consistent with the GWAS data is that these variants are in regulatory elements of MYC; however, functional correlation with MYC expression has been inconsistent137,138. In addition to these cancer-associated SNPs discovered by GWAS, the 8q24 region has been identified in a large-scale multi-cancer tumour study as the most frequently (14%) amplified region139.  1.3.6.3 Locus 2q31 One SNP, rs2072590, was associated with increasing EOC risk at the 2q31 locus (per minor allele OR= 1.16, P= 4.5x10-14)40. rs2072590 is the first and only GWAS SNP significantly associated with EOC risk in a subtype other than HG-SER. It is associated with risk in MUC cases (OR= 1.30, P= 7.3x10-7), and HG-SER cases (OR= 1.20, P= 3.8x10-14). This result may be attributed to this SNPs relatively large effect size in the mucinous subtype. rs2072590 is also associated with increasing EOC risk in BRCA2 mutation carriers (per-minor allele HR= 1.24, Ptrend= 6.6x10-4), but not BRCA1 mutation carriers131. rs2072590 is a non-coding SNP 22  that lies ~5kb downstream of HOXD3 and ~10kb upstream to HOXD1. HOX genes are highly conserved transcription factors that play critical roles in morphogenesis. rs2072590 is in LD (r2= 0.96) with SNPs in the 3’ untranslated region (UTR) of HOXD3, and as such may have an indirect regulatory role. HOXD3 has previously been implicated in carcinogenesis140.  1.3.6.4 Locus 3q25 The association at 3q25 is reported as approaching GWS; however, there is good evidence for its association in HG-SER tumours40. This association was detected by one SNP (rs2665390), and was found to increase EOC risk (per minor allele OR= 1.19, P= 3.2x10-7; HG-SER per minor allele OR= 1.24, P= 7.1x10-8)141. This SNP is also associated with increased EOC risk in BRCA1/2 mutation carriers (BRCA2 per minor allele HR= 1.48, Ptrend= 1.8x10-4; BRCA1 per minor allele HR= 1.25, Ptrend= 6.1x10-4)131. Of the SNPs discovered by GWAS of EOC, and then tested for association in BRCA1 and BRCA2 mutation carriers131, rs2665390 confers the largest modifying effect. rs2665390 is intronic to TIPARP/PARP7, an ADP-ribose polymerase (PARP). PARPs are a family of proteins involved in DNA singlestrand break repair (SSBR) and programmed cell death. This may be an example of synthetic lethality, defined in this context as a combination of mutations/variants in two or more genes that is cancer-causing, but when present individually, are not. In this situation, existing BRCA1/2 dysfunction in mutation carriers may sensitizes cells to further insult on DNA repair pathways, including SSBR (presumably partially overlapping and compensatory pathways)142. These two pathways, DSBR and SSBR, may cooperate to repair DNA damage and maintain genome integrity, and both may be relevant in the context of EOC tumourigenesis.  1.3.6.5 Locus 17q21 The association at 17q21 is also reported as approaching GWS40. This association was detected by one SNP (rs9303542), and increased EOC risk (per minor allele OR= 1.11, P= 1.4x10-6; HG-SER per minor allele OR= 1.14, P= 1.4x10-7)40. This SNP is also associated with increased EOC risk in BRCA2 mutation carriers, but not BRCA1 mutation carriers131. rs9303542 is intronic to SKAP1, a T-cell adaptor protein. In T-cells, constitutive expression of SKAP1 suppresses activation of RAS/ RAF pathways, which have been implicated in the 23  early stage development of EOC143. Nevertheless, it is not clear how changes to SKAP1 expression levels in T-cells might influence tumourigenesis in OSE cells.  1.3.6.6 Locus 19p13 Two SNPs at 19p13 were associated with EOC risk: rs8170 (OR= 1.23, P= 3.6 x10-6) and rs2363969 (OR = 1.10, P= 1.2×10–7), and they are more significant in HG-SER only analyses. There is some indication that these SNPs may also be associated with EOC survival after diagnosis144. Another SNP at this locus (rs67397200; r2= 0.6 with rs8170 and rs2363969) was recently associated with increased EOC and breast cancer risk in BRCA1/2 mutation carriers145. rs8170 is a coding SNP in BABAM1 (previously MERIT40), and depending on the transcript isoform in question, it is either a synonymous or nonsynonymous SNP (ns-SNP)144. BABAM1 is known to interact with BRCA1, regulating its retention at double strand break sites and facilitating DNA repair146,147. rs2363956 is a nsSNP is ANKLE1. In an analysis of 216 HG-SER tumours83, the expression of BABAM1, but not ANKLE1, was elevated in EOC tumours relative to normal tissue. Given the function and expression of BABAM1, this gene is a compelling candidate for involvement in EOC pathogenesis.  1.3.7 Candidate Gene Association Studies of EOC Quaye et al., (2009) looked for association between invasive EOC risk in 84 (340 SNPs) genes in pathways such as DSBR, MMR, and cell cycle control148. They also analyzed several tumour suppressor genes and oncogenes nominated by an in vitro functional model of EOC. They found borderline evidence of association for several SNPs. Johnatty et al. 2010 subsequently used a similar approach to test for association in tumour-host interaction genes and EOC risk149. They genotyped variants in 173 genes (1536 SNPs) involved in stromalepithelial interactions; however, none achieved significant levels in their replication data set. They also genoptyped the TERT SNP rs7726159, and found it increased risk of HG-SER EOC (per minor allele adjusted OR= 1.14, P= 0.003). Recent efforts at fine mapping TERT and telomere maintenance genes support the association of TERT with EOC risk141,150. SNPEOC survival associations for 227 (1416 SNPs) candidate genes from ovarian cancer–related pathways, including: angiogenesis, inflammation, detoxification, glycosylation, one-carbon 24  transfer, apoptosis, cell cycle regulation, and cellular senescence were also investigated40. Angiogenesis was most strongly associated with survival time overall (P =0.03); variation in inflammation pathways was borderline significant (all patients, P = 0.09). In general, inadequate understanding of EOC biology and/or histological differences has thwarted candidate gene studies.  1.3.8 Conclusion to EOC Introduction The common low-penetrance risk alleles associated with EOC risk reinforce two salient points initially raised by the high-penetrance risk alleles identified in HBOCS and LS families. First, frequent alteration to genes involved in DNA repair pathways, including DSBR, MMR, and SSBR pathways, suggest that aberrant DNA repair is important to EOC pathogenesis, in particular HG-SER pathogenesis. Second, the risk alleles discovered to date, both low and high pentrance, are subtype specific and are primarily associated with HG-SER risk only. As of yet, a subtype-specific EOC GWAS has not been performed, and SNP associations driven by non-HG-SER case have not been properly investigated. This is the focus of Chapter 3.  1.4 Ovarian Aging and Age at Natural Menopause (ANM) Ovarian aging is described as the natural process by which a woman reaches reproductive exhaustion. Menopause is an important event in this process, and a prevailing hypothesis states that it occurs when the follicle pool in the ovary is too low to maintain regular cycles151,152. Menopause is a dramatic event in a woman’s life history and demarks major changes in endocrine signaling, particularly a reduction in female hormone production (estrogen and progesterone) by the ovaries116. It is also a risk factor for many age-related diseases. For example, early menopause is associated with an increased risk of cardiovascular disease and osteoporosis153-155, and a decreased risk of ovarian cancer1, endometrial cancer3, and breast cancer4,156. Menopause age varies widely, between 40 and 60 years (average 5051), and this continuous trait is influenced by genetic and environmental factors7,152,157-159. Family- and twin-based studies have estimated the heritable component of ANM between 42% and 87%160-164, and the prevailing view is that genetics play a very important role. 25  Menopause before 40 is described as premature ovarian failure (POF), and this condition is thought to have its own host of genetic risk factors. Although environmental factors such as smoking and oral contraceptive use influence ANM, collectively they explain relatively little of the variation in this trait160,165. Despite the relevance of ovarian aging to women’s health, and the evidence that heritable factors may explain a significant proportion of the variation in this trait, genetic effects are only beginning to be investigated and understood.  1.4.1 Relevance to Health and Disease Menopausal age is important as a retrospective marker for ovarian senescence and/or an exhausted follicular reserve151. Also important is that early natural menopause implies a reduced span of fertility166. Over the past few decades, postponement of childbearing has led to decreased family size and increased rates of age-related female sub-fertility. This is because in addition to follicle loss, oocyte quality diminishes with increasing age, believed to be due to greater meiotic non-dysjunction. Women choosing to postpone childbearing experience a natural, age-related decline in the ovarian follicle reserve, which influences fertility, and whose conclusion is the onset of menopause. Currently, we are unable to identify women at risk of early menopause. Further studies identifying or validating novel genes involved in the variation of menopausal age are likely to point to pathways and mechanisms that underlie ovarian ageing. This new information may lead to strategies to screen for at-risk individuals, and potentially suggest avenues to decelerate the rate of ovarian ageing. Although data remains conflicting, ANM has been associated with EOC risk167,168. In a recent study of 350 EOC cases and 2331 controls, ovarian cancer risk was decreased by 2% per year that menopause age was advanced in time. A younger age at menopause indicates fewer ovulatory cycles and might decrease ovarian cancer risk simply for this reason (see section 1.3.4.1 )1,102. For each year that total menstrual lifespan decreased (onset of menarche to onset of menopause) a 3% reduction in EOC was observed (HR = 0.97, 95% CI: 0.950.99)1. Analysis by histological subtype in this study was limited by the small number of cases in the different strata, and did not indicate clear differences in the ANM-EOC association. A previous study found a that a 1-year increase in the number of ovulatory years was associated with a significant 8% increase in risk HG-SER and END EOC risk, but only a 26  3% increase in risk of MUC EOC risk167 This analysis included 221,866 women and 721 cases, including 496 HG-SER, 139 END, and 86 MUC EOC cases from the US Nurses’ Health Study (1976–2006) and Nurses’ Health Study II (1989–2005)167. Broadly speaking, in the process of gaining an understanding of ANM, we may inadvertently shed light on EOC as well.  1.4.2 ANM Genetic Risk Factors: GWAS in Europeans In 2009 two large GWAS identified five loci associated with ANM on chromosomes 5, 6, 13, 19 and 20169,170. Both studies were performed on women of European descent. Encouragingly, two of the reported loci overlapped between studies (chromosome 20 and 19). As well, there appear to be multiple independently associated variants at these loci, suggesting allelic heterogeneity. These loci explained <1.5% of the phenotypic variation of ANM. Comprehensive explorations of these loci may well discover additional variants and account for a larger proportion of the variation in ANM. Furthermore, additional loci of smaller effect are likely to be discovered if more samples are used171. It should be noted that in 2007 a much smaller (438 women) GWAS of ANM was published, also on women of European descent172. None of the results from this earlier study overlapped with those of the recent GWAS, likely due to power issues. Two genome-wide linkage studies have been published on this trait173,174; however, no regions of the genome achieved statistical significance (the highest LOD scores ranged from 2.1-3.1). Further, the regions suggestive of linkage did not overlap between studies. In 2012, a large consortium based meta-analysis of 22 studies of ANM was performed, comprising 38,968 European samples5. This analysis replicated 4 of the loci (5, 6, 19 and 20) reported in 2009169,170, as well as revealing 13 new SNPs associated with ANM (P < 5 × 10−8). In total, 17 loci have now been associated with ANM in European women, and collectively these account for between 2.5-4.1% of the observed variation in ANM. At this point, there is little insight into what the causal changes are for these loci and/or what their functional consequences might be. Regional candidate genes located at these newly associated loci include genes implicated in DNA repair and immune function; meanwhile, pathway analyses using the full GWAS data set suggest exoDNase, NF-kappa B signaling and mitochondrial dysfunction as biological processes related to ANM5. These processes and 27  pathways have also been implicated in general aging and longevity175,176, suggesting that the mechanisms of ovarian aging (germline aging) overlap with those of somatic aging. Finally, inflammation/immune function and DNA repair/genome integrity have also been identified as important processes and pathways in EOC. Insights into shared genetic risk factors for ANM, aging/longevity, and EOC, could provide novel entry points for understanding the mechanisms involved in these traits/diseases.  1.4.3 ANM Genetic Risk Factors: Candidate Gene Studies ANM candidate gene studies have focused primarily on the estrogen pathway177,178 or vascular components, but few loci have consistently replicated152. As part of the recent GWAS meta-analysis5, 125 candidate genes were selected a priori based on a reported relationship with ovarian function, and analyzed separately for association with ANM. This analysis included 18,327 SNPs within 60 kb of the start/end of transcription of the genes of interest. After adjusting for multiple testing, SNPs near five of the candidate genes were significantly associated with ANM. Only one of these was a not already identified as part of the main GWAS analysis. The candidate gene was DMC1, and the most associated SNP was rs763121 (P = 0.0009 when corrected for candidate gene SNP analyses). The SNP resulted in an earlier ANM (~0.18 years per copy of the minor allele). DMC1 encodes a protein that is essential for meiotic homologous recombination, and is regulated by NOBOX. Mutations in NOBOX have been documented in women with POF179,180. These findings implying some overlap in the pathways responsible for very early menopause in the form of POF, and normal variation in menopausal age. The convergence of results using the candidate gene and GWAS approach imply that most of the relevant SNP-ANM associations that can be discovered using these approaches have now been found, with one caveat, they have been performed exclusively in women of European descent.  1.4.4 Population-Specific ANM Genetic Risk Factors Epidemiological studies find that ANM varies by race/ethnicity157,181. A US-based cohort study of 92,704 women from five racial ethnic groups found that Latina women experienced menopause earlier, and Japanese women experience menopause later, than non-Latina Whites157. Adjustment for environmental factors, including: smoking, age of menarche, 28  parity, and body-mass index (BMI), did not change this result. These differences may be explained by environmental factors not considered, genetic differences, or a combination thereof. Chapter 4 present a GWAS of ANM in a non-European population, and addresses whether the loci associated with ANM are shared between populations, or are predominantly population-specific. In a recent study, those loci identified by the GWAS of ANM in European women169,170 were tested for association in Hispanic American women. Two loci were also associated in this population, 19q13.42 (BRSK1/TMEM150B) and 20p12.3 (MCM8)182. Hence, these genetic factors do not appear to be population specific. The exact function of MCM8 is still unknown; however, other MCM family members function to restrict DNA replication to one round per cell cycle183. There is little functional data for hypothesizing the role of BRSK1/TMEM150B in menopause. Novel loci associated with ANM in the Hispanic population were not reported on.  1.5 Conclusion 1.5.1 GWAS of Subtype-Specific EOC In Chapter 3, a GWAS of EOC risk by histology is presented. In planning and designing this study, several possible approaches were considered, including: 1) conducting an “allovarian” GWAS, 2) conducting a serous-only GWAS, 3) conducting a subtype-specific GWAS, or 4) conducting no GWAS. At the time (2008), we were aware of other groups working on a well-powered and well-designed “all-ovarian” GWAS, which included a wellpowered serous-only analysis. Based on samples size it was known that these groups would conduct a GWAS more powered than ours; therefore, the 1st and 2nd options were not pursued. This left the 3rd and 4th options: a subtype-specific EOC GWAS, or no GWAS. Taking stock of the field it became apparent that no one (to our knowledge, including OCAC) would initiate a GWAS of the rarer EOC subtypes within the next 5 years (if not much longer). A central reason for this is that such a study would invariably be underpowered given the number of cases accrued in the rarer subtypes and the average expected SNP effect size (i.e. OR<1.3, based on 5 years of GWAS). This is not to say that larger effect sizes (OR>1.7) do not exist, they do, but they are not typically expected. The high cost of a conventional GWAS makes it unaccepted to only pursue variants exerting 29  moderate to large effects; variants which may very well not exist. If they do not exist, there is very little to show for a lot of money spent. In contrast, a pool-based GWAS is typically in excess of 1/50th the cost of a conventional GWAS (but does have a small reduction in power). At this cost point, the risk of failure (i.e. if moderate-to-large effect risk variants do not exist) is more acceptable, particularly if it holds some possibility of advancing our understanding of EOC (i.e. if moderate-to-large effect risk variants do exist), a cancer notoriously refractory to treatment and with very poor survival. Based on this, we chose to pursue the 3rd option: a subtype-specific GWAS using the cost efficient DNA pooling approach. This study would be powered to detect moderate to large effects only, and was pursued in an attempt to advance biological understanding where it appeared very little progress would be made in the near future.  1.5.2 GWAS of Population-Specific ANM In Chapter 4, a GWAS is performed on age of natural menopause in an Iranian population. When initiating this project, we were aware of several highly-powered ANM GWAS in progress; however, without exception, they examined women of European descent. Despite our relatively small sample size, we felt that a GWAS performed in another human population could make a novel contribution to this field by revealing genetic variants that are not easily detected (or do not exist) in the European population. Population-specific differences in the ability to detect trait-associated SNPs (and/or their existence in a population) are affected by LD structure, allele frequency differences, and environmental exposures. Differences can arise due to genetic drift, population-specific histories of genetic recombination, mutation, and differences in environmental exposures that interact with genetics for the trait of interest. In terms of human population genetics, Iranians are quite close to Europeans, meaning that for the most part, tag-SNPs chosen based on Europeans should effectively tag the genomes of Iranian individuals. As in the EOC GWAS (Chapter 3), our ANM GWAS was only powered to detect genetic variants explaining a moderate-tolarge proportion of the heritable component of ANM (consistent with an OR>1.7). These may or may not exist. As argued for the study presented in Chapter 3, because we are able to conduct a relatively low cost GWAS, the higher risk experiment becomes more tolerable (financially). Furthermore, this study was used to establish the generality of the discoveries 30  made in the European population – in itself an important and useful result.  1.5.3 Planning DNA-Pooling GWAS Planning and conducting a DNA pooling GWAS should begin with asking and answering several key questions, including: “What results can I expect in terms of minimum detectable odds ratio? How will pooling attenuate my power? How many replicate genotyping arrays should I use on each DNA pool balance cost-efficient and loss of power? Should I construct multiple replicate DNA pools?” Unfortunately, when the experiments in Chapter 3 and Chapter 4 were planned, definitive answers to these questions were not available. Data generated by the EOC and ANM GWAS (in combination with three other pooling GWAS) was used to address these questions empirically; this is the topic of Chapter 2.  31  CHAPTER 2 Estimates of Array and Pool-Construction Variance for Planning Efficient DNA-Pooling Genome -Wide Association Scans Until recently, genome-wide association studies (GWAS) have been restricted to research groups with the budget necessary to genotype hundreds, if not thousands, of samples. Replacing individual genotyping with genotyping of DNA pools in the discovery stage (stage 1) of a GWAS has proven successful, and dramatically altered the financial feasibility of this approach. When conducting a pool-based GWAS, how well SNP allele frequency in a DNA is estimated pool will influence a study’s power to detect associations. In essence, as the variance in allele frequency estimation (AFE) increases, the power to detect associations decreases. Here we address how to control the variance of AFE, and thus how to conduct the most powered pool-based GWAS. How array variance [var(earray)] and pool-construction variance [var(econstruction)] contribute to the total variance of AFE was calculated by comparing data from arrays between and within DNA pools. This information is critical in deciding whether replicate arrays, or replicate pools, are most useful in reducing the variance of AFE. Analysis was based on 27 DNA pools, ranging in size from 74 to 446 individual samples, genotyped on 128 Illumina beadarrays. For three types of Illumina arrays (1M-Duo, 1M-Single, 660-Quad) our estimates of array variance were similar, 3x10-4 to 4x10-4. Based on data for 27 DNA pools, pool construction variance was found to account for between 20-40% of pooling variance; array variance accounted for the remainder. Thus, relative to var(earray), var(econstruction) is of less importance in reducing the variance in AFE. A simple online tool is provided (PoolingPlanner, http://www.kchew.ca/PoolingPlanner/), which calculates the effective sample size (ESS) of a DNA pool given our estimates of array variance and pool construction variance for a range of replicate array values. ESS is intended to be used in a power calculator to perform power calculations that are adjusted for pooling a priori. This allows an experimenter to quickly calculate the loss of power associated with a pooling experiment to make an informed decision on whether a pool-based GWAS is worth pursuing.  32  2.1 Background GWAS have been used to examine over 200 diseases and traits, and identified over 4000 SNPs associated with these traits, as listed in the Catalog of Published Genome-Wide Association Studies51. In many cases, GWAS have revealed previously unsuspected molecular mechanisms of disease, highlighting the value of this hypothesis-free approach (reviewed in 14). Unfortunately, GWAS are very costly due to the price of genotyping thousands of individual DNA samples on high-density SNP arrays. Consequently, GWAS have only been feasible for research groups with the necessary budget, studying well-funded diseases or traits. A simple strategy to drastically reduce cost is to replace individual genotyping in the discovery stage (stage 1) of a GWAS with genotyping of DNA pools. DNA pools yield estimated allele frequencies rather than observed genotypes; hence, this step has been called allelotyping41. Several studies have provided proof of principle for the pooling strategy, using it to re-discover known disease-variant associations of moderate to large effect size for a fraction of the cost of conventional GWAS41,184. To date, more than twenty pooled-based GWAS have been published, many reporting genome-wide significant associations for diseases and traits such as follicular lymphoma, otosclerosis, multiple sclerosis, Alzheimer's disease, melanoma, psoriasis, and skin colour36,42-46,50. Depending on the number of samples being pooled, the cost reduction in stage 1 can easily reach 100 fold. Consider, if a SNP array costs $250 and there are 2000 cases and 2000 controls to genotype, a million dollars is required for stage 1 individual genotyping alone. Conversely, the poolbased experiment using 12 replicate arrays on two pools (case and control) would be $6000, or 0.6% of the cost. Simply put, a pooling GWAS is feasible for most grant budgets, and individual genotyping GWAS is not. The criticism of pool-based GWAS is that they have reduced power relative to conventional GWAS because of errors introduced by estimating allele frequency from DNA pools. While it is true that pool-based GWAS forfeit some power, these losses can be estimated, are often less than expected, and may not change the associations discovered. Although array costs will continue to drop and conventional GWAS will become more feasible, the potential savings associated with the pooling approach will scale in proportion, leaving more funds for subsequent replication, fine-mapping, and sequencing of associated genomic regions. For diseases or traits with unknown biology or 33  genetic involvement, a pooling GWAS represents an economical way to test for associations with moderate odds ratios. In addition, work using DNA extracted from pooled whole blood suggests that a large time-savings (50-100 fold) may also be possible, presenting the possibility of an incredibly fast (<1 month) and economical experiment184. For a comprehensive introduction and review of DNA pooling readers are directed to Sham et al. 2002 and Pearson et al. 2007, and for a set of best practices for any GWAS to Pearson & Manolio, 2008185. In the process of estimating allele frequencies from DNA pools, error is introduced, and must be taken into consideration to plan an adequately powered experiment or to appropriately calculate association statistics48,186. With respect to doing this, the most important consideration is the pooling variance47, the variance in the errors arising from estimating allele frequency from a DNA pool. Pooling variance is the sum of many sources of variation, including in particular, array variance and pool construction variance. Array variance can be attributed to those errors arising from estimating allele frequency from a DNA pool on an SNP array47,187. Pool construction variance can be attributed to those errors arising from the physical creation of a DNA pool. As pooling variance increases, the ability of a pool-based GWAS to detect odds ratios similar to those detectable by conventional GWAS decreases. By controlling the pooling variance (as close to zero as possible), the power of a pool-based GWAS is preserved (relative to a conventional GWAS). Here we assume the pooling variance is equal to the sum of array variance and pool-construction variance47, and calculate which component makes the greater contribution to the pooling variance. This is relevant to determining how best to design a pool-based GWAS, and how to allocate resources. For example, replicate arrays can be used to reduce array variance and/or pools can be constructed in replicate to control pool construction variance. We partition and estimate variance components using the approach described by MacGregor 47, which examines variation in allele frequency measurements between and within DNA pools. Briefly, within-pool variation is that observed between two arrays used to allelotype the same DNA pool (i.e. replicate arrays), and is an estimate of array variance. Between-pool variation is that observed between two arrays used to allelotype two different DNA pools, and is an estimate of pooling variance. Estimates of array variance and pooling variance are used to calculate pool construction variance by subtraction47. Using this 34  approach in an analysis of two DNA pools allelotyped on twelve Affymetrix Genechip HindIII arrays (6 arrays per pool) it was found that approximately 87.5% of pooling variation could be attributed to the arrays, leaving 12.5% to pool-construction47. It was noted, however, that more data sets would be necessary to determine the variability in these estimates. Here we inspect 27 DNA pools allelotyped on a total of 128 Illumina arrays, including the Human1M Single (1M-Single), Human1M Duo (1M-Duo), and HumanHap660 Quad (660-Quad) arrays, allowing us to better address the question of what range of values array variance and pool-construction variance are likely to take. In addition, we perform our analysis on normalized array data and raw array data to examine how normalization affects pooling variance estimates. In the first part of this study we establish values for array variance and poolconstruction variance. In the second part, we use these estimates to calculate the effective sample size (ESS) of a DNA pool (where ESS is the equivalent number of samples that would need to be individually genotyped to give a similar result)188. We also present a simple online tool, PoolingPlanner, which uses our empirical variance estimates as default values to calculate the effective sample size (ESS) of a DNA pool given a range of replicate array values (available at http://www.kchew.ca/PoolingPlanner/). PoolingPlanner also accepts user-supplied values for variance estimates. ESS can then be used in one of the available power calculators, such as CaTS26, or Quanto189, to perform pool-adjusted power calculations41. PoolingPlanner is intended to help researchers quickly calculate the loss of power associated with a particular pooling experiment, a first step in making on informed decision on whether a pool-based GWAS is worth pursuing.  2.2 Methods 2.2.1 Data Collection Analyses are based on 27 DNA pools ranging in size from 74 to 446 individual samples. These were allelotyped on a collective total of 128 Illumina beadarrays: 24 1M-Single, 32 1M-Duo, and 72 660-Quad. The dataset comprises four batches of genotyping (details given in Appendix B.1), which correspond to four ongoing pool-based GWAS (Chapters 3 and 4 present two of these GWAS). Each of these studies was approved by the joint Clinical 35  Research Ethics Board of the British Columbia Cancer Agency and the University of British Columbia. All subjects gave written informed consent. Genomic DNA was extracted from peripheral venous blood collected between 2001 and 2008 by different laboratories using different methods. DNA samples were diluted to 50100ng/uL and then quantified in duplicate by fluorometry using PicoGreenTM (Molecular Probes, Eugene, OR, US). Pools were constructed by combining 200ng of each sample DNA by manual pipetting. Pools were assayed at the Centre for Applied Genomics at Sick Children’s Hospital in Toronto.” SNP allele frequency in DNA pools was estimated using Illumina’s beadarrays, where on average each SNP is estimated by 16-18 “bead” observations per array (oligonucleotide probes are designed to assay a SNP and attached to beads, where individual beads are coated with one probe type and interrogate one site in the genome)18. The equation used in the calculation of each SNP allele frequency was:  pˆ i  1...n  1 n  n i 1  Gi Gi  Ri  where Gi and Ri are the green and red fluorescence intensity for the ith bead assaying a given SNP. The two colours correspond to the two alleles of the SNP, and n is the number of beads assaying a given SNP, typically 16-18. Illumina beadarrays are manufactured such that there are multiple strips on each array18, and our preliminary analysis revealed that unique groups of SNPs are consistently on only a subset of strips. From our previous experience, and that of others187, it was known that the average relative intensity of the red and green channels could differ dramatically between strips and between arrays. To prevent these manufacturing and/or assaying properties from biasing allele frequency estimation, a simple normalization was performed. Each array was normalized on a strip-by-strip basis by adjusting the red channel intensity to give a mean strip-wide allele frequency estimate of 0.5187. To examine the effect of this normalization on the variance terms estimated, the analyses presented in this paper are performed on both normalized and raw Illumina array data.  36  2.2.2 Statistical Analysis Our purpose was to calculate empirical estimates of pooling variance and array variance, and then to estimate pool construction variance by subtraction. Pooling variance and array variance are both estimated by calculating allele frequency differences across two paired (by SNP, for all SNPs on the array) arrays47. The two arrays used in the comparison will dictate whether an estimate of array or pooling variance is generated. For example, to calculate array variance, let allele frequency estimates on arrays x used to allelotype DNA pool a be: ~ p ax  pˆ a  earray _ x  where pˆ a is the true allele frequency for those samples in DNA pool a, and earray_x is the error associated with estimating the allele frequency from a DNA pool48. Then, the variance of the allele frequency difference on two replicate arrays (x = 1, 2) is47: var( ~ p a1  ~ pa 2 )  var( pˆ a  pˆ a  earray _ 1  var(earray _ 1  earray _ 2 )  earray _ 2 )  2 var(earray )  This yields an estimate of array variance:  var(earray )  p a1 where var( ~  var( ~ pa1  ~ pa 2 ) / 2  ~ p a 2 ) is calculated as the average of the squared allele frequency differences  for all SNPs, i (i = 1...n), on arrays 1 and 2:  var( ~ pa1  ~ pa 2 )  1  n  n 2  i 1  (~ pa1,i  ~ p a 2 ,i ) 2  Var (e array ) is assumed constant for all SNPs. If more than two replicate arrays are used to  allelotype a given DNA pool, multiple array comparisons are possible, and the best estimate 37  of var(earray ) is the average of all possible pairings47. If arrays 1 and 2 interrogate two different DNA pools, an estimate of pooling variance can be obtained. When two DNA pools (a, b) are constructed from identical samples (i.e replicate pool construction), var( ~ p a1  ~ pb 2 )  2 var(earray )  2 var(econstruction )  where var(econstruction ) is the variance in the pool construction errors, which are assumed to be constant for all SNPs. Thus, an estimate of pooling variance, var(e pooling 1 ) is47:  var(e pooling 1 )  var( ~ p a1  ~ pb 2 ) / 2  where “pooling-1” is used to indicate that this estimate of pooling variance is based on the comparison of arrays that allelotype two replicate DNA pools. As before, if more than two replicate arrays are used to allelotype a given DNA pool, multiple array comparisons are possible, and the best estimate of var(e pooling 1 ) is the average of all possible pairings47. When DNA pools a and b are constructed from non-identical samples (ex. a case and control pool), an alternative estimate of pooling variance is var(e pooling 2 ) 47,48:  var( p˜ a1  var(e pooling 2 )  p˜ b 2 ) V˜a1,b 2 /2  Here var( p˜ a1  p˜ b2 ) is calculated as the average of the squared allele frequency difference ~ minus a random sampling variance term, Va1,b 2 , for all SNPs, i (i = 1...n), on arrays 1 and 2:  var(e pooling 2 )  1 n 2  n  [( p˜ a1,i  p˜ b 2,i ) 2 V˜a1,b 2,i ]/2  i 1  38  ~ Va1,b 2 is calculated using the sampling variance equation: V˜a1,b1,i  pa1,i (1 pa1,i ) /N a1  pb 2,i (1 pb 2,i ) /N b 2  The sampling variance term refers to the additional component of variation present when comparing non-identical pools, where two random sample sets have been drawn from the same population. Although cases and controls are being compared, for most SNPs on an array we expect no association between SNP and case-control status48. Figure 2.1 visually summarizes the three types of pair-wise array comparisons used, including the sources of error in each comparison. When comparing arrays used to allelotype the same DNA pool (henceforth referred to as ‘Type A’ comparisons), the variation observed can only arise due to the arrays, giving an estimate of array variance. When comparing arrays used to allelotype replicate DNA pools (henceforth referred to as ‘Type B’ comparisons), the variation observed is due to the arrays and pool-construction, giving a direct estimate of pooling variance. Pool-construction variance is then calculated by subtracting the array variance (Type A) from the pooling variance (Type B). If replicate DNA pools have not been constructed, as is the case for many of the pools in our data set, we are still able to estimate the pooling variance by comparing non-identical pools (henceforth referred to as ‘Type C’ comparison) and account for the additional sampling variance term that arises in this case. Pool-construction variance is then calculated by subtracting Type A values from Type C values. A number of assumptions are made in this analysis. We assume that the array variance is comparable across the DNA pools in an experiment, and that the average array variance is the best estimate. For arrays with larger than average array variance, array variance will be underestimated. This situation could occur due to greater variation in PCR amplification steps and/or the measurement of allele frequency (detection of red and green fluorescence) on an array. It is known that SNPs with smaller minor allele frequencies are estimated with a greater error, i.e. var(earray) is not constant for all SNPs. For SNPs with a small minor allele frequency, the average array variance will tend to underestimate the SNPs 39  array variance. We also assume that the pooling variance is constant across all SNPs, and that unequal amplification and/or hybridization of alleles (A or B) will have a negligible effect on results. Because our analysis is based upon contrasting array data from two DNA pools, the effects of unequal hybridization should largely cancel out48,187.  2.2.3 PoolingPlanner Theory In choosing to conduct a pool-based GWAS, one accepts a loss in power relative to a conventional GWAS. This is because as the variance in allele frequency estimation (AFE) increases, the power to detect SNP associations decreases. How much power is lost can be expressed in terms of the effective sample size (N*) resulting from pooling N individuals188. PoolingPlanner (available at http://www.kchew.ca/PoolingPlanner/) uses an estimate of var(epooling) to calculate the N* of a DNA pool. N* and var(epooling) are related through two expressions for relative sample size (RSS) [defined in 188]:  RSS RSS  N* N (Vs  Vs var(e pooling ))  In the first, the RSS of a DNA pool is expressed as the ratio of effective sample size to the actual sample size (N). In the second, it is expressed as the fraction of the total variance, (Vs + var(epooling)), explained by the sampling variance, Vs. Vs is calculated as p(1-p)/2N, where p is the average minor allele frequency on the array, and N is number of individuals contributing to the DNA pool. If DNA pools have been constructed in replicate we let var(epooling)= var(epooling-1), otherwise we let var(epooling)= var(epooling-2). The two equations for RSS can then be equated and solved for N*. It is worth noting that because our calculation of RSS relies on our empirical estimates of var(epooling), estimates which are based on contrasting allele frequencies in two DNA pools, the effects of unequal hybridization, which would typically thwart a direct comparison of a pooling-based and conventional genotyping experiment, cancels out48,187. Replicate arrays can be used to reduce var(epooling) by a factor of 1/k, where k is the number of replicate arrays41. In making var(epooling) smaller the RSS and N* become larger. 40  Effective sample size can then be used with one of the available power calculators, for example CaTS26 or Quanto189, to perform pool-adjusted power calculations41. PoolingPlanner is intended to help first time users plan a DNA pooling experiment, and our empirical estimates of array variance and pool construction variance are supplied as the default setting for the program for this reason. Users with their own estimates of variances can provide these to the program as well.  2.3 Results In our analyses beads with negative intensity values in the red, green, or both channels were encountered. The number of negative beads varied by strip and typically affected 1-10% beads, a pattern consistently seen across all arrays. This can occur due to local background intensity removal at the point of image processing190. These beads were removed from our variance calculations. Furthermore, beads with zero in both the red and green channels were considered failed beads and also dropped from our analysis. There were typically fewer than 100 of these per strip. Finally, SNPs having fewer than four bead observations were excluded. The rationale for this was that SNPs having fewer than four beads observation would have poorly estimated allele frequency.  2.3.1 Array Variance: Type A Comparisons Array variance was estimated by comparing replicate arrays, Type A comparison in Figure 2.1, for three types of Illumina beadarrays, the 1M-Single, the 1M-Duo, and the 660-Quad. The results for normalized and raw data are given in Table 2.1, and box plots in Figure 2.2 provide a visual summary of the estimates. Normalization dramatically reduced the range of values observed for array variance for all array types. Most estimates of array variance, regardless of array type, fell between 2.5x10-4 and 5.0x10-4. Normalization also reduced the mean array variance estimate ~2.5-fold for the 1M-Duo arrays, ~8-fold for the 1M-Single and 660-Quad arrays. For the 1M-Single arrays 12 DNA pools were allelotyped using 24 arrays (2 arrays per pool), yielding 12 estimates of array variance, the mean of which was 3.8x10-4 (normalized) and 2.9x10-3 (raw data) (Table 2.1). For the 1M-Duo array 8 DNA pools were 41  analyzed on 32 arrays (4 arrays per pool), yielding 48 estimates of var(earray). Three of these estimates, each from pair-wise array comparisons involving the same array, were extreme outliers in both the normalized and raw dataset (see Figure 2.3). This array was determined faulty and removed from further analysis. For the remaining 45 estimates the mean var(earray) was 3.2x10-4 (normalized) and 9.0x10-4 (raw data) (Table 2.1). Unlike the data for the 1MSingle arrays, the 1M-Duo array data spanned two batches of genotyping, carried out at two different times. To look for batch effects the 1M-Duo data was also analyzed stratified by batch. The mean array variance was significantly different between batches for normalized data but not raw data (based on non-overlapping confidence intervals constructed assuming a normal distribution). Batch 1 (18 var(earray)) and batch 2 (27 var(earray)) had mean estimates of array variance of 4.2x10-4 and 2.6x10-4, respectively. For the 660-Quad arrays, 7 pools were assayed using 72 arrays (6 or 12 arrays per pool), and mean array variance was 3.3x10-4 for normalized data, and 2.7x10-3 for raw data (Table 2.1).  2.3.2 Pooling Variance: Type B and C Comparisons Pool-construction variance was estimated for 27 DNA pools; results are discussed in order by Illumina array type. Six pools were allelotyped on the 1M-Single array, and for each, pools were constructed in replicate and allelotyped by two arrays. This allowed us to calculate and compare pooling variance and pool-construction variance estimates as calculated using Type B and Type C comparison values. Figure 2.4 summarizes the var(epooling) and var(econstruction) estimates for those pools on the 1M-Single array. For normalized data var(epooling-1) ranged from 3.2x10-4 to 5.5x10-4 and averaged 4.0x10-4. In comparison var(epooling-2) ranged from 3.5x10-4 to 7.0x10-4 and averaged 4.8x10-4. Var(econstruction-1) ranged from 0 to 6.7x10-5 and had a mean of 2.9x10-5 (where negative values have been set to zero). Thus, for these pools var(econstruction-1) accounts for between 0 and 20%, or an average 7.5% of the pooling variance when using Type B derived values (see Appendix B.2 for all values). Var(econstruction-2) ranged from 0 to 3.2x10-4 and averaged 1.0x10-4; thus, pool-construction variance accounted for between zero and 46%, or an average 20% of the pooling variance using Type C derived values (Appendix B.2). There does not appear to be any correlation between pool size and pool-construction variance, see Figure 2.4. Using raw data, estimates of var(epooling-1) were approximately 8-fold higher than the 42  normalized data. Estimates of var(econstruction-1) tended to be higher as well, averaging ~20% of the pooling variance. Var(epooling-2) estimates followed the same pattern, larger estimates of pooling variance and pool-construction variance (data not shown). Pools allelotyped on the 1M-Duo and 660-Quad arrays were not constructed twice; hence, for these we estimated pool-construction variance based on Type C comparisons only. Seven DNA pools were allelotyped on the 660-Quad array, two using six replicate arrays (396 estimates of var(epooling-2) each), and five using twelve replicate arrays (720 estimates of var(epooling-2) per pool. Figure 2.5 summarizes the var(epooling-2) and var(econstruction-2) estimates for these pools (normalized data). Var(epooling-2) estimates ranged from 4.3x10-4 to 5.7x10-4, and averaged 5.1x10-4. Var(econstruction-2) estimates ranged from 1.0x10-4 (23%) to 2.4x10-4 (42%) and averaged 1.9x10-4 (35%). These estimates of pooling variance are very similar to those seen for pools on the 1M-Single array; however, the estimates of pool-construction variance are higher (see Appendix B.3 for all values). For the raw data var(epooling-2) estimates ranged from 2.6x10-3 to 2.9x10-3, and averaged 2.7x10-3. The matched var(econstruction-2) estimates ranged from 0 to 2.6x10-4 (9%) and averaged 1.9x10-4 (2%). 1M-Duo arrays were analyzed separately by batch using batch-specific estimate of array variance for normalized data. The 1M-Duo batch 1 data contained three DNA pools, each allelotyped by four replicate arrays; therefore, each var(epooling-2) estimate is the average of 32 pair-wise array comparisons. Figure 2.6 summarizes var(epooling-2) and var(econstruction-2) estimates for these pools (normalized data). Var(epooling-2) was estimated at 5.6x10-4, 6.0x10-4 and 6.1x10-4. The matched var(econstruction-2) estimates were 1.5x10-4, 1.8x10-4, and 1.9x10-4 , or 26%, 31%, and 32% of the pooling variance for pools sized 122, 246, and 121 (see Appendix B. 3 for values). These values reflect those seen for pools on 660-Quad and 1MSingle arrays. In comparison, the 1M-Duo batch 2 data deviated dramatically. This batch contained 5 pools, each also alleloyped by four replicate arrays. For these var(epooling-2) ranged from 1.8x10-3 to 3.7x10-3, and averaged 2.6x10-3, and var(econstruction-2) estimates ranging from 7.9x10-4 (43%) to 2.7x10-3 (72%) (see Appendix B. 3). For these pools the estimates of pooling variance are nearly 2-3 fold higher than those of batch 1 but the array variance remained low at 2.4x10-4, leading to high estimates of pool-construction variance (see discussion). For raw data batch 1 & 2 were analyzed combined using all possible array comparisons and var(earray) = 9.0x10-4. Estimates of var(epooling-2) ranged from 2.2x10-3 to 43  5.4x10-3 and averaged 3.4x10-3. Var(econstruction-2) estimates averaged at 51% of the calculated var(epooling-2).  2.3.3 PoolingPlanner Example To demonstrate how to use PoolingPlanner we consider a hypothetical scenario (that is the nonetheless approximately consistent with what was each encountered for the subtypes of EOC in Chapter 3). A researcher has a collection of samples including 300 cases and 1000 controls and wants to conduct a pool-based GWAS (typical GWAS start with ~2000 cases and controls, but the conclusions drawn from the hypothetical scenario presented nonetheless apply). The researcher needs to decide how many arrays to use, and wants to construct power curves that take into consideration the power loss concomitant with this cost-efficient strategy. They plan on using Illumina’s 660-Quad array and normalizing their data. PoolingPlanner is used to calculate the effective sample size of each DNA pool using four input values: 1) var(earray), 2) var(econstruction), 3) pool size, and 4) allele frequency. Figure 2.7A shows the PoolingPlanner input panel for the case pool; Figure 2.7B the input panel for the control pool. PoolingPlanner will supply the var(earray) value as calculated based on our 660-Quad normalized data, 3.3x10-4, see Table 2.2. Alternatively, the user may specify a custom value. In this example we assume var(econstruction) is 30% of the pooling variance, chosen to reflect values we observed. Var(econstruction) is entered into PoolingPlanner by specifying “Array:Construction Ratio= 7:3”, as seen in Figure 2.7A and 2.7B. An exact value for var(econstruction) can also be entered (30% of 3.3x10-4 would be 9.9x10-5). For allele frequency, by default PoolingPlanner uses HapMap CEU data (release 27) to set p to the average minor allele frequency (MAF) on the 1M-Single, 1M-Duo, or 660-Quad Illumina array. For the 1M-Single and 1M-Duo arrays p = 0.21 (>95% of SNPs had available HapMap data), and for the 660-Quad array p = 0.29 (87% of SNPs had available HapMap data). Estimates of p based on our pooled array data were similar (see Appendix B. 4). In this example the average MAF is set to 0.29, but the user can enter any value between 0 and 0.5. Once these values are entered the program calculates the relative and effective sample size of each DNA pool for a range of replicate array values, and provides a corresponding table of values as seen in Figure 2.7A and 2.7B. A plot of relative sample size versus number of replicate arrays is also automatically generated. For a DNA pool containing 300 individuals 44  (blue line in Figure 2.7C), an RSS of 80% is achieved with 6 arrays (N* is 244) while an RSS of 90% requires 13 arrays (N* is 271). In contrast, for a pool of 1000 individuals (red line in Figure 2.7C), an RSS of 80% is achieved with 19 arrays (N* is 806). This plot makes it easy to see at what point additional replicate arrays begin to yield diminishing returns in terms of increasing the effective sample size of a DNA pool. To perform pooling-adjusted power calculations, a pool’s effective sample size, output by PoolingPlanner, is entered into a power calculator. We have used Quanto189 for this example. Assuming an unmatched case-control design testing for gene-only effects using a log-additive model, where the incidence of the case phenotype is 0.02%, and the risk allele frequency (prisk) is 29% (and in complete linkage disequilibrium with a SNP on the array), the power curves corresponding to a pooling experiment where 3, 6, 12, or 24 Illumina 660Quad replicate arrays are used per pool is given in Figure 2.8. The power curve for individual genotyping is also plotted for reference. Table 2.2 accompanies this Figure 2.8 and gives the minimum detectable odds ratio (MDOR) at 80% power for each curve when prisk is 0.29, and for comparison, when prisk is 0.1. Assuming individual genotyping, the MDOR at 80% power would be 1.32 when prisk is 0.29. Using 24 arrays per pool this value rises incrementally to 1.33. Using 12, 6, or 3 arrays per pool, the MDOR’s further increase to 1.35, 1.38, and 1.44, respectively. Only when 3 arrays are used per pool does the MDOR dramatically differ between pooling and individual genotyping. Marginal improvements in MDOR should be considered in light of increasing experimental cost, and the percent cost of a pooling GWAS relative to a conventional GWAS is given in Table 2.2 to highlight this difference. If arrays cost $250, the ability to detect an odds ratio of 1.38 with 80% power would cost $3,000 (6 arrays per pool), while the ability to detect an odds ratio of 1.33 would be $325,000 (individual genotyping). Pearson et al, (2007)41 demonstrated that for phenotypes suggestive of genetics variants conferring moderate-to-large effects, including Alzhemeir disease, progressive supranuclear palsy, and sudden infant death syndrome, the difference in detectable odds ratios would not change the overall outcome of the study. In a pooling GWAS, as in conventional GWAS, for rarer risk alleles we have less power to detect associations, see the MDOR in Table 2.2 when prisk is 0.1. We note that as prisk gets smaller, the difference in the MDOR for a pooling versus individual genotyping experiment becomes more noticeable. For example, when 6 replicate arrays are used per pool and prisk is 0.29, the 45  MDOR differs by 0.06 from individual genotyping, but this difference becomes 0.09 when prisk is 0.1. It is also worth noting in Table 2.2 that using the same number of replicate arrays on different sized DNA pools results very different RSS values. This might prompt one to consider allotting a fixed number of arrays in such a way as to achieve the largest average RSS value (i.e. an unequal number of arrays per pool). Contrary to what might be expected, the maximally powered pool-based experiment occurs when arrays are equally distributed amongst pools, regardless of differences in pool size and RSS, assuming the poolconstruction variance is constant (see Appendix B.5, Appendix B.6). By conducting an analysis such as this a user can decide what power is forfeited by conducting a pool-based GWAS, and decide whether the approach makes practical sense in their situation.  2.4 Discussion In the first part of this study we set out to establish a range of experimentally observed values for array variance on Illumina’s SNP-genotyping beadarrays. At the same time, we wanted to establish a range of values for pool construction variance. In the second part, we used these estimates to calculate the effective sample size of a DNA pool given a range of replicate array values, and provide an online tool to allow readers to do the same. At the time of our analysis we were aware of only one report that estimated array variance (var(earray)= 1.1x10-4 ) for an Illumina HumanHap300 beadarray187. Illumina has since released higher density arrays (>1 million SNPs per array), and we wanted to determine if increased SNP density negatively impacted array variance. Overall, we found this was not the case. All of the Illumina array types examined here (660-Quad, 1M-Single, 1M-Duo) had very similar var(earray) estimates, centering around 3x10-4 for our normalized data, which is largely in keeping with the HumanHap300 result. We expect this result would extend to the HumanOmni1-Quad array, although it was not analyzed here. We found that the normalization procedure we used reduced the array variance between 2-8-fold, and a newly reported normalization algorithm suggests that array variance may be reduced even further37. Reduced array variance should mean more precise estimates of allele frequency, which should further minimize the loss of power associated with using the DNA pooling strategy. The Illumina arrays analyzed here yielded var(earray) estimates ~10-fold smaller than 46  those of the Affymetrix HindIII 50K arrays (var(earray)= 1.26x10-3) analyzed by MacGregor47. A similar result was noted when Affymetrix arrays were compared to Illumina HumanHap300 arrays187. In part, this may be explained by differences in the manufacturing of the arrays. MacGregor et al. 2008 report that pooling errors appear to be highly related to number of probes used to estimate SNP allele frequency. While 10 probe pairs are assigned to each SNP on the Affymetrix HindIII 50K arrays187, on average 16-18 beads are used on the Illumina arrays. Further, on Illumina arrays beads are randomly dispersed on a slide18, while on Affymetrix arrays probes are fixed in a given location, making the latter more susceptible to location-specific technical errors. As the array variance gets smaller (i.e. when using Illumina arrays), the pool-construction variance accounts for a greater proportion of the pooling variance. Our estimates of var(econstruction) spanned 27 DNA pools, ranging in size from 74 to 446 individual samples, allowing us to sample a range of possible pool construction variances. First, in contrast to a previous report191, we did not observe a relationship between pool size and pool-construction variance. We did, however, observe batch effects. For the 1M-Duo arrays, which were processed in two batches on different dates, we observed very different estimates of pooling variance and pool-construction variance (see Figure 2.6). Most of our estimates of pool-construction variance were based on values from Type C comparisons, and for these var(econstruction) usually fell between 20 and 40% of the pooling variance. When calculations were based on the comparison of replicate DNA pools (Type B comparisons, 1M-Single arrays only) our estimates were smaller, on average 7.5% of the pooling variance. There are several possible reasons for this. The adjustment for sampling variance may not fully account for the variance arising from sampling, leaving variance that is then attributed to pool-construction in the Type C comparisons. As well, some estimates of pool-construction variance were negative, and these were set to zero, which would lead to overestimation of pool-construction variance. We conclude that relative to var(earray), var(econstruction.) is of less importance; however, in our data pool construction accounted for more of the pooling variance than previously estimated 47. MacGregor47 attributed 12.5% of the pooling variance to pool-construction when using Affymetrix HindIII 50K arrays. On average we attributed 30% of pooling variance to pool construction when using Illumina arrays. This difference could be due to smaller array variance for Illumina arrays. 47  Alternatively, our pool construction process could have had a higher error rate. With respect to the design of pool-based experiments when using Illumina arrays, our partitioning of the pooling variance still suggests47 that constructing fewer (large) pools while using more replicate arrays (i.e. target array variance), is the most effective way to reduce pooling variance and conduct the most efficient pool-based GWAS. Further, for an equivalent pool-based experiment using Affymetrix arrays in place of Illumina arrays, more array replicates will be needed (~10-fold more). As the proportion of array variance to pool construction variance approaches 50:50, strategies to reduce pool construction variance become more important. For one of our experiments, 1M-Duo Batch 2, we observed unusually high estimates of pool-construction variance and low estimates of array variance (see Figure 2.6). In this experiment, pool replicates were allelotyped on the same physical array (which holds two samples). Subsequently, we noticed that the array variance for replicates on the same chip were much smaller than the variance for replicates on different chips. Overall, this led to the array variance being underestimated relative to the pooling variance, leaving more variance to be accounted for by pool construction. In addition, the between-chip variance for these arrays was much higher than observed in the 1M-Duo Batch 1 dataset, which led to large estimates of pooling and pool-construction variance overall. Ultimately, this was traced back to unusually high red channel intensity on some arrays, despite normalization, which biased allele frequency estimates array-wide. Clearly this will influence any downstream association analysis, so in this case, our analysis of variance served to flag a serious problem in the array data. It also highlighted the need to randomize DNA pool replicates among arrays that carry more than one sample, and to randomize by location on the array, particularly in the case of the 660-Quad and HumanOmni1-Quad arrays, which carry four samples. The differences between 1M-Duo Batch 1 and 2 data were significant for normalized data, but not raw data. On one hand, it may be that greater noise associated with the raw data prevented differences in array variance and pool construction variance from being significant. On the other, it is possible that the normalization procedure itself exacerbated technical artifacts only present on some arrays, leading to the observed differences in normalized data. This can occur if technical artifacts violate the assumptions of the normalization192. 48  2.5 Conclusion Empirical estimates of var(earray) and var(econstruction) for a range of DNA pool sizes are provided, and as PoolingPlanner, a simple program to help translate these variances into their effect on sample size. This information that can then be used in a power calculator to conduct power calculations adjusted for the use of the pooling strategy. PoolingPlanner may be helpful in quickly assessing theoretical best and worst-case scenarios for a DNA pooling GWAS. With this information users can make informed decision on how to carry out a pooling experiment to optimally balance cost with loss of power.  49  CHAPTER 3 Discovery of Subtype-Specific Epithelial Ovarian Cancer Risk Alleles Using a Pool-Based Genome-Wide Association Scan Epithelial ovarian cancer (EOC) is the leading cause of death from gynecological malignancy in the developed world. Increasingly, it is being appreciated that EOC is a heterogenous group of tumours whose heritable factors are subtype-specific. The genome-wide association study (GWAS) approach was recently used to detect common low to moderate penetrance alleles influencing risk in all EOC subtypes. Common variants specifically influencing risks in the rarer subtypes of EOC have yet to be investigated. To address this we performed a subype-specific GWAS of EOC risk using a DNA pooling strategy. One hundred and ninetyeight SNP associations detected in the discovery stage of this GWAS were replicated by the Ovarian Cancer Association Consortium (OCAC). Nine SNPs (7 loci) replicated at a P-value <0.05 level of significance; they are not significant at the genome-wide level (P-value <10-7). Associated SNPs were as follows: four SNPs (4 loci) in the mucinous (MUC) EOC ubtype, 3 SNPs (1 locus) in the endometrioid and clear cell (ENCC) EOC subtypes combined, and 2 SNPs (2 loci) in the low-malignancy potential/borderline serous (LMP-SER) EOC subtype. Several of these SNPs are in or near genes that arguably have a clear biological rationale for conferring subtype-specific EOC risk, including RAD51B in MUC EOC, GRB10 in ENCC EOCs, and BPIL2/C22orf26 in LMP-SER EOC. Even within the OCAC as it currently exists, there are insufficient cases and well-matched controls to reliably detect risk alleles that have smaller effects and/or allele frequency in the MUC, ENCC, and LMP-SER subtypes; more samples are needed. These 9 SNPs warrant further replication and/or fine-mapping in additional datasets with reliable subtype and tumour behaviour information as they become available.  50  3.1 Background Epithelial ovarian cancers (EOCs) are a heterogeneous group of tumours defined by their location on the surface of the ovary. Although the incidence of EOC is low (1-2%) relative to other gynecological malignancies, it is the most lethal of these cancers owing to its poor prognosis69. Recently, it was shown that EOC subtypes can be reproducibly diagnosed based on morphological and molecular genetic features76,79,193. Importantly, the different histological subtypes differ with respect to inherited risk factors, precursor lesions, pattern of spread, response to treatment, and patient outcome, compelling many to assert that these are different diseases69,193,194. Four EOC subtypes are commonly described, including: serous (SER), mucinous (MUC), endometrioid (END), and clear cell (CC) tumours194. Still, it is common to treat EOC as one disease, perhaps because the majority of EOC cases are of the serous subtype (~75%). A subtype-specific approach and understanding of EOC may lead to new and effective means of prevention, early detection, and disease management. We performed a subtype-specific genome-wide association study (GWAS) of EOC susceptibility to discover genetic risk factors unique to the rarer EOC subtypes. GWAS have successfully identified hundreds of moderate to low risk susceptibility alleles conferring risk in several types of cancer, and many have been found to be subtype specific195. Heritability estimates for EOC suggest that similarly sized risk variants exist in this cancer type as well. Specifically, individuals with a family history of EOC have a 2-fold increased risk of developing this disease, even after accounting for mutations in the known rare high-penetrance susceptibility genes (BRCA1, BRCA2, MLH1, and MSH2)122,128. Results from a large twin study suggest that most of this additional risk is due to genetic factors, not environmental factors127. To our knowledge, subtype-specific heritability estimates do not exist, making it difficult to determine if the histologies differ in this regard. The first GWAS of EOC susceptibility was published in 2009130; however, it did not consider histological subtype in its discovery stage. As high-grade SER (HG-SER) EOC is by far the most common subtype, this study would be expected to detect HG-SER driven associations, or associations common to all of the EOC subtypes. One locus, 9p22.2 (near the BNC2 gene), was reported to increase EOC risk130. When data was stratified by subtype, this association was only present in the SER subtype, where it had a smaller per allele odds ratio 51  (OR) and more significant P-trend than in the analysis including all EOC subtypes combined. Song et al.130 concluded that disease heterogeneity likely reduced the power of the GWAS. Follow-up studies based on this GWAS were published in 201040,144 and five new susceptibility loci were reported: 2q31, 3q25, 8q24, 17q21, 19p13.11. With the exception of the 2q31 locus, these associations were only significant in HG-SER subtype. The 2q31 locus appears to confer risk in the SER and MUC histological subtypes40, and implies that at least one common heritable risk factor is shared between subtypes. The SNPs associated with EOC risk have had small effects (per allele OR less than 1.3), and collectively account for a small fraction of the unexplained familial risk in EOC. Common variants specifically influencing risks in the rarer subtypes of EOC have yet to be investigated, and may be detected using a subtype-specific approach to GWAS. Using a population-based sample collected in British Columbia (BC) between 2002 and 2008, we performed a two-stage subtype specific EOC GWAS, using a DNA pooling strategy in the discovery stage (stage 1). One-hundred and ninety-eight SNP detected in this stage were tested for replication in a large multi-study based replication stage (stage 2). SNPs were replicated by the Ovarian Cancer Association Consortium (OCAC) as part of the Collaborative Oncological Gene-environment Study (COGS). Eight subtype-specific SNP (7 loci) were replicated in OCAC at an unadjusted P <0.05 level of significance. The steps and outcomes of this study are outlined in Figure 3.1. It is important to note that based on available sample sizes this study was only powered to detect trait-associated SNPs that exert a moderate-to-large effect, and was pursued in an attempt to advance biological understanding where it appeared very little progress would be made in the near future. Larger effect sizes (OR>1.7) do exist, but they are not the norm (see section 1.2.2.2 for cancer-specific examples of larger SNP effect sizes). A pool-based GWAS is typically in excess of 1/50th the cost of a conventional GWAS (for a small reduction in power). At this cost point, the risk of a null result is more acceptable (i.e. if no moderate-to-large effect risk variants exist), particularly (in our opinion) if the possibility of advancing our understanding of EOC exists. For further discussion of these issues see sections 1.1.1.3, 1.1.1.4, 1.2.2.2, and 1.5 of this thesis.  52  3.2 Methods 3.2.1 Discovery Stage 3.2.1.1 Sample and Subtype Description Study participants were drawn from the Ovarian Cancer in Alberta and British Columbia (OVAL-BC) Study, which is the collaboration of the OVAL Study in Alberta, and the Women’s Health Study in BC. The OVAL-BC study represents the largest population-based study of the causes of ovarian cancer in Canada. This study was approved by the joint Clinical Research Ethics Board of the British Columbia Cancer Agency and the University of British Columbia. All subjects gave written informed consent. In BC, EOC cases were identified through the BC Cancer Registry (BCCR), and included any women residing in the province diagnosed with an ovarian (or related peritoneal or tubal) tumour. Patient names and referring physician names were obtained directly from pathology reports sent to the BCCR. Prior to contacting patients, physicians were approached and asked for consent for their patient to be invited to participate in the OVAL-BC study. If the physician responded in the affirmative, an introductory letter with study information sheets, consent form, self-administered questionnaire and buccal swabs (in the early part of study) or a blood sample lab requisition (in the later part of the study) was mailed to the patient. Sixty-seven percent of the patients who confirmed receiving a study package consented to participate in the OVAL-BC study. Participating EOC cases were assigned an ICD-O-3 code (International Classification of Diseases for Oncology, 3rd Edition) based on the pathology reports provided to the BCCR. A Health Record Technician or a Health Record Administrator assigned these. ICD-O-3 codes were used to decide which EOC cases to include and exclude from the discovery stage of the pool-based GWAS (Table 3.1). Four separate DNA case pools were constructed, corresponding to four subtype-specific analyses performed. The case pools were as follows: a MUC pool, an END and CC combined pool (abbreviated ENCC), a low-malignancy potential/borderline SER pool (LMP-SER), and a high-grade SER pool (HG-SER) (Table 3.1). The END and CC subtypes each had relatively few samples in the OVAL-BC study; however, they share an association with endometriosis100 and it seemed possible that they might share genetic risk factors as well99 (Dr. David Huntsman, personal communication). These two subtypes were combined in one DNA pool, abbreviated ENCC, to search for shared genetic risk factors. In contrast, a small 53  LMP-SER pool (74 individuals) was constructed and analyzed. This was motivated by the fact that SER tumours with LMP/borderline behaviour are under-studied and insufficient evidence exists to rebut the possibility of common variants conferring large risks. Cases were not excluded based on a family history of cancer. Furthermore, only individuals of European descent were included, as determined by self-reported grandparents’ ethnicity. OVAL-BC cases included in the discovery stage were diagnosed between 01-Jan-2002 and 30-Jun-2008; OVAL-BC cases diagnosed after 30-Jun-2008 were excluded due to the timing of DNA pool construction. 01-Jan-2002 marks the earliest date samples were obtained for the OVAL-BC study. Control individuals in the OVAL-BC study were ascertained in one of two ways, and are referred to as BC Controls (BCC) or Screening Mammography Program of BC (SMP) controls accordingly. Controls in the BCC group were obtained through the Provincial Ministry of Health Medical Services Plan Client Registry via random selection of BC resident women between the ages of 20-79. In 2005, a change in control ascertainment method became necessary after legislative changes at BC’s Ministry of Health prevented the disclosure of personal information for research purposes. Subsequently, SMP controls were randomly selected from women between the ages of 40-69 who attended a routine screening mammography program. This screening program is offered to all BC resident women aged 40 – 79, with screening occurring once every two years. SMP controls consented to be contacted for cancer research when completing a “Background Information Survey” administered at the time of mammography. Of the control individuals confirmed to receive a study package, the consent rate was 40% for the BCC group, and 62% for the SMP group. OVAL-BC control individuals included in the discovery stage were of European descent only, as determined by self-reported grandparents’ ethnicity. Controls were not excluded based on a family history of cancer (any type). Controls included in DNA pools were selected to be frequency matched by age with cases (by 10 year age bins). To perform this frequency matching, case histology and control grouping (BCC and SMP) were ignored, i.e. cases and controls are frequency matched when ‘all EOC cases’ are compared to ‘all controls’. Two control DNA pools were constructed, one for the BCC group, the other for the SMP group. OVAL-BC controls included in the discovery stage of the pool-based GWAS were collected between 01-Jan-2002 and 30-Jun-2008. 54  3.2.1.2 DNA Pooling Procedures Blood or saliva served as the source of genomic DNA for samples used in DNA pools. For 10% of individuals, genomic DNA was extracted from saliva samples using OraGene kits (DNA Genotek, PA, USA). For 90% of individuals, genomic DNA was extracted from peripheral venous blood using a modified salting out protocol196. Briefly, equal volumes lysis buffer (Sucrose: 0.32 M, Tris-HCl: 10 mM, MgCl2: 5 mM, Triton-x-100: 0.75%, pH = 7.6) and blood were added to 2 volumes of cold, sterile, distilled, and deionized water. Samples were mixed by inverting 20 times and incubating on ice for 5 minutes, then centrifuged (2100xg) for 15 minutes. The supernatant was removed, the pellet again washed using 1:3 parts lysis buffer:water, vortexed for 20 seconds at medium speed, and centrifuged for 15 minutes. The pellet was then suspended in 5mL of proteinase K buffer (Tris-HCl: 20 mM, Na2EDTA: 4 mM, NaCl: 100 mM, pH = 7.4) and 0.5mL of 10% SDS and incubated for twohours at 55oC. Room temperature 4.25mL of 5 M NaCl solution was added, and the suspension vortexed then centrifuged (2700xg) for 20 minutes at 4oC. The resulting pellet was washed with 10mL of cold isopropanol, and subsequently with 5mL of 70% ethanol. DNA was resuspended in 10:1 tris-EDTA (TE) buffer, pH 8.6. Samples were assigned a unique number code to ensure accurate and reliable sample processing and storage. DNA samples were adjusted to between 50-100ng/uL and then precisely quantified in duplicate by fluorometry using PicoGreenTM (Molecular Probes, Eugene, OR, US). DNA pools were constructed by combining 200ng (2-4 uL) of each DNA sample by manual pipetting. DNA pools were assayed using Illumina Human660W-Quad v1 (660Quad) BeadChips, assayed at the Centre for Applied Genomics (TCAG), Toronto, Canada. Using the 660-Quad array data SNP allele frequencies in DNA pools was estimated. For each case pool 12 replicate arrays were used; for each control DNA pool 6 replicate arrays were used. Replicate arrays were used to reduce the error (standard deviation, SD) associated with the estimates of SNP allele frequency49. In our discovery stage analyses, the array data from both control pools (12 array total) was combined to generate estimates of allele frequency that reflect all controls. Equation 1 was used in the calculation of SNP allele frequency41:  (1)  pˆ i , j ,k  1 12  Gijk 1 n Rijk 1 n k 1 Gijk  12 j  55  where i is a given SNP, j is a given array, and k is a given bead. There are ~600,000 unique SNPs interrogated by the 660-Quad array. On average there are 16 beads, k, designed to assay a given SNP; however, this number is not fixed. Gijk and Rijk represent the green and red fluorescence intensity for the kth bead on the jth array, assaying the ith SNP. The two colours correspond to the two alleles of the SNP. In effect, this equation gives the ratio of signal arising from one allele to the total signal. Values close to 1 and 0 indicate allele homozygosity for individuals in a pool. Intermediate values theoretically correlate directly with the relative abundance of the two alleles in a pool of individuals41,197.  3.2.1.3 Quality Control and Normalization of Array Data Prior to estimating SNP allele frequency using the 660-Quad data, poorly performing beads were removed and data was normalized. Beads with a negative or zero value in either fluorescence channels were removed from analyses. These bead values can occur due to local background intensity removal at the point of image processing190. Beads with a red or green fluorescence intensity value exceeding 3 median absolute deviations (3MAD) from the median bead value were removed. This operation was performed on a per SNP and per channel (red and green) basis. Bead removal was performed in a stepwise manner: 1) negative beads, 2) zero value bead, and 3) 3MAD outlier value beads. Based on previous experience it was known that the average relative intensity of the red and green channels could differ dramatically between arrays49,187. To prevent these manufacturing and/or assaying properties from biasing allele frequency estimation, a simple normalization was performed. Each array was normalized by adjusting the red channel intensity to give a mean array-wide allele frequency estimate of 0.5 (as done by MacGregor et al. 187). No changes were made to the green fluorescence values.  3.2.1.4 Analysis Approach Four GWAS analyses were performed, corresponding to the subtype-specific EOC case pools: 1) MUC, 2) ENCC, 3) LMP-SER, and 4) HG-SER constructed (Table 3.1). Common control pools (BCC and SMP array data combined) were used in these analyses. Before 56  analysis it was known that at most 200 SNPs could be selected for replication through OCAC. These SNPs were generously donated by Linda Kelemen to replicate the most interesting loci detected by our pool-based GWAS. Although this opportunity allowed us to achieve a degree of replication not possible without OCAC, 200 SNPs is a very limited number. Most GWAS carry forward 0.1% (~10,000) of the SNPs tested in Stage 1 to Stage 2 to minimize false negatives being dropped due to overly stringency P-value criteria19. False positives are then removed in Stage 2 leveraging the additional power afforded by larger sample size (typically ~75% more samples genotyped in Stage 2) (see section 1.2.1.6 for further discussion). Because only 200 SNPs could be selected for replication in OCAC, we had to be very stringent in the first round of SNP selection, and may not have selected true positive SNP-trait associations amongst the false positives that are expected to arise in the tails of the test statistic distribution. These limitations acknowledged, the steps performed to select 198 SNPs for replication are given below.  3.2.1.5 Pool-Based Test of Association Allele based tests of association were performed using a previously described publicly available program, GenePool41 (version 0.9.1). Using this package, association statistics were calculated using the SINGLEMARKER algorithm. Briefly, the SINGLERMARKER test statistic divides the difference in allele frequency, calculated from allele frequency estimates in a case pool and a control pools, by the variance inherent to the pooling experiment (Equation 2)41,198. Allele frequencies in each EOC case pool (MUC, ENCC, LMP-SER, and HG-SER) were compared to allele frequencies in the control pool (based on BCC and SMP array data combined).  2)  SINGLEMARKERi  p˜  A,i  Vs,i  p˜ A,i Vp,i  p˜  A,i  estimated allele frequency for the A allele of the ith SNP in case pools.  p˜  A,i  estimated allele frequency for the A allele of the ith SNP in control pools.  Vs,i  sampling var iance for the ith SNP.  Vp,i  poolingvar iance for the ith SNP. 57  SNPs that cannot be reliably assayed in DNA pools were excluded prior to calculating SINGLEMARKER test statistics (see Appendix C.1). We excluded SNPs with HapMap CEU (Utah residents with Northern and Western European ancestry from the CEPH collection, Release 27) minor allele frequency (MAF) less than 5% as power has shown to be substantially reduced for low MAF SNPs in pool-based experiments48,199. SNPs without HapMap data were not removed. Probes specifically designed to assay copy-number variant (CNV) non-polymorphic sites were removed as these do not have meaningful two-colour fluorescence intensity data. Finally, SNPs specifically designed to assess polymorphisms in mitochondrial DNA (mtDNA) were excluded due to concerns regarding homology (>98%) between nuclear and mtDNA sequence confounding results200. In total, 535,177 SNPs were tested for association using the SINGLEMARKER test.  3.2.1.6 Filtering of Associated SNPs There is enrichment for false-positives due to genotyping error among the most significant SNPs when performing a GWAS using individual genotyping data201; the same is true of a GWAS using pool-based data. Simply because a SNP is identified as the single most associated SNP, does not mean that this is not due to an undetected assay problem. Given at most 200 SNPs were to be selected for replication, false positive enrichment amongst the most associated SNPs was of particular concern. Additional filtering criteria were applied to remove SNPs most likely to be false positives. There is no perfect method to screen out SNPs that are false positives, and with regard to pool-based GWAS, there is no agreed upon series of filtering steps198. Two different series of filters were applied, depending on the properties of the SNP in question. In the first category were SNPs (defined as primary SNP) with at least one other SNP (defined as proxy SNP) on the 660-Quad array in high LD (r2> 0.8). Linkage disequilibrium was estimated based on HapMap CEU individuals here and in the rest of this chapter. For primary SNPs, we filtered based on agreement between primary and proxy SNP SINGLEMARKER test statistic results. Highly associated SNPs in disagreement with their proxy SNP(s) were removed; the latter should yield identical/very similar result assuming experimental error is not an issue. This was called the “cluster method” filtering approach. In the second category were SNPs without proxies on the 660-Quad array (defined 58  as singleton SNPs). For singleton SNPs, additional exclusion criteria were applied to remove SNPs most likely to be false positives. This was called the “singleton” filtering approach. Cluster method filtering was performed on the most significant 3,300 SNPs (~0.5% of the array data) as ranked by the SINGLEMARKER test statistic P-value. The number of SNPs on which the cluster method filtering analysis was performed was relatively arbitrary; a sufficient number to define more genomic regions of interest than could be pursued in replication given the 200 SNP allotment, but still restricting to a very small subset of highly associated SNPs. By analyzing the top-ranked 3,300 SNPs, far more genomic regions were defined than could be pursued in replication. Using these SNPs, contiguous genomic regions were defined as those having one or more SNPs within 100 kb of each other. For example, 5 SNPs, each at 90kb intervals apart, would be grouped into one “cluster” spanning ~450kb. The genomic regions were prioritized by ranking by the smallest weighted average SINGLEMARKER rank. The weighting factor applied was 2/k, where k is the number of SNPs within a cluster. This weighting prioritizes genomic regions with many highly-ranked SNPs (top 0.5% of data), but biases us towards genomic regions with greater SNP density on the 660-Quad array. For each genomic region considered for replication, we verified that at least 2 SNPs were in high LD (r2 > 0.8, HapMap CEU data). If a set of clustered SNPs did not meet this criterion, they were moved to the singleton filtering analysis. The cluster method filtering approach attempts to prioritize SNPs most likely to be true positives based on primary/proxy agreement. It does not attempt to combine the information from all proxies of a primary SNP. Methods which combine all proxy information can be biased by one poorly genotyped SNP41,198, and we did not want to exclude SNPs on this basis. The following filtering criteria were used to exclude singleton SNPs in a step-wise manner: 1) SINGLEMARKER test statistic P-value greater than 0.05; 2) SNP allele frequency estimated by less than 50% of the expected array data in the case and control DNA pool; and 3) allele frequency difference in the control DNA pools (BCC versus SMP) is greater than the difference in case vs. controls pools (BCC and SMP combined). The rationale for each exclusion criteria is as follows. A nominal P-value of less than 0.05 is a commonly used probability cut-off (not considering multiple testing). A loss of 50% of array data is indicative of undetected assay complications/failure. A SNP effect that is more pronounced in a comparison of controls pools is not likely to be associated with EOC 59  risk, but rather a spurious association. Association testing and filtering were performed in subtype-specific manner.  3.2.1.7 SNP Selection for Replication SNP selection was distributed roughly equally among the MUC, ENCC, and LMP-SER analyses (Table 3.2). Fewer SNPs were chosen from the HG-SER analysis because the GWAS performed by Song et al. in 2009130 likely had far superior power to detect HG-SER associations based on the sample available (despite genetic heterogeneity among case samples). Primary SNPs were favoured for replication (i.e. those with proxy SNPs and filtered by the cluster method approach) as these are the SNPs for which we have the most confidence in the association result. SNPs were chosen to tag the top 30 genomic regions arising from the cluster method filtering for the MUC, ENCC, and LMP-SER analyses. For genomic regions with many SNPs present, where some SNPs were not in high LD (r2 > 0.8) but moderate LD (r2>0.6), more SNPs were chosen to tag the region. The top 15 genomic regions from the HG-SER analysis were similarly tagged. In total, 161 SNPs were assigned in this way (Table 3.2). Thirteen top-ranked singleton SNPs from each of the MUC, LMPSER, ENCC GWAS were then chosen (as ranked by the SINGLEMARKER test statistic Pvalue and passing the singleton filtering steps) (Table 3.2). Details of the SNPs chosen based on the top 30 genomics regions from the LMP-SER, MUC, and ENCC analyses, and the top 15 genomic regions from the HG-SER analysis are given in Tables 3.3-3.6, and Table 3.7 for singleton SNPs.  3.2.2 Replication Stage 3.2.2.1 Sample Description Replication was carried out in ~19,000 EOC cases (all subtypes combined) and ~25,000 controls through the OCAC, as part of COGS. COGS was established to replicate potential associations that have arisen through GWAS and other association studies of breast, ovarian and prostate cancer. In its entirety COGS genotyped ~200,000 SNPs in ~150,000 breast, ovarian and prostate cancer cases and controls. Only samples from OCAC studies were used in the replication stage analyses; each OCAC study has been previously described40. Approval from the OCAC study-specific human research ethnics committees was obtained, 60  and all participants provided written informed consent. Only women of European descent were used for replication. Women were assigned an ethnic group designation based on principal component analysis carried out on the full COGS dataset. For SNPs chosen based on the HG-SER and LMP-SER discovery stage analyses, the choice of case samples to include in replication analyses was straightforward, those ovary cases (cases of tubal, peritoneal, or cases of unknown origin were excluded) with a serous tumour histology, and either an invasive (for HG-SER) or borderline/LMP (for LMP-SER) tumour behaviour. Cases with unknown behaviour were excluded. For SNPs chosen based on the MUC and ENCC pools, the appropriate cases to include for a wellmatched replication was more ambiguous. For the MUC EOC subtype, both invasive and LMP/borderline tumour behaviour samples were included in the discovery stage pool, albeit the majority of cases were of LMP/borderline type (70%) (Table 3.1). To address this fact one main replication analysis was performed using invasive and LMP/borderline MUC samples combined. Subsequently, SNPs found to be significant in this main analysis (P-value < 0.05) were inspected in invasive and LMP/borderline only analyses. In the discovery stage END and CC cases were combined in one pool (Table 3.1). Again, one main replication analysis was performed using invasive END and CC cases combined. Only invasive cases were included, which reflects the samples used in the discovery stage. Subsequently, SNPs found to be significant in this main analysis (P-value < 0.05) were inspected in END only and CC only analyses. With respect to cases and controls, individuals without principal component data or age at diagnosis (cases) or recruitment (controls) were excluded. A breakdown of the EOC cases and controls genotyped by the OCAC as part of COGS is given in Table 3.3.  3.2.2.2 Genotyping and Quality Control Replication stage genotyping was performed on an Illumina custom Infinium genotyping array. The Ovarian Cancer Association Consortium (OCAC) established robust genotyping quality control (QC) guidelines to ensure accurate genotyping, particularly across multiple studies. These QC guidelines have been previously described40. Only SNPs passing OCAC’s QC guidelines were included in the replication analysis. 61  3.2.2.3 Statistical Analysis Each OCAC study set was analyzed separately; however, the tested allele for all analyses was set to be the minor allele as calculated in all European controls in the full OCAC dataset. Each histological subtype was analyzed separately, unless otherwise described. Logistic regression assuming an additive genetic model (0, 1, 2 copies of the minor allele) was performed, adjusting for five European principal components and age at diagnosis (5 year age bins: <40, 40-45, 45-49, 50-54, 55-59, 60-64, 65-69, >70). These factors were judged to be potentially important confounders a priori and were included in the model irrespective of their effect on the association result. Logistic regression analyses were implemented in PLINK (v1.07). Study specific OR’s and standard errors (SE’s) from logistic regression analyses were used in fixed effects (inverse variance method) and random effects (DerSimonian and Laird method) method meta-analysis to calculate summary OR’s and confidence intervals (CI’s). Meta-analysis was carried out using the rmeta library implemented in the R project for Statistical Computing. Fixed effects meta-analysis assumes that a single common effect underlies each study; random effects meta-analysis allows for a distribution of effects across different studies. Cochran's Q test was used to accept or reject if studies had fixed effects. Both IV and DL meta-analysis methods incorporate study variance into the estimation of a studies contribution to combined effect estimate, and can be biased if data is sparse202. Studies with fewer than 30 subtype-specific EOC cases and 30 controls were excluded from the meta-analysis to reduce complications and/or bias introduced by sparse data. The OVALBC study (abbreviated OVA in the OCAC) was excluded from replication meta-analysis because most of the cases (and ~50% of controls) in OVA participated in the DNA pools used in the discovery stage, and the meta-analysis was intended to be an independent replication.  3.2.3 In Silico SNP Analysis Replicated SNPs were assessed for potential functional effects using in silico analyses. The Ensembl v57 database (Genome Assembly: GRCh37, Feb 2009, dbSNP version: 130) was used as the source of SNP information, including SNP: chromosome, nucleotide position, 62  consequence (e.g. intronic, non-synonymous coding), and the HGNC symbol of genes tagged by SNPs. These were conveniently extracted using the web-based tool, Varietas203. The webbased package FuncPred was used to predict functional characteristics of SNPs97. In particular, FuncPred was used to assess the potential for the alternate allele of a non-coding SNP to impact: transcriptional regulation by changing transcription factor binding site (TFBS) activity, translational regulation by changing miRNA binding site activity, and protein structure/function by changing exonic splice site enhancer (SSE) or exonic splice site silencer (SSS) activity. In addition, FuncPred outputs the ESPERR regulatory potential score204, and the Vertebrate Multiz Alignment & Conservation score (17 species). ESPERR (evolutionary and sequence pattern extraction through reduced representations) regulatory score discriminates regulatory regions from neutral sites with excellent accuracy (approximately 94%)204; however, it does not perform calculation in the context of a SNPs reference versus alternate allele. The conservation score is a measure of evolutionary conservation in 17 vertebrates, including mammalian, amphibian, bird, and fish species, based on a phylogenetic hidden Markov model (phastCons)88. Conserved regions of the genome are typically thought to serve important biological function, although that function is often unknown. More often than not SNPs implicated by GWAS are in LD with causal SNPs, and do not exert a functional consequence themselves. To in part address this using available in silico data, SNPs in LD with the associated SNPs (r2>0.6 based on the HapMap CEU population) were also evaluated using FuncPred.  3.3 Results 3.3.1 Discovery Stage 3.3.1.1 Sample and Subtype Description Summary statistics for the samples contributing to each DNA pool are given in Table 3.8, including the pool size, the average age at diagnosis ± SD, and the minimum/maximum age of pool participants. Age was compared between pools using a Mann-Whitney Rank Sum Test, implemented in R version 2.7.0. Case pools were compared to the combined BCC and SMP controls to reflect the analyses performed in the discovery stage. Notably, the SMP 63  control group is older than the BCC control group, a fact that can be attributed to only women over 40 being enrolled in the SMP program used for recruitment. A breakdown of the DNA pools by 10 year age bins underscores many of the previously reported differences in the age of diagnosis of the EOC histological subtypes (Figure 3.2)82,205,206. Patients with HG-SER tumours are older than patients with LMP-SER tumours. In the discovery samples most HG-SER cases were diagnosed after 50, meanwhile most cases of LMP-SER are diagnosed before 50 (Figure 3.2). The MUC subtype had the most diagnoses before the age of 30 (as a % of MUC cases), and the percentage of MUC cases rose in each age bin until 60, whereupon it declined. The END and CC combined pool was approximately normally distributed about the 51 - 60 age bin, identical to the BCC and SMP combined controls.  3.3.1.2 Array Quality Control and Normalization The 660-Quad arrays used in the discovery stage of the pool-based GWAS delivered highly reproducible results on replicate arrays. Figure 3.3 depicts the high correlation in the estimates of allele frequency for 5000 randomly selected SNPs on two replicate arrays before normalization (“raw” array data, r = 0.989) and after normalization (“normalized”, r=0.996). Given data for 12 replicate arrays and 5000 randomly selected SNPs, Figure 3.4 depicts the distribution of the standard deviations associated with pool-based AF estimates before (“raw” array data) and after normalization (“normalized”). The mean SD value is reduced from 0.05 before normalization, to 0.01 after normalized; thus, the normalization procedure dramatically increased the precision of the pool-based AF estimation. Hierarchical cluster plots of the sixty 660-Quad arrays used in the discovery stage analyses (12 arrays per MUC, ENCC, LMP-SER, and HG-SER case pool and 6 arrays per BC and SMP control pool) show the global effect of normalization, and the well-defined clustering of replicate arrays (Figure 3.5 and Figure 3.6). To construct these plots, a SNP AF correlation matrix (comparing SNP AF on all arrays) was calculated using Pearson’s correlation coefficient. This was then converted to distances for plotting, implemented in R. Before normalization replicate arrays were distributed in several small groups on poorly distinguished branches (Figure 3.5). After normalization replicate arrays demonstrate tight clustering on well-defined and discrete branches (Figure 3.6). For the normalized data, only 64  the HG-SER arrays do not all cluster together. Instead, replicates are distributed between two branches (6 arrays per branch). The replicate arrays from the BC and SMP control pools branch together and intermix in this hierarchical cluster plot, suggesting that relative to the case pools, SNP AF estimates in these two control pools are globally (across the entire array) more similar. Overall, based on these diagnostics (Figure 3.3-3.5) we were very satisfied with the performance of the 660-Quad arrays with respect to AF estimation in the discovery stage DNA pools.  3.3.1.3 Validation Prior to testing for replication of our 198 SNPs (Table 3.3), we first confirmed that the SNP associations detected in the pool-based data were also present in individual genotyping (IG) data. This was done to confirm that undetected array and/or pool construction error was not the reason SNPs were selected for replication. This validation step was possible because OVAL-BC samples were included in the OCAC/COGS genotyping (abbreviated OVA in Table 3.4). In total, 915 (of 943) samples used in the discovery stage DNA pools were individually genotyped. Twenty-eight samples were omitted due to insufficient DNA and one sample failed to pass the QC steps performed by the genotyping centre. For the ovarian cancer pools: 93%, 100%, 91%, 97% of MUC, ENCC, LMP-SER, and HG-SER samples, respectively, had IG data. Ninety-eight percent of controls samples had IG data. With respect to SNPs, 188 passed the QC steps performed by the genotyping centre (Table 3.3). For each DNA pool the agreement between pool-based and IG-based SNP AFE was inspected (for the 188 SNPs with IG data) (Figure 3.7). For the case pools, the correlation between methods was: 0.74, 0.78, 0.74 and 0.79 for the MUC, ENCC, LMP-SER, and HGSER pools, respectively. For the control pool the correlation was 0.77. For pools with more samples individually genotyped (ENCC, HG-SER, controls), the agreement between methods was better. Overall, the correlation between SNP MAF estimation methods was good; however, our pool-based data tended to overestimate SNP MAF relative to IG data (Figure 3.7). This finding is consistent with previous reports207. Pooling AF estimates are known to be most useful in estimating differences between pools, rather than absolute frequencies, on account of unequal amplification of alleles for some SNPs on genotyping arrays. Most effects of unequal amplification cancel out when allele frequency differences (AFDs) are used. To 65  consider this, the distribution of AFDs between cases and controls was compared for the 188 SNP based on pooling and IG data (Figure 3.8). These plots demonstrate that the AFDs calculated based on pooling data are overestimated relative to IG data. Previous pool-based GWAS also found pooling data to overestimate AFD’s between pools207. Nevertheless, large AFD’s exist between cases and controls for many SNPs based on the IG data. Depending on the subtype, the average SNP AFD based on IG data ranged from 0.05-0.08. SNPs chosen based on cluster method filtering did not differ from those chosen based on singleton filtering with respect to the tendency of pooling data to overestimate AFD (Figure 3.9). Overall, we found the pooling data successfully detected SNPs with AF differences between cases and control. Even though AFD values were overestimated for the 188 SNPs chosen, the underlying true differences were still quite large. Ultimately, we are interested in SNPs that are associated with EOC based on IG data, not pooling data. Thus, using IG data for those samples used in the case pool and controls pools we carried out association analyses using a one degree of freedom (df) allelic chisquared test, implemented in PLINK (v1.07). To make these calculations as comparable as possible to those performed in the discovery stage using the SINGLEMARKER test, no covariates were included and only the allele-based test was considered. Based these tests of association, 89% of the SNPs selected for replication validated at a P <0.05 level of significance (P-values for each SNP are given in Table 3.3-3.7). This level of validation, in terms of the percentage of SNPs validated, is consistent with, if not better, than previous pool-based GWAS36. This could be attributed to the fact that we used more replicate arrays to estimate SNP AF in DNA pools than most previous pool-based GWAS. Failure of some SNPs to validate could be due to a small percentage of samples used in DNA pools being absent from the validation, or it may indicate SNPs chosen due to array-based error. Based on the number of SNPs that did validate, 89%, we were satisfied with the ability of the discovery stage array data and SNP selection procedure used to select associated SNPs.  3.3.2 Replication Stage In the next stage of the analyses, subtype-specific replication of the SNPs chosen in the discovery stage was investigated in the OCAC participating studies. Study specific OR’s and SE’s from logistic regression analyses were used in fixed effects effects and random effects 66  meta-analysis, carried out using the rmeta library implemented in R. Studies with fewer than 30 subtype-specific cases or controls were excluded from the meta-analyses to avoid bias introduced by small study size. Unless otherwise indicated, the OVAL-BC study was excluded from the meta-analysis to provide independent replication. The P-values reported are not adjusted for multiple testing. Results are presented in order by EOC subtype: MUC, ENCC, LMP-SER, and HG-SER.  3.3.2.1 Mucinous Results The MUC replication dataset included 15 OCAC studies consisting of 1,257 MUC cases (invasive and LMP/borderline tumour behaviour) and 17,190 controls. Of 56 SNPs chosen based the MUC pool analyses and genotyped according to OCAC QC standards, 4 were significant (Pfixed<0.05) in the replication samples and agreed in direction of effect with the discovery samples (Table 3.10). Three of these SNPs were chosen based on the cluster method of filtering (rs11108890, rs17106154, rs970651); one was a singleton filtered SNP (rs933518). The most associated SNP, rs11108890, had a minor allele (A) that increased MUC EOC risk: ORfixed = 1.35, 95%CI: 1.10–1.66, Pfixed= 0.003. There was significant between site heterogeneity for rs11108890 (Phet=0.01), but not rs933518, rs17106154, or rs970651. In a random effects meta-analysis rs11108890 is not significantly associated with MUC EOC risk: ORrandom=1.30, 95%CI: 0.96-1.78, Prandom= 0.09. Both rs933518 and rs17106154 increased risk in the replication meta-analysis: rs933518, ORfixed=1.24, 95%CI: 1.06-1.44, Pfixed=0.007; rs1710615, ORfixed=1.20, 95%CI: 1.03-1.41, Pfixed=0.02 (Table 3.10). The remaining SNP significant in the replication analysis was rs970651: ORfixed=1.13, 95%CI: 1.00-1.26, Pfixed=0.045. rs970651 is in high LD with rs7981902 (r2=0.8 based on HapMap CEU), which was also chosen for replication. These SNPs are part of a cluster of 8 highly-ranked SNPs on Chromosome 13, two of which were chosen to tag the locus. rs7981902 results agreed with those of rs970651, but narrowly missed statistical significance: rs7981902, ORfixed=1.12, 95%CI: 0.99-1.28, Pfixed=0.055. Subsequently, we analyzed these 4 SNPs for association in the non-MUC EOC histologies, including END, CC, LMP-SER, and HG-SER (grey rows in Table 3.10). rs11108890 was significant in the END EOC replication dataset: ORfixed=1.23, 95%CI: 1.021.49, Pfixed= 0.03), and in this subtype between-site heterogeneity was not significant (Phet= 67  0.6). The remaining SNPs were not significant in the non-MUC EOC subtypes. To determine if invasive and LMP/borderline MUC-EOC cases differed with respect to these 4 SNPs, tumour-behaviour specific analyses were performed (Table 3.11). In the invasive discovery samples (used in DNA pool construction) rs11108890 significantly increased risk (ORfixed=4.23, 95%CI: 1.30-13.7, P=0.02), and in the invasive replication samples it narrowly missed statistical significance (ORfixed =1.32, 95%CI: 1.00-1.76, Pfixed=0.054) (Table 3.11). In the LMP/borderline discovery samples rs11108890 significantly increased risk (ORfixed =2.69, 95%CI: 1.01-7.19, Pfixed=0.05), but it was not significant in replication samples (ORfixed =1.22, 95%CI: 0.87-7.19, Pfixed=0.24). Thus, the association of rs11108890 appears stronger in invasive MUC samples, but perhaps not absent from LMP/borderline MUC samples. Notably, heterogeneity amongst OCAC studies for the association with rs11108890 was not significant when invasive and LMP/borderline MUC cases were analyzed separately (Phet-invasive=0.25, Phet-LMP =0.23, Phet-combined =0.01). rs17106154 significantly increased risk in the invasive discovery samples (ORfixed =2.93, 95%CI:1.09-7.85, P=0.03); but not the invasive replication samples (ORfixed =1.16, 95%CI:0.92-1.46, P=0.20). rs17106154 did not significantly increase risk in the discovery or replication LMP/borderline MUC datasets. Based on the discovery samples, the association of rs17106154 appears stronger in invasive MUC tumours. rs933518 was not significant in the invasive discovery samples (ORfixed =1.52, 95%CI:0.58-3.96, P=0.39); but was significant in the invasive replication samples (ORfixed =1.27, CI:1.02-1.56, P=0.03). In contrast, in the LMP/borderline discovery samples it was significant (ORfixed=2.60, CI:1.39-4.87, P=0.0027); and in LMP-borderline replication dataset it narrowly missed statistical significance (ORfixed =1.27, CI:0.99-1.62, P=0.06). rs933518 may be associated with MUC EOC risk regardless of tumour behaviour. rs970651 significantly increased risk in the invasive discovery and replication samples (discovery ORfixed =2.39, 95%CI: 1.04-5.49, P=0.04; replication ORfixed =1.20, 95%CI: 1.02-1.41, P=0.02 ). In discovery LMP/borderline samples it was also very significant, but in replication LMP/borderline samples it was not significant (discovery ORfixed =3.04, 95%CI:1.70-5.45, P=0.00018; replication ORfixed =1.02, CI:0.84-1.22, P=0.85). The rs970651 association is clearly in invasive MUC EOC, but it is unclear if it is in LMP/borderline MUC EOC as well. Failure to replicate SNPs in the invasive-only or LMP/borderline-only replication samples could relate to power given the limited number of 68  MUC cases in OCAC, particularly when this rarer EOC subtype is stratified by tumour behaviour. To summarize, of the 4 SNPs found to be significant when invasive and LMP/borderline MUC cases were analyzed together (Table 3.10), three show evidence of being associated in both invasive and LMP/borderline MUC cases (rs11108890, sr933518, rs970561). The association of rs17106154 appears restricted to invasive MUC samples. None of the SNPs reported to replicate in the MUC EOC meta-analysis were significant at the genome-wide level of significance (P-value < 10-7), or assuming a Bonferroni correction for 56 SNPs (56 is the number of SNPs tested in this subtype, and gives an adjusted critical value of 8.93x10-4). rs933518 is significant after Bonferoni correction when the OVA study site is included in the replication dataset (Table 3.10).  3.3.2.2 Endometrioid and Clear Cell Results The ENCC replication dataset included 22 OCAC studies consisting of 2,594 ENCC (invasive endometrioid and clear cell) cases and 20,326 controls. Of 55 SNPs chosen based the ENCC pool analyses and genotyped according to OCAC QC standards, two were significant (Pfixed<0.05) in the replication meta-analysis and agreed in the direction of effect with the discovery samples (Table 3.12). These SNPs, rs2190503 and rs6593140, are in complete LD, and were selected based on the cluster filtering method. They are part of a cluster of 13 highly-ranked SNPs on chromosome 7, three of which were chosen for replication to tag the locus. rs2190503 and rs6593140 increased risk in the ENCC metaanalysis at the P<0.05 level of significance: rs2190503, ORfixed =1.12, 95%CI: 1.03-1.23, Pfixed=0.01; and rs6593140, ORfixed =1.11, 95%CI: 1.02–1.22, Pfixed=0. 02. The third SNP chosen to tag this locus, rs2329554, is in moderate LD (r2=0.6) with these two SNPs. rs2329554 also increased risk in the ENCC replication meta-analysis, but missed statistical significance (ORfixed=1.07, 95%C.I: 1.00-1.16, Pfixed=0.056). When these SNPs were subsequently inspected in the other EOC subtypes (grey rows in Table 3.12), all three SNPs were found to be associated with HG-SER risk, but not LMPSER or MUC EOC risk. To determine if END and CC EOC cases differed with respect to the association of these SNPs, tumour behaviour specific analyses were run (Table 3.13). None of the SNPs were significant in an END-only or CC-only replication meta-analysis. However, in the END-only replication all three SNPs approached significance: rs2190503, 69  ORfixed=1.11, P=0.057; rs659140, ORfixed=1.11, Pfixed=0.069; and rs2329554, ORfixed=1.09,Pfixed=0.057. In the CC EOC discovery samples, all three SNPs significantly increased risk; however, the association did not replicate. Failure to replicate these SNPs in END-only or CC-only samples could relate to power given the limited number of END and CC cases available for replication in OCAC, particularly in light of the smaller OR estimates for this SNP based on the combined analysis. When the OVA study was included in the replication meta-analysis all three SNPs achieved statistical significance in the END subtype (Table 3.13). Based on the available data, this locus is associated with increasing cancer risk in the END and HG-SER subtypes, and may also increase risk in the CC EOC subtype. Worth noting is that when the END and CC subtypes are analyzed separately, study site heterogeneity became significant for SNPs rs2190503 and rs6593140 in both the END and CC analyses. It was not significant when END and CC cases were analyzed together. It is not clear why this might be the case. None of the SNPs reported to replicate in the ENCC EOC meta-analysis (Table 3.12) were significant at the genome-wide level (P-value < 10-7), or assuming a Bonferroni correction for 55 SNPs (55 is the number of SNPs tested in this subtype, and gives an adjusted critical value of 9.1x10-4).  3.3.2.3 Low Malignancy Potential/Borderline Serous Results The LMP-SER replication dataset included 9 OCAC studies comprising 825 LMP-SER cases and 13,509 controls. Of 53 SNPs chosen based the LMP-SER pool analyses and genotyped according to OCAC QC standards, two were significant (Pfixed<0.05) in the replication metaanalysis and agreed in the direction of effect with analysis based on the discovery samples (Table 3.14). Both SNPs were chosen based on cluster method filtering. rs9609538 and rs2169310 (minor allele G for both SNPs) are estimated to decrease LMP-SER risk (rs9609538, ORfixed= 0.87, 95%CI: 0.78-0.97, P= 0.015; and rs2169310, OR= 0.87, 95%CI: 0.76-0.98, P= 0.026). Neither SNP was significantly associated with risk in any of the other EOC subtypes (grey rows in Table 3.14). Neither of these SNPs is significant at the genomewide level (P-value < 10-7), or assuming a Bonferroni correction for 53 SNPs (53 is the number of SNPs tested in this subtype, and gives an adjusted critical value of 9.4x10-4).  70  3.3.2.4 High Grade Serous Results The HG-SER replication dataset included 27 OCAC studies consisting of 6,881 HG-SER cases and 21,530 controls. Of the 24 SNP chosen based the HG-SER pool analyses and genotyped according to OCAC QC standards, none were significant (Pfixed<0.05). As no SNPs were significant in the HG-SER replication dataset, no SNPs were inspected in the other EOC subtypes. One HG-SER SNP, rs13194781, was significant when the replication dataset included OVA samples: ORfixed=1.08, 95%CI=1.01-1.16, Pfixed=0.031, Phet=0.07.  3.3.3 In Silico SNP Analysis Nine SNPs found to be significant (Pfixed<0.05) in our subtype-specific replication metaanalyses were analyzed for potential functional effects in silico (Table 3.15). Results are presented in order by subtype: MUC, ENCC, and LMP-SER.  3.3.3.1 Mucinous rs11108890 and rs970651 are intergenic SNPs on chromosome 12 and 13, respectively (Table 3.10). rs933518 is in an intron of a putative uncharacterized protein on chromosome 16. rs17106154 is an intron/downstream region (depending on transcript variant used) in RAD51B (alias RAD51L1) on chromosome 14. None of these four SNPs is predicted to alter TFBS activity, SSE/SSS activity, or miRNA binding site activity (based on FuncPred algorithms). The ESPERR regulation score for these SNPs is 0.04 or 0. Given ESPERR regulation scores range from 0-1, zero being completely neutral with respect to regulatory potential, these SNPs are not likely to have a regulatory function based on the reference allele. Finally, the conservation score for these SNPs was 0. Conservation scores range from 0-1, zero being not conserved and 1 being completely conserved. These SNPs are not in conserved regions of the genome. rs11108890, rs933518, and rs17106154 had relatively low MAFs (4%-8%) in European OCAC controls (Table 3.10). Of 22 SNPs in LD with rs11108890 and 1 SNP in LD with rs933518, none are predicted to impact sites relevant to transcriptional or translation regulation, nor are they in conserved regions of the genome. Of two SNPs in LD (r2>0.8) with rs17106154 (rs1274757 and rs1274758), both have alternate alleles predicted to alter TFBS activity (Table 3.11). The nucleotide position of these SNPs 71  is not conserved, nor are they predicted to have strong regulatory potential (ESPERR) based on the reference allele. Of 28 SNPs in LD with rs970651, one SNP (rs10397, r2=0.69) had an alternate allele predicted to alter SSE/SSS activity, and alter miRNA binding sites activity. rs10397 is in the 3’-UTR/upstream region (depending of transcript variant) of SUCLA2. The nucleotide position of rs10397 is highly conserved, and has a large regulatory potential score (ESPERR=0.17). To summarize, although the 4 non-coding SNPs associated with MUC EOC risk have no predicted regulatory function; rs17106154 and rs970651 and are in LD with SNPs that may well have functional consequences. The two SNPs in LD with rs17106154 could alter the transcriptional regulation of RAD51B by changing TFBS activity. One SNP in LD with rs970651 could impact the structure/function of the SUCLA2 protein translated by changing SSE/SSS activity, and/or regulate the amount of protein translated by changing miRNA binding site activity.  3.3.3.2 Endometrioid and Clear Cell rs2190503 and rs6593140 are in introns or GRB10 on chromosome 7 (Table 3.15). Depending on the transcriptional variant used, rs6593140 may alternatively be in an upstream or non-coding region of GRB10. rs2329554 is an intergenic SNP upstream of GRB10. These SNPs are not predicted to alter TFBS activity, SSE/SSS activity, or miRNA binding site activity (based on FuncPred algorithms); however, rs6593140 is in a highly conserved nucleotide position (Table 3.15), and has a large ESPERR regulatory potential score. rs2190503 and rs6593140 have a MAF of 13%; rs2329554 has a MAF of 22% in European OCAC controls. Of 58 SNPs in LD (r2>0.6) with rs2190503/rs6593140 (which are themselves in perfect LD), two SNPs, rs6953182 and rs7791286, have alternate alleles predicted to alter TFBS activity based on FuncPred (Table 3.16). These SNPs are not in conserved sites, nor are they in a position predicted to have strong regulatory potential (ESPERR) based on the reference allele. Of 12 SNPs in perfect LD with rs2329554, 4 have alternate alleles predicted to alter TFBS activity. Two of these have ESPERR regulation scores >0.1, indicating regulatory potential for the genomic position; however, none are in highly conserved regions on the genome. Again, although the three SNPs associated with ENCC EOC risk have no predicted regulatory function, they are in high LD with several SNPs that may well have functional consequences, particularly to the transcriptional 72  regulation of GRB10.  3.3.3.3 Low Malignancy Potential/Borderline Serous rs2169310 is an intergenic SNP on chromosome 16 and rs9609538 is an upstream or intronic SNP in C22orf28 (depending on the transcriptional variant used) (Table 3.15). rs9609538 is also 5 bp downstream of BPIL2. Both SNPs have a high MAF, 0.24 and 0.36, respectively, in European OCAC controls. rs9609538 is predicted to alter TFBS activity based on FuncPred (Table 3.10), it is also predicted to alter miRNA binding site activity. In addition, of 4 SNPs in LD (r2>0.6) with rs9609538, one is predicted to alter TFBS activity and miRNA binding site activity (Table 3.11). rs9609538 may have a direct role in regulating transcription, presumably of C22orf28, by changing TFBS activity. Alternatively, it may change translation regulation of BPIL2 by altering miRNA binding site activity. The nucleotide position of these SNPs is not conserved, nor are the sites predicted to have strong regulatory potential (ESPERR) based on the reference allele. Finally, neither rs2169310, nor SNPs in LD with it, are predicted to have functional consequences based on in silico analyses.  3.4 Discussion We set out to discover subtype-specific EOC risk alleles of moderate to large effects by performing a subtype-specific EOC GWAS. In particular, the focus was on discovering SNPs associated with risk in the rarer EOC subtypes, including the MUC, END, CC, and LMPSER subtypes. Nine SNP associations tagging 7 loci are reported here that we feel warrant further replication in additional datasets with reliable EOC histological subtype and tumour behaviour information. These SNPs are not significant at the genome-wide level, uncorrected P-values are reported, and SNP associations should be considered in this light. Even within the OCAC as it currently exists, there are insufficient non-SER EOC cases and well-matched controls to reliably detect risk SNPs that have smaller ORs and/or smaller MAF in the MUC, END, CC, and LMP-SER subtypes; more samples are needed. Nevertheless, of the SNP associations reported, several are intriguing in that they are in or near genes that arguably have a clear biological rationale for conferring EOC risk. As well, in silico analyses predict regulatory roles for these associated SNPs, or for SNPs in LD them. 73  3.4.1 Mucinous One salient association emerging from the MUC EOC analysis is that of a SNP (rs17106154) in a predicted intron of RAD51B. At least two SNPs in LD with rs17106154 have alternate alleles predicted to alter TFBS activity, which could change the transcriptional regulation of this gene. RAD51B is one of the five RAD51 paralogs involved in homologous recombination (HR) repair of DNA double-strand breaks (DSBs); it also interacts directly with p53. Haplo-insufficiency of RAD51B was shown to cause mild hypersensitivity to DNA damaging agents, impaired homologous recombination, and increased chromosome aberrations95, providing a clear hypothesis for why variation at this locus might increase EOC risk. The association with rs17106154 was more apparent in invasive MUC samples than LMP/borderline MUC samples, a finding that is arguably consistent with the role RAD51B plays in maintaining genomic integrity, and the increased chromosomal rearrangements seen in invasive versus LMP/borderline type tumours. Finally, RAD51B forms a stable heterodimer with RAD51C, a recently reported high penetrance EOC risk allele96. To our knowledge, this is the first time RAD51B has been implicated in EOC risk, particularly MUC-EOC risk; however, it has been associated with breast cancer risk208. It is not clear why RAD51B would increase risk in MUC EOC and not the other EOC subtypes, particularly the HG-SER subtype where RAD51C has already been implicated. rs970651 is in LD with SNPs in the 3’UTR and introns of SUCLA2, which are predicted to have consequences for splice site enhancer/suppressor activity, as well as miRNA binding activity. These SNPs could alter the structure and function of the protein transcribed, and/or the amount of protein transcribed. SUCLA2 encodes the beta-subunit of the ADP-forming succinyl-CoA synthetase, a mitochondrial matrix enzyme essential for the TCA cycle. Mutations in this gene are reported in early onset defects in mitochondrial DNA (mtDNA) maintenance, or mtDNA depletion syndromes209. There is not clear how altered function and/or expression or Sucla2 would confer MUC EOC risk. rs11108890 is intriguing in that although discovered and replicated in the MUC subtype, it is also associated in the END EOC subtype. Further, it is the most associated SNP in our MUC EOC meta-analysis, and the only one with significant between study heterogeneity. This heterogeneity is not significant when cases are stratified by tumour 74  behaviour; MUC tumours are found to be more associated. This may underscore the role detailed tumour pathology data can play in refining risk estimates for SNP associations. Alternatively, the low MAF of this SNP may be leading to unstable risk estimates. What biological effect this SNP might have given it lies in an intergenic region over 200kb from any known genes, and has no predicted regulatory function, is not clear.  3.4.2 Endometrioid and Clear Cell The one association emerging from the combined END and CC EOC analysis is that of three SNP in the upstream region (5’-UTR), or introns, of GRB10. Although these SNPs are not predicted to be directly functionally relevant, they are in LD with several SNPs that have alternate alleles predicted to alter TFBS activity and could change the transcriptional regulation of GRB10. Growth factor receptor-bound protein 10 (Grb10) is a widely expressed adaptor protein that functions downstream of activated insulin and growth factor receptors, and functions as a negative regulator of signaling. It was recently identified as a novel rapamycin complex 1 substrate, one that functions in the feedback inhibition of the PI3K/AKT and RAS/MAPK pathways210-212. Loss of GRB10 function was found to result in the hyper activation of the PI3K/AKT pathway in insulin-sensitive cells212, and this biological role is consistent those of genes already identified in END tumours. Somatic mutations frequently described in END EOC’s are silencing mutations of PTEN, and activating mutations in the PI3K/AKT pathway69. Reduced Grb10 expression is frequently reported in various cancers, and negatively correlated with reduced PTEN expression, leading to the conclusion that these operate in a mutually exclusive fashion212; cell growth and survival advantages conferred by GRB10 alterations are redundant in the context of PTEN loss-of-function and vice versa. GRB10 may be a tumour suppressor gene that acts in parallel with PTEN to ensure proper levels of activation of the PI3K/AKT pathway. Somatic mutations to genes involved in regulating the PI3K/AKT pathway have been frequently described in END EOC, CC EOC, and rarely in HG-SER and LMP-SER. All three SNPs were found to also be associated with HG-SER risk, but not LMP-SER or MUC EOC risk. These variants may result in subtle down-regulation of GRB10, and favour cancer susceptibility and tumour progression in cell context dependent fashion, a pattern already described for PTEN213,214. 75  3.4.3 Low Malignancy Potential/Borderline Serous Of the nine SNPs associated with subtype-specific EOC risk in our analyses, only one (rs9609538) is predicted to have a direct functional consequence. This SNP decreases LMPSER risk, and is not associated with any other EOC subtype. This SNP is sandwiched between two genes, C22orf28 (~500 bp upstream) and BPIL2 (5bp downstream), and is predicted to alter TFBS activity and miRNA binding site activity. As such, variation at rs9609538 could alter expression (transcriptionally and/or translationally) of C22orf28 or BPIL2 gene products. BPIL2 is an intriguing gene to explain the association at this locus. There are only 2 publications in PubMed on this gene (search term “BPIL2”). In the first, it was reported to be a rarely expressed lipid transfer/lipopolysaccharide binding protein, involved in recognizing the outer membrane of Gram-negative bacteria197. It is highly expressed in the inflammatory skin of psoriasis patients, and absent from normal adult skin197, suggesting a role in the inflammation and/or immune response. In the second report, BPIL2 was found to be one of a handful of significantly differentially expressed genes in the avian oviduct; leading to the hypothesis that it serves an anti-microbial function in this context197. Conceivably, BPIL2 could serve a similar host defense function in the human female gynecological tract. C22orf28 (alias HSPC117) is an essential component of the human tRNA splicing pathway, functioning to ligate tRNA exons after intron splicing. Recently, this gene was also implicated in RNA processing during viral replication215. Specifically, it was identified as a host factor involved in enabling human hepatitis delta virus replication. Interestingly, the biological function of BPIL2 and C22orf28 both suggest a possible infection/inflammation based role in the development of LMP SER EOC. Viruses and bacteria as a cause of chronic inflammation and consequently initiators/promoters of cancers in the lower female genital tract are already well described216. For example, infection with human papillomavirus and/or infection with Chlamydia trachomatis increase the risk of cervical cancer217. Although inflammation is thought to play a role in the development of OC, it is typically present in the context of incessant ovulation and/or endometriosis. This might imply inflammation in response to infection plays a role in the development LMP SER EOC. Variation in the expression of C22orf28/BPIL2 might alter response to infection, and over an individual’s lifetime reduces LMP-SER risk. Finally, rs2169310 is an intergenic SNP 76  in a gene sparse desert on chromosome 16 (no genes up/downstream 500kb). It is unclear how this SNP might impact LMP-SER risk.  3.4.4 High Grade Serous No SNPs were significantly associated with HG-SER risk in the replication meta-analysis. This can be attributed to the discovery stage of this pool-based GWAS being underpowered to reliably detect risk allele of smaller effect sizes – i.e. the sizes reported for HG-SER risk variants. Based on the findings of first EOC GWAS, it was highly unlikely that additional HG-SER risk alleles conferring moderate to large risk would be found (i.e they do not exist). In contrast, risk alleles conferring moderate to large effects in the non-HG-SER subtypes were yet to be investigated, and were not implausible.  3.5 Study Limitations Three major limiters existed in this study of heritable risk factors conferring subtype-specific EOC risk. The first limiter was sample availability (discussed in detail in sections 1.5 and 3.1 of this thesis). This study was only powered to detect SNPs conferring moderate-to-large effects, and these may not exist for this phenotype. Nevertheless, we pursued the possibility of larger effect variants in an attempt to advance biological understanding where it appeared little progress would be made in the foreseeable future. The cost savings afforded by the pooling GWAS approach alleviated some of the potential negative consequences of a null result. In the future, only through extensive multi-study collaboration will a sufficient number of cases of the rarer EOC subtypes be accrued to conduct a well-powered GWAS capable of detecting variants conferring small risk (OR<1.3). This is a reflection of the relatively low incidence of EOC, particularly the MUC, END and CC subtypes. To our knowledge, such a collaborative study of MUC, CC, END, or LMP SER EOC risk is not currently underway, and is not likely to be completed in the next 5 years. The second limiter was number of SNPs that could be chosen for replication. At most 200 SNPs could be selected for replication (a number that was fixed and assigned a priori) (see section 3.2.1.4), and these had to be distributed among four separate histological analyses. In the end, 198 SNPs were chosen, and as described in sections 1.2.1.6 and 3.2.1.4, 77  this number is very small number and therefore incredibly stringent. We may not have captured the true positive SNP-trait associations dispersed just below the false positives that are expected to arise and dominate the tails of the test statistic distribution. A future area of work may be to pursue replication of the top 0.1% of our GWAS results in a subset of the larger case-control studies making up the OCAC (of course, a sufficient number to ensure a well-powered replication). Further, for the trait-associated SNPs that were found to be associated in the replication (unadjusted P<0.05 level of significance), an insufficient number of SNPs (per locus) were selected to perform in silico imputation-based fine-mapping. For each locus of interest, additional genotyping and/or targeted sequencing will be needed to determine the most associated SNPs and/or narrow the region of association. A discussion of the fine-mapping and targeted sequencing approaches is given section 1.2.3.1. Many would consider the DNA pooling design to be a limiter; however, it has been shown that this approach can be successfully applied in the GWAS context, including the cancer-GWAS context36,41,50. The pool-based GWAS worked well in our hands, a statement made based on the fact that most of the SNP associations chosen for replication based on the pooled analyses validated in individual genotyping data (89%), and in some cases were highly significant (P <10-6, see Tables 3.3-3.7).  3.6 Conclusion We set out to discover subtype-specific EOC risk alleles of moderate-to-large effect by performing a subtype-specific EOC GWAS. We report 8 subtype-specific SNP associations tagging 7 loci that are significant at an unadjusted P<0.05 level of significance in a replication analysis including between 10 and 22 independent case-control strata. Without exception, these SNPs were estimated to have a large effect size (OR’s range from 1.75 to 4.0) in our discovery stage analyses; however, in replication stage analyses, these SNP effect sizes were dramatically attenuated and ranged from OR=1.11-1.35. This is consistent with the phenomena of winner’s curse, which is inversely related to the power of a study17. For example, for 10% power, the typical inflation factor for an additive effect is upwards of 60%17. It is also consistent with the expected effect sizes for cancer-associated SNPs218. Given the current significance level of these loci, additional replication is needed to confirm 78  or refute the 8 SNP-EOC associations reported. That said, several SNPs are intriguing in that they are in or near genes that arguably have a biological rationale for conferring subtypespecific EOC risk. These include the MUC-EOC risk loci near RAD51B, the ENCC risk loci near GRB10, and the LMP-SER risk loci near BPIL2/C22orf26. Ultimately, fine-mapping, targeted sequencing, and functional studies (particularly or regulatory mechanisms), will be needed to establish the putative causal variant(s) at these loci, their functional effect(s), and the genes affected (see sections 1.2.3.1-1.2.3.4 for further discussion).  79  CHAPTER 4 Novel and Shared Genetic Factors for Age of Natural Menopause in Iranian and European Women Menopause marks the upper limit of the female reproductive period. Genetics plays an important role in age at natural menopause (ANM). To date, GWAS of ANM have been reported only in European women. This study has two objectives. First, we use a DNA poolbased GWAS to identify genetic factors associated with ANM in Iranian women; and second, we assess whether genetic factors identified in European ancestry women are also associated with ANM in Iranian women. A population-based sample of women from the Tehran Lipid and Glucose Study (TLGS) was used. For the first objective, two DNA pools were constructed and compared: “early” ANM (40-45 yrs, n=165), and “late” ANM (54-65 yrs, n=187). Each DNA pool was allelotyped on 4 Illumina Human1M-Duo arrays, and allelotype-based tests of association were used to rank SNPs. 68 SNPs were then successfully genotyped in individual DNA samples from 782 Iranian women. Ten of these SNPs were associated with ANM (unadjusted p < 0.05). Meta-analyzed ReproGen Consortium GWAS data were used to assess these 10 SNPs in European women in silico. One, rs10840211 at 11p15.4 (near TMEM41B and IPO7) was associated with ANM (unadjusted p=0.0013) in ReproGen data. For the second objective, 10 ANM-associated SNPs identified in European ancestry populations (including 7 from published GWAS and 3 from candidate gene studies) were tested in Iranian women from the TLGS. One SNP, rs16991615 (in MCM8) at 20p12.3 was associated with ANM in Iranian women (adjusted p=0.02). In summary, we have identified a novel SNP-ANM association at 11p15.4,and have replicated a reported ANM association at 20p12.3; these two associations are seen in both Iranian and European women.  80  4.1 Background Ovarian aging is described as the natural process by which a woman reaches reproductive exhaustion. Menopause is an important event in this process, and a prevailing hypothesis states that it occurs when the follicle pool in the ovary is too low to maintain regular cycles151,152,219,220. Menopause is a dramatic event in a woman’s life history and demarks major changes in endocrine signaling, particularly a reduction in female hormone production by the ovaries116. It is also a risk factor for many age-related diseases. For example, late menopause is associated with an increased risk of ovarian cancer1, endometrial cancer3, and breast cancer4,156, and early menopause is associated with an increased risk of cardiovascular disease and osteoporosis153-155. Menopause age varies widely, between 40 and 60 years (average 50-51), and is influenced by genetic and environmental factors7,152,157,158,221. Familyand twin-based studies have estimated the heritable component of age at natural menopause (ANM) to be between 42% and 87%160-164; thus, the prevailing view is that genetics play a very important role. Although environmental factors such as smoking and oral contraceptive use influence ANM, collectively they have been found to explain relatively little of the variation in this trait160,165. Two large GWAS of ANM were published in 2009, and collectively reported five loci harboring genome-wide significant SNP associations169,170. Both studies were performed on women of European descent. In the first study, 4 loci were associated with ANM, including: 20p12.3, 19q12.42, 5q35.2, and 6p24.2169. In the second, SNP-ANM associations were reported at 20p12.3, 19q12.42, and 13q34170. Encouragingly, two loci overlapped between these studies (20p12.3, 19q12.42). In 2012, a meta-analysis of 22 GWAS performed on 38,968 women of European descent with ANM information was performed. This effort confirmed four of the previously reported loci and identified 13 new loci associated with ANM (P-value < 5×10−8). It was reported that genes at these loci are enriched for DNA repair and immune function, suggesting these pathways are important to ovarian aging. These pathways have also been implicated in ovarian cancer. Further, pathway analyses using the full meta-analysis GWAS data set identified exoDNase, NF-kappa B signaling and mitochondrial dysfunction as biological processes related to ANM. To date, ANM GWAS have been performed exclusively on women of European 81  descent. However, evidence suggests ANM varies by race/ethnicity157,181. For example, a US-based cohort study of 92,704 women from five racial/ethnic groups found that Latina women experience menopause earlier, and Japanese women experience menopause later, than non-Latina Whites157. Adjustment for environmental factors, including: smoking, age of menarche, parity, and BMI, did not change this result. These differences may be explained by environmental factors not taken into consideration, genetic factors, or a combination thereof. In this study, we investigate if genetic factors play a role in this difference by conducting a GWAS of ANM in non-European women, and comparing the results to those from GWAS of ANM in European women. The Iranian population is studied, using samples dawn from the Tehran Lipid and Glucose Study (TLGS)222,223. We also investigate if variants previously associated with ANM in European women are also associated with ANM in Iranian women. Based on the Human Diversity Panel224 women of Iranian descent are described as “Middle Eastern” and are expected to be more similar to Europeans than Latinas and Japanese; however, principal components analyses show that “Middle Eastern” peoples form a node distinguishable from that of “Europeans”224. This study had two objectives (summarized in Figure 4.1). The first was to identify novel variants explaining a moderate-to-large proportion of the genetic variation in on ANM in Iranian women using a pool-based GWAS approach. The second was to determine if variants associated with ANM in European women are also associated with ANM in Iranian women.  82  4.2 Methods 4.2.1 Iranian Sample Collection Study subjects were selected from women participating in the TLGS222,223, an ongoing longitudinal study in Iran. Approval for this GWAS of ANM was received from the Research Institute for Endocrine Sciences Clinical Research Ethics Board and from the joint Clinical Research Ethics Board of the British Columbia Cancer Agency and the University of British Columbia. TLGS invited participation from ~15,000 individuals 3 years of age and older from a geographically-defined population within Tehran, Iran. Participants entered the study after providing written informed consent. As part of this study, participants were questioned at three year intervals on the regularity of menstrual cycles, parity, contraceptive use, hormone replacement therapy use, and menopausal status222. The World Health Organization (WHO) definition of menopause status was used: the absence of spontaneous menstrual bleeding for more than 12 months, for which no other pathological cause can be determined. If a TLGS participant experienced menopause prior to entering the study, the date of the last cycle was recorded, based on participant recall. Only women who experienced natural menopause, and whose four grandparents were of Iranian ancestry (participant reported) were considered for inclusion in our GWAS of ANM. A total of 2,710 TLGS participants fit this criterion. Of these, 679 were excluded due to a history of hysterectomy, use of hormonereplacement therapy, or failing to recall when menopause occurred. In addition, 91 women who experienced premature ovarian failure (POF) were excluded. POF is defined as menopause before the age of 40, and is thought to have a unique set of genetic risk factors which could confound a GWAS of ANM152. After these exclusions, 1,940 women remained. Of these, 852 had sufficient DNA for inclusion in this study  4.2.2 Discovery Stage DNA Pooling Procedures Although ANM is a normally distributed continuous trait, for the discovery stage (stage 1) of the pool-based GWAS, it was necessary to quantize the phenotype because only pool allele frequencies, not individual genotypes, are available for tests of association. Study participants were placed in five groups, defined by ANM age percentiles calculated based on the 852 women included in the study. Women in the lower 20th percentile experienced 83  menopause between 40-44 (group 1), women in the 20th-80th percentile experienced menopause between 46-53 (groups 2-4), and women in the upper 20th percentile experienced menopause between 54-65 (group 5). Women from the ends of the ANM distribution, corresponding to the upper and lower 20th percentiles of reported menopausal ages, were pooled in an “early ANM” pool (lower 20th percentile) and a “late ANM” pool (upper 20th percentile). These two pools were then used for comparison and tests of SNP-ANM association in the discovery stage of the GWAS. By using the phenotypic ends of the distribution at this stage, we aim to enrich for differences in those genetic variants that influence ANM in our DNA pools, and thus perform the most powered pool-based GWAS191,225. However, we assume these variants are relevant to the entire ANM distribution, not just early and late menopause, and replication (stage 2) is performed accordingly (using women from the entire distribution of menopausal ages). The 20th and 80th percentiles as pool cut-offs, although relatively arbitrary, were chosen to balance the desire to use the ends of the ANM distribution, and the need to keep enough individuals in DNA pools to have power to detect moderate-to-large SNP-ANM associations (power calculations are given in Appendix D.1). DNA used in pool construction was extracted according to the established methods at the Research Institute for Endocrine Sciences226,227. DNA samples were quantified in duplicate by fluorometry using PicoGreenTM (Molecular Probes, Eugene, OR, US). Samples were diluted in TE (10 mM Tris, 0.1 mM EDTA) to a target concentration 2.0ng/uL, and quantified again in duplicate using PicoGreenTM. DNA pools were constructed by combining 20 ng of each DNA sample by manual pipetting. Pools were concentrated approximately 10fold by SpeedVac centrifugation, ethanol precipitated, and re-suspended in TE (10 mM Tris and 0.1 mM EDTA) to a target concentration of 100 ng/uL. The early and late ANM pools were then assayed using 4 Illumina Human1M-Duo (1M-Duo) BeadChips each, assayed at The Center for Applied Genomics (TCGA), Toronto, Canada. Replicate arrays were used to reduce the error in allele frequency estimation47-49 (see Chapter 2 for discussion). Using the 1M-Duo array data, SNP allele frequency in the DNA pools was estimated. For a detailed description of how SNP allele frequency is estimated using array data see Chapter 3.2.1.2. 84  4.2.3 Quality Control and Normalization of Discovery Stage Array Data Prior to estimating SNP allele frequency using the 1M-Duo array data, poorly performing beads were removed and data was normalized. No changes were made to the procedures described and implemented in the GWAS of ovarian cancer; hence, for a detailed description of these steps see Chapter 3.2.1.3.  4.2.4 Association Analysis in Discovery Stage Array Data Association analysis was conducted using a previously described publicly available program, GenePool41 (version 0.9.1). Source code was downloaded and modified to accept Illumina 1M-Duo array data. Using this package, association statistics comparing the early and late ANM pools were calculated using the SINGLEMARKER algorithm. Details of the association test performed by SINGLERMARKER are given in Chapter 3.2.1.5. SNPs that cannot be reliably assayed in DNA pools were excluded prior to tests of association. SNPs with a HapMap CEU minor allele frequency (MAF) < 10% were excluded; power has shown to be substantially reduced for low MAF SNPs in pool-based experiments48,199. SNPs without HapMap data (Release 27) were removed to eliminate the possibility of selecting low MAF SNPs for replication. Probes specifically designed to assay copy-number variant (CNV) nonpolymorphic sites were removed as these do not have meaningful two-colour fluorescence intensity data. SNPs specifically designed to assess polymorphisms in mitochondrial DNA (mtDNA) were excluded due to concerns regarding homology (>98%) between nuclear and mtDNA sequence confounding results200. Furthermore, after data collection but before association analysis, SNPs absent from one or more arrays (759 SNPs) were removed, as were those 5% of SNPs with the greatest variability in MAF estimation in both the early and late ANM pools. These SNPs are more likely to give inaccurate estimates of MAF, and could lead to SNPs being spuriously associated with ANM. After these exclusions, 694,326 SNPs were tested for association with ANM using the SINGLEMARKER test.  4.2.5 Filtering of Associated SNPs There is enrichment for false-positives due to genotyping error among the most significant SNPs when performing a GWAS using individual genotyping data201; the same is true of a GWAS using pool-based data. To address this, additional filtering criteria were applied to the 85  association results. Many SNPs on the 1M-Duo array are in high LD (r2>0.8) with other SNPs (proxies) on the array, and agreement between these high LD SNP was used to reprioritize SNPs. Henceforth, unless otherwise indicated, estimates of LD are based on the HapMap CEU data. To do this, all SNPs in high LD (‘proxy SNPs’) with a given SNP of interest (‘primary SNP’) were grouped, and the median SINGLEMARKER rank of these SNPs was calculated (called LD analysis). SNPs were then chosen based on this median rank. LD analysis was performed on the top 1000 SNP associated with ANM in the discovery stage, as ranked by the GenePool’s SINGLERMARKER P-value. The number of SNPs on which LD analysis was performed was relatively arbitrary; a sufficient number to reprioritize more SNPs of interest than could be pursued in replication, but still restricting to a very small number of highly associated SNPs. There are some SNPs on the 1M-Duo array that do not have any proxies in high LD, and were thus excluded from the LD analysis. In the absence of a means to more effectively filter these SNPs, it is expected that this category is enriched for false positives. We did not want to completely ignore this category of SNPs, nor did we want to choose many SNPs that are more likely to be spurious. We opted to select very few of these SNPs for replication, and selection was based exclusively on the SINGLERMARKER P-value.  4.2.6 SNP Selection for Replication Stage One hundred and twenty SNPs associated with ANM, either in the discovery stage of our pool-based GWAS of Iranian women or in association studies previously performed on European women, were selected for individual genotyping (IG) and replication in our study participants. The decision of how many SNPs to pursue by IG was largely dictated by practical and financial constraints. Financial limitations dictated that ~120 SNPs would be selected for replication, corresponding to individual genotyping on 3 multiplex Sequenom iPlex Gold® assays where up to 40 SNPs can be individually genotyped in a single well. Genotyping was performed at the McGill University and Génome Québec Innovation Center. Beforehand, and based on a review of the current literature, it was decided that ~20 SNPs would be needed to replicate ANM associations reported in European women. Approximately 100 SNPs remained for selection based on the pool-based GWAS of ANM in Iranian women. SNP selection is summarized in Table 4.1. 86  From the discovery stage of our pool-based GWAS, 102 SNPs associated with ANM were selected for replication. Details regarding these SNPs, including SINGLERMARKER P-value, rank, and median rank (based on LD analysis) is given in Table 4.2. Ninety-five SNPs were chosen based on smallest median SINGLEMARKER rank after LD analysis. These 95 SNPs tagged 95 unique loci. Seven SNPs without proxies (r2>0.8) were chosen based on smallest SINGLEMARKER test statistic P-value (Table 4.1). We favored SNPs analyzed by LD analysis because not only are they highly ranked; they are less likely to be technical errors based on primary/proxy agreement in the association result. The ratio of SNPs chosen based on these two criteria was relatively arbitrary; initially a 10:1 ratio was aimed for. In the final selection, a median SINGLERMARK rank of 2000 served as a cut-off for LD analysis SNPs, as did a P-value rank of 20 for SNPs without proxies. Eighteen SNPs associated with ANM in previous association studies performed in European women were selected for replication. For each of the genome-wide significant loci reported in the 2009 GWAS169,170, two SNPs were selected for replication: 1) the reported SNP with the smallest P-value, and 2) a SNP in LD with this primary SNP. Two loci, 14q32 and 5q15, harboring SNP-ANM associations reported to approach genome-wide significance by Stolk et al. 2009 were highly ranked in our pool-based GWAS data, and targeted for replication. SNPs at these loci were selected as described for the other European GWAS SNPs. At the time of this study, the meta-analysis of ANM was not yet published; hence, SNPs reported there were not replicated in our Iranian samples. Four SNPs reported as being associated with ANM in candidate gene studies were also included for replication. Two SNPs in AMHR2152, and one in each of BMP15152 and CYP1B1228. At the time of this study, these were the candidate gene SNPs most significantly associated with ANM152.  4.2.7 Replication Stage Statistical Analysis Association between ANM and each SNP was investigated using linear regression, implemented in PLINK (v1.07). An additive genetic model was assumed (i.e. 0, 1, 2 copies of the minor allele), and age at menopause was the independent variable. Replication was performed including and excluding samples used in the discovery stage; both results are presented and discussed. 87  4.3 Results 4.3.1 Discovery Stage 4.3.1.1 DNA Pooling The early ANM pool was constructed from 165 women who experienced menopause between 40 and 44. The late ANM pool was constructed from 187 women who experienced ANM between 54 to 65 years of age. Although the late ANM pool had a large range, 70% of women in this pool reported menopause at age 54. Women experiencing menopause between 46 and 53, corresponding to the 20th-80th percentiles of menopausal ages, were set-aside for replication (stage 2). Women participating in TLGS who experienced menopause after pool construction were included in the replication (33 women met the early ANM pool definition, and 35 women met the late ANM pool definition).  4.3.1.2 Quality Control and Normalization of Pooling Data A hierarchical cluster plot of the 8 arrays used to assay the early and late ANM pools (Figure 4.2) indicate that one array did not perform as expected with respect to MAF estimation (the plot was constructed as described in Chapter 3.3.1.2). This array was used to assay the late ANM pool, but does not cluster with the other late ANM replicates (see “Array 1- late pool” in Figure 4.2), or any other arrays used in the experiment. To explore this further, MAF estimates for 5000 randomly selected SNPs were extracted from all arrays. Replicate array data was used to calculate the average MAF of each SNP in both pools, along with the standard deviations (SD) of the averages for these 5000 SNPs. The distribution of the SD’s is plotted in Figure 4.3, and show that large SDs (SD > 0.05) are associated with the MAF estimates in the late ANM pool (see panel 2 in Figure 4.3). In constrast, small SDs (SD < 0.03) are associated with MAF estimates in the early ANM pool (panel 1 in Figure 4.3). When “Array 1- late pool” in Figure 4.3 was removed from average MAF calculations (panel 3 in Figure 4.3), the distribution of SDs is as expected for 3 replicate arrays (average SD<0.03). It was not clear how to correct SNP MAF estimation on Array 1 in the late ANM pool, nor what impact it would have on the accuracy of MAF estimates, and finally what effect it would have on the association analysis; therefore, this array was removed from further analysis. After removal of this array we were satisfied with the remaining arrays with 88  respect to the MAF estimation agreement between replicate arrays.  4.3.1.3 Quality Control of Individual Genotyping Data Individual genotyping (IG) was performed on 852 Iranian women (see “Iranian sample collection” for details). Of these, 209 were used in the discovery stage; 643 were not. Samples not used in the discovery stage were primarily women who experienced menopause between 45 and 53 (corresponding to the 20th-80th percentile of menopausal ages). Sixtyeight women in this category were in the upper and lower 20th percentile of menopause ages; however, they were obtained after pool construction and were therefore only included in replication. One hundred and forty three samples used in the discovery stage pools did not have sufficient DNA for individual genotyping. Of 120 individual genotyped SNPs, 10 failed in all samples and were removed. Fiftyfive samples failed genotyping in all SNPs and were removed. Thirty-nine duplicate samples (4.6% of total samples) were genotyped for QC purposes. The concordance rate for these was 99% (one sample was discordant for one SNP; that genotype was excluded). Four SNPs with a call rate below 95% were excluded for analysis. No SNPs deviated from Hardy–Weinberg equilibrium (HWE, P-value ≤ 0.001) (rs1172822). Fifteen samples with a call rate below 90% were removed, followed by 21 SNPs chosen based on the discovery stage of our GWAS that did not have a MAF >10% based on IG data. These were removed because we would have had limited statistical power to detect SNP-ANM associations for such low MAF SNPs in our discovery stage; hence, they most likely represented false positives. In summary, 85 SNPs passed QC, 68 from our GWAS in Iranian women, and 17 from previous association studies performed in European women. Seven hundred and eighty two Iranian samples were genotyped and passed QC.  4.3.1.4 Validation of SNPs Selected for Replication Prior to testing for replication, SNP associations detected in the pool-based data were inspected in IG data. This data was available for 102/167 (61%) samples in the early ANM pool, and 107/187 (57%) samples in the late ANM pool. One hundred and forty-one samples used in the discovery stage pools had insufficient DNA for IG. Based in this data is impossible to truly validate the SNP-ANM associations detected in the pools, nevertheless, 89  validation steps were performed to observe whether SNPs trended towards validation. Validation steps were performed on 89 SNPs that passed QC (not excluding SNPs with a MAF < 10%). Agreement between pool-based and IG-based SNP MAF estimates was inspected for the early and late ANM pools (top panels in Figure 4.3). The correlation between methods was 0.20 for the early ANM pool and 0.80 for the late ANM pool. The correlation between SNP MAF estimates was poor in the early ANM pool, where most MAFs were overestimated by 0.20 in pool-based data relative to IG data (see top left panel in Figure 4.3). Pool-based data tends to overestimate MAF (Chapter 3, and 207); however, this discrepancy exceeds what has previously been reported. Nine SNPs selected based on literature review were also assayed by the 1M-Duo arrays, and showed good correlation in the early (r2=0.70) and late (r2=0.60) ANM pooling data (bottom panels in Figure 4.3). This suggests that not all SNPs are poorly estimated in the early ANM pool, just many of those selected by our discovery stage analysis. SNP allele frequency difference (AFD) observed between the early and late ANM pools was compared for pool-based and IG-based data (Figure 4.4). These box-plots demonstrate that on average, pooling data overestimated SNP MAF by ~0.15 relative to IG data for SNPs chosen based on our GWAS in Iranian women. Over-estimation of SNP AFD in pooling data is much less pronounced for SNPs chosen based on literature review; however, it is still present (Figure 4.4). Presumably the correlation in MAF estimates between pool-based and IG data would be much better if all of the samples included in the pools had been individually genotyped. However, the discrepancy in the behavior of the SNPs chosen based on this discovery stage analysis versus the literature, is a cause for concern. It is possible that many of the SNPs selected for IG based on the discovery stage analysis were chosen because of poor SNP MAF estimation in the early ANM pool. Nevertheless, there are several SNPs with an AFD > 0.1 based on IG data. As a final validation step, SNP-ANM association was tested for using the available IG data for those samples used in the early and late ANM pools in a one degree of freedom (df) allelic chi-squared test, implemented in PLINK (v1.07) (comparable to the SINGLEMARKER test). Based on these tests, 16% of the SNPs selected for replication validated at a P <0.05 level of significance. This level of validation, in terms of the percentage of SNPs validated, is poor, but not unexpected in light of the other validation 90  results. Because of incomplete IG data, whether SNP fail to replicate because of missing samples or poor MAF estimation in the discovery stage pools, is impossible to know. All SNPs were included in the replication stage for analysis.  4.3.2 Replication This study had two objectives (summarized in Figure 4.1). The first was to identify novel variants associated with ANM in Iranian women using a GWAS. The second was to determine if genetic factors influencing ANM previously identified in European women are also associated in Iranian women.  4.3.2.1 Pool-Based GWAS SNPs Two sets of association analyses were performed, and are presented in order below. The first analysis was performed excluding samples used in the discovery stage DNA pools, and represents an independent test of association. However, samples in this analysis are not ideally matched to samples in the discovery stage, and this could affect whether SNP associations replicate. The second analysis was performed including samples used in the discovery stage pools, and therefore does not represent an independent test of association. However, this analysis includes samples representative of the entire ANM distribution, and is more powered to detect SNP-ANM associations influencing the entire ANM distribution. When discovery stage samples were excluded, one SNP (rs2207451, β= -0.49, N= 552, Plinear= 0.04) was found to be significantly associated with ANM in the replication stage. The effect of this SNP, decreasing ANM, agrees with that observed when discovery stage samples were analyzed alone (rs2207451, β=-0.63, N=201, Plinear= 0.49); however, the SNP was not significant in this analysis. This SNP did not achieve statistical significance when discovery and replication stage samples were combined (rs2207451, β= -0.51, N=753, Plinear= 0.08). rs2207451 was not significant at the genome-wide level (P < 10-7), nor after Bonferroni correction for 68 tests (cut-off P-value <7.3x10-4). When discovery stage samples were included, 9/68 (13%) SNPs were found to be significantly associated with ANM (Plinear<0.05) in the replication stage (Table 4.3). These 9 SNPs were not significant at the genome-wide level; however, one SNP (rs10140275, β=1.09, N=780, Plinear= 4.0x10-4) remained significant after Bonferroni adjustment for 68 91  tests (cut-off P-value <7.3x10-4). The G allele of rs10140275 is associated with delaying ANM by 1.09 years (Table 4.3). SNP-ANM association for these 9 SNPs was investigated in European women by inspecting the meta-analysis data prepared by the Reprogen Consortium5. This meta-analysis included 38,968 women of European descent who experienced ANM between 40 and 60 years of age. rs10140275 was not associated with ANM in European women (rs10140275, βmeta= -0.03 , Pmeta=0.41); however, rs10840211 (β= -0.11, Pmeta =0.0013) was. Further, the effect of the minor allele, decreasing ANM, agrees with that in our data (rs10840211, βlinear= -0.48, N= 765, Plinear= 0.04, Table 4.3). Eight SNPs in LD (r2 > 0.5 in HapMap CEU) with rs10840211 give a similar result in the meta-analysis data. With the exception of rs10840211, the SNPs associated with ANM in our analyses of Iranian women were not associated with ANM in European women.  4.3.2.2 Literature SNPs Fourteen SNPs tagging 7 loci associated with ANM in European women169,170 were chosen for cross-population replication in our Iranian samples (Table 4.4). Linear regression assuming an additive model was performed for each SNP using IG data for 782 Iranian women (ANM ranged from 40 to 65 years). rs16991615, tagging the 20p12.3 locus, was found to be associated with ANM in Iranian women (rs16991615, N=782, βlinear= 1.13, Plinear=0.02) (Table 4.5). This locus harbors the most significant SNP-ANM association reported to date in European women (rs16991615, βmeta =1.07, Pmeta=1.42x10-73)5. The other SNP associated with ANM at this locus, rs236114, was not associated with ANM in Iraniam women. These SNPs are in relatively low LD (r2=0.36). Two other previously reported SNPs approached statistical significance for association with ANM in Iranian women, rs1172822 and rs2153157. rs1172822 (N= 782, βlinear= -0.39, Plinear= 0.08) tags the 19q13.42 locus; rs2153157 (N= 782, βlinear = 0.41, Plinear= 0.06) tags the 6p24.2 locus. The 19q13.42 locus harbors the second most significant SNP-ANM associations reported in European women (rs11668344, βmeta= 0.95, Pmeta= 1.45x10-59)5 (r2= 0.89 between rs1172822 and rs11668344). rs2384687 was also chosen to tag the 19q13.42 locus, and agrees (SNP effect) with rs1172822 (Table 4.4). rs2153159 was also chosen to tag the 6p24.2 locus; however, the SNP effect does not agree with rs2153157. SNPs at other loci were not significantly associated with ANM in Iranian women. 92  Four SNPs tagging 3 loci associated with ANM in candidate gene studies were also chosen for cross-population replication; however, none of these SNPs was significantly associated with ANM in Iranian women (Table 4.5).  4.4 Discussion 4.4.1 Pool-Based GWAS SNPs Using independent samples in replication stage analysis, one SNP (rs2207451, β= -0.49, N= 552, Plinear= 0.04) was found to be significantly associated with ANM in Iranian women. However, this association was not significant in the discovery stage samples (rs2207451, N=201, β=-0.63, Plinear= 0.49). This result is perplexing given this SNP was selected for replication based on analyses conducted on the discovery stage samples. That said, IG data was missing for 143 samples used in the discovery stage analysis, and it is possible that were this IG data available, this SNP would be associated in the discovery stage samples. Alternatively, this SNPs association in the replication stage only could be a spurious result arising from small sample size. Further replication in additional Iranian samples is needed to determine if rs2207451 is associated with ANM in this population. In general, replication performed excluding samples used in the discovery stage may have been limited by the fact that most were women from the middle of the ANM distribution, and therefore are more similar with respect to ANM than the discovery stage samples, or women randomly sampled from a population. These were the only samples available for replication at the time of this study. rs2207451 is an intergenic SNP on chromosome 2, approximately 28kb downstream of an uncharacterized gene. This SNP is not predicted to alter transcription factor binding site (TFBS) activity, exon splice enhancer/suppressor site (ESE/ESS) activity, or miRNA binding site activity. Furthermore, reported SNPs in LD with rs2207451 (r2 > 0.5, HapMap CEU, Release 27) are all intergenic, and not predicted to alter TFBS, ESE/ESS, or miRNA binding site activity. This SNP has no predicted direct or indirect functional effects, and it is unclear how it might influence ANM. When discovery stage samples were included, nine SNPs were found to be significantly associated with ANM (Plinear<0.05) in the replication stage (Table 4.3), and one (rs10140275, N=780, β=1.09, Plinear= 4.0x10-4) remained significant after Bonferroni 93  correction for 68 tests. Clearly this analysis does not represent an independent replication, and results should be regarded with caution. rs10140275 appears uniquely associated with ANM in Iranian women; this SNP was not associated with ANM in European women. Alternatively, it may represent a spurious association in our relatively small collection of Iranian samples. rs10140275 is an intergenic SNP on chromosome 14, approximately 43kb upstream of LOC730118 and 80kb downstream of DICER1. This SNP, nor SNPs in LD with it, are predicted to alter TFBS, ESE/ESS, or miRNA binding site activity, making it difficult to derive a hypothesis for how direct or indirect functional effects might impact ANM. Intriguingly however, DICER1, which processes precursor-microRNA (pre-miRNA) into mature miRNA, was recently described as being critical to multiple aspects of ovarian function229. Replication of rs10140275 in an independent collection of Iranian women is needed to confirm this population-specific SNP-ANM association. rs10840211 was associated with ANM in Iranian women and European women. This is an intergenic SNP ~20 kb upstream of TMEM41B and ~50 kb downstream of IPO7. Again, rs10840211and proxy SNPs (r2>0.5) are not predicted to alter TFBS, ESE/ESS, or miRNA binding site activity. To our knowledge, there are no reports connecting TMEM41B or IPO7 to ovarian/oviduct function and/or dysfunction. Hence, this SNP has no predicted direct or indirect functional effects, and it is unclear how it might influence ANM.  4.4.2 Literature SNPs The second objective of this study was to test for association of variants identified in European women of Iranian women. The association with of rs16991615 (20p12.3) with ANM was replication in Iranian women. The effect size of the SNP was slightly larger in our data (βIranian= 1.13, βEuropean= 1.07) (Table 4.4). SNPs rs1172822 and rs2153157, which tag the 19q13.42 and 6p24.2 loci, respectively, narrowly miss statistical significance for association with ANM is our data. The direction and magnitude of the SNP effect in the Iranian data is in agreement with that in the European data, suggesting that had we more samples, we would have replicated these SNPs as well (Table 4.4). Based on 782 Iranian samples, our estimated power to detect the association with rs16991615 (20p12.3) was 68% (see Appendix D.1. for power calculations). Our power to detect SNPs rs1172822 and rs2384687 (loci 19q13.42) was 62% and 58%, respectively. For all other variants, our 94  estimated power of detection fell below 50%. These results confirm that European and Iranian women share heritable factors influencing ANM. Recently, Chen et al. 182 reported that SNPs tagging the 19q13.42 and 20p12.3 loci were also associated with ANM in US Hispanic women. Altogether these studies imply these SNP-ANM association are not population specific.  4.5 Study Limitations In this study, ANM information was ascertained by self-reporting, which is subject to recall bias. Self-reporting when menopause occurs in the recent past has shown to be accurate and reproducible in US women230. When menopause occurs in the distant past, women who experience early menopause tend to overestimate ANM, and women who experience late ANM tend to underestimate ANM231. Seventy percent of the women participating in our study experienced menopause prior to entering TLGS. For these women the median ANM recall time was 8 years. Based on a reproducibility study conducted in Dutch women over a 7-9 year interval, we anticipate that ~70% of our participants recalled their ANM correctly to within one year. The regression tendency of self-reported ANM towards the mean, particularly when recall times are long (90 participants in our study experienced ANM at least 20 years in the past), may have resulted in our underestimating the effect of SNPs. However, for those SNPs previously associated with ANM that replicated (or approached significance) in our Iranian samples, we observed SNP effect sizes similar to those previously reported (Table 4.4), suggesting this issue affected both studies equally, if at all. The phenotypic ends of the ANM distribution were used in the discovery stage of the pool-based GWAS. By doing this we aimed to enrich for differences in those genetic variants that influence ANM in our DNA pools, and thus perform the most powered pool-based GWAS191,225. However, we assume these variants are relevant to the entire ANM distribution, not just early and late menopause, and replication (stage 2) is performed using women from the entire distribution of menopausal ages. This assumption could be flawed. If this is the case, SNPs would not be expected to replicate. We did not have the samples necessary to tests these two possibilities. Further replication in additional Iranian samples, representing the entire range of the ANM, is needed for this. However, based on association 95  results in European women, this is not expected to be the case. Due to financial limitations, we employed a DNA pooling approach for the discovery stage of our GWAS. Previously, we found this approach worked well, as did others, and it was not considered a limiter. However, in this study we found replicate arrays did not always show good agreement in the estimation of SNP MAF, nor did the MAFs estimated by pooling versus IG methods correlate well (albeit this could be due to missing samples). Altogether this made it difficult to interpret results. In the future, it is recommended that additional replicate arrays are used in the discovery stage to reduce any bias introduced by one array (i.e. further lower SD). As well, if arrays need to be omitted, as was the case in this study, several replicates will remain for robust MAF estimation. Finally, careful consideration should be given to the importance of validation. Although previous pool-based GWAS have not always included a validation step232, if an experiment does not perform as expected, the ability to validate becomes increasingly important.  4.6 Conclusion: Novel and Shared Genetic Factors for ANM Together with previous studies, we provide evidence that at least two loci (20p12.3 and 19q13.42) are associated with ANM in different ethnic groups, including Europeans, Hispanics, and Iranians. We anticipate that several other loci previously associated with ANM in Europeans will be replicated in other ethnicities, including Iranians, as sufficient samples are genotyped to detect the effects conferred by the SNPs in question. We note that these previously reported SNPs were not top-ranked by our pool-based GWAS data. This accurately reflects our studies limited power to detect SNPs with low MAF, and SNPs conferring small effects on ANM. We report one SNP -ANM association, rs10140275, that appears to confer large effects on ANM in the Iranian population only. The minor allele of rs10140275 was estimated to delay ANM by 1.09 years in Iranian women (Table 4.3). This is a large effect, on par with that estimated for SNPs at the 20p.12.3 locus. Should this SNP survive further replication in additional Iranian samples, it would suggest that there are also important population-specific SNP-ANM associations, and these may in part account for why average ANM varies by race/ethnicity157,181  96  CHAPTER 5 Discussion  5.1 Impression of the Pool-Based GWAS Approach Two-pooled based GWAS were presented in this thesis, and lead to mixed impressions regarding the utility of this approach. Based on the subtype specific GWAS of EOC presented in Chapter 3, impressions of this approach were good. Most of the SNP associations (89%) chosen for replication based on pooling were validated using individual genotyping data. This means that we were able to estimate allele frequency with enough accuracy to detect true differences between DNA pools, and that these true differences were the most significant results in our data. In other words, false-positives did not predominate the SNPs chosen for replication. However, we do not know if the SNPs chosen for replication represent the most significant subtype-specific SNP-EOC associations we were theoretically powered to detect. In other words, we cannot exclude the possibility that Type II error caused us to miss otherwise significant SNP associations. To assess this we would need IG data for every sample in the case-control pools, for every SNP on the 660-Quad array. This would defeat the cost-savings afforded by the pooling approach. What can be said is that many of the SNPs chosen for IG were highly significant (P-value<10-6) in those samples in the discovery stage, and went on to replicate at an unadjusted P<0.05 level of significance in the OCAC datasets. The impression of pool-based GWAS from Chapter 4 (GWAS of ANM) was much different, and generally poor. In this study, most of the SNP associations chosen for replication did not validate or replicate based on the available IG data. Further, proper assessment of the discovery stage data was made difficult by incomplete IG data. Several recommendations to improve/guarantee the utility of a pool-based GWAS are made based on these experiences.  5.1.1 Replicate Array Considerations In the pool-based GWAS of EOC, 12 replicate arrays were used to assay each DNA pool; in the GWAS of ANM, 4 replicate arrays were used per pool. By using 12 replicate we were able to estimate SNP allele frequency with greater accuracy and precision. We attribute the 97  excellent validation rate observed in the EOC GWAS (89%) to this improved SNP allele frequency estimation (AFE). This validation rate is much higher than that observed in the ANM GWAS (which was <20%), and slightly higher than that reported in previous poolbased GWAS using 4-6 replicate arrays per pool36,50. The decision to use 12 replicate arrays per DNA pool was the outcome of modeling carried out using PoolingPlanner, presented in Chapter 249. Our objective was to preserve as much power as possible without making the experiment financially infeasible. At 12 replicate arrays per pool, we estimate that we achieved, on average, an effective sample size of 90% or better in each DNA pool, relative to a conventional GWAS. Financially speaking, we could have used more replicate arrays; however, PoolingPlanner demonstrated that additional arrays would have yielded diminishing returns in terms of increasing the effective sample size of the pools, and would not have notably increased our power to detect SNP-ANM associations. In contrast, using 4 replicate arrays per pool in the ANM GWAS, we estimate that we had, on average, an effective sample size of 80% or better in DNA pools. This would have reduced the theoretical MDOR of this experiment relative to conventional GWAS; particularly for low MAF SNPs. In part, this prompted us to only screen SNPs with a MAF >10% for association with ANM. Financial limitations prevented us from using more replicate arrays. Replicate arrays in the GWAS of EOC were randomly distributed over five 660-Quad BeadChips. Replicate arrays in the GWAS of ANM were non-randomly distributed over four 1M-Duo arrays (replicates were on the same BeadChip). In Chapter 2 we observed at least one Illumina BeadChip with unusually high red channel intensity, which affected SNP AFE on all arrays on that BeadChip. This was true even after normalization. It was also observed that, in general, replicate arrays on the same BeadChips are more similar than replicate arrays on different BeadChips. This highlighted the need to randomize replicate arrays among BeadChips, and by physical location on BeadChips (where chips have multiple arrays). This helps ensure that any bias introduced by individual BeadChips and/or arrays is as randomly distributed as possible, and do not confound results. Presumably, not all array aberrations that can and do arise are detected, in which case more array replicates will dampen the effect of one aberrant array, making the experiment more robust. Biases introduced by the lack of randomization of replicate arrays in the GWAS of ANM may in part explain why the poolbased AFE correlates poorly with the IG data. For future experiments, we recommend using 98  PoolingPlanner to help select the number of arrays that theoretically yields an effective sample size of ~90%, and ensuring that replicates are randomly distributed among BeadChips.  5.1.2 Analysis Software In the GWAS of EOC and ANM, we used the publicly available analysis package, GenePool41. This program was specifically written to manage pool-based array data, and perform tests of association which take into consideration the sources of error inherent to a pooling experiment. In our experience, GenePool was quite challenging to use. Bugs were frequently encountered, and documentation was inadequate. The quality control and normalization procedures that could be implemented were also very limited. Finally, GenePool is no longer updated or supported. Based on these experiences, any future poolbased GWAS should consider using one of the R packages now available for analyzing pooled-GWAS data, or writing custom scripts.  5.1.3 Importance of Validation In the GWAS of ANM, Chapter 4, we were unable to properly validate SNP associations because many of the samples included in the DNA pools were not individually genotyped. (due to insufficient DNA). When our discovery stage array data did not perform as expected based on the QC procedures implemented, the inability to validate SNP-ANM associations made it very difficult to interpret results. Previous GWAS that do not validate SNP associations have been published232, but they rely exclusively on positive results in replication to interpret their experiment. If this does not occur, and validation is not possible, it is impossible to distinguish between two possible conclusions: 1) no SNP-trait associations exist (of the effect size detectable by the experiment), or 2) errors in the discovery stage resulted in false positive SNP-trait association beings selected for replication over true SNPtrait associations. Based on this experience, to ensure that a pool-based GWAS can be properly analyzed and interpreted, we recommend that validation be considered an essential analysis step. Thus, any samples included in DNA pools should be available for subsequent individual genotyping. 99  5.1.4 Importance of Replication Both the GWAS of EOC and ANM were challenged and/or delayed by the lack of a welldefined and arranged replication plan prior to initiating the discovery stage. GWAS findings without adequate replication are not credible, particularly pool-based GWAS where individual genotypes are not observed and techniques to control for population stratification (for example, principal components analysis) and test for data quality (for example, checking SNPs for HWE) cannot be applied. If suitable replication cannot be arranged prior to initiating a pool-based GWAS, either by splitting samples between a discovery stage and replication stage, or by collaboration, there is little point in conducting the experiment.  5.2 Relevance of Subtype-Specific EOC Risk Alleles Two key points regarding EOC pathogenesis are raised by the high-penetrance risk alleles observed in HBOCS and LS families (reviewed in Chapter 1.3.5)., and subsequently reinforced by the six common low-penetrance risk alleles recently reported by the GWAS of EOC (summarized in Table 1.2). First, frequent alteration to genes involved in DNA repair, including DSBR, SSBR, and MMR pathways, suggest that aberrant DNA repair is important to EOC pathogenesis, particularly HG-SER pathogenesis. Second, genetic risk alleles are subtype specific, and associated almost exclusively with HG-SER risk. Several of the SNPEOC associations reported in Chapter 3 are discussed with respect to these two points.  5.2.1 DNA Repair and High-Grade EOC Very recently, EOCs were broadly classified into two distinct tumour types, Type I and II69. Type II tumours are comprised primarily of HG-SER cancers of the ovary, peritoneum, and fallopian tube. Type II tumours are clinically aggressive and are often widely metastatic when a woman first presents with cancer69. High grade END tumours are also included in this group; recent expression analyses show that these are indistinguishable from HG-SER tumours77,233,234. Germline and somatic mutations in genes functioning in DSBR by HRR, including BRCA1/2, PALB2, RAD51, RAD50, BARD1, CHEK2, and BRIP1 are now documented hallmarks of Type II tumours83,235. And, SNP variants in BABAM1 and TIPARP/PARP7, respectively functioning in DSBR and SSBR, have been associated with 100  HG-SER risk by GWAS of EOC40,144. While evidence for the involvement of DNA repair pathways is not entirely absent from the other EOC subtypes, it is not a characterizing property. That said, results from Chapter 3 suggest that germline variation in normal DNA repair may play an important role in MUC tumourogenesis as well. This statement stems from the discovery of a SNP (rs17106154) in a predicted intron of RAD51B, which associated with increasing MUC EOC risk. This SNP was not significant in the other histological subtypes. RAD51B is one of five RAD51 paralogs, and haplo-insufficiency of RAD51B has been shown to cause mild hypersensitivity to DNA damaging agents, impaired homologous recombination, and increased chromosome aberrations95. Furthermore, this association was more apparent in women presenting with invasive MUC tumours than LMP/borderline tumours. This result is consistent with that of the other aggressive tumours (Type II) in which DNA repair is known to be important to tumourigenesis, and may suggest that invasive MUC tumours belong in the Type II category. However, there is an alternative possible explanation for this finding. Many of the tumours classified as invasive MUC EOC in our discovery stage analyses, and in the OCAC replication dataset, may actually misclassified high-grade SER and/or END tumours, or metastases to the ovary from the gastrointestinal tract81, where the involvement of DNA repair pathways has already been well established236. What the level of misclassification might be is unknown. This is because consistent pathological assessment of these tumours is an outstanding challenge in the study of EOC. None of the other subtype-specific SNP EOC associations reported in Chapter 3 implicated genes involved in DNA repair.  5.2.2 PI3K/RAS Signaling and Low Grade EOC Type I tumours include the low grade SER, CC, END, and MUC subtypes. These tumours are characteristically slow growing, confined to the ovary, and less sensitive to standard chemotherapy69. Somatic mutations in genes functioning in PI3K/RAS signaling, including PTEN, PIK3CA, AKT1, AKT2, NF1, KRAS, and BRAF are becoming hallmarks of Type II tumours74,87,88,95,101,237. The one locus (tagged by 3 SNPs) emerging from our GWAS of END and CC EOC implicates another gene featuring prominently in PI3K/RAS signaling, GRB10. This gene was recently described as functioning in the feedback inhibition of both the PI3K/AKT and RAS/MAPK pathways210-212. To our knowledge, this is the first time GRB10 101  has been implicated in germline genetic risk of EOC; however, somatic mutations to this gene are frequent in difference cancer types, including EOC (Catalogue Of Somatic Mutations In Cancer, http://www.sanger.ac.uk/genetics/CGP/cosmic/, Accessed Aug 30, 2012). Perplexingly, this locus was also associated with HG-SER risk in our GWAS data, but not the other subtypes (Type I tumours). PI3K/RAS pathway mutations are documented in Type II tumours83, but they are not a characterizing property. Power limitations in the rarer EOC subtypes may explain this result.  5.2.3 Subtype-Specificity of Common, Low Penetrance EOC Risk Alleles Prior to our subtype-specific GWAS of EOC, five of the six low penetrance EOC risk alleles reported were associated with HG-SER risk only. One variant, 2q31, was associated with high-grade serous and mucinous EOC risk40. Of the 9 subtype-specific EOC-SNP associations reported in Chapter 3, four showed some evidence of being associated with more than one subtype. One of these associations was initially discovered in the MUC subtype, but was subsequently found to be associated with END EOC risk as well. All three variants (tagging the same loci) initially associated with ENCC EOC risk, were also associated with HG-SER risk in our GWAS. Altogether, these results suggest that there is overlap between histological subtypes with respect to common, low penetrance risk alleles, despite their being described as distinct diseases. There is also evidence for subtype-specific risk alleles. What overlap does exist, does not appear to follow any clear pattern, in particular, it does clearly follow the Type I and II categorizations recently put forward69. Two factors may be obscuring what may actually be clear patterns. The first is misclassification of tumours by histology and behaviour. Misclassification is likely present in many of the studies contributing to the GWAS, including our pool-based GWAS (Chapter 3), and the previously published EOC GWAS. Misclassification of HG-SER tumours as END and/or CC tumours is a current problem area in EOC pathology79,81,238,239, as is the misclassification of metastases to the ovary from the gastrointestinal tract as MUC EOCs240. The second is the lack of power to reliably detect the associations reported given the limited number of samples available and the relatively small effect sizes observed. As more EOC cases with reproducible pathology are collected, alleles conferring smaller risks, and/or alleles with smaller MAFs, will be subject to reliable detection in the rarer EOC subtypes. 102  5.3 Common Genes and Pathways in ANM Variation and EOC Risk An older age at menopause is a well established risk factor for women’s cancers, including ovarian cancer, breast cancer, and endometrial cancer241,242. Previous GWAS of ANM have implicated two key pathways in the variation of ANM: 1) DNA repair by SNPs in LD with EXO1, HELQ, UIMC1, FAM175A, FANCI, TLK1, POLG, and PRIM1, and 2) immune function by SNPs in LD with IL11, NLRP11, and PRRC2A5,169,170. Intriguingly, low penetrance EOC risk alleles seem to implicate the same pathways. DNA repair is implicated by SNPs in LD with BABAM1, TIPARP/PARP740 and RAD51B. Immune function is implicated by SNPs in LD with BPIL2 (Chapter 3). Although the pathways may overlap, the specific genes involved do not. The overlap of these pathways in this trait (ANM) and disease (EOC) could be reconciled by the observation that the accumulation of somatic alterations results in aging243, including ovarian aging, and an increased risk of cancer. Variations in both pathways have the ability to alter the rate at which somatic mutations accumulate. In Chapter 4, we report one SNP-ANM association that is unique to Iranian women, and therefore either population-specific or a spurious association in our relatively small (by GWAS standards) dataset. rs10140275 lies in an intergenic region ~43 kb upstream of LOC730118, and ~80 kb downstream of DICER1. DICER1 functions in miRNA processing, and was recently described as being critical to multiple aspects of ovarian function229. No direct connection between ANM and DICER1 has been made to date; however, somatic DICER1 mutations have been described in ovarian cancers244-248. Specifically, women carrying germline DICER1 mutations are known to experience a pleiotropic tumour predisposition syndrome. A substantial number of the observed tumours are ovarian sex cord-stromal tumours244-248; however, EOC is also occasionally observed248. If the association of DICER1 with ANM variation is true (and this requires further replication), it suggests an intriguing trait-disease connection. Potentially, variation at SNP rs10140275, or a causal variant in LD with it this SNP, contributes to subtle variation in DICER1 expression/function, and this in turn explains some of the variation in ANM in Iranian women. It is not clear why DICER1 would only be associated with ANM in this population. 103  In the ovaries of mice, DICER1 expression levels have been linked to the size of the primary follicle pool, as well as the rate of follicle recruitment and degeneration. Given that the size of the follicular reserve is a major determinant in the timing of menopause, a gene that may regulate the rate of follicular degeneration is an appealing candidate for explaining ANM variation. With respect to DICER1’s involvement in ovarian cancer, recurrent somatic missense mutations in the RNase IIIb domain have been observed, leading to the hypothesis that specific defects in miRNA processing yield an oncogenic miRNA profile248. This could be broadly summarized as, alterations to DICER1 expression yield variation in ANM, while alterations to DICER1 function lead to ovarian cancer.  5.4 Future Studies of EOC and ANM Associated SNPs Our understanding of the way in which a risk variant initiates disease pathogenesis progresses from statistical association between genetic variation and trait/disease variation to functionality and causality. With this in mind, the trait-associated SNPs presented in Chapter 3 and Chapter 4 represent the first step in a long process. In the future, these SNPs will need to be further replicated in study populations similar to those used for discovery to increase the significance of the findings, and in diverse populations to narrow down the regions of association and establish if they are relevant in other populations. Fine mapping, targeted sequencing, and eQTL studies can then be used to identify causal SNPs with putative functional effects (methods described in 1.2.3.1-1.2.3.3). Ultimately, causality is established using tissue and animals models where genes can be selectively manipulated and made to demonstrate characteristics of the trait/disease being studied (approaches described briefly in section 1.2.3.4).  104  Table 1.1. Features of Five Commonly Described EOC Subtypes. High-grade serous (HG-SER) Percent of EOCs1 Average age of diagnosis2  Low-malignancy potential serous (LMP-SER)  30-70 n/a  59.6 ± 11  Endometrioid (END)  Mucinous (MUC)  Clear cell (CC)  10-20  5-15  10-20  54.2 ± 13  54.8 ± 14  55.6 ± 13  cystadenomaborderline-tumourcarcinoma sequence; metastasis from bowel Mutations in KRAS, BRAF  endometrioisis  No  Yes  Precursor lesion  de novo; fallopian tube; inclusion cyst  cystadenomaborderlinetumourcarcinoma sequence  inclusion cysts; endometriosis  Molecular features  Somatic/germline mutations to TP53, BRCA1/2; other genes in HRR of DSB.  Mutations in KRAS, BRAF, PTEN  Mutations in PTEN, BRAF PI3K/AKT, CTNNB, TP53, BRCA1, ARID1A Yes  Association with endometriosis3  No  Unclear4  Mutations in PTEN, ARID1A, PIK3CA  Abbreviations: EOC, epithelial ovarian cancer; n/a, not available; HRR, homologous recombination repair; DSB, DNA double-strand break. 1  Varies depending on the case collection.  2  Estimates are based on one population-based case set79 for consistency. 3 As reported by the ovarian cancer association consortium (OCAC)100. 4  Invasive low-grade serous cases are associated with endometrioisis. Borderline serous cases are not. 105  Table 1.2. Common Low to Moderate Penetrance EOC Risk Alleles.  Locus  SNPs Top SNP  All EOC cases1: OR, Pvalue  BRCA1 carriers2: HR, P-trend  BRCA2 carriers2: HR, P-trend  Gene  9p22.2  12  0.82, 2.5x10-17  0.79, 4.4x10-6  0.80, 0.012  BNC2  rs3814113  Gene Function, Suggested Relevance TF; ovarian development and biology TF; cell proliferation, cell growth, apoptosis, chromatin structure  8q24.21 3  rs10088218 0.76, 8.0×10−15  0.91, 0.13  0.72, 5.7x10-3  MYC  2q31  1  rs2072590  1.16, 4.5x10-14  1.08, 0.15  1.31, 8.5x10-4  HOXD3  TF; morphogenesis  3q25  1  rs2665390  1.19, 3.2x10-7  1.25, 6.1x10-3  1.48, 1.8x10-4  TIPARP  ADP-ribose polymerase; SSBR, programmed cell death  17q25  1  rs9303542  1.11, 1.4x10-6  1.06, 0.22  1.16, 0.075  SKAP1  T-cell adaptor protein; cell adhesion, integrin signaling  19p133  2  rs2363969  1.1, 1.2×10–7  1.16, 3.8x10-4  1.30, 1.8x10-3  BABAM1  DSBR  Abbreviations: EOC, epithelial ovarian cancer; SNP, single nucleotide polymorphism; OR, odds ratio; HR, hazards ratio; TF, transcription factor; SSBR, single-strand break repair; DSBR, double-strand break repair. 1  Per allele odds ratio for the most associated SNP reported as analyzed in all ovarian cancer cases (not stratified by subtype) in original publication. 2  Per allele hazards ratio in all ovarian cancer cases (not stratified by subtype) as previously reported141,145 3 BRCA1/2 carrier HR estimates based on rs67397200; r2= 0.6 with rs2363969145.  106  Table 2.1. Array Variance for Illumina arrays  Normalized data Var(earray) (Range) Raw data Var(earray) (Range) Number of pools  1M-Single  1M-Duo  660-Quad  3.8x10-4  3.2x10-4  3.3x10-4  (2.2x10-4 – 6.6x10-4)  (1.6x10-4 – 6.3x10-4)  (2.5x10-4 – 4.9x10-4)  2.9x10-3  9.0x10-4  2.7x10-3  (3.0x10-4 – 9.2x10-3)  (1.7x10-4 – 4.3x10-3)  (2.0x10-3 – 3.0x10-3)  12  8  7  12  45(2)  360  24 (2/pool)  32 (4/pool)  72 (6 or 12/pool)  Number of comparisons, (1)  var(earray)  Number of arrays (arrays/ pool) 1  Each paired array comparison is treated as an independent estimate of array variance, the  average of which is reported in this table. 2  One array, in all 3 comparisons in which it was involved, produced extreme outlier  var(earray) values and was removed from all analysis; hence, there are 45 instead of 48 var(earray) for the 1M-Duo arrays.  107  Table 2.2. Impact of Replicate Arrays on Effective Sample Size (N*) and Minimum Detectable Odds Ratio (MDOR).  Arrays per  Case pool  Control pool  MDOR at  MDOR at 80%  pool  (RSS, N*)  (RSS, N*)  80% (p=0.29)  (p=0.10)  24  0.95, 284  0.84, 837  1.33  1.51  12  0.90, 269  0.72, 720  1.35  1.54  6  0.81, 244  0.56, 562  1.38  1.58  3  0.69, 206  0.39, 391  1.44  1.70  Individual  1, 300  1, 1000  1.32  1.49  Genotyping  This table compares the minimum detectable odds ratios (MDOR) at 80% power for a theoretical pooling experiment with 300 cases and 1000 controls, given a DNA-pooling strategy where 24, 12, 6, or 3 Illumina 660-Quad replicate arrays are used to allelotype each DNA pool (case and control). The equivalent individual genotyping experiment is given for reference. Relative sample size (RSS) and effective sample size (N*) are generated by PoolingPlanner assuming var(earray)= 3.3x10-4, var(econstruction)= 9.9x10-5, and an average minor allele frequency of 0.29. MDOR at 80% power were calculated using Quanto189 assuming an unmatched case-control design testing for gene-only effects using a log-additive model, where the incidence of the case phenotype is 0.02% and the risk allele, p, is set to 0.29 or 0.10.  108  Table 3.1. Discovery Stage EOC Pools by ICD-0-3 Codes. EOC case pool  ICD-O-3 code  ICD-O-3 code description  Count (% in pool)  84721  Mucinous cystic tumour of borderline malignancy  59 (70)  Mucinous  84803  Mucinous adenocarcinoma  16 (19)  (MUC)  84703  Mucinous cystadenocarcinoma, NOS  8 (10)  90151  Mucinous adenofibroma of borderline malignancy  1 (1)  Endometrioid and  83803  Endometrioid adenocarcinoma, NOS  72 (63)  Clear Cell (ENCC)1  83103  Clear cell adenocarcinoma, NOS  42 (37)  84421  Serous cystadenoma, borderline malignancy  49 (66)  84621  Serous papillary cystic tumour of borderline  25 (34)  Low-malignancy serous (LM-SER)  High-grade serous (HG-SER)  malignancy 84603  Papillary serous cystadenomcarcinoma  167 (61)  84413  Serous cystadenocarcinoma, NOS  95 (35)  84613  Serous surface papillary carcinoma  10 (4)  Abbreviatons: NOS, not otherwise specified. 1  Endometrioid and clear cell EOC cases were combined in one case pool to detect shared susceptibility  factors.  109  Table 3.2. Summary of SNPs Selected for Replication.  No SNP  No SNP  chosen  genotyped  Cluster  48  43  Singleton  13  13  Endometrioid and clear  Cluster  46  43  cell  Singleton  13  12  Cluster  41  40  Singleton  13  13  Cluster  24  24  Singleton  0  0  Histological subtype  Mucinous  Filter method  Low-malignancy potential / borderline serous High-grade serous  110  Table 3.3. Association and Rank Information for the Top 30 MUC EOC Loci. # of SNPs Cluster replicated / # Rank SNPs in cluster 1 1/2 2 1/2 3 1/2 4 4/9 4 4/9 4 4/9 4 4/9 5 1/2 6 2/2 6 2/2 7 1/3 8 2/13 8 2/13 9 1/2 10 1/2 11 2/12 11 2/12 12 2/10 12 2/10 13 2/8 13 2/8 14 2/12 14 2/12 15 3/7 15 3/7 15 3/7 (continued on next page)  SNP  1  rs2810589 rs11143593 rs11982376 rs9869278 rs9824190 rs2213260 rs9861668 rs6825690 rs1423463 rs1946260 rs2222870 rs1368301 rs157350 rs10011007 rs11108890 rs395612 rs1757648 rs3092997 rs9459893 rs970651 rs7981902 rs1109019 rs7069674 rs1557996 rs2301936 rs2241075  Location  2  Chr1: 202164743 Chr9: 75241007 Chr7: 130186932 Chr3: 98146934 Chr3: 98163910 Chr3: 98229618 Chr3: 98301260 Chr4: 142516373 Chr5: 66292182 Chr5: 66310808 Chr10: 23191070 Chr5: 155911080 Chr5: 156072147 Chr4: 131284140 Chr12: 96137530 Chr9: 92889677 Chr9: 93143111 Chr6: 167474734 Chr6: 167486676 Chr13: 47351705 Chr13: 47368792 Chr10: 131942924 Chr10: 132141062 Chr7: 7623829 Chr7: 7645632 Chr7: 7744835  Rank by SINGLEMARKER3  Validation P-value4  47 32 33 52 144 58 169 7 48 268 110 218 283 171 60 300 222 173 44 203 278 116 1 485 240 319  2.83E-03 1.38E-03 1.63E-02 2.23E-03 7.96E-03 9.79E-04 n.a. 4.99E-02 1.16E-02 1.16E-02 2.58E-04 1.79E-04 8.88E-05 5.61E-03 1.71E-03 6.99E-04 3.02E-02 5.40E-02 1.73E-01 9.40E-05 1.24E-03 6.29E-03 n.a. 1.58E-04 4.20E-04 1.57E-03  111  # of SNPs Rank by Cluster 1 2 replicated / # SNP Location SINGLERank SNPs in cluster MARKER3 16 2/6 rs3739257 Chr8: 134577692 774 16 2/6 rs3779925 Chr8: 134608468 1186 17 1/6 rs9431182 Chr1: 217802686 506 18 2/5 rs10841876 Chr12: 8590247 86 18 2/5 rs10770855 Chr12: 8595788 72 19 3/7 rs7527465 Chr1: 24556703 749 19 3/7 rs10489441 Chr1: 24614699 597 19 3/7 rs431454 Chr1: 24700416 66 20 1/2 rs9787394 Chr1: 170417123 103 21 1/2 rs6827689 Chr4: 139022571 151 22 1/6 rs2793299 Chr10: 44598598 321 23 2/9 rs10219339 Chr11: 75162498 1547 23 2/9 rs3060 Chr11: 75189220 1995 24 2/5 rs4935212 Chr10: 52442483 80 24 2/5 rs7097013 Chr10: 52546057 198 25 1/2 rs3735966 Chr8: 87748579 77 26 1/5 rs1798066 Chr12: 69794395 30 27 1/4 rs10494217 Chr1: 119270711 143 28 2/4 rs10737498 Chr1: 162386545 159 28 2/4 rs16825999 Chr1: 162422833 149 29 1/2 rs17106154 Chr14: 68230927 162 30 1/7 rs1672692 Chr11: 113450819 118 1 Dark green SNPs selected by other OCAC members. Light green SNPs in linkage disequilibrium (r2 > 0.5) with SNPs selected by other OCAC members. 2 March 2006 human reference sequence: NCBI Build 36/hg18. 3  Validation P-value4 1.66E-01 5.32E-02 5.07E-06 n.a. 6.54E-05 n.a. 1.25E-03 9.33E-02 6.99E-03 2.93E-02 n.a. 9.60E-03 9.60E-03 1.13E-02 5.92E-03 1.18E-01 5.91E-02 3.39E-04 1.60E-02 2.20E-02 8.35E-02 1.33E-02  Ranked in ascending order by the SINGLEMARKER test statistic P-value.  4  Using individual genotyping data for samples in the case pool and controls pools, association analyses using a one degree of freedom (df) allelic chi-squared test was performed. P-values from this test are reported. SNPs indicated as n.a. did not pass QC.  112  Table 3.4. Association and Rank Information for the Top 30 ENCC EOC Loci. # of SNPs Cluster replicated / # Rank SNPs in cluster 1 5/14 1 5/14 1 5/14 1 5/14 1 5/14 2 1/2 3 1/2 4 1/2 5 1/5 6 2/6 6 2/6 7 1/3 8 2/8 8 2/8 9 1/2 10 2/11 10 2/11 11 1/3 12 2/4 12 2/4 13 2/11 13 2/11 14 1/2 15 1/4 (continued on next page)  SNP  1  rs6493239 rs2336912 rs7181300 rs1873285 rs7165740 rs13261404 rs1249003 rs9438040 rs2257062 rs10093972 rs4841215 rs4611492 rs2835006 rs2835076 rs11236180 rs574869 rs7191155 rs17030742 rs2281929 rs6010669 rs9845965 rs7650774 rs2796375 rs2063645  Location  2  Chr15: 27014674 Chr15: 27029521 Chr15: 27041180 Chr15: 27043388 Chr15: 27057792 Chr8: 1846655 Chr10: 29070697 Chr1: 145824552 Chr21: 31630172 Chr8: 9488779 Chr8: 9675718 Chr17: 34869863 Chr21: 35866273 Chr21: 35965755 Chr11: 74020576 Chr16: 19492807 Chr16: 19707714 Chr3: 33510415 Chr20: 61892524 Chr20: 61916132 Chr3: 120531761 Chr3: 120687740 Chr1: 202137678 Chr3: 221374  Rank by SINGLEMARKER3  Validation P-value4  281 61 52 71 24 149 87 176 45 70 909 80 30 435 109 259 241 53 38 322 13 363 174 222  9.04E-04 3.27E-04 1.38E-03 1.81E-04 9.42E-04 7.07E-04 n.a. 2.43E-02 3.03E-02 9.20E-03 1.24E-02 1.09E-02 7.46E-04 4.37E-04 8.48E-05 2.48E-03 n.a. 1.14E-02 1.20E-03 3.91E-04 3.59E-03 1.22E-02 n.a. 3.83E-02  113  # of SNPs Rank by Cluster Validation 1 2 replicated / # SNP Location SINGLERank P-value4 SNPs in cluster MARKER3 16 1/2 rs1050975 Chr6: 353012 256 1.44E-01 17 3/8 rs2496450 Chr13: 38204464 352 1.00E-03 17 3/8 rs7318271 Chr13: 38226541 1 3.49E-04 17 3/8 rs1551026 Chr13: 38241477 69 6.15E-04 18 3/13 rs2190503 Chr7: 50710111 759 3.05E-03 18 3/13 rs6593140 Chr7: 50765627 804 1.92E-03 18 3/13 rs2329554 Chr7: 50842524 1221 7.82E-04 19 1/2 rs7724915 Chr5: 76177435 76 4.57E-04 20 1/11 rs9827620 Chr3: 30282986 867 4.48E-05 21 1/7 rs2072338 Chr16: 3996928 247 7.02E-03 22 3/12 rs4632257 Chr19: 33640281 1369 9.85E-03 22 3/12 rs4239555 Chr19: 33667767 37 4.46E-03 22 3/12 rs2024140 Chr19: 33696162 587 4.44E-03 23 1/4 rs10789171 Chr1: 65392468 325 1.16E-03 24 1/6 rs9725311 Chr1: 18400613 221 6.48E-05 25 1/5 rs917498 Chr17: 29887386 385 9.37E-04 26 1/4 rs6775462 Chr3: 76973111 213 3.54E-02 27 2/9 rs4712970 Chr6: 25878686 1689 1.07E-03 27 2/9 rs9358890 Chr6: 25887371 125 n.a. 28 1/5 rs4902165 Chr14: 62185200 326 9.70E-03 29 2/9 rs2045045 Chr11: 5832465 689 3.92E-04 29 2/9 rs4757986 Chr11: 5862649 707 2.95E-04 30 2/9 rs2455503 Chr11: 115038262 124 1.39E-03 1 2 Light green SNPs in linkage disequilibrium (r > 0.5) with SNPs selected by other OCAC members. 2 March 2006 human reference sequence: NCBI Build 36/hg18. 3  Ranked in ascending order by the SINGLEMARKER test statistic P-value.  4  Using individual genotyping data for samples in the case pool and controls pools, association analyses using a one degree of freedom (df) allelic chi-squared test was performed. P-values from this test are reported. SNPs indicated as n.a. did not pass QC.  114  Table 3.5. Association and Rank Information for the Top 30 LMP SER EOC Loci. # of SNPs Cluster replicated / # SNP1 Rank SNPs in cluster 1 1/2 rs1466004 2 1/2 rs487811 3 1/3 rs2038574 4 2/5 rs12742611 4 2/5 rs9661646 5 1/2 rs2169310 6 2/7 rs10505083 6 2/7 rs10505087 7 3/12 rs8020475 7 3/12 rs1679870 7 3/12 rs11158590 8 3/12 rs3218896 8 3/12 rs3218920 8 3/12 rs2072477 9 2/5 rs4778052 9 2/5 rs7166161 10 1/7 rs10932085 11 1/2 rs12938108 12 1/2 rs10464069 13 1/4 rs12138021 14 1/3 rs1011407 15 1/2 rs12410385 (continued on following page)  Location  2  Chr13: 61984835 Chr11: 63559193 Chr10: 17281711 Chr1: 18347847 Chr1: 18352745 Chr16: 62520057 Chr8: 106850389 Chr8: 106877254 Chr14: 64723761 Chr14: 64741700 Chr14: 64890893 Chr2: 101998084 Chr2: 102000437 Chr2: 102003191 Chr15: 91222248 Chr15: 91283004 Chr2: 205314900 Chr17: 53110312 Chr5: 177723558 Chr1: 174946760 Chr2: 60519272 Chr1: 144251749  Rank by SINGLEMARKER3 5 7 47 196 97 60 52 53 230 739 101 356 35 272 397 113 1497 28 150 104 54 139  Validation P-value4 1.66E-02 7.94E-03 8.83E-03 4.73E-03 4.10E-03 7.27E-05 9.65E-03 5.43E-03 3.26E-03 4.17E-03 2.62E-02 4.20E-02 3.85E-02 4.21E-02 1.19E-03 2.68E-02 2.55E-02 1.14E-02 n.a. 5.07E-02 3.06E-03 1.11E-02  115  Cluster Rank 16 17 17 18 18 19 20 21 21 22 23 24 25 26 27 28 28 29 29 30  # of SNPs replicated / # SNPs in cluster 1/3 2/10 2/10 2/7 2/7 1/2 1/2 2/5 2/5 1/2 1/4 1/4 1/2 1/5 1/2 2/7 2/7 2/8 2/8 1/2  SNP  1  rs1800717 rs1538055 rs2039656 rs743534 rs12257368 rs195413 rs3805346 rs6459919 rs6954099 rs2417048 rs2794852 rs9609538 rs6078938 rs10500400 rs17050232 rs2269793 rs179236 rs967808 rs968365 rs193139  Location  2  Chr1: 94234342 Chr13: 85147651 Chr13: 85167559 Chr10: 135199216 Chr10: 135219995 Chr6: 37410053 Chr4: 108821485 Chr7: 158563951 Chr7: 158612946 Chr9: 128964473 Chr1: 236101606 Chr22: 31139832 Chr20: 13082768 Chr16: 17031650 Chr3: 9338980 Chr16: 19180409 Chr16: 19232332 Chr4: 14628553 Chr4: 14652837 Chr1: 109285498  Rank by SINGLEMARKER3 214 183 102 191 241 244 146 65 321 249 434 285 238 96 124 548 330 245 186 50  Validation P-value4 n.a. 6.00E-04 5.55E-04 6.48E-04 4.13E-04 1.98E-02 2.06E-01 2.22E-02 2.44E-02 1.11E-01 6.76E-02 5.73E-04 3.35E-02 4.91E-03 2.44E-02 5.90E-04 1.95E-02 5.85E-04 5.85E-04 2.74E-03  1  Dark green SNPs selected by other OCAC member. Light green SNPs in linkage disequilibrium (r2 > 0.5) with SNPs selected by other OCAC members. 2  March 2006 human reference sequence: NCBI Build 36/hg18.  3  Ranked in ascending order by the SINGLEMARKER test statistic P-value. Using individual genotyping data for samples in the case pool and controls pools, association analyses using a one degree of freedom (df) allelic chi-squared test was performed. P-values from this test are reported. SNPs indicated as n.a. did not pass QC. 4  116  Table 3.6. Association and Rank Information for the Top 30 HG SER EOC Loci. # of SNPs Rank by Validation 1 2 replicated / # SNP Location SINGLEP-value4 SNPs in cluster MARKER3 1 3/7 rs1404403 Chr1: 199466839 2 3.84E-05 1 3/7 rs4915499 Chr1: 199481199 22 1.80E-05 1 3/7 rs10159291 Chr1: 199487214 70 3.43E-05 2 1/3 rs11632341 Chr15: 21540364 93 9.00E-03 3 2/7 rs16866828 Chr2: 8919360 8 1.20E-03 3 2/7 rs13415968 Chr2: 8981642 13 1.70E-03 4 1/2 rs2393599 Chr10: 61526867 55 2.49E-03 5 1/2 rs6492551 Chr13: 90986949 174 1.82E-03 6 3/12 rs13152390 Chr4: 72509881 33 7.67E-04 6 3/12 rs13124079 Chr4: 72522866 41 4.35E-04 6 3/12 rs6857491 Chr4: 72580387 284 9.24E-03 7 4/8 rs12431401 Chr14: 51852891 622 2.33E-04 7 4/8 rs708498 Chr14: 51862250 1780 1.92E-02 7 4/8 rs17197 Chr14: 51864131 443 1.62E-04 7 4/8 rs17831718 Chr14: 51869786 7 2.63E-04 8 2/2 rs12595883 Chr16: 74141000 10 2.11E-02 8 2/2 rs12598094 Chr16: 74166087 444 2.84E-02 9 2/6 rs6825690 Chr4: 142516373 118 1.79E-02 9 2/6 rs4956397 Chr4: 142663337 558 2.27E-01 10 1/3 rs13194781 Chr6: 27923618 120 3.41E-03 11 1/7 rs8013239 Chr14: 79873516 12 3.15E-05 12 1/4 rs4585039 Chr2: 4331016 128 9.70E-04 13 1/3 rs9650719 Chr9: 121218411 157 3.00E-04 14 1/4 rs10503197 Chr8: 3012902 83 3.61E-04 15 1/5 rs1192672 Chr10: 37215888 580 5.45E-02 1 2 Light green SNPs in linkage disequilibrium (r > 0.5) with SNPs selected by other OCAC members. 2 March 2006 human reference sequence: NCBI Build 36/hg18. Cluster Rank  3  Ranked in ascending order by the SINGLEMARKER test statistic P-value. Using individual genotyping data for samples in the case pool and controls pools, association analyses using a one degree of freedom (df) allelic chi-squared test was performed. P-values from this test are reported. SNPs indicated as n.a. did not pass QC. 4  117  Table 3.7. Association and Rank Information for the Singleton SNPs. #.  Subtype  SNP1  1 LMP-SER rs12078447 2 LMP-SER rs8096374 3 LMP-SER rs1536678 4 LMP-SER rs9383964 5 LMP-SER rs4359077 6 LMP-SER rs4395807 7 LMP-SER rs11207161 8 LMP-SER rs16925377 9 LMP-SER rs11868735 10 LMP-SER rs17114286 11 LMP-SER rs1487492 12 LMP-SER rs2845568 13 LMP-SER rs17239136 1 ENCC rs2125998 2 ENCC rs10922552 3 ENCC rs11187736 4 ENCC rs2404347 5 ENCC rs10850376 6 ENCC rs4781213 7 ENCC rs885386 8 ENCC rs4317185 9 ENCC rs10516503 10 ENCC rs12823020 11 ENCC rs1046632 12 ENCC rs5030755 13 ENCC rs1760981 (continued on following page)  Location2 Chr1: 51272113 Chr18: 4751808 Chr13: 112143948 Chr6: 152486508 Chr1: 200395735 Chr7: 142457413 Chr1: 58293791 Chr11: 3633651 Chr17: 26820323 Chr14: 28326903 Chr1: 186428653 Chr11: 63592187 Chr4: 62328826 Chr15: 66418794 Chr1: 89273380 Chr10: 95702070 Chr12: 38719542 Chr12: 113675209 Chr16: 12368918 Chr9: 16242410 Chr4: 151165833 Chr4: 104772736 Chr12: 113344884 Chr17: 34663151 Chr17: 1729702 Chr14: 64723648  SingleMarker P-value3 0.003 0.004 0.005 0.007 0.009 0.009 0.009 0.025 0.026 0.029 0.029 0.029 0.031 0.015 0.015 0.015 0.018 0.019 0.020 0.020 0.022 0.027 0.034 0.038 0.047 0.047  Validation P-value4 2.15E-03 1.36E-02 2.15E-03 1.18E-01 2.99E-03 1.36E-02 2.41E-03 2.94E-03 6.83E-02 1.15E-03 1.15E-02 8.04E-02 1.03E-03 3.18E-02 3.65E-03 2.88E-03 1.85E-03 6.06E-03 2.99E-03 2.29E-04 7.23E-01 5.66E-01 2.07E-01 3.84E-03 n.a. 3.19E-01  118  #  Subtype  SNP1  Location2  SingleMarker Validation P-value3 P-value4  1 2 3 4 5 6 7 8 9 10 11 12 13  MUC MUC MUC MUC MUC MUC MUC MUC MUC MUC MUC MUC MUC  rs293777 rs175700 rs2304572 rs6806321 rs2073204 rs12256012 rs933518 rs17185776 rs2060214 rs17028387 rs1507864 rs39088 rs1557814  Chr3: 9727333 Chr14: 75036729 Chr2: 25206783 Chr3: 118638930 Chr14: 30428128 Chr10: 31429915 Chr16: 53079622 Chr1: 118904362 Chr5: 137986213 Chr4: 153156143 Chr4: 48300385 Chr7: 29250928 Chr17: 54924029  0.001 0.005 0.007 0.008 0.015 0.017 0.022 0.032 0.034 0.037 0.042 0.042 0.044  1.79E-03 3.18E-03 5.67E-03 5.72E-03 7.44E-02 2.12E-02 2.17E-05 3.20E-03 2.09E-03 1.73E-02 1.66E-01 4.81E-03 1.23E-01  1  Dark green SNPs selected by other OCAC member. Light green SNPs in linkage disequilibrium (r2 > 0.5) with SNPs selected by other OCAC members. 2  March 2006 human reference sequence: NCBI Build 36/hg18. Ranked in ascending order by the SINGLEMARKER test statistic P-value. 4 Using individual genotyping data for samples in the case pool and controls pools, association analyses using a one degree of freedom (df) allelic chi-squared test was performed. P-values from this test are reported. SNPs indicated as n.a. did not pass QC. 3  119  Table 3.8. EOC Cases and Controls in OCAC by Subtype and Tumour Behaviour. Serous OCAC abbrev. All Inv. 1 AUS 539 539 2 BAV 60 56 3 BEL 168 168 4 DAN 452 393 5 DOV 754 561 6 GER 216 194 7 HAW 47 38 8 HJO 132 127 9 HMO 50 50 10 HOC 118 106 11 HPE 422 380 12 LA2 642 526 13 MAY 388 345 14 MCC 34 34 15 MDA 190 190 16 MSK 317 317 (continued on following page) No.  LMP 0 4 0 59 193 22 9 4 0 7 40 116 43 0 0 0  Mucinous All Inv. 38 38 9 8 23 23 119 52 153 28 30 22 15 3 13 8 7 7 45 39 57 34 115 51 41 18 6 6 27 27 0 0  LMP 0 1 0 67 124 8 12 5 0 0 22 64 23 0 0 0  Endometrioid All Inv. 112 112 13 13 21 21 70 68 164 151 36 36 12 12 26 26 12 12 26 23 101 100 96 96 97 95 7 7 28 28 17 17  LMP 0 0 0 2 12 0 0 0 0 1 1 0 2 0 0 0  Clear cell All Inv. 52 52 6 6 23 23 41 41 67 67 6 6 5 5 3 3 1 1 13 11 54 52 43 43 32 32 6 6 4 4 17 17  LMP 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0  Controls 978 143 1305 828 1487 413 157 273 138 447 1466 1047 743 68 0 593  120  Serous Mucinous Endometrioid Clear cell OCAC Controls abbrev. All Inv. LMP All Inv. LMP All Inv. LMP All Inv. LMP 17 NCO 478 360 118 81 33 46 114 111 3 79 78 1 792 18 NEC 510 359 137 112 40 72 130 128 2 102 95 7 1009 19 NHS 72 62 10 17 7 10 14 14 0 6 6 0 425 20 NJO 99 99 0 7 6 0 27 27 0 20 20 0 181 21 NOR 132 129 3 18 15 3 26 26 0 11 11 0 371 22 NTH 117 116 1 34 33 1 64 64 0 20 20 0 323 23 OVA 377 299 78 95 26 69 102 100 2 57 57 0 748 24 POC 199 199 0 33 33 0 39 39 0 9 9 0 417 25 POL 106 106 0 17 17 0 37 37 0 10 10 0 223 26 SEA 603 569 33 193 144 49 228 226 1 145 144 0 6024 27 STA 160 160 0 19 17 2 41 34 7 21 20 1 349 28 TOR 339 339 0 39 39 0 132 132 0 34 34 0 0 29 UCI 251 165 86 73 19 54 48 48 0 23 23 0 367 30 UK2 618 611 7 145 130 15 224 224 0 102 102 0 1103 31 WOC 128 128 0 10 8 2 20 20 0 17 17 0 204 TOTAL 8718 7725 970 1591 931 649 2084 2047 33 1029 1015 11 22622 Abbreviations: All, all tumour behaviours (invasive, LMP/borderline, and unknowns); Inv., invasive tumour behaviour; LMP, lowmalignancy potential/borderline tumour behaviour. Adding invasive and LMP cases does not equal "All" because of samples with unknown tumour behaviour: 23 in serous, 11 in mucinous, 4 in endometrioid, and 3 in clear cell. Table only includes ovary cases of European descent with individual genoptying data, 5 European principle components estimates, age diagnosis data, and a serous, mucinous, endometrioid, or clear cell histology code. Only controls of European descent with individual genoptying data, 5 European principle components estimates, age of diagnosis data, and a null histology and tumour behaviour code are included in this table. These cases and controls reflect the samples used in the subtype-specific logistic regression analyses performed. No.  121  Table 3.9. Properties of Samples in the Discovery Stage Case-Control Pools.  Pool Size  Age at diagnosis ± standard deviation  Minimum Maximum Age Age  Mucinous (MUC)  84  53 ± 13.21  23  78  Endometrioid and clear cell (ENCC)3  114  56 ± 10.5  25  79  High-grade serous (HG-SER)  272  62 ± 10.11  36  80  Low-malignancy potential/borderline serous (LMP-SER)  75  53 ± 12.81  21  80  BCC & SMP combined controls  398  57 ± 10.5  31  80  BC controls (BCC)  176  54 ± 11.1²  31  76  EOC case pools  Control pools  SMP controls (SMP) 222 59 ± 9.5 40 80 1 Significantly different from BCC & SMP combined controls (P-value < 0.05) ²Significantly different from SMP controls (P-value < 0.05) 3  Endometrioid and clear cell EOC cases were combined in one case pool to detect shared risk factors.  122  Table 3.10. MUC EOC SNPs Replicated in OCAC, Stratified by Subtype. SNP  Subtype, sample set1  Discovery Mucinous Replication Rep/OVA rs11108890 END CC Other HG-SER LMP-SER Discovery Mucinous Replication Rep/OVA rs933518 END CC Other LMP-SER HG-SER Discovery Mucinous Replication Rep/OVA rs17106154 END CC Other LMP-SER HG-SER Discovery Mucinous Replication Rep/OVA rs970651 END CC Other LMP-SER HG-SER (continued on following page)  No. 1 15 16 16 10 28 9 1 15 16 16 10 9 28 1 15 16 16 10 9 28 1 15 16 16 10 9 28  Cases, Controls 78 / 392 1257/17190 1352/17938 1569/17315 705/15223 6881/21530 825/13509 78/392 1257/17190 1352/17938 1569/17315 705/15223 825/13509 6881/21530 78/392 1257/17190 1352/17938 1569/17315 705/15223 825/13509 6881/21530 78/392 1257/17190 1352/17938 1569/17315 705/15223 825/13509 6881/21530  OR (95% CI)2  P  3.05 (1.32-7.06)* 1.35 (1.10-1.66)* 1.41 (1.16-1.71)* 1.23 (1.02-1.49)* 1.05 (0.77-1.43) 1.04 (0.93-1.16) 0.95 (0.72-1.26) 2.29 (1.32-3.99)* 1.24 (1.06-1.44)* 1.29 (1.11-1.49)* 0.98 (0.84-1.14) 0.90 (0.72-1.13) 1.04 (0.85-1.28) 1.03 (0.95-1.11) 1.78 (0.95-3.37) 1.20 (1.03-1.41)* 1.24 (1.07-1.44)* 1.06 (0.91-1.22) 1.08 (0.88-1.33) 1.11 (0.92-1.35) 1.01 (0.94-1.10) 2.54 (1.54-4.19)* 1.12 (1.00-1.26)* 1.17 (1.04-1.30)* 0.95 (0.86-1.06) 1.13 (0.97-1.31) 1.08 (0.94-1.24) 1.04 (0.98-1.10)  9.1x10-3 3.5x10-3 5.4x10-3 0.03 0.75 0.46 0.73 3.4x10-3 7.2x10-3 7.3x10-4 0.82 0.37 0.68 0.45 0.07 0.02 4.1x10-3 0.46 0.47 0.27 0.73 2.6x10-4 0.045 6.3x10-3 0.392 0.104 0.287 0.229  Phet 9.1x10-3 8.0x10-3 0.55 0.85 0.85 0.15 0.22 0.16 0.13 0.96 0.22 0.80 0.47 0.39 0.10 0.25 0.44 0.14 0.61 0.31 0.65 0.70 0.83 0.57  123  SNPs are ordered by significance in the replication meta-analysis. Abbreviations used: No., number of studies; OR, odds ratio; 95%CI, 95% confidence intervals; Rep/OVA, Replication including OVA samples; END, endometrioid; CC, clear cell; LMP-SER, low-malignancy potential/borderline serous; HG-SER, high-grade serous. Asterisk indicate that 95% confidence intervals do not contain 1. 1" Discovery" sample set includes samples used in the DNA pools and subsequently genotyping by OCAC. "Replication" sample set includes samples from OCAC studies with ≥30 mucinous cases (invasive and LMP/borderline tumour behaviour), and ≥30 controls, excluding the OVA study. OVA samples are included in the "Rep/OVA" sample set, and include those samples used in the DNA pools in additional to cases and controls obtained after DNA pool construction. The "Other" histologies sample sets include samples from OCAC studies with ≥30 subtype-specific cases and ≥30 controls, excluding the OVA study. 2"  Discovery" OR and P-value are from logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). "Replication" OR and P-value are from fixed effects metaanalysis carried out using the rmeta library implemented in the R project for Statistical Computing. Individual studies contributing to replication meta-analysis were analyzed using logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). Phet is the P-value for Cochran's-Q measure of between study heterogeneity, generated in rmeta.  124  Table 3.11. MUC EOC SNPs Replicated in OCAC, Stratified by Tumour Behaviour. SNP  Tumuor behaviour, sample set1 Discovery  Mucinous, Replication invasive Rep/OVA rs11108890 Mucinous, Discovery LMP / Replication borderline Rep/OVA Discovery Mucinous, Replication invasive Rep/OVA rs933518 Mucinous, Discovery LMP / Replication borderline Rep/OVA Discovery Mucinous, Replication invasive Rep/OVA rs17106154 Mucinous, Discovery LMP / Replication borderline Rep/OVA Discovery Mucinous, Replication invasive Rep/OVA rs970651 Mucinous, Discovery LMP / Replication borderline Rep/OVA (continued on following page)  No.  Cases, Controls  1  23/392  11 12 1 7 9 1 11 12 1 7 9 1 11 12 1 7 9 1 11 12 1 7 9  627/14180 653/14928 55/392 476/11300 545/12048 23/392 627/14180 653/14928 55/392 476/11300 545/12048 23/392 627/14180 653/14928 55/392 476/11300 545/12048 23/392 627/14180 653/14928 55/392 476/11300 545/12048  OR (95% CI)2  P  4.23 (1.3013.69)* 1.32 (1.00-1.76) 1.43 (1.08-1.88)* 2.69 (1.01-7.19)* 1.22 (0.87-1.70) 1.29 (0.94-1.75) 1.52 (0.58-3.96) 1.27 (1.02-1.56)* 1.27(1.03-1.56)* 2.60 (1.39-4.87)* 1.27 (0.99-1.62) 1.39 (1.11-1.74)* 2.93 (1.09-7.85)* 1.16 (0.92-1.46) 1.23 (0.99-1.54) 1.54 (0.71-3.33) 1.19 (0.93-1.52) 1.24 (0.99-1.55) 2.39 (1.04-5.49)* 1.20 (1.02-1.41)* 1.21 (1.03-1.41)* 3.04 (1.70-5.45)* 1.02 (0.84-1.22) 1.13 (0.95-1.34)  0.02 0.05 0.01 0.05 0.24 0.11 0.39 0.03 0.02 2.7x10-3 0.06 4.0x10-3 0.03 0.20 0.07 0.27 0.16 0.07 0.04 0.02 0.02 1.83x10-4 0.85 0.17  Phet  0.25 0.13 0.23 0.27 0.63 0.71 0.18 0.11 0.22 0.13 0.44 0.49 0.91 0.93 0.71 0.10  125  SNPs are ordered according to Table 3.5. Abbreviations used: No., number of studies; OR, odds ratio; 95%CI, 95% confidence intervals; Rep/OVA, Replication including OVA samples; LMP, low-malignancy potential. Asterisk indicate that 95% confidence intervals do not contain 1. 1" Discovery" sample set includes samples used in the DNA pools and subsequently genotyping by OCAC, stratified by tumour behaviour. "Replication" sample set includes samples from OCAC studies with ≥30 invasive or LMP/borderline mucinous cases, and ≥30 controls, excluding the OVA study. OVA samples are included in the "Rep/OVA" sample set, and include those samples used in the DNA pools in additional to cases and controls obtained after DNA pool construction. 2"  Discovery" OR and P-value are from logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). "Replication" OR and P-value are from fixed effects meta-analysis carried out using the rmeta library implemented in the R project for Statistical Computing. Individual studies contributing to replication meta-analysis were analyzed using logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). Phet is the P-value for Cochran's-Q measure of between study heterogeneity, generated in rmeta.  126  Table 3.12. ENCC SNPs Replicated in OCAC, Stratified by Subtype. No.  Cases, Controls  OR (95% CI)2  P  Phet  1 22 23 15 9 28 1 22 23 15 9  114 / 392 2594 / 20326 2751 / 21074 1257 / 17190 825 / 13509 6881 / 21530 114 / 392 2594 / 20326 2751 / 21074 1257 / 17190 825 / 13509  1.85 (1.21-2.85)* 1.12 (1.03-1.23)* 1.14 (1.04-1.24)* 0.93 (0.81-1.06) 0.95 (0.80-1.12) 1.08 (1.02-1.15)* 2.03 (1.30-3.17)*  4.9x10-3 0.01 3.9x10-3 0.29 0.53 0.01 1.8x10-3  0.06 0.06 0.24 0.14 0.47  1.11 (1.02-1.22)* 1.13 (1.03-1.23)* 0.92 (0.80-1.05) 0.97 (0.82-1.14)  0.02 0.01 0.20 0.68  0.20 0.19 0.36 0.09  28 Endometrioid Discovery 1 & clear cell Replication 22 combined Rep/OVA 23 rs2329554 MUC 15 LMP-SER 9 Other HG-SER 28  6881 / 21530 114 / 392 2594 / 20326 2751 / 21074 1257 / 17190 825 / 13509  1.10 (1.03-1.17)* 1.91 (1.33-2.74)* 1.07 (1.00-1.15) 1.09 (1.02-1.17)* 0.97 (0.87-1.07) 0.97 (0.85-1.10)  4.2x10-3 4.4x10-4 0.06 0.01 0.54 0.59  0.53  6881 / 21530  1.08 (1.02-1.13)*  3.3x10-3  0.35  SNP  Subtype, sample set1  Endometrioid Discovery & clear cell Replication combined Rep/OVA rs2190503 MUC Other LMP-SER HG-SER Endometrioid Discovery & clear cell Replication combined Rep/OVA rs6593140 MUC LMP-SER Other HG-SER  (continued on following page)  127  0.84 0.66 0.10 0.61  SNPs are ordered by significance in the replication meta-analysis. Abbreviations used: No., number of studies; OR, odds ratio; 95%CI, 95% confidence intervals; Rep/OVA, Replication including OVA samples; MUC, mucinous; LMP-SER, low-malignancy potential/borderline serous; HG-SER, high-grade serous. Asterisk indicate that 95% confidence intervals do not contain 1. Endometrioid and clear cell cases are combined in the replication to reflect the discovery stage pool. 1"  Discovery" sample set includes samples used in the DNA pools and subsequently genotyping by OCAC. "Replication" sample set includes samples from OCAC studies with ≥30 endometrioid/clear cell cases (combined, invasive only), and ≥30 controls, excluding the OVA study. OVA samples are included in the "Rep/OVA" sample set, and include those samples used in the DNA pools in additional to cases and controls obtained after DNA pool construction. The "Other" histologies sample sets include samples from OCAC studies with ≥30 subtype-specific cases and ≥30 controls, excluding the OVA study. 2"  Discovery" OR and P-value are from logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). "Replication" OR and P-value are from fixed effects meta-analysis carried out using the rmeta library implemented in the R project for Statistical Computing. Individual studies contributing to replication meta-analysis were analyzed using logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). Phet is the P-value for Cochran's-Q measure of between study heterogeneity, generated in rmeta.  128  Table 3.13. ENCC SNPs Replicated in OCAC, Stratified by END and CC Subtypes. SNP  Subtype, sample set1  Discovery END Replication Rep/OVA rs2190503 Discovery CC Replication Rep/OVA Discovery END Replication Rep/OVA rs6593140 Discovery CC Replication Rep/OVA Discovery END Replication Rep/OVA rs2329554 Discovery CC Replication Rep/OVA (continued on following page)  No . 1 16 17 1 10 11 1 16 17 1 10 11 1 16 17 1 10 11  Cases, Controls 68 / 392 1569 / 17315 1669 / 18063 41 / 392 705 / 15223 762 / 15971 68 / 392 1569 / 17315 1669 / 18063 41 / 392 705 / 15223 762 / 15971 68 / 392 1569 / 17315 1669 / 18063 41 / 392 705 / 15223 762 / 15971  OR (95% CI)2  P  1.47 (0.87-2.51) 1.11 (1.00-1.25) 1.12 (1.00-1.25)* 2.31 (1.25-4.27)* 1.12 (0.95-1.31) 1.16 (1.00-1.35) 1.66 (0.96-2.87) 1.11 (0.99-1.24) 1.12 (1.00-1.25)* 2.45 (1.29-4.67)* 1.10 (0.93-1.29) 1.14 (0.98-1.33) 1.82 (1.17-2.82)* 1.09 (1.00-1.20) 1.10 (1.01-1.21)* 1.98 (1.15-3.41)* 1.05 (0.92-1.20) 1.10 (0.97-1.25)  0.152 0.057 0.044 7.42x10-3 0.175 0.053 0.068 0.069 0.047 6.35x10-3 0.267 0.093 7.86x10-3 0.057 0.027 0.014 0.427 0.126  Phet  0.04 0.05 0.03 0.02 0.06 0.08 0.08 0.05 0.58 0.58 0.38 0.15  129  SNPs are ordered according to Table 3.7. Abbreviations used: No., number of studies; OR, odds ratio; 95%CI, 95% confidence intervals; Rep/OVA, Replication including OVA samples; END, endometrioid; CC, clear cell. Asterisk indicate that 95% confidence intervals do not contain 1. 1" Discovery" sample set includes samples used in the DNA pools and subsequently genotyping by OCAC, stratified by tumour behaviour. "Replication" sample set includes samples from OCAC studies with ≥30 endometrioid or clear cell cases (invasive only), and ≥30 controls, excluding the OVA study. OVA samples are included in the "Rep/OVA" sample set, and include those samples used in the DNA pools in additional to cases and controls obtained after DNA pool construction. 2" Discovery" OR and P-value are from logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). "Replication" OR and P-value are from fixed effects meta-analysis carried out using the rmeta library implemented in the R project for Statistical Computing. Individual studies contributing to replication meta-analysis were analyzed using logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). Phet is the P-value for Cochran's-Q measure of between study heterogeneity, generated in rmeta.  130  Table 3.14. LMP-SER SNPs Replicated in OCAC, Stratified by Subtype. SNP  rs9609538  rs2169310  Subtype, sample set1 Discovery LMPReplication SER Rep/OVA MUC END Other HG-SER CC Discovery LMPReplication SER Rep/OVA MUC END Other CC HG-SER  No 1 9 10 15 16 28 10 1 9 10 15 16 10 28  Cases, Controls 68 / 392 825 / 13509 903 / 14257 1257 / 17190 1569 / 17315 6881 / 21530 705 / 15223 68 / 392 825 / 13509 903 / 14257 1257 / 17190 1569 / 17315 705 / 15223 6881 / 21530  OR (95% CI)2  P  0.48 (0.30-0.77)* 0.87 (0.78-0.97)* 0.85 (0.76-0.95)* 0.94 (0.86-1.03) 1.03 (0.95-1.11) 0.98 (0.94-1.02) 1.05 (0.93-1.18) 0.26 (0.13-0.51)* 0.87 (0.76-0.98)* 0.83 (0.73-0.93)* 0.93 (0.84-1.04) 0.94 (0.86-1.03) 0.90(0.79-1.03) 0.99 (0.95-1.04)  2.3x10-3 0.01 3.0x10-3 0.17 0.50 0.39 0.42 7.7x10-5 0.03 2.5x10-3 0.20 0.23 0.12 0.79  Phet 0.86 0.71 0.56 0.27 0.08 0.44 0.33 7.6x10-3 0.59 0.11 0.87 0.50  SNPs are ordered by significance in the replication meta-analysis. Abbreviations used: No., number of studies; OR, odds ratio; 95%CI, 95% confidence intervals; Rep/OVA, Replication including OVA samples; LMP-SER, low-malignancy potential/borderline serous; MUC, mucinous; END, endometrioid; CC, clear cell; HG-SER, high-grade serous. Asterisk indicate that 95% confidence intervals do not contain 1. 1" Discovery" sample set includes samples used in the DNA pools and subsequently genotyping by OCAC. "Replication" sample set includes samples from OCAC studies with ≥30 LMP/borderline tumour behaviour serous cases, and ≥30 controls, excluding the OVA study. OVA samples are included in the "Rep/OVA" sample set, and include those samples used in the DNA pools in additional to cases and controls obtained after DNA pool construction. The "Other" histologies sample sets include samples from OCAC studies with ≥30 subtype-specific cases and ≥30 controls, excluding the OVA study. 2"  Discovery" OR and P-value are from logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). "Replication" OR and P-value are from fixed effects metaanalysis carried out using the rmeta library implemented in the R project for Statistical Computing. Individual studies contributing to replication meta-analysis were analyzed using logistic regression assuming an additive genetic model, implemented in PLINK (v1.07). Phet is the P-value for Cochran's-Q measure of between study heterogeneity, generated in rmeta.  131  Table 3.15. Summary of Nine Subtype-Specific SNPs Replicated in OCAC.  MUC  ENCC  HGNC symbol  TFBS / SSE,SSS / miRNA3 --/--/--  Reg. score  Cons. Score  0.00  0.00  --/--/--  0.04  0.00  --/--/--  0.00  0.00  --/--/--  0.00  0.00  GRB10  --/--/--  0.00  0.00  GRB10  --/--/--  0.20  0.98  --/--/--  na  0.02  Location1  Allele MAF2  Consequence  rs11108890  Chr12: 97613399  C/A  0.04  Inter.  rs933518  Chr16: 54522121  C/T  0.07  Intron  na  rs17106154  Chr14: 69161174  T/C  0.07  aa  RAD51B  rs970651  Chr13: 48453704  G/A  0.16  Inter.  rs2190503  Chr7: 50742617  A/G  0.13  Intron  rs6593140  Chr7: 50798133  C/T  0.12  Intron/upstr.  rs2329554  Chr7: 50875030  A/G  0.22  Inter.  rs2169310  Chr16: 63962556  A/G  0.24  Subtype SNP  Inter. --/--/-0.00 0.06 Upstr./Intron/ C22orf28, rs9609538 Chr22: 32809832 T/C 0.36 Y/--/Y 0.00 0.00 Downstr. BPIL2 Within each subtype SNPs are ordered by significance in the replication meta-analysis. Abbreviations used: Chr, chromosome; MAF, minor allele frequency; HGNC, Hugo gene nomenclature committee; TFBS, transcription factor binding site; SSE, splice site enhancer; SSS, splice site suppressor; Reg., regulation; Cons., conservation; na, not available, Y, yes; inter, intergenic; upstr., upstream; downstr., downstream. 1 The Ensembl v57 database (Genome Assembly: GRCh37, Feb 2009, dbSNP version: 130) was used as the source of SNP information, including SNP: chromosome, nucleotide position, consequence (e.g. intronic, non-synonymous coding), and the HGNC symbol of genes tagged by SNPs. 2 As calculated in all European controls in OCAC participating studies. 3 The web-based package FuncPred was used to predicted functional characteristics of SNPs199. FuncPred was used to assess the potential for the alternate allele of a non-coding SNP to impact: transcriptional regulation by changing transcription factor binding site activity, translational regulation by changing miRNA binding site activity, and protein structure/function by changing exonic splice site enhancer or splice site silencer activity. FuncPred outputs the ESPERR regulatory potential score ("Reg. score"), and the Vertebrate Multiz Alignment & Conservation score ("Cons score"). LMPSER  132  Table 3.16. Proxy SNPs With In Silico Predicted Functional Consequences. Subtype Associated SNP  MUC  ENCC  LD-SNP  1  1  LD  LD-SNP Location  2  MAF  TFBS / ESS,SSS / miRNA3  Reg score  Cons Score  1  rs17106154  rs1274757  0.84  Chr14: 69160487  0.09  Y/--/--  0.06  0.00  rs17106154  rs1274758  0.82  Chr14: 69158957  0.10  Y/--/--  0.00  0.00  rs970651  rs10397  0.69  Chr13: 48516845  0.15  --/Y/Y  0.17  0.96  rs2329554  rs11238326 1  Chr7: 50853428  0.21  Y/--/--  0.28  0.00  rs2329554  rs11238337 1  Chr7: 50865049  0.21  Y/--/--  0.01  0.00  rs2329554  rs11983539 1  Chr7: 50863651  0.22  Y/--/--  0.00  0.00  rs2329554 rs2329554, rs2190503, rs6593140 rs2329554, rs2190503, rs6593140  rs6963498 rs6593182  1 0.89, 0.62, 0.65  Chr7: 50855509 Chr7: 50855847  0.21 0.18  Y/--/-Y/--/--  0.12 0.00  0.00 0.00  rs7791286  0.69, 0.64, 0.67  Chr7: 50856792  0.17  Y/--/--  0.00  0.00  LMPrs9609538 rs9609537 1 Chr22: 32807769 0.34 Y/--/Y 0.00 0.00 SER Abbreviations used: Chr, chromosome; MAF, minor allele frequency; TFBS, transcription factor binding site; SSE, splice site enhancer; SSS, splice site suppressor; Reg., regulation; Cons., conservation; Y, yes. 1 As calculated based on the the HapMap CEU panel. 2 The Ensembl v57 database (Genome Assembly: GRCh37, Feb 2009, dbSNP version: 130) was used as the source of SNP information, including SNP: chromosome, nucleotide position. 3 The web-based package FuncPred was used to predicted functional characteristics of SNPs199. In particular, FuncPred was used to assess the potential for the alternate allele of a non-coding SNP to impact: transcriptional regulation by changing transcription factor binding site activity, translational regulation by changing miRNA binding site activity, and protein structure/function by changing exonic splice site enhancer or splice site silencer activity. In addition, FuncPred outputs the ESPERR regulatory potential score ("Reg. score"), and the Vertebrate Multiz Alignment & Conservation score ("Cons score"). 133  Table 4.1. Summary of SNPs Selected for Individual Genotyping in Iranian Women.  No SNPs chosen  No SNPs genotyped1  95  68  7  2  Reported by GWAS1  14  13  Reported by candidate gene study2  4  4  120  85  Selection criteria  Associated in Iranians  Associated in Europeans  Pool-based GWAS: median SM rank Pool-based GWAS: SM P-value  Total  Abbreviations: GWAS, genome-wide association study; SM; SINGLEMARKER; SNP, single nucleotide polymorphism. 1  SNPs retained after quality control steps. He et al. 2009and Stolk et al. 2009. 3 Voorhuis et al., 2011 and Hefler et al., 2005. 2  134  Table 4.2. Association and Rank Information for 102 Iranian GWAS SNPs Chosen for Individual Genotyping. No.  SNP  Location  1 rs3892396 Chr19: 6355507 2 rs11166152 Chr1: 99229849 3 rs12293066 Chr11: 79504699 4 rs421548 Chr5: 9561979 5 rs10840211 Chr11: 9314832 6 rs7032810 Chr9: 84581904 7 rs2002659 Chr9: 113779160 8 rs6578841 Chr11: 7259800 9 rs4849360 Chr2: 111532342 10 rs17440439 Chr9: 131045656 11 rs4724254 Chr7: 43897300 12 rs1343604 Chr13: 52547814 13 rs10518362 Chr1: 86256220 14 rs10140275 Chr14: 94542457 15 rs9526786 Chr13: 51107423 16 rs1928868 Chr9: 12378022 17 rs10803667 Chr2: 234550592 18 rs10241074 Chr7: 107252168 19 rs6762123 Chr3: 20478133 20 rs12257453 Chr10: 23350775 21 rs6597754 Chr10: 127938228 22 rs10817056 Chr9: 112418610 23 rs6000948 Chr22: 20786704 24 rs7321052 Chr13: 104719210 25 rs13402171 Chr2: 71597893 (continued on following page)  SM P-value 1.7E-06 4.0E-06 5.2E-04 1.1E-03 9.4E-04 6.7E-06 3.2E-04 1.6E-04 3.6E-04 6.8E-04 2.9E-04 1.7E-03 1.4E-03 6.1E-04 1.6E-03 1.2E-03 7.3E-04 1.1E-03 1.4E-03 2.7E-05 3.4E-04 9.0E-04 2.4E-03 2.2E-03 1.3E-03  SM rank  No. SNPs in LD  Median SM rank  1 2 86 171 150 3 47 26 53 106 44 272 220 100 241 184 115 181 224 7 50 142 394 360 208  2 3 2 3 2 5 2 3 3 2 4 3 3 3 9 2 2 2 4 3 7 2 2 3 2  6 93 104 171 210 287 290 296 303 313 314 320 338 342 367 388 391 396 417 426 449 488 525 546 570  Selection method median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank  135  No.  SNP  Location  26 rs7766409 Chr6: 167209834 27 rs1630030 Chr11: 124806797 28 rs1333200 Chr10: 2180873 29 rs7498814 Chr16: 61611678 30 rs593602 Chr6: 85930019 31 rs12607480 Chr18: 41955585 32 rs4663953 Chr2: 239305866 33 rs10144724 Chr14: 36259993 34 rs2207451 Chr22: 47717016 35 rs1539943 Chr18: 69254778 36 rs717482 Chr16: 84862498 37 rs10276194 Chr7: 135417166 38 rs300153 Chr2: 17849898 39 rs2291031 Chr17: 70739768 40 rs3903981 Chr1: 178881911 41 rs17438534 Chr5: 30085779 42 rs1949194 Chr17: 48378364 43 rs17029976 Chr3: 33348207 44 rs317817 Chr18: 53430317 45 rs175644 Chr14: 74989084 46 rs13325331 Chr3: 59619045 47 rs1915973 Chr7: 68297816 48 rs12988603 Chr2: 159743441 49 rs10987671 Chr9: 129304835 50 rs17133704 Chr11: 74516697 (continued on following page)  SM P-value  SM rank  3.1E-03 1.2E-03 1.7E-03 2.6E-03 9.0E-04 3.6E-03 7.7E-04 2.0E-04 2.4E-03 3.7E-03 3.8E-03 3.0E-03 9.8E-04 1.0E-03 1.7E-03 1.7E-03 4.4E-03 8.3E-04 2.2E-03 5.1E-04 2.8E-03 2.8E-03 2.1E-03 4.2E-04 1.4E-03  521 190 262 430 141 632 122 30 393 650 662 496 155 159 255 265 791 130 353 85 467 474 346 66 216  No. SNPs in LD 5 2 2 2 6 2 3 3 2 2 2 3 3 8 4 3 3 3 2 3 2 2 4 7 8  Median SM rank  Selection method  594 616 631 640 641 652 698 704 727 744 749 767 769 792 800 811 818 820 821 845 853 866 866 872 892  median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank  136  No.  SNP  Location  SM P-value  SM rank  76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102  rs1748936 rs10165492 rs414125 rs9531454 rs7330156 rs2319188 rs7118650 rs4413314 rs2220702 rs12967036 rs6006753 rs10769773 rs11730273 rs17757858 rs2911518 rs13239642 rs4603562 rs472978 rs11139772 rs6878087 rs12623805 rs12427748 rs175700 rs12880943 rs11716985 rs9315161 rs1841432  Chr14: 95210993 Chr2: 108825880 Chr12: 43588274 Chr13: 82919341 Chr13: 105385858 Chr3: 148889003 Chr11: 130514513 Chr3: 95780807 Chr4: 167550520 Chr18: 25829752 Chr22: 44225136 Chr11: 7260565 Chr4: 31369957 Chr12: 106534675 Chr13: 69322069 Chr7: 77713235 Chr16: 26516834 Chr4: 63904578 Chr9: 84583222 Chr5: 276117 Chr2: 223597400 Chr13: 51217941 Chr14: 75036729 Chr14: 33279026 Chr3: 103498009 Chr13: 31687242 Chr4: 115578981  4.3E-03 7.4E-05 4.3E-03 4.0E-03 2.6E-03 6.9E-04 3.9E-04 3.5E-03 4.2E-04 2.3E-03 4.6E-03 3.8E-03 2.8E-03 2.5E-03 3.2E-03 1.9E-03 2.9E-03 5.1E-03 2.9E-03 4.2E-04 8.5E-06 1.9E-05 3.9E-05 4.4E-05 5.0E-05 1.0E-04 1.1E-04  775 14 766 703 425 107 60 613 65 386 815 664 461 419 548 294 490 913 487 67 4 5 8 9 10 16 18  No. SNPs in LD 2 6 5 2 3 2 2 3 4 4 3 3 3 2 2 2 2 4 2 10 0 0 0 0 0 0 0  Median SM rank  Selection method  1369 1386 1396 1425 1425 1434 1461 1463 1464 1465 1469 1477 1528 1568 1584 1615 1619 1680 1688 1733 n/a n/a n/a n/a n/a n/a n/a  median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank median SM rank SM P-value SM P-value SM P-value SM P-value SM P-value SM P-value SM P-value  Abbreviations: n/a, not available; SM, SingleMarker  137  Table 4.3. SNPs Associated With ANM in 782 Iranian Women. SNP Information SNP  Location1  rs10140275 rs10144724 rs7766409 rs4413314 rs4304553 rs6762123 rs10840211 rs4902619 rs4663953  Chr14: 94542457 Chr14: 36259993 Chr6: 167209834 Chr3: 95780807 Chr1: 161686638 Chr3: 20478133 Chr11: 9314832 Chr14: 68161442 Chr2: 239305866  Discovery samples Allele s2 [A/G] [T/C] [A/G] [C/T] [C/A] [G/A] [G/A] [A/C] [A/G]  MA F2 0.19 0.16 0.27 0.49 0.15 0.13 0.26 0.12 0.19  Replication samples  N  β1  P-value  N  β1  209 209 209 209 209 208 204 209 209  3.59 -2.32 -1.21 1.10 -1.85 -1.52 -1.33 2.81 -1.99  3.0x10-4 0.01 0.09 0.13 0.05 0.13 0.07 0.02 0.05  571 572 573 572 573 572 561 571 573  0.32 -0.19 -0.31 0.35 -0.24 -0.42 -0.19 0.14 -0.16  Pvalue 0.18 0.45 0.10 0.05 0.29 0.11 0.32 0.58 0.55  Discovery and replication samples N  β1  P-value  780 781 782 781 782 780 765 780 782  1.09 -0.82 -0.56 0.54 -0.64 -0.71 -0.48 0.66 -0.65  4.0x10-4 0.01 0.02 0.02 0.03 0.03 0.04 0.04 0.05  The table is ordered by smallest P-value in the discovery and replication samples combined dataset for the linear regression analyses performed (Y1 = β0 + β1SNP). Bolded P-values < 0.05. Abbreviations: N, the number of samples included in the analysis. 1 Position based on NCBI Build 36.1 2 Major allele/minor allele, based on HapMap CEU population, Release 27.  138  Table 4.4. Replication of SNP-ANM Associations Reported by Previous GWAS in 782 Iranian Women. SNP information Loci  Cytological SNP location1  European sample results (GWAS) r2 (4)  MAF  SNP effect (β or ES)  P-value (study)  Iranian sample results1 MAF  β1  P-value  rs16991615 0.06 1.07 (β) 1.2x10-21 (He) 0.05 1.13 0.02 1 20p12.3 0.36 -11 rs236114 0.21 0.50 (ES) 9.7x10 (Stolk) 0.18 0.43 0.15 -19 rs1172822 0.37 –0.49 (β) 1.8x10 (He) 0.38 -0.39 0.08 2 19q13.42 0.77 -18 rs2384687 0.39 –0.47 (β) 2.4x10 (He) 0.39 -0.36 0.1 -14 rs365132 0.49 0.39 (β) 8.4x10 (He) NA NA NA 3 5q35.2 0.97 –13 rs7718874 0.49 0.39 (β) 1.3x10 (He) 0.48 -0.29 0.18 –08 rs7333181 0.12 0.52 (ES) 2.5x10 (Stolk) 0.13 -0.05 0.87 4 13q34 1.00 rs1756091 NR NR NR 0.14 -0.13 0.7 –08 rs2153157 0.49 0.29 (β) 5.1x10 (He) 0.49 0.41 0.06 5 6p24.2 0.87 rs2153159 NR NR NR 0.49 -0.21 0.33 -6 rs4906172 NR -0.36 (ES) 2.5x10 (Stolk) 0.25 -0.05 0.84 6 14q32 0.93 rs6575888 NR NR NR 0.23 -0.07 0.8 -05 rs1426100 NR 0.34 (ES) 9.4x10 (Stolk) 0.16 0.1 0.73 7 5q15 0.80 rs1593464 NR NR NR 0.14 0.18 0.57 Loci are ranked by smallest observed SNP-ANM association P-value in previous GWAS in European women5,169,170. Abbreviations: GWAS, genome-wide association study; ES, effect size; NR, not reported; NA, not available; MAF, minor allele frequency. 1 Results for linear regression analysis performed in Iranian women.  139  Table 4.5. Replication of SNP-ANM Associations Reported by Candidate Gene in 782 Iranian women.  European sample results Iranian sample results1 Model PSNP Gene1 Study MAF β MAF β1 P-value tested value rs2002555 AMHR2 ADD 0.19 0.30 0.02 0.15 -0.53 0.09 Voorhuis et al., rs11170547 AMHR2 ADD 0.12 0.31 0.05 0.06 -0.19 0.69 2011 rs6521896 BMP15 ADD 0.13 0.41 0.01 0.17 0.13 0.67 Hefler et al., rs1800440 CYP1B1 DOM NR -0.80 0.01 0.25 -0.24 0.35 2005 Abbreviations: MAF, minor allele frequency; AMHR2, anti-Mullerian hormone receptor, type II; BMP15, bone morphogenetic protein 15; CYP1B1, cytochrome P450, family 1, subfamily B, polypeptide 1; ADD, additive; DOM, dominant; NR, not reported. SNP Information  1  As reported in previous studies indicated..  2  Results for linear regression analysis performed in Iranian women  140  Figure 1.1. Possible Cells of Origin in Ovarian Cancers.  141  Figure 1.2. Model of Formation of Inclusion Cyst.  142  Figure 2.1. Overview of three Possible Pair-Wise Array Comparisons. Step 1 depicts the construction of three DNA pools. The first two pools (orange and red) are constructed using the same DNA samples and are pool-construction replicates. The third pool (green) is constructed using difference DNA samples. Step 2 indicates allelotyping on Illumina SNP arrays, where the two arrays allelotyping the orange pool are array replicate. Step 3 shows the three types of pair-wise SNP array comparisons that can be made, along with the sources of error that account for differences in allele frequency estimates on the 143  paired arrays. For Type A comparisons, the arrays being compared were used to allelotype the exact same DNA pool; hence, the only source of variation is the array. For Type B comparisons, the arrays paired were used to allelotype independently constructed but identical pools; thus, variation may arise due to the array and the pool-construction process. For Type C comparisons, the arrays paired were used to allelotype completely independent DNA pools, and variation may be due to the array, pool-construction, or sampling (assuming both pools are independent samples from a single population).  144  Figure 2.2. Box Plots of Array Variance for Three Illumina Array Types. Box plots of var(earray(x,y)) for Illumina 1M-Duo, 1M-Single, and 660-Quad arrays for normalized and raw data. The 1M-Duo arrays were genotyped in two batches and are plotted stratified by batch (“1M-Duo-Batch 1”, “1M-Duo-Batch 2”), as well as by array type “1MDuo”. The number of var(earray) estimates for each array type is: 1M-Duo, n=45; 1M-DuoBatch 1, n=18; 1M-Duo-Batch 2, n=27; 1M-Single, n=11; 660-Quad, n=360. Box plot whiskers are plotted at the lowest datum within 1.5 the interquartile range of the lower quartile, and the highest datum within 1.5 the interquartile range of the upper quartile.  145  Figure 2.3. Box Plots of Array Variance for Illumina 1M-Duo arrays. Box plots of var(earray) estimates (n=48) for the 1M-Duo arrays (Batch 1 and 2 combined) highlighting the three extreme outlier estimates in both normalized and raw data, all attributable to one array. This array was determined faulty and removed from all analyses. Box plot whiskers are plotted at the lowest datum within 1.5 the interquartile range of the lower quartile, and the highest datum within 1.5 the interquartile range of the upper quartile.  146  Figure 2.4. Decomposition of Pooling Variance for Illumina 1M-Single Arrays. Stacked barplots showing the normalized pooling variance estimates, and the breakdown into array and to pool-construction variance for pools allelotyped on the Illumina 1M-Single array. Estimates derived from comparison of replicate pools are labeled “B”. Estimates derived from comparison of non-identical pools are labeled “C1” and “C2” (specifying replicate pool). The portion of pooling variance attributed to pool-construction is indicated by hatched bars, and array variance by black or grey bars. Pool size is shown above the barplots.  147  Figure 2.5. Decomposition of Pooling Variance for Illumina 660-Quad Arrays. Stacked barplots showing the normalized pooling variance estimates, and the breakdown into array and to pool-construction variance for pools allelotyped on the Illumina 660-Quad array. All estimates are derived from comparison of non-identical pools, Type C. The portion of pooling variance attributed to pool-construction is indicated by hatched bars, the portion of pooling variance attribute to the array is indicated by grey bars. Pool size is indicated above each stacked bar.  148  Figure 2.6. Decomposition of Pooling Variance for Illumina 1M-Duo Arrays. Stacked barplots showing the normalized pooling variance estimates, and the breakdown into array and to pool-construction variance for pools allelotyped on the Illumina 1M-Duo array. All estimates are derived from comparison of non-identical pools, Type C. The portion of pooling variance attributed to pool-construction is indicated by hatched bars, the portion of pooling variance attribute to the array is indicated by grey bars. Pool size is indicated above each stacked bar.  149  Figure 2.7. Example of PoolingPlanner. (A) Control input and output panel for the case pool. (B) Control input and output panel for the control pool. (C) Corresponding plot of relative sample size versus the number if replicate arrays used in allelotyping the case (blue line) and control pool (red line). 150  Figure 2.8. Example Use of PoolingPlanner. Power curves for a theoretical pooling experiment with 300 cases and 1000 controls where 24, 12, 6, or 3 Illumina 660-Quad replicate arrays are used to allelotype the DNA pools. The equivalent individual genotyping experiment is given for reference. Effective sample size assuming 24, 12, 6, or 3 arrays was calculated using PoolingPlanner (see Table 2.2) and these values entered into Quanto189 to obtain pool-adjusted estimates of power over a range of odds ratios. Calculations are based on an unmatched case-control design testing for geneonly effects using a log-additive model, where the incidence of the case phenotype is 0.02%, and the risk allele frequency (prisk) is 29% (and in complete linkage disequilibrium with a SNP on the array). A dashed line is draw to indicate the 80% power threshold.  151  Figure 3.1. Summary of Ovarian Cancer GWAS Stages and Outcomes 152  Figure 3.12. Breakdown of the Discovery Stage Case-Control Pools by 10 Year Age Bins.  153  Figure 3.3. Correlation in Allele Frequency Estimation Between Two Arrays. Correlation in the estimates of allele frequency for 5000 randomly selected SNPs on two replicate arrays before normalization (“raw”) and after normalization (“normalized”).  154  standard deviation of SNP AF estimate  Figure 3.4. Distribution of Standard Deviations of Allele Frequency Estimates. Distribution of the standard deviations associated with pool-based SNP AF estimates before normalization (“raw”) and after normalization (“normalized”). Data is for 12 replicates and 5000 randomly selected SNPs. 155  (1-Pearson’s correlation coefficient)  Figure 3.5. Hierarchical Cluster Plot of Discovery Stage Illumina 660-Quad Array Data Before Normalization. 156  (1-Pearson’s correlation coefficient)  Figure 3.6. Hierarchical Cluster Plot of Discovery Stage Illumina 660-Quad Array Data After Normalization  157  Figure 3.7. Correlation Between Pool-Based and IG-Based Allele Frequency Estimates.  158  Figure 3.8. Allele Frequency Difference for 188 SNPs Based on IG and Pooling Data, Stratified by Case Pool.  159  Figure 3.9. Allele Frequency Difference for 188 SNPs Based on IG and Pooling Data, Stratified by SNP Filtering Method.  160  Figure 4.1. Summary of ANM Study Objectives, Steps, and Outcomes  161  Figure 4.2. Hierarchical Cluster Plot of 8 Illumina 1M-Duo Arrays. Arrays were used to assay the early and late ANM pools in the discovery stage of our GWAS. Normalized data is presented. 162  Figure 4.3: Histogram of the Standard Deviations Associated with SNP allele Frequencies. Pool array data for 5000 randomly selected SNPs (normalized data presented). In panel 3 the faulty late ANM Array 1 is removed.  163  Figure 4.4. Correlation Between Pool and IG-Based Allele Frequency Estimates. Data is for discovery stage SNPs (“Iranian GWAS SNPs)” and SNPs previously associated in European women (“literature SNPs”) in the early and late ANM DNA pools.  164  Figure 4.5. Allele Frequency Difference for SNPs. Iranian GWAS SNPs were chosen based on the discovery stage GWAS analysis. Literature SNPs were chosen based on being previously associated with ANM in European women. Data for the early and late ANM DNA pools are compared. Individual genotyping data and pool-based data are compared.  165  REFERENCES 1.  2. 3.  4.  5. 6.  7.  8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.  Braem, M.G. et al. Reproductive and hormonal factors in association with ovarian cancer in the Netherlands cohort study. American Journal of Epidemiology 172, 1181-1189 (2010). Schorge, J.O. et al. SGO White Paper on ovarian cancer: etiology, screening and surveillance. Gynecologic oncology 119, 7-17 (2010). Dossus, L. et al. Reproductive risk factors and endometrial cancer: the European Prospective Investigation into Cancer and Nutrition. International journal of cancer.Journal international du cancer 127, 442-451 (2010). Titus-Ernstoff, L. et al. Menstrual factors in relation to breast cancer risk. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology 7, 783-789 (1998). Stolk, L. et al. Meta-analyses identify 13 loci associated with age at menopause and highlight DNA repair and immune pathways. Nature genetics 44, 260-268 (2012). Bolton, K.L., Ganda, C., Berchuck, A., Pharaoh, P.D. & Gayther, S.A. Role of common genetic variants in ovarian cancer susceptibility and outcome: progress to date from the Ovarian Cancer Association Consortium (OCAC). J Intern Med 271, 366-78 (2012). Voorhuis, M., Onland-Moret, N.C., van der Schouw, Y.T., Fauser, B.C. & Broekmans, F.J. Human studies on genetics of the age at natural menopause: a systematic review. Human reproduction update 16, 364-377 (2010). Klein, R.J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385-9 (2005). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661-78 (2007). Visscher, P.M., Brown, M.A., McCarthy, M.I. & Yang, J. Five years of GWAS discovery. Am J Hum Genet 90, 7-24 (2012). Slatkin, M. Linkage disequilibrium--understanding the evolutionary past and mapping the medical future. Nature reviews.Genetics 9, 477-485 (2008). Wall, J.D. & Pritchard, J.K. Assessing the performance of the haplotype block model of linkage disequilibrium. American Journal of Human Genetics 73, 502-515 (2003). International HapMap, C. A haplotype map of the human genome. Nature 437, 12991320 (2005). McCarthy, M.I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews.Genetics 9, 356-369 (2008). Hill, W.G. & Robertson, A. The effect of linkage on limits to artificial selection. Genetic Research, Genet. Res. 89, 311 (1966). Hinds, D.A. et al. Whole-genome patterns of common DNA variation in three human populations. Science (New York, N.Y.) 307, 1072-1079 (2005). Ioannidis, J.P., Thomas, G. & Daly, M.J. Validating, augmenting and refining genome-wide association signals. Nature reviews.Genetics 10, 318-329 (2009). Steemers, F.J. & Gunderson, K.L. Whole genome genotyping technologies on the BeadArray platform. Biotechnology journal 2, 41-49 (2007). 166  19. 20. 21. 22. 23. 24. 25.  26.  27. 28. 29. 30. 31. 32.  33.  34. 35.  36. 37.  38.  Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Optimal designs for two-stage genome-wide association studies. Genetic epidemiology 31, 776-788 (2007). Reich, D.E. & Lander, E.S. On the allelic spectrum of human disease. Trends Genet 17, 502-10 (2001). Gibson, G. Rare and common variants: twenty arguments. Nat Rev Genet 13, 135-45 (2011). Dunlop, M.G. et al. Common variation near CDKN1A, POLD3 and SHROOM2 influences colorectal cancer risk. Nat Genet 44, 770-6 (2012). Lu, X. et al. Genome-wide association study in Han Chinese identifies four new susceptibility loci for coronary artery disease. Nat Genet 44, 890-4 (2012). Turnbull, C. et al. A genome-wide association study identifies susceptibility loci for Wilms tumor. Nat Genet 44, 681-4 (2012). Do, C.B. et al. Web-based genome-wide association study identifies two novel loci and a substantial genetic component for Parkinson's disease. PLoS Genet 7, e1002141 (2011). Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature genetics 38, 209-213 (2006). Thomas, D.C. et al. Methodological Issues in Multistage Genome-wide Association Studies. Stat Sci 24, 414-429 (2009). Chanock, S.J. & Thomas, G. The devil is in the DNA. Nature genetics 39, 283-284 (2007). Ioannidis, J.P. Non-replication and inconsistency in the genome-wide association setting. Human heredity 64, 203-213 (2007). Liu, Y.J., Papasian, C.J., Liu, J.F., Hamilton, J. & Deng, H.W. Is replication the gold standard for validating genome-wide association findings? PloS one 3, e4037 (2008). Kraft, P. Curses--winner's and otherwise--in genetic epidemiology. Epidemiology (Cambridge, Mass.) 19, 649-51; discussion 657-8 (2008). Zhong, H. & Prentice, R.L. Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics (Oxford, England) 9, 621-634 (2008). Zollner, S. & Pritchard, J.K. Overcoming the winner's curse: estimating penetrance parameters from case-control data. American Journal of Human Genetics 80, 605-615 (2007). Lander, E.S. & Botstein, D. Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185-199 (1989). Kosoy, R. et al. Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Human mutation 30, 69-78 (2009). Skibola, C.F. et al. Genetic variants at 6p21.33 are associated with susceptibility to follicular lymphoma. Nature genetics 41, 873-875 (2009). Bostrom, M.A. et al. Candidate genes for non-diabetic ESRD in African Americans: a genome-wide association study using pooled DNA. Human genetics 128, 195-204 (2010). Cho, Y.S. et al. Meta-analysis of genome-wide association studies identifies eight new loci for type 2 diabetes in east Asians. Nature genetics 44, 67-72 (2011). 167  39. 40. 41.  42.  43.  44. 45. 46. 47.  48.  49.  50. 51.  52. 53. 54. 55. 56. 57. 58.  Soler Artigas, M. et al. Genome-wide association and large-scale follow up identifies 16 new loci influencing lung function. Nature genetics 43, 1082-1090 (2011). Goode, E.L. et al. A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nature genetics 42, 874-879 (2010). Pearson, J.V. et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. American Journal of Human Genetics 80, 126-139 (2007). Schrauwen, I. et al. A genome-wide analysis identifies genetic variants in the RELN gene associated with otosclerosis. American Journal of Human Genetics 84, 328-338 (2009). Comabella, M. et al. Identification of a novel risk locus for multiple sclerosis at 13q31.3 by a pooled genome-wide scan of 500,000 single nucleotide polymorphisms. PloS one 3, e3490 (2008). Abraham, R. et al. A genome-wide association study for late-onset Alzheimer's disease using DNA pooling. BMC medical genomics 1, 44 (2008). Capon, F. et al. Identification of ZNF313/RNF114 as a novel psoriasis susceptibility gene. Human molecular genetics 17, 1938-1945 (2008). Stokowski, R.P. et al. A genomewide association study of skin pigmentation in a South Asian population. American Journal of Human Genetics 81, 1119-1132 (2007). Macgregor, S. Most pooling variation in array-based DNA pooling is attributable to array error rather than pool construction error. European journal of human genetics : EJHG 15, 501-504 (2007). Macgregor, S., Visscher, P.M. & Montgomery, G. Analysis of pooled DNA samples on high density arrays without prior knowledge of differential hybridization rates. Nucleic acids research 34, e55 (2006). Earp, M.A., Rahmani, M., Chew, K. & Brooks-Wilson, A. Estimates of array and pool-construction variance for planning efficient DNA-pooling genome wide association studies. BMC medical genomics 4, 81 (2011). Brown, K.M. et al. Common sequence variants on 20q11.22 confer melanoma susceptibility. Nature genetics 40, 838-840 (2008). Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America 106, 9362-9367 (2009). Freedman, M.L. et al. Principles for the post-GWAS functional characterization of cancer risk loci. Nat Genet 43, 513-8 (2011). Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat Rev Genet 11, 499-511 (2010). Via, M., Gignoux, C. & Burchard, E.G. The 1000 Genomes Project: new opportunities for research and social challenges. Genome Med 2, 3 (2010). Saccone, N.L. et al. In search of causal variants: refining disease association signals using cross-population contrasts. BMC Genet 9, 58 (2008). Jia, L. et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS Genet 5, e1000597 (2009). Pomerantz, M.M. et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat Genet 41, 882-4 (2009). Heintzman, N.D. et al. Histone modifications at human enhancers reflect global cell168  59. 60. 61. 62.  63. 64. 65. 66.  67. 68. 69.  70. 71. 72.  73.  74.  75. 76.  77. 78.  type-specific gene expression. Nature 459, 108-12 (2009). Morley, M. et al. Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743-7 (2004). Rockman, M.V. & Kruglyak, L. Genetics of global gene expression. Nat Rev Genet 7, 862-72 (2006). Cheung, V.G. & Spielman, R.S. Genetics of human gene expression: mapping DNA variants that influence gene expression. Nat Rev Genet 10, 595-604 (2009). Rioux, J.D. et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat Genet 39, 596604 (2007). Van Limbergen, J., Wilson, D.C. & Satsangi, J. The genetics of Crohn's disease. Annu Rev Genomics Hum Genet 10, 89-116 (2009). Rong, C., Hu, W., Wu, F.R., Cao, X.J. & Chen, F.H. Interleukin-23 as a potential therapeutic target for rheumatoid arthritis. Mol Cell Biochem 361, 243-8 (2012). Stranger, B.E., Stahl, E.A. & Raj, T. Progress and promise of genome-wide association studies for human complex trait genetics. Genetics 187, 367-83 (2011). Maier, L.M. et al. IL2RA genetic heterogeneity in multiple sclerosis and type 1 diabetes susceptibility and soluble interleukin-2 receptor production. PLoS Genet 5, e1000322 (2009). Ghoussaini, M. et al. Multiple loci with different cancer specificities within the 8q24 gene desert. J Natl Cancer Inst 100, 962-6 (2008). Hunter, D.J. & Kraft, P. Drinking from the fire hose--statistical issues in genomewide association studies. N Engl J Med 357, 436-9 (2007). Kurman, R.J. & Shih, I. The origin and pathogenesis of epithelial ovarian cancer: a proposed unifying theory. The American Journal of Surgical Pathology 34, 433-443 (2010). Howlader, N. et al. SEER Cancer Statistics Review, 1975-2008. (2011). Mok, S.C. et al. Etiology and pathogenesis of epithelial ovarian cancer. Disease markers 23, 367-376 (2007). Auersperg, N., Wong, A.S., Choi, K.C., Kang, S.K. & Leung, P.C. Ovarian surface epithelium: biology, endocrinology, and pathology. Endocrine reviews 22, 255-288 (2001). Salvador, S. et al. The fallopian tube: primary site of most pelvic high-grade serous carcinomas. International journal of gynecological cancer : official journal of the International Gynecological Cancer Society 19, 58-64 (2009). Landen, C.N., Jr., Birrer, M.J. & Sood, A.K. Early events in the pathogenesis of epithelial ovarian cancer. Journal of clinical oncology : official journal of the American Society of Clinical Oncology 26, 995-1005 (2008). Crum, C.P. et al. The distal fallopian tube: a new model for pelvic serous carcinogenesis. Current opinion in obstetrics & gynecology 19, 3-9 (2007). Kalloger, S.E. et al. Calculator for ovarian carcinoma subtype prediction. Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc (2010). Kobel, M. et al. Ovarian carcinoma subtypes are different diseases: implications for biomarker studies. PLoS medicine 5, e232 (2008). Shan, W. & Liu, J. Inflammation: a hidden path to breaking the spell of ovarian 169  79.  80. 81. 82. 83. 84.  85.  86.  87. 88. 89. 90. 91.  92. 93. 94.  95. 96. 97.  cancer. Cell cycle (Georgetown, Tex.) 8, 3107-3111 (2009). Gilks, C.B. et al. Tumor cell type can be reproducibly diagnosed and is of independent prognostic significance in patients with maximally debulked ovarian carcinoma. Human pathology 39, 1239-1251 (2008). Malpica, A. et al. Grading ovarian serous carcinoma using a two-tier system. The American Journal of Surgical Pathology 28, 496-504 (2004). Gilks, C.B. & Prat, J. Ovarian carcinoma pathology and genetics: recent advances. Human pathology 40, 1213-1223 (2009). Rosen, D.G. et al. Ovarian cancer: pathology, biology, and disease models. Frontiers in bioscience : a journal and virtual library 14, 2089-2102 (2009). Integrated genomic analyses of ovarian carcinoma. Nature 474, 609-15 (2011). Plisiecka-Halasa, J. et al. P21WAF1, P27KIP1, TP53 and C-MYC analysis in 204 ovarian carcinomas treated with platinum-based regimens. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO 14, 10781085 (2003). Crijns, A.P. et al. Molecular prognostic markers in ovarian cancer: toward patienttailored therapy. International journal of gynecological cancer : official journal of the International Gynecological Cancer Society 16 Suppl 1, 152-165 (2006). Yuan, Z.Q. et al. Frequent activation of AKT2 and induction of apoptosis by inhibition of phosphoinositide-3-OH kinase/Akt pathway in human ovarian cancer. Oncogene 19, 2324-2330 (2000). Nakayama, K. et al. Sequence mutations and amplification of PIK3CA and AKT2 genes in purified ovarian serous neoplasms. Cancer Biol Ther 5, 779-85 (2006). Singer, G. et al. Mutations in BRAF and KRAS characterize the development of lowgrade ovarian serous carcinoma. J Natl Cancer Inst 95, 484-6 (2003). Landen, C.N., Jr., Birrer, M.J. & Sood, A.K. Early events in the pathogenesis of epithelial ovarian cancer. J Clin Oncol 26, 995-1005 (2008). Mok, S.C. et al. Mutation of K-ras protooncogene in human ovarian epithelial tumors of borderline malignancy. Cancer research 53, 1489-1492 (1993). Enomoto, T., Weghorst, C.M., Inoue, M., Tanizawa, O. & Rice, J.M. K-ras activation occurs frequently in mucinous adenocarcinomas and rarely in other common epithelial tumors of the human ovary. The American journal of pathology 139, 777785 (1991). Okamoto, T. et al. Distinguishing primary from secondary mucinous ovarian tumors: an algorithm using the novel marker DPEP1. Mod Pathol 24, 267-76 (2011). Shan, W. & Liu, J. Epithelial ovarian cancer: focus on genetics and animal models. Cell cycle (Georgetown, Tex.) 8, 731-735 (2009). Zwart, J., Geisler, J.P. & Geisler, H.E. Five-year survival in patients with endometrioid carcinoma of the ovary versus those with serous carcinoma. European journal of gynaecological oncology 19, 225-228 (1998). Campbell, I.G. et al. Mutation of the PIK3CA gene in ovarian and breast cancer. Cancer Res 64, 7678-81 (2004). Kurman, R.J. & Shih Ie, M. Molecular pathogenesis and extraovarian origin of epithelial ovarian cancer--shifting the paradigm. Hum Pathol 42, 918-31 (2011). Willner, J. et al. Alternate molecular genetic pathways in ovarian carcinomas of common histological types. Hum Pathol 38, 607-13 (2007). 170  98. 99. 100.  101. 102. 103. 104.  105.  106.  107.  108.  109. 110. 111.  112.  113. 114.  115.  Palacios, J. & Gamallo, C. Mutations in the beta-catenin gene (CTNNB1) in endometrioid ovarian carcinomas. Cancer Res 58, 1344-7 (1998). Wiegand, K.C. et al. ARID1A mutations in endometriosis-associated ovarian carcinomas. The New England journal of medicine 363, 1532-1543 (2010). Pearce, C.L. et al. Association between endometriosis and risk of histological subtypes of ovarian cancer: a pooled analysis of case-control studies. The lancet oncology (2012). Kuo, K.T. et al. Frequent activating mutations of PIK3CA in ovarian clear cell carcinoma. Am J Pathol 174, 1597-601 (2009). Fathalla, M.F. Incessant ovulation--a factor in ovarian neoplasia? Lancet 2, 163 (1971). Cramer, D.W. & Welch, W.R. Determinants of ovarian cancer risk. II. Inferences regarding pathogenesis. Journal of the National Cancer Institute 71, 717-721 (1983). Risch, H.A. Hormonal etiology of epithelial ovarian cancer, with a hypothesis concerning the role of androgens and progesterone. Journal of the National Cancer Institute 90, 1774-1786 (1998). Schildkraut, J.M., Schwingl, P.J., Bastos, E., Evanoff, A. & Hughes, C. Epithelial ovarian cancer risk among women with polycystic ovary syndrome. Obstetrics and gynecology 88, 554-559 (1996). Nash, M.A., Ferrandina, G., Gordinier, M., Loercher, A. & Freedman, R.S. The role of cytokines in both the normal and malignant ovary. Endocrine-related cancer 6, 93107 (1999). Modugno, F. et al. Oral contraceptive use, reproductive history, and risk of epithelial ovarian cancer in women with and without endometriosis. American Journal of Obstetrics and Gynecology 191, 733-740 (2004). Escobar-Morreale, H.F., Luque-Ramirez, M. & Gonzalez, F. Circulating inflammatory markers in polycystic ovary syndrome: a systematic review and metaanalysis. Fertility and sterility 95, 1048-58.e1-2 (2011). Hanahan, D. & Weinberg, R.A. Hallmarks of cancer: the next generation. Cell 144, 646-674 (2011). Mantovani, A., Allavena, P., Sica, A. & Balkwill, F. Cancer-related inflammation. Nature 454, 436-444 (2008). Parazzini, F. et al. Pelvic inflammatory disease and risk of ovarian cancer. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology 5, 667-669 (1996). Risch, H.A. & Howe, G.R. Pelvic inflammatory disease and the risk of epithelial ovarian cancer. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology 4, 447-451 (1995). Ness, R.B. et al. Factors related to inflammation of the ovarian epithelium and risk of ovarian cancer. Epidemiology (Cambridge, Mass.) 11, 111-117 (2000). Maisey, K. et al. Expression of proinflammatory cytokines and receptors by human fallopian tubes in organ culture following challenge with Neisseria gonorrhoeae. Infection and immunity 71, 527-532 (2003). McGee, Z.A. et al. Gonococcal infection of human fallopian tube mucosa in organ 171  116.  117. 118.  119.  120.  121. 122.  123. 124.  125.  126.  127.  128. 129. 130. 131.  culture: relationship of mucosal tissue TNF-alpha concentration to sloughing of ciliated cells. Sexually transmitted diseases 26, 160-165 (1999). Berger, D. et al. Genetic variants of insulin receptor substrate-1 (IRS-1) in syndromes of severe insulin resistance. Functional analysis of Ala513Pro and Gly1158Glu IRS1. Diabetic medicine : a journal of the British Diabetic Association 19, 804-809 (2002). Brinton, L.A. et al. Ovarian cancer risk after the use of ovulation-stimulating drugs. Obstetrics and gynecology 103, 1194-1203 (2004). Schiffenbauer, Y.S. et al. Loss of ovarian function promotes angiogenesis in human ovarian carcinoma. Proceedings of the National Academy of Sciences of the United States of America 94, 13203-13208 (1997). Wang, T.H. et al. Human chorionic gonadotropin-induced ovarian hyperstimulation syndrome is associated with up-regulation of vascular endothelial growth factor. The Journal of clinical endocrinology and metabolism 87, 3300-3308 (2002). Schiffenbauer, Y.S. et al. Gonadotropin stimulation of MLS human epithelial ovarian carcinoma cells augments cell adhesion mediated by CD44 and by alpha(v)-integrin. Gynecologic oncology 84, 296-302 (2002). Pennington, K.P. & Swisher, E.M. Hereditary ovarian cancer: beyond the usual suspects. Gynecol Oncol 124, 347-53 (2012). Shulman, L.P. Hereditary breast and ovarian cancer (HBOC): clinical features and counseling for BRCA1 and BRCA2, Lynch syndrome, Cowden syndrome, and LiFraumeni syndrome. Obstetrics and gynecology clinics of North America 37, 109-33, Table of Contents (2010). Narod, S. et al. Histology of BRCA1-associated ovarian tumours. Lancet 343, 236 (1994). Boyd, J. et al. Clinicopathologic features of BRCA-linked and sporadic ovarian cancer. JAMA : the journal of the American Medical Association 283, 2260-2265 (2000). Risch, H.A. et al. Prevalence and penetrance of germline BRCA1 and BRCA2 mutations in a population series of 649 women with ovarian cancer. American Journal of Human Genetics 68, 700-710 (2001). Werness, B.A. et al. Histopathology, FIGO stage, and BRCA mutation status of ovarian cancers from the Gilda Radner Familial Ovarian Cancer Registry. International journal of gynecological pathology : official journal of the International Society of Gynecological Pathologists 23, 29-34 (2004). Lichtenstein, P. et al. Environmental and heritable factors in the causation of cancer-analyses of cohorts of twins from Sweden, Denmark, and Finland. The New England journal of medicine 343, 78-85 (2000). Gayther, S.A. & Pharoah, P.D. The inherited genetics of ovarian and endometrial cancer. Current opinion in genetics & development 20, 231-238 (2010). Antoniou, A.C. & Easton, D.F. Models of genetic susceptibility to breast cancer. Oncogene 25, 5898-5905 (2006). Song, H. et al. A genome-wide association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nature genetics 41, 996-1000 (2009). Ramus, S.J. et al. Genetic variation at 9p22.2 and ovarian cancer risk for BRCA1 and BRCA2 mutation carriers. Journal of the National Cancer Institute 103, 105-116 172  132.  133.  134.  135. 136. 137. 138.  139. 140.  141. 142. 143.  144. 145.  146. 147. 148.  (2011). Romano, R.A., Li, H., Tummala, R., Maul, R. & Sinha, S. Identification of Basonuclin2, a DNA-binding zinc-finger protein expressed in germ tissues and skin keratinocytes. Genomics 83, 821-833 (2004). Vanhoutteghem, A. & Djian, P. Basonuclins 1 and 2, whose genes share a common origin, are proteins with widely different properties and functions. Proceedings of the National Academy of Sciences of the United States of America 103, 12423-12428 (2006). Wentzensen, N. et al. Genetic variation on 9p22 is associated with abnormal ovarian ultrasound results in the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. PloS one 6, e21731 (2011). Ghoussaini, M. et al. Multiple loci with different cancer specificities within the 8q24 gene desert. Journal of the National Cancer Institute 100, 962-966 (2008). Pomerantz, M.M. et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nature genetics 41, 882-884 (2009). Huppi, K., Pitt, J.J., Wahlberg, B.M. & Caplen, N.J. The 8q24 gene desert: an oasis of non-coding transcriptional activity. Front Genet 3, 69 (2012). Grisanzio, C. et al. Genetic and functional analyses implicate the NUDT11, HNF1B, and SLC22A3 genes in prostate cancer pathogenesis. Proc Natl Acad Sci U S A 109, 11252-7 (2012). Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899-905 (2010). Okubo, Y. et al. Transduction of HOXD3-antisense into human melanoma cells results in decreased invasive and motile activities. Clinical & experimental metastasis 19, 503-511 (2002). Beesley, J. et al. Functional polymorphisms in the TERT promoter are associated with risk of serous epithelial ovarian and breast cancers. PloS one 6, e24987 (2011). Farmer, H. et al. Targeting the DNA repair defect in BRCA mutant cells as a therapeutic strategy. Nature 434, 917-921 (2005). Kosco, K.A., Cerignoli, F., Williams, S., Abraham, R.T. & Mustelin, T. SKAP55 modulates T cell antigen receptor-induced activation of the Ras-Erk-AP1 pathway by binding RasGRP1. Mol Immunol 45, 510-22 (2008). Bolton, K.L. et al. Common variants at 19p13 are associated with susceptibility to ovarian cancer. Nature genetics 42, 880-884 (2010). Couch, F.J. et al. Common Variants at the 19p13.1 and ZNF365 Loci Are Associated with ER Subtypes of Breast Cancer and Ovarian Cancer Risk in BRCA1 and BRCA2 Mutation Carriers. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology (2012). Feng, L., Huang, J. & Chen, J. MERIT40 facilitates BRCA1 localization and DNA damage repair. Genes & development 23, 719-728 (2009). Shao, G. et al. MERIT40 controls BRCA1-Rap80 complex integrity and recruitment to DNA double-strand breaks. Genes & development 23, 740-754 (2009). Quaye, L. et al. Association between common germline genetic variation in 94 candidate genes or regions and risks of invasive epithelial ovarian cancer. PloS one 4, e5983 (2009). 173  149.  150.  151. 152.  153.  154.  155. 156. 157.  158.  159. 160.  161.  162. 163. 164.  165.  Johnatty, S.E. et al. Evaluation of candidate stromal epithelial cross-talk genes identifies association between risk of serous ovarian cancer and TERT, a cancer susceptibility "hot-spot". PLoS genetics 6, e1001016 (2010). Terry, K.L. et al. Telomere length and genetic variation in telomere maintenance genes in relation to ovarian cancer risk. Cancer Epidemiol Biomarkers Prev 21, 50412 (2012). te Velde, E.R., Dorland, M. & Broekmans, F.J. Age at menopause as a marker of reproductive ageing. Maturitas 30, 119-125 (1998). Voorhuis, M., Broekmans, F.J., Fauser, B.C., Onland-Moret, N.C. & van der Schouw, Y.T. Genes involved in initial follicle recruitment may be associated with age at menopause. The Journal of clinical endocrinology and metabolism 96, E473-9 (2011). van der Schouw, Y.T., van der Graaf, Y., Steyerberg, E.W., Eijkemans, J.C. & Banga, J.D. Age at menopause as a risk factor for cardiovascular mortality. Lancet 347, 714718 (1996). Shuster, L.T., Rhodes, D.J., Gostout, B.S., Grossardt, B.R. & Rocca, W.A. Premature menopause or early menopause: long-term health consequences. Maturitas 65, 161166 (2010). Gallagher, J.C. Effect of early menopause on bone mineral density and fractures. Menopause (New York, N.Y.) 14, 567-571 (2007). Shin, A., Song, Y.M., Yoo, K.Y. & Sung, J. Menstrual factors and cancer risk among Korean women. International journal of epidemiology (2011). Henderson, K.D., Bernstein, L., Henderson, B., Kolonel, L. & Pike, M.C. Predictors of the timing of natural menopause in the Multiethnic Cohort Study. American Journal of Epidemiology 167, 1287-1294 (2008). Morabia, A. & Costanza, M.C. International variability in ages at menarche, first livebirth, and menopause. World Health Organization Collaborative Study of Neoplasia and Steroid Contraceptives. American Journal of Epidemiology 148, 11951205 (1998). Treloar, S.A., Do, K.A. & Martin, N.G. Genetic influences on the age at menopause. Lancet 352, 1084-1085 (1998). Morris, D.H., Jones, M.E., Schoemaker, M.J., Ashworth, A. & Swerdlow, A.J. Familial concordance for age at natural menopause: results from the Breakthrough Generations Study. Menopause (New York, N.Y.) (2011). Murabito, J.M., Yang, Q., Fox, C., Wilson, P.W. & Cupples, L.A. Heritability of age at natural menopause in the Framingham Heart Study. The Journal of clinical endocrinology and metabolism 90, 3427-3430 (2005). van Asselt, K.M. et al. Heritability of menopausal age in mothers and daughters. Fertility and sterility 82, 1348-1351 (2004). de Bruin, J.P. et al. The role of genetic factors in age at natural menopause. Human reproduction (Oxford, England) 16, 2014-2018 (2001). Snieder, H., MacGregor, A.J. & Spector, T.D. Genes control the cessation of a woman's reproductive life: a twin study of hysterectomy and age at menopause. The Journal of clinical endocrinology and metabolism 83, 1875-1880 (1998). van Noord, P.A., Dubas, J.S., Dorland, M., Boersma, H. & te Velde, E. Age at natural menopause in a population-based screening cohort: the role of menarche, fecundity, 174  166. 167. 168. 169. 170. 171. 172.  173.  174.  175.  176. 177. 178.  179. 180.  181. 182.  183. 184.  and lifestyle factors. Fertility and sterility 68, 95-102 (1997). te Velde, E.R. & Pearson, P.L. The variability of female reproductive ageing. Human reproduction update 8, 141-154 (2002). Gates, M.A., Rosner, B.A., Hecht, J.L. & Tworoger, S.S. Risk factors for epithelial ovarian cancer by histologic subtype. Am J Epidemiol 171, 45-53 (2010). Titus-Ernstoff, L. et al. Menstrual and reproductive factors in relation to ovarian cancer risk. Br J Cancer 84, 714-21 (2001). He, C. et al. Genome-wide association studies identify loci associated with age at menarche and age at natural menopause. Nature genetics 41, 724-728 (2009). Stolk, L. et al. Loci at chromosomes 13, 19 and 20 influence age at natural menopause. Nature genetics 41, 645-647 (2009). Park, J.H. et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet 42, 570-5 (2010). Lunetta, K.L. et al. Genetic correlates of longevity and selected age-related phenotypes: a genome-wide association study in the Framingham Study. BMC medical genetics 8 Suppl 1, S13 (2007). van Asselt, K.M. et al. Linkage analysis of extremely discordant and concordant sibling pairs identifies quantitative trait loci influencing variation in human menopausal age. American Journal of Human Genetics 74, 444-453 (2004). Murabito, J.M., Yang, Q., Fox, C.S. & Cupples, L.A. Genome-wide linkage analysis to age at natural menopause in a community-based sample: the Framingham Heart Study. Fertility and sterility 84, 1674-1679 (2005). Lee, C.K., Allison, D.B., Brand, J., Weindruch, R. & Prolla, T.A. Transcriptional profiles associated with aging and middle age-onset caloric restriction in mouse hearts. Proc Natl Acad Sci U S A 99, 14988-93 (2002). Hamatani, T. et al. Age-associated alteration of gene expression patterns in mouse oocytes. Hum Mol Genet 13, 2263-78 (2004). He, L.N. et al. Association study of the oestrogen signalling pathway genes in relation to age at natural menopause. Journal of genetics 86, 269-276 (2007). Weel, A.E. et al. Estrogen receptor polymorphism predicts the onset of natural and surgical menopause. The Journal of clinical endocrinology and metabolism 84, 31463150 (1999). Qin, Y. et al. NOBOX homeobox mutation causes premature ovarian failure. American Journal of Human Genetics 81, 576-581 (2007). Rajkovic, A., Pangas, S.A., Ballow, D., Suzumori, N. & Matzuk, M.M. NOBOX deficiency disrupts early folliculogenesis and oocyte-specific gene expression. Science (New York, N.Y.) 305, 1157-1159 (2004). Gold, E.B. et al. Factors associated with age at natural menopause in a multiethnic sample of midlife women. American Journal of Epidemiology 153, 865-874 (2001). Chen, C.T. et al. Replication of loci influencing ages at menarche and menopause in Hispanic women: the Women's Health Initiative SHARe Study. Human molecular genetics 21, 1419-1432 (2012). Maiorano, D., Lutzmann, M. & Mechali, M. MCM proteins and DNA replication. Current opinion in cell biology 18, 130-136 (2006). Craig, J.E. et al. Rapid inexpensive genome-wide association using pooled whole blood. Genome Res 19, 2075-80 (2009). 175  185. 186. 187. 188.  189. 190. 191. 192. 193.  194.  195. 196. 197. 198.  199.  200.  201.  202.  203.  Pearson, T.A. & Manolio, T.A. How to interpret a genome-wide association study. JAMA 299, 1335-44 (2008). Visscher, P.M. & Le Hellard, S. Simple method to analyze SNP-based association studies using DNA pools. Genet Epidemiol 24, 291-6 (2003). Macgregor, S. et al. Highly cost-efficient genome-wide association studies using DNA pools and dense SNP arrays. Nucleic acids research 36, e35 (2008). Barratt, B.J. et al. Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Annals of Human Genetics 66, 393-405 (2002). Gauderman Wj, M.J.M. Quanto 1.1: a computer program from power and sample size calculations for genetic epidemiology studies (http://hydra.usc.edu/gxe). (2006). Kuhn, K. et al. A novel, high-performance random array platform for quantitative gene expression profiling. Genome research 14, 2347-2356 (2004). Jawaid, A. & Sham, P. Impact and quantification of the sources of error in DNA pooling designs. Annals of Human Genetics 73, 118-124 (2009). Leek, J.T. et al. Tackling the widespread and critical impact of batch effects in highthroughput data. Nat Rev Genet 11, 733-9 (2010). Kobel, M. et al. Diagnosis of ovarian carcinoma cell type is highly reproducible: a transcanadian study. The American Journal of Surgical Pathology 34, 984-993 (2010). Gomez-Raposo, C., Mendiola, M., Barriuso, J., Hardisson, D. & Redondo, A. Molecular characterization of ovarian cancer by gene-expression profiling. Gynecologic oncology 118, 88-92 (2010). Varghese, J.S. & Easton, D.F. Genome-wide association studies in common cancers-what have we learnt? Current opinion in genetics & development 20, 201-209 (2010). Sambrook J, R.D. Molecular Cloning: A Laboratory Manual (Third Edition), (Cold Spring Harbor Laboratory Press, USA, 2000). Gaggioli, A. et al. Computer-guided mental practice in neurorehabilitation. Stud Health Technol Inform 145, 195-208 (2009). Homer, N. et al. Multimarker analysis and imputation of multiple platform poolingbased genome-wide association studies. Bioinformatics (Oxford, England) 24, 18961902 (2008). Butcher, L.M. et al. SNPs, microarrays and pooled DNA: identification of four loci associated with mild mental impairment in a sample of 6000 children. Human molecular genetics 14, 1315-1325 (2005). Ramos, A., Santos, C., Alvarez, L., Nogues, R. & Aluja, M.P. Human mitochondrial DNA complete amplification and sequencing: a new validated primer set that prevents nuclear DNA sequences of mitochondrial origin co-amplification. Electrophoresis 30, 1587-1593 (2009). Hua, J. et al. SNiPer-HD: improved genotype calling accuracy by an expectationmaximization algorithm for high-density SNP arrays. Bioinformatics (Oxford, England) 23, 57-63 (2007). Heaphy, C.M. et al. Prevalence of the alternative lengthening of telomeres telomere maintenance mechanism in human cancer subtypes. Am J Pathol 179, 1608-15 (2011). Guan, B. et al. Mutation and loss of expression of ARID1A in uterine low-grade 176  204.  205.  206.  207. 208. 209.  210. 211. 212. 213. 214. 215.  216.  217.  218. 219.  220. 221. 222.  endometrioid carcinoma. Am J Surg Pathol 35, 625-32 (2011). Yamamura, S. et al. The activated transforming growth factor-beta signaling pathway in peritoneal metastases is a potential therapeutic target in ovarian cancer. Int J Cancer 130, 20-8 (2012). de Paula, S. et al. Hemispheric brain injury and behavioral deficits induced by severe neonatal hypoxia-ischemia in rats are not attenuated by intravenous administration of human umbilical cord blood cells. Pediatr Res 65, 631-5 (2009). Santoro, N. et al. Prevalence of pathogenetic MC4R mutations in Italian children with early onset obesity, tall stature and familial history of obesity. BMC Med Genet 10, 25 (2009). Bosse, Y. et al. Identification of susceptibility genes for complex diseases using pooling-based genome-wide association scans. Hum Genet 125, 305-18 (2009). Lu-Emerson, C. et al. Retrospective study of dasatinib for recurrent glioblastoma after bevacizumab failure. J Neurooncol 104, 287-91 (2011). Min, J. et al. An oncogene-tumor suppressor cascade drives metastatic prostate cancer by coordinately activating Ras and nuclear factor-kappaB. Nat Med 16, 286-94 (2010). Galeone, C. et al. Onion and garlic use and human cancer. Am J Clin Nutr 84, 102732 (2006). Galeone, C. et al. Folate intake and squamous-cell carcinoma of the oesophagus in Italian and Swiss men. Ann Oncol 17, 521-5 (2006). Pelucchi, C. et al. Dietary acrylamide and human cancer. Int J Cancer 118, 467-71 (2006). Li, Y. et al. The R1441C mutation alters the folding properties of the ROC domain of LRRK2. Biochim Biophys Acta 1792, 1194-7 (2009). Greggio, E. et al. The Parkinson's disease kinase LRRK2 autophosphorylates its GTPase domain at multiple sites. Biochem Biophys Res Commun 389, 449-54 (2009). Pisano, C. et al. A phase II study of capecitabine in the treatment of ovarian cancer resistant or refractory to platinum therapy: a multicentre Italian trial in ovarian cancer (MITO-6) trial. Cancer Chemother Pharmacol 64, 1021-7 (2009). De Marzi, P. et al. Adjuvant treatment with concomitant radiotherapy and chemotherapy in high-risk endometrial cancer: a clinical experience. Gynecol Oncol 116, 408-12 (2010). Blackinton, J. et al. Formation of a stabilized cysteine sulfinic acid is critical for the mitochondrial function of the parkinsonism protein DJ-1. J Biol Chem 284, 6476-85 (2009). Easton, D.F. & Eeles, R.A. Genome-wide association studies in cancer. Hum Mol Genet 17, R109-15 (2008). Faddy, M.J., Gosden, R.G., Gougeon, A., Richardson, S.J. & Nelson, J.F. Accelerated disappearance of ovarian follicles in mid-life: implications for forecasting menopause. Human reproduction (Oxford, England) 7, 1342-1346 (1992). Richardson, S.J. The biological basis of the menopause. Baillière's clinical endocrinology and metabolism 7, 1 (1993). Treloar, A.E. Menstrual cyclicity and the pre-menopause. Maturitas 3, 249-264 (1981). Azizi, F. et al. Prevention of non-communicable disease in a population in nutrition 177  223. 224.  225.  226.  227. 228.  229. 230. 231.  232. 233.  234.  235.  236. 237. 238.  239.  transition: Tehran Lipid and Glucose Study phase II. Trials 10, 5 (2009). Azizi, F. et al. Cardiovascular risk factors in an Iranian urban population: Tehran lipid and glucose study (phase 1). Sozial- und Praventivmedizin 47, 408-426 (2002). Paschou, P., Lewis, J., Javed, A. & Drineas, P. Ancestry informative markers for finescale individual assignment to worldwide populations. Journal of medical genetics 47, 835-847 (2010). Sham Pc, B.J.S.C.I.O.D.M.O.M. DNA Pooling: A tool for large-scale associations studies. Nature reviews. Molecular cell biology;Nature reviews.Genetics 3, 862-871 (2002). Eshraghi, P., Hedayati, M., Daneshpour, M.S., Mirmiran, P. & Azizi, F. Association of body mass index and Trp64Arg polymorphism of the beta3-adrenoreceptor gene and leptin level in Tehran Lipid and Glucose Study. British journal of biomedical science 64, 117-120 (2007). Kahrizi, K. et al. Identification of SLC26A4 gene mutations in Iranian families with hereditary hearing impairment. European journal of pediatrics 168, 651-653 (2009). Hefler, L.A. et al. Estrogen-metabolizing gene polymorphisms and age at natural menopause in Caucasian women. Human reproduction (Oxford, England) 20, 14221427 (2005). Christenson, L.K. MicroRNA control of ovarian function. Animal reproduction / Colegio Brasileiro de Reproducao Animal 7, 129-133 (2010). Colditz, G.A. et al. Reproducibility and validity of self-reported menopausal status in a prospective cohort study. Am J Epidemiol 126, 319-25 (1987). Rodstrom, K., Bengtsson, C., Lissner, L. & Bjorkelund, C. Reproducibility of selfreported menopause age at the 24-year follow-up of a population study of women in Goteborg, Sweden. Menopause 12, 275-80 (2005). Dunckley, T. et al. Whole-genome analysis of sporadic amyotrophic lateral sclerosis. The New England journal of medicine 357, 775-788 (2007). Vang, R., Shih Ie, M. & Kurman, R.J. Ovarian low-grade and high-grade serous carcinoma: pathogenesis, clinicopathologic and molecular biologic features, and diagnostic problems. Adv Anat Pathol 16, 267-82 (2009). Schwartz, D.R. et al. Novel candidate targets of beta-catenin/T-cell factor signaling identified by gene expression profiling of ovarian endometrioid adenocarcinomas. Cancer Res 63, 2913-22 (2003). Walsh, T. et al. Mutations in 12 genes for inherited ovarian, fallopian tube, and peritoneal carcinoma identified by massively parallel sequencing. Proc Natl Acad Sci U S A 108, 18032-7 (2011). Poulogiannis, G., Frayling, I.M. & Arends, M.J. DNA mismatch repair deficiency in sporadic colorectal cancer and Lynch syndrome. Histopathology 56, 167-79 (2010). Gemignani, M.L. et al. Role of KRAS and BRAF gene mutations in mucinous ovarian carcinoma. Gynecol Oncol 90, 378-81 (2003). Han, G. et al. Mixed ovarian epithelial carcinomas with clear cell and serous components are variants of high-grade serous carcinoma: an interobserver correlative and immunohistochemical study of 32 cases. Am J Surg Pathol 32, 955-64 (2008). Kobel, M. et al. A limited panel of immunomarkers can reliably distinguish between clear cell and high-grade serous carcinoma of the ovary. Am J Surg Pathol 33, 14-21 (2009). 178  240. 241. 242. 243. 244. 245.  246.  247.  248. 249.  Zaino, R.J. et al. Advanced stage mucinous adenocarcinoma of the ovary is both rare and highly lethal: a Gynecologic Oncology Group study. Cancer 117, 554-62 (2011). Cramer, D.W. The epidemiology of endometrial and ovarian cancer. Hematol Oncol Clin North Am 26, 1-12 (2012). Kelsey, J.L. & Horn-Ross, P.L. Breast cancer: magnitude of the problem and descriptive epidemiology. Epidemiol Rev 15, 7-16 (1993). Johnson, F.B., Sinclair, D.A. & Guarente, L. Molecular biology of aging. Cell 96, 291-302 (1999). Foulkes, W.D. et al. Extending the phenotypes associated with DICER1 mutations. Human mutation 32, 1381-1384 (2011). Slade, I. et al. DICER1 syndrome: clarifying the diagnosis, clinical features and management implications of a pleiotropic tumour predisposition syndrome. Journal of medical genetics 48, 273-278 (2011). Rio Frio, T. et al. DICER1 mutations in familial multinodular goiter with and without ovarian Sertoli-Leydig cell tumors. JAMA : the journal of the American Medical Association 305, 68-77 (2011). Schultz, K.A. et al. Ovarian sex cord-stromal tumors, pleuropulmonary blastoma and DICER1 mutations: a report from the International Pleuropulmonary Blastoma Registry. Gynecologic oncology 122, 246-250 (2011). Heravi-Moussavi, A. et al. Recurrent somatic DICER1 mutations in nonepithelial ovarian cancers. N Engl J Med 366, 234-42 (2012). Miller, S.A., Dykes, D.D. & Polesky, H.F. A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Res 16, 1215 (1988).  179  APPENDIX A. List of Genes by Gene Name  HUGO symbol1 AKT1 AKT2 AMHR2 ANKLE1 ARID1A BABAM1 BARD1 BMP15 BMP7 BNC2 BPIL2 BRAF BRCA1 BRCA2 BRIP BRSK1 CHEK2 CSMD3 CTNNB1 CYP1B1 DICER1 DMC1  HUGO approved name1 v-akt murine thymoma viral oncogene homolog 1 v-akt murine thymoma viral oncogene homolog 2 anti-Mullerian hormone receptor, type II ankyrin repeat and LEM domain containing 1 AT rich interactive domain 1A (SWI-like) BRISC and BRCA1 A complex member 1 BRCA1 associated RING domain 1 bone morphogenetic protein 15 bone morphogenetic protein 7 basonuclin 2 BPI fold containing family C v-raf murine sarcoma viral oncogene homolog B1 breast cancer 1, early onset breast cancer 2, early onset BRCA1 interacting protein C-terminal helicase 1 BR serine/threonine kinase 1 checkpoint kinase 2 CUB and Sushi multiple domains 3 catenin (cadherin-associated protein), beta 1, 88kDa cytochrome P450, family 1, subfamily B, polypeptide 1 dicer 1, ribonuclease type III DMC1 dosage suppressor of mck1 homolog, meiosis-specific homologous recombination epidermal growth factor receptor estrogen receptor 1  EGFR ESR1 EXO1 exonuclease 1 FAM175A family with sequence similarity 175, member A FANCI Fanconi anemia, complementation group I (continued on following page)  Genomic location 14q32.32-q32.33 19q13.1-q13.2 12q13 19p13.11 1p36.1-p35 19p13.11 2q34-q35 Xp11.2 20q13 9p22.2 22q12.3 7q34 17q21-q24 13q12-q13 17q22.2 19q13.4 22q12.1 8q23.3 3p21 2p22.2 14q32.2 22q13.1 7p12 6q24-q27 1q42-q43 4q21.23 15q26.1  180  HUGO symbol1  HUGO approved name1  Genomic location  GRB10 HELQ HOXD1 HOXD3 IPO7  growth factor receptor-bound protein 10  7p12.2  helicase, POLQ-like homeobox D1 homeobox D3 importin 7 v-Ki-ras2 Kirsten rat sarcoma viral oncogene KRAS homolog minichromosome maintenance complex MCM8 component 8 MERIT40 BRISC and BRCA1 A complex member 1 mutL homolog 1, colon cancer, nonpolyposis type MLH1 2 mutS homolog 2, colon cancer, nonpolyposis type MSH2 1 MSH6 mutS homolog 6 (E. coli) MYC v-myc myelocytomatosis viral oncogene homolog NF1 neurofibromin 1 NLRP11 NLR family, pyrin domain containing 11 NOBOX NOBOX oogenesis homeobox PALB2 partner and localizer of BRCA2 phosphatidylinositol-4,5-bisphosphate 3-kinase, PIK3CA catalytic subunit alpha POLG polymerase (DNA directed), gamma PRIM1 primase, DNA, polypeptide 1 (49kDa) PRRCA2 proline-rich coiled-coil 2A PTEN phosphatase and tensin homolog RAD51 RAD50 homolog (S. cerevisiae) RAD51B RAD51 homolog B RAD51C RAD51 homolog C RAD51D RAD51 homolog D (continued on following page)  4q21.23 2q31.1 2q31.1 11p15.3 12p12.1 20p12.3 19p13.11 3p22.3 2p21 2p16 8q24 17q11.2 19q13.43 7q35 16p12.1 3q26.3 15q24 12q13.3 6p21.3 10q23 5q23-q31 14q23-q24.2 17q25.1 17q11  181  HUGO symbol1  HUGO approved name1  Genomic location  RAF1 RB1 SKAP1 SUCLA2 TERT TIPARP / PARP7 TLK1 TMEM41B TP53 UIMC1  v-raf-1 murine leukemia viral oncogene homolog 1 retinoblastoma 1 src kinase associated phosphoprotein 1 succinate-CoA ligase, ADP-forming, beta subunit telomerase reverse transcriptase  3p25 13q14.2 17q21.32 13q12.2-q13.3 5p15.33  TCDD-inducible poly(ADP-ribose) polymerase  3q25.31  tousled-like kinase 1 transmembrane protein 41B tumour protein p53  2q31.1 11p15.3 17p13.1  ubiquitin interaction motif containing 1 5q35.2 The HUGO Gene Nomenclature Committee (HGNC) unique gene symbols, http://www.genenames.org/ [Accessed 30 Aug 2012] 1  182  APPENDIX B. Appendices to Chapter 2 Appendix B.1. Pool and Array Summary. Details of the 27 DNA pools and 128 Illumina arrays used in the analysis of pooling variance. Pool Pool name No.  Samples in DNA pool  1 2 3 4 5 6 7 8 9 10 11 12 14 15 19 13 16 17 18 20 21 22 23 24 25 26 27  127 127 129 129 253 253 256 256 404 404 446 446 121 122 246 94 161 165 187 223 75 84 114 176 222 272 303  1-1M-Single 1 1-1M-Single 2 2-1M-Single 1 2-1M-Single 2 3-1M-Single 1 3-1M-Single 2 4-1M-Single 1 4-1M-Single 2 5-1M-Single 1 5-1M-Single 2 6-1M-Single 1 6-1M-Single 2 1-1M-Duo 2-1M-Duo 3-1M-Duo 4-1M-Duo 5-1M-Duo 6-1M-Duo 7-1M-Duo 8-1M-Duo 1-660-Quad 2-660-Quad 3-660-Quad 4-660-Quad 5-660-Quad 6-660-Quad 7-660-Quad  Arrays per pool 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 12 12 12 6 6 12 12  DNA Genotyping Genotyping extraction array batch/date method 1M-Single 1-Apr-08 1M-Single 1-Apr-08 1M-Single 1-Apr-08 PAXgene 1M-Single 1-Apr-08 Blood 1M-Single 1-Apr-08 DNA Kit 1M-Single 1-Apr-08 1M-Single 1-Apr-08 1M-Single 1-Apr-08 1M-Single 1-Apr-08 1M-Single 1-Apr-08 1M-Single 1-Apr-08 1M-Single 1-Apr-08 1M-Duo 2-Oct-09 PAXgene 1M-Duo 2-Oct-09 Blood DNA Kit 1M-Duo 3-Oct-09 1M-Duo 2-Dec-09 1M-Duo 3-Dec-09 Miller et 1M-Duo 3-Dec-09 al., 1988 249 1M-Duo 3-Dec-09 1M-Duo 3-Dec-09 660-Quad 4-Mar-09 660-Quad 4-Mar-09 660-Quad 4-Mar-09 PureGene 660-Quad 4-Mar-09 DNA Kit 660-Quad 4-Mar-09 660-Quad 4-Mar-09 660-Quad 4-Mar-09  183  Appendix B.2. Estimates of Variance Components, 1M-Single Array. Pooling variance, array variance, and pool-construction variance for DNA pools using normalized data. Label1  var(epooling(1 or 2))  var(earray)2  var(econstruction(1 or 2))3 % C/P4  1-1M-Single, Type B 3.9E-04 3.3E-04 5.6E-05 14.5 2-1M-Single, Type B 4.4E-04 4.1E-04 2.9E-05 6.6 3-1M-Single, Type B 3.3E-04 2.6E-04 6.7E-05 20.6 4-1M-Single, Type B 3.2E-04 3.4E-04 0 0 5-1M-Single, Type B 5.5E-04 5.3E-04 1.9E-05 3.5 6-1M-Single, Type B 3.5E-04 3.9E-04 0 0 AVERAGE 2.9E-05 7.5 1-1M-Single 1, Type C1 4.6E-04 3.8E-04 8.3E-05 18.5 1-1M-Single 2, Type C2 4.9E-04 3.8E-04 1.1E-04 23.7 2-1M-Single 1, Type C1 4.6E-04 3.8E-04 7.7E-05 17.5 2-1M-Single 2, Type C2 5.1E-04 3.8E-04 1.3E-04 25.8 3-1M-Single 1, Type C1 4.4E-04 3.8E-04 6.4E-05 15.1 3-1M-Single 2, Type C2 4.7E-04 3.8E-04 8.6E-05 19.1 4-1M-Single 1, Type C1 3.5E-04 3.8E-04 0 0 4-1M-Single 2, Type C2 4.2E-04 3.8E-04 4.4E-05 11.1 5-1M-Single 1, Type C1 4.7E-04 3.8E-04 9.3E-05 20.3 5-1M-Single 2, Type C2 5.1E-04 3.8E-04 1.3E-04 25.9 6-1M-Single 1, Type C1 4.5E-04 3.8E-04 7.1E-05 16.3 6-1M-Single 2, Type C2 7.0E-04 3.8E-04 3.2E-04 46.1 AVERAGE 1.0E-05 20.0 1 Variance estimates derived from comparison of replicate pools are labeled Type B. Variance estimates derived from comparison of non-identical pools are labeled Type C1 or C2. There are 6 unique pools but 12 pools in total because pools were constructed in replicate. The subscripts in C1 and C2 indicate the pool replicate. 2 Type B var(earray) is calculated from those arrays used to allelotype the unique pool specified. Type C var(earray) is the average over all arrays over all pools (as in Table 2.1). 3 Pool-construction variance is calculated as var(econstruction(1 or 2)) = var(epooling(1 or 2)) var(earray), and negative values are set to zero. 4 The percentage of pooling variance attributable to pool-construction variance.  184  Appendix B.3 Estimates of Variance Components, 1M-Duo and 660-Quad Arrays. Pooling variance, array variance, and pool-construction variance for DNA pools using normalized data. Label  var(epooling-2))  var(earray)  var(econstruction-2)1  % C/P2  1-660-Quad 5.41E-04 3.30E-04 2.10E-04 38.9 2-660-Quad 5.73E-04 3.30E-04 2.42E-04 42.3 3-660-Quad 5.59E-04 3.30E-04 2.29E-04 40.9 4-660-Quad 4.98E-04 3.30E-04 1.67E-04 33.6 5-660-Quad 4.30E-04 3.30E-04 9.98E-05 23.2 6-660-Quad 4.53E-04 3.30E-04 1.22E-04 27.0 7-660-Quad 5.59E-04 3.30E-04 2.29E-04 40.9 1-1M-Duo, Batch 1 5.63E-04 4.15E-04 1.48E-04 26.3 2-1M-Duo, Batch 1 6.07E-04 4.15E-04 1.92E-04 31.7 3-1M-Duo, Batch 1 5.98E-04 4.15E-04 1.83E-04 30.6 4-1M-Duo, Batch 2 1.07E-03 2.64E-04 8.10E-04 75.4 5-1M-Duo, Batch 2 1.20E-03 2.64E-04 9.34E-04 78.0 6-1M-Duo, Batch 2 1.40E-03 2.64E-04 1.14E-03 81.2 7-1M-Duo, Batch 2 1.30E-03 2.64E-04 1.03E-03 79.6 8-1M-Duo, Batch 2 1.90E-03 2.64E-04 1.64E-03 86.1 1 Pool-construction variance is calculated as var(econstruction-2) = var(epooling-2) - var(earray). 2 Indicates the percentage of pooling variance attributable to pool-construction variance.  185  Appendix B.4. Average MAF on Illumina Arrays. Comparison of average minor allele frequency on three Illumina arrays estimated using HapMap CEU data (release 27) and experimental pool-derived data.  Average minor allele frequency  Pool Name  Pool Size  5-1M-Single  404  0.227  6-1M-Single  446  0.225  1-660-Quad  75  0.260  6-660-Quad  272  0.262  3-1M-Duo  246  0.249  5-1M-Duo  161  0.253  DNA pool  HapMap CEU 0.209  0.288  0.208  186  Appendix B.5. Array Allocation, Effective Samples Size, and MDOR. Impact of unequal array allocation on effective sample size (N*) and minimum detectable odds ratio (MDOR) in a pooling genome wide association study.  Arrays Case : Control 6:6 2 : 10 10 : 2 4:8 8:4  Case pool RSS N* 0.81 244 0.59 178 0.88 264 0.74 223 0.85 256  Control pool RSS N* 0.56 562 0.68 681 0.30 300 0.63 631 0.46 461  MDOR at 80% MDOR at 80% (p=0.29) (p=0.10) 1.38 1.58 1.43  1.65  1.39  1.60  This table compares the minimum detectable odds ratios (MDOR) at 80% power for a theoretical pooling experiment with 300 cases 1000 controls where 12 arrays are (MD distributed betweenfor theacase This table compares the and minimum detectable odds ratio OR)differently at 80% power and control pools. Relative sample size (RSS) and effective sample size (N*) are generated by -5 theoretical pooling experiment 300-4, cases and 1000 controls 12 minor arraysallele are PoolingPlanner assuming vary(earraywith )=3.3x10 var(econstruction )=9.9x10 , and where an average frequency of 0.29. MDOR at 80% power were calculated using Quanto (Gauderman & Morrison, 2006) distributed differently between the case and control pools. Relative sample size (RSS) and assuming an unmatched case-control design testing for gene-only effects using a log-additive model, where the incidence of the case(N*) phenotype is 0.02% and risk allele, p, is set to 0.29 or the 0.10.array variance is effective sample size are generated bythe PoolingPlanner assuming  3.3x10-4, and the pool construction variance is 9.9x10-5, and the average minor allele frequency is 0.29. MD OR at 80% power was calculated using Quanto189, assuming an unmatched case-control design testing for gene-only affects using a log-additive model, where the incidence of the case phenotype was set to 0.02%, and the risk allele, p, was set to 0.29 or 0.10).  187  Appendix B.6. Power Curve Example.  Power curves for a theoretical pooling experiment with 300 cases and 1000 controls where 12 arrays are distributed differently between the case and control pools. Effective sample size given the different pooling designs was calculated using PoolingPlanner and these values entered into Quanto to obtain pool-adjusted estimates of power over a range of odds ratios. Calculations are based on an unmatched case-control design testing for gene-only effects using a log-additive model, where the incidence of the case phenotype is 0.02%, and the risk allele frequency (prisk) is 29% (and in complete linkage disequilibrium with a SNP on the array). 188  APPENDIX C. Appendices to Chapter 3 Appendix C.1. SNP Exclusions. Summary of the Illumina 660-Quad BeadChip SNPs excluded from the genome-wide association analysis.  Filtering step (1) HapMap MAF < 0.05  57,527  % of total1 8.75  (2) Copy number variant non-polymorphic SNPs  64,527  9.82  135  0.02  122,189  18.59  (3) Mitochondrial SNPs Total 1  Number of SNPs excluded Count  Starting total 657,366 SNPs  189  APPENDIX D. Appendices to Chapter 4 Appendix D.1. Power Calculations. Method used to calculate power for SNPs associated with ANM in Iranian women.  Prior to performing our association studies in Iranian women, the power to discover novel ANM alleles, and the power to replicate previous results was estimated using Quanto189. With respect to the pool-based GWAS, to account for the pooling procedure, power estimates were performed assuming an effective sample size of 68% the true sample size 188. Using this criterion and testing a Gene-only unmatched case-control design under a log-additive mode of inheritance (case= early ANM pool, control= late ANM pool)), power at the P<0.05 level was estimated at be 80% to detects odds ratios of 2.21, 1.85, 1.73, and 1.69 for causal allelic variants with 10%, 20%, 30%, and 40% MAF. To perform these calculations the population risk was set to 0.18, which represents the proportion of Iranian women falling in the early ANM pool. This study was capable of discovering moderate to large effect risk alleles. With respect to our power to detects associations reported in the 2009 ANM169,170, ANM was modeled as a continuous trait (log-additive mode of inheritance). The population mean (µ) and SD were set to the sample mean and SD (µ= 49.8, SD=4.38). At the P < 0.05 level, IG data was estimated (passing QC) to have 66% power to replicate the most significant association reported by the 2009 AMM GWAS rs16991615 (i.e. MAF= 0.06, β= 1.07, p= 1.2x10-21)169. For the 2nd and 3rd most significant associations, rs1172822 (MAF= 0.37, β = –0.49, p= 1.8x10-19) and rs2384687 (MAF= 0.39, β = –0.47, p= 2.4x10-18)169 we estimate we had 61% and 57% power, respectively, to detect these causal allelic variants. For all other variants, our power was below 50%. For both AMHR2 SNPs, power was below 20%. For the BMP15 SNP, power was predicted to be 25%. Finally, for the CYP1B1 SNP, assuming a dominant mode of inheritance, power was 68%.  190  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073471/manifest

Comment

Related Items