"Science, Faculty of"@en . "DSpace"@en . "UBCV"@en . "Srivastava, Siddhartha"@en . "2011-03-09T00:41:20Z"@en . "2007"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "New genotyping technologies are producing reliable results with far greater coverage and at dramatically lower cost than previously possible. Given the rapid new discovery of disease associated markers and the new technology for determining the nucleotide sequences of key positions in the DNA of an individual, it is now feasible to apply existing knowledge to generate personalized analyses of genetic risk for diverse diseases. DNA Genetic Risk Information Profile (D-GRIP) is a genotype analysis software system that determines an individual's genetic risk profile given a genotype. The prototype web tool can take, as input, up to a million observed genotypes from single nucleotide positions known to be polymorphic in a human population. The submitted genotype data are compared to a database of disease associated single nucleotide polymorphisms (SNPs) and an output is generated, reporting disease-associated variants for which the individual has a predicted modified risk. An evaluation of D-GRIP was performed through the direct surveying of potential users of such a system - users such as clinicians, genetic counselors and genetics researchers. Due to ethical issues related to providing a genetic risk profile, the prototype system is kept closed to the general public and reserved for research into the utility and requirements of such software. The major conclusions drawn direct attention towards the key limitations presently precluding the creation of personalized genetic risk assessment. The lack of computationally exploitable resource for disease associated genetic variants, the inherent statistical complexities involved with risk calculation for large-scale genotyping data and the limited understanding of interactions between genes, environment and complex diseases, are all key factors that need to be overcome in order to create a practical genetic risk assessment tool."@en . "https://circle.library.ubc.ca/rest/handle/2429/32186?expand=metadata"@en . "D-GRIP: DNA Genetic Risk Information Profile A genotype analysis system to predict a genetic risk profile for an individual by Siddhartha Srivastava B.Sc, Biological Science, The University of Calgary, 2005 B.Sc, Computer Science, The University of Calgary, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Bioinformatics) The University Of British Columbia October, 2007 \u00C2\u00A9 Siddhartha Srivastava 2007 Abstract New genotyping technologies are producing reliable results with far greater coverage and at dramatically lower cost than previously possible. Given the rapid new discovery of disease associated markers and the new technology for determining the nucleotide sequences of key positions in the DNA of an individual, it is now feasible to apply existing knowledge to generate per-sonalized analyses of genetic risk for diverse diseases. DNA Genetic Risk Information Profile (D-GRIP) is a genotype analysis software system that determines an individual's genetic risk profile given a genotype. The proto-type web tool can take, as input, up to a million observed genotypes from single nucleotide positions known to be polymorphic in a human popula-tion. The submitted genotype data are compared to a database of disease associated single nucleotide polymorphisms (SNPs) and an output is gen-erated, reporting disease-associated variants for which the individual has a predicted modified risk. An evaluation of D-GRIP was performed through the direct surveying of potential users of such a system - users such as clinicians, genetic coun-selors and genetics researchers. Due to ethical issues related to providing a genetic risk profile, the prototype system is kept closed to the general public and reserved for research into the utility and requirements of such software. ii Abstract The major conclusions drawn direct attention towards the key limitations presently precluding the creation of personalized genetic risk assessment. The lack of computationally exploitable resource for disease associated ge-netic variants, the inherent statistical complexities involved with risk cal-culation for large-scale genotyping data and the limited understanding of interactions between genes, environment and complex diseases, are all key factors that need to be overcome in order to create a practical genetic risk assessment tool. iii Table of Contents Abstract i i Table of Contents iv List of Tables v i List of Figures vii Acknowledgments ix 1 Introduction 1 1.1 Variat ions and Diseases 1 1.2 Discovery of new markers 3 1.3 Genotyping technologies 4 1.4 Bioinformatic Tools 6 1.4.1 Commerc ia l Systems 7 1.4.2 Open Source Systems 9 1.5 Overview of project 12 Bibliography 13 2 D-GRIP: DNA Genetic Risk Information Profile 19 2.1 Introduction 19 2.2 Methods 23 2.2.1 D - G R I P Overview 24 2.2.2 Genotype-Phenotype Database 26 2.2.3 Disease Risk M o d e l 27 2.2.4 Haplotype D a t a 29 2.2.5 Software Evalua t ion 31 2.3 Results 32 2.4 Discussion 39 2.4.1 Limi ta t ions 39 iv Table of Contents 2.4.2 Ideal Software 41 2.4.3 Implications 43 2.4.4 Conclusions 44 Bibliography 46 3 Conclusions and Future Directions 52 3.1 Further Observations \u00E2\u0080\u00A2 52 3.2 Future Considerations 54 3.3 Conclusion 56 Bibliography 57 Appendices A Feedback from Experts 60 A . l Questions 60 A . 2 Feedback 62 A.2.1 User Interface 62 A.2 .2 D - G R I P Core 63 A.2 .3 Potent ia l Users 64 A . 2.4 Implications 66 B D-GRIP User Manual 67 B . l Introduction to D - G R I P 67 B . l . l D - G R I P System 67 B .2 D - G R I P Features 69 B.2.1 Disclaimer 70 B.2.2 Input 70 B.2.3 Output 76 B.2.4 Help T ips 81 v List of Tables 1.1 A summary of genotyping technologies currently available. The cost per genotype is an estimate of maximal multiplexing capability. A note, Illumina's Sentrix\u00C2\u00AE numbers in the table are based on the HumanlM BeadChip which will be released in the second quarter of 2007 7 2.1 A summary of the number of genes and number of polymor-phisms for each of the diseases in the DNA-Disease database. vi List of Figures 2.1 A schematic overview detailing the flow of information across the various components of D-GRIP is illustrated 23 2.2 The opening page of D-GRIP is shown. The instructions on how to use D-GRIP and a disclaimer explicitly stating the assumptions and limitations inherent in D-GRIP are shown. . 32 2.3 The first step in using D-GRIP is illustrated, where the user's demographic information such as gender, age and ethnicity is collected. The hypothetical example above shows a male, 47 years old from European ancestry. The inference option is turned on (checked) 33 2.4 The second step in using D-GRIP is illustrated here. The user has a choice of copying and pasting the genotype data or uploading it. For ease of use, various hypothetical sam-ple genotype files were created to illustrate D-GRIP. The above example contains the 13.highly significant genotypes which are heterozygous for each disease in the DNA-Disease database. A description of the pre-loaded data is shown in the 'Comments' box 34 2.5 The last step shows a tabular result for any single nucleotide polymorphisms (SNPs) found to be associated with a disease in the user's genotype data 36 2.6 More details are shown for each SNP. As an example, details for SNP rs7903146 is shown from gene TCF7L2 from Diabetes Mellitus type II 37 2.7 Details of overall probability calculation, integrated analy-sis and inferred SNPs are shown for Diabetes Mellitus type 2 disease. The integrated analysis indicates which disease-associated SNPs are in high linkage disequilibrium (r2 > 0.8). For SNPs in high LD, only the SNP with strongest effect (highest odds ratio) is used in the overall calculated proba-bility ' . 38 vii List of Figures B . l The entry into D - G R I P occurs w i th user authentication. A val id username and password is required to access D - G R I P . . 68 B.2 A snapshot of D - G R I P ' s main page. The page describes in -structions on how to use D - G R I P and outlines a disclaimer for the user to read 69 B .3 T h e assumptions made by D - G R I P are listed as a disclaimer and shown here 70 B .4 Demographic information and configuration options submit-ted to D - G R I P are shown here 71 B .5 F o r m for submit t ing the genotype data is shown here. The user can either copy/paste the genotype-data or upload a genotype file. A set of sample genotypes are provided and can be loaded into the copy/paste form by cl icking on 'Get Sample ' 73 B.6 The I l lumina and Affymetr ix tab-delimited file formats for D - G R I P . The respective column names are shown at the top. 74 B.7 Genotype sample 1 is loaded into the copy/paste form by clicking on 'Get Sample ' . A description of the sample geno-type file are i l lustrated in the 'Comments ' box 75 B.8 D - G R I P risk profile sample output. T h e output illustrates 3 diseases, Alzheimer 's , Diabetes type 2 and Parkinson's dis-ease. The respective associated S N P s wi th each disease is shown. The background and overall calculated probabi l i ty of developing the disease is also shown 76 B .9 Details about one S N P from Diabetes type II disease 77 B.10 Probab i l i ty details for diabetes type 2 is shown here 78 B . l l S N P s from Inference analysis for Diabetes type 2 are shown. . 79 B.12 Detai ls about the inferred S N P s is shown. The details in -clude the user's genotype, Hapmap data from which infer-ence was performed and the relevant statistics for the disease-associated S N P 80 B.13 A n example of a ethnic background help t ip is shown 81 B.14 A n example of inference of genotypes help t ip is shown. . . . 81 Acknowledgments I would like to acknowledge my academic supervisor, Dr. Wyeth W. Wasser-man. Through his continuous support and guidance I have gained valuable insights in how to conduct and present research. I would also like to acknowl-edge my thesis committee: Francis Ouellette and Dr. Jan M . Friedman for their advice and support. In addition, I thank Drs. Cornelius Boerkoel, Angela Brooks-Wilson, Lome Clarke, Denise Daley, Anita Dircks, Bill Gibson, Jinko Graham, Mil-Ian Patel, Colin Ross for providing valuable feedback regarding utility and shortcomings of D-GRIP. I would like to extend further acknowledgments to Drs. William Dana Flanders, Muin Khoury and Quanhe Yang for their feedback on the statistical model. I want to also thank Francis Ouellette and Dr. Artem Cherkasov for giving their advice and encouragement during my training in the CIHR/MSFHR Strategic Training Program in Bioinfor-matics. I would like to thank the members of the Wasserman Laboratory: David Arenillas, Jochen Brumm, Warren Cheung, Alice Chou, Debra Fulton, Shan-nan J. Ho Sui, Andrew Kwon, Jonathan Lim, Stuart Lithwick, Dora Pak, Elodie Portales-Casamar, Magdalena Swanson, Amy Ticoll, Tony Wong and Dimas Yusuf, for creating a friendly and enjoyable research environment. I greatly appreciate the financial support that was provided by the Cana-ix Acknowledgments dian Institutes of Health Research (CIHR) and the Michael Smith Founda-tion for Health Research (MSFHR) Strategic Training Program in Bioinfor-matics. Chapter 1 Introduction This thesis describes the exploration of how bioinformatics can be applied in the field of genetics, specifically to the prediction of disease risk. The causes of human diseases range from simple Mendelian inheritance patterns to complex combination of genetic and non-genetic (environmental) factors. With the availability of the entire human genome sequence and the common variation map (HapMap project), the understanding of genetic contributions to diseases is increasing rapidly. We are approaching a time where prediction of disease risk on a personalized level will become a reality. 1.1 Variations and Diseases Variations in DNA sequences occur throughout the genome at a frequency of approximately 4-5 in 1000 bases (0.4 \u00E2\u0080\u0094 0.5%) on average between two unrelated individuals [3]. These differences or variations in sequences in-clude both mutations and polymorphisms, which are distinguished by their frequency within a population. Mutations are by definition rarely observed in a population and while they can cause disease, are not generally relevant to the prediction of disease risk in the general population. The simplest and most common form of polymorphism is called a Single Nucleotide Poly-morphism (SNP). At a particular site on the human genomic sequence, a 1 Chapter 1. Introduction SNP is denned by the existence of a certain percentage of individuals with a nucleotide differing from the norm. For instance, in two copies of a chromo-some at one site, one chromosome might have an A at that position (the 'A ' allele) and the other might have a C (a ' C allele). The minimum threshold percentage for classifying a position as being a SNP rather than a mutation is generally defined as l%of tested choromosomes, although some reports use other values. In the human populations, there are approximately 10 million SNPs that occur with greater than 1% frequency and these 10 million sites constitute 90% of the variation in the population [3, 21]. In short, SNPs constitute a dramatic portion of the genetic variation between two individ-uals. A genotype is then defined as the combination of the two alleles at a particular locus for a given SNP. For instance, at a known polymorphic po-sition with A and C forms, genotypes would be AA, AC or CC. SNPs occur throughout the genome (promoter region, coding and intronic regions) where those variations situated in proten coding regions are of two types, synony-mous (not altering the encoded amino acid sequence) and non-synonymous (causing a change to the encoded amino acid sequence). In the study of human genetics there have been a litany of examples of links between sequence variations (also referred to as markers) and specific traits or diseases [27]. Disorders where genetics plays an'important role, the so called genetic diseases, can be classified into single gene defects, chromo-somal disorders or multifactorial. Single gene disorders (or Mendelian dis-orders) such as Cystic Fibrosis, are usually rare and identifying the causal genetic variant has helped understand the disease. Chromosomal disorders are caused by excess or deficiency of genes [8]. Most common diseases are 2 Chapter 1. Introduction multifactorial such as diabetes or heart disease and it is generally accepted that these phehotypic effects are based on direct genetic effects, mult iple gene-gene interactions and gene-environment interactions [27, 30]. Recently, through new technologies and genome-wide association surveys, there has been a strong effort towards finding disease susceptibil i ty variations (espe-cial ly S N P s ) for complex disorders [13]. 1.2 Discovery of new markers Recently, there has been a surge in new discovery of disease susceptibil i ty genes and variations. Tradi t ional ly, in human genetics, a discovery involved identifying a gene for susceptibil i ty of disease. Tha t notion, however, comes from working on rare diseases in which single studies have reported strong statist ical associations between a mutat ion in a gene and a disease [13]. In contrast, for common diseases, the oligogenic model is usually accepted. T h e model states that the genetic component of complex diseases are more likely to be a result of a few genes wi th moderate effect or a large number of genes wi th smaller effect [11]. W i t h the development of large-scale genotyping technologies, it has now become feasible to perform genome-wide association studies [11, 13] to identify contr ibut ing loci by surveying a large set of known variable sites. Several large-scale genome-wide association studies have been recently published, including studies of diabetes Mel l i tus type II [26, 28, 31, 33], bipolar disorder [1], Alzheimer ' s disease [4], Crohn 's (inflammatory bowel) disease [6, 22] and coronary artery disease [24]. Given the smal l sample of 3 Chapter 1. Introduction diseases listed here and the short timeframe in which they were published, a large number of markers are being discovered at a very rapid rate. A more detailed analysis on the recent advances of genome wide association studies and a count of newly discovered markers for several common diseases can be found in [5]. 1.3 Genotyping technologies New genotyping technologies are driving the burst of genetic studies. For studies where a small number of SNPs are analyzed, Sequenom 's MassARRAY\u00C2\u00AE system, TaqMan\u00C2\u00AE and Pyrosequencing\u00E2\u0084\u00A2 have been widely used. These methods provide flexibility in study design for investigators prepared to work on a small set of candidate genes. For studies where thousands of SNPs need to be analyzed simultaneously (i.e., multiplexed) for each sample, platforms such as the Illumina BeadArray and the Affymetrix GeneChip\u00C2\u00AE can be used. These systems have dramatically increased the throughput of geno-typing and substantially reduced genotyping costs [23]. To illustrate the underlying technology, a brief description of the original T M T M Illumina BeadArray platform and the GoldenGate assay follows. The array-based technology comes in a 96 well plate format. Each well contains an optical fiber bundle where an array of 50,000 randomly placed beads, each ~3 microns in diameter, exist. There are 1520 bead types, each representing a different oligonucleotide sequence. This gives ~30 copies of each bead type providing (on average) 30 replicate genotyping experiments for each SNP and can screen up to 100,000 genotypes in one sample [10]. 4 Chapter 1. Introduction The GoldenGate\u00C2\u00AE Assay is used with the BeadArray platform and has the advantage of allowing high multiplexing during amplification steps while minimizing reagent volumes and time. Genomic DNA is normalized and then chemically reacted to incorporate biotin to make activated DNA. Three oligonucleotides are designed for each SNP. Two are allele-specific oligonu-cleotides (ASO) and.one is locus-specific oligonucleotide (LSO). Each ASO has a 3' base complementary to one of the two SNP alleles. The LSO hybridizes downstream of the ASOs. Each of the three oligonucleotide se-quences contain regions of genomic complementary for polymerase chain reaction (PCR): PI and P2 on the ASOs and P3 on the LSO. The LSO also contains a unique address sequence that targets a particular bead type on the well plate. After extension and ligation, activated genomic DNA is amplified using PCR and labeled PI and P2. The primers PI and P2 are labeled with Cy3 and Cy5 respectively. The PCR products are then hybridized to array matrix plate where the Cy5 and Cy3 labeled materials bind in proportion to the relative abundance of the two alleles in the sample such that a homozygote for the allele has only one color and a heterozygote has two. The labels are detected and analyzed using the fluorescence signal and using software for genotype clustering and calling. Based on the color distribution of each allele, the genotype of the samples for the designated SNPs can be determined. For a more thorough and detailed description of the assay, refer to [19] and [32]. Both Illumina and Affymetrix systems have challenged the technological limit of genotyping analysis. For instance, Illumina's Sentrix\u00C2\u00AE Human-Hap650Y BeadChip and whole-genome HumanlM BeadChip can respec-5 Chapter 1. Introduction tively genotype over 650,000 tag SNPs and over one million genetic varia-tions on a single array, whereas the Affymetrix's GeneChip\u00C2\u00AE Genome-wide human SNP array 5.0 can genotype approximately 500,000 SNPs in one sam-ple. Both platforms can genotype fixed set of SNPs as well as customized panels of SNPs. Illumina's SNP selection is based on the HapMap project while Affymetrix's SNPs selection is based on feasibility of SNPs to be geno-typed. For both systems, the cost of genotyping is less than $0.01 per SNP. A general recent summary of the various methods is shown in Table 1.1. A more detailed review of various genotyping technologies is available in [32] and [23]. Given the new technologies and the high throughput of genotypes at substantially low costs, genotyping an individual has become increasingly feasible and led to a shift from investigation of a few candidate polymor-phisms at a time to comprehensive whole-genome studies [23]. 1.4 Bioinformatic Tools There are many different open source and commercial systems available that manage, organize and analyze large-scale genotype data and/or provide risk assessments for disease. In order to determine whether any currently avail-able systems integrate the analysis of many genotypes to provide person-alized risk assessments for diseases, a survey of the risk prediction systems follows. 6 Chapter 1. Introduction Assay design Mul t i p l ex ing Throughput Cost per capabil i ty (no. of sam- genotype pies per assay) T a q M a n \u00C2\u00AE B y manufacturer N o U p to 10,000+ >US$0.30 . T M ryrosequencmg or investigator B y investigator 1 to 3 U p to 4,000+ >US$0.30 Sequenom's B y investigator 1 to 29 U p to 3,000+ US$0.05-M a s s A R R A Y \u00C2\u00AE 0.10 I l lumina 's B y manufacturer 1,536 to U p to 96 0.8 [2]). To extract the HapMap SNPs and linkage disequilibrium values, Ensembl (build 45) was used. Due to the complexity involved in defining and classify-29 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile ing populations, a simplification was made when incorporating the hapmap data: the populations from the HapMap project were generalized to match the populations found in the DNA-Disease database. The population cate-gories from the DNA-Disease database were Caucasians, Asians, African and other/Mixed. The corresponding matches from the HapMap project were European ancestry (CEPH) grouped as Caucasians, the Tokyo (JPT) and Han Chinese (CHB) ethnic groups represented as Asians and the Nigeria (YRI) ethnic group matched to Africans. D-GRIP uses the HapMap data in two different ways during the gen-eration of a disease risk profile. First, for the reported disease-associated SNP, an integrated analysis is performed in which multiple disease associ-ated SNPs in high linkage disequilibrium (LD) are clustered together during the probability calculation. Rather than treating these high LD SNPs in-dependently in the calculated overall disease probability, a simplification is made. The SNP with the highest effect (highest odds ratio) is used to rep-resent the other SNPs in high LD and thus only one SNP (with strongest effect) is used in the posterior probability calculation. Second, an inferred analysis is reported with the observed genotypes in the final risk profile output. The inferred analysis reports SNPs that were present in the user's genotype but did not have a direct association to a disease. These inferred SNPs are in high LD with known disease associ-ated SNPs which are present in the DNA-Disease database. The Hapmap Genome Browser (Release 21) [40] was used to extract the phased geno-type data. Subsequently, Haploview (version 3.32) [3] software was used to calculate the haplotype blocks, using the default method on Haploview 30 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile software for haplotype block calculation, in order to infer phase information. Since the inferred analysis is highly predictive in nature and untested, it is provided as an option for the user, which by default is turned off during anal-ysis. Also, the inferred SNPs are not used in overall posterior probability calculation. 2.2.5 So f twa re E v a l u a t i o n After a working prototype was created, D-GRIP underwent a series of crit-ical evaluations. The evaluation was structured as a survey where D-GRIP was demonstrated to experts and their feedback was recorded. A total of 21 scientists, clinicians or counselors were surveyed including clinical geneti-cists, molecular geneticists, biostatisticians and genetic counseling students. 31 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile 2.3 Results A walkthrough of D-GRIP illustrates the user interface features as well as the underlying DNA-Disease database. Figure 2.2 shows the first page of D-GRIP after the user logs in. The opening page explains how to use D-GRIP and warns the user via a disclaimer which outlines the assumptions and limitations of D-GRIP. Upon clicking the 'Use D-GRIP' link, the user is presented with a form to solicit demographic information and options regarding the analysis. In this example, suppose the user is a male, 47 years old from European ancestry and inference analysis turned on (Figure 2.3). W e l c o m e ' T e s t ' Home Disclaimer This web site provides a Icol lor predicting a genetic risk profile Tor a person hy utlizing geixitype informati.41. Use D-GR IP Gelling Slarled: Logout . & 1. Olid; on ihe 'Use D-GR'IP' link. 2. Upload a genotype file or copy/paste data imo ihe form. , \ G tick on Calculate:Risk. Nnte: Tips are provided anywhere *\u00C2\u00AE appears. Bringcursnr over In see tips. Disclaimer 1. It is assumed lliesyslem is used ina guided setting, 2. '. All inlbrmatioivprovided by y < H S 7 6 1 7 ^ 9 3 8 7 6 1 7 2 9 ^ 8 7 6 i I 7 \u00C2\u00B0 9 3 _ h 7 6 _ 1 7 \">93 < ? 7 6 I n f e r e n c e A n a l v s i s D i a b e t e s M c l l i t u s t y p e b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y o v e r a l l c a l c u l a t e d p r o b a b i l i t y 5 % 7 % A / A C a u c a s i a n . i ; 5 - 1 6 0 0 9 8 9 1 . ; . \u00E2\u0080\u00A2 I n f e r e n c e A n a l y s i s \u00C2\u00AE P a r k i n s o n d i s e a s e ' b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y '2% o v e r a l l c a l c u l a t e d p r ' o ' b ' a b ' i l i t y 2 . 1 4 % \u00E2\u0080\u00A2 Figure 2.5: The last step shows a tabular result for any single nucleotide polymorphisms (SNPs) found to be associated with a disease in the user's genotype data. P a r k i n s o n d i s e a s e P1NK1,. . rs'1043424 36 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile ' / f I z h e i m e n d l s e a s e t A l z h e i m e r - d i s e a s e \u00E2\u0080\u00A2 A k h e i m e r , d i s e a s e * ; A l z h e i m e r d i s e a s e JGHRNB2i , 484S37S, \u00E2\u0080\u00A2PON1T1. rs20IS62l -TOMM40 rs 157581' background population probabdiiv overall calculated probability \ r /G_ (I ,^Caucasian> J A7G-err Caucasian .GauciisiarT 1 '6847012 i Li _ lu 1.68 2 96 25.65 * 173 17784 DiabetesiMellilus.type;2 HI IEX Dnbetts Mellitus^ pe 2 ^ _,TCF7l 2 tGcnotjpes*1 , * l f ' \u00C2\u00BB i l t Risk uenot\ pe C/T ' Ni tjoi gctiotj 1c C/C rs HI 1875 A/G Caucasian- . - ;IV1.9 \u00E2\u0080\u00A217293876' ~ 3fp^' lV-. . \" ^ C / \" - ' 1 Cauca tan \"^ T 65^ f^J?2jm76^ S t a b t e s *.\u00C2\u00A3iLJL Control. \u00E2\u0080\u00A2\u00E2\u0080\u00A2Gefloiv.pejhreq'uencies'*' 0 4 S 6 f f - ?! J3419 J1 C/C _ (Jc 2 E \ T 2 Dubttc Mellitus type 2 ' ] iE \T2, \u00C2\u00BB;\u00E2\u0080\u00A2\u00E2\u0080\u00A2 ' \", '\u00E2\u0080\u00A2 Fd./f-1..:. ini:_.\"rn,riuia:t& * ii 'I.i^ .tuL.T.^ j.muJi.xntuAiU'.^ n.1 ,^'- c... Diabetes Mellitus t>pc 2 h \T2 tsl113132 r,/c isl 1037909^ _ F^_C/r is3740878 G/A Diabett Nkllitus t )p^'_ j tHHJ \ is792 437 __T J [ _ A/G ' Caucasian If Ciucasi tn 'Caucasian: \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 17293876 I 27_ I 26 1 \"\"293876 1*7293876, Inference Analysis Diabetes Mellitus type-2 background population probability overall calculated probability 5 % 1 % Figure 2.6: More details are shown for each SNP. As an example, details for SNP rs7903146 is shown from gene TCF7L2 from Diabetes Mellitus type II. 37 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile D n b t t e s M t l l i t u s l \ p i 2 I I I I I \ i s l 1 1 ( S 7 5 b i t b e u M e l l i u i s t v W j T C , 7 L 2 f V l M V s 7 9 0 ; i 4 6 m i y D n h e t < . s M t l l i t u s t > p e 2 E X T I 1 1 1 3 1 3 2 U u i b e t c M c U i t i i s l\ p t 2 ^ E \ T 2 D i a b e t e s ' > M e l h t i i s t v p c : - 2 ' A/G \u00E2\u0080\u00A2 Caucasian 1 19 1 7 2 9 3 8 7 6 C '\u00E2\u0080\u00A2 >\u00E2\u0080\u00A2 \u00E2\u0080\u0094 - \u00C2\u00ABi f r\u00C2\u00BB\"n* if \" C T * l| J 65, ^ 9 3 8 7 6 r:x;i'2 ... -rTT-irt;\u00E2\u0080\u0094r^\".\u00E2\u0080\u0094 D i a b e t t ^ M e J I i l u s t ) p e ^ ^ H t X . G/C ' \u00E2\u0080\u0094 \u00E2\u0080\u0094 \u00E2\u0080\u009E ' F ~ r 11037909 1 Caucasian Caucasi in 1 1\") ,17293876 . ' r72938?6 J 1 26 t '17-293876 'rs3740S7S- .. G / A V , i .; \"Caucasian; Inference Analysis Diabetes Mcllitus type 2 SLC30A8 rs 13266634 T/C Caucasian I.-I8 17293876 Diabetes Mellitus type 2 LOC38776I rs7480010 A /G Caucasian 1.14 17293876 Diabetes Mellitus type 2 background population probability overall calculated probability Age Gender Ethnicity User details 47 Male Eu rope Background probability details. Age of Onset (vrs) 45 60 Background probability 5% 15% 5 % 7 % Integrated Analvsis SNP used in probability calculation rs 11037909 SNPs in high linkage disequilibrium rsl 1037909 rsl 113132 rs3740878 Figure 2.7: Details of overall probability calculation, integrated analysis and inferred SNPs are shown for Diabetes Mellitus type 2 disease. The inte-grated analysis indicates which disease-associated SNPs are in high linkage disequilibrium (r2 > 0.8). For SNPs in high LD, only the SNP with strongest effect (highest odds ratio) is used in the overall calculated probability. 38 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile 2.4 Discussion 2.4.1 Limitations In the current state of information and implementation, D-GRIP has sev-eral limitations. A key limitation is the narrow scope of the DNA-Disease database. The scarcity reflects two key causes: lack of organization of genotype-phenotype data and the small number of confirmed markers for risk. Even though numerous studies report new DNA marker-disease asso-ciations, there is a shortage of databases that organize such information in a comprehensive and computationally accessible manner. Databases such as AlzGene [5] and PDGene [1] are rare examples of organized genotype-phenotype data which are continuously updated when new studies are pub-lished and are easy to use computationally. More such genetics databases are required for other common diseases [33]. It should be noted that nu-merous databases provide information about genetics and disease, such as OMIM [15] and HGVbase [16], but the information is not sufficiently gran-ular and/or formatted to incorporate into the risk calculation procedure of D-GRIP. The second problem, the scarcity of confirmed predictive mark-ers will soon be ameliorated as the rate of publication of such studies is accelerating. Another limitation of D-GRIP resides in the statistical model. There are several issues regarding the statistical model. When a posterior probability is calculated using the observed SNPs which are associated to a disease, each genetic test (SNP) is assumed to be acting independently. This is a very simplistic view and does not realistically capture the underlying disease 39 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile process. In order to partially circumvent this limitation, we included haplo-type information in the analysis. By including an integrated analysis where if observed SNPs in the output were found in linkage disequilibrium, only the SNP with strongest effect is included in the posterior probability cal-culation. Again, this is a simplification which is warranted since we could not find other existant suitable statistical models that incorporate haplo-type data for risk prediction of disease with SNPs. Furthermore, the lack of consideration for gene-gene interactions and gene-environment interactions is another limitation. Even though the model allows for incorporation of interaction effects, for simplicity, D-GRIP does not utilize that feature. A second issue with the statistical model is lack of incorporation of age and gender during risk calculation. Even though we require the user to input such demographic information when calculating risk for a particular disease, this information is not utilized. In order to use demographic information appropriately, we require the age and gender distribution for each of the individuals in the case-control studies stored in the DNA-Disease database. Since such raw data are unavailable, a simplification was used. D-GRIP uses a different prior probability (background probability) for specific diseases (e.g. Alzheimer's disease) based on the age of the person. In order to alleviate this scarcity of raw data, currently efforts are under way at the NIH to archive and distribute more detailed information on upcoming genetic association studies. The database, dbGaP is designed to house genetics studies dealing with genotype-phenotype interactions and provide all study documentation as well as pre-computed analysis [30]. Currently, no family history or medical history is used for predicting 40 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile risk of disease. Incorporation of family history has been shown to improve the predictive accuracy of risk models [11]. Thus future versions of D-GRIP should incorporate family history in the risk model. For any prediction based software, rigorous validation regarding specificity, sensitivity and accuracy is required. Currently, no such validation is performed due to insufficient number of diseases in DNA-Disease database as well as the unavailability of raw genotype data from individuals for testing. D-GRIP was evaluated through a survey in which D-GRIP was demonstrated to various experts in genetics-related field and feedback was recorded. The conclusions from this form of evaluation are discussed in the next section. 2.4.2 Ideal Software Based on the conclusions drawn from the prototype system and feedback from experts, the features and functionality of an idealized software system can be outlined. The input features of a system should include, as in the prototype, demographic information collected from the user and in addition, an option for collecting family history of any diseases and relevant environ-mental exposures (e.g. cigarette smoking). Also, the genotype parser should be flexible and accommodate various fiie formats. Preferably, an widely ac-cepted file format standard should be established for genotyping data which are released from platforms such as Illumina and Affymetrix. By having a standard file format, exchange of genotyping data across studies will be more efficient. Lastly, user information on non-SNP variants, such as inser-tions/deletions, copy number variations and large-scale structural variants should also be accepted. 41 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile At the core of the software, the ideal DNA-Disease database will contain information for as many common diseases as possible. There are two ways to populate such a database. One, create a meta-analysis engine for each disease. When new studies are published for a disease, they can be added to the database and then meta-analysis re-performed over all the studies for a specific disease. This would require continuous updating of the database each time new disease associated markers are found. In the second approach, genotype-phenotype data would be extracted from disease-specific databases such as AlzGene and PD Gene, but currently, such disease specific genetics databases, of suitable format are rare. Based on recommendations from biostatisticians, an idealized software's statistical approach would include a unique model for each disease (or a range of optional models). Since common diseases are varied and complex, it is crucial to have rigorously tested and validated statistical models. In addition, the statistical models will need to incorporate gene-gene interac-tions as well as co-variates such as exposure to environmental or behavioral factors. The user interface, both the input and output of an ideal system will have to be tailored towards the audience. For example, the current dis-ease risk profile report generated from D-GRIP is intended to be read by a trained user such as a genetic counselor. If one were to target use to family physicians, as suggested by one survey participant, it might be more suitable for the output to highlight links to information about prevention. Appro-priate training will be required for any user of such a system, be it genetic counselors, family physicians or individual subjects. Lastly, it was highly 42 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile recommended by the respondants that access to D-GRIP-like tools be re-stricted - the mixture of complicated interpretation of risk and opportunity for the generation of undue stress on the recipient of information combine to warrant limited user access for the near-term. As a last comment, the average consensus from the feedback for when such an ideal system could be accepted and used clinically was between 5 and 10 years. 2.4.3 Implications There are many societal, ethical and legal implications involved with using D-GRIP. The immediate issues are discussed here and potential directions are presented. One of the pressing questions deals with data protection. The same level of protection should be provided for genetic data as for sensitive medical data, that is, confidentiality and privacy. In addition, the individ-ual's rights should be respected everytime such a tool is used in professional setting. Currently, D-GRIP ensures protection of the user's rights by not storing any user specified information (demographic and genotype) and en-sures confidentiality via anonymous submission of genetic data. However, in the long-term it would be more appropriate for a continuous analysis engine to reassess the DNA each time a new genetic risk marker was deposited into the database. Therefore, encryption and privacy features are required in such a tool. There is much research needed in how to present and explain genetic risk information to individuals [10]. The effect of inappropriately explaining risks can lead to demoralization and unnecessarily increased anxiety, both of which can decrease an individual's ability to change risk-related behav-43 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile ior [28, 42]. Also, most people find probabilities and relative risk information difficult to comprehend, in part due to poor presentation of statistics [20]. Thus, it is recommended to use standard vocabulary, use a common denomi-nator when explaining odds, provide both positive and negative perspectives and use visual aids for probabilities [31]. Genetic testing for affected or at risk individuals creates serious ethical dilemmas. Concerns such as discrimination from employers and insurers and fear of discrimination can deter individuals who could benefit from genetic testing. It also remains to be seen how third-party use of genetic information and potential will impact the use of predictive tools such as D-GRIP. These issues will have to be discussed and addressed by governments, industries and the public in a transparent manner [22]. 2.4.4 C o n c l u s i o n s The creation of the D-GRIP system for genetic risk prediction was intended to identify bioinformatics, statistical and scientific challenges that must be addressed to create predictive systems of clinical utility. The major bioinfor-matic limitation is the lack of available data in terms of strongly predictive susceptibility alleles for complex diseases. This is in part due to the lack of organized and computationally exploitable disease databases for complex disorders. The major statistical limitation is the calculation of risk given large-scale genotype data (e.g. incorporating haplotype information into the analysis). The major scientific limitation, despite the flurry of association studies, is our limited understanding of complex diseases and how various genes interact with each other and the environment. Any proposed predic-44 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile tive model (be it for a single disease or a general model) will have to undergo rigorous testing and evaluations in order to ensure clinical utility. When the proposed limitations are overcome, useful and beneficial pre-dictive software can be created and implemented. The key features include: incorporation of genotype data along with family history of disease, a contin-uously updated DNA-Disease database with a meta-analysis engine, disease-specific risk models which have been validated and user-oriented risk profile reporting. The use of the software will be under a guided setting, with potential users being genetic counselors and family physicians. Regardless of the user, appropriate training in using the software and interpreting the output will be a necessity. Lastly, implications such as privacy and confi-dentiality of genetic data, appropriate explanations of risk, discrimination towards individuals via third parties, effect on public health policies and public education are all important challenges to be addressed before imple-mentation of such a predictive tool becomes a reality. 45 Bibliography [1] S Bagade, NC Allen, R Tanzi, and L Bertram. The pdgene database, alzheimer research forum, available at: http://www.pdgene.org/, Ac-cessed May 2007. [2] Michael R Barnes. Navigating the hapmap. Brief Bioinform, 7(3):211-24, September 2006. [3] J C Barrett, B Fry, J Mailer, and M J Daly. Haploview: analysis and visualization of Id and haplotype maps. Bioinformatics, 21(2):263-5, January 2005. [4] A E Baum, N Akula, M Cabanero, I Cardona, W Corona, B Klemens, T G Schulze, S Cichon, M Rietschel, M M Nothen, A Georgi, J Schu-macher, M Schwarz, R Abou Jamra, S Hofels, P Propping, J Satagopan, S D Detera-Wadleigh, J Hardy, and F J McMahon. A genome-wide as-sociation study implicates diacylglycerol kinase eta (dgkh) and several other genes in the etiology of bipolar disorder. Mol Psychiatry, May 2007. [5] Lars Bertram, Matthew B McQueen, Kristina Mullin, Deborah Blacker, and Rudolph E Tanzi. Systematic meta-analysis of alzheimer disease genetic association studies: The alzgene database. Nature Genetics, 39:17-23, January 2007. [6] International Hapmap Consortium. The international hapmap project. Nature, 426(6968):789-96, December 2003. [7] Keith D Coon, Amanda J Myers, David W Craig, Jennifer A Web-ster, John V Pearson, Diane Hu Lince, Victoria L Zismann, Thomas G Beach, Doris Leung, Leslie Bryden, Rebecca F Halperin, Lauren Mar-lowe, Mona Kaleem, Douglas G Walker, Rivka Ravid, Christopher B Heward, Joseph Rogers, Andreas Papassotiropoulos, Eric M Reiman, John Hardy, and Dietrich A Stephan. A high-density whole-genome association study reveals that apoe is the major susceptibility gene for 46 Bibliography sporadic late-onset alzheimer's disease. J Clin Psychiatry, 68(4):613-8, April 2007. [8] J R Fraser Cummings, Rachel Cooney, Saad Pathan, Carl A Anderson, Jeffrey C Barrett, John Beckly, Alessandra Geremia, Laura Hancock, Changcun Guo, Tariq Ahmad, Lon R Cardon, and Derek P Jewell. Confirmation of the role of atgl611 as a Crohn's disease susceptibility gene. Inflamm Bowel Dis, April 2007. [9] Ofir Davidovich, Gad Kimmel, and Ron Shamir. Gevalt: an integrated software tool for genotype analysis. BMC Bioinformatics, 8:36, 2007. [10] Adrian Edwards, Silvana Unigwe, Glyn Elwyn, and Kerenza Hood. Effects of communicating individual risks in screening programmes: Cochrane systematic review. BMJ, 327(7417):703-9, September 2003. [11] David M Euhus, Kristin C Smith, Linda Robinson, Amy Stucky, Olu-funmilayo I Olopade, Shelly Cummings, Judy E Garber, Anu Chit-tenden, Gordon B Mills, Paula Rieger, Laura Esserman, Beth Craw-ford, Kevin S Hughes, Connie A Roche, Patricia A Ganz, Joyce Seldon, Carol J Fabian, Jennifer Klemp, and Gail Tomlinson. Pretest predic-tion of brcal or brca2 mutation by risk counselors and the computer model brcapro. J Natl Cancer Inst, 94(11):844-51, June 2002. [12] J.B. Fan, A. Qliphant, R. Shen, B.G. Kermani, F. Garcia, K .L . Gun-derson, M . Hansen, F. Steemers, S.L. Butler, P. Deloukas, L. Galver, S. Hunt, C. McBride, M . Bibikova, T. Rubano, J. Chen, E. Wickham, D. Doucet, W. Chang, D. Campbell, B. Zhang, S. Kruglyak, D. Bently, J. Haas, P. Rigault, L. Zhou, J. Stuelpnagel, and M.S. Chee. Highly parallel snp genotyping. Cold Springs Harbor Symposia on Quantitative Biology, 68:69-78, 2003. [13] Martin Farrall and Andrew P Morris. Gearing up for genome-wide gene-association studies. Hum Mol Genet, 14 Spec No. 2:R157-62, October 2005. [14] Simon Fiddy, David Cattermole, Dong Xie, Xiao Yuan Duan, and Richard Mott. Igs: An integrated system for genetic analysis. BMC Bioinformatics, 7:210, 2006. [15] McKusick-Nathans Institute for Genetic Medicine and National Center for Biotechnology Information. Online mendelian inheritance in man omim (tm), http://www.ncbi.nlm.nih.gov/omim/, July 2006. 47 Bibliography [16] D Fredman, M Siegfried, Y P Yuan, P Bork, H Lehvaslaiho, and A J Brookes. Hgvbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res, 30(1):387-91, January 2002. [17] Nelson B Freimer and Chiara Sabatti. Human genetics: variants in common diseases. Nature, 445(7130):828-30, February 2007. [18] Genelex. Genelex website, available at http://www.genelex.com/, May 2007. [19] GeneTrack. Genetrack website, available at http://www.genetrack. bc.ca, July 2006. [20] Gerd Gigerenzer and Adrian Edwards. Simple tools for understanding risks: from innumeracy to insight. BMJ, 327(7417):741-4, September 2003. [21] Alan E Guttmacher and Francis S Collins. Welcome to the genomic era. N Engl J Med, 349(10) :996-8, September 2003. [22] Wayne D Hall, Katherine I Morley, and Jayne C Lucke. The prediction of disease risk in genomic medicine. EMBO Rep, 5 Spec No:S22-6, October 2004. [23] Lynn B Jorde and Stephen P Wooding. Genetic variation, classification and 'race'. Nat Genet, 36(11 Suppl):S28-33, November 2004. [24] K M Kelly and K Sweet. In search of a familial cancer risk assessment tool. Clin Genet, 71(l):76-83, January 2007. [25] Muin J Khoury, Julian Little, Marta Gwinn, and John Pa Ioannidis. On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies. Int J Epi-demiol, 36(2):439-45, April 2007. [26] Cecile Libioulle, Edouard Louis, Sarah Hansoul, Cynthia Sandor, Frederic Farnir, Denis Franchimont, Severine Vermeire, Olivier Dewit, Martine de Vos, Anna Dixon, Bruno Demarche, Ivo Gut, Simon Heath, Mario Foglio, Liming Liang, Debby Laukens, Myriam Mni, Diana Ze-lenika, Andre Van Gossum, Paul Rutgeerts, Jacques Belaiche, Mark Lathrop, and Michel Georges. Novel crohn disease locus identified by genome-wide association maps to a gene desert on 5pl3.1 and modu-lates expression of ptger4. PLoS Genet, 3(4):e58, April 2007. 48 Bibliography [27] Yen-Ling Low, Sara Wedren, and Jianjun Liu. High-throughput ge-nomic technology in research and clinical management of breast cancer, evolving landscape of genetic epidemiological studies. Breast Cancer Res, 8(3):209, 2006. [28] T M Marteau and R T Croyle. The new genetics, psychological re-sponses to genetic testing. BMJ, 316(7132):693-6, February 1998. [29] Ruth McPherson, Alexander Pertsemlidis, Nihan Kavaslar, Alexandre Stewart, Robert Roberts, David R Cox, David A Hinds, Len A Pen-nacchio, Anne Tybjaerg-Hansen, Aaron R Folsom, Eric Boerwinkle, Helen H Hobbs, and Jonathan C Cohen. A common allele on chromo-some 9 associated with coronary heart disease. Science, May 2007. [30] NCBI. dbgap: Database of genome wide association studies, url: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, 2007. [31] John Paling. Strategies to help patients understand risks. BMJ, 327(7417) :745-8, September 2003. [32] Lyle J Palmer and Lon R Cardon. Shaking the tree: mapping complex disease genes with linkage disequilibrium. Lancet, 366(9492): 1223-34, October 2005. [33] George P Patrinos and Anthony J Brookes. Dna, diseases and databases: disastrously deficient. Trends Genet, 21(6):333-8, June 2005. [34] Richa Saxena, Benjamin F Voight, Valeriya Lyssenko, Noel P Burtt, Paul I W de Bakker, Hong Chen, Jeffrey J Roix, Sekar Kathiresan, Joel N Hirschhorn, Mark J Daly, Thomas E Hughes, Leif Groop, David Altshuler, Peter Almgren, Jose C Florez, Joanne Meyer, Kristin Ardlie, Kristina Bengtsson, Bo Isomaa, Guillaume Lettre, Ulf Lindblad, He-len N Lyon, Olle Melander, Christopher Newton-Cheh, Peter Nilsson, Marju Orho-Melander, Lennart Rastam, Elizabeth K Speliotes, Marja-Riitta Taskinen, Tiinamaija Tuomi, Candace Guiducci, Anna Berglund, Joyce Carlson, Lauren Gianniny, Rachel Hackett, Liselott Hall, Johan Holmkvist, Esa Laurila, Marketa Sjogren, Maria Sterner, Aarti Surti, Margareta Svensson, Malin Svensson, Ryan Tewhey, Brendan Blumen-stiel, Melissa Parkin, Matthew Defelice, Rachel Barry, Wendy Brodeur, Jody Camarata, Nancy Chia, Mary Fava, John Gibbons, Bob Hand-saker, Claire Healy, Kieu Nguyen, Casey Gates, Carrie Sougnez, Diane 49 Bibliography-Gage, Marcia Nizzari, Stacey B Gabriel, Gung-Wei Chirn, Qicheng Ma, Hemang Parikh, Delwood Richardson, Darrell Ricke, and Shaun Pur-cell. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science, April 2007. [35] Laura J Scott, Karen L Mohlke, Lori L Bonnycastle, Cristen J Wilier, Yun Li , William L Duren, Michael R Erdos, Heather M Stringham, Peter S Chines, Anne U Jackson, Ludmila Prokunina-Olsson, Chia-Jen Ding, Amy J Swift, Narisu Narisu, Tianle Hu, Randall Pruim, Rui Xiao, Xiao-Yi Li , Karen N Conneely, Nancy L Riebow, Andrew G Sprau, Maurine Tong, Peggy P White, Kurt N Hetrick, Michael W Barnhart, Craig W Bark, Janet L Goldstein, Lee Watkins, Fang Xiang, Jouko Saramies, Thomas A Buchanan, Richard M Watanabe, Timo T Valle, Leena Kinnunen, Goncalo R Abecasis, Elizabeth W Pugh, Kimberly F Doheny, Richard N Bergman, Jaakko Tuomilehto, Francis S Collins, and Michael Boehnke. A genome-wide association study of type 2 di-abetes in finns detects multiple susceptibility variants. Science, April 2007. [36] Nameeta Shah, Michael V Teplitsky, Simon Minovitsky, Len A Pennac-chio, Philip Hugenholtz, Bernd Hamann, and Inna L Dubchak. Snp-vista: an interactive snp visualization tool. BMC Bioinformatics, 6:292, 2005. [37] Robert Sladek, Ghislain Rocheleau, Johan Rung, Christian Dina, Lishuang Shen, David Serre, Philippe Boutin, Daniel Vincent, Alexandre Belisle, Samy Hadjadj, Beverley Balkau, Barbara Heude, Guillanume Charpentier, Thomas J. Hudson, Alexandre Montpetit, Alexey V. Pshezhetsky, Marc Prentki, Barry I. Posner, David J. Bald-ing, David Meyre, Constantin Polychronakos, and Philippe Froguel. A genome-wide association study identifies novel risk loci for type 2 \u00E2\u0080\u00A2 diabetes. Nature, 445:881-885, February 2007. [38] E M Smigielski, K Sirotkin, M Ward, and S T Sherry, dbsnp: a database of single nucleotide polymorphisms. Nucleic Acids Res, 28(l):352-5, January 2000. [39] Valgerdur Steinthorsdottir, Gudmar Thorleifsson, Inga Reynisdottir, Rafn Benediktsson, Thorbjorg Jonsdottir, G Bragi Walters, Unnur Styrkarsdottir, Solveig Gretarsdottir, Valur Emilsson, Shyamali Ghosh, Adam Baker, Steinunn Snorradottir, Hjordis Bjarnason, Maggie C Y Ng, Torben Hansen, Yu Bagger, Robert L Wilensky, Muredach P Reilly, 50 Bibliography Adebowale Adeyemo, Yuanxiu Chen,- Jie Zhou, Vilmundur Gudnason, Guanjie Chen, Hanxia Huang, Kerrie Lashley, Ayo Doumatey, Wing-Yee So, Ronald C Y Ma, Gitte Andersen, Knut Borch-Johnsen, Tor-ben Jorgensen, Jana V van Vliet-Ostaptchouk, Marten H Hofker, Cisca Wijmenga, Claus Christiansen, Daniel J Rader, Charles Rotimi, Mark Gurney, Juliana C N Chan, Oluf Pedersen, Gunnar Sigurdsson, Jef-frey R Gulcher, Unnur Thorsteinsdottir,' Augustine Kong, and Kari Stefansson. A variant in cdkall influences insulin response and risk of type 2 diabetes. Nat Genet, April 2007. [40] Gudmundur A Thorisson, Albert V Smith, Lalitha Krishnan, and Lin-coln D Stein. The international hapmap project web site. Genome Res, 15(ll):1592-3, November 2005. [41] Wenyi Wang, Sining Chen, Kieran A Brune, Ralph H Hruban, Gio-vanni Parmigiani, and Alison P Klein. Pancpro: risk assessment for individuals with a family history of pancreatic cancer. J Clin Oncol, 25(ll):1417-22, April 2007. [42] A J Wright, J Weinman, and T M Marteau. The impact of learning of a genetic predisposition to nicotine dependence: an analogue study. Tob Control, 12(2):227--30, June 2003. [43] Nan Yang, Hongzhe Li , Lindsey A Criswell, \u00E2\u0080\u00A2 Peter K Gregersen, Marta E Alarcon-Riquelme, Rick Kittles, Russell Shigeta, Gabriel Silva, Pragna I Patel, John W Belmont, and Michael F Seldin. Examination of ancestry and ethnic affiliation using highly informative diallelic dna markers: application to diverse and admixed populations and impli-cations for clinical epidemiology and forensic medicine. Hum Genet, 118(3-4):382-92, December 2005. [44] Quanhe Yang, Muin J. Khoury, Lorenzo Botto, J .M. Friedman, and Dana Flanders. Improving the prediction of complex diseases by testing for multiple disease-susceptibility genes. American Journal of Human Genetics, 72:636-649, 2003. [45] Lan-Juan Zhao, Miao-Xin Li , Yan-Fang Guo, Fu-Hua Xu, Jin-Long Li , and Hong-Wen Deng. Snpp: automating large-scale snp genotype data management. Bioinformatics, 21(2):266-8, January 2005. 51 Chapter 3 Conclusions and Future Directions 3.1 Further Observations One of the most important observations noted during the D-GRIP develop-ment and testing was the lack of computationally efficient organization of existing and new discoveries in the genetics field [5, 10]. There has been an explosion of data from the recent progress in disease genetics field, and even 'though currently there are many types of mutation databases, the progress towards creation of new databases has been slow. The challenges involved are often technical in nature, such as, gathering, exchanging, integrating and interpreting the disease-related information. However, arguably the lack of targeted funding and the inherent bias towards making new discov-eries rather than managing existing data are one of the main underlying problems [10]. In order to overcome the technical limitations of creating a comprehen-sive, computationally exploitable genotype-phenotype database, a few goals must be met. For easy computational access, complex phenotype data mod-els that extensively utilize phenotype ontologies will be required. By using ontologies, a standard vocabulary can be established for use of terms, which 52 Chapter 3. Conclusions and Future Directions will help integrate various types of data and make analysis computationally easier. Initially, the DNA changes related to phenotypes can be represented in a structured and standardized way. Then, a basic framework for gath-ering, integrating, analyzing and updating the stored information will be required. Given the enormous amounts of data being generated, a system-atic and standardized way to manage phenotype data will be a necessity, which will require international cooperation and open access to anonymous data. Ultimately, an ideal genotype-phenotype database will provide a sys-tems biology approach where all information, such as that derived from the genome, transcriptome, proteome and metabolome, pertaining to the con-nection between genotypic differences and phenotypic consequences will be recorded. The second important observation that resulted from my work on D-GRIP was the limited number of variants that are known to be associated with complex diseases. Even though individual genome wide association studies(GWAs) are publishing results for many diseases [12, 11, 1, 2, 4, 9], most of the studies report only a few disease associated variants [3, 8]. In addition, the reported effects of individual genetic variants associated to common diseases are small (risk ratios ^ 2.0). Although, it has been shown that the combined effects of a moderate number (fewer than 20) of common genetic variants (with relative ratios ^ 2.0) could explain 50% of the burden of disease in a population [13]; there are numerous challenges with genome-wide association studies. These challenges include, for example, significance chasing bias (including publication bias, selective analysis and reporting bias), population stratification (due to heterogenous populations mixtures), 53 Chapter 3. Conclusions and Future Directions misclassification of exposures and outcome, and the inherent problems that include, failure to detect gene-gene and gene-environment interactions, lim-ited sample size, statistical power and false positive associations. All these issues can lead to difficulty in finding biologically meaningful genetic asso-ciations and thus slow the progress of understanding complex diseases. In order to alleviate and infer true disease-associated variants from nu-merous GWAs, standards should be established for presenting and interpret-ing the accumulated evidence. Efforts by the Human Genome Epidemiology Network (HuGENet) are ongoing in developing systematic approaches for assessing combined evidence of disease associated variants. The approaches include criteria such as biological plausibility, experimental evidence, sound methods for conduct and analysis, and appropriate replication [8]. The op-portunity to develop methods and standards for measuring, validating and interpreting genetic associations will be high in the next few years and will ultimately lead to benefit for individuals and population health. 3.2 Future Considerations The goal of shifting the current medical paradigm from a reactive to pre-ventative approach through personalized risk profiles appears within reach long-term. The generation of genetic risk profiles is intended to improve disease prevention by prompting at-risk individuals to take specific preven-tative actions that usually involve environmental exposures, diet or other lifestyle changes. However, before genetic risk assessment tools can be used in a clinical setting, an evaluation of the clinical utility of such tools needs 54 Chapter 3. Conclusions and Future Directions to be conducted [7] Clinical utility of a test refers to the likelihood a diagnostic test will lead to improved health outcomes [7]; For individuals with positive test results, the clinical utility depends on the availability, safety and effectiveness of therapeutic measures. The recommendation for ensuring clinical utility for any genetic test is to consider the clinical and social outcomes of the test. Clinical outcomes depend on effective changes in lifestyle due to positive test result. The social outcomes depend on the psychosocial, ethical, legal and social issues related to receiving a positive or negative outcome. Both clinical and social outcomes are important because they both contribute to the net balance between benefits and harms of genetic testing [6]. Thus, future evaluation of genomic profiles should encompass and clearly address validity of the test, clinical utility and social utility of the test. Regardless of the intended audience for a genetic risk profiling software, two crucial criteria are necessary for providing a genetic profile test. First, due to the still limited knowledge about clinical implications of such tests, the benefits and limitations of the tests should be clearly explained. Such limitations should be explicitly addressed, and individuals who provide tests should disclose what is known and not known about the test. Second, the tests should be offered in a controlled environment such that individual test takers are counseled about the results and implications of the tests. By having transparency when providing the genetic profile test and counseling the individual test taker, informed decisions can be made by health profes-sionals, patients and general pubic. Lastly, consensus needs to be achieved on when genomic profiling has 55 Chapter 3. Conclusions and Future Directions achieved an acceptable standard in a clinical setting. In the future, ge-nomic profiling will likely become common and thus the level of evidence that justifies clinical use of genomic profiling requires careful thought. It is recommended to develop an accepted process that incorporates defined pro-cedures for evaluating evidence and reaching conclusions that include input from clinicians, health care payers and consumers. 3.3 Conclusion Given the advent of new genotyping technologies and the rapid new discov-ery of new disease associated variants, experts have predicted that future medical care will become more personalized and geared towards disease pre-vention. We created a prototype web tool, called, DNA Genetic Risk Infor-mation Profile (D-GRIP), which predicts disease risk profiles based on an individual's genotype. The project outlined the current bioinformatic and scientific limitations involved in creating a genetic risk assessment software and addressed the main issues involved in the creation, evaluation and util-ity of such a tool in a clinical setting. By overcoming the major limitations and addressing the important issues, a viable and useful genetic risk profil-ing software is plausible in the future and thus will lead to a change in the way medicine is currently practiced. 56 Bibliography [1] A E Baum, N Akula, M Cabanero, I Cardona, W Corona, B Klemens, T G Schulze, S Cichon, M Rietschel, M M Nothen, A Georgi, J Schu-macher, M Schwarz, R Abou Jamra, S Hofels, P Propping, J Satagopan, S D Detera-Wadleigh, J Hardy, and F J McMahon. A genome-wide as-sociation study implicates diacylglycerol kinase eta (dgkh) and several other genes in the etiology of bipolar disorder. Mol Psychiatry, May 2007. [2] Keith D Coon, Amanda J Myers, David W Craig, Jennifer A Web-ster, John V Pearson, Diane Hu Lince, Victoria L Zismann, Thomas G Beach, Doris Leung, Leslie Bryden, Rebecca F Halperin, Lauren Mar-lowe, Mona Kaleem, Douglas G Walker, Rivka Ravid, Christopher B Heward, Joseph Rogers, Andreas Papassotiropoulos, Eric M Reiman, John Hardy, and Dietrich A Stephan. A high-density whole-genome association study reveals that apoe is the major susceptibility gene for sporadic late-onset alzheimer's disease. J Clin Psychiatry, 68(4):613-8, April 2007. [3] Jennifer Couzin and Jocelyn Kaiser. Genome-wide association, closing the net on common disease genes. Science, 316(5826):820-2, May 2007. [4] J R Fraser Cummings, Rachel Cooney, Saad Pathan, Carl A Anderson, Jeffrey C Barrett, John Beckly, Alessandra Geremia, Laura Hancock, Changcun Guo, Tariq Ahmad, Lon R Cardon, and Derek P Jewell. Confirmation of the role of atgl611 as a Crohn's disease susceptibility gene. Inflamm Bowel Dis, April 2007. [5] Angela Frodsham and Julian Higgins. Online genetic databases inform-ing human genome epidemiology. BMC Med Res Methodol, 7(1):31, July 2007. [6] Scott D Grosse and Muin J Khoufy. What is the clinical utility of genetic testing? Genet Med, 8(7):448-50, July 2006. 57 Bibliography [7] Susanne B Haga, Muin J Khoury, and Wylie Burke. Genomic profiling to promote a healthy lifestyle: not ready for prime time. Nat Genet, 34(4):347-50, August 2003. [8] Muin J Khoury, Julian Little, Marta Gwinn, and John Pa Ioannidis. On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies. Int J Epi-demiol, 36(2):439-45, April 2007. [9] Ruth McPherson, Alexander Pertsemlidis, Nihan Kavaslar, Alexandre Stewart, Robert Roberts, David R Cox, David A Hinds, Len A Pen-nacchio, Anne Tybjaerg-Hansen, Aaron R Folsom, Eric Boerwinkle, Helen H Hobbs, and Jonathan C Cohen. A common allele on chromo-some 9 associated with coronary heart disease. Science, May 2007. [10] George P Patrinos and Anthony J Brookes. Dna, diseases and databases: disastrously deficient. Trends Genet, 21(6):333-8, June 2005. [11] Laura J Scott, Karen L Mohlke, Lori L Bonnycastle, Cristen J Wilier, Yun Li , William L Duren, Michael R Erdos, Heather M Stringham, Peter S Chines, Anne U Jackson, Ludmila Prokunina-Olsson, Chia-Jen Ding, Amy J Swift, Narisu Narisu, Tianle Hu, Randall Pruim, Rui Xiao, Xiao-Yi L i , Karen N Conneely, Nancy L Riebow, Andrew G Sprau, Maurine Tong, Peggy P White, Kurt N Hetrick, Michael W Barnhart, Craig W Bark, Janet L Goldstein, Lee Watkins, Fang Xiang, Jouko Saramies, Thomas A Buchanan, Richard M Watanabe, Timo T Valle, Leena Kinnunen, Goncalo R Abecasis, Elizabeth W Pugh, Kimberly F Doheny, Richard N Bergman, Jaakko Tuomilehto, Francis .S Collins, and Michael Boehnke. A genome-wide association study of type 2 di-abetes in finns detects multiple susceptibility variants. Science, April 2007. [12] Robert Sladek, Ghislain Rocheleau, Johan Rung, Christian Dina, Lishuang Shen, David Serre, Philippe Boutin, Daniel Vincent, Alexandre Belisle, Samy Hadjadj, Beverley Balkau, Barbara Heude, Guillanume Charpentier, Thomas J. Hudson, Alexandre Montpetit, Alexey V. Pshezhetsky, Marc Prentki, Barry I. Posner, David J. Bald-ing, David Meyre, Constantin Polychronakos, and Philippe Froguel. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445:881-885, February 2007. 58 Bibliography [13] Quanhe Yang, Muin J Khoury, Jm Friedman, Julian Little, and W Dana Flanders. How many genes underlie the occurrence of common complex diseases in the population? Int J Epidemiol, 34(5): 1129-37, October 2005. 59 Appendix A Feedback from Experts A . l Questions The set of questions asked to each type of expert (clinical geneticist, molec-ular geneticist, genetic counselors and biostatisticians) are listed below. 1. Any comments on the user-interface of D-GRIP? \u00E2\u0080\u00A2 The input page? \u00E2\u0080\u00A2 The output page? 2. Any comments or references to available risk models that predict risk based on genotype data? \u00E2\u0080\u00A2 How to include age specific risk prediction without raw data? 3. How should an ideal system handle various complex diseases? Treat each separately with disease-specific risk model? 4 . The system shows a very fatalistic view. Do you think we should include more positive news? 5. Who could be a potential user of D-GRIP? \u00E2\u0080\u00A2 Genetic Counselors? \u00E2\u0080\u00A2 Family Physicians? 60 Appendix A. Feedback from Experts \u00E2\u0080\u00A2 Insurance companies? \u00E2\u0080\u00A2 Lay public? \u00E2\u0080\u00A2 Yourself? 6. How many years down the line can you see this being used (respectively for each of the potential users from previous question)? 7. Do you think we should store people's genotype data? What about family doctor's storing .their patient's genotype data? 8. What are some of the implications you see from using such a system? \u00E2\u0080\u00A2 Personal implications? \u00E2\u0080\u00A2 Effect on patients? \u00E2\u0080\u00A2 Societal implications? 9. In what journal can you see this type of paper being published? 61 Appendix A. Feedback from Experts A . 2 Feedback A summary of the feedback provided by several experts is detailed below. The experts consisted of two biostatisticians, two molecular geneticists, five clinical geneticists and 12 MSc genetic counseling students. The comments and recommendations are categorized into various aspects of D-GRIP, for example, user interface issues regarding input and output features, core of D-GRIP dealing with DNA-Disease database and risk prediction model, issues pertaining to the users and any ethical, legal and social implications. A.2.1 User Interface Input and general usability \u00E2\u0080\u00A2 Allow option for users to provide family history along with genotype data. \u00E2\u0080\u00A2 Ethnicity classification is currently biased. Provide two options, one user-specified ethnicity and two, calculate ethnicity based on a verified and reliable predetermined-determined markers from genotype data provided. Consensus was to calculate the ethnicity but only when calculations can be done reliably. \u00E2\u0080\u00A2 When more data is available, allow input for copy number variantions data. \u00E2\u0080\u00A2 Provide a disclaimer that explicitly informs the user of all the limita-tions and assumptions of the software. 62 Appendix A. Feedback from Experts \u00E2\u0080\u00A2 As it is currently, keep the interface simple and easy to use. Risk Profile Report \u00E2\u0080\u00A2 Tailor the final risk report towards the intended user. Currently, the view is more geared towards genetic researchers and counselors. In contrast, for a family physician or a consumer, provide a 'Patient view' where communication of probabilities and risk is done visually, links to prevention and therapeutic options and any relevant links for lifestyle and behavior changes are provided. \u00E2\u0080\u00A2 Provide the option of restricting analysis to specific diseases, for in-stance, diseases where prevention is an option versus where currently no preventative options are available. A.2.2 D-GRIP Core Diseases, DNA-Disease database \u00E2\u0080\u00A2 Implement a meta-analysis engine for each disease so that whenever new studies are published, the entire database is updated. In addition, whenever such updates are performed, create a notification system for users to inform them. \u00E2\u0080\u00A2 Store gene-gene and gene-environment and epigenetic information in DNA-Disease database. Data on gender and age related to diseases is very important, especially for age-dependent diseases. 63 Appendix A. Feedback from Experts \u00E2\u0080\u00A2 When information regarding copy number variations related to dis-eases is available, store this into the DNA-Disease database. \u00E2\u0080\u00A2 Also store intermediate phenotypes associated with markers in addi-tion to disease associated markers. Risk prediction issues \u00E2\u0080\u00A2 Implement disease specific risk models so that each disease is treated separately. Also, allow advanced users to choose multiple risk predic-tion models for each disease. \u00E2\u0080\u00A2 When data are available, incorporate gene-gene and gene-environment effects into the respective disease risk models. \u00E2\u0080\u00A2 Perform rigours validation of each predictive model and prediction. Show the results of the tests performed, such as sensitivity, specificity, positive predictive values. Ensure validations of the prediction models is performed with genotype data that is not part of the case-control population data in the DNA-Disease database. Currently, such volume of data for testing is not available so future versions will require this feature. Also, provide links to studies supporting the risk predictions models for respective diseases. A.2.3 Potential Users \u00E2\u0080\u00A2 Genetic counselors are a good initial user for the software. During initial deployment of D-GRIP, user training will be required so that all limitations and proper interpretation of results is performed. 64 Appendix A. Feedback from Experts \u00E2\u0080\u00A2 Family physicians (or in a primary care setting) can be other potential users. But training for family physicians on how to use and interpret results from such a tool will be a necessity. \u00E2\u0080\u00A2 Potentially, general public could act as consumers of such a software. But all implications will need to be addressed by health professionals, governments and industry before such a software is released to the general public. \u00E2\u0080\u00A2 Insurance companies could also be potential users but the many social, legal and ethical implications will need to be addressed and a support-ing framework will need to be implemented so handle third party use of genetic data. \u00E2\u0080\u00A2 As mentioned, user interface of software should be tailored towards the user. \u00E2\u0080\u00A2 The consensus was that currently, D-GRIP is ahead of its time. But a similar software can be seen used in the next 5-10years time. How-ever, better understanding of disease associated variants and reliable predictions will be a necessity. \u00E2\u0080\u00A2 Until proper standards and procedures are developed to handle all the ethical, legal and social implications, such a software should always be used under a guided setting where the counseled individual is explained all the limitations and provide guidance in understanding the results from such a software. 65 Appendix A. Feedback from Experts A.2.4 Implications \u00E2\u0080\u00A2 As it is currently, there should be no user identifiable storing of geno-type data. User genotype data can be stored only when the family physician is the user and storing the patient's genotype data. How-ever, in the future, proper framework will be required to handle genetic data management, to support privacy, confidentiality and anonymity. \u00E2\u0080\u00A2 The level of care required in helping the general public interpret and understand the results is enormous and should be done appropriately. \u00E2\u0080\u00A2 At the current rate, not enough genetic counselors to support the future demand for counseling of individuals wanting .a genetic risk profile. \u00E2\u0080\u00A2 All necessary ethical, social and legal implications will need to be addressed by the providers of such a tool. 66 Appendix B D-GRIP User Manual B . l Introduction to D - G R I P This user Guide assumes you have access to D-GRIP since D-GRIP is a closed and secure web tool. The guide explains the various features of D-GRIP and provides a brief walk through. This guide is not intended to explain the results of D-GRIP or how to interpret them. The guide explains: \u00E2\u0080\u00A2 The overall processes. \u00E2\u0080\u00A2 Basic features that are available. B . l . l D-GRIP System DNA Genetic Risk Information Profile (D-GRIP) is a genotype analysis system that predicts an individual's genetic risk profile based on the geno-type. The system can take as input, observed genotypes of up to one million positions of known single nucleotide polymorphisms (SNPs) in human pop-ulations. The flow of information in D-GRIP begins from the input of user data. The user is asked to fill in demographic information (ethnic background, 67 Appendix B. D-GRIP User Manual age and gender) and a genotype file which is parsed and temporarily stored. Next, The system compares the genotyping results to an internal DNA-DISEASE risk database and for each disease, calculates a risk score for developing the disease. Finally, a tabular output of potential diseases with the relevant disease risk for the individual is displayed. Useniumc: | Password: [ Submit j Figure B . l : The entry into D-GRIP occurs with user authentication. A valid username and password is required to access D-GRIP. 68 Appendix B. D-GRIP User Manual B.2 D-GRIP Features There are various features in D-GRIP and a detailed description of each with illustrations is provided below. The page is laid out with a menu on the left and all the relevant content on the right. The menu contains navigation links to Home page (Figure B.2), Disclaimer page, Use D-GRIP page, Help page and link to Log out of D-GRIP. Home Disclaimer Use D-GRIP Help Lou oui DNA Genetic.Rt.sk Information Profile Welcome 'Test' This web sile provides a tool for predicting a genetic risk prolllc for a person by utlizing genotype information. Getting Started: Click on the 'Use D-GRIP\" link. Kill in demographic information and click 'next'. Upload a genotype file or copy/paste data into the form. Click on Calculate Risk. Please LOR out when leaving D-GRIP N o t e : T i p s a r e p r o v i d e d a n y w h e r e ' & & a p p e a r s . B r i n g c u r s o r o v e r t o s e e t i p s . Disclaimer 1. It is assumed the \"system is used in a guided setting. 2. All information provided by you ('the user') is.assumed to be accurate. For instance, ethnic background provided by the user is assumed to be.accurate to the best of the user's knowledge. 3. ; D-GRIP predicts risk of developing disease-based on population information collected from literature. 4. The overall probability of developing a disease is calculated'assuming all susceptible allcles/gcncs arc-acting independently within diseases and across diseases. 5. The system does not store any user-provided data (e.g. genotype and demographic data). :i\u00C2\u00AB>i w i ^ f i r a n Lull Figure B.2: A snapshot of D-GRIP's main page. The page describes in-structions on how to use D-GRIP and outlines a disclaimer for the user to read. 69 Appendix B. D-GRIP User Manual B.2.1 Disclaimer The disclaimer explicitly outlines the assumptions made by D-GRIP (Fig-ure B.3). The disclaimer is shown on the first page, when the user accesses the site. Also, a separate link is provided to view the disclaimer. Di scln i m er 1. H is assumed Ihe system is used in a guided setting. 2. Al l information provided by you ('the user') is assumed to be accurate. For instance, ethnic background provided by the user isassuiried to be accurate to the best of the user's knowledge. 3. D-GRIP predicts risk ol\" developing.disease based on population information collected from literature. 4. The overall probability of developing a disease is calculated assuming all susceptible allcles/genes arc acting independently within diseases and across diseases. 5. Ilie system does not store any user-provided data (e.g. genotype'and demographic data). Figure B.3: The assumptions made by D-GRIP are listed as a disclaimer and shown here B.2.2 Input The input page can be accessed by clicking on the 'Use D-GRIP' link in the menu on the left. The input for D-GRIP occurs in two steps. First, demographic information and configuration options are presented. Next, genotype data is requested from the user. Demographic Information Figure B.4 shows the first stage of the input. The mandatory information requested from the user is Gender, Age and Ethnic background. For the Age, the user enters the year of birth. For the Ethnic background, the user should select the most appropriate option based on the geographic 70 Appendix B. D-GRIP User Manual ancestry of the user. The options presented are: Africa, Asia, Europe, Pacific, First nations/Aboriginals and Mixed. The configuration option currently has one checkbox for 'inference of genotypes'. The inference of genotypes utilizes the haplotype information from the Hapmap Project Website to infer disease-associated genotypes from the genotype data provided by the user. By default, the inference option is turned off (no tick in checkbox). Once the user fills in the demographic information form, proceed to loading genotype data by clicking the 'Next' button. Input user details Demographic Information Gender * Y e a r o f B i r t h * E t h n i c B a c k g r o u n d * Configuration Options I n f e r e n c e o f G e n o t y p e s M a n d a t o r y fields m a r k e d * Figure B.4: Demographic information and configuration options submitted to D-GRIP are shown here. Genotype Data Figure B.5 shows how the genotype data can be loaded into D-GRIP. There are two ways to load the genotype data. The copy/paste option f \" M a l e * ~ F e m a l e | Y Y Y Y | E u r o p e T ] \u00C2\u00AE I - c l i c k t o t u r n O n \u00C2\u00AE \u00E2\u0080\u00A2 N e x t | 71 Appendix B. D-GRIP User Manual allows the user to copy the genotype data and paste into the text area provided. The mandatory fields for copy/paste form are file format, file name and genotype data. After filling in the form, click on 'Calculate Risk' button to generate the risk profile output. For the uploading of genotype file, the mandatory fields are file format and address where the file is stored. The user may use the 'Browse' button to find the genotype file on the hard drive. Note, the maximum allowed size for the genotype file to be uploaded is 10Mb. This size limit can contain genotypes for more than 1 million SNPs in the file. After filling in the form, click on 'Upload File and Calculate Risk' button to generate the risk profile output. Currently, D-GRIP accepts two file formats: Illumina Final format and Affymetrix Text Output. An example of the respective genotype file formats are shown in Figure B.6. The Illumina Final format can be obtained by generating a tab delimited 'Final Report' when using the Illumina platform's BeadStudio Genotyping Module software. The only fields necessary are: SNP Name, Allele 1 and Allele 2. The sample Id and GC score are not necessary for D-GRIP. The Affymetrix text output can be obtained by using the SNP Export feature in the Affymetrix GeneChip Genotyping Analysis Software and gen-erating a tab delimited output file. Again, the only fields necessary are SNP identifier and SNP genotype (two alleles). In Figure B.5, next to the copy/paste form is a box with 'Pre-loaded' data. To illustrate D-GRIP, sample genotype files have been created and can be loaded using this 'Pre-loaded' data box. Simply select the particular 72 Appendix B. D-GRIP User Manual Copy/Paste or Upload genotype information Copy/Paste data Mandatory fields marked ' File format* File name* Input genotype data* | Illumina Final Format J -Pre-loaded datri-Scleci test gcnotyps dnin to load; | Sample 1 _J ig> Get Sample | CalculateRisk OR Upload data - -Please complete the form below. Mandator)' fields marked * F i le format* | Illumina Final Format j j Type (or select) Filename* | Browse... | # Upload and Calculate Risk | Figure B.5: Form for submitting the genotype data is shown here. The user can either copy/paste the genotype data or upload a genotype file. A set of sample genotypes are provided and can be loaded into the copy/paste form by clicking on 'Get Sample'. 73 Appendix B. D-GRIP User Manual |[Header] BSGT Version. 2.1.10 30089 Processing Date. 5/2/2006 12:54 PM Content. . CS0006968-0PA NUB SNPS. 26 Total SNPs. 26 Num Samples. 1 Total Samples. 1 [Data] SNP Name. Sample ID. AHelel - Top. Allele2 - Top. GC Score rs2018621 Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001. A. G. 0.63 rs4845378. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001 G. G. 0 54 rsll31706. Europe - HD01-01 - Northern European HD01 - GH17001 -NA17001. T. T. 0 6 rs2847173. Europe - KD01-01 - Northern European HD01 - GM17001 -NA17001 G. G. 0 54 rsl2448760. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001. A.. G. 0 65 rsl0915884. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001. G. G. 0 89 rsl676885. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001 A.. A., 0 59 (a) I l lumina final format sample file \u00C2\u00A3NP. SAMPLE. GENOTYPE.' SCORE rs2018621. Europe - HD01 -01 - Northern European HD01 - GH17001- NA17001. AG. 0 6345 rs4845378. Europe - HD01 -01 - Northern European HD01 - CM17001- NA17001. CG. 0 5403 rsll31706. Europe - HD01 -01 - Northern European HD01 - CM17001- NA17001. TT. 0 6032 rs2847173. Europe - HD01 -01 - Northern European HD01 - CM17001- NA17001. GG. 0 5403 rsl.2448760. Europe - HD01 -01 - Northern European HD01 - GM17001- NA17001. AG. 0 6478 rsl0915884. Europe - HD01 -01 - Northern European HTJ01 - GM17001- NA17001. GG. 0 8906 rsl676885. Europe - HD01 -01 - Northern European HD01 - GM17001- NA17001 AA. 0 5901 (b) Affymetrix text output sample file Figure B.6: The Illumina and Affymetrix tab-delimited file formats for D-GRIP. The respective column names are shown at the top. 7 4 Appendix B. D-GRIP User Manual sample and click on 'Get Sample'. A 'Comments' box appears describing the sample file and the sample file appears in the copy/paste text area. Genotype Sample 1 is shown in Figure B.7. Copy/Paste data Mandatory fields marked ' File format* File name* Input genotype data* I lllufnina Final Format tes (Genotype Datal rs790314.6 Northern C ' 'if rial 1.1.1875 Northern A G rs79'23837 Northern |A G rs37-10878 Nor thern Europe European HDOl 0.99 Eu rope European HDOl 0.97 Eu rope European HD01. 0.96 Eu rope European HDOl HDOl-O-'F- \u00E2\u0080\u00A2\u00E2\u0080\u00A2 GMi70b.t'4tAl7001 HDOl-a.l- \u00E2\u0080\u00A2 GM17001-NA1700 HDO'l-01 \u00E2\u0080\u00A2 CM! 700.1.-HA 17001 HDO 1.-0.1 \u00E2\u0080\u00A2 CH17001-NA17001 \u00E2\u0080\u00A2Pro-loaded data-Seiea lest genotype data 10 loud: | Sample 1 ;rj @ .Get Sam pie' | -.Corainenls Sample.-'I: Caucasian population willi seiecLod''SNf'sfronrall' , .(liseases-iii-database. All genotypes are heterozygous for each disease except Parkinson disease which tire hornozygotts..First five SNPs me for Diabetes type 1, next Three tire I'or AbJte'irner nrid last, two tire Fur Parkinson's disease. The last throe SNPs are for diatjetesi2 SN'PsI and Parkinson) I SNP) bul used for : \u00C2\u00BB f n m , \u00E2\u0084\u00A2 n\u00C2\u00AB..u,^, TI\u00C2\u00BB..\u00E2\u0084\u00A2 Calculate Risk Figure B.7: Genotype sample 1 is loaded into the copy/paste form by clicking on 'Get Sample'. A description of the sample genotype file are illustrated in the 'Comments' box. 75 Appendix B. D-GRIP User Manual B.2.3 Output An example output of D-GRIP is shown in Figure B.8. The output of D-GRIP is table that shows user's SNPs that matched disease-associated SNPs. The table illustrates the disorder, gene, SNP and genotype associated with the disorder, population in which the SNP occurs, calculated odds ratio and link to Pubmed for literature articles supporting the association. ^l/lfSmer^drsuise ,\u00C2\u00A3IJM . B 1T [ \u00C2\u00BB \u00E2\u0080\u009E i - - r s 4 S 4 5 \ 7 8 ; f T / O \"\"~ \" ' C a u c a s i a n ; t 2 5 2 jJIfccfl&T \"Ti A l z h e i m e r d i s e a s e P . O M T ' I . r s 2 0 1 : 8 6 2 1 \u00E2\u0080\u00A2 . _ . AJG ' C a u c a s i a n K 6 8 v 1 6 8 4 7 0 1 2 , t\l/liTiIiTerdTseIsI ^ ^ ' M 4 0 f i r1' i s l ^ i T j ^ l ^ c V r J C a u c i s i V p ^ l \u00E2\u0080\u009E I* \"> gfo H^ip 1 7 - ' 8 4 t \" \" J A l z h e i m e r d i s e a s e background population probability I f i v o overall c a l c u l a t e d probability 20.P.\") / o D i a b e t e s M J I i t u s t \ p \u00C2\u00AB . 2 T \ T 2 _ i_ TOO' 7 S D n K t c \"MelTitu t y p e 2 ^ H1JJTX_ J f . . 1 . 7 9 2 3 8 ) 3 7 D i a b e t e s M e l i i m s t y - p e 2 - M H E X \u00E2\u0080\u00A2\u00E2\u0080\u00A2 ' r s M ' M S 7 5 ' DnbLtc^irirrSs itY|~ ,*2 ~fcT7L2~ ~T~ Tj9QMA6' Diabctes;Mcllitus type2: EXT*. . \u00E2\u0080\u00A2 . \u00E2\u0080\u00A2 -re-l'l 0 3 7 9 0 9 D i i b c k s M e l l i t u s t y p 0 2 r \ T 2 i s 1 1 1 ^ 1 1 2 D i a b e t e s M e l l i t u s t v p e 2 b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y o v e r a l l c a l c u l a t e d p r o b a b i l i t y O A Cu iL isnn 1 2 6 ]T>rW7L A / G : . C a u c a s i a n . 1 7 1 9 , ' 1 7 . 2 9 3 8 7 6 ' -C / T . C a u c a s i a n 1 : 2 7 . 1 7 2 9 3 8 7 6 l ^ ' G / C ^ ' ^ c T u ^ a n T 7 1 J I ' m H T O - 1^ s 7 % P a r k i n s o n d i s e a s e P 1 N K 1 rsl 0 4 3 4 2 4 A / A C a u c a s i a n . I v 5 _ 1 6 0 0 9 8 9 1 . \u00C2\u00AB \u00E2\u0080\u00A2 P a , k i n ? o i f | i i . J j s e ^ \"7 * i $ k k K ^ * ' I 1 , ^ 1 X 0 1 5 8 2 , ; ' } V S ^ / V C j u t a s i . m i ' ^ , ' \" l ' \ 7 5 J r O 0 6 7 6 7 _ . ; j P a r k i n s o n d i s e a s e b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y 2 V\u00C2\u00BB o v e r a l l c a l c u l a t e d p r o b a b i l i t y 2.14% Figure B.8: D-GRIP risk profile sample output. The output illustrates 3 diseases, Alzheimer's, Diabetes type 2 and Parkinson's disease. The re-spective associated SNPs with each disease is shown. The background and overall calculated probability of developing the disease is also shown. 76 Appendix B. D-GRIP User Manual The user can click on the gene name, and disorder name for external links to genbank and OMIM respectively. In addition, by clicking on each SNP row, more details about the SNP can be seen (Figure B.9). Diabetes Mellitus.typo 2 TCF/Ue2 - \". rs7?03;,14fc. '. ' ' . -.G/T'- Caucasian 1.05 \u00E2\u0080\u00A2 J729JiS^6'f' Genotypes. . ' .Statistics \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 . '\u00E2\u0080\u00A2 Risk genotype: G/T , ' O'dds;Mi6.(95%;Ci):\" 1:65 (1.47-. 1>'5) Majot genotype C7C ' .log OddssRatio: . 0.5 :i 0.06' . log Odds Ratio -)5 !-;,..\u00E2\u0082\u00AC]: (0.38.'. 0.6':!') Oeiiotvpe. Frequencies . , \u00E2\u0080\u009E , , , Likelihood, Ratio: L27 ; a 0.0017. C/T ' C7C Likelihood ratio 95%'CI: ( I . ! / . I.3S') Case 0:486 0:351 \u00E2\u0080\u00A2 Probability ofidiscasebased: 6;-27,.n/o . . . A .\u00E2\u0080\u009E \u00C2\u00AB\u00E2\u0080\u009Ev ' lon.tliis'iSNP: Control. 0:419 0.497 , . . \u00E2\u0080\u00A2 . Figure B.9: Details about one SNP from Diabetes type II disease. More details about the probability calculation for each disease can be seen by clicking on the probability row (Figure B.10). If there are SNPs found that are in high linkage disequilibrium (r2 > 0.8) then integrated analysis is performed where only one SNP from the set of high LD SNPs is chosen to be in the overall calculated probability. This is illustrated on the right side of Figure B.10. 77 Appendix B. D-GRIP User Manual Diabetes Mel 1 iius type 2 backgroun d. popul at ion probab i I ity overall calculated probability 7 % Age G;ender Ethnicity User details 47 Male Europe Background probabil'itv details Age of Onset Background (yrs) probability 45 5% Integrated Analysis SNP used in probability calculation rsl 1037909 SNPs in high linkag disequilibrium rsl 1037909 rsl 113132 rs3740S7S 60 \u00E2\u0080\u00A215% Figure B.10: Probability details for diabetes type 2 is shown here. . If the inference of genotypes configuration option was selected, the out-put will display SNPs from inference analysis. > An example of inferred SNPs and their corresponding details is illustrated in Figures B . l l and B.12. 78 Appendix B. D-GRIP User Manual Diibetc Mdlitu t\pe 2 TCrT 2 Dijbdx\" M jFiHJ^>pL. 2 ?_ 'f\T^7\" Diabetes Mel I itu's typc.2' HHBX Dnbcte Mellitu t>| ejfi HHJ-X Diabetes Mellitus type 2 .EXT2 Dnbete Mdfitus ^fi , 2 I f Y H is790^14o i ^i!uLi i !?2 r rs I 111875 -rs3740878 7s I To W O K lZ?Jsi2l_9 S^6 G/C__ j | t tufc3Ji \u00E2\u0080\u00A2A/Gf ; .'Caucasian 1.19' A/3T\" '(^CTHTTTITTJ G/A- ' Caucasian k-26 C T V Cauca nn j , 4 f l 27 l7~J.jr.7e 17-293876. 17293876 In ference Analysis !\u00C2\u00AE Diabetes Mellitus type 2 LOC387761 rs74800IO A/G Caucasian 1.14 17293876. Diabetes Mellitus type 2 SLC30A8 rsl3266634 T/C Caucasian 1.18 172938.76 Diabetes Mellilus type. 2 background population probability overall calculated probability 7 % Figure B . l l : SNPs from Inference analysis for Diabetes type 2 are shown. 79 Appendix B. D-GRIP User Manual I n f e r e n c e A n a l y s i s @ D i a b e t e s M e l l i t u s t y p e 2 L O C 3 K 7 7 6 1 I n f e r r e d S N P d e t a i l s U s e r ' s G e n o t y p e S N P i d : r s 4 4 4 5 6 l 9 g e n o t y p e : T / C S N P I d r s 7 4 S O 0 ' l O r s 4 4 4 5 6 l . 9 H a p m a p P h a s e d a t a A l l e l e I A T r s 7 4 K 0 0 l ' 0 A l l e l e : G C A / G C a u c a s i a n . 1 7 2 9 3 8 7 6 H a p m a p S N P - I n f o r m a t i o n S N P A l l e l e s . G e n o t y p e G e n o t y p e f r e q u e n c y G e n e C h r o m o s o m e P o s i t i o n r s 4 4 4 5 6 l 9 T , C T / C 0 . 3 0 9 1 1 4 2 2 0 2 I 7 K H a p m a p p o p u l a t i o n . C S H L - H A P M A P : H a p M a p - C E U \u00E2\u0080\u00A2 D i s e a s e a s s o c i a t e d S N P D e t a i l s G e n o t y p e s R i s k g e n o t y p e : A / G M a j o r g e n o t y p e : , A / A G e n o t y p e F r e q u e n c i e s A / G C a s e 0 . 4 3 0 C o n t r o l 0 . 4 1 3 A / A 0 , 4 4 9 0 . 4 9 2 S t a t i s t i c s O d d s R a t i o ( 9 5 % C I ) : l o g O d d s R a t i o : l o g O d d s R a t i o 9 5 % C I : 1 . 1 4 ( 1 . 0 2 . 1 . 2 8 ) 0 . 1 3 i 0 . 0 6 ( 0 . 0 2 . 0 . 2 5 ) . L i k e l i h o o d R a t i o : 1 . 0 7 \u00C2\u00B1 0 . 0 0 1 7 L i k e l i h o o d r a t i o 9 5 % C I : ( 0 . 9 9 . 1 . 1 6 ) Figure B.12: Details about the inferred SNPs is shown. The details include the user's genotype, Hapmap data from which inference was performed and the relevant statistics for the disease-associated SNP. 80 Appendix B. D-GRIP User Manual B.2.4 Help Tips Help tips appear as pop-up on the top right of the page. Whenever a blue question mark icon is displayed, the user can bring the mouse over to the question mark to see the relevant tip. This is done to help guide the user when using D-GRIP. Examples are shown below. Ethnic Background* Europe M a j o r i t y o f d a t a i n d a t a b a s e i s b a s e d o n C a u c a s i a n p o p u l a t i o n . T h u s , d e f a u l t i s E u r o p e a n a n c e s t r y . Figure B.13: An example of a ethnic background help tip is shown. I n f e r e n c e o f G e n o t y p e s r c l i c k t o t u r n O n # W h e n ' I n f e r e n c e o f g e n o t y p e s ' o p t i o n i s a i m e d o n . a n y u s e r g e n o t y p e s t h a t a r c i n h i g h l i n k a g e d i s e q u l i b r i u m ( r2> O . S ) w i t h d i s e a s e a s s o c i a t e d S N P s a r e a l s o r e p o r t e d i n t h e g e n e r a t e d r i s k p r o f i l e . T h e r e p o r t e d i n f e r r e d S N P s a r e n o t u s e d i n t h e o v e r a l l p r o b a b i l i t y c a l c u l a t i o n . C l i c k c h e c k b o x l o . t i i m o n I n f e r e n c e A n a l y s i s o p t i o n . Figure B.14: An example of inference of genotypes help tip is shown. 81 "@en . "Thesis/Dissertation"@en . "10.14288/1.0101065"@en . "eng"@en . "Bioinformatics"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "D-GRIP : DNA genetic risk information profile : A genotype analysis system to predict a genetic risk profile for an individual"@en . "Text"@en . "http://hdl.handle.net/2429/32186"@en .