Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

D-GRIP : DNA genetic risk information profile : A genotype analysis system to predict a genetic risk.. 2007

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_2007-0595.pdf [ 3.79MB ]
ubc_2007-0595.pdf
Metadata
JSON: 1.0101065.json
JSON-LD: 1.0101065+ld.json
RDF/XML (Pretty): 1.0101065.xml
RDF/JSON: 1.0101065+rdf.json
Turtle: 1.0101065+rdf-turtle.txt
N-Triples: 1.0101065+rdf-ntriples.txt
Citation
1.0101065.ris

Full Text

D-GRIP: DNA Genetic Risk Information Profile A genotype analysis system to predict a genetic risk profile for an individual by Siddhartha Srivastava B.Sc, Biological Science, The University of Calgary, 2005 B.Sc, Computer Science, The University of Calgary, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Bioinformatics) The University Of British Columbia October, 2007 © Siddhartha Srivastava 2007 Abstract New genotyping technologies are producing reliable results with far greater coverage and at dramatically lower cost than previously possible. Given the rapid new discovery of disease associated markers and the new technology for determining the nucleotide sequences of key positions in the DNA of an individual, it is now feasible to apply existing knowledge to generate per- sonalized analyses of genetic risk for diverse diseases. DNA Genetic Risk Information Profile (D-GRIP) is a genotype analysis software system that determines an individual's genetic risk profile given a genotype. The proto- type web tool can take, as input, up to a million observed genotypes from single nucleotide positions known to be polymorphic in a human popula- tion. The submitted genotype data are compared to a database of disease associated single nucleotide polymorphisms (SNPs) and an output is gen- erated, reporting disease-associated variants for which the individual has a predicted modified risk. An evaluation of D-GRIP was performed through the direct surveying of potential users of such a system - users such as clinicians, genetic coun- selors and genetics researchers. Due to ethical issues related to providing a genetic risk profile, the prototype system is kept closed to the general public and reserved for research into the utility and requirements of such software. ii Abstract The major conclusions drawn direct attention towards the key limitations presently precluding the creation of personalized genetic risk assessment. The lack of computationally exploitable resource for disease associated ge- netic variants, the inherent statistical complexities involved with risk cal- culation for large-scale genotyping data and the limited understanding of interactions between genes, environment and complex diseases, are all key factors that need to be overcome in order to create a practical genetic risk assessment tool. iii Table of Contents Abstract i i Table of Contents iv List of Tables v i List of Figures vii Acknowledgments ix 1 Introduction 1 1.1 Variat ions and Diseases 1 1.2 Discovery of new markers 3 1.3 Genotyping technologies 4 1.4 Bioinformatic Tools 6 1.4.1 Commerc ia l Systems 7 1.4.2 Open Source Systems 9 1.5 Overview of project 12 Bibliography 13 2 D-GRIP: DNA Genetic Risk Information Profile 19 2.1 Introduction 19 2.2 Methods 23 2.2.1 D - G R I P Overview 24 2.2.2 Genotype-Phenotype Database 26 2.2.3 Disease Risk M o d e l 27 2.2.4 Haplotype D a t a 29 2.2.5 Software Evalua t ion 31 2.3 Results 32 2.4 Discussion 39 2.4.1 Limi ta t ions 39 iv Table of Contents 2.4.2 Ideal Software 41 2.4.3 Implications 43 2.4.4 Conclusions 44 Bibliography 46 3 Conclusions and Future Directions 52 3.1 Further Observations • 52 3.2 Future Considerations 54 3.3 Conclusion 56 Bibliography 57 Appendices A Feedback from Experts 60 A . l Questions 60 A . 2 Feedback 62 A.2.1 User Interface 62 A.2 .2 D - G R I P Core 63 A.2 .3 Potent ia l Users 64 A . 2.4 Implications 66 B D-GRIP User Manual 67 B . l Introduction to D - G R I P 67 B . l . l D - G R I P System 67 B .2 D - G R I P Features 69 B.2.1 Disclaimer 70 B.2.2 Input 70 B.2.3 Output 76 B.2.4 Help T ips 81 v List of Tables 1.1 A summary of genotyping technologies currently available. The cost per genotype is an estimate of maximal multiplexing capability. A note, Illumina's Sentrix® numbers in the table are based on the HumanlM BeadChip which will be released in the second quarter of 2007 7 2.1 A summary of the number of genes and number of polymor- phisms for each of the diseases in the DNA-Disease database. vi List of Figures 2.1 A schematic overview detailing the flow of information across the various components of D-GRIP is illustrated 23 2.2 The opening page of D-GRIP is shown. The instructions on how to use D-GRIP and a disclaimer explicitly stating the assumptions and limitations inherent in D-GRIP are shown. . 32 2.3 The first step in using D-GRIP is illustrated, where the user's demographic information such as gender, age and ethnicity is collected. The hypothetical example above shows a male, 47 years old from European ancestry. The inference option is turned on (checked) 33 2.4 The second step in using D-GRIP is illustrated here. The user has a choice of copying and pasting the genotype data or uploading it. For ease of use, various hypothetical sam- ple genotype files were created to illustrate D-GRIP. The above example contains the 13.highly significant genotypes which are heterozygous for each disease in the DNA-Disease database. A description of the pre-loaded data is shown in the 'Comments' box 34 2.5 The last step shows a tabular result for any single nucleotide polymorphisms (SNPs) found to be associated with a disease in the user's genotype data 36 2.6 More details are shown for each SNP. As an example, details for SNP rs7903146 is shown from gene TCF7L2 from Diabetes Mellitus type II 37 2.7 Details of overall probability calculation, integrated analy- sis and inferred SNPs are shown for Diabetes Mellitus type 2 disease. The integrated analysis indicates which disease- associated SNPs are in high linkage disequilibrium (r2 > 0.8). For SNPs in high LD, only the SNP with strongest effect (highest odds ratio) is used in the overall calculated proba- bility ' . 38 vii List of Figures B . l The entry into D - G R I P occurs w i th user authentication. A val id username and password is required to access D - G R I P . . 68 B.2 A snapshot of D - G R I P ' s main page. The page describes in - structions on how to use D - G R I P and outlines a disclaimer for the user to read 69 B .3 T h e assumptions made by D - G R I P are listed as a disclaimer and shown here 70 B .4 Demographic information and configuration options submit- ted to D - G R I P are shown here 71 B .5 F o r m for submit t ing the genotype data is shown here. The user can either copy/paste the genotype-data or upload a genotype file. A set of sample genotypes are provided and can be loaded into the copy/paste form by cl icking on 'Get Sample ' 73 B.6 The I l lumina and Affymetr ix tab-delimited file formats for D - G R I P . The respective column names are shown at the top. 74 B.7 Genotype sample 1 is loaded into the copy/paste form by clicking on 'Get Sample ' . A description of the sample geno- type file are i l lustrated in the 'Comments ' box 75 B.8 D - G R I P risk profile sample output. T h e output illustrates 3 diseases, Alzheimer 's , Diabetes type 2 and Parkinson's dis- ease. The respective associated S N P s wi th each disease is shown. The background and overall calculated probabi l i ty of developing the disease is also shown 76 B .9 Details about one S N P from Diabetes type II disease 77 B.10 Probab i l i ty details for diabetes type 2 is shown here 78 B . l l S N P s from Inference analysis for Diabetes type 2 are shown. . 79 B.12 Detai ls about the inferred S N P s is shown. The details in - clude the user's genotype, Hapmap data from which infer- ence was performed and the relevant statistics for the disease- associated S N P 80 B.13 A n example of a ethnic background help t ip is shown 81 B.14 A n example of inference of genotypes help t ip is shown. . . . 81 Acknowledgments I would like to acknowledge my academic supervisor, Dr. Wyeth W. Wasser- man. Through his continuous support and guidance I have gained valuable insights in how to conduct and present research. I would also like to acknowl- edge my thesis committee: Francis Ouellette and Dr. Jan M . Friedman for their advice and support. In addition, I thank Drs. Cornelius Boerkoel, Angela Brooks-Wilson, Lome Clarke, Denise Daley, Anita Dircks, Bill Gibson, Jinko Graham, Mil- Ian Patel, Colin Ross for providing valuable feedback regarding utility and shortcomings of D-GRIP. I would like to extend further acknowledgments to Drs. William Dana Flanders, Muin Khoury and Quanhe Yang for their feedback on the statistical model. I want to also thank Francis Ouellette and Dr. Artem Cherkasov for giving their advice and encouragement during my training in the CIHR/MSFHR Strategic Training Program in Bioinfor- matics. I would like to thank the members of the Wasserman Laboratory: David Arenillas, Jochen Brumm, Warren Cheung, Alice Chou, Debra Fulton, Shan- nan J. Ho Sui, Andrew Kwon, Jonathan Lim, Stuart Lithwick, Dora Pak, Elodie Portales-Casamar, Magdalena Swanson, Amy Ticoll, Tony Wong and Dimas Yusuf, for creating a friendly and enjoyable research environment. I greatly appreciate the financial support that was provided by the Cana- ix Acknowledgments dian Institutes of Health Research (CIHR) and the Michael Smith Founda- tion for Health Research (MSFHR) Strategic Training Program in Bioinfor- matics. Chapter 1 Introduction This thesis describes the exploration of how bioinformatics can be applied in the field of genetics, specifically to the prediction of disease risk. The causes of human diseases range from simple Mendelian inheritance patterns to complex combination of genetic and non-genetic (environmental) factors. With the availability of the entire human genome sequence and the common variation map (HapMap project), the understanding of genetic contributions to diseases is increasing rapidly. We are approaching a time where prediction of disease risk on a personalized level will become a reality. 1.1 Variations and Diseases Variations in DNA sequences occur throughout the genome at a frequency of approximately 4-5 in 1000 bases (0.4 — 0.5%) on average between two unrelated individuals [3]. These differences or variations in sequences in- clude both mutations and polymorphisms, which are distinguished by their frequency within a population. Mutations are by definition rarely observed in a population and while they can cause disease, are not generally relevant to the prediction of disease risk in the general population. The simplest and most common form of polymorphism is called a Single Nucleotide Poly- morphism (SNP). At a particular site on the human genomic sequence, a 1 Chapter 1. Introduction SNP is denned by the existence of a certain percentage of individuals with a nucleotide differing from the norm. For instance, in two copies of a chromo- some at one site, one chromosome might have an A at that position (the 'A ' allele) and the other might have a C (a ' C allele). The minimum threshold percentage for classifying a position as being a SNP rather than a mutation is generally defined as l%of tested choromosomes, although some reports use other values. In the human populations, there are approximately 10 million SNPs that occur with greater than 1% frequency and these 10 million sites constitute 90% of the variation in the population [3, 21]. In short, SNPs constitute a dramatic portion of the genetic variation between two individ- uals. A genotype is then defined as the combination of the two alleles at a particular locus for a given SNP. For instance, at a known polymorphic po- sition with A and C forms, genotypes would be AA, AC or CC. SNPs occur throughout the genome (promoter region, coding and intronic regions) where those variations situated in proten coding regions are of two types, synony- mous (not altering the encoded amino acid sequence) and non-synonymous (causing a change to the encoded amino acid sequence). In the study of human genetics there have been a litany of examples of links between sequence variations (also referred to as markers) and specific traits or diseases [27]. Disorders where genetics plays an'important role, the so called genetic diseases, can be classified into single gene defects, chromo- somal disorders or multifactorial. Single gene disorders (or Mendelian dis- orders) such as Cystic Fibrosis, are usually rare and identifying the causal genetic variant has helped understand the disease. Chromosomal disorders are caused by excess or deficiency of genes [8]. Most common diseases are 2 Chapter 1. Introduction multifactorial such as diabetes or heart disease and it is generally accepted that these phehotypic effects are based on direct genetic effects, mult iple gene-gene interactions and gene-environment interactions [27, 30]. Recently, through new technologies and genome-wide association surveys, there has been a strong effort towards finding disease susceptibil i ty variations (espe- cial ly S N P s ) for complex disorders [13]. 1.2 Discovery of new markers Recently, there has been a surge in new discovery of disease susceptibil i ty genes and variations. Tradi t ional ly, in human genetics, a discovery involved identifying a gene for susceptibil i ty of disease. Tha t notion, however, comes from working on rare diseases in which single studies have reported strong statist ical associations between a mutat ion in a gene and a disease [13]. In contrast, for common diseases, the oligogenic model is usually accepted. T h e model states that the genetic component of complex diseases are more likely to be a result of a few genes wi th moderate effect or a large number of genes wi th smaller effect [11]. W i t h the development of large-scale genotyping technologies, it has now become feasible to perform genome-wide association studies [11, 13] to identify contr ibut ing loci by surveying a large set of known variable sites. Several large-scale genome-wide association studies have been recently published, including studies of diabetes Mel l i tus type II [26, 28, 31, 33], bipolar disorder [1], Alzheimer ' s disease [4], Crohn 's (inflammatory bowel) disease [6, 22] and coronary artery disease [24]. Given the smal l sample of 3 Chapter 1. Introduction diseases listed here and the short timeframe in which they were published, a large number of markers are being discovered at a very rapid rate. A more detailed analysis on the recent advances of genome wide association studies and a count of newly discovered markers for several common diseases can be found in [5]. 1.3 Genotyping technologies New genotyping technologies are driving the burst of genetic studies. For studies where a small number of SNPs are analyzed, Sequenom 's MassARRAY® system, TaqMan® and Pyrosequencing™ have been widely used. These methods provide flexibility in study design for investigators prepared to work on a small set of candidate genes. For studies where thousands of SNPs need to be analyzed simultaneously (i.e., multiplexed) for each sample, platforms such as the Illumina BeadArray and the Affymetrix GeneChip® can be used. These systems have dramatically increased the throughput of geno- typing and substantially reduced genotyping costs [23]. To illustrate the underlying technology, a brief description of the original T M T M Illumina BeadArray platform and the GoldenGate assay follows. The array-based technology comes in a 96 well plate format. Each well contains an optical fiber bundle where an array of 50,000 randomly placed beads, each ~3 microns in diameter, exist. There are 1520 bead types, each representing a different oligonucleotide sequence. This gives ~30 copies of each bead type providing (on average) 30 replicate genotyping experiments for each SNP and can screen up to 100,000 genotypes in one sample [10]. 4 Chapter 1. Introduction The GoldenGate® Assay is used with the BeadArray platform and has the advantage of allowing high multiplexing during amplification steps while minimizing reagent volumes and time. Genomic DNA is normalized and then chemically reacted to incorporate biotin to make activated DNA. Three oligonucleotides are designed for each SNP. Two are allele-specific oligonu- cleotides (ASO) and.one is locus-specific oligonucleotide (LSO). Each ASO has a 3' base complementary to one of the two SNP alleles. The LSO hybridizes downstream of the ASOs. Each of the three oligonucleotide se- quences contain regions of genomic complementary for polymerase chain reaction (PCR): PI and P2 on the ASOs and P3 on the LSO. The LSO also contains a unique address sequence that targets a particular bead type on the well plate. After extension and ligation, activated genomic DNA is amplified using PCR and labeled PI and P2. The primers PI and P2 are labeled with Cy3 and Cy5 respectively. The PCR products are then hybridized to array matrix plate where the Cy5 and Cy3 labeled materials bind in proportion to the relative abundance of the two alleles in the sample such that a homozygote for the allele has only one color and a heterozygote has two. The labels are detected and analyzed using the fluorescence signal and using software for genotype clustering and calling. Based on the color distribution of each allele, the genotype of the samples for the designated SNPs can be determined. For a more thorough and detailed description of the assay, refer to [19] and [32]. Both Illumina and Affymetrix systems have challenged the technological limit of genotyping analysis. For instance, Illumina's Sentrix® Human- Hap650Y BeadChip and whole-genome HumanlM BeadChip can respec- 5 Chapter 1. Introduction tively genotype over 650,000 tag SNPs and over one million genetic varia- tions on a single array, whereas the Affymetrix's GeneChip® Genome-wide human SNP array 5.0 can genotype approximately 500,000 SNPs in one sam- ple. Both platforms can genotype fixed set of SNPs as well as customized panels of SNPs. Illumina's SNP selection is based on the HapMap project while Affymetrix's SNPs selection is based on feasibility of SNPs to be geno- typed. For both systems, the cost of genotyping is less than $0.01 per SNP. A general recent summary of the various methods is shown in Table 1.1. A more detailed review of various genotyping technologies is available in [32] and [23]. Given the new technologies and the high throughput of genotypes at substantially low costs, genotyping an individual has become increasingly feasible and led to a shift from investigation of a few candidate polymor- phisms at a time to comprehensive whole-genome studies [23]. 1.4 Bioinformatic Tools There are many different open source and commercial systems available that manage, organize and analyze large-scale genotype data and/or provide risk assessments for disease. In order to determine whether any currently avail- able systems integrate the analysis of many genotypes to provide person- alized risk assessments for diseases, a survey of the risk prediction systems follows. 6 Chapter 1. Introduction Assay design Mul t i p l ex ing Throughput Cost per capabil i ty (no. of sam- genotype pies per assay) T a q M a n ® B y manufacturer N o U p to 10,000+ >US$0.30 . T M ryrosequencmg or investigator B y investigator 1 to 3 U p to 4,000+ >US$0.30 Sequenom's B y investigator 1 to 29 U p to 3,000+ US$0.05- M a s s A R R A Y ® 0.10 I l lumina 's B y manufacturer 1,536 to U p to 96 <US$0.01 S e n t r i x ® 1,000,000 Affymetr ix 's G e n e C h i p ® B y manufacturer 10,000 to U p to 96 <US$0.01 500,000 Table 1.1: A summary of genotyping technologies currently available.- The cost per genotype is an estimate of max ima l mul t ip lexing capability. A note, I l lumina 's S e n t r i x ® numbers in the table are based on the H u m a n l M B e a d C h i p which w i l l be released in the second quarter of 2007. 1.4.1 C o m m e r c i a l S y s t e m s Genetics and genetic testing companies such as GeneSage [16], GeneTracks [17] and Genelex [14], provide or attempted to provide a variety of products and services. For instance, GeneSage, which now appears defunct, offered secure storage of genetic information for its users as well as access to genetic infor- mat ion and cl inical information on genetic medicine for health professionals such as physicians and nurses. Also , risk assessments for specific diseases were provided through a team of in-house genetic counselors. A n advantage of GeneSage was that risk assessments were provided by qualified genetic counselors, but the assessments were not based on genotype information. GeneTracks, on the other hand, provides various forms of D N A testing such as Paternity, T w i n or Sibship and Materni ty . The strength of Gene- 7 Chapter 1. Introduction Tracks lies in its DNA testing capability while the disadvantage is the lack of genetic assessment. In addition, two facets of GeneTracks are the DNA Bank and DNA Ancestry project. The DNA Bank acts as a storage facility for the customer's genetic data while the DNA Ancestry project provides a way to trace an individual's ancestry based on 20-40 Y-chromosome DNA markers. One advantage of such a service is the incorporation of genotype data in tracing ancestry but the disadvantages are the lack of genetic risk assessment and the lack of flexibility because only males can be tested since the test uses markers from Y-chromosome. Lastly, Genelex provides a diverse range of services. For health pro- fessionals, genetic information, drug information, pharmacogenetic testing for specific drugs and nutrigenetic tests (dietary consultation) are provided. Also for clinicians, a software called GeneMedRx, which provides drug-drug and drug-gene interaction risk prediction for cytochrome P450 metabolism and genetic testing [15]. For the general public, adverse drug reaction test- ing, nutritional testing (dietary consultation), ancestry DNA testing and predictive testing for four diseases are provided. Al l the testing services utilize genetic information from the customer and test a set of known geno- types, genes or set of phenotypes. One advantage of Genelex is the Gen- eMedRx software. It incorporates genetic testing with risk prediction to ensure drug efficacy and prevent adverse drug reaction. One disadvantage is that GeneMedRx only incorporates one genetic test with risk prediction (the cytochrome 450P metabolism). 8 Chapter 1. Introduction 1.4.2 O p e n S o u r c e S y s t e m s There are many open source systems that provide management and analysis of genotype data and a disease risk assessment. For brevity, only recently published tools will be discussed. The open source systems can be broken down into three categories. There are data management tools, visualization tools and risk assessment tools. In the realm of data management, IGS, Integrated Genotype Analy- sis [12], stores, edits and analyzes genotype and phenotype data. IGS can handle large-scale genotype data, stores the data and meta data in various formats and can be used for genetic analysis (e.g. pedigree checks, Hardy- Weinberg tests, allele frequency tests, etc). The system is freely available on-line and the underlying database structure can be easily re-created. IGS is useful for storing raw genotype as well as processed genotype data (sim- ply the genotype and the sample). Another tool is called SNPP, Single Nucleotide Polymorphism Processor [36]. SNPP's strength lies in handling massive amounts of raw SNP genotyping data, using a backend database framework for storage and it can also be used as a tool for data format con- version. The disadvantage lies in the minimal analysis of the genotype data since it only provides Mendelian inheritance checks for SNP data obtained from families. For visualization tools, there are several programs which provide an in- tegrated environment for visualization and analysis of genotype data. SNP- VISTA, an interactive SNP visualization tool [29] allows visualization of large-scale genotype data for disease related genes. The software maps SNPs 9 Chapter 1. Introduction to gene structure, classifies SNPs based on location, frequency and allele composition, clusters SNPs according to user criteria and includes protein evolutionary conservation visualization. The strength of SNP-VISTA is the graphical interface and visual representation of large scale data. SNPAna- lyzer, a workbench for SNP analysis [35], performs data manipulation, sta- tistical analysis on genotype data and visualization. Another recent tool is GEVALT, GEnotype Visualization and ALgorithmic Tool [7], which pro- vides phasing and tag SNP selection algorithms, along with visualization of LD plots and haplotype data. All of the functionality is available in one in- tegrated viewer. The advantage of GEVALT is in the integration of analysis tools and the visualization in one environment. There are other visualiza- tion tools that provide various features but are not mentioned here. Al l the visualization software provides analysis of genotype data but does not provide any disease risk assessments. Risk assessment tools can be broken.down into two categories, non- family-based and family-based. For non-family-based risk assessments, the tools are classified as expert systems or knowledge-based systems. An expert system is a computer system, based on artificial intelligence(AI) principles, which uses an organized body of knowledge, heuristics and inference to sug- gest solutions in a particular domain of expertise, for instance in medicine. A review of various expert systems and currently used systems is done in [25] and [18]. Therefore, for brevity, only one of the originally developed systems will be mentioned here. MYCIN [2] was developed to provide assistance to physicians in the diagnosis and treatment of meningitis and bacterial infec- tions. MYCIN conducts a question and answer dialog where it ask questions 10 Chapter 1. Introduction such as suspected sites of infection, symptoms and results of other laboratory tests. Then, MYCIN recommends a course of antibiotics and can also pro- vide its reasoning behind its answers. The advantage of an expert system is its diagnostic support capability to a physician. A potential disadvantage is the purely computational basis of prediction and no incorporation of genetic history in diagnosis of diseases. For family-based risk assessments, there are many tools available, of which the majority target cancers. A tool for prediction of breast cancer risk is BRCAPRO [9]. The BRCAPRO model incorporates information on all family members (affected and unaffected) for breast and ovarian cancer and then calculates the probability of carrying the BRCA gene mutation us- ing Bayes theorem. BRCAPRO's strength is its accuracy to predict BRCA gene mutation. BRCAPRO was validated by comparing to genetic coun- selors and it was found that BRCAPRO had similar sensitivity and higher specificity to experienced genetic counselors in identifying BRCA mutation carriers. A similar system has been created for identifying high risk indi- viduals of familial pancreatic cancer called PancPRO [34]. The underlying framework of PancPRO is similar to BRCAPRO. Again, a validation of PancPRO indicated its accuracy in risk assessment. In a recent review [20], a set of cancer risk assessment tools (CRATs) which were available on the Internet. The five tools discussed in the paper determined the risk of various types of cancers based on family history. One of the disadvantages of these tools is the focus on purely familial-based Mendelian model diseases and not on other more complex diseases such as Diabetes Mellitus or Alzheimer's. 11 Chapter 1. Introduction 1.5 Overview of project Given the rapid new discoveries of disease-associated markers and the ad- vent of new genotyping technology, a question arises: is it now possible to apply existing knowledge of genetic diseases to create disease risk profiles for individuals? This thesis project was motivated by such a question and was designed to ascertain the bioinformatic limitations that must be overcome to facilitate a genotyping-based analysis of disease risk. We created a web tool called D-GRIP, DNA Genetic Risk Information Profile, which is a genotype analysis system that determines an individual's genetic risk profile given a genotype as input. The on-line tool can take, as input, up to one million ob- served genotypes from known SNPs in human populations. The submitted genotype data are then compared to validated disease-associated SNPs (a DNA-Disease database) and then outputs a list of diseases for which the in- dividual has modified (up or down) risk. D-GRIP is intended to serve as an early prototype of a prognostic tool for use by genetic counselors. D-GRIP went through a testing phase where clinical geneticists, genetic counselors, genetic researchers and biostatisticians were consulted on the utility of D- GRIP and their feedback was recorded. One major conclusion drawn from the project is that the level of current knowledge for disease-causing SNPs is limited. There are only a few diseases that had strong supporting evidence causally linking SNPs to the disease. Given this scarcity of data, substan- tial studies on disease-causing variations are needed, especially for complex diseases. 12 Bibliography [1] A E Baum, N Akula, M Cabanero, I Cardona, W Corona, B Klemens, T G Schulze, S Cichon, M Rietschel, M M Nothen, A Georgi, J Schu- macher, M Schwarz, R Abou Jamra, S Hofels, P Propping, J Satagopan, S D Detera-Wadleigh, J Hardy, and F J McMahon. A genome-wide as- sociation study implicates diacylglycerol kinase eta (dgkh) and several other genes in the etiology of bipolar disorder. Mol Psychiatry, May 2007. [2] Bruce G. Buchanan and Edward H. Shortliffe. Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. AAAI Press, Available at: http://www. aaaipress.org/Classic/Buchanan/buchanan. html, ebook edition, 1984. [3] International Hapmap Consortium. The international hapmap project. Nature, 426(6968):789-96, December 2003. [4] Keith D Coon, Amanda J Myers, David W Craig, Jennifer A Web- ster, John V Pearson, Diane Hu Lince, Victoria L Zismann, Thomas G Beach, Doris Leung, Leslie Bryden, Rebecca F Halperin, Lauren Mar- lowe, Mona Kaleem, Douglas G Walker,. Rivka Ravid, Christopher B Heward, Joseph Rogers, Andreas Papassotiropoulos, Eric M Reiman, John Hardy, and Dietrich A Stephan. A high-density whole-genome association study reveals that apoe is the major susceptibility gene for sporadic late-onset alzheimer's disease. J Clin Psychiatry, 68(4):613-8, April 2007. [5] Jennifer Couzin and Jocelyn Kaiser. Genome-wide association, closing the net on common disease genes. Science, 316(5826):820-2, May 2007. [6] J R Fraser Cummings, Rachel Cooney, Saad Pathan, Carl A Anderson, Jeffrey C Barrett, John Beckly, Alessandra Geremia, Laura Hancock, Changcun Guo, Tariq Ahmad, Lon R Cardon, and Derek P Jewell. 13 Bibliography Confirmation of the role of atgl611 as a Crohn's disease susceptibility gene. Inflamm Bowel Dis, April 2007. [7] Ofir Davidovich, Gad Kimmel, and Ron Shamir. Gevalt: an integrated software tool for genotype analysis. BMC Bioinformatics, 8:36, 2007. [8] A.D.A.M. Medical Encyclopedia. Genetics. [updated 2005 apr 20; cited 2007 may 2007] available at: http://www.nlm.nih.gov/medlineplus/ency/article/002048.htm, May 2007. [9] David M Euhus, Kristin C Smith, Linda Robinson, Amy Stucky, Olu- funmilayo I Olopade, Shelly Cummings, Judy E Garber, Anu Chit- tenden, Gordon B Mills, Paula Rieger, Laura Esserman, Beth Craw- ford, Kevin S Hughes, Connie A Roche, Patricia A Ganz, Joyce Seldon, Carol J Fabian, Jennifer Klemp, and Gail Tomlinson. Pretest predic- tion of brcal or brca2 mutation by risk counselors and the computer model brcapro. J Natl Cancer Inst, 94(11):844-51, June 2002. [10] J.B. Fan, A. Qliphant, R. Shen, B.G. Kermani, F. Garcia, K.L . Gun- derson, M . Hansen, F. Steemers, S.L. Butler, P. Deloukas, L. Galver, S. Hunt, C. McBride, M . Bibikova, T. Rubano, J. Chen, E. Wickham, D. Doucet, W. Chang, D. Campbell, B. Zhang, S. Kruglyak, D. Bently, J. Haas, P. Rigault, L. Zhou, J. Stuelpnagel, and M.S. Chee. Highly parallel snp genotyping. Cold Springs Harbor Symposia on Quantitative Biology, 68:69-78, 2003. [11] Martin Farrall and Andrew P Morris. Gearing up for genome-wide gene- association studies. Hum Mol Genet, 14 Spec No. 2:R157-62, October 2005. ' [12] Simon Fiddy, David Cattermole, Dong Xie, Xiao Yuan Duan, and Richard Mott. Igs: An integrated system for genetic analysis. BMC Bioinformatics, 7:210, 2006. [13] Nelson B Freimer and Chiara Sabatti. Human genetics: variants in common diseases. Nature, 445(7130):828-30, February 2007. [14] Genelex. Genelex website, available at http://www.genelex.com/, May 2007. [15] Genelex. Genemedrx: Drug-drug and drug-gene interaction software, available at: http://genemedrx.com/, May 2007. 14 Bibliography [16] GeneSage. Genesage website, avalable at http://www. genesage. com, July 2006. [17] GeneTrack. Genetrack website, available at http://www.genetrack. bc.ca, July 2006. [18] Leigh S Goggin, Robert H Eikelboom, and Marcus D Atlas. Clini- cal decision support systems and computer-aided diagnosis in otology. Otolaryngol Head Neck Surg, 136(4 Suppl):S21-6, April 2007. [19] Illumina. Illumina godengate assay workflow, available at: http://www.illumina.com/downloads/goldengateassay.pdf, May 2007. [20] K M Kelly and K Sweet. In search of a familial cancer risk assessment tool. Clin Genet, 71(l):76-83, January 2007. [21] L Kruglyak and D A Nickerson. Variation is the spice of life. Nat Genet, 27(3):234-6, March 2001. [22] Cecile Libioulle, Edouard Louis, Sarah Hansoul, Cynthia Sandor, Frederic Farnir, Denis Franchimont, Severine Vermeire, Olivier Dewit, Martine de Vos, Anna Dixon, Bruno Demarche, Ivo Gut, Simon Heath, Mario Foglio, Liming Liang, Debby Laukens, Myriam Mni, Diana Ze- lenika, Andre Van Gossum, Paul Rutgeerts, Jacques Belaiche, Mark Lathrop, and Michel Georges. Novel crohn disease locus identified by genome-wide association maps to a gene desert on 5pl3.1 and modu- lates expression of ptger4. PLoS Genet, 3(4):e58, April 2007. [23] Yen-Ling Low, Sara Wedren, and Jianjun Liu. High-throughput ge- nomic technology in research and clinical management of breast cancer, evolving landscape of genetic epidemiological studies. Breast Cancer Res, 8(3):209, 2006. [24] Ruth McPherson, Alexander Pertsemlidis, Nihan Kavaslar, Alexandre Stewart, Robert Roberts, David R Cox, David A Hinds, Len A Pen- nacchio, Anne Tybjaerg-Hansen, Aaron R Folsom, Eric Boerwinkle, Helen H Hobbs, and Jonathan C Cohen. A common allele on chromo- some 9 associated with coronary heart disease. Science, May 2007. [25] K S Metaxiotis and J E Samouilidis. Expert systems in medicine: academic exercise or practical tool? J Med Eng Technol, 24(2):68- 72, 2000. 15 Bibliography [26] Richa Saxena, Benjamin F Voight, Valeriya Lyssenko, Noel P Burtt, Paul I W de Bakker, Hong Chen, Jeffrey J Roix, Sekar Kathiresan, Joel N Hirschhorn, Mark J Daly, Thomas E Hughes, Leif Groop, David Altshuler, Peter Almgren, Jose C Florez, Joanne Meyer, Kristin Ardlie, Kristina Bengtsson, Bo Isomaa, Guillaume Lettre, Ulf Lindblad, He- len N Lyon, Olle Melander, Christopher Newton-Cheh, Peter Nilsson, Marju Orho-Melander, Lennart Rastam, Elizabeth K Speliotes, Marja- Riitta Taskinen, Tiinamaija Tuomi, Candace Guiducci, Anna Berglund, Joyce Carlson, Lauren Gianniny, Rachel Hackett, Liselott Hall, Johan Holmkvist, Esa Laurila, Marketa Sjogren, Maria Sterner, Aarti Surti, Margareta Svensson, Malin Svensson, Ryan Tewhey, Brendan Blumen- stiel, Melissa Parkin, Matthew Defelice, Rachel Barry, Wendy Brodeur, Jody Camarata, Nancy Chia, Mary Fava, John Gibbons, Bob Hand- saker, Claire Healy, Kieu Nguyen, Casey Gates, Carrie Sougnez, Diane Gage, Marcia Nizzari, Stacey B Gabriel, Gung-Wei Chirn, Qicheng Ma, Hemang Parikh, Delwood Richardson, Darrell Ricke, and Shaun Pur- cell. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science, April 2007. [27] NJ Schork, D Fallin, and D. Lanchbury. Single nucleotide polymor- phism and the future of genetic epidemiology. Clinical Genetics, 58:250- 264, 2000. [28] Laura J Scott, Karen L Mohlke, Lori L Bonnycastle, Cristen J Wilier, Yun Li , William L Duren, Michael R Erdos, Heather M Stringham, Peter S Chines, Anne U Jackson, Ludmila Prokunina-Olsson, Chia-Jen Ding, Amy J Swift, Narisu Narisu, Tianle Hu, Randall Pruim, Rui Xiao, Xiao-Yi Li , Karen N Conneely, Nancy L Riebow, Andrew G Sprau, Maurine Tong, Peggy P White, Kurt N Hetrick, Michael W Barnhart, Craig W Bark, Janet L Goldstein, Lee Watkins, Fang Xiang, Jouko Saramies, Thomas A Buchanan, Richard M Watanabe, Timo T Valle, Leena Kinnunen, Goncalo R Abecasis, Elizabeth W Pugh, Kimberly F Doheny, Richard N Bergman, Jaakko Tuomilehto, Francis S Collins, and Michael Boehnke. A genome-wide association study of type 2 di- abetes in finns detects multiple susceptibility variants. Science, April 2007. [29] Nameeta Shah, Michael V Teplitsky, Simon Minovitsky, Len A Pennac- chio, Philip Hugenholtz, Bernd Hamann, and Inna L Dubchak. Snp- vista: an interactive snp visualization tool. BMC Bioinformatics, 6:292, 2005. 16 Bibliography [30] Barku'r S. Shastry. Snp allels in human disease and evolution. American Journal of Human Genetics, 47:561-566, 2002. [31] Robert Sladek, Ghislain Rocheleau, Johan Rung, Christian Dina, Lishuang Shen, David Serre, Philippe Boutin, Daniel Vincent, Alexandre Belisle, Samy Hadjadj, Beverley Balkau, Barbara Heude, Guillanume Charpentier, Thomas J. Hudson, Alexandre Montpetit, Alexey V. Pshezhetsky, Marc Prentki, Barry I. Posner, David J. Bald- ing, David Meyre, Constantin Polychronakos, and Philippe Froguel. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445:881-885, February 2007. [32] Beatriz Sobrino, Maria Brion, and Angel Carracedo. Snps in forensic genetics: a review on snp typing methodologies. Forensic Sci Int, 154(2- 3)':181-94, November 2005. [33] Valgerdur Steinthorsdottir, Gudmar Thorleifsson, Inga Reynisdottir, Rafn Benediktsson, Thorbjorg Jonsdottir, G Bragi Walters, Unnur Styrkarsdottir, Solveig Gretarsdottir, Valur Emilsson, Shyamali Ghosh, Adam Baker, Steinunn Snorradottir, Hjordis Bjarnason, Maggie C Y Ng, Torben Hansen, Yu Bagger, Robert L Wilensky, Muredach P Reilly, Adebowale Adeyemo, Yuanxiu Chen, Jie Zhou, Vilmundur Gudnason, Guanjie Chen, Hanxia Huang, Kerrie Lashley, Ayo Doumatey, Wing- Yee So, Ronald C Y Ma, Gitte Andersen, Knut Borch-Johnsen, Tor- ben Jorgensen, Jana V van Vliet-Ostaptchouk, Marten H Hofker, Cisca Wijmenga, Claus Christiansen, Daniel J Rader, Charles Rotimi, Mark Gurney, Juliana C N Chan, Oluf Pedersen, Gunnar Sigurdsson, Jef- frey R Gulcher, Unnur Thorsteinsdottir, Augustine Kong, and Kari Stefansson. A variant in cdkall influences insulin response and risk of type 2 diabetes. Nat Genet, April 2007. [34] Wenyi Wang, Sining Chen, Kieran A Brune, Ralph H Hruban, Gio- vanni Parmigiani, and Alison P Klein. Pancpro: risk assessment for individuals with a family history of pancreatic cancer. J Clin Oncol, 25(ll):1417-22, April 2007. [35] Jinho Yoo, Bonghee Seo, and Yangseok Kim. Snpanalyzer: a web-based integrated workbench for single-nucleotide polymorphism analysis. Nu- cleic Acids Res, 33(Web Server issue):W483-8, July 2005. 17 Bibliography [36] Lan-Juan Zhao, Miao-Xin Li , Yan-Fang Guo, Fu-Hua Xu, Jin-Long Li , and Hong-Wen Deng. Snpp: automating large-scale snp genotype data management. Bioinformatics, 21(2):266-8, January 2005. 18 Chapter 2 D - G R I P : D N A Genetic Risk Information Profile1 2.1 Introduction Genetics knowledge is being transformed through whole-genome associa- tion studies enabled by new high-throughput genotyping and re-sequencing technologies. In the past, genetics research focused on the identification of individual genes directly responsible for a disease or phenotype, based on Mendellian genetics. Common diseases such as diabetes, heart disease, asthma and cancer are caused by a combination of genetic and environmen- tal factors [17, 25]. For complex diseases, the genetic component may be provided by a few genes with moderate effects or a large number of genes with smaller effects [22]. To identify genes that contribute to susceptibility but are not definitively causal has emerged as the focus of many large genetics studies. With the completion of the Human Genome project [21], the uncovering of common genetic variants through the International Haplotype Map (HapMap) [6] has enabled the susceptibility studies for common diseases [32]. The analysis of 1A version of this chapter wil l be submitted for publication: Srivastava S and Wasser- man W . 2007 19 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile large sets of common genetic variants in association to specific diseases are called genome-wide association (GWA) studies. The GWA studies typically utilize a set of single nucleotide polymorphisms (SNPs) from the HapMap project known to represent blocks of linked variations (so called 'tag' SNPs) along with nonsynonymous SNPs and SNPs situated within evolutionarily conserved regions of the genome. A large number of GWAs were recently published for diseases such as Diabetes type 2 [34, 35, 37, 39], bipolar disor- der [4], Alzheimer's disease [7], Crohn's (inflammatory bowel) disease [8, 26] and coronary artery disease [29]. Given the rate of these new discoveries, there is much excitement in the scientific community for the potential to discover new links between genes and diseases - links which could pave the road for predictive genetic screening [13]. Facilitating the GWA studies are several new high-throughput genotyp- ing platform technologies such as the AfTymetrix GeneChip® and Illumina BeadArray™ which can simultaneously analyze thousands of variable po- sitions (i.e. SNPs). The advantages of such platforms lie in their high mul- tiplexing capability, increased reliability and the low genotyping cost per SNP. Both platforms allow genotyping of 500,000 SNPs per sample at a cost of less than $0.01 per SNP with greater than 95% accuracy [12, 27]. Due to such advancements, it is now economically feasible to perform large-scale whole-genome studies. It is even possible for an individual to obtain one's own genotype information covering many known common sequence variants for an affordable price. Suppose a geneticist was provided with results of a large-scale geno- typing experiment. It would be natural for that scientist to seek insights 20 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile into the data from computational services. There are several open source systems and commercial systems currently available for application to geno- type data. In the realm of open source systems, three categories exist: data management tools, visualization tools and risk assessment tools. Data man- agement and visualization tools such as Integrated Genotype Analysis [14], SNPP [45], SNP-VISTA [36] and GEVALT [9] can process and store large amounts of raw genotype data and provide intuitive visualization of results from many samples. The majority of the risk prediction tools, such as BRCAPRO [11], PancPRO [41] and other Cancer Risk Assessment Tools (CRATs) [24], utilize family history to predict risk of disease with relatively high accuracy. However, no system allows an individual to explore genetic risk for many diseases given a single individual's genotype. A few commercial systems handle genetic data and/or perform risk pre- diction. In addition to performing DNA tests, GeneTracks [19] provides, software to trace family ancestry based on markers and offers secure storage of personal genetic information. Genelex [18] uses a small set of markers for predicting adverse drug reactions and Mendelian model diseases. For the commercial risk prediction services, genome-scale genotype data are not utilized and risk predictions are, in general, very specific to a small set of Mendelian diseases. Given the advent of new genotyping technologies and the flow of new discoveries of disease-associated variants, is it now possible to use existing knowledge of diseases to create disease risk profiles for individuals? This paper is concerned with exploring such a concept in order to identify key limitations which must be addressed in genetics, bioinformatics and statis- 21 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile tics. In addition, the research raises ethical and societal implications. We created a prototype web tool called D-GRIP, DNA Genetic Risk Information Profile, which attempts to generate an individual's genetic risk profile given a genotype as input. The prototype system accepts up to one million geno- types, compares the submitted data to a DNA-Disease database and then outputs a report for those diseases for which the individual has a predicted modified risk. In order to test the utility and ascertain the limitations of D- GRIP, a survey of potential users, such as genetic counselors, was performed and their feedback was recorded. The development and subsequent assess- ment of D-GRIP revealed several key weaknesses which must be addressed before wide use of a predictive system should be attempted. 22 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile 2.2 Methods In order to create a practical prototype risk assessment tool, several com- ponents are required. An intuitive and easy to use interface is essential. At the core of the software, two aspects are needed: a DNA-Disease database and a statistical model for risk assessment. Lastly, the software needs to be passed through a testing phase to assess both usability and predictive performance. For D-GRIP, a schematic overview is shown in Figure 2.1. In subsequent sections, the various components of the software are detailed and in the results, a walkthrough is performed to illustrate the features of D-GRIP. • Create and'Query. • DNA-Disease DB .Gene ratemtegra ted-Risk • Score<(with:Gonfidence .' intervals) Ra rs£ in put.data. Map Variations to * Genes and i . .'. =;.-• .HapMap data -*" v.: • ( Perl/BioF?erl \ ° ' Ensembl J DBofsparsed genotype and!demographic:,'. - - 'Datas. ..• ^^ - ' t oca l 'mySQL^ Y T Output a list of diseases:, :,, lirked to variations HTML Oi-tpjt Text Output Figure 2.1: A schematic overview detailing the flow of information across the various components of D-GRIP is illustrated. 23 ) Chapter 2. D-GRIP: DNA Genetic Risk Information Profile 2.2.1 D - G R I P O v e r v i e w The overall flow of information occurs in three steps. The first step in- volves entering demographic information and the user's genotype data. In the second step the genotype data are compared to a genotype-phenotype database and a risk is calculated for the individual to develop each disease represented in the database. The last step is the reporting of any disease as- sociated variations found in the user's genotype and the relevant statistical measures. In the first step, the user enters demographic information such as age, gender and ethnic background. Due to the complexities involved in clas- sifying ethnicity [23], a geographical generic grouping was used as follows: European, Asian, African, Pacific, Mixed and First Nations/Aboriginals. It is also possible to infer ethnicity based on ancestry informative markers (AIMs) [43], especially for admixe'd individuals but for simplicity, D-GRIP uses user-identified ethnicity. The user is also required to input one's geno- type data, either by uploading the processed genotype file or copying and pasting the file. D-GRIP accepts two types of genotype file formats from widely used instruments (Illumina Final format and Affymetrix text out- put). Each row of the genotype file contains a SNP identifier (the 'rs' num- ber provided by dbSNP [38]) and the two alleles that make up the observed genotype. The software is capable of handling up to one million genotypes at a time. The second step processes the genotype data, based on the defined eth- nic background and compares each of the user's genotypes to the entries 24 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile i n a genotype-phenotype database. For corresponding matches of S N P id and genotype, D - G R I P uses the specific S N P from the genotype-phenotype database i n the statist ical model to calculate probabi l i ty of developing a spe- cific disease. The details of the database and statistical model are explained in subsequent sections. T h e final step involves reporting all matching S N P s between the user's genotype data and the disease-associated S N P s . T h e analysis results are reported in a tabular format which includes for each disease, the part icu- lar gene, the part icular S N P (and genotype) associated wi th the disease, the populat ion i n which the association was observed and links to relevant studies support ing the association between the disease and genotype. In addit ion, for each S N P , odds ratio and confidence intervals, risk and major allele's homozygous genotypes, the case and control genotype frequencies and set of S N P s found to be in high linkage disequi l ibr ium based on the H a p M a p data are reported. F ina l ly , an overall probabi l i ty of developing a disease is shown, based on the statist ical model used. A s per the model, the overall probabi l i ty is calculated over the whole set of observed disease- associated genotypes. D - G R I P was implemented for browser-based access over a network. Since there are many social, ethical and legal implicat ions associated wi th the use of such a risk assessment tool , access to D - G R I P is restricted. D - G R I P is envisioned to be used in a guided setting, for example, i n the presence of a genetic counselor. In addit ion, to respect privacy and confidentiality, user submitted information (e.g. demographic and genotype data) is not stored in the system. Once a report is generated, al l user data are removed from 25 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile memory. 2.2.2 Genotype-Phenotype Database Exi s t i ng genotype-phenotype databases are not sufficient for large-scale dis- ease risk prediction due to deficiencies i n the organization and/or extent of genetic risk knowledge [33]. Currently, the majori ty of genetic disease databases use free text for disease information (rather than a more struc- tured format) and thus are not suited for large scale computat ional analyses. Due to this deficiency, we created a D - G R I P DNA-Disea se database for the testing of the system. The DNA-Disease database contains information per- taining to a l imi ted set of complex diseases. T h e information represented pr imar i ly includes validated markers (SNPs) either confirmed in mult iple studies or emerging from studies performed wi th samples from large num- bers of participants. For each available disease, the DNA-Disease database contains associ- ated and validated S N P s . For each S N P , the case and control allele and genotypic frequencies from different populations is recorded. We decided to model the information in the database on an existing system, A l z G e n e , which was developed for genetic markers predictive for Alzheimer ' s disease risk [5]. A l z G e n e was created to house the results of a meta-analysis for each polymorphism wi th known genotype data in at least three case-control studies. For each polymorphism, allele and genotypic frequencies on a per populat ion basis are provided in a well organized structure. In addi t ion to Alzheimer ' s data, the D - G R I P DNA-Disease database contains information from a Parkinson's disease database (PDGene) , created by the developers 26 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile of AlzGene [1]. The data for Diabetes Type II was manually extracted from a recent large scale genome wide association study [37]. A summary of the contents of the database is represented in Table 2.1. Diseases Number of Genes Number of Polymorphisms Alzheimer's Disease 38 76 Parkinson Disease 8 17 Diabetes Type II 5 8 Table 2.1: A summary of the number of genes and number of polymorphisms for each of the diseases in the DNA-Disease database. 2.2.3 Disease Risk Model The implemented statistical model in D-GRIP was defined by Yang et al. [44]. The original model includes two steps. First, a likelihood ratio was calculated using logistic regression and then a posterior probability of disease was estimated using the likelihood ratio. The likelihood ratio is de- fined as the ratio of the probability for an individual with a disease to have an observed genotype to the probability for an individual without the dis- ease to have the genotype [44]. While full details can be obtained in the cited paper, a brief summary follows. L i ke l i hood R a t i o For an individual with a set of genetic tests, G, where G is a vector of n disease susceptibility genes or alleles (<?i,<?2)- ••i9n)- Let gi = 1 for positive genetic test result and gi = 0 for negative test result, then, let the individual who is tested for one allele be represented as a combination of Os and Is. 27 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile Also, let D represent the diseased (case) population and let D represent the non-diseased (control) population. Then, the likelihood ratio for any observed value of G can be defined as: LR{G)-piGWy ( 2 J ) As stated, G is a set of genetic tests G = (51,52,- • -,9n)- The implemented model assumes that each genetic test is acting independently, thus the joint probability of a given result is the product of the individual probabilities, P(G\D) = P(gi\D)P(g2\D)...P(gn\D). This is also true for P{G\D) and thus it follows that LR(G) = LR(g1)LR(g2). •. LR(gn). Thus, the likelihood ratio for a panel of independent tests is simply the product of the likelihood ratios of the individual test results. The assumption of independence will be discussed in a later section. Since the DNA-Disease database contains case-control studies from var- ious populations for each disease, a logistic model can be used to estimate the likelihood ratio. For a binary disease outcome (D = 0,1), for a logis- tic model in the population, logistic regression can be used to calculate the likelihood ratio from case-control studies in a population, as follows: ln LR(G) = ln ( ^ ) + a C C + PGT, (2.2) where acc and (3 are the intercept term and the logistic regression coefficient of the odds of developing the disease respectively. NQA is the number of case subjects in the study sample and Neo is the number of control subjects in 28 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile the study sample. It is worth nothing that the likelihood-ratio calculation assumes each gene is acting independently. However, realistically, gene- gene interactions and gene-environment interactions should be included in the model. The likelihood-ratio equation (Equation 2.2) can be modified by including a vector of covariates as well as interaction effects of multiple binary tests (gene-gene or gene-environment interactions). However, for brevity, the equation is not shown here and for prototype development, is not used in D-GRIP. Posterior Probability The statistical model uses a set of genetic tests to predict the probability that the multifactorial disease will develop in people with allele-positive result, or P{D\G). By using the a pretest risk of disease, P(D), or the average risk of disease in the population, the posterior probability can be defined as: r(D\C) LR(G)P(D) P { D l G ) ~ [1 - P(D)] + LR(G)P(DY ( 2 ' 3 ) 2.2.4 Haplotype Data In addition to utilizing validated disease associated variations, we incorpo- rated the use of haplotype blocks in the statistical model. For each of the SNPs that are associated with disease in our DNA-Disease database, we extracted SNPs (1Mb on either side), in corresponding HapMap popula- tions that were in high linkage disequilibrium (threshold of r 2 > 0.8 [2]). To extract the HapMap SNPs and linkage disequilibrium values, Ensembl (build 45) was used. Due to the complexity involved in defining and classify- 29 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile ing populations, a simplification was made when incorporating the hapmap data: the populations from the HapMap project were generalized to match the populations found in the DNA-Disease database. The population cate- gories from the DNA-Disease database were Caucasians, Asians, African and other/Mixed. The corresponding matches from the HapMap project were European ancestry (CEPH) grouped as Caucasians, the Tokyo (JPT) and Han Chinese (CHB) ethnic groups represented as Asians and the Nigeria (YRI) ethnic group matched to Africans. D-GRIP uses the HapMap data in two different ways during the gen- eration of a disease risk profile. First, for the reported disease-associated SNP, an integrated analysis is performed in which multiple disease associ- ated SNPs in high linkage disequilibrium (LD) are clustered together during the probability calculation. Rather than treating these high LD SNPs in- dependently in the calculated overall disease probability, a simplification is made. The SNP with the highest effect (highest odds ratio) is used to rep- resent the other SNPs in high LD and thus only one SNP (with strongest effect) is used in the posterior probability calculation. Second, an inferred analysis is reported with the observed genotypes in the final risk profile output. The inferred analysis reports SNPs that were present in the user's genotype but did not have a direct association to a disease. These inferred SNPs are in high LD with known disease associ- ated SNPs which are present in the DNA-Disease database. The Hapmap Genome Browser (Release 21) [40] was used to extract the phased geno- type data. Subsequently, Haploview (version 3.32) [3] software was used to calculate the haplotype blocks, using the default method on Haploview 30 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile software for haplotype block calculation, in order to infer phase information. Since the inferred analysis is highly predictive in nature and untested, it is provided as an option for the user, which by default is turned off during anal- ysis. Also, the inferred SNPs are not used in overall posterior probability calculation. 2.2.5 So f twa re E v a l u a t i o n After a working prototype was created, D-GRIP underwent a series of crit- ical evaluations. The evaluation was structured as a survey where D-GRIP was demonstrated to experts and their feedback was recorded. A total of 21 scientists, clinicians or counselors were surveyed including clinical geneti- cists, molecular geneticists, biostatisticians and genetic counseling students. 31 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile 2.3 Results A walkthrough of D-GRIP illustrates the user interface features as well as the underlying DNA-Disease database. Figure 2.2 shows the first page of D-GRIP after the user logs in. The opening page explains how to use D- GRIP and warns the user via a disclaimer which outlines the assumptions and limitations of D-GRIP. Upon clicking the 'Use D-GRIP' link, the user is presented with a form to solicit demographic information and options regarding the analysis. In this example, suppose the user is a male, 47 years old from European ancestry and inference analysis turned on (Figure 2.3). W e l c o m e ' T e s t ' Home Disclaimer This web site provides a Icol lor predicting a genetic risk profile Tor a person hy utlizing geixitype informati.41. Use D-GR IP Gelling Slarled: Logout . & 1. Olid; on ihe 'Use D-GR'IP' link. 2. Upload a genotype file or copy/paste data imo ihe form. , \ G tick on Calculate:Risk. Nnte: Tips are provided anywhere *® appears. Bringcursnr over In see tips. Disclaimer 1. It is assumed lliesyslem is used ina guided setting, 2. '. All inlbrmatioivprovided by y<xi fllie user') isassumed lobe accurale. Forinslance, elhhiclbackgrnuncl provided hy Ihe user is assumed to Ix accurale to inchest of the user's knowledge. .V D-GR IP predicts risk of developing disease based on population inlcH'malion collected frun literature. 4. The overall probability of developing a disease is calculated assuming all susceptible alleles/genes are acting independently within diseases and across diseases. 5. The syslem dees not store any user-pnn'ided data (e.g. genotype and demographic data). Figure 2.2: The opening page of D-GRIP is shown. The instructions on how to use D-GRIP and a disclaimer explicitly stating the assumptions and limitations inherent in D-GRIP are shown. 32 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile Input user deUiils Denmgrtiphli: tninrimitinn Gender* Year ol'Birilr11 Elhnlc Background* . Configuration Options Inference of Genotypes- M-andaiory fields marked I , [ Figure 2.3: The first step in using D-GRIP is illustrated, where the user's demographic information such as gender, age and ethnicity is collected. The hypothetical example above shows a male, 47 years old from European an- cestry. The inference option is turned on (checked). After clicking 'next', the user is asked to submit genotype data. Two options are available, copying and pasting the data into the submission box or uploading a genotype data file. The file formats supported are Illumina's Final format or Affymetrix's Text Output format. In order to illustrate the underlying DNA-Disease database, sample genotype files were created. One such sample genotype file is shown in Figure 2.4 which shows 13 genotypes, all of which are heterozygous for particular SNPs from each of the three diseases. After loading the genotype data, the user clicks on 'Calculate Risk'. The last step is the output page which displays a disease risk profile report. As seen in Figure 2.5, for each disease, the associated gene, SNP, genotype and population is reported along with a list of scientific articles supporting the association between disease and genotype. In addition, for Male <~ Female | Europe • ® 0'click lo aim On ® 33 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile Step 2: Copy /Paste or Upload genotype information ' Copy /Paste d a t a " - Mandatory fields marked File format* File name* Input genotype data1" I Illumina Final Format jgj testGenotypeDatal s790 3.l'l 6 Europe frlorthern European HI30I |c '!' 0.99 i-ojfl 1-875 Europe [Northern. Eiu-opeart HDO'l A G 0.97 rs79238-37 Europe |llorthem European HDOt A G 0.96 rB37'i0878 Europe jlloi-thern European. HD01 Calcuiate'iRisk. | HDO'l rOl - Gill 7001 -HAI7001 HDO .1.-0:1 - GH:l7001-IIA.t700.1 HD01-01 - GI:l.1.700.l-tlAi7001 •HD01.-01 • GH17001-HA17001 i OR Upload data . . . Please complete the form below. Mandator)-- fields marked * File format* Type (or select) Filename* Illumina Final Format Browse.. r Pre-loaded data - Selsci. lest gsnaiypr.dala Io load: J Sample 1 f̂j , : Get-iS am pie;, I "LnmnieiiiK Sample I: Caiicasiairjx"piilaiion with selected SNPs from' all diseases in database. All genotypes tire lieti!'n:i/ygoiistbr each disease except Parldnsotrdisease wliicli'.aie ho-moV-v-gous. First five SNPs are [or Diabetes type'2, next three are for Al/.heirner and lasl'tywi are-for PurkinsDii'-sdisease.Xhe last three Upload and Calculate Risk Figure 2.4: The second step in using D-GRIP is illustrated here. The user has a choice of copying and pasting the genotype data or uploading it. For ease of use, various hypothetical sample genotype files were created to illus- trate D-GRIP. The above example contains the 13 highly significant geno- types which are heterozygous for each disease in the DNA-Disease database. A description of the pre-loaded data is shown in the 'Comments' box. 34 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile each disease, the background probability and the calculated probability are indicated. For example, based solely on genotype, a 47 year old male from European ancestry is reported to have a slightly elevated risk (7.0%) of developing Diabetes type II given that the background probability in the Caucasian population is 5%. In figure 2.6, further details for each SNP are shown such as risk and major genotype, genotype frequencies for case and control populations, odds and likelihood ratios and confidence intervals. In this example, SNP rs7903146 from gene TCF7L2 for Diabetes Mellitus type 2 is shown. Links to relevant resources such as GenBank, OMIM and db- SNP are available when clicking the gene, disease name and SNP identifier respectively. Also, after clicking on the overall probability row, an inte- grated analysis is shown which combines disease associated SNPs that are in linkage disequilibrium according to HapMap data (figure 2.7). Lastly, inferred SNPs are shown separately under each disease. The overall proba- bility calculation is done only once, using observed SNPs and the inferred SNPs are not included in the calculation due to their speculative nature. The evaluation of D-GRIP was performed by surveying experts in the field. The feedback was recorded and is presented in Appendix A. 35 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile ' A I z h e i m e r a d i s e a s e s - A l z h e i m e r d i s e a s e . \ l / h e i m i r d i s c i s i A l z h e i m e r d i s e a s e C H R N B 2 f r s 4 S 4 5 ^ 7 8 l O M T l r s 2 0 1 8 6 2 l b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y , o v e r a l l c a l c u l a t e d p r o b a b i l i t y ~ Y / G ^ 31; ^ C a u c a s i t n 7 ] r ' ™ - 2 ( S 2 A / G C a u c a s i a n ! I=;68 1 . 0 % 2 5 . 6 5 ' 1 5 0 2 6 1 6 S 1 6 8 4 7 0 1 2 <mi77b4il|,§l D i a b e t e s - M e l l i t u s t v p e 2 1 I I I H X i s 1 i H K 7 5 . . \ / G D i i b t t e s M t l l i t u V . t y p e 2 1 t ' I U 7 1 ' I S 7 9 0 3 I 4 6 Cfl D i a b e t e s M e l h t u s t y p e 2 K X T 2 • r s 1 1 1 3 1 3 2 GIC D i a b e t e ^ M e l l i U i s ' t v p e ; 2 - D i a b e t e s M e l l i t u s t y p e : 2 D i l b e t c M e l l i t u s t y p j ^ 2 i \ p ^ • K X T 2 . I l l I I \ 1s 1 H 3 3 7 9 0 9 r s 3 7 4 0 8 7 8 T G / A . A / G C a u c a s i a n : • C a u c a i i n I C a u c a s i a n C l u c a i j ^ C a u c a s i a n ? . I ' M . 9 h 15 IJT7 P. 26. 1 7 " > < H S 7 6 1 7 ^ 9 3 8 7 6 1 7 2 9 ^ 8 7 6 i I 7 ° 9 3 _ h 7 6 _ 1 7 ">93 < ? 7 6 I n f e r e n c e A n a l v s i s D i a b e t e s M c l l i t u s t y p e b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y o v e r a l l c a l c u l a t e d p r o b a b i l i t y 5 % 7 % A / A C a u c a s i a n . i ; 5 - 1 6 0 0 9 8 9 1 . ; . • I n f e r e n c e A n a l y s i s ® P a r k i n s o n d i s e a s e ' b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y '2% o v e r a l l c a l c u l a t e d p r ' o ' b ' a b ' i l i t y 2 . 1 4 % • Figure 2.5: The last step shows a tabular result for any single nucleotide polymorphisms (SNPs) found to be associated with a disease in the user's genotype data. P a r k i n s o n d i s e a s e P1NK1,. . rs'1043424 36 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile ' / f I z h e i m e n d l s e a s e t A l z h e i m e r - d i s e a s e • A k h e i m e r , d i s e a s e * ; A l z h e i m e r d i s e a s e JGHRNB2i , 484S37S, •PON1T1. rs20IS62l -TOMM40 rs 157581' background population probabdiiv overall calculated probability \ r /G_ (I ,̂Caucasian> J A7G- err Caucasian .GauciisiarT 1 '6847012 i Li _ lu 1.68 2 96 25.65 * 173 17784 DiabetesiMellilus.type;2 HI IEX Dnbetts Mellitus^ pe 2 ^ _,TCF7l 2 tGcnotjpes*1 , * l f ' » i l t Risk uenot\ pe C/T ' Ni tjoi gctiotj 1c C/C rs HI 1875 A/G Caucasian- . - ;IV1.9 •17293876' ~ 3fp^' lV-. . " ^ C / " - ' 1 Cauca tan "̂ T 65^ f^J?2jm76^ S t a b t e s *.£iLJL Control. ••Gefloiv.pejhreq'uencies'*' 0 4 S 6 f f - ?! J3419 J1 C/C _ <L\31 0 497 ,[ l/l ' W i s Ratio (95^ CI) "it-'̂ ''Sili Mi.IK f " J'^'-fli^^1:5V***|r- . :~:^'£?£? * log Odd Ratio log Odds Ratio 9V»'ci - •. likelihood .Ratio:' LikUihood tatio 9 % Cl vr:='\,'''''^""T5ri'?¥'«;"',T»' . ^Brobability.ol'disease.'basedion tht SNP 1 65 (1 47* I 85) 0 5 ± r0 06 (0 38 0*61) 1*27 ±"0 0017 •,(117 I 38) '6 27 'a Diabetes Mellitus t> (Jc 2 E \ T 2 Dubttc Mellitus type 2 ' ] iE \T2, »;•• ' ", '• Fd./f-1..:. ini:_."rn,riuia:t& * ii 'I.î .tuL.T.̂ j.muJi.xntuAiU'.̂ n.1 ,̂'- c... Diabetes Mellitus t>pc 2 h \T2 tsl113132 r,/c isl 1037909̂  _ F^_C/r is3740878 G/A Diabett Nkllitus t )p^'_ j tHHJ \ is792 437 __T J [ _ A/G ' Caucasian If Ciucasi tn 'Caucasian: • • 17293876 I 27_ I 26 1 ""293876 1*7293876, Inference Analysis Diabetes Mellitus type-2 background population probability overall calculated probability 5 % 1 % Figure 2.6: More details are shown for each SNP. As an example, details for SNP rs7903146 is shown from gene TCF7L2 from Diabetes Mellitus type II. 37 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile D n b t t e s M t l l i t u s l \ p i 2 I I I I I \ i s l 1 1 ( S 7 5 b i t b e u M e l l i u i s t v W j T C , 7 L 2 f V l M V s 7 9 0 ; i 4 6 m i y D n h e t < . s M t l l i t u s t > p e 2 E X T I 1 1 1 3 1 3 2 U u i b e t c M c U i t i i s l\ p t 2 ^ E \ T 2 D i a b e t e s ' > M e l h t i i s t v p c : - 2 ' A/G • Caucasian 1 19 1 7 2 9 3 8 7 6 C '• >• — - «i f r»"n* if " C T * l| J 65, ^ 9 3 8 7 6 r:x;i'2 ... -rTT-irt;—r^".— D i a b e t t ^ M e J I i l u s t ) p e ^ ^ H t X . G/C ' — — „ ' F ~ r 11037909 1 Caucasian Caucasi in 1 1") ,17293876 . ' r72938?6 J 1 26 t '17-293876 'rs3740S7S- .. G / A V , i .; "Caucasian; Inference Analysis Diabetes Mcllitus type 2 SLC30A8 rs 13266634 T/C Caucasian I.-I8 17293876 Diabetes Mellitus type 2 LOC38776I rs7480010 A /G Caucasian 1.14 17293876 Diabetes Mellitus type 2 background population probability overall calculated probability Age Gender Ethnicity User details 47 Male Eu rope Background probability details. Age of Onset (vrs) 45 60 Background probability 5% 15% 5 % 7 % Integrated Analvsis SNP used in probability calculation rs 11037909 SNPs in high linkage disequilibrium rsl 1037909 rsl 113132 rs3740878 Figure 2.7: Details of overall probability calculation, integrated analysis and inferred SNPs are shown for Diabetes Mellitus type 2 disease. The inte- grated analysis indicates which disease-associated SNPs are in high linkage disequilibrium (r2 > 0.8). For SNPs in high LD, only the SNP with strongest effect (highest odds ratio) is used in the overall calculated probability. 38 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile 2.4 Discussion 2.4.1 Limitations In the current state of information and implementation, D-GRIP has sev- eral limitations. A key limitation is the narrow scope of the DNA-Disease database. The scarcity reflects two key causes: lack of organization of genotype-phenotype data and the small number of confirmed markers for risk. Even though numerous studies report new DNA marker-disease asso- ciations, there is a shortage of databases that organize such information in a comprehensive and computationally accessible manner. Databases such as AlzGene [5] and PDGene [1] are rare examples of organized genotype- phenotype data which are continuously updated when new studies are pub- lished and are easy to use computationally. More such genetics databases are required for other common diseases [33]. It should be noted that nu- merous databases provide information about genetics and disease, such as OMIM [15] and HGVbase [16], but the information is not sufficiently gran- ular and/or formatted to incorporate into the risk calculation procedure of D-GRIP. The second problem, the scarcity of confirmed predictive mark- ers will soon be ameliorated as the rate of publication of such studies is accelerating. Another limitation of D-GRIP resides in the statistical model. There are several issues regarding the statistical model. When a posterior probability is calculated using the observed SNPs which are associated to a disease, each genetic test (SNP) is assumed to be acting independently. This is a very simplistic view and does not realistically capture the underlying disease 39 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile process. In order to partially circumvent this limitation, we included haplo- type information in the analysis. By including an integrated analysis where if observed SNPs in the output were found in linkage disequilibrium, only the SNP with strongest effect is included in the posterior probability cal- culation. Again, this is a simplification which is warranted since we could not find other existant suitable statistical models that incorporate haplo- type data for risk prediction of disease with SNPs. Furthermore, the lack of consideration for gene-gene interactions and gene-environment interactions is another limitation. Even though the model allows for incorporation of interaction effects, for simplicity, D-GRIP does not utilize that feature. A second issue with the statistical model is lack of incorporation of age and gender during risk calculation. Even though we require the user to input such demographic information when calculating risk for a particular disease, this information is not utilized. In order to use demographic information appropriately, we require the age and gender distribution for each of the individuals in the case-control studies stored in the DNA-Disease database. Since such raw data are unavailable, a simplification was used. D-GRIP uses a different prior probability (background probability) for specific diseases (e.g. Alzheimer's disease) based on the age of the person. In order to alleviate this scarcity of raw data, currently efforts are under way at the NIH to archive and distribute more detailed information on upcoming genetic association studies. The database, dbGaP is designed to house genetics studies dealing with genotype-phenotype interactions and provide all study documentation as well as pre-computed analysis [30]. Currently, no family history or medical history is used for predicting 40 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile risk of disease. Incorporation of family history has been shown to improve the predictive accuracy of risk models [11]. Thus future versions of D-GRIP should incorporate family history in the risk model. For any prediction based software, rigorous validation regarding specificity, sensitivity and accuracy is required. Currently, no such validation is performed due to insufficient number of diseases in DNA-Disease database as well as the unavailability of raw genotype data from individuals for testing. D-GRIP was evaluated through a survey in which D-GRIP was demonstrated to various experts in genetics-related field and feedback was recorded. The conclusions from this form of evaluation are discussed in the next section. 2.4.2 Ideal Software Based on the conclusions drawn from the prototype system and feedback from experts, the features and functionality of an idealized software system can be outlined. The input features of a system should include, as in the prototype, demographic information collected from the user and in addition, an option for collecting family history of any diseases and relevant environ- mental exposures (e.g. cigarette smoking). Also, the genotype parser should be flexible and accommodate various fiie formats. Preferably, an widely ac- cepted file format standard should be established for genotyping data which are released from platforms such as Illumina and Affymetrix. By having a standard file format, exchange of genotyping data across studies will be more efficient. Lastly, user information on non-SNP variants, such as inser- tions/deletions, copy number variations and large-scale structural variants should also be accepted. 41 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile At the core of the software, the ideal DNA-Disease database will contain information for as many common diseases as possible. There are two ways to populate such a database. One, create a meta-analysis engine for each disease. When new studies are published for a disease, they can be added to the database and then meta-analysis re-performed over all the studies for a specific disease. This would require continuous updating of the database each time new disease associated markers are found. In the second approach, genotype-phenotype data would be extracted from disease-specific databases such as AlzGene and PD Gene, but currently, such disease specific genetics databases, of suitable format are rare. Based on recommendations from biostatisticians, an idealized software's statistical approach would include a unique model for each disease (or a range of optional models). Since common diseases are varied and complex, it is crucial to have rigorously tested and validated statistical models. In addition, the statistical models will need to incorporate gene-gene interac- tions as well as co-variates such as exposure to environmental or behavioral factors. The user interface, both the input and output of an ideal system will have to be tailored towards the audience. For example, the current dis- ease risk profile report generated from D-GRIP is intended to be read by a trained user such as a genetic counselor. If one were to target use to family physicians, as suggested by one survey participant, it might be more suitable for the output to highlight links to information about prevention. Appro- priate training will be required for any user of such a system, be it genetic counselors, family physicians or individual subjects. Lastly, it was highly 42 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile recommended by the respondants that access to D-GRIP-like tools be re- stricted - the mixture of complicated interpretation of risk and opportunity for the generation of undue stress on the recipient of information combine to warrant limited user access for the near-term. As a last comment, the average consensus from the feedback for when such an ideal system could be accepted and used clinically was between 5 and 10 years. 2.4.3 Implications There are many societal, ethical and legal implications involved with using D-GRIP. The immediate issues are discussed here and potential directions are presented. One of the pressing questions deals with data protection. The same level of protection should be provided for genetic data as for sensitive medical data, that is, confidentiality and privacy. In addition, the individ- ual's rights should be respected everytime such a tool is used in professional setting. Currently, D-GRIP ensures protection of the user's rights by not storing any user specified information (demographic and genotype) and en- sures confidentiality via anonymous submission of genetic data. However, in the long-term it would be more appropriate for a continuous analysis engine to reassess the DNA each time a new genetic risk marker was deposited into the database. Therefore, encryption and privacy features are required in such a tool. There is much research needed in how to present and explain genetic risk information to individuals [10]. The effect of inappropriately explaining risks can lead to demoralization and unnecessarily increased anxiety, both of which can decrease an individual's ability to change risk-related behav- 43 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile ior [28, 42]. Also, most people find probabilities and relative risk information difficult to comprehend, in part due to poor presentation of statistics [20]. Thus, it is recommended to use standard vocabulary, use a common denomi- nator when explaining odds, provide both positive and negative perspectives and use visual aids for probabilities [31]. Genetic testing for affected or at risk individuals creates serious ethical dilemmas. Concerns such as discrimination from employers and insurers and fear of discrimination can deter individuals who could benefit from genetic testing. It also remains to be seen how third-party use of genetic information and potential will impact the use of predictive tools such as D-GRIP. These issues will have to be discussed and addressed by governments, industries and the public in a transparent manner [22]. 2.4.4 C o n c l u s i o n s The creation of the D-GRIP system for genetic risk prediction was intended to identify bioinformatics, statistical and scientific challenges that must be addressed to create predictive systems of clinical utility. The major bioinfor- matic limitation is the lack of available data in terms of strongly predictive susceptibility alleles for complex diseases. This is in part due to the lack of organized and computationally exploitable disease databases for complex disorders. The major statistical limitation is the calculation of risk given large-scale genotype data (e.g. incorporating haplotype information into the analysis). The major scientific limitation, despite the flurry of association studies, is our limited understanding of complex diseases and how various genes interact with each other and the environment. Any proposed predic- 44 Chapter 2. D-GRIP: DNA Genetic Risk Information Profile tive model (be it for a single disease or a general model) will have to undergo rigorous testing and evaluations in order to ensure clinical utility. When the proposed limitations are overcome, useful and beneficial pre- dictive software can be created and implemented. The key features include: incorporation of genotype data along with family history of disease, a contin- uously updated DNA-Disease database with a meta-analysis engine, disease- specific risk models which have been validated and user-oriented risk profile reporting. The use of the software will be under a guided setting, with potential users being genetic counselors and family physicians. Regardless of the user, appropriate training in using the software and interpreting the output will be a necessity. Lastly, implications such as privacy and confi- dentiality of genetic data, appropriate explanations of risk, discrimination towards individuals via third parties, effect on public health policies and public education are all important challenges to be addressed before imple- mentation of such a predictive tool becomes a reality. 45 Bibliography [1] S Bagade, NC Allen, R Tanzi, and L Bertram. The pdgene database, alzheimer research forum, available at: http://www.pdgene.org/, Ac- cessed May 2007. [2] Michael R Barnes. Navigating the hapmap. Brief Bioinform, 7(3):211- 24, September 2006. [3] J C Barrett, B Fry, J Mailer, and M J Daly. Haploview: analysis and visualization of Id and haplotype maps. Bioinformatics, 21(2):263-5, January 2005. [4] A E Baum, N Akula, M Cabanero, I Cardona, W Corona, B Klemens, T G Schulze, S Cichon, M Rietschel, M M Nothen, A Georgi, J Schu- macher, M Schwarz, R Abou Jamra, S Hofels, P Propping, J Satagopan, S D Detera-Wadleigh, J Hardy, and F J McMahon. A genome-wide as- sociation study implicates diacylglycerol kinase eta (dgkh) and several other genes in the etiology of bipolar disorder. Mol Psychiatry, May 2007. [5] Lars Bertram, Matthew B McQueen, Kristina Mullin, Deborah Blacker, and Rudolph E Tanzi. Systematic meta-analysis of alzheimer disease genetic association studies: The alzgene database. Nature Genetics, 39:17-23, January 2007. [6] International Hapmap Consortium. The international hapmap project. Nature, 426(6968):789-96, December 2003. [7] Keith D Coon, Amanda J Myers, David W Craig, Jennifer A Web- ster, John V Pearson, Diane Hu Lince, Victoria L Zismann, Thomas G Beach, Doris Leung, Leslie Bryden, Rebecca F Halperin, Lauren Mar- lowe, Mona Kaleem, Douglas G Walker, Rivka Ravid, Christopher B Heward, Joseph Rogers, Andreas Papassotiropoulos, Eric M Reiman, John Hardy, and Dietrich A Stephan. A high-density whole-genome association study reveals that apoe is the major susceptibility gene for 46 Bibliography sporadic late-onset alzheimer's disease. J Clin Psychiatry, 68(4):613-8, April 2007. [8] J R Fraser Cummings, Rachel Cooney, Saad Pathan, Carl A Anderson, Jeffrey C Barrett, John Beckly, Alessandra Geremia, Laura Hancock, Changcun Guo, Tariq Ahmad, Lon R Cardon, and Derek P Jewell. Confirmation of the role of atgl611 as a Crohn's disease susceptibility gene. Inflamm Bowel Dis, April 2007. [9] Ofir Davidovich, Gad Kimmel, and Ron Shamir. Gevalt: an integrated software tool for genotype analysis. BMC Bioinformatics, 8:36, 2007. [10] Adrian Edwards, Silvana Unigwe, Glyn Elwyn, and Kerenza Hood. Effects of communicating individual risks in screening programmes: Cochrane systematic review. BMJ, 327(7417):703-9, September 2003. [11] David M Euhus, Kristin C Smith, Linda Robinson, Amy Stucky, Olu- funmilayo I Olopade, Shelly Cummings, Judy E Garber, Anu Chit- tenden, Gordon B Mills, Paula Rieger, Laura Esserman, Beth Craw- ford, Kevin S Hughes, Connie A Roche, Patricia A Ganz, Joyce Seldon, Carol J Fabian, Jennifer Klemp, and Gail Tomlinson. Pretest predic- tion of brcal or brca2 mutation by risk counselors and the computer model brcapro. J Natl Cancer Inst, 94(11):844-51, June 2002. [12] J.B. Fan, A. Qliphant, R. Shen, B.G. Kermani, F. Garcia, K .L . Gun- derson, M . Hansen, F. Steemers, S.L. Butler, P. Deloukas, L. Galver, S. Hunt, C. McBride, M . Bibikova, T. Rubano, J. Chen, E. Wickham, D. Doucet, W. Chang, D. Campbell, B. Zhang, S. Kruglyak, D. Bently, J. Haas, P. Rigault, L. Zhou, J. Stuelpnagel, and M.S. Chee. Highly parallel snp genotyping. Cold Springs Harbor Symposia on Quantitative Biology, 68:69-78, 2003. [13] Martin Farrall and Andrew P Morris. Gearing up for genome-wide gene- association studies. Hum Mol Genet, 14 Spec No. 2:R157-62, October 2005. [14] Simon Fiddy, David Cattermole, Dong Xie, Xiao Yuan Duan, and Richard Mott. Igs: An integrated system for genetic analysis. BMC Bioinformatics, 7:210, 2006. [15] McKusick-Nathans Institute for Genetic Medicine and National Center for Biotechnology Information. Online mendelian inheritance in man omim (tm), http://www.ncbi.nlm.nih.gov/omim/, July 2006. 47 Bibliography [16] D Fredman, M Siegfried, Y P Yuan, P Bork, H Lehvaslaiho, and A J Brookes. Hgvbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res, 30(1):387-91, January 2002. [17] Nelson B Freimer and Chiara Sabatti. Human genetics: variants in common diseases. Nature, 445(7130):828-30, February 2007. [18] Genelex. Genelex website, available at http://www.genelex.com/, May 2007. [19] GeneTrack. Genetrack website, available at http://www.genetrack. bc.ca, July 2006. [20] Gerd Gigerenzer and Adrian Edwards. Simple tools for understanding risks: from innumeracy to insight. BMJ, 327(7417):741-4, September 2003. [21] Alan E Guttmacher and Francis S Collins. Welcome to the genomic era. N Engl J Med, 349(10) :996-8, September 2003. [22] Wayne D Hall, Katherine I Morley, and Jayne C Lucke. The prediction of disease risk in genomic medicine. EMBO Rep, 5 Spec No:S22-6, October 2004. [23] Lynn B Jorde and Stephen P Wooding. Genetic variation, classification and 'race'. Nat Genet, 36(11 Suppl):S28-33, November 2004. [24] K M Kelly and K Sweet. In search of a familial cancer risk assessment tool. Clin Genet, 71(l):76-83, January 2007. [25] Muin J Khoury, Julian Little, Marta Gwinn, and John Pa Ioannidis. On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies. Int J Epi- demiol, 36(2):439-45, April 2007. [26] Cecile Libioulle, Edouard Louis, Sarah Hansoul, Cynthia Sandor, Frederic Farnir, Denis Franchimont, Severine Vermeire, Olivier Dewit, Martine de Vos, Anna Dixon, Bruno Demarche, Ivo Gut, Simon Heath, Mario Foglio, Liming Liang, Debby Laukens, Myriam Mni, Diana Ze- lenika, Andre Van Gossum, Paul Rutgeerts, Jacques Belaiche, Mark Lathrop, and Michel Georges. Novel crohn disease locus identified by genome-wide association maps to a gene desert on 5pl3.1 and modu- lates expression of ptger4. PLoS Genet, 3(4):e58, April 2007. 48 Bibliography [27] Yen-Ling Low, Sara Wedren, and Jianjun Liu. High-throughput ge- nomic technology in research and clinical management of breast cancer, evolving landscape of genetic epidemiological studies. Breast Cancer Res, 8(3):209, 2006. [28] T M Marteau and R T Croyle. The new genetics, psychological re- sponses to genetic testing. BMJ, 316(7132):693-6, February 1998. [29] Ruth McPherson, Alexander Pertsemlidis, Nihan Kavaslar, Alexandre Stewart, Robert Roberts, David R Cox, David A Hinds, Len A Pen- nacchio, Anne Tybjaerg-Hansen, Aaron R Folsom, Eric Boerwinkle, Helen H Hobbs, and Jonathan C Cohen. A common allele on chromo- some 9 associated with coronary heart disease. Science, May 2007. [30] NCBI. dbgap: Database of genome wide association studies, url: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, 2007. [31] John Paling. Strategies to help patients understand risks. BMJ, 327(7417) :745-8, September 2003. [32] Lyle J Palmer and Lon R Cardon. Shaking the tree: mapping complex disease genes with linkage disequilibrium. Lancet, 366(9492): 1223-34, October 2005. [33] George P Patrinos and Anthony J Brookes. Dna, diseases and databases: disastrously deficient. Trends Genet, 21(6):333-8, June 2005. [34] Richa Saxena, Benjamin F Voight, Valeriya Lyssenko, Noel P Burtt, Paul I W de Bakker, Hong Chen, Jeffrey J Roix, Sekar Kathiresan, Joel N Hirschhorn, Mark J Daly, Thomas E Hughes, Leif Groop, David Altshuler, Peter Almgren, Jose C Florez, Joanne Meyer, Kristin Ardlie, Kristina Bengtsson, Bo Isomaa, Guillaume Lettre, Ulf Lindblad, He- len N Lyon, Olle Melander, Christopher Newton-Cheh, Peter Nilsson, Marju Orho-Melander, Lennart Rastam, Elizabeth K Speliotes, Marja- Riitta Taskinen, Tiinamaija Tuomi, Candace Guiducci, Anna Berglund, Joyce Carlson, Lauren Gianniny, Rachel Hackett, Liselott Hall, Johan Holmkvist, Esa Laurila, Marketa Sjogren, Maria Sterner, Aarti Surti, Margareta Svensson, Malin Svensson, Ryan Tewhey, Brendan Blumen- stiel, Melissa Parkin, Matthew Defelice, Rachel Barry, Wendy Brodeur, Jody Camarata, Nancy Chia, Mary Fava, John Gibbons, Bob Hand- saker, Claire Healy, Kieu Nguyen, Casey Gates, Carrie Sougnez, Diane 49 Bibliography- Gage, Marcia Nizzari, Stacey B Gabriel, Gung-Wei Chirn, Qicheng Ma, Hemang Parikh, Delwood Richardson, Darrell Ricke, and Shaun Pur- cell. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science, April 2007. [35] Laura J Scott, Karen L Mohlke, Lori L Bonnycastle, Cristen J Wilier, Yun Li , William L Duren, Michael R Erdos, Heather M Stringham, Peter S Chines, Anne U Jackson, Ludmila Prokunina-Olsson, Chia-Jen Ding, Amy J Swift, Narisu Narisu, Tianle Hu, Randall Pruim, Rui Xiao, Xiao-Yi Li , Karen N Conneely, Nancy L Riebow, Andrew G Sprau, Maurine Tong, Peggy P White, Kurt N Hetrick, Michael W Barnhart, Craig W Bark, Janet L Goldstein, Lee Watkins, Fang Xiang, Jouko Saramies, Thomas A Buchanan, Richard M Watanabe, Timo T Valle, Leena Kinnunen, Goncalo R Abecasis, Elizabeth W Pugh, Kimberly F Doheny, Richard N Bergman, Jaakko Tuomilehto, Francis S Collins, and Michael Boehnke. A genome-wide association study of type 2 di- abetes in finns detects multiple susceptibility variants. Science, April 2007. [36] Nameeta Shah, Michael V Teplitsky, Simon Minovitsky, Len A Pennac- chio, Philip Hugenholtz, Bernd Hamann, and Inna L Dubchak. Snp- vista: an interactive snp visualization tool. BMC Bioinformatics, 6:292, 2005. [37] Robert Sladek, Ghislain Rocheleau, Johan Rung, Christian Dina, Lishuang Shen, David Serre, Philippe Boutin, Daniel Vincent, Alexandre Belisle, Samy Hadjadj, Beverley Balkau, Barbara Heude, Guillanume Charpentier, Thomas J. Hudson, Alexandre Montpetit, Alexey V. Pshezhetsky, Marc Prentki, Barry I. Posner, David J. Bald- ing, David Meyre, Constantin Polychronakos, and Philippe Froguel. A genome-wide association study identifies novel risk loci for type 2 • diabetes. Nature, 445:881-885, February 2007. [38] E M Smigielski, K Sirotkin, M Ward, and S T Sherry, dbsnp: a database of single nucleotide polymorphisms. Nucleic Acids Res, 28(l):352-5, January 2000. [39] Valgerdur Steinthorsdottir, Gudmar Thorleifsson, Inga Reynisdottir, Rafn Benediktsson, Thorbjorg Jonsdottir, G Bragi Walters, Unnur Styrkarsdottir, Solveig Gretarsdottir, Valur Emilsson, Shyamali Ghosh, Adam Baker, Steinunn Snorradottir, Hjordis Bjarnason, Maggie C Y Ng, Torben Hansen, Yu Bagger, Robert L Wilensky, Muredach P Reilly, 50 Bibliography Adebowale Adeyemo, Yuanxiu Chen,- Jie Zhou, Vilmundur Gudnason, Guanjie Chen, Hanxia Huang, Kerrie Lashley, Ayo Doumatey, Wing- Yee So, Ronald C Y Ma, Gitte Andersen, Knut Borch-Johnsen, Tor- ben Jorgensen, Jana V van Vliet-Ostaptchouk, Marten H Hofker, Cisca Wijmenga, Claus Christiansen, Daniel J Rader, Charles Rotimi, Mark Gurney, Juliana C N Chan, Oluf Pedersen, Gunnar Sigurdsson, Jef- frey R Gulcher, Unnur Thorsteinsdottir,' Augustine Kong, and Kari Stefansson. A variant in cdkall influences insulin response and risk of type 2 diabetes. Nat Genet, April 2007. [40] Gudmundur A Thorisson, Albert V Smith, Lalitha Krishnan, and Lin- coln D Stein. The international hapmap project web site. Genome Res, 15(ll):1592-3, November 2005. [41] Wenyi Wang, Sining Chen, Kieran A Brune, Ralph H Hruban, Gio- vanni Parmigiani, and Alison P Klein. Pancpro: risk assessment for individuals with a family history of pancreatic cancer. J Clin Oncol, 25(ll):1417-22, April 2007. [42] A J Wright, J Weinman, and T M Marteau. The impact of learning of a genetic predisposition to nicotine dependence: an analogue study. Tob Control, 12(2):227--30, June 2003. [43] Nan Yang, Hongzhe Li , Lindsey A Criswell, • Peter K Gregersen, Marta E Alarcon-Riquelme, Rick Kittles, Russell Shigeta, Gabriel Silva, Pragna I Patel, John W Belmont, and Michael F Seldin. Examination of ancestry and ethnic affiliation using highly informative diallelic dna markers: application to diverse and admixed populations and impli- cations for clinical epidemiology and forensic medicine. Hum Genet, 118(3-4):382-92, December 2005. [44] Quanhe Yang, Muin J. Khoury, Lorenzo Botto, J .M. Friedman, and Dana Flanders. Improving the prediction of complex diseases by testing for multiple disease-susceptibility genes. American Journal of Human Genetics, 72:636-649, 2003. [45] Lan-Juan Zhao, Miao-Xin Li , Yan-Fang Guo, Fu-Hua Xu, Jin-Long Li , and Hong-Wen Deng. Snpp: automating large-scale snp genotype data management. Bioinformatics, 21(2):266-8, January 2005. 51 Chapter 3 Conclusions and Future Directions 3.1 Further Observations One of the most important observations noted during the D-GRIP develop- ment and testing was the lack of computationally efficient organization of existing and new discoveries in the genetics field [5, 10]. There has been an explosion of data from the recent progress in disease genetics field, and even 'though currently there are many types of mutation databases, the progress towards creation of new databases has been slow. The challenges involved are often technical in nature, such as, gathering, exchanging, integrating and interpreting the disease-related information. However, arguably the lack of targeted funding and the inherent bias towards making new discov- eries rather than managing existing data are one of the main underlying problems [10]. In order to overcome the technical limitations of creating a comprehen- sive, computationally exploitable genotype-phenotype database, a few goals must be met. For easy computational access, complex phenotype data mod- els that extensively utilize phenotype ontologies will be required. By using ontologies, a standard vocabulary can be established for use of terms, which 52 Chapter 3. Conclusions and Future Directions will help integrate various types of data and make analysis computationally easier. Initially, the DNA changes related to phenotypes can be represented in a structured and standardized way. Then, a basic framework for gath- ering, integrating, analyzing and updating the stored information will be required. Given the enormous amounts of data being generated, a system- atic and standardized way to manage phenotype data will be a necessity, which will require international cooperation and open access to anonymous data. Ultimately, an ideal genotype-phenotype database will provide a sys- tems biology approach where all information, such as that derived from the genome, transcriptome, proteome and metabolome, pertaining to the con- nection between genotypic differences and phenotypic consequences will be recorded. The second important observation that resulted from my work on D- GRIP was the limited number of variants that are known to be associated with complex diseases. Even though individual genome wide association studies(GWAs) are publishing results for many diseases [12, 11, 1, 2, 4, 9], most of the studies report only a few disease associated variants [3, 8]. In addition, the reported effects of individual genetic variants associated to common diseases are small (risk ratios ^ 2.0). Although, it has been shown that the combined effects of a moderate number (fewer than 20) of common genetic variants (with relative ratios ^ 2.0) could explain 50% of the burden of disease in a population [13]; there are numerous challenges with genome- wide association studies. These challenges include, for example, significance chasing bias (including publication bias, selective analysis and reporting bias), population stratification (due to heterogenous populations mixtures), 53 Chapter 3. Conclusions and Future Directions misclassification of exposures and outcome, and the inherent problems that include, failure to detect gene-gene and gene-environment interactions, lim- ited sample size, statistical power and false positive associations. All these issues can lead to difficulty in finding biologically meaningful genetic asso- ciations and thus slow the progress of understanding complex diseases. In order to alleviate and infer true disease-associated variants from nu- merous GWAs, standards should be established for presenting and interpret- ing the accumulated evidence. Efforts by the Human Genome Epidemiology Network (HuGENet) are ongoing in developing systematic approaches for assessing combined evidence of disease associated variants. The approaches include criteria such as biological plausibility, experimental evidence, sound methods for conduct and analysis, and appropriate replication [8]. The op- portunity to develop methods and standards for measuring, validating and interpreting genetic associations will be high in the next few years and will ultimately lead to benefit for individuals and population health. 3.2 Future Considerations The goal of shifting the current medical paradigm from a reactive to pre- ventative approach through personalized risk profiles appears within reach long-term. The generation of genetic risk profiles is intended to improve disease prevention by prompting at-risk individuals to take specific preven- tative actions that usually involve environmental exposures, diet or other lifestyle changes. However, before genetic risk assessment tools can be used in a clinical setting, an evaluation of the clinical utility of such tools needs 54 Chapter 3. Conclusions and Future Directions to be conducted [7] Clinical utility of a test refers to the likelihood a diagnostic test will lead to improved health outcomes [7]; For individuals with positive test results, the clinical utility depends on the availability, safety and effectiveness of therapeutic measures. The recommendation for ensuring clinical utility for any genetic test is to consider the clinical and social outcomes of the test. Clinical outcomes depend on effective changes in lifestyle due to positive test result. The social outcomes depend on the psychosocial, ethical, legal and social issues related to receiving a positive or negative outcome. Both clinical and social outcomes are important because they both contribute to the net balance between benefits and harms of genetic testing [6]. Thus, future evaluation of genomic profiles should encompass and clearly address validity of the test, clinical utility and social utility of the test. Regardless of the intended audience for a genetic risk profiling software, two crucial criteria are necessary for providing a genetic profile test. First, due to the still limited knowledge about clinical implications of such tests, the benefits and limitations of the tests should be clearly explained. Such limitations should be explicitly addressed, and individuals who provide tests should disclose what is known and not known about the test. Second, the tests should be offered in a controlled environment such that individual test takers are counseled about the results and implications of the tests. By having transparency when providing the genetic profile test and counseling the individual test taker, informed decisions can be made by health profes- sionals, patients and general pubic. Lastly, consensus needs to be achieved on when genomic profiling has 55 Chapter 3. Conclusions and Future Directions achieved an acceptable standard in a clinical setting. In the future, ge- nomic profiling will likely become common and thus the level of evidence that justifies clinical use of genomic profiling requires careful thought. It is recommended to develop an accepted process that incorporates defined pro- cedures for evaluating evidence and reaching conclusions that include input from clinicians, health care payers and consumers. 3.3 Conclusion Given the advent of new genotyping technologies and the rapid new discov- ery of new disease associated variants, experts have predicted that future medical care will become more personalized and geared towards disease pre- vention. We created a prototype web tool, called, DNA Genetic Risk Infor- mation Profile (D-GRIP), which predicts disease risk profiles based on an individual's genotype. The project outlined the current bioinformatic and scientific limitations involved in creating a genetic risk assessment software and addressed the main issues involved in the creation, evaluation and util- ity of such a tool in a clinical setting. By overcoming the major limitations and addressing the important issues, a viable and useful genetic risk profil- ing software is plausible in the future and thus will lead to a change in the way medicine is currently practiced. 56 Bibliography [1] A E Baum, N Akula, M Cabanero, I Cardona, W Corona, B Klemens, T G Schulze, S Cichon, M Rietschel, M M Nothen, A Georgi, J Schu- macher, M Schwarz, R Abou Jamra, S Hofels, P Propping, J Satagopan, S D Detera-Wadleigh, J Hardy, and F J McMahon. A genome-wide as- sociation study implicates diacylglycerol kinase eta (dgkh) and several other genes in the etiology of bipolar disorder. Mol Psychiatry, May 2007. [2] Keith D Coon, Amanda J Myers, David W Craig, Jennifer A Web- ster, John V Pearson, Diane Hu Lince, Victoria L Zismann, Thomas G Beach, Doris Leung, Leslie Bryden, Rebecca F Halperin, Lauren Mar- lowe, Mona Kaleem, Douglas G Walker, Rivka Ravid, Christopher B Heward, Joseph Rogers, Andreas Papassotiropoulos, Eric M Reiman, John Hardy, and Dietrich A Stephan. A high-density whole-genome association study reveals that apoe is the major susceptibility gene for sporadic late-onset alzheimer's disease. J Clin Psychiatry, 68(4):613-8, April 2007. [3] Jennifer Couzin and Jocelyn Kaiser. Genome-wide association, closing the net on common disease genes. Science, 316(5826):820-2, May 2007. [4] J R Fraser Cummings, Rachel Cooney, Saad Pathan, Carl A Anderson, Jeffrey C Barrett, John Beckly, Alessandra Geremia, Laura Hancock, Changcun Guo, Tariq Ahmad, Lon R Cardon, and Derek P Jewell. Confirmation of the role of atgl611 as a Crohn's disease susceptibility gene. Inflamm Bowel Dis, April 2007. [5] Angela Frodsham and Julian Higgins. Online genetic databases inform- ing human genome epidemiology. BMC Med Res Methodol, 7(1):31, July 2007. [6] Scott D Grosse and Muin J Khoufy. What is the clinical utility of genetic testing? Genet Med, 8(7):448-50, July 2006. 57 Bibliography [7] Susanne B Haga, Muin J Khoury, and Wylie Burke. Genomic profiling to promote a healthy lifestyle: not ready for prime time. Nat Genet, 34(4):347-50, August 2003. [8] Muin J Khoury, Julian Little, Marta Gwinn, and John Pa Ioannidis. On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies. Int J Epi- demiol, 36(2):439-45, April 2007. [9] Ruth McPherson, Alexander Pertsemlidis, Nihan Kavaslar, Alexandre Stewart, Robert Roberts, David R Cox, David A Hinds, Len A Pen- nacchio, Anne Tybjaerg-Hansen, Aaron R Folsom, Eric Boerwinkle, Helen H Hobbs, and Jonathan C Cohen. A common allele on chromo- some 9 associated with coronary heart disease. Science, May 2007. [10] George P Patrinos and Anthony J Brookes. Dna, diseases and databases: disastrously deficient. Trends Genet, 21(6):333-8, June 2005. [11] Laura J Scott, Karen L Mohlke, Lori L Bonnycastle, Cristen J Wilier, Yun Li , William L Duren, Michael R Erdos, Heather M Stringham, Peter S Chines, Anne U Jackson, Ludmila Prokunina-Olsson, Chia-Jen Ding, Amy J Swift, Narisu Narisu, Tianle Hu, Randall Pruim, Rui Xiao, Xiao-Yi L i , Karen N Conneely, Nancy L Riebow, Andrew G Sprau, Maurine Tong, Peggy P White, Kurt N Hetrick, Michael W Barnhart, Craig W Bark, Janet L Goldstein, Lee Watkins, Fang Xiang, Jouko Saramies, Thomas A Buchanan, Richard M Watanabe, Timo T Valle, Leena Kinnunen, Goncalo R Abecasis, Elizabeth W Pugh, Kimberly F Doheny, Richard N Bergman, Jaakko Tuomilehto, Francis .S Collins, and Michael Boehnke. A genome-wide association study of type 2 di- abetes in finns detects multiple susceptibility variants. Science, April 2007. [12] Robert Sladek, Ghislain Rocheleau, Johan Rung, Christian Dina, Lishuang Shen, David Serre, Philippe Boutin, Daniel Vincent, Alexandre Belisle, Samy Hadjadj, Beverley Balkau, Barbara Heude, Guillanume Charpentier, Thomas J. Hudson, Alexandre Montpetit, Alexey V. Pshezhetsky, Marc Prentki, Barry I. Posner, David J. Bald- ing, David Meyre, Constantin Polychronakos, and Philippe Froguel. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445:881-885, February 2007. 58 Bibliography [13] Quanhe Yang, Muin J Khoury, Jm Friedman, Julian Little, and W Dana Flanders. How many genes underlie the occurrence of common complex diseases in the population? Int J Epidemiol, 34(5): 1129-37, October 2005. 59 Appendix A Feedback from Experts A . l Questions The set of questions asked to each type of expert (clinical geneticist, molec- ular geneticist, genetic counselors and biostatisticians) are listed below. 1. Any comments on the user-interface of D-GRIP? • The input page? • The output page? 2. Any comments or references to available risk models that predict risk based on genotype data? • How to include age specific risk prediction without raw data? 3. How should an ideal system handle various complex diseases? Treat each separately with disease-specific risk model? 4 . The system shows a very fatalistic view. Do you think we should include more positive news? 5. Who could be a potential user of D-GRIP? • Genetic Counselors? • Family Physicians? 60 Appendix A. Feedback from Experts • Insurance companies? • Lay public? • Yourself? 6. How many years down the line can you see this being used (respectively for each of the potential users from previous question)? 7. Do you think we should store people's genotype data? What about family doctor's storing .their patient's genotype data? 8. What are some of the implications you see from using such a system? • Personal implications? • Effect on patients? • Societal implications? 9. In what journal can you see this type of paper being published? 61 Appendix A. Feedback from Experts A . 2 Feedback A summary of the feedback provided by several experts is detailed below. The experts consisted of two biostatisticians, two molecular geneticists, five clinical geneticists and 12 MSc genetic counseling students. The comments and recommendations are categorized into various aspects of D-GRIP, for example, user interface issues regarding input and output features, core of D- GRIP dealing with DNA-Disease database and risk prediction model, issues pertaining to the users and any ethical, legal and social implications. A.2.1 User Interface Input and general usability • Allow option for users to provide family history along with genotype data. • Ethnicity classification is currently biased. Provide two options, one user-specified ethnicity and two, calculate ethnicity based on a verified and reliable predetermined-determined markers from genotype data provided. Consensus was to calculate the ethnicity but only when calculations can be done reliably. • When more data is available, allow input for copy number variantions data. • Provide a disclaimer that explicitly informs the user of all the limita- tions and assumptions of the software. 62 Appendix A. Feedback from Experts • As it is currently, keep the interface simple and easy to use. Risk Profile Report • Tailor the final risk report towards the intended user. Currently, the view is more geared towards genetic researchers and counselors. In contrast, for a family physician or a consumer, provide a 'Patient view' where communication of probabilities and risk is done visually, links to prevention and therapeutic options and any relevant links for lifestyle and behavior changes are provided. • Provide the option of restricting analysis to specific diseases, for in- stance, diseases where prevention is an option versus where currently no preventative options are available. A.2.2 D-GRIP Core Diseases, DNA-Disease database • Implement a meta-analysis engine for each disease so that whenever new studies are published, the entire database is updated. In addition, whenever such updates are performed, create a notification system for users to inform them. • Store gene-gene and gene-environment and epigenetic information in DNA-Disease database. Data on gender and age related to diseases is very important, especially for age-dependent diseases. 63 Appendix A. Feedback from Experts • When information regarding copy number variations related to dis- eases is available, store this into the DNA-Disease database. • Also store intermediate phenotypes associated with markers in addi- tion to disease associated markers. Risk prediction issues • Implement disease specific risk models so that each disease is treated separately. Also, allow advanced users to choose multiple risk predic- tion models for each disease. • When data are available, incorporate gene-gene and gene-environment effects into the respective disease risk models. • Perform rigours validation of each predictive model and prediction. Show the results of the tests performed, such as sensitivity, specificity, positive predictive values. Ensure validations of the prediction models is performed with genotype data that is not part of the case-control population data in the DNA-Disease database. Currently, such volume of data for testing is not available so future versions will require this feature. Also, provide links to studies supporting the risk predictions models for respective diseases. A.2.3 Potential Users • Genetic counselors are a good initial user for the software. During initial deployment of D-GRIP, user training will be required so that all limitations and proper interpretation of results is performed. 64 Appendix A. Feedback from Experts • Family physicians (or in a primary care setting) can be other potential users. But training for family physicians on how to use and interpret results from such a tool will be a necessity. • Potentially, general public could act as consumers of such a software. But all implications will need to be addressed by health professionals, governments and industry before such a software is released to the general public. • Insurance companies could also be potential users but the many social, legal and ethical implications will need to be addressed and a support- ing framework will need to be implemented so handle third party use of genetic data. • As mentioned, user interface of software should be tailored towards the user. • The consensus was that currently, D-GRIP is ahead of its time. But a similar software can be seen used in the next 5-10years time. How- ever, better understanding of disease associated variants and reliable predictions will be a necessity. • Until proper standards and procedures are developed to handle all the ethical, legal and social implications, such a software should always be used under a guided setting where the counseled individual is explained all the limitations and provide guidance in understanding the results from such a software. 65 Appendix A. Feedback from Experts A.2.4 Implications • As it is currently, there should be no user identifiable storing of geno- type data. User genotype data can be stored only when the family physician is the user and storing the patient's genotype data. How- ever, in the future, proper framework will be required to handle genetic data management, to support privacy, confidentiality and anonymity. • The level of care required in helping the general public interpret and understand the results is enormous and should be done appropriately. • At the current rate, not enough genetic counselors to support the future demand for counseling of individuals wanting .a genetic risk profile. • All necessary ethical, social and legal implications will need to be addressed by the providers of such a tool. 66 Appendix B D-GRIP User Manual B . l Introduction to D - G R I P This user Guide assumes you have access to D-GRIP since D-GRIP is a closed and secure web tool. The guide explains the various features of D- GRIP and provides a brief walk through. This guide is not intended to explain the results of D-GRIP or how to interpret them. The guide explains: • The overall processes. • Basic features that are available. B . l . l D-GRIP System DNA Genetic Risk Information Profile (D-GRIP) is a genotype analysis system that predicts an individual's genetic risk profile based on the geno- type. The system can take as input, observed genotypes of up to one million positions of known single nucleotide polymorphisms (SNPs) in human pop- ulations. The flow of information in D-GRIP begins from the input of user data. The user is asked to fill in demographic information (ethnic background, 67 Appendix B. D-GRIP User Manual age and gender) and a genotype file which is parsed and temporarily stored. Next, The system compares the genotyping results to an internal DNA- DISEASE risk database and for each disease, calculates a risk score for developing the disease. Finally, a tabular output of potential diseases with the relevant disease risk for the individual is displayed. Useniumc: | Password: [ Submit j Figure B . l : The entry into D-GRIP occurs with user authentication. A valid username and password is required to access D-GRIP. 68 Appendix B. D-GRIP User Manual B.2 D-GRIP Features There are various features in D-GRIP and a detailed description of each with illustrations is provided below. The page is laid out with a menu on the left and all the relevant content on the right. The menu contains navigation links to Home page (Figure B.2), Disclaimer page, Use D-GRIP page, Help page and link to Log out of D-GRIP. Home Disclaimer Use D-GRIP Help Lou oui DNA Genetic.Rt.sk Information Profile Welcome 'Test' This web sile provides a tool for predicting a genetic risk prolllc for a person by utlizing genotype information. Getting Started: Click on the 'Use D-GRIP" link. Kill in demographic information and click 'next'. Upload a genotype file or copy/paste data into the form. Click on Calculate Risk. Please LOR out when leaving D-GRIP N o t e : T i p s a r e p r o v i d e d a n y w h e r e ' & & a p p e a r s . B r i n g c u r s o r o v e r t o s e e t i p s . Disclaimer 1. It is assumed the "system is used in a guided setting. 2. All information provided by you ('the user') is.assumed to be accurate. For instance, ethnic background provided by the user is assumed to be.accurate to the best of the user's knowledge. 3. ; D-GRIP predicts risk of developing disease-based on population information collected from literature. 4. The overall probability of developing a disease is calculated'assuming all susceptible allcles/gcncs arc-acting independently within diseases and across diseases. 5. The system does not store any user-provided data (e.g. genotype and demographic data). :i«>i w i ^ f i r a n Lull Figure B.2: A snapshot of D-GRIP's main page. The page describes in- structions on how to use D-GRIP and outlines a disclaimer for the user to read. 69 Appendix B. D-GRIP User Manual B.2.1 Disclaimer The disclaimer explicitly outlines the assumptions made by D-GRIP (Fig- ure B.3). The disclaimer is shown on the first page, when the user accesses the site. Also, a separate link is provided to view the disclaimer. Di scln i m er 1. H is assumed Ihe system is used in a guided setting. 2. Al l information provided by you ('the user') is assumed to be accurate. For instance, ethnic background provided by the user isassuiried to be accurate to the best of the user's knowledge. 3. D-GRIP predicts risk ol" developing.disease based on population information collected from literature. 4. The overall probability of developing a disease is calculated assuming all susceptible allcles/genes arc acting independently within diseases and across diseases. 5. Ilie system does not store any user-provided data (e.g. genotype'and demographic data). Figure B.3: The assumptions made by D-GRIP are listed as a disclaimer and shown here B.2.2 Input The input page can be accessed by clicking on the 'Use D-GRIP' link in the menu on the left. The input for D-GRIP occurs in two steps. First, demographic information and configuration options are presented. Next, genotype data is requested from the user. Demographic Information Figure B.4 shows the first stage of the input. The mandatory information requested from the user is Gender, Age and Ethnic background. For the Age, the user enters the year of birth. For the Ethnic background, the user should select the most appropriate option based on the geographic 70 Appendix B. D-GRIP User Manual ancestry of the user. The options presented are: Africa, Asia, Europe, Pacific, First nations/Aboriginals and Mixed. The configuration option currently has one checkbox for 'inference of genotypes'. The inference of genotypes utilizes the haplotype information from the Hapmap Project Website to infer disease-associated genotypes from the genotype data provided by the user. By default, the inference option is turned off (no tick in checkbox). Once the user fills in the demographic information form, proceed to loading genotype data by clicking the 'Next' button. Input user details Demographic Information Gender * Y e a r o f B i r t h * E t h n i c B a c k g r o u n d * Configuration Options I n f e r e n c e o f G e n o t y p e s M a n d a t o r y fields m a r k e d * Figure B.4: Demographic information and configuration options submitted to D-GRIP are shown here. Genotype Data Figure B.5 shows how the genotype data can be loaded into D-GRIP. There are two ways to load the genotype data. The copy/paste option f " M a l e * ~ F e m a l e | Y Y Y Y | E u r o p e T ] ® I - c l i c k t o t u r n O n ® • N e x t | 71 Appendix B. D-GRIP User Manual allows the user to copy the genotype data and paste into the text area provided. The mandatory fields for copy/paste form are file format, file name and genotype data. After filling in the form, click on 'Calculate Risk' button to generate the risk profile output. For the uploading of genotype file, the mandatory fields are file format and address where the file is stored. The user may use the 'Browse' button to find the genotype file on the hard drive. Note, the maximum allowed size for the genotype file to be uploaded is 10Mb. This size limit can contain genotypes for more than 1 million SNPs in the file. After filling in the form, click on 'Upload File and Calculate Risk' button to generate the risk profile output. Currently, D-GRIP accepts two file formats: Illumina Final format and Affymetrix Text Output. An example of the respective genotype file formats are shown in Figure B.6. The Illumina Final format can be obtained by generating a tab delimited 'Final Report' when using the Illumina platform's BeadStudio Genotyping Module software. The only fields necessary are: SNP Name, Allele 1 and Allele 2. The sample Id and GC score are not necessary for D-GRIP. The Affymetrix text output can be obtained by using the SNP Export feature in the Affymetrix GeneChip Genotyping Analysis Software and gen- erating a tab delimited output file. Again, the only fields necessary are SNP identifier and SNP genotype (two alleles). In Figure B.5, next to the copy/paste form is a box with 'Pre-loaded' data. To illustrate D-GRIP, sample genotype files have been created and can be loaded using this 'Pre-loaded' data box. Simply select the particular 72 Appendix B. D-GRIP User Manual Copy/Paste or Upload genotype information Copy/Paste data Mandatory fields marked ' File format* File name* Input genotype data* | Illumina Final Format J -Pre-loaded datri- Scleci test gcnotyps dnin to load; | Sample 1 _J ig> Get Sample | CalculateRisk OR Upload data - - Please complete the form below. Mandator)' fields marked * F i le format* | Illumina Final Format j j Type (or select) Filename* | Browse... | # Upload and Calculate Risk | Figure B.5: Form for submitting the genotype data is shown here. The user can either copy/paste the genotype data or upload a genotype file. A set of sample genotypes are provided and can be loaded into the copy/paste form by clicking on 'Get Sample'. 73 Appendix B. D-GRIP User Manual |[Header] BSGT Version. 2.1.10 30089 Processing Date. 5/2/2006 12:54 PM Content. . CS0006968-0PA NUB SNPS. 26 Total SNPs. 26 Num Samples. 1 Total Samples. 1 [Data] SNP Name. Sample ID. AHelel - Top. Allele2 - Top. GC Score rs2018621 Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001. A. G. 0.63 rs4845378. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001 G. G. 0 54 rsll31706. Europe - HD01-01 - Northern European HD01 - GH17001 -NA17001. T. T. 0 6 rs2847173. Europe - KD01-01 - Northern European HD01 - GM17001 -NA17001 G. G. 0 54 rsl2448760. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001. A.. G. 0 65 rsl0915884. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001. G. G. 0 89 rsl676885. Europe - HD01-01 - Northern European HD01 - GM17001 -NA17001 A.. A., 0 59 (a) I l lumina final format sample file £NP. SAMPLE. GENOTYPE.' SCORE rs2018621. Europe - HD01 -01 - Northern European HD01 - GH17001- NA17001. AG. 0 6345 rs4845378. Europe - HD01 -01 - Northern European HD01 - CM17001- NA17001. CG. 0 5403 rsll31706. Europe - HD01 -01 - Northern European HD01 - CM17001- NA17001. TT. 0 6032 rs2847173. Europe - HD01 -01 - Northern European HD01 - CM17001- NA17001. GG. 0 5403 rsl.2448760. Europe - HD01 -01 - Northern European HD01 - GM17001- NA17001. AG. 0 6478 rsl0915884. Europe - HD01 -01 - Northern European HTJ01 - GM17001- NA17001. GG. 0 8906 rsl676885. Europe - HD01 -01 - Northern European HD01 - GM17001- NA17001 AA. 0 5901 (b) Affymetrix text output sample file Figure B.6: The Illumina and Affymetrix tab-delimited file formats for D-GRIP. The respective column names are shown at the top. 7 4 Appendix B. D-GRIP User Manual sample and click on 'Get Sample'. A 'Comments' box appears describing the sample file and the sample file appears in the copy/paste text area. Genotype Sample 1 is shown in Figure B.7. Copy/Paste data Mandatory fields marked ' File format* File name* Input genotype data* I lllufnina Final Format tes (Genotype Datal rs790314.6 Northern C ' 'if rial 1.1.1875 Northern A G rs79'23837 Northern |A G rs37-10878 Nor thern Europe European HDOl 0.99 Eu rope European HDOl 0.97 Eu rope European HD01. 0.96 Eu rope European HDOl HDOl-O-'F- •• GMi70b.t'4tAl7001 HDOl-a.l- • GM17001-NA1700 HDO'l-01 • CM! 700.1.-HA 17001 HDO 1.-0.1 • CH17001-NA17001 •Pro-loaded data- Seiea lest genotype data 10 loud: | Sample 1 ;rj @ .Get Sam pie' | -.Corainenls Sample.-'I: Caucasian population willi seiecLod''SNf'sfronrall' , .(liseases-iii-database. All genotypes are heterozygous for each disease except Parkinson disease which tire hornozygotts..First five SNPs me for Diabetes type 1, next Three tire I'or AbJte'irner nrid last, two tire Fur Parkinson's disease. The last throe SNPs are for diatjetesi2 SN'PsI and Parkinson) I SNP) bul used for : » f n m , ™ n«..u,^, TI»..™ Calculate Risk Figure B.7: Genotype sample 1 is loaded into the copy/paste form by clicking on 'Get Sample'. A description of the sample genotype file are illustrated in the 'Comments' box. 75 Appendix B. D-GRIP User Manual B.2.3 Output An example output of D-GRIP is shown in Figure B.8. The output of D- GRIP is table that shows user's SNPs that matched disease-associated SNPs. The table illustrates the disorder, gene, SNP and genotype associated with the disorder, population in which the SNP occurs, calculated odds ratio and link to Pubmed for literature articles supporting the association. ^l/lfSmer^drsuise ,£IJM . B 1T [ » „ i - - r s 4 S 4 5 \ 7 8 ; f T / O ""~ " ' C a u c a s i a n ; t 2 5 2 jJIfccfl&T "Ti A l z h e i m e r d i s e a s e P . O M T ' I . r s 2 0 1 : 8 6 2 1 • . _ . AJG ' C a u c a s i a n K 6 8 v 1 6 8 4 7 0 1 2 , t\l/liTiIiTerdTseIsI ^ ^ ' M 4 0 f i r1' i s l ^ i T j ^ l ^ c V r J C a u c i s i V p ^ l „ I* "> gfo H^ip 1 7 - ' 8 4 t " " J A l z h e i m e r d i s e a s e background population probability I f i v o overall c a l c u l a t e d probability 20.P.") / o D i a b e t e s M J I i t u s t \ p « . 2 T \ T 2 _ i_ TOO' 7 S D n K t c "MelTitu t y p e 2 ^ H1JJTX_ J f . . 1 . 7 9 2 3 8 ) 3 7 D i a b e t e s M e l i i m s t y - p e 2 - M H E X •• ' r s M ' M S 7 5 ' DnbLtc^irirrSs itY|~ ,*2 ~fcT7L2~ ~T~ Tj9QMA6' Diabctes;Mcllitus type2: EXT*. . • . • -re-l'l 0 3 7 9 0 9 D i i b c k s M e l l i t u s t y p 0 2 r \ T 2 i s 1 1 1 ^ 1 1 2 D i a b e t e s M e l l i t u s t v p e 2 b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y o v e r a l l c a l c u l a t e d p r o b a b i l i t y O A Cu iL isnn 1 2 6 ]T>rW7L A / G : . C a u c a s i a n . 1 7 1 9 , ' 1 7 . 2 9 3 8 7 6 ' - C / T . C a u c a s i a n 1 : 2 7 . 1 7 2 9 3 8 7 6 l ^ ' G / C ^ ' ^ c T u ^ a n T 7 1 J I ' m H T O - 1̂ s 7 % P a r k i n s o n d i s e a s e P 1 N K 1 rsl 0 4 3 4 2 4 A / A C a u c a s i a n . I v 5 _ 1 6 0 0 9 8 9 1 . « • P a , k i n ? o i f | i i . J j s e ^ "7 * i $ k k K ^ * ' I 1 , ^ 1 X 0 1 5 8 2 , ; ' } V S ^ / V C j u t a s i . m i ' ^ , ' " l ' \ 7 5 J r O 0 6 7 6 7 _ . ; j P a r k i n s o n d i s e a s e b a c k g r o u n d p o p u l a t i o n p r o b a b i l i t y 2 V» o v e r a l l c a l c u l a t e d p r o b a b i l i t y 2.14% Figure B.8: D-GRIP risk profile sample output. The output illustrates 3 diseases, Alzheimer's, Diabetes type 2 and Parkinson's disease. The re- spective associated SNPs with each disease is shown. The background and overall calculated probability of developing the disease is also shown. 76 Appendix B. D-GRIP User Manual The user can click on the gene name, and disorder name for external links to genbank and OMIM respectively. In addition, by clicking on each SNP row, more details about the SNP can be seen (Figure B.9). Diabetes Mellitus.typo 2 TCF/Ue2 - ". rs7?03;,14fc. '. ' ' . -.G/T'- Caucasian 1.05 • J729JiS^6'f' Genotypes. . ' .Statistics • • . '• Risk genotype: G/T , ' O'dds;Mi6.(95%;Ci):" 1:65 (1.47-. 1>'5) Majot genotype C7C ' .log OddssRatio: . 0.5 :i 0.06' . log Odds Ratio -)5 !-;,..€]: (0.38.'. 0.6':!') Oeiiotvpe. Frequencies . , „ , , , Likelihood, Ratio: L27 ; a 0.0017. C/T ' C7C Likelihood ratio 95%'CI: ( I . ! / . I.3S') Case 0:486 0:351 • Probability ofidiscasebased: 6;-27,.n/o . . . A .„ «„v ' lon.tliis'iSNP: Control. 0:419 0.497 , . . • . Figure B.9: Details about one SNP from Diabetes type II disease. More details about the probability calculation for each disease can be seen by clicking on the probability row (Figure B.10). If there are SNPs found that are in high linkage disequilibrium (r2 > 0.8) then integrated analysis is performed where only one SNP from the set of high LD SNPs is chosen to be in the overall calculated probability. This is illustrated on the right side of Figure B.10. 77 Appendix B. D-GRIP User Manual Diabetes Mel 1 iius type 2 backgroun d. popul at ion probab i I ity overall calculated probability 7 % Age G;ender Ethnicity User details 47 Male Europe Background probabil'itv details Age of Onset Background (yrs) probability 45 5% Integrated Analysis SNP used in probability calculation rsl 1037909 SNPs in high linkag disequilibrium rsl 1037909 rsl 113132 rs3740S7S 60 •15% Figure B.10: Probability details for diabetes type 2 is shown here. . If the inference of genotypes configuration option was selected, the out- put will display SNPs from inference analysis. > An example of inferred SNPs and their corresponding details is illustrated in Figures B . l l and B.12. 78 Appendix B. D-GRIP User Manual Diibetc Mdlitu t\pe 2 TCrT 2 Dijbdx" M jFiHJ^>pL. 2 ?_ 'f\T^7" Diabetes Mel I itu's typc.2' HHBX Dnbcte Mellitu t>| ejfi HHJ-X Diabetes Mellitus type 2 .EXT2 Dnbete Mdfitus ^fi , 2 I f Y H is790^14o i ^i!uLi i !?2 r rs I 111875 -rs3740878 7s I To W O K lZ?Jsi2l_<fTv',«s76 CT [ C lucasmi 1 6D ]7">9 S^6 G/C__ j | t tufc3Ji •A/Gf ; .'Caucasian 1.19' A/3T" '(^CTHTTTITTJ G/A- ' Caucasian k-26 C T V Cauca nn j , 4 f l 27 l7~J.jr.7e 17-293876. 17293876 In ference Analysis !® Diabetes Mellitus type 2 LOC387761 rs74800IO A/G Caucasian 1.14 17293876. Diabetes Mellitus type 2 SLC30A8 rsl3266634 T/C Caucasian 1.18 172938.76 Diabetes Mellilus type. 2 background population probability overall calculated probability 7 % Figure B . l l : SNPs from Inference analysis for Diabetes type 2 are shown. 79 Appendix B. D-GRIP User Manual I n f e r e n c e A n a l y s i s @ D i a b e t e s M e l l i t u s t y p e 2 L O C 3 K 7 7 6 1 I n f e r r e d S N P d e t a i l s U s e r ' s G e n o t y p e S N P i d : r s 4 4 4 5 6 l 9 g e n o t y p e : T / C S N P I d r s 7 4 S O 0 ' l O r s 4 4 4 5 6 l . 9 H a p m a p P h a s e d a t a A l l e l e I A T r s 7 4 K 0 0 l ' 0 A l l e l e : G C A / G C a u c a s i a n . 1 7 2 9 3 8 7 6 H a p m a p S N P - I n f o r m a t i o n S N P A l l e l e s . G e n o t y p e G e n o t y p e f r e q u e n c y G e n e C h r o m o s o m e P o s i t i o n r s 4 4 4 5 6 l 9 T , C T / C 0 . 3 0 9 1 1 4 2 2 0 2 I 7 K H a p m a p p o p u l a t i o n . C S H L - H A P M A P : H a p M a p - C E U • D i s e a s e a s s o c i a t e d S N P D e t a i l s G e n o t y p e s R i s k g e n o t y p e : A / G M a j o r g e n o t y p e : , A / A G e n o t y p e F r e q u e n c i e s A / G C a s e 0 . 4 3 0 C o n t r o l 0 . 4 1 3 A / A 0 , 4 4 9 0 . 4 9 2 S t a t i s t i c s O d d s R a t i o ( 9 5 % C I ) : l o g O d d s R a t i o : l o g O d d s R a t i o 9 5 % C I : 1 . 1 4 ( 1 . 0 2 . 1 . 2 8 ) 0 . 1 3 i 0 . 0 6 ( 0 . 0 2 . 0 . 2 5 ) . L i k e l i h o o d R a t i o : 1 . 0 7 ± 0 . 0 0 1 7 L i k e l i h o o d r a t i o 9 5 % C I : ( 0 . 9 9 . 1 . 1 6 ) Figure B.12: Details about the inferred SNPs is shown. The details include the user's genotype, Hapmap data from which inference was performed and the relevant statistics for the disease-associated SNP. 80 Appendix B. D-GRIP User Manual B.2.4 Help Tips Help tips appear as pop-up on the top right of the page. Whenever a blue question mark icon is displayed, the user can bring the mouse over to the question mark to see the relevant tip. This is done to help guide the user when using D-GRIP. Examples are shown below. Ethnic Background* Europe M a j o r i t y o f d a t a i n d a t a b a s e i s b a s e d o n C a u c a s i a n p o p u l a t i o n . T h u s , d e f a u l t i s E u r o p e a n a n c e s t r y . Figure B.13: An example of a ethnic background help tip is shown. I n f e r e n c e o f G e n o t y p e s r c l i c k t o t u r n O n # W h e n ' I n f e r e n c e o f g e n o t y p e s ' o p t i o n i s a i m e d o n . a n y u s e r g e n o t y p e s t h a t a r c i n h i g h l i n k a g e d i s e q u l i b r i u m ( r2> O . S ) w i t h d i s e a s e a s s o c i a t e d S N P s a r e a l s o r e p o r t e d i n t h e g e n e r a t e d r i s k p r o f i l e . T h e r e p o r t e d i n f e r r e d S N P s a r e n o t u s e d i n t h e o v e r a l l p r o b a b i l i t y c a l c u l a t i o n . C l i c k c h e c k b o x l o . t i i m o n I n f e r e n c e A n a l y s i s o p t i o n . Figure B.14: An example of inference of genotypes help tip is shown. 81

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
China 9 17
United States 7 0
France 1 0
Germany 1 91
City Views Downloads
Unknown 7 91
Beijing 5 0
Ashburn 3 0
Chicago 2 0
Redmond 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}

Share

Share to:

Comment

Related Items