UBC Theses and Dissertations
Harmonization of SNP identifiers Tripathi, Chhavi
While data generation has been, and will remain crucial to making scientific discoveries, our ability to analyze data has not been at par with data generation. Therefore, it is important to direct our efforts towards making sense of the data already produced. In this thesis, the harmonization of single nucleotide polymorphism (SNP) identifiers is investigated. Harmonization of SNP identifiers means having the same identifier for a SNP every time it occurs. Harmonizing SNP identifiers would allow the genetic data from different datasets to become comparable, which would allow re-purposing of existing datasets in public repositories. Genetic data helps in associating genetic alterations with disease and health. Genetic data is being generated at a rate faster than Moore’s law. With the intention of making generated data available to all researchers in the world, public repositories like the UK Biobank, European Genome-phenome archive (EGA), and database of Genotypes and Phenotypes (dbGaP) have been set up to host public data and disseminate it according to protocols established. The data in these repositories is from different time points, is generated using different genotyping arrays, and is submitted by researchers all over the world. This leads to a large degree of heterogeneity in the data. In order to make the most of the data, they need to be harmonized. The greater the overlap between two datasets, the easier it is to harmonize them. Thus, in order to assess the extent to which datasets can be harmonized, it is important to perform an overlap between them. SNPs are of most interest in genetic datasets. Because of the numerous kinds of identifiers a SNP may have, determining the number and identity of overlapping SNPs between datasets is challenging and increases in complexity with the number of comparisons (SNPs and datasets). There is no tool available to perform on-the-fly harmonization of SNP identifiers. The SNP Overlap Tool (SPOT) was designed to harmonize SNP identifiers using the SNP chromosomal locations, and subsequently calculate the overlap of SNPs between two datasets. It is a web-based tool, coded in Java programming language.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International