UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

K-mer-based data structures and pipelines for sequence mapping and analysis Goktas, Talha Murathan

Abstract

The exponential growth of genomic data demands progress and research on scalable bioinformatics algorithms. A paradigm to improve computational efficiency in bioinformatics is k-mers. Here we present three works based on the k-mer paradigm that improved the existing methods and opened new possibilities for major applications domains in bioinformatics. LINKS 2.0 is an alignment-free scaffolding tool that brings 3-fold run-time and 5-fold memory optimization to the latest previous version (LINKS v1.8.7). Together with enabling LINKS to process more data with lower computational requirements, this major update also outputs higher- quality scaffolds. Major memory optimization in LINKS 2.0 was obtained by storing k-mers as their 64-bit hash values instead of with ASCII characters. Multi-index Bloom filter (miBF) is a novel associative probabilistic data structure designed for efficiently storing k-mer and spaced seeds. MiBF-mapper discovered the utility of miBF in the long-read mapping domain and demonstrated its competitive accuracy. The mapping with miBF will be a future reference, especially for miBF-based methods. The work on miBF-based global ancestry inference (GAI) proved the scalability of miBF by processing high-coverage data of 208 individuals and promises to increase the accuracy of state-of-art by capturing short insertion and deletion (indel) markers as well as SNPs. We demonstrated high accuracy in continent-level inference and present a promising foundation for developing more accurate, loci-aware ancestry inferences.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International