K-mer-based data structures and pipelines for sequence mapping and analysis

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

K-mer-based data structures and pipelines for sequence mapping and analysis Goktas, Talha Murathan

Abstract

The exponential growth of genomic data demands progress and research on scalable bioinformatics algorithms. A paradigm to improve computational efficiency in bioinformatics is k-mers. Here we present three works based on the k-mer paradigm that improved the existing methods and opened new possibilities for major applications domains in bioinformatics. LINKS 2.0 is an alignment-free scaffolding tool that brings 3-fold run-time and 5-fold memory optimization to the latest previous version (LINKS v1.8.7). Together with enabling LINKS to process more data with lower computational requirements, this major update also outputs higher- quality scaffolds. Major memory optimization in LINKS 2.0 was obtained by storing k-mers as their 64-bit hash values instead of with ASCII characters. Multi-index Bloom filter (miBF) is a novel associative probabilistic data structure designed for efficiently storing k-mer and spaced seeds. MiBF-mapper discovered the utility of miBF in the long-read mapping domain and demonstrated its competitive accuracy. The mapping with miBF will be a future reference, especially for miBF-based methods. The work on miBF-based global ancestry inference (GAI) proved the scalability of miBF by processing high-coverage data of 208 individuals and promises to increase the accuracy of state-of-art by capturing short insertion and deletion (indel) markers as well as SNPs. We demonstrated high accuracy in continent-level inference and present a promising foundation for developing more accurate, loci-aware ancestry inferences.

Item Metadata

Title	K-mer-based data structures and pipelines for sequence mapping and analysis
Creator	Goktas, Talha Murathan
Supervisor	Birol, Inanc
Publisher	University of British Columbia
Date Issued	2023
Description	The exponential growth of genomic data demands progress and research on scalable bioinformatics algorithms. A paradigm to improve computational efficiency in bioinformatics is k-mers. Here we present three works based on the k-mer paradigm that improved the existing methods and opened new possibilities for major applications domains in bioinformatics. LINKS 2.0 is an alignment-free scaffolding tool that brings 3-fold run-time and 5-fold memory optimization to the latest previous version (LINKS v1.8.7). Together with enabling LINKS to process more data with lower computational requirements, this major update also outputs higher- quality scaffolds. Major memory optimization in LINKS 2.0 was obtained by storing k-mers as their 64-bit hash values instead of with ASCII characters. Multi-index Bloom filter (miBF) is a novel associative probabilistic data structure designed for efficiently storing k-mer and spaced seeds. MiBF-mapper discovered the utility of miBF in the long-read mapping domain and demonstrated its competitive accuracy. The mapping with miBF will be a future reference, especially for miBF-based methods. The work on miBF-based global ancestry inference (GAI) proved the scalability of miBF by processing high-coverage data of 208 individuals and promises to increase the accuracy of state-of-art by capturing short insertion and deletion (indel) markers as well as SNPs. We demonstrated high accuracy in continent-level inference and present a promising foundation for developing more accurate, loci-aware ancestry inferences.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2023-03-30
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0428836
URI	http://hdl.handle.net/2429/84117
Degree (Theses)	Master of Science - MSc
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2023-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

K-mer-based data structures and pipelines for sequence mapping and analysis Goktas, Talha Murathan

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights