- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- K-mer-based data structures and pipelines for sequence...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
K-mer-based data structures and pipelines for sequence mapping and analysis Goktas, Talha Murathan
Abstract
The exponential growth of genomic data demands progress and research on scalable bioinformatics algorithms. A paradigm to improve computational efficiency in bioinformatics is k-mers. Here we present three works based on the k-mer paradigm that improved the existing methods and opened new possibilities for major applications domains in bioinformatics. LINKS 2.0 is an alignment-free scaffolding tool that brings 3-fold run-time and 5-fold memory optimization to the latest previous version (LINKS v1.8.7). Together with enabling LINKS to process more data with lower computational requirements, this major update also outputs higher- quality scaffolds. Major memory optimization in LINKS 2.0 was obtained by storing k-mers as their 64-bit hash values instead of with ASCII characters. Multi-index Bloom filter (miBF) is a novel associative probabilistic data structure designed for efficiently storing k-mer and spaced seeds. MiBF-mapper discovered the utility of miBF in the long-read mapping domain and demonstrated its competitive accuracy. The mapping with miBF will be a future reference, especially for miBF-based methods. The work on miBF-based global ancestry inference (GAI) proved the scalability of miBF by processing high-coverage data of 208 individuals and promises to increase the accuracy of state-of-art by capturing short insertion and deletion (indel) markers as well as SNPs. We demonstrated high accuracy in continent-level inference and present a promising foundation for developing more accurate, loci-aware ancestry inferences.
Item Metadata
Title |
K-mer-based data structures and pipelines for sequence mapping and analysis
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2023
|
Description |
The exponential growth of genomic data demands progress and research on scalable bioinformatics algorithms. A paradigm to improve computational efficiency in bioinformatics is k-mers. Here we present three works based on the k-mer paradigm that improved the existing methods and opened new possibilities for major applications domains in bioinformatics. LINKS 2.0 is an alignment-free scaffolding tool that brings 3-fold run-time and 5-fold memory optimization to the latest previous version (LINKS v1.8.7). Together with enabling LINKS to process more data with lower computational requirements, this major update also outputs higher- quality scaffolds. Major memory optimization in LINKS 2.0 was obtained by storing k-mers as their 64-bit hash values instead of with ASCII characters. Multi-index Bloom filter (miBF) is a novel associative probabilistic data structure designed for efficiently storing k-mer and spaced seeds. MiBF-mapper discovered the utility of miBF in the long-read mapping domain and demonstrated its competitive accuracy. The mapping with miBF will be a future reference, especially for miBF-based methods. The work on miBF-based global ancestry inference (GAI) proved the scalability of miBF by processing high-coverage data of 208 individuals and promises to increase the accuracy of state-of-art by capturing short insertion and deletion (indel) markers as well as SNPs. We demonstrated high accuracy in continent-level inference and present a promising foundation for developing more accurate, loci-aware ancestry inferences.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2023-03-30
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0428836
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2023-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International