Genome misassembly detection using Stash : a data structure based on stochastic tile hashing

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Genome misassembly detection using Stash : a data structure based on stochastic tile hashing Sarvar, Armaghan

Abstract

Analyzing large amounts of data produced by high-throughput sequencing technologies presents challenges in terms of memory and computational requirements. Therefore, it is crucial to develop data structures and computational methods that handle this information effectively. These challenges impact bioinformatics studies, including de novo genome assembly, which serves as the foundation of genomics. Issues like errors in reads or limitations due to heuristic decisions in assembly algorithms can lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to maximize the information provided by reads and produce an optimal assembly. Here, we present Stash, a novel hash-based data structure designed for storing and querying large repositories of sequencing data based on a k-mer representation of a large sequence dataset. Stash uses a two-dimensional data structure based on hash values generated by sliding windows of spaced seed patterns over sequences to compress data. The key-value pairs stored in Stash are k-mers and sequence ID hashes, respectively. The stored hashed identifiers are then used to check if two queried k-mers are observed in the same set of sequences. This functionality provides utility for Stash across diverse domains of bioinformatics. For example, Stash can inform whether two genomic regions are covered by the same set of reads by measuring the number of matches between them. This can be used in detection of misassemblies within a genome assembly of interest. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by the Flye and Shasta algorithms, using Pacbio HiFi reads from the human cell line NA24385. We observe that scaffolding Stash-cut assemblies reduce 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. It accomplishes this by utilizing eight GB of memory and a total processing time of 117 plus 18 minutes. Remarkably, it can outperform alternative methods for detecting misassemblies in long-read data, all the while preserving contiguity.

Item Metadata

Title	Genome misassembly detection using Stash : a data structure based on stochastic tile hashing
Creator	Sarvar, Armaghan
Supervisor	Birol, Inanc
Publisher	University of British Columbia
Date Issued	2023
Description	Analyzing large amounts of data produced by high-throughput sequencing technologies presents challenges in terms of memory and computational requirements. Therefore, it is crucial to develop data structures and computational methods that handle this information effectively. These challenges impact bioinformatics studies, including de novo genome assembly, which serves as the foundation of genomics. Issues like errors in reads or limitations due to heuristic decisions in assembly algorithms can lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to maximize the information provided by reads and produce an optimal assembly. Here, we present Stash, a novel hash-based data structure designed for storing and querying large repositories of sequencing data based on a k-mer representation of a large sequence dataset. Stash uses a two-dimensional data structure based on hash values generated by sliding windows of spaced seed patterns over sequences to compress data. The key-value pairs stored in Stash are k-mers and sequence ID hashes, respectively. The stored hashed identifiers are then used to check if two queried k-mers are observed in the same set of sequences. This functionality provides utility for Stash across diverse domains of bioinformatics. For example, Stash can inform whether two genomic regions are covered by the same set of reads by measuring the number of matches between them. This can be used in detection of misassemblies within a genome assembly of interest. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by the Flye and Shasta algorithms, using Pacbio HiFi reads from the human cell line NA24385. We observe that scaffolding Stash-cut assemblies reduce 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. It accomplishes this by utilizing eight GB of memory and a total processing time of 117 plus 18 minutes. Remarkably, it can outperform alternative methods for detecting misassemblies in long-read data, all the while preserving contiguity.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2023-11-22
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NoDerivatives 4.0 International
DOI	10.14288/1.0437874
URI	http://hdl.handle.net/2429/86661
Degree	Master of Science - MSc
Program	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2024-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Genome misassembly detection using Stash : a data structure based on stochastic tile hashing Sarvar, Armaghan

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights