- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Genome misassembly detection using Stash : a data structure...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Genome misassembly detection using Stash : a data structure based on stochastic tile hashing Sarvar, Armaghan
Abstract
Analyzing large amounts of data produced by high-throughput sequencing technologies presents challenges in terms of memory and computational requirements. Therefore, it is crucial to develop data structures and computational methods that handle this information effectively. These challenges impact bioinformatics studies, including de novo genome assembly, which serves as the foundation of genomics. Issues like errors in reads or limitations due to heuristic decisions in assembly algorithms can lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to maximize the information provided by reads and produce an optimal assembly. Here, we present Stash, a novel hash-based data structure designed for storing and querying large repositories of sequencing data based on a k-mer representation of a large sequence dataset. Stash uses a two-dimensional data structure based on hash values generated by sliding windows of spaced seed patterns over sequences to compress data. The key-value pairs stored in Stash are k-mers and sequence ID hashes, respectively. The stored hashed identifiers are then used to check if two queried k-mers are observed in the same set of sequences. This functionality provides utility for Stash across diverse domains of bioinformatics. For example, Stash can inform whether two genomic regions are covered by the same set of reads by measuring the number of matches between them. This can be used in detection of misassemblies within a genome assembly of interest. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by the Flye and Shasta algorithms, using Pacbio HiFi reads from the human cell line NA24385. We observe that scaffolding Stash-cut assemblies reduce 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. It accomplishes this by utilizing eight GB of memory and a total processing time of 117 plus 18 minutes. Remarkably, it can outperform alternative methods for detecting misassemblies in long-read data, all the while preserving contiguity.
Item Metadata
Title |
Genome misassembly detection using Stash : a data structure based on stochastic tile hashing
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2023
|
Description |
Analyzing large amounts of data produced by high-throughput sequencing technologies presents challenges in terms of memory and computational requirements. Therefore, it is crucial to develop data structures and computational methods that handle this information effectively. These challenges impact bioinformatics studies, including de novo genome assembly, which serves as the foundation of genomics. Issues like errors in reads or limitations due to heuristic decisions in assembly algorithms can lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to maximize the information provided by reads and produce an optimal assembly.
Here, we present Stash, a novel hash-based data structure designed for storing and querying large repositories of sequencing data based on a k-mer representation of a large sequence dataset. Stash uses a two-dimensional data structure based on hash values generated by sliding windows of spaced seed patterns over sequences to compress data. The key-value pairs stored in Stash are k-mers and sequence ID hashes, respectively. The stored hashed identifiers are then used to check if two queried k-mers are observed in the same set of sequences. This functionality provides utility for Stash across diverse domains of bioinformatics. For example, Stash can inform whether two genomic regions are covered by the same set of reads by measuring the number of matches between them. This can be used in detection of misassemblies within a genome assembly of interest. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by the Flye and Shasta algorithms, using Pacbio HiFi reads from the human cell line NA24385. We observe that scaffolding Stash-cut assemblies reduce 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. It accomplishes this by utilizing eight GB of memory and a total processing time of 117 plus 18 minutes. Remarkably, it can outperform alternative methods for detecting misassemblies in long-read data, all the while preserving contiguity.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2023-11-22
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0437874
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NoDerivatives 4.0 International