UBC Theses and Dissertations
Genome misassembly detection using Stash : a data structure based on stochastic tile hashing Sarvar, Armaghan
Analyzing large amounts of data produced by high-throughput sequencing technologies presents challenges in terms of memory and computational requirements. Therefore, it is crucial to develop data structures and computational methods that handle this information effectively. These challenges impact bioinformatics studies, including de novo genome assembly, which serves as the foundation of genomics. Issues like errors in reads or limitations due to heuristic decisions in assembly algorithms can lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to maximize the information provided by reads and produce an optimal assembly. Here, we present Stash, a novel hash-based data structure designed for storing and querying large repositories of sequencing data based on a k-mer representation of a large sequence dataset. Stash uses a two-dimensional data structure based on hash values generated by sliding windows of spaced seed patterns over sequences to compress data. The key-value pairs stored in Stash are k-mers and sequence ID hashes, respectively. The stored hashed identifiers are then used to check if two queried k-mers are observed in the same set of sequences. This functionality provides utility for Stash across diverse domains of bioinformatics. For example, Stash can inform whether two genomic regions are covered by the same set of reads by measuring the number of matches between them. This can be used in detection of misassemblies within a genome assembly of interest. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by the Flye and Shasta algorithms, using Pacbio HiFi reads from the human cell line NA24385. We observe that scaffolding Stash-cut assemblies reduce 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. It accomplishes this by utilizing eight GB of memory and a total processing time of 117 plus 18 minutes. Remarkably, it can outperform alternative methods for detecting misassemblies in long-read data, all the while preserving contiguity.
Item Citations and Data
Attribution-NoDerivatives 4.0 International