UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improving sequence analysis with probabilistic data structures and algorithms Chu, Justin


In the biological sciences, sequence analysis refers to analytical investigations that use nucleic acid or protein sequences to elucidate biological insights from them, such as their function, species of origin, or evolutionary relationships. However, sequences are not very meaningful by themselves, and useful insights generally come from comparing them to other sequences. Indexing sequences using concepts borrowed from the computational sciences may help perform these comparisons. One such concept is a probabilistic data structure, the Bloom filter, which enables low memory indexing with high computational efficiency at the cost of false-positive queries by storing a signature of a sequence rather than the sequence itself. This thesis explores high-performance applications of this probabilistic data structure in sequence classification (BioBloom Tools) and targeted sequence assembly (Kollector) and shows how these implemented tools outperform state-of-the-art methods. To remedy some weaknesses of Bloom filters, such as the inability to index multiple targets, I have developed a novel probabilistic data structure called a multi-index Bloom filter (miBF), used to facilitate alignment-free classification of thousands of references. The data structure also synergizes with spaced seeds. Sequences are often broken up into subsequences when using a hashed-based algorithm, and spaced seeds are subsequences with wildcard positions to improve classification sensitivity and specificity. This novel data structure enables faster classification and higher sensitivity than sequence alignment-based methods and executes in an order of magnitude less time while using half the memory compared to other spaced seed-based approaches. This thesis features formulations of classification false-positive rates in relation to the indexed and queried sequences, and benchmarks the data structure on simulated data. In addition to my work on short read data, I explore and evaluate of methods for finding sequence overlaps in error-prone long read datasets.

Item Media

Item Citations and Data


Attribution-NonCommercial-ShareAlike 4.0 International