Open Collections will undergo scheduled maintenance on the following dates: On Monday, April 27th, 2026, the site will not be available from 7:00 AM – 9:00 AM PST and on Tuesday, April 28th, 2026, the site will remain accessible from 7:00 AM – 9:00 AM PST, however item images and media will not be available during this time.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Data mining techniques for de novo genome assembly and analysis Afshinfard, Amirhossein

Abstract

While conventional physical maps were instrumental in constructing most of the reference genomes in use today, generating the maps was prohibitively expensive, and the technology was abandoned in favor of whole-genome shotgun sequencing (WGS). However, genome assemblies produced using WGS data and automated assembly pipelines often suffered from reduced quality. Despite recent advancements, these assemblies can still benefit from improvements in contiguity and correctness. This thesis revives the concept of physical maps, integrating them with modern sequencing techniques to address these challenges. It introduces Physlr, a novel tool for constructing next-generation physical maps using linked- and long-read sequencing data. Employing data mining techniques, Physlr achieves highly contiguous physical maps, offering applications in misassembly correction, structural variant detection, haplotype phasing, read selection, and scaffolding, the last two showcased in this thesis. To demonstrate the utility of next-generation physical maps, I have added a utility to Physlr that scaffolds draft genome assemblies of any technology. Comprehensive computational evaluations illustrate how linked-read physical maps improve the contiguity (NG50 and NGA50) of draft assemblies made using various technologies, and how long-read physical maps can further scaffold genome assemblies produced by GoldRush, a state-of-the-art lightweight and linear-time complexity genome assembler. Additionally, this thesis introduces two other tools: Goldend and Philter. Goldend filters reads with high repeat content in their flanks to simplify downstream analysis, especially for genome scaffolding. Philter uses physical maps to select informative reads and filter out chimeric sequences in long-read datasets, thereby improving the accuracy of genomic analyses. Evaluations conducted by running GoldRush on various datasets filtered using Goldend and Philter demonstrate improved quality across standard metrics such as NGA50, misassembly count (local and global), indels and mismatches, and duplication ratio. This thesis presents numerous data mining techniques that enable these advancements. A notable technique introduced within Physlr is a novel community detection algorithm. This algorithm is based on a divide-and-conquer approach, utilizing bi-connected components, 2-D cosine similarity, and an ensemble of current state-of-the-art community detection algorithms, which individually could not scale to the problem size addressed in this study.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International