- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Improving structural variant detection and interpretation...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Improving structural variant detection and interpretation using nanopore long reads : de novo diploid assembly, population frequency annotation, and clinical applications Saju, Riya
Abstract
Structural variants (SVs) contribute substantially to genetic disorders but remain difficult to detect accurately and characterize fully due to their size, complexity, and enrichment in repetitive genomic regions. Short-read whole-genome sequencing often fails to identify or resolve these variants. Long-read sequencing, such as Oxford Nanopore Technologies, improves SV detection, but SV calls still suffer from high false-positive and false-negative rates. Moreover, existing population frequency databases, like gnomAD-SV, are based on short-read data, limiting the ability to filter out common SVs and technical artifacts in long-read datasets. Together, these challenges hinder reliable SV-based diagnosis in clinical settings.
Most genome sequencing workflows rely on read-based alignment to a reference genome, which may fail to resolve SVs longer than read lengths and introduce biases in repetitive regions. Long-read sequencing has advanced assembly methods for accurate SV characterization, but most Nanopore assemblers generate haploid representations, limiting their ability to profile heterozygous variations. To address this, I developed a Snakemake diploid de novo assembly pipeline for precise SV detection. Our benchmarking identified an optimal toolset combination: Flye-Medaka-LongStitch-HapDup-PAV, Hapdiff. The pipeline was evaluated on HG002 locally generated Nanopore data benchmarked against the HG002 GIAB truth set. Our pipeline produced diploid assemblies with an N50 of 49 Mb, a BUSCO completeness of 98.4% and an SV benchmark F1 score of 0.840 (95% CI: 0.836–0.843). The assembly-based pipeline produced higher F1 scores than read-based alignment methods and the F1 scores are comparable to those from T2T-HG002 assemblies and BUSCO completeness scores are like other high-quality reference assemblies such as T2T-CHM13 and T2T-HG002.
The pipeline was applied to long-read data from 18 RapidOmics 2.0 trios to understand its capabilities and limitations in a clinical context. STIX was used to annotate SV population frequencies in these data and provided around 99% variant annotation for alignment calls and 95% for assembly calls.
Future directions include testing our pipeline on more complex cases, increasing the read lengths or testing ultra-long ONT reads to improve assemblies, and evaluating the potential of the new Hifiasm (ONT) tool for producing de novo diploid assemblies more quickly and efficiently from simplex Nanopore reads.
Item Metadata
| Title |
Improving structural variant detection and interpretation using nanopore long reads : de novo diploid assembly, population frequency annotation, and clinical applications
|
| Creator | |
| Supervisor | |
| Publisher |
University of British Columbia
|
| Date Issued |
2026
|
| Description |
Structural variants (SVs) contribute substantially to genetic disorders but remain difficult to detect accurately and characterize fully due to their size, complexity, and enrichment in repetitive genomic regions. Short-read whole-genome sequencing often fails to identify or resolve these variants. Long-read sequencing, such as Oxford Nanopore Technologies, improves SV detection, but SV calls still suffer from high false-positive and false-negative rates. Moreover, existing population frequency databases, like gnomAD-SV, are based on short-read data, limiting the ability to filter out common SVs and technical artifacts in long-read datasets. Together, these challenges hinder reliable SV-based diagnosis in clinical settings.
Most genome sequencing workflows rely on read-based alignment to a reference genome, which may fail to resolve SVs longer than read lengths and introduce biases in repetitive regions. Long-read sequencing has advanced assembly methods for accurate SV characterization, but most Nanopore assemblers generate haploid representations, limiting their ability to profile heterozygous variations. To address this, I developed a Snakemake diploid de novo assembly pipeline for precise SV detection. Our benchmarking identified an optimal toolset combination: Flye-Medaka-LongStitch-HapDup-PAV, Hapdiff. The pipeline was evaluated on HG002 locally generated Nanopore data benchmarked against the HG002 GIAB truth set. Our pipeline produced diploid assemblies with an N50 of 49 Mb, a BUSCO completeness of 98.4% and an SV benchmark F1 score of 0.840 (95% CI: 0.836–0.843). The assembly-based pipeline produced higher F1 scores than read-based alignment methods and the F1 scores are comparable to those from T2T-HG002 assemblies and BUSCO completeness scores are like other high-quality reference assemblies such as T2T-CHM13 and T2T-HG002.
The pipeline was applied to long-read data from 18 RapidOmics 2.0 trios to understand its capabilities and limitations in a clinical context. STIX was used to annotate SV population frequencies in these data and provided around 99% variant annotation for alignment calls and 95% for assembly calls.
Future directions include testing our pipeline on more complex cases, increasing the read lengths or testing ultra-long ONT reads to improve assemblies, and evaluating the potential of the new Hifiasm (ONT) tool for producing de novo diploid assemblies more quickly and efficiently from simplex Nanopore reads.
|
| Genre | |
| Type | |
| Language |
eng
|
| Date Available |
2026-02-06
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
| DOI |
10.14288/1.0451466
|
| URI | |
| Degree (Theses) | |
| Program (Theses) | |
| Affiliation | |
| Degree Grantor |
University of British Columbia
|
| Graduation Date |
2026-05
|
| Campus | |
| Scholarly Level |
Graduate
|
| Rights URI | |
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International