- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Discovery of mid-range novel sequence insertions using...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Discovery of mid-range novel sequence insertions using long-read sequencing Safa, Armita
Abstract
Two decades after the initial sequencing and assembly of the human genome, the current reference assembly is still not sufficiently representative. Although most efforts to enrich our understanding of genomic variations have focused on single nucleotide polymorphisms, recent studies led by the Human Genome Structural Variation Consortium and the Genome in a Bottle Project aim to characterize struc- tural variations. Still, there are genomic sequences missing from assemblies. These sequences, termed novel sequence insertions, need to be discovered to better char- acterize human genome diversity. Furthermore, insertions discovered to date have been shown to harbor coding genes and other functional elements. Studies have proven the link between the existence of these insertions and the emergence of certain diseases. Current methods and tools developed for novel sequence inser- tion discovery suffer from shortcomings such as mapping ambiguity and assembly fragmentation, and especially lag in detecting long insertions. Unlike short-read sequencing, long-read sequencing has a higher basepair error rate, but is less prone to mapping ambiguity due to short repeats. Moreover, they can achieve longer more contiguous segments, resolving common assembly issues. On the other hand, short-reads are almost error-free, therefore could provide better breakpoint prediction at basepair level. Utilizing the complementary characteristics of both technologies, we introduced a novel algorithm RinsLR that discovers mid-range (50bp-10kbp) novel sequence insertions. RinsLR uses short-reads to accurately identify potential novel sequence insertion breakpoints. It then uses long-reads to rebuild and retrieve the inserted sequence. Using simulated experiments and the T2TCHM13 genome, we evaluated RinsLR, compared it against other tools, and showed that RinsLR achieves consistently high precision and recall rates.
Item Metadata
Title |
Discovery of mid-range novel sequence insertions using long-read sequencing
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2022
|
Description |
Two decades after the initial sequencing and assembly of the human genome, the current reference assembly is still not sufficiently representative. Although most efforts to enrich our understanding of genomic variations have focused on single nucleotide polymorphisms, recent studies led by the Human Genome Structural Variation Consortium and the Genome in a Bottle Project aim to characterize struc- tural variations. Still, there are genomic sequences missing from assemblies. These sequences, termed novel sequence insertions, need to be discovered to better char- acterize human genome diversity. Furthermore, insertions discovered to date have been shown to harbor coding genes and other functional elements. Studies have proven the link between the existence of these insertions and the emergence of certain diseases. Current methods and tools developed for novel sequence inser- tion discovery suffer from shortcomings such as mapping ambiguity and assembly fragmentation, and especially lag in detecting long insertions. Unlike short-read sequencing, long-read sequencing has a higher basepair error rate, but is less prone to mapping ambiguity due to short repeats. Moreover, they can achieve longer more contiguous segments, resolving common assembly issues. On the other hand, short-reads are almost error-free, therefore could provide better breakpoint prediction at basepair level. Utilizing the complementary characteristics of both technologies, we introduced a novel algorithm RinsLR that discovers mid-range (50bp-10kbp) novel sequence insertions. RinsLR uses short-reads to accurately identify potential novel sequence insertion breakpoints. It then uses long-reads to rebuild and retrieve the inserted sequence. Using simulated experiments and the T2TCHM13 genome, we evaluated RinsLR, compared it against other tools, and showed that RinsLR achieves consistently high precision and recall rates.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2022-12-14
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0422617
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2023-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International