UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Discovery of mid-range novel sequence insertions using long-read sequencing Safa, Armita

Abstract

Two decades after the initial sequencing and assembly of the human genome, the current reference assembly is still not sufficiently representative. Although most efforts to enrich our understanding of genomic variations have focused on single nucleotide polymorphisms, recent studies led by the Human Genome Structural Variation Consortium and the Genome in a Bottle Project aim to characterize struc- tural variations. Still, there are genomic sequences missing from assemblies. These sequences, termed novel sequence insertions, need to be discovered to better char- acterize human genome diversity. Furthermore, insertions discovered to date have been shown to harbor coding genes and other functional elements. Studies have proven the link between the existence of these insertions and the emergence of certain diseases. Current methods and tools developed for novel sequence inser- tion discovery suffer from shortcomings such as mapping ambiguity and assembly fragmentation, and especially lag in detecting long insertions. Unlike short-read sequencing, long-read sequencing has a higher basepair error rate, but is less prone to mapping ambiguity due to short repeats. Moreover, they can achieve longer more contiguous segments, resolving common assembly issues. On the other hand, short-reads are almost error-free, therefore could provide better breakpoint prediction at basepair level. Utilizing the complementary characteristics of both technologies, we introduced a novel algorithm RinsLR that discovers mid-range (50bp-10kbp) novel sequence insertions. RinsLR uses short-reads to accurately identify potential novel sequence insertion breakpoints. It then uses long-reads to rebuild and retrieve the inserted sequence. Using simulated experiments and the T2TCHM13 genome, we evaluated RinsLR, compared it against other tools, and showed that RinsLR achieves consistently high precision and recall rates.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International