UBC Theses and Dissertations

Scalable methods for improving genome assemblies Nikolić, Vladimir


De novo genome assembly is cornerstone to modern genomics studies. It is also a useful method for studying genomes with high variation, such as cancer genomes, as it is not biased by a reference. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by k - 1 bases, followed by merging nodes along unambiguous walks in the graph. The selection of k is influenced by a few factors, and its fine tuning results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, this approach has only been explored with small genomes, without addressing scalability issues with larger ones. Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which it uses to evaluate the assembly graph path support at branching points and removes the paths with insufficient support. RResolver runs efficiently, taking 3% of a typical ABySS human assembly pipeline run time on average with 48 threads and 40GB memory. Compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 16% and reduces misassemblies by up to 7%. RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome.

Attribution-ShareAlike 4.0 International