Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Parallel algorithms and software tools for high-throughput sequencing data Mohamadi, Hamid 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_september_mohamadi_hamid.pdf [ 1.67MB ]
JSON: 24-1.0348402.json
JSON-LD: 24-1.0348402-ld.json
RDF/XML (Pretty): 24-1.0348402-rdf.xml
RDF/JSON: 24-1.0348402-rdf.json
Turtle: 24-1.0348402-turtle.txt
N-Triples: 24-1.0348402-rdf-ntriples.txt
Original Record: 24-1.0348402-source.json
Full Text

Full Text

Parallel Algorithms and SoftwareTools for High-ThroughputSequencing DatabyHamid MohamadiM.Sc., McMaster University, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)June 2017c© Hamid Mohamadi 2017AbstractWith growing throughput and dropping cost of High-Throughput Sequenc-ing (HTS) technologies, there is a continued need to develop faster andmore cost-effective bioinformatics solutions. However, the algorithms andcomputational power required to efficiently analyze HTS data have laggedconsiderably. In health and life sciences research organizations, de novoassembly and sequence alignment have become two key steps in everydayresearch and analysis. The de novo assembly process is a fundamental stepin analyzing previously uncharacterized organisms and is one of the mostcomputationally demanding problems in bioinformatics. The sequence align-ment is a fundamental operation in a broad spectrum of genomics projects.In genome resequencing projects, they are often used prior to variant calling.In transcriptome resequencing, they provide information on gene expression.They are even used in de novo sequencing projects to help contiguate assem-bled sequences. As such designing efficient, scalable, and accurate solutionsfor de novo assembly and sequence alignment problems would have a wideeffect in the field.In this thesis, I present a collection of novel algorithms and softwaretools for the analysis of high-throughput sequencing data using efficient datastructures. I also utilize the latest advances in parallel and distributed com-iiAbstractputing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novoassembly and sequence alignment problems. The algorithms and softwaresolutions I develop are publicly available for free for academic use, to facili-tate research at health and life sciences laboratories and other organizationsworldwide.iiiLay summaryRecent advances in DNA sequencing technologies have altered the scale andscope of health and life sciences. Using High-Throughput Sequencing tech-nologies (HTS), huge amounts of data are produced, however, the algorithmsand computational power required to efficiently and accurately process andanalyze HTS data have lagged considerably. Building bioinformatics capac-ity to address this growing problem is an area of active research. The goalof my thesis is to design and develop novel algorithms for large-scale bio-logical sequence analysis including de novo genome assembly and sequencealignment. I also utilize state-of-the-art parallel and distributed comput-ing paradigms to build scalable and resource-efficient software solutions onhigh-performance computing infrastructures. The algorithms and softwaresolutions I develop are publicly available for free for academic use, to facili-tate research at health and life sciences laboratories and other organizationsworldwide, and to advance knowledge about diseases and to improve humanhealth through disease prevention and diagnosis.ivPrefaceChapter 2 is focused on the biological sequence alignment problem, whichinvolves identification of the possible coordinates of nucleotide or amino acidsequences along a target sequence, based on sequence similarity. Althoughthere are some solutions for this problem, they have limitations in theiraccuracy, runtime and memory usage especially when the volume of inputdata is drastically increasing. In this chapter I address the challenges relatedto large-scale sequence alignment tasks. This work was presented at AGBT2014 conference as a poster and in ISMB 2014 conference as a selected talk,and was published in the PloS One journal in 2015 [79]. I designed andimplemented the algorithm and the related software tool, conducted all theexperiments, and wrote the manuscript.Chapter 3 outlines a new hashing algorithm for bioinformatics applica-tions. Hashing is an essential operation in many bioinformatics applicationsfor indexing, querying, and similarity search. My proposed method waspresented in ISMB 2016 conference as a poster and in EMBL Big Data inBiology and Health 2016 conference as a selected talk, and was published inthe Bioinformatics journal in 2016 [77].Chapter 4 describes a novel algorithm for cardinality estimation in ge-nomics data which is desirable for many bioinformatics applications andvPrefacetheir downstream analysis pipelines to predict genome sizes, measure se-quencing error rates, and tune runtime parameters for analysis tools. Thiswork was presented in EMBL Big Data in Biology and Health 2016 con-ference as a poster, and has been recently published in the Bioinformaticsjournal in 2017 [78].viTable of contentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Background, motivation, and research goals . . . . . . . . . 11.1 DNA sequencing . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Next-Generation sequencing . . . . . . . . . . . . . . 31.2 Efficient data structures for NGS data . . . . . . . . . . . . . 51.2.1 Bloom filter . . . . . . . . . . . . . . . . . . . . . . . 6viiTable of contents1.2.2 Suffix tree and suffix array . . . . . . . . . . . . . . . 61.2.3 BWT and FM-index . . . . . . . . . . . . . . . . . . 71.2.4 Contiguous seed . . . . . . . . . . . . . . . . . . . . . 81.2.5 Spaced seed . . . . . . . . . . . . . . . . . . . . . . . 91.3 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 BWA, BWA-SW, BWA-MEM . . . . . . . . . . . . . 121.3.2 Bowtie, Bowtie2 . . . . . . . . . . . . . . . . . . . . . 141.3.3 Novoalign . . . . . . . . . . . . . . . . . . . . . . . . 151.3.4 GEM . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.3.5 SOAP, SOAP2, SOAP3 . . . . . . . . . . . . . . . . . 161.4 Sequence assembly . . . . . . . . . . . . . . . . . . . . . . . . 181.4.1 ALLPATHS . . . . . . . . . . . . . . . . . . . . . . . 201.4.2 SGA . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.4.3 ABySS . . . . . . . . . . . . . . . . . . . . . . . . . . 221.5 Overview of research goals . . . . . . . . . . . . . . . . . . . 242 DIDA: Distributed Indexing Dispatched Alignment . . . . 262.1 Publication note . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Author summary . . . . . . . . . . . . . . . . . . . . . . . . . 262.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4.1 Assembly graph . . . . . . . . . . . . . . . . . . . . . 302.4.2 Bloom filter . . . . . . . . . . . . . . . . . . . . . . . 312.4.3 DIDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.4.4 Implementation . . . . . . . . . . . . . . . . . . . . . 39viiiTable of contents2.4.5 Evaluated tools . . . . . . . . . . . . . . . . . . . . . 412.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.5.1 Performance on real data . . . . . . . . . . . . . . . . 422.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 ntHash: recursive nucleotide hashing . . . . . . . . . . . . . 533.1 Publication note . . . . . . . . . . . . . . . . . . . . . . . . . 533.2 Author summary . . . . . . . . . . . . . . . . . . . . . . . . . 533.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654 ntCard: a streaming algorithm for cardinality estimation ingenomics data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.1 Publication note . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Author summary . . . . . . . . . . . . . . . . . . . . . . . . . 704.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.1 Background, notations, and definitions . . . . . . . . 734.4.2 Estimating k-mer frequencies, fi . . . . . . . . . . . . 744.4.3 Implementation details . . . . . . . . . . . . . . . . . 814.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . 824.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 834.5.3 Runtime and memory usage . . . . . . . . . . . . . . 85ixTable of contents4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99xList of tables2.1 Dataset specification. . . . . . . . . . . . . . . . . . . . . . . . 432.2 Alignment time/indexing memory for all aligners on C. elegans. 432.3 Alignment time/indexing memory for all aligners on humandraft assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . 452.4 Alignment time/indexing memory for all aligners on hg19. . . 452.5 Alignment time/indexing memory for all aligners on Piceaglauca. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.1 Bloom filter evaluation for cityhash on real data. . . . . . . . 613.2 Bloom filter evaluation for murmur on real data. . . . . . . . 613.3 Bloom filter evaluation for xxhash on real data. . . . . . . . . 613.4 Bloom filter evaluation for ntHash on real data. . . . . . . . . 623.5 Bloom filter evaluation for cityhash on simulated data. . . . . 623.6 Bloom filter evaluation for murmur on simulated data. . . . . 623.7 Bloom filter evaluation for xxhash on simulated data. . . . . 633.8 Bloom filter evaluation for ntHash on simulated data. . . . . 634.1 Dataset specification. . . . . . . . . . . . . . . . . . . . . . . . 834.2 Accuracy of algorithms in estimating F0 and f1 for HG004. . 84xiList of tables4.3 Accuracy of algorithms in estimating F0 and f1 for NA19238. 844.4 Accuracy of algorithms in estimating F0 and f1 for PG29 . . 84xiiList of figures1.1 Genome sequencing and assembly steps. . . . . . . . . . . . . 192.1 The workflow of DIDA with four partitions as an example. . . 302.2 Loading the Bloom filter using a sliding substring of length b. 362.3 Querying the Bloom filter using a substring of length b. . . . 362.4 Choosing the jumping size s for querying the Bloom filter. . . 372.5 Extra dispatched reads vs. r and ϕ. . . . . . . . . . . . . . . 382.6 Scalability of different aligners using DIDA for the C. elegantdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.7 Runtime scalability behaviour with the number of nodes. . . 462.8 The scalability performance of ABySS-map+DIDA for C.elegans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.9 Scalability of different aligners using DIDA for human assembly. 482.10 Scalability of different aligners using DIDA for hg19. . . . . . 492.11 Scalability of abyss-map aligner using DIDA for P. glauca. . . 503.1 Correlation coefficient plots of cityhash. . . . . . . . . . . . . 583.2 Correlation coefficient plots of murmur hash. . . . . . . . . . 593.3 Correlation coefficient plots of xxhash. . . . . . . . . . . . . . 59xiiiList of figures3.4 Correlation coefficient plots of ntHash. . . . . . . . . . . . . . 603.5 Correlation coefficient of hash methods for different samples. 663.6 Runtime for hashing a 250bp DNA sequence. . . . . . . . . . 673.7 Runtime for hashing 1 billion k-mers. . . . . . . . . . . . . . . 673.8 Runtime for canonical hashing 1 billion k-mers. . . . . . . . . 683.9 Comparing multi-hashing runtime. . . . . . . . . . . . . . . . 684.1 64-bit hash value generated by ntHash. . . . . . . . . . . . . . 754.2 The workflow of ntCard algorithm. . . . . . . . . . . . . . . . 804.3 k-mer frequency histograms for human genome HG004. . . . 864.4 k-mer frequency histograms for human genome NA19238. . . 874.5 k-mer frequency histograms for white spruce genome PG29. . 884.6 Runtime of DSK, ntCard, KmerGenie, KmerStream, and Khmerfor HG004 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 904.7 Runtime of DSK, ntCard, KmerGenie, KmerStream, and Khmerfor NA19238 dataset. . . . . . . . . . . . . . . . . . . . . . . . 904.8 Runtime of DSK, ntCard, KmerGenie, KmerStream, and Khmerfor PG29 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 91xivList of programs2.1 The DIDA algorithm . . . . . . . . . . . . . . . . . . . . . . . 404.1 The ntCard algorithm . . . . . . . . . . . . . . . . . . . . . . 79xvAcknowledgementsI would like to express my enduring gratitude to my supervisor, Inanc Birol.His continuous and unconditional help and support made my Ph.D. workand studies at UBC and Genome Sciences Centre (GSC) at BC CancerAgency exciting and joyful. I learnt a lot from Inanc during my time atGSC. Discussions with him not only shaped my research directions andcareer path, but also taught me many useful things at a higher schema. Ifeel incredibly indebted to him. I would like to sincerely thank Dr. Steven JM Jones, Dr. Anne Condon, and Dr. Alan Wagner who graciously acceptedto be in my Ph.D. Supervisory Committee and helped me with valuablediscussions and comments.I have also been very fortunate to work with Clay Breshears and hisgroup during my visits to Health and Life Sciences group at Intel Corpora-tion in Hillsboro. Clay and his group introduced me to the next-generationcomputer systems, and helped me utilize them in bioinformatics applica-tions.I would also like to thank all my lab-mates and friends in the Bioinfor-matics Technology Lab at GSC for their comments and discussion duringmy Ph.D. studies. I also express my profound affection and appreciation toNazanin for her continuous encouragement and support during my years ofxviAcknowledgementseducation. Last, but not least, I heartily thank my family for the strongmotivation that they gave me to follow my studies. Their support was in-valuable to me.xviiDedicationTo my parents.xviiiChapter 1Background, motivation, andresearch goalsGathering large volumes of data by measuring physical processes is anessential feature of contemporary science. Experiments that gather mas-sive amounts of data have been recently popular in computational biology,physics, chemistry, and finance. With recent advances in health and lifesciences technologies, using high-throughput sequencing instruments, wecan now produce terabytes of DNA sequencing data from previously un-known organisms, whole human populations, and thousands of human can-cers, drastically increasing the computational burden of the related dataanalysis. Therefore, there is an essential demand for efficient, scalable, andcost-effective algorithms and software tools for analyzing the produced data.The goal of my PhD research is to design and develop computationallyefficient and scalable algorithms and software tools for accurately processingof raw DNA sequence data produced by new sequencing technologies. Thealgorithms and software tools that I develop, in their cores, utilize succinct,compact, and compressed data structures. These data structures becomemore efficient in memory use as the redundancy in the data increases, while11.1. DNA sequencingretaining the ability to perform efficient approximate matching. Using thealgorithms I develop, I build a package to address two major problems insequence analysis, sequence alignment and de novo assembly. The first prob-lem that I address is biological sequence alignment, which involves identifica-tion of the possible coordinates of nucleotide or amino acid sequences alonga target sequence, based on sequence similarity. Although there are somesolutions for this problem, they have limitations in their accuracy, runtimeand memory usage especially when the volume of input data is drasticallyincreasing. The second problem that I tackle is the efficient reconstruc-tion of a genome - the full complement of DNA within a cell - using onlythe output from the new sequencing instruments. Given the scale of theseproblems, often requiring hundreds of gigabytes of data, computation timeand memory efficiency is a primary concern. I optimize memory usage andexecution time requirements for de novo assembly problem.In the remainder of this chapter, I introduce DNA sequencing technolo-gies, give an overview of efficient data structures for next-generation se-quencing data, and explain key problems of sequence alignment and de novoassembly. I will also review existing methods for the sequence alignment andde novo assembly problems. At the end of this chapter, I will present theoverview of my research goals.1.1 DNA sequencingDNA sequencing is the process of finding the nucleotide orders within aDNA molecule. With the limitations of current sequencing technologies,21.1. DNA sequencingit is not possible to determine the whole sequence of a genome at once,but a small DNA fragment up to a few thousands of nucleotides can besequenced. The process of DNA sequencing starts by first shearing multiplecopies of the target genomes into very short DNA fragments. Afterward,the short DNA fragments are sequenced separately using DNA sequencinginstruments. There are two categories for sequencing techniques: SangerSequencing and Next-Generation Sequencing (NGS). The Sanger sequenc-ing technique is the first method of DNA sequencing developed by FrederickSanger, and is based on the chain termination principle [92]. It works byincorporating chain-terminating dideoxynucleoside triphosphates (ddNTPs)during DNA replications using DNA polymerase. Before replaced by theNext-Generation Sequencing, Sanger sequencing was the most widely usedapproach for nearly quarter-century. In the Next-Generation Sequencing,the long DNA sequence is sheared into short DNA chunks and then DNAadapters are connected to the ends of chunks to prepare a short DNA chunklibrary [33]. The library will then be amplified before the sequencing pro-cess. Sanger sequencing is a expensive process with low throughput butapproximately long reads, while NGS produces huge amounts of short readsfaster at lower cost by simultaneously sequencing many DNA fragments.1.1.1 Next-Generation sequencingThe first NGS technology developed, named “pyrosequencing”, was commer-cialized by Roche/454 Life Sciences and it is the technology that was used tosequence the genome of James Watson [107]. The pyrosequencing sequenc-ing, instead of terminating the chain amplification in Sanger sequencing, is31.1. DNA sequencingbased on detecting pyrophosphates in the process of nucleotide incorpora-tion [70]. In this approach of sequencing, a single-stranded DNA is capturedby beads, and then loaded into picoliter reaction wells after amplification.After loading and in a predefined order, fluorescently labeled nucleotidesare included into the reaction. Subsequently, a small pulse of light is issuedwhen a labeled nucleotide is bound to the template DNA and can be de-tected using a charge-coupled device (CCD) camera [33]. The reagents arethen cleared after each reaction and before the next addition of nucleotides.Finally, the sequence of each template molecule is identified in real time byanalyzing the captured images [107]. This technology of sequencing gener-ates much more data per run than Sanger sequencing, yielding up to 700Mbases of sequencing data with maximum 1000bp read lengths in aboutone-day run [95].The second NGS technology was developed at the University of Cam-bridge and commercialized by Solexa that was acquired by Illumina later.In this technology, a template DNA is attached to sequences fixed on a slide.A cluster of molecules is then created by amplifying the template DNA inplace. The action of sequencing happens during some cycles. In each cycle,fluorescent-dye labeled nucleotides, which are reversibly-terminated, are in-serted into the reaction [95]. The process follows by taking an image andthen chemically removing the dye and terminator to allow the reaction toproceed to the next base. After finishing the run, the captured images areprocessed and by using the color of each cluster the identity of which basewas included during each cycle is identified [5]. This technology now gener-ates over 1.5 Tbases per run in less than 72-hour run with maximum read41.2. Efficient data structures for NGS datalengths of 150 bp taken from both ends of DNA fragments. The technologyis also able to generate longer length reads (e.g. 250 bp) in other modes ofoperation.Other NGS technologies are sequencing-by-ligation SOLiD by AppliedBiosystems and sequencing based on semiconductor Ion Torrent by LifeTechnologies. Sequencing technologies continue to grow by emerging of sin-gle molecule real-time sequencing approaches requiring no DNA amplifica-tion (PacBio, HeliScope, Oxford Nanopore) and providing very long reads,but with high error rate and low accuracy [15, 33].With the improvements in NGS technologies, huge amounts of datahave been generated and this empowered researchers in investigating hu-man genome variation and sequencing thousands of human cancers to finddisease-causing mutations and genes associated with diseases. On the otherhand, new sequencing technologies have also imposed new computationalchallenges for the analysis of enormous volumes of data being generated.Hence, we need efficient algorithms and software tools that can be scal-able and cost-effective to handle the data produced remarkably in routineproblems such as sequence alignment and de novo assembly.1.2 Efficient data structures for NGS dataIn this section, I bring some notations and definitions related to data struc-tures for the analysis of next-generation sequencing data.51.2. Efficient data structures for NGS data1.2.1 Bloom filterA Bloom filter [9] is a probabilistic space-efficient and compact data struc-ture that is designed to support membership queries on dynamic sets withan adjustable false positive parameter. It has been broadly employed inmany computer science and engineering applications leveraging its capabil-ity to compactly describe a set and readily detect items that do not belongor may belong to the set with a controlled false positive rate. In bioinfor-matics and computational biology, the Bloom filter data structure has beenused in applications such as genome assembly, k-mer counting, and errorcorrection [14, 16, 35, 53, 89, 90, 101, 102].The answer of querying a Bloom filter is: no which means the the queriedelement is not definitely present in the data structure, or yes which meansthe element maybe present in the filter. The uncertainty comes from thefact that different items can be mapped to same entries in the Bloom filterresulting in false positive hits with the following rate:F = (1− (1− 1m)kn)k ≈ (1− e−k/r)k (1.1)where n is the number of items in the filter, m is the number of bit positionsin the filter, k is the number of hashes per item, and r = m/n is the numberof bits per element [9].1.2.2 Suffix tree and suffix arraySuffix arrays and suffix trees [67, 68] are two widely-used data structuresfor indexing, text searching, and information retrieval. They have been61.2. Efficient data structures for NGS datawidely used in bioinformatics and computational biology applications. Thereare two specific types of substrings s[i..j] for a given string s that are veyimportant. The substring s[i..n] is a suffix of s for i ∈ [1..n + 1]. Thesubstring s[1..j] is a prefix of s for any j ∈ [0..n]. For a given string s, thesuffix tree is a tree whose leaves represent all the suffixes of s, where eachsuffix of s is shown in the tree as a path starting from the root and endingat the corresponding leaf. For a given string s, the suffix array is an array ofintegers denoting the start positions of the lexicographically sorted suffixesof s [76]. The Longest Common Prefix, or LCP, of two strings is the longeststring that is the prefix of both strings. LCP is stored alongside the list ofprefix indices, and represents the number of characters a particular suffixshares with the suffix exactly above it, starting at the beginning of both ofthem [76]. This value is useful to make string operations more efficient. Forinstance, LCP may be used to speed up searching in the list of suffixes byavoiding comparison of characters that are already known to be the same.1.2.3 BWT and FM-indexBurrows-Wheeler Transform (BWT) for a given string T , BT , is a reversiblepermutation of T , and it has been widely used in text compression andindexing. When BWT is coupled with some small auxiliary data structure,the two form a space-efficient and compact index of T [26]. By appendingan occurrence array, Occ, and an aggregate count vector, C, Ferragina andManzini [26] constructed the FM-index as a novel indexing structure. LetC(a) be the index in SAT of the first suffix that starts with character a. Ifv is the number of letters that are alphabetically smaller than a in T , then71.2. Efficient data structures for NGS dataC(a) = v+ 1. Let Occ(a, i) denote the number of times the character a hasbeen seen in BT [1, i]. To facilitate operations, the sentinel symbol, $, is alsoincluded in C and Occ, where $ is alphabetically smaller than all letters.Using C(a) and Occ(a, 1), Ferragina and Manzini proposed an algorithmfor searching a query Q in T . Let S denote a string with the suffix arrayinterval in the range [Rl, Ru]. Now, for aS the new suffix array interval canbe computed from [Rl, Ru] using Occ and C as follows:Rl = C(a) +Occ(a,Rl − 1)Ru = C(a) +Occ(a,Ru)− 1(1.2)and Rl(aS) ≤ Ru(aS) if and only if aS is a substring of T . This meansO(|Q|) time is required to check if Q is a substring of T and to count thenumber of times Q has been seen in T . This is called backward search.1.2.4 Contiguous seedContiguous seed is a run of matches between sequence reads. Contiguousseed similarity search is based on the fact that a long substring sharedbetween query and target sequences may be originated from significant sim-ilarity [76]. The consecutive match of few bases is defined as the seed or hit.After identifying seed(s) between query and target sequences, the searchis continued by extending the seeds regions up to a threshold measure ofsimilarity to report the candidate positions. The most popular tool for sim-ilarity search, BLAST (Basic Local Alignment Search Tool) [3], is basedon the contiguous seed similarity search for finding local alignment. The81.2. Efficient data structures for NGS datalength of the shared substring between sequences, seed, is w=11 by de-fault. The target and query sequences should have w consecutive identicalcharacters, or w matches, to be aligned, and is shown as seed in the formof 11111111111, where a 1 represents a match. The key idea here comesfrom the fact that searching exact matches would be easier than looking forapproximate matches.1.2.5 Spaced seedSpaced seed similarity search, proposed by Ma, in 2002 [65] looksfor matches that are not consecutive and can include some “don’t care”positions, resulting in a higher probability of catching alignments. Thespaced seed proposed in [65] is 111 ∗ 1 ∗ ∗1 ∗ 1 ∗ ∗11 ∗ 111, where a 1 denotesfor a match and ∗ corresponds to a “don’t care” location. It means, whenlooking for a possible alignment, only locations of 1’s are examined, whereaslocations with *’s are not checked. As opposed to the consecutive seedof BLAST, this class of seeds is referred as spaced seed. Spaced seeds arethe current state-of-the-art for approximate sequence matching, which havebeen increasingly used to improve the quality and sensitivity of searchingin different applications including alignment-free methods and metagenomicstudies [29, 30, 39, 84]. Spaced seeds are more sensitive and have higherchance to identify actual alignments. The weight, denoted by w, of a spacedseed is defined as the number of 1’s in the seed whereas the total number of1’s and *’s is defined as the length and denoted by l. A seed’s capability toidentify regions that are similar between sequences is called the sensitivity ofthat seed [76]. There is a dynamic programming algorithm for calculating91.3. Sequence alignmentthe sensitivity of a spaced seed [57]. Using a set of several spaced seedsfor similarity search will greatly improve the sensitivity. This set of severalspaced seeds is referred as multiple spaced seeds [76].There are some limitations when using spaced seeds for approximatestring matching. The first one is designing and finding the optimal seed ormultiple seeds. This is an NP-hard problem (there are exponential number ofcandidate sets of spaced seeds to examine, and computing the sensitivity ofeach spaced seed set also needs exponential time), and it has been addressedin the several previous works [23]. The second issue related to spaced seedsis the large space requirements for spaced seeds index table that makes itinfeasible to apply them for large-scale alignment tasks. Recently, Egidiand Manzini [23] have proposed a class of spaced seed called, QuadraticResidue seeds (QR-seed), having competitive sensitivity/specificity to thebest multiple spaced seeds. It is based on one spaced seed, and requires lessmemory compared to multiple spaced seeds.1.3 Sequence alignmentSequence alignment is one of the essential operations in computational biol-ogy and bioinformatics, and is affected adversely by the NGS data deluge.In genome resequencing projects, it is often used prior to variant calling. Intranscriptome resequencing, it provides information on gene expression mea-surements. It is even used in de novo sequencing projects to help contiguateassembled sequences. As such, improving the performance of alignment al-gorithms would have a wide effect in the field.101.3. Sequence alignmentFormally, sequence alignment problem is defined as follows. As input weare given a set of target sequences T , and a set of query sequences S, and thegoal is to find one or more coordinates from each query sequence in the set oftarget sequences based on the similarity between sequences. Given a distancefunction F , and an error threshold e, we are looking for the set of positionsin T where a query sequence has a distance ≤ e under the distance functionF . Common distance functions are hamming distance, corresponding to thenumber of mismatches, and edit distance, corresponding to the number ofmismatches, insertions, and deletions. The best output record of a givenquery s ∈ S is the position in the target sequence t ∈ T that is the mostprobable position that s originated form. The measures that can be usedto define this likelihood is the minimum number of edit operations or theminimum sum of Phred quality score of error positions in s. The Phredquality score Q is defined as a measure of quality related to the base-callingerror probabilities P , and is equal to Q = −10log10P [24]. For instance,if Phred score of a given base is 20, the probability that the given base isidentified incorrectly is equal to 0.01.Sequence alignment is broadly categorized into two classes: local align-ment and global alignment. A local alignment, instead of considering thetotal sequence, takes into account regions of all possible lengths, and detectssimilar regions between two sequences. Whereas in global alignment all thebases in two sequences are involved and taken into account in the alignmentprocess. The most widely used algorithms for solving the global and lo-cal alignment problems are based on dynamic programming approaches [83,100]. Creating the dynamic programming alignment matrix and perform-111.3. Sequence alignmenting a backtrack in it to detect the optimal alignment requires O(nm) time.The dynamic programming approaches are slow due to their quadratic timecomplexity, and will take long time especially in cases where we have veylong target sequences to be compared such as large genomes [76].Finding exact occurrences of queries in target sequences can take lineartime, however a linear time complexity in order of the length of the longsequence or target is too expensive, so the time complexity linear in the sizeof the short sequence or query is desired. To achieve this goal, an index suchas a suffix array, suffix tree, FM-index, or hash table can be constructed onthe large sequence . Building such an index requires O(n) time and O(nlogn)memory, where n is the length of the target sequence. The indexing costis usually amortized when the index is utilized multiple times, resultingin quick lookup of the indexed sequences. During the past decade, manyalignment methods have been proposed based on suffix trees [47, 73], suffixarrays [1, 37], hash tables [3, 12, 34, 38, 56, 63, 64, 93, 99, 108], or FM-indices [48–50, 52, 54, 55, 59, 69]. Below is a summarized description ofsome of the most popular alignment methods in the field.1.3.1 BWA, BWA-SW, BWA-MEMBWA [55] is an FM-index based alignment method. For exact string match-ing, it uses the FM-index and backward search. For inexact matching, itutilizes a heuristic bounded traversal/backtracking approach. Using the pro-posed recursive approach, BWA identifies the intervals related to the suffixarray of the target’s substrings matching the query sequence q with less thane gaps or mismatches. This procedure is performed by backward search to121.3. Sequence alignmentsample distinct subsequences from the target sequence which is boundedby the array D(.), where D(i) is the lower bound on the count of differentbases between prefix q[0; i] and the target sequence. Calculating better esti-mates for D narrows down the search space and results in a faster and moreefficient process.BWA-SW [54] constructs FM-index for the target and query sequence.The method represents the target sequence as a prefix trie implicitly anddescribes the query string in a prefix directed acyclic word graph (DAWG),reconstructed from the prefix trie of the query string. The proposed methodthen applies a dynamic programming approach between the DAWG and thetrie, by traversing the query DAWG and the target prefix trie. Instead ofoutputting all the significant local alignments, the algorithm only outputslargely non-overlapping mappings on the query sequence. BWA-SW heuris-tically finds and ignores seeds contained in a longer alignment, which resultsin less computing time for extension of unsuccessful seeds [54].BWA-MEM [52] is a seed-and-extend based alignment method. Theproposed method works by first seeding an alignment using super-maximalexact matches, called SMEMs, by utilizing an algorithm that looks for thelongest exact match spanning positions in a given query. Nevertheless, theactual alignment may not sometimes have any SMEMs. To decrease mis-alignment resulted from absent seeds, BWA-MEM uses re-seeding. Supposewe have a SMEM of length l that happens k times in the target sequence. Ifthe length of SMEM, l, is too long, longer than 28bp by default, BWA-MEMre-seeds with the largest exact matches that span the middle nucleotide ofthe SMEM and appear at least k+1 times in the target [52]. These kinds of131.3. Sequence alignmentseeds can be identified by demanding an occurrence threshold in the originalSMEM method.1.3.2 Bowtie, Bowtie2Bowtie [50] is an FM-index based alignment algorithm. After building theindex for target sequences, it employs a modified Ferragina and Manzini [26]exact matching approach to determine the possible mapping coordinates. Tohandle inexact matching, Bowtie customizes this method by adding two ex-tensions. The first one is a quality-aware backtracking procedure allowingmismatches, and the second one is “double indexing”, to prevent excessivebacktracking. Bowtie allows a limited number of mismatches and choosesmappings that have lower values for sum of the quality at all mismatchedlocations [50]. It launches a backtracking search to identify mappings thatassure the above mapping procedure. The search continues similarly forthe exact backward search, computing ranges for larger query suffixes. Thealgorithm may choose a current matched query location if it encounters anempty range, and changes the related nucleotide with another base. It thenrestarts the backward search from the next position after the changed nu-cleotide. Bowtie algorithm considers only those changes that are compatibleand agree with the mapping policy resulting in a modified suffix that ap-pears at least once in the target. To reduce excessive backtracking, thismethod utilizes “double indexing”. For target sequences, two indices areconstructed. The first one includes the BWT of the forward-strand targetset, called forward index, and the second one consists of the BWT of thereversed-strand, not the reverse-complement, which is referred as the mirror141.3. Sequence alignmentindex.Bowtie2 [49], as opposed to Bowtie, handles the gapped alignment byextending the FM-index based appraoch of Bowtie to allow gapped align-ments by dividing the algorithm into two steps: (i) ungapped seed-findingstep based on FM-index that utilizes the speed and memory efficiency of itand (ii) gapped extension step based on the dynamic programming approachthat takes advantage of single-instruction multiple-data (SIMD) vector pro-cessing extensions of modern processors. To do so, the proposed methodidentifies seeds from the query sequence and its reverse complement. Afterseed extraction, Bowtie2 utilizes the FM-index to identify all the ranges re-lated to these seeds in an ungapped fashion and then prioritizes them andcalculates their positions in the reference genome from the index. Finally,it uses a SIMD-accelerated version of Smith-Waterman algorithm [25] toextend the seeds into full alignments.1.3.3 NovoalignNovoalign ( is a hash-based alignment algo-rithm. It provides the most sensitive results for the alignment problemby employing a dynamic programming approach. In the indexing stage,Novoalign constructs a hash table by partitioning the target sequences intooverlapping k-mers. In the mapping phase, after constructing the index ta-ble, it exploits the Needleman-Wunsch dynamic programming approach [83]with affine gap penalties to identify the optimal global alignment.151.3. Sequence alignment1.3.4 GEMGEM [69] is an FM-index based alignment method. It utilizes a filtration-based paradigm to approximate sequence matching. To do so, all relatedcandidate matches are identified using an FM-index by appropriate pigeonhole-like rules followed by a dynamic programming approach for refining align-ments. Some of the regions may result into many matches as candidates andrequire to be verified, and hence resulting in inefficient alignments. GEMpicks a threshold t, which defines an upper bound for candidates that shouldbe examined for each filtering region. For a given query, GEM scans it back-ward from right to left with FM-index, appending one base at a time to thecurrent segment, and calculating the candidates count in the target that mapexactly to the sequence being formed. Any moment the count of candidatesgoes under the threshold, it starts a new segment. It should be mentionedthat the number of candidates to be examined per filtering region is alwaysguaranteed to be less than the threshold.1.3.5 SOAP, SOAP2, SOAP3SOAP [59] is a hash-based alignment method. For aligning a query againstthe target sequences, SOAP allows a fixed number of mismatches (at most2) or a single continuous gap (1-3 bp) with no mismatches in the flank-ing regions. Its criteria for best alignment results are either the smallestnumber of mismatches or a shorter gap. SOAP loads the target sequencesinto memory and constructs the seed index tables for the target sequences.For each query, it then builds seeds and searches for candidate hits in the161.3. Sequence alignmentcorresponding index table. Finally, it performs the alignment and reportsthe results. SOAP uses a look-up table to accelerate the alignment. Toreduce the memory requirements, both the query and the target sequencesare converted into 2-bit encoding for each base. A bit-wise EXCLUSIVEOR (XOR) comparison is performed on query and target sequences to findthe number of different bases. Since gapped hits do not include mismatches,SOAP uses an enumeration approach that attempts to add a consecutivegap or delete a region at each possible site in a query.SOAP2 [60] utilizes Burrows Wheeler Transform (BWT) compact indexto construct index from the target sequences in the main memory. By em-ploying BWT, SOAP2 improves alignment speed and significantly reducesmemory requirements. SOAP2 constructs a hash table to speed up search-ing for the location of a query in BWT target index. Because of the hash,detecting the actual location inside the block requires very few look-up in-teractions. For alignments with both indels and mismatches, it partitionsthe query into segments. To allow a single mismatch, a query is partitionedinto two regions. For handling two mismatches, it partitions a query intothree regions to look for hits. This enumeration approach was utilized tofind mutation positions on the queries.SOAP3 [62] is a parallel implementation of SOAP2 adapted for theGraphics Processing Unit (GPU) to improve the speed. It constructs aGPU variant of the compressed BWT based index used by SOAP2. Toadapt the compressed BWT index with GPU, SOAP3 redesigns the BWTto decrease memory accesses to the full, while preserving the index efficiency.It also restricts the patterns introducing too many branches at runtime and171.4. Sequence assemblypostpones the execution of them to decrease the idle time of processors.1.4 Sequence assemblyIn bioinformatics, sequence assembly process refers to joining small se-quences, called reads, of a much larger and unknown DNA sequence in orderto rebuild the initially long sequence. It is a required step in bioinformaticsbecause the current sequencing technologies cannot generate whole genomes(a huge sequence, e.g. about 3 billion characters long for human) from startto end, but rather generate small sequences (reads) between 100 and 5000bases long (Fig. 1.1: a,b). Then the assembly algorithm stitches togetherthe genome from short sequenced pieces of DNA (Fig. 1.1: c-e). Therefore,given a collection of reads, i.e. short subsequences of the genomic sequencein the alphabet {A, C, G, T}, the goal of genome assembly is to completelyreconstruct the genome from which the reads are derived. There are severalchallenges for the assembly problem such as sequencing errors, repeats andduplications. Also, large genomes require more computational power as wellas memory (most algorithms require more than 300 GB memory for mam-malian genomes) to assemble billions of reads. Assembly algorithms canbe divided into three major paradigms: Greedy, Overlap-Layout-Consensus(OLC), and de Bruijn graph [81].In greedy paradigm, the assembly algorithm always makes the best im-mediate local choice, e.g. the reads that have the best overlap are joinedincrementally while they are in agreement with the recent built assembly.Here, the global relationship between the reads is not taken into account and181.4. Sequence assemblya) Genome b) Clone, cut to       fragments c) Select fragments        Sequence Reads d) Contigs e) Scaffold Figure 1.1: Genome sequencing and assembly steps.all choices taken by the assembly method are inherently local. This paradigmhas also challenges with assembling of large and complicated genomes.In the Overlap-Layout-Consensus paradigm, the assembler first finds po-tentially overlapping reads (overlap), then merges reads into contigs andcontigs into scaffolds (layout), and finally derives the DNA sequence andcorrect read errors (consensus). Because the complexity of overlap step isO(n2), this paradigm is not efficient for huge assembly problems such ashuman genome assembly.In the de Bruijn graph paradigm, sequences are decomposed into fixed-length substring of length k, called k-mers. Then, each k-mer is assigned191.4. Sequence assemblyto a vertex in the graph and the edges between vertices illustrate that theadjacent k-mers overlap by k − 1 characters. The assembly algorithm thentries to join vertices. Hence, assembly task can then be observed as join-ing vertices of the graph when they are connected unambiguously. Beforeconcatenating vertices, the graph must be cleaned of edges and vertices re-sulting from sequencing errors [98]. After creating all the k-mers from theinput reads, the graph is constructed and finally the potential long sequences(genomes) are obtained by finding the Eulerian tours in the de Bruijn graph.Several methods exist for the sequence assembly problem.Popular assemblers based on the OLC paradigm are Edena [36] andSGA [97] and for de Bruijn graph paradigm we can mention ALLPATHS [11,31, 66], ABySS [98], Euler [86], SOAPdenovo [58], and Velvet [109]. Follow-ing is a summarized description of some of the assembly methods.1.4.1 ALLPATHSALLPATHS [11] is a unipath graph-based assembly method. A unipath is anunambiguous path in the graph and a unipath graph is a graph whose edgesare unipaths. Unipaths are found by traversing the nodes until a branchingis found. One end of a unipath is arbitrarily named as the left side andthe other end is named as the right side. For each unipath in the graph,ALLPATHS finds its neighbors on the left side and also on the right side. Ifthe distance between the left neighbors and the right neighbors is less thana threshold then the unipath is removed. The remaining unipaths in thegraph are seed unipaths. ALLPATHS starts assembly from low copy countseed unipaths called ideal seeds. For each ideal seed unipath, ALLPATHS201.4. Sequence assemblydefines its neighborhood based on the distance on each side and constructstwo sets of read clouds: the primary read cloud consists of the reads whoseactual genomic locations are most likely near the seed unipath (reads onother ideal seed unipaths in the neighbourhood) and the secondary readcloud consists of all other short-fragments in the neighbourhood [11]. Thereads in the neighborhood of an ideal seed unipath define a local unipath.Local unipaths are joined iteratively using mate pairs in the primary cloud,which in the end yield a sequence graph representing the genome. Then,ALLPATHS simplifies dead-ends, bubbles and loops to further simplify thegraph.1.4.2 SGAThe String Graph Assembler (SGA) constructs an assembly string graphto represent the reads. It first removes duplicate and contained reads fromthe input set of reads to build an overlap graph. Then transitive edges areremoved resulting in an overlap graph of irreducible edges, called a stringgraph [97]. SGA uses an FM-index to directly compute the set of irre-ducible edges for a given set of reads. Reads in SGA are corrected based onthe k-mer frequencies and approximate overlap between reads. SGA triesto correct the reads by minimizing the sum of edit distances of all over-lapping reads in the string graph. After filtering and correction, most ofthe remaining reads do not contain errors. Each node in the string graphdenotes a read. Most of the reads in the string graph will have only twoneighbours, one overlapping with a prefix and the other overlapping with asuffix of the read. The majority of the read sequences in the string graph211.4. Sequence assemblyare simply connected on a path without any branching. These reads canbe unambiguously joined together to dercrease the size of the graph [97].SGA iteratively discovers irreducible edges in the graph and reduces thenodes that are simply connected. Each edge in the simplified string graphrepresents a contig. In the last stage of assembly, paired-end or mate-pairinformation are employed to bridge contigs to form scaffolds by a programcalled scaffolder. The scaffolder constructs a contig linkage graph that rep-resents relationships between contigs using mate pairs. The scaffolder thenremoves ambiguous edges from the contig linkage graph, where the contigscannot be ordered consistently. Finally, terminal vertices (i.e., vertices hav-ing an edge in only one direction) are identified and all paths between pairsof vertices are determined. The path with the largest sequence amount isretained as a scaffold [97].1.4.3 ABySSABySS (Assembly By Short Sequences) is a parallel sequence assembly al-gorithm implemented using MPI and developed to assemble huge amountsof NGS data [98]. The ABySS algorithm has two stages. In the first stage,all k-mers are created from the reads and refined to remove read errors andbuild initial contigs. In the second stage, contigs are extended by resolvingambiguities using mate-pair information. Some genomes have a very largenumber of k-mers. For instance, human genome has over 2 billion uniquek-mers. If every k-mer is represented using, say 50 bytes, 100 GB RAM isrequired just to represent k-mers. So, the solution of ABySS is distributedcomputing which distributes the de Bruijn graph over a cluster of compute221.4. Sequence assemblynodes as follows. Each input read of length l is divided into l−k+ 1 succes-sive k-mers by shifting a fixed-length window with length k over the inputread sequence [98]. Then, the cluster compute node index of each k-mer iscomputed, based on binary representation value of its sequences, and the k-mer is given to that node and stored in a hash table. After the k-mers wereloaded into the de Bruijn graph which is distributed across several nodes,the adjacency information for the k-mers is calculated. To do so, a messageis sent to 8 possible neighbors for each k-mer because a node in graph mayconnect up to 8 edges, one for each possible single-base extension of {A, C,G, T}, in both directions. If the neighbor is present, a k − 1 overlap withthe originating k-mer is available, and the adjacency information is updatedrespectively [98]. The de Bruijn graph may have false branches caused bynoises and sequencing errors or bubbles caused by repeat read errors. In thenext step, the graph is cleaned of those vertices and edges related to noisesand sequencing errors (Trimming) as well as repeat read errors or haplotypicvariants (Bubble popping). After removing ambiguous edges, the remainingde Bruijn graph is analyzed for contig extension and then vertices linkedby unambiguous edges are merged together resulting in the initial contigs.After creating initial contigs, paired-end information is employed for resolv-ing the ambiguities among contigs. Contigs that can be linked together areidentified by paired-end reads. Then, the linked contigs are filtered to elim-inate erroneous links resulted from misaligned or mispaired reads. Finally,contigs are considered to be linked if at least a predefined number of pairsjoin them together.New releases of ABySS address challenges with the increase in read231.5. Overview of research goalslength in the third generation of sequencing technologies. In this method, anew spaced seed data structure and algorithm has bee designed for paired deBruijn graph genome assembly [6, 7]. The paired de Bruijn graph method iseasily adaptable into any de Bruijn graph-based framework. This modifiedapproach deems to be more fit than its predecessor due to the followingreasons (i) Allows for sequence error corrections not only in low coverageregions but also across the entire genome due to the usage of two k-mersinstead of one; (ii) Longer and unambiguous contigs can be reported fromrepeat-rich sequences; (iii) More unique k-mers pairs can be reported forlarger separation distances using small k-mers (k′), which is only possible toreproduce if a single k-mer length is more than 2k′. There are lesser nodesin the same de Bruijn graph due to the accommodation of two k-mers ineach of them and less tangled edges due to the unique nodes. As an appar-ent consequence, the amount of memory and time efficiency is a lot lessercompared to a typical de Bruijn graph structure.1.5 Overview of research goalsIn Chapter 2, I introduce my technical notation and definitions then de-scribe a distributed and parallel framework for large-scale sequence align-ment tasks. This will address the memory requirement challenge for per-forming large-scale alignment tasks. The method that I develop is similarto MapReduce framework developed by Google for large scale data process-ing. It also employs the Bloom filter data structure at its core. As this isa very compact data structure, its peak memory footprint is much smaller241.5. Overview of research goalsthan other data structures. This will allow the reduction of the memorybottleneck of indexing large target sequences.In chapter 3, I describe a fast hashing algorithm for bioinformatics ap-plications called ntHash. This algorithm is a recursive or rolling hashingfunction in which the new hash value is derived from the previous one byfew more operations. I demonstrate the uniformity and runtime performanceof the proposed method on real sequencing data as well as simulated randomdata. Also, I compare the performance of ntHash with the state-of-the-arthashing algorithms in bioinformatics.In chapter 4, I propose a streaming algorithm for cardinality estimationin massive genomics data called ntCard. I will address this problem byutilizing the ntHash algorithm to efficiently compute hash values for inputstream sequences. I will then sample the calculated hash values to build areduced representation multiplicity table describing the sample distribution,and then derive a statistical model to reconstruct the population distributionfrom the sample distribution. I evaluate the performance of ntCard onwhole genome shotgun sequencing experiments on the human genome andthe white spruce genome datasets. I also compare the accuracy, runtime,and memory usage of ntCard to leading cardinality estimation algorithms.Finally, in chapter 5, I present concluding remarks.25Chapter 2DIDA: Distributed IndexingDispatched Alignment2.1 Publication noteThe work described in this chapter was previously published in the PLoSOne journal in 2015 [79]. The outcome algorithm and software tool from thiswork was used in two published papers in the Plant Journal in 2015 [106]and the Genome biology and evolution journal in 2016 [43]. The work de-scribed in this chapter is the main work of the author, under the supervi-sion of his PhD supervisor, Inanc Birol. The author designed the algorithm,implemented the software tool, performed the experiments, and wrote themanuscript. The co-authors on this work helped in optimizing the softwaretool and performing the experiments.2.2 Author summarySequence alignment is one of the most popular and widely-used applica-tions in bioinformatics that is affected by the high-throughput sequencingdata deluge. In sequence alignment process, nucleotide or amino acid se-262.3. Introductionquences are queried against targets to find regions of close similarity. Thealignment process becomes computationally challenging when queries aretoo many and/or targets are too large. This bottleneck is usually ad-dressed by preprocessing techniques, where the queries and/or targets areindexed for easy access while searching for matches. When the target isstatic, such as in an established reference genome, the cost of indexingis amortized by reusing the generated index. However, when the targetsare non-static, such as contigs in the intermediate steps of a de novo as-sembly process, a new index must be computed for each run. To addresssuch scalability problems, we present DIDA, a novel framework that dis-tributes the indexing and alignment tasks into smaller subtasks over a clus-ter of compute nodes. It provides a workflow beyond the common practiceof embarrassingly parallel implementations. DIDA is a cost-effective, scal-able and modular framework for the sequence alignment problem in termsof memory usage and runtime. It can be employed in large-scale align-ments to draft genomes and intermediate stages of de novo assembly runs.The DIDA source code, sample files and user manual are available through The softwareis released under the British Columbia Cancer Agency License (BCCA), andis free for academic use.2.3 IntroductionPerforming fast and accurate alignments of reads generated by modern se-quencing technologies represents an active field of research. At its core, the272.3. Introductionsequence alignment problem is about identifying regions of close similaritybetween a query and a target. Most modern algorithms in this domain workby first constructing an index of the target and/or the query sequences. Thisindex may be in the form of a suffix tree [47, 73], suffix array [1, 37], hashtable [3, 12, 34, 38, 56, 65, 93, 99, 108], or full-text minute-space index (FM-index) [48–50, 55, 58, 61, 69]. Although this pre-processing step introducesan initial computational overhead, indexing helps narrow the list of possiblealignment coordinates, speeding up the alignment task.When the target sequence is static (e.g., a reference genome), the cost ofindex construction represents a one-time fixed-cost. It is performed once asa pre-alignment operation, and the resulting index is used for many subse-quent queries. Hence, it is often discounted in performance measurements ofalignment tools. However, there are many applications where the referenceis not static and/or the computational cost of indexing is not negligible.Such cases include resequencing data analysis of non-model species, wherethe target index has to be established, and intermediate stages of a de novoassembly process where index construction needs to be performed severaltimes.One more complicating factor in these two domains is that the targetsequence may not represent chromosome-level contiguity, requiring align-ments to a fragmented target sequence. This may be a particular challengefor many alignment algorithms, which perform poorly near target bound-aries, introducing “edge effects”.To address these challenges, we have designed and developed DIDA, forDistributed Indexing and Dispatched Alignment. DIDA works by first dis-282.4. Methodtributing the index construction over several computing nodes. It dispatchesthe query sequences over corresponding computing nodes for alignment. Fi-nally, partial alignment results from different computing nodes are gatheredand combined for reporting.We tested DIDA using four datasets: (1) C. elegans genome, (2) Humandraft genome, (3) Human reference genome, and (4) P. glauca genome. Here,we report on the scalability, modularity and performance of the tool.2.4 MethodThe sequence alignment task is often suitable for parallel processing, andthe widely practiced approach is to perform the task in an embarrassinglyparallel manner. When the target index is available, it is loaded on multipleprocessors, and a subset of the query sequences (usually raw reads from asequencing experiment) are aligned in parallel to this common target.In DIDA, we parallelize both the indexing and alignment operationsusing a five-step workflow (Fig. 2.1). For the description of the method,we consider a use case where the target is a draft genome assembly, withindividual contigs and scaffolds related to each other through an assemblygraph. Although, the protocol is general enough for a generic target notassociated with a graph. Before describing the proposed framework, weprovide some preliminary and basic definitions and notation.292.4. MethodTarget	   Target-­‐1	   Target-­‐2	   Target-­‐3	   Target-­‐4	  Index-­‐1	   Index-­‐2	   Index-­‐4	  Index-­‐3	  aligner	   aligner	   aligner	   aligner	  aln-­‐1.	  sam	  aln-­‐2.	  sam	  aln-­‐3.	  sam	  aln-­‐4.	  sam	  merger	  final.	  sam	  a) Distribute d) Align  and Merge 6 $ g a t c c t 1 a t c c t $ g 3 c c t $ g a t 4 c t $ g a t c 0 g a t c c t $ 5 t $ g a t c c 2 t c c t $ g a Suffix	  Array	  BWT	  b) Index Query	  Query-­‐4	  Query-­‐3	  Query-­‐2	  Query-­‐1	  c) Dispatch 1 0 1 1 1 0 1 1 0 0 1 k-meri k-merk k-merj Figure 2.1: The workflow of DIDA with four partitions as an example. (a)First, we partition the targets into four parts using a heuristic balanced cut.(b) Next, we create an index for each partition. (c) The reads are then flowedthrough Bloom filters to dispatch the alignment task to the correspondingnode(s). (d) Finally, the reads are aligned on all four partitions and theresults are combined together to create the final output.2.4.1 Assembly graphMost modern assembly tools are graph-based algorithms [11, 58, 80, 97, 98,109]. These algorithms model the assembly problem using a graph datastructure (e.g., de Bruijn graph, overlap graph, string graph) consisting ofa set of vertices (reads or k-mers) and edges (overlaps) representing therelationship between vertices. After building such graphs, the assemblyproblem is converted to a graph traversal problem, where a walk along the302.4. Methodgraph would reconstruct the source genome. In practice, assembly algo-rithms report unambiguous walks on the assembly graphs, building contigsas opposed to full genomes or chromosomes. Especially for large genomes,use of short reads from high-throughput sequencing platforms for assemblyresults in a large number of contigs. For example, using 150 base pairs (bp)reads, the white spruce genome is assembled into several million contigs [8].Some of the ambiguity on the assembly graph can be mitigated by usingpaired end reads or other linkage information. This requires alignment ofqueries to a typically highly fragmented draft genome. In DIDA, when avail-able, we partition the assembly graph keeping tightly connected componentson the same partition, as described below. For other use cases, where thetarget components (e.g., contigs or chromosomes) are not related through agraph, the partition optimization is done based on component lengths.2.4.2 Bloom filterAs explained in the previous chapter, a Bloom filter [9] is a space-efficient andcompact probabilistic data structure providing membership queries over dy-namic sets with an allowable false positive rate. Bloom filter has been widelyused in many computing applications, which exploit its ability to succinctlyrepresent a set, and efficiently filter out items that do not belong to theset, with an adjustable error probability. In bioinformatics, the Bloom filterhas been recently utilized in applications such as k-mer counting, genomeassembly and contamination detection [14, 16, 74, 90, 102].An ordinary Bloom filter consists of a bit array B of m bits, which areinitially set to 0, and ϕ hash functions, h1, h2, . . . , hϕ, mapping keys from312.4. Methoda universe U to the bit array range {1, 2, . . . ,m}. In order to insert anelement x from a set S = {x1, x2, . . . , xn} to the filter, the bits at positionsh1(x), h2(x), . . . , hϕ(x) are set to 1. To query if an element q is in thefilter, all of the bits at positions h1(q), h2(q), . . . , hϕ(q) are examined. If atleast one bit is equal to 0, then q is definitely not in S. Otherwise, q likelybelongs to the set. The uncertainty stems from coincidental cases, whereall the corresponding bits, hi(q) i = 1, 2, . . . , ϕ, may be equal to one eventhough q is not in S. This is possible if other keys from S were mapped intothese positions. Such a chance occurrence is called a false positive hit, andits probability is called the false positive rate, F . The probability for a falsepositive hit depends on the selection of the parameters m and ϕ, the size ofthe bit array and the number of hash functions, respectively. After insertingn distinct elements at random to the bit array of size m, the probabilitythat a specific bit in the filter is 0 is (1− 1m)ϕn. Therefore, the false positiverate is:F = (1− (1− 1m)ϕn)ϕ ≈ (1− e−ϕ/r)ϕ (2.1)where r = m/n is the number of bits per element. Minimizing the equation(1) for a fixed ratio of r yields the optimal number of hash functions ofϕ = rln(2), in which case the false positive rate is (0.6185)r [10].2.4.3 DIDAOur proposed distributed and parallel indexing and alignment framework,DIDA, consists of five major steps to perform the indexing and alignment322.4. Methodtask: distribute, index, dispatch, align, and merge. The indexing and dis-patch steps are performed in parallel. Each step is explained in detail asfollows.Distribute. In this step, the set of target sequences is partitioned intoseveral subsets. Depending on the nature of the target sequences (static asin reference genomes, or non-static as in a draft assembly; unrelated as inchromosomes, or related as contigs in an assembly graph) different partition-ing strategies may apply to initial target set. The key point in all cases is tokeep the partitions as balanced as possible. The target partitioning problemis a variant of the bin-packing problem and since the bin-packing problemis NP-hard, there is no polynomial time solution to the target partitioningproblem either. However, there are efficient heuristics developed to solvethe problem [44, 105]. Other than the theoretical hardness of the targetpartitioning problem, having well-balanced partitions in practice when thetarget set contains few number of long sequences will also be difficult.In the case of a static and independent set of target sequences, the par-titioning is performed using the best fit decreasing strategy that is amongthe simplest greedy approximation algorithms for solving the bin-packingproblem. Here, the bins correspond to computing nodes, and items are thetarget sequences. This strategy operates by first sorting the target sequencesto be partitioned in decreasing order by their lengths, and then distributingeach target sequence into the best node in the list, which is the node withthe minimum sufficient remaining space for the target sequence.When the target sequences are related, such as contigs in a draft assem-332.4. Methodbly, the partitioning starts by first identifying all connected target sequencesusing adjacency information in the assembly graph. This is performed bylaunching a depth-first search traversal of the undirected adjacency graph,and by finding all connected components. Then, the partitioning proce-dure continues by applying the best fit decreasing strategy for identifiedconnected components to distribute them over computing nodes.Index. An exact pattern search can take linear time in the target length.But when the target length is very long, it is desirable to have the searchtime linear in the query length and independent of the target length. Todo so, an index such as a suffix tree, suffix array [67, 68, 87], hash table, orFM-index [26, 27] on the long sequence or text can be created. Constructingsuch an index takes O(n) time and O(nlogn) space, where n is the size ofthe target [87]. This cost is often amortized when the index is used severaltimes, providing very fast searching of the indexed target.For the indexing step, and on each computing node, DIDA takes thesubset of target sequences from the Distribute step, and constructs an indexfor each subset in parallel on all computing nodes. Then, the indices arestored on each computing node to be invoked later in the alignment step.Depending on the alignment algorithm, any indexing approach can be usedin this step. The reduced target size (n/P in the best-case scenario, whereP is the number of partitions) allows linear scaling of the indexing timeand better than linear scaling in the index space. While both would havea positive impact on alignments against dynamic targets, the latter wouldalso help cases where the target is too large to fit into the memory of a single342.4. Methodcomputer.Dispatch. To keep track of each subset of target sequences, and to identifywhich read may align on which partition of the target, a set of Bloom filtersis created for all partitions of target sequences. The reads are then flowedthrough these Bloom filters, and dispatched to the corresponding node(s).To create a Bloom filter for each partition of target sequences, all targetsubsequences of length b (b-mer) in the partition are scanned. Each scannedb-mer, x, is then inserted into the corresponding Bloom filter by settingthe related bits in the bit array, i.e. B[hi[x]] = 1, i = 1, 2, . . . , ϕ. Afterconstructing the Bloom filters, all possible read b-mers are queried againstthe Bloom filters. If at least one hit is found for each read, the read isdispatched to the corresponding node. This procedure continues until allthe reads are either dispatched or discarded. By choosing the b-mer length,b, small enough, we make sure that no read sequence will be missed in theBloom filter query step. This is performed by setting b less than or equalto the minimum seed or exact match length of aligners, l, for candidatehits. Detailed procedure for choosing b, loading and querying Bloom filtersis explained below.Loading Bloom filter. In this stage, all target sequences in the target setT = {T1, T2, . . . } are scanned using a sliding substring of length b. For eachtarget sequence Ti, all possible |Ti| − b + 1 b-mers are scanned and theninserted to the Bloom filter after specifying the corresponding bit vectorpositions by computing the ϕ hash values for each b-mer (Fig. 2.2).Querying Bloom filter. In this step, all reads in the read setR = {R1, R2, . . . }352.4. Method_________________________target-seq_____________________ __________ <----b--->  __________  <---b+1--->    __________ <----b+2---->           . . .                                               __________  <-------------------------|T|--------------------------> Figure 2.2: Loading the Bloom filter using a sliding substring of length b.are queried using the same b-mer length in the loading stage. If at least onehit is found for a read, the read is dispatched to the corresponding node. Wecan use sliding b-mer windows (with step size of one base, s = 1) or jumpingb-mer windows (with step size greater than one base) to interrogate eachread Rj as explained in Fig. 2.3._________________________read-seq_______________________ __________ <----b--->        __________ <--s--><----b--->               __________ <--s--><--s--><----b--->  . . .                                               __________  <-------------------------|R|--------------------------> Figure 2.3: Querying the Bloom filter using a substring of length b.Suppose that the minimum seed or exact match length to report a candi-date hit for an aligner is l. We choose b ≤ l and then load the Bloom filtersas mentioned in the Loading Bloom filter stage. In the querying stage, theb-mers of each read sequence is interrogated against all Bloom filters fromdifferent partitions. If b = l, the reads are scanned using sliding windowwith step size of one base, i.e. s = 1. By choosing b < l, the interrogation362.4. Methodstep will be faster but with more extra dispatched reads, and the step size iscomputed as follows to cover all possible seeds or exact matches of length lbetween the target sequence Ti and read sequence Rj . If the l-mer startingat position p is covered by at least one b-mer, to cover the next l-mer start-ing at position p+ 1 we should have s+ b ≤ l + 1. Therefore, s ≤ l − b+ 1(Fig. 2.4)._________________________read-seq_______________________ __________ <----b--->        __________ <--s--><----b---> <-------l+1-----> ________________ <-------l------>  ________________  <-------l------> s+b=l+1 Figure 2.4: Choosing the jumping size s for querying the Bloom filter.In the implementation, the values of r and ϕ can be set as input pa-rameters and we have considered 8 bits for each b-mer, r = 8, as default.Therefore, the optimal number of hash functions that minimizes the falsepositive rate of Bloom filter is ϕ = rln2 ≈ 5, resulting in a false positiveerror rate slightly larger than 2%. It should be mentioned that the falsepositive rate does not affect the final alignment result. It only imposes moreworkload on nodes by dispatching reads that do not necessarily belong tothose nodes as a result of false positive Bloom filter hits. Fig. 2.5 showsan example of how different values of r and ϕ affect the number of extradispatched reads over multiple nodes.372.4. Methodlllll l l lr=1651015202530354045501 2 3 4 5 6 7 8kextra query (%)a)llll l l lϕ = 5510152025308 12 16 20 24 28 32rextra query (%)b)Figure 2.5: Extra dispatched reads vs. r and ϕ. (a) Percentage of the extradispatched reads vs. the number of hash functions, ϕ, for the fixed value ofr = 16 on the C. elegans dataset. (b) Percentage of the extra dispatchedreads vs. the number of bits per item, r, for the fixed value of ϕ = 5 on theC. elegans dataset.Align and Merge. After constructing indices for all sets of target se-quences and dispatching the reads to the computing nodes, DIDA aligns thereads against the target sequence in parallel on each node. Note that, DIDA382.4. Methoditself does not offer an alignment algorithm; instead, it can use a variety ofthird party alignment tools.In the merging step, partial alignment results, usually stored as SAM orBAM (Sequence Alignment/Map Format) [55] records from different com-puting nodes, are gathered and combined into the final SAM/BAM output.Depending on the aligner parameters for reporting the output, differentmerging approaches are applied. For example, when aligner parameters areadjusted in order to obtain the best unique mapped query, the merger willtake into account that information to pick up the best quality mapped recordfor each query from the related records in all partitions. With the reportingparameters set to obtain multiple alignment records, the merger proceduresearches for all or up to a predefined distinct number of alignment recordsin the partial alignment results in all partitions. The pseudo-code for DIDAis presented in Algorithm ImplementationDIDA is written in C++ and parallelized using OpenMP for multi-threadedcomputing on a single computing node. For distributed computing, DIDAemploys Message Passing Interface (MPI) for inter-processor communica-tions. As input, it gets the set of target sequences and the set of queriesin FASTA or FASTQ formats, and the default output alignment format isSAM.392.4. MethodProgram 2.1 The DIDA algorithm1: Input: Set T of target sequences, set Q of query sequences2: procedure Distribute3: comp← dfs(T ) . identifying connected components4: compsorted ← qsort(comp) . sorting connected components5: P ← best-fit-decreasing(compsorted) . partitioning sorted connectedcomponents6: procedure Index7: for all p ∈ P in parallel do8: indexp ← build-index(p) . constructing index for each partition9: store-index(indexp) . storing index for alignment step10: procedure Dispatch11: for all p ∈ P in parallel do . loading Bloom filter for eachpartition12: for all t ∈ p do13: for all b-mer ∈ t do14: insert(b-mer,BF [p])15: for all q ∈ Q do . flowing queries through partitions16: for all p ∈ P in parallel do17: if contain(b-mer ∈ q,BF [p]) then18: dispatch(q,nodep)19: procedure Align20: for all p ∈ P in parallel do . aligning queries against targets onall partitions21: receive(q, nodep)22: s← align(q, indexp)23: send(s, nodemerger)24: procedure Merge25: while receive(s, nodei) in parallel do . merging results from allpartitions26: insert(s, priority-queue)27: s← priority-queue.pop()28: write(s, samFile)29: Output: File samFile402.4. Method2.4.5 Evaluated toolsTo evaluate the performance of DIDA, four alignment tools have been usedwithin the proposed framework: BWA-MEM [52], Bowtie2 [49], Novoalign(, and ABySS-map [98]. The first three algo-rithms were described in detail in Section 1.3.Similar to BWA and Bowtie2, ABySS-map, a utility within the ABySSgenome assembly software [98], constructs an FM-index for target sequencesto perform exact matches. It is mainly used for alignment tasks in theintermediate stages of the ABySS assembly pipeline. In order to speed upthe alignment operations, and hence the total assembly process, ABySS-maponly performs exact matching and avoids backtracking for inexact matching.All four tools are run with their default parameters, and the parame-ters related to the resource usage are set in a way to utilize the maximumcapacity on each computing node. For example, all tools are run in multi-threaded mode with the maximum number of threads on each node. Theperformance of each alignment method is compared with itself within theDIDA framework.Results were obtained on a high performance computer cluster consistingof 500 computing nodes, each of which has 48 GB of RAM and dual IntelXeon X5650 2.66GHz CPUs with 12 cores. The operating system on eachnode is Linux CentOS 5.4. The cluster’s network fabric and file system areInfiniband 40 Gbps and the IBM GPFS, respectively.412.5. Results2.5 Results2.5.1 Performance on real dataTo evaluate the performance and scalability of DIDA on real data, we down-loaded publicly available sequencing data on the C. elegans genome, a drafthuman genome assembly, human reference genome, and the P. glauca (whitespruce) genome from the following websites.• C. elegans:• Human genome (NA12878):• Human genome reference (hg19, GRCh37):• P. glauca (accession number: ALWZ0100000000 and PID: PRJNA83435):http:/ order to assess the performance of DIDA for each aligner on non-statictargets, we assembled the reads from each dataset (except the human refer-ence genome) using ABySS 1.3.7, and used the assembly graph in interme-diate stages to guide partitioning. We have also evaluated the performanceof DIDA for each aligner on human genome reference (hg19, GRCh37) asa static target. The detailed information of each dataset is presented inTable 2.1.422.5. ResultsTable 2.1: Dataset specification.Data #targets target(bp) #reads read(bp)length lengthC. elegans 152,841 106,775,302 89,350,844 8,935,084,400Human 6,020,169 3,099,949,065 1,221,224,906 123,343,715,506hg19 93 3,137,161,264 1,221,224,906 123,343,715,506P. glauca 70,868,549 35,816,518,982 1,079,576,520 161,936,478,000Fig. 2.6 shows the scalability of wall-clock time and indexing peak mem-ory usage of all four aligners on the C. elegans dataset, in standalone case(grey bars) or within the DIDA framework on two, four, eight, and 12 nodes,as indicated. For instance, for ABySS-map, the runtime of 2154 sec withoutDIDA is decreased to 893 sec using DIDA on four computing nodes. FromFig. 2.6 we can see the runtime scalability and modularity of different align-ers within DIDA protocol. Notably, we have better scalability for the sloweralignment tool, Novoalign. On memory usage, all aligners scale well withinthe DIDA framework. For example, the peak memory usage of ABySS-mapindexing goes from 1100 MB without DIDA to 238 MB with DIDA on fourcomputing nodes.Table 2.2: Alignment time/indexing memory for all aligners on C. elegans.time/mem (sec/MB)node# abyss-map bwa bowtie novoalign1 2154/1100 945/156 1700/274 19671/5892 1261/522 667/80 1014/163 6305/2634 893/238 574/65 737/99 5184/1518 723/120 526/134 595/62 4788/6912 700/81 547/89 601/46 4464/50Tables 2.2-2.5 summarizes all results in the form of alignment-time/indexing-432.5. Results2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec2154 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec1700 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec945 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec19671 sec0.000.250.500.751.00abyss-map bowtie bwa novoalignscalabilitynode1-node2-node4-node8-node12-nodeRuntime1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB1100 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB274 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB156 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB589 MB0.000.250.500.751.00abyss-map bowtie bwa novoalignscalabilitynode1-node2-node4-node8-node12-nodeMemoryFigure 2.6: Scalability of different aligners using DIDA for C. elegant data.Y-axis indicates the runtime/memory scalability in the in the [0..1] intervalfor different alignment tools. The scalability of each tool is shown in thestandalone case and within DIDA framework on 2, 4, 8, and 12 nodes [79].memory. Regarding the accuracy of the alignment results for all alignerswithin DIDA framework on multiple nodes, we have compared them withthe baseline results and found the accuracy of results the same as expected.As mentioned in the previous section, by choosing the b-mer length smallenough, we make sure that no potential alignment is missed in the dispatch442.5. ResultsTable 2.3: Alignment time/indexing memory for all aligners on human draftassembly.time/mem (min/MB)abyss-map bwa bowtie novoalign1-node 652/31000 407/4400 1174/6100 59125/93002-node 472/15000 254/2200 611/2900 35728/41004-node 343/8100 216/1100 493/1600 23311/37008-node 253/4100 191/559 371/977 17485/210012-node 210/2700 181/372 296/590 13141/1200Table 2.4: Alignment time/indexing memory for all aligners on hg19.time/mem (min/MB)abyss-map bwa bowtie novoalign1-node 444/33823 379/4709 996/5528 NA2-node 323/16911 254/2354 512/3042 NA4-node 232/8455 205/1177 352/1417 NA8-node 173/4227 171/588 254/667 NA12-node 160/3170 164/441 226/495 NATable 2.5: Alignment time/indexing memory for all aligners on Picea glauca.time/mem (min/GB)abyss-map bwa bowtie novoalign1-node NA NA NA NA2-node 1201/184 NA NA NA4-node 827/81 NA NA NA8-node 638/45 NA NA NA12-node 574/31 NA NA NAstep, and therefore, the accuracy of final alignment results will not be de-graded. For example, the number of aligned reads for total 89,350,844 readsin the C. elegans dataset using ABySS-map within DIDA on 2, 4, 8, and 12nodes is 86,851,694 which is almost the same as in baseline or standalonemode with the same SAM/BAM quality scores.One point that should be addressed is that by increasing the compu-452.5. Resultstational power, i.e., number of computing nodes, we may not necessarilyobtain better runtime scalability due to the related overhead of the dis-patch and merge steps. In general, for any parallel algorithm the runtimescalability should follow the behaviour shown in Fig. 2.7.0	  5	  10	  15	  20	  25	  0	   10	   20	   30	   40	   50	   60	  run,me	  number	  of	  compute	  nodes	  p	  Figure 2.7: Runtime scalability behaviour with the number of nodes.The optimal value for the number of nodes, p, may be different for eachdataset and each algorithm. For example, for a small dataset such as C.elegans, with the BWA and Bowtie2 aligners, p is between 8 to 12 nodes.For ABySS-map and Novoalign we have p > 12. The scalability performanceof DIDA for C. elegans dataset and ABySS-map alignment method for [2,4, 8, 16, 32] nodes is presented in Fig. 2.8. For larger datasets such as humangenome, the value of p is greater than 12. In general, the larger the datasets462.5. Resultsllll ll21541261893723 69882560010001400180022000 4 8 12 16 20 24 28 32number of compute nodesruntime (sec)speedup of abyss−map for c. elegansFigure 2.8: The scalability performance of ABySS-map+DIDA for C. ele-gans dataset.or the slower the alignment methods, the greater the value of p. For instance,the runtime of Bowtie2 within DIDA framework on eight computing nodesis 595 sec compared to 601 sec on 12 computing nodes. This is because ofthe related overhead of the dispatch and merge steps. Another point thatshould be explained is the unexpected memory scalability for BWA from 4 to8 nodes on the C. elegans dataset. Based on the size of target set, bwa-indexautomatically chooses between bwtsw and is (induced sorting) algorithmsto generate BWT in the index construction process. For short target sets(≤ 25Mb), bwa-index uses is algorithm while for long target sets (> 25Mb) itemploys bwtsw. The memory usage of bwtsw is less than is for a given targetset. When we divide C. elegans dataset into 8 or more partitions, the size of472.5. Results652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min652 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min1174 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min407 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min59125 min0.000.250.500.751.00abyss-map bowtie bwa novoalignscalabilitynode1-node2-node4-node8-node12-nodeRuntime31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB31 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB6.1 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB4.4 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB9.3 GB0.000.250.500.751.00abyss-map bowtie bwa novoalignscalabilitynode1-node2-node4-node8-node12-nodeMemoryFigure 2.9: Scalability of different aligners using DIDA for human draftassembly [79].each subset becomes less than 25Mb, and hence, bwa-index automaticallyinvokes is algorithm. On the other hand, for 4 partitions or less, bwa-indexuses bwtsw algorithm. Therefore, we see the memory scalability of BWA forC. elegans not behaving as expected.Fig. 2.9 shows the results on the human draft assembly data. Com-pared to the smaller datasets, for human genome we see better runtime and482.5. Results444 min444 min444 min444 min444 min444 min444 min444 min444 min444 min444 min444 min444 min444 min444 min996 min996 min996 min996 min996 min996 min996 min996 min996 min996 min996 min996 min996 min996 min996 min379 min379 min379 min379 min379 min379 min379 min379 min379 min379 min379 min379 min379 min379 min379 min0.000.250.500.751.00abyss-map bowtie bwascalabilitynode1-node2-node4-node8-node12-nodeRuntime33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB33.8 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB5.5 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB4.7 GB0.000.250.500.751.00abyss-map bowtie bwascalabilitynode1-node2-node4-node8-node12-nodeMemoryFigure 2.10: Scalability of different aligners using DIDA for hg19.memory usage scalability, illustrating that DIDA shows better performanceon large data due to the amortization of the overhead of the distributedparadigm. That means the overhead of dispatch and merge steps are com-pensated for large-scale indexing and alignment applications. We have alsoevaluated the performance of DIDA on the human reference genome (hg19)as a static target set. Fig. 2.10 and Table 2.4 shows the scalability of wall-492.5. Resultsclock time and indexing peak memory usage of different aligners, exceptNovoalign (due to its long runtime). As expected, the scalabilities for run-time and memory are similar to the case of non-static target set.NA NA184 GB184 GB184 GB184 GB184 GB184 GB184 GB184 GB184 GB184 GB1201 min1201 min1201 min1201 min1201 min1201 min1201 min1201 min1201 min1201 min0.000.250.500.751.00memory runtimescalabilitynode1-node2-node4-node8-node12-nodeFigure 2.11: Scalability of abyss-map aligner using DIDA for P. glauca.Fig. 2.11 and Table 2.5 shows the result for ABySS-map on P. glaucadraft assembly. Due to resource restrictions, we could not obtain the resultof ABySS-map aligner without DIDA. The required memory for construct-ing the index for the spruce draft assembly is about 400 GB. However, usingDIDA framework we can divide the draft spruce assembly into a numberof partitions, and perform the indexing and alignment operation in a dis-tributed way. From Table 2.5, we can easily see the scalability for runtimeand indexing peak memory usage of ABySS-map within DIDA.502.6. Discussion2.6 DiscussionIndexing large target sequences, and aligning large queries are computation-ally challenging. In this article, we described a novel, scalable and cost-effective parallel workflow for indexing and alignment, called DIDA.The performance of DIDA was measured and evaluated when coupledwith popular alignment methods BWA, Bowtie2, Novoalign, and ABySS-map on C. elegans, human draft genome, human reference genome, and P.glauca genome. Compared to their baseline performance, when run throughthe DIDA framework with 12 nodes, BWA, Bowtie2, Novoalign, and ABySS-map use less memory (91%, 90%, 87%, and 91%, respectively) and executefaster (55%, 74%, 77%, and 67%, respectively) for a draft human genomeassembly.DIDA is an enabling technology for large-scale alignment and assemblytasks especially for labs that have limited compute resources. It enabledour lab at Genome Sciences Centre in BC Cancer Agency to assemble thewhite spruce (P. glauca) genome [8]. For this large-scale de novo assemblytask, the required memory for index construction from the draft genome ona single node is about 400 GB of RAM, which requires the use of a specialbig memory machine, and may be prohibitive for many labs. Using theDIDA framework, the indexing was performed in a distributed way on 12low-memory compute nodes with peak memory usage of 31 GB on any onenode. Therefore, DIDA efficiently made this huge indexing and alignmentand assembly task feasible, well scalable, and modular [43, 106]. DIDAhas also enabled our lab to perform many experiments at a time without512.6. Discussionrequisitioning large memory machines.As the cost of DNA sequencing is dropping faster than the cost of com-putational power, the need for scalable and cost-effective parallel and dis-tributed algorithms and software tools for accurately and expeditiously pro-cessing “big data” from high-throughput sequencing is increasing. DIDAoffers a solution to this growing issue for the alignment problem, especiallywhen the target is non-static, or large. In life sciences research organiza-tions and clinical genomics laboratories, alignment and de novo assemblyare becoming two key steps in everyday research and analysis. Especiallyfor labs that have limited computational resources, DIDA may offer an ap-propriate solution to address their needs and expectations by reducing heavycomputational resource requirements.52Chapter 3ntHash: recursive nucleotidehashing3.1 Publication noteThe work described in this chapter was previously published in the Bioin-formatics journal in 2016 [77]. The outcome algorithm and software toolfrom this work was used in three published journal papers in BMC MedicalGenomics in 2015 [104], Genome Research in 2017 [42], and Bioinformat-ics in 2017 [78]. The work described in this chapter is the main work ofthe author, under the supervision of his PhD supervisor, Inanc Birol. Theauthor designed the algorithm, implemented the software tool, performedthe experiments, and wrote the manuscript. The co-authors on this workhelped in performing the experiments.3.2 Author summaryHashing has been widely used for indexing, querying, and rapid similaritysearch in many bioinformatics applications, including sequence alignment,genome and transcriptome assembly, k-mer counting, and error correction.533.3. IntroductionHence, expediting hashing operations would have a substantial impact inthe field, making bioinformatics applications faster and more efficient. Inthis chapter, we present ntHash, a hashing algorithm tuned for processingDNA/RNA sequences. It performs the best when calculating hash valuesfor adjacent k-mers in an input sequence, operating an order of magnitudefaster than the best performing alternatives in typical use cases. ntHash isavailable online at and is free for academic use.3.3 IntroductionHashing is a common function across many informatics applications, andrefers to mapping an input key value of arbitrary size to an allocated mem-ory of predetermined size. Among other uses, it is an enabling concept forrapid search operations, and forms the backbone of Internet search engines.In bioinformatics, it has many applications including sequence alignment,genome and transcriptome assembly, RNA-seq expression quantification, k-mer counting, and error correction. Large-scale sequence analysis often relieson cataloguing or counting consecutive k-mers in DNA/RNA sequences forindexing, querying, and similarity searching. An efficient way of implement-ing such operations is through the use of hash based data structures, such ashash tables or Bloom filters. Therefore, improving the performance of hash-ing algorithms would have a great impact in a wide range of bioinformaticstools. Here, we present ntHash, a fast function for recursively computinghash values for consecutive k-mers.543.4. Method3.4 MethodWe propose an algorithm to hash consecutive k-mers in a sequence, r oflength l > k, using a recursive function, f , where the hash value of the newk-mer H is computed from the hash value of the previous k-mer:H(k−meri) = f(H(k−meri−1), r[i+ k − 1], r[i− 1]) (3.1)Such a recursive function, also called rolling hash function, offers huge im-provements in performance when hashing consecutive k-mers. This has beenpreviously described and investigated for n-gram hashing for string match-ing, text indexing, and information retrieval [17, 32, 45, 51]. In this work,we have customized the concept for hashing all k-mers of a DNA sequence,and implemented an adapted version of the cyclic polynomial hash function,ntHash, to efficiently calculate normal or canonical hash values for k-mersin DNA sequences. In hashing by cyclic polynomial, ntHash uses barrelshifts instead of multiplications to make the process faster. To computehash values for all k-mers of the sequence r of length l, we first hash theinitial k-mer, k-mer0, as follows:H(k−mer0) = rolk−1h(r[0])⊕ rolk−2h(r[1])⊕ · · · ⊕ h(r[k − 1]) (3.2)where rol is a cyclic binary left rotation, ⊕ is the bit-wise EXCLUSIVE OR(XOR) operator, and h(.) is a seed table, in which the letters of the DNAalphabet,∑= {A,C,G, T}, are assigned different random 64-bit integers.The hash values for all consequent k-mers, k-mer1, . . . , k-merl−k, are then553.4. Methodcomputed recursively as follows:H(k−meri) = rol1H(k−meri−1)⊕ rolkh(r[i− 1])⊕ h(r[i+ k − 1]) (3.3)We note that the time complexity of ntHash for sequence r is O(k+ l) com-pared to O(kl) complexity of regular hash functions. In some bioinformaticsapplications, one might be interested in computing the hash value of forwardand reverse-complement sequences of a k-mer. To do so, we add in the seedtable integers that correspond to the complement bases, such that table in-dices of base-complement base pairs are separated by a fixed offset. Usingthis table, we can easily compute the hash value for the reverse-complement(as well as the canonical form) of a sequence efficiently, without actuallyreverse-complementing the input sequence, as follows:H¯(k−mer0) = h(r[0]+d)⊕rol1h(r[1]+d)⊕· · ·⊕rolk−1h(r[k−1]+d) (3.4)The canonical hash values for all consequent k-mers, k-mer1, . . . , k-merl−k,are then computed recursively as follows:H¯(k−meri) = ror1H¯(k−meri−1)⊕ror1h(r[i−1]+d)⊕rolk−1h(r[i+k−1]+d)(3.5)where ror is a cyclic binary right rotation, and d is the offset of complementbase in the seed table h(.). Additionally, ntHash provides a fast way tocalculate multiple hash values for a given k-mer, without iterating the fullprocess for each hash value. To do so, a single hash value is computed froma given k-mer, and then each extra hash value is computed by few more563.5. Resultsmultiplication, shifting and XOR operations on the initial hash value. Thiswould be very useful for certain bioinformatics applications, such as thosethat utilize the Bloom filter data structure.Experimental results show a significant speed improvement over tradi-tional methods, while maintaining a near-ideal hash value distribution. Inthe Results section, we have used sequencing data on the human individualNA19238 from the Illumina Platinum Genomes project, as well as simulatedrandom DNA sequences.• Real data of Human individual NA19238 from Illumina Platinum Genomeproject:• Simulated random DNA sequences using seqgen in ntHash package: ResultsA good hash function should generate hash values that are uniformly dis-tributed across the target domain resulting in fewer collisions. To evaluatethe uniformity of ntHash, one way is to use the correlation coefficient be-tween the bits of 64-bit hash values. That is, if each bit xi, i = 1, . . . , 64, ina 64-bit hash value vector X = {x1, x2, . . . , x64} is an independent randomvariable, then there should be no correlation between them. To test this, wefirst generated sets of hash values on randomly generated DNA sequences.We next computed the correlation coefficient matrix for each sample set,and performed significance tests with Bonferroni correction for the hypoth-esis that the correlation coefficients are consistent with zero. Fig. 3.4 shows573.5. Resultsthe correlation coefficients of two sample sets of size 100 (below diagonal)and 100,000 (above diagonal). The plot shows natural statistical fluctuationsfor smaller sample sets. The correlations dissipate rapidly for large samplesets (Figs. 3.1- 3.5). Comparing the computed correlation coefficients witha confidence interval around the theoretical zero correlation shows that forall hash functions tested, the number of observations outside the 99.7% con-fidence interval is around 0.3%, in agreement with theoretical expectations..Figure 3.1: Correlation coefficient plots of cityhash for one (right) and three(left) hashes on small (100 data points, below diagonal) and large sampleset (100,000 data points, above diagonal).We have also evaluated the uniformity of different hash methods byutilizing a Bloom filter data structure. We first load a Bloom filter with anumber of unique k-mers, and then query the Bloom filter with another setof unique k-mers. The results show the false positive rate of ntHash, as wellas other hash functions tested comply with the theoretical false positiverate for Bloom filters, indicating the uniform distribution of hash values583.5. ResultsFigure 3.2: Correlation coefficient plots of murmur for one (right) and three(left) hashes on small (100 data points, below diagonal) and large sampleset (100,000 data points, above diagonal).Figure 3.3: Correlation coefficient plots of xxhash for one (right) and three(left) hashes on small (100 data points, below diagonal) and large sampleset (100,000 data points, above diagonal).593.5. ResultsFigure 3.4: Correlation coefficient plots of ntHash for one (right) and three(left) hashes on small (100 data points, below diagonal) and large sampleset (100,000 data points, above diagonal).generated (Table 3.1- 3.8). We have compared the uniformity and runtimeperformance of ntHash algorithm with three most widely used hash methodsin bioinformatics:• cityhash:• murmur:• xxhash: all experiments, we first load the Bloom filter with 100 long sequencesof length 5,000,000bp and allocating 8 bit/k-mer in Bloom filter. Next wequery 4,000,000 real reads of length 250bp with k-mer of sizes 50, 150 and250. The theoretical approximate false positive rate for h=1, 3, and 5 are11.75%, 3.06% and 2.17%, respectively.603.5. ResultsTable 3.1: Bloom filter evaluation for cityhash on real data.k Hash# Set bit# Query# False hit# FPR%501 470,017,244 804,000,000 94,378,728 11.743 1,250,823,778 804,000,000 24,570,368 3.065 1,858,906,097 804,000,000 17,400,323 2.161501 469,998,731 404,000,000 47,919,015 11.863 1,250,821,595 404,000,000 12,338,460 3.055 1,858,931,600 404,000,000 8,748,217 2.172501 469,986,185 4,000,000 472,181 11.83 1,250,802,647 4,000,000 121,776 3.045 1,858,880,965 4,000,000 85,939 2.15Table 3.2: Bloom filter evaluation for murmur on real data.k Hash# Set bit# Query# False hit# FPR%501 470,007,281 804,000,000 94,295,129 11.733 1,250,852,012 804,000,000 24,560,434 3.055 1,858,955,242 804,000,000 17,381,581 2.161501 470,002,596 404,000,000 47,411,326 11.743 1,250,839,667 404,000,000 12,338,781 3.055 1,858,923,413 404,000,000 8,750,161 2.172501 469,997,041 4,000,000 469,325 11.733 1,250,795,983 4,000,000 122,051 3.055 1,858,914,749 4,000,000 86,649 2.17Table 3.3: Bloom filter evaluation for xxhash on real data.k Hash# Set bit# Query# False hit# FPR%501 470,016,553 804,000,000 94,292,232 11.733 1,250,840,753 804,000,000 24,518,667 3.055 1,858,942,822 804,000,000 17,394,611 2.161501 469,995,918 404,000,000 47,411,924 11.743 1,250,830,011 404,000,000 12,340,660 3.055 1,858,930,859 404,000,000 8,743,269 2.162501 469,986,739 4,000,000 469,547 11.743 1,250,798,629 4,000,000 122,788 3.075 1,858,874,277 4,000,000 86,494 2.16613.5. ResultsTable 3.4: Bloom filter evaluation for ntHash on real data.k Hash# Set bit# Query# False hit# FPR%501 469,999,521 804,000,000 94,251,857 11.723 1,250,842,928 804,000,000 24,535,428 3.055 1,858,926,469 804,000,000 17,420,006 2.171501 469,998,307 404,000,000 47,418,185 11.743 1,250,807,514 404,000,000 12,333,989 3.055 1,858,901,461 404,000,000 8,743,479 2.162501 469,989,680 4,000,000 469,293 11.743 1,250,786,386 4,000,000 122,542 3.065 1,858,855,900 4,000,000 86,775 2.17Table 3.5: Bloom filter evaluation for cityhash on simulated data.k Hash# Set bit# Query# False hit# FPR%501 470,004,731 804,000,000 94,467,037 11.753 1,250,839,836 804,000,000 24,581,360 3.065 1,858,937,046 804,000,000 17,433,637 2.171501 470,001,865 404,000,000 47,475,319 11.753 1,250,810,005 404,000,000 12,352,225 3.065 1,858,911,301 404,000,000 8,757,714 2.172501 469,989,444 4,000,000 469,599 11.743 1,250,773,973 4,000,000 122,939 3.075 1,858,866,870 4,000,000 87,331 2.18Table 3.6: Bloom filter evaluation for murmur on simulated data.k Hash# Set bit# Query# False hit# FPR%501 470,017,505 804,000,000 94,484,767 11.753 1,250,827,988 804,000,000 24,589,864 3.065 1,858,925,731 804,000,000 17,424,248 2.171501 469,997,827 404,000,000 47,469,314 11.753 1,250,784,464 404,000,000 12,347,235 3.065 1,858,877,467 404,000,000 8,757,213 2.172501 469,992,778 4,000,000 469,332 11.733 1,250,777,246 4,000,000 121,813 3.055 1,858,869,175 4,000,000 86,187 2.15623.5. ResultsTable 3.7: Bloom filter evaluation for xxhash on simulated data.k Hash# Set bit# Query# False hit# FPR%501 470,008,758 804,000,000 94,470,794 11.753 1,250,821,114 804,000,000 24,585,395 3.065 1,858,965,631 804,000,000 17,434,877 2.171501 469,991,012 404,000,000 47,464,074 11.753 1,250,815,185 404,000,000 12,350,539 3.065 1,858,920,189 404,000,000 8,751,526 2.172501 469,997,137 4,000,000 470,919 11.773 1,250,793,131 4,000,000 122,493 3.065 1,858,893,379 4,000,000 86,618 2.17Table 3.8: Bloom filter evaluation for ntHash on simulated data.k Hash# Set bit# Query# False hit# FPR%501 470,001,008 804,000,000 94,473,191 11.753 1,250,822,696 804,000,000 24,594,765 3.065 1,858,911,305 804,000,000 17,428,958 2.171501 470,009,333 404,000,000 47,464,574 11.753 1,250,812,436 404,000,000 12,351,716 3.065 1,858,915,330 404,000,000 8,757,143 2.172501 469,979,683 4,000,000 469,658 11.743 1,250,795,888 4,000,000 122,625 3.075 1,858,897,893 4,000,000 86,664 2.17633.5. ResultsUsually, we compute a single hash value when we hash a given k-mer.However, there are some bioinformatics applications utilizing the Bloom fil-ter data structure that requires computing multiple hash values for a givenk-mer. The existing hash methods do this by repeating the whole hash-ing procedure on a given k-mer with different initial seeds while in ntHashwe have provided a version to compute the multiple hash values in an ef-ficient way. Given the number of required hash values, we first compute ahash value for the k-mer in the regular way using NT64 function. We thenperturb the 64-bit computed hash values by few additional operations with-out repeating the whole hashing procedure to get required number of hashvalues. The detailed procedure for the multi-hash version of ntHash is ex-plained below. The uniformity results for the multi-hash version of ntHashand other hash functions have been presented in Figs. 3.1- 3.4.Fig. 3.6 shows the runtimes for hashing different length k-mers in 5million DNA sequences of length 250 bp. In the inset, we see ntHash out-performs other algorithms when hashing more than two consecutive k-mersin a DNA sequence. The runtime of all hash methods for hashing one billionk-mers of lengths {50,100,150, 200,250} is shown in Fig. 3.7. We have alsocalculated the runtime for canonical hashing of 1 billion k-mers presentedin Fig. 3.8. From these results, we see ntHash computes the regular andcanonical hash values very quickly. Fig. 3.9 illustrates a typical use case ofcomputing multiple hash values for 50-mers in DNA sequences of length 250bp, and shows that ntHash is over 20× faster than the closest competitor.643.6. Discussion3.6 DiscussionIn this work, we introduced ntHash algorithm for computing regular orcanonical hash values for all possible k-mers in a DNA/RNA sequence. Be-ing a recursive or rolling hash function, ntHash calculates the hash value forthe new k-mer from the hash value of the previous k-mer. This idea booststhe performance of ntHash with significant speed improvement in compar-ison with conventional hash methods in bioinformatics, and maintains thenear-ideal distribution for generated hash values. Moreover, by providinga fast way for computing multi-hash values for a given k-mer, ntHash wasdemonstrated to be very useful for applications utilizing the Bloom filterdata structure. ntHash has been employed in a series of algorithms and soft-ware tools to expedite the hashing operations. Examples include our currentand new version of ABySS genome assembly software package [42, 98], thenew version of BioBloomTools [16], Konnector [103, 104], ChopStitch [46],and our new algorithm for cardinality estimation, ntCard, which will beexplained in the next chapter.653.6. Discussion●● ● ●● ●●city murmur xxhash ntHash−1.0− for sample size  10correlation●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●city murmur xxhash ntHash− for sample size  100correlation●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●city murmur xxhash ntHash−0.10− for sample size  1000correlation●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●city murmur xxhash ntHash−0.03− for sample size  10000correlation●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●city murmur xxhash ntHash−0.015−0.0050.0000.0050.010boxplot for sample size  100000correlation●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●city murmur xxhash ntHash−0.0020.0000.002boxplot for sample size  1000000correlationFigure 3.5: Correlation coefficient of hash algorithms for different samplesets. We see the correlation between hash value bits is near-ideal for ntHashas well as the state-of-the-art hashing methods like cityhash, murmur, andxxhash.663.6. Discussion010020030040050 100 150 200 250kcpu time (ns)methodcitymurmurxxhashntBasentHash0100200300400245246247248249250Figure 3.6: Runtime for hashing a 250bp DNA sequence with different kranging from 50 to 250. ntHash outperforms all other hash methods whenhashing more than two k-mers, i.e. k < 249.05010015020050 100 150 200 250kcpu time (s)methodcityhashmurmurxxhashntBasentHashFigure 3.7: CPU time for hashing 1 billion k-mers of lengths 50, 100, 150,200, and 250. ntBase is the hash function based on non-recursive equationof ntHash.673.6. Discussion010020030050 100 150 200 250kcpu time (s)methodcityhashmurmurxxhashntBasentHashFigure 3.8: CPU time for canonical hashing 1 billion k-mers of lengths 50,100, 150, 200, and 250. ntBase is the hash function based on non-recursiveequation of ntHash.020406080100120city murmur xxhash ntBase ntHashcpu time (s)#hash135Figure 3.9: Comparing multi-hashing runtime of ntHash with the leadinghash functions for one billion 50-mers. ntHash performs over 20 faster thanthe closest competitor, cityhash. Grey, orange and blue bars refer to calcu-lation of one, three and five hash functions, respectively.68Chapter 4ntCard: a streamingalgorithm for cardinalityestimation in genomics data4.1 Publication noteThe work described in this chapter was previously published in the Bioinfor-matics journal in 2017 [78]. The outcome algorithm and software tool fromthis work was used in one published journal paper in Genome Research in2017 [42], and one submitted journal paper in Bioinformatics in 2017 [46].The work described in this chapter is the main work of the author, underthe supervision of his PhD supervisor, Inanc Birol. The author designed thealgorithm, implemented the software tool, performed the experiments, andwrote the manuscript. The co-author on this work, Hamza Khan, helped inperforming the experiments.694.2. Author summary4.2 Author summaryMany bioinformatics algorithms are designed for the analysis of sequencesof some uniform length, conventionally referred to as k-mers. These includede Bruijn graph assembly methods and sequence alignment tools. An effi-cient algorithm to enumerate the number of unique k-mers, or even better, tobuild a histogram of k-mer frequencies would be desirable for these tools andtheir downstream analysis pipelines. Among other applications, estimatedfrequencies can be used to predict genome sizes, measure sequencing errorrates, and tune runtime parameters for analysis tools. However, calculatinga k-mer histogram from large volumes of sequencing data is a challengingtask. In this chapter, we present ntCard, a streaming algorithm for esti-mating the frequencies of k-mers in genomics datasets. At its core, ntCarduses the ntHash algorithm to efficiently compute hash values for streamedsequences. It then samples the calculated hash values to build a reduced rep-resentation multiplicity table describing the sample distribution. Finally, ituses a statistical model to reconstruct the population distribution from thesample distribution. Our benchmarks demonstrate ntCard as a potentiallyenabling technology for large-scale genomics applications. ntCard is imple-mented in C++ and is released under the GPL-3 license. The software andsource codes are freely available at IntroductionMany bioinformatics applications rely on counting or cataloguing fixed-length substrings of DNA/RNA sequences, called k-mers, generated from704.3. Introductionreads coming out of high-throughput sequencing platforms. This is a veryimportant step in de novo assembly [11, 61, 91, 98, 109], multiple sequencealignment [22], error correction [35, 72], repeat detection [96], SNP detec-tion [82, 94], and RNA-seq quantification analysis [85]. The problem ofcounting k-mers has been well studied in the literature, including the Jel-lyfish [71], BFCounter [74], DSK [88], and KMC [21] algorithms. Thesetools need considerable computational resources and can be improved interms of memory, disk space, and runtime requirements for processing andobtaining the histogram of k-mer frequencies in large sets of DNA/RNAsequences. During the past few years there have been many studies to im-prove the memory and time requirements for the k-mer counting problem.While a na¨ıve approach would keep track of all possible k-mers in the in-put datasets, employing a succinct and compact data structure [18], or adisk-based workflow [21, 88] would reduce memory usage. Although theimproved methods with efficient implementations have considerable impacton memory and time usage, they require processing of all possible k-mersbase-by-base and storing them in memory or disk. Therefore, the time andmemory requirements for theses efficient solutions grow linearly with theinput data size, and can take hours or days using terabytes of memory forlarge datasets. In the recent works by Chikhi-Medvedev [13] and Melsted-Halldrsson [75], the authors proposed methods to approximate the k-mercoverage histogram in large sets of DNA/RNA sequences, which are aboutan order of magnitude faster, and require only a small portion of the mem-ory compared with previous k-mer counting algorithms [75]. However, thesemethods still can take considerable amount of time for processing terabytes714.4. Methodof high-throughput sequencing data, or may not provide the full histogramfor k-mer abundance.In this article, we present an efficient streaming algorithm, ntCard, forestimating the k-mer coverage histogram for large high-throughput sequenc-ing genomics data. The proposed method requires fixed amount of memory,and runs in linear time with respect to the size of the input dataset. Atits core, ntCard uses the ntHash algorithm [77] to efficiently compute hashvalues for streamed sequences. It samples the calculated hash values tobuild a reduced representation multiplicity table describing the sample dis-tribution, which it uses to statistically infer the population distribution. Wecompare the histograms estimated by ntCard with the exact k-mer countsof DSK [88], and illustrate that the ntCard estimations are approximationswithin guaranteed intervals. We also compare the accuracy, runtime andmemory usage of ntCard with the best available exact and approximate al-gorithms for k-mer count frequencies such as DSK [88], KmerGenie [13],KmerStream [75], and Khmer [41].4.4 MethodLet’s first introduce the problem background and notations on streamingalgorithms for identifying the distinct elements. Then we will derive a statis-tical model to estimate k-mer frequencies, and outline the generated model.724.4. Method4.4.1 Background, notations, and definitionsStreaming algorithms are algorithms for processing data that are too largeto be stored in available memory, but can be examined online, typically ina single pass. There has been a growing interest in streaming algorithmsin a wide range of applications, in different domains dealing with massiveamounts of data. Examples include, analysis of network traffic, databasetransactions, sensor networks, and satellite data feeds [19, 20, 40].Here, we propose a streaming algorithm to estimate the frequencies ofk-mers in massive data produced from high-throughput sequencing tech-nologies. Let fi denote the number of distinct k-mers that occur i times ina given sequencing dataset. The k-mer frequency histogram is then the listof fi, i ≥ 1. The kth frequency moment Fk is defined asFk =∞∑ (4.1)The numbers Fk provide useful statistics on the input sequences. For ex-ample, F0 denotes the number of distinct k-mers appearing in the streamsequences, F1 is the total number of k-mers in the input datasets, F2 is theGini index of variation that can be used to show the diversity of k-mers,and F∞ is related to the most frequent k-mer in the input reads.There are streaming algorithms in the literature for estimating differ-ent kth frequency moments. The problem of estimating F0, also knownas distinct elements counting, has been addressed by the FM-Sketch [28]and K-Minimum Value [4] algorithms. An F2 estimation algorithm wasfirst proposed in Alon et al. [2], and F∞ was investigated by Cormode and734.4. MethodMuthukrishnan [20]. These proposed algorithms can perform their estima-tions within a factor of (1 ± ) with a set probability using O(−2 log(D))operations, where D is the number of distinct k-mers in the dataset [75].4.4.2 Estimating k-mer frequencies, fiTo estimate the k-mer frequencies, we use a hash-based approach similar tothe KmerStream algorithm [75]. KmerStream is based on the K-MinimumValue algorithm [4], and it samples the data streams at different rates toselect the optimal sampling rate giving the best result.ntCard works by first hashing the k-mers in read streams, which it sam-ples to build a reduced multiplicity table. After calculating the multiplicitytable for sampled k-mers, it uses this table to infer the population histogramthrough a statistical model.HashingntCard utilizes the ntHash algorithm [77] to efficiently compute the canon-ical hash values for all k-mers in DNA sequences. ntHash is a recursive, orrolling, hash function in which the hash value for the next k-mer in an inputsequence of length l (l ≥ k) is derived from the hash value of the previousk-mer as described in the previous chapter.We have shown earlier that ntHash has significant speed improvementover conventional approaches, while maintaining a near-ideal hash valuedistribution [77].744.4. Methods ... r !"#$%&'()*('Figure 4.1: 64-bit hash value generated by ntHash. The s left bits are usedfor sampling the k-mers in input datasets and the r right bits are used asresolution bit for building the reduced multiplicity table, with r + s < 64.Sampling and building the multiplicity tableAfter computing the hash values for k-mers in DNA streams using ntHash,ntCard segments the 64-bit hash values into three parts as shown in Fig-ure 4.1. It uses the left s bits in the 64-bit hash value for its samplingcriterion, picking k-mers for which these bits are zero, resulting in an aver-age sampling rate of 1/2s. Earlier, we have demonstrated that ntHash bitsare independently and uniformly distributed [77]. Consequently, on average1/2 of the hash values start with 0, 1/4 of them will start with two zeros,and 1/2s will start with s zeros. Therefore, by selecting the hash valuesstarting with s zeros, we build our sample with the cardinality of 1/2s.Also building on the statistical properties of computed hash values, weuse the right r bits, called the resolution bits, to build a k-mer multiplicitytable for sampled k-mers. To do so, we use an array of size 2r to keepobserved k-mer counts. The resolution bits of each hash value serve asthe index for the count array. We note that, each entry in the array isan approximate count of the sampled k-mers, since there may be multiplek-mers with the same r bit pattern, resulting in count collisions.Ideally, one would want a hash function that generates a unique hashvalue for every k-mer, say using infinite number of bits. Also, if one hasaccess to infinite memory to hold all these values, the ideal values for s and754.4. Methodr would be zero and infinity, respectively. Since we do not in practice haveaccess to such resources, we use 64-bit hash values, subsample our datasetby 1/2s, and tabulate 2r patterns (some with zero counts). To infer thepopulation histogram from these measurements, we derived the followingstatistical model.Let’s denote the count array with 2r entries by t(r). If we were to extendour resolution to r + 1, we would obtain a new count array, t(r+1), with2r+1 entries, twice the size of the current array t(r). There is a relationbetween the entries of the current array t(r) and the new count array t(r+1).By folding the first half of t(r+1) with its second half, we can construct t(r)usingt(r)n = t(r+1)n + t(r+1)2r+n , ∀n ∈ [0, . . . , 2r − 1] (4.2)where t(r)n denotes the count for entry n in the table t(r).Next, if we let p(r)i be the relative frequency of counts i ≥ 0 in table t(r),with∑∞i=0 p(r)i = 1, we can make the following observations. An entry oft(r)i = 0 is only possible if t(r+1)i = 0 and t(r+1)2r+i = 0. Since there is no apriori reason why the first and second half of t(r+1) should have differentcount distributions, we can relate the frequencies of zero counts in the twotables throughp(r)0 = (p(r+1)0 )2 (4.3)Similarly, a count of one in t(r) is only possible if the first half of t(r+1) isa one and the second half a zero corresponding to that entry, or vice versa,764.4. Methodwhich we can write mathematically asp(r)1 = 2p(r+1)0 p(r+1)1 (4.4)This can be generalized asp(r)i =i∑i′=0p(r+1)i′ p(r+1)i−i′ (4.5)Note that, equations (4.3) - (4.5) can be solved for p(r+1)i through the recur-sive formulap(r+1)i =(p(r)0)1/2for i = 0p(r)12p(r+1)0for i = 112p(r+1)0(p(r)i −∑i−1i′=1 p(r+1)i′ p(r+1)i−i′)for i > 1(4.6)Now, just like extensions from a resolution of r to r+ 1, resolution to r+ xis also mathematically tractable. Ultimately, we would be interested inrelating the observed count frequencies p(r)i to the count frequencies p(∞)i ,and in calculating k-mer multiplicity frequenciesfˆi =p(∞)i1− p(∞)0(4.7)For example, for i = 1, this can be calculated asfˆ1 = limx→∞p(r)12x(p(r)0 )2x−12x1− (p(r)0 )12x=−p(r)1p(r)0 ln p(r)0(4.8)774.4. Methodand for i = 2 asfˆ2 =−p(r)0 p(r)2 + 12(p(r)1 )2(p(r)0 )2 ln p(r)0(4.9)In general, for fˆi, i ≥ 1, we can write the following equationfˆi =1(p(r)0 )i ln p(r)0i−1∑j=0(−1)i+j(p(r)0 )ji− j(∑∀(l,u)∈Z2s.t.∑k uk=i−j∑k lkuk=i|u|∏k=1(i− j −∑k−1k′=0 uk′uk)(p(r)lk)uk)(4.10)where u0 = 0, uk 6= uk′ for all k 6= k′, and |u| = argmaxk{uk}.This complex-looking formula can also be written in the following recur-sive formfˆi =−p(r)ip(r)0 ln p(r)0− 1ii−1∑j=1jp(r)i−j fˆjp(r)0(4.11)The two terms of this equation can be interpreted as follows. The first termcorresponds to count frequencies i in table t(r) assuming none of the entriescollided with any non-zero entries through folding rounds from limx→∞(r+x)to r. The second term is a correction to the first term, accounting for allcollisions of (i− j), 0 < j < i and j, result of which is a count frequency ofi.Now, we can derive an estimate for F0 by a similar approach we used forrelative frequencies.F0 = limx→∞ 2s(1− p(r+x)0 )2r+x (4.12)784.4. MethodThis formula has three terms inside the limit, the first one, 2s, correctingfor the subsampling we have performed. The second term is the frequencyof non-zero entries in table t(r+x), and the third entry is the normalizingfactor that was used to convert occurrences of counts in this table to theirfrequencies, p(r+x)i . Taking this limit then givesF0 = −2s+r ln p(r)0 (4.13)Using the equations (4.11) and (4.13) we can obtain the k-mer coverage fre-quencies as outlined in Algorithm 4.1 with a binomial proportion confidenceinterval. The workflow of ntCard algorithm is also presented in Fig. 4.2.Program 4.1 The ntCard algorithm1: function Update(k-mer)2: for each read seq do3: for each k-mer in seq do4: h← ntHash (k-mer) . Compute 64-bit h using ntHash5: if h64:64−s+1 = 0s then . Checking the s left bit in h6: i← hr:1 . r is resolution parameter7: ti ← ti + 18: function Estimate9: for i← 1 to 2r do10: pt[i] ← pt[i] + 111: for i← 1 to tmax do12: pi ← pi/2r13: F0 = − ln p0 × 2s+r . F0 estimate14: for i← 1 to tmax do15: fˆi ← −pip0 ln p0 − 1i∑i−1j=1jpi−j fˆjp0. Relative frequency estimates16: for i← 1 to tmax do17: fi ← fˆi × F0 . fi estimates18: return f, F0794.4. MethodDNA sequence streams ntHash 64-bit hash value s bits r bits Use s as sampling rate parameter Use r to construct array for  frequencies of count i Sampled dataset with cardinality of 1/2s Extend using a statistical model and  build  relative frequencies Compute estimated population frequencies from sample relative frequencies H0 = rolk-1h(r[0]) ⊕ rolk-2h(r[1]) ⊕ … ⊕ h(r[k-1]) Hi = rol1Hi-1 ⊕ rolkh(r[i-1]) ⊕ h(r[i+k-1])  (pi ) pi =ti2rfˆi = −pip0 ln p0−1ijpi− j fˆ jp0,  i ≥1j=1i−1∑fi = fˆiF0, i ≥1,  F0 = − ln(p0 )×2r+s( fˆi )k-mer histogram  ( fi ), i ≥1Figure 4.2: The workflow of ntCard algorithm for estimating the k-mer cov-erage frequencies and the total number of distinct k-mers in DNA sequencestreams.804.4. Method4.4.3 Implementation detailsSelection of the resolution parameter, r, represents a tradeoff between accu-racy and computational resources. While it should not be too low to avoidpoor estimates of frequency counts, it should not be too high for feasiblepeak memory usage. In our experience, values r > 20 work well for accurateestimates, and the memory usage peaks above 1 GB for r ≥ 28. We have setthe default value to r = 27. We have also observed that estimations basedon only tr, without applying the statistical model, has higher error rates dueto count collisions, as expected.If input reads or sequences contain ambiguous bases, or characters otherthan {A,C,G, T}, ntCard ignores them in the hashing stage. This is per-formed as a functionality of ntHash algorithm. When ntHash encountersa non-ACGT character it can jump over the ambiguous base, and restartsthe hashing procedure from the first valid k-mer containing only ACGTcharacters.ntCard is implemented in C++ and parallelized for multi-threading ona single compute node by OpenMP. As input, it gets the set of sequencesin FASTA, FASTQ, SAM, or BAM formats. The input sequences can alsobe in compressed formats such as .gz and .bz formats. ntCard is distributedand released under GNU General Public License (GPL-3). Documentation,software, and source codes are all freely available at Results4.5 Results4.5.1 Experimental setupTo evaluate the performance and accuracy of ntCard, we downloaded thefollowing publicly available sequencing data.• The Genome in a Bottle (GIAB) project [110] sequenced seven individ-uals using a large variety of sequencing technologies. We downloaded2x250 bp paired-end Illumina whole genome shotgun sequencing datafor the Ashkenazi mother (HG004).• We downloaded a second H. Sapiens dataset from the 1000 GenomesProject, for the individual NA19238 (SRA:ERR309932).• To represent a larger problem, we used the white spruce (Picea glauca)genome sequencing data that represents the genotype PG29 [106] (ac-cession number: ALWZ0100000000 and PID: PRJNA83435).The information of each dataset including the number of sequences, sizeof sequences, total number of bases, and total input size of datasets is pre-sented in Table 1. To evaluate the performance of ntCard, we compare itto KmerGenie, KmerStream, and Khmer in terms of accuracy of estimates,runtime, and memory usage. We also compare the accuracy of our resultswith DSK, which is an exact k-mer counting tool. Results were obtained oncomputing nodes with 48 GB of RAM and dual Intel Xeon X5650 2.66GHzCPUs with 12 cores. The operating system on each node was Linux CentOS5.4.824.5. ResultsTable 4.1: Dataset specification.Dataset Read number Read length Total bases SizeHG004 868,593,056 250 bp 217,148,264,000 480 GBNA19238 913,959,800 250 bp 228,489,950,000 500 GBPG29 6,858,517,737 250 bp 1,714,629,434,250 2.4 TBAll five tools are run with their default parameters, and the parametersrelated to the resource usage are set in a way to utilize the maximum capacityon each computing node. For example, all tools are run in multi-threadedmode with the maximum number of threads available on the computer.4.5.2 AccuracyIn Tables 4.2-4.4, we see the results of DSK, ntCard, KmerGenie, Kmer-Stream, and Khmer for distinct number of k-mers, F0, as well as the num-ber on singletons, f1, on three datasets. We compared the accuracy ofestimated counts from ntCard, KmerGenie, KmerStream, and Khmer withexact counts from DSK. We see that, for all k-mer lengths, ntCard com-putes F0 and f1 for all three datasets with error rates less than 0.7%. Incomparison, the error rates of KmerGenie, KmerStream, and Khmer can beup to 17%, 9%, and 11%, respectively. Note that, the Khmer algorithm onlyestimates the total number of distinct k-mers, F0.Compared to ntCard and KmerStream, Khmer and KmerGenie esti-mates for distinct number of k-mers, F0, have the highest error rates (>7%)on PG29 data; though, for HG004 and NA19238, Khmer estimates F0 withlower error rates, and KmerGenie has very accurate estimates with errorrates <1% for all k values. On all three datasets, KmerStream has more834.5. ResultsTable 4.2: Accuracy of algorithms in estimating F0 and f1 for HG004 reads.The DSK column reports the exact k-mer counts, and columns for the forother tools report percent errors.k DSK ntCard KmerGenie KmerStream Khmer32f1 13,319,957,567 0.01% 0.97% 7.04% −F0 16,539,753,749 0.02% 0.64% 5.12% 0.67%64f1 17,898,672,342 0.02% 0.35% 0.73% −F0 21,343,659,785 0.00% 0.22% 0.66% 0.15%96f1 18,827,062,018 0.36% 0.87% 0.00% −F0 22,313,944,415 0.24% 0.69% 0.05% 0.31%128f1 18,091,241,186 0.36% 0.76% 0.40% −F0 21,555,678,676 0.25% 0.62% 0.20% 0.30%Table 4.3: Accuracy of algorithms in estimating F0 and f1 for NA19238reads. The DSK column reports the exact k-mer counts, and columns forthe for other tools report percent errors.k DSK ntCard KmerGenie KmerStream Khmer32f1 14,881,561,565 0.00% 0.53% 6.36% −F0 18,091,801,391 0.00% 0.40% 4.64% 1.82%64f1 19,074,667,480 0.02% 0.75% 0.68% −F0 22,527,419,136 0.01% 0.77% 0.65% 1.22%96f1 19,420,503,673 0.22% 0.66% 0.09% −F0 22,932,238,161 0.16% 0.66% 0.07% 0.46%128f1 17,902,027,438 0.21% 0.85% 0.19% −F0 21,421,517,759 0.13% 0.76% 0.03% 1.05%Table 4.4: Accuracy of algorithms in estimating F0 and f1 for PG29 reads.The DSK column reports the exact k-mer counts, and columns for the forother tools report percent errors.k DSK ntCard KmerGenie KmerStream Khmer32f1 27,430,910,938 0.02% 15.33% 9.41% −F0 42,642,198,777 0.01% 11.02% 7.37% 8.86%64f1 44,344,130,469 0.04% 16.36% 2.61% −F0 67,800,291,613 0.02% 11.14% 1.73% 11.18%96f1 43,300,244,443 0.66% 17.51% 0.73% −F0 69,855,690,006 0.46% 11.13% 0.57% 9.36%128f1 32,089,613,024 0.40% 14.82% 0.06% −F0 58,195,246,941 0.30% 8.35% 0.27% 7.39%844.5. Resultsaccurate estimates for longer k-mers, where error rates increase rapidly forshorter k-mers. Although ntCard generally has the opposite trend, it alsohas the most stable performance for all three datasets. Except for k=128bp on NA19238 and PG29, and k=96 bp on NA19238 and HG004, ntCardconsistently displays the best accuracy both for F0 and f1, as indicated bythe bold entries in Tables 4.2-4.4, and in all these cases it was a close second.We have also evaluated the accuracy of full k-mer frequency histogramsof ntCard on all three datasets with different k values. Since the Kmer-Stream algorithm only computes estimates for F0 and f1 and Khmer onlyestimates F0, we could only compare the accuracy of the ntCard histogramwith the estimated results of KmerGenie and the exact histogram fromDSK method. Figures 4.3-4.5 shows the k-mer frequency histograms ofDSK, ntCard, and KmerGenie for all three datasets with four k values,{32, 64, 96, 128}. Since the results of f1 have already been presented in Ta-bles 4.2-4.4, and because f2..f62  f1, the histograms in Figures 4.3-4.5show the k-mer frequencies starting from f2. From Figures 4.3-4.5 and Ta-bles 4.2-4.4, we can see ntCard estimates the k-mer frequency histogramsfor all three datasets more accurate than KmerGenie.4.5.3 Runtime and memory usageWe have calculated the memory usage of all benchmarked tools. DSK usesboth main memory and disk space for counting k-mers, and therefore weobtained both values for it. We should also mention that DSK was executedon compute nodes equipped with solid-state drives (SSD). This helps theruntime of DSK be greatly reduced with the SSD and multi-threaded par-854.5. Results01002003004005002 14 26 38 50 62Fold CoverageFrequency (million) - HG004 DSK ntCard KmerGeniek=3201002003004005002 14 26 38 50 62Fold CoverageFrequency (million) - HG004k=6401002003004005002 14 26 38 50 62Fold CoverageFrequency (million) - HG004k=9601002003004005002 14 26 38 50 62Fold CoverageFrequency (million) - HG004k=128Figure 4.3: k-mer frequency histograms for human genome HG004. We haveused DSK k-mer counting results as our ground truth in evaluation (orangecircle data points). The k-mer coverage frequency results, f2..f62 of ntCardand KmerGenie for different values of k = 32, 64, 96, 128 (the four columnsfrom left to right) are shown with the symbols (+) and (), respectively.allelism. The memory usage for DSK on all three datasets was the sameat about 20 GB of RAM, while the disk space usage was 500 GB for hu-man genomes HG004 and NA19238, and 1 TB for the white spruce genomePG29.864.5. Results01002003004002 14 26 38 50 62Fold CoverageFrequency (million) - NA19238DSK ntCard KmerGeniek=3201002003004002 14 26 38 50 62Fold CoverageFrequency (million) - NA19238k=6401002003004002 14 26 38 50 62Fold CoverageFrequency (million) - NA19238k=9601002003004002 14 26 38 50 62Fold CoverageFrequency (million) - NA19238k=128Figure 4.4: k-mer frequency histograms for human genome NA19238. Wehave used DSK k-mer counting results as our ground truth in evaluation(orange circle data points). The k-mer coverage frequency results, f2..f62of ntCard and KmerGenie for different values of k = 32, 64, 96, 128 (thefour columns from left to right) are shown with the symbols (+) and (),respectively.The memory usage of KmerGenie to estimate the full k-mer frequencyhistograms for all datasets was about 200 MB of RAM. KmerStream uses 2-bit counters to estimate F0 and f1, resulting in lower memory requirement.874.5. Results01002003002 14 26 38 50 62Fold CoverageFrequency (10 million) - PG29DSK ntCard KmerGeniek=3201002003002 14 26 38 50 62Fold CoverageFrequency (10 million) - PG29k=6401002003002 14 26 38 50 62Fold CoverageFrequency (10 million) - PG29k=9601002003002 14 26 38 50 62Fold CoverageFrequency (10 million) - PG29k=128Figure 4.5: k-mer frequency histograms for the white spruce genome PG29.We have used DSK k-mer counting results as our ground truth in evaluation(orange circle data points). The k-mer coverage frequency results, f2..f62of ntCard and KmerGenie for different values of k = 32, 64, 96, 128 (thefour columns from left to right) are shown with the symbols (+) and (),respectively.The memory usage for KmerStream on all three datasets was about 65MB of RAM. The Khmer algorithm requires the lowest amount of memoryamong all algorithms but only estimates F0. It requires about 15 MB of884.5. ResultsRAM to estimate the total number of distinct k-mers in all three datasets.The memory requirement of ntCard for all three datasets was about 500MB of RAM, although we note that it computes the full k-mer multiplicityhistogram. We have also implemented a special runtime parameter to onlycompute the total number of distinct elements, F0, in which case it requiresabout 2 MB of RAM.Figures 4.6-4.8 shows the runtime of all methods on the experimenteddatasets with different k values from 32 to 128. The runtime of ntCardto obtain the full k-mer frequency histograms for human genome datasets(HG004, NA19238) is about 6 mins. For KmerStream, it takes about 100mins to obtain F0 and f1 on human genome datasets, while this is about200 mins for Khmer to estimate just the total number of distinct k-mers,F0. DSK and KmerGenie take up to 600 and 800 minutes, respectively, tocompute the k-mer coverage histograms for human genome datasets. Forthe white spruce PG29 dataset, ntCard requires about 30 mins to estimatek-mer frequency histograms, while for KmerStream it takes about 450 minsto obtain F0 and f1. The Khmer takes longer about 1200 mins to estimateF0. DSK and KmerGenie can take up to 2700 and 3400 mins to computethe k-mer frequency histograms.We should note that ntCard, KmerGenie, and KmerStream algorithmshave an option to pass multiple k values and compute multiple k-mer cov-erage histograms in a single run. This option will reduce the amortizedruntime per k value, but it will increase the memory usage. From the run-time results, we see ntCard estimates the full k-mer coverage frequencyhistograms > 15× faster than the closest competitor, KmerStream, which894.5. Results2705274065416 6 5 5765675589503102 97 92 87210 208 197 1870250500750100032 64 96 128kTime (min)DSK ntCard KmerGenie KmerStream KhmerHG004Figure 4.6: Runtime of DSK, ntCard, KmerGenie, KmerStream, and Khmerfor HG004 dataset.2865274436316 6 5 5809715625535103 100 95 89264 262 257 2450250500750100032 64 96 128kTime (min)DSK ntCard KmerGenie KmerStream KhmerNA19238Figure 4.7: Runtime of DSK, ntCard, KmerGenie, KmerStream, and Khmerfor NA19238 dataset.904.6. Discussion150027062126192932 29 27 2534862989441486516 479 445 3971282 1263 118510680100020003000400032 64 96 128kTime (min)DSK ntCard KmerGenie KmerStream KhmerPG29Figure 4.8: Runtime of DSK, ntCard, KmerGenie, KmerStream, and Khmerfor PG29 dataset.only computes F0 and f1. In our experiments and computing environment,approximately one third of the ntCard runtime is spent on reading inputdatasets, and the rest on computing k-mer coverage histograms. ThereforeI/O efficiency, which is system and architecture dependent, has a consider-able impact on the runtime performance of ntCard.4.6 DiscussionWith growing throughput and dropping cost of the next generation sequenc-ing technologies, there is a continued need to develop faster and more effec-tive bioinformatics tools to process and analyze data associated with them.Developing algorithms and tools that analyze these huge amounts of dataon the fly, preferably without storing intermediate files, would have many914.6. Discussionbenefits in a broad spectrum of genomics projects such as de novo genomeand transcriptome assembly, sequence alignment, repeat detection, errorcorrection, and downstream analysis.In this work, we introduced the ntCard streaming algorithm for estimat-ing the k-mer coverage frequency histogram for high-throughput sequencinggenomics data. It employs the ntHash algorithm for hashing all k-mers inDNA/RNA sequences efficiently, samples the k-mers in datasets based onthe k-mer hashes, and reconstructs the k-mer frequencies using a statisticalmodel. Using an amount of memory comparable to similar tools, ntCardestimates k-mer frequency histogram for massive genomics datasets, severalfolds faster than the state-of-the-art approaches.Sample use cases of ntCard include tuning runtime parameters in deBruijn graph assembly tasks such as optimal k value for the assembly, andsetting parameters in applications utilizing the Bloom filter data struc-ture. ntCard has been used in the new version of our genome assemblysoftware package, ABySS 2.0 [42], to determine the values for total mem-ory size and number of hash functions. It has been also utilized to setthe Bloom filter sizes in BioBloom tools [16], which is a general use fastsequence categorization tool utilizing Bloom filters. Using ntCard thesetools are able to get the total number of distinct k-mers F0, as well asthe number of k-mers above a certain multiplicity threshold. The k-mercoverage histograms computed by ntCard can be also used as input to util-ities like GenomeScope ( for estimat-ing genome sizes, sequencing error rates, repeat contents, and heterozygosityof genomes [13, 71, 75, 96].924.6. DiscussionWe expect ntCard to provide utility in efficiently characterizing certainproperties of large read sets, helping quality control pipelines and de novosequencing projects.93Chapter 5ConclusionHigh-throughput sequencing technologies have profoundly altered the scaleand scope of research in health and life sciences. As sequencing through-puts continue growing and costs keep dropping, there will be continuedneeds for accurate and cost-effective algorithms and software tools for theanalysis of increasing DNA sequencing data. In health and life sciences re-search organizations and clinical genomics laboratories, de novo assemblyand sequence alignment are becoming two key steps in everyday researchand analysis. Hence, designing scalable, accurate, and fast algorithms toimprove the runtime, memory, and other computational resources in de novoassembly pipelines would have a great impact in the field. During my PhDresearch work, I have designed, developed, and optimized efficient, scalable,and cost-effective algorithms and software tools for the analysis of high-throughput sequencing data specially for sequence alignment and de novoassembly problems using state-of-the-art parallel and distributed computingparadigms on high-performance computing infrastructures. Depending onthe level of communication required in problems, I utilized different parallelcomputing paradigms such as “embarrassingly parallel”, “loosely coupled,or “ tightly coupled” to design efficient and scalable methods to tackle them.94Chapter 5. ConclusionI first introduced DIDA, a distributed and parallel indexing and align-ment framework for large-scale sequence alignment tasks. Although therewere some solutions for this problem, they still had room for improvementin their runtime and memory usage. Moreover, methods presented in widelypopular alignment tools assumed that the target sequence is static and usu-ally represented as a reference genome. The first stage to perform readalignment within those tools involved “indexing” of this reference sequencefor faster alignment; a costly stage which is usually discounted in perfor-mance measurements. However, there are many applications, where thereference is not static and/or the performance cost of indexing is not neg-ligible. Such cases include resequencing work done on non-model species,and intermediate stages of a de novo assembly process. Using DIDA, wewere able to address the above challenges and perform large-scale alignmenttasks by distributing them across several compute nodes, solving each sub-task on a node separately, and finally gathering the partial results into thefinal output. DIDA was employed in the de novo assembly project of thewhite spruce genome at the Genome Sciences Centre in BC Cancer Agencyand enabled us to perform several rounds of large-scale alignment jobs tofinish the assembly process.Later in ntHash, I designed and developed a fast hash algorithm forbioinformatics applications. Many applications in bioinformatics rely oncataloguing or counting DNA/RNA sequences for indexing, querying, andsimilarity search. These include sequence alignment, genome and transcrip-tome assembly, RNA-seq expression quantification, and error correction.An efficient way of performing such operations is through the use of hash-95Chapter 5. Conclusionbased data structures, such as hash tables or Bloom filters. Thus, improv-ing the performance of hashing algorithms would have a broad impact for awide range of bioinformatics tools. ntHash achieved this goal by computinghash values for consecutive k-mers in a given sequence using a recursive ap-proach. It was an implementation of cyclic polynomial rolling hashing, andwas adapted to the reduced alphabet of DNA sequences. It also efficientlyhandled computations for reverse-complemented and consequently canonicalhash values. Further, ntHash provided a fast way for calculating multiplehash values for a k-mer without repeating the whole hashing procedure foreach value - a very useful functionality for bioinformatics applications thatutilize the Bloom filter data structure. Our experiments demonstrated sig-nificant speed improvements over traditional methods, while maintainingnear-ideal distributions for generated hash values. ntHash was employed ina series of software tools to improve the performance of hashing operationssuch as ABySS, BioBloomTools, ChopStitch, and ntCard.Finally, I developed a streaming algorithm, called ntCard, for estimatingthe frequencies of k-mers in genomics data. Many bioinformatics algorithmsare designed for the analysis of sequences of some uniform length, conven-tionally referred to as k-mers. These include sequence alignment tools, andde Bruijn graph assembly software. An efficient algorithm to enumerate thenumber of unique k-mers, or even better, to build a histogram of k-merfrequencies would be desirable for these tools and their downstream analy-sis pipelines. Among other applications, estimated frequencies can be usedto predict genome sizes, measure sequencing error rates, and tune runtimeparameters for analysis tools. However, calculating a k-mer histogram from96Chapter 5. Conclusionlarge volumes of sequencing data is a challenging task and hence I designedntCard to tackle these challenges. At its core, ntCard utilized the ntHashalgorithm to efficiently compute hash values for steamed sequences. It sam-pled the calculated hash values to build a reduced representation multiplic-ity table describing the sample distribution. Finally, it derived a statisticalmodel to reconstruct the population distribution from the sample distri-bution. The experimental results demonstrated that the ntCard algorithmestimated cardinalities 15× faster, using less memory than the state-of-the-art algorithms, with higher accuracy rates. This makes it as a potentiallyenabling technology for large-scale genomics applications. ntCard was em-ployed in the new version of our genome assembly software package ABySS2.0, BioBloomTools, KmerGenie, and ChopStitch algorithms [46].High-throughput sequencing technologies have extended the frontiers ofgenomics and bioinformatics research, opening up new doors to investiga-tion and analysis and offering better understanding of broad areas of biologyand medicine. Rapid developments and recent advances in DNA sequencingtechnologies are lowering costs and increasing speeds. With the emergenceof third generation sequencing platforms, this continuously developing tech-nology is now being applied within clinical environments, where it has thepotential to change treatment and diagnosis outcomes of diseases such ascancer. On the other hand, the third-generation sequencing technology willchange the algorithmic landscape. As the read length of the third generationsequencing platforms grows, the related sequencing error increases as well.Therefore, their is a continued need for resource-efficient and accurate algo-rithms that can handle higher errors and other complex events. It is essential97Chapter 5. Conclusionto tune and adapt existing tools with the need of new sequencing technolo-gies. Inventing new methods that can handle third generation sequencingdata efficiently will reduce the computational requirements significantly andconsequently speed up and improve the diagnosis and treatment outcomesof diseases.98Bibliography[1] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch.The Enhanced Suffix Array and Its Applications to Genome Analysis,pages 449–463. Springer Berlin Heidelberg, Berlin, Heidelberg, 2002.[2] Noga Alon, Yossi Matias, and Mario Szegedy. The Space Complexityof Approximating the Frequency Moments. Journal of Computer andSystem Sciences, 58(1):137 – 147, 1999.[3] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers,and David J. Lipman. Basic local alignment search tool. Journal ofMolecular Biology, 215(3):403 – 410, 1990.[4] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and LucaTrevisan. Counting Distinct Elements in a Data Stream. In Pro-ceedings of the 6th International Workshop on Randomization andApproximation Techniques, RANDOM ’02, pages 1–10, London, UK,UK, 2002. Springer-Verlag.[5] David R. Bentley, Shankar Balasubramanian, Harold P. Swerdlow,Geoffrey P. Smith, John Milton, Clive G. Brown, Kevin P. Hall, Dirk J.99BibliographyEvers, and et al. Accurate whole human genome sequencing usingreversible terminator chemistry. Nature, 456(7218):53–59, 11 2008.[6] Inanc Birol, Justin Chu, Hamid Mohamadi, Shaun D. Jackman,Karthika Raghavan, Benjamin P. Vandervalk, Anthony Raymond, andRen L. Warren. Spaced Seed Data Structures for De Novo Assembly.International Journal of Genomics, 2015.[7] Inanc Birol, Hamid Mohamadi, Anthony Raymond, Karthika Ragha-van, Justin Chu, Benjamin P. Vandervalk, Shaun D. Jackman, andRene L. Warren. Spaced seed data structures. In 2014 IEEE Interna-tional Conference on Bioinformatics and Biomedicine (BIBM), pages15–22, Nov 2014.[8] Inanc Birol, Anthony Raymond, Shaun D. Jackman, Stephen Plea-sance, Robin Coope, Greg A. Taylor, Macaire Man Saint Yuen,Christopher I. Keeling, Dana Brand, Benjamin P. Vandervalk, HeatherKirk, Pawan Pandoh, Richard A. Moore, Yongjun Zhao, Andrew J.Mungall, Barry Jaquish, Alvin Yanchuk, Carol Ritland, Brian Boyle,Jean Bousquet, Kermit Ritland, John MacKay, Jo¨rg Bohlmann, andSteven J.M. Jones. Assembling the 20 Gb White Spruce (PiceaGlauca) Genome from Whole-genome Shotgun Sequencing Data.Bioinformatics, 29(12):1492–1497, June 2013.[9] Burton H. Bloom. Space/Time Trade-offs in Hash Coding with Al-lowable Errors. Commun. ACM, 13(7):422–426, July 1970.100Bibliography[10] Andrei Broder and Michael Mitzenmacher. Network Applications ofBloom Filters: A Survey. Internet Math., 1(4):485–509, 2003.[11] Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter,Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B.Jaffe. ALLPATHS: De novo assembly of whole-genome shotgun mi-croreads. Genome Research, 18(5):810–820, 2008.[12] Yangho Chen, Tade Souaiaia, and Ting Chen. PerM: efficient mappingof short sequencing reads with periodic full sensitive spaced seeds.Bioinformatics, 25(19):2514, 2009.[13] Rayan Chikhi and Paul Medvedev. Informed and automated k-mer sizeselection for genome assembly. Bioinformatics, 30(1):31–37, 2014.[14] Rayan Chikhi and Guillaume Rizk. Space-efficient and exact de Bruijngraph representation based on a Bloom filter. Algorithms for MolecularBiology, 8(1):22, 2013.[15] Justin Chu, Hamid Mohamadi, Rene L. Warren, Chen Yang, and InancBirol. Innovations and challenges in detecting long read overlaps: anevaluation of the state-of-the-art. Bioinformatics, 33(8):1261, 2017.[16] Justin Chu, Sara Sadeghi, Anthony Raymond, Shaun D. Jackman,Ka Ming Nip, Richard Mar, Hamid Mohamadi, Yaron S. Butterfield,A. Gordon Robertson, and Inan Birol. BioBloom tools: fast, accurateand memory-efficient host species sequence screening using bloom fil-ters. Bioinformatics, 30(23):3402, 2014.101Bibliography[17] Jonathan D. Cohen. Recursive Hashing Functions for n-Grams. ACMTrans. Inf. Syst., 15(3):291–320, July 1997.[18] Thomas C. Conway and Andrew J. Bromage. Succinct data structuresfor assembling large genomes. Bioinformatics, 27(4):479–486, 2011.[19] Graham Cormode and Minos Garofalakis. Sketching Streams Throughthe Net: Distributed Approximate Query Tracking. In Proceedings ofthe 31st International Conference on Very Large Data Bases, VLDB’05, pages 13–24. VLDB Endowment, 2005.[20] Graham Cormode and S. Muthukrishnan. An improved data streamsummary: the count-min sketch and its applications. Journal of Al-gorithms, 55(1):58 – 75, 2005.[21] Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Ag-nieszka Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mercounting. Bioinformatics, 31(10):1569–1576, 2015.[22] Robert C. Edgar. MUSCLE: multiple sequence alignment with highaccuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004.[23] Lavinia Egidi and Giovanni Manzini. Better spaced seeds usingQuadratic Residues. Journal of Computer and System Sciences,79(7):1144 – 1155, 2013.[24] Brent Ewing, LaDeana Hillier, Michael C. Wendl, and Phil Green.102BibliographyBase-Calling of Automated Sequencer Traces UsingPhred.?I. Accu-racy?Assessment. Genome Research, 8(3):175–185, 1998.[25] Michael Farrar. Striped Smith-Waterman speeds database searches sixtimes over other SIMD implementations. Bioinformatics, 23(2):156,2006.[26] P. Ferragina and G. Manzini. Opportunistic Data Structures withApplications. In Proceedings of the 41st Annual Symposium on Foun-dations of Computer Science, FOCS ’00, pages 390–, Washington, DC,USA, 2000. IEEE Computer Society.[27] Paolo Ferragina, Travis Gagie, and Giovanni Manzini. LightweightData Indexing and Compression in External Memory. Algorithmica,63(3):707–730, 2012.[28] Philippe Flajolet and G. Nigel Martin. Probabilistic counting algo-rithms for data base applications. Journal of Computer and SystemSciences, 31(2):182 – 209, 1985.[29] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, andMichael A. Beer. Enhanced Regulatory Sequence Prediction UsingGapped k-mer Features. PLOS Computational Biology, 10(7):1–15,07 2014.[30] Mahmoud Ghandi, Morteza Mohammad-Noori, and Michael A. Beer.Robust k-mer frequency estimation using gapped k-mers. Journal ofMathematical Biology, 69(2):469–500, 2014.103Bibliography[31] Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J. Ribeiro,Joshua N. Burton, Bruce J. Walker, Ted Sharpe, Giles Hall, Ter-rance P. Shea, Sean Sykes, Aaron M. Berlin, Daniel Aird, MauraCostello, Riza Daza, Louise Williams, Robert Nicol, Andreas Gnirke,Chad Nusbaum, Eric S. Lander, and David B. Jaffe. High-quality draftassemblies of mammalian genomes from massively parallel sequencedata. Proceedings of the National Academy of Sciences, 108(4):1513–1518, 2011.[32] Gaston H. Gonnet and Ricardo A. Baeza-Yates. An analysis of theKarp-Rabin string matching algorithm. Information Processing Let-ters, 34(5):271 – 274, 1990.[33] Sara Goodwin, John D. McPherson, and W. Richard McCombie. Com-ing of age: ten years of next-generation sequencing technologies. NatRev Genet, 17(6):333–351, 06 2016.[34] Faraz Hach, Fereydoun Hormozdiari, Can Alkan, Farhad Hormozdiari,Inanc Birol, Evan E Eichler, and S Cenk Sahinalp. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Meth, 7(8):576–577,08 2010.[35] Yun Heo, Xiao-Long Wu, Deming Chen, Jian Ma, and Wen-MeiHwu. BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics, 30(10):1354, 2014.[36] David Hernandez, Patrice Franois, Laurent Farinelli, Magne sters, andJacques Schrenzel. De novo bacterial genome sequencing: Millions of104Bibliographyvery short reads assembled on a desktop computer. Genome Research,2008.[37] Steve Hoffmann, Christian Otto, Stefan Kurtz, Cynthia M. Sharma,Philipp Khaitovich, Jrg Vogel, Peter F. Stadler, and Jrg Hackermller.Fast Mapping of Short Sequences with Mismatches, Insertions andDeletions Using Index Structures. PLOS Computational Biology,5(9):1–10, 09 2009.[38] Nils Homer, Barry Merriman, and Stanley F Nelson. BFAST: AnAlignment Tool for Large Scale Genome Resequencing. PLOS ONE,4(11):1–12, 11 2009.[39] Lucian Ilie, Hamid Mohamadi, Geoffrey Brian Golding, and WilliamF Smyth. BOND: Basic OligoNucleotide Design. BMC Bioinformatics,14(1):69, 2013.[40] Piotr Indyk and David Woodruff. Optimal Approximations of theFrequency Moments of Data Streams. In Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing, STOC ’05,pages 202–208, New York, NY, USA, 2005. ACM.[41] Luiz Carlos Irber Junior and C. Titus Brown. Efficient cardinalityestimation for k-mers in large DNA sequencing data sets. bioRxiv,2016.[42] Shaun D. Jackman, Benjamin P. Vandervalk, Hamid Mohamadi,Justin Chu, Sarah Yeo, S. Austin Hammond, Golnaz Jahesh, Hamza105BibliographyKhan, Lauren Coombe, Rene L. Warren, and Inanc Birol. Abyss2.0: resource-efficient assembly of large genomes using a bloom filter.Genome Research, 27(5):768–777, 2017.[43] Shaun D. Jackman, Ren L. Warren, Ewan A. Gibb, Benjamin P. Van-dervalk, Hamid Mohamadi, Justin Chu, Anthony Raymond, StephenPleasance, Robin Coope, Mark R. Wildung, Carol E. Ritland, JeanBousquet, Steven J. M. Jones, Joerg Bohlmann, and Inan Birol. Or-ganellar Genomes of White Spruce (Picea glauca): Assembly and An-notation. Genome Biology and Evolution, 8(1):29–41, 2016.[44] David S Johnson and Michael R Garey. A 7160 theorem for bin pack-ing. Journal of Complexity, 1(1):65 – 106, 1985.[45] Richard M. Karp and Michael O. Rabin. Efficient RandomizedPattern-matching Algorithms. IBM J. Res. Dev., 31(2):249–260,March 1987.[46] Hamza Khan, Hamid Mohamadi, Benjamin P. Vandervalk, Ren LWarren, and Inanc Birol. ChopStitch: exon annotation and splicegraph construction using transcriptome assembly and whole genomesequencing data. In revision, Bioinformatics, 2017.[47] Stefan Kurtz, Adam Phillippy, Arthur L. Delcher, Michael Smoot,Martin Shumway, Corina Antonescu, and Steven L. Salzberg. Versatileand open software for comparing large genomes. Genome Biology,5(2):R12, 2004.106Bibliography[48] T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu.Compressed indexing and local alignment of DNA. Bioinformatics,24(6):791, 2008.[49] Ben Langmead and Steven L Salzberg. Fast gapped-read alignmentwith Bowtie 2. Nat Meth, 9(4):357–359, 04 2012.[50] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg.Ultrafast and memory-efficient alignment of short DNA sequences tothe human genome. Genome Biology, 10(3):R25, 2009.[51] Daniel Lemire and Owen Kaser. Recursive n-gram hashing is pairwiseindependent, at best. Computer Speech and Language, 24(4):698 –710, 2010.[52] H. Li. Aligning sequence reads, clone sequences and assembly contigswith BWA-MEM. ArXiv e-prints, March 2013.[53] Heng Li. BFC: correcting Illumina sequencing errors. Bioinformatics,31(17):2885, 2015.[54] Heng Li and Richard Durbin. Fast and accurate long-read alignmentwith Burrows Wheeler transform. Bioinformatics, 26(5):589, 2010.[55] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan,Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin.The Sequence Alignment/Map format and SAMtools. Bioinformat-ics, 25(16):2078, 2009.107Bibliography[56] Heng Li, Jue Ruan, and Richard Durbin. Mapping short DNA se-quencing reads and calling variants using mapping quality scores.Genome Research, 18(11):1851–1858, 2008.[57] Ming Li, Bin Ma, Derek Kisman, and John Tromp. PatternHunter II:Highly sensitive and fast homology search. Journal of Bioinformaticsand Computational Biology, 02(03):417–439, 2004.[58] Ruiqiang Li, Wei Fan, Geng Tian, Hongmei Zhu, Lin He, Jing Cai,Quanfei Huang, Qingle Cai, Bo Li, Yinqi Bai, Zhihe Zhang, YapingZhang, Wen Wang, Jun Li, Fuwen Wei, Heng Li, Jun Wang, andet al. The sequence and de novo assembly of the giant panda genome.Nature, 463(7279):311–317, 01 2010.[59] Ruiqiang Li, Yingrui Li, Karsten Kristiansen, and Jun Wang. SOAP:short oligonucleotide alignment program. Bioinformatics, 24(5):713,2008.[60] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu,Karsten Kristiansen, and Jun Wang. SOAP2: an improved ultrafasttool for short read alignment. Bioinformatics, 25(15):1966, 2009.[61] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang,Zhongbin Shi, Yingrui Li, Shengting Li, Gao Shan, Karsten Kris-tiansen, Songgang Li, Huanming Yang, Jian Wang, and Jun Wang.De novo assembly of human genomes with massively parallel shortread sequencing. Genome Research, 2009.108Bibliography[62] Chi-Man Liu, Thomas Wong, Edward Wu, Ruibang Luo, Siu-MingYiu, Yingrui Li, Bingqiang Wang, Chang Yu, Xiaowen Chu, KaiyongZhao, Ruiqiang Li, and Tak-Wah Lam. SOAP3: ultra-fast GPU-basedparallel alignment tool for short reads. Bioinformatics, 28(6):878,2012.[63] Yongchao Liu, Bernt Popp, and Bertil Schmidt. CUSHAW3: Sensitiveand Accurate Base-Space and Color-Space Short-Read Alignment withHybrid Seeding. PLOS ONE, 9(1):1–9, 01 2014.[64] Yongchao Liu and Bertil Schmidt. Long read alignment based onmaximal exact match seeds. Bioinformatics, 28(18):i318–i324, 09 2012.[65] Bin Ma, John Tromp, and Ming Li. PatternHunter: faster and moresensitive homology search. Bioinformatics, 18(3):440, 2002.[66] Iain MacCallum, Dariusz Przybylski, Sante Gnerre, Joshua Burton,Ilya Shlyakhter, Andreas Gnirke, Joel Malek, Kevin McKernan, SwatiRanade, Terrance P. Shea, Louise Williams, Sarah Young, Chad Nus-baum, and David B. Jaffe. ALLPATHS 2: small genomes assembledaccurately and with high continuity from short paired reads. GenomeBiology, 10(10):R103, 2009.[67] Udi Manber and Gene Myers. Suffix Arrays: A New Method forOn-line String Searches. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’90, pages 319–327,Philadelphia, PA, USA, 1990. Society for Industrial and Applied Math-ematics.109Bibliography[68] Udi Manber and Gene Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948,1993.[69] Santiago Marco-Sola, Michael Sammeth, Roderic Guigo, and PaoloRibeca. The GEM mapper: fast, accurate and versatile alignment byfiltration. Nat Meth, 9(12):1185–1188, 12 2012.[70] Marcel Margulies, Michael Egholm, William E. Altman, Said Attiya,Joel S. Bader, Lisa A. Bemben, Jan Berka, Michael S. Braverman, Yi-Ju Chen, Zhoutao Chen, Scott B. Dewell, Lei Du, and et al. Genomesequencing in microfabricated high-density picolitre reactors. Nature,437(7057):376–380, 09 2005.[71] Guillaume Marais and Carl Kingsford. A fast, lock-free approach forefficient parallel counting of occurrences of k-mers. Bioinformatics,27(6):764–770, 2011.[72] Paul Medvedev, Eric Scott, Boyko Kakaradov, and Pavel Pevzner.Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics, 27(13):i137–i141, 2011.[73] Colin Meek, Jignesh M. Patel, and Shruti Kasetty. OASIS: An Onlineand Accurate Technique for Local-alignment Searches on BiologicalSequences. In Proceedings of the 29th International Conference onVery Large Data Bases - Volume 29, VLDB ’03, pages 910–921. VLDBEndowment, 2003.110Bibliography[74] Pa´ll Melsted and Jonathan K. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics,12(1):333, 2011.[75] Pll Melsted and Bjarni V. Halldrsson. KmerStream: streaming algo-rithms for k-mer abundance estimation. Bioinformatics, 30(24):3541–3547, 2014.[76] Hamid Mohamadi. Oligonucleotide Probe Design for Large Genomesusing Multiple Spaced Seeds. Master’s thesis, McMaster University,2012.[77] Hamid Mohamadi, Justin Chu, Benjamin P. Vandervalk, and InancBirol. ntHash: recursive nucleotide hashing. Bioinformatics,32(22):3492–3494, 2016.[78] Hamid Mohamadi, Hamza Khan, and Inanc Birol. ntCard: a stream-ing algorithm for cardinality estimation in genomics data. Bioinfor-matics, 33(9):1324, 2017.[79] Hamid Mohamadi, Benjamin P Vandervalk, Anthony Raymond,Shaun D Jackman, Justin Chu, Clay P Breshears, and Inanc Birol.DIDA: Distributed Indexing Dispatched Alignment. PLOS ONE,10(4):1–14, 04 2015.[80] Eugene W. Myers. The fragment assembly string graph. Bioinformat-ics, 21(suppl2):ii79, 2005.111Bibliography[81] Niranjan Nagarajan and Mihai Pop. Sequence assembly demystified.Nat Rev Genet, 14(3):157–167, 03 2013.[82] Maria Nattestad and Michael C. Schatz. Assemblytics: a web analyticstool for the detection of variants from an assembly. Bioinformatics,32(19):3021–3023, 2016.[83] Saul B. Needleman and Christian D. Wunsch. A general method ap-plicable to the search for similarities in the amino acid sequence of twoproteins. Journal of Molecular Biology, 48(3):443 – 453, 1970.[84] Laurent Noe´ and Donald E. K. Martin. A Coverage Criterion forSpaced Seeds and Its Applications to Support Vector Machine StringKernels and k-Mer Distances. Journal of Computational Biology,21(12):947–963, 2016/11/23 2014.[85] Rob Patro, Stephen M Mount, and Carl Kingsford. Sailfish en-ables alignment-free isoform quantification from RNA-seq reads usinglightweight algorithms. Nat Biotech, 32(5):462–464, 05 2014.[86] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerianpath approach to DNA fragment assembly. Proceedings of the NationalAcademy of Sciences, 98(17):9748–9753, 2001.[87] Simon J. Puglisi, W. F. Smyth, and Andrew H. Turpin. A Taxonomyof Suffix Array Construction Algorithms. ACM Comput. Surv., 39(2),July 2007.[88] Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. DSK: k-mer112Bibliographycounting with very low memory usage. Bioinformatics, 29(5):652–653,2013.[89] Rajat Shuvro Roy, Debashish Bhattacharya, and Alexander Schliep.Turtle: Identifying frequent k-mers with cache-efficient algorithms.Bioinformatics, 30(14):1950, 2014.[90] Kamil Salikhov, Gustavo Sacomoto, and Gregory Kucherov. Usingcascading Bloom filters to improve the memory usage for de Brujingraphs. Algorithms for Molecular Biology, 9(1):2, 2014.[91] Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, Daniela Puiu,Tanja Magoc, Sergey Koren, Todd J. Treangen, Michael C. Schatz,Arthur L. Delcher, Michael Roberts, Guillaume Marais, Mihai Pop,and James A. Yorke. GAGE: A critical evaluation of genome assem-blies and assembly algorithms. Genome Research, 2011.[92] F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing withchain-terminating inhibitors. Proceedings of the National Academy ofSciences, 74(12):5463–5467, 1977.[93] Michael C. Schatz. CloudBurst: highly sensitive read mapping withMapReduce. Bioinformatics, 25(11):1363, 2009.[94] Ariya Shajii, Deniz Yorukoglu, Yun William Yu, and Bonnie Berger.Fast genotyping of known SNPs through approximate k-mer matching.Bioinformatics, 32(17):i538–i544, 2016.[95] Jared T. Simpson. Efficient sequence assembly and variant calling us-113Bibliographying compressed data structures. PhD thesis, University of Cambridge,2013.[96] Jared T. Simpson. Exploring genome characteristics and sequencequality without a reference. Bioinformatics, 30(9):1228–1235, 2014.[97] Jared T. Simpson and Richard Durbin. Efficient de novo assembly oflarge genomes using compressed data structures. Genome Research,2011.[98] Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E.Schein, Steven J.M. Jones, and Inanc Birol. ABySS: A parallel assem-bler for short read sequence data. Genome Research, 19(6):1117–1123,2009.[99] Andrew D. Smith, Wen-Yu Chung, Emily Hodges, Jude Kendall, GregHannon, James Hicks, Zhenyu Xuan, and Michael Q. Zhang. Up-dates to the RMAP short-read mapping software. Bioinformatics,25(21):2841, 2009.[100] T.F. Smith and M.S. Waterman. Identification of common molecularsubsequences. Journal of Molecular Biology, 147(1):195 – 197, 1981.[101] Li Song, Liliana Florea, and Ben Langmead. Lighter: fastand memory-efficient sequencing error correction without counting.Genome Biology, 15(11):509, 2014.[102] Henrik Stranneheim, Max Kller, Tobias Allander, Bjrn Andersson,114BibliographyLars Arvestad, and Joakim Lundeberg. Classification of DNA se-quences using Bloom filters. Bioinformatics, 26(13):1595, 2010.[103] Benjamin P. Vandervalk, Shaun D. Jackman, Anthony Raymond,Hamid Mohamadi, Chen Yang, Dean A. Attali, Justin Chu, Rene L.Warren, and Inanc Birol. Konnector: Connecting paired-end readsusing a bloom filter de Bruijn graph. In 2014 IEEE InternationalConference on Bioinformatics and Biomedicine (BIBM), pages 51–58,Nov 2014.[104] Benjamin P. Vandervalk, Chen Yang, Zhuyi Xue, Karthika Raghavan,Justin Chu, Hamid Mohamadi, Shaun D. Jackman, Readman Chiu,Rene´ L. Warren, and Inanc¸ Birol. Konnector v2.0: pseudo-long readsfrom paired-end sequencing data. BMC Medical Genomics, 8(3):S1,2015.[105] Vijay V. Vazirani. Approximation Algorithms. Springer-Verlag NewYork, Inc., New York, NY, USA, 2001.[106] Ren L. Warren, Christopher I. Keeling, Macaire Man Saint Yuen, An-thony Raymond, Greg A. Taylor, Benjamin P. Vandervalk, HamidMohamadi, Daniel Paulino, and et al. Improved white spruce (Piceaglauca) genome assemblies and annotation of large gene families ofconifer terpenoid and phenolic defense metabolism. The Plant Jour-nal, 83(2):189–212, 2015.[107] David A. Wheeler, Maithreyan Srinivasan, Michael Egholm, YufengShen, Lei Chen, Amy McGuire, Wen He, Yi-Ju Chen, Vinod Makhi-115Bibliographyjani, G. Thomas Roth, Xavier Gomes, Karrie Tartaro, Faheem Niazi,Cynthia L. Turcotte, Gerard P. Irzyk, James R. Lupski, Craig Chin-ault, Xing-zhi Song, Yue Liu, Ye Yuan, Lynne Nazareth, Xiang Qin,Donna M. Muzny, Marcel Margulies, George M. Weinstock, Richard A.Gibbs, and Jonathan M. Rothberg. The complete genome of an indi-vidual by massively parallel DNA sequencing. Nature, 452(7189):872–876, 04 2008.[108] Thomas D. Wu and Serban Nacu. Fast and SNP-tolerant detec-tion of complex variants and splicing in short reads. Bioinformatics,26(7):873, 2010.[109] Daniel R. Zerbino and Ewan Birney. Velvet: Algorithms for denovo short read assembly using de Bruijn graphs. Genome Research,18(5):821–829, 2008.[110] Justin M. Zook, David Catoe, Jennifer McDaniel, Lindsay Vang, NoahSpies, Arend Sidow, Ziming Weng, Yuling Liu, Christopher E. Mason,Noah Alexander, and et al. Extensive sequencing of seven humangenomes to characterize benchmark reference materials. Sci. Data,3:160025, Jun 2016.116


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items