UBC Faculty Research and Publications

Tigmint: correcting assembly errors using linked reads from large molecules Jackman, Shaun D.; Coombe, Lauren; Chu, Justin; Warren, Rene L.; Vandervalk, Benjamin P.; Yeo, Sarah; Xue, Zhuyi; Mohamadi, Hamid; Bohlmann, Joerg; Jones, Steven J.M.; Birol, Inanc Oct 26, 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-12859_2018_Article_2425.pdf [ 1.57MB ]
Metadata
JSON: 52383-1.0373173.json
JSON-LD: 52383-1.0373173-ld.json
RDF/XML (Pretty): 52383-1.0373173-rdf.xml
RDF/JSON: 52383-1.0373173-rdf.json
Turtle: 52383-1.0373173-turtle.txt
N-Triples: 52383-1.0373173-rdf-ntriples.txt
Original Record: 52383-1.0373173-source.json
Full Text
52383-1.0373173-fulltext.txt
Citation
52383-1.0373173.ris

Full Text

Jackman et al. BMC Bioinformatics          (2018) 19:393 https://doi.org/10.1186/s12859-018-2425-6SOFTWARE Open AccessTigmint: correcting assembly errors usinglinked reads from large moleculesShaun D. Jackman1* , Lauren Coombe1, Justin Chu1, Rene L. Warren1, Benjamin P. Vandervalk1,Sarah Yeo1, Zhuyi Xue1, Hamid Mohamadi1, Joerg Bohlmann2, Steven J.M. Jones1 and Inanc Birol1AbstractBackground: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome.Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task isdifficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, andheterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassembliesmay be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two.Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Althoughtools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yetexists that makes use of the long distance information of the large molecules provided by linked reads, such as thoseoffered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap.Results: To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using shortreads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified byQUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffoldNGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of theassembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assemblycorrection in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools,as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing.Conclusions: Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both morecorrect and substantially more contiguous than an assembly that has not been corrected. Using single-moleculesequencing in combination with linked reads enables a genome sequence assembly that achieves both a highsequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.Keywords: Assembly correction, Genome scaffolding, Genome sequence assembly, 10x Genomics Chromium,Linked readsBackgroundAssemblies of short read sequencing data are easily con-founded by repetitive sequences larger than the fragmentsize of the sequencing library. When the size of a repeatexceeds the library fragment size, the contig comes to anend in the best case, or results in misassembled sequencein the worst case. Misassemblies not only complicatedownstream analyses, but also limit the contiguity of theassembly. Each incorrectly assembled sequence prevents*Correspondence: sjackman@bcgsc.ca1BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6, CanadaFull list of author information is available at the end of the articlejoining that chimeric sequence to its true neighboursduring assembly scaffolding, illustrated in Fig. 1.Long-read sequencing technologies have greatlyimproved assembly contiguity with their ability to spanthese repeats, but at a cost currently significantly higherthan that of short-read sequencing technology. For pop-ulation studies and when sequencing large genomes,such as conifer genomes and other economically impor-tant crop species, this cost may be prohibitive. The10x Genomics (Pleasanton, CA) Chromium technologygenerates linked reads from large DNA molecules ata cost comparable to standard short-read sequencing© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Jackman et al. BMC Bioinformatics          (2018) 19:393 Page 2 of 10Fig. 1 An assembly of a hypothetical genome with two linear chromosomes is assembled in three contigs. One of those contigs is misassembled. Inits current misassembled state, this assembly cannot be completed by scaffolding alone. The misassembled contig must first be corrected bycutting the contig at the location of the misassembly. After correcting the missasembly, each chromosome may be assembled into a single scaffoldtechnologies. Whereas paired-end sequencing gives tworeads from a small DNA fragment, linked reads yieldroughly a hundred read pairs from molecules with atypical size of a hundred kilobases. Linked reads indicatewhich reads were derived from the same DNA molecule(or molecules, when they share the same barcode), and soshould be in close proximity in the underlying genome.The technology has been used previously to phase diploidgenomes using a reference [1], de novo assemble complexgenomes in the gigabase scale [2], and further scaffolddraft assemblies [3, 4].A number of software tools employ linked reads forvarious applications. The Long Ranger tool maps readsto repetitive sequence, phases small variants, and identi-fies structural variants [5], while Supernova [2] assemblesdiploid genome sequences. Both tools are developed bythe vendor. Among tools from academic labs, GROC-SVs [6], NAIBR [7], and Topsorter [8] identify structuralvariants, and ARCS [4], Architect [9], and fragScaff [10]scaffold genome assemblies using linked reads.In de novo sequencing projects, it is challengingyet important to ensure the correctness of the result-ing assemblies. Tools to correct misassemblies typicallyinspect the reads aligned back to the assembly to iden-tify discrepancies. Pilon [11] inspects the alignments toidentify variants and correct small-scale misassemblies.NxRepair [12] uses Illumina mate-pair sequencing to cor-rect large-scale structural misassemblies. Misassembliesmay also be corrected using optical mapping and chro-mosome conformation capture [13]. Linked reads offeran opportunity to use the long-range information pro-vided by large molecules to identify misassemblies in acost-effective way, yet no software tool currently exists tocorrect misassemblies using linked reads. Here we intro-duce a software tool, Tigmint, to identify misassembliesusing this new and useful data type.Tigmint first aligns linked reads to an assembly, andinfers the extents of the large DNA molecules from thesealignments. It then searches for atypical drops in phys-ical molecule coverage, revealing the positions of pos-sible misassemblies. It cuts the assembled sequences atthese positions to improve assembly correctness. Linkedreads may then be used again to scaffold the correctedassembly with ARCS [4] to identify contig ends shar-ing barcodes, and either ABySS-Scaffold (included withABySS) or LINKS [14] to merge sequences of contigs intoscaffolds.MethodsTigmint identifies misassembled regions of the assemblyby inspecting the alignment of linked reads to the draftgenome assembly. The command tigmint-moleculegroups linked reads with the same barcode intomolecules.The command tigmint-cut identifies regions of theassembly that are not well supported by the linked reads,and cuts the contigs of the draft assembly at these posi-tions. Tigmint may optionally scaffold the genome usingARCS [4]. A block diagram of the analysis pipeline isshown in Fig. 2.A typical workflow of Tigmint is as follows. The userprovides a draft assembly in FASTA format and the linkedreads in FASTQ format. Tigmint first aligns the linkedreads to the draft genome using BWA-MEM [15]. Thealignments are filtered by alignment score and numberof mismatches to remove poorly aligned reads with thedefault thresholds NM < 5 and AS ≥ 0.65·l, where l is theread length. Reads with the same barcode that map withina specified distance, 50 kbp by default, of the adjacentreads are grouped into amolecule. A BED (Browser Exten-sible Data) file [16] is constructed, where each recordindicates the start and end of one molecule, and thenumber of reads that compose that molecule. Unusu-ally small molecules, shorter than 2 kbp by default, arefiltered out.Physical molecule depth of coverage is the number ofmolecules that span a point. A molecule spans a pointwhen one of its reads aligns to the left of that point andanother of its reads (with the same barcode) aligns to theright of that point. Regions with poor physical moleculecoverage indicate potentially problematic regions of theassembly. At a misassembly involving a repeat, moleculesmay start in left flanking unique sequence and end inthe repeat, and molecules may start in the repeat andend in right flanking unique sequence. This seeminglyJackman et al. BMC Bioinformatics          (2018) 19:393 Page 3 of 10Fig. 2 The block diagram of Tigmint. Input files are shown in parallelograms. Intermediate files are shown in rectangles. Output files are shown inovals. File formats are shown in parenthesesuninterruptedmolecule coveragemay give the appearancethat the region is well covered by molecules. Closerinspection may reveal that no molecules span the repeatentirely, from the left flanking sequence to the rightflanking sequence. Tigmint checks that each region ofa fixed size specified by the user, 1000 bp by default,is spanned by a minimum number of molecules, 20 bydefault.Tigmint constructs an interval tree of the coordinatesof the molecules using the Python package Intervaltree.The interval tree allows us to quickly identify and countthe molecules that span a given region of the draft assem-bly. Regions that have a sufficient number of spanningmolecules, 20 by default, are deemed well-covered, andregions that do not are deemed poorly-covered and revealpossible misassemblies. We inspect the molecule cover-age of each contig with a sliding window of 1000 bp (bydefault) with a step size of 1 bp. Tigmint cuts the assem-bly after the last base of a well-covered window before arun of poorly-covered windows, and then cut the assem-bly again before the first base of the first well-coveredwindow following that run of poorly-covered windows,shown in Listing 1. The coordinates of these cut points arerecorded in a BED file. The sequences of the draft assem-bly are split at these cut points, producing a correctedFASTA file.Listing 1 A window of w bp spanned by at least n molecules iswell covered. Use the interval tree molecules to identify regionsthat are not well covered by molecules. Return a set of positions(cut points) at which to split the contig. Interval coordinates arezero-based and half open.determine_cutpoints =function(molecules,contig_length,n,w)cutpoints = []for i in [0, contig_length - w - 1)interval_0 = [i, i + w)interval_1 = [i + 1, i + w + 1)count_0=|molecules.spanning(interval_0)|count_1=|molecules.spanning(interval_1)|if count_0 >= n and count_1 < ncutpoints.insert(interval_0.end)elseif count_0 < n and count_1 >= ncutpoints.insert(interval_1.start)return cutpointsTigmint will optionally run ARCS [4] to scaffold thesecorrected sequences and improve the contiguity of theassembly. Tigmint corrects misassemblies in the draftgenome to improve the correctness of the assembly, butTigmint itself cannot improve the contiguity of the assem-bly. ARCS merges contigs into scaffolds by identifyingJackman et al. BMC Bioinformatics          (2018) 19:393 Page 4 of 10ends of contigs that share common barcodes. However,ARCS in itself would not be able to make the join if thecorrect mate of a contig end is buried deep within a misas-sembled contig. Tigmint corrects the misassembly, whichexposes the end of the previously misassembled contig, sothat ARCS is now able to make that merge. Tigmint andARCS work together to improve both the correctness andcontiguity of an assembly.Tigmint will optionally compare the scaffolds to a ref-erence genome, if one is provided, using QUAST [17]to compute contiguity (NGA50) and correctness (num-ber of putative misassemblies) of the assemblies beforeTigmint, after Tigmint, and after ARCS. Each misassem-bly identified by QUAST reveals a difference betweenthe assembly and the reference, and may indicate a realmisassembly or a structural variation between the ref-erence and the sequenced genome. The NGA50 metricsummarizes both assembly contiguity and correctnessby computing the NG50 of the lengths of alignmentblocks to a reference genome, correcting the contiguitymetric by accounting for possible misassemblies. It how-ever also penalizes sequences at points of true variationbetween the sequenced and reference genomes. The truebut unknown contiguity of the assembly, which accountsfor misassemblies but not for structural variation, there-fore lies somewhere between the lower bound of NGA50and the upper bound of NG50.EvaluationWe have evaluated the effectiveness of Tigmint on assem-blies of both short and long read sequencing data, includ-ing assemblies of Illumina paired-end and mate-pairsequencing using ABySS andDISCOVARdenovo, a Super-nova assembly of linked reads, a Falcon assembly ofPacBio sequencing, a Canu assembly of Oxford Nanoporesequencing, and an ABySS assembly of simulated Illu-mina sequencing (see Table 1). All assemblies are of theGenome in a Bottle (GIAB) human sample HG004, exceptthe Canu assembly of human sample NA12878. The sam-ple HG004 was selected for the variety of data types avail-able, including Illumina 2x250 paired-end and mate-pairTable 1 Genome assemblies of both short and long readsequencing were used to evaluate TigmintSample Sequencing Platform AssemblerHG004 Illumina ABySSHG004 Illumina DISCOVARdenovoHG004 10x Chromium SupernovaHG004 PacBio FalconNA12878 Oxford Nanopore CanuThe GIAB sample HG004 is also known as NA24143. See “Availability of data andmaterial” to access the sequencing data and assembliessequencing, linked reads, and PacBio sequencing [18].NA12878 was selected for the availability of an assemblyof Oxford Nanopore sequencing [19] as well as the linkedread sequencing needed by Tigmint.We downloaded the ABySS 2.0 [20] assembly of HG004abyss-2.0/scaffolds.fa from NCBI, assembledfrom Illumina paired-end and mate-pair reads [18]. Wedownloaded the Illumina mate pair reads for this indi-vidual from NCBI. We trimmed adapters using NxTrim0.4.0 [21] with parameters --norc --joinreads--preserve-mp and selected the reads identified asknown mate pairs. We ran NxRepair 0.13 [12] to correctthe ABySS 2.0 assembly of HG004 using these trimmedmate-pair reads. A range of values of its z-score thresholdparameter T were tested.We downloaded the 10x Genomics Chromium reads forthis same individual from NCBI, and we extracted bar-codes from the reads using Long Ranger Basic. We ranTigmint to correct the ABySS 2.0 assembly of HG004using these Chromium reads with the parameters win-dow = 2000 and span = 20. The choice of parametersis discussed in the results. Both the uncorrected andcorrected assemblies were scaffolded using ARCS. Theseassemblies were compared to the chromosome sequencesof the GRCh38 reference genome using QUAST [17].Since ARCS version 1.0.0 that we used does not esti-mate gap sizes using linked reads, the QUAST parameter--scaffold-gap-max-size is set to 100 kbp.We repeated this analysis using Tigmint, ARCS, andQUAST with five other assemblies. We downloaded thereads assembled with DISCOVARdenovo and scaffoldedusing BESST [22] from NCBI, and the same DISCO-VARdenovo contigs scaffolded using ABySS-Scaffold. Weassembled the linked reads with Supernova 2.0.0 [2],which used neither the 2×250 paired-end reads nor mate-pair reads.We applied Tigmint and ARCS to two assemblies ofsingle-molecule sequencing (SMS) reads. We downloadedPacBio reads assembled with Falcon from NCBI [23] andOxford Nanopore reads assembled with Canu [19].Most software used in these analyses were installedusing Linuxbrew [24] with the command brewtap brewsci/bio; brew install abyssarcs bwa lrsim miller minimap2 nxtrimsamtools seqtk. We used the development version ofQUAST 5 revision 78806b2, which is capable of analyzingassemblies of large genomes using Minimap2 [25].Results and discussionCorrecting the ABySS assembly of the human dataset HG004 with Tigmint reduces the number ofmisassemblies identified by QUAST by 216, a reduc-tion of 27%. While the scaffold NG50 decreases slightlyfrom 3.65 Mbp to 3.47 Mbp, the scaffold NGA50 remainsJackman et al. BMC Bioinformatics          (2018) 19:393 Page 5 of 10unchanged; thus in this case, correcting the assembly withTigmint improves the correctness of the assembly withoutsubstantially reducing its contiguity. However, scaffold-ing the uncorrected and corrected assemblies with ARCSyield markedly different results: a 2.5-fold increase inNGA50 from 3.1 Mbp to 7.9 Mbp without Tigmint versusa more than five-fold increase in NGA50 to 16.4Mbp withTigmint. Further, correcting the assembly and then scaf-folding yields a final assembly that is both more correctandmore contiguous than the original assembly, as shownin Fig. 3 and Table 2.Correcting the DISCOVARdenovo + BESST assemblyreduces the number of misassemblies by 75, a reductionof 13%. Using Tigmint to correct the assembly before scaf-folding with ARCS yields an increase in NGA50 of 28%over using ARCS without Tigmint. Correcting the DIS-COVARdenovo + ABySS-Scaffold assembly reduces thenumber of misassemblies by 35 (5%), after which scaf-folding with ARCS improves the NGA50 to 23.7 Mbp, 2.6times the original assembly and a 40% improvement overARCS without Tigmint. The assembly with the fewestmisassemblies is DISCOVARdenovo + BESST + Tigmint.The assembly with the largest NGA50 is DISCOVAR-denovo + ABySS-Scaffold + Tigmint + ARCS. Finally,DISCOVARdenovo + BESST + Tigmint + ARCS strikesa good balance between both good contiguity and fewmisassemblies.Correcting the Supernova assembly of the HG004linked reads with Tigmint reduces the number of misas-semblies by 82, a reduction of 8%, and after scaffoldingthe corrected assembly with ARCS, we see a slight (<1%)decrease in both misassemblies and NGA50 comparedto the original Supernova assembly. Since the Supernovaassembly is composed entirely of the linked reads, thisresult is concordant with our expectation of no substantialgains from using these same data to correct and scaffoldthe Supernova assembly.We attempted to correct the ABySS assembly usingNxRepair, which made no corrections for any value of itsz-score threshold parameter T less than -2.7. Setting T =-2.4, NxRepair reduced the number of misassemblies from790 to 611, a reduction of 179 or 23%, whereas Tigmintreduced misassemblies by 216 or 27%. NxRepair reducedthe NGA50 by 34% from 3.09 Mbp to 2.04 Mbp, unlikeTigmint, which did not reduce the NGA50 of the assem-bly. Tigmint produced an assembly that is both more cor-rect and more contiguous than NxRepair with T = -2.4.Smaller values of T corrected fewer errors than Tigmint,and lager values of T further decreased the contiguity ofthe assembly.We similarly corrected the two DISCOVAR-denovo assemblies using NxRepair with T = -2.4, shownin figure Fig. 4. The DISCOVARdenovo + BESST assem-bly corrected by Tigmint is both more correct and morecontiguous than that corrected by NxRepair. The DIS-COVARdenovo + ABySS-Scaffold assembly corrected byNxRepair has 16 (2.5%) fewermisassemblies than that cor-rected by Tigmint, but the NGA50 is reduced from 9.04Mbpwith Tigmint to 5.53MbpwithNxRepair, a reductionof 39%.The assemblies of SMS reads have contig NGA50s inthe megabases. Tigmint and ARCS together improve thescaffold NGA50 of the Canu assembly by more than dou-ble to nearly 11 Mbp and improve the scaffold NGA50 ofthe Falcon assembly by nearly triple to 12 Mbp, and bothassemblies have fewer misassemblies than their originalassembly, shown in Fig. 5. Thus, using Tigmint and ARCSFig. 3 Assembly contiguity and correctness metrics of HG004 with and without correction using Tigmint prior to scaffolding with ARCS. The mostcontiguous and correct assemblies are found in the top-left. Supernova assembled linked reads only, whereas the others used paired end and matepair readsJackman et al. BMC Bioinformatics          (2018) 19:393 Page 6 of 10Table 2 The assembly contiguity (scaffold NG50 and NGA50) and correctness (number of misassemblies) metrics with and withoutcorrection using Tigmint prior to scaffolding with ARCSSample Assembly NG50 (Mbp) NGA50 (Mbp) Misass. ReductionHG004 ABySS 3.65 3.09 790 NAABySS+Tigmint 3.47 3.09 574 216 (27.3%)ABySS+ARCS 9.91 7.86 823 NAABySS+Tigmint+ARCS 26.39 16.43 641 182 (22.1%)HG004 DISCO+ABySS 10.55 9.04 701 NADISCO+ABySS+Tigmint 10.16 9.04 666 35 (5.0%)DISCO+ABySS+ARCS 29.20 17.05 829 NADISCO+ABySS+Tigmint+ARCS 35.31 23.68 804 25 (3.0%)HG004 DISCO+BESST 7.01 6.14 568 NADISCO+BESST+Tigmint 6.77 6.14 493 75 (13.2%)DISCO+BESST+ARCS 27.64 15.14 672 NADISCO+BESST+Tigmint+ARCS 33.43 19.40 603 69 (10.3%)HG004 Supernova 38.48 12.65 1005 NASupernova+Tigmint 17.72 11.43 923 82 (8.2%)Supernova+ARCS 39.63 13.24 1052 NASupernova+Tigmint+ARCS 27.35 12.60 998 54 (5.1%)HG004 Falcon 4.56 4.21 3640 NAFalcon+Tigmint 4.45 4.21 3444 196 (5.4%)Falcon+ARCS 18.14 9.71 3,801 NAFalcon+Tigmint+ARCS 22.52 11.97 3,574 227 (6.0%)NA12878 Canu 7.06 5.40 1688 NACanu+Tigmint 6.87 5.38 1600 88 (5.2%)Canu+ARCS 19.70 10.12 1736 NACanu+Tigmint+ARCS 22.01 10.85 1,626 110 (6.3%)Simulated ABySS 9.00 8.28 272 NAABySS+Tigmint 8.61 8.28 217 55 (20.2%)ABySS+ARCS 23.37 17.09 365 NAABySS+Tigmint+ARCS 30.24 24.98 320 45 (12.3%)ABySS and DISCOVARdenovo are assemblies of Illumina sequencing. Supernova is an assembly of linked read sequencing. Falcon is an assembly of PacBio sequencing. Canuis an assembly Oxford Nanopore sequencing. Data simulated with LRSim is assembled with ABySSFig. 4 Assembly contiguity and correctness metrics of HG004 corrected with NxRepair, which uses mate pairs, and Tigmint, which uses linked reads.The most contiguous and correct assemblies are found in the top-leftJackman et al. BMC Bioinformatics          (2018) 19:393 Page 7 of 10together improves both the contiguity and correctnessover the original assemblies. This result demonstrates thatby using long reads in combination with linked reads,one can achieve an assembly quality that is not currentlypossible with either technology alone.The alignments of the ABySS assembly to the referencegenome before and after Tigmint are visualized in Fig. 6using JupiterPlot [26], which uses Circos [27]. A numberof split alignments, likely misassemblies, are visible in theassembly before Tigmint, whereas after Tigmint no suchsplit alignments are visible.The default maximum distance permitted betweenlinked reads in a molecule is 50 kbp, which is thevalue used by the Long Ranger and Lariat tools of 10xGenomics. In our tests, values between 20 kbp and 100kbp do not substantially affect the results, and valuessmaller than 20 kbp begin to disconnect linked reads thatshould be found in a single molecule. The effect of varyingthe window and spanning molecules parameters of Tig-mint on the assembly contiguity and correctness metricsis shown in Fig. 7. When varying the spanning moleculesparameter, the window parameter is fixed at 2 kbp,and when varying the window parameter, the spanningmolecules parameter is fixed at 20. The assembly met-rics of the ABySS, DISCOVARdenovo + ABySS-Scaffold,and DISCOVARdenovo + BESST assemblies after correc-tion with Tigmint are rather insensitive to the spanningmolecules parameter for any value up to 50 and for thewindow parameter for any value up to 2 kbp. The param-eter values of span = 20 and window = 2000 worked wellfor all of the tested assembly tools.We simulated 434 million 2×250 paired-end and 350million 2×125 mate-pair read pairs using wgsim of sam-tools, and we simulated 524 million 2×150 linked readpairs using LRSim [28], emulating the HG004 data set.We assembled these reads using ABySS 2.0.2, and appliedTigmint and ARCS as before. The assembly metrics areshown in Table 2. We see similar performance to thereal data: a 20% reduction in misassemblies after runningTigmint, and a three-fold increase in NGA50 after Tig-mint and ARCS. Since no structural rearrangements arepresent in the simulated data, eachmisassembly identifiedby QUAST ought to be a true misassembly, allowing usto calculate precision and recall. For the parameters usedwith the real data, window= 2000 and span= 20, Tigmintmakes 210 cuts in scaffolds at least 3 kbp (QUAST doesnot analyze shorter scaffolds), and corrects 55 misassem-blies of the 272 identified by QUAST, yielding precisionand recall of PPV = 55210 = 0.26 and TPR = 55272 = 0.20.Altering the window parameter to 1 kbp, Tigmint makesonly 58 cuts, and yet it corrects 51 misassemblies, mak-ing its precision and recall PPV = 5158 = 0.88 and TPR =51272 = 0.19, a marked improvement in precision with onlya small decrease in recall. The scaffold NGA50 after ARCSis 24.7 Mbp, 1% less than with window = 2000. Since thefinal assembly metrics are similar, using a smaller value forthe window size parameter may avoid unnecessary cuts.Small-scale misassemblies cannot be detected by Tigmint,such as collapsed repeats, and relocations and inversionssmaller than a typical molecule.The primary steps of running Tigmint are mappingthe reads to the assembly, determining the start and endcoordinate of each molecule, and finally identifying thediscrepant regions and correcting the assembly. Mappingthe reads to the DISCOVAR + ABySS-Scaffold assem-bly with BWA-MEM and concurrently sorting by barcodeusing Samtools [29] in a pipe required 5.5 h (wall-clock)and 17.2 GB of RAM (RSS) using 48 threads on a 24-corehyper-threaded computer. Determining the start and endcoordinates of each molecule required 3.25 h and 0.08 GBRAM using a single thread. Finally, identifying the dis-crepant regions of the assembly, correcting the assembly,and creating a new FASTA file required 7 min and 3.3 GBRAM using 48 threads. The slowest step of mapping thereads to the assembly could be made faster by using light-weight mapping rather than full alignment, since Tigmintneeds only the positions of the reads, not their alignments.NxRepair required 74.9 GB of RAM (RSS) and 5h 19mFig. 5 Assemblies of Oxford Nanopore sequencing of NA12878 with Canu and PacBio sequencing of HG004 with Falcon with and withoutcorrection using Tigmint prior to scaffolding with ARCSJackman et al. BMC Bioinformatics          (2018) 19:393 Page 8 of 10Fig. 6 The alignments to the reference genome of the ABySS assembly of HG004 before and after Tigmint. The reference chromosomes are on theleft in colour, the assembly scaffolds on the right in grey. No translocations are visible after Tigmintof wall clock time using a single CPU core, since it is notparallelized.When aligning an assembly of an individual’s genometo a reference genome of its species, we expect to seebreakpoints where the assembled genome differs from thereference genome. These breakpoints are caused by bothmisassemblies and true differences between the individualand the reference. The median number of mobile-elementinsertions for example, just one class of structural variant,is estimated to be 1218 per individual [30]. Misassem-blies can be corrected by inspecting the alignments of thereads to the assembly and cutting the scaffolds at positionsnot supported by the reads. Reported misassemblies dueto true structural variation will however remain. For thisreason, even a perfectly corrected assembly is expectedto have a number of differences when compared to thereference.ConclusionsTigmint uses linked reads to reduce the number of mis-assemblies in a genome sequence assembly. The con-tiguity of the assembly is not appreciably affected bysuch a correction, while yielding an assembly that ismore correct. Most scaffolding tools order and ori-ent the sequences that they are given, but do notattempt to correct misassemblies. These misassembliesFig. 7 a. b. c. d. Effect of varying the window and span parameters on scaffold NGA50 and misassemblies of three assemblies of HG004Jackman et al. BMC Bioinformatics          (2018) 19:393 Page 9 of 10hold back the contiguity that can be achieved by scaf-folding. Two sequences that should be connected togethercannot be when one of those two sequences is con-nected incorrectly to a third sequence. By first cor-recting these misassemblies, the scaffolding tool can doa better job of connecting sequences, and we observeprecisely this synergistic effect. Scaffolding an assem-bly that has been corrected with Tigmint yields a finalassembly that is both more correct and substantiallymore contiguous than an assembly that has not beencorrected.Linked read sequencing has two advantages over paired-end and mate-pair reads to identify and correct mis-assemblies. Firstly, the physical coverage of the largemolecules of linked reads is more consistent and lessprone to coverage dropouts than that of paired-end andmate-pair sequencing data. Since roughly a hundred readpairs are derived from each molecule, the mapping ofthe large molecule as a whole to the draft genome isless affected by the GC content and repetitiveness ofany individual read. Secondly, paired-end and mate-pairreads are derived from molecules typically smaller than1 kbp and 10 kbp respectively. Short reads align ambigu-ously to repetitive sequence that is larger than the DNAmolecule size of the sequencing library. The linked readsof 10× Genomics Chromium are derived from moleculesof about 100 kbp, which are better able to uniquely alignto repetitive sequence and resolve misassemblies aroundrepeats.Using single-molecule sequencing in combination withlinked reads enables a genome sequence assembly thatachieves both a high sequence contiguity as well as highscaffold contiguity, a feat not currently achievable witheither technology alone. Although paired-end and mate-pair sequencing is often used to polish a long-read assem-bly to improve its accuracy at the nucleotide level, it isnot well suited to polish the repetitive sequence of theassembly, where the reads align ambiguously. Linked readswould resolve this mapping ambiguity and are uniquelysuited to polishing an assembly of long reads, an oppor-tunity for further research in the hybrid assembly of longand linked reads.Availability and requirementsProject name: TigmintProject home page: https://github.com/bcgsc/tigmintOperating system: Platform independentProgramming language: PythonLicense: GNU GPL v3.0AbbreviationsBED: Browser extensible data; bp: Base pair; GIAB: Genome in a bottle; kbp:Kilobase pair; NCBI: National center for biotechnology information; RAM:Random access memory; RSS: Resident set size; SMS: Single-moleculesequencing; SRA: Sequence read archiveFundingThis work was funded by Genome Canada, Genome BC, Natural Sciences andEngineering Research Council of Canada (NSERC), National Institutes of Health(NIH). The funding agencies were not involved in the design of the study andcollection, analysis, interpretation of data, nor writing the manuscript.Availability of data andmaterialsThe script to run the data analysis is available online at https://github.com/sjackman/tigmint-data. Tigmint may be installed using PyPI, Bioconda [31],Homebrew, or Linuxbrew [24].The datasets generated and/or analysed during the current study are availablefrom NCBI.HG004 Illumina mate-pair reads SRA SRR2832452-SRR283245 [18] http://bit.ly/hg004-6kb or http://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/NIST_Stanford_Illumina_6kb_matepair/fastqs/HG004 10x Genomics Chromium linked reads [18] http://bit.ly/giab-hg004-chromium or http://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/10Xgenomics_ChromiumGenome/NA24143.fastqs/HG004 ABySS 2.0 and Discovar de novo assemblies [20] http://bit.ly/giab-hg004 or https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/BCGSC_HG004_ABySS2.0_assemblies_12082016/HG004 PacBio reads assembled with Falcon [18] http://bit.ly/giab-falcon orhttps://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/MtSinai_PacBio_Assembly_falcon_03282016/NA12878 Oxford Nanopore reads assembled with Canu [19] https://www.ncbi.nlm.nih.gov/assembly/GCA_900232925.1/NA12878 10× Genomics Chromium linked reads https://support.10xgenomics.com/de-novo-assembly/datasets/2.0.0/wfu/Authors’ contributionsSDJ drafted the manuscript. SDJ and IB revised the manuscript. SDJ designedand executed the data analysis. SDJ, LC, and ZX performed exploratory dataanalysis. SDJ, LC, and JC implemented Tigmint. SDJ, LC, JC, RLW, and SYimplemented ARCS. SDJ, BPV, and HM implemented ABySS 2. JC implementedJupiterPlot and created the JupiterPlot figure. JB, SJMJ, and IB supervised theproject and secured funding. All authors provided critical feedback of themanuscript, and read and approved the final manuscript.Ethics approval and consent to participateNot applicable.Consent for publicationNot applicable.Competing interestsThe authors declare that they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.Author details1BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6, Canada.2University of British Columbia, Michael Smith Laboratories, Vancouver, BC V6T1Z4, Canada.Received: 8 June 2018 Accepted: 9 October 2018References1. Zheng GXY, Lau BT, Schnall-Levin M, Jarosz M, Bell JM, Hindson CM, et al.Haplotyping germline and cancer genomes with high-throughputlinked-read sequencing. Nat Biotechnol. 2016;34:303–11. https://doi.org/doi:10.1038/nbt.3432.2. Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Directdetermination of diploid genome sequences. Genome Res. 2017;27:757–67. https://doi.org/doi:10.1101/gr.214874.116.3. Mostovoy Y, Levy-Sakin M, Lam J, Lam ET, Hastie AR, Marks P, et al. Ahybrid approach for de novo human genome sequence assembly andJackman et al. BMC Bioinformatics          (2018) 19:393 Page 10 of 10phasing. Nat Methods. 2016;13:587–90. https://doi.org/doi:10.1038/nmeth.3865.4. Yeo S, Coombe L, Warren RL, Chu J, Birol I. ARCS: Scaffolding genomedrafts with linked reads. Bioinformatics. 2017;34:725–31. https://doi.org/doi:10.1093/bioinformatics/btx675.5. 10x Genomics, Inc. Overview of Genome Software. https://support.10xgenomics.com/genome-exome/software/overview/welcome.Accessed 1 Jun 2018.6. Spies N, Weng Z, Bishara A, McDaniel J, Catoe D, Zook JM, et al.Genome-wide reconstruction of complex structural variants using readclouds. Nat Methods. 2017;14:915–20. https://doi.org/doi:10.1038/nmeth.4366.7. Elyanow R, Wu H-T, Raphael BJ. Identifying structural variants usinglinked-read sequencing data; 2017. https://doi.org/doi:10.1101/190454.8. Fang H. Topsorter: Graphical assessment of structrial variants using 10xgenomics data. https://github.com/hanfang/Topsorter. Accessed 1 Jun2018.9. Kuleshov V, Snyder MP, Batzoglou S. Genome assembly from syntheticlong read clouds. Bioinformatics. 2016;32:i216–24. https://doi.org/doi:10.1093/bioinformatics/btw267.10. Adey A, Kitzman JO, Burton JN, Daza R, Kumar A, Christiansen L, et al. Invitro, long-range sequence information for de novo genome assemblyvia transposase contiguity. Genome Res. 2014;24:2041–9. https://doi.org/doi:10.1101/gr.178319.114.11. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al.Pilon: An integrated tool for comprehensive microbial variant detectionand genome assembly improvement. PLoS ONE. 2014;9:e112963. https://doi.org/doi:10.1371/journal.pone.0112963.12. Murphy RR, O’Connell J, Cox AJ, Schulz-Trieglaff O. NxRepair: Errorcorrection inde novosequence assembly using nextera mate pairs. PeerJ.2015;3:e996. https://doi.org/doi:10.7717/peerj.996.13. Jiao W-B, Accinelli GG, Hartwig B, Kiefer C, Baker D, Severing E, et al.Improving and correcting the contiguity of long-read genomeassemblies of three plant species using optical mapping andchromosome conformation capture data. Genome Res. 2017;27(https://doi.org/doi:10.1101/gr.213652.116):778–86.14. Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, et al.LINKS: Scalable, alignment-free scaffolding of draft genomes with longreads. GigaScience. 2015;4:. https://doi.org/doi:10.1186/s13742-015-0076-3.15. Li H. Aligning sequence reads, clone sequences and assembly contigswith BWA-MEM; 2013. arXiv:13033997.16. Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparinggenomic features. Bioinformatics. 2010;26:841–2. https://doi.org/doi:10.1093/bioinformatics/btq033.17. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: Quality assessmenttool for genome assemblies. Bioinformatics. 2013;29:1072–5. https://doi.org/doi:10.1093/bioinformatics/btt086.18. Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, et al. Extensivesequencing of seven human genomes to characterize benchmarkreference materials. Sci Data. 2016;3:160025. https://doi.org/doi:10.1038/sdata.2016.25.19. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanoporesequencing and assembly of a human genome with ultra-long reads. NatBiotechnol. 2018;36:338–45. https://doi.org/doi:10.1038/nbt.4060.20. Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA,et al. ABySS 2.0: Resource-efficient assembly of large genomes using abloom filter. Genome Res. 2017;27:768–77. https://doi.org/doi:10.1101/gr.214346.116.21. O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MM, Gormley NA, Cox AJ.NxTrim: Optimized trimming of illumina mate pair reads; 2014. https://doi.org/doi:10.1101/007666.22. Sahlin K, Chikhi R, Arvestad L. Assembly scaffolding withpe-contaminated mate-pair libraries. Bioinformatics. 2016;32:1925–32.https://doi.org/doi:10.1093/bioinformatics/btw064.23. Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A,et al. Phased diploid genome assembly with single-molecule real-timesequencing. Nat Methods. 2016;13:1050–4. https://doi.org/doi:10.1038/nmeth.4035.24. Jackman SD, Birol I. Linuxbrew and Homebrew for cross-platform packagemanagement [v1; not peer reviewed]. F1000Research. 2016;5(ISCB Comm J):1795 (poster) https://doi.org/doi:10.7490/f1000research.1112681.1.25. Li H. Minimap2: Versatile pairwise alignment for nucleotide sequences.arXiv:170801492. 2017.26. Chu J. JupiterPlot: Circos assembly consistency plot. https://github.com/JustinChu/JupiterPlot. Accessed 1 Jun 2018.27. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al.Circos: An information aesthetic for comparative genomics. Genome Res.2009;19:1639–45. https://doi.org/doi:10.1101/gr.092759.109.28. Luo R, Sedlazeck FJ, Darby CA, Kelly SM, Schatz MC. LRSim: A linkedreads simulator generating insights for better genome partitioning; 2017.https://doi.org/doi:10.1101/103549.29. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. Thesequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–9. https://doi.org/doi:10.1093/bioinformatics/btp352.30. Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, HuddlestonJ, et al. An integrated map of structural variation in 2504 human genomes.Nature. 2015;526:75–81. https://doi.org/doi:10.1038/nature15394.31. Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, etal. Bioconda: Sustainable and comprehensive software distribution for thelife sciences. Nat Methods. 2018;15:475–6. https://doi.org/doi:10.1038/s41592-018-0046-7.

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.52383.1-0373173/manifest

Comment

Related Items