@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Forestry, Faculty of"@en, "Wood Science, Department of"@en, "Non UBC"@en ; edm:dataProvider "DSpace"@en ; ns0:identifierCitation "Genome Biology. 2009 Sep 11;10(9):R94"@en ; dcterms:contributor "Michael Smith Laboratories"@en ; ns0:rightsCopyright "DiGuistini et al.."@en ; dcterms:creator "DiGuistini, Scott"@en, "Liao, Nancy Y."@en, "Platt, Darren"@en, "Robertson, Gordon"@en, "Seidel, Michael"@en, "Chan, Simon K."@en, "Docking, T. R."@en, "Birol, Inanc"@en, "Holt, Robert A."@en, "Hirst, Martin"@en, "Mardis, Elaine"@en, "Marra, Marco, 1966-"@en, "Hamelin, Richard C."@en, "Bohlmann, Jörg"@en, "Breuil, Colette"@en, "Jones, Steven J. M."@en ; dcterms:issued "2015-11-05T03:19:12"@en, "2009-09-11"@en ; dcterms:description "Sequencing-by-synthesis technologies can reduce the cost of generating de novo genome assemblies. We report a method for assembling draft genome sequences of eukaryotic organisms that integrates sequence information from different sources, and demonstrate its effectiveness by assembling an approximately 32.5 Mb draft genome sequence for the forest pathogen Grosmannia clavigera, an ascomycete fungus. We also developed a method for assessing draft assemblies using Illumina paired end read data and demonstrate how we are using it to guide future sequence finishing. Our results demonstrate that eukaryotic genome sequences can be accurately assembled by combining Illumina, 454 and Sanger sequence data."@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/55168?expand=metadata"@en ; skos:note "Open Access2009DiGuistiniet al.Volume 10, Issue 9, Article R94MethodDe novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence dataScott DiGuistini¤*, Nancy Y Liao¤†, Darren Platt‡, Gordon Robertson†, Michael Seidel†, Simon K Chan†, T Roderick Docking†, Inanc Birol†, Robert A Holt†, Martin Hirst†, Elaine Mardis§, Marco A Marra†, Richard C Hamelin¶, Jörg Bohlmann¥, Colette Breuil* and Steven JM Jones†Addresses: *Department of Wood Science, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada. †BC Cancer Agency Genome Sciences Centre, Vancouver, BC, V5Z 4E6, Canada. ‡Amyris Biotechnologies, Inc., Hollis Street, Emeryville, CA 94608, USA. §Washington University School of Medicine, Forest Park Ave, St Louis, MO 63108, USA. ¶Natural Resources Canada, rue du PEPS, Ste-Foy, Quebec, G1V 4C7, Canada. ¥Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada. ¤ These authors contributed equally to this work.Correspondence: Steven JM Jones. Email: sjones@bcgsc.ca© 2009 DiGuistini et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.De novo sequence assembly

A method for de novo assembly of a eukaryotic genome using Illumina, 454 and Sanger generated sequence data

AbstractSequencing-by-synthesis technologies can reduce the cost of generating de novo genome assemblies.We report a method for assembling draft genome sequences of eukaryotic organisms thatintegrates sequence information from different sources, and demonstrate its effectiveness byassembling an approximately 32.5 Mb draft genome sequence for the forest pathogen Grosmanniaclavigera, an ascomycete fungus. We also developed a method for assessing draft assemblies usingIllumina paired end read data and demonstrate how we are using it to guide future sequencefinishing. Our results demonstrate that eukaryotic genome sequences can be accurately assembledby combining Illumina, 454 and Sanger sequence data.BackgroundThe efficiency of de novo genome sequence assembly proc-esses depends heavily on the length, fold-coverage and per-base accuracy of the sequence data. Despite substantialimprovements in the quality, speed and cost of Sangersequencing, generating a high quality draft de novo genomesequence for a eukaryotic genome remains expensive. Newsequencing-by-synthesis systems from Roche (454), Illuminatechnologies add several complicating factors: they generateshort (typically 450 bp for 454; 50 to 100 bp for Illumina andSOLiD) reads that cannot resolve low complexity sequenceregions or distributed repetitive elements; they have system-specific error models; and they can have higher base-callingerror rates. To this point, then, de novo assemblies that useeither 454 data alone, or that combine 454 with Sanger datain a 'hybrid' approach, have been reported only for prokaryotePublished: 11 September 2009Genome Biology 2009, 10:R94 (doi:10.1186/gb-2009-10-9-r94)Received: 5 June 2009Accepted: 11 September 2009The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2009/10/9/R94Genome Biology 2009, 10:R94(Genome Analyzer) and ABI (SOLiD) offer greatly reducedper-base sequencing costs. While they are attractive for gen-erating de novo sequence assemblies for eukaryotes, thesegenomes, and no de novo assemblies that use Illumina reads,either alone or in combination with Sanger and 454 read data,have been reported for a eukaryotic genome.94.2http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. RIn principle, it should be possible to generate a de novogenome sequence for a eukaryotic genome by combiningsequence information from different technologies. However,the new sequencing technologies are evolving rapidly, and nocomprehensive bioinformatic system has been developed foroptimizing such an approach. Such a system should flexiblyintegrate read data from different sequencing platforms whileaddressing sequencing depth, read quality and error models.Read quality and error models raise two challenges. First,while it is desirable to identify a subset of high quality readsprior to genome assembly, and established read quality scor-ing methods exist for Sanger sequence data, there are no rig-orous equivalents for 454 or Illumina reads [1]. Second, errormodels differ between different sequencing technologies.A number of genome assemblers are currently available forcombining Sanger and 454 read collections, as well as special-ized short read assembly programs like ALLPATHS, SSAKE,Velvet and ABySS [2-5]. However, short reads require greatersequencing depth to ensure specificity in read overlaps, asshorter overlaps cause ambiguities in the assembly stage.This increased sequence depth prevents both applying thetraditional overlap-layout-consensus method directly andextending Sanger/454 hybrid assemblers to use ultra-shortreads. Assemblers that are primarily intended for short readscan process deep coverage read data; however, because readlength and software limitations restrict the unambiguoussequence regions that they can assemble and they currentlylack the capacity for scaffolding contigs effectively, they aretypically limited to ultra-short reads. When we assessed suchassemblers, the above challenges - likely compounded by thehigh error rate in our earlier Illumina read collections -resulted in contigs that were either too short or too unreliableto support comparing homologous blocks of sequencebetween genomes.The Forge genome assembler [6] was designed for assemblingcombinations of reads from Sanger and 'next-generation'sequencing technologies, and attempts to address the abovechallenges. Distributed memory hash tables and pruned over-lap graphs allow its classical overlap-layout-consensusapproach to handle large data sets with deep coverage. Simu-lation techniques embedded in the algorithm allow it to auto-matically adapt to varying read lengths and errorcharacteristics to accommodate rapidly changing perform-ance in next-generation sequencing platforms.In the work described here, we developed a hybrid approachthat uses Forge for generating de novo draft genomesequences, and applied the approach to a filamentous fungus,Grosmannia clavigera (Gc). To generate the draft sequence,we combined: conventional, 40-kb fosmid paired-end (PE)Sanger reads from an ABI 3730xl sequencer; single-end (SE)454 reads from Roche GS20 and GS-FLX sequencers; and PEreads from an Illumina Genome Analyzer (GAii) sequencer.The current sequence assembly is approximately 32.5 Mb inlength and has an N50 scaffold size of approximately 782 kb.The assembly as well as the raw read data are available fromNational Center for Biotechnology Information (NCBI; seeMaterials and methods). We describe how we prepared readdata for assembly by filtering and trimming using an inter-nally developed pipeline, which we make available [7]. Weoutline below our experience in assembling this eukaryoticgenome using the Velvet and Forge assemblers. We alsodescribe a bioinformatic approach for assessing the accuracyof such hybrid assemblies when no high quality referencesequence exists.ResultsGenerating sequence dataWe assembled a genome sequence for Gc using the pipelinedescribed below and in Figure 1. We first constructed a fos-mid library, from which we generated 18,424 Sanger PEsequences (approximately 0.3-fold genome sequence cover-age). We then used sheared genomic DNA to generate sevenread sets on Roche GS20 and GS-FLX sequencers, producing3,045,953 reads with 100.0 and 224.5 bp average lengths,respectively (250 Mb of sequence data; approximately 7.7-fold genome sequence coverage). Finally, we supplementedthese data sets with PE, 42-bp reads (82,655,316) for a singlelibrary of approximately 200-bp sheared genomic DNA frag-ments on an Illumina GAii (approximately 3.3 Gb of sequencedata; approximately 100-fold genome sequence coverage).Initial assembly analysisInitially, Illumina PE read data required preassembly, as wewere unable to complete a Forge (v.20090319) run using ourentire read collection; we integrated these data by preassem-bling them with Velvet. We assembled the read datadescribed above, alone or in combination, and devised a strat-egy for refining these assemblies. Using Velvet (v.6.04 andv.7.31), we assessed assemblies generated from Illumina PEread data and Illumina with Sanger PE read data (see Materi-als and methods: Assembling Illumina data); using Forge weassessed assemblies generated from 454 SE read data, 454 SEwith Sanger PE read data and 454 SE and Sanger PE read dataplus a Velvet-preassembled contig backbone. We used a col-lection of 7,169 unique expressed sequence tag (EST)sequences to do an initial assessment of these assemblies.From the EST-to-genome alignments, we determined thenumber of complete alignments as well as the number oftimes an alignment was split between contigs in a resolvable('partial') or unresolvable ('misassembly') manner (describedGenome Biology 2009, 10:R94Assembly process overviewFigure 1 (see following page)Assembly process overview. Overview of the process for producing de novo assemblies.94.3http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. RGenome Biology 2009, 10:R94Figure 1 (see legend on previous page)94.4http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. Rin Materials and methods), and also identified small inser-tions or deletions (termed indels). The Velvet assembly gen-erated from Illumina PE data alone yielded an N50 contiglength of approximately 24.5 kb, and covered approximately26.7 Mb of the 32.5 Mb manually finished genome sequence(Table 1). In contrast, a Forge assembly of the 454 read collec-tion yielded an N50 contig length of approximately 7.8 kb andcovered approximately 29.5 Mb of the complete genomesequence (Table 2). We checked the overlap between theseassemblies, and found that 100% of the Velvet-Illuminaassembly was contained within the Forge-454 assembly,while the 454 assembly contained an additional approxi-mately 2.5 Mb of sequence that was not found in the Illuminaassembly.Comparing indels across assemblies indicated that the rate atwhich small (1 to 5 bp) insertions or deletions appeared in theassembled consensus sequence depended on the fraction of454 data in the assembly (Figure 2). When we inspected thefrequency of each base that was inserted or deleted across allassemblies that used 454 read data, the pattern was consist-ently A>T>C>G, while Velvet assemblies of Illumina readsproduced a C>A>T>G indel pattern where A, C, G, and T rep-resent indel frequencies for their corresponding bases. Toassess whether these small insertions and deletions could dis-rupt the phasing of the assembled genome sequence (that is,the periodicity of nucleotide sequences within the assemblyrelative to cis factors), we examined the predicted protein col-lections from each of these assemblies. Average predictedprotein sequences contained 401.1 versus 527.0 amino acidsin assemblies that used only 454 or only Illumina data,respectively. Although this difference could be the result of anincreased contig N50 length in the Illumina based assembly(Tables 1 and 2), we observed that, in the NCBI non-redun-dant database [8], the fraction of predicted protein sequenceswith at least one significantly similar sequence was 60% forthe 454-only assembly but 70% for the Illumina-only assem-bly. This suggests that the shorter average protein lengths inassemblies with greater ratios of 454 reads were due to spuri-ous peptide sequences and not contig end truncations.Assemblies that used 454 read data achieved greater amountsof total assembled DNA, including relatively more sequenceannotated with repetitive elements, despite shorter contigN50 values; the 454 assembly and the Sanger-454-Illuminaassembly were annotated with approximately equal numbersof repetitive elements, while the Velvet assembly had approx-Table 1Velvet assembliesID T42 T38 T36 T36; QRL(Q10) = 28Total contigs 6,945 8,637 19,118 39,488N50 contig 24,566 (N/A) 10,706 2,902 1,299Total DNA (bp) 26,721,397 26,466,756 25,854,719 24,812,690EST analysis* 6,585/29 6,204/24 4,657/11 2,923/9*EST alignments are given as: Complete alignments/Misassemblies (see Materials and methods). Velvet assemblies were generated from Illumina GAii read data. Assembly T42 was generated from the untrimmed, no-call and shadow filtered Illumina PE reads. Assemblies T38 and T36 were generated by trimming the last 4 and 6 bp, respectively, from the T42 read set. Assembly T36, QRL(Q10) = 28 was generated with the T36 read set from which reads were removed if they failed the QRL(Q10) = 28 quality region length filtering (see Materials and methods).Table 2Forge assembliesID 454 Sanger-454 Sanger-454-IlluminaPA Sanger-454-IlluminaDATotal scaffolds* 7,860 4,805 2,307 1,443N50 contig (scaffold) 5,773 (N/A) 7,440 (289,760) 31,821 (557,565) 164,278 (187,326)Total DNA (bp)† 29,484,877 34,841,371 39,238,044 29,522,629Number of scaffolds with gaps‡ 0 656 163 17Augustus predictions 10,555 10,230 8,912 8,476EST analysis§ 5,544/25 5,747/60 6,314/40 6,685/33*Scaffolds included in this calculation contained two or more reads and were longer than 500 bp. †Total DNA was calculated excluding gaps and was performed on scaffolds that contained two or more reads and were longer than 500 bp. ‡Gaps included in this calculation were longer than 50 bp. §EST alignments are given as: Complete alignments/Misassemblies (see Materials and methods). Forge assemblies were generated using Illumina, 454 and Sanger read data. The '454' assembly was generated using only 454 SE read data. The 'Sanger-454' assembly was generated by combining the Genome Biology 2009, 10:R94Sanger PE and 454 SE read collections. The 'Sanger-454-IlluminaPA' assembly was generated by combining the Sanger PE and 454 SE read collections with preassembled (PA) contigs generated from Illumina PE reads with Velvet. The 'Sanger-454-IlluminaDA' assembly was generated by combining the Sanger PE and 454 SE read collections with Illumina PE reads (DA = direct assembly).94.5http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. Rimately half as many annotations. Because the 454 assem-blies also had acceptably low EST-detectable misassemblyrates, we concluded that a strategy that combined all threeread types would be optimal. We assessed validating ourassembly methodology using simulation, but found that theresults did not accurately reflect the outcomes of workingwith real read data. This was likely due to the difficulty ofaccurately modelling read-specific sequence quality anderrors (results not shown).Optimizing Sanger/454 assemblies using 454 read filteringFiltering 454 SE reads for no-calls, length and sequence com-plexity incrementally improved the overall quality of the denovo assembled Gc genome sequence relative to a manuallyfinished sequence, which we will refer to as GCgb1 (see Mate-rials and methods for a description). For 454 SE reads, no-callfiltering removed 95,833 (3%) reads, and length filtering fur-ther removed 141 (0.009%) GS20 reads and 3,583 (0.2%) GS-FLX reads. Applying these filtering strategies reduced bothsive. However, for our strategy of assembling Sanger PE and454 SE read data around high-coverage Illumina read data,the two filtering steps were worthwhile; applied together,they improved the integration of the different sequence typesand reduced the number of chimeric contig ends by 20% (seeSupplementary section 1 in Additional data file 1).Low complexity regions (that is, genome sequences with asimple repetitive composition) are expected features for a fil-amentous fungus. We found that reads containing suchsequences were associated with misassemblies (data notshown). Using DUST [9] we filtered 522 of the Sanger readsand 3,889 of the 454 reads containing such repetitive compo-sition. Filtering 454 and Sanger reads for low complexitysequences marginally affected contig and scaffold N50; how-ever, it reduced the number of scaffolds containing gaps from685 to 666, and decreased the number of irresolvable splitEST alignments by 7. Given this, we removed reads contain-ing low complexity sequence from the draft assemblies. Weintend to resolve such regions in the finishing stage of theConsensus sequence qualityFigure 2Consensus sequence quality. The proportion of 454 read data within the total read collection affected the number of small insertions and deletions (indels) based on analysis of 7,169 unique EST-to-genome alignments. The relative proportions of insertions (blue) and deletions (orange) in the assembly sequence are shown in the inset pie chart. Assemblies are described in Tables 1 and 2; those including 454 read data were assembled with Forge; the Illumina-only assembly was generated with Velvet.AssembliesInsertions Deletionsthe contig and scaffold N50s, suggesting that when a hybrid sequencing project, using tools and resources that are betterGenome Biology 2009, 10:R94assembly includes relatively low 454 SE sequence coverage,filtering reads by no-calls and length may be overly aggres-suited for such genomic elements.94.6http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. RImproving assemblies with Illumina PE reads by trimming and filteringGiven the promising initial assembly of the Illumina PE readdata, we assessed trimming and filtering as a means toimprove the Velvet assembly accuracy. Beginning with the82.6 M, 42-bp PE reads, we discarded 1.1 M reads containingno-call bases and 1.9 M shadow reads (described in Materialsand methods). To optimize the Velvet assembly, we usedalignments with our preliminary 454 and Sanger sequenceassembly to determine trimming and quality read length(QRL; described in Materials and methods) filtering parame-ters for removing low quality bases from reads (Supplementalsection 2 and Figure S4A in Additional data file 1).As determined by EST alignments and alignments to GCgb1,trimming and filtering improved the accuracy while onlymarginally reducing the total length of DNA assembled; how-ever, more aggressive read trimming and filtering substan-tially reduced the contig N50s in Velvet assemblies (Table 1).Trimming Illumina reads from 42 bp to 38 bp (T38) and thento 36 bp (T36) reduced the assembly N50 to 10.7 kb and 2.9kb, respectively. For the T36 assembly, trimming reduced thetotal amount of assembled sequence and the number of com-plete EST-to-assembly alignments, while also reducing thenumber of EST-detectable assembly errors from 29 to 11(Table 1). Trimming Illumina reads also reduced the effectivelevel of coverage, which likely explains why the N50 and com-plete EST-to-genome alignments were reduced. Given this,we assessed whether the improvements in EST-detectableassembly errors could also have resulted from arbitrary readtrimming and subsequent shortening of the assembled contiglengths. We tested this by removing 6 bp from the 5' end ofeach read. In the resulting assembly the N50 and completeEST-to-genome alignment counts were approximately half ofthe corresponding values for the T36 assembly, and the EST-detectable error rate was five times higher, validating the effi-ciency of our trimming algorithm.Filtering low quality data (QRL(Q10) = 28) resulted in anassembly that, relative to the T36 assembly, had a smallerN50 (1,299 bp) but only a marginally lower number of EST-detectable assembly errors. We then tested whether filteringby randomly removing the same number of reads that hadbeen removed by QRL filtering changed the resulting assem-bly. We found that although random filtering did not substan-tially change N50, it tripled the number of EST-detectableerrors and doubled the number of ESTs with no genomeassembly alignment, validating the efficiency of our filteringalgorithm.Relative to GCgb1, we found that this trimmed and filteredIllumina read collection yielded the most accurate Velvet con-tigs and that these contigs had approximately 15% fewer chi-data reported above, we attempted two assemblies using arevised version of Forge (v.20090526). We tested: incorpo-rating the Illumina PE data following Velvet preassembly(Sanger-454-IlluminaPA); and incorporating the Illumina PEdata directly (Sanger-454-IlluminaDA). EST-to-genomesequence alignments and Illumina PE read alignment clusteranalysis showed that the Sanger-454-IlluminaDA genomesequence had a lower misassembly rate than the Sanger-454-IlluminaPA assembly (Table 2). However, alignment toGCgb1 suggested that the Sanger-454-IlluminaPA was a moreaccurate assembly in regards to long range continuity (Figure3). The Sanger-454-IlluminaDA assembly had greater contigN50 whereas the Sanger-454-IlluminaPA assembly hadgreater scaffold N50 (Table 1).Assessing the final assemblyAssembly Sanger-454-IlluminaPA had 6,314 complete ESTalignments and 40 EST-detected assembly errors. Thenumber of scaffolds containing gaps greater than 1 kb, 163,was substantially lower than the 656 in the best assemblyachieved without the Illumina PE read data. We assessed thequality of this Forge hybrid assembly using the consistency ofthe Sanger PE read pairings and 200-bp Illumina PE reads.Adding the Illumina PE read data increased the fraction ofconsistently-paired Sanger PE reads from 64 to 81% forSanger-454-IlluminaPA versus the best assembly withoutIllumina PE read data; for Illumina PE alignment data, thenumbers of unpaired reads decreased by 37% and thosepaired on different scaffolds decreased by 21%, while thenumber of paired reads on the same scaffold with an appro-priate fragment length increased by approximately 1.5 M. Theassembly contained 46 scaffolds longer than 100 kb, whichrepresented 88.5% of the total genome sequence. These scaf-folds had a G+C content of 53.2%. The 10 largest scaffoldscontained 48 gaps with a total length of approximately 181 kb(Figure S5 in Additional data file 1). The longest scaffold wasapproximately 3.67 Mb and the tenth longest scaffold wasapproximately 782 kb.The 454 read coverage and Sanger PE read placements forassembly Sanger-454-IlluminaPA indicate that the distribu-tion of read data was generally uniform across the top tenscaffolds (Figure S5). We noted 12 sequence regions withunexpectedly high read coverage. Preliminary analysis ofthese sequence regions indicates that, as expected, they werespanned by repetitive elements, primarily transposons. Largegene families with high levels of similarity were also problem-atic. However, there is no evidence that such genomic ele-ments necessarily ended up in misassemblies; rather, theysometimes caused early contig growth termination by makingthe collapsed sequence data unavailable to other appropriategenomic regions. Misassemblies primarily occurred when therepeat span was large and fosmid collapses brought incorrectGenome Biology 2009, 10:R94meric contig ends. Using the approximately 51 M Illumina PEreads resulting from trimming and filtering (approximately56.5× genome sequence coverage) and the Sanger and 454contigs into adjacency during scaffolding. However, these areeasily identified and corrected during sequence finishing.94.7http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. RAssessing the final draft assembly using the 200-bp IlluminaPE read set highlighted genomic regions with collapsed repet-itive elements, low coverage, misassemblies, and adjacentscaffolds. The PE alignment data were plotted by coverageand are shown in Figure S5. Correctly paired read alignmentshad a mean outside distance of 193 bp and appeared to beevenly distributed across the scaffolds. However, approxi-mately 1,500 anomalous PE read-alignment-clusters (that is,reads with overly stretched gap distances between pairs,unpaired reads or reads paired inappropriately on differentscaffolds) highlight that automated rules can be applied to thecurrent draft assembly, and we have implemented a semi-automated system in our finishing pipeline to leverage thesedata. In GCgb1, we have currently resolved > 90% of theanomalous clusters identified in Sanger-454-IlluminaPA. Asexpected, many (approximately 85%) of the ambiguities thatarose during our analysis of PE read clusters occurred at scaf-fold edges (< 3 kb), suggesting that scaffold growth termina-tion was accurate in this assembly; further, scaffold growthwas constrained by read ambiguity rather than by low cover-age. Although greater sequencing depth could improve this byallowing better resolution of read overlap alignments, sometypes of genomic elements will likely continue to cause ambi-guity in read overlaps, leading to premature truncation ofscaffold growth.By counting complete gene models for core eukaryotic pro-teins reported by CEGMA [10], we estimated that we havegenerated gene models for greater than 94% of the fullgenome's hypothetical gene model collection. For the prelim-inary Sanger-454-IlluminaPA gene predictions, the averagegene density was approximately 1 gene/3.5 kb, the averagegene length was approximately 1.5 kb, the average transcriptlength was approximately 1.2 kb, and the average transcriptG+C content was approximately 58%. Similar values havebeen reported for other ascomycetes from the order sordario-mycetes [11,12]. A detailed description and annotation of theGc genome will be published separately (manuscript inpreparation).Analysis of Illumina and 454 read dataWe used the manually finished GCgb1 assembly to assess theperformance of the Illumina and 454 sequencing platforms(Figure 4). We quantified the efficiency of discovering newand useful sequence data, as well as the rate at which the newsequence data covered GCgb1. We performed this analysis onall possible read substrings with length 28 bp (termed k-mers) generated from the raw reads rather than on the rawreads themselves. Although the rate at which novel k-merswere discovered was approximately the same for both tech-nologies at lower numbers of k-mers, when we split the anal-ysis of novel k-mers into those that appeared at least twiceversus once, a greater error rate was observable in the Illu-mina k-mer collection (Figure 4a). Because the 454 readlengths were longer, the unique k-mers generated from thisread collection overlapped each other more than k-mers gen-erated from the Illumina reads. This was inherent in the k-mer sampling process and likely explains the slower gain in454 genome coverage (Figure 4b). Our data were insufficientfor systematically assessing library saturation; however, itwas apparent that the large number of reads generated foreither library captured the entire genome sequence we assem-bled (Figure 4b). Based on EST-to-genome alignments,approximately 0.6% of the protein coding sequence was miss-ing or ambiguous in GCgb1. This could suggest that a portionof the genome remains ambiguous to our assembly method-ology or that read data are missing from our sequence set.Given the rapid development of wet lab methodologies, it willbe interesting to see whether library saturation remains achallenge for de novo genome sequencing.DiscussionWe sought to rapidly generate a de novo genome assemblythat supported high quality protein coding gene predictions,wet lab experiments, comparative genomics and sequencefinishing for a eukaryotic organism. We used a hybridapproach for sequencing and assembly. We combined SangerPE, 454 SE and Illumina PE sequence data, and developed anassembly strategy that was adaptable to evolving technolo-gies, tools and methods. Using Forge we generated a draftgenome sequence with a length of approximately 32.5 Mb,which had a contig N50 length of approximately 32 kb and aComparison of Forge Sanger/454/Illumina assemblies against GCgb1Figure 3Comparison of Forge Sanger/454/Illumina assemblies against GCgb1. Alignments of scaffolds greater that 100 kb - (a) 'Sanger/454/IlluminaDA' (approximately 24 Mb on 80 scaffolds) and (b) 'Sanger/454/IlluminaPA' (approximately 28.7 Mb on 46 scaffolds) - on the y-axis against the Genome Biology 2009, 10:R94scaffold N50 length of approximately 782 kb. During thiswork, read lengths and read quality improved for 454 andIllumina platforms; as they changed, we evaluated differentmanually finished genome sequence (GCgb1) on the x-axis.94.8http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. Rways of processing Illumina sequence reads in order to inte-grate them into assemblies. We characterized the accuracy ofthe draft assemblies by aligning ESTs, Illumina PE reads anda manually finished sequence to them.We chose Forge as the assembler for three reasons. First, itcan flexibly integrate different sequencing technologies byautomatically adapting alignment parameters for particularread error models. This facilitates using it with evolvingsequencing technologies and variable, technology-specificread or contig preprocessing. Second, it is capable of integrat-ing PE information directly into the contig-building andmerging processes, making it ideally suited for processingabundant short paired reads. Finally, because it can be run oncomputer processors running in parallel, it can be applied tothe relatively large data sets generated by next-generationplatforms. From our initial observations, Forge assemblieswere promising as they integrated Illumina PE read datadirectly, and yielded accurate assemblies with good longrange continuity.Although Forge was designed to accommodate the 454 scor-ing system, the vendor-supplied quality scores do not indicatethe probability that a base is called correctly. While this short-coming can be addressed by transforming the scores into aPhred-like scale similar to that used for Sanger reads [13], wechose an empirical approach and rejected problematic data[1]. We found that by aggressively applying no-call and lengthfiltering we could improve the overall quality of the assembly,as measured by alignments to the GCgb1 sequence, reducedgap sizes and fewer EST-detectable misassemblies. Low com-plexity filtering was especially useful for the 454 SE read databecause, without read pairing information to anchor ambigu-ous overlaps, accurate read placement appeared difficult toresolve. Although we substantially improved the assembliesusing these methods, 454 base calling inaccuracies in thevicinity of homopolymer runs continued to cause phasingproblems that affected gene predictions in the assembledconsensus sequence. We found that adding Sanger PE reads,Velvet contigs and then Illumina PE reads directly into theassembly progressively improved the consensus sequence byreducing the frequency of these indels. We also found thataligning a collection of Illumina-based assemblies back to thefinal assembly in a post-processing step accurately identifiedand resolved these homopolymers.Given the promising initial assembly of Illumina PE reads, wefurther assessed how to improve the accuracy of Velvet-assembled contigs. Profiles of read quality and substitutionerror rate relative to the Sanger/454 preliminary assemblysuggested that trimming the 42-bp Illumina reads wouldimprove the assembly accuracy. While trimming reads atposition 36 resulted in a lower N50, EST and referencetion 38 or untrimmed. Importantly, adding the Illumina datato Forge assemblies substantially reduced the number of scaf-folds and contigs, suggesting that these relatively inexpensivereads contributed additional data and encouraged contiggrowth and merging.Forge uses a statistical model of overlap derived from internalsimulations to determine the probability that two reads relia-bly overlap. This probability is systematically lowered orreduced to zero in repetitive regions, forcing Forge to rely onalternative information such as reads with mate pairsanchored in a scaffold, polymorphisms within a repeat family,or the combination of a low probability overlap and read-pairdata. An important advance made with Forge during thecourse of our work was the ability to scale beyond 50 M reads,which enabled the direct integration of Illumina PE read datain a single Forge assembly stage. The increased accuracy ofEST-to-genome alignments, Illumina PE read alignmentsand the significant increase in contig N50 of the resultingassembly likely resulted from the large amount of pairinginformation introduced by these data. This suggests thatwhen abundant PE information is available, read sequencelength is not as important a limitation as anticipated. Cur-rently, one challenge of this assembly method appears to be inbalancing out the PE information in the low coverage Sangerdata versus the high coverage Illumina data. Although moreFosmid pairs were correctly assigned to the same scaffold inthe Sanger-454-IlluminaPA assembly, a greater fraction ofthe fosmid read pairs had consistent pairing distances in theassembly generated from direct integration of the IlluminaPE read data. We also detected fewer inconsistencies in theSanger-454-IlluminaDA assembly using the Illumina PEalignment strategy. This could have resulted from workingdirectly with the Illumina PE reads in the assembly stage ver-sus working with read substrings (k-mers), which is typical ina short read assembler like Velvet. Working with read sub-strings is an abstraction that does not enforce read integrityonto the contig consensus sequence. For the Illumina PElibrary reported here, read pairing distances were not distrib-uted normally around the mean, and left hand tailingincreased at greater pairing distances (Figure S4B in Addi-tional data file 1). Read pairs with zero gap distance were alsonoted and could cause occasional sequence deletions in Forgeassemblies if not filtered out.We also noted that although low quality reads did notimprove the assembly of genome sequence and so should befiltered out, they remained valuable as PE alignments forassessing and finishing the draft genome sequence. We areassessing the use of additional Illumina PE sequence data toevaluate the quality of the draft genome assembly and toguide finishing. We identified high quality regions in theassembly by calculating the coverage of correctly paired Illu-Genome Biology 2009, 10:R94sequence alignments showed that this assembly containedfewer errors; further, these contigs yielded a more accurateForge assembly than either those with reads trimmed at posi-mina PE reads, and used scaffold-spanning PE reads to iden-tify possible ambiguities or misassemblies in the consensussequence. For such assessments, Illumina PE data offer94.9http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. Radvantages over EST data: the large number of reads providesdeeper coverage, and the sequence data include non-tran-consensus sequence. Improved software tools for workingwith Illumina PE data will likely benefit both the assembly ofAssessing the discovery of unique read information between the Illumina and 454 platformsFigure 4Assessing the discovery of unique read information between the Illumina and 454 platforms. (a) Raw reads were processed into overlapping 28-bp k-mers, and any k-mer that varied from all other k-mers by at least 1 bp was accepted as new sequence information. The analysis was done separately for unique k-mers and those that occurred at least twice (2× k-mers). (b) MAQ was then used to map these k-mers to the reference genome sequence and the rate at which new coverage was generated was plotted against the number of k-mers examined.Illumina 200-bp454I. 200-bp unique k-mersI. 200-bp 2x k -mers454 unique k-mers454 2x k -mersNumber of k-mers examinedDistinct k-mers discoveredPercentage of total assembly covered by k-mersNumber of k-mers examinedGenome Biology 2009, 10:R94scribed regions, which are typically more difficult to assem-ble. We were also able to use the PE data to map theboundaries of misassemblies and to link scaffold edges in thedraft genome sequences and the finishing of these drafts.94.10http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. RIn conclusion, we assembled a draft genome sequence for afungal pathogen using Illumina, 454 and Sanger sequencedata. We found that the highest quality assemblies resultedfrom integrating the read and contig collections in a singleround of assembly, using software that could coherently man-age the varying read and contig lengths as well as the differenterror models. Aggressively filtering this high coverage datawas an effective strategy for incrementally improving theresulting draft assemblies. We anticipate that the iterativeapproach that we describe will facilitate using rapidly improv-ing sequencing technologies to generate draft eukaryoticgenome sequences.Materials and methodsLibrary construction and sequencingGc spores from strain kw1407 [14] were spread onto cello-phane overlaid on 1.5% agar containing 1% malt extract in 15-cm petri dishes. The fungal spores were incubated at 22°C inthe dark for 8 days, and the mycelia were removed from thecellophane and pooled. DNA was extracted from mycelia fol-lowing the method of Möller et al. [15] but without firstlyophilizing the mycelia. For constructing a 40-kb fosmidlibrary, fungal DNA was randomly sheared, then blunt-endrepaired and size-selected by electrophoresis on a 1% agarosegel. Recovered DNA was ligated to the pEpiFOS-5 vector(Epicentre Biotechnologies, Madison, WI, USA), mixed withLambda packaging extract and incubated with hostEscherichia coli cells. Clones containing inserts were selectedand paired-end-sequenced on an ABI 3730xl. For sequencingon the Roche GS20 or GS-FLX sequencers, DNA was pre-pared using the methods described by Margulies et al. [16].For preparing the approximately 200-bp library on the Illu-mina GAii sequencer, 5 μg of DNA was sonicated for 10 min-utes, alternating 1 minute on and 1 minute off, using a SonicDismembrator 550 (Fisher Scientific, Ottawa, Canada). Soni-cated DNA was then separated in an 8% PAGE. The librarywas constructed from the eluted 190- to 210-bp fraction ofDNA using Illumina's genomic DNA kit, following their pro-tocol (Illumina, San Diego, CA, USA). Four lanes in a singleflow-cell were sequenced to 42 cycles using v.1 sequencingand cleavage reagents. Data were processed using Illumina'sGA pipeline (v.0.3.0 beta3).Filtering Sanger and 454 readsFor Sanger PE data, we removed reads that had less than 200bp of continuous sequence with a minimum quality score ofPhred 20; 14,522 reads with an average read length ofapproximately 600 bp remained. We discarded 454 readsthat contained uncalled base positions (no-calls), then pooledreads into separate GS20 and GS-FLX sets. After assessingthe two read length distributions, we discarded reads whoselengths were either less than 40 bp or longer than 200 bp, orusing DUST with a 50% threshold [9]. Contamination filter-ing was performed against a database of bacterial genomesequences. From the initial GS20 read collection approxi-mately 3% of reads were identified with 98% or greater simi-larity to the genome sequence of Anaerostipes caccae andwere removed. Lastly, 454 reads were mapped against theUnivec database [8] using BLAST to trim and filter libraryadaptor sequence; 3% of reads were removed and approxi-mately 7.5 Mb of sequence were trimmed from the read col-lection with no significant difference in the pre- and post-trimming read length (163 bp).Assembling Illumina dataVersion 7.31 of Velvet is able to generate scaffolded contigs,which results in larger N50 values; however, we were unableto observe scaffolding resulting from our hybrid Sanger/Illu-mina read assembly. Further, comparing Illumina-onlyassemblies generated from previous and current Velvet ver-sions to our reference sequence indicated that the contigmerging increased the number of assembly errors (data notshown). Given our assembly strategy, the limitations of theVelvet v. 7.31 release indicated that we should continue usingVelvet v. 6.04 for our current work.Because eukaryotic genomes pose an increasing number ofambiguous sequence regions compared with prokaryotes,and because we had generated relatively deep sequence cov-erage for the 200-bp Illumina library, we used the highestavailable assembly k-mer parameter (hash length) of 31 for allVelvet assemblies reported here. We calculated expected cov-erage and the coverage cut-off parameters as described in theVelvet documentation.We applied a simple paired-read analysis to identify chimericpairs that we believed to be artifacts of library constructionand sequencing. We have termed these 'shadow' reads.Briefly, we identified a shadow read pair when a read sharesX identical starting bases with its mate, where we tested Xequals 6, 8, 10, 12, 14, 16, 20 or 24. We discarded such readpairs with 6-bp or greater shared sequence.We tested trimming and filtering on the Illumina reads usedfor assembly and developed a QRL metric using the calibratedIllumina Phred-like quality scores. We calculated a read'sQRL as follows. Moving from the 3' towards the 5' end of aread, we used the highest probability score value for each baseposition to determine a quality score for that base. The maxi-mum possible value for this score is 40. For each read, theQRL was the length between the first and last bases that wereabove a quality score threshold.We assessed the Velvet assemblies using four metrics: N50,the scaffold (contig) length for which 50% of the assembledGenome Biology 2009, 10:R94less than 50 bp or longer than 350 bp from the GS20 and GS-FLX sets, respectively, as described by Huse et al. [1]. We thenapplied a low complexity filter to the 454 and Sanger readsgenome is in scaffolds (contigs) that are at least as long asN50; the assembly size, calculated by adding the total lengthof retained contigs or scaffolds; alignment of the assembly94.11http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. Rcontigs against a manually finished reference sequence; andalignments of ESTs to the assemblies, using a set of 7,169unique ESTs (each EST was selected as the member with thelongest Phred 20 read length from multiple sequence align-ments generated by clustering approximately 43 k EST reads)generated in ongoing and previous work [17]. We alignedESTs using BLASTn with an E-value threshold of 1e-50, anddifferentiated complete alignments from resolvable and irre-solvable partial alignments. Resolvable partial alignmentswere alignments occurring on a contig edge that could bemerged with another partial alignment on a complementarycontig edge. Irresolvable partial alignments were alignmentsin which the partial EST alignments were isolated in the inte-rior sequence region of a contig such that it was not possibleto join the complementary alignments. For identifying smallinsertions and deletions from the same BLAST report, a cus-tom PERL script was used to parse the alignment data; inser-tions were identified as gaps in the EST query alignment,deletions were identified from gaps in the target side of thealignment. Several of these contig assemblies were thentested in Forge and further assessed using the methodologydescribed below.Velvet assemblies took approximately 3 hours on a serverwith two 2.2 GHz dual-core AMD Opteron 275 processors and8 GB of RAM. Velvet assemblies handled by Forge wereassigned base quality scores as uniform PHRED 20 at eachposition.Forge hybrid assembler and genome assembly analysisThe Forge output is a consensus sequence with quality scoresand a complete multiple sequence alignment for all reads,with locations in a tabular format that can be converted intothe Consed 'ace' file format [18].We assessed scaffold qualities using the 42-bp PE readsrejected by the filtering process described above in assem-bling the draft genome sequence. We aligned the reads to thedraft assemblies using MAQ [19] in paired-end mode. Weprocessed the output and identified PE relationships usingcustom PERL scripts and the Vancouver Short Read AnalysisPackage [20]. We separated the aligned reads into three sub-sets: PE reads that were correctly spaced and oriented andaligned on the same scaffold; PE reads that were aligned onseparate scaffolds; and unpaired read alignments. We usedclusters of read-alignment pairs to identify pairs that could beused to merge scaffolds and to identify low quality assemblyregions. The first type had a read cluster located at a scaffoldedge and a mate-pair cluster located on a complementaryscaffold edge. In the second type the complementary clusterwas located in the interior scaffold sequence region such thatthe complementary clusters could not be joined. Because PEread mates can be incorrectly paired in the Illumina flowcellcluster to mark a potential scaffold merge or to identify a low-quality region.As described above, the EST collection was aligned to theForge assemblies for quality control and alignments weregenerated against the manually finished genome sequenceusing nucmer within the MUMmer package with the seedcluster parameter (-c) set to 750. Read coverage, repeat dataand quality data were then combined and visualized usingCircos [21]. RepeatMasker [22]was used for preliminary fil-tering of repetitive elements against repbase (v.14) with thespecies parameter set to 'fungi/metazoa group' prior to geneprediction. Gene prediction was done using Augustus [23].The Forge hybrid assemblies were generated using the follow-ing settings: a genome size estimate of 35 Mb and a hash tablesize of 80 M for assemblies generated from Sanger/454 readdata only or those that included preassembled Illumina PEread data and 260 M for the assembly with direct integrationof the Illumina PE read data. The Forge assemblies took 10 to84 hours on a Linux server cluster using 40 nodes rangingfrom dual 2.0 GHz processors with 2 GB of RAM to quad-core2.6 GHz processors with 16 GB of RAM.Generating the GCgb1 genome sequenceWe generated a reference genome sequence and used it for denovo assembly verification by using the methodologydescribed above, we added 10,000 additional Sanger fosmidPE reads and approximately 7.6 M, 50 bp Illumina PE reads(see Supplementary section 3 in Additional data file 1). Afterassembling these data with Forge and applying manual edit-ing, primer walking and other standard finishing techniques;the largest and tenth largest contigs of the resulting genomesequence were 2.33 and 0.68 Mb long, respectively. The larg-est scaffold was approximately 2.9 Mb and the scaffold N50was approximately 950 kb. Eighty five percent of the genomesequence was contained within the top 29 scaffolds.Data accessRaw read data are available through NCBI genome project ID39847: fosmid PE Sanger reads (see Additional data file 2 fora complete list of accessions); SE 454 reads[SRA:SRR023307] and [SRA:SRR023517] to[SRA:SRR023533]; 200 bp PE Illumina reads[SRA:SRR018008] to [SRA:SRR018011] and 700 bp PE Illu-mina reads [SRA:SRR018012]. Assemblies have also beendeposited at NCBI: Sanger-454-IlluminaPA [DDBJ/EMBL/GenBank:ACXQ00000000]; Sanger-454-IlluminaDA[DDBJ/EMBL/GenBank:ACYC00000000].AbbreviationsEST: expressed sequence tag; GA: Genome Analyzer; Gc:Grosmannia clavigera; indel: insertion or deletion; NCBI:Genome Biology 2009, 10:R94image analysis pipeline, and base-calling errors or low-com-plexity sequences can result in read placement errors byMAQ, we required cluster sizes of at least 10 before using aNational Center for Biotechnology Information; PE: paired-end; QRL: quality read length; SE: single-end.94.12http://genomebiology.com/2009/10/9/R94 Genome Biology 2009, Volume 10, Issue 9, Article R94 DiGuistini et al. RAuthors' contributionsCB, RCH, SJMJ, MAM and JB conceived of the project. SD,NYL, RAH and SJMJ designed the analysis. Sanger sequenc-ing was carried out under the direction of RAH. 454 sequenc-ing was carried out under the direction of EM. Illuminasequencing was carried out under the direction of MH. Forgewas developed by DP. Assemblies were performed by SD,NYL, MS, SKC and DP. Data analysis was performed by SD,SKC, TRD and NYL under the direction of IB. The manuscriptwas prepared by SD, CB, JB and SJMJ with assistance fromGR. All authors have read and approved the final version ofthe manuscript.Additional data filesThe following additional data are available with the onlineversion of this paper: a PDF file including Supplementary sec-tions 1 to 3, including supplementary Figures S1 to S5 (Addi-tional data file 1); NCBI accession numbers for the trace dataused in this study (Additional data file 2).Additional data file 1Supplement ry sections 1 to 3, including supplementary Figures S1 to S5 : add tional explanation for the 454 readfiltering alignment of the 454 pr - and ost- iltered Fo g ass mbli s rel tiv t the manual y fi ishe GCgb1 sequence. Sup-p ementary sec ons 2: addition explanation and supporting fig-ur s for the fi ter g a d trimming of Illumina PE re d da a. 3 supporting Fig re S5 etailing hec vera e i the final s mbly as well prelimi a y rep at annoa ions and highl ghti g a s all umber of misass mbli s de ti-e with Illum na PE align st rs.C ick her for f le 2NCBI accession umbers r the trace d t us d in his study .AcknowledgementsThe authors would like to thank the Functional Genomics Group of the BCCancer Agency Genome Sciences Centre (BCGSC, Vancouver, Can) forexpert technical assistance, Richard Varhol and Anthony Fejes of theBCGSC for analysis, Dirk Evers (Illumina, Cambridge, UK) for contributingto the development of Forge and for technical support and the CeBiTecCenter at Bielefeld University for access to computer resources. This workwas funded by grants from the Natural Sciences and Engineering ResearchCouncil of Canada (grant to JB and CB), the British Columbia Ministry ofForests (grant to SJMJ, JB and CB), the Natural Resources Canada Genom-ics program (grant to RCH) and Genome BC and Genome Alberta (grantto JB, CB, RCH and SJMJ). Salary support for JB came in part from anNSERC Steacie award and the UBC Distinguished Scholars Program. SJMJ,RAH and MAM are Michael Smith Distinguished Scholars. JB, CB, RCH andSJMJ are principal investigators of the Tria project [24], which is supportedby Genome BC and Genome Alberta.References1. Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracyand quality of massively-parallel DNA pyrosequencing.Genome Biol 2007, 8:R143.2. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, LanderES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly ofwhole-genome shotgun microreads. Genome Res 2008,18:810-820.3. Warren R, Sutton G, Jones S, Holt R: Assembling millions ofshort DNA sequences using SSAKE. Bioinformatics 2007,23:500-501.4. Zerbino D, Birney E: Velvet: Algorithms for de novo short readassembly using de Bruijn graphs. Genome Res 2008, 18:821-829.5. Simpson J, Wong K, Jackman S, Schein J, Jones SJM, Birol I: ABySS:A parallel assembler for short read sequence data. GenomeRes 2009, 19:1117-1123.6. Forge Genome Assembler [http://sourceforge.net/projects/forge/]7. Pipeline Scripts [ftp://ftp.bcgsc.ca/supplementary/Grosmannia_clavigera/tools/]8. NCBI [http://www.ncbi.nlm.nih.gov]9. DUST [ftp://ftp.ncbi.nlm.nih.gov/pub/tatusov/dust/]10. Parra G, Bradnam K, Korf I: CEGMA: a pipeline to accuratelyannotate core genes in eukaryotic genomes. Bioinformaticsersen D, Nelson MA, Werner-Washburne M, Selitrennikoff CP, Kin-sey JA, Braun EL, Zelter A, Schulte U, Kothe GO, Jedd G, Mewes W,et al.: The genome sequence of the filamentous fungus Neu-rospora crassa. Nature 2003, 422:859-868.12. Dean RA, Talbot NJ, Ebbole D, Farman ML, Mitchell TK, Orbach MJ,Thon M, Kulkarni R, Xu JR, Pan H, Read ND, Lee YH, Carbone I,Brown D, Oh YY, Donofrio N, Jeong JS, Soanes DM, Djonovic S,Kolomiets E, Rehmeyer C, Li W, Harding M, Kim S, Lebrun MH,Bohnert H, Coughlan S, Butler J, Calvo S, Ma LJ, et al.: The genomesequence of the rice blast fungus Magnaporthe grisea. Nature2005, 434:980-986.13. Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL,Russ C, Lander ES, Nusbaum C, Jaffe DB: Quality scores and SNPdetection in sequencing-by-synthesis systems. Genome Res2008, 18:763-770.14. Lee S, Kim J, Breuil C: Pathogenicity of Leptographium longicla-vatum associated with Dendroctonus ponderosae to Pinuscontorta. Can J Forest Res 2006, 36:2864-2872.15. Möller EM, Bahnweg G, Sandermann H, Geiger HH: A simple andefficient protocol for isolation of high molecular weight DNAfrom filamentous fungi, fruit bodies, and infected planttissues. Nucleic Acids Res 1992, 20:6115-6116.16. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA,Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM,Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, JandoSC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR,Leamon JH, Lefkowitz SM, Lei M, Li J, et al.: Genome sequencing inmicrofabricated high-density picolitre reactors. Nature 2005,437:376-378.17. DiGuistini S, Ralph SG, Lim YW, Holt R, Jones S, Bolhmann J, BreuilC: Generation and annotation of lodgepole pine and ole-oresin-induced expressed sequences from the blue-stain fun-gus Ophiostoma clavigerum, a Mountain Pine Beetle-associated pathogen. FEMS Microbiol Lett 2007, 267:151-158.18. Gordon D, Abajian C, Green P: Consed: a graphical tool forsequence finishing. Genome Res 1998, 8:195-202.19. Li H, Ruan J, Durbin R: Mapping short DNA sequencing readsand calling variants using mapping quality scores. Genome Res2008, 18:1851-1858.20. Fejes A, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ:FindPeaks 3.1: a tool for identifying areas of enrichmentfrom massively parallel short-read sequencing technology.Bioinformatics 2008, 24:1729-1730.21. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D,Jones SJ, Marra MA: Circos: an information aesthetic for com-parative genomics. Genome Res 2009, 19:1639-1645.22. RepeatMasker [http://www.repeatmasker.org/]23. Stanke M, Schöffmann O, Morgenstern B, Waack S: Gene predic-tion in eukaryotes with a generalized hidden Markov modelthat uses hints from external sources. BMC Bioinformatics 2006,7:62.24. The Tria Project [http://www.thetriaproject.ca/index.php]Genome Biology 2009, 10:R942007, 23:1061-1067.11. Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, Jaffe D, Fit-zHugh W, Ma LJ, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R,Wang S, Nielsen CB, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Ped-"@en ; edm:hasType "Article"@en ; edm:isShownAt "10.14288/1.0167807"@en ; dcterms:language "eng"@en ; ns0:peerReviewStatus "Reviewed"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "BioMed Central"@en ; ns0:publisherDOI "10.1186/gb-2009-10-9-r94"@en ; dcterms:rights "Attribution 4.0 International (CC BY 4.0)"@en ; ns0:rightsURI "http://creativecommons.org/licenses/by/4.0/"@en ; ns0:scholarLevel "Faculty"@en ; dcterms:title "De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/55168"@en .