UBC Faculty Research and Publications

aTRAM - automated target restricted assembly method: a fast method for assembling loci across divergent… Allen, Julie M; Huang, Daisie I; Cronk, Quentin C; Johnson, Kevin P Mar 25, 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2015_Article_515.pdf [ 641kB ]
JSON: 52383-1.0223857.json
JSON-LD: 52383-1.0223857-ld.json
RDF/XML (Pretty): 52383-1.0223857-rdf.xml
RDF/JSON: 52383-1.0223857-rdf.json
Turtle: 52383-1.0223857-turtle.txt
N-Triples: 52383-1.0223857-rdf-ntriples.txt
Original Record: 52383-1.0223857-source.json
Full Text

Full Text

SOFTWAREaTRAM - automated targe-evAllen et al. BMC Bioinformatics  (2015) 16:98 DOI 10.1186/s12859-015-0515-2interest, and not genomic assemblies. As more and more61820, USAFull list of author information is available at the end of the articleBackgroundShort read sequencing methods have rapidly increasedthe amount of genetic data that can be obtained in acost effective manner [1]. The computational skills andtime necessary to assemble genes from these short readdatasets is quickly increasing. To assemble genomic data-sets researchers must first create a genome assembly usingeither a de novo or reference-based approach, if a referencegenome is available [2]. Complete de novo genomic assem-blies typically require a variety of DNA sequencing librar-ies and the assemblies are computationally intensive.Although reference-based assemblies can significantly re-duce the computational time needed and be performedfrom a single DNA sequencing library, such assembliescan be problematic or impossible for more divergent taxa[3]. For many studies however (e.g. phylogenetics, genefamily analysis), researchers may not need a complete gen-ome assembly; rather the analysis may only require hom-ologous sequencing data that covers all of the taxa of* Correspondence: juliema@illinois.edu1Illinois Natural History Survey, University of Illinois, Champaign, ILarchive, Phylogenomics, PhylogeneticsAbstractBackground: Assembling genes from next-generation sequencing data is not only time consuming but computationallydifficult, particularly for taxa without a closely related reference genome. Assembling even a draft genome usingde novo approaches can take days, even on a powerful computer, and these assemblies typically require data froma variety of genomic libraries. Here we describe software that will alleviate these issues by rapidly assemblinggenes from distantly related taxa using a single library of paired-end reads: aTRAM, automated Target RestrictedAssembly Method. The aTRAM pipeline uses a reference sequence, BLAST, and an iterative approach to target andlocally assemble the genes of interest.Results: Our results demonstrate that aTRAM rapidly assembles genes across distantly related taxa. In comparativetests with a closely related taxon, aTRAM assembled the same sequence as reference-based and de novo approachestaking on average < 1 min per gene. As a test case with divergent sequences, we assembled >1,000 genes from six taxaranging from 25 – 110 million years divergent from the reference taxon. The gene recovery was between 97 – 99%from each taxon.Conclusions: aTRAM can quickly assemble genes across distantly-related taxa, obviating the need for draft genomeassembly of all taxa of interest. Because aTRAM uses a targeted approach, loci can be assembled in minutes dependingon the size of the target. Our results suggest that this software will be useful in rapidly assembling genes forphylogenomic projects covering a wide taxonomic range, as well as other applications. The software is freelyavailable http://www.github.com/juliema/aTRAM.Keywords: Massively parallel sequence data, Next-generation sequencing, Targeted gene assembly, Short-readmethod: a fast method fodivergent taxa from nextdataJulie M Allen1*, Daisie I Huang2, Quentin C Cronk2 and K© 2015 Allen et al.; licensee BioMed Central. TCommons Attribution License (http://creativecreproduction in any medium, provided the orDedication waiver (http://creativecommons.orunless otherwise stated.Open Accesst restricted assemblyr assembling loci acrossgeneration sequencingin P Johnson1his is an Open Access article distributed under the terms of the Creativeommons.org/licenses/by/4.0), which permits unrestricted use, distribution, andiginal work is properly credited. The Creative Commons Public Domaing/publicdomain/zero/1.0/) applies to the data made available in this article,Allen et al. BMC Bioinformatics  (2015) 16:98 Page 2 of 7genomic data become available, the time associated withthese assemblies will become more challenging particu-larly for projects with hundreds of taxa. Approachesthat target specific loci or genes from short read data-sets will likely reduce the time necessary to assemblegenetic datasets.A few target-based methods have been made availablethat are shown to work well for very closely related taxa[4,5], RAD-PE fragments [6] and meta-genomic datasets[7]. However, a method is still needed that can targetand assemble genes across highly divergent taxonomicdatasets. In this article we describe aTRAM, automatedTarget Restricted Assembly Method, a software packagedesigned to rapidly assemble genes using a single paired-end genomic library from divergent taxa. The aTRAMsoftware is inspired by the Target Restricted AssemblyMethod (TRAM) idea first outlined by Johnson et al. [8].TRAM used a targeted approach with local BLAST [9]to assemble genes from short sequencing reads. TheaTRAM software is TRAM completely redesigned andfully automated including a number of optimizations tospeed up gene assembly, as well as providing computa-tional pipelines for multiple taxon datasets and down-stream processing.The aTRAM software distributes queries of the readsby using a MapReduce approach to parallelize indexingand searching of the short-read dataset. To assemblegenes, aTRAM uses a query sequence, searches the shortread databases for matches to the gene of interest, findsthe matching mates, and uses a de novo assembler to as-semble those reads. The aTRAM software then usesthose contigs as the query sequence in the next iterationand repeats the process to completely assemble the locusof interest. We compare the results from aTRAM tothose assembled using reference-based and completelyde novo assemblies. Finally, we demonstrate the abilityof aTRAM to assemble genes from highly divergenttaxa.ImplementationThe aTRAM package is downloadable from GitHub,(www.github.com/juliema/aTRAM) and is written as aPerl package that links together widely-cited programsin a novel way. These programs include BLAST [9]; twoalternate de novo assemblers, Velvet [10] and Trinity[11]; and two multiple sequence aligners, MUSCLE [12]and MAFFT [13], aTRAM was designed so that new as-sembly and alignment software can be added as they be-come available. aTRAM has two components. The firstcomponent constructs an aTRAM formatted BLASTdatabase from the original paired-end FASTQ or FASTAfile and is performed once per sample. The second com-ponent is the search for a locus of interest, using an it-erative approach aTRAM queries all or a fraction of theconstructed short read database for the locus of interestand performs a de novo assembly. The package also in-cludes post-processing scripts for validation of the re-sults and pipeline scripts that automate multiple genealignments across a number of short-read datasets.Database creationThe aTRAM software creates a database from a paired-end FASTA or FASTQ sequence file using a MapReducestrategy [14]. Because sequence names are unrelated tothe genomic content of the reads, the MapReduce strat-egy speeds up subsequent searches by a hashing functionto distribute the reads across many partitions, or shards.The sizes of the shards are approximately equal and eachshould contain a random sample of reads from the ori-ginal run. Searches can now be done more efficientlyacross a smaller subset of the whole dataset. For eachshard, a BLAST database is constructed, correspondingto one end of each paired-end read in the shard. Themate of each read is placed in an easily searchable file(Figure 1A). This sharding process allows aTRAM to beparallelizable, because each shard can be searched inde-pendently on its own process. Furthermore, becauseeach shard contains a random sample of the full shortread dataset, any number of shards can be searched inan aTRAM run, allowing the user to vary the coveragedepth used in the assembly and reducing computationaltime if genomic coverage is high.Gene assemblyThe aTRAM pipeline iteratively searches the formatteddatabase to assemble the gene of interest. Each shard issearched independently and the results are combined forde novo assembly. The target sequence is provided as ei-ther a DNA or amino acid FASTA file. For the first iter-ation, aTRAM uses BLAST to search each shard forreads that are similar to the target sequence. The tophits and their mates are retrieved, combined across allshards, and used as input to a de novo assembler (Velvetor Trinity). The possibility exists for other de novo as-semblers to be written into aTRAM as plugins in the fu-ture [15-19]. The resulting contigs are then compared tothe original target sequence using BLAST, and the mostsimilar ones used as target sequences in the next iter-ation (Figure 1B). Because the subsequent iterations areusing target sequences that were assembled directlyfrom the short read database, further iterations will in-volve short reads that are not just similar but identicalto the contig being assembled. The program stops whenthe total number of iterations determined by the userhave been completed, or if the resulting contigs fromany iteration matches exactly the contigs from the previ-ous iteration. Alternatively, an autocomplete flag can beset to end the search if one of the contigs has sequenceAllen et al. BMC Bioinformatics  (2015) 16:98 Page 3 of 7Amatching both the beginning and the end of the querysequence, suggesting that the contig includes the entiretarget. As mentioned, aTRAM can be adjusted to use afraction of the available shards: the fraction should beBFigure 1 Graphic of the aTRAM method. A) Formation of the aTRAM daaTRAM splits the SRD into shards, creates a BLAST formatted database of thshard. B) In iteration 0 a query sequence in either amino acid or DNA formtop-hits and their paired-ends are selected and assembled de novo. In theagainst the same database using BLAST, the top-hits and paired-ends seleccalculated based on expected coverage of the targetlocus in the sequencing run, for example, a short readdataset that contains 5x coverage of the nuclear genomemay contain 200x coverage of the ribosomal DNA andtabase; DNA is sequenced into a paired-end short read dataset (SRD).e first pair and indexes the paired-end for the sequences in eachat is queried against the aTRAM formatted database using BLAST. Thefollowing iterations the contigs from the previous iteration are queriedted and assembled de novo until the full locus is assembled.Allen et al. BMC Bioinformatics  (2015) 16:98 Page 4 of 7800x coverage of the chloroplast genome [20]. Although20-50x coverage may be optimal for many genes, it hasbeen suggested that this method can work with coverageas low as 5x [8].PipelinesThe aTRAM package also includes ready-made pipelinesfor running aTRAM on multiple samples for many tar-get sequences. AssemblyPipeline runs aTRAM on mul-tiple target sequences for multiple samples; it is ideal forquickly producing a list of putatively orthologous genesfrom different species. AlignmentPipeline produces a setof aligned homologous sequences for a set of genes anda set of samples, allowing straightforward production ofmultiple gene alignments for gene tree analyses.PerformanceaTRAM Compared to other methodsTo compare the performance of aTRAM to genome as-sembly based methods and verify similar results a datasetof 1,534 single copy orthologs from Pediculus schaeffi, thechimpanzee louse, was chosen. These genes were firstassembled using a reference-based approach against thebody louse genome P. humanus in Johnson et al. [21].Their study used one lane of an Illumina HiSeq2000run, which resulted in 36 GB of data and over ~100Xcoverage of the genome (NCBI; SAMN02438447). Theauthors used CLC Genomics Workbench (CLCbio) tomap paired-end reads to the reference genome and veri-fied orthology using a reciprocal best-BLAST test(http://dx.doi.org/10.5061/dryad.9fk1s). The same set ofgenes was assembled using aTRAM and a completely denovo approach to compare the three sequence retrievalmethods.An aTRAM database was created from the same P. schaeffipaired-end library from Johnson et al. [21] taking a totalof 2.37 hours to format the 36 GB library. The 1,534 refer-ence P. humanus proteins were used as the target se-quences for aTRAM assembly. Because the expectedcoverage of the genome for the complete Illumina runwas over 100x, aTRAM was run using only 25% of theavailable shards, providing an estimated genomic coverageof ~25x. The program was set to run for five iterationsusing the autocomplete option. This run was performedon the Institute for Genomic Biology's Biocluster at theUniversity of Illinois, which uses two Intel Xeon E55302.4GHz quad-core processors per node with 24 GB RAMper node. Both aTRAM steps were run on one node withfour processors.Finally, the same P. schaeffi paired-end library was usedto create a completely de novo assembly. The raw readswere trimmed for nucleotide bias at the 5′ end and forlow-quality bases at the 3′ end using the FASTX toolkitand error-corrected using Quake [22] with c = 2.83 for19-mers. Paired-end reads were assembled in SOAPde-novo v1.05 [19] using K = 49, which is roughly half of theread lengths, and the optional GapCloser v1.10 algorithmwith a minimum overlap = 31. Finally, the 1,534 geneswere identified by creating a BLAST-formatted databaseof the de novo assembled contigs and using the P. huma-nus transcripts as targets for a BLAST search. The top hitswere selected as the de novo contigs.The aTRAM contigs, top hits from the de novo assem-bly, and the reference-based assembly sequences wereeach aligned against the original Pediculus humanus refer-ence DNA sequences using MAFFT [13] with the includedpost-processing PercentCoverage script. Uncorrectedp-distances (proportion of sites with differing nucleotides,not corrected with a model of molecular evolution) werecalculated using a custom Perl script used originally inJohnson et al. [21] (Available on Github: juliema/publica-tions/). Orthology was verified for the aTRAM and denovo contigs with the same reciprocal best-BLAST testthat was previously used for the reference-based assem-blies in Johnson et al. [21]. Because each method usedBLAST for assembly the resultant contigs were then recip-rocally compared to the entire Pediculus humanus proteincoding genome, and if the original query sequence wasthe top hit, the assembled gene was considered to beorthologous to the query gene. Finally, the outputs fromaTRAM, the reference-based assembly, and de novoassembly were aligned to each other and uncorrectedp-distances calculated to determine if the three methodsproduced the same sequence for each gene.aTRAM and Divergent TaxaSamples of six species of lice were sequenced on an Illu-mina sequencer combining two species in a lane (NCBI:SAMN03360966 – SAMN03360971). Four species weresucking lice from the suborder Anoplura and thought torange from 25 – 75 million years divergent from the ref-erence sequence P. humanus [23]. The other two specieswere chewing lice from the suborder Ischnocera andthought to be ~ 110 million years divergent from the ref-erence species [24]. Johnson et al. [8] had previouslyidentified a set of 1,107 genes as single copy orthologsprotein coding genes across nine insect genomes, includ-ing lice, using OrthoDB [25]. The amino acid sequencesfrom P. humanus for these 1,107 genes were used asquery sequences in aTRAM for each of the six louse spe-cies. Each aTRAM contig was compared to the entire P.humanus protein-coding genome using the reciprocalbest-BLAST test for orthology. The orthologous contigswere then aligned back to the P. humanus genome anduncorrected p-distances were calculated. To determine ifa DNA query would also assemble genes across thedivergent datasets, we ran 10 genes using the DNA fromthe reference P. humanus, and only those from theAdditional file 2). One possible explanation is that com-pared to the reference based assembly, aTRAM is morelikely to assemble sequences at intron-exon junctions orat the 5′ and 3′ ends of genes.Genetic distance to the referenceThe contigs returned from all three methods were simi-larly divergent when compared to their P. humanus0.0 0.1 0.2 0.3 0.4 0.501P−Distance aTRAM Contig to P. humanusRatio of LeFigure 2 Y axis is the ratio of the length of the contigassembled with aTRAM by the length of the contig assembledwith the reference based approach. Points under the 1 line arelonger with the reference based approach and those above the lineare longer from aTRAM assemblies. The x-axis indicates the uncorrectedp-distance comparing the aTRAM contigs to the reference DNAsequence. The graph illustrates that aTRAM assemblies tended tobe longer and the longer genes tended to be the more divergentTable 1 Results from assembling 1,534 protein codinggenes using aTRAM, a reference-based and a de novoapproachProportion(mean, range)P-Distance Reciprocal best-blast(mean, std dev) (total 1,534)aTRAM 0.99 (1-0.20) 0.093 (0.044) 1,530 (99.7%)Reference 0.93 (1-0.19) 0.077 (0.022) N/Ade novo 0.92 (1-0.16) 0.095 (0.052) 1,512 (98.92%)Allen et al. BMC Bioinformatics  (2015) 16:98 Page 5 of 7congener Pediculus schaeffi assembled. This suggeststhat aTRAM can be limited by the success of the initialBLAST search (Additional file 1), and as taxa becomemore divergent, amino acid sequences are the more op-timal target sequence.Results and discussionThe aTRAM software rapidly assembles genes of interestfrom short paired-end sequencing reads, even across di-vergent taxa by iteratively querying and assemblingreads. The MapReduce strategy [14] used in aTRAM en-ables faster searching of large short read data files, bysplitting the short – read database into shards. Thus, thesearch is divided into many smaller parallelizable prob-lems, speeding up computation time. This method alsoprovides a means for further reducing computationaltime by allowing the user to search only a fraction of theshort reads if genomic coverage is expected to be high.Comparisons with reference and de novo assembliesUsing aTRAM, we quickly assembled the sequences of1,534 putatively single copy genes from the Pediculusschaeffi short read dataset. A total of 90% of the genescompletely assembled before the fifth iteration, and 75%of those finished at the first iteration, taking a mean of55 seconds per gene. Assemblies of the other genesranged from 3-7.5 minutes. Although 170 genes werenot flagged as complete by the fifth iteration of aTRAM,searching among the best contigs of these genes verifiedthat many had the complete gene but were not flaggedin the autocomplete process. These genes had a mean of96.97% of the gene assembled, with a median of 99.46%.Further investigation revealed three typical reasons thegenes were not marked as complete: 1) some were miss-ing one section of the gene, 2) some had high sequencedivergences as compared to the reference, and 3) othershad a small exon at one end of the gene. Because theoriginal query sequence only included exons andaTRAM assemblies include introns, genes with a smallexon at one or both ends are unlikely to have a highBLAST match of these small exons back to the originalgene sequence. These results suggest that even thoughthe gene may not be flagged as complete by the end ofthe iterations, the entire gene may still be assembled.Furthermore, in our experience, as the assembled contiggrows with each iteration adding more iterations allowsthe complete assembly of the locus of interest, this isparticularly true for very large genes, where more itera-tions may be needed to completely assemble the gene.Gene completenessWhen compared to the P. humanus reference sequence,aTRAM assembled a greater fraction of the gene thaneither the reference-based or de novo approaches (Table 1;orthologs, with the aTRAM and the de novo contigshaving a few genes with higher distances. The mean p-distance to P. humanus was lowest for the reference-based contigs, most likely because more divergent re-gions failed to assemble (Figure 2). The aTRAM contigshad the next lowest p-distance, followed by the de novocontigs. All but four of the aTRAM contigs passed thereciprocal best-BLAST test of orthology, whereas 22 ofthe de novo contigs did not pass the test (Table 1). Allof the reference-based contigs had previously passedthe reciprocal best-BLAST test in Johnson et al. [21]and this resulted in the selection of the gene set used inour current comparisons.2345ngths aTRAM/Reference>1 aTRAM−assembly Longer<1 Reference−based Longerones, suggesting that aTRAM can assemble more divergent sectionsthan a reference based approach.esmtigsAllen et al. BMC Bioinformatics  (2015) 16:98 Page 6 of 7Sequence similarity among methodsFinally, the sequences from all three methods were com-pared to each other to determine if they assembled thesame sequence. The contigs from each method were iden-tical in many cases; when they were not identical, aTRAMcontigs tended to be more similar to the reference-basedcontigs (mean uncorrected p-distance = 0.011) than to thede novo contigs (mean = 0.022). The de novo contigstended to be less similar to either of the other methodsoverall, suggesting that the de novo contigs were the leastaccurate of the three methods tested. This may be a func-tion of the de novo assembly method and other assemblersmay perform better. Additionally, we aligned aTRAMcontigs to previously Sanger-sequenced loci and foundidentical sequences for two of the three genes, the thirdgene was only different for two base pairs out of 241 bpand a single N in the Sanger sequence (Additional file 3).Taken together, these results suggest that the contigs as-sembled by aTRAM are of a similar (or higher) length andquality to those assembled using alternate methods, whiletaking a fraction of the time to assemble. The alignmentsfrom these methods have been made available from theDryad Digital Repository http://dx.doi.org/10.5061.dryad.kh886.Table 2 Results from assembling 1,107 1:1 orthologous genSuborder, species YearsDivergentAnoplura, Pedicinus badii 25 – 30aAnoplura, Haematopinus eurysternus 65 - 70aAnoplura, Linognathus spicatus 65 - 70aAnoplura, Proechinopthirus fluctus 75 - 80aIschnocera, Brueelia antiqua ~110bIschnocera, Columbicola liva ~110bYears divergent from the reference taxon were estimated in millions of years froqueries that assembled contigs in aTRAM. The final column has the number of conprotein coding genome.Assembling genes from divergent taxaFinally, we used aTRAM to assemble genes from highlydivergent taxa from P. humanus. Specifically, we assem-bled 1,107 1:1 orthologous genes from lice ranging from25–110 million years divergent from the reference se-quence [23,24]. aTRAM assembled nearly all of thegenes from each of the six divergent taxa, ranging from97% to 99% recovery (Table 2; Dryad Digital Repositoryhttp://dx.doi.org/10.5061.dryad.kh886). It is possible thatsome of genes that did not assemble were not present inthe genomes of those taxa, having been lost over time.Between 2% and 6% of the assembled contigs did notpass the reciprocal best-BLAST test of orthology, leavingwell over 1,000 genes for each species that did pass, sug-gesting these genes are orthologous to the referencegene and can be used for phylogenomic datasets. Themean p-distance from P. humanus for these genesranged from 0.24–0.30. As expected, the more distantlyrelated lice had higher p-distances from the referencesequences.ConclusionsOverall these results suggest that aTRAM will likelyprove useful for quickly assembling phylogenomic datasetsacross a wide taxonomic range. Furthermore aTRAM wasdesigned to be agnostic to the type of input data andtherefore future testing should include RNA-seq data aswell as other types of markers such as UCEs.Availability and requirementsProject name: aTRAMProject Home Page: http://www.github.com/juliema/aTRAMOperating system: Unix, Linux, OSXProgramming language: PerlOther requirements: Client needs free software including,muscle, mafft, blast, velvet or trinityLicense: BSD 3-clause open source licenseusing aTRAM across different species of liceContigs ReciprocalBest-BLAST1091 (98.6%) 1068 (96.5%)1089 (98.4%) 1048 (94.7%)1082 (97.7%) 1031 (93.1%)1090 (98.5%) 1026 (92.7%)1102 (99.5%) 1060 (95.8%)1074 (97.0%) 1053 (95.1%)a). Light et al. [23] and b) Smith et al. [24]. Contigs are the number of the 1,107that passed a Reciprocal best-BLAST test against the entire Pediculus humanusAdditional filesAdditional file 1: DNA vs Protein query results.Additional file 2: Table of individual gene results summarized inTable 2.Additional file 3: Fasta files of 3 genes (CO1, EF1a and oneunknown nuclear locus), from lice in the genus Degeeriella. Fastafiles include aTRAM contigs as well as Sanger sequences.Competing interestsThe authors declare that they have no competing interests.Authors’ contributionsJA and DH programmed the software. DH designed and engineered thesoftware. JA and KPJ tested with example datasets. KPJ and QK providedcritical insight into the functionality and testing of the program. JA wrotethe manuscript. KPJ, DH and QK edited the manuscript. All authors read andapproved the final manuscript.doi:10.1098/rsbl.2011.010525. Waterhouse RM, Zdobnov EM, Tegenfieldt F, Li J, Kriventseva EV.OrthoDB: the hierarchical catalog of eukaryotic orthologs. NucleicAcids Res. 2011;39:D283–8.Allen et al. BMC Bioinformatics  (2015) 16:98 Page 7 of 7AcknowledgementsWe thank Kim Walden for help with the de novo assembly, Therese Catanachfor providing Sanger sequences, and David Slater for help with the Universityof Illinois IGB Biocluster. We would like to thank the students in the SystematicsDiscussion Group (University of Illinois) for testing aTRAM. Finally, we thank twoanonymous reviewers, and Shaun Jackman, for helpful comments andsuggestions on the manuscript.FundingThis work was supported by the National Science Foundation Grants[DEB-1050706, DEB-0612938, and DEB-1239788 to K.P.J] and by GenomeCanada [Project 168BIO to QCC]; and Natural Sciences and EngineeringResearch Council of Canada [Discovery Grant 298148 to QCC].Author details1Illinois Natural History Survey, University of Illinois, Champaign, IL61820, USA. 2Department of Botany and Beaty Biodiversity Centre,University of British Columbia, Vancouver, BC V6T 1Z4, Canada.Received: 11 December 2014 Accepted: 24 February 2015References1. Do K, Qin ZS, Vannucci M. 2010. Advances in Statistical BioinformaticsModels and Integrative Inference for High-Throughput Data. Camb UnivPress2. Metzker M. Sequencing technologies - the next generation. Nat Rev Genet.2011;11:31–46.3. Li C, Hofreiter M, Straube N, Corrigan S, Naylor GJP. Capturing protein-codinggenes across highly divergent species. Biotechniques. 2013;54:321–6.4. Warren RL, Holt RA. 2011. Targeted Assembly of Short Sequence Reads.PLoS One. doi:10.1371/journal.pone.00198165. Peterlogo P, Chikhi R. Mapsembler, targeted and micro assembly of largeNGS datasets on a desktop computer. BMC Bioinformatics. 2012;13:48.6. Etter PD, Preston JL, Bassham S, Cresko WA, Johnson EA. Local De novoassembly of RAD paired End contigs using short sequencing reads. PLoSOne. 2011;6:e18561.7. Ruby JG, Bellare P, DeRisi JL. PRICE: Software for the targeted assemblyof components of (Meta) Genomic Sequence Data. G3 (Bethesda).2013;3(5):865–80.8. Johnson KP, Walden KK, Robertson HM. Next-generation phylogenomicsusing a target restricted assembly method. Mol Phylogenet Evol.2013;66:417–22.9. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignmentsearch tool. J Mol Biol. 1990;215(3):403–10.10. Zerbino DR. 2010. Using Velvet de novo assembler for short-read sequen-cing technologies. Curr Protoc Bioinformatics. doi:10.1002/0471250953.bi1105s3111. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al.Full-length transcriptome assembly from RNA-seq data without a referencegenome. Nat Biotechnol. 2011;29:644–52.12. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy andhigh throughput. Nucleic Acids Res. 2013;32:1792–7.13. Katoh K, Standley DM. MAFFT multiple sequence alignment softwareversion 7: improvements in performance and usability. Mol Biol Evol.2013;30:772–8.14. Dean J, Ghemawat S. MapReduce. 2008. Simplified data processing on largeclusters. Commun ACM – 50th Anniversary Issue. 2008;51:107–13.15. Ariyaratne PN, Sung W-K 2010 PE-Assembler: De novo assembler using shortpaired-end reads Bioinformatics doi: 10.1093/bioinformatics/btq62616. Hossan MS, Azimi N, Skiena S. Crystallizing short-read assemblies aroundseeds. BMC Bioinformatics. 2009;10 Suppl 1:S16.17. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Briol I. ABySS: aparallel assembler for short read sequencing data. Genome Res.2009;19:1117–23.18. Rausch T, Koren S, Denisov G, Weese D. A consistency-based consensusalgorithm for de novo and reference guided assembly of short reads.Bioinformatics. 2009;25:1118–24.19. Li Y, Hu Y, Bolund L, Wang J. State of the art de novo assembly of humangenomes from massively parallel sequencing data. Hum Genomics.2010;4:271–7.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistribution20. Kane NC, Sveinsson S, Dempewolf H, Yang JY, Zhang D, Engels JMM, et al.Ultra-barcoding in cacao (Theobroma spp.; Malvaceae) using whole chloroplastgenomes and nuclear ribosomal DNA. Am J Bot. 2012;99:320–9.21. Johnson KP, Allen JM, Olds BP, Mugisha L, Reed DL, Paige KN, et al. Rates ofgenomic divergence in humans, chimpanzees and their lice. Proc Biol Sci.2014;281:1777.22. Kelly DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correctionof sequencing errors. Genome Biol. 2010;11:R116.23. Light JE, Smith VS, Allen JM, Durden LA, Reed DL. Evolutionary history ofmammalian sucking lice (Phthiraptera: Anoplura). BMC Evol Biol.2010;10:292.24. Smith VS, Ford T, Johnson KP, Johnson PCD, Yoshizawa K, Light JE. 2011.Multiple lineages of lice pass through the K-Pg boundary. Biol Lett.Submit your manuscript at www.biomedcentral.com/submit


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items