UBC Faculty Research and Publications

NU-IN: Nucleotide evolution and input module for the EvolSimulator genome simulation platform Dlugosch, Katrina M; Barker, Michael S; Rieseberg, Loren H Aug 2, 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-13104_2010_Article_621.pdf [ 296.88kB ]
JSON: 52383-1.0223688.json
JSON-LD: 52383-1.0223688-ld.json
RDF/XML (Pretty): 52383-1.0223688-rdf.xml
RDF/JSON: 52383-1.0223688-rdf.json
Turtle: 52383-1.0223688-turtle.txt
N-Triples: 52383-1.0223688-rdf-ntriples.txt
Original Record: 52383-1.0223688-source.json
Full Text

Full Text

TECHNICAL NOTE Open AccessNU-IN: Nucleotide evolution and input modulefor the EvolSimulator genome simulationplatformKatrina M Dlugosch1*, Michael S Barker2,3, Loren H Rieseberg1,3AbstractBackground: There is increasing demand to test hypotheses that contrast the evolution of genes and genefamilies among genomes, using simulations that work across these levels of organization. The EvolSimulatorprogram was developed recently to provide a highly flexible platform for forward simulations of amino acidevolution in multiple related lineages of haploid genomes, permitting copy number variation and lateral genetransfer. Synonymous nucleotide evolution is not currently supported, however, and would be highly advantageousfor comparisons to full genome, transcriptome, and single nucleotide polymorphism (SNP) datasets. In addition,EvolSimulator creates new genomes for each simulation, and does not allow the input of user-specified sequencesand gene family information, limiting the incorporation of further biological realism and/or user manipulations ofthe data.Findings: We present modified C++ source code for the EvolSimulator platform, which we provide as theextension module NU-IN. With NU-IN, synonymous and non-synonymous nucleotide evolution is fullyimplemented, and the user has the ability to use real or previously-simulated sequence data to initiate a simulationof one or more lineages. Gene family membership can be optionally specified, as well as gene retentionprobabilities that model biased gene retention. We provide PERL scripts to assist the user in deriving thisinformation from previous simulations. We demonstrate the features of NU-IN by simulating genome duplication(polyploidy) in the presence of ongoing copy number variation in an evolving lineage. This example is initiatedwith real genomic data, and produces output that we analyse directly with existing bioinformatic pipelines.Conclusions: The NU-IN extension module is a publicly available open source software (GNU GPLv3 license)extension to EvolSimulator. With the NU-IN module, users are now able to simulate both drift and selection at thenucleotide, amino acid, copy number, and gene family levels across sets of related genomes, for user-specifiedstarting sequences and associated parameters. These features can be used to generate simulated genomic datasetsunder an extremely broad array of conditions, and with a high degree of biological realism.IntroductionThe current explosion of genomic sequence data is gener-ating unprecedented insights into the structure and evolu-tion of genomes. Among the most profound recentdiscoveries is the extent to which gene copy number varia-tion and the gain and loss of lineage specific duplicationsare pervasive and ongoing features of evolution in manyorganisms [1]. Such paralogs are known to affectphenotypes directly and to play important roles in the evo-lution of gene functions and divergence among species, e.g.[2-4]. Accordingly, their retention or loss appears to beshaped by selection in some cases [5]. This means thatcomprehensive studies of genomes must incorporate evolu-tion at the levels of gene and gene family loss and gain, aswell as the traditional scales of nucleotide and amino acidmutation. Simulations that integrate across all of theselevels to generate different evolutionary scenarios will pro-vide powerful tools for testing hypotheses about how evolu-tion works.* Correspondence: katrina.dlugosch@gmail.com1Department of Botany, University of British Columbia, Vancouver, BCV6T1Z4, CanadaFull list of author information is available at the end of the articleDlugosch et al. BMC Research Notes 2010, 3:217http://www.biomedcentral.com/1756-0500/3/217© 2010 Dlugosch et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.The simulation platform EvolSimulator [6] wasrecently developed to accommodate many of theseneeds. It generates the coding sequences of an ancestralhaploid genome, uses this genome to initiate any num-ber of lineages, and evolves the lineages forward underdifferent selective regimes. Users can specify rates ofnucleotide mutation, gene duplication and loss, as wellas lateral gene transfer, if desired. EvolSimulator is cur-rently one of the only programs available for performingsuch detailed genome-scale simulations, and it providesa particularly powerful platform for testing aspects ofgene family evolution [7].Currently, EvolSimulator (as of version 2.1.0) isdesigned to simulate only amino acid evolution. Anucleotide sequence exists for every genome, and muta-tions are evaluated at each position according to nucleo-tide substitution models set by the user, but only non-synonymous changes are retained and propagatedthrough the simulation. A full implementation of(synonymous and non-synonymous) nucleotide evolu-tion in the simulation would be highly advantageous fortesting hypotheses at the nucleotide level, and for com-parison to increasingly available genome, transcriptome,and single nucleotide polymorphism (SNP) datasets.Many analyses of sequence and genome evolution utilizepatterns of both synonymous and non-synonymousnucleotide changes to infer relationships within andbetween genomes and distinguish the action of selection[8]. Simulations that include neutral and non-neutralnucleotide evolution are therefore particularly importantfor evolutionary hypothesis testing.In addition, particular evolutionary scenarios will oftenrequire specific data types, different parameter sets atdifferent time points, and/or user manipulations of thedata. These objectives are most easily achieved by allow-ing users to input their own sequence and other geno-mic information as the starting material for a givensimulation. For example, it is desirable to use data fromreal organisms in cases where the simulation will becompared directly to real data, or used to test analysispipelines which are dependent on biological realism(such as protein motifs). It is also commonly helpful torestart a simulation, for instance to provide replicationfrom the same starting genome, multi-stage simulationswhere different parameter combinations are needed, andmanual manipulation of genomes (such as whole gen-ome duplication, see example below) within the simula-tion. EvolSimulator is not designed to accommodatethese inputs at present, and instead generates a newgenome as the ancestor for each simulation. We havedeveloped NU-IN, an extension to the EvolSimulatorplatform to implement nucleotide evolution and offeruser data input, along with the existing program features(see Additional File 1: NU-IN Download 1.1.0 for thesource code and documentation for this software).Software DescriptionNU-IN takes advantage of EvolSimulator’s existingnucleotide-based mutation machinery and implements itfully. EvolSimulator generates nucleotide changesaccording to user specified rates of mutation andnucleotide bias. Previously, only mutations leading to anamino acid change that survived selection wererecorded. With NU-IN, all synonymous (silent) nucleo-tide mutations as well as the non-synonymous muta-tions are reported. Synonymous changes are selectivelyneutral and remain in the genome unless they co-occurwith an amino acid change that is eliminated by selec-tion (as we imagine happens in nature).In addition, NU-IN allows users to start simulationswith their own data, including genes from real organ-isms and/or genomes already generated by EvolSimula-tor. Input genomes may be any fasta-formattednucleotide sequences in reading frame. NU-IN processesinput data in several steps: First, the genome file is read,the genes are numbered sequentially, and each gene isassigned as the first member of its gene family. Everygene is then assigned a ‘usefulness’ value in all selectiveenvironments (’habitats’ and ‘niches’) to be used in theprogram. Usefulness values are probabilities of generetention, dictating how important it is to prevent agene from being lost in an environment if the genomesize happens to shrink (where parameters for genomesize variation are set by the user). This represents selec-tion at the level of gene loss. All of these steps mimicthe same process used by EvolSimulator2.1.0 when itcreates a simulated starting genome.NU-IN also creates the option for the user to providetheir own gene family membership and gene usefulnessinformation. When inputting real data, gene familymembership information can be derived from additionalanalyses of gene relationships in the genome [9]. Geneusefulness could be tied to knowledge about historicalpatterns of gene retention (such as biased retention oftranscription factors [10]). For inputs of genomes gener-ated by previous runs of the simulation, gene family andusefulness information is embedded in the output files.The NU-IN download provides several PERL scripts toaid users in parsing sequence, family, and usefulnessinformation from the simulation output.Example UsageWe expect that users will find this to be an exception-ally practical tool with which they can test a very broadarray of hypotheses about sequence and genome evolu-tion. For example, simulated datasets can be generatedDlugosch et al. BMC Research Notes 2010, 3:217http://www.biomedcentral.com/1756-0500/3/217Page 2 of 4to evaluate the impact of mutation rate, divergencetimes, and/or selection on our ability to detect reticulateevolution (lateral gene transfer or hybridization), copynumber variation, or genome duplication in samples ofgenomic data. To demonstrate one such an approach,we simulated a whole genome duplication (polyploidy)in an evolving lineage with a constant background rateof gene gain and loss. We initiated our simulation withcoding sequences from the spike moss Selaginella moel-lendorffii genome (10 Sep 2008 release [11]), whichshows no evidence of recent or ancient polyploidy. Weevolved this genome for 5.0 Ks (synonymous substitu-tions/site) to allow it to stabilize on a constant rate ofgene turnover (see Additional File 2: Example Para-meters for the parameters used), and produce an agedistribution of simulated paralogs that is similar to theoriginal data (Figure 1 inset). Using the output of thissimulation, we doubled the genome manually and inputthis polyploid genome and its gene family and useful-ness information back into the simulation to evolve foran additional 0.05 and 0.45 Ks.For the original, simulated, and duplicated genomes,we used an existing bioinformatic pipeline to identifythe Ks values for each duplication event in each genefamily tree [12,13] (available as ‘DupPipe’ at EvoPipes.net [14]). We used a random sample of 10,000 genes forthese analyses, similar to what is available in many pub-licly available EST datasets (e.g. [12]). Briefly, thispipeline identifies sets of similar sequences as genefamily members, uses alignment with known plant pro-teins to place sequences in reading frame, generates agene tree for the family, and calculates the synonymoussite divergence that corresponds to each node (duplica-tion event) in the tree. Because our simulations werebased on sequences from a real genome, we were ableto use the existing analysis pipeline unaltered, and pro-duce observations with the same error (e.g., in defininggene families and assigning reading frame) as occurswith real data.Distributions of the Ks values for each duplicationevent are commonly used to evaluate evidence forancient genome duplications, apparent as peaks ofduplication at particular levels of divergence (a proxyfor time) [15]. Detailed simulations of the sequencesthat yield these distributions have not been availablepreviously, however, limiting tests of the factors gener-ating and shaping these patterns. Histograms of dupli-cation Ks values for our simulations (Figure 1)demonstrate that the program achieved realistic pat-terns of gene gain and loss, and that whole genomeduplications are clearly visible as peaks that diminishwith time since polyploidization. The reduction inpeak prominence is due to ongoing paralog loss, andsuch simulations will be invaluable for generatingexpectations about how far back in time these eventsmight remain observable by this method.ConclusionsGenomic datasets offer tremendous potential to addressbroad evolutionary questions, but demand analytical toolsthat work at these same scales of biological organization.NU-IN expands the EvolSimulator platform to accommo-date hypotheses involving synonymous and non-synon-ymous nucleotide evolution, and allows users to provideand manipulate input data as needed to address theirunique needs. These features can be used to generatesimulated genomic datasets under an extremely broadarray of conditions affecting point mutations, copy num-ber variation, lateral gene transfer, drift, and selection atmultiple levels. Our simulation of a genome duplicationevent demonstrates the ability of this platform to producerealistic genome-wide patterns of gene divergence and var-iation from these fundamental evolutionary processes.Availability• Project name: NU-IN• Project home page: http://evopipes.net/nuin.html• Operating system(s): Linux/Unix (gcc/g++compiler)• Programming language: C++, PERL• License: GNU GPLFigure 1 Simulated Ks Distributions of Duplicated Genes .Frequency distributions of synonymous site divergence (Ks) atnodes of gene family trees for loci in a sample of the S.moellendorffii genome (inset), a simulated genome based on thosesame loci (grey line), and a whole genome duplication of thesimulated genes that has since evolved for 0.05 Ks (solid line) and0.45 Ks (dashed line).Dlugosch et al. BMC Research Notes 2010, 3:217http://www.biomedcentral.com/1756-0500/3/217Page 3 of 4Additional materialAdditional file 1: Example Parameters. A parameter text file used torun the NU-IN simulation program.Additional file 2: NUIN Download 1.0.2. An archive folder (gzip tarball)of documentation and source code files for NU-IN version 1.0.2.AcknowledgementsWe thank Rob Beiko and Rob Charlebois for helpful discussions regardingthe EvolSimulator program, and two reviewers for helpful comments on thismanuscript. Funding was provided by the Natural Sciences and EngineeringResearch Council of Canada (No. 353026 to LHR).Author details1Department of Botany, University of British Columbia, Vancouver, BCV6T1Z4, Canada. 2The Biodiversity Research Centre, University of BritishColumbia, Vancouver, BC V6T1Z4, Canada. 3Department of Biology andCenter for Genomics and Bioinformatics, Indiana University, Bloomington, IN47405, USA.Authors’ contributionsKMD, MSB, and LHR conceived of the software development. KMD wrotethe software and manuscript. All authors read and approved the finalmanuscript.Competing interestsThe authors declare that they have no competing interests.Received: 24 April 2010 Accepted: 2 August 2010Published: 2 August 2010References1. Lynch M: The origins of genome architecture Sunderland: Sinauer Associates2007.2. Bomblies K, Lempe J, Epple P, Warthmann N, Lanz C, Dangl JL, Weigel D:Autoimmune response as a mechanism for a Dobzhansky-Muller-typeincompatibility syndrome in plants. PLOS Biol 2007, 5:1962-1972.3. Chapman M, Leebens-Mack J, Burke J: Positive selection and expressiondivergence following gene duplication in the sunflower CYCLOIDEAgene family. Mol Biol Evol 2008, 25:1260-1273.4. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J,Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH,Kristiansson K, MacArthur DG, MacDonald JR, Onyiah I, Pang AWC,Robson S, Stirrups K, Valsesia A, Walter K, Wei J, The Wellcome Trust CaseControl Consortium, Tyler-Smith C, Carter NP, Lee C, Scherer SW, Hurles ME:Origins and functional impact of copy number variation in the humangenome. Nature 2010, 464:704-712.5. Hahn MW, Demuth JP, Han S-G: Accelerated rate of gene gain and loss inprimates. Genetics 2007, 177:1949.6. Beiko RG, Charlebois RL: A simulation test bed for hypotheses of genomeevolution. Bioinformatics 2007, 23:825-831.7. Beiko RG, Doolittle WF, Charlebois RL: The impact of reticulate evolutionon genome phylogeny. Sys Biol 2008, 57:844-856.8. Yang Z, Bielawski JP: Statistical methods for detecting molecularadaptation. Trends Ecol Evol 2000, 15:496-503.9. PlantTribes. [http://fgp.bio.psu.edu/tribedb/index.pl].10. Freeling M: Bias in plant gene content following different sorts ofduplication: tandem, whole-genome, segmental, or by transposition. AnnRev Plant Biol 2009, 60:433-453.11. Selaginella moellendorffii v1.0. [http://genome.jgi-psf.org/Selmo1].12. Barker MS, Kane NC, Matvienko M, Kozik A, Michelmore RW, Knapp SJ,Rieseberg LH: Multiple paleopolyploidizations during the evolution of theCompositae reveal parallel patterns of duplicate gene retention aftermillions of years. Mol Biol Evol 2008, 25:2445-2455.13. Barker MS, Vogel H, Schranz ME: Paleopolyploidy in the Brassicales:analyses of the Cleome transcriptome elucidate the history of genomeduplications in Arabidopsis and other Brassicales. Genome Biol Evol 2009,1:391-399.14. EvoPipes.net. [http://www.evopipes.net].15. Blanc G, Wolfe KH: Widespread paleopolyploidy in model plant speciesinferred from age distributions of duplicate genes. Plant Cell 2004,16:1667-1678.doi:10.1186/1756-0500-3-217Cite this article as: Dlugosch et al.: NU-IN: Nucleotide evolution andinput module for the EvolSimulator genome simulation platform. BMCResearch Notes 2010 3:217.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitDlugosch et al. BMC Research Notes 2010, 3:217http://www.biomedcentral.com/1756-0500/3/217Page 4 of 4


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items