UBC Faculty Research and Publications

Both mechanism and age of duplications contribute to biased gene retention patterns in plants Rody, Hugo V S; Baute, Gregory J; Rieseberg, Loren H; Oliveira, Luiz O Jan 6, 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2016_Article_3423.pdf [ 1.31MB ]
JSON: 52383-1.0340693.json
JSON-LD: 52383-1.0340693-ld.json
RDF/XML (Pretty): 52383-1.0340693-rdf.xml
RDF/JSON: 52383-1.0340693-rdf.json
Turtle: 52383-1.0340693-turtle.txt
N-Triples: 52383-1.0340693-rdf-ntriples.txt
Original Record: 52383-1.0340693-source.json
Full Text

Full Text

RESEARCH ARTICLE Open AccessBoth mechanism and age of duplicationscontribute to biased gene retentionpatterns in plantsHugo V. S. Rody1†, Gregory J. Baute2†, Loren H. Rieseberg2 and Luiz O. Oliveira1*AbstractBackground: All extant seed plants are successful paleopolyploids, whose genomes carry duplicate genes thathave survived repeated episodes of diploidization. However, the survival of gene duplicates is biased with respectto gene function and mechanism of duplication. Transcription factors, in particular, are reported to be preferentiallyretained following whole-genome duplications (WGDs), but disproportionately lost when duplicated by tandemevents. An explanation for this pattern is provided by the Gene Balance Hypothesis (GBH), which posits thatduplicates of highly connected genes are retained following WGDs to maintain optimal stoichiometry among geneproducts; but such connected gene duplicates are disfavored following tandem duplications.Results: We used genomic data from 25 taxonomically diverse plant species to investigate the roles of duplicationmechanism, gene function, and age of duplication in the retention of duplicate genes. Enrichment analyses wereconducted to identify Gene Ontology (GO) functional categories that were overrepresented in either WGD ortandem duplications, or across ranges of divergence times. Tandem paralogs were much younger, on average, thanWGD paralogs and the most frequently overrepresented GO categories were not shared between tandem andWGD paralogs. Transcription factors were overrepresented among ancient paralogs regardless of mechanism oforigin or presence of a WGD. Also, in many cases, there was no bias toward transcription factor retention followingrecent WGDs.Conclusions: Both the fixation and the retention of duplicated genes in plant genomes are context-dependentevents. The strong bias toward ancient transcription factor duplicates can be reconciled with the GBH if selectionfor optimal stoichiometry among gene products is strongest following the earliest polyploidization events andbecomes increasingly relaxed as gene families expand.Keywords: Biased gene retention, Polyploidy, Transcription factors, Whole-genome duplicationBackgroundGene duplication has long been viewed as a key driver ofbiological complexity in Eukaryotes [1–4]. Duplicategenes mainly arise via small-scale tandem or segmentalduplication events or via large-scale whole genomeduplications (WGDs). The latter are especially commonin plants [5, 6]. Indeed, comparative genomic studies in-dicate that all extant seed and flowering plants haveexperienced one or more WGDs in their evolutionaryhistory [7–12].Following gene duplication (whether via tandem, seg-mental or WGD events), most duplicate copies becomepseudogenes (i.e. lose their function) or are lost entirelydue to deletions [13]. This is expected because of relaxedpurifying selection due to functional redundancy. Large-scale deletions are especially common following WGDs,as the neopolyploid returns back to its ancestral diploidcondition, a process referred to as diploidization. Never-theless, some gene duplicates are retained, and thesesurviving duplicates appear to contribute importantly tothe evolution of biological complexity and phenotypic* Correspondence: lorlando@ufv.br†Equal contributors1Department of Biochemistry and Molecular Biology, Universidade Federal deViçosa, Viçosa 36570-900, Minas Gerais, BrazilFull list of author information is available at the end of the article© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Rody et al. BMC Genomics  (2017) 18:46 DOI 10.1186/s12864-016-3423-6novelty, in part because such genes are less constrainedevolutionarily than are single copy genes [14–16].Several models have been put forward to explain howduplicate genes avoid pseudogenization, as well as toaccount for why some duplicate genes are retained andothers are not [17]. These include (1) neofunctionaliza-tion, in which one of the duplicates (i.e. paralogs)acquires a new function; (2) subfunctionalization, inwhich ancestral function is partitioned among paralogs[1]; (3) relative dosage, in which duplicate genes areretained (or lost) to avoid dosage imbalances [18, 19];and (4) absolute dosage, in which the fixation of dupli-cate genes is due to selection favoring an increase ingene dosage [20] or metabolic flux [21].In this paper, we focus on the predictions of the rela-tive dosage model, also known as the Gene BalanceHypothesis (GBH) [18, 22], as this hypothesis has gar-nered the most support from real data [19, 23–26]. Ac-cording to the GBH, genes with a large number ofinteractions (i.e., “connected genes”) should be retaineddisproportionately following WGD events thereby main-taining optimal stoichiometry among their products;when a WGD event occurs, all genes are duplicated sim-ultaneously and so relative gene dosage should notchange. In small-scale duplications (e.g., tandem events),the increased dosage of a single, highly connected genecould result in decreased fitness, or even in lethality.Therefore, connected genes are expected to be differen-tially lost following small-scale duplications. Conversely,genes that work alone or have few interactions, such asthose involved in disease resistance, are more likely tobe retained following tandem duplications.Patterns of gene retention in Arabidopsis thaliana arelargely consistent with GBH predictions. For example,highly connected genes such as transcription factorshave been preferentially retained after WGDs in A. thali-ana, but disproportionately lost following small-scaleduplications [23, 24]. The bias towards regulatory geneschiefly derives from duplicates of intermediate age (circa50–70 mya), which are mainly WGD-associated [13].Similar findings have been reported for poplar [26] andrice [25]. In contrast, paleologs (paralogs arising fromWGD events) in the Compositae family are enriched forgenes annotated to structural components or cellularorganization gene ontology (GO) categories, while genesinvolved with transcription appear to be significantlyunder-represented [10]. In A. thaliana and Sorghum bi-color, both WGD and tandem mechanisms of duplica-tion are associated with paralogs involved in highmetabolic flux networks [21], an observation best pre-dicted by the absolute dosage model.In addition to mechanism of duplication, the fate ofparalogs may be influenced by genetic background, vari-ous environmental factors, epigenetic effects, geneticdrift, and the mechanism of gene dosage-compensation[15, 21, 27]. Another potential issue concerns the fasterturnover rates of tandem paralogs relative to thoseoriginating via WGDs [7, 14, 28, 29]. As a consequence,the sampling of tandem paralogs is biased towardsyoung gene duplicates whereas that of WGD paralogs isskewed towards old duplications. As far as we are aware,this bias has not previously been accounted for when in-ferring patterns of duplicate gene retention.Here we investigate the impact of duplication mechan-ism, gene function, and age of duplication in the reten-tion of duplicate genes. Our analyses consider bothWGD and tandem duplications, as these are the twomost frequently invoked mechanisms to explain howparalogous gene pairs are generated in plant genomes [3,23, 24, 30]. We have targeted 25 plant species with fullysequenced genomes that include the basal land plants,Physcomitrella and Selaginella, the basal flowering plantAmborella, and as well as 14 flowering plant orders. Thisdiverse array of taxa enables comparisons of taxa withhighly contrasting histories of polyploidy, including atleast one species with no known WGD in its evolution-ary history (Selaginella). This is critical, because it allowsus to control for potential biases caused by unequal du-plicate gene turnover rates. Our focus is on genes anno-tated as transcription factors, since differential retentionof duplicated transcription factors provides the mainsupport for the GBH. We specifically address the follow-ing questions: (1) Is the turnover rate of WGD paralogspersistently lower than that of tandem paralogs? (2)Which functional gene categories are consistently over-represented among WGD and/or tandem paralogs? (3)Does variation in duplicate gene retention depend sig-nificantly upon the age of WGD paralogs? and (4) Towhat extent do our results support for the Gene BalanceHypothesis?ResultsOrigin and turnover rate of paralogsFor each of the 25 study species, we calculated Ks timedivergence between pairs of paralogs and used asynteny-based approach to categorize members of allgene families as derived from WGD or tandem duplica-tions. Duplicate pairs whose origins were uncertainbased on available data were classified as “undefined”.Across the 25 target genomes, the majority of paralogsdetected had Ks ≤ 2 (Table 1) including 79% of paralogsin A. thaliana, 86% in Glycine max, and 92% in Malusdomestica. Paralogs with Ks > 2 were excluded from ouranalyses due to concerns that Ks saturation could impairreliable inferences [31]. Most species displayed clearprominent peaks in their Ks age histograms, which isillustrated by histograms for five species with contrastinghistories of polyploidy (Fig. 1). Histograms for theRody et al. BMC Genomics  (2017) 18:46 Page 2 of 10remaining 20 species are depicted elsewhere (Additionalfile 1: Figure S1). In the K-S goodness of fit test, all his-tograms for all species except Carica deviated signifi-cantly (P < 0.05) from the null model of constantduplicate gene birth and death (Additional file 1: TableS1). SiZer maps identified a significantly increasing gra-dient in the Ks age histograms of WGD-derived paralogsof most species, which provides support for polyploidsignals being well distinguished from backgroundduplications.The Ks age histograms of WGD-derived paralogs(Fig. 1a, depicted in black) were clearly distinct fromthose of the tandem-derived paralogs (Fig. 1b, depictedin gray). While tandem histograms exhibited a descend-ing slope (similar to a half-parabola) for most of the spe-cies, WGD-derived paralog histogams had peaks thatoverlapped with peaks from histograms of all paralogs(Fig. 1a, depicted in brown). SiZer maps also confirmedthe presence of peaks for WGD-derived paralogs histo-grams (Fig. 1d).Because of our focus on transcription factor paralogs,their Ks age histograms are shown (Fig. 1 and Additionalfile 1: Figure S1; depicted in yellow) along with the Kshistograms of WGD- and tandem-derived paralogs. TheSiZer maps (Fig. 1e) showed increasing gradients fortranscription factor paralogs that overlapped with theslopes of WGD-derived paralogs.Biased retention of paralogs after large- and small-scaleduplicationsTo assess the universality of the GBH across land plants,we identified the most strongly overrepresented GOTable 1 Distribution of paralogous gene pairs for 25 plant species targeted by this studySpecie Chr Initial PCG Duplicates Number of duplicates by duplication type Number of duplicates by Ks rangeWGD Tandem Undefined 0 < Ks≤0.50.5 <Ks≤ 11 < Ks≤1.51.5 <Ks≤ 2Ks >2Arabidopsis lyrata 16 32670 6378 3442 1816 1120 2251 2216 966 945 228Arabidopsis thaliana 10 33602 6194 2740 1232 2222 1657 2407 1183 947 222Amborella trichopoda 26 26460 3322 15 998 2309 1861 427 402 632 137Brachypodiumdistachyon10 26678 3573 1025 1768 780 981 1024 835 733 175Carica papaya 18 28072 1915 24 455 1436 603 210 402 700 126Citrullus lanatus 22 23438 2806 385 1015 1406 691 435 751 930 227Eucalyptus grandis 22 36449 11120 390 6424 4306 8106 925 1029 1060 240Fragaria vesca 14 34809 3974 1021 1606 1347 1500 979 684 811 184Glycine max 40 46509 15242 9721 2087 3434 11697 1961 790 794 185Helianthus annuus 34 44144 9925 57 995 8873 4651 2805 1551 918 108Lotus japonicus 12 26818 2682 184 627 1871 1159 774 415 334 51Malus domestica 34 63515 15551 2761 1308 11482 13084 1258 683 526 107Manihot esculenta 36 30800 7134 2530 703 3901 4915 837 716 666 110Medicago truncatula 16 57587 5098 1083 2419 1596 2902 1262 543 391 115Oryza sativa ssp.indica24 48788 8349 1957 2869 3523 3361 2169 1665 1154 317Oryza sativa ssp.japonica24 59430 5559 1482 2173 1904 1928 1584 1233 814 183Physcomitrella patens 54 36137 3769 306 202 3261 637 1848 883 401 99Populus trichocarpa 36 41521 9721 5609 1988 2124 7572 738 704 707 147Ricinus communis 20 31221 2558 155 614 1789 628 435 683 812 176Sorghum bicolor 20 34686 4267 1048 1698 1521 1468 1061 993 745 186Solanum lycopersicum 24 34432 7100 1234 2561 3305 3184 2287 872 757 209Selaginellamoellendorffii16–2722285 1885 351 608 926 1457 129 102 197 66Theobroma cacao 20 46269 3488 722 1553 1213 1199 601 822 866 201Vitis vinifera 38 26644 4536 528 1935 2073 1918 852 1042 724 128Zea mays 20 39597 6336 590 1396 4350 3792 1095 813 636 153Chr Number of Chromosomes, Initial PCG Initial number of Protein-coding gene sequencesRody et al. BMC Genomics  (2017) 18:46 Page 3 of 10functional categories in both predicted WGD- andtandem-derived paralogs in these 25 genomes. We foundthat WGD- and tandem-derived paralogs did not sharethe top 10 most frequently overrepresented GO categor-ies (Fig. 2a and c). While the most overrepresented cat-egories of WGD-derived paralogs fell undermacromolecular complexes (GO:0032991), internal tocell (GO:0005622), and cytoplasm (GO:0005737) func-tional GO categories; those of tandem-derived paralogsgrouped into programmed cell death (GO:0012501),defense response (GO:0006952), and apoptotic process(GO:0006915) GO categories.In six species, WGD-derived paralogs were notenriched for the overrepresented GO categories found inthe remaining plant species. Five of them—Cariaca,Ricinus, Populus, Selaginella, and Physcomitrella—havefew WGD-derived paralogs predicted by DAGchainer(Table 1), consistent with possible under-estimation ormisidentification of WGD-derived paralogs in thesespecies (see Discussion below). For another five taxa—A.thaliana, Medicago, both Oryza subspecies, and Popu-lus—WGD-derived transcription factor paralogs wereoverrepresented (Fig. 2b). Surprisingly, WGD-derivedtranscription factor paralogs were not significantly over-represented in Arabidopsis lyrata, which shares the sameWGD events as A. thaliana, although there was a trendin the expected direction.Unexpectedly, transcription factor activity(GO:0003700) WGD-derived paralogs were not sig-nificantly overrepresented in 20 plant species, ten ofwhich exhibit evidence of recent WGDs in their evo-lutionary history, with a significantly increasing gradi-ent in SiZer (Fig. 1 and Additional file 1: Figure S1)within Ks range < 1 (and consistent with previousreports—see below). Finally, results from our analysesof tandem duplications showed tandem-derived tran-scription factor paralogs were significantly underrep-resented across the 25 focal genomes.Fig. 1 Ks age distributions (a and b) and SiZer maps (c to e) of five plant species. a Brown bars, all paralogs (background); black bars, WGD-derived paralogs predicted by DAGchainer; yellow bars, paralogs annotated as transcription factor activity (GO:0003700). b Brown bars,background; gray bars, tandem-derived paralogs predicted by DAGchainer. SiZer maps for c All paralogs; d WGD-derived paralogs; e Transcriptionfactor paralogsRody et al. BMC Genomics  (2017) 18:46 Page 4 of 10Biased retention toward ancient transcription factorsWe analyzed the biased retention of transcription factorparalogs based on Ks time divergence as opposed tomechanism of duplication. This was accomplished bymapping known WGD events onto a phylogeny for the25 species targeted by this study (Fig. 3, Additional file1: Table S2).In general, transcription factor (GO:003700) para-logs tend to be overrepresented amongst ancient (Ks >1) duplication regardless of mechanism of duplication(Fig. 3). Eleven of the 25 focal species exhibited sig-nificant enrichment at Ks range > 1.5 (Fig. 3, penta-gons), but no such retention bias at lower Ks ranges(≤1.5). When we compared transcription factor para-log enrichment at Ks > 1.0 versus < 1.0, 17 speciesshowed significant enrichment for the older transcrip-tion factor paralogs (Fig. 3, stars). For four of these,A. thaliana, Medicago, and the two Oryza subspecies,the overrepresented transcription factors originatedfrom WGD events (Fig. 2b). However, for theremaining 13 species, the ancient paralogs are not ob-viously associated with a WGD event. Although A.thaliana, Oryza sativa ssp. indica, and Solanum ex-hibited significant signals of polyploidy in the Ksrange < 1 (Fig. 1; Additional file 1: Figure S1), theirtranscription factor paralogs were only significantlyoverrepresented in the Ks range > 1 (Fig. 3f ).In genomes of only four taxa (Carica, Malus, Manihot,and Populus) were recent transcription factor paralogsoverrepresented, and only for Populus were WGD-derived transcription factor paralogs significantly over-represented (Fig. 2b).In addition to analyzing the retention of transcriptionfactor paralogs, we submitted our data to enrichmentanalysis aiming to find additional GO categories thatcould have experienced biased retention patterns. Anumber of GO categories, including those involved intranscription, regulation, transport, and response tostimulus were frequently overrepresented among ancientparalogs (Ks > 1) and not exclusively associated toWGDs (Additional file 1: Figure S2). While three ofthese functional GO categories—cell periphery(GO:0071944), plasma membrane (GO:0005886), andresponse to abiotic stimulus (GO:0009628)—were over-represented among WGD-derived paralogs; two categor-ies—response to stimulus (GO:0050896) and catalyticactivity (GO:0003824)—were overrepresented amongtandem-derived paralogs.Fig. 2 Heat maps of GO categories across 25 plant species. a The 10 most frequent GO categories overrepresented among WGD-derivedparalogs. b Transcription factor activity category (GO:0003700) enrichment analysis for WGD-derived paralogs. c The 10 most frequent GOcategories overrepresented among tandem-derived paralogs. Color gradient represents the Corrected P value calculated by the ErmineJ software:brown colors, significant over-representation (P < 0.05); yellow colors, reduced or non-significant enrichment; and gray color, no enrichmentRody et al. BMC Genomics  (2017) 18:46 Page 5 of 10DiscussionTandem paralogs have faster turnover rateOur synteny-based approach identified pairs of WGD-derived genes similar to those that have been reportedin previous studies. In A. thaliana, for example, circa80% of the 2740 duplicate gene pairs we classified asWGD-derived are in common with the list ofpolyploidy-derived paralogs published by [23]. Differ-ences among studies may be due to new gene annota-tion tools that recently became available. In someinstances, the number of paralogs predicted as havingtheir origin in WGD events can be underestimateddue to widespread genomic changes (e.g., gene lossand/or chromosomal rearrangements) after polyploidi-zation events [19]. Such processes are particularlyproblematic for ancient polyploidization events, whichmay explain the low number of WGD paralogs we pre-dicted in the basal plants, Amborella and Physcomi-trella, as well as for Lotus, Carica, and Ricinus(Table 1). On the other hand, our approach indicatesthe presence of a small number WGD-derived para-logs in Selaginella, which is not thought to have aWGD in its evolutionary history (Table 1). This resultcould be evidence for Selaginella as ancient poly-ploidy. Alternatively, it suggests that Selaginella hashad an ancient large segmental duplication or somefraction of the identified WGD derived paralogs arefalse positives. However, we selected WGD pairs usinga syteny based approach, which is the most conserva-tive method presently available.Fig. 3 Phylogenetic distribution of transcription factor retention biases among 25 plant species. The phylogenetic tree was adapted from PLAZA3.0. Symbol code: Black circles on the tree branches, all known WGD events we also identified in this study; Open circles, suggested ancient WGDevents we did not examine; triangles, species with WGD-derived transcription factor paralogs significantly overrepresented; pentagons and stars,species with transcription factor paralogs significantly overrepresented in range 1.5 < Ks≤ 2 and range 1 < Ks≤ 2, respectivelyRody et al. BMC Genomics  (2017) 18:46 Page 6 of 10Tandem paralogs were similarly identified based onthe genomic coordinates of genes. In Eucalyptus, 32% ofits 36,449 protein-coding genes originated via tandemevents, which is the largest proportion of tandem-derived paralogs amongst the 25 plant species we inves-tigated. Physcomitrella exhibited the smallest proportion(~1%) of tandem-derived paralogs. These findings arevery similar to those previously reported for Eucalyptus[32] and Physcomitrella [33], respectively.We identified peaks in the Ks age histograms; basedon SiZer maps, these peaks likely result from WGDs(Fig. 1 and Additional file 1: Figure S1). Previous studieshave also identified these WGD events using data thatspan across several families [34], or from a given plantspecies [9, 10, 32]. In the Ks histogram of A. thaliana,for example, there were two prominent peaks (Fig. 1),which coincided with the α and β polyploid events re-ported by early investigations [34–36]. In our analysis,the tail of the most recent duplication masked the sec-ond peak; thus, a single, significantly increasing slopewas identified by SiZer. In A. lyrata, SiZer identified twosignificant peaks as expected given the recent history ofpolyploidy in Arabidopsis [36].Differences in the Ks age histograms from WGD- andtandem-derived paralogs indicates that the turnover rateof tandem paralogs is faster than that of WGD paralogs,as previously suggested by others [7, 14, 29, 33]. Thepattern we uncovered suggests lower turnover rates oftranscription factor paralogs than those observed fortandem paralogs. Furthermore, it appears that the originand biased retention of transcription factor paralogs arenot restricted to large-scale duplication events.Patterns of transcription factor retention following WGDsConsistent with the expectations of the GBH, WGD-and tandem-derived paralogs did not share the top 10most frequently overrepresented GO categories. Six spe-cies—Malus, Cariaca, Ricinus, Populus, Selaginella, andPhyscomitrella—were exceptions and did not share themost frequent GO categories, which is consistent withthe possible under-estimation or misidentification ofWGD-derived paralogs in these species. In Malus, forexample, the GO categories that were overrepresentedinclude: plasma membrane (GO:0005886), response toabiotic stimulus (GO:0009628), response to biotic stimu-lus (GO:0009607), and response to endogenous stimulus(GO:0009719). Analyses of an EST library of Malusdomestica also found that these categories were overrep-resented [37]. Consistent with the GBH, we did notfound tandem-derived transcription factor paralogs over-represented in any of 25 focal genomes.Other findings were inconsistent with the predictionsof the GBH. In plants, the genome of A. thaliana hasbeen frequently used to support dosage-constraints oftranscription factors [23, 24]. Unexpectedly, our findingsreveal transcription factor activity (GO:0003700) WGD-derived paralogs to be significantly overrepresented inonly five plant taxa—A. thaliana, Medicago, the twoOryza subspecies, and Populus. Ten of the 20 remainingstudy species exhibited evidence for recent WGDs.Other studies have also reported a downward bias in theretention of transcription factor paralogs followingWGD events. In Compositae paleologs, for example, ithas been observed that genes involved with structuralcomponents or cellular organization were significantlyoverrepresented; whereas transcription factors were sig-nificantly underrepresented [10]. These authors arguedthat patterns of intrinsic selection on different gene cat-egories may vary across higher taxonomic categories.The fate of paralogs originated by either WGD or small-scale events would depend on intrinsic properties, suchas gene function and the environment in which the newpolyploid was born [21].Age of duplications contribute to biased gene retentionRegardless the mechanism of duplication, we showedthat ancient paralogs of transcription factors were pref-erentially retained over paralogs of more recent origin.In agreement to our findings, a previous study in A.thaliana reported that genes involved in transcriptionalregulation showed greater retention after the later (β)genome duplication than after the youngest (α) duplica-tion [24]. Likewise, transcription factors not directly as-sociated with WGDs were overrepresented among genesof ancient origin in A. thaliana [13]. Again, our resultsindicate that out of 25 plant species with very differenthistories of polyploidy, such as A. thaliana which hastwo recent WGD events [35] and Vitis which has noknown recent WGD events [9], 17 share this pattern ofbiased retention of ancient transcription factor paralogs.Although transcription factor paralogs with recent originwere over-represented in four species (Carica, Malus,Manihot, and Populus), we could only clearly determinethat those of Populus were WGD-derived paralogs. Theover-representation of young (Ks < 0.5) transcription fac-tor paralogs in Carica is intriguing, given that no WGDevents likely took place in its recent evolutionary history[38] and that DAGchainer only predicted tandem-derived transcription factor paralogs for Carica withinthe Ks range ≤ 1.0 (Additional file 1: Table S2). Giventhat Carica lacks recent WGD events [38] and we didnot identified transcription factors paralogs originatedfrom WGD events within Ks < 1, the many transcriptionfactor paralogs of Carica appear to derive from small-scale duplications within its genome.Our findings differ from a recent study of core genefamilies in 37 angiosperm genomes [13], which reportedremarkable consistency in the rate at which genes returnRody et al. BMC Genomics  (2017) 18:46 Page 7 of 10to a single copy state, as well as in the gene families thatare retained as multi-copy. The findings were related todifferences in gene function and the authors concludedthat similar selection pressures within and between line-ages are largely responsible for the non-random patternsobserved, at least for core genes [13]. The apparent dif-ferences between the two studies derive partly from thefact that core gene families represent a fairly small frac-tion (13%) of all gene families and that single copy genesin were included their analysis, which drive many of thereported patterns. In contrast, we restricted our analysesto duplicate genes.ConclusionsOur analyses imply that both the fixation and retentionof duplicated genes are context-dependent events. Thus,while the mechanism of duplication is clearly important,so are the characteristics of the particular lineage inwhich the duplication arises, as well as timing of dupli-cation. Although our results show that many transcrip-tion factor paralogs do indeed derive from large-scaleduplication events, this is not conclusive evidence forthe GBH. Observations seemingly inconsistent with theGBH include, for example, the preferential retention oftranscription factor paralogs in taxa with no apparenthistory of polyploidy or following tandem duplicationsin Carica, as well as the absence of such retention biasesfollowing some recent WGDs (e.g. Glycine, Helianthus,and Zea). Nonetheless, the most important observationin this paper—the strong bias toward ancient transcrip-tion factor duplicates seen in most plant genomes—maybe interpreted in a manner consistent with the GBH.Possibly, all plant lineages are the product of multipleancient WGDs, the earliest of which are no longer de-tectable. Under the GBH, the duplicates from the firstpolyploidization would be most likely to be retained tomaintain optimal stoichiometry among gene products.The number of paralogs is expected to grow rapidly witheach polyploidization event. With so many paralogs,changes in the amount of the gene product might be tol-erated and a copy of the gene can be lost or diverge.This could lead to the pattern we see—biased retentiontoward ancient transcription factor paralogs—and alsomight account for the weaker signal we see among re-cent transcription factor paralogs. It even could account,in part, for the greater tolerance of recent tandem tran-scription paralogs seen in Carica.MethodsData collection and selection of paralogsFull genome annotations, protein-gene codes, DNA se-quences, gene families, and Gene Ontology (GO) anno-tations from the 25 focal species were retrieved fromPLAZA 2.5 and 3.0 Dicots [39], with the exception ofsunflower (Helianthus annuus), as detailed in Additionalfile 1: Table S3. Protein-gene code files with alternativetranscripts removed were used to identify paralogousgene pairs using BLASTp all-against-all, with an e-valuecutoff of e−20, with a minimum 50% identity, alignmentlength > 300 bp, number of mismatches < 550, and num-ber of gap opens < 30. Self hits were removed and onlyparalogous gene pairs with both copies belonging to thesame gene family were maintained for further analysis.For the selection of paralog pairs for H. annuus, CDS se-quences and BLASTn all-against-all were used based onHA412.v1.1 version of the genome (http://www.sunflo-wergenome.org/).Determining paralog duplication mechanismThe DAGchainer software package [40] was used to pre-dict the mechanism of by which paralogs originatedbased on their genomic coordinates. WGD-derived para-log pairs were predicted by running DAGchainer to findsyntenic/collinear regions among chromosomes, in thesame species, using default parameters and ignoring tan-dem duplication alignments (-s and -I options).Tandem-derived paralogs were predicted by using theaccessory segmental duplication tool, also made availableby DAGchainer, to find collinear sets of homologousgenes, with the ‘max intervening genes value’ set to 10.All the other paralog pairs, not predicted as WGD ortandem-derived, were marked as undefined (UD), asthese paralogs may have been originated by either large-or small-scale duplications.Age of duplication eventsWe calculated relative divergence times for each paralogpair in terms of synonymous substitutions per synonym-ous site (Ks). First, we aligned the nucleotide sequencesof gene pairs using TranslatorX [41], based on proteinalignments performed by MUSCLE v3.8.31 [42]. Diver-gence times (Ks) were calculated with the yn00 softwarefrom the PAML v4.1 package [43]. This method assumesthe F3x4 codon frequency model and accounts for tran-sition/transversion rate bias and codon usage bias, whichis an approximation of the maximum likelihood methodrecommended for pairwise comparisons in the manualof PAML. Because of issues associated with Ks satur-ation and stochasticity [31], only paralogs with Ks ≤ 2and Standard Error (SE) < 0.5 were used in furtheranalyses.Custom python scripts were used to parse the BLASTall-against-all output in order to identify the closestparalog gene pairs. First, self hits were removed. Then,paralogs were organized into a single gene list and thenused to select the corresponding paralog pair(s) for eachof these genes based on the following three rules: (I) if asingle gene was predicted as WGD-derived byRody et al. BMC Genomics  (2017) 18:46 Page 8 of 10DAGchainer, keep the duplicate pair with the lowest Ksvalue, while still allowing pairing with tandem-derivedor undefined genes; (II) if not predicted as WGD-derived, but predicted as tandem-derived, keep the genepair with the lowest Ks value; and (III) if the single genewas not predicted as WGD- or tandem-derived, keepthe undefined paralog pair with the lowest Ks value.GO annotation and over-representation analysisFunctional Gene ontology GO terms (categories) weredetermined for each gene and paralog pair and thenevaluated for enrichment by the ErmineJ v3.0.2 software[44]. All the three GO domains (Biological Process, Mo-lecular Function and Cellular Component) were in-cluded in the Over-Representation Analysis (ORA), witha minimum gene set size equal to 10 and the Best Scor-ing gene replicate treatment. Eight different groups ofparalogs were analyzed: WGD-derived, tandem-derived,and paralogs representing the following Ks ranges: (A) 0< Ks ≤ 0.5, (B) 0.5 < Ks ≤ 1, (C) 1 < Ks ≤ 1.5, (D) 1.5 < Ks ≤2, (E) 0 < Ks ≤ 1 and (F) 1 < Ks ≤ 2. The GO categorieswere considered overrepresented if Corrected P < 0.05,as calculated by the ErmineJ software.StatisticsK-S goodness of fit testThe Kolmogorov-Smirnov test [45] was used to evaluateif the age distribution (Ks) of all duplicates (background)deviated significantly (P < 0.05) from a simulated null hy-pothesis of constant duplicate gene birth and death.SiZer maps: identifying significant peaks in Ks histogramsSignificant peaks in the Ks histograms were found bySiZer [46] implemented on R software, with the follow-ing command line: SiZer.1 < − SiZer(x, y, h = c(.05,5), de-gree = 1, derv = 1). A SiZer map is a way of examiningwhen the p-th derivative of a scatterplot-smoother is sig-nificantly negative, possibly zero or significantly positiveacross a range of smoothing bandwidths. In a SiZermap, blue indicates a significantly increasing gradient,red is a significantly decreasing gradient, purple is anon-significant gradient and gray indicates that data aretoo sparse for reliable estimation.Additional fileAdditional file 1: Figure S1. Ks age distributions of 20 species andcorrespondent maps. In (A) brown bars represent all duplicates(background), black and yellow represent the WGD-derived duplicatespredicted by DAGchainer and transcription factors activity (GO:0003700)duplicates, respectively. (B) Brown bars represent the background andgray the tandem-derived duplicates. (C) SiZer maps of background. (D)SiZer maps of WGD-derived duplicates. (E) SiZer maps of transcriptionfactors duplicates. Figure S2. Gene Ontology (GO) categoriesoverrepresented in duplicates with (a) 1.5 < Ks ≤ 2, and (b) 1 < Ks ≤ 2 innine or more of the 25 plant species analyzed in this study. Table S1.Kolmogorov-Smirnov test for Ks age distributions of all duplicates of the25 plant species used in this study. Table S2. Number of duplicatesannotated as transcription factor (TF) activity (GO:0003700) for 25 plantspecies, displayed by duplication type (predicted by the DAGchainersoftware) and grouped by Ks equivalent ages ranges. Table S3. Detailingthe data source and abbreviation of the 25 plant species used in thisstudy. (PDF 2652 kb)AcknowledgementsNone.FundingThe financial support for this study was provided by the Fundação deAmparo à Pesquisa do Estado de Minas Gerais–FAPEMIG (PPM 00561–15) toLOO. LOO received a fellowship from Conselho Nacional deDesenvolvimento Científico e Tecnológico–CNPq (PQ305827/2015-4); HVSRreceived a fellowship from the Brazilian program for research incentives“Science without borders”. LHR was supported by grants from GenomeCanada and Genome BC. The funding agencies played no role in the designof the study, collection, analysis, interpretation of data or in writing themanuscript.Availability of data and materialsData generated in this study are available in as a single file at ftp://ftp.ufv.br/dbb/geneduplication/.Authors’ contributionsHVSR performed the data collection, the computational/statistcs analysis,interpretation of the data, and contributions to the drafting of the article.GJB made contributions to conception and design, interpretation of data,and article drafting. LOO and LHR participated in drafting/writing the articleor revising it critically for important intellectual content. LOO and LHRrevised and approved the final version to be submitted.Competing interestsThe authors declare that they have no competing interests.Consent for publicationNot applicable.Ethics approval and consent to participateNot applicable.Author details1Department of Biochemistry and Molecular Biology, Universidade Federal deViçosa, Viçosa 36570-900, Minas Gerais, Brazil. 2Department of Botany,University of British Columbia, Vancouver, BC V6T 1Z4, Canada.Received: 26 August 2016 Accepted: 14 December 2016References1. Ohno S. Evolution by Gene Duplication. New York: Springer; 1970.2. Wendel JF. Genome evolution in polyploids. Plant Mol Biol. 2000;42:225–49.3. Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV. Selection in the evolution ofgene duplications. Genome Biol. 2002;3:research0008.I0008.9.4. Zhang J. Evolution by gene duplication: an update. Trends Ecol Evol. 2003;18:292–8.5. Wood TE, Takebayashi N, Barker MS, Mayrose I, Greenspoon PB, RiesebergLH. The frequency of polyploid speciation in vascular plants. Proc Natl AcadSci U S A. 2009;106:13875–9.6. Mayrose I, Zhan SH, Rothfels CJ, Magnuson-Ford K, Barker MS, Rieseberg LH,et al. Recently formed polyploid plants diversify at lower rates. Science(80- ). 2011;333:1257.7. Blanc G, Wolfe KH. Functional Divergence of Duplicated Genes Formed byPolyploidy during Arabidopsis Evolution. Plant Cell. 2004;16:1679–91.American Society of Plant Biologists.8. Shoemaker RC, Schlueter J, Doyle JJ. Paleopolyploidy and gene duplicationin soybean and other legumes. Curr Opin Plant Biol. 2006;9:104–9.Rody et al. BMC Genomics  (2017) 18:46 Page 9 of 109. Jaillon O, Aury J-M, Noel B, Policriti A, Clepet C, Casagrande A, et al. Thegrapevine genome sequence suggests ancestral hexaploidization in majorangiosperm phyla. Nature. 2007;449:463–7. Nature Publishing Group.10. Barker MSS, Kane NCC, Matvienko M, Kozik A, Michelmore RW, Knapp SJJ,et al. Multiple paleopolyploidizations during the evolution of theCompositae reveal parallel patterns of duplicate gene retention aftermillions of years. Mol Biol Evol. 2008;25:2445–55.11. Jiao Y, Wicket N, Ayyampalayam S, Chanderbali A. Ancestral polyploidy inseed plants and angiosperms. Nature. 2011;473:97–100.12. Li Z, Baniaga AE, Sessa EB, Scascitelli M, Graham SW, Rieseberg LH, et al.Early genome duplications in conifers and other seed plants. Sci Adv. 2015;1:e1501084.13. Li Z, Defoort J, Tasdighian S, Maere S, Van de Peer Y, De Smet R. GeneDuplicability of Core Genes Is Highly Consistent across All Angiosperms.Plant Cell. 2016;28:326–44.14. Lynch M, Conery JS. The Evolutionary Fate and Consequences of DuplicateGenes. Science (80- ). 2000;290:1151–5.15. Edger PP, Pires JC. Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes. Chromosom Res. 2009;17:699–717.16. Edger PP, Heidel-Fischer HM, Bekaert M, Rota J, Glöckner G, Platts AE, et al.The butterfly plant arms-race escalated by gene and genome duplications.Proc Natl Acad Sci U S A. 2015;112:8362–6.17. Conant GC, Birchler JA, Pires JC. Dosage, duplication, and diploidization:clarifying the interplay of multiple models for duplicate gene evolution overtime. Curr Opin Plant Biol. 2014;19:91–8.18. Birchler J, Bhadra U, Bhadra MP, Auger DL. Dosage-dependent generegulation in multicellular eukaryotes: implications for dosagecompensation, aneuploid syndromes, and quantitative traits. Dev Biol. 2001;234:275–88.19. Freeling M. Bias in Plant Gene Content Following Different Sorts ofDuplication: Tandem, Whole-Genome, Segmental, or by Transposition. AnnuRev Plant Biol. 2009;60:433–53.20. Kondrashov FA, Koonin EV. A common framework for understanding theorigin of genetic dominance and evolutionary fates of gene duplications.Trends Genet. 2004;20:287–90.21. Hudson CM, Puckett EE, Bekaert M, Pires JC, Conant GC. Selection for highergene copy number after different types of plant gene duplications.Genome Biol Evol. 2011;3:1369–80.22. Veitia RA. Exploring the etiology of haploinsufficiency. Bioessays. 2002;24:175–84.23. Blanc G, Wolfe KH. Widespread Paleopolyploidy in Model Plant SpeciesInferred from Age Distributions of Duplicate Genes. Plant Cell. 2004;16:1667–78.24. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, et al.Modeling gene and genome duplications in eukaryotes. Proc Natl Acad SciU S A. 2005;102:5454–9. National Academy of Sciences.25. Yu J, Wang J, Lin W, Li S, Li H, Zhou J, et al. The Genomes of Oryza sativa: AHistory of Duplications. PLoS Biol. 2005;3:e38.26. Rodgers-Melnick E, Mane SP, Dharmawardhana P, Slavov GT, Crasta OR,Strauss SH, et al. Contrasting patterns of evolution following whole genomeversus tandem duplication events in Populus. Genome Res. 2012;22:95–105.27. Crow KD, Wagner GP. What is the role of genome duplication in theevolution of complexity and diversity? Mol Biol Evol. 2006;23:887–92.28. Hanada K, Zou C, Lehti-Shiu MD, Shinozaki K, Shiu S-H. Importance ofLineage-Specific Expansion of Plant Tandem Duplicates in the AdaptiveResponse to Environmental Stimuli. Plant Physiol. 2008;148:993–1003.29. Wang Y. Locally duplicated ohnologs evolve faster than nonlocally duplicatedohnologs in Arabidopsis and rice. Genome Biol Evol. 2013;5:362–9.30. Thomas BC, Pedersen B, Freeling M. Following tetraploidy in an Arabidopsisancestor, genes were removed preferentially from one homeolog leavingclusters enriched in dose-sensitive genes. Genome Res. 2006;16:934–46.31. Vanneste K, Van De Peer Y, Maere S. Inference of genome duplications fromage distributions revisited. Mol Biol Evol. 2013;30:177–90.32. Myburg AA, Grattapaglia D, Tuskan GA, Hellsten U, Hayes RD, Grimwood J,et al. The genome of Eucalyptus grandis. Nature. 2014;510:356–62.33. Rensing SA, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, et al. ThePhyscomitrella genome reveals evolutionary insights into the conquest ofland by plants. Science (80- ). 2008;319:64–9.34. Vanneste K, Baele G, Maere S, Van De Peer Y. Analysis of 41 plant genomessupports a wave of successful genome duplications in association with theCretaceous-Paleogene boundary. Genome Res. 2014;24:1334–47.35. Simillion C, Vandepoele K, Van Montagu MCE, Zabeau M, Van De Peer Y. Thehidden duplication past of Arabidopsis thaliana. PNAS. 2002;99:13627–32.36. Bowers JE, Chapman BA, Rong J, Paterson AH. Unrevealing angiospermgenome evolution by phylogenetic analysis of chromosomal duplicationevents. Nature. 2003;422:433–8.37. Sanzol J. Dating and functional characterization of duplicated genes in theapple (Malus domestica Borkh.) by analyzing EST data. BMC Plant Biol. 2010;10:87.38. Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, et al. The draftgenome of the transgenic tropical fruit tree papaya (Carica papayaLinnaeus). Nature. 2008;452:991–6.39. Proost S, Van Bel M, Vaneechoutte D, Van de Peer Y, Inzé D, Mueller-RoeberB, et al. PLAZA 3.0: an access point for plant comparative genomics. NucleicAcids Res. 2015;43:D974–81.40. Haas BJ, Delcher AL, Wortman JR, Salzberg SL. DAGchainer: a tool formining segmental genome duplications and synteny. Bioinformatics. 2004;20:3643–6.41. Abascal F, Zardoya R, Telford MJ. TranslatorX: multiple alignment ofnucleotide sequences guided by amino acid translations. Nucleic Acids Res.2010;38:W7–W13.42. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy andhigh throughput. Nucleic Acids Res. 2004;32:1792–7.43. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol BiolEvol. 2007;24:1586–91.44. Gillis J, Mistry M, Pavlidis P. Gene function analysis in complex data setsusing ErmineJ. Nat Protoc. 2010;5:1148–59.45. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, et al.Widespread genome duplications throughout the history of floweringplants. Genome Res. 2006;16:738–49.46. Chaudhuri P, Marron JS. Scale space view of curve estimation. Ann Stat.2000;28:408–28.•  We accept pre-submission inquiries •  Our selector tool helps you to find the most relevant journal•  We provide round the clock customer support •  Convenient online submission•  Thorough peer review•  Inclusion in PubMed and all major indexing services •  Maximum visibility for your researchSubmit your manuscript atwww.biomedcentral.com/submitSubmit your next manuscript to BioMed Central and we will help you at every step:Rody et al. BMC Genomics  (2017) 18:46 Page 10 of 10


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items