UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The first reference genome of sunflower (Helianthus annuus L.) : a domesticated compilospecies. Grassa, Christopher J. 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_may_grassa_christopher.pdf [ 20.36MB ]
JSON: 24-1.0167171.json
JSON-LD: 24-1.0167171-ld.json
RDF/XML (Pretty): 24-1.0167171-rdf.xml
RDF/JSON: 24-1.0167171-rdf.json
Turtle: 24-1.0167171-turtle.txt
N-Triples: 24-1.0167171-rdf-ntriples.txt
Original Record: 24-1.0167171-source.json
Full Text

Full Text

The First Reference Genome of Sunflower(Helianthus annuus L.)A Domesticated CompilospeciesbyChristopher J. GrassaB.S. in Zoology, University of Florida, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Botany)The University of British Columbia(Vancouver)April 2015c© Christopher J. Grassa, 2015AbstractI present the first reference genome for sunflower, Helianthus annuus. The ref-erence is 3.6 billion base pairs long and is divided into seventeen lines of textrepresenting the DNA of sunflower’s seventeen chromosomes. This reference wasconstructed via DNA sequencing and assembly of sunflower line HA412, physicalmapping using a sequence-based barcoding approach, and genetic mapping basedon low coverage DNA sequencing of a highly polymorphic mapping population.I also assembled and annotated a reference genome of sunflower’s mitochondrialgenome. Sunflower and its wild relatives are a useful system for studying ecologyand evolution. Helianthus annuus may be regarded as a natural compilospecies;adaptive introgressive hybridization with related species has facilitated the expan-sion of its range over a variety of soils and climates. In addition, the compati-bility of sunflower with its extremophile wild relatives offers the opportunity tobreed environmentally resilient sunflower cultivars that can cope with global cli-mate change. The resource described in this thesis will be a useful tool for evolu-tionary biologists and crop breeders with interests pertaining to sunflower genetics.iiPrefaceThe Sunflower Genome Project received $10 million in funding over a period offive years and was headed by five co-Principle Investigators (PIs) in three coun-tries. It follows that the work was highly collaborative. Loren H. Rieseberg, NolanC. Kane, John M. Burke, Patrick Vincourt, and Steve Knapp conceived the Sun-flower Genome Project. They were responsible for its high-level design under thesupervision of Scientific Advisory Board members: Scott Jackson, Brad Barbazuk,Carl Douglas, Conrad Brunk, and Catherine Feuillet.My intellectual contributions to the project involved understanding the high-level design of the PIs and the technical details of its individual components. Mypractical contributions mainly involved performing bioinformatics work to pro-cess DNA sequencing data into biologically meaningful information. The workdescribed in this thesis would have been impossible in the absence of a team ofpeople. I performed most of the bioinformatics work for several of the project’smajor components and was responsible for assuring the quality of others. My mostimportant contributions to the project, however, were: persistent involvement, de-veloping a detailed understanding of how the components fit together, and integrat-ing them to meet the project’s high-level design.The plant material employed for Section 2.1 was prepared by Shunxue Tang.The sequencing described in Section 2.2 was carried out by at Genome Quebec inMontreal, QC, Canada. I carried out the genotyping described in Section 2.2. Iconstructed the genetic map described in Section 2.3 which was hand-curated byJohn Edward Bowers. Bowers provided the verbal model for matching incompletesegregation patterns to a template map described in Section 2.3. I formalized themodel in computer code.iiiThe DNA sequencing libraries described in Section 3.1.1 were prepared by at:Genome B.C., Genome Quebec, the French National Institute for Agricultural Re-search (INRA), and the Beaty Biodiversity Research Centre’s NextGen Sequencingfacility. I curated the data and was responsible for its quality control. I performedall of the work described in Section 3.1.2. Nolan C. Kane and I worked togetherto configure the genome assembly described in Section 3.1.3. Nolan C. Kane andThuy Nguyen monitored the assemblys computation. After three months of pro-cessing, the assembly failed in its final stage of converting binary files to FASTA-formatted text, but Thuy Nguyen wrote custom computer code to recover fromthe failure. Sariel Hubner carried out some of the bioinformatics work describedin Section 3.1.4. Navdeep Gill assigned Allpaths and Celera scaffolds to linkagegroups using the physical map. I assigned the Allpaths and Celera scaffolds, aswell as mate-pair reads, to linkage groups using information from genetic maps.The plant material and sequencing libraries employed in Section 3.2 were preparedby Dan Ebert. I performed most of bioinformatics work described in this section,including configuring the SOAP assembly and writing the custom computer codeused to align the assembly contigs to the restriction map. Nolan C. Kane filledgaps in the mitochondrial genome assembly using 454 reads. I hand-curated andannotated the mitochondrial genome assembly.The physical map described in Section 4.2 was constructed by KeyGene, Inc.and hand-curated by Navdeep Gill. Thuy Nguyen wrote custom computer code toassign BACs to linkage groups and break chimeric contigs. Navdeep Gill then as-sembled physical maps for each linkage group independently. I designed the algo-rithm for aligning the genome assembly scaffolds to physical map contigs withinthe constraints of the genetic map, which was implemented in custom computercode written by Frances Raftis and me. I designed and implemented the algorithmdescribed in Section 4.3. I was responsible for the quality control described inSection 4.4. Jerome Gouzy and Sebastien Carrere at INRA performed the genomeannotation described in Section 4.4.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Ultra-high Density Genetic Map . . . . . . . . . . . . . . . . . . . . 82.1 Plant Material and Construction of Mapping Population . . . . . . 82.2 Sequencing and Genotyping . . . . . . . . . . . . . . . . . . . . 92.3 Construction of Genetic Map . . . . . . . . . . . . . . . . . . . . 102.4 Consensus with Other Genetic Maps . . . . . . . . . . . . . . . . 113 Genome Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Nuclear Genome . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.1 Preparation and Sequencing of DNA Libraries . . . . . . 163.1.2 Allpaths-LG Assembly . . . . . . . . . . . . . . . . . . . 183.1.3 Celera Assembly . . . . . . . . . . . . . . . . . . . . . . 203.1.4 Merge of Allpaths-LG and Celera Assemblies . . . . . . . 21v3.2 Mitochondrial Genome . . . . . . . . . . . . . . . . . . . . . . . 214 Pseudomolecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 What is a Pseudomolecule? . . . . . . . . . . . . . . . . . . . . . 294.2 Combining Genetic and Physical Maps . . . . . . . . . . . . . . 304.3 The Golden Path . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5 Seventeen Pseudomolecules . . . . . . . . . . . . . . . . . . . . 345 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41viList of TablesTable 3.1 Illumina Reads: Fragment Sizes . . . . . . . . . . . . . . . . . 25Table 3.2 Roche 454 Reads: Fragment Sizes . . . . . . . . . . . . . . . 26Table 3.3 Sunflower Mitochondrial Genome Protein-Coding Features. . . 27Table 3.4 Sunflower Mitochondrial Genome RNA and Structural Features. 28Table 4.1 Example Alignment: Scaffold403 to LG8-Ctg66 . . . . . . . . 38Table 4.2 Final Pseudomolecule Statistics . . . . . . . . . . . . . . . . . 38viiList of FiguresFigure 1.1 H. annuus is interfertile with nine other species of sunflower.The fill color of each polygon indicates the number of speciesinterfertile with H. annuus found in the area. North Americanoccurrence records for the nine species were downloaded fromThe Global Biodiversity Information Facility (GBIF) (Lane2003). Polygons are defined by a tessellation (Dirichlet 1850)around points generated from a model of sunflower seed pack-ing (Vogel 1979). . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 2.1 Genetic map for H. annuus. Interior radii show the segrega-tion of chromosome segments in ninety-three RILs. Black seg-ments indicate RHA280 ancestry and white segments RHA801ancestry, with transitions locations of chromosomal crossover.The genetic map is drawn along the outer two sets of radii. Theray length (yellow) is proportional to the sum of de novo basepairs assigned to 1 cM bins. . . . . . . . . . . . . . . . . . . 12Figure 2.2 Illustration of RIL crossing design employed for making theRHA280 x RHA801 genetic map for H. annuus. (Courtesy ofKasia Stepien) . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 2.3 Box plot showing distribution of sequencing depth for 93 RILs.The sunflower’s genome size is estimated to be 3.6 Gbp. TheRILs were sequenced to approximately 1x depth. . . . . . . . 14viiiFigure 2.4 Comparison of synteny between Illumina Infinium SNP arrayand Whole Genome Shotgun sequence-based map of RHA280x RHA801 RILs. Approximately 90% of hits are in 17 syn-tenic blocks. The roughly 10% of non-syntenic hits can beexplained by picking the second best hit if the true homolog isfragmented into several contigs, or if the sequence is multi-copy. 15Figure 3.1 Comparison of restriction fragment maps. Dark grey segmentsshow an in silico digestion of sunflower mitochondrial genomeNCBI Reference Sequence: NC 023337. Light grey segmentsshow fragment lengths of the enzyme digestion reported bySiculella and Palmer 1988. . . . . . . . . . . . . . . . . . . . 23Figure 3.2 Gene and repeat map of sunflower mitochondrial genome NCBIReference Sequence: NC 023337. Genic loci of Table 3.3 areindicated by black rectangles. Dark green ribbons attach to re-peats. Light green rays show alignment depth of a WGS libraryto the reference mitochondrial genome. Note the coveragespikes, which indicate regions of high homology to the plas-tid genome. Drops in coverage near the large repeat boundarysuggest that the cellular molarity of the mitochondrial genome’smaster replication circle is lower than the alternative configu-ration of two sub-circles. Yellow rays show alignment depthof coverage of an RNAseq library. . . . . . . . . . . . . . . . 24ixFigure 4.1 Diagram showing integration of genetic map, de novo genomeassembly, and physical map on Linkage Group 4. Genetic mapbins positions are shown in dark grey. Scaffolds are shown ingreen, with the scaffold base pair position and physical maptag sequences in the neighboring columns. The red bar atthe top of the diagram shows the physical map contig of themember tags with FPC units in the row below. Yellow barsindicate a minimum tiling path of BACs. Orange rectanglesindicate alignment matches of scaffold tag positions with thecorresponding position in the FPC physical map contig. . . . . 35Figure 4.2 Scaffold positions plotted in RHA280 x RHA801 cM (x-axis)and HA412 physical map bp (y-axis). Each numbered cell con-tains a chromosome’s plot. Regions of extreme recombinationsuppression are shown where the slope of a line is close toinfinity. I suspect these regions harbor centromeric loci. Con-versely, regions of the chromosome that recombine frequentlyare shown where the slope of a line is close to zero. . . . . . . 36Figure 4.3 Frequency distribution of length in base pairs for 100 fully se-quenced and assembled BACs. . . . . . . . . . . . . . . . . . 37xAcknowledgmentsLoren H. Rieseberg gave me the opportunity to be a part of a whole that exceedsthe sum of its parts.Nolan C. Kane believed that I could grow as a scientist.Rose L. Andrew demonstrated to me that the true reward of doing good science istruth.Rob J. Kulathinal taught me the value of being a prote´ge´.Connor Morgan-Lang and Nadia Chadir taught me the value of being a mentor.Michael C. Whitlock is responsible for much of my understanding of populationgenetics. He also provided the most comprehensive review of this thesis.John E. Bowers taught me most of what I know about meiotic mapping.Matt King, John M. Burke, Navdeep Gill, and S. Evan Staton taught me much ofwhat I know about sunflower DNA.Armando Geraldes and Sebastien Renaut taught me that a scientist is responsiblefor disseminating knowledge in addition to creating it.Quentin C.B. Cronk taught me some concepts of plant genomics. More impor-tantly, he encouraged my curiosity.Austin Davis-Richardson demonstrated the power that lies in UNIX mastery to me.Thuy Nguyen, Daisie Huang, and Frances Raftis demonstrated the value of elegantalgorithm design to me.Gregory J. Baute was always available to listen to my ideas.xiKathryn G. Turner, Brook T. Moyers, Gregory L. Owens, Kate Ostevick, andKieren Samuk taught me much of what I know about ecology and evolution.Genome Canada, Genome B.C., and the Sunflower Genome Consortium fundedmy work.Joshua Chang Mell taught me to question my assumptions.Jayne Elizabeth Knight brought me to Vancouver, B.C., Canada.Rogers Thompson Brewer, Marjesca Brown, Rebecca Deist, Jonathan S. Griffiths,Nikta Fay, and Katie Elizabeth Berns encouraged me to keep trying when I felt likegiving up.Carl J. Douglas and Shawn D. Mansfield influenced my cerebral model of natureat the cellular scale.Dylan Orion Burge influenced my cerebral model of nature at the geologic scale.My parents, Dreama Andersen, Bill Grassa, and Bruce Andersen, and grandpar-ents, Sarah and Thomas Grassa, provided me with DNA and an environment forearly development.Cristina Maria Moya reminded me of the reasons I first fell in love with science andrenewed my enthusiasm for a scholarly life. She also provided helpful commentsfor this thesis.xiiChapter 1IntroductionThis thesis describes my part in producing a reference genome for sunflower. Agenome is the entire DNA (deoxyribonucleic acid) belonging to a single organism.It contains the information needed to grow from a zygote to mature adult. A ref-erence genome is a textual model of this information. Bare DNA is useless in theabsence of the cellular machinery needed to transcribe and translate the informa-tion it encodes into proteins. Similarly, a reference genome is meaningless outsidethe context in which it will be used; it is a resource. Sunflowers are an importantoilseed crop and also an important model for studying ecology and evolution. Assuch, I begin with introductions of the sunflower system and the domestication ofthe common sunflower before introducing the methods and resources I used to crafta resource for future research.Darwin’s sketches of the phylogenetic relationship of species resemble thebranching of a tree (Darwin 1859). As time progresses in the sketches, biodiversityincreases via the division and differentiation of populations, eventually leading tospeciation. Edgar Anderson suspected that the topology of phylogenetic relation-ships could be more complex than this and closed his manuscript ”Internal FactorsAffecting Discontinuity between Species” with:I have taken asexual propagation, polyploidy series and physiologicalisolation as representatives of the internal factors which affect specificisolation and which whole genera or even families of plants may havein common. There must be many other such factors. May we not there-1fore logically expect that, even though species prove to be biologicalunits, their relationships with each other and the relationships of indi-viduals within species will vary from genus to genus and from familyto family? (Anderson 1931)This commences his explorations of reticulate evolution (e.g. Anderson 1936,Anderson and Hubricht 1938). Reticulate evolution refers to a phylogenetic topol-ogy in which branches not only bifurcate, but also interweave and rejoin to formnew branches. He first researched allopolyploid speciation, but would later developthe concept of (and write a book titled) Introgressive Hybridization (Anderson1949). Introgressive hybridization is recombinant reticulate evolution, wherebysome portion of the genome of one species is introduced to another’s via meioticrecombination of the two in a first generation hybrid, followed by backcrossing inlater generations.Around the time that he published these ideas, his student, Charlie Heiser, tookan interest in hybridizing sunflowers (Heiser Jr 1947) and would go on to conduct acomprehensive inventory and key of Helianthus (Heiser et al. 1969), the sunflowergenus. Morphometry and cytogenetics of the clade suggested widespread and on-going hybridization resulting in polyploid speciation (Heiser and Smith 1954) andintrogression (Heiser et al. 1962). In hindsight, it is clear that this collaborationgave birth to sunflower as a system in which to study reticulate evolution in thecontext of a variety of geographies (Renaut et al. 2013).The genus includes approximate fifty species endemic to North America. Thearea covered by their combined ranges includes most of the geography bounded bythe Unites States, the prairies of southern Canada, and northern and central Mex-ico, including Baja. Heiser split the genus into three sections: Annui (the annuals),Ciliares (western perennials), and Divaricati (eastern perennials). While severalallopolyploid origins of perennial species have been documented, the majority ofsunflower evolution and ecology research focuses on section Annui, comprised ofapproximately fourteen diploid species. The most ancestral node in the sectiondates to approximately two million years (Sambatti et al. 2012), splitting the An-nuus group from the Petiolaris group. In addition to its namesake, H. annuus, theAnnuus group includes: H. argophyllus, the silverleaf sunflower, endemic to Texas,2H. bolanderi, a California sunflower colonizing serpentine soils, and H. winterii(Stebbins et al. 2013), a derived tree that has reverted to a perennial life historyand grows dense woody stems. H. petiolaris, the prairie sunflower, is broadly sym-patric with H. annuus. The Petiolaris group also includes H. debilis, split intoseveral subspecies clustered in the southeast U.S., H. neglectus, and H. niveus, hy-pothesized to be the ancestral type (Beckstrom-Sternberg et al. 1991) of the annualsand divided into polyphyletic subspecies (Rieseberg et al. 1991). In general, someinterspecific gene flow may be expected wherever annual sunflowers are sympatric(Yatabe et al. 2007, Kane et al. 2009, Scascitelli et al. 2010).Heliantus annuus and H. petiolaris are the most broadly sympatric annual sun-flower species. Mosaic hybrid zones (made up of first-generation (F1) crosses andvarious backcrossed generations) often form when these two species are in veryclose proximity (Rieseberg et al. 1998), and so it is perhaps unsurprising that theyhave some of the highest rates of gene flow documented in the clade. Helianthusannuus and H. petiolaris are also notable as the progenitor species of three homo-ploid hybrid species: H. anomalus, H. deserticola, and H. paradoxus (Rieseberg1991). Homoploid hybrid speciation is hybrid speciation without a change in chro-mosome number; the derived genomes, which stabilize after about 1,000 genera-tions, are mosaic chimeras of the ancestral genomes (Buerkle and Rieseberg 2008).The ancestry of chromosomal tiles making up the mosaics matches the parentaldirection of quantitative traits segregating in synthetic interspecific crosses (Riese-berg et al. 2003), suggesting the possibility that they harbor multi-gene complexescoadapted to produce phenotypes matched to some small sub-niche of ecologicalspace. The homoploid hybrid sunflowers inhabit extreme environments, for ex-ample, the salt marshes where H. paradoxus grows (Karrenberg et al. 2006). Thetransgressive phenotypes needed to survive in these extreme environments may becaused by positive epistatic interactions between the tiles or additive gene action.Such hybridization is not only a historical process, but also occurs frequentlyin many places where interfertile species co-occur (e.g. Figure 1.1). In most cases,however, hybridizaiton is rare and leads to only low levels of gene flow amongspecies (Kane et al. 2009). Still, because of their extremely large effective pop-ulation sizes, the widespread sunflower species such as H. annuus harbor geneticvariation derived from introgression from even quite distant lineages.3Given all this hybridization and gene flow, a critical reader might ask if the sun-flower species named above qualify as separate species at all. This is a legitimatequestion, but it is important to remember that speciation is, more often than not, agradual process that occurs through time (Schluter 2009, Feder et al. 2012) (N.B. ahybrid fern, albeit reproducing asexually, has recently been found to be the productof an intergeneric cross of lineages separated by 60 million years (Rothfels et al.2015)). That sunflower species lie within various ranges of the speciation contin-uum is what makes them so useful for studying the process. Examples of the earlystages of speciation may be found in sunflowers (e.g. dune populations of H. petio-laris diverging from those living on the nearby sandsheet (Andrew et al. 2013)), butinterfertility between the named species is quite low; the fertility of interspecificcrosses is usually less than 5% (Chandler et al. 1986). Much of this reproductivebarrier may be attributed to chromosomal rearrangements (Burke et al. 2004). Sun-flowers have some of the highest known rates of chromosomal evolution, a factorlikely contributing to their rapid diversification (Barb et al. 2014).The sunflower may be regarded as a natural compilospecies (Harlan and De Wet1963); adaptive introgressive hybridization with related species has facilitated theexpansion of its range over a variety of soils and climates.Not only is sunflower an important model system for studying evolution andecology, but it is also an important crop. The common sunflower (Helianthus an-nuus macrocarpus) was domesticated approximately 5,000 years ago in the areaof what is now Tennessee (Blackman et al. 2011). Native Americans selected forincreased head size and lack of shattering in their crop, and the farming practicespread via social transmission (Harter et al. 2004). They mostly used sunflowersas food, but the Hopi also developed a second line high in anthocyanin that is stillused to produce dye (Heiser 1951).Sunflowers were introduced to Europe in the 16th century by Spanish explorersreturning from the new world. They first became popular there as a horticulturalnovelty that was easy to care for and grew larger than a child in a single year. Bythe 17th century, they had reached Russia. There, sunflower became popular inpart because the Russian Orthodox Church did not include it in the list of fats thatcould not be eaten during Lent. Russian breeders selected for larger seed size andhigher oil content, establishing it as an oilseed crop. Germplasm resulting from4their efforts returned to North America in the late 1800s and became the primarystock from which most modern elite lines are derived (Blamey et al. 1997).Pedigrees kept by breeders indicate that synthetic introgression of alleles fromwild relatives has been used to improve the germplasm of elite lines several times(Rieseberg and Seiler 1990). Modern genome scans confirm this, for example:reintroduction of the branching allele from H. annuus ssp. texanus, downy mildewresistance from H. argophyllus, and cytoplasm from H. petiolaris (Baute et al.2015, Dussle et al. 2004, Horn et al. 1991). Investigation of the ecological niches(as modeled by bioclimatic variables) that sunflowers occupy suggest that currentelite germplasm may be grown in conditions covering less than half the variancethat their wild relatives occupy (M. Kantar, per. comm.). The vast genetic diversitypresent in wild relatives of cultivated sunflower (Seiler 1992, Mandel et al. 2011,Hodgins et al. 2014) will continue to be an important resource in crop breeding,with ongoing efforts to breed lines resistant to drought, flood, salt, and parasites(Rauf 2008, Wan et al. 2013, Ahmed et al. 2013, Seiler and Jan 2014). Theseprojects are helping to ensure that humans nutritional requirements will be met inthe face of global climate change (McCouch et al. 2013, Dempewolf et al. 2014).The domestic sunflower’s nuclear genome is estimated to contain approxi-mately 3.6 billion base pairs (bp) (Baack et al. 2005) with a guanosine + cytosine(G+C) content of 40%. Karyotype analyses report seventeen chromosome pairs.Generally, thirteen of these are categorized as meta- or submeta-centric and fouras acrocentric with a total of three nucleolus-organizing regions (NORs) (Fenget al. 2013). The genome is highly redundant: approximately 80% of the DNAis retrotransposon sequence. Most of this derives from recent proliferation of theTy-3/Gypsy type (Staton et al. 2012). Approximately half of the genome consistsof a single Ty-3/Gypsy element less than 10kbp in length with an average pairwisedivergence of 1% between copies. Additional redundancy in the gene space hasbeen attributed to a number of paleopolyploidy events (Barker et al. 2008). LineHA412HO (Miller et al. 2006) was chosen for sequencing because it is highly in-bred.Cooperation and communication between evolutionary biologists and plantbreeders expedites the practical application of pure science. The goal of my work,described below, is to facilitate knowledge synthesis by providing a common axis5for all sunflower researchers. To do so, I produced an ultra-high density geneticmap, assembled a genome de novo from short reads, and integrated these with aphysical map of the genome. All three information sources were necessary to com-plete this reference genome. The product of DNA sequencing is millions or billionsof very short reads. I assembled these into hundreds of thousands of contiguoussequences. I scaffolded these into tens of thousands of sequences using a physicalmap and anchored them to chromosomes with a genetic map. This furthers thework of several collaborators (Kane et al. 2011). Here I mainly describe my con-tributions to generating a reference sequence for sunflower, but I also include briefsummaries of work by others as needed for context.6Figure 1.1: H. annuus is interfertile with nine other species of sunflower. Thefill color of each polygon indicates the number of species interfertilewith H. annuus found in the area. North American occurrence recordsfor the nine species were downloaded from The Global BiodiversityInformation Facility (GBIF) (Lane 2003). Polygons are defined by atessellation (Dirichlet 1850) around points generated from a model ofsunflower seed packing (Vogel 1979).7Chapter 2Ultra-high Density Genetic Map2.1 Plant Material and Construction of MappingPopulationThe sunflower reference mapping population is derived from a cross between He-lianthus annuus cultivars RHA280 and RHA801. RHA280 was first registered in1974 (Fick et al. 1974) and is derived from the open-pollenated Sundak germplasm.A typical confectionary line, it produces large black seeds with white stripes con-taining relatively low oil concentration. It is also a fertility restorer of male-sterilecytoplasm, midseason maturing, and rust-resistant. RHA801 was first registeredin 1981 (Roath et al. 1981) and is derived from a population of lines RHA271,RHA273, RHA274, R344, R494 after selection for improved yield and three gen-erations of selfing. RHA801 is a dominant fertility restorer and has moderate rustresistance. It is also resistant to Verticillium wilt and downy mildew. RHA801 is ahigh-oil cultivar with a single apical inflorescence.Coancestry analysis based on pedigree indicates that the confectionary restorerlines and oilseed restorer lines to be highly inbred within each group with strongseparation between groups (Cheres and Knapp 1998). RHA280 and RHA801 ad-here to this pattern. Principle component analysis (PCA) of simple sequence repeat(SSR) markers revealed RHA280 as very dissimilar to other elite lines, and espe-cially distant from RHA801 (Yu et al. 2002). The cross is thus ideal for generatinga highly polymorphic mapping population.8The mapping population began with hand emasculation of RHA280, followedby pollination with RHA801 to produce an F1 (Tang et al. 2002). F1 seeds werethen grown to begin the generation of the recombinant inbred lines (RILs). EachRIL lineage is of single seed decent from this F1 (Figure 2.2). Self-pollinationof the RILs was carried out for seven generations in summer and winter nurseriesand/or greenhouses located in Corvallis, Oregon and Balcarce, Argentina between1995 and 1998. The RIL population segregates for apical branching as well asseveral seed traits, including: hull pigment, seed oil concentration, overall seedweight, and seed dimensions (Tang et al. 2006).2.2 Sequencing and GenotypingWhole genome shotgun sequencing was carried out with 100 base pair paired-endIllumina reads at Genome Quebec in Montreal, Canada. One lane of Illumina se-quence was generated for each parent. 172,086,364 read pairs were generated forRHA280, for a total of 34,417,272,800 sequenced bases. 160,718,566 read pairswere generated for RHA801, for a total of 32,143,713,200 sequenced bases. Wesequenced a total of 96 RILs to low depth. Eight lanes were each multiplexedwith twelve barcoded RILs. As coverage was a little lower than we expectedfor some samples, an additional lane was sequenced. The ninth lane of RIL se-quencing included the samples with the lowest count for each barcode tag, ex-cept for Index 8. In all, the number of bases obtained for the RILs ranged from1.859,549,200 to 6,971,326,000 with a mean of 3,692,104,758 and standard devia-tion of 881,568,716. Assuming a genome size of 3.6 Gbp, RHA280 and RHA801were sequenced to a depth of coverage of approximately 9.6x and 8.9x, respec-tively, with the average depth of coverage obtained for the RILs approximately1.0x (Figure 2.3).I aligned parental reads to our draft reference assembly using the Burrows-Wheeler Aligner (BWA) (Li and Durbin 2009) and called genotypes using SAMtoolsmpileup (Li et al. 2009). I used fixed Single Nucleotide Polymorphisms (SNPs)with a genotype quality more than 20 and a mapping quality more than 30 on thePhred scale (Ewing and Green 1998) as candidate sites for calling genotype blocksin the RILs. The RIL reads were aligned to our draft reference assembly using9BWA. I used SAMtools to convert the alignments to the pileup format. As the RILswere sequenced to very low coverage, I did not apply the strict quality cut-offsused for the parental reads to them. Instead, I called each candidate site as inher-ited from either or both parents based on the presence of fewer than three alignedreads and used a quality-control heuristic later in the process (Section 2.3). In all,I identified 2,726,257 SNPs on 273,422 contigs.2.3 Construction of Genetic MapIn each individual, I then called genomic contigs as descended from one or theother parent based on the presence of at least nine genotype calls at candidatesites. As no quality filters were applied at the read level, I also required at least90% of the genotype calls to indicate descent from the same parent. I used thiscut-off to allow contigs containing small repetitive regions (potentially attractingreads from distant loci), distal recombination breakpoints (allele switching at thecontig ends), or small regions of gene conversion (allele switching internal to thecontig) to be mapped. I used contigs meeting these requirements in at least 75%of the RILs and with a minor allele frequency greater than 30% as map markers. Iused MSTmap (Wu et al. 2008) to order the markers in linearly. MSTmap groupsmarkers based on the minimum sum of recombination events (Hamming 1950)between their segregation patterns and divides them into linkage groups if the sumis significantly different than observed across all markers. MSTmap then ordersmarkers on each linkage group using a recursive minimum spanning tree algorithm.I calculated the map distance between adjacent pairs of markers that were orderedby MSTmap with Kosambi’s mapping function (Kosambi 1943) (Figure 2.1) (N.B.John E. Bowers later pointed out to me that a mapping function is not needed for asaturated genetic map).A template map of 1629 mapped bins was curated from the map generated byMSTmap. The initial template map was manually curated by collaborator John E.Bowers based on results of testing all SNPs and contigs to fill in gaps that mayhave been missed with the initial 4200 contigs. Each bin represented all loci thatshowed an identical segregation pattern for 93 RILs. Plants representing three RILs(RIL10, RIL46, and RIL255) appeared to be highly heterozygous and showed an10excessive number of apparent recombinations. These plants were assumed to rep-resent outcrosses and were not used in the map. The apparent number of recombi-nations on the three excluded lines ranged from four to ten times the number seenon the other 93 lines. I suspect a bee or some other insect may have contaminatedthese lines with non-self pollen. The template map contains 2531 recombinationevents and is 1361 centimorgans long.A primary goal of constructing the genetic map was to anchor de novo assem-bled contigs to chromosomes. I compared all contigs containing segregating SNPsto the template map. Comparisons were made in forward and reverse order andthe best match was stored for each direction. A contig was placed with an upperdistance of the best forward match and a lower distance of the best reverse matchif both were found on the same linkage group. This allowed me to anchor contigsto chromosomes even if they did not contain complete segregation patterns or ifthey contained some level of error in genotyping. A total of 243,048 contigs wereplaced to an accuracy of 5 centiMorgans (cM) (Figure 2.1).2.4 Consensus with Other Genetic MapsThis ultra-high density genetic map was just the most recent of several constructedusing the core mapping population of sunflower. The prior map of highest density(Bowers et al. 2012) used RHA280 x RHA801 markers genotyped with a 10,640SNP Infinium array that we developed in collaboration with Advanta Seeds, DowAgrosciences, Syngenta AG, and Pioneer Hi-Bred. The array’s probe sequenceswere matched by BLAST (Altschul et al. 1997) to the contigs in the sunflower as-sembly. The cM positions on the two maps were compared. The two maps agreedvery well in terms of synteny and ordering even though they were completely in-dependently constructed (Figure 2.4). The chromosomes from the sequence-basedmap were then named and oriented relative to the previous literature.11Figure 2.1: Genetic map for H. annuus. Interior radii show the segregation ofchromosome segments in ninety-three RILs. Black segments indicateRHA280 ancestry and white segments RHA801 ancestry, with transi-tions locations of chromosomal crossover. The genetic map is drawnalong the outer two sets of radii. The ray length (yellow) is proportionalto the sum of de novo base pairs assigned to 1 cM bins.12Figure 2.2: Illustration of RIL crossing design employed for making theRHA280 x RHA801 genetic map for H. annuus. (Courtesy of KasiaStepien)13Figure 2.3: Box plot showing distribution of sequencing depth for 93 RILs.The sunflower’s genome size is estimated to be 3.6 Gbp. The RILs weresequenced to approximately 1x depth.14Figure 2.4: Comparison of synteny between Illumina Infinium SNP array andWhole Genome Shotgun sequence-based map of RHA280 x RHA801RILs. Approximately 90% of hits are in 17 syntenic blocks. Theroughly 10% of non-syntenic hits can be explained by picking the sec-ond best hit if the true homolog is fragmented into several contigs, or ifthe sequence is multi-copy.15Chapter 3Genome Assembly3.1 Nuclear Genome3.1.1 Preparation and Sequencing of DNA LibrariesTwo sequencing technologies were dominant during the life of the project: Roche454 (Margulies et al. 2005) and Illumina (Bentley 2006). Both of these producereads of DNA fragments about 500bp in length. Mate-pair libraries (Van Nieuwer-burgh et al. 2011) were prepared in order to achieve paired reads separated by up toabout 20Kbp. I will briefly describe the properties of each technology and methodof library preparation, as the unique properties affect the choice of appropriate al-gorithms used to assemble them.454 reads are generated as follows (Rothberg and Leamon 2008). Organis-mic DNA is fractionated. Oligonucleotide adapters are attached to denatured frag-ments. The fragments are diluted in an emulsion along with beads that the adaptersbind to. The emulsion is prepared such that one fragment of organismic DNA andone bead lie within a drop of oil. A polymerase chain reaction (PCR) occurs withinthe drop of oil so that many single-stranded copies of the original DNA moleculeexist within it, hybridized to the beads via the adapter sequence. The beads arethen drawn into wells etched in a fiber optic plate.A solution of nucleotides, polymerase, sulfurylase, and luciferase are added tothe wells. Polymerization of the complementary strand results in the addition of16a nucleoside to the strand and pyrophosphate. The pyrophosphate is converted toadenosine triphosphate (ATP) by the sulfurylase. ATP and luciferin are convertedto oxyluciferin by the luciferase, emitting a photon. A digital camera captures thephoton emissions from the wells. This process is repeated for each nucleotide, withwashes in between, to make one cycle. The template sequence contained in eachwell may be inferred by determining which nucleotide addition in the cycle causedphoton emissions from the well to be captured by the camera.Illumina reads are generated via methods fundamentally similar to 454 se-quencing (Metzker 2010). Two sequence primer templates are ligated to DNAfragments; each end receives a different primer template sequence with an adapter,either complementary or the same as that of the flow cell, at its extremity. Themolecules are placed directly on a flow cell and polymerized via bridge amplifi-cation. Oligonucleotide primers complementary to one of the templates are addedto the flow cell, initiating polymerization. The different nucleotides are added tothe flow cell in solution together. They are engineered such that a specific fluo-rescent label is bonded to the base. Additionally, a trinitrogen monoxide (ratherthan alcohol) is bonded to carbon 3 of the pentose, preventing polymerization. Adigital camera records the fluorescence as the flow cell is excited with a laser. Thelabels are cleaved and the trinitrogen monoxide is replaced with an alcohol, com-pleting one cycle. This is repeated for one hundred cycles. The second primeroligonucleotides are then added to the flowcell, and the process is repeated.While 454 and Illumina technologies are fundamentally similar, there two dif-ferences with significant consequences. One is the trinitrogen monoxide on thepentose that is later replaced by an alcohol (termed reversible terminator) (Bent-ley et al. 2008). While nucleotides are added individually in the 454 process, it isstill possible for more than one to be added if the present region of the templateis a homopolymer. The intensity of the luminescence is used to estimate the ho-mopolymer length, but the estimation is not precise enough to determine the exactlength of long homopolymers. The other major difference of the Illumina processis that both strands of DNA are sequenced, each from the opposite end’s primer.This gives paired-end reads.Reconstructing the contiguous sequence of a genome from reads of this size isimpossible if the genome contains repeated sequence longer than the longest read.17Mate-pair libraries help overcome this limitation. Mate-pair library preparationbegins by size-selecting DNA fragments ranging in length from about 2,000bp toabout 20,000bp. The fragments are then circularized (via biotinylation for Illuminalibraries, or, for 454 libraries, a 42-44bp linker sequence). The circular moleculesare then broken on either side of the join to give a small fragment of DNA contain-ing sequence from the extremities of the original long fragment. These fragmentsare then sequenced via the aforementioned methods.Fragmentation of the circular molecules does not always occur on either side oflink. 454 mate pairs are easily identified as the linker sequence is wholly containedwithin the read, flanked by each mate. Illumina mate pair libraries include readsrepresenting proper mate pairs (if the link was located at a distance from eitherend that is greater than the read length), typical short-fragment paired-end reads(if the fragmentation did not include the link), and chimeric reads (if the link waslocated at a distance from either end that is less than the read length). As Illuminamate pair library preparation does not include a linker sequence, proper mate pairsmust be isolated via the removal of improper pairs. This can be accomplished byaligning them to a draft assembly from which they were excluded.We prepared five short-fragment libraries and seventeen mate-pair libraries, to-talling approximately 60x depth of coverage, for sequencing on the Illumina plat-form. Seven short-fragment libraries and fourteen mate-pair libraries, totaling ap-proximately 24x depth of coverage, were sequenced using 454 technology. Librarystatistics are summarized in Table 3.1 and Table Allpaths-LG AssemblyThe de Bruin graph has emerged as the most popular method for assembling high-volume, short read sequencing data with a low error rate (Illumina reads) (Zerbinoand Birney 2008). Most assemblers based on the de Bruin graph divide the assem-bly process into three general steps: error correction of the reads, contig construc-tion using the graph, and scaffolding.Error correction (Kelley et al. 2010) involves first tiling the reads into short(e.g. 25bp) words of length k, or k-mers, while keeping track of how often theyappear (multiplicity). Whole genome shotgun data involves the random shearing18and sequencing of organismic DNA. The frequency distribution of sampling mul-tiplicities of k-mers that are unique in the genome is expected be Gaussian (Galton1894) and centered on the mean depth of coverage. In practice, repetitive or poly-morphic k-mers affect the distribution, but this is not relevant for correcting errors.Errors in high quality reads are rare, introducing a frequency spike of low multi-plicities in the distribution. In other words, we may expect to find many errors inbillions of reads, but it is unlikely to find the same error many times. The frequencyminimum between this spike and the Gaussian peak suggests a multiplicity cut offbetween suspicious and trusted k-mers. A second tiling pass is made over the reads.Erroneous bases are identified as present in untrusted k-mers, flanked by trusted k-mers. If trusted k-mers with a low edit distance ( 1) from the untrusted k-mers canbe found, the error can be changed. Otherwise, the read may be discarded.Contig construction (Compeau et al. 2011) involves tiling the corrected readsinto k-mers. In this phase of the assembly process, it is important for the k-mers tobe unique in the genome, and so a larger value of k is used. A directed graph is builtas the reads are tiled. Each k-mer is an edge in the graph connecting the k-1 wordat the start of the tile to the k-1 word ending the tile. The full graph is explored inparallel. Each exploration is constrained such that it is only allowed to visit an edgeonce. Under this constraint, if every node has the same number edges entering itthere are leaving it (i.e. it is balanced), an exploration will end at the same nodethat it began. The explorations are combined to form a path that visits each nodeonce. The first nucleotide of each edge is added to a growing sequence as thispath is traversed, giving the genome sequence. In practice, genomes may containtrue repeats that are much longer than values of k suitable for use with short reads.Consequently, some nodes may be unbalanced. The number of edges entering thefirst node of a long repeat will be equal to the number of biological copies, butthe number of exiting edges may be just one. The resulting final path is thus nolonger linear, and must therefore be broken to give several contiguous sequences,or contigs, rather than one. Mate pairs are then used to rejoin the contigs where itis possible to do so unambiguously.Access to a high performance computer and the sequencing of new sunflowerDNA libraries recently provided me with the opportunity to use the Broad’s AllPaths-LG genome assembler (Gnerre et al. 2011). AllPaths-LG estimated the sunflower19genome size to be 3.151 Gb with a GC content of 39.5% and 77% present as repet-itive sequence. The final assembly size was 1.154 Gb in 99,439 scaffolds. Theassembler incorporates strict sequence quality control including: cleaning readsof sequencing artifacts, trimming of low-quality sequence, base quality score nor-malization, and removal of low frequency kmers. The fraction of reads used fromeach library ranged from 13.1% to 39.9%. This filtering brought the genome se-quence coverage to 40.1x in fragment libraries and 10.2x in jumping libraries. Thesuggested coverage is 45x for both required library classes.Increases in library coverage, insert size, and diversity are expected to improvethe performance of AllPaths-LG. The library insert sizes obtained for the sunflowerdiffered from the sequencing model proposed by the assembler’s authors. The in-serts for the fragment libraries are recommended to be about 1.8 times the readlength; those obtained ranged from 1.22 to 1.47 times the read length. The recom-mended long jumping library insert size is 6,000 bp; the longest insert size obtainedhad a mean insert size of 4,447 bp.The French National Institute for Agricultural Research (INRA) provided useof their GenoBigMem server to compute the assembly. The server has approxi-mately 1 TB of available physical RAM and 32 CPUs. AllPaths-LG completedthe assembly in 206.47 hours, with a peak memory usage of 913.51 GB and aneffective parallelization factor of 15.36. The assembler estimated the memory re-quired for each module. If the required memory exceeded the available memory,the module was divided into a number of passes. This suggests that the assemblercould be used on a computer with lesser resources.3.1.3 Celera AssemblyAlthough not part of my thesis, for the Roche 454 reads were assembled by formerpostdoc Nolan Kane using the Celera Genome Assembler (CABOG) (Miller et al.2008). It is an overlap-layout-consensus assembler. The best assembly had an N50of 25kb, and a total assembly length of 3.1 Gb.203.1.4 Merge of Allpaths-LG and Celera AssembliesBecause many of the Allpaths scaffolds were not found in the Celera assembly (andvice versa), postdoc Sariel Hubner employed the computer program Minimus2(Sommer et al. 2007) was used to merge the two assemblies. To reduce the com-plexity of the merger (and to minimize false merges), scaffolds assigned to eachlinkage group were merged independently. Next, he employed the computer pro-gram SSPACE (Boetzer et al. 2011) to increase scaffold lengths with new long matepair Illumina libraries (20 and 40 kb) and bacterial artificial chromosome (BAC)-end sequences. Again, this was done for each linkage group independently to re-duce the likelihood of generating chimeric scaffolds. The scaffolding resulted in atotal of 155,000 scaffolds, which were used to generate pseudomolecules (Chap-ter 4).3.2 Mitochondrial GenomeLeaf tissue from ten-day-old HA412 seedlings was enriched for mitochondria bycentrifugation. DNA was extracted from the enriched tissue, barcoded, and se-quenced on 1/48th of an Illumina lane, producing 2,727,097 pairs of 101 bp reads.Reads were quality trimmed and cleaned of sequencing artifacts using Trimmo-matic (Bolger et al. 2014).Reads with exact matches of at least 50bp to the chloroplast genome and theirmates were removed from the dataset. SOAPdenovo (Luo et al. 2012) was used toassemble the reads, producing an assembly 387,493 bp in length with an N50 of562 bp and an N90 of 11,390 bp. Next, reads from the mate pair libraries preparedfor the AllPaths-LG assembly with exact matches of at least 50 bp to this assemblywithout matches to the chloroplast genome were added to the scaffolding steps ofa second SOAPdenovo assembly. This produced an assembly 466,799 bp in lengthwith an N50 of 500 bp and an N90 of 46,247 bp. Some scaffolds in both assembliesare of nuclear origin and were identified based on coverage.Scaffolds from the second assembly were digested in silico using the recog-nition sites of these restriction enzymes: PstI, SalI, KpnI, BglI, BstEII, and SacI.Each scaffold’s restriction enzyme cut site sequence was aligned to the sequence ofcut sites of a previously published fragment map (Figures 1 and 6 of Siculella and21Palmer 1988) using the Smith-Waterman algorithm (Smith and Waterman 1981).Alignments were confirmed by comparing the order and size of digested fragmentsin the region. Agreement of fragment sizes between the chemical and computa-tional digests was very good: usually within 100 bp (Figure 3.1).Gaps within and between scaffolds were filled using long (typically 500-1000bp) 454 reads. Reads with exact matches to sequence on both sides of a gap werefound using the UNIX grep (Kernighan and Mashey 1979) command and alignedto the super scaffold by hand. The same procedure was used to close the 300,945bp master circle. Raw reads were aligned to the reference and a few substitutionand small indel errors were fixed by hand.I annotated the mitochondrial genome by hand and using the software Mitofy(Alverson et al. 2010). I searched for open reading frames using the National Cen-ter for Biotechnology Information’s (NCBI) online BLAST aligner to identify thegenes based on homology and used the software Mitofy to identify transfer andribosomal ribonucleic acid (RNA) sequences. I identified repetitive regions byaligning the finished reference to itself using BLAST. To further verify that the as-sembly was correct, I aligned an independent Whole Genome Shotgun (WGS) toit using BWA and inspected the alignments by eye. As expected, I found coveragespikes at regions with high homology to the plastid genome. I found drops in cov-erage (although never to a depth of less than thirty reads) near the boundaries of thelarge repeat copies. This supports the hypothesis that the sunflower’s mitochondrialgenome is typically arranged as two equimolar subcircle chromosomes, each con-taining one of the two repeat copies found in the master replication circle (Siculellaand Palmer 1988). I also aligned a sequenced RNA library to the reference. Thealignment coordinates with the highest depth of coverage overlapped with genecoordinates. Some unannotated intergenic regions also attracted alignments at lowcoverage, which suggests the possibility that they may have some functional role.The mitochondrial genome was submitted to GenBank and is publicly availableas NCBI Reference Sequence: NC 023337. Protein-coding features are summa-rized in Table 3.3; transfer RNA (tRNA), ribosomal RNA (rRNA), and structuralfeatures are summarized in Table 3.4; all features are plotted in Figure 3.2.22Figure 3.1: Comparison of restriction fragment maps. Dark grey segmentsshow an in silico digestion of sunflower mitochondrial genome NCBIReference Sequence: NC 023337. Light grey segments show fragmentlengths of the enzyme digestion reported by Siculella and Palmer 1988.23Figure 3.2: Gene and repeat map of sunflower mitochondrial genome NCBIReference Sequence: NC 023337. Genic loci of Table 3.3 are indicatedby black rectangles. Dark green ribbons attach to repeats. Light greenrays show alignment depth of a WGS library to the reference mitochon-drial genome. Note the coverage spikes, which indicate regions of highhomology to the plastid genome. Drops in coverage near the large re-peat boundary suggest that the cellular molarity of the mitochondrialgenome’s master replication circle is lower than the alternative configu-ration of two sub-circles. Yellow rays show alignment depth of coverageof an RNAseq library.24Table 3.1: Illumina Reads: Fragment SizesType Library Mean(bp) Std.Dev.(bp)PairedEnd A1 136 23A2 139 28A5 160 34200bp HA0001 192 21500bp HA0002 408 46MatePair 2kbp HA0003 61Y EJAAXX 1 1510 321MP1 2062 1744MP2.BD0T EHACXX 3 2451 295MP3.AC0C9VACXX 4 2500 272LBM11326 GFI−529 3kb LJD 2550 760MP4.BD0T EHACXX 5 3320 443HA412 GGCTAC 40kb LJD 3458 1910MP5 3848 3395kbp HA0004 626E6AAXX 5 4418 846MP6.BD0T EHACXX 7 4653 468INX517∗ 4394 321LBM11326 GFI−546 40kb LJD 5084 3642INX518∗ 5286 2016LBM11325 GFI−530 8kb LJD 7114 1090LBM CAGATC 8kb.LJD 7132 1057LBM GATCAG 20kb.LJD 13887 5153LBM1481 GFI−531 20kb LJD 16863 457825Table 3.2: Roche 454 Reads: Fragment SizesType Library Mean(bp) Std.Dev.(bp)ShortFragment 01V 17GRL2 368 133MPS004761454RL 373 124MPS006655454RL 383 12001V 17G454RL 384 136MPS004762454RL 394 125HA412Long 592 169MAY ha412long 648 205MatePair 01V 17G454PE1 2890 48501V 17G454PE2 2929 515MPS008920454PE55kb 3259 1148MPS008921454PE6kb 3517 1412MPS006655454PE20Kb 4491 3390MPS004761454PE38kb 7272 1060MPS008922454PE8kb 7584 1360MPS004761454PE210kb 7897 1146MPS008923454PE10kb 10041 1587MPS004761454PE10kb 10441 1932MPS004761454PE15kb 11157 5060MPS009917454PE20kb 12507 4667MPS008924454PE20kb 12955 4944MPS009918454PE20kb 13463 519026Table 3.3: Sunflower Mitochondrial Genome Protein-Coding Features.Class Start End Strand Gene ProductCDS 16027 15284 − ccmC cytochrome c biogenesis C28498 27923 − at p4 AT Pase subunit 428950 28678 − nad4L NADH dehydrogenase subunit 4L36771 37250 + at p8 AT Pase subunit 837820 38617 + coxIII cytochrome c oxidase subunit43497 42934 − rpl5 ribosomal protein L566603 67223 + ccmB cytochrome c biogenesis B68019 67531 − rpl10 ribosomal protein L10106128 107822 + coxI cytochrome c oxidase subunit112934 111735 − nad5 NADH dehydrogenase subunit 5114601 114341 − at p9 AT Pase subunit 9122115 123110 + rps4 ribosomal protein S4149093 149443 + rps13 ribosomal protein L13169722 168793 − nad6 NADH dehydrogenase subunit 6188450 189643 + cob apocytochrome B201645 200761 − ccmFc cytochrome c biogenesis FC202665 201790 − or f 873 hypothetical protein204362 202830 − at p1 AT Pase subunit 1215079 213361 − ccmFn cytochrome c biogenesis FN228434 230110 + ccmFn cytochrome c biogenesis FN230001 230516 + rpl16 ribosomal protein L16251892 249925 − matR maturase254008 254364 + nad3 NADH dehydrogenase subunit 3254416 254793 + rps12 ribosomal protein L12260202 260774 + nad9 NADH dehydrogenase subunit9269075 269980 + at p6 AT Pase subunit 627Table 3.4: Sunflower Mitochondrial Genome RNA and Structural Features.Class Start End Strand Gene ProducttRNA 5785 5703 − trnY tRNA−Tyr6659 6588 − trnN tRNA−Asn8761 8691 − trnC tRNA−Cys51553 51626 + trnD tRNA−Asp75504 75585 + trnM tRNA−Met79517 79589 + trnG tRNA−Gly82782 82853 + trnQ tRNA−Gln87906 87833 − trnH tRNA−His89834 89905 + trnE tRNA−Glu64558 64486 − trnK tRNA−Lys170075 170001 − trnP tRNA−Pro170454 170381 − trnF tRNA−Phe170923 170836 − trnS tRNA−Ser261753 261826 + trnW tRNA−Trp300889 300817 − trnK tRNA−LysrRNA 128775 132510 + rrn26 26S ribosomal RNA139908 140023 + rrn5 5S ribosomal RNA140166 142111 + rrn18 18S ribosomal RNArepeat 51682 64614 n/a n/a large repeat copy1288012 300945 n/a n/a large repeat copy228Chapter 4Pseudomolecules4.1 What is a Pseudomolecule?The utilities of a reference genome include: representing the entire genome of onerepresentative individual from a species, providing a common axis for inter-studycomparisons, and contextualizing loci. The traditional representation of genetic se-quence is as text, with individual letters corresponding to individual bases. An idealreference genome would include a contiguous sequence of letters for each chromo-some of the organism. These sequences are often referred to as pseudomolecules.In practice, many factors affect how closely a set of reference pseudomoleculesmatches the ideal. These factors include: genome content and degree of repeti-tion, sequencing read length, and other positional information, such as genetic andphysical maps.The genome of sunflower line HA412HO was sequenced with high volume,short read technologies and assembled with algorithms described in a previoussection. Many genetic maps have been created for sunflower, including the ultra-high density genetic map described in a previous section. A single physical mapdeveloped by postdoc Navdeep Gill using Keygene’s sequence-based BAC finger-printing approach, (Van 2011) was available during the span of this project. Ourpseudomolecules are a synthetic amalgamation of these resources. We also appliedquality control steps to remove technical artifacts of the sequencing and assemblyprocess.29While we had a variety of high-quality resources available, our final pseudo-molecules contain many gaps. The factor limiting the achievement our goal ofproducing a highly contiguous reference is the sunflower genome’s biology; it ishighly repetitive. Approximately 85% of the sunflower genome is high-copy se-quence (Staton et al. 2012). Approximately 50% of the genome is a Ty3/gypsy LTRretrotransposon, about 10kbp in length, with an estimated 1% divergence betweenelement copies. We were unable to place many of the sunflower genome’s repeatswithin pseudomolecules and they are included in the reference artificially concate-nated as the so-called sequence Q. We have however, made progress towards quan-tifying and localizing some repetitive genomic features, namely the centromeres,telomeres, and ribosomal repeats.4.2 Combining Genetic and Physical MapsOur ultra-high density genetic map and Finger Printed BAC Contig (FPC) physicalmap were both useful for ordering genomic loci. They are, however, of maximumutility at different scales. We believe our genetic map to be nearly saturated; thatis, we have accounted for all observable recombination events. The lengths of thede novo assembled scaffolds were usually shorter than the distance between anytwo consecutive and observable recombination events. The result is that severalscaffolds may share the same genetic position, which may alternatively be referredto as a genetic bin. Within a bin, relative scaffold ordering is unknown. Recom-bination rate varies widely throughout the genome and genetic distance is not wellcorrelated to physical distance at the chromosome scale.The physical map is a collection of contigs constructed from fingerprintedBACs. The fingerprinting and contig construction are described briefly below. Alibrary of BACs is constructed covering the genome to approximately 12x depth.The BACs are digested with a restriction enzyme. Digested fragments are barcodedwith oligonucleotides such that the BAC they originated from may be determinedlater. They are sequenced using short Illumina reads that begin at the cut site. Werefer to a set of reads originating from a BAC as physical map tags. This set oftags is the BAC’s fingerprint. Contigs are constructed by comparing fingerprints toeach other. Fingerprints partially shared between BACs indicate overlap. Several30tiled fingerprints form a contig. Note that the tag order within a contig is inferredfrom the presence or absence of tags within adjacent tiled BACs, therefore only apartial ordering may be inferred for some tag subsets.Our physical map is useful at a more granular scale than the genetic map. Itprovides an estimate of the number of base pairs between scaffolds and their rela-tive orientation. Physical map contigs, however, are not ordered or oriented relativeto each other, nor are they anchored to a chromosome. Integrating the genetic andphysical maps exploits the complementary information gained from each to over-come their individual limitations. The de novo assembly is used to do so.First, the de novo genome assembly is searched for physical map tag sequencesusing BLAST (blastn -evalue 10000 -outfmt 7 -dust no -word size 7 -perc identity96). For each tag, all hits with a bit score equal to the highest for the tag areretained. For each scaffold, a candidate set of matching physical map contigs isgenerated by searching for those sharing matching tags. The tag-to-scaffold bitscores are summed for all tags shared between a contig and a scaffold. The threehighest scores are retained. In case of ties, all contigs with the three highest scoresare retained as candidate matches.I wrote a small piece of software to order and orient scaffolds using the physi-cal map. A scaffold is matched to a contig using the alignment-scoring scheme de-scribed below; an example alignment is provided in Table 4.1. For a given scaffold-to-contig alignment, only tags shared between both are considered. That is, thereis no mismatch penalty. First, tags are ordered according to their starting positionin the scaffold. Note that some tags may share the same position in a physical mapcontig. The tag holding the lowest position in the scaffold receives a score of oneand its contig position index is recorded. The contig position index of each sub-sequent tag is checked. For tagi to tagi+1: if there is no change in contig positionindex, the match score is increased by one; if the contig position index increasesby one, the match score is increased by two; if the contig position index increasesby more than one, the match score is not changed; if the contig position index de-creases, the match score is decreased by two. The tag orders are then reversed andthe process is repeated. After searching all candidate contigs in both directions,the highest score is chosen, giving both the final matching contig and the scaffoldorientation within the contig (Figure 4.1).314.3 The Golden PathThe FPC assembly software (Nelson and Soderlund 2009) provides distances mea-sured in custom units. The mean BAC length for all physical map contigs was21.23503 FPC units. We fully sequenced and assembled 100 BAC clones in orderto estimate their physical size in base pairs Figure 4.3. The mean assembly lengthof sequenced BACs was 149678.9 bp. The pseudomolecules were constructed us-ing the conversion of 7049 base pairs per 1 FPC unit.I determined the initial gap length between member scaffolds using the fol-lowing method. I first summed the lengths of member scaffolds. This sum wassubtracted from the estimated length of the physical map contig. If the differ-ence was greater than twice the number of scaffolds, each member scaffold waspadded with Ns on either side with the difference divided by twice the number ofscaffolds. When the lengths of member de novo assembled scaffolds assigned tosuper-scaffold lengths were compared to the estimated lengths of the physical mapcontigs they were based on, the distribution of differences were centered abovezero, but some differences were negative. If the difference was less than twice thenumber of scaffolds, each scaffold was padded with a single N on either side.Introducing gaps between scaffolds such that the super-scaffold matches itscorresponding physical map contig without compensating for cases in which thesum of scaffold lengths exceeded the physical map contig size would have theeffect of inflating the length of a pseudomolecule above that physical map estimate.I also included scaffolds in the pseudomolecules if we could anchor them to alinkage group, even if they could not be assigned to a physical map contig. Thus, Iadjusted gap length between de novo assembled scaffolds such that the total lengthof a pseudomolecule matched the sum of physical map contigs assigned to it.Scaffold orders within a genetic bin are taken from the physical map. Scaffoldlengths are summed for a genetic bin. The genetic distance of a bin is dividedamong member scaffolds proportional to their length to assign them pseudo-cMpositions. The desired length of a chromosome’s pseudomolecule is obtained bysumming the length of all physical map contigs assigned to it. The chromosomeis initially divided into 1cM windows. The difference of lengths estimated via thephysical map and the sum of scaffold lengths is taken for each window. If the differ-32ence is positive for all windows, it is divided among scaffolds in proportion to theirpseudo cM position. If the smallest proportion is less than two, window expansioncontinues. Otherwise, optimal window size has been found. All between-scaffoldgaps are thus positive and the sum of all gap and scaffold lengths is equal to thesum of all physical map contig lengths. The pseudomolecule is then printed. Fig-ure 4.2 shows the positions of de novo assembled scaffolds in physical and geneticspace.4.4 MaskingPreliminary analysis suggested that some non-biological duplicated sequence waspresent in the merged assembly. When I looked at alignments of EST sequencesto the genome, about a third of the alignments were duplicates covering the entiretranscript at over 99% identity. These duplicates are not present in the Allpaths-LG assembly, nor are they present in the Celera assembly. I thus took measuresto remove these technical artefacts. I applied the following methods were to eachlinkage group separately.A database of know repetitive elements was constructed by concatenating theSUNREP database (Natali et al. 2013), Repbase (Jurka et al. 2005) repeats clas-sified as present in asterids, sunflower full-length LTR-RT families, transposableelements known to be active in sunflowers (Gill et al. 2014), ribosomal DNA (Bocket al. 2014), and cytoplasmic reference genomes (Chapter 2, Timme et al. 2007).This database was used to hard-mask (replacing ATCGs with Ns) the Allpathsand Celera subassemblies with RepeatMasker (Tarailo-Graovac and Chen 2009).Masked subassembly scaffolds were split into contigs at runs of Ns longer thannine. For each subassembly, contigs were aligned to themselves with BLAST (-dustno -perc identity 99). The coordinates of non-self matches were hard-masked, re-taining single-copy sequence. Masked contigs were again split into contigs at runsof Ns longer than nine. Single copy sequence entries from each subassembly wereconcatenated into file and then clustered at 99% identity using cd-hit-est (Li andGodzik 2006) to remove redundant sequences. In order to ensure the resulting non-redundant sequences were single copy in both sub-assemblies, they were alignedto both subassemblies separately using BLAST. Query sequence with more than33one match longer than 100 base pairs were masked from the non-redundant set,and again split into contigs at runs of Ns longer than nine.In all, 465,802,996bp assigned to a chromosome were identified as single copysequence at 99% identity. These sequences were used to identify technical artifactsin the merged assembly. Single copy sequences were aligned to the reference as-sembly using BLAST. Matches with greater than 99% identity, at least 100bp inlength, and spanning at least 90% of the query length were inspected for copy num-ber. If a query sequence matched the subject sequence twice, the pair of subjectmatches was flagged as containing a technical artefact. The genetic position of thequery sequence was compared to both matches in the pair. If the genetic position ofone match of the pair differed from that listed in the subassembly, it was flagged formasking. If the genetic positions of the subject matches were the same, we choseone at random to be masked. The regions determined to be technical artefacts weremasked using BEDTools (Quinlan and Hall 2010).4.5 Seventeen PseudomoleculesThe final reference set of pseudomolecules is similar to the expected genome length(3.64 Gbp versus 3.6 Gbp expected) (Baack et al. 2005), with a super-scaffold N50of 210kbp. The genome includes greater than 98% of CEGMA (Core EukaryoticGenes Mapping Approach) (Parra et al. 2007) genes, of which approximately 90%are full length, indicating that the gene space is well covered. The genome has beenfully annotated by colleagues at INRA and includes approximately 39k stronglysupported protein-coding gene models (excluding transposable elements). It is dis-played in JBrowse (Skinner et al. 2009) and is accompanied by numerous tools forsearching, mapping, and functional analyses (http://www.sunflowergenome.org).The number of protein-coding genes, length in centiMorgans, length in base pairs,and number of nucleotides assigned to each pseudomolecule is tabulated in Ta-ble 4.2.34Figure 4.1: Diagram showing integration of genetic map, de novo genomeassembly, and physical map on Linkage Group 4. Genetic map binspositions are shown in dark grey. Scaffolds are shown in green, withthe scaffold base pair position and physical map tag sequences in theneighboring columns. The red bar at the top of the diagram shows thephysical map contig of the member tags with FPC units in the row be-low. Yellow bars indicate a minimum tiling path of BACs. Orangerectangles indicate alignment matches of scaffold tag positions with thecorresponding position in the FPC physical map contig.35Figure 4.2: Scaffold positions plotted in RHA280 x RHA801 cM (x-axis)and HA412 physical map bp (y-axis). Each numbered cell contains achromosome’s plot. Regions of extreme recombination suppression areshown where the slope of a line is close to infinity. I suspect these re-gions harbor centromeric loci. Conversely, regions of the chromosomethat recombine frequently are shown where the slope of a line is closeto zero.36sequenced BACsassembly lengthsbase pairsn assemblies0 50000 100000 150000 200000 250000 30000005101520Figure 4.3: Frequency distribution of length in base pairs for 100 fully se-quenced and assembled BACs.37Table 4.1: Example Alignment: Scaffold403 to LG8-Ctg66FPCunits : 100 102 115∑score ∆ ∆ ∆ f orwardstart(bp) tag sequence1 0 0 +1 17724 GAAT TCCGAACACACT GAT GT GAT TA2 0 0 +1 17746 GAAT TCGT T GTAAAACAGAGATAT GAT T TC3 0 0 +1 18170 GAAT TCTAGAATATCCT T GAATACAACCAT4 0 0 +1 18974 GAAT TCAAGGAAACACGAAAT GAGT GGT T T5 0 0 +1 20734 GAAT TCAT T T TCATCAACAT GCATCATCT T6 0 0 +1 20758 GAAT TCAAGGT T GAT T T T GAAGAAGAACT G4 0 −2 0 25585 GAAT TCGAGCTAGCTCGGCT T GGCTCGATC2 −2 0 0 25609 GAAT TCTAATCAAGCCGAGCTCGAGCCTCAreverse1 +1 0 0 25609 GAAT TCTAATCAAGCCGAGCTCGAGCCTCA3 0 +2 0 25585 GAAT TCGAGCTAGCTCGGCT T GGCTCGATC5 0 0 +2 20758 GAAT TCAAGGT T GAT T T T GAAGAAGAACT G6 0 0 +1 20734 GAAT TCAT T T TCATCAACAT GCATCATCT T7 0 0 +1 18974 GAAT TCAAGGAAACACGAAAT GAGT GGT T T8 0 0 +1 18170 GAAT TCTAGAATATCCT T GAATACAACCAT9 0 0 +1 17746 GAAT TCGT T GTAAAACAGAGATAT GAT T TC10 0 0 +1 17724 GAAT TCCGAACACACT GAT GT GAT TATable 4.2: Final Pseudomolecule StatisticsChromosome ngenes length(cM) length(bp) nATCG1 2535 78.52 175,985,764 99,635,6072 2050 83.61 209,013,747 116,957,7423 2567 75.70 203,472,901 111,263,4264 2486 95.99 216,026,857 114,464,9865 2538 88.06 271,056,985 147,484,8576 1654 59.68 100,519,666 57,620,5767 1569 54.03 109,221,022 60,893,5798 2240 68.46 192,129,815 105,634,8759 3300 91.98 253,478,808 139,276,31410 3233 87.89 327,788,049 183,694,26511 2168 84.69 208,730,832 109,503,89512 2591 70.22 208,068,730 114,409,34513 2732 70.56 239,367,298 137,400,77414 2613 76.30 230,295,834 119,919,82315 2326 75.34 202,246,870 110,705,37216 2350 99.13 226,777,971 115,811,86417 2699 100.77 267,415,242 144,655,486total 41,651 1360.94 3,641,596,391 1,989,332,78638Chapter 5ConclusionI have produced a set of pseudomolecules representing the seventeen chromosomesof sunflower. Additionally, I have closed the mitochondrial genome’s master circle.Together with the previously assembled plastid genome, nearly all the DNA of asunflower can now be easily browsed as graphics online.This project required integrating many sources of information to deliver a goodthat will aid knowledge synthesis. The physical map, genetic map, and de novoassemblies all provide useful information, but at different scales. At this pointin time, leveraging all three was necessary to model the sunflower’s genome asstretched out strings of DNA.The delivery of a reference genome is largely a technical achievement. Theimmediate question it answers (i.e. what is the linear sequence of DNA in a single,highly inbred, line?) is narrow. On its own, it allows us to address other biologicalquestions, such as: How do repeat families cluster in space? How do recombi-nation rates vary? What is their relationship to sequence features? Where areprotein-coding genes located? What is the syntenic relationship of the sunflower’spaleologs?While these are interesting questions, reference genomes are most powerfulwhen used as an x-axis against which to plot measures pertinent to the study ofmacroevolution, population genetics, and functional morphology. Perhaps a popu-lation geneticist will gain new insight into the mechanisms driving differentiationby viewing the Fixation Index (Fst) outliers between two populations on an axis39shared with the tests for selection of a multi-species comparison, transcript expres-sion measures values of a gene expression experiment, or the Quantitative TraitLoci of a mapping cross.The reference described here is currently being used to genotype nearly 500accessions of sunflowers via the alignment of low-coverage short reads. Align-ment to a common reference is facilitating the use of accurate Bayesian modelsof genotyping. The resulting matrix of genotypes will allow researchers to modelreticulate evolution in the genus and help understand mechanisms of speciation inthe face of high levels of gene flow. Additionally, companies involved in the sun-flower genome consortium have favorably reviewed the reference genome as usefulin elite breeding.Recent technological breakthroughs in sequencing technology (i.e. PacBio(Eid et al. 2009) and Oxford Nanopore (Bayley 2015)) have resulted in read lengthsmeasured in kilobases. New methods for long-range scaffolding using read li-braries prepared from precipitated chromatin (Lieberman-Aiden et al. 2009) arebeing developed as well (Burton et al. 2013). The next generation of sunflowergenomes assembled de novo will be based on these technologies. These will likelysupplant the reference discussed here. However, many of the tools, methods, andresources we developed for the HA412 reference will be reused to produce newsets of pseudomolecules from the new de novo assemblies.40BibliographyAhmed, R., Yousaf, J., Nadeem, I., Saleem, M., and Ali, A. (2013). Response ofsunflower (Helianthus annuus L.) hybrids to population of different insect pestsand their bio-control agents. Journal of Agriculutural Research, 51(1).Altschul, S. F., Madden, T. L., Scha¨ffer, A. A., Zhang, J., Zhang, Z., Miller, W.,and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Research, 25(17):3389–3402.Alverson, A. J., Wei, X., Rice, D. W., Stern, D. B., Barry, K., and Palmer, J. D.(2010). Insights into the evolution of mitochondrial genome size from completesequences of Citrullus lanatus and Cucurbita pepo (cucurbitaceae). MolecularBiology and Evolution, 27(6):1436–1448.Anderson, E. (1931). Internal factors affecting discontinuity between species.American Naturalist, pages 144–148.Anderson, E. (1936). The species problem in Iris. Annals of the MissouriBotanical Garden, pages 457–509.Anderson, E. (1949). Introgressive hybridization. John Wiley and Sons, Inc., NewYork, Chapman and Hall, Ltd., London.Anderson, E. and Hubricht, L. (1938). Hybridization in Tradescantia. III. theevidence for introgressive hybridization. American Journal of Botany, pages396–402.Andrew, R. L., Kane, N. C., Baute, G. J., Grassa, C. J., and Rieseberg, L. H.(2013). Recent nonhybrid origin of sunflower ecotypes in a novel habitat.Molecular Ecology, 22(3):799–813.Baack, E. J., Whitney, K. D., and Rieseberg, L. H. (2005). Hybridization andgenome size evolution: timing and magnitude of nuclear DNA content41increases in Helianthus homoploid hybrid species. New Phytologist,167(2):623–630.Barb, J. G., Bowers, J. E., Renaut, S., Rey, J. I., Knapp, S. J., Rieseberg, L. H.,and Burke, J. M. (2014). Chromosomal evolution and patterns of introgressionin Helianthus. Genetics, 197(3):969–979.Barker, M. S., Kane, N. C., Matvienko, M., Kozik, A., Michelmore, R. W., Knapp,S. J., and Rieseberg, L. H. (2008). Multiple paleopolyploidizations during theevolution of the Compositae reveal parallel patterns of duplicate gene retentionafter millions of years. Molecular Biology and Evolution, 25(11):2445–2455.Baute, G. J., Kane, N. C., Grassa, C. J., Lai, Z., and Rieseberg, L. H. (2015).Genome scans reveal candidate domestication and improvement genes incultivated sunflower, as well as post-domestication introgression with wildrelatives. New Phytologist, 206(2):830–838.Bayley, H. (2015). Nanopore sequencing: From imagination to reality. ClinicalChemistry, 61(1):25–31.Beckstrom-Sternberg, S., Rieseberg, L. H., and Doan, K. (1991). Gene lineageanalysis in populations of Helianthus niveus and H. petiolaris (Asteraceae).Plant Systematics and Evolution, 175(3-4):125–138.Bentley, D. R. (2006). Whole-genome re-sequencing. Current Opinions inGenetics & Development, 16(6):545–552.Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J.,Brown, C. G., Hall, K. P., Evers, D. J., Barnes, C. L., Bignell, H. R., et al.(2008). Accurate whole human genome sequencing using reversible terminatorchemistry. Nature, 456(7218):53–59.Blackman, B. K., Scascitelli, M., Kane, N. C., Luton, H. H., Rasmussen, D. A.,Bye, R. A., Lentz, D. L., and Rieseberg, L. H. (2011). Sunflower domesticationalleles support single domestication center in eastern North America.Proceedings of the National Academy of Sciences, 108(34):14360–14365.Blamey, F., Zollinger, R. K., and Schneiter, A. A. (1997). Sunflower productionand culture. Sunflower Technology and Production, pages 595–670.Bock, D. G., Kane, N. C., Ebert, D. P., and Rieseberg, L. H. (2014). Genomeskimming reveals the origin of the Jerusalem Artichoke tuber crop species:neither from Jerusalem nor an artichoke. New Phytologist, 201(3):1021–1030.42Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D., and Pirovano, W. (2011).Scaffolding pre-assembled contigs using SSPACE. Bioinformatics,27(4):578–579.Bolger, A. M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexibletrimmer for Illumina sequence data. Bioinformatics, page btu170.Bowers, J. E., Bachlava, E., Brunick, R. L., Rieseberg, L. H., Knapp, S. J., andBurke, J. M. (2012). Development of a 10,000 locus genetic map of thesunflower genome based on multiple crosses. G3: Genes— Genomes—Genetics, 2(7):721–729.Buerkle, C. A. and Rieseberg, L. H. (2008). The rate of genome stabilization inhomoploid hybrid species. Evolution, 62(2):266–275.Burke, J. M., Lai, Z., Salmaso, M., Nakazato, T., Tang, S., Heesacker, A., Knapp,S. J., and Rieseberg, L. H. (2004). Comparative mapping and rapid karyotypicevolution in the genus Helianthus. Genetics, 167(1):449–457.Burton, J. N., Adey, A., Patwardhan, R. P., Qiu, R., Kitzman, J. O., and Shendure,J. (2013). Chromosome-scale scaffolding of de novo genome assemblies basedon chromatin interactions. Nature Biotechnology, 31(12):1119–1125.Chandler, J. M., Jan, C.-C., and Beard, B. H. (1986). Chromosomal differentiationamong the annual Helianthus species. Systematic Botany, pages 354–371.Cheres, M. T. and Knapp, S. J. (1998). Ancestral origins and genetic diversity ofcultivated sunflower: coancestry analysis of public germplasm. Crop Science,38(6):1476–1482.Compeau, P. E., Pevzner, P. A., and Tesler, G. (2011). How to apply de Bruijngraphs to genome assembly. Nature Biotechnology, 29(11):987–991.Darwin, C. (1859). On the origins of species by means of natural selection.London: Murray.Dempewolf, H., Eastwood, R. J., Guarino, L., Khoury, C. K., Mu¨ller, J. V., andToll, J. (2014). Adapting agriculture to climate change: A global initiative tocollect, conserve, and use crop wild relatives. Agroecology and SustainableFood Systems, 38(4):369–377.Dirichlet, G. L. (1850). U¨ber die reduction der positiven quadratischen Formenmit drei unbestimmten ganzen Zahlen. Journal fu¨r die Reine und AngewandteMathematik, 40:209–227.43Dussle, C., Hahn, V., Knapp, S., and Bauer, E. (2004). Pl Arg from Helianthusargophyllus is unlinked to other known downy mildew resistance genes insunflower. Theoretical and Applied Genetics, 109(5):1083–1086.Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D.,Baybayan, P., Bettman, B., et al. (2009). Real-time DNA sequencing fromsingle polymerase molecules. Science, 323(5910):133–138.Ewing, B. and Green, P. (1998). Base-calling of automated sequencer traces usingphred. II. error probabilities. Genome Research, 8(3):186–194.Feder, J. L., Egan, S. P., and Nosil, P. (2012). The genomics ofspeciation-with-gene-flow. Trends in Genetics, 28(7):342–350.Feng, J., Liu, Z., Cai, X., and Jan, C.-C. (2013). Toward a molecular cytogeneticmap for cultivated sunflower (Helianthus annuus L.) by landed BAC/BIBACclones. G3: Genes— Genomes— Genetics, 3(1):31–40.Fick, G., Zimmer, D., and Kinman, M. (1974). Registration of six sunflowerparental lines (Reg. No. PL 1 to 6). Crop Science, 14(6):912–912.Galton, F. (1894). Natural inheritance. Macmillan.Gill, N., Buti, M., Kane, N., Bellec, A., Helmstetter, N., Berges, H., andRieseberg, L. H. (2014). Sequence-based analysis of structural organizationand composition of the cultivated sunflower (Helianthus annuus L.) genome.Biology, 3(2):295–319.Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker,B. J., Sharpe, T., Hall, G., Shea, T. P., Sykes, S., et al. (2011). High-qualitydraft assemblies of mammalian genomes from massively parallel sequencedata. Proceedings of the National Academy of Sciences, 108(4):1513–1518.Hamming, R. W. (1950). Error detecting and error correcting codes. Bell SystemTechnical Journal, 29(2):147–160.Harlan, J. R. and De Wet, J. (1963). The compilospecies concept. Evolution,pages 497–501.Harter, A. V., Gardner, K. A., Falush, D., Lentz, D. L., Bye, R. A., and Rieseberg,L. H. (2004). Origin of extant domesticated sunflowers in eastern NorthAmerica. Nature, 430(6996):201–205.Heiser, C. B. (1951). The sunflower among the North American Indians.Proceedings of the American Philosophical Society, pages 432–448.44Heiser, C. B., Martin, W. C., and Smith, D. (1962). Species crosses in Helianthus:I. Diploid species. Brittonia, 14(2):137–147.Heiser, C. B. and Smith, D. M. (1954). New chromosome numbers in Helianthusand related genera (Compositae). In Proceedings of the Indiana Academy ofScience, volume 64, pages 250–253.Heiser, C. B., Smith, D. M., Clevenger, S. B., and Martin, W. (1969). NorthAmerican sunflowers (Helianthus)., volume 22 of Memoirs of the Torrey PinesBotanical Club. Durham.Heiser Jr, C. B. (1947). Hybridization between the sunflower species Helianthusannuus and H. petiolaris. Evolution, pages 249–262.Hodgins, K. A., Lai, Z., Oliveira, L. O., Still, D. W., Scascitelli, M., Barker, M. S.,Kane, N. C., Dempewolf, H., Kozik, A., Kesseli, R. V., et al. (2014). Genomicsof Compositae crops: reference transcriptome assemblies and evidence ofhybridization with wild relatives. Molecular Ecology Resources,14(1):166–177.Horn, R., Ko¨hler, R. H., and Zetsche, K. (1991). A mitochondrial 16 kDa proteinis associated with cytoplasmic male sterility in sunflower. Plant MolecularBiology, 17(1):29–36.Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O., andWalichiewicz, J. (2005). Repbase Update, a database of eukaryotic repetitiveelements. Cytogenetic and Genome Research, 110(1-4):462–467.Kane, N., Gill, N., King, M., Bowers, J., Berges, H., Gouzy, J., Bachlava, E.,Langlade, N., Lai, Z., Stewart, M., et al. (2011). Progress towards a referencegenome for sunflower. Botany, 89(7):429–437.Kane, N. C., King, M. G., Barker, M. S., Raduski, A., Karrenberg, S., Yatabe, Y.,Knapp, S. J., and Rieseberg, L. H. (2009). Comparative genomic andpopulation genetic analyses indicate highly porous genomes and high levels ofgene flow between divergent Helianthus species. Evolution, 63(8):2061–2075.Karrenberg, S., Edelist, C., Lexer, C., and Rieseberg, L. (2006). Response tosalinity in the homoploid hybrid species Helianthus paradoxus and itsprogenitors H. annuus and H. petiolaris. New Phytologist, 170(3):615–629.Kelley, D. R., Schatz, M. C., Salzberg, S. L., et al. (2010). Quake: quality-awaredetection and correction of sequencing errors. Genome Biology, 11(11):R116.45Kernighan, B. W. and Mashey, J. R. (1979). The UNIX programmingenvironment, volume 9. Wiley Online Library.Kosambi, D. (1943). The estimation of map distances from recombination values.Annals of Eugenics, 12(1):172–175.Lane, M. A. (2003). The global biodiversity information facility. Bulletin of theAmerican Society for Information Science and technology, 30(1):22–24.Li, H. and Durbin, R. (2009). Fast and accurate short read alignment withBurrows-Wheeler transform. Bioinformatics, 25(14):1754–1760.Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,Abecasis, G., Durbin, R., et al. (2009). The sequence alignment/map formatand SAMtools. Bioinformatics, 25(16):2078–2079.Li, W. and Godzik, A. (2006). CD-hit: a fast program for clustering andcomparing large sets of protein or nucleotide sequences. Bioinformatics,22(13):1658–1659.Lieberman-Aiden, E., van Berkum, N. L., Williams, L., Imakaev, M., Ragoczy, T.,Telling, A., Amit, I., Lajoie, B. R., Sabo, P. J., Dorschner, M. O., et al. (2009).Comprehensive mapping of long-range interactions reveals folding principlesof the human genome. Science, 326(5950):289–293.Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q.,Liu, Y., et al. (2012). SOAPdenovo2: an empirically improvedmemory-efficient short-read de novo assembler. Gigascience, 1(1):18.Mandel, J., Dechaine, J., Marek, L., and Burke, J. (2011). Genetic diversity andpopulation structure in cultivated sunflower and a comparison to its wildprogenitor, Helianthus annuus L. Theoretical and Applied Genetics,123(5):693–704.Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben,L. A., Berka, J., Braverman, M. S., Chen, Y.-J., Chen, Z., et al. (2005). Genomesequencing in microfabricated high-density picolitre reactors. Nature,437(7057):376–380.McCouch, S., Baute, G. J., Bradeen, J., Bramel, P., Bretting, P. K., Buckler, E.,Burke, J. M., Charest, D., Cloutier, S., Cole, G., et al. (2013). Agriculture:feeding the future. Nature, 499(7456):23–24.46Metzker, M. L. (2010). Sequencing technologies - the next generation. NatureReviews Genetics, 11(1):31–46.Miller, J., Gulya, T., and Vick, B. (2006). Registration of three maintainer (HA456, HA 457, and HA 412 HO) high-oleic oilseed sunflower germplasms. Cropscience, 46(6):2728–2728.Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A.,Johnson, J., Li, K., Mobarry, C., and Sutton, G. (2008). Aggressive assembly ofpyrosequencing reads with mates. Bioinformatics, 24(24):2818–2824.Natali, L., Cossu, R. M., Barghini, E., Giordani, T., Buti, M., Mascagni, F.,Morgante, M., Gill, N., Kane, N. C., Rieseberg, L., et al. (2013). The repetitivecomponent of the sunflower genome as shown by different procedures forassembling next generation sequencing reads. BMC Genomics, 14(1):686.Nelson, W. and Soderlund, C. (2009). Integrating sequence with FPC fingerprintmaps. Nucleic Acids Research, 37(5):e36–e36.Parra, G., Bradnam, K., and Korf, I. (2007). CEGMA: a pipeline to accuratelyannotate core genes in eukaryotic genomes. Bioinformatics, 23(9):1061–1067.Quinlan, A. R. and Hall, I. M. (2010). BEDtools: a flexible suite of utilities forcomparing genomic features. Bioinformatics, 26(6):841–842.Rauf, S. (2008). Breeding sunflower (Helianthus annuus L.) for droughttolerance. Communications in Biometry and Crop Science, 3(1):29–44.Renaut, S., Grassa, C., Yeaman, S., Moyers, B., Lai, Z., Kane, N., Bowers, J.,Burke, J., and Rieseberg, L. (2013). Genomic islands of divergence are notaffected by geography of speciation in sunflowers. Nature Communications,4:1827.Rieseberg, L. H. (1991). Homoploid reticulate evolution in Helianthus(Asteraceae): evidence from ribosomal genes. American Journal of Botany,pages 1218–1237.Rieseberg, L. H., Baird, S. J., and Desrochers, A. M. (1998). Patterns of mating inwild sunflower hybrid zones. Evolution, pages 713–726.Rieseberg, L. H., Beckstrom-Sternberg, S. M., Liston, A., and Arias, D. M.(1991). Phylogenetic and systematic inferences from chloroplast DNA andisozyme variation in Helianthus sect. Helianthus (Asteraceae). SystematicBotany, pages 50–76.47Rieseberg, L. H., Raymond, O., Rosenthal, D. M., Lai, Z., Livingstone, K.,Nakazato, T., Durphy, J. L., Schwarzbach, A. E., Donovan, L. A., and Lexer, C.(2003). Major ecological transitions in wild sunflowers facilitated byhybridization. Science, 301(5637):1211–1216.Rieseberg, L. H. and Seiler, G. J. (1990). Molecular evidence and the origin anddevelopment of the domesticated sunflower (Helianthus annuus, Asteraceae).Economic Botany, 44(3):79–91.Roath, W., Miller, J., and Gulya, T. (1981). Registration of RHA 801 sunflowergermplasm (Reg. No. GP 5). Crop Science, 21(3):479.Rothberg, J. M. and Leamon, J. H. (2008). The development and impact of 454sequencing. Nature Biotechnology, 26(10):1117–1124.Rothfels, C. J., Johnson, A. K., Hovenkamp, P. H., Swofford, D. L., Roskam,H. C., Fraser-Jenkins, C. R., Windham, M. D., and Pryer, K. M. (2015). Naturalhybridization between genera that diverged from each other approximately 60million years ago. The American Naturalist, 185(3):433–442.Sambatti, J., Strasburg, J. L., Ortiz-Barrientos, D., Baack, E. J., and Rieseberg,L. H. (2012). Reconciling extremely strong barriers with high levels of geneexchange in annual sunflowers. Evolution, 66(5):1459–1473.Scascitelli, M., Whitney, K., Randell, R., King, M., Buerkle, C., and Rieseberg, L.(2010). Genome scan of hybridizing sunflowers from Texas (Helianthusannuus and H. debilis) reveals asymmetric patterns of introgression and smallislands of genomic differentiation. Molecular Ecology, 19(3):521–541.Schluter, D. (2009). Evidence for ecological speciation and its alternative.Science, 323(5915):737–741.Seiler, G. J. (1992). Utilization of wild sunflower species for the improvement ofcultivated sunflower. Field Crops Research, 30(3):195–230.Seiler, G. J. and Jan, C.-C. (2014). Wild sunflower species as a genetic resourcefor resistance to sunflower broomrape (Orobanche cumana Wallr.). Helia,37(61):129–139.Siculella, L. and Palmer, J. D. (1988). Physical and gene organization ofmitochondrial DNA in fertile and male sterile sunflower. CMS-associatedalterations in structure and transcription of the atpA gene. Nucleic AcidsResearch, 16(9):3787–3799.48Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J., and Holmes, I. H.(2009). JBrowse: a next-generation genome browser. Genome Research,19(9):1630–1638.Smith, T. F. and Waterman, M. S. (1981). Identification of common molecularsubsequences. Journal of Molecular Biology, 147(1):195–197.Sommer, D. D., Delcher, A. L., Salzberg, S. L., and Pop, M. (2007). Minimus: afast, lightweight genome assembler. BMC Bioinformatics, 8(1):64.Staton, S. E., Bakken, B. H., Blackman, B. K., Chapman, M. A., Kane, N. C.,Tang, S., Ungerer, M. C., Knapp, S. J., Rieseberg, L. H., and Burke, J. M.(2012). The sunflower (Helianthus annuus L.) genome reflects a recent historyof biased accumulation of transposable elements. The Plant Journal,72(1):142–153.Stebbins, J. C., Winchell, C. J., and Constable, J. V. (2013). Helianthus winteri(Asteraceae), a new perennial species from the southern Sierra Nevadafoothills, California. Aliso: A Journal of Systematic and Evolutionary Botany,31(1):19–24.Tang, S., Leon, A., Bridges, W. C., and Knapp, S. J. (2006). Quantitative trait locifor genetically correlated seed traits are tightly linked to branching and pericarppigment loci in sunflower. Crop Science, 46(2):721–734.Tang, S., Yu, J.-K., Slabaugh, M., Shintani, D., and Knapp, S. (2002). Simplesequence repeat map of the sunflower genome. Theoretical and AppliedGenetics, 105(8):1124–1136.Tarailo-Graovac, M. and Chen, N. (2009). Using RepeatMasker to identifyrepetitive elements in genomic sequences. Current Protocols in Bioinformatics,pages 4–10.Timme, R. E., Kuehl, J. V., Boore, J. L., and Jansen, R. K. (2007). A comparativeanalysis of the Lactuca and Helianthus (Asteraceae) plastid genomes:identification of divergent regions and categorization of shared repeats.American Journal of Botany, 94(3):302–312.Van Nieuwerburgh, F., Thompson, R. C., Ledesma, J., Deforce, D., Gaasterland,T., Ordoukhanian, P., and Head, S. R. (2011). Illumina mate-paired DNAsequencing-library preparation using Cre-Lox recombination. Nucleic AcidsResearch, page gkr1000.49Vogel, H. (1979). A better way to construct the sunflower head. MathematicalBiosciences, 44(3):179–189.Wan, S., Jiao, Y., Kang, Y., Jiang, S., Tan, J., Liu, W., and Meng, J. (2013).Growth and yield of oleic sunflower (Helianthus annuus L.) under dripirrigation in very strongly saline soils. Irrigation Science, 31(5):943–957.Wu, Y., Bhat, P. R., Close, T. J., and Lonardi, S. (2008). Efficient and accurateconstruction of genetic linkage maps from the minimum spanning tree of agraph. PLoS Genetics, 4(10):e1000212.Yatabe, Y., Kane, N. C., Scotti-Saintagne, C., and Rieseberg, L. H. (2007).Rampant gene exchange across a strong reproductive barrier between theannual sunflowers, Helianthus annuus and H. petiolaris. Genetics,175(4):1883–1893.Yu, J.-K., Mangor, J., Thompson, L., Edwards, K. J., Slabaugh, M. B., and Knapp,S. J. (2002). Allelic diversity of simple sequence repeats among elite inbredlines of cultivated sunflower. Genome, 45(4):652–660.Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de novo short readassembly using de Bruijn graphs. Genome Research, 18(5):821–829.50


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items