Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Regulation of gene expression in invasive and non-invasive Compositae weeds Bell, Graeme Douglas Milton 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2010_fall_bell_graeme.pdf [ 6.54MB ]
JSON: 24-1.0071029.json
JSON-LD: 24-1.0071029-ld.json
RDF/XML (Pretty): 24-1.0071029-rdf.xml
RDF/JSON: 24-1.0071029-rdf.json
Turtle: 24-1.0071029-turtle.txt
N-Triples: 24-1.0071029-rdf-ntriples.txt
Original Record: 24-1.0071029-source.json
Full Text

Full Text

REGULATION OF GENE EXPRESSION IN INVASIVE AND NON-INVASIVE CANADA THISTLE  by Graeme Douglas Milton Bell B.Sc., The University of British Columbia, 2007  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS OF THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Botany) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  June 2010  ©Graeme Douglas Milton Bell, 2010! !  !  !  !  !  ABSTRACT Gene expression divergence between populations has been linked to adaptive morphological evolution and is thought to be a factor in the invasive success of certain weedy plants. Understanding the genetic basis of these regulatory changes can identify genes that have been under selection during adaptation to a new environment or new species interactions. A highthroughput sequencing approach was used to study the regulatory basis (cis and/or trans) of gene expression differences between native and invasive populations of Cirsium arvense (Canada thistle) by exploring patterns of differential gene expression and sequence variation. Parent and hybrid allele-specific expression ratios were compared to infer the relative effects of cis- and trans-regulatory change. Genes differentially regulated in cis are considered candidate genes involved in adaptation or weediness because there is evidence for selection acting primarily on cis-regulatory variation. Illumina sequencing of cDNA libraries derived from parents and hybrid pools resulted in a total of 82,713,256 paired-end (2x100bp) reads and 83.4% of these were mapped to a reference C. arvense transcriptome of 88,374 unigene sequences. Expression analysis and variant (SNPs and Indel) calling was performed to score the nature of regulatory divergence for the first 900 contigs, representing ~1% of the total dataset. Of the 40 highconfidence cases, 7 showed cis-effects, 6 showed trans-effects, 9 had varying degrees of both cis and trans, and 18 showed non-intermediate hybrid effects. A set of contigs that had high similarity to 63 known or confirmed stress-related genes, previously identified in studies of sunflower and Canada thistle, was also assayed for allelic imbalance. Of these, 2 cases showed a cis-effect, 2 showed both cis- and trans-effects, and 2 revealed hybrid effects. Contig 23614, an auxin-response transcription factor, was differentially regulated due to cis-effects and has been previously confirmed as drought-stress gene in both sunflower and C. arvense. This research identifies changes in gene expression that are driven by differential selective pressures in native and invasive populations. It also advances our understanding of the nature of genetic changes that drive gene expression evolution.  !  ""!  !  !  !  TABLE OF CONTENTS Abstract .......................................................................................................................................ii Table of Contents...................................................................................................................... iii List of Tables ..............................................................................................................................iv List of Figures .............................................................................................................................v Acknowledgements 1 Introduction .............................................................................................................................1 1.1 Invasive Weeds – Introduction and Stress Response ........................................................1 1.2 Canada Thistle Biology .....................................................................................................3 1.3 Weed Genomics and Thesis Goals ....................................................................................4 1.4 The Transcriptional Regulators of Gene Expression, Cis and Trans ................................5 1.5 Estimating the Roles of Cis and Trans on Gene Expression Divergence .........................7 1.5.1 Evidence from Multi-Locus Mapping Studies ..........................................................7 1.5.2 Evidence from Studies of Allele Specific Expression...............................................8 1.5.3 Hybrid Effects Complicate Inference of Cis and Trans ..........................................10 1.6 What Conditions Favour Cis- vs Trans- Regulatory Evolution? ....................................12 1.6.1 Mutational Target Size ............................................................................................12 1.6.2 Different Constraints on Cis-regulatory and Coding Regions.................................13 1.7 Summary of Cis and Trans Regulatory Evolution ..........................................................15 2 Methods .................................................................................................................................16 2.1 Plant Materials .................................................................................................................16 2.2 Crosses.............................................................................................................................17 2.3 RNA Extractions and DNase-I Treatment.......................................................................19 2.4 cDNA Library Preparation and Illumina Sequencing .....................................................20 2.5 Reference Transcriptome – Raw Data, Filtering and Assembly .....................................20 2.6 Quality Confirmation of the Reference Transcriptome Assembly ..................................21 2.7 Handling of Raw Illumina Data and Quality Trimming..................................................21 2.8 Mapping Illumina Sequence Reads .................................................................................22 2.9 SNP Calling and Expression Analysis using SAMtools .................................................22 2.10 Identifying Contigs Corresponding to Known Stress-Genes ........................................23 2.11 Cis and Trans Effects ....................................................................................................24 3 Results.....................................................................................................................................26 3.1 Reference Transcriptome Assembly................................................................................26 3.2 Crossing Results .............................................................................................................27 3.3 Illumina Sequencing and Mapping Results .....................................................................28 3.4 Gene Expression Analysis ...............................................................................................30 3.5 Allele-Specific Expression of Known Stress Genes........................................................32 3.6 Analyzing Genome-Wide Allelic Expression .................................................................39 4 Discussion ...............................................................................................................................43 4.1 Considerations of the Illumina sequencing approach......................................................43 4.2 Estimating Cis and Trans Effects ....................................................................................44 4.3 Hybrid Effects..................................................................................................................47 4.4 Future Direction...............................................................................................................47 Bibliography..............................................................................................................................48 Appendix....................................................................................................................................54  !  """!  !  !  !  LIST OF TABLES Table 2.1 Number of plants grown from native and invasive populations.................................17 Table 3.1 Summary of 22 pairwise crosses between native and invasive populations ..............28 Table 3.2 Summary of Illumina sequence results.......................................................................29 Table 3.3 Summary of mapping results using MosaikAligner ...................................................29 Table 3.4 Cirsium contigs with homology to known stress genes .............................................33 Table 3.5 Stress genes for which parents are homozygous for different alleles.........................34 Table 3.6 Summary of SNP sites used to score hybrid allelic imbalance ............................ 40-41 Table A.1 Building Hybrid RNA Pool1 .....................................................................................54 Table A.2 Building Hybrid RNA Pool2 .....................................................................................55  !  "#!  !  !  !  LIST OF FIGURES Figure 1.1 Interactions between cis and trans factors regulate transcription ...............................6 Figure 1.2 Allele-specific expression reveals the effects of cis and trans....................................9 Figure 2.1 Workflow of SNP analysis using variant pileup output from SAMtools..................25 Figure 3.1 Cirsium transcriptome contig length distribution .....................................................26 Figure 3.2 Age distribution of duplicated genes in the Cirsium transcriptome..........................27 Figure 3.3 Correlation of relative allelic expression between hybrid pools 1 and 2 .................30 Figure 3.4 Correlation of relative allelic expression between hybrid pool 1 and parents ..........31 Figure 3.5 Correlation of relative allelic expression between hybrid pools and parents............31 Figure 3.6 Allele-specific expression in parents and hybrids for Contig23614 .........................36 Figure 3.7 Allele-specific expression in parents and hybrids for Contig14491 .........................37 Figure 3.8 Allele-specific expression in parents and hybrids for Contig5913 ...........................38 Figure 3.9 Allele-specific expression in parents and hybrids for Contig1451 ...........................39  !  #!  !  !  !  ACKNOWLEDGEMENTS I would like to thank my graduate supervisor, Dr. Keith Adams, and the other members of my committee, Drs. Loren Rieseberg, Jeannette Whitton, and Sally Otto, for the opportunity to participate in leading edge genomics research with one of the top evolutionary biology groups. Thanks also to my colleagues in the Adams lab for stimulating discussion and friendship over these past two years. I would especially like to recognized Drs. Katrina Dlugosch, Nolan Kane and Mike Barker, for the crash course in bioinformatics, as well as Drs. Alessia Guggisberg and Maya Mayrose for guidance generating plant materials and with wet lab protocols. This work is dedicated to my family and friends for their ongoing support and for always reminding me to see the forest through the trees.  !  #"!  !  !  !  1. INTRODUCTION 1.1 Invasive Weeds – Introduction and Stress Response Invasive agricultural weeds cost billions of dollars annually in terms of lost agricultural productivity and control efforts (Pimentel et al., 2005) and threaten local ecosystems by reducing species diversity (Donald, 1994). Understanding the genetic basis of traits that allow certain introduced species to out-compete indigenous flora is essential in guiding efforts to prevent their establishment and spread (Anderson, 2008; Stewart et al., 2009). There are many different characteristics that might explain the invasive success of a particular plant relative to others, and it is becoming clear that weeds succeed as invaders through a variety of different strategies (Chao et al., 2005). Weeds are thought to have evolved multiple traits that are important for colonization and proliferation in a range of environments, taking on a generalist (as opposed to specialist) strategy (reviewed in Chao et al., 2005). Such ‘weedy’ traits may include, but are not limited to, resistance to control measures, increased seed viability, early reproduction, long distance seed dispersal, rapid vegetative growth rates, prolific seed!set, and efficient nutrient uptake (Baker, 1974; Basu et al., 2004; reviewed in Chao et al., 2005). The success of Cirsium arvense (Canada thistle), for example, is largely attributed to its potential for clonal reproduction from root growth, but also to long-distance seed dispersal, long-term seed viability (>20 years buried in soil) and high germination rate (Donald, 1994). Despite the diverse strategies of invasive weeds, one common trend has emerged which is consistent with the trade-off hypothesis proposed by Grime (1977). The trade-off hypothesis predicts that invaders that have escaped stresses (biotic or abiotic) in their native ranges are free to allocate more energy towards overall competitive ability since investment in costly defense pathways is no longer necessary (Grime, 1977). Indeed, comparative studies of gene expression between native and invasive populations of invasive sunflowers have revealed a downregulation of many stress tolerance genes in weedy populations (Lai et al., 2008). The detection of regulatory divergence between native and invasive populations of common sunflower, Helianthus annuus, provided evidence for adaptive changes in gene expression in the evolution of Compositae weeds (Lai et al., 2008). Parallel changes in gene expression occur !  $!  ! ! ! between several wild and weedy population pairs, which strongly suggest that these loci underlie adaptively important traits. On average, about 5% of genes are differentially expressed between wild and weedy populations and these include genes involved in stress-responses (Lai et al., 2008). This observed downregulation of stress-tolerance genes in invasive populations supports the trade-off hypothesis. Further evidence for stress-tolerance genes being overrepresented as targets of selection comes from comparisons of sequence variability at microsatellite loci between native and invasive population of H. annuus. Kane and Rieseberg (2007; 2008) found that ~5% of genes in microsatellite flanking regions had reduced variation in weedy populations, which could be explained by linkage with locally advantageous alleles rising in frequency. In contrast, reduced variation across all loci can result from population bottlenecks, where small founding populations offer only limited genetic diversity to the gene pool. This signature of selection was found in several stress-tolerance genes putatively involved in adaptation to new environments. Some of these genes showed the same patterns of reduced variation in weedy populations between multiple native/invasive pairs, further supporting their importance in adaptive evolution (Kane and Rieseberg, 2008). Developing genomic resources for multiple model weeds is important in order to encompass the variety of mechanisms by which invasive weeds succeed. Several candidate model weeds have emerged - selected on the basis of economic impact, geographic distribution, and availability of existing genomic resources (Stewart et al., 2009). These include well-known species in the Asteraceae (Compositae) family, many of which are suitable candidates based on the above criteria: Common ragweed (Ambrosia artemisiifola), Common sunflower (Helianthus annuus), Yellow Star Thistle (Centaurea solstitialis), Diffuse Knapweed (Centaurea diffusa), and Canada Thistle (Cirsium arvense). Canada thistle is one ideal model weed because it is persistent, has a wide geographic range, and is hard to control (reviewed in Donald 1994). It is also among the most frequently listed plants on noxious weed lists across the United States and Canada (Skinner et al., 2000). Furthermore, microarray and EST data previously produced for domesticated Compositae species, sunflower and lettuce, have helped to inform the production of genomic resources for Canada thistle (Barker et al., 2008; Stewart et al., 2009).  !  %!  !  !  !  1.2 Canada Thistle Biology Canada thistle is a perennial and is notorious for its extensive root system. In terms of sexual reproductive morphology, Canada thistle is “imperfectly dioecious”. Male and female flower heads occur on separate individuals, yet hermaphroditic populations have been identified where a small percentage of pollen-producing plants have been known to set seed (Donald 1994). Flowers are primarily insect pollinated as evidenced by observation as well as the detection of compounds known for their insect-attracting properties (Moore 1975; Donald 1994). Natural flowering time occurs from July to September at northern latitudes, and slightly earlier (May – August) at southern latitudes. Overwinter conditions (low temperature and short photoperiod) have been shown to favour root growth, whereas a longer photoperiod and summer heat encourages shoot growth and reduces time-to-flowering (Moore 1975). Originally native to the Mediterranean region of southeastern Europe, C. arvense now has a near global distribution (Nuzzo 1997; Moore 1975) and is well adapted to a variety of environmental conditions worldwide. Canada thistle was most likely introduced to North America as a contaminant of cereal crop seeds in the late 16th or early 17th century (Donald, 1994). Once established, C. arvense is problematic on account of several factors that enable C. arvense to directly outcompete native species and take full advantage of disturbed, nutrient-rich environments such as agricultural fields. A deep-rooted perennial root system gives Canada thistle a competitive edge over other species (especially annuals) whose root systems might be limited to extracting water from more shallow depths (Donald 1994). Debris of Canada thistle is also reported to produce allelopathic chemicals that have negative effects on other plants (Donald 1994; Chao et al., 2005). Finally, rapid propagation and range expansion are achieved in Canada thistle through a variety of reproductive strategies. Adventitious root buds arise from thickened, propagative taproots to form new shoots, facilitating continuous local expansion (Donald 1994). The resulting ‘daughter’ shoots generally elongate immediately upon emergence from the soil and do not form rosettes, unlike true seedlings (Donald 1994). A potentially interesting finding with respect to the trade-off hypothesis is that the rate of range expansion caused by adventitious root growth is 4-5 times faster in introduced as compared to native ranges (reviewed in Donald, 1994).  !  &!  ! ! ! Control methods most commonly involve mechanical and chemical removal (Evans, 1984). Attempts at biological control have been problematic because many of the pathogenic microorganisms that attack Canada thistle also tend to attack crop plants (Donald, 1994). Although several potential native biological control agents have been identified, they have not been widely used out of concern for native thistle species. The extent to which C. arvense has infested agricultural crops and rangelands has been quantified in North American and the United Kingdom (Moore, 1975; Donald, 1994), where 8% of pastures were seriously infested (reviewed in Donald, 1994). The occurrence of dense patches of Canada thistle has also been correlated with the establishment of other weedy plants.  1.3 Weed Genomics and Thesis Goals Genomic-based approaches for weed science have been proposed for some time (Basu et al., 2004; Chao et al., 2005), but have only recently become cost-effective with the advent of second-generation sequencing technologies (Shendure and Ji, 2008; Stewart et al., 2009). By simultaneously revealing genome-wide patterns of gene expression and sequence variation, highthroughput sequencing of cDNA (RNA-Seq) is used here to investigate the genetic nature of gene expression divergence between native and invasive populations. By comparing allelespecific expression between parental and F1 hybrid generations, inferences are made as to the number and types of genes involved in adaptation of an invasive species. This project was part of a larger collaborative effort that used a multifaceted approach to characterize the types and numbers of genes involved in evolution of Compositae weeds. Other aspects of the larger project include mapping experiments that correlate segregating markers with ‘weedy’ phenotypes and microarray experiments to assay for expression differences. The goal of this study was to determine the genetic basis (cis- or trans-) of regulatory changes underlying gene expression divergence between native and invasive populations of C. arvense. A subset (1%) of genome-wide expression data was scored for allelic hybrid imbalance in order to determining whether cis- or trans-acting changes are responsible for gene expression divergence. Allele specific expression was also measured for 7 of 63 known stress-related genes. These results have implications in the context of natural selection, which is expected to drive gene expression evolution primarily through cis-acting changes because trans-acting changes are more likely to have widespread pleiotropic effects (reviewed below) (Prud’homme !  '!  ! ! ! et al., 2007; Whitehead and Crawford, 2006b). It is therefore important to characterize how changes in transcriptional regulation arise (in cis or trans) in order to better understand how natural selection and neutral forces influence gene expression divergence (Whitehead and Crawford, 2006a).  1.4 The Transcriptional Regulators of Gene Expression, Cis and Trans Regulation of gene expression is achieved at the level of transcription by interactions between cis- and trans- regulatory elements (Figure 1.1) (Tautz, 2000). Cis-regulatory elements are typically short stretches of DNA that act as sequence-specific binding sites for diffusible regulatory proteins; these can be further characterized as enhancers or silencers based on whether transcription or trans-acting factor binding results in activation or suppression of gene expression, respectively. Although cis-elements usually lie immediately upstream or downstream from the genes whose expression they govern, they have also been found in introns and even occasionally in exons (reviewed in Landry, 2007). Cis-regulatory modules have, in extreme cases, been found to govern expression of genes up to 100kb away due to chromatin looping that can bring otherwise distant loci close together (Stern and Orgogozo, 2008). Epigenetic factors can also affect the expression of nearby genes and so they must also be considered cis-acting. Cytosine methylation and histone modification are two epigenetic factors that have been shown to influence transcription (reviewed in Zhang and Borevitz, 2009; Tung et al., 2009). Trans-regulatory elements include transcription factors with direct DNA-binding activity as well as regulatory proteins further upstream in signaling cascades, including those that detect environmental cues (Tautz, 2000; Tirosh et al., 2009).  !  (!  !  !  !  Figure 1.1 Interactions between cis and trans factors regulate transcription. By binding to cis-regulatory elements and each other, trans-elements fine-tune levels of transcription by influencing the binding affinity of RNA polymerase to a gene’s promoter (A). Loss of a cis x trans interaction causes changes in transcriptional activity (B). Changes in cis- and/or trans-regulators are ultimately responsible for the new expression profile (C and D, respectively).  Cis-regulatory elements of many different types can function independently of one another or in any number of combinations to result in a wide range of achievable spatial and temporal patterns of gene expression (Tautz, 2000). This ‘modular’ nature facilitates a wide range of achievable expression patterns ranging from simple on/off regulation typical of prokaryotes, to more complex expression patterns observed in multicellular eukaryotes (Wray, 2007). The set of available transcription factors also varies widely between different environmental and cellular conditions, leading to even greater possible variety of gene expression profiles between tissue types or stages of development (Tautz, 2000). Differential regulation of gene expression is crucial for tissue differentiation within complex organisms and appears to be the major driver of morphological and physiological change between species (Carroll, 2005). !  )!  !  !  !  1.5 Estimating the Roles of Cis and Trans on Gene Expression Divergence  Gene expression variation has been documented within and between many species and has been linked to adaptive evolution and speciation (Hoekstra and Coyne, 2007; Wray, 2007; Stern and Orgogozo, 2008). However, the nature of the genetic changes underlying cases of regulatory divergence has only recently received attention. Several groups investigating gene expression evolution have gained insight into the extent to which cis- and trans- regulatory changes are responsible (Fay and Wittkopp, 2008). Most approaches use measurements of allele-specific expression to make inferences about the relative contributions of cis- and trans-regulatory change, although multi-locus mapping has also been used to this end. Similar trends emerging in multiple organisms suggest that similar evolutionary mechanisms underlie their divergence (Stern and Orgogozo, 2008). The effects of cis and trans can be parsed with transcription level data using departures from additivity in the hybrid (Lemos et al., 2008). gene expression changes due to cis-regulation show higher additivity whereas variation due to trans-regulation tends to deviate from additivity because of the nature of dominance interactions (Lemos et al., 2008).  1.5.1 Evidence from Multi-Locus Mapping Studies Quantitative trait locus (QTL) mapping allows genetic loci underlying a particular phenotypic trait to be identified by making associations between segregating genetic markers and a particular fitness-related trait (Storz, 2005). Expression quantitative trait loci (eQTL) mapping is a variation of this technique where the level of mRNA is the phenotypic variable, typically measured by microarray (Williams et al., 2007). Transcriptional activity itself can be considered a quantitative trait because it is under the control of several genetic loci (genes encoding transacting factors as well as their local cis-targets) (Gibson and Weir 2005; Gilad et al., 2006). eQTL studies were the first to globally examine cis- and trans- acting effects on gene expression evolution in Drosophila, yeast, mice, humans, and maize (reviewed in Whitehead and Crawford, 2006b). Williams et al., (2007) provides a good overview of the eQTL method and studies that pioneered its use for large-scale analysis of allele-specific expression.  Using this approach, cis-regulatory changes in FLOWERING LOCUS T between recombinant inbred Arabidopsis lines were shown to account for variation in flowering times (Schwartz et al., !  *!  ! ! ! 2009). This is another example of cis-regulatory evolution underlying adaptively significant regulatory divergence. However, between inbred lines of maize, regulatory divergence was driven predominantly by changes in trans (Swanson-Wagner et al., 2006). Overall there were fewer changes in cis, but these tended to have larger effects than those in trans (SwansonWagner et al., 2006).  1.5.2 Evidence from Studies of Allele Specific Expression Several different approaches have been used to study allele-specific expression at scales ranging from small gene sets to genome-wide assays. Allele-specific expression, inferred through transcript abundance and sequence variation, was first studied for a set of genes in Drosophila using a gene-by-gene pyrosequencing approach (Wittkopp et al., 2004). Other small-scale methods for detecting allele-specific expression have included qRT-PCR and use of single-base extension of a primer adjacent to known SNPs (Wittkopp et al., 2005; Zhuang and Adams, 2007).  Microarray-based studies, high-density SNP-tiling arrays (Graze et al., 2009), and second generation sequencing technologies have facilitated detection of allele-specific expression on a larger scale (Springer and Stupar, 2007). Most recently, genome-wide scale allele-specific expression has been studied by high-throughput sequencing of cDNA (RNA-Seq) in humans, Drosophila, and yeast (Serre et al., 2008; Wang et al., 2008; Tirosh et al., 2009). One major advantage of this approach over eQTL mapping is that it does not require knowledge of any sequences a priori, making it useful for studying the effects of cis- and trans- in non-model genetic organisms (Landry et al., 2007). Screens for differential allelic expression have been used to study gene expression evolution in humans, mice, Arabidopsis, Drosophila, poplar, maize, and yeast (Denver et al., 2005; Gruber and Long, 2008; Zhuang and Adams, 2007; Springer and Stupar, 2007; Wittkopp et al., 2004, 2008a, 2008b; Serre et al., 2008; Guo et al., 2008; Zhang and Borevitz, 2009; Chang et al., 2008; Sung et al., 2009).  By comparing allele-specific expression of F1 hybrids with their parents, the contributions of cis and trans regulatory change underlying each case of expression divergence can be identified (Wittkopp et al., 2005; Landry et al., 2007). When two different parental alleles are exposed to a !  +!  ! ! common set of trans-regulatory factors in the hybrid, any remaining differential allelic  !  expression must be due to cis-regulatory differences encoded near the allele itself (Figure 1.2). Gene expression divergence that cannot be assigned to cis-regulatory differences can be inferred to be due to trans-acting variation (Wittkopp et al., 2004). This approach identifies target genes controlled in cis and trans but does not identify the controlling trans loci, unlike eQTL studies.  Figure 1.2 Allele-specific expression reveals the effects of cis and trans. Mutations in a cisregulatory element (star) generally affect the expression of only nearby genes on the same chromosome, whereas trans-acting changes act on both target alleles in a diploid cell. In the hybrid, both parental alleles (distinguished by polymorphisms ‘G’, and ‘T’) are exposed to the same set of transcription factors and any differential allelic expression can be attributed to cis-acting variation.  In a survey of allele-specific expression between closely related Drosophila species, D. melanogaster and D. simulans, 27% of differentially expressed genes had significant cis- effects while only 16% had only trans- effects and just 4% of were influenced by both cis and trans (Graze et al., 2009). These genome-wide results agree with earlier studies of allele-specific expression in Drosophila (Wittkopp et al., 2004) and between species of yeast (Tirosh et al., 2009). A major role for cis-regulation has also gained support from interspecific comparisons in both poplar and maize (Zhuang and Adams, 2007; Springer and Stupar, 2007). Interestingly, most of the trans-acting change detected in these and other studies had only minor overall effects on gene expression divergence whereas the magnitude of expression divergence caused by cis effects tended to be greater (Wittkopp et al., 2004, 2008a; Tirosh et al., 2009; Birchler and Veitia, 2009). However, comparisons of allelic expression between another pair of Drosophila !  ,!  ! ! ! species, D. melanogaster and D. sechellia, revealed that trans-regulatory changes accounted for more expression divergence than cis-regulatory changes (McManus et al., 2010), an unexpected result that might be explained by incompatibilities arising upon hybridization of divergent regulatory networks (McManus et al., 2010). In contrast to the predominance of cis-divergence between species, studies looking at withinspecies variation generally found similar numbers of cis- and trans-acting changes (Wittkopp et al., 2008b; Wang et al., 2008). Wittkopp et al., (2008b) found that trans-variation accounted for most of the regulatory changes while cis-regulatory differences accounted for just 35%. Transacting effects also prevailed in a study between Drosophila lines that differed at only one chromosome, where only 14% of differentially expressed genes were found to have a cisregulatory basis (Wang et al., 2008). A recent microarray-based analysis of allele-specific expression between ecotypes of Arabidopsis thaliana found that 27% of the differentially expressed genes were caused by only cis-variation and 29% were caused by only trans variation (Zhang and Borevitz, 2009). Trans-variation has also been found to be more important than cisvariation in explaining gene expression differences between strains of yeast, although cisregulatory changes also contributed slight effects to the majority of cases (Sung et al., 2009).  Taken together, these studies show that cis-regulatory changes account for more of the divergent expression observed across larger evolutionary distances (Wittkopp et al., 2008a; Stern and Orgogozo, 2008). Confirming this trend, one study that included both within- and betweenpopulation comparisons in the same experiment found that cis-regulatory changes accounted for 65% of regulatory divergence between species but only 35% within species (Wittkopp et al., 2008b).  1.5.3 Hybrid Effects Complicate Inference of Cis and Trans Interpretation of the relative cis- and trans- contributions to gene expression variation can be complicated in cases where sufficiently divergent regulatory networks conflict in the hybrid (Comai et al., 2003; Landry et al., 2007). Hybrid effects can arise as a consequence of lineagespecific cis-regulatory changes being counteracted by compensatory trans-regulatory changes such that the overall level of gene expression remains unchanged (Landry et al., 2007). Upon !  $-!  ! ! ! hybridization, reuniting these modified regulatory components can lead to novel interactions between different trans factors that never took place in either parent alone.  Hybrid effects become apparent in allelic expression studies when the level of gene expression in hybrids is non-intermediate to (i.e. more extreme than) parental levels, but they may also remain undetected in the mid-parent range and introduce bias in the interpretation of the cis- and transeffects (Landry et al., 2005). A portion of the inferred trans-effects may include new interactions that only arise due to regulatory incompatibilities in the hybrid and do not actually contribute to regulatory divergence between parents (Zhang and Borevitz, 2009; Graze et al., 2009). These hybrid-specific incompatibilities may include novel protein-protein interactions (trans x trans) or chromatin remodeling and other epigenetic changes (cis x trans) (Zhang and Borevitz, 2009; Elena and Lenski, 2003). Using comparisons between Drosophila lines that differed by only one X chromosome, Wittkopp et al., (2008b) tested for epistatic interactions between cis- and trans-factors by looking at allele-specific expression in different genetic backgrounds. While they did not find statistical support for the occurrence of epistatic cis x trans interactions, trans x trans interactions were much more common (Wittkopp et al., 2008b). However, coevolution of cis- and trans-factors has been identified between strains of Drosophila (Landry et al., 2005) as well as in yeast (Tirosh et al., 2009). The evidence for significant trans- variation within populations prompted a search for a genetic explanation. When comparing gene expression patterns between homozygous lines as well as heterozygous lines, Lemos et al., (2008) found that most of the expression differences between homozygous lines disappeared in the heterozygotes. This led to the conclusion that otherwise deleterious trans-acting mutations may be hidden from selection when a second, dominant allele is present (Lemos et al., 2008). In contrast, new cis-regulatory mutations are not subject to dominance interactions (they cannot be ‘masked’ by another allele) and so, by their nature, are constitutively exposed to selection. This supports widespread evidence from studies of allele specific expression that gene expression levels in the hybrids of between-species comparisons generally behave additively (Graze et al., 2009). It also explains why cis effects are typically stronger between species than trans. Although most cases of regulatory incompatibilities have been documented in crosses between species (Landry et al., 2005), non-additivity has also been found in F1 hybrids from inbred Drosophila lines and, most recently, in hybrids of Arabidopsis ecotypes (Gibson et al., 2004; Zhang and Borevitz, 2009, respectively). In general, regulatory !  $$!  ! ! ! incompatibilities appear to be less common within than between species (Wittkopp et al., 2008b; Landry et al., 2007).  1.6 What Conditions Favour Cis- vs Trans- Regulatory Evolution?  Regulatory divergence can occur as a result of local regulatory changes that modulate the expression of that individual gene (cis), or as a consequence of changes in upstream regulatory proteins belonging to a common pathway (trans) (Carroll, 2005). But under which conditions do these different mechanisms contribute to regulatory divergence? Several studies have shown that the answer depends on the relative probabilities of mutations arising in cis- and trans- as well as their subsequent exposure to, and strength of, selection (Stern and Orgogozo, 2008; Gruber and Long, 2008). These factors result in different rates of cis- and trans-regulatory evolution.  1.6.1 Mutational Target Size The relative probability of a mutation arising in a cis- or trans- regulatory element is proportional to the amount of DNA under functional constraint (Cloonan and Grimmond, 2008; Stern and Orgogozo, 2008). Although non-protein coding sequences make up the majority of most eukaryotic genomes, only a small fraction is known to encode functional cis-regulatory elements (ENCODE, 2004; Andolfatto, 2005). In Drosophila, 19% of the genome encodes protein while only 3% has been shown to have cis-regulatory function, but is unclear how many cis-regulatory elements have yet to be characterized (Stern and Orgogozo, 2008). The average length of a protein-coding gene in humans is 3000bp, but the proportion of these site are under functional constraint is not usually well defined. In contrast, the average length of functionally constrained sequence in an individual cis-regulatory motif is just 7-15bp (Gruber and Long, 2008). In general, there are more opportunities for mutations to cause trans- rather than cisregulatory changes.  Mutational target sizes also differ among classes of genes. Genes with highly complex expression profiles, such as transcription factors, are typically regulated by a larger number of !  $%!  ! ! ! cis-regulatory elements (reviewed in Wittkopp 2005). Similarly, there are more opportunities for a trans-acting mutation to arise in genes under the control of many trans-regulators, which also applies to regulatory genes with complex expression patterns (Stern and Orgogozo, 2008). Based on the higher probabilities of both cis- and trans-mutations occurring, we should expect expression divergence more often in regulatory proteins than in other classes of genes.  1.6.2 Different Constraints on Cis-regulatory and Coding Regions Ultimately fitness costs determine the scope of cis- and trans- regulatory changes that are tolerated by organisms. Although there may be more opportunity for mutations to occur in protein-coding regions on account of the larger mutational target size, these tend to have higher fitness costs and are usually subject to stronger purifying selection (Stern and Orgogozo, 2008). While cis-regulatory changes tend to have subtle effects on the expression profiles of individual genes, mutations in regulatory proteins often affect the expression of multiple genes under their control (Wray, 2007). Network architecture also influences the degree of selective constraint. The number of possible trans-mutations affecting gene expression is related to the number of gene products involved in regulation of a particular gene, where more complex regulatory interactions means more opportunities for trans-acting mutations to arise (Inga et al., 2002; Wittkopp, 2005; Birchler and Veitia, 2009). The ‘gene balance hypothesis’ predicts that gene expression is controlled by a large number of genes (acting in trans) and that dosage-dependent interactions are crucial for many processes (Birchler and Veitia, 2001; Birchler and Veitia, 2009). Therefore, because regulatory proteins typically regulate the expression of multiple other genes, changes in their activity or abundance will often have deleterious pleiotropic effects that prevent the fixation of a trans-mutation.  Regulatory networks involved in different processes vary in their potential for pleiotropic effects (Wittkopp, 2005). Genes involved in essential processes, such as cell division, tend to belong to highly interconnected regulatory networks, while genes involved in stress response triggered by an external cue generally contain fewer regulatory proteins (Inga et al., 2002; Wittkopp, 2005). The position of a gene within a regulatory pathway also influences the extent of its pleiotropic constraint (Wittkopp, 2005; Stern and Orgogozo, 2008). Studies have found that differential !  $&!  ! ! ! expression of a highly pleiotropic transcription factor involved in metabolism was caused by cisvariation whereas expression divergence in more downstream target genes was driven more often by trans-acting changes (Prud’homme et al., 2006; Chang et al., 2008). Interestingly, Tirosh et al. (2009) found that trans-effects caused more changes in sensory molecules than in transcription factors that acted more downstream in a pathway. This might arise from changes in the activity or abundance of a terminal sensory protein that leads to differential interpretation of an environmental signal, resulting in a concerted effect on each gene in the regulatory pathway. In contrast, changes in transcriptional regulators that only affect a subset of genes are more likely to lead to deleterious gene dosage imbalances (Tirosh et al., 2009).  Some mechanisms of genome evolution are predicted to reduce the pleiotropic consequences associated with trans-regulatory change. In both gene duplication and alternative splicing, the ancestral function of a gene can be maintained while providing genetic novelty that can have important phenotypic consequences. The genetic redundancy that accompanies these events may provide unique opportunities for trans-acting variation to contribute to regulatory evolution by avoiding constraints that might otherwise be imposed by gene dosage requirements (Birchler and Veitia, 2009).  Cis-regulatory elements are more tolerant of accumulating mutations because they typically control a subset of the expression profile of only one gene (Tautz et al., 2000). As a consequence, changes in cis tend to have less pleiotropic effects (and hence lower fitness costs) than those in trans and are therefore more likely to favoured and contribute to adaptive expression divergence (Carroll, 2005; Jordan et al., 2004; Hoekstra and Coyne, 2007). However, cis-regulatory changes can also have pleiotropic consequences if the expression of neighbouring genes is affected. For example, cis-variation genes are more commonly found in highly polymorphic pericentromeric regions than in gene-dense regions where the misregulation of neighbouring genes is more likely to result (Zhang and Borevitz, 2009; Tung et al., 2009). Another correlation made with epigenetic patterns shows that genes with significant cisregulatory variation tend to have a high level of methylation across the promoter region, frequently associated with transcriptional repression, while genes controlled in trans did not share this pattern of methylation (Zhang and Borevitz, 2009).  !  $'!  ! ! ! On account of their different pleiotropic tendencies, discovering whether cis- or trans- regulatory change predominates can be used as a proxy to infer whether expression divergence has occurred as a consequence of selection (Stern and Orgogozo, 2008). When differential expression is caused predominantly by cis-regulatory changes, that gene is likely under differential selective pressures (i.e. locally adaptive) between the two parental environments. In contrast, if trans effects are found to play a larger role, differential expression is more likely to be a by-product of selection on an upstream regulatory gene on account of their more widespread pleiotropic effects (Wittkopp, 2005; Stern and Orgogozo, 2008). Gene expression divergence caused by transregulatory mutations are therefore less likely to have direct adaptive significance. While a component of gene expression evolution can be attributed to neutral processes, selectively driven expression divergence appears to act primarily through cis-, rather than trans-, regulatory changes (Fay and Wittkopp, 2008; Gilad et al., 2006; Whitehead and Crawford 2006a).  1.7 Summary of Cis and Trans Regulatory Evolution Cis- and trans-regulatory changes both contribute to expression divergence and a trend is emerging that predicts the type of change that is expected to prevail (Stern and Orgogozo, 2008). Although trans-acting mutations seem to arise more frequently than cis-acting mutations (due to aforementioned differences in mutational target size), cis-regulatory changes generally underlie most expression divergence between species because evolution favors fixation of cis-regulatory mutations with more subtle effects (Stern and Orgogozo, 2008). Trans-effects account for more of the variation in gene expression within Drosophila species (Wittkopp et al., 2004, 2008b; Stern and Orgogozo, 2008) and play a larger role in expression divergence for genes with reduced pleiotropic consequences, such as those at terminal positions within regulatory networks (Stern and Orgogozo, 2008). By examining the specific genetic causes of expression divergence, this research will expand our understanding of the molecular basis of adaptation and builds genomic resources that may contribute to the development of more effective methods of weed control. It also provides insight into the extent of sequence and expression divergence between native and invasive populations of Cirsium arvense.  !  $(!  !  !  !  2. METHODS 2.1 Plant Materials A total of 258 C. arvense were raised from seed in preparation for crosses. Dr. Alessia Guggisberg collected these seeds in the summer of 2008 from three native and three invasive populations from areas of Europe and North America, respectively (Table 2.1). Vouchers are available in the Rieseberg lab at the UBC Biodiversity Research Centre in Vancouver, BC. Prior to germination, seeds were examined for defects, scarified, placed in a petri dish on damp Whatman filter paper, and moistened with 1% PPM biocide solution to prevent fungal growth. The presence of fungus slowed germination in a few cases but was largely avoided. Petri dishes were incubated at 28oC (22oC when dark) in a germination chamber running at a16 hour photoperiod and were monitored daily for radicle emergence. Germination rates varied across populations, ranging from 30-90%. No significant trend was observed between native and invasive populations. After 1-2 cm of root growth, newly germinating seedlings were potted in soil , comprised of 75% regular peat moss and 25% perlite (pH of 5.5-6.5), and moved to an Adaptis growth chamber set for 23oC and a 16-hour photoperiod. Once the first two true leaves formed (typically after 7-10 days), seedlings were re-potted individually (one per 10cm x 10cm pot) and transferred to a flood table at the UBC Horticulture Greenhouse. Here plants from all populations grew side-by-side in a common garden under a 16-hour photoperiod, maintained by heat lamps, and reaching temperatures up to 28oC. Appropriate biocontrol agents were applied to prevent against damage caused by known pests. Elongation of the flower-bearing axis generally occurred after 7-8 weeks, at which point the unopened flower heads, or capitula, were enclosed in mesh (Organza 3" x 4") bags. Each plant was also covered with a perforated, large plastic bag to prevent winged insect pollinators from landing directly on the mesh bags. At this time, each flowering stalk was tied to a bamboo stake for support in an effort to prevent them from falling over. Each individual C. arvense was classified as male or female based on the presence or absence of pollen on the styles of newly opened flowers; pollen ends up on styles because it is swept out of the pollen tube during flowering. Subtle sex-specific differences in capitulum shape can help to predict if an individual is male or female; unopened female heads tend to be more slender than those of males. However, the most reliable way to determine sex is based on identification of functional (i.e. !  $)!  ! ! ! pollen-producing) stamen, since both male and female C. arvense have vestigial reproductive organs of the other sex. Population-specific observations indicate a strong female bias in some populations (280808) but a more equal distribution in others (King 129) (Table 2.1). This is consistent with observations by Lalonde and Roitberg (1994). However, there was no general trend observed between sex ratio and population origin (native vs invasive). The combined sex ratio of the three invasive populations, King129, AU-ON, and KN-ON was 3.38:1 female biased, similar to the combined sex ratio of their native partner populations, 310808, 17808 and 280808 at 3.35:1 (Table 2.1). An overall female bias of 3.36:1 was observed across all populations. None of the individuals characterized as males set seed despite reports of hermaphroditic individuals in some wild populations (Donald 1994).  Table 2.1 Number of plants grown from native and invasive populations. The number of male and female individuals are shown for each population.  2.2 Crosses Population pairs were selected and crossed on the basis of similarity of 19 climatic variables as well as seed availability so as to minimize the influence of environment-of-origin differences on population-specific differences in gene expression. Native populations were chosen that best represent the progenitors of the invading populations based on phylogeographic information as well as ecological similarity. Variables included annual means (e.g., temperature and precipitation), seasonal ranges, and extreme or limiting environmental factors, as recorded by weather stations between 1950 and 2000. Male and female plants from each of the six populations were used in three pairwise crossing schemes:  !  $*!  !  ! KN-ON (Ontario) with 280808-2 (Romania) AU-ON (Ontario) with 170808-1 (Italy) King 129 (Alberta) with 310808-1 (Hungary)  !  These population pairs were among the best matched, with euclidean distances 7.48E-01, 7.02E01, and 8.29E-01, respectively (A. Guggisberg, unpublished data). Male and female heads opening simultaneously in both partner populations of a pair were crossed pairwise, re-bagged, and labeled. For each cross, flower heads were manually pollinated using a small paintbrush or by direct contact between male and female heads. Stigmas were generally receptive for only 3-5 days, but plants were regularly pruned in order to stimulate continuous blooming. Crosses were continued throughout flowering until senescence. Reciprocal crosses could have been done to tease out the influence of parent of origin effects inheritance on gene expression, indicated by expression biases specific to one direction or the other. The decision not to use reciprocal crosses came down to a cost/benefit analysis. One group that tested for parent-of-origin effects in Maize found no cases of biased allelic expression due to parent-specific inheritance in seedlings (Springer and Stupar, 2007). Another group studying cis and trans-regulatory variation in Drosophila found an absence of allele-specific, parent-of-origin effects in multiple experiments (Wittkopp et al., 2006; Wittkopp et al., 2008). Additionally, because Canada thistle is dioecious, it would not be possible to use the same individual as both a mother and a father in a set of reciprocal crosses. Fruits fully matured roughly two weeks after pollination and were evident by the characteristic drying-out and release of the achenes from the capitulum. A fluffy pappus developed in fertilized heads that did not occur in 3 unfertilized, negative controls. This evidence for the successful prevention of unintended crosses ensures the accurate identity of individual parents used in the test of hybrid allelic expression. Filled, seed-bearing achenes were collected and cold treated for 10 days at 4oC to break dormancy in order to increase germination success (as in Lalonde and Roitberg, 1994). Seeds from multiple crosses were sown, in petri dishes on filter paper (as for parents), with the goal of achieving at least fifty hybrid seedlings from a single cross. Once roots formed, these F1s were transferred to a growth chamber where they remained at 23oC (16 hour photoperiod) until harvested.  !  $+!  ! ! ! The ability of C. arvense to reproduce clonally was exploited in order to collect tissue from similar developmental stages in the parents and hybrids, helping to control for stage-specific changes in gene expression. Parental genotypes were propagated clonally via buried root fragments (2-3cm long), collected from the original parents used in selected crosses. The resulting plantlets were placed in the growth chamber alongside their F1 hybrid offspring. Once 4-6 true leaves emerged, seedling leaf tissue was collected, labeled, and placed on liquid nitrogen to minimize RNA degradation. Tissue was collected from parental clones as well as over 60 hybrid individuals for each of three crosses, (A, T and L1, Table 3.1) and stored at 80oC. All tissue was collected during the same time of day in order to to control for photoperiod effects on differential gene expression.  2.3 RNA Extractions and DNase-I Treatment Total RNA was extracted from parental and hybrid leaf tissues from cross A. A phenol-based (TRIzol) protocol for RNA extraction was used, followed by treatment with RNase-free DNase I (Qiagen). One hundred mg of tissue was sufficient to produce the miniumum desired yield of RNA (10ug per extraction). Separate extractions were performed for each of 60 F1 hybrids and these were pooled to make two biological replicates (each of n=30). See Appendix for details on RNA concentration and quality for each hybrid added to pool1 and pool2 as well as overall pool statistics (Tables A.1 and A.2). Quantity and quality of RNA in samples have both been shown to have a big impact on the success of sequencing-based transcriptome profiling experiments (Marioni et al., 2008). A Nanodrop D-1000 spectrophotometer was used to calculate RNA concentrations and quality from optical density measurements for each extraction. The final RNA concentrations of the male, female, and hybrid pools 1 and 2 were 531.6 ng/uL, 539.5 ng/uL, 396.6 ng/uL, and 394.4 ng/uL, respectively. Appropriate 260/280 and 260/230 ratios (1.8-2 and > 1.8, respectively) indicated that RNA was free of proteins, phenol, and other contaminants (Tables A.1 and A.2). Gel electrophoresis further confirmed the absence of genomic contamination, with the presence of clear RNA bands corresponding to highly expressed ribosomal subunits, the integrity of RNA for each extraction.  !  $,!  !  !  !  2.4 cDNA Library Preparation and Illumina Sequencing DNase-I treated RNA from the four samples (each parent and two pooled hybrid replicates) were couriered overnight to the Indiana University Sequencing Centre. The cDNA library construction for Illumina sequencing was done directly by technicians at the sequencing centre according to the manufacturer’s instructions. Library preparation involved numerous steps to convert RNA into small fragments of DNA that can be sequenced by the Illumina platform. First, RNA is enriched for poly-(A) transcripts by removing those that fail to hybridize to oligodT beads. The mRNA is then fragmented and reverse transcribed into cDNA using random or targeted primers. Following adapter ligation, size-selection from a gel, and PCR enrichment, each of the cDNA libraries was sequenced separately in a flow cell lane on Illumina’s Genome Analyzer II. The latest chemistry allowed for paired-end sequencing, such that 100bp reads are generated from both ends of an mRNA transcript (2x100bp).  2.5 Reference Transcriptome – Raw Data, Filtering and Assembly Because de novo assembly of short Illumina reads is difficult at best, an alternative approach was used in which tags are mapped to a known reference sequence. A suitable reference transcriptome was assembled using sequence reads generated by Roche 454 high-throughput pyrosequencing of three different cDNA libraries from three different tissue types, flower, leaf, and root, from a single invasive genotype of Cirsium arvense. The leaf and flower datasets were available from previous work while collaborators with the USDA (Horvath et al., unpublished data) provided the root data. The total number of unfiltered raw reads in each library was: 609,458 (leaf), 565,129 (flower), and 636,795 (root). Flower and leaf libraries were normalized but the root data had highly uneven read distribution. Cleaning of the three cDNA libraries was achieved using SnoWhite version 1.1.2. SnoWhite is a pipeline of existing programs, Seqclean and TagDust (Chen et al., 2007; Lassmann et al., 2009), as well as custom scripts designed to filter 454-generated sequence data prior to assembly (Dlugosch & Rieseberg, in prep). Filtering steps implemented by SnoWhite included the trimming of uninformative poly-(A/T) tails and the removal of 10-12bp from the beginning of each read that correspond to adaptors used in the sequencing reactions. Additional steps included screening and removal of reads that contain primers or vector contamination that may !  %-!  ! ! ! have been introduced during cDNA normalization. The number of reads eliminated during cleaning steps for the leaf, flower, and root libraries was 5,740 (0.9%), 8,837 (1.6%), and 20,686 (3.2%), respectively. The latest version of SnoWhite can be downloaded at Datasets containing extremely highly expressed loci, such as those present in the non-normalized root dataset, slow the assembly process unnecessarily since our goal at this stage was only to generate a consensus reference transcriptome for C. arvense. Wcd, an open source tool for clustering ESTs, was used here to bin sequence reads by cluster size (Hazelhurst et al., 2008). Reads belonging to highly expressed loci were not included in subsequent assemblies of the root data. Reads from all three tissue-specific datasets were assembled individually as well as in combination using MIRA3 (Chevreux et al., 1999, 2004), a short read assembler designed for Sanger, 454, and Illumina sequence reads. The output from MIRA3 was then fed into a second assembler, CAP3 (Huang and Madan, 1999). CAP3 excels at handling longer reads and compensates for inadequacies of MIRA3 in assembling EST data, which tends to have unequal read distribution.  2.6 Quality Confirmation of the Reference Transcriptome Assembly The output assembly files from CAP3 were fed into DupPipe, a program that calculates the age distribution of gene duplication events from genome-wide EST data (Barker et al., 2008). This acted as a quality measure to check that close paralogs are not overassembled into a single contig, while ensuring that different alleles of the same gene are appropriately mapped together. Histograms were generated for assembly of each tissue-specific library as well as the combined datasets representing the C. arvense transcriptome.  2.7 Handling of Raw Illumina Data and Quality Trimming Raw Illumina reads were downloaded from the Indiana University sequencing centre by ftp transfer. 41,918,620 reads (20,959,310 pairs) were acquired from the male library, 41,240,328 !  %$!  ! ! ! (20,620,164 pairs) from the female library, and 41,316,322 (20, 658161 pairs) and 40,951,242 (20,475,621 pairs) from hybrid libraries 1 and 2, respectively (Table 3.2). Trimming reads to remove low-quality stretches is especially important for next generation sequencing data, where base-calling towards the 3’ read ends is increasingly prone to errors. For more recent Illumina fastq files, a quality score of "2" (B) indicates the beginning of a low quality stretch rather than the quality of an individual base. Because of the rapid advancement of sequencing technologies, many short read alignment programs do not recognize this signal. To this end, a custom python script was designed to recognize these signals and trim reads accordingly prior to assembly.  2.8 Mapping Illumina Sequence Reads Burrows Wheeler Aligner (BWA), a fast and accurate short read aligner (Li and Durbin, 2009), was used to map Illumina-generated sequence tags to the newly assembled reference C. arvense transcriptome. The short read algorithm of BWA version 0.5.7 was used and an optional parameter was implemented to trim the reads at the 3’ ends to remove low quality bases. Four alignment output files were provided in SAM (standard alignment/mapping) format. Common normalization procedures of RNA-Seq data include scaling gene counts by lane totals (Marioni et al., 2008; Mortazavi et al., 2008), and more recent quantile-based procedures (Bullard et al., 2010). However, normalization was not used in this study because it is generally not required when making comparisons between datasets with similar total number of reads (Bullard et al., 2010). This is largely attributed to negligible technical variation across lanes and deep sampling.  2.9 SNP Calling and Expression Analysis using SAMtools Information on relative expression levels and patterns of SNPs was acquired through the use of SAMtools, a set of programs to analyze and manipulate alignment files in SAM format (Li et al., 2009). Homozygous, parent-specific SNPs were identified using SAMtools pileup function and used to determine the origin of reads (maternal/paternal allele) in the hybrid pools. Custom perl scripts were written to pull out this data genome-wide. SNPs were identified as single base differences between the Illumina-generated reads in the hybrid dataset and the 454-generated reference transcriptome. Transcribed polymorphisms represent true genetic variation as opposed !  %%!  ! ! ! to technical (base-calling or mapping) errors, distinquished by their intermediate (as opposed to low) frequencies within populations. SAMtools pileup format facilitates computational parsing of genome-wide mapping results, where each line provides sufficient data to detect hybrid allelic imbalance for a given SNP site. SAMtools also has an internal contig viewer (tview) that was used to visualize alignment results (Figures 3.7-3.10). SAMtools pileup function outputs a tab-delimited file useful for calculating allelic frequencies computationally. In pileup format, the first column corresponds to a single base pair coordinate on the reference sequence. Other columns provide useful information such as the number of reads mapped, quality of consensus and SNP calling (phred scores), mapping quality (root mean squared), as well as the identity and quality of each base on each read mapping to that position. One version of the pileup output displays only high quality SNP/Indel sites; this was used to identify SNP sites that were useful to distinguish between parental alleles. The ‘default’ pileup format, in which information on every coordinate (not just SNPs/Indels) is given, was used to calculate the average number of reads mapping per contig. Custom perl scripts were designed to calculate the average expression per contig (i.e. number of reads mapped scaled by read length), as well as to pull out relative SNP frequencies in each of the four datasets (P1, P2, Hyb1, Hyb2). Only SNP sites that were supported by ! 6 informative reads in each of the four datasets were considered further; this was the minimum coverage shown to be sufficient for inferring allelic imbalances using high-throughput sequencing approaches (Fontanillas et al., 2010). Above this threshold, genes were classified into three groups based on their expression level: High (tag count ! 21), Medium (20 ! tag count > 6), and Low (6 ! tag count > 0).  2.10 Identifying Contigs Corresponding to Known Stress-Genes BlastStation-Local was used to perform sequence similarity searches using the NCBI BLAST software (Altschul et al., 1990). The sequences of 63 genes putatively involved in stress response were used as query sequences against the 454-generated reference sequence containing 88,374 contigs. The resulting hit (contig) that gave the smallest E value in a tBlastx comparison was inferred to be orthologous to the given queried stress gene. BlastStation-Local can be downloaded at:  !  %&!  !  !  !  2.11 Cis and Trans Effects Comparisons of allele-specific expression between parental and hybrid datasets, as measured by SAMtools variant pileup format, was used to infer the relative contributions of cis and trans underlying cases of gene expression divergence between parents. Two sets of genes were examined. The sequences of genes differentially regulated under stress conditions, identified in studies of H. annuus and C. arvense (Lai et al., 2006 and 2008; Kane and Rieseberg 2007; M. Mayrose, unpublished) were used as a tBlastx query against the assembled reference transcriptome. The cis/trans test was also applied to a set differentially of expressed genes from the genome-wide dataset. Gene functions were predicted for these genes based on sequence homology to annotated Arabidopsis genes in the TAIR9 database (E-value cutoff of e-10). For cases where parents were homozygous for different alleles, allelic imbalance was determined by comparing the expression of the two hybrid alleles (Invasive - Native). Hybrid allelic imbalance in the mid-parent range was inferred to be due to cis-effects. Trans-effects were inferred by comparing the expression of the two parental genes (Invasive – Native) and subtracting the difference between the corresponding hybrid alleles, such that an equal allele frequency (0.5 A1, 0.5 A2) in the hybrids means that only trans-acting changes have caused expression divergence between parents. Inference of cis and trans effects required at least a 1.25-fold expression difference between parents as per McManus et al. (2010). For each gene, the overall expression level in the hybrid alleles was compared with the expression in the parental genes in order to define cases of hybrid under- and over- expression.  !  %'!  !  !  !  Figure 2.1 Workflow of SNP analysis using variant pileup output from SAMtools.  !  %(!  !  !  !  3. RESULTS 3.1 Reference Transcriptome Assembly MIRA3 and CAP3 assembly of the three 454-generated EST libraries resulted in 88,374 contigs with an average length of 683 bp. The inclusion of reads from root, leaf, and flower libraries ensured the representation of a large number of genes present in the Cirsium genome, including those expressed in a tissue-specific manner. Reads from all three tissue-specific libraries were derived from the same invasive genotype. Contig length distribution for the transcriptome assembly is summarized in Figure 3.1.  Figure 3.1 Cirsium transcriptome contig length distribution. Histogram shows the size distribution of the contigs assembled by MIRA3 and CAP3 using reads available from 454sequencing of Cirsium ESTs. The x-axis bins contigs by length at 100bp intervals. The yaxis gives the total number of contigs falling into each bin. The DupPipe tool was used to verify the quality of the MIRA3- and CAP3-generated transcriptome assembly (Figure 3.2). DupPipe is a pipeline that summarizes the age of gene duplication events from expressed sequence libraries in order to infer gene family phylogenies (Barker et al., 2008). A file containing the sequences of all 88 374 assembled contigs was uploaded to the DupPipe tool online at (Barker et al., in prep). The histogram produced was used to validate the quality of the newly assembled Cirsium transcriptome (Figure 3.2). The strong front peak corresponds to recently duplicated genes in various stages of pseudogenization, and the smaller peak at Ks 0.6 corresponds to gene families born from a !  %)!  ! ! ! Compositae-specific polyploidy event roughly 33 million years ago (Barker et al., 2008). The resolution of both peaks in the Cirsium data indicates no problems with over- or under-assembly, verifying the quality of assemblies output by MIRA3 and CAP3. The absence of either of these peaks would have indicated that reads belonging to different members of the same gene family were mistakingly being mapped to a single locus.  Figure 3.2 Age distribution of duplicated genes in the Cirsium transcriptome. Plotted DupPipe output summarizes the history of gene duplication events from the MIRA3- and CAP3-generated transcriptome assembly.  3.2 Crossing Results Crosses A-U were conducted between flowering male and female plants from partner populations to produce as many seeds as possible (Table 3.1). Crosses producing over 50 seeds remained candidates for tests of allelic expression, since a large number of hybrids is required. Seed yield averaged 30 per capitulum. Sufficient F1 seeds were produced for 12 of the 22 crosses attempted (Table 3.1). However, of these, only crosses A, D, E, H, I, J, L1, and T also achieved clonal reproduction from adventitious root fragments for both parents (Table 3.1). Seeds from four of these crosses, A, H, L1 and T, were germinated to generate hybrid leaf tissue to potentially be used in tests of allelic expression. Although leaf tissue was harvested for all four crosses, cross A was chosen for use in this experiment, in part because of a high germination rate. Unused parental and hybrid tissue from crosses H, L1, and T remains stored at 80oC and available for use in future experiments. RNA was extracted from parents used in cross !  %*!  ! ! ! A (mother from Romania with father from Ontario) as well as from their 60 F1 hybrid offspring (see methods and appendix for details regarding RNA extractions as well as overall pool statistics; Tables A.1 and A.2).  Table 3.1 Summary of 22 pairwise crosses between native and invasive populations. Asterisks (*) indicate crosses for which tissue was collected from parents and hybrids. For each cross, maternal and paternal genotypes are given along with the number of seeds produced and results of parental clonal propagation trials.  3.3 Illumina Sequencing and Mapping Results Sequencing results are shown in Table 3.2. A total of 82,713,256 raw paired-end reads, or 165,426,521 single reads, as obtained from the Indiana University Sequencing Centre by ftp transfer. The relatively even number of reads in each sample removed the need to normalize by total number of reads per lane. Raw reads were trimmed for low quality stretches at the 3’ end using a custom python script. !  %+!  !  !  !  Table 3.2 Summary of Illumina sequence results. The number of reads remaining after filtering/trimming for quality using Mosaik sequence aligner is also shown. Short read aligner, BWA, was used to map paired-end Illumina reads from each of the four datasets to the reference Cirsium transcriptome. Mapping results from a similar alignment program, Mosaik, are summarized in Table 3.3. Across all four read libraries, an average of 83.4% reads mapped to the reference. Orphaned reads are where one read mapped but its paired read did not. A high number (>50%) of reads mapped equally well to multiple locations (one or both mates non-unique), suggesting that the dataset may contain highly expressed paralogs with similar sequences. SAMtools internal contig viewer, tview, helped to visualize the results of mapping Illumina reads to the reference transcriptome (Figures 3.7-3.10). SAMtools pileup function was used to calculate allele-specific expression data for the parental and hybrid libraries.  Table 3.3 Summary of mapping results using MosaikAligner. The number of short Illumina reads and read pairs aligned for each library are shown.  !  %,!  !  !  !  3.4 Gene Expression Analysis The relationship between the relative expression of alleles in the two hybrid pools was analyzed using tests of linear regression (Figures 3.3). The line-of-best fit is summarized in the R2 value, ranging from 0 (no correlation) to 1 (perfect correlation). An r2 value of 0.41 indicated that the data were fairly repeatable between hybrid datasets. Values range from 0 to 1, where 1 indicates perfect correlation of the two datasets. Therefore, if hybrid pool 1 has predominantly native allele expression, so should hybrid pool 2. Linear regression analysis was also used to investigate the relationship between the relative expression of the native allele in hybrid pool 1 and the parents (Figure 3.4). Averaging the allelic expression of two hybrid pools resulted in a better correlation with parents than either hybrid pool alone, as indicated by a higher R2 value (0.09 vs 0.11) (Figures 3.4 and 3.5, respectively).  Figure 3.3 Correlation of relative allelic expression between hybrid pools 1 and 2. Analysis of the relationship between the relative expression of both alleles in hybrid replicate pools 1 (x-axis) and 2 (y-axis).  !  &-!  !  !  !  Figure 3.4 Correlation of the relative allelic expression between hybrid pool 1 and parents. Analysis of the relationship between the relative expression of alleles in hybrid pool 1 (x-axis) and the parents (y-axis).  Figure 3.5 Correlation of the relative allelic expression between hybrid pools and parents. Analysis of the relationship between the relative expression of alleles in the combined hybrid pools (x-axis) and the parents (y-axis). !  &$!  !  !  !  3.5 Allele-Specific Expression of Known Stress Genes The nature of regulatory divergence (cis, trans, cis and trans, or hybrid effects) was inferred for 40 eligible case in the first ~1% of contigs in the Cirsium transcriptome (Table 3.6). Allelic ratios that fell outside the mid-parental range were characterized as hybrid effects. Hybrid effects were further classified based on whether the most abundant allele ‘switched’ between parents and hybrids, or whether the expression imbalance became more extreme in the same direction. The sequences of 63 candidate ‘weedy’ genes emerging from other studies in sunflower and Canada thistle were used as a tBlastx query against the Cirsium reference transcriptome assembly. Suspected weedy genes include those differentially expressed in response to stress conditions as well as those differentially expressed between natural populations (Lai et al., 2006 and 2008; Kane and Rieseberg 2007; M. Mayrose, unpublished data). This is based on the assumption of the trade-off hypothesis that stress-response genes are downregulated in invasive populations, allowing for increased investment in competitive traits. Table 3.4 lists contigs in the Cirsium transcriptome with homology to genes differentially expressed under stress conditions, with a particular focus on drought-stress genes. The sequences of confirmed stress genes in Cirsium correspond to genes 1-23 in Table 3.4; those differentially regulated in both sunflower and Canada thistle in response to drought conditions correspond to numbers 1, 5, 6, 7, 8, 9,12, 13, 14, 20, 21 (M. Mayrose, unpublished data). The remaining genes (24-63) were selected on the basis of differential expression between wild and weedy populations of sunflower or under stress conditions (Hewezi et al., 2006; Lai et al., 2008; Roche et al., 2009) (Table 3.4). The contigs that topped tBlastx hit results (i.e. had the lowest Evalue) were inferred to be the Cirsium orthologues to these putative stress-related genes.  !  &%!  !  !  !  Table 3.4 Cirsium contigs with homology to known stress genes. Summary of tBlastn results using the Genbank sequences of stress-regulated genes as a query against the Cirsium transcriptome. Contigs yielding the smallest E-values were inferred to be orthologous to the query sequence. Also shown are the coordinates of hybrid SNPs covered by !6 reads. E-value gives the probability due to chance that there is another alignment with greater similarity. !  &&!  ! ! ! For contigs corresponding to putative stress-related genes, high-coverage SNP sites were identified in the parental and hybrid datasets (Table 3.4). The classic test of allelic imbalance assumes that parents are homozygous for different alleles, such that only two alleles are segregating in the hybrids. Hence, a SNP that occurs in both the maternal and paternal datasets cannot be used to track allelic inheritance in the hybrids. Table 3.5 shows 11 contigs from the 63 stress-genes above that contain homozygous parent-specific SNPs. Comparing allelic expression ratios, scored as the relative abundance of reads containing each SNP, in the parental and hybrid datasets allowed for calculations of the relative contributions of cis- and trans- (Table 3.5). Although these preliminary assignments were not statistically based, they revealed some obvious cases of cis- and trans- regulatory variation between populations.  Table 3.5 Stress genes for which parents are homozygous for different alleles as revealed by unshared SNPs. Relative allelic expression was determined for parental and hybrid libraries. Coverage was scored as poor, sufficient, or high, based on the number or tags support each allele. Contigs achieving sufficient or high coverage were further scored as cis-, trans-, or cis- and trans-, based on degree of allelic imbalance in the hybrids. Hybrid allelic imbalance revealed cis-acting effects for contigs 1451 and 23614. Contig 1451 corresponds to confirmed stress-related gene in Canada thistle with high similarity to Oxidoreductase NAD-binding domain-containing protein known to be involved in energy metabolism in A. thaliana (Hewezi et al., 2006). Transcripts from female and male parents were homozygous for different nucleotides at position 480, T and G respectively. Contig 1451 was inferred to be under cis-regulatory control because allelic imbalance was maintained in the hybrid and was supported by deep tag coverage (Table 3.5, Figure 3.9). Contig 23614 has high homology to an auxin-response transcription factor and is confirmed to be downregulated by drought stress in both sunflower and Cirsium (Roche et al., 2009). This case was scored as cisbecause allelic imbalance is maintained in the hybrid (Table 3.5; Figure 3.6). No cases of expression divergence caused by trans-acting change alone were found. Cis- and trans-effects were both inferred for Contigs 9087 and 25002. Alleles for Contig 25002 were determined by !  &'!  ! ! ! the presence of an indel, as opposed to a SNP, at bp 89. All reads in the female parent contained a single base-pair insertion (+A) relative to the reference sequence and all male parent reads. Contig 9087 shares homology with lipoxygenase (Lox2) in Nicotiana attenuata. Epistatic effects, such as dominance interactions among trans-acting factors, might underly some of the inferred cases of hybrid effects. Contig 14491 was scored as hybrid effect since the most highly expressed allele switched between parental and hybrid datasets (Figure 3.7). Contig 14491 has high homology to a subunit of a multicatalytic endopeptidase protein known to be involved in early flowering. A hybrid effect was also inferred for Contig 5913 due to a switch of the favoured allele in the hybrids (Figure 3.8, Table 3.5). Contig 5913 has high homology to a pathogenesis-related protein in sunflower. Contig 629 has high similarity to HD-Zip transcription factor Athb-8 and is a confirmed drought stress-response gene in Cirsium (M. Mayrose, unpublished data). Despite low read coverage, this appears to be a case of hybrid effect leading to the female (native) allele being more highly expressed.  !  &(!  !  !  !  Figure 3.6 Allele-specific expression in parents and hybrids for Contig 23614. Screenshots from SAMtools tview follows a population-specific homozyogous SNP at bp 741 on Contig 23614 in the female, male, hybrid1, and hybrid2 datasets (A, B, C, and D, respectively). The reference sequence and coordinates are shown on top, with mapped Illumina reads stacked below. The frequency of the SNP in each dataset is given in Table 3.5.  !  &)!  !  !  !  Figure 3.7 Allele-specific expression in parents and hybrids for Contig 14491. Screenshots from SAMtools tview follows a population-specific homozyogous SNP at bp 1039 on Contig 14491 in the female, male, hybrid1, and hybrid2 datasets (A, B, C, and D, respectively). The reference sequence and coordinates are shown on top, with mapped Illumina reads stacked below. The frequency of the SNP in each dataset is given in Table 3.5.  !  &*!  !  !  !  Figure 3.8 Allele-specific expression in parents and hybrids for Contig 5913. Screenshots from SAMtools tview follows a population-specific homozyogous SNP at bp 721 on Contig 5913 in the female, male, hybrid1, and hybrid2 datasets (A, B, C, and D, respectively). The reference sequence and coordinates are shown on top, with mapped Illumina reads stacked below. The frequency of the SNP in each dataset is given in Table 3.5.  !  &+!  !  !  !  Figure 3.9 Allele-specific expression in parents and hybrids for Contig 1451. Screenshots from SAMtools tview follow a population-specific homozyogous SNP at bp 480 on Contig 1451 in the female, male, hybrid1, and hybrid2 datasets (A, B, C, and D, respectively). The reference sequence and coordinates are shown on top, with mapped Illumina reads stacked below. The frequency of the SNP in each dataset is given in Table 3.5.  3.6 Analyzing Genome-Wide Allelic Expression Another set of genes was scored as cis-/trans- based on differential expression at characteristic SNP sites between parents. A complete list of SNP sites for which parents were differentially homozygous was generated for all 88,374 contigs. This list was filtered for cases with at least 6x coverage in each parent as well as a minimum of 6x coverage for each SNP in both hybrid pools, which halved the number of contigs that remained eligible for cis/trans analysis. These results were further parsed by imposing a minimum expression difference between parental datasets of 1.25-fold as well as similar allele frequencies in hybrid pool 1 and hybrid pool 2.  !  &,!  !  !  !  Table 3.6 Summary of SNP sites used to score hybrid allelic imbalance. These data were filtered by high confidence (!6 tag coverage) homozygous parental SNP sites, expression divergence (!1.25), and agreement in relative allele frequency between hybrid replicate pools 1 and 2.  "#!  !  !  !  !  Table 3.6 (Continued) Summary of SNP sites used to score hybrid allelic imbalance. These data were filtered by high confidence (!6 tag coverage) homozygous parental SNP sites, expression divergence (!1.25), and agreement in relative allele frequency between hybrid replicate pools 1 and 2.  "$!  !  !  !  !  Notable cases included evidence for cis-regulatory divergence in Contigs 113, 198, 324, 336, 394, 739, and 905 (Table 3.6). Based on a return to equal allelic expression in the hybrids, trans-regulatory divergence was inferred for Contigs 129, 179, 733, 744, 833, and 853. Nine cases were scored as a combination of both cis and trans effects. Hybrid effects were common, comprising 18 (nearly half) of the cases assayed. Of these, 4 contigs had higher relative expression of the male parent allele (P1) in hybrids than in the parents; in 14 cases, the most abundantly expressed allele in the hybrids was the least abundantly expressed allele when comparing the parental expression. Of the 40 contigs in Table 3.6, seven cis-effects were scored (Contigs 113, 198, 324, 336, 394, 739, and 905). Both cis and trans effects were inferred for 9 contigs and hybrid effects made up nearly half (18/40) of the cases assayed (Table 3.6). When multiple SNP sites were available for a given contig, allelic imbalance was scored for each (Table 3.6). Variation in allelic expression along the length of a transcript might be due to alternative splicing or Illumina biases, where higher error rates are typical at the 3’ end. Generally, the conclusions drawn from different SNP sites supported the same mode of expression divergence, but when predictions disagree they can lead to ambiguous conclusions. For instance, Contig 359 has three suitable SNP sites, two that led to inference of a hybrid effect (318, 330) and one that supported scoring as trans (383). Parsimony helped to score these cases, where the outcome with the most support (hybrid effect) was assumed to be correct. Contigs 795 is another complex case. While most of the contigs scored rested on inferences made from just one or two SNP sites, hybrid effects for Contig 661 are supported by all twelve SNPs across the transcript (Table 3.7). In this case, a higher proportion of the reads correspond to the invasive allele in the hybrids relative to its expression in the parents. Indeed, typical results indicated that the male (invasive) allele is more often highly expressed in the hybrids relative to the female (native) allele. Regardless of relative allele frequencies, the overall expression of each contig was also considered. Comparing the total tag counts per contig between parent and hybrid datasets identified cases of hybrid over- or under-expression. Out of the 40 genes assayed, 20 were upregulated, 5 were downregulated, and 15 had unchanged overall expression between hybrids and parents (Table 3.6). !  "#!  !  !  !  4. DISCUSSION This project used high-throughput measurements of allele-specific expression to provide insight into the extent of cis and trans-regulatory variation, as well as hybrid effects, in native/invasive comparisons. Estimates of gene expression using RNA-Seq are especially powerful since the number of tags mapping to a given reference locus is proportional to the number of transcripts in the cell (Cloonan and Grimmond, 2008; Wang et al., 2008). Compared to previous studies of allelic imbalance that used model genetic organisms, this work had the added benefit of working within a known ecological context. By pairing recently diverged populations based on measures of habitat bioclimatic similarity, this approach controlled for geographic differences and any associated effects those might have on gene expression.  4.1 Considerations of the Illumina sequencing approach The power of high-throughput approaches in tests of allelic expression depends on the extent of sequence divergence between alleles, the relative transcript abundance, read length, and the number of reads mapping per gene (i.e. sequencing depth) (Fontanillas et al., 2010). The Illumina GAII sequencing platform was ideal for achieving the high depth coverage. Opting for paired-end read technology effectively increased read length and, in turn, the probability of spanning an SNP that might help discriminate between alleles. Steps during cDNA library preparation were also taken to manipulate sequence read distribution over the length of a transcript (Wang et al., 2008; Nagalakshmi et al., 2008). Oligo(dT)-primed cDNA library construction generated reads biased to 3' ends, achieving localized regions of deep coverage. SNPs have also been shown to be more frequent in 5' or 3' UTRs than in nonsynonymous, protein-coding sites (Andolfatto, 2005), so a 3' bias should be expected to increase the number of informative reads. Because the extent of coding sequence divergence between native and invasive populations of Canada thistle had not been previously characterized, there was some uncertainty as to the number of population-specific genetic variants (SNPs and Indels) to expect. This work showed that sufficient sequence divergence has occurred to assess the nature of regulatory divergence for at least a subset of genes between native and invasive populations. !  "$!  ! ! ! One challenge when aligning the Illumina reads was how to deal with reads that map equally well to multiple locations on the reference sequence. This can occur when copy number variation, related paralogs, or structural variations result in similar regions between transcripts. Thankfully, BWA takes this into account by assigning multi-mapping reads randomly to loci and indicating a mapping quality score of zero. SNPs supported by these low quality reads can be ignored in downstream analyses.  4.2 Estimating Cis and Trans Effects Despite initial intentions of scoring cis and trans even for genes with more than two hybrid alleles, this was not possible given the variable read depth across the transcript length and relatively low occurrence of SNPs that were suitable to distinguish between parental alleles. Only cases where parents were homozygous for different alleles were classified, fitting the requirements of the traditional cis/trans test for hybrid allelic imbalance (Wittkopp, 2004). The initial focus on drought-stress genes considered the extensive root system of Cirsium, suspected to be its primary mechanism of invasive success, allowing water to be extracted from deep sources (Donald, 1994). Of the 11 stress genes analyzed, 2 appeared to be differentially regulated due to cis-effects (Contigs 1451 and 23614). Contig 1451 showed high similarity to an oxidoreductase NAD-binding domain-containing protein and is known to be involved in energy metabolism in A. thaliana (Table 3.5). Contig 23614 had homology to an auxin-response transcription factor and has been shown to be differentially regulated under drought stress in both sunflower and Cirsium. Cases of hybrid over- and underexpression were also scored to assess trends in overall gene activity between generations. Both alleles of Contig 1451 were downregulated in the hybrid relative to the parents by over 50%, whereas Contig 23614 had similar overall expression between parents and hybrids. Among the drought-stress genes, Contig 5913 was one of the most interesting cases that showed a hybrid effect in this study. Based on these Illumina results, expression of Contig 5913 was downregulated in weedy populations when comparing between parents. Contig 5913 has high homology to a pathogenesis-related protein, QHN14D03, and was also consistently differentially expressed between wild and weedy populations of sunflower (Lai et al., 2008). Contig 5913 was among those differentially expressed in response to drought-stress in Canada thistle (M. Mayrose, unpublished data). !  ""!  ! ! ! Another set of genes, representing 1% of the genome-wide dataset was also assayed for allelic imbalance based on differential expression between parental datasets. In this analysis, 40 genes were differentially expressed, had sufficient read-depth coverage, and SNP patterns suitable for scoring hybrid allelic imbalance (Table 3.6). Finding new cases of cis/trans based on differential expression between populations facilitates the discovery of other candidate genes involved in adaptation. A 1.25-fold cut-off for differential expression was used, as in McManus et al. (2010). However, a higher threshold for calling differential expression between the parents could be applied to increase the confidence of scoring hybrid allelic imbalance, especially where read depth is low or moderate. In terms of scoring cis and trans, statistical support has been shown to increase significantly for cases where the hybrid allelic imbalance is greater than 2-fold (Fontanillas et al., 2010). The magnitude of expression differences between parental populations ranged from 1.25 to 6.05 (Table 3.4). This is in line with findings from sunflower, where native and invasive populations typically showed 1.08- to a 6.54-fold difference (Lai et al., 2008). Considering that only 1% of total contigs were included, we might be tempted to extrapolate these trends genome-wide such that we would expect to be able to characterize regulatory divergence for ~4000 genes across the whole dataset. The allelic ratios between the two hybrid pools (replicates) were expected to be highly similar, given the minimal lane effects typically introduced with Illumina sequencing (Marioni et al., 2008; Bullard et al., 2010). In other words, SNPs more highly expressed in hybrid pool 1 were expected to be expressed at similar frequencies in hybrid pool 2. However, there were several cases where hybrid pools 1 and 2 did not agree on the relative frequency of parental alleles. Such cases were removed from further analysis. Overall, however, a modest positive correlation was observed between the two datasets (R2 = 0.41, Figure 3.3). This could reflect a real trend of random allelic expression in the F1 hybrids for some genes, but this is unlikely due to the use of large sample sizes (n=30). Alternatively, biases may have been introduced during library preparation that could account for expression differences between hybrid pools. The list of 40 genes in Table 3.6 represents the filtered, high confidence cases from this fraction of the Cirsium transcriptome. It is by no means a comprehensive list because genes were removed in which i) alleles were supported by fewer than 6 tags, ii) parental expression differences was less than 1.25-fold, and iii) the hybrid pools showed dissimilar allele frequencies. A conservative approach is important for confidence in scoring the nature of !  "%!  ! ! ! regulatory divergence. One case of interest that would be missed by the filtering criteria is Contig 1260. The low mapping quality of certain reads in Contig 1260 indicated that the parents were, in fact, differentially homozygous. It is important to remove SNP sites supported only by reads with low mapping quality scores, as these are likely to have been misaligned, actually belonging to a close paralogue rather than a different allele of the same gene. Of the 40 contigs in Table 3.6, seven cis-effects were scored (Contigs 113, 198, 324, 336, 394, 739, and 905). Predicted functions of these cis-regulated genes include a chaperonin (Contig 113), SCL30a RNA binding protiein (Contig 198), COX10 (cytochrome c oxidase 10) (Contig 324), a glutathione transferase (Contig 336), a pectinesterase inhibitor (Contig 394), a thioesterase family protein (Contig 739), and a mitochondrial ATP synthase g subunit family protein (Contig 905). Both cis and trans effects were inferred for 9 contigs and hybrid effects made up nearly half (18/40) of the cases assayed (Table 3.6). Ambiguous scoring of a few contigs resulted from multiple SNPs that did not support the same conclusion (e.g. Contigs 359, 795). In most cases, conclusions drawn from different SNP sites supported the same mode of expression divergence. One possible explanation for SNPs supporting different frequencies of alleles across the length of a transcript is alternative splicing. The observation of similar numbers of genes being regulated in cis and trans (Table 3.6) is consistent with earlier studies that show that cis-regulatory variation is more often the target of selection across long time scales (i.e. between species) whereas changes in trans also tend to contribute to intraspecific differences (Wittkopp et al., 2004, 2008b; Stern and Orgogozo, 2008). Most of the cis-acting changes that were identified had a larger magnitude of effect on expression divergence than trans-acting changes (Table 3.6). This is consistent with widespread evidence that additive, cis-variation is more often the target of directional selection (Wittkopp et al., 2004, 2008a; Tirosh et al., 2009; Birchler and Veitia, 2009). Directional selection has been shown to act on much of the gene expression variation between species (cis), whereas (trans) changes contribute to expression divergence only within the limits tolerated by stabilizing selection (Lemos et al., 2005; Denver et al., 2005; Holloway et al., 2007; Whitehead and Crawford, 2006b; Fay and Wittkopp, 2008).  !  "&!  !  !  !  4.3 Hybrid Effects The identification of hybrid effects for 18 of 40 genes examined in Table 3.6 was unexpectedly high considering the low level of genetic divergence between the parental Canada thistle populations. Extensive hybrid effects are more typical of between-species comparisons (Zhuang and Adams, 2007), where the regulatory networks of genetically distinct parents result in misregulation in the hybrids (Landry et al., 2007). One intraspecific cis/trans study in maize found that ~50% of genes showed biased hybrid allelic expression (Springer and Stupar, 2007), but this study used highly inbred parental lines. It will be interesting to see if the trend continues as more of the genes are analyzed. For 14 of the 18 cases, the most abundant allele in the hybrids was the least abundant allele between the parents. Of these cases, the paternally inherited (invasive) allele was often more highly expressed in hybrids relative to the maternally inherited (native) allele. This is in contrast to evidence from interpecific Drosophila hybrids that found that maternally inherited alleles were more often highly expressed (Fontanillas et al., 2010). This switch between relative allelic abundance between parents and hybrids provides evidence for hybrid regulatory incompatibilities. Hybrid effects also included cases where the allelic imbalance is more extreme in the hybrids than between parents. In 4 of the 18 cases of hybrid effects identified in Table 3.6, the frequency of the male parent allele was higher (i.e. more extreme) in the hybrid pools than in the parents. It could be the case that trans factors from evolutionary divergent chromosome sets interact in a way that forces hybrids to be controlled by trans, when these factors do not actually contribute to the observed expression divergence between parental populations (Birchler and Veitia, 2009; Landry et al., 2007).!  4.4 Future Direction Genome-wide analysis of cis/trans regulation is being automated computationally for all remaining contigs using the methodology described here. Appropriate statistical tests should be applied to score the ambiguous cases of hybrid allelic imbalance. Common methods for inferring differential expression from RNA-Seq data include Fisher’s exact test and GLM-based likelihood ratio statistics (Bullard et al., 2010). A subset of cases scored as cis or trans should also be examined using qRT-PCR to confirm the Illumina results. !  "'!  !  !  !  BIBLIOGRAPHY Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410 Andolfatto, P. 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437:11491153 Baker, H.G. 1974. The evolution of weeds. Annu. Rev. Ecol. Evol. S5:1-24 Barker, M. S., Kane N. C., Matvienko, M., Kozik, A., Michelmore, R.W., Knapp, S.J., and Rieseberg, L.H. 2008. Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. J. Mol. Biol. Evol. 25: 2445-2455 Barker M.S., Dlugosch, K.M., Dinh, L., Challa, R.S., King, M.G. and Rieseberg, L.H. bioinformatic pipelines and forums for ecological and evolutionary genomics. In prep Basu, C., Halfuhill, M.D., Mueller, T.C. and Stewart, C.N. 2004. Weed genomics: new tools to understand weed biology. Trends Plant Sci. 9:391-398 Birchler, J. A. and Veitia, R. A. 2009. The gene balance hypothesis: implications for gene regulation, quantitative traits and evolution. New Phytol. 186:54-62 Birchler, J.A., Bhadra, U., Pal Bhadra, M. and Auger, D.L. 2001. Dosage dependent gene regulation in multicellular eukaryotes: Implications for dosage compensation, aneuploid syndromes and quantitative traits. Dev. Biol. 234:275-288 Bullard, J.H., Purdom, E.A., Hansen, K.D., and Dudoit, S. 2010. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinf. 11:94 Carroll, S. B. 2005. Evolution at two levels: on genes and form. PLoS Biol. 3:e245 Chang, Y.W., Lui, F.R, Yu, N., Sung, H.M., Yang, P., Wang, D., Huang, C.H., Shih, M.C., and Li, W.H. 2008. Roles of cis- and trans-changes in the regulatory evolution of genes in the gluconeogenic pathway in yeast. Mol. Biol. Evol. 25:1863-1875 Chao, W.S., Horvath, D.P. Anderson, J.V., and Foley, M.P. 2005. Potential model weeds to study genomics, ecology, and physiology in the 21st century. Weed Sci. 53:929-937 Chen, Y., Lin, C., Wang, C., Wu, H., and Hwang, P. 2007. An optimized procedure greatly improves EST vector contamination removal. BMC Genomics 8:416 Chevreux, B., Wetter, T. and Suhai, S. 1999. Genome sequence assembly using trace signals and additional sequence information. Comput. Sci. Biol. (GCB) 99:45-56 !  "(!  ! ! ! Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J., Muller, W. E., Wetter, T. and Suhai, S. 2004. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 14:1147-1159 Cloonan, N., and Grimmond, S. M. 2008. Transcriptome content and dynamics at singlenucleotide resolution. Genome Biol. 9:234 Comai, L., Madlung, A., Josefsson, C., and Tyagi, A. 2003. Do the different parental ‘heteromes’ cause genomic shock in newly formed allopolyploids? Philos. Trans. R. Soc. Lond. Biol. Sci. 358:1149-1155 Denver, D.R., Morris, K., Streelman, J.T., Kim, S.K., Lynch, M.L., and Thomas, W.K. 2005. The transcriptional consequences of mutation and natural selection Caenorhabditis elegans. Nat Genet. 37:544-548 Donald, W. 1994. The biology of Canada thistle (Cirsium arvense). Rev. Weed Sci. 6:77–101 Elena, S.F. and Lenski, R.E. 2003. Evolution experiments with microorganisms: the dynamics and genetic basis of adaptation. Nat. Rev. Genet. 4:457-469 ENCODE (ENCyclopedia Of DNA Elements) Project. The ENCODE Project Consortium. 2004. Science 306:636. Evans, J.E. 1984. Canada thistle (Cirsium arvense): a literature review of management practices. Nat. Areas J. 4:11-21 Fay, J.C. and Wittkopp, P.J. 2008. Evaluating the role of natural selection in the evolution of gene regulation. Heredity 100:191-199 Fontanillas, P., Landry, C.R., Wittkopp, P.J., Russ, C., Gruber, J.D., Nusbaum, C., and Hartl, D.L. 2010. Key considerations for measuring allelic expression on a genomic scale using highthroughput sequencing. Mol. Ecol. 19: 212-227 Gibson, G. and Weir, B. 2005. The quantitative genetics of transcription. Trends Genet. 21:616-623 Gilad, Y., Oshlack, A. and Rifkin, S.A. 2006. Natural selection on gene expression. Trends Genet. 22:456-461 Guo, M., Yang, S., Rupe, M., Hu, B., Bickel, D.R. Arthur, L., and Smith, O. 2008. Genomewide allele-specific expression analysis using massively parallel signature sequencing (MPSS) reveals cis- and trans-effects on gene expression in maize hybrid meristem tissue. Plant Mol. Biol. 66:551-563 Graze, R.M., McIntyre, L.M., Main, B.J., Wayne, M.L., and Nuzhdin, S.V. 2009. Regulatory divergence in Drosophila melanogaster and D. simulans, a genomewide analysis of allelespecific expression. Genetics 183:547-561  !  ")!  ! ! ! Grime, J.P., 1977. Evidence for the existence of three primary strategies in plants and its relevance to ecological and evolutionary theory. Am. Nat. 111:1169-1194 Gruber, J.D. and Long, A.D. 2008. Cis-regulatory variation is typically polyallelic in Drosophila. Genetics 181:661-670 Hazelhurst, S., Hide, W., Liptak, Z., Nogueira, R, and Starfield, R. 2008. An overview of the wcd EST clustering tool. Bioinformatics 24:1542-1546 Hewezi, T., Leger, M., Kayal, W.E., and Gentzbittel, L. 2006. Transcriptional profiling of sunflower plants growing under low temperatures reveals an extensive downregulation of gene expression associated with chilling sensitivity. J. Exp. Bot. 57:3109-3122 Hoekstra, H.E. and Coyne, J.A. 2007. The locus of evolution: evo devo and the genetics of adaptation. Evolution 61:995–1016 Holloway, A.K. Lawniczak, M.K.N., Mezey, J.G., Begun, D.J., and Jones, C.D. 2007. Adaptive gene expression divergence inferred from population genomics. PLoS Genet. 3:20072013 Huang, X., and Madan, A. 1999. CAP3: A DNA sequence assembly program. Genome Res. 9:868-877 Inga, A., Storici, F., Darden, T.A., and Resnick, M.A. 2002. Differential transactivation by the p53 transcription factor is highly dependent on p53 level and promoter target sequence. Mol. Cell Biol. 22:8612-8625 Jordan, I.K., Marino-Ramirez, L., and Koonin, E.V. 2004. Evolutionary significance of gene expression divergence. Gene 345:119-126 Kane, N.C. and Rieseberg, L.H. 2007. Selective sweeps reveal candidate genes for adaptation to drought and salt tolerance in common sunflower, Helianthus annuus. Genetics 175:1823-1834 Kane, N.C. and Rieseberg, L.H. 2008. Genetics and evolution of weedy Helianthus annuus populations: adaptation of an agricultural weed. Mol. Ecol. 17:384-394. Lai, Z., Kane, N.C., Zou, Y., and Rieseberg, L.H. 2008. Natural variation in gene expression between wild and weedy populations of Helianthus annuus. Genetics 179: 1881-1890 Lai, Z., Gross, B.L., Zou, Y., Andrews, J. and Rieseberg, L.H. 2006. Microarray analysis reveals differential gene expression in hybrid sunflower species. Mol. Ecol. 15:1213-1227 Lalonde, R.G. and Roitberg, B.D. 1994. Mating system, life history, and reproduction in Canada thistle. Am. J. Bot. 81: 21-28 Landry, C.R., Hartl, D.L., and Ranz, J.M. 2007. Genome clashes in hybrids: insights from gene expression. Heredity 99:483-493  !  %*!  ! ! ! Landry, C.R., Wittkopp, P.J., Taubes, C.H., Ranz, J.M., Clark, A.G., and Hartl, D.L. 2005. Compensatory cis-trans evolution and the dysregulation of gene expression in interspecific hybrids of Drosophila. Genetics 171:1813-1822 Lassman, T., Hayashizaki, Y., and Daub, C.O. 2009. TagDust - a program to eliminate artifacts from next generation sequencing data. Bioinformatics 25:2839-2840 Lemos, B., Araripe, L.O., Fontanillas, P., and Hartl, D.L. 2008. Dominance and the evolutionary accumulation of cis- and trans- effects on gene expression. PNAS 105: 1447114476 Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25:1754-1760 Li, H., Handsaker, B., Wysoker, A., Fennel, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25:2078-2079 McManus, C.J., Coolon, J.D., O’Duff, M., Eipper-Mains, J., Graveley, B.R. and Wittkopp, P.J. 2010. Regulatory divergence in Drosophila revealed by mRNA-seq. Genome Res. 20:816-825 Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y. 2008. RNA-Seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18:1509-1517 Moore, R.J. 1975. The biology of Canadian weeds. 13: Cirsium arvense (L.) Scop. Can. J. Plant Sci. 55:1033-1048 Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., and Snyder, M. 2008. The transcriptional landscapte of the yeast genome defined by RNA sequencing. Science 320:1344-1349 Pimentel, D., Zuniga, R., and Morrison, D. 2005. Update on he environmental and economic costs associated with alien-invasive species in the United States. Ecol. Econ. 52:273-288 Prud’homme, B., Gompel, N., Rokas, A., Kassner, V.A., Williams, T.M., Yeh, S.D., True, J.R., and Carroll, S.B. 2006 Repeated morphological evolution through cis-regulatory changes in a pleiotropic gene. Nature 440:1050-1053 Prud’homme, B., Gompel, N., and Carroll, S. B. 2007. Emerging principles of regulatory evolution. PNAS 104:8605-8612 Roche, J., Hewezi, T., Bouniols, A., and Gentzbittel Laurent. 2009. Real-time PCR monitoring of signal transduction related genes involved in water stress tolerance mechanisms of sunflower. Plant Phys. Bioch. 47:139-145 Serre, D., Gurd, S., Ge, B., Sladek, R., Sinnett, D., Harmsen, E., Bibikova, M., Chudin, E., Barker, D. L., Dickinson, T., Fan, J., and Hudson, T. J. 2008. Differential allelic expression in !  %+!  ! ! ! the Human genome: A robust approach to identify genetic and epigenetic cis-acting mechanisms regulating gene expression. PLoS Genet. 4:e1000006 Shendure, J., and Ji, H. 2008. Next-generation DNA sequencing. Nat. Biotech. 26:1135-1145 Skinner, K. Smith, L., and Rice, P. 2000. Using noxious weed lists to prioritze targets for developing weed management strategies. Weed Sci. 48:640-644 Springer, N. M. and Stupar, R. M. 2007. Allele-specific expression patterns reveal biases and embryo-specific parent-of-origin effects in hybrid Maize. Plant Cell 19: 2391-2402 Stern, D. L. and Orgogozo, V. 2008. The loci of evolution: how predictable is genetic evolution? Evolution 62:2155–77 Stewart, N.C., Tranel, P.J., Horvath, D.P, Anderson, J.V., Rieseberg, L.H., Westwood, J.H., Mallory-Smith, C.A., Zapiola, M.L., and Dlugosch, K.M. 2009. Evolution of weediness and invasiveness: charting the course for weed genomics. Weed Sci. 57:451-462 Storz, J. F. 2005. Using genome scans of DNA polymorphism to infer adaptive population divergence. Mol. Ecol. 14:671-688 Sung, H.M., Wang, T.Y., Wang, D., Huang, Y.S., Wu, J.P., Tsai, H.K., Tzeng, J., Huang, C.J., Lee, Y.C., Yang, P., Hsu, J., Chang, T., Cho, C.Y., Weng, L.C., Lee, T.C., Chang, T.H., Li, W.H., and Shih, M.C. 2009. Roles of trans and cis variation in yeast intraspecific evolution of gene expression. Mol. Biol. Evol. 26:2533-2538 Schwartz, C., Balasubramanian, S., Warthmann, N., Michael, T.P., Lempe, J., Sureshkumar, S., Kobayashi, Y., Maloof, J.N., Borevitz, J.O., Chory, J., and Weigel, D. 2009. Cis-regulatory changes at FLOWERING LOCUS T mediate natural variation in flowering responses of Arabidopsis thaliana. Genetics 183:723-32 Swanson-Wagner, R.A., Jia, Y., DeCook, R., Borsuk, L.A., Nettleton, D., and Schnable, P.S. 2006. All possible modes of gene action are observed in a global comparison of gene expression in a maize F1 hybrid and its inbred parents. PNAS 103:6805-6810 Tautz, D. 2000. Evolution of transcriptional regulation. Curr. Opin. Genet. Dev. 10:575-579 Tirosh, I., Reikhav, S., Levy, A.A., and Barkai, N. 2009. A yeast hybrid provides insight into the evolution of gene expression regulation. Science. 324:659-662 Tung, J., Fedrigo, O., Haygood, R., Mukherjee, S. and Wray, G. A. 2009. Genomic features that predict allelic imbalance in humans suggest patterns of constraint on gene expression variation. Mol. Biol. Evol. 26:2047-2059 Wang, Z., Gerstein, M., and Snyder, M. 2008. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10:57-63 Whitehead, A., and Crawford, D. L. 2006a. Neutral and adaptive variation in gene expression. Proc. Natl. Acad. Sci. USA 103:5425-5430 !  %#!  !  !  !  Whitehead, A., and Crawford, D. L. 2006b. Variation within and among species in gene expression: raw material for evolution. Mol. Ecol. 15:1197-1211 Williams, R.B.H., Chan, E.K.F., Cowley, M.J., and Little, P.F.R. 2007. The influence of genetic variation on gene expression. Genome Res. 17:1707-1716 Wittkopp, P. J., Haerum, B. K., and Clark, A. G. 2004. Evolutionary changes in cis and trans gene regulation. Nature 430:85-88 Wittkopp, P. J. 2005. Genomic sources of regulatory variation in cis and in trans. Cell. Mol. Life Sci. 62:1779-1783 Wittkopp, P. J. 2006. Evolution of cis-regulatory sequence and function in Diptera. Heredity 97:139-147 Wittkopp, P.J., Haerum, B.K. and Clark, A.G. 2008a. Regulatory changes underlying expression differences within and between Drosophila species. Nat. Genet. 40:346-50 Wittkopp, P. J., Haerum, B. K., and Clark, A. G. 2008b. Independent effects of cis- and transregulatory variation on gene expression in Drosophila melanogaster. Genetics 178:1831-1835 Wray, G. A. 2007. The evolutionary significance of cis-regulatory mutations. Nat. Rev. Genet. 8:206–216 Zhang, X., and Borevitz, J.O. 2009. Global analysis of allele-specific expression in Arabidopsis thaliana. Genetics 182:943-954 Zhuang, Y., and Adams, K. L. 2007. Extensive allelic variation in gene expression in Populus F1 hybrids. Genetics 177:1987-1996  !  %$!  !  !  !  APPENDIX  Table A.1 Building Hybrid RNA Pool1. RNA concentration and quality is shown for each hybrid added to pool (n=30). Overall pool statistics are indicated in bold.  !  %"!  !  !  !  Table A.2 Building Hybrid RNA Pool2. RNA concentration and quality is shown for each hybrid added to pool (n=30). Overall pool statistics are indicated in bold.  !  %%!  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items