Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Evolutionary conservation of long intergenic non-coding RNA genes in Arabidopsis Hammel, Alexander John 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2013_fall_hammel_alexander.pdf [ 843.89kB ]
JSON: 24-1.0074208.json
JSON-LD: 24-1.0074208-ld.json
RDF/XML (Pretty): 24-1.0074208-rdf.xml
RDF/JSON: 24-1.0074208-rdf.json
Turtle: 24-1.0074208-turtle.txt
N-Triples: 24-1.0074208-rdf-ntriples.txt
Original Record: 24-1.0074208-source.json
Full Text

Full Text

Evolutionary Conservation of Long IntergenicNon-coding RNA Genes in ArabidopsisbyAlexander John Hammela thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Scienceinthe faculty of graduate and postdoctoral studies(Botany)The University of British Columbia(Vancouver)August 2013c? Alexander John Hammel, 2013AbstractLong intergenic non-coding RNA (lincRNA) genes are a poorly studied class of tran-scripts, particularly in plants. Because of the low levels of expression, high tissue speci-ficity, and rapid rate of evolution of lincRNA transcripts, the discovery and functionalannotation of these molecules is a significant challenge. Here, I report the annotationof 201 new lincRNA transcripts in Arabidopsis thaliana discovered using the results of asingle RNA-seq experiment of a normalized library. Using these sequences, along withthe 6 480 lincRNA genes annotated by Liu et al. (2012), I performed a pairwise se-quence alignment experiment with the genomes of 22 plant species in order to discoverhighly conserved sequences within lincRNA loci. Of the 6 681 lincRNA sequences exam-ined, 3 374 have highly conserved sequences supported by multiple genomic alignmentsto other species. Six of these show evidence of ongoing reduced sequence rate evolutionwhen single-nucleotide variant data from the recent evolutionary history of Arabidopsisthaliana. The rate of retention of these conserved regions within the Brassicaceae sug-gests a much higher rate of sequence turnover in lincRNA genes compared with proteincoding genes. Structural variant data from 80 different A. thaliana ecotypes suggeststhat lincRNA genes suffer deletions of the entire locus from the genome with appreciablefrequency: 570 of the lincRNA loci examined are entirely missing from at least one A.thaliana strain. These results suggest an intriguing mixture of rapid sequence evolutionwith short, highly-conserved islands in lincRNA genes.iiPrefaceThis project includes collaborations with David Tack, a Ph.D. candidate in the Adamslab at UBC, and Jon Willinofsky, an undergraduate at UBC. The strategy for filteringIllumina RNA-seq reads associated with annotated genes?described in section 2.1?was developed in collaboration with David Tack and Jon Willinofsky. The softwareimplementation of this strategy was developed by Mr. Willinofsky under the supervisionof Mr. Tack and Dr. Adams. I was responsible for all further data analysis once thefiltered data were obtained.iiiContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Evolution of non-coding RNA . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Known functions of lincRNA genes . . . . . . . . . . . . . . . . . . . . . 21.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Identification of lincRNA transcripts . . . . . . . . . . . . . . . . . . . . 52.2 Identification of lincRNA conserved regions . . . . . . . . . . . . . . . . . 62.3 Conserved regions of lincRNA loci . . . . . . . . . . . . . . . . . . . . . . 72.4 Recent gain and loss of lincRNA loci . . . . . . . . . . . . . . . . . . . . 83 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Discovery of lincRNA loci through Illumina analysis of normalized libraries 103.2 LincRNA sequence similarity in other plant genomes . . . . . . . . . . . 113.3 Conserved regions of Arabidopsis lincRNA loci . . . . . . . . . . . . . . . 133.4 Recent evolutionary dynamics of lincRNA genes . . . . . . . . . . . . . . 144 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Identifying lincRNA genes . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Conserved regions in lincRNA loci . . . . . . . . . . . . . . . . . . . . . . 164.3 Interpreting conserved blocks . . . . . . . . . . . . . . . . . . . . . . . . 184.4 Evolutionary comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 The origins of lincRNA genes . . . . . . . . . . . . . . . . . . . . . . . . 20ivConclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37vList of Tables1 Plant genomes used in the identification of Arabidopsis thaliana lincRNAconserved blocks by pairwise alignment. . . . . . . . . . . . . . . . . . . . 232 Summary of the e values of the BLAST pairwise alignments . . . . . . . 243 Summary of the results of the two methods of annotating regions of highsequence similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 The number of lincRNA loci and protein coding sequences with conservedblocks in the genomes of other species . . . . . . . . . . . . . . . . . . . . 255 Summary statistics of the conserved regions of lincRNA loci . . . . . . . 266 Evidence for reduced rate of sequence evolution in conserved regions oflincRNA genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Co-occurrence of conserved regions and whole-locus deletions in lincRNAloci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26viList of Figures1 Differences in confidence statistics between novel and confirmed lincRNAloci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Alignment characteristics of lincRNA genes in all species . . . . . . . . . 283 Alignment characteristics of lincRNA genes in the Brassicaceae . . . . . . 294 Phylogenetic positions of conserved blocks of within the Brassicaceae . . 305 Partial alignment of a very broadly conserved lincRNA locus . . . . . . . 316 Alignments of the lincRNA genes which have evidence of reduced sequencerate evolution within their conserved regions. . . . . . . . . . . . . . . . . 357 Frequency of deletions of Frequency of deletions of lincRNA loci which donot have a significant alignment to any other plant genome . . . . . . . . 36viiAcknowledgementsI would like to thank Dr. Keith Adams for his support in this project. I would also like toacknowledge the rest of the members of the Adams lab, particularly Aude Darracq andDavid Tack, for their support and technical assistance. Drs. Quentin Cronk, Naomi Fast,and Loren Rieseberg made invaluable contributions of their time, ideas, and expertise. Ifeel it necessary to thank the open source software community as a whole, without whomthis project would have been impossible.viiiChapter 1IntroductionIt was once possible for biologists to think of RNA mostly as a transitional stage be-tween DNA and protein. With modern sequencing technology, it is becoming clear thatmany transcripts do not undergo translation, but are functional as RNA. The eurkary-otic genome produces an amazingly broad spectrum of transcript types with a greatdiversity of functions related to gene expression, including transcription, translation, andchromatin remodelling (Ponting et al., 2009). Although recent advances in nucleic acidsequencing technology have revealed a great number of non-protein-coding RNA tran-scripts (ncRNAs), relatively few of these have been functionally characterized.Detailed studies of the transcription of eurkaryotic genomes using modern sequencingtechnology have consistently found that transcription is surprisingly ubiquitous (Kapra-nov et al., 2007). In humans, it has been estimated that almost every nucleotide inthe genome is transcribed in at least one developmental stage or cell type (ENCODEProject Consortium, 2007), and a recent study suggests that ncRNA transcripts out-number protein coding transcripts at least two-to-one in both diversity of species andtotal abundance (Managadze et al., 2013). Although a number of ncRNA transcriptshave been functionally characterized, it is difficult to believe that all of these transcriptsare functionally important. This suggests the need for a method to determine whichnon-coding transcripts may be functional, and which are non-functional ?transcriptionalnoise?.Long intergenic non-coding RNAs are transcripts longer than 200 nucleotides whichdo not show experimental or evolutionary evidence for translation (such as codon sub-stitution biases consistent with protein-coding sequences), and which do not overlapwith known genic regions. This definition excludes other species of non-coding RNA,such as anti-sense, intronic, and bi-directional transcripts, as well as comparatively well-understood small RNA species such as miRNA, siRNA and snoRNA (Zhang et al., 2007;1IntroductionMolnar et al., 2011; Scott and Ono, 2011). In practise, a distinction is usually drawnbetween lincRNA and transcripts associated with transposable elements and other in-tergenic repeats, as these transcripts are thought to be functionally distinct from othernon-coding RNAs (Zhang et al., 2010). Liu et al. (2012) refer to these RNA speciesas ?repeat-containing transcriptional units? (RCTUs). RNA genes which are involvedin transcription, including rRNA and tRNA genes are also normally excluded from thecategory.1.1 Evolution of non-coding RNADespite recent interest in the function of non-coding RNAs, there have been few studiesof the degree to which lincRNA genes are conserved amongst species. In mice, a sequencecomparison across 29 mammal species has shown that lincRNA loci are evolutionarilyconserved at a rate higher than random intergenic non-coding sequence, but lower thanprotein-coding genes (Guttman et al., 2010).Pang et al. (2006) found that long non-coding RNAs are poorly conserved betweenhumans and mice compared to miRNA and snoRNA genes. This holds true for lincRNAgenes that are known to be functional as long RNAs, including Xist (see below). Theauthors suggest that this may be because only short stretches of sequence may be essentialfor function, with the rest of the sequence functioning in secondary structure or as spacer.Sequence conservation based analyses have been shown to be effective in identifyingfunctional ncRNAs. For example, Willingham et al. (2005) were able to identify a ncRNAwith a repressor function by screening a pool of mouse ncRNAs showing sequence sim-ilarity to human genomic sequences. NRON, the ncRNA gene which was discovered torepress the expression of the NFAT protein coding gene, is a relatively large transcript at3.7 kb. It contains two regions of relatively high similarity between rodents and primates:one 289 base pair region with 90% identity between humans and mice, and one 400 basepair region with 89% percent identity. In creating a pool of candidates, the authors useda relatively low cutoff of 50% identity across 70% of the length of the locus (Numataet al., 2003; Willingham et al., 2005).1.2 Known functions of lincRNA genesRelatively few lincRNA genes have been functionally characterized in plants. One suchgene is INDUCED BY PHOSPHATE STARVATION 1 (IPS1 ) which is involved in themicroRNA-mediated regulation of phosphorus nutrition (Kim and Sung, 2012). Over-2Introductionexpression of IPS1 results in a decrease in phosphate accumulation in the shoot, whichsuggests that it is involved in the mobilization of phosphates in conditions of Pi starvation(Franco-Zorrilla et al., 2007). The mechanism of action of IPS1 has been determined.Franco-Zorrilla et al. (2007) found that the transcript modifies the activity of the mi-croRNA miRNA399 by competitive inhibition. The IPS1 transcript binds to miRNA399but does not undergo cleavage. This decreases the proportion of miRNA transcriptswhich can bind to their protein-coding targets and trigger the transcript degradationpathway.COLDAIR is another well-characterized plant lincRNA gene. It is required for re-pression of FLOWERING LOCUS C during vernalization. Knockdown mutants forCOLDAIR have increased expression of FLC2 following cold treatment and do not dis-play cold-triggered flowering behaviour (Heo and Sung, 2011). COLDAIR is known toact by physically binding the Polycomb Repressive Complex 2, resulting in the formationof repressive chromatin at the FLC locus (Heo and Sung, 2011). Remarkably, the mam-malian lincRNA gene HOTAIR?after which COLDAIR is named?also binds the PRC2complex, altering chromatin state and gene expression patterns (Gupta et al., 2010). Thetwo loci are not apparently orthologous (Heo and Sung, 2011), which suggests that PRCcomplex binding is something of a recurring theme in lincRNA function. This themeis evident in the mammalian lincRNA gene Xist, which acts in X chromosome inactiva-tion. Xist acts as a molecular bridge, binding both PRC2 and the YY1 protein, whichitself binds to DNA motifs along the X chromosome (Jeon and Lee, 2011). The result isthat the PRC is brought into proximity with the X chromosome, causing the chromatinmodification which inactivates the entire chromosome (Zhao et al., 2008; Jeon and Lee,2011).Although lincRNA genes are relatively poorly studied, those that have been func-tionally characterized tend to be involved in epigenetic gene regulation pathways such aschromatin modification and the miRNA pathway, rather than the ?classical? transcrip-tion factor pathways or catalytic functions. The mechanism of action of the characterizedfunctions tends to DNA-RNA and DNA-protein binding interactions. Because the knownfunctions of lincRNA genes depend on binding motifs along the length of the transcript,I hypothesize that functional lincRNA genes may be characterized by conserved bindingregions. Such conserved regions have been observed in IPS1 (Franco-Zorrilla et al., 2007;Hou et al., 2005; Liu et al., 1997; Burleigh and Harrison, 1997) in plants, and in manynon-coding RNAs in animals (Pang et al., 2006). Guttman et al. (2010) found evidencein mice for short regions of reduced sequence rate evolution in many lincRNA loci, find-ing that the average conserved region covers approximately 22% of the locus, compared3Introductionwith a figure of 70% in protein-coding exons. My work is the first systematic attempt toidentify conserved regions in plant lincRNA genes on a genome-wide scale.1.3 ObjectivesThe goals of this study were to determine the patterns of evolution of lincRNA genes,to identify regions of high evolutionary conservation in these genes, to discover whetherthere is evidence for ongoing reduced sequence rate evolution in these conserved regions,and to explore the potential of RNA-seq with library normalization for the discovery oflincRNA transcripts. I adopted a sequence-similarity approach to identify putative ho-mologs of Arabidopsis thaliana lincRNA genes in a wide variety of plant species. Usingthe same approach to discover areas of sequence similarity in protein-coding genes allowsus to compare the patterns of sequence conservation. Using alignments across manyspecies has allowed me to identify conserved regions of lincRNA genes which have expe-rienced relatively little sequence evolution over the course of macroevolutionary time. Byintegrating data regarding these conserved regions with data regarding mutations whichhave arisen in the recent evolutionary history of Arabidopsis thaliana, I have been ableto test the hypothesis that conserved regions are subject to decreased rates of sequenceevolution on both a microevolutionary and macroevolutionary time scale.4Chapter 2Methods2.1 Identification of lincRNA transcriptsNovel lincRNA transcripts were identified using RNA-seq. The raw Illumina reads wereobtained from a prior study by (Marquez et al., 2012). This data set was chosen be-cause the library was enriched for rare transcripts. Marquez et al. (2012) constructedtheir cDNA libraries from Arabidopsis thaliana (Col-0) flowers and seedlings, and pooledthe cDNA from both tissue types. The cDNA pool was normalized using an EvrogenTrimmer-direct Kit, and sequenced with 75-base-pair paired-end reads on five lanes usingthe Illumina GA system. I mapped the Marquez et al. reads to the TAIR10 version ofthe Arabidopsis genome assembly (Lamesch et al., 2012) using Bowtie (Langmead et al.,2009).Together with undergraduate Jon Willinofsky, I developed a strategy to filter outreads associated with annotated genes. Our strategy was to remove all of the read pairswhich overlap with an genic region annotated in the TAIR10 genome (including 5? and 3?UTRs), then remove all of the read pairs which overlap with those read pairs, and so forthuntil only reads which unambiguously map to intergenic regions remain. We consideredreads to be overlapping if they shared at least one base pair of their mapped positions incommon, and non-overlapping if there was no shared base pair. Overlap between readsand annotated regions was defined in the same way. We expect this strategy to removenot only reads associated with annotated protein coding genes, but also those associatedwith unannotated 5? and 3? UTR regions and natural antisense transcripts which overlapwith annotated genes.I then identified putative lincRNA loci using the Samtools pileup function (Li et al.,2009). Every RNA molecule identified by the pileup that is at least 200 base pairslong was treated as a putative lincRNA transcript. The genomic positions, consensus5Methodssequences, and average read alignment qualities for the putative lincRNA transcriptswere calculated from the pileup.In order to remove unannotated protein-coding loci, I used GenScan (Burge andKarlin, 1997) to predict open reading frames. I removed all loci which have an openreading frame greater than 100 base pairs. I also removed any loci that overlap withtransposable elements annotated by TAIR10 (Lamesch et al., 2012). This is the samestrategy that was employed by Liu et al. (2012) to filter their lincRNA annotations forprotein-coding loci and transposable elements. Any loci which overlapped with repeat-containing transcriptional units annotated by Liu et al. (2012), and any loci which werecovered to an average depth of less than five reads were also removed from further analysis.In order to test the specificity of my lincRNA identification procedure, the remainingputative lincRNA transcripts were compared to the set of lincRNA annotations recentlypublished by Liu et al. (2012). I divided my set of lincRNA loci into two categories:?overlapping loci? whose genomic positions overlap with a locus in the Liu et al. dataset, and ?novel loci? which are non-overlapping with Liu et al. loci. I examined theaverage fold coverage (the average number of reads covering any base along the lengthof the locus) and the average Phred mapping quality score (Ewing et al., 1998) at eachlocus in order to determine whether these measures are significantly different in noveland confirmed lincRNA loci.Both the Liu et al. lincRNA loci and the novel lincRNA loci discovered by my analysiswere included in downstream analyses.2.2 Identification of lincRNA conserved regionsConserved sequence elements of lincRNA loci were identified by alignment to the genomicsequences of selected plant species (see table 1). Species were chosen on the basis of thequality of the genome available, the depth of coverage of the genome, and to give a broadphylogenetic coverage of the plant kingdom. Carica papaya, for instance, was excludedas the genome is sequenced to an average depth of only 3? coverage (Ming et al., 2008).Alignments were performed using the discontiguous MegaBLAST program (Zhanget al., 2000; Ma et al., 2002) using seed optimized for non-coding sequences with a wordsize of 11 and a template length of 16. I developed two criteria by which to identifyregions of high sequence similarity from MegaBLAST alignments. I call a sequence a?conserved block? when it can be aligned to the genome of another plant species with ane-value of less than 10?30. To this set of conserved blocks, I added any alignments which6Methodswere at least 85 % identical across at least 50 base pairs. These are referred to as ?shorthighly conserved blocks? (SHCBs).A perennial problem in studies of evolutionary conservation is the decision as to whatconstitutes a significant alignment. Because the degree of sequence similarity which issufficient to infer homology is dependent on the organism being studied, the class ofmolecule, and the researchers? goals, the criteria for calling a ?significant? alignment arenecessarily somewhat ad hoc. In almost all cases, the false-positive and false-negativerates of a particular alignment criterion are completely unknown. In order to amelioratethis difficulty, I elected to use two different approaches to constructing an alignment cutoffcriterion. The SHCB criterion requires a high level of sequence similarity over a relativelysmall proportion of the lincRNA locus. This criterion was chosen based on what is knownabout the evolutionary dynamics of known functional lincRNA genes in mammals: manyfunctionally characterized lincRNA genes have been found to have a relatively smallconserved region while the rest of the length of the locus is subject to a high rate ofsequence evolution (Willingham et al., 2005; Pang et al., 2006; Ponting et al., 2009). The?conserved block? criterion uses a more general e-value cutoff. This is expected to allowalignments that do not follow the expected patterns of lincRNA evolution at the expenseof failing to filter out a larger proportion of spurious alignments. For example: the e-value criterion potentially allows for conserved blocks which are smaller than expectedby the low false-positive criterion, or for lincRNAs which have a modest rate of sequenceconservation across the entire locus?as is not uncommon in protein-coding genes.In order to use the presence or absence of a sequence alignment to draw conclusionsabout evolutionary rates, it is necessary to make comparisons among gene classes usingthe same alignment criteria. In order to compare the rates of conservation of lincRNA locito protein coding genes, I took a random sample of 2 000 protein coding sequences anno-tated by TAIR10 (Lamesch et al., 2012) and aligned them to the plant genomes describedabove using the same discontiguous MegaBLAST strategy described for lincRNA loci.The resulting alignments were then filtered using the same alignment criteria describedabove.2.3 Conserved regions of lincRNA lociPutative conserved regions in other plant species, for the purposes of this study, aredefined as the longest sequence of base pairs of a lincRNA locus which overlaps with allof the alignments in other species at a particular conservation criterion. (Note that thisis different from a conserved block, which is a region of high sequence similarity between7Methodsa lincRNA locus and one particular genomic sequence in another species.) If I foundno region which was overlapped by all of either alignments which met the cutoff?or ifthere was only one alignment to a particular locus?that locus is considered not to havea conserved region. Conserved regions were calculated separately using conserved blocksand SHCBs. These two data sets were combined for downstream analysis. If a singlelincRNA locus had a conserved region annotated using both cutoff criteria, the overlapwas used in downstream analysis. If the two conserved regions did not overlap, the locuswas discarded.In order to test the hypothesis that the putative conserved regions represent areaswhich are under stronger purifying selection than the rest of the locus, I used genomicsingle nucleotide variant (SNV) data from the Arabidopsis 80 Genomes project (Caoet al., 2011). Because a SNV present in two different Arabidopsis strains may represent asingle mutation in their common ancestor, I considered only unique variants. A SNV ofthe same substitution base present at the same genomic location in more than one strainis only counted as a single unique variant in my experiment. I tested the hypothesis thatputative conserved regions contain fewer unique variants than the rest of the length ofthe lincRNA locus using a one-tailed binomial test and calculated Q-values using falsediscovery rate correction (Benjamini and Hochberg, 1995).In order to examine alignments of conserved regions, I used Clustal Omega (Sieverset al., 2011) to realign lincRNA sequences to genome regions aligned by MegaBLAST.The resulting alignments were visualized with MView (Brown et al., 1998).2.4 Recent gain and loss of lincRNA lociIn order to study the evolutionary dynamics of lincRNA loci within the Arabidopsisthaliana species, I used the large deletion annotations from Cao et al. (2011). This dataset lists relatively long (more than 10 nucleotide) stretches of genomic DNA which arepresent in the Arabidopsis thaliana Columbia-0 ecotype, but absent in at least one otherstrain, not counting deletions which are part of a more complex rearrangement. Caoet al. (2011) employed a very similar strategy to study the recent evolution of microRNAgenes in Arabidopsis thaliana.In order to determine whether these insertion/deletion events are more likely to repre-sent recent insertions or recent deletions, I determined whether the locus had a significantalignment to any other plant species. Loci which are absent in some Arabidopsis thalianastrains but which are similar to genomic sequences in other plants likely represent re-cent deletion events, while sequences which are present in Arabidopsis Col-0 but absent8Methodsin other ecotypes and all other plant genomes likely represent recent insertion events.Finally, I determined whether recently deleted lincRNA loci are less likely to containconserved regions (as defined in section 2.3).9Chapter 3Results3.1 Discovery of lincRNA loci through Illumina anal-ysis of normalized librariesLike other classes of non-coding RNA, lincRNAs are characterized by low levels of ex-pression and, at least in animals, a high level of tissue specificity (Cabili et al., 2011;Young et al., 2012; Liu et al., 2012). Because clone library-based methods are some-what unreliable for the discover of low copy number transcripts, specialized techniquesare required to identify lincRNA genes on a genome-wide scale. Previously, scientistshave catalogued lincRNA genes using tiling arrays (Cawley et al., 2004; Matsui et al.,2008), RNA-seq (Guttman et al., 2010), chromatin signature (Guttman et al., 2009),and conserved predicted secondary RNA structure motifs (Hupalo and Kern, 2013). Al-though the identification of lincRNA genes in plants is in its infancy, both tiling arrayand RNA-seq based methods have proven effective (Liu et al., 2012). In addition to theset of lincRNA genes annotated by Liu et al. (2012) using a combination of tiling arrayand RNA-seq techniques, I used a normalized RNA-seq library (Marquez et al., 2012) inorder to annotate lincRNA genes de novo. Because normalized libraries are enriched forrare transcripts, library normalization for RNA-seq is promising as a low-cost method forthe discovery of novel lincRNA genes (although it makes evaluation of expression levelimpossible). Library normalization combined with high throughput sequencing has pre-viously been shown to be an effective strategy for the discovery of non-coding transcriptsand other rare RNA species (Guffanti et al., 2009; Marquez et al., 2012).The Marquez et al. (2012) data set consisted of 115 883 414 paired-end Illumina RNA-seq reads. Of these, 50 801 105 (43.84 %) mapped concordantly to the Arabidopsis thali-ana genome using Bowtie2 (Langmead and Salzberg, 2012). After removing reads asso-10Resultsciated with annotated genes, there remained 268 936 intergenic reads (0.5 % of the totalmapped reads). I assembled these reads into 1 220 lincRNA loci. The lincRNA loci whichmapped to the mitochondrion were discarded, leaving 1 142. Of these, 133 overlappedwith one of the 6 728 lincRNA loci annotated by Liu et al. (2012), 82 overlapped withTAIR10 transposable elements, 229 had predicted open reading frames longer than 100base pairs, and 94 overlapped with RCTUs discovered by Liu et al. These were all re-moved from further analysis. I also discarded 828 loci which had an average fold coverageof less than five (meaning that each base pair of the locus was covered by fewer than fivereads, on average), leaving 201 novel lincRNA loci.I performed a number of tests to evaluate my confidence in the authenticity of thenovel lincRNA loci. These are summarized in figure 1. There is no significant differencebetween the average Phred scores or fold coverage of the lincRNA loci which were dis-covered de novo by my analysis and those which were confirmed by Liu et al. (2012),but the newly discovered lincRNA genes were significantly shorter. The lengths of thelincRNA loci?both those annotated by Liu et al. (2012) and those discovered de novoby my analysis?vary greatly, from the minimum length of 200 base pairs up to more than2100. The distribution in size is roughly exponential: longer lincRNA loci are relativelyrare compared to loci around 200 nucleotides.3.2 LincRNA sequence similarity in other plant gen-omesI identified genomic loci in other plant genomes with a high level of sequence similarityto lincRNA genes in Arabidopsis thaliana using a pair-wise local alignment method.Alignments with an e-value of less than 10?30 were considered conserved blocks. Inorder to identify additional, short regions of high sequence conservation, I examined anyfurther alignments which were at least 85 % identical to a lincRNA gene across at least50 base pairs. Those are referred to as short highly conserved blocks (SHCBs). Theseidentification criteria were chosen after examination of the distributions of these statisticsover the alignments (figures 2 and 3 and table 2) as well as visual examination of thealignments, with the goal of finding regions of high sequence similarity, erring on the sideof specificity rather than sensitivity in order to minimize false positives.Because preliminary analysis suggested that there are very few highly conserved blocksof primary sequence between Arabidopsis thaliana and its distant relatives, I chose tofocus on finding conserved loci in the four available Brassicaceae species with available11Resultsgenomes, aside from Arabidopsis thaliana: A. lyrata, Capsella rubella, Brassica rapa andEutrema parvulum. Sequence alignments were performed using MegaBLAST. Withinthe Brassicaceae, I found 34 730 conserved blocks and an additional 8 279 SHCBs. Asexpected, the SHCBs were identical to their targets at a larger proportion of sites, butcovered a slightly lower percentage of the lincRNA gene (table 3). Together with the factthat SHCBs tend to be found in longer lincRNA loci than conserved blocks annotatedusing the e-value criterion, this suggests that the strategy of looking for regions of highsimilarity at a fixed minimum length is more effective than using the e-value of thealignment at annotating short regions of high conservation.Roughly 90 % of the Arabidopsis thaliana lincRNA loci examined have a conservedblock in Arabidopsis lyrata. This value falls to roughly 50 % in Capsella rubella, the nextclosest relative of A. thaliana included in this study, and then to roughly 30 % in bothBrassica rapa and Eutrema parvulum (table 4).The analysis was expanded to include the genomes of 18 other plant species (table 1) inorder to identify regions of deep conservation and compare the rates of sequence evolutionbetween lincRNA genes and protein coding genes. When all species are considered, therewere a total of 458 628 pair-wise genomic alignments. Table 2 summarizes the e-values ofthese alignments, and the percent-identity and percent-coverage statistics are summarizedin figure 2. The e-values of the alignments follow a roughly exponential distribution, withabout half of the total alignments having a value greater than 0.005. Most alignmentscover a relatively small proportion of the locus (10?30%) of the length of the lincRNAgene), and are identical at fewer that 50 % of the bases of the entire lincRNA locus.In order to compare rates of evolution in lincRNA genes and protein-coding genes,I carried out an identical pairwise-alignment experiment using the Arabidopsis thalianacoding regions annotated by TAIR10 (Lamesch et al., 2012). I aligned the coding se-quences to the same genomic data that were used in the alignment of lincRNA genes,and processed the alignments using the same two criteria. The phylogenetic positionsof these alignments are summarized in figure 4. Curiously, there were far more SHCBsthan conserved blocks discovered using the e-value criterion in CDS regions (3 541 208 to430 728), suggesting that long regions of high similarity are much more rare in lincRNAgenes than in protein coding genes. The proportions of conserved blocks confined to theBrassicaceae was much lower in the CDS alignments (43 % of conserved blocks, 8.2 % ofSHCBs), which suggests that lincRNA loci are subject to deletions and rapid sequenceevolution much more frequently than protein-coding genes. In particular, 32 of 2000protein coding genes had a significant conserved block in every genome analyzed, whileonly 4 lincRNA loci out of more than 6 600 had such a block.12ResultsThe phylogenetic positions of conserved blocks in protein-coding genes and lincRNAgenes are markedly different. Within the Brassicaceae, there are far more Arabidopsisthaliana lincRNA genes with conserved blocks in A. lyrata and no other species thanprotein coding genes, while protein coding genes are much more likely to be shared by allthe Brassicaceae species examined (figure 4). A far larger proportion of lincRNA genesthan protein coding genes lack conserved blocks outside of the Brassicaceae than proteincoding genes. This difference is particularly striking when comparing alignments of pro-tein coding genes and lincRNA genes between Arabidopsis thaliana and the Fabidae1.Within this clade, at least 33 % of the protein coding genes sampled have a conservedblock, while this is true of only 0.3 % of lincRNA genes at most (table 4)3.3 Conserved regions of Arabidopsis lincRNA lociBecause lincRNA genes with known functions tend to have relatively short, evolutionar-ily conserved functional regions (Ponting et al., 2009), I identified putatively conservedregions in my lincRNA data set which are present in every genomic region to which thelocus aligns. For the purposes of this study, I defined a putative conserved region as theregion of a lincRNA locus which is present in all of the significant genomic alignmentsof that locus at a particular stringency. A locus was considered not to have a conservedregion if not all of the significant alignments overlapped, or if there was only one signif-icant alignment. In total, I discovered 3178 conserved regions using the low stringencyalignments and 462 using high stringency (table 5). The majority of conserved regionsare found in close relatives of Arabidopsis thaliana.In general, the trend in lincRNA genes is toward short regions of conservation inthe centre of the gene flanked by relatively long regions which are highly divergent indifferent organisms. Outside of conserved regions, there is often evidence of dramaticsequence evolution, possibly including large scale insertions/deletions and a high rateof single nucleotide variation. This usually results in a long, unalignable region of thelincRNA locus outside of the conserved region. An example alignment of this is shown inthe very deeply conserved alignment is shown in figure 5. Because the rate of interspecificsequence variation is so high, the alignments of non-conserved regions are not of sufficientquality to make a rigorous estimate of the rate of sequence evolution among species.In order to detect conserved regions with reduced sequence rate evolution, I comparedthe number of intraspecific variations from the Arabidopsis 80 genomes project (Cao et al.,1The Fabidae species included in the analysis are Glycine max, Phaseolus vulgaris, Cannabis sativa,Malus domestica, Populus trichocarpa, and Ricinus communis13Results2011) in the conserved regions to the number of variants in the surrounding lincRNAlocus. The results are summarized in table 6. In total, I found 6 conserved regions whichhave experienced significantly fewer recent single-nucleotide mutations than the rest ofthe gene (6, P ? 0.05, false discovery rate= 0.10). The alignments of these six conservedregions are shown in figure 6.3.4 Recent evolutionary dynamics of lincRNA genesI was able to use Arabidopsis thaliana ecotype resequencing data to examine the degreeto which lincRNA loci are subject to large structural variation. Using annotation ofstructural variants among different Arabidopsis thaliana ecotypes from the 80 Genomesproject (Cao et al., 2011), I examined the frequency with which entire lincRNA loci aredeleted from the Arabidopsis thaliana genome relative to the Col-0 genome. Of the 6 681lincRNA loci included in my analysis, I found 570 which were entirely missing in at leastone Arabidopsis thaliana ecotype. Of these, 205 lacked any significant alignments to otherplant genomes using either the high or low stringency criterion. Figure 7 summarizes thenumber of ecotypes in which a sequence that is unique to Arabidopsis thaliana is deleted.In the majority of cases, the locus is absent in only one or a few ecotypes, suggesting thatthese are cases of recent deletion of a locus which is present in most Arabidopsis thalianaindividuals. In a small minority of cases, however, the locus in absent in virtually allArabidopsis thaliana ecotypes except Col-0.In the majority of cases (352/580) loci with annotated large insertion/deletion eventswithin the Arabidopsis thaliana species lack a conserved region (see table 7). None ofthe loci with annotated insertion/deletion events are among the six which were found tohave significantly fewer mutations in their conserved regions.14Chapter 4Discussion4.1 Identifying lincRNA genesIn total, my analysis of the Marquez et al. (2012) normalized library RNA-seq datareconfirmed 133 of the 6480 (2.0 %) lincRNA loci identified by Liu et al. (2012), andprovided evidence for 201 novel lincRNA genes. Liu et al. (2012) had far more successwith RNA-seq analysis: reconfirming more than 2700 loci out of the 6480 that were firstidentified with tiling array data. However, this is not an entirely fair comparison, sincethe two data sets differed markedly in the tissues prepared: Marquez et al. (2012) usedflowers and whole seedlings, whereas Liu et al. (2012) used flower, leaf, root and siliquesamples. In addition, the two RNA-seq data sets were created using different platforms:Marquez et al. (2012) used five lanes of 75 nucleotide paired end reads on the Illumina GAsystem, whereas Liu et al. (2012) had four lanes of 101 nucleotide single end reads on theIllumina HiSequation 2000 platform. Although these differences prevent us from makinga rigorous estimate of the degree to which library normalization improves detection oflincRNA transcripts, the relatively large number of novel lincRNA genes discovered byanalysis of normalized RNA-seq data suggests that the procedure may provide a valuableincrease in sensitivity. The fact that 201 novel lincRNA species were discovered in a singleRNA-seq experiment when a novel tissue type is included suggests the possibility thatthere are many more undiscovered lincRNA transcripts in Arabidopsis thaliana.Although Liu et al. (2012) found evidence for many more lincRNA transcripts throughthe analysis of tiling array data sets than either they or I were able to confirm throughRNA-seq, this does not necessarily indicate that tiling arrays are a more sensitive tool.The technique which Liu et al. describe relies on an enormous volume of data: more than200 data sets were included in the analysis, including RNA libraries from 14 differentArabidopsis mutants, 18 stress conditions and 6 tissue types. If all of these libraries15Discussionwere submitted to an RNA-seq experiment rather than a tiling array, it is quite possiblethat many more lincRNA transcripts would have be discovered. Indeed, if the recentfindings in mammals are any guide, there may be many thousands of as-yet unannotatedArabidopsis lincRNA genes (Managadze et al., 2013).4.2 Conserved regions in lincRNA lociOverall, the general pattern in conserved lincRNAs is patches of higher conservationwithin a poorly conserved overall sequence. The average conserved block discoveredusing the e-value filtering criterion covers slightly more than half of the lincRNA locus(table 5). Expanding the filtering criteria to include any regions of at least 85 % across50 base pairs or more adds a large number of conserved blocks which cover only 14 %of the locus on average (table 5). Figure 5 shows a good example of an island of highconservation within lincRNA locus: the locus shown is 480 base pairs long, and has aconserved region of approximately 200 base pairs which is present in every plant genomeincluded in this study. Across the rest of its length, however, I was unable to find anyconserved blocks. The pattern of conserved regions within loci that are relatively poorlyconserved overall is consistent with patterns of sequence evolution that have been found infunctional lincRNA genes in mammals (Pang et al., 2006). Curiously, microRNA geneshave also been observed to show a high rate of sequence evolution in the nucleotidesflanking conserved hairpin structures (Berezikov et al., 2005).That many lincRNA genes in my analysis lack a conserved region does not necessarilyindicate a lack of functional importance, since many of these transcripts could haveconserved functional regions too short to detect by primary sequence analysis alone, orthat they have conserved secondary structure motifs that function without conservedregions of primary structure. Detecting the evolutionary conservation of such structureswill doubtless be extremely challenging. It seems likely that, if any lincRNA genes whichare functional only because of extremely short conserved regions or secondary structure, itwill not be possible to effectively study the degree to which these structures are preservedby natural selection until they are characterized experimentally in a functional biologysetting.Although lincRNA genes in general have a high rate of sequence evolution, thereare many lincRNA transcripts with known function which have short, conserved regions(Pang et al., 2006), and the annotation of conserved regions has been shown to be aneffective strategy for finding functional lincRNA transcripts (Willingham et al., 2005).Therefore, the detection of areas of reduced sequence rate evolution within lincRNA loci is16Discussiona promising strategy for the discovery of transcripts of functional importance. Several ofthe conserved regions that were discovered through genomic alignments with other speciesshow signs of reduced sequence rate evolution among different Arabidopsis thaliana lines.This is consistent with the hypothesis that these regions are of functional importance,possibly representing miRNA or protein binding sites (although, of course, this can onlybe conclusively demonstrated with functional studies). Although many conserved regionsdo not show reduced rates of sequence evolution compared to the rest of the locus, Icannot reject the hypothesis that these regions are of functional importance on this basisalone. In many cases, there is no variant data available at all for a particular lincRNAlocus, making it impossible to draw a conclusion one way or the other regarding recentevolutionary conservation of the locus. It is possible that many more of the conservedregions in my data set are under purifying selection that is invisible due to lack of data.Although reduced sequence rate evolution is not sufficient evidence to conclude thatthese lincRNA genes are functionally important, these results are very suggestive. The ex-perience of animal researchers suggests that examining evolutionary conserved lincRNAgenes is an effective strategy for discovering novel functionally important transcripts(Willingham et al., 2005). Similar studies of lincRNA conservation in animals suggestthat there are thousands of non-coding transcripts whose functions have yet to be anno-tated (Guttman et al., 2009, 2010; Managadze et al., 2013). It would therefore be wellworth exploring this set of evolutionarily conserved lincRNA genes in the context of afunctional study.Although evolutionary conservation of primary sequence has been shown to be aneffective criterion for the discover of functional lincRNA transcripts, there are otheravenues which are worth exploring. Many lincRNA transcripts in mammals have sub-stantial predicted secondary structure (Ponting and Belgard, 2010; Tsai et al., 2011),and enrichment in secondary structures is known to be correlated with both evolutionaryconservation and specificity of expression (Marques and Ponting, 2009). However, it hasnot been demonstrated in vivo that disrupting the secondary structure of any lincRNAgene disrupts its function. Nonetheless, it is well worth mining the set of conserved plantlincRNA genes for conserved secondary structural motifs.17Discussion4.3 Interpreting conserved sequence between lincRNAtranscripts and genomic regionsMy strategy for identifying lincRNA homologs is based on pair-wise genomic alignments.It is not possible to conclude with confidence that an alignment between an Arabidopsisthaliana lincRNA gene and the genome of another species represents a lincRNA gene inthat species. It is possible that the locus is not transcribed in the other species, or eventhat it is both transcribed and translated into a short peptide. At present the transcrip-tome data available for non-Arabidopsis plant species are too thin to attempt a compre-hensive genome-wide analysis of transcription and translation of lincRNA homologs inthe rest of the plant kingdom on the scale that can be accomplished with a study of con-served genomic DNA elements. As deeper transcriptome and data sets become availablein a variety of plant species, it will be interesting to see to what degree lincRNA genestransition between non-transcribed intergenic space, non-coding transcribed region andprotein-coding sequence.As an alternative to using publicly available transcriptome data, it may be possi-ble to use comparable, matched RNA-seq data sets from two related plant species andcompare the rate of conservation of lincRNA genes discovered de novo. Although theseexperiments would necessarily involve a smaller number of species than my approach,this would provide direct evidence that the lincRNA loci in question are expressed inboth species. This approach is also likely to identify far fewer lincRNA loci than a tiling-based approach, which, as discussed, is more sensitive but requires many more individualexperiments and depth of data.High sequence similarity between lincRNA genes and genomic regions may be causedby the origins of the lincRNA gene, rather than because the gene really is shared amongstplant species. For example, the lincRNA gene At3NC056191 identified by Liu et al.(2012) has an alignment in every genome included in my study, but a BLAST search ofthe sequence suggests that these conserved regions are highly similar to ribosomal RNAsequence. This suggests that the locus in question may be descended from an rRNA gene,or possibly a previously unannotated gene copy in the rRNA family. In other cases, thesequence similarity may be due to the inclusion of partial, unannotated repeats, or eventhe inclusion of conserved DNA elements?such as promoters?in the locus. As with anyother sequence alignment, researchers should be cautious about interpreting sequencesimilarity to be indicative of direct homology without independent confirmation.18Discussion4.4 Evolutionary comparisons of lincRNAs to pro-tein coding genes and miRNAsIn contrast to protein coding genes, alignments of lincRNA genes are generally quite short.This is consistent with the hypothesis that lincRNA genes have only short stretches ofprimary sequence which are required for function, while the rest of the locus is relativelyunconstrained in terms of evolution. This is not the case for protein coding genes, in whichpoint mutations along much of the length can cause a disruptive frame shift mutation,dramatically altering the function. However, a similar pattern can be seen to a lesserextent in protein coding genes, in which the rate of sequence evolution is relatively slow inregions where the three dimensional structure of the protein is required for function andrelatively rapid in ?intrinsically disordered? regions with no consistent tertiary structure(Brown et al., 2002).My analysis shows that, compared with protein coding loci, lincRNA loci are gener-ally less broadly conserved. This is consistent with the hypothesis that, in addition to ahigh rate of primary sequence evolution, lincRNA genes have a very rapid rate of emer-gence and decline within lineages (Hyashizaki, 2004; Ponting et al., 2009). Comparedwith protein coding sequences, lincRNA genes are apparently lost very frequently, as isevident in the relatively large number of deletions of lincRNA loci in different Arabidopsisthaliana ecotypes (figure 7). This raises the question of how lincRNA genes maintaintheir diversity in Arabidopsis thaliana despite a relatively high rate of loss.Small RNA transcriptome sequencing studies of microRNA genes have shown thatconservation is highly variable: some families are highly conserved throughout the plantkingdom while others are absent from the databases outside of Arabidopsis thaliana(Zhang et al., 2006). Although my alignments do not include secondary structure pre-dictions, I find no evidence of a similar core group of highly conserved lincRNA genes.However, lincRNA genes apparently share a tendency with microRNA genes for rapidevolution, and frequent loss within different Arabidopsis thaliana ecotypes (Cao et al.,2011). Cao et al. (2011) found that microRNA genes which are deleted in at least oneArabidopsis thaliana ecotype are either not conserved in other plant species, or are mem-bers of large gene families. In microRNA genes, loss within A. thaliana is correlated withthe presence of multiple-copy families. If lincRNA genes are also found in large families,that could partly explain their apparent tendency toward frequent deletion. It is alsopossible that lincRNA genes are frequently deleted due to redundancy in function withunrelated genes (which may or may not be lincRNAs), or that they have nonessential ornonexistent functions.19Discussion4.5 The origins of lincRNA genesMy analysis of structural variants in different Arabidopsis thaliana ecotypes suggests thatthere are a small number of lincRNA loci which are absent in the majority of strains asidefrom Col-0, and which do not appear to be highly conserved in any other plant species(figure 7). This suggests that these loci may have originated very recently as a result oflarge-scale structural mutations. Although it is extremely challenging to predict lincRNAgenes which arose from such sequence rearrangements, there is at least one known caseof a lincRNA gene which arose from a chromosomal rearrangement bringing together twopreviously untranscribed genomic regions (Ponting et al., 2009).In mammals, lincRNAs do not apparently form large families by comparison to proteincoding genes, which has lead to speculation that, while protein coding genes typicallyarise by duplication and divergence, lincRNAs and other non-coding genes may arisefrom intergenic space (Ponting et al., 2009). The extent to which lincRNA genes formfamilies in plants is unclear. Research is underway in the Adams lab to determine theextent to which lincRNA genes are conserved after whole-genome and other duplicationevents. If indeed lincRNA genes are frequently duplicated, this could help to explain theapparently great diversity of lincRNA genes in Arabidopsis despite the frequency withwhich they undergo deletions. On the other hand, if lincRNA genes are not frequentlyduplicated and retained, they must commonly originate from other classes of genes orintergenic DNA.There is already evidence for the origins of some lincRNA genes in coding sequences.The Xist gene in mammals, for example, has its origins in the pseudogenization of aprotein-coding gene (Duret et al., 2006; Elisaphenko et al., 2008). It is also known thatprotein-coding genes can arise de novo from intergenic sequences (Carvunis et al., 2012),a process in which lincRNA genes may play a transitional role. Detailed studies of theorigins of specific lincRNA genes are needed to address the issue of how these transcriptsmaintain their diversity in the face of frequent loss.20ConclusionStudies in mammal suggest that assembling a comprehensive catalogue of the lincRNAs ina transcriptome requires a tremendous depth of sequencing coverage across many experi-ments due to the low expression levels and high tissue specificity of lincRNA transcripts.My results suggest that the situation is no different in Arabidopsis : a single lane of Illu-mina analysis has added 201 transcripts to the catalogue of known lincRNAs. There isevery reason to expect that deeper coverage of the Arabidopsis non-coding transcriptome,aided by library normalization, will uncover many more lincRNA transcripts.Although lincRNA loci clearly have a much higher rate of sequence evolution andturnover than protein coding genes, many have stretches of highly conserved nucleotides,and a few show signs of ongoing reduced sequence rate evolution. These patterns ofevolution are consistent with what has been found in functional lincRNA genes in animals.On the other hand, the relatively high proportion of lincRNA loci which have experienceddeletion in the recent evolutionary history of Arabidopsis thaliana suggests that manyof these transcripts are non-functional, or have redundant functions. As more lincRNAgenes are functionally characterized in plants, it will become clear what proportion oflincRNA transcripts have. However, the high rate of deletion of lincRNA loci amongArabidopsis thaliana ecotypes suggests that many such loci are not under strong purifyingselection.The high rate of turnover of lincRNA loci in plant genomes suggests that these tran-scripts may play a role in providing variation in the non-coding transcriptome which pro-vides natural selection with raw material for the evolution of new functions. If lincRNAand other ncRNA transcripts arise frequently from intergenic space, transcripts which,by chance, have secondary structure or binding properties with beneficial functional con-sequences could be preserved by natural selection, resulting in de novo gene birth. NewlincRNA loci may be the result of the evolution of new promoter elements in intergenicspace by random drift. MicroRNA genes have been found to have originated in this wayin Drosophila (Nozawa et al., 2010). Detailed studies of the origins of lincRNA tran-scripts from intergenic space are difficult (Ponting et al., 2009), but will be required in21Conclusionorder to determine the evolutionary roles of lincRNA genes.Although the importance of lincRNA genes as a class is still unclear, we have tan-talizing hints that these transcripts may be of evolutionary and functional importance.Studies of the evolutionary dynamics and degree of sequence conservation of lincRNAgenes, such as this one, are the first step in determining what role they play in thefunction of organisms and the evolution of new genes.22TablesArabidopsis lyrata Hu et al. (2011)Capsella rubella Slotte et al. (2013)Brassica rapa Wang et al. (2011)Eutrema parvulum Dassanayake et al. (2011)Citrus clementia International Citrus Genome Consortium (2011)Gossypium raimondii Wang et al. (2012)Eucalyptus grandis Eucalyptus grandis Genome Project (2010)Glycine max Schmutz et al. (2010)Phaseolus vulgaris DOE-JGI and USDA-NIFA (2013)Malus domestica Velasco et al. (2010)Populus trichocarpa Tuskan et al. (2006)Ricinus communis Chan et al. (2010)Cannabis sativa van Bakel et al. (2011)Vitis vinifera Jaillon et al. (2007)Mimulus guttatus Mimulus Genome Project and DOE-JGI (2013)Solanum lycopersicum Tomato Genome Consortium (2012)Aquilegia coerulea DOE-JGI (2013)Brachopodium distachyon International Brachypodium Initiative (2010)Oryza sativa Goff et al. (2002)Zea mays Schnable et al. (2009)Selaginella moellendorfii Banks et al. (2011)Physcomitrella patens Rensing et al. (2008)Chlamydomonas reinhardtii Merchant et al. (2007)Table 1: Plant genomes used in the identification of Arabidopsis thaliana lincRNA con-served blocks by pairwise alignment.23Tablese value Frequency Cumulative Frequecy Relative Frequency(0,1e-100] 3125 3125 0.01(1e-100,1e-50] 18385 21510 0.04(1e-50,1e-30] 27024 48534 0.06(1e-30,1e-20] 22006 70540 0.05(1e-20,1e-15] 18549 89089 0.04(1e-15,1e-05] 85748 174837 0.19(1e-05,0.0001] 19794 194631 0.04(0.0001,0.001] 25307 219938 0.06(0.001,0.01] 30525 250463 0.07(0.01,0.1] 57089 307552 0.12(0.1,1] 67078 374630 0.15(1,10] 83708 458338 0.18Table 2: Summary of the e values of the BLAST pairwise alignments. The breaksare exclusive at the lower limit, and inclusive at the upper limit. The total number ofalignments at each e-value level includes both lincRNAs identified by Liu et al. (2012)and by my analysis of the Marquez et al. (2012) data (see text).Conserved Blocks SHCBsn 50955 29653query length 356.42 ? 310.15 379.31 ? 271.38% coverage 0.55 ? 0.31 0.14 ? 0.10% identity 0.82 ? 0.062 0.87 ? 0.024Table 3: Summary of the results of the two methods of annotating regions of high se-quence similarity. ?Conserved blocks? are regions of a lincRNA locus with a MegaBLASTpairwise alignment to another genome with an e-value less than 10?30. ?SHCBs? (shorthighly conserved blocks) are regions which were not annotated as conserved blocks atthe first step, but which have a MegaBLAST alignment of at least 85 % identity acrossat least 50 base pairs. n is the total number of conserved blocks annotated at each step.?Query length? is the average length of the lincRNA locus. ?% coverage? is the averagefraction of the lincRNA locus which is aligned. ?% identity? is average fraction of baseswhich are identical in the lincRNA and genomic sequence. All measures of error are onestandard deviation.24TableslincRNA CDSSpecies Blocks SHCBs Blocks SHCBsArabidopsis lyrata 5320 1033 1897 994Capsella rubella 2547 491 1766 862Brassica rapa 1655 427 1649 946Eutrema parvulum 1945 301 1657 792Citrus clementia 21 41 818 353Gossypium raimondii 25 45 801 440Eucalyptus grandis 20 38 689 341Glycine max 21 45 730 410Phaseolus vulgaris 18 28 675 327Malus domestica 25 51 790 387Populus trichocarpa 15 41 811 398Ricinus communis 23 43 776 341Vitis vinifera 21 38 758 378Mimulus guttatus 18 40 622 296Solanum lycopersicum 35 47 672 331Aquilegia coerula 19 44 648 301Brachopodium distachyon 10 21 383 189Oryza sativa 14 23 383 224Zea mays 16 27 380 283Selaginella moellendorfii 7 16 151 108Physcomitrella patens 4 22 178 137Chlamydomonas reinhardtii 1 3 14 53Table 4: The number of Arabidopsis thaliana lincRNA loci and protein coding sequenceswith conserved blocks in the genomes of other species. ?Blocks? indicates the numberof genes with conserved blocks found using the e-value criterion described in the text,while ?SHCBs? (short highly conserved blocks) indicates number of genes which wereadded to the data set when any alignment with an identity of 85 % over at least 50 basepairs was also included. ?lincRNA? indicates alignments to one of the 6681 lincRNA genesincluded in this study, while the ?protein coding? alignments where made using the codingsequences of a random sample of the 2000 protein coding genes annotated by TAIR10.25TablesConserved Blocks SHCBsLength 185.73 ? 91.70 121.98 ? 84.83% Locus Length 58.8 ? 2.1 37.6 ? 2.6% Identity 81.5 ? 6.1 88.3 ? 2.5Table 5: Summary statistics of the conserved regions of lincRNA loci. ?% Locus Length?is the length of the conserved region divided by the length of the locus, and ?% identity? isthe fraction of the bases of the conserved region which are identical in all alignments. Allvalues are in the format ?mean ? standard deviation?. ?Conserved Blocks? indicates theconserved regions which were found using conserved blocks defined solely by the e-valuecriteria described in the text, while ?SHCBs? indicates the additional conserved regionswhich were annotated using short, highly conserved blocks (see text).Locus ID SNVs (inside/outside) % length P QAt5NC004520 11/21 73.2 5.00 ? 10?6 6.78 ? 10?3At1NC064140 1/15 61.1 7.23 ? 10?6 6.78 ? 10?3At5NC061480 9/75 31.2 9.07 ? 10?6 6.78 ? 10?2At1NC027691 11/8 91.3 1.05 ? 10?4 6.25 ? 10?2At1NC030450 0/25 29.4 1.68 ? 10?4 8.12 ? 10?2At3NC014370 0/24 30.0 1.90 ? 10?4 8.12 ? 10?2Table 6: Evidence for reduced rate of sequence evolution in conserved regions of lincRNAgenes. SNVs are the number of distinct single nucleotide variants annotated by Cao et al.(2011) inside the conserved region and along the rest of the locus respectively. ?% length?is the proportion of the lincRNA locus covered by the conserved region. P is the resultof a binomial test with the null hypothesis that SNVs are equally likely or more likelyto occur within the conserved region than in the non-conserved portions of the locus.Q values were obtained using Benjamini and Hochberg false discovery rate multiple testcorrection. Only loci with significantly fewer SNVs within their conserved regions at? = 0.05 and FDR = 0.1 are shown.Conserved regionPresent AbsentDeletionPresent 218 352Absent 3156 3045Table 7: Co-occurrence of conserved regions and whole-locus deletions in lincRNA loci.?Deletion? refers to a deletion spanning the entire locus in at least one Arabidopsis thalianaecotype as annotated by Cao et al. (2011). Conserved regions are defined in the text.Loci with conserved regions are significantly less likely to have whole-locus deletions(P < 10?5, Fisher?s exact test).26Figures28303234Overlapping Novell lPhred score5101520253035Overlapping Novell lFold coverage500100015002000Overlapping NovelllLocus length (bp)Overlapping NovelN 31 209Phred Score 33.73 33.39Fold Coverage 14.33 13.85Length 550.20 405.07 ***Figure 1: Violin plots of the differences in confidence statistics between putative lincRNAgenes which are new in my analysis and those which were discovered independently by Liuet al. (2012). The white circle indicates the median, while the black rectangle spans thefirst through third interquartile range. The thin curves represent the density estimator.?N ? is the number of alignments in each category. ?Phred score? is the average Phredscore of the reads supporting the alignment (Ewing et al., 1998). ?Fold coverage? is theaverage number of reads which cover the locus at any base. ?Length? is the length ofthe locus in base pairs. The table gives the average values in each case. *** indicates asignificant difference at P < 0.0001 (Wilcoxon rank sum test).27FiguresFigure 2: Scatter plot of the characteristics of the alignments to lincRNA genes in allspecies. ?Fraction aligned? is the proportion of the length of the lincRNA gene whichcan be aligned to a plant genome by MegaBLAST. ?Fraction identical? is the proportionof the alignment which is a perfect match to the target genome. The points have beenbinned into cells for ease of reading. Darker cells indicate a larger number of points.28Figures0 0.2 0.4 0.6 0.8 1708090100Arabidopsis lyrata0 0.2 0.4 0.6 0.8 1708090100Capsella rubella0 0.2 0.4 0.6 0.8 1708090100Brassica rapa0 0.2 0.4 0.6 0.8 1708090100Eutrema parvulumFigure 3: Scatter plots of the characteristics of the alignments to lincRNA genes inBrassicaceae species. The x -axis of each plot is the is the proportion of the length of thelincRNA gene which can be aligned to the indicated plant genome by MegaBLAST. They-axis is the proportion of the alignment which is a perfect match to the target genome.The points have been binned into cells for ease of reading. A black cell indicates morethan 1 000 hits, while the lightest grey shading indicates a single hit.29Figures1431 (21%)793 (12%)2387 (36%) Arabidopsis thaliana Arabidopsis lyrata Capsella rubella Brassica rapa Eutrema parvulum(a) lincRNA genes1630 (82%)77 (4%)108 (5%) Arabidopsis thaliana Arabidopsis lyrata Capsella rubella Brassica rapa Eutrema parvulum(b) protein coding genesFigure 4: Phylogenetic positions of conserved blocks of lincRNA genes and protein codingsequences within the Brassicaceae. Internal node labels indicate the number of loci witha conserved block in all of the members of that clade, but none of the other speciesin the phylogeny. The lincRNA genes are the 6681 Arabidopsis thaliana lincRNA lociannotated by Liu et al. (2012) and myself (see text). The protein coding genes are 2000randomly selected coding sequences from the TAIR10 Arabidopsis annotations (Lameschet al., 2012).30FiguresFigure 5: Partial alignment of a very broadly conserved lincRNA locus. The entirealignment is shown for the Arabidopsis thaliana locus, while the only the aligned portionof the locus is shown in other species. Alignments were performed using Clustal Omega(Sievers et al., 2011) and visualized using MView (Brown et al., 1998). Highlightingindicates identity to the reference Arabidopsis thaliana lincRNA sequence. When therewere multiple alignments in a single species, the alignment with the fewest gaps is shown.31Figures32Figures33Figures34FiguresFigure 6: Alignments of the Arabidopsis thaliana lincRNA genes which have evidenceof reduced sequence rate evolution within their conserved regions. The entire alignmentis shown for the Arabidopsis thaliana locus, while the only the aligned portion of thelocus is shown in other species. A gap in the alignment outside of the conserved regiontherefore does not necessarily indicate that the lincRNA gene is shorter in that species,only that the sequence is so different as to be unalignable. Alignments were performedusing Clustal Omega (Sievers et al., 2011) and visualized using MView (Brown et al.,1998). Highlighting indicates identity to the reference Arabidopsis thaliana lincRNAsequence. When there were multiple alignments in a single species, the alignment withthe fewest gaps is shown.35FiguresNumber of StrainsFrequency0 20 40 60 80050100150200250300Figure 7: Frequency of deletions of Arabidopsis thaliana lincRNA loci which do not havea significant alignment to any other plant genome. Grey bars represent loci which donot have low stringency alignment while white bars represent loci which are unalignedusing the high stringency criterion. A deletion of the locus is defined as a deletion eventpredicted by Cao et al. (2011) which includes the entire lincRNA locus in at least one ofthe 80 strains examined.36BibliographyBanks, J. A., Nishiyama, T., Hasebe, M., Bowman, J. L., Gribskov, M., DePamphilis, C.,Albert, V. A., Aono, N., Aoyama, T., Ambrose, B. A., Ashton, N. W., Axtell, M. J.,Barker, E., Barker, M. S., Bennetzen, J. L., Bonawitz, N. D., Chapple, C., Cheng,C., Correa, L. G. G., Dacre, M., DeBarry, J., Dreyer, I., Elias, M., Engstrom, E. M.,Estelle, M., Feng, L., Finet, C., Floyd, S. K., Frommer, W. B., Fujita, T., Gramzow, L.,Gutensohn, M., Harholt, J., Hattori, M., Heyl, A., Hirai, T., Hiwatashi, Y., Ishikawa,M., Iwata, M., Karol, K. G., Koehler, B., Kolukisaoglu, U., Kubo, M., Kurata, T.,Lalonde, S., Li, K., Li, Y., Litt, A., Lyons, E., Manning, G., Maruyama, T., Michael,T. P., Mikami, K., Miyazaki, S., Morinaga, S.-i., Murata, T., Mueller-Roeber, B.,Nelson, D. R., Obara, M., Oguri, Y., Olmstead, R. G., Onodera, N., Petersen, B. L.,Pils, B., Prigge, M., Rensing, S. A., Rian?o Pacho?n, D. M., Roberts, A. W., Sato, Y.,Scheller, H. V., Schulz, B., Schulz, C., Shakirov, E. V., Shibagaki, N., Shinohara, N.,Shippen, D. E., S.? rensen, I., Sotooka, R., Sugimoto, N., Sugita, M., Sumikawa, N.,Tanurdzic, M., Theissen, G., Ulvskov, P., Wakazuki, S., Weng, J.-K., Willats, W. W.G. T., Wipf, D., Wolf, P. G., Yang, L., Zimmer, A. D., Zhu, Q., Mitros, T., Hellsten,U., Loque?, D., Otillar, R., Salamov, A., Schmutz, J., Shapiro, H., Lindquist, E., Lucas,S., Rokhsar, D., and Grigoriev, I. V. (2011). The Selaginella genome identifies geneticchanges associated with the evolution of vascular plants. Science, 332:960?9633.Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practicaland powerful approach to multiple testing. Journal of the Royal Statistical Society.Series B (Methodological), 57:289?300.Berezikov, E., Guryev, V., van de Belt, J., Wienholds, E., Plasterk, R. H. A., andCuppen, E. (2005). Phylogenetic shadowing and computational identification of humanmicroRNA genes. Cell, 120:21?24.Brown, C. J., Takayama, S., Campen, A. M., Vise, P., Marshall, T. W., Oldfield, C. J.,Williams, C. J., and Dunker, A. K. (2002). Evolutionary rate heterogeneity in proteinswith long disordered regions. Journal of Molecular Evolution, 55:104?110.Brown, N. P., Leroy, C., and Sander, C. (1998). MView: a web-compatible databasesearch or multiple alignment viewer. Bioinformatics, 14:380?381.Burge, C. and Karlin, S. (1997). Prediction of complete gene structures in human genomicDNA. Journal of Molecular Biology, 268:78?94.37BibliographyBurleigh, S. H. and Harrison, M. J. (1997). A novel gene whose expression in Medicagotruncatula roots is suppressed in response to colonization by vesicular-arbuscular myc-orrhizal (VAM) fungi and to phosphate nutrition. Plant Molecular Biology, 34:199?208.Cabili, M. N., Trapnell, C., Goff, L., Koziol, M., Tazon-Vega, B., Regev, A., and Rinn,J. L. (2011). Integrative annotation of human large intergenic noncoding RNAs revealsglobal properties and specific subclasses. Genes & Development, 25:1915?1927.Cao, J., Schneeberger, K., Ossowski, S., Gu?nther, T., Bender, S., Fitz, J., Koenig, D.,Lanz, C., Stegle, O., Lippert, C., Wang, X., Ott, F., Mu?ller, J., Alonso-Blanco, C.,Borgwardt, K., Schmid, K. J., and Weigel, D. (2011). Whole-genome sequencing ofmultiple Arabidopsis thaliana populations. Nature Genetics, 43:956?963.Carvunis, A.-R., Rolland, T., Wapinski, I., Calderwood, M. A., Yildirim, M. A., Simo-nis, N., Charloteaux, B., Hidalgo, C. A., Barbette, J., Santhanam, B., Brar, G. A.,Weissman, J. S., Regev, A., Thierry-Mieg, N., Cusick, M. E., and Vidal, M. (2012).Proto-genes and de novo gene birth. Nature, 487:370?374.Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Pic-colboni, A., Sementchenko, V., Cheng, J., Williams, A. J., Wheeler, R., Wong, B.,Drenkow, J., Yamanaka, M., Patel, S., Brubaker, S., Tammana, H., Helt, G., Struhl,K., and Gingeras, T. R. (2004). Unbiased mapping of transcription factor bindingsites along human chromosomes 21 and 22 points to widespread regulation of noncod-ing RNAs. Cell, 116:499?509.Chan, A. P., Crabtree, J., Zhao, Q., Lorenzi, H., Orvis, J., Puiu, D., Melake-Berhan,A., Jones, K. M., Redman, J., Chen, G., Cahoon, E. B., Gedil, M., Stanke, M., Haas,B. J., Wortman, J. R., Fraser-Liggett, C. M., Ravel, J., and Rabinowicz, P. D. (2010).Draft genome sequence of the oilseed species Ricinus communis. Nature Biotechnology,28:951?6.Dassanayake, M., Oh, D.-H., Haas, J. S., Hernandez, A., Hong, H., Ali, S., Yun, D.-J.,Bressan, R. A., Zhu, J.-K., Bohnert, H. J., and Cheeseman, J. M. (2011). The genomeof the extremophile crucifer Thellungiella parvula. Nature Genetics, 43:913?918.DOE-JGI (2013). Aquilegia coerulea v1.0. Retrieved August 2, 2013, from http::// and USDA-NIFA (2013). Phaseolus vulgaris v0.9. Retrieved August 2, 2013,from http:://, L., Chureau, C., Samain, S., Weissenbach, J., and Avner, P. (2006). The XistRNA gene evolved in eutherians by pseudogenization of a protein-coding gene. Science,312:1653?1655.Elisaphenko, E. A., Kolesnikov, N. N., Shevchenko, A. I., Rogozin, I. B., Nesterova,T. B., Brockdorff, N., and Zakian, S. M. (2008). A dual origin of the Xist gene froma protein-coding gene and a set of transposable elements. PLoS ONE, 3:11.38BibliographyENCODE Project Consortium (2007). Identification and analysis of functional elementsin 1% of the human genome by the ENCODE pilot project. Nature, 447:799?816.Eucalyptus grandis Genome Project (2010). Retrieved August 2, 2013, from, B., Hillier, L., Wendl, M. C., and Green, P. (1998). Base-calling of automatedsequencer traces using Phred. I. accuracy assessment. Genome Research, 8:175?185.Franco-Zorrilla, J. M., Valli, A., Todesco, M., Mateos, I., Puga, M. I., Rubio-Somoza, I.,Leyva, A., Weigel, D., Garca, J. A., and Paz-Ares, J. (2007). Target mimicry providesa new mechanism for regulation of microRNA activity. Nature Genetics, 39:1033?1037.Goff, S. A., Ricke, D., Lan, T.-H., Presting, G., Wang, R., Dunn, M., Glazebrook, J.,Sessions, A., Oeller, P., Varma, H., Hadley, D., Hutchison, D., Martin, C., Katagiri,F., Lange, B. M., Moughamer, T., Xia, Y., Budworth, P., Zhong, J., Miguel, T.,Paszkowski, U., Zhang, S., Colbert, M., Sun, W.-l., Chen, L., Cooper, B., Park, S.,Wood, T. C., Mao, L., Quail, P., Wing, R., Dean, R., Yu, Y., Zharkikh, A., Shen, R.,Sahasrabudhe, S., Thomas, A., Cannings, R., Gutin, A., Pruss, D., Reid, J., Tavtigian,S., Mitchell, J., Eldredge, G., Scholl, T., Miller, R. M., Bhatnagar, S., Adey, N.,Rubano, T., Tusneem, N., Robinson, R., Feldhaus, J., Macalma, T., Oliphant, A., andBriggs, S. (2002). A draft sequence of the rice genome (Oryza sativa L. ssp. japonica).Science, 296:92?100.Guffanti, A., Iacono, M., Pelucchi, P., Kim, N., Solda`, G., Croft, L. J., Taft, R. J., Rizzi,E., Askarian-Amiri, M., Bonnal, R. J., Callari, M., Mignone, F., Pesole, G., Bertalot,G., Bernardi, L. R., Albertini, A., Lee, C., Mattick, J. S., Zucchi, I., and De Bellis,G. (2009). A transcriptional sketch of a primary human breast cancer by 454 deepsequencing. BMC Genomics, 10:163.Gupta, R. A., Shah, N., Wang, K. C., Kim, J., Horlings, H. M., Wong, D. J., Tsai, M.-C., Hung, T., Argani, P., Rinn, J. L., Wang, Y., Brzoska, P., Kong, B., Li, R., West,R. B., van de Vijver, M. J., Sukumar, S., and Chang, H. Y. (2010). Long non-codingRNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature,464:1071?1076.Guttman, M., Amit, I., Garber, M., French, C., Lin, M. F., Feldser, D., Huarte, M., Zuk,O., Carey, B. W., Cassady, J. P., Cabili, M. N., Jaenisch, R., Mikkelsen, T. S., Jacks,T., Hacohen, N., Bernstein, B. E., Kellis, M., Regev, A., Rinn, J. L., and Lander, E. S.(2009). Chromatin signature reveals over a thousand highly conserved large non-codingRNAs in mammals. Nature, 458:223?227.Guttman, M., Garber, M., Levin, J. Z., Donaghey, J., Robinson, J., Adiconis, X., Fan,L., Koziol, M. J., Gnirke, A., Nusbaum, C., Rinn, J. L., Lander, E. S., and Regev, A.(2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse revealsthe conserved multi-exonic structure of lincRNAs. Nature Biotechnology, 28:503?510.39BibliographyHeo, J. B. and Sung, S. (2011). Vernalization-mediated epigenetic silencing by a longintronic noncoding RNA. Science, 331:76?79.Hou, X. L., Wu, P., Jiao, F. C., Jia, Q. J., Chen, H. M., Yu, J., Song, X. W., and Yi,K. K. (2005). Regulation of the expression of OsIPS1 and OsIPS2 in rice via systemicand local Pi signalling and hormones. Plant, Cell & Environment, 28:356?364.Hu, T. T., Pattyn, P., Bakker, E. G., Cao, J., Cheng, J.-F., Clark, R. M., Fahlgren, N.,Fawcett, J. A., Grimwood, J., Gundlach, H., Haberer, G., Hollister, J. D., Ossowski,S., Ottilar, R. P., Salamov, A. A., Schneeberger, K., Spannagl, M., Wang, X., Yang,L., Nasrallah, M. E., Bergelson, J., Carrington, J. C., Gaut, B. S., Schmutz, J., Mayer,K. F. X., Van de Peer, Y., Grigoriev, I. V., Nordborg, M., Weigel, D., and Guo, Y.-L.(2011). The Arabidopsis lyrata genome sequence and the basis of rapid genome sizechange. Nature Genetics, 43:476?481.Hupalo, D. and Kern, A. D. (2013). Conservation and functional element discovery in 20angiosperm plant genomes. Molecular Biology and Evolution, 30:1729?1744.Hyashizaki, Y. (2004). Mouse transcriptome: Neutral evolution of non-coding comple-mentary DNAs (reply). Nature, 431:757.International Brachypodium Initiative (2010). Genome sequencing and analysis of themodel grass Brachypodium distachyon. Nature, 463:763?768.International Citrus Genome Consortium (2011). Haploid Clementia genome. Retrieved August 2, 2013, from, O., Aury, J.-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N.,Aubourg, S., Vitulo, N., Jubin, C., Vezzi, A., Legeai, F., Hugueney, P., Dasilva, C.,Horner, D., Mica, E., Jublot, D., Poulain, J., Bruye`re, C., Billault, A., Segurens, B.,Gouyvenoux, M., Ugarte, E., Cattonaro, F., Anthouard, V., Vico, V., Del Fabbro, C.,Alaux, M., Di Gaspero, G., Dumas, V., Felice, N., Paillard, S., Juman, I., Moroldo,M., Scalabrin, S., Canaguier, A., Le Clainche, I., Malacrida, G., Durand, E., Pesole,G., Laucou, V., Chatelet, P., Merdinoglu, D., Delledonne, M., Pezzotti, M., Lecharny,A., Scarpelli, C., Artiguenave, F., Pe`, M. E., Valle, G., Morgante, M., Caboche, M.,Adam-Blondon, A.-F., Weissenbach, J., Que?tier, F., and Wincker, P. (2007). Thegrapevine genome sequence suggests ancestral hexaploidization in major angiospermphyla. Nature, 449:463?7.Jeon, Y. and Lee, J. T. (2011). YY1 tethers Xist RNA to the inactive X nucleationcenter. Cell, 146:119?133.Kapranov, P., Cheng, J., Dike, S., Nix, D. A., Duttagupta, R., Willingham, A. T.,Stadler, P. F., Hertel, J., Hackermu?ller, J., Hofacker, I. L., Bell, I., Cheung, E.,Drenkow, J., Dumais, E., Patel, S., Helt, G., Ganesh, M., Ghosh, S., Piccolboni,40BibliographyA., Sementchenko, V., Tammana, H., and Gingeras, T. R. (2007). RNA maps re-veal new RNA classes and a possible function for pervasive transcription. Science,316:1484?1488.Kim, E.-D. and Sung, S. (2012). Long noncoding RNA: unveiling hidden layer of generegulatory networks. Trends in Plant Science, 17:16?21.Lamesch, P., Berardini, T. Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller,R., Dreher, K., Alexander, D. L., Garcia-Hernandez, M., Karthikeyan, A. S., Lee,C. H., Nelson, W. D., Ploetz, L., Singh, S., Wensel, A., and Huala, E. (2012). TheArabidopsis Information Resource (TAIR): improved gene annotation and new tools.Nucleic Acids Research, 40:D1202?D1210.Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2.Nature Methods, 9:357?359.Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology,10:R25.Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abeca-sis, G., and Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools.Bioinformatics, 25:2078?2079.Liu, C., Muchhal, U. S., and Raghothama, K. (1997). Differential expression of TPS11,a phosphate starvation-induced gene in tomato. Plant Molecular Biology, 33:867?874.Liu, J., Jung, C., Xu, J., Wang, H., Deng, S., Bernad, L., Arenas-Huertero, C., and Chua,N.-H. (2012). Genome-wide analysis uncovers regulation of long intergenic noncodingRNAs in arabidopsis. The Plant Cell, 24:4333?4345.Ma, B., Tromp, J., and Li, M. (2002). PatternHunter: faster and more sensitive homologysearch. Bioinformatics, 18:440?445.Managadze, D., Lobkovsky, A. E., Wolf, Y. I., Shabalina, S. A., Rogozin, I. B., andKoonin, E. V. (2013). The vast, conserved mammalian linc-RNome. PLoS Computa-tional Biology, 9:e1002917.Marques, A. C. and Ponting, C. P. (2009). Catalogues of mammalian long noncodingRNAs: modest conservation and incompleteness. Genome Biology, 10:R124.Marquez, Y., Brown, J. W. S., Simpson, C., Barta, A., and Kalyna, M. (2012). Tran-scriptome survey reveals increased complexity of the alternative splicing landscape inArabidopsis. Genome Research, 22:1184?1195.Matsui, A., Ishida, J., Morosawa, T., Mochizuki, Y., Kaminuma, E., Endo, T. A.,Okamoto, M., Nambara, E., Nakajima, M., Kawashima, M., Satou, M., Kim, J.-M.,Kobayashi, N., Toyoda, T., Shinozaki, K., and Seki, M. (2008). Arabidopsis transcrip-tome analysis under drought, cold, high-salinity and ABA treatment conditions usinga tiling array. Plant & Cell Physiology, 49:1135?1149.41BibliographyMerchant, S. S., Prochnik, S. E., Vallon, O., Harris, E. H., Karpowicz, S. J., Witman,G. B., Terry, A., Salamov, A., Fritz-Laylin, L. K., Mare?chal-Drouard, L., Marshall,W. F., Qu, L.-H., Nelson, D. R., Sanderfoot, A. A., Spalding, M. H., Kapitonov, V. V.,Ren, Q., Ferris, P., Lindquist, E., Shapiro, H., Lucas, S. M., Grimwood, J., Schmutz, J.,Cardol, P., Cerutti, H., Chanfreau, G., Chen, C.-L., Cognat, V., Croft, M. T., Dent, R.,Dutcher, S., Ferna?ndez, E., Fukuzawa, H., Gonza?lez-Ballester, D., Gonza?lez-Halphen,D., Hallmann, A., Hanikenne, M., Hippler, M., Inwood, W., Jabbari, K., Kalanon, M.,Kuras, R., Lefebvre, P. A., Lemaire, S. D., Lobanov, A. V., Lohr, M., Manuell, A.,Meier, I., Mets, L., Mittag, M., Mittelmeier, T., Moroney, J. V., Moseley, J., Napoli,C., Nedelcu, A. M., Niyogi, K., Novoselov, S. V., Paulsen, I. T., Pazour, G., Purton,S., Ral, J.-P., Rian?o Pacho?n, D. M., Riekhof, W., Rymarquis, L., Schroda, M., Stern,D., Umen, J., Willows, R., Wilson, N., Zimmer, S. L., Allmer, J., Balk, J., Bisova, K.,Chen, C.-J., Elias, M., Gendler, K., Hauser, C., Lamb, M. R., Ledford, H., Long, J. C.,Minagawa, J., Page, M. D., Pan, J., Pootakham, W., Roje, S., Rose, A., Stahlberg, E.,Terauchi, A. M., Yang, P., Ball, S., Bowler, C., Dieckmann, C. L., Gladyshev, V. N.,Green, P., Jorgensen, R., Mayfield, S., Mueller-Roeber, B., Rajamani, S., Sayre, R. T.,Brokstein, P., Dubchak, I., Goodstein, D., Hornick, L., Huang, Y. W., Jhaveri, J., Luo,Y., Mart??nez, D., Ngau, W. C. A., Otillar, B., Poliakov, A., Porter, A., Szajkowski, L.,Werner, G., Zhou, K., Grigoriev, I. V., Rokhsar, D. S., and Grossman, A. R. (2007).The Chlamydomonas genome reveals the evolution of key animal and plant functions.Science, 318:245?250.Mimulus Genome Project and DOE-JGI (2013). Retrieved August 2, 2013, from, R., Hou, S., Feng, Y., Yu, Q., Laporte, A., Saw, J. H., Senin, P., Wang, W., Ly,B. V., Lewis, K. L. T., Salzberg, S. L., Feng, L., Jones, M. R., Skelton, R. L., Murray,J. E., Chen, C., Qian, W., Shen, J., Du, P., Eustice, M., Tong, E., Tang, H., Lyons, E.,Paull, R. E., Michael, T. P., Wall, K., Rice, D. W., Albert, H., Li, Zhu, Y. J., Schatz,M., Nagarajan, N., Acob, R. A., Guan, P., Blas, A., Wai, C. M., Ackerman, C. M.,Ren, Y., Liu, C., Wang, J., Wang, J., Kuk, Shakirov, E. V., Haas, B., Thimmapuram,J., Nelson, D., Wang, X., Bowers, J. E., Gschwend, A. R., Delcher, A. L., Singh, R.,Suzuki, J. Y., Tripathi, S., Neupane, K., Wei, H., Irikura, B., Paidi, M., Jiang, N.,Zhang, W., Presting, G., Windsor, A., Perez, R., Torres, M. J., Feltus, F. A., Porter,B., Li, Y., Burroughs, A. M., Cheng, Liu, L., Christopher, D. A., Mount, S. M., Moore,P. H., Sugimura, T., Jiang, J., Schuler, M. A., Friedman, V., Olds, T., Shippen, D. E.,dePamphilis, C. W., Palmer, J. D., Freeling, M., Paterson, A. H., Gonsalves, D., Wang,L., and Alam, M. (2008). The draft genome of the transgenic tropical fruit tree papaya(Carica papaya Linnaeus). Nature, 452:991?996.Molnar, A., Melnyk, C., and Baulcombe, D. C. (2011). Silencing signals in plants: a longjourney for small RNAs. Genome Biology, 12:215.Nozawa, M., Miura, S., and Nei, M. (2010). Origins and evolution of microRNA genesin Drosophila species. Genome Biology and Evolution, 2:180?189.42BibliographyNumata, K., Kanai, A., Saito, R., Kondo, S., Adachi, J., Wilming, L. G., Hume, D. A.,Hayashizaki, Y., and Tomita, M. (2003). Identification of putative noncoding RNAsamong the RIKEN mouse full-length cDNA collection. Genome Research, 13:1301?1306.Pang, K. C., Frith, M. C., and Mattick, J. S. (2006). Rapid evolution of noncoding RNAs:lack of conservation does not mean lack of function. Trends in Genetics, 22:1?5.Ponting, C. P. and Belgard, T. G. (2010). Transcribed dark matter: meaning or myth?Human Molecular Genetics, 19:R162?R168.Ponting, C. P., Oliver, P. L., and Reik, W. (2009). Evolution and functions of longnoncoding RNAs. Cell, 136:629?641.Rensing, S. A., Lang, D., Zimmer, A. D., Terry, A., Salamov, A., Shapiro, H., Nishiyama,T., Perroud, P.-F., Lindquist, E. A., Kamisugi, Y., Tanahashi, T., Sakakibara, K.,Fujita, T., Oishi, K., Shin-I, T., Kuroki, Y., Toyoda, A., Suzuki, Y., Hashimoto, S.-I.,Yamaguchi, K., Sugano, S., Kohara, Y., Fujiyama, A., Anterola, A., Aoki, S., Ashton,N., Barbazuk, W. B., Barker, E., Bennetzen, J. L., Blankenship, R., Cho, S. H.,Dutcher, S. K., Estelle, M., Fawcett, J. A., Gundlach, H., Hanada, K., Heyl, A., Hicks,K. A., Hughes, J., Lohr, M., Mayer, K., Melkozernov, A., Murata, T., Nelson, D. R.,Pils, B., Prigge, M., Reiss, B., Renner, T., Rombauts, S., Rushton, P. J., Sanderfoot,A., Schween, G., Shiu, S.-H., Stueber, K., Theodoulou, F. L., Tu, H., Van de Peer, Y.,Verrier, P. J., Waters, E., Wood, A., Yang, L., Cove, D., Cuming, A. C., Hasebe, M.,Lucas, S., Mishler, B. D., Reski, R., Grigoriev, I. V., Quatrano, R. S., and Boore, J. L.(2008). The Physcomitrella genome reveals evolutionary insights into the conquest ofland by plants. Science, 319:64?69.Schmutz, J., Cannon, S. B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten, D. L.,Song, Q., Thelen, J. J., Cheng, J., Xu, D., Hellsten, U., May, G. D., Yu, Y., Sakurai,T., Umezawa, T., Bhattacharyya, M. K., Sandhu, D., Valliyodan, B., Lindquist, E.,Peto, M., Grant, D., Shu, S., Goodstein, D., Barry, K., Futrell-Griggs, M., Abernathy,B., Du, J., Tian, Z., Zhu, L., Gill, N., Joshi, T., Libault, M., Sethuraman, A., Zhang,X.-C., Shinozaki, K., Nguyen, H. T., Wing, R. A., Cregan, P., Specht, J., Grimwood,J., Rokhsar, D., Stacey, G., Shoemaker, R. C., and Jackson, S. A. (2010). Genomesequence of the palaeopolyploid soybean. Nature, 463:178?183.Schnable, P. S., Ware, D., Fulton, R. S., Stein, J. C., Wei, F., Pasternak, S., Liang, C.,Zhang, J., Fulton, L., Graves, T. A., Minx, P., Reily, A. D., Courtney, L., Kruchowski,S. S., Tomlinson, C., Strong, C., Delehaunty, K., Fronick, C., Courtney, B., Rock,S. M., Belter, E., Du, F., Kim, K., Abbott, R. M., Cotton, M., Levy, A., Marchetto,P., Ochoa, K., Jackson, S. M., Gillam, B., Chen, W., Yan, L., Higginbotham, J.,Cardenas, M., Waligorski, J., Applebaum, E., Phelps, L., Falcone, J., Kanchi, K.,Thane, T., Scimone, A., Thane, N., Henke, J., Wang, T., Ruppert, J., Shah, N.,Rotter, K., Hodges, J., Ingenthron, E., Cordes, M., Kohlberg, S., Sgro, J., Delgado,B., Mead, K., Chinwalla, A., Leonard, S., Crouse, K., Collura, K., Kudrna, D., Currie,J., He, R., Angelova, A., Rajasekar, S., Mueller, T., Lomeli, R., Scara, G., Ko, A.,43BibliographyDelaney, K., Wissotski, M., Lopez, G., Campos, D., Braidotti, M., Ashley, E., Golser,W., Kim, H., Lee, S., Lin, J., Dujmic, Z., Kim, W., Talag, J., Zuccolo, A., Fan,C., Sebastian, A., Kramer, M., Spiegel, L., Nascimento, L., Zutavern, T., Miller, B.,Ambroise, C., Muller, S., Spooner, W., Narechania, A., Ren, L., Wei, S., Kumari,S., Faga, B., Levy, M. J., McMahan, L., Van Buren, P., Vaughn, M. W., Ying, K.,Yeh, C.-T., Emrich, S. J., Jia, Y., Kalyanaraman, A., Hsia, A.-P., Barbazuk, W. B.,Baucom, R. S., Brutnell, T. P., Carpita, N. C., Chaparro, C., Chia, J.-M., Deragon,J.-M., Estill, J. C., Fu, Y., Jeddeloh, J. A., Han, Y., Lee, H., Li, P., Lisch, D. R., Liu,S., Liu, Z., Nagel, D. H., McCann, M. C., SanMiguel, P., Myers, A. M., Nettleton, D.,Nguyen, J., Penning, B. W., Ponnala, L., Schneider, K. L., Schwartz, D. C., Sharma,A., Soderlund, C., Springer, N. M., Sun, Q., Wang, H., Waterman, M., Westerman,R., Wolfgruber, T. K., Yang, L., Yu, Y., Zhang, L., Zhou, S., Zhu, Q., Bennetzen,J. L., Dawe, R. K., Jiang, J., Jiang, N., Presting, G. G., Wessler, S. R., Aluru, S.,Martienssen, R. A., Clifton, S. W., McCombie, W. R., Wing, R. A., and Wilson,R. K. (2009). The B73 maize genome: complexity, diversity, and dynamics. Science,326:1112?1115.Scott, M. S. and Ono, M. (2011). From snoRNA to miRNA: Dual function regulatorynon-coding RNAs. Biochimie, 93:1987?1992.Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R.,McWilliam, H., Remmert, M., So?ding, J., Thompson, J. D., and Higgins, D. G. (2011).Fast, scalable generation of high-quality protein multiple sequence alignments usingClustal Omega. Molecular Systems Biology, 7:539.Slotte, T., Hazzouri, K. M., Agren, J. A., Koenig, D., Maumus, F., Guo, Y.-L., Steige,K., Platts, A. E., Escobar, J. S., Newman, L. K., Wang, W., Manda?kova?, T., Vello, E.,Smith, L. M., Henz, S. R., Steffen, J., Takuno, S., Brandvain, Y., Coop, G., Andolfatto,P., Hu, T. T., Blanchette, M., Clark, R. M., Quesneville, H., Nordborg, M., Gaut, B. S.,Lysak, M. A., Jenkins, J., Grimwood, J., Chapman, J., Prochnik, S., Shu, S., Rokhsar,D., Schmutz, J., Weigel, D., and Wright, S. I. (2013). The Capsella rubella genomeand the genomic consequences of rapid mating system evolution. Nature Genetics,45:831?835.Tomato Genome Consortium (2012). The tomato genome sequence provides insights intofleshy fruit evolution. Nature, 485:635?41.Tsai, M.-C., Spitale, R. C., and Chang, H. Y. (2011). Long intergenic noncoding RNAs:new links in cancer progression. Cancer Research, 71:3?7.Tuskan, G. A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Put-nam, N., Ralph, S., Rombauts, S., Salamov, A., Schein, J., Sterck, L., Aerts, A.,Bhalerao, R. R., Bhalerao, R. P., Blaudez, D., Boerjan, W., Brun, A., Brunner, A.,Busov, V., Campbell, M., Carlson, J., Chalot, M., Chapman, J., Chen, G. L., Cooper,D., Coutinho, P. M., Couturier, J., Covert, S., Cronk, Q., Cunningham, R., Davis, J.,Degroeve, S., De?jardin, A., Depamphilis, C., Detter, J., Dirks, B., Dubchak, I., Dup-lessis, S., Ehlting, J., Ellis, B., Gendler, K., Goodstein, D., Gribskov, M., Grimwood,44BibliographyJ., Groover, A., Gunter, L., Hamberger, B., Heinze, B., Helariutta, Y., Henrissat, B.,Holligan, D., Holt, R., Huang, W., Islam-Faridi, N., Jones, S., Jones-Rhoades, M.,Jorgensen, R., Joshi, C., Kangasja?rvi, J., Karlsson, J., Kelleher, C., Kirkpatrick, R.,Kirst, M., Kohler, A., Kalluri, U., Larimer, F., Leebens-Mack, J., Leple?, J.-C., Lo-cascio, P., Lou, Y., Lucas, S., Martin, F., Montanini, B., Napoli, C., Nelson, D. R.,Nelson, C., Nieminen, K., Nilsson, O., Pereda, V., Peter, G., Philippe, R., Pilate, G.,Poliakov, A., Razumovskaya, J., Richardson, P., Rinaldi, C., Ritland, K., Rouze?, P.,Ryaboy, D., Schmutz, J., Schrader, J., Segerman, B., Shin, H., Siddiqui, A., Sterky,F., Terry, A., Tsai, C. J., Uberbacher, E., Unneberg, P., Vahala, J., Wall, K., Wessler,S., Yang, G., Yin, T., Douglas, C., Marra, M., Sandberg, G., Van de Peer, Y., andRokhsar, D. (2006). The genome of black cottonwood, Populus trichocarpa (Torr. &Gray). Science, 313:1596?1604.van Bakel, H., Stout, J. M., Cote, A. G., Tallon, C. M., Sharpe, A. G., Hughes, T. R., andPage, J. E. (2011). The draft genome and transcriptome of Cannabis sativa. GenomeBiology, 12:R102.Velasco, R., Zharkikh, A., Affourtit, J., Dhingra, A., Cestaro, A., Kalyanaraman, A.,Fontana, P., Bhatnagar, S. K., Troggio, M., Pruss, D., Salvi, S., Pindo, M., Baldi, P.,Castelletti, S., Cavaiuolo, M., Coppola, G., Costa, F., Cova, V., Dal Ri, A., Goremykin,V., Komjanc, M., Longhi, S., Magnago, P., Malacarne, G., Malnoy, M., Micheletti,D., Moretto, M., Perazzolli, M., Si-Ammour, A., Vezzulli, S., Zini, E., Eldredge, G.,Fitzgerald, L. M., Gutin, N., Lanchbury, J., Macalma, T., Mitchell, J. T., Reid, J.,Wardell, B., Kodira, C., Chen, Z., Desany, B., Niazi, F., Palmer, M., Koepke, T.,Jiwan, D., Schaeffer, S., Krishnan, V., Wu, C., Chu, V. T., King, S. T., Vick, J., Tao,Q., Mraz, A., Stormo, A., Stormo, K., Bogden, R., Ederle, D., Stella, A., Vecchietti,A., Kater, M. M., Masiero, S., Lasserre, P., Lespinasse, Y., Allan, A. C., Bus, V.,Chagne?, D., Crowhurst, R. N., Gleave, A. P., Lavezzo, E., Fawcett, J. A., Proost,S., Rouze?, P., Sterck, L., Toppo, S., Lazzari, B., Hellens, R. P., Durel, C.-E., Gutin,A., Bumgarner, R. E., Gardiner, S. E., Skolnick, M., Egholm, M., Van de Peer, Y.,Salamini, F., and Viola, R. (2010). The genome of the domesticated apple (Malus xdomestica Borkh.). Nature Genetics, 42:833?839.Wang, K., Wang, Z., Li, F., Ye, W., Wang, J., Song, G., Yue, Z., Cong, L., Shang, H.,Zhu, S., Zou, C., Li, Q., Yuan, Y., Lu, C., Wei, H., Gou, C., Zheng, Z., Yin, Y., Zhang,X., Liu, K., Wang, B., Song, C., Shi, N., Kohel, R. J., Percy, R. G., Yu, J. Z., Zhu,Y.-X., Wang, J., and Yu, S. (2012). The draft genome of a diploid cotton Gossypiumraimondii. Nature Genetics, 44:1098?1103.Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.-H., Bancroft,I., Cheng, F., Huang, S., Li, X., Hua, W., Wang, J., Wang, X., Freeling, M., Pires,J. C., Paterson, A. H., Chalhoub, B., Wang, B., Hayward, A., Sharpe, A. G., Park,B.-S., Weisshaar, B., Liu, B., Li, B., Liu, B., Tong, C., Song, C., Duran, C., Peng, C.,Geng, C., Koh, C., Lin, C., Edwards, D., Mu, D., Shen, D., Soumpourou, E., Li, F.,Fraser, F., Conant, G., Lassalle, G., King, G. J., Bonnema, G., Tang, H., Wang, H.,Belcram, H., Zhou, H., Hirakawa, H., Abe, H., Guo, H., Wang, H., Jin, H., Parkin, I.45BibliographyA. P., Batley, J., Kim, J.-S., Just, J., Li, J., Xu, J., Deng, J., Kim, J. A., Li, J., Yu,J., Meng, J., Wang, J., Min, J., Poulain, J., Hatakeyama, K., Wu, K., Wang, L., Fang,L., Trick, M., Links, M. G., Zhao, M., Jin, M., Ramchiary, N., Drou, N., Berkman,P. J., Cai, Q., Huang, Q., Li, R., Tabata, S., Cheng, S., Zhang, S., Zhang, S., Huang,S., Sato, S., Sun, S., Kwon, S.-J., Choi, S.-R., Lee, T.-H., Fan, W., Zhao, X., Tan, X.,Xu, X., Wang, Y., Qiu, Y., Yin, Y., Li, Y., Du, Y., Liao, Y., Lim, Y., Narusaka, Y.,Wang, Y., Wang, Z., Li, Z., Wang, Z., Xiong, Z., and Zhang, Z. (2011). The genomeof the mesopolyploid crop species Brassica rapa. Nature Genetics, 43:1035?1039.Willingham, A. T., Orth, A. P., Batalov, S., Peters, E. C., Wen, B. G., Aza-Blanc, P.,Hogenesch, J. B., and Schultz, P. G. (2005). A strategy for probing the function ofnoncoding RNAs finds a repressor of NFAT. Science, 309:1570?1573.Young, R. S., Marques, A. C., Tibbit, C., Haerty, W., Bassett, A. R., Liu, J.-L., andPonting, C. P. (2012). Identification and properties of 1,119 candidate lincRNA lociin the Drosophila melanogaster genome. Genome Biology and Evolution, 4:427?442.Zhang, B., Pan, X., Cannon, C. H., Cobb, G. P., and Anderson, T. A. (2006). Conser-vation and divergence of plant microRNA genes. The Plant Journal, 46:243?259.Zhang, B., Wang, Q., and Pan, X. (2007). MicroRNAs and their regulatory roles inanimals and plants. Journal of Cellular Physiology, 210:279?289.Zhang, Y., Liu, J., Jia, C., Li, T., Wu, R., Wang, J., Chen, Y., Zou, X., Chen, R.,Wang, X.-J., and Zhu, D. (2010). Systematic identification and evolutionary featuresof rhesus monkey small nucleolar RNAs. BMC Genomics, 11:61.Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000). A greedy algorithm foraligning DNA sequences. Journal of Computational Biology, 7:203?214.Zhao, J., Sun, B. K., Erwin, J. A., Song, J.-J., and Lee, J. T. (2008). Polycomb proteinstargeted by a short repeat RNA to the mouse X chromosome. Science, 322:750?756.46


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items