UBC Faculty Research and Publications

Transcriptome-wide comparison of sequence variation in divergent ecotypes of kokanee salmon Lemay, Matthew A; Donnelly, David J; Russello, Michael A May 7, 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2012_Article_4998.pdf [ 290.35kB ]
JSON: 52383-1.0223940.json
JSON-LD: 52383-1.0223940-ld.json
RDF/XML (Pretty): 52383-1.0223940-rdf.xml
RDF/JSON: 52383-1.0223940-rdf.json
Turtle: 52383-1.0223940-turtle.txt
N-Triples: 52383-1.0223940-rdf-ntriples.txt
Original Record: 52383-1.0223940-source.json
Full Text

Full Text

RESEARCH ARTICLE Open AccessTranscriptome-wide comparison of sequencevariation in divergent ecotypes of kokaneesalmonMatthew A Lemay*, David J Donnelly and Michael A RusselloAbstractBackground: High throughput next-generation sequencing technology has enabled the collection of genome-widesequence data and revolutionized single nucleotide polymorphism (SNP) discovery in a broad range of species. Whenanalyzed within a population genomics framework, SNP-based genotypic data may be used to investigate questionsof evolutionary, ecological, and conservation significance in natural populations of non-model organisms. Kokaneesalmon are recently diverged freshwater populations of sockeye salmon (Oncorhynchus nerka) that exhibit reproductiveecotypes (stream-spawning and shore-spawning) in lakes throughout western North America and northeast Asia.Current conservation and management strategies may treat these ecotypes as discrete stocks, however their recentdivergence and low levels of gene flow make in-season genetic stock identification a challenge. The development ofgenome-wide SNP markers is an essential step towards fine-scale stock identification, and may enable a directinvestigation of the genetic basis of ecotype divergence.Results: We used pooled cDNA samples from both ecotypes of kokanee to generate 750 million base pairs oftranscriptome sequence data. These raw data were assembled into 11,074 high coverage contigs from which weidentified 32,699 novel single nucleotide polymorphisms. A subset of these putative SNPs was validated usinghigh-resolution melt analysis and Sanger resequencing to genotype independent samples of kokanee andanadromous sockeye salmon. We also identified a number of contigs that were composed entirely of reads from asingle ecotype, which may indicate regions of differential gene expression between the two reproductive ecotypes.In addition, we found some evidence for greater pathogen load among the kokanee sampled in stream-spawninghabitats, suggesting a possible evolutionary advantage to shore-spawning that warrants further study.Conclusions: This study provides novel genomic resources to support population genetic and genomic studies ofboth kokanee and anadromous sockeye salmon, and has the potential to produce markers capable of fine-scale stockassessment. While this RNAseq approach was successful at identifying a large number of new SNP loci, we found thatthe frequency of alleles present in the pooled transcriptome data was not an accurate predictor of population allelefrequencies.Keywords: Adaptation, High resolution melt analysis, Next-generation sequencing, Oncorhynchus nerka,Single nucleotide polymorphisms* Correspondence: matt.lemay@ubc.caDepartment of Biology, University of British Columbia, Okanagan Campus,3333 University Way, Kelowna BC, V1V 1V7, Canada© 2013 Lemay et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.Lemay et al. BMC Genomics 2013, 14:308http://www.biomedcentral.com/1471-2164/14/308BackgroundBiology is currently undergoing a genomics revolution [1].With the increasing ubiquity of next-generation sequencingtechnology, it is now possible to rapidly generate millionsof base pairs of DNA sequence data for a fraction ofthe per-base cost required for chain-termination Sangersequencing. The result is an exponential increase in theabundance of genomic resources available for studyingnon-model organisms [2]. Single nucleotide polymorphisms(SNPs) have emerged as a marker of choice for population-level studies in the genomics era [3-5]. Due to their broadgenomic distribution, direct association with functionalsignificance, and ease of genotyping, SNPs represent animprovement over conventional markers such as amplifiedlength polymorphisms (AFLPs) and expressed sequencetag (EST)-linked microsatellites. The efficiency of SNPdiscovery and validation may be further increased throughtargeting of the transcriptome (RNAseq; [6]) or restriction-site associated DNA tags (RAD; [7,8]).The emerging field of population genomics has evolvedin tandem with these novel technologies, incorporatinggenome-wide sequence data into the study of systems thathistorically would have been limited to a small number ofneutral markers [9]. Central to a population genomicapproach is the identification of statistical outlier loci thatexhibit locus-specific patterns that are highly divergentfrom the rest of the genome, representing candidateregions under selection [10-12]. The combined signalfrom high density neutral and putatively adaptive SNPsthroughout the genome offers great potential for investi-gating evolutionary and ecological questions in naturalpopulations. For example, several recent studies havefound that the use of outlier loci can reveal fine-scalepopulation structure beyond what was previously inferredfrom conventional neutral markers (e.g. Atlantic herring[13], Atlantic cod [14], Atlantic salmon [15]). Moreover,population genomic approaches that incorporate statisticaloutlier loci offer great potential for delineating conserva-tion units [16,17] and informing fisheries management[18].Kokanee salmon are recently diverged land-locked popu-lations of sockeye salmon (Oncorhynchus nerka) that existin lakes throughout western North America and northeastAsia. Two reproductive ecotypes of kokanee have beendescribed based on spawning behaviour and location.Stream-spawning kokanee migrate into tributaries in earlyautumn, display pronounced secondary sex characteristics,site defence, and redd formation. Conversely, shore-spawning populations (also known as beach-spawners)breed much later in the autumn, lack secondary sex charac-teristics and site defence, and mate directly on the shorelinerather than migrating into tributaries [19,20]. In manylakes, the two ecotypes occur in sympatry. Althoughkokanee ecotypes display different reproductive traits andexperience varying incubation environments, outside of thespawning season the ecotypes are panmictic within lakesand morphologically indistinguishable.Throughout their range, kokanee constitute a heavilymanaged recreational fishery of great socioeconomic andecological importance [21]. The ability to distinguishstream- and shore-spawning kokanee is an important goalfor fisheries managers, as ecotypes may be differentiallyimpacted by alternative conservation and managementregimes with regard to water use, the protection ofspawning habitat, and recreational harvest regulations.For example, conservation strategies for kokanee havefocused primarily on the preservation of stream-spawninghabitat by the formation of spawning channels. However,management efforts for shore-spawning populations arein their infancy and are hindered by a lack of data on thesize of shore-spawning populations within lakes. Whilevisual surveys of kokanee during the spawning season arequite accurate in streams, it is considerably more difficultto obtain visual measurement of shore-spawner abun-dance, especially in lakes where shore populations spawnat greater depth or at night. Kokanee management wouldbe significantly enhanced by the identification of molecu-lar markers capable of assigning individuals to the correctecotype, however the identification of sufficiently fine-scale markers has been hindered by the recent divergenceand correspondingly weak genetic structure of the twoecotypes [18].Previous studies using neutral loci have found evidencefor low levels of genetic differentiation based on mitochon-drial DNA haplotype and microsatellite loci, yet individualassignment probabilities to ecotype were generally low[22-24]. More recently, work using a panel of EST-linkedmicrosatellite loci has identified a set of markers with ahigh association of genetic diversity with reproductiveecotype, suggesting that there may indeed be a genetic basisfor the evolution of kokanee ecotypes [18]. However thesemarkers suffer from poor coverage across the genome andprovide little information on the specific genes involved inthe divergence of the two ecotypes.Here, we evaluated an RNAseq approach for preferen-tially identifying SNPs of putatively adaptive significancewithin and among kokanee ecotypes. We used the Roche454 GS FLX Titanium platform to generate transcriptome-wide sequence data for pooled cDNA libraries of shore-and stream-spawning kokanee. These data were used forSNP discovery and subsequent validation in natural popula-tions, enabling explicit investigations of ecotype-specificpatterns of genetic variation.Results and discussionSequencing and assemblyWe generated ~750 million base pairs of sequence datacorresponding to 1.3 × 106 and 1.4 × 106 reads for the shoreLemay et al. BMC Genomics 2013, 14:308 Page 2 of 11http://www.biomedcentral.com/1471-2164/14/308and stream-spawning cDNA libraries, respectively, with anaverage length of 271 bases per read (Table 1). A de novoassembly was carried out with the trimmed reads from bothecotype libraries in order to generate contigs to use asreference sequences; this assembly incorporated 87% of thetranscriptome reads to produce 123,547 contigs (Table 2).We then mapped the raw reads back to these referencecontigs separately for each ecotype and generated a refineddata set consisting only of contigs that had a minimumaverage coverage of 5× for each ecotype and a minimumlength of 200 bases. The resulting data set (hereafterreferred to as the high coverage data set) was retained fordownstream analyses (Table 2; Figure 1; Additional file 1and Additional file 2).Transcriptome analyses and annotationA genome duplication event in the early evolution ofsalmonids and subsequent diploidization has resulted in agenome in which isoloci and paralogous gene copiesare common [25]. The presence of paralogous sequencevariants (PSVs) among contigs would impede our abilityto identify sequence variants between the two ecotypes.To address this problem, we used conservatively highsimilarity parameters during the de novo assembly inorder to minimize the incorporation of PSVs within eachcontig [26,27]. In doing so, two separate but highly similarcontigs may be created: one corresponding to functionalcoding sequence, the other to a PSV. To test for these re-dundancies in the data, a de novo assembly with a slightlyreduced sequence-similarity value (0.90) was carried outon the high coverage contigs in order to identify anycontig sequences that either partially or totally overlapped.Of the 13,593 contigs initially retained in the high cover-age data set, 1,864 (13.7%) aligned with one other contig,and 643 (4.7%) aligned with two or more other contigs.These contigs were discarded from the high coverage dataset. The remaining 11,085 contigs (81.5%) were uniqueand did not show similarity with any other contig.The strict similarity value used to create the referencecontigs may also explain the relatively short size of thecontigs retained in the high coverage data set (averagelength = 595 bases). While the short contig size may limitsome down-stream analyses, the minimization of PSVs inthe data was of higher priority when optimizing theassembly parameters.A BLAST search of all contigs in the high coverage dataset (n = 11,085) produced 4,410 positive matches to se-quences in the NCBI database (minimum e-value cut-off =10-6; average e-value = 1.9×10-9). Of those contigs withpositive BLAST hits, 60% matched published salmonidsequences (contigs matching Oncorhynchus sp. = 628; Salmosp. = 1993; all other salmonids = 12). In addition, elevencontigs were a positive match to the infectious hema-topoietic necrosis virus (IHNV), a lethal virus that is enzo-otic in western North America and can have detrimentalimpacts on salmonid aquaculture [28]. These 11 contigswere subsequently removed from the high coverage data setfor all further analyses. Of the high coverage contigs thathad positive matches from the Blastx search, 2,285 weresubsequently annotated with one or more gene ontology(GO) terms (Figure 2, Additional file 3).Mitochondrial genomeThere was coverage across all genes in the sockeyesalmon mitochondrial genome [GENBANK: EF055889],representing 0.3% of the trimmed transcriptome reads(n = 6,824).SNP detectionWithin the high coverage data set, 8,339 contigs containedSNPs that fell within our detection parameters (minimumcoverage 8×, minimum variant frequency 10%, minimumreads per allele = 2, minimum central quality 20). Fromthese contigs, we identified 32,699 putative SNPs that maybe used for population genetic analyses of kokanee(Additional file 4). Although there has been much focuson marker development in anadromous sockeye salmon[29-32], to our knowledge, this is the first study that hasidentified SNPs specifically for reproductive ecotypes ofkokanee.Table 1 Summary of next-generation sequence dataobtained for each ecotype of Okanagan Lake kokaneeShore-spawner library Stream-spawner libraryNo. of bases 371,876,524 373,169,057No. of reads 1,343,483 1,406,375Mean read length 276.8 bases 265.3 basesTable 2 Summary of the contigs present in each kokaneedata setTotalcontigsHigh coveragedata set1Shore-uniquecontigs2Stream-uniquecontigs2No. ofcontigs123,547 11,074 277 557Meancoverage7.5 37.0 6.8 8.4Meanlength463.7 594.8 374.2 404.2Mean no.of reads14.2 77.9 8.5 12.71 In the high coverage data set each contig has a minimum length of 200bases and a minimum of 5× coverage for each ecotype. In addition, this dataset has duplicate contigs and contigs that map to pathogen DNAsequences removed.2 Contigs composed of reads from a single ecotype. Minimum length = 200bases; minimum coverage = 5×. Contigs that map to pathogen DNAsequences have been removed.Lemay et al. BMC Genomics 2013, 14:308 Page 3 of 11http://www.biomedcentral.com/1471-2164/14/308Given that low levels of molecular divergence amongecotypes of Okanagan Lake kokanee have been observed inprevious studies, we expected the majority of loci to displaysimilar allele frequencies in both ecotypes. Of the putativeSNPs identified in this study, 7,931 were polymorphic inboth ecotypes, 12,835 were polymorphic in the streamecotype but fixed in shore, and 11,933 were polymorphic inthe shore ecotype but fixed in stream. There were no SNPswithin our detection parameters that were fixed for alter-nate alleles in the two ecotypes.In this study, the high frequency of loci that appear to befixed in one ecotype may be artificially inflated as a resultof the small sample size (only eight individuals per ecotype)in the pooled transcriptome libraries and/or due to theSNP detection parameters, which required a minimum ofeight reads at a given site and at least two reads for eachallele. If one ecotype had coverage below these cut-offvalues, then it would erroneously appear to be fixed at thatsite, even if there was some variation present. This couldresult if one ecotype under-expressed a given gene,preventing it from being detected at sufficiently high levelsin the transcriptome data. These potential biases reflect thetrade-off between avoiding false SNPs resulting from se-quence error, while attempting to account for all possiblevariation in the data.The difference in the frequency of the major allelebetween ecotypes ranged from <1-88% (mean = 19%).Ninety-five percent of all SNPs were contained within adivergence value of 44% or less, suggesting that the overalllevel of nucleotide divergence is low. The remaining 5% ofloci (n = 1,493) represented the most divergent SNPs, withvalues between 45–88%. These highly divergent SNPs mayrepresent promising targets for ecotype discrimination,potentially associated with genes underlying ecologicaldiversification.SNP validationPrimer pairs were designed for 36 loci such that theyamplified a 60–200 base pair fragment containing a singleSNP (Additional file 5). Of these loci, 18 exhibited success-ful PCR amplification, were free of introns, and producedsufficiently clear high resolution melt (HRM) signal toattempt the subsequent genotyping validation. These lociwere then used to genotype 32 stream-spawners, 32 shore-Figure 1 Characterization of the contigs present in the high coverage data set. Histograms represent (A) average coverage of each contig(mean = 37.0), (B) number of reads (mean = 77.9), (C) contig lengths (mean = 594.8 bases), and (D) the number of SNPs for each of the highcoverage contigs.Lemay et al. BMC Genomics 2013, 14:308 Page 4 of 11http://www.biomedcentral.com/1471-2164/14/308spawners, and 24 anadromous sockeye salmon using HRManalysis. From the panel of 18 SNPs for which broad-scaleHRM genotyping and Sanger validation was attempted,nine loci produced consistently scorable HRM clusters andwere retained for downstream analyses.Sanger sequence data for the remaining loci confirmedthat the expected SNP site was indeed polymorphic,however the resulting HRM clusters were not sufficientlydiscrete to enable accurate genotype assignment. For theseloci, Sanger sequencing identified either multiple meltcurves with the same genotype or multiple genotypeswithin the same melt curve. Some of the inconsistencieswith the HRM data may be explained by the reducedaccuracy of HRM compared with other conventionalgenotyping methods such as TaqManW assays [33]. Martinoet al. [33] found that rare genotypes and low minor allelefrequencies decreased the accuracy of HRM analysis. Bothof these factors are likely present in our data given that wepreferentially chose divergent loci that were fixed for asingle allele in one population. The presence of additionalpolymorphic sites within the amplicon [26] and the use ofloci containing Class 3 (C/G) or Class 4 (A/T) SNPs [34]may also have been factors resulting in weakly differentiatedclusters. Given that Sanger sequencing confirmed thepresence of the expected polymorphism, we conclude thatthe failed assays were not due to errors with the initial SNPdetection but rather reflect the limitations of the HRMassays at those loci.viral reproductionbiological adhesiongrowthlocomotionreproductioncell proliferationmulti-organism processimmune system processdeathsignalingdevelopmental processresponse to stimulusmulticellular organismal processlocalizationbiological regulationmetabolic processcellular processsynapsecell junctionextracellular regionmembrane-enclosed lumenmacromolecular complexmembraneorganellecellantioxidant activityelectron carrier activityreceptor activityenzyme regulator activitymolecular transducer activitystructural molecule activitytransporter activitycatalytic activitybindingBiological process Cellular component Molecular function Percent of sequences Figure 2 Functional annotation of the high coverage contigs. The frequency (%) of each observed gene ontology (GO) term is given for thethree GO domains (biological process, cellular component, and molecular function).Lemay et al. BMC Genomics 2013, 14:308 Page 5 of 11http://www.biomedcentral.com/1471-2164/14/308There was no evidence of linkage disequilibrium amongany of the nine loci for which genotypic data were obtained.One locus (One74958) showed a significant deviation fromHardy-Weinberg equilibrium (HWE) in all three popula-tions tested (stream-spawning kokanee, shore-spawningkokanee and anadromous sockeye salmon). This may indi-cate the presence of a PSV at this locus [26]. One additionallocus showed a significant deviation from HWE among theshore-spawning samples only (Table 3). Among the anadro-mous sockeye salmon samples used in this study, all SNPsshowed similar patterns of allele frequencies to kokanee.These results indicate that SNP loci developed from thekokanee transcriptome data could also be useful for geno-typing populations of anadromous sockeye salmon.The frequency data obtained from HRM genotypingsuggests that while the transcriptome data set provides avaluable tool for identifying novel SNPs, it is limited in itsability to infer population allele frequencies (Table 3). Smallsample size in the pooled transcriptome libraries likelycontributed to the observed inconsistencies between HRMgenotypes and the allele frequencies predicted from thetranscriptome data. We initially hypothesized that the mostdivergent SNPs in the transcriptome sample could indicatehigh-confidence targets for genetically discriminatingecotypes. Based on these results, however, we feel that therelative frequency of alleles present in the pooled transcrip-tome reads is not an appropriate surrogate for large-scaleSNP genotyping followed by statistical outlier tests. The useof greater sample sizes within the pooled cDNA librariesmay improve this outcome. Likewise, emerging protocolsthat utilize combinatorial labelling methods and RAD tagsmay provide more efficient and cost-effective alternativesfor simultaneously discovering SNPs in non-model organ-isms and genotyping population-level samplings [7,8].Ecotype-unique dataWe also generated additional data sets with contigs thatcontained only reads from a single ecotype (Table 2,Additional file 6). The presence/absence of ecotype-specificcontigs may represent candidates for differences in geneexpression (rather than sequence variation) betweenecotypes. Interestingly, we observed twice as many ecotype-unique contigs from the stream-spawning DNA library,which may indicate a number of genes that are notexpressed by the shore ecotype during spawning. However,this discrepancy in ecotype-specific contigs could partiallybe accounted for by genes with very low levels of expres-sion that were observed in one library by chance alone. Astudy specifically designed to examine differences in levelsof gene expression may be useful at identifying divergencebetween the two ecotypes.BLAST searches (Blastn, NCBI, max e-value = 10-05) ofthe ecotype-unique contigs produced 150 positive matchesin the shore-unique data set (mean e-value = 1.1×10-07) and393 positive hits for contigs that are unique to the streamecotype (mean e-value = 2.1×10-06). Interestingly, 12stream-unique contigs (2%, including the highest coveragecontig) were a match to Saprolegnia ferax [GENBANK:AY534144], a pathogenic water mould that is associatedwith pre-spawning mortality in salmonids [35]. Con-tigs matching this pathogen were absent among theshore-unique contigs. Similarly, 92 stream-unique contigs(14%) were a match to Flavobacterium psychrophilum[GENBANK: AM398681], which is a bacterial infectionassociated with high levels of salmonid mortality [36].Again, there were no matches to this pathogen among theshore-unique contigs. A similar pattern was also observedfor contigs matching Gyrodactylus salmonis [GENBANK:JN230351], a pathogenic flatworm [37].Table 3 Genetic diversity estimates from loci that were successfully genotyped using High Resolution Melt Analysis(HRMA)LocusHE/HO Major allele1 frequency from HRMA (frequency predicted from transcriptome data)Stream Shore Sockeye Stream Shore SockeyeOne74958 0.13 / 0.00 * 0.32 / 0.00 * 0.19 / 0.05 * 0.93 (1.00) 0.80 (0.48) 0.90One81166 0.29 / 0.29 0.32 / 0.40 0.41 / 0.48 0.18 (1.00) 0.20 (0.46) 0.28One81284+ 0.47 / 0.57 0.43 / 0.61 0.47 / 0.57 0.74 (1.00) 0.70 (0.54) 0.63One81385 0.20 / 0.22 0.38 / 0.36 0.19 / 0.22 0.79 (0.52) 0.75 (1.00) 0.90One113434 0.32 / 0.40 0.32 / 0.40 0.00 / 0.00 0.80 (1.00) 0.80 (0.77) 1.00One73115 0.43 / 0.52 0.44 / 0.47 0.16 / 0.17 0.78 (0.42) 0.67 (0.79) 0.91One74836 0.49 / 0.37 0.50 / 0.53 0.43 / 0.46 0.58 (0.33) 0.58 (0.65) 0.32One24190 0.22 / 0.25 0.35 / 0.45 0.44 / 0.48 0.88 (0.31) 0.77 (0.85) 0.67One73476 0.50 / 0.61 0.50 / 0.25 * 0.50 / 0.57 0.57 (0.35) 0.56 (0.70) 0.52*Denotes significant deviation from HWE following sequential Bonferroni correction.1 The identity of the major allele is defined as the allele with the highest frequency in the transcriptome data.+The contig from which this locus was created had some overlap with one other contig (34452). Both contigs were subsequently removed from the highcoverage data set. As the overlap did not impact the SNP site, this locus has been retained.Lemay et al. BMC Genomics 2013, 14:308 Page 6 of 11http://www.biomedcentral.com/1471-2164/14/308To test the possibility that there were reads matchingthese pathogens that had not been assembled into contigs,we mapped the raw transcriptome reads from each libraryto the NCBI reference sequence from each of these patho-gens. This assembly supported the prevalence of pathogensequences among the stream ecotype transcriptome data(Table 4). The exception was IHNV, which had beenidentified in the high coverage data set and was expectedto be present in both ecotypes.Each of these pathogens is associated with reducedfitness in salmonids, not only by killing adults before theyare able to spawn [35], but also by persisting in thespawning location and infecting emerging juveniles in thenext season [36]. A reduction in pathogen sequencespresent among kokanee collected in shore-spawning habi-tats suggests the possibility that shore-spawning behaviourmay have evolved, in part, as a way to reduce the pathogenload. While highly speculative, this hypothesis is consist-ent with other studies (Frazer and Russello, in review),which detected outlier loci associated with immuneresponse in kokanee. Further, the fact that only internalorgans and subcutaneous muscle tissue were sampledincreases the probability that the observed pathogensequences were indeed present within the fish tissue,rather then being environmental contaminants that couldoccur if external fin-clips or operculum punches had beencollected. While the present study was not explicitlydesigned to address this question and may be influencedby stochastic factors, our preliminary results warrantfuture research to quantitatively compare the pathogenload among kokanee within each of the two spawninghabitats.All contigs that matched pathogen sequences where re-moved from the ecotype-unique data sets prior to down-stream analyses. Blast2Go was then used to assign GOannotations to Blastx matches (e-value threshold = 10-6)among the remaining contigs that were unique to eachecotype. This analysis found sequence similarity matchesfor 160 and 76 contigs from the stream and shore datasets, respectively. From these contigs with positive BLASTmatches, 64 stream contigs and 34 shore contigs weresubsequently annotated with at least one GO term(Figure 3).ConclusionsIn this study, we harnessed next-generation sequencingtechnology in order to compare transcriptome-widepatterns of sequence variation among divergent ecotypesof kokanee salmon. We identified 32,699 putative SNPsthat could be used for population genetic and genomicstudies of both kokanee and anadromous sockeye salmon.We further detected contigs that were unique to eachecotype, which may be indicative of differential geneexpression, as well as preliminary evidence for variablepathogen load associated with spawning location.MethodsSample collection, RNA extraction, and next-generationsequencingKokanee were collected during spawning from fourstream-spawning sites and three shore-spawning sites inOkanagan Lake, British Columbia (Additional file 7).Animal collections and tissue sampling complied withUniversity of British Columbia animal care protocol#A11-0127 and British Columbia Ministry of Environmentcollection permit #PE10-66394. A total of eight individualsper ecotype were immediately sacrificed upon collectionand dissected in the field. For each of these individuals,five different tissues (heart, liver, muscle, gonad, and olfac-tory bulb) were harvested and immersed in separate 5mlscrew-cap vials containing 2.5ml of RNALATERW (LifeTechnologies) solution. Samples were held at 4°C for24 hrs and then stored at −20°C until needed. RNA wasextracted from each tissue type using the RNEASYUNIVERSAL MINIKIT (Qiagen) following the manufac-turer’s protocol.Two normalized cDNA libraries (Evrogen, Russia) wereconstructed using pooled RNA from all shore-spawners (5tissues × 8 individuals) and all stream-spawners (5 tissues ×8 individuals). The two resulting cDNA libraries were eachsubject to a full run of 454 GS FLX Titanium sequencing atthe Genome Quebec core facility. Pooling of multiple indi-viduals in each sample was used to provide a preliminaryTable 4 Number of next-generation sequencing readsthat aligned to reference sequences from four salmonidpathogensPathogen Number of reads aligned[GenBank accession] Stream-spawnerlibraryShore-spawnerlibraryFlavobacterium psychrophilum7,393 219Complete genome[AM398681]Gyrodactylus salmonis17 0Partial ITS1, complete 5.8S rRNAgene, partial ITS2[JN230351]Saprolegnia ferax393 1Mitochondrion, complete genome[AY534144]Infectious hematopoietic necrosisvirus1285 327Glycoprotein (G) and non-virionprotein (NV) genes[IHNGNVJ]1 IHNV was identified in the high coverage data set containing reads fromboth ecotypes and was not expected to show ecotype specificity.Lemay et al. BMC Genomics 2013, 14:308 Page 7 of 11http://www.biomedcentral.com/1471-2164/14/308indication of genetic variation within and among ecotypes;the combination of five tissue types for each individual wasused to maximize the diversity of expressed genes presentin each library. RNA samples were normalized to increasethe likelihood that rare transcripts were detected in thesequence data. The raw read data from each library wasdeposited at the NCBI Sequence Read Archive (project #SRP021088) with the following accessions: Streamreads SRR827512, SRR827513; shore reads SRR827572,SRR827573.AssemblyThe CLC GENOMICS WORKBENCH (CLC Bio) v.4.8was used for initial read trimming. Low quality (qualitylimit 0.05) and very short reads (<100 bases), terminalnucleotides (five from each end), and 454 sequencingadapters and primers were removed from the data setprior to assembly. A de novo assembly was then carriedout using CLC GENOMICS WORKBENCH v.4.8 (similar-ity = 0.96, length fraction 0.5) in order to generatereference contigs for subsequent analyses. A conserva-tively high similarity value was used in our assembly inorder to minimize the incorporation of PSVs within thesereference contigs [26,27].To facilitate a comparison of sequence variation betweenthe two ecotypes, the consensus sequence from each contigcreated during the de novo assembly was used as a refer-ence to map the stream and shore reads separately (CLClocomotionbiological adhesiongrowthimmune system processmulti-organism processreproductionrhythmic processcell proliferationdeathresponse to stimulussignalingdevelopmental processlocalizationmulticellular organismal processmetabolic processbiological regulationcellular processcell junctionextracellular matrixmembrane-enclosed lumenextracellular regionmacromolecular complexmembraneorganellecellelectron carrier activityenzyme regulator activitymolecular transducer activityreceptor activitystructural molecule activitytransporter activitycatalytic activitybindingBiological process Cellular component Molecular function Percent of sequences stream shore Figure 3 Functional annotation of contigs that were unique to each ecotype. The frequency (%) of each observed gene ontology (GO)term is presented for both ecotypes.Lemay et al. BMC Genomics 2013, 14:308 Page 8 of 11http://www.biomedcentral.com/1471-2164/14/308GENOMICS WORKBENCH v.5.5; similarity = 0.96, lengthfraction 0.5). We retained only those contigs with aminimum length of 200 bases and an average coveragegreater than 5× for each ecotype (hereafter referred to asthe high coverage data set).We also generated data sets containing those contigsthat were composed of reads from only a single eco-type (same parameters as above, minimum coverage =5×; minimum length = 200 bases). These two ‘ecotype-unique’ data sets may suggest target genes for subse-quent studies examining differences in gene expressionamong ecotypes.Mitochondrial genomeWhile there is not yet a fully assembled and annotatedsalmon nuclear genome, an annotated mitochondrialgenome sequence is available for most salmonid species.To assess the prevalence of mitochondrial genes in thetranscriptome data, we used the mitochondrial genome forsockeye salmon [GENBANK: EF055889] as a reference tomap the kokanee transcriptome reads (CLC GENOMICSWORKBENCH v.4.8, similarity = 0.90).Transcriptome analysis and annotationThe high coverage data set was subject to sequence simi-larity searches using Blast2Go v.2 [38,39]. For this analysis,a Blastx search was performed using the NCBI nr database(e-value threshold = 10-6, HSP length cut-off = 33). Thetop five hits for each contig were retained. Following theBlastx search, GO analysis was carried out in order toobtain hierarchical structure information with respect tothe three GO domains (molecular function, biologicalprocess, and cellular component). This analysis wascarried out using Blast2Go v. 2 (e-value threshold = 10-6,HSP length cut-off = 20, GO weight = 5).Sequence similarity searches were also carried out for allcontigs in each of the ecotype unique data sets. BothBlastx (same parameters as above) and Blastn (maxi-mum e-value threshold = 10-5) searches were performed.Blast2Go v.2 was again used to assign GO annotations tocontigs (same parameters as above).A de novo assembly (CLC GENOMICS WORKBENCHv5.5; similarity = 0.90, length fraction = 0.5) was carriedout using all contigs in the high coverage data set in orderto identify contigs with overlapping portions. Redundan-cies among the contigs may indicate the presence of PSVsor be indicative of alternative splicing within the transcrip-tome data. Contigs that overlapped with one or moreother contigs were removed from the high coveragedata set.SNP discoveryThe working data set of high coverage contigs was screenedfor SNPs using the CLC GENOMICS WORKBENCHv. 5.5 (minimum coverage 8×, minimum variant frequency10%, minimum reads per allele = 2, minimum centralquality 20). By carrying out mapping and SNP detectionseparately for the two ecotypes and then combining theresulting SNP tables, it was possible to determine thenumber and percent of reads that matched the referencecontig sequence at each SNP site. The absence of apolyorphism (fixation) in one ecotype was evident as either100% of reads matching the reference (in this case no SNPwould be observed in the table for this ecotype) or 100% ofreads possessing a nucleotide other than that of thereference sequence. For example, a fixed difference for agiven location would be scored when one ecotype was100% different from the reference contig sequence, whilethe other ecotype displayed no polymorphism at that site(i.e. 100% match to the reference). Using this approach wewere able to characterize all SNPs as being either: (a)polymorphic in both ecotypes; (b) fixed in one ecotype,polymorphic in the other; or (c) fixed for different allelesin each ecotype. Polymorphic sites with more than twoalleles and sites with read ambiguity were removed fromthe data. Variation in the form of indels was not examinedin this study.A divergence value (based on the index implemented inJuekens et al. [40]) was calculated for each SNP, defined asthe absolute value of the difference in the frequency of themajor allele among ecotypes. For example, a SNP that wasfixed for a different allele in each ecotype would have adivergence value = 1.0; a SNP where the major-allele wasfixed in shore and had a frequency of 60% in stream wouldhave a divergence value = 0.40. The approach was used toputatively identify the most divergent polymorphic siteswithin the transcriptome.SNP validationA subset of 36 SNPs was used to genotype an independentsample of kokanee in order to validate this ascertainmentprocedure. Validation of candidate SNPs was carried outfollowing a pipeline similar to that implemented by Seebet al. [26]. Briefly, primers were designed using PRIMER3[41] such that they would amplify a ~60-200 bp fragmentthat encompassed a single SNP. Loci that produced a singleclean amplicon with no detectable introns were usedto genotype 88 individuals using HRM analysis (see below).Each PCR reaction contained 1.25 μl of 10× buffer,1.25 μl of 2 mM dNTP mix, 0.5 μl of 10mM forward andreverse primer, 0.5 units of Kapa Taq polymerase (KapaBiosystems), 20–100 ng of DNA template, and ultra purewater for a total reaction volume of 12.5 μl. For eachreaction, a touchdown PCR procedure was implementedusing a Veriti thermal cycler (Applied Biosystems). Theprogram had an initial denaturation at 94°C for 2 minutes,followed by 10 cycles at 94°C for 30 seconds, 60°C for30 seconds, and 72°C for 30 seconds, with the annealingLemay et al. BMC Genomics 2013, 14:308 Page 9 of 11http://www.biomedcentral.com/1471-2164/14/308temperature decreasing by 1.0°C per cycle. This wasfollowed by 25 cycles at 94°C for 30 seconds, 50°C for30 seconds, and 72°C for 30 seconds. The final cycle hadan extension of 72°C for 2 minutes and was then held at4°C. PCR products were run on a 1.5% agarose-gel inorder to obtain a preliminary assessment of the qualityand size of the amplicon. Loci that showed evidence forthe presence of introns (larger PCR products thenexpected), or had multiple bands were not retained forsubsequent analyses.Loci that produced a single clean PCR product of theanticipated size were then evaluated by HRM analysisusing DNA samples from 32 stream-spawners and32 shore-spawners from Okanagan Lake. For detailedsampling methods for the kokanee DNA samples seeRussello et al. [18]. In addition, 24 anadromous sockeyesalmon DNA samples were included in this analysis.These samples were collected by the Okanagan NationAlliance from Osoyoos Lake, British Columbia, which isthe closest extant population of sockeye salmon from thesame drainage system as Okanagan Lake. Fin clips wereremoved from mature adults during the spawning sea-son in September 2011. DNA was extracted using aNucleoSpin Tissue Kit (Macherey Nagel) following themanufacturers suggested protocol for 96 well plates.Each HRM reaction contained 7.2 μl of Precision MeltSupermix (Bio-Rad), 0.4 μl of each primer, 20–100 ng ofDNA template, and ultra pure water for a total reactionvolume of 20 μl. HRM analyses were run in 96 well plateson a Bio-Rad CFX96 Touch™ real time PCR detectionsystem. A two-step touchdown PCR protocol was usedstarting with an initial denaturation step at 95°C for2 minutes, followed by 9 cycles of 95°C for 10 seconds,60°C for 30 seconds, with the annealing temperaturedecreasing by 1°C per cycle. This was followed by43 cycles of 95°C for 10 seconds and 50°C for 30 seconds.The final PCR cycle consisted of 95°C for 30 secondsfollowed by 55°C for 1 minute. A plate read was obtainedat the end of every PCR cycle. The melt curve data wereobtained starting at 70°C and increasing by 0.2°C every10 seconds to a maximum of 95°C. During the meltingstage a plate read was obtained at every 0.2°C increment.Melt curve data were analyzed using the Bio-Rad CFXproprietary software.SNPs that demonstrated multiple clusters of HRMcurves (i.e. polymorphic loci) were subjected to Sanger re-sequencing on an ABI 3130 Genetic Analyzer to confirmtheir genotype. Given that each HRM cluster shouldrepresent a single SNP genotype, 1–2 individuals fromeach cluster were sequenced in order to determine thegenotype of each cluster.For each locus that was successfully genotyped, wecalculated expected and observed heterozygosity, andtested for deviations from Hardy-Weinberg and linkageequilibrium. These analyses were carried out usingGenePop v.4 [42] followed by a sequential Bonferronicorrection for multiple tests [43]. In addition, the allelefrequencies observed from HRM genotyping of each locuswere compared with the frequency expected from thepooled transcriptome read data. For consistency, themajor-allele was defined as the allele with the highestfrequency observed in the transcriptome data.Additional filesAdditional file 1: High coverage contig sequences. In FASTA formatcontaining the sequence of all high coverage contigs used for SNP detection(minimum length = 200 bases, minimum coverage = 5× for each ecotype).Additional file 2: Characterization of high coverage contigs. Listingthe length, coverage, number of reads, and top Blastx hit (NCBI) for each ofthe high coverage contigs.Additional file 3: GO terms for each high coverage contig. Listingthe contig name and GO annotation information for each of the three GOdomains: Biological process (BP), molecular function (MF), cellularcomponent (CC).Additional file 4: SNP information. Characterizing the 32,699 SNPsidentified in the high coverage data set.Additional file 5: SNP primers. Containing the primer sequences andlocus information for all loci for which HRM validation was attempted.Additional file 6: Ecotype unique contig sequences. In FASTA formatcontaining the sequence of all contigs that were composed entirely ofreads from a single ecotype (minimum length = 200 bases, minimumcoverage = 5×).Additional file 7: Sample collection. Containing the sampling locationsof all kokanee used to generate the two pooled transcriptome libraries.Competing interestsThe authors declare that they have no competing interests.Authors’ contributionsMR and ML designed the study. ML collected tissue samples, carried out lab work,analyzed data and drafted the manuscript. DD carried out lab work and assistedwith data analysis. MR obtained the funding, assisted with analyses, and helpeddraft the manuscript. All authors read and approved the final manuscript.AcknowledgementsWe thank Paul Askey (BC Ministry of Forests, Lands, and Natural ResourceOperations) and Jason Webster (Chara Consulting) for field expertise andsampling advice. We also thank Richard Bussanich (Okanagan NationAlliance) for providing Osoyoos Lake sockeye salmon samples. Jim Seeb(University of Washington) offered helpful feedback regarding the SNPdiscovery and validation pipeline. Mark Rheault and Deanna Gibson providedsome of the necessary lab equipment and the Génome Québec InnovationCentre provided technical assistance. This work was funded by Genome BCStrategic Opportunities Fund grant # 130 (MR) and NSERC Discovery Grant #341711–07 (MR). ML was partially supported by an NSERC PostgraduateScholarship.Received: 19 October 2012 Accepted: 1 May 2013Published: 7 May 2013References1. Wolfe KH, Li WH: Molecular evolution meets the genomics revolution.Nat Genet 2003, 33:255–265.2. Allendorf FW, Hohenlohe PA, Luikart G: Genomics and the future ofconservation genetics. Nat Rev Genet 2010, 11:697–709.3. Brumfield RT, Beerli P, Nickerson DA, Edwards SV: The utility of singlenucleotide polymorphisms in inferences of population history. Trends EcolEvol 2003, 18:249–256.Lemay et al. BMC Genomics 2013, 14:308 Page 10 of 11http://www.biomedcentral.com/1471-2164/14/3084. Garvin MR, Saitoh K, Gharrett AJ: Application of single nucleotidepolymorphisms to non-model species: a technical review. Mol Ecol Resour2010, 10:915–934.5. Helyar SJ, Hemmer-Hansen J, Bekkevold D, Taylor MI, Ogden R, Limborg MT,Cariani A, Maes GE, Diopere E, Carvalho GR, Nielsen EE: Application of SNPsfor population genetics of nonmodel organisms: new opportunities andchallenges. Mol Ecol Resour 2011, 11:123–136.6. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool fortranscriptomics. Nat Rev Genet 2009, 10:57–63.7. Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, Selker EU,Cresko WA, Johnson EA: Rapid SNP discovery and genetic mapping usingsequenced RAD markers. PLoS One 2008, 3:e3376.8. Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE: Double digestRADseq: an inexpensive method for De novo SNP discovery andgenotyping in model and Non-model species. PLoS One 2012, 7:e37135.9. Luikart G, England PR, Tallmon D, Jordan S, Taberlet P: The power andpromise of population genomics: from genotyping to genome typing.Nat Rev Genet 2003, 4:981–994.10. Narum SR, Hess JE: Comparison of F-ST outlier tests for SNP loci underselection. Mol Ecol Resour 2011, 11:184–194.11. Cavalli-Sforza LL: Population structure and human evolution. Proceedingsof the Royal Society Series B-Biological Sciences 1966, 164:362–379.12. Lewontin RC, Krakauer J: Distribution of gene frequency as a test of thetheory of the selective neutrality of polymorphisms. Genetics 1973,74:175–195.13. André C, Larsson LC, Laikre L, Bekkevold D, Brigham J, Carvalho GR,Dahlgren TG, Hutchinson WF, Mariani S, Mudde K, Ruzzante DE, Ryman N:Detecting population structure in a high gene-flow species, Atlanticherring (Clupea harengus): direct, simultaneous evaluation of neutral vsputatively selected loci. Heredity 2011, 106:270–280.14. Westgaard JI, Fevolden SE: Atlantic cod (Gadus morhua L.) in inner andouter coastal zones of northern Norway display divergent geneticsignature at non-neutral loci. Fish Res 2007, 85:306–315.15. Freamo H, O’Reilly P, Berg PR, Lien S, Boulding EG: Outlier SNPs show moregenetic structure between two Bay of Fundy metapopulations ofAtlantic salmon than do neutral SNPs. Mol Ecol Resour 2011, 11:254–267.16. Bonin A, Nicole F, Pompanon F, Miaud C, Taberlet P: Population adaptiveindex: a new method to help measure intraspecific genetic diversity andprioritize populations for conservation. Conserv Biol 2007, 21:697–708.17. Funk CW, McKay JK, Hohenlohe PA, Allendorf FW: Harnessing genomics fordelineating conservation units. Trends Ecol Evol 2012, 27:489–496.18. Russello MA, Kirk SL, Frazer K, Askey P: Detection of outlier loci and theirutility for fisheries management. Evol Appl 2012, 5:39–52.19. Dill P: A study of shore-spawning kokanee salmon (Oncorhynchus nerka)at Bertram Creek Park, Okanagan Lake, BC 1992–1996. Penticton, BC:Report prepared for the BC Ministry of Environment; 1996.20. Shepherd BG: A case history: The kokanee stocks of Okanagan Lake.Proceedings of a Conference on the Biology and Management ofSpecies and Habitats at Risk: 15-19 February 1999. In Edited by DarlingLM. B.C: Ministry of Environment, Lands and Parks, Victoria, BC andUniversity College of the Cariboo, Kamloops, BC; 2000:609–616.21. Thompson LC: Abundance and production of zooplankton and kokaneesalmon (Oncorhynchus nerka) in Kootenay Lake, British Columbia duringartificial fertilization, PhD thesis. Vancouver, Canada: University of BritishColumbia, Department of Zoology; 1999.22. Taylor EB, Kuiper A, Troffe PM, Hoysak D, Pollard S: Variation indevelopmental biology and microsatellite DNA in reproductive ecotypesof kokanee, Oncorhynchus nerka: implications for declining populationsin a large British Columbia lake. Conserv Genet 2000, 1:231–249.23. Taylor EB, Harvey S, Pollard S, Volpe J: Postglacial genetic differentiation ofreproductive ecotypes of kokanee Oncorhynchus nerka in OkanaganLake. British Columbia. Mol Ecol 1997, 6:503.24. Lemay MA, Russello MA: Neutral loci reveal structure by geography, notecotype, in Kootenay Lake kokanee. N Am J Fish Manage 2012, 32:282–291.25. Allendorf FW, Thorgaard G: Polyploidy and the evolution of salmonidfishes. In The Evolutionary Genetics of Fishes. Edited by Turner BJ. New York:Plenum Press; 1984:1–53.26. Seeb JE, Pascal CE, Grau ED, Seeb LW, Templin WD, Harkins T, Roberts SB:Transcriptome sequencing and high-resolution melt analysis advancesingle nucleotide polymorphism discovery in duplicated salmonids.Mol Ecol Resour 2011, 11:335–348.27. Renaut S, Nolte AW, Bernatchez L: Mining transcriptome sequencestowards identifying adaptive single nucleotide polymorphisms in lakewhitefish species pairs (Coregonus spp. Salmonidae). Mol Ecol 2010,19:115–131.28. Nichol ST, Rowe JE, Winton JR: Molecular epizootiology and evolution ofthe glycoprotein and non-virion protein genes of the infectioushematopoietic necrosis virus, a fish rhabdovirus. Virus Res 1995,38:159–173.29. Campbell NR, Narum SR: Development of 54 novel single-nucleotidepolymorphism (SNP) assays for sockeye and coho salmon andassessment of available SNPs to differentiate stocks within the ColumbiaRiver. Mol Ecol Resour 2011, 11:20–30.30. Elfstrom CM, Smith CT, Seeb JE: Thirty-two single nucleotidepolymorphism markers for high-throughput genotyping of sockeyesalmon. Mol Ecol Notes 2006, 6:1255.31. Smith CT, Elfstrom CM, Seeb LW, Seeb JE: Use of sequence data fromrainbow trout and Atlantic salmon for SNP detection in Pacific salmon.Mol Ecol 2005, 14:4193.32. Habicht C, Seeb LW, Myers KW, Farley EV, Seeb JE: Summer-fall distributionof stocks of immature sockeye salmon in the Bering Sea as revealed bysingle-nucleotide polymorphisms. Trans Am Fish Soc 2010, 139:1171.33. Martino A, Mancuso T, Rossi AM: Application of high-resolution melting tolarge-scale, high-throughput SNP genotyping: a comparison with theTaqMan(R) method. J Biomol Screen 2010, 15:623–629.34. Liew M, Pryor R, Palais R, Meadows C, Erali M, Lyon E, Wittwer C:Genotyping of single-nucleotide polymorphisms by high-resolutionmelting of small amplicons. Clin Chem 2004, 50:1156–1164.35. Van West P: Saprolegnia parasitica, an oomycete pathogen with a fishyappetite: new challenges for an old problem. Mycologist 2006, 20:99–104.36. Duchaud E, Boussaha M, Loux V, Bernardet JF, Michel C, Kerouault B,Mondot S, Nicolas P, Bossy R, Caron C, Bessieres P, Gibrat JF, Claverol S,Dumetz F, Le Henaff M, Benmansour A: Complete genome sequence ofthe fish pathogen Flavobacterium psychrophilum. Nat Biotechnol 2007,25:763–769.37. Rubio-Godoy M, Paladini G, Freeman MA, Garcia-Vasquez A, Shinn AP:Morphological and molecular characterisation of Gyrodactylus salmonis(Platyhelminthes, Monogenea) isolates collected in Mexico from rainbowtrout (Oncorhynchus mykiss Walbaum). Vet Parasitol 2012,186:289–300.38. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: auniversal tool for annotation, visualization and analysis in functionalgenomics research. Bioinformatics 2005, 21:3674–3676.39. Gotz S, Garcia-Gomez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ,Robles M, Talon M, Dopazo J, Conesa A: High-throughput functionalannotation and data mining with the Blast2GO suite. Nucleic Acids Res2008, 36:3420–3435.40. Jeukens J, Renaut S, St-Cyr J, Nolte AW, Bernatchez L: The transcriptomicsof sympatric dwarf and normal lake whitefish (Coregonus clupeaformisspp., Salmonidae) divergence as revealed by next-generationsequencing. Mol Ecol 2010, 19:5389–5403.41. Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and forbiologist programmers. In Bioinformatics Methods and Protocols: Methodsfor Molecular Biolog. Edited by Krawetz S, Misener S. Totowa, NJ: HumanaPress; 2000:365–386.42. Raymond M, Rousset F: GENEPOP (version-1.2) – population geneticssoftware for exact tests and ecumenicism. J Hered 1995, 86:248–249.43. Rice WR: Analyzing tables of statistical tests. Evolution 1989, 43:223–225.doi:10.1186/1471-2164-14-308Cite this article as: Lemay et al.: Transcriptome-wide comparison ofsequence variation in divergent ecotypes of kokanee salmon. BMCGenomics 2013 14:308.Lemay et al. BMC Genomics 2013, 14:308 Page 11 of 11http://www.biomedcentral.com/1471-2164/14/308


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items