UBC Faculty Research and Publications

A sequence-based approach to identify reference genes for gene expression analysis Chari, Raj; Lonergan, Kim M; Pikor, Larissa A; Coe, Bradley P; Zhu, Chang Q; Chan, Timothy H; MacAulay, Calum E; Tsao, Ming-Sound; Lam, Stephen; Ng, Raymond T; Lam, Wan L Aug 3, 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12920_2010_Article_167.pdf [ 584.45kB ]
JSON: 52383-1.0215986.json
JSON-LD: 52383-1.0215986-ld.json
RDF/XML (Pretty): 52383-1.0215986-rdf.xml
RDF/JSON: 52383-1.0215986-rdf.json
Turtle: 52383-1.0215986-turtle.txt
N-Triples: 52383-1.0215986-rdf-ntriples.txt
Original Record: 52383-1.0215986-source.json
Full Text

Full Text

RESEARCH ARTICLE Open AccessA sequence-based approach to identify referencegenes for gene expression analysisRaj Chari1*, Kim M Lonergan1, Larissa A Pikor1, Bradley P Coe1, Chang Qi Zhu2, Timothy HW Chan1,3,Calum E MacAulay1, Ming-Sound Tsao2, Stephen Lam1, Raymond T Ng1,3†, Wan L Lam1†AbstractBackground: An important consideration when analyzing both microarray and quantitative PCR expression data isthe selection of appropriate genes as endogenous controls or reference genes. This step is especially critical whenidentifying genes differentially expressed between datasets. Moreover, reference genes suitable in one context (e.g.lung cancer) may not be suitable in another (e.g. breast cancer). Currently, the main approach to identify referencegenes involves the mining of expression microarray data for highly expressed and relatively constant transcriptsacross a sample set. A caveat here is the requirement for transcript normalization prior to analysis, andmeasurements obtained are relative, not absolute. Alternatively, as sequencing-based technologies provide digitalquantitative output, absolute quantification ensues, and reference gene identification becomes more accurate.Methods: Serial analysis of gene expression (SAGE) profiles of non-malignant and malignant lung samples werecompared using a permutation test to identify the most stably expressed genes across all samples. Subsequently,the specificity of the reference genes was evaluated across multiple tissue types, their constancy of expression wasassessed using quantitative RT-PCR (qPCR), and their impact on differential expression analysis of microarray datawas evaluated.Results: We show that (i) conventional references genes such as ACTB and GAPDH are highly variable betweencancerous and non-cancerous samples, (ii) reference genes identified for lung cancer do not perform well for othercancer types (breast and brain), (iii) reference genes identified through SAGE show low variability using qPCR in adifferent cohort of samples, and (iv) normalization of a lung cancer gene expression microarray dataset with orwithout our reference genes, yields different results for differential gene expression and subsequent analyses.Specifically, key established pathways in lung cancer exhibit higher statistical significance using a datasetnormalized with our reference genes relative to normalization without using our reference genes.Conclusions: Our analyses found NDUFA1, RPL19, RAB5C, and RPS18 to occupy the top ranking positions among 15suitable reference genes optimal for normalization of lung tissue expression data. Significantly, the approach usedin this study can be applied to data generated using new generation sequencing platforms for the identification ofreference genes optimal within diverse contexts.BackgroundGene expression profiling, including quantitative RT-PCR (qPCR) and microarray experimentation, is invalu-able for the molecular analysis of biological systems.The interpretation of results from such experiments (i.e., the determination of differential expression for aparticular gene among datasets) is strongly influencedby the selection of reference genes for normalizationacross datasets [1]. Specifically, gene expression is nor-malized within a given dataset by calculating the tran-script abundance of the gene of interest relative to agene that is constantly expressed across independentdatasets (termed a “housekeeping” or a “reference”gene), and differential expression between two datasetsor samples is determined by calculating the ratio of thenormalized expression levels for the gene of interestbetween the two datasets. Typically, housekeeping genes* Correspondence: rchari@bccrc.ca† Contributed equally1Department of Integrative Oncology, British Columbia Cancer AgencyResearch Centre, Vancouver, BC, CanadaFull list of author information is available at the end of the articleChari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32© 2010 Chari et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.satisfy the following criteria: they are highly expressed inthe cell, the variability in expression between samples isminimal, and the genes’ expression is not influenced bythe experimental conditions tested [2]. Hence, problemsarise when housekeeping genes are selected that do notmeet these criteria, as fluctuations in these genes mayerroneously influence the data interpretation.Historically, beta actin (ACTB), glyceraldehyde-3-phos-phate dehydrogenase (GAPDH), and 18 S rRNA havebeen routinely used as reference genes for qPCR andmicroarray data normalization. However, a number ofstudies have shown that expression of these genes variesconsiderably depending on the specific tissue type anddisease state of the tissue [3-16]. Attempts to achievemore reliable normalization include the spiking of syn-thetic poly-A RNAs for the analysis of cDNA arrays andnorthern blots, and the combined use of an oligo-(dT)nprimer with an 18 S specific primer for qPCR analysis[17,18]. In addition, re-mining of large microarray data-sets for the identification of novel, highly stable genes,as well as use of a combination of reference genesinstead of a single gene for normalization, are some ofthe other approaches taken to address this problem[11,13,19].Recently, efforts have been made to identify more sui-table reference genes for microarray and qPCR studiesof lung cancer. Specifically, candidate reference geneshave been identified from the mining of microarraygene expression data to identify the least variable genes,followed by validation of expression using qPCR[11,20,21]. However, as microarray data do not provideabsolute abundance values for transcripts, selection ofreference genes from this type of data is inherently pro-blematic. To circumvent this handicap in the utilizationof microarray data, we turn to the use of large-scaleexpression profiling permitted by serial analysis of geneexpression (SAGE) experimentation for the identifica-tion of novel reference genes optimal for the study oflung cancer. This approach, which we have termed nor-malization of expression by permutation of SAGE(NEPS), takes advantage of the fact that SAGE is a tran-scriptome profiling technique that identifies the absoluteabundance levels of transcripts by direct enumeration ofsequence tag counts, thus allowing the direct compari-son of expression levels across multiple profiles withoutthe need for reference or housekeeping genes [22].NEPS adopts a permutation test approach designedfor analyzing relatively small sample sizes, such as thosetypically encountered with SAGE. Unlike the conven-tional T-test, the permutation test is non-parametric[23]. The null hypothesis states that the mean geneexpression levels in two groups of SAGE libraries beingcompared (in this case normal and cancer), are thesame. For this analysis, samples from both the normaland the cancer groups are pooled, followed by randomsampling to create a simulated Group 1 and a simulatedGroup 2. For each gene, the difference in expressionbetween these two simulated groups was measured.This exercise was repeated 10,000 times, thus generat-ing a simulated mean μ and a simulated standard devia-tion s. The permutation score (PS) of a given gene isdefined by PS O= −| | , where O is the true differencebetween the average expression levels in the two groups.Hence, for a given gene, the closer the permutationscore is to zero, the more it satisfies the constancyrequirement.To demonstrate the utility of NEPS for selecting genesthat satisfy the constancy requirement, we analyzed 24bronchial epithelial lung SAGE libraries, 2 lung parench-yma libraries, and 11 lung squamous cell carcinomalibraries. From this analysis, NEPS selected 15 genes,which we hereafter refer to as the lung-NEPS referencegenes (Table 1). We further demonstrate that (1) whilethese genes perform well as reference genes for lung,they are not satisfactory for normalization of expressiondata from other tissues, suggesting that reference genesare tissue-specific, and (2) in lung cancer datasets, dif-ferential gene expression determination and subsequentpathway analyses are improved after normalization usingthe lung-NEPS reference genes.MethodsSAGE library construction26 normal and 11 lung cancer SAGE libraries were con-structed and used in the analysis [24]. The constructionof the 26 normal libraries has been previously described[25,26]. 24 of these libraries were generated from exfo-liated bronchial cells obtained from bronchial brushes,and two libraries from normal lung parenchyma (Addi-tional file 1). Conversely, the 11 cancer libraries weregenerated from biopsied specimens with six librariesrepresenting lung squamous cell carcinoma and fivelibraries representing carcinoma in situ. This data canbe found at the GEO database with the following seriesaccession numbers: GSE3707, GSE5473, and GSE7898.All samples were acquired under approval by the Uni-versity of British Columbia - British Columbia CancerAgency Research Ethics Board (UBC-BCCA-REB) andall subjects provided written consent.SAGE data from public domainPublicly available SAGE data were also used in this ana-lysis, representing both brain and breast cancer. Specifi-cally, six normal and 12 breast cancer librariesChari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 2 of 11(Additional file 2) and 7 normal and 19 brain cancerlibraries were used (Additional file 3). The libraries wereobtained from the cancer genome anatomy project(CGAP) database http://cgap.nci.nih.gov[27,28].Permutation testGiven that SAGE libraries are expensive to generate, thenumber of libraries in a given study is typically small(i.e., in 10’s, rather than in 100’s). The permutation testis a non-parametric test, which does not assume anyunderlying distribution. The number of samplesrequired for the test to achieve sufficient statisticalpower is relatively low compared to other statisticaltests (e.g., t-test and c2-square test). Furthermore, eachadditional sample increases the power of the test expo-nentially. The permutation test is a test of the meansbetween two different distributions. Without loss of gen-erality, let us assume that one distribution is for thegene expression level of a particular gene in normal tis-sues (i.e. subscript n), and that the other distribution isfor cancerous tissues (i.e. subscript c). Genes areselected using the following hypotheses:Null HypothesisAlternative HypothesisHHc na c n::  − =− ≠00If there is little difference between the two means, itwould make no difference if we mix the cancerous sam-ples with the normal samples. But, if the null hypothesisis rejected, it indicates that the gene expression levels ofnormal and cancer samples are sufficiently different (thealternative hypothesis). In the following, we show ourspecific implementation of the test. Let n and c be thenumber of normal tissue samples and the number ofcancerous tissue samples respectively.A. For each gene, select all the gene-specific normal-ized tag counts from the normal libraries and all thegene-specific normalized tag counts from the cancerlibraries.B. Randomly select n counts to create a simulatednormal set, and calculate the simulated normal meanμsn.C. Similarly, select the remaining c counts form thesimulated cancerous set. Calculate the simulated cancermean μsc.D. Consider the random variable v = μsc - μsn, calledthe simulated difference.E. Repeat the steps A to D above m times. Let μ ands denote the mean and the standard deviation of v.F. Now separate the libraries back into their true iden-tity: normal or cancerous. Calculate the true observeddifference O = μrc - μrn, where μrc denotes the truemean count of the cancerous libraries, and μrn denotesthe true mean of the normal libraries.G. Calculate the Permutation Score PS wherePS O= −| | .H. Repeat all the above steps for each gene. Sort thepermutation score in descending order.The permutation score is one way to measure howlikely the actual observed difference occurs by chance. Itis based on standardization, i.e., subtracting the meanand then divided by the standard deviation. The moreTable 1 Lung NEPS GenesGene Symbol Gene Name Average Raw Tag Count1 Permutation ScorePPP1CB protein phosphatase 1, catalytic subunit, beta isoform 29 0.003B2M beta-2-microglobulin 829 0.011CSTB cystatin B (stefin B) 52 0.036RPL4 ribosomal protein L4 46 0.045SLFN13 schlafen family member 13 31 0.045CAPZB capping protein (actin filament) muscle Z-line, beta 77 0.050ATP5J ATP synthase, H+ transporting, mitochondrial F0 complex, subunit F6 38 0.059RAB5C RAB5C, member RAS oncogene family 44 0.064NDUFA1 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 1, 7.5 kDa 89 0.077RPL19 ribosomal protein L19 69 0.082HMGB1 high-mobility group box 1 39 0.087CD55 CD55 molecule, decay accelerating factor for complement (Cromer blood group) 27 0.100RPS18 ribosomal protein S18 112 0.123HSPA1A heat shock 70 kDa protein 1A 40 0.133EIF4A2 eukaryotic translation initiation factor 4A, isoform 2 89 0.1451Across all normal and cancer SAGE librariesChari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 3 of 11the true observed difference is from the average(expressed as multiples of the standard deviation), theless likely that the true observed difference is a coinci-dence. That is to say, the larger the permutation score,the more significant is the observed difference betweencancerous and normal samples.On the other hand, for the sake of evaluating the con-stancy requirement, the ideal reference gene would havea permutation score equal to 0. This means that there isno difference in the distributions of expression levelsbetween cancerous and normal samples. For the resultsreported here, we used m = 10,000 permutations.Data pre-processingRaw tag counts for each SAGE library were normalizedto tags per million (TPM) to facilitate adequate compar-ison among libraries. Tag-to-gene mapping was per-formed using the February 5th, 2007 version ofSAGEGenie [27]. In cases where multiple SAGE tagsmapped to the same gene, the tags were collapsed tocapture all potential transcript variants, and a cumula-tive tag count was utilized for analysis.Statistical criteria for reference gene selectionThe permutation test outlined above was used to iden-tify genes which were statistically similar when compar-ing the libraries from normal tissue (bronchialepithelium and lung parenchyma) and cancerous tissueof the lung. Three main criteria were used for referencegene selection: permutation score (described above) ≤0.15; at least two SAGE tags observed in each library;and an overall average count of ≥ 25 across all samples.For the analysis in brain and breast tissue, the first twocriteria were maintained, but due to the lower sequen-cing depth, an average count of ≥ 10 across all sampleswas used instead.Quantitative RT-PCR validation in clinical lung cancerspecimensOne microgram of total RNA from 15 lung tumor andmatched non-malignant parenchyma samples were con-verted to cDNA using the High-Capacity cDNA archivekit (Applied Biosystems Inc., Foster City CA). One hun-dred nanograms of cDNA were utilized for qPCR usingthe TaqMan Gene Expression Assay (Applied Biosys-tems Inc). All fifteen lung NEPS genes and six addi-tional reference genes were assayed. All TaqMan probeswere pre-optimized by Applied Biosystems. Primer IDsfor all genes are provided in Additional file 4. The 30samples were assayed in triplicate in parallel along withnegative (no cDNA template) controls using the 7500Fast Real-Time PCR System. Appropriate cDNAdilutions were used such that the exponential phase ofthe amplification curves were within the 40 PCR cyclesrecommended by the manufacturer (i.e. ranging from16-36 cycles for the 20 genes and 1-13 cycles for18SRNA). Cycle thresholds were determined fromamplification curves using 7500 Fast System software.For the analysis of qPCR data, three different methodswere used. Within each method, all genes were rankedfrom best to worst. Subsequently, for each gene, acumulative ranking across all three methods was deter-mined by summing its rank from each individualmethod. Two previously published methods, geNorm[14] and NormFinder [29], and the variance of cyclethreshold difference (dCt) across all 15 tumor/matchednon-malignant sample pairs were the approaches usedto determine constancy.Analysis of publicly available microarray datasetsLung NEPS genes were used to re-normalize two pub-licly available microarray datasets. Microarray data wereobtained from GEO at NCBI under accession numbersGSE10072 [30] and GSE12428 [31].For the Affymetrix data (GSE10072), Raw CEL fileswere processed through Affymetrix’s Microarray ArraySuite (MAS) 5.0 algorithm in the “affy” package in Bio-conductor [32,33]. Briefly, MAS 5.0 is a three step pro-cess which involves a global background signalcorrection, correction of the probe value for cross-hybri-dization and spurious signals using mismatch probeswhich are off by one base, and finally, scale normaliza-tion of each experiment to a fixed median intensity tofacilitate inter-experimental comparison http://media.affymetrix.com/support/technical/whitepapers/sadd_whi-tepaper.pdf. Probes were filtered on MAS 5.0 calls, andthose having a “P” or “M” call in at least 50% of sampleswere retained. This resulted in a dataset of 11440probes. Of the 15 lung NEPS reference genes, 12 wererepresented on the array platform. Of those 12 genes,probes which had a “P” call in 100% of the sampleswere used for the calculation of the scaling factor withonly one probe/gene allowed. If two probes met thesecriteria for one gene, the probe with the highest meanexpression was chosen. After employing these criteria,eight probes were used (Additional file 5), which repre-sented genes PPP1CB, B2M, RPL4, CAPZB, ATP5J,RAB5C, NDUFA1, and HSPA1A.For the Agilent microarray data (GSE12428), all lungNEPS genes were represented on this microarray plat-form. Data was processed as described previously [31]. Inthe cases where lung NEPS genes were represented withmultiple probes, the probe with the maximum averageintensity across the dataset was used. A list of the probesChari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 4 of 11used is given in Additional file 6. Since each sample hadat least two replicate experiments, the average acrossreplicate experiments was used for each probe.To determine the scaling factor, for each sample, lin-ear regression analysis was performed comparing thevalues for the reference gene (x) versus the averagevalues for the reference genes across the sample set (y).The slope of the line based on least-squares fitting wasthen multiplied to each value in the experiment.Next, Significance Analysis of Microarrays (SAM) wasperformed to determine differentially expressed genesbetween non-malignant and malignant samples for bothmicroarray datasets using the “samr” package in R [34].Unpaired analysis was performed using the normal sam-ples versus tumor samples and the delta parameter set to0.4. Probes which had a Q-value% ≤ 5 were considered sig-nificant. For the Affymetrix dataset, results were comparedbetween the dataset normalized with MAS 5.0 alone andMAS 5.0 + NEPS scaling and for the Agilent dataset, thecomparison was done between median normalizationalone and NEPS scaling followed by mediannormalization.Results and DiscussionIdentification of reference genes for gene expressionanalysis in lung cancerFrom our NEPS analysis [with an imposed permutationscore (PS) threshold ≤ 0.15, and an average expressionof ≥ 25 raw tag counts across all samples], 15 geneswere identified as the most consistently expressedacross normal and cancerous lung tissue (Table 1).Here we identified beta-2-microglobulin (B2M), com-ponents of the large ribosomal subunit such as riboso-mal protein L19 (RPL19) and ribosomal protein L4(RPL4), components of the small ribosomal subunitsuch as ribosomal protein S18 (RPS18), and electrontransport chain constituents such as NADH dehydro-genase (ubiquinone) 1 alpha subcomplex 1 (NDUFA1),to rate highly in our permutation analysis, thereby sug-gesting their potential as reliable reference genes. B2M has previously been utilized as a reference gene[10,15], providing validity to the approach used here.The 18 S and 28 S rRNAs have previously served asreference genes [4,7,11,12,16], and here we show thatthe ribosomal protein genes can also provide thisservice.Performance of standard and previously reportedreference genesA previous study reported a meta-analysis of microar-ray data designed to identify novel reference genes forthe study of non-small cell lung cancer (NSCLC) [11].Using a small panel of tumor/normal specimens, theauthors demonstrated that genes commonly used forreference in qPCR experimentation were sub-optimal,and identified novel, more consistently expressedgenes to be superior as reference genes. Additionally,studies of asthmatic airways have also shown that tra-ditional reference genes such as ACTB and GAPDHperform poorly in this regard [6,7,12,14,15]. We findsimilar results using our NEPS-based approach. Asdescribed above, genes such as B2 M (permutationscore = 0.011) and RPL19 (0.082) were shown to havevery low permutation scores denoting stable expres-sion between normal and cancer, whereas ACTB(2.69) and GAPDH (6.48) performed very poorly withsignificant differential expression. Other known refer-ence genes such as hypoxanthine phosphoribosyltrans-ferase 1 (HPRT1) (0.114) and TATA box bindingprotein (TBP) (0.468), while exhibiting low permuta-tion scores, were not as highly expressed with averageraw tag counts across all samples of 1.54 and 1.65,respectively. In contrast, genes such as peptidylprolylisomerase A (PPIA, aka cyclophilin A) (6.11), transfer-rin receptor (TFRC, p90, CD71) (4.70), and phospho-glycerate kinase 1 (PGK1) (5.04) identified in amicroarray meta-analysis study (see above) [11], per-formed poorly in our study, as revealed by the rela-tively high permutation scores. Although theseparticular genes did not perform as well as the refer-ence genes identified from our permutation analysis,other genes identified by Saviozzi et al., such as signaltransducer and activator of transcription 1 (STAT1)(0.21), esterase D/formylglutathione hydrolase (ESD)(0.18), Yes-associated protein 1 (YAP1) (0.65) and poly-merase (RNA) II (DNA directed) polypeptide A(POLR2A) (0.88) did perform satisfactorily in ourstudy, as evidenced by permutation scores ≤ 1. In addi-tion to these genes, a second set of genes identifiedusing a cross-tissue and cross-platform analysis werealso assessed [20] and similarly, while some genes suchas C-terminal binding protein 1 (CTBP1), cullin 1(CUL1), DIM1 dimethyladenosine transferase 1-like(DIMT1L), tripartite motif-containing 27 (TRIM27) andubiquilin 1 (UBQLN1) performed reasonably wellbased on our metric, others such as poly(A) polymerasealpha (PAPOLA) and ADP-ribosylation factor-like 8B(ARL8B) did not (Figure 1, Additional file 7).Demonstrating tissue specificity of reference genesTo further our investigations regarding reference genesoptimal for cancer cell biology, we expanded our ana-lysis to include publicly available SAGE libraries repre-senting normal and cancer tissue from both brain andChari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 5 of 11breast. The results of this analysis clearly demonstratethat the reference genes identified in the lung datasetare distinct from those found in either breast (Table 2)or brain (Table 3). This data strongly suggests thatreference genes should be selected in a tissue specificmanner. For example, GAPDH, which performedpoorly as a reference gene for lung gene expressionanalysis (see above), was in fact one of the best refer-ence genes identified from the analysis of the braindataset (Table 3). Moreover, not only was there nooverlap among the reference gene lists determined foreach of the three different tissue types (i.e., lung-NEPS,breast-NEPS, brain-NEPS), but when examiningreference genes specific to one tissue type (i.e. lung-NEPS) in the other two tissue types (i.e. breast orbrain), the permutation scores for these genes weresignificantly higher and more variable (Figure 2). Theseresults are consistent with other studies demonstratingthe need for tissue and context-specific selection ofreference genes [3,5,8,10,14].Quantitative RT-PCR validation of identified referencegenes in lung cancer samplesUsing a secondary set of 15 tumor and matched non-malignant samples, qPCR was used to validate consis-tency of expression for all lung-NEPS genes.Permutation ScoreAverage Raw Tag Count01 2 3 4 5 6 71002003004005006007008009000B2MGAPDHPPIAARL8BPGK1RPL30TFRCPAPOLAFigure 1 Enhanced performance of the lung-NEPS genes (red; see Table 1) relative to previously reported and standard referencegenes traditionally used in qPCR and microarray normalization (blue). The x-axis represents the permutation score of a defined gene andthe y-axis represents the average raw (non-normalized) tag count for the same gene. Data used in the graph are given in Additional file 7. LungNEPS genes are stable and highly expressed as compared to the traditionally used genes. B2 M appears to perform the best with respect to highaverage tag count and low permutation score. Notably, the gene that performs the poorest is GAPDH.Chari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 6 of 11Additionally, we performed qPCR for previously identi-fied, commonly used housekeeping genes ACTB,GAPDH, HPRT1, and TBP. In addition, two genes outof 13 identified (CUL1 and TRIM27) as suitable refer-ence genes from a previously published study [20], wereselected here based on high NEPS performance (seeabove), for qPCR analysis.Of the NEPS genes analyzed, NDUFA1, RPL19,RAB5C, member RAS oncogene family (RAB5C), andRPS18 performed the best based on the cumulativeranking metric (Table 4). Conversely, the standard refer-ence genes ACTB, GAPDH, and HPRT1 did not performas well. These results confirm a high constancy ofexpression for a subset of the lung-NEPS genes using analternative method in a secondary set of samples.Effect of reference genes on differential gene expressionanalysisUsing a publicly available microarray dataset(GSE10072, [30]), differential expression analysis wasperformed using SAM [34]. Results from SAM werecompared using the dataset normalized by MAS 5.0alone, versus the same dataset normalized by MAS 5.0with scale normalization using the lung-NEPSTable 2 Breast NEPS GenesGeneSymbolGene Name Average Raw TagCountPermutationScoreEIF5A eukaryotic translation initiation factor 5A 22 0.003EIF3S2 eukaryotic translation initiation factor 3, subunit 2 beta, 36 kDa 12 0.037RPS8 ribosomal protein S8 122 0.046TSPAN9 tetraspanin 9 122 0.051UBB ubiquitin B 39 0.057RPL28 ribosomal protein L28 78 0.064FTL ferritin, light polypeptide 16 0.066YWHAQ tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, thetapolypeptide19 0.074TMEM49 transmembrane protein 49 13 0.083FAM39B family with sequence similarity 39, member B 11 0.091NINJ1 ninjurin 1 13 0.097RPL30 ribosomal protein L30 108 0.108PDE6B phosphodiesterase 6B, cGMP-specific, rod, beta 10 0.115TUBA3 tubulin, alpha 1a 50 0.117MYL9 myosin, light chain 9, regulatory 15 0.120MYH9 myosin, heavy chain 9, non-muscle 21 0.128NPM1 nucleophosmin (nucleolar phosphoprotein B23, numatrin) 48 0.130HLA-A major histocompatibility complex, class I, A 45 0.131RPS2 ribosomal protein S2 63 0.138Table 3 Brain NEPS GenesGene Symbol Gene Name Average Raw Tag Count Permutation ScoreNUCKS1 nuclear casein kinase and cyclin-dependent kinase substrate 1 14 0.024CDAN1 congenital dyserythropoietic anemia, type I 35 0.030PABPCP2 poly(A) binding protein, cytoplasmic, pseudogene 2 15 0.033GTF2I general transcription factor II, i 22 0.036ZFAND5 zinc finger, AN1-type domain 5 20 0.060GAPDH glyceraldehyde-3-phosphate dehydrogenase 163 0.068NCL nucleolin 13 0.083FIS1 fission 1 (mitochondrial outer membrane) homolog (S. cerevisiae) 10 0.094GRIN2C glutamate receptor, ionotropic, N-methyl D-aspartate 2C 81 0.132RPS27A ribosomal protein S27a 63 0.142COX4I1 cytochrome c oxidase subunit IV isoform 1 17 0.148CXXC5 CXXC finger 5 13 0.149Chari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 7 of 11reference genes represented on the microarray. Weobserved differences in the total number of differen-tially regulated genes, depending on the normalizationprotocol used. When MAS 5.0 + NEPS normalizationwas used, 5502 genes were identified as up-regulatedin cancer, whereas 4798 up-regulated genes were iden-tified using MAS 5.0 alone. With respect to down-regulated genes, 2543 were identified using MAS 5.0 +NEPS, whereas 3325 were identified using MAS 5.0alone (Figure 3A). According to the Canonical Path-way Analysis [Ingenuity Pathway Analysis (IPA)], weobserve slight differences in both the number and thesignificance of identified pathways between the twosets of differentially normalized microarray data (Addi-tional file 8). For example, while both datasets identifypathways such as mitochondrial dysfunction and pro-tein ubiquitination, analysis of the dataset normalizedby MAS 5.0 + NEPS identifies pathways known to beimportant in lung cancer, such as Neuregulin andJAK/Stat [35], at a higher significance relative to ana-lysis of the same dataset normalized by MAS 5.0alone (Figure 3B). Similarly, when evaluated using anadditional publicly available lung cancer microarraydataset [31], we observe slight differences between thevarious pathways identified from analysis of differen-tially expressed genes derived from a NEPS-normal-ized dataset versus a dataset not normalized using thelung-NEPS genes (Additional file 9, Additional file10). These results demonstrate that the choice ofreference genes used for data normalization can influ-ence the conclusions derived from gene expressionstudies.ConclusionsIn this study we present a methodology based uponpermutation test analysis of SAGE data, to identifyreference genes that more stringently satisfy the con-stancy requirements crucial for accurate normalizationbetween samples utilized in gene expression experi-ments. Specifically, we have identified reference genesmore effective for normalization than the traditionaland previously reported housekeeping genes for lung,breast, and brain cancer gene expression profiling.Furthermore, we strongly emphasize that referencegenes utilized for expression profiling should beselected in a tissue specific manner. Given that thismethodology utilizes sequence-based data, its utilitywill increase as data generated from new next-genera-tion sequencing platforms accumulate. The usage ofmore appropriate reference genes will have an impacton the interpretation of existing microarray data aswell as expression data generated in future studies, andpotentially will shed new insight into the molecularbiology of cancer.L u n g B ra in B re a st00 .511 .522 .533 .544 .5PermutationScoreL u n g  H o u se k e e p in g  G e n e sB re a st L u n g B ra in01234567Permutation ScoreB re a s t H o u se k e e p in g  G e n e sB ra in L u n g B re a st0123456Permutation ScoreB ra in  H o u se k e e p in g  G e n e sABCFigure 2 Tissue-specificity of reference genes. Comparison ofthe permutation scores for reference genes generated in one tissuetype with permutation scores for the same genes in the other twotissue types. (A) Performance of lung-NEPS genes in breast andbrain tissues, (B) Performance of breast-NEPS genes in lung andbrain tissues, and (C) Performance of brain-NEPS genes in lung andbreast tissues.Chari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 8 of 11Table 4 Quantitative RT-PCR analysis of lung NEPS genes and select previously identified genesGene Symbol* Cumulative Rank dCt Variance Rank NormFinder Stability Value Rank geNormM valueRankNDUFA1 11 2.011 4 0.059 2 1.141 5RPL19 14 2.252 6 0.071 4 1.140 4RAB5C 18 2.928 10 0.058 1 1.150 7RPS18 20 0.011 1 0.064 3 1.461 16TBP 24 3.846 16 0.076 7 1.097 1RPL4 27 1.523 3 0.099 12 1.253 12ATP5J 28 2.150 5 0.090 9 1.342 14HMGB1 29 2.648 8 0.093 10 1.210 11TRIM27 29 3.752 15 0.073 5 1.158 9EIF4A2 31 3.229 12 0.106 16 1.131 3CAPZB 33 4.362 18 0.100 13 1.105 2PPP1CB 33 3.453 13 0.104 14 1.143 6CUL1 34 5.164 20 0.075 6 1.156 8ACTB 37 2.731 9 0.099 11 1.638 17B2M 38 1.517 2 0.154 21 1.369 15HPRT1 39 3.611 14 0.105 15 1.173 10CSTB 41 3.026 11 0.108 17 1.259 13CD55 44 2.460 7 0.131 18 1.811 19HSPA1A 47 7.330 21 0.077 8 1.653 18GAPDH 58 4.044 17 0.145 20 2.093 21SLFN13 58 4.803 19 0.132 19 1.858 20*Genes identified in this study are boldedFigure 3 SAM and pathway analysis of a dataset normalized with and without lung NEPS genes. (A) Number of probes identified asdifferentially over and underexpressed between cancer and normal using SAM on the dataset with and without NEPS normalization. Venn diagramillustrates the overlap in the genes identified as well as those which are different between the two analyses. (B) Canonical pathway analysis usingIngenuity Pathway Analysis. Dark blue bars represent the results from the dataset normalized with MAS 5.0 + NEPS and light blue bars represent theresults from normalization using MAS 5.0 alone. The pathways which are the most significant are the most significant in both analyses. Note that keypathways such as Neuregulin signaling and JAK/Stat are identified with higher significance when normalized using the lung NEPS genes. Suchdifferences illustrate the impact of reference gene selection and normalization on differential gene expression analysis.Chari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 9 of 11Additional materialAdditional file 1: Summary of lung SAGE libraries. Raw tag countsummary for lung SAGE libraries used in analysis.Additional file 2: Summary of breast SAGE libraries. Raw tag countsummary for breast SAGE libraries used in analysis.Additional file 3: Summary of brain SAGE libraries. Raw tag countsummary for brain SAGE libraries used in analysis.Additional file 4: Taqman probe IDs of genes assessed byquantitative RT-PCR. Applied Biosystems Taqman probe IDs for genesassessed by qRT-PCR.Additional file 5: Probes used for NEPS genes represented on theAffymetrix U133A microarray. Probes for NEPS genes represented onthe Affymetrix U133A microarray.Additional file 6: Probes used for NEPS genes represented on theAgilent microarray. Probes used for NEPS genes represented on theAgilent microarray.Additional file 7: Permutation scores of previously identified andNEPS identified reference genes. Data used in scatter plot shown inFigure 1.Additional file 8: Ingenuity Pathway Analysis using genes from theanalyses of a NEPS-normalized and unnormalized dataset by Landiet al. Ingenuity Pathway Analysis using genes from the analyses of aNEPS-normalized and unnormalized dataset by Landi et al.Additional file 9: SAM and pathway analysis of an Agilent lungcancer microarray dataset normalized with and without lung NEPSgenes. SAM and pathway analysis of a dataset normalized with andwithout lung NEPS genes. (A) Number of probes identified asdifferentially over and underexpressed between cancer and normal usingSAM on the dataset with and without NEPS normalization. Venn diagramillustrates the overlap in the genes identified as well as those which aredifferent between the two analyses. (B) Canonical pathway analysis usingIngenuity Pathway Analysis. Dark blue bars represent the results from thedataset normalized with NEPS and median normalization and light bluebars represent the results from using median normalization alone. Whilesimilar pathways are statistically significant, each pathway is slightlydifferent in the degree of statistical significance. Such differencesillustrate the impact of reference gene selection and normalization ondifferential gene expression analysis.Additional file 10: Ingenuity Pathway Analysis using genes from theanalyses of a NEPS-normalized and unnormalized dataset byBoelens et al. Ingenuity Pathway Analysis using genes from the analysesof a NEPS-normalized and unnormalized dataset by Boelens et al.AcknowledgementsWe thank Drs. William W. Lockwood and Ian M. Wilson for useful discussionand editing. This work was supported by funds from Canadian Institutes ofHealth Research. RC is supported by scholarships from the Michael SmithFoundation for Health Research and Canadian Institutes of Health Research.Author details1Department of Integrative Oncology, British Columbia Cancer AgencyResearch Centre, Vancouver, BC, Canada. 2Ontario Cancer Institute/PrincessMargaret Hospital, Toronto, ON, Canada. 3Department of Computer Science,University of British Columbia, Vancouver, BC, Canada.Authors’ contributionsRC analyzed the SAGE and quantitative PCR data to identify the key findings,and wrote the manuscript. KML led the construction of all lung SAGElibraries, contributed to interpretation of data and manuscript editing. LAPperformed the quantitative PCR experiments and data interpretation. BPCand CEM provided insight into the statistical analysis as well as manuscriptediting. CQZ contributed to the experimental work. THWC contributed tothe data analysis. SL isolated the clinical samples, contributed tointerpretation of results. CEM, MST, SL, RTN and WLL are the principalinvestigators on the project. They contributed to the overall design and datainterpretation. All authors have read and approved the final manuscript.Competing interestsThe authors declare that they have no competing interests.Received: 17 February 2010 Accepted: 3 August 2010Published: 3 August 2010References1. Quackenbush J: Microarray data normalization and transformation. NatGenet 2002, 32(Suppl):496-501.2. Huggett J, Dheda K, Bustin S, Zumla A: Real-time RT-PCR normalisation;strategies and considerations. Genes Immun 2005, 6:279-284.3. Barber RD, Harmer DW, Coleman RA, Clark BJ: GAPDH as a housekeepinggene: analysis of GAPDH mRNA expression in a panel of 72 humantissues. Physiol Genomics 2005, 21:389-395.4. Bas A, Forsberg G, Hammarstrom S, Hammarstrom ML: Utility of thehousekeeping genes 18 S rRNA, beta-actin and glyceraldehyde-3-phosphate-dehydrogenase for normalization in real-time quantitativereverse transcriptase-polymerase chain reaction analysis of geneexpression in human T lymphocytes. Scand J Immunol 2004, 59:566-573.5. de Kok JB, Roelofs RW, Giesendorf BA, Pennings JL, Waas ET, Feuth T,Swinkels DW, Span PN: Normalization of gene expression measurementsin tumor tissues: comparison of 13 endogenous control genes. Lab Invest2005, 85:154-159.6. Glare EM, Divjak M, Bailey MJ, Walters EH: beta-Actin and GAPDHhousekeeping gene expression in asthmatic airways is variable and notsuitable for normalising mRNA levels. Thorax 2002, 57:765-770.7. Goidin D, Mamessier A, Staquet MJ, Schmitt D, Berthier-Vergnes O:Ribosomal 18 S RNA prevails over glyceraldehyde-3-phosphatedehydrogenase and beta-actin genes as internal standard forquantitative comparison of mRNA levels in invasive and noninvasivehuman melanoma cell subpopulations. Anal Biochem 2001, 295:17-21.8. Khimani AH, Mhashilkar AM, Mikulskis A, O’Malley M, Liao J, Golenko EE,Mayer P, Chada S, Killian JB, Lott ST: Housekeeping genes in cancer:normalization of array data. Biotechniques 2005, 38:739-745.9. Lee S, Jo M, Lee J, Koh SS, Kim S: Identification of novel universalhousekeeping genes by statistical analysis of microarray data. J BiochemMol Biol 2007, 40:226-231.10. Rubie C, Kempf K, Hans J, Su T, Tilton B, Georg T, Brittner B, Ludwig B,Schilling M: Housekeeping gene variability in normal and cancerouscolorectal, pancreatic, esophageal, gastric and hepatic tissues. Mol CellProbes 2005, 19:101-109.11. Saviozzi S, Cordero F, Lo Iacono M, Novello S, Scagliotti GV, Calogero RA:Selection of suitable reference genes for accurate normalization of geneexpression profile studies in non-small cell lung cancer. BMC Cancer2006, 6:200.12. Steele BK, Meyers C, Ozbun MA: Variable expression of some“housekeeping” genes during human keratinocyte differentiation. AnalBiochem 2002, 307:341-347.13. Szabo A, Perou CM, Karaca M, Perreard L, Quackenbush JF, Bernard PS:Statistical modeling for selecting housekeeper genes. Genome Biol 2004,5:R59.14. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A,Speleman F: Accurate normalization of real-time quantitative RT-PCRdata by geometric averaging of multiple internal control genes. GenomeBiol 2002, 3:RESEARCH0034.15. Zhang X, Ding L, Sandford AJ: Selection of reference genes for geneexpression studies in human neutrophils by real-time PCR. BMC Mol Biol2005, 6:4.16. Zhong H, Simons JW: Direct comparison of GAPDH, beta-actin,cyclophilin, and 28 S rRNA as internal standards for quantifying RNAlevels under hypoxia. Biochem Biophys Res Commun 1999, 259:523-526.17. Eickhoff B, Korn B, Schick M, Poustka A, van der Bosch J: Normalization ofarray hybridization experiments in differential gene expression analysis.Nucleic Acids Res 1999, 27:e33.18. Zhu LJ, Altmann SW: mRNA and 18S-RNA coapplication-reversetranscription for quantitative gene expression analysis. Anal Biochem2005, 345:102-109.Chari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 10 of 1119. Jin P, Zhao Y, Ngalame Y, Panelli MC, Nagorsen D, Monsurro V, Smith K,Hu N, Su H, Taylor PR, et al: Selection and validation of endogenousreference genes using a high throughput approach. BMC Genomics 2004,5:55.20. Kwon MJ, Oh E, Lee S, Roh MR, Kim SE, Lee Y, Choi YL, In YH, Park T,Koh SS, Shin YK: Identification of novel reference genes usingmultiplatform expression data and their validation for quantitative geneexpression analysis. PLoS One 2009, 4:e6162.21. de Jonge HJ, Fehrmann RS, de Bont ES, Hofstra RM, Gerbens F, Kamps WA,de Vries EG, van der Zee AG, te Meerman GJ, ter Elst A: Evidence basedselection of housekeeping genes. PLoS One 2007, 2:e898.22. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of geneexpression. Science 1995, 270:484-487.23. Good P: Permutation Tests: A Practical Guide to Resampling Methods forTesting Hypotheses Springer-Verlag New York, Inc, 2 2000.24. Lonergan KM, Chari R, Coe BP, Wilson IM, Tsao MS, Ng RT, Macaulay C,Lam S, Lam WL: Transcriptome profiles of carcinoma-in-situ and invasivenon-small cell lung cancer as revealed by SAGE. PLoS One 2010, 5:e9162.25. Lonergan KM, Chari R, Deleeuw RJ, Shadeo A, Chi B, Tsao MS, Jones S,Marra M, Ling V, Ng R, et al: Identification of novel lung genes inbronchial epithelium by serial analysis of gene expression. Am J RespirCell Mol Biol 2006, 35:651-661.26. Chari R, Lonergan KM, Ng RT, MacAulay C, Lam WL, Lam S: Effect of activesmoking on the human bronchial epithelium transcriptome. BMCGenomics 2007, 8:297.27. Boon K, Osorio EC, Greenhut SF, Schaefer CF, Shoemaker J, Polyak K,Morin PJ, Buetow KH, Strausberg RL, De Souza SJ, Riggins GJ: An anatomyof normal and malignant gene expression. Proc Natl Acad Sci USA 2002,99:11287-11292.28. Riggins GJ, Strausberg RL: Genome and genetic resources from theCancer Genome Anatomy Project. Hum Mol Genet 2001, 10:663-667.29. Andersen CL, Jensen JL, Orntoft TF: Normalization of real-timequantitative reverse transcription-PCR data: a model-based varianceestimation approach to identify genes suited for normalization, appliedto bladder and colon cancer data sets. Cancer Res 2004, 64:5245-5250.30. Landi MT, Dracheva T, Rotunno M, Figueroa JD, Liu H, Dasgupta A,Mann FE, Fukuoka J, Hames M, Bergen AW, et al: Gene expressionsignature of cigarette smoking and its role in lung adenocarcinomadevelopment and survival. PLoS One 2008, 3:e1651.31. Boelens MC, van den Berg A, Fehrmann RS, Geerlings M, de Jong WK, teMeerman GJ, Sietsma H, Timens W, Postma DS, Groen HJ: Currentsmoking-specific gene expression signature in normal bronchialepithelium is enhanced in squamous cell lung cancer. J Pathol 2009,218:182-191.32. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy–analysis of AffymetrixGeneChip data at the probe level. Bioinformatics 2004, 20:307-315.33. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B,Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software developmentfor computational biology and bioinformatics. Genome Biol 2004, 5:R80.34. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarraysapplied to the ionizing radiation response. Proc Natl Acad Sci USA 2001,98:5116-5121.35. Liu J, Kern JA: Neuregulin-1 activates the JAK-STAT pathway andregulates lung epithelial cell proliferation. Am J Respir Cell Mol Biol 2002,27:306-313.Pre-publication historyThe pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/3/32/prepubdoi:10.1186/1755-8794-3-32Cite this article as: Chari et al.: A sequence-based approach to identifyreference genes for gene expression analysis. BMC Medical Genomics2010 3:32.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitChari et al. BMC Medical Genomics 2010, 3:32http://www.biomedcentral.com/1755-8794/3/32Page 11 of 11


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items