ralssBioMed CentBMC BioinformaticsOpen AcceMethodology articleA stepwise framework for the normalization of array CGH dataMehrnoush Khojasteh*1,2, Wan L Lam1, Rabab K Ward2 and Calum MacAulay1Address: 1British Columbia Cancer Research Centre, Vancouver, BC, Canada and 2Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, CanadaEmail: Mehrnoush Khojasteh* - mkhojast@bccrc.ca; Wan L Lam - wanlam@bccrc.ca; Rabab K Ward - rababw@ece.ubc.ca; Calum MacAulay - cmacaula@bccrc.ca* Corresponding author AbstractBackground: In two-channel competitive genomic hybridization microarray experiments, theratio of the two fluorescent signal intensities at each spot on the microarray is commonly used toinfer the relative amounts of the test and reference sample DNA levels. This ratio may beinfluenced by systematic measurement effects from non-biological sources that can introducebiases in the estimated ratios. These biases should be removed before drawing conclusions aboutthe relative levels of DNA. The performance of existing gene expression microarray normalizationstrategies has not been evaluated for removing systematic biases encountered in array-basedcomparative genomic hybridization (CGH), which aims to detect single copy gains and lossestypically in samples with heterogeneous cell populations resulting in only slight shifts in signal ratios.The purpose of this work is to establish a framework for correcting the systematic sources ofvariation in high density CGH array images, while maintaining the true biological variations.Results: After an investigation of the systematic variations in the data from two array CGHplatforms, SMRT (Sub Mega base Resolution Tiling) BAC arrays and cDNA arrays of Pollack et al.,we have developed a stepwise normalization framework integrating novel and existingnormalization methods in order to reduce intensity, spatial, plate and background biases. We usedstringent measures to quantify the performance of this stepwise normalization using data derivedfrom 5 sets of experiments representing self-self hybridizations, replicated experiments, detectionof single copy changes, array CGH experiments which mimic cell population heterogeneity, andarray CGH experiments simulating different levels of gene amplifications and deletions. Our resultsdemonstrate that the three-step normalization procedure provides significant improvement in thesensitivity of detection of single copy changes compared to conventional single step normalizationapproaches in both SMRT BAC array and cDNA array platforms.Conclusion: The proposed stepwise normalization framework preserves the minute copynumber changes while removing the observed systematic biases.Published: 18 November 2005BMC Bioinformatics 2005, 6:274 doi:10.1186/1471-2105-6-274Received: 20 June 2005Accepted: 18 November 2005This article is available from: http://www.biomedcentral.com/1471-2105/6/274© 2005 Khojasteh et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 15(page number not for citation purposes)BackgroundMicroarray-based Comparative Genomic Hybridization(array CGH) is used to detect the aberrations in segmentalcopy numbers at chromosomal loci represented by DNABMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274clones with known genomic locations [1]. CGH microar-rays typically contain tens of thousands of spotted DNAsequences such as those derived from bacterial artificialchromosomes (BACs). Sample DNA from a test and a ref-erence genome are labelled with different fluorescent dyes(usually Cyanine-3 and Cyanine-5 dyes) and then hybrid-ized to the genomic microarray. The fluorescent signalintensity of each spot on the microarray serves as a relativemeasure of the amount of sample DNA bound to the DNAsequence of that spot. The ratio between the Cyanine-3and the Cyanine-5 intensity of each spot reflects the rela-tive quantities of the test and reference DNA samples.The ratio of the two fluorescent signals at each spot iscommonly used to detect copy number alteration. How-ever, the ratios of the fluorescent signals are usually influ-enced by systematic effects from non-biological sourcesthat can introduce biases in the estimates of these ratios.Such biases should be removed in order to draw conclu-sions on copy number status. The process of correcting forthe systematic effects is often referred to as normalization.Array CGH technology generally has more stringent per-formance requirements than gene expression microarrayanalysis. These requirements are to detect single DNAcopy number changes in abnormal cells, typically withintumor samples. Detection sensitivity is complicated bythe heterogeneous nature of tumor tissue with varyingdegrees of contamination from non-cancer cells. Due tothe limitation in material availability performing replicateexperiments is not always possible or desired.In the context of developing a normalization protocol forarray CGH, knowing the copy number status of DNA seg-human cells is diploid. In contrast, gene expression levelvaries continuously for each gene and the expression levelof the same gene is not expected to be identical in two dif-ferent samples.While a 2 fold change in signal may not represent a signif-icant alteration in gene expression microarray analyses[3], for CGH arrays, a single copy gain compared to nor-mal diploid DNA will result in a ratio of 3:2. A single copyloss would reduce the signal ratio to 1:2. Considering thecontamination of tumor (abnormal) cells with non-can-cer (normal) cells, the copy number ratio may be evensmaller. So the challenge in normalization is to preservethe true copy number change signals while removing thesystematic variations.The purpose of this work is to correct for the systematicsources of variation while maintaining the true biologicalvariations as small as a single copy number change in asample of a heterogeneous cell population.After an investigation of the systematic variations in thedata from array CGH experiments, we tested existing nor-malization methods commonly used for gene expressiondata in order to deduce a stepwise normalization frame-work tailored to handling high density array CGH data.Here we demonstrate the efficacy of the stepwise normal-ization scheme through several quantitative characteris-tics of the data from several functional types of array CGH.Results and discussionsMaterialsData from five sets of experiments were used in the devel-opment of the normalization strategy (Table 1). The firstTable 1: Data description. In this table, the array data of this study are summarized.Array Reference DNA Sample DNA Data type for normalization performance evaluationEvaluation methodMM-1 to MM-4 Male genomic Male genomic Self-self hybridizations S.d. for each arrayH526-1 to H526-8 H526 cell line Male genomic Replicate H526 cell line experiments1. Correlation coefficient for each pair of arrays2. ICC3. S.d. for each spotMF-1 and MF-2 Female genomic Male genomic Single copy change T-testT1 to T5 Female genomic Male/Female mixture (see Additional file 1)Single copy loss with normal cell contaminationT-testT6 to T10 Male genomic Male/Female mixture (see Additional file 1)Single copy gain with normal cell contaminationT-testX1 to X5 Female DNA cell lines containing varying numbers of X chromosomes (see Additional file 2)varying levels of gene amplification and deletion for each of the X-chromosomal genesT-testPage 2 of 15(page number not for citation purposes)ments provides true values for calibration. The same copynumber exists in different samples as the normal state forfour datasets were generated from array CGH experimentsperformed using the SMRT (Sub Mega base ResolutionBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274Tiling) arrays. These arrays are tiling resolution BAC arrayswith complete coverage of the human genome using32,433 fingerprint-verified individually amplified BACclones [4]. The experimental procedures for array CGHand generating spot images have been described previ-ously [4]. The entire set of 32,433 solutions was spotted intriplicate onto two slides by a 4 × 12 pin arrayer. For thepurpose of this study, only the data from the first array outof the two arrays were used.The fifth dataset is a public dataset downloaded from theStanford Microarray Database http://smd.stanford.edu.This datasets was generated from array CGH experimentsperformed using human cDNA microarrays, [12].The first dataset (self-self hybridization data) was derivedfrom hybridization of the same DNA sample, i.e., normalmale genomic DNA was used for both test and referencematerials but labelled with different dyes. The four micro-arrays used in this CGH experiment are referred to as MM-1 to MM-4 in the following text.cell DNA sample with well characterized chromosomalaberrations (lung cancer cell line H526) [4] against nor-mal male DNA. The 8 arrays used in this experiment aredenoted H526-1 through H526-8.The third dataset (hybridization data from male andfemale DNA mimicking single copy deletion) was derivedfrom comparison of normal male DNA versus normalfemale DNA, using arrays named MF-1 and MF-2.The fourth dataset (hybridization data from samplesmimicking heterogeneous cell populations) was derivedfrom a series of array CGH experiments in which the sam-ples to be compared were mixtures of male and femaleDNA affecting X chromosome dosage mimicking tumorsamples with varying levels of normal cell contamination.Precise proportions of DNA were mixed to simulateincreasing levels of heterogeneity as previously described[5]. Arrays T1 through T5 compared male DNA againstfemale DNA generating a 1:2 ratio for X chromosome sitesmimicking a single copy deletion. Contamination fromnormal cells was then simulated by spiking varyingA smoothed M-XY plot illustrating spatial biasFigure 1A smoothed M-XY plot illustrating spatial bias. The plot displays representation of log2 ratios based on the correspond-ing spot location on the microarray, the plot is smoothed with a moving median filter.Page 3 of 15(page number not for citation purposes)The second dataset (hybridization data from replicateexperiments) was derived from comparison of a tumoramounts of female DNA into the male DNA sample.Arrays T6 through T10 compared a 50/50 mixture of maleBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274and female DNA against a male DNA reference generatinga 3:2 ratio for X chromosome sites mimicking single copyamplifications. Contamination from normal cells wassimulated by spiking varying amounts of female DNAinto the male/female DNA mixture.The fifth dataset was derived from hybridization ofgenomic DNAs from cell lines containing varying num-bers of X chromosomes to simulate varying levels of geneamplification and deletion for each of the X-chromo-somal genes present in the cDNA array [12]. The fiveexperiments comprising the fifth data set are denoted X1through X5.Systematic variationsAfter a thorough investigation of the systematic variationsin the data from our array CGH experiments, four kinds ofbias were identified. Below we explain each bias type.Intensity biasThis bias is evident in the frequently used M-A plots whichare plots of the log ratio M = log2(Ir/Ig) = log2(Ir) - log2(Ig)against the mean of the log intensities A = 1/2(log2(Ir) +log2(Ig)), where Ir and Ig are the intensities of the cyanine-5 and cyanine-3 channels respectively. In our data, thisbias predominantly appears as curvature in the low inten-sity end of the M-A plot.Spatial biasThe representation of log ratios based on the correspond-ing spot location on the microarray is another type of plotwhich can be used to reveal spatially variable bias. Werefer to this plot as M-XY plot. The spatially smoothed M-XY plot reveals the general trend of log ratios against theirlocations on the array (Fig. 1). For randomly distributedgenomic loci across an array this plot should be a flatplane.Spatial heterogeneity was thought to be caused by the dif-ferent print tips used in printing the targets on the arrays[6]. However, our data show that the spatial heterogeneityis not caused by print tips effects because the spatial pat-terns are not organized in a block wise fashion (as theywould be due to bias introduced by specific print tips). Infact, the patterns appear as a continuous function acrossthe entire array.Plate biasThis is a spatial pattern that can be seen in the data afterthe spatial gradient has been removed by the spatial nor-malization step mentioned above. This pattern is repeatedin all subgrids in the M-XY plot and corresponds to theplate groups (groups of spots on the microarray that arePlate bias is evident when box-plots of log2 ratios fromeach plate group are compared. These box plots show asystematic difference among the log2 ratios of the differentplate groups. The median log2 ratio of each plate group isexpected to be near zero, i.e. positive and negative devia-tions should cancel out in each plate group, unless thecopy numbers of the clones in a plate biologically differbetween the test and the control samples. We do notbelieve this is the case in our experiments.This bias is caused by the fact that different clones that areproduced in different microplates may have experiencedslightly different physical conditions during the polymer-ase chain reaction (PCR) or in subsequent purificationsteps [7]. This variation in the efficiency of spot solutionsynthesis appears to affect different plate groups resultingin a plate level bias.Background biasThe measured intensity for each microarray spot containsa contribution from the background fluorescence withinthe spot. This introduces a bias in the ratios of the spots'intensities. In the M-A plot this bias appears as deviationfrom zero in the log2 ratios of the lower intensity spots.Methods of bias removalIn order to remove these types of biases, we evaluated thefollowing stepwise normalization procedure:1. The spatial trend is estimated by computing, for eachspot on the array, the median of log2 ratios for the spotswithin a spatial neighbourhood window of size 11 rowsby 11 columns centred on that spot. The spatial bias thatis estimated for each spot in this way is then subtractedfrom the log2 ratio of that spot. This step is referred to as"Spatial" normalization.2. The plate bias is removed by calculating the median ofthe log2 ratios for all spots in the same plate group andsubtracting it from the log2 ratios for all those spots. Thisstep is referred to as "Plate" normalization.3. The intensity bias is estimated using robust LOWESScurve fitting [8]. After this bias is estimated, assuming thebias is multiplicative; the bias is subtracted from the logratios. This step is denoted as "Intensity LOWESS" nor-malization.4. To remove the background bias, one of the followingtwo different approaches is usually taken: either the esti-mate of the background intensity is subtracted from theestimated foreground intensity of each spot before takingthe ratios, or it is not subtracted. In the latter case, thePage 4 of 15(page number not for citation purposes)all printed from the same microplate). introduced bias is dealt with by treating it as intensityBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274Table 2: Summary of normalization methods. Each of the normalization methods in this table will be denoted by its number through out the text. For full description of methods refer to "Methods of bias removal" section in Results and Discussion and the "Normalization methods" section in Methods.Method no. Normalization method DescriptionBackground subtracted1 No normalization Raw ratiosGlobal method2 Global median Ratio Ratios scaled by their medianIntensity dependant methods3 Intensity LOWESS, 10% span Global Intensity LOWESS, span = 10%4 Intensity LOWESS, 25% span Global Intensity LOWESS, span = 25%5 Intensity LOWESS, 40% span Global Intensity LOWESS, span = 40%Spatial methods6 Print tip mean Ratio Ratios of each print-tip group scaled by the mean ratio of that group7 Spatial median of log2 ratios for the spots within a spatial neighbourhood window of size 11 rows by 11 columns centred on that spot8 Spatial + Median Plate Ratio Method 8 followed by plate normalizationCombined intensity dependent and spatial methods9 Print Tip Intensity LOWESS, span = 40% LOWESS performed on the ratios from each print-tip group10 Intensity LOWESS + Spatial Stepwise Method 4 and 8Three step normalization11 Intensity LOWESS + Spatial + Median Plate Ratio Stepwise Methods 4 and 912 Spatial + Median Plate Ratio + intensity LOWESS Stepwise Methods 9 and 4Background not subtracted13 No Normalization See Method 1, but without background subtractionGlobal method14 Global median Ratio See Method 2, but without background subtractionIntensity dependant methods15 Global Intensity LOWESS, span = 10% See Method 4, but without background subtractionSpatial methods16 Print tip Mean Ratio See Method 3, but without background subtraction17 Spatial See Method 8, but without background subtractionCombined intensity dependent and spatial methods18 Intensity LOWESS + Spatial See Method 10, but without background subtractionPage 5 of 15(page number not for citation purposes)on the effectiveness of normalization.Among 12 normalization methods that are performed onthe ratios of background subtracted intensities, the step-wise strategy (method 12) results in the lowest s.d. for allfour arrays. Also, among 7 normalization methods thatare performed on the ratios of non-background subtractedintensities, the stepwise strategy (method 19) results inthe smallest s.d.intensities (method 12).The Pearson's Correlation Coefficient [9] was calculated forthe data from each pair of the replicate arrays, with 28possible pairings. The average of the 28 correlation coeffi-cients for each single method was then calculated (Fig.3B).The Intraclass Correlation Coefficient (ICC) [9] was calcu-BMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274Three step normalization19 Intensity LOWESS + Spatial + Median Plate Ratio See Method 11, but without background subtractionTable 2: Summary of normalization methods. Each of the normalization methods in this table will be denoted by its number through out the text. For full description of methods refer to "Methods of bias removal" section in Results and Discussion and the "Normalization methods" section in Methods. (Continued)dependent bias. We evaluated both of these approaches inour experiments (see below).Below we show that the above stepwise procedure is effec-tive in removing the mentioned types of systematic varia-tions. We demonstrate the efficacy of our procedure bycomparing several quantitative characteristics of data nor-malized by our proposed strategy to those of non-normal-ized data and data normalized by other techniques listedin Table 2.Normalization of self-self array CGH dataThe self-self experiments (arrays MM-1 through MM-4)were used to study the effect of normalization on remov-ing the bias from the data and increasing the accuracy ofthe measurements. The 19 methods of normalizationlisted in Table 2 were evaluated on the data obtained fromthese arrays.Since the same male genomic DNA serves as both sampleand reference DNA, the copy numbers detected in boththe Cyanine-3 and Cyanine-5 channels are expected to bethe same at all loci, resulting in a zero theoretical value forthe log2 ratio of intensities at all spots on the array. Theeffects of normalization on removing the bias were exam-ined by calculating the standard deviation (s.d.) of thelog2 ratios for each array in the experiment, evaluatingeach of the 19 methods listed in Table 2. Then all 19standard deviations were scaled against the standard devi-ation of the raw ratios before normalization (i.e. againstthe s.d. value from the first method of Table 2). For eachnormalization method, the scaled s.d. values were thenaveraged across the four arrays. Figure 2 shows these aver-age standard deviations.The three different window sizes of 10%, 25% and 40% ofthe data points, used for LOWESS intensity normalization(methods 4-6 in Table 2) did not have a significant effectWhen the three-step proposed normalization is per-formed on the ratios of non-background subtracted inten-sities, it yields better performance, in terms of reducingthe s.d. of log2 ratios, than when it is applied to the ratiosof background-subtracted intensities.To further explore the effect of the background intensities,the standard deviations were recalculated for these fourarrays with the lowest intensity spots removed from eachdata set. The difference between the s.d. of the ratios afternormalization for the case of background subtracted andthe case of non-background subtracted intensities becamesmaller on the reduced datasets. As an example, the news.d. values when 10% of the lowest intensity spots areremoved, are plotted in Fig. 2. This suggests that subtract-ing background increases the variability of ratios of lowerintensity spots and the variability of higher intensity spotsare not affected much by subtracting or not subtractingthe background.Normalization of hybridization data from replicate experimentsIn order to see how normalization affects the consistencyof the data from replicate experiments, 8 replicate experi-ments were performed. H526-1 through H526-8 repre-sent independent array CGH experiments using the samesource of sample DNA (isolated from the well studiedlung cancer cell line H526).The Standard deviations of the log2 ratios of the same spotacross the 8 replicate arrays were calculated and averagedacross all the spots for each normalization method. Theresults are shown in Fig. 3A. The standard deviation meas-ure attains its smallest value after method 12 or 19 is per-formed on the data. When the three-step normalization isperformed on the ratios of non-background subtracteddata (method 19), its performance is slightly better thanwhen it is performed on ratios of background subtractedPage 6 of 15(page number not for citation purposes)lated for the set of data obtained from the 8 replicateBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274arrays normalized using each of the methods describedabove. The results are also summarized in Fig. 3B. The ICCand Pearson correlation coefficient show similar resultsacross the methods. Both ICC and Correlation coefficientattain their highest values after the three-step normaliza-tion method. This applies to both the ratios of non-back-ground subtracted intensities and ratios of backgroundsubtracted intensities. ICC and Correlation coefficient areslightly higher when background subtraction is not per-formed on spot intensities measures.Normalization of hybridization data from male and female DNATo evaluate the effect of normalization on improvingdetection of single copy loss, two array CGH experimentswere conducted comparing male (XY) genomic DNAagainst female (XX) genomic DNA. The copy numbers ofautosomal loci (clones on chromosome 1 through 22) areequal, while the X loci exhibit a 1:2 ratio, simulating a sin-gle copy loss.The normalization methods described above were appliedto the data obtained from these two experiments. Toloss, a two-sample two-tailed T-test was performed oneach array data normalized by each method. The T-testevaluates the difference between the means of two groupsof log ratios. The first group consists of log ratios forclones from chromosomes 1 through 22 and the secondgroup consists of log ratios for clones from chromosomeX. The value of the T statistic is shown in Fig. 4 for botharrays and for each normalization method. A larger valuefor the T-statistic indicates better separation between themeans of the two samples.For the data from array MF-1, the largest T-statistic wasobtained after our three-step normalization procedurewas performed on the ratios of background subtractedintensities. For this array, the normalization methods per-formed on the ratios of the non-background subtractedintensities were not as effective.For the MF-2 array data, the normalization methods donot significantly change the value of the T-statistic. Thethree-step normalization performed on the ratio of non-background subtracted intensities slightly increases the T-statistic. In fact the correlation coefficient of the log ratiosNormalization of self-self hybridization dataFigure 2Normalization of self-self hybridization data. Relative standard deviation (s.d.) of log2 ratios averaged across arrays MM-1 through MM-4 using all data points are shown in blue. The repeated analysis of relative s.d. after removal of the weakest 10% of spots is shown in red. The numbers on the horizontal axis refer to the methods used for normalization listed on Table 2.0%20%40%60%80%100%120%1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Method numberRelatives.d.averagedacross4arraysComplete data set Weakest 10% of spots removedPage 7 of 15(page number not for citation purposes)determine which method results in the best separation ofclones with normal copy from those with a single copyand the estimated intensity bias and the correlation coef-ficient of the log ratios and the estimated spatial bias wereBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274Page 8 of 15(page number not for citation purposes)Normalization of hybridization data from replicate experimentsFigure 3Normalization of hybridization data from replicate experiments. 8 replicate array CGH experiments were done comparing sample DNA from H526 cell line and the reference normal male genomic DNA. A. Graph shows the average of the standard deviations of log2 ratios for the same spot across 8 replicate arrays. B. shows the ICC and Average correlation coeffi-cient of replicate arrays. Horizontal axis represents the method number listed in Table 2.00.10.20.30.40.50.60.70.81 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Method numberAverageofs.d.Average of s.d. values across 8 replicate slides-20%-10%0%10%20%30%40%50%60%70%80%1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Method numberICC Average Pearson correlation coefficientABBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274both quite low for this array compared to the other arrays(below 15%). Also the background intensities for thisarray were quite low compared to the other arrays. Thissuggests that the reason for the lack of significant changein the T-statistic values after normalization is that the datafrom this particular array did not have significant bias.Normalization of hybridization data from samples mimicking heterogeneous cell populations and single copy alterationsArray CGH is often used to detect genetic alterations intumor cells. However, tumours generally consist of heter-ogeneous cell populations including a variety of infiltrat-ing non-cancerous cells. Contamination from normalcells may affect the ability to detect copy number aberra-tions. In the case of a single copy gain, contaminationfrom diploid normal cells dampens the expected 3:2 sig-nal ratio produced by the single copy gained sequences inthe tumour cells due to the averaging effect in the mixedcell population. In the case of a single copy loss, normalcell contamination increases the average copy number,deviating from the expected 1:2 ratio. In a previous study,this effect on detection sensitivity was evaluated by mix-contamination affecting the dosage of the X chromosome[5].In this study, we wish to determine how our three-stepnormalization method affects the estimated log2 ratios forthe clones with single copy number changes and increas-ing levels of heterogeneity. The stepwise normalizationmethod was applied to the data from the titration series(arrays T1-T10) that simulated different contaminationlevels for both single copy gains and losses (Fig. 5).We compared the data obtained after performing thethree-step normalization procedure to data obtained afterperforming global median normalization on both theratios of background subtracted intensities and the ratiosof non-background subtracted intensities. For each array,a T-test was performed on the two groups of log ratios, i.e.log ratios for the autosomal clones and those for the Xchromosome clones. T-values are shown in Fig. 5.The T-statistic values are higher after normalization in allcases which assures us that the separation of the twogroups is increased and the low-level copy numberNormalization of hybridization data from male and female DNAFigure 4Normalization of hybridization data from male and female DNA. For each of arrays MF-1 and MF-2, a T-test was performed on the two groups of log ratios, i.e. log ratios for the autosomal clones and those for the X chromosome clones. Values of T-statistic after each normalization method are shown. Horizontal axis represents the method number listed in Table 2.0204060801001201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Method numberT-statisticValuesMF-1 MF-2Page 9 of 15(page number not for citation purposes)ing male (XY) and female (XX) DNA in precise propor-tions to mimic 0%, 15%, 30%, 50% and 75% normal cellchanges are preserved and even magnified. Comparingthe T-statistic values for data with no normalization to theBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274normalized results shows that normalization increases thesensitivity of detection of the single copy number changesup to 5 times. However, the T-statistic values are consider-ably lower for the ratios of non-background subtractedintensities as compared to the ratios of background sub-tracted intensities.Functional normalization increases the separationbetween the distributions of the clones with normal andabnormal copy numbers and this facilitates the analysis ofheterogeneous samples. For example, after normalization,the T-statistic for array T9 which simulates a single copyamplification with 50% contamination, becomes quiteclose to the T-statistic of array T6 which simulates a casewith no contamination.Normalization of hybridization data from cDNA arrays simulating varying levels of gene amplification and deletion for X-chromosomal genes on the arrayTo evaluate the performance of the stepwise normaliza-tion strategy on hybridization data from cDNA arrays, weused public data from hybridization of genomic DNAsfrom cell lines containing varying numbers of X chromo-somes that simulate varying levels of gene amplificationand deletion for each of the X-chromosomal genes presenton the array (arrays X1 to X5).We compared the data obtained after performing thethree-step normalization procedure to data obtained afterperforming global median normalization on both theratios of background subtracted intensities and the ratiosof non-background subtracted intensities. For each array,Normalization of hybridization data from samples mimicking heterogeneous cell populations and single copy alterationsFigure 5Normalization of hybridization data from samples mimicking heterogeneous cell populations and single copy alterations. Array CGH data were generated for samples mimicking single copy loss (deletion) or single copy gain (amplifica-tion) with contamination of increasing proportion of reference DNA, indicated as percentage on the horizontal axis. The experimental procedure for the array CGH experiments was previously described [5]. Global median normalization (method 1), stepwise normalization (method 12), global median normalization with background subtraction (method 13), and 3 step normalization with background subtraction (method 19) were applied. T-statistic values computed before and after normaliza-tion for arrays T1-T10 are summarized.0204060801001201400% 15% 30% 50% 75% 0% 15% 30% 50% 75%ContaminationT-Statisticglobal median normalization (background not subtracted)three step normalization (background not subtracted)global median normalization (background subtracted)three step normalization (background subtracted)AmplificationDeletionPage 10 of 15(page number not for citation purposes)a T-test was performed on the two groups of log ratios, i.e.log ratios for the autosomal clones and those for the XBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274chromosome clones. The T-statistic values are shown inFig. 6.The T-statistic values are higher after normalization in allcases. The increase in the T-statistic values may be inter-preted as the increase in the separation of the distribu-tions of the log2 ratios from two groups of normal andaltered genes.Other considerationsVisual comparison of the genomic profilesThe use of the genomic location of the clones allows us tocompare profiles before and after normalization and touse the visual correlation between observed and expectedprofiles as a measure of success. (This is not possible whenanalyzing gene expression array data.)In Figures 6A and 6B, chromosome plots of the data fromratios for each of the target DNA clones, as a function ofthe location of the clone in the chromosome. Figure 7Ashows the chromosome plots for chromosome 1 of arraysH526-1 and H526-5. Figure 7B shows the chromosomeplots for chromosome 2 of arrays H526-1 and H526-5.For each array and each chromosome the log2 ratios areshown after global median normalization and after thethree-step normalization. The variability of log2 ratios inarray H526-5 is much higher than that of array H526-1.For the H526 genome, the regions of copy numberchanges are known [4]. As the figures show, for data fromarray H526-5 (low quality data), normalization reducedthe unwanted variations. Consequently, after normaliza-tion the altered regions are clearer. An important point tonote for data from array H526-1 (high quality data),where the variation of the log2 ratios is quite low evenbefore normalization, is that normalization did notremove the true biological variation present in the sam-Normalization of hybridization data from cDNA arraysFigure 6Normalization of hybridization data from cDNA arrays. Array CGH data were generated for samples simulating vary-ing levels of gene amplification and deletion for X-chromosomal genes on the array. Global median normalization (method 1), stepwise normalization (method 12), global median normalization with background subtraction (method 13), and 3 step nor-malization with background subtraction (method 19) were applied. T-statistic values computed before and after normalization for arrays X1-X5 are summarized.0102030405060X5 (5X vs. 2X) X4 (4X vs. 2X) X3 (3X vs. 2X) X2 (2X vs. 2X) X1 (1X vs. 2X)ArrayT-statisticGlobal median normalization (background subtracted)Three step normalization (background subtracted)Global median normalization (background not subtracted)Three step normalization (background not subtracted)Page 11 of 15(page number not for citation purposes)two of the replicate H526 arrays, generated by SeeGH soft-ware [10], are shown. Chromosome plots show the log2 ofple.BMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274Page 12 of 15(page number not for citation purposes)Chromosome plots before and after normalizationFigure 7Chromosome plots before and after normalization. Plot of log2 signal ratios for clones (from chromosome 1 in A and chromosome 2 in B) versus their location across the chromosome. The profiles from left to right are: H526-1 data with global median normalization (method 1), H526-1 data with stepwise normalization (method 12), H526-5 data with global median nor-malization (method 13), H526-5 data with stepwise normalization (method 19). Each dot on the SeeGH plot represents a BAC clone. A shift in signal ratio to the left of center line indicates a copy number reduction, while a shift to the right indicates a gain. Blue arrow points to a high level segmental amplification. The arrow in part B points to the micro-amplification.BMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274Background subtractionThe issue of subtracting or not subtracting backgroundintensities has been an open question in microarray dataanalysis. Some groups choose to use the raw intensitieswhile others use the background subtracted intensities.Through our experiments we observed that not subtract-ing the background results in slightly less variability andmore repeatability of the ratios. However, knowing thetruth about the ratios of array CGH experiments enabledus to examine how subtracting and not subtracting thebackground intensities affect the ability to detect copynumber changes. We observed that for the array CGH datafrom SMRT arrays the ability to detect the copy numberchanges when using the ratio of non-background sub-tracted intensities is degraded when compared to usingthe ratio of background subtracted intensities. However,for the array CGH data from cDNA arrays, the ability todetect the copy number changes when using the ratio ofnon-background subtracted intensities is increased. Webelieve that the fact that different methods of backgroundestimation are used in these two cases and the differencesin the average level of background intensities of the arrayshave caused this inconsistency between the results. Thedata from SMRT arrays along with the image analysismethods used suggest that background subtractionimproves normalization and should be performed forthese data.ConclusionWe evaluated the performance and effectiveness of anintegration of novel and existing bias removal methodsmainly used for gene expression arrays considering thestringent performance requirements of the array CGHexperiments and using the characteristics of the arrayCGH data that provide the true biological values for cali-bration.A normalization scheme is expected to remove the sys-tematic variations in the data and leave the true biologicalvariations unchanged. In evaluating the performance ofthe normalization methods, both these issues should beconsidered. Our method is shown to preserve even thelow-level copy number changes while reducing the sys-tematic biases. To the best of our knowledge this is thefirst study to examine the effectiveness of various normal-ization methods taking advantage of the knowledge of theunderlying truth in known copy number status ingenomic array CGH data – as opposed to using variablegene expression changes in normalizing expressionmicroarray data. Our stepwise normalization frameworkestimates the intensity dependent, spatial and plate biasusing regression-based techniques and removes the esti-mated biases from the raw log2 ratios. These biases werestrate that multi-step normalization outperforms conven-tional single step methods in reducing systematic biases inarray CGH spot data from both BAC and cDNA platforms(such as those representing self-self hybridization, repli-cated experiments, single copy detection, and data mim-icking tissue heterogeneity) and suggest that multiplesystematic variations need to be addressed in the normal-ization of genomic array CGH data.In this study we focused on within-array normalizationand did not consider performing between-array normali-zation. This was based on the fact that because of tissueheterogeneity, there is usually some degree of contamina-tion from normal cells into the tumor cells in array CGHexperiment samples. As a result, it is not known that a sin-gle copy change results in how much change in the fluo-rescent ratios [5]. Because of this, it seems that the safestway to deal with the issue of unequal scales of data fromdifferent arrays would be to find the regions of gains orlosses in DNA copy number according to data from onearray CGH experiment and assign different levels ofchange to those different regions. These levels may thenbe compared across arrays.MethodsMicroarray image analysisSMRT arraysHybridized arrays were imaged using a charge-coupleddevice based imaging system and analyzed using the Soft-Worx Tracker spot analysis software (ArrayWorx eAuto,API, Issaquah, WA). The mean pixel intensity was used forthe spot foreground intensities and the median pixelintensity was used for the spot background intensity.Background calculation was achieved using the "Cellmethod" in the SoftWorx Tracker program. In thismethod, a square of 125% size of spot spacing is drawnand centred on the centroid of the spot's contour. All pix-els within the square which are not located within thetwo-pixel margin of the spot's contour are treated as back-ground pixels for that spot.cDNA arraysThe image analysis methods are described in [12]. Forcomputing the fluorescence ratios, the mean pixel inten-sity was used for the spot foreground intensities and themedian pixel intensity was used for the spot backgroundintensity.Normalization methodsThe normalization methods that were used in this studyare listed in Table 2. Among these methods, global inten-sity LOWESS, Median Plate Ratio and Spatial normaliza-tion have been described in the text above. ForPage 13 of 15(page number not for citation purposes)observed in two different array CGH platforms, the SMRTBAC arrays [4] and cDNA arrays [12]. Our results demon-comparison purposes print tip LOWESS intensity normal-ization [6] is also implemented, performing LOWESSBMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274curve fitting on log ratios from each subgrid of the micro-array.LOWESS (Locally Weighted Scatter plot Smoothing) is acurve-fitting technique based on local regression [8]. Eachsmoothed value is determined by its neighbouring datapoints defined within the span. A regression weight func-tion is defined for the data points contained within thespan. In addition to the regression weight function, arobust weight function may be used, which makes theprocess resistant to outliers. In this study, we used a robustLOWESS with a first degree polynomial for regression.Evaluation methodsPearson's correlation coefficientIn the analysis of replicated experiments (H526-1 throughH526-8), the Pearson correlation coefficient is calculatedfor all 28 possible pairings of the 8 replicate arrays. Foreach pair wise comparison, the log2 ratios for spots fromone array form the first group, and the log2 ratios for thecorresponding spots from the other array form the secondgroup.Intra-class correlation coefficient (ICC)This is an ANOVA-based type of correlation. It measuresthe relative homogeneity within groups compared to theirtotal variation. Suppose that we have k groups of measure-ments and each group consists of n replicate measure-ments. Xi,j, i = 1,..,k and j = 1,..,n represents the j-thmeasurement in the i-th group. If we define:then rICC is calculated from the following formula:The maximum positive value of the intra-class correlationcoefficient is 1.0, but its maximum negative value is (-1/(n-1)). Intra-class correlation coefficient is large and posi-tive when there is no variation within the groups, but thegroup means differ. Intra-class correlation coefficient islarge and negative when the group means are the same butthere is great variation within groups. A negative intra-class correlation occurs when between-group variation isless than within-group variation [9]. ICC was shown to beuseful for the assessment of technical and biological vari-ations in microarray experiments [11].In evaluating the normalized data from the replicatearrays (H526-1 through H526-8), n is the total number ofreplicate arrays which is 8 and k is the total number ofclones on each array, and Xi,j represents the estimated log2ratio for the j-th clone on the i-th array.T-testA two-sample two-tailed T-test was used to determinewhether two samples (with different numbers of observa-tions) from a normal distribution (in x and y) could havethe same mean when the standard deviations areunknown but assumed equal.In the analysis performed on third, fourth, and fifth data-sets (MF-1 to MF-2, T1 to T10, and X1 to X5), for eacharray dataset, log2 ratios for autosomal clones representthe first sample and log2 ratios for clones from chromo-some X represent the second sample.Authors' contributionsMK developed and implemented the methods, partici-pated in the design of the study, and drafted the manu-script. WL provided expertise on the array CGH platform.RW participated in the coordination of the study. CM par-ticipated in the design, development and coordination ofthe study. All authors read and approved the manuscript.XXnki jjnik===∑∑ ,11XXnii jjn==∑ ,1RSS X Xiik= −( )=∑ 21TSS X Xi jjnik= −( )==∑∑ , 211MSRSSkbetweengroups=MSSSEnk kwithin groups=−rMS MSMS n MICCBetween groups Within groupsBetween groups=−+ −( ) ∗1 SWithin groupsPage 14 of 15(page number not for citation purposes)SSE = TSS - RSSPublish with BioMed Central and every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2005, 6:274 http://www.biomedcentral.com/1471-2105/6/274Additional materialAcknowledgementsThe authors wish to thank Bradley Coe for his useful discussions, Spencer Watson for providing array CGH data and Ron J DeLeeuw for careful proofreading of the manuscript. This work was supported by funds from the Canadian Institute of Health Research, Genome British Columbia/Genome Canada and the National Institute of Dental and Craniofacial Research Grant R01 DE015965.References1. Snijers AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J,Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, PalmerJ, Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG: Assem-bly of microarrays for genome-wide measurement of DNA copynumber. Nat Genet 2001, 29(3):263-4.2. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitor-ing of gene expression patterns with a complementary DNAmicroarray. Science 1995, 270:467-470.3. Draghici S: Statistical intelligence: effective analysis of high-density microarray data. Drug Discov Today 2002, 7(11):S55-63.4. Ishkanian AS, Malloff CA, Watson SK, DeLeeuw RJ, Chi B, Coe BP,Snijders A, Albertson DG, Pinkel D, Marra MA, Ling V, MacAulay C,Lam WL: A tiling resolution DNA microarray with completecoverage of the human genome. Nat Genet 2004,36(3):299-303.5. Garnis C, Coe BP, Lam SL, MacAulay C, Lam WL: High-resolutionarray CGH increases heterogeneity tolerance in the analysisof clinical samples. Genomics 2005, 85(6):790-3.6. Smyth GK, Speed T: Normalization of cDNA microarray data.Methods 2003, 31(4):265-73.7. Watson SK, deLeeuw RJ, Ishkanian AS, Malloff CA, Lam WL: Meth-ods for high throughput validation of amplified fragmentpools of BAC DNA for constructing high resolution CGHarrays. BMC Genomics 2004, 5(1):6.8. Cleveland WS: Robust Locally Weighted Regression andSmoothing Scatter plots. Journal of the American Statistical Associ-ation 1979, 74:829-836.9. StatSoft, Inc: Electronic Statistics Textbook. 2004 [http://www.statsoft.com/textbook/stathome.html]. Tulsa, OK: StatSoft10. Chi B, DeLeeuw RJ, Coe BP, MacAulay C, Lam WL: SeeGH – a soft-ware tool for visualization of whole genome array compara-tive genomic hybridization data. BMC Bioinformatics 2004,11. Pellis L, Franssen-van Hal NL, Burema J, Keijer J: The intraclass cor-relation coefficient applied for evaluation of data correction,labeling methods, and rectal biopsy sampling in DNA micro-array experiments. Physiol Genomics 2003, 16(1):99-106.12. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tib-shirani R, Botstein D, Borresen-Dale AL, Brown PO: Microarrayanalysis reveals a major direct role of DNA copy numberalteration in the transcriptional program of human breasttumors. Proc Natl Acad Sci USA 2002, 99(20):12963-8.Additional File 1Supplemental table 1. A description of array CGH experiments that sim-ulate varying degrees of normal diploid cells contamination in a popula-tion of cancer cells carrying a single copy alteration. A more detailed description can be found in [5].Additional file 2 - Supplemental table 2Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-274-S1.doc]Additional File 2Supplemental table 2. A Description of array CGH experiments involv-ing hybridization of genomic DNAs from cell lines containing varying numbers of X chromosomes that simulate varying levels of gene amplifica-tion and deletion for each of the X-chromosomal genes present on the cDNA array. A more detailed description can be found in http://smd.stan ford.edu.Additional file 2 - Supplemental table 2Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-274-S2.doc]yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 15 of 15(page number not for citation purposes)5(1):13.