UBC Faculty Research and Publications

Experimental analysis of oligonucleotide microarray design criteria to detect deletions by comparative… Flibotte, Stephane; Moerman, Donald G Oct 21, 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-12864_2008_Article_1690.pdf [ 581.87kB ]
Metadata
JSON: 52383-1.0224065.json
JSON-LD: 52383-1.0224065-ld.json
RDF/XML (Pretty): 52383-1.0224065-rdf.xml
RDF/JSON: 52383-1.0224065-rdf.json
Turtle: 52383-1.0224065-turtle.txt
N-Triples: 52383-1.0224065-rdf-ntriples.txt
Original Record: 52383-1.0224065-source.json
Full Text
52383-1.0224065-fulltext.txt
Citation
52383-1.0224065.ris

Full Text

ralssBioMed CentBMC GenomicsOpen AcceMethodology articleExperimental analysis of oligonucleotide microarray design criteria to detect deletions by comparative genomic hybridizationStephane Flibotte*1 and Donald G Moerman2,3Address: 1Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, B.C, V5Z 4S6, Canada , 2Department of Zoology, University of British Columbia, Vancouver, B.C, V6T 1Z4, Canada and 3Michael Smith Laboratories, University of British Columbia, Vancouver, B.C, V6T 1Z4, Canada Email: Stephane Flibotte* - sflibotte@bcgsc.ca; Donald G Moerman - moerman@zoology.ubc.ca* Corresponding author    AbstractBackground: Microarray comparative genomic hybridization (CGH) is currently one of the mostpowerful techniques to measure DNA copy number in large genomes. In humans, microarray CGHis widely used to assess copy number variants in healthy individuals and copy number aberrationsassociated with various diseases, syndromes and disease susceptibility. In model organisms such asCaenorhabditis elegans (C. elegans) the technique has been applied to detect mutations, primarilydeletions, in strains of interest. Although various constraints on oligonucleotide properties havebeen suggested to minimize non-specific hybridization and improve the data quality, there havebeen few experimental validations for CGH experiments. For genomic regions where strict designfilters would limit the coverage it would also be useful to quantify the expected loss in data qualityassociated with relaxed design criteria.Results: We have quantified the effects of filtering various oligonucleotide properties by measuringthe resolving power for detecting deletions in the human and C. elegans genomes using NimbleGenmicroarrays. Approximately twice as many oligonucleotides are typically required to be affected bya deletion in human DNA samples in order to achieve the same statistical confidence as one wouldobserve for a deletion in C. elegans. Surprisingly, the ability to detect deletions strongly depends onthe oligonucleotide 15-mer count, which is defined as the sum of the genomic frequency of all theconstituent 15-mers within the oligonucleotide. A similarity level above 80% to non-targetsequences over the length of the probe produces significant cross-hybridization. We recommendthe use of a fairly large melting temperature window of up to 10°C, the elimination of repeatsequences, the elimination of homopolymers longer than 5 nucleotides, and a threshold of -1 kcal/mol on the oligonucleotide self-folding energy. We observed very little difference in data qualitywhen varying the oligonucleotide length between 50 and 70, and even when using an isothermaldesign strategy.Conclusion: We have determined experimentally the effects of varying several keyoligonucleotide microarray design criteria for detection of deletions in C. elegans and humans withNimbleGen's CGH technology. Our oligonucleotide design recommendations should be applicablefor CGH analysis in most species.Published: 21 October 2008BMC Genomics 2008, 9:497 doi:10.1186/1471-2164-9-497Received: 21 July 2008Accepted: 21 October 2008This article is available from: http://www.biomedcentral.com/1471-2164/9/497© 2008 Flibotte and Moerman; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 12(page number not for citation purposes)BMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497BackgroundIn human health research microarray comparativegenomic hybridization (CGH) has become a powerfultechnique to investigate DNA copy number variants(CNVs) in healthy subjects [1,2] and genomic aberrationsassociated with various diseases and syndromes [3,4]. Fur-thermore, CGH is now frequently used to analyze thegenome of strains of interest in various model organisms[5,6]. On some oligonucleotide microarray platformsindividual researchers can design their own specializedmicroarrays for very specific experiments. Basically, theonly crucial requirement before starting to design an arrayis to have access to a sequenced reference genome for thespecies under investigation. The first task facing a biolo-gist trying to design a CGH microarray is to design criteriato eliminate oligonucleotides with particular propertiesthat are expected to reduce the data quality. Some designcriteria have been suggested and used for several yearswith little or no large-scale experimental validation [7,8].Large-scale studies of the effects of various oligonucle-otide properties on microarray data quality are just start-ing to be published [9,10] but few of them are designed toinvestigate the two-colour scheme typically used in CGHexperiments. Most of these studies are concerned with thehuman genome but it would be useful to know if somedesign criteria could be relaxed for smaller and less com-plex genomes and in general what kind of penalty one hasto pay in terms of data quality when relaxing constraintson specific oligonucleotide properties.In our research we are particularly interested in using oli-gonucleotide microarray CGH to detect induced deletionsin the C. elegans genome [5,11]. We designed our ownmicroarray chips but our criteria for oligonucleotide selec-tion were arbitrary and relied more on empirical observa-tion, that is the data quality was adequate for the task [5],and was not based on experimentally testing various oli-gonucleotide features. Optimal design criteria areexpected to depend on the hybridization conditions andpossibly on the complexity of the genome under investi-gation. In the current publication we report our findingson the effects of varying the oligonucleotide design crite-ria and how these alterations affect our ability to detectdeletions in both the C. elegans and human genomes.Considering the differences in size and complexity ofthese two genomes the design properties we recommendhere should be applicable to many organisms with asequenced genome provided that the hybridization con-ditions are not drastically different from those used in ourexperiments.Results and discussionEffects of various oligonucleotide properties on resolving tool to detect and quantify small variations in overall dataquality when changes are made to the oligonucleotideselection process or the data analysis procedure, or evenwhen comparing different array platforms. Briefly, usingthe experimental distributions of the data points in the so-called normal regions and in the regions with copynumber aberrations it is possible to estimate the expectedp-value associated with the detection of a typical aberrantDNA segment covered by a given number of probes. In thecurrent work, the resolving power in C. elegans has beenevaluated with the help of two strains with large hetero-zygous deletions previously found in CGH experiments[5]. As a human sample, a pool of male DNA has beencompared by CGH to a pool of female DNA so that probestargeting the X chromosome could be associated with aone-copy loss in the male sample. Details of the microar-ray design for both the human and C. elegans experimentscan be found in the Methods section. Briefly, for both nor-mal and deleted regions we had probes manufactured oflength 50, 60, and 70 nucleotides. We also used a so-called isothermal design where the oligonucleotide lengthis varied in an attempt to obtain an approximately con-stant melting temperature. The only significant con-straints applied on the oligonucleotides at the designstage were the exclusion of known repeats and for thehuman chip the elimination of segments with knownCNVs and single nucleotide polymorphisms (SNPs).Microarrays with shorter oligonucleotides, for example25-mers in the case of the Affymetrix platform [13], canalso be used to infer CNVs [12,14] but their optimization[15,16] is associated with different issues than longer oli-gonucleotide arrays and therefore they will not be consid-ered in the current study.Figure 1 shows resolving power curves for detection ofone-copy deletions with 50-mer oligonucleotides for bothC. elegans and human DNA with and without the applica-tion of standard constraints on the oligonucleotide prop-erties. Those standard constraints are summarized inTable 1 and more details regarding their calculations canbe found in the Methods section. It is clear that the resolv-ing power curves are linear when plotted in logarithmicscale and therefore the data quality can be summarized bythe slope, steeper slopes being better. For example, toachieve a p-value better than 1 × 10-5 a typical hetero-zygous deletion in C. elegans would need to be covered byabout 15 unfiltered oligonucleotides, while about 3 fewerprobes would be required to achieve the same p-value ifthe standard filters are applied to the microarray design.The human data is noticeably noisier and one would basi-cally require twice as many probes than in C. elegans todetect a deletion at a given p-value level. However, theimprovement in the resolving power slope when applyingPage 2 of 12(page number not for citation purposes)powerThe concept of resolving power we use here was intro-duced in a software evaluation study [12]. It is a usefulthe same standard filters is slightly better in the humanexample. More precisely, the ratio of resolving powerslope between the filtered and unfiltered situations is 1.34BMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497for human compare to 1.25 for C. elegans. However, thosefilters represent more restrictive design constraints forhuman DNA with only 24% of the 50-mer oligonucle-otides on the array being accepted compare to 35% in theC. elegans case. It should however be noted that in the typ-ical microarray designs we have used in previous biologi-cal experiments we did not apply a hard ceiling at themedian value for the 15-mer count but simply used the15-mer count to guide the final oligonucleotide selectionfor oligonucleotides passing all the other filters [5].Each filter on oligonucleotide properties can be turned onand off independently before calculating the resolvingpower, except of course for the elimination of repeatsequences, which has been already applied to all the oli-gonucleotides present on the arrays. The effect of repeatsequences cannot be studied in the current work but thisis not a significant limitation. It is true that in some casesit is possible to find sequences that are fairly uniquewithin repeat regions, which might be of interest for someexperiments especially in mammalian genomes [17].However, at least in C. elegans we noticed that the CGHlog2ratio signal tends to be somewhat unreliable withmore non-zero bias near repeats. This type of alteration insignal is also often observed just outside deletions [5]. Theindividual effect of each of our oligonucleotide filters onresolving power is shown is Figure 2 for both human andC. elegans one-copy deletions detected using 50-mer oli-gonucleotides. The effect of each filter on the resolvingpower is virtually identical when using oligonucleotidesof length 50, 60, 70, or with an isothermal design (datanot shown). As previously mentioned, the slope in resolv-ing power is roughly twice as steep for C. elegans thanhuman data, and this is also true when individual filtersare applied. For the most part, each filter produces a sim-ilar gain in resolving power for both human and C. elegansdata except perhaps for the elimination of non-unique 20-mers which is more effective in human than C. elegans.However, the elimination of non-unique 20-mers is obvi-ously a more restrictive design constraint in human thanC. elegans as it eliminates 59% of all the oligonucleotideson the array compared to only 20% in the C. elegans case(data not shown).It is of course possible to modify the parameters of someof our standard filters and measure the effect on theresolving power. Figure 3 shows the effects of changingthe constraints on the self-folding energy, on the length ofthe longest homopolymer, on the 15-mer count and onthe melting temperature both for human and C. elegansdata obtained with 50-mer probes. Once again, the trendsare very similar for human and C. elegans and for all theTable 1: Standard oligonucleotide filters used in this work.Oligonucleotide property Condition for eliminationRepeat sequences PresenceConstituent 20-mers Non-uniquenessHomopolymers Length > 5Melting temperature More than 5°C from medianSelf-folding energy < -1 kcal/molSimilarity with other genomic location > 70%15-mer count > medianResolving power curves for detection of one-copy deletions with 50-mer oligonucl otidesFigure 1Resolving power curves for detection of one-copy deletions with 50-mer oligonucleotides. The open cir-cles show the results from resolving power calculations for one-copy deletions, in other words, the logarithm of the expected p-value for a deletion is shown as a function of the number of probes affected by the copy-number aberration. The solid (dashed) lines are linear regressions of the resolv-ing power calculations in C. elegans (human), with their slope being provided in the legends. Red data points and lines cor-respond to calculations using any oligonucleotide on the arrays without further filtering while the green lines and data points correspond to resolving power calculations after selecting the oligonucleotides with our standard filters as 10 20 30 40−15−10−50Number of 50−mer probes in 1−copy deletionlog10(p)C. elegansno filtersslope = −0.3373all filtersslope = −0.4230humanno filters (s=−0.1786)all filters (s=−0.2393)Page 3 of 12(page number not for citation purposes)oligonucleotide lengths present on the arrays. As can beseen in panel A our standard use of a self-energy thresholddescribed in the Methods section.BMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497of -1 kcal/mol seems optimal while panel B suggests thatour standard ceiling of 5 for the longest homopolymer iscertainly acceptable. In both human and C. elegans exam-ples the vast majority of homopolymers are polyA andpolyT tracts with much fewer polyC and polyG tracts soour results are basically measurements of the effect of hav-ing polyA and polyT present in a 50-mer probe. In con-trast to what has been implied in a previous publication[9], the presence of polyA and polyT reduces the perform-ance of our probes when attempting to detect deletions.However, since polyA/polyT tracts are much more fre-quent than polyC/polyG tracts, selecting probes withlonger homopolymers tends to reduce the average GCcontent within the probes and therefore the average melt-another resolving power calculation but this time at afixed melting temperature and our conclusion is stillvalid, the presence of long homopolymers tends to deteri-orate the performance of a probe. As can be seen in panelC of Figure 3 filtering the probes according to their 15-mercount is a very efficient way to directly control their per-formance. For example, when calculating the ratio of theresolving power slopes for probes belonging to the bot-tom and top 10% in 15-count one obtains 2.3 for C. ele-gans and 2.7 for human. Finally, panel D demonstratesthat our standard range of 10 degrees in melting tempera-ture is an adequate filter and reducing the width of thatwindow only marginally improve the data quality todetect deletions. This is in contradiction with what hasbeen previously reported for general copy-number meas-urements in human samples [9]. As demonstrated in pre-vious work [10], oligonucleotides with higher meltingtemperature tend to produce higher overall fluorescenceintensities. However, the use of a two-colour CGH schemecoupled with our data analysis procedure appears to elim-inate the need for a very uniform melting temperaturedesign. The formula we used to calculate the melting tem-perature is identical to that used in Reference [9] with onlya small difference in one of the parameters, which cannotaffect our conclusion for oligonucleotides of fixed length.We have repeated the resolving power analysis but thistime with a melting temperature calculation based on anearest neighbour approach [18-20] and once again wesee only a marginal improvement in data quality whenreducing the width of the window in melting temperature(data not shown).As previously mentioned and illustrated in Figure 4 PanelA, the trends we observed for the resolving power are verysimilar for oligonucleotides of length 50, 60, 70 and ourisothermal design. A very small gain in performance isobserved for longer probes but as can be deduced frompanel B of Figure 4 the majority of that gain is probablydue to the fact that longer probes tends to have highermelting temperature.Sequence similarityIn order to quantify the best design practices with regardto minimizing potential non-specific hybridization wehave introduced a series of perturbations on a pre-selectednumber of oligonucleotides, see Methods section fordetails. Basically, two different kinds of sequence similar-ity have to be considered, either the presence of a stretchof perfect identity of a given length within the oligonucle-otide or a given similarity level over the whole oligonucle-otide.The red boxplot in Figure 5 shows the difference in fluo-Individual effect of standard oligonucleotide filters on resolv-i g powerFi ure 2Individual effect of standard oligonucleotide filters on resolving power. The negative (or absolute value) of the slope in the resolving power curve is shown for individual constraints on oligonucleotides when detecting one-copy deletions with 50-mer probes. For each filter, the green bars correspond to resolving power calculations performed exclusively with oligonucleotides accepted by the filter while for the red bars only the oligonucleotides rejected by the fil-ter were used in the calculations. Solid colours are associ-ated with C. elegans and hashed areas are associated with the human data set. The filters from left to right are: elimination of non-unique 20 mers, elimination of homopolymers longer than 5 nucleotides, the selection of a 10°C range in melting temperature, the elimination of oligonucleotides with self-folding energy smaller than -1 kcal/mol, the elimination of oli-gonucleotides mapping to more than one genomic region with more than 70% similarity, the elimination of oligonucle-otides with 15-mer count above the median value, and finally, the simultaneous application of all the those filters. More details on those standard filters can be found in the Methods section.20m homopol Tm energy homolog 15m allFilterslope in resolving power0.00.10.20.30.4 elegans acceptedelegans rejectedhuman acceptedhuman rejectedPage 4 of 12(page number not for citation purposes)ing temperature. In order to disentangle the melting tem-perature and homopolymer effects we have performedrescence intensity one is expected to observe between a50-mer oligonucleotide mapping perfectly to the C. ele-BMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497gans genome and a random 50-mer oligonucleotide withthe same GC content. This is the basis for comparison andoligonucleotides with sequence identity associated with asmaller difference in intensity present some level of cross-hybridization. As can be seen from the green boxplots inFigure 5, a stretch of perfect identity of length of about 22and above in the middle of the oligonucleotide will pro-duce some level of cross-hybridization, and of course thelonger the perfectly matched sequence is the worst theeffect will be on the performance of the oligonucleotide.The elimination of non-unique 20 mers in our standardfilters seems therefore a little too conservative. However,Presumably due to steric effects, a stretch of perfect iden-tity close to the slide will produce less cross-hybridizationproblems than a perfect stretch of identical length locatedat the other end of the oligonucleotide. For example, in C.elegans a perfect match of length 30 in the middle of theoligonucleotide will introduce similar cross-hybridizationnoise as a perfect match of length 23 close to the slide orlength 36 at the end away from the slide. In fact, a perfectmatch of length 20 at the end away from the slide will pro-duce a measurable fluorescence intensity above back-ground so our standard elimination of non-unique 20mers is justifiable in these instances. Similar positionalEffects of varying some oligonucleotide constraints on resolving powerFigure 3Effects of varying some oligonucleotide constraints on resolving power. The negative (or absolute value) of the slope in the resolving power curve is shown for individual constraints on oligonucleotides when detecting one-copy deletions with 50-mer probes. The green (red) bars correspond to C. elegans (human) data. The individual oligonucleotide constraints that have been varied consist of (A) the self-folding energy, (B) the length of the longest homopolymer, (C) the 15-mer count, and (D) the melting temperature.e > 1 0 < e < 1 −1 < e < 0 −2 < e < −1 e < −2Self−folding energy (kcal/mol)slope in resolving power0.00.10.20.30.4 A<=4 5 6 7 >=8Longest homopolymerslope in resolving power0.00.10.20.30.4 B0−10 10−25 25−50 50−75 75−90 90−10015−mer count quantileslope in resolving power0.00.10.20.30.4 Cany 65−75 66−74 67−73 68−72 69−71 70Tmslope in resolving power0.00.10.20.30.4 DPage 5 of 12(page number not for citation purposes)as can be seen in Figure 6, the position of the stretch ofperfect identity within the oligonucleotide is important.effects are manifest in our human data set, except that theoverall amplitude of the intensity difference between orig-BMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497Page 6 of 12(page number not for citation purposes)Effects of varying the oligonucleotide lengthFigure 4Effects of varying the oligonucleotide length. The negative (or absolute value) of the slope in the resolving power curve is shown as a function of oligonucleotide length for C. elegans (solid colour bars) and human (hashed bars). (A) Effects of vary-ing the oligonucleotide length between 50 and 70 before (red bars) and after (green bars) the use of our standard filters, see Methods for details. For the so-called isothermal design the length of each oligonucleotide was allowed to vary between 50 and 70 in an attempt to minimize the width to the melting temperature distribution. (B) Effects of varying the oligonucleotide length between 50 and 70 for a fixed melting temperature of 72°C without applying additional constraints on oligonucleotides.50 mer 60 mer 70 mer isothermalslope in resolving power0.00.20.40.6elegans filteredelegans no filtershuman filteredhuman no filters A50 mer 60 mer 70 mer isothermalOligonucleotide lengthslope in resolving power0.00.20.4elegans Tm = 72 human Tm = 72 BBMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497inal and perturbed oligonucleotides is smaller. This isbecause the human data is noisier and spans a smallerdynamical range. This effect is compatible with the asym-metry previously reported for experiments performedwith one-colour hybridization scheme on NimbleGenmicroarrays [10]. Furthermore, such asymmetry couldexplain the difference in performance sometimesobserved [21] between oligonucleotides designed follow-ing the plus and minus strand templates at a givengenomic location.cross-hybridization for shorter oligonucleotides. Forexample, as can be seen in Figure 7, a perfect stretch oflength 30 in the middle of a 60-mer oligonucleotide willproduce the same intensity perturbation as a perfectstretch of length 27 in the middle of a 50-mer oligonucle-otide, or a perfect stretch of length 33 in the middle of a70-mer oligonucleotide. Figure 7 also shows that such atrend is not quite as obvious in the human case, in partic-ular, very little difference is seen between the curves for60- and 70-mer oligonucleotides. It should be noted thatthe recommendation [7] of eliminating non-unique 15Stretch of perfect identity in the middle of 50-mer oligonucleotides in C. elegansFigure 5Stretch of perfect identity in the middle of 50-mer oligonucleotides in C. elegans. Boxplots of the difference in fluo-rescence intensity in log2 scale between the original and perturbed 50-mer oligonucleotides. For the green boxplots, the per-turbation consisted in randomizing the left and right sides of the original oligonucleotide while keeping a stretch intact in the middle. The red boxplot is associated with a randomization over the full length of the oligonucleotide. In all the cases, the per-turbed oligonucleotide has the same GC content as the original oligonucleotide in an attempt to keep the melting temperature constant.0 12 16 20 24 28 32 36 40−6−4−202Number of consecutive perfect matches in the middle of 50 merIntensity difference (log2)Page 7 of 12(page number not for citation purposes)As expected, for C. elegans the presence of a perfect stretchof identity of a given length will produce a higher level ofmers within oligonucleotides of length 50 is too conserv-ative with our hybridization conditions. This is fortunateBMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497because basically no oligonucleotides would pass such aconstraint for CGH in large genomes. However, as alreadymentioned, minimizing the 15-mer count of oligonucle-otides is recommended to improve the resolving power.Figure 8 shows that the introduction of about 10 or moremismatches within a 50-mer oligonucleotide is enough tobring the fluorescence intensity down to the backgroundlevel in C. elegans. One can see in Figure 9 that the corre-sponding limits for 60-mer and 70-mer oligonucleotideare about 12 and 14 mismatches, respectively. In otherwords, an oligonucleotide mapping to a second locationin the genome with an overall degree of similarity aboveabout 80% will produce a measurable amount of non-specific hybridization, which is in agreement with whathas been reported in previous work for 50-mer oligonu-cleotides [7]. When comparing with the example usedabove for a stretch of perfect identity of length 30 in themiddle of a 50-mer oligonucleotide, one can see that thesame level of cross-hybridization would be obtained foran oligonucleotide with an overall similarity around 88%over the length of the oligonucleotide. Once again, thesmaller dynamical range covered by the human data (seecase, it is clear that for a fixed number of mismatches theshorter oligonucleotides will present more perturbationon the fluorescence intensity.Oligonucleotide design recommendationsSeveral constraints can be applied on oligonucleotideproperties to improve the data quality but one shouldkeep in mind that our design recommendations summa-rized in this section could easily be relaxed to improvecoverage in specific genomic regions and still get usefulinformation from the resulting data. In the current workwe have only studied one-copy deletions but we expectthat oligonucleotide design criteria improving the resolv-ing power for detection of deletions should also improvethe resolving power for detecting copy number gains.However, we cannot really infer that our design recom-mendations are necessarily optimal for experimentsattempting to measure precise copy numbers where largecopy numbers are expected.We observed very little difference in data quality whenvarying the oligonucleotide length between 50, 60, 70,and even when using a so-called isothermal design wherethe length of each oligonucleotide varies between 50 and70 in an attempt to minimize the overall width of themelting temperature distribution. Oligonucleotides ofEffect of the position of a stretch of perfect identity within 50-mer oligonucle tidesFigure 6Effect of the position of a stretch of perfect identity within 50-mer oligonucleotides. LOESS regression of the difference in fluorescence intensity (in log2 scale) between the original and perturbed 50-mer oligonucleotides as a func-tion of the length of the stretch of perfect identity. Solid (dashed) lines correspond to C. elegans (human) data. The perfect stretch of identity is either on the left (5') side (green lines), right (3') side (blue lines) or middle (red lines) of the 50-mer oligonucleotide. With NimbleGen's manufacturing process the oligonucleotides are synthesized from 3' to 5' and therefore the left side is protruding and freely floating in the solution while the right side is closer to the slide.15 20 25 30 35 40−3.0−2.5−2.0−1.5−1.0−0.50.0Number of consecutive perfect matches within 50 merIntensity difference (log2)humanleft/protrudingmiddleright/tetheredelegansleft/protrudingmiddleright/tetheredStretch of perfect identity in the middle of oligonucleotides of various lengthsFigu e 7Stretch of perfect identity in the middle of oligonu-cleotides of various lengths. LOESS regression of the dif-ference in fluorescence intensity (in log2 scale) between the original and perturbed oligonucleotides as a function of the length of the stretch of perfect identity. Solid (dashed) lines correspond to C. elegans (human) data. Results are shown for oligonucleotides of length 50 (green lines), 60 (red lines) and 70 (blue lines).15 20 25 30 35 40−3.0−2.5−2.0−1.5−1.0−0.50.0Number of consecutive perfect matches in the middleIntensity difference (log2)human50 mer60 mer70 merelegans50 mer60 mer70 merPage 8 of 12(page number not for citation purposes)dashed lines in Figure 9) makes a precise interpretation ofthe results more difficulty. However, even in the humanfixed length are simpler to work with at the design stageand it is easier to avoid unwanted genomic regions withBMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497shorter oligonucleotides. Even if only for convenience wewill continue to use 50-mer oligonucleotides in our ownresearch projects and suggest that oligonucleotides of thislength will suffice for most other projects.Our results demonstrate that filtering potential oligonu-cleotide probes according to their 15-mer count is proba-bly the most effective way to control probe quality. In thecurrent work repeat sequences have been eliminated rightfrom the start and we therefore do not have a direct meas-urement of their effect on data quality. However, consid-ering the effect of the 15-mer count above it is safe tobe improved by considering the oligonucleotides self-folding tendency and the presence of homopolymers; aself-energy threshold of -1 kcal/mol seems optimal andthe elimination of oligonucleotides with homopolymerslonger than 5 is probably adequate in most situations.Only small gain in data quality is achieved by restrictingthe oligonucleotides melting temperature and using a rel-atively wide window of 10°C centered on the medianvalue seems an acceptable compromise between dataquality and coverage.Two types of sequence similarity with multiple genomicIntroduction of random mismatches in 50-mer oligonucleotides in C. elegansFigure 8Introduction of random mismatches in 50-mer oligonucleotides in C. elegans. Boxplots of the difference in fluores-cence intensity in log2 scale between the original and perturbed 50-mer oligonucleotides. For the green boxplots, the pertur-bation consisted in the introduction of mismatches at random locations. The red boxplot is associated with a randomization over the full length of the oligonucleotide. In all the cases, the perturbed oligonucleotide has the same GC content as the orig-inal oligonucleotide in an attempt to keep the melting temperature constant.1 3 5 7 9 11 13 15 17 19 50−6−4−202Number of mismatches in 50 merIntensity difference (log2)Page 9 of 12(page number not for citation purposes)assume that most types of repeats should be excludedfrom most CGH array designs. The quality of the data canregions have been investigated, the presence of a perfectidentity over a fraction of the probe and the similarityBMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497over the whole length of the probe. Our elimination ofnon-unique 20 mers within the genome when designingoligonucleotides is conservative and is really only justifiedwhen the 20-mer is located at the end away from the slidein a 50-mer probe. A stretch of perfect identity of length22 in the middle of a 50-mer probe will produce measur-able cross-hybridization. The same is true for a similaritylevel above about 80% over the full length of a probe.While the constraints on oligonucleotide design describedhere are good starting points, the optimal constraints tobe used to eliminate cross-hybridization from both typesof sequence similarity will depend on the genome underinvestigation and the desired coverage in a given region.ConclusionWe have analyzed CGH experiments performed withNimbleGen's microarray platform in order to assess therelationships between various oligonucleotide propertiesand the quality of the data as measured by the ability todetect deletions in the human and C. elegans genomes. Forthe most part our microarray design recommendationssummarized in the previous section are very similar forboth species and they could probably be used withoutmodifications for most other species with a sequenced ref-erence genome. As expected, the larger and more complexhuman genome is more difficult to study with CGH and adence as in C. elegans. All our results were obtained withthe NimbleGen platform with their standard hybridiza-tion protocol and of course our conclusions might not bevalid for other microarray platforms or when using differ-ent hybridization conditions.MethodsDNADNA from two C. elegans strains harbouring deletionswere used as samples in the current study; strainsVC10019 (gk487/mIn1) and VC10020 (gk488/mIn1) [5]carry 0.8 Mb and 0.5 Mb heterozygous deletions on chro-mosome II, respectively. For both C. elegans hybridiza-tions, DNA extracted from the wild-type N2 strain hasbeen used as the reference DNA. For the human DNAexperiment, the sample was a commercial pool of DNAfrom 6 male anonymous individuals and the referencewas a similar DNA pool from 6 female donors both sup-plied by Promega Corporation. Details of the nematodeculture, DNA preparation and labelling can be found in aprevious publication [5].Oligonucleotide microarray designBoth the human and C. elegans microarrays used in thecurrent study comprised approximately 380 × 103 oligo-nucleotides tiling the positive strand. In each case the totalnumber of oligonucleotides was divided in five approxi-mately equal parts: 1) oligonucleotides of length 50, 2)oligonucleotides of length 60, 3) oligonucleotides oflength 70, 4) oligonucleotides of variable length between50 and 70 selected to minimize the overall spread in melt-ing temperature, and finally 5) perturbed oligonucle-otides where mismatches have been intentionallyintroduced in oligonucleotides from the first three catego-ries above. The last category has been subdivided in threetypes of perturbation 1) a complete random shuffling, 2)the introduction of mismatches at random locations or 3)a random shuffling of the nucleotides at one or both endsin order to produce oligonucleotides with perfect matcheson the left, right or middle. Each perturbed oligonucle-otide originated from a perfect match oligonucleotidepassing our standard constraints, see below. Furthermore,the GC content of the perturbed oligonucleotide wasidentical to the GC content of the original oligonucleotidein an attempt to maintain the same melting temperature.Each perturbed oligonucleotide can therefore be associ-ated with a specific perfect match oligonucleotide presenton the array and the difference in fluorescence intensitybetween the pair should be a reflection of the perturba-tion applied. The number of oligonucleotides in each cat-egory is provided in Table 2 for the C. elegans and humanarrays.Introduction of random mismatches in oligonucleotides of vari us lengthsFigure 9Introduction of random mismatches in oligonucle-otides of various lengths. LOESS regression of the differ-ence in fluorescence intensity (in log2 scale) between the original and perturbed oligonucleotides as a function of the number of mismatches introduced in the oligonucleotide. Solid (dashed) lines correspond to C. elegans (human) data. Results are shown for oligonucleotides of length 50 (green lines), 60 (red lines) and 70 (blue lines).5 10 15 20−3.0−2.5−2.0−1.5−1.0−0.50.0Number of mismatchesIntensity difference (log2)elegans50 mer60 mer70 merhuman50 mer60 mer70 merPage 10 of 12(page number not for citation purposes)deletion typically needs to affect approximately twice asmany probes to achieve the same level of statistical confi-For the C. elegans array design, the oligonucleotidesselected correspond to an approximately uniform tiling ofBMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497the deletions in gk487 and gk488 plus 0.5 Mb of flankingregions on each side. The repeats annotated in Wormbasedata freeze version WS170 have been eliminated fromconsideration but no other constraints have been appliedon the oligonucleotides except that they had to be synthe-sized in fewer than 180 cycles with NimbleGen's microar-ray manufacturing process [22]. Approximately 22% and16% of the oligonucleotides cover the deletions in gk487and gk488, respectively.A similar strategy has been applied to select the oligonu-cleotide for the human array. In this case the probes wereapproximately uniformly distributed on the wholegenome but with increased density for chromosome Xresulting in approximately 37% of the oligonucleotidescovering that chromosome. Once again, repeats wereeliminated but also the regions with known SNPs (indbSNP) [23], CNVs and other genomic variants (in theDatabase of Genomic Variants) [24,25].Hybridization and data processingThe hybridization, image analysis, extraction of fluores-cence intensities and their ratios log2ratio together withtheir subsequent normalization have been described indetail in a previous publication [5]. Briefly, a two-colourCGH scheme has been used and the hybridization andimage analysis have been performed as a commercial serv-ice by Roche NimbleGen Inc. No background has beensubtracted before calculating the log2ratio values and thenormalization followed a LOESS regression. The data dis-cussed in this publication have been deposited in NCBI'sGene Expression Omnibus [26] and are accessiblethrough GEO Series accession number GSE12208 [27].Resolving powerThe concept of resolving power, as used in the currentwork, has been described in a previous publication [12].The inputs to the resolving power calculations are themean and standard deviation of the log2ratio data pointsin the normal and aberrant regions. Consequently, thecalculations assume that the distribution of log2ratio isGaussian for both types of region and no attempt is madeto account for possible autocorrelation between datapoints mapping to nearby genomic locations. Armed withthe mean and standard deviation of both distributionsand knowing the total number of data points on the arrayone can evaluate the expected p-value for copy numberaberrations affecting a given number of probes with thesame mathematical formulation that is used to calculate at-test. In the current study we are only interested in calcu-lating the resolving power for one-copy deletions. The log-arithm of the p-value coming out of a resolving powercalculation is linear with the number of probes affected bythe copy-number aberration and therefore a resolvingpower curve can be summarized by its slope, which is eas-ily calculated with a linear regression.Oligonucleotide properties and standard filtersWe are using the term standard filters to refer to the micro-array design constraints on oligonucleotides similar tothose that gave us acceptable results in previous CGHstudies [5]. In summary they correspond to 1) the elimi-nation of repeat sequences, 2) the elimination of 20-mersoccurring more than once in the genome, 3) the elimina-tion of homopolymers longer than 5 nucleotides, 4) theselection of oligonucleotides with a melting temperatureTm within +- 5°C of the median melting temperaturewhere Tm has simply been calculated as a function of per-cent GC content and oligonucleotide length L by Tm =64.9 + 0.41GC - 500/L, 5) the elimination of oligonucle-otides with a self-folding energy smaller than -1 kcal/molaccording to a hybrid-ss-min calculation [28], 6) the elim-ination of oligonucleotides mapping to more than onelocation in the genome with a similarity level above 70%over the whole oligonucleotide according to a MegaBLAST[29] search, and 7) the elimination of oligonucleotideswith a 15-mer count above median where the 15-mercount is defined as the sum of the genomic frequency ofall the constituent 15-mers within the oligonucleotide.With the exception of the first constraint, all the filters canbe modified at the analysis stage before calculating theresolving power, the repeats have already been eliminatedwhen designing the arrays and therefore that constraintTable 2: Number of oligonucleotides for each category represented on the arrays.Species Oligonucleotide Length Perfect Match Perturbed oligonucleotidesRandom Shuffling Random Mismatches Perfect StretchC. elegans 50 76305 3815 9854 761860 76463 3814 16140 1246870 75223 2232 10714 6431Isothermal 75439 3586 0 0Human 50 75769 3171 10268 829460 75784 3166 14448 1165870 75805 1841 13344 8280Page 11 of 12(page number not for citation purposes)Isothermal 75530 2812 0 0BMC Genomics 2008, 9:497 http://www.biomedcentral.com/1471-2164/9/497cannot be modified during the analysis and applies to allthe results presented in the current work.Authors' contributionsSF designed the study and the microarrays, performed thedata analysis and drafted the manuscript. DGM helped todraft the manuscript. Both authors read and approved thefinal manuscript.AcknowledgementsWe wish to thank Rick Zapf for growing the worms and preparing the DNA for the C. elegans experiments. This work was supported by grants from Genome Canada, Genome British Columbia and the Michael Smith Research Foundation.References1. Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM,Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP,Scherer SW, Lee C: Copy number variation: new insights ingenome diversity.  Genome Res 2006, 16:949-961.2. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fie-gler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Free-man JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, KomuraD, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K,Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, WoodwarkC, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, EstivillX, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, SchererSW, Hurles ME: Global variation in copy number in the humangenome.  Nature 2006, 444:444-454.3. Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T,Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R,Lee YH, Hicks J, Spence SJ, Lee AT, Puura K, Lehtimäki T, LedbetterD, Gregersen PK, Bregman J, Sutcliffe JS, Jobanputra V, Chung W,Warburton D, King MC, Skuse D, Geschwind DH, Gilliam TC, Ye K,Wigler M: Strong association of de novo copy number muta-tions with autism.  Science 2007, 316:445-449.4. Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB,Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, Stray SM,Rippey CF, Roccanova P, Makarov V, Lakshmi B, Findling RL, Sikich L,Stromberg T, Merriman B, Gogtay N, Butler P, Eckstrand K, Noory L,Gochman P, Long R, Chen Z, Davis S, Baker C, Eichler EE, Meltzer PS,Nelson SF, Singleton AB, Lee MK, Rapoport JL, King MC, Sebat J:Rare structural variants disrupt multiple genes in neurode-velopmental pathways in schizophrenia.  Science 2008,320:539-543.5. Maydan JS, Flibotte S, Edgley ML, Lau J, Selzer RR, Richmond TA,Pofahl NJ, Thomas JH, Moerman DG: Efficient high-resolutiondeletion discovery in Caenorhabditis elegans by array com-parative genomic hybridization.  Genome Res 2007, 17:337-347.6. Egan CM, Sridhar S, Wigler M, Hall IM: Recurrent DNA copynumber in the laboratory mouse.  Nat Genet 2007,39:1384-1389.7. Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ:Assessment of the sensitivity and specificity of oligonucle-otide (50 mer) microarrays.  Nucleic Acids Res 2000,28:4552-4557.8. Relógio A, Schwager C, Richter A, Ansorge W, Valcárcel J: Optimi-zation of oligonucleotide-based DNA microarrays.  NucleicAcids Res 2002, 30:e51.9. Sharp AJ, Itsara A, Cheng Z, Alkan C, Schwartz S, Eichler EE: Opti-mal design of oligonucleotide microarrays for measurementof DNA copy-number.  Hum Mol Genet 2007, 16:2770-2779.10. Wei H, Kuan PF, Tian S, Yang C, Nie J, Sengupta S, Ruotti V, JonsdottirGA, Keles S, Thomson JA, Stewart R: A study of the relationshipsbetween oligonucleotide properties and hybridization signalintensities from NimbleGen microarray datasets.  NucleicAcids Res 2008, 36:2926-2938.11. Moerman DG, Barstead RJ: Towards a mutation in every gene in12. Baross A, Delaney AD, Li HI, Nayar T, Flibotte S, Qian H, Chan SY,Asano J, Ally A, Cao M, Birch P, Brown-John M, Fernandes N, Go A,Kennedy G, Langlois S, Eydoux P, Friedman JM, Marra MA: Assess-ment of algorithms for high throughput detection ofgenomic copy number variation in oligonucleotide microar-ray data.  BMC Bioinformatics 2007, 8:368.13. Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X,Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, SurtiU, Phillips MS, Boyce-Jacino MT, Fodor SP, Jones KW: Large-scalegenotyping of complex DNA.  Nat Biotechnol 2003, 21:1233-1237.14. Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, GrigorovaM, Jones KW, Wei W, Stratton MR, Futreal PA, Weber B, ShaperoMH, Wooster R: High-resolution analysis of DNA copynumber using oligonucleotide microarrays.  Genome Res 2004,14:287-295.15. Mei R, Hubbell E, Bekiranov S, Mittmann M, Christians FC, Shen MM,Lu G, Fang J, Liu WM, Ryder T, Kaplan P, Kulp D, Webster TA:Probe selection for high-density oligonucleotide arrays.  ProcNatl Acad Sci USA 2003, 100:11237-11242.16. Zhang L, Wu C, Carta R, Zhao H: Free energy of DNA duplexformation on short oligonucleotide microarrays.  Nucleic AcidsRes 2007, 35:e18.17. Gräf S, Nielsen FG, Kurtz S, Huynen MA, Birney E, Stunnenberg H,Flicek P: Optimized design and assessment of whole genometiling arrays.  Bioinformatics 2007, 23:i195-204.18. Breslauer KJ, Frank R, Blöcker H, Marky LA: Predicting DNAduplex stability from the base sequence.  Proc Natl Acad Sci USA1986, 83:3746-3750.19. Sugimoto N, Nakano S, Yoneyama M, Honda K: Improved thermo-dynamic parameters and helix initiation factor to predictstability of DNA duplexes.  Nucleic Acids Res 1996, 24:4501-4505.20. Griffith M, Tang MJ, Griffith OL, Morin RD, Chan SY, Asano JK, ZengT, Flibotte S, Ally A, Baross A, Hirst M, Jones SJ, Morin GB, Tai IT,Marra MA: ALEXA: a microarray design platform for alterna-tive expression analysis.  Nat Methods 2008, 5:118.21. Baldocchi RA, Glynne RJ, Chin K, Kowbel D, Collins C, Mack DH,Gray JW: Design considerations for array CGH to oligonucle-otide arrays.  Cytometry A 2005, 67:129-136.22. Singh-Gasson S, Green RD, Yue Y, Nelson C, Blattner F, Sussman MR,Cerrina F: Maskless fabrication of light-directed oligonucle-otide microarrays using a digital micromirror array.  Nat Bio-technol 1999, 17:974-978.23. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM,Sirotkin K: dbSNP: the NCBI database of genetic variation.Nucleic Acids Res 2001, 29:308-311.24. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y,Scherer SW, Lee C: Detection of large-scale variation in thehuman genome.  Nat Genet 2004, 36(9):949-951.25. Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW: Development ofbioinformatics resources for display and analysis of copynumber and other structural variants in the human genome.Cytogenet Genome Res 2006, 115:205-214.26. Edgar R, Domrachev M, Lash AE: Gene expression Omnibus:NCBI gene expression and hybridization array data reposi-tory.  Nucleic Acids Res 2002, 30:207-210.27. NCBI Gene Expression Omnibus, Series GSE12208   [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12208]28. Markham NR: Hybrid: A software system for nuclei acid fold-ing, hybridizing and melting predictions.  In Masters thesis Rens-selaer Polytechnic Institute, Troy, NY; 2003. 29. Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm foraligning DNA sequences.  J Comput Biol 2000, 7:203-214.Page 12 of 12(page number not for citation purposes)Caenorhabditis elegans.  Brief Funct Genomic Proteomic 2008,7:195-204.

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0224065/manifest

Comment

Related Items