@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Medical Genetics, Department of"@en, "Medicine, Faculty of"@en, "Pathology and Laboratory Medicine, Department of"@en, "Other UBC"@en, "Non UBC"@en ; edm:dataProvider "DSpace"@en ; ns0:identifierCitation "BMC Bioinformatics. 2007 Oct 02;8(1):368"@en ; ns0:rightsCopyright "Baross et al."@en ; dcterms:creator "Baross, Ágnes"@en, "Delaney, Allen D."@en, "Li, H. I."@en, "Nayar, Tarun"@en, "Flibotte, Stephane"@en, "Qian, Hong"@en, "Chan, Susanna Y."@en, "Asano, Jennifer"@en, "Ally, Adrian"@en, "Cao, Manqiu"@en, "Birch, Patricia"@en, "Brown-John, Mabel"@en, "Fernandes, Nicole"@en, "Go, Anne"@en, "Kennedy, Giulia"@en, "Langlois, Sylvie"@en, "Eydoux, Patrice"@en, "Friedman, J. M. (Jan Marshall), 1947-"@en, "Marra, Marco, 1966-"@en ; dcterms:issued "2016-01-21T23:07:40Z"@*, "2007-10-02"@en ; dcterms:description """Background: Genomic deletions and duplications are important in the pathogenesis of diseases, such as cancer and mental retardation, and have recently been shown to occur frequently in unaffected individuals as polymorphisms. Affymetrix GeneChip whole genome sampling analysis (WGSA) combined with 100 K single nucleotide polymorphism (SNP) genotyping arrays is one of several microarray-based approaches that are now being used to detect such structural genomic changes. The popularity of this technology and its associated open source data format have resulted in the development of an increasing number of software packages for the analysis of copy number changes using these SNP arrays. Results: We evaluated four publicly available software packages for high throughput copy number analysis using synthetic and empirical 100 K SNP array data sets, the latter obtained from 107 mental retardation (MR) patients and their unaffected parents and siblings. We evaluated the software with regards to overall suitability for high-throughput 100 K SNP array data analysis, as well as effectiveness of normalization, scaling with various reference sets and feature extraction, as well as true and false positive rates of genomic copy number variant (CNV) detection. Conclusion: We observed considerable variation among the numbers and types of candidate CNVs detected by different analysis approaches, and found that multiple programs were needed to find all real aberrations in our test set. The frequency of false positive deletions was substantial, but could be greatly reduced by using the SNP genotype information to confirm loss of heterozygosity."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/56669?expand=metadata"@en ; skos:note "ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleAssessment of algorithms for high throughput detection of genomic copy number variation in oligonucleotide microarray dataÁgnes Baross1,5, Allen D Delaney1, H Irene Li1, Tarun Nayar1, Stephane Flibotte1, Hong Qian1, Susanna Y Chan1, Jennifer Asano1, Adrian Ally1, Manqiu Cao2, Patricia Birch3, Mabel Brown-John1, Nicole Fernandes3, Anne Go1, Giulia Kennedy2, Sylvie Langlois3, Patrice Eydoux4, JM Friedman3 and Marco A Marra*1,3Address: 1Genome Sciences Centre, BC Cancer Agency, British Columbia Cancer Agency, Suite 100, 570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada, 2Affymetrix Inc., 3420 Central Expressway, Santa Clara, CA 95051, USA, 3Dept. of Medical Genetics, University of British Columbia, Children's & Women's Hospital, Box 153, 4500 Oak Street, Vancouver, BC, V6H 3N1, Canada, 4Dept. of Pathology and Laboratory Medicine, BC Children's Hospital,4480 Oak Street, Vancouver, BC, V6H 3N1, Canada and 5Genome British Columbia, 500-555 West 8th Avenue, Vancouver, BC, V5Z 1C6, CanadaEmail: Ágnes Baross - abaross@genomebc.ca; Allen D Delaney - adelaney@bcgsc.ca; H Irene Li - ili@bcgsc.ca; Tarun Nayar - tnayar@bcgsc.ca; Stephane Flibotte - sflibotte@bcgsc.ca; Hong Qian - hqian@bcgsc.ca; Susanna Y Chan - schan@bcgsc.ca; Jennifer Asano - jasano@bcgsc.ca; Adrian Ally - aally@bcgsc.ca; Manqiu Cao - manqiu.cao@intel.com; Patricia Birch - birch@interchange.ubc.ca; Mabel Brown-John - mbjohn@bcgsc.ca; Nicole Fernandes - nfernandes@cw.bc.ca; Anne Go - ago@bcgsc.ca; Giulia Kennedy - Giulia_Kennedy@affymetrix.com; Sylvie Langlois - slanglois@cw.bc.ca; Patrice Eydoux - peydoux@cw.bc.ca; JM Friedman - frid@interchange.ubc.ca; Marco A Marra* - mmarra@bcgsc.ca* Corresponding author AbstractBackground: Genomic deletions and duplications are important in the pathogenesis of diseases, such ascancer and mental retardation, and have recently been shown to occur frequently in unaffected individualsas polymorphisms. Affymetrix GeneChip whole genome sampling analysis (WGSA) combined with 100 Ksingle nucleotide polymorphism (SNP) genotyping arrays is one of several microarray-based approachesthat are now being used to detect such structural genomic changes. The popularity of this technology andits associated open source data format have resulted in the development of an increasing number ofsoftware packages for the analysis of copy number changes using these SNP arrays.Results: We evaluated four publicly available software packages for high throughput copy number analysisusing synthetic and empirical 100 K SNP array data sets, the latter obtained from 107 mental retardation(MR) patients and their unaffected parents and siblings. We evaluated the software with regards to overallsuitability for high-throughput 100 K SNP array data analysis, as well as effectiveness of normalization,scaling with various reference sets and feature extraction, as well as true and false positive rates ofgenomic copy number variant (CNV) detection.Conclusion: We observed considerable variation among the numbers and types of candidate CNVsdetected by different analysis approaches, and found that multiple programs were needed to find all realPublished: 2 October 2007BMC Bioinformatics 2007, 8:368 doi:10.1186/1471-2105-8-368Received: 8 December 2006Accepted: 2 October 2007This article is available from: http://www.biomedcentral.com/1471-2105/8/368© 2007 Baross et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 18(page number not for citation purposes)aberrations in our test set. The frequency of false positive deletions was substantial, but could be greatlyreduced by using the SNP genotype information to confirm loss of heterozygosity.BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368BackgroundChromosomal abnormalities frequently contribute tohuman disorders including cancer [1-3] and mental retar-dation (MR) [4-6], and characterization of these DNAalterations is important for both diagnosis and under-standing of disease mechanisms. A surprising recent find-ing has been the extent to which genomic copy numbervariants (CNVs) also exist in the normal population [7-13]. Such variation may represent an important class ofmutations that predispose to disease.Conventional cytogenetic studies such as karyotyping areroutinely used to detect genomic deletions and duplica-tions involving more than 5–10 Mb, but detection of sub-microscopic aberrations requires higher resolutionapproaches. Oligonucleotide microarray technologiesoffer high resolution, scalable methods for whole genomescreening and can detect previously unidentified CNVs[6,14-17]. Among these approaches, the Affymetrix Gene-Chip® Mapping Assay [18,19] is increasingly used fordetecting CNVs in human DNA. This method involves awhole genome sampling analysis (WGSA) combined withhigh-density SNP genotyping oligonucleotide arrays. Thefirst such arrays contained 1,494 SNPs, and the subse-quent 10 K arrays consisted of 11,555 SNPs [14]. Furtherdevelopment resulted in the 100 K array set with probesfor 116,204 SNPs [16], and now the 500 K array set con-taining 500,568 SNPs [18] is available. All these arrays canbe used to estimate copy number changes from probeintensities, determine SNP genotypes by allele-specifichybridization, confirm loss of heterozygosity, detect uni-parental disomy, identify non-paternity and determinehaplotypes and parental origin of CNVs.A number of software packages are available for analysisof oligonucleotide arrays [14,20-23]. Three software pack-ages, listed in Table 1, are currently in common use forcopy number analysis of Affymetrix 100 K SNP WGSAdata: Copy Number Analyser for GeneChip® arrays(CNAG) [22,24], DNA-Chip Analyzer (dChip) [23,25]and Affymetrix GeneChip® Chromosome Copy NumberAnalysis Tool (CNAT) [14,18]. All of these software pack-ages perform normalization, scaling and feature extrac-tion of signal intensities, and enable detection of copynumber alterations, but each package uses a differentalgorithm for these functions. Briefly, CNAG normalizesand scales the test sample against a \"best-fit\" user-definedreference set and corrects the signal intensity ratios for thedifferences in PCR product length and GC content. Afterfeature extraction a Hidden Markov Model (HMM) algo-rithm is applied to infer copy numbers along each chro-mosome [22]. dChip normalizes and scales data withinand between chips using a procedure established for® estimate copy numbers in the test sample. This output isthen used by an HMM to infer copy numbers [23]. CNATcompares a test sample to a reference set of 106 samplesprovided by Affymetrix [18] or to a user-defined referenceset to estimate the copy number of each SNP locus, andthen applies a Kernel Smoothing algorithm to identify theregions of copy number alteration [14]. The relative per-formance of these methods in performing high through-put oligonucleotide array normalization, scaling andfeature extraction and their performance in the sensitivityor specificity of CNV detection have not previously beenreported, nor have the effects of different reference sets onCNV discovery. Accordingly, in this study we comparedthe performance of CNAG, dChip and CNAT software(Table 1) using synthetic data and an empirical data setthat contains CNVs validated predominantly by fluores-cent in situ hybridization (FISH). We report assessment ofthe normalization, scaling and feature extraction algo-rithms of these packages, as well as assessment of theapproaches used for identification of CNVs and theirboundaries. In addition, we tested the impact of referenceset size and composition on CNV detection with eachsoftware package. Finally, we estimated the true and falsepositive detection rates of these various approaches for theidentification of genomic gains or losses.Results and discussionThe purpose of this study was to compare the perform-ance of various software packages and the effect of differ-ent reference sets on identification of CNVs in Affymetrix100 K SNP array data. We performed the evaluationsdescribed here using a synthetic data set and an empiricaldata set that we generated from 331 individuals (Addi-tional file 1). The sample set was derived from 107patients with mental retardation (MR) and their unaf-fected mothers and fathers, as well as 10 unaffected sib-lings of the patients. Several of the individuals studiedhave CNVs that were validated using independent meth-ods [6].We performed 100 K SNP WGSA experiments using 662arrays of which 331 were Xba 50 K chips and 331 wereHind 50 K chips (Additional file 1). From individual oli-gonucleotide probe intensities, we determined the SNPgenotypes (Figure 1; Methods) and performed initial copynumber analysis using each of the software packages listedin Table 1. Of the software packages we analyzed, onlythose developed for Affymetrix GeneChip Mapping 100 Karrays are capable of normalization, scaling and featureextraction of Affymetrix data (Table 1). Hence, we usedCNAG, dChip or CNAT to perform this procedure on ourarray data.Page 2 of 18(page number not for citation purposes)Affymetrix GeneChip arrays [23], and then compares thetest sample to a user-defined reference set of samples toCNAG and dChip use HMM-based algorithms to detectregions of genomic gains and losses and estimate theirBMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368boundaries (Table 1). CNAT provides plots of copynumber and associated p-values along each chromosome,but does not report CNVs or their boundaries. For the esti-mation of CNVs and their breakpoints, we evaluated theutility of CNAG, dChip and GLAD [26], the latter devel-oped originally for array-CGH data analysis (Table 1).Detection of candidate copy number variants from synthetic dataAs an initial assessment of the software packages, we con-structed a synthetic data set in which we purposely intro-duced artificial CNVs, and then measured CNV detectionperformance, including true and false positive detectionrates, of the software approaches.Our data set contained 30 artificial normalized arrayresults produced from a normal individual's genome andsubsequent comparison to a reference set of 50 individu-als (Methods). Normalization was performed usingCNAG, dChip and CNAT for these synthetic array results(10 by each software). We then introduced 100 simulatedCNVs into each of the 30 synthetic samples with probe setwidths ranging from 5 to 23 and copy numbers rangingfrom 0.3 to 3.0 (Methods). Detection of CNVs in thesenormalized data was then performed using dChip andGLAD (Methods). CNV detection could not be performedusing CNAG, because this software does not accept inter-mediate stage normalized data as input.The total numbers of putative CNVs detected from thesynthetic arrays, and assessments of false positive andfalse negative rates are shown in Table 2. None of the soft-ware detected all true CNVs as the true positive rates fellbetween 0.23 (CNAT-GLAD, Hind data, Table 2) and 0.58(CNAG-GLAD, Xba data, Table 2). All of the approachesTable 2). We observed generally superior performance inthe detection of deletions compared to the detection ofduplications. We found that dChip analysis of the syn-thetic data resulted in the identification of the largestnumber of putative CNVs but yielded fairly low true pos-itive rates (0.32 and 0.26 for Xba and Hind data, respec-tively) and the highest false discovery rates (0.44 and 0.42for Xba and Hind data, respectively) (Table 2). In thisanalysis the CNAG-GLAD approach showed the best over-all true positive rates (0.58 and 0.42 for Xba and Hinddata, respectively) and lowest false discovery rates (0.009and 0 for Xba and Hind data, respectively).Detection of candidate copy number variants from empirical dataTo assess the performance of the software approaches onempirical data, we next analyzed 662 Affymetrix SNParrays, employing five approaches that in total used foursoftware packages (Figure 1, Table 3). To detect regions ofgenomic gains and losses after normalization, scaling andfeature extraction by CNAG, dChip or CNAT, we appliedthe HMM algorithms of CNAG and dChip, as well as theadaptive weights smoothing (AWS) algorithm of GLAD(Figure 1, Table 3). Due to the difference in normal Xchromosome copy numbers between males and females,detection of X chromosome CNVs requires more complexapproaches than autosomal CNVs, and not all of the soft-ware packages tested here were able to score genomic copynumber along the sex chromosomes. Hence, we focusedon copy number assessment of autosomal regions. Toidentify a candidate CNV, we arbitrarily imposed arequirement for at least four adjacent SNPs that demon-strated a similar apparent gain or loss of copy number.To determine the effect of reference set size and composi-Table 1: List of copy number analysis software packages evaluatedDeveloped for Affymetrix GeneChip Mapping 100 K Array Data AnalysisSoftware Name Normalization, scalingand feature extractionaSmoothing Estimation ReferenceCNAG 1.1 Copy Number Analyser for GeneChip yes yes yes [22]dChip (Nov 17, 2005) DNA-Chip Analyzer yes yes yes [23]CNAT 3.0 Chromosome Copy Number Analysis Tool yes yes no [14]Developed for Array CGH Data AnalysisSoftware Name Normalization, scalingand feature extractionSmoothing Estimation ReferenceGLAD (R) Gain and Loss Analysis of DNA no yes yes [26]aCapability to perform normalization, scaling and feature extraction on Affymetrix GeneChip® Mapping 100 K array data.Page 3 of 18(page number not for citation purposes)had false discovery rates ranging between 0 (CNAG-GLAD, Hind data, Table 2) and 0.44 (dChip, Xba data,tion on CNV detection, we used four reference sets in ouranalyses (Figure 1, Table 3). The algorithmic differencesBMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368between CNAT, dChip and CNAG impose differentrequirements with regards to the size range of the refer-ence set used. However, by default, all of the softwarepackages we used assume that the reference set has a meancopy number of 2.0 at all autosomal locations. A large ref-erence set would usually satisfy this assumption because,in such a set, rare polymorphic CNVs will have negligibleeffects. A large reference set also provides the advantage ofreducing noise arising from the comparison. However,common polymorphic CNVs in the reference sets couldstill affect the results.Pair wise comparisons of one sample to another can onlybe performed with CNAG [22]. This may be useful in thecase of parent-offspring \"trio\" analyses, such as reportedin [6]. Direct comparison of array data derived from achild to data derived from the parents is the most straight-forward means of distinguishing de novo mutations frominherited CNVs, and the boundaries of inherited aberra-tions should usually be the same in the parent and child.Thus, we tested CNAG, as well as CNAG normalization,scaling and feature extraction combined with GLAD CNVdetection (CNAG-GLAD) using three pair-wise compari-Overview of the data analysis processFigure 1Overview of the data analysis process. A) Methods appear in blue, and data in yellow. B) The reference sets used for each analysis method are as follows. '2': within each MR trio (child, mother and father), three comparisons were done – child to father as reference, child to mother as reference, and father to mother as reference. '50': each sample was compared to a ref-erence set of 50 unaffected mothers of children with MR. These 50 mothers selected for this reference set had the lowest numbers of CNVs detected by dChip compared to other mothers. '214': each sample was compared to a reference set that included all 214 unaffected parents (107 mothers and 107 fathers) of the children with MR. '106': a default reference set of 106 individuals provided by Affymetrix for copy number analysis with CNAT [18].ABMethod 2 50 214 106CNAG 9 9CNAG-GLAD 9 9dChip 9 9dChip-GLAD 9 9CNAT-GLAD 9 9 9Reference setSNPgenotypecallsMapping 100K SNP Chip experimentdChip normalization andcomparison to referenceCNAG normalization andcomparison to referenceGeneChip Operating Software (GCOS)Raw intensity dataGDAS genotype analysisCNAG SNP copynumber valuesand log2 ratiosdChip SNP copynumber valuesand log2 ratiosCNAG HMM dChip HMMT-test T-testPutative CNVs - start and end SNP positions CandidateduplicationsCandidatedeletionsDo genotypes conform with deletion?noCNAT normalization andcomparison to referenceCNAT SNP copynumber valuesand log2 ratiosGLAD AWST-testGLAD AWS GLAD AWST-test T-testFalse positivedeletionsCandidate deletionsconfirmed by genotypesyesPage 4 of 18(page number not for citation purposes)sons within each trio – child to father, child to mother,BMC Bioinformatics 2007, 8:368http://www.biomedcentral.com/1471-2105/8/368Page 5 of 18(page number not for citation purposes)Table 2: Candidate copy number variants from synthetic dataTotal from Xba dataMethoda # CNVs # Duplications % #Deletions % # True CNVs # Unique True CNVsb True Positive Rate (power)c# False Positive CNVsFalse Discovery RateCNAG-GLAD 334 20 6 314 94 331 58 0.58 3 0.009dChip 381 166 44 215 56 213 32 0.32 168 0.44dChip-GLAD 70 0 0 70 100 70 31 0.31 0 0CNAT-GLAD 111 10 9 101 90 101 36 0.36 10 0.09Total from Hind dataMethod # CNVs # Duplications % #Deletions % # True CNVs # Unique True CNVs True Positive Rate (Power)# False Positive CNVsFalse Discovery RateCNAG-GLAD 70 11 16 59 84 70 42 0.42 0 0dChip 269 91 34 178 66 155 26 0.26 114 0.42dChip-GLAD 101 5 5 96 95 94 33 0.33 7 0.07CNAT-GLAD 49 0 0 49 100 48 23 0.23 1 0.02aWhere two software packages are listed, the first one was used for normalization and the second for CNV detection.bCNVs with different chromosomal locations and breakpoints.cThe number of true (synthetic) CNVs per array is 100.BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368Table 3: Candidate copy number variants from empirical dataAll from Xba dataMethod Reference Set #CNVs # Duplications %a # Deletions %a # False Positive Deletionsb%c1-CNAG 2 3,210 1,755 55 1,455 45 970 672-CNAG 50 924 820 89 104 11 35 343-CNAG-GLAD 2 1,850 996 54 854 46 343 404-CNAG-GLAD 50 340 69 20 271 80 62 235-dChip 50 31,354 19,093 61 12,261 39 3,830 316-dChip 214 5,443 4,076 75 1,367 25 452 337-dChip-GLAD 50 1,292 66 5 1,226 95 456 378-dChip-GLAD 214 1,207 30 2 1,177 98 402 349-CNAT-GLAD 50 701 253 36 448 64 214 4810-CNAT-GLAD 106 454 232 51 222 49 98 4411-CNAT-GLAD 214 866 240 28 626 72 363 58p <= 0.05 and copy number < 1.25 or > 2.75 from Xba dataMethod Reference Set #CNVs # Duplications % # Deletions % # False Positive Deletions%1-CNAG 2 444 361 81 83 19 21 252-CNAG 50 235 211 90 24 10 3 133-CNAG-GLAD 2 416 332 80 84 20 27 324-CNAG-GLAD 50 133 48 36 85 64 17 205-dChip 50 17,034 4,846 28 12,188 72 3,804 316-dChip 214 2,273 907 40 1,366 60 452 337-dChip-GLAD 50 1,042 27 3 1,015 97 313 318-dChip-GLAD 214 1,027 22 2 1,005 98 283 289-CNAT-GLAD 50 426 87 20 339 80 115 3410-CNAT-GLAD 106 272 88 32 184 68 61 3311-CNAT-GLAD 214 540 117 22 423 78 172 41All from Hind dataMethod Reference Set #CNVs # Duplications % # Deletions % # False Positive Deletions%1-CNAG 2 2,127 1,161 55 966 45 638 662-CNAG 50 324 202 62 122 38 41 343-CNAG-GLAD 2 1,299 697 54 602 46 206 344-CNAG-GLAD 50 366 20 5 346 95 87 255-dChip 50 21,124 17,843 84 3,281 16 1,402 436-dChip 214 5,792 4,603 79 1,189 21 469 397-dChip-GLAD 50 790 42 5 748 95 253 348-dChip-GLAD 214 806 41 5 765 95 274 369-CNAT-GLAD 50 650 108 17 542 83 210 3910-CNAT-GLAD 106 360 90 25 270 75 108 4011-CNAT-GLAD 214 462 56 12 406 88 161 40p <= 0.05 and copy number < 1.25 or > 2.75 from Hind dataMethod Reference Set #CNVs # Duplications % # Deletions % # False Positive Deletions%1-CNAG 2 377 300 80 77 20 26 342-CNAG 50 52 12 23 40 77 15 383-CNAG-GLAD 2 383 287 75 96 25 29 30Page 6 of 18(page number not for citation purposes)4-CNAG-GLAD 50 140 9 6 131 94 38 295-dChip 50 6,488 3,230 50 3,258 50 1,392 43ence set of 214). Candidate CNVs identified using GLADfrom SNP copy number log2-ratios calculated by CNAT(CNAT-GLAD) are listed in Additional file 11 (referenceset of 50), Additional file 12 (Affymetrix default referenceset of 106) and Additional file 13 (reference set of 214).Table 3 summarizes the numbers of candidate genomicdeletions and duplications identified using each of thesecombinations of methods and reference sets on the 331individuals studied. The data are presented for the Xbaand Hind arrays separately, so a CNV that was identifiedin a particular sample by both array types is listed underdetected in this study (Table 3).To obtain an estimate of false positive rates among a largernumber of candidate deletions, we used the SNP genotypedata, assuming that deletions (with copy number of 1 or0) should not contain heterozygous genotype calls (Figure1, Methods). The average proportion of false positive dele-tions identified by SNP heterozygosity was 40%, rangingfrom 23% to 67% in the Xba data, and between 25% and66% in the Hind data (Table 3). In both array types, theCNAG-GLAD combination with the reference set of 50exhibited the lowest false positive deletion rate, andBMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/3686-dChip 214 1,822 637 35 1,185 65 468 397-dChip-GLAD 50 744 23 3 721 97 238 338-dChip-GLAD 214 748 18 2 730 98 245 349-CNAT-GLAD 50 547 52 10 495 90 182 3710-CNAT-GLAD 106 311 67 22 244 78 89 3611-CNAT-GLAD 214 426 48 11 378 89 150 40aPercentage of all candidate CNVs.bNumbers of false positive deletions were estimated using SNP genotype data (Methods).cThis is the percentage of false positives among the deletions (# false positive deletions/# deletions * 100).Table 3: Candidate copy number variants from empirical data (Continued)and father to mother (Figure 1, Table 3, Methods). Werefer to this analysis within trios as a \"reference set of 2\".dChip and CNAT require the use of larger reference sets:the minimum sizes required are 10 for dChip [23] and 50for CNAT [18]. So that one consistent reference set couldbe used to compare the performance of all three softwarepackages, we chose a reference set of 50 unaffected moth-ers of children with MR (Figure 1, Table 3, Methods).These 50 mothers chosen were those with the fewest can-didate CNVs identified by dChip, compared to the othermothers in our data set (using a reference set of all 214parents). With dChip and CNAT, it is possible to increasethe size of the reference set further, so we tested whetherthis would be advantageous. For this purpose, we assem-bled a reference set of 214 unaffected parents of childrenwith MR (Figure 1, Table 3, Methods). For CNAT, there isa default reference set of 106 individuals provided byAffymetrix, which was also evaluated (Figure 1, Table 3,Methods).The lists of candidate CNVs and their boundaries identi-fied in the 331 samples by CNAG using reference sets of 2and 50 individuals are shown in Additional files 2 and 3,respectively. The CNVs detected with the CNAG-GLADapproach are listed in Additional file 4 (reference set of 2)and Additional file 5 (reference set of 50). Putative CNVsfound with dChip are shown in Additional files 6 and 7(reference set of 50) and Additional file 8 (reference set of214). CNVs detected by GLAD from the feature extracteddata by dChip (dChip-GLAD) are shown in Additionalfile 9 (reference set of 50) and Additional file 10 (refer-and types of CNVs detected from the same sample set withdifferent analysis approaches. The fewest candidate CNVswere detected using CNAG-GLAD and CNAG with the ref-erence set of 50 – 340 from Xba and 324 from Hind data,respectively (Table 3). The most putative CNVs were iden-tified by dChip with the reference set of 50 – 31,354 fromXba and 21,124 from Hind data (Table 3).The types of candidate CNVs detected also varied greatlyamong the 11 approaches. Duplications accounted forbetween 2% and 89% of all CNVs (Table 3). The lowestproportion of duplications, and thus the highest propor-tion of deletions, was identified by dChip-GLAD. Thethree highest proportions of duplications and lowest pro-portions of deletions were detected by dChip with the ref-erence sets of 50 and 214, and by CNAG with thereference set of 50 (Table 3).For three Hind and two Xba arrays, each from a differentsample, candidate CNVs were not detected by anyapproach. Data from 97 Hind and 90 Xba chips predicted30 or more putative CNVs by at least one method. How-ever, none of the arrays had 30 or more aberrationsdetected by all of the 11 approaches.False positive rateThe ultimate approach to determining the false positiverate of each copy number analysis method would be toattempt validation of each candidate CNV using an inde-pendent method. A subset of putative CNVs was con-firmed using FISH (Table 4) [6], but it was not feasible todo this for all of the many thousands of candidate CNVsPage 7 of 18(page number not for citation purposes)both (Table 3). There is great variability in the numbers CNAG with a reference set of size 2 produced the highest.BMC Bioinformatics 2007, 8:368http://www.biomedcentral.com/1471-2105/8/368Page 8 of 18(page number not for citation purposes)Table 4: Detection of validated CNVsSample IDCNV Chr Length (kb)CNAG Ref2aCNAG Ref50CNAG-GLAD Ref2CNAG-GLAD Ref50dChip Ref50dChip Ref214dChip-GLAD Ref50dChip-GLAD Ref214CNAT-GLAD Ref50CNAT-GLAD Ref106CNAT-GLAD Ref214# Methods detected% Methods detected3476c delb 4 10,655 1 1 1 1 1 1 1 1 1 1 1 11 1001895c del 13 4,887 1 1 1 1 1 1 1 1 1 1 1 11 1004818c del 12 3,204 1 1 1 1 1 1 1 1 1 1 1 11 1009143c del 11 3,175 1 1 1 1 1 1 1 1 1 1 1 11 1008326c del 14 1,923 1 1 1 1 1 1 1 1 1 1 1 11 1006235c del 10 1,737 1 1 1 1 1 1 0 1 0 1 1 9 826545c del 7 785 1 1 1 1 1 1 1 1 1 0 1 10 917807c del 22 731 1 1 1 1 1 1 1 1 1 1 1 11 1004357c del 6 595 1 1 1 1 1 1 1 1 1 1 1 11 1004357m del 6 595 1 1 1 1 1 1 1 1 1 1 1 11 1004357c del 6 353 1 1 1 1 1 1 1 1 1 1 1 11 1004357m del 6 353 1 1 1 1 1 1 1 1 1 1 1 11 1005003c del 2 294 1 1 1 1 0 0 1 1 1 1 1 9 827551c del 2 220 1 1 1 1 1 1 1 1 1 1 1 11 1007551m del 2 220 1 1 1 0 1 1 1 1 1 1 1 10 911280c del 4 192 1 1 1 1 1 0 1 1 1 0 1 9 821280m del 4 192 1 1 1 1 1 0 1 1 1 0 1 9 820674c del 2 147 1 1 1 1 0 0 1 1 1 1 1 9 820674f del 2 147 1 0 1 1 1 0 1 1 1 0 1 8 735566c del 14 130 0 0 1 1 0 0 0 0 1 0 1 4 366789c del 14 68 1 0 1 1 0 0 0 0 1 0 1 5 456789m del 14 68 1 0 1 1 0 0 0 0 0 0 1 4 363476c del 1 66 1 0 1 1 1 0 1 1 1 0 1 8 733476m del 1 66 1 0 1 0 0 0 1 1 1 0 1 6 556607c del 20 57 0 0 1 0 0 0 0 0 0 0 0 1 96607m del 20 57 0 0 1 0 0 0 0 0 0 0 0 1 98785c del 18 43 0 0 0 1 1 0 1 1 1 1 1 7 648785f del 18 43 0 0 0 0 1 0 1 1 0 0 0 3 279299f del 9 38 0 0 1 1 0 0 1 1 1 0 1 6 559299c del 9 38 0 0 1 1 0 0 1 0 1 0 1 5 458379c dup 10 23,842 1 1 1 1 1 1 1 1 1 1 1 11 1004794c dup 16 3,356 1 1 1 0 1 1 0 0 0 0 0 5 458379c dup 15 1,481 1 1 1 1 1 1 1 1 1 0 1 10 913595c dup 15 781 1 1 1 0 1 1 0 0 1 0 1 7 643595m dup 15 781 0 1 0 0 0 0 0 0 0 0 1 2 183923c dup 11 494 1 1 0 1 1 1 0 0 1 0 1 7 643923m dup 11 494 0 0 0 0 1 1 0 0 0 0 0 2 186168c dup 17 324 1 0 1 0 1 1 0 0 0 0 0 4 36Number of validated CNVs detected 29 24 33 28 27 21 26 26 29 17 32% of validated CNVs detected 76 63 87 74 71 55 68 68 76 45 84aDetection of CNV from at least one type of array data (Xba or Hind or both). 1 means detected, 0 means not detected.bThe method of validation for each CNV is shown in Additional file 15.BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368We note that these false positive rates are likely underesti-mates, especially for short CNVs, because stretches ofhomozygous SNPs could also occur by chance in regionswith normal copy number.The software packages tested here apply different algo-rithms and statistics for CNV detection. We examined thedistribution of SNP copy numbers and found that theyshowed the characteristics of a Gaussian distribution.Thus, to assess and compare the significance of CNVsdetected by these different approaches, we performed a t-test as follows. First, we calculated the log2-ratios of testsample copy numbers versus reference copy numbers foreach SNP. Next we calculated the mean and standard devi-ation (SD) of these log2-ratios within each candidateCNV, and also for the rest of the same chromosomeexcluding the region affected by the CNV. We then com-pared these values using a t-test, and obtained the corre-sponding p-values (Additional files 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13). We then filtered the candidate CNVs usingan uncorrected p <= 0.05 cutoff along with arbitrary copynumber value thresholds of <1.25 for deletions and >2.75for duplications. The candidate CNVs that passed thesethresholds are summarized in Table 3. As expected, appli-cation of these cutoff values resulted in fewer CNVs andalso reduced the false positive deletion rates in mostinstances. However, the false positive deletion call ratesremained substantial, averaging 32%. Lower p-valuethresholds further reduced the numbers of candidateCNVs, as expected (not shown). However, substantialrates of false positive deletions still remained even with ap <= 0.00001 cutoff, with an average false positive dele-tion rate of 28%.To assess how detection and false positive rates wereaffected by the number of SNPs included in candidateCNVs, we counted the number of candidates thatincluded at least 4, 11, 21, 41 or 101 SNPs (Figure 2). Wealso calculated false positive deletion rates on the basis ofSNP heterozygosity at each level and applied p-value andcopy number thresholds as described above (Figure 2).We note that the rate of false positive deletion calls in thesmallest size class (4–10 SNPs) may be unrealistically lowin our analysis, because homozygosity is more likely tooccur by chance over a few adjacent SNPs than over many.However, there was a high number of CNVs that passedour p-value (p <= 0.05) and copy number (<1.25 or>2.75) thresholds and that were predicted by <= 10 SNPs,indicating that many of the putative CNVs in this smallsize range may be real. Interestingly, the false positive callrate was often relatively high and the percentage of CNVsthat passed the p-value and copy number thresholds wasoften relatively low in the largest CNV size class (>= 101copy numbers that averaged only a little more or less than2.0, but the change may have appeared significant becauseof the large number of SNPs involved.Candidate CNVs predicted by multiple methodsPutative CNV regions identified by at least two software/reference set combinations from Xba or Hind data or bothfrom the same sample are presented in Additional file 14.These regions were predicted by the software platformswithout applying additional filters based on p-value orcopy number thresholds. Because distinct approaches andthe two different chips in the 100 K set often detectslightly different boundaries for a particular CNV, wedefined mutually predicted CNVs in this analysis as thosesharing at least 50% of the base pairs within a genomicsegment defined by the SNPs included in the deletion orduplication.Mutually predicted candidate CNVs consisting of fewerthan 50 consecutive SNPs are listed in Additional file 14A.In this size range, a total of 8,649 putative CNVs consist-ing of 5,418 duplications (63%) and 3,231 deletions(37%) were detected in our sample set of 331 individualsusing two or more methods. 7,497 (86%) of these puta-tive 8,649 CNVs (<50 SNPs) were detected by 2 distinctsoftware/reference set combinations, 919 (11%) by 3 or 4,and 233 (3%) by 5 or more approaches. 1,512 of the can-didate deletions predicted by more than one approach(47% of 3,231) were considered to be false positive callson the basis of SNP heterozygosity (Methods, Figure 1).Mutually predicted putative CNVs of 50 or more consecu-tive SNPs are listed in Additional file 14B. A total of 1,084such candidate CNVs were identified by at least two meth-ods, including 926 duplications (85%) and 158 deletions(15%). Of these larger CNVs, 963 (89%) were identifiedby 2 distinct software/reference set combinations, 106(10%) by 3 or 4, and 15 (1%) via 5 or more approaches.154 (97% of 158) deletions predicted by more than oneapproach were considered to be false positive calls on thebasis of SNP heterozygosity (Methods). We validated 3 ofthe remaining candidate deletions using FISH (Additionalfile 14, Table 4); the fourth one was not tested.Rate of detection of confirmed CNVsTo determine the detection rate of real CNVs by each ofthe software/reference set combinations, we assessed 38CNVs (30 deletions and 8 duplications) that were con-firmed by independent experimental approaches (Table 4,Additional file 15) [6]. Some of these were inheritedCNVs that had been demonstrated in both the child and aparent of one MR family. Other confirmed CNVs occurredde novo in a child with MR and were shown not to bePage 9 of 18(page number not for citation purposes)SNPs) compared to the other categories (Figure 2). Themajority of false positive CNVs in this size range exhibitedpresent in either parent. SNP genotypes were used to con-firm paternity in all cases.BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368Page 10 of 18(page number not for citation purposes)Size distribution of candidate CNVs detectedFigure 2Size distribution of candidate CNVs detected. The five plots show numbers of candidate copy number gains and losses identi-fied using Xba and Hind arrays, arranged according to the numbers of SNPs within the aberrations: A) all CNVs (>= 4 SNPs); B) CNVs >= 11 SNPs; C) CNVs >= 21 SNPs; D) CNVs >= 41 SNPs and E) CNVs >= 101 SNPs. The y-axis value of each hor-izontal line represents the total number of CNVs detected by a given method: 1 – CNAG Ref2; 2 – CNAG Ref50; 3 – CNAG-GLAD Ref2; 4 – CNAG-GLAD Ref50; 5 – dChip Ref50; 6 – dChip Ref214; 7 – dChip-GLAD Ref50; 8 – dChip-GLAD Ref214; 9 – CNAT-GLAD Ref50; 10 – CNAT-GLAD Ref106; 11 – CNAT-GLAD Ref214 (the reference sets are described in Figure 1 and in the Methods.) The left and right side of each panel correspond to the fraction of deletions and duplications, respectively. The orange bars within the black lines show the fraction of CNVs that passed the following confidence thresholds: p <= 0.05 (t-test) and copy number < 1.25 for deletions (left); or p <= 0.05 (t-test) and copy number > 2.75 for duplications (right). The fractions of false positive deletion calls, calculated based on SNP heterozygosity, are indicated by the red vertical bars on the left side of each panel. For example, the y-axis value of the top line (5) in plot 'A' indicates the total number of candidate CNVs (52,478) including at least 4 consecutive SNPs identified by dChip Ref50 (from Xba and Hind data). 30% of the 52,478 putative CNVs were deletions (left) and 70% were duplications (right). 99% of the deletions (orange fraction of the line, left) and 22% of the duplications (orange fraction of the line, right) passed our p-value <= 0.05 and copy number (<1.25 or >2.75) thresholds described above. 34% of the candidate deletions were considered to be false positives, indicated by the red bar (left).BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368The confirmed deletions in this set all had a copy numberof 1 in the involved genomic region, and the confirmedduplications all had a copy number of 3. Deletions of~200 kb or larger were identified by most or all (between9 and 11) of the 11 software/reference set combinationsused. As expected, detection rates were lower for smallerdeletions (Table 4). Successful detection of duplicationshad lower rates overall (Table 4). Surprisingly, a ratherlarge 3.3 Mb duplication was detected by only 5 of the 11software/reference set combinations (Table 4, Additionalfile 15). Within this genomic segment, the average dis-tance between SNPs was ~280 kb (Additional file 15),which is substantially greater than the 23.6 kb average dis-tance between SNPs across the whole genome in the 100K array set [16].No single method identified all 38 of the confirmedCNVs. CNAG-GLAD with the reference set of 2 and CNAT-GLAD with the reference set of 214 had the highest ratesfor detection, identifying 33 and 32 of the 38 confirmedCNVs, respectively (Table 4). Two large deletions weredivided by dChip and the CNAG-GLAD combinedapproach into multiple smaller deletions (2–4 each),instead of the single CNVs predicted by alternateapproaches (Additional file 15).Candidate CNVs per individualTo estimate the average number of CNVs per genome inour sample set, we chose a combination of three copynumber analysis approaches that resulted in optimal truepositive detection rate: CNAG-GLAD with the referenceset of 2, dChip with the reference set of 50 and CNAT-GLAD with the reference set of 214. These three methodstogether detected all of the 38 confirmed CNVs in thisstudy (Table 4, Additional file 15). We generated a list ofcandidate CNVs that were identified by at least one ofthese three approaches, from at least one array (Xba orHind). To reduce the false positive detection rate, we elim-inated all of the putative aberrations that did not meet thefollowing criteria: p <= 0.05 (t-test); and copy number <=1.25 (for deletions) or >= 2.75 (duplications). Deletionsconsidered false positive based on SNP heterozygositywere also eliminated. We then calculated the averagenumbers of remaining candidate CNVs per individual inthe 107 children with MR, and in the 224 unaffected par-ents and siblings of the affected children.In the 224 unaffected individuals we found an average of39 candidate CNVs per genome, consisting of 20 dele-tions and 19 duplications. The average size of the dele-tions was 157 kb (in the range between 190 bp and 5.5Mb), and the average size of the duplications was 244 kb(ranging between 115 bp and 16.7 Mb). In the affectedsize of the deletions was 191 kb (ranging between 220 bpand 11.3 Mb), and the average size of the duplications was208 kb (ranging between 220 bp and 23.8 Mb).Theoretical resolving powerThe ability to estimate genomic gains and losses and todefine their boundaries is dependent on the normaliza-tion, scaling and feature extraction of the raw intensitydata. More effective normalization and feature extractionyields higher signal-to-noise ratios, which enable superiordetection of regions with altered copy numbers. UsingSNP copy number data from the 30 validated deletionsand 8 confirmed duplications listed in Table 4, we calcu-lated theoretical resolving powers of the normalization,scaling and feature extraction algorithms used by CNAG,dChip and CNAT with the various reference sets describedabove (Figure 1B, Methods). We defined the resolvingpower as the average size of the smallest single copy dele-tion or duplication that could be detected at a given con-fidence level. Mean test versus reference SNP copy numberlog2-ratios were calculated from the data following featureextraction, and they showed the characteristics of a Gaus-sian distribution. The Welch t-test was then computed tocompare mean SNP copy number ratios within a givenCNV against the rest of the chromosome (Methods). Forthis calculation, we assumed that SNPs were uniformlydistributed throughout the genome. We then estimatedthe p-values that would be obtained for hemizygous dele-tions (copy number 1) and single copy duplications (copynumber 3) containing increasing numbers of adjacentSNPs using the means and standard deviations obtainedfrom 30 confirmed deletions and 8 confirmed duplica-tions (Methods). In genomic regions where the SNP den-sity is higher or lower than average, corresponding p-values would be lower or higher than those presented inFigure 3, but variation in SNP density would affect the p-values across all of the methods similarly. Therefore, eventhough absolute p-values change with SNP density, therelative p-values presented here provide a valid compari-son.The resolving power calculated from the weighted averageof the 30 validated deletions (Table 4) is shown in Figure3A. The resolving power calculated from the weightedaverage of the 8 confirmed duplications (Table 4) isshown in Figure 3B. We observed that the Affymetrix Map-ping 50 K XbaI and HindIII assays had similar resolvingpowers, so we combined the Xba and Hind data for theseanalyses.dChip normalization, scaling and feature extraction pro-vided the highest resolving power for the deletions, withnegligible difference between the reference sets of 50 andPage 11 of 18(page number not for citation purposes)children, the average number of candidate CNVs was 45,including 26 deletions and 19 duplications. The average214 (Figure 3A). This result indicates that for any given p-value cutoff, on average one would expect to be able toBMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368detect the smallest one-copy deletions with dChip featureextraction and our reference set of 50 or 214. Most othermethods showed only slightly decreased resolving power.Reference set selection had little effect on resolving powerin most cases, although CNAT with the Affymetrix defaultreference set of 106 ranked the lowest (Figure 3A).Reference set selection had a greater effect on the resolvingpower for duplications, and the reference sets chosenfrom our own data set resulted in higher resolving powerthan the Affymetrix default set of 106 individuals (Figure3B). We note that our estimation of the resolving powerrations available for our analysis. Nevertheless, theresolving power was clearly better for deletions than forduplications, such that deletions of a given size could bedetected with higher confidence than duplications of thesame size.ConclusionWe found that CNAG, dChip, CNAT and GLAD were suit-able for high-throughput processing of Affymetrix 100 KSNP array data for copy number analysis. Various refer-ence sets selected from data produced by our researchteam resulted in superior feature extraction, higher signal-to-noise ratios, and higher rates of detection of confirmedCNVs than the external default reference set provided byAffymetrix. This difference may be due to experimentalvariation between different laboratories, to differences inthe frequencies of SNP genotypes and copy number poly-morphisms (CNPs) in ethnically diverse populations, orto other unidentified factors. Therefore, we recommendusing a reference set, processed in the same laboratory andideally from samples with a similar ethnic composition tothe sample set.We found considerable variation in the numbers of puta-tive CNVs detected by various software/reference set com-binations, and more CNVs were called by dChip than byany other software tested. Rates of false positive deletioncalls identified by SNP heterozygosity were substantialwith all of the approaches tested, and the false positivecall rates did not correlate with the total number of candi-date CNVs identified by a given approach. The highest rateof false positive candidate deletion calls was produced byCNAG using a reference set of 2 (within trios), but this islikely, at least in part, due to the very small size of the ref-erence set combined with noisy data. In such a small ref-erence set, the average copy number may be quitedifferent from 2.0 in certain regions of the genome. Forexample, similar results are expected using pair-wise testversus reference comparisons for the very different cases ofa paternally inherited deletion (copy numbers 1, 1, and 2in the child, father, and mother, respectively) and a dupli-cation in the mother that was not inherited (copy num-bers 2, 2, and 3, respectively) (Additional file 16). In suchcases, we accepted all possible CNVs as candidates foroptimal sensitivity (Methods, Additional file 16), but weexpected that a subset of these CNVs would be false posi-tives.Within a large reference set, the average copy number ofall loci is more likely to be close to 2.0, which improvesthe confidence of CNV detection in a given sample. How-ever, frequent polymorphisms in a large reference set mayskew the results. For example, a deletion affecting a singleTheoretical resolving power of CNAG, dChip and CNAT with ferenc sets of 2, 50, 106 and 214 (see Methods and Figure 1 legend)igu 3Theoretical resolving power of CNAG, dChip and CNAT with reference sets of 2, 50, 106 and 214 (see Methods and Figure 1 legend). The resolving power was defined as the average size of the smallest one-copy deletion or duplication that could be detected with a given method at a given confi-dence level. The theoretical p-value (in log10 scale) is shown as a function of the deletion (A) or duplication (B) size detected from Affymetrix GeneChip 100 K Xba and Hind data. For a given p-value, e.g. 10-5, the theoretical minimum size of detectable deletion or duplication is shown for each method. For a deletion or duplication of a given size, e.g. 400,000 bp, the theoretical p-values are shown for each method.0 200000 400000 600000 800000 1000000 1200000−15−10−50log10(p)CNAT Ref106CNAT Ref214CNAT Ref50CNAG Ref2CNAG Ref50dChip Ref214dChip Ref50A0 200000 400000 600000 800000 1000000 1200000−15−10−50deletion/duplication size (bp)log10(p)BPage 12 of 18(page number not for citation purposes)was probably less accurate for duplications than dele-tions, since we had a smaller number of confirmed aber-genomic region occurring in 10% of the population coulddecrease the mean of copy number in that region to 1.9 inBMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368a large random reference set, while a deletion with 50%frequency may push the base line to 1.5, resulting in afalse positive duplication call in a test sample that lacksthe aberration or a false negative deletion call in a testsample that has the deletion. Data from a chromosomalregion rich in polymorphic sites might seem noisy andmight not yield distinguishable CNVs even though theyare frequent. Further understanding of CNVs and their fre-quencies in the general population will help resolve thisissue and increase specificity of CNV detection in theseregions.The performance of software packages in the detection ofsingle copy deletions was better than that of single copyduplications. This may be due to the fact that deletionsproduce a 2-fold change in copy number (from 2 to 1),while duplications produce only a 1.5-fold change (from2 to 3).The rate of detection of confirmed CNVs (Table 4) wasindependent of the total numbers of CNVs reported(Table 3) by a particular software/reference set combina-tion. As expected, larger CNVs were detected more consist-ently than smaller ones, and the denser the SNP coveragewithin a given chromosomal segment, the smaller theCNVs (in bp) that could be detected with high confidence.We found that CNAG-GLAD using pair wise comparisonswithin trios (reference set of 2) detected the largest pro-portion (87%) of the 38 validated aberrations, closely fol-lowed by CNAT-GLAD with the reference set of 214(84%) (Table 4). Unfortunately, these two approachesalso had the highest false positive deletion rates, of 66%and 51%, respectively (Table 3). No single methoddetected all of the confirmed CNVs, and each methodmissed a different subset of variants. Thus, none of thesoftware/reference set combinations we tested appears tobe sufficient to detect all true CNVs, and severalapproaches may need to be used together for efficient andreliable copy number analysis of GeneChip SNP data. Forexample, to maximize the detection rate of true positiveCNVs we recommend using the combination of CNAG-GLAD with pair-wise comparisons of test and referencesamples, and the use of dChip and CNAT-GLAD withlarge reference sets (>50). This combination ofapproaches successfully detected all of the validated CNVsin our study. To reduce false positive rates, we recommendSNP genotype analysis of putative deletions (see Meth-ods) and setting thresholds for statistical significance andcopy number values of putative CNVs.We used the default parameter settings for copy numberanalysis in each of the software packages evaluated in thisstudy, as would most users. We did not attempt a thor-the software packages were tested. It should be notedthough that changing parameters for some of these soft-ware packages may result in different numbers of putativeCNVs detected, and the optimal parameters for detectionof specific CNVs may also depend on the noise level ofeach chip and the location and size of the CNVs. Amongthe packages that we evaluated, there are no applicableparameter settings to CNAT [14,18], and we used CNATonly for normalization and feature extraction, not forCNV detection. dChip automatically determines its opti-mal HMM parameters for each chip from the raw data[23], thus these parameters are not user accessible. GLADhas a few adjustable parameters for AWS, such as thelambda value for the number of breakpoints and a cluster-ing parameter lambda [26]. We have examined the sensi-tivity of CNV detection to AWS parameter change fromthe default on a small subset of samples, and found thatthe results did not change, even for the detection of oursmallest validated aberrations. CNAG's default HMMparameters have been optimized by the software develop-ers to detect full copy number changes (e.g. to 1 or 3) in amostly diploid sample [22,24]. These parameters are useradjustable, and adjustment is specifically recommendedfor non-diploid chromosomal regions or detection ofmosaic CNVs (with an average copy number change ofless than 1.0) (CNAG users' manual [24]). In some casesone may wish to change these parameters to detect asmany true CNVs as possible, even though this would alsolikely produce much higher false positive call rates. Inother circumstances, it may be more important to mini-mize the false positive call rate, even if this means thatsome real CNVs will be missed. Our study used exclusivelysamples which are predominantly diploid, and so thedefault parameters appear the most appropriate.In addition to the results presented from the empiricaldata set of 662 arrays, we have also tested the software ona smaller synthetic data set with a higher number of sim-ulated CNVs. Although the numbers of candidate CNVsdetected were not directly comparable to those foundfrom empirical data due to the differences between theapproaches, the following conclusions were consistentbetween the empirical and simulation data. Software per-formance in the detection of deletions was generally bet-ter than that of duplications. dChip identified the mostputative CNVs from both the empirical and syntheticdata. However, it did not have the best true positive CNVdetection rate and had significant false positive rate inboth cases. On the synthetic data the CNAG-GLADapproach yielded the best true positive CNV detectionrate.The availability of both SNP genotypes and genomic copyPage 13 of 18(page number not for citation purposes)ough parameter optimization due to the large size of thedata set and the number of other variables under whichnumber information from the same data is a particularlyuseful feature of Affymetrix GeneChip Mapping arrays.BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368The copy number analysis algorithms evaluated in thispaper all have substantial false positive candidate CNVcall rates; however, many putative deletions can be con-firmed and a large proportion of false positives can beeliminated without performing further experiments usingthe genotype information. None of the copy number anal-ysis programs tested here take genotype information intoaccount for identifying candidate deletions, and thiswould be a useful feature for future implementation.Allelic imbalance in heterozygous genotypes could also beused in calling duplications, as it is in the recentlydescribed CARAT algorithm [27].Another recommendation for improvement of the soft-ware packages we tested would be to assign statistical sig-nificance to each CNV call and then use this informationto rank the candidate CNVs. None of the software pack-ages we tested accurately describes the relative quality oftheir CNV predictions. An independent statistical test,such as the t-test we employed, is necessary to provideconfidence in the CNVs identified by various methods.Furthermore, it would also be useful to rank candidateCNVs by the deviation of copy number from 2.0. Aresearcher could then decide approximately how manyfalse positive calls to tolerate to achieve the desired rate oftrue CNV detection by setting the corresponding p-valueand copy number thresholds appropriately.Using a combination of approaches described above tooptimize true positive detection rates and minimize falsepositive rates, we estimated the average numbers of CNVsper genome in our sample set. The average of 39 candidateCNVs in unaffected individuals (20 deletions and 19duplications) may be an overestimate, because a subset ofthese may still be false positives. These numbers; however,are in a range similar to those estimated by others in thegeneral population using different technologies and ana-lytical approaches, reviewed by [12]. Many of the CNVswe found in our sample set of 224 unaffected individualsprobably represent normal polymorphisms, and futurestudies will characterize these candidate variants in moredetail.In summary, hybridization data obtained from 100 K SNPWGSA arrays can be used to identify single-copy constitu-tional CNVs smaller than 200 kb. We found that detectingall real CNVs from such data requires multiple computa-tional approaches. The SNP genotype information that isavailable for all samples analyzed with these arrays is use-ful for recognizing many false positive calls and should beused to improve the specificity of CNV detection. Furtherimprovement in the specificity of recognizing true CNVsmay be achieved without loss of sensitivity by better usethe GeneChip arrays, by taking advantage of the increasedresolution of 500 K GeneChip arrays, and by furtherimproving the array design to provide more uniform cov-erage of the genome.MethodsAffymetrix GeneChip® 100 K Mapping Array dataFor this analysis, we used a data set generated in a previ-ous study [6] of families with children with mental retar-dation (MR). The study group was composed of 107children and both of their unaffected parents, plus 10unaffected siblings of the affected children from 10 of thefamilies. DNA was isolated from 331 whole blood sam-ples as described [6]. Hybridization to Affymetrix Gene-Chip® 100 K Mapping arrays was performed according tothe manufacturer's recommendations (Affymetrix Gene-Chip® Mapping 100 K Assay Manual; [18]) as previouslydescribed [6].Reference sets for copy number analysisThe reference sets described below were used for copynumber analysis of Affymetrix GeneChip® Mapping 100 Karray data:• '2': three pair wise comparisons were performed withineach MR trio (child to father as reference, child to motheras reference, and father to mother as reference). Deletionsand duplications in family members were called asdescribed in Additional file 16A.• '50': each sample was compared to a reference set of 50unaffected mothers of children with MR. These 50 moth-ers selected for this reference set had the smallest numbersof candidate CNVs identified by dChip, compared to theother mothers in our data set (using a reference set of all214 parents).• '214': each sample was compared to a reference set com-prised of all 214 normal parents (107 mothers and 107fathers) of the children with MR.• '106': a default reference set from 106 individuals pro-vided by Affymetrix for copy number analysis with CNAT[18].Synthetic 100 K array dataWe generated 30 artificial data sets by randomly shufflingnormalized 100 K SNP array data from a normal samplewith variability close to the median using the reference setof '50' individuals. Input to the shuffling was normalizedcopy number data from CNAG, dChip and CNAT, and 10synthetic samples were produced for each of these soft-ware packages. We then introduced 100 simulated CNVsPage 14 of 18(page number not for citation purposes)of the information provided by each of the individual 25-mer oligonucleotide probes associated with each SNP oninto each synthetic sample with SNP widths ranging from5 to 23 and copy numbers ranging from 0.3 to 3.0. CNVBMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368detection on these normalized data was then performedby dChip and GLAD (this was not possible for the CNAGHMM, since this software does not accept intermediatestage normalized data).Copy number analysis with CNAG and CNAG/GLADDetection of copy number variants was performed usingthe Copy Number Analyser for GeneChip® arrays (CNAG)Version 1.1 software [22], using the default parameters.Each sample was compared to a reference set of 2 or 50individuals. Regions of copy number gains or losses weredetected using the Hidden Markov Model (HMM) outputof CNAG. Deletions and duplications in individuals wereidentified using the rules described in Additional file16AA (for the reference set of 2) and B (for the referenceset of 50).In addition to the CNAG HMM implementation, we alsoidentified copy number changes using the Gain and LossAnalysis of DNA (GLAD) R package [26] with default set-tings. Sample versus reference SNP copy number log2-ratios calculated by CNAG were used as the input for theCNAG/GLAD analysis.Copy number analysis with dChip and dChip/GLADGenomic gains and losses were assessed against the refer-ence sets of 50 or 214 individuals using the DNA-ChipAnalyzer (dChip) Version Release (Nov 17 2005) software[23] with the default parameters. Regions of copy numbergain or loss in each comparison were detected using theHidden Markov Model (HMM) output of dChip.We also detected copy number changes using the GLAD Rpackage [26] with the default settings instead of the dChipHMM. SNP copy number log2-ratios of sample versus ref-erence calculated by dChip served as the input for thedChip/GLAD analysis. Deletions and duplications inindividuals were identified following the rules describedin Additional file 16B.Copy number analysis with CNAT and GLADSNP copy numbers were determined using the AffymetrixGeneChip® Chromosome Copy Number Analysis Tool(CNAT) Version 3.0 [14,18] using the default parametersand the reference sets of 50, 214 or 106 individuals. Weused the GLAD R package (Hupe et al. 2004) to identifyCNV boundaries from SNP copy number log2-ratios ofsample versus reference sample set calculated by CNAT.Deletions and duplications in individuals were identifiedusing the rules described in Additional data file 16B.Genotype analysis of deletionsSNP genotype calls were generated from probe signal® threshold of 0.05 for genotype accuracy. We counted thenumber of heterozygous SNPs within putative deletionsidentified by each copy number analysis methoddescribed above. If the rate of heterozygous SNPs wasmore than 10% of the total SNP count within a candidatedeletion, the deletion was considered to be a false positivecall. If no more than 10% of SNPs were heterozygous, thedeletion was accepted. One reason for allowing up to 10%heterozygous SNP call rate (instead of 0%) within dele-tions was the occasional occurrence of errors in genotypecalls. Furthermore, although the presence of a deletionmay correctly be identified in a chromosomal segment bya software package, the breakpoints may not be accuratelydefined, resulting in the inclusion of heterozygous SNPsfrom the normal region(s) flanking the deletion bounda-ries. The percentage of heterozygous SNPs was below 10%in all of our validated deletions. All candidate deletionswith more than 10% SNP heterozygosity that weattempted to validate turned out to be false positives.Validation of putative copy number variants (CNVs)Validation of most putative CNVs was carried out by fluo-rescent in situ hybridization (FISH) on interphase andmetaphase chromosome spreads prepared according tostandard cytogenetic protocols, as described [6]. Bacterialartificial chromosome (BAC) or fosmid inserts were usedas probes, selected via the University of California at SantaCruz (UCSC) genome browser [28,29], May 2004 humangenome assembly. A subset of CNVs were confirmedusing standard karyotyping, and one inherited deletionwas validated using quantitative PCR as previouslydescribed [30].Theoretical resolving power for detecting hemizygous deletions and duplicationsWe defined the resolving power as the average size of thesmallest one-copy deletion or duplication that could bedetected at a given confidence level. The confidence levelof finding hemizygous deletions (1 copy) or duplications(3 copies) containing n number of SNPs from featureextracted copy number data by various methods was esti-mated using the Welch t-test as follows. SNP copy numberlog2-ratios of sample versus reference were calculatedfrom probe intensity values using CNAT, CNAG or dChip.The means and standard deviations of these ratios werecalculated within each validated CNV, and in the rest ofthe same chromosome outside the CNV that we used asthe control region. Assuming that the mean and standarddeviation of any 2 or more SNPs chosen from within theCNV or control region would be equal (in keeping with aGaussian distribution), we compared the average log2-ratios between n SNPs from the CNV with (c-n) SNPsfrom the control region, where 'c' represents the totalPage 15 of 18(page number not for citation purposes)intensity data using the GeneChip DNA Analysis Soft-ware Version 3.0 (GDAS) [18], with a confidence scorenumber of SNPs for that chromosome on the array. Usingthe Welch t-test, a p-value was calculated. The resolvingBMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368power for deletions and duplications were calculatedfrom combined Xba and Hind data using the average val-ues of confirmed deletions and duplications, respectively,and then by extrapolating to calculate p-values required todetect a wide range of potential CNV sizes.AbbreviationsArray-CGH, array comparative genomic hybridization;AWS, adaptive weights smoothing; CNP, copy numberpolymorphism; CNV, copy number variant; FISH, fluores-cent in situ hybridization; HMM, Hidden Markov Model;MR, mental retardation; SD, standard deviation; SNP, sin-gle nucleotide polymorphism; WGSA, whole genomesampling analysisCompeting interestsThe author(s) declares that there are no competing inter-ests.Authors' contributionsAB was involved in the study design, generation of 100 KSNP array data, 100 K SNP array data analysis, and wroteand edited the manuscript. ADD participated in the studydesign, 100 K SNP array data analysis and manuscriptediting. HIL, TN, SF and HQ performed 100 K SNP arraydata analysis and manuscript editing. SYC participated ingeneration of the 100 K SNP array data and manuscriptediting. JA, AA, MC, MB-J, AG and GK generated 100 KSNP array data. PB, NF, and SL were involved in collectingpatient samples. PE carried out validation of putativeCNVs. JMF was involved in patient sample collection andmanuscript editing. MAM participated in the study design,supervised the study and edited the manuscript. Allauthors read and approved the final manuscript.Additional filesRaw array data are publicly available within the NCBIGene Expression Omnibus (GEO) database [31] underaccession number GSE 7226. The raw data can also bedownloaded from ftp://mr@ftp2.bcgsc.ca/ using thelogin: mr and password: omn1w0rld.Additional materialAdditional file 1List of samples and oligonucleotide arrays. List of 331 samples and 662 arrays (Xba 50 K and Hind 50 K) with corresponding array quality meas-ures.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S1.xls]Additional file 2Candidate CNVs by CNAG Ref2. List of candidate copy number variants identified by CNAG using reference set '2'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S2.xls]Additional file 3Candidate CNVs by CNAG Ref50. List of candidate copy number vari-ants identified by CNAG using reference set '50'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S3.xls]Additional file 4Candidate CNVs by CNAG-GLAD Ref2. List of candidate copy number variants identified by CNAG and GLAD using reference set '2'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S4.xls]Additional file 5Candidate CNVs by CNAG-GLAD Ref50. List of candidate copy number variants identified by CNAG and GLAD using reference set '50'.>Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S5.xls]Additional file 6Candidate CNVs by dChip Ref50. List of candidate copy number variants identified by dChip using reference set '50'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S6.xls]Additional file 7Candidate CNVs by dChip Ref50 (continued). List of candidate copy number variants identified by dChip using reference set '50' (continued).Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S7.xls]Additional file 8Candidate CNVs by dChip Ref214. List of candidate copy number vari-ants identified by dChip using reference set '214'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S8.xls]Additional file 9Candidate CNVs by dChip-GLAD Ref50. List of candidate copy number variants identified by dChip and GLAD using reference set '50'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S9.xls]Page 16 of 18(page number not for citation purposes)BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368AcknowledgementsWe thank Martin Krzywinski and the Mapping Group, Sequencing Group, Bioinformatics Group and the Project Management Group of the British Columbia Cancer Agency Genome Sciences Centre for their assistance in this project. We are grateful to Sharoni Jacobs for critically evaluating this manuscript and for helpful suggestions. This study was funded by Genome Columbia Cancer Agency is also supported by the British Columbia Cancer Foundation. Marco A. Marra is a Michael Smith Foundation for Health Research Scholar. We are extremely grateful to the families that donated samples used in this study.References1. Kops GJ, Weaver BA, Cleveland DW: On the road to cancer: ane-uploidy and the mitotic checkpoint. Nat Rev Cancer 2005,5(10):773-785.2. Fukasawa K: Centrosome amplification, chromosome instabil-ity and cancer development. Cancer Lett 2005, 230(1):6-19.3. Duesberg P, Li R, Fabarius A, Hehlmann R: The chromosomalbasis of cancer. Cell Oncol 2005, 27(5-6):293-318.4. Leonard H, Wen X: The epidemiology of mental retardation:challenges and opportunities in the new millennium. MentRetard Dev Disabil Res Rev 2002, 8(3):117-134.5. van Karnebeek CD, Jansweijer MC, Leenders AG, Offringa M, Hen-nekam RC: Diagnostic investigations in individuals with men-tal retardation: a systematic literature review of theirusefulness. Eur J Hum Genet 2005, 13(1):6-25.6. Friedman JM, Baross A, Delaney AD, Ally A, Arbour L, Asano J, BaileyDK, Barber S, Birch P, Brown-John M, Cao M, Chan S, Charest DL,Farnoud N, Fernandes N, Flibotte S, Go A, Gibson WT, Holt RA,Jones SJ, Kennedy GC, Krzywinski M, Langlois S, Li HI, McGillivray BC,Nayar T, Pugh TJ, Rajcan-Separovic E, Schein JE, Schnerch A, SiddiquiA, Van Allen MI, Wilson G, Yong SL, Zahir F, Eydoux P, Marra MA:Oligonucleotide microarray analysis of genomic imbalancein children with mental retardation. Am J Hum Genet 2006,79(3):500-513.7. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, HaugenE, Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Fine-scale structural variation of the human genome. Nat Genet2005, 37(7):727-732.8. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S,Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K,Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M:Large-scale copy number polymorphism in the humangenome. Science 2004, 305(5683):525-528.9. McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, BarrettJC, Dallaire S, Gabriel SB, Lee C, Daly MJ, Altshuler DM: Commondeletion polymorphisms in the human genome. Nat Genet2006, 38(1):86-92.10. Hinds DA, Kloek AP, Jen M, Chen X, Frazer KA: Common dele-tions and SNPs are in linkage disequilibrium in the humangenome. Nat Genet 2006, 38(1):82-85.11. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: Ahigh-resolution survey of deletion polymorphism in thehuman genome. Nat Genet 2006, 38(1):75-81.12. Feuk L, Carson AR, Scherer SW: Structural variation in thehuman genome. Nat Rev Genet 2006, 7(2):85-97.13. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fie-gler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Free-man JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, KomuraD, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K,Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, WoodwarkC, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, EstivillX, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, SchererSW, Hurles ME: Global variation in copy number in the humangenome. Nature 2006, 444(7118):444-454.14. Huang J, Wei W, Zhang J, Liu G, Bignell GR, Stratton MR, Futreal PA,Wooster R, Jones KW, Shapero MH: Whole genome DNA copynumber changes identified by high density oligonucleotidearrays. Hum Genomics 2004, 1(4):287-299.15. Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, RodgersL, Brady A, Sebat J, Troge J, West JA, Rostan S, Nguyen KC, PowersS, Ye KQ, Olshen A, Venkatraman E, Norton L, Wigler M: Repre-sentational oligonucleotide microarray analysis: a high-reso-lution method to detect genome copy number variation.Genome Res 2003, 13(10):2291-2305.16. Slater HR, Bailey DK, Ren H, Cao M, Bell K, Nasioulas S, Henke R,Choo KH, Kennedy GC: High-Resolution Identification ofChromosomal Abnormalities Using Oligonucleotide ArraysAdditional file 10Candidate CNVs by dChip-GLAD Ref214. List of candidate copy number variants identified by dChip and GLAD using reference set '214'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S10.xls]Additional file 11Candidate CNVs by CNAT-GLAD Ref50. List of candidate copy number variants identified by CNAT and GLAD using reference set '50'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S11.xls]Additional file 12Candidate CNVs by CNAT-GLAD Ref106. List of candidate copy number variants identified by CNAT and GLAD using reference set '106'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S12.xls]Additional file 13Candidate CNVs by CNAT-GLAD Ref214. List of candidate copy number variants identified by CNAT and GLAD using reference set '214'.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S13.xls]Additional file 14Candidate CNVs predicted by multiple methods. Putative CNV regions identified by at least two software/reference set combinations from the same sample.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S14.xls]Additional file 15Detection of confirmed CNVs. List of 38 CNVs that were confirmed by independent experimental approaches, as well as their detection by various software/reference set combinations.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S15.xls]Additional file 16Rules for candidate CNV detection. The rules used for the detection of putative copy number variants with various reference sets.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-368-S16.xls]Page 17 of 18(page number not for citation purposes)Canada, Genome British Columbia and the Canada Foundation for Innova-tion, with additional support from Affymetrix Inc. Research at the British Containing 116,204 SNPs. Am J Hum Genet 2005, 77(5):709-726.17. Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, GrigorovaM, Jones KW, Wei W, Stratton MR, Futreal PA, Weber B, ShaperoPublish with BioMed Central and every scientist can read your work free of charge\"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime.\"Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2007, 8:368 http://www.biomedcentral.com/1471-2105/8/368MH, Wooster R: High-resolution analysis of DNA copynumber using oligonucleotide microarrays. Genome Res 2004,14(2):287-295.18. Affymetrix Inc., Santa Clara, CA : [http://www.affymetrix.com/].19. Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X,Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, SurtiU, Phillips MS, Boyce-Jacino MT, Fodor SP, Jones KW: Large-scalegenotyping of complex DNA. Nat Biotechnol 2003,21(10):1233-1237.20. Ishikawa S, Komura D, Tsuji S, Nishimura K, Yamamoto S, Panda B,Huang J, Fukayama M, Jones KW, Aburatani H: Allelic dosage anal-ysis with genotyping microarrays. Biochem Biophys Res Commun2005, 333(4):1309-1314.21. LaFramboise T, Weir BA, Zhao X, Beroukhim R, Li C, Harrington D,Sellers WR, Meyerson M: Allele-specific amplification in cancerrevealed by SNP array analysis. PLoS Comput Biol 2005, 1(6):e65.22. Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A,Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S: A robustalgorithm for copy number detection using high-density oli-gonucleotide single nucleotide polymorphism genotypingarrays. Cancer Res 2005, 65(14):6071-6079.23. Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, MinnaJ, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M: An inte-grated view of copy number and allelic alterations in the can-cer genome using single nucleotide polymorphism arrays.Cancer Res 2004, 64(9):3060-3071.24. CNAG : [http://www.genome.umin.jp/].25. dChip : [http://biosun1.harvard.edu/complab/dchip/].26. Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis ofarray CGH data: from signal ratio to gain and loss of DNAregions. Bioinformatics 2004, 20(18):3413-3422.27. Huang J, Wei W, Chen J, Zhang J, Liu G, Di X, Mei R, Ishikawa S, Abu-ratani H, Jones KW, Shapero MH: CARAT: a novel method forallelic detection of DNA copy number changes using highdensity oligonucleotide arrays. BMC Bioinformatics 2006, 7:83.28. UCSC Genome Browser : [http://genome.ucsc.edu/].29. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,Haussler D: The human genome browser at UCSC. GenomeRes 2002, 12(6):996-1006.30. Wilson GM, Flibotte S, Chopra V, Melnyk BL, Honer WG, Holt RA:DNA copy-number analysis in bipolar disorder and schizo-phrenia reveals aberrations in genes involved in glutamatesignaling. Hum Mol Genet 2006, 15(5):743-749.31. NCBI Gene Expression Omnibus : [http://www.ncbi.nlm.nih.gov/geo/].yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 18 of 18(page number not for citation purposes)"@en ; edm:hasType "Article"@en ; edm:isShownAt "10.14288/1.0223627"@en ; dcterms:language "eng"@en ; ns0:peerReviewStatus "Reviewed"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "BioMed Central"@en ; ns0:publisherDOI "10.1186/1471-2105-8-368"@en ; dcterms:rights "Attribution 4.0 International (CC BY 4.0)"@en ; ns0:rightsURI "http://creativecommons.org/licenses/by/4.0/"@en ; ns0:scholarLevel "Faculty"@en ; dcterms:title "Assessment of algorithms for high throughput detection of genomic copy number variation in oligonucleotide microarray data"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/56669"@en .