UBC Faculty Research and Publications

Robust SNP genotyping by multiplex PCR and arrayed primer extension Podder, Mohua; Ruan, Jian; Tripp, Ben W; Chu, Zane E; Tebbutt, Scott J Jan 31, 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-12920_2007_Article_5.pdf [ 1.21MB ]
Metadata
JSON: 52383-1.0215872.json
JSON-LD: 52383-1.0215872-ld.json
RDF/XML (Pretty): 52383-1.0215872-rdf.xml
RDF/JSON: 52383-1.0215872-rdf.json
Turtle: 52383-1.0215872-turtle.txt
N-Triples: 52383-1.0215872-rdf-ntriples.txt
Original Record: 52383-1.0215872-source.json
Full Text
52383-1.0215872-fulltext.txt
Citation
52383-1.0215872.ris

Full Text

ralssBioMed CentBMC Medical GenomicsOpen AcceResearch articleRobust SNP genotyping by multiplex PCR and arrayed primer extensionMohua Podder1,2, Jian Ruan1, Ben W Tripp1, Zane E Chu1,4 and Scott J Tebbutt*1,3Address: 1The James Hogg iCAPTURE Centre for Cardiovascular and Pulmonary Research, St. Paul's Hospital, University of British Columbia, Vancouver, BC, V6Z 1Y6, Canada, 2Department of Statistics, University of British Columbia, Vancouver, BC, V6T 1Z2, Canada, 3Department of Medicine, Division of Respiratory Medicine, University of British Columbia, Vancouver, BC, V6Z 1Y6, Canada and Current Address: 4Division of Engineering Science, University of Toronto, Toronto, ON, M5S 2E4, CanadaEmail: Mohua Podder - mpodder@mrl.ubc.ca; Jian Ruan - jruan@mrl.ubc.ca; Ben W Tripp - btripp@mrl.ubc.ca; Zane E Chu - zane.chu@utoronto.ca; Scott J Tebbutt* - stebbutt@mrl.ubc.ca* Corresponding author    AbstractBackground: Arrayed primer extension (APEX) is a microarray-based rapid minisequencingmethodology that may have utility in 'personalized medicine' applications that involve geneticdiagnostics of single nucleotide polymorphisms (SNPs). However, to date there have been fewreports that objectively evaluate the assay completion rate, call rate and accuracy of APEX. Wehave further developed robust assay design, chemistry and analysis methodologies, and have soughtto determine how effective APEX is in comparison to leading 'gold-standard' genotyping platforms.Our methods have been tested against industry-leading technologies in two blinded experimentsbased on Coriell DNA samples and SNP genotype data from the International HapMap Project.Results: In the first experiment, we genotyped 50 SNPs across the entire 270 HapMap CoriellDNA sample set. For each Coriell sample, DNA template was amplified in a total of 7 multiplexPCRs prior to genotyping. We obtained good results for 41 of the SNPs, with 99.8% genotypeconcordance with HapMap data, at an automated call rate of 94.9% (not including the 9 failedSNPs). In the second experiment, involving modifications to the initial DNA amplification so that asingle 50-plex PCR could be achieved, genotyping of the same 50 SNPs across each of 49 randomlychosen Coriell DNA samples allowed extremely robust 50-plex genotyping from as little as 5 ng ofDNA, with 100% assay completion rate, 100% call rate and >99.9% accuracy.Conclusion: We have shown our methods to be effective for robust multiplex SNP genotypingusing APEX, with 100% call rate and >99.9% accuracy. We believe that such methodology may beuseful in future point-of-care clinical diagnostic applications where accuracy and call rate are bothparamount.Background appropriate clinical intervention for a patient will requirePublished: 31 January 2008BMC Medical Genomics 2008, 1:5 doi:10.1186/1755-8794-1-5Received: 1 September 2007Accepted: 31 January 2008This article is available from: http://www.biomedcentral.com/1755-8794/1/5© 2008 Podder et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 15(page number not for citation purposes)If 'personalized medicine', using genomic knowledge, isto become a reality, then the ability to determine the mostthe genotyping of several tens to hundreds of single nucle-otide polymorphisms (SNPs) across many genes and theirBMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5regulatory sequences for that individual patient [1,2], rap-idly and at the point-of-care. Of many genotyping meth-ods, those based on microarrays offer the greatestpotential for economic, patient-specific application [3-7],due to their ability to simultaneously interrogate multipleSNPs. Arrayed primer extension (APEX [8,9]) is a minise-quencing microarray assay based on a two-dimensionalarray of oligonucleotide probes that are immobilized, viatheir 5' ends, on a glass surface. The probes (25-mers) aredesigned so that they are complementary to the gene upto, but not including, the base where the SNP exists. TheSanger-based sequencing chemistry of APEX allows geno-typing of hundreds of SNPs, with the array chemistry tak-ing only fifteen to twenty minutes to complete. APEXachieves this clinically relevant speed because it uses thecatalytic ability of a DNA polymerase to carry out a singlenucleotide base extension (SBE) at the 3' end of thearrayed probes, specific to the SNP sites of interest inamplified patient DNA that is temporarily hybridized tothese probes. The dideoxynucleotide (ddNTP) 'termina-tor' bases are labelled with tags containing distinct fluo-rescent chromophores, specific for each of the four basesof DNA (A,C,G,T). Hence, the fluorescent 'colour' at eachof the probe sites (array spots) will give SNP-specific gen-otypic information. As a discovery research tool, APEX hasbeen used to detect β-thalassemia [10], p53 [11], andBRCA1 mutations [12]. Importantly, APEX has also beenshown to be efficient at simultaneously genotyping SNPmarkers that are widely dispersed across the humangenome [13,14]; such capability is essential for future'individualized' genomic diagnostic analysis across multi-ple genes and pathways that are relevant to disease. In arecent quality assessment survey of SNP genotyping labo-ratories [15], in which up to 18 SNPs were genotypedacross 47 DNA samples, APEX performed well againstother methods, and the authors concluded that a "con-servative approach for calling the genotypes should beused to achieve a high accuracy at the cost of a lower gen-otyping success rate." Whilst such a conservative approachmay be applicable for research studies, it may not beappropriate for clinical diagnostics, in which life-savingmedical decisions might require extremely accurate geno-typing across all SNPs of interest.Given the potential utility of APEX for rapid clinical diag-nostics, we have developed robust assay design, chemistryand analysis methodologies, and have sought to deter-mine just how effective APEX is in comparison to leading'gold-standard' genotyping platforms, including Perlegenand Illumina. Our objective was to achieve 100% assaycompletion rate, call rate and genotyping accuracy rate,for multiple SNPs across multiple samples. Previous stud-ies from our laboratory have reported APEX genotypingnificantly lower than 100%, and usually do not include aproportion of the originally selected SNPs that fail theassay. Similarly, other laboratories that use APEX andequivalent technology have reported genotyping accura-cies ranging from 98% to >99%, with call rates varyingfrom 84.4% to 96.8% [10,11,13,15,19-21].Results and DiscussionWe selected 50 SNPs from the HapMap database that hadbeen previously genotyped and analyzed as part of thethird quality control exercise on Illumina and Perlegenplatforms, arguably the most accurate and best validatedhigh-throughput methodologies for SNP genotyping todate. The randomly selected SNPs were located acrossmultiple chromosomes and are listed in Additional file 1online, along with details of the APEX probe sequencesand PCR primer sequences. The genotyping arrays that arecurrently being developed and tested in our laboratoryincorporate multiple redundant measures consisting ofsense and antisense DNA-strand APEX probes plus allele-specific oligonucleotide (ASO) APEX probes for a total ofsix different probes per SNP [14], with each replicated fivetimes on the array grid, which allows for more robust sta-tistical averaging. Optimal PCR primer pairs weredesigned for each of the 50 SNP loci [Additional file 1]and seven multiplex PCR groups were set up that,together, would amplify all 50 loci [Additional file 2]. Weobtained a set of 287 DNA samples from McGill Univer-sity and Génome Québec Innovation Centre (one of theHapMap Project's genotyping centers). This set comprised270 DNA samples from the Coriell Institute for MedicalResearch [22] plus hidden duplicates and negative con-trols, all of which our laboratory was blinded to. PCR [Fig.1a and Fig. 1b] and APEX assays were performed on eachof the samples, plus a 10% repeat set which was randomlyselected by us to allow internal quality control and an ini-tial assessment of genotyping concordance.Microarray image data were imported into SNP Chart [23]and analyzed using previously described image analysisalgorithms [24,25]. Genotypes were called using two pre-viously published methods: 1. MACGT software [17],which is a multi-dimensional clustering tool; 2. simplelinear discriminant analysis (LDA) using dynamic varia-ble selection [18], which is a classification algorithm.Results are shown in Table 1 and Additional file 3 online.Briefly, a training set was established using SNP Chart, fol-lowed by auto-calling in MACGT. Nine SNPs did not passquality control due to assay failure or inconsistent PCRamplification. For all remaining SNPs that were auto-called by MACGT, any genotypes that had a 'fit' score ofless than 0.001 (approximately 9%) were checked bymanual scoring in SNP Chart and either validated, orPage 2 of 15(page number not for citation purposes)accuracies ranging from 98% to 99.8% [14,16-18],though the call rates in these studies have always been sig-changed to a different genotype or to a non-call (NN). Thefinal results using MACGT showed highly accurate geno-BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5typing (99.94% concordance with HapMap) with goodcall rates (90% auto-called plus 9% manual scoring).Importantly, of the 1,013 genotypes called manually, theaccuracy was 99.87%, even in cases where the array spotsignal intensities were up to an order of magnitude lowerthan for higher quality genotype data, and only slightlyhigher than background signals [Additional file 4 andAdditional file 5]. Using the same training set, we thenanalyzed the data set with simple linear discriminantanalysis (LDA) using dynamic variable selection [18].Results [Table 1 and Additional file 3] also showed accu-dence score threshold of 0.75). We also calculated thehomozygous and heterozygous performance for the set of270 HapMap samples with the previously selected 41SNPs out of 50 SNPs [See Table 1]. For a threshold of 0.75,we were able to call 6883 cases out of 7214 homozygouscases (95.41% call rate) with 6880 correct calls (99.96%HapMap concordance). Whereas, with the same thresh-old, out of 3873 heterozygous cases, we were able to call3640 cases (93.98% call rate) with 3634 correct calls(99.84% HapMap concordance). Therefore, in commonwith other genotyping platforms, our methodology has aMultiplexing PCR and subsequent amplicon fragmentation results, prior to APEX reaction on HapMap ChipFigure 1Multiplexing PCR and subsequent amplicon fragmentation results, prior to APEX reaction on HapMap Chip. (a) Standard mul-tiplex PCR from a single Coriell DNA sample using optimally-designed primers [Additional files 1 &2] within seven unique mul-tiplex groups (lanes 1–7; lane M shows 100 bp DNA ladder markers), showing wide range of amplicon sizes across the 50 SNP loci. (b) Purification, concentration and fragmentation of standard PCR amplicons. Lane 1 represents an aliquot of concen-trated mixture of all seven multiplex products shown in Fig. 1a. Lane 2 shows the fragmentation result, generating single-stranded nucleic acid of 30–100 base length. (c) Multiplex PCR amplification of all 50 SNP loci in a single reaction tube using new PCR primer set [Additional file 6], showing 50-plex PCR products (individual SNP loci amplicons are unresolvable by aga-rose gel electrophoresis) from two Coriell DNA samples (lanes 1 & 2), plus a negative PCR control (lane 3). (d) Fragmentation of 50-plex PCR amplicons from aliquots of lane 1 & lane 2 samples shown in Fig. 1c.Page 3 of 15(page number not for citation purposes)rate genotyping (99.91% HapMap concordance), andwith higher automated call-rates (94.91% – using a confi-slight bias that favours the calling of homozygous geno-types.BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5These results, although promising and at least as accurateas any previously reported for APEX-based methodolo-gies, did not deliver on our objective of 100% call rate and100% accuracy, and several of the 50 SNPs failed qualitycontrol. However, two important lessons were learnt fromthe study: 1. our on-chip assay chemistry is extremelyrobust and specific, allowing accurate genotype calls (atleast by manual inspection of the array spot data withinSNP Chart) even at very low sensitivities (i.e., when thesequence-specific spot intensities are only slightly higherthan background signals); 2. non-calls (NNs) generallyresulted from sporadic PCR failure for certain amplicons,especially those of a length greater than 650–700 basepairs (bp). Taken together, our results suggested that evenif specific SNPs give high NN rates across multiple sam-ples, the genotypes for the remaining samples for theseSNPs (for which APEX assay data can be obtained) are stillvery accurate, despite low signal to noise. We believe thatthis is due to the redundancy in the genotyping probedesign: two classical APEX probes (one probe per DNAstrand), plus four allele-specific (ASO) APEX probes (twoprobes per strand), each replicated five times, for eachSNP site. When this redundant data is displayed in a SNPChart, it is relatively straightforward to interpret the geno-type manually [Additional file 4 and Additional file 5].From these conclusions we reasoned that the PCR designitself needed to be addressed, so that sporadic failures(despite good primer design algorithms) could be consist-ently minimized or even eliminated.For SNP genotyping, only the immediate sequencearound the SNP site is of interest. Therefore, keeping thePCR amplicon size to a minimum ensures short extensiontimes and minimal use of reagents. However, sequence-context issues, especially in multiplex PCR, necessitate thedesign of unique primers that have balanced annealingtemperatures. This requirement can result in individualamplicon sizes in a multiplex mix ranging from 100 to>700 bp [14]. Large amplicons are optimal neither for fastPCR nor for the subsequent APEX assay, which requiresamplicons to be fragmented to ~50–100 base lengths [Fig.1b]. In addition, the degree of multiplexing is usually lim-ited to between four and ten amplicons per individualmultiplex PCR: e.g., for our original HapMap chip, the 50tested multiplex PCR using all original PCR ampliconprimer pairs in a single reaction. As expected, severalexperimental attempts all failed to amplify even a modestproportion of the 50 amplicons (typically, less than 20amplicons would be successful; data not shown). Thus,our new objectives were to increase the degree of multi-plexing and shorten the amplicon lengths to less than 200bp, so that all 50 SNP loci could be simultaneously androbustly amplified in a single reaction vessel. New PCRprimers were designed for the 50 HapMap SNP loci, withamplicon sizes restricted to between 100 and 200 bp[Additional file 6]. Because of this limitation, we were notable to optimally design the primers based on a balancedmelting temperature (Tm). To try to compensate for thispotential problem, each new PCR primer had a commonlinker sequence designed at its 5' end (5' TACGACTCACT-TAGGGAG 3' for each of the left hand PCR primers/5'CGATGTAGGTGACACTAG 3' for each of the right handPCR primers). These linkers have two properties: a bal-anced and reasonably high GC content to increase themelting temperature of the primer and a unique sequencenot found in the human DNA template [26]. After the firstfew cycles of PCR, the linker sequence becomes incorpo-rated into the amplicon sequence and is amplified alongwith the template sequence. This approach helps reduceprimer-dimer formation during the PCR [27]. Because theprimers have balanced GC content, primer annealing inlater cycles of PCR should become much more sensitiveand robust [28]. We randomly selected 50 of the HapMapCoriell DNA samples from our initial study, for 50-plexPCR using the pool of linker-modified primers. SpecificPCR cycling conditions were adopted from a previouslypublished study by Wang et al. [28]. We also attempted50-plex PCR using the redesigned PCR primers, but with-out the common 5' linker sequences: we managed toamplify only a modest number of the 50 SNPs, and thismultiplex PCR was not robust and we could never amplifyall 50 SNPs (data not shown).PCR [Fig. 1c and Fig. 1d] and APEX assays [Fig. 2] wereperformed on each of the samples, including negativecontrols. Microarray image data were imported into SNPChart and analyzed as described previously. Genotypecalling was performed using three independent methods:Table 1: Results summary for 287 HapMap samples and 41 SNPsMethod Call rate Concordance with HapMapMACGT (0.001 cut-off) + manual calls 98.90% (9% manual calls) 99.94%LDA (0.75 threshold) Total cases 94.91% (10,523 cases vs. 11,087) 99.91% (10,514 vs. 10,523)LDA (0.75 threshold) Homozygous cases 95.41% (6,883 vs. 7,214) 99.96% (6,880 vs. 6,883)LDA (0.75 threshold) Heterozygous cases 93.98% (3,640 vs. 3,873) 99.84% (3,634 vs. 3,640)Page 4 of 15(page number not for citation purposes)SNP loci are amplified in a total of seven separate multi-plex reactions [Fig. 1a and Additional file 2]. We initially1. manual calling in SNP Chart; 2. auto-calling withMACGT; and 3. auto-calling by LDA using dynamic varia-BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5ble selection. Genotypes were compared to HapMap datafor concordance. One SNP (rs7693776) was monomor-phic (TT) across all samples genotyped. Results are pre-sented in Table 2 and [Additional files 7, 8, 9, 10]. Manualgenotype calling, although time-consuming and vulnera-ble to user-subjectivity issues [14,23], is nevertheless anaccurate and validated way to interpret APEX data, espe-cially at low spot intensity levels (see above). In addition,manual calling does not require the use of a training set.Of the 49 Coriell DNA samples (one sample out of therandom set was a blinded negative control sample)assayed across 50 SNPs, manual calls were made for allpossible 2,450 genotypes (100% assay completion and100% call rate). Of these, 2,448 were concordant withHapMap data (99.92%). The two discrepant genotypeswere for two different samples each at different SNP loci.Interestingly, the SNP Charts for these two genotypesshowed high quality data, and the same samples/geno-types had previously been concordant with HapMap inthe initial data set [Additional file 3, and discussed furtherbelow].Auto-calling was independently undertaken. Initially,MACGT cluster plots and quality control using SNP ChartTable 2: Results summary for 49 HapMap samples and 50 SNPsMethod Call rate Concordance with HapMapManual calling only 100% 99.92%1MACGT (no cut-off) 100% 99.84%2LDA (0 threshold) Total cases 100% (1,941 cases vs. 1,941) 99.89%3 (1,939 vs. 1,941)LDA (0 threshold) Homozygous cases 100% (1,289 vs. 1,289) 100%4 (1,289 vs. 1,289)LDA (0 threshold) Heterozygous cases 100% (652 vs. 652) 99.7%5 (650 vs. 652)MACGT (0.001 cut-off) 94.04% 99.94%6LDA (0.65 threshold) Total cases 99.18% (1,925 vs. 1,941) 99.90%3 (1,923 vs. 1,925)LDA (0.65 threshold) Homozygous cases 98.91%7 (1,275 vs. 1,289) 100% (1,275 vs. 1,275)LDA (0.65 threshold) Heterozygous cases 99.7% (650 vs. 652) 99.7% (648 vs. 650)1 Two discrepancies amongst 2,450 genotype cases.2 Three discrepancies amongst 1,926 genotype cases (524 cases used in training set).3 Two discrepancies amongst 1,941 genotype cases (509 cases used in training set).4 No discrepancy amongst 1289 cases (327 cases used in training set)5 HapMap Chip four colour microarray images showing successful de-multiplexing of 50-plex PCR from two Coriell DNA sam-ples (a, b), plus a negative control s mple (c), prior to image analysis and automated genotypingFigure 2HapMap Chip four colour microarray images showing successful de-multiplexing of 50-plex PCR from two Coriell DNA sam-ples (a, b), plus a negative control sample (c), prior to image analysis and automated genotyping. The spots on the negative control image represent positive control probes [8, 14].Page 5 of 15(page number not for citation purposes)Two discrepancies amongst 652 cases (182 cases used in training set)6 One discrepancy amongst 1,926 genotype cases (524 cases used in training set).7 Eleven predictions (all TT and correct) with confidence score less than 0.65 for a single SNP (rs1891403).BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5were used to allow manual selection of a limited trainingset of samples from the data set [17]. Using this trainingset, MACGT auto-calling of the test set with a 0.001 fitthreshold resulted in a call rate of 94.04% and a concord-ance rate of 99.94%. When the fit threshold was relaxed toachieve a 100% call rate, three genotypes were discordantwith HapMap data. Two of these genotypes (both withhigh fit values – good confidence scores) were the same asthe two that had been identified as part of the manual call-ing data. The third discrepancy had a relatively poor fitconfidence score. LDA with dynamic variable selection,using a slightly reduced sized training set, yielded identi-cal genotyping results to manual calling, at a 100% callrate across all 50 SNPs (16 NNs at a 0.65 confidence scorethreshold). Again, the two discrepant genotypes, both ofwhich were incorrectly called as homozygous, had highconfidence scores, consistent with high quality APEXassay data. Separate analysis of homozygous and hetero-zygous cases showed that for a 0.0 threshold,homozygous cases (1289 in total) achieved a call rate of100% with 100% HapMap concordance, whereas hetero-zygous cases (652 in total) achieved a call rate of 100%with 99.7% HapMap concordance (two heterozygouserrors with high confidence scores). Surprisingly, with a0.65 threshold, among 16 non-calls 14 were homozygouswith 11 cases (all TT genotypes) from a single SNPrs1891403, which gives a homozygous call rate of 98.9%and a heterozygous call rate of 99.7%. Interestingly, theLDA-called genotype that had the lowest score (but never-theless was still called correctly) was the same genotype asthe third MACGT-called discordant genotype [see aboveand Additional file 7]. Subsequent inspection of the SNPChart for this genotype (heterozygous CT) showed thatthe ASO-APEX probe intensity signals for the C allele weresomewhat lower than the T allele signals. Again, this samesample/genotype had previously been concordant withHapMap in the initial data set, using the original PCRprimer pairs. (See below for further discussion of this gen-otype and the other two discrepant genotypes.)In summary, we have shown that a combination of multi-plex PCR, redundant and robust APEX design and assay,and statistically-robust auto-calling (simple LDA usingdynamic variable selection) can achieve 100% comple-tion and call rate with >99.9% accuracy, for multiple SNPsand multiple samples. We believe that this is a significantimprovement over other published APEX methodologies.The strength of our methodology is not based on the qual-ity of a single measurement but on the redundancyobtained from measuring the allele intensities by usingmultiple chemistries. To take advantage of this inherentrobustness of the assay we use robust statistical methodsthat automatically select the most reliable measurementshigher costs per SNP, concomitant with lower numbers ofSNPs able to be interrogated in a given area of the micro-array. For research studies, a trade-off may need to betaken into consideration, given the ever-increasing needto genotype as many SNPs as possible, at minimal cost perSNP, and a recent article by Smemo and Borevitz [29]cogently argues for a reduction in the approximately 40-fold probe redundancy currently featured on AffymetrixGeneChips, which only use hybridization for allelic signalgeneration. For clinical diagnostics however, we believethat genotyping accuracy, call rate and completion rate areparamount.To further determine the effect of probe redundancy inour APEX methodology, we used LDA to reanalyze bothdata sets (original and 50-plex) but using non-redundantand partially-redundant probe-specific data [Additionalfiles 8, 9 and 10]. Fig. 3 and Additional file 11 show sim-ple four-panel scatter plots of the probe data for the 50-plex experiment. In particular, Fig. 3 represents the fourseparate scatter plots for the SNP rs12466929 correspond-ing to the four different probe chemistries: ASO.LEFT,ASO.RIGHT, APEX.LEFT and APEX.RIGHT. For each scat-ter plot, the three possible genotype clusters (previouslyknown from the HapMap data set) are presented withthree different colours: blue for allele 1 homozygous;magenta for allele 2 homozygous; and green for allele 1and allele 2 heterozygous. For the SNP rs12466929, allele1 is A and allele 2 is G, and the scatter plots are represent-ative of the entire set of 50 HapMap SNPs. The four scatterplots indicate that three out of the four probe chemistrieswork perfectly well and produce well separable (informa-tive) clusters corresponding to the three genotype classes(AA, AG and GG), whereas one probe chemistry, namelyAPEX.LEFT, fails to work properly and gives overlappingclusters for AG and GG genotype classes [plot (3) in Fig.3]. Nevertheless, this probe chemistry gives a well separa-ble cluster for the AA genotype class. This phenomenonconveys the point of considering each probe chemistryseparately during the building of the genotype classifica-tion model, and in the next stage of the genotype callingalgorithm, combining the four genotype models withproper weights adjusted dynamically with the quality ofeach of the four classifiers (four probe chemistries) spe-cific to each SNP and sample. If all four probes failed toproduce informative clusters, then our LDA-based geno-type calling algorithm would flag that SNP as a failed SNP,which clearly is not the case for the SNP rs12466929. Thisis how the redundancy amongst our APEX based genotyp-ing platform is captured through the proposed LDA-basedgenotype calling algorithm with dynamic variable selec-tion. Viewing the four-panel scatter plots, we would alsolike to emphasize the point that for most of the SNPs thePage 6 of 15(page number not for citation purposes)for each SNP to make the genotype call, sample by sample[18]. Redundancy in genotyping arrays is associated withhomozygous clusters show some significant signal inten-sities corresponding to the other allele, due to spectralBMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5overlap within the APEX fluorescent ddNTP chemistry,thus inducing background to the homozygous clusters.Particularly for this reason, we do not often see ahomozygous cluster close to either of X- or Y-axes. Here,the aim is to compare the allele 1 and allele 2 signal inten-sities for the three possible genotype classes, and thenassign a test sample to the appropriate class based on theprior knowledge of the available training set. We wouldalso like to mention that the initial signal intensities cor-responding to each allele for all four probe chemistries areconverted into the log-scale in order to reduce the variabil-ity between several microarray slides.try. The extreme left hand column of each table indicatesthe combination of four classifiers used to build the LDAmodel [Additional files 8, 9 and 10]. For example, in thefirst row, all four classifiers were used to give the final gen-otype call, and in the fourth row, only the left classifierswere used. In the last four rows, only one classifier wasused at a time to give independent genotype calls usingthe simple LDA model (with no dynamic variable selec-tion). For the complete set of 287 HapMap samples andthe set of 41 SNPs, the training data had in total 807 gen-otype cases (among which 519 genotypes were from Hap-Map Coriell samples and 288 genotypes were from otherCoriell samples) and the test data had in total 11,248 gen-Simple scatter plots for SNP rs12466929 (A/G) from 50-plex data set (this SNP is representative of the entire set of 50 Hap-Map SNPs)Figure 3Simple scatter plots for SNP rs12466929 (A/G) from 50-plex data set (this SNP is representative of the entire set of 50 Hap-Map SNPs). For each plot the x-axis represents signal values for X allele (A for this SNP) and the y-axis represents signal values for Y allele (G for this SNP). All values are in log scale. Magenta, green, blue and black coloured symbols denote the classes YY (GG), YX (AG), XX (AA) and NN (negative control samples), respectively. Plot (1) combines the two ASO-APEX Left probes (one for each allele); plot (2) combines the two ASO-APEX Right probes (one for each allele); plot (3) is for the APEX Left probe; plot (4) is for the APEX Right probe. All the classifiers except APEX Left (plot 3) give well separated genotype clusters for this SNP. Dynamic variable selection is able to automatically weight these LDA classifiers in such a way that the homozygous AA cluster in plot (3) (blue) is able to contribute to the final call for such genotypes, even though AG (green) and GG (magenta) genotype clusters overlap somewhat for this Left APEX probe. Additional file 11 shows four-panel scatter plots for all 50 SNPs from the 50-plex data set.0 2 4 6 8 10 1204812snp.id: 12466929 ( A/G )(1)log(ASO.XL)(X= A )log(ASO.YL)(Y= G )0 2 4 6 8 1002468snp.id: 12466929 ( A/G )(2)log(ASO.XR)(X= A )log(ASO.YR)(Y= G )0 2 4 6 8 1002468snp.id: 12466929 ( A/G )(3)log(APEX.XL)(X= A )log(APEX.YL)(Y= G )0 2 4 6 8 10048snp.id: 12466929 ( A/G )(4)log(APEX.XR)(X= A )log(APEX.YR)(Y= G )Page 7 of 15(page number not for citation purposes)Performance analyses for the different data sets aredescribed below, addressing the redundant probe chemis-otype cases (among which 163 had no validated geno-types from HapMap for comparison).BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5For the set of 270 HapMap DNA samples, applying a 0.65threshold improved the concordance rate (0.31% miss-classification rate) with a reduced call rate of 97.30%. Wefurther checked the performance of the same data setapplying a stringent threshold of 0.75, which gave99.91% concordance (0.06% miss-classification rate) fora reduced call rate of 94.91%. Applying different level ofthresholds, we can control the call rates and, given the val-idated genotype set, we can also check the performancelevel by calculating the miss-classification rates. Theunderlying supposition is that, with reduced call rate,accuracy should increase successively until it reaches itsmaximum limit. For the improved 50-plex PCR chemistry,we were able to achieve a high concordance rate (99.89%using all four classifiers) with 100% call rate [Table 2]. Ifwe apply a 0.65 threshold to the set of 50-plex PCR Hap-Map samples, then the automated call rate reduced to99.18%, leaving only 16 non calls (below thresholdvalue) to be verified manually using SNP Chart (all ofwhich were correct).Therefore, we have determined that reliance on any singleprobe type alone [i.e.: APEX Left probe; APEX Right probe;2 × ASO-APEX Left probes (one for each allele); 2 × ASO-APEX Right probes (one for each allele)] resulted neitherin as high an accuracy of genotyping nor in as high a callrate, compared to the dynamic use of multiple probes.We were interested in further study of the two discrepantgenotype cases, since both had previously been concord-ant with HapMap in the 7-reaction-multiplex PCR dataset, and both showed high quality, unambiguous SNPCharts in the 50-plex PCR data set. A third genotype case(concordant with HapMap by manual calling and simpleLDA, but with a low quality score of 0.4876) was also dis-crepant when called by MACGT. We re-amplified thesethree individual SNP loci from their respective CoriellDNA samples, using the original PCR primers [Additionalfile 1], and sequenced each amplicon from both ends. Thetwo discrepant genotypes were: 1. DNA sample 192(NA18502) at SNP rs3776720 – 50-plex genotype GG/HapMap & 7-reaction-multiplex genotype GA; 2. DNAsample 101 (NA18621) at SNP rs12472674 – 50-plexgenotype CC/HapMap & 7-reaction-multiplex genotypeCT. The third genotype case (concordant with HapMap bymanual calling and simple LDA, but with a low qualityscore of 0.4876) was also discrepant when called byMACGT (DNA sample 228 (NA19210) at SNP rs4739199– 50-plex genotype (MACGT) TT/HapMap, 7-reaction-multiplex, and 50-plex (manual call & LDA) genotypeCT).As expected, we identified additional polymorphic sitessites was identified as an existing SNP (rs6871885). Toour knowledge, the other two sites represent genetic vari-ants not previously reported. For each of these cases, itappears that the sequence variation within the PCRprimer site has caused allelic drop-out, resulting inhomozygous genotype calls for the two discrepant cases,and a poor quality heterozygous genotype call for thethird case (partial allelic drop-out). Specifically, for dis-crepant genotype case 1 (Coriell NA18502 at SNPrs3776720), we found a neighbouring SNP (T/A) which islocated at the 3' end of the anti-sense PCR primer site (5'CGA TGT AGG TGA CAC TAG TAT TGC AGG CAG ACGTGA3' – [Additional file 6]) – this polymorphic site (30bp downstream of rs3776720) is reported in dbSNP asrs6871885, with the A base (sense strand) being describedas a rare allele (0.083) in sub-Saharan African populationsonly (Coriell NA18502 is indeed a sub-Saharan African,Yoruba, and is heterozygote for this SNP).For discrepant genotype case 2 (Coriell NA18621 at SNPrs12472674), we found a sequence variant (G/A) 52 bpdownstream of SNP rs12472674, located within the anti-sense PCR primer site (5' CGA TGT AGG TGA CAC TAGCTC AAT ATG TTA CCA CAA 3' – [Additional file 6]) – thisvariant (heterozygous in Coriell NA18621 – Asian, HanChinese) has not been previously reported in dbSNP andmay represent a novel polymorphism. For the low qualitygenotype case 3 (Coriell NA19210 at SNP rs4739199), wefound a sequence variant (G/A) 45 bp downstream of SNPrs4739199, located within the anti-sense PCR primer site(5' CGA TGT AGG TGA CAC TAG TCCACT TCA TTA GGTGAA 3' – [Additional file 6]) – this variant (heterozygousin Coriell NA19210 – sub-Saharan African, Yoruba) hasalso not been previously reported in dbSNP and may rep-resent a novel polymorphism.Whilst more stringent due-diligence at the 50-plex PCRprimer design stage would have alerted us to one of theseSNPs (rs6871885), the evidence that we have identifiedtwo hitherto unreported SNPs provides a cautionary tale[30]. Elimination of such 'sporadic' genotyping errors dueto novel or unaccounted-for SNPs, as well as due to struc-tural variation in the genome (e.g., copy number variants– CNVs) [31], will need to be addressed in future clinicaldiagnostic genotyping technologies, and possibly even inresearch discovery studies where any sporadic errors dueto hidden SNPs will not cause significant departure fromHardy-Weinberg equilibrium [15]. In preliminary studieswe have been able to correct all three discrepancies previ-ously described, using a redundant 50-plex PCR assay thatincludes two primer pairs for each SNP loci (data notshown).Page 8 of 15(page number not for citation purposes)that coincided with the positions delimited by the PCRprimer sequences used for the 50-plex reaction. One of theFinally, due to the low amount (5 ng) of genomic DNArequired for the 50-plex PCR (compared to 25 ng for eachBMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5of the 7-reaction-multiplex PCRs), we have attemptedAPEX genotyping using our improved methodology onDNA derived from plasma samples. A pilot project wasperformed on five plasma samples (stored for up to tenyears). Comparing the plasma-derived genotyping datawith data obtained from high quality genomic DNA forthe same five individuals, the call rate was >99% (100%for high quality DNA) and the concordance was >99%,which opens up the possibility of robust and accurate gen-otyping of clinical plasma samples without any need forprior whole genome amplification.ConclusionWe report significant improvements to arrayed primerextension (APEX) genotyping methodology that mayshow utility in future point-of-care genetic diagnosticapplications. Our methods have been validated againstindustry-leading technologies in a blinded experimentbased on Coriell DNA samples and SNP genotype datafrom the International HapMap Project. Modifications toPCR amplification design have allowed robust 50-plexgenotyping from as little as 5 ng of DNA, with 100% callrate and >99.9% accuracy.MethodsDNA Samples and Validated GenotypesA set of 287 DNA samples were obtained from McGillUniversity and Génome Québec Innovation Centre (oneof the HapMap Project's genotyping centers). This setcomprised 270 DNA samples from the Coriell Institute forMedical Research [22] plus hidden duplicates and nega-tive controls, all of which our laboratory was blinded to.We were given access to the validated HapMap genotypingdata for these samples only after we had finished the maingenotyping experiment (287 samples/50 SNPs), and afterwe had sent a file of our genotyping results to McGill Uni-versity.HapMap APEX Chip – Probe Design and PrintingSix oligonucleotide probes (25 mers) for each SNP weredesigned using Biodata algorithms (Biodata Ltd., Tartu,Estonia [32]) [Additional file 1]: two classical APEXprobes (one probe per DNA strand), plus four allele-spe-cific (ASO) APEX probes (two probes per strand) whichinclude the actual SNP site at the 3' end of the probe.Allele-specific single base extension of these ASO-APEXprobes during the reaction is contingent on the presenceof the actual complementary base at the SNP site in thesample template DNA [6,10]. Probes were synthesized ata 25 nmol scale and aliquotted into 96-well plates by Inte-grated DNA Technologies (Coralville, IA, USA). Wediluted each probe at 200 pmol/µL as stock concentrationin pure water (resistivity of 18.2 MΩ-cm and total organicArrays were generously printed for us at the MicroarrayFacility of The Prostate Centre at Vancouver General Hos-pital [33] (University of British Columbia, Vancouver, BC,Canada). Briefly, the APEX and ASO-APEX probe oligonu-cleotides (50 pmol/µL in 150 mM sodium phosphateprinting buffer, pH 8.5) were printed to specific grid posi-tions on CodeLink™ Activated Microarray Slides (Amer-sham Biosciences/GE Healthcare, Piscataway, NJ, USA)following the manufacturer's recommended protocols.The 5' end of each oligonucleotide probe was amino-modified during synthesis, allowing its covalent attach-ment to the slide's pre-applied surface chemistry. Eachgrid consisted of five spot replicates of each of the sixprobes per SNP, as well as multiple buffer-only spots andpositive control normalization spots. The latter comprisedan oligonucleotide probe based on a plant-specific genesequence that will extend by a single N base due to thepresence of an exogenous complementary template oligo-nucleotide in the APEX reaction mixture (Npg1) [14].Each Npg1 positive control probe was spotted 40 timesonto the grid, at regular physical intervals. Each one of thesix probes for each SNP was printed at a reasonably widedistance apart from any other probe for the same SNPwithin the grid (as were their replicate spots). This ena-bled a useful degree of robustness in the system, especiallyhelpful in cases of high local background and hybridiza-tion problems [14]. Each spot was approximately 110 µmin diameter. Three replicated grids were printed on eachslide, enabling three samples to be genotyped per slide.Following the printing of the arrays, the slides were incu-bated overnight at room temperature at 75% relativehumidity (saturated NaCl chamber) to drive the covalentcoupling reaction between the probes' 5' amino groupand the CodeLink™ slide chemistry to completion. Block-ing of the arrays was in 50 mM ethanolamine, 0.1 M Tris,pH 9.0, 0.1% SDS, at 50°C for 20 min, according to themanufacturer's protocol.PCR Amplification and FragmentationFor the first experiment, PCR primers were designed toamplify the regions across the 50 SNPs, based on a melt-ing temperature (Tm) of 62°C ± 3°C (at 20 mM monova-lent salt concentration in PCR buffer [Additional file 1]).All primers were computationally tested against thehuman genome and found to amplify single product (Bio-data Ltd., Tartu, Estonia [32]). Multiplex PCR amplifica-tions were performed on the Coriell genomic DNAsamples (plus several negative PCR control samples thatcontained no genomic DNA). The multiplex PCR grouphad a unique combination of the primer pairs among 7reactions [Additional file 2]. Each PCR was performed ina total volume of 15 µL, containing 1.5 µL 10× PCR buffer[Tris-Cl, (NH4)2SO4, 15 mM MgCl2, pH 8.7], 1.5 mMPage 9 of 15(page number not for citation purposes)content of less than five parts per billion) using a BiomekFX robot (Beckman Coulter, Fullerton, CA, USA).MgCl2, 200 µM dNTPs without dTTP, 160 µM dTTP, 40µM dUTP, 0.75 U HotStar Taq DNA polymerase (5 U/µL;BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5Qiagen, Valencia, CA, USA), 1 µL 10 µM primer mixtures(each primer), and 25 ng genomic DNA. Incorporation ofthe dUTP allowed for the amplified DNA to be enzymati-cally sheared by uracil N-glycosylase (UNG, InterScience,Troy, NY, USA) to produce a DNA size of approximately50–100 bases, optimal for hybridization to the oligonu-cleotides on the microarray (see below). Genomic DNAand PCR master mixture were transferred into ABI 384-well reaction plates (Applied Biosystems, Foster City, CA,USA) using a Biomek FX robot (Beckman Coulter, USA).PCR reactions were performed in a GeneAmp PCR System9700 ThermoCycler (Applied Biosystems, USA). PCRswere initiated by a 15 min polymerase activation step at95°C and completed by a final 10 min extension step at72°C. The PCR cycles were as follows: 35 cycles of 30 sdenaturation at 95°C, 30 s annealing at 58°C, and 50 sextension at 72°C.For the second experiment, in order to increase the effi-ciency of PCR, we designed 50× 5' linker PCR primer pairs[Additional file 6] based on a Tm of 65°C ± 7°C and per-formed 50-plex PCR in one single reaction per sample.Each new PCR primer had a common linker sequencedesigned at its 5' end (5' TACGACTCACTTAGGGAG-3' foreach of the left hand PCR primers/5' CGATGTAGGT-GACACTAG-3' for each of the right hand PCR primers).The 3' ends of the primers were chosen to have non-com-plementary bases with respect to each other (i.e., all prim-ers ended with one or two A bases), in order to reduce theprobability of primer interactions and primer-dimer for-mation. All primers were computationally tested againstthe human genome and found to amplify single product.The new amplicon sequences were located within theamplicon sequences from the original primer pairs. Themultiplex PCR was carried out in a 25 µL reaction contain-ing 20 nM (final) of each primer plus 20 nM of left andright linker-only primers (left linker: 5' TACGACTCACT-TAGGGAG 3'/right linker: 5' CGATGTAGGTGACACTAG3'), 200 µM dNTPs without dTTP, 160 µM dTTP, 40 µMdUTP, 6 units of HotStar Taq DNA polymerase (5 U/µL;Qiagen, USA), 1.5 mM MgCl2 in 1× PCR reaction buffer[100 mM Tris-HCl, 50 mM KCl, 100 µg/mL Gelatin, pH8.3] with 5 ng of genomic DNA. PCR was performed usinga MJR PTC 200 ThermoCycler (MJ Research, Waltham,MA, USA). PCR was initiated by a 15 min polymerase acti-vation step at 95°C and completed by a final 3 min exten-sion step at 72°C. The reaction procedure consisted of 40cycles of denaturation at 95°C for 40 s, primer annealingat 55°C for 2 min and one ramping-up step from 55°C to70°C for 2.5 min (0.1°C/s) [28].Aliquots of PCR products were visualized with Gel Redfluorescent nucleic acid dye (Biotium, Hayward, CA, USA)EDTA (TBE) buffer. The 7 subgroup multiplex PCR prod-ucts were pooled for each individual Coriell sample andprecipitated by adding 2.5 volumes of ice-cold 100% eth-anol and 0.25 volumes of 10 M ammonium acetate solu-tion. After precipitation at -20°C overnight, the mixturewas centrifuged at 20,800 × g at 4°C for 20 min. Thesupernatant was carefully removed, and the DNA pelletwas washed with 400 µL of ice-cold 70% ethanol. TheDNA pellet was then dissolved in 15 µL pure water. 10 µLof this DNA (or 10 µL of unpurified 50-plex PCR prod-ucts; amplified to a concentration of approximately 300 –400 ng/µL.) were then fragmented by 1 U uracil-N-glyco-sylase (UNG; Inter Science Inc., Troy, NY, USA) and unin-corporated dNTPs were simultaneously inactivated bydigestion with 1 U shrimp alkaline phosphatase (SAP;Amersham Biosciences/GE Healthcare, USA) for 15 minat 37°C, in a 20 µL reaction mixture containing 2 µL 10×digestion buffer [0.5 M Tris-HCl, 0.2 M (HN4)2SO4,pH9.0], followed by enzyme inactivation for 10 min at95°C.Microarray-based Minisequencing: Arrayed Primer Extension (APEX)The APEX reaction was performed in a total volume of 40µL by the addition of 17 µL fragmented DNA template, 1µL of 2 pmol/µL Npg1-positive control template oligonu-cleotide, 1.25 µM of each fluorescently labeled dideoxy-nucleotide triphosphate (Texas Red-ddATP, Cy3-ddCTP,Cy5-ddGTP, R110-ddUTP; Perkin Elmer Life Sciences,Boston, MA, USA), 5 U Thermo Sequenase™ DNApolymerase (Amersham Biosciences/GE Healthcare, USA)diluted in its dilution buffer, 2× Thermo Sequenase reac-tion buffer [10×, 260 mM Tris-HCl, 65 mM MgCl2, pH9.5]. The reaction mixture was applied to the grid of APEXand ASO-APEX probes previously printed on the Code-Link slide that had been washed two times in 95°C purewater and placed on a Thermo Hybaid HyPro20 incuba-tion plate (Thermo Electron, Waltham, MA, USA) set at58°C. The reaction mixture was covered with a small pieceof Parafilm™, and the APEX reaction allowed to proceed at58°C with agitation (setting 1) for 20 min. Following theincubation period, slides were washed with 95°C water toremove the template DNA, enzyme, and excess ddNTPs.Further washing in 0.3% Alconox (Alconox Inc., WhitePlains, NY, USA) and 95°C pure water ensured low back-ground on the array images.DNA SequencingAs described in the main paper, we directly sequencedthree SNP loci in three independent samples: 1. sample192 (NA18502) at SNP rs3776720; 2. sample 101(NA18621) at SNP rs12472674; 3. sample 228(NA19210) at SNP rs4739199. We performed three sin-Page 10 of 15(page number not for citation purposes)staining under ultraviolet (UV) illumination on a 2% aga-rose gel, following electrophoresis in 0.5× Tris-borategle-plex PCR reactions using primer pairs from the firstexperimental design and methods [Additional file 1] toBMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5obtain the DNA fragments including the SNP sites onthese three Coriell DNA samples. PCR primers pairs usedwere: 1. rs3776720 sense 5' GGC CAA GGA AAA GAAATG AAT CTG CT 3', anti-sense 5' AAC TTT AGT GCAGGA TTT GCC ATC CA 3' – PCR amplicon size of 389 bp;2. rs12472674 sense 5' TAA AAT CCA ATC AGG CCA ACTGTT CA 3', anti-sense 5' TCA ATG CCA TTA TAT GTG CCAGCC A 3' – PCR amplicon size of 388 bp; 3. rs4739199sense >5' TCC AGC CAG CAA AAG ATC CTC AAA 3', anti-sense 5' TCA AGC ACA TGT TAC CAG TTT CCC AA 3' –PCR amplicon size of 587 bp. PCR products were purifiedusing a QIAquick PCR Purification Kit (Qiagen, Valencia,CA, USA) according to the manufacture's instructions.DNA sequencing reactions were performed by the NucleicAcid Protein Service Unit [34] at the University of BritishColumbia (Vancouver, BC, Canada). For each amplicon,sense and anti-sense PCR primers were used as sequencingprimers.Microarray Imaging and Spot Intensity CalculationSlide microarrays were imaged using an arrayWoRxe AutoBiochip Reader (Applied Precision, LLC, Issaquah, WA,USA), fitted with the following filter sets: 1. A488 – Ex.480/15× – Em. 530/40 (R110 dye); 2. Cy3 (narrowband)– Ex. 546/11 – Em. HQ570/10 m (Cy3); 3. Texas Red – Ex.602/13 – Em. 631/23 (Texas Red); 4. Cy5 – Ex. 635/20 –Em. 685/40 (Cy5) (Chroma Technology, Rockingham,VT, USA). Exposure times for each dye were set up to giveapproximately 60–70% pixel saturation for selected Npg1positive control probe spots. Resolution of the imager wasset to 10 µm. Four 16-bit TIFF files for each array wereobtained (one from each channel) and these wereimported into SNP Chart [35], a data management andvisualization tool for array-based genotyping by primerextension from multiple probes [23]. This software gener-ates visual patterns of spot intensity values, from multiplechannels across a multiple probe set specific for a givenSNP, allowing easy calling of the genotype. All the imageswere gridded in SNP Chart by manually selecting four pre-defined spots that, combined with knowledge of the lay-out of the grid, allows SNP Chart to locate every spot [23].Spot segmentation and background subtraction werebased on hybrid segmentation algorithms previously pub-lished by our laboratory [24,25]. Spot intensity valueswere normalized by setting the 40 Npg1 positive controlspots, widely distributed across each array grid, to an aver-age value of 20,000 units per channel, with the exportednormalized intensity value calculated from the scale factor× median signal) [16].Genotyping – Manual CallingManual genotype calling within SNP Chart was carriedGenotyping – Automated Calling Using MACGTThe training set for MACGT (multi-dimensional auto-mated clustering genotyping tool) [17] was selected bymanually inspecting SNP Charts for each of the SNPsacross some of the 287 samples. For the 50 SNPs, up to tenhigh-quality charts were chosen as 'prototypes' [23] foreach genotype. All prototype data were exported fromSNP Chart into a format readable by MACGT. MACGTwas run on just the training data, and the clusters for eachSNP were manually inspected to ensure there where noerrors in the training set. Genotyping was performed byMACGT using the parametersNORMALIZE_GROUP_OF_4 = 1,GROUP_OF_4_MEAN_CUTOFF = 10,PATCH_GROUPS_OF_4 = 1, DROP_NNS = 1 [17]. A 'fit'statistical cut-off of 0.001 was used to identify poor qual-ity genotypes as non-calls (NNs) [17]. Any SNP or samplewith a high rate of NNs was subject to further inspection.We identified nine SNPs that the PCR assay performedpoorly on and which MACGT could not confidently score,although manual inspection of SNP Charts did show thatthe assays were somewhat successful, albeit non-repro-ducibly. The final training set for the 41 SNPs was madeup of 519 genotypes [Additional file 3]. All NNs wereinspected within SNP Chart and manually called if possi-ble. The final genotypes from MACGT and from thosemanually called were combined, and compared to the val-idated genotypes from HapMap using a Microsoft Excelmacro [Additional file 3].Genotyping – Automated Calling Using Simple LDA with Dynamic Variable SelectionDetailed descriptions of the algorithms used in simple lin-ear discriminant analysis (LDA) with dynamic variableselection have previously been published by our labora-tory [18]. A brief descriptive example follows, using thedata structure for SNP rs12466929 and DNA sample 101(Coriell NA18621 – genotype AA [Additional file 12]).Ideally, for variable construction, each genotype call couldbe based on just one of the four sets of probes: (1)APEX_LEFT; (2) APEX_RIGHT; (3) ASO_1LEFT andASO_2LEFT; and (4) ASO_1RIGHT and ASO_2RIGHT[Additional file 12]. Considering the underlying chemis-try, we have developed four sets of classifiers, named:APEX.L, APEX.R, ASO.L and ASO.R. Each of these classifi-ers consists of a pair of explanatory variables, genericallydenoted by X and Y, corresponding to two candidate alle-les in the SNP position [Additional file 13]. In Additionalfile 12, for example, X and Y correspond to the A and Galleles, respectively. Since there are five realizations (repli-cates) for each of the two entries in each classifier, wesummarized the information for each allele, by taking aPage 11 of 15(page number not for citation purposes)out as previously described [14,16,23]. robust average: median of the relevant signals from fivespots, for each of the classifiers. From the example data inBMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5Additional file 12, the values of the variables for the clas-sifier APEX.L areAPEX.XL = median (1394, 1148, 597, 1106, 1504) =1148, andAPEX.YL = median (29, 27, 43, 27, 32) = 29, and so on, assummarized in Additional file 13. In our subsequent anal-yses, we have considered different combinations of theabove mentioned classifiers.Our automated genotype calling algorithm is based onthe simple linear discriminant analysis (LDA), usingdynamic variable selection as special criteria for variousclassifiers related to multiple probes. LDA is a supervisedlearning technique which requires a valid training set inorder to build the classification (genotyping) model foreach SNP. For the complete set of 287 HapMap samples,our dynamic variable LDA-based genotype calling algo-rithm used the same training set as used by MACGT above(i.e., 519 genotypes across the 41 SNPs [Additional file 3])and predicted the genotypes for the remainder of the sam-ples.For LDA analysis of the 50-plex PCR chemistry, performedon a subset of 50 HapMap samples which were chosenrandomly out of the original 287 samples, we selectedprototypes to build a new training set using MACGT clus-ters, verifying the chosen cases with SNP Chart. We con-sidered two different training sets, one with a smallnumber of prototypes (at most 3 to 4 prototypes in eachclass) and the other with a minimal number of prototypes(at most 2 prototypes in each class) for each SNP. The twodifferent training sets yielded different performances forthe respective test data sets.For automated genotype calling, we started our analysisby fitting the simple LDA-based genotype model usingeach classifier separately, and then comparing the pre-dicted genotypes with the validated genotypes. Subse-quently, we applied our dynamic-variable LDA-basedgenotyping model on different combinations of the fourclassifiers. When combining four classifiers together, foreach SNP we apply LDA to each pair of variables in Addi-tional file 13. For generic alleles X and Y, the possibleclasses are XX, XY, YY and NN (NN class corresponding tonegative controls: generally low signal intensities for allchannels throughout all probes). Bayesian posterior prob-abilities for the possible classes from each of the four pos-sible classifiers are given in Table 3. The posteriorprobabilities for the four classifiers are combined using anentropy-based weighting scheme. For example, for theASO.L classifier, defineAnalogous quantities are computed for other classifiersEASO.R, EAPEX.L and EAPEX.RProper weights are obtained by normalizing them, e.g.,The weights are applied to the posterior probabilities ofthe respective class to give the final class posterior proba-bilities. For example, the final posterior probability for XXclass isAfter obtaining PXY, PYY and PNN in a similar manner, thefinal genotype call is obtained with highest weightedprobability. In the last stage, call rate can be adjusted byapplying varying thresholds to the 'final weighted proba-bility' (confidence score), and the concordance with thevalidated genotype set will vary accordingly. The callswere checked for concordance with the validated geno-types from HapMap.Additional file 7 contains all 50-plex HapMap genotypingdata, for both MACGT and LDA.Competing interestsThe author(s) declare that they have no competing inter-ests.Authors' contributionsM.P. performed the linear discriminant analyses (LDA)using dynamic variable selection. J.R. performed the wet-lab experiments described in this study, and assisted in thedesign of the initial multiplex PCR. B.W.T. performed theimage analysis and MACGT auto-calling and analysissteps, and assisted S.J.T. in the manual genotype calling.E [log(1/4) ( P log(P ))]ASO.L i(ASO.L)i(ASO.L)i C= − + −∈∑WE ASO.LE ASO.L E ASO.R E APEX.L E APEX.RASO.L = + + +P W P W P W P WXX ASO.L XXASO.LASO.R XXASO.RAPEX.L XXAPEX.LAPEX.R= + + + PXXAPEX.RTable 3: Bayesian posterior probabilities for the possible classes from each of the four possible classifiersLDA/Classes XX XY YY NNASO.L Pxx (ASO.L) Pxy (ASO.L) Pyy (ASO.L) PNN (ASO.L)ASO.R Pxx (ASO.R) Pxy (ASO.R) Pyy (ASO.R) PNN (ASO.R)APEX.L Pxx (APEX.L) Pxy (APEX.L) Pyy (APEX.L) PNN (APEX.L)APEX.R Pxx (APEX.R) Pxy (APEX.R) Pyy (APEX.R) PNN (APEX.R)Page 12 of 15(page number not for citation purposes)Z.E.C. helped design the 50-plex PCR primers, and under-took initial experimental evaluation of these primers.BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5M.P., J.R., B.W.T. and S.J.T. discussed the results and con-tributed to the preparation of this manuscript. S.J.T.designed and supervised the experiments and analyses,and wrote the paper.Additional materialAdditional file 1List of SNPs, probes and PCR primers. Table that details the rs numbers of the 50 SNPs investigated, as well as the APEX and allele-specific APEX probe sequences, and the PCR primer sequences for the initial experiment.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S1.pdf]Additional file 2PCR multiplex groups. Table that details the 7 groups of multiplex PCRs.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S2.pdf]Additional file 3Genotyping results from first experiment. Table that lists the complete gen-otyping results for 287 HapMap samples and 41 SNPs. Includes LDA call and MACGT call (with quality scores), as well as original HapMap call.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S3.pdf]Additional file 4SNP Charts showing high quality genotypes (auto-called correctly) and lower quality genotypes (auto-called 'NN', but manually-called correctly). Chart interpretation is given below. Illustrative examples for each geno-type case (TT, TC, CC) are shown for the SNP rs1433375. SNP Charts on the left hand side (samples 104, 148 and 67) represent auto-called genotypes, whilst those on the right hand side (samples 125, 128 and 126) represent manually-called genotypes. The y-axes (signal intensity) of each individual genotype class have been set to identical scale values for both auto- and manually-called samples, and clearly show that genotypes can be correctly called (at least by manual inspection of the data) from samples having signal intensities up to an order of magnitude lower than usual. Each chart shows four channel fluorescent intensity data (A,C,G and T) from thirty rs1433375-specific array spots (five replicate spots for six different probes – arranged along x-axis). Starting at the left hand side of each chart, the first five spots ('LEFT T/C') refer to the left-hand APEX probe that will give either a single T (blue) signal (for homozygous TT genotypes), or a C (green) signal (for homozygous CC genotypes), or a mixture of T and C (heterozygous CT). The next five spots ('RIGHT A/G') refer to the right-hand APEX probe that interrogates the complemen-tary DNA strand nucleotide to that of the left-hand APEX probe, hence gives a single A (yellow) signal (for TT), a single G (red) signal (for CC), or a mixed A and G signal (for TC). The remaining spots represent allele-specific APEX probes in which a base-specific fluorescence signifies the presence of the allele. '_1' probes correspond to the first allele (T in the case of rs1433375), and '_2' probes correspond to the second allele (C). The redundancy and consistency of the data across different probes give high confidence in the assigned genotypes.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S4.pdf]Additional file 5Re-scaled SNP Charts for rs1433375. This figure is a repeat of Additional file 4, except that the y-axes of the SNP Charts on the right hand side (manually-called samples) have been adjusted to show as much of the spot intensity data as possible. Note the relative increases in the background signals, as compared to the chart data on the left hand side (auto-called samples).Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S5.pdf]Additional file 6List of PCR primer sequences for 50-plex PCR experiment. PCR primer sequences that were designed for the 50 HapMap SNP loci, with amplicon sizes restricted to between 100 and 200 bp, and with common 5' linkers.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S6.pdf]Additional file 7Genotyping results from second experiment (50-plex PCR). Table that lists the complete genotyping results for 49 HapMap samples and 50 SNPs. Includes LDA call and MACGT call (with quality scores), as well as original HapMap call.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S7.pdf]Additional file 8Performance analyses for the different data sets, addressing the redundant probe chemistry. To further determine the effect of probe redundancy in our APEX methodology, we used LDA to reanalyze both data sets (original and 50-plex) but using non-redundant and partially-redundant probe-specific data. Three tables are shown (8, 9 and 10).Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S8.docAdditional file 9Performance analyses for the different data sets, addressing the redundant probe chemistry. To further determine the effect of probe redundancy in our APEX methodology, we used LDA to reanalyze both data sets (original and 50-plex) but using non-redundant and partially-redundant probe-specific data. Three tables are shown (8, 9 and 10).Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S9.docAdditional file 10Performance analyses for the different data sets, addressing the redundant probe chemistry. To further determine the effect of probe redundancy in our APEX methodology, we used LDA to reanalyze both data sets (original and 50-plex) but using non-redundant and partially-redundant probe-specific data. Three tables are shown (8, 9 and 10).Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S10.docPage 13 of 15(page number not for citation purposes)BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5AcknowledgementsWe thank Tom Hudson, Martin Leboeuf, Alexandre Montpetit and Steph-anie Roumy (McGill University and Genome Quebec Innovation Centre) for advice and access to HapMap genotype data and Coriell DNA samples. We are grateful to Colleen Nelson, Jamie Rosner, Bruce Dangerfield and Jonathan Ma at the Microarray Facility of The Prostate Centre at Vancouver General Hospital for microarray printing services used for this study [33], and we acknowledge BioData Ltd., Estonia [32] for access to APEX probe design software. We thank Teresa Feuchuk and Jo-Lynn Mervyn for admin-istrative assistance, and Ruben Zamar, Will Welch, Rafeef Abugharbieh, Peter Paré and Bruce McManus for continued support. In addition, we would like to thank the reviewers of this manuscript for their helpful com-ments. This research was supported by funding from AllerGen NCE, the National Sanitarium Association (Canada), the British Columbia Lung Asso-ciation, the Canadian Institutes of Health Research, and the Michael Smith Foundation for Health Research.References1. Yang Q, Khoury MJ, Botto L, Friedman JM, Flanders WD: Improvingthe prediction of complex diseases by testing for multipledisease-susceptibility genes.  Am J Hum Genet 2003,72(3):636-649.2. Janssens AC, Pardo MC, Steyerberg EW, van Duijn CM: Revisitingthe clinical validity of multiplex genetic testing in complexdiseases.  Am J Hum Genet 2004, 74(3):585-8; author reply 588-9.polymorphism genotyping.  Proc Natl Acad Sci U S A 2000,97(22):12164-12169.4. Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X,Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, SurtiU, Phillips MS, Boyce-Jacino MT, Fodor SP, Jones KW: Large-scalegenotyping of complex DNA.  Nat Biotechnol 2003,21(10):1233-1237.5. Oliphant A, Barker DL, Stuelpnagel JR, Chee MS: BeadArray tech-nology: enabling an accurate, cost-effective approach tohigh-throughput genotyping.  Biotechniques 2002, Suppl:56-8,60-1.6. Pastinen T, Raitio M, Lindroos K, Tainola P, Peltonen L, Syvanen AC:A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays.  Genome Res 2000,10(7):1031-1042.7. Steemers FJ, Gunderson KL: Whole genome genotyping tech-nologies on the BeadArray platform.  Biotechnol J 2007,2(1):41-49.8. Kurg A, Tonisson N, Georgiou I, Shumaker J, Tollett J, Metspalu A:Arrayed primer extension: solid-phase four-color DNA rese-quencing and mutation detection technology.  Genet Test 2000,4(1):1-7.9. Shumaker JM, Metspalu A, Caskey CT: Mutation detection bysolid phase primer extension.  Hum Mutat 1996, 7(4):346-354.10. Gemignani F, Perra C, Landi S, Canzian F, Kurg A, Tonisson N,Galanello R, Cao A, Metspalu A, Romeo G: Reliable detection ofbeta-thalassemia and G6PD mutations by a DNA microar-ray.  Clin Chem 2002, 48(11):2051-2054.11. Tonisson N, Zernant J, Kurg A, Pavel H, Slavin G, Roomere H, MeielA, Hainaut P, Metspalu A: Evaluating the arrayed primer exten-sion resequencing assay of TP53 tumor suppressor gene.Proc Natl Acad Sci U S A 2002, 99(8):5503-5508.12. Tonisson N, Kurg A, Kaasik K, Lohmussaar E, Metspalu A: Unravel-ling genetic data by arrayed primer extension.  Clin Chem LabMed 2000, 38(2):165-170.13. Dawson E, Abecasis GR, Bumpstead S, Chen Y, Hunt S, Beare DM,Pabial J, Dibling T, Tinsley E, Kirby S, Carter D, Papaspyridonos M, Liv-ingstone S, Ganske R, Lohmussaar E, Zernant J, Tonisson N, RemmM, Magi R, Puurand T, Vilo J, Kurg A, Rice K, Deloukas P, Mott R, Met-spalu A, Bentley DR, Cardon LR, Dunham I: A first-generationlinkage disequilibrium map of human chromosome 22.Nature 2002, 418(6897):544-548.14. Tebbutt SJ, He JQ, Burkett KM, Ruan J, Opushnyev IV, Tripp BW,Zeznik JA, Abara CO, Nelson CC, Walley KR: Microarray geno-typing resource to determine population stratification ingenetic association studies of complex disease.  Biotechniques2004, 37(6):977-985.15. Lahermo P, Liljedahl U, Alnaes G, Axelsson T, Brookes AJ, Ellonen P,Groop PH, Hallden C, Holmberg D, Holmberg K, Keinanen M, KeppK, Kere J, Kiviluoma P, Kristensen V, Lindgren C, Odeberg J, Oster-man P, Parkkonen M, Saarela J, Sterner M, Stromqvist L, Talas U,Wessman M, Palotie A, Syvanen AC: A quality assessment surveyof SNP genotyping laboratories.  Hum Mutat 2006,27(7):711-714.16. Tebbutt SJ, Mercer GD, Do R, Tripp BW, Wong AW, Ruan J: Deox-ynucleotides can replace dideoxynucleotides in minise-quencing by arrayed primer extension.  Biotechniques 2006,40(3):331-338.17. Walley DC, Tripp BW, Song YC, Walley KR, Tebbutt SJ: MACGT:multi-dimensional automated clustering genotyping tool foranalysis of microarray-based mini-sequencing data.  Bioinfor-matics 2006, 22:1147-1149.18. Podder M, Welch WJ, Zamar RH, Tebbutt SJ: Dynamic variableselection in SNP genotype autocalling from APEX microar-ray data.  BMC Bioinformatics 2006, 7:521.19. Cremers FP, Kimberling WJ, Kulm M, de Brouwer AP, van Wijk E, teBrinke H, Cremers CW, Hoefsloot LH, Banfi S, Simonelli F, Fleis-chhauer JC, Berger W, Kelley PM, Haralambous E, Bitner-Glindzicz M,Webster AR, Saihan Z, De Baere E, Leroy BP, Silvestri G, McKay GJ,Koenekoop RK, Millan JM, Rosenberg T, Joensuu T, Sankila EM, WeilD, Weston MD, Wissinger B, Kremer H: Development of a geno-typing microarray for Usher syndrome.  J Med Genet 2007,44(2):153-160.20. Zernant J, Kulm M, Dharmaraj S, den Hollander AI, Perrault I, PreisingAdditional file 11Simple scatter plots for all 50 SNPs from 50-plex data set. For each plot the x-axis represents signal values for X allele and the y-axis represents sig-nal values for Y allele. All values are in log scale. Magenta, green, blue and black coloured symbols denote the classes YY, YX, XX and NN (neg-ative control samples), respectively. Plot (1) combines the two ASO-APEX Left probes (one for each allele); plot (2) combines the two ASO-APEX Right probes (one for each allele); plot (3) is for the APEX Left probe; plot (4) is for the APEX Right probe. The plots for SNPs rs3776720, rs12472674 and rs4739199 include labeled data-points for the individ-ual Coriell samples that gave rise to discrepancies in genotype calling.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S11.pdf]Additional file 12Data structure for SNP rs12466929 & DNA sample HapMap 101 (Cori-ell NA18621 – AA). Illustrative table of microarray four-channel inten-sity data from 30 spots corresponding to one SNP (rs12466929) and one DNA sample.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S12.pdf]Additional file 13List of explanatory variables listed by appropriate classifiers. Each of these classifiers consists of a pair of explanatory variables, generically denoted by X and Y, corresponding to two candidate alleles in the SNP position. Values are based on the data shown in Additional file 12.Click here for file[http://www.biomedcentral.com/content/supplementary/1755-8794-1-5-S13.pdf]Page 14 of 15(page number not for citation purposes)3. Hirschhorn JN, Sklar P, Lindblad-Toh K, Lim YM, Ruiz-Gutierrez M,Bolk S, Langhorst B, Schaffner S, Winchester E, Lander ES: SBE-TAGS: an array-based method for efficient single-nucleotideMN, Lorenz B, Kaplan J, Cremers FP, Maumenee I, Koenekoop RK,Allikmets R: Genotyping microarray (disease chip) for LeberPublish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Medical Genomics 2008, 1:5 http://www.biomedcentral.com/1755-8794/1/5congenital amaurosis: detection of modifier alleles.  InvestOphthalmol Vis Sci 2005, 46(9):3052-3059.21. Jaakson K, Zernant J, Kulm M, Hutchinson A, Tonisson N, Glavac D,Ravnik-Glavac M, Hawlina M, Meltzer MR, Caruso RC, Testa F,Maugeri A, Hoyng CB, Gouras P, Simonelli F, Lewis RA, Lupski JR,Cremers FP, Allikmets R: Genotyping microarray (gene chip)for the ABCR (ABCA4) gene.  Hum Mutat 2003, 22(5):395-403.22. Coriell Institute for Medical Research   [http://coriell.org/]23. Tebbutt SJ, Opushnyev IV, Tripp BW, Kassamali AM, Alexander WL,Andersen MI: SNP Chart: an integrated platform for visualiza-tion and interpretation of microarray genotyping data.  Bioin-formatics 2005, 21(1):124-127.24. Abbaspour M, Abu-Gharbieh R, Podder M, Tebbutt SJ: Fully-auto-mated analysis of multi-resolution four-channel microarraygenotyping data.  2006, 6144:61443M 1-8.25. Abbaspour M, Abu-Gharbieh R, Podder M, Tripp BW, Tebbutt SJ:Hybrid spot segmentation in four-channel microarray geno-typing image data: Vancouver, Canada.   ; 2006:M21.3. 26. Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R, GhandourG, Perkins N, Winchester E, Spencer J, Kruglyak L, Stein L, Hsie L,Topaloglou T, Hubbell E, Robinson E, Mittmann M, Morris MS, ShenN, Kilburn D, Rioux J, Nusbaum C, Rozen S, Hudson TJ, Lander ES,et al.: Large-scale identification, mapping, and genotyping ofsingle-nucleotide polymorphisms in the human genome.  Sci-ence 1998, 280(5366):1077-1082.27. Brownie J, Shawcross S, Theaker J, Whitcombe D, Ferrie R, NewtonC, Little S: The elimination of primer-dimer accumulation inPCR.  Nucleic Acids Res 1997, 25(16):3235-3241.28. Wang HY, Luo M, Tereshchenko IV, Frikker DM, Cui X, Li JY, Hu G,Chu Y, Azaro MA, Lin Y, Shen L, Yang Q, Kambouris ME, Gao R, ShihW, Li H: A genotyping system capable of simultaneously ana-lyzing >1000 single nucleotide polymorphisms in a haploidgenome.  Genome Res 2005, 15(2):276-283.29. Smemo S, Borevitz JO: Redundancy in genotyping arrays.  PLoSONE 2007, 2:e287.30. Quinlan AR, Marth GT: Primer-site SNPs mask mutations.  NatMethods 2007, 4(3):192.31. Feuk L, Carson AR, Scherer SW: Structural variation in thehuman genome.  Nat Rev Genet 2006, 7(2):85-97.32. Biodata Ltd., Tartu, Estonia   [http://www.biodata.ee/]33. Microarray Facility of The Prostate Centre - at VancouverGeneral Hospital   [http://www.microarray.prostatecentre.com/]34. Nucleic Acid Protein Service Unit at the University of BritishColumbia   [http://www.michaelsmith.ubc.ca/services/NAPS/]35. SNP Chart - Single Nucleotide Polymorphism GenotypingSoftware   [http://www.snpchart.ca]Pre-publication historyThe pre-publication history for this paper can be accessedhere:http://www.biomedcentral.com/1755-8794/1/5/prepubyours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 15 of 15(page number not for citation purposes)

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0215872/manifest

Comment

Related Items