UBC Faculty Research and Publications

Dynamic variable selection in SNP genotype autocalling from APEX microarray data Podder, Mohua; Welch, William J; Zamar, Ruben H; Tebbutt, Scott J Nov 30, 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-12859_2006_Article_1260.pdf [ 821.38kB ]
Metadata
JSON: 52383-1.0224112.json
JSON-LD: 52383-1.0224112-ld.json
RDF/XML (Pretty): 52383-1.0224112-rdf.xml
RDF/JSON: 52383-1.0224112-rdf.json
Turtle: 52383-1.0224112-turtle.txt
N-Triples: 52383-1.0224112-rdf-ntriples.txt
Original Record: 52383-1.0224112-source.json
Full Text
52383-1.0224112-fulltext.txt
Citation
52383-1.0224112.ris

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleDynamic variable selection in SNP genotype autocalling from APEX microarray dataMohua Podder1,2, William J Welch1, Ruben H Zamar1 and Scott J Tebbutt*2Address: 1Department of Statistics, University of British Columbia, Vancouver, BC, Canada and 2James Hogg iCAPTURE Centre for Cardiovascular and Pulmonary Research, St. Paul's Hospital, Vancouver, BC, CanadaEmail: Mohua Podder - mpodder@mrl.ubc.ca; William J Welch - will@stat.ubc.ca; Ruben H Zamar - ruben@stat.ubc.ca; Scott J Tebbutt* - stebbutt@mrl.ubc.ca* Corresponding author    AbstractBackground: Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurringwhen a single nucleotide – adenine (A), thymine (T), cytosine (C) or guanine (G) – is altered.Arguably, SNPs account for more than 90% of human genetic variation. Our laboratory hasdeveloped a highly redundant SNP genotyping assay consisting of multiple probes with signals frommultiple channels for a single SNP, based on arrayed primer extension (APEX). This mini-sequencing method is a powerful combination of a highly parallel microarray with distinctiveSanger-based dideoxy terminator sequencing chemistry. Using this microarray platform, ourcurrent genotype calling system (known as SNP Chart) is capable of calling single SNP genotypesby manual inspection of the APEX data, which is time-consuming and exposed to user subjectivitybias.Results: Using a set of 32 Coriell DNA samples plus three negative PCR controls as a training dataset, we have developed a fully-automated genotyping algorithm based on simple linear discriminantanalysis (LDA) using dynamic variable selection. The algorithm combines separate analyses basedon the multiple probe sets to give a final posterior probability for each candidate genotype. Wehave tested our algorithm on a completely independent data set of 270 DNA samples, withvalidated genotypes, from patients admitted to the intensive care unit (ICU) of St. Paul's Hospital(plus one negative PCR control sample). Our method achieves a concordance rate of 98.9% witha 99.6% call rate for a set of 96 SNPs. By adjusting the threshold value for the final posteriorprobability of the called genotype, the call rate reduces to 94.9% with a higher concordance rateof 99.6%. We also reversed the two independent data sets in their training and testing roles,achieving a concordance rate up to 99.8%.Conclusion: The strength of this APEX chemistry-based platform is its unique redundancy havingmultiple probes for a single SNP. Our model-based genotype calling algorithm captures theredundancy in the system considering all the underlying probe features of a particular SNP,automatically down-weighting any 'bad data' corresponding to image artifacts on the microarrayslide or failure of a specific chemistry. In this regard, our method is able to automatically select theprobes which work well and reduce the effect of other so-called bad performing probes in aPublished: 30 November 2006BMC Bioinformatics 2006, 7:521 doi:10.1186/1471-2105-7-521Received: 06 July 2006Accepted: 30 November 2006This article is available from: http://www.biomedcentral.com/1471-2105/7/521© 2006 Podder et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 11(page number not for citation purposes)sample-specific manner, for any number of SNPs.BMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521BackgroundGenotyping SNPsDetermination of the alleles at a specific single nucleotidepolymorphism (SNP) site is called genotyping. An opti-mal genotyping technology should be capable of genotyp-ing any number of SNPs for a large number of individualssatisfying the following criteria: 1. easy and quick devel-opment of an assay from the sequence information; 2.over-all low cost; 3. the data analysis must be simple,transparent, fully-automated and robustly give accurategenotype-calls for all kinds of samples; and 4. the studydesign must be flexible and scalable in all respects (e.g.,number of SNPs investigated). Automated genotype call-ing is an essential part of such a system. A number ofmedium to high-throughput genotyping methods havebeen developed. Among these various techniques, Taq-Man [1] was designed optimally to give genotypes of largenumbers of individuals for one SNP at a time. But from aclinically relevant, personalized medicine point of view,we require a system which can genotype multiple SNPssimultaneously for any single patient sample.Such a system can be achieved through a device known asa genotyping microarray. Through this mechanism, onecan display thousands of specific oligonucleotide probes,precisely located on a small glass slide. These array-basedtechnologies offer both economic and patient specificapplications allowing the genotyping of multiple SNPssimultaneously. There are a number of microarray geno-typing protocols, including Affymetrix GeneChips(R) [2]and Illumina's BeadArray™ system [3]. For the widely usedAffymetrix GeneChip system, a system based on the dis-criminatory power of nucleic acid hybridization to gener-ate the genotyping signals, sophisticated autocallingalgorithms have been developed [4]. Over the last five tosix years Affymetrix has developed and tested a series ofalgorithms using their platform. The Affymetrix Gene-Chip is suitable for very large scale genotyping, e.g.,10,000 or more SNPs at a time, but is expensive formedium to small scale genotyping (e.g., 100 to 200SNPs). The Illumina BeadArray genotyping platform pro-vides a powerful combination of high-throughput andaccuracy with low cost per SNP analysis. Based on theGoldenGate™ genotyping assay, Illumina designed a gen-otype calling algorithm using a Bayesian model, takingthe ratio of two single colored intensity signals corre-sponding to two possible SNP alleles, to give the genotypefor a single SNP [5]. The automatic calling of genotypes isperformed by proprietary software, GenCall, which isbased on a custom-designed clustering algorithm [5]. Toour knowledge, exact details of the algorithm are notavailable in the public domain.technology of single base extension which produces mul-tiple signals from multiple probes [APEX and allele-spe-cific APEX (ASO) probes for both DNA strands]corresponding to a single SNP [6]. To our knowledge,APEX is the only chemistry in which the on-chip assay canbe performed in 20 minutes, making APEX potentiallysuitable for rapid genetic diagnostics in clinical settings:the Affymetrix assay takes several hours for hybridizationon the chip, and Illumina's assays also takes longer com-pared to APEX.Commercial software called Genorama [7] can detect allthe four colors of fluorescence emitted from the dyes usedin an APEX experiment, and then automatically call thebase(s) corresponding to a specific probe spot. The prob-lem with this system is that the underlying scoring algo-rithm treats all probes equally and thus requiresconsiderable inspection of the original array data to pro-duce the final genotype call [8,9]. Using the Genoramabase-calling data for both APEX and AS-APEX probes,Gemignani et al. [10] developed a simple matrix-scorebased algorithm and made the calls corresponding to themost likely genotype, but with considerable manualinspection.Current Genotype Calling System: SNP ChartSNP Chart is a Java based visualization tool, developed byour research group [11]. In this integrated platform, spotintensity data from different and/or replicate probes (ran-domly scattered across the microarray slide) that interro-gate the same SNP are imported, together with a multi-channel TIFF image of the original array experiment. Thissystem is capable of calling any SNP genotype with thehelp of individual manual data inspection. The mainproblem with this genotype-calling system is that it istime-consuming and exposed to user subjectivity bias.Examples of SNP Charts are shown in Figure 1.Data CompositionWe built our genotyping model based on the training setof 32 Coriell DNA samples [12] and 3 negative PCR con-trols [6,11]. Each sample comes from a single microarrayexperiment, conducted on a small glass slide, and con-tains information on all the SNPs under study. Our labo-ratory has developed a robust microarray platform foreach sample patient, generating multiple signals forapproximately one hundred SNPs using two kinds ofprobes, namely, classical APEX probes and allele specificAPEX (ASO) probes [6]. There are six probes in total foreach biallelic SNP and each probe has two replicateswhich make twelve different spots for a single SNP on themicroarray slide. All these spots are randomly scatteredacross the microarray slide but with known coordinates.Page 2 of 11(page number not for citation purposes)Compared to these systems, our laboratory has developeda robust and redundant chemistry platform using theMultiple sets of probes of these types along with their rep-licates make this genotyping platform unique and redun-BMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521dant. Each spot in the microarray slide produces signalsfrom four different channels, corresponding to A, C, Gand T. In our current genotyping method, we only consid-ered the expected foreground signals and will consider allthe background, non-expected signals for further develop-An example of a data source for a single Coriell sampleand a single SNP is given in Table 1. For the SNPrs1106577, the two possible alleles are C or T. Each row ofthis table represents a single spot. The first column is thespot ID; the second column is the probe name; the thirdExamples of SNP Chart ApplicationFigure 1Examples of SNP Chart Application. Examples of SNP Charts for the SNP rs1106577 to illustrate the data structure (e.g., Table 1). Template DNA from three Coriell samples with three possible genotypes (CC, CT and TT) and one negative control (NN) are shown in four different charts. Each chart shows four-channel fluorescent intensity data (A, C, G, and T) on the ver-tical axes, from 12 rs1106577-specific array spots (duplicate spots for six different probes). On the horizontal axes, 12 probe-names corresponding to 12 spots are given sequentially. 1st and 2nd spots from the left ("LEFT C/T") refer to the left-hand APEX probe that will give either a single C (green) signal (for homozygous CC genotypes) or a T (blue) signal (for homozygous TT genotypes) or a mixture of C and T (heterozygous CT). 3rd and 4th spots from the left ("RIGHT G/A") refer to the right-hand APEX probe that interrogates the DNA strand nucleotide complementary to that of the left-hand APEX probe, thus giv-ing a single G (red) signal (for CC), a single A (yellow) signal (for TT), or a mixed G and A signal (for CT). From left, spots 5 to 12, inclusive, represent allele-specific APEX probes in which a base-specific fluorescence signifies the presence of the allele. Among them, spots 5 to 8 refer to the "_1" probes corresponding to the first allele (C in the case of rs1106577) and spots 9 to 12 refer to the "_2" probes corresponding to the second allele (T). The redundancy and consistency of the data across differ-ent probes give high confidence in the assigned genotypes.CT CCTT NNPage 3 of 11(page number not for citation purposes)ment of genotyping model (see below). column is the expected allele ID for the appropriate spot;BMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521and the last four columns are the signal intensity valuesfor the four channels corresponding to each spot.In the second column of Table 1, "APEX_LEFT" refers tothe left-hand APEX probe on the sense strand, and"APEX_RIGHT" refers to the right-hand APEX probe onthe anti-sense strand that interrogates the DNA strandnucleotide complementary to that of the left-hand APEXprobe. For all the APEX probes, the fluorescent signalscome from the base position of the SNP allele. In contrast,for all the ASO probes, fluorescent signals come from thebase adjacent (3') to the actual SNP site [6]. For the SNPrs1106577 considered in Table 1, the base 3' adjacent tothe SNP allele is always T on the sense strand. The leftprobes, ASO_1LEFT and ASO_2LEFT, are designed to sig-nal at this adjacent base, T, if the SNP site has the firstallele (C here) and/or the second allele (T here), respec-tively. Similarly, SNP rs1106577 has G in the adjacentposition 3' to the SNP side on the anti-sense strand. Theright probes, ASO_1RIGHT and ASO_2RIGHT, signal atthis adjacent base, G, again for C and/or T at the SNP site,respectively. It is merely the presence or absence of the sig-nal that indicates the SNP allele. According to the probestructure, the signals corresponding to the expected allelesare highlighted.The data represented in Table 1 come from the DNA sam-ple Coriell NA17102 and here the true genotype is CC (seethe top-right CC-chart in Figure 1). According to the APEXchemistry, for the genotype CC the dominating signalsfrom spots 1 and 2 should be C among the two expectedchannels C and T. Similarly the dominating signals fromspots 3 and 4 should be G (complementary to C in theleft-strand) among the two expected channels G and A.the genotype is CC, all the expected signals correspondingto allele 1 (C) should dominate over the other channels,i.e., expected foreground (expected channel correspond-ing to all allele 2 probes) and background signals [6].Note that for spots 11 and 12, the expected signal (G), cor-responding to the presence of the T allele (which is absentin this particular case), is comparable to the backgroundsignals. In Table 1, all the signals which are not high-lighted in bold are considered as background signals,often due to the spectral overlap between dyes, and/or ageneral background.In fact, this is a very good source of data, as all the signalscorresponding to allele 2 (T in this case) are comparableto the level of background signals. Now suppose the truegenotype is TT, then we should expect dominating signalsonly from the expected channels corresponding to allallele 2 probes. For a heterozygous CT genotype, weshould expect dominating signals from all the expectedchannels corresponding to both allele 1 probes and allele2 probes. These features of our redundant and robust plat-form can also be represented through our data visualiza-tion tool: SNP Chart [11]. In Figure 1, four SNP Chartscorresponding to three different genotypes (CT, CC andTT) and a negative control (NN) are shown for the sameSNP (rs1106577).In our study we use the 32 Coriell samples plus three neg-ative PCR controls for model building. These 35 sampleswill be called the Coriell training set. To test the perform-ance of the calling algorithm we also have a completelyindependent set of 270 SIRS (systematic inflammatoryresponse syndrome) DNA samples from the ICU of St.Paul's hospital, plus one test negative control sample. ThisTable 1: Data structure for SNP rs1106577 and DNA sample Coriell NA17102 (CC) (CC-chart in Figure 1)Spot ID Probe ID Expected allele ID A C G TSpot 1 APEX_LEFT C and/or T 732 17003 258 667Spot 2 APEX_LEFT C and/or T 965 28290 348 1046Spot 3 APEX_RIGHT G and/or A 190 85 1198 233Spot 4 APEX_RIGHT G and/or A 353 104 2923 269Spot 5 ASO_1LEFT T 109 5284 80 45700Spot 6 ASO_1LEFT T 107 5456 83 45713Spot 7 ASO_2LEFT T 90 88 20 182Spot 8 ASO_2LEFT T 76 106 22 222Spot 9 ASO_1RIGHT G 288 182 2346 992Spot 10 ASO_1RIGHT G 369 209 3908 1098Spot 11 ASO_2RIGHT G 138 68 166 187Spot 12 ASO_2RIGHT G 151 68 212 193Page 4 of 11(page number not for citation purposes)Rows 5–12 represent the ASO probes in which a base-spe-cific fluorescence signifies the presence of the allele. Sinceset of 271 samples will be called the SIRS test data. Notethat the SIRS data are not used in model building andBMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521come from a separate study, so they provide a very rigor-ous test. For the training data, there are 123 SNPs on themicroarray slide, but only 96 were usable: (1) 15 SNPshad PCR chemistry failure and (2) 12 SNPs had one of thethree possible genotypes missing among the training set.Formation of ClassifiersIdeally, the genotype call could be solely based on justone of four sets of probes: (1) APEX_LEFT, (2)APEX_RIGHT, (3) ASO_1LEFT and ASO_2LEFT, and (4)ASO_1RIGHT and ASO_2RIGHT (see Table 1). Accord-ingly, we have developed four sets of classifiers, namedAPEX.L, APEX.R, ASO.L and ASO.R, based on the respec-tive probe sets. Each classifier is based on two explanatoryvariables, generically denoted by X and Y, measuring thesignal intensities for the two candidate alleles in the SNPposition. In Table 1, for example, X and Y corresponds tothe C and T alleles, respectively.Between them the four classifiers have four pairs of suchexplanatory variables: (APEX.XL, APEX.YL); (APEX.XR,APEX.YR); (ASO.XL, ASO.YL) and (ASO.XR, ASO.YR).They are derived from the signal intensities in rows 1–2,3–4, 5–8, and 9–12, respectively, in data structures exem-plified in Table 1. All these variables take the sum of therelevant signals. From the example data in Table 1, the val-ues of the variables for the classifier APEX.L are APEX.XL= 17, 003 + 28, 290 = 45, 293 and APEX.YL = 667 + 1, 046= 1, 713, and so on, as summarized in Table 2.Our main objective is to automatically select from thesefour sets of variables those pairs which give "good" signalsfor genotype calling. Moreover, the variables and hencethe classifier(s) used will be chosen dynamically, i.e., fora specific SNP and sample. In this paper we use Fisher's[13] linear discriminant analysis (LDA) to build the clas-sifiers, but the method of dynamic variable selectionwould apply to any linear or nonlinear classifier.Figure 2 and Figure 3 illustrate how dynamic variableselection exploits the redundancy in the chemistry. Thefigures are based on the 32 Coriell samples plus three neg-ative PCR controls, where the true genotypes are known.We plot the X and Y signals for each of the four probe sets.Ideally, any pair of variables would form well separatedclusters for the three possible genotypes, XX, XY and YY(plotted with different colors and symbols). There is afourth cluster corresponding to the negative controls(NN). Any reasonable classifier based on these variablesshould make correct calls under ideal conditions. Figure 2shows an ideal SNP, where all four probe sets producegood separation of the three genotypes and the negativecontrols.Conversely, problems with the samples or the chemistrymay lead to overlap in the four clusters, making callingdifficult. In Figure 3 for SNP rs1003399, for example,sample 11 is a GG genotype which falls in the CG clusterfor the left APEX probe set. Fortunately, the other threeprobe sets correctly place sample 11 in the GG cluster. Sothree out of the four probe sets work, and classifiers basedon them would make the correct call for sample 11. Forsample 20 (NA07341), however, the left APEX probe setworks the best, placing the GG sample clearly in the GGdata cluster. Thus, different probe sets may be effective fordifferent samples, even for the same SNP. Our algorithmattempts to identify effective probe sets automatically,sample by sample, and it is in this sense that it choosesvariables dynamically.Results and DiscussionDynamic-variable LDA Based Genotyping ModelFor each SNP we build four separate LDA classificationmodels; the models are based on the pairs of explanatoryvariables in Table 2 corresponding to the four probe sets.For this stage the training data are the 32 Coriell samplesand the three negative PCR controls described under DataComposition. As test data to evaluate the calling perform-ance we use the 271 SIRS test samples also describedunder Data Composition. Within each SNP, sample bysample the four classifiers are combined using the weight-ing algorithm described later in the Methods Section, togive one call for the particular test sample. The calls arechecked for concordance with the validated genotypes inthe SIRS data, leading to the results in the first row ofTable 3. In 0.4% of samples, the called genotype is NN(non-call), hence the call rate of less than 100% in thetable. As detailed under Methods, by changing the thresh-old for calling, a modest reduction in the call rate to94.9% yields a 99.6% concordance rate.Table 2: Values of the explanatory variables for SNP rs1106577 and DNA sample Coriell NA17102Classifier Variables used by classifier ValuesAPEX.L APEX.XL APEX.YL 45,293 1,713APEX.R APEX.XR APEX.YR 4,121 543ASO.L ASO.XL ASO.YL 91,413 404Page 5 of 11(page number not for citation purposes)ASO.R ASO.XR ASO.YR 6,254 378BMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521We also reverse the roles of the training and test data sets,leading to the second row of Table 3. The results arestronger in terms of the number of SNPs called, call rateand concordance rate, because in this second analysis amuch larger set of data is used for training the models.Row 3 of Table 3 reports the results from applying themethod of cross validation (CV) [14] to the Coriell dataset. Here, each sample is removed in turn from the data,and its genotype is predicted based on retraining the fourclassifiers using only the remaining data. The results are sim-ilar to those in row 1. For the SIRS data, row 4 reportsSimple LDA Based Genotyping ModelFor comparison, for each SNP we use the training data tobuild a single LDA classification model using all eight var-iables available in Table 2. For each SNP, simple LDAapplied to the training data assigns weights to the eightvariables and these weights are constant for every test sam-ple. Thus, this more standard modeling approach doesnot allocate weights dynamically. The same commentapplies to MACGT from our research group [15], whichalso requires greater levels of manual inspection of theAPEX data.Example of a well-behaved SNP: rs1932819Figure 2Example of a well-behaved SNP: rs1932819. All the classifiers give three well separated clusters for the SNP rs1932819 [red, green, blue and black colored symbols respectively denote the classes YY, XY, XX and NN].2 4 6 8 102468(1)log(ASO.XL)(X= C )log(ASO.YL)(Y= T )●●●●4 6 8 10 12246810(2)log(ASO.XR)(X= C )log(ASO.YR)(Y= T )●●●●2 4 6 8 10 12246810(3)log(APEX.XL)(X= C )log(APEX.YL)(Y= T )●●●2 4 6 8 10 12246810(4)log(APEX.XR)(X= C )log(APEX.YR)(Y= T )●●●●Page 6 of 11(page number not for citation purposes)analogous cross validation performance estimates, andthere is very close agreement with row 2.The results from simple linear discriminant analysis aregiven in Table 4. In row 1 the concordance rate for theBMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521SIRS test set is 97.3%, which might be considered good forother applications but for clinical purposes a muchsmaller concordance error is desirable. Modifying the call-ing threshold makes negligible difference to the concord-ance rate. Reversing the training and test data shows aneven worse outcome: (1) again changing the thresholdTable 3: Results from Dynamic-variable LDAHigh call rate Lower call rateTraining set Test set No. of SNPs Call rate Concordance rate Call rate Concordance rateCoriell SIRS 96 99.6% 98.9% 94.9% 99.6%SIRS Coriell 102 99.9% 99.3% 95.6% 99.8%Coriell CV 96 100.0% 98.7% 94.2% 99.2%Example of critical SNP: rs1003399Figure 3Example of critical SNP: rs1003399. Sample 11 is correctly classified by both ASO probes and APEX.R probe but wrongly classified by APEX.L probe for the SNP: rs1003399, whereas for sample 20, APEX.L probe works the best [red, green, blue and black colored symbols respectively denote the classes YY, XY, XX and NN].0 2 4 6 8 1002468(1)log(ASO.XL)(X= C )log(ASO.YL)(Y= G )●●●●●●●●●●●●●11200 2 4 6 80246810(2)log(ASO.XR)(X= C )log(ASO.YR)(Y= G )● ●●●●●●●● ●●●● ●11202 4 6 8 1002468(3)log(APEX.XL)(X= C )log(APEX.YL)(Y= G )● ●●●●●●●● ●●●●11202 4 6 8246810(4)log(APEX.XR)(X= C )log(APEX.YR)(Y= G )● ●●●●● ●●●●●●●1120Page 7 of 11(page number not for citation purposes)SIRS CV 102 99.9% 99.3% 96.0% 99.8%BMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521value does not control the call rate and (2) the concord-ance rate deteriorates drammatically. Therefore the per-formance is not competitive against dynamic-variableLDA.As shown in rows 3 and 4 of Table 4, the performance ofsimple linear discriminant analysis is better when meas-ured by cross validation, particularly when predicting theSIRS data. It seems that the method is not robust to usingsamples from different sources for training and testing.DiscussionWe also tried classifiers based on different sets of varia-bles. For example, we built an ASO classifier using the var-iables ASO.XL, ASO.YL, ASO.XR and ASO.YR and anAPEX classifier using the variables APEX.XL, APEX.YL,APEX.XR and APEX.YR. The calls from the two classifierswere then combined using the dynamic variable method-ology. Little improvement in concordance rate was foundrelative to eight-variable simple LDA. Similar results wereobtained when combining left and right classifiers, basedon the left variables (ASO.XL, ASO.YL, APEX.XL andAPEX.YL) and the right variables (ASO.XR, ASO.YR,APEX.XR and APEX.YR), respectively.ConclusionWe have developed a robust automated genotype callingmethod based on an ASO and APEX microarray platform.Multiple, qualitatively different probes provide redundan-cies in the event that a probe does not provide a reliablesignal. The dynamic-variable calling algorithm respectsthese redundancies, building up an overall call from clas-sifiers based on subsets of variables, with more weightgiven to seemingly more reliable classifiers. The weightschange from one test sample to another; it is in this sensethat the method is dynamic. Standard methods of variableselection (also called feature extraction) as described by,for example, Hand, Mannila, and Smyth [14] or Hastie,Tibshirani, and Friedman [16], would select or filter thevariables and use the same set of reduced variables forevery call. Such a strategy would be appropriate if the sameprobe sets are reliable from sample to sample.For a call rate of approximately 95%, we were able toachieve a concordance rate of 99.6% in a large, independ-ent test set of validated genotypes. The probe data forthose samples/SNPs that are not automatically calledwould be manually inspected within SNP Chart; unlike100% manual inspection, this does not impose an unrea-sonable time burden. The method of combining classifi-ers is not specific to linear discriminant analysis; otherstatistical classifiers could be used. Similarly, the methodcould be applied to other microarray platforms with com-plex redundancies.MethodsLDALinear discriminant analysis (LDA), due to Fisher [13], isone of the oldest methods of discrimination betweenclasses or classification. It is described in virtually everytext book that includes classification (e.g., Hastie, Tib-shirani, and Friedman) [16].LDA is applied to each SNP separately. It is assumed thatthe variables (probe signals) used to classify have a multi-variate normal distribution, with a within-class covari-ance matrix that is common to all classes (the genotypesand a negative control class) but within-class mean vec-tors that vary from one class to another. These quantitiesare estimated from the Coriell training data. For any testsample, the values of the same variables lead to posteriorprobabilities for the various classes. The genotype called isthe class with the highest posterior probability. Themethod also requires the prior probabilities of belongingto the various classes. We assume priors based onobserved frequencies in the training data. This basic LDAmethodology is common to all the strategies we use.Simple LDAIn Simple LDA we train a single LDA genotyping modelusing the logarithms of all eight variables described inTable 2. Among the validated genotypes of the 32 Coriellsamples, there are some cases where the exact genotype isunknown, denoted by NN (non-call). The three negativecontrols added to the Coriell data are also treated as NNas well. Thus, for each SNP there may be up to four classespresent in the training data, corresponding to the threeTable 4: Results from Simple LDAHigh call rate Lower call rateTraining set Test set No. of SNPs Call rate Concordance rate Call rate Concordance rateCoriell SIRS 96 99.4% 97.3% 98.1% 97.3%SIRS Coriell 102 99.5% 93.0% 99.5% 93.0%Coriell CV 96 99.8% 98.4% 99.7% 98.5%Page 8 of 11(page number not for citation purposes)SIRS CV 102 99.4% 99.5% 98.9% 99.6%BMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521candidate genotypes and NN. Thus, LDA may call NN.The call rate is the proportion of calls that are not NN.Dynamic-variable LDAFor each SNP we apply LDA to each pair of variables inTable 5. For example, the classifier ASO.L is based on theleft ASO variables, log(ASO.XL) and log(ASO.YL). Forgeneric alleles X and Y, the classes are XX, XY, YY, and NN(if all are present). Table 6 sets out the notation for theBayesian posterior probabilities for the possible classesfrom each of the four possible classifiers. For example, is the posterior probability for the XX genotypefrom the classifier, ASO.L. The posterior probabilities forthe four classifiers are combined using an entropy weight-ing scheme. Entropy is a measure of uncertainty or disper-sion of a random variable. Denote the four posteriorprobabilities from any classifier (*) in any row of Table 6by , where c indexes one of the classes (genotypes) inthe setC = {XX, XY, YY, NN}.Then the entropy for this probability distribution overclasses is defined to beEntropy or uncertainty is maximized when all the 's areequal and minimized (taking the value 0) when one of the's is 1 and the others are zero.Entropy is computed for each of the four classifiers inTable 6. We will be giving more weight to a classifier withless entropy (uncertainty). Thus, we define for the ASO.Lclassifier in row 1 of the table, for example,which is a quantity which is large if ASO.L's entropy issmall compared to the maximum possible entropy. Anal-ogous quantities are computed for EASO.R, EAPEX.L, andEAPEX.R in Table 6. The weights for the four classifiers areobtained by normalizing them so that they sum to 1, i.e.,with analogous computations for WASO.R, WAPEX.L, andWAPEX.R. Note that the probabilities in Table 6 and hencethe weights will vary from one test sample to another.The weights for the four classifiers are applied to the pos-terior probabilities for each class (column) in Table 6 toobtain the final class posterior probabilities. For example,the final probability for XX isPXXASO L( . )Pc∗−∗∈∗∑ P Pcc Cclog( ).Pc∗Pc∗E P Pcc CcASO LASO L ASO L.( . ) ( . )log( ) [ log( )],= − − −∈∑14WEE E E EASO.LASO.LASO.L ASO.R APEX.L APEX.R=+ + +,Table 6: Posterior probabilities from four LDA classifiersClassifier/Class XX XY YY NNASO.LASO.RAPEX.LAPEX.RPXXASO L( . ) PXYASO L( . ) PYYASO L( . ) PNNASO L( . )PXXASO R( . ) PXYASO R( . ) PYYASO R( . ) PNNASO R( . )PXXAPEX L( . ) PXYAPEX L( . ) PYYAPEX L( . ) PNNAPEX L( . )APEX R( . ) APEX R( . ) APEX R( . ) APEX R( . )Table 5: Applying LDA using four sets of classifiersClassifier VariablesASO.L log(ASO.XL), log(ASO.YL)ASO.R log(ASO.XR), log(ASO.YR)APEX.L log(APEX.XL), log(APEX.YL)APEX.R log(APEX.XR), log(APEX.YR)Page 9 of 11(page number not for citation purposes)PXX PXY PYY PNNBMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/521with similar calculations for XY, YY, and NN. A sample isassigned to the class with maximum weighted probability.To increase the concordance with the validated test sam-ples (at the expense of reducing the call rate), a call ismade if and only if the maximum probability across theclasses exceeds a threshold. For instance, the results in thelast two columns of the first row of Table 3 are obtainedby requiring the maximum probability to be at least 0.6for a call.Example corresponding to SNP rs1003399 and sample Coriell NA17111Figure 3 relates to SNP rs1003399 and the point labeled11 is Coriell sample NA17111. To check how dynamic-variable LDA works for a sample with complex redun-dancy, we predict the genotype of that sample based onthe remaining 31 Coriell samples plus three negative PCRcontrols. Underlying calculations for both dynamic-varia-ble LDA and simple LDA are shown here.The posterior probabilities from dynamic-variable LDAcorresponding to Table 6 but specific to this example aregiven in Table 7. The final posterior probabilities fromDynamic-variable LDA and Simple LDA are given in Table8. So from Table 8, it is clear that the sample CoriellNA17111 (with validated genotype GG) is correctly clas-sified only by dynamic-variable LDA with confidencemeasure .75, but simple LDA fails to do so. Moreover sim-ple LDA wrongly classifies the sample as CG with highconfidence score (posterior probability 1.000).Competing interestsThe author(s) declare that they have no competing inter-ests.Authors' contributionsMP designed and developed the algorithms, performedthe statistical analysis of the data sets and drafted themanuscript; all the authors contributed to the writing ofthe final version. WJW and RHZ supervised in developingthe algorithms. SJT designed the APEX microarray chipand provided the data sets. All authors read and approvedthe final manuscript.AcknowledgementsWe thank Jian Ruan and Ben Tripp for technical assistance, and Keith Walley and Jim Russell for access to ICU patient DNA samples. We would also like to thank the anonymous reviewers for their helpful comments. This project is funded by the National Sanitarium Association, AllerGen NCE, MITACS, the British Columbia Lung Association, and NSERC.References1. Livak KJ, Flood SJ, Marmaro J, Giusti W, Deetz K: Oligonucleotideswith fluorescent dyes at opposite ends provide a quenchedprobe system useful for detecting PCR product and nucleicacid hybridization.  PCR Methods Appl 1995, 4:357-62.2. Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X,Cao M, Chen W, Zhang J, et al.: Large-scale genotyping of com-plex DNA.  Nat Biotechnol 2003, 21:1233-7.3. Oliphant A, Barker DL, Stuelpnagel JR, Chee MS: BeadArray tech-nology: enabling an accurate, cost-effective approach tohigh-throughput genotyping.  Biotechniques 2002:56-8. 60–14. Xiaojun Di, Matsuzaki H, Webster TA, Hubbell E, Liu G, Dong S, Bar-tell D, Huang J, Chiles R, et al.: Dynamic model based algorithmsfor screening and genotyping over 100K SNPs on oligonucle-otide microarrays.  Bioinformatics 2005, 21(9):1958-1963.5. Shen R, Fan JB, Campbell D, Chang W, Chen J, Doucet D, Yeakley J,Bibikova M, Garcia EW, McBride C, Steemers F, Garcia F, KermaniBG, Gunderson K, Oliphant A: High-throughput SNP genotyp-ing on universal bead arrays.  Mutation Research 2005, 573:70-82.6. Tebbutt SJ, He JQ, Burkett KM, Ruan J, Opushnyev IV, Tripp BW,Zeznik JA, Abara CO, Nelson CC, Walley KR: Microarray geno-typing resource to determine population stratification inP W P W P W P WXX ASO.L XXASO.LASO.R XXASO.RAPEX.L XXAPEX.L= + + +( ) ( ) ( ) APEX.R XXAPEX.RP( ),Table 8: Resultant posterior probabilities from two methodsClasses/Methods Dynamic-variable LDA Simple LDACC <0.001 <0.001CG 0.253 1.000GG 0.746 <0.001NN <0.001 <0.001Table 7: Posterior probabilities from Table 6 for SNP rs1003399 and target sample Coriell NA17111Classifier/Class CC CG GG NNASO.L <0.001 0.001 0.999 <0.001ASO.R <0.001 0.003 0.997 <0.001APEX.L <0.001 1.000 <0.001 <0.001APEX.R <0.001 0.005 0.995 <0.001Page 10 of 11(page number not for citation purposes)genetic association studies of complex disease.  Biotechniques2004, 37:977-85.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2006, 7:521 http://www.biomedcentral.com/1471-2105/7/5217. Website title   [http://www.asperbio.com]8. Kaminski S, Ahman A, Rusc A, Wojcik E, Malewski T: MilkProtChip– a microarray of SNPs in candidate genes associated withmilk protein biosynthesis-development and validation.  J ApplGenet 2005, 46(1):45-58.9. Kurg A, Tonisson N, Georgiou I, Shumaker J, Tollett J, Metspalu A:Arrayed primer extension: solid-phase four-color DNA rese-quencing and mutation detection technology.  Genet Test 2000,4:1-7.10. Gemignani F, Perra C, Landi S, Canzian F, Kurg A, Tonisson N,Galanello R, Cao A, Metspalu A, Romeo G: Reliable detection ofbeta-thalassemia and G6PD mutations by a DNA microar-ray.  Clin Chem 2002, 48:2051-4.11. Tebbutt SJ, Opushnyev IV, Tripp BW, Kassamali AM, Alexander WL,Andersen MI: SNP Chart: an integrated platform for visualiza-tion and interpretation of microarray genotyping data.  Bioin-formatics 2005, 21:124-7.12. Website title   [http://coriell.umdnj.edu/]13. Fisher RA: The Use of Multiple Measures in Taxonomic Prob-lems.  Ann of Eugenics 1936, 7:179-188.14. Hand D, Mannila H, Smyth P: Principles of Data Mining.  The MITPress, Cambridge, MA; 2001:362-363.  359–36015. Walley DC, Tripp BW, Song YC, Walley KR, Tebbutt SJ: MACGT:multidimensional automated clustering genotyping tool foranalysis of microarray based mini-sequencing data.  Bioinfor-matics 2006, 22:1147-9.16. Hastie T, Tibshirani R, Friedman J: The Elements of StatisticalLearning.  Springer, New York; 2001:126-127.  84–95yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 11 of 11(page number not for citation purposes)

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.52383.1-0224112/manifest

Comment

Related Items