UBC Faculty Research and Publications

A comparison of five methods for selecting tagging single-nucleotide polymorphisms Burkett, Kelly M; Ghadessi, Mercedeh; McNeney, Brad; Graham, Jinko; Daley, Denise Dec 30, 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12863_2005_Article_332.pdf [ 443.63kB ]
JSON: 52383-1.0167808.json
JSON-LD: 52383-1.0167808-ld.json
RDF/XML (Pretty): 52383-1.0167808-rdf.xml
RDF/JSON: 52383-1.0167808-rdf.json
Turtle: 52383-1.0167808-turtle.txt
N-Triples: 52383-1.0167808-rdf-ntriples.txt
Original Record: 52383-1.0167808-source.json
Full Text

Full Text

ralssBioMed CentBMC GeneticsOpen AcceProceedingsA comparison of five methods for selecting tagging single-nucleotide polymorphismsKelly M Burkett1, Mercedeh Ghadessi2, Brad McNeney2, Jinko Graham2 and Denise Daley*1,3Address: 1The James Hogg-iCAPTURE Centre for Cardiovascular and Pulmonary Research, University of British Columbia, St. Paul's Hospital, Vancouver, BC V6Z 146, Canada, 2Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC 15A 156, Canada and 3Department of Epidemiology and Biostatistics, Case Western Reserve University, 44106, Cleveland, OH, USAEmail: Kelly M Burkett - kburkett@sfu.ca; Mercedeh Ghadessi - mghadess@stat.sfu.ca; Brad McNeney - mcneney@sfu.ca; Jinko Graham - jgraham@stat.sfu.ca; Denise Daley* - ddaley@mrl.ubc.ca* Corresponding author    AbstractOur goal was to compare methods for tagging single-nucleotide polymorphisms (tagSNPs) withrespect to the power to detect disease association under differing haplotype-disease associationmodels. We were also interested in the effect that SNP selection samples, consisting of eithercases, controls, or a mixture, would have on power. We investigated five previously describedalgorithms for choosing tagSNPS: two that picked SNPs based on haplotype structure (Chapman-haplotypic and Stram), two that picked SNPs based on pair-wise allelic association (Chapman-allelicand Cousin), and one control method that chose equally spaced SNPs (Zhai). In two disease-associated regions from the Genetic Analysis Workshop 14 simulated data, we tested theassociation between tagSNP genotype and disease over the tagSNP sets chosen by each methodfor each sampling scheme. This was repeated for 100 replicates to estimate power. The two allelicmethods chose essentially all SNPs in the region and had nearly optimal power. The two haplotypicmethods chose about half as many SNPs. The haplotypic methods had poor performance comparedto the allelic methods in both regions. We expected an improvement in power when the selectionsample contained cases; however, there was only moderate variation in power between thesampling approaches for each method. Finally, when compared to the haplotypic methods, thereference method performed as well or worse in the region with ancestral disease haplotypestructure.BackgroundCase-control designs are increasingly used in candidategene association studies to detect common disease alleles.Traditionally, this design requires an a priori hypothesis ofthe genes to be tested for association. A key conceptunderlying the design of any disease-marker associationstudy is linkage disequlibrium (LD), or the nonrandomsent other SNPs in a given region; these SNPs have beencalled tagging SNPs (tagSNPs). The goal is to select tag-SNPs in order to reduce genotyping costs without losingthe ability to detect disease associations. Many methodshave been developed for selecting tagSNPs, using criteriasuch as haplotype diversity and pairwise LD. Obtainingsamples from sources such as the HapMap http://from Genetic Analysis Workshop 14: Microsatellite and single-nucleotide polymorphismNoordwijkerhout, The Netherlands, 7-10 September 2004Published: 30 December 2005BMC Genetics 2005, 6(Suppl 1):S71 doi:10.1186/1471-2156-6-S1-S71<supplement> <title> <p>Genetic Analysis Workshop 14: Microsatellite and single-nucleotide polymorphism</p> </title> <editor>Joan E Bailey-Wilson, Laura Almasy, Mariza de Andrade, Julia Bailey, Heike Bickeböller, Heather J Cordell, E Warwick Daw, Lynn Goldin, Ellen L Goode, Courtney Gray-McGuire, Wayne H ning, ail Jarvik, Brion S Maher, Nancy Mendell, Andrew D Paterson, John Rice, Glen Satten, Brian Suar z, Veronica Vieland, Marsha Wilcox, Heping Zhang, Andre s Ziegler and Jean W MacCluer</editor> <note>Proceedings</note> </suppleme t>Page 1 of 5(page number not for citation purposes)assortment of alleles. LD can be used to identify single-nucleotide polymorphisms (SNPs) that efficiently repre-www.hapmap.org for SNP discovery and LD or haplotypecharacterization can save both time and genotyping costs,BMC Genetics 2005, 6:S71but may compromise power. If a disease allele is rare itmay be optimal to sample a population of cases to selecttagSNPs, rather than a sample consisting only of healthyindividuals.We assessed the performance of five methods: Stram etal.(implemented in TAGSNPS), Chapman et al. (haplo-typic and allelic, implemented in HTSNP2), Cousin et al.[1-3], and the recently proposed approach of Zhai et al.[4] as a control method. We will simply refer to these asthe Stram, Chapman-haplotypic, Chapman-allelic,Cousin, or Zhai methods, respectively. TagSNPs were cho-sen from an initial sample of cases-only, controls-only,and a combined case/control sample in two regions withknown disease association. We estimated the power of thetagSNP sets to detect association over 100 simulated case-control studies and compared the number of tagSNPsselected. Although tagSNP methods have been assessedand compared, little information is available on how wellthe methods compare under different haplotype-diseaseassociation models, and the effect that sampling popula-tion has on tagSNP selection.MethodsPerformance of the tagSNP selection methods was deter-mined by comparing the results of case-control associa-tion studies. Using the Genetic Analysis Workshop 14simulated dataset and answers, we selected 2 candidateregions for analysis: D2 and D4. We chose these regionsbecause they were known to contain a disease locus andwere simulated to have differing haplotype-disease associ-ation structure. Region D2 was simulated with the diseaseallele inserted into structurally similar haplotypes, mim-icking the case of a mutation arising on an ancestral hap-lotype. Region D4 was simulated with the disease alleleinserted into haplotypes of similar frequency so that thedisease mutation was not tied to haplotype structure. Inpractice one would select SNPs flanking the region ofinterest, and so we included 5 SNPs on both sides of ourregions, except for region D2, which is at the right end ofthe chromosome. The microsatellite locus D09S0348 inregion D4 was removed. We considered 17 SNPs in the D2region and 22 SNPs in the D4 region.We assessed the performance of 4 tagSNP methods thatwe classify as allelic or haplotypic and a fifth method thatwe use as a control (Zhai). In allelic (Cousin and Chap-man-allelic) or single-SNP approaches, a SNP is a tagSNPif it is a good surrogate for other SNPs based on some pair-wise measure such as LD or power to detect an associa-tion. In haplotypic approaches (Stram and Chapman-haplotypic), the set of tagSNPs captures the informationon the haplotype structure in the region.Stram's method [1], motivated by the common-disease,common-haplotype hypothesis, seeks to identify tagSNPhaplotypes that predict common haplotypes by maximiz-ing the minimum coefficient of determination for com-mon haplotypes, Rh2. The minimum Rh2 is maximizedover all possible tagSNP subsets of a given size. Chap-man's implementations (allelic and haplotypic) [2]assume a single causal locus in the region, whose allelesmay be predicted by haplotypes of tagSNPs (haplotypic),or tagSNP alleles (allelic). The association between tag-SNP alleles or haplotypes and the causal locus is measuredthrough the coefficient of determination, R2, under theassumption that predicting the true causal locus is nomore difficult than predicting any of the SNPs in theregion. Cousin's method [3] selects tagSNPs that maxi-mize the power of detecting association with an unob-served disease locus in LD with SNPs in the set. The powerof a set is found by averaging over defined disease modelpenetrances and over each SNP in the candidate region,assuming each such SNP has an equal chance of being thesusceptibility locus. Finally, Zhai's method [4] selects ktagSNPs as equally spaced throughout the candidateregion as possible. This is achieved by selecting tagSNPsthat minimize the variance of pair-wise SNP distances, asmeasured on the linkage map. The description of themethod does not include criteria for choosing k; therefore,we use it as a control method to verify that the other tag-SNP methods actually offer improvements over this moreintuitive approach.For Stram's method, we set the minimum haplotype fre-quency cut-off to 0.04. Chapman's method was run usinga minor allele frequency cut-off of 0. Both Stram andChapman use an R2 parameter that measures the coeffi-cient of determination for the underlying model and inboth cases we set this parameter to 0.80. We implementedCousin's method as described in the paper since no soft-ware was available. For these 4 methods, subset size wasincreased until the corresponding thresholds of Rh2, R2,and maximal power were attained. We used threshold val-ues given in the original papers. Our implementation ofZhai's method utilized the number of tagSNPs selected byboth the Chapman-haplotypic and Stram method as thevalue of k, and selected from all SNPs. The best set of tag-SNPs was chosen from among 106 randomly generatedcandidate sets.For tagSNP selection, we randomly selected 24 cases, 24controls and an equal mixture of 24 cases and controlsfrom the entire population. After tagSNP selection, weperformed a case/control association study using 100cases and 100 controls. Initially, 50 samples were used fortagSNP selection and 500 cases and 500 controls werePage 2 of 5(page number not for citation purposes)chosen for the association study. However, we found thatthe association was too strong to allow meaningful differ-BMC Genetics 2005, 6:S71entiation of the methods, so sample size was lowered.Cases and controls in the association study were ran-domly selected from the Karangar datasets and includedindividuals from the tagSNP selection step. Single-locus p-values were obtained from chi-square tests of allelic asso-ciation. The most significant (i.e., minimum) Bonferroni-corrected p-value within a candidate region and thenumber of tagSNPs selected were recorded for eachmethod. We repeated this experiment with 100 randomsamples and estimated power with the proportion of rep-licates having the Bonferroni-corrected p-value less than0.05. We determined that differences greater than 10% aregreater than simulation error and therefore consideredthese noteworthy (calculations not shown). Because theallelic test assumes Hardy-Weinberg equilibrium (HWE),we tested for HWE in all SNPs across replicates and foundno evidence for deviation at the 5% level after correctingfor multiple tests in both regions (results not shown).Results and DiscussionAlthough there was consistency over the 100 replicates inthe number of tagSNPs chosen by a given method, therewere considerable differences across methods in thenumber of tagSNPs selected (see Table 1). Cousin andChapman-allelic select nearly all SNPs in both candidateregions as tagSNPs. Since these methods are dependent onthe presence of pair-wise LD, we looked at allelic correla-tions (r2) in both regions in our first two replicates andfound unexpectedly low levels of pair-wise LD. In con-trast, the haplotypic approaches of Stram and Chapmanselected half as many tagSNPs as the non-haplotypicapproaches in both regions. On average, Stram chose onemore SNP than Chapman-haplotypic. In comparing theSNP sets selected by Stram and Chapman-haplotypic, wefound that the average proportion of SNPs in common,relative to the number of all SNPs chosen by both meth-ods, was approximately 30% (results not shown). CousinTable 1: Summary of p-values, estimated power and size of tagSNP sets over 100 replicatesD2 Region D4 RegionMethod Sample Median p-valuea [1st, 3rd quartile]Estimated powerb (SE)Mean # tagSNPs (SD)Median p-valuea [1st, 3rd quartile]Estimated powerb (SE)Mean # tagSNPs (SD)Cousin cases 0.034 [0.0002, 0.0173] 0.88 (0.033) 16.5 (0.6) 0.050 [0.0131, 0.1043] 0.49 (0.051) 20.2 (0.9)controls 0.021 [0.0002, 0.0173] 0.88 (0.033) 16.6 (0.5) 0.041 [0.0134, 0.1113] 0.53 (0.050) 20.7 (0.8)mixture 0.033 [0.0002, 0.0176] 0.87 (0.034) 16.5 (0.6) 0.040 [0.0125, 0.1124] 0.55 (0.050) 20.7 (0.7)Chapman-allelic cases 0.002 [0.0002, 0.0176] 0.88 (0.033) 16.9 (0.3) 0.050 [0.0129, 0.2307] 0.50 (0.050) 20.0 (1.0)controls 0.002 [0.0002, 0.0176] 0.88 (0.033) 16.8 (0.4) 0.047 [0.0136, 0.1792] 0.50 (0.050) 20.8 (0.8)mixture 0.002 [0.0002, 0.0176] 0.88 (0.033) 16.9 (0.4) 0.041 [0.0135, 0.1261] 0.52 (0.050) 20.5 (0.9)Stram cases 0.013 [0.0032, 0.0505] 0.73 (0.044) 7.8 (0.6) 0.042 [0.0101, 0.2754] 0.53 (0.050) 8.1 (1.0)controls 0.003 [9 × 10-5, 0.0348]0.76 (0.043) 7.9 (0.7) 0.069 [0.0066, 0.2745] 0.44 (0.050) 8.0 (1.1)mixture 0.008 [4 × 10-4, 0.0769]0.70 (0.046) 7.7 (0.6) 0.075 [0.0133, 0.2368] 0.46 (0.050) 7.9 (1.0)Zhaic (Stram) cases 0.039 [0.0126, 0.146] 0.59 (0.049) 0.099 [0.0141, 0.2564] 0.44 (0.050)controls 0.039 [0.0088, 0.1642] 0.59 (0.049) 0.080 [0.0132, 0.2562] 0.46 (0.050)mixture 0.037 [0.0112, 0.1538] 0.58 (0.049) 0.059 [0.0124, 0.2565] 0.50 (0.050)Chapman-haplotypiccases 0.022 [0.0052, 0.1553] 0.64 (0.048) 6.9 (0.4) 0.150 [0.0143, 0.5386] 0.37 (0.048) 7.6 (0.6)controls 0.011 [0.0006, 0.1126] 0.68 (0.047) 6.8 (0.4) 0.105 [0.0197, 0.3639] 0.37 (0.048) 7.5 (0.6)mixture 0.016 [0.0014, 0.0970] 0.66 (0.047) 6.9 (0.5) 0.115 [0.0152, 0.4258] 0.41 (0.049) 7.5 (0.6)Zhaid (Chapman haplotypic)cases 0.041 [0.0116, 0.1805] 0.61 (0.049) 0.048 [0.0103, 0.2322] 0.50 (0.050)controls 0.022 [0.0060, 0.1126] 0.67 (0.047) 0.055 [0.0108, 0.2196] 0.49 (0.050)mixture 0.037 [0.0086, 0.1620] 0.62 (0.049) 0.059 [0.0107, 0.2192] 0.48 (0.050)aThe median of the Bonferroni-corrected p-values from 100 replicates.bThe proportion of replicates with Bonferroni-corrected p-value < 0.05.cThe number of tagSNPs chosen for Zhai is set to be the same as Stram for each sample within replicate.dThe number of tagSNPs chosen for Zhai is set to be the same as Chapman-haplotypic for each sample within replicate.Page 3 of 5(page number not for citation purposes)and Chapman-allelic choose almost all SNPs, and onBMC Genetics 2005, 6:S71average 94% of SNPs were shared in common (results notshown).Estimated power across all methods was higher in the D2region than in the D4 region, likely reflecting the underly-ing disease models used in the data simulation. The esti-mated powers of Cousin and Chapman-allelic wereessentially equal in D2 and D4, and were generally higherthan those of the haplotypic methods. Since these meth-ods chose nearly all the SNPs in the region, they basicallygive the underlying power to detect association The hap-lotypic method of Stram had approximately 10% lowerestimated power in the D2 region than the allelic meth-ods. The estimated power of Chapman-haplotypic in theD2 region was consistently lower than that of Stram acrosstagSNP sample sets, but was within the 10% simulationerror range. In D4, Stram had estimated power within10% of the allelic methods. On the other hand, Chap-man-haplotypic had greater than 10% differences in esti-mated power relative to the allelic methods. However,Chapman-haplotypic was within 10% of Stram, except inthe cases sample, where there was a 16% reduction in esti-mated power relative to Stram. Generally, power was esti-mated to be higher for the allelic methods than for thehaplotypic methods, indicating that even if there is suffi-cient haplotypic structure to reduce the tagSNP set size,this may result in a loss of power to detect association.By choosing equidistant SNPs, Zhai's method is similar tothe SNP selection approach one might use in practice.Zhai et al. [4] concluded that choosing equally spacedSNPs performed as well as the HapBlock method [5] thatselects tagSNPs based on haplotype blocks. For each rep-licate, we chose the subset sizes for Zhai to match thenumber of SNPs chosen by both Chapman-haplotypicand Stram. Because Cousin and Chapman-allelic chosealmost all SNPs in each region, the comparison to Zhai'smethod would not be meaningful and would be expectedto have the same power. In the D2 region, Zhai had atleast 12% less power than Stram for SNP subsets of equalsize. However, in the D4 region differences between Zhaiand Stram were within 10%. This could suggest thatchoosing SNPs to tag common haplotypes offersincreased power if in the candidate region similar haplo-types carry the disease locus. Alternatively the poor per-formance of Zhai in D2 may be because the hiddendisease locus was located at the very end of the D2 region.Our implementation of Zhai's method cannot select thelast SNP in a region, and because we were unable to padD2 with extra SNPs at the disease-locus end, a potentiallyimportant disease-associated SNP could be missed. We re-implemented Zhai to force the inclusion of the last SNP inthe region into the tagSNP set, but this did not improveperformed about as well (in D2) or better (in D4) thanChapman-haplotypic.We had hypothesized power would increase when the tag-SNP selection sample contained cases only, because caseswould be more likely to carry disease haplotypes. How-ever, the power for the control samples was often greaterthan or equal to that of the cases. With only moderate var-iations under 7% in estimated power between the differ-ent tagSNP sampling approaches within each method, thevariation is within simulation error and we cannot con-clude that the initial tagSNP sample altered power.ConclusionOur motivation for this study was to compare differentmethods and sample populations for tagSNP selectionwith respect to the power to detect disease association. Wefound that there were no significant differences in esti-mated power between the 3 selection samples. However,we do note that in regions of low pair-wise LD, reducingthe number of SNPs genotyped appears to reduce thepower to detect an association, as seen by the generallypoorer performance of the smaller tagSNP sets from thehaplotypic approaches. Larger samples would have to berecruited in order to offset this lower power. Although wedid not determine which thresholds were optimal, forhaplotypic methods the suggested thresholds of 0.8 forR2-values may yield tagSNP sets underpowered to detectassociation. Those using these approaches should con-sider larger R2 thresholds. Finally, we did not replicate thefindings of Zhai et al. [4] that tagSNP subsets were no bet-ter than equally spaced SNP subsets. In the D2 region, wefound that the Stram method had better estimated powerthan the Zhai method.There are a few points that limit generalization of theseresults that we did not address because of time and com-putational limitations. For example, we could have com-pared power across methods after forcing the methods toselect equal numbers of tagSNPs. Without equal numbersof SNPs, it is unclear whether any differences in estimatedpower are due simply to the size of the tagSNP set ratherthan the methods examined. However, for Stram in D2there was a clear improvement over tagSNP sets of thesame size with equally spaced SNPs. Hence, in some situ-ations tagSNP methods can capture more informationthan a reasonable SNP subset size. Additionally, our studyused simulated data. While these data were based on realdata from chromosome 6, the methods used to simulatethe disease alleles may not reflect what actually occurs innature. The regions we examined contained low levels ofpair-wise LD, and in practice one may not actually use atagSNP selection strategy in such regions because of thePage 4 of 5(page number not for citation purposes)estimated power (results not shown). In contrast, Zhai potential to miss a true disease locus.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Genetics 2005, 6:S71AbbreviationsHWE: Hardy-Weinberg equilibriumLD: Linkage disequilibriumSNP: Single-nucleotide polymorphismstagSNPs: Tagging SNPAuthors' contributionsDesign of study and research question: KMB, DD, MG, JG,BM. Implementation of new methods: KMB, JG, BM. Run-ning methods: KMB, DD, MG. Writing manuscript: KMB,DD, MG. Editing and proofreading manuscript: KMB,DD, MG, JG, BM. All authors read and approved the finalmanuscript.AcknowledgementsThis work was supported in part by the Canadian Institutes of Health Research (CIHR) Grants NPG-64869 and ATF-66667 (JG and BM), the Mathematics of Information Technology and Complex Systems, Canadian National Centres of Excellence (JG and MG), Michael Smith Foundation for Health Research Scholar Award (JG), Genome Quebec and Genome Can-ada, CIHR Interdisciplinary Health Research Team (IHRT) grant (KMB, MG and DD), CIHR IMPACT and IG/IPPH Postdoctoral Fellowships (DD).References1. Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN,Henderson BE, Pike MC: Choosing haplotype-tagging SNPsbased on unphased genotype data using a preliminary sam-ple of unrelated subjects with an example from the Multieth-nic Cohort Study.  Hum Hered 2003, 55:27-36.2. Chapman JM, Cooper JD, Todd JA, Clayton DG: Detecting diseaseassociations due to linkage disequilibrium using haplotypetags: a class of tests and the determinants of statisticalpower.  Hum Hered 2003, 56:18-31.3. Cousin E, Genin E, Mace S, Ricard S, Chansac C, del Zompo M,Deleuze JF: Association studies in candidate genes: strategiesto select SNPs to be tested.  Hum Hered 2003, 56:151-159.4. Zhai W, Todd MJ, Nielsen R: Is haplotype block identificationuseful for association mapping studies?  Genet Epidemiol 2004,27:80-83.5. Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic pro-gramming algorithm for haplotype block partitioning.  ProcNatl Acad Sci USA 2002, 99:7335-7339.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 5 of 5(page number not for citation purposes)


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items