UBC Faculty Research and Publications

Dichotomization: 2 × 2 (×2 × 2 × 2...) categories: infinite possibilities Heavner, Karyn K; Phillips, Carl V; Burstyn, Igor; Hare, Warren Jun 23, 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12874_2009_Article_462.pdf [ 903.1kB ]
JSON: 52383-1.0223970.json
JSON-LD: 52383-1.0223970-ld.json
RDF/XML (Pretty): 52383-1.0223970-rdf.xml
RDF/JSON: 52383-1.0223970-rdf.json
Turtle: 52383-1.0223970-turtle.txt
N-Triples: 52383-1.0223970-rdf-ntriples.txt
Original Record: 52383-1.0223970-source.json
Full Text

Full Text

Heavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Open AccessR E S E A R C H  A R T I C L EResearch articleDichotomization: 2 × 2 (×2 × 2 × 2...) categories: infinite possibilitiesKaryn K Heavner*1,2, Carl V Phillips2, Igor Burstyn3,4 and Warren Hare5AbstractBackground: Consumers of epidemiology may prefer to have one measure of risk arising from analysis of a 2-by-2 table. However, reporting a single measure of association, such as one odds ratio (OR) and 95% confidence interval, from a continuous exposure variable that was dichotomized withholds much potentially useful information. Results of this type of analysis are often reported for one such dichotomization, as if no other cutoffs were investigated or even possible.Methods: This analysis demonstrates the effect of using different theory and data driven cutoffs on the relationship between body mass index and high cholesterol using National Health and Nutrition Examination Survey data. The recommended analytic approach, presentation of a graph of ORs for a range of cutoffs, is the focus of most of the results and discussion.Results: These cutoff variations resulted in ORs between 1.1 and 1.9. This allows investigators to select a result that either strongly supports or provides negligible support for an association; a choice that is invisible to readers. The OR curve presents readers with more information about the exposure disease relationship than a single OR and 95% confidence interval.Conclusion: As well as offering results for additional cutoffs that may be of interest to readers, the OR curve provides an indication of whether the study focuses on a reasonable representation of the data or outlier results. It offers more information about trends in the association as the cutoff changes and the implications of random fluctuations than a single OR and 95% confidence interval.BackgroundBy convention, results of epidemiological analyses areoften represented by a single odds ratio (OR) for a dichot-omous exposure variable and sometimes with ORs foreach of the confounders. An exposure variable is fre-quently created by dichotomizing a continuous variablesuch that values below a certain level, known as the cut-off, are classified as low risk and compared to those athigh risk (with a value greater than or equal to the cutoff ).Results are often reported for one such dichotomization,as if no other cutoffs were considered or even possible.There are several problems with representing the poten-tially complex relationship between an initially continu-ous exposure and an outcome variable by a single OR and95% confidence interval (CI). First, this approach makesthe assumption that the researcher and current andfuture readers are interested in the same cutoffs, which isoften not the case. Second, the use of different cutoffs (asdescribed below) across studies aimed to test the samehypothesis naturally can make it very difficult to compareresults. Third, in addition to being a source of publicationbias, the effect of the cutoff on the results of epidemiolog-ical studies is an overlooked area of model misspecifica-tion, even though it is known that dichotomization of amismeasured continuous variable induces non-differen-tial exposure misclassification [1]. Finally, when the effectestimate influences the cutoff, the snapshot of the datathat is typically reported may be misleading.The effects of publication bias are well known and havebeen extensively documented. However, the potentialconsequences of researchers' choices of which results toreport from a particular study are underappreciated. The* Correspondence: karynkh@aol.com1© 2010 Heavner et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.preferential reporting of particular study results, or publi-cation bias in-situ (PBIS) [2], may skew the readers' School of Public Health, University of Alberta, Edmonton, Alberta, T6G 2L9, CanadaFull list of author information is available at the end of the articleHeavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 2 of 11understanding of the exposure-outcome relationship andlead to inappropriate public health policies. For example,a reader trying to summarize the literature is forced tonot only deal with changing variable definitions, but can-not know whether a series of similar associations in theliterature represent studies with fundamentally similarresults or merely attempts to find the "same" result byanalyzing the data differently. Unbeknownst to readers, aresearcher having the choice of several cutoffs for creat-ing dichotomous exposure variables, may be tempted toreport only the single cutoff that best illustrates thedesired outcome (e.g., the cutoff that results in an OR of1.25 instead of 1.15) as if no other OR was obtained fromthe analysis, if there is no overriding reason to reject theselected dichotomization as unreasonable [3]. In contrastto typical publication bias, people interested in a particu-lar topic have to speculate about unpublished effect esti-mates or other results within a given study instead ofsearching for whole studies that have not been published.A desired effect estimate may be obtained as a result ofchance, model selection and/or varying the eligibility cri-teria and definitions of the outcome, exposure and cova-riates. In most cases, consumers of the medical andpublic health literature do not know if the published cut-offs were based on an a priori hypothesis about a changein the outcome at that cutoff or chosen to obtain an effectestimate that better conforms to the study hypothesis.We focus on the consequences of selecting differentcutoffs for dichotomizing a continuous exposure variableon epidemiological results and reintroduce a simplemethod that does not force an investigator to choose anyparticular dichotomization in presenting results. Specifi-cally, we: 1) describe different strategies that are com-monly used to select cutoffs for continuous exposurevariables; 2) illustrate the potential effects of changing thecutoff on ORs and standard error; and 3) (re)propose amethod for summarizing the results of analyses acrossplausible cutoffs. Our example is meant to be illustrativeof a simple point, and not to investigate etiology or even aspecific population. The survey sampling design variables(cluster, strata and weight) were not used, and thus theassociations should not be interpreted as representing theactual U.S. population-average relationship between bodymass index (BMI) and high cholesterol. Also, no attemptwas made to analyze causal pathways or correct for con-founders, so though we use the conventional terminologyof "exposure" and "outcome," we do not mean to suggestthat the association is causal.The recommended analytical approach is not a newconcept. In 1991, Wartenberg and Northridge proposedthat the OR be plotted as a function of the cutoff as partapproach focused on the use of quartile-quartile (Q-Q)plots, later renamed probability-probability (P-P) plots[5,6]. They cautioned that using the OR curve to select acutoff is a biased representation of the data and violatesthe assumptions needed for frequentist hypothesis test-ing [3,4]. Unfortunately, despite this cautionary note, theauthors do provide recommendation for choosing a sin-gle cutoff using their method. ("... if one must choose adichotomous cutpoint, or if the data seem consistent witha dichotomous exposure classification, then to be conser-vative in a public health sense, one should choose thelargest odds ratio that is consistent with the observeddata and that also provides a relatively stable estimate.Therefore, we recommend choosing the largest odds ratiovalue within the middle 80-90 percent of the data.") [4].Later researchers did indeed use the P-P plots or ORcurve to select a cutoff (e.g., [7,8]).References to the 1991 article are tracked in Additionalfile 1. The method was useful and the P-P plots (butrarely the OR curve) were used periodically until 1994when the method was criticized as promoting data drivencutoff selection, namely selection of the cutoff that yieldsthe maximum OR [9]. Then the method was virtuallyabandoned, which is why it was not referenced in earlierversions of this work. There were only a few researcherswho used Wartenberg and Northridge's method (or ifothers did so, they did not cite the original article). Mostof the references to the 1991 article related to dichotomi-zation in general or cutpoint bias (a type of PBIS in whichresearchers "Choose a few reasonable values and reportresults for the one that is most consistent with the inves-tigator's a priori hypothesis.") [3]. We propose the resur-rection of the OR curve (but not the more cumbersomeand less flexible P-P plots), not as a method to select acutoff but as a strategy to maximize the amount of infor-mation conveyed to readers, and thus the utility of epide-miological publications. This is particularly important ata time when synthesis of knowledge across studies isessential to formulating policy decisions. This is a timelyreintroduction into the epidemiological literature asimproved computational tools make the method realisticfor all researchers from the novice to the veteran andreadily searchable online media and post-publicationpeer review (e.g., including numerous blogs and epiere-view.com) make it possible to monitor the use the of thismethod so that it is not abused again.MethodsDatasetTo provide an easily-understood and replicable example,we illustrate our point using the relationship betweenof an approach to investigate the exposure distributionand its relationship to the outcome [4]. Much of theirBMI and serum cholesterol in a widely-used public data-base. Four waves of data (1999-2000, 2001-2002, 2003-Heavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 3 of 112004, and 2005-2006) from the National Health andNutrition Examination Survey (NHANES) were used forall examples presented here [10-13]. The sampling frame,enrolment methodology and response rates for thiscross-sectional survey are described elsewhere in detail[10-13]. All analyses were conducted using SAS version9.1 (SAS Institute, Cary, North Carolina). (The SAS codeis available upon request.)The analysis was limited to the 19,340 NHANES partic-ipants who were at least 18 years old and who had data ontwo continuous variables: BMI (exposure) and totalserum cholesterol (outcome). The mean age of the samplewas 45.9 (standard deviation = 20.1, median 44) and48.0% were male. BMI and cholesterol were chosenbecause they are commonly used in epidemiological anal-yses, standardized (theory driven) cutoffs exist, and thereis significant variation in the cutoffs used for BMI in theliterature (sometimes between articles reporting analysesof the same dataset [14]http://www.tobaccoharmreduc-tion.org/papers/heavner-phillips-heffernan-rodu-jun08.pdf). There have been several studies investigatingdifferent BMI cutoffs and there seems to be some recog-nition that there is not a universally appropriate cutoff[15-20].BMI was measured in kg/m2 and reported in incre-ments of 0.01. To focus on the effect of different exposuredichotomizations, the cutoff for total serum cholesterolwas held constant at ≥200 mg/dl (= case) in this analysis.This corresponds to the cutoff for "desirable" cholesterollevel recommended by the CDC [21-24]. Logistic regres-sion (PROC LOGISTIC) was used throughout the manu-script to calculate the OR and 95% CI for each cutoff.The effect of dichotomization on the ORTheory driven cutoffsTheory driven cutoffs may be chosen based on 1) aknown dose-response relationship (i.e., the thresholdwhere biological effects are known or postulated to start);2) a literature review of cutoffs used in previous studies;or 3) cutoffs proposed by experts, governmental agenciesor non-governmental organizations. The BMI categoriesthat are currently recommended by the CDC and WorldHealth Organization are: underweight (<18.5), normal(18.5-24.9), overweight (25.0-29.9) and obese (>=30.0)[25,26]. These categories are a simplification, becausethere is evidence that the effect of BMI varies by gender,age and race/ethnicity [15,27,28]. Addressing the debateabout the usefulness of these cutoffs is beyond the scopeof this study, other than to note that there are manyresearchers who believe that other cutoffs are useful andmight want to see results for them. In addition to thecommonly used cutoffs of 25 and 30, those identified in aData driven dichotomizationsData driven dichotomizations that are independent of the magnitude of the exposure-outcome association In addition to the theory driven cutoffs, numerous datadriven cutoffs are possible for BMI and other continuousvariables. These cutoffs may be based on univariate sta-tistics or on correlations in the data. Cutoffs may be cho-sen based on the univariate distribution of the exposurevariable. The xth percentile is a common cutoff [29] butany univariate statistic may be used. For this analysis, themean, median and 75th percentile were chosen, as theseare commonly reported statistics in the literature. A sec-ond method is to select the cutoff based on the distribu-tion of the exposure among the cases, typically to obtainequal numbers of cases in the exposed and unexposedgroups to ensure equal precision in each exposure group(e.g. [30,31]). For this, the median BMI among the caseswas chosen as the cutoff. Third, selecting a cutoff basedon a desired level of precision was accomplished by calcu-lating the OR and standard error for all cutoffs betweenthe 25th and 75th percentiles of BMI in increments of 0.01.The OR and cutoff corresponding to the minimum stan-dard error were selected. Fourth, a common method toinvestigate the effect of selecting various cutoffs for a pre-dictive model is to conduct a sensitivity analysis and gen-erate a receiver operating characteristic (ROC) curve(sensitivity versus 1-specificity). The area under the curvewas calculated for every cutoff between the 25th and 75thpercentiles of BMI in increments of 0.01 and graphedagainst the cutoff. For this method, the cutoff that maxi-mized the area under this curve was selected as the "best"cutoff. The Youden J statistic (sensitivity + specific - 1)was also calculated for each cutoff [32].Association-driven dichotomization A particularlyproblematic form of data-driven dichotomization isselecting a cutoff based on the size of the desired effectestimate. This may mean maximizing or minimizing theOR, or selecting the OR that is closest to 1.0 if the desiredresult is to demonstrate that there is no associationbetween the exposure and outcome or at what level ofexposure this is true. Clearly, this can introduce largebiases into research and the literature. To illustrate thiscase, ORs were calculated for every cutoff between the25th and 75th percentiles of BMI in increments of 0.01.The cutoff with the target OR was then selected.Recommended analytical approach - distribution ofORs for a plausible range of dichotomizationsIn the absence of a strong a priori hypothesis, one pos-sible analytical approach is to present cutoff/OR pairs formany cutoffs as proposed by Wartenberg and Northridge[4]. This allows both the researchers and readers to lookreview by Kuczmarski and Flegal [15] were included astheory driven cutoffs in the present analysis.for possible thresholds of effect and the OR correspond-ing to the cutoffs that they are interested in. ReportingHeavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 4 of 11the full range of possible cutoffs can be done with a graphof a dense set of discrete cutoffs and the correspondingORs. This was done for the full range of cutoffs (i.e., allcutoffs from 13.37 (minimum plus 0.01) to 130.20 (maxi-mum minus 0.01) in increments of 0.01). The ORs and95% CIs were plotted against this full range of cutoffs.Then the graph was limited to cutoffs between the 25thand 75th percentiles of BMI in increments of 0.01 to illus-trate the curve for a narrower range of cutoffs thatincludes both the overweight and obese cutoffs recom-mended by the CDC.ResultsNHANES sampleNearly two-thirds (65.9%) of the sample had a BMI of atleast 25 (overweight or obese according to the currentguidelines [25,26]) and 46% had a total serum cholesterollevel ≥200 mg/dl. The distributions of both BMI and cho-lesterol were skewed to the right. (The distributions ofBMI and total serum cholesterol are illustrated in Addi-tional file 2.)The effect of cutoff selection on the ORTheory driven cutoffsThe ORs obtained using two currently recommendedcutoffs and six previously recommended sets of theorydriven cutoffs are presented in Table 1. Current analysesof these data are likely to use a cutoff of 25 or 30, resultingin ORs of 1.7 and 1.2, respectively. If the contemporaryrecommendations for BMI cutoffs were used for analysesconducted between 1980 and the present, it would benearly impossible to make useful comparisons, conductmeta-analyses or observe temporal trends since ORsobtained from different cutoffs would have beenreported. The results might be described with similarprose, but they would, in fact, be measures of differentexposures.Data driven dichotomizationsData driven dichotomizations that are independent of the magnitude of the exposure-outcome association Avariety of data driven dichotomizations and correspond-ing ORs are presented in Table 2. The results obtainedusing the mean and median as the cutoff are similar tothose obtained when the cutoff was chosen on the basisof having equal numbers of exposed and unexposed casesand minimizing the standard error. However, differentconclusions about the relationship between having 'high'BMI and having high cholesterol may be reached if the75th percentile is chosen as the cutoff as this OR isapproximately 1, as opposed to the median BMI (OR =1.5).Figures 1 and 2 illustrate the ROC curve and graph ofthe area under the curve by cutoff for BMI cutoffsTable 1: Theory driven dichotomizationCutoff(s) OR (95% CI)Current WHO/CDC recommendationsOverweight1 25 1.7 (1.6, 1.8)Obese1 30 1.2 (1.2, 1.3)Historical cutoffs1980 Dietary Guidelines2 Males: 26, Females: 25 1.7 (1.6, 1.8)1984 Health United States1 Males: 28, Females: 35 1.2 (1.1, 1.2)1985 NIH Consensus Development Panel, 1985 Health United States, Health People 20001Males: 27.8,Females: 27.31.5 (1.4, 1.5)1985 Dietary Guidelines, 1995 Dietary Guidelines2 25 1.7 (1.6, 1.8)1989 Committee on Diet and Health2, 3 Age specific (years): 19-24: 24, 25-34: 25,35-44: 26, 45-54: 27,55-65: 28, >65: 291.2 (1.1, 1.3)1990 Dietary Guidelines1, 3 Age specific (years): 19-34: 25, >=35: 27 1.3 (1.2, 1.4)1 Comparing those with a value ≥ the cutoff to those with a value < the cutoff.2 Comparing those with a value > the cutoff to those with a value ≤ the cutoff.3 Excluded 1046 18 year olds.ORs measured the association between BMI and high cholesterol (>=200 mg/dl) and were obtained from logistic regression.(Historical cutoffs based on a review by Kuczmarski RJ, Flegal KM. Criteria for definition of overweight in transition: background and recommendations for the United States. Am J Clin Nutr. 2000;72(5):1074-81.)Heavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 5 of 11between the 25th and 75th percentiles. There is little differ-ence in the area under the curve for cutoffs between 24and 26 (with corresponding ORs varying from 1.8 to 1.6).The area under the curve decreases for cutoffs greaterthan 26. The area under the curve is greatest when thecutoff is 25.55, which would likely be the cutoff chosenbased on this sensitivity analysis. (25.55 is also the cutoffthat maximizes the Youden J statistic.) The cutoffsselected based on these four methods will likely be differ-ent for each dataset, making the results of studies that usethese four dichotomization strategies difficult to com-pare.Association-driven dichotomization Presentation ofonly one OR/cutoff pair using one of the preceding meth-also be chosen based on the association. In this example,cutoffs selected between the 25th and 75th percentiles ofBMI to maximize or minimize the OR resulted in ORs ofapproximately 1.9 or 1.1, respectively, which could havedifferent implications if presented in isolation. It is appar-ent that the data support a wide range of ORs withapproximately equal precision, enabling the investigatorto select an exposure definition to obtain a simplisticinterpretation of the data in such a way that it eitherstrongly supports an association with high cholesterol orprovides negligible support for the association. PBISwould result if researchers investigate such a range of cut-offs but only report one cutoff/OR pair based on theirpreferred association.Table 2: Data-driven dichotomization1Cutoff OR (95% CI)Data-driven dichotomization that is not based on the exposure/outcome associationDetermined by the distribution of exposure variableMean BMI 28.16 1.4 (1.3, 1.5)Median BMI 27.13 1.5 (1.4, 1.6)75th percentile for BMI 31.40 1.1 (1.1, 1.2)Determined by the distribution of outcome variableEqual numbers of exposed and unexposed cases 27.84 1.4 (1.4, 1.5)Effect of a desired precision2Minimizing the standard error 27.19 1.5 (1.4, 1.6)Maximizing the area under the curve2 25.553 1.7 (1.6, 1.8)Association-driven dichotomizationEffect of a desired size2Maximizing the OR 23.79 1.9 (1.8, 2.0)Minimizing the OR 31.38 1.1 (1.1, 1.2)OR closest to 1.0 31.38 1.1 (1.1, 1.2)1 Comparing those with a value ≥ the cutoff to those with a value < the cutoff. ORs measuring the association between BMI and high cholesterol (≥200 mg/dl) were obtained from logistic regression.2 Varying the BMI cutoff from the 25th (23.75) and 75th (31.40) percentiles in increments of 0.01.3 25.55 is also the cutoff with the maximum Youden's J statistic.ods throws away the majority of OR/cutoff pairs whichare potentially informative, but does not necessarily gen-erated a biased measure of association. But the cutoff canRecommended analytical approachThe theory driven methods and data driven methodsthat are not based on the desired association illustrateHeavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 6 of 11that there are numerous legitimate choices for dichoto-mization of a continuous variable. The choices motivatedby the resulting association illustrate that it is easy to biasresults with arguably legitimate methods using defensiblecutoffs (i.e., cutoffs that would be legitimate if they werechosen for a reason other than data mining) [4]. A partialsolution would be to report results for many candidatecutoffs, like in Tables 1 and 2. Wartenberg and3. This graph gives the reader the OR corresponding tothe cutoff that he/she is interested in and does not makethe assumption that the researcher and all readers areinterested in the same cutoff. It may also give readersinformation about possible thresholds and the stability ofthe OR across a range of cutoffs and may facilitate futuremeta-analyses by providing the ORs for a variety of cut-offs that may have been used in other studies. TheFigure 1 Sensitivity analysis of the relationship between BMI and high cholesterol in the NHANES sample, 1999-2006: Receiver operating characteristic (ROC) curve.Northridge's proposed alternative that is more completeand no more difficult to implement is illustrated in Figureapproach does not make any assumptions about theshape of the association between a continuous exposureHeavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 7 of 11and dichotomous outcome. The potential implications ofthe shape of the OR curve are worthy of further investiga-tion but are beyond the scope of the present analysis.The value of this graph is immediately obvious even inthis simple example. The "OR curve" is very unstable forcutoffs less than 20 or greater than about 55, where varia-tion in the ORs between two consecutive cutoffs and thechange in the OR. Reporting only results based on thetails, or similarly, drawing conclusions from a single studybased on cutoffs in the tails would be inappropriate dueto the instability in the ORs. However, those cutoffsmight be of interest for some purposes, so reporting thefull range avoids both over-emphasizing an unstableresult and merely throwing away potentially useful infor-Figure 2 Sensitivity analysis of the relationship between BMI and high cholesterol in the NHANES sample, 1999-2006: Area under the curve for each BMI cutoff.standard error were the greatest. In the tails, shifting afew individuals between the exposed and unexposedgroups as the cutoff varies slightly results in a dramaticmation. The 95% CIs were each calculated as if they werebased on an a priori hypothesis (no adjustments weremade for multiple comparisons) so these should only beHeavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 8 of 11used by researchers investigating a specific hypothesisabout the exposure at a discrete cutoff.A representation of the OR curve that excludes theunstable tails is presented in Figure 4. This figure limitsthe ORs to those obtained from cutoffs between the 25thand 75th percentiles of BMI. The OR decreases from 1.9to 1.1 as the cutoff increases within this range, consistentwith the ROC curve. In other words, as the cutoffincreases, the reference (lower BMI) group contains morepeople who have high cholesterol. Ideally the entire ORcurve would be presented but researchers must weigh theadditional information gained against the imprecise esti-mates and potentially more complex regression analysis(e.g., need for additional iterations or non-logistic mod-els) in the tails of the exposure cutoff distribution.DiscussionIf it is determined that dichotomization is the optimal ordesirable variable conceptualization, there are alterna-tives to presenting the OR derived from a single theory ordata driven cutoff. Wartenberg and Northridge's methodis easily achievable using modern software, and providesa more comprehensive picture of the exposure-outcomerelationship in a dataset. It offers more information aboutillustration demonstrates this for the relatively simplecase of changing an exposure variable cutoff as well as thefeasibility of presenting a more complete representationof the exposure-outcome relationship.It has been argued that if cutoffs are not should be cho-sen based on an a priori hypothesis or the variable shouldbe analyzed as a continuous variable [9]. However, inpractice it is not unusual for researchers to analyze datausing multiple cutoffs and report the result that conformsto their hypothesis [3]. Presenting the OR curve is a moretransparent and method and does not preclude research-ers from focusing the discussion a specific cutoff.A comprehensive comparison of the advantages anddisadvantages of dichotomization is beyond the scope ofthis paper and has been debated in detail elsewhere.Dichotomization has been proposed as a method toguard against misspecification of the disease model, buteven in this application it does not appear to be a univer-sally advantageous solution [33,34]. One of the benefits ofdichotomization is that the methods and results are oftenmore understandable and inherently useful to the averageresearcher and consumer of epidemiology, including cli-nicians, than is the reporting of continuous functions(including splines (e.g. [35]), polynomial regression, andFigure 3 Odds ratio curves for the relationship between high cholesterol and different BMU cutoffs in the NHANES sample, 1999-2006: The effect of changing the BMI cutoff on the OR.both trends in the association as the cutoff changes andthe implications of random fluctuations. The presentother methods used to fit a curve to an outcome as afunction of a continuous independent variable). Dichoto-Heavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 9 of 11mization can usefully summarize the effect of an expo-sure, particularly when supplemented with other analyticstrategies [34]. We propose the presentation of OR curvesas one such analytic strategy.This method is fairly simple and flexible. It can be usedfor either categorical or continuous outcomes, unlike theP-P plots. Similar, albeit more complex, methods mayalso be used if it is important or desirable to use a poly-chotomous exposure variable or convert continuouscovariates or outcomes to categorical variables. It wouldbe more difficult to comprehensively report the interac-tion of changing cutoffs for several variables, though athird axis or multiple OR curves may be presented. It isimportant to realize that while this strategy solves someproblems, it does not address other problems from cate-gorizing continuous variables, such as non-differentialmisclassification in the imputed categorical variables dueto measurement error in the continuous variable [1]. Thisis an analysis of the available data, not the subjects' trueBMI and cholesterol values so measurement error wasnot taken into account.A common exposure variable, like BMI, which gener-ates high interest and numerous plausible cutoffs,emphasizes how much useful information is lost when aall possible cutoffs might be able to reduce bias. Runningmultiple models and presenting only one exposure cutoffand the corresponding OR, as if no other models wereconsidered, may introduce bias to reported study results[3]. The desire to obtain a parsimonious result thatreduces the representation of complex social and biologi-cal relationships to a single number may have resulted inthe preferential publication of ORs on one side of the ORcurve.The dataset used in this work is an unusually large sam-ple for an epidemiological study, so the variation is almostentirely due to true differences in the association whendifferent cutoffs are used, rather than random error. Cut-offs between the 25th and 75th percentile were focused onto illustrate the variation in ORs that is possible from arelatively narrow range of cutoffs in the middle of therange of exposure values. In the middle of the OR curve,there were only slight changes in the OR between cutoffs.The OR curve for a small sample is likely more similar tothe tails of the presented OR curve, where moving a fewsubjects from one category to another strongly affects theassociation and the same OR may be obtained from dis-parate cutoffs. In that case, the potential for bias isgreater as the associations are less stable and randomFigure 4 Odds ratio curves for the relationship between high cholesterol and different BMU cutoffs in the NHANES sample, 1999-2006: The effect of changing the BMI cutoff between the 25th and 75th percentile of BMI on the OR.single cutoff is presented. For less-studied variables andidiosyncratic studies the greater importance of reportingvariation may produce an outlier OR for a narrow rangeof cutoffs.Heavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 10 of 11ConclusionThe influence of cutoff selection is invisible to the readerand may be a mystery to the researcher if cutoff selectionis not discussed and only the results obtained from a sin-gle cutoff are presented. Presenting only one preferredcutoff/OR pair when multiple such pairs were investi-gated results in a large amount of potentially usefulknowledge being discarded and exchanges unbiased ran-dom error for non-random error. Frequentist test statis-tics are typically reported, but they are no longermeaningful since the error is no longer random [36]. Thisis rarely acknowledged by researchers and few readers areaware of the problem or the resulting pattern of bias. Thepost-publication solution would be to make data availablefor reanalysis so that other researchers can run modelsusing their own choices of cutoffs and other modelingdecisions. Until such transparency of data and results isthe standard of practice, reporting the full range of possi-ble ORs offers a partial solution and may help reverse theprevalent myth and belief that there is a "right answer," or"correct OR" that can be derived from a single study.Additional materialAbbreviationsCI: confidence interval; NHANES: National Health and Nutrition ExaminationSurvey; OR: odds ratio, PBIS: publication bias in-situ.Competing interestsAt the time that this research was conducted, Drs. Heavner and Phillips werefunded by an unrestricted grant to the University of Alberta, School of PublicHealth, from U.S. Smokeless Tobacco Company. Dr. Burstyn is funded by a Pop-ulation Health Investigator salary award from the Alberta Heritage Foundationfor Medical Research. This research was investigator initiated and the fundersplayed no role in it. Dr. Phillips has built much of his career on questions of howepidemiologic research results are misleading as currently reported, and in par-ticular has openly argued that the biggest such problem is publication bias insitu (PBIS), though there is limited agreement on this point. Thus, he has theincentive to produce analyses that demonstrate the potential implications ofPBIS.Authors' contributionsThe study was initiated by CVP and all the authors contributed to generatingthe methods. KH conducted the data analysis. All authors contributed to writ-ing the manuscript and have read and approved the final manuscript.Author Details1School of Public Health, University of Alberta, Edmonton, Alberta, T6G 2L9, Canada, 2TobaccoHarmReduction.org, Saint Paul, MN, 55104, USA, 3Department of Environmental and Occupational Health, Drexel University School of Public Health, Philadelphia, Pennsylvania, 19102, USA, 4Department of Medicine, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Alberta, T6G 1K4, Canada and 5Department of Math, Statistics, and Physics, University of British Columbia, Okanagan, Kelowna, British Columbia, References1. Gustafson P: Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments New York City: Chapman & Hall CRC; 2004. 2. Phillips CV: Publication bias in situ.  BMC Med Res Methodol 2004, 4:20.3. Wartenberg D, Northridge M: Wartenberg and Northridge reply.  Am J Epidemiol 1994, 139:443-444.4. Wartenberg D, Northridge M: Defining exposure in case-control studies: a new approach.  Am J Epidemiol 1991, 133:1058-1071.5. Zeger SL: Re: "Defining exposure in case-control studies: a new approach".  Am J Epidemiol 1992, 136:1294.6. Wartenberg D, Northridge M: The authors reply.  Am J Epidemiol 1992, 136:1294.7. Arfken CL, Lach HW, Birge SJ, Miller JP: The prevalence and correlates of fear of falling in elderly persons living in the community.  Am J Public Health 1994, 84:565-570.8. Schulgen G, Lausen B, Olsen JH, Schumacher M: Outcome-oriented cutpoints in analysis of quantitative exposures.  Am J Epidemiol 1994, 140:172-184.9. Altman DG: Problems in dichotomizing continuous variables.  Am J Epidemiol 1994, 139:442-445.10. CDC (NCHS): National Health and Nutrition Examination Survey Data (1999-2000).  Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 1999.  10-2-200811. CDC (NCHS): National Health and Nutrition Examination Survey Data (2001-2002).  Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2001.  10-2-200812. CDC (NCHS): National Health and Nutrition Examination Survey Data (2003-2004).  Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2003.  10-2-200813. CDC (NCHS): National Health and Nutrition Examination Survey Data (2005-2006).  Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2005.  10-2-200814. Heavner K, Heffernan C, Phillips CV, Rodu B: Methodologic and Ethical Failures in Epidemiologic Research, as Illustrated by Research Relating to Tobacco Harm Reduction (THR).  Am J Epidemiol 2008, 167(Suppl):S115.15. Kuczmarski RJ, Flegal KM: Criteria for definition of overweight in transition: background and recommendations for the United States.  Am J Clin Nutr 2000, 72:1074-1081.16. Pelletier D: Theoretical considerations related to cutoff points.  Food Nutr Bull 2006, 27:S224-S236.17. Sanchez-Castillo CP, Velazquez-Monroy O, Berber A, Lara-Esqueda A, Tapia-Conyer R, James WP: Anthropometric cutoff points for predicting chronic diseases in the Mexican National Health Survey 2000.  Obes Res 2003, 11:442-451.18. Nguyen TT, Adair LS, He K, Popkin BM: Optimal cutoff values for overweight: using body mass index to predict incidence of hypertension in 18- to 65-year-old Chinese adults.  J Nutr 2008, 138:1377-1382.19. Lin WY, Lee LT, Chen CY, Lo H, Hsia HH, Liu IL, et al.: Optimal cut-off values for obesity: using simple anthropometric indices to predict cardiovascular risk factors in Taiwan.  Int J Obes Relat Metab Disord 2002, 26:1232-1238.20. Klotsche J, Ferger D, Pieper L, Rehm J, Wittchen HU: oA novel nonparametric approach for estimating cut-offs in continuous risk indicators with application to diabetes epidemiology.  BMC Med Res Methodol 2009, 9:63.21. CDC (NCHS): National Health and Nutrition Examination Laboratory Procedure Manual: Lab 18 Biochemistry Profile (1999-2000).  Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 1999.  12-12-200822. CDC (NCHS): National Health and Nutrition Examination Laboratory Procedure Manual: Lab 18 Biochemistry Profile (2001-2002).  Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2001.  12-12-200823. CDC (NCHS): National Health and Nutrition Examination Laboratory Procedure Manual: Lab 13 Total Cholesterol, HDL-Cholesterol, Additional file 1 Appendix 1. Studies that referenced the 1991 Warten-berg and Northridge study.Additional file 2 Appendix 2. BMI and total serum cholesterol in the NHANES sample, 1999-2006 (n = 19,340).V1V1V7, Canada Triglycerides, and LDL-Cholesterol (2003-2004).  Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2003.  12-12-2008Received: 24 November 2009 Accepted: 23 June 2010 Published: 23 June 2010This article is available from: http://www.biomedcentral.com/1471-2288/10/59© 2010 H avner et al; licensee BioMed Central Ltd. is an Open Access articl  distributed under h  terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.BMC Medic l R s arch M hodology 2010, 10:59Heavner et al. BMC Medical Research Methodology 2010, 10:59http://www.biomedcentral.com/1471-2288/10/59Page 11 of 1124. USDHHS: Third Report of the National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III): Final Report.  2002. NIH Publication No. 02-521525. WHO: Physical status: the use and interpretation of anthropometry.  1995. Report of a WHO Expert Committee. 854. Technical Report Series. 10-16-200826. CDC: Defining Overweight and Obesity.   [http://www.cdc.gov/nccdphp/dnpa/obesity/defining.htm]. 6-20-2008. 12-17-200827. Calle EE, Thun MJ, Petrelli JM, Rodriguez C, Heath CW: Body-mass index and mortality in a prospective cohort of U.S. adults.  N Engl J Med 1999, 341:1097-1105.28. WHO: WHO Consultation on Obesity.  1999. Obesity: Preventing and managing the global epidemic: report of a WHO consultation. 894. Technical Report Series29. Rothman KJ, Greenland S: Modern Epidemiology 2nd edition. New York: Lippincott Williams & Wilkins; 1998. 30. Burstyn I, Boffetta P, Kauppinen T, Heikkila P, Svane O, Partanen T, et al.: Performance of different exposure assessment approaches in a study of bitumen fume exposure and lung cancer mortality.  Am J Ind Med 2003, 43:40-48.31. de Vocht F, Kromhout H, Ferro G, Boffetta P, Burstyn I: Bayesian Modeling Of Lung Cancer Risk And Bitumen Fume Exposure Adjusted For Unmeasured Confounding By Smoking.  Occup Environ Med 2009, 66:502-508.32. Youden WJ: Index for rating diagnostic tests.  Cancer 1950, 3:32-35.33. Cangul MZ, Chretien YR, Gutman R, Rubin DB: Testing treatment effects in unconfounded studies under model misspecification: Logistic regression, discretization, and their combination.  Stat Med 2009, 28:2531-2551.34. Maldonado G, Greenland S: Factoring vs linear modeling in rate estimation: a simulation study of relative accuracy.  Epidemiology 1998, 9:432-435.35. Pischon T, Boeing H, Hoffmann K, Bergmann M, Schulze MB, Overvad K, et al.: General and abdominal adiposity and risk of death in Europe.  N Engl J Med 2008, 359:2105-2120.36. Moye LA: Statistical Reasoning in Medicine: The Intuitive P value Primer New York, NY: Springer-Verlag; 2000. Pre-publication historyThe pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/10/59/prepubdoi: 10.1186/1471-2288-10-59Cite this article as: Heavner et al., Dichotomization: 2 × 2 (×2 × 2 × 2...) cate-gories: infinite possibilities BMC Medical Research Methodology 2010, 10:59


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items