DECISION RULES BASED ON HYPOTHESIS TESTS AND EFFECT SIZES FOR LOGISTIC REGRESSION DIFFERENTIAL ITEM FUNCTIONING by Adam Gesicki A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS in The Faculty of Graduate and Postdoctoral Studies (Measurement, Evaluation, and Research Methodology) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) September 2015 © Adam Gesicki, 2015 ii Abstract Logistic Regression (LR) has been a technique used for the detection of items exhibiting differential item functioning (DIF). When it was introduced in 1990, the LR was conceptualized as strictly a test of statistical significance. This led to the over-identification of items as DIF, generally not exhibiting practically (psychometrically) significant differences. The use of blended decision rules – where effect sizes are used in addition to statistical significance in the decision-making process – was proposed to address this issue. Previous work in the literature attempted to align a decision rule grounded in the Mantel-Haenszel (M-H) technique to LR. However, this work is unable to replicate previously recommended cut-offs, through the use of the same methodology on a different data set. It is possible that cut-off values may be dataset specific, which also opens the question of whether universal cut-off values for effect sizes for DIF are a realistic expectation. iii Preface This dissertation is the original independent work by the author, Adam Gesicki. iv Table of Contents Abstract ........................................................................................................................................... ii Preface............................................................................................................................................ iii Table of Contents ........................................................................................................................... iv List of Tables .................................................................................................................................. v List of Figures ................................................................................................................................ vi List of Abbreviations ................................................................................................................... viii Introduction ..................................................................................................................................... 1 Background Literature .................................................................................................................... 5 The Mantel-Haenszel (M-H) Technique ..................................................................................... 5 The Logistic Regression (LR) Technique ................................................................................... 7 Advantages of the LR Method Over the M-H ............................................................................ 9 DIF: Defined by Statistical Significance, Effect Size, or Both? ............................................... 11 Development of Recommended Cut-Offs for the R2 ................................................................ 14 Applications of Decision Rules in Simulation Studies ............................................................. 17 Statement of the Research Questions ........................................................................................ 19 Methods......................................................................................................................................... 22 Data Source ............................................................................................................................... 22 Question 1: Translating M-H Cut-Offs to LR........................................................................... 24 Question 2: Comparability of Results ....................................................................................... 25 Results ........................................................................................................................................... 26 Results for Question 1: Translating M-H Cut-Offs to LR ........................................................ 26 Results for Question 2: Comparability of Results .................................................................... 36 Concluding Remarks ..................................................................................................................... 42 Implications for Practice ........................................................................................................... 44 Other Directions for Research .................................................................................................. 46 References ..................................................................................................................................... 47 v List of Tables Table 1. Sample size under various conditions for the four forms under consideration in this work. ............................................................................................................................... 23 Table 2. A table of exact agreement (off-diagonal) and proportions (diagonal) of various decision rules applied to all types of DIF under consideration. .................................................... 37 Table 3. A table of exact agreement (off-diagonal) and proportions (diagonal) of various decision rules applied to gender DIF data only ............................................................................. 38 Table 4. Classification contingency matrix for DIF items using decision rules from Holland & Thayer (1988) for M-H vs. those calculated in this work, all numbers as proportions of total item mix for all data. ............................................................................................... 40 Table 5. Summary of differences in cut-offs on the 2R metric between various sources used to-date in the literature. ....................................................................................................... 43 vi List of Figures Figure 1. A contingency matrix for use in cases of null-hypothesis statistical testing. ................ 12 Figure 2. A contingency matrix for use in cases of a blended decision rule. ............................... 13 Figure 3. Density plot of summed score by gender, in both listening forms (left) and reading forms (right). ................................................................................................................... 27 Figure 4. Density plot of summed score by sociocultural background, in both listening forms (left) and reading forms (right). ...................................................................................... 28 Figure 5. SIBTEST ˆ on x-axis with LR-based 2R between models 1 and 2 on the y-axis. ... 29 Figure 6. Rescaling of Figure 5 to match the presentation of Figure 1 from Jodoin and Gierl (2001, p. 336). ................................................................................................................. 30 Figure 7. Relationship between deltaMH and the logOdds of Group in model 2, placed on the same metric scale, with superimposed linear regressions. Points representing gender DIF are clustered towards the bottom-right of the graph and are represented with circles, with socioculturally data clustered to the top-left of gender and represented with triangles. .......................................................................................................................... 32 Figure 8. A graph demonstrating the median results of the increasing window around the target logOdds of 0.42553. The dotted horizontal line demonstrates the model-based estimate, discussed later in text. ..................................................................................................... 34 Figure 9. The relationship between the absolute value of the logOdds of group in model 2 as opposed to 2R between models 1 and 2. The loess curves are plotted using solid lines; matching modelled relationships (power) in dotted lines. .............................................. 35 vii Figure 10. A graph comparing the distribution of 2R s between Models 2 and 3. The top graph shows the distribution of 2R for items classified equal / lower using decision rules suggested here, when compared to those suggested by Holland & Thayer. The bottom graph shows the distribution of 2R for items classified higher (containing more DIF) using the new decision rule as opposed to their Holland & Thayer classification. ........ 41 viii List of Abbreviations DIF Differential Item Functioning LR Logistic Regression M-H Mantel-Haenszel SIBTEST Simultaneous Item Bias Test 1 Introduction Differential item functioning, or DIF, is an important consideration for the test developer, as highlighted by the Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014; referenced as the Standards for the remainder of this work). Camilli (1993) clearly defined DIF, stating: An item is said to “function differently” for two or more groups if the probability of a correct answer to a test item is associated with group membership for examinees of comparable ability. Statistical indices of DIF are designed to identify such test items. If the degree of DIF is determined to be practically significant for an item and the DIF can be attributed plausibly to a feature of the item that is irrelevant to the test construct, the presence of this item on the test biases the ability estimates of some individuals. This compound condition, when satisfied, indicates item bias. (pp. 397-8). What is most apparent from this definition is that DIF is simply a statistical index – an indicator – that signals that an item may be biased. It is only the follow-up of content experts that can determine whether or not the statistical flag is as a result of a biased item, and what features of the item may be responsible for such a difference between test-taker groups. As well, it is important to highlight from the definition that any statistic used for DIF detection must only compare test takers of similar ability. If test taker abilities are not taken into account in the calculation, one cannot isolate whether or not differences in an item’s performance are due to features of the item or a difference in the underlying ability of test takers. Lastly, DIF, as the name suggests, specifically targets analysis at the item level. Further extensions of DIF to the testlet or test form level exist, but are not directly discussed here. 2 In discussing definitions surrounding DIF, it is also important to differentiate between item bias and item impact. Impact is defined when a detected “difference in group performance” was “caused by valid skill group differences” (Ackerman, 1992, p. 6). In other words, where there “were true differences between the groups in the underlying ability of interest being measured by the item” (Zumbo, 2007, p. 224). Naturally, the use of item content panelists, together with the triangulation of other sources of data, would help test developers narrow which of the two conditions is present in a particular item. This judgement may – of course – not always be clear-cut. This process attempts to strengthen the validity of the testing process, by identifying items whose inclusion may ultimately lead to inappropriate inferences from a test-taker’s score. However, even discussions regarding bias and impact require a statistical technique that can support the process undertaken by content experts. The variety of techniques used for the detection of DIF are neatly summarized in Millsap and Everson (1993). Other notable reviews include Zumbo’s (2007) three generations of DIF analyses, which reviewed where the field had been and provided a research agenda for the future. This work specifically focuses on one of these methods, the logistic regression (LR) DIF technique. Introduced by Swaminathan and Rogers (1990), the LR DIF technique was presented as an alternative to the detection of DIF using the Mantel-Haenszel (M-H) test (Holland & Thayer, 1988). Regardless of the statistical technique under discussion, all forms require a decision rule to determine whether or not an item exhibits signs of DIF. As originally presented by Swaminathan and Rogers (1990), the decision rule for LR revolved solely around a statistical test of statistical significance. In this case, a statistically significant test result was to be interpreted as an item that had differential functioning. Eventually, the notion of a “blended” decision rule was introduced; that is, a rule where both the statistical significance of a test as well 3 as its effect size are considered as evidence in deciding whether an item is differentially functioning; mirroring a strategy used by M-H DIF since its inception (Zumbo, 2008; Holland & Thayer, 1988). Unlike the M-H, which was introduced with interpretable guidelines, the LR DIF technique does not have a consistently agreed-upon blended decision rule to determine whether or not an item should be classified as “DIF” vs. “non-DIF”. Despite this limitation, the LR DIF technique offers a number of qualities that M-H DIF does not offer, such as: the detection of non-uniform DIF (explained later), the capacity to treat the ability range as a continuous variable (as opposed to the discretization required under M-H DIF), and increased modelling flexibility. These benefits suggest that it would be worthwhile to consider the role of blended decision rules within LR DIF, and strive for harmonization of some of the suggestions made to-date in the literature. As a result, the purpose of this work is to: 1. summarize existing approaches in the development of LR DIF blended decision rules; 2. attempt to replicate methodology suggested in the literature (Jodoin & Gierl, 2001) and recommend a different approach for a potential transfer of M-H DIF cut-offs into a scale used by LR DIF; and, 3. show the comparability between various decision rules. The work will begin with a review of the background literature of the M-H and LR DIF techniques, as well as highlight previous recommendations made in the literature under the LR DIF framework. Next, another method will be described which will enable the existing and well-established M-H DIF blended decision rule to be transformed to a comparable rule within the LR DIF framework. The results section will provide preliminary results of the new cut-offs, especially highlighting the comparability of the different decision rules discussed in the 4 background literature using data derived from an actual testing programme. For the purposes of brevity throughout the remainder of the paper, “M-H DIF” and “LR DIF” will simply be referred to by “M-H” and “LR” methods respectively; their use as techniques for DIF detection are to be implied throughout the remainder of this work. 5 Background Literature The Mantel-Haenszel (M-H) Technique The Mantel-Haenszel (M-H) technique has been used extensively by many individuals in testing, especially due to its long history within the field. M-H began with the application of a statistical technique presented by Mantel and Haenszel (1959) into the testing realm by Holland and Thayer in 1988. The M-H categorizes the ability range into a number of levels (k); and within each, considers the number of the members within each of the test taker groupings R and F, who have answered the question correctly (NR1k, NF1k) and incorrectly (NR0k, NF0k).1 The estimate of the conditional odds ratio is calculated as: k kkFkRk kkFkRMHNNNNNN1001ˆ . The conditional odds ratio estimated by the M-H technique is usually subsequently transformed to a “delta” metric, denoted as MH . This has the purpose of bringing the estimate of the conditional odds ratio onto the same scale used by ETS test developers to describe item 1 The symbols R (“reference”) and F (“focal”) groups are preserved in this section to respect the notation used in existing research, and could easily have been substituted for G1 (“group 1”) and G2 (“group 2”). As a philosophical note, the use of the terms “reference group” and “focal group” will be avoided in this paper. Traditionally, this terminology is associated with the “first generation” of DIF researchers, who were most interested in dominant and minority groups (respectively; Zumbo, 2007, p. 224). An acknowledgement is made that there existed many historical and legal reasons for such terminology to be used within the realm of DIF; however, this paper focuses on the use of terminology that does not intend to value any one group of test-takers over another. 6 difficulty, making it more interpretable for internal use (a metric with a normal distribution, with a mean of 13 and standard deviation of four; Holland & Thayer, 1985; Zwick, 2004). This is done through the following relationship: )ˆln(35.2MH MH . This transformation also had the additional side-effect of linearizing the odds ratio (i.e., the difference between 0 and 0.5 represents the same difference as 0 to -0.5), as well as centering around zero (i.e., no-difference between the two groups being examined would be represented with a value of zero as opposed to one). The statistical significance of the M-H is calculated with the help of the M-H chi-square test statistic: kkRkkRkkRHMNVarNEN121125.0 , Where: kkRkKRNmnNE 11 and 12011kkkkFkRkkRNNmmnnNVar . The subscripts represent whether the number of test takers (n or m) counted are those who: belong to a specific group membership (“R” vs. “F”), and those who got the item-in-question correct or incorrect (“1” vs. “0”, respectively). The chi-square of the above test is compared against the critical value of the chi-square distribution, with one degree of freedom and an alpha level of 0.05. The interpretation of the M-H test is generally done using the ETS-recommended decision rules, which when used, categorize each item into either A, B, or C. These three 7 categories represent increasing levels of detected DIF: A-level represents items to be considered free of DIF, B-level is to classify items with some DIF that are less favoured than A-level items, and C-level isused to describe items that should only be retained in the cases where it is “essential to meet test specifications” (Zwick & Ercikan, 1989, p. 59). More practically, items exhibiting C-level DIF are targeted for removal if there is no strong underlying argument for the item’s continued inclusion, whereas A-level items are interpreted to have either no, or minimal levels, of DIF (Zwick & Ercikan, 1989). The ETS decision rule behind the M-H is summarized as: To meet condition… the statistical test must be… | MH | is… A non-significant or less than 1 B statistically significant and between 1 and 1.5 C statistically significant and greater than 1.5 The Logistic Regression (LR) Technique The LR method for the detection of DIF was first discussed by Swaminathan and Rogers (1990) as a potential alternative to the detection of DIF with M-H (see also Rogers & Swaminathan, 1993). Unlike the M-H test, the LR method does not have established guidelines for its interpretation beyond the tests of statistical significance. As a result, a discussion of the LR method, and its considerations will be detailed. The model proposed was the following binary logistic regression with logit link function: ggPP 32101ln with P representing the probability of successfully answering the item under investigation for DIF, 8 representing an individual’s ability, 0 representing the regression intercept, 1 the regression weight for the ability’s effect (shown by ) in log-odds, 2 being the regression weight for group membership (g) in log-odds, and 3 being the regression weight of the interaction effect between ability ( ) and group membership (g) in log-odds. The 2 and 3 terms are often termed the uniform and non-uniform (respectively) components of DIF of an item. Uniform DIF is used to define a difference in probability between the two test taker groups that is equidistant throughout the ability range, represented by the single ability variable. Non-uniform DIF represents where the difference of success on an item between two groups varies throughout the ability range, represented mathematically with an interaction term. The ability term, , can be estimated in a number of ways. The most common method is to use the summed total score on the measurement instrument (cf. Rogers & Swaminathan, 1990; Zumbo, 1999; Cuevas & Cervantes, 2012). It is relevant to note that, despite this common usage of regression for DIF, the assumption that the predictor variables are without error is, in fact, violated (Shear & Zumbo, 2013). More specifically, when the summed score is used as an estimator of ability, it is done so under a classical test theory framework, which suggests that the observed score is composed of both the true score and an uncorrelated random error (Shear & Zumbo, 2013). Reference to this warning is made here to serve as a reminder and acknowledgement; however, considering the current and active use of LR for the study of DIF, this work will continue in line with the usage found in the literature. 9 The original Swaminathan and Rogers publications (1990, 1993) suggested the use of a two degree-of-freedom G2 test on the regression model presented above. A statistically significant value, at an alpha of 0.05, was proposed as the indicator for DIF. The initial 1990 publication did include a distributional study which verified that the LR test statistic for the two-degree of freedom test was indeed distributed according to the expected chi-square ( 2 ) distribution. Zumbo (1999) further advanced the LR method by describing the original regression method detailed above as two consecutive, nested models. The three resulting models are: model 1: a model that only includes ability as predictor; model 2: a model that includes ability and group membership; and model 3: the final model with ability, group, and the interaction term between the two variables. The proposed advantage of consecutive one degree of freedom chi-square tests is that it allows a separate focus on uniform DIF (i.e., the comparison between model 1 and model 2) from non-uniform DIF (i.e., comparing model 2 to 3). This, in turn, provides the advantage that one does not lose a degree of freedom in doing the two-degree of freedom test on an item that only contains uniform DIF (Swaminathan & Rogers, 1990). Advantages of the LR Method Over the M-H Upon introduction of the LR method, early adopters often compared it to the dominant strategy of the time, the M-H, and the advantages it provided over the M-H. The largest difference noted by Swaminathan and Rogers (1990) was that the LR technique detects the presence of non-uniform DIF, as reflected by 3 , which is not expressly built into the Mantel-Haenszel statistic (Rogers & Swaminathan, 1993). This effect is not to be minimized as, in some cases, as much as 16% of items containing DIF have been suggested to contain non-uniform 10 differences (Maller, 2001). Moreover, the LR method does not require the arbitrary categorization of a continuous criterion (ability) variable, which is necessary in the creation of the contingency tables required in the M-H (Zumbo, 1999). The LR-based models can be expanded to be more complex, including multiple matching variables (such as multiple estimates of ability; French & Maller, 2007). Furthermore, conceptualizing DIF from an LR perspective allows a host of techniques from the logistic regression framework to be used, such as multinomial logistic regression (French & Maller, 2007) or logistic discriminant function analysis (Gómez-Benito, Hildago, & Zumbo, 2013). Lastly, LR is a statistical method that is freely available in most of today’s statistical computer programmes, making it a tool that can be easily implemented (Zumbo, 1999; French & Maller, 2007). The original work also began the statistical comparison between the already existing M-H technique and the newly proposed LR technique. Rogers and Swaminathan’s (1993) study especially established that, in terms of power, LR did reasonably well when compared to the M-H for uniform DIF, and was better able to detect some non-uniform DIF (especially due to its expressed design in doing so). As well, the early studies noted differences that were expected: power was higher when the size of the DIF was greater, a longer test length increased the power of detection, and increased sample sizes increased power as well (Swaminathan & Rogers, 1990; Rogers & Swaminathan, 1993). They went on to also identify that false-positive rates for LR were variable, ranging from 1% to 6% for LR, as opposed to the 1% that the M-H “consistently” performed. Therefore, if the LR technique is to be a tool to be used for DIF detection – and especially as an alternative to M-H for the detection of DIF – the already established M-H blended decision rule must be transformed onto a metric that can be used by LR. Naturally, 11 understanding the nature of comparability between different tests is practically important. Identifying items incorrectly as DIF (either items that are differentially functioning but do not flag statistically, or vice-versa) can have serious implications. These could include the removal of items from an item bank at a high cost from an operational perspective, or the development of a different understanding of an underlying construct as items are removed incorrectly from a research perspective. High Type I error may be responsible for “seemingly innocuous items” which are statistically flagged for DIF (Russous & Stout, 1996, p. 215). Finally, the understanding of bias and the nature of DIF underlying items would be impacted, interfering with the call for deeper understanding of DIF made by Zumbo (2007). If the LR model is assumed to be a reasonable representation of DIF, then it stands to consider what decision rule(s) are to be used to convert the model’s results into decisions of practical significance. DIF: Defined by Statistical Significance, Effect Size, or Both? Effect sizes have been central to the DIF landscape since very early in the DIF literature. As discussed above, the M-H’s “A”, “B”, or “C” levels are based on the effect size measure of MH . This three-level terminology continues to be mirrored with other techniques; for example, A/B/C groupings also appear in Zumbo and Thomas (1997) and Jodoin and Gierl (2001). Often, in the literature, the drive to include a blended decision rule to the LR is fueled by a concern that chi-square tests of statistical significance are sensitive to large sample sizes, causing a situation where a large enough sample size may magnify a minute difference between two test taker groups (Gómez-Benito et al., 2013). This concern is echoed in Jodoin and Gierl (2001), where it is hoped that the inclusion of effect sizes may lower “Type I error rates” (p. 329); more specifically, “the identification of an item as displaying DIF when, in fact, it does not” (p. 330). 12 However, in some eyes, a trivial difference that is magnified with a large sample size is, in fact, considered to be an overpowered test statistic, rather than an issue of strict Type I error (Ellis, 2010; Lin, Lucas, Shmueli, 2013). In the context of DIF studies to-date, the literature has used a number of definitions of “Type I” and “Type II”; terms that have traditionally been reserved for null hypothesis significance testing. It is encouraged that further research in this area clearly distinguishes among the cells of a contingency matrix and differentiate the terms when the decision rule is based solely on hypothesis testing, as opposed to that of a blended decision rule. “Gold Standard” / Truth condition true condition negative Test Outcome (Statistical Significance) DIF Correct Conclusion Type I Error no DIF Type II Error Correct Conclusion Figure 1. A contingency matrix for use in cases of null-hypothesis statistical testing. When used in line with classical statistical testing (as described by Neyman-Pearson; Lehmann, 1993), the test under consideration is the test of statistical significance of the logistic regression, as conceptualized by Swaminathan and Rogers (1990), based on the sampled data. In this case, the matching gold standard needs to resemble a statistical one: one where the null hypothesis of no (exact) difference between the two groups under consideration exist in the population. Therefore, when discussions occur of sensitivity of the test to large sample size populations, the concern is really one of an over-powered test statistic (Lenth, 2001). 13 “Gold Standard” / Truth condition true condition negative Test Outcome (Blended Rule) DIF True Positive (TP) False Positive (FP) no DIF False Negative (FN) True Negative (TN) Figure 2. A contingency matrix for use in cases of a blended decision rule. In the other case, when the gold standard is modified to represent DIF that is of practical (or psychometric) significance rather than an exact difference, the test used in this consideration should be clarified to mean a blended decision rule. In this case, the difference in the gold standard is reflected through the inclusion of an appropriate statistical measure: the effect size. To reflect the change in the nature of the test, it is recommended that the terms should appropriately be changed to those commonly used in statistical learning and not to view them as completely synonymous (James, Witten, Hastie, & Tibshirani, 2013; see Figure 2 above). Given the concerns noted in Jodoin and Gierl (2001), the chi-square test raises the issue of having too many false positives (FP). The use of the FP term is a better description of the actual concern when the chi-square test is discussed in the contest of DIF. There exists a third option to consider for the gold standard, which has not been studied in the literature to-date: the agreement of a test’s results to that of a DIF panel of experts who agree in the test’s indication of a DIF item that requires removal from a testing instrument. This would still retain the language used in Figure 2, because a blended decision rule would still be in use. 14 Ultimately, as stated by Potenza and Dorans (1995): “To be used effectively, a DIF detection technique needs an interpretable measure of the amount of DIF” (p. 33). The same also appears in the Standards, where a reminder is drawn that analyses must detect “meaningful DIF” (AERA/APA/NCME, 2014, p. 89). This is in line with recommendations within the broader research literature to use effect sizes in enriching interpretations beyond statistical significance (Wilkinson & Task Force on Statistical Inference, 1999). With clarifications on the role of the blended decision rule in a contingency matrix in place, the following section will highlight attempts in the development of effect size cut-offs for the LR. Development of Recommended Cut-Offs for the R2 Although the M-H test has only one conceptualization of effect size in use, the LR method has multiple ways of portraying effect size. At the time of the original Swaminathan and Rogers studies (1990, 1993), the use of effect sizes was not discussed. In essence, the decision rule only consisted of a test of statistical significance. The notion of the use of effect sizes in LR was first introduced by Zumbo and Thomas (1997), which recommended that the effect size cut-offs on the 2R scale should be 0.13 and 0.26, akin to the transition between A- and B- level, as well as B- to C-level DIF (respectively). These were informed by the effect size guidelines put forth by Cohen (1992) and represented a best-guess attempt at the creation of interpretable benchmarks for DIF without the presence of previously established guidelines for LR. In an attempt to link the interpretation of the M-H and the LR techniques, Jodoin and Gierl (2001) converted the classification benchmarks established for the Simultaneous Item Bias (SIBTEST) method onto the 2R effect size of LR; given that the SIBTEST cut-offs had originally been based on the M-H effect size measures (see Russous & Stout, 1996). To facilitate these conversions, Jodoin and Gierl (2001) calculated the relevant measures using data from: 15 a 50-item grade six social studies achievement test, with the two groups defined by English-French translations of the same test, each group containing 2,220 test takers, a 50-item grade six mathematics achievement test, with groupings as above, and a 70-item grade 12 social studies high school exit exam, with gender DIF under consideration, but no number of test takers in each group specified. Based on this data, Jodoin and Gierl (2001) noted that the relationships between the SIBTEST cut-offs and LR 2R were similar regardless of test content or group split considered. This resulted in final recommendations of a 2R under 0.035 for A, over or equal to 0.035 but under 0.070 for B, and greater than or equal to 0.070 for C. Again, following the tradition of the M-H, Jodoin and Gierl (2001) suggest that statistical significance must be met for B- and C-level DIF exclusively. As can be seen by direct comparison, the cut-offs presented by Jodoin and Gierl (2001) represent quite a distinct reduction from those initially suggested by Zumbo and Thomas in 1997. The newly recommended cut-offs, as expected, greatly influenced the number of items detected as containing moderate DIF from 6.8% to 68.2% of DIF items (Jodoin & Gierl, 2001). From a power standpoint, this change strongly weighed in favour of the smaller effect size requirements, making the Jodoin and Gierl-suggested cut-offs most often used in recent published literature (cf. Hidalgo & López-Pina, 2004). Jodoin and Gierl’s (2001) 2R cut-offs were originally designed to be applied at each sequential regression model, testing uniform and non-uniform separately using the same 2Rcut-offs. The authors stated that, because their recommendations were derived from M-H (albeit via SIB), the cut-offs only are relevant to the transition between models one and two, which are 16 aimed to test for uniform DIF (the only type of DIF the M-H is designed to detect). The authors suggested that with no levels established by the psychometric community for non-uniform DIF, the same cut-offs should be applied to both non-uniform as to uniform DIF. Interestingly, the authors wrote earlier in the same study that “SIB was chosen as the basis of comparison because it is able to detect both uniform and non[-]uniform DIF, and it has superior statistical characteristics compared to MH in the detection of both uniform and non[-]uniform DIF” (p. 335; emphasis added). This is a stark contrast to the assertion that the cut-offs must be used separately for uniform as well as non-uniform DIF. Given that the M-H metric does report on conditional odds, it is also strange that the study authors do not directly compare the M-H odds with the odds reported by LR. This feature is explicitly stated in one of Holland and Thayer’s (1986) works: “The parameters, α and ln(α), are also called ‘partial association’ parameters because they are analogous to the partial correlations used with continuous data” (p. 9). Lastly, it is worth discussing the unique contributions of Cuevas and Cervantes (2012), which slightly departed from the pattern of previous research in this field. This is primarily attributed to the study’s intent to analyze DIF in the environment of a Columbian testing programme. Unlike the studies detailed earlier, which generally detailed sample sizes under 1,000 test-takers per group, Cuevas and Cervantes (2012) considered a simulation population where the primary group had either 7,500, 8,500, 26,000, or 33,000 test-takers, with many additional ratios for the second group of test-takers. Previous research explicitly stated the use of the Nagelkerke R2 (1991) for LR as well (see Zumbo, 1999; Jodoin & Gierl, 2001), whereas Cuevas and Cervantes (2012) selected the McFadden R2 for their use (although they did not attempt to apply any of the existing cut-offs to their study). 17 Although the methodology for the Cuevas and Cervantes (2012) study departed from existing trends in the literature, there are a number of approaches discussed within the article that are useful to draw upon. Of note is the use of receiver operator characteristic (ROC) curves to determine optimality under each condition. This method generates a graph of the false positive rate on the x-axis, and true positive rate on the y-axis. Using this method allows the identification of a point on the curve closest to the top left of the graph; that point representing the ideal condition of no false positives and 100% true positives. Through the use of this method, it is interesting to see that the 2R cut-offs suggested by the authors are different, depending on sample size. This is unusual considering the goal of exploring the effect sizes in the first place was to prevent sample size from influencing the results (see discussion earlier). The deeper implication within this reasoning is that sample size would change the maximum permissible amount of DIF in a particular item – something that is not operationally defendable, nor theoretically justifiable. This is also a strategy that is not supported with the current use of DIF results (such as the M-H). What is also striking about the cut-offs suggested in Cuevas and Cervantes (2012) is the large discrepancy when the results are compared to those put forward by Jodoin and Gierl (2001), let alone Zumbo and Thomas (1997). The suggested cut-offs for uniform DIF, in the small DIF condition (assuming a typographical error at the top of table 4 on p. 56), range from 0.000082 to 0.000363 (dependent on sample size); numbers that are very different in magnitude compared to 0.035 for B-level DIF in Jodoin and Gierl (2001). A suggestion is made in Cuevas and Cervantes (2012) that the Jodoin and Gierl (2001) cut-offs may be too conservative at identifying DIF, joining similar proposals put forward by Oliveri, Ercikan, and Zumbo (2014) and French and Maller (2007). Applications of Decision Rules in Simulation Studies 18 Some of the decision rules have been tested in various simulation-based contexts, the results of which were primarily concerned with the Type I error and power rates. Jodoin and Gierl (2001) reported that Type I error had decreased when the effect size was included in the decision rule. The starkest note in regards to Type I error when using a blended decision rule, however, comes from French and Maller’s (2007) study, which concluded that with a blended decision rule, Type I errors decreased to “essentially zero” (p. 380). This result was particularly noticeable when unequal ability distributions were induced within the simulation. French and Maller (2007) noted that power was overly conservative, especially when the effect size was included into the decision rule, which echoed a similar statement made by Jodoin and Gierl (2001). It is interesting to note that similar overall conclusions are drawn by both research teams, despite a difference in the decision rules that each study used. French and Maller’s (2007) results are explicitly only based on a one-degree of freedom chi-square test using an alpha of 0.01, not considering the possibility of the two-degree of freedom chi-square for the statistical significance criterion. On the other hand, Jodoin and Gierl (2001) considered both the one- and two-degree of freedom tests, but used an alpha of 0.05 under both conditions. Hidalgo and López-Pina (2004) also performed a simulation study in comparing the statistical power of the M-H and the LR methods, with the LR methods also including the effect size. A large benefit of their approach is the inclusion of their Tables 2 and 3 (pp. 912- 913), that highlight the mean and standard deviations of the effect sizes (specifically2R between the combined models). However, a large difference between the Hidalgo and López-Pina (2004) study, as compared to French and Maller (2007) and Jodoin and Gierl (2001), is where each one had applied the 2R cut-offs. In contrast to Jodoin and Gierl (2001), Hidalgo and López-Pina 19 (2004) used the difference in R2 between the first and the last LR models (i.e., the model including the interaction effect to represent non-uniform DIF) against the cut-offs suggested by Jodoin and Gierl (2001). The same approach was also used by Oliveri, Ercikan, and Zumbo (2014). The use of one, global, cut-off value was originally suggested by Zumbo (1999; albeit with the much more liberal cut-offs). That being said, Zumbo suggests that retaining the two sequential 2R s would still be beneficial as the two values could be compared against each other to determine if the item is primarily uniform or non-uniform DIF. This approach may be more beneficial as it does not require the creation of separate 2R cut-offs for both types of DIF, as is suggested in Jodoin and Gierl (2001). These results indicate that not only have there been a variety of cut-offs that have been suggested in the literature, but also differences in how they have been applied. Statement of the Research Questions As can be seen from this discussion, it is not only a question of what the cut-offs should be, but also a matter of clearly defining where the cut-offs are applied in the modelling process. It is important that in proposing the following research agenda that clearly articulated reasoning is used to outline each sequential step, allowing for further replication of the study method, with the hope of harmonizing the recommendations made to-date throughout the literature. The first research goal to be considered here is a replication of an issue discussed in Jodoin and Gierl (2001): the conversion of the M-H cut-offs onto a scale used by the LR (2R ). As highlighted above, the recommendation presented in Jodoin and Gierl (2001) was developed with an intermediary conversion through SIBTEST and was based on datasets available to those researchers. An attempt will be made to replicate the existing methodology described in the literature with the use of a different dataset. Then, a more direct conversion with another dataset 20 will demonstrate if the cut-offs are interchangeable regardless of method by which they were calculated or which dataset was used. Given new cut-offs that may lead from the first research question, the comparability of their results with the Jodoin and Gierl (2001) cut-offs, as well as with the original M-H cut-offs, will be examined. This will serve as a verification step to demonstrate the degree to which the three decision rules are interchangeable within each other. Since both of the LR decision rules were derived originally from the M-H, it is expected that they will be highly comparable. Secondly, a broader question will be asked in regards to effect size: does their inclusion lead to different decisions at the item level, over and above the statistical test alone? In this case, how consistent are the decisions regarding DIF when the following decision rules are compared with each other: 1. LR: statistical significance – using G2 statistical significance of the two-df test only, with an alpha of 0.05 (based off Jodoin & Gierl, 2001) 2. LR: statistical significance – using G2 statistical significance of either of the one-df tests only, with an alpha of 0.05 3. LR: effect size – using Nagelkerke’s (1991) values to obtain 2R , applying cut-offs suggested by Jodoin & Gierl (2001) to the separate model changes (models 1 to 2, models 2 to 3) 4. LR: effect size – same as (3), with the exception of the application of the cut-off to the 2R from model 1 through 3 5. LR: effect size – using 2R based on Nagelkerke (1991) values, with any proposed cut-offs that may be developed from research question 1 6. LR: combined rule –rule (1) with rule (4) 21 7. LR: combined rule – rule (1) with rule (5) 8. M-H: combined rule – Holland and Thayer (1988) classifications 22 Methods In order to answer the questions above, the data source to be used for the study and the methods of the analysis will be detailed, in turn, below. Data Source The data source used for this analysis is drawn from an operational test used to measure functional English language use. Both the listening and reading components are each composed of 38 multiple-choice, single response, dichotomously-scored items. The majority of these items are grouped into testlets sharing a similar stimulus (e.g., a reading passage, or a recorded conversation). Despite the presence of testlets, the unidimensionality of all forms of this test has been examined and documented elsewhere. The choice of operational data for this study, rather than simulated data, will allow the techniques to be tested in an environment which includes the natural variability that is expected and observed in an operational setting. In addition to operational item response data, the following demographic variables were used in an attempt to detect DIF: gender groups of the test taker, split into the two groups: “male”, and “female” (test takers reporting other self-identified gender identities are excluded from the dataset) citizenship groups of the test taker, split into two groups: one group including those from a country primarily speaking English in a sociocultural context judged to be highly similar to that found in Canada (i.e., test takers who hold an initial citizenship of one of: Canada, United States, Australia, New Zealand, United Kingdom, or Ireland), and the other group consisting of those reporting other citizenships. The analyses to follow will be symbolically coded with one (1) representing males and socioculturally similar groupings, and zero (0) representing females and test takers from 23 socioculturally dissimilar countries. Given the data splits explained above, the following table summarizes the number of test takers within each grouping, as well as the total size of the dataset: Table 1. Sample size under various conditions for the four forms under consideration in this work. Form N males females similar context dissimilar context Listening A 2666 1674 (62.8%) 992 (37.2%) 254 (9.5%) 2412 (90.5%) Listening B 2563 1597 (62.3%) 966 (37.7%) 236 (9.2%) 2327 (90.8%) Reading A 2630 1859 (70.7%) 771 (29.3%) 363 (13.8%) 2267 (86.2%) Reading B 2691 1866 (69.3%) 825 (30.7%) 397 (14.8%) 2294 (85.2%) The purpose of the recommended splits is to demonstrate the comparability of the statistical techniques in two circumstances, especially in a circumstance that is more balanced than the other. Each of the previously mentioned studies related to LR have considered unbalanced group sizes. Since this is a common factor in those studies, it is also considered in this research context. Although other research has usually used the simple total score as an estimate of ability, this work will use a corrected item-total: a total summed from the items taken by the test taker without taking into account the individual item being considered in a particular DIF regression. This prevents the total score from being conflated with the particular item should it, in fact, contain DIF. 24 Question 1: Translating M-H Cut-Offs to LR First and foremost, the conversion of the M-H metric onto one used by LR was performed. This was accomplished using two methods. At first, an attempt to replicate Jodoin and Gierl’s (2001) results was performed. Second, since M-H has a metric that resembles a conditional odds ratio, this metric was transformed to the conditional odds ratio that was calculated from the coefficient estimated for the regression model 2 (the model which contains only ability and group membership as the predictors). For each of the two data splits (gender and context similarity), the relationship between the M-H alpha scale, the conditional odds ratio, as well as the calculated 2R for the transition from model 1 to 2 was investigated graphically. The goal of graphical relationship modelling was to investigate the whether or not the various datasets could be merged into one translating model (cf. Jodoin and Gierl, 2001). From the fitted model(s), the M-H delta cut-offs were transformed onto the 2R metric. The most notable drawback of this approach is that it only explicitly uses the regression coefficient for group membership from model 2, which, in fact, only captures the condition of uniform DIF. However, one could make an argument that the measure of DIF should not depend on the type of DIF. With this reasoning, if a group difference is considered large enough by uniform standards when using a strategy that is only designed to detect uniform differences in items, then differences that are just as large with partially (or exclusively) non-uniform DIF should also be considered as equally suspect. As a result, the recommendation provided in this study will be that the cut-off, even though it is developed exclusively from a source describing uniform DIF, can (and should) be applied to the difference in R2 from models 1 to 3. In other words, the degree of DIF, as measured by effect size (a practical difference), should not be 25 dependent on the type of DIF (a theoretical description of the type of DIF). With this strategy, the recommendation made by Zumbo (1999) can be used: the ratio of the two different R2 changes can be compared to identify whether or not the item exhibits primarily uniform or non-uniform DIF. For example, if 80% of the item’s R2 difference occurs in the transition between models 2 and 3, one can reasonably conclude that the item exhibits primarily non-uniform DIF. From a strictly pragmatic orientation, the concern surrounds the amount of DIF (the effect size), as opposed to how that DIF is manifested (i.e., uniform vs. non-uniform). Question 2: Comparability of Results Next, the various techniques will be run on the datasets described earlier. These techniques have been programmed into a computer script written in R (R Core Development Team, 2014). This automated process will output the raw results of the various statistical tests outlined earlier, as well as the interpretations suggested by the various decision rules outlined earlier. To address the comparability between the different measures, a consistency matrix was developed. This matrix will show percent agreement on the off-diagonal, and the proportion of items classified at each level on the diagonal. Since the decision rules result in either two or three categories of DIF classification (either the “yes”/“no” of statistical significance, or the A-, B-, and C-levels of the combined decision rule), B- and C-level items will be consolidated into one category for the purposes of calculating agreement, representing a presence/absence of DIF. 26 Results Results for Question 1: Translating M-H Cut-Offs to LR Distributions of test taker ability. At first, the comparison of the raw ability distributions are demonstrated between the two types of data groupings. From the following graphs, it is possible to see that gender grouping represents a group of test takers who are highly overlapped (see Figure 3), compared with the highly non-overlapping nature of test takers split by sociocultural context (see Figure 4). This has implications for DIF (Zumbo, 1999), as matching becomes a statistical creation. As a result, although the sociocultural difference may be a DIF result faced in actual data (as in this case), the majority of calculations in this section will be based on the gender-based DIF results. Results in Figure 3 and Figure 4 present variables coded as one (1) in solid lines, and zero (0) in dashed lines. 27 Figure 3. Density plot of summed score by gender, in both listening forms (left) and reading forms (right). 28 Figure 4. Density plot of summed score by sociocultural background, in both listening forms (left) and reading forms (right). Method 1: Attempting Replication of Jodoin & Gierl (2001). At first, an attempt was made to replicate Jodoin and Gierl’s (2001) methodology in the determination of cut-offs. SIBTEST was run in DIF-Pack version 1.7 (William Stout Institute for Measurement, 2005). In keeping with the method in Roussos and Stout (1996), as cited by Jodoin and Gierl (2001), each item was considered for DIF against all other items on the test. As well, a bi-directional test of DIF was used (no directionality of differential functioning was assumed in any of the datasets). With the intention of replication, only the gender-based DIF was considered for this sub-study. The results are presented in Figure 5, below. Two solid vertical lines were placed on the graph to represent the two locations of the SIBTEST cut-offs (0.059 and 0.088 on the ˆ 29 metric). The dashed line superimposed on the data is a cubic regression, resembling the functional form suggested by Jodoin and Gierl (2001). To contrast, a solid line on the dataset shows the loess-based representation of the data. Figure 5. SIBTEST ˆ on x-axis with LR-based 2R between models 1 and 2 on the y-axis. Based on the loess curve, the cubic function may not necessarily resemble the data the best in the lower range (especially around SIBTEST ˆ < 0.025). The functional form is a better representation of the loess mapping in the range where the SIBTEST-based cut-offs exist; however, there is a notable lack of data as well in the same range. For consistency in method to the original article, the cubic functional form best-fit to this data is calculated as: 00018.005732.079805.093691.14ˆ 232 R , leading to suggested cut-offs of 0.00385 and 0.00922 for B-level and C-level DIF. 30 These results diverge strongly from the recommendations provided in Jodoin and Gierl (2001) of 0.035 and 0.070. To emphasize the divergence of results, Figure 5 from above is included again, rescaled to match the ranges of Figure 1 from Jodoin and Gierl (2001, p. 336). Upon comparison of the figures, the following two differences are apparent: (a) the range of 2R in the original work is much wider than that seen with this data set, with 2R values as high as ~0.65 reported compared to a maximum of 0.0095 shown here; and, (b) shape of the cubic regression model is visibly different. The large outliers in the items analyzed in Jodoin and Gierl (2001) may also have highly influenced the nature of the cubic regression model because of their high leverage values. Figure 6. Rescaling of Figure 5 to match the presentation of Figure 1 from Jodoin and Gierl (2001, p. 336). 31 Furthermore, the linear relationship between SIBTEST ˆ and M-H’s ˆ was verified visually (Roussos & Stout, 1996). However, the obtained correlation of 0.44 found with this data distinctly differed from a previously reported correlation of 0.98 (Dorans, 1991), which was used as the basis of the transformation between the M-H and SIBTEST DIF detection techniques (Roussos & Stout, 1996). Because the results using SIBTEST are divergent in both (a) M-H to SIBTEST, and (b) SIBTEST to LR, a direct transformation of M-H to LR may be a simpler and more defensible strategy. Method 2: Directly Linking M-H and LR. As a result of these large differences, and in an attempt to simplify the transition between M-H and LR cut-offs, another strategy was considered in mirroring these two DIF detection techniques. This technique, as discussed earlier, directly maps the conditional odds ratios between both techniques, then considers their relationship with the change in R2. Checking the expected relationship between delta (transformed logOdds in M-H) and logOdds from LR. Firstly, the conditional odds of group membership from both the M-H and the LR methods are compared to demonstrate that the two are indeed interchangeable. For the purpose of ease of interpretation, the following graph demonstrates the linearized versions of both. The logOdds was multiplied by 2.35 to match the scaling of the delta metric, and is demonstrated below with linear regressions for each DIF type. 32 Figure 7. Relationship between deltaMH and the logOdds of Group in model 2, placed on the same metric scale, with superimposed linear regressions. Points representing gender DIF are clustered towards the bottom-right of the graph and are represented with circles, with socioculturally data clustered to the top-left of gender and represented with triangles. The relationship between the two variables for gender estimated by linear regression is xy 97649.004252.0ˆ . Socioculturally similar groupings have a relationship between the two variables of xy 93770.085750.0ˆ . The graph and data both reveal that the relationship (slope) between the two variables, in both regressions, tends towards 1; however, only the gender-based DIF reveals the expected identity relationship with an intercept of approximately 0 (Holland & Thayer, 1988). Because the sociocultural similar grouping does not match the expected relationship between the two variables, the gender-based models will be used in the remainder of this work to construct a new recommended cut on the 2R scale. The unexpected performance of the data from sociocultural background DIF is likely related to the non-overlapping ability distributions between the two groupings. With the interchangeability of the two scales verified 33 for the gender-based divisions, a MH = 1 would, after dividing by 2.35 (the scaling factor), equal a logOdds of approximately 0.42553, along with a MH = 1.5 equalling a logOdds of 0.63830. Converting logOdds to R2. It is necessary to convert the logOdds of group membership in model 2 to the 2R metric, in order to enable DIF determinations to occur in the broader and more encompassing model 3 (the model which includes non-uniform DIF). 2R remains the most convenient effect size measure for LR as the difference in R2 between models 1 through 2 is nested within the difference in R2 between model 1 and the non-uniform accounting model 3. Since both the (conditional) odds ratio and 2R are measures of effect size, it is appropriate to relate the two variables; which is visualized below. Given the conclusion from above that the gender DIF more accurately respects the relationship expected between the conditional odds ratio of M-H and the conditional odds ratio of LR, only its results are used for the purposes of conversion. Two methods; an empirically-derived and a model-based approach are considered next. To better reflect the actual data, an empirical-based strategy was first considered. In this strategy, increasing ranges around the target logOdds (0.42553 and 0.63830) are built; within each range, the median is calculated. In the model-based strategy, an estimated mathematical model was selected that best resembled the loess line suggested by the data. Results of the empirical strategy. The empirical strategy was initially preferred over a model-based strategy, as it did not impose a functional form on two variables whose relationship has not clearly been linked. The results of the empirical-based strategy are shown below, in Figure 8, for the target logOdds of 0.42553. In this strategy, sequential steps in increments of 34 0.0005 above and below the target logOdds were taken for a duration of 100 steps (i.e., a total range of 0.42553 ± 0.05). At each step, the median was calculated, and is presented on the y-axis. The figure only begins to display data when the step contained more than two data points (which occurred starting at step 39). With the exception of a small dip between steps 78 and 81, the median is stable at 0.00686 until the end, calculated on the basis of five data points. The dotted line represents the model-based result for the same cut point (detailed later), showcasing the similarity of results between the two strategies. Figure 8. A graph demonstrating the median results of the increasing window around the target logOdds of 0.42553. The dotted horizontal line demonstrates the model-based estimate, discussed later in text. An attempt was made to compute the median 2R at the cut-point of 0.63830; however, there was no data available for gender-based DIF at this level. In order to be able to calculate an 35 equivalent in this range, a model-based imputation – although not preferred – is a required alternative. Results of the model-based strategy. The functional form that best resembled the data is an exponential model. For the purposes of modelling, the graph below also shows a loess curve (solid line), and a modelled power relationship (dotted line). An acknowledgement is made that the functional form selected for the model is strictly based on best-fit to the data, and not on any proofs relating the two variables. As well, because data is not available in the range of the second cut-off, the model is only a best-informed guess of what possible values could be should there have been data in that range. Figure 9. The relationship between the absolute value of the logOdds of group in model 2 as opposed to 2R between models 1 and 2. The loess curves are plotted using solid lines; matching modelled relationships (power) in dotted lines. 36 In this case, the relationship between the logOdds and 2R between model 1 and 2, for gender-based DIF, is estimated by the formula: 96970.12 logOdds03665.0ˆ R . Given this estimated relationship, a cut-off of 1 on MH solves to a 2Rˆ of 0.00681, and 1.5 solves to a 2Rˆ of 0.01514. The model-estimated cut-off for the first cut point was graphed as a horizontal line in Figure 8. This shows concordance between the mean-based regression technique, and the data-based median. Since the empirical technique does not provide results for the second cut, it is suggested that both cut-offs be model-based to preserve consistency, and because of relatively close agreement of the first cut with the empirical method. Results for Question 2: Comparability of Results Having suggested different cut-off values with the use of a different methodology, the cross-comparability of the various methods are considered in turn. At first, a general summary of the comparability of the various methods is presented in the following two tables. Table 2 demonstrates the application of the cut-offs for all data points (both types of DIF under consideration in this study). Table 3 demonstrates the same for the gender-only DIF situation (e.g., the context within which equal ability distributions are seen). In both tables, the diagonals show the results of classification for that particular decision rule; the off-diagonals show the percentage of exact agreement between two decision rules. In cases where a decision rule consisting solely of a significance test is compared against a decision rule generating A/B/C categories, B- and C-level items were collapsed into one category of items which exhibit DIF (similar strategy used in Jodoin & Gierl, 2001). 37 Table 2. A table of exact agreement (off-diagonal) and proportions (diagonal) of various decision rules applied to all types of DIF under consideration. Chi-Square2dfalpha = 0.05Chi-Square1df , either modelalpha = 0.05Change in R^2 in Models 1 to 2 or 2 to 3(Jodoin & Gierl, 2001)Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001)Change in R^2 in Models 1 to 3(new cuts)Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001)with Chi-Square 2df < 0.05Change in R^2 in Models 1 to 3(new cuts)with Chi-Square 2df < 0.05(Holland & Thayer, 1988)Chi-Square2dfalpha = 0.05Non-Significant = 51%Significant = 49%.93 .51 .53 .73 .53 .73 .65Chi-Square1df , either modelalpha = 0.05.93TRUE = 46%FALSE = 54%.46 .48 .68 .48 .68 .61Change in R^2 in Models 1 to 2 or 2 to 3(Jodoin & Gierl, 2001).51 .46 A = 100% .98 .78 .98 .78 .80Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001).53 .48 .98A = 98%B = 2%.78 1.00 .78 .79Change in R^2 in Models 1 to 3(new cuts).73 .68 .78 .78A = 78%B = 13%C = 9%.78 1.00 .78Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001)with Chi-Square 2df < 0.05.53 .48 .98 1.00 .78A = 98%B = 2%.78 .79Change in R^2 in Models 1 to 3(new cuts)with Chi-Square 2df < 0.05.73 .68 .78 .78 1.00 .78A = 78%B = 13%C = 9%.78(Holland & Thayer, 1988).65 .61 .80 .79 .78 .79 .78A = 80%B = 6%C = 15%Blended Decision RulesALL DATAStatistical Significance Only Effect Size Only Blended Decision RulesStatistical Significance OnlyEffect Size Only38 Table 3. A table of exact agreement (off-diagonal) and proportions (diagonal) of various decision rules applied to gender DIF data only Chi-Square2dfalpha = 0.05Chi-Square1df , either modelalpha = 0.05Change in R^2 in Models 1 to 2 or 2 to 3(Jodoin & Gierl, 2001)Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001)Change in R^2 in Models 1 to 3(new cuts)Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001)with Chi-Square 2df < 0.05Change in R^2 in Models 1 to 3(new cuts)with Chi-Square 2df < 0.05(Holland & Thayer, 1988)Chi-Square2dfalpha = 0.05Non-Significant = 69%Significant = 31%.90 .69 .69 .75 .69 .75 .72Chi-Square1df , either modelalpha = 0.05.90TRUE = 62%FALSE = 38%.62 .62 .68 .62 .68 .64Change in R^2 in Models 1 to 2 or 2 to 3(Jodoin & Gierl, 2001).69 .62 A = 100% 1.00 .94 1.00 .94 .97Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001).69 .62 1.00 A = 100% .94 1.00 .94 .97Change in R^2 in Models 1 to 3(new cuts).75 .68 .94 .94A = 94%B = 6%.94 1.00 .97Change in R^2 in Models 1 to 3(Jodoin & Gierl, 2001)with Chi-Square 2df < 0.05.69 .62 1.00 1.00 .94 A = 100% .94 .97Change in R^2 in Models 1 to 3(new cuts)with Chi-Square 2df < 0.05.75 .68 .94 .94 1.00 .94A = 94%B = 6%.97(Holland & Thayer, 1988).72 .64 .97 .97 .97 .97 .97A = 97%B = 3%Blended Decision RulesGENDER DIF ONLYStatistical Significance Only Effect Size Only Blended Decision RulesStatistical Significance OnlyEffect Size Only39 The tables above reveal that very few items are classified as exhibiting DIF when the cut-offs recommended by Jodoin and Gierl (2001) are applied. The addition of effect size clearly reduces the number of items identified as DIF, as highlighted in the literature. However, in the data used for this study, the blending of statistical significance and effect sizes yields nothing additional when compared to effect sizes used alone. That noted, the retention of the statistical significance is required to keep with the tenets of hypothesis testing. Although exact agreement is very similar between M-H (Holland & Thayer, 1988), Jodoin and Gierl (2001), and the cut-offs presented here; the distribution of A-/B-/C-level identified items are much more similar between the cut-offs presented here than those presented by Jodoin and Gierl (2001). Despite the similarity in proportions, it is worthwhile to look closer at the distribution of items as identified by the blended decision rule proposed here against the decision rules as applied to the M-H technique (see Table 4). 40 Table 4. Classification contingency matrix for DIF items using decision rules from Holland & Thayer (1988) for M-H vs. those calculated in this work, all numbers as proportions of total item mix for all data. This table demonstrates that 78% of items (i.e., the sum of the diagonal) were classified in the same manner by both techniques. The items present in the categories of most disagreement (i.e., items identified “A” with one technique and “C” with the other) occurred within items tested for sociodemographic DIF, which comes with the challenge of an unequal ability distribution demonstrated earlier. Ten percent of items were classified higher using the decision rule recommended in this work, than the decision rules used with the M-H technique. The higher classifications using decision rules from this work may be related to the presence of non-uniform DIF. Figure 10, below, shows that items classified higher (bottom of graph) do tend to have 41 higher 2R s between models 2 and 3. Naturally, this is to be expected as the M-H is not intended to test for non-uniform DIF. No obvious relationship was observed to explain why items may be classified lower using the decision rule proposed here when compared to M-H. Figure 10. A graph comparing the distribution of 2R s between Models 2 and 3. The top graph shows the distribution of 2R for items classified equal / lower using decision rules suggested here, when compared to those suggested by Holland & Thayer. The bottom graph shows the distribution of 2R for items classified higher (containing more DIF) using the new decision rule as opposed to their Holland & Thayer classification. 42 Concluding Remarks In effect, despite the known ability to use the LR technique for DIF detection over the last 15 years, there has yet to be a single decision rule used consistently within the literature. The intention of this research was to identify a comparable cut-off in the LR test from the M-H, and to perform a comparison between the various DIF flagging techniques presented in the literature so far. The work here revealed suggested cut-offs on the 2R metric that are inconsistent with findings in the current literature, summarized in the table below (Table 5). Throughout the work, substantial differences were found when unequal ability distributions were considered, prompting the need for further evaluation of the LR technique under these specific conditions. There are indications that LR is more sensitive than M-H to non-uniform DIF, as defined by the change in R2 between LR models 2 and 3, supporting the idea that LR can capture non-uniform DIF. 43 Table 5. Summary of differences in cut-offs on the 2R metric between various sources used to-date in the literature. Jodoin and Gierl (2001) Cuevas and Cervantes (2012)a newly calculated cut-offs Applied to change between models 1 to 2 or models 2 to 3 models 1 to 2 models 2 to 3 models 1 to 3 A-level 2R < 0.035 2R < 0.000363 2R < 0.00062 2R < 0.00681 B-level 0.035 < 2R < 0.070 0. 000363 < 2R < 0. 001012 0. 00062 < 2R < 0. 002708 0.00681 < 2R < 0.01514 C-level 0.070 < 2R 0.001012 < 2R 0.002708 < 2R 0.01514 < 2R Note. The number of digits after the decimal point are presented as reported in the original articles in order to preserve the authenticity of the recommended cut-offs. aThe units of the Cuevas and Cervantes (2012) are on the metric of McFadden’s 2R , and therefore, are not directly comparable with the remaining values in the table. As well, due to the multitude of cut-offs presented in the original work, the largest cut-offs in each series were selected to present in this summary table. The selected cut-offs also resemble the gender condition presented here the closest, in terms of sample size and proportion between the two groups. 44 On reviewing the table, it is apparent that the scale of the cut-off values on the 2R metric are notably miniscule. This is especially the case when the cut-off values are compared with the range 2R s are able to take (0 to 1), as well as the effect size guidelines suggested by Cohen (1992). This may suggest to a researcher, unaccustomed to these values as applied in the context of DIF, to conclude that such results that are inconsequential. Prentice and Miller (1992) specifically discuss how numerically small effect sizes may still reflect an important change worth measuring, such as the case where effects are measured when the dependent variable is “unlikely to yield to influence from the independent variable” (p. 162). Such broad examples of groupings that are customary within the DIF literature, such as “male” or “female”, combined with a large number of test-takers within each grouping, would reasonably point to a more heterogeneous group than a homogeneous group. As a result, even if a tiny difference is found in item score between two gender groups – which consist of a wide variety of individuals of different ages, socioeconomic backgrounds, and educational experiences to name a few – may demonstrate that gender can operate “even in domains [one] would think were immune to its effects” (Prentice & Miller, 1992, p. 162). Therefore, such small changes in 2R are still worth monitoring for bias. Implications for Practice As can be seen in this work, a new cut-off is suggested that is notably dissimilar from that suggested by Jodoin and Gierl (2001). This may be due to differences in the technique in which the results were calculated, or differences in the underlying dataset. Although the results used in this study come from two different constructs (listening and reading), both scales originate from the same assessment instrument and the same population. Consequently, the cut-offs that are 45 calculated as a result of this research may actually be highly tied to the dataset used, including the nature of the construct or changes in the distribution of the item parameters. It is clear that the need for effect sizes within DIF are not only necessary, but have been used for a number of years. Nonetheless, the selection of effect sizes for screening for DIF must be done with purpose. Psychometricians and other researchers must always remember that the effect sizes for DIF, like all statistical effect sizes, must always be interpreted in the context in which they are used. The idea of establishing one universal effect size that can be used without contextualization is a high order; Cohen specifically warned against universal effect sizes (1988). Despite this, DIF techniques seem to assume the universality of their cut-offs. In some cases, such as an initial exploratory study of a construct, a higher level of DIF may be permissible than traditionally used. In other cases, psychometricians may be required to use a set of more constrained cut-offs, such as for items on a high-stakes limited-number-of-attempts assessment product. These categories based on research questions may still be too broad: Do different constructs require different cut-offs? Should different test taker groupings be allowed to have different thresholds for the identification of DIF? All in all, effect sizes selected must be defensible, respecting test taker groupings that may be affected by an item exhibiting DIF. It is worth considering, with the goal of increasing the defensibility of a testing programme’s DIF cut-offs, that one could draw upon procedures traditionally used in standard setting for cut score development (cf. Cizek, 2011). If this DIF cut-off setting procedure is carried out, including enough varied panelists, with test-, construct-, and test use-specific knowledge, the defensibility of cut-offs should be higher than the application of universally established cut-offs. Testing programmes should then consider the collection of 46 validity evidence to support their DIF cut-offs, much like cut scores require validity argumentation. As a result, it is recommended that psychometric practitioners evaluate potential cut-offs for DIF on a testing programme-specific basis. The difference in results presented here as opposed to those presented earlier in the literature suggests that different DIF cut-offs may occur as an interaction of test items, test domain, and testing population. Should any of the three criteria change, the cut-off used to evaluate DIF items may also require adjusting. The use of a single universal cut is, therefore, not recommended. Other Directions for Research An acknowledgement is made that the proposed methods are only a stepping stone for future research in this domain. First and foremost, this study cannot clearly be considered a power, Type I, or a Type II study of the various statistical techniques. Without a simulation study, one cannot determine the actual detection rate of DIF as one has not artificially inserted performance differences into the data. It would be relevant to follow up this study with simulation studies to further investigate the ways that various DIF decision rules perform, including the proposed cut-offs presented in this work. As well, because this data is taken from an operational test, the number of items identified as containing DIF were limited. With limited numbers of items identified as DIF, cross-comparability between the various measures is also limited. Repeating this experiment with purposefully simulated data, with specifically inserted DIF, may increase the ability to draw any conclusions about the cross-comparability of measures. 47 References Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67-91. American Educational Research Association [AERA], American Psychological Association [APA], National Council on Measurement in Education [NCME]. (2014). Standards for educational and psychological testing. Washington, DC. Camilli, G. (1993). The case against item bias techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 397–413). Hillsdale, NJ: Lawrence Erlbaum Associates Cizek, G. J. (Ed.). (2011). Setting performance standards: Foundations, methods, and innovations (2nd ed.). Routledge. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Earlbaum Associates. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312. Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155. Cuevas, M., & Cervantes, V. H. (2012). Differential item functioning detection with logistic regression. Mathématiques et Sciences Humaines. Mathematics and Social Sciences, 2012(3), 45–59. Dorans, N. J. (1991, November). Implications of choice of metric for DIF effect size on decisions about DIF. Paper presented at the International Symposium on Modern Theories in Measurement: Problems & Issues, Montebello, Québec, Canada. Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge, UK: Cambridge University Press. 48 French, B. F., & Maller, S. J. (2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and Psychological Measurement, 67(3), 373–393. doi:10.1177/0013164406294781 Gómez-Benito, J., Hidalgo, M. D., & Zumbo, B. D. (2013). Effectiveness of combining statistical tests and effect sizes when using logistic discriminant function regression to detect differential item functioning for polytomous items. Educational and Psychological Measurement, 73(5), 875–897. doi:10.1177/0013164413492419 Hidalgo, M. D., & López-Pina, J. A. (2004). Differential item functioning detection and effect size: A comparison between logistic regression and Mantel-Haenszel procedures. Educational and Psychological Measurement, 64(6), 903–915. doi:10.1177/0013164403261769 Holland, P. W., & Thayer, D. T. (1985). Research Report RR-85-43: An alternative definition of the ETS delta scale of item difficulty. Princeton, NJ: Educational Testing Service. Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. New York: Springer. Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329–349. 49 Lehmann, E. L. (1993). The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? Journal of the American Statistical Association, 88, 1242-1240. doi: 10.2307/2291263 Lenth, R. V. (2001). Some practical guidelines for effective sample size determination. American Statistician, 55(3), 187-193. Lin, M., Lucas, H. C. Jr., Shmueli, G. (2013). Too big to fail: Large samples and the p-value problem. Information Systems Research, 24(4), 906-917. Maller, S. J. (2001). Differential item functioning in the WISC-III: Item parameters for boys and girls in the national standardization sample. Educational and Psychological Measurement, 61(5), 793–817. doi:10.1177/00131640121971527 Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297–334. doi:10.1177/014662169301700401 Nagelkerke, N. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78: 691-692. Oliveri, M. E., Ercikan, K., Zumbo, B. D. (2014). Effects of population heterogeneity on accuracy of DIF detection. Applied Measurement in Education, 27(4), 286-300. doi:10.1080/08957347.2014.944305 Potenza, M.T., & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19(1), 23-37. 50 Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychological Bulletin, 112(1), 160. R Core Development Team. (2014). R: A language and environment for statistical computing. Available from http://www.R-project.org Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105–116. doi:10.1177/014662169301700201 Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33(2), 215–230. Shear, B. R., & Zumbo, B.D. (2013). False positives in multiple regression: Unanticipated consequences of measurement error in the predictor variables. Educational and Psychological Measurement, 73, 733-756. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594-604. William Stout Institute for Measurement. (2005). Dimensionality-Based DIF/DBF Package: SIBTEST, Poly-SIBTEST, Crossing SIBTEST [Computer software]. Retrieved from http://psychometrictools.measuredprogress.org/dif1 Zumbo, B. D. (1999). A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression Modeling as a Unitary Framework for Binary and Likert-51 Type (Ordinal) Item Scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233. Zumbo, B. D. (2008). Statistical Methods for Investigating Item Bias in Self-Report Measures, [The University of Florence Lectures on Differential Item Functioning]. Universita degli Studi di Firenze, Florence, Italy. Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-base approach for studying DIF. Working Paper of the Edgeworth Laboratory for Quantitative Behavioral Science, University of Northern British Columbia: Prince George, B.C. Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement, 26, 55-66.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Decision rules based on hypothesis tests and effect...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Decision rules based on hypothesis tests and effect sizes for logistic regression differential item functioning Gesicki, Adam 2015
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Decision rules based on hypothesis tests and effect sizes for logistic regression differential item functioning |
Creator |
Gesicki, Adam |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | Logistic Regression (LR) has been a technique used for the detection of items exhibiting differential item functioning (DIF). When it was introduced in 1990, the LR was conceptualized as strictly a test of statistical significance. This led to the over-identification of items as DIF, generally not exhibiting practically (psychometrically) significant differences. The use of blended decision rules – where effect sizes are used in addition to statistical significance in the decision-making process – was proposed to address this issue. Previous work in the literature attempted to align a decision rule grounded in the Mantel-Haenszel (M-H) technique to LR. However, this work is unable to replicate previously recommended cut-offs, through the use of the same methodology on a different data set. It is possible that cut-off values may be dataset specific, which also opens the question of whether universal cut-off values for effect sizes for DIF are a realistic expectation. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-10-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0165814 |
URI | http://hdl.handle.net/2429/54871 |
Degree |
Master of Arts - MA |
Program |
Measurement, Evaluation and Research Methodology |
Affiliation |
Education, Faculty of Educational and Counselling Psychology, and Special Education (ECPS), Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_november_gesicki_adam.pdf [ 1.16MB ]
- Metadata
- JSON: 24-1.0165814.json
- JSON-LD: 24-1.0165814-ld.json
- RDF/XML (Pretty): 24-1.0165814-rdf.xml
- RDF/JSON: 24-1.0165814-rdf.json
- Turtle: 24-1.0165814-turtle.txt
- N-Triples: 24-1.0165814-rdf-ntriples.txt
- Original Record: 24-1.0165814-source.json
- Full Text
- 24-1.0165814-fulltext.txt
- Citation
- 24-1.0165814.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0165814/manifest