Nonparametric Item Response Modeling for Identifying Differential Item Functioning in the Moderate-to-Small-Scale Testing Context By P E T R O N I L L A M U R L I T A W I T A R S A M . A. (Ed.) University of Victoria (1995) Dokter (M. D.) University of Andalas (1982) A Dissertation Submitted in Partial Fulfillment of The Requirements for the Degree of DOCTOR OF PHILOSOPHY In the Department of Educational and Counselling Psychology and Special Education Program: Measurement, Evaluation, and Research Methodology We Accept this Dissertation as Conforming to the Required Standard © Petronilla Murlita Witarsa2003 University of British Columbia July 2003 A l l rights reserved. Dissertation may not be reproduced in whole or in part, by photocopy or other means, without the written permission of the author. In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of 1~~<'U, > •-.w,, I. ( -.'.u-i.l ' j ' 1 ' t ; T ' " 1 The University of British Columbia Vancouver, Canada Date I u I.:., DE-6 (2/88) ABSTRACT Differential item functioning (DIF) can occur across age, gender, ethnic, and/or linguistic groups o f examinee populations. Therefore, whenever there is more than one group o f examinees involved in a test, a possibility o f DIF exists. It is important to detect items wi th DIF wi th accurate and powerful statistical methods. While finding a proper DIP method is essential, unti l now most o f the available methods have been dominated by applications to large scale testing contexts. Since the early 1990s, Ramsay has developed a nonparametric item response methodology and computer software, TestGraf (Ramsay, 2000). The nonparametric item response theory (IRT) method requires fewer examinees and items than other item response theory methods and was also designed to detect DIF. However, nonparametric IRT's Type I error rate for DIF detection had not been investigated. The present study investigated the Type I error rate o f the nonparametric IRT DIF detection method, when applied to moderate-to-small-scale testing context wherein there were 500 or fewer examinees in a group. In addition, the Mantel-Haenszel (MH) DIF detection method was included. A three-parameter logistic item response model was used to generate data for the two population groups. Each population corresponded to a test o f 40 items. Item statistics for the first 34 non-DIF items were randomly chosen from the mathematics test o f the 1999 TEVISS (Third International Mathematics and Science Study) for grade eight, whereas item statistics for the last six studied items were adopted from the DIF items used in the study o f Muniz, Hambleton, and Xing (2001). These six items were the focus o f this study. The M H test maintained its Type I error rate at the nominal level. The investigation o f the nonparametric IRT methodology resulted in: (a) inflated error rates for both a formal and informal test o f DIP, and (b) a discovery o f an error in the widely available nonparametric IRT software, TestGraf. As a result, new cut-off indices for the nonparametric IRT DIF test were determined for use in the moderate-to-small-scale testing context. T A B L E OF CONTENTS Abstract i i Table o f contents iv List o f tables v i i List o f figures ,,x Acknowledgments x i CHAPTER I: INTRODUCTION 1 Setting the Stage for the Dissertation 1 Motivation for the Dissertation 2 Testing context 5 Systematic literature based survey 7 DIF detection methods 9 Problem Statement 12 CHAPTER I I : L ITERATURE REVIEW 14 Methods for DIF Detection for the Moderate-to-Small-Scale Testing Context 14 Mantel-Haenszel (MH) method 16 TestGraf DIF: Nonparametric regression to assess DIF 17 Findings from Simulation Studies 26 M H test 27 TestGraf 27 Research Questions 28 CHAPTER I I I : METHODOLOGY 31 Study Design 31 Distribution o f the latent variable and sample sizes 32 Statistical characteristics o f the studied test items 32 Computer simulation design and dependent variables 35 Procedure 35 Data Analysis o f Simulation Results 36 Version o f TestGraf Used in the Simulation 37 CHAPTER IV: RESULTS 38 Mantel-Haenszel 40 TestGraf 43 Beta o f TestGraf 43 Type I error rate o f TestGraf 50 Cut-Off Indices 52 CHAPTER V: DISCUSSION 62 Summary o f the Findings 65 REFERENCES 67 APPENDIX A 83 APPENDIX B 87 B - l . Methodology 88 Description o f DIP detection procedure 89 Variables in the study 90 Data generation 94 Procedure 96 Compute DIF statistics 97 Data analysis o f simulation results 97 B-2. Results and Discussion 98 Summary 106 B-3. Conclusion 107 LIST OF TABLES Table 1: Summary o f the systematic based survey by category o f research 9 Table 2: Item statistics for the 40 items 34 Table 3: Type I error rate o f Mantel-Haenszel at nominal a = .05 42 Table 4: Type I error rate o f Mantel-Haenszel at nominal a = .01 43 Table 5: Sampling distribution o f beta o f Item-35 44 Table 6: Sampling distribution o f beta o f Item-36 45 Table 7: Sampling distribution o f beta o f Item-37 46 Table 8: Sampling distribution o f beta o f Item-38 47 Table 9: Sampling distribution o f beta o f Item-39 48 Table 10: Sampling distribution o f beta o f Item-40 49 Table 11: Type I error rate o f TestGraf at nominal a = .05 51 Table 12: Type I error rate o f TestGraf at nominal a = .01 52 Table 13: Type I error o f TestGraf DIF detection using the Roussos-Stout criterion across sample size combinations and item parameters 54 Table 14: Cut-off indices for p in identifying TestGraf DIF across sample size combinations and three significance levels irrespective o f the item characteristics 55 Table 15: Cut-off indices for (3 in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a low discrimination level (a = 0.50) and low diff iculty level (b =-1.00) 56 Table 16: Cut-off indices for P in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a high discrimination level (a = 1.00) and low diff iculty level (b = -1.00) 57 Table 17: Cut-off indices for P in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a low discrimination level (a = 0.50) and medium diff iculty level (b = 0.00) 58 Table 18: Cut-off indices for P in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a high discrimination level (a = 1.00) and medium diff iculty level (b = 0.00). . . . 59 Table 19: Cut-off indices for p in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a low discrimination level (a = 0.50) and high diff iculty level (b = 1.00)........ 60 Table 20: Cut-off indices for P in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a high discrimination level (a = 1.00) and high diff iculty level (b =1.00) 61 Table A l : Details o f the systematic literature based survey by category o f research 84 Table B l : Item statistics for the 40 items 93 Table B2: Item statistics for the six DIF items 94 Table B3: Type I error rate o f TestGraf version pre-December 2002 at nominal a = .05 99 Table B4: Probability o f rejecting the hypothesis for TestGraf version pre-December 2002 DIF detection in SMALL-DIF at a = .05 100 Table B5: Probability o f rejecting the hypothesis for TestGraf version pre-December 2002 DIF detection in SMALL-DIP at a = .01 101 Table B6: Probability o f rejecting the hypothesis for TestGraf version pre-December 2002 DIF detection in M E D I U M - D I F at a = .05 102 Table B7: Probability o f rejecting the hypothesis for TestGraf version pre-December 2002 DIF detection in MEDIUM-DIF at a = .01 103 Table B8: Probability o f rejecting the hypothesis for TestGraf version pre-December 2002 DIF detection in LARGE-DIF at a - .05 104 Table B9: Probability o f rejecting the hypothesis for TestGraf version pre-December 2002 DIP detection in LARGE-DIP at a = .01 105 Table B10: Standard deviations and standard errors o f TestGraf version pre-December 2002 DIP detection 106 X LIST OF FIGURES Figure 1: Item characteristic curve o f the reference group wi th /V = 100 22 Figure 2: Item characteristic curve the focal group wi th N = 50 22 Figure 3: Item characteristic curves for reference and focal groups wi th Ns = 100/50. Curve 1 is for reference group, and curve 2 is for focal group 25 XI ACKNOWLEDGEMENTS I wish to express my sincere appreciation and thanks: ... to my supervisor, Dr. Bruno D. Zumbo, who has always been there whenever I needed his assistance, thoughts and suggestions. ... to my committee, Dr. Seong-Soo Lee and Dr. Anita Hubley; to my university examiners, Dr. Kimberly Schonert-Reichl and Dr. Lee Gunderson, to my external examiner, Dr. John O. Anderson, and to the Chair, Dr. Graham Johnson, o f my final examination as well . ... to the former Department o f Educational Psychology and Special Education Faculty and Staff; the Department o f Educational and Counseling Psychology, and Special Education: the Head and Graduate Student Advisors, Faculty and Administrative Staff for the Research and Teaching Assistantship positions given to me, as wel l as for their kindness and assistance during my study. ... to the Universitas Terbuka in Jakarta, Indonesia, for allowing me to pursue and complete my study at the University o f Brit ish Columbia. ... and to everyone who has been so kind and friendly to me during my stay in Vancouver, British Columbia, including some friends in Victoria. CHAPTER I INTRODUCTION This chapter begins wi th the introduction o f the dissertation problem and provides some general motivation and context for the dissertation. The second chapter deals wi th the literature review. Setting the Stage for the Dissertation Hattie, Jaeger, and Bond (1999, p.394) described educational test development and the educational test enterprise more generally as a cyclic process that involves the fol lowing tasks: • Conceptual Models o f Measurement: This involves alternative measurement models, classical test theory, item response theory, and cognitive processing based models. • Test and Item Development: This involves selection o f item formats, selection o f scoring models, frameworks for test organization, test specifications, and test assembly. • Test Administration: This involves classical group administration o f tests, administering performance tests, accommodations in the standardized testing situation, and computer adaptive testing. • Test Use: This includes using tests for school accountability purposes, ensuring the dependability o f scoring, reporting test scores, setting standards for test performances, and linking scores from different tests. • Test Evaluation: This includes estimation o f the reliability o f tests, generalizability o f scores, reliability and performance assessment, estimation o f 2 the standard error o f measurement, estimating decision consistency, validity, dimensionality, and adverse impact and test bias. • The cycle continues at the top of this list. The above cycle is useful because it integrates the various activities and testing issues o f measurement research and practice. It applies to all kinds o f testing contexts from classroom testing to pilot studies and large-scale programs. The Test Evaluation in the cycle o f educational testing can be guided by the Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999). In particular, Standard 7.3 stated: When credible research reports that differential item functioning exists across age, gender, racial/ethnic, cultural, disability", and/or linguistic groups in the population o f test takers in the content domain measured by the test, test developers should conduct appropriate studies when feasible. Such research should seek to detect and eliminate aspects o f test design, content, and format that might bias test scores for particular groups (p. 81). The Standards go on to state that differential item functioning (DIF) may happen when one or more groups o f examinees wi th equal ability, on average, have different probabilities o f getting a particular item correct. It further explained that the above circumstance could be considered as subject to DIF when the DIF results could be replicated, implying that the DIF results are found in another sample. The sample-to-sample variability is taken into account in the standard error o f a statistic. Motivation for the Dissertation The motivation for this dissertation rests on the Test Evaluation part o f Hattie et al.'s (1999) cycle, which is the task where item bias and validity are considered. Moreover, the precise focus o f the present study was to investigate methods that statistically flag DIP items. 3 The statistical method involves testing a hypothesis o f no-DIF; therefore, Type I error rate becomes very important because a statistical test must maintain its Type I error rate to be a valid test o f its hypothesis. I f the statistical null hypothesis o f no-DIF were not rejected in a DIF analysis, a measurement analyst would conclude that the item contains no bias for that particular item. I f it is rejected, it can be concluded that DIF exists and further evaluation w i l l be needed to see i f this DIF is attributable to item bias or to item impact reflecting relevant factors. What this means, then, is that a Type I error is quite important because a test user may be overlooking a potentially biasing item(s) and hence the test may be functioning differently and inappropriate decisions are made based on the test score. In the context o f high-stakes testing this type o f error may be o f great concern because o f the matter o f test fairness. DIF is a technical term to describe a situation where there is a difference in the item response performance between two or more groups o f examinees, which possess equal underlying overall ability level required to respond correctly to the relevant item. It is worth noting that DIF only tells us that an item functions differently for a particular group or groups o f overall equal ability level. Having found an item wi th DIF does not necessarily indicate bias, although the DIF term itself was derived from item bias research. Another source o f DIF can be what is termed item impact. Therefore, DIF is a necessary but not a sufficient condition for item bias to exist. Studies o f the item bias were first introduced to educational measurement in the 1960s, when it was discovered that a significant discrepancy in test performance between Black and Hispanic students and White students existed on tests o f cognitive ability (Angoff, 1982). The studies were designed to develop methods for exploring cultural differences. The more specific goal o f those studies, however, was to identify any test items that were biased against minority 4 students and hence to remove the biased items from the tests. As many more studies report, it became clear that the use o f the term "bias" referred to two different meanings, social and statistical (Angoff, 1993). This resulted in the expression 'differential item functioning' (DIF) for the statistical effect. DIF can be detected using statistical procedures or " i tem discrepancy methods" (Angoff, 1982), which w i l l tell us whether or not a particular item functions differently in different groups that are equivalent on the latent variable being measured. Therefore, a situation in which DIF occurs should be differentiated from the situations o f either item impact or item bias. Item impact occurs when examinees o f different groups possess true differences in the underlying overall ability being measured by the item, which results in differing probabilities o f answering the relevant item correct. In this notion, the item should measure what it is purported to measure, but the differences in getting the item correct lie in the true differences o f underlying ability needed to correctly answer the item. In other words, item impact should reflect the differences o f the actual knowledge and skil l o f the construct being measured for all groups responding to the item. In contrast, i f the different probabilities o f getting an item correct are due to some characteristics o f the test item itself, or the testing situation, which is irrelevant to the purpose o f the test, then there is item bias. Item bias can happen for any identified group o f examinees however the most commonly investigated groups are formed based on gender, ethnic, culture, language, social, or/and economic differences. For item bias to happen, DIF is required but not sufficient. For example, put simply, item bias can occur, among several causes, when an item is worded in such a way that leads one or more groups having less probability o f getting the item correct than the other group(s). As previously mentioned, the item bias exists only when all groups involved possess equal underlying overall ability which is required to respond correctly to the relevant item, do not get the item correct due to some characteristics o f the item itself, or the testing situation, which is irrelevant to the purpose of the test. It should be noted that only when a particular item persistently receives incorrect responses from one particular group over another, and the groups are equivalent in the measured latent variable, could bias be considered to have occurred. As noted above, there is a l ink between item bias, item impact, and DIF. Such a connection is basically methodological in terms of a need for a statistical method for identifying item(s) wi th DIF. Once DEF is found, then further analysis or review w i l l be necessary to determine whether DIF is attributable to item bias or is, in fact, attributable to item impact. A judgmental analysis or content review may be needed to determine the nature o f the link. There are two forms o f DEF, uniform and non-uniform. Uni form DEF occurs when the difference in probability o f success o f getting a correct response between groups is consistent across all levels o f ability. That is, there is no interaction effect between group membership and the ability level in getting an item correct. Non-uniform DEF can be found when there is an interaction effect between the ability o f getting an item right and group membership. Ln non-uniform DEF, the difference in probability o f success o f responding correctly to the item between groups is not constant across ability levels. In the framework o f item response modeling, uniform DEF occurs when the item characteristic curves (ICCs) o f the groups studied are separated but parallel or not crossing, whereas non-uniform DIF implies that the ICCs are not parallel. ICC is a plot o f probability o f an item being answered correctly against ability. Testing context. At this point it is useful to distinguish between large-scale and smaller-scale testing because DEF has evolved out o f large-scale testing's concern for litigation and capacity to conduct formal analyses. The type o f large-scale testing I am referring to is the type practiced by testing companies, such as ETS or ACT, who are concerned about ligitation from individuals who pay to take their tests. As stated by Hattie et al. (1999), the cycle o f the tasks described earlier applies irrespective o f the type o f testing context - be it testing programs that are high volume wherein there is a testing session once or more times a year (e.g., provincial testing conducted by the British Columbia Ministry o f Education, Test o f English as a Foreign Language/TOEFL, or Graduate Record Examinations/GRE) or i f it is smaller-scale testing contexts. The high volume testing context is often referred to as "large-scale" testing in contrast to what I wi l l call "moderate-to-small-scale" testing. This moderate-to-small-scale testing involves those situations wherein one has a test developed for a specific testing or research purpose taken by 500 or fewer examinees on a single occasion. The tests used in moderate-to-small-scale testing contexts are often, but not always, much shorter than the ones produced in large-scale testing situations. Interestingly, as Hattie et al. point out, the educational measurement journals, such as the Journal of Educational Measurement and conference presentations and workshops (such as NCME), reflect a greater focus on large-scale testing than moderate-to-small-scale testing. Two points are noteworthy at this point. First, the fact that there are many items and many examinees is somewhat intertwined. In the move toward using item response theory (IRT) in educational measurement, there is a need for many items so that one can accurately estimate the latent score (theta) or ability level for examinees. Likewise, an increase in the number o f items, and IRT-based analysis, has led to the need for more examinees. In the end, large-scale testing emerged. Second, large-scale testing has evolved out o f educational policy and accountability. That is, there is a keen interest on the part o f policy makers and administration for 7 large-scale testing o f students at various grades for the purposes o f student accountability and educational program evaluation. Systematic literature based survey. As prominent as LRT has become, as Hattie et al. has observed, it is expected that not all measurement activity would involve LRT nor would it involve large number o f items and examinees. What appears largely undocumented in the measurement literature are 'statistics' about the typical numbers o f examinees and numbers o f items used in a variety o f educational testing contexts. To f i l l this void and help set the context for this dissertation, a systematic literature based survey was conducted. For this survey, four widely read educational research journals were selected. Two journals were devoted to educational measurement and two others to educational psychology. They include Journal of Educational Measurement, Applied Measurement in Education, Journal of Educational Psychology, and British Journal of Educational Psychology. To get a sense o f the current state, the search was limited to issues from the years 1998 to 2002. Each research paper that reports the examinee/simulee/subject sample sizes, wi th or without reporting the number o f items used was recorded. Articles were divided into four groups, achievement testing, simulation testing, psychological testing, and survey research. For the purposes o f recording, some criteria were first established to achieve a consistency across the articles. The criteria and detailed data are provided in Appendix A and the summary o f the information recorded in Table 1. In summary, Table 1 shows that the 38 studies o f achievement testing involved sample sizes ranging from 50 to nearly 88,000 examinees with numbers o f items from 20 to 150 per test. There were 27 simulation studies reported to use sample sizes ranging from 25 simulees to 20,000 wi th numbers o f items o f 20 or more. Studies in the educational psychology used as small as 18 subjects up to 1,070 and from 10 to 56 items per questionnaire. Sample sizes in the survey category were from 3 to 2,000 respondents wi th numbers o f items from 9 to 80 per survey. In conclusion, the survey suggests that: 1. Achievement testing research uses larger sample sizes and larger number o f items than psychological and survey research. 2. A majority o f simulation research is similar both in terms o f sample size and number o f items to achievement testing (and not typical psychological research). 3. The median o f the psychological and survey research reflect the use o f sample sizes o f moderate-to-small-scale context as defined by the present study. Likewise, we w i l l later see that studies investigating the statistical properties o f DIP detection methods have focussed on large-scale testing. This focus on large-scale testing indicates that there is no widely studied DIP detection method for moderate-to-small-scale testing contexts, where there are fewer examinees and/or items than in the large-scale context. 9 Table 1. Summary of the systematic literature based survey by category of research Achievement Tests Simulation Studies Psychological Test Survey Studies Examinees Items Simulees Items Subjects Items Subjects Items M 4,657.0 54.1 2,558.9 51 321.5 29.7 310.6 43.2 M 1,808.2* 1,962.6** Mdn 1,284 40 1,000 37.5 214.5 27 193.5 40 Min 50 20 25 20 18 10 3 9 Max 87,785 150 20,000 153 1,070 56 2,000 80 25th% 371.5 31 250 30 148 19.5 33.5 26.3 75th% 2,106.3 62 1,600 55 444.8 37.5 361.5 58.3 N 38 17 . 27 21 16 6 36 10 Note : denotes mean without three studies that involved Ms of 50, and over 25,000. ** denotes mean without two studies that involved Ms of 25 and 20,000. DIF detection methods. To recapitulate what we have discussed so far, the presence o f DLF may confound interpretations that one makes from test scores. However, the development o f DLF methodology has been associated wi th three phenomena: (a) simulation studies involving thousands o f examinees, (b) tests wi th many items, and (c) the widespread use o f item response theory (LRT). These three phenomena are interconnected through the use o f many items, many examinees, and use o f LRT. Motivated by the fact that some testing contexts still exist wherein one has fewer numbers o f items and examinees than found in the contexts where LRT is used, the purpose o f this dissertation was to explore the operating characteristics o f a psychometric DLF detection method useful in moderate-to-small-scale context. 10 Therefore, this section w i l l briefly explore some possible pitfalls associated wi th DIP and statistical methods to detect DIF, as wel l as the uses o f IRT. Then, the section w i l l describe the existence o f persistent issues relating to using moderate-to-small number o f examinees for the same purposes o f assessment and testing, and conclude by identifying the research problem that this dissertation is to address. Wi th the attention focused on the role o f DIF in test development and adaptation, comes the concomitant need for statistical methods, to detect (i.e., flag) DIP items, which are accurate and useful in a variety o f testing contexts. Methods for detecting items wi th DIF are normally used in the process o f developing new psychological and/or educational measures, adapting existing measures, and/or validating test score inferences. DIP detection methods w i l l only permit us to evaluate whether an item functions similarly for various examinee groups, or favours one or more groups over the other after conditioning on their underlying ability level, which is required to get the item correct (Zumbo & Hubley, 2003). The majority o f DIP methods have been developed wi th large sample size and large number o f items in mind. This context o f large testing organizations is mentioned because this large scale testing has been the place where a great deal o f the contemporary measurement theory has evolved. This is particularly true in the almost meteoric rise o f item response theory in educational measurement. One can easily see the prominence o f IRT by scanning the National Council on Measurement in Education (NCME) conference program over the last 15 years wherein one sees many workshops, symposia, and conference papers on the topic o f LRT. For a list o f reasons for this increased visibil i ty o f IRT one can see Hambleton, Swaminathan, and Rogers's (1991) as wel l as Crocker and Algina's (1986) discussions o f topics, such as invariance and sample-free estimates. 11 There are various methods found in the literature used to detect DIF items. Among those the Mantel-Haenszel (MH) DIF detection method has been investigated as a useful method for both large-scale and moderate-to-small-scale testing contexts. Several DIF studies have been conducted which referred to substantial small-scale testing, using this M H method. Some o f those studies used sample sizes as small as 50 respondents per group in either reference or focal group (e.g., Muniz, Hambleton, & Xing, 2001). Muii iz et al. investigated the DLF operating characteristics o f Mantel-Haenszel using combinations o f sample sizes ranging from 50 to 500 wi th a test o f 40 items. Parshall and Mi l ler (1995) analysed DLF wi th the M H method using sample sizes from 25 to 200 for each o f the focal groups studied, each o f which was combined wi th a reference group o f 500. Although details w i l l be provided later in the literature review, the upshot is that the M H method was not as successful as expected for moderate-to-small-scale studies in detecting DLF items. Because there was no method found to work consistently in the small-scale testing context, there was a need o f a new method. The DIP detection method found in the software TestGraf may be a viable one for moderate-to-small-scale testing, because TestGraf was developed wi th moderate-to-small-scale testing in mind. TestGraf is a relatively new software based method for evaluating items wi th a nonparametric item response modeling approach (Ramsay, 2000). TestGraf s DLF detection method was designed not only to detect the presence of DIP, but the magnitude o f the DLF effect as well. For DLF to be practically useful, it is important to have a statistical method that can accurately detect it as wel l as measuring its magnitude. To date, TestGraf s DLF statistical test has not been investigated. In fact, the standard error o f TestGraf s DIP statistic is reported, but it has not yet been used to construct a hypothesis test o f DLF, similar to the hypothesis test found wi th the M H . 12 Although large-scale assessment and testing continues unabated, the need for moderate-to-small-scale testing has always persisted. Therefore, the present study was intended to highlight this new method o f assessing DLF for moderate-to-small-scale testing (to contrast it wi th large-scale testing) wherein one has far fewer examinees and in combination wi th far fewer items or questions than the above defined large-scale testing context. The present study is based on the pioneering work o f Muniz et al. (2001) by exploring smaller sample sizes and studying the TestGraf DIF detection method. Findings o f this study w i l l make a significant contribution to the field o f educational and psychological measurement and evaluation, and primarily to the development o f research methods for detecting items wi th DLF in the context o f moderate-to-small-scale testing. Problem Statement During the last decades many assessment practitioners had been facing problems in assessing differential item functioning (DLF) when dealing wi th small number o f examinees or/and small number o f test items. Ramsay (2000) described a nonparametric item response modeling DIF detection method that required far fewer examinees and items than did the other DLF detection methods. This DIF detection method also provided a graphical method o f evaluating multiple-choice test and questionnaire data (Zumbo & Hubley, 2003). However, Type I error rate o f detecting any DLF has been unknown. The DLF detection method from TestGraf was new and had not been investigated in terms o f its operating characteristics. This study was designed to specifically investigate the Type I error rate o f the TestGraf DLF method in the context o f moderate-to-small-scale testing as defined above. 13 As wel l , this study investigated the cut-off criteria for significant DLF provided by Roussos and Stout (1996). The formal statistical test is new, whereas the latter method (i.e., the cut-off approach) is known in the literature for some time. 14 CHAPTER I I L ITERATURE REVIEW In this chapter, the most frequently used method o f DLF detection, the Mantel-Haenszel (MH) test, w i l l be reviewed and the main research questions w i l l be identified. Then, a description o f the nonparametric item response modeling approach, which was the primary method o f the study, is presented. Each o f the methods to be presented w i l l be accompanied by an example to demonstrate any critical differences in the methods to be implemented. Along the way, the Roussos-Stout cut-off criteria for significant DIP w i l l be introduced. One should recall that, for the purposes o f the present study, large-scale testing is defined as a testing context involving more than 500 examinees and moderate-to-small-scale testing involves 500 or less examinees per group. It should be noted, however, that even though the above definitions do not involve the number o f items, large-scale testing typically involves a large number o f items. Methods for DIF Detection for Moderate-to-Small-Scale Testing Contexts The contingency table, M H , method w i l l be reviewed below, followed by the nonparametric TestGraf item response modeling method. However, before delving into the details o f the methodologies, a few overall remarks denoting the essential differences between M H and TestGraf (LRT) are appropriate at this point. Zumbo and Hubley (2003) provide three frameworks for dealing wi th DIF: (1) modeling item responses via contingency tables and/or regression models, (2) item response theory (IRT), and (3) multidimensional models. The first framework for DLF includes two DIP detection 15 methods, logistic regression and M H . The M H method has been suggested for detecting DLF items in the context o f moderate-to-small scale testing (e.g., Muf i iz , Hambleton, & Xing, 2001). The second and third frameworks, LRT and multidimensional methods, are primarily targeted for large-scale testing. There are two noteworthy points beyond the distinction in terms o f the number o f examinees between M H and LRT. First, although IRT is used in large-scale testing, the concept o f item response modeling is applied in TestGraf s nonparametric regression method (Ramsay, 2000); TestGraf was designed wi th moderate-to-small scale testing in mind. Second, M H focuses on the observed differences in response proportions to each item whereas IRT focuses on the differences between the item response functions - the function that traces the relation between the latent variable score and the likely, or expected, item response. A key feature o f DIF is the "matching" to study the group differences. That is, the definition o f DLF implies that groups o f individuals, who are matched on the key variable being measured, should perform equally wel l on each item. A further distinction, however, between the M H and LRT (including TestGraf) DLF detection methods is that M H matches on the actual observed total scale score - a discretized total scale score - whereas the IRT methods integrate out the matching score. As Zumbo and Hubley (2003) write: In its essence, the LRT approach is focused on determining the area between the curves (or, equivalently, comparing the LRT parameters) o f the two groups. It is noteworthy that, unlike the contingency table ... methods, the LRT approach does not match the groups by conditioning on the total score. That is, the question o f "matching" only comes up i f one computes the difference function between the groups conditionally (as in M H . . . ) . Comparing the LRT parameter estimates or ICCs [item characteristic curves] is an unconditional analysis because it implicit ly assumes that the ability distribution has been 'integrated out'. The mathematical expression 'integrated out' is commonly used in some DLF literature and is used in the sense that one computes the area between the ICCs across the distribution o f the continuum of variation, theta. (p. 506-507). 16 Mantel-Haenszel (MH) method. Developed by Mantel and Haenszel (1959; in Camill i & Shepard, 1994) for medical research this method was first applied to DLF research by Holland and Thayer (1988). The M H DLF detection method is considered to be one o f the most widely used contingency table procedures. It uses the total test score for matching examinees o f the reference and focal groups and compares the two groups in terms o f their proportion o f success on an item. M H discretizes the total scores into a number o f category score bins. The M H method treats the DLF detection matter as one involving, in essence, a three-way contingency table (Zumbo & Hubley, 2003). The three-way contingency table consists o f the correctness or incorrectness o f a response, the group membership, and the total score category or score bin. The M H procedure requires relatively fewer examinees than other DLF methods - wi th the exception o f TestGraf. It is easy to understand and a relatively inexpensive in terms o f computing time. The method yields a significance test distributed as a chi-square statistic wi th one degree o f freedom for the null DLF hypothesis and an estimate o f DLF effect size as the M H common odds ratio estimator. The null hypothesis chi-square statistic indicates that the likelihood o f a correct response to a given item is equal for examinees o f the reference and focal groups at equal ability level. I f the chi-square is statistically significant, the item is considered to perform differentially for the compared groups. I f the common odds ratio estimate is greater than one, it suggests that the item benefits the reference group and i f it is less than one, the focal group. Over many replications o f a study, the statistical significance o f chi-square statistics refers to the Type I error rate, which is expected to be near the nominal significance level alpha (a). Wi th respect to this nominal, Bradley (1978) stated a liberal criterion o f robustness. A test 17 conforms to Bradley's liberal criterion at a = .05 i f the Type I error rate is between 0.025 and 0.075. Let me demonstrate the M H with a hypothetical example o f the M H DIF method. The data set was simulated to be statistically non-significant (i.e., an investigation o f Type I error rates) wi th a sample size combination o f 100 and 50 for the reference and focal groups, respectively. The item parameters o f this item for the two groups were: a = 1.00, b = -1.00, c = 0.17; where a refers to the item discrimination, b the item difficulty, and c the pseudo-guessing parameters in IRT. As expected, it produced not-significant j?-value to suggest there was no-DIF, X2 (1) = 0.014,p = .905. TestGraf DIF: Nonparametric regression to assess DIF.. The present study used the computer program TESTGRAF (Ramsay, 2000; December 20 t h , 2002 version). TestGraf software was developed by Ramsay and was designed to aid the development, evaluation, and use o f multiple choice examination, psychological scales, questionnaires, and similar types o f data. As its manual describes, TestGraf requires minimal data, which are characterized by (1) a set o f examinees or respondents, and (2) a set o f choice situations, such as items or questions on an examination or a questionnaire. As Ramsay states, although it can be used wi th large-scale testing data, TestGraf was developed wi th the intention o f aiding instructors and teachers, typically at the college or university level, w i th their test analysis. TestGraf implemented nonparametric regression method. Nonparametric regression is the term used for a wide range o f methods that directly estimate a functional relationship between an 18 independent variable X and a dependent variable Y. TestGraf applies what is considered the simplest and most convenient computational procedure, that is the normal kernel or the Gaussian kernel smoothing. This normal kernel is simply the standard normal density function, where KN (z) refers to the normal kernel function (Fox, 2000). The kernel smoothing is used to estimate an item characteristic curve (ICC) and an option characteristic curve for each item. The option characteristic curve is the ICC for each option for an item. The ICC is, in essence, the option characteristic curve for the correct option. The ICC curve displays the relationship between the probability that examinees choose the correct response and the proficiency variable or the expected score. Therefore, each ICC provides information about the probabilities o f getting a correct response over the given range o f proficiency levels or expected scores. The expected score refers to the expected or average number o f correct items that an examinee at a particular proficiency level w i l l achieve. Hence, it provides information over the range o f the proficiency variable or the expected scores obtained. TestGraf uses a kernel smoothing to estimate the probability o f option m to item i, Pim. Ramsay (2000) described the proficiency value 6 as the independent variable, and the dependent variable is the probabilities o f an examinee, a, choosing option m for item i, which is the actual choice. The actual choices can be summarized numerically by defining an indicator variable o f yima. In the binary situation, this is denoted as one i f an examinee picks a correct option and zero i f incorrect. Assigning 6a to each examinee and then smoothing the relationship between the binary values o f 0-1 and the examinee abilities w i l l give an estimate o f the probability function 1 e The estimation follows four sequential steps (Ramsay, 2000): 19 1. Estimate the rank o f each examinee, ra, by ranking their total scores on the exam. 2. Replace these ranks wi th the quantiles o f the standard normal distribution. These quantiles w i l l be used as the proficiency level values 6a,a = \,...,N; and are calculated by dividing the area under the standard normal density function into N+ I equal areas o f size 1/(N+ 1), where N denotes the total number o f examinees. 3. Arrange the examinee response patterns {Xai, ... , Xan) by the estimated proficiency rankings. That is, the ath response pattern in the given test, (X(a)i, ••• , X(a)n), is that o f the examinee by the level o f 6a. 4. For the mth option o f the ith item, estimate Pim by smoothing the relationship between the binary item-option indicator vector yima o f length N and the proficiency vector di, ... ,6N using the equation piin(eq)=fjwaqYima 0=1 where yima is the indicator variable as described earlier and waq is the weight applied at a particular point. The weight vector is computed from K[(6a-9q)lh] Waq N , 6=1 where the h value used by TestGraf is set equal to 1.1K1/5. In short, the TestGraf methodology is a nonparametric regression o f the item response onto the latent score. I f there is a large number o f examinees, tens or hundreds o f thousands, estimation o f the ICC at a display value 6a is done by averaging the indicator values yima for the values o f 6„ falling within the limits (6g-i + 6q)l2 and (dq + 6q+i)/2; that is, between the centres o f the adjacent 20 intervals, [9q.j, 9q] and [9q, 9q+f\. The average o f values that fall below the centre o f the first interval denotes the smallest value o f yim„ and that o f above the centre o f the last interval the largest value ofyima. These Q averages are indicated by Pimq. The area under the standard normal curve between the two interval centres is computed as Qq. This Qq is smoothed by the equation pinieq)=YwrqPimr. The summation resulted in much smaller set o f display values (by default is set to 51 by TestGraf) rather than over the potentially enormous number o f examinee indices. The weight o f the values in interval r for 6q is computed by the equation _ QrK[(P,-09)/h] yVrq Q -*,)] where r is the interval o f values that are close to q, Qr is the area under the curve between the two intervals, 9r is the value o f the centre o f the r interval, and 9S is the index o f the summed r intervals. As can be seen, the weight values do not depend on the item neither on the option; hence a matrix o f order Q containing their values can be computed init ially and then used for all curves. Examples o f TestGraf displays. Figures 1 and 2 provide some description o f TestGraf graphical output. They present the ICCs o f the same item used for the example o f M H . Figure 1 represents a reference group wi th sample size o f 100 and Figure 2 a focal group wi th sample size o f 50. In these figures, the X-axis represents the expected scores, and the Y-axis is the l ikely correct item response at the corresponding expected scores. The vertical dashed lines across the entire plot indicating various quantiles o f the standard normal distribution. The third vertical dashed line, from the right, indicates the median while 50% o f the scores lie between the second 21 and fourth lines, and 5% beyond the first and f i f th vertical dashed lines. Going from left to right, the vertical dashed lines indicate the 5 t h , 25 t h , 50 t h , 75 t h , and 95 t h percentiles, respectively. For multiple-choice items or binary case, as mentioned earlier, the correct response has a weight o f one and each o f the other options weighs zero. The ICC displayed in Figure 1 and in Figure 2 is simply the ICC for a correct response. The vertical solid lines designate estimated 95% confidence limits for the value o f the curve at specific expected score values. These are referred to as point wise confidence limits. The name was given for distinguishing them from confidence limits for the entire curve. As can be seen, these point wise confidence limits are broader for lower scores, because only few examinees fell into this range o f expected scores. On the contrary, the point wise confidence limits are denser for higher scores indicating that those examinees o f higher expected scores had more probability o f getting correct answer than those o f lower expected scores. Item Score 1.0 0.8 0.6 0.4 0.2 0.0 Figure 1. Item characteristic curve o f the reference group wi th A^= 100 Figure 2. Item characteristic curve o f the focal group wi th N = 50 23 TestGraf DIF detection method. As previously mentioned, TestGraf has also been developed to display the differences among two or more groups o f examinees or respondents in terms o f how they responded to items or questions. This indicated the capacity o f TestGraf to assess DIF by showing whether there were systematic ways in which, for instance, females and males that possessed similar underlying ability level responded differently to a particular item or question, or whether different ethnic or language groups responded to particular items or questions in different ways. Because the TestGraf framework is a latent variable approach, it must be noted o f the importance o f a common metric which is often overlooked in the literature. I f the ability distributions o f the compared groups are not on the same metric, any DIF results are difficult to interpret. When comparing two or more groups, in practice one o f the groups is identified as the reference group and the remainder as focal group(s). The reference group is the standard o f comparison, and considered the one being advantaged over the remaining group(s), while the focal group is being disadvantaged on an item or a test being studied. TestGraf assumed that the first file named during an analysis is the reference group. For the purposes o f this study, however, neither group was identified as being disadvantaged nor advantaged. Instead, the main interest was only comparing two groups in assessing the operating characteristics o f the nonparametric item response modeling method, i.e. its Type I error as previously mentioned. Like other IRT methods (Zumbo & Hubley, 2003), TestGraf measures and displays DIF in the form o f a designated area between the item characteristic curves. This area is denoted as beta, symbolized as P, which measures the weighted expected score discrepancy between the reference group curve and the focal group curve for examinees wi th the same ability on a particular item. This DIF summary index o f P is computed in TestGraf as follows. 24 Let the proportion o f the reference group having display variable value 9q be indicated by pRq. And let Pf~JP (6) and P^ (6) stand for the option characteristic curve values for the reference and focal groups, respectively. Then p is In a situation where the focal group is found to be disadvantaged on average, this index P is negative. As Ramsay (personal communication, December 19, 2002) stated, the variance o f P, which is denoted as Var[/?/OTg] is computed by the equation: where the notation is the same as above wi th the addition o f g to denote group. The standard error o f P is the squared root o f that variance as follows To illustrate how the TestGraf software can identify DIF across two groups o f examinees, Figure 3 is given below. It brings together the ICCs displayed in Figures 1 and 2. As expected, given that the data was simulated wi th no DIF, it produced a negligible amount o f DIF with P = 0.055 and the £ £ = 0.033. Curve 1 indicates the scores obtained by the reference group, whereas curve 2 by the focal group. As previously described, the X-axis shows the expected scores to be gained by examinees, and the Y-axis the item score or the probability o f getting the item correct given the examinees' expected score. The Y-axis ranges from zero to one, the X-axis indicates the range o f scores from the expected minimum to maximum scores that obtained by the examinees o f the two comparison groups overall. e Var]j3img\ = 2>*0 {Var[Pimg(nq)] + Var[Pim0(nq)]}, SEP= JVar[j3img]. 25 Item 36 Item Score S'4 2SZ S0'4 75* 3SX 0 . 0 H ' 1 1 ' 1 1— 1 •—i H -8 12 16 20 24 28 32 36 Score Figure 3. Item characteristic curves for reference and focal groups wi th each JV=100 and 50. Curve 1 is for reference group, and curve 2 is for focal group. The P and its corresponding standard error, which was provided by the TestGraf program could be used to obtain a confidence interval by dividing p by the standard error - £ - ~ / V ( l , 0 ) SE According to Ramsay (personal communication, January 28, 2002), the distribution o f this statistic should be a standard normal Z-distribution. The Z-value o f the item illustrated in Figure 3 is Z = 0.055/0.033 = 1.667, which is not significant at nominal a = 0.05. Testing for P DIF with TestGraf. Given the above-described test statistic for P, there are two approaches to testing the no-DLF hypothesis for P using the TestGraf DLF detection method: 1. Using the guideline proposed by Roussos and Stout (1996) for the SLBTEST procedure in identifying DLF items. They suggested to use the fol lowing cut-off indices: (a) negligible DLF i f |p| < .059, (b) moderate DIF i f .059< |p| < .088, and (c) large DIF i f |p|> .088. 26 These criteria were recently investigated by Gotzmann (2002). Applying the Roussos-Stout criterion, the example data displays negligible DLF. 2. Applying the statistical test described above by dividing P by its standard error and referring to a standard normal distribution. TestGraf produced both the p and the standard error o f p. This is a new method that w i l l be investigated for the first time in this dissertation. As Ramsay states: The program produces a composite index no matter whether one or two groups are involved. This index is always signed, and is given in equation (9) on page 65 o f the latest version o f the manual. It also produces a DLF index that is signed for each reference group (males in your case), and this is equation (8)on the previous page. Right, the standard errors can be used to give a confidence interval for the signed DLF value. (Ramsay, Personal Communication, January 28, 2002) O f course, any decision rule for a hypothesis is a statistical test, o f a sort. It should be noted, however, that o f the two approaches, only the second is a conventional statistical hypothesis test as commonly used in statistics - i.e., a test statistic that refers to a particular sampling distribution to see i f the result falls in the rejection region o f the sampling distribution. Roussos & Stout's and Gotzmann's approach, although a test o f DLF, serves as rather a heuristic device for helping interpret the magnitude o f P and to aid the analysts in deciding i f DIF is present. What is sought here in the present study is to develop the heuristic device toward to a decision rule for DLF and hence a statistical test o f DLF. Findings from Simulation Studies This section presents relevant simulation studies o f the statistical properties o f the M H and TestGraf methods. Because the main interest o f this dissertation was the context o f moderate-to-small-scale testing, the literature review was limited to studies that investigated 500 27 or less examinees per group. In addition, only DIF studies that investigated the Type I error rate o f DLF detection were reviewed. Given the hypothesis o f no-DIF, Type I error rate in detecting DLF refers to declaring an item as DLF when it is not a DLF item (Nul l hypothesis). It is also called a false positive (Rogers & Swaminathan, 1993). MH test. In the study by Muf i iz , Hambleton, and X ing (2001), they used various sample size combinations o f 50/50, 100/50, 200/50, 100/100, and 500/500 for the reference and focal groups, respectively. Type I error rates were obtained based on the 34 non-DLF items in their simulation study. Findings o f their study showed Type I error rates associated wi th the Mantel-Haenszel method under the combinations o f sample sizes o f 50/50, 100/50, 200/50 and 100/100, ranging from 0% to 6%. In contrast wi th the sample size combination o f 500/500, Type I error rates ranged from 6% to 16%. The nominal Type I error rate (alpha) was set at 0.05. Rogers and Swaminathan (1993) applied the M H method in detecting items with DLF. They used two levels o f sample sizes o f 250 per group and 500 per group. Three different test lengths were applied, 40, 60, and 80, each o f which contained 10% uniform DIF items and 10% non-uniform DLF items, and the remaining 80% o f non-DLF items were used to obtain the false positives or Type I error rates. Their findings showed that Type I error rates o f the Mantel-Haenszel method consistently resulted in around 1 % under all conditions o f test lengths. TestGraf. From the above findings, it was evident that the formal hypothesis test o f TestGraf had not been investigated in terms of its operating characteristics in detecting DLF items. Although it 28 involves 500 and more examinees per group, the Gotzmann (2002) study is reviewed because it is the only simulation study o f the Roussos-Stout decision heuristic approach to DLF detection. Gotzmann conducted a simulation study that included the investigation o f Type I error rate o f TestGraf DBF detection method. The study used eight sample size combinations for the reference and focal groups: 500/500, 750/1000, 1000/1000, 750/1500, 1000/1500, 1500/1500, 1000/2000, and 2000/2000. The findings showed that Type I error rates at the nominal .05 level, wi th an exception o f the 500/500 sample size condition, were considerably below .05. It is worth noting that Gotzmann applied, as previously mentioned, the guideline proposed by Roussos and Stout (1996). Research Questions It is clear from the literature review that: • The M H test is, at best, inconsistent in terms o f protecting its Type I error rate, sometimes being quite liberal in moderate-to-small-scale contexts. • More research needs to be done to investigate the Type I error rate o f (a) TestGraf s formal hypothesis test approach, and (b) TestGraf heuristic approach to detecting DIF wi th particular emphasis on moderate-to-small-scale contexts wherein the heuristic approach has not yet been investigated. Hence, the primary purpose o f this study was to investigate the operating characteristics o f the nonparametric item response modeling method (both the formal test and heuristic rule) in the moderate-to-small-scale testing context. That is, do the statistical tests have a Type I error rate that was equal to or less than the nominal alpha set for the test statistics. The study o f Type I 29 error rate is necessary before one can consider turning to investigating the probability o f detecting DLF when DLF was actually present in the data - i.e. statistical power. In addition, Type I error rate o f the M H DIF detection method was investigated for two reasons. First, to extend the measurement literature (i.e., the knowledge base) about the performance o f the M H wi th smaller sample sizes, and at the same time allow a comparison o f Type I error rate o f the new TestGraf method wi th the more widely documented M H method. Second, given that the M H has been previously investigated, it was used as a validity check o f the simulation data and the study design. A word o f caution should be stated in comparing M H to TestGraf. Zumbo and Hubley (2003) explained that an item response modeling method such as TestGraf, integrates out the latent variable, while the M H statistically conditions (i.e., covaries) the empirical categorization o f the total score. This distinction between hypotheses may be magnified in the case o f moderate-to-small-scale testing because the conditioning bins for the M H w i l l typically involve very wide bins (sometimes called thick matching). The present study answered five specific research questions in the context o f an educational testing and sample size. They were as follows: I. What is the Type I error rate o f the Mantel-Haenszel statistical test? I I . Is the shape o f the sampling distribution o f the beta TestGraf DLF statistic normal as assumed by the formal test o f DIF from TestGraf? I I I . Is the standard deviation o f the sampling distribution o f the beta TestGraf DIF statistic the same as the standard error computed in TestGraf? IV. What is the Type I error rate o f the formal statistical test o f the TestGraf? V. What is the Type I error rate o f the statistical test o f TestGraf using the cut-off index o f |p| < .059 as proposed by Roussos and Stout (1996) for the SIBTEST procedure 30 when sample size combinations were 500/500 and smaller? I f the Roussos-Stout cut-offs resulted in an inflated error rate, new cut-offs w i l l be determined from the simulation data. 31 CHAPTER I I I METHODOLOGY The methodology and study design used in this simulation study was similar to the one used by Muf i iz , Hambleton, and Xing (2001). Although the current study adopted the sample size combinations that were used in the Muniz et al. study, three additional sample size combinations, to be described later, were included. Study Design To investigate the Type I error rate o f the DLF detection methods, the simulation involves generating two populations o f item responses. As is standard practice in the measurement literature, in each case, the item parameters were generated using the three-parameter item response theory model. The three item parameters include (a) item discrimination, (b) item difficulty, and (c) the lower asymptote (that is sometimes referred to as guessing or pseudo-guessing) parameters. In short, for each population, what the simulation process requires is generating a normal distribution wi th a specified mean and variance for each item and then using the three-parameter LRT model to compute a likely item response. In essence, the LRT model becomes a model o f item responding for each item. In simulation studies, as reviewed in the previous chapter, one may manipulate any o f the input values for the item parameters or mean and variance o f the latent distribution. I f the item parameters are different across the two populations, the simulation study depicts a DIP situation. If, on the other hand, the mean o f the latent distribution in one population group is different from the other, the simulation study depicts a test impact situation. Once the populations are constructed, one would sample 32 randomly (with replacement) from the population, a type o f bootstrapping, to generate samples o f response simulees. The DLF tests were conducted on each sample and a tally was recorded o f the statistical decision for each test for each sample. I f the item parameters and latent distribution characteristics were the same for the two populations, the tally is the Type I error rate. Distribution of the latent variable and sample sizes. In the present study, the two population groups being studied had equal ability distributions, normal distributions wi th M= 0 and SD = 1.0. The sample sizes o f the reference and focal groups were varied in the study design. Building on the design used in Muf i iz et al. (2001), the sample sizes used for the reference and focal groups were: 500/500, 200/100, 200/50, 100/100, 100/50, 50/50, 50/25, and 25/25 examinees in pairs, respectively. Five o f the above combinations were the same wi th that used in the study by Muf i iz et al. The additional sample size combinations, 200/100, 50/25, and 25/25 were included so that an intermediary between 500/500 and 200/50, and smaller sample size combinations were included. In addition, as Muf i iz et al. suggested, these sample size combinations reflect the range o f sample sizes seen in practice in, what I would refer to as, moderate-to-small-scale testing. Statistical characteristics of the studied test items. The item characteristics for the 40 item simulated test involved two parts. First, item statistics for the first 34 non-studied items were randomly chosen from the mathematics test o f the 1999 TIMSS (Third International Mathematics and Science Study) for grade eight, whereas item statistics for the last six items studied were adopted from the DLF items used in the study o f 33 Muf i iz et al. (2001). The set o f these six items were the focus o f this study. The study design included two item discrimination levels o f low and high, each o f which consisted o f three item diff iculty levels low, medium, and high. Item statistics for the 40 items were presented in Table 2. The last six items were the items for which DLF was investigated - i.e., the studied DEF items. The a refers to the item discrimination parameter, b the item diff iculty parameter, and c the pseudo-guessing parameter. Both groups in the DIF analysis had the same population item characteristics; therefore, this was a study o f the Type I error rate o f the DIF detection methods. The current study used the three-parameter logistic item response-modeling program o f MERTGEN (Luecht, 1996) to generate data for the population o f the simulation. Two population groups were generated each wi th 500,000 population simulees. One hundred data sets were randomly sampled from each population and analysed for each sample size combination. This made it possible to obtain Type I error rate in any combination o f 100 replication pairs. Therefore, the data sets were reflective o f actual test data. 34 Table 2 Item Statistics for the 40 Items I tem# a b c 1 1.59 0.10 .19 2 0.60 -0.98 .20 3 0.75 -0.42 .06 4 1.08 0.43 .24 5 0.81 0.34 .32 6 0.66 -0.57 .38 7 0.81 -0.18 .20 8 0.43 -0.36 .30 9 0.94 0.45 .34 10 1.40 0.15 .07 11 0.98 -0.20 .18 12 1.28 -0.12 .23 13 1.18 0.18 .23 14 0.98 -0.63 .30 15 0.94 -0.14 .17 16 1.39 0.94 .43 17 0.78 0.25 .16 18 0.55 -0.82 .20 19 0.88 0.09 .27 20 1.10 0.14 .40 I tem# a b c 21 1.23 -0.43 .10 22 0.73 1.13 .27 23 0.54 -1.91 .23 24 0.71 -0.43 .31 25 0.66 -0.67 .16 26 1.14 0.59 .18 27 1.12 0.29 .26 28 0.96 -0.26 .23 29 0.95 0.13 .15 30 1.38 0.66 .16 31 1.38 1.11 .16 32 0.42 -0.02 .20 33 1.04 -0.01 .30 34 0.73 0.10 .13 35 0.50 -1.00 .17 36 1.00 -1.00 .17 37 0.50 0.00 .17 38 1.00 0.00 .17 39 0.50 1.00 .17 40 1.00 1.00 .17 35 Computer simulation design and dependent variables. The computer simulation study was an 8 x 3 x 2 completely crossed design: sample size combinations, by item diff iculty levels, and by item discrimination levels. The dependent variables recorded for each replication o f the simulation were: 1. The beta ((3) and the standard error o f P for each item produced by TestGraf From this information I was able to compute the DIF hypothesis test statistics (via a confidence interval) for each item for each replication. In addition, Ps allowed me to apply such a heuristic test o f DIF as based on the Roussos and Stout (1996) criterion. 2. The M H test was performed for each item and each replication. Procedure As an overview, the study was carried out fol lowing a modified procedure originally used in the study o f Muf i iz et al. (2001) as follows: 1. First, set a test o f 40 items with the item parameters o f the first 34 items were drawn from the TLMSS data, and those o f the last six items were adopted from the studies o f Muf i iz et al. These item parameters were provided in Table 2. 2. Two populations o f simulees were created wi th no DLF and no item impact - i.e., equal item parameters and equal ability means across populations. 3. One hundred replications from the reference and focal population groups were created for the eight sample size combinations: 500/500, 200/100, 200/50, 100/100, 100/50, 50/50, 50/25, and 25/25 examinees for reference and focal groups in pairs, respectively. 4. TestGraf was used to compute the nonparametric LRT procedure and SPSS was used to compute the M H test. 36 5. Repeated step 4 for the 100 replications o f each o f the 48 (8x3x2 cells) conditions in the study. Type I error rate and the other statistics were computed over the 100 replications. Because o f the statistical software limitations, I conducted all o f the above steps manually. That is, a batch computer code could not be written to handle the sampling and computation. The computing time for the simulation studies including that reported in Appendix B, was approximately 8 months at 5 days a week and approximately six to eight hours a day. Data Analysis of Simulation Results For each o f the 6 items studied and each o f the sample size combinations, the Type I error rate o f the DLF detection per item was obtained by dividing the number o f times the null hypothesis was rejected by 100, the number o f replications. The mean and standard deviation o f the P values for the 100-replications o f each sample size combination were computed. Distribution o f p for each condition was examined for normality. The fol lowing analyses were carried out to answer each o f the research questions: 1. Tabulated for Type I error rate o f the statistical test o f the Mantel-Haenszel. 2. Conducted the Kolmogorov-Smirnov test on TestGraf P for normality. Because it was done for each item separately, the Type I error rate o f the K-S test was set at .01. 3. Computed empirical standard error o f P and compared to TestGraf standard error. 4. Tabulated for Type I error rate o f the statistical test o f TestGraf DLF detection method. 5. Computed Type I error rate o f TestGraf DLF detection method using the Roussos-Stout criterion o f |p| < .059. 6. Calculated the 90 t h ' 95 t h , and 99 t h percentiles o f P values o f each item across sample size combinations for the nominal alpha o f .10, .05, and .01, respectively. 37 Version of TestGraf Used in the Simulation The December 2002 version o f TestGraf was used in this simulation. Appendix B reports on a study that used the pre-December 2002 version o f TestGraf. The study in Appendix B resulted in a new version TestGraf because it was found via my simulation in Appendix B that TestGraf was incorrectly computing the standard error o f p. Readers interested in the details o f that simulation should see Appendix B. 38 CHAPTER rV RESULTS To answer each o f the research questions, several analyses were conducted on the simulation data. The independent variables in the study design were sample size combinations o f reference and focal groups, and the item discrimination and item diff iculty parameters. There were eight sample combinations generated for the purpose o f the study: 500/500, 200/100, 200/50, 100/100, 100/50, 50/50, 50/25, and 25/25. Item discrimination (a-parameter) consisted o f two levels, low (a = 0.50) and high (a = 1.00). Item diff iculty (6-parameter) consisted o f three levels, low (b = -1.00), medium (b = 0.00), and high (b = 1.00). As a reminder to the reader, and to help structure the reporting o f the results, the Type I error rate o f the Mantel-Haenszel DIF detection method was investigated to: (i) ' extend the measurement literature (i.e., the knowledge base) about the performance o f the M H wi th smaller sample sizes and, at the same time, allow a comparison o f Type I error rate o f TestGraf method wi th the more widely documented M H method, and concomitantly (i i) provide supporting empirical evidence o f the validity o f the simulated data and study design. The simulation outcomes of the TestGraf DLF detection method were examined to investigate: (i) the purported normality o f the sampling distribution o f the TestGraf Ps using the Kolmogorov-Smirnov test for one-sample on the TestGraf P for the six studied items and sample size combinations, 39 (i i) the mean of P over 100 replications (i.e., the mean and standard deviation o f the sampling distribution o f P) for each item at each sample size combination, ( i i i ) whether the standard deviation o f P equals the standard error o f p produced by TestGraf — i.e., i f the standard error produced by TestGraf is the expected standard error, (iv) the Type I error rate o f the formal hypothesis test o f DLF based on the TestGraf method, (v) the Type I error rate o f the Roussos-Stout (1996) criterion o f |p| < .059, (vi) and finally, for the conditions for which this Type I error rate is inflated, new cut-offs by studying the percentiles o f P values o f each item across sample size combinations at 90%, 95%, and 99% for a o f .10, .05, and .01, respectively. A l l results wi th regard to the Type I error rates o f M H , sampling distribution and statistics o f TestGraf betas, Type I error rates o f TestGraf, and the cut-off indices are presented in tables along wi th the explanation referring to the relevant tables. To determine i f the observed Type I error rate is within an acceptable range o f the nominal alpha, the confidence interval strategy from Zumbo and Coulombe (1997) was used. This strategy treats the empirical Type I error rate like a proportion computed from a particular sample size (i.e., the number o f replications in the simulation design). A confidence interval was computed by the modified Wald method presented in Agresti and Coull (1998) to investigate i f one has coverage o f the nominal alpha. That is, i f the confidence interval contains the nominal alpha, then the test is considered to be operating appropriately at that nominal alpha. In the fol lowing Tables that contain Type I error rates, a Type I error in bold font denotes an inflated Type I error. Note that i f either o f the 40 confidence bounds equals the nominal alpha, the test w i l l be considered as operating appropriately. A methodological question in analyzing the simulation results arises because statistical significance (and confidence intervals) is used in analyzing the simulation results. In short, the study design involves 48 cells (i.e., an 8x3x2 design) and a confidence interval (or hypothesis test for research question 2) is computed for each o f these cells. This could result in inflation o f the Type I errors o f the methods used to examine the simulation results. As is typical o f any research study, to address this matter, one needs to balance the inflation in Type I error rate wi th the potential loss o f statistical power i f one is conservative in correcting for this inflation. That is, i f one corrects for every hypothesis test for each research question, the per comparison Type I error rate would be very small as would be the statistical power. The very liberal option is to ignore the concern o f Type I error rate and simply conduct each analysis at an alpha o f .05. To strike a balance, the fol lowing strategy was applied: (a) each research question was considered individually as was each Table in the results, and (b) for each Table o f results the analysis error rate was set at .05/(the number o f rows in the Table). Therefore, given that each o f the studied Tables has eight rows, when a statistical test o f the results is reported the alpha for the analysis o f the simulation outcomes is conducted at an alpha o f .006. Therefore, the Kolmogorov-Smirnov tests used the p-value threshold o f .006 for significance and the confidence intervals were computed for 99% rather than 99.4%, because o f software limitations that allowed only whole numbers for the percentage o f coverage o f the interval. Mantel-Haenszel Research Question 1: What is the Type I error rate o f the Mantel-Haenszel statistical test? 41 The results o f the Type I error rate for the M H at alpha o f .05 and .01 are listed in Tables 3 and 4, respectively. The first column o f these tables lists the sample size combinations, while the remaining columns are the six studied items. As an example o f how to interpret these tables, the first row o f the results in Table 3 is for a sample size combination o f 500/500, and given that none o f values are in bold font, all o f the items have a Type I error rate wi th in acceptable range o f the nominal alpha. Applying the confidence interval strategy for the Type I error rates in Tables 3 and 4, one can see that there are no inflated Type I errors at the nominal alpha o f .05 nor .01 * }. In correspondence wi th Rogers and Swaminathan (1993), it was found that, as expected, the M H operates appropriately for the sample size o f 500/500, hence adding empirical evidence as a design check on the simulation methodology used herein. Although the simulation methodology used in this dissertation is standard, I believe that it is always important to have an empirical crosscheck built into the simulation study. Table 3 contains one result that rests on the borderline of admissible Type I error rate. For a sample size combination of 100/50, Item-40, the confidence interval is .049 to .219. 42 Table 3 Type I error rate of Mantel-Haenszel at nominal a = .05 Item Discrimination Level (a) Item Dif f icul ty Level (b) Low Medium High Low Medium High N \ I N 2 Item35 Item37 Item39 Item36 Item38 Item40 500/500 ~\ !06 !bl X)5 !04 ^T~" 200/100 .02 .06 .02 .03 .04 .07 200/50 .05 .04 .05 .03 .06 .01 100/100 0 .03 .01 .02 .03 .05 100/50 .02 .07 .05 .06 .02 .11 50/50 .02 .03 0 .03 .08 .01 50/25 .02 .05 0 .02 .05 .01 25/25 .03 .02 0 .02 .03 .06 43 Table 4 Type I error rate of Mantel-Haenszel at nominal a = .01 Item Discrimination Level (a) Low High Item Dif f icul ty Level (b) Low Medium High Low Medium High Ni/N2 Item35 Item37 Item39 Item36 Item38 Item40 500/500 0 .01 .01 .02 .01 0 200/100 0 .02 0 .01 .03 .03 200/50 .01 .01 .01 0 .02 0 100/100 0 .01 0 0 .01 0 100/50 0 .01 .01 0 .01 .02 50/50 .01 .01 0 .01 .02 0 50/25 .02 .03 0 0 .03 .01 25/25 .01 0 0 .01 .01 .01 TestGraf Beta of TestGraf. Research Question 2: Is the shape o f the sampling distribution o f the beta TestGraf DIF statistic normal as assumed by the formal test o f DIF from TestGraf? The results o f the one-sample Kolmogorov-Smirnov test are in Tables 5 to 10. The first column of these tables refers to the various sample size combinations. The present research 44 question uses only Columns [2] and [3] (from the left). A l l o f the sample combinations and items showed no statistical significance when tested for normality o f the beta sampling distribution - the reader should recall that, as described above, the significance level for each o f these K-S Z tests is .006. Table 5 Sampling distribution of beta ofItem-35 N\ IN2 Kolmogo- Asymp. Sig. M o f beta SD o f beta M o f the Compari rov- (two-tailed) over 100 over 100 TestGraf -son Smirnov Z o f K-S Z replica- replications SE over ratio tions (Empirical 100 [6]/[5] SE) replica-tions [1] [2] [3] [4] [5] [6] [7] 500/500 0.617 .841 -0.00037 0.02723 0.01089 .39993 200/100 0.582 .887 -0.00051 0.05401 0.02277 .42159 200/50 1.241 .092 0.00921 0.08276 0.03071 .37107 100/100 1.030 .239 0.00454 0.06020 0.02460 .40864 100/50 0.440 .990 0.00580 0.07900 0.03232 .40911 50/50 0.692 .724 -0.00060 0.08771 0.03547 .40440 50/25 0.707 .700 0.00962 0.13437 0.05292 .39384 25/25 0.581 .889 0.01526 0.13225 0.05623 .42518 Note: N = sample size, M = mean, SD = standard deviation, SE = standard error 45 Table 6 Sampling distribution of beta of Item-36 NXIN2 Kolmogo- Asymp. Sig. M o f beta SD o f beta M o f the Compari rov- (two-tailed) over 100 over 100 TestGraf -son Smimov Z o f K-S Z replica- replications SE over ratio tions (Empirical 100 [6]/[5] SE) replica-tions [1] [2] [3] [4] [5] [6] [7] 500/500 0.515 .953 0.00188 0.02968 0.01058 .35647 200/100 0.676 .750 -0.00603 0.06101 0.02260 .37043 200/50 0.596 .870 0.01422 0.07548 0.03043 .40315 100/100 0.386 .998 0.00485 0.05951 0.02434 .40901 100/50 0.713 .689 0.01351 0.08055 0.03223 .40012 50/50 0.555 .918 0.00111 0.09474 0.03549 .37460 50/25 1.198 .113 -0.00696 0.11153 0.05249 .47064 25/25 0.654 .786 -0.01110 0.12575 0.05574 .44326 Note: N= sample size, M - mean, SD = standard deviation, SE = standard error Table 7 Sampling distribution of beta ofItem-37 N\ 1N2 Kolmogo- Asymp. Sig. M o f beta SD o f beta M o f the Compari rov- (two-tailed) over 100 over 100 TestGraf -son Smirnov Z o f K-S Z replica- replications SE over ratio tions (Empirical 100 [6]/[5T SE) replica-tions [1] [2] [3] [4] [5] [6] [7] 500/500 0.712 .691 -0.00348 0.02629 0.01079 .41042 200/100 0.533 .939 -0.00048 0.06237 0.02319 .37181 200/50 0.639 .808 0.00926 0.08198 0.03133 .38217 100/100 0.539 .934 0.00793 0.06046 0.02495 .41267 100/50 0.504 .961 0.01588 0.08446 0.03300 .39072 50/50 0.567 .905 -0.01508 0.09703 0.03621 .37318 50/25 0.732 .657 -0.00497 0.14431 0.05487 .38022 25/25 0.446 .989 -0.00402 0.13522 0.05733 .42398 Note: N = sample size, M = mean, SD = standard deviation, SE = standard error Table 8 Sampling distribution of beta ofItem-38 N\ / N2 Kolmogo- Asymp. Sig. M o f beta SD o f beta M o f the Compari rov- (two-tailed) over 100 over 100 TestGraf -son Smirnov Z o f K-S Z replica- replications SE over ratio tions (Empirical 100 [6]/[5] SE) replica-tions [1] [2] [3] [4] [5] [6] [7] 500/500 0.672 .758 -0.00104 0.02588 0.01009 .38988 200/100 0.631 .821 -0.00007 0.05884 0.02195 .37305 200/50 0.710 .695 -0.00097 0.07964 0.03005 .37732 100/100 0.433 .992 -0.00146 0.06377 0.02390 .37478 100/50 0.612 .848 -0.00675 0.08089 0.03181 .39325 50/50 1.373 .046 -0.00904 0.09651 0.03501 .36276 50/25 0.580 .890 -0.00118 0.10154 0.05316 .52354 25/25 0.878 .424 0.01751 0.12067 0.05629 .46648 Note: N = sample size, M = mean, SD = standard deviation, SE = standard error Table 9 Sampling distribution of beta of Item-39 NXIN2 Kolmogo- Asymp. Sig. M o f beta SD o f beta M o f the Compari rov- (two-tailed) over 100 over 100 TestGraf -son Smimov Z of K-S Z replica- replications SE over ratio tions (Empirical 100 [6]/[5] SE) replica-tions [1] [2] [3] [4] [5] [6] [7] 500/500 0.738 .648 0.00148 0.01864 0.00895 .48015 200/100 0.681 .743 -0.00681 0.04102 0.01849 .45076 200/50 0.498 .965 -0.00309 0.05953 0.02372 .39845 100/100 0.690 .727 -0.00421 0.04629 0.02054 .44372 100/50 1.026 .243 -0.00399 0.06256 0.02617 .41832 50/50 0.443 .990 -0.00844 0.06174 0.02999 .48575 50/25 0.605 .858 -0.01281 0.08551 0.04335 .50696 25/25 0.976 .296 -0.01372 0.08439 0.04759 .56393 Note: N = s a m p l e s i ze , M = m e a n , SD = s tandard d e v i a t i o n , SE = s tandard er ror 49 Table 10 Sampling distribution of beta of Item-40 Nl/N2 Kolmogo- Asymp. Sig. M o f beta SD o f beta M o f the Compari rov- (two-tailed) over 100 over 100 TestGraf -son Smirnov Z o f K-S Z replica- replications SE over ratio tions (Empirical 100 [6]/[5] SE) replica-tions [1] [2] [3] [4] [5] [6] [7] 500/500 1.052 .219 0.00179 0.02121 0.00896 .42244 200/100 0.691 .727 -0.00779 0.04973 0.01889 .37985 200/50 1.151 .141 -0.00514 0.05693 0.02447 .42983 100/100 0.751 .625 -0.00038 0.05135 0.02104 .40974 100/50 0.796 .551 -0.00542 0.07530 0.02659 .35312 50/50 0.678 .748 0.01129 0.07078 0.03033 .42851 50/25 0.763 .605 -0.00325 0.08820 0.04501 .51032 25/25 1.019 .251 -0.00387 0.11184 0.04821 .43106 Note: N = sample size, M = mean, SD = standard deviation, SE = standard error Research Question 3: Is the standard deviation o f the sampling distribution o f the beta TestGraf DIF statistic the same as the standard error computed in TestGraf? First, it is important to note that in Tables 5 to 10, the fourth column from the left lists the mean (M) o f the Ps over the 100 replications for the various items and sample size combinations. A confidence interval o f M o v e r the 100 replications was computed for each item for each sample size using the same confidence level as in the K-S tests reported above (i.e., a 99.4% confidence interval). In all cases, the confidence around the mean P contained zero indicating 50 that the statistic is an unbiased estimate o f the population P even at small sample sizes, ~ i.e., the mean o f the sampling distribution is the population value o f zero because the item parameters are the same for the two groups. This is an important point that needed to be established before examining the standard errors. Columns 5 and 6 from the left in Tables 5 to 10 are the standard deviation o f P over the 100 replications (i.e., an empirical estimate o f the standard error o f P) and the average o f the standard error o f P produced by TestGraf, /3img{0)= X P { ^ ) ~ o v e r t h e 1 0 0 q=\ replications. One can see that the empirical standard errors are larger than the standard error produced by TestGraf betas o f each item for every sample combination. The difference between these two statistics appears to be effected by the sample size combination wherein, for sample sizes o f 500/500 in Table 10, the standard error produced by TestGraf ([6] = .00896) is mostly less than half o f what it should be (i.e., the empirical standard error in Column 5 o f each table, which is [5] = .02121 in Table 10). This comparison ratio is listed in column 7 or the last column from the right. Type I error rate of TestGraf Research Question 4: What is the Type I error rate o f the formal statistical test o f the TestGraf? The Type I error rates for the formal statistical test o f DLF produced by TestGraf are reported in Tables 11 and 12 for a nominal alpha o f .05 and .01, respectively. These two tables have the same layout as Tables 3 and 4. By examining Tables 11 and 12 one can see that all o f the Type I error rates are inflated. This is an expected result given that the standard error o f P produced by TestGraf is too small in comparison to the empirical standard error. 51 Table 11 Type I error rate of TestGraf at nominal a = .05 Item Discrimination Level (a) Low High Item Dif f icul ty Level (b) Low Medium High Low Medium High N{/N2 Item35 Item37 Item39 Item36 . Item38 Item40 500/500 .44 .40 .39 .47 .50 .45 200/100 .33 .46 .38 .44 .50 .38 200/50 .45 .57 .46 .44 .42 .39 100/100 .46 .43 .43 .43 .46 .47 100/50 .45 .45 .34 .42 .46 .42 50/50 .42 .39 .30 .46 .41 .38 50/25 .46 .50 .42 .38 .35 .30 25/25 .42 .46 .24 .35 .35 .37 Note: Bold denotes an inflated Type I error 52 Table 12 Type I error rate of TestGraf at nominal a = .01 Item Discrimination Level (a) Low High Item Dif f icul ty Level (b) Low Medium High Low Medium High NXIN2 Item35 Item37 Item39 Item36 Item38 Item40 500/500 .30 .28 .18 .38 .36 .34 200/100 .26 .33 .22 .36 .41 .29 200/50 .33 .40 .35 .30 .32 .28 100/100 .36 .28 .29 .30 .29 .33 100/50 .33 .29 .24 .25 .31 .34 50/50 .29 .31 .26 .32 .30 .24 50/25 .34 .32 .26 .21 .15 .22 25/25 .26 .32 .17 .22 .25 .24 Note: Bold denotes an inflated Type I error Cut-Off Indices Research Question 5: What is the Type I error rate o f the statistical test o f TestGraf using the cut-o f f index o f IB| < .059 as proposed by Roussos and Stout (1996) for the SLBTEST procedure when sample size combinations were'500/500 and smaller? I f the Roussos-Stout cut-offs result in an inflated error rate, new cut-offs w i l l be determined from the simulation data. 53 To answer this research question, first the SIBTEST cut-off o f |P| < .059 as suggested by Roussos and Stout (1996) for the SIBTEST procedure (Gotzmann, 2002) was applied to investigate the Type I error o f TestGraf DLF detection method. The results can be seen in Table 13. As it was in Tables 3 and 4, the confidence interval strategy wi th thep-value o f .006 was used. In Table 13 one can see that the test is operating appropriately for a sample combination o f 500/500. However, for smaller sample sizes the criterion operates at best inconsistently (sample size combinations o f 200/100, 200/50, 100/100) being quite inflated for most item parameter combinations and at worst always inflated for the smaller sample size combinations o f 100/50, 50/50, 50/25, and 25/25. Clearly, a new cut-off is needed for cases o f less than 500 per group. By investigating the empirical sampling distribution o f P, the simulation design in this study allowed me to compute new cut-off values for the sample sizes wherein the Roussos-Stout criterion resulted in inflated error rates. As a matter o f completeness, the cut-off for the sample size combination o f 500/500 w i l l also be provided. To obtain the new cut-offs, the percentiles o f P values o f each item across sample size combinations were computed at 90%, 95%, and 99% for significance levels o f .10, .05, and .01, respectively. These indices answered the f i f th research question addressed in the present study and can be seen in Table 14 and Tables 15 to 20. Table 14 provides the criterion irrespective o f the item characteristics, whereas Tables 15 to 20 provide the criterion for the variety o f item characteristics in the present simulation study. For example (Table 14), a researcher interested in investigating the gender DEF for their test conducts the nonparametric LRT analysis with TestGraf and computes the p for each of the items. Imagine further that this researcher has 100 male and 100 female examinees; she/he would compare each obtained beta to the cut-off o f .0421 for a nominal alpha o f .05. Any item wi th a P 54 greater than .0421 would mean that that item has been flagged as DLF. This allows the researcher to use the same cut-off value irrespective o f the item's discrimination and difficulty. Table 13 Type I error of TestGraf DIF detection using the Roussos-Stout criterion across sample size combinations and item parameters Item Discrimination Level (a) Low High Item Dif f icul ty Level (b) Low Medium High Low Medium High Item35 Item37 Item39 Item36 Item38 Item40 500/500 .01 .02 0 .03 0 0 200/100 .11 .14 .02 .18 .17 .08 200/50 .28 .33 .15 .26 .21 .09 100/100 .17 .22 .09 .18 .17 .10 100/50 .24 .31 .15 .25 .22 .18 50/50 .29 .18 .14 .28 .22 .21 50/25 .39 .34 .20 .27 .31 .28 25/25 .37 .34 .17 .25 .33 .31 Note: Bold denotes an inflated Type I error 55 Table 14 Cut-off indices for B in identifying TestGraf DIF across sample size combinations and three significance levels irrespective of the item characteristics Level o f Significance a NXIN2 .10 .05 .01 500/500 .0113 .0161 .0374 200/100 .0249 .0373 .0415 200/50 .0460 .0540 .0568 100/100 .0308 .0421 .0690 100/50 .0421 .0579 .0741 50/50 .0399 .0455 .0626 50/25 .0633 .0869 .1371 25/25 .0770 .0890 .1154 56 Table 15 Cut-off indices for R in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a low discrimination level (a = 0.50) and low difficulty level (b= -1.00) Level o f Significance a NXIN2 .10 .05 .01 500/500 .0369 .0440 .0639 200/100 .0687 .0917 .1519 200/50 .1039 .1100 .2006 100/100 .0839 .0998 .1179 100/50 .1140 .1406 .2560 50/50 .1213 .1386 .2353 50/25 .1680 .2657 .2940 25/25 .1999 .2314 .3774 57 Table 16 Cut-off indices for B in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a high discrimination level (a = 1.00) and low difficulty level (b=-1.00) Level o f Significance a NXIN2 . .10 .05 .01 500/500 .0390 .0519 .0869 200/100 .0893 .0990 .1658 200/50 .1216 .1453 .1769 100/100 .0829 .0968 .1558 100/50 .1070 .1622 .2548 50/50 .1256 .1598 .2000 50/25 .1180 .1640 .1820 25/25 .1432 .1888 .4562 58 Table 17 Cut-off indices for /? in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a low discrimination level (a = 0.50) and medium difficulty level (b = 0.00) Level o f Significance a NXIN2 TO .05 .01 500/500 .0329 .0380 .0759 200/100 .0720 .0890 .1904 200/50 .1110 .1319 .1768 100/100 .0910 .1010 .1500 100/50 .1295 .1557 .2660 50/50 .1160 .1487 .2867 50/25 .1617 .2187 .3053 25/25 .1667 .2262 .2739 59 Table 18 Cut-off indices for in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a high discrimination level (a = 1.00) and medium difficulty level (b = 0.00) Level o f Significance a NXIN2 TO .05 .01 500/500 .0329 .0360 .0579 200/100 .0798 .0890 .1308 200/50 .1022 .1240 .2919 100/100 .0808 .1122 .1735 100/50 .1093 .1458 .1709 50/50 .1040 .1187 .2019 50/25 .1180 .1460 .2080 25/25 .1712 .2244 .3174 60 Table 19 Cut-off indices for (3 in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a low discrimination level (a — 0.50) and high difficulty level (b = 1.00) Level o f Significance a NjlN2 .10 .05 .01 500/500 .0249 .0290 .0410 200/100 .0428 .0520 .0780 200/50 .0754 .0926 .1516 100/100 .0565 .0710 .1059 100/50 .0906 .1120 .1350 50/50 .0638 .0954 .1490 50/25 .1029 .1309 .1700 25/25 .0937 .1756 .2296 61 Table 20 Cut-off indices for B in identifying TestGraf DIF across sample size combinations and three significance levels for an item that has a high discrimination level (a = 1.00) and high difficulty level (b = 1.00) Level o f Significance a N i l N 2 .10 .05 .01 500/500 .0288 .0358 .0480 200/100 .0500 .0726 .1089 200/50 .0573 .0719 .1040 100/100 .0606 .0679 .1277 100/50 .0973 .1326 .1856 50/50 .0940 .1427 .1910 50/25 .0999 .1413 .1680 25/25 .1497 .1738 .3431 62 Chapter V DISCUSSION The purpose o f this dissertation is to investigate psychometric methods for DEF detection in moderate-to-small-scale testing. In particular, the focus is on the test evaluation stage of testing wherein one may focus on differential item functioning (DEF) as a method to detect potentially biased items. In setting the stage for the investigation, I defined moderate-to-small-scale testing contexts as those involving 500 or fewer examinees. The literature-based survey I reported in the beginning o f this dissertation shows that psychological studies and surveys involve far fewer participants than achievement testing research. The survey also highlights the observation that simulation research on psychometric methods tends to focus on large-scale testing contexts. Both findings empirically confirm what are fairly widely held beliefs about achievement testing and simulation research. The survey findings therefore set the stage by demonstrating the need for more research into moderate-to-small-scale methods that can be used in educational psychology studies, surveys, pilot studies or classroom testing. The sort o f classroom testing I have in mind is the same sort envisioned by Ramsay in the development o f his TestGraf non-parametric item response theory model and software: tests used in classes the size o f those found in post-secondary institutions sometimes include as few as 25 students per comparison group, to as many as 500 per group. This is in a stark contrast to the numbers found in large-scale testing involving thousands o f examinees. Doing psychometric research on classroom tests may appear out o f step wi th contemporary thinking in classroom assessment, which de-emphasizes the psychometric properties o f classroom tests. However, i f one is to take the goal o f fair classroom assessment 63 seriously, then investigating test properties for potentially biased items is a natural practice. I am not suggesting that every instructor assess all items in all contexts for potential bias, but rather that a post-secondary instructor may want to maintain an item bank for their tests and hence may want to investigate matters such as potential language or cultural bias on, for example, their mid-term and final examinations. Furthermore, textbook writers and textbook test-bank developers may want to investigate these potential group biases in developing their item banks. As is common in measurement practice, once one has flagged an item as DIP it does not mean that the item is necessarily biased nor is it necessarily removed from the test. It is a common practice, for example, that an item be "put on ice" (i.e., set aside) for further investigation and revision rather than simply removed from a test bank. As Zumbo and Hubley (2003) remind us, in the end, DLF research is about asking questions o f whether a test is fair and about providing empirical evidence o f the validity o f the inferences made from test scores. In addition, DLF research is important in educational psychology research because DLF may threaten the correct interpretation o f group differences on psychological scales. For example, when one finds cultural differences in response to items on a self-esteem scale, one needs to rule out the inferential validity threat that these differences may arise from different interpretations o f items on the scale. In short, one needs to rule out that the differences are an artefact o f the measurement process rather than actual group difference. The above two examples o f the use o f DLF methods in moderate-to-small-scale testing highlight the fact that appropriately functioning DIF statistical techniques are needed for sample sizes less than or equal to 500 respondents or examinees. The psychometric literature offers three alternatives: two methods based on non-parametric item response theory and the Mantel-Haenszel (MH) test. The non-parametric item response modelling method is incorporated in the 64 software TestGraf uses either (i) a heuristic decision rule, or (i i) a formal hypothesis test o f DLF based on the TestGraf results. The TestGraf methods, which were developed wi th moderate-to-small-scale testing in mind, are becoming more widely used for two reasons: (a) the methodology is graphically based and hence easier to use by individuals who do not have specialized psychometric knowledge o f item response theory, and (b) the software is made freely available by its developer, Jim Ramsay o f McGi l l University, via the internet. On the other hand, the M H test is an extension o f widely used methods for the analysis o f contingency tables and is available in commonly used software such as SPSS or SAS. The purpose o f the present study is to investigate the operating characteristics o f these three DIF methods in the context o f moderate-to-small-scale testing. The DLF methodologies (either the formal hypothesis test or the heuristic cut-off values established by Roussos and Stout) in TestGraf have not been investigated in moderate-to-small-scale testing. In fact, the formal hypothesis testing in detecting DLF is, to my knowledge, first introduced in this dissertation. The findings in this dissertation lead to recommending the fol lowing practice for moderate-to-small scale testing: • The M H DLF test maintains its Type I error rate and is recommended for use. • TestGraf can be used as a method o f DIF detection; however, neither the formal hypothesis test nor Roussos-Stout's heuristic criteria should be used because of the inflated error rates. Instead, a practical decision rule based on the cut-off indices for TestGraf s beta statistic reported in Tables 14 to 20 should be used. One should note that Tables 15 to 20 provide cut-offs for particular item characteristics, whereas Table 14 provides cut-off indices irrespective o f item properties but specific to sample sizes. A l l 65 o f these tables provide values for eight combinations o f sample sizes so, unti l further research, the practitioner may need to interpolate cut-offs for intermediate sample sizes. • The reader should note that computing the 90th, 95th, and 99th percentiles o f the null sampling distribution o f the beta statistic, for the various sample size and item characteristics in the simulation, resulted in the cut-off values. This is based on the statistical principle that creating a statistical hypothesis tests, in essence, requires one to divide the sampling distribution o f a statistic into a region o f acceptance and a region o f rejection. In the case reported in this dissertation for the new cut-off values, because the statistic beta did not fol low its proposed sampling distribution, this division was done empirically by computing the percentiles from the empirical sampling distribution. Summary of the Findings This dissertation has found that: • The M H DEF test maintains its acceptable Type I error rate. • The original version o f TestGraf, i.e. the version prior December 20, 2002, produced the incorrect standard error that was too large and hence, as expected, resulted in Type I error rate that was very small nearly zero. Ironically, wi th the computing error corrected and a revised version o f TestGraf released, the new TestGraf program produced Type I error rate that was too large. • Regardless o f the error in the standard error produced by TestGraf, the beta statistic is shown to be an unbiased and hence accurate estimate o f DEF magnitude. 66 • Likewise, the shape o f the sampling distribution o f beta is normal, but the standard error o f beta produced by TestGraf is often too small resulting in too large o f a test statistic value and hence an inflated Type I error rate. • Building on the beta statistic's unbiasedness, it has been shown that the Roussos-Stout criterion does not work for moderate-to-small-scale sample sizes. However, new criteria are provided based on the simulation data. Future research needs to further explore the new cut-off criteria proposed in this dissertation wi th a variety o f sample size combinations. Furthermore, this dissertation focused on Type I error rates, which are necessary for further study o f statistical power in detecting DLF. The Type I error rate for DIF tests is not only important for statistical reasons (i.e., power is not formally defined unless the test protects the Type I error rate) but also useful for the type o f decision being made wi th DIF tests. That is, as I described in the Introduction o f this dissertation, a Type I error in a DLF analysis means that an analyst w i l l conclude that there is no DIF when in fact DEF is present. This type o f error has obviously serious implications for research and practice that the Type I error rate needs be controlled in DLF. Once one finds a DLF test that protects the Type I error rate, then the needs turns to statistical power. Future research should investigate the statistical power o f the cut-off scores for beta presented in Tables 14 to 20 and compare this power to that o f the M H DEF method for the same data. 67 REFERENCES References marked wi th an asterisk (*) indicate studies included in the small systematic literature-based study. *Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education. 14, 219-234. Agresti, A. & Coull, B. A. (1998). Approximate is better than "Exact" for interval estimation o f binomial proportions. The American Statistician, 52, 119-126. *Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes o f DLF in translated verbal items. Journal o f Educational Measurement, 36, 185-198. *Allalouf, A., & Ben-Shakhar, G. (1998). The effect o f coaching on the predictive validity o f Scholastic Aptitude Tests. Journal o f Educational Measurement, 35, 31-47. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Angoff, W. H. (1982). Use o f diff iculty and discrimination indices for detecting item bias. In R. A. Berk (Ed.), Handbook o f methods for detecting test bias (pp. 96-116). Baltimore, M D : John Hopkins University. Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Hillsdale, NJ: Lawrence Erlbaum. *Arikenmann, R. D., Witt , E. A., & Dunbar, S. B. (1999). A n investigation o f the power o f the likelihood ratio goodness-of-fit statistic in detecting differential item functioning. Journal o f Educational Measurement, 36, 277-300. *Ban, J. C , Hansen, B. A., Wang, T., Y i , Q., & Harris, D. J. (2001). A comparative study o f on-line pretest item-calibration/scaling methods in computerized adaptive testing. Journal o f Educational Measurement, 38, 191-212. *Ban, J. C , Hansen, B. A., Y i , Q., & Harris, D. J. (2002). Data sparseness and on-line pretest item calibration-scaling methods in CAT. Journal o f Educational Measurement, 39, 207-218. *Bennett, R. E., Morley, M., Quardt, D., & Rock, D. A. (2000). Graphical modeling: A new response type for measuring the qualitative component o f mathematical reasoning. Applied Measurement in Education, 13, 303-322. *Biel inski, J., & Davison, M. L. (2001). A sex difference by item diff iculty interaction in multiple-choice mathematics items administered to national probability samples. Journal o f Educational Measurement, 38, 51-77. *Biggs, J., Kember, D., & Leung, D. Y. P. (2001). The revised two-factor Study Process Questionnaire: R-SPQ-2F. British Journal o f Educational Psychology, 71 , 133-149. *Bishop, N. S., & Frisbie, D. A. (1999). The effect o f test item familiarization on achievement test scores. Applied Measurement in Education, 12, 327-341. *Bolt, D. M. (1999). Evaluating the effects o f multidimensionality on LRT true-score equating. Applied Measurement in Education, 12, 383-407. *Bolt , D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under conditions o f test speededness: Application o f a mixture Rasch model wi th ordinal constraints. Journal o f Educational Measurement, 39, 331-348. Bradley, J. V. (1978). Robustness? The British Journal o f Mathematical and Statistical Psychology. 31 . 144-152. *Buckendahl, C. W., Smith, R. W., Lmpara, J. C , & Plake, B. S. (2002). A comparison o f Angof f and bookmark standard setting methods. Journal o f Educational Measurement, 39, 253-263. Camili, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Publications. *Chang, L. (1999). Judgmental item analysis o f the Nedelsky and Angof f standard-setting methods. Applied Measurement in Education, 12, 151-165. *Clauser, B. E., Harik, P., & Clyman, S. G. (2000). The generalizability o f scores for a performance assessment scored wi th a computer-automated scoring system. Journal o f Educational Measurement, 37, 245-261. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31-44. *Congdon, P. J., & McQueen, J. (2000). The stability o f rater severity in large-scale assessment programs. Journal o f Educational Measurement, 37, 163-178. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York, N Y : CBS College Publishing. *Cross, L. H. & Frary, R. B. (1999). Hodgepodge grading: Endorsed by students and teachers alike. Applied Measurement in Education, 12, 53-72. *De Ayala, R. J., Plake, B. S., & Impara, J. C. (2001). The impact o f omitted responses on the accuracy o f ability estimation in item response theory. Journal o f Educational Measurement, 38, 213-234. *De Champlain, A. & Gessaroli, M . E. (1998). Assessing the dimensionality o f item response matrices wi th small sample sizes and short test lengths. Applied Measurement in Education, 11, 231-253. *De Mars, C. E. (1998). Gender differences in mathematics and science on a high school proficiency exam: The role o f response format. Applied Measurement in Education, 11, 279-299. *De Mars, C. E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13, 55-77. * Engelhard, G., Jr., Anderson, D. W. (1998). A binomial trials model for examining the ratings o f standard-setting judges. Applied Measurement in Education, 11, 209-230. * Engelhard, G., Jr., Davis, M., & Hansche, L. (1999). Evaluating the accuracy o f judgments obtained from item review committees. Applied Measurement in Education, 12, 199-210. *Enright, M . K., Rock, D. A., & Bennett, R. E. (1998). Improving measurement for graduate admissions. Journal o f Educational Measurement, 35, 250-267. *Ercikan, K., Schwarz, R. D., Julian, M . W., Burket, G. R., Weber, M . M. , Link, V. (1998). Calibration and scoring o f tests wi th multiple-choice and constructed-response item types. Journal o f Educational Measurement, 35, 137-154. *Feldt, L. S., & Quails, A. L. (1998). Approximating scale score standard error o f measurement from the raw score standard error. Applied Measurement in Education, 11, 159-177. 71 *Feltz, D. L., Chase, M. A., Moritz, S. E., & Sullivan, P. J. (1999). A conceptual model o f coaching efficacy: Preliminary investigation and instrument development. Journal o f Educational Psychology, 91 , 765-776. *Ferrara, S., Huynh, H., & Michaels, H. (1999). Contextual explanations o f local dependence in item cluster in a large scale hands-on science performance assessment. Journal o f Educational Measurement, 36, 119-140. *Fitzpatrick, A. R., Lee, G., & Gao, F. (2001). Assessing the comparability o f school scores across test forms that are not parallel. Applied Measurement in Education, 14, 285-306. *Fitzpatrick, A. R., & Yen, W. M . (2001). The effects o f test length and sample size on the reliability and equating o f tests composed o f constructed-response items. Applied Measurement in Education, 14, 31-57. *Fitzpatrick, A. R., Ercikan, K., Yen, W. M., & Ferrara, S. (1998). The consistency between raters scoring in different test years. Applied Measurement in Education, 11, 195-208. Fox, J. (2000). Nonparametric simple regression: Smoothing scatterplots. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-130. Thousand Oaks, CA: Sage. *Fox, R. A., McManus, L C , Winder, B .C. (2001). The shortened Study Process Questionnaire: A n investigation o f its structure and longitudinal stability using confirmatory factor analysis. British Journal o f Educational Psychology, 71 , 511-530. *Fuchs, L. S., Fuchs, D., Karns, K., Hamlett, C. L., Dutka, S., Katzaroff, M. (2000). The importance o f providing background information on the structure and scoring o f performance assessments. Applied Measurement in Education, 13, 1-34. *Gallagher, A., Bridgeman, B., & Cahalan, C. (2002). The effect o f computer-based tests on racial-ethnic and gender groups. Journal o f Educational Measurement, 39, 133-147. *Gao, X., Brennan, R. L. (2001). Variability o f estimated variance components and related statistics in a performance assessment. Applied Measurement in Education, 14, 191-203. *Garner, M. , & Engelhard, D., Jr. (1999). Gender differences in performance on multiple-choice and constructed response mathematics items. Applied Measurement in Education, 12, 29-51. *Ghuman, P. A. S. (2000). Acculturation o f South Asian adolescents in Australia. British Journal o f Educational Psychology, 70, 305-316. *Gierl , M. J., Khaliq, S. N. (2001). Identifying sources o f differential item and bundle functioning on translated achievement tests: A confirmatory analysis. Journal o f Educational Measurement, 38, 164-187. *Goodwin, L. D. (1999). Relations between observed item diff iculty levels and Angof f minimum passing levels for a group o f borderline examinees. Applied Measurement in Education, 12, 13-28. Gotzmann, A. J. (2002). The effect o f large ability differences on Type I error and power rates using SIBTEST and TESTGRAF DEF detection procedures. Paper prepared at the Annual Meeting o f the American Educational Research Association (New Orleans, LA , Apr i l 1-5, 2002). (ERIC Document Reproduction Service No. ED 464 108) *Hal l , B. W., & Hewitt-Gervais, C. M . (2000). The application o f student portfolios in primary-intermediate and self-contained-multiage team classroom environments: Implications for instruction, learning, and assessment. Applied Measurement in Education, 13, 209-228. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals o f item response theory. Newbury Park, CA: Sage. *Hamilton, L. S. (1999). Detecting gender-based differential response science test. Applied Measurement in Education, 12, 211-235. *Hanson, K., Brown, B., Levine, R., & Garcia, T. (2001). Should standard calculators be provided in testing situations? A n investigation o f performance and preference differences. Applied Measurement in Education, 14, 59-72. Hattie, J., Jaeger, R. M., & Bond, L. (1999). Persistent methodological questions in educational testing. In A. Iran-Nejad & P. D. Pearson (Eds.), Review o f Research in Education (Vol. 24, pp. 393-446). Washington, DC: The American Educational Research Association. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum Associates. *Hong, S., Roznowski. (2001). A n investigation o f the influence o f internal test bias on regression slope. Applied Measurement in Education, 14, 351-368. *Impara, J. C , & Plake, B. S. (1998). Teachers' ability to estimate item diff iculty: A test o f the assumptions in the Angof f standard setting method. Journal o f Educational Measurement, 35, 69-81. *Janosz, M., Le Blanc, M., Boulerice, B., & Tremblay, R. E. (2000). Predicting different types o f school dropouts: A typological approach wi th two longitudinal samples. Journal o f Educational Psychology, 92, 171-190. *Jodoin, M . G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure wi th the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329-349. *Katz, I. R., Bennett, R. E., & Berger, A. E. (2000). Effects o f response format on diff iculty o f SAT-Mathematics items: It 's not the strategy. Journal o f Educational Measurement, 37, 39-57. *K ing, N. J., Ollendick, T. H., Murphy, G. C., & Mol loy, G. N. (1998). Ut i l i ty o f relaxation training wi th children in school settings: A plea for realistic goal setting and evaluation. British Journal o f Educational Psychology, 68, 53-66. *Kle in, S. P., Stecher, B. M., Shavelson, R. J., McCaffrey, D., Ormseth, T., Bel l , R. M., Comfort, K., & Othman, A. R. (1998). Analytic versus holistic scoring o f science performance tasks. Applied Measurement in Education, 11, 121-137. *Lee, G. (2000). Estimating conditional standard errors o f measurement for test composed of testlets. Applied Measurement in Education, 13, 161-180. *Lee, G. (2002). The influence of several factors on reliability for complex reading comprehension tests. Journal o f Educational Measurement, 39, 149-164. *Lee, G., & Frisbie, D. A. (1999). Estimating reliability under a generalizability theory model for test scores composed o f testlets. Applied Measurement in Education, 12, 237-255. *Lee, W. C , Brennan, R. L., & Kolen, M. J. (2000). Estimators o f conditional scale-score standard errors o f measurement: A simulation study. Journal o f Educational Measurement, 37, 1-20. 75 *Lingbiao, G. & Watkins, D. (2001). Identifying and assessing the conceptions o f teaching o f secondary school physics teachers in China. British Journal o f Educational Psychology, 71,443-469. *Ma , X. (2001). Stability o f school academic performance across subject areas. Journal o f Educational Measurement, 38, 1-18. *McBee, M . M., & Barnes, L. L. B. (1998). The generalizability o f a performance assessment measuring achievement in eighth-grade mathematics. Applied Measurement in Education, 11, 179-194. *Meijer, R. R. (2002). Outlier detection in high-stakes certification testing. Journal o f Educational Measurement, 39, 219-233. *Mokhtari , K. & Reichard, C. A. (2002). Assessing students' metacognitive awareness o f reading strategies. Journal o f Educational Psychology, 94, 249-259. Muii iz, J., Hambleton, R. K., & Xing, D. (2001). Small sample studies to detect flaws in item translations. International Journal o f Testing, 1, 115-135. *Muraki , E. (1999). Stepwise analysis o f differential item functioning based on multiple-group partial credit model. Journal o f Educational Measurement, 36, 217-232. *Nichols, P., & Kuehl, B. J. (1999). Prophesying the reliability o f cognitively complex assessments. Applied Measurement in Education, 12, 73-94. * 0 ' N e i l , H. F., Jr., & Brown, R. S. (1998). Differential effects o f question formats in math assessment on metacognition and affect. Applied Measurement in Education, 11, 331-351. *Oosterheert, I. E., Vermunt, J. D., & Denessen, E. (2002). Assessing orientations to learning to teach. Brit ish Journal o f Educational Psychology, 72, 41-64. *Oshima, T. C , Raju, N. S., Flowers, C. P., & Slinde, J. A. (1998). Differential bundle functioning using the DFIT framework: Procedures for identifying possible sources o f differential functioning. Applied Measurement in Education, 11, 353-369. Parshall, C. G., & Mil ler, T. R. (1995). Exact versus asymptotic Mantel-Haenszel DLF statistics: A comparison o f performance under small-sample conditions. Journal o f Educational Measurement, 32 , 302-316. *Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: A comparison o f three Mantel-Haenszel procedures. Applied Measurement in Education, 14,235-259. *Pommerich, M., Nicewander, W. A., & Hanson B. A. (1999). Estimating average domain scores. Journal o f Educational Measurement, 36, 199-216. *Pomplun, M. , & Omar, M . H. (2001). Do reading passages about war provide factorially invariant scores for men and women? Applied Measurement in Education, 14, 171-189. *Pomplun, M., & Omar, M. H. (2001). The factorial invariance o f a test o f a reading comprehension across groups o f limited English proficient students. Applied Measurement in Education, 14, 261-283. *Pomplun, M., & Sundbye, N. (1999). Gender differences in constructed response reading items. Applied Measurement in Education, 12, 95-109. *Ponsoda, V., Olea, J., Rodriguez, M. S., & Revuelta, J. (1999). The effects o f test diff iculty manipulation in computerized adaptive testing and self-adapted testing. Applied Measurement in Education, 12, 167-184. *Powers, D. E., & Bennett, R. E. (1999). Effects o f allowing examinees to select questions on a test o f divergent thinking. Applied Measurement in Education, 12, 257-279. 77 *Powers, D. E., & Fowles, M. E. (1998). Effects o f preexamination disclosure o f essay topics. Applied Measurement in Education, 11, 139-157. Towers , D. E., & Rock, D. A. (1999). Effects o f coaching on SAT I: Reasoning test scores. Journal o f Educational Measurement, 36, 93-118. *Rae, G. & Hyland, P. (2001). Generalisability and classical test theory analyses o f Koppitz's Scoring System for human figure drawings. Brit ish Journal o f Educational Psychology, 7 i , 369-382. Ramsay, J. O. (2000). TESTGRAF: A program for the graphical analysis o f multiple choice test and questionnaire data. Montreal, Quebec, Canada: McGi l l University. *Raymond, M . R. (2001). Job analysis and the specification o f content for licensure and certification examinations. Applied Measurement in Education, 14, 369-415. *Rest, J. R., Narvaez, D., Thoma, S. J., & Bebeau, M. J. (1999). DIT2: Devising and testing a revised instrument o f moral judgment. Journal o f Educational Psychology, 91 , 644-659. Rogers, H. J., & Swaminathan, H. (1993). A comparison o f logistic regression and Mantel Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116. Roussos, L. A., & Stout, W. F. (1996). Simulation studies o f the effects o f small sample size and studied item parameters in SIBTEST and Mantel-Haenzel Type I error performance. Journal o f Educational Measurement, 33, 215-230. *Roussos, L. A., Stout, W. F., & Marden, J. I. (1998). Using new proximity measures wi th hierarchical cluster analysis to detect multidimensionality. Journal o f Educational Measurement, 35, 1-30. *Ryan, K. E., & Chiu, S. (2001). An examination o f item context effects, DLF, and gender DEF. Applied Measurement in Education, 14, 73-90. *Sachs, J. & Gao, L. (2000). Item-level and subscale-level factoring o f Biggs' Learning Process Questionnaire (LPQ) in a Mainland Chinese sample. Brit ish Journal o f Educational Psychology. 70. 405-418. *Schwarz, R. D. (1998). Trace lines for classification decisions. Applied Measurement in Education, 11. 311-330. *Scrams, D. J., & McLeod, L. D. (2000). A n expected response function approach to graphical differential item functioning. Journal o f Educational Measurement, 37, 263-280. *Sireci, S. G., & Berberogln, G. (2000). Using bilingual respondents to evaluate translated-adapted items. Applied Measurement in Education, 13, 229-248. *Sireci, S. G., Robin, F., & Patelis, T. (1999). Using cluster analysis to facilitate standard setting. Applied Measurement in Education, 12, 301-325. *Smith, M., Duda, J., Al len, J., & Hal l , H. (2002). Contemporary measures o f approach and avoidance goal orientations: Similarities and differences. Brit ish Journal o f Educational Psychology. 72, 155-190. *Smits, N., Mellenbergh, G. J., & Vorst, H. C. M. (2002). Alternative missing data techniques to grade point average: Imputing unavailable grades. Journal o f Educational Measurement, 39, 187-206. *Sotaridona, L. S., & Meijer, R. R. (2002). Statistical properties o f the K-Index for detecting answer copying. Journal o f Educational Measurement, 39, 115-132. *Stecher, B. M., Klein, S. P., Solamo-Flores, G., McCaffrey, D., Robyn, A., Shavelson, R. J., et al. (2000). The effects o f content, format, and inquiry level on science performance assessment scores. Applied Measurement in Education, 13, 139-160. *Stocking, M. L., Lawrence, I., Feigenbaum, M., Jirele, T., Lewis, C , & Van Essen, T. (2002). A n empirical investigation o f impact moderation in test construction. Journal o f Educational Measurement, 39, 235-252. •Stocking, M. L., Ward, W. C , & Potenza, M. T. (1998). Simulating the use o f disclosed items in computerized adaptive testing. Journal o f Educational Measurement, 35, 48-68. *Stricker, L. J., Rock, D. A., & Bennett, R. E. (2001). Sex and ethnic-group differences on accomplishments measures. Applied Measurement in Education, 14, 205-218. *Stricker, L. J., & Emmerich, W. (1999). Possible determinants o f differential item functioning: Familiarity, interest, and emotional reaction. Journal o f Educational Measurement, 36, 347-366. *Stuart, M. , Dixon, M. , Masterson, J., & Quinlan, P. (1998). Learning to read at home and at school. Brit ish Journal o f Educational Psychology, 68, 3-14. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal o f Educational Measurement, 27, 361-370. *Sykes, R. C , & Yen, W. M. The scaling o f mixed-item-format tests w i th the one-parameter and two-parameter partial credit models. Journal o f Educational Measurement, 37, 221-244. T s a i , T. PL, Hansen, B. A., Kolen, M . J., & Forsyth, R. A. (2001). A comparison o f bootstrap standard errors o f LRT equating methods for the common-item non-equivalent groups design. Applied Measurement in Education, 14, 17-30. *Van der Linden, W. J., & Glas, C. A. W. (2000). Capitalization on item calibration error in adaptive testing. Applied Measurement in Education, 13, 35-53. *Vispoel, W. P., Claigh, S. J., Bleiler, T., Hendrickson, A. B., & Ehrig, D. (2002). Can examinees use judgments o f item diff iculty to improve proficiency estimates on computerized adaptive vocabulary tests? Journal o f Educational Measurement, 39, 311-330. *Vispoel, W. P., & Fast, E. E. F. (2000). Response biases and their relation to sex differences in multiple domains o f self-concept. Applied Measurement in Education, 13, 79-97. * Vispoel, W. P., Rocklin, T. R., Wang, T., & Bleiler, T. (1999). Can examinees use a review option to obtain positively biased ability estimates on a computerized adaptive test? Journal o f Educational Measurement, 36, 141-157. * Vispoel, W. P. (1998). Psychometric characteristics o f computer-adaptive and self-adaptive vocabulary tests: The role o f answer feedback and test anxiety. Journal o f Educational Measurement, 35, 155-167. *Walker, C. M., Beretvas, S. N., & Ackerman, T. (2001). A n examination o f conditioning variables used in computer adaptive testing for DIF analyses. Applied Measurement in Education, 14, 3-16. *Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties o f scale scores and performance levels for performance assessments using polytomous LRT. Journal o f Educational Measurement, 37, 141-162. *Waugh, R. F. (1999). Approaches to studying for students in higher education: A Rasch measurement model analysis. British Journal o f Educational Psychology, 69, 63-79. *Waugh, R. F. & Addison, P. A. (1998). A Rasch measurement model analysis o f the revised approaches to studying inventory. British Journal o f Educational Psychology, 68, 95-112. *Webb, N. M., SchlackmanJ., & Surged, B. (2000). The dependability and interchangeability o f assessment methods in science. Applied Measurement in Education, 13, 277-301. *Wightman, L. F. (1998). A n examination o f sex differences in LSAT scores from the perspective o f social consequences. Applied Measurement in Education, 11, 255-277. *Wil l iams, V. S. L., Pommerich, M., & Thissen, D. (1998). A comparison o f developmental scales based on Thurstone methods and item response theory. Journal o f Educational Measurement, 35, 93-107. *Wil l ingham, W. W., Pollack, J. M., & Lewis, C. (2002). Grades and test scores: Accounting for observed differences. Journal o f Educational Measurement, 39, 1-37. *Wise, S. L., Finney, S. J., Enders, C. K., Freeman, S. A., Severance, D. D. (1999). Examinee judgments on changes in item difficulty: Implications for item review in computerized adaptive testing. Applied Measurement in Education, 12, 185-198. *Wolfe, E. W., & Gitomer, D. H. (2001). The influence o f changes in assessment design on the psychometric quality o f scores. Applied Measurement in Education, 14, 91-107. * Y u , F., & Nandakumar, R. (2001). Poly-Detect for quantifying the degree o f multidimensionality o f item response data. Journal o f Educational Measurement, 38, 99-120. Zumbo, B. D., & Coulombe, D. (1997). Investigation o f the robust rank-order test for non-formal populations wi th unequal variances: The case o f reaction time. Canadian Journal o f Experimental Psychology, 51, 139-150. Zumbo, B. D., & Hubley, A. M. (2003). Item bias. In Rocio Fernandez-Ballesteros (Ed.), Encyclopaedia o f psychological assessment, (pp. 505-509). Sage Press, Thousand Oaks, CA. 82 *Zuri f f , G. E. (2000). Extra examination time for students wi th learning disabilities: A n examination o f the maximum potential thesis. Applied Measurement in Education, 13, 99-117. 83 APPENDIX A Criteria for reporting sample sizes o f the articles recorded the small systematic literature based study: 1. A l l articles that reported sample sizes found in Journal o f Educational Psychology and Brit ish Journal o f Educational Psychology were categorized as educational psychology research. Wherever applicable, rater or interviewer sample sizes were labelled as survey. 2. Articles in Journal o f Educational Measurement and Applied Measurement in Education were classified as achievement testing, simulation testing, or survey. Sample sizes o f raters, interviewers, judges, were grouped in to survey. Test scores used in a study i f they were obtained from a testing database of real testing were classified as achievement testing unless otherwise it denoted as a simulation testing. 3. For articles wi th one study, i f it used more than one-sample sizes, the smallest sample size used - either subjects or items - was chosen to be reported in Table A l . 4. For articles wi th multiple studies, sample size o f each study was coded in Table A l wi th the application o f point 3 above as wel l . 5. Concerning achievement testing, number o f complex items was not coded in Table 1, because when such item format involved, the relevant study would use only one or two items. This can lead to a wrong conclusion on the number o f items involved. 6. DLF studies wi th various combinations o f sample sizes in subjects were coded for the smallest sample size used in any groups. 7. The above criteria resulted in some studies to fall into more than one categories coded, for instance, Rae and Hyland (2001; coded as an educational psychology and a survey). 84 Table A l Details of the systematic literature based survey by category of research Achievement Tests Simulation Studies Psychological Test Survey Studies Examinees Items Simulees Items Subjects Items Subjects Items 163 40 1,000 36 1,070 18 293 80 2,080 70 12,000 30 75 10 900 20 1,392 48 7,300 25 791 30 26 25 448 100 100 25 174 56 16 40 1,401 125 500 30 18 24 12 53 1,351 20 1,000 40 450 40 50 9 8,454 62 100 200 85 - 186 30 717 31 250 55 229 2,000 75 1,485 35 250 36 475 3 60 2,595 40 1,700 30 443 36 40 198 49 1,500 56 169 200 -4,637 20 300 24 369 16 388 30 6,515 30 189 390 633 150 3,000 60 200 15 366 40 100 40 346 314 6,883 34 500 20 61 44 4,712 26 1,000 100 - 335 2,115 - 200 43 662 55 250 100 243 85 Table A l (continued) Achievement Tests Simulation Studies Psychological Test Survey Studies Examinees Items Simulees Items Subjects Items Subjects Items 8,285 250 39 115 1,217 250 30 728 200 1,000 - 307 1,493 1,000 4 1,065 8,000 20 2,002 1,000 187 73 25* 77 600 20,000* 39 182 - 810 57 782 1,493 98 629 300 1,862 658 107 26 1,029 603 2,919 352 50* 335 25,844* 87,785* 86 Table A l (continued) Achievement Tests Simulation Studies Psychological Test Survey Studies Examinees Items Simulees Items Subjects Items Subjects Items M 4,657.0 54.1 2,558.9 51 321.5 29.7 310.6 43.2 M 1,808.2** 1,962.6** Mdn 1,284 40 1,000 37.5 214.5 27 193.5 40 Min 50 20 25 20 18 10 3 9 Max 87,785 150 20,000 153 1,070 56 2,000 80 25th% 371.5 31 250 30 148 19.5 33.5 26.3 75th% 2,106.3 62 1,600 55 444.8 37.5 361.5 58.3 N 38 17 27 21 16 6 36 10 j ;jrjr Note: Denotes studies that were excluded from M with 87 APPENDIX B A n Initial Study o f the Operating Characteristics o f TestGraf s DLF Detection Statistic: Discovering an Error in TestGraf98's Calculation o f the Standard Error o f Beta 1 As a consequence o f sharing the results reported in this Appendix and getting clarification from Ramsay on the computation o f the standard error o f beta, a revised version o f TestGraf98 was released by Ramsay on December 20, 2002. The purpose o f this Appendix is to provide a record, for historical and archival reasons, o f the Type I error rate and statistical power o f TestGraf98's pre-December 20, 2002 version o f the statistical test o f DIF. Applied measurement researchers who used the pre-December 20 t h 2002 version o f TestGraf98 for DLF w i l l have had the Type I error rate and power reported in this Appendix B. Anyone using the Roussos-Stout criterion are not affected by the change in TestGraf versions. That is, the version o f TestGraf98 that was used in this Appendix is the version dated July 31 , 2001. Upon completion o f this simulation study and working through the results o f the Type I error rates and statistical power patterns, I corresponded wi th Ramsay to fol low up on the results I found. A n outcome o f this correspondence about this simulation study was that Ramsay found a computational error in TestGraf s computer code for the standard error o f the Beta statistic and he released a revised version o f TestGraf98 on December 20 t h 2002 that had the computation o f the standard error o f Beta corrected. The revised version o f the software was used in the study reported in the main body o f this dissertation. This Appendix is written, as a freestanding study wi th more detail than would be typically found in an Appendix because some readers may be interested only in this study; 1 Author note: I would like to thank Professor Jim Ramsay for his encouragement in this project and for so promptly providing the corrected version of TestGraf98. 88 therefore, some o f the information is repetitive o f the information found in the main body o f this dissertation. It was decided to report this study in an Appendix because the conclusion was basically to point out an error in a widely used software package. Although I consider this an important finding I have not included it as part o f the main body o f my dissertation because I would like to build on the error I found in TestGraf to go back to initial purpose o f studying the operating characteristics o f TestGraf s DLF detection test. As Ramsay noted, "Actually, it's an excellent project to research the SE o f the DLF index as TestGraf actually produces it, and it sounds like your work has revealed what needed to be revealed, namely that there's something wrong." (personal communication, December 20, 2002). The study was originally designed to answer four broad research questions in the context o f an educational testing and sample size. They were as follows: V I . What was the Type I error rate o f this statistical test o f TestGraf? V I I . What was the power o f this statistical test o f TestGraf in detecting DLF? V I I I . Was the standard deviation o f the sampling distribution o f the beta TestGraf DLF statistic the same as the standard error computed in TestGraf? B-L Methodology The methodology in the study by Muf i iz , Hambleton, and X ing (2001) is used by many DLF researchers. Although the present study adopted the sample size combinations that were used in their study, there were differences in generating the reference and focal sample groups, for the investigation o f Type I error. It is important to make a methodological note about the significant difference between the present study and the previous study by Muf i iz et al. (2001). As discussed in the literature 89 review, to compare two or more DLF detection methods in assessing their statistical properties -that are their Type I error rate and power o f DLF detection — the item parameters used w i l l have to be obtained from the same item(s). In their study, however, Muf i iz et al. computed Type I error rate from different items than the items used for assessing power o f detecting DLF. By computing Type I error rate o f different items to those for computing power, hence different item parameter values are used to compare the statistical properties o f a DIF detection method in assessing the studied method. Wi th this, they associated Type I error rate o f items that were different than items for power o f detecting DIF. Therefore, there may be confounding in interpreting the results for evaluation purpose. For example, imagine we have a test o f 40 items. Thirty-four items are used to compute Type I error rate, and the remaining six items are used to compute power o f detecting DLF. Although this strategy is computationally efficient in terms o f computing time, the principles o f experimental design applied to simulation experiment would suggest that the experimental manipulation (Type I error and power) should be applied to the same item that the simulation is run for. For this to be an appropriate experiment, we should look at the same items that are simulated to be in either no-DIF or DIF condition. Unlike Muf i iz et al., the present study was carried out as a complete factorial design in which Type I error rate was explored from the same items that were used to assess power o f detecting DLF. More detail o f the methodology o f this study is presented below. Description of DIF detection procedure. The Ramsay TestGraf non-parametric DLF detection method wi th the fol lowing factors was applied in detecting potential DIF items that were set for the purposes o f this study. The 90 method displayed the findings graphically. The two population groups that were studied and compared to run the simulation were set so as to have equal ability and normal distribution wi th M— 0 and SD equal to 1.0. The manipulated factors were: 1. The sample sizes o f the reference and focal groups. There were five categories o f sample sizes used for the reference and focal groups: 50/50, 100/50, 200/50, 100/100, and 500/500. 2. The discrimination and diff iculty parameters o f the items in which DLF was found. Two item discrimination levels were used, low and high; each o f which contained three different diff iculty levels o f easy, medium, and hard. 3. The amount o f DIF in the studied items. The amount o f DLF in the items w i l l produce four categories o f no-DIF, small DLF, medium DLF, and large DLF. Variables in the study. Sample sizes. As previously mentioned, the study used the same sample size combinations that were used in the small-scale study by Muf i iz et al. (2001). Therefore, combinations o f 50/50, 100/50 and 200/50 for reference and focal groups representatively were analyzed. Furthermore, considering a sample size o f 100 per group was more realistic in practice, the study also looked at a combination o f 100/100 in which each o f the two groups contained 100 examinees. In order to mediate between the small-scale o f this study and the large-scale testing context o f over a thousand o f examinees, the study also used a sample o f 500 per group. As a result, the fol lowing combinations o f sample size variables were analyzed: 91 Reference Group Focal Group 50 50 100 50 200 50 100 100 500 500 Statistical characteristics o f the studied test items wi th DLF. Previous research (Clauser, Mazor, & Hambleton, 1994) suggested that item statistics were related to power o f DLF detection and therefore, to study this point further, DLF was simulated in easy, medium diff iculty, and hard items each o f which wi th either low or high discriminating power. I f the a value indicates the proportional to the slope o f the curve at the point o f inflection and corresponding to the discriminating power, the b value for an item refers to the proficiency level (0) at the point o f inflection representing item diff iculty parameter. When a b value scored 0-1 is reported on the ability scale, it indicates an examinee has a 60% chance o f giving a correct answer to a five-option multiple-choice item. Higher b values indicate higher ability level required to get an item correct. In other words, the higher the b is, the harder the item w i l l be. Whereas for the a, values between zero and two are common when abil i ty scores are scaled to an M o f zero and an SD o f one. The higher the a values, the steeper the item characteristic curves are and the more discriminating the items w i l l be. Two a values o f the studied DLF items were applied: 0.5 and 1.0. The third item characteristic involved in the three-parameter item response model is symbolized wi th a c. The c depicts the lower asymptote corresponding to the probability o f a person wi th no ability at all in getting the item correct w i l l provide right answer. Therefore, this parameter is called as the pseudo-guessing parameter. The c 92 value for the six DIF items was set at .17, which is fairly typical for a multiple-choice item with four or five options. In each test o f 40 items, DEF was built into the last six items. Item statistics for the 40 items were presented in Table B l and that for the six studied items in Table B2. It should be noted that all four reference population groups were generated from the test without DEF. Amount o f DIF. The amount o f DIF was introduced through b value differences in the DEF items between the two test sets for generating the reference and focal population groups (See Table B2). Three levels o f b value differences were applied: 0.5 for small DEF, 1.0 for medium DIF, and 1.5 for large DLF. Based on the amount o f DLF as the variables that were investigated (dependent variables), there were four conditions generated: no-DLF, small DEF with 0.5 b value difference, medium DEF wi th 1.0 b value difference, and large DEF wi th 1.5 b value difference. 93 Table B l Item statistics for the 40 items Item a b c 1 1.59 0.10 .19 2 0.60 -0.98 .20 3 0.75 -0.42 .06 4 1.08 0.43 .24 5 0.81 0.34 .32 6 0.66 -0.57 .38 7 0.81 -0.18 .20 8 0.43 -0.36 .30 9 0.94 0.45 .34 10 1.40 0.15 .07 11 0.98 -0.20 .18 12 1.28 -0.12 .23 13 1.18 0.18 .23 14 0.98 -0.63 .30 15 0.94 -0.14 .17 16 1.39 0.94 .43 17 0.78 0.25 .16 18 0.55 -0.82 .20 19 0.88 0.09 .27 20 1.10 0.14 .40 Item a b c 21 1.23 -0.43 .10 22 0.73 1.13 .27 23 0.54 -1.91 .23 24 0.71 -0.43 .31 25 0.66 -0.67 .16 26 1.14 0.59 .18 27 1.12 0.29 .26 28 0.96 -0.26 .23 29 0.95 0.13 .15 30 1.38 0.66 .16 31 1.38 1.11 .16 32 0.42 -0.02 .20 33 1.04 -0.01 .30 34 0.73 0.10 .13 35 0.50 -1.00 .17 ' 36 1.00 -1.00 .17 37 0.50 0.00 .17 38 1.00 0.00 .17 39 0.50 1.00 .17 40 1.00 1.00 .17 94 Table B2 Item statistics for the six DIF items Item Item diff iculty level (b) Reference Focal group group No-DEF Small-DIF Medium-DEF Large-DLF 35 -1.00 -1.00 -0.50 0.00 0.50 36 -1.00 -1.00 -0.50 0.00 0.50 37 0.00 0.00 0.50 1.00 1.50 38 0.00 0.00 0.50 1.00 1.50 39 1.00 1.00 1.50 2.00 2.50 40 1.00 1.00 1.50 2.00 2.50 Data generation. The present study used the three-parameter logistic item response-modeling program of MLRTGEN (Luecht, 1996) to generate data for the population o f the simulation. It should be noted that although an item-response theory framework was used in generating the data, the study was not about item-response theory. The use o f the three-parameter logistic item-response model was because the test items being studied contained differences in the diff iculty and discrimination parameters, as wel l as taken into consideration the pseudo-guessing parameter. Therefore, the model provided a convenient way for generating examinee item-response data for analysis. 95 Two population groups for each o f the four conditions in regard to the amount o f DIF were generated. Each o f both reference and focal populations contained 500,000 examinees. Each population corresponded to a test o f 40 items. Item statistics for the first 34 non-DLF items were randomly chosen from the mathematics test o f the 1999 TIMSS (Third International Mathematics and Science Study) grade eight, and item statistics for the last six items that were set as containing DIF, were adopted from the DLF items used in the study by Muf i iz et al. (2001). These six items were the focus o f this study. For the focal population, the amount o f DLF across the six DIF items, which are Items 35 to 40, was fixed similarly to the replicating study except for the non-DLF. Under no-DIF for assessing Type I error rate, the item characteristics were set the same for both reference and focal groups. Focal populations were considered being disadvantaged. Therefore, for the focal populations with DLF, the item difficulty parameter values were set higher or harder than the diff iculty levels o f the test for the reference population. There were small DBF wi th a b or item diff iculty value difference equals 0.5, medium wi th a b or item diff iculty value difference o f 1.0, and large wi th a b or item diff iculty value difference equals 1.5. On the other hand, in the reference populations, all o f the item diff iculty parameters remained the same across all categories, the no-DLF, small DLF, medium DLF, and large DLF groups. The item diff iculty parameter values o f the six DLF items can be seen in Table B2. For each combination o f variables, 100 data sets were generated and analysed. This made it possible to obtain Type I error rate and power o f DLF detection in any combination o f variables o f the 100 replication pairs. Therefore, the data sets were reflective o f actual test data, except for the insertion o f the six DLF items. 96 Procedure. The study was carried out fol lowing a modified procedure originally used in the study o f Muf i iz et al. (2001) as follows: 1. First, set a test o f 40 items wi th the item parameters o f the first 34 items were drawn from the TLMSS data, and those o f the last six items were adopted from the studies o f Muf i iz et al. In each given data set, the amount o f DLF across the six DLF items were set at no-DLF (6-value for both reference and focal populations was the same), small (b-value difference = 0.5), medium (fr-value difference = 1.0), or large (6-value difference = 1.5). These item parameters are provided in Tables B l and B2. 2. Then set a normal and equal ability distribution, ./V (0,1). 3. Next, sample sizes for the reference and focal groups as described earlier were generated in five combinations. These five combinations contained 50/50, 100/50, 200/50, 100/100 and 500/500 examinees for reference and focal groups in pairs, respectively. 4. Generated population groups o f 500,000 examinees/simulees for the reference and the focal groups o f each DEF conditions. 5. Generated item-response data for the reference and focal groups for each condition, no-DLF, small DLF, medium DEF, and large DEF. (Steps 1 to 4 produced 20 different combinations o f variables.) Each o f these combinations was repeated 100 times. 6. Applied TestGraf procedure to the examinee item-response data and compared each pair o f the reference and focal groups. 7. Repeated step 6 for 100 replications o f each o f the 20 combinations o f variables in the study (5 sample combinations x 4 conditions o f DEF), and determined Type I error rate and power o f DEF detection. 97 Because o f the graphical interface and the need to check each replication for convergence as wel l as the appropriate bin width, the seven steps o f the simulation study were conducted without the use o f batch computing - instead I conducted each DLF test and TestGraf run for each replication individually. Each of the 20 cells o f the simulation design (with 100 replications per cell) required roughly 20 hours o f computational time resulting in 400 hours o f computation. Compute DIF statistics. When two or more groups were compared in the Compare step, TestGraf produced a file containing statistics summarizing the differences between the studied groups for each item. This file could be identified by its extension o f .cmp (Ramsay, 2000, pp. 55-56). The file provided the amount o f DLF or beta (P) and the standard error o f beta for each item resulted in from comparing multiple groups. From this output, the values o f sampling distribution o f P divided by the standard error o f P was computed. The hypothesis was that the distribution would look like a standard normal Z-distribution. Type I error rate and power o f DIF detection at the nominal alpha of .05 (Z = 1.96) and .01 (Z = 2.576) were computed. Any significant result indicated the presence of DLF items, either as Type I error rate or false positive under no-DLF, or power o f DLF detection under DLF condition. Data analysis of simulation results. For each o f the 6 items studied and each of the sample size combinations, the Type I error rate o f the DLF detection per item was obtained by dividing the number o f times Ho 98 hypothesis is rejected by 100. The mean and standard deviation o f the P values for the 100-replications o f each sample size combination were computed. The fol lowing analyses were done to answer each o f the three research questions: 1. Tabulated Type I error rate. 2. Tabulated Power. 3. Computed empirical standard error o f P and compared to TestGraf standard error. The design was a 2 (levels o f item discrimination) x 3 (levels o f item diff iculty) x 5 (sample size combinations). B-2. Results and discussion This section w i l l present the results o f the above simulation study to answer the four research questions. A brief conclusion w i l l conclude this Appendix. Research Question 1: What was the Type I error rate o f this statistical test o f TestGraf? Table B3 presents Type I error rate resulted in from TestGraf DLF detection version pre-December 20, 2002. As can be seen, at nominal alpha of. 05 the old TestGraf produced almost all zero Type I error rate, except for Item-36 and Item-37 with sample size combination o f 50/50, Type I error rate was .01. The study found Type I error rate was all zero at nominal alpha o f .01 across all sample size combinations. 99 Table B3 Type I error rate of Old-TestGraf at nominal a = .05 Item discrimination level (a) Low High Item diff iculty level (b) Low Medium High Low Medium High N1/N2 Item35 Item37 Item39 Item 36 Item38 Item40 50/50 0 .01 0 .01 0 0 100/50 0 0 0 0 0 0 200/50 0 0 0 0 0 0 100/100 0 0 0 0 0 0 500/500 0 0 0 0 0 0 Research Question 2: What was the power o f this statistical test o f TestGraf in detecting DLF? The probability o f rejecting the hypothesis for the old TestGraf DLF detection, i.e. the power o f detecting DLF items given the small amount o f DLF represented in the 0.50 difference between the item diff iculty levels, was ranged from zero to .07 wi th a combination o f sample sizes 50/50, 0 to .09 for 100/50, 0 to .09 for 200/50, 0 to .07 for 100/100, and 0 to .58 for 500/500. A l l o f the above values were at nominal alpha o f .05 as presented in Table B4. A t nominal alpha of.01 the DLF detection power values range from zero to .01, 0 to .02, 0 to .02, all 0, and 0 to .16, for the designed sample size combination, respectively (Table B5). 100 Table B4 Probability of rejecting the hypothesis for the Old-TestGraf DIF detection in SMALL-DIF at a = .05 Item discrimination level (a) Item diff iculty level (b) Low Medium High Low Medium High N,/N2 Item35 Item37 Item39 Item 36 Item38 Item40 50/50 !bl m m o m o 100/50 .02 .09 .02 0 .01 0 200/50 .03 .09 .01 0 .02 0 100/100 .01 .07 .01 0 0 0 500/500 .28 .58 .07 .08 .12 0 101 Table B5 Probability of rejecting the Hypothesis for the Old-TestGraf DIF detection in SMALL-DIF at a = .01 Item discrimination level (a) Low High Item diff iculty level (b) Low Medium High Low Medium High N1/N2 Item35 Item37 Item39 Item 36 Item38 Item40 50/50 .01 0 0 0 .01 0 100/50 .01 .02 0 0 0 0 200/50 .01 .02 .01 0 0 0 100/100 0 0 0 0 0 0 500/500 0 .16 0 .01 0 0 When the difference o f item diff iculty level was set at 1.00, which was set to represent a medium amount o f DIP, the probability o f rejecting the hypothesis or power o f DLP detection ranged from 0 to .38, .01 to .46, .07 to .49, 0 to .56, and .57 to 1.00, for the sample size combinations o f 50/50, 100/50, 200/50, 100/100, and 500/500, respectively, at nominal alpha o f .05. These findings can be seen in Table B6 below. A t nominal alpha o f .01 the power o f detecting DIF items was found ranging from 0 to .08, 0 to .14, 0 to .13, 0 to .24, and .08 to 1.00, respectively. The values are presented in Table B7. 102 Table B6 Probability of rejecting the hypothesis for the Old-TestGraf DIF detection in MEDIUM-DIF at a = .05 N,/N2 Item discrimination level (a) Low High Item diff iculty level (b) Low Medium High Low Medium High Item35 Item37 Item39 Item 36 Item38 Item40 50/50 100/50 200/50 100/100 500/500 .15 .33 .39 .56 1.00 .38 .46 .49 .52 .96 0 .06 .07 .05 .73 .06 .12 .16 .22 .99 .16 .16 .24 .23 .97 .01 .01 .07 0 .57 103 Table B7 Probability of rejecting the hypothesis for the Old-TestGraf DIF detection in MEDIUM-DIF at a = .01 Item discrimination level (a) Low High Item diff iculty level (b) Low Medium High Low Medium High Nj/N2 Item35 Item37 Item39 Item 36 Item38 Item40 50/50 .02 .08 0 . .01 .06 0 100/50 .05 .14 .01 .03 .01 0 200/50 .13 .02 0 .04 .06 0 100/100 .16 .24 0 .01 .03 0 500/500 1.00 .96 .22 .86 .78 .08 Tables B8 and B9 present the probability o f rejecting hypothesis when the difference in the item diff iculty level was set at 1.50 to represent a large amount o f DLF. A t nominal alpha o f .05 it was found that the probability ranged from .01 to .67 for the combination o f sample sizes o f 50/50. For the sample size combinations o f 100/50, 200/50, 100/100, and 500/500, the power o f DLF detection ranged from .06 to .93, .11 to .93, .01 to .98, and .94 to 1.00, respectively. At nominal alpha of. 01 it was found that power o f detecting large amount o f DLF ranged from 0 to .35, 0 to .64, .01 to .71, .01 to .78, and .67 to 1.00, under each o f the five sample conditions, respectively. 104 Table B8 Probability of rejecting the hypothesis for the Old-TestGraf DIF detection in LARGE-DIF at a =.05 Item discrimination level (a) Low High Item diff iculty level (b) Low Medium High Low Medium High N]/N2 Item35 Item37 Item39 Item 36 Item38 Item40 50/50 .67 .64 .05 .44 .36 .01 100/50 .79 .93 .06 .56 .63 .07 200/50 .87 .93 .11 .72 .69 .18 100/100 .98 .96 .01 .75 .73 .14 500/500 1.00 1.00 .94 1.00 1.00 .95 105 Table B9 Probability of rejecting the hypothesis for the Old-TestGraf DIF detection in LARGE-DIF at a =.01 Item discrimination level (a) Low High Item diff iculty level (b) Low Medium High Low Medium High N1/N2 Item35 Item37 Item39 Item 36 Item38 Item40 50/50 .34 .35 0 .17 .16 0 100/50 .43 .64 0 .25 .26 0 200/50 .59 .71 .01 .32 .32 .01 100/100 .78 .08 .01 .42 .39 .02 500/500 1.00 1.00 .67 1.00 1.00 .71 Research Question 3: Was the standard deviation o f the sampling distribution o f the beta TestGraf DLF statistic the same as the standard error computed in TestGraf? Table B10 presents the SD values o f DIF (3 and the averages o f the TestGraf SE values over 100 replications for each o f the six studied items across the five sample size combinations. While the two scores should refer to the same statistical characteristics o f TestGraf DLF detection, the study found the TestGraf SE scores were substantially higher than their counterpart scores. As a consequence o f the higher SE produced by TestGraf, Type I error rate and power o f TestGraf in detecting DIF items resulted in lower values than it was expected. 106 Table BIO Standard deviations and standard errors of the Old-TestGraf DIF detection Item Over 100 replications M / # 2 50/50 100/50 200/50 100/100 500/500 35 SD o f beta (Empirical SE) 0.088 0.079 0.083 0.060 0.027 Mof the TestGraf SE 0.171 0.158 0.152 0.128 0.067 36 SD o f beta (Empirical SE) 0.095 0.081 0.076 0.060 0.030 Mofthe TestGraf SE 0.171 0.157 0.149 0.126 0.064 37 SD o f beta (Empirical SE) 0.097 0.085 0.082 0.061 0.026 Mofthe TestGraf SE 0.175 0.161 0.154 0.129 0.066 38 SD o f beta (Empirical SE) 0.097 0.081 0.080 0.064 0.026 M o f the TestGraf SE 0.168 0.153 0.146 0.123 0.060 39 SD o f beta (Empirical SE) 0.062 0.063 0.060 0.046 0.019 M o f the TestGraf S£ 0.147 0.130 0.120 0.109 0.057 40 57) o f beta (Empirical SE) 0.071 0.075 0.057 0.051 0.021 M o f the TestGraf S£ 0.146 0.132 0.124 0.111 0.057 Note: N= sample size, M= mean, SD = standard deviation, SE = standard error Summary. Type I error rate was nearly all zero at nominal alpha o f .05 and all zero at nominal alpha o f .01 across all sample size combinations. Although it is possible wi th 100 replications, all zero o f Type I error rate was inappropriate. But carried it on wi th the power o f TestGraf DLF detection, the average o f the standard error over 100 replications was far o f f than the empirical 107 standard error. Test statistics resulted in from TestGraf are too liberal. As expected as sample size goes up, the power goes up as well . The operating characteristics o f TestGraf DLF detection suggested that the TestGraf DLF detection method was a very conservative test. Furthermore, the old-version o f TestGraf DLF detection prior to December 20, 2002 produced larger SE values in averages than the empirical SE o f the sampling distribution, i.e. SD o f DIF P values. B-3. Conclusion Anyone who has computed a hypothesis test from TestGraf prior to December 20, 2002 w i l l have Type I error rate that was very low, and the power that was very low as wel l . The former is not a problem unless it comes wi th the latter. It should be noted that Type I error rate was inflated everywhere in the main body o f this dissertation.