Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Score scale comparability in international educational assessments Sandilands, Debra Anne 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2009_spring_sandilands_debra_anne.pdf [ 2.69MB ]
Metadata
JSON: 24-1.0054575.json
JSON-LD: 24-1.0054575-ld.json
RDF/XML (Pretty): 24-1.0054575-rdf.xml
RDF/JSON: 24-1.0054575-rdf.json
Turtle: 24-1.0054575-turtle.txt
N-Triples: 24-1.0054575-rdf-ntriples.txt
Original Record: 24-1.0054575-source.json
Full Text
24-1.0054575-fulltext.txt
Citation
24-1.0054575.ris

Full Text

SCORE SCALE COMPARABILITY IN INTERNATIONAL EDUCATIONAL ASSESSMENTS by Debra Anne Sandilands  A THESIS SUBMITTED iN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF ARTS in The Faculty of Graduate Studies (Measurement, Evaluation and Research Methodology) University of British Columbia (Vancouver) December 2008  ©Debra Anne Sandilands, 2008  ABSTRACT Many countries, including Canada, are increasingly using international educational assessments to make comparisons of achievement across countries and to make important decisions regarding issues such as educational policy and curriculum. Most large-scale assessments have different forms that are adapted and/or translated for use across multiple language and cultural groups. Equivalence and fairness for examinees of all groups must be established in order to support valid score comparisons across groups and validity of decisions made based on these assessments. This study investigated the degree of score comparability in the Reader booklet of the Progress in International Reading Literacy Study (PIRLS) 2001 at three levels of scores. At the item level, differential item functioning (DIF) analyses were conducted using Ordinal Logistic Regression and Poly-SIBTEST. DIF items were grouped into bundles and analyzed for differential bundle functioning (DBF) using Poly-SIBTEST. Differences in item response theory-based test characteristic curves (TCCs) were analyzed to investigate comparability at the scale level. The study focussed on four countries: Argentina, Colombia, England and USA. The results of this study confirm previous studies that demonstrate a large degree of DIF in international educational assessments. Results also indicate a high degree of similarity between the two DIF methods used in identif\jing DIF items, but fail to support the correspondence between their effect size measures. This study expands on the research base regarding DBF and demonstrates a two stage approach to identifying potential causes of differential functioning. Results of DBF analyses indicate that cognitive levels tapped by reading comprehension questions may represent a source of bias leading to differential functioning in the Reader booklet. This study also contributes preliminary evidence for the possibility that the use of international item parameters to create individual country scores may provide a relative advantage to some countries due to the locations of their score distributions, which may have implications regarding current score scale creation methods.  III  TABLE OF CONTENTS ABSTRACT  ii  TABLE OF CONTENTS  iii  LIST OF TABLES  vi  LIST OF FIGURES  vii  LIST OF ACRONYMS  viii  ACKNOWLEDGEMENTS  x  CHAPTER 1: INTRODUCTION  I  Overview  1  Bias and Factors that Affect Score Scale Comparability  3  Current Scaling Methods in Large-scale Assessment  7  Purpose of Study and Research Questions  11  CHAPTER II: LITERATURE REVIEW  13  Large-scale Assessment in the Canadian Context  13  Statistical Methods for Establishing Score Comparability  16  Establishing Construct Comparability: Analysis of Factor Structures  17  Establishing Item Level Equivalence: Differential Item Functioning Analyses  18  Ordinal Logistic Regression  20  Poly-SIBTEST  21  Establishing Bundle Level Equivalence: Differential Bundle Functioning Analyses.. 25 Establishing Test Level Equivalence: Comparison of TCCs Summary  CHAPTER III: METHOD  28 30  31  Overview  31  Measure  31  Data  41  Analyses  45  iv  Item Level Analyses  .45  Ordinal Logistic Regression  45  Poly-SIBTEST  48  Bundle Level Analysis  52  Scale Level Analyses  53  Model-Data Fit  56  Factor Structures  56  Summary  CHAPTER IV: RESULTS  57  58  Descriptive Statistics of the PIRLS 2001 Reader Booklet Data  58  Item Level Analyses  61  Ordinal Logistic Regression  61  Poly-SIBTEST  62  Comparison of OLR and Poly-SIBTEST DIF Results  62  Bundle Level Analysis  65  Scale Level Analyses  68  Model-Data Fit  75  Factor Structures  76  Summary  CHAPTER V: DISCUSSION Discussion of Findings  79  82 83  Research Question #1: Differential Item Functioning Analyses  83  Comparison of Results from Poly-SIBTEST and Ordinal Logistic Regression  84  Research Question #2: Differential Bundle Functioning Analyses  85  Research Question #3: Differential Test Functioning and TCC Comparisons  88  Research Question #4: Why do large degrees of differential functioning in large scale assessments at the item level result in only small degrees of differences at the test level? 90 Implications, Contribution to Research and Further Research Suggestions  91  Limitations of this Study  95  V  Summary  .95  REFERENCES  .96  APPENDIX A: Reader booklet  108  vi  LIST OF TABLES Table 1:  PIRLS 2001 Aspects of Reading  Table 2:  PIRLS 2001 Distribution of Literary and Informational Blocks Across Booklets  Table 3:  34  36  PIRLS 2001 Sample Information and Overall Average Scale Scores  39  Table 4:  PIRLS 2001 Reader Booklet Sample Sizes  42  Table 5:  PIRLS 2001 Reader Booklet Items  44  Table 6:  PIRLS 2001 Reader Booklet Descriptive Statistics  59  Table 7:  Ordinal Logistic Regression DIF Results  61  Table 8:  Poly-SIBTEST DIF Results  62  Table 9:  Number of DIF items identified by effect size by method  63  Table 10:  Items Identified as DIF by Ordinal Logistic Regression and Poly-SIBTEST  64  Table 11:  Differential Bundle Functioning Analyses  67  Table 12:  Average original and transformed discrimination and location parameters  Table 13:  Correlation of a and b parameters for each country with the international parameters  Table 14:  69  70  Descriptive statistics of differences between country TCCs and international TCCs  73  Table 15:  Qi Goodness of fit results  76  Table 16:  PCA Eigenvalues and variance explained for each factor  77  VII  LIST OF FIGURES Figure 1.  Argentina and Colombia Score Distributions  60  Figure 2.  England and USA Score Distributions  60  Figure 3.  TCC Comparison for Argentina  71  Figure 4.  TCC Comparison for Colombia  71  Figure 5.  TCC comparison for England  72  Figure 6.  TCC comparison for USA  72  Figure 7.  Argentina and Colombia, Differences in TCCs based on international versus country parameters  Figure 8.  Figure 9.  74  England and USA, Differences in TCCs based on international versus country parameters  74  Scree plots for Reader booklet data  78  VIII  LIST OF ACRONYMS  1PL  One parameter logistic model  2PL  Two parameter logistic model  2PPC  Two parameter partial credit model  3PL  Three parameter logistic model  AERA  American Educational Research Association  APA  American Psychological Association  DBF  Differential bundle functioning  DFIT  Differential Fit of Items and Tests  DIF  Differential item functioning  DTF  Differential test functioning  EFA  Exploratory factor analysis  GMH  Generalized Mantel-Haenszel  GPCM  Generalized Partial Credit Model  GRM  Graded Response Model  ICC  Item characteristic curve  TEA  International Association for the Evaluation of Educational Achievement  IRT  Item response theory  ITC  International Test Commission  LDFA  Logistic discriminant function analysis  MH  Mantel-Haenszel  MMD  Multi-dimensional model for DIF  NCME  National Council on Measurement in Education  LR  Logistic Regression  OEL  Ordering of the expected latent trait  OLR  Ordinal logistic regression  PIRLS  Progress in International Reading Literacy Study  ix  PISA  Program for International Student Assessments  Poly-SIBTEST  Polytomous SIBTEST  SIBTEST  Simultaneous item bias test  SMD  Standardized Mean Difference  ICC  Test characteristic curve  TIMSS  Trends in International Mathematics and Science Study  x  ACKNOWLEDGEMENTS  I would like to express sincere gratitude to my thesis supervisor, Dr. Kadriye Ercikan, for her invaluable guidance over the past few years. I want to thank her not only for her thesis advice, but also for including me in many concurrent projects which have greatly broadened my experience and knowledge. I have benefitted tremendously from the highest quality of graduate education under Dr. Ercikan’ s mentorship. I am deeply indebted to my committee members, Dr. Nancy Perry, Dr. Jenna Shapka, and Dr. Marion Porath for their constructive feedback and assistance. I am especially grateful to each of them for their willingness to work on a very tight time frame to enable me to achieve this goal. My dear friend and fellow student, Malena Oliveri, has been an amazing source of inspiration and support to me along this journey. What started as a casual friendship in a class on item response theory has become one of the more meaningful friendships I have had the privilege of enjoying. I also want to thank my family for their confidence in me and their never ending support. My daughters, Kate and Stephanie, and my husband, Andy, have been pillars of strength for me to lean on when needed. Their love has lifted me and carried me along.  1  CHAPTER 1: INTRODUCTION Overview Many countries, including Canada, are increasingly using international, large-scale educational assessments to address issues such as achievement and accountability (Crundwell, 2005). Currently, assessments such as Progress in International Reading Literacy Study (PIRLS), the Program for International Student Assessments (PISA) and Trends in International Mathematics and Science Study (TIMSS) are administered in as many as 57 countries simultaneously in dozens of different languages. These international assessments are used to make comparisons of educational achievement across countries or regions, decisions regarding educational policy, curriculum, program development and evaluation. Since their uses and inferences are concerned with making comparisons, issues regarding appropriate and justifiable comparability of results from these assessments are critical to consider. According to the Standards for Educational and Psychological Testing (American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME), 1999) (the “Standards”), comparisons across groups are only meaningful if scores are comparable across groups. Establishing score comparability is therefore critical to establishing validity of group comparisons in international assessments. The need to establish comparability increases as the stakes resulting from the assessments increase. When used for measuring systems for policy and accountability purposes, large-scale international assessments are high-stakes  2  assessments (dePascale, n.d.). Therefore it is particularly important to establish that the scores for different groups of test takers are comparable. International assessments typically involve testing a wide variety of examinees with one assessment tool that is intended to be comparable for all groups of examinees. Most large-scale assessments are constructed to have different but comparable forms, which can be used across multiple language and cultural groups. They cannot be assumed to be equivalent and free of bias for all examinees of different cultural, ethnic, language and curricular backgrounds. This equivalence and fairness must be established through the use of quantitative empirical methods. Establishing equal validity of score interpretations for all examinees is a complicated and essential task for test developers. Both the International Test Commission Guidelines (International Test Commission, 2000) (the “ITC Guidelines”) and the Standards (AERA, APA & NCME, 1999) place a responsibility on test developers to provide evidence that all language versions of tests are equivalent in order to support score comparability. However, currently, test developers are not providing this critical evidence to support score comparability. Equivalence of test scores across test versions requires that the different test versions measure the same construct with the same level of measurement accuracy. Much research has indicated that these assumptions are not tenable in large-scale assessments. If this is the case, test developers should use “comparable scales” which result in scalar equivalence (Cook, 2006) rather than  3  using equated scores. For scalar equivalence, scores must have the same measurement unit and origin in all populations and the measurement must be free of bias (van de Vijver & Tanzer, 2004). There are many sources of bias that jeopardize comparability in large-scale international assessments such as Progress in International Reading Literacy Study 2001 (“PIRLS 2001 “). They must be investigated as a prerequisite to determining scalar equivalence. The next section will provide a more comprehensive definition of bias and a discussion of the factors that have been shown to introduce bias and reduce score scale comparability. It begins with the presentation of three conceptualizations of bias and its sources, and ends by noting the common features between these interpretations, which will encapsulate the definition of bias taken throughout this current research. Bias and Factors that Affect Score Scale Comparability In the Standards (AERA, APA & NCME, 1999), bias is conceptualized as group differences that result from deficiencies in a test itself or the manner in which it is used rather than true differences in ability on the construct being measured, where a construct is “the concept or characteristic that a test is designed to measure.” The Standards define bias as construct underrepresentation or construct-irrelevant components of test scores that differentially affect the performance of different groups of test takers. There are two general sources of bias discussed in the Standards: bias associated with test content and bias associated with response processes.  4  Bias related to test content can be introduced when there is an inappropriate sampling of test content, for example, when the content of the test does not adequately cover the intended construct, is not relevant to the intended construct, or is not aligned with the curricula for all groups. Occasionally, testing material may be differentially interesting or motivating to different groups, which results in bias related to the test content. Response-related sources of test bias include the use of scoring criteria that may not adequately represent different response approaches taken by different groups. Test items can elicit responses other than those predicted or intended by test developers. The use of particular response formats may cause impediments for some groups or advantages for others depending on familiarity or differential abilities with the response format. Bias has also been conceptualized as one of two broad aspects of fairness, one of which focuses on the validity of an assessment. Bond, Moss, and Carr (1996) define bias as an interpretation or action based on a test that is not equally valid for all relevant subgroups. In other words, bias and validity share as their point of reference the interpretations, uses or decisions that flow from an assessment, rather than the assessment itself. Bond et al. characterize bias in two ways, one being the situation where scores systematically under-estimate the performance of one group on the construct intended to be assessed. The other occurs when an assessment is intended to measure the same construct across the groups, but actually measures different constructs in the groups. Sources of bias can be internal to the assessment, such as bias that arises from the specification of the content framework, the training of scorers, the development of the tasks and  5  exercises being assessed, and the manner in which they are scored. External sources of bias can arise when the behaviours elicited from the test are not those the test is intended to predict or when the conditions of testing are not related to those under which the construct would normally be exhibited. Van de Vijver and Tanzer (2004) also provide a thorough discussion of bias and the sources of bias which affect scalar equivalence. Their conceptualization of bias and its sources is summarized in the following paragraphs. Bias refers to nuisance factors which are present in an assessment that lead to score differences for one group which do not reflect or correspond to differences in the underlying trait which is being measured. Equivalence is defined as a lack of bias, and is required for valid comparisons across cultural or other groups. Three kinds of bias that affect scalar equivalence are identified in the van de Vijver and Tanzer (2004) overview. The first type of bias distinguished is construct bias. Construct bias occurs when the construct being measured is not the same for all groups or when the behaviours associated with the construct are differentially appropriate in the groups. Method bias arises from the administration of the instrument, or from the instrument itself. Method bias has significant consequences on validity as it leads to shifts in average scores and affects differences observed in scores at the test level. Method bias is categorized into three types: (1) sample bias, which leads to incomparability of samples due to aspects not related to the target variable such as  6  educational background or motivation levels of examinees; (2) instrument bias, which refers to the fact that different groups respond in different manners to the form of an assessment or to the type of response required by the assessment; and (3) administration bias, which arises from problems or differences across administration of the assessment which compromise the collection of data. The third source of bias, which affects scalar equivalence noted by van de Vijver and Tanzer (2004) is termed item bias and is the only one of the three types of bias which occurs at the item level, that is, at the level of the individual question. Biased items have different meaning across groups. Item bias is one form of differential item functioning (DIF), which occurs when persons from different groups, matched on ability, do not have the same probability of correctly responding to, or endorsing, an item (Hambleton, Swaminathan, & Rogers, 1991). In other words, they will not achieve the same expected score on an item. Item bias is differentiated from item impact, which occurs when two groups have a different probability of correctly responding to an item as a result of true ability differences between the groups on the construct of interest. While item impact is related to group characteristics, item bias is a result of an interaction between group characteristics and item characteristics. The finding of empirical evidence of DIF is necessary but not sufficient to conclude bias is present. Once an item has been determined to exhibit DIF, it is necessary to use judgment to decide whether the item is biased. While slight differences can be seen in the interpretations of bias presented above, it is also clear that there are common foundations or overarching themes  7  among them. First, bias is akin to “differential invalidity” related to group differences in scores and in score interpretations that are not related to the construct or facet of ability intended to be measured. In other words, when bias exists, student scores are not only a function of their ability on the construct of interest; in addition, there is some degree of variance contributing to scores that is not accounted for by ability and this results in a lack of score comparability on the construct of interest. Second, bias can be introduced into an assessment at all stages, beginning with the test development stage where the construct, content, curriculum, format, translation and adaptation, response processes and scoring schemes are critical to consider for each potential subgroup in the population. At the administration stage, bias can be introduced through sampling methodology and inconsistencies in administration that compromise the collection of data. At the scoring stage, bias can result from inconsistencies in training of scorers or in scorers’ applications of scoring procedures. Current Scaling Methods in Large-scale Assessment In most large-scale assessments, current scaling methods use Item Response Theory (“IRT”) to create item parameters. In the 3-Parameter Logistic IRT Model (“3PL model”), the three item parameters item difficulty, -  discrimination, and pseudo-guessing are estimated in addition to the ability or -  latent trait of an examinee. According to IRT, the relationship between examinees’ item performance and the set of traits underlying item performance can be described by a monotonically increasing function called an Item Characteristic Curve (“ICC”) which specifies that as the level of the trait (i.e. a  8  latent trait or ability) increases the probability of a correct response to an item increases (Hambleton et a!. 1991). The sum of all the item characteristic curves in a test gives a Test Characteristic Curve (“TCC”) for a given assessment. Current scaling methods in international assessments administered in multiple languages are based on an international sample which is created using samples of data from all countries, with the country samples being weighted so that each country contributes equally to the overall international sample. The international sample is intended to be representative of all of the individual countries. Score scale comparability is directly related to the appropriateness of the international parameters for individual countries. The appropriateness of item parameters derived from an international sample for individual countries is necessary but not sufficient for score creation methods that support comparability. International item parameters must fit the country data well in order to create score scales and make comparisons across countries (Ercikan & Sandilands, in press; Ercikan & Gonzalez, 2008). If the international item parameters do not fit country data well, then TCCs based on international calibration and those based on country calibrations would be expected to have differences and lead to score incomparability. Since poor item fit can accumulate across items, it is possible for overall poor fit to be hidden by the fact that some items may favor a country while others disfavor it, thereby cancelling each other out. In this case there would be no apparent difference at the test level when analyzed using international parameters or using individual country parameters. It might still be possible for  9  scalar incomparability to exist without being evident in TCCs or in score distribution comparisons. At the item level, previous research has indicated that there is a large degree of DIF in large-scale assessments that have been translated. Ercikan, Gierl, McCreith, Puhan, and Koh (2004) studied bilingual versions of Canada’s national achievement tests. Their results point to substantial psychometric differences between language versions at the item level with up to 36% of the items being identified as DIF for the 2 language groups. Sireci and Allalouf (2003) found that 34% of the items on Israel’s Psychometric Entrance Test exhibited DIF between the Russian and Hebrew versions. Despite large amounts of DIF in large-scale assessments, preliminary research in comparability at the test level seems to indicate that few to no differences are seen at the test level. In a study of the potential for manifestation of item level DIF at the test (scale-level), Zumbo (2003) used both exploratory and confirmatory factor analysis as well as an analysis of reliability. He found that the presence of DIF was not necessarily reflected at the whole test level. Ercikan and Sandilands (in press) investigated the TIMSS 2003 grade 8 mathematics assessment and found that there may be only a small (i.e., 5% of maximum score) difference between the international and country TCCs. Ercikan and Gonzalez (2008) conducted a study in which they compared country TCCs to the international TCC using the PIRLS 2001 data. Their study also indicated that there was very little difference between them, with the largest score scale difference being 5% of the maximum possible score. These studies and others  10  (Drasgow, 1987; Pae, 2004) seem to lend support to the use of international parameters to create country scores. However, while small, a 5% difference might have a significant impact if many examinees are found in the score ranges where the differences are seen. In order to gain an understanding of this apparent discrepancy between comparability at the item level (large amounts of DIF items found) and at the test level (small differences in test characteristic curves) in large-scale assessments, the current research will extend the investigation of comparability by examining DIF amplification or cancellation at the bundle level. A bundle is a group of test items that is thought to share, and measure, a common factor or dimension that may be a source of differential functioning. Statistical methods for identifying bundles of items as sources of differential functioning are more powerful than methods for identifying single items (Gierl, Bisanz, Bisanz & Boughton, 2001). Nandakumar (1993) demonstrated that the cumulative effect of DIF expressed through a bundle of items could lead to amplification, partial cancellation, or complete cancellation of DIF at the test level. She argued that even when a test contains individual items that function differently across subgroups, the test as a whole may not, due to partial or complete cancellation of DIF. The current research will examine bundles of items to explore the possibility of DIF cancellation being responsible for the apparent inconsistency between results of item level analyses and test level analyses in large-scale assessment.  11  Purpose of Study and Research Questions The focus of this study is score scale comparability in international assessments. Specifically, using data from the PIRLS 2001 “Reader Booklet”, this study investigated whether there is equivalence between score scales for countries at the item level (DIF), at the bundle level (differential bundle functioning, DBF), and at the test level (differential test functioning, DTF). In addition, this study investigated whether different results are obtained when using different statistical methods to examine comparability. At the item level, the PIRLS data was examined for DIF using two methods: Ordinal Logistic Regression (Zumbo 1999) and Poly-SIBTEST (Chang, Mazzeo, & Roussos, 1996; Shealy & Stout, 1993). Since Poly-SIBTEST can also be used to detect differential functioning at the bundle level, it was employed to examine potential differential bundle functioning (DBF). At the test level, a comparison of test characteristic curves resulting from the use of international parameters versus individual country parameters was conducted. This study leads to a greater understanding of the issues in comparability and validity in large-scale assessment, specifically about methodology of how scores and scales are created, and whether or not current methods support the uses, inferences, and comparison practices among the countries that participate in international assessment. In summary, this research seeks to provide answers to the following four research questions:  12  1. What is the degree of DIF in the PIRLS data? What are similarities in DIF detection by OLR and Poly-SIBTEST? 2. At the bundle level, are there any bundles of items in the PIRLS Reader booklet that may result in DIF amplification or cancellation? 3. At the test level, what are similarities or differences in test characteristic curves based on the international parameters versus those based on individual country parameters? Are the international parameters appropriate for creating country scores? 4. Why do large degrees of differential functioning in large-scale assessments at the item level result in oniy small degrees of differences at the test level?  13  CHAPTER II: LITERATURE REVIEW This chapter provides a review of current uses of large-scale assessments, particularly in the Canadian context, thereby affirming the need to address issues of validity especially related to establishing score comparability. Requirements placed on test developers to provide evidence of score comparability by the Standards (AERA, APA & NCME, 1999) and ITC Guidelines (ITC, 2000) are briefly reviewed. The remaining sections give a review of the types of evidence that should be provided in order to support score comparability, as well as a review of and justification for the use of the statistical methods for establishing score comparability.  Large-scale Assessment in the Canadian Context Most Canadian provinces and territories conduct some form of large-scale assessment and use the results for holding educational systems accountable and for evaluating the effectiveness of school programs and systems (Volante, 2006). The British Columbia Ministry of Education uses large-scale provincial, national and international assessments to obtain information for the purposes of public accountability, to improve programs and schools, determine students’ knowledge and skills, and inform decisions to be made at the classroom, district and provincial levels for improving student achievement (British Columbia Ministry of Education, 2004). In the Province of Ontario, the Educational Quality and Accountability Office, an agency of the Ontario provincial government whose mandate is to  14  provide parents, teachers and the public with accurate and reliable information about student achievement, reports that the findings from large-scale assessments are used for setting educational priorities and for improvement planning as well as to measure Ontario students’ achievement against national and international benchmarks (Education Quality and Accountability Office, 2008). One large-scale assessment used in Canada at the international level is Progress in Reading Literacy Study (PIRLS). The Canadian Provinces of Ontario and Quebec took part in the 2001 administration of PIRLS. In the 2006 administration, Alberta, British Columbia, Nova Scotia, Ontario and Quebec participated. A 2007 news release from the British Columbia Ministry of Education regarding that province’s results in the 2006 PIRLS assessment exemplifies the extent to which results from this assessment are used for policy and program planning in British Columbia:  “B.C. is a world leader in literacy and this international assessment proves that.  ...  The results we see here show that our ReadNow BC literacy  strategy is working to make our province the most literate jurisdiction, not only in North America, but in the world.  . . .  Since 2001, the government  has invested over $125 million in literacy for British Columbians through the Province’s literacy strategy, ReadNow BC.” (British Columbia Ministry of Education, 2007).  This news release exemplifies that the Province of British Columbia has relied on a comparison of its PIRLS 2006 scores with those of other jurisdictions and  15  countries, and interpreted a favourable comparison as justification for the financial commitment of over $125 million and for the continued existence of the ReadNow BC literacy strategy. A similar news release by Alberta’s Ministry of Education revealed the use of PIRLS 2006 results in that province to make comparisons between itself and the rest of the world with the headline, “Alberta Grade 4 students achieved top three in the world in the latest international literacy tests”. This was followed by statements attributing this success in PIRLS to the quality of teaching, resources, curriculum and programs in Alberta, and a justification for continued participation in international large-scale testing as a means of demonstrating the success of Alberta’s students in relation to the rest of the world (Alberta Education, 2007). Clearly, large-scale assessments including PIRLS are employed for comparisons and high stakes decision making by the individual countries and jurisdictions that take part in them, as is manifested in the Canadian context. Yet, what evidence is there that such interpretations of score comparisons are meaningflul and valid? What evidence should be provided? According to the Standards (AERA, APA & NCME, 1999), comparisons across groups are only meaningful if scores have comparable meaning across groups. Standard 9.9, in particular, requires that test developers report evidence of test comparability when different language versions are intended to be comparable, as is the case in large-scale assessments such as PIRLS. Standard 7.3 puts an onus on test developers to conduct research to detect and eliminate aspects of test design, content and format that might lead to DIF. The ITC Guidelines  16  (ITC, 2000) caution against interpreting score differences between populations as though they represent true differences in the construct measured by the test and that interpretations be limited to those for which satisfactory validity evidence has been presented. Specifically, JTC Guidelines D.2, D.6, D.7, D.8, and D.9 place an onus on test developers and publishers to apply appropriate statistical techniques to establish equivalence, and to provide validity evidence of adapted test versions for all intended populations. In short, both the Standards and the ITC Guidelines require that the validity of interpretations of assessment results for all relevant groups of examinees be established through the use of empirical and qualitative methods. The focus of the current study is on statistical methods for establishing score comparability.  Statistical Methods for Establishing Score Comparability There are many factors in large-scale assessments that can contribute to bias and threaten score equivalence. It is generally accepted that international large-scale assessments are not bias-free as a result of the adaptation and translation processes that can introduce bias and cause difficulties in comparability and equivalence (Allalouf, Hambleton, & Sireci, 1999; Ercikan et al, 1998; Ercikan, 2002; Gierl & Khaliq, 2001; Hambleton, 1993). Any analysis of score equivalence therefore must assess for bias, and weigh the evidence as to the degree of score comparability that exists. The choice of statistical methods for establishing score comparability depends in part on the scoring schemes of the items being analyzed. The PIRLS  17  2001 Reader Booklet, like many other large-scale international assessments, contains two types of item scores, dichotomous and polytomous. Dichotomouslyscored items provide binary data which is scored in only one of two levels, for example, correct/incorrect, or true/false. Polytomously-scored items, also referred to as partial credit or graded response items, provide ordinal data which can be scored in more than two levels, such as on a scale of 0 to 3. The PIRLS Reader contains multiple choice items which were scored dichotomously, as well as constructed response items which were scored polytomously, with up to 4 possible score levels. Given this, any statistical method considered for inclusion in this study or in any other similar study of large-scale assessments containing both types of items must be able to accommodate both dichotomously and polytomously scored data. Establishing Construct Comparability: Analysis ofFactor Structures Bias can be introduced into an assessment when an assessment, intended to measure the same construct across groups, measures different constructs in the groups (Bond et al., 1996). Validity of group comparisons depends, in part, on whether construct comparability exists across the groups. In addition, some psychometric methods such as IRT which were used in this study to examine comparability at the test level and which is used in current scaling methods in large-scale assessments have strict assumptions that must be met before they can be properly applied. One critical assumption is that there is a single latent trait that underlies test performance and that it is the same unidimensional construct across groups. An examination of unidimensionality of  18  the construct across the groups provided evidence as to whether this IRT assumption is met and hence the suitability of the use of IRT for this particular study. It also provided information regarding the appropriateness of PIRES 2001 use of IRT to generate item parameters as the basis for establishing scores. This study used exploratory factor analysis (EFA) to examine construct comparability. EFA is commonly used in studies of comparability of different language versions of large-scale academic assessments (Puhan & Gierl, 2006; Robin, Sireci, & Hambleton, 2003). Establishing Item Level Equivalence: Differential Item Functioning Analyses DIF analyses are used to examine item level equivalence across groups. DIF, as defined by Hambleton et al. (1991), occurs when individuals having the same ability, but from different groups, do not have the same probability of correctly responding to an item. The presence of DIF is related to group membership (such as language, culture, gender or country of residence) which in theory should not play a role in determining an examinee’s score on an individual item or on an assessment as a whole. This study uses the common terminology for naming groups in DIF studies, that is the “reference group” which serves as a group for comparison purposes, and the “focal group” which is the group of interest to the researcher. There are several methods for investigating DIF. They are grouped into two main categories, those that are parametric and those that are nonparametric. Parametric methods assume that a specific mathematical model underlies the data. Of critical practical importance is that in order to use parametric analyses for DIF,  19  the data must be shown to fit the statistical model used. In contrast, nonparametric methods make no statistical assumptions and there is no resulting need to establish fit between data and model. Many authors have provided comprehensive discussions of DIF detection methods (Camilli & Shepard, 1994; Clauser & Mazor, 1998; Holland & Wainer, 1993; Millsap & Everson, 1993; Potenza & Dorans, 1995). The commonly-used nonparametric DIF detection methods for data containing a mixture of polytomous and dichotomous items are Mantel-Haenszel (MI-I, Mantel, 1963; Zwick, Donoghue, & Grima, 1993), Generalized Mantel-Haenszel (GMH), Standardized Mean Difference (SMD, Dorans & Schmitt, 1991) and Poly-SIBTEST (Chang et al. 1996). Commonly used parametric methods for mixed format data include Logistic Regression (LR, Swaminathan & Rogers, 1990), Ordinal Logistic Regression (OLR, Zumbo, 1999) and some Item Response Theory (IRT) based methods such as the Graded Response Model (GRM, Samejima, 1969) and the Generalized Partial Credit Model (GPCM, Muraki, 1992). It has been shown that different DIF detection methods perform best under different conditions with the result that identification of DIF can vary according to which detection method is used. Factors that affect DIF detection rates include test length, sample sizes, ratio of the two groups’ sample sizes, amount of DIF, balance of DIF between the two groups, differences in item difficulty and discrimination, and differences between the ability levels of the two groups. For this reason, it has been suggested that DIF analyses should be conducted using two different DIF detection methods, and that only those items identified by both  20  methods should be flagged for potential DIF (Ercikan et al. 2004; Kim & Cohen, 1995). The two methods chosen to examine item level equivalence in this study are OLR and Poly-SIBTEST. Ordinal Logistic Regression As with many other DIF detection methods, logistic regression was originally developed for analyzing DIF in items that were scored dichotomously. Extensions to the original logistic regression methods have been made to accommodate polytomous items with ordinal data. For example, Miller and Spray (1993) described a form of logistic discriminant function analysis (LDFA) for DIF detection which was able to analyze both dichotomously and polytomously scored items simultaneously. Their method had some advantages over some of the previous DIF detection methods, in that both item scoring formats could be accommodated simultaneously, and both uniform and non-uniform DIF could be assessed. Although there was an associated chi-squared test of significance for the logistic discriminant function analysis method, there was no effect size measure. Zumbo (1999) introduced an ordinal logistic regression method for analyzing DIF in items which are scored using an ordinal scoring format. He also introduced an effect size measure. His procedure, OLR, is an observed score parametric procedure which provides a hierarchical test of DIF that is able to distinguish uniform and nonuniform DIF. Zumbo suggests using a linear regression equation that regresses the variables of total score, group membership, and the interaction between total and group, (the predictor variables) on the dependent variable of score on an item, a latent, continuously-distributed random  21  variable. The regression is carried out in a hierarchical, or step-wise, fashion and a chi-squared value is calculated for each step of the regression as a test of significance. In addition, a coefficient of determination (R ) is calculated at each 2 step. Differences between R 2 values with the addition of each new step in the model give an indication of the amount of variance accounted for by the variable being added at that step, and this provides a step-wise effect size in addition to the test of significance. Zumbo’ s (1999) OLR method was selected for use in this study because it has advantages over other DIF detection methods such as generalized Mantel Haenszel and logistic discriminant function analysis (Zumbo, 2008). It can model both binary and ordinal items using the same modelling strategy, is relatively easy to implement and has both a test statistic and a measure of effect size. The OLR effect size, R A, can be compared to the SIBTEST effect size, J3. Jodoin and Gierl 2 (2001) conducted a cubic curve regression to predict the OLR effect size from the previously established SIBTEST effect size measure thus establishing comparable effect sizes between the two methods. Few empirical studies exist that draw specific comparisons between DIF detection results using OLR and Poly-SIBTEST. This research study will contribute to the research base by providing such a comparison. Poly-SJB TEST Shealy and Stout (1993) presented a procedure for multidimensional modeling of DIF (MMD) called SIBTEST, where SIB stands for “simultaneous item bias”. Poly-SIBTEST is a modification of the SIBTEST procedure to  22  accommodate polytomous items. Both are nonparametric procedures that estimate the amount of DIF in an item and provide a statistical test of significance as well as an effect size measure. SIBTEST was designed to detect differential functioning in dichotomous items, and Poly-SIBTEST was designed to extend the detection method to include either dichotomous or polytomous items (Chang et al., 1996). Shealy and Stout’s MMD model sets out that DIF is a result of a secondary ability or construct, called eta  (n), being measured in addition to the  ability of interest or main construct, theta (0). In other words, there is more than one dimension (construct) being measured by the item and the studied groups differ in ability on the second dimension. Thus, when two groups who are matched on 0 have different conditional ability distributions on ri, DIF is indicated. In cases where there is a secondary dimension being measured on which there is no group difference in ability, DIF is not indicated. The presence of true ability distribution differences on the primary construct can statistically inflate results, and SIBTEST uses a regression correction to eliminate target ability difference from its test statistic. A considerable amount of empirical research has shown that detection of DIF is comparable to other DIF detection methods such as Mantel-Haenszel and Logistic Regression (Jiang & Stout, 1998; Roussos & Stout, 1996; Shealy & Stout, 1993). In cases where there is a large difference in group ability levels, SIBTEST has been shown to be more accurate and powerful than other methods (for example, Narayanan & Swaminathan, 1994). Gierl, Jodoin, and Ackerman (2000) studied MH, SIBTEST, and LR under conditions in which there was a  23  large percentage of DIF items, and found that all three provide adequate protection against Type I error (false identification of items as being DIF when they are not) even when 60% of the items contain DIF, but SIBTEST was the most powerful. Considering the amount of research that indicates a large percentage of items display DIF in translated assessments, this result provides evidence that SIBTEST is an appropriate choice in these circumstances. For this study which used SIBTEST’s polytomous extension, Poly SIBTEST, it is appropriate to consider how Poly-SIBTEST compares to other DIF detection methods. Two modifications were made to SIBTEST to extend it to the polytomous case (Chang et al., 1996). One modification involved the inclusion of a multiplicative factor in the SIBTEST test statistic so that there would be more matching scores used in Poly-SIBTEST. The other modification was the substitution of coefficient ci. for the KR-20 which SIBTEST uses in its regression correction. The two methods are very closely related and there are many examples in the literature where they are treated as one method. However, they are technically different; therefore a separate literature review was conducted to determine the suitability of Poly-SIBTEST for this study. Chang et al. (1996) introduced Poly-SIBTEST as an extension of the SIBTEST DIF detection method to include polytomous items. In their simulation study, Chang et al. compared Poly-SIBTEST to the Mantel Haenszel (MH) and Standardized Mean Difference (SMD) procedures. Their research showed that the Type I error rate of Poly-SIBTEST remained consistent when sample sizes were increased whereas MH and SMD did not. In addition, Poly-SIBTEST performed  24  better than MH and SMD when item discrimination parameters were set at realistic values such as those from an actual administration of the National Assessment of Educational Progress (NAEP) assessment. In a simulation study comparing two parametric DIF procedures which use the Graded Response Model (GRIVI) (Samejima, 1969) with the nonparametric Poly-SIBTEST, Bolt (2002) established that Poly-SIBTEST resulted in greater Type 1 error control when the data did not fit the parametric models used. Poly SIBTEST Type 1 error results were consistent across differences in sample size, mean ability, and generating models. Poly-SIBTEST power rates were noticeably lower than the two parametric procedures with low sample sizes (300 examinees) but were slightly improved with sample sizes of 1000. Bolt concluded that Poly SIBTEST might offer advantages when the parametric DIF detection methods cannot be used due to model misspecification. A complex simulation study ofpolytomous DIF and violations of ordering of the expected latent (OEL) trait by the raw score was conducted by DeMars (2008). Although it might be expected that examinees ordered by observed score would be similarly ordered on the expected value of the latent trait, this is not always the case. DeMars varied the amount of violation of OEL, test length (5 or 20 items), differences in group abilities, and sample size (250 or 1000) and compared Type 1 error results using four methods: Mantel, GMH, LDFA and Poly-SIBTEST. Results showed that although Poly-SIBTEST displayed greater error than the other procedures for the 5 item tests, it performed at least as well as the other procedures for the 20 item tests. When the sample size was 1,000 per  25  group and the group means differed, Poly-SIBTEST had lower rates of Type I errors than the other procedures. Zwick, Thayer, and Mazzeo (1997) conducted an analysis of Type 1 error control of the MH, SMD, and Poly-SIBTEST DIF detection methods. Their results showed that when reference and focal groups had identical ability distributions, SMD outperformed the other two methods. However, when the group abilities differed, Poly-SIBTEST produced the best results; and when the group abilities were different and item discrimination of the studied item was higher than that of the matching items, Poly-SIBTEST had much better Type 1 error control than the other methods. Taken together, these studies of Poly-SIBTEST provide evidence for the appropriateness of its use in the current study. Poly-SIBTEST has been shown to exhibit good Type 1 error control when compared to both parametric and nonparametric methods. It performs at least as well as or better than other methods at sample sizes that approximate those used in this study, and in situations where group differences in ability might be expected, as in this study. In addition, it has both a test of significance and an effect size measure thus it lends itself to comparison with OLR on both of these criteria. Establishing Bundle Level Equivalence: Dtfferential Bundle Functioning Analyses Statistical methods for analyzing bundles of items are more powerful than methods for analyzing single items (Gierl et al., 2001). Douglas, Roussos, and Stout (1996) proposed two methods of selecting bundles of items for DBF  26  analysis. They described a bundle as a cluster of items chosen according to some organizing principle. Based on Shealy and Stout’s (1993) multidimensional modeling of DIF (MMD), Douglas et al. proposed the use of expert opinion alone which provides substantive evidence in a confirmatory approach, or the use of statistical dimensionality analysis sometimes combined with expert opinion in an exploratory approach, in order to organize test items into bundles for differential ftinctioning analysis. Roussos and Stout (1996) suggested a paradigm to unify the substantive and statistical DIF analyses by linking them to a theoretically sound and mathematically rigorous multidimensional conceptualization of DIF. When DIF hypothesis are generated based on substantive characteristics, or organizing principles, of the items (such as item content, differential task demands, or predicted group differences), which are then linked to the underlying structure of the data, bias can be minimized. Two recent studies have been conducted to assess the performance and suitability of SIBTEST specifically for the purpose of analyzing DBF. Russell (2005) conducted simulation study analyses of two IRT-based techniques, Differential Fit of Items and Tests (‘DF1T”, Raju, 1999 as cited in Russell, 2005) and SIBTEST in detecting differential test and differential bundle functioning respectively. His results, with regard to SIBTEST and DBF, suggest that SIBTEST performed with adequate power in his DBF power study. In contrast to other studies though, Russell determined that SIBTEST performed poorly when there was a large difference in ability between the target and reference groups.  27  This may be due to the fact that he evaluated SIBTEST performance at the conservative alpha level of 0.01 whereas others have used the 0.05 criteria. Ross (2007) also assessed the performance of SIBTEST under various conditions in terms of its Type I error and power for DBF analysis with simulated data. Her study provides evidence that SIBTEST is able to distinguish between multidimensionality and multidimensionality that leads to DIF or DBF. In addition, Ross determined that SIBTEST has good power for detecting DBF, and controlled its Type I error rate for almost all of the simulated conditions. Her results regarding Type I error rate appear to conflict with those of Russell (2005) since Ross had elected to use a less stringent alpha level of 0.05. Originating with the work of Nandakumar (1993) who showed that amplification, partial cancellation, or complete cancellation of DIF can occur when items are analyzed in groups or “bundles”, several studies have since demonstrated the use of SIBTEST’s bundle capabilities. Both confirmatory and exploratory analyses have been used to develop theory as to the potential causes for differential functioning in an assessment. Examples can be found in the work of Abbott (2007), Mendes-Barnett and Ercikan (2006), and Stout et al (2003). Fewer studies (Fouad & Walker, 2004; Gierl, 2005) have used Poly-SIBTEST for developing theory regarding causes of differential functioning. For example, Gierl (2005) demonstrated how Poly-SIBTEST can be used to identify and interpret sources of DIF by first conducting a substantive analysis of test items for organizing principles. This analysis can take the form of test specifications analysis, content analysis (based on expert opinion or literature review),  28  psychological analysis (such as cognitive tasks required) or empirical analysis of dimensionality. Based on the results of the substantive analysis, one can formulate a DIF or DBF hypothesis, that is, whether a bundle of items which are designed to measure a primary dimension also measure a secondary dimension. A statistical analysis can then be carried out to determine whether the organizing principles reveal distinct primary and secondary dimensions which lead to differential functioning. The current study provides additional research on the use of Poly SIBTEST for developing differential functioning theory. Establishing Test Level Equivalence: Comparison of TCCs As with DIF studies, a variety of methods exist for the analysis of differential test functioning (DTF). These methods have been categorized by Pae and Park (2006) as methods that study the association between test score and an external criterion, IRT Test Characteristic Curve difference methods, and methods that compare internal factor structures across groups. This study examined item response theory test characteristic curves to investigate the degree of test level equivalence. Item response theory methods can be used for the analysis of DIF as well as DTF. Item response theory provides us with mathematical models in which the probability of success on (or endorsing) an item varies with an examinee’s latent ability, theta (0), and certain parameters that distinguish the item. According to IRT theory, the probability of success on a given item is a monotonically increasing function of examinees’ ability, which can easily be represented graphically for each item on a test, with 0 plotted on the abscissa and the  29  probability of correct response on the ordinate axis. Item parameters determine the shape and positioning of the curve. This graphical representation is referred to as an Item Characteristic Curve (ICC). The sum of all the ICCs in a test give a Test Characteristic Curve (TCC). There are three commonly-used item response models which are related to the number of item parameters being estimated. In the one-parameter logistic model (IPL), a single item parameter, item difficulty (b), is estimated, where difficulty is defined as the point on the ability scale where the probability of a correct response is 0.5. The two-parameter logistic model (2PL) provides an estimation of the item discrimination parameter (a) in addition to item difficulty. The a parameter is proportional to the slope of the ICC at the difficulty point, b, on the ability scale. The higher the value of the discrimination parameter is, the steeper the slope of the ICC will be. The three-parameter logistic model (3PL) adds a parameter for “guessing” or “pseudo-chance” (c parameter) which represents the probability of examinees with low ability correctly answering an item. The c parameter is seen as the lower asymptote on the item characteristic curve. It is possible to investigate DTF directly by making comparisons of the item parameters for two or more groups, or indirectly by making comparisons of the TCCs for the groups. If DTF is not present, the TCCs generated separately from data from a focal group and a reference group will be identical. If the TCCs for the two groups are different, this is a representation of DTF between the groups.  30  Summary Validity and score comparability are related to the uses and inferences drawn from assessment results. This chapter highlighted some of the uses and inferences from large-scale assessments in the Canadian context, with a particular focus on the PIRLS assessment, the 2001 administration of which is the measure that was be analyzed in this study. The responsibilities of test developers to provide empirical evidence for comparability of large-scale assessments which have been translated have been outlined, with reference to the Standards (AERA, APA & NCME, 1999) and the JTC Guidelines (ITC, 2001). This chapter also provided a review of statistical methods that are appropriate for use in establishing score comparability at the item, bundle and test levels. The methods selected for use in this study were introduced. Their suitability for inclusion in this study was substantiated with thorough review and consideration of previous research.  31  CHAPTER III: METHOD Overview The purpose of this research study was to examine the degree of differential item, bundle, and test functioning in the PIRLS 2001 Reader Booklet and to draw comparisons across construct comparability analysis results at these three levels. In addition, this study examined the similarities and differences between two DIF detection methods: OLR and Poly-SIBTEST, and looked for patterns or relationships between differential functioning analysis results at all three levels of analysis. This chapter presents a description of the PIRLS 2001 measure, including the target population, construct intended to be measured, PIRLS assessment framework, matrix sampling methodology and the manner in which items were grouped by PIRLS into blocks and distributed in booklets. This chapter also provides information regarding the data used in this study, specifically the Reader booklet data for the four countries included in the study, sample sizes and information regarding item types and their placement in two blocks in the Reader booklet. Finally, descriptions of the analyses conducted at the item level (OLR and Poly-SIBTEST), bundle level (Poly-SIBTEST) and scale level (comparison of TCCs) are given. Measure The PIRLS 2001 assessment was developed and conducted by the International Association for the Evaluation of Educational Achievement (TEA). It was developed to provide policymakers, educators, researchers and practitioners  32  with information about how well students are mastering reading literacy, and to provide information for improvement. Thirty-five countries joined together to conduct the first PIRLS assessment in 2001. The assessment was originally constructed in English and was then translated/adapted and administered in 28 different languages (Martin, Mullis, & Kennedy, 2003). The international target population for PIRLS 2001 was “all students enrolled in the upper of the two adjacent grades that contain the largest proportion of 9-year-olds at the time of testing” (Foy & Joncas, 2003, p. 54). In most, but not all countries, this represented the fourth grade. PIRLS 2001 chose this population because the age of 9 years represents an important transition point in the development of reading skills where students are expected to have learned how to read and are now reading to learn. In addition, the PIRLS 2001 study was intended to complement another study, Trends in International Mathematics and Science Study (TIMSS) which assesses math and science achievement at the same grade (Grade 4), thus providing countries who take part in both assessments with data in three subjects at the Grade 4 level. PIRLS 2001 was designed to assess the construct of “reading literacy”  which is defined as the ability to understand and use those written language forms required by society and/or valued by the individual. Young readers can construct meaning from a variety of texts. They read to learn, to participate in communities of readers, and for enjoyment. (Campbell, Kelly, Mullis, Martin, & Sainsbury, 2001, p. 3)  33  This definition reinforces the concept that reading is a complex and interactive process. The PIRLS assessment framework provides a theoretical basis for understanding reading literacy, whose definition was derived from and informed by numerous theories of reading (Sainsbury & Campbell, 2003, p 15). There are three specific aspects of reading literacy which are the focus of PIRLS 2001. They are processes for comprehension, purposes for reading, and reading behaviors and attitudes. Reading behaviors and attitudes were assessed through the use of four questionnaires which were developed to gather contextual information on home and school experiences associated with developing reading literacy, as well as home and school environmental, demographic and socioeconomic factors. Students, their parents, teachers and school principals were surveyed. Table I presents the PIRLS 2001 aspects of reading which form the foundation for the PIRLS written achievement assessment, the focus of this study. They are processes for comprehension and purposes for reading.  34  Table 1 PIRLS 2001 Aspects ofReading Processes for Comprehension  Focus on and retrieve explicitly stated information. Make straightforward inferences. Interpret and integrate ideas and information. Examine and evaluating content, language, and textual elements.  Purposes for Reading  Reading for literary experience Reading to acquire and use information  Each PIRLS 2001 item was written to engage students in one of the two purposes for reading and one of the four processes for comprehension (Sainsbury & Campbell, 2003, p 17). The four processes for comprehension are composed of a hierarchical sequence from simple retrieval, to making inferences, interpreting and integrating information, and finally examining and evaluating elements of what is being read. These four processes require increasing meta-cognitive processes and strategies that allow readers to examine their understanding and adjust their approach. In addition, the knowledge and experience that readers bring to reading equip them with an understanding of language, texts, and the world through which they filter their comprehension of the material. In order to respond to questions intended to assess the two higher cognitive levels (interpreting and  35  integrating information, and examining and evaluating elements), readers may need to depend on background knowledge and personal experience (Campbell et al. 2001). Overall, the PIRLS assessment contained a balanced representation of items in each of the four process categories, with each one representing either 20 or 30% of the total items. The two purposes of reading focused on in the written assessment are reading for literary experience and reading to acquire and use information. The PIRLS 2001 reading literacy assessment consists of selected reading passages of approximately 1000 words, with associated questions in two question formats, multiple choice and constructed-response. Reading passages are either literary or informational in nature. In total, the assessment contains 8 blocks of items (4 literary and 4 informational blocks) for a total of 98 items, with a score value of 133 points, distributed across 10 booklets which were matrix sampled. Each student was assigned only I of the 10 booklets. When the purpose of testing is to measure the performance of groups rather than individuals, matrix sampling procedures are frequently used. In matrix sampling, a small subset of items is assigned to each of many subsamples of examinees. Data are aggregated across subsamples and item subsets to obtain a measure of group performance (AERA, APA & NCME, 1999). Matrix sampling allows for broader coverage of the construct while at the same time reducing the amount of time required for each respondent to take the assessment. Student performance on the entire set of items is then inferred based on completing just a subset of the items.  36  Three literary and three informational blocks of items were linked among booklets 1 to 9 through a sophisticated linking procedure. For example, items that comprised block “L3” were responded to by students who received Booklets 2, 3 and 9. The fourth literary and informational blocks were contained only in the Reader booklet. The distribution of blocks is shown in Table 2: Table 2 PIRLS 200] Distribution ofLiterary and Informational Blocks Across Booklets Booklet  Blocks  Booklet  Blocks  1  Li  L2  6  13  LI  2  L2  L3  7  Li  II  3  L3  ii  8  12  L2  4  II  12  9  13  L3  5  12  13  Reader  L4  14  Booklet 10, called the “PIRLS 2001 Reader Booklet”, was unique in many respects. Its uniqueness led to it being chosen as the source of data for the current study, as outlined next. The Reader booklet contained 2 blocks (1 literary and I informational) of items that were not linked to any other blocks or booklets. Rather, the Reader items were linked to the rest of the assessment through the use of randomly equivalent samples. Booklets 1 to 9 were printed in black and white and had reading passages and questions contained within a single leaflet. In contrast, the  37  Reader booklet was printed in full colour in an attractive magazine-style format. The Reader was distributed at three times the rate of the other 9 booklets thus giving it a three-fold increase in sample size over any other individual booklet. Its reading passages and questions were contained in two separate booklets. The Reader was thought to be a natural, authentic reading setting for 9-year-olds. A natural implication of its intended authenticity is that the Reader may be a key booklet for capturing the construct of reading literacy. A last distinction of the Reader booklet that makes it an appropriate choice for study is that both of its reading passages and all associated questions have been released into the public domain and made available to the public in English (Gonzalez & Kennedy, 2003). The Reader Booklet passages and questions are contained in Appendix A attached to this document. The time allotted to complete the reading literacy assessment was 80 minutes, and a further 20 minutes was allotted for each student to complete the student questionnaire. PIRLS 2001 written reading literacy achievement results are reported in the PIRLS 2001 International Report (Mullis, Martin, Gonzalez, & Kennedy, 2003). This report contains assessment results by country for each of the two purposes of reading, and for overall reading achievement. The PIRLS International Achievement in the Processes of Reading Comprehension report (Mullis, Martin, & Gonzalez, 2004) provides achievement results for the comprehension processes. Comparisons are made between countries, and information is given as to whether each country is significantly higher or lower than the international average. Achievement results by gender were also reported,  38  as were the percentages of students in each country reaching defined international benchmarks in reading achievement. The sample size, mean proficiency on overall reading achievement, years of formal schooling, and average age for each of the participating countries is shown in Table 3.  39  Table 3  PIRLS 2001 Sample Information and Overall Average Scale Scores Overall Average  Years of  Sample  Scale Score  Formal  Average  Country  Size  (Standard Error)  Schooling  Age  Sweden  6044  561 (2.2)  4  10.8  Netherlands  4112  554 (2.5)  4  10.3  England  3156  553 (3.4)  5  10.2  Bulgaria  3460  550 (3.8)  4  10.9  Latvia  3019  545 (2.3)  4  11  Canada (O,Q)a  8253  544 (2.4)  4  10  Lithuania  2567  543 (2.6)  4  10.9  Hungary  4666  543 (2.2)  4  10.7  United States  3763  542 (3.8)  4  10.2  Italy  3502  541 (2.4)  4  9.8  Germany  7726  539 (1.9)  4  10.5  Czech Republic  3022  537 (2.3)  4  10.5  New Zealand  2488  529 (3.6)  5  10.1  Scotland  2717  528 (3.6)  5  9.8  Singapore  7002  528 (5.2)  4  10.1  Russian Fed.  4093  528 (4.4)  3 or4  10.3  40  Table 3 (Continued) Overall Average  Years of  Sample  Scale Score  Formal  Average  Country  Size  (Standard Error)  Schooling  Age  HongKong,SAR  5050  528 (3.1)  4  10.2  Greece  2494  524 (3.5)  4  9.9  SlovakRepublic  3807  518 (2.8)  4  10.3  Iceland  3676  512 (1.2)  4  9.7  Romania  3625  512 (4.6)  4  11.1  Israel  3973  509 (2.8)  4  10  Slovenia  2952  502 (2.0)  3  9.8  500 (0.6)  4  10.3  Intern’! Avg.  Norway  3459  499 (2.9)  4  10  Cyprus  3001  494 (3.0)  4  9.7  Moldova, Rep.  3533  492 (4.0)  4  10.8  Turkey  5125  449 (3.5)  4  10.2  Macedonia,Rep.  3711  442(4.6)  4  10.7  Colombia  5131  422 (4.4)  4  10.5  Argentina  3300  420 (5.9)  4  10.2  Iran, Islamic Rep.  7430  414 (4.2)  4  10.4  Kuwait  7133  396 (4.3)  4  9.9  Morocco  3153  350 (9.6)  4  11.2  Belize 2909 327 (4.7) 4 9.8 aCanath is represented by the provinces of Ontario and Quebec only. The international average does not include the results from these provinces separately.  41  Data  This study analyzed PIRLS 2001 Reader booklet data from four of the thirty-five participating countries: Argentina, Colombia, England and the United States. These countries were chosen based on their relative positions in the distribution of achievement results from PIRLS 2001, the language in which they administered the assessment, and potential cultural similarities and differences. Argentina and Colombia both ranked below the international average of all participating countries (fifth and sixth-lowest respectively), while England and the United States ranked well above the international average (third and ninth-highest respectively). Argentina and Colombia were the only two countries that translated the assessment into Spanish and administered it in Spanish, while England and the United States administered it in English, requiring no translation. This selection of countries to analyze was made based on the fact that, if differential functioning was present (whether at the item, bundle, or test level), it would be most likely to be found between high and low ranking countries. In addition, a significant amount of research has indicated that differential functioning may be a result of many factors such as translation, culture, and socio-economic status. Each of these factors may therefore contribute to the presence of differential functioning in this study. As shown in Table 3, PIRLS 2001 was administered to 146,490 students, whose average age was 10.3 years, and who had had an average of 4 years of formal schooling. The sample sizes for the four countries analyzed in this study are also shown, and range from 3,156 to 5,131. Students in England had received,  42  on average, 5 years of formal schooling compared with 4 years in Argentina, Colombia and the United States. Students in Colombia were an average age of 10.5 years, compared with 10.2 years in the other three countries. Since the PIRLS 2001 booklets were matrix sampled, not all students took the Reader Booklet and its associated items. The average sample size per country for the Reader Booklet was 1000 students. Table 4 shows the sample sizes for the PIRLS 2001 Reader Booklet only for the four countries analyzed in this study. Table 4 PIRLS 200] Reader Booklet Sample Sizes Country  Sample Size  Total Sample  35 987  Argentina  794  Colombia  1 285  England  777  USA  926  This study investigated score comparability for four countries: Argentina, Colombia, England, and USA. Data from each of these countries were analyzed separately as a focal group, which were compared to a reference group. The reference group was composed of a random sample of 200 students from each of the 34 remaining countries. For example, the analyses for England compared the entire sample of English students who wrote the Reader Booklet (the focal group) against an “International Sample” which consisted of a random sampling of  43  students who wrote the Reader booklet from all countries except for England (the reference group). This resulted in a reference group size of 6800. The reference group sample creation followed the method used by PIRLS 2001 in creating an international sample for scaling as closely as possible. The international sample is intended to be representative of all individual countries. As with all other PIRLS 2001 booklets, the Reader booklet contains two reading passages with their associated questions. One reading passage, “Hare Heralds the Earthquake”, is intended to assess reading for literary experience, and the other, “Nights of the Pufflings”, is intended to assess reading to acquire and use information. In total, there were 24 items in the Reader associated with these two passages, 13 of which were multiple choice and 11 of which were constructed response items, as shown in Table 5. The corresponding item numbering used throughout this study is also shown in Table 5, in the colunm entitled “Study Item#.  44  Table 5 PIRLS 2001 Reader Booklet Items Reading Passage  PIRLS  Study  Item  Max Score  Item #  Item #  Type*  (points)  Hare Heralds the  1  1  MC  I  Earthquake  2  2  MC  I  3  3  CR  2  4  4  CR  5  5  MC  6  6  MC  1  7  7  CR  2  8  8  CR  9  9  CR  2  10  10  CR  3  11  11  MC  1  Nights of the  1  12  MC  1  Pufflings  2  13  MC  1  3  14  MC  1  4  15  MC  1  5  16  MC  1  6  17  MC  1  7  18  CR  2  8  19  CR  2  9  20  MC  1  10  21  CR  1  11  22  MC  1  12  23  CR  2  13  24  CR  1  Total Score: *MC= Multiple Choice; CR  32 Points =  Constructed Response  45  Analyses Several statistical analyses were conducted to investigate score comparability in the PIRLS 2001 Reader booklet. Analyses were conducted for comparability at the item, bundle and scale levels. These analyses are described below. Item Level Analyses Comparability at the item level was investigated using both Ordinal Logistic Regression and Poly-Sibtest analyses. Ordinal Logistic Regression Zumbo’s (1999) OLR method uses a linear regression equation which regresses the variables of total score, group membership, and the interaction between total and group (the predictor variables) on the dependent variable of score on an item, a latent, continuously-distributed random variable. The OLR equation is  =  TOTAL*GRP 3 b + 0 b T 1 G 2 OTAL+b RP+ b1  +  where y* is the continuously-distributed random variable being predicted (in this study, y* represents the predicted score on an item) b 0 represents the slope of the ,  regression line, and b 3 are the regression coefficients (slopes, or beta 1 ,b , and b 2 coefficients) of each of the predictor variables total score, group membership, and total score by group membership interaction. The error terms,  are distributed  with a mean of zero and variance of it 2/3 Zumbo (1999) sets out that OLR assumes an unobservable continuum of variation and that it can be expressed as:  46  lag  fP(r?j)1  bgt[P(Y  =  1 + bX) or a  j)1  1 + b(XJ a  where a logit is the natural logarithm of the ratio of two probabilities and j  1, 2,  c-l, where c is the number of categories in the ordinal scale. The model requires a separate intercept parameter c for each cumulative probability. For a 3-point response, an OLR model describes the effect of X (the total score for the scale) on the odds that Y the odds that Y  1 instead of Y> 1 on the scale, and the effect of X on  2 instead of Y> 2. For DIF analysis, the OLR equation  becomes: 1ogt[P(Y  jJ  =  c.+b1TOTAL+ b2GRP± b3TOTAL* GRPI  The selection of a matching variable is an important decision to be made when implementing an OLR analysis because logistic regression models the probability of responding correctly to an item by group membership and a matching variable, often the total scale score for the assessment. The matching variable used in this study was a “rest score” which is the total score excluding the item under investigation. Slocum, Gelin, and Zumbo (in press) suggest that for short scales a rest score matching criteria be used. The PIRLS 2001 Reader booklet consists of 24 items only which is considered to be a short scale. Ordinal logistic regression is carried out in a hierarchical, or step-wise, fashion with the order of entry of variables determined by the researcher, first entering total score. The second step involves holding the effect of the  47  conditioning variable (the total score) constant, while measuring the effect of the group membership variable. This step is representative of uniform DIF. In the third step, the first two variables are held constant while the effect of the interaction is measured. This step represents non-uniform DIF, that is DIF that varies along the ability continuum. A chi-square value is calculated for each step of the regression as a test of significance. For this study, a significance level of p <.01 was set as the criterion for the chi-square test. In addition, the coefficient of determination, Nagelkerke R , was 2 calculated at each step. Differences between R 2 values with the addition of each new step in the model give an indication of the amount of variance accounted for by the variable being added at that step, and this provides an effect size, R A, in 2 addition to the chi-square test of significance. Jodoin and Gierl’s (2001) effect size classification scheme will be used in the current study: Negligible DIF:  z\ 2 R  Moderate DIF:  Chi-squared null hypothesis is rejected and .035< R z\ <.070 2  Large DIF:  Chi-scuared null hypothesis is rejected andR A>.070  <  .035  These effect size criteria have been chosen for this study because they have been shown to be comparable to the effect size criteria for SIBTEST (Jodoin & Gierl, 2001).  48  Poly-SJB TEST SIBTEST (Shealy & Stout, 1993) is a latent variable, nonparametric approach to DIF detection. That is, groups are matched on an estimate of a latent variable, 0, instead of on their observed scores, and no functional form for the item response function is assumed. In order to analyze data in SIBTEST or Poly-SIBTEST, the items must be separated into two subtests. Items hypothesized to display DIF are placed into a “studied subtest”. In the Multidimensional Model for DIF, these would be items predicted to contain the primary dimension and at least one secondary dimension. Items believed to be free of DIF, that is, measuring only a primary dimension, are placed into the “matching subtest” (also called a “valid subtest”). The matching subtest is used for the purposes of matching examinees on the latent variable. For dichotomous items, an item does not exhibit DIF if regression scores for the two groups are expected to be the same for all values of 9:  EJ?[Y18j =  Equation (I)  where R refers to the reference group and F refers to the focal group under study. Chang et al. (1996) made two modifications to SIBTEST to make it applicable to polytomous items, as well as a mixture of dichotomous and polytomous items. One modification was to allow for more score categories, since polytomous items are scored in m+1 ordered categories. A polytomous item can be shown not to exhibit DIF if the probability of achieving a score of k for a  49  randomly sampled examinee (with the same proficiency on 0) from each group is the same: Equation (2) Here, m refers to the maximum possible score for the items. Equation (1) has been shown to imply Equation (2) for the commonly used polytomous item response models, thus Equation (I) can be adopted for the latent variable definition of null DIF for ordinally scored polytomous items (Chang et al). Poly-SIBTEST measures the amount of DIF at 0 (known as local DIF) as: 0 B  Poly-SIBTEST also provides a global index of DIF as:  =  where  J () is the density of 0  f B0 (O)f . (O)d& 1 in the focal group. In this equation f3, representing  effect size, is interpreted as the expected amount of DIF that a randomly selected focal group examinee would experience. It has been shown that for the partial credit model, the generalized partial credit model, and the graded response model, matching test true scores (from classical test theory) are a one-to-one transformation of the latent variable 0 (Chang, 1994 and Chang & Mazzeo, 1994, as cited in Chang et al. 1996). Local DIF can then be estimated by: c/k  —YRk  —Y1k,k=O,...,nH  50  where Y 1  —  Y,.k is the group difference in performance on the studied item among  examinees with the same observed score, and an estimation of DIF is given by A  il  /3Pkdk  where pk is the proportion of all examinees achieving a score of k on the matching test. Finally, the Poly-SIBTEST test statistic is defined by: A  where  A  A  iI  A(yIkR)  A  (Yk,F) )]112  k=O  .AJI?k  N I.7  with A(Yk,R) andA (Yk,F)being the sample variances of the studied item scores for examinees in the reference and focal groups respectively. It is important to note that the above equations were all formulated with the assumption that the focus and reference groups have equal target ability distributions. This is, of course, not always the case and the presence of unequal target abilities would lead to impact which must be distinguished from bias. Shealy and Stout (1993) therefore used a regression equation to correct for target ability distribution differences. The matching test true scores for the groups are estimated by using the equation for the linear regression of true score on observed score. In the dichotomous SIBTEST case, the slope of the regression line was  51  estimated through the use of KR-20; for Poly-SIBTEST, the slope is calculated using Cronbach’s alpha (Chang et al. 1996). Poly-SIBTEST is a statistical test for testing the null hypothesis (Ho: f3  =  0) against the alternative hypothesis of DIF/DBF against the focal group  : J3 < 0) or against either group 1 : J3> 0) or against the reference group (H 1 (H (Hi: J3  0). This study tested the alternative hypothesis of DIF against either  group. When f3 is statistically significant and has a positive value, this is representative of DIF against the focal group, whereas if f3 is significant and negative this indicates DIF against the reference group. For this analysis at the item level, a “standard” one-item-at-a-time DIF analysis using Poly-SIBTEST was conducted as explained in the software manual (The Roussos-Stout Software Development Group, n.d., p. 13) by analyzing each item as the suspect subset and the rest of the test as the matching subset. For this study, a significance level of p  <  .01 was set in order to maintain comparability  with the OLR method that is being used. In addition, the guessing parameter was set to 0 for consistency with the test level analyses which are being conducted as part of this study, since TCCs for the scale level analysis were generated using a 2PL model (no guessing) for the dichotomous items. Roussos and Stout (1996) proposed a classification scheme for interpreting the value of f3 for an item. Their proposed effect size classification is:  52  Negligible DIF:  1131 <.059  Moderate DIF:  .059  Large DIF:  131> .088  <  <.088  This effect size classification scheme was adopted for this study. The results of DIF analyses using both OLR and Poly-SIBTEST were examined to seek out organizing principles on which to formulate a differential bundle functioning hypothesis, as outlined in the next section. Bundle Level Analysis Poly-SIBTEST was used in this study for analyzing DBF as well as for DIF. This study provided additional research on the use of Poly-SIBTEST for developing differential functioning theory. The results of DIF analyses using both OLR and Poly-SIBTEST were examined. The processes for reading comprehension of the items jointly identified as DIF by the two methods were identified (using test specifications set out in Campbell et al., 2001) to seek out organizing principles on which to group items into bundles for DBF as outlined in Douglas et al. (1996) and Gierl (2005). Once organizing principles were established, the items were grouped into bundles and analyzed using Poly SIBTEST. Although classification guidelines exist for interpreting effect size for DIF using SIBTEST and Poly-SIBTEST, no such guidelines exist to date for the interpretation of effect size for bundles of items (Boughton, Gierl, & Khaliq, 2000;  53  Gierl, Bisanz, Bisanz, & Boughton, 2003; Gierl & Bolt, 2001). In the absence of established effect size guidelines for bundles, Gierl and Bolt proposed guidelines specifically for bundles of 5 items or less, being  1131 <.10 for negligible DBF and  1131 > or equal to .10 for moderate to large DBF.  These proposed guidelines were  followed if bundles of 5 items or less were distinguished in the current research. Scale Level Analyses This study compared test characteristic curves (TCC5) based on each international reference sample (as defined earlier in this chapter) versus its respective country (focal) group data to investigate the degree of comparability at the scale level and to investigate the appropriateness of using the international item parameters for establishing country scores. Recall that TCCs are a graphical representation of the expected probability of test score plotted against ability, and are based on item parameters estimated using item response theory. Also recall from Chapter 1 that current scaling methods in large-scale assessments involve the estimation of international item parameters from a data set containing samples of data from all countries. The estimated international item parameters are then used to create individual country scores. An assumption is made that the international item parameters are representative of all countries and are suitable for creating country scores. Differences in TCCs between the international reference sample and the country data may signify inappropriateness of the international item parameters for estimating country scores (Ercikan & Gonzalez, 2008). Item response theory parameters were estimated for the international sample and for each of the four countries separately using PARDUX (Burket,  54  1991). PARDUX generates parameters simultaneously for dichotomous and polytomous items using marginal maximum likelihood procedures implemented via an EM algorithm, and can be used to estimate parameters with a variety of models, including the 2PL and PPC models (Fitzpatrick et al. 1996) which were used in the current study. The 2PL IRT model which is used to jointly estimate the difficulty and discrimination parameters of items and ability is given by the equation  =  where ](O) is the probability that a randomly chosen examinee with ability 9 answers item I correctly, b is the item difficulty parameter, and a 1 is the item discrimination parameter (1-lambleton et al, 1991). As explained previously, the probability of a correct response to an item is a monotically-increasing, s-shaped curve, with values between 0 and I over the ability range, the ICC function. The sum of all ICCs over a test results in the TCC. The two-parameter partial credit (2PPC) model is explained in Ercikan et al. (1998) and can be summarized as a special case of Bock’s nominal model (Bock, 1972) which is equivalent to Muraki’s (1992) generalized partial credit model. Bock’s nominal model states that an examinee with 0 ability, obtaining a score at the kth level on the jth item is:  k(O)(X/  =kli0)=  expZJk  exp Z , 1  55  where,  ZJk  =  k1 + CJk. 1 A 9  The 2PPC model is a special case in that it includes the following constraints:  AJk  (k—l), 1 a  and, Cik  In this model,  0, and a and ‘yfi are free parameters to be estimated by the data.  The first constraint implies that items can vary in their discriminations and that higher item scores reflect higher ability levels. In the 2PPC model, for each item there are  in—  1 independent  yj difficulty  parameters and one ct discrimination  parameter; a total of m independent item parameters are estimated (Ercikan et a!., p. 142). Item response theory item parameters based on separate country calibrations and the international calibration are not on the same scale due to the arbitrariness of the IRT scales. Before comparisons can be made, the scales need to be linked to be placed onto a common scale. In this study linking was created using the robust mean and sigma transformation described by Stocking and Lord (1983). The scale of reference was that from the calibration done using the combined data sets from all countries including all items in the transformation (Ercikan & Gonzalez, 2008).  56  TCCs were computed and their representations were created using an Excel-based software, MODFIT (Stark, 2001, 2002). Differences in the percent of maximum score estimated between individual countries and the international sample were determined and compared. Model-Data Fit A model-data fit using Yen’s (1984) Qlfit statistic was calculated using PARDUX (Burket, 1991). The Q1 is a chi-square statistic used to test the goodness of fit between an IRT model and test data. Qi is affected by the number of score points for each item as well as by the sample size (Fitzpatrick, Lee, & Gao, 2001). Therefore, QI is transformed into a standardized z-statistic. It is computed based on the squared differences in probabilities of responding correctly to an item based on observed data versus predicted by the item parameters and added across deciles on the score scale. As described in Fitzpatrick et al., items are considered to fit poorly if z ? 4.0 for sample sizes of 1,500. For larger sample sizes the critical value for determining poor fit is adjusted by a factor of Nil 500 where N is the calibration sample size. For the purposes of this analysis where the international sample size is N6800, a criterion of Iz  18 was used.  Factor Structures The Exploratory Factor Analysis using Principal Components Analysis (PCA) was conducted to examine the factor structures of the PIRLS 2001 Reader booklet data separately for the four countries. The similarity of factor structures provides evidence of similarity of constructs measured at the scale level in the four country administrations of the PIRLS Reader. The underlying structure and  57  number of factors in the data for each country were examined using the Kaiser Guttman rule, the percentage of variance accounted for by the factors, and examination of scree plots.  Summary In this chapter the PIRLS 2001 measure was described, with a particular focus on the Reader Booklet. The target population, sample sizes, item and booklet formats were explained. The construct intended to be assessed has been discussed in detail. The statistical analyses and procedures used in this study were outlined, including the use of OLR and Poly-SIBTEST for item level analyses, Poly-SIBTEST for bundle level analysis, and TCC comparisons for analysis at the scale level. The next chapter presents the results of these analyses with a view to answering the research questions that are the focus of this study.  58  CHAPTER IV: RESULTS This chapter presents the results of the study, beginning with descriptive statistics of the PIRLS 2001 Reader booklet data. This is followed by results of the item level analyses using OLR and Poly-SIBTEST, together with comparison of the similarities and differences of the two DIF detection methods. The manner in which items were bundled using the results of the DIF analyses is explained, and the results from the bundle level analysis are then presented. Finally, the results of the scale level analysis of TCC comparisons to investigate DTF and the appropriateness of using the international item parameters for establishing country scores are presented. Descriptive Statistics of the PIRLS 2001 Reader Booklet Data The Reader Booklet was written by a total of 35,987 students from the 35 participating countries. This study analyzed data from all students who took the PIRLS 2001 Reader booklet in the countries of Argentina, Colombia, England and USA. In order to create reference group comparison data for differential functioning studies, a random sample of 200 students from each of the remaining 34 participating countries was drawn, resulting in 4 reference group samples, each composed of 6800 examinees. The sample sizes and descriptive statistics for reference and focal groups are shown in Table 6. The Reader booklet contained 24 items (13 multiple choice and 11 constructed response items) associated with two reading passages. The maximum possible score for the booklet is 32 points. The mean raw scores for Argentina (12.02) and Colombia (12.36) are below their respective reference group mean  59  raw scores, with standardized score differences of 0.78 and 0.74. The mean scores for England (20.73) and USA (20.26) are both above their reference group mean scores, with standardized score differences of -0.47 and -0.41 respectively. Table 6 PJRLS 200] Reader Booklet Descriptive Statistics Sample  Standardized Score  Country  Group  Size  Mean (SD)  Argentina  Reference  6800  17.61 (7.17)  794  12.02 (6.73)  Reference  6800  17.59 (7.21)  Focal  1285  12.36 (6.05)  Reference  6800  17.33 (7.22)  777  20.73 (6.79)  6800  17.35 (7.22)  926  20.26 (6.50)  Focal Colombia  England  Focal USA  Reference Focal  a  Reference group  —  Differencea  0.78  0.74  -0.47  -0.41  Focal group  The score distributions for each of the four countries are shown in Figures 1 and 2. These figures, which show the percentage of students achieving each score level in the four countries, demonstrate similar score distribution patterns for Argentina and Colombia and similar score distribution patterns for England and USA. A majority ofArgentinean and Colombian students score in the lower half of the score range, while the majority of English and American students score in the upper half of the range. In fact, 72.8% and 74.2% of students from Colombia  60  and Argentina respectively have total scores between 0 and 16 points, while 74.1% and 73.8% of students from England and USA respectively have total scores between 16 and 32 points.  Score Distributions, Argentina and Colombia  04812  16  2O28  32  Score  Figure 1. Argentina and Colombia Score Distributions  Score Distributions, England and USA 10  048121620242832 Score  Figure 2. England and USA Score Distributions  61  Item Level Analyses This study used two DIF methods to examine item level equivalence, OLR and Poly-SIBTEST. Two methods were chosen because it has been shown that different DIF detection methods perform best under different conditions, so it is advisable to consider items as containing DIF if they are identified by two methods (Ercikan et al., 2004; Kim & Cohen, 1995). Results of these two analyses are presented next, followed by a comparison of these results. Ordinal Logistic Regression Results of the OLR analysis are shown in Table 7. OLR identified between 54 and 75% of the items in the Reader booklet as being differentially functioning. All items identified as DIF by OLR were classified as having negligible effect sizes using Jodoin and Gierl’s (2001) effect size classification scheme outlined in the previous chapter, indicating that although significant the amount of variance accounted for by the grouping variable (individual country versus international data) is small. Table 7 Ordinal Logistic Regression DIF Results DIF Items Number  Argentina 13  Colombia 18  England 16  USA 16  Percentage  54.2  75.0  66.7  66.7  Note: Total number of items =24 Note:  p.Ol  62  Poly-SIB TEST Results of the standard, one-item-at-a-time, DIF analysis using Poly SIBTEST also reveal a large percentage of DIF items, as shown in Table 8. Table 8 shows that the DIF items identified in each country are approximately balanced in terms of whether the focal group (country) or the reference group (international) is favoured. Using the effect size classification of Roussos and Stout (1996), Poly-SIBTEST identified items at all three effect size levels (negligible, moderate and large). Table 8 Poly-SIBTEST DJF Results Country  Argentina  Colombia  DIF Favours  F  R  F  R  # of Items  5  5  8  5  % of Items  41.7  Notes: Total number of items F p  =  54.2 =  England  USA ER  6  7 54.2  4  6 50.0  24  Focal Group, R = Reference Group .01 Comparison of OLR and Poly-SIBTEST DIF Results In this study, the results of DIF analyses using OLR and Poly-SIBTEST  are compared for two purposes. The first is as a method of verifying DIF identification, as noted previously, and to seek convergence between the two DIF methods to identify bundles of items for further analysis. The second purpose is to add to the few existing empirical studies in which such a comparison is made.  63  A consideration of 24 items in each of 4 countries provides 96 comparison points of potential agreement between the two DIF detection methods. Over these 96 comparison points there was 77% agreement between the two methods as to the presence or absence of DIF at the significance level of p  .01.  Although there was high agreement between the two methods as to the presence or absence of DIF, there were differences in the effect sizes detected by the two methods. As noted above, all items identified as DIF by OLR were classified as negligible, whereas Poly-SIBTEST identified items at all three classification levels of effect size. Effect size classifications of both methods are shown in Table 9. Table 9 Number ofDIF items identified by effect size by method Negligible  Moderate  PolyOLR  SIB  ARG  13  2  COL  18  1  ENG  16  4  USA  16  2  Country  Large  PolyOLR -  -  -  -  SIB 2 5 3 5  Poly OLR -  -  -  -  SIB 6 7 6 5  Note: p.OI Items which were identified as DIF at p E .01 by both methods are shown in Table 10. As can be seen, all 24 items were identified as DIF by both methods in at least one of the four countries. Since the agreement level between the  64  methods was so high with all items being identified as DIF, further examination was carried out only on items that were identified by both methods in at least three of the four countries. Seven such items were found, being items 4, 6, 8, 9, 15, 18 and 22. These seven items provided the basis on which bundles were chosen for the differential bundle study. Table 10 Items IdentjJied as DIF by Ordinal Logistic Regression and Poly-SIBTEST Item  Arg  Col  Eng  Usa  Item  Arg  13  X  Col  Eng  Usa  # 1  X  2  X  3  X  X  4  X  X  5  X  6  X  7  X  14 15 X  X  X  X  X  18  X  19  X  X  X  X  20  9  X  X  X  X  21  10  X  X  11  X X  X  17  X  22 X  X  16  8  12  X  24  X  X  X X  X  X X  X X  X  23 X  X  X  X  X  X  X  X X  65  Bundle Level Analysis The seven items which were identified as containing DIF by both OLR and Poly-SIBTEST in at least three of the four countries were studied further in an attempt to understand whether there may be some inherent similarities among the items on which to form an organizing principle for bundling. The items were analyzed according to their specifications in the PIRLS 2001 data files. Three of the 7 items were multiple choice and 4 were constructed response items. Four of the items were intended to assess the purpose of reading for literary experience and 3 were intended to assess acquiring and using information. Thus item type and purpose of reading were not considered to be potential sources of DIF. Of interest though, six of the 7 items (items 6, 8, 9, 15, 18 and 22) were designed to assess the reading comprehension process of interpreting and integrating ideas and information, and the remaining one item (item 4) was designed to assess the focus and retrieve reading comprehension process. The finding that 6 of the 7 items were intended to assess the same reading comprehension process provided an organizing principle on which to bundle items for the DBF analysis. Based on Shealy and Stout’s (1993) multidimensional model for DIF, the reading comprehension cognitive levels were hypothesized to contain a secondary dimension which represents a potential source of differential functioning. All Reader booklet items from the reading comprehension cognitive process of interpreting and integrating ideas and information were analyzed as a suspect bundle using Poly-SIBTEST. All remaining items in the Reader booklet formed the matching subtest items for this analysis. Results revealed significant  66  DBF against both Argentina and Colombia, when the basis for bundle formation is the reading comprehension cognitive level of interpreting and integrating ideas and information. At the same time, for this cognitive level bundle, significant DBF in favour of the USA was identified. Similarities amongst the items in this cognitive level were noted when reviewing the items. Four items in this bundle required explanations, six of the items used the word “how”, and four of the items were on the topic of the characters’ feelings. Two examples of items in this bundle are: item 9, how did the hare’s feelings change during the story? ; and item  23, write two different feelings Halla might have after she has set the pufflings free and explain why she might have each feeling. Items belonging to the lower cognitive level focus and retrieve reading comprehension process were then analyzed as a suspect bundle, with all other Reader items forming the matching subtest. For this grouping of lower cognitive level items, significant DBF was identified in favour of Argentina and Colombia,  and against England and the USA. Items in this bundle tend to begin with the words “what”, “where” or “why”. Two examples are: item 12, why are puffins clumsy at takeoffs and landings?; and item 16, what happens during the nights of the pufflings? DBF analyses results for both bundles of items are shown in Table II. A positive )3 estimate together with a significant p value (.O5) indicates differential bundle functioning against the focal country (favouring the reference group), and a negative 13 estimate together with a significant p value indicates differential bundle functioning against the reference group and favouring the focal country  67  group. As shown, there is significant DBF against Argentina and Colombia in the “interpret and integrate” bundle, and in favour of the USA. In the “focus and retrieve” bundle there is significant DBF favouring Argentina and Colombia and against England and the USA. Table 11 Differential Bundle Functioning Analyses Suspect  #  Focal  Bundle  Items  Group  12  Interpret & Integrate  Focus & Retrieve  8  p-  DBF against  Estimate  value  (group):  ARG  0.71  0.00  Focal  COL  0.71  0.00  Focal  ENG  0.14  0.19  USA  -0.55  0.00  Reference  ARG  -0.164  0.02  Reference  COL  -0.333  0.00  Reference  ENG  0.121  0.04  Focal  USA  0.317  0.00  Focal  -  Given that each suspect bundle contained more than 5 items, no classification guidelines exist for interpreting effect sizes for these results (Boughton et al, 2000; Gierl et al, 2003; Gierl & Bolt, 2001). However, in all cases where significant DBF was identified, the absolute values of the 13-estimates were greater than 0.1 which would have been classified as moderate to large had the bundles contained fewer items.  68  Scale Level Analyses In order to investigate the degree of scale level equivalence, and whether differential item and bundle functioning are manifested at the test level, an analysis of DTF was conducted using PARDUX (Burket, 1991). Item response theory parameters were estimated for the international reference sample and for each of the four focal country samples separately using the 2PL and PPC models (Fitzpatrick et al., 1996) discussed in chapter 3. In order to link item parameters from separate country calibrations to their respective international sample a robust mean and sigma transformation as described by Stocking and Lord (1983) was carried out. All items were included in the transformation. Table 12 shows the average discrimination and difficulty item parameters in the original metric prior to transformation (columns “a” and “b”), the discrimination and difficulty of the items in the transformed metric (colunms “a(t)” and “b(t)”) and the transformation parameters used (columns “Ml” and “M2”).  69  Table 12 Average original and transformed discrimination and difficulty parameters Country  N  a  b  a(t)  b(t)  Mia  M2 b  ARG  24  1.16  0.51  1.14  -0.34  1.01  -0.85  COL  24  1.04  0.28  1.36  -0.49  0.77  -0.71  ENG  24  1.32  -1.26  1.30  -0.75  1.01  0.52  USA  24  1.26  -1.17  1.35  -0.66  0.93  0.43  a  Ml: Stocking and Lord multiplicative factor  bM2: Stocking and Lord additive factor An analysis of the correlations between country specific and international item parameters was conducted. This provides information about the similarity of the ordering of discrimination and difficulty parameters in the two sets of calibrations. These correlations are summarized in Table 13. The discrimination (a) parameter correlations between country specific calibrations and the international calibration ranged from 0.66 (for Argentina) to 0.85 (for Colombia). The correlations of the difficulty (b) parameters were higher. They ranged from 0.89 (for Argentina) to 0.94 (for USA). These correlations tend to be moderate to high compared to other item parameter correlations conducted with international assessments (Ercikan & Koh, 2005).  70  Table 13. Correlation of a and b parameters for each country with the international parameters Country  a parameters  b parameters  Argentina  0.66  0.89  Colombia  0.85  0.92  England  0.71  0.93  USA  0.84  0.94  Test characteristic curves were computed and graphed using MODFIT (Stark, 2001) based on the international and linked country parameters. TCC comparisons for each country are shown in Figures 3 to 6. Each country’s TCC looks very similar to its international TCC. These results tend to support the notion of score comparability at the test level (i.e., DTF is not evident in the TCC comparisons).  71  TCC Comparison, Argentina 0.90 0.80 0. .0 0.60 050 0.40 0.:0 0.20 0.10 0.00 -  .  b  —  ARG  —INT-ARO  0 2 4 6 8 101214161820222426283032  Latent Ability Scale  Figure 3. TCC Comparison for Argentina  TCC Comparison, Colombia  .  ; —  0.90 0.80 0.70 0.60 050 0.40 0*,0 0.20 0.10 0.00  —COL —INT-COL  0 2 4 6 8 101214161820222426283032  Latent Ability Scale  Figure 4. TCC Comparison for Colombia  72  TCC Comparison, England  i)  .  !  0.90 0.80 0. 0 0.60 0.50 -  —ENG —INT-ENG  0.20 0.10 0.00 0 2 4 6 8 101214161820222426283032  Latent Ability Scale FigureS. TCC comparison for England  TCC Comparison, USA 0.90 080 -  .  060 0.50 0.0 0.0 0.20 0.10 0.00  —USA —TNT-USA  0 2 4 6 8 101214161820222426283032  Latent Ability Scale Figure 6. TCC comparison for USA  73  Differences in the percent of maximum score estimated between the international sample and individual countries were determined together with descriptive statistics of the differences. These results are shown in Table 14. The negative values for differences for England and the USA indicate that these two countries would receive higher scores if scores were created based on their own country parameters instead of on international parameters. The positive differences for Argentina and Colombia indicate they receive higher scores based on the international parameters than they would if scores were based on their own country parameters. Table 14. Descrzptive statistics of dfferences between country TCCs and international TCCs Country  Minimum  Maximum  Mean  Std. Deviation  Argentina  -0.0084  0.0349  0.0175  0.0129  Colombia  -0.0118  0.0276  0.0066  0.0124  England  -0.0585  0.0224  -0.0202  0.0300  USA  -0.0429  0.0146  -0.0118  0.0193  Differences between focal and reference TCCs were examined along the ability scale, and results are plotted in Figures 7 and 8. A comparison of the score distributions shown in Figures 1 and 2 with the TCC differences shown in Figures 7 and 8 reveal that the region along the ability scale where Argentina and Colombia would be most advantaged by the use of international parameters coincides with the region in the score distribution where most students are located.  74  Conversely, the region along the ability scale where England and USA would be disadvantaged coincides with the region in the score distribution where the fewest students are found in those two countries. TCC Differences 0.05 0.03 0 _.  0.01  0  -0.01  12 8”  1  Nb__.  —  Argentina  —Colombia  -0.03  -oo -0.07  Abilftv  Figure 7. Argentina and Colombia, Differences in TCCs based on international versus country parameters. TCC Differences 0.05 0.03 0  0.01 0  -0.01  England —USA  -0.03 C-.  -0.05 -0.07  Ability  Figure 8. England and USA, Differences in TCCs based on international versus country parameters.  75  Model-Data Fit An evaluation of model-data fit using Yen’s (1984) Qifit statistic was also conducted using PARDUX (Burket, 1991) to test the goodness of fit between the IRT models and test data. Items were considered to fit poorly if Izi ? 4.0 for individual country data and if Izi  18.0 for the international samples. Four of the  24 items were found to display poor fit. Item #19 exhibited poor fit across all four countries and all four international data sets. Item #17 exhibited poor fit in all four international data sets, as well as in Argentina and Colombia. Items #7 and #18 displayed poor fit in Colombia and Argentina respectively. Items displaying poor fit are summarized in Table 15. Items 17, 18 and 19 are items that were identified as DIF using both DIF methods in either 2 (items 17 and 19) or 3 (item 18) countries, and all three items are intended to assess the reading comprehensions process of interpreting and integrating ideas and information. They also contributed to the DBF which was observed when the interpret and integrate reading comprehension process items were analyzed as the suspect bundle. Item 7 was identified as DIF by both DIF methods in one country (Colombia), and is intended to assess the reading comprehension process of making straightforward inferences.  76  Table 15. Qi Goodness offit results: items identified as misfit Country (Focal) Dataa International (Reference) Datab Item ARG  No. 7  COL  ENG  USA  ARG  COL  ENG  USA  38.25  37.58  39.77  37.21  63.54  70.54  63.17  61.84  6.95  17  4.04  18  4.27  19  6.56  aIZI>40  4.33  5.38  10.06  12.58  bIzi> 18.0  Factor Structures Evidence of construct comparability was analyzed through an examination of factor structures for each of the four countries separately using exploratory factor analysis (EFA). The method of extraction was Principal Components Analysis, using LISREL/PRELIS, version 2.50 (SSI Scientific Software International (2000). The criteria for establishing factor structure used in this study were: the Kaiser-Guttman rule, eigenvalue ratios of factor 1 to factor 2, and examination of scree plots. Using the Kaiser-Guttman rule for the lower bound, the number of factors is equal to the number of components with eigenvalues that are greater than or equal to 1.0 (Gorsuch, 1983). The factor analysis results for factors with eigenvalues greater than 1 .0 are shown in Table 16.  77  Table 16 PCA Eigenvalues and variance explainedfor each factor Factor  Ratio of Factor 1 to  2  3  4  5  6 Factor 2  ARG  COL  ENG  USA  Eigenvalue  8.68  1.72  1.38  1.19  1.11  1.04  % Variance  36.15  7.16  5.76  4.96  4.63  4.32  Cumulative%  36.15  43.31  49.07  54.03  58.66  62.99  Eigenvalue  7.96  1.48  1.41  1.11  1.02  %Variance  33.16  6.19  5.89  4.64  4.24  Cumulative%  33.16  39.35  45.24  49.88  54.12  Eigenvalue  10.05  1.59  1.12  1.01  %Variance  41.88  6.62  4.66  4.2  Cumulative%  41.88  48.5  53.16  57.36  Eigenvalue  9.82  1.57  1.23  1.03  %Variance  40.92  6.53  5.14  4.3  Cumulative %  40.92  47.45  52.6  56.9  -  5.04  5.38  -  -  -  6.32  -  -  -  6.25  -  -  Table 16 presents eigenvalues, percentage of variance accounted for by each factor solution, the cumulative percent of variance accounted for by each additional factor, and the eigenvalue ratios of factor 1 to factor 2, for all four countries. Six factors represent the data for Argentina, 5 for Colombia, and 4 for England and the US. The ratio of eigenvalues for factor I to factor 2 range from  78  5.04 to 6.32, providing evidence of one main factor underlying the data, as well as  support for the IRT assumption of essential unidimensionality of the Reader items in all four countries. The one main factor accounts for approximately 33% to 42% of the variance in the data while the second factor accounts for between 6 and 7% of the variance. Scree plots were also examined in this study. Scree plots are graphs of eigenvalues plotted against each factor representing the data. The number of factors that precede the bend, or elbow, in the plot indicates the maximum number of factors that can be extracted from the data. The scree plots, shown in Figure 9, show that the data are represented by one major factor for all four countries, evidenced by the sharp bend in each graph at the second factor. Scree Plots 11 10  9 8 7  6  —  Argentina  —Colombia 4 —  3  Eitlancl  —USA  1 0 1  2  3  4  5  6  7  ComponentNumber  Figure 9. Scree plots for Reader booklet data.  8  9  10  79  Summary This study presents a score scale comparability analysis of the PIRLS 2001 Reader booklet data for the countries of Argentina, Colombia, England and USA. at the item, bundle, and test levels of analysis. The procedures used and the study results were presented in this chapter. Large differences were found in the raw score means between individual countries and their respective reference group samples, with Argentina and Colombia scoring 0.78 and 0.74 standardized score difference units below their reference means, while England and the USA score 0.47 and 0.41 standardized score units above their reference means. An examination of score distributions for the four countries revealed that nearly three-quarters of students from Argentina and Colombia score in the lower half of the score range, while nearly threequarters of students from England and USA score in the upper half of the score distribution. Exploratory factor analysis results indicate that the overall test structure is very similar for all four countries, with one main factor underlying the data. However, secondary factors with eigenvalues greater than 1.0 are present in all countries, and these factors may give rise to differential functioning between the groups. At the item level of analysis, both OLR and Poly-SIBTEST analyses identified a large percentage of differentially functioning items in all four countries. These results are consistent with previous studies of large scale educational assessments. There was a high degree of agreement between the two  80  DIF detection methods as to the presence or absence of DIF. However, the effect size classifications resulting from the two methods were different, with OLR identifiing all items as negligible, whereas Poly-SIBTEST identified approximately 20% of the DIF items as negligible, 30% as moderate, and 50% as large effect size. Due to the large percentage of DIF items identified in the data, inclusion criteria were set for the DBF study. Only those items that were identified as DIF by both methods in three or four countries were included for DBF analysis. Seven items met the criteria for further inclusion. The analysis of item content specifications to seek out organizing principles on which to form hypotheses for the DBF analysis revealed that six of the seven DIF items were intended to assess the higher cognitive level reading comprehension process of integrating and interpreting ideas and information and the remaining item was intended to assess the lower cognitive level reading comprehension process of focusing and retrieving specific information. Two DBF analyses were conducted, one in which the suspect items were comprised of all items from the interpret and integrate process, and one in which the suspect items were comprised of all items from the focus and retrieve process. Significant DBF was revealed against Argentina and Colombia, and in favour of USA, for the higher cognitive level reading comprehension process items. Significant DBF in favour of Argentina and Colombia, and against England and USA, was revealed for the lower cognitive level process items. At the test level, item response theory was used to create test characteristic curves for comparison between country and international (reference) groups.  81  There was a moderate to high correlation between country and international parameters, indicating similar ordering of the parameters in the two sets of calibrations. Analysis of model-data fit revealed four items with poor fit to the model, three of which belong to the interpret and integrate reading comprehension process. Test characteristic curves for each country and its international reference sample are very similar, indicating possible score comparability, or lack of differential functioning, at the test level. An analysis of differences between international and country TCCs along the latent ability continuum reveals that Argentina and Colombia may be favoured by the use of international parameters to create country scores in the region where these countries have the greatest score density, that is, in the lower half of the score distribution. Conversely, England and the USA are disadvantaged by the use of international parameters in the region where they have fewer numbers of students. The following chapter will provide a discussion of these results, together with implications for practice, contributions of these findings to the research literature, future directions for study, and limitations of this study.  82  CHAPTER V: DISCUSSION Large-scale international assessments of achievement necessitate the translation and adaptation of one assessment into many different languages for use in a wide variety of countries and cultures worldwide. Results from these assessments are used to make comparisons across countries, which require that scores be comparable across countries. Evidence of score comparability is required to make meaningful comparisons across countries. This study investigated the degree of score comparability in the Reader booklet of the PIRLS 2001 large-scale educational assessment, using results from four countries: Argentina, Colombia, England and USA. Four research questions were addressed: 1. What is the degree of DIF in the PIRLS data? What are similarities in DIF detection by OLR and Poly-SIBTEST? 2. At the bundle level, are there any bundles of items in the PIRLS Reader booklet that may result in DIF amplification or cancellation? 3. At the test level, what are similarities or differences in test characteristic curves based on the international parameters versus those based on individual country parameters? Are the international parameters appropriate for creating country scores? 4. Why do large degrees of differential functioning in large-scale assessments at the item level result in only small degrees of differences at the test level?  83  The findings from this study confirm current DIF research in the field of score comparability in large-scale assessment, and expand on the research base regarding DBF and comparability of test characteristic curves at the test level. This chapter provides a discussion of the research results, together with implications for practice, contributions of these findings to the research literature, future directions for study, and limitations of this study. Discussion of Findings Research Question #1: Differential Item Functioning Analyses Differential item functioning analyses were conducted using two methods, OLR and Poly-SIBTEST, to investigate score comparability in the PIRLS 2001 Reader booklet at the level of individual items. Both methods identified large percentages of DIF against all four countries when compared to their international reference sample. Converging the results of the two methods to verify the presence of DIF showed that all 24 items were identified as DIF by both methods in at least one country. In addition, 11 items (45.8%) were identified by both methods in 2 countries, 4 items (16.67%) were identified in 3 countries, and 3 items (12.5%) were identified as DIF in all 4 countries by both methods. The results of this study confirm many previous studies that have demonstrated a large degree of DIF in international large-scale assessments (Ercikan et al., 2004; Sireci & Allalouf 2003). When comparisons at the level of individual items are made between an individual country and a sample intended to represent all countries, there are a significant number of items that function differentially between these groups. Despite being matched on ability, the groups  84  have different probabilities of correctly responding to individual items. The DIF that is present is fairly evenly distributed across the four countries involved, and does not appear to systematically favour focal or reference groups. This indicates that language of testing alone does not account for the DIF that is present. Recall from Chapter 1 that the finding of empirical evidence of DIF is necessary but not sufficient to conclude bias is present. Once an item has been determined to exhibit DIF, it is necessary to attempt to identifS sources of the DIF and use judgment to decide whether the item is biased. One strategy for analyzing possible sources of DIF is to examine the content and topic of DIF items (Ercilcan, 2002). The 7 items that were identified as DIF by both methods in three or four countries were examined for their content. Six of these 7 items were intended to assess one cognitive level of reading comprehension, the level of interpreting and integrating ideas and information from the text. As a result of these analyses, a hypothesis was formed that the bundle of items intended to assess the cognitive level of interpreting and integrating ideas and information would contain DBF. This hypothesis was tested using Poly-SIBTEST in a DBF analysis, the results of which are discussed in a following section which addresses the findings of the DBF study. Comparison of Resultsfrom Poly-SJB TEST and Ordinal Logistic Regression This study has demonstrated a high correspondence in results between the two DIF methods with 77% agreement between them as to the presence or absence of DIF. This was expected due to previous research that has shown that each of  85  the two methods is capable of identifying DIF in large-scale assessments in circumstances such as those in this research study. Once DIF has been identified it is important to determine an effect size to aid in understanding whether statistically significant DIF has practical significance or not. In this study, all items identified by OLR were identified as negligible according to the guidelines established by Jodoin and Gierl (2001). Recall that Jodoin and Gierl used cubic curve regression to predict the OLR effect size from the SIBTEST effect size, and determined a concordance for negligible, moderate and large effect sizes between the two methods. However, while all OLR DIF items were classified as negligible, only 19% of the items identified as DIF by Poly-SIBTEST were classified as negligible. In addition, Poly-SIBTEST classified 31% and 50% of the items as moderate and large DIF effect sizes respectively. Therefore, this research does not support the findings of Jodoin and Gierl on the correspondence between the effect sizes for the two methods. Research Question #2: Dfferential Bundle Functioning Analyses DBF analyses have greater power than analyses of individual items. When bundles of items are selected using an organizing principle, and the substantive characteristics of the items are revealed in the structure of the data, differential functioning may be attributed to the basis on which the items are organized into bundles. Based on the joint results of the two DIF analyses, item type, purpose for reading, and language of testing were hypothesized not to be sources of DIF. However, it was hypothesized that reading comprehension cognitive level may constitute a source of differential functioning. Sources of bias that may contribute  86  to reading comprehension differential functioning in this study were also considered. One source of bias that may be particularly relevant to the results of this study is bias that is introduced when test content is not equally aligned with the curricula for all groups being tested (The Standards, AERA, APA & NCME, 1999). Differential functioning in reading comprehension items across countries in PIRLS 2001 Reader booklet may reflect bias resulting from differences between curricula, what is intended to be taught or what is actually taught in the participating countries. Other potential sources of bias that may be relevant to the results of this study are opportunity to learn and motivation related to differential access to learning materials, and culture (in that students from different cultures may approach reading comprehension differently). In order to test the hypothesis of differential functioning related to reading comprehension cognitive levels, all items in the PIRLS 2001 Reader booklet from the reading comprehension process of interpreting and integrating ideas and information were analyzed for DBF. Indeed, significant DBF was found against Argentina and Colombia, and in favour of the USA for “the interpret and integrate” bundle. Next, all items from the lower cognitive level “focus and retrieve” reading comprehension items were analyzed as a bundle, and significant DBF was found in favour of Argentina and Colombia, and against England and USA. These results support the hypothesis that reading comprehension cognitive level is a source of differential functioning in the PIRLS 2001 Reader booklet, and provide evidence of lack of score comparability at the bundle level of analysis.  87  There is a considerable amount of evidence in the PIRLS 2001 documentation that supports the hypothesis that reading comprehension DIF represents a source of bias caused by differences in curricula, and opportunity to learn across the four countries as well as differences in motivation of the students. The PIRLS 2001 Encyclopedia (Mullis, Martin, Kennedy, & Flaherty, 2002) provides a description of the participating countries’ policies and practices that guide school organization and classroom reading instruction in the lower grades and provides contextual information necessary for appropriate interpretation of PIRLS results. A review of the Encyclopedia reveals potentially significant  differences between the four countries in this study related to curriculum and differential opportunity to learn the skills associated with reading comprehension. Examples are differences in: the age at which reading instruction begins and the percentages of students receiving early reading instruction; the number of years of formal education students have had when they are assessed; the curriculum taught as well as differences between the intended curriculum and what is actually taught; and student access to interesting reading materials both at home and at school (Arrieta & Parreño, 2002; Rush & Kapinus, 2002; Sáenz, Rocha, Garcia, & Riveros, 2002; Twist, 2002). The availability of books is related to student motivation to learn to read and enjoy reading. Students are motivated to read by the availability of a variety of informational and literary resources across a wide range of difficulties (Early Reading Expert Panel,2003; Guthrie & Alao,1997). Arrieta and Parreño (2002) state that the only book available to each Argentinean student is a graded reader or  88  his/her own textbook. In fact, “most classes (and even schools) do not have a bookshelf or reading corner for students to use” (pp. 4-5). Argentinean students are disadvantaged by a lack of reading materials which represents a source of bias related to motivation. Several studies have been noted in which culture is shown to affect interpretation of what is read (see, for example, Bock, 2006; Murata, 2007; Reynolds, Taylor, Steffensen, Shirey, & Anderson, 1982). Bock demonstrated that the interpretation of what is read varies according to ethnicity or culture independent of reading comprehension ability. Li, Cohen and Ibarra (2004) contend that cultural context can confound measurement of knowledge on standardized tests, and that culturally distinct groups have specific patterns of thinking and learning that may constitute sources of DIF.  Research Question #3: Differential Test Functioning and TCC Comparisons Results of DTF analyses using TCC comparisons are similar to those seen in previous research (Ercikan & Sandilands, in press; Ercikan & Gonzalez, 2008). These findings appear to support score comparability at the test level between country and international TCCs with a maximum absolute value difference of approximately 6%. One additional contribution of this study is that it extends previous research by examining TCC differences along the ability scale and comparing each country’s TCC difference to its score distribution. Results reveal that the region along the ability scale where Argentina and Colombia would be most advantaged by the use of international parameters to create their country scores  89  coincides with the region in the score distribution where most of their students are located. Therefore, the scores of a large proportion of students in Argentina and Colombia may be artificially inflated (up to 3.5% of maximum score) through the use of international item parameters to create individual country scores. Conversely, the region along the ability scale where England and USA would be disadvantaged coincides with the region in the score distribution where the fewest students are found in those two countries. In other words, a relatively small proportion of student scores in England and USA may be artificially deflated (by up to 6% of maximum score for England and 4% for USA) through the use of international item parameters. It therefore is possible that the use of international item parameters to create individual country scores may have provided a relative advantage to Argentina and Colombia, which creates the appearance of comparability at the test level (a very small degree of DTF is evident in the TCCs) despite incomparability at the item level (high levels of DIF) and at the bundle level (DBF based on cognitive levels in reading comprehension processes). In addition, analysis of model-data fit revealed four items (17%) with poor fit to the IRT model, indicating that for these items, the item response theory assumption of model fit to data may not be met. A poorly fitting IRT model will not yield invariant item or ability parameters. In other words, one of the fundamental assumptions of IRT will not be met: that the item parameters not depend on the ability distribution of the examinees, and the examinee parameter not depend on the set of test items (Hambleton et al., 1991). Three of the poorly fitting items are from the interpret and integrate reading comprehension cognitive  90  level, which lends supporting evidence to the findings that these cognitive level items may be functioning differentially in the PIRLS 2001 Reader booklet.  Research Question #4: Why do large degrees ofdifferentialfunctioning in largescale assessments at the item level result in only small degrees of differences at the test level? This fourth research question was addressed by examining the results of the analyses for the first three research questions. Similar to other research, the TCC comparisons appear to support score comparability at the test level; yet both DIF and DBF have been found in the Reader booklet data which fails to provide support for score comparability at the item and bundle levels. Three explanations seem feasible for this apparent discrepancy. First is the possibility that bundles of items are cancelling each other at the test level in a manner similar to which Nandakumar (1993) had observed with item cancellation. The DBF of higher cognitive level items may cancel that of the lower cognitive level items in such as way that at the test level neither is detected. The second possibility is that the current method of detecting DTF through examination of IRT TCCs is not sensitive enough to capture potentially subtle amounts of differential functioning which are spread over the ability scale. A third possibility, and one that seems to be supported by the findings of this study, is that the current scaling methods that use IRT to estimate item parameters based on an international sample are not appropriate for creating individual country scores.  91  Implications, Contribution to Research and Further Research Suggestions  Each of the research findings noted in the previous sections has implications for practice, and leads to suggestions for further research. These implications will have an impact on decisions by many stakeholders in the realms of test development and use. The first implication is related to validity and the use of scores from the PIRLS 2001 Reader booklet to compare country results. This study has added to the research by providing statistical evidence of differential functioning at the item and bundle levels for items that tap higher cognitive level reading comprehension processes, specifically interpretation and integration of information and ideas from what is read. Substantive evidence to support this study’s statistical finding of differential functioning has been found in the literature on reading comprehension, providing support that the differential functioning is indicative of bias. Recall that equivalence can be defined as a lack of bias which is required in order to make valid comparisons across cultures or other groups (van de Vijver & Tanzer, 2004). Interpretations of score comparisons across countries that took part in PIRLS 2001 should therefore be made with caution as, without appropriate validity evidence, such comparisons may lack meaning and may result in erroneous decisions. Test developers and publishers have a duty according to the Standards (AERA, APA & NCME, 1999) and the ITC Guidelines (International Test Commission, 2000) to establish equivalence through the application of appropriate statistical techniques and to provide validity evidence for all intended populations. The results of this  92  study point to the need for test developers and publishers to fulfill these obligations prior to assessment results being used for important policy or other high-stakes decisions. It may be necessary for test developers to provide evidence at all three levels of analysis in order to allow test users to make appropriately informed decisions. Further implications arise from the discrepancy between score comparability evidence at the test level versus the item and bundle levels. This may not represent the absence of differential functioning at the test level, but rather may reflect bundle cancellation, insensitivity of current methods to detect DTF, or the use of inappropriate methods of determining country scores based on international item parameters. If DTF is present but not being detected, true group differences will be masked and test results may be used to make unfounded decisions such as policy, curriculum or funding decisions. Further research is needed to gain a better understanding as to which, if any, of these three postulated factors is contributing to the apparent discrepancy between item and test level analyses. In addition, new methods of detecting DTF may be necessary to increase the likelihood of detecting DTF when it is present. This study has contributed to research by introducing preliminary evidence for the possibility that the use of international item parameters to create individual country scores may provide a relative advantage to some countries due to the locations of their score distributions. This relative advantage may then create the appearance of comparability at the test level despite evidence for incomparability at both the item and bundle levels. Significant research indicates that when there is  93  insufficient evidence to support measurement invariance across groups, score scales should be based on separate country calibrations which are then linked to establish comparability (Cook, 2006; Ercikan & Gonzalez, 2008; Sireci, 1997). The implication for test developers may be that score scales should be estimated separately by country and then linked. Further research should be conducted in this regard. Although this study has demonstrated the possibility of a relative advantage to some countries through the use of international parameters, it has not established whether there is or may be a significant effect on country comparisons or ordering of countries due to such advantage. Future studies should attempt to quantify the relative advantages and disadvantages to determine their overall effect. This study also has implications regarding comparison of effect sizes from OLR and Poly-SIBTEST. Based on statistical significance and effect sizes, test developers and administrators make decisions to exclude items from assessments at the test development stage, or from scoring after a test has been administered. Clearly, in this study, different interpretations regarding practical significance and decisions would flow from the same data, depending on which method was used to analyze the data and determine the effect size. Thus the comparability of the existing guidelines for determining effect sizes with these two DIF methods comes into question. Jodoin and Gierl’s (2001) effect size prediction was based on equal sample sizes for the focal and reference groups, whereas this study may be limited by the use of unequal sample sizes. It is important to note, though, that the presence of unequal sample sizes is a common occurrence in DIF analyses of  94  large-scale assessments and therefore should be taken into account in setting effect size criteria. Jodoin and Gierl also found in their power simulation study that OLR power tended to be lower in the condition of unequal group size combined with unequal ability; again this condition exists in the current study. Future research may be required to investigate the effect of unequal sample size and ability on effect size criterion and comparability across the two DIF detection methods. This study has contributed to research in the field of large-scale educational assessments such as PJRLS by demonstrating the use of Poly SIBTEST DBF analyses to establish potential causes for differential functioning seen at the item level. Using statistical results from the DIF study, hypotheses were generated as to the cause of DIF and tested in the DBF study. Results of the DBF study confirmed DIF related to cognitive levels of reading comprehension. Lastly, although this study revealed potential sources of bias in the PIRLS assessment, there are additional sources of bias that could be contributing to DIF. An area requiring further study is the effect of age differences and developmental abilities on reading comprehension results. Students from England and USA represent a fairly homogeneous age group, whereas the ages of students in Argentina and Colombia exhibit much greater variance. This study has not addressed the impact that age variation within country may have on PIRLS 2001 comparability results.  95  Limitations of this Study This study has examined score comparability in international educational assessments using data from the PIRLS 2001 Reader booklet. It is unknown whether these results would generalize to other international educational assessments, or to the remaining booklets in the PIRLS 2001 study. Further studies should be conducted to determine whether similar results would be found in other international educational assessments. In this study an international sample reference group was created with which to compare the country focal groups, which resulted in unequal reference and focal group sample sizes. The sample sizes for the reference groups were 6800 students whereas the country sample sizes ranged from 777 to 1285. The results of this study should be replicated using more equally balanced sample sizes. Summary This study analyzed score scale comparability in the PIRLS 2001 Reader booklet data in four countries. Score comparability was found to be jeopardized by the presence of differential functioning related to the cognitive levels intended to be assessed by the reading comprehension questions. Potential sources of bias may be related to cultural interpretations of reading material, differential curricula and opportunities to learn, and differential motivation to learn reading comprehension across the countries. These findings have significant implications related to the valid use of results from the PIRLS 2001 assessment and the types of decisions which can appropriately be made from these results.  96  REFERENCES Abbott, M. L. (2007). A confirmatory approach to differential item functioning on an ESL reading assessment. Language Testing 24(1), 7—36. Alberta Education. (2007). Alberta’s grade 4 students demonstrate strong results on international literacy test. Retrieved August 31, 2008, from http://education.alberta.ca/department/news/2007/november/20071128 .aspx Al1alouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items for testing and evaluation. Journal ofEducational Measurement, 36(3), 195-198. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational andpsychological testing. Washington, DC: American Educational Research Association. Arrieta, J. & Parreño, V. (2002). Argentina. In I. V. S. Mullis, M. 0 Martin, A. M. Kennedy, & C. L. Flaherty (Eds.), PJRLS 200] encyclopedia: a reference guide to reading education in the countries particiaring in lEA ‘s Progress in International Reading Literacy Study (PJRLS). Chestnut Hill, MA: Boston College. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51. Bock, T. (2006). A consideration of culture in moral theme comprehension: Comparing Native and European American students. Journal ofMoral Education, 35, 7 1—87. Bolt, D.M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113-141.  97  Bond, L., Moss, P., & Carr, P. (1996). Fairness in large scale performance assessment. In G. W. Phillips (Ed.), Technical issues in large-scale peiformance assessment 1 17-140). National Center for Education Statistics. Retrieved from (pp. http://nces. ed.gov/pubs/96802.pdf Bougton, K. A., Gierl, M. J., & Khaliq, S. N. (2000, May). Differential bundle functioning on mathematics and science achievement tests: A small step toward understanding cflfferential performance. Paper presented at the annual meeting of  the Canadian Society for Studies in Education (CSSE), Edmonton, Canada. British Columbia Ministry of Education. (2004). Policy Document: Large Scale Assessment. Retrieved from http://www.bced.gov.bc.ca/policy/policies/large_scale_assess.htm British Columbia Ministry of Education. (2007). B.C. students among top in the world for literacy. News Release 2007EDUO 169-001536. Retrieved from http://www2.news. gov.bc.calnews_releases_2005-2009/2007EDUO 169001 536.htm Burket, G. R. (1991). PARDUX. [Computer software]. Monterey, CA: CTB/McGraw Hill. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Publications. Campbell, J. R., Kelly, D. L., Mullis, 1. V. S., Martin, M. 0., & Sainsbury, M. (2001). Framework and spec(j’Ications for PIRLS assessment 200]. (2 ed). Chestnut Hill, MA: Boston College. Chang, H. H., Mazzeo, J., & Roussos, L. (1996). Detecting DIF for polytornously scored items: An adaptation of the SIBTEST procedure. Journal ofEducational Measurement, 33, 333-353.  98  Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement:Issues and Practice, 17, 31—44. Cook, L. (2006, July). Practical considerations in linking scores on adapted tests. Keynote address at the 5th international meeting of the International Test Commission, Brussels, Belgium. Crundwell, R. M. (2005). Alternative strategies for large scale student assessment in Canada: Is value-added assessment one possible answer. Canadian Journal of Educational Administration and Policy, 41. Retrieved September 5, 2007, from h  f  DeMars, C. E. (2008). Polytomous differential item functioning and violations of ordering of the expected latent trait by the raw score. Educational and Psychological Meaurement, 68, 379-396. DePascale, C. A. (n.d.). The ideal role oflarge-scale testing in a comprehensive assessment system. National Center for the Improvement of Educational Assessment. Retrieved January 6, 2008, from http;//www.testpublish cs.prg/D ocum ci.LL arge S ca  çs sm e.ntv  ncif  Dorans, N. J., & Schmitt, A. P. (1991). Constructed response and differential item functioning: A pragmatic approach (Research Rep. No. RR-9 1-47). Princeton, NJ: Educational Testing Service. Douglas, J. A., Roussos, L. A., & Stout, W. (1996). Item-bundle DIF hypothesis testing: Identifying suspect bundles and assessing their differential functioning. Journal ofEducational Measurement, 33(4), 465-484. Drasgow, F. (1987). Study of the measurement bias of two standardized psychological tests. Journal ofApplied Psychology, 72, 19—29.  99  Early Reading Expert Panel. (2003). Early reading strategy: The report of the Expert Panel on Early Reading in Ontario. Toronto, ON: Ontario Ministry of Education. Education Quality and Accountability Office. (2008). National and international assessments program overview. Retrieved August 31, 2008, from http://www.eqao.com/N1AINlA.aspx?statuslogout&Lang=E Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage assessments. International Journal of Testing, 2, 199-215. Ercikan, K., Gierl, M. J., McCreith, T., Puhan, G., & Koh, K. (2004). Comparability of bilingual versions of assessments: Sources of incomparability of English and French versions of Canada’s national achievement tests, Applied Measurement in Education, 17(3), 301-321. Ercikan, K., & Gonzalez, E. (2008, March). Score scale comparability in international assessments. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, USA. Ercikan, K., & Koh, K. (2005). Construct comparability of the English and French versions of TIMSS. International Journal of Testing, 5, 23-3 5. Ercikan, K., & Sandilands, D. (in press). Score scale comparability in international assessments. IERI Monograph Series. IEA-ETS Research Institute. Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G. R., Weber, M. M, & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed response item types. Journal ofEducational Measurement, 35, 137-154. Fitzpatrick, A. R., Lee, G., & Gao, F. (2001). Assessing the comparability of school scores across test forms that are not parallel. Applied Measurement in Education, 14(3), 285-306.  100  Fitzpatrick, A. R., Link, V. B., Yen, W. M., Burket, G. R., Ito, K., & Sykes, R. C. (1996). Scaling performance assessments: A comparison of one-parameter and twoparameter partial credit models. Journal ofEducational Measurement, 33(3), 291-3 14. Fouad, N. A., & Walker, C. M. (2005). Cultural influences on responses to items on the Strong Interest Inventory. Journal of Vocational Behaviour, 66, 104-123. Foy, P., & Joncas, M. (2003). PIRLS sampling design. In M. 0. Martin, I. V. S. Mullis, & A. M. Kennedy (Eds.), PIRLS 2001 technical report. Chestnut Hill, MA: Boston College. Gierl, M. J. (2005). Using dimensionality-based DIF analyses to identify and interpret constructs that elicit group differences. Educational Measurement: Issues and Practice,24(l), 3-14. Glen, M. J., Bisanz, J., Bisanz, G. L., & Boughton, K. A. (2003). Identifying content and cognitive skills that produce gender differences in mathematics: A demonstration of the multidimensionality-based DIF analysis paradigm. Journal ofEducational Measurement, 40(4), 281-306. Gierl, M. J., Bisanz, J., Bisanz, G. L., & Boughton, K. A. (2001). Illustrating the utility of DBF analyses to identif,’ and interpret group differences on achievement tests. Educational Measurement: Issues and Practice,20(2), 26-3 6. Gierl, M. J., & Bolt, D. M. (2001). Illustrating the use of nonparametric regression to assess differential item and bundle functioning among multiple groups. International Journal of Testing, 1(3&4), 249-270.  101  Gierl, M. J., Jodoin, M. G., & Ackerman, T. A. (2000, April). Performance ofMantel Haenszel, simultaneous item bias test and logistic regression when the proportion ofDIF items is large. Paper presented at the annual meeting of the American Educational Research Association (AERA), New Orleans, USA. Gierl, M. J., & Khaliq, S. N. (2001). 1dentif’ing sources of differential item and bundle functioning on translated achievement tests: A confirmatory analysis. Journal of Educational Measurement, 38(2), 164-187. Gonzalez, E. J., & Kennedy, A. M. (Eds.). (2003). PIRLS 200] user guidefor the international database. Chestnut Hill, MA: Boston College. Gorsuch, R. L. (1983). Factor Analysis  2h1(  Edition. Hillside, NJ: Lawrence Erlbaum and  Associates. Guthrie, J. T., & Alao, S. (1997). Designing contexts to increase motivations for reading. Educational Psychologist, 32(2), 95-105. Hambleton. R. K. (1993). Translating achievement tests for use in cross-national studies. European Journal ofPsychological Assessnzen4 9(1), 57-58. Hambleton, R. K., Swarninathan, H. & Rogers, H. J. (1991). Fundamentals of iten2 response theory. Newberry Park, CA: Sage Publications, Inc. Holland, P. W., & Wainer, H. (Eds.). (1993). Differential itemfunctioning. Hillside, NJ: Lawrence Erlbaum. International Test Commission. (2000). JTC test adaptation guidelines. Retrieved from p Jiang, H., & Stout, W. (1998). Improved type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal ofEducational and Behavioral Statistics, 23(4), 291-322.  102  Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349. Kim, S.-H., & Cohen, A. S. (1995). A comparison of Lord’s chi-square, Raju’s area measures, and the likelihood ratio test on detection of differential item functioning. Applied Measurement in Education, 8, 291-312. Li, Y., Cohen, A. S., & lbarra, R. A. (2004). Characteristics of mathematics items associated with gender DIF. International Journal of Testing, 4(2), 115-136. Mantel, N. (1963). Chi-square tests with one degree of freedom: Extensions of the Mantel—Haenszel procedure. Journal ofthe American Statistical Association, 58, 690—700. Martin, M. 0., Mullis, 1. V. S. & Kennedy, A. M. (2003). PJRLS 2001 Technical Report. Chestnut Hill, MA: Boston College. Mendes-Barnett, S., & Ercikan, K. (2006). Examining sources of gender DIF in mathematics assessments using a confirmatory multidimensional model approach. Applied Measurement in Education, 19(4), 289-3 04. Miller, T. R., & Spray, J. A. (1993). Logistic discriminant function analysis for DIF identification of polytomously scored items. Journal ofEducational Measurement, 30, 107-122. Milisap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297-334. doi:l0.1 177/014662169301700401 Mullis, I. V. S., Martin, M. 0., & Gonzalez, E. J. (2004). PJRLS 2001 International achievement in the processes of reading Comprehension. Chestnut Hill, MA: Boston College.  103  Mullis, I. V. S., Martin, M. 0., Gonzalez, E. J., & Kennedy, A. M. (2003). PIRLS 2001 international report: lEA ‘s study ofreading literacy achievement in primary schools in 35 countries. Chestnut Hill, MA: Boston College. Mullis, 1. V. S., Martin, M. 0., Kennedy, A. M., & Flaherty, C. L. (2002). PIRLS 2001 encyclopedia: A reference guide to reading education in the countries participating in lEA ‘s Progress in International Reading Literacy Study (PIRLSf Chestnut Hill, MA: Boston College. Muraki, E. (1992). A generalized partial credit model: Applications of an EM algorithm. Applied Psychological Measurement, 16, 159—176. Murata, K. (2007). Unanswered questions: Cultural assumptions in text interpretation. International Journal ofApplied Linguistics, 17, 38-59. Nandakurnar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stout’s Test for DIF. Journal ofEducational Measurement, 30(4), 293-311. Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous item bias procedures for detecting differential item functioning. Applied Psychological Measurenient 18, 315-328. doi: 10.1177/014662169401800403 Pae, T.-J. (2004). DIF for exam inees with different academic backgrounds. Language Testing, 21, 53—73. Pae, T.-1., & Park, G.-P. (2006). Examining the relationship between differential item functioning and differential test functioning. Language Testing, 23(’4,), 475—496. Potenza, M. T., & Dorans, N. J. (1995). DIF assessment for polytornously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 23—37.  104  Puhan, G., & Gierl, M. J. (2006). Evaluating the effectiveness of two-stage testing on English and French versions of a science achievement test. Journal of CrossCultural Psychology, 37, 136-154. doi:1 0.1177/0022022105284492 Reynolds, R. E., Taylor, M. A., Steffensen, M. S., Shirey, L. L., & Anderson, R. C. (1982). Cultural schemata and reading comprehension. Reading Research Quarterly, 17(3), 353-366. Robin, F., Sireci, S. G., & Hambleton, R. K. (2003). Evaluating the equivalence of different language versions of a credential ing exam. International Journal of Testing, 3(1), 1-20. Ross, T. R. (2007). The impact of multidimensionality on the detection of differential bundle functioning using SIBTEST. Retrieved from Dissertations & Theses: Full Text database. (AAT 3301009) Roussos, L., & Stout, W. (1996). A multi-dimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355-371. doi: 10.1177/014662169602000404 Rush, G. & Kapinius, B. (2002). United States. In I. V. S. Mullis, M. 0 Martin, A. M. Kennedy & C. L. Flaherty (Eds.), PIRLS 2001 encyclopedia: a reference guide to reading education in the countries participating in lEA  Progress in  International Reading Literacy Study (PJRLS). Chestnut Hill, MA: Boston College. Russell, S. S. (2005). Estimates of Type I error and power for indices of differential bundle and test functioning. Retrieved from Dissertations & Theses: Full Text database. (AAT 3175804)  105  Sáenz, C., Rocha, M., Garcia, G. & Riveros, E. (2002). Colombia. In I. V. S. Mullis, M. 0 Martin, A. M. Kennedy & C. L. Flaherty (Eds.), PIRLS 200] encyclopedia: a reference guide to reading education in the countries participating  in  lEA  Progress in international Reading Literacy Study (PIRLS). Chestnut Hill, MA: Boston College. Sainsbury, M., & Campbell, J. (2003). Developing the PIRLS reading assessment. In M. 0. Martin, 1. V. S. Mullis, & A. M. Kennedy (Eds.), PIRLS 2001 technical report. Chestnut Hill, MA: Boston College. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17. Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/dif from group ability differences and detects test bias/dtf as well as item bias/dif. Psychometrika, 58, 159-194. Sireci, S. G. (1997). Problems and issues in linking assessment across languages. Educational Measurement: issues and Practice, 16, 2-19. Sireci, S. G., & Allalouf, A. (2003). Appraising item equivalence across multiple languages and cultures. Language Testing, 20(2), 148—166. Slocurn, S. L., Gelin, M. N., & Zumbo, B. D. (in press). Statistical and graphical modeling to investigate differential item functioning for rating scale and likert item formats. In B. D. Zumbo (Ed.), Developments in the theories and applications of measurement, evaluation, and research methodology across the disctplines, volume 1. Vancouver, BC: Edgeworth Laboratory, University of British Columbia. SSI Scientific Software International (2000). LISREL 8.5, Lincolnwood, IL., Scientific Software International, Inc.  106  Stark, S. (2001). MODFIT: A computer program for model-data fit. Unpublished manuscript. University of Illinois: Urbana-Champaign. Stark, S. (2002). MODFIT: Plot theoretical item response functions and examine the fit of dichotomous or polytomous IRT models to response data [computer program]. Department of Psychology, University of Illinois at Urbana-Champaign. Stout, W., Bolt, D., Froelich, A. G., 1-labing, B., Hartz, S., & Roussos, L (2003). Bundle methodologyfor improving test equity with applications for GRE test development. (GRE Board Professional Report No. 98-15P, ETS Research Report 03-06). Princeton, NJ: Educational Testing Service. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7 201-210. Swaminathan, H. & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal ofEducational Measurement, 27, 361370. The Roussos-Stout Software Development Group. (n.d.). Manual for Dimensionality Based DIF Package Version 1.7, Sibtest, Poly-Sibtest, Crossing Sibtest, IRT Based Educational and Psychological Measurement Software. St. Paul, MN: Assessment Systems Corporation. Twist, L. (2002). England. In I. V. S. Mullis, M. 0 Martin, A. M. Kennedy & C. L. Flaherty (Eds.), PIRLS 2001 encyclopedia: a reference guide to reading education in the countries part icipating in lEA ‘s Progress in International Reading Literacy Study (PJRLS). Chestnut Hill, MA: Boston College. van de Vijver, F., & Tanzer, N. K. (2004). Bias and equivalence in cross-cultural assessment: An overview. Revue européenne depsychologie appliqué, 54, 119— 135. doi:10.1016/j.erap.2003.12.004  107  Volante, L. (2006). An alternative vision for large-scale assessment in Canada. Journal of Teaching and Learning, 4(1), 1-14. Yen, W. M. (1984). Effects of local independence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurenient 8, 125145. Zumbo, B. D. (1999). A handbook on the theory and methods ofdlifferential item functioning (DIF): Logistic regression modeling as a unitaryframeworkfor binary and likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Zumbo, B. D. (2003). Does item.Ievel DIF manifest itself in scale-level analyses?: Implications for translating language tests. Language Testing, 20, 136-147. Zumbo, B. D. (2008, July). Statistical methods for investigating item bias in self-report measures. In F. Maggino (Seminar Organizer), Dottorato in Metodologia delle Cienze Soda/i, Firenze, Italy. Zwick, R., Donoghue, J., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal ofEducational Measurement, 30, 233-251. Zwick, R., Thayer, D. T., & Mazzeo, J. (1997). Descriptive and inferential procedures for assessing differential item functioning in polytomous items. Applied Measurement in Education, 10(4), 32 1-344.  108 Appendix A: Reader Booklet  PIRLS Reader  The Natural World  NI 7 PIRLS  2OQ1J  PIRLS Progress in International Reading Literacy Study  Main Survey 2001  109  Contents Part 1 Hare Heralds the Earthquake  1  A short story by Rosalind Kerven  Part 2 Nights of the Pufflings by Bruce McMillan  7  110  Hare Heralds the Earthquake by Rosalind Kerven  T  here was once a hare who was always worrying. “Oh dear,” he muttered all day long, “oh deary, deary me.” His greatest worry was that there might be an earthquake. “For if there was,” he said to himself, “whatever would become of me?” He was feeling particularly anxious about this one morning, when suddenly an enormous fruit fell down from a nearby tree—CRASH/—making the whole earth shake. The hare leaped up. “Earthquake!” he cried. And with that he raced across the fields to warn his cousins.  “Earthquake! Run for your lives!” All the hares left the fields and madly followed him. They raced across the plains, through forests and rivers and into the hills warning more cousins as they went.  The Natural World  111  -q  “Earthquake! Run for your lives!” All the hares left the rivers and plains, the hills and forests and madly followed. By the time they reached the mountains, ten thousand hares pounded like thunder up the slopes. Soon they reached the highest peak. The first hare gazed back to see if the earthquake was coming any closer but all he could see was a great swarm of speeding hares. Then he looked in front but all he could see was more mountains and valleys and, far in the distance, the shining blue sea.  ‘.—  .fr  4S  The Natural World  -  112  As he stood there panting, a lion appeared. “What’s happening?” he asked. “Earthquake, earthquake!” babbled all the hares. “An earthquake?” asked the lion. “Who has seen it? Who has heard it?” “Ask him, ask him!” cried all the hares, pointing to the first one. The lion turned to the hare. “Please Sir,” said the hare shyly, “I was sitting quietly at home when there was a terrible crash and the ground shook and I knew it must be a quake, Sir so I ran as fast as I could to warn all the others to save their lives.” The lion looked at the hare from his deep, wise eyes. “My brother, would you be brave enough to show me where this dreadful disaster happened?” The hare didn’t really feel brave enough at all, but he felt he could trust the lion. So, rather timidly, he led the lion back down the mountains and the hills, across the rivers, plains, forests and fields, until at last they were back at his home. “This is where I heard it, Sir.” The lion gazed around—and very soon he spotted the enormous fruit which had fallen so noisily from its tree.  The Natural World  113 He picked it up in his mouth, climbed onto a rock and dropped it back to the ground. CRASH! The hare jumped. “Earthquake! Quickly—run away—it’s just happened again!” But suddenly he realised that the lion was laughing. And then he saw the fruit rocking gently by his feet. “Oh,” he whispered, “it wasn’t really an earthquake after all, was it?” “No,” said the lion, “it was not and you had no need to be afraid.” “What a silly hare I’ve been!” The lion smiled kindly. “Never mind, little brother. All of us—even I— sometimes fear things we cannot understand.” And with that he padded back to the ten thousand hares that were still waiting on top of the mountain, to tell them that it was now quite safe to go home.  --  t.  _.1  J  Hare Heralds the Earthquake based on an Indian folk tale. © Rosalind Kerven. First published in Legends of the Animal World (Cambridge University Press 1986).  The Natural World  114  End of Part 1. Now go to your question booklet.  The Natural World  115  —  Nights of the Pufflings by Bruce McMillan  Every year, black and white birds with orange bills visit the Icelandic island of Heimaey. These birds are called puffins. They are known as “clowns of the sea” because of their bright bills and clumsy movements. Puffins are awkward fliers during takeoffs and landings because they have chunky bodies and short wings.  H  alla lives on the island of Heimaey. She searches the sky every day. As she watches from high on a cliff overlooking the sea, she spots her first puffin of the season. She whispers to herself “Lundi,” which means “puffin” in Icelandic. Soon the sky is speckled with them—puffins, puffins everywhere. They are returning from their winter at sea, returning to Halla’s island and the nearby uninhabited islands to lay eggs and raise puffin chicks. These “clowns of the sea” return to the same burrows year after year. It’s the only time they come ashore.  The Natural World  116 Halla and her friends climb over the cliffs to watch the birds. They see pairs tap-tap-tap their beaks together. Each pair they see will soon tend an egg deep inside the cliffs. When the puffin eggs have hatched, the parents will bring fish home to feed their chicks. Each chick will grow into a young puffing. The nights of the pufflings will come when each puffing takes its first flight. Although the nights of the pufflings are still long weeks away, Halla thinks about getting some cardboard boxes ready. All summer long the adult puffins fish and tend to their chicks. By August, flowers blanket the burrows. With the flowers in full bloom, Halla knows that the wait for the nights of the pufflings is over. The hidden chicks have grown into young pufflings. Now it’s time for Halla and her friends to get out their boxes and torches for the nights of the pufflings. Starting tonight, and for the next two weeks, the pufflings will be leaving for their winter at sea. In the darkness of the night, the pufflings leave their burrows for their first flight. It’s a short, wing-flapping trip from the high cliffs. Most of the birds splash-land safely in the sea below. But some get confused by the village lights perhaps they think the lights are moonbeams reflecting on the water. Hundreds of the pufflings crash-land in the village every night. Unable to take off from the flat ground, they run around and try to hide. Halla and her friends will spend each night searching for stranded pufflings that haven’t made it to the water. But the village cats and dogs will be searching, too. Even if the cats and dogs don’t get them, the pufflings might get run over by cars or trucks. The children must find the stray pufflings first. By ten o’clock the streets of Heimaey are alive with roaming children. Halla and her friends race to rescue the pufflings. Armed with torches, they wander through the village, searching dark places. Halla spots a puffing. She races after it, grabs it, and puts it safely in a cardboard box. —  The Natural World  117 For two weeks all the children of Heimaey sleep late in the day so they can stay out at night. They rescue thousands of pufflings. Every night Halla and her friends take the rescued pufflings home. The next day, with the boxes full of pufflings, Halla and her friends go down to the beach. It’s time to set the pufflings free. Halla releases one first. She holds it up so that it will get used to flapping its wings. Then, holding the puffing snugly in her hands, she swings it up in the air and launches it out over the water beyond the surf. The puffling flutters just a short distance before splash-landing safely. Day after day Halla’s pufflings paddle away, until the nights of the pufflings are over for the year. As she watches the last of the pufflings and adult puffins leave for their winter at sea, Halla bids them farewell until next spring. She wishes them a safe journey as she calls out, “Goodbye, goodbye.”  Excerpted from Nights of the Pufflings by Bruce McMillan. © 1995 by Bruce McMillan. Reprinted by permission of Houghton Muffin Company. All rights reserved.  Stop End of Part 2. Now go to your question booklet.  The Natural World  118  Directions In this test, you will read stories or articles and answer questions about what you have read. You may find some parts easy and you may find some parts difficult. You will be asked to answer different types of questions. Some of the questions will be followed by four choices. You will choose the best answer and fill in the circle next to that answer. Example 1 shows this type of question.  Example 1 1.  How many days are there in a week? 2 days cB)  4days  •  7days 10 days  The circle next to “7 days” is filled in because there are 7 days in a week. For some questions you will be asked to write your answers in the space provided in your booklet. Example 2 shows this kind of question.  Example 2 2.  Where does the little boy go after he finds the book?  0 For questions like Example 2, you will see a  a  or a  next to it.  119  Example 3 shows a question with a  next to it.  Example 3 3.  What makes the ending of the story both happy and sad? Use what you have read in the story to help you explain.  You will have 40 minutes to work in your test booklet and then we will take a short break. Then, you will work for another 40 minutes. Do your best to answer all the questions. If you cannot answer a question, move on to the next one.  —4  120  Stop Do not turn the page until you are told to do so.  121  Questions  1.  Hare Heralds the Earthquake  What was the hare’s greatest worry? a lion (B  a crash an earthquake a falling fruit  2.  What made the whole earth shake?  0  an earthquake an enormous fruit the fleeing hares a falling tree  3.  Things happened quickly after the hare shouted “Earthquake!” Find and copy two words in the story that show this.  ICH  l)  I 2.  1 he .P na \Vok  122  4.  Where did the lion want the hare to take him?  0 0  5.  Why did the lion drop the fruit onto the ground?  0  to make the hare run away to help the hare get the fruit  c  to show the hare what had happened to make the hare laugh  6.  How did the hare feel after the lion dropped the fruit onto the ground? ‘  angiy  B  disappointed  Q  foolish  (E worried  I 23 7.  Write two ways in which the lion tried to make the hare feel better at the end of the story.  ®1.  ®2.  8.  Do you think the lion liked the hare? What happens in the story that shows this?  taI  \Viid  124 9.  How did the hare’s feelings change during the story?  I® I  At the beginning of the story the hare felt  because  At the end of the story the hare felt  because  125  10.  11.  You learn what the lion and the hare are like from the things they do in the story. Describe how the lion and the hare are different from each other and what each does that shows this.  What is the main message of this story?  0 D  Run away &om trouble. Check the facts before panicking. Even lions that seem kind cannot be trusted. Hares are fast animals.  Stop End of this part of the booklet. Please stop working.  ihe  fN  126  Questions  Nights of the Pufflings  Why are puffins clumsy at takeoffs and landings?  0  They live in a land of ice. They hardly ever come to shore.  Q 0  2.  They spend time on high cliffs. They have chunky bodies and short wings.  Where do the puffins spend the winter?  0  inside the cliffs on the beach  at sea on the ice  3.  Why do the puffins come to the island? to be rescued  (? to look for food c) to lay eggs  to learn to fly  Woid  127 4.  How does Halla know the pufflings are about to fly? A  Parents bring fish to the pufffings Flowers are in full bloom. Chicks are hidden away. Summer has just begun.  5.  What happens during the nights of the pufflings? Puffin pairs tap-tap tap their beaks together  (  Pufflings take their first flight. Puffin eggs hatch into chicks.  (  6.  Pufflings come ashore from the sea.  What could the people in the village do to stop the pufflings from landing there by mistake?  0  turn off the lights  )  get the boxes ready  Q  keep the cats and dogs inside shine their torches in the sky  128  Questions 7 and 8 ask you to explain how Halla rescues the pufflings. 7.  Explain how Halla uses her torch to rescue the pufflings.  rz 2 I; I(  ()  8.  Explain how Halla uses the cardboard boxes to rescue the pufflings.  9.  According to the article, which of these is a danger faced by the pufflings?  ‘) drowning while landing in the sea  (  getting lost in the burrows not having enough fish from their parents  D  being run over by cais and tiucks  ih  129  10.  Why does it need to be daylight when the children release the pufflings? Use information from the article to explain.  jj (N  11.  What do the pufflings do after Halla and her friends release them? walk on the beach fly from the cliff hide in the village  0  12.  swim in the sea  Write two different feelings Halla might have after she has set the pufflings free. Explain why she might have each feeling.  ®1.  2.  ovid  0  130 13.  Would you like to go and rescue pufflings with Halla and her friends? Use what you have read to help you explain. (9 III  Stop End of this part of the booklet. Please stop working.  Fh Nturu Wrd  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0054575/manifest

Comment

Related Items