UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Investigations of parameter invariance in IRT models : theoretical and pratical avenues for understanding… Rupp, André Alexander 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2003-859541.pdf [ 6.24MB ]
JSON: 831-1.0054554.json
JSON-LD: 831-1.0054554-ld.json
RDF/XML (Pretty): 831-1.0054554-rdf.xml
RDF/JSON: 831-1.0054554-rdf.json
Turtle: 831-1.0054554-turtle.txt
N-Triples: 831-1.0054554-rdf-ntriples.txt
Original Record: 831-1.0054554-source.json
Full Text

Full Text

I N V E S T I G A T I O N S O F P A R A M E T E R I N V A R I A N C E I N I R T M O D E L S : T H E O R E T I C A L A N D P R A C T I C A L A V E N U E S F O R U N D E R S T A N D I N G A F U N D A M E N T A L P R O P E R T Y O F M E A S U R E M E N T by A N D R E A L E X A N D E R R U P P B . A . (equivalent), The Unive r s i ty o f H a m b u r g , 1997 M . A . , Northern A r i z o n a Un ive r s i ty , 1999 M . S . , Northern A r i z o n a Univers i ty , 2001 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F D O C T O R O F P H I L O S O P H Y in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department o f Educa t iona l and Counse l l i ng Psycho logy , and Spec ia l Educat ion; Programme: Measurement , Evalua t ion and Research M e t h o d o l o g y ) W c accept this thesis as con fo rming to the required standard T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A June 2003 © Andre Alexander Rupp , 2003 UBC Rare Books and Special Collections - Thesis Authorisation Form Page 1 of 1 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r reference and study. I further agree that permission for extensive copying of t h i s thesis for s c h o l a r l y purposes may be granted by the head of my department or by h i s or her representatives. I t i s understood that copying or p u b l i c a t i o n of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of t AtUit hom< I  LQi v M U' ' U fcUap; Qi'd SjXj. « v ( ' fduvJu-t The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Date Ju.k't 2$ > ZCOI http://www.library.ubc.ca/spcoll/thesauth.html 6/23/2003 11 Abstract The quest for invariance is the quest for scientific generalizability and parameter invariance is thus a fundamental property of measurement that is of interest to both theoreticians and practitioners. To investigate invariance properly, a precise mathematical definition is required, which contrasts sharply with a more conceptual and philosophical usage of this term and it is not uncommon to find that researchers think of the invariance of parameters in a measurement model as a guaranteed property of such models. This dissertation deconstructs this myth through a series of four studies that are connected by a consistent logic of inquiry for understanding what does and does not constitute parameter invariance and how a lack of parameter invariance can be assessed, quantified, and accounted for. The first study shows how the use of correlation coefficients can be insufficient to show that parameter invariance holds as such coefficients miss group-level differences in the data. The second and third studies show how biases due to a lack of invariance can be analytically derived and numerically quantified and they reveal that their practical impact is minor for many conditions. Furthermore, the work shows how the formalization of invariance provides a unique frame for discussions of model optimality, because it is shown that no single unidimensional item response theory model possesses superior optimality properties under all conditions that are considered. The fourth study illustrates how attitudinal and background variables can be used to create examinee profiles, which can be used to group examinees when investigating differential functioning of item sets for these groups using novel tools from functional data analysis. Specifically, using data from the TEVISS 1999 large-scale assessment, the study shows how observed differential performance can be accounted for using these profiles. The work in this dissertation thus combines multiple theoretical and practical perspectives to better understand this fundamental property of measurement. i i i T A B L E OF CONTENTS Abstract ii Table of Contents i i i List of Tables iv List of Figures v List of Acronyms vii Acknowledgements ix CHAPTER I Introduction 1 CHAPTER II The Quest for Invariance 9 CHAPTER III How to Quantify and Report Whether Parameter Invariance Holds: 18 Why Pearson Correlations Are Not Enough CHAPTER IV Bias Coefficients for Lack of Invariance in Unidimensional IRT 28 Models CHAPTER V A n Analytic and Numerical Look at Biases from /5-drift across 46 Unidimensional IRT Models CHAPTER VI Quantifying Subpopulation Differences for a Lack of Invariance 62 Using Complex Examinee Profiles: A n Exploratory Multi-group Approach Using Functional Data Analysis CHAPTER VII Implications for the Validation and Use of Assessments 98 References 109 Appendix H 9 LIST OF TABLES Table Table Title Page Number Number 1 Factor correlations for an oblique 8-factor solution under principal 72 components extraction using the sample covariance matrix 2 Variables with loadings exceeding .25 for all 8 factors 73 3 Descriptive factor labels for orthogonal 8-factor solution 75 4 Number of terminal nodes and relative error of 'best' trees under 81 different calibration settings 5 10 most important variables across 8 calibration settings 82 6 Model-fit and item-fit statistics for calibration of 12 target items 85 7 Item parameter values for target items under the 3PL for single 86 examinee population 8 Score distribution among terminal nodes in C A R T tree 93 A l Values of Ay for a ' = a+ .5 120 A2 Values of Ayfor/3' = /3+.4 121 A3 Values of A,, for a ' = a + . 5 and = 4 122 A4 Variable loadings for exploratory factor analysis of background 123 variables as precursor to cluster analysis V LIST OF FIGURES Figure Figure Caption Page Number Number 1 Three ICCs for three hypothetical items calibrated with an IRT model 4 2 Item parameters for two groups on an achievement test with 12 items 25 3 Surface and contour plots of Ay- for a' = a + .5 40 4 ICCs for item with a-drift of .5 40 .5 Surface and contour plots of Ay for / J ' = /? + .4 41 6 ICCs for item with/J-drift of .4 42 7 Surface and contour plots of Ay for a' = a + .5 and J3' = (3 + .4 42 8 Surface and contour plots of A,y for a' = a + .5 and /?' = /?+ .4 43 9 Effect of drift for item with identical discrimination calibrated with 2PLand3PL 53 10 Surfaces for Ay as a function of location differences, item discrimination values, and amounts of parameter drift 55 11 Effect of drift for an item with identical discrimination calibrated with 2PLand3PL 57 12 Effect of drift for an item with different discrimination calibrated with 1PL and 2PL 58 13 Scree plot of eigenvalues for exploratory PC factor analysis 71 14 Scree plot for PC extraction of eigenvalues from polychoric correlation matrix of item responses 84 (continued) vi Figure Figure Caption Page Number Number 15 Scatterplot of estimates for &-means and CART groupings 88 16 Smoothed difficulty parameter regression functions for /c-means and 89 CART 17 Functional eigenvalue plots for £-means and C A R T groupings 91 18 PC scores for 10-group ft-means cluster solution and 8-group C A R T 91 solution LIST OF A C R O N Y M S Acronym Complete Expression 1PL One-parameter Logistic 2PL Two-parameter Logistic 3PL Three-parameter Logistic CART Classification and Regression Trees CFA Confirmatory Factor Analysis CTT Classical Test Theory C V Cross-validation DBF Differential Bundle Functioning DJF Differential Item Functioning DTF Differential Test Functioning E F A Exploratory Factor Analysis FDA Functional Data Analysis F-PCA Functional Principal Components Analysis ICC Item Characteristic Curve IPD Item Parameter Drift IRT Item Response Theory ISRF Item Step Response Function L A D Least Absolute Deviation L C Linear Combinations LOI Lack of Invariance (continued) Acronym Complete Expression LQ Lower Quartile LS Least Squares M B C Model-based Clustering M C Multiple Choice M - M C Math Multiple Choice M V N Multivariate Normal PC Principal Components Analysis PPMCC Pearson's Product-moment Correlation Coefficient S E M Structural Equation Modeling TCC Test Characteristic Curve TJMSS Third International Math and Science Study UQ Upper Quartile ix A C K N O W L E D G E M E N T S There are many people that have influenced the work of this dissertation and while it is impossible to thank all of them it is more than appropriate to thank at least a few of them specifically. I would like to thank first and foremost my advisor Dr. Bruno Zumbo for his unconditional support, unsurpassed encouragement, and continuing inspiration. I sincerely thank him on a professional and personal level for the influence he has been on my life. Similarly, I would like to thank Dr. Terry Ackerman, Dr. Kadriye Ercikan, Dr. Anita Hubley, Dr. Hillel Goelman, and Dr. Pam Ratner for carving out time from their busy schedules to provide me with valuable feedback as reviewers of my work. On a personal note, I would like to thank my parents Hans Rupp and Elke Rupp, who have been in my heart and mind throughout the entire process even though they reside several thousand miles away. I love them very much and thank them from the bottom of my heart for the infinite amount of support that they have provided me with in so many different ways. Finally, I would like to thank my friends, in particular Paul Stoesz and Claire Sowerbutt, who have induced the necessary level of fun, relaxation, and sarcasm into my life, as well as all of the colleagues that I have been fortunate to work with. Investigations of Parameter Invariance in IRT Models 1 Chapter I - Introduction It has been said that all scientific research is the quest for generalizability (Engelhard, 1992, 1994) and surely this sounds like an apt statement for most applied quantitative work. Indeed, most quantitative researchers agree that i f their scientific endeavors are to have any practical impact, it is necessary to show that the conclusions made from these endeavors are applicable to a context that is broader than the single instance for which the data were collected. Similarly, the field of psychometric assessment is generally concerned with determining for which populations of respondents (i.e., examinees) and stimuli (i.e., items or measurement opportunities) a given measurement process holds. However, the process of collecting evidence for such generalizations is much more difficult than many people appreciate and the tools with which psychometricians have attacked this issue have changed considerably over the years. Psychometric modeling is concerned with making inferences about examinees and assessment instruments based on data that were collected from a particular sample of examinees and a particular set of items. In fact, the first thing to appreciate about data that have been collected for any measurement instrument is that they have been collected at the intersection of examinee and item sets, which highlights that two distinct directions of generalization are typically of interest. First, it is of interest to make statements about the functioning of a particular assessment instrument for groups of examinees that share characteristics with those examinees that have already been scored with it. Second, it is of interest to make statements about the functioning of item sets that share characteristics with those items that are already included on a particular test form. For example, it is often of interest to show that the scores from different examinee groups are comparable i f either the same instrument is administered to the different groups, a parallel version of the instrument is administered to the different groups, or selected Investigations of Parameter Invariance in IRT Models 2 subsets of items are administered to the different groups. This requires that item properties, such as how difficult they are or how well they discriminate among learners of different levels of ability, are comparable across different examinee groups. These desiderata are natural goals for psychometric assessments, but to realize them one requires mathematical models that allow one to link observations without introducing any biases along the way. Test-level versus Item-level Models Historically, the predominant way to calibrate a set of items and to subsequently score the examinees on these items has been through a series of models collected under the term classical test theory (CTT). The models benefit from the simplicity with which many of their indices can be derived and the relatively small sample sizes that are needed to arrive at reasonable estimates, which is why they continue to be used by many measurement specialists. In CTT, calibrations are done at the scale or test level so that the representative statistic on which examinees are compared is their observed total score, which is an estimate of their unknown and unobserved true score, and which is likely to be somewhat off due to measurement error. The amount of measurement error can be estimated and is inversely related to the reliability of the test, which is the correlation between test scores obtained at two occasions. In other words, the more similar scores are at different occasions, the more reliable they are and hence, the less measurement error there is. Unfortunately, the score that an examinee obtains on a test in this fashion is highly dependent on the items that he or she is responding to so that a test with overall "easier" items will lead to a higher total score and thus a higher estimate of overall "ability" and vice versa. Similarly, one typically also wishes to make statements about the properties of items with their "difficulty" and "discriminatory power" being the main quantities of interest. In CTT, "item difficulty" is defined as the proportion of examinees in the calibration set that respond to an item Investigations of Parameter Invariance in IRT Models 3 correctly and "item discrimination" is defined as the point-biserial correlation between the scores for an individual item and the total test score. Unfortunately, as these quantities depend on the examinees to whom the items are administered, an item will appear "easier" i f it is administered to a group of examinees that have a higher level of "ability" and vice versa. In other words, the desired quantities are mutually dependent on one another and the inferences about them are thus confounded. Since the comparison of examinee and item characteristics across populations and occasions for the purpose of making generalizable inferences is such a fundamental goal of psychometric work, alternative means for scoring examinee-item data needed to be developed. This has led to the advancement of methods that allow for the separation of examinee and item parameters by conducting analyses not at the test level but at the item level. That is, while statements about individual examinees and the test as a whole are of course still desired, the relationship between examinee "ability" and item responses is explicitly modeled first and foremost for each item separately. Through linearly combining the item functions one can obtain functions for item sets and the test as a whole, which further allows for the computation of functions for statistical item information, conditional standard errors, and even true scores that are similar to those in CTT. Such models have come to be known under the term item response theory (IRT) and employ a latent variable to estimate "ability" at the item level. They generally express the probability of endorsing an item for binary items (i.e., dichotomously scored items) or the probability of responding to a given category for multi-category items (i.e., polytomously scored items) as a function of the latent variable and at least one item parameter separately for each item. Because response probabilities need to be bounded between 0 and 1, these item-level functions are basically latent variable logistic regressions whose functional curves are sigmoidal Investigations of Parameter Invariance in IRT Models 4 in shape. The curves are called item characteristic curves (ICCs) for dichotomously scored items, or item step response functions (ISRFs) for polytomously scored items. Since all items and examinees are calibrated on a common scale, one can compare the functional curves visually, and Figure 1 shows the ICCs for three hypothetical items. Prcbability 1 0.8 0.6 0.4 Theta - 4 - 3 - 2 - 1 0 1 2 3 Figure 1. Three ICCs for three hypothetical items calibrated with an IRT model. The graph shows how the probability of endorsing an item increases for all three items as the values of the latent variable, which is denoted 'Theta' (symbolically 0), increase as well. This is an assumption typically referred to as monotonicity. However, the ICCs also show the differential properties of the items. Of all three items, Item 1 discriminates most strongly between examinees, specifically between those in the range of-1 to 1 on the 0 scale. In addition, the curve for this item has a lower asymptote at 0.1 indicating that for examinees with little or no "ability" (i.e., a very low value on the latent variable), the probability of endorsing the item is 0.1, which is typically attributed to "guessing". One can also see that Item 1 is the easiest of the three items while Item 3 is the most difficult as the latter requires an examinee to have a higher value on the latent variable for an equally high probability of correct response. Mathematically, these ideas are captured in the parameters that are used to compute the ICCs and in this dissertation item parameters will be subscripted with a j while examinee parameters Investigations of Parameter Invariance in IRT Models 5 will be subscripted with an i assuming that individual examinee scores, rather than merely the score distribution of all examinees, are of interest.-The parameter that indicates the inflexion point of an ICC is often called the "item difficulty" parameter and is denoted by /J/, the parameter that indicates the slope of an ICC at its inflexion point (i.e., the "steepness" of the curve) is often called the "item discrimination" parameter and is denoted by ay, and the parameter that indicates the lower asymptote of an ICC is often called the "item pseudo-guessing" parameter and is denoted by yj. The model that allows all three item parameters to be freely estimated from data is called the three-parameter logistic (3PL) model, because its link function is a logistic function. With the examinee parameters being the latent variable values denoted by Qu the 3PL can be written as follows (e.g., van der Linden & Hambleton, 1997): exp(a,(0, -/?.)) P j ) = y , + (1 - Y j ) . / F j " ;<*j >0,-oo < p ,0, <oo,0< Y j < 1. l + exp(ay(0,. -pj)) By restricting the item parameters to fixed values, one obtains special cases of this model with special properties of the ICCs. If the lower-asymptote parameter is fixed to a certain value (typically # = 0 for all items), then one obtains the two-parameter logistic (2PL) model, exp(a.(0,.-/J.)) Pi(6i) = >0 , -oo</3 0 <oo. If the slope parameter is fixed to a certain value also (typically a, = 1 for all items), then one further obtains the one-parameter logistic (1PL) model with ICCs that do not intersect, exp(0,-/J,) l + exp(6>.-/Jy) 00 . These three nested models are the most commonly used IRT models in practice and they are all called unidimensional models, because the mathematical equations contain only one latent variable. In practice, the parameters in these models need to be estimated from data and several Investigations of Parameter Invariance in IRT Models 6 software programs are currently available for this purpose (see Rupp, in press-a, for a review of two popular ones). For a more detailed discussion of basic and more complex IRT models, see, for example, Rupp (in press-b), Junker (1999), or van der Linden and Hambleton (1997). The strength of IRT comes from the specification of these item-level models, which separate examinee and item parameters mathematically and thus provide separable information about item and examinee properties at the item as well as the scale level. Furthermore, estimation of measurement error can be done flexibly at each level of the latent variable circumventing the traditional CTT assumption that the standard error of measurement is the same for all examinees. As Zumbo and Rupp (in press) note, however, conditional standard errors of measurement are nowadays also available for models in CTT and models for its cousin, generalizability theory (e.g., Brennan, 2001; Brennan & Lee, 1999; Feldt, 1996; Feldt & Quails, 1998; Kolen, Hanson, & Brennan, 1992), but they are rather infrequently used and do not resolve the fact that CTT models are essentially scale-level models. The above advantages have made IRT models a popular choice among current-day measurement specialists and have led to their widespread use in the linking of assessments, test development, and adaptive test administration (van der Linden & Hambleton, 1997). But even though the comparison of examinee and item characteristics across different assessments, occasions, or groups has been facilitated, it has by no means led to a guarantee that examinee scores derived from different item sets are identical, that items function identically for different examinee groups, or that they function identically across different assessment occasions. On the contrary, an unbiased comparison of examinee scores or item characteristics presumes that the parameters of the model that is used to calibrate the data be identical across contexts, a property called parameter invariance. However, this is an idealization, which in practice holds Investigations of Parameter Invariance in IRT Models 7 only approximately, and some degree of lack of invariance (LOI) is not uncommon. In particular, research on differential item functioning (DIP) and item parameter drift (IPD) is dedicated to investigating the seriousness of biases due to a LOI from a mathematical perspective. Unfortunately, despite the DIF and IPD research, some practitioners fail to understand the widespread implications of a LOI and tend to see cases of it as unusual or exotic, because they believe too strongly that the theoretical properties of IRT models also hold up in practice (e.g., it is sometimes perceived that IRT models "inject" the parameter invariance property into data). Consequently, i f this feature of IRT models is misunderstood, it can lead to fundamentally incorrect perceptions of the inferential limits of such models and the modeling process itself. The problem of parameter invariance is thus a fundamental problem in measurement as it is the mathematical prerequisite for inferential generalizability and not merely a minor mathematical fact about a few numbers in some abstract model equation. Goal of Dissertation The goal of this dissertation is to investigate the problem of what does and does not constitute parameter invariance, specifically item parameter invariance, and how it can be adequately assessed using both theoretical and practical approaches. It uses the mathematical formalization of invariance for IRT models to tease out how the mathematical conditions for invariance inform our conclusions about what can and cannot be claimed about the validity of our inferences based on the modeling process. No treatment currently exists that unifies and clarifies the interconnections between multiple perspectives on LOI in the literature using a strict mathematical lens and this dissertation thus intends to partially close that gap. While it is true that numerous measurement models such as exploratory factor analysis (EFA) models and confirmatory factor analysis (CFA) models along with structural equation Investigations of Parameter Invariance in IRT Models 8 models (SEM) are currently available to the psychometric analyst, IRT models were chosen for the following reasons. First, even though it can be shown that IRT models share many mathematical commonalities with other latent variable models (e.g., Goldstein & Wood, 1989; Junker, 1999; McDonald, 1999; Muthen, 2002) the modeling traditions differ in their emphases on graphical representations of item functions, parameter interpretations, and even estimation processes (see Rupp, in press-b). For many practitioners who investigate attributes of questionnaires, psychological scales, health measures, or achievement tests in large-scale settings, IRT models are particularly attractive because they are relatively easy to estimate with software and their theory has been explicitly developed to understand examinee and item characteristics. This means that the literature provides a wealth of information regarding the analysis of item and examinee characteristics and how to best present such information. Hence, investigating parameter invariance from a mathematical perspective in the context of IRT models is both theoretically appropriate and practically relevant. While the discussion so far has provided a rationale for investigating parameter invariance, it is necessary to describe some of the work that has been done on this topic in more detail before pursuing individual research questions. The next chapter therefore provides both some additional philosophical considerations about the concept of invariance generally and several results from formal research on parameter invariance specifically. Investigations of Parameter Invariance in IRT Models 9 Chapter II - The Quest for Invariance According to Webster's New Universal Unabridged Dictionary (1996), invariant simply means 'constant' and, in mathematics more specifically, 'a quantity or expression that is constant throughout a certain range of conditions' (p. 1003). Even though this seems like an intuitively appealing way to define invariance it is crucial to note at this early stage that it is the latter part of the second definition, the 'range of conditions', which has caused numerous confusions among measurement specialists. Within the framework of observed-score models, generalizability theory is a particularly powerful framework for decomposing different assessment design components that contribute to measurement error to better understand what conditions influence scoring accuracy in what manner and to what degree (Brennan, 2001). Within the framework of latent-variable models, it is equally necessary to understand the ranges of conditions for which measurement instruments work "properly" (i.e., allow for equally accurate and valid inferences), which is why parameter invariance is such a crucial concept. Invariant quantities are desirable because they are evidence of the fact that a certain phenomenon can be consistently captured under a given lens (e.g., through a mathematical model) and that conclusions within the range of conditions under which invariance is observed can thus be considered directly comparable i f not identical. Moreover, as inferential statistical modeling is the quest of drawing inferences from calibration samples to the respective populations from which these were drawn, the inferential bridge to the range of possible conditions under which invariance is expected to hold thus needs to be solidly built. This is, of course, one of the oldest problems in statistical inference; yet, somehow, latent variable modeling, and in particular item response theory (IRT), has cast a spell on many applied analysts, who believe that item parameter invariance is a universal feature that holds across an Investigations of Parameter Invariance in IRT Models 10 almost infinite range of conditions. Indeed, the fact that different conditions such as non-identical examinee populations, non-identical item populations, and non-identical times of assessment lead to 'differential functioning of items' and 'drifted parameter values' should not come as a surprise because nowhere in the literature does an 'automatic guarantee' exist that parameters are universally invariant (see Meredith, 1993; Meredith & Horn, 2001). Researchers working with the Rasch model have acknowledged this fact for a long time as they conceive of their model as fundamental additive conjoint measurement and hence attempt to find data that fit the model and not a model that fits the data (Perline, Wright, & Wainer, 1979). If the response pattern of a person or an item does not fit the Rasch model proposed for the given joint examinee-item space, then the person or the item are often discarded. In other words, it is acknowledged that i f these units remained in the data set, then the parameters for a population that included them would not be identical to those for a population that excluded them. In order to investigate parameter invariance and to understand such arguments more precisely it is indispensable to inspect the varying ways in which this feature has been mathematically formalized in the literature. Before describing a few more applied avenues of investigation it is therefore insightful to look at more abstract formalizations of invariance first. In particular, the following discussion wil l focus on invariance from a psychophysical scaling perspective and invariance as defined in the factor-analytic tradition. In psychophysical scaling theory, the development of invariant scales that provide consistent numerical representations of an underlying empirical relational structure over situations and time has fueled advances in axiomatic measurement theory for years (Narens, 2002), although it is debatable to what extent developments of axiomatic systems coupled with representational theorems, uniqueness theorems, and mathematical theories of meaningfulness have actually Investigations of Parameter Invariance in IRT Models 11 impacted applied measurement work (Cliff, 1992). Model invariance has also been attacked from a measurement-theoretic perspective (Eaton, 1989) and properties of invariant hypothesis tests, confidence intervals, and decisions have been known for a long time (e.g., Casella & Berger, 1990). Even though the statistical treatment of invariant decisions is based on mathematical theorems, invariant decisions in the sense of'consistent' and 'theoretically justifiable' across postulated conditions have occupied validity theorists for years. As pointed out by Messick (1989) in his elegant treatment of validity theory and as reiterated by Zumbo and Rupp (in press), the focus of an applied measurement enterprise is on valid inferences, which need to include considerations about the consequences these have for examinees. Hence, they need to be grounded in a firm evidentiary basis and parameter invariance in a mathematical model is one of its fundamental building blocks. The following sections describe the mathematical machinery that allows one to collect such evidence. Mathematical Approaches for Assessing Invariance in Theory Most prominently, the work of Meredith and colleagues (Meredith, 1964a, 1964b, 1993; Meredith & Millsap, 1992; Meredith & Horn, 2001; Millsap & Meredith, 1992) has investigated different types of invariance and the conditions under which these are likely to hold. Working originally from a linear common factor framework, they showed four decades ago under what conditions one can find a rotation matrix for such a model that leads to invariant factor loadings for different subpopulations (Meredith, 1964a, 1964b). These ideas were subsequently expanded (Meredith, 1993) and have led to defining different types of invariance mathematically (e.g., configural, strong, and strict measurement invariance). Most importantly, the work has highlighted that, in order to understand these different types of invariance, one has to minimally model mean vectors and dispersion matrices in a linear common factor model and, ideally, Investigations of Parameter Invariance in IRT Models 12 decompose the residual term in such a model into two components, specific factors and error, because the different types of invariance are distinguished by an increasing restrictiveness of the assumptions they place on the mean vectors and dispersion matrices (Meredith, 1993; Meredith & Horn, 2001). This research has roots in earlier work by other statisticians and psychometricians including McDonald (1982), who discussed invariance of factor structures for subpopulations, and Lord and Novick (1968), who showed how weak true score models in classical test theory (CTT) can be cast in a factor-analytic framework with an appropriate decomposition of model terms. One of the most important contributions of this work apart from the mathematical formalization of invariance conditions is that it has highlighted that invariance is not only an idealization (Meredith, 1993, p. 540) but also that there is nothing inherent in a linear factor model that guarantees invariance (Meredith & Horn, 2001, p. 210). In other words, the work has reminded psychometricians that invariance of any type is not a feature of such models per se but a hypothesis that can and needs to be subjected to testing (Meredith & Horn, 2001, p. 205). Because the linear common factor model is equivalent to the one-parameter and two-parameter IRT models under certain conditions (McDonald, 1999; Takane & de Leeuw, 1987), these statements about the nature of invariance also apply to IRT models and certainly echo the concerns of researchers working in areas such as differential item functioning (DDF) and item parameter drift (IPD). Specifically, parameter invariance in subpopulations cannot be inferred from the fact that some linear factor model holds in a large population that can be thought of as the union of different subpopulations, which could be selected via some selection variable. Instead, simultaneous fitting of linear common factor models or linear factor models with separated Investigations of Parameter Invariance in IRT Models 13 specific variances, covariances, and error in subpopulations is necessary. Only i f certain types of invariance such as strong or strict invariance are found across those subpopulations can one be assured that a factor model with similar structure exists in their union (Meredith, 1993). Ideally, one thus needs to take specific factors and error into account and judiciously place restrictions on the associated model parameters in structural equation models (SEM) to test the invariance hypotheses. However, as Meredith and Horn (2001) discuss in detail, some of these restrictions are quite subtle and even current S E M work often is not conducted properly to test these hypotheses, because (a) alternative means for performing these tests are just now being proposed and (b) it can be difficult to assess whether some of the rather subtle assumptions in these models have been violated. Furthermore, the literature still lacks extensive simulation work and practical applications for investigating these rather fine-grained distinctions between different types of invariance and their practical impact on decision-making. In other words, despite the apparent sophistication in the mathematical formalization of different types of invariance and the conditions under which these types are likely to hold, some the above results have thus had only limited practical applicability due to their theoretical subtlety. Mathematical Approaches for Assessing Invariance in Practice While some theoretical considerations may appear restrictive for certain practical applications, researchers have been attacking the problem of assessing lack of invariance (LOI) with confirmatory factor analysis (CFA) models for decades. These models have been used extensively to assess whether parameters from different populations are invariant and are useful because they provide statistical goodness-of-fit tests that indicate the degree to which conclusions about LOI are warranted (e.g., Raju, Laffitte, & Byrne, 2002; see Meredith, 1993). Investigations of Parameter Invariance in IRT Models 14 However, such scale-level analyses may fail to detect a LOI that is due to item-level effects or item-set-level effects (Zumbo, 2003). Similarly, a wide range of techniques for assessing DIP have been proposed (see, e.g., Clauser & Mazor, 1998), some of which condition on manifest variables such as the observed total score whereas others condition on latent variables such as 0 in an IRT model (Millsap & Everson, 1993). The first category includes procedures such as traditional % approaches, the Mantel-Haenszel yf statistic, loglinear models, logistic regression models, and discriminant function analysis (e.g., Clauser & Mazor, 1998; Mellenbergh, 1982; Roussos & Stout, 1996; Roussos, Schnipke, & Pashley, 1999; Zumbo, 1999) but they do not necessarily detect all types of LOI (Meredith & Millsap, 1992). The second category includes approaches based on latent variable models that compare areas between response curves for dichotomous and polytomous scoring models (e.g., K i m & Cohen, 1991; Lord, 1980; Shealy & Stout, 1993). An important distinction is made between uniform and non-uniform DIF/LOI, item- and test-level DIF/LOI, as well as bias and impact (e.g., Clauser & Mazor, 1998; Zumbo, 1999). Some of these techniques, particularly those that compare areas between item characteristic curves (ICCs) in IRT models, have been advocated to assess IPD as well because IPD is the same phenomenon in a slightly different disguise as the two parameter calibrations are now separated by a sufficiently large temporal gap (Donoghue & Isham, 1998). While parameter invariance is an ideal state, its counterpart, LOI, represents a continuum of variation in terms of degrees of LOI, which, i f properly quantified, could provide insight into the degree to which inferences from a given data set are generalizable to the measurement conditions under consideration. Existing research has so far neither extensively investigated different facets of LOI nor its complex sources and the research in this dissertation fills in some of these gaps. Investigations of Parameter Invariance in IRT Models 15 As stated in Chapter 1, the research is grounded in the mathematical formalization of parameter invariance in IRT models and consists of analytical results and a novel methodology to account for LOI. It not only contributes to an advanced theoretical understanding of this fundamental property of measurement but also allows practitioners to assess the interactions between item sets, examinee populations, and measurement models. Structure of Dissertation This dissertation consists of a series of four studies that are organized into four consecutive chapters and follow a consistent logic of inquiry to investigate LOI. The chapters are written to be separate publications and thus contain some degree of content overlap, particularly in the introductory sections, but any descriptions in this dissertation attempt to minimize that repetitive characteristic by focusing on their methodological and inferential differences. The four studies build on each other and address complementary research questions. In the first study, the two-parameter IRT model is used to highlight that the quantities that we choose to assess whether item parameter invariance is likely to hold across examinee populations need to be carefully chosen for the analysis to be valid. The work was motivated by the research question that asked whether a correlation coefficient as a measure of a general linear trend between parameter estimates is sufficient to assess the degree of LOI. It is shown that this is not the case, because parameter invariance implies a restrictive set of linear relationships that this coefficient cannot capture. The work was specifically motivated by recent research on parameter invariance in CTT and IRT models (Fan, 1998; McDonald & Paunonen, 2002) where such a correlation coefficient was used. In the second study, an analytical treatment of a particular type of LOI, item parameter drift, is provided. The study seeks to answer the research question that asks how serious biases Investigations of Parameter Invariance in IRT Models 16 introduced by a LOI are for practical decision-making and was again motivated by a recent study on this topic that left some connections implicit (Wells, Subkoviak, & Serlin, 2002). The study complements that paper and clarifies many of the connections between LOI and latent variable indeterminacy. Furthermore, the study shows how the biases that are introduced due to this kind of LOI can be quantified. This proves to be a useful means for simulating the mathematical impact of LOI on response probabilities and to gauge its seriousness for practical decision-making. The third study addresses the research question that asks to what degree it is possible to claim superior properties of one basic unidimensional IRT model over another when LOI is considered. This is the natural extension of the second study, which investigates biases due to LOI only for a particular model, namely the 2PL. The particular type of LOI that is investigated in this study is drift of the item difficulty parameter as it is the only parameter that is common to all three basic unidimensional models. While the first three studies are primarily theoretically driven with a practical context in mind, the fourth study is primarily practically driven with the theoretical problem in mind. It addresses the research question that asks how one can assess the degree of LOI for multiple examinee groups simultaneously and, i f it is indeed present, how one can account for its occurrence. The study applies methods in functional data analysis to sets of item difficulty parameters calibrated for multiple examinee groups to quantify examinee group differences. First, the different examinee groups are created using two different exploratory grouping methodologies, exploratory factor analysis coupled with a &-means clustering algorithm as well as a classification and regression trees methodology. The groupings are based on attitudinal and background variables available in the Third International Math and Science Study 1999 (TEvISS Investigations of Parameter Invariance in IRT Models 17 1999) data set. Following the estimation of item difficulty parameters for the different groups, the estimates are plotted and smoothed using non-parametric methods. The variation in the smoothed functional curves is then decomposed via functional principal component analysis and is used to quantify differences between the multiple groups simultaneously in a two-dimensional space consisting of the scores on the principal components. Finally, using the examinee profiles, the observed group differences are accounted for. The four studies thus present a consistent logic for investigating parameter invariance, which begins with a mathematical formalization of the property, continues on to discuss how it should, or rather should not, be assessed in all situations, and then leads to the quantification of biases for a single model and multiple models. The theoretical ideas are then complemented by a novel exploratory methodology that provides guidance for the practitioner in large-scale assessment as to the degree to which invariance of the item difficulty parameter is likely to hold for multiple examinee groups. The following chapters now contain more detailed descriptions of the individual studies; the dissertation closes with a final section that synthesizes the most salient findings and points out important directions for future research. Due to the length of the overall document, key acronyms will be reintroduced in each chapter (for reference see also the table on page vii) and a consistent mathematical notation wil l be used for all equations to facilitate the reading of the material. Investigations of Parameter Invariance in IRT Models 18 Chapter III - How To Quantify and Report Whether Parameter Invariance Holds: When Pearson Correlations Are Not Enough Item response theory (IRT) is one of the most popular current methodological frameworks for modeling response data from assessments. It is used directly in computer adaptive testing, cognitively diagnostic assessment, and test equating among other applications (e.g., Hambleton, Swaminathan, & Rogers, 1991; Junker, 1999; Kaskowitz & de Ayala, 2001). Furthermore, output from IRT models has more recently been incorporated into hierarchical regression models for multilevel data (Adams, Wilson, & Wu, 1997; Fox & Glas, 2001). The versatility of IRT models has made them the preferred tool of choice for many psychometric modelers, but beyond the flexibility of IRT models it is the often misunderstood feature of parameter invariance that is frequently cited in introductory or advanced texts as one of their most important characteristics (e.g., Hambleton & Jones, 1993; van der Linden & Hambleton, 1997; Hambleton et al., 1991; Lord, 1980). In this chapter, the mathematical formalization of parameter invariance is used for algebraic investigations of biases introduced by different types of lack of invariance (LOI). The derivations in this chapter are not novel and are not likely to be of interest to modeling experts in IRT; yet, it is hoped that the chapter helps to clarify invariance issues for a broader and more applied audience. More specifically, motivated by recent studies on parameter invariance (Fan, 1998; McDonald & Paunonen, 2002), the purpose of this chapter is to remind readers that parameter invariance implies restrictive linear relationships between parameter sets from different populations that cannot be sufficiently quantified with a correlation coefficient as a measure of general linear association only. In the phrase parameter invariance, the term parameter indicates that the phrase refers to population quantities. In psychometrics, one needs to consider two different collections of Investigations of Parameter Invariance in IRT Models 19 population parameters, because test data are the result of the intersection of item and examinee sample spaces and the model is the "glue" that binds the examinees and items together. Therefore, the parameters referred to are the sets of item parameters (e.g., the discrimination parameter a, the difficulty parameter /J, and the lower asymptote parameter y'ma. three-parameter model) and the set of examinee parameters (e.g., the set of unidimensional 9 parameters in a unidimensional IRT model). The word invariance indicates that the values of the parameters are identical in different populations or across different conditions of interest, which is assessed when they are estimated repeatedly with different calibration samples. It may be helpful to look at four scenarios where comparisons of item and examinee parameter estimates from different calibrations are relevant in practice. Scenario 1: Consider administering a proficiency test of Canadian history with items written in English administered to monolingual Anglophones as well as bilingual Francophones resulting in separate calibrations of item and examinee parameters. It would then be of interest to compare the item parameters for both calibrations to investigate whether the same IRT model holds for these linguistic groups. Scenario 2: Consider a personnel selection test. Legislation in Canada and the U.S. dictates that the test properties should be invariant across legislated subgroups such as those due to different genders and ethnicities. In this case, it would also be of interest to determine whether the item parameters are invariant across subgroups. Scenario 3: Consider an intervention study that consists of administering a pretest and posttest form consisting of different items that are meant to measure the same Investigations of Parameter Invariance in IRT Models 20 construct to the same group of examinees. In this case, it is indispensable to ensure that the item parameters are invariant so as to avoid a confounding of intervention and measurement effects. Scenario 4: Consider computer adaptive testing. In order for the examinee scores of all people who take such a test from a common item pool to be considered comparable, it is paramount that the examinee parameters are invariant across all possible subsets of administered items. Given the prevalence of concerns for parameter invariance in practical applications, it is worthwhile to look at its formalization more closely to achieve a better understanding of this important property. Invariance is a term denoting an absolute state so any discussion about whether there are "degrees of invariance" or whether there is "some invariance" are technically inappropriate (Hambleton et al., 1991). Moreover, the question of whether there is invariance in a single population is illogical as invariance requires at least two populations for parameter comparisons. The mathematical relationships that define parameter invariance are of course not novel and can be found, albeit more cryptically, in other sources. Lord (1980), for example, states that the probability of a correct answer to item / from examinees at a given ability level Oo depends only on 0n not on the number of people at 90 nor on the number of people at other ability levels 0/, 02, ... Since the regression is invariant, its lower asymptote, its point of inflexion, and the slope at this point all stay the same regardless of the distribution of ability in the group tested. [...] According to the model, they remain the same regardless of the group tested, (p. 34) Investigations of Parameter Invariance in IRT Models 21 In this citation it is the phrase "according to the model", which is key to an understanding of invariance. The phrase can be translated to " i f the model holds" and indeed renders invariance a relatively trivial issue (as the author implies himself), because one can say that i f a given model holds perfectly for examinees and items in the respective populations, then the sets of item and examinee parameters are invariant. "In other words, invariance only holds when the fit of the model to the data is exact in the population" (Hambleton et al., 1991, p.23). This illustrates the paradox that is parameter invariance; on the one hand, it is a trivial identity; on the other hand, it is probably the most important property of IRT models that sets them apart from classical test theory (CTT) models. Indeed, the property of parameter invariance allows for linking of tests (e.g., Kaskowitz & de Ayala, 2001; Sireci, 1997) and drives as well as unifies investigations of differential item functioning (DIF) and item parameter drift (IPD; e.g., Donoghue & Isham, 1998) as both are instantiations of LOI. However, the literature does not provide simple and accessible algebraic work on the conditions of invariance and possible violations of these, which is why the work in this dissertation is necessary as it clarifies the subtleties of parameter invariance for practitioners and theoreticians alike. We have seen that parameter invariance is an ideal state in populations and is mathematically guaranteed only for perfect model fit and not by the mere fact that an IRT model is fit to data per se (see Engelhard Jr., 1994, van der Linden & Hambleton, 1997). Mathematically, parameter invariance is a simple identity for parameters that are on the same scale. Yet the latent scale in IRT models is arbitrary so that unlinked sets of model parameters are invariant only up to a set of linear transformations specific to a given IRT model. When estimating these parameters in unidimensional IRT models with calibration samples, this indeterminacy is typically resolved by requiring that the latent indicator 0 be normally distributed with mean 0 and standard deviation Investigations of Parameter Invariance in IRT Models 22 1 (i.e., 0~ N(0,1)). In orthogonal multidimensional IRT models, the latent scale indeterminacy implies that parameters are identical up to an orthogonal rotation, a translation transformation, and a single dilution or contraction. When estimating these parameters with calibration samples, the indeterminacy is typically resolved by requiring that the multivariate latent indicator 0 be multivariate normal (MVN) distributed with mean vector 0 and variance-covariance matrix I where I is the identity matrix of appropriate size (i.e., 0~ MVN(0,1)), which is the multidimensional analogue to the unidimensional case (Davey, Oshima, & Lee, 1996; L i & Lissitz, 2000). Once estimated values of the parameters for different populations are available on their respective scales, it is of interest to determine the type of relationship that exists between them as a yardstick to assess whether the same IRT model is likely to hold in both populations (i.e., whether invariance across the populations is likely to hold). In this chapter examinees are indexed by i = 1,..., I, items are indexed by j = 1,..., J, 9 signifies the unidimensional latent indicator, and PJ(0i) is the probability of responding to item j correctly as a function of 0. The following focuses on the unidimensional three-parameter logistic (3PL) IRT model for dichotomously scored items for illustrative purposes, but the logic applies to other models as well. The 3PL can be written as follows: exp(a,(0, -0,)) „ • PjV,) = rj+(l-rj), / ' o<rJ<\,aJ>o,-co<pj,0i<cc 3 3 3 l + exp(a7(6>(. -fij)) As is typical in the literature, a- is the "item discrimination" parameter related to the slope of an item characteristic curve (ICC), p. is the "item difficulty" parameter related to the location of the ICC, and y. is the "pseudo-guessing" parameter, which is the lower asymptote of the ICC. Investigations of Parameter Invariance in IRT Models 23 Now let the superscript (') denote parameters from a second population. It follows that joint parameter invariance of item and examinee parameters for two distinct populations of items and examinees implies that a)=aj P>Pj e; = e, r'j=rj or equivalently, d V / A ' n exp(q;.(i;-/?;.)) exp( S.(fl,-/? 7.)) l + exp(ay (0.-/3,.)) l + exp(a ;. (0,-/3,)) for a// items and a// examinees i f the scales of the populations are linked. For unlinked scales, due to the indeterminacy of the latent scale itself, the above identity is only preserved up to a linear transformation. In symbols, for nonzero real numbers e and 5, we have GCj = 8~XGCj B'j=s + 5/3j 01=8 + 30, Y\=Yj or equivalently, exV(^-((50i+e)-(5/3j+e))) exp(a,(0,-/?,)) l + e x p ( - i ( ( , » , + « ) - ( # ? , + * ) ) ) ' ' o Hence, instead of looking for identity relationships after the item parameters have been linked one requires the following relationships to hold for parameters to be invariant: Investigations of Parameter Invariance in IRT Models 24 (i) equal slopes for the regression of fi'j on /3j and 6t on 0r i» (ii) the slope in (i) is the reciprocal of the slope for the regression of on a y (iii) equal intercepts for the regression of B\ on /? ; and 0,'on 6{ i' (iv) zero intercept for the regression of a y on a,., and (v) zero intercept and unit slope for regression of y'j on y}. Of course, not all equations are relevant for every unidimensional model and only some are relevant for situations where only certain sets of parameters are investigated for invariance (e.g., only conditions (i) - (iv) are required for parameter invariance in a two-parameter logistic (2PL) model). Again, the linear transformations that result in equality of parameters are more restrictive than general linear transformations, because they do not merely require any linear relationship but a specific linear relationship (i.e., an identity on linked scales). However, once sets of parameter estimates are available, it is tempting to use a measure of linear association such as Pearson's Product-moment Correlation Coefficient (PPMCC) to compare the estimates (see Fan, 1998; McDonald & Paunonen, 2002), because it is easy to compute, easy to interpret generally, and easy to report. Unfortunately, as the above discussion shows, a PPMCC of large absolute magnitude is a necessary but not sufficient condition for parameter invariance to hold. Specifically, while it measures the strength and direction of linear relationships in bivariate data, it fails to capture non-linear relationships or, as is perhaps more likely in practice, it fails to capture additive shifts in parameter estimates that separate one examinee population from another at the test level because it is unitless. For example, consider the scatterplot in Figure 2, which shows item difficulty parameters from two groups for 12 dichotomously scored items on an achievement test: Investigations of Parameter Invariance in IRT Models T Group 1 * Group 2 • Invariant Parameters -2.0 -1.5 -1.0 -.5 0.0 .5 Population Values Figure 2. Item parameters for two groups on an achievement test with 12 items. It is clearly visible from this graph that the parameters for group 1 all display an additive downward shift indicating that the items are easier for this group, on average, while the parameters for group 2 all display an additive upward shift indicating that the items are more difficult for this group, on average. Nevertheless, the relationship between the population and group parameters is a relatively strong positive linear one and the correlation coefficients between the population values and the group 1 and group 2 values are p = .95 and p = .83 respectively. While one could argue that the lower correlation coefficient captures some of the nonlinearity in the trend for that group, the larger one certainly does not pick up any additive shift, because the P P M C C is insensitive to any considerations about the slope or intercept of regression lines. The data in this plot are actually not artificial but have been observed in the TEVISS 1999 data set as parameter estimates and will be revisited in the fourth study in this dissertation. Investigations of Parameter Invariance in IRT Models To see the above statements about the PPMCC properties mathematically, consider the scenario where one would allow for equated parameters to be merely linearly related. In symbols, for real nonzero and unequal numbers v,%,9,A,y/,K,r},0, define the linear transformations a) = v + gctj P]=9 + XPj d'i=y/ + Kdi which captures that the parameters are not invariant. For example, this results in the following relationship for the exponent part of the model: a) (0," -fi'j) = (y+4a j +K0,)-&+Wj)] = VK91 + E)K<xi9i - vAfij - tycCjPj + %y/ccj - %ScCj +vy/-v& = €1.0. - A . j i J ^ajdi-ajpj where in the above Qj =VK + K^Clj A . = vXPj + fycCjPj - gy/ccj + ZSctj -vy/ + v9 As seen above, the estimated PPMCCs that one can obtain for estimated parameter pairs from two calibrations could all be of large absolute magnitude yet parameters do not have to be invariant. Unfortunately, such possibilities do not seem to be addressed by researchers that investigate parameter invariance in an IRT or CTT context (e.g., Fan, 1998; McDonald & Paunonen, 2002) despite research that shows that such trends exist (e.g., Kolen & Brennan, 1995; Sireci & Allalouf, 2003). Investigations of Parameter Invariance in IRT Models 27 Even though the results presented here are not novel per se it is important that researchers and practitioners alike are reminded of the exact meaning of the term parameter invariance and how it should be assessed. Parameter invariance is not a mysterious property but rather a simple identity of population parameters that manifests itself in restrictive linear relationships between parameters from different populations and hence estimates from different calibration sets. It should also be noted that there is no reason to believe that an excellent model fit in practice implies that parameter estimates are now valid for any set of items or any group of examinees from arbitrarily defined populations, which would be wishful thinking. It is always an important question to investigate for whom or for what items a given IRT model appears plausible to hold. In cases where lack of invariance is suspected, one could quantify the magnitude of introduced differences in response probabilities and test scores using bias coefficients, which is the topic of the next chapter. Investigations of Parameter Invariance in IRT Models 28 Chapter IV - Bias Coefficients for Lack of Invariance in Unidimensional IRT Models In this chapter, the mathematical formalization of parameter invariance is used to develop a framework that allows for algebraic, numerical, and visual investigations of different types of lack of invariance (LOI). Specifically, since item parameter drift (IPD) is a case of LOI, the framework is used to explicitly show the different types of bias introduced by drifting parameters and the dependence of these biases on model parameters that have not drifted. The work was motivated in part by a recent paper on IPD (Wells, Subkoviak, & Serlin, 2002) that presented results from simulation studies and it will be shown how their empirically derived results can be accurately predicted from the algebraically derived results of the framework. The term 'bias' has a variety of different usages in the statistical and non-statistical literature. In textbooks of statistical inference, bias is generally defined as the difference between the expected value of an estimator and the quantity it is trying to estimate (Casella & Berger, 1990, p. 303). In the literature on differential item functioning (DIF), bias is sometimes referred to as an undesired differential functioning of items that is not attributable to ability differences on the latent dimensions the test is intended to measure. Therefore, bias produces an unfair advantage for one group of examinees over another as the examinees in both groups possess differing amounts of proficiency on those nuisance dimensions (Shealy & Stout, 1993). In this study, the term 'bias coefficients' is used to denote quantities that are derived from differences in model parameters due to IPD, because, i f IPD goes undetected, the examinees are assigned a score that is different from the one they should be correctly assigned i f the drift were detected. As an additional point of clarification, just like in the previous chapter, all of the following equations involve population quantities only, because the focus of this study is not the estimation of biases but the analytical derivation of the idealized population analogues. Circumventing the Investigations of Parameter Invariance in IRT Models 29 estimation process allows for discussions of what can be considered "best-case" and "worse-case" scenarios with any real data applications being instantiations of these cases. To derive bias coefficients, consider the unidimensional two-parameter logistic (2PL) model for illustrative purposes where examinees are indexed by i = 1,..., I, items are indexed by j = 1,...,./, and Pj (0j) is the probability of examinee i responding to item j correctly as a function of the latent trait 0. The 2PL model can be written as follows: 'urns e*V( aj( 0i-Pj)) n • a a PA9i) = ;a! > 0 , - c o < ft.,0. < o o l + exp(a,.(0,.-/?,.)) ' , where a.j is the item discrimination parameter, Pj is the item location or difficulty parameter, and Ot is the latent predictor variable. In the following parameters from a second population of interest are again denoted by a prime ('). Conceptually, neither population is considered more 'important' in any sense; they will thus not be semantically distinguished with terms such as 'reference' or 'focal' population as is done in, for example, the literature on differential item functioning (DIF; see Clauser & Mazor, 1998; Donoghue & Isham, 1998; Zumbo, 1999). It is only relevant for this discussion that two populations are considered, which shows why a LOI has such widespread general implications and has effectively been discussed in the literature under a variety of different names for several years. For parameters in the 2PL to be invariant in the populations of interest, one simply requires otj = (Xj, fi'j = Pj, and 9\ = 9t to hold jointly for all items and examinees relevant to the practical context at hand i f the parameters are linked onto the same scale. Due to the indeterminacy of the latent scale for 9, the above identities are equivalent to the following equations for unlinked scales: Investigations of Parameter Invariance in IRT Models 30 a'j = 8~xa,j B)=s + 8(3j 01=5 + 30, where s and 8 are non-zero real numbers as in the previous chapter. Mathematically, parameters fail to be invariant i f at least one of these equations does not hold for at least one item or examinee in the populations of interest depending on which parameters are investigated for invariance. As stated in the previous chapter, the above equations represent restrictive kinds of linear transformations, which is why it is inappropriate to compare parameter estimates from different calibrations with indices that measure linear association only. Hence, considerations about invariance need to include considerations of item sets as well as of individual items (e.g., Donoghue & Isham, 1998, Zumbo, 2003). To understand the types of biases that are possible under a LOI, it is insightful to consider the impact of different violations of the conditions above on the response probabilities. In generic terms, the linked parameters from the first and second population are related by a' =f(a), P' = g(P), and 0' = h(0) and are invariant only i f the transformation functions / (")> g(')> h(-) are identity functions for all items and examinees; otherwise, they fail to be invariant. For the sake of simplicity, the following examples of parameter invariance will be restricted to item parameters and will consider only linear transformation functions for a and /J; the derivations for 0 are very similar to those for J3 since the two parameters are on the same scale even though the meaning behind the invariance investigation is very different. Since each linear relationship is represented by a line with an intercept and a slope, there are three cases to consider for each item parameter. For ctj we have Investigations of Parameter Invariance in IRT Models 31 (I) (II) (III) (IV) (V) (VI) where Kj and Aj are non-zero real numbers. Note that these six cases are not distinguishable for a given item. That is, i f an item parameter value has drifted and only the drifted value is observed -as is generally the case when we work with estimates - then there is exactly one real-valued constant Tj, one real-valued factor CO/, and an infinite number of real-valued pairs {TJ , (Oj) that could have given rise to the transformed value. However, i f a transformation applies to sets of items, a distinction between the above cases is crucial as the biases under different transformations are of different form and magnitude across all the drifted items. The six basic cases (I) - (VI) lead to a total number of 15 cases i f joint violations of invariance in ctj and ft are considered. However, they wil l not all be described in detail because cases that are combinations of the six basic cases follow logically from those. Hence, in the following section, only the six basic cases are used to express biases first on the logit scale. The section after that then shows how these biases can be translated into biases on the probability scale to clearly highlight their practical utility as differences in response probabilities and related true scores are the focus of practical decision-making. a)=aj+Tj aJ=ojJ-aj a'j=G)J-aj+Tj where co} and r . are non-zero real numbers. For ft we similarly have 0'j=fij+*j Pj=Aj.pj+KJ Investigations of Parameter Invariance in IRT Models 32 Bias on the Logit Scale For some cases, the biases that are introduced by violations (I) - (VI) can be compactly written with coefficients on the scale that is defined by the link function. The logit scale was chosen for analytical convenience but any other transformation with appropriate properties (e.g., the probit transformation) wil l technically work as well. For each case (a) the new relationship between the parameter values and (b) the introduced bias on the logit scale wil l be presented. The bias coefficients wil l then be interpreted but it is thus crucial to note that the interpretation is with respect to the logit scale and does not necessarily mirror the interpretation that would be appropriate on the probability scale. Since most practitioners are probably more interested in the implications of biases for response probabilities and test scores, these wil l be discussed in a later section and several biases on the probability scale will be interpreted there. The following description primarily highlights succinctly the interrelationships between the parameter transformation function (i.e., the type of LOI) and the logit scale formulation of the two-parameter kernel. Case (I) - Non-zero intercept for a' For non-zero real numbers 8 and Tj, (a) a'j =S~* (ccj +Tj) = S^ccj + S~ltj (b) logit[P;(0;)] = (a,. + TjW - /? , ) = logit[P,(0,)] + B°yfi where B6^ = rj(Gi - / J ; ) is an additive bias coefficient whose absolute magnitude depends on the location difference # - Pj and 8is the global transformation parameter required to link scales. Hence, for a given item, the introduced logit-scale bias is larger in absolute magnitude for an examinee whose ability is very different from the difficulty of the item than for an examinee Investigations of Parameter Invariance in IRT Models 33 whose ability level is closer to the difficulty of the item. No bias exists for examinees whose ability level is identical to the item difficulty. Case (ID - Different slope for a' For non-zero real numbers 8 and cOj, (a) a) =8-1(a)JaJ) = (8-lCQj)ciJ (b) logit[P;(5;)] = ( a » y a J ) ( ^ - ^ ) = 5 y -logit[i ' J (^)] where Bj is a multiplicative bias coefficient and 8 is the global transformation parameter required to link scales. Case (IIP - Non-zero intercept and different slope for a' For non-zero real numbers 8, tj , and COj, (a) a'j=8-\cojaj+TJ) = (8-lcoJ)aj+(8-,TJ) (b) logit[P:(0,:)] = (cojcc. + tjW, -/3j) = Bj • logit[/> (0,.)] + Bjp where again Bj is a multiplicative bias coefficient, B^is an additive bias coefficient whose absolute magnitude depends on the location difference 0, - Pj and Sis the global transformation parameter required to link scales. This case is of course a combination of the two cases above. Case (TV) - Different intercept for P' For non-zero real numbers s, 8, and Kj, (a) p\=s + 8(Pj+Kj) = (e + 8Kj) + 8Pj (b) logit[/»,(e;)] = a /^ - -Gf f J +ic y ) ) = logit[P J(^)] + 5 ; where Z?J = - a/*} is an additive bias coefficient whose magnitude depends for each item on its discrimination parameter and e and 8 are the global transformation parameters required to link Investigations of Parameter Invariance in IRT Models 34 scales. Hence, items with higher discrimination values will have a larger logit-scale bias independent of the location difference between examinee and item - this is actually similar to the description of the bias on the probability scale for this case. Case (V) - Different slope for B' For non-zero real numbers s, 8, and Aj, (a) fi'J=e + S(A.jpj) = e + (MJ)0j (b) loghtJ*;(*,")] = ajiO, = logit[P,(0,)]# where s and 8 are the global transformation parameters required to link scales. The pound sign superscript (#) on the right hand side indicates that this is a transformed logit that cannot be written compactly using the original logit and bias coefficients. Case (VI) - Different slope and intercept for /?' For non-zero real numbers s, 8, Kj, and Xj, (a) fi\=e + S&jfij = + 8K j) + )/?,. (b) logit[i»;(^;)] = a y ( ^ - ( A ^ +*ry)) = logit[/' >(^)] # + 5 ; where e and 8 are the global transformation parameters required to link scales. The pound sign superscript (#) again indicates that the first part cannot be compactly written using bias coefficients and5J = - CCJKJ is again an additive bias coefficient whose magnitude depends for each item on its discrimination parameter. This case is of course a combination of the prior two. In all cases it is clear that the biases result in differences in ICCs, which equal differences in response probabilities for all or almost all examinees. But since the logit transformation is non-linear, the effects of biases on the logit and probability scales are different and additivity of bias is not preserved. It is thus necessary to translate the logit-scale biases into probability-scale Investigations of Parameter Invariance in IRT Models 35 biases. The following section thus discusses the practical utility of the bias coefficients for the estimation of response probabilities and true scores and shows how the results presented here are useful for the study of IPD. Bias on the Probability Scale It is possible to use the above formulations to analytically compute differences in response probabilities at the population level as is done empirically in studies of IPD (e.g., Wells et al., 2002; see also Donoghue & Isham, 1998). Conceptually, IPD is typically defined as the differential shift of item parameters over time (Goldstein, 1983), which is often attributed to educational, technological, or cultural changes (Bock, Muraki, & Pfeiffenberger, 1988). Mathematically, it is readily seen that IPD represents LOI at the item level where IPD in either a or [i leads to a change in the respective parameter value with the form of the exact transformation from a to a' or p to /?' unknown. Hence, one way to represent IPD at the item level is a) =aj+Tj (A) B'j^fij+Kj (B) where - a . < T . < oo, - oo < Kj < oo with the first inequality ensuring that a'j > 0. In other words, all cases (I) - (VI) are cases of IPD but the simplest way to simulate drift and to analytically investigate it is by casting it as an additive formulation. Since the formulation of (A) and (B) corresponds to cases (I) and (TV), the biases are additive on the logit scale, but since graphical comparisons of item characteristic curves (ICCs) are made on the probability scale it is helpful to translate the above statements into bias statements on that scale. To combine the discussion for both cases into one, consider a general additive bias on the logit scale where <p is any non-zero real number: logit[P: (9\)] = ctj (9, -Pj) + <fi = logit[P, (0t)] + </>. Investigations of Parameter Invariance in IRT Models 36 On the probability scale, this is written as exp[a,(fl,.-/?.)]exp[^] _ exp[«y(6>,. -yg y)] 1 + exp[a.(0, - Pj)]exp[0] exp[-fl + exp[a.(0. - p.)] which can be compared to the response function with item parameters that have not drifted, where the arrow denotes an implication. In other words, i f the additive logit bias is positive, the probability under the drifted parameters will be positively biased; i f it is negative, it will be negatively biased; otherwise, the two probabilities will be identical. The relationships (Rl) -(R3) are not equivalences, however, because observed differences in response probabilities can have many causes only one of which is an additive bias on the logit scale. Consider the Wells et al. (2002) study for illustrative purposes. The authors simulated shifts in the population values of the item difficulty and discrimination parameters in a 2PL. Only positive amounts of drift were considered and the effect of item parameter drift on the estimation of examinee ability parameters was investigated under 48 conditions: Test length (2 conditions) x sample size (2 conditions) x type of drift (3 conditions) x number of drift items (4 conditions). More specifically, i f an item was selected to display item parameter drift, the authors increased either the discrimination parameter by .5, or the difficulty parameter by .4, or both simultaneously by .5 and .4 respectively. exp[a/g f-^)] l + exp[a,.(0,-/?,)] A few basic algebraic steps result in the following relationships: 0<o=>p;(0;)<W) *=o=>p;(0h=Pj(0t) t>o=>.p;(e:)>Pj(o,) (Rl) (R2) (R3) Investigations of Parameter Invariance in IRT Models 37 It is immediately clear that increasing a difficulty parameter by some positive number leads to an ICC that is shifted to the right and that increasing a discrimination parameter by some positive number leads to an unchanged inflection point but a steeper slope. Yet in addition to the conceptual appeal it is possible to quantify these changes more precisely and the bias coefficients allow us to do just that. When the authors changed an item discrimination parameter value by .5, they introduced a bias of B'J->=TJ{Ol-pj) = .W-Pj) according to case (I). For drifted items, this results in ICC segments that are shifted upward for positive bias (R3), which occurs when 0. > fij, ICC segments that are shifted downward for negative bias (Rl) , which occurs when 0i < f3j, and an identical ICC value for no bias (R2) at 01 = Bj. This pattern was observed (see figure la , p. 80) and plotted with respect to the estimated true score, which is computed as the sum of the ICCs over all items in the test: 7=1 The resulting curve that traces the true score as a function of the latent indicator 0 is called the test characteristic curve (TCC) and it was seen that the overall shift in the TCC was relatively minimal, because only a few items exhibited drift in each of the design conditions in the study. This also stems from the fact that the differences in response probabilities are actually relatively minor. To illustrate this, let us formally denote the difference in response probabilities by A,y, A,=Pj(0,)-J5(0,') with - 1 < A , < 1 . Table A l shows the A y values (cell entries) for an a-drift of .5 as a function of the original discrimination value of an item (row value) and the location difference between an examinee and Investigations of Parameter Invariance in IRT Models 38 an item on the 6 scale, 9{ - fij (column value). For example, take an item with an original discrimination value of .75 and an examinee whose location on the latent scale is .5 units above the location of the iteni (i.e., 6t - Bj = .5). The bias that gets introduced for this examinee on this item under drift of the discrimination parameter manifests itself in a difference in response probabilities of only Ay = -.0586883 . In other words, the response probability under the drifted discrimination parameter is about 6% higher than under the non-drifted parameter. It can be seen in this table that most Ay values are between .05 and .10. In other words, between 10 to 20 items with an a-drift of this magnitude are necessary to result in a true score difference of only 1 point, a difference that would probably be considered trivial for most practical circumstances. When the authors changed an item difficulty parameter value by .4, they introduced a negative bias of B" = -a>jC(j = -Actj < 0, according to case (TV), where the inequality stems from the fact that item discrimination values are always positive. For drifted items, this results in ICCs that are shifted to the right according to (Rl) independent of the values of #, and /5 7 , which was observed with again relatively minimal effects in terms of the TCC (see figure 2a, p. 82). Again, the Ay values for a variety of item discrimination parameters and location differences can be computed (Table A2) and, again, most of the Ay values for moderately discriminating items and moderate location differences are between .05 and .10 albeit some cases with higher values can be observed. Just as before this means that for the majority of cases between 10 to 20 items with a /5-drift of this magnitude are required to produce a true-score change of 1 point, a relatively minor effect. Finally, when the authors changed both the item discrimination parameter value by .5 and the item difficulty parameter value by .4, they introduced multiple biases. Even though conditions Investigations of Parameter Invariance in IRT Models 39 for when upward and downward shifts of the ICCs occur can be formally stated, those conditions are relatively cumbersome to present and are omitted here. In the study, the authors report that the TCCs cross at a value Oo where 00 > /3j. The effects of the biases as seen in the TCCs were again relatively minimal for most values of 0 but now started to increase in magnitude for specific sub-regions of the 0 space compared to the previous scenarios (see figure 3a, p. 83). Table A3 shows the Ay values for this scenario and it can be seen that these are still often between .05 and .10 but that now there are also quite a few values in the range of .10 to .15 with some even reaching .20. Thus, even though between 10 and 20 items are required to produce a true score difference of 1 point for many cases under this joint a- and /3-drift, only around 5 to 7 items are required in other cases. A l l three scenarios show that when the pattern of introduced biases is expressed with respect to response probabilities it appears quite complex due to the curvature and asymptotic behavior of the ICCs. It is possible to plot the Ay values to illustrate this complex behavior more closely. Figure 3 shows the A y surface and contour plots for the a-drift of .5, item discrimination values of non-drifted items between 0 and 2, and location differences between -2 and 2 to match the structure of Table A l while utilizing more grid points. Note that for the surface plot A y is labeled 'Delta', the location difference is labeled 'Theta-Beta', and the non-drifted discrimination values are labeled 'Alpha'. Furthermore, the orientation of the contour plot matches the orientation of the surface plot so that the horizontal axis represents the location difference Values, the vertical axis represents the item discrimination values, and the contour lines and shades represent the Ay-values with lighter shades corresponding to higher Ay values and darker shades corresponding to lower Ay- values. Investigations of Parameter Invariance in IRT Models 40 - 2 - 1 0 1 2 (a) Surface Plot for a-drift (b) Contour Plot for a-drift Figure 3. Surface and contour plots of A,y for a' = a + .5. To understand these plots, note that when an item discrimination parameter drifts the slope of the ICC for the item with the drifted parameter is steeper, which results in increasing differences in response probabilities in both directions from the inflection point for some range of lvalues followed by decreasing differences as the original ICC and the ICC under drift approach their asymptotes. The differences are positive to the left of the inflection point and negative to the right of the inflection point as a result of how Ay was defined here. As an example of this behavior, Figure 4 shows a plot of a drifted item with original discrimination value a = 1, discrimination value a' - a+ .5 = 1.5 after drift, and difficulty value /?= 0. Note that the latent trait 0is labeled 'Theta' and thatP,(0 ()is labeled 'Probability'. Pixbability -2 0 2 4 Figure 4. ICCs for item with a-drift of .5. Investigations of Parameter Invariance in IRT Models 41 The surface and contour plots of Figure 3 show graphically how differences are largest in absolute magnitude for the least discriminating items and smallest for the most discriminating items for the range of location differences considered. This makes sense, because i f the slope of an ICC is already rather steep without drift present (i.e., when an item is already highly discriminating) then a further increase in slope will have relatively little impact on response probability differences. This implies that i f items are of at least reasonable discriminatory power for a given population (e.g., i f they have an a, value of at least 1) biases are not as extreme. Figure 5 shows the Ay surface and contour plots for the /3-drift of .4, item discrimination values of non-drifted items between 0 and 2, and location differences between -2 and 2 to match the structure of Table A2 while again utilizing more grid points. The labeling corresponds to that of Figure 3. - 2 - 1 0 1 2 (a) Surface Plot for /3-drift (b) Contour Plot for /J-drift Figure 5. Surface and contour plots of Ay for /?' = fi+ .4. This shows that when an item difficulty parameter drifts, the effect is asymmetric with respect to the original inflexion point of the ICC. Figure 6 shows a plot of a drifted item with original difficulty value /?= 0, drifted difficulty value J3' = fi+ .4 = .4, and discrimination value a= 1 to illustrate this behavior. The labeling corresponds to that of Figure 4. Investigations of Parameter Invariance in IRT Models 42 Probability 1| -2 0 2 4 Figure 6. ICCs for item with /J-drift of .4. As seen in the surface and contour plots of Figure 5, the difference in response probabilities gets larger in absolute magnitude as the discrimination value gets larger and items with higher discrimination values have a higher bias for smaller location differences. Finally, Figure 7 shows the Av- surface and contour plots for a joint a-drift of .5 and /3-drift of .4 for item discrimination values between 0 and 2, and location differences between -2 and 2 to match the structure of Table A3 while again utilizing more grid points. The labeling corresponds to that of Figure 3 and Figure 5. - 2 - 1 0 1 2 (a) Surface Plot for joint a- and /3-drift (b) Contour Plot for joint a- and /3-drift Figure 7. Surface and contour plots of A,y for a' = a + .5 and fi' = fi + .4. This plot shows the complex effects that both drift types have on the difference in response probabilities and one can readily identify characteristics of the previous two cases. For example, note the almost linear difference values for poorly discriminating items in the location difference Investigations of Parameter Invariance in IRT Models 43 range considered here due to flat ICCs as well as the pronounced spike in difference values for highly discriminating items similar to the cases before. As an example of the complex behavior of the A y values, Figure 8 shows the ICC of an item with original parameter values j3 = 0, a = 1, and the ICC of the same item with drifted parameter values /5' = .4 and a' = 1.5. The values were chosen to match the effects shown in Figure 4 and Figure 6 and the labeling is identical to these figures as well: Probability Figure 8. Plot of ICCs for item with joint a-drift of .5 and /3-drift of .4. Finally, these complex relationships raise the issue of what kind of discrimination values and location differences are typically observed in practice. It seems clear that extreme differences of, say, ± 2 or 3 units, can be observed in many practically relevant cases i f test data are collected with item and examinee population subsets that yield a wide range of item and examinee parameter values. Indeed, a good test often consists of items with a wide variety of difficulty levels and a moderate range of discrimination values and is typically given to examinees with a wide range of ability levels with the implicit hope that item and examinee properties are well captured by the chosen model. Whether or not the intersection of a given examinee with a given item results in a large bias under drift of some parameter cannot be generally answered, however, and depends on the type and magnitude of drift. On the upside, the above framework can be used to investigate the magnitude and impact of biases under a variety of conditions, which has implications for various research arenas that are concerned with LOI. Investigations of Parameter Invariance in IRT Models 44 Conclusion This chapter summarized the basic algebraic relationships between parameter values from distinct populations to underscore that parameter invariance is an ideal state that is violated i f at least one equality condition does not hold for at least one examinee or item. Violations can be of any kind but three cases of linear transformations were considered and the biases introduced on the logit scale by this LOI were represented with bias coefficients whenever possible. A l l of these violations constitute IPD and, using a recent study, the analytic foundations of the empirical results presented therein were derived. It was furthermore illustrated how the magnitude of the drift impacts differences in response probability and examinee true scores. The bias coefficients highlight primarily the dependencies of different types of bias on model parameters and allow one to quickly gauge the severity of the bias on the logit scale. From a practical viewpoint, the algebraic perspective taken here allows one to compute and visualize different biases directly, which can easily be done for a variety of different conditions. This process can be used to cleanly assess the impact certain biases have on response probabilities, and hence true scores, of examinees. Any real-life data set wil l be a mixed bag of different biases that falls somewhere between the clean analytical extremes. If the extremes display relatively minor effects, as this chapter and other research seem to suggest, then the different theoretical conditions are the "worst-case scenarios" and practitioners might even encounter a less severe impact. This chapter has looked at only one model, the 2PL, and the natural extension is to compare multiple models under drift. In particular, it is reasonable to ask whether one of the three basic models possesses superior bias properties with respect to item parameter drift. In other words, it is interesting to ask whether the introduced bias is globally smallest for one particular model so Investigations of Parameter Invariance in IRT Models 45 that that model could be advocated over others in the context where biases are suspected. This would be relevant for practitioners, who appreciate simple and straightforward recommendations for model choice. As the next chapter shows, perhaps not too surprisingly, the patterns of bias are relatively complex and unfortunately allow one to show superior properties of one model over another only under very restrictive conditions. Investigations of Parameter Invariance in IRT Models 46 Chapter V - An Analytic and Numerical Look at Biases from /3-drift Across Unidimensional IRT Models Even though various measurement models based on item response theory (IRT) are available to psychometric modelers (for overviews see, e.g., Junker, 1999; Rupp, in press-b; van der Linden & Hambleton, 1997), the practical day-to-day applications of more complex IRT models are rather limited. Instead, it is not uncommon for testing agencies, research institutes, and consultants to use one of the more basic unidimensional IRT models such as the one-parameter logistic (1PL) or Rasch model, the two-parameter logistic (2PL) model, or the three-parameter logistic (3PL) model for dichotomously scored items. Alternatively, the graded response model (Sarnejima, 1969) and the multiple response model (Thissen & Steinberg, 1984) are often used for polytomously scored items. In fact, some of the most popular estimation software such as BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) and M U L T I L O G (Thissen, 1991) is designed to estimate primarily these basic powerful models (see Rupp, in press-a, for a recent review of these programs). Given some of the discrepancies between the theoretical availability of models and the practical model choice that needs to be made in a given context, it is of inherent interest to understand some of the driving forces behind model choice. It is particularly worthwhile investigating whether it is possible to collect some evidence that supports the eventual model choice based on mathematical criteria about its inferential properties. This chapter seeks to provide some pieces of evidence, which underscore that the process of model choice is complex and cannot be unambiguously resolved by invoking certain lack of invariance (LOI) properties under item parameter drift (IPD). Specifically, it will be shown how biases due to drift of the item difficulty parameter may be only minor from a practical perspective but are rather complex Investigations of Parameter Invariance in IRT Models 47 from a theoretical perspective. Therefore, the issue of model choice under this LOI lens can be considered either practically unimportant - as biases in response probabilities are relatively minor across all models - or theoretically important - as the bias patterns show that no single model possesses globally superior bias property under drift of the item difficulty parameter. But before these arguments are analytically and numerically developed, it is worthwhile to say a few words about the process of model choice in general. Choosing a Measurement Model At the simplest level, choosing a measurement model can be a matter of training or tradition. In the former case, knowledge about a certain class of measurement models such as classical test theory (CTT) models or structural equation models (SEM) may lead one to choose one modeling paradigm over the other. This is of course a very reasonable course of action, because in order to use a model responsibly, one needs to be knowledgeable, at least to a reasonable degree, about its mathematical structure, its parameter estimation process, the meaning of the different parameters, the meaning of information provided in the output, and the contexts where it can be used. Hence, i f one is knowledgeable about a particular class of models, one is typically at least a responsible user of those models, which may be better than being an irresponsible user of a larger class of models that one knows very little about. If model choice is driven by tradition, a similar logic often underlies the process. For example, by reputation and perception the Educational Testing Service and CTB McGraw/Hill are companies that often use the 3PL IRT model whereas many language testers and the National Board of Medical Examiners often use the 1PL IRT or Rasch model. Similarly, many educational and psychological researchers still use CTT models and the University of Iowa is often thought of as one of the centers for its cousin, generalizability theory. As in almost all areas of life, traditions have histories that are hard to Investigations of Parameter Invariance in IRT Models 48 break, because they have been shown to be consistently beneficial to the users and to be of high practical utility. In many practical scenarios, model choice is driven by very real constraints that models place on the data input and the consumers of their output. For example, simpler models have fewer parameters that need to be estimated and thus make less stringent sample size requirements for stable parameter estimation. But even i f parameters in more complex models can be estimated to a desired degree of accuracy, it may be difficult to interpret them from a substantive theoretical viewpoint and that may be what is desired. Conceptually, one could thus argue that the question of model choice can be answered by invoking more fundamental properties of measurement, which would, for example, favor the 1PL or Rasch model over other models because it can be considered an instance of additive conjoint measurement (Perline, Wright, & Wainer, 1979). Researchers working seriously in cognitive assessment might not even consider a mathematical model unless it allows them to formally operationalize elements of the substantive theory as parameters in the model. For these people, a model of choice needs to have an appropriate mathematical structure that needs to provide adequate fit to the data as well as the underlying theory that generated the data. Alternatively, one could argue that the choice between models can be considered a question of empirical model fit, which implies that every new data set should be modeled with that specific model which fits the data best. However, detecting model fit for real data sets can be a tricky issue in particular for smaller data sets. Indeed, as seen before in the discussion of model complexity, sample size is often one of the main determining factors for model choice, even though sample size requirements are somewhat moderated these days by advances in the theory and implementation of Bayesian estimation paradigms (see Rupp, Dey, & Zumbo, in press, for Investigations of Parameter Invariance in IRT Models 49 an overview). But even though it is relatively easy to find faults with a given application of a model to a data set, either on empirical or theoretical grounds, even strong critics are not always able to offer superior alternatives (see, e.g., Traub, 1983, for a passionate criticism of unidimensional IRT models). It is thus of inherent interest to determine whether a consistent preference for a given model bears positive or negative consequences in the long run. To answer this question, this study explores under what conditions of IPD the largest amounts of bias in response probabilities occurs to determine whether one can argue for globally superior properties of one model over another. IPD, an instantiation of LOI, is a commonly observed phenomenon where item parameters from a previous calibration of an item set have changed (i.e., drifted) over time. IPD has a practical impact on decision-making, because, i f it goes undetected, a bias in response probabilities is introduced, which affects the true score estimates of examinees and, by implication, their respective 6 estimates. Thus, i f one could make general statements that showed one model to be superior in that it resulted generally in smaller amounts of bias across different drift conditions, more substantial arguments about model choice could be developed. This chapter explores analytical, numerical, and visual methods to answer that question and it highlights the conditions that have to hold to make such general claims along with the overall complexity of the optimality issue. The study focuses on basic unidimensional IRT models for dichotomously scored item sets (i.e., the 1PL, 2PL, and 3PL models) as those facilitate discussions of the underlying theoretical concerns. A l l discussions are developed at the population level and hence circumvent the problem of detecting the true amount of bias in the population with calibration samples. This is not necessary for the purpose of this chapter as the Investigations of Parameter Invariance in IRT Models 50 population analogue, as a clean idealization, shows the minimum and maximum amounts of biases without any confounding due to sample-to-sample fluctuation. In the previous chapter, the relationships that exist between parameters from different populations on both the logit and probability scales were made explicit and the impact of IPD on item response probabilities as well as examinee true scores was demonstrated analytically, numerically, and visually. Consistent with previous research it was found that for moderate amounts of item discrimination parameter drift (i.e., a-drift), item difficulty parameter drift (i.e., /5-drift), and joint a- and /3-drift, the effect on examinee true scores was relatively minimal. Since that research had been motivated by and was complementary to a recent study on the same problem (Wells, Subkoviak, & Serlin, 2002), the model that was used there was the 2PL to allow for a comparison of the findings. This chapter presents the natural extension of the previous study by offering a cross-model comparison of the impact of item parameter drift on item response probabilities and examinee true scores. Technically, there are nine scenarios to consider for this purpose, since one is dealing with two design factors that can be crossed, (1) the generating model (3 levels - 1PL, 2PL, 3PL), and (2) the fitted model (3 levels - 1PL, 2PL, 3PL). However, certain types of parameter drift cannot be considered for all scenarios. In particular, one cannot consider a-drift for the 1PL, because by default all discrimination parameter values are held constant for this model (typically aj=\ for all items) and a "drift" for some items would effectively constitute going from a 1 PL to a 2PL. In other words, it would constitute an instance of model misfit and not IPD, which presumes that the same model holds across calibrations for the item set under consideration. In addition, one cannot consider y-drift for either the 1PL or the 2PL, because that would constitute another instance of model misfit as only the 3PL allows differing lower asymptote parameters across Investigations of Parameter Invariance in IRT Models 51 items. It is clear by now that item parameter drift can only be considered across models for an item parameter that is common across all models; hence, one has to limit attention to the item difficulty parameter /3. The guiding idea and questions for this chapter are the following. Consider an item that displays /3-drift and further consider that the item is calibrated with a 1PL, 2PL, and 3PL. For what combinations of a-values, examinee locations on the latent scale, and amounts of /3-drift is the bias in item response probabilities and examinee true score smallest (i.e., where are local minima for bias)? In addition, is it possible to identify one of these models as the one that has the smallest overall amount of bias due to the mathematical structure of the model (i.e., is there a global minimum of bias across models)? To answer these questions, a formalization of bias is necessary. Formalization of Bias in Response Probabilities To begin, define the bias in response probabilities for examinee i on item j as the difference Ay = Pj(0i )-Pj(P'i) where the first probability is computed using the original value of the item difficulty parameter /3,- whereas the second probability is computed using /3j, the drifted value of the item difficulty parameter (i.e., for a drift of magnitude Tj, fi'j = fij + x}; see chapter 4). Since the 3PL is the most general of the three unidimensional models, one can compare the difference in biases across models most easily i f one uses it as a starting point. The above definition of Ay results in the following expression of bias for the 3PL: exp(« , ( f l , - ( /3 ,+r , . ) ) ) " l + exp(a,.(0,.-(/3,.+r ;))_ A3Pl = exp(q,(fl,.-/?,.)) Yi + (1- / / ) " " ^ V e x p f a . (0,-/3,.)) Investigations of Parameter Invariance in IRT Models 52 1 + VP.. n + H ' . . y '7 y y 1 + ^ . . 77. y 'y y 1 + % » 7 y + ^ where 77^  = exp(a / r / ) and = exp(a7(0,. - Bj)). Since the 2PL is a special case of the 3PL with y. = c, c e [0,1), where c is typically chosen to be 0, one can analytically see that bias for the 2PL can be considered a "special case" of the bias for the 3PL under certain conditions as A2PL = ,J u 1 + ^F, 77. + T , and hence, 2 PL This equation illustrates analytically the first result of cross-model comparisons. If one calibrates an item with a 2PL and a 3PL and the item discrimination value is the same under both model calibrations, then, for an identical difference between examinee and item location on the latent 9 scale, the introduced bias is smaller under the 3PL. Specifically, it is smaller by a factor proportional to the probability of not guessing. This result makes sense i f one considers the item characteristic curves (ICCs) for this scenario. Both ICCs would have the same slope and the same location displacement under drift; yet, since the lower-asymptote for the 3PL is higher than for the 2PL, the horizontal displacement due to /J-drift results in a smaller vertical difference A i y. between the curves for the 3PL than the identical displacement for the 2PL. It is already clear that the difference is not Investigations of Parameter Invariance in IRT Models 53 going to be very large since lower-asymptote parameter values are rarely larger than .3; Figure 9 illustrates this for an item with J3j = 0,/?j = .6 ,a ; =1.5, and y. = .3 under the 3PL. 2PL Calibration (^ = 0) 3PL Calibration (vL = .3) Prdb Ercto Figure 9. Effect of drift for item with identical discrimination calibrated with 2PL and 3PL. Since the 1PL is a special case of the 2PL and, by implication, of the 3PL, the bias under a 1PL is also a special case of the bias under a 2PL and 3PL. However, it cannot be written as compactly in an equation as above, because the difference between a 1PL and the other two models is that the discrimination parameter is fixed, which is a parameter that is not directly on the probability scale and hence precludes a simple analytic equation. This scenario will be discussed further in the next section where graphics are used to illustrate these biases, which makes patterns in biases easier to describe. Visualization of Biases Using the A(>. Function To appreciate the differential impact of various types of biases and their dependencies on model parameters, it is useful to graphically visualize the At] values. To understand the meaning of such plots, it is important to realize that Ay is actually a function of four variables: Investigations of Parameter Invariance in IRT Models 54 1. The discrimination parameter (aj) 2. The location difference (6i - Bj) 3. The lower-asymptote parameter (y . ) 4. The amount of p-drift ( ) One way to explore the differential impact these values have on the response probability bias is to plot the Ay values as a function of the location difference and the discrimination parameter, which yields 3-dimensional surfaces, and then to consider a matrix of these 3-dimensional surfaces where the rows are defined by different values of the drift parameter r • and the columns are defined by different values of the lower-asymptote parameter y j. For illustrative purposes, attention wil l be restricted to the following ranges of values: 1. The discrimination parameter (a ,) : (0,2] 2. The location difference (6i -B.): [-3,3] 3. The lower-asymptote parameter (y.): 0,.1,.2,.3 4. The amount of P-drift ( r . ) : .2, .4, .6, .8 This results in the 4 x 4 matrix of 3-dimensional plots shown in Figure 10. in o CD O > l-i CD PL, o CO o .SJ3 CO CD I l - l CD CD J3 & o CO O I T3 CO CD > c o "fa fi o CO CD CO CD o fi —I ll" | fi o "3 u o fi o co a CO CD o co 9- = 'i r = 4 CD l - l fi SJ Investigations of Parameter Invariance in IRT Models 56 The plots in Figure 10 reflect the nested nature of the three unidimensional models and illustrate some general trends. First, it is apparent that the introduced bias is generally largest for the largest amount of drift (row 1 with tj = .8) and smallest for the smallest amount of drift (row 4 with tj - .2), which is intuitively clear. Second, it can be seen how for each amount of drift the amount of bias gets progressively less as the lower asymptote parameter increases (i.e., within a row, the surfaces are generally flatter for higher values of yj). Third, for a given model and fixed amount of drift (i.e., for a particular 3-dimensional surface), the bias appears to be largest for items that have higher discrimination values than for items that have lower discrimination values for many location differences. Fourth, it is clear that the introduced biases are not very large in absolute magnitude across all surfaces unless there is an unusually large amount of drift. For example, i f the introduced bias is .1, which is a reasonable typical value for many of the cases shown above, then 10 items with that amount of drift are required to produce a true-score difference of 1 point. This is a minute difference for most practical purposes and hence speaks well to the robustness properties of IRT models. This robustness result furthermore presumes that there are no other items in the data set that display drift in the opposite direction, in which case effects might cancel out leading to even less of an overall impact on examinee true scores for item sets. For cross-model comparisons, the nested nature of the three models becomes important. First, the 2PL is a special case of the 3PL with a fixed lower-asymptote parameter (typically = 0 for all items) and hence, in order to compare the biases introduced under a 2PL to those introduced under a 3PL one has to look across a row. Here one can see that the introduced bias is globally less for the 3PL for all identical location differences given that the item discrimination value is identical under both calibrations; this was shown analytically earlier in the chapter. As Investigations of Parameter Invariance in IRT Models 57 an illustrative example, consider an extreme case where an item with discrimination parameter value of a, =1.5 is calibrated under a 2PL (with yj = 0) and under a 3PL (with yj = .3) and displays a drift of Tj = .8. This corresponds to the two slices in Figure 11, which are taken from the first and last 3-dimensional surfaces in the first row of the matrix in Figure 10: 2PL Calibration (Yj• = 0) 3PL Calibration (y,• = .3) Delta Delta 0.4 r 0.4 r Figure 11. Effect of drift for an item with identical discrimination calibrated with 2PL and 3PL. In accordance with Figure 9 one can see how the introduced bias is lower at all location differences for the 3PL, which would hold for other discrimination values as well. Note, however, that i f the discrimination parameter values for a given item are different under the two models, or i f different location differences are considered, then one would have to inspect the height of the surfaces locally, which will be discussed later. Second, the 1 P L is a special case of the 2PL with a fixed discrimination parameter value (typically «y = 1 for all items) and so comparing bias under the 1PL to that under the 2PL is tantamount to inspecting a horizontal slice from a particular 3-dimensional surface in relation to the rest of the surface. For most location differences the introduced bias is higher i f the item has a higher discrimination value under the 2PL than under the 1PL (i.e., typically an a, greater or less than 1 under the 2PL); however, that it is not true for all location differences. As an illustrative example, consider an extreme case where an item is calibrated under a 1PL (with aj Investigations of Parameter Invariance in IRT Models 58 = 1) and also under a 2PL (with ctj - 2) and displays a /3-drift of Z/ = .8. This corresponds to the 3-dimensional surface in the upper left-hand corner of the matrix in Figure 10 and leads to the two horizontal slices shown in Figure 12. 1PL Calibration (q, = 1) 2PL Calibration (« f = 2) Delta Delta 0.4 r 0.4 r - 3 - 2 - 1 1 2 3 - 3 - 2 - 1 1 2 3 Figure 12. Effect of drift for an item with different discrimination calibrated with 1PL and 2PL. Here one can see that for location differences of about less than -.7 units or more than 1.5 units, there is more bias introduced under the 1PL than there is under the 2PL whereas for location differences between -.7 and 1.5 units the opposite is true. Translating Theoretical Conditions into Practical Contexts In order to make statements about the amount of bias introduced into the response probabilities and hence true scores under /3-drift, several conditions had to be explicitly stated in each of the above case. Practitioners will recognize that most of these conditions appear unrealistic. For example, i f an item set were calibrated once with a 2PL and once with a 3PL and both models seem to provide adequate fit, as perhaps judged by some empirical fit statistic, then it is unlikely that a given item would have the same discrimination value under both models allowing for the global statements that were made earlier. Similarly, for a given examinee the 6 estimate is likely to be different under the two models in this situation as well, which will result in different location differences relative to the item difficulty value for the two models further Investigations of Parameter Invariance in IRT Models 59 complicating matters. Moreover, the amount of bias for an item calibrated under a 2PL on separate occasions is likely going to be different from the bias for the same item calibrated under a 3PL on separate occasions adding yet another layer of complexity to the issue. It thus appears that the answer to the question of what model leads to the most optimal properties in terms of introduced biases cannot be answered simply and is indeed rather complex. In an attempt to provide some closure, however, a few general considerations about the types of biases can be made. These considerations look at the biases one can expect i f the population data are generated with a certain model (e.g., the 1PL) but a different model (e.g., a 2PL) is fit and deemed acceptable. Given the analytical results in this study, these synthetic descriptions should be understood as simplifications only. Case 1: Generating Model is 1PL If one fits a 1 PL to data one is effectively choosing one of the horizontal slices in a particular 3-dimensional surface in a particular column in the above matrix. Fitting a 2PL or a 3PL to such data, which is really an example of overfitting, does not lead to differences in a or y values compared to those that are obtained under a 1PL calibration since the assumptions of constant a values and constant y values hold. Hence, theoretically no differences in biases exist and practically any differences in biases between the different calibrations wil l be due to sampling fluctuations, wil l be extremely minor, and will have almost no practical impact on decision-making. Case 2: Generating Model is 2PL If one fits a 1 P L to data that were originally generated with a 2PL, one forces all a values to be equal when in fact they are not. Depending on the degree of deviation from the fixed value the 1PL imposes and the true generating values as well as the location differences of examinees, one Investigations of Parameter Invariance in IRT Models 60 observes higher or lower biases for either model as one effectively chooses different slices of a selected surface. In extreme cases these differences might lead to some differences in decision-making, but due to potential cancellation effects across item sets these are likely going to be minor. If one fits a 3PL to data that were generated with a 2PL one wil l theoretically not observe any bias differences as the constant lower-asymptote assumption holds. Case 3: Generating Model is 3PL If one fits a 1 P L to data that were originally generated with a 3 PL, one inherits the results from the previous case, which are now compounded by the fact that one also forces the y parameter to be 0 for all items. If one fits a 2PL to data that were generated with a 3PL, one will find that the biases under /3-drift are going to be uniformly larger for those items that have y values different from 0 under the 3PL but identical discrimination values otherwise. Conclusion This chapter set out to take an analytical look at the issue of model optimality to see how one could gather evidence for making the case that one of the three unidimensional IRT models is more robust than the others. For all practical purposes this was not possible and it was only theoretically possible i f one compares 2PL and 3PL calibrations under some restrictive side conditions. The goal of this chapter was not to address calibration issues that may further compound the complexity of the problem because it is clear that sample-to-sample fluctuations in parameter estimates affect the estimates of population biases and hence an investigator's ability to cleanly detect the case with which he or she is working. At the same time, circumventing this problem was advantageous because it allowed for a clearer presentation of the underlying logic for which it is irrelevant whether one is able to pinpoint exactly the true parameter values. Investigations of Parameter Invariance in IRT Models 61 The results in this study highlight theoretical considerations that need to be taken into account in discussions about model optimality i f a rich discussion on that subject is sought. It is, of course, up to the modeler to decide whether robustness properties of this kind are indeed desirable. On the one hand, one may argue that IPD is a feature of the data that a model should be very sensitive to so that a drift of a certain magnitude should have a rather strong impact on model-based inferences. On the other hand, one may argue that it is, practically, rather difficult to disentangle IPD from other factors that could lead to a differential functioning of items so that it is a nuisance that IRT models should be robust towards. Depending on the philosophical beliefs about what models are supposed to accomplish that the modeler brings to the table, either line of reasoning could be considered appropriate. This chapter concludes the more theoretical section of the dissertation that was concerned primarily with mathematical investigations of LOI. In the previous chapters it has been shown that the practical implications of a LOI are relatively minor but no specific methodological tools have so far been presented to detect a LOI for real data. The fourth study in this dissertation, presented in the next chapter, fills this void. It presents a novel exploratory methodology, which allows for the simultaneous comparison of multiple examinee groups and the quantification of their relative difference under a LOI lens. Investigations of Parameter Invariance in IRT Models 62 Chapter VI - Quantifying Subpopulation Differences for a Lack of Invariance Using Complex Examinee Profiles: A n Exploratory Multi-group Approach Using Functional Data Analysis Item parameters in item response theory (IRT) models are typically said to be robust toward separate calibration with examinees from different subsets of the target population. However, as research in differential item functioning (DIF) and item parameter drift (IPD) has shown, this is not necessarily always the case and the potential for lack of invariance (LOI) needs to be empirically assessed (e.g., Wells, Subkoviak, & Serlin, 2002). In most research on DIF, two groups of examinees, a reference and a focal group, are compared. Under a latent variable approach comparisons can be done with various means such as effect tests in structural equation models or indices that measure the area between the item characteristic curves (ICCs) for the two groups (Clauser & Mazor, 1998; Ferrando, 1996; Raju, Lafitte, & Byrne, 2002; Reise, Widaman, & Pugh, 1993). Investigations of DIF can be extended to the testlet level (differential item bundle functioning; DBF) and the overall test level (differential test functioning; DTF) as well as to multiple groups, but in practice the focus is most commonly on item-level DIF for two groups. The research presented in this chapter extends such investigations in two major directions. First, many groupings that are commonly employed in practice appear to be based on rather coarse proxy indicators of more influential variables that cause the differences in item parameters and such groupings are thus not inherently meaningful from an interpretative viewpoint. If one assigns linguistic signifiers such as 'item difficulty' and 'item discrimination' to item parameters and indeed detects reliable differences in their values for different populations, it should be a primary goal to account for these differences in a rich fashion. In other words, i f we want to use these terms properly it is indispensable to understand what makes items difficult and what allows them to discriminate between learners of different ability levels Investigations of Parameter Invariance in IRT Models 63 well both from an inferential validation and test design perspective. Most commonly, researchers attempt to identify reliable differences in item parameters based on statistical DIF detection methods and then hand over the problem of accounting for these differences to subject-matter experts in the field (e.g., Allalouf, Hambleton, & Sireci, 1999; Budgell, Raju, & Quartetti, 1995). Instead, the research presented herein advocates the use of examinee profiles based on attitudinal and background variables to statistically define groups for which LOI investigations should be conducted. It is of course not claimed that subject-matter experts can or should ever be replaced entirely, but it is certainly desirable to provide more powerful explanatory evidence for differential test functioning that is derived through statistical methods. In addition, this approach can also help researchers and practitioners define statistically, through variable interactions, the populations of examinees for which test items are functioning similarly and hence, for what groups of examinees inferences from an assessment instrument are comparably valid. Second, what is missing in the field at the moment is a strong impetus to utilize the LOI occurrences that are observed in DIF and IPD research to quantify examinee population differences. In this study, it is proposed to model the differences in item parameter values explicitly using non-parametric spline methods and to then use the observed functional variation to quantify differences between multiple populations using functional principal components analysis (F-PCA). In combination with having the groups defined by profiles based on attitudinal and background variables, this helps to further understand the observed variation in item parameter values and their relative differences in magnitude within a framework that allows for the simultaneous comparison of different examinee populations. Typically, groups are defined based on rather simplistic proxy variables (e.g., gender, ethnicity) and then it is investigated whether an item, a testlet, or a test displays LOI for these Investigations of Parameter Invariance in IRT Models 64 groups. While this is meaningful from a perspective where equality and fairness across generic groups is at stake it provides little evidence as to general LOI properties across groups defined in a more complex fashion or to the reasons for any observed LOI. Therefore, this study develops grouping structures based on examinee profiles first and then explores whether LOI is present for such groups. If LOI is indeed found, the group profiles will provide a richer description for the observed differences, which augments a posteriori subject-matter analyses of items or tests as is typically done in generic two-group comparisons. However, the tools advocated to pursue these questions are only suitable to investigate sets of items and not individual items, which makes them similar to tools for investigating DBF and DTF rather than tools for DIF. The chapter is organized as follows. In the next section, the process of identifying background variables and items is described. In the section after that, different approaches for grouping examinees based on background variables are described. The section after that describes how item parameters are estimated, how their relationship is modeled, and how this information can be used to quantify the differences in multiple examinee populations. Throughout each section, ideas are illustrated with real data from a subset of the Third International Math and Science Study 1999 (TIMSS 1999; Mullis et al., 2001). The TEVISS 1999 assessment was designed to allow for international comparisons of examinees in the areas of Math and Science achievement. It consists of dichotomously and polytomously scored responses to multiple-choice items as well as responses to attitudinal and background variables by students in grades 8 or 9 from 38 countries from around the world. As one of the most comprehensive international studies of academic achievement in Math and Science, it lends itself naturally to investigations of parameter invariance and differential test performance since a multitude of examinee groups exists already naturally and the comparability of results needs to be ensured. Investigations of Parameter Invariance in IRT Models 65 Identifying Items and Background Variables Before any groupings of examinees can be determined one has to first determine which items and background variables should be chosen as input for the different analyses. The choice of the items is perhaps easier than the choice of the background variables, because it is driven by practical concerns about which sets of items function similarly for different examinee groups. The choice of background variables is much more challenging, however, because the quality of the variables affects the quality of the grouping structure and hence, the explanatory power of the approach as a whole. Item Selection In general there is a great degree of freedom for determining which items to choose since background information and item responses are conceivably available for all examinees and items. While the advantage of the TFMSS 1999 data set is the richness of the background variable set and the large number of examinees, its "disadvantage", for the purpose of this study, is the use of a matrix-sampling design. In this design, different examinees are administered different sets of items in different booklets to reduce the cognitive load for examinees and to facilitate administration. It was first decided to use data from only those countries that had English as their dominant language to minimize any potential translation effects in items and cultural differences in examinee groups, which resulted in 28,440 examinees from Canada, the U.S., New Zealand, Australia, and England. It was then decided to focus on the math multiple-choice (M-MC) items only. The choice between math and science items was arbitrary while the choice for M C items was made because (a) the majority of items was M C and (b) because a consistent scoring format (i.e., 0-incorrect, 1-correct for M C items) allowed for the use of a single calibration model (e.g., one-parameter logistic (1PL), two-parameter logistic (2PL), or Investigations of Parameter Invariance in IRT Models 66 three-parameter logistic (3PL) model) and thus allowed for parameter comparisons across all items. However, there were only 6 items that were administered to all examinees (i.e., anchor items). A l l other items appeared in 3 or 4 booklets with booklets containing sets of about 6 items. To maximize the number of items and examinees available, it was decided to choose the 6 anchor items and the 6 items that appeared in 4 booklets yielding a total of 12 items and 14,060 examinees. The examinees were in grades 8 or 9 (mean age = 14.11) and evenly split between boys and girls (boys = 49.9%, girls = 51.5%). Furthermore, the educational level of the parents was relatively high with 73.8% of the parents finishing at least secondary schoolor even university and only 26.2% having attained a lower educational degree or no degree at all. Since the TTMSS 1999 assessment distinguished between different subdomains, it should be noted that out of the 12 items, 4 are 'Fractions and Number Sense' items, 1 is a 'Measurement' item, 3 are 'Data Representation, Analysis, and Probability' items, 3 are 'Geometry' items, and 1 is an 'Algebra' item. It is noted by the TIMSS 1999 researchers, however, that they calibrated items from different subdomains as one set as that yielded the most stable parameter estimates; indeed, the TIMSS 1999 data contains a national Rasch score for all math items. Therefore, it seemed justified to pursue the same approach for the 12 items considered here and also treat them as one set. Background Variable Selection One of the prerequisites for conducting the type of investigation proposed here is access to a set of background variables that can function as powerful predictor variables for differences in item parameter values. In other words, i f one is interested in accounting for variation in difficulty, discrimination, or guessing levels across populations, one should choose variables that influence response processes in a way that creates variation in these parameter values. This is Investigations of Parameter Invariance in IRT Models 67 one immediate reason why variables such as 'gender' and 'ethnicity' appear intuitively insufficient for this purpose and seem to be rather coarse proxies for more complex constructs that cause differential performance. Instead, it seems much more reasonable to choose variables that attempt to measure psychological and environmental characteristics of examinees such as self-efficacy, self-concept, self-determination, perceived valence of task, educational level of parents, and educational resources at home. However, there exists no clear set of candidate variables that have been repeatedly shown across studies to contribute substantially to variation in item parameter values (but see, e.g., Buck, Tatsuoka, & Kostin, 1997; Rupp, Garcia, & Jamieson, 2002; Wolf & Smith, 1995; Wolf, Smith, & Birnbaum, 1995). In addition it should be noted that labels for item parameters such as 'difficulty', 'discrimination', and 'guessing' are not without problem as their substantive meaning is often only marginally understood at best (Rupp & Zumbo, 2003; Zumbo, Pope, Watson, & Hubley, 1997), but they certainly provide a heuristic starting point for thinking about which variables to select. More research such as this one will hopefully shed more light on how certain variable types and sets are of differential utility across content domains, task types, and assessment instruments. What is indeed sorely needed is a theory about the types of information that are required to account for differential performance and the best ways to gather such information so that one does not fall into the trap of replacing one set of proxy variables with another. Since the 12 M - M C items from the TLMSS 1999 subset were all Math items, it was decided to focus primarily on those background variables that were semantically related directly to performance in Math such as Likert-ratings on questions such as "I am just not talented in mathematics" or "I like mathematics". This resulted in a set of 73 individual background variables and a set of 11 derived background indices. The 11 derived indices were provided in Investigations of Parameter Invariance in IRT Models 68 the TEVISS 1999 database and were summary indices that combined responses on a set of background variables and recoded the combined responses. For example, the variable 'bsdmpatm' represents an index of positive attitude toward mathematics scored on a 3-point scale derived from the Likert-ratings on 5 variables that measure attitude toward mathematics, (1) "I like mathematics", (2) "I enjoy learning mathematics", (3) "Mathematics is boring (reversed scoring)", (4) " Mathematics is important to everyone's life", and (5) "I would like a job that involved using mathematics". Unfortunately, data on background variables were incomplete with the percent of missing data per individual variable ranging from 2% to 15% with most having about 4% - 6% of the cases missing. Since listwise deletion would have resulted in about 50% of the cases being unavailable for analysis due to the large number of variables, it was decided to impute the data using the expectation-maximization algorithm (e.g., Dempster, Laird, & Rubin, 1977) inPRELIS (Joreskog & Sorbom, 2001). This resulted in a complete data set of 12 M - M C items and 84 background variables for 14,060 examinees from countries with English as a dominant language, who could now be grouped. Grouping Algorithms With a set of background variables at hand, the next step is to choose either a single or, alternatively, a set of grouping algorithms that use the information contained in these background variables to form interpretatively meaningful groups. It would be naive to believe that a single 'best' algorithm exists, because, as is well known in the literature, different algorithms have their strengths and weaknesses (e.g., Berkhin, 2002; Fasulo, 1999; Fraley & Rafterty, 1998; Johnson & Wichern, 1982). In addition, the type of algorithm that is most suitable for the data set at hand also depends on the measurement scales of the variables. Investigations of Parameter Invariance in IRT Models 69 If the number of variables is relatively small (e.g., 10 - 20) and consists exclusively of continuous variables measured on interval or ratio scales (Stevens, 1946) or a mixture of continuous variables measured on interval, ratio, or ordinal scales with a sufficiently large number of scale points, a traditional exploratory clustering algorithm could be appropriate. There are numerous clustering algorithms available with various methods to construct clusters (e.g., agglomerative, divisive, centroid) and, depending on the multidimensional properties of the true underlying clusters in the population, different algorithms are best suited for different problems. Even though one can pool clustering results for different input parameter constellations and investigate the stability of membership assignment, these methods provide no numerical means for selecting a 'good' clustering structure. Moreover, the results of a clustering algorithm depend on the type of similarity measure used to construct the similarity matrix (e.g., Euclidian distance, city-block distance, correlational measure, conditional proximity matrix) and the type of linkage used (e.g., single linkage, complete linkage, average linkage). An improvement over these purely exploratory techniques are model-based clustering (MBC) algorithms, which assume a statistical model for the joint multivariate distribution of the background variables and hence allow for model comparisons through goodness-of-fit indices. For nested model structures, a likelihood-based measure can be appropriate whereas for non-nested models the Bayesian Information Criterion provides guidance, albeit no definitive answer, for which clustering method to choose. A number of M B C methods are currently available for the practitioner and they basically differ with respect to whether they model multivariate normal distributions, non-continuous scale distributions, or mixtures of these as well as whether they estimate the covariance structures in the clusters and how they estimate the model parameters (Vermunt & Magidson, 2002). Investigations of Parameter Invariance in IRT Models 70 As is common with many multivariate techniques, an assumption of multivariate normality provides flexible estimation and testing frameworks yet also requires that the background variables be quantitative and, for computational purposes, be of a reasonably small number. Even though recent advances have been made in the estimation of models with mixed predictor variables, many approaches are still bound by memory and speed limitations for large number of observations and data points. In the TIMSS 1999 data set, the sheer number of background variables and observations itself precludes running a M B C algorithm on the original variables for most software programs. In addition, most background variables are at best ordinal with typically around 3 scale points further precluding the use of a method that assumes multivariate normality'. Hence, it was decided to perform an exploratory factor analysis (EFA) on the items first and then to use the resulting continuous factor scores as input into an M B C algorithm, which had the additional advantage that the dimensionality of the data was significantly reduced. Factor Analysis as Precursor to Clustering The scree plot in Figure 13 resulted from a principal components (PC) extraction of the eigenvalue structure of the covariance matrix. It shows that 8 factors are most dominant in the data structure even though they only account for about 41.34% of the observed multivariate variation in the data. 1 Future research will investigate the utility of the perhaps most flexible latent-class MBC approach implemented in Latent GOLD (Vermunt & Magidson, 2000), but this approach was not pursued for this study. Investigations of Parameter Invariance in IRT Models 71 1 8 11 18 21 26 31 36 41 46 51 56 81 68 71 76 81 Component Number Figure 13. Scree plot of eigenvalues for exploratory PC factor analysis. The first 8-factor solution that was fit to the covariance matrix was a principal components extraction with an oblique rotation. The principal components extraction was chosen, because it places no requirements onto the scale type of the variables or their multivariate distribution, unlike methods such as maximum-likelihood or weighted-least-squares, as it is basically a spectral decomposition of the covariance matrix (Johnson & Wichern, 1982). An oblique rotation allowed for correlated factors, but the resulting interfactor correlations turned out to be negligible - except for maybe three small correlations associated with factor 6 - as illustrated in Table 1. Investigations of Parameter Invariance in IRT Models 72 Table 1 Factor Correlations for an Oblique 8-factor Solution under Principal Components Extraction Using the Sample Covariance Matrix FI F2 F3 F4 F5 F6 F7 F8 FI — -.086 .028 -.047 -.022 -.212 -.008 .020 F2 -- -.095 .015 -.098 .252 -.001 .019 F3 ~ .024 .080 -.233 -.015 -.013 F4 ~ .024 -.038 .022 .241 F5 — -.168 .022 -.121 F6 -.006 .069 F7 - .010 F8 Note. F = Factor. Consequently, an orthogonal varimax rotation with principal components extraction for 8 factors was fit to the covariance matrix to allow for easier interpretation of the factor loadings. To determine which items loaded on which factor, a cut-off value of .25 in absolute magnitude was chosen.. Table A4 shows all variable loadings on each factor along with a brief verbal description of all variables as given in TIMSS 1999. In addition, Table 2 lists only the variables whose loadings for the individual factors exceeded .25; the variables for each factor are listed in decreasing order of the absolute magnitude of their loadings on the factors. Investigations of Parameter Invariance in IRT Models 73 Table 2 Variables with Loadings Exceeding .25 for A l l 8 Factors Factor Variables Factor 1 Math is not one of my strengths. I am just not talented in Math. Math is more difficult for me than it is for others. I would like Math much more i f it were not so difficult. Sometimes, when I don't really understand a topic, I feel I will never understand it. How much do you like Math? I usually do well in Math. Do you think that you enjoy learning Math? To do well in Math you need good luck. To do well in Math I memorize notes. To do well in Math you need natural talent. Factor 2 Index of daily hours spent studying. Index of study time for other subjects outside of school. Index of study time for Math outside of school. Study time for other subjects outside of school. Index of study time outside of school. Study time for Math outside of school. Index of study time for all three fields outside of school. Factor 3 Do you think that you enjoy learning Math? How much do you like Math? M y friends think that it is important to do well in Math. M y friends think that it is important to do well in Science. How much do you like using computers to learn Math? To do well in Math you need lots of hard work studying at home. Students in the classroom do exactly as they are told. How often do you use the internet to work on Math projects? To do well in Math you need to memorize notes. Students in the classroom are orderly and quiet. How often do you use email to work with other students on Math projects? I usually do well in Math. (continued) Investigations of Parameter Invariance in IRT Models 74 Factor Variables Factor 4 How often does the teacher use the overhead projector in class? How often do you discuss completed homework in class? How often do you check each others' homework in class? How often do you work in small groups or pairs in class? How often do you have a quiz or test in class? How often do you begin homework in class? How often do the students use the board in class? How often do you use things from everyday life to solve problems? How often do you work on projects? How often does the teacher check homework? How often do you use computers in class? How often does the teacher show you how to do problems in class? How often do you copy notes from the board in class? How often does the teacher give homework? Factor 5 I would like a job involving Math. I need to do well in Math to get the job that I want. I need to do well in Math to please myself. I think that Math is boring. I need to do well in Math to enter the desired school I want to attend. I think that Math is important in life. I think that Math is an easy subject. I need to do well in Math to please my parents. How often does the teacher use the board in class? Factor 6 How often does the teacher discuss practical problems when introducing a new topic? How often does the teacher ask students what they know when introducing a new topic? How often do you work in small groups when a new topic gets introduced? How often does the teacher explain rules when introducing a new topic? How often do you look at the textbook when a new topic gets introduced? How often do you solve related examples when a new topic gets introduced? How often do students use the overhead in class? How often does the teacher use the computer in class? Factor 7 Index of home educational resources. Index of highest educational level of parents. How many books are there in your home? Index of home equipment (computer, desk, books). How far do you expect to go in school? Do you possess a computer at home? Factor 8 Highest educational level of father. Highest educational level of mother. Index of highest educational level of parents. Investigations of Parameter Invariance in IRT Models 75 Both tables show how variables related to different aspects of performance load relatively cleanly, although not perfectly, on separate factors. Note that the negative and positive signs of the loadings in Table A4 should not be interpreted as contrasting indicators; on the contrary, it can be shown that the negative and positive signs reflect the reverse coding of the variables in the TEVISS 1999 data set. For example, for factor 1, lower values indicate agreement with the statement for all variables. Hence, for the first variable, which has a negative factor loading, a lower value indicates more confidence in Math ability whereas for the other variables a higher value indicates more confidence in Math, because this is reflected in the student disagreeing with the statements. Overall, therefore, higher values on this factor indicate a higher confidence in Math ability. Using the variable descriptions, descriptive labels for the factors were developed as listed in Table 3 below. Table 3 Descriptive Factor Labels for Orthogonal 8-factor Solution Factor Description 1 Self-efficacy Factor 2 Outside-of-school Study Time Factor 3 Attribution & Learning Practices Factor 4 Instructional Assessment & General Approach Factor 5 Professional Utility Factor 6 Instructional Approach for New Topic Factor 7 Home Resource Factor 8 Educational Level of Parents Factor Investigations of Parameter Invariance in IRT Models 76 Using this factor structure, factor scores were estimated using ordinary least-squares yielding factor scores that are effectively weighted PC scores (Johnson & Wichern, 1982); the factor scores were used as input into a subsequent clustering algorithm. Clustering with Factor Scores It was originally decided to employ the M B C method using the multivariate normal mixture distribution approach available in S-PLUS through the mclust and Emclust routines. However, as Fraley and Raftery (1998) state, this procedure requires extensive computation time for larger data sets and, for the TIMSS 1999 data, was not able to handle more than about 2000 observations under the relative simple model of spherical clusters with variable volumes. Hence, since more flexible models with an even larger number of parameters were for sure prohibitive, it was decided to perform M B C approaches on random subsets of 2000 observations under the spherical-clusters-with-variable-volume constraint and to inspect the resulting cluster structure. The idea was to use the information about the resulting cluster structure to determine a likely number of dominant clusters that could then be used as input into a more heuristic method such as the centroid method, which can easily handle all 14060 observations in a program such as SPSS. Unfortunately, no clear number of clusters was apparent from these analyses as the Bayesian information values showed a linear downward trend across the spectrum. Thus, an alternative approach needed to be sought and it was decided to resort back to an informed heuristic exploratory approach. It was decided to choose the centroid method and to inspect resulting cluster structures for a range of cluster solutions; eventually, based on the resulting cluster sample sizes and the interpretability of the cluster structures, solutions for 7-13 clusters were compared. For each solution, an iterative classification approach was allowed and the maximum number of iterations Investigations of Parameter Invariance in IRT Models 77 was set to 100 resulting in all clustering routines converging (i.e., showing no further meaningful numerical reduction in inter-cluster distances). For each solution, the resulting factor score distributions across clusters were inspected via box plots, which allowed the detection of numerically identified outliers and median comparisons per factor score variable across clusters. The goal was to choose a cluster structure that satisfied the two properties that (a) every cluster can be characterized with unusually large or small median factor scores for at least one factor score variable, and (b) every factor score variable is used in that sense to characterize at least one cluster. The two solutions that satisfied these criteria best were the 8-cluster and 10-cluster solutions, which, of course, did not differ much with respect to each other, because the 10-cluster solution basically redistributed a few cases across clusters. Since some of the clusters could be characterized more cleanly with the 10-cluster solution, the cluster assignment for it was used to create groups and to calibrate the item parameters for these groups; a more detailed interpretation of the most relevant groups in terms of the attitudinal and background variables that define them is given in the section on multi-group comparisons below. The clustering methodology appears to be promising for investigating the robustness properties of IRT parameters for subgroups defined through the interaction of variables that may help to account for observed group differences in test functioning. The approach does not presume any predictive structure of the underlying variables per se but uses them to derive what one may call "naturally occurring" groups, whose structure can reflect subject-matter understanding about which factors seem to differentially impact test functioning. The following section proposes an alternative exploratory method for clustering examinees, which is particularly well suited for large data sets that contain a large number of examinees as well as a large number of explanatory variables of different scale types. Investigations of Parameter Invariance in IRT Models 78 CART Methodology If the number of variables is large (e.g., 50-100) and the variables are exclusively or predominantly categorical, then any approach that does not use some form of dimensionality reduction for the data is problematic and alternative means for clustering need to be sought. One tool that is particularly flexible for clustering observations in high-dimensional spaces defined by categorical as well as numerical variables is the Classification and Regression Tree (CART) methodology (Breiman, Friedman, Olshen, & Stone, 1984), which has numerous applications in the data-mining field. Software that "grows" C A R T trees is readily available nowadays and is either integrated into software programs (e.g., the 'tree models' function in S-PLUS), available as a separate suite (e.g., the AnswerTree™ module for SPSS), or available as a stand-alone product (e.g., CART™ by Salford-Systems). Despite differences in the tree growing and pruning methodologies and the sets of available measures for each software program, they all build a tree using, typically, binary splits starting with the entire data set in a parent node. The algorithm seeks a recursive partitioning of the data and thus splits the data into disjoint subsets. At each step, the next splitting variable is chosen as that variable which reduces a certain deviance measure (e.g., misclassification rate, least-squares deviance) most strongly. Since CART trees cluster observations in general, the observations could be either individuals or items in psychometric contexts. But in contrast to traditional or M B C algorithms CART trees are predictive models and thus require the user to choose a response variable whose variation a resulting cluster structure is supposed to predict optimally. Hence, one is required to choose either a categorical response variable (leading to classification trees and associated Investigations of Parameter Invariance in IRT Models 79 deviance measures) or a numerical response variable (leading to regression trees and associated deviance measures) for growing a CART tree. Since the eventual goal in the TIMSS 1999 investigation was to predict variation in item difficulty parameter values and since it is reasonable to assume that item difficulty parameter values are not invariant for separated ability groups (i.e., low, medium, high ability examinees), a variable that measures ability seems reasonable. Hence it was decided to regress the examinee total score on the 12 items onto the background variables, because this provides insight into which variables best predict total performance. Moreover, the Pearson-correlations between the total score and the national Rasch score as well as standardized Math score were r = .838 (p < .01) and r = .856 (p < .01) respectively; yet, the total score was easily available and did not require adjustment of the metrics due to the fact that examinees from different countries made up the data set. At the same time, the disadvantage with using the total score, just as with any achievement score, is that it basically separates poor from good performers and one can argue that this process is somewhat redundant for the 12 items used here as one could simply group examinees into different groups by observed total scores. Nevertheless, this serves to illustrate the methodology, which works particularly well for larger item sets with a larger range of scores. Moreover, it also provides some empirical data to verify claims about the relationship between group ability and the degree of LOI. The choice of a "best" tree is of course another sensitive matter and there are various means for making such a choice. One way is to look at a relative error measure that is derived from applying test sets of data (e.g., a random subset of 20% of the input data) and measuring misclassification rates or total deviance; another way is a model-based approach that fits Investigations of Parameter Invariance in IRT Models 80 regression models at each splitting stage and hence produces sequences of log-likelihood values for nested model comparisons (e.g., available in AnswerTree™). As with traditional clustering algorithms, the eventual decision is not purely numerically based but is also influenced by the interpretability of the resulting grouping structure and the number of cases in each terminal node (i.e., group). For the purpose of this investigation it is important to have a sufficient number of examinees in each terminal node, because the goal is to calibrate item parameters separately for each group and that requires sufficiently large group sizes for stable parameter estimation with an IRT calibration program such as BILOG-MG (Zimowski et al., 1996) or P A R D U X (Burket, 1998). In order to avoid an overreliance on a single numerical measure of best fit, a hybrid method that borrows strength from multiple calibration runs with different calibration settings was employed (see Rupp, Garcia, & Jamieson, 2002). Since the C A R T software by Salford-Systems (Steinberg & Colla, 1997) was used, the following basic options were available. First, the choice of splitting variable at each split could be based on either a least-squares (LS) or least-absolute-deviation (LAD) deviance measure; second, the splits could be based on linear combinations (LC) of predictor variables or they could not; and thirds one could set aside a random portion of the data set for testing (the default 20% was used) or use F-fold cross-validation (CV) for that purpose (the default 10-fold C V was used). To pool results, one can make use of the 'variable importance rankings' reported by the software, which report the proportion of reduction in deviance achieved by the most important splitting variable and the other important splitting variables scaled with respect to that number resulting in a rank order of all predictor variables. In addition, these importance ratings can be computed based on main splitting variables only (i.e., those that are directly reported in the tree) Investigations of Parameter Invariance in IRT Models 81 or based on surrogate splitting variables as well (i.e., those variables that would reduce in "similar" splits i f used instead of the main splitting variable). Since it is not possible to compute this rank order for main splitting variables only i f L C splits are allowed, the rank ordering based on main and surrogate splitters was chosen each time. The above choices result in 8 different calibrations and Table 4 shows the resulting number of terminal nodes and the associated relative error of the resulting 'best' tree for each calibration: Table 4 Number of Terminal Nodes and Relative Error of 'Best' Trees Under Different Calibration Settings LS Deviance Measure L A D Deviance Measure \ Testing Random 10-fold C V \ Testing Random 10-fold L C X . L C C V Allowed 4 8 Allowed 6 9 .886 (.014) .870 (.006) .920 (.010) .910 (.005) Disallowed 18 35 Disallowed 22 38 .891 (.013) .884 (.006) .928 (.010) .929 (.004) Note. The upper entries in the main cells are the number of terminal nodes of the tree with the lowest relative error rate (i.e., residual variation across all nodes using test data). The lower entries in the main cells are the relative error rate of that tree followed by an estimate of its standard error in parentheses. In order to pool results across calibrations to determine a single tree to be chosen to group examinees, the most important variables across all calibrations were first determined; Table 5 shows the 10 most important variables. Investigations of Parameter Invariance in IRT Models 82 Table 5 10 Most Important Variables Across 8 Calibration Settings Variable Question / Statement / Measurement Rank Sum B S B M M Y T 3 "I am just not talented in mathematics." 11 B S B M M Y T 5 "Mathematics is not one of my strengths." 14 B S D M C M A I "Index of confidence in mathematics ability." 29 B S B G B O O K "About how many books are there in your home?" 47 BSBMDOW2 "To do well in mathematics you need good luck." "Sometimes, when I do not understand a topic initially in 63 B S B M M Y T 4 math, I never understand it." 71 BSBMGOOD "I usually do well in mathematics." 86 BSDGHERI Index of home educational resources 119 BSDGEDUP Highest educational level of parents "Although I do my best, mathematics is more difficult for 123 B S B M M Y T 2 me than for my classmates." 131 Table 5 shows that the general disposition toward mathematics, confidence in math ability, and home and educational resources are the strongest predictors for variation in examinee scores. It was decided to choose the one tree whose variable importance ranking is most similar in structure to the overall variable importance ranking across all trees. To determine this tree the rank orders for the variables across all trees were correlated with the overall rank order using Spearman's Correlation Coefficient (ps). The highest coefficient value was rs = .838, which was associated with the tree that was built using the LS deviance measure, 10-fold C V , and LC splits Investigations of Parameter Invariance in IRT Models 83 allowed. Interestingly, this tree is also the one that has the smallest overall relative error out of all trees (see Table 4). The splitting rules of this tree were exported into SPSS and used to create the 8 groups that make up the terminal nodes; a more detailed interpretation of the most relevant groups in terms of the attitudinal and background variables that define them is given in the section on multi-group comparisons below. It should be noted that the only disadvantage of this tree is that the L C splits make interpretations of the splits more challenging as each splitting rule is a weighted linear composite of predictor variables; nevertheless, the positive and negative contributions to these splits allow for some interpretation. In future investigations, one could omit the L C split rule to facilitate interpretation. From a theoretical perspective this shows, however, how predicting test performance is a matter of complex and not simple groupings. In this section two sets of groupings were developed, one via F A and clustering routines and one via C A R T methodology. Once such groups are determined, the next step is to calibrate the items and to compare the results across groups, which is the topic of the next section. Calibrating Items and Comparing Groups In the TEVISS 1999 report, it is stated that a calibration of global domain scores (i.e., math and science) produced more stable estimates than computing subdomain specific subscale scores (e.g., math-algebra, math-measurement). As stated above, the 12 items that were selected for this study came from different math subdomains so it was first necessary to assess whether they could be reasonably modeled with a unidimensional IRT model. For this purpose, an exploratory F A was conducted in PRELIS resulting in a residual mean square error of approximation for a 1-factor solution of .0339 indicating an acceptable fit. This is further supported by the scree plot of the eigenvalues extracted via PC analysis from the polychoric correlation matrix of the items as Investigations of Parameter Invariance in IRT Models 84 shown in Figure 14, which shows clearly that there is only one dominant factor and that no single factor or smallset of additional factors would result in any meaningful increase in variance explained. 5 . • • • • • • • • • • 1 1 2 3 4 5 6 7 8 9 10 11 12 Index Figure 14. Scree plot for PC extraction of eigenvalues from polychoric correlation matrix of item responses. The one-factor solution accounts for 37.48% of the observed variance in the item responses, and despite the fact that this presents weak evidence for a substantially meaningful one-factor solution, the above evidence was deemed sufficient to proceed with item calibrations. In order to compare the item parameter calibrations from different groups using functional data analysis (FDA) methods, it is necessary to have one set of calibrated item parameters at hand that can be treated as baseline values. In a real-life application this could consist of the parameter values from a previous year or of those that have been obtained when all examinees are treated as belonging to a single population (see Zhang, Matthews-Lopez, & Dorans, 2003); This is the path pursued here and so the 12 items were first calibrated jointly for all examinees with BILOG-MG (Zimowski et al., 1996). To assess the fit of different types of unidimensional models, a 1PL, 2PL, and 3PL were fit. Recall again that the 3PL is of the form Investigations of Parameter Invariance in IRT Models 85 exp(a,(0-"/?,)) Pjiei) = yj+(\-rj) y \ A H'" ;aj>O,O<ri<h-^<Bj,0i<^ l + exp(a,. (0,-/3,.)) where «/ is the item slope or "discrimination" parameter, Bj is the item location or "difficulty" parameter, # is the item lower-asymptote or "pseudo-guessing" parameter, and 0, is a latent predictor variable. For the 2PL, = c (with c = 0 in most applications) for all items and for the 1PL Oj•• = k (with k = 1 in most applications) for all items in addition to that. The log-likelihood statistics were recorded for each model as presented in Table 6. Table 6 Model-fit and Item-fit Statistics for Calibration of 12 Target Items Model -2log-likelihood Change in log-likelihood dfioxx2 yr»-value 1PL 197224.5938 2PL 194810.7023 2413.8915 12 < .001 3PL 194751.3036 59.3987 12 < .001 These results indicate that the 1PL certainly does not fit the data well and that the 3PL is the best-fitting model; hence, the 3PL was chosen for all calibrations in this study. It should be noted that for all three models all item-fit % -statistics were statistically significant indicating that no model really fit well for all items i f those statistics are chosen as a criterion, which is a confirmation of the factor-analytic results from above, which pointed to only a moderately strong one-factor structure. These are interesting points, because they somewhat contradict the use of a 1PL or Rasch model for this subset of items and, given that the subset consists of a mix of items from all 5 math subscales just like the entire math section, this brings up the question of how Investigations of Parameter Invariance in IRT Models 86 appropriate the national Rasch scores for examinees in the TIMSS 1999 data set really are from a model-fit perspective. However, the purpose of the research presented here is not to investigate that question and the results here could of course be an artifact of the items chosen. Since the calibration size of 14,060 examinees is rather large for 12 items the item parameter values obtained for a joint calibration of all examinees were considered to be the 'known' baseline values for the remainder of the study. Table 7 lists the item parameter values under the 3PL as estimated with P A R D U X , which was chosen over B I L O G - M G at this point for equating purposes that will be explained below. Table 7 Item Parameter Values for Target Items Under the 3PL for Single Examinee Population Item aj SE(a7) Pj SE(#) Yj SE(#) 1 1.50 .0368 -.19 .0160 .09 .0089 2 .63 .0251 -1.30 .1170 .00 .0574 3 .86 .0231 -.70 .0370 .00 .0189 4 1.00 .0323 .27 .0260 .14 .0113 5 .31 .0310 -.04 .3340 .00 .0841 6 .47 .0265 -1.31 .2250 .00 .0877 7 .59 .0249 -1.18 .1280 .00 .0585 8 .32 .0334 -1.63 .6790 .00 .1889 9 .53 ' .0242 -.79 .1190 .00 .0474 10 1.20 .0289 .03 .0160 .04 .0076 11 .82 .0281 -.62 .0550 .11 .0263 12 .99 .0301 -76 .0420 .10 .0229 Investigations of Parameter Invariance in IRT Models 87 It is apparent that the 12 items represent a set that is overall too easy for the examinees chosen here as most item difficulty values are less than 0 and most item discrimination values are between .5 and 1. Since the focus of the study is on the item parameter estimates from different groups relative to these baseline values, however, this fact indicates more that these items would probably not be given as a test by themselves in a real context rather than that they pose any mathematical problems for the following analyses. With a certain grouping structure at hand, the next step is to calibrate and compare the item parameter estimates from different groups to these baseline values. Since the latent scale in IRT models is indeterminate, however, the item parameters are not on the same scale and the scales need to be linked first. The literature on scale linking and score equating is vast and it is beyond the scope of this chapter to provide a comprehensive review of the most important issues (but see, e.g., Kolen & Brennan, 1995; Petersen, Kolen, & Hoover, 1989; Zhu, 1998). The IRT-based test characteristic curve (TCC) method advocated by Stocking and Lord (1983) was used for this study and calibration and scale linking was done with P A R D U X (Burkett, 1998), because the software implements this equating method and allows for easy handling of the parameter estimates2. Once the item parameters had been placed onto the same scale, the next step was to analyze the sets of parameters. The following section describes this process and illustrates the analyses for the A:-means and CART groupings. Analyzing Item-set LOI Using Functional Data Analysis Tools For the purpose of this study only item difficulty parameters were inspected, because variation in difficulty is often easiest to account for and because combining results about 2 Parameter equating was also attempted with the EQUATE (Baker, 1995) software, but despite convergence it returned some nonsensical values such as discrimination values of '0';hence, PARDUX was chosen here. Investigations of Parameter Invariance in IRT Models 88 difficulty, discrimination, and guessing parameters adds layers of complexity that will be investigated in a follow-up study. A first step for comparing the parameter estimates to the baseline values is to visually inspect a scatterplot, which is shown here in Figure 15 for the item difficulty parameters for both the k-means and CART groupings. K-Means Groupings -1.5 -3.0 > > I o V X * t la o 8 A +0 v • • t> * + Bta • > | £ A « t i 0 A 8 % A a o • • 9 A -2.0 -1.5 -1.0 -.5 CART Groupings 0.0 O BG10 BALL • BOS BALL O BOB BALL A BQ7 BALL V . BOS BALL < BG5 BALL t> BG4 BALL + BG3 BALL X BG2 BALL * BQ1 BALL -2.0 -1.5 -1.0 -.5 0.0 . .5 Figure 15. Scatterplot of estimates for &-means and CART groupings. Investigations of Parameter Invariance in IRT Models 89 Both scatterplots show that there seem to be upward and downward displacements for some groups. For example, for the &-means grouping items appear to be overall easier than in the reference population for groups 7 and 10 whereas for the C A R T grouping items appear to more difficult for groups 5 and 4 and easier for groups 7 and 8. To investigate this visual pattern more formally, tools from F D A (Ramsay & Dalzell, 1991; Ramsay & Silverman, 1997, 2002) were used. First, non-parametric splines were fit to these item difficulty values. It was decided to use a B-spline basis with 12 basis functions and knots located at the baseline parameter values, because such a basis is more flexible than a polynomial basis and is appropriate for data that do not require a cyclical basis such as a Fourier basis. Figure 16 shows the resulting smoothed functions for both the A>means and C A R T groupings. fc-means Groupings I TO UJ Q. G7--1.5 -1.0 -0.5 0.0 Baseline Values Investigations of Parameter Invariance in IRT Models 90 CART Groupings — i 1 1 r --1.5 -1.0 -0.5 0.0 Baseline Values Figure 16. Smoothed difficulty parameter regression functions for A>means and CART groupings. Note that the solid line located in the center of the plots is the invariance line and that deviations from this line indicate a potential LOI. However, since F D A is a set of exploratory tools and does not allow one to disentangle the influence of sampling variation and true LOI, relative comparisons are more meaningful than potentially misleading strong statements about LOI in the populations. Nevertheless, the smoothed curves visually highlight the statements made above about particular groups for both grouping approaches. To analyze these functions further, an F-PCA was undertaken. First, the number of dominant dimensions of variation in the data was investigated via functional eigenvalues as shown in Figure 17 for both the M B C and CART groupings. K-Means Groupings Investigations of Parameter Invariance in IRT Models C A R T Groupings 91 Figure 17. Functional eigenvalue plots for &-means and C A R T groupings. Both plots shows that there is one dominant dimension of variation in the respective data sets, which stems from the fact that most of the deviations from the invariance line appear to be vertical functional displacements, and that two principal components would certainly account well for the observed functional variation. Hence, it was decided to score the functions on two principal components and to plot the groups in a principal components score space as shown in Figure 18 for both the &-means and CART groupings. &-Means Groupings -0.8 -0.6 -0.2 0.0 Score on PC1 Investigations of Parameter Invariance in IRT Models CART Groupings 92 Grdlip3 GroTip2 Grcilip6 Combined GroTjpl Grdlip4 Gnftpa —1 • 1 1 1 -0.5 0.0 0.5 1.0 Score on PC1 Figure 18. PC scores for 10-group &-means cluster solution and 8-group C A R T solution. The &-means plot shows how group 10 is most unusual compared to the remaining groups followed by groups 7 and 1. A l l of these groups have lower values on the first principal component, which could indicate that the items are generally easier for these groups, but they differ with respect to the second principal component; the remaining groups 2, 3, 4, 5, 6, 8, and 9 show the strongest evidence of parameter invariance. The C A R T graph shows nicely how groups 5, 7, and 8 are separated from the remaining groups and how group 4 is somewhat in between these groups; groups 1, 2, 3, and 6 show the strongest evidence of parameter invariance. At this stage it is useful to explain these findings with respect to the group characteristics. Explaining the Observed Patterns For the A>means groupings, the median factor scores for each group were investigated to arrive at a profile in terms of the attitudinal and background variables by which the factors were defined. The description in terms of these variables is relatively simple, however, because most groups are defined by one or two unusual median factor scores across factors. Examinees in Grc&p8 Grdlip7 Investigations of Parameter Invariance in IRT Models 93 group 10, for whom items appear to be easier than in the reference population, are examinees where the teacher does not provide a lot of instructional variety when introducing a new topic while examinees in group 7 find themselves in the opposite situation. Examinees in group 1 find themselves in classroom environments where they are assessed more frequently and where they compare work with other students more frequently than others also. Given that the &-means grouping structure did not show strong indications of LOI, these results are unfortunately not very powerful except to say that this provides some indication that differential test performance may be attributable to differences in the instructional environments that students find themselves in, which is a finding frequently cited in the literature on IPD. For the C A R T groupings, the picture is different. First, it should be noted that the raw scores for the individual groups align with the observed item parameter patterns. Specifically, groups 7 and 8, for whom the items were in general easier, scored higher on the 12 items whereas groups 4 and 5, for whom the items were generally more difficult, scored lower on the 12 items. Table 8 lists the minimum, maximum, and median score for each group along with the 25 t h (LQ) and 75 t h (UQ) percentile. Table 8 Score Distribution Among Terminal Nodes in CART Tree A l l 1 2 3 4 5 6 7 8 Minimum 0 0 0 0 0 0 0 0 0 LQ 6 4 5 6 4 3 5 7 8 Median 8 6 8 8 6 5 7 9 10 UQ 10 8 9 10 8 7 9 11 11 Maximum 12 12 12 12 12 12 12 12 12 Investigations of Parameter Invariance in IRT Models 94 It is apparent that all groups have a large amount of score variation in them indicating that the groups are not very strongly separated under this approach, which is in part due to the restricted range of scores available for only 12 items. Nevertheless, it is clear that the examinees in node 5 tend to perform poorer than examinees in other nodes, in particular those in nodes 7 and 8. Characterizing the examinees in these nodes by inspecting the background variables that were used to define the groups is somewhat difficult for this tree, because the tree used L C splits so that for each split the positive and negative contributions of individual variables had to be considered. Nevertheless, examinees in group 5, the general low performers, can be characterized by low confidence in their math ability (e.g., they do not believe they are talented, they believe they wil l never understand math, and they believe that they need good luck to succeed), they do not want a job involving math and do not believe it is necessary to succeed in school despite high academic peer pressure. Furthermore, their parents do not have a completed an advanced educational degree, home and school resources for studying are low, and they participate in numerous clubs. Nevertheless, they do copy notes from the board and are in classes where the teacher discusses practical problems using computers. Examinees in group 4, who also perform lower but not as low as those in group 5 generally, have a similar profile but they are in a climate of low peer pressure and spend some time outside of school studying math, science, and other subjects. Furthermore, the teacher does not assess them as frequently with tests and quizzes and does not use a computer as frequently. Examinees in group 8, the general high performers, are characterized by a high confidence in math, which means that they particularly believe in their talent for math. They usually do well in math but do not attribute that to luck. Furthermore, they have a moderate to large number of books at home and also have access to a computer. Examinees in group 7, also generally high Investigations of Parameter Invariance in IRT Models 95 performers, only differ with respect to these examinees in that they may not have a computer at home or, i f they do, they have less strong beliefs about doing well in math albeit still positive ones. Conclusions This novel approach to investigating item-set LOI corroborated two important characteristics of IRT models. First, i f examinees are of a generally limited range of "ability", as they were in the CART approach, item parameters are likely going to display LOI, which indicates that meaningful discussions about LOI are only possible i f a comparable range of abilities is available in each examinee group. This is of course difficult to know beforehand, which is why phenomena such as IPD are observed in practice. Second, it is possible to account for a differential item-set functioning (i.e., DBF, DTF) with respect to item difficulty by inspecting attitudinal and background variables that provide richer descriptions of those examinee groups for whom a test might function differently. However, there are numerous challenges that were encountered along the way. First, in a clustering approach, powerful software is needed to cluster a large number of examinees on a large number of background variables that are of different measurement scale types. Second, even if one arrives at a cluster structure through some means, the resulting patterns may not be as strong as expected. On the one hand, this is disappointing because group differences are less pronounced, but on the other hand it speaks well to the robustness properties of IRT parameters. Third, the approaches introduced here are neither inferential in nature nor are they based on a probabilistic model for the observed differences. The only time when a probabilistic model is used is in the calibration of the item parameters via a unidimensional IRT model and, potentially, Investigations of Parameter Invariance in IRT Models 96 when a M B C approach is used. Hence, it is not possible to disentangle sampling variability from LOI in the population and one has to speak properly about relative variation across groups. It is of course open to debate at this point what types of variables should be used to create examinee groups that are used for multiple calibrations of item parameters and how one should create these groups once the variables have been selected. In order to prevent replacing one unsatisfying approach with another, what is sorely needed is a powerful theory about response processes that can guide variable selection. This is indispensable to allow practitioners and theoreticians alike to gauge what constitutes a "meaningful" set of variables for accounting for differences in "meaningfully" defined groups and one can expect this to vary according to the context and purpose of the measurement process. Despite these challenges and shortcomings, the methodology holds strong potential for future investigations. It presents an intriguing way of comparing multiple examinee groups along potential dimensions of variations, which can be done for any grouping structure imposed on the data set. For example, i f one were interested in comparing relative item performance across multiple ethnic groups, this methodology allows one to do that. In such cases, it can provide a useful introductory step for more detailed LOI analyses, which is the purpose of most multivariate exploratory techniques. In addition, this approach allows for multi-group comparisons that take multiple item parameters into account, which introduces an additional layer of complexity for how to combine different measures but can also yield stronger results as multiple item parameters are considered simultaneously. The methodology also illustrates that parameter invariance is an ideal state and that one should not consider LOI as a contrasting state either but rather as a continuum, which may even be made up of different subcontinua. Investigations of Parameter Invariance in IRT Models 97 It is therefore important to recognize that research such as this is designed to open the door to rich discussions about how psychometricians can account for observed differences in item parameter values across populations. If one wants to go beyond merely detecting differential functioning and then handing the problem over to subject-matter experts in the field, approaches such as this one are indispensable to strengthen the validity of any conclusions about LOI. Investigations of Parameter Invariance in IRT Models 98 Chapter VII - Implications for the Validation and Use of Assessments Invariance is a concept that has universal appeal for many researchers, because it represents a condition for generalizability. Specifically, generalizability relates to inferences across different examinee groups, item sets, or contexts, and is one of the fundamental desiderata of scientific practice. However, to discuss invariance in a meaningful way, the conceptual notion of invariance has to be translated into a specific framework for inquiry such as statistical modeling theory because only then is it possible to assess the degree to which invariance of a certain form exists. Throughout the process of assessing the degree of lack of invariance (LOI) for a specific context, one eventually encounters numerous challenges. A formalized notion of invariance can lose some of the appeal that a conceptual notion carries with it as the former typically appears to be much more restrictive. One appears to work with a very narrow definition of invariance when one attempts to formalize it and it is easy to get the impression, that assessing LOI mathematically is somewhat divorced from the larger philosophical debates in which it is grounded. However, such a narrow focus is indeed necessary for a critical and detailed understanding of invariance, which can inform practitioners and theoreticians alike. Even though it is easy to accuse mathematicians and psychometricians who investigate invariance of a certain degree of academic "tunnel vision" because they formalize invariance so narrowly, one needs to appreciate that the same happened in the area of validity research. Even though current conceptualizations of validity are inclusive, comprehensive, and both discursive and mathematical, we should not forget that, for a long time, one major way of how validity had been operationalized was via so-called validity coefficients. In other words, it was common practice to take this complex and intricate concept and to reduce it to a single mathematical Investigations of Parameter Invariance in IRT Models 99 number that was essentially a correlation between the score on the assessment instrument under consideration and the score on a gold standard instrument administered to the same group of examinees. As the research literature matured, it became clear that not everything that looked golden was indeed golden and that such an operationalization was much too narrow to support statements about conceptual validity i f it stood alone. But that does not necessary mean that this particular operationalization of validity is useless or merely a minor side product of the psychometric enterprise; on the contrary, such a narrow definition of validity has made it clear that arguments based on mathematical definitions can rarely capture the complexity of a construct in its entirety but that they can provide a unique and necessary perspective on the issue. Just as statistics, as a discipline, does not guarantee complete certainty about generalizability unless population data are available, mathematical formalizations of validity or invariance do not provide complete certainty about generalizability either. Unfortunately, grand and unrealistic statements about properties of psychometric models ensue i f such a seemingly limiting definition of invariance is not injected into the scientific discourse. Statements such as "sample-free" item and examinee parameter estimation for item response theory (IRT) models are evidence of such a danger. The research presented in this dissertation shows clearly that very detailed and specific analyses of the property of invariance are necessary to fully appreciate its subtlety and to debunk unrealistic statements about what psychometric modeling practice can accomplish. Even though this dissertation could have easily been about the epistemological and ontological considerations regarding invariance as a construct within a post-positivistic research paradigm, it evolved into a work on the mathematical nuts and bolts of invariance. Investigations of Parameter Invariance in IRT Models 100 Almost to the point of disappointment, invariance of model parameters is a rather "trivial" mathematical requirement in that it states that parameters need to be identical across calibration contexts. This formalizes common sense about what it means for something to be invariant, yet many practitioners fail to appreciate that parameter invariance in scoring models is therefore neither more or less. B y implication, any deviation from such an ideal state constitutes a LOI, which has been shown throughout this dissertation to be a continuum, perhaps even a multidimensional one, rather than a contrasting categorical state. This is evidenced in the chapter on bias coefficients, where different amounts of bias are shown to arise from different degrees of LOI, as well as in the chapter on multi-group comparisons, where populations appeared to differ on at least one or two LOI dimensions. Even i f researchers are willing to discuss invariance mathematically, not everyone agrees on which tools to use to investigate it. This, of course, is not surprising because psychometric assessment faces the challenge that measurement error is difficult to separate from sampling fluctuations in statistics thereby ideally requiring tools that can disentangle these two sources of variability. However, these tools are not easily developed. Moreover, separating signal from noise is, of course, a problem as old as statistics and it is thus commonly known that there will be no definitive answer as to whether an observed data pattern actually presents "sufficient" evidence for a similar pattern in the population. Because, in psychometric research, there are actually two potential populations to consider, those of examinees and those of items, and typically neither is statistically well defined, the problem becomes even more prone to confounding. Intuitive measures, despite their widespread appeal, have their limitations. As the chapter on assessing invariance via correlation coefficients reminds the reader, a measure such as Pearson's Investigations of Parameter Invariance in IRT Models 101 correlation coefficient may appear to be appropriate due to its simplicity, but even advanced statisticians tend to forget once in a while that it was designed for linear association only. In recent articles that used this measure, no mention was made as to whether the observed trends were indeed linear or whether additive shifts in parameter estimates existed in the data. This may appear to some as a rather subtle occurrence, but as the chapter on multi-group comparisons has shown, additive shifts are indeed possible. Even though the strongest linear shifts were observed for the classification and regression trees (CART) groupings that separated examinees by total score as a proxy measure of ability, both the &-means and the C A R T grouping structures displayed some shifts and, specifically, only additive shifts. Even though it has to be acknowledged that this may be a result of sampling variability or ability differences and required a linking of parameter estimates along the way, which is a process that can itself be prone to error, the observed trends cannot be completely denied. With differential item, bundle, and test functioning, as well as item parameter, drift being real problems to which a lot of space is devoted in the research literature, one would expect that decided efforts are constantly undertaken to shed light on the driving forces behind such phenomena. Unfortunately, this is not as commonplace as one would hope. Indeed, the development of procedures to statistically detect LOI is miles ahead of the development of procedures that actually account for its sources. In recent years, there has been a push in the psychometric literature to promote interdisciplinary and comprehensive investigations of observed phenomena, which have led to the advancement of cognitive models and complex assessment structures. The chapter on multi-group comparison presented in this dissertation is another stepping stone in this direction and is primarily a call to recognize that current explanatory mechanisms for LOI could benefit from a synthesis of statistical and psychological Investigations of Parameter Invariance in IRT Models 102 perspectives just as other modeling endeavors do. The answers provided in that study may not be conclusive but they highlight the necessity for a more comprehensive and in-depth look at reasons for LOI. Once more research clears this path from the dense foliage that is currently present - due to a lack of research on how to best utilize background variables for explanatory purposes - an exciting vista for understanding differential performance of tests will be reached. The tools presented in chapter six represent an exploratory modeling approach and it is neither claimed that this approach is best nor that it is the only one that provides deeper insight into the problem. Any such claims would be naive at best and arrogant at worst. Instead, it is important to remember that all research practice is fundamentally a set of choices that investigators make based on the paradigms of inquiry in which they are trained, the current research accessible within such paradigms, and the practical limitations of the endeavor as a whole. Good statistical practice recognizes the choices that are inherent in any analysis of measurement data and the implications for the claims that are made based on it. Therefore, the true benefit of a realm of methods of inquiry into LOI is not the superiority of one method over the other but the multiple perspectives that they offer en concerto. This allows one to consider multiple specialized perspectives onto invariance joined "at birth" by a common understanding of the necessary precise mathematical definition of invariance and, thus, to pool the results from such perspectives to reach a higher-level analytical consensus. It should also be understood that one of the fundamental choices about invariance investigations comes from the decision about which mathematical model is used to calibrate a given data set. The structural equation modeling and IRT literatures offer numerous tools to specify simple and complex models but many developments of advanced models have taken place in specialized contexts fueled first and foremost by academic research so that, in practice, Investigations of Parameter Invariance in IRT Models 103 basic unidimensional IRT models are still predominantly used. This should, of course, not be condemned per se but rather be recognized as a necessary compromise between academic elegance and rigor and practical implementation. In this sense, the chapter on model comparisons via impact from item difficulty parameter drift highlighted that no model choice can ever be best under all conditions, not even under a lens as restrictive as parameter invariance. Instead, LOI effects are subtle and complex and require researchers to make cautious statements about their implications within the frame of reference of a particular model. But i f an expose on invariance is to have strong impact on the world of practitioners, it should have something to say about the practical impact of LOI investigations. On the upside, it was shown here in both the bias coefficient chapter and the model comparison chapter that the effects of LOI in terms of practical decision-making are rather minor and speak well to the robustness properties of latent variable models. This should comfort those practitioners who are primarily concerned with the numerical stability of results across contexts. However, in the chapter on multi-group comparisons, it was also shown how situational factors, in the case of the A:-means grouping structure, and attitudinal and background factors, in the case of the CART grouping structure, impacted the relative functioning of item sets. This relates back to the scenarios presented in chapter 3 where invariance is of practical importance. For example, i f the proficiency test of Canadian history with both item types is administered to monolingual Anglophones as well as bilingual Francophones and some differential item functioning of the magnitudes in chapters 4 and 5 is detected, then it is unlikely that it is going to lead to incorrect placement decisions - unless the examinee score were close to a cut-off score. If differential functioning of this assessment for other language groups in addition to these two were of interest and the factors causing this functioning could be measured in variables such as cultural or Investigations of Parameter Invariance in IRT Models 104 linguistic profiles of such examinee groups, then the approach proposed in chapter 6 could be useful. Whether the relative group differences, i f they indeed exist, would be considered 'problematic' or merely 'insightful' as they explain differential achievement profiles, is dependent on the users of the test and not the method itself. This ties back to the notion that validity is a unitary concept whose investigation is always socially contextualized and value-laden (Messick, 1995) no matter whether mathematical results point to relatively mild or rather strong violations of model assumptions. Because most assessments are designed to differentiate between learners at different levels of ability or disposition, the challenging question is whether differential functioning of item sets for different examinee groups constitutes evidence of a restricted validity of tests or whether it can be logically subsumed under the basic goal of testing. At first, this might seem like a simple question because the statistical literature on differential functioning seems to state that matching examinees on a reference variable resolves that problem. Unfortunately, the problem is not as straightforward. In particular, which variable should be chosen for matching, what design was used to collect the data, and what assumptions about examinee ability and item functioning are made along the way all make this question much more complex (Wainer, 1999; see Shealy & Stout, 1993). Moreover, whenever a differential functioning is statistically detected, an ambiguous process already, it is by no means straightforward to classify this as substantially meaningful (see Penfield & Lam, 2000) - otherwise, subject-matter experts would not be needed. For example, the results in the last study show clearly that invariance is not a general property that holds exactly over all possible sets of subpopulations, however defined. Yet to make claims that attribute the differential functioning to examinee group differences grounded in their characteristics (i.e., to resolve the confounding of examinee "ability" and item "difficulty") Investigations of Parameter Invariance in IRT Models 105 assumptions about either their "abilities" or the item "difficulties" would have to be made, but neither is independent of the other. Moreover, even statistical techniques that match on a latent variable require that the latent variable can be related to the construct the test is designed to measure for additional analyses to be valid, which is by no means an easy task. In other words, because "proper" test functioning depends on "proper" test scoring, it in turn depends on the "proper" calibration of items with examinees from a "suitable" population. Intuition tells us that a large group of examinees is never really a homogeneous single population and instead consists of a set of disjoint subpopulations which, taken together, make up the entire population but which, individually, represent subgroups in which a test functions differently. Even though most practitioners would agree that differential test functioning for subpopulations is not desirable, very little is often known about what constitutes an "allowable" and a "disallowable" population of examinees. For example, a language test typically is designed with a specific population of examinees in mind, but the operational definition of that population exists mostly linguistically in the test blueprint and is rarely operationalized via an exhaustive set of statistically measurable variables that define that population. If that were at least attempted, examinees from populations of varying degrees of similarity to the target population could be administered such a test and the degree to which such a test functions differently in these populations could be investigated. Typically, a test that functions "properly" with respect to different examinee groups is one for which the item parameters are identical across different examinee populations for whom the test is designed. Therefore, the logic goes, i f differences in values of an item parameter such as item difficulty are observed for certain groups of people who are compared with others on the same metric, after they have been matched on ability, then the test appears to function differently Investigations of Parameter Invariance in LRT Models 106 for these groups. But the argument is based on a subtle disclaimer, which is that "important" differential functioning only exists when it is observed for groups for whom it should not exist in the first-place. For example, i f boys and girls should not perform differently on a test any differential functioning is alarming. Hence, the condition for when such differential functioning is detected and classified as alarming is dependent on a definition of populations by the researcher who conducts that investigation. If one takes the results in the last study, for example, one can conceivably ignore the fact that the CART groupings were constructed via a predictive model and consider the examinees with low self-esteem and low educational resources a group of interest. The results seem to suggest that, for those people, the test functions differently. This should be just as alarming as i f this group had been made up solely of boys and all remaining groups had consisted solely of girls. One can argue of course at this point that a mere detection of this differential functioning is a result of differences in ability, but that requires a separate valid measure of ability. Similarly, consider the example of the group with differential instructional practices whenever new topics are introduced in the &-means grouping structure. The fact that instructional practices can lead to differential advantages of one group over another surely should give one pause for a moment, after which one realizes that this is what every informed practitioner suspects all along. Specifically, results such as this one more generally are reasons why we send our children to schools with specific educational environments, where past test results have indicated superior general student performance or, at least, have provided a unique environment for growth. From this perspective, LOI research in the fashion advocated here can be used to uncover sets of variables that indicate differential performance generally. Taken a step further, this implies that even without having to believe in the latent variable score as an indicator of true 'ability', one can use the scoring model as a filter to carve out groups of Investigations of Parameter Invariance in IRT Models 107 examinees that are unique from others with respect to the instrument as a benchmark. This line of thinking promotes the function of any assessment as a diagnostic tool that informs educational practice alongside its function as a tool to rank order examinees (see Klieme & Baumert, 2001). Put in a nutshell, it is important to precisely define the populations for whom a test is supposed to function identically and to investigate differential functioning for subpopulations under the scoring model that is used, be that an observed-score or a latent-variable model. Specifically, even though many IRT models appear to be relatively robust with respect to variations in population profiles, this is not necessarily true for all populations, for all IRT models, and certainly not for all scoring models generally. Rather than believing that a certain scoring model invokes a proper test functioning across an infinite range of conditions because it is theoretically robust, one should rather ask whether such beliefs may mask an improper test functioning for certain conditions as the scoring methodology may be practically less robust for the conditions that one cares about. The dissertation has highlighted that considerations about parameter invariance are, by no means, simple and straightforward, neither at a mathematical nor at an implicational level of analysis. While the research presented in this dissertation complements past and current investigations on the same topic, its broad coverage and depth of analysis at specified junctures coupled with the use of novel statistical methodology make it a unique point of departure for further discussions on this fundamental property of measurement. In order to understand the validity of inferences made from assessments, one has to understand parameter invariance and researchers have shrugged off this feature as gratuitous and self-evident for far too long. It is time that parameter invariance is taken seriously so that it will be investigated properly and comprehensively more frequently if we really want to advance our understanding of scientific Investigations of Parameter Invariance in IRT Models 108 generalizability. We cannot limit ourselves to conceptual and philosophical arguments every time someone wants to define invariance precisely. Neither can we afford to simply call upon theoretical features of certain classes of mathematical models like they are invariance oracles in practice. If invariance is the quest for generalization, the dissertation has shown that this quest is clearly just beginning. Investigations of Parameter Invariance in IRT Models 109 References Adams, R. J., Wilson, M . , & Wu, M . (1997). Multilevel item response models: A n approach to errors in variable regression. Journal of Educational and Behavioral Statistics. 22, 47-76. Allalouf, A . , Hambleton, R. K. , & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, 36, 185-198. Baker, F. B. (1995). E Q U A T E [Computer software]. Madison, WI: Laboratory of Experimental Design, University of Wisconsin. Berkhin, P. (2002). Survey of clustering data mining techniques (Technical Report). San Jose, C A : Accrue Software. Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275-285. Breiman, L. , Friedman, J., Olshen, R., & Stone, C. (1997). Classification and regression trees. Pacific Grove, C A : Wadsworth. Brennan, R. L . (2001). Generalizability theory. New York: Springer-Verlag. Brennan, R. L. , & Lee, W. (1999). Conditional scale-score standard errors of measurement under binomial and compound-binomial assumptions. Educational and Psychological Measurement, 59, 5-24. Buck, G., Tatsuoka, K . , & Kostin, I. (1997). The subskills of reading: Rule-space analysis of a multiple-choice test of second language reading comprehension. Language Learning, 47, 423-466. Budgell, G. R., Raju, N . S., & Quartetti, D. A . (1995). Analysis of differential item functioning in translated assessment instruments. Applied Psychological Measurement, 19, 309-321. Burket, G. (1998). P A R D U X [Computer software]. Los Angles: CTB/McGraw-Hill. Investigations of Parameter Invariance in IRT Models 110 Casella, G., & Berger, R. L . (1990). Statistical inference. Belmont, C A : Duxbury Press. Clauser, B. E., & Mazor, K . M . (1998). Using statistical procedures to identify differentially functioning items. Educational Measurement: Issues and Practice. 17, 31-45. Cliff, N . (1992). Abstract measurement theory and the revolution that never happened. Psychological Science, 3, 186-190. Davey, T., Oshima, T. C , & Lee, K . (1996). Linking multidimensional item calibrations. Applied Psychological Measurements. 20, 405-416. Dempster, A. P., Laird, N . M . , & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the E M algorithm (with discussion). Journal of the Royal Statistical Society, 39, Series B . 1-38. Donoghue, J. R., & Isham, S. P. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement. 22, 33-51. Eaton, M . L. (1989). Group invariance applications in statistics. Alexandria, V A : American Statistical Association. Engelhard Jr., G. (1992). Historical views of invariance: Evidence from the measurement theories of Thorndike, Thurstone, and Rasch. Educational and Psychological Measurement, 52,275-291. Engelhard Jr., G. (1994). Historical views of the concept of invariance in measurement theory. In M . Wilson (Ed.), Objective measurement: Theory into practice (volume 2; pp. 73-99). Norwood, NJ : Ablex. Fan, X . (1998). Item response theory and classical test theory: A n empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58, 357-381. Investigations of Parameter Invariance in IRT Models 111 Fasulo, D. (1999). A n analysis of recent work on clustering algorithms (Technical Report 01-03-02). Seattle, W A : Department of Computer Science & Engineering at the University of Washington. Feldt, L. S. (1996). Estimation of measurement error variance at specific score levels. Journal of Educational Measurement, 33, 141-156. Feldt, L . S., & Quails, A . L . (1998). Approximating scale score standard error of measurement from the raw score standard error. Applied Measurement in Education, 11, 159-177. Ferrando, P. J. (1996). Calibration of invariant item parameters in a continuous item response model using the extended LISREL measurement submodel. Multivariate Behavioral Research, 31, 419-439. Fox, J., & Glas, C. A . W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, 271-288. Fraley, C , & Raftery, A . E. (1998). How many clusters? Which cluster method? Answers via model-based cluster analysis (Technical Report 329). Seattle, W A : Department of Statistics at the University of Washington. Goldstein, H . (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20, 369-377. Goldstein, H . , & Wood, R. (1989). Five decades of item response modeling. British Journal of Mathematical and Statistical Psychology, 42, 139-167. Hambleton, R. K . (1989). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational Measurement (pp. 147 - 200). New York: Macmillan. Investigations of Parameter Invariance in IRT Models 112 Hambleton, R. K. , & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice. 12. 38-47. Hambleton, R. K. , Swaminathan, H. , & Rogers, H . J. (1991). Fundamentals of item response theory. Newbury Park, C A : Sage. Jdreskog, K. G., & Sdrbom, D. (2001). PRELIS [Computer software]. Chicago: Scientific Software International. Johnson, R. A . , & Wichern, D. W. (1982). Applied multivariate statistical analysis. Englewood Cliffs, NJ: Prentice-Hall. Junker, B . W. (1999). Some statistical models and computational methods that may be useful for cognitively-relevant assessment. Unpublished manuscript. Available online at http://www.stat.cmu.edu/~brian/nrc/cfa Kaskowitz, G. S., & de Ayala, R. J. (2001). The effect of error in item parameter estimates on the test response function method of linking. Applied Psychological Measurement, 25, 39-52. Kim, S., & Cohen, A . S. (1991). A comparison of two area measures for detecting differential item functioning. Applied Psychological Measurement, 15, 269-278. Klieme, E., & Baumert, J. (2001). Identifying national cultures of mathematics education: Analysis of cognitive demands and differential item functioning in TLMSS. European Journal of Psychology of Education, 16, 385-402. Kolen, M . J., & Brennan, R. L. (1995). Test equating: Methods and practices. New York: Springer-Verlag. Kolen, M . J., Hanson, B . A. , & Brennan, R. L . (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285-307. Investigations of Parameter Invariance in IRT Models 113 L i , Y . H. , & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115-138. Lord, F. M . (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F. M . , & Novick, M . R. (1968). Statistical theories of mental test scores. Reading, M A : Addison-Wesley. McDonald, R. P. (1982). Linear versus nonlinear models in item response theory. Applied Psychological Measurement, 6, 379-396. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum. McDonald, P., & Paunonen, S. V . (2002). A Monte Carlo comparison of item and person statistics based on item response theory versus classical test theory/Educational and Psychological Measurement, 62, 921-943. Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7,105-118. Meredith, W. (1964a). Notes on factorial invariance. Psychometrika, 29, 177-185. Meredith, W. (1964b). Rotation to achieve factorial invariance. Psychometrika, 29,187-206. Meredith, W. (1993). Measurement invariance, factor analysis, and factorial invariance. Psychometrika, 58, 525-543. Meredith, W., & Horn, J. (2001). The role of factorial invariance in modeling growth and change. In L . M . Collins & A . G. Sayer (Eds.), New methods for the analysis of change: Decade of behavior (pp. 203-240). Washington, DC: American Psychological Association. Meredith, W. & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57, 289-311. Investigations of Parameter Invariance in IRT Models 114 Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (pp. 13-103). New York: Macmillan. Millsap, R. E., & Everson, H . T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement. 17, 297-334. Millsap, R. E., & Meredith, W. (1992). Inferential conditions in the statistical detection of measurement bias. Applied Psychological Measurement, 16, 389-402. Mullis, I. V . S., Martin, M . O., Smith, T. A. , Garden, R. A. , Gregory, K . D., Gonzalez, E. J., Chrostowski, S. J., & O'Connor, K . M . (2001). TIMSS assessment framework and specifications 2003. Chestnut Hi l l , M A : International Study Center, Lynch School of Education, Boston College. Available online at http://timss.bc.ed Muthen, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81-117. Narens, L. (2002). Theories of meaningfulness. Mahwah, NJ: Erlbaum. Penfield, R. D., & Lam, T. C. M . (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19,5-15. Perline, R., Wright, B . D., & Wainer, H . (1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3, 237-255. Petersen, N . S., Kolen, M . J., & Hoover, H . D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational Measurement (3 r d ed., pp. 221-262). New York: MacMillan. Raju, N . S., Laffitte, L . J., & Byrne, B. M . (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology. 87, 517-529. Investigations of Parameter Invariance in IRT Models 115 Ramsay, J. O., & Dalzell, C. J. (1991). Some tools for functional data analysis. Journal of the Royal Statistical Society, Series B, 53, 539-572. Ramsay, J. O., & Silverman, B . W. (1997). Functional data analysis. New York: Springer-Verlag. Ramsay, J. O., & Silverman, B. W. (2002). Applied functional data analysis: Methods and case studies. New York: Springer-Verlag. Reise, S. P., Widaman, K . F., & Pugh, R. H . (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin. 114, 552-566. Roussos, L. , & Stout, W. (1996). Simulation studies of the effects of small sample size and studied item parameters on SD3TEST and Mantel-Haenszel Type I Error Performance. Journal of Educational Measurement, 33, 215-230. Roussos, L .A. , Schnipke, D. L. , & Pashley, P. J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics. 24. 293-322. Rupp, A . A . (in press-a). B I L O G - M G and M U L T I L O G for Windows [Software review]. International Journal of Testing. Rupp, A . A . (in press-b). Feature selection for choosing and assembling measurement models: A building-block based organization. International Journal of Testing. Rupp, A. A . , Dey, D. K , & Zumbo, B. D. (in press). To Bayes or not to Bayes, from whether to when: Applications of Bayesian methodology to item response modeling. Structural Equation Modeling. Investigations of Parameter Invariance in IRT Models 116 Rupp, A . A . , Garcia, P., & Jamieson, J. (2002). Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension test items. International Journal of Testing, 1,185-216. Rupp, A. A. , & Zumbo, B. D. (2003). Putting flesh on the psychometric bone: Substantive interpretation of IRT item parameters to enhance inferential validity. Paper presented at the Conry Conference, February 7, Vancouver, BC, Canada. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded responses. Psychometrika Monograph Supplement, No. 17. Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DLF. Psychometrika. 58. 159-194. Sireci, S. G. (1997). Problems and issues in linking assessments across languages. Educational Measurement: Issues and Practice, 16,12-19,29. Sireci, S. G., & Allalouf, A . (2003). Appraising item equivalence across multiple languages and cultures. Language Testing, 20, 248-266. Steinberg, D., & Colla, P. (1997). C A R T - Classification and regression trees [Computer software]. San Diego, C A : Salford-Systems. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Stocking, M . L. , & Lord, F. M . (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. Takane, Y . , & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. Investigations of Parameter Invariance in IRT Models 117 Thissen, D. (1991). M U L T I L O G : Multiple category item analysis and test scoring using item response theory [Computer software]. Chicago: Scientific Software International. Thissen, D., & Steinberg, L. (1984). A response model for multiple-choice items. Psychometrika, 49,501-519. van der Linden, W. J., & Hambleton, R. K . (1997). Handbook of modern item response theory. New York: Springer-Verlag. Vermunt, J. K. , & Magidson, J. (2002). Latent class cluster analysis. In J. A . Hagenaars & A. L. McCutcheon (Eds.), Advances in latent class analysis (pp. 89 - 106). Cambridge University Press. Vermunt, J. K. , & Magidson, J. (2000). Latent GOLD 2.0 user's guide. Belmont, M A : Statistical Innovations, Inc. Wainer, H . (1999). Comparing the incomparable: An essay on the importance of big assumptions and scant evidence. Educational Measurement: Issues and Practice. 18, 10-16. Webster's new universal unabridged dictionary. (1996). New York: Barnes & Noble Books. Wells, C. S., Subkoviak, M . J., & Serlin, R. C. (2002). The effect of item parameter drift on examinee ability estimates. Applied Psychological Measurement, 26, 77-87. Wolf, L. F., & Smith, J. R. (1995). The consequence of consequence: Motivation, anxiety, and test performance. Applied Measurement in Education, 8, 227-242. Wolf, L. F., Smith, J. K. , & Birnbaufn, M . E. (1995). Consequences of performance, test motivation, and mentally taxing items. Applied Measurement in Education, 8, 341-351. Zhang, Y . , Matthews-Lopez, J. L. , & Dorans, N . J. (2003). Assessing effects of item deletion due to DIF on the performance of SAT I ®: Reasoning test subpopulations. Paper presented at the annual meeting of the National Council of Measurement in Education, April 23, Chicago, IL. Investigations of Parameter Invariance in IRT Models 118 Zhu, W. (1998). Test equating: What, why, how? Research Quarterly for Exercise and Sport, 69, 11-23. Zimowski, M . F., Muraki, E., Mislevy, R. J., & Bock, R. D. (1996). BILOG-MG: Multiple-group LRT analysis and test maintenance for binary items [Computer software]. Chicago: Scientific Software International. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Zumbo, B. D. (2003). Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests. Language Testing, 20,136-147. Zumbo, B . D., & Rupp, A . A . (in press). Responsible modeling of measurement data for appropriate inferences: Important advances in reliability and validity theory. In D. Kaplan (Ed.), Handbook of quantitative methodology for the social sciences. Newbury Park, C A : Sage Press. Zumbo, B. D., Pope, G. A. , Watson, J. E., & Hubley, A . M . (1997). A n empirical test of Roskam's conjecture about the interpretation of an ICC parameter in personality inventories. Educational and Psychological Measurement, 57, 963-969. Investigations of Parameter Invariance in IRT Models Appendix 8 o o o o o o o o o o H H O ri i n vo r- i n 8 8 c q R $ £ | t : a a S 8 s • S 8 8 o o o o o o o o I I I I I I I I CN OS as os vo r» r> 3 ffl S | in VJ CN vo Q 8 ? 8 H 9 o o o o o d o o o o o o o o o o o o o o o o o o o o Uj) QO. 03 o o o o o o o o CN os os r*- r> o i n r l i n w N W ( < i H o i S c u > t o i n ^ r - j r H O O O O O O O O O O O O O O rH IT) vo r> in r H H H O O O O O LT) 00 OS £ • CO r- r» in as in OS KD rH 10 S O o" O" O O O O O O O O O O Vi V) vt © v\ <N <o r~ rH r> r- r--r f 01 00 * H m oo n 00 m in vo H o i H r-r- o H in oo H 01 M P J o o o o o o o o o o o V0 H ro vo ro oo ro o 3 in o i 3 vo ro in ro vo 01 01 oo g in in of _ r- r- . _ . . oo in cn CN in u IO vo in in o o o o o o o o o o o o o o r~ oo in ro vo vo oi CN ro in o ^ oo oo S 3 " 9 in oo -- -- r-11 CN CN H vo VO 00 01 O O O O rH O O O O O O O O 01 01 o m oi vo in i n h o j c o " * ^ * ^ 01 VO (31 <f m m h s s o H PI 3 ^ VO 00 o o o o o o o o oi vo oo oo m in oo vo <a< VD CN oi <a* r- ro 01 oo Ttf 01 o o o o o o o o 00 CN m oo oi 01 CN ^ VO CO O rH H H H Ifl > P] PI s 00 8 CN m o vo in oi vo m "* oi t^" m £31 oo oi oo ro vo oo o o o H 3 H OI _. _. r- o ro <tf CN f» 9 9 o o o o o o o o 3 CN o CN 01 r- in in r-i s vo r -N PI 01 h _ . vo m CN oo oo oo - JN - - -• g p o o o Lrt CN o r-vo ro vo oo 00 H VO VD o o o o o o o o o o oi m H 01 V0 VD 01 ro ~ 00 01 3 o 9 CN m in o o CN H ro vo r- m oo ro rH V0 VO CN •* ro oo VD oo - f- o — ro 8 o o o o o o o o 3 3 CN in ro CN o o CN in CN ro 01 CN r~ VD 01 vo ro . CN ro o o >s< in r- p 3 o CN oo oo ro ro ro CN ro CN - oo in oo o <tf 01 o 3< VO 00 ro CN CN o o o o o o o o o o o o o >n „ ^ _ i - ^ r s 9 t- ffl En • . . U ) M 01 CO 9 8 3 00 00 p 0 0 0 8 3 oo ro 00 o o o o o o o o 00 3 en vo cri OS CN i 8 . S 8 CN 8 8 o O O O O O O O i n s o O O O O O O O O S in r-CN in H pp. 0 0 a 1 in co r-s H 9 a O O O O O O O O O O O O O O O O H i n 10 CN O O O O O O O O O O O O O O O O r-; _ ^ ^ ^ •fa UOpBmUTUOSIQ CO C N O I 00 ho vo "o C O C N o O 00 CO C O C N o O S o 00 C N © VO os vo iri co o o o o o C O o o o o </-> o o o C N o C N o o C N 00 o C N o 00 o o V O r-C N C O o VO T—I o C O C O CN O © o o C N C N O C O o o o o o '•8 S3 o "8 > PH U-I o S3 o *-4-» <D I co > l-i CO u O -*-» O o cn <D " S > I PQ 8 I-I o O N Ov O S 00 00 S3 S3 _o •c o w IQ '1 C O o o o o C O C O o 00 o o C O o o C O C N o C N C N o VO O O 00 o o o C O o o o C N o C N O o C O C N o VO O O co o o C N O o o o C N C N o o o O S o o H 00 HJ O O u 00 o $ 1 C N o C N © o o O S < N VO r—( o 00 © l' vo o o o "0 "o oo o o g H O H • 00 H-i X u oo W e 00 H O I C O o C O q o o VO co C O o 00 Q h-1 W i—i co < H oo w u oo W e 00 H I VO C N O oo o C N o r—I © O S v o C N O o 00 "3-o r--uo oo © H i—i H & H i—i 00 O C O o O O S r—I o C N O o os vo o o u OH Q O H 00 H § PH 00 00 C3 < in C N m O o o o 00 o o c o © o VO o CO o o m o • o in o o c o o C O C N o m o o o u O in . C N m ' i—i o o ^ o SO ~H O o CN o o oo o o o t-o o o CN in o CN CO O VO VO VO O c o O o o Os CO o CN CN O CN vo oo o OS r-o Os o oo o ° S3 Pi o ES O 60 • »-H -*-» 1/3 l 1-1 o Os OS OS 00 o '•g CO IQ VO VO O 1—1 o <s-i CO H W 1—1 O r »—1 © 00 m o o o o co r~-CN CN o o S O H G O < H U o Q H H G O os 1—I o C N o 00 os CN o VO O O o o CO c o o in c o c o o r--CN o H < W O 9 H 00 CO o in o o •n © o OS o CN o o o 0 00 OS o CN VO o CO CN o 00 CN O o <0 CO VO O O O O O 00 p OS 1—I o o CN o o o VO o OS o o OS o CO o VO c o o vo °0 o o © o o CO o o 00 S O o o co o r-"3-00 t—I p O N "-I 00 CN cn o o o o o ON © *-H © o VO o o cn CN o o o cn o o CN o O cn o cn i n o o © o o ON VO O V O o 0 0 o VO CN o VO o o o o o V O o o cn o CN o o o o o cn cn o CN o OS © r-o 0 0 o VO o o I-I PH t»—I O VI o o 4 - * fc in m o CN o ON oo o o o o o VO o o cn CM o cn o o 0 0 O O cn o o o cn o o VO O o CN o o cn o o o o Ov " 1 cn o o o cn cn o CN O CN O CN O o 0 0 CN o CN © o o o CN o o r-CN o o o cn o CN O oo o o cn CN o r-o o 0 \ cn t~-vo o CN O r-H CN CN O o © VO cn o o C3 0 0 • rH -r-» I Ov OS Ov I-H 00 00 IH .g a o *+-» OH o VI IQ 'I oo cn oo ov 0 0 cn VO VO m t> © ~ VO O Ov o in m o VO ON > CO o o co o in o o o o VO CN o o o CN o o o o o OS CN o o C O o i - H © 00 in o co m CN O CO VO O 00 CN o OS o o VO o CN o © OS o o OS CN o OS 1-H © VO i — i © CO t--os VO © OS co 00 © co in o © 1—I © VO VO O OS © © © 00 © CN © © OS m © CO CO in in CN © © VO VO CN . © VO oo © m in © Os in oo o OS in CN oo CN CN VO co © ,8 CN OS in <s-> fN SO VO CO <2> <o •n Os •n © o VO VO © © o © o CN o o 1—I © CO CO © CO co • © <n co © © m © o os C N © co © vO © O CN © CN O m © © CO © © © oo CN © CN O i—( m .© *-> © © in © m CN © CN © OS o © co © CO CN © 00 co © r—I © r-CN © © © VO CN © - 3 -© © SO © O 1—I © oo CN © © © CN co © OS OS OS O0 oo IH o ex •c o w 10 U 'I oo & O 3. o u u w-oo OO g 5 P-, o u w oo 9 oo O H W & 5 m > w a H o oo o u oo oo P U oo i—i 9 o PQ oo W 00 Pi W w U w 00 W oo P4 •W K W o PQ W oo oo H § H oo > o w oo 00 H § H oo 2 VO o cn o o o o o o o cn o o oo m o o in vo o o m o o o ov cn o CN c n cn o o o c n " n m CN o o o o V O C N o VO o VO vo o C N o o o co OO c o VO " 1 vo C N o m VO O cn o o vo o O N " - I VO © 4—• 1 a O m C N C N © oo o o C N V O O O U 00 W oo o < w H O N o in o o C N cn o oo C N O o o o C N o C N f-H © o cn cn o m C N o in o o C N © 2 hs C N cn o o «-H C N C N <-> o o m o o cn i n o 0) •8 00 <u cs I S i-. O o I t s 13 'I a o 60 a '•s o i-a > •a o '1 VI Hi 1 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items