EVALUATING THE ERROR OF MEASUREMENT DUE TO CATEGORICAL SCALING WITH A MEASUREMENT INVARIANCE APPROACH TO CONFIRMATORY FACTOR ANALYSIS by BRENT F. OLSON B.A., Western Washington University, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS in THE FACULTY OF GRADUATE STUDIES (Measurement, Evaluation, and Research Methodology) THE UNIVERSITY OF BRITISH COLUMBIA January 2008 © Brent F. Olson, 2008 ii Abstract It has previously been determined that using 3 or 4 points on a categorized response scale will fail to produce a continuous distribution of scores. However, there is no evidence, thus far, revealing the number of scale points that may indeed possess an approximate or sufficiently continuous distribution. This study provides the evidence to suggest the level of categorization in discrete scales that makes them directly comparable to continuous scales in terms of their measurement properties. To do this, we first introduced a novel procedure for simulating discretely scaled data that was both informed and validated through the principles of the Classical True Score Model. Second, we employed a measurement invariance (MI) approach to confirmatory factor analysis (CFA) in order to directly compare the measurement quality of continuously scaled factor models to that of discretely scaled models. The simulated design conditions of the study varied with respect to item-specific variance (low, moderate, high), random error variance (none, moderate, high), and discrete scale categorization (number of scale points ranged from 3 to 101). A population analogue approach was taken with respect to sample size (N = 10,000). We concluded that there are conditions under which response scales with 11 to 15 scale points can reproduce the measurement properties of a continuous scale. Using response scales with more than 15 points may be, for the most part, unnecessary. Scales having from 3 to 10 points introduce a significant level of measurement error, and caution should be taken when employing such scales. The implications of this research and future directions are discussed. iii Table of Contents Abstract.......................................................................................................................................... ii Table of Contents ......................................................................................................................... iii List of Tables ................................................................................................................................. v List of Figures............................................................................................................................... vi Dedication .................................................................................................................................... vii Introduction................................................................................................................................... 1 Procedural Background.............................................................................................................. 4 Method ........................................................................................................................................... 8 Design Conditions....................................................................................................................... 8 Model Specification .................................................................................................................... 9 Data Generation ....................................................................................................................... 14 Data Validation......................................................................................................................... 20 Results .......................................................................................................................................... 23 Data Validation......................................................................................................................... 23 Measurement Invariance .......................................................................................................... 29 Discussion..................................................................................................................................... 34 Conclusions............................................................................................................................... 36 Limitations and Future Directions............................................................................................ 38 Footnotes ...................................................................................................................................... 41 References .................................................................................................................................... 42 Appendix A. Selected EFA Results for the “Low” ISV – “None” REV Condition.a................ 45 Appendix B. Selected EFA Results for the “Low” ISV – “Moderate” REV Condition.a ......... 46 Appendix C. Selected EFA Results for the “Low” ISV – “High” REV Condition.a ................ 47 Appendix D. Selected EFA Results for the “Moderate” ISV – “None” REV Condition.a........ 48 Appendix E. Selected EFA Results for the “Moderate” ISV – “High” REV Condition.a......... 49 Appendix F. Selected EFA Results for the “High” ISV – “None” REV Condition.a ............... 50 Appendix G. Selected EFA Results for the “High” ISV – “Moderate” REV Condition.a ........ 51 iv Appendix H. Selected EFA Results for the “High” ISV – “High” REV Condition.a ............... 52 Appendix I. Selected MI Results for the “Low” ISV – “None” REV Condition.a.................... 53 Appendix J. Selected MI Results for the “Low” ISV – “Moderate” REV Condition.a............. 54 Appendix K. Selected MI Results for the “Low” ISV – “High” REV Condition.a................... 55 Appendix L. Selected MI Results for the “Moderate” ISV – “None” REV Condition.a........... 56 Appendix M. Selected MI Results for the “Moderate” ISV – “High” REV Condition.a .......... 57 Appendix N. Selected MI Results for the “High” ISV – “None” REV Condition.a.................. 58 Appendix O. Selected MI Results for the “High” ISV – “Moderate” REV Condition.a.......... 59 Appendix P. Selected MI Results for the “High” ISV – “High” REV Condition.a................... 60 v List of Tables Table 1 Relative Multivariate Kurtosis (RMK) of the Continuous Model and Selected Discrete Models.....................................................................................................24 Table 2 Selected EFA Results for the “Moderate” ISV – “Moderate” REV Condition.....26 Table 3 Data Validation with Exploratory Factor Analysis: Comparing the Continuous Scale Model to the 3-point Model.........................................................................28 Table 4 Selected MI Results for the “Moderate” ISV – “Moderate” REV Condition........30 Table 5 The Scale Point Level at which Successful Invariance was Achieved According to RMSEA, χ2, and CFI..............................................................................................32 vi List of Figures Figure 1 A Comparison Between Methods for Simulating Observed Test Scores According to the Classical True Score Model...........................................................................6 Figure 2 Conceptual Diagram of the Dependent Samples Model Used to Test the Measurement Invariance Between Continuous and Discrete Scale Models..........10 Figure 3 Three Examples of Frequency Histograms for Continuously Scaled Items and Their Corresponding Categorized Scale Versions.................................................18 vii Dedication for Forrest Lee Olson 1 Introduction Errors in measurement are one of the most pervasive problems in modern psychological test construction (Gregory, 2004, chap. 3). Measurement error can take many forms and arise from many sources. One source of error that has gained increasing attention in the literature is the use of coarsely categorized response formats in measuring theoretically continuous constructs. Response formats with a continuous scale of measurement possess interval properties that provide useful information about the distance between observed scores on psychological measures. That is, any two observed scores from a continuous scale have a known distance from each other, and a known mid-point. Categorization of continuous scales is a process of using cut- off points to segment the scale into a set of discrete categories. Researchers decide on the number and type of scale points to use during the development of their measures. Test and survey developers often find it impractical or impossible to utilize continuous scales directly in their instruments, so they must resort to using discrete scales (DiStefano, 2002). Categorization has the effect of removing the interval measurement properties of the scale as the information that exists in-between each scale point is lost. For coarsely categorized response scales, this loss of information can reduce the meaningfulness of test scores, and may lead to a misrepresentation of the construct being measured (Lubke & Muthén, 2004; Russel, Pinto, & Bobko, 1991). Specifically, coarseness in categorized responses has been shown to generate systematic error that affects the interaction between variables in regression models (Russel et al., 1991), and may cause a substantial loss in statistical power to detect true relationships between predictors and outcomes (Taylor, West, & Aiken, 2006). Coarseness causes errors in measurement that can distort the factor structure of latent variable models (Lubke & Muthén, 2004), and can reduce the 2 reliability of test scores (Bandalos & Enders, 1996; Jenkins & Taber, 1977; Lissitz & Green, 1975). Problems due to categorization manifest themselves when researchers attempt to use coarsely scaled data to perform analyses designed for continuously scaled data. For example, the Pearson product moment correlation – a commonly used analysis in the social sciences – relies on the assumption that the input data is continuous (Garson, n.d.; Pearson, 1909). Jöreskog (1994) argues that the biggest problem with categorized variables is that their distributions are simply incapable of being continuous. Jöreskog and others (e.g., Bandalos & Enders, 1996; Bollen & Barb, 1981; Dolan, 1994; Muthén & Kaplan, 1985) seem to agree that scales with four categories or fewer are too coarse to be treated as continuous. However, there is considerable disagreement about the “optimum” number of scale points that exists above four points. While there is some evidence to establish which scale points are clearly non-continuous, there is no evidence, thus far, to say which scales do have a sufficiently continuous distribution. Not only is there disagreement in the literature about how many categories constitute an approximately continuous measure, but there are several issues that may have been overlooked by previous researchers. First, any assumption about the way categorized scales perform must incorporate the fact that observed score data must follow the tenants of the Classical True Score Model (i.e., Observed Score = True Score + Error). That is, any categorization of the data applies to the true scores and the error scores, not just observed scores. Second, the methods that have been used to determine the “optimum” number of categories, in terms of measurement properties, do not directly compare the performance of discretely scaled test items against continuously scaled test items. A strict test of the measurement equality between discrete scales 3 and continuous scales is truly necessary to determine the proper number of scale points that should be used. The present study will address these concerns in the follow ways. First, we will introduce a novel procedure for generating valid observed score data by simulating the effect of categorization on the Classical True Score Model. Second, we will employ a measurement invariance (MI) approach to confirmatory factor analysis (CFA) in order to directly compare continuously scaled factor models to discretely scaled models in terms of their measurement properties. In general, MI can be used to determine whether a given psychological measure performs equally across sample groups or over repeated measures of the same group (Brown, 2006, chap. 3). In our case, we will use repeated measures MI - CFA to determine whether continuous and discrete scale models have equal measurement properties, including equal measurement error. In short, this study will utilize MI - CFA, in conjunction with simulated data, to assess the measurement properties of latent factor models comprised of test items expressed on continuous and discrete response scales. The result of this evaluation will help determine the level of coarseness in a categorized scale that creates enough measurement error to make it incomparable to a continuous scale. If dramatic differences in error exist between the two types of response scales, it indicates that the measurement properties of discrete scales behave in a fundamentally different way than continuous scales. This study will provide a justification for deciding how many discrete scale points are necessary to adequately reproduce the measurement properties of a continuous scale. 4 Procedural Background For CFA, the covariance between the variables in the model is relied upon to evaluate the model (Brown, 2006, chap. 2). Because of this dependency on the covariance, simulation studies that evaluate CFA models often begin by producing a covariance matrix (Bandalos, 2006). This, in turn, informs the creation of raw data, onto which the manipulations imposed by the design of the study are performed. However, the raw data that will be generated for the current study attempts to simulate how raw data are believed to be collected in a natural context. Ever since the early 1900s, researchers have theorized that observed raw score data from psychological measures are derived from the combination of an underlying “true score” with an “error score” (Crocker & Algina, 1986, chap. 6). This Classical True Score Model can be expressed with the following equation: Xij = Tij + Eij where Xij is the observed score for the ith individual on the jth measure (i.e., test or item), Tij is the true score, and Eij is the error score. Under the natural (non-simulated) conditions of psychological test administration, neither the true score nor the error term can be directly obtained (Crocker & Algina, 1986, chap. 6; McDonald, 1999, chap. 5). Under unnatural (simulated) conditions however, true scores and errors can be artificially generated using provisions set forth by Classical Test Theory; the theory from which the Classical True Score Model is derived. Classical Test Theory assumes that the mean of the error term is zero, the correlation between the errors across items is zero, and the correlation between errors and true scores is zero. When these assumptions are met, simply adding the simulated error to the simulated true scores produces theoretically informed simulated observed scores. 5 That being said, it is important to note that several previous studies in the field of categorical scaling have failed to properly incorporate the tenants of the Classical True Score Model into their simulation procedures. Previous simulation studies commonly take one of the following approaches to generating their data. In some cases, they ignore the Classical True Score Model altogether by first creating a continuous observed score variable, which is simply categorized into a discrete observed score variable (e.g., Bollen & Barb, 1981; Johnson & Creech, 1983; Taylor et al., 2006). In other cases, continuous true scores and error scores are created and then added together to form a continuous observed score variable, which is subsequently categorized to form a discrete observed score variable (e.g., Bandalos & Enders, 1996; Jenkins & Taber, 1977; Lissitz & Green, 1975). In either case, these methods ignore the fact that categorization has an impact on true scores and errors scores directly, not simply on the observed scores. Figure 1 depicts this inappropriate conventional approach to deriving categorized observed scores. From the perspective of the Classical True Score Model, true scores can be affected by categorization when researchers assume respondents can translate a nearly infinite range of feelings or attitudes into a single discrete scale value. Thus, the impact of categorization on the measurement process is introduced to the true scores when researchers decide how many discrete scale points a test item will have. Additionally, error scores can be affected by categorization when random error forces a respondent’s score to change by one or more full scale point(s). This is in contrast to continuous scales, where random error might change the score by some minute fraction of a point. Therefore, categorization makes its impact well before the existence of observed scores. This is why the simulation of realistic discrete observed scores requires the 6 Figure 1. A Comparison Between Methods for Simulating Observed Test Scores According to the Classical True Score Model. Conventional Approach: Novel Approach: = = = Tij Tij Tij Eij Eij Eij Yij Xij Xij Yij 7 simulation of categorized true scores and error scores in order to follow the Classical True Score Model. The present study will attempt to remedy some of the shortcomings of previous simulations with a novel procedure that undertakes the following steps. First, continuous true scores are generated and added to a set of continuous error scores in order to produce continuous observed scores. Second, separate copies of the continuous true and error scores are made, and subsequently categorized. Third, the categorized true scores are added to the categorized error scores to produce a unique discrete version of the observed scores. This process results in a set of continuous observed scores and a set of discrete observed scores; each having been derived from separate true and error scores. See Figure 1 for a visual comparison between the conventional approach and the current novel approach. When continuous and discrete observed scores are derived in this way, it implies that these two sets of scores should perform equally in terms of their measurement properties. The only difference between them is the level of categorization imposed on the discrete scores. For that reason, we will repeat the simulation process over multiple levels of categorization, spanning a complete range of discrete scale points (from 3 to 101 points). This will allow us to use MI - CFA to pin-point the level of categorization that crosses the measurement threshold for having nearly continuous properties. In summary, the aim of this study is two fold. First, we present and validate a novel procedure for generating raw data that simulates the Classical True Score Model. Second, we use a MI approach to CFA to establish the number of points a discretely scaled factor model must have in order to perform equally to a continuously scaled model in a measurement context. 8 Method Design Conditions The simulated data for this study were generated under several research design conditions resulting in multiple sets of raw scores. The conditions formed a completely crossed 3 x 3 x 27 design with various levels of item-specific variance (low, moderate, high), levels of random error variance (none, moderate, high), and 27 levels of discrete scale categorization (discussed below). A population analogue approach was taken with respect to sample size. For each design condition a grand sample of 50,000 cases was generated, from which a random sample of 10,000 was drawn for use in all subsequent analyses. The size of this sample was deemed sufficient to produce the required stability among covariances, parameter estimates, and fit indices. There were a total of 243 unique sets of raw score data, and thus, 243 separate CFA model evaluations. Data were generated with syntax code written for SPSS 13.0. All data sets were subsequently imported into LISREL 8.54 for analysis using SIMPLIS syntax. For all CFA model evaluations, maximum likelihood (ML) estimation was used. Most of the categorical scaling literature concentrates on response scales with 7 points or fewer (Bandalos, 2006; Bollen & Barb, 1981; Dolan, 1994). However, scales that are commonly seen throughout the social science research literature can have a range in categorization anywhere from 3 to 101 scale points.1 Therefore, we wanted the discrete scales evaluated in this study to represent the entire range of categorization. Specifically, we included all non-binary scales with 21 points or fewer, as well as, all scales with 31 or more points at intervals of 10, up to 101 points (i.e., 3 – 21, 31, 41, 51, 61, 71, 81, 91, 101). All the discrete scales simulated in this study possess a low scale point of zero, which means the high point is always one minus the 9 number of scale points. For example, the 101-point scale ranges from 0 to 100, and the 3-point scale ranges from 0 to 2. Due to the fact that the hypothesis tests performed in this study were applied to each of the 27 levels of scale points, the statistical significance level (α) was corrected using a Bonferroni adjustment. The corrected significance level was α’ = 0.05 ÷ 27 = .002. It should be noted that the entire set of results produced from all of the different scale points are not reported here. The overall results are summarized and only the output from the most relevant or exemplificative scales are presented. Model Specification As is true for most CFA studies, the raw data used for this study were produced with a particular latent factor model in mind. The following is a description of the type and specification of the intended factor model. As was mentioned above, there are two versions to each true score and error score used to create the raw data. In fact, the discrete versions of these variables are derived directly from the continuous versions. Consequently, these variables come from the same sample; that is, the data consists of dependent samples. Therefore, the type of factor analytic model used to compare the continuous version to the discrete version was a dependent samples model. This model resembles a repeated measures or longitudinal model, whereby a single configural specification is identified for the continuous and the discrete data, but both versions are present in the same model. Figure 2 shows the conceptual diagram of the hypothesized configural model. This proposed model consists of one latent factor for each of the two versions 10 Figure 2. Conceptual Diagram of the Dependent Samples Model Used to Test the Measurement Invariance Between Continuous and Discrete Scale Models. 1.00 1.00 ryx = 1.00 11 of the variables. Continuous variables are hereby referred to as “Y” variables and discrete variables will be referred to as “X” variables. The model specification for both versions is identical and they are placed side-by-side within one large model to maximize comparability. We chose to use four items to indicate the factors; this creates sufficient over-specification of the model and provides the opportunity to assess variability in the pattern of factor loadings. It is important to note that each indicator in the model is an individual test item consisting of a unique set of observed scores. Each set of the observed scores consist of a unique set of true scores and error scores. However, the true scores are all derived from the same source, that is, they have the same underlying common factor. We hypothesize that the common factor among “Y” variables is precisely equal to that of the “X” variables; therefore, the equality of the two factors was restricted by setting the correlation between them to unity (ryx = 1.0). The metric for the latent variables was fixed by setting the variance of the latent variables to 1.0. Error variance is unique and uncorrelated among the four continuous “Y” items. Likewise, error is uncorrelated among the four discrete “X” items. However, the error for respective items across versions is derived from the same source and should be correlated (see the set of four correlational arcs in Figure 2). The equality of error variances across versions was assessed through the tests of MI (discussed below), rather than by setting the correlation to 1.0. Thus, the correlation between respective items across versions was allowed to be freely estimated. Assessment of this dependent samples model for MI followed the procedure called “longitudinal measurement invariance” outlined by Brown (2006, chap. 7). Accordingly, the continuous and discrete scale versions of the model were considered measurement invariant if the model held up against a series of four increasingly strict evaluations of equality. The first 12 step was to test for configural invariance by constraining the number of indicators and number of latent variables to be equal across the continuous and discrete models (see Figure 2). All of the model parameters were allowed to be freely estimated. For the decision of whether configural invariance held for our model, several formal measures of global goodness-of-fit were compared to established criteria. We followed common practice by reporting the overall goodness-of-fit Chi-Square (χ2) value. The degrees of freedom (df) for the configural model was 16, therefore the χ2 critical value when α = .002 was: χ2crit = 37.2. The criterion used to evaluate the Root Mean Square Error of Approximation (RMSEA) was: RMSEA ≤ .06. The criterion for Comparative Fit Index (CFI) was: CFI ≥ .95. Models that fail to meet the criteria established for each level of MI are excluded from all subsequent tests in the series of MI evaluations. If the given model showed appropriate goodness-of-fit under the restrictions of configural invariance, additional constraints were applied in order to test for loading invariance. Loading invariance was tested by placing equality constraints on the configural specification, as well as on the factor loadings (path coefficients) between the latent variables and their respective indicator variables. This test was important for determining whether the discrete latent variable had the same unit of measurement as the continuous latent variable. Loading invariance was determined by the degree of change in model fit observed when proceeding from configural invariance. Following the recommendation of Brown (2006, chap. 7), each successive test of invariance involved an assessment of the change in χ2 from the previous invariance level (i.e., Δχ2). The change in degrees of freedom (Δdf) for the loading invariance model was 4, therefore the Δχ2 critical value when α = .002 was: Δχ2crit = 16.9. Stemming from some stark criticism of the use of the χ2 difference test as the only means of deciding whether MI holds over successive invariance tests of a given model, Cheung and 13 Rensvold (2002) and Wu, Li and Zumbo (2007) have evaluated the appropriateness of ΔCFI as an alternative to Δχ2. These researchers found that for multi-group CFA applications (i.e., independent samples), using a criterion of ΔCFI ≤ -0.01 was far more stable and realistic in its allocation of MI decisions than was a statistically non-significant Δχ2. However, we have found no studies that have attempted to apply a ΔCFI ≤ -0.01 decision rule to the type of CFA model we are evaluating here (i.e., dependent samples). Cautiously, we compared the results of both the Δχ2 and ΔCFI as a means of determining loading invariance. In addition to the equality constraints of the configural specification and the factor loadings, the third level of measurement invariance tested the equality of factor means. The evaluation of means invariance was a test of whether continuous and discrete models are equally centered on an underlying latent distribution. The mean of the latent variable should be the same for both models if they are to be considered measurement invariant. Means invariance is established by the change in model fit when proceeding from loading invariance, where Δχ2crit = 16.9 and ΔCFI ≤ -0.01 must hold for a given CFA model. The fourth and final test in the series was a test of residual error variance invariance (hereby referred to as error invariance). This test placed equality constraints on the configural specification, the factor loadings, the factor means, and the residual error variance parameters for each of the respective indicator variables. Error invariance was a direct test of whether the discretely scaled version of the model had the same amount of measurement error as the continuously scaled version. This test is critical in determining the existence of systematic sources of error variance that may influence the measurement properties of coarsely categorized response scales. Error invariance of the model was determined by the degree of change in model fit observed when proceeding from means invariance. Once again, a comparison between Δχ2crit 14 = 16.9 and ΔCFI ≤ -0.01 provided the criteria for deciding whether error invariance holds in a given model. If the model held up against all four levels of successively restrictive tests, the model was said to be measurement invariant. That is, the tests have provided strong evidence that continuous and discrete scores – at a given level of categorization – possess the same fundamental measurement quality and that they are directly comparable in terms of their measurement properties. Data Generation The simulation procedure used to produce the raw data for this study was divided into the following steps: 1. Generate the common factor, 2. Generate item-specific variance, 3. Compute the item true scores, 4. Re-express the true scores to fit the distribution of the desired scale, 5. Generate the random error scores, 6. Compute the item observed scores. A further explanation of these six basic parts is as follows. 1. Generate the common factor. Latent common factors are, by definition, unobservable and have an unknown mean and variance. However, the distribution of common factors is often considered to be randomly normally distributed (Crocker & Algina, 1986, chap. 6; McDonald, 1999, chap. 5). The mean and variance of the common factor generated in this simulation were 15 set arbitrarily to emulate the standard normal curve, where MC = 0.0 and SDC = 1.0. For the purposes of demonstration, the common factor variable is referred to as the ‘C’ variable. 2. Generate item-specific variance. No two test items on a single instrument should contain precisely the same content (Gregory, 2004, chap. 3). Differences in the wording of items add necessary and systematic item-specific variance to the items’ scores (Brown, 2006, chap. 2). This process was simulated by generating sets of four randomly normally distributed variables, where each set had one of three different levels of variance. These variables represented the item-specific variance of the four test items in our proposed CFA model. The item-specific variables are referred to as ‘IS’ variables; they have a mean of zero (i.e., MIS = 0.0), and a low, moderate, or high standard deviation (i.e., SDIS = 0.5, 1.0, or 2.0 respectively). 3. Compute the item true scores. The original item true scores were derived from the addition of item-specific variables to the common factor variable. That is, each of the four ‘IS’ variables were separately added to the ‘C’ variable, thus creating four original item true score variables. True score variables are hereby referred to as ‘T’ variables. 4. Re-express the true scores to fit the distribution of the desired scale. The previously mentioned steps (i.e., steps 1, 2 and 3), ensure that the true scores possess a mean of zero (i.e., MT = MC + MIS = 0.0). However, if, for example, the true score for a variable was measured on a 5-point discrete scale, where the lowest point was 0 and the highest point was 4, then we would not expect the mean of that variable to be 0. In order to simulate the scores that one might get from live data collection, we had to re-express the true scores to fit a distribution with a realistic mean and standard deviation. Accordingly, it was necessary to adapt the original true score variables to fit the distribution we would expect at each level of categorization. 16 This part of the procedure involved standardizing the item true scores and re-expressing them to fit the expected distribution of the desired scale. Also involved in this step was the all- important process of categorizing the continuous scales into equivalently distributed discrete scales. This re-expression and categorization procedure includes the following sub-steps: (a) Compute the mean of the desired scale, (b) Compute the standard deviation of the desired scale, (c) Standardize the original item true scores using a Z-score transformation, (d) Re-express the true scores along the distribution of the desired scale, (e) Categorize the continuous scale items to create new discrete scale items, (f) Identify and remove special cases containing impossible scores. The reader is reminded that the above sub-steps, except for (e) and (f), deal exclusively with continuously measured scales. However, the entire process is designed to ensure that the resulting discretely measured scales possess the expected distribution. This implies that the mean and standard deviation of the desired scale is actually the expected mean and standard deviation of the discrete scale. Practically speaking, the resulting “Y” variables will be continuously scaled but will have means and standard deviations equal to those of the discrete “X” variables. (a). The following equation was used to calculate the mean of the desired scale: where τ is the number of scale points. All of the discrete scales simulated in this study possess a low point of zero, and a high point of τ -1. Dividing the high point of the scale by 2 ensures the distribution will be symmetrical. 2 )1( −= τYM )1( 17 (b). To calculate the standard deviation of the desired scale we used: where τ is the number of scale points, rangeT is the range (max minus min) of the original item true score, and σT is the standard deviation of the original true score. This equation determines the number of standard deviation units the new variable should have in order to account for the entire range of true scores. That is, the standard deviation of the re-expressed scale will change, but the number of standard deviation units will remain the same. (c) - (d). The procedure for standardizing the original item true scores and then re- expressing them along the distribution of the desired scale was accomplished in a single Z-score transformation equation: where TY is the new re-expressed item true score, T is the original true score, MT is the mean of the original true score, σT is the standard deviation of the original true score, σY is the desired scale standard deviation from equation (2), and MY is the desired scale mean from equation (1). The result of the procedure thus far is to output the four continuous “Y” true score variables with distributions relevant to a given number of scale points. (e). The categorization of these “Y” variables into discrete “X” variables is accomplished by simply rounding the continuous scores to the nearest integer. Mathematically this would be expressed as: TX = round(TY ). With only a few exceptions (see the following step), the Z-score standardization process ensures that the number of possible integers the rounding process can create is equal to the number of desired scale points. Figure 3 shows the effect that the categorization process has on the distributions of true score variables. ( ) YY T T Y M MTT +⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛ ⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛ −= σσ * ⎟⎠ ⎞⎜⎝ ⎛= T Y σ τσ Trange )2( )3( 18 Figure 3. Three Examples of Frequency Histograms for Continuously Scaled Items and Their Corresponding Categorized Scale Versions. 100.0080.0060.0040.0020.000.00 Continuous "Y" on 101-point Scale 200 150 100 50 0 Fr eq ue nc y 100.0080.0060.0040.0020.000.00 Discrete "X" on 101-point Scale 200 150 100 50 0 Fr eq ue nc y 10.008.006.004.002.000.00 Continuous "Y" on 11-point Scale 150 120 90 60 30 0 Fr eq ue nc y 10.008.006.004.002.000.00 Discrete "X" on 11-point Scale 1,400 1,200 1,000 800 600 400 200 0 Fr eq ue nc y 2.502.001.501.000.500.00-0.50 Continuous "Y" on 3-point Scale 200 150 100 50 0 Fr eq ue nc y 2.001.000.00 Discrete "X" on 3-point Scale 5,000 4,000 3,000 2,000 1,000 0 Fr eq ue nc y 19 (f). There were some exceptional cases that arose from the process described above. The original true scores were created by adding some variance (item-specific variance) to the distribution of the common factor. This process occasionally produced extreme scores with values either above or below the possible maximum or minimum of the given response scale. For example, we found scores from the 101-point scale with values of -8 and 121. Such cases were identified and flagged for removal from the data set (discussed below). 5. Generate the random error scores. Girard and Cliff’s (1976) research on human errors in judgment has provided the model of random measurement error used in this study. Girard and Cliff claim that the standard deviation of the random error associated with a 9-point response scale can theoretically range from 0.0 to 1.0. Because the current study deals with a variety of response scales other than the 9-point, an equation was developed to determine the ratio of the error standard deviation relevant to any given number of scale points: where σE is the standard deviation of the random error for the desired scale, τ is the desired number of scale points, and σ9-point is the error standard deviation of the 9-point scale. To generate the random error terms for each of the four items in the current CFA model, four new normally distributed variables were created. Each variable had a mean of zero, and a standard deviation set to one of three levels such that σ9-point = 0.0, 0.5 or 1.0 representing none, moderate, or high level of error, respectively. This resulted in the continuous “Y” version of the four random error variables, which were subsequently rounded to the nearest integer in order to create the four discrete “X” random error variables (i.e., EX = round(EY ) ). ( ) 9 * int9 po E −= στσ )4( 20 6. Compute the item observed scores. Item observed scores are derived simply from the addition of the random error variables to the re-expressed item true scores. That is, the four continuous error terms were added to the four continuous true scores to create raw scores for the simulated observed “Y” variables (i.e., Y = TY + EY ). Similarly, the four discrete error terms were added to the four discrete true scores to create raw scores for the simulated observed “X” variables (i.e., X = TX + EX ). However, the combination of error with true scores invited another opportunity for the resulting scores to be pushed beyond the tails of the distribution. It is inevitable that introducing random error would increase the chance of producing extreme scores in observed variables. Any extreme cases, produced either from the generation of observed scores or of true scores (discussed above), were identified and subsequently removed from the data set. The frequency of all such removed cases was minute, and varied according to the design conditions of the data set (i.e., the percent of cases removed from each data set ranged from 0.004% to 1.0%). This removal was imparted upon the grand sample of 50,000, and in no way affected the size of the random sample of 10,000. Data Validation Prior to the CFA evaluation of our dependent samples model, we wanted to validate the simulated raw data for its ability to uphold the assumptions of the Classical True Score Model. The validation of the data involved testing for the assumptions of multivariate normality and uni- dimensionality. We performed exploratory factor analysis (EFA) to determine whether the four proposed test items, from each level of categorization, could uncover the underlying factor from 21 which they were derived. We can claim that our data conform to the Classical True Score Model if the following four criteria are met: 1. The variables meet the assumption of normality according to Mardia’s (1970) test of multivariate normality at p > .002. 2. The χ2 goodness-of-fit test of the four-item one-factor EFA model should be non- significant (p > .002), indicating the model is a good fit for the data. 3. For each set of four test items, the EFA must show only one dominant factor with an initial eigenvalue greater than 1.0. 4. The factor loadings for each item must approximate the true correlation between that item and the underlying common factor. As opposed to the dependent samples CFA model, the simplified EFA model tested here includes only one common factor and four indicator variables, expressed on either continuous or discrete scales. A total of 252 EFAs were performed, 243 of which involved models with items expressed as one of the 27 different discrete scales (i.e. 3 –101), and 9 models in which continuously scaled items were entered. For all EFA model evaluations, ML estimation was employed. Normality was assessed through visual inspection of frequency histograms, and through an extension of Mardia’s (1970) test of multivariate normality called Relative Multivariate Kurtosis (RMK). RMK is reported by the PRELIS module in the LISREL 8.54 software package. However, PRELIS does not provide critical values for interpreting RMK, so they must be calculated by hand using the following formula (see SAS Institute, 2004, chap. 19): ( ) ( ) ( ) ( )2 228 + ++⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛ +± = pp pp N ppZ RMK crit crit )5( 22 where Zcrit is the desired critical value from a two-tailed Z-score distribution, p is the number of variables in the multivariate analysis, and N is the sample size. For our analysis, the critical interval when Z(α=.002,2-tail) = ± 3.09, p = 4, and N = 10,000 is: 0.982 ≤ RMK ≤ 1.018. If the observed value of RMK reported by PRELIS is outside this interval, the variables in the analysis are deemed to be non-normal. 23 Results Data Validation The data generation procedure was evaluated for its ability to create theoretically justifiable data, capable of producing factor analytic models consistent with the expectations set forth by the Classical True Score Model. For this, a comprehensive examination of normality, as well as dimensionality of the simulated data was conducted. Normality. Frequency histograms of the variables were visually assessed for their ability to approximate the normal curve. Figure 3 provides examples of these histograms. All of the simulated variables appeared to be highly symmetrically distributed and closely followed the normal curve. However, formal tests of multivariate kurtosis revealed some deviations from normality. Table 1 shows that both the number of scale points and high levels of random error have a strong influence on multivariate normality as measured by RMK. In general, scales with a low number of points (3 to 6 points) are more likely to suffer from non-normality. Scales with at least 9 points or higher, including the continuous scale, rarely fail to achieve normality. Notice that under the “high” random error conditions, even the continuously scaled variables fail to meet the critical value for RMK. Similarly, discrete scales with a large number of scale points, such as from 21 points up to 101 points, unexpectedly fail in this regard as well. There are two possible explanations that can account for this phenomenon. First, high levels of random error variance naturally increase the proportion of scores in the tails of a variable’s distribution. However, the tails of the variables in this study are limited by the maximum and minimum of their discrete scale. That is, no matter how much random error is introduced into the variable, the scores cannot exceed the maximum or minimum point of that scale. When large amounts of random error are added to the distribution, the extreme tails of the 24 Table 1. Relative Multivariate Kurtosis (RMK) of the Continuous Model and Selected Discrete Models. Number of scale points Uncommon variance condition Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point Low item-specific variance Random Error None 0.999 0.999 0.995 0.994 0.997 0.985 0.991 1.022 1.121 1.125 2.351 Moderate 0.998 0.998 0.995 0.995 1.003 1.018 1.048 1.084 1.173 1.164 2.346 High 0.981 0.982 0.982 0.979 0.980 0.977 0.981 0.970 0.983 0.991 1.327 Moderate item-specific variance Random Error None 0.990 0.990 0.989 0.988 0.984 0.988 0.988 0.980 1.005 0.905 1.696 Moderate 0.996 0.996 0.996 0.985 0.996 1.004 1.013 1.020 1.060 0.941 1.691 High 0.975 0.975 0.976 0.975 0.972 0.977 0.973 0.968 0.969 0.962 1.207 High item-specific variance Random Error None 1.002 1.002 1.001 1.004 1.004 0.997 0.996 0.997 1.003 0.903 1.485 Moderate 0.990 0.991 0.991 0.988 0.993 0.993 0.995 1.002 1.024 0.931 1.462 High 0.976 0.976 0.974 0.972 0.967 0.971 0.963 0.968 0.962 0.948 1.119 Note. The critical interval was: 0.982 ≤ RMK ≤ 1.018. Note. Models that fail to meet multivariate normality are in bold. 25 variables are effectively cut off, which may cause negative kurtosis, and hence, a failure in multivariate normality. Second, Mardia’s test of multivariate kurtosis has been shown to be sensitive to sample size (Mardia, 1974; see equation (5)). As can be seen in Table 1, many of the scale points that failed to meet the critical value for RMK only failed by a slight margin. It may simply be that the current sample of 10,000 cases caused Mardia’s test to overestimate the degree of non-normality in otherwise normally distributed variables. While many of the normality failures were marginal, it was clear that the discretely scaled variables with 3 or 4 points consistently expressed some of the most severe violations of normality. This may be expected, as some researchers in the field have argued that scales with 4 points or fewer are, by nature, incapable of achieving normality (Jöreskog, 1994; Lubke & Muthén, 2004). In this context, the normality violations observed among scales with few points may not be caused by the data generation procedure per se, but may simply arise out of having an inherently limited number of values in the distribution. Despite these apparent problems, it is important to note that many of the other variables created by the data generation procedure did achieve multivariate normal distributions according to RMK. Nonetheless, if there are adverse consequences to the observed non-normality, they will bare themselves out in the results of the MI - CFA evaluations discussed below. Dimensionality. The results from the 252 EFA evaluations were consolidated onto nine different summary tables (see Table 2 and Appendices A – H). There is one summary table for each combination of item-specific variance (ISV) and random error variance (REV). Any unique combination of these two types of variance is hereby referred to as uncommon variance. An example of one of the nine EFA summary tables is shown in Table 2. This table highlights the 26 Table 2. Selected EFA Results for the “Moderate” ISV – “Moderate” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 0.005 0.007 0.064 0.208 0.462 1.097 1.685 0.117 0.496 0.608 0.120 df 2 2 2 2 2 2 2 2 2 2 2 p 0.99 0.99 0.97 0.90 0.79 0.58 0.43 0.94 0.78 0.74 0.94 Initial eigenvalue Factor 1 2.244 2.243 2.224 2.141 2.111 2.089 2.080 2.058 2.058 2.030 1.884 Factor 2 0.59 0.59 0.60 0.63 0.64 0.65 0.65 0.66 0.66 0.67 0.72 Factor 3 0.59 0.59 0.59 0.61 0.63 0.63 0.64 0.64 0.65 0.66 0.70 Factor 4 0.58 0.58 0.58 0.61 0.62 0.63 0.63 0.64 0.64 0.65 0.69 Factor loadings Item 1 0.644 0.643 0.641 0.619 0.612 0.603 0.606 0.599 0.602 0.596 0.550 Item 2 0.644 0.644 0.638 0.621 0.610 0.610 0.595 0.600 0.581 0.585 0.530 Item 3 0.652 0.653 0.646 0.623 0.613 0.610 0.609 0.594 0.603 0.589 0.562 Item 4 0.636 0.635 0.629 0.604 0.600 0.588 0.591 0.582 0.589 0.574 0.529 Correlation between the item and the common factor Item 1 0.648 0.647 0.645 0.623 0.615 0.606 0.607 0.599 0.599 0.586 0.548 Item 2 0.642 0.641 0.636 0.618 0.607 0.606 0.596 0.596 0.581 0.585 0.517 Item 3 0.653 0.652 0.646 0.626 0.624 0.620 0.605 0.604 0.605 0.597 0.553 Item 4 0.633 0.632 0.625 0.604 0.595 0.583 0.590 0.578 0.581 0.571 0.491 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 27 regularity with which the data generation procedure can produce sound factor analytic models. Each of the nine EFA tables were similar in this respect. According to the χ2 goodness-of-fit test, the four-item one-factor model was a good fit for the data across all scale points. In fact, the χ2 test was non-significant (p > .002) for all scale points regardless of uncommon variance (observed significance ranged from p = .99 to .008). Moreover, there was no discernable pattern among χ2 values; that is to say, goodness-of-fit does not necessarily diminish as the number scale points decreases. Initial eigenvalues are a representation of the total variance explained by the common factor(s) among items (Brown, 2006). The initial eigenvalue of the first common factor in our four-item one-factor model was greater than 1.0 and was clearly dominant over the other factors (for example see Table 2). The continuously scaled model showed the highest first factor eigenvalue, and there was a consistent decline in eigenvalue as the number of scale points decreased. Therefore, the proposed four-item one-factor model accounted for the greatest amount of variance, but discretely scaled items did not explain as much variance as continuously scaled ones. This result was observed across all 9 combinations of uncommon variance (see Appendices A – H). Table 3 shows the further consolidated results of all EFA evaluations as they pertain to eigenvalues, factor loadings, and item-to-factor correlations. The factor loadings for all EFA models mirrored the behavior of the eigenvalues, such that continuously scaled items had the highest factor loadings, and as the scale points decreased so did the factor loadings (for example see Table 2). In EFA, the factor loadings for a one-factor model can be interpreted as the estimated correlation between an item and its underlying latent factor (Brown, 2006). Because our data generation procedure creates the common factor variable (i.e., the ‘C’ variable mentioned in the Method section), we are afforded the rare opportunity to 28 Table 3. Data Validation with Exploratory Factor Analysis: Comparing the Continuous Scale Model to the 3-point Model. Average factor loading Average item-to-factor Initial eigenvalue across items correlation Uncommon variance condition Continuous 3-point Continuous 3-point Continuous 3-point Factor 1 Factor 2 Factor 1 Factor 2 Low item-specific variance Random Error None 3.391 0.21 2.597 0.48 0.893 0.730 0.893 0.654 Moderate 2.937 0.37 2.598 0.49 0.804 0.730 0.802 0.649 High 2.224 0.61 1.773 0.76 0.639 0.508 0.640 0.452 Moderate item-specific variance Random Error None 2.509 0.50 1.892 0.72 0.709 0.545 0.708 0.530 Moderate 2.244 0.59 1.884 0.72 0.644 0.543 0.644 0.527 High 1.775 0.76 1.434 0.89 0.508 0.380 0.507 0.364 High item-specific variance Random Error None 1.601 0.81 1.358 0.89 0.447 0.346 0.445 0.337 Moderate 1.490 0.85 1.331 0.90 0.404 0.332 0.404 0.326 High 1.341 0.90 1.202 0.94 0.337 0.259 0.337 0.255 29 check the estimated item-to-factor correlation (i.e., the factor loading), against the true item-to- factor correlation. Tables 2 and 3 show that for almost every level of scale points, the factor loading successfully approximated the true item-to-factor correlation down to the one-hundredth decimal place or lower. The factor loading estimates among the 3 and 4-point scales were the least accurate. They were still appropriate however, having approximated the true item-to-factor correlation to within the one-tenth decimal place or lower. Overall, these results provided strong evidence to suggest the data generated from the newly developed simulation was in concordance with the theoretical tenants of the Classical True Score Model. The MI evaluations discussed in the next section will reveal whether the unexpected failures in multivariate normality have a direct influence on the measurement invariant properties of discrete scales. Given the evidence that has been presented thus far, subsequent analyses were conducted under the assumption that the data were theoretically and statistically valid. Measurement Invariance Similar to the EFA evaluations, the results from the MI - CFA evaluations were consolidated onto nine summary tables (see Table 4 and Appendices I – P), one for each combination of uncommon variance. Table 4 is a representative example of one of the nine tables. Not all of the scale points are represented in the tables because there is sufficient consistency in the values to infer the pattern of results from the existing scales. Table 4 provides a context for how we arrived at our decisions about MI between the continuously scaled CFA model and the various levels of discretely scaled models. In general, when there was agreement 30 Table 4. Selected MI Results for the “Moderate” ISV – “Moderate” REV Condition.a (a) Configural invariance (b) Loading invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.000 15.364 1.000 16.886 1.522 1.000 0.000 21 0.000 5.963 1.000 11.295 5.332 1.000 0.000 16 0.000 7.181 1.000 8.260 1.079 1.000 0.000 15 0.000 14.015 1.000 16.059 2.044 1.000 0.000 14 0.000 12.013 1.000 13.959 1.946 1.000 0.000 13 0.000 9.458 1.000 11.323 1.865 1.000 0.000 12 0.000 8.325 1.000 10.511 2.186 1.000 0.000 11 0.000 8.030 1.000 11.625 3.595 1.000 0.000 10 0.000 9.319 1.000 15.182 5.863 1.000 0.000 9 0.005 19.742 1.000 20.783 1.041 1.000 0.000 8 0.000 15.557 1.000 17.774 2.217 1.000 0.000 7 0.003 17.601 1.000 25.397 7.796 1.000 0.000 6 0.000 7.879 1.000 12.244 4.365 1.000 0.000 5 0.000 9.869 1.000 14.725 4.856 1.000 0.000 4 0.000 9.779 1.000 20.929 11.150 1.000 0.000 3 0.017 58.938 0.999 442.436 383.498 0.988 -0.011 (c) Means invariance (d) Error invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 19.109 2.223 1.000 0.000 37.738 18.629 1.000 0.000 21 19.319 8.024 1.000 0.000 266.504 247.185 0.997 -0.003 16 9.764 1.504 1.000 0.000 607.463 597.699 0.993 -0.007 15 24.036 7.977 1.000 0.000 543.962 519.926 0.993 -0.007 14 14.954 0.995 1.000 0.000 700.307 685.353 0.991 -0.009 13 15.506 4.183 1.000 0.000 952.276 936.770 0.988 -0.012 12 11.837 1.326 1.000 0.000 1043.072 1031.235 0.987 -0.013 11 19.968 8.343 1.000 0.000 1228.667 1208.699 0.984 -0.016 10 15.649 0.467 1.000 0.000 1344.038 1328.389 0.982 -0.018 9 24.406 3.623 1.000 0.000 1501.714 1477.308 0.980 -0.020 8 20.630 2.856 1.000 0.000 1687.008 1666.378 0.977 -0.023 7 25.922 0.525 1.000 0.000 1596.128 1570.206 0.977 -0.023 6 15.606 3.362 1.000 0.000 1581.638 1566.032 0.977 -0.023 5 22.782 8.057 1.000 0.000 1216.860 1194.078 0.981 -0.019 4 22.048 1.119 1.000 0.000 1721.116 1699.068 0.972 -0.028 3 not tested not tested a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note. Critical values were: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note. Models that fail to meet invariance with a continuous model are in bold. 31 among the fit indices that a particular scale point model exceeded the critical value for fit, that model was dropped from subsequent tests of MI. For example, in Table 4(a) – the test of configural invariance – fit indices for the 3-point scale indicate the χ2 test was significant, whereas RMSEA and CFI were within excepted bounds. Because of this disagreement among indices, the 3-point model was considered measurement invariant to a continuous scale model at the configural level. In Table 4(b) – the test of loading invariance – the change in fit for the 3-point scale proceeding from configural invariance exceeded the critical value for both Δχ2 and ΔCFI. Consequently, we concluded that the 3-point model failed to be measurement invariant at the loading invariance level, and dropped the model from subsequent evaluations (see Tables 4(c)-(d)). Analyses proceeded in this way for all nine different combinations of uncommon variance conditions (see Appendices I – P). The culmination of all of our MI decisions across all design conditions are summarized in Table 5. Configural Invariance. In general, discretely scaled models seem to have very little trouble meeting the criteria for configural invariance. There were only three conditions where models failed in this regard. When ISV is “low” and REV is either “moderate” or “high”, RMSEA and χ2 are in agreement to fail the 3-point model. Additionally, when ISV is “low” and REV is “none”, RMSEA and χ2 agree to fail both the 3 and 4-point models. Conversely, all discretely scaled models, including the 3-point model, met the criteria for goodness-of-fit according to the CFI index at the level of configural invariance. Loading Invariance. At the level of loading invariance, where ISV is “moderate”, the Δχ2 test and ΔCFI were in agreement to fail the 3-point model. All other discretely scaled models – that were not previously dropped from the analysis – were deemed to be loading invariant to a continuously scaled model. 32 Table 5. The Scale Point Level at which Successful Invariance was Achieved According to RMSEA, χ2, and CFI. RMSEA χ2 Δχ2 CFI ΔCFI Uncommon variance condition Configural Configural Loading Means Error Configural Loading Means Error Low item-specific variance Random Error None 5 6 6 6 > 101a 3 5 5 11 Moderate 4 6 6 6 > 101a 3 4 4 13 High 4 6 6 6 71 3 4 4 12 Moderate item-specific variance Random Error None 3 4 5 5 71 3 4 4 11 Moderate 3 4 4 4 > 101a 3 4 4 14 High 3 4 4 4 71 3 4 4 12 High item-specific variance Random Error None 3 3 4 4 61 3 3 3 11 Moderate 3 3 4 4 91 3 3 3 15 High 3 3 4 4 > 101a 3 3 3 12 a Indicates that all of the scale points, including the 101-point scale, have failed to achieve invariance. Note. All scale points that lie at or above the values shown can be considered invariant to a continuous scale at the given invariance level. Note. All scale points that lie below the values shown have failed to achieve invariance with a continuous scale at the given invariance level. 33 Means Invariance. As mentioned briefly above, the data generation procedure sets the mean of all continuously scaled items equal to the mean of the corresponding discretely scaled items. It was expected that this would greatly minimize the chances that the discretely scaled models would fail with respect to means invariance to the continuously scaled models. Accordingly, Table 5 shows no additional invariance failures among the discretely scaled models – that were not previously dropped from the analysis – at the level of means invariance. Error Invariance. At the level of error invariance, the results revealed the dramatic effect that number of scale points has on measurement invariance. According to the Δχ2 test, there are several conditions in which even the 101-point model fails to achieve error invariance with a continuous model. In fact, there was no condition in which the criterion for the Δχ2 test was met among models with fewer than 61 scale points. The ΔCFI index was far more realistic in its allocation of error invariance decisions. According to ΔCFI, the minimum number of scale points for which error invariance holds for discretely scaled models ranged from 11 to 15 scale points. Neither of the two indices revealed an association between the uncommon variance conditions and the allocation of error invariance decisions. If we employ the established policy for accepting MI decisions when two of the fit indices are in agreement, then we may conclude that discretely scaled models with a range of scale points from 11 to 15 or higher are measurement invariant to continuously scaled models. 34 Discussion The aim of this study was two fold. First, we presented and validated a novel procedure for generating raw data that simulates the Classical True Score Model. Second, we used a MI approach to CFA to establish the number of points a discretely scaled factor model must have in order to perform equally to a continuously scaled model in a measurement context. Successful realization of the second goal was, of course, dependent on realizing the first. In order to draw sound conclusions about the equality of discrete and continuous models, it was necessary to scrutinize the simulated data used to test the models. The SPSS syntax code, designed for this study, attempted to simulate how data are believed to be produced in natural settings. In general, we expected the data to exhibit uni- dimensionality and multivariate normality. Uni-dimensionality of the data was confirmed by the fact that all 252 four-item one-factor EFA models passed the χ2 goodness-of-fit test, and had dominant first factor eigenvalues. In the presence of correlated errors, or correlations between errors and true scores, we would expect the estimated factor loadings from the EFA to diverge from the item-to-factor correlations, but this was not the case. The fact that the EFA maximum likelihood estimation routine produced factor loading parameters that closely replicated the true item-to-factor correlation is strong evidence that the data generation procedure follows the Classical True Score Model. We did observe an unexpectedly high rate of violations in the multivariate normality assumption among variables with a low number of scale points (i.e., from 3 to 6 points), and variables in the “high” random error variance condition. We offered two potential explanations for the problem. First, high levels of error variance can push the extreme tails of a variable’s distribution past the maximum or minimum of the response scale, which effectively removes the 35 tails of the distribution altogether – potentially causing non-normality. Second, the confidence interval for evaluating Mardia’s test of multivariate normality may be too sensitive to the large sample size of this study (Mardia, 1974; see equation (5)). Oversensitivity to sample size may cause an overestimate of the degree of non-normality in otherwise normally distributed variables. Based on our results, there is evidence to suggest that the normality violations are likely due to the latter explanation. Kline (1998) has shown that significantly elevated χ2 goodness-of- fit values are prevalent among factor models suffering from normality problems. Interestingly, none of the models in this study, that were found to be non-normal according to RMK, had significant χ2 values according to the EFA evaluation. Likewise, with the exception of the 3, 4 and 5-point scale models, significant RMK values were not associated with significant χ2 values according to the configural invariance CFA evaluations. Far from contradicting the findings of Kline, our results suggest instead that the confidence interval for Mardia’s RMK value may be inappropriately detecting significant non-normality when there is none. Furthermore, if one compares how the scale models perform on Mardia’s test (see Table 1) to how they perform on tests of MI (see Table 5), no clear pattern between them was found. The comparison between these two tests revealed examples of RMK normal models that passed MI evaluations under certain conditions, and failed MI under others (i.e., the RMK normal 11- point model passed MI under the “low” REV condition and failed MI under the “moderate” REV condition). Additionally, examples were also found where RMK non-normal models were indeed able to pass MI evaluations (i.e., the RMK non-normal 21-point model consistently passed MI under the “high” REV condition). Because of the lack of a pattern between RMK and MI, we suspect that the confidence interval around Mardia’s RMK value was oversensitive to the large sample size in the study, and is therefore, unduly attributing non-normality to some legitimately 36 normally distributed models. This leads us to conclude that the novel data generation procedure presented in this study was both informed and validated through the principles of the Classical True Score Model. In general, the data were deemed to be well suited for studying the equality of discrete and continuous measurement models. It should be emphasized that models with 3 and 4 scale points suffered the most from normality issues; they had the lowest performance in the EFA, and were more likely to fail at the configural invariance level of MI. It remains unclear however, whether these scales’ poor performance in the EFA and CFA evaluations is directly due to their lack of normality. Further study is needed in order to uncover the exact relationship between discrete scale normality and their measurement invariance with continuous scales. Conclusions From the results of the MI - CFA, we conclude that there are conditions under which response scales with 11 to 15 scale points can reproduce the measurement properties of a continuous scale. In very general terms, the more susceptible a measure is to random error variance, the higher the number of scale points should be used. However, our results provide strong support for the claim that using response scales with more than 15 points is, for the most part, unnecessary. We have found that scales with less than 11 points have significantly more measurement error than continuous scales, even under ideal conditions. Thus, scales with from 3 to 10 points can be considered coarsely categorized. Regardless of whether these scales are capable of producing normal distributions, they do not compare to a continuous scale in terms of their measurement properties. Moreover, the error that was found to be inherent in coarse scales may 37 have direct implications for the accuracy of instruments that employ them. We know that measurement error serves to reduce the reliability of test scores (Crocker & Algina, 1986, chap. 6; Gregory, 2004, chap. 3; McDonald, 1999, chap. 5). What is unknown, however, is the exact relationship between reliability and the measurement error introduced by coarse response scales. Our results are consistent with previous studies which have concluded that 3 and 4-point scales should not be treated as if they were continuous (Bandalos & Enders, 1996; Bollen & Barb, 1981; Dolan, 1994; Jenkins & Taber, 1977; Johnson & Creech, 1983; Lissitz & Green, 1975; Taylor et al., 2006). We extend this conclusion by considering that scales with 10 points or fewer have serious comparability problems with continuous scales, and therefore, caution should be taken whenever coarse scales are being employed. Researchers will likely continue using coarse scales in tests and surveys simply because they are perceived to be more convenient. It is important however, that researchers understand the level of error that is introduced through the use of coarse response scales. While further research is still needed, following the guidelines set forth by this study could help to reduce the error in survey results, and potentially raise the standard of accuracy for future psychological measures. Also noteworthy is the fact that our results provide additional evidence for the assertion forwarded by Cheung and Rensvold (2002) and Wu et al. (2007), that ΔCFI is more stable and realistic in its allocation of MI decisions than Δχ2. This is the first study, that we know of, to show that ΔCFI is just as stable for dependent sample models, as it is for independent sample models. The Δχ2 test seemed to reject MI for discrete scale models far more often than would be expected. We suspect this test was highly sensitive to the large sample size involved in this study. Researchers who are interested in conducting MI - CFA studies should consider the results from both Δχ2 and ΔCFI evaluations. 38 Limitations and Future Directions The analyses conducted in this study were performed upon data produced under ideal simulated conditions. Any conclusions drawn from such analyses must be understood to have a somewhat limited generalizability to real-world applications. However, the data generation procedure was shown to have followed the tenants of the Classical True Score Model; the same model in which data collected under natural conditions is believed to follow (Crocker & Algina, 1986, chap. 6). Additionally, a MI approach to CFA with dependent samples models is one of the most strict tests of measurement equality among factor analytic models (Brown, 2006, chap. 7). The choice to use simulated data and a dependent samples model for our evaluation implies that our recommendations about the number of scale points to use should be considered fairly conservative. That is, the implementation of from 11 to 15 scale points is perhaps a realistically safe overestimate of the “optimum” number of points necessary for most psychological measures. Researchers should feel confident in using 11 to 15 points; however, future research is needed in order to determine if a less conservative estimate exists. Many of the conditions in which the current data were generated could be manipulated in ways that would further our understanding of the effect of categorization on the measurement invariance between discrete and continuous response scales. The ideal symmetrically distributed variables seen in the current study are unlikely to be found in natural data. Thus, a thorough study of the effect of skewed and/or kurtotic distributions on the equality of scales should be a high priority for future research. Additionally, the specified model for the CFA was quite simple; models with only four indicators and one latent factor are fairly unrepresentative of the models commonly seen in social science research. Perhaps a larger or more complex measurement model would show different results. Finally, the sample size could also be manipulated to better 39 represent samples normally found in CFA studies. An evaluation of samples ranging from 500 down to 100 cases may improve the applicability of recommendations for the number of scale points that should be used. A population analogue approach was taken in this study as a means of compensating for the lack of empirical random sampling. While the parameter estimates and fit indices reported here were shown to be consistent and stable across all design conditions, it is common practice among simulation studies to compile data over multiple iterations of the simulation in order to produce stable results (Bandalos, 2006). Future research should consider implementing an iterative approach to data generation. In this study, the Pearson product moment correlation (PPM) was relied upon to calculate the covariance matrix for both the EFA and the CFA. As previously mentioned, the PPM assumes that all the variables in the analysis have a continuous scale of measurement (Pearson, 1909). Because of this assumption, the use of PPM provided an additional level of strictness to the test of MI between continuous and discrete scale models. However, it is often inappropriate to apply this assumption to discretely scaled data collected under natural conditions (Jöreskog, 1994; Muthén, 1984). If it is assumed that the discretely scaled variables are not continuous themselves, but are instead derived from an unknown latent continuum, then the polychoric correlation should be used to calculate covariance matrices for factor analysis. Further study is needed in order to establish the equality of continuous and discrete scale models when different covariance matrices are applied. In an attempt to add to the base of knowledge concerning the relationship between scale points and the reliability of test scores, the current data generation procedure may be well suited to help determine the number of scale points necessary to reproduce the reliability estimates 40 made by a continuous scale. Because of the fact that our simulation procedure produces both the true scores and error scores, it may be possible to quantify the amount of measurement error introduced by discretely categorized response scales. Once quantified, the measurement error can be used to calculate a reliability adjustment, able to correct reliability estimates for the effect of categorization. The computer simulation program designed for this study is highly versatile. In the future, it could be used to examine a multitude of measurement related subjects. For example, it could be used to explore the effect that categorization has on the outcomes produced by various factor analytic estimation methods, such as maximum likelihood and weighted least squares. Similarly, there is much work needed in improving our methods of calculating reproduced factor scores (e.g., the Bartlett method or the Anderson-Rubin method). The data generator presented here could also help illuminate the differences in the way group biases are detected when using either a MI approach, or a differential item functioning (DIF) approach. 41 Footnotes 1Note that 2-point scales also appear in the literature, but are subject to specific statistical analyses that are beyond the scope of the current treatment. 42 References Bandalos, D. L. (2006). Use of Monte Carlo studies in structural equation modeling research. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course (pp. 385-426). Greenwich, CT: Information Age Publishing, Inc. Bandalos, D., & Enders, C. (1996). The effects of nonnormality and number of response categories on reliability. Applied Measurement in Education, 9(2), 151-160. Bollen, K. A., & Barb, K. (1981). Pearson’s R and coarsely categorized measures. American Statistical Review, 46, 232-239. Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: The Guilford Press. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing MI. Structural Equation Modeling, 9, 235-55. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Holt, Rinehart and Winston, Inc. DiStefano, C. (2002). The impact of categorization with confirmatory factor analysis. Structural Equation Modeling, 9(3), 327-346 Dolan, C. V. (1994). Factor analysis of variables with 2, 3, 5 and 7 response categories: A comparison of categorical variable estimators using simulated data. British Journal of Mathematical and Statistical Psychology, 47, 309-326. Garson, G. D. (n.d.). Correlation. In Statnotes: Topics in multivariate analysis. Retrieved December 13, 2007 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm. Girard, R. A., & Cliff, N. (1976). A Monte Carlo evaluation of interactive multidimensional scaling. Psychometrika, 41, 43-64. 43 Gregory, R. J. (2004). Psychological testing: History, principles, and applications. Boston, MA: Pearson. Jenkins, G. D. Jr., & Taber, T. D. (1977). A Monte Carlo study of factors affecting three indices of composite scale reliability. Journal of Applied Psychology, 62, 392-398. Johnson, D. R., & Creech, J. C. (1983). Ordinal measures in multiple indicator models: A simulation study of categorization errors. American Sociological Review, 48, 398-407. Jöreskog, K. G. (1994). On the estimation of polychoric correlations and their asymptotic covariance matrix. Psychometrika, 59, 381-389. Kline, R. B. (1998). Principles and practice of structural equation modeling. NY: Guilford Press. Lissitz, R. W., & Green, S. B. (1975). Effect of the number of scale points on reliability: A Monte Carlo approach. Journal of Applied Psychology, 60, 10-13. Lubke, G., & Muthén, B. (2004). Applying multigroup confirmatory factor models for continuous outcomes to likert scale data complicates meaningful group comparisons. Structural Equation Modeling, 11(4), 514-534. Mardia, K. V. (1970). Measures of multivariate skewness and kurtosis with applications. Biometrika, 57, 519-530. Mardia, K.V. (1974). Applications of some measures of multivariate skewness and kurtosis for testing normality and robustness studies. Sankhya, 36, 115–128. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Muthén, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115–132. 44 Muthén, B., & Kaplan, D. (1985). A comparison of some methodologies for the factor analysis of non-normal Likert variables, British Journal of Mathematical and Statistical Psychology, 38, 171-189. Pearson, K. (1909). On a new method for determining the correlation between a measured character A and a character B. Biometrika, 7, 96-109. Russell, C. J., Pinto, J. K., & Bobko, P. (1991). Appropriate moderated regression and inappropriate research strategy: A demonstration of information loss due to scale coarseness. Applied Psychological Measurement, 15, 125-135. SAS Institute Inc. (2004). SAS/STAT 9.1 user’s guide. Cary, NC: SAS Institute Inc. Taylor, A., West, S., & Aiken, L. (2006). Loss of power in logistic, ordinal logistic, and probit regression when an outcome variable is coarsely categorized. Educational & Psychological Measurement, 66(2), 228-239. Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multi-group confirmatory factor analysis: A demonstration with TIMSS data. Practical Assessment Research & Evaluation, 12(3), 1-26. Retrieved June 17, 2007, from http://pareonline.net/getvn.asp?v=12&n=3. 45 Appendix A. Selected EFA Results for the “Low” ISV – “None” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 2.380 2.059 1.196 6.005 4.334 0.269 1.141 4.407 1.039 0.979 2.693 df 2 2 2 2 2 2 2 2 2 2 2 p 0.30 0.36 0.55 0.05 0.11 0.87 0.57 0.11 0.59 0.61 0.26 Initial Eigenvalue Factor 1 3.391 3.390 3.358 3.277 3.219 3.177 3.117 3.043 2.927 2.815 2.597 Factor 2 0.21 0.21 0.22 0.25 0.27 0.28 0.31 0.33 0.37 0.40 0.48 Factor 3 0.20 0.20 0.21 0.24 0.26 0.27 0.29 0.32 0.36 0.40 0.47 Factor 4 0.20 0.20 0.21 0.23 0.25 0.27 0.28 0.31 0.35 0.39 0.46 Factor Loadings Item 1 0.892 0.891 0.888 0.872 0.861 0.855 0.843 0.823 0.802 0.775 0.731 Item 2 0.892 0.892 0.886 0.870 0.856 0.852 0.838 0.824 0.807 0.779 0.724 Item 3 0.894 0.893 0.887 0.873 0.864 0.853 0.848 0.829 0.808 0.783 0.733 Item 4 0.894 0.893 0.885 0.870 0.860 0.848 0.831 0.824 0.789 0.774 0.730 Correlation between the item and the common factor Item 1 0.893 0.893 0.888 0.873 0.862 0.854 0.841 0.828 0.800 0.762 0.665 Item 2 0.893 0.892 0.886 0.870 0.858 0.852 0.839 0.821 0.800 0.756 0.646 Item 3 0.894 0.894 0.889 0.874 0.865 0.856 0.847 0.833 0.806 0.769 0.680 Item 4 0.893 0.893 0.885 0.868 0.858 0.846 0.833 0.819 0.784 0.750 0.625 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 46 Appendix B. Selected EFA Results for the “Low” ISV – “Moderate” REV Condition.a a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 0.474 0.515 0.518 1.453 2.158 0.312 0.625 1.327 1.729 7.502 1.941 df 2 2 2 2 2 2 2 2 2 2 2 p 0.79 0.77 0.77 0.48 0.34 0.86 0.73 0.52 0.42 0.02 0.38 Initial Eigenvalue Factor 1 2.937 2.936 2.900 2.786 2.730 2.703 2.659 2.647 2.636 2.697 2.598 Factor 2 0.37 0.37 0.38 0.42 0.44 0.45 0.47 0.47 0.47 0.45 0.49 Factor 3 0.35 0.35 0.36 0.40 0.42 0.43 0.44 0.45 0.46 0.44 0.46 Factor 4 0.35 0.35 0.36 0.39 0.41 0.42 0.43 0.43 0.44 0.42 0.45 Factor Loadings Item 1 0.807 0.807 0.800 0.775 0.765 0.761 0.753 0.756 0.743 0.762 0.741 Item 2 0.803 0.803 0.797 0.771 0.758 0.752 0.740 0.738 0.733 0.744 0.717 Item 3 0.811 0.811 0.802 0.781 0.772 0.764 0.761 0.746 0.751 0.757 0.739 Item 4 0.794 0.793 0.785 0.759 0.742 0.737 0.721 0.725 0.727 0.746 0.722 Correlation between the item and the common factor Item 1 0.807 0.806 0.799 0.774 0.767 0.757 0.751 0.745 0.738 0.735 0.667 Item 2 0.798 0.798 0.790 0.767 0.752 0.746 0.736 0.735 0.729 0.723 0.641 Item 3 0.811 0.810 0.804 0.781 0.772 0.761 0.755 0.747 0.748 0.741 0.674 Item 4 0.794 0.793 0.784 0.760 0.746 0.738 0.728 0.726 0.719 0.720 0.614 47 Appendix C. Selected EFA Results for the “Low” ISV – “High” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 6.025 5.880 6.667 9.458 5.072 5.568 1.680 1.475 3.873 0.082 4.492 df 2 2 2 2 2 2 2 2 2 2 2 p 0.05 0.05 0.04 0.01 0.08 0.06 0.43 0.48 0.14 0.96 0.11 Initial Eigenvalue Factor 1 2.224 2.223 2.204 2.161 2.134 2.112 2.083 2.040 1.980 1.960 1.773 Factor 2 0.61 0.61 0.62 0.64 0.64 0.65 0.65 0.67 0.69 0.69 0.76 Factor 3 0.60 0.60 0.60 0.62 0.63 0.64 0.64 0.66 0.68 0.68 0.75 Factor 4 0.57 0.57 0.57 0.58 0.60 0.60 0.62 0.63 0.65 0.67 0.71 Factor Loadings Item 1 0.645 0.645 0.640 0.629 0.626 0.618 0.613 0.601 0.582 0.566 0.518 Item 2 0.634 0.634 0.629 0.615 0.607 0.600 0.594 0.581 0.568 0.550 0.496 Item 3 0.658 0.657 0.653 0.643 0.631 0.630 0.611 0.607 0.588 0.582 0.534 Item 4 0.618 0.617 0.612 0.601 0.595 0.587 0.585 0.567 0.547 0.565 0.483 Correlation between the item and the common factor Item 1 0.652 0.651 0.647 0.635 0.628 0.622 0.616 0.595 0.580 0.556 0.470 Item 2 0.633 0.633 0.628 0.614 0.605 0.602 0.592 0.583 0.562 0.545 0.442 Item 3 0.654 0.654 0.651 0.640 0.632 0.625 0.616 0.608 0.586 0.561 0.493 Item 4 0.619 0.619 0.616 0.601 0.595 0.586 0.582 0.570 0.540 0.543 0.402 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 48 Appendix D. Selected EFA Results for the “Moderate” ISV – “None” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 0.164 0.205 0.305 0.628 0.394 0.191 0.162 0.863 1.473 0.172 2.089 df 2 2 2 2 2 2 2 2 2 2 2 p 0.92 0.90 0.86 0.73 0.82 0.91 0.92 0.65 0.48 0.92 0.35 Initial Eigenvalue Factor 1 2.509 2.509 2.491 2.441 2.399 2.391 2.353 2.302 2.221 2.119 1.892 Factor 2 0.50 0.50 0.51 0.52 0.54 0.54 0.55 0.57 0.61 0.64 0.72 Factor 3 0.50 0.50 0.51 0.52 0.53 0.53 0.55 0.56 0.59 0.63 0.70 Factor 4 0.49 0.49 0.50 0.52 0.53 0.53 0.54 0.56 0.59 0.62 0.69 Factor Loadings Item 1 0.707 0.707 0.701 0.692 0.688 0.683 0.670 0.659 0.641 0.614 0.543 Item 2 0.705 0.705 0.701 0.692 0.679 0.677 0.668 0.656 0.628 0.608 0.546 Item 3 0.710 0.710 0.707 0.695 0.683 0.684 0.676 0.665 0.640 0.624 0.555 Item 4 0.715 0.714 0.711 0.693 0.682 0.680 0.673 0.655 0.643 0.598 0.538 Correlation between the item and the common factor Item 1 0.704 0.703 0.700 0.688 0.681 0.678 0.668 0.655 0.640 0.609 0.540 Item 2 0.701 0.701 0.697 0.685 0.676 0.675 0.666 0.653 0.627 0.608 0.525 Item 3 0.713 0.713 0.710 0.697 0.687 0.685 0.677 0.667 0.644 0.616 0.555 Item 4 0.713 0.712 0.708 0.691 0.681 0.676 0.669 0.651 0.637 0.598 0.499 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 49 Appendix E. Selected EFA Results for the “Moderate” ISV – “High” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 2.914 2.760 2.989 3.412 4.289 3.035 3.388 2.281 2.736 0.564 3.299 df 2 2 2 2 2 2 2 2 2 2 2 p 0.23 0.25 0.22 0.18 0.12 0.22 0.18 0.32 0.25 0.75 0.19 Initial Eigenvalue Factor 1 1.775 1.775 1.767 1.736 1.725 1.709 1.685 1.667 1.616 1.583 1.434 Factor 2 0.76 0.77 0.77 0.78 0.78 0.78 0.79 0.80 0.82 0.81 0.89 Factor 3 0.73 0.73 0.74 0.75 0.75 0.76 0.77 0.77 0.79 0.80 0.86 Factor 4 0.73 0.73 0.73 0.74 0.74 0.75 0.76 0.76 0.77 0.80 0.82 Factor Loadings Item 1 0.515 0.514 0.512 0.501 0.504 0.490 0.489 0.491 0.479 0.443 0.421 Item 2 0.521 0.521 0.520 0.509 0.500 0.497 0.486 0.479 0.451 0.442 0.366 Item 3 0.516 0.516 0.513 0.498 0.495 0.490 0.480 0.474 0.467 0.449 0.417 Item 4 0.481 0.481 0.477 0.473 0.467 0.468 0.457 0.441 0.415 0.430 0.316 Correlation between the item and the common factor Item 1 0.521 0.520 0.518 0.507 0.505 0.497 0.491 0.483 0.468 0.443 0.399 Item 2 0.502 0.502 0.500 0.489 0.483 0.482 0.470 0.458 0.445 0.435 0.356 Item 3 0.523 0.523 0.519 0.508 0.501 0.498 0.492 0.483 0.475 0.456 0.397 Item 4 0.481 0.481 0.478 0.468 0.462 0.460 0.454 0.439 0.424 0.418 0.303 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 50 Appendix F. Selected EFA Results for the “High” ISV – “None” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 0.662 0.585 1.186 0.638 2.776 0.788 2.396 1.474 0.359 0.514 1.193 df 2 2 2 2 2 2 2 2 2 2 2 p 0.72 0.75 0.55 0.73 0.25 0.67 0.30 0.48 0.84 0.77 0.55 Initial Eigenvalue Factor 1 1.601 1.600 1.596 1.572 1.575 1.547 1.547 1.514 1.489 1.433 1.358 Factor 2 0.81 0.81 0.81 0.82 0.83 0.83 0.83 0.84 0.84 0.87 0.89 Factor 3 0.80 0.80 0.80 0.81 0.80 0.82 0.82 0.83 0.84 0.86 0.89 Factor 4 0.79 0.79 0.79 0.80 0.79 0.80 0.80 0.81 0.83 0.84 0.87 Factor Loadings Item 1 0.440 0.438 0.440 0.433 0.423 0.421 0.412 0.408 0.398 0.362 0.360 Item 2 0.460 0.461 0.459 0.442 0.454 0.447 0.443 0.432 0.419 0.405 0.353 Item 3 0.449 0.449 0.443 0.441 0.444 0.420 0.432 0.411 0.407 0.381 0.342 Item 4 0.441 0.441 0.440 0.431 0.430 0.420 0.421 0.405 0.391 0.371 0.328 Correlation between the item and the common factor Item 1 0.446 0.445 0.444 0.439 0.431 0.428 0.419 0.414 0.400 0.378 0.336 Item 2 0.434 0.434 0.431 0.419 0.420 0.415 0.412 0.416 0.397 0.386 0.340 Item 3 0.455 0.454 0.450 0.445 0.441 0.438 0.428 0.421 0.409 0.396 0.339 Item 4 0.446 0.446 0.443 0.438 0.434 0.425 0.423 0.410 0.403 0.380 0.333 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 51 Appendix G. Selected EFA Results for the “High” ISV – “Moderate” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 1.205 1.121 0.734 1.708 1.797 1.890 2.375 1.568 0.019 1.540 0.673 df 2 2 2 2 2 2 2 2 2 2 2 p 0.55 0.57 0.69 0.43 0.41 0.39 0.30 0.46 0.99 0.46 0.71 Initial Eigenvalue Factor 1 1.490 1.489 1.481 1.448 1.446 1.426 1.431 1.420 1.422 1.401 1.331 Factor 2 0.85 0.85 0.85 0.87 0.87 0.87 0.87 0.87 0.87 0.88 0.90 Factor 3 0.84 0.84 0.84 0.85 0.85 0.86 0.86 0.86 0.86 0.86 0.90 Factor 4 0.82 0.82 0.83 0.83 0.83 0.85 0.84 0.84 0.85 0.85 0.87 Factor Loadings Item 1 0.380 0.379 0.379 0.361 0.357 0.369 0.352 0.355 0.351 0.349 0.314 Item 2 0.414 0.414 0.408 0.401 0.398 0.380 0.394 0.384 0.388 0.366 0.316 Item 3 0.405 0.404 0.401 0.380 0.381 0.373 0.386 0.370 0.381 0.388 0.328 Item 4 0.417 0.417 0.414 0.402 0.404 0.385 0.385 0.387 0.380 0.359 0.372 Correlation between the item and the common factor Item 1 0.392 0.392 0.388 0.374 0.374 0.369 0.372 0.362 0.365 0.360 0.321 Item 2 0.405 0.405 0.400 0.387 0.386 0.383 0.379 0.378 0.374 0.372 0.324 Item 3 0.407 0.407 0.402 0.387 0.380 0.384 0.374 0.377 0.370 0.382 0.323 Item 4 0.412 0.412 0.405 0.398 0.396 0.387 0.387 0.384 0.387 0.379 0.337 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 52 Appendix H. Selected EFA Results for the “High” ISV – “High” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 0.488 0.532 0.365 0.484 0.983 0.854 0.332 0.239 0.683 0.035 0.169 df 2 2 2 2 2 2 2 2 2 2 2 p 0.78 0.77 0.83 0.78 0.61 0.65 0.85 0.89 0.71 0.98 0.92 Initial Eigenvalue Factor 1 1.341 1.341 1.340 1.323 1.315 1.309 1.303 1.290 1.256 1.252 1.202 Factor 2 0.90 0.90 0.90 0.90 0.91 0.91 0.91 0.91 0.93 0.92 0.94 Factor 3 0.89 0.89 0.89 0.90 0.90 0.90 0.90 0.91 0.92 0.92 0.93 Factor 4 0.87 0.87 0.87 0.88 0.88 0.88 0.89 0.89 0.90 0.91 0.93 Factor Loadings Item 1 0.325 0.325 0.328 0.318 0.312 0.316 0.301 0.302 0.282 0.291 0.253 Item 2 0.367 0.366 0.364 0.358 0.344 0.351 0.339 0.342 0.323 0.298 0.274 Item 3 0.336 0.336 0.338 0.320 0.322 0.317 0.317 0.303 0.282 0.292 0.265 Item 4 0.320 0.320 0.316 0.315 0.318 0.301 0.316 0.295 0.283 0.278 0.245 Correlation between the item and the common factor Item 1 0.333 0.333 0.333 0.323 0.320 0.320 0.314 0.307 0.300 0.295 0.249 Item 2 0.333 0.333 0.332 0.324 0.319 0.323 0.315 0.312 0.295 0.294 0.252 Item 3 0.348 0.349 0.348 0.340 0.335 0.332 0.324 0.316 0.306 0.300 0.261 Item 4 0.333 0.333 0.333 0.323 0.323 0.316 0.311 0.314 0.294 0.301 0.256 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 53 Appendix I. Selected MI Results for the “Low” ISV – “None” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.000 10.773 1.000 10.864 0.091 1.000 0.000 21 0.000 15.735 1.000 17.199 1.464 1.000 0.000 16 0.000 10.815 1.000 19.844 9.029 1.000 0.000 13 0.003 17.767 1.000 23.670 5.903 1.000 0.000 12 0.000 11.291 1.000 26.983 15.692 1.000 0.000 11 0.004 18.600 1.000 25.996 7.396 1.000 0.000 10 0.005 20.514 1.000 23.937 3.423 1.000 0.000 9 0.006 20.888 1.000 28.678 7.790 1.000 0.000 8 0.002 16.532 1.000 19.383 2.851 1.000 0.000 7 0.000 15.727 1.000 22.145 6.418 1.000 0.000 6 0.008 27.086 1.000 32.234 5.148 1.000 0.000 5 0.022 88.797 1.000 97.911 9.114 0.999 -0.001 4 0.067 653.666 0.996 not tested 3 0.138 2517.920 0.979 not tested 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 11.720 0.856 1.000 0.000 39.281 27.561 1.000 0.000 21 18.002 0.803 1.000 0.000 567.507 549.505 0.997 -0.003 16 24.096 4.252 1.000 0.000 925.949 901.853 0.995 -0.005 13 34.302 10.632 1.000 0.000 1446.655 1412.353 0.992 -0.008 12 35.800 8.817 1.000 0.000 1648.995 1613.195 0.991 -0.009 11 26.951 0.955 1.000 0.000 1843.481 1816.530 0.990 -0.010 10 26.689 2.752 1.000 0.000 2297.172 2270.483 0.987 -0.013 9 42.628 13.950 1.000 0.000 2899.778 2857.150 0.984 -0.016 8 29.141 9.758 1.000 0.000 3504.711 3475.570 0.980 -0.020 7 34.314 12.169 1.000 0.000 4574.824 4540.510 0.973 -0.027 6 32.425 0.191 1.000 0.000 5974.675 5942.250 0.964 -0.036 5 100.174 2.263 1.000 0.001 7906.446 7806.272 0.949 -0.051 4 not tested not tested 3 not tested not tested a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 54 Appendix J. Selected MI Results for the “Low” ISV – “Moderate” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.000 14.217 1.000 21.434 7.217 1.000 0.000 21 0.005 19.924 1.000 27.029 7.105 1.000 0.000 16 0.000 10.524 1.000 12.950 2.426 1.000 0.000 15 0.005 19.516 1.000 22.999 3.483 1.000 0.000 14 0.000 10.344 1.000 14.454 4.110 1.000 0.000 13 0.000 16.092 1.000 20.532 4.440 1.000 0.000 12 0.000 13.086 1.000 18.156 5.070 1.000 0.000 11 0.000 14.029 1.000 19.862 5.833 1.000 0.000 10 0.000 7.599 1.000 13.596 5.997 1.000 0.000 9 0.008 25.365 1.000 25.787 0.422 1.000 0.000 8 0.000 10.798 1.000 12.953 2.155 1.000 0.000 7 0.004 18.103 1.000 20.318 2.215 1.000 0.000 6 0.000 14.851 1.000 17.337 2.486 1.000 0.000 5 0.015 49.169 1.000 52.581 3.412 1.000 0.000 4 0.047 336.772 0.997 446.772 110.000 0.996 -0.001 3 0.113 1742.710 0.981 not tested 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 25.772 4.338 1.000 0.000 52.689 26.917 1.000 0.000 21 30.890 3.861 1.000 0.000 483.131 452.241 0.997 -0.003 16 14.979 2.029 1.000 0.000 881.208 866.229 0.994 -0.006 15 27.543 4.544 1.000 0.000 1120.791 1093.248 0.992 -0.008 14 16.672 2.218 1.000 0.000 1280.902 1264.230 0.991 -0.009 13 26.559 6.027 1.000 0.000 1383.484 1356.925 0.990 -0.010 12 21.110 2.954 1.000 0.000 1700.401 1679.291 0.987 -0.013 11 24.286 4.424 1.000 0.000 1831.697 1807.411 0.986 -0.014 10 20.221 6.625 1.000 0.000 2101.347 2081.126 0.984 -0.016 9 29.224 3.437 1.000 0.000 2473.103 2443.879 0.980 -0.020 8 17.601 4.648 1.000 0.000 2718.607 2701.006 0.978 -0.022 7 22.758 2.440 1.000 0.000 2882.374 2859.616 0.976 -0.024 6 18.419 1.082 1.000 0.000 2709.370 2690.951 0.977 -0.023 5 52.756 0.175 1.000 0.000 2249.776 2197.020 0.980 -0.020 4 452.086 5.314 0.996 0.000 2759.906 2307.820 0.975 -0.021 3 not tested not tested a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 55 Appendix K. Selected MI Results for the “Low” ISV – “High” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.008 26.142 1.000 26.698 0.556 1.000 0.000 21 0.003 17.166 1.000 17.970 0.804 1.000 0.000 16 0.007 22.624 1.000 27.273 4.649 1.000 0.000 13 0.005 19.661 1.000 27.517 7.856 1.000 0.000 12 0.004 18.379 1.000 22.014 3.635 1.000 0.000 11 0.002 16.862 1.000 18.395 1.533 1.000 0.000 10 0.006 21.792 1.000 31.925 10.133 1.000 0.000 9 0.000 11.362 1.000 13.684 2.322 1.000 0.000 8 0.004 18.999 1.000 22.650 3.651 1.000 0.000 7 0.007 23.862 1.000 27.845 3.983 1.000 0.000 6 0.002 16.596 1.000 17.615 1.019 1.000 0.000 5 0.018 66.017 0.999 73.121 7.104 0.999 0.000 4 0.039 240.839 0.996 260.935 20.096 0.996 0.000 3 0.070 709.375 0.984 not tested 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 29.836 3.138 1.000 0.000 39.057 9.221 1.000 0.000 71 22.296 n/a 1.000 n/a 37.259 14.963 1.000 0.000 61 36.060 n/a 1.000 n/a 62.099 26.039 1.000 0.000 21 21.240 3.270 1.000 0.000 250.966 229.726 0.997 -0.003 16 29.880 2.607 1.000 0.000 401.444 371.564 0.995 -0.005 13 32.794 5.277 1.000 0.000 539.243 506.449 0.993 -0.007 12 23.226 1.212 1.000 0.000 623.203 599.977 0.992 -0.008 11 28.945 10.550 1.000 0.000 834.060 805.115 0.989 -0.011 10 33.893 1.968 1.000 0.000 821.962 788.069 0.989 -0.011 9 22.133 8.449 1.000 0.000 1098.923 1076.790 0.986 -0.014 8 25.189 2.539 1.000 0.000 1339.995 1314.806 0.982 -0.018 7 31.674 3.829 1.000 0.000 1835.902 1804.228 0.974 -0.026 6 19.472 1.857 1.000 0.000 2260.531 2241.059 0.967 -0.033 5 76.076 2.955 0.999 0.000 3143.781 3067.705 0.951 -0.048 4 263.755 2.820 0.996 0.000 4379.371 4115.616 0.926 -0.070 3 not tested not tested a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 56 Appendix L. Selected MI Results for the “Moderate” ISV – “None” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.005 19.412 1.000 31.929 12.517 1.000 0.000 21 0.000 12.040 1.000 15.799 3.759 1.000 0.000 16 0.005 20.189 1.000 25.833 5.644 1.000 0.000 13 0.000 7.438 1.000 8.147 0.709 1.000 0.000 12 0.005 20.130 1.000 31.808 11.678 1.000 0.000 11 0.000 11.364 1.000 12.033 0.669 1.000 0.000 10 0.000 8.390 1.000 10.054 1.664 1.000 0.000 9 0.000 10.418 1.000 16.459 6.041 1.000 0.000 8 0.000 5.774 1.000 6.567 0.793 1.000 0.000 7 0.000 11.054 1.000 11.971 0.917 1.000 0.000 6 0.000 11.093 1.000 15.032 3.939 1.000 0.000 5 0.000 8.271 1.000 13.262 4.991 1.000 0.000 4 0.000 10.438 1.000 36.081 25.643 1.000 0.000 3 0.022 89.624 0.999 727.004 637.380 0.988 -0.011 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 33.056 1.127 1.000 0.000 45.993 12.937 1.000 0.000 71 33.110 n/a 1.000 n/a 44.508 11.398 1.000 0.000 61 20.475 n/a 1.000 n/a 70.971 50.496 1.000 0.000 21 19.083 3.284 1.000 0.000 203.960 184.877 0.998 -0.002 16 28.081 2.248 1.000 0.000 407.015 378.934 0.996 -0.004 13 15.851 7.704 1.000 0.000 593.868 578.017 0.994 -0.006 12 41.607 9.799 1.000 0.000 679.470 637.863 0.993 -0.007 11 15.724 3.691 1.000 0.000 749.099 733.375 0.992 -0.008 10 20.242 10.188 1.000 0.000 1035.759 1015.517 0.989 -0.011 9 18.478 2.019 1.000 0.000 1216.402 1197.924 0.987 -0.013 8 11.468 4.901 1.000 0.000 1343.392 1331.924 0.986 -0.014 7 16.933 4.962 1.000 0.000 1768.126 1751.193 0.981 -0.019 6 22.047 7.015 1.000 0.000 2540.902 2518.855 0.971 -0.029 5 14.192 0.930 1.000 0.000 3646.435 3632.243 0.956 -0.044 4 49.517 13.436 1.000 0.000 6364.731 6315.214 0.916 -0.084 3 not tested not tested a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 57 Appendix M. Selected MI Results for the “Moderate” ISV – “High” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.003 17.595 1.000 23.442 5.847 1.000 0.000 21 0.000 12.049 1.000 16.745 4.696 1.000 0.000 16 0.001 16.018 1.000 22.555 6.537 1.000 0.000 13 0.001 16.111 1.000 22.934 6.823 1.000 0.000 12 0.000 13.792 1.000 17.979 4.187 1.000 0.000 11 0.000 12.892 1.000 14.710 1.818 1.000 0.000 10 0.000 7.171 1.000 13.413 6.242 1.000 0.000 9 0.000 15.352 1.000 20.821 5.469 1.000 0.000 8 0.000 9.105 1.000 16.183 7.078 1.000 0.000 7 0.000 11.865 1.000 12.411 0.546 1.000 0.000 6 0.007 24.077 1.000 27.313 3.236 1.000 0.000 5 0.000 12.710 1.000 15.234 2.524 1.000 0.000 4 0.002 16.527 1.000 24.104 7.577 1.000 0.000 3 0.013 41.950 0.999 499.912 457.962 0.984 -0.015 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 24.396 0.954 1.000 0.000 34.225 9.829 1.000 0.000 71 16.924 n/a 1.000 n/a 31.865 14.941 1.000 0.000 61 30.684 n/a 1.000 n/a 68.584 37.900 0.999 -0.001 21 19.530 2.785 1.000 0.000 161.502 141.972 0.998 -0.002 16 25.652 3.097 1.000 0.000 323.379 297.727 0.995 -0.005 13 31.617 8.683 1.000 0.000 436.280 404.663 0.992 -0.008 12 21.442 3.463 1.000 0.000 476.196 454.754 0.992 -0.008 11 16.492 1.782 1.000 0.000 687.883 671.391 0.988 -0.012 10 22.807 9.394 1.000 0.000 740.211 717.404 0.986 -0.014 9 22.502 1.681 1.000 0.000 868.899 846.397 0.984 -0.016 8 18.524 2.341 1.000 0.000 1035.455 1016.931 0.980 -0.020 7 15.350 2.939 1.000 0.000 1392.265 1376.915 0.972 -0.028 6 30.801 3.488 1.000 0.000 1752.112 1721.311 0.964 -0.036 5 17.188 1.954 1.000 0.000 2584.474 2567.286 0.943 -0.057 4 27.642 3.538 1.000 0.000 3579.847 3552.205 0.912 -0.088 3 not tested not tested a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 58 Appendix N. Selected MI Results for the “High” ISV – “None” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.008 27.353 1.000 34.756 7.403 1.000 0.000 21 0.000 15.103 1.000 21.201 6.098 1.000 0.000 16 0.000 11.024 1.000 11.671 0.647 1.000 0.000 13 0.000 13.156 1.000 16.703 3.547 1.000 0.000 12 0.000 11.520 1.000 21.195 9.675 1.000 0.000 11 0.000 10.234 1.000 21.629 11.395 1.000 0.000 10 0.000 10.013 1.000 12.013 2.000 1.000 0.000 9 0.000 15.575 1.000 24.780 9.205 1.000 0.000 8 0.005 20.131 1.000 22.618 2.487 1.000 0.000 7 0.008 26.004 1.000 28.629 2.625 1.000 0.000 6 0.000 15.668 1.000 18.916 3.248 1.000 0.000 5 0.000 12.580 1.000 12.607 0.027 1.000 0.000 4 0.000 14.710 1.000 18.133 3.423 1.000 0.000 3 0.001 16.169 1.000 105.350 89.181 0.997 -0.003 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 39.743 4.987 1.000 0.000 56.523 16.780 0.999 -0.001 61 22.482 n/a 1.000 n/a 34.660 12.178 1.000 0.000 51 23.017 n/a 1.000 n/a 48.107 25.090 1.000 0.000 21 24.872 3.671 1.000 0.000 184.030 159.158 0.997 -0.003 16 13.717 2.046 1.000 0.000 270.223 256.506 0.995 -0.005 13 20.424 3.721 1.000 0.000 421.068 400.644 0.992 -0.008 12 22.985 1.790 1.000 0.000 495.297 472.312 0.990 -0.010 11 23.531 1.902 1.000 0.000 433.616 410.085 0.991 -0.009 10 16.775 4.762 1.000 0.000 708.222 691.447 0.985 -0.015 9 28.988 4.208 1.000 0.000 759.635 730.647 0.984 -0.016 8 24.774 2.156 1.000 0.000 964.815 940.041 0.979 -0.021 7 31.256 2.627 1.000 0.000 1199.746 1168.490 0.974 -0.026 6 21.989 3.073 1.000 0.000 1750.380 1728.391 0.960 -0.040 5 16.433 3.826 1.000 0.000 2426.783 2410.350 0.941 -0.059 4 24.240 6.107 1.000 0.000 4317.859 4293.619 0.884 -0.116 3 107.715 2.365 0.997 0.000 1969.487 1861.772 0.933 -0.064 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 59 Appendix O. Selected MI Results for the “High” ISV – “Moderate” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.000 10.921 1.000 13.172 2.251 1.000 0.000 21 0.000 11.175 1.000 12.942 1.767 1.000 0.000 16 0.000 7.679 1.000 12.308 4.629 1.000 0.000 15 0.000 10.425 1.000 18.306 7.881 1.000 0.000 14 0.000 10.944 1.000 13.551 2.607 1.000 0.000 13 0.003 17.704 1.000 26.489 8.785 1.000 0.000 12 0.000 15.280 1.000 19.676 4.396 1.000 0.000 11 0.000 15.858 1.000 22.238 6.380 1.000 0.000 10 0.000 13.439 1.000 14.610 1.171 1.000 0.000 9 0.000 8.475 1.000 11.124 2.649 1.000 0.000 8 0.003 17.732 1.000 20.029 2.297 1.000 0.000 7 0.000 15.444 1.000 17.527 2.083 1.000 0.000 6 0.000 10.319 1.000 11.263 0.944 1.000 0.000 5 0.000 8.126 1.000 8.651 0.525 1.000 0.000 4 0.000 11.443 1.000 13.271 1.828 1.000 0.000 3 0.000 13.687 1.000 96.899 83.212 0.997 -0.003 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 14.987 1.815 1.000 0.000 22.177 7.190 1.000 0.000 91 48.928 n/a 0.999 n/a 66.210 17.282 0.999 0.000 81 16.721 n/a 1.000 n/a 40.150 23.429 1.000 0.000 21 20.561 7.619 1.000 0.000 271.624 251.063 0.995 -0.005 16 15.688 3.380 1.000 0.000 396.726 381.038 0.992 -0.008 15 19.459 1.153 1.000 0.000 447.896 428.437 0.991 -0.009 14 15.160 1.609 1.000 0.000 533.837 518.677 0.989 -0.011 13 28.956 2.467 1.000 0.000 679.718 650.762 0.985 -0.015 12 24.988 5.312 1.000 0.000 722.078 697.090 0.984 -0.016 11 22.541 0.303 1.000 0.000 792.293 769.752 0.982 -0.018 10 18.528 3.918 1.000 0.000 1044.708 1026.180 0.976 -0.024 9 13.502 2.378 1.000 0.000 1153.326 1139.824 0.973 -0.027 8 26.107 6.078 1.000 0.000 1248.435 1222.328 0.970 -0.030 7 22.376 4.849 1.000 0.000 1280.539 1258.163 0.968 -0.032 6 17.482 6.219 1.000 0.000 1081.917 1064.435 0.972 -0.028 5 15.915 7.264 1.000 0.000 941.193 925.278 0.974 -0.026 4 18.591 5.320 1.000 0.000 1182.785 1164.194 0.963 -0.037 3 102.118 5.219 0.997 0.000 275.726 173.608 0.990 -0.007 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 60 Appendix P. Selected MI Results for the “High” ISV – “High” REV Condition.a 1. Configural Invariance 2. Loading Invariance Scale RMSEA χ2 CFI χ2 Δ χ2 CFI ΔCFI 101 0.006 21.487 1.000 21.980 0.493 1.000 0.000 21 0.004 18.887 1.000 23.274 4.387 1.000 0.000 16 0.000 11.922 1.000 16.787 4.865 1.000 0.000 13 0.000 11.649 1.000 13.129 1.480 1.000 0.000 12 0.006 20.998 1.000 23.372 2.374 1.000 0.000 11 0.000 14.236 1.000 16.170 1.934 1.000 0.000 10 0.007 24.545 1.000 29.466 4.921 1.000 0.000 9 0.004 18.206 1.000 20.476 2.270 1.000 0.000 8 0.000 4.847 1.000 5.865 1.018 1.000 0.000 7 0.000 8.166 1.000 11.537 3.371 1.000 0.000 6 0.000 9.810 1.000 17.893 8.083 1.000 0.000 5 0.000 4.793 1.000 14.755 9.962 1.000 0.000 4 0.000 7.972 1.000 9.988 2.016 1.000 0.000 3 0.000 7.354 1.000 79.922 72.568 0.997 -0.003 3. Means Invariance 4. Error Invariance Scale χ2 Δ χ2 CFI ΔCFI χ2 Δ χ2 CFI ΔCFI 101 35.359 13.379 1.000 0.000 56.725 21.366 0.999 -0.001 21 24.227 0.953 1.000 0.000 176.409 152.182 0.997 -0.003 16 23.708 6.921 1.000 0.000 297.139 273.431 0.994 -0.006 13 14.326 1.197 1.000 0.000 297.457 283.131 0.994 -0.006 12 26.770 3.398 1.000 0.000 419.192 392.422 0.991 -0.009 11 17.931 1.761 1.000 0.000 478.853 460.922 0.989 -0.011 10 32.409 2.943 1.000 0.000 617.373 584.964 0.985 -0.015 9 21.184 0.708 1.000 0.000 731.986 710.802 0.982 -0.018 8 9.544 3.679 1.000 0.000 843.124 833.580 0.979 -0.021 7 17.503 5.966 1.000 0.000 1178.805 1161.302 0.970 -0.030 6 23.272 5.379 1.000 0.000 1568.589 1545.317 0.958 -0.042 5 22.869 8.114 1.000 0.000 2120.063 2097.194 0.939 -0.061 4 16.456 6.468 1.000 0.000 2741.555 2725.099 0.912 -0.088 3 83.991 4.069 0.997 0.000 442.301 358.310 0.982 -0.015 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Evaluating the error of measurement due to categorical...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Evaluating the error of measurement due to categorical scaling with a measurement invariance approach… Olson, Brent 2008
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Evaluating the error of measurement due to categorical scaling with a measurement invariance approach to confirmatory factor analysis |
Creator |
Olson, Brent |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | It has previously been determined that using 3 or 4 points on a categorized response scale will fail to produce a continuous distribution of scores. However, there is no evidence, thus far, revealing the number of scale points that may indeed possess an approximate or sufficiently continuous distribution. This study provides the evidence to suggest the level of categorization in discrete scales that makes them directly comparable to continuous scales in terms of their measurement properties. To do this, we first introduced a novel procedure for simulating discretely scaled data that was both informed and validated through the principles of the Classical True Score Model. Second, we employed a measurement invariance (MI) approach to confirmatory factor analysis (CFA) in order to directly compare the measurement quality of continuously scaled factor models to that of discretely scaled models. The simulated design conditions of the study varied with respect to item-specific variance (low, moderate, high), random error variance (none, moderate, high), and discrete scale categorization (number of scale points ranged from 3 to 101). A population analogue approach was taken with respect to sample size (N = 10,000). We concluded that there are conditions under which response scales with 11 to 15 scale points can reproduce the measurement properties of a continuous scale. Using response scales with more than 15 points may be, for the most part, unnecessary. Scales having from 3 to 10 points introduce a significant level of measurement error, and caution should be taken when employing such scales. The implications of this research and future directions are discussed. |
Extent | 392764 bytes |
Subject |
optimum number of scale points continuous scale discrete scale categorization coarseness measurement error Classical True Score Model simulation study data generation item specific variance random error variance longitudinal measurement invariance Comparative Fit Index Relative Multivariate Kurtosis |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2008-02-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0054570 |
URI | http://hdl.handle.net/2429/332 |
Degree |
Master of Arts - MA |
Program |
Measurement, Evaluation and Research Methodology |
Affiliation |
Education, Faculty of Educational and Counselling Psychology, and Special Education (ECPS), Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2008-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2008_spring_olson_brent.pdf [ 383.56kB ]
- Metadata
- JSON: 24-1.0054570.json
- JSON-LD: 24-1.0054570-ld.json
- RDF/XML (Pretty): 24-1.0054570-rdf.xml
- RDF/JSON: 24-1.0054570-rdf.json
- Turtle: 24-1.0054570-turtle.txt
- N-Triples: 24-1.0054570-rdf-ntriples.txt
- Original Record: 24-1.0054570-source.json
- Full Text
- 24-1.0054570-fulltext.txt
- Citation
- 24-1.0054570.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0054570/manifest