EVALUATING THE ERROR OF MEASUREMENT DUE TO CATEGORICAL SCALING WITH A MEASUREMENT INVARIANCE APPROACH TO CONFIRMATORY FACTOR ANALYSIS by BRENT F. OLSON B.A., Western Washington University, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS in THE FACULTY OF GRADUATE STUDIES (Measurement, Evaluation, and Research Methodology) THE UNIVERSITY OF BRITISH COLUMBIA January 2008 © Brent F. Olson, 2008 Abstract It has previously been determined that using 3 or 4 points on a categorized response scale will fail to produce a continuous distribution of scores. However, there is no evidence, thus far, revealing the number of scale points that may indeed possess an approximate or sufficiently continuous distribution. This study provides the evidence to suggest the level of categorization in discrete scales that makes them directly comparable to continuous scales in terms of their measurement properties. To do this, we first introduced a novel procedure for simulating discretely scaled data that was both informed and validated through the principles of the Classical True Score Model. Second, we employed a measurement invariance (MI) approach to confirmatory factor analysis (CFA) in order to directly compare the measurement quality of continuously scaled factor models to that of discretely scaled models. The simulated design conditions of the study varied with respect to item-specific variance (low, moderate, high), random error variance (none, moderate, high), and discrete scale categorization (number of scale points ranged from 3 to 101). A population analogue approach was taken with respect to sample size (N = 10,000). We concluded that there are conditions under which response scales with 11 to 15 scale points can reproduce the measurement properties of a continuous scale. Using response scales with more than 15 points may be, for the most part, unnecessary. Scales having from 3 to 10 points introduce a significant level of measurement error, and caution should be taken when employing such scales. The implications of this research and future directions are discussed. ii Table of Contents Abstract.......................................................................................................................................... ii Table of Contents ......................................................................................................................... iii List of Tables ................................................................................................................................. v List of Figures............................................................................................................................... vi Dedication .................................................................................................................................... vii Introduction................................................................................................................................... 1 Procedural Background.............................................................................................................. 4 Method ........................................................................................................................................... 8 Design Conditions....................................................................................................................... 8 Model Specification .................................................................................................................... 9 Data Generation ....................................................................................................................... 14 Data Validation......................................................................................................................... 20 Results .......................................................................................................................................... 23 Data Validation......................................................................................................................... 23 Measurement Invariance .......................................................................................................... 29 Discussion..................................................................................................................................... 34 Conclusions............................................................................................................................... 36 Limitations and Future Directions............................................................................................ 38 Footnotes...................................................................................................................................... 41 References.................................................................................................................................... 42 Appendix A. Selected EFA Results for the “Low” ISV – “None” REV Condition.a................ 45 Appendix B. Selected EFA Results for the “Low” ISV – “Moderate” REV Condition.a ......... 46 Appendix C. Selected EFA Results for the “Low” ISV – “High” REV Condition.a ................ 47 Appendix D. Selected EFA Results for the “Moderate” ISV – “None” REV Condition.a ........ 48 Appendix E. Selected EFA Results for the “Moderate” ISV – “High” REV Condition.a......... 49 Appendix F. Selected EFA Results for the “High” ISV – “None” REV Condition.a ............... 50 Appendix G. Selected EFA Results for the “High” ISV – “Moderate” REV Condition.a ........ 51 iii Appendix H. Selected EFA Results for the “High” ISV – “High” REV Condition.a ............... 52 Appendix I. Selected MI Results for the “Low” ISV – “None” REV Condition.a .................... 53 Appendix J. Selected MI Results for the “Low” ISV – “Moderate” REV Condition.a............. 54 Appendix K. Selected MI Results for the “Low” ISV – “High” REV Condition.a ................... 55 Appendix L. Selected MI Results for the “Moderate” ISV – “None” REV Condition.a........... 56 Appendix M. Selected MI Results for the “Moderate” ISV – “High” REV Condition.a .......... 57 Appendix N. Selected MI Results for the “High” ISV – “None” REV Condition.a .................. 58 Appendix O. Selected MI Results for the “High” ISV – “Moderate” REV Condition.a .......... 59 Appendix P. Selected MI Results for the “High” ISV – “High” REV Condition.a ................... 60 iv List of Tables Table 1 Relative Multivariate Kurtosis (RMK) of the Continuous Model and Selected Discrete Models.....................................................................................................24 Table 2 Selected EFA Results for the “Moderate” ISV – “Moderate” REV Condition.....26 Table 3 Data Validation with Exploratory Factor Analysis: Comparing the Continuous Scale Model to the 3-point Model.........................................................................28 Table 4 Selected MI Results for the “Moderate” ISV – “Moderate” REV Condition........30 Table 5 The Scale Point Level at which Successful Invariance was Achieved According to RMSEA, χ2, and CFI..............................................................................................32 v List of Figures Figure 1 A Comparison Between Methods for Simulating Observed Test Scores According to the Classical True Score Model...........................................................................6 Figure 2 Conceptual Diagram of the Dependent Samples Model Used to Test the Measurement Invariance Between Continuous and Discrete Scale Models..........10 Figure 3 Three Examples of Frequency Histograms for Continuously Scaled Items and Their Corresponding Categorized Scale Versions.................................................18 vi Dedication for Forrest Lee Olson vii Introduction Errors in measurement are one of the most pervasive problems in modern psychological test construction (Gregory, 2004, chap. 3). Measurement error can take many forms and arise from many sources. One source of error that has gained increasing attention in the literature is the use of coarsely categorized response formats in measuring theoretically continuous constructs. Response formats with a continuous scale of measurement possess interval properties that provide useful information about the distance between observed scores on psychological measures. That is, any two observed scores from a continuous scale have a known distance from each other, and a known mid-point. Categorization of continuous scales is a process of using cutoff points to segment the scale into a set of discrete categories. Researchers decide on the number and type of scale points to use during the development of their measures. Test and survey developers often find it impractical or impossible to utilize continuous scales directly in their instruments, so they must resort to using discrete scales (DiStefano, 2002). Categorization has the effect of removing the interval measurement properties of the scale as the information that exists in-between each scale point is lost. For coarsely categorized response scales, this loss of information can reduce the meaningfulness of test scores, and may lead to a misrepresentation of the construct being measured (Lubke & Muthén, 2004; Russel, Pinto, & Bobko, 1991). Specifically, coarseness in categorized responses has been shown to generate systematic error that affects the interaction between variables in regression models (Russel et al., 1991), and may cause a substantial loss in statistical power to detect true relationships between predictors and outcomes (Taylor, West, & Aiken, 2006). Coarseness causes errors in measurement that can distort the factor structure of latent variable models (Lubke & Muthén, 2004), and can reduce the 1 reliability of test scores (Bandalos & Enders, 1996; Jenkins & Taber, 1977; Lissitz & Green, 1975). Problems due to categorization manifest themselves when researchers attempt to use coarsely scaled data to perform analyses designed for continuously scaled data. For example, the Pearson product moment correlation – a commonly used analysis in the social sciences – relies on the assumption that the input data is continuous (Garson, n.d.; Pearson, 1909). Jöreskog (1994) argues that the biggest problem with categorized variables is that their distributions are simply incapable of being continuous. Jöreskog and others (e.g., Bandalos & Enders, 1996; Bollen & Barb, 1981; Dolan, 1994; Muthén & Kaplan, 1985) seem to agree that scales with four categories or fewer are too coarse to be treated as continuous. However, there is considerable disagreement about the “optimum” number of scale points that exists above four points. While there is some evidence to establish which scale points are clearly non-continuous, there is no evidence, thus far, to say which scales do have a sufficiently continuous distribution. Not only is there disagreement in the literature about how many categories constitute an approximately continuous measure, but there are several issues that may have been overlooked by previous researchers. First, any assumption about the way categorized scales perform must incorporate the fact that observed score data must follow the tenants of the Classical True Score Model (i.e., Observed Score = True Score + Error). That is, any categorization of the data applies to the true scores and the error scores, not just observed scores. Second, the methods that have been used to determine the “optimum” number of categories, in terms of measurement properties, do not directly compare the performance of discretely scaled test items against continuously scaled test items. A strict test of the measurement equality between discrete scales 2 and continuous scales is truly necessary to determine the proper number of scale points that should be used. The present study will address these concerns in the follow ways. First, we will introduce a novel procedure for generating valid observed score data by simulating the effect of categorization on the Classical True Score Model. Second, we will employ a measurement invariance (MI) approach to confirmatory factor analysis (CFA) in order to directly compare continuously scaled factor models to discretely scaled models in terms of their measurement properties. In general, MI can be used to determine whether a given psychological measure performs equally across sample groups or over repeated measures of the same group (Brown, 2006, chap. 3). In our case, we will use repeated measures MI - CFA to determine whether continuous and discrete scale models have equal measurement properties, including equal measurement error. In short, this study will utilize MI - CFA, in conjunction with simulated data, to assess the measurement properties of latent factor models comprised of test items expressed on continuous and discrete response scales. The result of this evaluation will help determine the level of coarseness in a categorized scale that creates enough measurement error to make it incomparable to a continuous scale. If dramatic differences in error exist between the two types of response scales, it indicates that the measurement properties of discrete scales behave in a fundamentally different way than continuous scales. This study will provide a justification for deciding how many discrete scale points are necessary to adequately reproduce the measurement properties of a continuous scale. 3 Procedural Background For CFA, the covariance between the variables in the model is relied upon to evaluate the model (Brown, 2006, chap. 2). Because of this dependency on the covariance, simulation studies that evaluate CFA models often begin by producing a covariance matrix (Bandalos, 2006). This, in turn, informs the creation of raw data, onto which the manipulations imposed by the design of the study are performed. However, the raw data that will be generated for the current study attempts to simulate how raw data are believed to be collected in a natural context. Ever since the early 1900s, researchers have theorized that observed raw score data from psychological measures are derived from the combination of an underlying “true score” with an “error score” (Crocker & Algina, 1986, chap. 6). This Classical True Score Model can be expressed with the following equation: Xij = Tij + Eij where Xij is the observed score for the ith individual on the jth measure (i.e., test or item), Tij is the true score, and Eij is the error score. Under the natural (non-simulated) conditions of psychological test administration, neither the true score nor the error term can be directly obtained (Crocker & Algina, 1986, chap. 6; McDonald, 1999, chap. 5). Under unnatural (simulated) conditions however, true scores and errors can be artificially generated using provisions set forth by Classical Test Theory; the theory from which the Classical True Score Model is derived. Classical Test Theory assumes that the mean of the error term is zero, the correlation between the errors across items is zero, and the correlation between errors and true scores is zero. When these assumptions are met, simply adding the simulated error to the simulated true scores produces theoretically informed simulated observed scores. 4 That being said, it is important to note that several previous studies in the field of categorical scaling have failed to properly incorporate the tenants of the Classical True Score Model into their simulation procedures. Previous simulation studies commonly take one of the following approaches to generating their data. In some cases, they ignore the Classical True Score Model altogether by first creating a continuous observed score variable, which is simply categorized into a discrete observed score variable (e.g., Bollen & Barb, 1981; Johnson & Creech, 1983; Taylor et al., 2006). In other cases, continuous true scores and error scores are created and then added together to form a continuous observed score variable, which is subsequently categorized to form a discrete observed score variable (e.g., Bandalos & Enders, 1996; Jenkins & Taber, 1977; Lissitz & Green, 1975). In either case, these methods ignore the fact that categorization has an impact on true scores and errors scores directly, not simply on the observed scores. Figure 1 depicts this inappropriate conventional approach to deriving categorized observed scores. From the perspective of the Classical True Score Model, true scores can be affected by categorization when researchers assume respondents can translate a nearly infinite range of feelings or attitudes into a single discrete scale value. Thus, the impact of categorization on the measurement process is introduced to the true scores when researchers decide how many discrete scale points a test item will have. Additionally, error scores can be affected by categorization when random error forces a respondent’s score to change by one or more full scale point(s). This is in contrast to continuous scales, where random error might change the score by some minute fraction of a point. Therefore, categorization makes its impact well before the existence of observed scores. This is why the simulation of realistic discrete observed scores requires the 5 Tij Eij = Yij Conventional Approach: Xij Tij Eij = Yij Novel Approach: Tij Eij = Xij Figure 1. A Comparison Between Methods for Simulating Observed Test Scores According to the Classical True Score Model. 6 simulation of categorized true scores and error scores in order to follow the Classical True Score Model. The present study will attempt to remedy some of the shortcomings of previous simulations with a novel procedure that undertakes the following steps. First, continuous true scores are generated and added to a set of continuous error scores in order to produce continuous observed scores. Second, separate copies of the continuous true and error scores are made, and subsequently categorized. Third, the categorized true scores are added to the categorized error scores to produce a unique discrete version of the observed scores. This process results in a set of continuous observed scores and a set of discrete observed scores; each having been derived from separate true and error scores. See Figure 1 for a visual comparison between the conventional approach and the current novel approach. When continuous and discrete observed scores are derived in this way, it implies that these two sets of scores should perform equally in terms of their measurement properties. The only difference between them is the level of categorization imposed on the discrete scores. For that reason, we will repeat the simulation process over multiple levels of categorization, spanning a complete range of discrete scale points (from 3 to 101 points). This will allow us to use MI - CFA to pin-point the level of categorization that crosses the measurement threshold for having nearly continuous properties. In summary, the aim of this study is two fold. First, we present and validate a novel procedure for generating raw data that simulates the Classical True Score Model. Second, we use a MI approach to CFA to establish the number of points a discretely scaled factor model must have in order to perform equally to a continuously scaled model in a measurement context. 7 Method Design Conditions The simulated data for this study were generated under several research design conditions resulting in multiple sets of raw scores. The conditions formed a completely crossed 3 x 3 x 27 design with various levels of item-specific variance (low, moderate, high), levels of random error variance (none, moderate, high), and 27 levels of discrete scale categorization (discussed below). A population analogue approach was taken with respect to sample size. For each design condition a grand sample of 50,000 cases was generated, from which a random sample of 10,000 was drawn for use in all subsequent analyses. The size of this sample was deemed sufficient to produce the required stability among covariances, parameter estimates, and fit indices. There were a total of 243 unique sets of raw score data, and thus, 243 separate CFA model evaluations. Data were generated with syntax code written for SPSS 13.0. All data sets were subsequently imported into LISREL 8.54 for analysis using SIMPLIS syntax. For all CFA model evaluations, maximum likelihood (ML) estimation was used. Most of the categorical scaling literature concentrates on response scales with 7 points or fewer (Bandalos, 2006; Bollen & Barb, 1981; Dolan, 1994). However, scales that are commonly seen throughout the social science research literature can have a range in categorization anywhere from 3 to 101 scale points.1 Therefore, we wanted the discrete scales evaluated in this study to represent the entire range of categorization. Specifically, we included all non-binary scales with 21 points or fewer, as well as, all scales with 31 or more points at intervals of 10, up to 101 points (i.e., 3 – 21, 31, 41, 51, 61, 71, 81, 91, 101). All the discrete scales simulated in this study possess a low scale point of zero, which means the high point is always one minus the 8 number of scale points. For example, the 101-point scale ranges from 0 to 100, and the 3-point scale ranges from 0 to 2. Due to the fact that the hypothesis tests performed in this study were applied to each of the 27 levels of scale points, the statistical significance level (α) was corrected using a Bonferroni adjustment. The corrected significance level was α’ = 0.05 ÷ 27 = .002. It should be noted that the entire set of results produced from all of the different scale points are not reported here. The overall results are summarized and only the output from the most relevant or exemplificative scales are presented. Model Specification As is true for most CFA studies, the raw data used for this study were produced with a particular latent factor model in mind. The following is a description of the type and specification of the intended factor model. As was mentioned above, there are two versions to each true score and error score used to create the raw data. In fact, the discrete versions of these variables are derived directly from the continuous versions. Consequently, these variables come from the same sample; that is, the data consists of dependent samples. Therefore, the type of factor analytic model used to compare the continuous version to the discrete version was a dependent samples model. This model resembles a repeated measures or longitudinal model, whereby a single configural specification is identified for the continuous and the discrete data, but both versions are present in the same model. Figure 2 shows the conceptual diagram of the hypothesized configural model. This proposed model consists of one latent factor for each of the two versions 9 1.00 ryx = 1.00 1.00 Figure 2. Conceptual Diagram of the Dependent Samples Model Used to Test the Measurement Invariance Between Continuous and Discrete Scale Models. 10 of the variables. Continuous variables are hereby referred to as “Y” variables and discrete variables will be referred to as “X” variables. The model specification for both versions is identical and they are placed side-by-side within one large model to maximize comparability. We chose to use four items to indicate the factors; this creates sufficient over-specification of the model and provides the opportunity to assess variability in the pattern of factor loadings. It is important to note that each indicator in the model is an individual test item consisting of a unique set of observed scores. Each set of the observed scores consist of a unique set of true scores and error scores. However, the true scores are all derived from the same source, that is, they have the same underlying common factor. We hypothesize that the common factor among “Y” variables is precisely equal to that of the “X” variables; therefore, the equality of the two factors was restricted by setting the correlation between them to unity (ryx = 1.0). The metric for the latent variables was fixed by setting the variance of the latent variables to 1.0. Error variance is unique and uncorrelated among the four continuous “Y” items. Likewise, error is uncorrelated among the four discrete “X” items. However, the error for respective items across versions is derived from the same source and should be correlated (see the set of four correlational arcs in Figure 2). The equality of error variances across versions was assessed through the tests of MI (discussed below), rather than by setting the correlation to 1.0. Thus, the correlation between respective items across versions was allowed to be freely estimated. Assessment of this dependent samples model for MI followed the procedure called “longitudinal measurement invariance” outlined by Brown (2006, chap. 7). Accordingly, the continuous and discrete scale versions of the model were considered measurement invariant if the model held up against a series of four increasingly strict evaluations of equality. The first 11 step was to test for configural invariance by constraining the number of indicators and number of latent variables to be equal across the continuous and discrete models (see Figure 2). All of the model parameters were allowed to be freely estimated. For the decision of whether configural invariance held for our model, several formal measures of global goodness-of-fit were compared to established criteria. We followed common practice by reporting the overall goodness-of-fit Chi-Square (χ2) value. The degrees of freedom (df) for the configural model was 16, therefore the χ2 critical value when α = .002 was: χ2crit = 37.2. The criterion used to evaluate the Root Mean Square Error of Approximation (RMSEA) was: RMSEA ≤ .06. The criterion for Comparative Fit Index (CFI) was: CFI ≥ .95. Models that fail to meet the criteria established for each level of MI are excluded from all subsequent tests in the series of MI evaluations. If the given model showed appropriate goodness-of-fit under the restrictions of configural invariance, additional constraints were applied in order to test for loading invariance. Loading invariance was tested by placing equality constraints on the configural specification, as well as on the factor loadings (path coefficients) between the latent variables and their respective indicator variables. This test was important for determining whether the discrete latent variable had the same unit of measurement as the continuous latent variable. Loading invariance was determined by the degree of change in model fit observed when proceeding from configural invariance. Following the recommendation of Brown (2006, chap. 7), each successive test of invariance involved an assessment of the change in χ2 from the previous invariance level (i.e., Δχ2). The change in degrees of freedom (Δdf) for the loading invariance model was 4, therefore the Δχ2 critical value when α = .002 was: Δχ2crit = 16.9. Stemming from some stark criticism of the use of the χ2 difference test as the only means of deciding whether MI holds over successive invariance tests of a given model, Cheung and 12 Rensvold (2002) and Wu, Li and Zumbo (2007) have evaluated the appropriateness of ΔCFI as an alternative to Δχ2. These researchers found that for multi-group CFA applications (i.e., independent samples), using a criterion of ΔCFI ≤ -0.01 was far more stable and realistic in its allocation of MI decisions than was a statistically non-significant Δχ2. However, we have found no studies that have attempted to apply a ΔCFI ≤ -0.01 decision rule to the type of CFA model we are evaluating here (i.e., dependent samples). Cautiously, we compared the results of both the Δχ2 and ΔCFI as a means of determining loading invariance. In addition to the equality constraints of the configural specification and the factor loadings, the third level of measurement invariance tested the equality of factor means. The evaluation of means invariance was a test of whether continuous and discrete models are equally centered on an underlying latent distribution. The mean of the latent variable should be the same for both models if they are to be considered measurement invariant. Means invariance is established by the change in model fit when proceeding from loading invariance, where Δχ2crit = 16.9 and ΔCFI ≤ -0.01 must hold for a given CFA model. The fourth and final test in the series was a test of residual error variance invariance (hereby referred to as error invariance). This test placed equality constraints on the configural specification, the factor loadings, the factor means, and the residual error variance parameters for each of the respective indicator variables. Error invariance was a direct test of whether the discretely scaled version of the model had the same amount of measurement error as the continuously scaled version. This test is critical in determining the existence of systematic sources of error variance that may influence the measurement properties of coarsely categorized response scales. Error invariance of the model was determined by the degree of change in model fit observed when proceeding from means invariance. Once again, a comparison between Δχ2crit 13 = 16.9 and ΔCFI ≤ -0.01 provided the criteria for deciding whether error invariance holds in a given model. If the model held up against all four levels of successively restrictive tests, the model was said to be measurement invariant. That is, the tests have provided strong evidence that continuous and discrete scores – at a given level of categorization – possess the same fundamental measurement quality and that they are directly comparable in terms of their measurement properties. Data Generation The simulation procedure used to produce the raw data for this study was divided into the following steps: 1. Generate the common factor, 2. Generate item-specific variance, 3. Compute the item true scores, 4. Re-express the true scores to fit the distribution of the desired scale, 5. Generate the random error scores, 6. Compute the item observed scores. A further explanation of these six basic parts is as follows. 1. Generate the common factor. Latent common factors are, by definition, unobservable and have an unknown mean and variance. However, the distribution of common factors is often considered to be randomly normally distributed (Crocker & Algina, 1986, chap. 6; McDonald, 1999, chap. 5). The mean and variance of the common factor generated in this simulation were 14 set arbitrarily to emulate the standard normal curve, where MC = 0.0 and SDC = 1.0. For the purposes of demonstration, the common factor variable is referred to as the ‘C’ variable. 2. Generate item-specific variance. No two test items on a single instrument should contain precisely the same content (Gregory, 2004, chap. 3). Differences in the wording of items add necessary and systematic item-specific variance to the items’ scores (Brown, 2006, chap. 2). This process was simulated by generating sets of four randomly normally distributed variables, where each set had one of three different levels of variance. These variables represented the item-specific variance of the four test items in our proposed CFA model. The item-specific variables are referred to as ‘IS’ variables; they have a mean of zero (i.e., MIS = 0.0), and a low, moderate, or high standard deviation (i.e., SDIS = 0.5, 1.0, or 2.0 respectively). 3. Compute the item true scores. The original item true scores were derived from the addition of item-specific variables to the common factor variable. That is, each of the four ‘IS’ variables were separately added to the ‘C’ variable, thus creating four original item true score variables. True score variables are hereby referred to as ‘T’ variables. 4. Re-express the true scores to fit the distribution of the desired scale. The previously mentioned steps (i.e., steps 1, 2 and 3), ensure that the true scores possess a mean of zero (i.e., MT = MC + MIS = 0.0). However, if, for example, the true score for a variable was measured on a 5-point discrete scale, where the lowest point was 0 and the highest point was 4, then we would not expect the mean of that variable to be 0. In order to simulate the scores that one might get from live data collection, we had to re-express the true scores to fit a distribution with a realistic mean and standard deviation. Accordingly, it was necessary to adapt the original true score variables to fit the distribution we would expect at each level of categorization. 15 This part of the procedure involved standardizing the item true scores and re-expressing them to fit the expected distribution of the desired scale. Also involved in this step was the allimportant process of categorizing the continuous scales into equivalently distributed discrete scales. This re-expression and categorization procedure includes the following sub-steps: (a) Compute the mean of the desired scale, (b) Compute the standard deviation of the desired scale, (c) Standardize the original item true scores using a Z-score transformation, (d) Re-express the true scores along the distribution of the desired scale, (e) Categorize the continuous scale items to create new discrete scale items, (f) Identify and remove special cases containing impossible scores. The reader is reminded that the above sub-steps, except for (e) and (f), deal exclusively with continuously measured scales. However, the entire process is designed to ensure that the resulting discretely measured scales possess the expected distribution. This implies that the mean and standard deviation of the desired scale is actually the expected mean and standard deviation of the discrete scale. Practically speaking, the resulting “Y” variables will be continuously scaled but will have means and standard deviations equal to those of the discrete “X” variables. (a). The following equation was used to calculate the mean of the desired scale: MY = (τ − 1) 2 (1) where τ is the number of scale points. All of the discrete scales simulated in this study possess a low point of zero, and a high point of τ -1. Dividing the high point of the scale by 2 ensures the distribution will be symmetrical. 16 (b). To calculate the standard deviation of the desired scale we used: σY = τ ⎛ range T ⎞ ⎜ ⎟ σ T ⎠ ⎝ (2) where τ is the number of scale points, rangeT is the range (max minus min) of the original item true score, and σT is the standard deviation of the original true score. This equation determines the number of standard deviation units the new variable should have in order to account for the entire range of true scores. That is, the standard deviation of the re-expressed scale will change, but the number of standard deviation units will remain the same. (c) - (d). The procedure for standardizing the original item true scores and then re- expressing them along the distribution of the desired scale was accomplished in a single Z-score transformation equation: ⎛ ⎛ (T − M T ) ⎞ ⎞ (3) ⎟⎟ * σ Y ⎟ + M Y TY = ⎜⎜ ⎜⎜ ⎟ σ T ⎠ ⎝⎝ ⎠ where TY is the new re-expressed item true score, T is the original true score, MT is the mean of the original true score, σT is the standard deviation of the original true score, σY is the desired scale standard deviation from equation (2), and MY is the desired scale mean from equation (1). The result of the procedure thus far is to output the four continuous “Y” true score variables with distributions relevant to a given number of scale points. (e). The categorization of these “Y” variables into discrete “X” variables is accomplished by simply rounding the continuous scores to the nearest integer. Mathematically this would be expressed as: TX = round(TY ). With only a few exceptions (see the following step), the Z-score standardization process ensures that the number of possible integers the rounding process can create is equal to the number of desired scale points. Figure 3 shows the effect that the categorization process has on the distributions of true score variables. 17 200 150 150 Frequency Frequency 200 100 100 50 50 0 0 0.00 20.00 40.00 60.00 80.00 0.00 100.00 20.00 40.00 60.00 80.00 100.00 Discrete "X" on 101-point Scale Continuous "Y" on 101-point Scale 1,400 150 1,200 120 Frequency Frequency 1,000 90 60 800 600 400 30 200 0 0 0.00 2.00 4.00 6.00 8.00 0.00 10.00 2.00 4.00 6.00 8.00 10.00 Discrete "X" on 11-point Scale Continuous "Y" on 11-point Scale 5,000 200 4,000 Frequency Frequency 150 100 3,000 2,000 50 1,000 0 0 -0.50 0.00 0.50 1.00 1.50 2.00 Continuous "Y" on 3-point Scale 2.50 0.00 1.00 2.00 Discrete "X" on 3-point Scale Figure 3. Three Examples of Frequency Histograms for Continuously Scaled Items and Their Corresponding Categorized Scale Versions. 18 (f). There were some exceptional cases that arose from the process described above. The original true scores were created by adding some variance (item-specific variance) to the distribution of the common factor. This process occasionally produced extreme scores with values either above or below the possible maximum or minimum of the given response scale. For example, we found scores from the 101-point scale with values of -8 and 121. Such cases were identified and flagged for removal from the data set (discussed below). 5. Generate the random error scores. Girard and Cliff’s (1976) research on human errors in judgment has provided the model of random measurement error used in this study. Girard and Cliff claim that the standard deviation of the random error associated with a 9-point response scale can theoretically range from 0.0 to 1.0. Because the current study deals with a variety of response scales other than the 9-point, an equation was developed to determine the ratio of the error standard deviation relevant to any given number of scale points: σE = (τ * σ 9 − po int ) (4) 9 where σE is the standard deviation of the random error for the desired scale, τ is the desired number of scale points, and σ9-point is the error standard deviation of the 9-point scale. To generate the random error terms for each of the four items in the current CFA model, four new normally distributed variables were created. Each variable had a mean of zero, and a standard deviation set to one of three levels such that σ9-point = 0.0, 0.5 or 1.0 representing none, moderate, or high level of error, respectively. This resulted in the continuous “Y” version of the four random error variables, which were subsequently rounded to the nearest integer in order to create the four discrete “X” random error variables (i.e., EX = round(EY ) ). 19 6. Compute the item observed scores. Item observed scores are derived simply from the addition of the random error variables to the re-expressed item true scores. That is, the four continuous error terms were added to the four continuous true scores to create raw scores for the simulated observed “Y” variables (i.e., Y = TY + EY ). Similarly, the four discrete error terms were added to the four discrete true scores to create raw scores for the simulated observed “X” variables (i.e., X = TX + EX ). However, the combination of error with true scores invited another opportunity for the resulting scores to be pushed beyond the tails of the distribution. It is inevitable that introducing random error would increase the chance of producing extreme scores in observed variables. Any extreme cases, produced either from the generation of observed scores or of true scores (discussed above), were identified and subsequently removed from the data set. The frequency of all such removed cases was minute, and varied according to the design conditions of the data set (i.e., the percent of cases removed from each data set ranged from 0.004% to 1.0%). This removal was imparted upon the grand sample of 50,000, and in no way affected the size of the random sample of 10,000. Data Validation Prior to the CFA evaluation of our dependent samples model, we wanted to validate the simulated raw data for its ability to uphold the assumptions of the Classical True Score Model. The validation of the data involved testing for the assumptions of multivariate normality and unidimensionality. We performed exploratory factor analysis (EFA) to determine whether the four proposed test items, from each level of categorization, could uncover the underlying factor from 20 which they were derived. We can claim that our data conform to the Classical True Score Model if the following four criteria are met: 1. The variables meet the assumption of normality according to Mardia’s (1970) test of multivariate normality at p > .002. 2. The χ2 goodness-of-fit test of the four-item one-factor EFA model should be nonsignificant (p > .002), indicating the model is a good fit for the data. 3. For each set of four test items, the EFA must show only one dominant factor with an initial eigenvalue greater than 1.0. 4. The factor loadings for each item must approximate the true correlation between that item and the underlying common factor. As opposed to the dependent samples CFA model, the simplified EFA model tested here includes only one common factor and four indicator variables, expressed on either continuous or discrete scales. A total of 252 EFAs were performed, 243 of which involved models with items expressed as one of the 27 different discrete scales (i.e. 3 –101), and 9 models in which continuously scaled items were entered. For all EFA model evaluations, ML estimation was employed. Normality was assessed through visual inspection of frequency histograms, and through an extension of Mardia’s (1970) test of multivariate normality called Relative Multivariate Kurtosis (RMK). RMK is reported by the PRELIS module in the LISREL 8.54 software package. However, PRELIS does not provide critical values for interpreting RMK, so they must be calculated by hand using the following formula (see SAS Institute, 2004, chap. 19): RMK crit ⎛ 8 p( p + 2) ⎞ ⎜ (± Z crit ) ⎟ + p( p + 2) ⎜ ⎟ N ⎠ =⎝ p( p + 2) (5) 21 where Zcrit is the desired critical value from a two-tailed Z-score distribution, p is the number of variables in the multivariate analysis, and N is the sample size. For our analysis, the critical interval when Z(α=.002,2-tail) = ± 3.09, p = 4, and N = 10,000 is: 0.982 ≤ RMK ≤ 1.018. If the observed value of RMK reported by PRELIS is outside this interval, the variables in the analysis are deemed to be non-normal. 22 Results Data Validation The data generation procedure was evaluated for its ability to create theoretically justifiable data, capable of producing factor analytic models consistent with the expectations set forth by the Classical True Score Model. For this, a comprehensive examination of normality, as well as dimensionality of the simulated data was conducted. Normality. Frequency histograms of the variables were visually assessed for their ability to approximate the normal curve. Figure 3 provides examples of these histograms. All of the simulated variables appeared to be highly symmetrically distributed and closely followed the normal curve. However, formal tests of multivariate kurtosis revealed some deviations from normality. Table 1 shows that both the number of scale points and high levels of random error have a strong influence on multivariate normality as measured by RMK. In general, scales with a low number of points (3 to 6 points) are more likely to suffer from non-normality. Scales with at least 9 points or higher, including the continuous scale, rarely fail to achieve normality. Notice that under the “high” random error conditions, even the continuously scaled variables fail to meet the critical value for RMK. Similarly, discrete scales with a large number of scale points, such as from 21 points up to 101 points, unexpectedly fail in this regard as well. There are two possible explanations that can account for this phenomenon. First, high levels of random error variance naturally increase the proportion of scores in the tails of a variable’s distribution. However, the tails of the variables in this study are limited by the maximum and minimum of their discrete scale. That is, no matter how much random error is introduced into the variable, the scores cannot exceed the maximum or minimum point of that scale. When large amounts of random error are added to the distribution, the extreme tails of the 23 Table 1. Relative Multivariate Kurtosis (RMK) of the Continuous Model and Selected Discrete Models. Number of scale points Uncommon variance condition Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point Low item-specific variance Random Error None 0.999 Moderate 0.998 High 0.981 0.999 0.998 0.982 0.995 0.995 0.982 0.994 0.995 0.979 0.997 1.003 0.980 0.985 1.018 0.977 0.991 1.048 0.981 1.022 1.084 0.970 1.121 1.173 0.983 1.125 1.164 0.991 2.351 2.346 1.327 Moderate item-specific variance Random Error None 0.990 Moderate 0.996 High 0.975 0.990 0.996 0.975 0.989 0.996 0.976 0.988 0.985 0.975 0.984 0.996 0.972 0.988 1.004 0.977 0.988 1.013 0.973 0.980 1.020 0.968 1.005 1.060 0.969 0.905 0.941 0.962 1.696 1.691 1.207 High item-specific variance Random Error None 1.002 Moderate 0.990 High 0.976 1.002 0.991 0.976 1.001 0.991 0.974 1.004 0.988 0.972 1.004 0.993 0.967 0.997 0.993 0.971 0.996 0.995 0.963 0.997 1.002 0.968 1.003 1.024 0.962 0.903 0.931 0.948 1.485 1.462 1.119 Note. The critical interval was: 0.982 ≤ RMK ≤ 1.018. Note. Models that fail to meet multivariate normality are in bold. 24 variables are effectively cut off, which may cause negative kurtosis, and hence, a failure in multivariate normality. Second, Mardia’s test of multivariate kurtosis has been shown to be sensitive to sample size (Mardia, 1974; see equation (5)). As can be seen in Table 1, many of the scale points that failed to meet the critical value for RMK only failed by a slight margin. It may simply be that the current sample of 10,000 cases caused Mardia’s test to overestimate the degree of non-normality in otherwise normally distributed variables. While many of the normality failures were marginal, it was clear that the discretely scaled variables with 3 or 4 points consistently expressed some of the most severe violations of normality. This may be expected, as some researchers in the field have argued that scales with 4 points or fewer are, by nature, incapable of achieving normality (Jöreskog, 1994; Lubke & Muthén, 2004). In this context, the normality violations observed among scales with few points may not be caused by the data generation procedure per se, but may simply arise out of having an inherently limited number of values in the distribution. Despite these apparent problems, it is important to note that many of the other variables created by the data generation procedure did achieve multivariate normal distributions according to RMK. Nonetheless, if there are adverse consequences to the observed non-normality, they will bare themselves out in the results of the MI - CFA evaluations discussed below. Dimensionality. The results from the 252 EFA evaluations were consolidated onto nine different summary tables (see Table 2 and Appendices A – H). There is one summary table for each combination of item-specific variance (ISV) and random error variance (REV). Any unique combination of these two types of variance is hereby referred to as uncommon variance. An example of one of the nine EFA summary tables is shown in Table 2. This table highlights the 25 Table 2. Selected EFA Results for the “Moderate” ISV – “Moderate” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point χ2 0.005 0.007 0.064 0.208 0.462 1.097 1.685 0.117 0.496 0.608 0.120 df p 2 0.99 2 0.99 2 0.97 2 0.90 2 0.79 2 0.58 2 0.43 2 0.94 2 0.78 2 0.74 2 0.94 Initial eigenvalue 2.244 Factor 1 0.59 Factor 2 0.59 Factor 3 0.58 Factor 4 2.243 0.59 0.59 0.58 2.224 0.60 0.59 0.58 2.141 0.63 0.61 0.61 2.111 0.64 0.63 0.62 2.089 0.65 0.63 0.63 2.080 0.65 0.64 0.63 2.058 0.66 0.64 0.64 2.058 0.66 0.65 0.64 2.030 0.67 0.66 0.65 1.884 0.72 0.70 0.69 Factor loadings 0.644 Item 1 0.644 Item 2 0.652 Item 3 0.636 Item 4 0.643 0.644 0.653 0.635 0.641 0.638 0.646 0.629 0.619 0.621 0.623 0.604 0.612 0.610 0.613 0.600 0.603 0.610 0.610 0.588 0.606 0.595 0.609 0.591 0.599 0.600 0.594 0.582 0.602 0.581 0.603 0.589 0.596 0.585 0.589 0.574 0.550 0.530 0.562 0.529 Correlation between the item and the common factor 0.648 0.647 0.645 0.623 Item 1 0.642 0.641 0.636 0.618 Item 2 0.653 0.652 0.646 0.626 Item 3 0.633 0.632 0.625 0.604 Item 4 0.615 0.607 0.624 0.595 0.606 0.606 0.620 0.583 0.607 0.596 0.605 0.590 0.599 0.596 0.604 0.578 0.599 0.581 0.605 0.581 0.586 0.585 0.597 0.571 0.548 0.517 0.553 0.491 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 26 regularity with which the data generation procedure can produce sound factor analytic models. Each of the nine EFA tables were similar in this respect. According to the χ2 goodness-of-fit test, the four-item one-factor model was a good fit for the data across all scale points. In fact, the χ2 test was non-significant (p > .002) for all scale points regardless of uncommon variance (observed significance ranged from p = .99 to .008). Moreover, there was no discernable pattern among χ2 values; that is to say, goodness-of-fit does not necessarily diminish as the number scale points decreases. Initial eigenvalues are a representation of the total variance explained by the common factor(s) among items (Brown, 2006). The initial eigenvalue of the first common factor in our four-item one-factor model was greater than 1.0 and was clearly dominant over the other factors (for example see Table 2). The continuously scaled model showed the highest first factor eigenvalue, and there was a consistent decline in eigenvalue as the number of scale points decreased. Therefore, the proposed four-item one-factor model accounted for the greatest amount of variance, but discretely scaled items did not explain as much variance as continuously scaled ones. This result was observed across all 9 combinations of uncommon variance (see Appendices A – H). Table 3 shows the further consolidated results of all EFA evaluations as they pertain to eigenvalues, factor loadings, and item-to-factor correlations. The factor loadings for all EFA models mirrored the behavior of the eigenvalues, such that continuously scaled items had the highest factor loadings, and as the scale points decreased so did the factor loadings (for example see Table 2). In EFA, the factor loadings for a one-factor model can be interpreted as the estimated correlation between an item and its underlying latent factor (Brown, 2006). Because our data generation procedure creates the common factor variable (i.e., the ‘C’ variable mentioned in the Method section), we are afforded the rare opportunity to 27 Table 3. Data Validation with Exploratory Factor Analysis: Comparing the Continuous Scale Model to the 3-point Model. Average factor loading across items Initial eigenvalue Uncommon variance condition Low item-specific variance Random Error None Moderate High Continuous Factor 1 Factor 2 3-point Factor 1 Factor 2 Average item-to-factor correlation Continuous 3-point Continuous 3-point 3.391 2.937 2.224 0.21 0.37 0.61 2.597 2.598 1.773 0.48 0.49 0.76 0.893 0.804 0.639 0.730 0.730 0.508 0.893 0.802 0.640 0.654 0.649 0.452 Moderate item-specific variance Random Error None 2.509 Moderate 2.244 High 1.775 0.50 0.59 0.76 1.892 1.884 1.434 0.72 0.72 0.89 0.709 0.644 0.508 0.545 0.543 0.380 0.708 0.644 0.507 0.530 0.527 0.364 High item-specific variance Random Error None Moderate High 0.81 0.85 0.90 1.358 1.331 1.202 0.89 0.90 0.94 0.447 0.404 0.337 0.346 0.332 0.259 0.445 0.404 0.337 0.337 0.326 0.255 1.601 1.490 1.341 28 check the estimated item-to-factor correlation (i.e., the factor loading), against the true item-tofactor correlation. Tables 2 and 3 show that for almost every level of scale points, the factor loading successfully approximated the true item-to-factor correlation down to the one-hundredth decimal place or lower. The factor loading estimates among the 3 and 4-point scales were the least accurate. They were still appropriate however, having approximated the true item-to-factor correlation to within the one-tenth decimal place or lower. Overall, these results provided strong evidence to suggest the data generated from the newly developed simulation was in concordance with the theoretical tenants of the Classical True Score Model. The MI evaluations discussed in the next section will reveal whether the unexpected failures in multivariate normality have a direct influence on the measurement invariant properties of discrete scales. Given the evidence that has been presented thus far, subsequent analyses were conducted under the assumption that the data were theoretically and statistically valid. Measurement Invariance Similar to the EFA evaluations, the results from the MI - CFA evaluations were consolidated onto nine summary tables (see Table 4 and Appendices I – P), one for each combination of uncommon variance. Table 4 is a representative example of one of the nine tables. Not all of the scale points are represented in the tables because there is sufficient consistency in the values to infer the pattern of results from the existing scales. Table 4 provides a context for how we arrived at our decisions about MI between the continuously scaled CFA model and the various levels of discretely scaled models. In general, when there was agreement 29 Table 4. Selected MI Results for the “Moderate” ISV – “Moderate” REV Condition.a (a) Configural invariance Scale 101 21 16 15 14 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.005 0.000 0.003 0.000 0.000 0.000 0.017 χ2 15.364 5.963 7.181 14.015 12.013 9.458 8.325 8.030 9.319 19.742 15.557 17.601 7.879 9.869 9.779 58.938 (b) Loading invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 16.886 11.295 8.260 16.059 13.959 11.323 10.511 11.625 15.182 20.783 17.774 25.397 12.244 14.725 20.929 442.436 (c) Means invariance Scale 101 21 16 15 14 13 12 11 10 9 8 7 6 5 4 3 χ2 19.109 19.319 9.764 24.036 14.954 15.506 11.837 19.968 15.649 24.406 20.630 25.922 15.606 22.782 22.048 Δ χ2 CFI 2.223 1.000 8.024 1.000 1.504 1.000 7.977 1.000 0.995 1.000 4.183 1.000 1.326 1.000 8.343 1.000 0.467 1.000 3.623 1.000 2.856 1.000 0.525 1.000 3.362 1.000 8.057 1.000 1.119 1.000 not tested Δ χ2 1.522 5.332 1.079 2.044 1.946 1.865 2.186 3.595 5.863 1.041 2.217 7.796 4.365 4.856 11.150 383.498 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.988 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.011 (d) Error invariance ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 37.738 266.504 607.463 543.962 700.307 952.276 1043.072 1228.667 1344.038 1501.714 1687.008 1596.128 1581.638 1216.860 1721.116 Δ χ2 CFI 1.000 18.629 0.997 247.185 0.993 597.699 0.993 519.926 0.991 685.353 0.988 936.770 1031.235 0.987 1208.699 0.984 1328.389 0.982 1477.308 0.980 1666.378 0.977 1570.206 0.977 1566.032 0.977 1194.078 0.981 1699.068 0.972 not tested ΔCFI 0.000 -0.003 -0.007 -0.007 -0.009 -0.012 -0.013 -0.016 -0.018 -0.020 -0.023 -0.023 -0.023 -0.019 -0.028 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note. Critical values were: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note. Models that fail to meet invariance with a continuous model are in bold. 30 among the fit indices that a particular scale point model exceeded the critical value for fit, that model was dropped from subsequent tests of MI. For example, in Table 4(a) – the test of configural invariance – fit indices for the 3-point scale indicate the χ2 test was significant, whereas RMSEA and CFI were within excepted bounds. Because of this disagreement among indices, the 3-point model was considered measurement invariant to a continuous scale model at the configural level. In Table 4(b) – the test of loading invariance – the change in fit for the 3-point scale proceeding from configural invariance exceeded the critical value for both Δχ2 and ΔCFI. Consequently, we concluded that the 3-point model failed to be measurement invariant at the loading invariance level, and dropped the model from subsequent evaluations (see Tables 4(c)-(d)). Analyses proceeded in this way for all nine different combinations of uncommon variance conditions (see Appendices I – P). The culmination of all of our MI decisions across all design conditions are summarized in Table 5. Configural Invariance. In general, discretely scaled models seem to have very little trouble meeting the criteria for configural invariance. There were only three conditions where models failed in this regard. When ISV is “low” and REV is either “moderate” or “high”, RMSEA and χ2 are in agreement to fail the 3-point model. Additionally, when ISV is “low” and REV is “none”, RMSEA and χ2 agree to fail both the 3 and 4-point models. Conversely, all discretely scaled models, including the 3-point model, met the criteria for goodness-of-fit according to the CFI index at the level of configural invariance. Loading Invariance. At the level of loading invariance, where ISV is “moderate”, the Δχ2 test and ΔCFI were in agreement to fail the 3-point model. All other discretely scaled models – that were not previously dropped from the analysis – were deemed to be loading invariant to a continuously scaled model. 31 Table 5. The Scale Point Level at which Successful Invariance was Achieved According to RMSEA, χ2, and CFI. RMSEA χ2 Configural Configural Loading Means Error Configural Loading Means Error 5 4 4 6 6 6 6 6 6 6 6 6 > 101a > 101a 71 3 3 3 5 4 4 5 4 4 11 13 12 Moderate item-specific variance Random Error None 3 Moderate 3 High 3 4 4 4 5 4 4 5 4 4 71 > 101a 71 3 3 3 4 4 4 4 4 4 11 14 12 High item-specific variance Random Error None Moderate High 3 3 3 4 4 4 4 4 4 61 91 > 101a 3 3 3 3 3 3 3 3 3 11 15 12 Uncommon variance condition Low item-specific variance Random Error None Moderate High 3 3 3 Δχ2 CFI ΔCFI Indicates that all of the scale points, including the 101-point scale, have failed to achieve invariance. Note. All scale points that lie at or above the values shown can be considered invariant to a continuous scale at the given invariance level. Note. All scale points that lie below the values shown have failed to achieve invariance with a continuous scale at the given invariance level. a 32 Means Invariance. As mentioned briefly above, the data generation procedure sets the mean of all continuously scaled items equal to the mean of the corresponding discretely scaled items. It was expected that this would greatly minimize the chances that the discretely scaled models would fail with respect to means invariance to the continuously scaled models. Accordingly, Table 5 shows no additional invariance failures among the discretely scaled models – that were not previously dropped from the analysis – at the level of means invariance. Error Invariance. At the level of error invariance, the results revealed the dramatic effect that number of scale points has on measurement invariance. According to the Δχ2 test, there are several conditions in which even the 101-point model fails to achieve error invariance with a continuous model. In fact, there was no condition in which the criterion for the Δχ2 test was met among models with fewer than 61 scale points. The ΔCFI index was far more realistic in its allocation of error invariance decisions. According to ΔCFI, the minimum number of scale points for which error invariance holds for discretely scaled models ranged from 11 to 15 scale points. Neither of the two indices revealed an association between the uncommon variance conditions and the allocation of error invariance decisions. If we employ the established policy for accepting MI decisions when two of the fit indices are in agreement, then we may conclude that discretely scaled models with a range of scale points from 11 to 15 or higher are measurement invariant to continuously scaled models. 33 Discussion The aim of this study was two fold. First, we presented and validated a novel procedure for generating raw data that simulates the Classical True Score Model. Second, we used a MI approach to CFA to establish the number of points a discretely scaled factor model must have in order to perform equally to a continuously scaled model in a measurement context. Successful realization of the second goal was, of course, dependent on realizing the first. In order to draw sound conclusions about the equality of discrete and continuous models, it was necessary to scrutinize the simulated data used to test the models. The SPSS syntax code, designed for this study, attempted to simulate how data are believed to be produced in natural settings. In general, we expected the data to exhibit unidimensionality and multivariate normality. Uni-dimensionality of the data was confirmed by the fact that all 252 four-item one-factor EFA models passed the χ2 goodness-of-fit test, and had dominant first factor eigenvalues. In the presence of correlated errors, or correlations between errors and true scores, we would expect the estimated factor loadings from the EFA to diverge from the item-to-factor correlations, but this was not the case. The fact that the EFA maximum likelihood estimation routine produced factor loading parameters that closely replicated the true item-to-factor correlation is strong evidence that the data generation procedure follows the Classical True Score Model. We did observe an unexpectedly high rate of violations in the multivariate normality assumption among variables with a low number of scale points (i.e., from 3 to 6 points), and variables in the “high” random error variance condition. We offered two potential explanations for the problem. First, high levels of error variance can push the extreme tails of a variable’s distribution past the maximum or minimum of the response scale, which effectively removes the 34 tails of the distribution altogether – potentially causing non-normality. Second, the confidence interval for evaluating Mardia’s test of multivariate normality may be too sensitive to the large sample size of this study (Mardia, 1974; see equation (5)). Oversensitivity to sample size may cause an overestimate of the degree of non-normality in otherwise normally distributed variables. Based on our results, there is evidence to suggest that the normality violations are likely due to the latter explanation. Kline (1998) has shown that significantly elevated χ2 goodness-offit values are prevalent among factor models suffering from normality problems. Interestingly, none of the models in this study, that were found to be non-normal according to RMK, had significant χ2 values according to the EFA evaluation. Likewise, with the exception of the 3, 4 and 5-point scale models, significant RMK values were not associated with significant χ2 values according to the configural invariance CFA evaluations. Far from contradicting the findings of Kline, our results suggest instead that the confidence interval for Mardia’s RMK value may be inappropriately detecting significant non-normality when there is none. Furthermore, if one compares how the scale models perform on Mardia’s test (see Table 1) to how they perform on tests of MI (see Table 5), no clear pattern between them was found. The comparison between these two tests revealed examples of RMK normal models that passed MI evaluations under certain conditions, and failed MI under others (i.e., the RMK normal 11point model passed MI under the “low” REV condition and failed MI under the “moderate” REV condition). Additionally, examples were also found where RMK non-normal models were indeed able to pass MI evaluations (i.e., the RMK non-normal 21-point model consistently passed MI under the “high” REV condition). Because of the lack of a pattern between RMK and MI, we suspect that the confidence interval around Mardia’s RMK value was oversensitive to the large sample size in the study, and is therefore, unduly attributing non-normality to some legitimately 35 normally distributed models. This leads us to conclude that the novel data generation procedure presented in this study was both informed and validated through the principles of the Classical True Score Model. In general, the data were deemed to be well suited for studying the equality of discrete and continuous measurement models. It should be emphasized that models with 3 and 4 scale points suffered the most from normality issues; they had the lowest performance in the EFA, and were more likely to fail at the configural invariance level of MI. It remains unclear however, whether these scales’ poor performance in the EFA and CFA evaluations is directly due to their lack of normality. Further study is needed in order to uncover the exact relationship between discrete scale normality and their measurement invariance with continuous scales. Conclusions From the results of the MI - CFA, we conclude that there are conditions under which response scales with 11 to 15 scale points can reproduce the measurement properties of a continuous scale. In very general terms, the more susceptible a measure is to random error variance, the higher the number of scale points should be used. However, our results provide strong support for the claim that using response scales with more than 15 points is, for the most part, unnecessary. We have found that scales with less than 11 points have significantly more measurement error than continuous scales, even under ideal conditions. Thus, scales with from 3 to 10 points can be considered coarsely categorized. Regardless of whether these scales are capable of producing normal distributions, they do not compare to a continuous scale in terms of their measurement properties. Moreover, the error that was found to be inherent in coarse scales may 36 have direct implications for the accuracy of instruments that employ them. We know that measurement error serves to reduce the reliability of test scores (Crocker & Algina, 1986, chap. 6; Gregory, 2004, chap. 3; McDonald, 1999, chap. 5). What is unknown, however, is the exact relationship between reliability and the measurement error introduced by coarse response scales. Our results are consistent with previous studies which have concluded that 3 and 4-point scales should not be treated as if they were continuous (Bandalos & Enders, 1996; Bollen & Barb, 1981; Dolan, 1994; Jenkins & Taber, 1977; Johnson & Creech, 1983; Lissitz & Green, 1975; Taylor et al., 2006). We extend this conclusion by considering that scales with 10 points or fewer have serious comparability problems with continuous scales, and therefore, caution should be taken whenever coarse scales are being employed. Researchers will likely continue using coarse scales in tests and surveys simply because they are perceived to be more convenient. It is important however, that researchers understand the level of error that is introduced through the use of coarse response scales. While further research is still needed, following the guidelines set forth by this study could help to reduce the error in survey results, and potentially raise the standard of accuracy for future psychological measures. Also noteworthy is the fact that our results provide additional evidence for the assertion forwarded by Cheung and Rensvold (2002) and Wu et al. (2007), that ΔCFI is more stable and realistic in its allocation of MI decisions than Δχ2. This is the first study, that we know of, to show that ΔCFI is just as stable for dependent sample models, as it is for independent sample models. The Δχ2 test seemed to reject MI for discrete scale models far more often than would be expected. We suspect this test was highly sensitive to the large sample size involved in this study. Researchers who are interested in conducting MI - CFA studies should consider the results from both Δχ2 and ΔCFI evaluations. 37 Limitations and Future Directions The analyses conducted in this study were performed upon data produced under ideal simulated conditions. Any conclusions drawn from such analyses must be understood to have a somewhat limited generalizability to real-world applications. However, the data generation procedure was shown to have followed the tenants of the Classical True Score Model; the same model in which data collected under natural conditions is believed to follow (Crocker & Algina, 1986, chap. 6). Additionally, a MI approach to CFA with dependent samples models is one of the most strict tests of measurement equality among factor analytic models (Brown, 2006, chap. 7). The choice to use simulated data and a dependent samples model for our evaluation implies that our recommendations about the number of scale points to use should be considered fairly conservative. That is, the implementation of from 11 to 15 scale points is perhaps a realistically safe overestimate of the “optimum” number of points necessary for most psychological measures. Researchers should feel confident in using 11 to 15 points; however, future research is needed in order to determine if a less conservative estimate exists. Many of the conditions in which the current data were generated could be manipulated in ways that would further our understanding of the effect of categorization on the measurement invariance between discrete and continuous response scales. The ideal symmetrically distributed variables seen in the current study are unlikely to be found in natural data. Thus, a thorough study of the effect of skewed and/or kurtotic distributions on the equality of scales should be a high priority for future research. Additionally, the specified model for the CFA was quite simple; models with only four indicators and one latent factor are fairly unrepresentative of the models commonly seen in social science research. Perhaps a larger or more complex measurement model would show different results. Finally, the sample size could also be manipulated to better 38 represent samples normally found in CFA studies. An evaluation of samples ranging from 500 down to 100 cases may improve the applicability of recommendations for the number of scale points that should be used. A population analogue approach was taken in this study as a means of compensating for the lack of empirical random sampling. While the parameter estimates and fit indices reported here were shown to be consistent and stable across all design conditions, it is common practice among simulation studies to compile data over multiple iterations of the simulation in order to produce stable results (Bandalos, 2006). Future research should consider implementing an iterative approach to data generation. In this study, the Pearson product moment correlation (PPM) was relied upon to calculate the covariance matrix for both the EFA and the CFA. As previously mentioned, the PPM assumes that all the variables in the analysis have a continuous scale of measurement (Pearson, 1909). Because of this assumption, the use of PPM provided an additional level of strictness to the test of MI between continuous and discrete scale models. However, it is often inappropriate to apply this assumption to discretely scaled data collected under natural conditions (Jöreskog, 1994; Muthén, 1984). If it is assumed that the discretely scaled variables are not continuous themselves, but are instead derived from an unknown latent continuum, then the polychoric correlation should be used to calculate covariance matrices for factor analysis. Further study is needed in order to establish the equality of continuous and discrete scale models when different covariance matrices are applied. In an attempt to add to the base of knowledge concerning the relationship between scale points and the reliability of test scores, the current data generation procedure may be well suited to help determine the number of scale points necessary to reproduce the reliability estimates 39 made by a continuous scale. Because of the fact that our simulation procedure produces both the true scores and error scores, it may be possible to quantify the amount of measurement error introduced by discretely categorized response scales. Once quantified, the measurement error can be used to calculate a reliability adjustment, able to correct reliability estimates for the effect of categorization. The computer simulation program designed for this study is highly versatile. In the future, it could be used to examine a multitude of measurement related subjects. For example, it could be used to explore the effect that categorization has on the outcomes produced by various factor analytic estimation methods, such as maximum likelihood and weighted least squares. Similarly, there is much work needed in improving our methods of calculating reproduced factor scores (e.g., the Bartlett method or the Anderson-Rubin method). The data generator presented here could also help illuminate the differences in the way group biases are detected when using either a MI approach, or a differential item functioning (DIF) approach. 40 Footnotes 1 Note that 2-point scales also appear in the literature, but are subject to specific statistical analyses that are beyond the scope of the current treatment. 41 References Bandalos, D. L. (2006). Use of Monte Carlo studies in structural equation modeling research. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course (pp. 385-426). Greenwich, CT: Information Age Publishing, Inc. Bandalos, D., & Enders, C. (1996). The effects of nonnormality and number of response categories on reliability. Applied Measurement in Education, 9(2), 151-160. Bollen, K. A., & Barb, K. (1981). Pearson’s R and coarsely categorized measures. American Statistical Review, 46, 232-239. Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: The Guilford Press. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing MI. Structural Equation Modeling, 9, 235-55. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Holt, Rinehart and Winston, Inc. DiStefano, C. (2002). The impact of categorization with confirmatory factor analysis. Structural Equation Modeling, 9(3), 327-346 Dolan, C. V. (1994). Factor analysis of variables with 2, 3, 5 and 7 response categories: A comparison of categorical variable estimators using simulated data. British Journal of Mathematical and Statistical Psychology, 47, 309-326. Garson, G. D. (n.d.). Correlation. In Statnotes: Topics in multivariate analysis. Retrieved December 13, 2007 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm. Girard, R. A., & Cliff, N. (1976). A Monte Carlo evaluation of interactive multidimensional scaling. Psychometrika, 41, 43-64. 42 Gregory, R. J. (2004). Psychological testing: History, principles, and applications. Boston, MA: Pearson. Jenkins, G. D. Jr., & Taber, T. D. (1977). A Monte Carlo study of factors affecting three indices of composite scale reliability. Journal of Applied Psychology, 62, 392-398. Johnson, D. R., & Creech, J. C. (1983). Ordinal measures in multiple indicator models: A simulation study of categorization errors. American Sociological Review, 48, 398-407. Jöreskog, K. G. (1994). On the estimation of polychoric correlations and their asymptotic covariance matrix. Psychometrika, 59, 381-389. Kline, R. B. (1998). Principles and practice of structural equation modeling. NY: Guilford Press. Lissitz, R. W., & Green, S. B. (1975). Effect of the number of scale points on reliability: A Monte Carlo approach. Journal of Applied Psychology, 60, 10-13. Lubke, G., & Muthén, B. (2004). Applying multigroup confirmatory factor models for continuous outcomes to likert scale data complicates meaningful group comparisons. Structural Equation Modeling, 11(4), 514-534. Mardia, K. V. (1970). Measures of multivariate skewness and kurtosis with applications. Biometrika, 57, 519-530. Mardia, K.V. (1974). Applications of some measures of multivariate skewness and kurtosis for testing normality and robustness studies. Sankhya, 36, 115–128. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Muthén, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115–132. 43 Muthén, B., & Kaplan, D. (1985). A comparison of some methodologies for the factor analysis of non-normal Likert variables, British Journal of Mathematical and Statistical Psychology, 38, 171-189. Pearson, K. (1909). On a new method for determining the correlation between a measured character A and a character B. Biometrika, 7, 96-109. Russell, C. J., Pinto, J. K., & Bobko, P. (1991). Appropriate moderated regression and inappropriate research strategy: A demonstration of information loss due to scale coarseness. Applied Psychological Measurement, 15, 125-135. SAS Institute Inc. (2004). SAS/STAT 9.1 user’s guide. Cary, NC: SAS Institute Inc. Taylor, A., West, S., & Aiken, L. (2006). Loss of power in logistic, ordinal logistic, and probit regression when an outcome variable is coarsely categorized. Educational & Psychological Measurement, 66(2), 228-239. Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multi-group confirmatory factor analysis: A demonstration with TIMSS data. Practical Assessment Research & Evaluation, 12(3), 1-26. Retrieved June 17, 2007, from http://pareonline.net/getvn.asp?v=12&n=3. 44 Appendix A. Selected EFA Results for the “Low” ISV – “None” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 2.380 2 0.30 2.059 2 0.36 1.196 2 0.55 6.005 2 0.05 4.334 2 0.11 0.269 2 0.87 1.141 2 0.57 4.407 2 0.11 1.039 2 0.59 0.979 2 0.61 2.693 2 0.26 Initial Eigenvalue Factor 1 3.391 Factor 2 0.21 Factor 3 0.20 Factor 4 0.20 3.390 0.21 0.20 0.20 3.358 0.22 0.21 0.21 3.277 0.25 0.24 0.23 3.219 0.27 0.26 0.25 3.177 0.28 0.27 0.27 3.117 0.31 0.29 0.28 3.043 0.33 0.32 0.31 2.927 0.37 0.36 0.35 2.815 0.40 0.40 0.39 2.597 0.48 0.47 0.46 Factor Loadings Item 1 0.892 Item 2 0.892 Item 3 0.894 Item 4 0.894 0.891 0.892 0.893 0.893 0.888 0.886 0.887 0.885 0.872 0.870 0.873 0.870 0.861 0.856 0.864 0.860 0.855 0.852 0.853 0.848 0.843 0.838 0.848 0.831 0.823 0.824 0.829 0.824 0.802 0.807 0.808 0.789 0.775 0.779 0.783 0.774 0.731 0.724 0.733 0.730 Correlation between the item and the common factor Item 1 0.893 0.893 0.888 0.873 Item 2 0.893 0.892 0.886 0.870 Item 3 0.894 0.894 0.889 0.874 Item 4 0.893 0.893 0.885 0.868 0.862 0.858 0.865 0.858 0.854 0.852 0.856 0.846 0.841 0.839 0.847 0.833 0.828 0.821 0.833 0.819 0.800 0.800 0.806 0.784 0.762 0.756 0.769 0.750 0.665 0.646 0.680 0.625 χ 2 df p a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 45 Appendix B. Selected EFA Results for the “Low” ISV – “Moderate” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 0.474 2 0.79 0.515 2 0.77 0.518 2 0.77 1.453 2 0.48 2.158 2 0.34 0.312 2 0.86 0.625 2 0.73 1.327 2 0.52 1.729 2 0.42 7.502 2 0.02 1.941 2 0.38 Initial Eigenvalue Factor 1 2.937 Factor 2 0.37 Factor 3 0.35 Factor 4 0.35 2.936 0.37 0.35 0.35 2.900 0.38 0.36 0.36 2.786 0.42 0.40 0.39 2.730 0.44 0.42 0.41 2.703 0.45 0.43 0.42 2.659 0.47 0.44 0.43 2.647 0.47 0.45 0.43 2.636 0.47 0.46 0.44 2.697 0.45 0.44 0.42 2.598 0.49 0.46 0.45 Factor Loadings Item 1 0.807 Item 2 0.803 Item 3 0.811 Item 4 0.794 0.807 0.803 0.811 0.793 0.800 0.797 0.802 0.785 0.775 0.771 0.781 0.759 0.765 0.758 0.772 0.742 0.761 0.752 0.764 0.737 0.753 0.740 0.761 0.721 0.756 0.738 0.746 0.725 0.743 0.733 0.751 0.727 0.762 0.744 0.757 0.746 0.741 0.717 0.739 0.722 0.774 0.767 0.781 0.760 0.767 0.752 0.772 0.746 0.757 0.746 0.761 0.738 0.751 0.736 0.755 0.728 0.745 0.735 0.747 0.726 0.738 0.729 0.748 0.719 0.735 0.723 0.741 0.720 0.667 0.641 0.674 0.614 χ2 df p Correlation between the item and the common factor Item 1 0.807 0.806 0.799 Item 2 0.798 0.798 0.790 Item 3 0.811 0.810 0.804 Item 4 0.794 0.793 0.784 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 46 Appendix C. Selected EFA Results for the “Low” ISV – “High” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 6.025 2 0.05 5.880 2 0.05 6.667 2 0.04 9.458 2 0.01 5.072 2 0.08 5.568 2 0.06 1.680 2 0.43 1.475 2 0.48 3.873 2 0.14 0.082 2 0.96 4.492 2 0.11 Initial Eigenvalue Factor 1 2.224 Factor 2 0.61 Factor 3 0.60 Factor 4 0.57 2.223 0.61 0.60 0.57 2.204 0.62 0.60 0.57 2.161 0.64 0.62 0.58 2.134 0.64 0.63 0.60 2.112 0.65 0.64 0.60 2.083 0.65 0.64 0.62 2.040 0.67 0.66 0.63 1.980 0.69 0.68 0.65 1.960 0.69 0.68 0.67 1.773 0.76 0.75 0.71 Factor Loadings Item 1 0.645 Item 2 0.634 Item 3 0.658 Item 4 0.618 0.645 0.634 0.657 0.617 0.640 0.629 0.653 0.612 0.629 0.615 0.643 0.601 0.626 0.607 0.631 0.595 0.618 0.600 0.630 0.587 0.613 0.594 0.611 0.585 0.601 0.581 0.607 0.567 0.582 0.568 0.588 0.547 0.566 0.550 0.582 0.565 0.518 0.496 0.534 0.483 0.635 0.614 0.640 0.601 0.628 0.605 0.632 0.595 0.622 0.602 0.625 0.586 0.616 0.592 0.616 0.582 0.595 0.583 0.608 0.570 0.580 0.562 0.586 0.540 0.556 0.545 0.561 0.543 0.470 0.442 0.493 0.402 χ 2 df p Correlation between the item and the common factor Item 1 0.652 0.651 0.647 Item 2 0.633 0.633 0.628 Item 3 0.654 0.654 0.651 Item 4 0.619 0.619 0.616 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 47 Appendix D. Selected EFA Results for the “Moderate” ISV – “None” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 0.164 2 0.92 0.205 2 0.90 0.305 2 0.86 0.628 2 0.73 0.394 2 0.82 0.191 2 0.91 0.162 2 0.92 0.863 2 0.65 1.473 2 0.48 0.172 2 0.92 2.089 2 0.35 Initial Eigenvalue Factor 1 2.509 Factor 2 0.50 Factor 3 0.50 Factor 4 0.49 2.509 0.50 0.50 0.49 2.491 0.51 0.51 0.50 2.441 0.52 0.52 0.52 2.399 0.54 0.53 0.53 2.391 0.54 0.53 0.53 2.353 0.55 0.55 0.54 2.302 0.57 0.56 0.56 2.221 0.61 0.59 0.59 2.119 0.64 0.63 0.62 1.892 0.72 0.70 0.69 Factor Loadings Item 1 0.707 Item 2 0.705 Item 3 0.710 Item 4 0.715 0.707 0.705 0.710 0.714 0.701 0.701 0.707 0.711 0.692 0.692 0.695 0.693 0.688 0.679 0.683 0.682 0.683 0.677 0.684 0.680 0.670 0.668 0.676 0.673 0.659 0.656 0.665 0.655 0.641 0.628 0.640 0.643 0.614 0.608 0.624 0.598 0.543 0.546 0.555 0.538 0.688 0.685 0.697 0.691 0.681 0.676 0.687 0.681 0.678 0.675 0.685 0.676 0.668 0.666 0.677 0.669 0.655 0.653 0.667 0.651 0.640 0.627 0.644 0.637 0.609 0.608 0.616 0.598 0.540 0.525 0.555 0.499 χ 2 df p Correlation between the item and the common factor Item 1 0.704 0.703 0.700 Item 2 0.701 0.701 0.697 Item 3 0.713 0.713 0.710 Item 4 0.713 0.712 0.708 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 48 Appendix E. Selected EFA Results for the “Moderate” ISV – “High” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 2.914 2 0.23 2.760 2 0.25 2.989 2 0.22 3.412 2 0.18 4.289 2 0.12 3.035 2 0.22 3.388 2 0.18 2.281 2 0.32 2.736 2 0.25 0.564 2 0.75 3.299 2 0.19 Initial Eigenvalue Factor 1 1.775 Factor 2 0.76 Factor 3 0.73 Factor 4 0.73 1.775 0.77 0.73 0.73 1.767 0.77 0.74 0.73 1.736 0.78 0.75 0.74 1.725 0.78 0.75 0.74 1.709 0.78 0.76 0.75 1.685 0.79 0.77 0.76 1.667 0.80 0.77 0.76 1.616 0.82 0.79 0.77 1.583 0.81 0.80 0.80 1.434 0.89 0.86 0.82 Factor Loadings Item 1 0.515 Item 2 0.521 Item 3 0.516 Item 4 0.481 0.514 0.521 0.516 0.481 0.512 0.520 0.513 0.477 0.501 0.509 0.498 0.473 0.504 0.500 0.495 0.467 0.490 0.497 0.490 0.468 0.489 0.486 0.480 0.457 0.491 0.479 0.474 0.441 0.479 0.451 0.467 0.415 0.443 0.442 0.449 0.430 0.421 0.366 0.417 0.316 0.507 0.489 0.508 0.468 0.505 0.483 0.501 0.462 0.497 0.482 0.498 0.460 0.491 0.470 0.492 0.454 0.483 0.458 0.483 0.439 0.468 0.445 0.475 0.424 0.443 0.435 0.456 0.418 0.399 0.356 0.397 0.303 χ 2 df p Correlation between the item and the common factor Item 1 0.521 0.520 0.518 Item 2 0.502 0.502 0.500 Item 3 0.523 0.523 0.519 Item 4 0.481 0.481 0.478 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 49 Appendix F. Selected EFA Results for the “High” ISV – “None” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 0.662 2 0.72 0.585 2 0.75 1.186 2 0.55 0.638 2 0.73 2.776 2 0.25 0.788 2 0.67 2.396 2 0.30 1.474 2 0.48 0.359 2 0.84 0.514 2 0.77 1.193 2 0.55 Initial Eigenvalue Factor 1 1.601 Factor 2 0.81 Factor 3 0.80 Factor 4 0.79 1.600 0.81 0.80 0.79 1.596 0.81 0.80 0.79 1.572 0.82 0.81 0.80 1.575 0.83 0.80 0.79 1.547 0.83 0.82 0.80 1.547 0.83 0.82 0.80 1.514 0.84 0.83 0.81 1.489 0.84 0.84 0.83 1.433 0.87 0.86 0.84 1.358 0.89 0.89 0.87 Factor Loadings Item 1 0.440 Item 2 0.460 Item 3 0.449 Item 4 0.441 0.438 0.461 0.449 0.441 0.440 0.459 0.443 0.440 0.433 0.442 0.441 0.431 0.423 0.454 0.444 0.430 0.421 0.447 0.420 0.420 0.412 0.443 0.432 0.421 0.408 0.432 0.411 0.405 0.398 0.419 0.407 0.391 0.362 0.405 0.381 0.371 0.360 0.353 0.342 0.328 0.439 0.419 0.445 0.438 0.431 0.420 0.441 0.434 0.428 0.415 0.438 0.425 0.419 0.412 0.428 0.423 0.414 0.416 0.421 0.410 0.400 0.397 0.409 0.403 0.378 0.386 0.396 0.380 0.336 0.340 0.339 0.333 χ 2 df p Correlation between the item and the common factor Item 1 0.446 0.445 0.444 Item 2 0.434 0.434 0.431 Item 3 0.455 0.454 0.450 Item 4 0.446 0.446 0.443 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 50 Appendix G. Selected EFA Results for the “High” ISV – “Moderate” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 1.205 2 0.55 1.121 2 0.57 0.734 2 0.69 1.708 2 0.43 1.797 2 0.41 1.890 2 0.39 2.375 2 0.30 1.568 2 0.46 0.019 2 0.99 1.540 2 0.46 0.673 2 0.71 Initial Eigenvalue Factor 1 1.490 Factor 2 0.85 Factor 3 0.84 Factor 4 0.82 1.489 0.85 0.84 0.82 1.481 0.85 0.84 0.83 1.448 0.87 0.85 0.83 1.446 0.87 0.85 0.83 1.426 0.87 0.86 0.85 1.431 0.87 0.86 0.84 1.420 0.87 0.86 0.84 1.422 0.87 0.86 0.85 1.401 0.88 0.86 0.85 1.331 0.90 0.90 0.87 Factor Loadings Item 1 0.380 Item 2 0.414 Item 3 0.405 Item 4 0.417 0.379 0.414 0.404 0.417 0.379 0.408 0.401 0.414 0.361 0.401 0.380 0.402 0.357 0.398 0.381 0.404 0.369 0.380 0.373 0.385 0.352 0.394 0.386 0.385 0.355 0.384 0.370 0.387 0.351 0.388 0.381 0.380 0.349 0.366 0.388 0.359 0.314 0.316 0.328 0.372 0.374 0.387 0.387 0.398 0.374 0.386 0.380 0.396 0.369 0.383 0.384 0.387 0.372 0.379 0.374 0.387 0.362 0.378 0.377 0.384 0.365 0.374 0.370 0.387 0.360 0.372 0.382 0.379 0.321 0.324 0.323 0.337 χ 2 df p Correlation between the item and the common factor Item 1 0.392 0.392 0.388 Item 2 0.405 0.405 0.400 Item 3 0.407 0.407 0.402 Item 4 0.412 0.412 0.405 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 51 Appendix H. Selected EFA Results for the “High” ISV – “High” REV Condition.a Continuous 101-point 21-point 11-point 9-point 8-point 7-point 6-point 5-point 4-point 3-point 0.488 2 0.78 0.532 2 0.77 0.365 2 0.83 0.484 2 0.78 0.983 2 0.61 0.854 2 0.65 0.332 2 0.85 0.239 2 0.89 0.683 2 0.71 0.035 2 0.98 0.169 2 0.92 Initial Eigenvalue Factor 1 1.341 Factor 2 0.90 Factor 3 0.89 Factor 4 0.87 1.341 0.90 0.89 0.87 1.340 0.90 0.89 0.87 1.323 0.90 0.90 0.88 1.315 0.91 0.90 0.88 1.309 0.91 0.90 0.88 1.303 0.91 0.90 0.89 1.290 0.91 0.91 0.89 1.256 0.93 0.92 0.90 1.252 0.92 0.92 0.91 1.202 0.94 0.93 0.93 Factor Loadings Item 1 0.325 Item 2 0.367 Item 3 0.336 Item 4 0.320 0.325 0.366 0.336 0.320 0.328 0.364 0.338 0.316 0.318 0.358 0.320 0.315 0.312 0.344 0.322 0.318 0.316 0.351 0.317 0.301 0.301 0.339 0.317 0.316 0.302 0.342 0.303 0.295 0.282 0.323 0.282 0.283 0.291 0.298 0.292 0.278 0.253 0.274 0.265 0.245 0.323 0.324 0.340 0.323 0.320 0.319 0.335 0.323 0.320 0.323 0.332 0.316 0.314 0.315 0.324 0.311 0.307 0.312 0.316 0.314 0.300 0.295 0.306 0.294 0.295 0.294 0.300 0.301 0.249 0.252 0.261 0.256 χ 2 df p Correlation between the item and the common factor Item 1 0.333 0.333 0.333 Item 2 0.333 0.333 0.332 Item 3 0.348 0.349 0.348 Item 4 0.333 0.333 0.333 a EFA = Exploratory Factor Analysis, ISV = Item-Specific Variance, REV = Random Error Variance. 52 Appendix I. Selected MI Results for the “Low” ISV – “None” REV Condition.a 1. Configural Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.000 0.000 0.000 0.003 0.000 0.004 0.005 0.006 0.002 0.000 0.008 0.022 0.067 0.138 χ2 10.773 15.735 10.815 17.767 11.291 18.600 20.514 20.888 16.532 15.727 27.086 88.797 653.666 2517.920 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.979 10.864 17.199 19.844 23.670 26.983 25.996 23.937 28.678 19.383 22.145 32.234 97.911 3. Means Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 χ2 11.720 18.002 24.096 34.302 35.800 26.951 26.689 42.628 29.141 34.314 32.425 100.174 Δ χ2 0.856 0.803 4.252 10.632 8.817 0.955 2.752 13.950 9.758 12.169 0.191 2.263 not tested not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Δ χ2 0.091 1.464 9.029 5.903 15.692 7.396 3.423 7.790 2.851 6.418 5.148 9.114 not tested not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.001 4. Error Invariance ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 χ2 39.281 567.507 925.949 1446.655 1648.995 1843.481 2297.172 2899.778 3504.711 4574.824 5974.675 7906.446 Δ χ2 27.561 549.505 901.853 1412.353 1613.195 1816.530 2270.483 2857.150 3475.570 4540.510 5942.250 7806.272 not tested not tested CFI 1.000 0.997 0.995 0.992 0.991 0.990 0.987 0.984 0.980 0.973 0.964 0.949 ΔCFI 0.000 -0.003 -0.005 -0.008 -0.009 -0.010 -0.013 -0.016 -0.020 -0.027 -0.036 -0.051 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 53 Appendix J. Selected MI Results for the “Low” ISV – “Moderate” REV Condition.a 1. Configural Invariance Scale 101 21 16 15 14 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.000 0.005 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.008 0.000 0.004 0.000 0.015 0.047 0.113 χ2 14.217 19.924 10.524 19.516 10.344 16.092 13.086 14.029 7.599 25.365 10.798 18.103 14.851 49.169 336.772 1742.710 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 0.981 21.434 27.029 12.950 22.999 14.454 20.532 18.156 19.862 13.596 25.787 12.953 20.318 17.337 52.581 446.772 3. Means Invariance Scale 101 21 16 15 14 13 12 11 10 9 8 7 6 5 4 3 χ2 25.772 30.890 14.979 27.543 16.672 26.559 21.110 24.286 20.221 29.224 17.601 22.758 18.419 52.756 452.086 Δ χ2 4.338 3.861 2.029 4.544 2.218 6.027 2.954 4.424 6.625 3.437 4.648 2.440 1.082 0.175 5.314 not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.996 Δ χ2 7.217 7.105 2.426 3.483 4.110 4.440 5.070 5.833 5.997 0.422 2.155 2.215 2.486 3.412 110.000 not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.996 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.001 4. Error Invariance ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 52.689 483.131 881.208 1120.791 1280.902 1383.484 1700.401 1831.697 2101.347 2473.103 2718.607 2882.374 2709.370 2249.776 2759.906 Δ χ2 26.917 452.241 866.229 1093.248 1264.230 1356.925 1679.291 1807.411 2081.126 2443.879 2701.006 2859.616 2690.951 2197.020 2307.820 not tested CFI 1.000 0.997 0.994 0.992 0.991 0.990 0.987 0.986 0.984 0.980 0.978 0.976 0.977 0.980 0.975 ΔCFI 0.000 -0.003 -0.006 -0.008 -0.009 -0.010 -0.013 -0.014 -0.016 -0.020 -0.022 -0.024 -0.023 -0.020 -0.021 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 54 Appendix K. Selected MI Results for the “Low” ISV – “High” REV Condition.a 1. Configural Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.008 0.003 0.007 0.005 0.004 0.002 0.006 0.000 0.004 0.007 0.002 0.018 0.039 0.070 χ2 26.142 17.166 22.624 19.661 18.379 16.862 21.792 11.362 18.999 23.862 16.596 66.017 240.839 709.375 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.996 0.984 26.698 17.970 27.273 27.517 22.014 18.395 31.925 13.684 22.650 27.845 17.615 73.121 260.935 3. Means Invariance Scale 101 71 61 21 16 13 12 11 10 9 8 7 6 5 4 3 χ2 29.836 22.296 36.060 21.240 29.880 32.794 23.226 28.945 33.893 22.133 25.189 31.674 19.472 76.076 263.755 Δ χ2 3.138 n/a n/a 3.270 2.607 5.277 1.212 10.550 1.968 8.449 2.539 3.829 1.857 2.955 2.820 not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.996 Δ χ2 0.556 0.804 4.649 7.856 3.635 1.533 10.133 2.322 3.651 3.983 1.019 7.104 20.096 not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.996 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4. Error Invariance ΔCFI 0.000 n/a n/a 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 39.057 37.259 62.099 250.966 401.444 539.243 623.203 834.060 821.962 1098.923 1339.995 1835.902 2260.531 3143.781 4379.371 Δ χ2 9.221 14.963 26.039 229.726 371.564 506.449 599.977 805.115 788.069 1076.790 1314.806 1804.228 2241.059 3067.705 4115.616 not tested CFI 1.000 1.000 1.000 0.997 0.995 0.993 0.992 0.989 0.989 0.986 0.982 0.974 0.967 0.951 0.926 ΔCFI 0.000 0.000 0.000 -0.003 -0.005 -0.007 -0.008 -0.011 -0.011 -0.014 -0.018 -0.026 -0.033 -0.048 -0.070 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 55 Appendix L. Selected MI Results for the “Moderate” ISV – “None” REV Condition.a 1. Configural Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.005 0.000 0.005 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.022 χ2 19.412 12.040 20.189 7.438 20.130 11.364 8.390 10.418 5.774 11.054 11.093 8.271 10.438 89.624 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 31.929 15.799 25.833 8.147 31.808 12.033 10.054 16.459 6.567 11.971 15.032 13.262 36.081 727.004 3. Means Invariance Scale 101 71 61 21 16 13 12 11 10 9 8 7 6 5 4 3 χ2 33.056 33.110 20.475 19.083 28.081 15.851 41.607 15.724 20.242 18.478 11.468 16.933 22.047 14.192 49.517 Δ χ2 1.127 n/a n/a 3.284 2.248 7.704 9.799 3.691 10.188 2.019 4.901 4.962 7.015 0.930 13.436 not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Δ χ2 12.517 3.759 5.644 0.709 11.678 0.669 1.664 6.041 0.793 0.917 3.939 4.991 25.643 637.380 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.988 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.011 4. Error Invariance ΔCFI 0.000 n/a n/a 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 45.993 44.508 70.971 203.960 407.015 593.868 679.470 749.099 1035.759 1216.402 1343.392 1768.126 2540.902 3646.435 6364.731 Δ χ2 12.937 11.398 50.496 184.877 378.934 578.017 637.863 733.375 1015.517 1197.924 1331.924 1751.193 2518.855 3632.243 6315.214 not tested CFI 1.000 1.000 1.000 0.998 0.996 0.994 0.993 0.992 0.989 0.987 0.986 0.981 0.971 0.956 0.916 ΔCFI 0.000 0.000 0.000 -0.002 -0.004 -0.006 -0.007 -0.008 -0.011 -0.013 -0.014 -0.019 -0.029 -0.044 -0.084 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 56 Appendix M. Selected MI Results for the “Moderate” ISV – “High” REV Condition.a 1. Configural Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.003 0.000 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.007 0.000 0.002 0.013 χ2 17.595 12.049 16.018 16.111 13.792 12.892 7.171 15.352 9.105 11.865 24.077 12.710 16.527 41.950 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 23.442 16.745 22.555 22.934 17.979 14.710 13.413 20.821 16.183 12.411 27.313 15.234 24.104 499.912 3. Means Invariance Scale 101 71 61 21 16 13 12 11 10 9 8 7 6 5 4 3 χ2 24.396 16.924 30.684 19.530 25.652 31.617 21.442 16.492 22.807 22.502 18.524 15.350 30.801 17.188 27.642 Δ χ2 0.954 n/a n/a 2.785 3.097 8.683 3.463 1.782 9.394 1.681 2.341 2.939 3.488 1.954 3.538 not tested CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Δ χ2 5.847 4.696 6.537 6.823 4.187 1.818 6.242 5.469 7.078 0.546 3.236 2.524 7.577 457.962 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.984 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.015 4. Error Invariance ΔCFI 0.000 n/a n/a 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 34.225 31.865 68.584 161.502 323.379 436.280 476.196 687.883 740.211 868.899 1035.455 1392.265 1752.112 2584.474 3579.847 Δ χ2 9.829 14.941 37.900 141.972 297.727 404.663 454.754 671.391 717.404 846.397 1016.931 1376.915 1721.311 2567.286 3552.205 not tested CFI 1.000 1.000 0.999 0.998 0.995 0.992 0.992 0.988 0.986 0.984 0.980 0.972 0.964 0.943 0.912 ΔCFI 0.000 0.000 -0.001 -0.002 -0.005 -0.008 -0.008 -0.012 -0.014 -0.016 -0.020 -0.028 -0.036 -0.057 -0.088 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 57 Appendix N. Selected MI Results for the “High” ISV – “None” REV Condition.a 1. Configural Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.008 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.005 0.008 0.000 0.000 0.000 0.001 χ2 27.353 15.103 11.024 13.156 11.520 10.234 10.013 15.575 20.131 26.004 15.668 12.580 14.710 16.169 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 34.756 21.201 11.671 16.703 21.195 21.629 12.013 24.780 22.618 28.629 18.916 12.607 18.133 105.350 3. Means Invariance Scale 101 61 51 21 16 13 12 11 10 9 8 7 6 5 4 3 χ2 39.743 22.482 23.017 24.872 13.717 20.424 22.985 23.531 16.775 28.988 24.774 31.256 21.989 16.433 24.240 107.715 Δ χ2 4.987 n/a n/a 3.671 2.046 3.721 1.790 1.902 4.762 4.208 2.156 2.627 3.073 3.826 6.107 2.365 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 Δ χ2 7.403 6.098 0.647 3.547 9.675 11.395 2.000 9.205 2.487 2.625 3.248 0.027 3.423 89.181 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.003 4. Error Invariance ΔCFI 0.000 n/a n/a 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 56.523 34.660 48.107 184.030 270.223 421.068 495.297 433.616 708.222 759.635 964.815 1199.746 1750.380 2426.783 4317.859 1969.487 Δ χ2 16.780 12.178 25.090 159.158 256.506 400.644 472.312 410.085 691.447 730.647 940.041 1168.490 1728.391 2410.350 4293.619 1861.772 CFI 0.999 1.000 1.000 0.997 0.995 0.992 0.990 0.991 0.985 0.984 0.979 0.974 0.960 0.941 0.884 0.933 ΔCFI -0.001 0.000 0.000 -0.003 -0.005 -0.008 -0.010 -0.009 -0.015 -0.016 -0.021 -0.026 -0.040 -0.059 -0.116 -0.064 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 58 Appendix O. Selected MI Results for the “High” ISV – “Moderate” REV Condition.a 1. Configural Invariance Scale 101 21 16 15 14 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.000 0.000 0.000 0.000 0.000 0.003 0.000 0.000 0.000 0.000 0.003 0.000 0.000 0.000 0.000 0.000 χ2 10.921 11.175 7.679 10.425 10.944 17.704 15.280 15.858 13.439 8.475 17.732 15.444 10.319 8.126 11.443 13.687 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 13.172 12.942 12.308 18.306 13.551 26.489 19.676 22.238 14.610 11.124 20.029 17.527 11.263 8.651 13.271 96.899 3. Means Invariance Scale 101 91 81 21 16 15 14 13 12 11 10 9 8 7 6 5 4 3 χ2 14.987 48.928 16.721 20.561 15.688 19.459 15.160 28.956 24.988 22.541 18.528 13.502 26.107 22.376 17.482 15.915 18.591 102.118 Δ χ2 1.815 n/a n/a 7.619 3.380 1.153 1.609 2.467 5.312 0.303 3.918 2.378 6.078 4.849 6.219 7.264 5.320 5.219 CFI 1.000 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 Δ χ2 2.251 1.767 4.629 7.881 2.607 8.785 4.396 6.380 1.171 2.649 2.297 2.083 0.944 0.525 1.828 83.212 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.003 4. Error Invariance ΔCFI 0.000 n/a n/a 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 22.177 66.210 40.150 271.624 396.726 447.896 533.837 679.718 722.078 792.293 1044.708 1153.326 1248.435 1280.539 1081.917 941.193 1182.785 275.726 Δ χ2 7.190 17.282 23.429 251.063 381.038 428.437 518.677 650.762 697.090 769.752 1026.180 1139.824 1222.328 1258.163 1064.435 925.278 1164.194 173.608 CFI 1.000 0.999 1.000 0.995 0.992 0.991 0.989 0.985 0.984 0.982 0.976 0.973 0.970 0.968 0.972 0.974 0.963 0.990 ΔCFI 0.000 0.000 0.000 -0.005 -0.008 -0.009 -0.011 -0.015 -0.016 -0.018 -0.024 -0.027 -0.030 -0.032 -0.028 -0.026 -0.037 -0.007 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 59 Appendix P. Selected MI Results for the “High” ISV – “High” REV Condition.a 1. Configural Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 RMSEA 0.006 0.004 0.000 0.000 0.006 0.000 0.007 0.004 0.000 0.000 0.000 0.000 0.000 0.000 χ2 21.487 18.887 11.922 11.649 20.998 14.236 24.545 18.206 4.847 8.166 9.810 4.793 7.972 7.354 2. Loading Invariance χ2 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 21.980 23.274 16.787 13.129 23.372 16.170 29.466 20.476 5.865 11.537 17.893 14.755 9.988 79.922 3. Means Invariance Scale 101 21 16 13 12 11 10 9 8 7 6 5 4 3 χ2 35.359 24.227 23.708 14.326 26.770 17.931 32.409 21.184 9.544 17.503 23.272 22.869 16.456 83.991 Δ χ2 13.379 0.953 6.921 1.197 3.398 1.761 2.943 0.708 3.679 5.966 5.379 8.114 6.468 4.069 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 Δ χ2 0.493 4.387 4.865 1.480 2.374 1.934 4.921 2.270 1.018 3.371 8.083 9.962 2.016 72.568 CFI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.003 4. Error Invariance ΔCFI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 χ2 56.725 176.409 297.139 297.457 419.192 478.853 617.373 731.986 843.124 1178.805 1568.589 2120.063 2741.555 442.301 Δ χ2 21.366 152.182 273.431 283.131 392.422 460.922 584.964 710.802 833.580 1161.302 1545.317 2097.194 2725.099 358.310 CFI 0.999 0.997 0.994 0.994 0.991 0.989 0.985 0.982 0.979 0.970 0.958 0.939 0.912 0.982 ΔCFI -0.001 -0.003 -0.006 -0.006 -0.009 -0.011 -0.015 -0.018 -0.021 -0.030 -0.042 -0.061 -0.088 -0.015 a MI = Measurement Invariance, ISV = Item-Specific Variance, REV = Random Error Variance. Note: Critical Values: RMSEA=0.06, χ2(df=16)= 37.2, CFI=0.95, Δχ2(df=4)= 16.9, and ΔCFI= -0.01. Note: Models that fail to meet invariance with a continuous model are in bold. 60
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Evaluating the error of measurement due to categorical...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Evaluating the error of measurement due to categorical scaling with a measurement invariance approach… Olson, Brent 2008
pdf
Page Metadata
Item Metadata
Title | Evaluating the error of measurement due to categorical scaling with a measurement invariance approach to confirmatory factor analysis |
Creator |
Olson, Brent |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | It has previously been determined that using 3 or 4 points on a categorized response scale will fail to produce a continuous distribution of scores. However, there is no evidence, thus far, revealing the number of scale points that may indeed possess an approximate or sufficiently continuous distribution. This study provides the evidence to suggest the level of categorization in discrete scales that makes them directly comparable to continuous scales in terms of their measurement properties. To do this, we first introduced a novel procedure for simulating discretely scaled data that was both informed and validated through the principles of the Classical True Score Model. Second, we employed a measurement invariance (MI) approach to confirmatory factor analysis (CFA) in order to directly compare the measurement quality of continuously scaled factor models to that of discretely scaled models. The simulated design conditions of the study varied with respect to item-specific variance (low, moderate, high), random error variance (none, moderate, high), and discrete scale categorization (number of scale points ranged from 3 to 101). A population analogue approach was taken with respect to sample size (N = 10,000). We concluded that there are conditions under which response scales with 11 to 15 scale points can reproduce the measurement properties of a continuous scale. Using response scales with more than 15 points may be, for the most part, unnecessary. Scales having from 3 to 10 points introduce a significant level of measurement error, and caution should be taken when employing such scales. The implications of this research and future directions are discussed. |
Extent | 392764 bytes |
Subject |
optimum number of scale points continuous scale discrete scale categorization coarseness measurement error Classical True Score Model simulation study data generation item specific variance random error variance longitudinal measurement invariance Comparative Fit Index Relative Multivariate Kurtosis |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2008-02-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0054570 |
URI | http://hdl.handle.net/2429/332 |
Degree |
Master of Arts - MA |
Program |
Measurement, Evaluation and Research Methodology |
Affiliation |
Education, Faculty of Educational and Counselling Psychology, and Special Education (ECPS), Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2008-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2008_spring_olson_brent.pdf [ 383.56kB ]
- Metadata
- JSON: 24-1.0054570.json
- JSON-LD: 24-1.0054570-ld.json
- RDF/XML (Pretty): 24-1.0054570-rdf.xml
- RDF/JSON: 24-1.0054570-rdf.json
- Turtle: 24-1.0054570-turtle.txt
- N-Triples: 24-1.0054570-rdf-ntriples.txt
- Original Record: 24-1.0054570-source.json
- Full Text
- 24-1.0054570-fulltext.txt
- Citation
- 24-1.0054570.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0054570/manifest