"Graduate and Postdoctoral Studies"@en . "DSpace"@en . "UBCV"@en . "Eom, Han J."@en . "2008-09-10T18:14:20Z"@en . "1993"@en . "Doctor of Philosophy - PhD"@en . "University of British Columbia"@en . "The present study employed Monte Carlo procedures to investigate the effects of data categorization and noncircularity on generalizability (G) coefficients for the one-facet and two-facet fully-crossed balanced designs as well as on the Type I error rates for F tests in repeated measures ANOVA designs. Computer programs were developed to conduct a series of simulations under various sampling conditions. Five independent parameters were considered in the simulations: (a)three levels of repeated measures (3, 5, 7); (b) three G coefficients (.60, .75, .90); (c) three epsilon values (.50,.70, 1.0); (d) three sample sizes (15, 30, 45); and (e) six measurement scales (Continuous, 5-point and 3-point scales with either normal or uniform distribution, and dichotomous). For the one-facet design, the results of the simulations indicated that categorical data resulted in a considerably smaller G coefficient than for the parent continuous data, especially for a 3-point or less scale. Noncircularity did not introduce any bias to the estimate, but yielded more variable estimates of the G coefficient. The sampling theory of G coefficients with continuous data was fairly robust to a moderate departure from circularity, but somewhat sensitive to severe noncircularity (about 6% for E = .7 and about 7.2% for E =.5 of the sample estimates lay in the 5% region of the upper tail). However, it was not adequate for categorical data, especially for a 3-point or less scale. The results of the two-facet design closely paralleled those of the one-facet design in terms of the effects of categorization, sample size, and population G values. The primary difference in the findings between the two designs was that the sampling theory of G coefficients for the two-facet design, which was developed using Satterthwaite's procedure, was very satisfactory and quite robust to violations of the circularity assumption. Type I error rates of the F test for continuous data were inflated when the circularity assumption failed, with categorization causing a slight reduction in this inflation. Relationships among the population epsilon, the sample estimate, and the Type I error rates across the 81 simulated conditions revealed the presence of a strong negative relationship between the epsilon estimates and the associated Type I error rates, thus supporting current theory. However, for the e = 1.0condition the associated Type I error rates were all close to the nominal level, and the correlation with the estimated epsilon was near zero. Further investigation of the correlations among the sample estimates (\"C, MSe, and MSr) within each population epsilon condition suggested that the inflation in Type I error rates is not, as is commonly assumed, merely a function of the population epsilon value. This led us to question the current practice of utilizing an epsilon-adjusted F test in repeated measures ANOVA designs."@en . "https://circle.library.ubc.ca/rest/handle/2429/1758?expand=metadata"@en . "8985409 bytes"@en . "application/pdf"@en . "t the required standardTHE INTERACTIVE EFFECTS OF DATA CATEGORIZATION ANDNONCIRCULARITY ON THE SAMPLING DISTRIBUTION OF GENERALIZABILITYCOEFFICIENTS IN ANALYSIS OF VARIANCE MODELS:AN EMPIRICAL INVESTIGATIONByHAN J00 EOMB.P.E and M.P.E., Sung Kyun Kwan University in Korea, 1985M.P.E., The University of British Columbia in Canada, 1989A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENT FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIES(Interdisciplinary Studies:Human Kinetics / Educational Psychology / Statistics)We accept this thesis as conformingTHE UNIVERSITY OF BRITISH COLUMBIAOctober 1993\u00C2\u00A9 Han Joo Eom, 1993In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department of HUMAN KINETICSThe University of British ColumbiaVancouver, CanadaDate^OCTOBER, 15, 1991DE-6 (2/88)i iABSTRACTThe present study employed Monte Carlo procedures toinvestigate the effects of data categorization andnoncircularity on generalizability (G) coefficients for the one-facet and two-facet fully-crossed balanced designs as well as onthe Type I error rates for F tests in repeated measures ANOVAdesigns. Computer programs were developed to conduct a seriesof simulations under various sampling conditions. Fiveindependent parameters were considered in the simulations: (a)three levels of repeated measures (3, 5, 7); (b) three Gcoefficients (.60, .75, .90); (c) three epsilon values (.50,.70, 1.0); (d) three sample sizes (15, 30, 45); and (e) sixmeasurement scales (Continuous, 5-point and 3-point scales witheither normal or uniform distribution, and dichotomous).For the one-facet design, the results of the simulationsindicated that categorical data resulted in a considerablysmaller G coefficient than for the parent continuous data,especially for a 3-point or less scale. Noncircularity did notintroduce any bias to the estimate, but yielded more variableestimates of the G coefficient. The sampling theory of Gcoefficients with continuous data was fairly robust to amoderate departure from circularity, but somewhat sensitive tosevere noncircularity (about 6% for E = .7 and about 7.2% for E =.5 of the sample estimates lay in the 5% region of the uppertail). However, it was not adequate for categorical data,especially for a 3-point or less scale. The results of the two-facet design closely paralleled those of the one-facet design iniiiterms of the effects of categorization, sample size, andpopulation G values. The primary difference in the findingsbetween the two designs was that the sampling theory of Gcoefficients for the two-facet design, which was developed usingSatterthwaite's procedure, was very satisfactory and quiterobust to violations of the circularity assumption.Type I error rates of the F test for continuous data wereinflated when the circularity assumption failed, withcategorization causing a slight reduction in this inflation.Relationships among the population epsilon, the sample estimate,and the Type I error rates across the 81 simulated conditionsrevealed the presence of a strong negative relationship betweenthe epsilon estimates and the associated Type I error rates,thus supporting current theory. However, for the e = 1.0condition the associated Type I error rates were all close tothe nominal level, and the correlation with the estimatedepsilon was near zero. Further investigation of thecorrelations among the sample estimates (\"C, MS e , and MS r )within each population epsilon condition suggested that theinflation in Type I error rates is not, as is commonly assumed,merely a function of the population epsilon value. This led usto question the current practice of utilizing an epsilon-adjusted F test in repeated measures ANOVA designs.ivTABLE OF CONTENTSAbstract^ iiTable of Contents^ ivList of Tables viList of Figures^ ixAcknowledgementCHAPTER ONE: INTRODUCTION^ 1Overview of test theory 4Sampling error of variance components^ 8Sampling distribution of generalizability coefficients^10Inferential procedures for coefficient alpha^11Inferential procedures for intraclass correlation^11Inferential procedures for G coefficients 12Purpose of the study^ 18CHAPTER TWO: THEORIES AND MATHEMATICAL DEVELOPMENT^19Classical test theory^ 19Generalizability theory 30Variance component estimation^ 43Sampling theory of G coefficients 49CHAPTER THREE: METHODS AND PROCEDURES^ 66Overview of problems and simulation conditions^66Simulation I: One-facet design 69Sampling populations^ 69Parameter specification 71Data generation 74Simulation II: Two-facet design^ 76Design specification^ 76Sampling populations 76Parameter specification 77Data generation 82VCHAPTER FOUR: RESULTS AND DISCUSSION^ 84Simulation I: One-facet designCalculated population G coefficient (Gcp )Estimated G coefficient (AG ])Empirical sampling distribution of AG 'Sample estimates of epsilon and Type I error ratesSimulation II: Two-facet designCalculated population G coefficient (Gcp )Estimated G coefficient (AG2)Empirical sampling distribution of ^G2Type I error rates in quasi F testsCHAPTER FIVE: SUMMARY AND CONCLUSIONSOne-facet designTwo-facet designImplications of the present studySuggestions for future researchREFERENCES^ 170Appendix A: Circularity assumptions in repeated measuresANOVA^ 182Appendix B: Input population covariance matrices^185858591103114133134136146151156156164165168viLIST OF TABLESTable 2-1. The two-way (Persons by Raters) random effectsANOVA model with a single observation per cell^37Table 2-2. The three-way (Persons by Occasions by Raters)random effects ANOVA model with a single observationper cell^ 43Table 3-1. Characteristics of population parameters in thecovariance matrices for data generation in theone-facet design^ 73Table 3-2. Scale proportions of transformed data^75Table 3-3. Partitioned covariance matrix for the two-facetdesign^ 80Table 3-4. Population characteristics of the nine covariancematrices 81Table 4-1. The effect of categorization of continuous dataon the G coefficient (G rn ) with simulatedpopulation data (N=9000i6T^ 86Table 4-2. The mean of AG 1 , averaged over the levels of nand k, and the calculated population G coefficient(Gcp) across the six scales^ 92Table 4-3. The mean (standard deviation) of AG ' (k=5, 2000replications)^ 96Table 4-4. The difference between the calculated populationG (Gcp ) and the estimated G (AG ' ) values for k=5and n=30^ 97Table 4-5. The theoretical standard deviation of AG1for k=5^ 98Table 4-6. The mean (standard deviation) of the observedmean square for persons (MS ) and error (MSe )for some selected condition (k=5, n=30,continuous data only)^ 99Table 4-7. The theoretical standard deviations for someselected conditions (k=5 and n=30), calculated byusing Gcp , instead of G 1^102Table 4-8. Empirical sampling distribution of AG ' and agoodness of fit test (k=5, n=15, G1=.75, and 6000replications in each condition with continuousdata only)^ 105viiTable 4-9. Empirical percentage of \"G 1 falling beyond thelimits of the 100(1-a)% theoretical toleranceinterval, averaged over the levels of k, n, and G 107Table 4-10. Empirical percentage of AG 1 falling beyond thelimits of the 90% theoretical tolerance interval,averaged over the levels of k and n 109Table 4-11. Empirical percentage of AG 1 falling beyond thelimits of the 90% theoretical tolerance interval,averaged over the levels of k and G 110Table 4-12. Empirical percentage of AG 1 falling beyond thelimits of the 90% theoretical tolerance interval,averaged over the levels of n and G 113Table 4-13. The lower and upper limits of the 90%theoretical tolerance interval for AG '^114Table 4-14. The effect of categorization of continuous data onepsilon (ecp) with simulated population data(N=90000)^ 119Table 4-15. The mean (standard deviation) of Ae for the threelevels of k (n=15, averaged over the levels of G) 120Table 4-16. The mean (standard deviation) of Ae for the threesample sizes (k=5, averaged over the levels of G) 121Table 4-17. The mean (standard deviation) of the epsilonestimates for the three levels of G (k=5, n=15,2000 replications) 122Table 4-18. Empirical percentage of the Type I error rates,averaged over the levels of k, n, and G^124Table 4-19. Empirical percentage of the Type I error ratesfor some selected conditions^ 125Table 4-20. Correlation between the Type I error ratesand the epsilon estimates for alpha=.05 only^128Table 4-21. Type I error rates, correlation, and descriptivestatistics for some selected conditions(k=5, n=15, G 1 =.90)^ 131Table 4-22. Expected mean squares (EMS), Satterthwaite'sdegrees of freedom (f a ), and theoretical standarddeviation (SD) of AG2^ 140Table 4-23. The mean (standard deviation) of AG2 for thetwo-facet design (n=30 only, 2000 replications)^142viiiTable 4-24. The mean (standard deviation) of observed meansquares and their correlations(continuous data only, n=30, 2000 replications)^143Table 4-25. The mean (standard deviation) of the limits ofthe 90% confidence intervals for a G2 in thetwo-facet design for some selected conditions(2000 replications)Table 4-26. Empirical proportion of confidence intervalsthat failed to include a population G2 value(averaged over the levels of G and n)Table 4-27. Empirical proportion of 90% confidence intervalsthat failed to include a specified populationG2 valueTable 4-28. Empirical percentage of the Type I error ratesfor quasi F tests in the three-way random effectsANOVA model (averaged over the levels of G and n,each condition having 2000 replications,alpha=.05)Table 5-1. An overview of the results regarding Gcp , AG1,and empirical proportions beyond the theoreticallimits of the tolerance interval of AG 1in the one-facet designTable 5-2. An overview of the results regarding Cco , A C,and Type I error rates in the one-facet design^158147149150155157ixLIST OF FIGURESFigure 4-1. The effect of categorization on the Gcoefficient (Gcp ) with simulated population data(k=5, N=90000)^ 88Figure 4-2. The mean of AG ' (n=30, averaged overthe levels of E) 94Figure 4-3. Effect of categorization on the G coefficientwith population data (two-facet design, N=90000)^135Figure 4-4. The mean of AG 2 for the three sample sizes(2000 replications)^ 137Figure 4-5. Comparison of G nn and AG2, averaged overthe levels of epsi l onxACKNOWLEDGEMENTThis dissertation is the culmination of many years of study,along with the combined of efforts of many individuals to whom Iam indebted. Therefore, I would like to take this opportunityto express my sincere gratitude to all those who have given mecontinued encouragement and support throughout my graduateeducation.My most sincere appreciation is offered to Dr. Robert Schutz, mysupervisor. Over the six years that I have studied under hisguidance, Dr. Schutz has invested enormous time and efforttowards my academic development. He has always providednumerous valuable suggestions and stimulating discussions duringall phases of my graduate study, and given me the confidence topursue this research. Without his guidance and support, thisresearch would not have been completed.I am grateful to Dr. Robert Conry, who introduced me toMeasurement Theory, for his support and helpful advice.I am truly indebted to Dr. Ian Franks, who served as a committeemember for both my Master and Ph.D programs, for sharing hisexpertise in Sport Analysis and for his continued supportthroughout my graduate study.I am deeply grateful to Dr. Michael Schulzer for his teaching inStatistical Consulting and Biostatistics, and his many valuablesuggestions during this dissertation.I would further like to thank Dr. Robert Hindmarch, Dr. RobertMorford, and Mr. Dale Ohman. They made it possible for me tocome to Canada and start my graduate education at U.B.C., andhave continued to provide support and encouragement to thepresent time.I am grateful to my family who have supported me both morallyand financially throughout my graduate education, most notably,my mother, brothers, sisters Kwi Ok Eom, Chong Ae Suh, and theirfamilies. I am also grateful to my friends who have providedtheir moral and intellectual support since the beginning of mygraduate education. In particular, I thank my faithful friendsSung Won Yoon, Yoon Hak Park, Chi Yong Shin, and Drs, Sun DongYoo and In Gyu Kim.Finally, I gratefully acknowledge and thank Mr. David Chiu forsharing his expertise in computer programming, and for freelydevoting so much of his time to develop the simulation programswhich made this study possible.1IntroductionCHAPTER ONE: INTRODUCTIONAnyone who regularly plays a game with objective scoring,whether it be one of physical activities or of mental tasks, isacutely aware of the variability in human performance. Thisinconsistency in human performance stems from a variety offactors, depending on the nature of the measurement. Among theimportant factors are subtle variations in physical and mentalefficiency of the test taker, uncontrollable fluctuations inexternal conditions, variations in the specific tasks requiredof individual examinees, and inconsistencies of those whoevaluate examinee performance. Quantification of these sourcesof error that affect the measurement process constitutes theessence of reliability analysis within the context of classicaltest theory or generalizability theory.Generalizability theory (G theory) was proposed by Cronbachand his colleagues (1963, 1972) as an alternative to classicaltest theory. G theory can be viewed as an extension andliberalization of classical test theory that is achievedprimarily through the application of analysis of variance(ANOVA) procedure to measurement data. The use of anappropriate factorial ANOVA model permits one to identify andindependently estimate several sources of measurement error,which is regarded as an amorphous quantity in classical testtheory. The application of G theory to measurement problems hasgreatly expanded over the past several years in a wide range ofbehavioral research. In the educational and psychologicalliterature a number of authors demonstrated the use of G theory2Introductionto deal with multiple sources of measurement errors and stressedthe advantages of G theory approach to reliability estimationover the classical test theory approach (e.g., Brennan, 1983;Shavelson & Webb, 1991).The use of G theory is evident in observational studies(e.g., Godbout & Schutz, 1983; Huysamen, 1990; Lomax, 1982;Morgan, 1988; Ulrich, Ulrich, & Branta, 1988). Although theseobservational studies vary widely in content and method, theyall use human observers as a primary source of the necessarymeasure by classifying and recording certain behaviors ofsubjects. Particular examples of observational research thatare of interest in the present study include the assessment ofconsistency of judges' ratings on motor behavior as well asindividual and team performances in sport competitions, theevaluation of dependability of ratings on social and classroombehaviors in educational settings, and the assessment ofdiagnostic accuracy of subjects with physical or mental problemsin clinical settings. Data collection in any of these examplescan be done through the use of a simple checklist or asophisticated instrument such as a computerized recordingsystem. However, the quantification from \"excellent\" to \"bad\"of the qualitative performances, the determination of occurrenceor nonoccurrence of a particular behavior, or the classificationfrom \"not at all\" to \"severe\" of symptom is solely based onhuman judgement before any data recording process takes place.Therefore, reliable and consistent judgement is a primaryobjective, and unreliable observation, particularly in clinical3Introductionsettings, may have substantial effects on subsequent treatments.In fact, Fleiss (1981, p.192) noted that the single mostimportant source of the discrepancies in the results of similarstudies he examined in clinical settings was the unreliabilityof psychiatric diagnosis.Researchers employing G theory often tend to place greatemphasis on the G coefficient in their interpretation, andreport it as a means of summarizing the adequacy of themeasurement process. However, the estimation of G coefficientsinvolves some combination of variance component estimates (ormean squares), each of which is somewhat subject to samplingerror as well as to the violation of assumptions underlyingANOVA procedures. Consequently, these variance components couldcollectively produce a considerable amount of error that mayaffect the estimation of the G coefficients, and this would beespecially true for small samples. Therefore, in the absence ofinformation such as a likely range of the estimate which mightoccur under certain experimental conditions, one would not besure whether a large value of a G coefficient indicates that themeasurement process is reliable, or is merely an overestimatedue to sampling bias. The main focus of the present study,therefore, is to investigate and compare the samplingdistribution of G coefficients obtained under various simulatedsampling conditions.4IntroductionOverview of test theory Within the context of classical test theory, a variety ofprocedures have been developed for estimating different aspectsof reliability, for example: calculation of test-retestcorrelations to estimate the stability of measurements overdifferent testing occasions; correlation of scores obtained fromparallel or alternative forms of a test to estimate theequivalence of measurements based on different sets of items;and applications of various formulas to estimate the internalconsistency, or homogeneity, of a pool of test items (e.g., seeFeldt & Brennan, 1989). Each of these approaches defines theconcept of measurement error in somewhat different ways, andthus identifies a different source of error depending on thepurpose of the study under consideration. Classical testtheory, which is based on the concept of parallel measures (seep.21 for definition), postulates that an observed score on atest can be decomposed into a true score and a single source ofrandom error. As such, any single application of the classicaltest theory model cannot clearly differentiate among multiplesources of potential error that are inherent in most behavioralmeasurements. A technical description of reliabilityestimations in classical test theory is given in chapter II.To overcome some of the measurement problems underlyingclassical test theory, Cronbach and his colleagues (1963, 1972)proposed generalizability theory (G theory) as an alternative toclassical test theory. Following the pioneering work of theanalysis of variance (ANOVA) approach to measurement issues by5IntroductionBurt (1955), Ebel (1951), Horst (1949), and Hoyt (1941) amongothers, Cronbach, Rajaratnam, and Gleser (1963) formulated atheoretical model that does not rely upon the restrictiveassumptions of classical test theory. By applying themathematical rationale of Cornfield and Tukey (1956) theyreformulated reliability estimation procedures with noassumptions of the equivalence among the conditions. Based onthe variance component estimates obtained from the Cornfield andTukey method, Cronbach et al. (1963) derived a formula forcalculating an intraclass correlation coefficient for acomposite score in a one-facet design (i.e., a two-way, nppersons by n r raters, random effects ANOVA model), which isidentical to the reliability coefficient from the Hoyt ANOVAprocedure. (The term \"facet\" is analogous to the ANOVA term\"factor\", but the subject factor is not considered as a facet inG theory.) Cronbach, Gleser, Nanda, and Rajaratnam (1972) laterdenoted this coefficient as Ep2 and named it the coefficient ofgeneralizability. The use of the symbol Ep2 for the Gcoefficient was \"... intended to imply that a generalizabilitycoefficient is approximately equal to the expected value ... ofthe squared correlation between observed and universe scores.\"(Brennan, 1983, p.17). Following the foundation papers byCronbach et al. (1963) and Gleser, Cronbach, and Rajaratnam(1965), extensive treatments of G theory were documented in abook by Cronbach et al. (1972), and more recently by Brennan(1983), and Shavelson and Webb (1991). In addition, manymeasurement specialists provided extensive reviews and6Introductionpedagogical aspects of G theory (e.g., Brennan & Kane, 1977;Cardinet, Tourneur, & Allal, 1976, 1981; Godbout & Schutz, 1983;Hopkins, 1984; Morrow, 1989; Shavelson & Webb, 1981; Webb,Rowley, & Shavelson, 1988). These authors presented fundamentalconcepts in G theory with a conceptual framework for estimatingthe reliability of behavioral measurements in a wide range ofeducational and psychological research. By demonstrating theapplications of various factorial ANOVA procedures, theyadvocated the use of G theory and stressed the advantages of amultifaceted approach to reliability estimation over thetraditional classical test theory approach. The basic conceptsand terminology in G theory, along with underlying statisticalmodels, are presented in chapter II.The application of G theory to measurement problems hasgreatly expanded over the past several years in a wide range ofbehavioral research. In the educational and psychologicalliterature a number of authors have used a generalizabilityapproach to deal with measurement issues. Bert (1979) andMitchell (1979), for example, applied G theory to the estimationof inter- and intra-rater reliability; Gillmore (1983) toproblems of program evaluation; Johnson and Bell (1985) to theassessment of survey efficiency; Kane and Brennan (1977) to theassessment of class means; Lane and Sabers (1989) to theevaluation of scoring systems for sample essays; Macready (1983)to diagnostic testing problems; Morrow (1986) to reliability ofanthropometric measures; Staybrook and Corno (1979) to thedisattenuation of measurement error in path-analytic approaches;7Introductionand Violato and Travis (1988) to the assessment of behavior andpersonality. In addition, the use of G theory has alsoincreased rapidly in observational research. Examples includethe evaluation of social and classroom behavior (Huysamen, 1990;Lomax, 1982) in educational settings, the assessment ofdiagnostic accuracy in clinical settings (Morgan, 1988), and theevaluation of motor behavior and sport performances (Godbout &Schutz, 1983; Looney & Heimerdinger, 1991; Ulrich, Ulrich, &Branta, 1988).While it is evident in the literature that the number ofresearch papers employing G theory has greatly increased inrecent years, most appear to be limited to the applications andflexibilities of the theory in dealing with measurementproblems, and very little attention has been given to thepresence of sampling errors in the estimated G coefficient. Thepossible reason for the lack of attention in this area may bethat the common sources of G theory (e.g., Brennan, 1983;Crocker & Algina, 1986; Cronbach, et al.,, 1972; Shavelson &Webb, 1991) as well as many review and pedagogical papers on Gtheory (e.g., Cardinet, Tourneur, & Allal, 1981; Shavelson,Webb, & Rowley, 1989) have paid relatively little attention tothe fact that the G coefficient is subject to sampling error aswell as the violation of ANOVA assumptions. The major emphasishas been on the concepts of randomness and on the use of ANOVAtechniques as a means of obtaining variance component estimates(or observed mean squares).8IntroductionSampling error of variance components G theory makes extensive use of ANOVA procedures toestimate variance components. The estimated variance componentsserve the basis in G theory for describing and indexing therelative contribution of each source of error and thedependability of a measurement. However, the problemsassociated with estimating variance components (e.g., negativeestimates) have been frequently found in practice, andsubsequently several alternative approaches to variancecomponents estimation, such as maximum likelihood estimators,restricted maximum likelihood estimators, nonnegative estimatorsand Bayesian estimators, have been proposed to deal with suchproblems for the cases of both balanced and unbalancedexperimental designs (e.g., see the reviews by Khuri & Sahai,1985; Shavelson & Webb, 1981).Cronbach et al. (1972) earlier suggested that large scale Gstudies should be conducted to provide accurate and consistentvariance component estimates. They further warned that \"... thebehavioral scientist is on dangerous ground when he employsestimates of components and coefficients from a G study with theusual modest value of ni and nj, unless he can confidently makeassumptions of equivalence, homoscedasticity, and normality.\"(p.49). It may be intuitively obvious that in such large scaleG studies, the variance component estimates are likely stable.However, there are many situations in which sufficient resourcesare not available to conduct a large scale preliminary G study.This may be particularly true in ratings, clinical and9Introductionobservational studies where human judgement provides thenecessary measure. Although G theory was not devisedparticularly with ratings in mind, the use of this method inobservational research has rapidly increased, and most publishedresearch in these fields involves relatively small samples witha few conditions in each facet (e.g., Booth, Mitchell, & Solin,1979; Huysamen, 1990; Lane & Sabers, 1989; Looney &Heimerdinger, 1991; Violato & Travis, 1988).In response to the increased use of G theory with smallsamples in the literature, Smith (1978, 1982) conductedempirical simulation studies in order to examine the samplingproperties of the variance component estimates with 'small'samples (i.e., np = 25, 50, or 100 and ni, nj = 2, 4, or 8;where np = number of subjects, and ni, nj = number of conditionsin each facet) under several two-facet designs (i.e., crossedand nested designs). His results indicated that variancecomponent estimates based on small samples are very unstable,resulting in discouragingly wide confidence intervals. Bell(1986) and Smith (1982) further showed that the degree ofinstability in the variance component estimates depends on acombined relationship between the sample sizes, magnitude ofvariance components, and design configurations. More recently,Marcoulides (1990) also empirically demonstrated that thevariance component estimates in the one-facet and two-facetdesigns (np = 25) are sensitive to nonnormal distributionalforms. Given these empirical results, it is apparent that theestimation of G coefficients would also be unstable since it is10Introductioncomputed by some combination of estimated variance components.However, no further investigations were attempted in thesestudies to examine the sampling characteristics of Gcoefficients. Estimation procedures for variance components arepresented in chapter II.Sampling distribution of G coefficients Although very little research has focused on inferentialproperties of G coefficients, a considerable amount of work hasbeen directed toward examining sampling properties andinferential procedures for reliability coefficients (e.g.,Feldt, 1965, 1969, 1980; Hakstian & Whalen, 1976; Kraemer, 1981;Kristof, 1963, 1970; Sedere & Feldt, 1976; Woodruff & Feldt,1986). The investigations in this area have not dealt with Gtheory per se, but instead have mostly addressed the propertiesof a form of intraclass correlation coefficients -- Cronbach'scoefficient alpha, Kuder-Richardson 20 (KR-20) and thegeneralized Spearman-Brown formula. These indices arealgebraically equivalent to the G coefficient in a one-facetcrossed design in generalizability terminology.In what follows is a brief summary of the inferentialprocedures for reliability coefficients (e.g., Feldt, 1965;Fleiss & Shrout; 1978; Kristof, 1963) as well as that for the Gcoefficients for various two-facet designs developed bySchroeder and Hakstian (1990).11IntroductionInferential procedures for coefficient alpha. Kristof(1963) derived a sampling theory for reliability estimates anddemonstrated a method to apply it to a hypothesis testing.Feldt (1965) also derived similar results based on a two-wayrandom effects ANOVA model and presented a method to construct(1 - a) probability tolerance limits for the sample estimate interms of an F-distributed quantity. The derived 100(1-a)%tolerance interval for the sample estimate provides the basisfor describing the distributional properties of the estimate andcan be used to compute any percentile point of the estimate inthe distribution for a known population parameter. Feldt (1969,1980) extended it further to develop inferential techniques formaking two independent as well as two dependent samplecomparisons for coefficient alpha. Woodruff and Feldt (1986)took an extra step to consider the general case involving Kdependent coefficients. Using Paulson's (1942) normalizingtransformation for an F-variable, they developed a teststatistic that is distributed approximately as a chi-square.This test is essentially an extension of that by Hakstian andWhalen (1976) who developed inferential procedures for testingthe equality of k independent alpha coefficients.Inferential procedures for intraclass correlation coefficients. Many of the reliability indices can be viewed asversions of the intraclass correlation, typically a ratio of thevariance of interest over the sum of the variance of interestplus error variance. These intraclass correlation coefficients12Introductioncan give, however, quite different results when applied to thesame data, depending on the definition of error variance under aparticular experimental design (Bartko, 1976; Shrout & Fleiss,1979). Fleiss and Shrout (1978), and Shrout and Fleiss (1979)formulated six forms of intraclass correlation coefficientsunder the one- and two-way ANOVA models and presented guidelinesfor choosing the appropriate form depending on the intent of thestudy. They also derived the approximate 100(1-a)% confidenceintervals for these intraclass correlation coefficients usingSatterthwaite's (1941, 1946) approximation to the F distributionfor a composite of mean squares. Kraemer (1981) alsodemonstrated the procedures for testing the homogeneity of theintraclass correlation coefficients based on the sampling theoryby Kristof and Feldt. Recently, Alsawalmeh and Feldt (1992)derived, using Satterthwaite's procedure, an approximatestatistical test for the hypothesis that the two independentintraclass coefficients are equal within the context of a two-way random effects ANOVA model.Inferential procedures for G coefficients. Schroeder andHakstian (1990) extended the work of Feldt (1965, 1969), and ofHakstian and Whalen (1976) to develop inferential procedures forG coefficients for various two-facet designs. In doing that,they first applied the Satterthwaite's approximation procedurefor a composite of independent mean squares, which is involvedin the calculation of G coefficients in a two-way design. Thisallowed them to treat the quantity (1-AEp2 )/(1-Ep2 ), which is13Introductionthe ratio of two chi-squared variates, as an approximate F-variate. From this, they took a further step to derive thenormalized expression of the quantity (1-Agp 2 ) 1 / 3 usingPaulson's (1942) normalizing transformation, and developed anasymptotic variance expression for the estimate (1- AEp 2 ) 1 / 3 byemploying the delta method (Rao, 1973, p.387). The resultingvariance expressions permit the construction of confidenceintervals for a single sample G coefficient and can be appliedto develop an inferential procedure for testing the equality ofK independent G coefficients under normal theory.The sampling theory of the G coefficient (includingcoefficient alpha) and the inferential procedures describedabove have been developed under conditions in which theunderlying ANOVA assumptions are fully met. It is, however,conceivable that \"real-world\" data will not always fulfill therigorous underlying assumptions of the models. In addition,unlike test development studies, most observational studiesemploying G theory rarely involve more than a few conditions(e.g., usually 3 to 5) of each facet, along with rather small ormoderate sample sizes. Data collection in these studies areoften done in such a way that a rater or a group of raterssuccessively observes the behavior or performance of subjectsand numerically codes them using an instrument which yields onlya limited number of score values. Therefore, due to the natureof the data, it is quite likely that the violation of ANOVAassumptions will occur in conjunction with limitations of the14Introductionscore scale when such data are subjected to ANOVA procedures.Considerable research has focused on examining the effects ofthe number of scale points on the estimated reliabilitycoefficient (e.g., Cicchetti, Showalter, & Tyrer, 1985; Jenkins& Taber, 1977; Lissitz & Green, 1975). In general, studies haveindicated that the reliability of a test increases with anincreasing number of scale points, but in most cases thisincrease quickly levels off for anything beyond a 5-point scale.However, these studies were mainly concerned with the magnitudeof the estimates under various measurement scales, and noattention was given to the sampling variability of the estimatesunder violation of ANOVA assumptions.In principle, G theory is based on random effects repeatedmeasures ANOVA models. The assumption of randomness itself doesnot carry with it the assumption of normality. Most estimationprocedures for variance components and thus mean squares do notrequire normality. However, when distributional properties ofthe resulting estimators are of interest, normality is assumedin the distribution of random effects (Searle, Casella, &McCulloch, 1991). In fact, Scheffe' (1959, p.345) earlierdemonstrated that non-zero kurtosis seriously affects inferencesabout variances of random effects, although it has little effecton inferences about means.In the statistical literature, a number of empiricalstudies on the effect of violating ANOVA assumptions ofnormality and homogeneity of variance on Type I error rates haveshown that the ANOVA F statistic is generally robust with15Introductionrespect to moderate departures from these assumptions,especially if sample sizes are equal (e.g., Glass, Peckham, &Sanders, 1972, but see Bradley, 1978). However, ANOVA loses itsrobustness, especially when the covariance matrix underlying therepeated measures deviates from a certain pattern, referred toas compound symmetry or circularity (see Appendix A fordetails). Box (1954), for example, has shown that the violationof this assumption can result in more unstable estimates of themean squares than would be the case if all observations wereindependent. Subsequent investigations have also suggested thatthe inflation in the Type I error rates of the F test introducedby violating the circularity assumption was quite substantial ina variety of specific cases (e.g., Collier, Baker, Mandeville, &Hayes, 1967; Gessaroli & Schutz, 1983; Greenhouse & Geisser,1959; Huynh, 1978; Huynh & Feldt, 1976; Maxwell & Bray, 1986;Stoloff, 1970; Wilson, 1975).Knowing that violating ANOVA assumptions will result inunstable estimates of observed mean squares, it is expected thatthis will add additional variability to the estimates of the Gcoefficient. This would be especially true with a small scalemeasurement design involving a limited number of score values,and in such cases the usual interpretations for the estimated Gcoefficient and subsequent generalization over the conditions ofthe universe could be misleading. In fact, Schroeder andHakstian (1990) found that G coefficient estimates are highlyvariable with small samples, especially if the error variancesare relatively large. They suggested that researchers should16Introductioninterpret with caution G coefficients calculated from designsinvolving small numbers of objects of measurement (i.e., np=25)since the population value could vary markedly from the observedestimate, ranging from trivially low to impressively high.However, there is very little published research regarding theextent to which the sampling variability of the estimates and/orthe magnitude of the sample estimates will be affected by theviolation of ANOVA assumptions, especially the circularityassumption.Feldt (1965, 1969) provided some empirical evidence tosupport the proposed sampling theory. He concluded that despitecertain violations of ANOVA assumptions inherent in binary data,the empirical distribution corresponds, in general, quiteclosely to the theoretical one, at least when the number ofitems equals 80. However, Bay (1973, p.56) found in employing aone-facet (n=30 by k=8) design that the non-zero kurtosis of thetrue score distribution has substantial influence on thesampling distribution and standard error of reliabilityestimates, although the influence on the error score isnegligible for fairly large k.By applying Box's work to reliability estimation, Maxwell(1968, p.810) showed analytically that the correlated errors ina (n by k) repeated measures ANOVA model will positively biasthe estimate of the reliability coefficient. Recently, Smithand Luecht (1992) empirically investigated the effect ofcorrelated errors on the variance component estimates in a one-facet G study design. Their results showed that serially17Introductioncorrelated errors underestimated the residual variance componentand overestimated by a similar amount the person variancecomponent. Although they did not examine explicitly the Gcoefficient, they noted from these results, that \"Together orseparately, these biases will result in an overestimation of thecomputed generalizability coefficient...\" (p.232). Smith andLuecht further noted that the effect of serially correlatederrors are equally likely to be present in designs employingmore than a single facet (p.234). However, there appears to bevery little, if any, published research regarding the effects ofviolating compound symmetry or circularity on an inferentialprocedure for the G coefficients in either one- or two-facetdesigns. In fact, Schroeder and Hakstian (1990) are the onlyones who stated the assumption of circularity explicitly intheir study. In developing inferential procedures for Gcoefficients in the two-facet designs, they assumed, besidesnormality, that the covariance matrices for all subsets of firstand second facet conditions and their interaction have (local)circularity or sphericity properties, which are necessary totreat the sums of squares in the model as central chi-squaredvariates. Although they provided some evidence of theinsensitivity of the proposed procedures to nonnormality, theeffect of noncircularity was not part of their study, but theysuggested that \" ... the procedures may also be robust withrespect to violation of the local circularity assumption,although at this point we have no proof of this.\" (p. 443).18IntroductionPurposes of the present studyThere is insufficient information in the related literaturesubstantiating the conditions under which the estimated Gcoefficient would be most variable and the relative extent towhich the estimated G coefficient for categorical data will beaffected by a particular sampling condition or a combination ofsimulated conditions, in comparison to their parent continuousdata. Empirical work is certainly in need to provideinformation about the sampling characteristics of G coefficientsunder various sampling conditions of G study designs. Thus, themajor focus of the present study is an empirical investigationof the sampling variability of the estimated G coefficients forboth categorical and continuous data under violation of thecircularity assumption. In doing so, the empiricaldistributions of the G coefficient estimates under varioussimulated sampling conditions are obtained for both one- andtwo-facet designs, and the variabilities of the estimates arecompared across simulated conditions within each design toinvestigate the precision and accuracy of the sample estimates.Then, the empirical percentages of the sample estimates fallingbeyond the theoretical limits are compared to the correspondingtheoretical values to assess the robustness of the proposedsampling theory. For all simulated conditions, empiricalresults for both categorical and the parent continuous data areobtained and compared in order to assess the degree of relativebias in sample estimates for categorical data, in comparison tothe parent continuous data.19Mathematical DevelopmentCHAPTER TWO: THEORIES AND MATHEMATICAL DEVELOPMENTThis chapter consists of four sections: classical testtheory, generalizability theory, variance component estimation,and sampling theory of G coefficients. First, definitions andestimation procedures for various reliability coefficients arepresented. Second, the basic concepts and terminology in Gtheory are briefly reviewed, along with the statistical modelsfor one- and two-facet fully-crossed designs and theformulations of G coefficients. Third, a brief overview ofestimation procedures and variance expression for variancecomponents is given. Lastly, the sampling theory of coefficientalpha (equivalently, a G coefficient for relative decisions inthe one-facet crossed design) developed by Feldt (1965) andKristof (1963) is presented in detail and extended to develop anapproximate sampling distribution of the G coefficient forrelative decisions in two-facet fully-crossed design.A. Classical test theoryAccording to the true-score model in classical test theory,an observed score is viewed as a composite of two components --a theoretical score (true score) and an error score. Insymbols:[2-1]x = t + ewhere, x is the observed score; t is the true score; and e israndom error.20Mathematical DevelopmentFundamental assumptions imposed by classical test theoryare:(a) the true scores are stable over time;(b) the expected error score is zero: E(e)=0;(c) the correlation between error score and true score is zero:rte ^0 (Ghiselli, Campbell, & Zedeck, 1981).Consequently, the variance of observed scores is simply the sumof the true and error score variances:[2-2]^62x = 62 t^62 e'Given this, the ratio of the true score variance toobserved variance is called the reliability of measure x and canbe expressed as:[2-3]^rx = 62 t / 62 XThis ratio can be shown to be equal to the squared correlationbetween observed and true scores:[2-4]rxt = axt / at ax= a (t+e) (t) / at ax= att^a(te) / at (Tx= 62 t / at ax= at / GX'(since Gte = 0)Thus, squaring the last expression of [2-4] gives the sameresult as in Equation [2-3], which indicates the degree to whichtest scores are free from errors of measurement.21Mathematical DevelopmentParallel Measures Since the true-score model includes an unobservable element(true score), in practice, reliability of a measure is oftenassessed by correlating parallel measurements. By definition(Ghiselli et al., 1981), two measurements are said to beparallel if they have identical true scores and equal variances.As a result, the means and variances of both measures are alsoequal. In addition, according to the assumption that errors areindependent, it follows that errors associated with parallelmeasures are not correlated among themselves, nor are theycorrelated with true scores. That is, r(e l e 2 ) = r(e it) = r(e2t)= 0, where the subscripts 1 and 2 denote parallel measures.Using these, it can be shown that the correlation betweenparallel measures is an estimate of the reliability of eitherone of them. Expressing observed scores on each measure ascomposites of true and error scores, the correlation between twoparallel measures is:[2-5]rxlx2^6(x1x2) / 6x1 6x2= 6[(t+e i ) (t+e2)] / Gxl 6x2= G[tt + te2 + te l + ele2] / 6x1 6x2Since the last three terms of the numerator equal zero andbecause Oxi = 6x2f rxlx2 = 62t / 62 x which is consistent withthe definition of reliability given earlier. Thus, one canestimate reliability of a measure by administering two parallelmeasures. However, because of the very restrictive assumptionsunderlying parallel measures which are rarely met in practice,22Mathematical Developmentthe usefulness of this approach is limited. Some researchershave proposed different formulations that may be viewed asvariations on the true-score model. Such alternatives are tau-equivalent measures (identical true scores), essentially-tau-equivalent (true scores differ by an constant), and congenericmeasures (error variances, true-score variances, and true-scoremeans need not be equal as long as both measure the samephenomenon). These measures are in the order of less restrictedassumptions on the parallel measures (Novick & Lewis, 1967;Joreskog, 1971). The comparison among these models in terms ofdetermining which model is better or preferred in estimatingreliability can be done by using a computer program, such asLISREL (Joreskog & Sorbom, 1989).Alternative Form and Test-Retest Commonly used procedures that require two testadministrations to estimate reliability of a measure are thealternative-form method and the test-retest method. The formeris used to assess the degree of interchangeability between twoalternative forms of a test. It requires a test user toconstruct two similar forms of a test and administer both formsto the same group of examinees within a very short time period.The correlation coefficient between the two sets of scores isthen computed, called the coefficient of equivalence, and takenas an estimate of the test reliability. It is generallysuggested that the mean and variance for each form should bequite similar, but ideally the two forms should meet the23Mathematical Developmentcondition of parallelism as defined above, if the coefficient ofequivalence is to be interpreted as a reliability estimate(Crocker & Algina, 1986).The test-retest method is used when a test user isinterested in how consistently examinees respond to the samemeasure at different times. Thus, the same measure isreadministered to the same group within a certain time period.The correlation coefficient between the two sets of scores,called the coefficient of stability, is taken as an estimate ofthe test reliability as it indicates the degree of consistency.However, the use of the correlation coefficient in the test-retest method as a measure of reliability has been criticized byseveral researchers (Carmines & Zeller, 1979; Erikson, 1978;Heise, 1969). The correlation between the test and retestscores of the same measure will inevitably be less than perfectbecause of the temporal instability of measures taken atmultiple points in time and the measurement error. As a result,a simple test-restest correlation is inappropriate to estimate avariable's true reliability as well as the variable's temporalstability unless one can assume that either the underlyingvariable remains perfectly stable, or the variable is measuredwith perfect reliability (Erikson, 1978). This point isillustrated analytically in the following. Let x 1 and x2 be twotest scores, then, the correlation between the two scores can beexpressed as follows:24Mathematical Development[2-6]rxlx2^6(x1x2) / (6x1 6x2 )= 6 (t1t2) /2^ 111/2[(a tl^a2e1 ) (a2t2^G2 e2/J^\u00E2\u0080\u00A2Substituting for the covariance term in the numerator,a (tlt2) , by (rtlt2 ) (6t1 6t2) since rtlt2 = \u00E2\u0080\u00A2 (t1t2) / (atl 6t2 ),yields[2-7]rx1x2^(rt1t2) (at]. 6t2 ) /[(62t1^62e1) (62t2^62e2)] 1/2.Furthermore, since the reliability of a variable is the truescore variance divided by the true score variance plus the errorvariance, the equation [2-7] can be rewritten as:[2-8]^rx1x2 = (rtlt2 ) (r1 r2)1/2where, rj = reliability of xj, j = 1,2.From the above equation, we find that a simple test-retestcorrelation is inappropriate as a measure of reliability unlessone can assume that the underlying variable remains perfectlystable (i.e., rtlt2 = 1.0). In addition, it is possible that anobtained low correlation in the test-retest case may notindicate that the reliability of the test is low but may,instead, signify that the underlying theoretical concept itself(true score) has changed (Carmines et al., 1979).25Mathematical DevelopmentInternal Consistency Split-Half. The earliest form of the internal consistencyapproach to the estimation of reliability may be a split-halfreliability estimate. The split-half approach may be viewed asa variation of the alternative-form estimate of reliability.The items that comprise a given measure are split in half, andeach half is treated as if it were an alternative form for theother, thereby obviating the need to construct two forms of thesame measure. A reliability estimate is obtained by correlatingscores on the two halves of the measure. In order to estimatethe reliability of an original measure that is twice as long aseach half, split-half correlations are stepped up by theSpearman-Brown formula. The general Spearman-Brown formula, ofwhich the split-half method is a special case, for theestimation of reliability of a measure is:[2-9]k r 12rx 1 + (k-1) r 12where k is the factor by which the instrument is increased ordecreased (i.e., k=2 in case of the split-half method); r x isthe estimated reliability of a measure k times longer than theexisting one, r 1 2.Rulon (1939) proposed a simplified procedure for estimatingthe reliability coefficient by means of split-halves. Thismethod involves the use of difference scores between the half-tests (i.e., d = a - b, a and b being the examinee's score onthe first half and the second half of the original test,26Mathematical Developmentrespectively). The variance of the difference scores, a2 d , isthen used as an estimate of the error variance, 0 2 e , in thedefinitional formula of the reliability coefficient so that:[2-10]rx = 1a2 d2a xwhere 62x is the variance of the scores on the total test. Thetwo methods above yield identical results when the variances ofthe two half-tests are equal. Otherwise, the Spearman-Brownformula yields systematically larger coefficients than Rulon'smethod. In general, the split-halves method does not yield aunique estimate since there are many possible ways of dividing atest into halves. If a particular way of splitting a measureinto halves happens to be an unlucky one, not parallel, it mayresult in an underestimate or an overestimate of reliability.Coefficient Alpha. The logical extension of the split-halfapproach to estimate the reliability of a measure is to split ameasure into as many parts as it has items, and thus thearbitrariness of splitting a measure in halves can be avoided.Several approaches to the estimation of the internal-consistencyhave been formulated based on the assumption that all items aremeasures of the same underlying attribute. That is, the test ishomogeneous in content. A most general form for this approachis known as Cronbach's alpha (1951) which can be computed by theformula:27Mathematical Development[2-11]k^Eirs2ia - ^ [1k -1 ]E62 i + 2(Eaii)where; k is the number of items in the measure, and 0 2 i and aijare the variance of item i and the covariance of any pair ofitems i and j (where, i > j), respectively. Cronbach's alpha isa general form of the Kuder-Richardson 20 or KR-20 (1937).Thus, it yields identical results with the KR-20 when items arescored dichotomously. In addition, when all items arestandardized, having a mean of zero and a variance of one, it isreduced to the Spearman-Brown formula, replacing a mean of allpairwise inter-item correlations for the r 12 in Equation [2-9].Thus, Cronbach's alpha is known as the mean of all possiblesplit-half coefficients of a given measure.Intraclass Correlations Within the context of the variance component model ofanalysis of variance (ANOVA), the intraclass correlationcoefficient is derived from the concept of the statisticaldependence between any two observations xpi and xpi, (i =1= i')in the same class (i.e., with the same p) (Scheffe', 1959,p.223). In a two-way array with scores on a test (having iitems) for nP persons, an observed score of person p on item i,xpi, is viewed as the sum of four independent components:[2-12]x . = u + (u -u) + (u.-u) + (x .-u -u , +u)pi^P^1^pi p i=u+p+i+e.28Mathematical Developmentwhere: up is the universe (true) score of person p (the mean ofxpi over all i in the universe); ui is the mean score on i overall persons in the population; and u is the mean over both theup and the ui (the grand mean). The p, i, and e are consideredto be independently distributed random variables with zero meansand their own variances. Thus, the model renders itself as anadditive model in ANOVA, which forces the variance of the items,,a2 .1 to be equal and covariances or correlations between itemsto be equal, which leads the covariance matrix to have compoundsymmetry (Winer, 1971).Most treatments of reliability based on [2-12] definereliability as the ratio of true score variance (a 2p ) toobserved score variance (02p + a2 e ) I which is an intraclasscorrelation coefficient by definition in classical test theory.Using the mean squares and variance components obtained by theANOVA procedure, the intraclass correlation coefficient for asingle item score is:[2-13]MS - MS er 1MSp + (ni-1) MS eand that for the mean test score over all items is:[2-14]MSp - MSerMSpOne of the first attempts (among others being Burt, 1955;Ebel, 1951; Horst, 1949) to use the intraclass correlation as anestimate of reliability appears in Hoyt (1941). Hoyt derived29Mathematical DevelopmentEquation [2-14] by means of estimating reliability in a Personsby Items design. Hoyt related this formula to the theoreticaldefinition of the reliability by noting that the MSP representsthe observed score variance, and the MS e represents the errorvariance in the theoretical reliability expression [i.e., (0 2x -a2 e ) / 62 x ] .Hoyt (1941, p.155) noted that ANOVA procedures do notdepend on any particular choice in subdividing the items, andthey approximate an average of all the possible correlationsthat might have been obtained by different ways of assigningitems to alternative forms. Therefore, this method ofestimating the reliability of a test gives a better estimatethan any method based on an arbitrary division of the test intohalves or into any other fractional parts. Although Hoyt drewattention to its application only to the case where items arescored dichotomously, Equation [2-14] yields identical resultswith Equation [2-11], Cronbach's alpha, as well as with KR-20.Numerous papers on reliability subsequently made use of theANOVA procedure or the closely related intraclass correlation.Various definitions and procedures were formulated, eachdefining the measurement error in somewhat different ways, andthus estimating different aspects of reliability depending onthe purpose of the study under consideration (e.g., Algina,1978; Bartko, 1976; Bert, 1979; Fleiss, 1975; Lahey, Downey, &Saal, 1983; Mitchell, 1979; Shrout & Fleiss, 1979).30Mathematical DevelopmentB. Generalizability (G) theoryIt has been shown that reliability can be defined as eitherthe correlation between parallel measures, or the squaredcorrelation between the true score and the observed score, orthe ratio of the true score variance to the observed scorevariance (i.e., rxix2 -- r2tx _.= a2t/62x). It also has been shownthat different formulations for the estimation of reliabilitylead essentially to the same results. Although they appeardifferent in form due to different theoretical orientations,they all share the same underlying concepts in classical testtheory.The concept of the parallelism or equivalence of measureshas been criticized as a major limitation in classical testtheory as it is very difficult to construct, and is oftenunattainable in practice. As a result, Cronbach, Rajaratnam,and Gleser (1963) proposed generalizability theory as analternative to classical theory which does not rely upon therestricted assumptions of classical theory. Generalizabilitytheory (hereafter G theory) extends in some aspects the conceptof the intraclass correlation to the estimation of reliability.G theory offers a comprehensive set of concepts and estimationprocedures for various measurement designs by making use ofvarious factorial designs in ANOVA in which the conditions ofobservation are classified in several respects.In the following, concepts and terminology in G theory arepresented briefly, followed by statistical models underlying Gtheory and formulation of G coefficients for the one-facet and31Mathematical Developmenttwo-facet fully-crossed measurement designs. Throughout themanuscript, the symbols G 1 and G2, instead of Ep2 , are used todenote the population G coefficient for notational convenience,and ^G i and AG2 for the sample estimates. The subscriptindicates the number of facets involved in the measurementdesign.Basic Concepts and Terminology in G theoryObject of measurement. The object of measurement in Gtheory is the element of the study about which one wishes tomake judgments. In most applications of G theory, persons arethe object of measurement, but it can be any population ofobjects other than persons.Universes of admissible observation. The universe ofadmissible observations is defined by all possible combinationsof the conditions that theoretically could be included in astudy in a G study. The variation of these conditions iscentral to the study. Thus, G theory requires one to specify auniverse of conditions of observation over which s/he wishes togeneralize. Related to the concept of universe is the conceptof universe score. The universe score is viewed as a mean scorefor an object of measurement over all conditions in the universeof generalization, like the notion of true score in classicaltest theory.32Mathematical DevelopmentG and D studies. G theory draws the distinction between aG study and a decision study (D study). The purpose of the Gstudy is to obtain as much information as possible about thesources of variation in the measurement. Therefore, the G studyshould ideally define the universe of admissible observations asbroadly as possible. The purpose of the D study is to makedecisions about the object of measurement. The D study makesuse of the information provided by the G study to design thebest possible application of the measurement for a particularpurpose. In planning a D study, the decision maker defines auniverse of generalization over which the scores are to begeneralized. As well, using information from the G study aboutthe magnitude of the various sources of measurement error, thedecision maker evaluates the effectiveness of alternativedesigns to optimize reliability (i.e., nested or fixed). Inpractice, however, the same data are usually used for both G andD studies; in this case the G and D studies are the same.Facet and Condition. The design of a measurement procedureimplies the specification of the sources of error affecting themeasurement (e.g., judges, occasions), which are called facets.The term facet is analogous to the factor in ANOVA terminology,but the subject factor is not considered as a facet in G theory.The conditions (cf. levels of a factor in ANOVA) representingthese facets usually constitute random sampling from thepredefined universe of conditions. Variability in themeasurement due to a facet or an interaction among facets is33Mathematical Developmentdefined as error variance, whereas the variability amongindividuals over all objects of measurement is defined asuniverse score variance (Cronbach et al., 1972, p.15). Brennan(1983, p.16,18) further clarifies that a set of randomly sampledconditions is just one of an infinite number of sets from theuniverse. Thus in G theory randomly parallel tests, forexample, can have different means, and the between-test varianceis therefore generally not zero since any test may consist of anespecially easy or difficult set of items relative to the entireuniverse of items.Relative and absolute decisions. In G theory, howgeneralizable a measure is depends on how the data will be usedin the D study. In the D study one of two kinds of decisions ismade. A relative decision is made when the interest attaches tothe standing of individuals relative to one another. Incontrast, an absolute decision is made when the concern is withhow well a person's universe score estimates the universe scorefor that person, without regard to the performance of others.The variance components contributing to measurement error aresomewhat different for the relative and absolute decisions. Forthe relative decisions those variance components that interactwith the object of measurement, and thus influence the relativestanding of individuals, contribute to error. For example, in aone-facet Persons-by-Raters design, the systematic disagreementbetween the raters would not introduce error into the estimationof the person's universe score relative to the average universe34Mathematical Developmentscore for all persons. Therefore, the variance component due tothe Rater facet does not contribute to error in the relativedecision. For the absolute decisions all variance componentsexcept the object of measurement contribute to measurementerror. These variance components include all interactions andthe facet main effects.Statistical models underlying G theoryOne-facet crossed design. Consider a measurement designwhere, for example, n persons are observed by k raters in anobservational setting, assuming the n and k are a random samplefrom a respective population. The resulting scores from such adesign can be arranged in a two-way (n x k) array. A two-way(Persons by Raters) random effects ANOVA interaction model canbe used to partition observed scores into their effects, whichcan be written as:[2-15]xii = u+pi + rj + pri j +eij,^(i=1,..,n; j=1,..,k).Note here that because of a single observation per cell, anextra subscript for the number of entries in each cell is notused. In this model, the pi is the effect due to Person i, rjis the effect due to Rater j, pri j is the interaction effectbetween the two, and ej k is the residual error. All the effectsin [2-15], except for a grand mean u, are assumed to be randomvariables with zero means and their own variances, and allpairwise covariances are zero (Searle, Casella, & McCulloch,1992):35Mathematical DevelopmentE(pi) = E(rj) = E(prij) = E(eij) = 0;var(pi) = E(p2 i) = 02p;cov(Pi,Pi l ) = E(Pi,Pi l ) = 0 for all i =1= i f ;cov(pi,rj) = cov(Pi,prij) = cov(pi,eij) = 0.The symbol E denotes expectation. Similar statements can bemade for the remaining terms. Thus, the variance of an observedscore can be expressed as:[2-16]^62x = 62p + 62 + a 2 pr + 62 e\u00E2\u0080\u00A2The variance component for persons (()), or universe-scorevariance, for example, can be obtained by taking E(up-u) 2 , whichindicates the average (over the population of persons) of thesquared deviations of the persons' universe scores from thegrand mean (Shavelson & Webb, 1991). Other terms can be definedsimilarly. The estimates of the variance components are usuallyobtained in practice by solving the expected mean squareexpressions from ANOVA procedures. Searle et al. (1992) showedthat the ANOVA estimators of variance components with balanceddata are unbiased and have the smallest variance of allestimators that are both quadratic functions of the observationsand unbiased (p.129). They also noted that the ANOVA estimationof variance components does not demand any normality assumptionsfor the error term or of the random effects unless thedistributional properties are of interest.The statistical model in [2-15] is what Cronbach et al.(1963) have used to formulate generalizability theory in an36Mathematical Developmentattempt to break away from restrictive assumptions in classicaltest theory. Following the work of Cornfield and Tukey (1956),Cronbach et al. (1963) derived the expected mean squareexpressions for each effect in the model. Furthermore, they putthe variance components 02pr and 62 e together by stating that... with one observation per cell, it is impossible to separatethe interaction component of variance from the within-cellresidual.\" (p.150). This allowed them to derive a G coefficientthat is comparable with a reliability coefficient andcoefficient alpha.Huck (1978) demonstrated the application of Tukey's (1949)'one degree of freedom for non-additivity' in estimatingreliability by decomposing the interaction effect from the errorterm. However, this method may not be relevant in anobservational study. For example, when a group of judgesindependently rates the behaviors of n persons (e.g., class orsocial activities, live or taped sport performances), eachperson has a true score which must remain constant acrossjudges. Therefore, any interaction between the judges andsubjects in this case should be considered as a consequence ofinconsistency among the raters themselves and thus be part ofthe measurement error. Under this presumption, the p x rinteraction term drops out from the model [2-15]. Thus, themodel without the prij term renders itself as the two-wayadditive, rather than nonadditive, ANOVA model with a singleobservation per cell (Winer, 1971, p.394). Nonetheless, with asingle observation per cell the expected mean square expressions37Mathematical Developmentfor either the nonadditive model with Cronbach's modification orthe additive model are structurally identical, and they arepresented in Table 2-1.Table 2-1The two-way (Persons by Raters) random effects ANOVA model witha single observation per cellSource of^ Mean^Expectedvariability df^Square mean squaresPerson(p)^n-1^MS^a2e + nr 02P pRater (r) k-1 MS r^62 e + n 02p rpr.e^(e)^(n-1) (k-1)^MSe^62 eNote: the term pr.e reflects a combined effect of prij and eijby following Cronbach's modification in the model [2-15].A generalizability coefficient for relative decisions isdefined as the ratio of the universe score variance to theexpected observed score variance, which is a form of intraclasscorrelation coefficient. The G coefficient, like thereliability coefficient, reflects the proportion of variabilityin individuals' scores (i.e., the object of measurement) that issystematic. The population G coefficient for the one-facetdesign, expressed in terms of variance components, is:[2-17]G2 P 38Mathematical Developmentand expressing the estimate of G1 , AG 1 , in terms of theestimated mean squares yields:[2-18]^Gi -MSP - MSeMs pThis formula is exactly identical to Hoyt's formula for areliability coefficient presented in the previous section, whichin turn gives the same results as the Cronbach's alpha for anymetric, and as the KR-20 when items are scored dichotomously.In addition, it can be shown that the formula [2-17] yields thesame results as the generalized Spearman-Brown formula. Forexample, as noted previously, the intraclass correlation isdefined as Gii , / a2x = a2p / (a2 e + a2p ), which indicates thedegree of statistical dependence between two conditions withinthe same subject. This follows from the fact that the expectedvariance of any condition i is defined as: a2i = a2e + a 2p, andthe expected covariance over pairs of conditions within the samesubject as: 0(pi, pi') = ail, = a2p . From this, it may be shownthat the expectation of all of the observed variances (meansquares) in the analysis of variance can be expressed in termsof parameters of the variance-covariance matrix. Winer (1971)showed, for example, thatMSperson = var + (nr-1) cov, andMSerror = var - cov. where, var and cov = mean of the variances and covariances,respectively, in the n r by nr variance-covariance matrix.39Mathematical DevelopmentTherefore,E(MS ) = E[var + (n r-1) cov]= 62x^(nr-1)62p= 62e + 62p + n r 0 2 p - 62= a2e + nr 62 p andE(MS e ) = E[var - cov]2= 62 x - 6 p= [62 e ^62 ]^a2= G2 e \u00E2\u0080\u00A2Applying these terms in Equation [2-17] yields:[2-19]nr covG 1 var + (n r-1)covwhich is in form identical with the Spearman-Brown formula.A G coefficient for absolute decisions for the one-facetdesign is defined as the ratio of the universe score variance tothe total observed variance. Brennan and Kane (1977) calledthis an index of dependability, which can be defined as[2-20]Gabs622 62 +02 in + rr2 / \u00E2\u0080\u009Er^e'\"rand expressing the estimate of G abs in terms of obtained meansquares, with some algebra, as:[2-21]MS - MS eGabs =MSp + [MS r - MSe]/np40Mathematical DevelopmentThis coefficient, which takes the variability of the raters intoaccount as error, has been given special attention inobservational and clinical studies where an absolute decision isoften required (Berk, 1979; Booth et al, 1979; 1983; Brennan &Kane, 1977; Huysamen, 1990; Mitchell, 1979; Lomax, 1982). Sincethe classical test theory of parallelism indicates that themeans across conditions (i.e., raters) are assumed to be equal,the rater effect is assumed to be zero; if not, classical theorycannot formally distinguish between the error variance and therater variance. However, in G theory, randomly parallelconditions (Brennan, 1983) can have different means, and thusthe rater variance is generally not zero. Consequently, thereis no equivalent formula in classical reliability theory to theG coefficient for absolute decisions.Two-facet fully-crossed design. The partitioning ofobserved scores into their effects and the decomposition of thevariance of observed scores into variance components for theseparate effects can be easily extended to measurement designwith additional facets. Consider, for example, the two-facet,fully-crossed design where n persons are observed by r raters ono different occasions. Persons here are the object ofmeasurement; raters and occasions constitute sources of unwantedvariation in the measurement. Persons, raters, and occasionsare considered to be randomly sampled from a respectivepopulation or universe. The ANOVA model for this design is athree-way (Persons by Occasion by Raters) random effects ANOVA41Mathematical Developmentmodel with a single observation per cell:[2-22]xij k = u + pi + of + rk + poij + pri k + orjk + pori jk.e\u00E2\u0080\u00A2In this model, the pi is the effect due to Person i(i=1,2,..,np), of is the effect due to Occasion j of the firstfacet (j=1,2,..,n 0 ), and rk is the effect due to Rater k of thesecond facet (k=1,2,..,n r ). The pori j k.e term reflects acombined effect of the three-way interaction and the residual.The remaining terms in [2-22] represent two-way interactioneffects. As in the one-facet design, all the effects in [2-22],except for a grand mean u, are assumed to be random variableswith zero means and their own variances, and all pairwisecovariances are zero. Normality assumptions of the effects areadded when the distributional properties are of interest.Given the independence of the components in [2-22], thevariance of observed scores can be decomposed into variancecomponents for each effect as:[2-23]+ G^+ G ^+ G2G2x = (72p + G2 r + a20 + a2 pr^2 po^2 ro por.e\u00E2\u0080\u00A2For the G coefficient for relative decisions, all variancecomponents representing interaction with the object ofmeasurement (i.e., persons) contribute to the unwantedvariations in the measurement. Thus, in the two-facet design,the population G coefficient is defined as:42Mathematical Development[2-24]G2 62^62^62po^pr eno^nr^n no rNote that the symbol 62e is used for the 62p or.e term.The estimation of variance components in [2-24] can lead toan estimate of G2, and it is usually done by ANOVA procedures.Table 2-2 presents the expected mean square expressions for thisdesign from the ANOVA procedure. These expressions can besolved to obtain estimates of each variance component. It is,however, more convenient to express the estimate of G2, AG2, interms of the observed mean squares as:[2-25]AG2 \u00E2\u0080\u0094MSp - MSpo - MSpr + MSeMsA G coefficient for absolute decisions for the two-facetdesign can be defined as:[2-26]02P Gabs -2^02^2^r a o^pr^po^ro262^62^a a2 e^nr^n o^n r^no^n n^n nr o r oThe estimate of Gabs could be expressed in terms of observedmean squares presented in Table 2-2, but its complexity in formstill remains the same.43Mathematical DevelopmentTable 2-2The three-way (Persons by Occasions by Raters) random effectsANOVA model with a single observation per cellSource^VC^MS^EMSPerson(p)^a^MS^a2e + nr a2^+ n a22P P po^o pr + n rn o62 p+ n n 62MS r^02 e + n 02 + n a2Rater(r)^G2 r p ro^o pr^p o rOccasion(o) 02 0^MS o^02 e + n 02^+ n r 02^p r po + n pn r02 opr^a2^MS^a2 e + n 02pr^pr o prpo a2 e + n a62po^MS po^r 2 poroMS^a2e + n aa2 ro^ro p 2 ropro.e (e)^a2^MS ee 02 eNote: VC = variance components; MS = mean squares; and EMS =expected mean squares.C. Variance component estimationIn most cases a variance component estimate is computedusing some linear combination of available mean squares (MS)divided by a constant:[2-27]^AG2i = [1 aimsi ] / ciwhere; ai = + 1, MSi = 1,2, ...mth mean square, and ci is theconstant associated with the variance component 0 2 i.Under normality and independence assumptions of the randomeffects model with balanced design, it is known that fi(MSi)/EMSi is distributed as a chi-square (x2 ) with fi degrees of2=44Mathematical Developmentfreedom, and MS's are independent of one another (Searle, 1971,p.409). Therefore, the sampling variability of variancecomponent estimates can be considered as a linear combination ofx2 -variables. For any mean square in the model,[2-28]MS. =i EMSi x2Thus,fi1[2-29]var (MSi) =(Ems i ) 2 var(x2 f)fi12(EMSi) 2since var(x2 f) = 2ffiiwhere, the symbol E denotes expectation, and 'var' denotesvariance. Therefore,[2-30]1var (^62 i)- I var (MSi)c2 i(EMSi) 2x ^c 2 i f.1It is more convenient to express above derivations in ageneral form using the matrix notations (e.g., Brennan, 1983,Searle, 1971; Winer, 1971). Let m = a k-by-1 column vector ofmean squares in the design, having the same order as 02 , thevector of variance components in the model. Suppose P is suchE (m) = Pa2 .that45Mathematical DevelopmentP is a k by k (nonsignular) matrix of coefficients of thevariance components in the expected mean square expressions forthe model. Then, the ANOVA estimator of 02 is A62 , obtainedfrom m = PA62 as[2-31]^AG2 = pwhich is a k by 1 column vector whose elements are unbiased,because (Searle et al., 1992)[2-32]E(A62 ) = P-1 E(m) = -1(no _ p p02 = 62 .From [2-31], the variance of A62 can be expressed as:^[2-33]^var ( Aa2 ) = p-1 var (M) P -1 ' .And with the last expression of [2-29], it can be rewritten as:[2-34]^var (A62) = p-1 [2(EMSi)2/fi] p-i'Although mean squares in the balanced design aredistributed independently of one another, the estimated variancecomponents are themselves subject to sampling variability.Furthermore, two estimated variance components are generally notuncorrelated, unless there are no common mean squares used inestimating the two variance components. Assuming a multivariatenormal distribution for the score effects, the variance-covariance matrix associated with the estimated variancecomponents in A62 is1^1V = P - D P-[2-35]46Mathematical Developmentwhere D is a k by k diagonal matrix containing the variance ofMSi expressed in Equation [2-29]. Equation [2-35] is atheoretical expression for the population values, and thus, inpractice, the estimate (^V) of V can be obtained by replacingobserved mean squares [i.e., 2(MSi) 2 /fi] in the diagonal of Dfor the corresponding expected values. However, Searle (1971,p.417) showed that the elements in ANT are biased. He showedthat the unbiased estimate of V can be obtained by replacing (fi+ 2) for fi in the diagonal of D. For example, by definition,var(MS) = E(MS 2 ) - (EMS) 2 , from this it follows that[2-36] ^E(MS 2 ) = (EMS) 2 + var(MS)= (EMS) 2 + [2 (EMS) 2 / f]= [(f + 2)(EMS) 2 ] / f.Therefore, the unbiased estimator of 2(EMS) 2 /f is 2(MS 2 )/(f+2).The square root of the diagonal elements in the AV is theestimated standard error of the variance components, which maybe used to construct a confidence interval of interest. Thevariance components are essentially linear functions of meansquares, and the exact distributional properties of a compositeof mean squares are too complicated to be of practical utility(e.g., Burdick & Graybill, 1988; Fleiss, 1971). However,Satterthwaite (1941, 1946) suggested that the samplingdistribution of a linear combination of mean squares can beapproximated by the F distribution and recommended the use ofchi-square distribution in which the number of degrees offreedom is chosen so as to provide good agreement between thetwo.47Mathematical DevelopmentSeveral methods of approximation procedures for conductingconfidence intervals for variance components have been proposed,and these (about ten approximation procedures includingSatterthwaite's) are empirically compared by Boardman (1974).More recent work on variance component analysis and on theconfidence intervals for a linear combination or a ratio ofvariance components in both balanced and unbalanced randommodels is thoroughly reviewed and presented in Burdick andGraybill (1988), and Khuri and Sahai (1985). In addition, twobibliographies on this subject, Sahai (1979) and Sahai, Khuri,and Kapadia (1985), provide a comprehensive coverage of variancecomponents and other related topics.In practice, the variance components are usually estimatedthrough an ANOVA procedure. As is almost always the case withreal-world data, the estimation for mean squares, and thus forvariance components, from the sample data are always subject tosampling error. This is particularly true with small samples.Smith (1978, 1982), for example, conducted empirical studies toinvestigate the sampling error of variance component estimatesbased on small samples with a few conditions for each facet.His results revealed that the variance component estimates areunstable and sometimes negative. He further noted that theconfidence intervals for variance component estimates with asmall number of facet conditions are discouragingly wide.Negative variance estimates are not uncommon in practice(e.g., Khuri & Sahai, 1985; Shavelson & Webb, 1981; Verdooren,1982). Several methods have been proposed to treat such a48Mathematical Developmentnegative estimate (e.g., Brennan, 1983; Cronbach et al., 1972;Searle, 1971). The most common method, among others, is settingthe negative estimates to zero and carrying the zero throughwherever that variance components enters the expected meansquare of another variance component. Another option is settingnegative estimates to zero, but using the negative estimatewherever that variance components enters the expected meansquares of another variance component.As an alternative approach to variance componentestimation, Shavelson and Webb (1981) reviewed a Bayesianapproach, but they concluded that this approach is not wellenough developed to have widespread applicability. Morerecently, Marcoulides (1990) empirically examined theperformances of restricted maximum Likelihood estimation (RMLE)and compared it with ANOVA estimates. In most cases of hissimulations, RMLE provided estimates for the variance componentsthat are more stable and closer to the true parameter than thosefrom the least square estimation of ANOVA. He also found thatANOVA estimates were more sensitive to nonnormal distributionalform and produced consistently incorrect estimates to a greaterdegree than RMLE. Only in balanced data sets from normaldistributions did the two methods perform similarly. However,he concluded that although the sampling variability of RMLEestimates is smaller than that for ANOVA procedure, it is stillquite sizeable, and unfortunately RMLE does not completely solvethe problem of large sampling variability.49Mathematical DevelopmentD. Sampling theory of G coefficientsAs shown in the previous sections, the formulation of the Gcoefficient for relative decisions in the one-facet design isexactly identical to Hoyt's formula for a reliabilitycoefficient, which in turn gives the same result as Cronbach'salpha. Therefore, although the sampling theory derived by Feldtand Kristof was for coefficient alpha, we use it here in thecontext of a G coefficient for the one-facet design. We thenextend Feldt's approach to develop an approximate samplingdistribution of the G coefficient for the two-facet fully-crossed design using Satterthwaite's (1946) approximationprocedure. The reason for using Satterthwaite's procedure inthe present study is that it is the most commonly used methodand generally works well in many applications for constructingconfidence intervals on the sum or ratio of variance components.Some researchers reported that it provides somewhat liberalintervals under certain conditions and suggested modified or newprocedures, but the complexities of the proposed methods areoverwhelming (e.g., Birch, Burdick, & Ting, 1990; Burdick &Graybill, 1988). An additional reason for using Satterthwaite'sprocedure lies in its simplicity. In addition, severalsimulation studies have found that the quasi F is an acceptableapproximation to the conventional F as long as the total degreesof freedom are relatively large (e.g., Davenport & Webster,1973; Gaylor & Hopper, 1969). Hudson and Krutchkoff (1968) alsofound that when the total number of observations was 64 orgreater no negative values of the quasi F were observed out of50Mathematical Development2000 simulations.Following the presentation of the sampling theory of Gcoefficients, the amount of bias in the sample estimate for theone-facet design derived by Kristof (1963) is presented.Additionally the derivation of an unbiased estimator for thepopulation G coefficient is presented and extended further tothe two-facet design. A subsequent modification is made in thesampling distribution expressions for the unbiased estimator.Then, variance expressions for the estimated G coefficient forthe one- and two-facet designs are presented. Finally, theapplication of this theory is illustrated to construct a 100(1-a)% confidence interval for the population parameter as well asa (1 -a) probability tolerance interval for the sample estimate.One-facet Design. In deriving the sampling distribution ofthe estimated G coefficient for relative decisions, we startwith rearranging Equation [2-18] as:[2-37]MS - MSe^ 1^G 1 - ^ = 1MSp^MS / MSeFrom the above expressions, it is evident that the samplingdistribution of AG 1 can be defined by derivation of the samplingdistribution of MSp/MS e . Suppose SS is a sum of squares on fdegrees of freedom, and MS is the corresponding mean squares.Under the normality assumptions the quantity SS/E(MS) =fMS/E(MS) is distributed as a chi-squared variable with fdegrees of freedom (Searle, Casella, & McCulloch, 1992, p.131).51Mathematical DevelopmentFrom this relationship, it can be shown that[2-38] f MSP P = x2 with df = f .E(MS )Thus,MS^ = X2 /f with df = f = (n -1).nr a2 p + a2 eSimilarly,[2-39]MS e,2,c1-Le with df = fe = (np-1) (n r-1).62 eAccording to Craig's theorem (1938), these chi-squares areindependent of one another. In addition, ratios of twoindependent x 2-variables, each divided by its degrees offreedom, have F-distributions (Searle, et al., 1992, p.465)Therefore, the ratio of the two quantities in [2-38] and [2-39];namely,[2-40]^MSp / (nr 62p + 62 e )MSe / 62eis distributed as a central F with fp=(np-1) and fe=(np-1)(nr-1)degrees of freedom. Rearrangement of this ratio yields:[2-41]^MS^62 eMSe 2r pn a + 62eF (np-1) , (np-1) (nr -1).By denoting the ratio MS p /MS e as Fobs meaning the \"observed52Mathematical DevelopmentF\", and the (n ra2p + 02 e )/02e, which equals EMSp/EMS e , as Fpopmeaning the \"population F\", Equation [2-41] may be rewritten as:[2-42]Fobs (1/Fpop ) = F(np- 1),(np- 1)(nr- 1).Since Fobs = 1/(1-AG 1 ) and Fpop = 1/(1-G 1 ), Equation [2-42] canbe rewritten as:[2-43](1 - G 1 ) / (1 - AG 1 ) = F(np-1),(np-1) (nr-1).Feldt (1965, p.362) noted that regardless of whether thevariance component a2p be zero or be greater than zero, theratio in [2-43] is distributed as a central F with (n -1) and(np-1)(nr-1) degrees of freedom. This sampling property of AG ].in [2-43] can be used to derive a variance expression for AG ' aswell as to define a critical region in inferential applicationsfor an unknown population parameter.Two-facet Design. Following procedures similar to thosedemonstrated above, we have extended Feldt's approach to developan approximate sampling distribution of G coefficients for thetwo-facet design. First, the formula for the population Gcoefficient in Equation [2-24] can be rearranged as:[2-44]1G2 = 1 2e + n 2^+ n G2^+ non G2r po^o prra pa2e + n a2 + n 02r po^o prSimilarly, Equation [2-25] can be rewritten as:53Mathematical Development[2-45]1AG2 = 1 MSP / (MSpo + MSpr - MSe )It can be noticed that the ratio of the expected mean squareexpressions in [2-44] has the proper structural requirements forthe F statistic for the test of 02P = 0. Therefore, as in theone-facet design, the sampling distribution of AG2 can bedefined by derivation of the sampling distribution of [MSp/(MSpo+ MSpr - MSe )] in [2-45]. However, because the estimated meansquares in the denominator of [2-45] involves a composite ofdifferent sources of variation, the sampling distribution ofthis ratio (i.e., quasi F ratio) is not the usual Fdistribution, and the exact distributional properties of thisratio are too complicated to be of practical utility (e.g., seeBurdick & Graybill, 1988). Consequently, Satterthwaite (1941,1946) suggested that the sampling distribution of the quasi Fratio can be approximated by the usual F distribution andrecommended to use a chi-square distribution in which the numberof degrees of freedom is chosen so as to provide good agreementbetween the two. For example, if W is a linear combination ofindependent mean squares with v l , v2, ...vk degrees of freedom,W = a 1MS 1 + a2MS2 + ...+ akMSk,where ai is +1), then the quantity fW/E(W) is approximatelydistributed as a chi-squared variable with f degrees of freedom,where f is given by54Mathematical Development(I ai EMSi) 2f - ^(i = 1,2,...k).I [(ai EMSi) 2 / vi]Applying Satterthwaite's approximation to Equation [2-45],we see that the sampling distribution of AG 2 can be defined byconsidering the quantity [MS p / (MSpo + MSpr - MSe )]. From theone-facet case, we know that the quantity (np-1)MSp/EMSp isdistributed as a chi-squared variate with (np-1) degrees offreedom. Furthermore, the composite of mean squares (i.e., MS po+ MSpr - MS e ) is approximately distributed as a chi-squaredvariate with fa degrees of freedom (the subscript a denotes'adjusted') given by:[2-46]f (EMSpo + EMSpr - EMS e ) 2a EMS 2 po^EMS 2 pr^EMS 2e (np-1) (n 0 -1)^(np-1)(nr-1)^(n -1) (no-1) (nr -1)Therefore, the ratio of these two chi-squared variates in [2-47]below is approximately distributed as an F-variate with degreesof freedom equal to f l (np-1) and fa :[2-47]MSp / (62e + n a2^+ n 62^+ n n 62 )r po^o pr^o r p(MS^+ MSpr - MSe) / I er2^n (r2^n cy2po^pr^e + o- pr^r poiRearranging the terms in [2-47] yields the following:[2-48]poEMS + EMSSp^ pr - EMS eMSpr + MS - MSe^ EMpo^Sp55Mathematical DevelopmentTo be consistent with the terminology used in the one-facetdesign, we again denote the first term and the reciprocal of thesecond term of [2-48] as F obs and Fp0p, respectively, andrewrite the expression as:[2-49]Fobs (1/FpoP )^F (f 1 , fa ).^Since Fobs = 1 / (1 - AG2 ) and Fpop = 1 / (1^G2), the equation[2-49] can be rewritten as:[2-50](1 - G2) / (1 - AG2 )^F (f1 , fa ).This expression describes the distributional property of AG2,which is precisely the same in form as in [2-43], except for thedegrees of freedom of the denominator, which involves theSatterthwaite's procedure.Bias of the sample estimates In this section the accuracy of the estimator for the one-facet design is examined, and the desired unbiased estimator issubsequently presented based on the work of Kristof (1963). Theresults are directly extended to the two-facet design.One-facet design. To show that the estimator AG 1 isbiased, we begin with Equation [2-43]. The reciprocal of theratio in [2-43] is also distributed as F with degrees of freedomreversed, namely:[2-51](1 - AG 1 ) / (1 - G 1 ) = F(np-1) (nr-1),(np-1).56Mathematical DevelopmentLet f 1 = (np-1)(nr-1) and f2 = (np-1). Denoting the expectedvalue of AG 1 by E(AG 1 ), it follows that[2-52]E(AG1) = 1 - (1-G 1 ) E[F(f l ,f2)].Since the expected value of the F distribution is [f2 / (f2 -2)] (Winer, 1971, p.832), substituting this for E[F(f 1 ,f2 )] in[2-52] yields:[2-53]f2 E(^G1 ) = 1 - (1 -G1) [^]f2 - 2Replacing (np-1) for f2 , with some simplication, yields[2-54]E(AG1) =G 1 (np -1) - 2np - 3From [2-54], it is apparent that E(^G i ) is not equal to thepopulation parameter G 1 , except for the unrealistic case G 1 = 1.Thus, ^G i is biased and tends to underestimate the populationparameter G1 . In addition, this bias becomes larger for asmaller population parameter and is independent of n r , thenumber of levels of the facet. Kristof (1963, p.232) presentedthe desired unbiased estimator of G 1 , ^Gu l , in relation to ^G 1as follows:[2-55]^Gui =\"G 1 (np-3) + 2np - 157Mathematical DevelopmentFrom this the sampling distribution of unbiased estimator, AGu l ,can be easily derived by replacing [AGu i (np-1) - 2]/(np-3) forAG 1 in the denominator of the equation [2-51], with somesimplication, as:[2-56]- 3nP^1 - G 1 ^- F (np-1) , (np- 1) (nr - l) \u00E2\u0080\u00A2np^1^1Two-facet design. As can be expected from the structuralsimilarity in the sampling distribution between AG 1 and AG 2 inthe previous section, the unbiased estimator of G2, ^Gu2, isprecisely the same as that in the one-facet design. That is,the amount of bias is the same for both AG 1 and AG2.Furthermore, it is also independent of the Satterthwaite'sadjusted degrees of freedom, namely:[2-57]AGu2 =AG2(np-3) + 2n - 1Replacing [AGu 2 (np-1) - 2]/(np-3) for AG2 in the denominator ofEquation [2-50], the sampling distribution of the unbiasedestimator AGu 2 can be derived as:[2-58]nP - 3^1 - G2 F (f 1 , fa ) . 58Mathematical DevelopmentVariance expression for the sample estimates The variability of the distribution of the estimates is animportant aspect for assessing the performance of the estimatorand also can be used as a means for comparing the variabilitiesof the theoretical and empirical distributions. Therefore, wenow derive the variance expression of AG1 and AG2, using theproperties of the F distribution.One-facet design. In deriving the variance expression forAG ' , we start with Equation [2-51], namely:(1 - \"G1) / (1 - G 1 ) = F(f l ,f2 )where, f1 = (np-1) (nr-1) and f2 = (np-1). From this it followsthat[2-59]AG1 = 1 - (1-G 1 ) F(f l ,f2 ).Thus, the variance expression for AG1 , var(AG1 ), can be writtenas:[2-60]var(AG1) = var[1 - (1-G 1 ) F(f 1 ,f2 )]= (1-G1) 2 var[F(fl,f2)].The variance of the F distribution is given in Winer (1971,p.832) as:[2-61]^2 f2 2 (f 1 + f2 - 2)var[F(fl,f2)] - f 1 (f2^2)2(f2^4)^\u00E2\u0080\u00A2Replacing f 1 = (np-1)(nr-1) and f2 = (np-1) in [2-61] yields:59Mathematical Development[2-62]var[F(f l ,f2 )] =2(np-1) 2 [(np-1)(nr-1) + (np-1) - 2] (np-1)(nr-1)[(np-1) - 2] 2 [(np-1) - 4] \u00E2\u0080\u00A2Therefore, the theoretical variance expression of AG ' , with somesimplication, can be written as:[2-63]2(np-1)[nr(np-1) - 2]var(AG 1 ) = (1-G 1 ) 2 {^ 1.(n r-1)(np-3) 2 (np-5)And, from the relationship between AG 1 and AGul shown in [2-55],we find that the variance expression of AGu l , in relation toAG1, is:var(AGu i ) = [(flp - 3)/(flp-1)] 2 var(AG 1 ).It is apparent from Equation [2-63] that the variability willincrease as G1 becomes smaller for a fixed n p and nr . Forexample, the theoretical standard deviation of AG 1 in asimulated condition with G 1 = .60, np = 30, and n r = 5 would be.1349 = [(.0625)(.1138)] 1/2, which is considerably larger than.0337 = [(.01)(.1138)] 1 / 2 for G 1 = .90 with the same np and nr.The estimate of the var(AG 1 ), in practice, can be obtained bysubstituting an AG ' for G 1 in the calculation.Two-facet design.^The variability of the distribution ofAG2 also follows precisely the same structure as that in theone-facet design, except for the associated degrees of freedom.Following the same steps, starting from the reciprocal ofEquation [2-50], i.e., (1-AG2 ) / (1-G 2 )^F (fa , f 1 ), we derive60Mathematical Developmentthe variance expression of AG2 as:[2-64]var(AG2) = var[l - (1 - G2 ) F(fa ,f 1 )]= (1 - G2 ) 2 var[F(fa ,f 1 )].where, fa is defined in [2-46], and f 1 = (np-1).The variance of the F distribution with fa and fl = (np-1)degrees of freedom is:[2-65]fa [(np- 1) - 2] 2 [(np- 1) - 4]Thus, theoretical variance expression of AG2 , with somesimplication, can be written as:[2-66]and that of AGu2 as:var(AGu2) = [(np-3)/(np-1)] 2 var(AG2).For example, the theoretical standard deviation of AG2 in asimulated two-facet condition with G2 = .90, np = 30, no = 3, nr= 5, and fa = 76.91 (from Equation [2-46] with EMS po=66,EMSpr=55, EMS e=31) would be .0353 = [(.01) (.1246)] 1 / 2 . Theestimate of the var(AG2 ), in practice, can be obtained bysubstituting the sample estimates, AG2 and Af a , for G2 and fa inthe calculation.var[F(fa,f1)] =2(np-1)2[fa + (nP -1) - 2]2(n -1) 2 [fa + np - 3]var(AG2 ) = (1-G2)2 ^P{ ).fa (np-3) 2 (nP -5)61Mathematical DevelopmentInferential application of the sampling theoryThe sampling distributions presented previously are nowapplied to construct a 100(1-a)% tolerance interval for thesample estimate for a given population G value as well as toestablish a 100(1-a)% confidence interval for an unknownpopulation G value.One-facet design. From Equation [2-43] we know that theratio (1-G 1 )/(1-AG 1 ) is distributed as an F with (np-1) and (np-1) (nr-1) degrees of freedom. From this, we can construct a100(1-a)% tolerance interval for the sample estimate as follows.If the probability (P) of [F L < F < Fu ] = 1 - a, then we mayrewrite this as:[2-67]1 - a = P [ FL <1-G1FU ]1 -AG iwhere, FL is the lower a/2 percentage point and F u is the upper(1-a/2) percentage point of the F distribution with degrees offreedom (n -1) for the numerator and (n -1)(n r-1) for thedenominator. Further manipulations of the inequalityrelationship above yields:[2-68]1-AG11- a= P[ 1/FL > ^ > 1/Fu ]1-G 11-G 1^1-G1= p [^ > 1^AG1 > ^ ]FL^Fu1-G1 1-G 1= P [ 1^AG1 < 1FL^Fu62Mathematical DevelopmentFollowing the same steps and denoting the quantity (np- 3)/(np- 1)by M in Equation [2-56], we can also derive a 100(1-a)%tolerance interval for the unbiased sample estimate as:[2-69]1 - a = P [ FL .75 and n > 30. Theheterogeneity of covariance did not have any effect on themagnitude of AG 1 , but they did have some positive effects on thesampling variability of the estimate, especially for continuousdata. Although its effect was not large, it would be large102Results and Discussionenough to lead to more variable estimates of the G coefficient(rather than to bias the estimate). This would result in toomany large estimates of the G coefficient (as well as too manysmall ones). Lastly, the effect of categorization on the Gcoefficient in terms of population characteristics wasconsistent in the sample estimates. Thus, the G coefficientunder the categorical scales was seriously underestimated, whichresulted in a large sampling variability of ^G1, especially fora 3-point or less scale. These results led us to question theadequacy of the sampling theory of G 1 with the categoricalscales.Table 4-7(k =The theoretical standard deviations for some5 and n = 30),^calculated by using Gcp ,selected conditionsinstead of G 1G c^: C N5 U5 N3 U3 U2.90 H: .0338 .0403 .0413 .0503 .0500 .0677M: .0338 .0402 .0409 .0498 .0492 .0664L: .0338 .0401 .0406 .0492 .0484 .0647.75 H: .0846 .0932 .0957 .1058 .1071 .1295M: .0848 .0929 .0951 .1053 .1056 .1266L: .0849 .0927 .0935 .1037 .1033 .1210.60 H: .1355 .1444 .1474 .1578 .1591 .1810M: .1362 .1441 .1461 .1567 .1568 .1761L: .1361 .1439 .1426 .1522 .1512 .1642103Results and DiscussionC. Empirical sampling distribution of AG1The characteristics of the sampling variability of AG 1across the simulated conditions reported in the previous sectionare directly related to the properties of the samplingdistribution of AG 1 in this section. Therefore, with thesecharacteristics of the variability of AG 1 in mind, we nowexamine the resultant empirical sampling distributions of AG 1under the simulated conditions and compare them to thetheoretical ones in order to assess the precision of the sampleestimate and robustness of the sampling theory of the Gcoefficient. The main focus in this section is on the effect ofheterogeneity of covariance on the performance of AG 1 .As shown in chapter II, to describe and evaluate theempirical sampling distribution of AG1 we can use either theconfidence interval approach by establishing the 100(1-a)%limits for the population G 1 value or the tolerance intervalapproach by constructing the theoretical 100(1-a)% limits forthe estimated G coefficient, namely:[4-4][ Lower limit < AG 1 < Upper limit ]1 - G 1 1 - G 1AG 1 < 1 FL FUwhere, the F L and Fu are the critical F values, corresponding tothe lower (a/2) and the upper (1-a/2) percentage points from theF distribution with degrees of freedom (np-1) for the numeratorand (nr-1)(np-1) for the denominator (note that nr = k).104Results and DiscussionThe empirical percentage of the confidence intervalsfailing to include the population G value is essentially thesame as empirical percentage of the estimated G coefficientsfalling beyond the limits of the tolerance interval. Therefore,we use the latter (using Equation [4-4]) approach to present theresults.We first conducted a chi-square goodness of fit test(Gibbons, 1985) between the empirical sampling distribution andthe theoretical F distribution in order to assess and evaluatethe adequacy of sampling theory of G coefficients. An empiricalsampling distribution of AG ' with k = 5, n = 15, and G 1 = .75(6000 replications) was obtained for each of the threepopulation epsilon conditions (E = 1.0, .70, and .50). UsingEquation [4-4] a theoretical tolerance limit of AG 1 for each ofthe following eleven percentiles of F distribution: 1.0th,2.5th, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 97.5th, and99th, was computed. The empirical proportion of estimated Gcoefficients that fell below each of these theoretical tolerancelimits was tabulated for each empirical sampling distribution,and the results are presented in Table 4-8 (frequencies wereused for chi-square calculations, but proportions are presentedfor interpretation).As can be derived from in Table 4-8, the empiricalproportion in each region for the E = 1.0 condition was veryclose to the corresponding theoretical value, and the goodnessof fit test (x2 (11) = 9.2133, p = .6022) also indicates a highdegree of agreement between theoretical and empirical sampling105Results and Discussiondistributions. However, significant goodness of fit tests forthe E < 1.0 conditions suggests that the empirical samplingdistribution of AG 1 was not quite in agreement with thetheoretical distribution when the circularity assumption failed,especially for severe noncircularity (i.e., E = .50).Table 4-8Empirical sampling distribution of AG 1 and a goodness of fittest (k = 5, n = 15, G 1 = .75, and 6000 replications in eachcondition with continuous data only)Empirical proportionsTheoreticalpercentile :^E = 1.0^.70^.50^99.0^:^98.7^98.6^98.097.5^: 97.2 96.6 95.595.0^:^94.5 94.0 92.590.0^: 89.2^88.6^86.875.0^:^74.5 73.6 72.050.0^: 49.6 49.4 48.825.0^:^25.4^25.7^25.910.0^: 10.2 10.3 10.75.0 5.4 5.3 5.62.5^: 2.8^3.1^2.91.0^: 1.2 1.1 1.3Mean of AG 1^\u00E2\u0080\u00A2Empirical SD :Theoretical SD:.7077^.7080^.7074.1449 .1471 .1563.1437^for all three conditionsx2 (11)^9.2133^36.0073^124.4687P .6022 <.01 <.001Examination of the empirical proportions for the e < 1.0conditions suggests that the sampling distribution of AG 1 wasmore spread out, thus leaving a larger proportion in both tailsof the distribution. This is also evident by a larger empirical106Results and Discussionstandard deviation of AG 1 for the E < 1.0 conditions. Forexample, the empirical proportion above the 95th percentile was5.5%, 6.0%, and 7.5% for e = 1.0, .70, and .50, respectively.Although a goodness of fit test for the E = .70 condition wassignificant (x2 (11) = 36.0073, p < .01), the departure ofempirical proportion of \"G 1 from the corresponding theoreticalvalue in each region does not seem to be too serious forpractical purposes -- with a large sample size (i.e., 6000), achi-square test may detect even a minuscule departure from thetheoretical distribution, and thus is almost certain to lead toa significant result. Nevertheless, these results indicate thatthe sampling theory of G coefficient is very satisfactory whenthe circularity assumption is met, but quite sensitive to severenoncircularity.We now examine empirical sampling distributions of Gcoefficients in details across the simulated conditions. Foreach simulation condition 2000 replications were performed.Table 4-9 presents the empirical percentage of ^G i fallingbeyond the theoretical limits of 100(1-a)% tolerance interval,averaged over the levels of G, n, and k (for a = .10 and .05,two-tailed). Thus, the results in this table represents ageneral pattern of the effect of heterogeneity of covariance onthe sampling distribution of AG ' across the six categoricalscales. To assess the extent to which the other simulatedconditions affect the empirical proportion of AG ' , Table 4-9 wasfurther broken down by the levels of G, n, or k. The respectiveresults, presented in Tables 4-9 through 4-11 separately, were107Results and Discussionused to investigate interaction effects on the empiricalproportion across the categorical scales.Table 4-9Empirical percentage of AG 1 falling beyond the limits of the100(1-a)% theoretical tolerance interval,^averaged over thelevels of k, n,^and Ga e: C N5 U5 N3 U3 U2Upper H: 5.2 2.5 2.4 .9 1.1 .55% M: 5.9 2.9 2.9 1.3 1.4 .7L: 7.2 3.7 3.9 2.0 2.2 1.1Lower H: 5.0 9.4 10.2 20.1 20.1 40.75% M: 5.1 9.3 9.8 19.6 18.8 38.2L: 5.7 9.8 9.7 18.7 17.8 33.5Upper H: 2.6 1.2 1.2 .4 .5 .32.5% M: 3.3 1.4 1.5 .6 .7 .4L: 4.2 2.0 2.2 1.0 1.1 .6Lower H: 2.5 5.1 5.5 12.6 12.4 30.42.5% M: 2.6 5.1 5.3 12.3 11.7 28.5L: 2.8 5.4 5.2 11.5 10.8 24.4As can be seen in Table 4-9, the empirical proportions ofAG 1 falling beyond the limits of 100(1-a)% tolerance intervalfor the continuous data were very close to the nominal levelsfor thee = 1.0 condition, indicating that the sampling theory ofG1 works well for continuous data. However, the results alsoindicate that there was some positive effects of heterogeneityof covariance on the sampling distribution of AG 1 . Theseresults are somewhat expected when we consider the results ofthe goodness of fit test reported above and the effect of Econditions on the sampling variability of AG 1 described in the108Results and Discussionprevious section. Note also that the positive inflation wasmore apparent with the empirical percentage beyond the upperlimit of the tolerance interval. The reason for this may be dueto the skewness of F distribution. The pattern of the empiricalpercentage for the continuous data was very consistent acrossthe levels of G, n, and k, as is shown later in Tables 4-9through 4-11.Although the sampling theory of the G coefficient seems towork well for the continuous data across all the simulatedconditions, it is not adequate for the categorical scales,especially for a 3-point or less scale. As can be seen in Table4-9 the empirical percentage beyond the lower limit increasedconsiderably as the scale approached the U2 scale, whereas areverse pattern was evident in the upper limit direction.Furthermore, the comparison among the three 8 conditions revealsthat this trend was more apparent with the higher E condition.The effect of the population G values on the empiricalpercentage of the tolerance interval of AG 1 is shown in Table 4-10. As discussed in the previous section, the \"G1 is anegatively biased estimator for its G cp . Additionally,categorization resulted in a lower value of Gcp for the E = 1.0condition under categorical scales, which was considerablysmaller than for the corresponding G 1 value. On the other hand,the theoretical limits of the tolerance interval of ^G 1 areconstructed using G1, thus producing a narrower width of thetheoretical limits for a larger 0 1 . Therefore, as can be seenin Table 4-10, it is not surprising that the e = 1.0 and G = .90109Results and Discussionconditions yielded a relatively larger empirical percentagebeyond the lower limit under the categorical scales. Forexample, the mean (and standard deviation) of AG 1 for k = 5, n15, and G = .90 under the U2 scale is .7809 (.1088) (see Table4-3), whereas the corresponding 90% theoretical lower and upperlimits from Equation [4-4] are .7770 and .9488. Since the meanis already close to the lower limit, a large number of AG 1 wouldbe falling below its lower limit, but little beyond its upperlimit.Table 4-10Empirical percentage of AG 1 falling beyond the limits of the 90%theoretical tolerance interval, averaged over the levels of kand na E G^: C N5 U5 N3 U3 U2Upper H .90: 5.2 1.4 1.6 .3 .5 .35% .75: 5.2 2.8 2.5 .9 1.2 .5.60: 5.2 3.4 3.1 1.6 1.8 .8M .90: 6.1 1.6 2.0 .4 .6 .4.75: 5.8 3.3 3.0 1.4 1.5 .6.60: 5.8 3.9 3.7 2.0 2.1 1.1L .90: 7.4 2.2 2.6 .6 .9 .4.75: 7.3 4.1 4.0 1.8 2.1 .8.60: 7.0 5.0 5.1 3.6 3.5 2.1Lower H .90: 4.9 12.7 13.5 32.6 31.3 64.65% .75: 5.0 8.3 9.4 16.3 16.7 35.7.60: 5.1 7.3 7.9 11.3 12.3 21.8M .90: 5.0 12.6 13.1 32.0 30.0 62.9.75: 5.3 8.4 9.1 15.9 15.7 33.3.60: 4.9 6.9 7.1 10.9 10.8 18.5L .90: 5.5 13.2 13.3 30.5 29.1 59.4.75: 5.8 8.9 9.0 15.3 14.6 27.5.60: 5.8 7.3 7.0 10.1 9.5 13.5=110Results and DiscussionTable 4-11Empirical percentage of AG1 falling beyond the limits of the 90%theoretical tolerance interval, averaged over the levels of kand Ga E n: C N5 U5 N3 U3 U2Upper H 15: 5.1 3.1 3.3 1.4 1.8 1.15% 30: 5.1 2.3 2.3 .9 1.0 .345: 5.3 2.1 1.7 .6 .6 .1M 15: 5.7 3.7 3.8 1.9 2.2 1.330: 5.9 2.6 2.6 1.1 1.2 .545: 6.2 2.5 2.2 .9 .9 .2L 15: 7.3 4.6 5.0 2.8 3.2 1.830: 7.1 3.4 3.6 1.7 1.8 .845: 7.3 3.3 3.1 1.4 1.5 .6Lower H 15: 5.1 7.4 7.8 12.2 12.3 24.65% 30: 5.3 9.8 10.5 20.6 20.6 43.045: 4.5 11.1 12.4 27.5 27.5 54.5M 15: 4.9 7.2 7.6 12.3 11.9 23.330: 5.4 9.7 9.7 19.8 19.4 40.345: 4.9 11.0 11.9 26.7 25.2 51.1L 15: 5.5 7.7 7.6 11.8 11.7 20.630: 5.8 10.0 10.1 19.1 18.2 35.145: 5.8 11.7 11.6 25.0 23.4 44.6With respect to the effect of the sample size on theempirical results, it is apparent from Equation [4-4] that thewidth of the theoretical tolerance limits of AG 1 becomes smalleras the sample size increases. Given that the G cp for thecategorical scales is already a lot smaller than itscorresponding population G value, a smaller width of the limitsdue to a larger sample size would yield even greater proportionof AG ' beyond the lower limit. Table 4-11 clearly shows the111Results and Discussionaforementioned effect of the sample sizes in which the empiricalproportion beyond the lower limit increased considerably withincreasing sample sizes within the same value of E.It should be noted that if we had constructed thetheoretical limits of the tolerance interval using the G cpvalue, instead of using G 1 , the empirical results would haveshown a completely different pattern. For example, the G cpunder the U2 scale for the condition with G = .75, k = 5, and e= H was equal to .6186. Replacing this value for the G 1 inEquation [4-4] (and using n = 30 for the critical F values)yields the lower and upper limits of 90% tolerance interval:.3516 and .7563, respectively (as compared to the correspondingtheoretical limits of .5750 and .8403, which were calculatedusing the G 1 = .75). The mean of AG1 for this condition was.5956 with the standard deviation of .1245, as shown in Table 4-3. Therefore, if the limits calculated using Gcp , instead ofG 1 , were used, most of 2000 sample estimates for this conditionwould have fallen within these limits. This example alsodemonstrates that the categorization was the main factor thatresulted in larger empirical percentage beyond the theoreticaltolerance limits of AG 1 . We could also look at this problemfrom statistical power point of view. For example, we know thatpower increases as the sample size increases. If we presumethat the difference between the G 1 and Gcp values is indeed atrue difference, the empirical percentage beyond both limitswould be interpreted as power for detecting that difference,instead of Type I error, and similarly, the empirical proportion112Results and Discussionof AG 1 falling within the two limits would be considered as TypeII error. From these two examples, it is clear that thesampling theory of G 1 is not adequate for the categorical data,at least, under the conditions simulated in the present study.Finally, as can be seen in Table 4-12, the effect of thenumber of repeated measures (k) on the sampling distribution ofAG 1 was relatively small. The absence of an effect of thelevels of k on the sampling characteristics of AG 1 is alsoevident from Table 4-13. For example, the size of thetheoretical tolerance interval was practically identical amongthe levels of k (see Table 4-13). Although the larger value ofk tends to slightly reduce the empirical percentage beyond thelower limit within the same e, the difference in the proportionsamong the three levels of k was relatively small. Therefore,unless a larger number of measurements are used in a design, theuse of categorical scales, especially for a 3-point or lessscale, could result in a serious downward bias in estimating thepopulation G coefficient. The dimension of k used in thepresent study were certainly not large enough to producesubstantial differences in the empirical results across thesimulated conditions.In summary, the empirical results for continuous data werevery close to the corresponding theoretical values across thesimulated conditions, which suggest the adequacy of the samplingtheory of the G coefficient. It was also evident that theheterogeneity of covariance had some positive effects on thesampling distribution of AG 1 . Although the effect was not113Results and Discussionlarge, it was large enough to result in a moderate inflation inthe empirical percentage beyond the upper limit of thetheoretical tolerance interval for AG ' . However, the samplingtheory was not adequate for categorical data, especially for a3-point or less scale. This inadequacy was due to the effect ofcategorization on the G coefficient in terms of populationcharacteristics, which led to a serious inflation in empiricalpercentage beyond the lower limit of the theoretical toleranceinterval. These results led us to question the practicalutility of sampling theory of G 1 with categorical data.Table 4-12Empirical percentage of AG ' falling beyond the limits of the 90%tolerance interval, averaged over the levels of n and Ga e k: C N5 U5 N3 U3 U2Upper H 3: 5.2 2.5 2.6 1.1 1.4 .85% 5: 5.4 2.5 2.4 .9 1.1 .57: 5.0 2.5 2.3 .8 .9 .3M 3: 6.4 3.2 3.3 1.5 1.7 1.15: 5.7 2.8 2.7 1.2 1.4 .67: 5.6 2.9 2.6 1.1 1.1 .4L 3: 7.9 3.8 4.3 2.3 2.5 1.75: 7.2 3.8 3.8 1.9 2.1 .87: 6.6 3.6 3.6 1.8 1.9 .8Lower H 3: 4.9 10.4 10.9 22.7 21.8 42.35% 5: 5.1 9.2 10.2 19.6 19.7 40.57: 4.9 8.7 9.7 17.9 18.8 39.2M 3: 5.0 10.0 9.9 21.7 20.3 39.05: 5.2 9.1 9.7 19.4 18.6 38.37: 5.1 8.8 9.7 17.7 17.6 37.4L 3: 5.7 10.5 9.5 19.8 18.5 32.55: 5.8 9.5 9.9 18.7 17.8 34.67: 5.6 9.4 9.9 17.5 17.0 33.3114Results and DiscussionTable 4-13The lower and upper limits of the 90% theoretical toleranceinterval for AG 1k nG =: LL.90UL LL.75UL0.60LL^UL3 15 : .7680 .9515 .4200 .8788 .0719 .806230 : .8243 .9399 .5607 .8497 .2971 .759445 : .8429 .9340 .6074 .8350 .3718 .73595 15 : .7770 .9488 .4426 .8665 .1081 .786430 : .8300 .9361 .5750 .8403 .3200 .744545 : .8474 .9308 .6185 .8270 .3896 .72337 15 : .7803 .9448 .4507 .8620 .1211 .779230 : .8320 .9348 .5800 .8369 .3281 .739045 : .8490 .9297 .6225 .8242 .3960 .7187D. Sample estimates of epsilon and Type I error ratesAs noted in the introduction to this chapter, a primarypurpose of examining Type I error rates in this study was to usethese results as a partial validation of the simulationprocedure. Inflations in error rates similar to those reportedin the literature, given the same epsilon values would suggestvalidity and accuracy in the simulation and subsequentcalculations. A second purpose was to allow for a systematicexamination of the sampling characteristics of epsilon for bothcontinuous and categorical data, as there appears to be verylittle published on this topic. The results of this study have,in fact, provided the validation of the simulation procedure andadded some detailed information on the sampling characteristics115Results and Discussionof epsilon. However, these results have also revealed someinteresting and unexpected findings with respect to epsilonestimates and Type I error rates -- findings which cause one toquestion the appropriateness of current practice in applying a\"correction factor\" to the conventional F test in the presenceof low epsilon estimate. Thus, this section on the samplingcharacteristics of epsilon and Type I error rates, a sectionwhich had been expected to be very brief, has been expandedconsiderably to present and discuss these new findings.A number of empirical studies have been conducted toinvestigate the effect of heterogeneity of covariance on thedegree of inflation in Type I error rates. The results fromthese studies were generally in close agreement, and have beenwell documented in related literature (e.g., Collier, et al.,1967; Hertzog & Rovine, 1985; Huynh & Feldt, 1976, 1978;Stoloff, 1970; Wilson, 1975). Empirical studies withcategorical data have also been conducted to investigate thedegree of positive bias in the F tests. For example, Lunney(1970) investigated the Type I error rates of the F tests invarious repeated measures designs with dichotomous data anddemonstrated that the actual error rates were quite close totheir nominal levels. Also, Hsu and Feldt (1969), and Gregoireand Driver (1987) examined the degrees of positive bias in the Ftests with categorical data. However, in these studies, theconditions of heterogeneity of covariance were not part of theirsimulations. Furthermore, Myers, et al. (1982) found thatheterogeneous covariance in dichotomous data had a positive116Results and Discussioninflation in the empirical Type I error rates for the usual Ftests. However, they did not examine the samplingcharacteristics of the epsilon estimates, nor did theyincorporate the corresponding conditions of continuous data intheir simulations. Thus, they failed to provide informationregarding the degree of relative bias in Type I error rates fordichotomous data, in comparison to those for its parentcontinuous conditions.Since the early work by Box (1954) epsilon has been used asa correction factor in repeated measures ANOVA designs in orderto control for probable inflation in the Type I error ratesbrought about by heterogeneity of covariance. However, despiteits wide use, it appears that the properties of the samplingdistribution of epsilon are unavailable. For any covariancematrix (either a sample or population matrix) the numericalmaximum of E is unity. If there is any deviation from thehomogeneity of covariance (or from the circularity condition), Ewill be less than unity. This means that a sample matrix canalways be expected to exhibit some degrees of violation of thecondition, even though the population matrix does not. Althoughwe know that the estimator of E is biased, an unbiased estimatorof E is not known [Huynh & Feldt (1976) reduced this bias byintroducing a correction factor]. In addition, because of thetheoretical upper and lower limits on e [i.e., 1.0 and 1/(k-1)],^c is, in general, negatively skewed for high values of E andpositively skewed for low values of E. Thus, we can also expectthat the variability of \"E for both large and small population117Results and Discussionepsilons to be smaller than that for an E in the middle range.In addition to the aforementioned characteristics ofepsilon, as discussed in relation to Gcp , we also know thatcategorization not only produces a smaller mean value of thecovariances, but also affects (reduces) the degree ofheterogeneity of covariance in the covariance matrix. Since thecondition of E is directly related to the Type I error rates inthe F test, we first investigate the effect of categorization onthe performance of epsilon. Second, the behavior of sampleestimates of epsilon across simulated conditions is examined.Following this, the empirical Type I error rates of theconventional F test for the 'rater effect' (MS r /MS e ) under thesimulated conditions are observed and compared to previousfindings in related literature.Table 4-14 summarizes the effect of categorization ofcontinuous data on E across the levels of G, k, and E. Thevalues in each line in Table 4-14 were based on unique simulatedpopulation data sets of N = 90000. The epsilon calculated onthe simulated population data was named the calculatedpopulation e (ecp ), in order to distinguish it from thepopulation E value. The results in Table 4-14 show that e cpvalues under the continuous data were virtually identical withthe corresponding population E values defined in thesimulations. (Note that the difference in Ecp among the three Gvalues under the continuous scale is due to the initialpopulation covariance matrices defined in the simulations). Theresults also indicate that when the population epsilon equals118Results and Discussionunity (i.e., a perfect circularity condition), thecategorization did not affect the structure of the covariancematrix, and the homogeneity of covariance condition remains thesame for all categorical scales, regardless of the levels of k.However, for the e < 1.0 conditions the E cp increasedconsiderably as the scale approached U2, and this was moreapparent for the higher G values (see the last column in Table4-14). This trend in ecp was consistently shown in the threelevels of k, within the same E.With the aforementioned characteristics of epsilon in mind,we now examine the empirical means and variabilities of AEacross the categorical scales, which are summarized in Tables 4-14, 4-15, and 4-16 for some selected conditions. It is clearfrom Table 4-15 that the Ae for E = 1.0 was a (downward) biasedestimator, and this bias became considerably larger withincreasing k. However, for a fixed k, the magnitude of AE andstandard deviations are very consistent across the categoricalscales, though there was a slight decrease in AE and increase inits standard deviation at the U2 scale (see Table 4-15). As wasthe case in Ecp , these results indicate that the categorizationdid not have any effect on the sample estimates of epsilon forthe homogeneity of covariance condition.119Results and DiscussionTable 4-14The effect of categorization of continuous data on epsilon (E cp )with simulated population data^(N=90000)k^e G :^C^N5^U5^N3^U3^U2 (U2-C)3^H .90:.75:^All equal^1.00.60:M .90:^.7052^.7918^.7697^.8542^.8301^.8815 .1763.75:^.7052^.7570^.7565^.8170^.8062^.8744 .1692.60:^.7082^.7500^.7501^.7979^.7919^.8522 .1440L .90:^.5269^.6305^.6018^.7056^.6760^.7414 .2145.75:^.5397^.6020^.5940^.6721^.6535^.7356 .1959.60:^.5387^.5825^.5702^.6201^.6088^.6564 .11775^H .90:.75: All equal^1.00.60:M .90:^.6935^.7651^.7569^.8304^.8158^.8887 .1952.75:^.7061^.7549^.7599^.8083^.8074^.8752 .1691.60:^.7068^.7463^.7528^.7939^.7945^.8600 .1532L .90:^.5010^.5883^.5767^.6815^.6566^.7640 .2630.75:^.5188^.5722^.5679^.6347^.6235^.7062 .1874.60:^.5050^.5469^.5419^.5974^.5851^.6549 .14997^H .90:.75: All equal^1.00.60:M .90:^.7009^.7623^.7591^.8233^.8147^.8852 .1843.75:^.7034^.7465^.7530^.7990^.7977^.8630 .1596.60:^.7001^.7370^.7407^.7830^.7816^.8440 .1439L .90:^.5052^.5792^.5741^.6660^.6463^.7518 .2466.75:^.5072^.5530^.5468^.6080^.5932^.6627 .1555.60:^.5134^.5531^.5472^.5994^.5879^.6536 .1402Note: Exact population e values for the three levels of G and kwere given in Table 3-1 in chapter III.120Results and DiscussionTable 4-15The mean^(standard deviation)^of(n = 15,^averaged over the levelsAE for the three levels of kof G)E Scale: k= 3 5 7High C^: .8948 (.0817) .7720 (.0769) .6905 (.0641)N5: .8932 (.0830) .7735 (.0767) .6945 (.0639)U5: .8899 (.0846) .7703 (.0776) .6909 (.0637)N3: .8916 (.0844) .7729 (.0773) .6920 (.0637)U3: .8862 (.0871) .7684 (.0776) .6904 (.0642)U2: .8702 (.1032) .7637 (.0824) .6895 (.0680)Med. C^: .6969 (.0889) .6136 (.0921) .5536 (.0772)N5: .7437 (.0979) .6468 (.0932) .5808 (.0771)U5: .7367 (.0992) .6473 (.0933) .5816 (.0759)N3: .7831 (.1070) .6785 (.0915) .6065 (.0747)U3: .7693 (.1052) .6701 (.0916) .6020 (.0753)U2: .7919 (.1241) .7006 (.0932) .6324 (.0755)Low C^: .5369 (.0212) .4744 (.0679) .4329 (.0601)N5: .6051 (.0557) .5215 (.0765) .4698 (.0633)U5: .5883 (.0476) .5169 (.0753) .4668 (.0625)N3: .6549 (.0845) .5675 (.0852) .5083 (.0694)U3: .6364 (.0726) .5557 (.0827) .4978 (.0658)U2: .6785 (.1095) .6063 (.0954) .5415 (.0748)For the E < 1.0 conditions, the bias in AE was almost nullfor k = 3 under the continuous scale, but decreased withincreasing k. Note also that the variability of AE was greatestfor the E = M condition, within the same k. For the categoricalscales, the nature of the bias in Ae was shifted from negativeto positive, but at which scale this change occurs varieddepending on E, k, and types of scale. This trend seemed to bea result of some interactive effects between the categorization,the theoretical limits on c, and the nature of the downward biasin AC. In general, for the E < 1.0 conditions the magnitude of121Results and DiscussionAe became smaller for larger k values, regardless of theconditions of E and types of scale, but increased withincreasing n (see Table 4-16).Table 4-16The mean (standard deviation) of AE for the three sample sizes(k = 5, averaged over the levels of G)E Scale: n = 15 30 45High C^: .7720 (.0769) .8717 (.0499) .9097 (.0369)N5: .7735 (.0767) .8749 (.0482) .9127 (.0360)U5: .7703 (.0776) .8719 (.0496) .9102 (.0374)N3: .7729 (.0773) .8744 (.0498) .9121 (.0368)U3: .7684 (.0776) .8712 (.0497) .9106 (.0371)U2: .7637 (.0824) .8682 (.0537) .9094 (.0394)Med. C^: .6136 (.0921) .6571 (.0755) .6713 (.0646)N5: .6468 (.0932) .7004 (.0761) .7180 (.0660)U5: .6473 (.0933) .6998 (.0754) .7182 (.0661)N3: .6785 (.0915) .7426 (.0748) .7635 (.0656)U3: .6701 (.0916) .7357 (.0761) .7578 (.0657)U2: .7006 (.0932) .7814 (.0736) .8116 (.0626)Low C^: .4744 (.0679) .4924 (.0537) .4971 (.0449)N5: .5215 (.0765) .5472 (.0608) .5538 (.0517)U5: .5169 (.0753) .5403 (.0591) .5474 (.0507)N3: .5675 (.0852) .6047 (.0691) .6148 (.0592)U3: .5557 (.0827) .5900 (.0673) .6004 (.0574)U2: .6063 (.0954) .6550 (.0792) .6734 (.0691)With respect to the effect of the population G conditions(see Table 4-17), within the same G, the size of AE was fairlyconsistent across the categorical scales for e = 1.0, butincreased as the scale approached U2 for the E < 1.0 conditions.A comparison of the magnitude of AE among the three G valuesreveals that the AE was slightly larger for the G = . 60122Results and Discussioncondition underE = 1.0, but a reverse pattern was shown underthe E = L condition. However, the magnitude of thesedifferences among the three G conditions was relatively small.Therefore, the magnitude of the mean of the covariances in thepopulation matrix did not seem to have appreciable influence onthe estimates of epsilon.Table 4-17The mean(standard deviation) of the epsilon estimates for thelevels of G^(k = 5,^n = 15,^2000 replications)e Scale: G 1 = .90 .75 .60High C^: .7722 (.0775) .7720 (.0768) .7718 (.0764)N5: .7662 (.0789) .7739 (.0759) .7803 (.0752)U5: .7525 (.0815) .7746 (.0772) .7837 (.0742)N3: .7609 (.0809) .7755 (.0768) .7822 (.0741)U3: .7443 (.0817) .7727 (.0768) .7881 (.0742)U2: .7230 (.0946) .7754 (.0793) .7928 (.0732)Med. C^: .6136 (.1016) .6171 (.0933) .6100 (.0814)N5: .6512 (.1024) .6478 (.0946) .6415 (.0826)U5: .6402 (.1032) .6525 (.0938) .6491 (.0829)N3: .6833 (.0980) .6799 (.0914) .6722 (.0851)U3: .6615 (.0970) .6749 (.0930) .6739 (.0848)U2: .6730 (.1006) .7117 (.0924) .7170 (.0866)Low C^: .4764 (.0784) .4826 (.0691) .4643 (.0563)N5: .5387 (.0899) .5246 (.0776) .5011 (.0621)U5: .5290 (.0894) .5225 (.0766) .4993 (.0600)N3: .5960 (.0994) .5671 (.0840) .5393 (.0721)U3: .5759 (.0952) .5589 (.0839) .5324 (.0690)U2: .6156 (.1016) .6150 (.0968) .5883 (.0877)In summary, the results reported above indicate that themagnitude of \"E decreased with increasing k and increased withincreasing n. The results also showed that the Ae was biased,123Results and Discussionbut constant across the categorical scale for E = 1.0. However,for the E < 1.0 conditions, the magnitude of AE increasedconsiderably as the scale approached U2, regardless of theconditions of k, n, and G.We now investigate the empirical Type I error rates of theconventional F test for the 'rater effect' (MS r /MSe ) under thesimulated conditions. The results of the Type I error rates areexamined in relation to the behavior of the sample estimates ofepsilon reported above, and also compared to previous findingsin related literature.Table 4-18 shows the general pattern of Type I error rates(a = .01, .05, and .10) for the three E conditions across thecategorical scales, averaged over the levels of k, n, and G.Note that the empirical percentage for E = 1.0 was very close tothe corresponding nominal levels, and consistent across thecategorical scales. However, as expected the Type I error rateswere noticeably inflated by the heterogeneous covariance, andthe magnitudes of inflation were similar to those reported inthe literature. For the E < 1.0 conditions, the size of theerror rates decreased as the scale approached U2. As a result,the Type I error rates with dichotomous data did not appear tobe too serious, especially under the C = M condition. Note alsothat the error rates for the uniform distributions were slightlyhigher than those for the normal distributions, but themagnitude appears to be negligible. These results are generallyin close agreement with previous empirical findings (e.g.,Lunny, 1970; Myers, et al., 1982).124Results and DiscussionTable 4-18Empirical percentage of the Type I error rates, averaged overthe levels of k, n, and Ga E: C N5 U5 N3 U3 U2.01 H: 0.9 0.9 0.9 0.9 1.0 0.9M: 2.2 2.0 2.0 1.8 1.8 1.4L: 3.5 3.1 3.2 2.6 2.8 2.3.05 H: 4.7 4.6 4.9 4.9 4.9 4.6M: 7.0 6.6 6.8 6.3 6.4 5.7L: 8.9 8.0 8.4 7.5 7.8 7.0.10 H: 9.4 9.4 9.9 10.0 10.0 9.7M: 11.8 11.3 11.6 11.2 11.1 10.4L: 13.6 12.7 13.1 12.2 12.4 11.5The results in Table 4-18 were further broken down by thelevels of k, n, or G for some selected conditions, and arepresented in Table 4-19 for a = .05 only. The selectedconditions for the Type I error rates presented in Table 4-19are in accordance with those used to construct Tables 4-14through 4-16 for the epsilon estimates. Thus, the effect of themagnitude of AE on the error rates can be easily examined bycomparing appropriate entries in the corresponding tables.125Results and DiscussionTable 4-19Empirical percentage of the Type I error rates for some selectedconditionsa c k:Cn =N5^U5^N3^U315, averaged over the levels of GU2.05 H 3: 5.6 4.9^5.2^5.1^4.9 4.85: 4.6 4.3 4.8 5.0 4.8 4.67: 4.7 4.6^4.8^5.0^5.1 4.4M 3: 7.1 6.0^6.4^6.0^5.8 5.45: 7.3 6.5 7.2 6.3 6.2 5.97: 7.8 7.5^7.4^7.0^6.8 5.9L 3: 8.5 7.3^8.1^7.3^7.2 6.75: 9.2 8.0 8.6 7.4 8.0 7.67: 9.8 8.8^9.1^7.9^8.2 7.2a E n k = 5, averaged over the levels of G.05 H 15: 4.6 4.3^4.8^5.0^4.8 4.630: 4.6 4.9 5.0 5.2 5.1 4 445: 4.7 4.8^5.1^5.2^5.3 4^9M 15: 7.3 6.5^7.2^6.3^6.2 5 930: 6.6 6.6 6.3 6.2 6.8 6 045: 6.8 6.7^7.1^6.2^7.0 5 5L 15: 9.2 8.0^8.6^7.4^8.0 7^630: 8.8 7.8 8.5 7.2 8.0 7^145: 9.0 8.5^8.7^7.9^8.3 6 9a E G: n = 15 and k = 5.05 H .90: 4.8 4.2^4.9^4.9^4.7 4.9.75: 4.5 4.4 4.9 4.9 4.8 4.3.60: 4.6 4.3^4.7^5.1^4.9 4.5M .90: 7.5 6.6^7.3^6.8^6.2 5.6.75: 7.0 6.1 7.2 6.2 6.0 5.8.60: 7.3 6.9^7.2^6.0^6.5 6.4L .90: 9.7 7.9^8.7^7.2^7.9 6.8.75: 9.1 7.6 8.0 7.6 8.2 7.3.60: 8.9 8.4^9.1^7.3^7.9 8.6126Results and DiscussionAs can be seen in Table 4-19, the general pattern of theType I error rates across the categorical scales wasrepetitively demonstrated. That is, the Type I error rates fore = 1.0 were all close to the nominal level, regardless of theconditions of k, n, and G. The results also showed that theheterogeneity of covariance conditions inflated the error rates,to some extent, but the magnitude decreased as the scaleapproached U2. This trend was consistent across the levels ofk, n, and G. Additionally, for the 6 < 1.0 conditions, theerror rates increased somewhat with increasing k within the samee, and this trend was consistent across the categorical scales.Note also in Table 4-19 that the error rates were slightlysmaller for the two larger sample sizes, but there was noappreciable difference in the error rates among the three levelsof G. From these results, it is evident that the present studyreplicated, in general, previous findings in related literatureand demonstrated that the heterogeneity of covariance increasedthe Type I error rates for the usual F statistic.With respect to the relationship between the magnitude ofepsilon estimates and the Type I error rates, the results inTables 4-17 and 4-18 showed that for the C < 1.0 conditions, thepattern and magnitude of changes in Type I error rates acrossthe simulated conditions were very closely related to thoseshown for the sample estimates of epsilon. For example, asshown in Tables 4-14 through 4-16, the magnitude of AC decreasedwith increasing k, but increased with increasing n for the C <1.0 conditions. Furthermore, the magnitude of AC increased127Results and Discussionconsiderably as the scale approached U2, regardless of theconditions of k, n, and G. Comparatively, the Type I errorrates under the e < 1.0 conditions decreased as the scaleapproached U2. Within the same E, they increased for larger kvalues, and slightly decreased for the two larger sample sizes.Furthermore, the small difference in Type I error rates amongthe three G values shown on the bottom of Table 4-19 was inaccordance with a marginal difference in AE among them shown inTable 4-17.The dependency between the magnitude of \"E and Type I errorrates across simulated conditions is illustrated in Table 4-20,which presents a general pattern of the correlation between thetwo variables for the three e conditions across the six scales.The values in each line in Table 4-20 were based on theestimates across the levels of k, n, and G (i.e., 27conditions), which add up to 81 conditions for the total (i.e.,the 'All' on the last line in the table). Inspection of Table19 for the E < 1.0 conditions shows that the correlations werenegative and reasonably high under the continuous data, and werereduced as the scale approached U2. This trend in thecorrelation supports the aforementioned relationship between \"Eand Type I error rates.While the size of the Type I error rates varied dependingon the magnitude of AE across the simulated conditions for the E< 1.0 conditions, it was not the case under the E = 1.0condition. As can be seen in Tables 4-14 through 4-16, theestimated epsilons for E = 1.0 considerably varied in size,128Results and Discussionranging from .69 to .91, depending on a particular combinationof simulated conditions. However, the associated Type I errorrates were remarkably similar with one another, and very closeto the corresponding nominal levels (see Table 4-19). Thisrelationship is also evident from Table 4-20 as the correlationbetween the ^c and Type I error rates for the E = 1.0 conditionwas almost zero, especially under the continuous and 5-pointscales. In other words, these results indicate that theinflation in probability of Type I error rates for the usual Ftests may not be an issue, regardless of the magnitude of Ae, aslong as it is certain that the sample data at hand are from apopulation which possesses homogeneity of covariance. However,when the correlation was computed based upon all three Econditions combined, the aforementioned relationship between theAE and Type I error rates for the E = 1.0 condition wascompletely obscured (see the correlation for the 'All' in Table4-20).Table 4-20Correlation between the Type I error ratesestimates for alpha = .05 onlyand the epsilonC: C N5 U5 N3 U3 U2H: -.0543 .0863 -.0246 -.1395 -.1552 .1907M: -.6680 -.6662 -.6080 -.5927 -.3981 -.2439L: -.8145 -.7709 -.7509 -.5512 -.7232 -.6784All: -.9304 -.8948 -.9093 -.8634 -.8683 -.7696129Results and DiscussionThis highly negative correlation between the magnitude ofAE and Type I error rates would lead to a misleadinginterpretation of the relationship between them, that is, theType I error rates increase with decreasing sample estimates ofepsilon, regardless of the population condition of E. In fact,it is a common practice in repeated measures ANOVA designs touse AC-adjusted F tests (i.e., Greenhouse-Gessier or Huynh-Feldtcorrection) in order to protect against a probable inflation inthe Type I error rates whenever an observed epsilon is less thanunity. Interestingly, an estimated epsilon from a samplecovariance matrix is always less than unity due to the nature ofthe downward bias in AE, and the population epsilon is alwaysunknown in practice. Therefore, under such circumstances theAc-adjusted F test would be correct only if one presumes thatthe population covariance matrix from which a sample at handbeing taken possesses the heterogeneity of covariance condition.Otherwise, the AC-adjusted F test would result in an undulyconservative test, and thus increase the probability of Type IIerror rate if the estimated epsilon is indeed from a populationmatrix with homogeneous covariance. This leads us to query thecommon practice of utilizing the Ac-adjusted F test in therepeated measures ANOVA designs, and raises the question: Is italways justifiable to use an Ac-adjusted F test when an observedepsilon is less than unity?To summarize the results obtained in this section in termsof the relationships among sample estimates of epsilon,heterogeneity of covariance, observed mean squares, and Type I130Results and Discussionerror rates, the sample estimates of these variables arepresented in Table 4-21 for some selected conditions. Considerfirst the observed mean squares (i.e, MS e and MS r ). Themagnitude of the mean of the observed mean squares was virtuallyidentical among the three e conditions, and very close to theirexpected value (note that EMS e = EMS r under 62 r = 0, as definedin the simulation), and this trend was consistent across thecategorical scales. However, the variability for both meansquares was greater under the E < 1.0 conditions, while it wasvery close to their theoretical value under E = 1.0 (note thatthe variance expression for mean squares was given on the bottomof Table 4-6). This resulted in a more variable samplingdistribution for the F ratio than indicated by the theoretical Fdistribution under the E < 1.0 conditions. Since interest liesin the upper tail of the distribution, the cumulative proportionbeyond the upper limit are attributed to the Type I error rates,which were larger under the E < 1.0 conditions as shown in Table4-21.For the categorical scales, the AE for E < 1.0 became largeras the scale approached U2, and consequently there was littledifference in the magnitude of AE among the three E conditionsunder the U2 scale. As the size of Ac had a positive impact onthe variability of mean squares, this lack of difference invariability among the three E conditions was shown in theobserved mean squares as well as in the F ratio. Therefore, thedifference in the error rates among the three e conditions underthe U2 scale was relatively small.131Results and DiscussionTable 4-21Type I error rates,^correlation, and descriptive statistics forsome selected conditions^(k = 5,^n = 15,^G 1 = .90)Estimates:^C N5 U3 U2eCp (and Ae)H: 1.00^(.7722) 1.00^(.7662) 1.00^(.7443) 1.00^(.7230)M: .6935(.6136) .7651(.6512) .8158(.6615) .8887(.6730)L: .5010(.4764) .5883(.5387) .6566(.5759) .7640(.6156)MSr (sd)H: 35.6962(24.6115) .5471(.3783) .3110(.2134) .1397(.1009)M: 35.9424(29.8885) .5474 (.4396) .3088(.2451) .1367(.1017)L: 36.1523(35.4509) .5443 (.4929) .3075(.2780) .1367(.1110)MSe (sd)H: 35.5015(6.6744) .5445(.1041) .3080(.0689) .1380(.0334)M: 35.4153(7.8132) .5461(.1173) .3061(.0705) .1369(.0337)L: 35.4021(9.1513) .5424(.1275) .3021(.0739) .1352(.0338)EMSe 35.71^(6.7486) .5458 .3055 .1370F^(sd)H: 1.0446(^.7616)^1.0418(^.7613) 1.0459(^.7439) 1.0461(^.7611)M: 1.0677(^.9422)^1.0488(^.8859) 1.0514(^.8588) 1.0371(^.7922)L: 1.0958(1.1488)^1.0669(1.0385) 1.0774(1.0169) 1.0573(^.8832)Type I error (a = .05, .10)H: 4.8^9.8 4.2^9.4 4.7^9.1 4.9^9.8M: 7.5^12.3 6.6^11.6 6.2^11.7 5.6^10.6L: 9.7^14.6 7.9^13.4 7.9^13.2 6.8^11.6Correlations (AE, MSe ,Ae^MS eMSr )AC MS e AC MSe AC MSeH MS e : .0031 .0008 .1065 .4330MS r : .0236 -.0259 -.0062 .0105 .0143 .0898 .1991 .1307M MSe : -.4567 -.3468 -.1339 .3101MS r : -.0068 -.0217 -.0015 .0035 .0595 .0664 .1684 .1193L MS^\u00E2\u0080\u00A2e : -.4801 -.4754 -.2877 .1730MS r : -.0198 -.0264 .0101 -.0218 .0602 .0100 .1596 .0880132Results and DiscussionFinally, examination of the correlations among the sampleestimates ( AC, MS e , and MS r ) reveals a very interesting butunclear phenomenon, especially the correlation between AC andMSe across the levels of E. For example, a negative correlationbetween them under the E < 1.0 conditions suggests that asmaller AC is associated with a larger MS e. Given that thecorrelations between AE and MS r , and between MS e and MS r arealmost zero, this negative correlation between AE and MS eindicates that the ratio of MS r /MS e (F) tends to be smaller fora smaller AE under the E < 1.0 conditions. This is opposite towhat is reported in the literature. It appears that, givenmoderate to severe noncircularity in the population, F testsconducted on samples with moderate to severe noncircularity areprobably not associated with inflated Type I error rates. Ifthis is so, then the current practice of applying an epsilon-based correction factor to the F test in repeated measures ANOVAdesigns could be inappropriate. Additional studies are beingconducted to explore this phenomenon and the results will bereported elsewhere (Eom & Schutz, 1993).133Results and DiscussionSimulation II: Two-facet designIn this section, the simulation results for the two-facet(3 Occasions by 5 Raters), fully-crossed design are examinedacross the levels of G, n, E, and types of scale. The resultsof the two-facet design closely paralleled those of the one-facet design in terms of the effect of categorization,population G value, and sample size, and thus they are discussedonly briefly. More emphasis is placed on the effect ofnoncircularity on the sampling variability of AG 2 , whichresulted in some discrepancy between the two designs. To beconsistent, the results are presented following a similarsequence as in the one-facet design. First, the effect oftransformation of continuous data into categorical scales on theG coefficient is examined. Second, the characteristics ofsample estimates of the G coefficient are compared across thesimulated conditions. Third, the effect of noncircularity onthe sampling variability of the G coefficient is investigated.Finally, empirical proportions of confidence intervals thatfailed to include a population G coefficient and Type I errorrates of quasi F ratios for the \"Occasion\" and \"Rater\" maineffects are examined and compared among the three localcircularity conditions at specified alpha levels.134Results and DiscussionA. Calculated population G-coefficient (Gcp )Figure 4-3 illustrates the general pattern of the Gcoefficient (Gcp ) across the six scales for a combination of eand G. The values in each line shown in the bottom of Figure 4-3 were based on a unique simulated population of N = 90000(i.e., 9 different simulated populations).As can be seen in Figure 4-3, the Gcp under continuous datawas identical to the corresponding population G value, andconsistently so, regardless of e conditions. However, as thescale approached U2, the G cp gradually decreased. The patternof changes in Gcp across categorical scales was virtuallyidentical for the three G values, but the magnitude of decreasein Gcp was somewhat larger for G = .75 and .60. Note also thatwithin the same G, the e = 0/OR condition produced a slightlysmaller Gcp value, especially under the U2 scale, but the amountof difference among the three E conditions seemed to benegligible. In comparison to the findings of the one-facetdesign, the results of the two-facet design showed a similarpattern of changes in Gcp across the simulated conditions.However, the magnitude of decrease in Gcp from C to U2 scaleswas somewhat smaller for the two-facet design, and also theinteractive effects between categorization, G and E conditionsappear to be marginal. This difference between the two designsseemed to be due to a larger dimension involved in the two-facetdesign as well as due to a less variation in the covarianceelements among the population covariance matrices defined in thetwo-facet design simulation.Figure 4-3. Effect of categorization on the G coefficientwith population data (two-facet design, N=90000)0.9000cv0.7500U0.60000.4500C^N5^U5^N3^U3^U2G=.90, e=NONE^--e- 0.9004 0.8912 0.8877 0.8772 0.8751 0.8479e=0/OR + 0.9003 0.8903 0.8863 0.8765 0.8723 0.8426e=R/OR^X 0.9011 0.8913 0.8868 0.8766 0.8724 0.8406G=.75, e=NONE^-s- 0.7504 0.7394 0.7341 0.7225 0.7184 0.6872e=0/OR + 0.7495 0.7380 0.7295 0.7198 0.7116 0.6740e=R/OR^X 0.7514 0.7397 0.7333 0.7230 0.7165 0.6815G=.60, e=NONE^-41-- 0.6001 0.5901 0.5840 0.5714 0.5691 0.5375e=0/OR + 0.5986 0.5860 0.5775 0.5661 0.5575 0.5221e=R/OR^X 0.6014 0.5904 0.5820 0.5729 0.5666 0.5330136Results and DiscussionB. Estimated G coefficient (^G2)A sample estimate of the G coefficient (AG2) for the two-facet, fully-crossed design is calculated based on the observedmean squares as:[4-5]AG2 = 1MSpo + MSpr - MSeMSPAs described in chapter II, AG2 is a negatively biased estimatorfor the population G2 value, and the amount of bias is greaterfor the smaller G2 value, but decreases with increasing samplesizes. Furthermore, it was also shown that the magnitude ofbias in AG2 is independent of the number of facets as well asthe levels of a facet.The aforementioned characteristics of AG 2 were wellreflected in the empirical results, and they are graphicallyillustrated in Figure 4-4 separately for the three sample sizes.Note that the AG2 values under the continuous data were almostidentical among the three 6 conditions within the same G and n.These results indicate that the violation of circularitycondition does not have any effect on the magnitude of sampleestimates of the G coefficient. However, under the categoricalscales, the E = 0/OR condition yielded a slightly smaller AG2value, especially for a combination of G < .90 and n = 15conditions, but the magnitude of difference in AG2 among thethree E conditions was quickly reduced with increasing samplesizes.0.45C N5 U5 N3 U3 U2C N5 U5 N3 U3 U2(a) n = 1500.90-6 0.750000.600.45(b) n = 300.900'6 0.750000.600.45\u00E2\u0080\u0094e=NONE -Fe=0/OR -x- e=R/OR\u00E2\u0080\u0094e=NONE -1- e=0/OR -x- e-R/OR(c) n = 450.900.60--e=NONE 4- e=0/OR -x- e=R/OR137Results and DiscussionFigure 4-4The mean of ^G2 for the three sample sizes (2000 replications)C^N5^U5^N3^U3^U2138Results and DiscussionTo examine the nature of the bias in the sample estimatorof G2 under categorical scales, the values of AG 2 were comparedto the corresponding G cp values and depicted in Figure 4-5 forthe three sample sizes, averaged over the three E conditions.Figure 4-5 clearly illustrates the effect of population G valueand sample size on AG2, that is, the amount of bias in AG2increased for the smaller G2 values, but decreased withincreasing sample sizes. The parallel trends between G cp andAG2 across the categorical scales also indicate that the AG2 wasbiased, but the amount of bias was consistent across all sixscales.Figure 4-5Comparison of Gcp and AG2, averaged over the levels of epsilonC^N5^U5^N3^U3^U2139Results and DiscussionIn summary, the pattern of changes in AG2 across thesimulated conditions resembled that of Gcp shown in Figure 4-3,except for the magnitude of bias in AG2 . Furthermore, theeffects of the simulated conditions (i.e., categorization, G,and n) on the sample estimates were similar to those of the one-facet design, but to somewhat less extent. Although there wassome discrepancy in the degree of noncircularity between the twodesigns, it was apparent from the results of both designs thatthe violation of circularity condition did not have any effecton the magnitude of both AG ' and AG2 under continuous data.With respect to the sampling variability of AG2, it isapparent from Equation [4-6] that the variability of AG2increases with decreasing G 2 and decreases with increasingsample sizes. Note also that the variance expression involvesthe Satterthwaite's adjusted degrees of freedom, f a .[4-6]var(AG2) = (1 - G2) 22(np-1)2 (fa + np - 3)fa (np -3) 2 (nP - 5)The theoretical variabilities of AG2 across the simulatedconditions are presented in Table 4-22, and these values wereused to examine whether the sampling theory of G coefficient isrobust to noncircularity by comparing them with thecorresponding empirical values among the E conditions.140Results and DiscussionTable 4-22Expected mean squares (EMS), Satterthwaite's degrees of freedom(fa ), and theoretical standard deviation (SD)^of AG2G^: EMSP EMSpo EMSpr EMSe n fa SD(AG2).90^: 900 66 55 31 15 37.13 .060030 76.91 .035345 116.69 .0273.75^: 680 116 85 31 15 46.75 .146230 96.84 .085945 146.93 .0664.60^: 550 151 100 31 15 48.32 .233230 100.11 .136945 151.89 .1059Table 4-23 presents the mean and standard deviation of AG2based on 2000 replications across the three E conditions forsome selected conditions. Note that the standard deviation forthe E = 0/OR condition with n - 15 was somewhat larger than theother two conditions, but this difference vanished as the samplesize increased. Examination of the range of AG2 for thiscondition showed that the large standard deviation was mainlydue to some outliers (e.g., the smallest value of AG2 for E =0/OR was .1087 with a corresponding z value of -11.9 under thecontinuous data, as compared to .4461 and .5060 for the othertwo conditions). In general, the empirical standard deviationsunder the continuous data were virtually identical among thethree E conditions for a fixed value of G and were close to thecorresponding theoretical values reported in Table 4-22.However, it was not the case for the categorical scales. As can141Results and Discussionbe seen in Table 4-23, the variability of AG 2 increasedconsiderably as the scale approached U2, and was substantiallylarger than its theoretical counterpart. As was the case in theone-facet design, the large sampling variability of AG2 underthe categorical scales appeared to be a result of the initialdifference between G 2 and Gcp brought about by the effect ofcategorization. When considering the fact that the standarddeviations were similar among the three E conditions under thesame categorical scale, it appears that the violation ofcircularity assumption had a minimal effect on the samplingvariability of AG2 for the categorical scales as well.Therefore, taken together, these results suggest that thesampling theory of G 2 is quite robust to the violation ofcircularity condition.With respect to the effect of the e conditions on thesampling variability of AG2, the results from the two-facetdesign were somewhat different from those of the one-facetdesign. For example, as shown in Tables 4-3, under thecontinuous data the standard deviation of AG 1 was larger for theE < 1.0 conditions which indicate some positive effects ofheterogeneity of covariance on the sampling variability of AG 1 .However, in the two-facet design although the standard deviationof AG2 in Table 4-23 tends to be slightly larger for the E =0/OR and R/OR conditions, the magnitude of difference among thethree E conditions seemed to be very small. Even though somediscrepancy may be expected between the two designs because ofthe unmatched degree of E defined in each design as well as the142Results and Discussiondesign itself, the reason for why noncircularity did not haveany appreciable effect on the sampling variability of AG 2 needsfurther clarification.Table 4-23The mean (standard deviation) of AG2 for the two-facet design (n= 30 only,^2000 replications)n G E^: C N5 U3 U215 .90 NONE: .8838 (.0598) .8752 (.0649) .8595 (.0751) .8312(.0898)0/OR: .8830 (.0649) .8741 (.0699) .8560 (.0798) .8256(.0964)R/OR: .8851 (.0587) .8758 (.0636) .8571 (.0726) .8232(.0934)30 .90 NONE: .8926(.0352) .8839(.0378) .8678(.0437) .8404(.0542)0/OR: .8931(.0357) .8841(.0387) .8655 (.0459) .8352 (.0579)R/OR: .8932 (.0359) .8839(.0388) .8650(.0452) .8323(.0572).75 NONE: .7308 (.0854) .7210(.0894) .7009(.0961) .6697(.1050)0/OR: .7315(.0855) .7209(.0891) .6948(.1000) .6562(.1136)R/OR: .7316(.0870) .7215(.0903) .6985(.0973) .6617(.1111).60 NONE: .5688(.1365) .5597(.1415) .5402 (.1476) .5099(.1565)0/OR: .5689 (.1383) .5580 (.1428) .5298 (.1533) .4933(.1662)R/OR: .5695(.1397) .5602(.1422) .5366(.1507) .5029(.1597)45 .90 NONE: .8958 (.0265) .8871 (.0289) .8711(.0340) .8437 (.0418)0/OR: .8959(.0272) .8869(.0297) .8684(.0358) .8379(.0458)R/OR: .8965(.0267) .8872 (.0291) .8684 (.0346) .8359 (.0447)To investigate the cause for this discrepancy, we examinedthe sampling characteristics of the observed mean squares.Table 4-24 presents the means and standard deviations of meansquare estimates and pair-wise correlations between them acrossthe three E conditions for some selected conditions. Asexpected, the mean of the observed mean squares (MSp , MSpo ,MSpr, MS e ) was very similar among the three E conditions within143Results and Discussionthe same G, and close to their corresponding population value(to be more precise, they were almost identical to those meansquares calculated on the simulated population data set).Table 4-24The mean (standard deviation) of observed mean squares and theircorrelations (continuous data only,^n = 30,^2000 replications)MS \u00C2\u00A3^: .90G coefficient.75 .60MSP NONE: 906.13(241.03) 683.22 (181.48) 551.96(146.17)0/OR: 903.66(240.70) 681.07(182.05) 549.78 (146.57)R/OR: 905.34 (241.44) 683.04 (181.53) 551.63(146.58)MSpo NONE:0/OR:66.58 (12.60)66.31 (15.13)117.09(21.99)116.93 (27.05)152.46(28.53)152.46(35.04)R/OR: 65.87(12.15) 116.55(21.90) 151.84(28.42)MS pr NONE: 55.06(7.28) 85.15(11.33) 100.21 (13.36)0/OR: 54.83(7.04) 84.82(10.97) 99.66(12.81)R/OR: 55.05(8.88) 85.08(13.61) 100.12 (16.29)MS e NONE: 30.98 (2.84) 30.97(2.83) 30.97(2.82)0/OR: 30.98(3.51) 30.93(3.46) 30.91(3.50)R/OR: 31.08(4.30) 31.07(3.79) 31.06(3.96)Correlation:MSe MS po MSe MSpo MSe MSpoNONE MS^:o -.02 -.02 -.02MSPpr : -.05 .00 -.05 -.01 -.05 -.010/OR MS o : -.02 -.01 -.01MSpr :p .14 -.01 .25 -.02 .44 -.01R/OR MS^:o .13 .06 .02MSP :pr -.01 -.01 -.02 -.02 .00 -.02Note: correlations between MSP and the other MS's were all lessthan 1.051.144Results and DiscussionWith respect to the variability of mean squares, Table 4-24shows that the standard deviation of MS P was virtually identicalfor the three E conditions, but that of MSpo , MSpr , and MSe waspositively inflated under the violation of respective localcircularity. For example, the standard deviation (15.13) ofMSpo under the E = 0/OR condition was considerably larger thanits theoretical counterpart [12.26 = (2(EMSpo) 2/ df) 1/2 =(2(66) 2/58)1/2]. A similar result was shown for MSpr under theE = R/OR condition. Note also that the standard deviation ofMSe was positively inflated under both E = 0/OR and R/ORconditions, and it was slightly larger under the E = R/ORcondition -- this was due to a smaller E value defined for the 0by R interaction term in the simulation (see Table 3-4). Takentogether, these results indicate that the mean square estimateswere unbiased, but more variable under the violation of localcircularity. Given that the variability of MSp was fairlyconsistent across the three E conditions, the variance of alinear combination of the observed mean squares in the numeratorof Equation [4-5] appears to be the main source that determinesthe degree of sampling variability of ^G2. Since the meansquares in the numerator of Equation [4-5] were more variablewhen circularity fails, we might expect that the variability of^G2 would also be somewhat larger under noncircularitycondition, as was the case in the one-facet design. However,the results in Table 4-23 showed that this did not happen, andthere was only a minimal effect of noncircularity on thesampling variability of ^G 2 for the two-facet design. The145Results and Discussionreason for this is perhaps due to the fact that thecomputational formula for AG2 involves a combination of meansquares.Inspection of Table 4-24 in relation to Equation [4-5]suggests a possible explanation for why the sampling variabilityof AG2 was not sensitive to noncircularity. For example, the E= 0/OR condition produced more variable MS po and MSe , whereasthe e = R/OR condition yielded a larger variability for MSpr andMSe . Therefore, in order for AG2 to be more variable undernoncircularity conditions, either MS po and MS e , or MSpr and MSemust have at least some degree of negative correlation, giventhat the correlation between MSpo and MSpr was essentially zerounder both E = 0/OR and R/OR conditions. However, as shown onthe bottom of Table 4-24, it was not the case. All correlationswere nearly zero, except for those between MSpr and MSe underthe E = 0/0R, and between MSpo and MSe under the E = R/ORcondition. A positive correlation between MSpr and MS e , forexample, may be due to the fact, in part, that MS pr is acomposite of MS e and no A62pr . Thus, under the E = 0/ORcondition, MSpr would tend to fluctuate with MS e , given that thequantity n0 A62pr is almost independent of MS e . The cause of thevaried size of the correlation coefficients for these pairsacross the levels of G is unclear, but it could be a result ofthe difference in the magnitude of the mean squares among thelevels of G. Nonetheless, these results suggest that more-variable individual mean squares are not necessarily associatedwith a particularly small or large value of AG2 . Since the146Results and Discussioncharacteristics of the sampling variability of AG 2 are directlyrelated to the properties of sampling distribution of AG2, weexamine this problem further in the following section.C. Empirical sampling distribution of AG2As presented in chapter II, the ratio (1 - G2)/(1 - AG2) isapproximately distributed as an F-variate with degrees offreedom (np-1) for the numerator and f a for the denominator. Itwas also shown that from this a 100(1-a)% confidence intervalfor a population G2 value can be constructed as:[4-7]Lower limit < G2 < Upper limit= 1 - ( 1 - ^G2) FU < G2 < 1 - (1 - AG 2 )FL .The terms F L and Fu are the critical values corresponding to thelower a/2 and upper (1-a/2) percentage points, respectively, ofthe F distribution with degrees of freedom (np-1) for thenumerator and Afa for the denominator. The quantify Af a isestimated using observed mean squares, and thus varies overreplications. In the simulation, the lower and upper limits ofa 100(1-a)% confidence interval for a specified population G2value were obtained for each replication. The mean and standarddeviation of the limits of 2000 confidence intervals arepresented in Table 4-25 for some selected conditions.147Results and DiscussionTable 4-25The mean (standard deviation) of the limits of the 90%confidence intervals for a G2 in the two-facet design for someselected conditions (2000 replications)n G e^: LLCUL LLU2UL15 .90 NONE: .7703(.1161) .9490 (.0264) .6486 (.1770) . 9274 (.0395)0/OR: .7685(.1260) .9487(.0287) .6411(.1888) .9246 (.0425)R/OR: .7723(.1140) .9496 (.0260) .6344(.1831) .9237(.0411)30 .90 NONE: .8263 (.0562) .9379(.0204) .7344(.0874) .9092(.0314)0/OR: .8270 (.0573) .9382 (.0208) .7276(.0932) .9059(.0336)R/OR: .8272(.0573) .9383 (.0209) .7220 (.0917) .9044(.0332).75 NONE: .5731 (.1351) .8428 (.0499) .4644 (.1682) .8094 (.0610)0/OR: .5740 (.1358) .8432 (.0499) .4452 (.1821) .8012(.0660)R/OR: .5743 (.1375) .8433 (.0509) .4525(.1774) .8047 (.0646).60 NONE: .3178 (.2160) .7478(.0798) .2101(.2501) .7163(.0910)0/OR: .3179 (.2197) .7480 (.0807) .1877 (.2656) .7059 (.0966)R/OR: .3188(.2208) .7483 (.0818) .1995(.2552) .7122(.0930)45 .90 NONE: .8456 (.0389) .9328(.0172) .7632(.0617) .9004(.0271)0/OR: .8457 (.0401) .9328 (.0177) .7558(.0675) .8964 (.0296)R/OR: .8466(.0392) .9332(.0174) .7522(.0658) .8953 (.0289)With respect to the relationship between mean squares andAG2 under noncircularity conditions, it can be anticipated fromEquation [4-7] that if the more-variable mean squares obtainedunder the E = 0/OR or R/OR conditions are associated with aparticularly small or large AG 2 , it would result in even greaterfluctuation in the limits of the confidence interval because thecritical value of F u and FL in Equation [4-7] varies dependingon its denominator degrees of freedom, Afa . As can be seen inTable 4-25, the results, however, do not suggest theaforementioned relationship between mean squares and AG2. There148Results and Discussionwas little difference among the three e conditions in the meanand standard deviation of both limits of 90% confidenceintervals under the continuous data within the same G and n.Therefore, these results also support the robustness of samplingtheory of G2 to the violation of circularity. One further notefrom Table 4-25 is that the width of the confidence intervallimits for the U2 scale became narrower with increasing G and n,and thus its upper limit is already close to its population G2.Therefore, as is shown later, a larger empirical proportion ofthe confidence intervals would fail to include its populationG2 .To assess the adequacy and robustness of the samplingtheory of the G coefficient in the two-facet design, theempirical proportion of 2000 confidence intervals that failed toinclude a specified population G 2 value in either the lower orupper direction was obtained at three significance levels (a =.10, .05, .01, two-tailed). Table 4-26 presents the empiricalproportion for the three E conditions, averaged over the levelsof G and n. Thus, the results in this table represent a generalpattern of the effect of noncircularity on the samplingdistribution of ^G 2 across the six scales. Table 4-26 wasfurther broken down by the levels of G or n, and the results arepresented in Table 4-27 for a = .10 only. Note that the valuesfor the upper 5%, for example, in the tables are the empiricalproportion of the confidence intervals whose lower limit wasgreater than a specified population G2 value. Thus, theproportion can be interpreted as a Type I error rate.149Results and DiscussionFrom Tables 4-25 and 4-26 it is apparent that the samplingtheory of G 2 is robust to the violation of circularity conditionunder the continuous data as the empirical proportion was almostidentical among the three 6 conditions. Although the proportionwas slightly larger for the E = 0/OR and R/OR conditions, themagnitude of difference seemed to be negligible. Furthermore,both upper and lower empirical proportions were all close to thecorresponding nominal levels, regardless of the levels of G andn.Table 4-26Empirical proportion of confidence intervals that failed toinclude a population G2 value (averaged over the levels of G andn)a E^: C N5 U5 N3 U3 U2Upper NONE: 4.7 3.7 3.5 2.5 2.3 1.35% 0/OR: 5.2 4.0 3.7 2.3 2.2 1.0R/OR: 5.2 3.7 3.5 2.4 2.2 1.1Lower NONE: 5.0 6.6 7.3 10.2 10.7 19.55% 0/OR: 5.2 7.1 8.1 10.8 12.5 22.3R/OR: 5.2 6.8 7.7 10.1 11.7 22.1Upper NONE: 2.2 1.7 1.7 1.2 1.2 .62.5% 0/OR: 2.7 1.8 1.8 1.1 1.0 .4R/OR: 2.5 1.8 1.7 1.1 1.1 .5Lower NONE: 2.5 3.4 3.6 5.5 5.9 12.32.5% 0/OR: 2.6 3.5 4.2 5.8 6.9 15.0R/OR: 2.6 3.4 4.0 5.6 6.6 14.4Upper NONE: .5 .4 .4 .2 .2 .1.5% 0/OR: .5 .4 .3 .2 .2 .1R/OR: .5 .4 .4 .2 .2 .1Lower NONE: .4 .7 .7 1.2 1.3 3.7.5% 0/OR: .4 .7 .9 1.4 1.7 5.1R/OR: .5 .6 .7 1.2 1.4 4.9150Results and DiscussionTable 4-27Empirical percentage of 90% confidence intervals that failed toinclude a specified population G2 valuea E G^:C N5^U5^N3^U3averaged over the levels of nU2Upper NONE .90: 4.6 2.9 2.8 1.3 1.2 .45% .75: 5.0 3.9 3.6 2.7 2.5 1.5.60: 4.6 4.4 4.0 3.5 3.1 1.90/OR .90: 5.3 3.2 3.1 1.2 1.2 .4.75: 5.2 4.2 4.0 2.5 2.4 1.0.60: 5.2 4.5 4.0 3.3 3.0 1.6R/OR .90: 4.9 2.9 2.8 1.2 1.2 .3.75: 5.2 3.7 3.7 2.6 2.3 1.2.60: 5.4 4.5 4.1 3.5 3.1 1.8lower NONE .90: 5.1 7.9 8.8 14.1 15.4 32.75% .75: 4.9 6.2 6.9 8.8 9.3 15.3.60: 5.0 5.8 6.3 7.5 7.5 10.50/OR .90: 5.6 8.3 10.2 15.0 17.6 35.5.75: 5.0 6.6 7.5 9.3 11.2 18.7.60: 5.0 6.3 6.7 7.9 8.6 12.7R/OR .90: 4.8 7.6 9.4 14.4 17.2 37.8.75: 5.4 6.6 7.1 8.8 9.8 17.1.60: 5.3 6.1 6.5 7.3 8.1 11.5a E n^: averaged over the levels of GUpper NONE 15: 5.0 4.7 4.5 3.4 3.2 2.15% 30: 4.6 3.5 3.3 2.2 2.0 .845: 4.7 3.0 2.6 1.9 1.6 .80/OR 15: 5.5 4.8 5.0 3.3 3.5 1.730: 5.2 3.8 3.3 2.0 1.8 .745: 5.0 3.4 2.9 1.7 1.3 .5R/OR 15: 5.4 4.1 4.5 3.1 3.0 1.930: 5.0 3.6 3.4 2.4 2.0 .945: 5.1 3.3 2.7 1.8 1.6 .4Lower NONE 15: 5.3 6.4 6.7 8.1 8.8 13.05% 30: 5.5 7.0 7.7 10.3 11.1 19.945: 4.2 6.5 7.6 12.0 12.3 25.50/OR 15: 5.7 6.7 6.8 8.8 9.5 14.330: 5.2 7.1 8.5 10.9 12.6 22.745: 4.8 7.5 9.0 12.6 15.3 29.9R/OR 15: 5.4 6.4 6.7 8.2 8.7 13.930: 5.6 7.2 8.0 10.8 12.2 22.845: 4.5 6.8 8.3 11.4 14.2 29.8151Results and DiscussionThe results in Tables 4-25 and 4-26 also indicate that thesampling theory of G 2 is not adequate for the categoricalscales, especially for a 3-point or less scale. As expected,the empirical proportion decreased in the upper direction andincreased considerably to an unacceptable level in the lowerdirection as the scale approached U2, and this trend was moreapparent for larger G and n values (see Table 4-27). Ingeneral, these results reflected the characteristics of thesampling variability of AG 2 reported above. That is, acondition with a larger variability of AG2 is associated with alarger empirical proportion in the sampling distribution of AG2.Therefore, it can be concluded that the sampling theory of G2for the two-facet design works well under continuous data, andis quite robust to the violation of circularity condition.However, it was not acceptable for categorical scales,especially for a 3-point or less scale. As discussed inrelation to the one-facet design, the cause for this inadequacywas mainly due to the effect of categorization in terms ofpopulation characteristics.D. Type I error rates in quasi F testsIn this section, we present some results of empirical TypeI error rates for quasi F ratios for the test of Rater andOccasion main effects in the context of a three-way (Subjects byRaters by Occasions) random effects ANOVA model. In general,for any given design requiring the quasi F there are severaldifferent ways to form a test statistic (Winer, 1971). The two152Results and Discussionquasi F ratios used for our two-facet design are presented atthe bottom of Table 4-28. Under the null hypothesis that a 2 r =0 (i.e., no Raters effect), for example, the numerator anddenominator of both QFR1 and QFR2 have the same structure ofexpected values of mean squares. Thus, the test statistic canbe set up in the usual way, but the degrees of freedom for thoseterms associated with a combination of mean squares are obtainedfrom the Satterthwaite's procedure. In general, Satterthwaite'sadjusted degrees of freedom is fractional, and thus an exactcritical F value using a fractional degrees of freedom wasobtained by referring to F-inverse function in IMSL subroutinein the simulation.The reason for including the first form of the quasi Fratio (Form 1 in Table 4-28) was that the structure of firstquasi F ratio is similar to that involved in the samplingdistribution of AG2 for the two-facet design. As reported inthe one-facet design, the effect of noncircularity on theempirical proportion of the sampling distribution of AG 1 was notas large as that on the Type I error rates in the F test. Themain cause for this is that the F test involves both MS r and MS ewhich are more variable when the circularity assumption fails,whereas the sampling distribution of AG 1 include MSe and MSp ,but the variability of MS p is not sensitive to noncircularityconditions. Therefore, for similar reasons, it may be expectedthat noncircularity would have a larger effect on the quasi Ftest than it would have on the sampling distribution of AG2, andthus produce at least some positive effects on the quasi F test153Results and Discussionfor either the Rater or Occasion main effect because the quasi Fratio includes more-variable mean squares in both numerator anddenominator. As can be seen in Table 4-28, the results do showthat the Type I error rates for the test of both Occasion andRater main effects using the first form of quasi F ratio weresomewhat positively inflated under the violation of respectivelocal circularity. For example, the test for the Occasioneffect resulted in slightly larger Type I error rates under theE = 0/OR condition, whereas the Type I error rates for the sametest were somewhat conservative or close to the nominal levelunder the other two E conditions. Similar but slightly largerinflation in the Type I error rates were produced for the testof Rater effect under the E = R/OR condition.With respect to the second form of the quasi F ratio,Maxwell and Bray (1986) used this form in a simulation study toinvestigate the effect of violating sphericity (circularity) onthe quasi F ratio in a three-way ANOVA design with one nestedfactor. They concluded that the quasi F ratio was in generalquite robust to noncircularity, though it produced conservativeresults for some of the conditions simulated. Although thedesign, and thus the expected value of the mean squares, is notthe same in their study and the present study, we expect thatthe results of this section would also show the robustness ofquasi F tests to the violation of circularity. These resultswould then serve as a partial validation of the simulationprocedure and subsequent calculation implemented in thesimulation program. As can be seen in Table 4-28, the Type I154Results and Discussionerror rates for the test of the Occasion effect were very closeto the nominal level, indicating the robustness of the quasi F,whereas the same test under the other two conditions wassomewhat conservative. A slightly larger inflation was shownfor the test of Rater effect under the E = R/OR condition. Ingeneral, the Type I error rates for the second form of the quasiF test were smaller than those for the first form. This may bedue to the fact that the second form involves theSatterthwaite's degrees of freedom in both numerator anddenominator, which could be adjusted in such a way that thecritical F value is larger or smaller, keeping the actualprobability of a Type I error rate near the nominal level. Onefurther note from these results is that the Type I error rateswere virtually identical between the normal and uniformdistributions within the same number of response scale, and theydecreased as the scale approached U2, regardless of the Econditions and the types of quasi F ratio.In summary, the results of the two-facet design closelyparalleled those of the one-facet design in terms of the effectof categorization, sample size, and population G value.However, some discrepancy was observed in terms of the effect ofnoncircularity on the sampling variability of the G coefficientbetween the two designs. That is, the sampling distribution ofAG2 was quite robust to the violation of circularity condition,whereas the one-facet was not. Finally, as was the case in theone-facet design, the noncircularity had more effect on the Ftest (and quasi F) than it had on the sampling distribution of155Results and Discussion^G2. The results also indicate that the quasi F tests wererelatively robust to noncircularity, and their Type I errorrates were generally in close agreement with previous findingsin the literature.Table 4-28Empirical percentage of the Type I error rates for quasi F testsin the three-way random effects ANOVA model (averaged over thelevels of G and n, each condition having 2000 replications,alpha = .05)Quasi F^E :^C^N5^U5^N3^U3^U2QFO 1 NONE: 4.5 4.2 4.2 4.3 4.2 3.90/OR: 6.1 6.1 5.7 5.6 5.5 5.1R/OR: 4.3 4.5 4.0 4.1 4.2 3.5QFR1 NONE: 5.3 5.3 5.4 5.0 5.2 4.60/OR: 3.1 3.3 3.2 3.4 3.5 3.2R/OR: 7.6 7.1 7.1 6.3 6.6 5.7QFO2 NONE: 3.4 3.2 3.2 3.3 3.1 2.90/OR: 5.0 5.0 4.6 4.4 4.4 4.0R/OR: 3.1 3.3 3.0 3.0 3.1 2.5QFR2 NONE: 4.0 4.2 4.1 4.0 4.1 3.70/OR: 2.2 2.3 2.3 2.5 2.5 2.6R/OR: 6.4 5.8 6.0 5.3 5.4 5.0Occasion effect^ Rater effectForm 1:^MS0 MSrQFO 1 =^ QFR1 = ^MSpo + MSor - MSe^MSpr + MSor - MSeForm 2:MSo + MSe^ MSr + MSeQFO2 = QFR2 =MS + MSor MS + MSpr^or156Summary and ConclusionsCHAPTER FIVE: SUMMARY AND CONCLUSIONSThis chapter presents a brief summary and the findings ofthe present study, followed by the implications of the empiricalresults. It concludes with limitations of the present study andsuggestions for future research.The present study employed Monte Carlo procedures toinvestigate the interactive effect of data categorization andheterogeneity of covariance on the generalizability coefficientfor the one-facet and two-facet designs as well as on the Type Ierror rates for the F tests in repeated measures ANOVA designs.The primary focus was to examine and compare the samplingcharacteristics of the G coefficients obtained on bothcategorical scales and their parent continuous data under theviolation of the circularity assumption. Computer programs weredeveloped to construct the population covariance matrices ofinterest with desired G and E values, and to conduct a series ofsimulations under various sampling conditions.One-facet designAn overview of the results with respect to the Gcoefficient and to the Type I error rates is illustrated in atree diagram and presented in Table 5-1 and Table 5-2,respectively, for the one-facet design.1.0-->.50-->cp^k G cp AG 1 sd L% U%r7-->r-45-->.90 -1--15-->.90-r-45-->1--15-->r-45-->.8958.8841.8960.8845.8956.0242.0550.0274.0610.02584.15.34.54.95.25.05.35.24.96.7r-7--> .90--15--> .8844 .0580 5.2 6.5.90-L-3--> .90-r-45--> .8959 .0312 5.7 8.3L-15--> .8847 .0660 5.6 7.9r-45--> .8066 .0457 82.8 0.0.79-r-7--> .81_L-15--> .7903 .0991 36.8 0.21--3--> .78-r-45--> .7742 .0666 83.7 0.0L-15--> .7589 .1333 44.3 1.6r45--> .8122 .0451 78.8 0.0.81-r-7--> .82-1--15--> .7972 .0975 33.4 0.2.80-r-45--> .7944 .0627 76.5 0.31--15--> .7807 .1205 37.3 1.9r-45--> .5810 .0992 4.7 4.9r-7-->L.60-1--15--> .5363 .2190 5.1 5.1.60--3 --> .60-r-45--> .5818 .1119 4.5 5.1L-15--> .5323 .2531 5.4 5.0r-45--> .5789 .1069 6.2 6.2r7--> -L-15-->.60 .5373 .2383 5.8 7.2.60-L-3--> .60-r-45--> .5829 .1197 5.7 7.11--15--> .5344 .2661 5.2 7.2r-45--> .4501 .1264 29.8 0.2.46 -r-7--> .47-L-15--> .3979 .2772 12.9 1.21--3--> .46 -r-45--> .4356 .1508 29.0 0.41--15--> .3737 .3492 15.2 1.9r45--> .4940 .1252 20.0 1.0.52 -r-7--> .52 -L-15--> .4454 .2760 10.3 2.21--3--> .54-r-45--> .5190 .1287 12.5 2.31--15--> .4678 .2854 8.7 3.8U2-.50-->710-->.50-->1.0-->.50-->G1 Scale^eU2157Summary and ConclusionsTable 5-1An overview of the results regarding Gcn, A G 1 , and empiricalproportions beyond the theoretical limits of the toleranceinterval of AG 1 in the one-facet designr7-->U2--> .75-L-3-->C --> L_3__>0G=1 \u00E2\u0080\u00A2--> .66-r-7-->158Summary and ConclusionsTable 5-2An overview of the results regarding ecp , AC, and Type I errorrates in the one-facet design0C --> 1.0-r-7-->--> 1.0r7-->-->-->-->e cp EcpF7--> 1.0-L-15-->1.0---3--> 1.0-r-45-->7-- >r-45-->1.0^3 1.0-r-45-->>F-45-->1 0---15-->1 0 r-45-->L-15-->r-45-->1.0-L-15-->1.0-r-45-->45-->.53-r-45-->F 45-->.75---15-->.74-j--45-->F 45-->.54-r-45-->L_15__ >F-45-->.65 ---15-->.66 -r-45-->.8714 .0343.6906 .0639.9604 .0360.8952 .0817.8540 .0396.6541 .0762.9389 .0561.8295 .1292.8710 .0347.6905 .0643.9602 .0358.8943 .0817.8875 .0315.7152 .0630.9634 .0352.8990 .0842.4850 .0577.4429 .0779.5274 .0084.5285 .0167.6781 .0673.5557 .0835.7244 .0742.6839 .1161.4809 .0301.4284 .0498.5394 .0119.5404 .0231.6107 .0462.5352 .0687.6553 .0713.6498 .1117a=.054.94.75.15.74.94.34.54.84.44.84.85.64.54.44.64.89.19.87.88.96.06.95.76.79.99.78.28.28.27.28.07.2E^G1^ScaleG 1 = . 0AE sd 159Summary and ConclusionsCost of data categorization. Categorization of continuousdata had a marked influence on the G coefficient, resulting in aconsiderably smaller G cp than for the parent continuous data,especially for a 3-point or less scale. Although the magnitudeof reduction in Gcp from C to U2 scales varied in a rathercomplicated manner, depending on a particular combination of G,k, and e of the simulated conditions, it was the largest with asmall number of measures (i.e., k = 3). In practice,researchers rarely have control of the population parameters (Gand E), but these results can be used as a guide for planning aG study. The findings of the one-facet design suggest that insituations where a categorical response is inevitable (e.g., a Gstudy in observational research), the practitioner shouldconsider using a 5-point or more scale and try to avoid using acombination of a 3-point or less response category and a smallnumber of raters (i.e., k = 3) in a G study. Otherwise, he orshe may estimate a G coefficient which is already about 20%lower than its population G coefficient.Sample estimates. The sample estimate of the G coefficient(AG ' ) is a downward biased estimator, as shown in themathematical derivation where E(AG1) < G1, and the amount ofbias varies as a function of the size of G 1 and n, butindependently of the number of measures (k). The empiricalresults in the present study suggest that this bias becamesubstantially larger when G 1 < .75 and n < 30. In suchcircumstances, the use of the unbiased estimator of the G160Summary and Conclusionscoefficient is strongly recommended.Although AG 1 is a biased estimator of G 1 , the mean of AG 'over the replications was very close to its expected value[i.e., E(AG 1 )] and was consistently so across all simulatedconditions (k, E, and types of scale) for a given G1 (or Gcp ).These findings suggest that the degree of heterogeneity ofcovariance did not introduce any additional bias to themagnitude of AG 1 , nor did nonnormality, nor a moderate departurefrom homogeneity of variance (i.e., a ratio of .6 to 1.4 amongthe variances). However, heterogeneity of covariance,especially e = .5, did result in more variable estimates of MS e(but not for MSp ). This in turn produced a larger samplingvariability of AG1 . Especially, the magnitude of AG 1 with k =3, G = .60, E = .5, and n = 15 varied markedly, ranging anywherefrom .93 to -2.0.The variability of ^G i across the categorical scales showeda rather complicated trend. The magnitude of empirical standarddeviations for the categorical data was considerably larger thanfor its parent continuous data, but this result was deemed to bea result of the initial difference between Gcp and G 1 broughtabout by the interactive effects among population conditions (G,E, types of scale). When the effect of categorization waspartialled out, the variability of AG 1 appeared to be very closeto its corresponding theoretical value, although it was stillconsiderably larger than that for the continuous data.Furthermore, the comparison of empirical standard deviationsamong the three C conditions for the categorical data showed161Summary and Conclusionsthat the variability of AG 1 for the E = H condition was larger,especially under the U2 scale. This result was deemed to be dueto the sampling characteristics of epsilon as well as to theeffect of categorization -- the categorization resulted in asmaller Gcp for the E = H condition and produced a largerepsilon estimate, especially under the U2 scale.Therefore, these results suggest that the violation of thecircularity assumption did not add any bias to the estimate, butyielded more variable estimates of the G coefficient forcontinuous data. Thus, it is likely to produce too many largeestimates of the G coefficient (as well as too many small ones).However, the sampling variability of AG ' for categorical datawas less sensitive to the heterogeneity of covariance,especially for a 3-point or less scale.Sampling distribution of ^G i . Heterogeneity of covariancehad some positive effects, though not large, on the samplingdistribution of ^G i as evident by the inflated empiricalproportions beyond the upper limit of the theoretical toleranceinterval of AG ' , especially under the E = .5 condition. Theempirical proportions were, in general, about 6% for E = .7 andabout 7.2% for E = .5 at a/2 = .05. When considering thecriteria of robustness suggested by Bradley (1978, p.146) -- astringent criterion being 0.9a < actual value < 1.1a and themost liberal one being 0.5a < actual value < 1.5a, these resultsindicate that the sampling theory of the G coefficient is fairlyrobust to a moderate departure from circularity (i.e., under the162Summary and ConclusionsE = .7 condition), but somewhat sensitive to severenoncircularity. Therefore, when the circularity assumption isnot seriously violated, the sampling theory of G coefficient canbe adequately applied to an inferential test for an estimated Gcoefficient.The sampling theory of the G coefficient was not adequatefor categorical scales. The empirical proportion beyond thetheoretical upper limit was reduced to close to zero, and thatbeyond the lower limit increased considerably to an unacceptablelevel -- for the U2 scale it was as large as about 80% for n =45 and G 1 = .90 for all three levels of k. These results weredeemed to be mainly due to the initial difference between G 1 andGcp brought about by the effect of categorization. Aninteresting yet rather contradictory interpretation of theseresults would be that the empirical proportion of AG 1 fallingwithin the two limits of the tolerance interval could beinterpreted as a Type II error, if one could presume that thedifference between the G i and Gcp is indeed a true difference(but the sampling theory and statistical model assume thatobserved variables have an underlying continuous metric).Nevertheless, these findings indicate that the sampling theoryof the G coefficient and its inferential procedure are notadequate for categorical data, especially for a 3-point or lessscale, unless a large number of measures (k) (large enough tobring up the G cp close to its parent continuous G 1 value) areinvolved in a design.163Summary and ConclusionsType I error rates in the F test. As expected, the resultsshowed that the Type I error rates of the F test for the Ratermain effect (MS r /MS e ) were inflated when circularity failed.For categorical data, the degree of this inflation in the errorrates decreased as the scale approached U2. This appeared to bedue to the sampling characteristics of epsilon -- epsilonestimates were larger under categorical scales than undercontinuous data. As a result, the error rates for categoricalscales were not too serious for a moderate departure fromcircularity, especially for a 3-point or less scale. Theresults also suggest that the effect of noncircularity on the Ftest was larger than that on the inferential procedure of the Gcoefficient (i.e., about 7% vs. 9% under E = .50). In general,the empirical results in the present study were in closeagreement with previous findings in the literature, and thusprovided the validation of the simulation procedure and accuracyof subsequent calculations implemented in the simulationprograms.Examination of the relationships among the populationepsilon, the sample estimate, and the Type I error ratesrevealed an interesting phenomenon. There was a strong negativerelationship between the magnitudes of epsilon estimates and theType I error rates across simulated conditions when covarianceswere heterogeneous, thus supporting current theory. However,for the E = 1.0 condition, although the magnitude of the sampleestimates varied widely, the associated Type I error rates wereall close to the nominal level, thus yielding a near zero164Summary and Conclusionscorrelation between them. Further investigation of thecorrelations among the sample estimates (AE, MS e , and MS r )showed the presence of a negative correlation between AE and MSefor the low epsilon conditions. This negative correlationindicates that the ratio of MS r /MS e tends to be smaller for asmaller Ac under the violation of circularity assumption, whichis a contradictory relationship of what is reported in theliterature.Two-facet designThe results of the two-facet design closely paralleledthose of the one-facet design in terms of the effects ofcategorization, sample size, and population G value. However, aprimary difference in the findings between the two designs wasthat the violation of the circularity assumption did not haveany appreciable effect on the sampling characteristics of the Gcoefficient for the two-facet design. The results of thesampling variability and empirical distribution of AG2 suggestthat the sampling theory of the G coefficient for the two-facetdesign, which is based on an approximated F distribution usingthe Satterthwaite's procedure, was very satisfactory and quiterobust to the violation of the circularity assumption forcontinuous data and for a 5-point scale.With respect to quasi F ratios, the magnitude of Type Ierror rates for the Rater or Occasion effect test variedsomewhat depending on the particular form of the F ratios. Itwas found that the Type I error rates for the quasi F ratio,165Summary and Conclusionswhich include a combination of mean squares only in thedenominator of the F ratio, were somewhat positively inflatedunder noncircularity. As in the conventional F test, the degreeof inflation in the error rates was reduced to close to thenominal level as the scale approached U2. However, the secondform of the quasi F test, which includes a combination of meansquares in both numerator and denominator of the test, resultedin tests which were quite robust to noncircularity. On theother hand, the same test was somewhat conservative when thecircularity assumption was met. These results for the secondform of the quasi F test were in close agreement with those inthe related literature.Implications of the present study The findings of the present empirical study haveimplications for the use of G theory. First, the results showedthat categorization of continuous data into a 3-point or lessscale resulted in a considerable loss of measurementinformation. For example, a G coefficient for a 5-point scalewas only about 5% smaller than that for the parent continuousdata. However, it was about 10% and about 20% smaller for a 3-point scale and for a dichotomous scale, respectively.Therefore, dichotomization of any Likert-scale variables or theuse of dichotomous scale for the simplicity of ratings should beavoided, whenever possible.Second, it was shown that AG ' is a downward biasedestimator of G 1 , and the amount of bias increases with166Summary and Conclusionsdecreasing G and n. Furthermore, sample size and the magnitudeof the G value have relatively larger influence on thevariability of ^G 1 , in comparison to the number of measures.Therefore, if possible, the use of a large sample size should beconsidered in conducting a G or D study in order to reduce theamount of bias as well as the sampling variability of the Gcoefficient. However, when a small scale G study is inevitable,researchers should consider using an unbiased estimator of G1.For example, consider a measurement design under which aresearcher obtained a G coefficient of .68 with n = 15. Thecorresponding unbiased value for the same data would be .73 =H.68)(15-3) + 2]/(15-1). The two values may lead him/her toreach quite a different conclusion in a decision making, onebeing classified as an unacceptable value and the other beingconsidered as a reasonable value.Third, the sampling theory for the one-facet design wasquite robust to a moderate violation of the circularityassumption (i.e., c = .70). Researchers applying an inferentialtest for a G coefficient may not need to worry too much aboutnoncircularity unless a severe noncircularity is observed (i.e.,= .50). For the two-facet design, noncircularity did not haveany effect on the sampling distribution of G coefficient.However, for both designs the use of inferential procedures fora G coefficient with categorical data, especially with a 3-pointor less scale, should be given extra consideration. Since a Gcoefficient for a 3-point or less scale is already considerablylower than that for the parent continuous data, the limits of a167Summary and Conclusionsconfidence interval for an unknown population G coefficientwould be shifted downward, and thus give a misleading range fora true population G coefficient.Fourth, Type I error rates for continuous data wereinflated when circularity failed, but the error rates forcategorical data were not too serious for a moderate departurefrom circularity, especially for a 3-point or less scale.However, the Type I error rates were close to the nominal levelfor the E = 1.0 condition, regardless of the size of epsilonestimates. These findings suggest that an Ac- or -6-adjusted Ftest would be correct only if one can presume that thepopulation covariance matrix from which a sample being takenexhibits noncircularity. Otherwise, the adjusted F test wouldresult in a conservative test, and thus increase the probabilityof a Type II error if the estimated epsilon is indeed from thepopulation matrix with homogeneous covariances. Furthermore,empirical results revealed the presence of negative correlationsbetween MSe and AE for the e < 1.0 conditions. Given thatcorrelations between MS r and AE and between MS r and MS e werenear zero, the negative relationship between MS e and Ae suggeststhat a smaller Ac is associated with a larger MS e , which in turntends to produce a smaller F ratio (MS r /MS e ). These results ledus to question the validity of the common practice of utilizingthe \"E- or -E-adjusted F test in the repeated measures ANOVAdesigns.168Summary and ConclusionsSuggestions for future researchAs with any simulation study, the results obtained in thepresent study must be interpreted with a certain degree ofcaution. There are a number of limitations imposed by theconditions simulated, and these limitations suggest possibledirections for future research in this area. One suchlimitation is that the present study investigated the samplingbehavior of only one form of the G coefficient, that is, the Gcoefficient for a relative decision. There are a number ofreliability-like indices frequently used in practice, such asintraclass correlation coefficients and the G coefficient forabsolute decisions, and it is uncertain as to the extent whichthe results obtained for one specific index can be generalizedto other indices.^Further research is also needed toinvestigate the extent to which the results obtained for thespecific measurement designs used in the present study can begeneralized to other measurement designs, such as fixed andnested designs, and to a design having a large number of facetsand levels within a facet.Another restriction on the generalizability of the presentresults is that the simulated population data were generatedfrom a particular form of population covariance matrix, and thedata were transformed to the categorical scales having thenormal and uniform distribution. Although it seems clear thatthe sampling characteristics of the G coefficient would beinsensitive to a certain level of variation in nonnormalityrepresented by the uniform distribution and in heterogeneity of169Summary and Conclusionsvariances / covariances, it is less clear whether similarconclusions can be made about the performance of the Gcoefficient under other distributional forms (e.g., exponential)or under a radically different form of covariance matrices withsevere noncircularity for all facets and their interactions.Finally, it would be worthwhile to conduct furtherinvestigations on the relationships between the populationepsilon, the sample estimates, and the Type I error rates forthe F tests (perhaps for quasi F ratios as well). Aninteresting yet contradictory preliminary finding regarding thecorrelations among the sample estimates raises a question aboutthe current practice of utilizing an AE-adjusted F test inrepeated measures ANOVA designs, and this question requiresfurther confirmation with extensive empirical work.170ReferencesREFERENCESAlgina, J. (1978). Comment on Bartko's \"On various intraclasscorrelation reliability coefficients\". Psychological Bulletin, 85, 135-138.Alsawalmeh, Y.M, & Feldt, L.S. (1992). Test of the hypothesisthat the intraclass reliability coefficient is the same fortwo measurement procedures. Applied Psychological Measurement, 16, 195-205.Andersen, A.H., Jensen, E.B., & Schou, G. (1981). Two-wayanalysis of variance with correlated errors. International Statistical Review, 49, 153-167.Bartko, J.J. (1976). On various intraclass correlationreliability coefficients. Psychological Bulletin, 83, 762-765.Bay, K.S. (1973). The effect of non-normality on the samplingdistribution and standard error of reliability coefficientestimates under an analysis of variance model. British Journal of Mathematical Psychology, 26, 45-57.Bell, J.F. (1986). Simultaneous confidence intervals for thelinear functions of expected mean squares used ingeneralizability theory. Journal of Educational Statistics,11, 197-205.Bert, R.A. (1978). An analysis of variance model for assessingreliability of naturalistic observations. Perceptual and Motor Skills, 47, 271-278.Bert, R.A. (1979). Generalizability of behavioral observations:A clarification of interobserver agreement andinterobserver reliability. American Journal of Mental Deficiency, 83, 460-472.Birch, N.J., Burdick, R.K., & Ting, N. (1990). Confidenceintervals and bounds for a ratio of summed expected meansquares. Technometrics, 32, 437-444.Boardman, T.J. (1974). Confidence intervals for variancecomponents - A comparative Monte Carlo study. Biometrics,30, 251-269.Boik, R.J. (1981). A priori tests in repeated measures designs:Effects of nonsphericity. Psychometrika, 46, 241-255.Booth, C.L., Mitchell, S.K., & Solin, F.K. (1979). Thegeneralizability study as a method of assessing intra-andinterobserver reliability in observational research.Behavior Research Methods & Instrumentation, 11, 491-494.171ReferencesBox, G.E.P. (1954). Some theorems on quadratic forms applied inthe study of analysis of variance problems, II. Effect ofinequality of variance and correlation between errors inthe two-way classification. Annals of Mathematical Statistics, 25, 484-498.Bradley, J.V. (1978). Robustness? British Journal of Statistical Psychology, 31, 144-152.Bradley, J.V. (1980). Nonrobustness in one-sample Z and t tests:A large-scale sampling study. Bulletin of the PsychonomicSociety, 15, 29-32.Bradley, J.V. (1980). Nonrobustness in classical tests on meansand variances: A large-scale sampling study. Bulletin of the Psychonomic Society, 15, 275-278.Bradley, J.V. (1980). Nonrobustness in Z, t, and F tests atlarge sample sizes. Bulletin of the Psychonomic Society,16, 333-336.Brennan, R.L. (1983). Elements of generalizability theory. Iowa:The American College Testing Program.Brennan, R.L., & Kane, M.T. (1977). An index of dependabilityfor mastery tests. Journal of Educational Measurement, 14,277-289.Brennan, R.L., & Lockwood, R.E. (1980). A comparison of theNedelsky and Angoff cutting score procedures usinggeneralizability theory. Applied Psychological Measurement,4, 219-240.Burdick, R.K., & Graybill, F.A. (1988). The present status ofconfidence interval estimation on variance components inbalanced and unbalanced random models. Communications in Statistics A: Theory and method, 17, 1165-1195.Burt, C. (1955). Test reliability estimated by analysis ofvariance. British Journal of Statistical Psychology, 8,103-118.Cardinet, J., Tourneur, Y., & Allal, L. (1976). The symmetry ofgeneralizability theory: Applications to educationalmeasurement. Journal of Educational Measurement, 13, 119-135.Cardinet, J., Tourneur, Y., & Allal, L. (1981). Extension ofgeneralizability theory and its applications in educationalmeasurement. Journal of Educational Measurement, 18, 183-204.Carmines, E.G., & Zeller, R.A. (1979). Reliability and validityassessment. California: Sage Publications, Series 07-017.172ReferencesCicchetti, D.V., Showalter, D., & Tyrer, P.J. (1985). The effectof number of rating scale categories on levels ofinterrater reliability: A Monte Carlo investigation.Applied Psychological Measurement, 9, 31-36.Cohen, J. (1983). The cost of Dichotomization. AppliedPsychological Measurement, 7, 249-253.Collier, Jr. R.O., Baker, F.B., Mandeville, G.K., & Hayes, T.F.(1967). Estimates of test size for several test proceduresbased on conventional variance ratios in the repeatedmeasures design. Psychometrika, 32, 339-353.Connor, R.J. (1972). Grouping for testing trends in categoricaldata. Journal of American Statistical Association, 67, 601-604.Cornfield, J., & Tukey, J.W. (1956). Average values of meansquares in factorials. Annals of Mathematical Statistics,27, 907-949.Cox, D.R. (1957). Note on grouping. Journal of American Statistical Association, 52, 543-547.Craig, A.T. (1938). On the dependence of certain estimates ofvariance. Annals of Mathematical Statistics, 9, 48-55.Crocker, L., & Algina, J. (1986). Introduction to classical andmodern test theory. New York: Holt, Rinehart and Winston.Crocker, L., Llabre, M., & Miller, M.D. (1988). Thegeneralizability of content validity ratings. Journal ofEducational Measurement, 25, 287-299.Cronbach, L.J. (1951). Coefficient alpha and the internalstructure of tests. Psychometrika, 16, 297-334.Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N.,(1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. NewYork: Wiley.Cronbach, L.J., Rajaratnam, N., & Gleser, G.C. (1963). Thetheory of generalizability: A liberalization of reliabilitytheory. The British Journal of Statistical Psychology, 16,137-163.Davenport, J.M., & Webster, J.T. (1973). A comparison of someapproximate F-tests. Technometrics, 15, 779-789.Doverspike, D., Carlisi, A.M., Barrett, G.V., & Alexander, R.B.(1983). Generalizability analysis of a point-method jobevaluation instrument. Journal of Applied Psychology, 68,476-483.173ReferencesEbel, R.L. (1951). Estimation of the reliability of ratings.Psychometrika, 16, 407-424.Eom, H.J, & Schutz, R. W. (1993, August). Data categorization, noncircularity, and Type I error rates in repeated measures ANOVA designs. Paper presented at the meeting of theAmerican Statistical Association, San Francisco, CA.Erickson, R.S. (1978). Analyzing one variable-three wave paneldata: A comparison of two methods. Political Methodology,5, 151-166.Feldt, L.S. (1965). The approximate sampling distribution ofKuder-Rechardson reliability coefficient twenty.Psychometrika, 30, 357-370.Feldt, L.S. (1969). A test of the hypothesis that Cronbach'salpha or Kuder-Richardson coefficient twenty is the samefor two tests. Psychometrika, 34, 363-373.Feldt, L.S. (1980). A test of the hypothesis that Cronbach'salpha reliability coefficient is the same for two testsadministered to the same sample. Psychometrika, 45, 99-105.Feldt, L.S., & Brennan, R.L. (1989). Reliability. In R.L. Linn(Ed.), Educational measurement (3rd ed.) (pp.105-146). NewYork: American Council on Education.Finsturen, K., & Campbell, M.E. (1979). Further comments onBartko's \"On various intraclass correlation reliabilitycoefficients\". Psychological Reports, 45, 375-380.Fleiss, J.L. (1971). On the distribution of a linear combinationof independent chi squares. Journal of American Statistical Association, 66, 142-144.Fleiss, J.L. (1975). Measuring agreement between two judges onthe presence or absence of a trait. Biometrics, 31, 651-658.Fleiss, J.L. (1981). Statistical methods for rates andproportions (2nd ed.). New York: John Wiley & Sons.Fleiss, J.L., & Shrout, P.E. (1978). Approximate intervalestimation for a certain intraclass correlationcoefficient. Psychometrika, 43, 259-262.Gaylor, D.W., & Hopper, F.N. (1969). Estimating the degrees offreedom for linear combinations of mean squares bySatterthwaite's formula. Technometrics, 11, 691-706.Geisser, S., & Greenhouse, S.W. (1958). An extension of Box'sresults on the use of the F distribution in multivariateanalysis. Annals of Mathematical Statistics, 29, 885-891.174ReferencesGessaroli, M.E, & Schutz, R.W. (1983). Variable error: Variance-covariance heterogeneity, block size and Type I errorrates. Journal of Motor Behavior, 15, 74-95.Ghiselli, E.E., Campbell, J.P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. New York: W.H. Freemanand Company.Gibbons, J.D. (1985). Nonparametric methods for quantitative analysis (second edition). Ohio: American Sciences Press,Inc..Gillmore, G.M. (1983). Generalizability theory: Applications toprogram evaluation. In L.J.Fyans,Jr. (Ed.).Generalizability theory: Inferences and practical applications. San Francisco: Jossey-Bass.Glass, G.V., Peckham, P.D., & Sanders, J.R. (1972). Consequencesof failure to meet assumptions underlying the fixed effectsanalysis of variance and covariance. Review of Educational Research, 42, 237-288.Gleser, G.C., Cronbach, L.J., & Rajaratnam, N., (1965).Generalizability of scores influenced by multiple sourcesof variance. Psychometrika, 30, 395-418.Godbout, P., & Schutz, R.W. (1983). Generalizability of ratingsof motor performances with reference to variousobservational designs. Research Quarterly for Exercise andSport, 54, 20-27.Green, S.B., Lissitz, R.W., & Mulaik, S.A. (1977). Limitationsof coefficient alpha as an index of test unidimensionality.Educational and Psychological Measurement, 37, 827-839.Greenhouse, S.W., & Geisser, S. (1959). On methods in theanalysis of profile data. Psychometrika, 24, 95-112.Gregoire, T.G., & Driver, B.L. (1987). Analysis of ordinal datato detect population differences. Psychological Bulletin,101, 159-165.Grieve, A.P. (1984). Tests of sphericity of normal distributionsand the analysis of repeated measures designs.Psychometrika, 49, 257-267.Hakstian, A.R., & Whalen, T.E. (1976). A k-sample significancetest for independent alpha coefficient. Psychometrika, 41,219-231.Heise, D.R. (1969). Separating reliability and stability intest-retest correlation. American Sociological Review, 34,93-101.175ReferencesHorst, P. (1949). A generalized expression for the reliabilityof measures. Psychometrika, 14, 21-31.House, A.E., House, B.J., & Campbell, M.B. (1981). Measures ofinterobserver agreement: Calculation formulas anddistribution effects. Journal of Behavioral Assessment, 3,37-57.Hoyt, C.J. (1941). Test reliability estimated by analysis ofvariance. Psychometrika, 6, 153-160.Huck, S.W. (1978). A modification of Hoyt's analysis of variancereliability estimation procedure. Educational andPsychological Measurement, 38, 725-736.Hudson, J.D., & Krutchkoff, R.G. (1968). A Monte Carloinvestigation of the size and power of tests employingSatterthwaite's synthetic mean squares. Biometrika, 55,431-433.Huynh, H. (1978). Some approximate tests for repeatedmeasurement designs. Psychometrika, 43, 161-175.Huynh, H., & Feldt, L.S. (1970). Conditions under which meansquare ratios in repeated measurements designs have exact Fdistributions. Journal of American Statistical Association,65, 1582-1589.Huynh, H., & Feldt, L.S. (1976). Estimation of the Boxcorrection for degrees of freedom from sample data inrandomized block and split-plot designs. Journal ofEducational Statistics, 1, 69-82.Huynh, H., & Mandeville, G.K. (1979). Validity conditions inrepeated measures designs. Psychological Bulletin, 86, 964-973.Huysamen, G.K. (1990). The application of generalizabilitytheory to the reliability of ratings. South African Journal of Psychology, 20, 200-205.IMSL. (1991). International Mathematical and Statistical Libraries (10th ed.). Houston, TX.Jenkins, Jr, G.D., Taber, T.D. (1977). A Monte Carlo study offactors affecting three indices of composite scalereliability. Journal of Applied Psychology, 62, 392-398.Johnson, S., & Bell, J.F. (1985). Evaluating and predictingsurvey efficiency using generalizability theory. Journal ofEducational Measurement, 22, 107-119.176ReferencesJoreskog, K.G. (1971). Statistical analysis of sets ofcongeneric tests. Psychometrika, 36, 109-133.Joreskog, K.G., & Sorbom, D. (1989). LISREL 7: User's reference guide. Mooresville: Scientific Software, Inc..Kane, M.T. (1986). The role of reliability in criterion-referenced tests. Journal of Educational Measurement, 23,221-224.Kane, M.T., & Brennan, R.L. (1977). The generalizability ofclass means. Review of Educational Research, 47, 267-292.Kane, M.T., & Gilmore, G.M., & Crooks, T.J. (1976). Studentevaluations of teaching: The generalizability of classmeans. Journal of Educational Measurement, 13, 171-183.Kenny, D.A., & Judd, C.M. (1986). Consequences of violating theindependence assumption in analysis of variance.Psychological Bulletin, 99, 422-431.Khuri.A.I. (1981). Simultaneous Confidence intervals forfunctions of variance components in random models. Journal of American Statistical Association, 76, 878-885.Khuri, A.I., & Sahai, H. (1985). Variance components analysis: Aselective literature survey. International Statistical Review, 53, 279-300.Kogan, L.S. (1948). Analysis of variance - repeatedmeasurements. Psychological Bulletin, 45, 131-143.Kraemer, H.C. (1981). Extension of Feldt's approach to testinghomogeneity of coefficients of reliability. Psychometrika,46, 41-45.Kristof, W. (1963). The statistical theory of stepped-upreliability coefficients when a test has been divided intoseveral equivalent parts. Psychometrika, 28, 221-238.Kristof, W. (1970). On the sampling theory of reliabilityestimation. Journal of Mathematical Psychology, 7, 371-377.Kuder, G.F., & Richardson, M.W. (1937). The theory of theestimation of test reliability. Psychometrika, 2, 151-160.Lahey, M.A., Downey, R.G., & Saal, F.E. (1983). Intraclasscorrelations: There's more there than meets the eye.Psychological Bulletin, 93, 586-595.Lane, S., & Sabers, D. (1989). Use of generalizability theoryfor estimating the dependability of a scoring system forsample essays. Applied Measurement in Education, 2, 195-205.177ReferencesLissitz, R.W., & Green, S.B. (1975). Effect of the number ofscale points on reliability: A Monte Carlo approach.Journal of Applied Psychology, 60, 10-13.Lomax, R.G. (1982). An application of generalizability theory toobservational research. Journal of Experimental Education,51, 22-30.Looney, M.A., Heimerdinger, B.M. (1991). Validity andgeneralizability of social dance performance ratings.Research Quarterly for Exercise and Sport, 62, 399-405.Lunney, G.H. (1970). Using analysis of variance with adichotomous dependent variable: An empirical study. Journal of Educational Measurement, 7, 263-269.Macready, G.B. (1983). The use of generalizability theory forassessing relations among items within domains indiagnostic testing. Applied Psychological Measurement, 7,149-157.Marcoulides, G.A. (1989). The estimation of variance componentsin generalizability studies: A resampling approach.Psychological Reports, 65, 883-889.Marcoulides, G.A. (1990). An alternative method for estimatingvariance components in generalizability theory.Psychological Reports, 66, 379-386.Masters, J.R. (1974). The relationship between number ofresponse categories and reliability of Likert-typequestionnaires. Journal of Educational Measurement, 11, 49-53Matell, M.S., & Jacoby, J. (1971). Is there an optimal number ofalternatives for Likert scale items: Study I: Reliabilityand validity. Educational and Psychological Measurement,31, 657-674.Maxwell, A.E. (1968). The effect of correlated errors onestimates of reliability coefficients. Educational andPsychological Measurement, 28, 803-811.Maxwell, S.E., & Bray, J.H. (1986). Robustness of the Quasi Fstatistic to violations of sphericity. Psychological Bulletin, 99, 416-421.McCall, R.B., & Appelbaum, M.I. (1973). Bias in the analysis ofrepeated-measures designs: Some alternative approaches.Child Development, 44, 401-415.178ReferencesMcHugh, R.B., Sivanich, G., & Geisser, S. (1961). On theevaluation of changes by psychometric test profiles.Psychological Reports, 7, 335-344.Mendoza, J.L., Toothaker, L.E., & Crain, B.R. (1976). Necessaryand sufficient conditions for F ratios in the L x J x Kfactorial design with two repeated factors. Journal of American Statistical Association, 71, 992-993.Mitchell, S.K. (1979). Interobserver agreement, reliability, andgeneralizability of data collected in observationalstudies. Psychological Bulletin, 86, 376-390.Mitzel, H.C., & Games, P.A. (1981). Circularity and multiplecomparisons in repeated measure designs. British Journal ofMathematical and Statistical Psychology, 34, 253-259.Morgan, S. (1988). Diagnostic assessment of autism: A review ofobjective scales. Journal of Psychoeducational Assessment,6, 139-151.Morrow, J.R,Jr., et al. (1986). Generalizability of the AAHPERDhealth related skinfold test. Research Quarterly for Exercise and Sport, 57, 187-95.Morrow, J.R,Jr. (1989). Generalizability theory. In M.J. Safrit& T.M. Wood (Eds.), Measurement concepts in physical education and exercise science (pp. 73-96). Illinois: HumanKinetics.Myers, J., DiCecco, J.V., White, B.J., & Borden, V.M.^(1982).Repeated measurements on dichotomous variables: Q and Ftests. Psychological Bulletin, 92, 517-525.Novick, M.R., & Lewis, C. (1967). Coefficient alpha and thereliability of composite measurements. Psychometrika, 32,1-13.Paulson, E. (1942). An approximate normalization of the analysisof variance distribution. Annals of Mathematical Statistics, 13, 233-235.Rao, C.R. (1973). Linear statistical inference and its applications (2nd ed.). New York: John Wiley & Sons.Rasmussen, J.L. (1989). Analysis of Likert-scale data: Areinterpretation of Greoire and Driver (1987).Psychological Bulletin, 105, 167-170.Rasmussen, J.L., Heumann, K.A., Heumann, M.T., & Botzum, M.(1989). Univariate and multivariate groups by trailsanalysis under violation of variance-covariance andnormality assumptions. Multivariate Behavioral Research,24, 93-105.179ReferencesRogan, J.C., & Keselman, H.J. (1977). Is the ANOVA F-test robustto variance heterogeneity when sample sizes are equal?: Aninvestigation via a coefficient of variation. American Educational Research Journal, 14, 493-498.Rogan, J.C., Keselman, H.J., & Mendoza, J.L. (1979). Analysis ofrepeated measurements. British Journal of Mathematical andStatistical Psychology, 32, 269-286.Rouanet, H., & Lepine, D. (1970). Comparison between treatmentsin a repeated-measurement design: ANOVA and multivariatemethods. The British Journal of Mathematical andStatistical Psychology, 23, 147-163.Rulon, P.J. (1939). A simplified procedure for determining thereliability of a test by split-halves. Harvard Educational Review, 9, 99-103.Saal, F.E., Downey, R.G., & Lahey, M.A. (1980). Rating theratings: Assessing the psychometric quality of rating data.Psychological Bulletin, 88, 413-428.Sahai, H. (1979). A bibliography on variance components.International Statistical Review, 47, 177-222.Sahai, H., Khuri, A.I., & Kapadia, C.H. (1985). A secondbibliography on variance components. Communications inStatistics A: Theory and Method, 14, 63-115.Santa, J.L., Miller, J.J., & Shaw, M.L. (1979). Using Quasi F toprevent alpha inflation due to stimulus variation.Psychological Bulletin, 86, 37-46.Satterthwaite, F.E. (1941). Synthesis of variance.Psychometrika, 6, 309-316.Satterthwaite, F.E. (1946). An approximate distribution ofestimates of variance components. Biometrics, 2, 110-114.Scheffe, H. (1959). The analysis of variance. New York: Wiley.Schroeder, M.S., & Hakstian, R. (1990). Inferential proceduresfor multifaceted coefficients of generalizability.Psychometrika, 55, 429-447.Searle, S.R. (1971). Linear models. New York: John Wiley.Searle, S.R., Casella, G, & McCulloch, C.E. (1992). Variance components. New York: John Wiley.Sedere, M.U., & Feldt, L.S. (1976). The sampling distributionsof the Kristof reliability coefficient, the Feldtcoefficient, and Guttman's lambda-2. Journal of Educational Measurement, 14, 53-62.180ReferencesShavelson, R.J., & Webb, N.M. (1981). Generalizability theory:1973-1980. British Journal of Mathematical and Statistical Psychology, 34, 133-166.Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: Aprimer. Newbury Park: Sage.Shavelson, R.J., Rowley, G.L., & Webb, N.M. (1988). Usinggeneralizability theory in counseling and development.Measurement and Evaluation in Counseling and Development,21, 81-93.Shavelson, R.J., Webb, N.M., & Rowley, G.L. (1989).Generalizability theory. American Psychologist, 44, 922-932.Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlations:Uses in assessing rater reliability. Psychological Bulletin, 86, 420-428.Smith, P.L. (1978). Sampling errors of variance components insmall sample multifacet generalizability studies. Journal of Educational Statistics, 3, 319-346.Smith, P.L. (1981). Gaining accuracy in generalizability theory:Using multiple designs. Journal of Educational Measurement,18, 147-154.Smith, P.L. (1982). A confidence interval approach for variancecomponent estimates in the context of generalizabilitytheory. Educational and Psychological Measurement, 42, 459-466.Smith, P.L., & Luecht, R.M. (1992). Correlated effects ingeneralizability studies. Applied Psychological Measurement, 16, 229-235.Stayrook, N., & Corno, L. (1979). An application ofgeneralizability theory in disattenuating a path model ofteaching and learning. Journal of Educational Measurement,16, 227-237.Stoloff, P.H. (1970). Correcting for heterogeneity of covariancefor repeated measures designs of the analysis of variance.Educational and Psychological Measurement, 30, 909-924.Tukey, J.W. (1949). One degree of freedom for non-additivity.Biometrics, 5, 232-242.Ulrich, D., Ulrich, B.D., & Branta, C.F. (1988). Developmentalgross motor skill ratings: A generalizability study.Research Quarterly for Exercise and Sport, 59, 203-209.181ReferencesVerdooren, L.R. (1982). How large is the probability for theestimate of a variance component to be negative.Biometrical Journal, 24, 339-360.Violato, C., & Travis, L.D. (1988). An application ofgeneralizability theory to the consistency-specificityproblem: The transituational consistency of behavioralpersistence. The Journal of Psychology, 122, 389-407.Webb, N.M., Rowley, G.L., & Shavelson, R.J. (1988). Usinggeneralizability theory in counseling and development.Measurement and Evaluation in Counseling and Development,21, 81-90.Wike, E.L., & Church, J.D. (1980). Nonrobustness in F tests: 1.A replication and extension of Bradley's study. Bulletin ofthe Psychonomic Society, 20, 165-167.Wike, E.L., & Church, J.D. (1982). Nonrobustness in F tests: 2.Further extensions of Bradley's study. Bulletin of the Psychonomic Society, 20, 168-170.Wilcox, R.R. (1987). New designs in analysis of variance. Annual Review of Psychology, 38, 29-60.Wilson, K. (1975). The sampling distribution of conventional,conservative and corrected F-ratios in repeatedmeasurements designs with heterogeneity of covariance.Journal of Statistical Computation and Simulation, 3, 201-215.Winer, B.J. (1971). Statistical principles in experimental design (2nd ed.). New York: McGraw-Hill.Woodruff, D.J., & Feldt, L.S. (1988). Tests for equality ofseveral alpha coefficients when their sample estimates aredependent. Psychometrika, 51, 393-413.Zimmerman, D.W. (1980). Is classical test theory 'robust' underviolation of the assumption of uncorrelated errors?Canadian Journal of Psychology, 34, 227-237.182Appendix AAppendix ACircularity assumptions in repeated measures ANOVARepeated measures analysis of variance (ANOVA) proceduresare extensively used in educational and psychological research.When the repeated measures are obtained from the sameindividuals, it is naturally believed that the successivemeasures or responses will tend to positively correlated. Inthis case, besides the usual ANOVA assumptions of normality ofdistribution and homogeneity of variances, there is anadditional assumption regarding the pattern of these correlatedmeasures. A number of empirical studies on the effect ofviolating ANOVA assumptions of normality of distribution andhomogeneity of variances on Type I error rates have shown thatANOVA F statistic is generally robust with regard to moderatedepartures from these assumptions, especially if sample sizesare equal (e.g., Glass, Peckham, & Sanders, 1972), but seeBradley (1978). However, ANOVA loses its robustness when thecovariance matrix underlying the repeated measures deviates froma certain pattern, referred to as compound symmetry orcircularity.A covariance matrix is said to possess the property ofcircularity if the variances of all pair-wise differencesbetween the repeated measures are equal. A special case ofcircularity is compound symmetry, a covariance matrix with equalvariances and equal covariances (Huynh and Feldt, 1970; Rouanetand Lepine, 1970; Winer, 1971). For example, for a two-wayANOVA model, which is the equivalent model to a one-facetcrossed design (i.e., subjects by raters) in G theory, thecompound symmetry implies that a r by r covariance matrixexhibits equal variances in the diagonal and equal covariancesin the off-diagonal. Furthermore, for a three-way ANOVA model,i.e., a two-facet (Persons x Raters x Occasions) fully crosseddesign in G theory, each of the covariance matrices /r (nr xnr ), Zn (no x n^and /) I^ ) ro (nrno xnrno ) is required to possessa local circularity (Rouanet & Lepine, 1970) in order for the For quasi F statistic to be valid (Huynh & Mandeville, 1979;Maxwell & Bray, 1986; Mendoza, Toothaker, & Crain, 1976).Box (1954) has shown that violating this assumption yieldsmore variable estimates of the mean squares, and thus results inmore extreme large as well as small F ratios than indicated bythe theoretical F distribution. Since interest is directed tothe upper tail of this distribution, a cumulative proportionbeyond the theoretical upper limit is attributed to Type I errorrates. Consequently, he developed a measure of the degree ofdeparture from compound symmetry, known as epsilon (C). Epsilonis used to correct a positive bias in the usual F test byadjusting the degrees of freedom by an amount proportional to Eor of its estimate Ae. The epsilon is a function of thevariances and covariances in the population matrix (I x), and canbe calculated as:183Appendix Ak2 (aii - 6..) 2(c_1) (EF.02ij - 2)(27,021.^k2 62 ..)where;k^= the order of the covariance matrix,2ii = the mean of the variances (diagonals),6.. = the grand mean of the covariance matrix,Gi\u00E2\u0080\u00A2 = the mean of the i th row or column of thecovariance matrix, and6ij = an individual element in the matrix(where; i and j = 1,2,..k).Huynh and Feldt (1970), and Rouanet and Lepine (1970)demonstrated independently that the compound symmetry conditionof the covariance matrix is a sufficient condition for the ratioof mean squares to have an F distribution, but it is not anecessary condition. That is, a matrix E may have otherpatterns, but the ratio of the mean squares may still have an Fdistribution with E = 1.0. If the difference scores between allpairs of measures are equally variable, this produces acovariance matrix which possesses circularity (Rouanet & Lepine,1970) condition. This property in the covariance matrixindicates that when a k x k covariance matrix Ex is transformedorthonormally, using a k-1 by k orthonormal matrix M, then aresultant (k-1) by (k-1) matrix Ev contains a set of orthonormalvariables. If the original matrix Ex has the circularitypattern, then the resultant matrix Ev has sphericity conditionwhich results in Ev = mExmy = cI, wh6re I is the identity matrixof order (k-1), anti c is a constant. From this relationship,the epsilon E can be alternatively defined in terms oforthonormally transformed matrix as:[A2]^ (x c i ) 2E(k-1) E c2 1where;c is the (k-1) eigenvalues of a (k-1) by (k-1) matrixY .Under circularity or sphericity condition, all eigenvalues areequal. Consequently, (E c i) (k-1) E c' i , and E = 1.0. Undermaximum departure from sphericity, all eigenvalues, except one,are equal to zero (e.g., Boik, 1981; Grieve, 1984), thus (E ci) 2= E c'. and E = 1/(k-1).[Al]E =184Appendix ABox's work was extended to more complex designs by Geisserand Greenhouse (1958), Greenhouse and Geisser (1959), McHugh,Sivanich, and Geisser (1961), and Huynh (1978). In addition,many subsequent simulation studies (with continuous dependentvariables) have shown that the degree of bias introduced byviolating circularity assumption is quite substantial in avariety of specific cases (Collier, Baker, Mandeville, & Hayes,1967; Greenhouse & Geisser, 1959; Huynh, 1978; Rasmussen,Heumann, Heumann, & Botzum, 1989; Wilson, 1975). For example,Collier et al. (1967) showed that computing \"E from a samplecovariance matrix and adjusting the degrees of freedom for thecritical F by amount of AE produced an approximate F test thatis relatively robust for reasonable samples of 15 or larger.However, Stoloff (1970), and Huynh and Feldt (1976) havedemonstrated, through Monte Carlo studies, that AE-adjusted testis negatively biased (i.e., too conservative). This bias isgreatest when a population E is near or above .75, especiallywhen the sample size is small. Consequently, Huynh and Feldtproposed an alternative estimator of E. Their estimator -E is afunction of n (sample), g (group), k (level) and Box's AE:[A3]n(k-1) AE - 2-E -(k-1) [n-g- (k-1) ^E]Thus, for any value of n and k, -E is equal to or greater thanAE, and the equality holds when AE = 1/(k-1). The upper bound of-E was set to unity, though it theoretically can be greater thanunity. Huynh and Feldt(1976, 1978) and Rogan, Keselman andMendoza (1979) reported that -E-adjusted test produced a testsize closer to a specified alpha level than did the AE-adjustedtest.Most empirical studies mentioned above have focused more onexamining the behavior of the ratio of the mean square estimatesand the degree of positive bias in the F test. However, knowingthat the departure from the circularity condition in thecovariance matrix results in unstable estimates of mean squares,it is quite likely that the more-variable mean squares would addan additional variability to the estimates of the G coefficientsince the G coefficient is a function of observed mean squares.This would be especially true with a small scale of ameasurement design. There is, however, little research in theliterature that has systematically examined the effects ofnoncircularity on the magnitude of the sample estimates of the Gcoefficient as well as on the sampling variability of theestimates.185Appendix BAppendix BInput population covariance matrices Table B-1Population covariance matrices for the one-facet design (k = 3)G = .90^.75^.60E =^1.0^1.0^1.0 10075 10075 75 10010050 10050 50 10010033.33 10033.33^33.33 100E =^.7051^.7051^.708810075 10061 89 10010050 10022 78 10010014^10010 76^100e =^.5268^.5395^.5389100 100 10076 100^50 100^3^10054 95 100^10 90 100^2 95^100186Appendix BTable B-2Population covariance matrices for the one-facet design (k = 5)G = .90:100^ E = 1.064.29 10064.29 64.29 10064.29 64.29 64.29 10064.29 64.29 64.29 64.29 100100^E = .693077 10064 77 10051 64 77 10041 51 64 77 100100^ 8 = .500183 10065 83 10041 65 83 10034 41 65 83 100G = .75:100^ E = 1.037.50 10037.50 37.50 10037.50 37.50 37.50 10037.50 37.50 37.50 37.50 100100^E = .706160 10033 60 10012 33 60 10012 12 33 60 100100^ e = .517980 10025 50 10010 25 50 10010 10 25 90 100G = .60:100^ 8 = 1.023.08 10023.08 23.08 10023.08 23.08 23.08 10023.08 23.08 23.08 23.08 100^100^E = .706630 1001 40 1000^5 60 1000^0 25 70 10010095 100^E = .50470 25 1000^0 26 1000^0^0 90 100187Appendix BTable B-3Population covariance matrices for the one-facet design (k = 7)G = .90:100^E = 1.056.25 10056.25 56.25 10056.25 56.25 56.25 10056.25 56.25 56.25 56.25 10056.25 56.25 56.25 56.25 56.25 10056.25 56.25 56.25 56.25 56.25 56.25 100100^E = .702471 10060 71 10054 60 71 10040 54 60 71 10040 40 54 60 71 10040 40 40 54 60 71 100100^e = .506980 10060 80 10050 60 80 10039 50 60 80 10035 39 50 60 80 10015 35 39 50 60 80 100G = .75:100^E = 1.030 10030 30 10030 30 30 10030 30 30 30 10030 30 30 30 30 10030 30 30 30 30 30 100100^c = .704665 10020 50 10020 20 50 10015 20 20 50 10010 15 20 30 50 10010 10 20 30 40 65 100100^E = .508090 10025 40 10015 25 90 10010 15 25 40 10050 10 15 25 40 1005 10 10 15 25 90 100188Appendix BTable B-3 - continuedPopulation covariance matrices for the one-facet design^(k = 7)G =^.60:10017.65e =1001.017.65 17.65^10017.65 17.65^17.65 10017.65 17.65^17.65 17.65 10017.65 17.65^17.65 17.65 17.65^10017.65 17.65^17.65 17.65 17.65^17.65 100100 e = .6993 100 e = .513620^100 90^10010^50 100 5^25 10010^10 70^100 0^5 25 1000^10 10^20^100 0^0 5 90 1000^0 10^10^50 100 0^0 0 5 25 1000^0 0^10^10 70^100 0^0 0 0 5 90^100Table B-4Population covariance matrices for the two-facet^(35 Raters)^design189Appendix BOccasions by100 G =^.90,^Matrix: NONE, Epsilon61 10061 61 100 0^= 1.061 61 61 100 R^= 1.061 61 61 61 100 OR^= 1.062 54 54 54 54 10054 62 54 54 54 61 10054 54 62 54 54 61 61^10054 54 54 62 54 61 61^61 10054 54 54 54 62 61 61^61 61^10062 54 54 54 54 62 54^54 54^54 10054 62 54 54 54 54 62^54 54^54 61 10054 54 62 54 54 54 54^62 54^54 61 61 10054 54 54 62 54 54 54^54 62^54 61 61 61^10054 54 54 54 62 54 54^54 54^62 61 61 61^61^100100 G =^.90,^Matrix:^0/OR, Epsilon75 10075 75 100 0 =^.672975 75 75 100 R = 1.075 75 75 75 100 OR =^.675266 54 54 54 54 10054 66 54 54 54 60 10054 54 66 54 54 60 60^10054 54 54 66 54 60 60^60 10054 54 54 54 66 60 60^60 60^10040 54 54 54 54 80 54^54 54^54 10054 40 54 54 54 54 80^54 54^54 48 10054 54 40 54 54 54 54^80 54^54 48 48 10054 54 54 40 54 54 54^54 80^54 48 48 48^10054 54 54 54 40 54 54^54 54^80 48 48 48^48^100100 G =^.90,^Matrix:^R/OR, Epsilon80 10060 80 100 0 = 1.040 60 80 100 R =^.681030 40 60 80 100 OR =^.449570 54 54 54 54 10054 60 54 54 54 80 10054 54 60 54 54 60 80^10054 54 54 60 54 40 60^80 10054 54 54 54 60 30 40^60 80^10070 54 54 54 54 70 54^54 54^54 10054 60 54 54 54 54 60^54 54^54 80 10054 54 60 54 54 54 54^60 54^54 60 80 10054 54 54 60 54 54 54^54 60^54 40 60 80^10054 54 54 54 60 54 54^54 54^60 30 40 60^80^100190Appendix BTable B-4 - continuedPopulation covariance matrices for the two-facet (3 Occasions by5 Raters) design100 G = .75,^Matrix:^NONE, Epsilon51 10051 51 100 0 =^1.051 51 51 100 R =^1.051 51 51 51 100 OR =^1.052 34 34 34 34 10034 52 34 34 34 51 10034 34 52 34 34 51 51 10034 34 34 52 34 51 51 51 10034 34 34 34 52 51 51 51 51^10052 34 34 34 34 52 34 34 34^34 10034 52 34 34 34 34 52 34 34^34 51^10034 34 52 34 34 34 34 52 34^34 51^51 10034 34 34 52 34 34 34 34 52^34 51^51 51 10034 34 34 34 52 34 34 34 34^52 51^51 51 51^100100 G = .75,^Matrix:^0/OR, Epsilon80 10080 80 100 0 =^.654380 80 80 100 R =^1.080 80 80 80 100 OR =^.674250 34 34 34 34 10034 50 34 34 34 50 10034 34 50 34 34 50 50 10034 34 34 50 34 50 50 50 10034 34 34 34 50 50 50 50 50^10030 34 34 34 34 76 34 34 34^34 10034 30 34 34 34 34 76 34 34^34 23 10034 34 30 34 34 34 34 76 34^34 23^23 10034 34 34 30 34 34 34 34 76^34 23^23 23 10034 34 34 34 30 34 34 34 34^76 23^23 23 23 100100 G = .75,^Matrix:^R/OR, Epsilon70 10050 70 100 0 =^1.030 50 70 100 R =^.681020 30 50 70 100 OR =^.567340 40 30 30 30 10040 50 40 30 30 70 10030 40 50 40 30 50 70 10030 30 40 50 40 30 50 70 10030 30 30 40 70 20 30 50 70^10040 40 30 30 30 40 40 30 30^30 10040 50 40 30 30 40 50 40 30^30 70^10030 40 50 40 30 30 40 50 40^30 50^70 10030 30 40 50 40 30 30 40 50^40 30^50 70 10030 30 30 40 70 30 30 30 40^70 20^30 50 70^100191Appendix BTable B-4 - continuedPopulation covariance matrices for the two-facet (3 Occasions by5 Raters) design100 G = .60,^Matrix: NONE, Epsilon46 10046 46 100 0 = 1.046 46 46 100 R = 1.046 46 46 46 100 OR = 1.045 22 22 22 22 10022 45 22 22 22 46 10022 22 45 22 22 46 46 10022 22 22 45 22 46 46 46 10022 22 22 22 45 46 46 46 46 10045 22 22 22 22 45 22 22 22^22 10022 45 22 22 22 22 45 22 22^22 46 10022 22 45 22 22 22 22 45 22^22 46^46 10022 22 22 45 22 22 22 22 45^22 46^46 46 10022 22 22 22 45 22 22 22 22^45 46^46 46^46 100100 G = .60,^Matrix:^0/OR, Epsilon82 10082 82 100 0 =^.655282 82 82 100 R = 1.082 82 82 82 100 OR =^.654233 22 22 22 22 10022 33 22 22 22 36 10022 22 33 22 22 36 36 10022 22 22 33 22 36 36 36 10022 22 22 22 33 36 36 36 36^10020 22 22 22 22 82 22 22 22^22 10022 20 22 22 22 22 82 22 22^22 20^10022 22 20 22 22 22 22 82 22^22 20^20 10022 22 22 20 22 22 22 22 82^22 20^20 20^10022 22 22 22 20 22 22 22 22^82 20^20 20^20 100100 G = .60,^Matrix:^R/OR, Epsilon70 10040 70 100 0 = 1.020 40 70 100 R =^.656620 20 40 70 100 OR =^.514930 30 20 15 15 10030 45 30 15 15 70 10020 30 45 30 20 40 70 10015 15 30 45 30 20 40 70 10015 15 20 30 60 20 20 40 70^10030 30 20 15 15 30 30 20 15^15 10030 45 30 15 15 30 45 30 15^15 70^10020 30 45 30 20 20 30 45 30^20 40^70 10015 15 30 45 30 15 15 30 45^30 20^40 70^10015 15 20 30 60 15 15 20 30^60 20^20 40^70^100"@en . "Thesis/Dissertation"@en . "1993-11"@en . "10.14288/1.0076803"@en . "eng"@en . "Interdisciplinary Studies"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "The interactive effects of data categorization and noncircularity on the sampling distribution of generalizability coefficients in analysis of variance models : an empirical investigation"@en . "Text"@en . "http://hdl.handle.net/2429/1758"@en .