UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Comparability of test scores for non-aboriginal and aboriginal students Davidson, Bonita Marie 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-931137.pdf [ 5.9MB ]
Metadata
JSON: 831-1.0054425.json
JSON-LD: 831-1.0054425-ld.json
RDF/XML (Pretty): 831-1.0054425-rdf.xml
RDF/JSON: 831-1.0054425-rdf.json
Turtle: 831-1.0054425-turtle.txt
N-Triples: 831-1.0054425-rdf-ntriples.txt
Original Record: 831-1.0054425-source.json
Full Text
831-1.0054425-fulltext.txt
Citation
831-1.0054425.ris

Full Text

COMPARABILITY OF TEST SCORES FOR NON-ABORIGINAL A N D ABORIGINAL STUDENTS by BONITA MARIE DAVIDSON B A . , Laurentian University, 1991 B . E d . , Laurentian University, 1992 D.Edu., The University o f British Columbia, 1997 M . A . , The University o f British Columbia, 2000  A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T OF THE REQUIREMENTS FOR T H E D E G R E E OF .DOCTOR OF PHILOSOPHY in THE F A C U L T Y OF G R A D U A T E STUDIES (Department o f Educational and Counselling Psycholoev and W Measurement, Evaluation, and  THE U N I V E R S W T 3 F BRITISH C O L U M B I A  © Bonita Marie Davidson, 2004  •  1<JLI^^^^^  We accept this thesis as conforming to the required standard  July 2004  • i  ABSTRACT The purpose of this study was to examine the comparability of the B C Ministry of Education's Grades 4 and 7 Reading and Numeracy Foundation Skills Assessment (FSA) scores for aboriginal and non-aboriginal students. It was found that the compositions of the constructs being measured had many similarities across the aboriginal and nonaboriginal populations and were congruent for the reading assessments but not for the numeracy assessments. The reliability estimates of the scores for each population were high and very similar. The Grade 7 Numeracy assessment provided more measurement accuracy for the aboriginal group than the non-aboriginal group, while the Grade 4 Numeracy assessment and the Grades 4 and 7 Reading assessments provided less measurement accuracy for the aboriginal group than the non-aboriginal group. For all assessments, items were ordered similarly in terms of their difficulty level and their degree of discrimination, and were ordered moderately similar in their inherent possibility of being answered correctly based on chance. For all assessments there was a low level of differential item functioning. Overall, the results indicated that for this study, there was a high degree of comparability across the aboriginal and non-aboriginal populations for the Reading F S A scores because all four analyses for both grades showed them to be highly comparable. There was a moderately high degree of comparability across the two populations for the Grade 4 Numeracy F S A scores because three out of the four analyses showed them to be highly comparable. There was a moderate degree of comparability across the two populations for the Grade 7 Numeracy F S A scores because two out of the four analyses showed them to be highly comparable.  Ill  T A B L E OF CONTENTS  Abstract  ii  Table of Contents  iii  List of Tables  vii  List of Figures....  x  Acknowledgements  xi  Chapter One: Introduction  1  Overview of the Study  3  Problem  3  Purpose of Study & Research Questions  4  The Comparable Nature of Scores  4  Methods of Construct Comparability  8  Importance of Study  9  Preview of Chapter Two  9  Chapter Two: Literature Review Scope and Organization of Review  10 11  Validity  11  Aboriginal Educational Performance  12  Literature Review  12  Professionally-Upheld Practical Rules for the Development of A Test  12  Theory Underpinning a Unitary Notion of Validity  15  Early History of Aboriginal Education with Regards to Academic Success  17  Recent History of Aboriginal Education with Regards to Academic Success  Chapter Three: Method  25  28  iv  Instrument  28  Participants  31  Procedures  31  Factor Analysis  32  Number of Factors to Extract  33  Optimal Rotation  35  Congruence Coefficients  36  Reliability  37  Item Information Functions  38  IRT Based Parameters  40  Differential Item Functioning (DIF)  40  Linn and Harnisch Method  40  Logistic Regression Method  45  C h a p t e r F o u r : Results  48  Participants  48  Factor Analysis  50  Selection of the Number of Factors to be Extracted  51  Optimal Rotation  52  Factor Patterns and Correlations  56  Factor Loading Summary  66  Grade 4 Numeracy  66  Grade 7 Numeracy  66  Grade 4 Reading  67  Grade 7 Reading  68  Factor Correlation Matrices  69  Reliability  73  IRT-Based Analyses  74  Evaluation of the IRT Model Assumptions  74  Model Fit  74  Unidimensionality  80  Local Item Dependence (LID)  80  Item Information Functions  82  IRT Item Parameter Correlations  92  Differential Item Functioning  92  Results Summary  98  C h a p t e r Five: Discussion Summary of Statistical Findings  99 99  Factor Analysis  99  Reliability  100  Item Information Functions and Item Parameters  100  Differential Item Functioning  101  Research Question 1  103  Research Question 2  108  Findings in Context  108  Implication of Findings  110  Interpretation Implications  110  Methodological Implications  113  Limitations of Findings  115  Future Directions  116  References  119  Appendixes  128  Appendix A :  Grade 4 Numeracy Item Details  129  Appendix B :  Grade 7 Numeracy Item Details  130  Appendix C :  Grade 4 Reading Item Details  131  Appendix D :  Grade 7 Reading Item Details  132  Appendix E :  Aboriginal Grade 4 Numeracy Scree Plot  133  Appendix F:  Non-Aboriginal Grade 4 Numeracy Scree Plot  134  Appendix G :  Aboriginal Grade 7 Numeracy Scree Plot  135  Appendix H : Non-Aboriginal Grade 7 Numeracy Scree Plot  136  Appendix I:  Aboriginal Grade 4 Reading Scree Plot  137  Appendix J:  Non-Aboriginal Grade 4 Reading Scree Plot  138  Appendix K :  Aboriginal Grade 7 Reading Scree Plot  139  Appendix L :  Non-Aboriginal Grade 7 Reading Scree Plot  140  Appendix M : Numeracy: Eigenvalues Greater than One  141  Appendix N :  Reading: Eigenvalues Greater than One  142  Appendix O:  Numeracy: Maximum Likelihood Estimations for the Number of Factors  Appendix P:  143  Reading: Maximum Likelihood Estimations for the Number of Factors  144  Appendix Q :  Pattern Matrix for Aboriginal Grade 4 Numeracy Scores  145  Appendix R:  Pattern Matrix for Non-Aboriginal Grade 4 Numeracy Scores  146  Appendix S:  Pattern Matrix for Aboriginal Grade 7 Numeracy Scores  147  Appendix T:  Pattern Matrix for Non-Aboriginal Grade 7 Numeracy Scores  148  Appendix U :  Pattern Matrix for Aboriginal Grade 4 Reading Scores  149  Appendix V :  Pattern Matrix for Non-Aboriginal Grade 4 Reading Scores  150  Appendix W : Pattern Matrix for Aboriginal Grade 7 Reading Scores  151  Appendix X :  152  Pattern Matrix for Non-Aboriginal Grade 7 Reading Scores  vii  LIST O F T A B L E S  Table 1: Grade 4 Numeracy Item Summary  29  Table 2: Grade 7 Numeracy Item Summary  29  Table 3: Grade 4 Reading Item Summary  30  Table 4: Grade 7 Reading Item Summary  31  Table 5: Gender Percentages for each Assessment and Population  48  Table 6: Mean Scores and Standard Deviations for the F S A Scores  49  Table 7: Cases Removed Based on Non-Response  50  Table 8: Selection of Number of Factors to be Extracted  52  Table 9: Simple Structure for Grade 4 Aboriginal Numeracy  54  Table 10: Simple Structure for Grade 4 Non-Aboriginal Numeracy  54  Table 11: Simple Structure for Grade 4 Aboriginal Reading  54  Table 12: Simple Structure for Grade 4 Non-Aboriginal Reading  54  Table 13: Simple Structure for Grade 7 Aboriginal Numeracy  55  Table 14: Simple Structure for Grade 7 Non-Aboriginal Numeracy  55  Table 15: Simple Structure for Grade 7 Aboriginal Reading  55  Table 16: Simple Structure for Grade 7 Non-Aboriginal Reading  55  Table 17: Salient Loading and Item Detail for Aboriginal Grade 4 Numeracy Scores  58  Table 18: Salient Loading and Item Detail for Non-Aboriginal Grade 4 Numeracy Scores  59  Table 19: Salient Loading and Item Detail for Aboriginal Grade 7 Numeracy Scores  60  Table 20: Salient Loading and Item Detail for Non-Aboriginal Grade 7 Numeracy Scores  61  Table 21: Salient Loading and Item Detail for Aboriginal Grade 4 Reading Scores  62  Table 22: Salient Loading and Item Detail for Non-Aboriginal Grade 4 Reading Scores  63  Table 23: Salient Loading and Item Detail for Aboriginal Grade 7 Reading Scores  64  Table 24: Salient Loading and Item Detail for Non-Aboriginal Grade 7 Reading Scores  65  Table 25: Factor Correlation Matrix for Aboriginal Grade 4 Numeracy Scores  70  Table 26: Factor Correlation Matrix for Non-Aboriginal Grade 4 Numeracy Scores  70  Table 27: Factor Correlation Matrix for Aboriginal Grade 7 Numeracy Scores  71  Table 28: Factor Correlation Matrix for Non-Aboriginal Grade 7 Numeracy Scores Table 29: Factor Correlation Matrix for Aboriginal Grade 4 Reading Scores  71 71  Table 30: Factor Correlation Matrix for Non-Aboriginal Grade 4 Reading Scores.. 72 Table 31: Factor Correlation Matrix for Aboriginal Grade 7 Reading Scores  72  Table 32: Factor Correlation Matrix for Non-Aboriginal Grade 7 Reading Scores.. 72 Table 33: Feldt-Raju Reliability Estimates  73  Table 34: Model Goodness of Fit for Grade 4 Numeracy  76  Table 35: Model Goodness of Fit for Grade 7 Numeracy  77  Table 36: Model Goodness of Fit for Grade 4 Reading  78  Table 37: Model Goodness of Fit for Grade 7 Reading  79  Table 38: Goodness of Fit Information for the Poorest Fitting Item for Each Test.. 80 Table 39: F S A Item Pairs Displaying Local Item Dependence  81  Table 40: Stocking and Lord Transformation Values  83  Table 41: Grade 4 Numeracy Item Information  84  Table 42: Grade 7 Numeracy Item Information  85  Table 43: Grade 4 Reading Item Information  86  Table 44: Grade 7 Reading Item Information  87  Table 45: Correlations between IRT Item Parameters for Aboriginals and Non-Aboriginals  92  Table 46: Identified DIF Items  94  IX  Table 47: Number of DIF Items Found Using both the L H and the L R DIF Detection Methods  94  Table 48: Grade 4 Numeracy DIF Item Details  95  Table 49: Grade 7 Numeracy DIF Item Details  96  Table 50: Grade 4 Reading DIF Item Details  96  Table 51: Grade 7 Reading DIF Item Details  97  Table 52: Degree of Comparability Across the Aboriginal and Non-Aboriginal Populations  107  X  LIST O F F I G U R E S  Figure 1: Grade 4 Numeracy: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations.  90  Figure 2: Grade 7 Numeracy: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations.  90  Figure 3: Grade 4 Reading: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations.  _  91  Figure 4: Grade 7 Reading: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations.  91  ACKNOWLEDGEMENTS Thanks to my husband Paul for his love, support, and friendship. Thanks to my new son Sam for being so wonderfully perfect & adorable. Thanks to my mom and dad for believing in me and loving me & Sam so much. Thanks to my niece Alex for her love and support and for being a great influence in my life.  Thanks to Tanya Steele for being an incredibly fun, brilliant, thoughtful, helpful, and motivating friend.. .the best sister E V E R !  Thanks to Rina Bonano for her kindness & friendship.. .for sharing so much of herself & for thinking that I am special.  Thanks to Kadriye Ercikan for being very smart, patient, and supportive.  Thanks to David Robitaille for being the most perfect roll-model.  Thanks to Charles Ungerleider for being insightful and inspiring.  Finally, thanks to Ralph Hakstian for his knowledge, ideas, and strength.  CHAPTER ONE: INTRODUCTION A few years ago, I was a teacher in a beautiful coastal community in British Columbia (BC). I was a recent graduate of a teacher education program and eager to motivate the students to learn the skills and concepts included in the B C Ministry of Education curriculum. I was responsible for a split-grade class that was composed mainly of Grade 5 students. About half of the students in my class were aboriginal children who lived on the nearby reserve, and about half were non-aboriginal children who lived on the nearby islands. A s I worked and developed a relationship with my students, I began to believe that the cognitive abilities of the two groups were very similar, but their learning styles and types of motivating factors were quite different. I did my best to tailor classroom activities to meet the different learning styles and different types of motivating factors. B y Christmas I felt that all the students were finding success as learners. In the spring, my school principal presented me with a table of my Grade 5 students' results from a large-scale standardized test in which they had taken or written during the previous school year. The results were presented in graphical form, and it was clear that there were two distinct groups of scores: a high-performing group (above the 60  th  percentile), and a low-performing group (below the 40  th  percentile). A s I examined  the results and read the names that belonged to each of the two groups, I was surprised to see that the high-performing group consisted of all my Grade 5 non-aboriginal students and that the low-performing group consisted of all my Grade 5 aboriginal students. I was surprised at these results because my perception of the students' academic abilities did not match the test results; the dramatic difference between the two groups did not, in my view, accurately reflect the students' performance in my class. For years afterwards, I  wondered i f that large-scale assessment was a fair and well-designed measurement tool for both groups. Upon a recent examination of the B C Ministry of Education's Foundation Skills Assessment (FSA) 2000-2001 results, I saw that, at the provincial level, aboriginal students consistently scored lower than non-aboriginal students in almost all content areas measured for Grades 4 and 7.1 found myself wondering the same thing about this assessment as I had for the one described above: Was F S A a fair and well-designed measurement tool for both aboriginal and non-aboriginal students? Were the items included in the F S A developed in such a way that neither group was at an unfair advantage or disadvantage by being given an assessment that had content, context, or language that was unfamiliar to them. I decided to perform a construct comparability study of the F S A scores to see if the test was measuring the same thing for both the aboriginal and non-aboriginal groups in an effort to see i f biases existed. Gould (1995) says that test scores may be biased either culturally or statistically: " . . . culturally biased when one group (typically a minority population) performs consistently lower than some reference population . . . " and " . . . statistically biased i f two individuals (e.g., one African American, one White) who get the same test score nevertheless perform differently on some criterion external to the test, such as school grades" (p. 2). For the present study, the focus will be on the degree of comparability of test scores for the aboriginal and nonaboriginal students in British Columbia, Canada.  Overview of the Study Large-scale assessments of students' academic achievement have been widely adopted by states and provinces in North America in an effort to measure learning outcomes and the effectiveness of schooling. Establishing the validity and comparability of scores from large-scale assessments across gender, cultural, racial, or ethnic subgroups is critical to interpreting assessment results accurately. As Messick (1995) has stated, "The extent to which score meaning and action implications hold across persons or population groups and across settings or contexts is a persistent and perennial empirical question" (p. 741). The present study examines the validity and comparability of aboriginal and non-aboriginal students' scores from a large-scale assessment, specifically, the 2000-2001 B C Ministry of Education's FSA English (as opposed to French), Numeracy, and Reading scores for Grades 4 and 7. Problem The problem addressed in the proposed study stems from the consistently poor academic achievement results of Canadian aboriginal students. A high school dropout rate of nearly 75% for students of aboriginal heritage makes this population a special concern, especially when research findings show that poor academic achievement is a strong factor in a student's decision to quit school (Alexander, Entwisle, & Kabbani, 2001). At a national level, the educational attainment of the Canadian aboriginal population is well below that of the non-aboriginal population; based on 1996 Canadian census data, 42% of the aboriginal population aged 20-29 did not graduate from high school, as opposed to 17% of the comparable non-aboriginal population (Council of Ministers of Education, 2002).  4 In an effort to understand the causes and sources of the general disparity between the aboriginal and non-aboriginal populations when it comes to academic testing results, one must first examine the test score results to see if the measurement tools being used are equally appropriate for both non-aboriginal and aboriginal populations. Recent FSA results indicate that the aboriginal student population consistently performed at lower levels than non-aboriginal students in the content area of mathematics for Grades 4 and 7. These interpretations are based on the findings for the 2000-2001 academic year (British Columbia Ministry of Education, 2002b). This paper will determine whether or not FSAs administered to aboriginal and non-aboriginal students actually measure the same constructs for both groups of students by examining evidence of construct validity and comparability. Purpose of Study & Research Questions The purpose of the present study is to explore the statistical nature of large-scale assessment scores in an effort to establish the comparability (or lack thereof) of the interpretation of the scores for aboriginal and non-aboriginal students. Two research questions that guide this study are as follows: (1) Are scores from the Foundation Skills Assessment comparable across aboriginal and non-aboriginal populations! and (2) Should score interpretations be the same for both populations! The Comparable Nature of Scores In a recent publication by the United States National Research Council, it was stated that, " . . . a school whose students have higher test scores is not necessarily better than one whose students have lower test scores.. .the quality of inputs, such as the entry characteristics of students or educational resources available, must be considered"  5  (Committee on Foundations of Assessments, 2001, p. 36). Further, in an effort to encourage an accurate and respectful understanding of the complexities surrounding the educational performance and attainment of the aboriginal population, the Canadian Council of Ministers of Education ( C M E C ) made a statement about the context in which this performance and attainment should be viewed: (a) The first language of many aboriginal children is neither English nor French; hence attending a school taught in a language different from their first language offers undue and often un-addressed challenges to the learners; (b) Cultural differences typically exist between aboriginal children and their teachers, and aboriginal children and their non-aboriginal classmates; (c) Negative stereotyping of aboriginal children and their families currently exists; (d) There are relatively few aboriginal people who have found success in postsecondary education who can act as role models for educational attainment in the aboriginal community; and (e) The geographically remote nature of many aboriginal communities makes it difficult to attract and retain well-qualified teachers for the respective schools (Council of Ministers of Education, 2002). A n individual's performance on an assessment can be influenced by many cognitive and non-cognitive factors other than his or her ability. Scores may vary "for reasons unrelated to achievement, such as the specific content being assessed, the particular format of the assessment items, the timing and conditions for administering the assessments..." (Committee on Foundations of Assessments, 2001, p. 37). Further, level of ability with such skills as reading and writing will certainly have an impact on the individual's performance on a typical mathematical assessment i f the items are presented in a word problem format. Also, familiarity with the context of the language used in  6  word problems, as well as allotted time to complete, will have an effect on the individual's performance on a test. Thus, to ensure the validity of test-score interpretations, one must ask, "To what degree—if at all—on the basis of evidence and rationales, should the test scores be interpreted and used in the manner proposed?" (Messick, 1989a, p. 5). Conceptually, i f the members of the aboriginal population interpret the test items in a different manner than the members of the non-aboriginal population, then a unique interpretation for the aboriginal students is necessary in order to be accurate. Iteminterpretation differences between the aboriginal and non-aboriginal populations would imply that direct construct comparability does not exist between the two. Whether one is exploring the comparable nature of measured constructs across two populations with one test, or the valid nature of test score interpretations for a single population, one should judge the validity of the test scores in terms of whether a test accomplishes the mission it was developed to achieve (Messick, 1989b). According to Messick, this judgment requires an evaluation of the intended and unintended social consequences of test interpretation and use. F S A scores are intended to measure the foundation skills of reading, writing, and numeracy. According to the B C Ministry of Education, the purposes of the F S A are as follows: The main purpose of this assessment is to help the province, school districts, schools, and school planning councils evaluate how well foundation skills are being addressed and make plans to improve student achievement. A secondary purpose is to provide teachers, students, and parents or guardians with external information about student performance. The information provided by  7  F S A can facilitate discussion at the provincial, district, and school levels. (British Columbia Ministry of Education, 2002a) In terms of intended score use, the B C Ministry of Education stated: "As with all assessment data, it is important to place F S A results in context, carefully considering the characteristics of the assessment instrument and various factors that might influence the results" (British Columbia Ministry of Education, 2002a). The Ministry highlighted such influencing factors as participation rate on the assessment, local policy, and instructional strategies (British Columbia Ministry of Education). They also suggested certain approaches to interpreting the F S A results, such as, "in comparison to local expectations, in relation to past performance, and against external references" (British Columbia Ministry of Education). For the sake of clarification, I would like to highlight that the Ministry did not refer to the aboriginal identity of students as a factor that may influence the appropriate F S A score interpretations, or as a factor by which score interpretations should be referenced. In terms o f unintended score use, The Fraser Institute published a Report Card on British Columbia's Elementary Schools: 2003 Edition in June 2003 that used F S A scores as its sole source of information on which to rate B C elementary schools' overall academic performance. Once rated, the 812 schools were ranked in descending order from best to worst. The Fraser Institute's reports have not been sanctioned by the Ministry, but they have received a great amount of media attention in British Columbia. I would consider this influence to be in the category of unintended score interpretations, and they should be identified as part of the construct validation study.  Methods of Construct Comparability For the proposed study, the aboriginal population is defined as all aboriginal students who were in Grades 4 or 7 during the 2000-2001 school year, and who attended publicly and independently funded B C schools, including independent schools. The comparability of test scores for the aboriginal and non-aboriginal populations will be evaluated by an examination of the degree of congruence of the resulting factor structures, the degree of equivalence of the reliability estimates of the scores, the relative efficiency of the scores in terms of item-information functions, and the existence of items found to have significant Differential Item Functioning (DIF). Factor analysis involves the study of order and structure in multivariate data; its objective is to summarize the empirical relationships among a given set of data (Gorsuch, 1983). The aim of this factor analysis will be to summarize the interrelationships among the measured variables (items) accurately and succinctly. This allows us to investigate whether the test data have similar structures for both groups; and, in this case, it also allows us to examine whether items are related to the overall test score in the same way. The aim of the reliability estimates will be to indicate the degree to which individuals' scores would remain relatively consistent over repeated administration of the same test or alternate test forms (Crocker & Algina, 1986). Reliability estimates are indicators of accuracy and are at the core of examining the degree to which test scores are accurate. The aim of calculating the relative efficiency of the scores in terms of item information is to display any disparity in the contribution that each item makes in estimating ability along the ability continuum. Finally, the aim of the DIF analysis will be to determine if  groups that are expected to perform similarly differ in their mean performances on specific items (Hambleton, Swaminathan, & Rogers, 1991, p. 109). Importance of the Study The principal contribution of the present study will be the determination of score validity and comparability across both the aboriginal and the non-aboriginal groups. If it can be shown that test scores are not valid or comparable across the groups, then a bettersuited assessment tool, or at least a better-suited set of score interpretations, could be designed to replace what is currently in use so that each group's scores could be deemed valid. Preview of Chapter Two Chapter two reviews and summarizes research literature regarding the educational measurement notion of validity as well as the academic testing performance of the aboriginal population in Canada.  10  CHAPTER TWO: LITERATURE REVIEW This chapter reviews and summarizes research literature regarding the educational measurement notion of  validity as well as the academic testing performance of the  aboriginal population in Canada. Establishing the appropriateness of educational testing for the aboriginal population in Canada has remained in a strained state since the inception of the Indian Act's policy mandating the formal education of all aboriginal children in Canada (Kirkness, 1999). Early studies of the academic success level of the aboriginal population in Canada show a strong similarity to recent studies of the same topic. Studies from both the early 1960s and the 2000s offer differing hypotheses about factors related to the relatively poor academic performance of the aboriginal population, but one vein that runs through over 40 years of educational research about aboriginal students is that as a whole, they consistently perform at lower levels than their nonaboriginal counterparts. A n identification of the related factors to relatively poor academic performance by the aboriginal population does not provide the information that is needed to ensure, or make changes in the direction of fair and equitable testing. Nor does it provide the information needed to establish i f a test is actually measuring the same thing for the aboriginal students as it is for the non-aboriginal students. One very powerful way to accomplish the goals of detecting fair and equitable testing across the aboriginal and nonaboriginal groups is to perform a construct validation study with scores from tests that are administered to both aboriginal and non-aboriginal students. This study will examine the psychometric properties of test scores in an effort to gain insight into such possible issues as differing factor structures, differing degrees of  11 internal consistency, differing item information and efficiency functions, and test-item bias. These types of information will give way to the development of a set of testdevelopment or test-modification recommendations that may be actualized in an effort to ensure that test score interpretations are valid for both the aboriginal and non-aboriginal populations. Scope and Organization of the Review  Validity For this review, the notion of validity is examined from two perspectives; one is the current professionally-upheld practical rules for the development of a test that ensures the validity o f the planned inferences of its scores, and the other is the currently upheld theory underpinning a unitary notion of validity. The practical perspective on validity is meant to inform the development of the instrument from all angles; the theoretical perspective is also meant to inform the development of the instrument, but much of what the validity theory extols is focused on the test scores, implying a post-development perspective of the test. Simply, the practical perspective outlines how valid test score inferences should be obtained, while the theoretical perspective outlines how valid test score inferences should be obtained, how to check i f this is so, and what to do about it if there is a problem. Both the practical and the theoretical perspective have recent seminal works that will be highlighted in this review. These recent seminal papers on validity have augmented and corrected flaws in past theory, and thus the conclusions from the past papers are suspect, and will, consequently, not be included in the present review.  Aboriginal Educational Performance For this review, the notion of the educational performance of the aboriginal population is examined from an evolutionary perspective, meaning that early research about the educational performance of aboriginal students is paired with current research in an effort to identify historically-stable or newly-emerging themes. One historicallystable theme that was stated previously is that 40 years of educational research about aboriginal students shows that they consistently perform at lower levels than their nonaboriginal counterparts. This theme will be the main focus of the aboriginal education section of the present review. Literature Review  Professionally-Upheld Practical Rules for The Development of a Test The Joint Committee on Testing Practices was established in 1985 by the American Educational Research Association ( A E R A ) , the American Psychological Association ( A P A ) , and the National Council on Measurement in Education ( N C M E ) . According to its bylaws, this committee provides "a means by which professional organizations and test publishers can work together to improve the use of tests in assessment and appraisal" (American Psychological Association, 2002). In 1999, the committee produced their second edition of  The Standards for Educational and  Psychological Testing (from here forward referred to as "the Standards document"). The Standards document was written as a practical guide that offers a theoretical base for its claims, and is considered to be a major seminal paper on the topic of test/assessment development. In the following paragraphs, I will summarize and critique the Standards  13  document with regard to issues of  validity and fairness in testing for identifiable sub-  populations. Generally speaking, the authors of the  Standards document (AERA, A P A , &  N C M E , 1999) referred to validity as "the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests" (p. 9) and deemed validity as the most fundamental consideration in developing and evaluating tests (p. 9). These same authors identified validity as a "unitary concept" (p. 11) that identifies the "degree to which all the accumulated evidence supports the intended interpretation of the test scores for the proposed purpose" (p.  11). Based on the Standards document, for  either planning or post-hoc score-interpretation validation, the process of identifying the degree of support mentioned above begins with the identification of the construct or concepts the test is intended to measure and is followed with an explicit description of what the test scores will be used for. Based on these two factors, appropriate evidence of validation can be identified and sought. O f noted importance in the search of evidence is the degree of both construct under representation (insufficient attention or focus on the construct or concepts the test is intended to measure) and construct-irrelevant variance (test-response influence(s) that are not relevant to the construct or concepts the test is intended to measure). In terms of examining evidence that is based on the consequences of testing, the  Standards document ( A E R A et al., 1999) clarified that when different populations reveal scores that are of different distributions, this is only evidence of invalidity i f the test actually measured constructs or concepts unrelated to what was proposed to be measured,  14  or if group differences "were due to the test's sensitivity to some examinee characteristic not intended to be part of the test construct" (p. 16). To assist in accomplishing all that is professed in the  Standards document ( A E R A  et al., 1999) to be professionally responsible with regard to creating valid score inferences, the document provides a list of 24 Standards that are intended to define the criteria that should be upheld, when applicable, in an effort to ensure that the interpretation of test scores is valid; and 12 Standards that are intended to define the criteria that should be upheld, when applicable, in an effort to ensure fairness-of-testing to every examinee, or sub-population of examinees. O f the 36 Standards mentioned above that are related to the present study; Standards 7.1, 7.2, and 7.3, highlight the critical link between the notions of validity with the criteria of fairness in testing and test use in a manner that informed and justified most of the methodology used in the present study. See below for the three highlighted Standards: Standard 7.1 When credible research reports that test scores differ in meaning across examinee subgroups for the type of test in question, then to the extent feasible, the same forms of validity evidence collected for the examinee population as a whole should also be collected for each relevant sub-group. Subgroups may be found to differ with respect to appropriateness of test content, internal structure of test responses, the relation of test scores to other variables, or the response processes employed by individual examinees. A n y such findings should receive due consideration in the interpretation and use of scores as well as in subsequent test revisions, (p. 80)  15 Standard 7.2 When credible research reports differences in the effects o f construct-irrelevant variance across subgroups of test takers on performance on some part of the test, the test should be used i f at all only for those subgroups for which evidence indicates that valid inferences can be drawn from test scores, (p. 81)  Standard 7.3 When credible research reports that differential item functioning exists across age, gender, racial/ethnic, cultural, disability, and/or linguistic groups in the population of test takers in the content domain measured by the test, test developers should conduct appropriate studies when feasible. Such research should seek to detect and eliminate aspects of test design, content, and format that might bias test scores for particular groups, (p. 81)  Theory Underpinning a Unitary Notion of Validity Messick  (1995) critiqued traditional notions of validity, claiming that a  compartmentalized vision of validity that sees  content validity, criterion validity, and  construct validity as separate aspects of a test, was incomplete to a fault because it failed to "take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use" (p. 741). Messick presented a noncompartmentalized vision of validity that relates all aspects of validity while encompassing the value implications of score meaning and the social consequences of score use. His comprehensive view o f validity "integrates considerations o f content, criteria, and consequences into a construct framework for empirically testing rational hypotheses about score meaning and utility" (p. 742).  16  Appearing to be of fundamental importance to Messick ( 1 9 9 5 ) with regard to the true meaning o f validity in psychological/educational testing, a claim that he had made in the previous decade was repeated: "Validity is an overall evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions on the basis of test scores or other modes of assessment" (p. 7 4 1 ) . Messick insisted that the concept of validity must encompass the interpretation o f test scores, not the test scores themselves, and that the fairness of a specific test-score use could only be evaluated by examining the particular interpretation of test scores at hand. From there, Messick re-introduced a new notion of construct  validity that "is  based on an integration o f any evidence that bears on the interpretation or  meaning o f the test scores—including content- and criterion-related evidence—which are thus subsumed as part o f construct validity" (p. 7 4 2 ) . For the practical application of Messick's ( 1 9 9 5 ) new vision of  construct validity,  he  presented a list o f six potential sources of evidence, or aspects, o f validity for researchers to use or investigate in an effort to make their overall evaluative judgment about the validity of the interpretation of the test scores at hand. These six sources are: ( 1 ) content relevance and representativeness; ( 2 ) substantive theories, process models, and process engagement; ( 3 ) scoring models as a reflective task and domain structure; (4) generalizability and the boundaries of score meaning; (5) convergent and discriminant correlations with external variables; and ( 6 ) consequences as validity evidence. Messick defended the appropriateness and completeness of his selection o f the six aspects of validity by stating that the "six aspects are highlighted because most score-based interpretations and action inferences, as well as the elaborated rationales or arguments  17  that attempt to legitimize them either invoke these properties or assume them, explicitly or tacitly" (p. 747). In his conclusion, Messick (1995) put strong emphasis on the importance of examining the social consequences, "both potential and actual" (p. 748) of the interpretation of test scores. In this specific context, Messick claimed "it is not that adverse social consequences of test use render the use invalid but, rather, that adverse social consequences should not be attributable to any source of test invalidity, such as construct underrepresentation or construct-irrelevant variance" (p. 748). In his closing words, Messick stated that "Thus, validity and values are one imperative, not two, and test validation implicates both the science and the ethics of assessment, which is why validity has force as a social value (p. 749). Messick's notion of validity makes clear that in order for interpretations to be meaningful and justified, all components of validity need to be studied, and that this is part of ethics of assessment. This unitary notion of validity implies, to some degree, that the content aspects of the test need to be examined as well the statistical aspects of the test.  Early History ofAboriginal Education with Regard to Academic Success Before European contact in North America, aboriginal people had a form of education " . . . in which the community was the classroom, its members were the teachers, and each adult was responsible to ensure that each child learned how to live a good life" (National Indian Brotherhood, 1973, as cited in Kirkness, 1999). Once contact was made between the Europeans and the Aboriginals, and cultures began to interact with one another, new skills became important for each group. Skills relating to hunting,  18  fishing, trapping, boating, navigating, and medicine, to name a few, were taught to the Europeans by the aboriginals. Skills relating to commerce and the English language were taught to the aboriginals by the Europeans. By the 1600s, Europeans began to establish day schools in Canada and the United States designed to civilize the aboriginals. By the 1800s, the day schools were being replaced by residential schools. A t its peak, there were approximately 80 residential schools in Canada in the 1930s. In the 1950s, day schools were brought back to replace the residential schools, and by 1970s, most residential schools in Canada were closed. Near this time, a national policy on integration was brought forth, and aboriginal children began to attend public schools (Kirkness). This integration brought children together in terms of proximity, but it did not ensure an integration of cultures and educational beliefs. Problems related to aboriginal student success in these integrated schools were such that by 1972, it was reported that 96% of aboriginal children did not finish high school (Council of Ministers of Education, 2002). The integration of aboriginal children into public schools was followed by reports of extremely poor levels of school success. The 1960s saw a research surge to explore the problems of education provided to the aboriginal populations in both Canada and the United States. The resulting research was accompanied by many recommendations for change and reform. The degree to which these recommendations were accepted and acted upon will not be addressed in the present study; rather, the focus is on the similarities between the 1960's research findings and more recent turn-of-the-century research findings. In 1961, Clarence Wesley of Arizona, an Apache Tribe Chairman, spoke of the "uneasiness and deep concern" he had about the education being provided to aboriginal  19  children in United States both on and off reservations (Wesley, 1961). His paper was not a formal research paper, but rather an opinion piece from the chairman of a Tribe. This opinion paper echoes problems about aboriginal education that are still being heard today in both the United States and Canada. Wesley highlighted problems such as dropout rates, delinquency, and the weak skills that their students were being told were sufficient to graduate from Grade 12. He claimed that once the Indian students graduated and went to college or university, they were left "unable to compete with the non-Indian students" (Wesley). Wesley refused to attribute the lack o f success of aboriginal students to the schools alone, but he did state that the schools must take partial responsibility because it is the fundamental function o f the school "to prepare Indians to become responsible participants in the American way of life". Wesley (1961) identified the causes of the lack of success to things such as the native students' weak English language skills and the dichotomy of SES between the native and non-native families. Wesley claimed that the above two influencing factors made the adoption o f non-native language and culture very challenging for native children, but still saw this as part o f the solution. He stated that the main advantage of having native children attend the same schools as non-native English-speaking children, "is the fact that here these youngsters are forced to use the English language on the playground because that is the only way they can make themselves understood by their non-Indian playmates." Wesley claimed that when the above scenario is not fully achieved, the school curriculum, which is prepared for students who are expected to have a certain level of proficiency with the English language, there is an extraordinary challenge for the native students to succeed and this scenario is hard to achieve when the  20  non-native and native groups come from such diverse cultural and S E S backgrounds. In the end, Wesley's recommendation for the solution of the problems of native education was to focus on the use of the English language from a very early age while, simultaneously, firmly holding on to the Native culture. Wesley's (1961) apparent perception of the problems involved with Native education, although written 40 years ago, seem to hold true to some degree today. His ideals were commendable and respectful o f both native and non-native cultures; he simply affirmed that aspects of the cultures differed in ways that could have led to culturally-based obstacles for native children in their pursuit of success in the non-native school system. David Lloyd (1961), also of Arizona, wrote a paper that attempted to examine whether native students performed better on standardized tests i f they spend more time in a public school system whose population was made up of over 97% non-native students. He held a hypothesis that Indian students in the public school system were "doing as well, both intellectually and academically, as the non-Indian student." Lloyd did not substantiate his hypothesis with previously published research findings, so it was no surprise to me when, at the end o f his paper, he concluded that his hypothesis was not supported by his data. Lloyd's statistical methodology was very weak, thus making his claim of an unsubstantiated hypothesis questionable. He examined both standardized intellectual ability scores and standardized academic achievement scores of a group of students from various grades. He made claims of a certain difference between the Indian and non-Indian students' scores based on his visual inspection o f mean scores and graphic stanine scores without taking into account that less than 3% of the student scores  21 being examined were from the Indian students. T o his paper's detriment, he did not perform any inferential statistical tests to see i f there was a statistically significant difference between the scores of the non-Indian students and the scores of the Indian students. Lloyd made an additional attempt to compare the Grade 6, 8, and 10 Indian students who had attended the public school system for almost all o f their public school years (a one year exception was allowed) to Indian students who have only attended public school for a couple of years (2 years or less for the Grade 6 and 8 students, 3 years or less for the Grade 8 students). Again, Lloyd performed some visual inspection of the scores and made a claim as follows: . . . there seems to be evidence that those Indians who have spent their entire educational life in the Mesa Public Schools tend to have a higher mean intelligence quotient for language, non-language and total mental as measured by the California Test of Mental Maturity than those who have been in the system a relatively short time. Although Lloyd's (1961) conclusions that intellectual and academic achievement scores differ between the Indian and non-Indian students are founded on weak and faulty methodology, his closing remarks are worth noting. He concluded that the tests for intellectual ability do seem to favour students educated in the public school system, and that although some Indian students may have spent the majority of their educational years in public school, they still live in a "segregated situation where the socio-economic standard is much lower and where many of the enriching experiences are lacking." Lloyd wisely suggested that the standardized tests themselves may be biased instruments that favour the non-Indian student; a bias that could have resulted by assuming the tests would  22  measure the same thing for both groups of students even though the Indian students' come from drastically differing cultural and socio-economic situations. In his work with the Bureau of Indian Services, Witherspoon (1962) deduced that some of the aspects o f the assessment that led to the failures of Indian children were controllable in the assessment design. He saw the following as controllable aspects that could be altered to reduce the assessment failures of the Indian children: 1. The predominantly verbal content of the tests; 2. The necessity for speed; 3. The observed difficulty Indian children had with separate answer sheets; and 4. A n apparent lack of motivation, which was thought to come at least in part from the difficulties listed above. (Witherspoon) Witherspoon revised parts o f previous-existing measures and combined them in what he thought, was the most meaningful way for the Indian students. In his examination of the students' test scores, both Indian and non-Indian, he found that there remained some consistencies with past research findings regarding Indian students and assessment. He found that the disparity between Indian and non-Indian children grew as the grade level increased, and the most significant progress made by the Indian students in the core subjects was made in the first six to eight years in public school. Through further examination of his results, Witherspoon claimed that Indian children, in general, do not "begin with the same preparation, nor do they, as is sometimes claimed, keep up with their non-Indian peers through the first three, four, or six grades." Although Witherspoon claimed that his research findings did support previous research findings about Indianstudent performance with regard to assessments, his paper did not address to what degree  23  his initial claim about a method o f creating better-suited assessment tools for Indian children was affirmed. In 1961, Keeler, Chairman o f the Task Force on Indian Education, gave a speech at the Indian Education Conference at Arizona State University (Keeler, 1961). In this speech, Keeler brought forth what he thought was an erroneous assumption about the simplicity o f the problem o f Indian education: " . . . there is widespread public assumption that all you have to do to make the Indian child a facsimile of other American youngsters in education and habits of mind is just put them in the public schools with the white children, and the job is done." Keeler stated that for educators to ignore the intellectual habits o f the Indian child, habits that are valued in their homes, is "a bad oversight." He spoke about the intellectual habits of the Indian such as " . . . craftsmanship habits that the Indian has, his painstaking ability in craftwork, his attention to detail, his patience. Also, the Indian child, without ever knowing that he's getting it, observes in nature things that the white man still hasn't seen". Keeler cited the words of Dr. Reifel (cite unknown) stating t h a t " . . . the Indian's idea of time was different from the average American and white man because the American was future minded." He claimed that the Indian does not seem to care too much about the duration o f time. Keeler ended his speech with a declaration of his belief that "the Indian is an extremely intelligent individual" whose motivation is the factor that educators need to seek and find if they want to find success in educating the Indian child. Although Keeler's (1961) comments were stereotypical, his perspective was one of respect and admiration o f the Indian people. He attributed their lack o f academic success to the lack of Indian content and perspective taking in the development of used  curriculum. This is a point that was made in nearly every paper written on Indian school performance at the time. Evvard and Weaver (1966) examined the score results from the Wechsler Intelligence Scale for Children (WISC) (performance tests only) and the Bender Gestalt Motor Test, attained from native (Navajo) students enrolled in Grades three through seven. From the results, Evvard and Weaver attempted to derive " . . . implications valuable to counsellors and teachers" in an effort to inform counselling and educational practices. Their primary finding was that, for the native children, lack of achievement was not the result of a lack of innate intelligence. Although E w a r d and Weaver wrote of specific problems with regard to the testing of the native children, they did not link these problems to a lack of intelligence. Evvard and Weaver described what happened when they attempted to administer the verbal section of their measures to the native children: "They began to wring their hands, to tap their feet, and to show other signs of emotional distress. They became unable to respond in any way that made it possible to score their performance at all." In speaking of the performance o f native children on intelligence tests, Evvard and Weaver (1966) concluded that i f the "goal is to enable the Indian students to achieve the same success with these materials as does the Anglo child, then we must deal with the differences in language, environment, and value systems of the Indian student." Alternatively, they offered that i f the goal is not to imitate the non-native child's achievements, then an instrument that does not depend on the English language and that is built around the Indian environment is more appropriate. Evvard and Weaver stated  25  that i f either of the above two courses of action were taken, the Indian students could perform equally well and have equal levels of achievement as non-Indian students. The findings of Evvard and Weaver (1966) would have been better served i f they had presented some of the data in their paper, or particular reasons for lower achievement levels for aboriginal children.  Recent History ofAboriginal Education with Regard to Academic Success Demmert (2001) summarized the current poor state of aboriginal education in a manner that is echoed throughout Canada and United States: Except for the tribal schools, responsibility for the education of Native children and youth has been transferred from the tribes to state agencies, mostly to administrators and other individuals outside the communities or tribes. With this transfer of responsibility, Native students began experiencing high levels of educational failure and a growing ambivalence toward learning traditional tribal knowledge and skills. They often exhibited indifference to formal Western academic learning as well. (p. 2) In response to the poor performance of aboriginal students, recent research has revealed some new causes and influences, but has, not surprisingly, included most of the ones deduced from the 1960s research (presented above) as well. A s highlighted by Demmert, academic performance of aboriginal students is still generally lower than non-aboriginal students, yet more recent research has shown that the following efforts have led to a lessening of the gap between the two: (a) a nurturing early childhood environment (Swisher & Deyhle, 1989); (b) inclusion of native language and cultural programs in the school (Ayoungman, 1991; Barnhardt, 1990, 1999; deMarrais, 1992; James, Chavez,  26  Beauvais, Edwards, & Oetting, 1995; Lipka & McCarty, 1994; Rubie, 1999; Slaughter & Lai, 1994; McLaughlin, 1992; Watahomigie & McCarty, 1994); and (c) enhanced community and parental influences on academic performance (Leveque, 1994; Mclnerney, Mclnerny, Ardington, & De Rachewiltz, 1997). Regardless of all the work being done to enhance the academic performance of aboriginal children, these children continue to face unique problems and continue to remain at risk. The state of aboriginal child welfare is in need of improvement. Recently, members of five Canadian national aboriginal organizations worked to develop an improvement-based document titled " A National Children's Agenda: Developing a Shared Vision." In this document, the aboriginal leaders stated, Aboriginal children face far greater risk than most non-aboriginal children since among may things they are: twice as likely to be born prematurely, underweight, or die within their first year of life; three or four times more likely to suffer Sudden Infant Death Syndrome; fifteen to thirty-eight times more likely to suffer from the effects of Fetal Alcohol Syndrome; three times more likely to be physically disabled; six times more likely to die by injury, poisoning or violence; and five times more likely to take their own life. (Indian and Northern Affairs Canada, 2002) In terms of the future success of aboriginal people in Canada, this group of five aboriginal organizations concluded that hope for the future is vested in their children. They stated, "Aboriginal people firmly believe that children represent the primary means through which cultures can preserve their traditions, heritage and languages. In this sense,  children are considered the hope of the future" (Indian and Northern Affairs Canada,  2002). With this  hope of the future vested in the children, attention needs to be put on  specific aspects of aboriginal children's impediments to success. The one impediment for success that the present paper explored is the aboriginal children's comparatively poor performance on large-scale assessments. The specific case that was explored in this study was the British Columbia Ministry of Education provincial assessments on mathematics and reading. For the  2000-2001 academic  year, students' scores from  Grades 4 and 7 provincial assessments revealed that aboriginal students performed at a lower level than any other identified group, including E S L students (British Columbia Ministry of Education,  2002b).  The present study investigated student performance in an effort to find i f the assessments were measuring the same constructs, with the same degree of accuracy, for both the aboriginal and non-aboriginal students. If the assessment scores proved to be comparable for the two groups, then the interpretation of the scores for the two groups should be the same. Conversely, i f the assessment scores did not prove to be comparable for the two groups, then the interpretation of the scores for the two groups should not be the same. I believe that the findings of this study will assist both the aboriginal organizations as well as federal and provincial organizations to more accurately measure and interpret the performance of the aboriginal students in a manner that will contribute to their improvement.  28  CHAPTER THREE: METHOD This chapter provides a detailed description o f the methodology used in the study. The selection o f the instrument and the data for use in this study are discussed, followed by a description of the methodological procedures used to examine the data. Instrument The results from the 2001 British Columbia Ministry o f Education's F S A numeracy and reading assessments for Grades 4 and 7 students were used in the present study. For both grades, the numeracy test booklets contained 32 multiple-choice ( M C ) items, and 4 open-ended (OE) items. For Grade 4, the reading examination booklet contained 35 M C items and 4 O E items. For Grade 7, the reading examination booklet contained 42 M C items and 4 O E items. For each of the four assessments, all students wrote the same items presented in the same order. For the present study, both item types were included in all analyses except for factor analysis as the specific method selected could not accommodate mixed item formats. For both grades, the numeracy items can be categorized into four sub-content areas: (1)  number, (2) patterns and relationships; (3) shape and space; and (4) statistics  and probability. The number sub-content area included such topics as ratio, height estimations, and rounding numbers. The  patterns and relationships sub-content area  included such topics as identification of number patterns and identification of shape patterns. The shape and space sub-content area included such topics as estimating height, converting from different metric units, and telling time. The statistics and  probability  sub-content area included such topics as reading bar graphs and calculating probabilities. Summaries of items by sub-content area, context, and item type for Grades 4 and 7  29  Numeracy items are provided in Tables 1 and 2 below. For example, from Table 1, the  number sub-content area had 15 items, the context of field trip had 18 items, and the M C item type had 32 items. See Appendixes A and B for full details of each item with regards to sub-content area, context, item type, and number-of-words per item for Grades 4 and 7 Numeracy items. Table 1  Grade 4 Numeracy Item Summary Sub-Content Area  Total # of Items  Context Field Trip  Number Patterns & Relationships Shape & Space Statistics & Probability  15 7 8 6  MC 6 3 5 2  Activity Day OE 0 1 0 1  MC 7 3 3 3  OE 2 0 0 0  Table 2  Grade 7 Numeracy Item Summary Sub-Content Area  Total # of Items  Context Ski Trip  Number Patterns & Relationships Shape & Space Statistics & Probability  15 7 9 5  MC 7 2 4 3  OE 1 1 0 0  School Fun Fair MC OE 7 0 3 1 4 1 2 0  For both grades, the reading items can be categorized into three sub-content areas:  1) critical analysis; 2) identify and interpret key concept and main idea; and 3) locate, interpret, organize details. The critical analysis sub-content area items included such aspects as requiring students to read between the lines to infer correct meaning, and students' recognizing the author's purpose for presenting specific information in an article. The identify  and interpret key concept and main idea sub-content area items  included such aspects as distinguishing events in the story from the main idea of the story  30  and distinguishing trivial points of the story from the main idea of the story. The  locate,  interpret, organize details sub-content area items included such aspects as separating one's own ideas from those portrayed in the story and using context clues to infer the meaning of a word. Summaries of items by sub-content area, context, and item type for Grades 4 and 7 Reading items are provided in Tables 3 and 4. For example, from Table 3, the critical analysis sub-content area had 6 items, the context of crime solving had 18 items, and the M C item type had 35 items. See Appendixes C and D for the full details of each item with regards to sub-content area, context, item type, and number-of-words per item for Grades 4 and 7 Reading items. Table 3  Grade 4 Reading Item Summary Context  Total # of Items  Sub-Content Area Critical Analysis  Crime Solving Rain House Pets Polar Bears Frogs & Toads Rabbits Memory Tree Growth  4 5 5 5 5 5 5 5  MC 0 0 1 0 1 1 1 1  OE 0 0 0 0 0 1 0 0  Identify & Organize Key Concept & Main Idea MC OE 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0  Locate, Interpret, & Organize Details MC 3 3 4 3 4 3 3 2  OE 0 1 0 1 0 0 0 I  31  Table 4  Grade 7 Reading Item Summary Context  Total # of Items  Sub-Content Area Critical Analysis  Egypt Baseball Game Whales Ponies Frogs & Toads Goldfish Willow Tree Snakes  6 6 4 6 5 6 7 6  MC 1 2 1 2 1 4 1 1  OE 0 0 0 0 0 0 0 0  Identify & Organize Key Concept & Main Idea MC OE 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0  Locate, Interpret, & Organize Details MC 4 3 3 2 4 2 5 4  OE 1 0 0 1 0 0 1 0  Participants A l l students who completed the English version (as opposed to French) of the F S A assessment in May of 2001 were included in the study (for both grades, more than 99% of the students wrote the English version of the F S A assessment). Students were divided into two populations: the aboriginal population and the non-aboriginal population. The aboriginal student population, who made up approximately 8% of the population for both assessments, included all students who identified themselves, or were identified by their parents or guardians, as being of aboriginal ancestry. The nonaboriginal student population, who made up approximately 92% of the population for both assessments, included students who were not identified as aboriginal. Procedures The validity and comparability of test scores for the aboriginal and non-aboriginal populations was evaluated by examining the resulting factor structures, reliability estimates, item information functions, item response theory based parameters, and DIF items.  32  Factor Analyses Exploratory factor analysis methods were used to identify the factor structures of the scores for each population in each grade. Factor analysis helped in understanding the structure of the correlation matrix of the items from the F S A assessments. In factor analysis studies, the purpose is to encapsulate the relationships among the variables "in a concise but accurate manner as an aid in conceptualization . . . by including the maximum amount of information from the original variables in as few derived variables, or factors, as possible to keep the solution understandable" (Gorsuch, 1983, p. 2). Exploratory factor analysis models are of two types: the full component model and the common factor model, both being variants of the multivariate linear factor model. Each of these factor analysis models can be further divided depending on whether the researcher assumes the factors to be correlated or uncorrelated. The full component model "is based on perfect calculation of the variables from the components" (Gorsuch, 1983, p. 14) with the assumption of no other sources of variance; while the common factor model "includes sources of variance not attributable to the common factors" (Gorsuch, p. 14) that are unique to each variable. Selection of a model should be based on existing substantive theory in the area of research as well as cost and time efficiency issues (Gorsuch). For this study, I selected the common factor model. It is the more complex of the two, but I believe that it is more meaningful to assume that, for the data at hand in the present study, sources of variance that are not attributable to the common factors do exist for each variable. Further, the model allows for the factors to be correlated; i f they prove to be uncorrelated, the model will allow for that, and no error would be made.  33 There are two main steps in applying a common factor model that assumes correlated factors to actual data: (1) prepare the relevant correlation matrix and extract the initial factors; and (2) determine the optimal rotation. When the same number of factors are extracted for the two groups being compared, these steps are followed by the calculation of congruence coefficients which determine the degree of comparability of relevant factors for the different populations. For all subsequent analysis, unweighted least squares ( U L S ) common factor analysis was used because this approach ignores the diagonal of the correlation matrix, thus maximally accounting for the variance in the off diagonal elements, resulting in minimized off-diagonal residuals.  Number of Factors to Extract The first step in factor analysis was to find the number of factors that could adequately explain the correlations among the observed variables or items. The main concern in this step is finding out i f a smaller number of factors can account for the covariation among a much larger number of variables. In order to perform an extraction of factors, the researcher must provide either the number of common factors to be extracted, or the criteria by which such a number can be determined. I used three criteria by which such a number could be determined: (1) the scree test, (2) the Kaiser-Guttman criterion of retaining only those components with associated eigenvalues greater than one, and (3) the Likelihood Ratio Test of fit (Hakstian, Rogers, & Cattell, 1982). These procedures are described below in order of priority. The first criterion was the scree test. For this test, the relevant eigenvalues were plotted in descending order. The pattern that was revealed from the plot of eigenvalues was  34 examined for a sharp decent and a point o f levelling off. The number o f eigenvalues existing in the sharp descent (in the part o f the plot before the eigenvalues started to level off) will correspond to the number o f factors that will be extracted based on the scree test. To be precise about the scree procedure, it consists of entering unities in the diagonal of the given correlation matrix and extracting successive latent roots by a principal axis program down to n roots, when n is the number o f variables (the last root may be zero)... When the size o f these roots (their variances) are successively plotted, one characteristically gets a falling curved section followed by a straight line (or two or even three), at a much lesser angle to the horizontal, extending over perhaps ten successive eigenvalues each. The resemblance of this "debris" to the scree of rock debris funning straight, at an angle of "boulder stability" at the foot of a mountain gave the term "scree test". (Cattell & Vogalmann, 1977, p. 292) The second criterion was the Kaiser-Guttman criterion of retaining only those components with associated eigenvalues greater than one (the eigenvalue-greater-thanone criteria). The third criterion was the Likelihood Ratio Test o f fit (a Chi-Squared test using the maximum likelihood factor analysis procedure), which was used to perform statistical inference testing on the number of factors proposed to explain the data. For the present study an alpha level of 0.05 was used. Not all criteria necessarily lead to a convergence on the same number o f factors to be extracted. The three criteria were presented in order o f priority, with the scree plot test being the best criterion because, unlike the other two, it is not subject to arbitrary boundaries o f eigenvalues and alpha levels; it lets the data "talk" without pre-set  35  boundaries of importance. Hence, when the three criteria revealed different numbers of factors to be extracted, I used the scree test exclusively (Hakstian, Rogers, & Cattell, 1982). When there was agreement between two criteria, I used the number of factors for extraction that they converged upon, regardless of which two.  Optimal Rotation For common factor analyses, "it is generally assumed that the initial factors will be rotated so that the factors meet criteria that make them more relevant to the purpose of the study" (Gorsuch, 1983, p. 176). The purpose of rotation, not unlike a linear transformation, is to shift the factors into the most parsimonious position without altering their relative values. In factor analyses, the most parsimonious position is referred to as the simple structure. For this study, the criteria I chose to use to determine simple structure were when a particular rotation (transformation) revealed (a) a maximum number of salient loadings, (b) a minimum number of complex loadings, and (c) a maximum number of hyperplanar coefficients. Gorsuch (1983) highlighted a general approach to determining salient loadings in that "anything that would be of interest for interpretation would be significant" (p. 208) i f sample sizes were large. Typically, maximum salient loadings are determined when a variable's factor loading is larger than 0.30, but because of the very large sample sizes of the present study, this critical value may resolve to be different than 0.30 once the factor loadings are examined. Complex loadings occur when a variable has salient loadings with more than one factor, and hyperplanar coefficients occur when a variable has a factor loading of less than 0.10. In search of the optimal rotation for each of the subgroups, I used oblique transformations because they allow for the correlation of factors. Further, the oblique  36  transformation procedure was selected based on Hakstian (1971) who showed that, in comparison to other currently prominent methods, the oblique procedure "produces solutions that best exemplify simple structure" (p. 175). A common factor pattern rotated to simple structure was obtained for each subgroup via an Oblimin method, which is a method for oblique (nonorthogonal) rotation. This oblique technique rotates to simple structure through a variance or covariance function of the factor loadings. The Oblimin function allows for the researcher to test various values for a parameter that determines the degree of obliquity. For each subgroup, four values of the tfe/ta-parameter (the delta-parameter roughly determines the degree of obliquity) were used: 0.00, 0.25, 0.50 and 0.80. The corresponding results were examined in search of the delta value that led to the most exemplary simple structures. These values of delta were then used, and the resulting set of rotated factors was considered to have simple structure. Congruence Coefficients When there were equal numbers of factors extracted for each of the aboriginal and non-aboriginal populations for a given assessment, congruence coefficients were calculated for each factor in an effort to examine the degree of factor structure comparability, and are denoted by 0  E (aiy/X 2./ )  / m  , where  a  m  (1)  37 In this definition, a , is a pattern element for the first sample, /'th variable, /th XJ  factor, and p is the number of variables (Harman, 1976). There are no firm rules for interpreting congruence coefficients, and no procedures for testing significance. They range from +1.0 to -1.0, with +1.0 being perfect agreement and -1.0 being perfect inverse agreement (Harman). In general, 0.95 might be interpreted to be very comparable, 0.90 quite comparable, and 0.70 only somewhat comparable (Harman). Reliability Reliability estimates were used to indicate the degree to which individuals' scores would remain relatively consistent over repeated administrations of the same test or alternate test forms (Crocker & Algina, 1986). The reliability estimates for each population were compared in an effort to identify differences in the internal consistency of the scores for the aboriginal and non-aboriginal populations. Considering that censusstyle data are being used, a difference in the reliability of test scores was deemed to exist if coefficients differed for the aboriginal and non-aboriginal populations in their respective grades and subject areas. In the absence of an appropriate external criterion for determining a meaningful difference between coefficients, I used my subjective judgement that a difference of more than a 0.10 would be considered a meaningful difference. In an effort to determine the most appropriate reliability coefficient to use for this study, the best fitting model for reliability had to be determined. Quails (1995) found that the two most common internal consistency estimation techniques, the split-halves and the Cronbach's alpha coefficient, "would generally be inappropriate for multidimensional instruments" (p. 111) as well as instruments with multiple item formats. Hence, when the test  38 parts have different functional lengths, then a congeneric model should be employed (Quails). In identifying that the congeneric model fits the data best, it was recognised that for both Numeracy and Reading FSAs, there are multiple item formats as well as subcategories of items based on differing content, implying differences in the functional lengths of the test parts or items. Differences in functional length between two test parts can arise when different item types (multiple-choice and open-ended) and different scoring methods (i.e., dichotomously or polytomously) are employed in the same test. When such differences in functional length are expected, the use of the Feldt-Raju formula for estimation of the reliability of congeneric measures is deemed most appropriate, and denoted F - Rpxx', where  F-Rpxx'=  {  y  ,  (2)  cr .. equals the observed part-score variances, i represents the functional length of each 2  y  test part (or item), and a\ is the total test variance (Quails). For the present study, the Feldt-Raju reliability estimate in Equation 2 was used to calculate the reliabilities of the FSAs. Item Information Functions Item Response Theory (IRT) provides a method of describing and examining the measurement accuracy provided by test items. For each item, IRT can be used to estimate an item information function, denoted by Iff) ), where  39  /(0) =  j£i^L_ ( , = l , 2 , 3 , . . . , n ) .  (3)  Pi(P)QX9)  li(0 ) is the total information provided by item i at ability 6 , P (0 ) is the probability that a t  randomly selected examinee with ability 6 will answer item / correctly, P] {6) is the first derivative of Pfd),  and Q (6)=[\- P,(Q)]. Item-information functions present the t  "contribution items can make to ability estimation at any point along the ability continuum" (Hambleton, 1989, p. 162). In other words, the item-information functions reveal three things: 1) at which point along the ability continuum each item functions maximally, 2) measurement accuracy provided by the item at each ability level, and 3) the degree of discriminating power a particular item has at a particular ability level. The item information functions were used to compare the aboriginal and nonaboriginal populations in two ways: 1) to determine the difference in the area under the item information functions for the two populations for each item of the four F S A s ; and 2) to determine the degree of similarity of the standard error of measurement ( S E M ) values at various ability levels for each set of items that comprise the F S A s for the two populations. In the absence of appropriate external criteria for determining a meaningful difference between the two aspects of the item information functions that were analyzed, I used my subjective judgement to determine that a difference of more than a 20% would indicate a meaningful difference in area under the item information functions and a visual inspection of the similarities of the graphically represented S E M values across various ability levels will be discussed in an effort to indicate similarity of measurement accuracy across the populations.  40 IRT Based Parameters In an effort to determine the degree of similarity of the IRT-based parameters, I examined the respective correlations for each F S A for the two populations. A correlation coefficient equal to or greater than 0.70 would be taken to indicate a meaningfully strong relationship between the IRT parameters.  Differential Item Functioning (DIF) It is commonly agreed upon by psychometricians that an item is seen as showing DIF "if individuals having the same ability, but from different groups, do not have the same probability of getting the item right" (Hambleton, et al., 1991, p. 110). In other words, an item shows DIF i f the response functions across different subgroups are not the same (Hambleton et al.). Some of the sources of DIF are understood to be the use of language, content, or context that is not universally understood the same way amongst the subgroups. There are a number of different DIF detection methods. These methods tend to provide consistent DIF identification, yet many identify items as DIF when items are not DIF. T o verify the DIF status of items, this study used two statistical DIF methods for identifying DIF items: the Linn-Harnisch and the Logistic Regressions DIF detection methods. Consensus of these two methods of DIF detection was used to ensure the DIF status of items. These methods of DIF detection were conducted in an effort to determine if aboriginal and non-aboriginal students, who have similar abilities, have differing probabilities of answering the items correctly.  Linn and Harnisch Method The IRT-based L H method (Linn & Harnisch, 1981) of DIF detection compares the item characteristic curves (ICC) for groups. This method of DIF detection is  41  implemented by P A R D U X software (Burket, 1998). The L H method assesses the fit of the model for the minority group using item ability parameter estimates. The L H method obtains estimates of the parameters for the combined group, assuming a 3-parameter logistic model (3PL) for the M C items (Lord, 1980) and a twoparameter partial credit model (2PPC) model (Yen, 1993) for the O E items. For the 3parameter logistic model, the parameter estimates and the probability P (the estimated y  probability that person j would answer item / correctly) could be estimated where,  c, +  (4)  l + exp[-1.7a,(0,-&,•)]'  and at, b c, and 0 are all parameter estimates. The probabilities for the reference group h  (the non-aboriginal group in this case) are based on item parameter estimates for the combined sample, and those for the target group (the aboriginal group in this case) are based on the aboriginal group sample. These probabilities are then compared for the target group and reference group using the observed proportion correct statistics. The proportion of people in a target group that are expected to correctly answer an item is,  En  (5)  where j is a member of a target group g, i is the item, and n is the number of people in g  the target group (Linn & Harnisch, 1981). A n d the proportion of people in the combined group is,  42  P,=-^  •  (6)  From this, the observed proportion correct on item i for target group g, can be calculated by dividing the number of people in target group g who answered item / correctly by the number of persons in the target group g. For the complete target group (including all ability levels), the observed proportion correct, O , for an item i, is ig  0 , = ^  •  (7)  From the above equations, an index of the degree to which members of a target group perform better or worse than the complete group, called the overall difference, can be easily calculated:  D,=0,-P,.  (8)  Further, such differences calculated for a number of different levels of ability can be used to identify i f differences are not uniform over all ability levels. This can happen, for example, when an item with a small overall difference has a large positive difference for one ability value and a large negative difference for another ability value.  43 The item parameters for the O E items were obtained using the two-parameter partial credit model (2PPC) (Yen, 1993). The 2PPC model is a special case of Bock's (1972) nominal model and is equivalent to Muraki's (1992) generalized partial credit model. Similar to the generalized partial credit model, in 2PPC, items can vary in their discriminations and each item has location parameters, one less than the number of score levels. The nominal model states that the probability of an examinee with ability 6 having a score at the A:-th level o f they'-th item is  Pjk(0) = P(Xj = k-l\0)=  ™  pZjk  ,k=l...mj,  (9)  ^expZ;/  where  Zjk = AjkO + Qk.  (10)  For the special case o f the 2PPC model used here, the following constraints were used:  A =a.j(k-\), Jk  (11)  and  Cj* =  -I>„,  (12)  44  where y  j0  = 0, and Bj and jjj are the free parameters to be estimated from the data. The  first constraint implies that items can vary in their discriminations and that higher item scores reflect higher ability levels. In the 2PPC model, for each item there are nij -1 independent yjf difficulty parameters and one ay discrimination parameter; a total of mj independent item parameters are estimated. T o summarize, the L H based P A R D U X software: computes the observed and expected mean response (expected and observed pvalues) and the difference between them (observed minus predicted, pdi/f) for each item by deciles of the specified group. The expected values are computed using the parameter estimates obtained from the entire sample, and the theta estimates (ability estimates) for the members o f the specified subgroup. Based on the difference between expected and observed ^-values, a Z-statistic is calculated for each decile and an average Z-statistic for the item is computed for identifying degree o f DIF. (Ercikan, Gierl, McCreith, Puham, & K o h , in press, p. 8) The L H method uses both statistical significance and effect size in categorizing DIF items. The Level 1 degree of DIF (free of DIF) includes items with |Z|<2.58; Level 2 degree of DIF (moderate differences) includes items where the absolute value of the expected mean difference is <0.10, and |Z|>=2.58; and Level 3 degree of DIF (relatively large differences) includes items where the absolute value of the expected mean difference is >=0.10, and |Z|>=2.58 (Ercikan & McCreith, 2002). For items having Levels 2 or 3 DIF, the interpretation is that the parameters for these items are not invariant across the two groups and the model obtained for the total group will not fit the target/minority group. A negative difference implies the item  45  favours the reference group, in this case, the non-aboriginals. Conversely, a positive difference implies the item favours the target/minority group, in this case, the aboriginals.  Logistic Regression Method The Logistic Regression (LR) method (Swaminathan & Rogers, 1990) of DIF detection is based on a model for predicting the probability of a correct response to an item using the standard logistic regression model for predicting a dichotomous dependent variable from given independent variables,  P(u = l\0)=  (13)  e  V + jPv+W']  where u is the response to the item, #is the observed ability of an individual, fio is the intercept parameter, and /?/ is the slope parameter. For the L R method, separate logistic regression curves are calculated for the  target group and for the population minus the target group. DIF is considered to be present only i f the logistic regression curves for the two groups are not the same. If the regression curves differ in either a uniform or non-uniform manner, DIF will be considered present in the respective manner. The regression model formula can be reformulated to include group membership for DIF testing:  P{u = \) = — —, e  (14)  46 where  (15)  In this formulation, g represents group membership, t2 corresponds to the group difference and r? corresponds to the interaction between group and ability. If T2 is nonzero while r? is zero, nonuniform DIF is inferred. If  is nonzero, whether or not T2 is  zero, we can infer nonuniform DIF. The null hypotheses are T2 = 0 and rj = 0 against the hypothesis that T2 0 and 73 * 0. The Chi-square model fit statistic is the test of significance associated with the L R method that actualizes differently for uniform and non-uniform DIF. When testing for the presence of uniform DIF, the Chi-square statistic is examined for a statistical difference when a term for group membership is added. When testing for the presence of non-  uniform DIF, the Chi-square statistic is examined for a statistical difference when a group-by-ability interaction term is added (as cited in Ercikan, 2003). It follows that,  X  (16)  =i'C\CY CYCr  1  J  which has the % distribution with 2 degrees of freedom. When the calculated value of 2  X  1  exceeds% , then D IF is found to exist; when the calculated value o f % does not 2  2  a2  exceed % . , then no DIF is identified. 2  a  2  47 The results from both DIF detection methods were used for deciding on the DIF status of the items. Those items that were identified as DIF by both methods were considered to be truly functioning differently for the two groups.  48  CHAPTER FOUR: RESULTS The results presented in this chapter are directly related to the research questions and the analysis procedures introduced in the earlier chapters. This chapter provides a detailed description of the participants, their scores on the F S A , and the results of the data analysis used to answer the research questions. In terms of a research question, the unitary concept of construct validation was applied to the data in an effort to explore the degree to which the F S A scores are comparable across aboriginal and non-aboriginal populations. Implied in the above exploration is finding out i f the F S A measured the same constructs for both populations and whether or not score interpretations should be the same for both populations. Participants A l l students who completed the English version of the F S A assessment in M a y of 2001 were included in the study. Students were divided into two populations: the aboriginal population and the non-aboriginal population. Once the population was split into aboriginal and non-aboriginal populations, the percent of males and females for each population for each assessment was near a 50:50 split in all cases. Table 5 shows these percentages for each case. For example, for the aboriginal students who wrote the Grade 4 Numeracy assessment, 50 percent were female and 50 percent were male. Table 5  Gender Percentages for each Assessment and Population Population Non-Aborij^inal  Abori ginal Assessment Grade 4 Numeracy FSA Grade 7 Numeracy FSA Grade 4 Reading FSA Grade 7 Reading FSA  % Females  % Males  % Females  % Males  50 51 50 51  50 49 50 49  49 49 49 50  51 51 51 50  49  Table 6 shows the number of students, mean scores, and standard deviations based on F S A scores for aboriginal and non-aboriginal students. For example, 3,339 aboriginal Grade 4 students wrote the Numeracy F S A ; they had a mean score of 15.21 (out of 32) and a standard deviation of 5.99 on the M C items. The aboriginal students' mean scores for Grade 4 numeracy M C items are approximately 3.90 points lower than the corresponding non-aboriginal students (based on 42,547 students), with a difference in the population-based standard deviation of 0.14. Table 6  Mean Scores and Standard Deviations for the FSA Scores Population Assessment  N  Aboriginal MC  M(SD)  OE  M(SD)  N  Non-Aborig inal MC  M(SD)  OE  M(SD)  9.30 (3.52) 42547 19.11 (5.85) 3339 15.21 (5.99) 7.28 (3.79) Numeracy 4 13.76 (5.54) 5.44 (3.32) 44043 18.16(6.40) 7.81 (3.53) 3072 Numeracy 7 42579 26.98 (5.75) 9.27 (3.86) 22.50(7.19) 6.70 (4.17) Reading 4 3329 8.82 (4.06) 23.57(6.92) 6.00 (4.26) 44126 28.00 (6.17) 3106 Reading 7 Note. A l l M C numeracy mean scores are based on scores that have a maximum value of 32. A l l M C reading mean scores are based on scores that have a maximum value of 35 for Grade 4 and 42 for Grade 7. A l l OE mean scores are based on scores that have a maximum value of 16. The data sets did not contain any missing values when they were delivered to me; rather, a score of zero had been entered for a student's non-response. For the cases where students had a zero entered for every item, I made the assumption that those students did not respond to the test, and in an effort to clean the data, those cases were removed from the data set. Table 7 shows the counts of removed cases. For example, for the Grade 4 Numeracy assessment, 241 aboriginal students were removed. This accounted for 6.7% of the total number of cases.  50 Table 7  Cases Removed Based on Non-Response Population Assessment Numeracy 4 Numeracy 7 Reading 4 Reading 7  Aborijnnal % Removed Removed 241 6.7 217 6.9 6.6 246 5.5 181  Non-Abori ginal % Removed Removed 1104 2.5 2.4 1099 2.7 1053 2.4 1093  Factor Analysis The purpose of the factor analysis procedure was to examine the relationships among the test items making up the assessments. When factor analyses are conducted separately for groups, differences in these relationships provide information regarding the degree to which similar constructs are being assessed for the groups. For this study, these relationships were examined across the aboriginal and non-aboriginal populations for similarities and differences in the number of factors and their respective compositions as shown by factor loadings. This exploratory factor analysis identified the factor structures of the scores for each population in each grade, as well as their degree of congruence across the aboriginal and non-aboriginal populations. The extraction and rotation of factors was used to identify the factor structures of the scores for each population in each grade. When examining the proceeding factorial results, the reader should keep in mind that the primary purpose of the present study was not to explore the dimensionality of the F S A s but rather to explore the comparability of the aboriginal and non-aboriginal populations with regard to score performance on the F S A s . Further, my interpretations of the factors were subjective to some extent, as is always the case with common factor analysis, and may not match that of all readers. Some of the factors are easily understood,  51  but other factors presented a greater interpretive challenge (Hakstian, Farrell, & Tweed, 2002).  Selection of the Number of Factors to be Extracted For this common factor analysis, the number of factors to be extracted for each population was based on three values: the scree-plot count (see Appendixes E - L ) , the number of eigenvalues greater than one (see Appendixes M - N ) , and the number of factors that led to a fit via the Maximum Likelihood Ratio Test (see Appendixes O-P). In a principal components analysis ( P C A ) , consideration of percent of variance explained would be one o f the criteria for the selection of the number o f factors to be extracted because the goal of a P C A "is to extract maximum variance from the data set with each component" (Tabachnick & Fidell, 1996, p.664). But for the common factor analysis, which is the analysis of this study, the goal "is to minimize squared differences between the observed and reproduced correlation matrices" (Tabachnick & Fidell, p.665); hence considering percent of variance explained as a criteria to select the number of factors, would be meaningless here because the purpose of the common factor analysis is to reduce total variance to common variance in an effort to reproduce the correlation matrix in the best possible manner. The selection of number of factors to extract was based on consensus among the three methods. When consensus did not occur, then the value determined by at least two methods was used. If all three methods converged on different values, then the scree plot was used as it is the least affected by arbitrary limit settings (Cattell, 1966). Scree plots were examined in an effort to identify a major change in the slope of the line that follows the descent o f the eigenvalues. I examined the scree plots from left to right, stopping my  52  count of eigenvalues when I came across a major change in the slope o f the line that follows the decent of the eigenvalues. For the purpose of amplifying my ability to visually identify a major change in slope, I expanded the eigenvalue axis. For the scores examined, Table 8 shows the number of factors that were selected for extraction for each assessment and population. For example, for the aboriginal student scores on the Grade 4 Numeracy assessment, the scree plot analysis suggested three factors, the eigenvalue analysis suggested six factors, and the maximum likelihood analysis suggested nine factors. Three factors were selected for the number of factors to be extracted. The number of factors selected for extraction for each of the matched student populations (e.g., aboriginal Grade 4 Numeracy was matched with non-aboriginal Grade 4 Numeracy) was the same for Grades 4 and 7 Reading and different for Grades 4 and 7 Numeracy. Hence, congruence coefficients were only calculated for each factor of the Grades 4 and 7 Reading scores. Table 8  Selection ofNumber ofFactors to be Extracted Population Aboriginal Non-Aboriginal Maximum Selection Scree Eigenvalue Assessment Scree Eigenvalue Maximum Likelihood Likelihood 17 4 4 3 9 6 3 Num 4 20 5 2 5 10 7 Num 7 5 20 6 3 3 11 5 Read 4 3 24 7 3 3 10 9 3 Read 7 Note. "Num" represents "Numeracy", and "Read" represents "Reading"  Selection 4 2 3 3  Optimal Rotation Once the number o f factors to be extracted was decided upon, then the optimal rotation was determined. The purpose of rotation was to shift the factors into the most parsimonious position without altering their relative values; this parsimonious position leads to a simple structure. Oblique rotations were used in the present study. These types  53  of rotations focus on simplifying the factor structure by shifting the factors so that each item loads on the fewest number of factors. For the present study, the criteria of a simple structure was met when a particular rotation revealed a maximum number o f salient loadings, a minimum number of complex loadings and a maximum number of hyperplanar coefficients.  Because these were oblimin (oblique) rotations, the counts for  the salient loadings, complex loadings, and hyperplanar coefficients were calculated for four different values  {delta values) that represent a spectrum of degrees of obliquity in an  effort to select the best fitting rotation. Based on the counts related to each value of delta for each subject and grade, and each population, the best value of delta was selected as shown in Tables 9-16. For example, in Table 9, the delta value of 0.25 for the Grade 4 aboriginal numeracy factor rotation resulted in 22 salient loadings, 0 complex loadings, and a hyperplanar count of 45; this was the best result of the four values of  delta. The  pattern matrix corresponding to each of the best fitting rotations was selected for the examination and composition of factor loadings. When interpreting factors, a pattern loading is considered salient i f it is "sufficiently high to assume that a relationship exists between the variable and the factor" and that "the variable can aid in interpreting the factor and vice versa" (Gorsuch, 1983, p. 208). Gorsuch stated that what may be a salient value for one analysis may not be a salient value for another, and that there is "no exact way to determine salient loadings" (p. 208). Gorsuch highlighted a general approach to determining salient loadings in that "anything that would be of interest for interpretation would be significant" (p. 208) if sample sizes were large, as they are for the present study. Loadings of approximately 0.25 for the rotated factor pattern solutions appeared to define a variable's relationship  54  with a factor when the other loadings for the other factors were near zero. Hence, for the presentation of salient loadings and the discussion of the composition and meaning of the factors, the loading value of 0.25 was used. For the Grade 4 Numeracy scores, simple structure was found with a rotation delta value of 0.25 for the aboriginal population and 0.50 for the non-aboriginal population. Table 9  Simple Structure for Grade 4 Aboriginal Numeracy Delta 0.00 0.25 0.50 0_80  Salient Loadings 20 22 * *  Complex Loadings 0 0 * *  Hyperplanar Count 46 45 * *  Simple Structure Selected  Note. * indicates that the rotation failed to converge in 1000 iterations.  Table 10  Simple Structure for Grade 4 Non-Aboriginal Numeracy Delta 0.00 0.25 0.50 O80  Salient Loadings 13 14 26 *  Complex Loadings 0 0 0 *  Hyperplanar Count 66 72 54 *  Simple Structure  Selected  Note. * indicates that the rotation failed to converge in 1000 iterations.  For the Grade 4 Reading scores, simple structure was found with a rotation delta value of 0.50 for both the aboriginal and non-aboriginal populations. Table 11  Simple Structure for Grade 4 Aboriginal Reading Delta 0.00 0.25 0.50 0.80  Salient Loadings 35 36 37 77  Complex Loadings 2 2 2 32  Hyperplanar Count 42 43 42 6  Simple Structure  Selected  Note. * indicates that the rotation failed to converge in 1000 iterations.  Table 12  Simple Structure for Grade 4 Non-Aboriginal Reading Delta 0.00 0.25 0.50 0.80  Salient Loadings 33 33 35 *  Complex Loadings 1 1 2 *  Hyperplanar Count 46 46 49 *  Note. * indicates that the rotation failed to converge in 1000 iterations.  Simple Structure  Selected  55  For the Grade 7 Numeracy scores, simple structure was found with a rotation delta value of 0.50 for the aboriginal population and 0.80 for the non-aboriginal population. Table 13  Simple Structure for Grade 7 Aboriginal Numeracy Delta 0.00 0.25 0.50 0.80  Salient Loadings 20 21 26 *  Complex Loadings 0 0 3 *  Hyperplanar Count 99 99 99 *  Simple Structure  Selected  Note. * indicates that the rotation failed to converge in 1000 iterations. Table 14  Simple Structure for Grade 7 Non-Aboriginal Numeracy Delta 0.00 0.25 0.50 0.80  Salient Loadings 30 30 31 32  Complex Loadings 1 0 0 0  Hyperplanar Count 14 14 16 14  Simple Structure  Selected  Note. * indicates that the rotation failed to converge in 1000 iterations. For the Grade 7 Reading scores, simple structure was found with a rotation delta value of 0.50 for both the aboriginal and non-aboriginal populations. Table 15  Simple Structure for Grade 7 Aboriginal Reading Delta 0.00 0.25 0.50 0.80  Salient Loadings 30 30 32 *  Complex Loadings 2 2 3 *  Hyperplanar Count 62 62 59  Simple Structure  Selected  Note. * indicates that the rotation failed to converge in 1000 iterations. Table 16  Simple Structure for Grade 7 Non-Aboriginal Reading Delta 0.00 0.25 0.50 0.80  Salient Loadings 25 29 32 97  Complex Loadings 0 0 1 39  Hyperplanar Count 68 69 59 13  Note. * indicates that the rotation failed to converge in 1000 iterations.  Simple Structure  Selected  56  Factor Patterns and  Correlations  Once the number of factors to be extracted and the best fitting rotation were determined, the factor analysis procedure was performed on each of the eight assessment categories: the aboriginal and non-aboriginal populations for each of the four assessments. Executing this procedure led to the production of eight factor pattern matrices which are presented in complete form in Appendixes Q - X , and in a summary form in Tables 17-24. The summary tables include only the items with salient loadings which were matched with item details such as number-of-words per item, context, and sub-content area. The pattern loadings taken from the pattern matrices "may be interpreted as measures of the unique contribution each factor makes to the variance of the variables" (Rummel, 1970, p. 397). In other words, these pattern loadings indicate the degree o f dependence o f the variables on the different factors (Rummel). These loadings allow for the determination of the clusters of variables defined by the resulting oblique factors. The summary Tables 17-24 present the item number, number-of-words per item, context, sub-content area, and pattern loading for each item that had a salient loading with a factor. Pairs of tables matched across the two populations were examined separately as well as in comparison to one another to identify similarities and differences in the patterns of the salient loadings. Table 17, which was the summary table for the Grade 4 Numeracy scores for the aboriginal population, showed three factors. Two of the factors were dominant and appeared to split by either the context of items, or the order of items. This means that the two factors were either associated with being set in one of two contexts, or with being set in the first half or the second half of the test. The  57  corresponding (paired) table, Table 18, which was the summary table for the nonaboriginal Grade 4 Numeracy scores, showed four factors. Upon visual examination, there did not appear to be an obvious pattern in these salient loadings, and they did not appear to be very similar to the corresponding aboriginal group's pattern of loadings. Eight salient loading summary tables (Tables 17-24) are presented next, followed by a discussion o f the similarities and differences of salient loading patterns for the paired populations. Further, in cases where there was the same number of factors for each paired population, congruence coefficients were calculated for each factor and are presented in the text. Following the salient loading summary tables are a presentation (Tables 25-32) and discussion of the eight corresponding factor correlation matrices which indicate the strength of the relationships between factors.  Table 17  Salient Loadings and Item Detail for Aboriginal Grade 4 Numeracy Scores Factor Item#  # of Words  Context  Sub-Content Area  1  2  3 0.28  1  21  Field Trip  Number  2  26  Field Trip  Patterns & Relationships  0.25  3  31  Field Trip  Number  0.35  4  32  Field Trip  Shape & Space  0.25  5  11  Field Trip  Shape & Space  0.44  7  Field Trip  Statistics & Probability  0.35  7  11  Field Trip  Number  0.29  8  41  Field Trip  Number  0.32  Field Trip  Shape & Space  0.37  6  9  25  10  26  Field Trip  Shape & Space  0.34  11  25  Field Trip  Number  0.45  12  25  Field Trip  Number  0.38  13  9  Field Trip  Statistics & Probability  0.33  14  37  Field Trip  Patterns & Relationships  0.36  15  20  Field Trip  Patterns & Relationships  0.31  16  26  Field Trip  Shape & Space  0.28  19  17  Activity Day  Number  0.12  20  29  Activity Day  Patterns & Relationships  0.44  21  18  Activity Day  Shape & Space  0.29  25  Activity Day  Patterns & Relationships  0.50  23  19  Activity Day  Patterns & Relationships  0.44  24  26  Activity Day  Shape & Space  0.29  18  Activity Day  Number  0.44  26  19  Activity Day  Number  0.30  28  26  Activity Day  Number  0.38  30  25  Activity Day  Statistics & Probability  31  19  Activity Day  Statistics & Probability  32  20  Activity Day  Statistics & Probability  0.33  34  26  Activity Day  Number  0.31  22  25  -0.43 -0.40  Table 18  Salient Loadings and Item Detail for Non-Aboriginal Grade 4 Numeracy Scores Factor Item #  #of Words  Context  Sub-Content Area  1  2  3  1  21  Field Trip  Number  3  31  Field Trip  Number  0.27  4  32  Field Trip  Shape & Space  0.34  5  11  Field Trip  Shape & Space  0.48  7  11  Field Trip  Number  8  41  Field Trip  Number  4  0.34  0.30 0.28  9  25  Field Trip  Shape & Space  0.29  10  26  Field Trip  Shape & Space  0.34  11  25  Field Trip  Number  0.40  12  25  Field Trip  Number  13  9  Field Trip  Statistics & Probability  14  37  Field Trip  Patterns & Relationships  15  20  Field Trip  Patterns & Relationships  16  26  Field Trip  Shape & Space  19  17  Activity Day  Number  -0.28 -0.29  0.75 0.37 0.36 -0.26 0.25  20  29  Activity Day  Patterns & Relationships  21  18  Activity Day  Shape & Space  -0.44  22  25  Activity Day  Patterns & Relationships  -0.32  23  19  Activity Day  Patterns & Relationships  -0.32  24  26  Activity Day  Shape & Space  -0.56  25  18  Activity Day  Number  -0.69  26  19  Activity Day  Number  0.82  27  19  Activity Day  Number  0.53  30  25  Activity Day  Statistics & Probability  0.78  31  19  Activity Day  Statistics & Probability  0.67  32  20  Activity Day  Statistics & Probability  0.35  34  26  Activity Day  Number  0.25  Table 19 Salient Loadings and Item Detail for Aboriginal Grade 7 Numeracy Scores Factor Item #  # of Words  Context  Sub-Content Area  1  36  Ski Trip  Number  1  2  3  4  5 -0.25  0.38 0.39  3  37  Ski Trip  Number  4  25  Ski Trip  Number  0.31 -0.47  5  32  Ski Trip  Number  -1.03  6  19  Ski Trip  Number  7  37  Ski Trip  Number  0.30  8  28  Ski Trip  Patterns & Relations  0.52  -0.25  0.47  9  25  Ski Trip  Statistics & Probability  0.40  10  31  Ski Trip  Patterns & Relations  0.43  11  20  Ski Trip  Shape & Space  12  15  Ski Trip  Shape & Space  13  17  Ski Trip  Shape & Space  15  44  Ski Trip  Statistics & Probability  0.33  19  27  School Fun Fair  Number  0.40  20  35  School Fun Fair  Shape & Space  0.52  21  27  School Fun Fair  Statistics & Probability  0.59  24  26  School Fun Fair  Patterns & Relations  26  19  School Fun Fair  Patterns & Relations  0.39  27  30  School Fun Fair  Patterns & Relations  0.50  30  34  School Fun Fair  Number  31  23  School Fun Fair  Shape & Space  0.42  32  25  School Fun Fair  Shape & Space  0.31  33  28  School Fun Fair  Statistics & Probability  0.38 0.29 0.29  0.31  0.26  0.28  Table 20 Salient Loadings and Item Detail for Non-Aboriginal  Grade 7 Numeracy Scores Factor  Item #  #of Words  1  36  Ski Trip  Number  34  Ski Trip  Statistics & Probability  2  Context  Sub-Content Area  2  1  0.53 0.40  0.07  3  37  Ski Trip  Number  0.48  -0.13  4  25  Ski Trip  Number  0.36  -0.02  Number  0.43  32  Ski Trip  6  19  Ski Trip  Number  7  37  Ski Trip  Number  0.32  8  28  Ski Trip  Patterns & Relations  0.47  9  25  Ski Trip  Statistics & Probability  10  31  Ski Trip  Patterns & Relations  0.47  11  20  Ski Trip  Shape & Space  0.51  12  15  Ski Trip  Shape & Space  0.31  13  17  Ski Trip  Shape & Space  0.31  14  16  Ski Trip  Shape & Space  15  44  Ski Trip  Statistics & Probability  16  33  Ski Trip  Number  19  27  School Fun Fair  Number  0.38  20  35  School Fun Fair  Shape & Space  0.36  21  27  School Fun Fair  Statistics & Probability  5  0.54  0.28  0.31 0.34 0.26  0.52 0.35  22  35  School Fun Fair  Number  23  17  School Fun Fair  Number  24  26  School Fun Fair  Patterns & Relations  0.40  25  32  School Fun Fair  Number  0.25  26  19  School Fun Fair  Patterns & Relations  0.25  27  30  School Fun Fair  Patterns & Relations  0.43  28  20  School Fun Fair  Shape & Space  0.41  29  19  School Fun Fair  Number  0.48  30  34  School Fun Fair  Number  0.48  31  23  School Fun Fair  Shape & Space  32  25  School Fun Fair  Shape & Space 0.44 0.33  33  28  School Fun Fair  Statistics & Probability  34  46  School Fun Fair  Number  0.28  0.29 0.25  62  Table 21  Salient Loadings and Item Detail for Aboriginal Grade 4 Reading Scores Factor Item #  #of Words  1 Context  2  3  Sub-Content Area  1  6  Crime Solving  Identify  -0.46  2  7  Crime Solving  Locate  -0.52  3  3  Crime Solving  Locate  -0.63  4  27  Crime Solving  Locate  -0.35  5  8  Rain  Identify  -0.34  6  6  Rain  Locate  -0.49  7  7  Rain  Locate  -0.42  8  12  Rain  Locate  -0.47  10  9  House Pets  Locate  -0.66  11  8  House Pets  Locate  -0.65  12  4  House Pets  Locate  -0.77  13  9  House Pets  Locate  -0.58  14  7  House Pets  Critical  -0.61  15  8  Polar Bears  Identify  -0.41  16  8  Polar Bears  Locate  -0.44  17  11  Polar Bears  Locate  -0.31  18  16  Polar Bears  Locate 0.39  -0.31  20  5  Frogs & Toads  Locate  21  7  Frogs & Toads  Locate  0.39  22  9  Frogs & Toads  Locate  0.33  24  10  Frogs & Toads  Critical  0.50  25  6  Rabbits  Locate  0.33  26  19  Rabbits  Locate  0.50  27  11  Rabbits  Locate  0.71  Critical  0.54  28  12  Rabbits  30  6  Memory  Identify  0.49  8  Memory  Critical  0.36  32  6  Memory  Locate  0.42  33  12  Memory  Locate  0.44  34  8  Memory  Locate  0.63  35  8  Tree Growth  Identify  0.63  36  10  Tree Growth  Locate  0.36  37  9  Tree Growth  Locate  0.27  38  11  Tree Growth  Critical  0.42  31  -0.25  0.34  Note. The sub-content areas of identify, locate and critical represent identify and interpret key concepts and main ideas; locate, interpret and organize details; and critical analysis respectively.  63  Table 22 Salient Loadings and Item Detail for Non-Aboriginal Grade 4 Reading Scores Factor Item #  # of Words  Context  1  2  -0.03  -0.49  3  Content  1  6  Crime Solving  Identify  2  7  Crime Solving  Locate  -0.40  3  3  Crime Solving  Locate  -0.40  4  27  Crime Solving  Locate  -0.39  5  8  Rain  Identify  -0.36  6  Rain  Locate  -0.56  7  7  Rain  Locate  -0.59  8  12  Rain  Locate  -0.51  10  9  House Pets  Locate  -0.29  0.45  11  8  House Pets  Locate  -0.36  0.33  12  4  House Pets  Locate  -0.33  0.59  13  9  House Pets  Locate  -0.36  14  7  House Pets  Critical  -0.45  15  8  Polar Bears  Identify  -0.38  16  8  Polar Bears  Locate  -0.47  17  11  Polar Bears  Locate  -0.36  18  16  Polar Bears  Locate  20  5  Frogs & Toads  Locate  0.26  21  7  Frogs & Toads  Locate  0.32  22  9  Frogs & Toads  Locate  0.29  Frogs & Toads  Critical  0.45  6  24  10  -0.44  25  6  Rabbits  Locate  0.31  26  19  Rabbits  Locate  0.44  27  11  Rabbits  Locate  0.55  28  12  Rabbits  Critical  0.53  30  6  Memory  Identify  0.35  31  8  Memory  Critical  0.26 0.32  32  6  Memory  Locate  33  12  Memory  Locate  0.38  34  8  Memory  Locate  0.53  35  8  Tree Growth  Identify  0.55  36  10  Tree Growth  Locate  38  11  Tree Growth  Critical  -0.34 0.37  Note. The sub-content areas of identify, locate and critical represent identify and interpret key concepts and main ideas; locate, interpret and organize details; and critical analysis respectively.  64  Table 23  Salient Loadings and Item Detail for Aboriginal Grade 7 Reading Scores Factor Item#  # of Words  Sub-Content Area  Context  1  2  3  1  15  Baseball Game  Critical  0.27  3  21  Baseball Game  Locate  0.36  4  10  Baseball Game  Critical  0.34  5  10  Baseball Game  Locate  0.25  12  12  Ponies  Critical  0.45  13  15  Ponies  Locate  0.26  14  14  Ponies  Critical  0.31  17  5  Frogs & Toads  Locate  -0.68  18  7  Frogs & Toads  Locate  -0.84  19  9  Frogs & Toads  Locate  -0.55  20  12  Frogs & Toads  Locate  -0.29  21  10  Frogs & Toads  Critical Locate  0.54  -0.82  8  Goldfish  23  8  Goldfish  Locate  0.59  24  11  Goldfish  Critical  0.25  25  10  Goldfish  Critical  0.49  27  9  Goldfish  Critical  0.32  28  9  Willow Tree  Locate  0.27  17  Willow Tree  Locate  31  13  Willow Tree  Locate  0.28  33  12  Willow Tree  Critical  0.40  35  5  Snakes  Identify  0.50  36  8  Snakes  Locate  0.28  37  5  Snakes  Critical  0.66  38  9  Snakes  Locate  0.33  39  6  Snakes  Locate  0.73  Snakes  Locate  0.47  22  30  40  6  0.35 0.28  41  9  Egypt  Locate  0.34  42  16  Egypt  Critical  0.27  43  13  Egypt  Locate  0.28  -0.26  0.28  Note. The sub-content areas of identify, locate and critical represent identify and interpret key concepts and main ideas; locate, interpret and organize details; and critical analysis respectively.  65  Table 24  Salient Loadings and Item Detail for Non-Aboriginal Grade 7 Reading Scores Factor #of  Item#  Words  Sub-Content Area  Context  2  1  3  1  15  Baseball Game  Critical  0.33  3  21  Baseball Game  Locate  0.51  4  10  Baseball Game  Critical  0.45  8  13  Whales  Critical  0.37  9  15  Whales  Locate  0.36  12  12  Ponies  Critical  0.50  13  15  Ponies  Locate  0.36  14  14  Ponies  Critical  0.38  15  9  Ponies  Locate  0.27  17  5  Frogs & Toads  Locate  -0.65  18  7  Frogs & Toads  Locate  -0.92  Locate  -0.63 -0.31  19  9  Frogs & Toads  20  12  Frogs & Toads  Locate  21  10  Frogs & Toads  Critical  -0.9  22  8  Goldfish  Locate  0.49  23  8  Goldfish  Locate  0.57  25  10  Goldfish  Critical  0.41  27  9  Goldfish  Critical  0.29  28  9  Willow Tree  Locate  0.26  29  11  Willow Tree  Locate  0.27  30  17  Willow Tree  Locate  0.39  31  13  Willow Tree  Locate 0.38  0.39  0.41  33  12  Willow Tree  Critical  35  5  Snakes  Identify  0.45  37  5  Snakes  Critical  0.58  38  9  Snakes  Locate  0.34  39  6  Snakes  Locate  0.70  40  6  Snakes  Locate  0.42  43  13  Egypt  Locate  0.41  44  6  Egypt  Locate  0.36  45  6  Egypt  Locate  0.27  Note. The sub-content areas of identify, locate and critical represent identify and interpret key concepts and main ideas; locate, interpret and organize details; and critical analysis respectively.  66  Factor Loading Summary Grade 4 Numeracy The salient factor loadings for Grade 4 Numeracy are presented in Tables 17 and 18, for aboriginal and non-aboriginal groups, respectively. There were three factors for the aboriginal population and four factors for the non-aboriginal population. For the aboriginal population, the first and third factors were dominant with 15 and 12 salient loadings respectively. These two factors appeared to split by either the context of items, or the order o f items in terms of the order of their presentation to the examinees. This means that the two factors were either associated with being set in one of two contexts, or with being set in the first half or the second half of the test. For the non-aboriginal population, visual inspection did not reveal any consistent patterns in these salient loadings except for the second factor that appeared to be related to the sub-content area of  statistics and probability with three salient loadings. Further, the loadings of the two populations did not appear to be very similar to one another. Because there were a different number o f factors for each population's results, no congruence coefficients could be calculated and interpreted. A s will be shown later in this study, there is a low degree o f local dependence amongst the items which means that responses to the test questions are conditionally (locally) independent given the examinee's ability. This gives evidence that item order in itself was not the source of the difference in factors.  Grade 7 Numeracy Table 19 represented the salient loadings for the aboriginal Grade 7 Numeracy factors, and Table 20 represented the salient loadings for the non-aboriginal Grade 7 Numeracy factors. For the aboriginal population, there were five factors; for the non-  67  aboriginal population there were two factors. For the aboriginal population, the first two factors were dominant, with 11 and 7 salient loadings respectively, and appeared to split by the order of items in terms of the order of their presentation to the examinees. For the non-aboriginal population, both factors appeared dominant with 18 and 17 salient loadings respectively. Visual inspection did not reveal any consistent patterns in these salient loadings. Further, the loadings of the two populations did not appear to be very similar to one another. Because there were a different number of factors for each population's results, no congruence coefficients could be calculated.  Grade 4 Reading Table 21 represented the salient loadings for the aboriginal Grade 4 Reading factors, and Table 22 represented the salient loadings for the non-aboriginal Grade 4 Reading factors. For both populations, there were three factors. For both populations, there appeared to be two dominant factors and one minor one. For the aboriginal population, there were 17 salient loadings on each of the first two factors; for the nonaboriginal population, there were 16 and 18 salient loadings on the first two factors respectively. For both populations, the dominant factors appeared to be related to the order of items. There appeared to be an anomaly in that Item 36 loaded on the third factor for the aboriginal population and on the second factor for the non-aboriginal population. For the non-aboriginal population, this item was the only one from the second half of the test that loaded on the second factor. In this study, Item 36 was not found to be DIF, and there was little difference in the measurement accuracy provided by this item for the two populations. Examining sub-content area, context, and number-of-words per item of Item  68  36 did not reveal any information about why this item had a different factor loading pattern. Congruence coefficients gave more evidence about the similarity of the three factors across the populations. The congruence coefficient for the first factor was 0.94, meaning that there was a very high degree of comparability for this factor across the two populations. The congruence coefficient for the second factor was 0.89, meaning that there was a high degree of comparability for this factor across the two populations. The congruence coefficient for the third factor was 0.71, meaning that there is a relatively low degree of comparability for this factor across the two populations. Factors 1 and 2 equalled or exceeded 0.85, which was stated to be a criterion for factorial equivalence by Harpur, Hakstian, and Hare (1988).  Grade 7 Reading Table 23 represented the salient loadings for the aboriginal Grade 7 Reading factors, and Table 24 represented the salient loadings for the non-aboriginal Grade 7 Reading factors. For both populations, there were three factors. For both populations, the first factor appeared to represent the second half of the items in terms of the order of their presentation to the examinees, with 16 and 11 salient loadings respectively; this could be related to test fatigue. For both populations, the second factor appeared to be related to one specific context with five salient loadings matched with the five items of a frogs and toads context for both populations. For the aboriginal population, there was no consistent pattern in the 12 salient loadings of the third factor. But for the non-aboriginal population, the third factor appeared to be consistent with sets of items based on context. O f the 16 salient loadings on this factor, all are accounted for when aligned with specific  69  contexts. For example, there are three salient loadings that match with three items of a  baseball game context, and four salient loadings that match with four items of a ponies context. A s for the aboriginal population, the second factor appeared to be related to one specific context with 5 salient loadings matched with the five items of a frogs and toads context. Again, like the aboriginal population, there was no consistent pattern in the 11 salient loadings of the third factor. Congruence coefficients gave more evidence about the similarity of the three factors across the populations. The congruence coefficient for the first factor was 0.72, meaning that there is a relatively low degree of comparability for this factor for the two populations. The congruence coefficient for the second factor was 0.98, meaning that there is a very high degree of comparability for this factor for the two populations. The congruence coefficient for the third factor was 0.97, meaning that there is a very high degree of comparability for this factor for the two populations. Factors 2 and 3 equalled or exceeded 0.85, which was stated to be a criterion for factorial equivalence.  Factor Correlation Matrices Correlations amongst factors are found through oblique rotation. Oblique rotation does not require that correlations exist, but it does allow for it. There is a resulting factor correlation matrix for each of the eight factor analyses performed in this study. These correlations indicate how the factors are related to each other for each test and population, and therefore provide evidence regarding the degree to which the factor structures are similar for the two groups. The correlation matrices for each test and population are presented below in Tables 25-32. Table 25 is the factor correlation matrix for the aboriginal Grade 4 Numeracy analysis. From this table, it can be seen that there  70  are weak relationships between Factors 1 and 2, and 2 and 3; this could be explained by the fact that there were very few salient loadings on Factor 2, and therefore, there were few strong loadings for the loadings o f Factors 1 and 3 to correlate with. This explanation follows through to partly explain the high correlation between Factors 1 and 3, because there were many salient loadings on each factor. The only two items that loaded on the second factor were common with one another in that their sub-content area was statistics and probability, but these were not the only items with that sub-content area, so this alone does not explain the factor. Table 25  Factor Correlation Matrix for Aboriginal Grade 4 Numeracy  1 1.00  Factor 1 2 3  2 -0.27 1.00  3 0.74 -0.31 1.00  Table 26 is the factor correlation matrix for the non-aboriginal Grade 4 Numeracy analysis. From this table, it can be seen that there is a strong relationship between all the factors. Table 26  Factor Correlation Matrix for Non-Aboriginal Grade 4 Numeracy  Factor 1 2 3 4  1 1.00  2 0.81 1.00  3 0.85 0.79 1.00  4 -0.87 -0.79 -0.85 1.00  Table 27 is the factor correlation matrix for the aboriginal Grade 7 Numeracy analysis. From this table, it can be seen that there is a weak relationship between Factors 1 and 4, 2 and 4, 3 and 4, 1 and 5, 2 and 5, and 3 and 5; there is a weak negative  71  relationship between factors 4 and 5 and there is a moderately strong relationship between Factors 1 and 2, 1 and 3, and 2 and 3. Table 27  Factor Correlation Matrix for Aboriginal Grade 7 Numeracy Factor 1 2 3 4  1  2  3  4  5  1.00  0.61 1.00  -0.69 -0.65 1.00  0.10 0.09 -0.18  0.27 0.57 -0.36  1.00  0.00 1.00  5  Table 28 is the factor correlation matrix for the non-aboriginal Grade 7 Numeracy analysis. From this table, it can be seen that there is a strong relationship between the two factors. Table 28  Factor Correlation Matrix for Non-Aboriginal Grade 7 Numeracy Factor 1 2  1  2  1.00  0.82 1.00  Table 29 is the factor correlation matrix for the aboriginal Grade 4 Reading analysis. From this table, it can be seen that there is a weak relationship between Factors 3 and 2, and 1 and 3; and a strong relationship between Factors 1 and 3. Table 29  Factor Correlation Matrix for Aboriginal Grade 4 Reading Factor 1 2 3  1  2  3  1.00  -0.82  -0.13  1.00  0.10 1.00  72  Table 30 is the factor correlation matrix for the non-aboriginal Grade 4 Reading analysis. From this table, it can be seen that there is a strong relationship between Factors 1 and 2; and a weak relationship between Factors 1 and 3, and 2 and 3. Table 30  Factor Correlation Matrix for Non-Aboriginal Grade 4 Reading Factor 1  1  2  1.00  -0.82  0.35  1.00  -0.25  2 3  3  1.00  Table 31 is the factor correlation matrix for the aboriginal Grade 7 Reading analysis. From this table, it can be seen that there is a moderately strong relationship between Factors 1 and 2, and a weak relationship between Factors 1 and 3, and 2 and 3. Table 31  Factor Correlation Matrix for Aboriginal Grade 7 Reading Factor 1  1  2  1.00  -0.61  0.58  1.00  -0.53  2 3  3  1.00 Finally, Table 32 is the factor correlation matrix for the non-aboriginal Grade 7  Reading analysis. From this table, it can be seen that there is a moderately strong relationship amongst all three factors. Table 32  Factor Correlation Matrix for Non-Aboriginal Grade 7 Reading Factor  1 2 3  1  2  3  1.00  0.69  0.70  1.00  -0.73 1.00  73  The inter-factor correlations presented in Tables 25-32 suggest that there is a range in differences in meaning among these factors across the populations for all four assessments. Reliability The Feldt-Raju reliability estimates that are presented in Table 33 were used to indicate the degree to which individuals' scores would remain relatively consistent over repeated administration of the same test or alternate test forms (Crocker & Algina, 1986). These estimates, calculated separately for each population, were compared in an effort to identify differences in the internal consistency of the scores for the aboriginal and nonaboriginal populations. For Grades 4 and 7 Numeracy, the F S A scores were slightly more reliable for the non-aboriginal population than the aboriginal population; for Grades 4 and 7 Reading, the scores were slightly more reliable for the aboriginal population than the non-aboriginal population. When the non-aboriginal reliability coefficient is subtracted from the corresponding aboriginal coefficient, the difference is -0.01. These results indicate few to no differences between the reliability estimates of the Numeracy and Reading tests for the aboriginal and non-aboriginal groups. Table 33  Feldt-Raju Reliability Estimates Popu ation Non-Aboriginal  Aboriginal  Difference in Reliability  Assessment N  Reliability  N  Reliability  Numeracy 4  0.81  3339  0.82  42547  -0.01  Numeracy 7  0.78  3072  0.84  44043  -0.06  Reading 4  0.88  3329  0.84  42579  0.04  0.03 0.83 3106 Reading 7 Note. "Difference" is calculated by subtracting the non-Aboriginal Feldt-Raju reliability estimates from the corresponding Aboriginal ones. 0.80  44126  74  IRT-Based Analyses IRT-based analyses were conducted to compare the measurement accuracy provided at the item and test level and parameters o f items for the two populations. First the appropriateness of the IRT models were examined by evaluating the degree to which IRT model assumptions were met. Evaluations of the IRT model assumptions were conducted by examining fit statistics, local item dependence statistics, and unidimensionality. A l l of these analyses capture deviations of response data from the model.  Evaluation of IRT Model Assumptions Model Fit The fit of item responses to the respective IRT models was evaluated by the Q\ statistic (Yen, 1981). The Q\ statistic compares observed and predicted trace lines and is ax  2  statistic. The power of the fit statistic is affected by sample size, hence, for large  sample sizes, small deviations from model predictions can be statistically significant. When the Q\ fit statistic was examined for the four sets of test scores, corresponding to the four assessments, model misfit was identified for items whose Z statistic was greater than 4.60 (Ercikan, et al., 1998). For all four assessments, many of the items were identified as having poor fit. This was not surprising considering the influence of a relatively large sample size (over 40 000) on the x  statistic. Upon visual  examination of the Item Characteristic Curves (ICC) and the corresponding observed and predicted statistics for each item identified as misfitting, I found that the items do fit their respective models. Tables 34-37 present the observed and predicted values for each item that was found to be poorly fitting. A comparison between the observed proportions o f a  75  given response, and the proportions that would be predicted using the estimated thetas and item parameters was made for each item to show that differences were minimal. For example, from Table 34, Item 36 was found to have the poorest fit to the Q\ fit statistic. This item was assessed for fit to the 2 P P C model with 45,856 cases. It produced a /  2  statistic value o f 770.21 with 35 degrees of freedom, and a Z-statistic value of 87.87, which is substantially over the threshold of 4.60 for identifying poorly fitting items. For this item, the difference between observed and predicted values of model fit is 0.0026, which is less than 1% different than the predicted value. A summary of the worst fitting items for each assessment is presented in Table 38. The comparison of differences between the observed and predicted values were nearly zero for the poorest fitting item of each assessment indicating that the data do fit the models well and hence the Q\ fit statistic is not an accurate indicator of fit with the large sample sizes of the present study.  Table 34 Model Goodness of Fit for Grade 4 Numeracy 2  Item  Model  X Statistic  DF  1  3PL  25.42  3 5 9 11 12  3PL 3PL 3PL 3PL 3PL 3 PL 3PL 3 PL  60.96 51.88 24.57 27.13 41.58 33.14 37.38 24.54  7 7 7 7 7 7 7 7 7  47.06 39.48 38.79 46.47 70.46 39.39 606.81 214.34 140.09 770.21  7 7 7 7 7 7 35 35 35 35  13 15 24 25 26 27 30 32 33 17 18 35 36  3 PL 3 PL 3 PL 3 PL 3PL 3 PL 2PPC 2PPC 2PPC 2PPC  45856 45856 45856 45856 45856 45856 45856 45856 45856  ObservedObserved Predicted Predicted 0.0035 0.7724 0.7690 4.92 0.6095 -0.0015 14.42 0.6080 0.8754 0.0057 0.8812 11.99 0.0016 0.6772 0.6789 4.69 0.0046 0.7974 0.7928 5.38 -0.0033 0.5237 0.5270 9.24 0.0004 0.6045 0.6048 6.99 0.4605 -0.0056 0.4549 8.12 -0.0066 0.4834 0.4900 4.69 -0.0096 0.3022 0.3118 10.71  45856 45856 45856 45856 45856 45856 45856 45856  8.68 8.50 10.55 16.96 8.66 68.34 21.44 12.56  45856  87.87  Total N 45856  Z-Statistic  0.5907 0.6695 0.8706 0.6850 0.4210 0.6246 0.4447 0.5836 0.6344  0.5918 0.6696 0.8647 0.6828 0.4256 0.6229 0.4476 0.5835 0.6317  -0.0011 -0.0001 0.0059 0.0022 -0.0046 0.0017 -0.0028 0.0001 0.0026  Table 35  Model Goodness of Fit for Grade 7 Numeracy Item  Model  x  2  Statistic  DF  Total N  Z-Statistic  ObservedObserved Predicted Predicted  1  3 PL  127.61  7  47056  32.23  0.7903  0.7854  0.0049  3  3PL  38.89  7  47056  8.52  0.3383  0.3445  -0.0062  5  3 PL  36.77  7  47056  7.96  0.4863  0.4908  -0.0044  6  3PL  90.08  7  47056  22.20  0.9032  0.8953  0.0079  7  3PL  36.62  7  47056  7.92  0.4956  0.4987  -0.0031  8  3PL  63.41  7  47056  15.08  0.4552  0.4608  -0.0055  9  3PL  79.38  7  47056  19.34  0.5778  0.5779  -0.0001  10  3PL  37.45  7  47056  8.14  0.3282  0.3348  -0.0066  11  3 PL  158.30  7  47056  40.44  0.4334  0.4401  -0.0068  13  3 PL  36.32  7  47056  7.84  0.5825  0.5836  -0.0010  15  3 PL  32.68  7  47056  6.86  0.7048  0.7001  0.0047  19  3 PL  30.08  7  47056  6.17  0.6270  0.6267  0.0003  20  3 PL  24.30  7  47056  4.62  0.7756  0.7686  0.0070  21  3PL  138.91  7  47056  35.26  0.8303  0.8220  0.0083  22  3PL  75.68  7  47056  18.36  0.2997  0.3051  -0.0054  24  3PL  32.08  7  47056  6.70  0.3911  0.3971  -0.0061  7.80  0.6597  0.6594  0.0003  26  3PL  36.20  7  47056  28  3 PL  35.65  7  47056  7.66  0.4365  0.4413  -0.0048  29  3 PL  69.54  7  47056  16.71  0.4612  0.4666  -0.0055  0.4233  0.4290  -0.0056  30  3 PL  102.06  7  47056  25.41  33  3 PL  65.45  7  47056  15.62  0.4149  0.4202  -0.0054  34  3PL  56.54  7  47056  13.24  0.5121  0.5163  -0.0042  0.5699  0.0025  17  2PPC  831.12  35  47056  95.15  0.5724  18  2PPC  294.99  35  47056  31.08  0.3881  0.3911  -0.0030  35  2PPC  218.88  35  47056  21.98  0.2703  0.2741  -0.0037  36  2PPC  660.03  35  47056  74.70  0.6800  0.6744  0.0056  Table 36 Model Goodness of Fit for Grade 4 Reading x 2  Item Model 4 3 PL 5 3 PL 3PL 7 3PL 8 10 3 PL 11 3PL 12 3PL 13 3PL 18 3PL 20 3PL 21 3 PL 22 3PL 23 3 PL 27 3PL 28 3 PL 30 3PL 31 3 PL 33 3 PL 34 3PL 35 3PL 36 3PL 37 3PL 38 3PL 2PPC 9 19 2PPC 29 2PPC 39 2PPC  Statistic 34.63 37.11 34.44 25.41 56.98 27.10 29.40 51.82 48.10 8.97 65.92 16.66 165.64 79.87 39.13 252.64 35.34 36.87 25.19 27.66 102.68 91.60 29.23 315.95 1005.15 1488.69 226.58  DF  ObservedTotal N Z-Statistic Observed Predicted Predicted 0.7106 -0.0030 7.39 0.7076 7 45774 0.7454 -0.0004 0.7450 8.05 7 45774 0.0002 0.7667 0.7665 7.33 7 45774 0.0024 4.92 0.8605 0.8581 7 45774 0.8963 0.0039 13.36 0.9003 7 45774 0.8320 0.0031 5.37 0.8350 7 45774 0.9212 0.0055 5.99 0.9267 7 45774 0.9126 0.0053 11.98 0.9179 7 45774 0.7412 10.98 0.7419 -0.0007 7 45774 0.7597 0.7605 -0.0008 0.53 7 45774 0.0112 15.75 0.8501 0.8389 7 45774 0.7274 -0.0026 0.7248 7 45774 2.58 42.40 0.5670 0.5700 -0.0030 7 45774 0.8774 0.8745 0.0029 19.48 7 45774 0.0038 0.8495 0.8457 8.59 7 45774 0.7570 0.7588 -0.0018 65.65 7 45774 7.57 0.6125 0.6169 -0.0044 7 45774 0.6708 0.6735 -0.0027 7.98 7 45774 0.8524 0.8496 0.0029 4.86 7 45774 0.8844 0.0043 5.52 0.8887 7 45774 0.5247 -0.0074 25.57 0.5173 7 45774 0.3735 0.3857 -0.0122 22.61 7 45774 5.94 0.7197 0.7201 -0.0004 7 45774 0.7636 0.0009 33.58 0.7645 35 45774 0.4233 -0.0085 0.4148 115.96 35 45774 0.4813 -0.0068 0.4745 173.75 35 45774 0.6146 -0.0022 22.90 0.6125 35 45774  T a b l e 37  Model Goodness of Fit for Grade 7 Reading x 2  Item  Model  2 3  3 PL 3 PL  Statistic 56.55 41.99  4 7 8 11  3PL 3PL 3PL 3 PL  12 14 15 18 19 21 22 23 24 25  3PL 3 PL 3 PL 3 PL 3 PL 3PL 3PL 3PL 3PL 3 PL 3PL 3 PL  26 27 28 30 31 32  DF  Total N  7 7  47227 47227  13.24 9.35  32.89 26.82 82.94 131.42  7 7 7 7  47227 47227 47227 47227  6.92 5.30 20.30 33.25  98.87 28.88 42.22  7 7 7 7 7 7 7 7 7 7 7 7  47227 47227 47227 47227 47227 47227 47227 47227 47227 47227 47227 47227 47227 47227 47227 47227  24.55 5.85 9.41 5.71 7.06 8.71 13.78 13.63 7.40 19.07 9.97 9.61 12.44 7.20 8.51 4.91  47227 47227 47227  28.37 33.41 39.59 58.54 58.01 34.68 78.35 44.29 42.97 53.54  3PL 3 PL 3 PL 3 PL 3PL 3 PL  33.96 38.85 25.37  7 7 7 7  52.93 182.36  7 7  3 PL  71.30  40  3PL 3PL  175.70 802.33  7 7  41  3PL  42 43 44 45 6  3 PL 3 PL 2PPC 2PPC  33 35 37 39  16 34 46  Z-Statistic  0.0003 -0.0031 -0.0021 -0.0031 0.0056 0.0051  0.9091  0.8559 0.9013  0.0059 0.0078 -0.0007  0.5900  0.0009 -0.0008  0.5436  0.5482  -0.0047  0.4420 0.2789 0.6544  0.4481 0.2857 0.6538 0.5184  -0.0061 -0.0068 0.0007 -0.0029  0.6182  0.0000  0.3754  -0.0055  3 PL  69.58  16.72  73.19  7 7  47227  3PL  47227  7 7 35 35  47227 47227 47227 47227  17.69 10.62 6.29 14.56 62.90  35  47227  43.35  47227  0.7018 0.5919 0.6217 0.4475 0.8284 0.8498  0.7842  11.11  35  0.7958  -0.0035 0.0010  0.6879  47227  278.92  0.0020 -0.0006 0.0014 0.0067 0.0058 -0.0011 0.0042  0.6873  7  2PPC  -0.0036 0.0031 -0.0081 -0.0028 -0.0101 -0.0047 -0.0024  0.7851 0.5892  48.57  561.27 397.66  0.0043 -0.0089  0.8618  47227  2PPC  0.8670 0.7347 0.8580 0.8957 0.9358 0.6751 0.8308 0.4868 0.7969 0.7022 0.5888 0.6196 0.4444  0.9650 0.3791 0.6473 0.8767 0.4390 0.5119 0.3075 0.5693 0.6807 0.8649 0.7354 0.8565 0.8891 0.9300 0.6763 0.8266 0.4904  0.8340 0.8550  7  46.73 30.53 156.84  0.9693 0.3703 0.6437 0.8798 0.4309 0.5091 0.2974 0.5646 0.6782  12.28 46.87 17.19 45.09 212.56  47227  ObservedObserved Predicted Predicted  29.15  0.5156 0.6182 0.3698  80  Table 38  Goodness of Fit Information for the Poorest Fitting Item for Each Test Test 4 Numeracy 7 Numeracy 4 Reading 7 Reading  2 ObservedZX Item Model Statistic D F Total N Statistic Observed Predicted Predicted 0.0026 0.6344 0.6317 87.87 770.21 35 45856 36 2PPC 0.5724 0.5699 0.0025 95.15 831.12 35 47056 17 2PPC 0.4813 -0.0068 0.4745 1488.69 35 45774 173.75 29 2PPC -0.0007 0.6873 0.6879 802.33 7 47227 212.56 40 3PL  Unidimensionality The IRT models used in the present study assumed that the underlying ability being measured was unidimensional. The assumption of unidimensionality can be satisfied i f the test data can be represented by a  "dominant component or factor"  (Hambleton, 1989, p. 150). Using this criterion the factor analytic results indicated that these tests were essentially unidimensional, as represented by a dominant factor for all the solutions (see Appendixes M & N). Specifically, for all four assessments, for both populations, there is one relatively large eigenvalue along with many relatively small ones, indicating that each assessment has one dominant component. Hence, the structure of the underlying ability being measured is found to be essentially unidimensional for each assessment (Reckase, 1979).  Local Item Dependence (LID) The I R T models used in this study assume that the responses to the test questions are conditionally (locally) independent given the examinee's ability. Local independence requires that any two items be uncorrelated when ability is fixed (Lord, 1980). Local item independence was evaluated by the QT, statistic (Yen, 1984). The QT, statistic is the correlation between performance on two items, after taking into account overall test performance. When a value of the Q?, statistic is greater than 0.20 (Ercikan et al., 1998),  81  then the corresponding pair of items is identified as displaying L I D . If the QT, statistic values were found to be greater than 0.20 but were still relatively low for a relatively small number of items, the effect of L I D on applying the IRT models is expected to be minimal. For the Grade 4 Numeracy assessment, one item pair out o f a possible 648 was found to display L I D ; for the Grade 7 Numeracy assessment, none of the item pairs out of a possible 648 was found to display LID; for the Grade 4 Reading assessment, two item pairs out of a possible 760 was found to display LID; and for the Grade 7 Reading assessment, five item pairs out of a possible 1058 was found to display LID. Table 39 below presents a summary of item pairs found to display L I D for the four F S A s . For example from Table 39, for the Grade 4 Numeracy F S A , there was one item pair identified as displaying L I D . The correlation between these items is 0.20 when the abilities influencing test performance are held constant (Hambleton et al., 1991). Table 39  FSA Item Pairs Displaying Local Item Dependence Assessment  Grade 4 Numeracy  Item Pair  | g | Value  35 & 36  0.200  9 & 11  0.253  10& 11  0.207  15& 16  0.266  16& 17  0.224  15& 19  0.230  16& 19  0.359  17& 19  0.244  3  Grade 7 Numeracy  Grade 4 Reading  Grade 7 Reading  82  Because of the small number of items displaying L I D and the relatively small values of the corresponding Q3 statistics, the effect of L I D on the four F S A score sets when applying the IRT models is low. Item Information Functions Item information function values were computed based on the item parameter estimates using F L U X (Burket, 1993) software. The item information functions indicate the degree of measurement accuracy provided by test items for different ability levels. The maximum value of the item information function (information), and the area under the item information function (area), an indicator of total measurement accuracy provided by the test item, for each item were calculated for each population. In an effort to determine i f measurement accuracy provided by the test items were comparable for the aboriginal and non-aboriginal populations, a difference in the area values was calculated for each item. In order to obtain comparable item information functions across the two populations, the scales were linked using the Stocking and Lord (1983) equating procedure. This method solves for the linear transformation that minimizes the squared differences between the test characteristic curves from two separate calibrations for a given ability level. This equating method does not affect the relative value of the item parameters to one another and, therefore, does not affect the definition of the scale or trait being estimated. The linking of the scales for aboriginal groups and non-aboriginal groups were based on a set of 10 test items that were considered to be comparable for the two groups and were DIF free. Table 40 presents the transformation values used in the linking procedure.  83  Table 40  Stocking and Lord Transformation Values Multiplicative Constant  Additive Constant  Numeracy 4  1.08  -0.70  Numeracy 7  0.90  -0.75  Reading 4  1.16  -0.73  Reading 7  1.09  -0.70  Test  The transformed (and therefore comparable) information function values are shown in Tables 41-44. Table 41 shows that for Item 1 from the Grade 4 Numeracy assessment, the area under the item information function is 0.0076 for the aboriginal scores, and 0.0073 for the non-aboriginal scores and the height of the item information function at the location of maximum utility is 0.0300 for the aboriginal scores and 0.0270 for the non-aboriginal scores. For this item, the difference between the area across the aboriginal and non-aboriginal populations was 4%. The sum of the differences in area across all items for the two populations is an indication of the difference in the amount of information provided by the F S A scores of the two populations. It was found that the Grade 4 Numeracy assessment and both reading assessments provided less information for the aboriginal group than the nonaboriginal group. In contrast, it was found that the Grade 7 Numeracy assessment provided more information for the aboriginal group than the non-aboriginal group. Specifically, the Grade 4 Numeracy F S A provided 4%  less information for the aboriginal  population; the Grade 7 Numeracy F S A provided 19%  more information for the  aboriginal population; the Grade 4 Reading F S A provided 3%  less information for the  aboriginal population; and the Grade 7 Reading F S A provided 3% the aboriginal population.  less information for  84 Table 41 Grade 4 Numeracy Item Information Population Aboriginal Non-Aboriginal Difference Information Area Area Area % Information Area Item 4 0.0073 0.0003 0.0076 0.0270 1 0.0300 0.0007 9 0.0160 0.0069 0.0076 2 0.0270 -0.0014 -11 0.0142 0.0128 0.0910 0.0820 3 -14 0.0105 -0.0013 0.0092 0.0340 4 0.0310 -1 0.0520 0.0108 -0.0001 0.0107 5 0.0550 -21 -0.0015 0.0380 0.0200 0.0070 0.0085 6 -12 0.0082 0.0340 -0.0009 0.0073 7 0.0250 -0.0012 -16 0.0088 0.0076 0.0260 0.0230 8 -18 0.0087 -0.0013 0.0074 0.0270 0.0220 9 3 0.0031 0.0001 0.0032 0.0080 0.0070 10 0.0124 -0.0020 -19 0.0104 0.0570 11 0.0480 -0.0019 -16 0.0700 0.0137 0.0118 12 0.0590 -14 -0.0012 0.0310 0.0100 0.0088 0.0300 13 -24 -0.0014 0.0072 0.0058 0.0280 14 0.0200 5 0.0008 0.0830 0.0148 0.0156 15 0.0960 5 0.0110 0.0006 0.0116 0.0500 0.0580 16 7 0.0007 0.0460 0.0097 0.0104 0.0500 19 -0.0015 -18 0.0300 0.0100 0.0085 0.0250 20 -0.0017 -20 0.0083 0.0300 0.0100 0.0240 21 -4 -0.0005 0.0139 0.0134 0.0650 0.0750 22 9 0.0011 0.0550 0.0118 0.0700 0.0129 23 8 0.0008 0.0104 0.0520 0.0096 24 0.0620 0.0174 0.0017 9 0.0191 0.0980 0.1240 25 -12 0.0155 -0.0017 0.0900 0.0730 0.0138 26 -0.0020 -21 0.0117 0.0097 0.0700 0.0570 27 -20 -0.0025 0.0122 0.0590 0.0147 0.0510 28 -14 -0.0011 0.0090 0.0079 0.0270 0.0240 29 16 0.0016 0.0102 0.0400 0.0086 0.0500 30 16 0.0019 0.0500 0.0100 0.0119 0.0530 31 8 0.0098 0.0009 0.0107 0.0320 0.0410 32 -94 -0.0030 0.0032 0.0240 0.0062 0.0060 33 -1 0.0101 -0.0001 0.0350 0.0100 34 0.0390 0.0002 10 0.0030 0.0018 0.0020 0.0040 17 0.0004 12 0.0034 0.0060 0.0030 0.0070 18 0.0014 27 0.0037 0.0090 0.0130 0.0051 35 40 0.0024 0.0016 0.0040 0.0070 0.0040 36 -0.0135 0.3450 0.3315 Total Note. Values in bold text indicate that the difference in area for the two populations was at least 20%. 'Difference' represents Aboriginal-Non-aboriginal  85  Table 42 Grade 7 Numeracy Item Information Population Aborij>inal Information Area 0.0940 0.0143 0.0700 0.0121 0.1420 0.0174 0.0420 0.0090 0.1280 0.0182 0.0770 0.0115 0.0940 0.0138 0.3100 0.0286 0.1120 0.0150 0.2330 0.0241 0.0194 0.2360 0.0370 0.0083 0.0141 0.0910 0.0360 0.0097 0.0510 0.0110 0.0114 0.0630 0.1010 0.0158 0.0560 0.0119 0.0700 0.0135 0.1580 0.0181 0.0520 0.0096 0.1210 0.0146 0.0900 0.0128 0.0104 0.0430 0.0620 0.0112 0.1200 0.0175 0.3110 0.0234 0.3460 0.0256 0.0350 0.0094 0.0450 0.0103 0.2300 0.0233 0.2030 0.0201 0.0080 0.0040 0.0140 0.0053 0.0400 0.0103 0.0080 0.0037 0.5087  Non-Aboriginal Difference Information Area Area Area % 0.0510 0.0117 0.0026 18 0.0880 0.0131 -0.0010 -8 0.0870 0.0138 0.0036 21 0.0440 0.0092 -0.0002 -2 0.1260 0.0173 0.0009 5 0.0440 0.0084 0.0031 27 0.0600 0.0117 0.0021 15 0.1740 0.0207 0.0079 28 0.0570 0.0125 0.0025 17 0.1080 0.0172 0.0069 29 0.1730 0.0175 0.0019 10 0.0320 0.0077 0.0006 7 0.0790 0.0135 0.0006 4 0.0460 0.0104 -0.0007 -7 0.0410 0.0107 0.0003 3 0.0600 0.0113 1 0.0001 0.0890 0.0012 0.0146 8 0.0390 0.0098 0.0021 18 0.0430 0.0096 0.0039 29 0.0620 0.0114 0.0067 37 0.0440 0.0092 0.0004 4 0.1020 0.0133 0.0013 9 0.0104 0.0550 0.0024 19 0.0560 0.0098 0.0006 6 0.0440 0.0099 12 0.0013 0.0960 0.0156 0.0019 11 0.0132 0.0940 0.0102 44 0.0850 0.0133 0.0123 48 0.0310 0.0089 0.0005 5 0.0290 0.0092 0.0011 11 0.1000 0.0165 0.0068 29 0.1100 0.0153 0.0048 24 0.0020 0.0012 0.0028 70 0.0070 0.0035 0.0018 34 0.0250 0.0073 0.0030 29 0.0020 0.0011 0.0026 70 0.4098 0.0989 Note. Values in bold text indicate that the difference in area for the two populations was at least 20%. 'Difference' represents Aboriginal-Non-aboriginal Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 17 18 35 36 Total  86  Table 43 Grade 4 Reading Item Information Population  Item 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 9 19 29 39 Total  AborijSinai Information Area 0.0121 0.0580 0.0077 0.0400 0.0103 0.0570 0.0048 0.0130 0.0080 0.0270 0.0111 0.0650 0.0119 0.0760 0.0097 0.0570 0.0144 0.0710 0.0139 0.0610 0.0180 0.1110 0.0166 0.1030 0.0158 0.0700 0.0128 0.0550 0.0172 0.1040 0.0080 0.0230 0.0107 0.0450 0.0077 0.0350 0.0065 0.0400 0.0090 0.0590 0.0030 0.0018 0.0097 0.0510 0.0089 0.0330 0.0091 0.0440 0.0083 0.0350 0.0168 0.1170 0.0230 0.0076 0.0101 0.0460 0.0088 0.0320 0.0100 0.0440 0.0107 0.0450 0.0123 0.0720 0.0184 0.1290 0.0156 0.0970 0.0144 0.0850 0.0009 0.0010 0.0111 0.0330 0.0380 0.0120 0.0071 0.0160 0.4198  Non-Abor ginal Information Area 0.0109 0.0540 0.0083 0.0450 0.0095 0.0510 0.0280 0.0065 0.0066 0.0220 0.0108 0.0660 0.0127 0.0920 0.0108 0.0790 0.0137 0.0700 0.0146 0.0670 0.0217 0.1410 0.0194 0.1240 0.0177 0.0920 0.0132 0.0500 0.0192 0.1140 0.0094 0.0320 0.0103 0.0570 0.0065 0.0230 0.0440 0.0094 0.0087 0.0520 0.0011 0.0020 0.0092 0.0500 0.0087 0.0280 0.0117 0.0610 0.0112 0.0510 0.0172 0.1390 0.0042 0.0110 0.0110 0.0550 0.0080 0.0220 0.0098 0.0460 0.0115 0.0500 0.0760 0.0148 0.0176 0.1210 0.0151 0.0910 0.0126 0.0680 0.0010 0.0020 0.0107 0.0320 0.0093 0.0250 0.0058 0.0130 0.4304  Difference Area Area % 0.0012 10 -8 -0.0006 8 0.0008 -35 -0.0017 0.0014 18 3 0.0003 -7 -0.0008 -0.0011 -11 5 0.0007 -5 -0.0007 -21 -0.0037 -0.0028 -17 -12 -0.0019 -0.0004 -3 -12 -0.0020 -0.0014 -18 4 0.0004 0.0012 16 -45 -0.0029 3 0.0003 0.0007 39 0.0005 5 2 0.0002 -0.0026 -29 -35 -0.0029 -0.0004 -2 0.0034 45 -9 -0.0009 9 0.0008 2 0.0002 -7 -0.0008 -20 -0.0025 4 0.0008 3 0.0005 13 0.0018 -11 -0.0001 4 0.0004 0.0027 23 18 0.0013 -0.0106  Note. Values in bold text indicate that the difference in area for the two populations was at least 20%. 'Difference' represents Aboriginal-Non-aboriginal  87  Table 44 Grade 7 Reading Item Information Population  Item 1 2 3 4 5 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 33 35 36 37 38 39 40 41 42 43 44 45 6 16 34 46 Total  Aboriginal Information Area 0.0360 0.0076 0.0320 0.0066 0.0650 0.0128 0.0910 0.0123 0.0220 0.0057 0.0240 0.0067 0.0820 0.0117 0.0440 0.0083 0.0200 0.0063 0.0120 0.0046 0.1290 0.0177 0.0400 0.0084 0.0430 0.0097 0.0210 0.0062 0.0610 0.0099 0.0830 0.0102 0.0920 0.0130 0.0050 0.0024 0.0620 0.0092 0.0790 0.0168 0.0950 0.0181 0.0130 0.0051 0.0470 0.0126 0.0090 0.0038 0.0130 0.0050 0.0840 0.0150 0.0500 0.0106 0.0620 0.0125 0.0170 0.0057 0.0510 0.0123 0.0560 0.0138 0.0250 0.0072 0.0780 0.0169 0.0410 0.0111 0.1070 0.0199 0.0080 0.0036 0.0120 0.0045 0.0230 0.0089 0.0360 0.0103 0.0280 0.0084 0.1310 0.0168 0.0150 0.0065 0.0280 0.0100 0.0220 0.0087 0.0200 0.0082 0.4416  Non-Aboriginal Information Area 0.0290 0.0065 0.0470 0.0083 0.0670 0.0124 0.0540 0.0094 0.0110 0.0043 0.0210 0.0059 0.0610 0.0099 0.0480 0.0080 0.0150 0.0045 0.0070 0.0028 0.0850 0.0146 0.0350 0.0076 0.0400 0.0084 0.0260 0.0060 0.0250 0.0064 0.0540 0.0075 0.0630 0.0102 0.0180 0.0048 0.0400 0.0090 0.0560 0.0108 0.0790 0.0128 0.0120 0.0046 0.0380 0.0105 0.0100 0.0036 0.0150 0.0052 0.1000 0.0154 0.0420 0.0096 0.0560 0.0116 0.0130 0.0049 0.0780 0.0156 0.0480 0.0121 0.0210 0.0079 0.0650 0.0148 0.0430 0.0121 0.1100 0.0200 0.0020 0.0011 0.0130 0.0046 0.0190 0.0060 0.0500 0.0104 0.0410 0.0098 0.0380 0.0084 0.0640 0.0187 0.1170 0.0304 0.0900 0.0259 0.0740 0.0232 0.4565  Difference Area 0.0011 -0.0017 0.0004 0.0029 0.0014 0.0008 0.0018 0.0003 0.0018 0.0018 0.0031 0.0008 0.0013 0.0002 0.0035 0.0027 0.0028 -0.0024 0.0002 0.0060 0.0053 0.0005 0.0021 0.0002 -0.0002 -0.0004 0.0010 0.0009 0.0008 -0.0033 0.0017 -0.0007 0.0021 -0.0010 -0.0001 0.0025 -0.0001 0.0029 -0.0001 -0.0014 0.0084 -0.0122 -0.0204 -0.0172 -0.0150 -0.0149  Area % 14 -26 3 24 25 12 15 4 29 39 18 10 13 3 35 26 22 -100 2 36 29 10 17 5 -4 -3 9 7 14 -27 12 -10 12 -9 -1 69 -2 33 -1 -17 50 -188 -204 -198 -183  Note. Values in bold text indicate that the difference in area for the two populations was at least 20%. 'Difference' represents Aboriginal-Non-aboriginal  88  A test information function is obtained by summing the information of the items that contributed to the test score. The standard error of measurement ( S E M ) of a given ability level is the reciprocal of the square root of the test information at that ability level. In an effort to examine the similarities and differences of the accuracy of each F S A across the two populations, S E M as a function of scaled scores was presented for each assessment in Figures 1-4. These resulting functions visually depict the measurement accuracy provided by each test for the two groups across many ability levels. For the four figures, the vertical axis represents the S E M values, and the horizontal axis represents the scaled score values. When examining the figures, the reader should interpret the functions as such: the lower the values on the vertical axis, the higher the accuracy of the test. So when comparing the functions of the two populations, functions that are lower on the S E M axis for certain levels of the scaled scores, mean that their population's scores are more accurate at those points. It should be noted that the comparison of the two populations' functions could show that the scores for one population are more accurate than the other for some scaled score values, and the other is more accurate for other scaled scores. From Figure 1, the S E M functions for Grade 4 Numeracy revealed that the accuracy of the test is similar for both populations, but the test is more accurate for higher scores for the aboriginal population, and for lower scores for the non-aboriginal population. From Figure 2, the S E M functions for Grade 7 Numeracy revealed that the test is slightly more accurate for the aboriginal population than the non-aboriginal population. From Figure 3, the S E M functions for Grade 4 Reading revealed that the accuracy of the test is similar for both populations, but the test is more accurate for higher  89  scores for the aboriginal population, and for lower scores for the non-aboriginal population. From Figure 4, the SEM functions for Grade 7 Reading revealed that the accuracy of the test is similar for both populations, but the test is more accurate for higher scores for the aboriginal population, and for lower scores for the non-aboriginal population.  90  — • — A b o r i g i n a l — U — Non-Aboriginal 400  350  300  250  s UJ CO  200  150  H  100  50  0 c  200  400  600  800  1000  Scaled Score  Figure 1. Grade 4 Numeracy: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations. -Aboriginal  • Non-Aboriginal  UJ 200 CO  400  600  Scaled S c o r e  Figure 2. Grade 7 Numeracy: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations.  91  400  350  300  250  UJ  200  CO 150  100  50  0-I 0  ,  ,  ,  ,  1  200  400  600  800  1000  Scaled Score  Figure 3. Grade 4 Reading: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations. Aboriginal — • — N o n - A b o r i g i n a l  UJ 200  to  0  200  400  600  800  1000  Scaled Score  Figure 4. Grade 7 Reading: Standard error of measurement as a function of scaled scores for the aboriginal and non-aboriginal populations.  92  IRT Item Parameter Correlations Further item comparability was revealed by examining the correlations between the IRT parameters for the two groups as shown in Table 49. For example, for the Grade 4 Numeracy scores, the correlation of the M C items' discrimination parameter a for the aboriginal and the non-aboriginal populations is 0.85; the correlation of the M C items' difficulty parameter b is 0.92; the correlation o f the M C items' guessing/chance parameter c is 0.53; the correlation of the O E items' discrimination parameter a is 0.90; the correlation of the O E items' difficulty parameter /? is 0.99; and the correlation of all of the items' difficulty parameter p, calculated as the percent of students who answered the item correctly, is 0.98. With the exception of the guessing/chance parameters, which were expected to be low because of the typically poor nature of estimating this parameter, all parameters for all assessments are highly correlated. These high correlations indicate similarity of functioning of test items for the two groups. Table 45  Correlations between IRT Item Parameters for Aboriginals and Non-Aboriginals c a Assessment a b P Numeracy Grade 4  0.85  0.92  0.53  0.90  0.99  0.98  Numeracy Grade 7  0.77  0.98  0.77  0.99  1.00  0.99  Reading Grade 4  0.89  0.94  0.68  0.98  0.71  0.98  Reading Grade 7  0.84  0.96  0.80  0.99  0.92  0.99  Note. Correlation values that are less than 0.70 are in bold text.  Differential Item Functioning DIF analyses were conducted using both the Linn-Harnisch (LH) and Logistic Regression (LR) DIF detection methods. DIF status of items was determined when an item was identified as DIF by both methods. Table 46 presents the DIF items identified by both methods, and Table 47 summarizes the number of items found to be DIF by both  93 methods. For example from Table 45, for the Grade 4 Numeracy assessment, there were seven items (Items 1, 2, 4, 8, 10, 29, & 33) identified as DIF using the LR method that favoured the aboriginal population, two of which were identified by both methods of DIF detection, and nine items (Items 1,9, 11, 13, 19, 20, 22, 23, & 28) identified as DIF that favoured the non-aboriginal population, two of which were identified by both methods of DIF detection. All items identified by both detection methods were consistent in their indication of which population, the aboriginal or the non-aboriginal students, was favoured by a particular DIF item.  94 Table 46  Identified DIF Items DIF Detection Method  Assessment Grade 4 Numeracy Grade 7 Numeracy Grade 4 Reading Grade 7 Reading  Logistic Regression Pro-NonProAboriginal Aboriginal 1,2, 4, 8, 7, 9, 11, 10, 29, 33 13, 19, 20, 22, 23, 28 5, 8, 9, 23,25 12,19, 26, 28,31,32 2,3,5, 10, 15,20, 25, 11,12,13, 34, 35, 38 24 1,2, 5, 10, 7, 11,22, 25, 35, 37, 17, 18,21, 38, 39, 42 27,36  Linn-Harnish Pro-NonProAboriginal Aboriginal 1, 14, 27, 7, 18, 19, 29, 36 25, 31, 32 6, 11, 14, 15,20  8, 18, 35  2,3, 11, 12, 13,27, 28 12, 22, 23, 25,37,39  15, 17, 19, 20, 29, 35, 36,39 3, 10, 16, 18,20,21, 34, 46  Both ProPro-NonAboriginal Aboriginal 1,29 7, 19 8 2,3, 11, 12, 13  15,20,35  22, 25, 37, 39  10, 18,21  Table 47  Number of DIF Items Found Using Both the LH and the LR DIF Detection Methods Assessment Numeracy 4 Numeracy 7 Reading 4 Reading 7  LHLevel 2 5 5 7 6  Pro-Aborij»inal LR LHLevel 3 7 0 0 2 0 9 0 9  LH & LR 2 0 5 4  LHLevel 2 6 3 7 8  Pro-Non-Aboriginal LHLR Level 3 0 9 0 9 1 6 0 9  LH & LR 2 1 3 3  Items that were determined to be DIF by both detection methods are presented with item details in Tables 48-51. This was done in an effort to identify patterns o f DIF items based on specific characteristics of items as well as to present the corresponding estimated amount o f DIF-related bias for each assessment. The item details presented in Tables 48-51 describe the item in terms of item type, context, and sub-content area, as well as describe the item in terms of the degree of DIF such as the Z-statistic, the number of participants in the focal group (the aboriginal population), the observed and predicted values o f the mean responses over deciles, and the difference between the observed and  95 predicted mean responses over deciles. A n example from Table 48, for Grade 4 Numeracy, Item 1 favours the aboriginal population, is a M C item type, is in the  number  sub-content area of Numeracy, and is written in the context of a field trip. Item 1 showed a Z-statistic value o f 2.703, an observed value of the mean responses over deciles of 0.68, a predicted value of the mean responses over deciles of 0.66, with an observed-predicted value o f 0.02. For interpretation purposes, the focus o f the item D I F details was on the observed-predicted value and the sum o f these values for each population favoured, for each assessment. From the manner in which the observed-predicted values were calculated, a positive result implies that the item favours the aboriginal population, and a negative result implies that the item favours the non-aboriginal population. For each F S A , these values were summed over the DIF items favouring a particular population to provide a crude estimation of the proportion of bias for each population in terms of item functioning. For the Grade 4 Numeracy F S A , the proportion of bias in favour of the aboriginal population, based on two DIF items, was 0.04; the proportion o f bias in favour of the non-aboriginal population, based on two DIF items, was 0.06. Table 48  Grade 4 Numeracy DIF Item Details Item  #of Words  Item Type  Sub-Content Area  Context  ZStatistic  O  P  3339 3339  0.68 0.54  0.66 0.52  0.02 0.02  3339 3339  0.64 0.48  0.67 0.51  -0.03 -0.03  N  O-P  Pro-Aboriginal 1 29  21 23  MC MC  Number Number  Field trip Activity day  2.702 2.855  Pro-Non-Aboriginal Field trip -3.984 Number 11 MC 7 Activity day -3.943 Number MC 17 19 Note. ' O ' represents 'Observed', and ' P ' represents 'Predicted'.  96  For the Grade 7 Numeracy F S A , the proportion of bias in favour of the aboriginal population, because there were no DIF items, was 0.00; the proportion of bias in favour of the non-aboriginal population, based on one DIF item, was 0.02. Table 49  Grade 7 Numeracy DIF Item Details #of Item Words  Item Type  Sub-Content Area  Context  ZStatistic  N  O  P  O-P  0.27  0.29  -0.02  Pro-Non-Aboriginal Patterns & Ski trip -2.794 Relations Note. 'O' represents 'Observed', and 'P' represents 'Predicted'. 8  28  MC  3071  For the Grade 4 Reading F S A , the proportion of bias in favour of the aboriginal population, based on five DIF items, was 0.13; the proportion of bias in favour of the non-aboriginal population, based on three DIF items, was 0.10. Table 50  Grade 4 Reading DIF Item Details #of Item Words  Item Type  Sub-Content Area  Context  ZStatistic  N  O  P  O-P  3328 3328 3328 3328 3328  0.90 0.87 0.74 0.85 0.84  0.88 0.85 0.71 0.82 0.80  0.02 0.02 0.03 0.03 0.03  Pro-Aboriginal 2 3 11  12 13  7 3 8 4 9  MC MC MC MC MC  Locate Locate Locate Locate Locate  Crime solving Crime solving House pets House pets House pets  3.130 4.042 4.189 4.270 4.763  Pro-Non-Aborij>inal 0.53 -0.04 -5.263 3328 0.49 Identify Polar bears MC 8 15 Frogs & Toads -4.113 3328 0.62 0.66 -0.03 MC Locate 5 20 Tree growth -4.330 3328 0.33 0.36 -0.03 MC Identify 35 8 Note. 'O' represents 'Observed', and 'P' represents 'Predicted'. The sub-content areas of "identify", "locate" and "critical" represent "identify and interpret key concepts and main ideas", "locate, interpret and organize details", and "critical analysis" respectively.  97  For the Grade 7 Reading FSA, the proportion of bias in favour of the aboriginal population, based on four DIF items, was 0.13; the proportion of bias in favour of the non-aboriginal population, based on three DIF item, was 0.09. Table 51 Grade 7 Reading DIF Item Details Item  #of Words  Item Type  Sub-Content Area  Context  ZStatistic  N  O  P  0.82 0.74 0.78 0.83  0.78 0.72 0.74 0.79  O-P  Pro-Aboriginal 22 25 37 39  8 10 5 6  MC MC MC MC  Locate, Critical Critical Locate  Goldfish Goldfish Snakes Snakes  4.379 2.728 5.022 4.305  3106 3106 3106 3106  0.03 0.02 0.04 0.04  Pro-Non-Aboriginal MC Locate 10 10 Whales -3.216 0.64 3106 0.67 -0.03 18 7 MC Locate Frogs & Toads -3.180 3106 0.74 0.76 -0.03 21 10 MC Critical Frogs & Toads -2.931 3106 0.74 0.76 -0.03 Note. 'O' represents 'Observed', and 'P' represents 'Predicted'. The sub-content areas of "identify", "locate" and "critical" represent "identify and interpret key concepts and main ideas", "locate, interpret and organize details", and "critical analysis" respectively.  For the Grade 4 Numeracy scores, neither number-of-words per item, item type, sub-content area, nor context can be used to suggest an explanation of the sources of DIF. The minimal nature of DIF items for the Grade 7 Numeracy scores, only one in total, indicates that in this case, there is no evidence to suggest that a specific item detail is the source of DIF. The item details for the DIF items identified for each population of Grade 4 Reading scores indicated that items of the locate, interpret, and organize sub-content area favoured the aboriginal population. Because of the similarity of the item details for the DIF items identified for each population for Grade 7 Reading scores, neither numberof-words per item, item type, sub-content area, nor context can be used to suggest an explanation of the sources of DIF.  98 Results Summary In this study, the dimensional structure of the test was found to be very similar for Grades 4 and 7 Reading FSAs, with high levels of comparability based on the congruence coefficients calculated for the factors across the two populations. The dimensional structure of the test was not found to be very similar for Grades 4 and 7 Numeracy FSAs. For all four FSAs, the reliability estimates of the test scores were found to be high and similar for both populations, meaning that the degree to which individuals' scores would remain relatively consistent over repeated administration of the same test or alternate test forms was high. For each of the Grade 4 Numeracy and the Grades 4 and 7 Reading FSAs, there was less than 4% difference in the level of accuracy of scores between the two populations, and for the Grade 7 Numeracy FSA, there was 19% difference in the level of accuracy of scores between the two populations, favouring the aboriginal population. For all four FSAs, relatively low degrees of DIF were found. In terms of the number of DIF items for each FSA, there were two favouring each of the two populations for the Grade 4 Numeracy; there was only one for the Grade 7 Numeracy and it favoured the non-aboriginal population; there were five favouring the aboriginal population and three favouring the non-aboriginal population for Grade 4 Reading; and there were four favouring the aboriginal population and three favouring the non-aboriginal population for Grade 7 Reading. Further, for all four FSAs, there was a minimal difference across the populations in the proportions of bias related to the collective impact of all DIF items.  99  CHAPTER FIVE: DISCUSSION This study examined the psychometric properties of scores from a large-scale assessment in an effort to establish the comparability (or lack thereof) of the interpretation of the scores for aboriginal and non-aboriginal students. The comparability of the scores was addressed by the following two research questions: (1) Are scores from the Foundation Skills Assessment comparable across aboriginal and non-aboriginal populations!; and (2) Should score interpretations be the same for both populations! In answering these research questions, I performed four statistical analyses with the FSA data: factor analysis, comparison of reliability estimates, comparison of item information functions, and an analysis of differential item functioning. Because both research questions are associated with the unitary notion of validity, and the unitary notion of validity can only be assessed with an integrative look at all of the findings, I will present a short summary of the findings before answering the research questions in detail. Summary of Statistical Findings Factor Analysis For each FSA (Grade 4 Numeracy, Grade 4 Reading, Grade 7 Numeracy, and Grade 7 Reading) there were varying degrees of similarities in the factor structures of the scores across the aboriginal and non-aboriginal populations. These similarities in factor structure were strong across the two populations for the Grades 4 and 7 Reading FSAs. For both Reading FSAs, an equal number of factors and a high degree of comparability across the two populations for the two (out of three) dominant factors were found. The similarities in factor structure were not as strong across the two populations for the Grades 4 and 7 Numeracy FSAs. For both Numeracy FSAs, an unequal number of factors  100  was found for the two populations making it impossible to mathematically estimate the degree of comparability. For both Numeracy F S A s , visual inspection of the composition of the factors did not reveal any common patterns across the two populations. There was limited overlap in the composition of factor solutions with respect to sub-content area, context, and number of words.  Reliability For each assessment the reliability estimates were high (ranging from 0.78-0.88). Further, there were small differences found in the reliability estimates of the scores for the aboriginal and non-aboriginal populations. The Grade 4 and Grade 7 Numeracy F S A s had scores that were slightly more reliable for the non-aboriginal population, and the Grade 4 and Grade 7 Reading F S A s had scores that were slightly more reliable for the aboriginal population. In general, the reliability of the scores for each population was high, and very similar to one another.  Item Information Functions and Item Parameters The item information functions indicate the degree of measurement accuracy provided by test items for different ability levels. The area under the item information function is an indicator of total measurement accuracy provided by each test item. For each F S A , the item information functions were summed to get the total test information for each population. Then, the differences between the test information for the two populations were used as an indication of the difference in the amount of information provided by the scores of the two populations. It was found that the Grade 4 Numeracy assessment and both reading assessments provided less information for the aboriginal group than the non-aboriginal group. In contrast, it was found that the Grade 7 Numeracy  101  assessment provided more information for the aboriginal group than the non-aboriginal group. Specifically, the Grade 4 Numeracy F S A provided 4%  less information for the  aboriginal population; the Grade 7 Numeracy F S A provided 19%  more information for  the aboriginal population; the Grade 4 Reading F S A provided 3%  less information for the  aboriginal population; and the Grade 7 Reading F S A provided 3%  less information for  the aboriginal population. T o further examine the differences in accuracy of the two populations for each F S A , the standard error of measurement ( S E M ) functions for each scale score point were estimated. The S E M functions provide a graphic display of measurement accuracy provided by the test for each scale score point. A n examination of the S E M functions revealed that the Grade 4 Numeracy test and the Grades 4 and 7 Reading tests provided lower measurement accuracy for the non-aboriginal group at the higher end of the scale, and lower measurement accuracy for the aboriginal group at the lower end of the scale. For the Grade 7 Numeracy test, the above was not the case, but rather, except for the very low end of the ability scale, the test provided more measurement accuracy for the aboriginal group. For each assessment examined, comparisons of IRT based parameters for the aboriginal and non-aboriginal populations showed that the items were ordered very similarly in terms of their difficulty level and their degree of discrimination, and moderately similar in their probability of responding correctly due to chance.  Differential Item Functioning For the present study, DIF items were identified based on the consensus of two different DIF detection methods. For all four assessments the presence of DIF was  102  minimal. Overall, there were more DIF items in the two Reading F S A s than in either of the Numeracy F S A s . For the Grade 4 Numeracy F S A , approximately 10% of the items (4 items) were found to be differentially functioning, which were evenly distributed between the two populations in terms of the direction they favored. The proportion of bias in favour of the aboriginal population was 0.04; the proportion of bias in favour of the nonaboriginal population was 0.06. There was no pattern evident with respect to item details (i.e., number-of-words per item, item type, sub-content area, and context) that provided a plausible explanation of the sources of DIF. For the Grade 7 Numeracy F S A , only one item was identified as DIF. This item favored the non-aboriginal population, and represents a proportion of bias in favour of this group of 0.02. For the Grade 4 Reading F S A , approximately 21% of the items of were found to be DIF, with 13% of the items (five items) favoring the aboriginal population and 8% of the items (three items) favoring the non-aboriginal population. The proportion of bias in favour of the aboriginal population was 0.13; the proportion of bias in favour of the nonaboriginal population was 0.10. The item details for the DIF items identified for each population indicated that 6 of the DIF items of a total of 8 DIF items were from the  locate, interpret, and organize sub-content area favoured the aboriginal population. For the Grade 7 Reading F S A approximately 15% of the items of were found to be DIF, with 9% of the items (four items) favoring the aboriginal population and 6% (three items) of the items favoring the non-aboriginal population. The proportion of bias in favour o f the aboriginal population was 0.13; the proportion o f bias in favour of the non-aboriginal population was 0.09. Because o f the similarity o f the item details for the  103  DIF items identified for each population, neither number-of-words per item, item type, sub-content area, nor context provided a plausible explanation of the sources of DIF. Research Question 1: Are Scores From the Foundation Skills Assessment Comparable Across Aboriginal and Non-Aboriginal Populations? The B C Ministry of Education used the F S A scores to compare individuals from the aboriginal and non-aboriginal populations based on a self-declaration of being of aboriginal ancestry by each student. For these comparisons, differentiation with regard to membership in each group (aboriginal or non-aboriginal) can be confounded with such features as language, socio-economic status, environmental exposure, recreational skills, access to computers, teachers' skills and level of dedication, and availability of good text books. Once group distinction is made by the bodies governing the assessments, comparisons of scores amongst the groups can be problematic when all of the related factors such as those listed above cannot be disentangled. The purpose of this study was to examine four F S A s to provide information about the degree of equivalence and comparability of scores for these two populations: aboriginal and non aboriginal. This project was based on Messick's (1989a; 1989b; 1995) unitary notion of validity in which one must make an overall evaluative judgment of the degree to which evidence supports the adequacy and appropriateness of score inferences. A number of pieces of evidence was gathered in order to make the determination of degree of comparability, each of which is presented below, as well as how each piece of evidence contributes to the evaluation of comparability. One piece of evidence about the degree of comparability of these assessments for aboriginals and non-aboriginals was the factor analysis. This set of analyses examined  104 whether the dimensional structure of a test was found to be consistent for both populations. If the structure of the test was found to be the same for both populations then this evidence was deemed consistent with the hypothesis that the test is measuring the same construct for both populations. In this study, the dimensional structure of the test was found to be very similar for Grades 4 and 7 Reading FSAs, with high levels of comparability based on the congruence coefficients calculated for the factors across the two populations. The dimensional structure of the test was not found to be very similar for Grades 4 and 7 Numeracy FSAs. For both Numeracy assessments, the number of factors defining the dimensional structure of the test scores across the two populations was different. Further, the composition of the resulting dimensions for both Numeracy FSAs did not reveal common factors across the populations. This means that how the construct of numeracy is represented and measured is somehow different for the two populations. For example, there may be a difference in the levels of the skills required to solve the Numeracy assessments across the two populations. Similarity of factor structures for the two populations is a necessary, but not sufficient, component of construct comparability. From this study, the factor structure evidence is supportive of a high degree of comparability across the aboriginal and nonaboriginal populations for the Reading FSA scores, but a low degree of comparability across the aboriginal and non-aboriginal populations for the Numeracy FSA scores. Investigating the internal consistency of scores, as indicated by the reliability estimates, was another piece of evidence gathered to examine the construct comparability. For this study, for all four assessments, the reliability estimates of the test scores were high and similar for both populations, meaning that the degree to which  105  individuals' scores would remain relatively consistent over repeated administration of the same test or alternate test forms was high. The reliability analysis gave evidence to a high degree of comparability across the aboriginal and non-aboriginal populations. Differences in the item information functions were examined for each F S A in an effort to detect differences in the degree o f measurement accuracy provided by the test items for the two populations. For each of the Grade 4 Numeracy and the Grades 4 and 7 Reading F S A s , there was less than 4% difference in the level of accuracy of scores between the two populations; minimal by my subjective standard. For the Grade 7 Numeracy F S A , there was a substantial difference, 19%, in the level o f accuracy of scores between the two populations, favoring the aboriginal population. The item information function analysis provided consistent evidence that there is a high degree of comparability across the aboriginal and non-aboriginal populations for Grade 4 Numeracy and Grades 4 and 7 Reading scores, and a low degree of comparability across the aboriginal and non-aboriginal populations for Grade 7 Numeracy scores. The presence o f DIF items for the two populations showed that, depending on group membership (aboriginal or non-aboriginal), a student's probability o f answering particular items correctly when matched on ability could be different on some items. Because each F S A had fewer than 30% o f the items identified as D I F , high degrees o f DIF were not found in this study (Zenisky, Hambleton, & Robin 2003). In terms of the number of DIF items for each F S A , there were two items favoring each o f the two populations for the Grade 4 Numeracy; there was only one item for the Grade 7 Numeracy and it favored the non-aboriginal population; there were five items favoring the aboriginal population and three items favoring the non-aboriginal population for  106  Grade 4 Reading; and there were four favoring the aboriginal population and three favoring the non-aboriginal population for Grade 7 Reading. Further, for each F S A , an estimate was made of the amount of total bias related to the collective impact of DIF items for each population based on the average difference in the probability of obtaining the maximum item score for the two populations, matched on ability. For the Grade 4 Numeracy F S A , the proportions of bias were 4% and 6% in favor of the aboriginal and non-aboriginal populations, respectively. For the Grade 7 Numeracy F S A , the proportions of bias were 0% and 2% in favor of the aboriginal and nonaboriginal populations, respectively. For the Grade 4 Reading F S A , the proportions of bias were 13% and 10% in favor of the aboriginal and non-aboriginal populations, respectively. For the Grade 7 Reading F S A , the proportions of bias were 13% and 9% in favor of the aboriginal and non-aboriginal populations, respectively. For all four F S A s , the relatively low degree of DIF and the minimal difference across the populations in the proportions of bias related to the collective impact of all DIF items, lead me to conclude that the DIF analysis provided evidence in support of a high degree of comparability across the aboriginal and non-aboriginal populations' F S A scores. Overall, DIF analyses demonstrated the presence of DIF for the two populations, but the results did not show consistent bias against one group or the other. This study resulted in many findings. Table 52 summarizes the determination of the degree of comparability across the aboriginal and non-aboriginal populations for each analysis. For ease of presentation and synthesis, results of analyses were considered to provide evidence for a low or high degree of comparability. From Table 52, it can be seen that for this study, there was a high degree of comparability across the two populations  107  for the Grades 4 and 7 Reading F S A scores because all four analyses showed them to be highly comparable. There was a moderately high degree of comparability across the two populations for the Grade 4 Numeracy F S A scores because three out of the four analyses showed them to be highly comparable scores. There was a moderate degree of comparability across the two populations for the Grade 7 Numeracy F S A scores because two out of the four analyses showed them to be highly comparable scores. Table 52  Degree of Comparability Across the Aboriginal and Non-Aboriginal Populations  Degree of Comparability Analysis Common Factor Analysis  High Grade 4 Reading Grade 7 Reading Grade 4 Numeracy Grade 7 Numeracy Grade 4 Reading Grade 7 Reading  Low Grade 4 Numeracy Grade 7 Numeracy  Item Information Functions  Grade 4 Numeracy Grade 4 Reading Grade 7 Reading  Grade 7 Numeracy  Differential Item Functioning  Grade 4 Numeracy Grade 7 Numeracy Grade 4 Reading Grade 7 Reading  Reliability  The differences of the F S A scores across the populations as indicated by the findings above, although minimal in nature, are likely because the items are assessing additional skills/competencies and the distribution of these additional skills/competencies is different for the two populations.  108  Research Question 2: Should Score Interpretations be the Same for Both Populations? The phrasing of this research question could imply that there may be separate interpretations for the aboriginal and non-aboriginal F S A scores, but what it really asks is if the scores were adequately attained in preparing an unbiased common scale for the two populations. T o re-phrase what was stated above, this study found that the scores were very comparable across the aboriginal and non-aboriginal populations for Grades 4 and 7 Reading F S A s ; quite comparable for Grade 4 Numeracy F S A s ; and moderately comparable for Grade 7 Numeracy F S A s . In answering Research Question 2, this study found that: (a) interpretations should be the same for the aboriginal and non-aboriginal populations for the Grades 4 and 7 Reading F S A s scores; (b) interpretations should be made with care and caution when comparing the two populations for the Grade 4 Numeracy F S A scores; and (c) interpretations that imply score comparability across the two populations should not be made for the Grade 7 Numeracy F S A . Making the same interpretations for these two populations based on scores from the Grade 7 Numeracy F S A could lead to faulty conclusions such as comparing the aboriginal and non-aboriginal students based on their performance and reporting the differences. Findings in Context A s previously stated in the literature review, the  Standards document ( A E R A et  al., 1999) contends that when different populations have scores that have different distributions, this is evidence of invalidity only i f different score distributions "were due to the test's sensitivity to some examinee characteristic not intended to be part of the test  109  construct" (p. 16). One statistical method used to distinguish between whether there are true differences in the abilities of the two groups, or whether or not the differences represent construct irrelevant bias is differential item functioning (DIF). For each assessment there were relatively few DIF items based on the context of the item, and this provides evidence that the group differences were minimally influenced by characteristics that were not intended to be part of what was intended to be measured as laid out by the B C Ministry of Education. A s was stated in Chapter 1 of this study, the main purpose of the F S A s was to help the province, school districts, schools, and school planning councils evaluate how well foundation skills are being addressed and make plans to improve student achievement (British Columbia Ministry of Education, 2002a). Some literature that dealt with the disparity of scores and the difference in the level of academic success between the aboriginal and non-aboriginal populations. Some studies presented a set of hypotheses that consistently blamed the drastic differences in cultural (including language) and socio-economic situations between the groups as the explanation (Wesley, 1961; Lloyd, 1961). Others presented explanations that indicated that it was the lack of aboriginal culture and aboriginal-like perspective taking in the assessments that resulted in the disparity of scores and the difference in the level of academic success between the aboriginal and non-aboriginal populations (Keeler, 1961; Evvard & Weaver, 1966). More recent research has shown that the following efforts have led to a reduction in the disparity of scores between the aboriginal and non-aboriginal populations: (a) a nurturing early childhood environment (Swisher & Deyhle, 1989) (b) inclusion of native language and cultural programs in the school (Ayoungman, 1991; Barnhardt, 1990, 1999; deMarrais, 1992; James, Chavez, Beauvais, Edwards, & Oetting,  110  1995; Lipka & McCarty, 1994; Rubie, 1999; Slaughter & Lai, 1994; McLaughlin, 1992; Watahomigie & McCarty, 1994), and (c) enhanced community and parental influences on academic performance (Leveque, 1994; and Mclnerney, Mclnerny, Ardington, & De Rachewiltz, 1997). It is my belief that these researchers presented plausible explanations for the differences, and suggestions for improvement, but unfortunately, I did not have the data, or resources, to explore their hypotheses in full. A natural extension of this study would be to further explore these plausible explanations by teaming up with other researchers, such as those listed above, as well attaining relevant data that could inform these issues in British Columbia from Statistics Canada, the B C Ministry of Education, the B C Ministry of Children and Family Development, the B C Ministry of Community, Aboriginal, and Women's Services, and the B C Ministry of Health Services. Implications of Findings  Interpretation Implications For the Grades 4 and 7 Reading F S A s , the high degree of comparability of scores found in this study led to the conclusion that these tests are measuring the same construct with nearly the same degree of accuracy for the aboriginal and non-aboriginal populations. The implication of how these scores should be interpreted is simple: the scores from the two populations can be interpreted in the same way and can be compared to one another in a psychometrically sound manner. For the Grade 4 Numeracy F S A , the moderately high degree of comparability of scores found in this study led to the conclusion that this test did not measure exactly the same construct, but what was measured was done with nearly the same degree o f accuracy for the aboriginal and non-aboriginal populations. The implication of how these  Ill scores should be interpreted is that the scores from the two populations should not be interpreted as i f they were comparable. The reliability analysis, the item information functioning analysis, and the differential item functioning analysis all indicated that there was a high degree of comparability. However, the differences in the factor structures led me to conclude that further investigation into how this construct is represented and measured for these populations is necessary in order to understand, address and possibly reduce the differences across the populations. For the Grade 7 Numeracy F S A , the moderate degree of comparability of scores found in this study led to the conclusion that this test did not measure exactly the same construct, and what was measured was done with similar degrees of accuracy for the aboriginal and non-aboriginal populations. The implication of how these scores should be interpreted is that the scores from the two populations should not be interpreted as i f they were comparable. The fact that the reliability analysis and the differential item functioning analysis both indicated that there was a high degree of comparability led me to believe that further investigation into the factor structure and the item information functions may lead to an understanding of the differences across the populations, which could in turn be addressed and possibly reduced. In the case where substantial psychometric differences were found to exist in the F S A scores across the aboriginal and non-aboriginal populations, these differences would have had an impact on the F S A total scores. This impact on total scores would have continued its influence as the B C Ministry of Education classified each student into cutscore-based proficiency levels (not yet within expectations, meets expectations, exceeds expectations). The B C Ministry of Education then used individual proficiency levels to  112  compare aboriginals and non-aboriginals at the provincial level. In conclusion, such comparisons would have misrepresented the true differences between academic performance for these two populations. O f the psychometric differences found, the differences in the degree of measurement accuracy of scores for the two populations is of serious concern. It was shown that for the Grade 4 Numeracy and the Grades 4 and 7 Reading F S A s , the scores were more accurate for the non-aboriginal population than the aboriginal population at the lower end of the ability scale, and the opposite was true at the higher end of the ability scale. This means that there is less measurement accuracy for the aboriginal students with lower ability levels than the non-aboriginal students with lower abilities. Further, there are more aboriginal students than non-aboriginal students with lower ability levels. When the B C Ministry of Education reduced the total scores into proficiency levels, one would hope that there was an equally high degree of measurement accuracy for each population at the cut-scores that make this determination, especially for the cut-score determining the lowest proficiency level. Although I did not have the cutscores to determine i f this was the case, the difference in measurement accuracy at the lower end of the ability scale is an indication that there was probably different degrees of measurement accuracy for the two populations in determining which students were classed into the lowest proficiency level. With educational decisions being made based on students' F S A scores, it is the responsibility of the B C Ministry of Education to demonstrate the extent to which inferences based on these scores are valid and comparable for all identifiable sub-groups. The implication of the findings of this study is that scores from the Grades 4 and 7  113  Reading FSAs are comparable for the aboriginal and non-aboriginal populations, meaning that the inferences based on these scores are valid and comparable. Further, the scores from the Grades 4 and 7 Numeracy FSAs are not strictly comparable, meaning that the inferences based on these scores are not valid and comparable. In the case of the Grades 4 and 7 Numeracy FSA scores, there appear to be factors that are inherent in the membership in the aboriginal population that influence the academic assessment scores. When the influence of group membership on assessment scores is strong enough to produce unequivalent factor structures, differing item information functions, or DIF items; some type of change needs to be made to either the testing instrument or the interpretation of the scores if the score interpretations are to be valid and comparable. Some suggestions for changes are discussed in the Methodological Implications section below. Methodological Implications This study represents the first stage of a multi-staged process in that it identified problems with FSA score comparability across the aboriginal and non-aboriginal populations. This study did not find high levels of DIF for the FSAs examined. Regardless, it is important to understand the methodological implications of DIF items because they point to differences between the performance of students from the two populations on these items, sources of which we may not know yet. For the Grade 4 Reading FSA, the sub-content area and context of items was found to be an influencing factor on the finding of DIF; for the Grade 7 Reading FSA, context of items was found to be an influencing factor on the finding of DIF. For the Grades 4 and 7 FSAs, the sources of DIF were more complex than the identifiable  114  variables listed above could explain. Other possible sources of differences for the two populations could be the familiarity with the F S A testing techniques, item formats, test conventions, and testing procedures (Hambleton & Jong, 2003). Further sources of DIF could be related to differences in the two populations with regard to individuals' interpretations of the relevance of the items, their intrinsic interest in the item, and their familiarity with the item content (Ercikan, 1998). T o investigate the complicated nature of understanding the sources of DIF, techniques for judgemental review of DIF items and procedures for examining student cognitive processes have been developed (Ercikan et al., 2002). The process of a judgemental review is to discover why items are performing differently between groups. The judgemental review uses a panel of reviewers to examine items for inconsistencies in meaning across groups or cultures. Both of these methods provide insights about sources of DIF and why test items are functioning differentially for students from different socio-cultural backgrounds. If assessment scores are found not to be comparable across populations, the adaptation or revision of items, via the judgemental reviewers' summary critique, are two legitimate solutions. They are both relatively inexpensive and faster alternatives to preparing a completely new test for a second cultural or language group (Hambleton & Patsula, 1998). One benefit of adapting a test across identifiable sub-populations is that it can enhance the fairness by enabling persons to take tests in their preferred manner of format, context, or language. But with the development of different adapted versions for identifiable sub-populations, come new challenges in terms of verifying that each version, administered to the appropriate population, produces scores that are valid and comparable across all versions. The other alternative suggested above in examining  115 sources o f DIF is to revise the items in such a way that the biased nature o f the item is removed. The benefit of altering the DIF items is that individuals from the identifiable cultural groups w i l l all be given the same items; hence, comparable scores would result. The real focus o f the implication o f this study should be on the necessity of the undertaking o f construct validity investigations for all large-scale assessments when comparisons are to be made about scores from identifiable sub-groups. Test developers bear the responsibility o f demonstrating that their tests produce scores that have valid and comparable interpretations for all persons who they were designed to assess. It should be a routine part of their development procedures; without it, no appropriate inferences can be made. Limitation o f Findings There were certain limitations o f my study that I would like to discuss; I w i l l attempt to convey how these limitations may or may not have affected my findings. One limitation was related to the manner i n which the students were categorized as either aboriginal or non-aboriginal. This categorization was based on the self-declaration of aboriginal heritage by the student; there were no other criteria applied for this declaration. The self-declaration process allowed for students who were of aboriginal heritage to choose not to declare themselves as aboriginal, and for students who were not of aboriginal heritage to choose to declare that they were. Considering this self-declaration variable for categorizing students into the two populations (aboriginal and nonaboriginal) for comparison, the composition o f the two populations might have been different i f another criteria, such as Indian Status, had been used to define the aboriginal population. But then I have to wonder, would only including Status Indian students as  116 aboriginal cause error i n the comparison by having the Metis and other non-status aboriginal groups in the non-aboriginal population. It would be interesting to follow the self-declaration pattern of students who where actually Status Indians to see i f this selfdeclaration variable was static or dynamic. If this variable was dynamic, and students changed their mind from year to year about deciding to declare themselves as having aboriginal ancestry, the comparison o f the scores across the aboriginal and non-aboriginal populations would become unstable and impossible to understand or track over the years. Another limitation o f this study is the lack of demographic variables for the students. A s mentioned above, I would have liked to have other culturally embedded variables such as socio-economic status to examine the effects of it on score performance. Another limitation of the study was that the data were received with no identifiable missing data. Before the data were delivered to me, all non-response data (the missing data) were coded as zeros. In an effort to clean the data, I removed the cases in which every response was zero, as these cases most likely did not reflect real people. But doing this, I may have deleted some cases in which a student answered all the questions incorrectly. Future Directions From this study came the realization that there are directions that interested researchers could pursue in an effort to fully understand the nature of the differences in large-scale assessment scores for the aboriginal and non-aboriginal student populations. Further, with regard to the application o f the construct validation aspect o f the study, there are research directions that could lead to greater specificity regarding how much  117  and what kind of differences between identifiable populations on assessments justify changes to the assessment instrument or different interpretations. One research direction would be to include the examination of convergent and discriminant correlations with external variables in determining the comparability of the score inferences for the aboriginal and non-aboriginal populations. This would broaden the scope of the investigation, but would have to be done cautiously as possible measurement error and biases may exist in the external data. Another research direction would be to interview groups of students from both aboriginal and non-aboriginal groups with regard to examining student cognitive processes as well as their beliefs about the appropriateness of the contexts of the items that make up the assessment in question. This type of research could inform the test designers as to sources of DIF and construct incomparability. Another research direction would be to apply a two-stage DIF approach in an effort to reduce the contamination of the matching criterion across the populations. In typical DIF analysis, the matching criterion is, as was in this study, the individual's total score. This means that the total score was used as the proxy for ability, when students' scores were matched on ability. When there is a high level of DIF items found, the associated errors related to the DIF items are embedded in the total score that is used to match the scores; this leads to the circularity of the error (Zenisky, Hambleton, & Robin, 2003). This type of contamination "is likely to result in less than optimal identification of DIF items and complicate efforts to interpret the findings" (Zenisky, Hambleton, & Robin, p. 52). The two-stage approach refines the total score by removing items that were initially identified as DIF. Once examinees are matched on the revised criterion, the DIF  118  analysis is repeated to identify newly emerged DIF items. This process may be repeated until some a priori condition is reached (Zenisky, Hambleton, & Robin). With the use of this thorough DIF detection method, one could be assured that they did in fact identify all DIF items. Finally, and o f most interest to me, another research direction is regarding how much and what kind of psychometric-property differences between identifiable populations on assessments justify the categorization of  comparable and not-comparable  when performing a construct comparability study. This aspect provided me with one of the biggest hurdles in terms of interpreting the results of this study. Sireci, X i n g , Bastari, Allalough, and Fitzgerald (1999) spoke to the lack of appropriate statistical tests in determining the degree o f structural equivalence, I found the same void when I tried to determine the degree of DIF, the degree of internal consistency differences, and the degree of item information function differences. With the attention being drawn to the need for construct validation and comparability studies with assessment scores, comes the need for a set of guidelines by which researchers can make claims about their findings in a manner that is acceptable by their peers, and ultimately leads to a consistency in the improvement of assessment.  119  REFERENCES Alexander, K . L . , Entwisle, D . R., & Kabbani, N . S. (2001). The dropout process in life course perspective: Early risk factors at home and school.  Teachers College  Record, 103, 760-823. Allalouf, A . , Hambleton, R., K . , & Sireci, S. (1999). Identifying causes of DIF in adapted Verbal items.  Journal of Educational Measurement, 36, 185-198.  American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999).  Standards for  educational and psychological testing. Washington, D C : American Psychological Association. American Psychological Association.  Joint committee on testing practices. Retrieved  December 4, 2002, from http://www.apa.org/science/ictpweb.html Ayoungman, V . (1991). Siksika language renewal efforts: Description and assessment (Indians, Canada) (Doctoral dissertation, Arizona State University, 1991).  Dissertations Abstract International, 52, 1188. Barnhardt, C . (1999).  Kuinerrarmiut Elitnaurviat: The school of the people of  Quinhagak, a case study. Portland, O R : Northwest Regional Educational Lab. (ERIC Document Reproduction Service N o . ED437252) Barnhardt, R. (1990). Two cultures, one school: St. Mary's, Alaska. Canadian  Journal of Native Education, 17, 54-65. British Columbia Ministry of Education.  Interpreting and communicating British  Columbia Foundation Skills Assessment results. Retrieved November 10, 2002a, from http://www.bced.gov.bc.ca/assessment/fsa/fsa interpretation 2003.pdf  120  British Columbia Ministry of Education.  Reports and Publications. Retrieved November  10, 2002b, from http://www.bced.gov.bc.ca/reportfinder/publicschools.php Bock, R. D . (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories.  Psychometrika, 37, 29-51.  Burket, G . R. (1993). F L U X [Computer software]. Monterey, C A : C T B / M c G r a w - H i l l . Burket, G . R. (1998). P A R D U X (Version 4.01) [Computer software]. Monterey, C A : CTB/McGraw-Hill. Cattell, R. B . (1966). The scree test for the number of factors.  The Journal of  Multivariate Behavioral Research, 1, 245-276. Cattell, R. B . & Vogelmann, S. (1977). A comprehensive trial of the scree and K G criteria for determining the number of factors.  The Journal of Multivariate  Behavioral Research, 13, 289-325. Committee on Foundations of Assessments. (2001).  Knowing what students know: The  science and design of educational assessment. Washington, D C : National Academy Press. Council of Ministers o f Education, Canada.  Education indicators in Canada. Retrieved  November 15, 2002, from http://www.cmec.ca/stats/pceip/1999old/pceipmac/english/pages/page26e.html Crocker, L . & Algina, J. (1986).  Introduction to classical and modern test theory. Fort  Worth, T X : Harcourt Brace Jovanovich College Publishers. deMarrais, K . B . (1992). Meaning in mud: Yup'ik Eskimo girls at play.  & Education Quarterly, 23, 120-144.  Anthropology  Demmert, W . G . (2001).  Improving academic performance among native American  students: A review of the research literature. Retrieved November 12, 2002, from http://www.edrs.com/Webstore/Download.cfm?ID =692162 :  Ercikan, K . (1998). Translation effects in international assessment.  International Journal  of Educational Research, 29, 543-553. Ercikan, K . (2002). Disentangling sources of differential item functioning in multilanguage assessments.  International Journal of Testing, 2(3&4), 199-215.  Ercikan, K . (2003). Are the English and French Versions of the Third International Mathematics and Science Study Administered in Canada Comparable? Effects o f Adaptations.  International Journal of Educational Policy, Research and Practice.  Ercikan, K . , Gierl, M . McCreith, T . , Puham, G . & K o h , K . (in press). Comparability of English and French versions of S AIP for reading, mathematics and science items.  Applied Measurement in Education.  Ercikan, K . , & McCreith, T . (2002). Effects o f adaptations on comparability o f test items. In D . Robitaille & A . Beaton (Eds.),  Secondary Analysis of TIMSS Results (pp.  391-405). Dordrecht The Netherlands: Kluwer Academic Publisher. Ercikan, K . , Schwarz, R. D . , Julian, M . W . , Burket, G . R., Weber, M . M . , & Link, V . (1998). Calibration and scoring of tests with M C and constructed-response item types.  Journal of Educational Measurement, 35, 137-154.  Evvard, E . , & Weaver, Jr., R. R. (1966). Testing—Some implications o f counsellors and teachers.  Journal ofAmerican Indian Education, 5(3). Retrieved October 1,  2002, from the Journal o f American Indian Education Electronic Journal Collection (via the Xwi7xwa Library, U B C ) .  Gierl, M , J . , & Khaliq, S., N . (2001). Identifying sources of differential item functioning on adapted achievement tests: A confirmatory approach.  Journal of Educational  Measurement, 38, 164-187. Gorsuch, R. L . (1983).  Factor Analysis. Hillsdale, N J : Lawrence Earlbaum Associates,  Publishers. Gould, S. J. (1995). Mismeasure by any measure. In R. Jacoby & N . Glauberman (Eds.), The bell curve debate (pp. 3-13). New York: Random House/Times Books. Hambleton, R. K . (1989). Principles and selected applications of Item Response Theory.  In R. L . Linn (Ed.), Educational measurement (3  rd  ed., pp. 147-200). Washington,  D C : American Council on Education. Hambleton, R. K . , & De Jong, J. H . A . L . (2003). Advances in translating and adapting educational and psychological tests [Electronic version]. Language Testing, 20, 127-134. Hambleton, R. K . , 8c Jones, R. W . (1994). Comparison of empirical and judgemental procedure for detecting differential item functioning. Educational Research  Quarterly, 75,21-36. Hambleton, R. K . , & Patsula, L . (1998). Adapting tests for use in multiple languages and cultures [Electronic Version]. Social Indicators Research, 45, 153-171. Hambleton, R. K . , Swaminatham, H . , & Rogers, H . J. (1991).  Fundamentals of item  response theory. London: Sage Publications. Hakstian, R. A . (1971). A comparative evaluation of several prominent methods of oblique factor transformation.  Psychometrika, 36, 175-193.  Hakstian, R. A . , Farrell, S., & Tweed, R. G . (2002). The assessment of counterproductive  123  tendencies by means of the California Psychological Inventory.  International  Journal of Selection & Assessment, 10, 58-86. Hakstian, R. A . , Rogers, W . T . , & Cattell, R. B . (1982). The behavior of number-offactors rule with simulated data. Harman, H . H . (1976).  Multivariate Behavioral Research, 17, 193-219.  Modern factor analysis. Chicago: The University of Chicago  Press. Harpur, T . J., Hakstian, A . R . , & Hare, R. D . (1988). Factor structure o f the psychopathy checklist.  Journal of Consulting and Clinical Psychology, 56, 741-747'.  Indian and Northern Affairs Canada.  United Nations working group on indigenous  populations: Statement by the observer delegation of Canada on the theme of "Children and Youth ". Retrieved November 6, 2002, from http://www.ainc-inac.gc.ca/nr/spch/2000/chay_e.html James, K . , Chavez, E . , Beauvais, F . , Edwards, R., & Oetting, G . (1995). School achievement and dropout among Anglo and Indian females and males: A comparative examination.  American Indian Culture and Research Journal, 19,  181-206. Keeler, W . W . (1961). Challenges in Indian education.  Journal ofAmerican Indian  Education, 1(2). Retrieved October 01, 2002, from the Journal of American Indian Education Electronic Journal Collection (via the Xwi7xwa Library, U B C ) . Kirkness, V . J. (1999). Aboriginal education in Canada: A retrospective and a prospective.  Journal of American Indian Education, 39, 14-30.  Leveque, D . M . (1994).  Cultural and parental influences on achievement among  Native American students in Barstow unified school district. Paper presented at the National Meeting of the Comparative and International Educational Society, San Diego, C A . Linn, R. L . , & Harnisch, D . L . (1981). Interactions between item content and group measurement on achievement test items.  Journal of Educational Measurement, 18,  109-118. Lipka, J., & McCarty, T . L . (1994). Changing the culture o f schooling: Navajo and Y u p ' i k cases.  Anthropology & Education Quarterly, 25, 266-84.  Lloyd, D . O . (1961). Comparison of standardized test results of Indian and non-Indian in an integrated school system.  Journal ofAmerican Indian Education, 1.  Retrieved October 03, 2002, from the Journal of American Indian Education Electronic Journal Collection (via the Xwi7xwa Library, U B C ) . Lord, F . M . (1980).  Applications of item response theory to practical testing problems.  Hillsdale, N J : Lawrence Erlbaum Associates. Mclnerney, D . M . , Mclnerney, V . , Ardington, A . , & De Rachewiltz, C . (March 1997).  School success in cultural context: Conversations at Window Rock.  Preliminary Report. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. McLaughlin, D . (1992).  When literacy empowers: Navajo language in print.  Albuquerque: University o f New Mexico Press. Messick, S. (1989a). Meaning and values in test validation: The science and ethics of  assessment. Educational Researcher, 18, 5-11.  Messick, S. (1989b). Validity. In R. L . Linn (Ed.), Educational measurement (3  ra  ed., p.p.  13-103). Washington, D C : American Council on Education. Messick, S. (1995). Validity of psychological assessment. American  Psychologist,  50, 741-749. Muraki, E . (1992). A generalized partial credit model: Application of an E M algoritym.  Applied Psychological Measurement, 16, 159-176. Quails, A . L . (1995). Estimating the reliability of a test containing multiple item formats.  Applied Measurement in Education, 8, 111-120. Reckase, M . D . (1979). Unifactor latent trait models applied to multi-factor tests: Results and implications. Rubie, C . (1999).  Journal of Educational Statistics, 4, 207-230.  Kia Kaha: Improving classroom performance through developing  cultural awareness. Paper presented at the Joint Conference of the Australian Association for Research in Education and the N e w Zealand Association for Research in Education, Melbourne, Australia. (ERIC Document Reproduction Service N o . ED441651) Rummel, R. J . (1970).  Appliedfactor analysis. Evanston: Northwest University Press.  Sireci, S. G . , X i n g , D . , Bastari, B . , Allalouf, A . , & Fitzgerald, C . (1999). Evaluating construct equivalence across tests adapted for use across multiple languages. Unpublished manuscript, University o f Massachusetts at Amherst. Slaughter, H . B . , & L a i , M . (1994).  Indigenous language immersion as an alternative  form of schooling for children of Hawaiian ancestry: Lessons from a six-year study. Paper presented at the Annual Meeting of the American Educational Research Association, N e w Orleans, L A . (ERIC Document Reproduction Service  No. ED375637) Stocking, M . L . , & Lord, F. M . (1983). Developing a common metric in item response theory.  Applied Psychological Measurement, 7, 201-210.  Swaminathan, H . , & Rogers, H . J. (1990). Detecting differential item functioning using logistic regression procedures.  Journal of Educational Measurement, 27, 361-370.  Swisher, K . , & Deyhle, D . (1989). The styles of learning are different, but the teaching is just the same: Suggestions for teachers of American Indian youth.  Journal ofAmerican Indian Education, Special Issue, 1-14. Tabachnick, B . G . , & Fidell, L . S. (1996).  Using multivariate statistics. Northridge, C A :  Harper Collins College Publishers. The Fraser Institute.  Report Card on British Columbia's Elementary Schools: 2003  Edition. Retrieved October 15, 2003, from http://www.fraserinstitute.ca/shared/readmore.asp?sNav =pb&id=536 ::  Watahomigie, L . J . , & McCarty, T . L . (1994). Bilingual/bicultural education at Peach Springs: A Hualapai way of schooling.  Peabody Journal of Education,  69(2), 26-42. Wesley, C . (1961). Indian education.  Journal ofAmerican Indian Education, 1(1).  Retrieved October 01, 2002, from the Journal of American Indian Education Electronic Journal Collection (via the Xwi7xwa Library, U B C ) . Witherspoon, Y . T . (1962). The measurement of Indian children's achievement in the academic tool subjects.  Journal ofAmerican Indian Education, 1. Retrieved  October 01, 2002, from the Journal of American Indian Education Electronic Journal Collection (via the Xwi7xwa Library, U B C ) .  Yen, W . M . (1981). Using simulation results to choose a latent trait model.  Applied  Psychological Measurement, 5, 245-262. Yen, W . M . (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model.  Applied Psychological Measurement, 8,  125-145. Yen, W . M . (1993). Scaling performance assessments: Strategies for managing local item dependence.  Journal of Educational Measurement, 30, 187-214.  Zenisky, A . L . , Hambleton, R. K . , & Robin, F. (2003). Detection of differential item functioning in large-scale state assessments: A study evaluating a two-stage approach [Electronic version]. 51-64.  Educational and Psychological Measurement, 63,  128  APPENDIX A Grade 4 Numeracy Item Details Item Number  Number Of Words  Context  Item Type  Sub-content area  1  21  Field Trip To Nature Park  MC  Number  2  26  Field Trip To Nature Park  MC  Patterns & Relationships  3  31  Field Trip T o Nature Park  MC  Number  4  32  Field Trip T o Nature Park  MC  Shape & Space  5  11  Field Trip T o Nature Park  MC  Shape & Space  6  7  Field Trip T o Nature Park  MC  Statistics & Probability  7  11  Field Trip T o Nature Park  MC  Number  8  41  Field Trip T o Nature Park  MC  Number  9  25  Field Trip T o Nature Park  MC  Shape & Space  10  26  Field Trip To Nature Park  MC  Shape & Space  11  25  Field Trip T o Nature Park  MC  Number  12  25  Field Trip T o Nature Park  MC  Number  13  9  Field Trip T o Nature Park  MC  Statistics & Probability  14  37  Field Trip T o Nature Park  MC  Patterns & Relationships  15  20  Field Trip T o Nature Park  MC  Patterns & Relationships  16  26  Field Trip To Nature Park  MC  Shape & Space Patterns & Relationships  17  35  Field Trip To Nature Park  OE  18  40  Field Trip To Nature Park  OE  Statistics & Probability  19  17  Activity Day For Students & Parents  MC  Number  20  29  Activity Day For Students & Parents  MC  Patterns & Relationships  21  18  Activity Day For Students & Parents  MC  Shape & Space  22  25  Activity Day For Students & Parents  MC  Patterns & Relationships  23  19  Activity Day For Students & Parents  MC  Patterns & Relationships  24  26  Activity Day For Students & Parents  MC  Shape & Space  25  18  Activity Day For Students & Parents  MC  Number  26  19  Activity Day For Students & Parents  MC  Number  27  19  Activity Day For Students & Parents  MC  Number  28  26  Activity Day For Students & Parents  MC  Number  29  23  Activity Day For Students & Parents  MC  Number  30  25  Activity Day For Students & Parents  MC  Statistics & Probability  31  19  Activity Day For Students & Parents  MC  Statistics & Probability  32  20  Activity Day For Students & Parents  MC  Statistics & Probability  33  16  Activity Day For Students & Parents  MC  Shape & Space  34  26  Activity Day For Students & Parents  MC  Number  35  15  Activity Day For Students & Parents  OE  Number  36  30  Activity Day For Students & Parents  OE  Number  129  APPENDIX B Grade 7 Numeracy Item Details ltem#  # Of Words  Context  Item Type  Sub-content area  1  36  Ski Trip  MC  Number  2  34  Ski Trip  MC  Statistics & Probability  3  37  Ski Trip  MC  Number  4  25  Ski Trip  MC  Number  5  32  Ski Trip  MC  Number  6  19  Ski Trip  MC  Number  7  37  Ski Trip  MC  Number  8  28  Ski Trip  MC  Patterns & Relations  9  25  Ski Trip  MC  Statistics & Probability  10  31  Ski Trip  MC  Patterns & Relations  11  20  Ski Trip  MC  Shape & Space  12  15  Ski Trip  MC  Shape & Space  13  17  Ski Trip  MC  Shape & Space  14  16  Ski Trip  MC  Shape & Space  15  44  Ski Trip  MC  Statistics & Probability  16  33  Ski Trip  MC  Number  17  49  Ski Trip  OE  Number  18  40  Ski Trip  OE  Patterns & Relations  19  27  School Fun Fair  MC  Number  20  35  School Fun Fair  MC  Shape & Space  21  27  School Fun Fair  MC  Statistics & Probability  22  35  School Fun Fair  MC  Number  23  17  School Fun Fair  MC  Number  24  26  School Fun Fair  MC  Patterns & Relations  25  32  School Fun Fair  MC  Number  26  19  School Fun Fair  MC  Patterns & Relations  27  30  School Fun Fair  MC  Patterns & Relations  28  20  School Fun Fair  MC  Shape & Space  29  19  School Fun Fair  MC  Number  30  34  School Fun Fair  MC  Number  31  23  School Fun Fair  MC  Shape & Space  32  25  School Fun Fair  MC  Shape & Space  33  28  School Fun Fair  MC  Statistics & Probability  34  46  School Fun Fair  MC  Number  35  50  School Fun Fair  OE  Shape & Space  36  20  School Fun Fair  OE  Patterns & Relations  130  APPENDIX C Grade 4 Reading Item Details ltem#  # Of Words  Context  Item Type  Sub-content area  1  6  Crime Solving  MC  Identify & Interpret Key Concept & Main Idea  2  7  Crime Solving  MC  Locate, Interpret, & Organise Details  3  3  Crime Solving  MC  Locate, Interpret, & Organise Details  4  27  Crime Solving  MC  Locate, Interpret, & Organise Details  5  8  Rain  MC  Identify & Interpret Key Concept & Main Idea  6  6  Rain  MC  Locate, Interpret, & Organise Details  7  7  Rain  MC  Locate, Interpret, & Organise Details  8  12  Rain  MC  Locate, Interpret, & Organise Details  9  33  Rain  OE  Locate, Interpret, & Organise Details  10  9  House Pets  MC  Locate, Interpret, & Organise Details  11  8  House Pets  MC  Locate, Interpret, & Organise Details  12  4  House Pets  MC  Locate, Interpret, & Organise Details  13  9  House Pets  MC  Locate, Interpret, & Organise Details  14  7  House Pets  MC  Critical Analysis  15  8  Polar Bears  MC  Identify & Interpret Key Concept & Main Idea  16  8  Polar Bears  MC  Locate, Interpret, & Organise Details  17  11  Polar Bears  MC  Locate, Interpret, & Organise Details  18  16  Polar Bears  MC  Locate, Interpret, & Organise Details  19  23  Polar Bears  OE  Locate, Interpret, & Organise Details  20  5  Frogs & Toads  MC  Locate, Interpret, & Organise Details  21  7  Frogs & Toads  MC  Locate, Interpret, & Organise Details  22  9  Frogs & Toads  MC  Locate, Interpret, & Organise Details  23  12  Frogs & Toads  MC  Locate, Interpret, & Organise Details  24  10  Frogs & Toads  MC  Critical Analysis  25  6  Rabbits  MC  Locate, Interpret, & Organise Details  26  19  Rabbits  MC  Locate, Interpret, & Organise Details  27  11  Rabbits  MC  Locate, Interpret, & Organise Details  28  12  Rabbits  MC  Critical Analysis  29  19  Rabbits  OE  Critical Analysis  30  6  Memory  MC  Identify & Interpret Key Concept & Main Idea  31  8  Memory  MC  Critical Analysis  32  6  Memory  MC  Locate, Interpret, & Organise Details  33  12  Memory  MC  Locate, Interpret, & Organise Details  34  8  Memory  MC  Locate, Interpret, & Organise Details  35  8  Tree Growth  MC  Identify & Interpret Key Concept & Main Idea  36  10  Tree Growth  MC  Locate, Interpret, & Organise Details  37  9  Tree Growth  MC  Locate, Interpret, & Organise Details  38  11  Tree Growth  MC  Critical Analysis  39  26  Tree Growth  OE  Locate, Interpret, & Organise Details  131 APPENDIX D Grade 7 Reading Item Details ltem# 1  #Of Words  Context  Sub-content area  Item Type  15  Baseball Game  MC  Critical Analysis  2  13  Baseball Game  MC  Locate, Interpret, & Organise Details  3  21  Baseball Game  MC  Locate, Interpret, & Organise Details  4  10  Baseball Game  MC  Critical Analysis  5  10  Baseball Game  MC  Locate, Interpret, & Organise Details  6  38  Baseball Game  OE  Identify & Interpret Key Concept & Main Idea  7  10  Whales  MC  Locate, Interpret, & Organise Details  8  13  Whales  MC  Critical Analysis  9  15  Whales  MC  Locate, Interpret, & Organise Details  10  10  Whales  MC  Locate, Interpret, & Organise Details  11  8  Ponies  MC  Identify & Interpret Key Concept & Main Idea  12  12  Ponies  MC  Critical Analysis  13  15  Ponies  MC  Locate, Interpret, & Organise Details  14  14  Ponies  MC  Critical Analysis  15  9  Ponies  MC  Locate, Interpret, & Organise Details  16  26  Ponies  OE  Locate, Interpret, & Organise Details  17  5  Frogs & Toads  MC  Locate, Interpret, & Organise Details  18  7  Frogs & Toads  MC  Locate, Interpret, & Organise Details  19  9  Frogs & Toads  MC  Locate, Interpret, & Organise Details  20  12  Frogs & Toads  MC  Locate, Interpret, & Organise Details  21  10  Frogs & Toads  MC  Critical Analysis  22  8  Goldfish  MC  Locate, Interpret, & Organise Details  23  8  Goldfish  MC  Locate, Interpret, & Organise Details  24  11  Goldfish  MC  Critical Analysis  25  10  Goldfish  MC  Critical Analysis  26  14  Goldfish  MC  Critical Analysis  27  9  Goldfish  MC  Critical Analysis  28  9  Weeping Willow Tree  MC  Locate, Interpret, & Organise Details  29  11  Weeping Willow Tree  MC  Locate, Interpret, & Organise Details  30  17  Weeping Willow Tree  MC  Locate, Interpret, & Organise Details  31  13  Weeping Willow Tree  MC  Locate, Interpret, & Organise Details  32  17  Weeping Willow Tree  MC  Locate, Interpret, & Organise Details  33  12  Weeping Willow Tree  MC  Critical Analysis  34  36  Weeping Willow Tree  OE  Locate, Interpret, & Organise Details  35  5  Snakes  MC  Identify & Interpret Key Concept & Main Idea  36  8  Snakes  MC  Locate, Interpret, & Organise Details  37  5  Snakes  MC  Critical Analysis  38  9  Snakes  MC  Locate, Interpret, & Organise Details  39  6  Snakes  MC  Locate, Interpret, & Organise Details  40  6  Snakes  MC  Locate, Interpret, & Organise Details  41  9  Egypt  MC  Locate, Interpret, & Organise Details  42  16  Egypt  MC  Critical Analysis  132  43  13  Egypt  MC  Locate, Interpret, & Organise Details  44  6  Egypt  MC  Locate, Interpret, & Organise Details  45  6  Egypt  MC  Locate, Interpret, & Organise Details  46  27  Egypt  OE  Locate, Interpret, & Organise Details  133 APPENDIX E Aboriginal Grade 4 Numeracy Scree Plot  9  Factor  11  13  15  17  19  21  23  25  27  29  31  134  APPENDIX F Non-Aboriginal Grade 4 Numeracy Scree Plot  6  Lu  T  1  o 1  3  Factor  5  7  9  11  13  15  17  19  21  23  25  27  29  31  APPENDIX G Aboriginal Grade 7 Numeracy Scree Plot  Factor  136 APPENDIX H Non-Aboriginal Grade 7 Numeracy Scree Plot 7-  6*  Lu  o | 1  3  Factor  5  7  9  11  13  15  17  19  21  23  25  27  29  31  APPENDIX I Aboriginal Grade 4 Reading Scree Plot  Factor  138  APPENDIX J Non-Aboriginal Grade 4 Reading Scree Plot  Lu 0 1  3  Factor  5  7  9  11  13  15  17  19  21  23  25  27  29  31  33  35  139 APPENDIX K Aboriginal Grade 7 Reading Scree Plot  7-1  6l,  5-  4-  3-  Lu  0 1  4  Factor  7  10  13  16  19  22  25  28  31  34  37  40  140  APPENDIX L Non-Aboriginal Grade 7 Reading Scree Plot  6  1  [] 5-  4-  3-  Lu  o | 1  4  Factor  7  10  13  16  19  22  25  28  31  34  37  40  APPENDIX M Numeracy: Eigenvalues Greater than One  Grade 4 Numeracy: Eigenvalues Greater than One Population Aboriginal  Factor  Eigenvalue  1  5.304  2 3 4 5  1.237 1.202 1.093 1.045  6  1.031  1 2 3 4  5.433 1.175 1.099 1.055  Non-Aboriginal  Grade 7 Numeracy: Eigenvalues Greater than One Population Aboriginal  Factor  Eigenvalue  1  4.534  2  5  1.525 1.183 1.148 1.104  6  1.023  7  1.019  1 2  5.916  3 4  Non-Aboriginal  3 4 5  1.308 1.041 1.034 1.007  APPENDIX N Reading: Eigenvalues Greater than One  Grade 4 Reading: Eigenvalues Greater than One Population Aboriginal  Factor  Eigenvalue  1  7.391  2 3 4 5  1.696 1.327 1.093 1.045  1 2 3 4 5 6  6.301 1.556 1.286  Non-Aboriginal  1.118 1.048 1.005  Grade 7 Reading: Eigenvalues Greater than One Population Aboriginal  Factor  Eigenvalue  1 2  5.918 1.923 1.585  3 4 5  1.221 1.170  6  1.074  7  1.063  8  1.045  9  1.016  1  5.918  2  1.923  3  1.585 1.221  Non-Aboriginal  4 5 6 7  1.170 1.074 1.063  APPENDIX O Numeracy: Maximum Likelihood Estimations for the Number o f Factors  Grade 4 Numeracy: Maximum Likelihood Estimations for the Number of Factors Population Aboriginal  Factors  Chi-Square  Df  p-value  1  1322.79  464  0.00  2 7 8 9  1022.29 387.49  433 293  326.68 276.19  268 244  0.00 0.00 0.01 0.08  1 2  9033.63 5979.04 88.89 59.99 43.44  464 433 73 59 46  NonAboriginal  18 19 20  Note.  0.00 0.00 0.10 0.44 0.58  The critical p-value was set at 0.05.  Grade 7 Numeracy: Maximum Likelihood Estimations for the Number o f Factors Population Aboriginal  Factors  Chi-Square  Df  p-value  1  1618.11  464  0.00  2 8  1009.92  433 268 244  0.00 0.01  9 10  329.97 282.18 234.76  221  0.05 0.25  NonAboriginal 1 2 18 19 20  Note.  11906.19 7440.72  464 433  0.00 0.00  137.09 91.35  73  0.00  59  0.00  60.77  46  0.07  The critical/?-value was set at 0.05.  APPENDIX P Reading: Maximum Likelihood Estimations for the Number of Factors  Grade 4 Reading: Maximum Likelihood Estimations for the Number o f Factors Population  Factors  Chi-Square  Df  p-value  Aboriginal 1  3329.90  560  0.00  2  1930.52  526  0.00  9  398.24  316  0.00  10  342.81  290  0.02  11  287.95  265  0.16  NonAboriginal 1  28714.35  560  0.00  2  15039.99  526  0.00  18  209.69  118  0.00  19  144.52  101  0.00  20  99.03  85  0.14  Note. The critical /?-value was set at 0.05.  Grade 7 Reading: Maximum Likelihood Estimations for the Number o f Factors Population  Factors  Chi-Square  Df  p-value  Aboriginal 1  4329.02  819  0.00  2  2323.94  778  0.00  8  652.92  553  0.00  9  581.99  519  0.03  10  517.56  486  0.16  NonAboriginal 1  40415.89  819  0.00  2  19827.57  778  0.00  22  232.16  168  0.00  23  191.67  148  0.01  24  154.00  129  0.07  Note. The critical /?-value was set at 0.05.  145  APPENDIX Q Pattern Matrix for Aboriginal Grade 4 Numeracy Scores  Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  1 0.14 0.25 0.35 0.16 -0.04 0.02 0.08 0.05 -0.02 -0.11 0.01 0.13 0.09 -0.02 0.31 0.28 0.42 0.44 0.29 0.50 0.44 0.29 0.44 0.30 0.18 0.38 0.23 0.14 0.13 0.33 0.10 0.31  Factor  2 0.02 0.11 0.10 0.01 -0.09 0.01 -0.04 0.00 -0.06 -0.02 -0.03 0.14 0.00 0.04 0.10 0.10 -0.04 -0.02 -0.04 -0.04 -0.04 0.00 0.04 0.00 -0.08 -0.07 -0.20 -0.43 -0.40 -0.18 -0.11 -0.06  3 0.27 0.14 0.15 0.25 0.44 0.34 0.29 0.32 0.37 0.34 0.45 0.38 0.33 0.36 0.20 0.13 0.00 -0.09 0.04 -0.02 0.05 -0.05 -0.06 0.20 0.20 0.08 0.07 0.12 0.16 0.05 0.05 0.00  APPENDIX R Pattern Matrix for Non-Aboriginal Grade 4 Numeracy Scores  Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  1 0.34 0.18 0.24 -0.05 -0.01 0.11 0.01 0.28 0.22 -0.08 0.08 0.75 -0.03 0.36 0.19 0.25 0.11 0.18 -0.08 0.18 0.15 -0.14 -0.04 0.82 0.53 0.00 0.13 -0.13 -0.15 -0.14 0.03 0.25  Factor 2 -0.04 -0.08 -0.16 -0.09 0.09 -0.07 0.02 -0.11 -0.05 -0.07 0.07 -0.18 0.00 -0.08 -0.16 -0.10 0.03 0.02 -0.05 0.06 0.10 -0.07 -0.13 -0.06 0.07 0.08 0.17 0.78 0.67 0.35 0.08 0.02  3 0.14 0.13 0.27 0.34 0.48 0.24 0.30 0.18 0.29 0.34 0.40 -0.01 0.37 0.12 0.18 -0.01 -0.01 -0.12 0.06 -0.06 -0.10 -0.04 -0.06 -0.22 -0.17 0.20 -0.05 -0.10 0.00 0.06 -0.09 -0.13  4 0.09 -0.06 -0.15 -0.20 0.21 -0.12 -0.05 -0.01 0.11 -0.01 0.10 0.10 -0.04 0.06 -0.26 -0.24 -0.28 -0.29 -0.44 -0.32 -0.32 -0.56 -0.69 0.02 -0.04 -0.22 -0.11 0.12 0.06 -0.15 -0.15 -0.22  APPENDIX S Pattern Matrix for Aboriginal Grade 7 Numeracy Scores  Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  1 0.38 0.14 -0.06 -0.09 -0.22 0.47 0.04 0.11 0.08 -0.04 -0.16 0.29 0.22 0.21 0.33 0.15 0.40 0.52 0.59 0.00 0.20 0.13 0.18 0.39 0.50 0.19 0.04 0.03 0.42 0.31 0.12 0.20  2 0.24 0.15 0.31 -0.06 -0.13 0.04 0.30 0.52 0.40 0.43 0.38 0.19 0.29 0.10 0.10 0.11 0.10 -0.06 -0.15 -0.03 0.16 -0.05 0.04 0.01 -0.08 0.22 0.10 0.09 -0.06 0.00 0.11 0.08  Factor  3 -0.02 -0.06 0.00 -0.47 -1.03 0.00 -0.05 0.03 0.00 0.00 0.08 0.09 0.03 -0.01 -0.02 -0.02 0.06 0.08 0.02 -0.02 0.03 -0.02 0.01 0.08 0.02 0.06 0.00 -0.03 0.04 0.05 -0.03 0.00  4 -0.11 0.06 0.39 0.05 -0.14 -0.10 0.03 0.02 -0.05 0.04 -0.01 -0.12 -0.06 -0.02 -0.10 -0.08 0.10 0.02 0.07 -0.05 0.00 0.31 0.07 0.05 0.05 0.05 0.13 0.11 -0.05 0.03 0.00 0.06  5 -0.25 0.06 -0.15 -0.03 -0.01 -0.25 -0.03 -0.06 -0.01 -0.03 0.04 -0.03 -0.04 0.04 -0.02 0.12 0.07 -0.01 -0.06 0.24 0.03 0.10 0.18 0.11 0.02 0.10 0.24 0.26 0.10 0.17 0.28 0.20  148  APPENDIX T Pattern Matrix for Non-Aboriginal Grade 7 Numeracy Scores Factor Item  1  2  1  -0.14  0.53  2  0.40  0.07  3 4  0.48 0.36  5  0.43  -0.13 -0.02 0.11  6 7  -0.24 0.32  0.54 0.12  8  0.47  0.10  9 10  0.19 0.47  0.28 0.00  11  0.51 0.03  -0.14  12 13 14  0.19 0.31  0.31  15  0.06  0.07 0.34  16 19  0.26 0.14  0.17 0.38  20 21  -0.01 -0.19  0.36 0.52  22  0.35 0.13  -0.12  23 24  0.31  0.40  0.28 -0.13  25  0.25  0.16  26  0.17  0.25  27  -0.05  0.43  28  0.41  0.09  29  0.48  -0.09  30  0.48  -0.07  31  0.07  0.29  32  0.12  0.25  33  0.44  0.07  34  0.33  0.18  149  APPENDIX U Pattern Matrix for Aboriginal Grade 4 Reading Scores  Item 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38  1 0.02 0.14 0.17 0.08 -0.04 0.05 -0.03 0.02 0.11 0.12 0.17 0.01 0.05 -0.04 -0.12 -0.05 -0.13 0.39 0.39 0.33 0.24 0.50 0.33 0.50 0.71 0.54 0.49 0.36 0.42 0.44 0.63 0.63 0.36 0.27 0.42  Factor  2 -0.46 -0.52 -0.63 -0.35 -0.34 -0.49 -0.42 -0.47 -0.66 -0.65 -0.77 -0.58 -0.61 -0.41 -0.44 -0.31 -0.31 0.02 0.08 0.01 0.06 0.01 -0.09 0.04 0.24 -0.09 0.08 -0.02 0.06 0.01 0.12 0.08 -0.01 0.01 -0.09  3 0.22 0.03 0.07 0.08 0.09 0.20 0.21 0.08 0.23 0.11 0.25 0.14 0.01 0.19 0.13 0.12 0.13 0.13 0.12 0.17 0.07 0.01 0.06 0.03 -0.17 -0.03 -0.03 0.14 0.13 0.09 -0.09 -0.03 0.34 0.24 0.17  150  APPENDIX V Pattern Matrix for Non-Aboriginal Grade 4 Reading Scores Factor  Item 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38  1 -0.03 -0.14 -0.10 -0.05 -0.03 -0.08 -0.07 -0.05 -0.03 -0.02 -0.06 0.03 0.00 0.06 0.08 0.02 0.00 0.26 0.32 0.29 0.21 0.45 0.31 0.44 0.55 0.53 0.35 0.26 0.32 0.38 0.53 0.55 0.24 0.20 0.37  2 -0.49 -0.40 -0.40 -0.39 -0.36 -0.56 -0.59 -0.51 -0.29 -0.36 -0.33 -0.36 -0.45 -0.38 -0.47 -0.36 -0.44 -0.10 -0.09 -0.15 0.09 0.01 -0.07 -0.05 0.16 -0.05 0.07 -0.22 -0.05 -0.09 0.12 0.11 -0.34 -0.23 -0.16  3 -0.09 0.09 0.09 0.05 0.03 0.07 0.09 0.00 0.45 0.33 0.59 0.19 0.19 0.01 0.02 0.03 0.03 0.09 0.09 0.14 0.01 0.06 0.08 0.06 0.01 -0.03 -0.02 0.14 0.12 0.13 -0.03 -0.01 0.23 0.19 0.13  151  APPENDIX W Pattern Matrix for Aboriginal Grade 7 Reading Scores Factor Item  2  3  0.11  -0.04  0.27  2  0.19  -0.13  -0.01  3  -0.06  -0.05  0.36  4  0.03  -0.06  0.34  5  0.08  0.03  0.25  7  0.16  -0.16  0.07  1  1  8  -0.10  -0.08  0.20  9  0.11  -0.08  0.23  10  0.13  -0.10  0.16  11  0.12  -0.08  0.09  12  -0.08  0.03  0.45  13  0.02  -0.11  0.26  14  0.02  -0.11  0.31  15  0.05  -0.13  0.19  17  -0.08  -0.68  -0.03  18  -0.15  -0.84  -0.08  19  -0.08  -0.55  0.12  20  -0.05  -0.29  -0.04  21  -0.13  -0.82  -0.10  22  0.54  0.05  -0.01  23  0.59  0.07  -0.07  24  0.25  0.05  0.11  25  0.49  0.11  0.07  26  0.20  0.05  0.06  27  0.32  0.10  0.08  28  0.27  0.02  0.35  29  0.00  0.03  0.09  30  0.18  0.02  0.28  31  0.28  0.07  0.28  32  0.08  0.05  0.20  33  0.40  0.07  0.16  35  0.50  0.05  0.02  36  0.28  0.00  0.11  37  0.66  0.01  -0.14  38  0.33  0.01  0.16  39  0.73  0.00  -0.22  40  0.47  -0.03  -0.26  41  0.34  0.04  -0.02  42  0.27  0.04  0.15  43  0.22  0.07  0.28  44  0.10  0.00  0.23  45  -0.05  0.02  0.22  APPENDIX X Pattern Matrix for Non-Aboriginal Grade 7 Reading Scores  Item 1 2 3 4 5 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 38' 39 40 41 42 43 44 45  1 -0.01 -0.10 0.11 0.03 -0.05 -0.09 0.10 -0.03 0.00 -0.03 0.12 0.02 0.05 0.03 0.12 0.15 0.10 0.11 0.17 0.49 0.57 -0.19 0.41 0.13 0.29 0.26 0.07 -0.10 -0.16 -0.08 0.38 0.45 0.24 0.58 0.34 0.70 0.42 -0.24 -0.20 -0.08 -0.03 0.02  Factor  2 -0.01 -0.16 0.06 0.02 -0.01 -0.11 0.04 0.00 -0.09 -0.04 0.06 -0.01 -0.03 -0.10 -0.65 -0.92 -0.63 -0.31 -0.90 0.09 0.08 0.08 0.11 0.07 0.05 0.11 0.05 0.08 0.12 0.07 0.10 0.08 0.08 0.07 0.08 0.05 0.03 0.04 0.07 0.08 0.04 0.05  3 0.33 0.02 0.51 0.45 0.19 0.10 0.37 0.36 0.19 0.12 0.50 0.36 0.38 0.27 -0.10 -0.23 0.01 0.02 -0.24 0.02 -0.07 0.17 0.07 0.16 0.05 0.39 0.27 0.39 0.41 0.23 0.19 0.02 0.18 -0.06 0.19 -0.17 -0.23 0.06 0.19 0.41 0.36 0.27  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0054425/manifest

Comment

Related Items