UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Analysis of construct comparability in the program for international student assessment, problem-solving… Oliveri, Maria Elena 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2008_spring_oliveri_maria_elena.pdf [ 55.37MB ]
Metadata
JSON: 24-1.0054571.json
JSON-LD: 24-1.0054571-ld.json
RDF/XML (Pretty): 24-1.0054571-rdf.xml
RDF/JSON: 24-1.0054571-rdf.json
Turtle: 24-1.0054571-turtle.txt
N-Triples: 24-1.0054571-rdf-ntriples.txt
Original Record: 24-1.0054571-source.json
Full Text
24-1.0054571-fulltext.txt
Citation
24-1.0054571.ris

Full Text

ANALYSIS OF CONSTRUCT COMPARABILITY IN THE PROGRAM FOR INTERNATIONAL STUDENT ASSESSMENT, PROBLEM-SOLVING MEASURE  by Maria Elena Oliveri  M.Ed, Counselling Psychology, The University of British Columbia, 2002 B. Ed, Education, The University of British Columbia, 1997 B.A. Psychology, The University of British Columbia, 1996  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF ARTS in The Faculty of Graduate Studies  (Measurement, Evaluation and Research Methodology)  THE UNIVERSITY OF BRITISH COLUMBIA  December, 2007  © Maria Elena Oliveri, 2007  ii ABSTRACT In Canada, many large-scale assessments are administered in English and French. The validity of decisions made from using these assessments critically depends on the meaningfulness and comparability of scores from using different versions of assessments. This research study focused on examining (1) the degree of construct comparability and (2) possible sources of incomparability of the Canadian English and French versions of the programme for international student assessment (PISA), 2003 problem-solving measure (PSM). In this study, statistical and qualitative linguistic reviews were used to examine construct comparability and potential sources of incomparability. These procedures sought to (1) determine the degree of comparability of the measure (2) identify if there are items that function differentially and (3) identify the potential sources of differential item functioning in the two language versions of the measure. Evidence from these procedures was used to determine the comparability of the inferences based on test scores from PISA 2003, PSM. A comparative analysis of the two language versions of the measure indicated that there were some psychometric differences at the scale and item level between the two languages which may jeopardize the comparability of assessment results.  iii TABLE OF CONTENTS ABSTRACT ^  ii  TABLE OF CONTENTS ^  iii  LIST OF TABLES ^  vii  LIST OF FIGURES ^  ix  ACKNOWLEDGMENTS ^ CHAPTER I: INTRODUCTION ^  1  Overview ^  1  Test Translation and Adaptation Effects ^  4  Measurement Equivalence ^  5  Stereotype Threat ^  6  Norm Sample Representation ^  6  Cultural Bias & Cultural Loading^  7  Linguistic Demand ^  8  Purpose of Study and Research Questions ^ CHAPTER II: LITERATURE REVIEW ^  9 11  Factors Affecting Performance Differences among Groups ^ 11 Construct Comparability ^  14  Statistical Evaluations of Construct Comparability ^ 16 Qualitative Evaluations of DIF ^  19  Summary ^  23  CHAPTER In: METHOD ^  25  Overview ^  25  iv  Measure ^  26  Test Design ^  26  Data ^  27  PISA 2003, Problem-Solving Measure ^  29  Translation Process ^  32  Item Format ^  33  Scoring^  35  Procedure^  37  Stage 1 - Examination of Factor Structures ^ 37 Stage 2 — Internal Consistency ^  39  Stage 3 — DIF Analyses ^  40  Logistic Regression Method of DIF Detection ^ 41 Linn-Harnisch Method of DIF Detection ^ 44 Evaluation of IRT Model Assumptions ^  47  Unidimensionality ^  47  Local Item Dependence ^  48  Item Fit ^  48  Item and Test Characteristic Curves ^ Stage 4 - Judgmental Review ^  48 49  Selection and Training of Judgmental Reviewers ^ 49 Training Session ^  50  Review Session of Test Items ^  52  V Summary^ CHAPTER IV: RESULTS ^ Data^  53 54 54  Descriptive Statistics of the Problem-Solving Measure ^ 55 Stage 1 - Examination of Factor Structures ^ 57 Summary ^  65  Stage 2 — Internal Consistency ^  66  Stage 3 — Differential Item Functioning ^  67  Item Response Theory Based Analyses ^  67  Evaluation of IRT Model Assumptions ^  68  Unidimensionality ^  68  Item Fit ^  68  Local Item Dependence ^  71  Identification of DIF Using IRT^  73  Identification of DIF Using Logistic Regression. ^ 73 Correlation of Item Parameters^ Stage 4 - Judgmental Review ^  75 81  Differences in Graphic Representation of Items ^ 81 Differences in Item Instructions and Questions ^ 82 Correspondence between DIF identification and linguistic reviews ^ 90 Summary^ CHAPTER V: DISCUSSION ^ Summary^  92 94 94  ^  vi Research Question One ^  94  Research Question Two ^  96  Research Question Three ^  97  Degree of Comparability ^  100  Implications ^  102  Implications for Practice ^  104  Limitations ^  105  Contributions of Findings to Literature ^  106  Future Directions ^  107  REFERENCES ^  112  APPENDICES ^  120  Appendix A^  120  Codes for the Sources of Translation Differences ^ 120 Appendix B ^ Judgmental Review Sample Worksheet ^ Appendix C ^ P-values for Booklets Sets 1 and 2 ^ Appendix D ^  124 124 126 126 127  English Source Version of Problem-Solving Measure (Booklet Set 1) ^ 127 English Source Version of Problem-Solving Measure (Booklet Set 2) ^ 137 French National Version of Problem-Solving Measure (Booklet Set 1) ^ 147 French National Version of Problem-Solving Measure (Booklet Set 2) ^ 157 Appendix E ^ Ethics Approval ^  167 167  vii LIST OF TABLES Table 1: PISA, 2003 Sample Sizes ^  26  Table 2: Problem-Solving Measure — Canada Sample Size ^ 29 Table 3: Problem Requirements in the Problem-Solving Measure^ 31 Table 4:^Problem-Solving Units and their Characteristics ^ 35 Table 5:^Statistical Rules for the Identification of DIF ^ 45 Table 6: Judgmental Review Rating Scale ^  52  Table 7:^Organization of Booklet Sets ^  55  Table 8:^Sample Sizes for Booklet Sets 1 and 2 ^  56  Table 9:^Descriptive Statistics for Booklet Sets 1 and 2 ^ 57 Table 10: PCA Eigenvalues and Variance Explained for Each Factor for Booklet Set 1 ^  58  Table 11: PCA Eigenvalues and Variance Explained for Each Factor for Booklet Set 2 ^  59  Table 12: PROMAX Rotated Factor Loadings for Booklet Set 1 ^ 63 Table 13: PROMAX Rotated Factor Loadings for Booklet Set 2 ^ 64 Table 14: Inter-factor Correlation Matrix for Booklet Set 1 ^ 65 Table 15: Inter-factor Correlation Matrix for Booklet Set 2^ 65 Table 16: Reliabilities for the English & French versions of PISA, 2003 PSM ^ 67 Table 17: Evaluation of IRT Assumptions ^  68  Table 18: Q1 Goodness of Fit Results for Booklet Set 1 ^ 70 Table 19: QIGoodness of Fit Results for Booklet Set 2 ^ 71 Table 20:^Item Pairs with Local Item Dependency (LID) ^ 72 Table 21: Number of DIF Items in Booklet Sets 1 and 2 Using IRT ^ 73  viii Table 22: Number of DIF Items in Booklet Set 1 ^  74  Table 23: Number of DIF Items in Booklet Set 2 ^  74  Table 24: Correlation between IRT Item Parameters ^ 76 Table 25: Summary of Judgmental Review Ratings of the Graphic Representation of Items 85 Table 26: Summary of Review Ratings & Noted Differences in Booklet Set 1.86 Table 27: Summary of Review Ratings & Noted Differences in Booklet Set 2.89 Table 28: Correspondence of DIF with Linguistic Reviews Ratings ^ 91 Table 29: Correspondence of DIF with Linguistic Reviews Ratings ^ 91 Table 30: Summary of Completed Analyses by Booklet Set ^ 101  ix LIST OF FIGURES Figure 1: Scree Plots for English- and French-speaking examinee groups for Booklet Set 1 ^  60  Figure 2: Scree Plots for English- and French-speaking examinee groups for Booklet Set 2 ^  61  Figure 3: Scatter Plot of Discrimination Parameters (a) for Booklet Set 1 ^ 77 Figure 4: Scatter Plots of Difficulty Parameters (131 and 132) for Booklet Set 1 ^ 78 Figure 5: Scatter Plot of Discrimination Parameters (a) for Booklet Set 2 ^ 79 Figure 6: Scatter Plots of Difficulty Parameters (131 and 112) for Booklet Set 2 ^ 80  x ACKNOWLEDGMENTS Completing this thesis and my M.A. is one of my highest achievements and I would like to thank and express my gratitude towards those people who helped me achieve this goal. I would like to express my gratitude towards my parents, Ramon and Olga Llanos for being supportive, teaching me about the value of education, and helping me set high goals for myself I would like to thank my husband, Gianfranco Oliveri for continuously reminding me that it was within me to achieve this goal, for his continued support, focus and love; my sister, Ana Maria and my cousin, Gerardo for long conversations about life and for their continuous support and encouragement. Finally, I would like to thank my brother, Ramon for teaching me throughout the years how to understand and enjoy math and solving problems that at first looked too great for me to solve. I would like to express my sincere appreciation towards Kadriye Ercikan, my supervisor and mentor. Achieving this goal would not have been possible without Kadriye's mentorship, positive attitude and lasting support, guidance and encouragement. I greatly value our long conversations and her passion towards fairness, equity and minority and large-scale assessment issues in Educational Measurement. Thank you for involving me in all areas of research, helping me set increasingly higher standards for myself and supporting me throughout the process! I would like to thank my committee: Dr. Ruth Ervin, Dr. Monique Bournot-Trites and Dr. Kim Schonert-Reichl for the reflective questions which helped me guide my research and improve my thesis.  xi I would like to thank Nicole Roy and Cathy Ionta for helping me conduct the linguistic reviews which were an essential part of my thesis, for their comments and insights. I am very thankful as well for friends who helped me keep motivated, interested and with whom most of all I enjoyed great moments of relaxation and fun. First of all, I would like to thank my friend Dallie Sandilands for helping me become interested in this field and area of research, for long hours spent talking and learning together. I look forward to continuing our friendship past our master program into our doctoral program and beyond. I would like to thank Rubab Arim for her help in preparing for my thesis, and learning to use new computer programs, the CARME lab and my all of my school psych friends with whom I started this master program: Julie, Janet, Stephen, Rashmeen, Sarah and Stacey.  1 CHAPTER I: INTRODUCTION Overview In Canada, as a bilingual country, many large-scale assessments are administered in English and French. Performance data from such assessments are used to make high-stakes decisions regarding program development, evaluation, and curriculum. The validity of such decisions critically depends on the meaningfulness of scores from these assessments. Previous research has demonstrated that multilingual versions of assessments cannot be assumed to provide comparable scores (Sireci, 1997; Allalouf, Hambleton, & Sireci, 1999). In such assessments, linguistic differences and other related differences created by the adaptation process of assessment measures may affect examinee performance and may result in lack of score comparability. Large-scale assessment measures are used in order to inform curriculum, program development and evaluation and decisions concerning educational policies, and to make comparisons of student achievement across countries. An implicit assumption made in these assessments is that constructs being measured are the same for all groups of interest, such as language groups or countries. However, construct equivalence needs to be ascertained before scores from such assessments can be used meaningfully to inform decisions that require comparison of scores across these language groups or countries (Sireci, 1997; Allalouf et al., 1999; Ercikan & McCreith, 2002; Gierl & Khaliq, 2001; Ercikan, 2007).  2 In order to make fair and valid interpretations using large-scale assessment data, it is important to ensure that constructs being measured by the test are equivalent. In other words, when factors outside of the construct measured are interpreted as real differences, practitioners may run the risk of making invalid interpretations of assessment results (Sireci, 1997). Efforts to create and use measures in a way that minimizes bias are important in order to make valid decisions and comparisons across language groups or countries. As stated in the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], National Council on Measurement in Education [NCME], & Joint Committee on Standards for Educational and Psychological Testing (U.S.), 1999) efforts to achieve fairness in testing and to eliminate or reduce bias should be made in order to test individuals from different linguistic and cultural backgrounds in a fair and reliable way. These efforts should include constructing technically adequate measures that require test developers and users to use and administer measures conscientiously and in a way that aims to minimize potential bias. In addition, the APA Ethical Principles of Psychologists and Code of Conduct states that psychologists are required to "use appropriate psychometric procedures and current scientific or professional knowledge for test design, standardization, validation, reduction or elimination of bias, and recommendations for use" (APA, 2002, p. 1072). In addition, as proposed by Messick (1995), validity should extend into a more encompassing construct  3 incorporating factors beyond test items in order to include examinee factors such as culture and language and the context of the assessment. Further, within Messick's more encompassing view of validity it is important to understand the types of issues in cross-cultural and cross-lingual assessments. Particularly, in order to arrive at appropriate conclusions based on assessment data, it is important to be able to interpret data within the context of the culture and environment within which it is produced. Otherwise, inappropriate interpretations which may lead to misjudgment and lack of meaningfulness and accuracy of scores may be made. As stated by Rhodes, Ochoa, & Ortiz: In order to understand the functioning of an individual on a measured task, we must first understand the influences which caused the individual to perform in the manner observed. When we fail to account for such culturally based behaviour, we run the greatest risk of identifying simple differences as serious deficits (2005, p. 136). There are several threats to construct equivalence that should be examined when determining whether two measures have construct equivalence and are comparable. These factors include test translation and adaptation effects, measurement equivalence, sample representation and cultural and linguistic loadings of tests (Hambleton, Merenda, & Spielberger, 2005). These factors may lead to assessment bias and may pose additional threats to test validity and comparability of scores.  4 Test Translation and Adaptation Effects When tests are translated from one language to another and are used to make comparisons across languages or cultures, there is the possibility of bias. The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) describes bias as the principle that: "examinees of equal standing with respect to the construct the test is intended to measure should on average earn the same test score, irrespective of group membership" (p. 74). Examinee sub-groups may be defined by race, ethnicity, gender, disability, and other characteristics. Measures that systematically lead to higher performance of one sub-group versus another may said to be biased and may occur due to deficiencies in a test itself for reasons such as lack of clarity in test instructions, inappropriately selecting items or test content, or in the way a test is scored leading systematically to favour one group over another. Bias may take place as well because items may differ in their degree of difficulty across languages. Items may be easier in one language than another, or may provide more clues in one language than the other. A hypothetical example may help illustrate this point. In an English/ French translated test given to students containing a picture with a boy holding a tennis ball as well as other distractors such as a dog playing with a ball or a girl holding a ball, students administered the assessment in French or English may have different degrees of difficulty answering the question. In English the student may be asked to point to the boy holding a ball, which is a fairly straightforward question. In French, however, the student may be asked to point to the word `ballon' which means 'large ball' in English. This item is more  5 difficult for the student answering the question in French because in fact nobody is holding a `ballon' or a large ball. As a result, differences in performance may be observed among the students answering the item in French or in English simply due to differences in cognitive requirements by the two language versions of this item. In this way, translation effects would lead to favouring one group of examinees over another, in this case the group who took the assessment in English. Measurement Equivalence The examination of measurement equivalence seeks to determine whether measures function similarly for different cultural groups. Measurement equivalence has many layers including functional equivalence, measurement unit equivalence and full score equivalence. Functional equivalence refers to the test's ability to measure the same psychological construct across cultural and linguistic groups. Functional equivalence seeks to answer whether the different language versions of tests measure the same psychological construct. Measurement unit equivalence requires that scales in each group have the same metric and that measurements be at the same interval level. Full score equivalence is the highest level of equivalence and it refers to instruments that have the same measurement unit across cultural groups (Hambleton et al., 2005). Difficulties with measurement inequivalence may arise out of test translation and adaptation effects of tests or may be related to other factors such as cultural differences or stereotype threat (Hambleton et al., 2005; Suzuki & Aronson, 2005). The subsections below discuss these and other potential sources of measurement inequivalence.  6 Stereotype Threat Stereotype threat refers to having "anxiety regarding one's performance in a particular domain (e.g. intelligence) based on negative stereotypes that exist in reference to one's group" (Suzuki & Aronson, 2005, p. 323). This type of anxiety is not related to the individual's ability level but instead is related to the situation in which a negative stereotype such as "Blacks are unintelligent" may be perceived to be confirmed by one's performance on a task. Stereotype threat's effects may cause standardized tests to have differential scores for different groups irrespective of ability. In addition, it may produce decreased performance on tests for various groups for whom stereotypes claim lower abilities in some domain (Aronson, 2002 as cited in Suzuki & Aronson, 2005). Stereotype threat may take place in large-scale assessments as students taking the examination may know that their scores will be compared to the scores of other students across the nation and may thus negatively influence students' results. This stereotype may exist in the case of French versus English Canadian test administrations and may lead to measurement inequivalence because English-language Canadian students have often performed better than French-language Canadian students particularly in French language minority settings (Council of Ministers of Education Council [CMEC], 2000). Norm Sample Representation One of the factors that may affect score comparability and lead to problems in fairness and bias is the representativeness of the sample on which the test was normed on. In norm-referenced interpretations of scores from large-scale  7 assessment tests, we are comparing each individual's performance with those scores obtained by a group of people who make up the norm sample. The norming sample is tested with the same measure and it is intended to be representative of the general population the norms refer to in terms of characteristics such as gender, age, intelligence, acculturation of parents, geography, disability status, race and cultural identity (Salvia & Ysseldyke, 2001). In order to be representative, the proportions of these characteristics should match the general population as closely as possible, which can usually be determined through a comparison with census data (Salvia & Ysseldyke, 2001). For the purposes of assessment of French and English Canadian students, it is important for the norming sample to be representative of these language groups in order for norm-referenced scores to be meaningful for these groups. Cultural Bias & Cultural Loading Another situation in which assessments may be measuring something other than intended abilities is when tests are culturally loaded, meaning that items on the test require "specific knowledge of, or experience with, the mainstream culture" (Rhodes et al., 2005, p. 186). Because many measures are developed in the U.S., they are reflections of the culture of those who developed them, and thus usually have some degree of cultural loading (Rhodes et al., 2005). Another related factor that may threaten the validity of tests is the "degree to which an individual espouses the cultural values, beliefs, and practices of a given cultural group versus that of the dominant ethnic/cultural group" or their degree of acculturation (Landrine & Klonoff, 1995, as cited in Kennepohl, Shore, Nabors,  8 & Hanks, 2004, p. 567). Studies suggest that acculturation is an important factor to consider when assessing individuals who speak different languages or come from different cultures and the effect of acculturation on performance on cognitive measures was greater for tasks with higher linguistic loading (Kennepohl et al. 2004). When a test has a high cultural loading, it may not be comparable across different cultural groups (Helms, 1992, as cited in Helms, 1997), in this case, French and English Canadian groups. Construct equivalence may be jeopardized then by having measures that are not equally applicable to both groups due to cultural or linguistic loadings or demands. Linguistic Demand A test containing a high degree of linguistic demand requires the examinee to have good receptive and expressive language abilities as the test contains a high proportion of items requiring these abilities to respond to it (Laing & Kami, 2003). Therefore, for a test containing high linguistic demands, in order to perform well, examinees must have receptive language developed enough to be able to understand directions, and expressive language abilities developed enough, in order to respond to test items. When a test with high linguistic demands is used with individuals who are culturally and linguistically diverse (CLD), the examiner may be equivocally assuming that the examinee has adequate receptive and expressive language skills. Linguistic bias can occur when there is a discrepancy between either the language or dialect used by the examinee, the language or dialect used by the examiner, or the language or dialect the examinee is expected  9 to respond in (Laing & Kami, 2003). Additional threats to validity may be brought about when tests containing high linguistic demands are used with individuals who are CLD and are not proficient enough in English to understand test directions and respond to test items. Assessment practices that are conducted in this manner lead to test results that are not representative of students' abilities. In a situation like this, the test will not be measuring the individual's abilities or what it is designed to measure, but will instead be measuring the examinee's level of English language proficiency, which can lead to erroneous decisions (Laing & Kami, 2003). Purpose of Study and Research Questions This study focused on multilingual assessments used to compare academic achievement and abilities of students within the same country receiving instruction in one of the official languages in Canada: English or French. In particular, it will examine the construct equivalence of the Programme for International Student Assessment (PISA), 2003, problem-solving measure for the two language groups in Canada. It will seek to answer the following research questions: (1) what is the degree of construct and score comparability among the Canadian English and French test administrations of PISA, 2003 problem-solving measure? (2) What are the potential sources of incomparability of the Canadian English and French test administrations? Establishing construct comparability is important in order to ascertain that scores from measures that are translated across languages and are administered to examinees from different cultures are comparable. In this study, statistical procedures were conducted in order to  10 determine the degree of construct comparability of PISA, 2003 problem-solving measure. These procedures sought to (1) determine whether the English and French versions of the test are comparable at the test level; (2) identify whether there are certain items that function differentially in the English and French versions of the test and (3) examine the linguistic comparability of test items to identify the potential sources of differential functioning items. Several factors may threaten score comparability including effects of test translation and adaptation, norm representation and cultural and linguistic loadings of tests. In order to determine construct comparability, equivalence at the test and item level, as well as the sources of incomparability were examined. The next chapter focuses on how these issues apply to construct comparability of the English and French versions of the test administrations.  11  CHAPTER II: LITERATURE REVIEW In this chapter, particular emphasis will be given to construct comparability and the different factors that compromise it, including construct irrelevance and the presence of differential functioning items. Statistical procedures to measure construct comparability will be described and exemplified from previous research studies conducted on the topic. As the focus of this study is on English-language and minority French language Canadian students, particular reference will be made to previous studies related to these students' performance on provincial assessments. Factors Affecting Performance Differences among Groups Particularly relevant to this research study are findings related to Englishlanguage and minority French-language Canadian students' performance on provincial assessments. Several studies indicate that minority Francophone students perform lower than their English speaking counterparts (Bussiere, Cartwright, Knighton, & Rogers, 2004; Ercikan & McCreith, 2002). Specifically, findings from a study conducted by Bussiere et al. (2004) using the Canadian English and French language test administrations of PISA, 2003 results indicate that students enrolled in the French-language school systems in Nova Scotia, New Brunswick, Ontario and Manitoba performed significantly lower in reading and science than did students in the English-language system in the same provinces. Similar findings were reported in a study conducted using results from Canadian English and French-language administrations of PISA, 2000 (Bussiere et al.,  12 2004) and in a study conducted using PISA, 2003, problem-solving measure. The latter study indicated that significant differences favouring the English-language system in Nova Scotia, New Brunswick, and Ontario were found. Several underlying reasons have been brought forward seeking to explain the differences between cultural and linguistic groups regarding differential performance in intelligence and standardized testing. Traditionally, the views of hereditarianism were used to explain White-minority groups' performance differences as stemming from genetic differences among Caucasians and minority groups (Suzuki & Valencia, 1997). Other researchers have conducted studies in which they have sought to determine the role of socio-economic status (SES) in explaining ethnic differences in performance. These studies suggest that when individual scores are aggregated by race-ethnicity, groups with higher SES perform higher on intelligence tests. In order to examine the degree of variance contributed by SES in explaining ethnic differences, studies have been conducted in which SES has been controlled for. These studies suggest that even when SES is controlled for, White-minority group differences variance persists yet at a lower rate. Hence, although SES may explain some of the variance, it does not explain all of it. Additional studies indicate that parental occupational status and schooling attainment are stronger predictors of performance differences than SES for some groups, particularly White students (Suzuki & Valencia, 1997). Given the different educational and occupational opportunities available for different groups, these variables may not be equally applicable to different ethnic groups and using these variables to explain ethnic differences may lead to further bias.  13 Further studies have reported that differences in home intellectual environments also contribute to differences in performance and that White groups often have more enriching home environments than Latinos or African Americans and that these differences in the home environment may be used to explain performance differences (Eamon, 2002). In addition to seeking to explain minority group performance differences in genetics, SES or the home environment, other researchers have sought to explain these differences by looking at the test instrument itself and have studied whether the test functions equally for both groups and whether it measures comparable constructs for both groups. As stated by Ford, 1998: "Regardless, of the view one holds when explaining minority students' test performance, few educators and researchers would disagree that the most important aspect of any test or assessment instrument is the degree to which it is valid and reliable." (p.8). Central to validity and reliability is construct comparability or the concept that constructs measured by a test should be comparable and that a test measures that which it intends to measure, students' ability. Several research findings suggest that instrumentation bias may have greatly contributed to differences in minority group's performance particularly test translation and test adaptation effects (Ercikan, 1998; Ercikan, 2002; Ercikan, Gierl, McCreith, Puhan, & Koh, 2004; Ercikan & Koh, 2005). The comparability of the construct assessed for the different comparison groups is central to instrumentation bias in multilingual assessments and will be described in the following subsections.  14 Construct Comparability When assessments are used to compare performance of students across countries or cultural groups, it is essential that the validity of cross-country or cross-cultural score comparisons be ascertained. Validity in multilingual and cross-country comparisons may be defined as construct comparability which requires that scores from different cultures or languages measure the same construct on the same metric (Wu, Li, & Zumbo, 2007). Score differences may be analyzed and the discrepancy across countries or cultures may be examined as truly representing performance or ability between groups only when construct comparability is achieved (Hambleton et al., 2005). Construct comparability refers to whether the constructs captured by the test are comparable for two population groups of equal abilities. Construct comparability may be confounded with a variety of factors including difficulties with test adaptation or translation, lack of clarity in test instructions, inappropriately selecting items or test content, curriculum differences, or in the way a test is scored leading systematically to favour one group over another. The aforementioned factors may confound construct comparability and may lead to performance differences of one group of examinees relative to another. When performance differences are due to factors that are irrelevant to the construct the test is intending to measure, validity may be threatened and scores may not be comparable. An important threat to construct validity and a source of bias for individuals or groups in test scoring and interpretation and of unfairness in test  15 use is construct-irrelevant variance. As described by Messick (1995) constructirrelevant variance may threaten construct validity and it occurs when the assessment is too broad and contains excess reliable variance associated with other constructs that may not be relevant to the construct being measured. A type of construct-irrelevant variance, particularly construct-irrelevant difficulty entails making the task unduly difficult due to the inclusion of extraneous factors into the task. For example, a fluid reasoning measure may be loaded with reading comprehension requirements that might favour one group of examinees over another (favouring those examinees that are good readers). Construct-irrelevant difficulty may be evident in PISA's 2003, problem-solving measure as the questions are heavily loaded with reading tasks. Construct-irrelevant difficulty is brought up due to the heavy demands on reading confounding the measurement of problem-solving abilities. In multilingual assessments, construct-irrelevant difficulty may be introduced due to translation differences as poor translations may make an item more difficult in one language over another or for one group versus another group. Test differences in multilingual assessments may occur at the item level as different items in a test may function differentially for one group of examinees versus another. When different patterns of response at the item level take place for groups of examinees of equal ability leading to differences in group performance, differential item functioning (DIF) may be taking place. DIF may threaten the validity of separate items on a test and may lead to lack of score comparability and biased measures. DIF may be caused by extraneous factors  16 such as increased word or passage difficulty favouring groups differently and producing differences in performance across groups. These differences may be produced by factors outside the construct being measured such as linguistic, cultural or curricular differences and may lead to invalid interpretations or inferences about group performance. DIF threatens the validity of the test, the examiners' ability to interpret scores and may lead to invalid test results (AERA, APA, & NCME, 1999). When tests have several items functioning differently across groups, differential test functioning (DTF) may take place (Hambleton et al., 2005). Due to DIF and DTF, construct comparability and equivalence of tests should be examined before score differences among populations can be interpreted as being due to real performance or ability differences (Sireci, 1997). DIF and DTF methods are used to identify group differences stemming from differences in construct-irrelevant sources. Statistical Evaluations of Construct Comparability at the Item Level Differential item functioning analysis is one of the methods used to examine the degree of construct comparability among different language versions of tests. DIF analysis seeks to answer the question: are group differences between assessments in multiple languages due to true differences in student abilities or group differences or are they due to construct-irrelevant variance, in other words the result of item bias? (AERA, APA, & NCME, 1999). Research studies using DIF analyses to examine construct comparability suggest that differences among translated or adapted tests may be due to linguistic differences between language versions due to different degrees of word difficulty, sentence structure,  17 grammatical and verb tense usage among different language versions of tests. The following research presents evidence of previous studies conducted using DIF analyses in multi-language assessments and particularly with Frenchlanguage and English-language Canadian test administrations. These studies indicate that generally large amounts of DIF are found between adapted or translated versions of tests and provide information about the sources of DIF (Allalouf et al., 1999; Sireci & Berberoglu, 2000; Gierl & Khaliq, 2001; Ercikan & McCreith, 2002; Ercikan, 1998, 2002, 2003). Allalouf et al. (1999) conducted statistical analyses of DIF items in multilanguage large-scale assessments in order to determine the degree of construct comparability of the Psychometric Entrance Test (PET). Although the PET was translated into several languages including Arabic, Russian, French, Spanish and English, only the original Hebrew and Russian translations were analyzed in this study. DIF findings from this study indicate that large amounts of DIF items were found, indicating that 34% of items functioned differentially across languages with some components of the test having up to 65% DIF items. Sireci and Berberoglu (2000) conducted a study comparing the English and Turkish versions of a course evaluation form and found that 3 out of 7 items (42%) functioned differentially for these two groups. Some of the causes for these differences included difficulties translating gender specific phrases and long and complex sentences. Large degrees of differences among tests administered to French-language and English-language students have also been found in several studies (Gierl &  18 Khaliq, 2001; Ercikan et al., 2004; Ercikan & McCreith, 2002; Ercikan, 1998, 1999, 2002, 2003). Gierl, Rogers and Klinger (1999) conducted a study comparing English and French students' performance on a Grade 6 social studies achievement test. DIF findings reported from this study indicate that DIF was high as 26 out 50 items (52%) displayed moderate to large DIF (Gierl et al., as cited in Gierl & Khaliq, 2001). In a study conducted by Ercikan (1999) using Trends in International Mathematics and Science Study (TIMSS) data comparing English and French Canadian students' performance in science found that 58 out of 140 items (41%) had moderate to large DIF and 29 out of 158 items (18%) displayed moderate to large DIF in mathematics. Similarly, in a study conducted by Ercikan (2002) using TIMSS data found that 27% of items in mathematics and 37% of items in science had moderate to large DIF related to adaptation-related differences. Large degrees of DIF were also found in a study conducted by Ercikan & McCreith (2002) using TIMSS with 37% of items showing DIF in science and 14% in mathematics. Results from this study indicate that FrenchEnglish Canadian test administrations had lower levels of DIF than other FrenchEnglish language comparisons made (England-France and United States and France) possibly due to a greater degree of cultural similarities between FrenchEnglish Canadian group administrations. Similarly, large degrees of DIF items indicating differences between two language versions of assessments were found in a study conducted using English and French versions of reading, mathematics, and science tests that were administered as part of a survey of achievement in Canada (Ercikan et al., 2004).  19 In this assessment, 18% to 36% of items were identified as DIF, 36% of these items were attributed to adaptation related differences. These and other studies examining the comparability of translated or adapted versions of tests indicate that there are large degrees of DIF items when multi-language assessments are used. In order to increase construct comparability in multi-language assessments it is important to understand DIF and its sources. Procedures to examine the sources of DIF include the use of judgmental review panels and qualitative analyses of test items. Qualitative Evaluations of DIF Interpreting the causes of DIF requires going beyond statistical analysis of item response data and undertaking qualitative analysis by using panels of experts or knowledgeable persons who can contribute towards identifying causes of DIF and provide guidelines for how to write and select items in test development (Allalouf et al., 1999). Judgmental reviews may lead to the identification of the reasons underlying DIF including problems brought up during the translation of items, cultural, curricular or educational differences (McCreith, 2004). For example, in the study of the PET, Allalouf et al. (1999) used a twostage procedure for identifying possible causes of DIF. In this procedure a blind review of 60 items consisting of 42 DIF and 18 non-DIF items was conducted by five Hebrew-Russian translators. The translators filled out a questionnaire in order to determine whether the items displayed DIF, the degree of DIF (large or moderate), which group the item would favour and to identify the potential causes of DIF. In the second stage, the translators were joined by three Hebrew-  20 speaking researchers and the findings were presented to the translators. Then, the translators were asked to present the reasons for their predictions and to reach an agreement regarding the causes of DIF for each of the 42 items (Allalouf et al., 1999). Judgmental reviews to determine the sources of DIF were used in another study conducted by Ercikan and McCreith (2002) examining the effects on the comparability of the English and French versions of MISS. In this study, four bilingual English-French translators analyzed the two language versions of the test to determine the degree of differences among all items within the test. In this review, three levels were given depending on the degree of differences found between items, level 1 indicated differences that would most likely not pose comparability problems, and level 3 indicated items that might have led to comparability differences. In this review, more in-depth analyses were conducted for items identified at level 3 in order to identify the potential sources of DIF. In this study, replication analyses were conducted among these three comparisons in order to examine sources of DIE Replication studies allowed for evidence to be gathered regarding whether DIF was caused due to translation, curricular or cultural differences. In this study, DIF due to curricular differences was suspected when the presence of DIF favouring one group of examinees was found in two or more comparisons related to the same curricular area, such as geometry in mathematics (Ercikan & McCreith, 2002). Results from these and other studies using qualitative analysis of items and judgmental reviews to identify the sources of DIF suggest that sources of DIF  21 may include adaptation related differences stemming from cultural and curricular differences, grammar, sentence structure and sentence length among other differences. These studies show that although some of the causes of DIF were identified from the judgmental review and analysis of the curricular and translation interpretations and from replication analyses, causes for some DIF items were not found (Ercikan & McCreith, 2002). Specifically, in the study of the Hebrew and Russian versions of the PET conducted by Allalouf et al. (1999) findings indicate that some of the differences found in the 42 DIF items analyzed stemmed from: (1) changes in word or sentence difficulty from one version of the test to the other; (2) changes in content perhaps due to an incorrect translation which might have caused a change in the meaning of the item; (3) changes in format such as extending the length of a sentence and making it too long and (4) differences in cultural relevance so even if the meaning stayed the same, the item became more or less familiar or relevant for one of the groups. In Ercikan and McCreith's (2002) study of TIMMS, sources of DIF were identified as stemming from adaptation-related problems in 22% (64 out of 296) of the items analyzed. Specifically, 17% of the mathematics items (4 items) and 27% of the science items (14 items) were attributable to DIF due to translation effects, and 17% (4 items) of the mathematics and 13% (7 items) of the science items were associated with curricular related causes. Adaptation related differences (as reported by TIMSS analysis) included differences in vocabulary specificity, statement clarity and level of vocabulary difficulty. Additional adaptation differences included word use, word difficulty and grammatical tenses  22 which may have acted differently in one language vs. another thereby making the tasks more or less difficult for certain groups. In this study, although some of the potential sources of DIF were identified from the judgmental review and analysis of the curricular and translation interpretations, causes for some DIF items were not found. DIF causes for 66% of mathematics items and 60% of science items were not explained (Ercikan & McCreith, 2002). In a study conducted by Ercikan et al. (2004) adaptation related sources of DIF items included differences related to measurement units such as using centimetres or inches, using different ways or writing time such as 14h00 or its equivalent, 2:00p.m. Additional DIF sources reported by this study included adaptation related procedures which resulted from varying degrees of student exposure to the curriculum (Ercikan et al., 2004). In another study conducted by Gierl and Khaliq (2001), four sources of DIF were identified in Canadian achievement tests administered in English and French. These sources were (a) omission or addition of words or phrases that affect meaning, (b—c) differences in words or expressions inherent and not inherent to the language or culture, and (d) format differences. Exposure to the curriculum may vary according to different schools, provinces or regions and countries. Identifying sources of DIF is an important element of construct comparability in order to ensure that constructs being measured are equivalent and that differences between tests are in fact measuring students' true differences in ability rather than construct-irrelevant differences.  23 Summary As outlined by the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999), and seminal work by Messick (1995) as well as several research studies (Hambleton et al., 2005; Ercikan 1998, 2002, 2003; Ercikan et al., 2004; Ercikan & McCreith, 2002) identifying ways in which to increase construct comparability across tests is an important issue to consider in order to increase fairness in testing, reduce test biases and increase comparability of scores across cultural groups. International, large-scale assessments are used for decision-making purposes, allocation of resources and implementation of educational policies. Test scores from provincial assessments are used for purposes including entrance into universities and used as performance indicators for comparing student academic achievement across provinces and schools. Ensuring that measures used across cultures and languages are fair and valid is central to using these assessments in a fair and representative way. Methodologies for conducting construct comparability studies have been discussed including using statistical procedures to examine the structural equivalence at the test level and DIF analysis to examine comparability at the item level. The use of judgmental reviews to identify potential sources of DIF was also outlined. Some of the sources of DIF found were due to adaptation and curricular differences including differences in vocabulary, sentence structure, differential levels of information given by the test, content or language that is differentially familiar to one group of examinees versus another. The next chapter will discuss methods used to conduct statistical analyses to examine the  24 structural equivalence of PISA, 2003 problem-solving measure, will outline the analyses that were conducted in order to identify the presence and degree of DIF at the item level as well as the linguistic review procedures that were conducted to identify the potential sources of DIF. In the next chapter, further detail regarding DIF detection methods will be outlined, and further information regarding PISA 2003 problem-solving measure (PSM) will be provided.  25 CHAPTER III: METHOD Overview Several methodologies have been used to analyze the construct comparability of standardized measures including the use of exploratory factor analysis (EFA) to analyze the comparability of the internal structure of measures, the use of differential item functioning (DIF) to analyze the comparability of items and the use of linguistic reviews to identify the possible sources of DIF. In this chapter, PISA, 2003 problem-solving measure will be described including item characteristics, sample sizes, translation process, scoring, and data sources. The procedures that will be used in order to analyze the construct comparability of the measure, including IRT based analyses, methods of DIF detection and linguistic reviews to be conducted will be discussed. These analyses will be based on previous studies used to conduct construct comparability (Hambleton et al., 2005; Ercikan 1998, 2002, 2003; Ercikan et al., 2004; Ercikan & McCreith, 2002, Allalouf, 2003; Allalouf et al., 1999) seminal work on validity by Messick (1995) and the Standards for Educational and Psychological Testing for guidelines on test construction (AERA, APA, & NCME, 1999). The purpose of this study will be to (1) analyze the degree of construct comparability across the English and French versions of the problem-solving measure by using statistical analyses including EFA at the test level and DIF analyses at the item level (2) investigate the internal consistency of the French and English versions of the PSM and (3) identify potential sources of incomparability by using judgmental reviews of items.  26 Measure PISA is a project of the Organisation for Economic Co-operation and Development. It was developed in 1997 and was designed to provide policyoriented international indicators of skills and knowledge of 15-year-old students as they are approaching graduation. PISA is intended to assess lifelong learning ability, and students' ability to apply knowledge and skills that will be needed in their future lives after leaving formal schooling. PISA is administered across 41 countries and it assesses learning of students' reading, mathematics and scientific literacy every three years (OECD, 2004). PISA is a pencil-and-paper test and students are expected to undertake two hours to complete the assessment of the main cognitive' test ncluding mathematics, science, reading, and problem solving. Students were asked to complete a questionnaire intended to take 20-30 minutes and designed to gather background data about the student's personal characteristics, preferences, aspirations and their opinions re garding characteristics of their home, family and school environments. School principals also completed a short questionnaire about broader aspects of the school context including school demographic characteristics and quality of learning environments. Test Design In PISA, 2003 test items were allocated to 13 item clusters each of which represented 30 minutes of test time. There were seven mathematics clusters and two clusters in each of the other domains: reading, science and problem-solving. Items from the different domains assessed by PISA were arranged in 13 test  27 booklets and a rotation design was implemented so that all students responded to clusters of items representing each cognitive domain assessed. A linked design was thus established so that each test item appeared in four of the test booklets which allowed for estimates of item difficulties and student abilities to be conducted based on resulting student response data. Students were randomly assigned one of the booklets which represented two hours of testing. The twohour test booklets were arranged in two one-hour parts, each part was made up of 30-minute time blocks. A short break was allowed between administration of the two parts of the test booklet, and a longer break was allowed between administration of the test and the questionnaire (OECD, 2005). Data Students participating in PISA were 15-year-old students who attended educational institutions within the country of assessment starting in Grade 7 and higher. Criteria for student inclusion consisted of 15-year-old students who attended full-time educational institutions, or who attended educational institutions on a part-time basis, who were enrolled in vocational training types of programmes or other related programmes as well as students attending foreign schools within the country of assessment. The 15-year-old international target population was slightly adapted in order to better fit the age structure of most of the northern hemisphere countries and included students aged 15 years and 3 months to 16 years and 2 months at the beginning of the assessment period. Criteria for exclusion included testing of students who were being home schooled, or resided outside of the country of assessment. International within school  28 exclusion criteria consisted of students with severe intellectual and/or physical disabilities who could not have participated in the assessment and students who had received less than 1 year instruction in the language of the assessment. School level criteria consisted of schools that were inaccessible. It was required that overall exclusion rates within a country be kept below 5 per cent (OECD, 2005). As shown in Table 1 below, in 2003, PISA administered a set of measures to a total sample of 276,165, 15 year-old students across 41 participating countries. Approximately, 4,000 to 10,000 students were assessed in each participating country. In Canada, approximately 28,000 15-year-old students from more than 1,000 schools participated in PISA. Canada's sample size was larger than other participating countries in order to obtain assessment information at the provincial level and to allow for estimates for both official language groups (OECD, 2004). Table 1 PISA, 2003 Sample Sizes Total Sample Size 276 165  ^  Average per Country^Total Sample in Canada  ^  4 000 to 10 000^28 000  Although close to 28,000 students in Canada took the PISA, 2003 assessment, not all of these students took the problem-solving assessment. As shown in Table 2, the numbers of examinees who took the problem-solving assessment are as follows: a total of 12,886 students took the problem-solving measure of which 11,009 students took the measure in English and 1,877 students  29 took it in French. Approximately 50% of booklets administered to students contained the problem-solving assessment component. Table 2 Problem-Solving Measure — Canada Sample Size Total Sample Size  ^  French^English  12 886^1 877^11 009  PISA 2003, Problem-Solving Measure In PISA 2003, problem-solving measure, there were three main problem types assessing 15 year-old students' problem-solving or analytical ability. These were: decision making, system analyses and trouble-shooting. Within PISA' s problem-solving assessment framework, problem-solving is defined as an individual's capacity to use cognitive processes to confront and resolve, real cross-disciplinary situations that contain solutions that are not immediately obvious (OECD, 2004). Problems presented in the assessment needed to be novel, where strategies to solve them would not have been previously learned or studied formally and where students were required to apply cross-disciplinary skills and integrate multiple content areas. Within the problem-solving measure, the following abilities were assessed: students' capacity to understand the process of solving problems set in unfamiliar situations by thinking flexibly and pragmatically including understanding a situation, identifying relevant information or constraints, representing possible alternatives or solution paths, trouble-shooting, selecting, designing and evaluating a solution strategy, checking  30 or reflecting on the solution and communicating the result (OECD, 2004). The following table presents information regarding the type of cognitive requirements presented in the measure.  31 Table 3 Problem Requirements in the Problem-Solving Measure' Decision Making Goals  ^  System Analysis & Design Trouble-Shooting  ^ ^ Diagnose and correct a To design a system or To choose under ^ ^ faulty or underidentify and express constraints from ^ relationships between parts performing system or different  ^  alternatives^of a system^mechanism Processes^Understand & Involved^identify relevant  Understand information characterizing a given  constraints, represent system, identify relevant parts of the system & and decide among possible alternatives, represent its relationships, check & evaluate the analyze or design systems,  Understand & diagnose the main features or mechanism of a system and its (mal) functioning, represent how a system functions,  decision, communicate &  check and evaluate the analysis or design of the  propose solutions, check and evaluate the  justify the decision  system, communicate the analysis or justify the  diagnosis, communicate or justify the diagnosis  design  and solution.  Sources of Number of  Number of interrelated  Number of interrelated  Complexity constraints, number  variables and nature of relationships, number and  parts of the system or mechanism and how its  type of representations used (verbal, pictorial,  parts interact, number & type of representations  numerical)  used (verbal, pictorial,  and type of representations used (verbal, pictorial, numerical)  numerical) Table modified from information presented in PISA, 2003 Problem Solving for Tomorrow's World (Figure 2.1, p. 29, OECD, 2004)  32  Translation Process Translation of items is often a major cause of items functioning poorly in international assessments and is often a greater problem than cultural biases or curricular differences (OECD, 2005). Due to the fact that the aim of PISA is to provide comparative performance information across countries, the interpretation of a scale may be biased by items that function differentially across countries. Two parallel source versions of the instruments in English and French were created. The French version was developed in the early stages of test development as items were being created by test development teams in Australia, Japan, and the Netherlands. The French version was developed through double translation or translation to the target language and back to its source language. This translation process served to find potential difficulties in subsequent translations to other languages, identify language ambiguities, pitfall expressions, and make further modifications to both source languages to reduce differences in translation. The English and French versions of the assessment were provided to participating countries requiring it, in this case Canada. This process facilitated consistent quality and linguistic equivalence of the PISA assessment. The final French version was further reviewed by a French domain expert for technical terminology, a French proofreader for linguistic correctness and by bilingual English/French translators for further refinements. The translation process was based on findings from studies conducted to compare the English and French versions of the assessment in PISA, 2000 which resulted in reasonably high consistency. Differences between the English and  33 French versions of the assessment were found primarily in the average length of words and sentences, with the French version containing longer words and sentences than the English version, primarily in the area of reading. These findings suggest that longer French units were (proportionally) more difficult for French-speaking students than those with slightly greater word count. Thus, although not substantial an effect of length was found on students' performance that could not be discarded (OECD, 2005). In PISA, 2000, of a total of 531 items in the field trial in all four Frenchspeaking countries or communities involved, only one reading item had translation errors leading to performance differences yet no item in the Englishspeaking countries showed translation errors. Double translation which refers to translation to the target language and back to the source language produced national versions that did not differ significantly regarding the number of translation errors from the versions derived through adaptation from one of the source languages. Number of translation errors was found to be greater with countries that did not use double translation or both sources. In addition, double translation from the English source only appeared to be effective when accompanied with crosschecks against the French source. As a result of findings from PISA, 2000, a double translation and reconciliation or crosscheck procedure using both source languages was used in PISA, 2003 (OECD, 2005). Item Format The items varied in format, some questions required students to construct their own responses, and asked students to provide a brief answer from a wide  34 range of possible answers (short-response items) or construct a longer response (open-constructed response items; OCR) allowing students to provide more divergent and individualized answers. Other sections of the test were based on students constructing their own responses but rather than being more open-ended, students were asked to construct their responses from a very limited range of possible responses (close-constructed response items; CCR) which were scored either correct or incorrect. The remaining items were multiple-choice and students were asked to select the correct response from four or five options (multiple-choice items; MC). Alternatively in complex multiple-choice items, students were provided with a series of choices and were asked to circle a word or short phrase such as a yes or no question for each choice. Students received credit for each choice answered correctly in the item (complex multiple-choice item). As described in Table 2 above, in the problem-solving measure three types of problems were designed to capture problem-solving: decision making, system analysis and design, and trouble-shooting. These problem types were chosen as they are widely applicable to a variety of settings in everyday life. The 19 items used in the assessment were clustered into ten units which ranged from one to three items in length. Table 4 below displays problem-solving units and their characteristics including items in each unit, the type of problems administered and the type of answers required of examinees. Scoring criteria for each item (partial or full credit responses) is also outlined. Full credit responses refer to whether a student received either a 0 or 1 for each question (dichotomous responses) and partial credit responses refer to whether a student received a 0, 1 or 2 points for  35 each question (polytomous responses). In total, out of the 19 problem-solving items, 11 items required dichotomous responses and 8 items required polytomous responses. Hence, as it will be described later on in this chapter, DIF detection methods required the use of statistical models for dichotomous and polytomous items. Table 4 Problem-Solving Units and their Characteristics 2  Unit Name  # of items  Item type  Score ranges  Decision-Making  2  CCR  Full credit (0-1 points)  3  OCR  Partial credit (0-2 points)  2  MC  Full credit (0-I points)  System Analysis  1  CCR  Full credit (0-1 points)  & Design  4  OCR  Partial credit (0-2 points)  2  MC  Full credit (0-1 points)  2  OCR  Full credit (0-1 points)  3  MC  Full credit (0-1 points)  Trouble-shooting  Table modified from information presented in PISA, 2003 Problem Solving for Tomorrow's World (Figure 4.1, p. 61, OECD, 2004) Scoring Coding was very complex due to high numbers of responses that needed to be hand scored. For example, approximately 4,500 students per country  36 participated in PISA, 2003. A total of 41 countries resulted in approximately 140,000 responses needing to be hand scored. In order to ensure consistency of scoring a system was set up which included providing several examples of acceptable and unacceptable response patterns, selection of booklets that would be coded multiple times and local examples of responses to use in training, adaptation or translation of scoring guides as needed. In order to ensure that there was consistency in marking between coder to coder and country to country several steps were carried out. For example, an inter-country reliability study was conducted in order to ensure that countries did not differ significantly in harshness or leniency in scoring so that particular groups or countries would not be at an advantage or disadvantage when compared to other countries in the international sample. Secondly, an independent panel of centrally trained expert markers scored a sub-sample of student responses from each country in order to verify that the marking process was done in an equivalent way across countries. These scoring processes were intended to help reduce differences in scoring practices between countries and lead to increased comparability of test results across countries. Additional procedures conducted included single and multiple coding of booklets, marking the more complex items by up to four markers, marking a sub-sample of student responses from each country by an independent panel of centrally trained expert markers to verify that the marking process was done in an equivalent way across countries as well as other quality monitoring procedures. Results of these analyses showed that marking had been achieved consistently across countries. Requirements of raters  37 included speaking the language of the countries whose material they would score, to speak more than one language so they could score more than one country (for international coders) and to have command of the English and French language to refer back to the source languages if any disagreements in coding were to occur (OECD, 2005). Procedure As shown by previous research, in order to analyze the construct comparability of various measures, establishing construct comparability is a multi-stage process consisting of several steps. The following 4 stages were conducted in order to examine evidence of construct comparability for the English and French versions of the PSM (1) examination of factor structures in order to empirically test factor similarity or dimensions (2) investigation of internal consistency (3) analyses at the item level to identify items that function differentially through the use of DIF detection methods and (4) completion of linguistic reviews to qualitatively analyze the comparability of the two language versions of the PSM. Each of these stages and the relevant methodologies used to conduct analyses to examine construct comparability are discussed in the following section. Stage 1 - Examination of Factor Structures Examination of similarity of factor structures for groups or versions of tests is a source of evidence regarding the degree of construct comparability. In this study, factor analyses were conducted in order to examine whether the structure of the English and French Canadian versions of the PSM are the same.  38 If factor structures are the same, inferences regarding measurement of the same construct for both language groups may be made. EFA was conducted in order to examine the dimensionality and structure of the English and French Canadian versions of the PSM. EFA is a multivariate statistical procedure that is used in order to simultaneously analyze multiple relationships between data using a multiple number of variables and seeks to condense the number of variables into a smaller number of factors (Hair, Anderson, Tatham, & Black, 1998). As stated in Allalouf (2003), prior to conducting DIF analysis, equivalence of structural dimensionality needs to be established. In this study, EFA and particularly separate factor analyses for each language version of the test were conducted for each language group used in order to determine whether the 19 items within the problem-solving measure could be condensed into a smaller set of factors. The method of principal component analysis (PCA) was used to determine the minimum number of factors needed to account for the maximum portion of variance represented in the original set of variables (Hair et al., 1998). In this study, three criteria for determining the number of factors representing the variables in the data were used. These criteria included: latent root or eigenvalue criterion, scree test and percentage of variance criterion. The latent root criterion uses a cut-off value of one in order to determine whether a factor is significant. The latent root criterion is most appropriate when there are between 20 and 50 variables as when less than 20 variables are used, this method may produce too few factors and if more than 50 variables are used it may produce too many. Secondly, the scree test criterion was used in order to identify  39 the optimum number of factors that may be extracted from the data. The scree test provides a graphical representation of the latent root factors or eigenvalues plotted on a chart. In a scree plot eigenvalues are plotted on a line, the point at which this line becomes close to horizontal is where the greatest number of factors is deemed to have been extracted. The scree plot criterion method usually produces a greater number of factors than the latent root method. Thirdly, percentage of variance criterion was used in order to determine practical significance for the derived factors by ensuring that each factor explains a minimum specified amount of variance. Generally, factors that explain less than 5% of the variance are usually not considered (Hair et al., 1998). Stage 2 — Internal Consistency Internal consistency is another source of evidence used to examine construct comparability. In this study, an investigation of internal consistency or reliability evidence was conducted in order to examine how accurately the purported constructs of the English and French Canadian versions of the PSM are measured. Are the alleged constructs measured with the same accuracy for each of the language groups? An internal consistency coefficient is based on the relationship among scores based on individual items within a test (AERA, APA, & NCME, 1999). In order to calculate the internal consistency reliability, the total test must be broken down into two separate scoreable parts. As shown in Table 4, the items in the PSM included a variety of item formats including multiple-choice and open and close-constructed responses and used different scoring methods (dichotomous and  40 polytomous scoring). When tests that have different scoring points (polytomous scoring) are split into two parts, tests may not appear to have the same functional  length due to differences in possible scores. Feldt-Raju is an internal consistency coefficient that may be used when tests have varied item formats (Qualls, 1995) and may be expressed by:  t9x2  E isc 2  F — Rptcc' = (^ 1-12 j 19x  where,^equals the observed part-score variances, 2represents the functional length, and L9: is the total test variance (Quails, 1995). In this study, Feldt-Raju was used as the measure of reliability as the PSM contains a variety of item types and scoring points. Stage 3 — DIF Analyses DIF is used to compare how test items function for the two examinee groups in this case English and French Canadian groups. In this study, two methods of DIF detection were used to increase accuracy in DIF detection. As stated by Ercikan et al. (2004): "at least two DIF procedures should be used to establish a consistent and defensible pattern of DIF when attempting to identify items than function differentially between language groups." (p.305). The methods used in this study were the Linn-Harnisch (LH) (1981) and Logistic  41 Regression (LR) (Narayanan & Swaminathan, 1996; Swaminathan & Rogers, 1990) methods of DIF detection. Both methods work to detect uniform and nonuniform DIF. Uniform DIF refers to DIF created when one group is favoured over the other at various points of the ability scale and is indicated by item characteristic curves that are parallel. Non-uniform DIF refers to DIF when one group of examinees is favoured in different ways along the ability scale (Clauser & Mazor, 1998) and is indicated by item characteristic curves that are not parallel (Swaminathan & Rogers, 1990). Due to the fact that the effects of language on performance may differ along the ability scale a method that is able to support uniform and non-uniform DIF was used in this study (McCreith, 2004). In addition, using two methods of DIF detection is important for example as the LR method provides tests of significance which are not provided by indexes such as the area between item characteristic curves provided by item-response theory (IRT) (Swaminathan & Rogers, 1990). Logistic Regression Method of DIF Detection Logistic regression provides tests of significance and a measure of effect size. LR may be used in order to detect uniform and non-uniform DIF and takes into account the continuous nature of the ability scale. The LR method provides a significance test in the form of a chi-square test of improvement of fit for the model with one degree of freedom for uniform DIF and two degrees of freedom for non-uniform DIF. The standard logistic regression model (Swaminathan & Rogers, 1990) derived from Bock's model (1975 as cited in Swaminathan &  42 Rogers, 1990) may be used to detect DIF for dichotomous items. The LR model for predicting the probability of a correct response to an item is:  PO! pi —1 1 19 Pi )  exP(ig 0 + J.61 ) [1 + exP(fi j^je Pi)  —1 n 1.2 • , • .•,^,.,„=  In this model, P(u pi = 1)is the response of person p in group j to the item, Op) is the observed ability of person p in group j ,  A, is the intercept parameter  and /31 , is the slope parameter for group j . An item is considered to be unbiased in the model above, when•oi = 1002 and A, = ,(312 as the LR functions are considered to be the same. An item exhibits uniform DIF if LR functions for the two groups are parallel but have different intercepts that is /301 #  &Y et •811 = A2  In this model, items exhibit non-uniform DIF if there is an interaction between ability levels and group membership, i.e., fli , x A 2 thereby LR functions are not parallel. This model can be further expressed as:  P up, (  where,  eXP(Zpi -  1) = 1+ exp(Z „)  43  z = + ri t9 pi + r2 g + T3 (Opi g i )  and where P(u pi ) is the probability of a correct response for person p in group j, g refers to group membership, To is the intercept, Ti is the coefficient of ability,  r2 (/301 PO2 ) is the group effect and r3 = (A I — flu ) is the interaction between group and ability (Narayanan & Swaminathan, 1996). DIF using the LR method for polytomous items (Miller & Spray, 1993; French & Miller, 1996) uses the model below. In this model, the item score to the jth item is Y, it is assumed that there are a number of (ordered) K categories, the item scores employed are 0,1,..., K-1 and 71 0 ,71;,...,71 k _ l denote the category -  -  k-1  response probabilities with  E  k  irk = 1 .  k=1  P(Y>k)=Irk+...+ 7ck 1,  where P(Y 0) =1 and P(Y .^.K)= 0. The category probabilities for k = 0,1,...,K — 1 are:  P(Y = k)= P(Y ^_k)— P(Y _^ k +1).  44 For example, for a polytomous item with three categories (K = 3) with item scores 0, 1, and 2, LR models refer to log [(NI + n2)/ ro] and log [(7c2/ (n o + n i )] (Kim, Cohen, Alagoz, & Kim, 2007). In this study, the LR method for polytomous and dichotomous items for uniform and non-uniform DIF were used to analyze the presence of DIF using EZDLF software (Waller, 1998). Linn-Harnisch Method of DIF Detection In this study, the Linn-Harnisch (LH) method was applied to item parameters to detect uniform and non-uniform DIF. The LH method of DIF detection involves the computation of the difference in deciles between the predicted and observed probability of correctly responding to an item  (Pdiff).  Based on this difference a chi-square statistic is obtained and converted into a Zstatistic. A negative difference indicates bias against the language group and a positive difference indicates bias in favour of the sub-group (Ercikan, 2007). Items are flagged either in favour of or against a language sub-group according to the criteria set by PARDUX (Burket, 1991) outlined in Table 5. Level 2 and Level 3 DIF indicate that parameters are not invariant across the two groups. Level 2 DIF indicates medium level poor fit and Level 3 DIF indicates serious or high level poor fit.  45 Table 5 Statistical Rules for the Identification of DIF DIF Level^Rule^ Implications Level 1^IZi<2.58^No DIF Level 2^IZf > 2.58 and Ipd,ff i<0.10^DIF (Medium level poor fit) Level 3^IZI > 2.58 and ipthiff>0.10^Serious DIF (High level poor fit)  In this study, DIF analyses were conducted using the three-parameter logistic (3PL) model (Lord, 1980) for the multiple-choice items and the twoparameter partial credit (2PPC) model (Yen, 1993) for the open and closeconstructed responses. Joint calibration of the items was conducted using PARDUX (Burket, 1991). Analyses of the items were conducted by estimating parameters of the French and English groups in order to determine the degree of DIF between the two language groups. The three-parameter logistic (3PL) model (Lord, 1980) is shown below. In this model, the probability that a person with a score of 0 responds correctly to item i is:  1.7a,(69-b,)  Pi(B)=c i + (1 c. ^ 1,2...,n . ^1.7a,(0-b,)^ I  1  e  46 In this model, a, is the item discrimination parameter, b, is the item difficulty parameter and c, is the pseudo-chance-level parameter and represents the probability of examinees with very low ability responding to the item correctly (Hambleton, Swaminathan, & Rogers, 1991). The two-parameter partial credit (2PPC) model (Yen, 1993) is a special case of Bock's nominal model (1972) which states that an examinee with 8 ability, obtaining a score at the kth level on the jth item is:  e^k xp Z  Pk (8)= P(XJ =k - 1119)= „,  ,k =1,...,m  Iexp Z i =1  where,  Z ak = Ajk 9+ C jk  In this study, the special case of the 2PPC model was used and it includes the following constraints:  47 Aft, = a j (k —1),  and, k-1  Cik=  -  E  : •  1=  In this model, yjo= 0, and aj and yj, are free parameters to be estimated by the data. The first constraint implies that items can vary in their discriminations and that higher item scores reflect higher ability levels. In the 2PPC model, for each item there are m1 — 1 independent 7j, difficulty parameters and one as discrimination parameter; a total of mj independent item parameters are estimated (Ercikan, Schwarz, Julian, Burket, Weber, & Link, 1998). Evaluation of the IRT Model Assumptions Three primary assumptions apply to IRT models: (1) unidimensionality (2) local item independence and (3) model fit. Unidimensionality. The assumption of unidimensionality states that only one ability should be measured by the items making up the test. Due to the fact that cognitive, personality and test-taking factors often involve more than one factor, the assumption of unidimensionality is said to be met when there is the presence of one primary component in the data measured (Hambleton et al., 1991). In this study, EFA was used in order to examine whether there is only one factor representing the data.  48 Local item dependence. The assumption of local item dependence (LID) refers to whether students' responses are independent of each other. In this study, this assumption was examined by using the Q3 statistic (Yen, 1984) in PARDUX (Burket, 1991). The Q3 statistic measures the correlation between two items in a test. When items are locally independent their correlation is expected to be 0 and items are flagged as being locally dependent at Q3 > .20 (Ercikan et al., 1998). Item fit. The assumption of item fit may be examined by using the Qi statistic (Yen, 1981) and it is used to test the null hypothesis that there is no statistical difference between the model and the observed responses. In this study, PARDUX (Burket, 1991) was used to estimate item parameters and to examine the Q/ statistic. The Q, statistic sums over all the standardized residuals for all the different ability groups, and its distribution is measured as  i  with degrees of  freedom equal to the number of ability groups minus the number of parameters in the model. The Qi statistic can be standardized in the form of a Z score (Fitzpatrick, Link, Yen., Burket, Ito, & Sykes, 1996). Z-values greater than 4.6 should be flagged as having poor fit (Ercikan et al., 1998). Item and Test Characteristic Curves A function of IRT is to describe relationships between the performance of examinees on an item and their ability underlying their performance on the item. This relationship may be described by a monotonically increasing function (an item characteristic curve) that exemplifies an examinee's performance on a test item based on the examinee's ability. An item characteristic curve (ICC) visually displays the way an item functions for examinees, and portrays the probability of  49 answering an item correctly at various levels of ability (Clauser & Mazor, 1998; McCreith, 2004). Stage 4 - Judgmental reviews Whereas DIF procedures serve to answer whether there are differences in performance between two language groups for examinees who are expected to perform at the same level (French and English speaking Canadian students) and to detect the presence of DIF, DIF procedures do not provide information regarding potential sources of DIF. The link between DIF, bias, and impact is largely methodological: statistical analyses are used to identify items with DIF, and judgmental analyses are used to determine if DIF is attributable to curricular, cultural or linguistic differences. Reviews intended to evaluate the fairness of an item cannot proceed as either a statistical or a judgmental analysis; both procedures are needed (Linn, 1993; Allalouf et al., 1999). In this study, judgmental review procedures primarily linguistic reviews were conducted based on previous research (Allalouf et al., 1999; Gierl & Khaliq, 2001; Ercikan & McCreith, 2002). Selection and Training of Judgmental Reviewers Guidelines related to translating or adapting tests into other languages state that translators should be (a) familiar with principles of good test development practices (b) familiar with the construct being assessed and (c) familiar and competent with the source and target languages and cultures involved (Geisinger, 1994; van de Vijver & Hambleton, 1996; Hambleton & Patsula, 1998). In conducting these analyses the use of more than one translator has been  50 recommended to allow for dialogue to resolve the different points that arise when preparing a test adaptation or translation (Hambleton & Patsula, 1998). In this study, two reviewers conducted linguistic reviews to examine the comparability of the English and French versions of the PSM. The reviewers were bilingual teachers, who met the criteria described above. They were familiar with principles of test development practices, the construct being assessed and were familiar and competent in English and French. A training session was conducted in order to help the reviewers familiarize themselves with the process of the linguistic review and provide information regarding past findings regarding potential sources of NT, Training Session A training session was conducted in order to explain to reviewers the process of a linguistic review, examples of translations and adaptation errors were provided as shown in Appendix A and were based upon previous research (Geisinger, 1994; van de Vijver & Hambleton, 1996; Hambleton & Patsula, 1998). Reviewers were debriefed regarding the purpose of the study (to examine the comparability of the English and French Canadian administration of PISA 2003 PSM), the purpose of the linguistic review was highlighted (to examine the items of the two language versions of the PSM to detect whether there were differences that may have led to performance differences between the two language groups). At the beginning of the session, reviewers were provided with a list containing information regarding types and examples of translation errors previously made  51  when translating and adapting tests (see Appendix A). This list was used as a guide for the linguistic review. The reviewers were subsequently introduced to the rating system that was used when reviewing the comparability of the two language versions of PISA 2003 PSM, presented in Table 6. Reviewers were asked to review, examine and  compare all items for each language version of the PSM simultaneously and identify whether there were differences between the two language versions. If differences were found, reviewers were instructed to rate the degree of differences between the two language versions according to rating criteria presented in Table 6 and to determine whether the differences would lead to performance differences between the two language groups. Reviewers were instructed to pay particular attention to differences between items related to the translation of test items (e.g., appropriate word use, similarity of meaning, word frequency, length, etc.), presentation format and familiarity to tasks or graphics used in both language versions of the test (Ercikan, 2003). Reviewers were asked to record the results of their qualitative analyses in a worksheet as presented in Appendix B.  52 Table 6 Judgmental Review Rating Scale  Rating^Meaning associated with rating 0^No difference between both test versions 1^Minimal differences between the two versions 2^Clear differences between the two versions which may not necessarily lead to differences in performance between the groups 3^Clear differences between the two versions which would likely lead to differences in performance between the groups Review Session of Test Items A comprehensive review of the items from the two language versions of the test was conducted by the two reviewers. Reviewers were given copies of the items for each language version of the test, the English Canadian version was the original source version developed by PISA and the French Canadian version contained the modifications that were made by PISA Canadian national consortium. Reviewers completed a comprehensive review of test items in order to examine potential reasons for differences in performance between the two language groups. The reviewers were asked to identify items that were different between the two language versions and to judge whether this difference would be likely to provide one group of examinees with an advantage if the item impacted their performance positively or at a disadvantage if the item impacted their performance negatively. Each reviewer had their own set of materials and  53 worked independently. The reviewers then worked collaboratively if their review ratings were different. A discussion took place in French in order to identify the sources of these differences, until a consensus was arrived upon regarding the nature and degree of differences and the way in which the difference would impact the performance of one group or the other. These differences were recorded in the worksheet provided to the reviewers (see Appendix B). The linguistic review provided information regarding the qualitative comparability of the items of the two language versions of the PSM and served to complement the statistical analyses conducted in order to evaluate the comparability of the test items. Summary In this chapter, PISA, 2003 problem-solving measure was described including sample sizes, scoring and the translation process, among other characteristics of the measure. Statistical analyses procedures were described including EFA at the test level, DIF detection methods at the item level, particularly LH and LR methods of DIF detection, and the linguistic review process to identify potential sources of DIF was outlined. In the next chapter, results of these analyses are presented in order to answer the following two research questions (1) are the French and English versions of PISA, 2003 comparable to each other and (2) what are its potential sources of incomparability.  54 CHAPTER IV: RESULTS In this chapter, results of the study are presented. Descriptive statistics findings are presented first followed by results from statistical analyses conducted at the test and item levels. Qualitative analyses conducted to identify the potential sources of DIF are presented last. In order to examine evidence of construct comparability for the English and French Canadian versions of the PSM analyses were conducted in four stages. In stage 1, statistical analyses were conducted at the scale level data in order to determine whether the dimensionality and test structure of the PSM were the same for the two language versions. In stage 2, internal consistency was examined using Feldt-Raju internal consistency coefficient to determine measurement accuracy of constructs represented in the PSM and to determine whether these constructs were measured with the same level of accuracy for the two language groups. In Stage 3, DIF analyses including Linn-Harnisch (LH) and Logistic Regression (LR) DIF detection procedures were conducted in order to identify items that function differentially for the two language versions of the test. In Stage 4, linguistic review studies were completed in order to examine linguistic comparability of the two language versions of the PSM and to determine possible sources of DIF. Data The previous sections describe the samples of the English and French speaking Canadian students who took the PISA 2003, problem-solving measure. All items included in both versions of the measure were the same except for the language in which the measure was written. The PSM consisted of a total of 19  55 questions which were divided into booklets, Booklets 3, 9, 11 and 12 (which will be referred to as booklet set 1) contained 10 common questions and Booklets 4, 10, 12 and 13 (which will be referred to as booklet set 2) contained 9 common questions. The analysis of results of booklet sets 1 and 2 was conducted separately. Type of questions and the total number of questions per booklet set are presented in Table 7. The actual questions contained in the English and French PSM are presented in Appendix D and are organized by booklet set. Table 7 Organization of Booklet Sets Booklets  ^  Questions in Each Booklet^Number of Questions  Set 1 (Booklets^Design by Numbers 1 — 3;^10 3, 9, 11 & 12)^Children's Camp; Freezer 1 & 2; Energy Needs 1 & 2; Cinema Outing 1 & 2  Set 2 (Booklets 4, 10, 12 & 13)  Library System 1 & 2;^9 Course Design; Transit System; Holiday 1 & 2; Irrigation 1 — 3  Descriptive Statistics of the Problem-Solving Measure Approximately 12,886, 15 year-old students took the PSM; 11,009 students took the measure in English and 1,877 took the measure in French. Approximately 50% of students took booklet set 1 and the remaining 50% took  56 booklet set 2. In order to have comparable sample sizes for analyzing each booklet set, English-speaking Canadian students were randomly selected to match the sample sizes of the French-speaking Canadian student group. The sample sizes and descriptive statistics for English- and French-speaking students are presented in Tables 8 and 9. The sample size for booklet set 1 (n=1,078) consisted of slightly more male students for the English-speaking examinee group than the French-speaking group (n=526 and n=505, respectively) and slightly fewer females (n=552 and n=573, respectively). The sample size for booklet set 2 (n=1,116) consisted of slightly fewer male students for the English-speaking examinee group than the French-speaking group (n=541 and n=544, respectively) and slightly more female students (n=575 and n=572, respectively). Table 8 Sample Sizes for Booklet Sets 1 and 2 English^  French  Booklet Set^Male Female Total^Male^Female^Total 1  526  552  1078  505  573  1078  2  541  575  1116  544  572  1116  There were small differences between raw scores of the two language groups on the problem-solving measure but they were not statistically significant.  57 Table 9 Descriptive Statistics for Booklet Sets 1 and 2 English^French Mean^Mean Booklet Set # of items^n^(SD)^n^(SD) Difference 1  10  1074  7.55  1069  9  1100  9.06  .01  (3.38)  (3.41) 2  7.54  1083  (4.72)  9.09  -0.03  (4.71)  Stage 1 - Examination of Factor Structures Exploratory factor analysis (EFA) was used to evaluate the structural equivalence of the problem-solving measure for the English and French-speaking Canadian examinee groups. The method of extraction used was Principal Component Analysis (PCA) with an oblique rotation (PROMAX). SPSS (version 13.0) was used for this analysis. The factor analyses results are shown in Tables 10 to 15 for booklet sets 1 and 2. Information on eigenvalues, percentage of variance accounted for by an unrotated factor solution and, the cumulative percent of variance accounted for by each subsequent factor for each language version of the test is presented in Table 10 for booklet set 1 and Table 11 for booklet set 2. Results displayed in Tables 10 and 11 show that the English version of the PSM has three factors representing the data and the French version has two factors. In the English and French versions of the assessment there is one main factor  58  representing the assessment data. The primary factor for the English version of booklet set 1 has an eigenvalue of 2.70, which accounts for 27.02% of the variance and the primary factor for the French version of booklet set lhas an eigenvalue of 3.66, which accounts for 36.60 % of the variance. Likewise, as shown in Table 11, the English and French version of the PSM has one primary factor representing the data for the English version of booklet set 2 with an eigenvalue of 3.00, accounting for 33.40% of the variance and the primary factor for the French version of booklet set 2 has an eigenvalue of 3.79 which accounts for 42.16% of the variance. Table 10 PCA Eigenvalues and Variance Explained for Each Factor for Booklet Set 1 English^  French  Factors Eigen- % of variance Cumulative % ^Eigen- % of variance Cumulative % value accounted for^value accounted for 1  2.70  27.02  27.02  3.66  36.60  36.60  2  1.41  14.06  41.09  1.24  12.36  48.96  3  1.11  11.12  52.21  59 Table 11 PCA Eigenvalues and Variance Explained for Each Factor for Booklet Set 2 English^  French  Factors Eigen- % of variance Cumulative %^Eigen- % of variance Cumulative % value^accounted for^value^accounted for 1  3.00  33.40  33.40  3.79  42.16  42.16  2  1.05  11.71  45.11  1.07  11.89  54.05  3  1.02  11.34  56.45  In this study, scree plots were used to identify the optimum number of factors to be extracted from the data. Scree plots for English- and Frenchspeaking Canadian students for booklet sets 1 and 2 are shown in Figures 1 and 2, respectively. The scree plots in Figures 1 and 2 show that the data may be represented by one dominant factor as shown by the sharp drop in the graph after the first factor representing the data for both the English- and French-speaking examinee groups.  60  Scree Plot English-Speaking Examinee Group: Booklet Set 1 3.0-  2.5-  (1)  2.0—  Cu  0.5-  0.02^3^4^5^6^7^8  ^  9^10  Component Number  Scree Plot French-Speaking Examinee Group: Booklet Set 1 4-  3—  U  to  2—  1-  o1^I^I^1^I^1^I^1 2^3^4^5^6^7^8^9^10  Component Number  Figure 1. Scree Plots for English- and French-speaking examinee groups for Booklet Set 1.  61  Scree Plot English-Speaking Examinee Group: Booklet Set 2  I^1^1^1^r^1^r^1^1 2^3^4^5^6  1  7  8  9  Component Number  Scree Plot French-Speaking Examinee Group: Booklet Set 2  1  ^  2^3^4^5^6  ^  Component Number  Figure 2. Scree Plots for English- and French-speaking examinee groups for Booklet Set 2.  1  9  62 Factor loadings for the two factors for the English and French versions of the PSM are summarized in Tables 12 and 13 for booklet sets 1 and 2, respectively. Results displayed in Tables 12 and 13 show rotated factor loadings using PROMAX oblique rotation. For example, Table 12 shows that item 9 has a factor loading of .75 for the English version group and .77 for the French version. Results displayed in Table 12 (booklet set 1) show that items 1 — 4 and 7 — 9 have similar factor loadings for the two language groups. Some differences in factor loadings were found for items 5 and 6 as these items loaded on one factor (factor 1) for the English version and on two factors for the French version. Item 10 had higher factor loadings for the French version for factor 1 (.71) than the English version (.43). Results displayed in Table 13 (booklet set 2) show that items 1 - 4 and 7 — 9 had similar factor loadings for the English and French versions of the PSM. Some differences in factor loadings were found for items 5 and 6 as these items had similar loadings for the first factor but showed some differences in factor loadings for the second factor. Overall, greater similarities were found in factor loadings for booklet set 2 than booklet set 1, 7 out of 10 items (70%) were found to have similar factor loadings for booklet set 1 and 7 out of 9 items (78%) were found to have similar factor loadings for booklet set 2.  63 Table 12 PROMAX Rotated Factor Loadings for Booklet Set 1 English  French  Item #  1  2  1  2  1  .51  .57  .56  .58  2  .47  .62  .61  .60  3  .60  .52  4  .49  .64  5  .72  .79  .42  6  .70  .80  .44  7  .32  .38  .55  8  .49 .65  .67  9  .75  .77  10  .43  .71  Notes. Factor loadings less than .30 are omitted.  64 Table 13 PROMAX Rotated Factor Loadings for Booklet Set 2 French  English Item #  1  1 2  .40  3  2  1  2  .68  .31  .68  .62  .36  .63  .69  .40  .73  4  .43  .70  .34  .67  5  .61  .34  .66  .54  6  .57  .37  .61  .55  7  .73  .35  .87  .43  8  .75  .31  .83  .42  9  .62  .19  .79  .33  Notes. Factor loadings less than .30 are omitted.  The inter-factor correlation matrices for booklet sets 1 and 2 are shown in Tables 14 and 15, respectively. Table 14 shows that the correlation between the first and second factor for the English-speaking group is (.30) and it is (.46) for the French-speaking group for booklet set 1. As shown in Table 14, higher interfactor correlations were found for the French version of the PSM (.46) than for the English version (.30) for booklet set 1. Table 15 shows that the correlation between the first and second factor for the English-speaking group is (.46) and it is (.53) for the French-speaking group for booklet set 2. As shown in Table 15 as in booklet set 1, higher inter-factor correlations were found for the French version  65 of the PSM (.53) than for the English version (.46) for booklet set 2. Consistent with findings from factor analysis loadings outlined above, higher inter-factor correlations were found for booklet set 2 (.53 and .46) than for booklet set 1 (.46 and .30) for the French and English versions of the test, respectively. Table 14 Inter-factor Correlation Matrix for Booklet Set 1 French  English Factor  1  2  1  2  1  1.00  .30  1.00  .46  2  .30  1.00  .46  1.00  Table 15 Inter-factor Correlation Matrix for Booklet Set 2 French  English Factor  1  2  1  2  1  1.00  .46  1.00  .53  2  .46  1.00  .53  1.00  Summary In this section, results related to the examination of the factor structure similarity between the tests administered in English and French were presented. Factor solutions for each language version of the test for each booklet set were presented in Tables 10 and 11. One primary factor was found to represent the  66 data for the French and English versions of the measure for each booklet set. Discrete differences were found for the English and French versions of the measure as the English version was represented by three factors and the French version was represented by two factors. Similarity of factor loadings was found for most items between the English and French versions of the measure for each booklet set. Inter-factor correlations show that the French version has a higher degree of inter-correlation for both booklet sets as compared to the English version. Overall, greater similarities in factor loadings and inter-factor correlations were found for booklet set 2 than for booklet set 1. Stage 2 - Internal Consistency Results of analyses related to internal consistency, are presented in this section for the two language groups and for each booklet set. These analyses were conducted to examine how accurately constructs assessed by the PSM were measured and whether these constructs were measured with the same level of accuracy for the two language groups. In order to calculate the internal consistency of the PSM, Feldt-Raju (Qualls, 1995) was used. This method of internal consistency estimation was chosen because the problem-solving measure contained both dichotomous and polytomous items. Table 16 shows reliability estimates calculated separately for each language version of the test and for each booklet set. These estimates were compared in order to examine differences in the internal consistency of the scores between each language version and booklet set. Table 16 shows that for booklet set 1 the reliability estimates were the same  67 for the English and French language groups (.73) and were very similar for booklet set 2 (.96 for the English group and .95 for the French group). Table 16 Reliabilities for the English & French versions of PISA, 2003 PSM Booklet Set^# of Items^English^French 1  10  .73  .73  2  9  .96  .95  These results indicate that measurement accuracy was very similar for the English and French language versions of the PSM. Booklet set 2 had high internal consistency whereas booklet set 1 had moderate internal consistency. Higher reliability estimates were found for booklet set 2 than booklet set 1. Stage 3 - Differential Item Functioning DIF detection procedures were used to identify whether the test items had similar psychometric properties for the two language groups. IRT based LH (Linn & Harnisch, 1981) and LR (Swaminathan & Rogers, 1996) DIF between language groups were used to detect items containing DIF for booklet sets 1 and 2. The two DIF detection procedures were used to verify and confirm the DIF status of the items. Item Response Theory Based Analyses IRT analyses were conducted using PARDUX (Burket, 1991). Model assumptions were analyzed to examine unidimensionality, model fit and local  item dependence.  68 Evaluation of the IRT Model Assumptions There are three assumptions related to the use of IRT models: (a) unidimensionality of the ability being measured, (b) local independence of examinee's responses to the items in a test, and (c) fit of item responses to the IRT model (Hambleton et al., 1991). Results of the evaluation of these assumptions are presented in Table 17. Table 17 Evaluation of IRT Assumptions # of Items  1  10  Yes  2  # of Item-Pairs with Local Dependency (Q3) 2 2  2  9  Yes  3  6  Booklet Set  Unidimensional  # of items with poor fit (0) 1  'Poor fit is defined as Z>4.60. Local dependence is defined as Q3 >0.2000.  2  Unidimensionality. The assumption of unidimensionality can be satisfied if the test data can be represented by "a dominant component or factor" (Hambleton et al., 1991). The factor analytic results conducted with booklet sets 1 and 2 indicated that these tests were essentially unidimensional, as data was represented by one dominant factor (see Tables 12 and 13 & Figures 1 and 2). Due to the fact that there was one dominant factor representing the data, essential unidimensionality was met which is considered sufficient to satisfy the unidimensionality assumption required in IRT applications. Item fit. In order to examine poor data fit to the model, the chi-squarebased Qi goodness of fit statistic was used. Tables 18 and 19 present information  69 on model fit for each item, for booklet sets 1 and 2. Items that display model misfit are identified by the Q I statistic and are shown in bold italic typeface. For booklet set 1 (2 out of 10) items were identified as poorly fitting and for booklet set 2 (3 out of 9 items) were identified as poorly fitting. Examination of the differences between Observed and Predicted values show that these differences were rather small as the difference between observed and predicted values is close to zero for the poorest fitting items in booklet set 1 such as items 8 and 9 (z value = 10.34;10-PI= -.02 and 10.37; 10-11=.01, respectively) and in booklet set 2 such as item 2 (z value= 17.74;10-PI= -.03). The chi-square statistics shown for poorly fitting items in booklet sets 1 and 2 are perhaps being inflated by relatively large sample sizes.  70  Table 18 01 Goodness of Fit Results for Booklet Set 1 Item #  Total N X2  Df  Z-  Observed Predicted Observed-  value  Predicted  1  2021  7.04  5  0.64  0.592  0.589  0.003  2  2023  17.80  5  4.05  0.571  0.573  0.002  3  1655  19.25  13  1.23  0.557  0.568  -0.011  4  1997  32.83  13  3.89  0.467  0.472  -.0.005  5  2055  21.35  6  4.43  0.542  0.538  0.004  6  2051  11.37  6  1.55  0.463  0.466  -0.003  7  2056  20.82  6  4.28  0.867  0.843  0.024  8  1855  65.71  13  10.34  0.405  0.426  -0.021  9  2039  65.88  13  10.37  0.639  0.628  0.011  10  1880  11.41  5  2.03  0.814  0.791  0.023  Notes. Items that display model misfit as identified by the Q i statistic are in bold italic typeface.  71 Table 19 01 Goodness of Fit Results for Booklet Set 2  Item #  Total N X2  Df  Z-  Observed Predicted Observed-  value  Predicted  1  1980  23.67  5  5.90  0.880  .853  .027  2  1846  94.23  11  17.74  .237  .268  -.031  3  2021  19.32  11  1.77  .474  .476  -.002  4  2035  29.08  11  3.86  .209  .228  -.018  5  1968  4.32  5  -.21  .554  .549  .005  6  1811  23.05  11  2.57  .524  .528  -.004  7  1990  11.63  5  2.10  .616  .602  .014  8  1913  12.87  5  2.49  .694  .675  .019  9  1865  21.31  5  5.16  .786  .769  .016  Notes. Items that display model misfit as identified by the Q1 statistic are in bold italic typeface.  Local item dependence (LID). The Q3 statistic was used to evaluate the correlation between the performance of examinees on two items after taking into account overall test performance (Yen, 1984). An item pair may be flagged as locally dependent if1Q31?.20 (Ercikan et al., 1998). There were two item pairs flagged as LID for booklet set 1 and six item pairs flagged as LID for booklet set 2. Table 20 shows item pairs flagged as LID and the values of the Q3 statistic. As shown in Table 20, The Q3 values for these items ranged from .20 to .30. A large amount of LID items may result in inaccurate estimates of examinee scores however there were only two LID items in booklet set 1 and six LID items in booklet set 2, with relatively low Q3 (all  72 equal to or less than 0.30). This level of LID is expected to have a minimal effect on item parameters and examinee scores. Examination of the items showing highest levels of LID (item 6, Freezer Question 2 and item 7, Energy Needs Question 1, LID = .30) suggest that these items require similar levels of analytical reasoning to answer the items correctly as shown by results of exploratory factor analysis conducted by PISA (OECD, 2004b). For example, Freezer Question 2 has a factor loading of (.157) and Energy Needs Question 1 has a factor loading of (.170). These results show that these items have very similar loadings which might have led to higher levels of LID (.30) between them. Table 20 Item Pairs with Local Item Dependency (LID) Booklet Set  Item Pair^1031 Value  1  8 and 1^.24 8 and 3^.25  2  ^  2 and 4^.20 7 and 2^.21  7 and 3^.27 6 and 7^.30 6 and 8^.25 9 and 6^.23  73 Identification of DIF Using IRT The results of IRT based LH method of DIF detection are summarized in Table 21. Table 21 presents the number of items containing DIF, the degree of DIF and whether items favoured the French- or English-speaking groups. Overall in booklet set 1, out of ten items, there were four items containing DIF (40%). Two of these DIF items favoured the French-speaking group and two favoured the English-speaking group. Overall in booklet set 2, out of nine items, there were two items containing DIF (22%). Of these DIY items, one item favoured the French-speaking group and the other item favoured the English-speaking group. Table 21 Number of DIF Items in Booklet Sets 1 and 2 Using IRT Pro-English^Pro-French  Booklet Set^# of Items^Level 2^Level 3^Level 2^Level 3 1  10  2  0  2  0  2  9  1  0  1  0  Identification of DIF Using Logistic Regression In this study, a second method of DIF detection (logistic regression) was used for verification of DIF identification. Results of these analyses are shown in Tables 22 and 23 for booklet sets 1 and 2, respectively. As shown in Tables 22 and 23, one item was found to contain DIF by both the LH and LR methods of DIF detection in each booklet set, item 8 (booklet set 1) and item 7 (booklet set 2). More items were identified as containing DIF using the IRT based LH method as  74 compared to the LR method of DIF detection. Using the LH method of DIF detection, three items were found to contain DIF in booklet set 1, and one item in booklet set 2 that had not been identified by the LR method. Using the LR method of DIF detection, two additional items were found to contain DIF in booklet set 1 and no additional items were found in booklet set 2. Four items in booklet set 1 and seven items in booklet set 2 were identified as DIF free by using both the LH and LR methods of DIF detection. Overall, more items were identified to contain DIF in booklet set 1 (6 out of 10 items) than booklet set 2 (2 out of 9 items) by either one of the two methods of DIF detection, LH and LR Table 22 Number of DIF Items in Booklet Set 1  LR DIF  No DIF  DIF  1  3  No DIF  2  4  LH  Table 23 Number of DIF Items in Booklet Set 2  LR DIF  No DIF  DIF  1  1  No DIF  0  7  LH  75 Correlation of Item Parameters Table 24 shows correlations of item parameters, a and B, based on the calibrations of items separately for each language group for each booklet set and p-values for each booklet set for the two language groups are presented in Appendix C. The correlations between the item parameters indicate that the correlations of the location parameters 13 1 and B2 (the difficulty parameters for each of the levels of the multiple scoring) ranged from .95 to .99. The correlations for the difficulty parameters (B1 and 13 2) were very high. These correlations indicate that the ordering of the item difficulties was very similar for these tests. The correlations of the discrimination parameter (a) were lower ranging from .43 to .82. The lower a parameter correlations indicate that the relationship between what the items are assessing with the overall construct assessed by the test varied for some of the items between the two language versions of the test. The lower the correlations are, the larger the number of items where differences in the ordering of the discrimination parameters were observed. The correlation for a parameters were particularly lower for booklet set 1 (.43) which consisted of a combination of multiple-choice questions and openconstructed responses than for booklet set 2 (.82) which consisted of openconstructed responses only. An examination of p-values or the probability of correctly answering an item shows that there were similarities between the two language groups for each booklet set (see Appendix C).  76 Table 24 Correlation between IRT Item Parameters Correlations Booklet Set  # of Items  a  131  132  P  1  10  .43  .95  .98  .94  2  9  .82  .98  .99  .99  Scatter plots showing the correlations of each parameter (a, 131 and 132) are presented in Figures 3 and 4 for booklet set 1 and Figures 5 and 6 for booklet set 2. Scatter plots are visual representations of the relative order of items for the two populations and show which items were particularly different in terms of the discrimination and difficulty parameters. Item numbers are displayed in the scatter plot in order to obtain a visual representation of how items were ordered for the two language groups. Figure 3 shows that items 1 and 4 in booklet set 1 were contributing to a lower correlation of the (a) parameter as otherwise all other items are more closely aligned.  77 Discrimination Parameter (a)  1.30 -  1.20-  1.10 -  1.00C  a)  LL 0.90-  0.80-  0.70-  0.60I^  0.70  I^  0.80  I ^  0.90  English  1.00  ^  a  Figure 3. Scatter Plot of Discrimination Parameters (a) for Booklet Set 1  1.10  78 Difficulty Parameters (131 and 132)  -2.00  0.50  ^  ^ ^ 0.00 1.00 -too English 131  —  Cs1 _c 0 oo — 0 C  LL  -0.50  —  -1 00—  -1.00^-0.50^0.00^0.50^1.00^1.50  English 132  Figure 4. Scatter Plots of Difficulty Parameters (131 and 82) for Booklet Set 1  79 Discrimination Parameter (a)  0.80^1.00^1.20  ^  1.40  ^  English a  Figure 5. Scatter Plot of Discrimination Parameters (a) for Booklet Set 2  1.60  80 Difficulty Parameters (131 and 132)  -3.00  -2.00  -1.00^0.00  ^  1.00  ^ 2.00  English 131  6 0  1^ 1^ r^ I^ I  -2.00^-1.00^0.00  1.00  2.00  English in  Figure 6. Scatter Plots of Difficulty Parameters (13i and 13 2 ) for Booklet Set 2  81 Stage 4 - Judgmental Review A linguistic examination of the French and English versions of the PSM was conducted in order to evaluate the comparability of the two language versions of the test by two bilingual reviewers. The results of the linguistic reviews are organized by the type of differences that were identified. Differences found in the graphic representation of items are described in Table 25, differences found in item instructions and questions are presented in Tables 26 and 27 for booklet sets 1 and 2, respectively. In these tables, differences that were identified as "somewhat different" or "very different" are described below. Differences in Graphic Representation of Items The linguistic review process identified one difference in the graphic representation of items between the two language versions of the test. Particularly, one difference was found in Library System which asked examinees to develop a decision tree diagram. In the English version the location of the tree diagram is situated at the edge of the page in the test whereas in the French version it is located at the centre of the page. The location of the start of the tree diagram may favour the French group because it makes it more clear for the French group that the diagram should take most of the blank space allocated to the question whereas in English examinees may think that the diagram is not meant to take a lot of room and examinees may therefore underestimate the importance of making a diagram to answer the question. Reviewers did not expect this difference to lead to performance differences among examinees. It is not surprising that only small differences in the graphic representation of items were  82 found between the two language versions of the test as the graphic representation of items remained the same for the two language versions of the test with few exceptions such as the tree diagram location in the Library System item. Differences in Item Instructions & Questions The two reviewers identified several differences in item instructions and questions between both language versions of the test. For the most part, these differences favoured the English-speaking examinee group. These differences were related to sentence length and complexity, differences in vocabulary including differences in word usage and difficulty and grammatical differences including differences in verb tense use. Particular examples of these differences are presented below and are summarized in Tables 26 and 27 for booklet sets 1 and 2, respectively. The reviewers found several inconsistencies between the English and French versions of the PSM. Most differences were found to not lead to performance differences between examinee groups. Reviewers found other types of differences which included longer and more complex sentences mostly for the French-speaking examinee group. For example, in Design by Numbers the English version uses the phrase "a design tool for generating graphics on computers" and the French version uses the phrase "un outil de conception assistee par ordinateur qui permet de generer des 616ments graphiques" or "a computer assisted conceptual tool that allows for the generation of graphical elements." This sentence is longer and more complex in French. The reviewers suggested that the phrase "un outil de dessin par ordinateur" would have been a  83 better translation. In another example, in Energy Needs the English version uses the phrase "recommended daily amount." The French version uses the phrase "aux apports quotidiens en dnergie recommandes" which in French is a longer and more complicated sentence structure. In Cinema Outing the English version uses the phrase "some scenes may be unsuitable for young children" whereas the French version uses the phrase "certaines scenes peuvent heurter la sensibility des plus jeunes" which means "some scenes may hurt young ones' sensibility." This phrase is less clear in French. Reviewers suggested that a better and closer translation would have been "certaines scenes peuvent etre non-recommandes pour les jeunes enfants." Reviewers found differences in vocabulary use including differences in word usage, word difficulty and suggested that some words were more often used in French from France than in French from Canada. For example, in Freezer the English version uses the term "connect... to the power" and the French version uses the term "brancher... sur le secteur." This expression is not commonly used in French. Reviewers indicated that a better translation would have been "brancher... sur la prise de courante." Likewise, in Freezer the English version uses the term "red warning light" and the French version uses the term "voyant rouge" and "voyant lumineux" for the same term. Reviewers suggested that the term should have been "lumiere rouge" in both cases and that using different terms for the same object may be more confusing for French-speaking examinees. In Energy Needs the English version uses the terms "below" and "above" whereas the French version uses the terms "inferieur" & "superieur" or "inferior" and  84 "superior", respectively. These terms are more difficult in French than in English. Dialectal differences were also found when comparing both language versions of the test. For example, in Irrigation the English version the term "crops" is used whereas in the French version the term "parcelles cultivees" is used. This term is used more often in French from France than French from Canada. The Canadian French version should have used the term "champs cultives" or "terrains cultiv6s" instead of "parcelles cultivêes" to translate the word "crops". Reviewers found differences in the use of verb tenses. For example, in Freezer the English version uses future tense in "will hear" whereas the French version uses the present tense in "entendez" or "hear " Likewise, in Freezer, the English version uses future tense in "will light up" and the French version uses present tense in "s'allume" or "lights up." Reviewers thought that keeping the verb in the future tense in both instances to "allez entendre" and "s'allumera", respectively would have made both versions more comparable. These and other differences found between both language versions are outlined in Tables 25 to 27 below. Overall, there were more differences found in booklet set 1 than booklet set 2.  85 Table 25 Summary of Judgmental Review Ratings of the Graphic Representation of Items Booklet Rating*^Item Presentation^Noted Differences Favours Set 2^2^French^In the English version, the location Library^ of the tree diagram is at the edge of System^ the page. In the French version it is located at the centre of the page. The tree diagram should have been located at the same place for both language versions. * A rating of 2 indicated that there were clear differences between the translations that may not necessarily lead to performance differences; a rating of 3 indicated that there were clear differences between the two versions that were expected to lead to performance differences.  86 Table 26 Summary of Review Ratings and Noted Differences in Booklet Set 1 Item^Rating*^Item^Noted Differences & Type of Problems Found  Favours^by Reviewers  Design by^2^English In the French version, the title of the question Numbers^ "Design by Numbers" is written in English and contains no meaning in French. The title should have been translated to "Dessin par numeros" to make it more comparable to the English version. 2^English The English version uses the term "a design tool for generating graphics on computers." The French version uses the term "a computer assisted conceptual tool that allows for the generation of graphical elements." This sentence is longer and more complex in French. Children's^2^English The title of the question "Children's Camp" Camp^ was translated into "Colonie de Vacances." The key word "Children's" was left out. This term should have been translated to "Camp d'ete pour enfants." Freezer^2^English The English version uses the term "connect... Instructions^ to the power." The French version uses the term "brancher... sur le secteur." It should have been "brancher... sur la prise de courante." 2^English The English version uses the term "will hear." The French version uses the term "entendez" or "hear." It should have been "allez entendre." 2^English The English version uses the term "red warning light." The French version uses the terms "voyant rouge" and "voyant lumineux" for the same object. It should have been "Iumiere rouge" in both cases. * A rating of 2 indicated that there were clear differences between the translations that may not necessarily lead to performance differences; a rating of 3 indicated that there were clear differences between the two versions that were expected to lead to performance differences.  87 Table 26 Summary of Review Ratings and Noted Differences in Booklet Set 1 ^ Item^Rating*^Item Noted Differences & Type of Problem Favours Freezer^2^English The English version uses the term "will light Instructions^ up." The French version uses the term "s'allume" or "lights up." It should have been "s'allumera." The English version uses the term "load the freezer with food." The French version uses the term "disposer les aliments dans le congelateur." This expression is not commonly used in French. It should have been "Placer les aliments dans le congelateur." Energy^2^English The English version uses the term "below" & Needs "above." The French version uses the term "inferieur" & "superieur." These terms are more difficult in French. Energy^2^English The English version uses the phrase Needs^ "recommended daily amount." The French Question 2^ version uses the phrase "aux apports quotidiens en energie recommandes." The phrase in French is longer and more complex. It should have been "aux apports en energie recommandes." Cinema^2^English The English version uses the phrase "Parental Outing^ Guidance" and the French version uses Instructions^ "Accord parental souhaitable." These phrases do not have the same meaning. A closer translation would have been "Accompagne par une adulte." * A rating of 2 indicated that there were clear differences between the translations that may not necessarily lead to performance differences; a rating of 3 indicated that there were clear differences between the two versions that were expected to lead to performance differences.  88 Table 26 Summary of Review Ratings and Noted Differences in Booklet Set 1  Item  ^  Cinema Outing  ^ Rating*^Item Noted Differences & Type of Problem Favours 2^English  The English version uses the phrase "some scenes may be unsuitable for young children." The French version uses the phrase "certaines scenes peuvent heurter la sensibilite des plus jeunes." This phrase is more difficult and complex for French examinees. It should have been "certaines scenes peuvent étre nonrecommandes pour les jeunes enfants."  2^English  The English version uses the movie title "King of the Wild." The French version uses the movie title "Le Roi de la savane." This title is more difficult for French examinees to understand. It should have been "Le Roi de la jungle."  Instructions  * A rating of 2 indicated that there were clear differences between the translations that may not necessarily lead to performance differences; a rating of 3 indicated that there were clear differences between the two versions that were expected to lead to performance differences.  89 Table 27 Summary of Review Ratings and Noted Differences in Booklet Set 2 ^ Booklet^Rating*^Item Noted Differences & Type of Problem & Item^Favours Transit^2^French The English version uses the terms "From System^ here"... "To here." The French version uses the Instructions^ terms "Depart"... "Arrivee." These terms are more specific in French. Library^2^English The English version uses the terms "two System^ outcomes" and the French version uses the terms "deux issues" which differ slightly in meaning. A better translation would have been "deux possibilites." Course Design Instructions  2  The English version uses first person singular for "A student" whereas the French version uses "Students" or "Les etudiants." The same person pronoun should have been used in both cases.  Holiday^2^English The English version uses the phrase "can break Question 2^ her journey by camping overnight anywhere between towns" and the French version uses "peut couper ses trajets en campant n'importe ou entre deux villes." This phrase is less clear in French and more difficult to understand. A better translation would have been "peut arrater n'importe oil entre deux villes pour dormir en faisant du camping." Irrigation^2^English The English version uses the term "crops." The Instructions^ French version uses the term "parcelles cultivees." This term is used more often in French from France than French from Canada. In the Canadian test version the term "terrains cultives" should have been used. * A rating of 2 indicated that there were clear differences between the translations that may not necessarily lead to performance differences; a rating of 3 indicated that there were clear differences between the two versions that were expected to lead to performance differences.  90 Correspondence between DIF identification and linguistic reviews In stage 4, a linguistic review of items was conducted by two bilingual reviewers in order to identify potential sources of DIF. Although not all potential sources of DIF were identified, these reviews pointed to some language related potential sources of DIF. In this study, qualitative reviews focused on linguistic differences only; therefore, additional sources were not identified in these reviews. Overall, in booklet set 1 there were six items identified as DIF by either LinnHarnisch IRT or logistic regression methods and in booklet set 2 there were 2 DIF items. Tables 28 and 29 give a summary of these items for booklet sets 1 and 2, respectively, and present information regarding which language group was favoured by these items and whether potential sources of DIF were identified by the linguistic reviews for these items. Statistically, no items with high level DIF (level 3 DIF) were found and linguistic reviews did not find items containing a rating of 3 (or ratings indicating clear differences between the two versions that were expected to lead to performance differences). Therefore, these tables present items with medium level DIF (level 2 DIF), and linguistic review ratings of 2 which indicated that there were clear differences between translations or adaptations between two language versions of the assessment which may not necessarily lead to performance differences between language groups.  91 Table 28 Correspondence of DIF with Linguistic Reviews Ratings in Booklet Set 1 Item #^Language Group Potential Source of DIF Identified Item Favours^by Linguistic Review 1^French  ^  2^French 4^French  ^  ^  7^English 8^English 10^English  ^  ^ ^  No No No  Yes  Yes Yes  Table 29 Correspondence of DIF with Linguistic Reviews Ratings in Booklet Set 2 Item #^Language Group Potential Source of DIF Identified Item Favours^by Linguistic Review 1  ^  7  ^  English^Yes English^Yes  As shown in Tables 28 and 29, linguistic reviews consistently identified possible sources of DIF when items favoured the English group whereas no potential sources of differences were found when items favoured the French speaking group which may point to other potential sources of DIF such as cultural or curricular differences for these items. The match between linguistic reviews  92 and DIF (pro-English) possibly indicates that translation or adaptation differences might have led French-speaking examinees to perform less favourably. Consistent findings of the match between DIF and linguistic differences indicate that conducting linguistic reviews of items containing DIF provides an important source of information regarding potential sources of DIF. Additional studies of curricular and cultural differences among items could have potentially led to identifying sources of DIF favouring the French speaking group. Summary In this study, an analysis of the English and French Canadian versions of the PSM was conducted in order to determine their degree of comparability. In this chapter, results of these analyses were described including the procedures used and stages followed to determine the degree of construct comparability of the PSM. Very small differences were found in means and standard deviations for the two language groups and the two booklet sets. In stage 1, results conducted at the scale level data to determine whether the dimensionality and test structure of the PSM was the same for both language versions showed that there was one main factor representing the data for the two language groups and for the two booklet sets. Some differences in the percentage of variance explained by the dominant factor among booklet sets were found. Primarily, a greater percentage of variance explained was accounted for by the dominant factor in booklet set 2 than booklet set 1. Greater similarities were found for factor loadings and interfactor correlations for booklet set 2 than booklet set 1. In stage 2, results from the examination of internal consistency coefficients indicate that there was moderate  93 to high degree of internal consistency for the two language groups for the two booklet sets. The degree of internal consistency was higher for booklet set 2 than booklet set 1. These findings are not surprising given that there was a greater percentage of variance explained by the dominant factor in booklet set 2 as well as higher factor loadings and inter-factor correlation similarities for booklet set 2. In Stage 3, DIF analyses were conducted in order to identify items that function differentially for the two language versions of the test. Consistent with previous studies, results from DIF detection procedures suggest that there was a large percentage of items containing DIF ranging from 20% (booklet set 2) to 60% (booklet set 1) of items using either the LH IRT based analyses and LR methods of DIF detection. In Stage 4, judgmental reviews were completed to qualitatively analyze the comparability of the two language versions of the PSM. Type of differences found included differences in sentence length, complexity and structure, vocabulary use, difficulty and grammatical differences including verb tense differences. Reviewers suggested that most differences favoured the English-speaking examinee group. Reviewers believed that most differences would not lead to performance differences. The next chapter presents a discussion of these findings and discusses implications of inferences from these findings.  94 CHAPTER V: DISCUSSION Summary In this study, an investigation of the construct comparability of the English and French versions of PISA, 2003 problem-solving measure was conducted. The purpose of this study was to answer the following three research questions: Research Question One: Are the English and French Canadian versions of PISA, 2003 problem-solving measure measuring the same construct? Research Question Two: Are there specific items that function differentially for English and French examinees? Research Question Three: What are the potential sources of differences in constructs being assessed for the two language groups? The degree of comparability between the English and French versions of PISA, 2003 problem-solving measure was determined from evidence gathered by answering these questions. Results from statistical and qualitative analyses indicated that these two versions were psychometrically similar however some differences at the scale and item levels were found. The summary of evidence related to each question is presented below. Research Question One Exploratory factor analysis was used in order to assess whether there were differences in the dimensionality and structure of the French and English Canadian versions of PISA, 2003 problem-solving measure as evidence of differences in constructs measured by the two language versions. Overall, the factor structures for the French and English versions of the measure were found to  95  be "essentially uni-dimensional" (Stout, 1990). One primary factor was found to account for a large portion of the variance and this factor accounted for a similar amount of variance for each test version and for each booklet set. Discrete differences were found in the structural equivalence of the test as there were three factors for the English version of the measure and two factors for the French version of the measure for the two booklet sets. Although there were different numbers of factors for each language version of the test, there were similarities in the factor solutions for several of the test items in each booklet set for the two language versions of the test. A higher degree of factor loadings and inter-factor correlations was found for booklet set 2 than booklet set 1. In the two language versions of the test and in each booklet set there was a large portion of variance that was not explained by the factor solutions indicating that some critical information was lost from the analyses. The two factor solutions appeared to be detecting mathematical reasoning and reading comprehension abilities. Analyses of the test items indicate that items containing longer reading passages and containing fewer visuals and graphs loaded higher on factor 1 such as Freezer, questions 1 & 2 and Cinema Outing questions 1 & 2 in booklet set 1 and Irrigation questions 1 to 3 in booklet set 2. Items that required students to interpret, generate and/or extrapolate information from graphs loaded more heavily on Factor 2 such as Design by Numbers 1 to 3, Children's Camp and Energy Needs 1 and 2 in booklet set 1 and Library System 1 and Transit System in booklet set 2. Latent correlation analyses further support the interpretation of these two factors as problem-solving items were most highly  96 correlated to mathematics (r=.89) and reading (r=.82). The correlation between mathematics and problem-solving is of about the same order of magnitude as correlations among the four mathematics sub-scales (OECD, 2004b, p.55). Research Question Two Differential Item Functioning analyses were conducted to determine whether there were specific items from PISA 2003, problem-solving measure that functioned differentially for the English- and French-speaking examinees. Two DIF detection procedures were used to increase accuracy of analyses (LH) and (LR) as a second DIF verification procedure. Analyses conducted using these two DIF detection methods showed that booklet set 1 had the greatest number of DIF items (six out of ten items, 60%), detected by both methods. Booklet set 2 had two DIF items which were evenly split in the direction of DIF between the two test language versions. Even though neither one of the booklet sets contained items with high level DIF, the fact that up to 60% of items in a given booklet set were identified as containing DIF suggests that pair-wise comparisons should be made at the field-testing stages of test development and before interpreting test results. In addition, caution should be used when interpreting differences between the two language groups as indicating performance differences between these two groups. Examination of the correlations between IRT parameters for both booklet sets indicated that there were some discrepancies for the two language versions. If correlations between discrimination parameters are low, then there is a large difference between how items are ordered according to what they measure in the  97 two language versions. A greater degree of differences were found for booklet set 1 (r=.43) for item discrimination parameters than booklet set 2 (r=.82). Analyses of these results indicate that two items in booklet set 1 (items 1 and 4) were leading to lower correlation coefficients. Examination of these two items indicates that they contained DIF. Item 1 (Design by Numbers Question 1) was identified by reviewers as having a more complex and longer sentence structure in French. Item 4 (Children's Camp) was identified by reviewers as containing a title in French "Colonie de Vacances" that was less precise, was missing the key word "children" and did not give French-speaking examinees as many clues about the item content as the title Children's Camp did in English. These differences were summarized in Table 26. Research Question Three A linguistic comparison of the English and French versions of the test items was conducted to identify the sources of differences in constructs assessed by the two language versions of the test. The results of this linguistic review indicated that there were small differences in the graphic representation of items between the two language versions. The differences were limited to differences in the location of graphics between the two language versions of the test. Some linguistic differences were found when examining items in the PSM. A greater degree of differences was found for booklet set 1 than booklet set 2. Some of the differences found throughout both booklet sets included longer and more complex sentences in French which included more difficult words that were not used as frequently. Given that examinees had a limited time to answer the  98 questions in each booklet set, this might have had an effect on French-speaking students' level of motivation to read through the lengthier questions as well as their overall performance. Several differences were found at the item level including several items containing medium level DIF and items containing linguistic differences, analyses from these methods might not have led to differences in examinee performance. However, the overall cumulative effect of these differences when added together might have led to test bias and might have led to French-speaking examinee groups performing lower than they could have had these differences been addressed at the test design and test development stage. In addition, the overall effect of length differences may not have been captured by conducting either DIF or linguistic comparisons at the item level. This difference might have led to French-speaking examinees performing less favourably at the overall test level than they could have had the tests been translated and adapted more closely. Several modifications were made to the French source version to make it more appropriate to the Canadian context. Words or terms such as names of currency, food items and places were changed to reflect Canadian conventions. For example, in Cinema Outing references to currency were changed from "euros" to "dollars" and phone numbers were changed from "08 00 42 30 00" to "(555) 7436666." In Energy Needs food items were changed from "crumble" to "tarte" or "tarte" to "gdteau" as well as specific expressions such as "diner" to "souper" to reflect the local terms and expressions used in Canada. In Library System, names of places such as school names were modified from "Lycae Montaigne" to "I'  99 Ecole secondaire Montaigne" and "Lycee Coulanges" to "1' Ecole secondaire Coulanges." These modifications made the test more appropriate for Frenchspeaking students from Canada. In addition to making modifications to better suit the national Francophone Canadian culture and language, PISA used a large number of diagrams and pictures that were meant to facilitate students' ability to understand the questions. Several modifications were also made to include names that were more common in the language of the test such as the names "Jane," "Fred," and "Stanley" in English which were translated to French as "Janine, "Francois" and "Simon," respectively. The items and questions from the PSM were also designed using fictitious names of places, currency, cities and countries that were designed to be neutral and not favour any particular language or culture. For example, in booklet set 1, in Children 's Camp the name "Zedish" community is used, in Transit System the name "Zedland" is used as the name of a city and "zed" is used for the name of the currency used in "Zedland". In Holiday, a map of roads between towns uses made-up city names such as "Kado," "Lapat," and "Megal." These modifications served to make the test versions more comparable and easier to understand for both language groups. At the field-testing stage, extensive studies were conducted where all participating countries were compared to each other to find out if there were any items containing DIF, items containing DIF were removed at this stage. Think-aloud procedures were conducted as well by panels of experts and students representative of different participating countries in order to verify that items would be interpreted as they were intended  100 to be by test developers. Due to PISA's rigor used to translate the PSM, studies conducted at the field-testing stage and the large degrees of pictures and diagrams used in the PSM, it is not surprising that there were neither items found in this study to contain high level DIF nor where there any items found by reviewers to lead to performance differences between examinee groups. Degree of Comparability The purpose of this study was to examine the degree of comparability of the French and English Canadian test administrations of PISA, 2003 problemsolving measure. These analyses were based on Messick's (1989a; 1989b; 1995) unitary notion of validity regarding the type of evidence that needs to be presented to support the adequacy and appropriateness of inferences made using test scores. Table 30 presents a summary of evidence gathered to investigate the comparability of both language versions of the Canadian PSM administration. Sources of evidence gathered by conducting statistical and qualitative analyses including analyses at the structural data level, internal consistency estimates, DIF and item parameter correlations and judgmental reviews at the item level suggest that comparability of the PSM was moderate for booklet set 1 and high for booklet set 2, degree of comparability differences between both booklet sets was found to be more prominent at the item level analyses.  101  Table 30 Summary of Completed Analyses by Booklet Set IRT Based Analyses Booklet^EFA^Internal^DIF^Correlation Judgmental Consistency^of Item^Reviews  Parameters  Set 1^*^*^*^*^* Set 2  At the structural equivalence level, booklet sets 1 and 2 were found to have high degrees of comparability suggesting that constructs measured for both language versions of the test were similar. These results were not surprising as the test was designed to measure problem-solving abilities and that each item contributes to the assessment of that ability. However, structural equivalence is necessary yet not sufficient to demonstrate the degree of comparability of the two language versions (McCreith, 2004). Internal consistency estimates show that there is a high degree of comparability between the two language versions. However there was a higher degree of reliability found for booklet set 2 (.96) than booklet set 1 (.73). In addition, lower discrimination parameters were found for booklet set 1 as compared to booklet set 2. Linguistic reviews indicated that there were no items that would be expected to lead to performance differences between the two language versions of the test. A greater number of differences were found for booklet set 1 than booklet set 2 (consistent with DIF findings). Examination of the above findings indicate moderate to high degree of comparability for the PSM between the two language versions of the test.  102 Implications The results of this study have several implications related to decisions made based upon and the methods used to establish the degree of comparability of the French and English Canadian versions of the PSM. Three implications will be discussed below. The first implication is related to the validity of using scores from the PSM to compare the two language groups, the second one is related to how these measures should be compared and the third one is related to the type of analyses that should be conducted at the test design stage and prior to test interpretation. These implications are described below. The following question is central to making inferences based on score interpretations: Can conclusions be drawn with the same amount of confidence from the two groups being compared? Validity of score interpretation depends on the test's ability to capture skills identified in the test framework (measurement of the problem-solving construct). Item content and format differences such as those created by translation differences might make this comparison difficult. Statistical and qualitative comparative analyses conducted in this study indicate that there was moderate to high level of comparability between the two language versions of the test indicating that comparisons made using the two booklet sets may be done; however, caution needs to be used when making these comparisons. Caution may include taking into account that when comparisons of overall group performance are made between two language groups these differences may be analyzed as being due in part to differences in performance between the two language groups and partly due to construct incomparability.  103 The second implication is related to whether results from booklet sets 1 and 2 should be treated as equivalent or whether results from the two booklet sets should be conducted and interpreted separately. A larger percentage of medium level DIF was found for booklet set 1 (60% of items contained DIF) than booklet set 2 (20% of items contained DIF). Due to differences in booklet sets 1 and 2 in levels of DIF, internal consistency estimates and somewhat different factor structures, these analyses and interpretation of results should be conducted separately in order to obtain more meaningful results. Otherwise, the use of nonequivalent booklet sets may lead to interpretation errors and faulty conclusions. The third implication is related to the type of analyses that should be conducted at the test design stage. Think-aloud and DIF analyses procedures were conducted at the test design stage by PISA (OECD, 2005). These procedures were conducted across all participating countries and were designed to identify any items that would function differentially for the different countries or languages of administration. Pair-wise comparisons across participating countries or sub-groups within countries were not conducted. Hence, although no items containing high level DIF were identified in this study, a large percentage of items containing medium level DIF were identified, particularly in booklet set 1. Due to these findings, conducting pair-wise comparisons at the test design stage would be beneficial in order to identify items containing DIF at the country or group level before test administration. Conducting pair-wise comparisons with all 41 countries involved in PISA would not be possible however pair-wise comparisons  104 should be made in bilingual or multi-lingual countries where groups are likely to be compared to each other, as in the case of Canada. These analyses should be conducted particularly with a measure such as the PSM that contains between 9 to 10 questions in each booklet set. In order to link the measure across booklet sets, items that have similar measurement accuracy and are DIF free should be used. As the PSM contains very few items, removing items containing DIF after test administration would lead to shortening the measure even further, which may lead to construct under-representation or reduced content coverage and thus reduced reliabilities and unstable construct measurement. As pair-wise comparisons were not done at the test-development stage, it is important to conduct DIF and linguistic comparisons prior to test interpretation. Doing so would provide information regarding number of items containing DIF and degree of caution needed when interpreting results from measures that contain larger percentages of items containing DIF. Implications for Practice The PSM was developed to collect information regarding students' problem-solving abilities across 41 countries participating in PISA, 2003. In this particular study, the PSM was analyzed in order to examine whether inferences from scores based on English- and French- Canadian examinee groups would be comparable. If the comparability of scores of the English and French Canadian test administrations is not demonstrated then the intention to compare language groups may not be done in an accurate manner and meaningful decisions and inferences may not be drawn from the data. For example, decisions concerning  105 educational policies, comparisons of student achievement across language groups and program development and evaluation which are contingent upon pattern of results from the PSM may be applied inappropriately. Significant efforts and human and financial resources are spent creating these measures it should therefore make sense that decisions based upon these results be based upon high quality data. The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) indicate that it is the test developers' responsibility to provide "empirical and logical evidence for score reliability and the validity of translated test's score inferences for the intended uses" (p.99) and that evidence particularly relevant to language groups may be provided so that "the different language versions measure equivalent or similar constructs, and that score reliability and validity of inferences from scores from the two versions are comparable" (p.99). The results from this study indicate that inferences based upon scores from booklet sets 1 and 2 have small to medium level differences. Practitioners may treat, interpret, and compare scores from these tests with caution. Due to differences in comparability between booklet sets, each booklet set should be treated separately and analyses of results should not treat both booklet sets as comparable. Limitations There were two main limitations related to this study, the first one is related to not having enough information regarding the sample, particularly whether the English- and French-speaking examinee groups came from linguistic  106 minority or majority settings. Having information regarding whether students came from minority or majority settings would have allowed for more meaningful analyses to be conducted so that students from minority Anglophone settings could have been compared to minority Francophone settings in the same way that students from majority Francophone settings could have been compared to students from majority Anglophone settings. In this study, conducting comparisons among students from similar settings may have produced more meaningful results. The second limitation is due to focusing only on linguistic differences as potential sources of DIF. Although linguistic comparisons of items provided some insights regarding potential sources of DIF, not all potential sources of DIF were explored. Other potential sources of DIF include cultural and curricular differences. These sources of DIF could have been explored by conducting thinkaloud procedures between English- and French-speaking Canadian examinees. Think-aloud procedures might have provided further information regarding whether the two language groups understood the items as they were intended to by test developers and might have helped uncover additional sources of DIF. Contributions of Findings to Literature This research study focused on the comparability of scores between English and French Canadian versions of cognitive ability tests. The study was based upon guidelines that have been written in order to highlight the importance of examining the comparability of translated tests (AERA, APA, & NCME, 1999) and Guidelines for Adapting Educational and Psychological Tests (van de Vijver  107 & Hambleton, 1996). The study contributes to the literature on construct comparability and validity of translated tests in three ways (1) it contributed towards the paucity of data on construct comparability using cognitive ability tests (2) it contributed towards putting into practice the guidelines mentioned above for examining and investigating construct comparability issues using translated tests and (3) it provided empirical evidence regarding the degree of construct comparability in different language versions of PISA, 2003 PSM. It is important that test developers and test users establish that the constructs measured by instruments used to compare sub-groups is equivalent across languages. Degree of comparability should be examined empirically by conducting statistical and qualitative studies. This study contributes to research on construct comparability issues in large-scale assessments using translated tests of cognitive abilities with implications at the national and international level. Future Directions There are a number of different areas of research related to construct comparability research and the measures used in this study that deserve further study. First, although this study focuses on construct comparability of PISA, 2003 PSM, other test domains (such as mathematics, science and reading) and other large-scale assessments (such as TIMSS and PIRLS) may be investigated. Degree of comparability of other large-scale assessments should be investigated in order to ensure that meaningful conclusions are drawn from the use of these tests. Construct comparability and meaningfulness of scores should be based upon studies of construct, content, and cognitive validity and should be conducted  108 according to Messick's unitary notion of validity. These studies should encompass all sources of validity evidence and should include factors beyond test items such as culture, language, and the assessment context (Messick, 1995). These studies may be conducted by either (a) examining existing data or (b) collecting new data. Examination of existing data particularly relevant to examining English- and French-speaking Canadian students' performance would require more specific information related to these two examinee groups. For example, if further information was available regarding whether students sampled came from minority or majority English- or French-speaking groups the following questions may be asked (1) how do performance between minority or majority Francophone and minority or majority Anglophone Canadian students compare? (2) Are there specific cognitive domains in which minority or majority Francophone students perform better or worse than minority or majority Anglophone Canadian students and vice versa? (3) Which areas are these? (4) Are there specific programs that may be used to increase either group's performance in these particular domains? If specific programs have been implemented perhaps based upon previous findings from other large-scale assessments then how useful have results been in informing policy development of new programs based on test results? Answers to these questions seek to improve methodology currently used when comparing standings across countries or regions using large-scale assessments and lead to making more accurate, relevant and meaningful inferences of individual and sub-groups for whom these assessments are geared. This approach may have some limitations, however. For example, by further  109 specifying sub-groups, sample sizes may decrease, which may jeopardize parameter estimations, and decrease power. If these studies were conducted by collecting new data then what could have been done differently? First, although 19 items were developed to measure the problem-solving construct these items were divided into two booklet sets with 9 to 10 items per booklet set, there were no common items between booklet sets to link the two measures. Therefore, if I were to collect new data in another assessment cycle, I would construct a measure that includes more items, more questions per item and some common items that would allow for linking among the different booklet sets. When making the measure longer, I would be careful to not add too many new items so as to not add significantly more reading content to the measure further confounding the measurement of problem-solving abilities. At the test development stage, I would conduct pair-wise comparisons between countries or groups within countries at the test design stage including DIF analyses and think-aloud procedures (particularly with bilingual or multilingual countries such as Canada, Switzerland and Belgium which would be likely to compare sub-groups to each other). Pair-wise DIF analyses procedures would allow for the removal of items containing DIF at the test development stage. Think-aloud procedures would allow for analyses of how examinees from different language groups are interpreting the test items and whether they are being interpreted as intended to by test developers. These analyses would allow for a greater number of DIF free items to be used when conducting linking  110  between booklet sets and would increase comparability of scores between different language groups. Important decisions are based upon large-scale assessment data concerning educational policies, comparisons of student achievement across regions or countries and program development and evaluation. Particularly, reports from the PSM provide indicators related to how well students are prepared for productive participation in today's world. Results of these analyses suggest that 22 out of 40 countries are prepared for productive participation in the world. These results indicate that students' who perform in the upper levels of the problem-solving scale (55% of students) are able to compete in a rapidly changing world and have increased opportunities for employment (OECD, 2004b), These results are important as they provide countries with important information regarding country rankings and information about where their population is placed as they are leaving compulsory education and are getting ready to enter the employment world. Due to the present movement towards economic globalization, it is important that international large-scale assessments such as the PSM provide scores and results that are meaningful. Meaningful and valid data may allow low performing countries to direct appropriate resource allocation and policy development to raise lower standards. Thereby allowing them to better prepare to meet the challenges of the future and compete in a global economy. For countries like Canada who are performing in the top one third valid and ,  reliable data may provide information related to what they are doing well and  111 provide information to monitor their progress and set up policies and programs that would allow them to maintain and continue raising their standards. Due to important decisions based upon the use of large-scale assessments, and significant capital and human resources invested into developing, interpreting and using these measures; it is important that these tests meet the highest psychometric and ethical standards. This study contributed towards addressing and examining these requirements. More steps and efforts need to be invested in this area of research in the future.  112 REFERENCES Allalouf, A. (2003). Revising translated differential item functioning items as a tool for improving cross-lingual assessment. Applied Measurement in Education, 16, 55-73. Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, 36, 185-198. American Educational Research Association, American Psychological Association, National Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing (1999). Standards for educational and psychological testing. Washington DC: American Educational Research Association. American Psychological Association. (2002). Ethical Principles of Psychologists and Code of Conduct. American Psychologist, 57, 1060-1073. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51. Burket, G. R. (1991). PARDUX. [Computer software]. Monterey, CA: CTBIMcGraw-Hill. Bussiere, P., Cartwright, F., Knighton, T., & Rogers, T. (2004). PISA 2003 — The 2003 Canadian Report Measuring Up: Canadian Results of the OECD PISA Study. The performance of Canada's Youth in Mathematics, Reading, Science and Problem Solving 2003 First Findings for Canadians  113 Aged 15. Retrieved June 1 st , from http://www.cmec.calpisa/2003/Pisa2003.en.pdf Clauser, B. E. & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. National Council of Measurement in Education — Instructional Topics in Educational Measurement, 31-43. Council of Ministers of Education, Canada (2000). Report on Science Assessment, School Achievement Indicators Program, 1999. Toronto, ON: Council of Ministers of Education, Canada. Eamon, M. K. (2002). Effects of poverty on mathematics and reading achievement of young adolescents. Journal of Early Adolescence, 22, 4974. Ercikan, K. (1998). Translation effects in international assessment. International Journal of Educational Research, 29, 543-553.  Ercikan, K. (April, 1999). Translation DIF on TIMSS. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada. Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage assessements. International Journal of Testing, 2, 199-215. Ercikan, K. (2003). Are the English and French versions of the Third International Mathematics and Science Study administered in Canada comparable? Effects of adaptations. International Journal of Educational Policy, Research and Practice, 4, 55-76.  114 Ercikan, K. (April, 2007). Score scale comparability in International assessments. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, USA. Ercikan, K., Gierl, M. J., McCreith, T., Puhan, G., & Koh, K. (2004). Comparability of bilingual versions of assessments: Sources of incomparability of English and French versions of Canada's national achievement tests. Applied Measurement in Education, 17, 301-321. Ercikan, K., & Koh, K. (2005). Construct comparability of the English and French versions of TIMSS. International Journal of Testing, 5, 23-35. Ercikan, K., & McCreith, T. (2002). Effects of adaptations on comparability of test items. In D. Robitaille & A. Beaton (Eds), Secondary Analysis of TIMSS Results (pp. 391-405). Dordrecht The Netherlands: Kluwer  Academic Publisher. Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G. R., Weber, M. M, & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed-response item types. Journal of Educational Measurement, 35, 137-154.  Fitzpatrick, A. R., Link, V. B., Yen, W. M., Burket, G. R., Ito, K., & Sykes, R. C. (1996). Scaling performance assessments: A comparison of one-parameter and two-parameter partial credit models. Journal of Educational Measurement, 33, 291-314.  115 Ford, D. Y. (1998). The underrepresentation of minority students in gifted education: Problems and promises in recruitment and retention. Journal of Special Education, 32, 4-14. French, A. W., & Miller, T. R. (1996). Logistic regression and its use in detecting differential item functioning in polytomous items. Journal of Educational Measurement, 33, 315-332. Geisinger, K. F. (1994). Cross-cultural normative assessment: Translation and adaptation issues influencing the normative interpretation of assessment instruments. Psychological Assessment, 6, 304-312. Gierl, M. J., & Khaliq, S. N. (2001). Identifying sources of differential item and bundle functioning on translated achievement tests: a confirmatory analysis. Journal of Educational Measurement, 38, 164-187. Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998). Multivariate Data Analyses (5 th ed.). Upper Saddle River, NJ: Prentice-Hall, Inc. Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (2005). Adapting Educational and Psychological Tests for Cross-cultural Assessment. Lawrence Erlbaum Associates, Mahwah, New Jersey. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, DA: Sage. Hambleton, R. K., & Patsula, L. (1998). Adapting tests for use in multiple languages and cultures. Social Indicators Research, 45, 153-171. Helms, J. E. (1997). The triple quandary of race, culture, and social class in standardized cognitive ability testing. In D. P. Flanagan, J. L. Genshaft, &  116 P. L. Harrison (Eds.), Contemporary intellectual assessment (pp. 517-532). NY: The Guilford Press. Kennepohl, S., Shore, D., Nabors, N., & Hanks, R. (2004). African American acculturation and neuropsychological test performance following traumatic brain injury. Journal of the International Neuropsychological Society, 10, 566-577.  Kim, S. H., Cohen, A. S., Alagoz, C., Kim., S. (2007). DIF detection and effect size measures for polytomously scored items. Journal of Educational Measurement, 44, 93-116.  Laing, S. P., & Kamhi, A. (2003). Alternative assessment of language and literacy in culturally and linguistically diverse populations. Language, Speech, and Hearing Services in Schools, 34, 44-55.  Linn, R. L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6, 83-102.  Linn, R. L., & Harnisch, D. L. (1981). Interactions between item content and group membership on achievement test items. Journal of Educational Measurement, 18, 109-118.  Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. McCreith, T. M. (2004). A construct comparability analysis of cognitive ability tests in different languages (Doctoral dissertation, University of British Columbia, 2004). ProQuest Dissertations and Theses, AAT, NQ 93153  117 Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performance as scientific inquiry into score meaning. American Psychologist, 50, 741-749. Miller, T. A., & Spray, J. A. (1993). Logistic discriminant function analysis for DIF identification of polytomously scored items. Journal of Educational Measurement, 30, 107-122. Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIE Applied Psychological Measurement, 20, 257-274. Organisation de Cooperation et de developpement economiques. (2004a). Resoudre des problemes, un atout pour reussir — premieres evaluations des competences transdisciplinaires issues de PISA 2003. Retrieved on December 1 st, 2006 from: http://www.pisa.oecd.org/dataoecd/48/49/34474406.pdf Organization for Economic Co-Operation and Development. (2004b). Problem solving for tomorrow's world — first measures of cross-curricular competencies from PISA 2003. Retrieved on December 1 st, 2006 from: http://www.pisa.oecd.org/dataoecd/25/12/34009000.pdf Organization for Economic Co-Operation and Development. (2005). PISA, 2003 Technical Manual, Retrieved on July 5 th , 2007 from: http://www.pisa.oecd.org/dataoecd/49/60/35188570.pdf. Rhodes, R. L., Ochoa, S. H., & Ortiz, S. 0. (2005). Assessing culturally and linguistically diverse students: A practical guide. NY: The Guilford Press.  118 Salvia, J., & Ysseldyke, J. (2001). Assessment in Special and Inclusive Education. Boston: Houghton Mifflin. Sireci, S. G. (1997). Problems and issues in linking assessment across languages. Educational Measurement: Issues and Practice, 16, 2-19. Sireci, S. G., & Berberoglu, G. (2000). Using bilingual respondents to evaluate translated-adapted items. Applied Measurement in Education, 35, 229259. Suzuki, L. A., & Aronson, J. (2005). The cultural malleability of intelligence and its impact on the racial/ethnic hierarchy. Psychology, Public Policy, and Law, 11, 320-327. Suzuki, L. A., & Valencia, R. R. (1997). Race-ethnicity and measured intelligence — Educational implications. American Psychologist, 52, 1103 — 1114. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal ofEducational Measurement, 27, 361-370. van de Vijver, F., & Hambleton, R. K. (1996). Translating tests: Some practical guidelines. European Psychologist, 1, 89-99. Waller, N. G. (1998). EZDIF: Detection of Uniform and Nonuniform Differential Item Functioning with Mantel-Haenszel and Logistic Regression Procedures. Applied Psychological Measurement, 22(2), 391. Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multi-group confirmatory factor  119 analysis: A demonstration with TIMSS data [electronic version]. Practical Assessment, Research and Evaluation, 12, 1- 26, Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125-145. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-214.  120 APPENDICES Appendix A Codes for the Sources of Translation Differences The following list summarizes the kind of differences that have been found in the translation and adaptation of tests into different languages (List adapted from McCreith, 2004). 1. Cultural relevance or cultural differences found in relevance to item content. Example: A reading comprehension passage contains content that is more relevant to one of the groups. 2. Changes in content producing a change in the meaning of the item (Hambleton, 1994 as cited in van de Vijver & Hambleton, 1996). Example: A word that has a single meaning was translated into a word containing more than one meaning. Example: In a Swedish-English comparison of educational achievement contained the following item: Where is a bird with webbed feet most likely to live? a. in the mountains b. in the woods c. in the sea d. in the desert. The Swedish translated the words "webbed feet" into "swimming feet" thereby providing cues about the correct answer.  121 3. Changes in format including differences in punctuation, capitalization, item structure, typeface, and other formatting usages that may affect performance of one of the examinee groups. Example: A word presented only in the stem of the English form was presented in all four options of the French form. Example: A letter was located above the X-axis in a graph for all distractors in the French form whereas in the English form two of the distractors had the letter below the X-axis. The variation in the position of the letter might have led the English-speaking examinees to believe that it was relevant. 4. Ommissions or additions including words, phrases, or expressions may affect the meaning of an item and may affect the performance of one group of examinees. Example: The English form of an item contained the expression "this number written in standard form" while the French form had the phrase "ce nombre est" ("this number is"); in this item the idea of "standard form" was excluded from the French translation. 5. Differences in verb tense. Example: The word "reads" (present tense) in the English form was translated into "a lu" (past tense) in the French form. 6. Differences in word difficulty, frequency or commonness of vocabulary.  122 Example: In the English form the word "burns" was translated into the word "combustion" in French. The word "combustion" is more difficult and used less frequently than the word "burns." 7. Exclusion or inappropriate translation of key words. Key words provide additional clues to guide examinees' thinking processes, exclusion or inappropriate translation may lead to differences in performance among different language examinee groups. Example: The stem in the English form stated "whenever scientists carefully measure any quantity many times, they expect that..." The answer is "most of measurements will be close but not exactly the same." In the French form, the stem was translated as "when scientists measure the same quantity many times, they expect that..." The word "same" in the French form could have led the examinees to think that this quantity was known and the answer should be that the scientists should get the same amount every time. 8. Differences in additional information given to guide examinees' thinking process. Example: The English item asks "At what point will the reflection of the candle appear to be?" In the French form, the item asks "En quell point l'image de la bougie apparaitra-t-elle?" ("At which point will the image of the candle seem to appear to be?") The French version provides more information by telling the examinees that the reflection in the mirror may  123 seem different than the actual object. This additional information could have made the item easier for French-speaking examinee groups. 9. Differences in length or sentence complexity making the item more or less difficult for one of the examinee groups. 10. Differences in words, expressions, or sentence structure inherent in one language or culture that is likely to affect the performance of one of the examinee groups.  Example: An English item used a 12-hour clock using AM and PM while the French translation used a 24-hour clock. The 12- vs. 24-hour clock represents an English-French cultural difference.  rct  0  cc)  0  cD  0 z  (.0  CD  0 -t  cD  O.  0  co  O  CA D  CD  CD  1—• •  0 CD  0  CD  5.-  CD t-4-) 0  ° 0  0". CD  CD (1)  CD -. cn_  CD  d...  CD  0  0 a  )  Cl)  FT  CD  CD  O  r-t CD  "c3 cn  -  Q.  CD 0 CD  OCD  CD CD  X  co 5.  Er O 0  CD  CD  a a- CD cc mm.  cf, fa.  CD  CD C"  • trl (IQ • ix? • U —. (IQ  rai.  g CD 0  CD  a co  ° C) CD (4 CDCr  CD C  cr  vi  o  CD  `"•<  CD  .1  CD CD CD .1 C .-t- ("5  0  CD CD  CD  co 5  0  r4; rj)  ta. CD  (!)-  -  §. CD•• CD  cr  •  CD  O -  CD 51)^0 ,-11  CD  CD  0  CA  14  co •  no a"  •  • ••  traCD  0 (ig CD 0-CO C D  ° 0 11)  O g  CD = NZ, CD cr,  (4 CD  •  cr cp  CD  co  CDs: 13)^•  0  ▪  CD  tat)  CD CD  C O  CD  (,)  CD  CD  s  p fa:  12.1-  ,  .-• Pi  )-} c w cr z o o' Cl) O ,E,  -% 0 0-  0  (I) .  < CD  0'  CD  CD 0  ;1  0  CD cil ° .., ■••• • '10 "-•  ti  4.  CD  7' .. 0 0 .,:$ ,..t 0  to  z  1--.3 tol  0  5  5. o  al  B O g  cm  5co ,...1., 0 — 0  a. t:  0  CD '0  gt  73. 1,..) 0 " 44 •.' *0 .-CD  CD  i  DD 0 CD .-t  SW  6'  2  t7,1  5. CfQ 0  3  tn.. 0"  tiQ  trl 0  Code(s)  Favours (E / F )  Confidence 2 Rating  Rating l  Item  co  CD  125  126 Appendix C Item P-Values for Booklet Sets 1 & 2 Item p-values Booklet Set 1  Booklet Set 2  Item  English  French  English  French  1  .63  .56  .83  .90  2  .62  .54  .25  .24  3  .57  .56  .48  .46  4  .49  .47  .21  .22  5  .55  .54  .57  .53  6  .48  .46  .51  .54  7  .85  .88  .64  .58  8  .37  .47  .69  .68  9  .64  .65  .79  .77  10  .82  .81  127 Appendix D Test Items in English Source Version for Booklet Set I Test items included in Appendix C (in English and French) were obtained from OECD Problem-Solving for Tomorrow's World — First Measures of CrossCurricular Competencies (OECD, 2004). DESIGN BY NUMBERS©  Design be Number . i a de s ig n oil:,^gvnerating graphics on computer ,  Pictures can he generated by giving a set of commands to the prozrani. Study carefully the following example commands and pictures before answet-ing the questions.  Paper 0  P.Ter /90 Pars Lice 20 20 50 20 irur 84 2.6 C0 t‘i.0 ini• '..11 7 SO 20 20  Line 20 0 W  I ,^by  rf It k i S W4Siik  'VAT. K1 h ^  Mucha Lth;)rator,,,^ight j 4',1r1  f  'c.invuotic,t) Croup at th  9V Niassachir ...-tts Institute ,  i d,Avrihtaiktl from kip. / dbn Firedia^etiu  t_ti-^fu tt ,h^  nix pr,fgrjrn  128 DESIGN. BY NUMBER$ - Question 1 _^„ •-, ,-, •^•,,^: ^  4.41",741).^t7,17; ,17 ,0-/n  DESIGN BY NUMBERS - Question 2 ot  -E= .  a)  20 00 e.:  10^1117w 80 ..2.0  1T r,tl^L ,  Pen It.X..)^Line 20  ao  DESIGN BY NUMBERS O - Question 3 •  , •  Paper I) Pon RN).  70  *-4  Lim 20 A 40A  •  [-t  JO rY-  A  •^•  129 CHILDREN'S CAMP  The 7.4Alish Communitv Service is organising a he-dav Children's Camp. 46 children (26 Girls and 20 boys) have signed up tur the camp, and 8 3 d u it.,, (4 men and 4 women) have volunteered to attend and organise the camp. 2 Dormitorta :s  I Adults^  ,  Mrs Madison  Name  Nip; Carroll  Red  72  Ms Grace  Blue  S  Ms Kelly  Green  8  Mr \t, vele,  Purple  S  Neil  Orange  8  Mr Williams  Vellcm  6  Mr Peters  White  6  Mr  Number oi beds  -  Dormitory rules: 1. Boys and girls must sleep in separate dormitories. 2. At least one adult mug sleep in each dormitory, 3. The adult) in a dormitory must be of the same gender as the children.  CHILDREN'S CAMP - Question I Dormitory Anocation i  r 't  Name Red RI tie  Green Purple C trarwe  Yellow White  =  lbte fo^the 46 'A rev I41 Yd^aft-, to ,flottnctotte5, ke,--erinq f7C ,  Number of boys  ,  -  Number of girls  Na  III  e  S^Azi^it(s)  130 FREEZER  Jane bought a new (kin- ^freezer. The manual gave the folloNving instructions: • Connect the appli^to the power and switch the appliance on. • You will hear the motor running now • A red warning light (LED) on the display will light up. • Turn the temperature control to the desired position. Position 2 is normal, feat  ; 4 3  Temperature ---11q2 --2''(: - 32°C  • The reel s arnirr2 1 Olt will stay on until the freezer temperature is low enough,This will take I - 3 hours, depending on the temperature you :et. ,  • Load the freezer with food after tour hours.  e I 1 N Ved t 1 1  ^  instruetions, but she set the temperature control to  position 4. After f -air hours, she loaded the freezer with food. After eight hours, the red warning light was still on, although the motor was running and it felt cold in the freezer.  131  FREEZER - Question I r"  O^tt ^oes-oro.e.r.:i•:k;  1 10 '  e 4  C r r; • " 1. :  •^,^. •^•^-^. . .  T.  %t  6  ,  VZI.3^I^(Pe.5  wrote^=^0fr ta'r F  Fr i  ,  f^  %Varning W41  •^  or .€1.7e, sfy:  ConItt ignoring the Nvarning hone canced a delay in the> warning light going ou t'  li  Warning 2  Yes f No  IVarning *i  Ycs^, No  \Vat lung 4  Yes  /^N,,  NVarnine,  Yes  /  V arnanz 6 ,  No  132  FREEZER — QUESTION 2 0. h:•1^Tsklt th ?_ ,  At Lion a 'Id Oh..:k.ri.ation ,  De c the 0#1 Serl'atiOnt ,  NVAriting  ,  Lt  Yes No  'she put the control t o position I and the red tight went off.  Yes .1 No  -  t hat  the  light nas working properk.'  put the ((Ann,' tpoNtion and Ow red light vo.nt  She put the oaintrol topositi .i.nd the Iva light 't.,n eti on  I  / NO  133 ENERGY NEEDS  Daily energy needs recommended for adults Women  Men Age (years)  lirmi 18 to 2'  F r o m ilt to S9  611 and  ab rwe ,  Acti% it t levet  Ent-Tv needed^  Liclilt  10660  Mi:dcrat,  I  P  Inergv needed fkj)  8430 8780 9810  .  14-420  S:;70 No40  light Modcratklicavy  10-1-co 1)1)0 14210  9790  Light Modorite  8780 10240 11910  700 7940 8780  tli -ztvF ,  Activity level according to occupation indoor •olcs perskm ( )1fict: \corker -  Housewak  ,  Coutst ruct Ion ‘,, ) ii-k c r  'Fc.h. her  -  Outdoor calesper ,on Nur:,,4: ,  Lktiurer Sportsp:rson  134  ENERGY NEEDS — Questton  •^ 1'011-Ye^P1:1  •^  ;:11.11,^rer at^  -e  s f. 5trmite. of ??ni, for :ervtreq ckp  JAM  MENU Soups:  Ntliti COV1  rge:  ripitiatf., Soup  33;  CTV3.111 of MUSfu(yoin'..!).:-.01p  ;f4;  Mc xican Chickcu  90  CaxThb  ^Gui  1)serts:  rt.-:. ?  79;  r ChiA.-km  Pork and Svc KAab s  920  Potato salad  7i0  Spinach, Aprici_it and Ilazeln.ut Salad  -k  Cous.ats SaLd  40  Apple and Raspberry Crumble  I OM  Gitwtr Che-cakf-  100;  ,  Salads:  -,  ,  Carrot Cake NI ilk stnake-s.:  Clocolatt-  1;90  Vanilla  1470  -  115-0 11.1..s 5Redli "4( oti .;'•  Tunato Soup  t:ariblart Ginger Chien  L  Carrot Coi:  ENERGY NEEDS — Question 2 ,^t-ecor,i^,vh,l-E^eits:^Re-Gye 72O =^riot 'A:11 .1^eTierciy^ -  4,;.e.t to^below  recommended daily amount500 f^yvti. c f he ^ , -  ShON  L:.4  or above rer  ^  135 CINEMA OUTING  This problem is about finding a suitable time and date to go to the cinema, Isaac, ..), I 3-y0 ar-old , wants to oranise.a :. me ITIA ruing ' X1111 IWO Of hIS friends, who are of the same age, during the one-week school vacation, The vacation begins on SawnAar, 24 March and end; on Sunday, I' April. ,  Isaac asks his friends for suitable dates and times for the outing. The following information is what he received. Fred:^1 have to stay home on Monday and Wednesdav afternoons for music practice between 2: 30 and 3: 30,"  Stanley: 'I have to visit TilV grandmother on Sundav s ^so it can't bc Sundays I have seen Pok, anun and don't want to see it again." :  - .  -  Isaac's parents insist that he only goes to movies suitable for his at and does not walk home. They will fetch the boys home at any time up to 10 p. in, Isaac checks the movie times for the vacation week: This is the information that he finds,  Advance Booking Number; 01924 423000 ^phone number: 01924 4X1071 TIVOLI^-:, EVA^241 Bargain Day lik.•sdays: All films S3 filin5 showing from In 23'4 March for two weeks: Pokamin  Children in the Net  10'; mins^Parental GuRbrice. General 113 mins^Suitable only fOr pmsoas viewing, but some scenes of 1'_ years awl over 147:00 (Mon-Fri only) 13:40 (DailYl^may be uncuitable for young 16:3; (Daily)^children 2 1:35 (Sail Sun onlyi Monsters from the Deep  Enigma  164 IllifIS Suitable only tor ptTm PrIs ,„, ^19:3 3 (Fri "Sat only')^b^,,,,^- '^ae 10 vearS duo (0, er -  -  Carnivore 148 //1111S^Suitable only for persons 18^30 (Daily)^of 18 vedrS 311d _:** :e I ,  -  1+4 mins 1 3^ 1 NFri on-^only) 18 :00 rSallStin on16 .^-  Suitable only for persons of 12 years and over  King (Jr the Wild 117 Illi115^Suitable for per^ ris of 14-: 3i Von-}-n only^all ages 18 7 30 iS,3t/Sart onb-)  136 CINEMA OUTING - Question I T--^  t^  _•  •;.  from briEr  Should the three boys, consider -notching the movii-:"  Altreie Chihli en in the N..--t  Yt'S  Alorisnrs from tht Dt.s..1)  Yes . No  Carnivore  -1•;,--.4 / No  Pokannn  Yt-s / No  Enigma  Ye, i N o  Ki ng of di( \\I  Yes I No  '  No  CINEMA OUTING - Question 2 t h.:^•^•^ `1 -' - ••••••^•-•^  ^  .•••••.^• . ^tP  .^  .^.  137 Test Items in English Source Version for Booklet Set 2 Library System  The John Hobson High School library has a simple system for lending books: for staff members the loan period is 2R days and for students the loan period is 7 days. 1 he following is a decision tree diagram showing this simple system:  </is the Itot rov.-er^Yes a stdi rntinher; -  Loan period is 2S days  The Greenwood High School library has a similar, but more complicated, lending system:  ▪ All publications classified as "Reser v-ed" have a Joan period of 2 Jays. • For books (not including ruagazinesi that are not on the reserved list, the loan period is 28 days for staff, and 14 days for students. • For magazines that are not on the reserved list, the loan period is  7 days for everyone. • Persons with an overdue items are not allowed to borrow anything,  138  LIBRARY SYSTEM - Question I ‘• '; .J; c.:^•••••••t),-:  ,.,  ?^Gtv.d High Sctloot  You^to N-_72 t r ow 3^ti-qt .t.5 not on „,00 boftow fh book -  tfl  ,  LIBRARY SYSTEM - Question 2 :cw  ,1  ;^  tk3r^  Greenwood  S-Chool Library sy.;tei .ri^r:^,:be.,CkInQ \e"5 ten1 • ln^„ies19n,,i to de a r vviti? ook and, 31^ii-jr-r3at t.  -  ck^shoutd h  r. 2■^FA: -  ;^T. should have the kasi^ched:Ir :+1^=, 1114ce  ■!.^  p  i ,;:  5;i  ;^ -  onI two • Y.f..-1^7-;-!  139 COURSE DESIGN  A technical  college offers the following 12 :,u1Ijects for a three-year course,  where the length of etch subject is one year: Subject Code  Subject Name  I  MI  Mechanics Level I  2  M2  Mechanics Level 2  3  El  Electronics Level I  4  E.)  EleLttonics Level 2  S  RI  RIP II,^-^'it lic ,.. Level I  6  132  husitim Studies Levet 2  7  133  Business Studies Level 3  8  Cl  Computer Svstems Level l  CI  Computer tivstems Level 2  10  (3  Computer Svsterns Level 3  II  TI  Technoloe and Information Management Level 1  12  T2  '1;1 linoh. and Informail^, '1^r ',Jeri cut Level 2  ,  140 COURSE DESIGN - Question I  i^4,.;:ike;:. 10*,  n o1  -  •  •  iv^v-Ict  Cif^:0''-t in^rt-evioti4.5r ^1  .  C.;  Stucii..9s^cony n,4aryl  2.  -4  .  %;u o, Ete r: ;^I ca r? ot^tiken't f_.1e.,:tronic.5 Le, e:^ '44b1C±. ^:t _^  --  n  IL^tls4^  e^t-'  Subject I Year I Yea r 2 Yea r 1  Leve. 1.  1.:(-11^r,t•A'  -^t•^•%;' ^CO'^:  Subject 2  Subject I  Subject 4  141 SYSTEM TRANSIT SY  sys ref; a ilie following diagram shows part of the trspor^ern an  ^in  Zedlancl, with thre(- railway lines. It shows wh,erc You are at present, and where vote have to go,  From here  ^  Jim C  f^B  Li  [11^a►s a station on a raihs a‘ line -  4  1  • Means a station that is a junction where you can change from one railway line to another (Lines A, 13 or  The bare is based on t he number of stations travelled inot counting the  station where you start Your journey. Each station travelled costs l zed. The time taken to travel between two adjacent stations is about 2 t Unutes. The time taken to change from one railway line to another at a junction is about 5 minutes.  142 TRANSIT SYSTEM - Question I or.Int to^. • TT be±te F-0-Lif..^ •^.1:r -he  -  )  Mark on the clOciram the, be5± ' FTly. ar)itd.  143 HOLIDAY  This puublern is about Manning the best route for a holiday.  Fiures I and 2 show a map of the area and the distances between )wits Figure 1 Map of -cods between towns  Figure 2. Shortest road ciiq,ance of towns from each other in kifoinetres. T  f< ad! Lajat  5110  Megal  g  co  N awn Pi ras  100  Angaz  Ka4A,  5 ;0 1 000  4-;o  SCM)  600^.)  1 Apt  Ntegal^Nlik n^Pira,  144 HOLIDAY  L.  -  Question  .  tjlOf'te5^-  ::.',•,;  HOLIDAY  -  Question 2  live5^,A0,4,2z. She^L.m  to 300 fritometre^L L kj , Vet^:4  .^.  .^  Ci  4:! •;C:  up  a niyw  lot two niqht, r eh town.  5^ar:  she on spent one who day  ezich  !7:,!ow zoets ittneKary by con-wle1.mc4^foifowiriQtkle to ifld^5htt! nb  taV5  11,4v  I  Ch-ernight Stay Camp-site between Angaz and Kado,  3 4 S 6 7  Angaz  145 IRRIGATION  Below is a diagram of a system of irrigation channels for watering sections of crops. The gates A to  H can be opened and dosed to let the water go  where it is needed, When a gate is closed no water can pass through it.  This is a problem about finding water  a gate which is stuck dosed, preventing  from flowing through the systern of channels.  Michael notices that the water is not always going where ^>sett t o, He thinks that one of the gates is stuck closed, so that when it is switched to Open, it does not open.  146 IRRIGATION - Question I  Gat 5. etrIngs ,  A Orcn  Closed  r_Te  I  Closed Orr, T;Ible.^i)tirre  Closed Own  aiagrarn t)0!0`...v cir:oik. .71 ,  t* e. ,  IRRIGATION Ctuestion 2 Lit,^h,Tvf= the Tabh= tl‘-11 at least cyoe oithe4:71te-5^to open  i°T- Stf^;-106,654, .  e:Kh t.”-obiern  .^NO' Will ;later ficiev through Ali the vvay?  Problei  Gitc A is stud; ci(i.,k_s:1_ All other iLates are workiw properly as set nTable 1,  es  /  Gate LI is stuck closed. All other gatec are working properly as set in Table I.  Yes /  Gate F is si uck dosed. AR other ates are working proikilY as et in Table  Yes / No  ,  IRRIGATION - Question 3  1,^4- *^ •^  LI"; ff f ,t,!1:to open  Setttngs for jates leach one open or closed'  F  ^  n  147  Test Items in French Canadian Version for Booklet Set 1 DESIGN BY NUMBERS©  Design by N'tiathen est ern out de conception assistee par ordinatcur qui permet de generer des elements graphiques,  On peut generer des images  en dormant au logiciel une serie &instructions, ttudiez Pen tivemen t les exemples d'instructions et d'image ci-dessous avant de repondre aux questions.  Papier 0  Papier co  Papier 0 41 too  Papier 100  Lem 20 0 80 60  Live 20 20 80 20 LigThe SO 20 SO Nit Live Co wi 20  Craw,rt 0  ,  I  I Design b y Nimihers. a Ett ( 1,i.v e lep r= par Ak..-qfpctics and Computation Group an laboratoirc MIT  CopvrrOt 1 995, Mascakintsvtts InNtitute of Techtiology, ^lotr,,icid petit t'ftre nWcharge s  l site hari://dbri.trieifia.rait_etiti  ,  148  DESIGN BY NUMBERS 0- Question I LiqL4 ,2  DESIGN BY  I  1;.;,?r  NUMBERS 0 - Question 2  1::10  (1- „- ,...,,--,....;-, ..,;^Ltt-.7r,.-.1'i?.1)::.,. rC.:1.:!  ^L;._;!  i'( : ' .  ,  60  i J ,i'l--,) 80^!  C^P: ! - r 100^C^tm,^20 •,L,-,-.:„.i.::: i, 7. ,:;) (-4:7, -  ,  ,  ,  ,  P^ft... .., ''.-- 0^C_ 'rayor,100^I.!,1:-:.-. 2. .7,T 80 R-', ) (.4)^I ,  DESIGN BY  -  -  NUMBERS 0 - Question 3 r -^3,1",; 1:- ; •^kép:-Tet i2.)  . ^. 1(0Re.pAcr A Ct 80  :(..71=7)1 5 -  •Ch,.4r  Lign-e: 20A 40A  1€ A - 50  4.-  •  -  149 COLONIE DE VACANCES Les services de la vine de Zedish organ sent une colonic de vacances qui dlif era cing lours. El v a 46 enfants (26 fines et 20 garcc)ris) qui se sont  inATits a la colonie de vacances et S adultes (4- horurnes et 4 leinines) se sent portes volontaires pour les accompagner et pour organiser la colonic, Aduites  T - if-lrea.I2 Dortetr .  Mom Marlette  Nom  !Arne Chantal  Rouge  Mile Greta  lilt u  8  Mlle Lorraine M. Simon  Vert Violet  8 8  M. Nod  M,  Orange Jaunt  8 6  M. Pascal  Blanc  6  Nombre de lit!,  12  ,  Ril-glement du dortoir : 1. Les gar on et ks tilles-doi'vent (forma dans des clortuirs  2. II faut qtfau mcnn un Alike {forme dans chaque dortoir. L'adulte ou les *Julies qui dornient dans un dortoir dolvent du iniline cexe gut' les enfants.  COLONIE DE VALANCES - Question I Affectation de; dortoirf, C r^tc-, -tt^  Nom R.OU C 141,"1.1 Vcirt  Vi ;IC( ,  Orange launc! Blanc  Nombre de garcons  4 -'nf.1!-; et^8 4 1^es  Nombre de filler  - ,  Nom(s) de l'adulte on des ;nitrites  150  CONGELATEUR Janine a achete un nouveau congelateur de type (.< armoire ». Le manuel d'utilisation contient les instructions suivantes : • Brancher l'appareil sur le secteur, puis l'allumer. • Vous entendez le moteur se mettre en marche. • Un voyant rouge (DEL) s'allume, • Tourner le bouton de reglage de temperature jusqu'a Ia position souhaitee. La position normale est la position 2. Po it ion I )  4 4 ;  Temperature 1 ;' 1: 8' C 21°C -  ---  C'C -1. )``C )  •  Le voyant rouge restera allume jusqu'a ce que Ia temperature du congelateur soit suffisamment basse. Cela prendra de une a trois heures, en fonction de la temperature a laquelle vous avez regle l'appareil.  •  Disposer les aliments dans le congelateur apres quatre heures.  Janine suit ces instructions mais elle toume le bouton de reglage de temperature en position 4. Aprés 4 heures, elle dispose des aliments dans le congelateur.  151 CONGELATEUR — Question .1 Janine relit le manuel pour voir si elle a fait quelque chose de travers. Ble y trouve les six avertissements suivants : 1. Ne pas brancher l'apparell sur une prise de courant qui n'est pas reli6e a la terre. 2. Ne pas rÉgler le conge/ateur sur une temperature plus basse que necessaire (la temperature normale est -18 oC). 3. Les grilles d'aÉration ne doivent pas etre obstrueses. Cela pourrait diminuer Is capactè de rerigèrabon de l'appareg 4. Ne pas congeter de laitue, de rads, de raisins, de pommel ou de pokes entiêres ou de viandes grasses. 5. Ne pas safer ou assaisonner les aliments frais avant de les congeler. 6. Ne pas ouvrir Is porte du congelateur trap souvent Parmi ces six avertissements, lequel ou lesquels pourraient avoir ete negliges,  retardant le moment on le voyant lumineux s'eteint? Encerclez « Oui » ou « Non pour chacun des 6 avertissements. Avertissvment  it fait  de tt e4ltger  (7et  aVe.  r t i ss e me n t auruit-il pu retarder  le moment oii. le  V ONAll t  AVertissemcnt I  Chti /  Avcrtissenitin 2  Ofn  i  Non.  Avertissement 3  Oui  •  Non  Aver tisse incnt 4  Out .,-` Non  As-^i^- lent  ;  Oni /  Avk tiss4:^IA  4  0:ni 1 Non  Non  Non  S'keiltt  152  CONGELATEUR — Question 2 Janine se demande si le voyant lumineux fonctionne correctement. Parini les actions et les observations suivantes, laquelle ou lesquelles suggerent que le voyant fonctionne correctement Encerclez « Oui » ou « Non » pour chacun des trois cas. Action et observation  Elle tourne le bouton de reglage la position 5 et le voyant rouge s'keint. Elle tourne le bouton de reglage la position 1 et le voyant rouge s'eteint. Elle tourne le bouton de reglage a la Position 1 et le voyant rouge reste allume.  Est-ce que les faits observes suggerent que le voyant fonctionne correctement Oui/ Non Oui / Non Oui/ Non  153 BESOINS EN ENERGIE  pports quot Wiens en tnergie ret °mummies pour tes adul tes  A ge ons)  NOM me  Femmes  Nivea u  Apports en c;itergie  d'actiNitkc.  ni-cessaires ikr.  Apports en energie nece5sair es i`k,f,  I^4^ti.:4. r  I Otk)  S.360  (:),-^IS ,I 29  ,M,),"1:'1,' lntenst  1 (080 144)0  8780 9820  Le2yr  10450  8370  Dt. 30 i ;9.  Nit.ydcr,2.  I  1:41  8990  Inter's,:  14210  9790  1 i'leT :kloskr,; Intense  87$30 10240 11910  7500 7940 87NO  ,  60 ( 1 plus -  -  )  Ni yea u cr act ilia. s elon la profession -  .1i,.--.4..1 j^:  Vcnilcur Onkrieur) Fravad  41c  WriaOre  bureau  Enst ignant Vendeur rextericur) Intirmk.r.-  1):!.•^H^•■^:  Ouvrier (1ttn Tmt) Manceuvre Sportri  154 BESOINS EN ENERGIE — Question I M. David Edison est un enseignant de 45 ans. Quel est (en kJ) l'apport en energie recommandê pour satisfaire ses besoins energétiques quotidiens ? 1t6ponse ^ kilojoules. Jeanne Gibon est une sauteuse en hauteur de 19 ans. Un soir, des amis l'invitent souper au restaurant. La carte proposëe est la suivante: CARi.e, Potages: Plats principaux:  Salades:  Desserts:  Milk-shakes:  Soupe a la tomate Soupe de champignons a la creme Poulet a a mexicaine (releve) Poulet au gingembre des Caraibes Brochettes de pore a la snnee Salade de pommes de terre Salade d'epinards aux abricots et aux noisettes Taboule Tarte aux pommes et aux framboises Gateau an fromage et au gingembre Gateau aux carottes Chocolat Vanille  Estimation par Jeanne de l'apport en energie des divers plats (kJ) 355 585 960 795 920 750 335 480 1 380 1 005 565 1 590 1 470  Le restaurant propose egalement un menu special a prix fixe. Menu a prix fixe 50 zeds Soupe a la tomate Poulet au gingembre des Caraibes Gateau aux carottes  BESOINS EN ENERGIE — Question 2 Jeanne tient un relev6 de ce qu'elle mange chaque jour. Ce jour-la, les aliments qu'elle a kid consommés avant le souper ont represents un apport en energie de 7 520 Id Jeanne ne veut pas que son apport en energie soit inferieur ou superieur de plus de 500 Id aux apports quotidiens en energie recomntandes dans son cas. Determinez si le « Menu a prix fixe » permettra a Jeanne de respecter, a ± 500 Id pres, rapport en energie recommande dans son cas. Montrez votre travail.  155 SORTIE AU CINEMA Dans cet exercice, ii s'agit de trouver une date et une heure appropriees pour aller au cinema. Isaac a 15 ans. Il veut organiser une sortie au cinema avec deux de ses copains du même age que lui, pendant la prochaine semaine de vacances scolaires. Les vacances commencent le samedi 24 mars et se terminent le dimanche ler avril. Isaac demande a ses camarades quels sont les jours et les heures qui leur conviennent pour cette sortie. II a recu les informations suivantes : Francois : « Je dois rester chez moi le lundi et le mercredi aprés-midi de 14h30 a 15h30 pour mes lecons de musique. » Simon : « Je dois rendre visite a ma grand mere les dimanches, done les dimanches sont exclus. J'ai déjà vu Pokamin et je ne veux pas le revoir. » -  Les parents d'Isaac insistent pour qu'il choisisse un film qui ne soit pas interdit aux jeunes de son age et pour qu'il ne rentre pas a pied ; ifs proposent de ramener les garcons chez eux a n'importe quelle heure jusqu'a 10 heures du soir. Isaac se renseigne sur les programmes de cinema pour la semaine de vacances. Voici les informations qu'il a recueillies : Reservations au numero : (555) 743 - 6666 CINEMA TIVOLI^Infos 24h124 : (555) 743-6678 Promotion speciale les mardis : tous les films a 3,00 $ Programme en vigueur a partir du vendredi 23 mars, pour deux semaine-s : Enfants sur la Toile Pokamin 113 min^Interdit aux moires 105 mins^Accord parental souhaitable. de 12 ans. Pour tous, mais certaines 14:00 13:40 (tons les jours)^scenes peuvent heurter la (lun-ven seulement) 16:35 (toms les jours)^sensibilite des plus jeunes. 21:35 (sam/dim seulement) Les monstres des profondeurs Enigma 164 min 19:55^Interdit aux moires (ven/sam seulement)^de 18 ans. Carnivore 148 min^Interdit aux mains 18:30 (tous les jours)^de 18 ans.  144 mins^Interdit aux mains 15h^de 12 ans. (lun-ven seulement) 18h (sam/dim seulement) Le Roi de la savane 117 min 14:35^Pour taus. (lun-ven seulement) 18:50 (sam/dim seulement  156  SORTIE AU CINEMA — Question 1 En tenant compte des renseignements qu'Isaac a recueillis sur le programme de cinema et aupres de ses copains, le(s)quel(s) des six films Isaac et ses amis peuvent-ils envisager d'aller voir ? Encerclez « Oui o ou « Non » pour chacun des films. Film Enfants sur la Toile  Les trois garcons peuvent-ils envisager d'aller voir le film?  Les monstres des profondeurs Carnivore Pokamin Enigma Le Roi de la savane  SORTIE AU CINEMA - Question 2  Oui/Non Oui/Non Oui/Non Oui/Non Oui/Non Oui/Non  157 Test Items in French Canadian Version for Booklet Set 2 SYSTEME DE GESTION D'UNE BIBLIOTHEQUE  La bibliothéque de l'Ecole secondaire Motaigne utilise un systeme simple de gestion du prat de livres : pour les membres du personnel, la duree du prat est de 28 jours et pour les eleves, la duree du prat est de 7 jours. On peut voir ci-dessous un schema de decision en arbre qui presente ce systême simple :  _ est 41  rm ruetul,re du  Ce 4lu pl;i1.  dc' 2.S Ourt,,.  „et-sound  1 N on La durtedu prét est de 7 lours.  La bibliotheque de l'Ecote secondaire Coulanges utilise un systeme de gestion des prets similaire, mais plus complexe : • Pour toutes les publications classees comme « reservees », la duree du prat est de 2 jours. • Pour les livres (mais pas les magazines) qui ne soot pas sur la liste des publications reservees„ la duree du prét est de 28 jours pour les membres du personnel et de 14 jours pour les eleves. • Pour les magazines qui ne soot pas sur la liste des publications reservees,la duree du prat est de 7 jours pour tout le monde. • Les personnes ayant des emprunts en cours pour lesquels la date de retour est depassee ne peuvent effectuer aucun nouvel emprunt  158 SYSTEME DE GESTION D'UNE BIBLIOTHEQUE — Question I  Vous etes un eléve de l'Ecole secondaire Coulanges et vous n'avez pas d'emprunts en cours pour lesquels la date de retour est depassee. Vous souhaitez emprunter un livre qui n'est pas sur la liste des publications reservees. Pour combien de temps pouvez-vous emprunter ce livre ? Reponse ^ jours. SYSTEME DE GESTION D'UNE BIBLIOTHEQUE — Question 2  Creez un schema de decision en arbre pour le systeme de gestion des pras de la bibliotheque de l'Ecole secondaire Coulanges, permettant de concevoir un systeme de contrOle automatise des peets de livres et de magazines de la bibliotheque. Votre systeme de contrille doit etre aussi efficace que possible (c'est-à-dire qu'il doit avoir le plus petit nombre possible d'etapes de contrOle). Notez que chaque &ape de contrOle ne doit presenter que deux issues et que ces issues doivent etre etiquetees correctement (par exemple: «Oui» et «Non»).  159 PROGRAMME DE COURS  lin instit  nseignetnent tec inique propose.- 1,..-4 1 2 matieresi suivantes  clans le cadre dun programme de 3 ans, on chaque matiere est. donnee pendant une ann't.'1e : Code tic b rimtkre  Intitttli de b matie-re  Aii  Mecanique n wau 1  NU  Mica . nique n veau 2  EI  Electronique niveau I  F.2  Electronique niveau 2  S  LI I  Etudes conwnerciales niveau I  6  112  Etudes cornrnerciales nivrau 2  7  II 3  Etudes conitnerriales niveau 3  CI  Sv^s informlaatiqucs niveau I  q  _  '. syste'rnes informanques nive u 2  10  (3  sys,te Ines inforrnatique.c niyeau 3  11  'T1  Technologic et gestion (le l'inftir^non niveau I  12  T2  Technologic et gestton de rinformation niveau 2  3  )  160 PROGRAMME DE COURS — Question 1 Chaque etudiant devra suivre 4 matieres par an et verra donc les 12 matieres en 3 ans. Les etudiants ne sont autorises a suivre les cours de niveau superieur dans une matiêre qu'd la condition d'avoir terming le(s) niveau(x) inferieur(s) dans la méme matiere lors d'une armee anterieure. Par exemple, vous neyouvez suivre les Etudes commerciales de niveau 3 qu'apres avoir terming les Etudes commerciales de niveaux 1 et 2. De plus, on ne peut suivre l'Electronique de niveau 1 qu'apres avoir terming la Mecanique de niveau 1. On ne peut suivre l'Electronique de niveau 2 qu'apres avoir terming la Mecanique de niveau 2. Decidez quelles matieres it faut proposer dans quelle annee et completez le tableau ci-dessous. Inscrivez les codes des matieres dans le tableau. Mati rr I it Annee  2' A^le  Matii-re 2  V :xticrr 3  A atii-re 4  161 CORRESPONDANCES Le schema ci-dessous montre une section du reseau de transports publics d'une ville de Zedlande, comprenant trois lignes de metro. II montre egalement l'endroit ou vous vous trouvez actuellement et celui ou vous devez vous rendre. Ligne A  L. cn IiVin. 13^  0  —  ^0 * R,..-prcente unc. station stir une des liLmes de rnetrr, ftc-prt nte um ionct 3n, cest_ a -dire lane station oft existe tine correspondance permettant changer de ligne de inkrn Lignes A s BouC).  the graphic above, the new changes ace instead of d' ici it s 10 uId i ead and instead of 3 la 'a shouh ieau an -  -  -  Le prix d'un billet de metro est en fonction du nombre de stations traversees (sans compter la station de depart). Le cotit s'eleve a 1 zed par station traversee. La duree du parcours entre deux stations successives est d'environ 2 minutes. La duree necessaire pour changer de ligne a une jonction est d'environ 5 minutes.  162 CORRESPONDANCES — Question 1  Sur le schema, on peut voir la station oil vous vous trouvez en ce moment («depart ») et celle on vous souhaitez vous rendre («arrivee »). Indiquez sur le schema le meilleur parcours (en termes de duree et de coat) et inscrivez, ci-dessous, le prix que vous paierez, ainsi que la durëe approximative du trajet. Prix • ^ zeds.  163 VACAN C E S Darrsce probleme, ii s'agit de determiner k meilleur itineraire deseances Les figures I et 2 pregentent une carte de la rtgon et le' distances %Titre les villes.  Figure . Carte des routes dune Ale a rautre  Figu e 2. Distances -olitiêres les plus corgi exprimees en kilornêt es -  -  Kado  Lapat ,ky Mega Nuhen Piras  ;-,()  ^&ire les villes,  164 VACANCES — Question 1 Calculez la plus courte distance par route entre Nuben et Kado Distance :^kilometres.  .  VACANCES — Question 2 Zoe habite a Angaz. Elle veut visiter Kado et Lapat. Elle ne peut pas faire plus de 300 kilometres par jour, mais elle peut couper ses trajets en campant, pour la nuit, n'importe oft entre deux villes. Zoe restera deux nuns dans chaque ville, de maniere a pouvoir passer chaque fois une journee entiere a les visiter. Donnez^de Zoe en remplissant le tableau ci-dessous pour indiquer elle passera chacune des nuits. Jour I 2 3 4 5 6 7  Logement pour la nuit Terrain de camping entre Angaz et Kado  Angaz  165 IRMA ON Le sell4na ci-dessous reprsente un systrne de canaux destine a r irrigation &pan-dies cultivees. Les vannes A a^peuvent ‘1"-tre ouvertes ou terniees pour arnener reau L ot elk est n,..T-cessaire, Quand une vanne est termee, l'eau passe pas, I  )  ^ +;ortie  din H  „.„  Dans ce probl6me,ii s' agit d identifier une vane qui est bk. ^rnpchant l'eau de s' ec order au travcrs dii systemed canaux, Aficliel a remarque quel'eau ne^pas toujourIa 06 elle etait cens4e le faire,  II pease qu'une des vannes est hioquce en position ferme*..,- ^sor te (pi 'elle ne s'ouvre pas, m 'tne lorsqu' on en corm-nand(' louverture.  166 IRRIGATION — Question 1 Michel utilise les reglages presentës par le tableau 1 pour verifier le fonctionnement des vannes.  Tableau 1: Reglages des vannes A verte  Ferrnee I OuNvrte  ouverte  Fe  hive rte  Ferni,1=e  Ou‘-ertc-  Compte tenu des reglages qui figurent an tableau 1, tracez sur le schema ci-dessus tous les chemins possibles par ou 1 eau peut s'e,couler. Supposez que l'ensemble des vannes fonctionnent selon les reglages. IRRIGATION— Question 2  Michel s'apercoit que, quand les vannes sont reglees comme indique dans le tableau 1, it n'y a pas d'eau qui s'ecoule a la sortie, indiquant qu'au moms une des vannes reglees en « position ouverte » est en fait bloquee en position fermee. Pour chacune des pannes decrites ci-dessous, indiquez si l'eau s'ecoulera jusqu'a la sortie. Encerclez« Oui » ou « Non » pour chaque panne. outera-t-Ale  Panne  iusqu':x Is cortie  La vanne A em. bloquee {.11 position ferinee. TiAites les autres vannes fonctionnent correk_ietnent selon r6glages du tableau I.  f_tcri i Non  vanne i. Gst^position Ferro& :routes autrec vannes font liniment (-rum tement scion les rt'2-glages du tableau I.  / Non  La vanne F est bloque en position lertnee.Toutes les autres vannes fonctionuent correctement scion les reglages du tableau I.  / Non  -  IRRIGATION — Question 3  Michel veut pouvoir verifier si la vanne D est bloquee en position fermee. Dans le tableau ci-dessous, indiquez comment devront etre reglees les vannes pour savoir si la vanne D est bloquee en position fermee alors qu'on l'a reglee en « position ouverte ».  Regloges des art: s fe Ot rerte^Form 4  The University of British Columbia Office of Research Services  Behavioural Research Ethics Board  Suite 102, 6190 Agronomy Road, Vancouver, B.C. V6T 1Z3  CERTIFICATE OF APPROVAL - MINIMAL RISK PRINCIPAL INVESTIGATOR:  INSTITUTION I DEPARTMENT:  Kadriye Ercikan  UBC/Education/Educational & Counselling 1_107_01033 Psychology, and Special Education  UBC BREB NUMBER:  INSTITUTION(S) WHERE RESEARCH WILL BE CARRIED OUT: Institution^  1^  Site  Vancouver (excludes UBC Hospital)  UBC^ Other locations where the research will be conducted:  UBC campus  CO -INVESTIGATOR(S): N/A  SPONSORING AGENCIES: N/A  PROJECT TITLE: Analysis of construct comparability in cognitive abikty tests  CERTIFICATE EXPIRY DATE: September 26, 2008 DOCUMENTS INCLUDED IN THIS APPROVAL:  DATE APPROVED: September 26, 2007  Document Name^  j^Version^1^Date  Protocol:  Analysis of Construct Comparability in Cognitive Ability Tests  N/A^September 12, 2007  The application for ethical review and the document(s) listed above have been reviewed and the procedures were found to be acceptable on ethical grounds for research involving human subjects. Approval is issued on behalf of the Behavioural Research Ethics Board and signed electronically by one of the following:  Dr. NI. Judth Lynam, Chair Dr. Jim Rupert, Associate Chair Dr. Laurie Ford, Associate Chair  https //rise.ubc ca/rise/Doc/O/PGFJDI6JBOEK37UB2T4QS5JED6/fromString.html ^  29/11/2007  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0054571/manifest

Comment

Related Items