THE EFFECT OF ITEM FORMAT ON MATHEMATICS ACHIEVEMENT TEST SCORES by Le s l i e Hubert Dukowski B. Sc., University of B r i t i s h Columbia, 1973 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS in the Faculty of Education We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA A p r i l , 1982 (c) L e s l i e Hubert Dukowski In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the Library s h a l l make i t f r e e l y available for reference and study. I further agree that permission for extensive copying of t h i s thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. I t i s understood that copying or publication of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department The University of B r i t i s h Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 Date &?m.0$-0fc DE-6 (3/81) Abstract Thesis Supervisor: Dr. David F. Robi t a i l l e The purpose of the study was to determine whether item format s i g n i f i c a n t l y affected scores on a mathematics achievement test. A forty-two item test was constructed and cast in both multiple-choice and constructed-response for-mats. The items were chosen in such a way that in each of three content domains, Computation, Application, and Alge-bra, there were seven items at each of two d i f f i c u l t y lev-e l s . The two tests were then administered on separate occa-sions to a sample of 213 Grade 7 students from a suburban/ rural community in B r i t i s h Columbia, Canada. The data gathered was analysed according to a repeated measures analysis of variance procedure using item format and item d i f f i c u l t y as t r i a l factors and using student a b i l -i t y and gender as grouping factors. Item format did have a s i g n i f i c a n t (p < 0.05) effect on test score. In a l l domains multiple-choice scores were higher than constructed-response scores. The multiple-choice scores were also transformed using the t r a d i t i o n a l correction for guessing procedure and analysed. Multiple-choice scores were s t i l l s i g n i f i c a n t l y higher in two of the three domains, Application and Algebra. There were s i g n i f i c a n t omnibus F - s t a t i s t i c s obtained for a number of interactions for both corrected and uncorrected i i data but there were s i g n i f i c a n t Tetrad differences (p < 0.10) only for interactions involving format and d i f f i -c ulty. The results indicate that students score higher on a multiple-choice form of a mathematics achievement test than on a constructed-response form, and therefore the two scores cannot be considered equal or interchangeable. However, because of the lack of interactions involving format, the two scores may be considered equivalent in the sense that they rank students in the same manner and that the intervals between scores may be interpretable in the same manner under both formats. Therefore, although the t r a d i t i o n a l correc-tion for chance formula is not s u f f i c i e n t to remove d i f f e r -ences between multiple-choice and constructed-response scores, i t may be possible to derive an empirical scoring formula which would equate the two types of scores on a par-t i c u l a r test. CONTENTS Page LIST OF TABLES v i i LIST OF FIGURES ix ACKNOWLEDGEMENT X Chapter 1. BACKGROUND 1 Statement of the Problem 4 Definition of Terms 6 Organization of the Following Chapters 8 2. REVIEW OF RELATED LITERATURE 10 Recall and Recognition .10 Advantages and Disadvantages of Recall and Recognition Items 11 R e l i a b i l i t y and V a l i d i t y Studies 12 Mental Processes Involved in Recall and Recognition 13 Response Sets 17 Characteristics of Response Sets 17 Effects of Response Sets and Guessing 20 Formula Scoring and Corrections for Guessing 22 Sex-related Differences in Mathematics Achievement 25 Hypotheses 26 3. METHOD 28 Development of Tests 29 Origin of the Test Items 29 iv c Chapter Page Content Domains 30 Selection of Test Items 32 P i l o t Testing 34 Sample Selection 36 Description of the Population 36 Selection Technique 36 Test Administration 38 Data Analysis 39 Test Analyses 40 Preliminary Analyses 40 Order of Administration 41 Class 42 D i f f i c u l t y 43 Final Analyses 43 4. RESULTS . 46 Description of the Sample 46 Descriptive Analysis of the Test Scores 48 Inferential Analyses 50 Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y 52 Correction for Guessing 63 Regression Analysis 69 5. SUMMARY, CONCLUSIONS, AND IMPLICATIONS 74 Summary 75 Item Format 7 5 Class 76 v Chapter Gender A b i l i t y Item D i f f i c u l t y Conclusions Implications Limitations of the Study Suggestions for Future Research REFERENCES APPENDIX A. Copies of the Test Instruments and Table of Question Distribution B. Directions to Test Administrators C. Summary ANOVA Tables for Order of Test Administration, Class, and Item D i f f i c u l t y D. C e l l Means and Standard Deviations for Transformed Scores With and Without the Correction for Guessing LIST OF TABLES Table Page 4.1 Distribution of Subjects by Gender and A b i l i t y 47 4.2 Mean Item D i f f i c u l t y 49 4.3 Subtest R e l i a b i l i t i e s , Means, Standard Deviations, and Standard Errors of Measurement 51 4.4 Summary Analysis of Variance; Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y ; Computation Domain 53 4.5 Summary Analysis of Variance; Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y ; Application Domain 54 4.6 Summary Analysis of Variance; Gender, i A b i l i t y , Item Format, and Item D i f f i c u l t y ; Algebra Domain 55 4.7 Summary Analysis of Variance; Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y ; Computation Domain, Multiple-Choice Scores Corrected for Guessing 66 4.8 Summary Analysis of Variance; Gender A b i l i t y , Item Format, and Item D i f f i c u l t y ; Application Domain, Multiple-Choice Scores Corrected for Guessing 67 4.9 Summary Analysis of Variance; Gender, A b i l i t y , Item Format,' and Item D i f f i c u l t y ; Algebra Domain, MultipleChoice Scores . Corrected for Guessing 68 4.10 Regression Weights and Intercepts, and Correlation Coefficients for C-R Scores Regressed on M-C Scores 73 A.l Distribution of Questions by Domain and D i f f i c u l t y 92 C.l Summary Analysis of Variance; Class, Order, and Item D i f f i c u l t y ; Multiple-Choice Computation 119 v i i Table Page C.2 Summary Analysis of Variance; Class, Order, and Item D i f f i c u l t y ; Multiple-Choice Application 120 C.3 Summary Analysis of Variance; Class, Order, and Item D i f f i c u l t y ; Multiple-Choice Algebra 121 C.4 Summary Analysis of Variance; Class, Order, and Item D i f f i c u l t y ; Construeted-Response Computation 122 C.5 Summary Analysis of Variance; Class, Order, and Item D i f f i c u l t y ; Construeted-Response Application 123 C. 6 Summary Analysis of Variance; Class, Order, and Item D i f f i c u l t y ; Constructed-Response Algebra 124 D. l Cell Means and Standard Deviations; Computation 125 D.2 C e l l Means and Standard Deviations; Application 126 D.3 Cell Means and Standard Deviations; Algebra 127 D.4 Cell Means and Standard Deviations; Computation; Scores Corrected for Guessing 128 D.5 Ce l l Means and Standard Deviations; Application; Scores Corrected for Guessing 129 D.6 Cell Means and Standard Deviations; Algebra; Scores Corrected for Guessing 130 v i i i LIST OF FIGURES Figure Page 4.1 Plot of Ce l l Means versus Gender for Items of High and Low D i f f i c u l t y ; Application Domain 58 4.2 Plot of Ce l l Means versus Item D i f f i c u l t y by A b i l i t y Level; Computation Domain ~ 59 4.3 Plot of Cell Means versus Item D i f f i c u l t y by A b i l i t y Level; Application Domain 60 4.4 Plot of Ce l l Means versus Item D i f f i c u l t y by A b i l i t y Level; Algebra Domain 61 4.5 Plot of Cell Means versus Item Format by Item D i f f i c u l t y ; Application Domain 62 4.6 Plot of Ce l l Means versus Item Format by Item D i f f i c u l t y for Three A b i l i t y Levels; Algebra Domain 64 4 .'7 Plot of Ce l l Means versus Item Format by Item D i f f i c u l t y ; Computation Domain; Chance-Corrected Data 70 4.8 Plot of Ce l l Means versus Item Format by Item D i f f i c u l t y ; Algebra Domain; Chance-Corrected Data 71 4.9 Summary of Main Effects and Interactions 72 ix ACKNOWLEDGEMENT I wish to thank the members of my thesis committee, Dr. David R o b i t a i l l e , Dr. Tom O'Shea, and Dr. Todd Rogers for their ,guidance. It has been a priv i l e g e for me to work with them. In addition I would l i k e to express my gratitude to the Grade 7 students and teachers in Langley School D i s t r i c t who participated in the study. I am pleased to have been asso-ciated with colleagues of their professional stature. F i n a l l y , I would l i k e to thank my family for their encouragement, especially my wife Lesley for her patience and constant support. x Chapter 1 BACKGROUND During the past decade, educators have witnessed a movement toward minimum-competency testing and large-scale evaluation of educational programs. In B r i t i s h Columbia, assessments in Reading, Writing, Mathematics, Science, Physical Education, and Social Studies have been conducted. Similar to comparable state and provincial assessments, the goal of the B.C. Learning Assessment Program is not to mea-sure individual student performance, but to provide informa-tion about student learning on a province-wide basis. That i s , the intent is to measure the, extent to which the basic objectives of the educational system are being achieved by a l l students. On the other hand, the intent of the former provincial examination system appeared to be to determine the percentage of students which should be admitted to higher schooling (Mussio and Greer, 1980, pp. 26-27). Once data have been collected and analysed, some judg-ment must be made regarding the acceptablity of students' performance. A major component of the B.C. Learning Assess-ment Program is the interpretation of the results obtained by students on the assessment tests. This interpretation 1 process is not without d i f f i c u l t i e s . Mussio and Greer (1980, p. 35) have discussed the concern over the confusion of norms and standards when interpreting assessment data. For example, Mathematics and Science Assessments u t i l -ized items from the National Assessment of Educational Pro-gress in the United States and B.C. students outperformed U.S. students on a number of these items. On the basis of such evidence from a norm-referenced standpoint, one might conclude that the students in the schools are achieving the goals of the curriculum at a satisfactory l e v e l . From a criterion-referenced standpoint, however, the decision is not at a l l clear. . Even though the students outperformed their American counterparts, i t s t i l l may be that the level of achievement on certain basic s k i l l items was unacceptably low. Therefore, because the Assessment program is intended to determine whether or not students have achieved the objectives of the curriculum, a t r a d i t i o n a l , norm-referenced approach is not suitable. Mussio and Greer (1980) point out, however, that Experience dealing with interpretation panels, involv-ing both educators and members of the public, for six assessments has repeatedly demonstrated an i n i t i a l skepticism [on the part of the interpretation panels] of any method of interpretation that is other than normative. (p. 35) It is not surprising that people are skeptical of criterion-referenced interpretation procedures. In the 3. past, educators, and p a r t i c u l a r l y those involved with educa-tional measurement, have tended to stress procedures which measure r e l a t i v e rather than absolute worth (Burton, 1972, p. 1). As a part of the 1981 Mathematics Assessment, c r i t e r -ion-referenced interpretations were made of student achieve-ment on multiple-choice items ( R o b i t a i l l e , 1981). Mason (1979) claims that i f provincial assessments are intended to measure essential, or core curriculum; learning objectives and i f these objectives r e f l e c t r e a l - l i f e s k i l l s , then mul-tiple-choice items are not appropriate. He points out that there are very few r e a l - l i f e situations where one is required to select a response from a variety of options. More often, one is required to construct a response. Mason also claims that the type of thinking required to answer multiple-choice questions is d i f f e r e n t from that required to answer constructed-response items. For example, on a multiple-choice item i t may be possible for respondents to choose the correct answer by guessing, eliminating a l t e r -natives judged unreasonable, or working backward from the given answers. These strategies are of l i t t l e use on con-structed-response questions.. The scores from multiple-choice and open-ended forms of an achievement test may rank students in essentially the same manner, however the scores 4. may not be interchangeable when one wishes to make criterion-referenced judgments (Mason, 1979, p. 11) . Although these concerns regarding the use of multiple-choice items appear sound, Bracht and Hopkins* (1968) review of the li t e r a t u r e led them to conclude that many of the d i f -ferences of opinion regarding the content v a l i d i t y of objec-tive tests, including multiple-choice tests, are not based on empirical evidence. With regard to the question of whether or not the cognitive processes involved in r e c a l l and recognition are the same, Tulving and Watkins (1973) claimed that the same psychological processes underlie both a c t i v i t i e s . Traub and Fisher (1977) also claimed that both constructed-response and multiple-choice forms of tests of mathematical reasoning measure the same attributes. Statement of the Problem The 1981 B r i t i s h Columbia Mathematics Assessment sought to determine the attainment of curriculum objectives by the entire student population at Grades 4, 8, and 12 (Ro b i t a i l l e , 1981). The technical considerations of large-scale testing made the use of multiple-choice items prefer-able to .constructed-response items. Following the adminis-tration and scoring of the Assessment instruments, an attempt was made to arrive at criterion-referenced judgments based on the test item results. There have been concerns expressed that the scores obtained by students on multiple-choice items may not be interchangeable with, nor equiva-lent to scores obtained on the same items in constructed-response format. This being the case, the judgments made by the Interpretation Panels may be suspect. This study sought to investigate whether or not there were differences in the character of the scores which would seriously affect the interpretation of item results. A demonstration that the two scores measure the same attributes in essentially the same way may relieve some of the apprehension on the part of those interpreting Assess-ment data. If the scores are shown not to be equivalent measures, then a description of the precise nature of the differences between the scores may make the interpretation more meaningful. The general questions to be addressed by the study, then, arise from the concerns as to whether or not multiple-choice test item results are interpretable in the same way as constructed-response item results. The sp e c i f i c ques-tions are as follows: 1. Do students score higher on a multiple-choice Math-ematics achievement test than on the same test in construc-ted-response format? 2 . Do boys outperform g i r l s on Mathematics achievement tests in either format? 3. Is there an interaction between format and gender, and i f so, what is the nature of the interaction? 4. Is there an interaction between format and a b i l i t y , and i f so, what is the nature of the interaction? 5. Is there an interaction between format and item d i f f i c u l t y , and i f so, what is the nature of the interac-tion? 6. Are there any more complex interactions which may r a f f e c t test score, and i f so, what are the natures of those interactions? D e f i n i t i o n of Terms The following terms are used throughout the study and are defined here for convenience. Recognition items are those items for which the respon-dent chooses an alternative from a given l i s t of choices. Recognition items w i l l also be referred to as multiple-choice items. Recall items are those items for which the respondent must construct a response. Recall items w i l l also be referred to as constructed-response items. Objective tests are those which contain recognition items or those r e c a l l items which require only single word or single phrase responses. That i s , there is a c l e a r l y defined right answer and there is a dichotomous decision made on the part of the grader as to whether or not the response is acceptable. Essay-type tests are those which contain items to which the respondent must answer using more than one phrase or sentence. Although there may be well defined c r i t e r i a for the acceptability of responses, the grader rarely makes a dichotomous decision as to whether a response is acceptable. Response sets have been defined in the following way: A response set is defined as any tendency causing a person to give different responses to test items than he would when the same content is presented in a d i f - ferent form. (Cronbach, 1946, p. 476, i t a l i c s in the original) For example, a person may have a tendency to agree with statements framed in a positive way and to disagree with statements framed in a negative way. With such a response set, a person might agree with content in one context, framed p o s i t i v e l y , and disagree with the same content in a diffe r e n t context, framed negatively. Formula scoring is the procedure by which student scores on tests containing multiple-choice items are adjusted to correct for guessing on the part of respondents. 8 . The usual formula for correcting individual scores i s : S = R - (W/(k - 1)) where R is the number of correct responses, W is the number of incorrect responses, k is the number of answer options on a single item, and S is the score corrected for guessing. This formula is used under the assumption of random guessing on the part of respondents who do not know the correct answer. Content domain A content domain is a body of material defined by a set of learning outcomes. The three content domains into which test items were grouped in this study were Computation, Application, and Algebra. These content domains are operationally defined in Chapter 3 . Organization of the Following Chapters A review of the l i t e r a t u r e pertaining to the experimen-t a l questions, a description of methodology, the results of s t a t i s t i c a l analysis and a discussion of the findings of the study are found in the following Chapters. The research hypotheses are presented at the end of Chapter 2 following the review of l i t e r a t u r e . Chapter 3 contains the details of 9. sample selection, development and administration of test instruments, and methods of data analysis. The results of the descriptive and i n f e r e n t i a l analyses are discussed in Chapter 4 and summarized in Chapter 5 which also contains a discussion of the findings of the study and their implica-tions. The test instruments and directions for test admin-i s t r a t i o n are included in the appendices following the ref-erences . Chapter 2 REVIEW OF RELATED LITERATURE The review of the l i t e r a t u r e is organized in four sec-tions. F i r s t , l i t e r a t u r e pertaining to the issue of r e c a l l versus recognition is presented. Next, the l i t e r a t u r e regarding response sets is discussed. Literature related to formula scoring and corrections for guessing is then reviewed and followed, f i n a l l y , by a review of studies of sex-related differences in Mathematics achievement. Recall and Recognition One of the major questions in educational testing is that of whether or not r e c a l l and recognition tests are equally ef f e c t i v e in measuring student a b i l i t y . Proponents of the use of multiple-choice items cite- the technical advantages of recognition items whereas advocates of con-structed-response items point out that r e c a l l items may be used to assess p a r t i a l knowledge (Cronbach, 1970, pp. 30-32). 10 \ 11. Advantages and Disadvantages of Recall and Recognition Items Stanley and Hopkins (1972, p. 236) claim that the "mul-tiple-choice form is usually regarded as the most valuable and most generally applicable test form." Cronbach (1970) points out that one of the cr i t i c i s m s of multiple-choice tests is that they tend to be re s t r i c t e d to low-level think-ing. He then goes on to claim that i t is possible to con-struct multiple-choice tests which require a great deal of understanding and the application of higher cognitive pro-cesses. This opinion that multiple-choice items can test both simple and complex learning outcomes is shared by other measurement s p e c i a l i s t s as well (Gronlund, 1968, p. 26; Ebel, 1979, pp. 56-57). The advantages of constructed-response or r e c a l l items are l i s t e d by Stanley and Hopkins (1972) as the following: 1. They are familiar to most children as they are com-monly used on teacher made tests. 2. They almost completely eliminate guessing. 3. They are pa r t i c u l a r l y suited to arithmetic and the physical sciences where computations are required. They claim that a disadvantage of items which require single word or short phrase responses is that they tend to measure only factual knowledge. Gronlund (1968, p. 26), however, 12. claims that essay questions are often used when more complex learning outcomes requiring unique responses are assessed. Ebel (1979, pp. 56-57) states that i t is a misconception that item type indicates the a b i l i t y tested. Rather, good essay and objective tests can require the same kind and level of a b i l i t y . According to Ebel (1979, p. 57), multiple-choice tests are more d i f f i c u l t to construct than essay tests. However, they can be more rapidly and r e l i a b l y scored than essay tests. Stanley and Hopkins (1972, p. 218) also claim that even single-word constructed-response items are time consum-ing to score and not always entirely objective. R e l i a b i l i t y and V a l i d i t y Studies Most of the empirical studies of objective and essay tests have been concerned with, r e l i a b i l i t y (Bracht and Hopkins, 1968, p. 3). For example, Kinney and Eurich (1932) tabulated the results of thirteen studies comparing r e c a l l , multiple-choice, true-false, and essay-type examinations. In six of the nine studies comparing the r e l i a b i l i t i e s of multiple-choice and constructed-response tests, the con-structed-response tests were shown to have higher r e l i a b i l i -t i e s ; however no tests were performed to determine whether 13. the r e l i a b i l i t i e s were s i g n i f i c a n t l y d i f f e r e n t . In contrast more recent results support the claim that objective tests consistently show higher r e l i a b i l i t i e s than essay tests (Bracht and Hopkins, 1968). With regard to the v a l i d i t y of objective and essay tests, Bracht and Hopkins (1968) reviewed a wide range of studies designed to compare content v a l i d i t y . Although there was considerable difference of opinion among the authors of the studies reviewed, Bracht and Hopkins con-cluded that the evidence supported the contention that both types of tests measure the same things. With reference to Mathematics, Cronbach (1970, p. 31) cited College Entrance Examination Board data which showed that the results from multiple-choice and constructed-response questions had ess e n t i a l l y the same correlations with grades in later courses in Mathematics. Mental Processes Involved in Recall and Recognition Mason (1979),, among others, has expressed the opinion that responses to r e c a l l and recognition items do not require the same mental processes. Among psychologists, there appear to be two competing theories to describe the processes involved in r e c a l l and recognition (Rabinowitz, 14. Mandler, & Patterson, 1977). The two theories are the uni- tary strength theory and the generation-recognition hypothe- s i s . The unitary strength theory asserts that both r e c a l l and recognition have accompanying associative stimulus-response strengths. According to the theory, recognition items generally have more strength to e l i c i t a correct response whereas r e c a l l items may not have as much stimulus information, that i s , enough strength, to cause a respondent to produce a correct response. On the other hand, the generation-recognition hypothe-sis asserts that while recognition requires a simple deci-sion process, r e c a l l requires the generation of possible responses and then the elimination of unsatisfactory candi-dates. The decisions made in the elimination of candidates are e s s e n t i a l l y the same as those made in recognition. Rabinowitz, et a l . (1977) reported the results of a number of studies designed to provide information regarding r e c a l l and recognition of information, s p e c i f i c a l l y , l i s t s of words. These studies provided data on r e c a l l tests pre-ceding or following recognition tests, the effect of speci-f i c instructions in r e c a l l i n g words from the l i s t s provided, the strength of items, and the effects of using a taxonomy when re c a l l i n g or recognizing words. Based on an analysis of the results, the authors.claimed that r e c a l l and recogni-tion made use of similar mental processes. 15. Similarly, Tulving and Watkins (1973) claimed that the same psychological processes underlie both r e c a l l and recog-n i t i o n . They reported a study in which subjects were shown a number of sequences of f i v e - l e t t e r words. Following each sequence, the subjects were asked to reproduce the sequence. Cues of one, two, three, four, and five l e t t e r s were given on each test except one, where no cues were provided. Per-formance improved in direct r e l a t i o n to the number of cues given. However, there were no discontinuous jumps in per-formance even though, as the authors claimed, for example, the task of generating probable alternative words from three-letter cues is much more d i f f i c u l t than from f o u r - l e t -ter cues. This continuity of performance in relation to the number of cues provided led the authors to believe that there is no clear d i s t i n c t i o n between r e c a l l and recogni-tion. Moreover, they f e l t that the hypothesis that the two are continuous is a more useful construct than the proposi-tion that the two are d i s j o i n t processes. Bracht and Hopkins (1970) reported a study using sopho-mores enrolled in an educational psychology course. During the semester, students were administered both essay and mul-tiple-choice examinations. Each examination was designed to measure higher cognitive processes and great care was taken to control for confounding factors such as answer length and 16. penmanship when grading the essay questions. Following an analysis of the data, the authors reported that the assump-tion that multiple-choice and essay tests measure different variables was not supported. Traub and Fisher (1977) conducted a study in which they investigated whether or not tests which differed only in response format measured the same attributes. Two sets of mathematical reasoning tests and two sets of verbal compre-hension tests were administered to Grade 8 students under three di f f e r e n t formats. The tests were presented in con-structed-response, multiple-choice, and Coombs multiple-choice format. In the Coombs format, respondents were required to identify incorrect options. Based on the data obtained, the authors claimed that the tests of mathematical reasoning measured the same attributes regardless of format. The tests of verbal reasoning were found not to be equiva-lent. Burton (1972) conducted a study to investigate the effects of item-scoring formulas. In her study items which dif f e r e d only in response format, multiple-choice or con-structed-response, were administered. The results led her to speculate that there was an inherent difference between the two types of items. In par t i c u l a r , differences in stu-dent achievement ranged between five and f i f t e e n percent in 17. in favour of the multiple-choice items (pp. 129-130). How-ever, she went on to state that for items which contained very plausible distractors, this difference did not appear, and in most cases the interpretation^ of the data was not affected by the supposed inherent difference. Response Sets Response sets cause subjects of equal a b i l i t y to con-s i s t e n t l y give d i f f e r i n g responses according to irrelevant characteristics of test items. Consequently, response sets are a factor which must be considered when constructing achievement tests. Characteristics of Response Sets Cronbach (1946) l i s t s the following response sets: 1. Guessing. 2. Acquiescence: some people, when faced with a choice about which they are not sure, tend to answer posi-t i v e l y ; others tend to answer negatively. 3. Speed versus accuracy: students who work quickly, guessing on items, may receive undeserved higher scores in comparison to their slower, more thoughtful classmates. 18. 4. Definition of judgment categories: the categories used on a test, such as l i k e or d i s l i k e , may mean di f f e r e n t things to different people. 5. Indusiveness: some people may be more inclined to include many answers or points in their responses while others may be more selective or. limited in the points made in an /essay or choose a smaller number of alternatives on multiple-choice tests. 6. Response sets on essay tests: some of these sets are inclusiveness, style of composition, and degree of organization. Stanley and Hopkins (1972) l i s t the f i r s t three catego-ries above and also add a category, positional preference: the tendency of some respondents to consistently choose the option appearing in a particular place in the l i s t of possi-ble answers. Stanley and Hopkins go' on to point out, how-ever, that recent research has f a i l e d to confirm this response set. Having categorized response sets, Cronbach (1946), l i s t s these characteristics of response sets: 1. They are r e l i a b l e from test to test. 2. They increase test r e l i a b i l i t y . 3. They raise or lower test v a l i d i t y depending on whether or not the response sets are correlated with the c r i t e r i o n . 19. 4. They always lower content v a l i d i t y . 5. They have the most effect on d i f f i c u l t items. 6. They have the greatest influence in situations per-ceived by the respondent as ambiguous or unstructured. 7. They appear to be uncorrelated across subject f i e l d s . That i s , a respondent may gamble on Mathematics tests but not English tests, or vice versa. 8. They interfere with inferences made on the basis of the results of tests. A study reported by Hopkins (1964) contains a l i t e r a -ture review which supports the claims about the characteris-t i c s of response sets given above. In addition, Hopkins also claimed, on the basis of the review, that response sets have been found to be r e l a t i v e l y independent of a b i l i t y , but related to personality. Sherriffs and Boomer (1954) conducted a study which examined the relationship of guessing behaviour to personal-i t y . In the study, the subjects were told that the scores on a true-false examination would be computed by subtracting the number of wrong responses from the number of correct responses. The results of the study showed that introverted subjects with low self-esteem and high concern with the impression that they make on others tended to be penalized by a correction for guessing when their scores were compared 20. to those of other students. In other words, these students were less l i k e l y to gamble and guess answers. Effects of Response Sets and Guessing It has been stated above that response sets have an eff e c t on test r e l i a b i l i t y and v a l i d i t y . In a study conduc-ted by Hopkins (1964), grade equivalent scores on standard-ized multiple-choice mathematics achievement tests were investigated and the scores on equivalent forms of the tests in constructed-response format were compared with the multi-ple-choice scores. The results showed that by answering a l l items and attaining chance success, that score could be interpreted as an acceptable grade equivalent score. Hopkins went on to explain that the r e l i a b i l i t y of such standardized tests may be due to the speed versus accuracy set. That i s , although the instruments may appear to test content objectives, and y i e l d high r e l i a b i l i t y c o e f f i c i e n t s , they may indeed be measuring irrelevant but stable factors. In an assessment based on multiple-choice items, then, those interpreting the results w i l l be faced with the prob-lem of scores possibly inflated by guessing. Thorndike (1971) defines guessing as a "loose, general term for an array of behaviors that occur when an examinee responds to 21. an alternate choice question to which he does not 'know' the answer" (pp. 59-61) . Thorndike then goes on to l i s t some behaviours that occur during guessing: 1. judging some answer choices to be wrong and selec-ting a response from the remaining alternatives; 2. using unintended semantic and syntactic cues in the wording of the responses or the question stem; 3. being misled by plausible but wrong responses con-structed by the item writer; 4. making an unsure response based on an attractive element in one of the choices; 5. responding in a random fashion, using a pattern of responses or some sp e c i f i c response pattern. Rowley and Traub (1977) conducted a short review of formula scoring l i t e r a t u r e . Their review focussed on the assumptions regarding pupil behaviour during examinations. The evidence suggested that the tendency to guess is corre-lated with personality and therefore introduces a confound-ing e f f e c t . However, "Do not guess" instructions may intro-duce a confounding influence as well because there may be d i f f e r e n t i a l compliance with the instructions according to personality c h a r a c t e r i s t i c s . Rowley and Traub (1977) also claimed that students tend to score higher than chance by "just guessing". This 22. assertion is only weakly supported by the results of a study conducted by Ebel (1968). In that study students responded to true-false items and indicated which of the responses were simply blind guesses. Only a small proportion of the responses (3%-8%) were reported as blind guesses and those blind guesses were' between 52% and 56% correct, only s l i g h t l y better than the proportion of scores expected to be correct by chance alone. Ebel (1979, p. 201), however, encouraged students to make rational guesses on objective items. Burton's (1972) review of the l i t e r a t u r e on guessing showed that some changes in format, notably the inclusion of the "I don't know" alternative, may reduce guessing. She was of the opinion, though, that there is no way to prevent a l l guessing on multiple-choice achievement tests. Formula Scoring and Corrections for Guessing A great deal has been written about formula scoring and the effects of correction formulas on test r e l i a b i l i t y and v a l i d i t y . Rowley and Traub (1977) reported that the results of empirical studies are equivocal and that, on unspeeded tests, the decision to use or not use formula scoring is a value judgment. 23. Ebel (1979, pp. 194-8) claimed that both corrected and uncorrected scores rank students in essentially the same order and the probability of achieving a respectable score on a good objective test by guessing is s l i g h t . Also, i f examinees are well-motivated and have time to attempt a l l items, the effects due to guessing w i l l be reduced. In addition, i t is not poor practice to encourage students to make ratio n a l guesses, the results of which may provide information on the general achievement of the students. F i n a l l y , a guessing correction may remove the incentive for slower students to guess on speeded tests but the corrected scores may be contaminated by the action of response sets. Lord and Novick (1968, pp. 302 f f ) described more com-plicated scoring formulas than the most commonly used S = R - (W/(k - 1)). Some of these formulas have been developed in an attempt to evaluate p a r t i a l knowledge and/or make use of weighting the scores assigned to certain items or response choices within the item to minimize mean square error. These scoring weights are determined empirically from the test data. Burton (1972) compared a number of scoring methods. One of the methods was simple number right and another was the simple correction given above. The other eighteen methods investigated employed two formulas which used empirical data to assign scoring weights and 16 24. d i f f e r e n t combinations of formulas based on a complicated series of decision procedures. Lord and Novick (1968) suggest that researchers and theoreticians are unlikely to abandon the search for more refined methods to glean an increased amount of information from test scores. They point out however that . . . what l i t t l e experimental work has been done in the t r a d i t i o n a l methods of formula scoring has mot been encouraging, and that no experimental work has been published that supports the new methods. Thus, at pre-sent, the sole recommendation of these new methods is their strong conceptual attractiveness. In evaluating any new response method, i t w i l l be necessary to show that i t adds more relevant a b i l i t y variation to the system than error variation, and that any such r e l a t i v e increase in information retrieved is worth the e f f o r t . . . (p. 314) Educators and measurement s p e c i a l i s t s are divided on the issue of formula scoring. Lord (1975) stated that "Religion, p o l i t i c s ^ and formula scoring are areas where two informed people often hold opposing views with great assur-ance" (p . 7 ) . The central assumption made when employing the most common corrections for guessing is that respondents w i l l guess randomly when faced with an item to which they do not know the answer with absolute confidence. Lord (1975) showed that under this assumption, both formula scoring and simple number right give unbiased estimates of the same quantity. However, Lord (1963, 1975) also suggested that the assumption of random guessing is indefensible. 25. Sex-related Differences in Mathematics Achievement Swafford (1980) reported that while the l i t e r a t u r e of the 1960's and 1970's generally held that sex-related d i f -ferences in mathematics achievement did not appear u n t i l adolescence, more recent studies have shown that these d i f -ferences are negligible at a l l ages when the number of years of mathematics studied by the subject is controlled. Simi-l a r l y Wolleat, Pedro, Becker, and Fennema (1980) claimed that while research on cognitive factors has been inconclu-sive, studies examining non-cognitive factors have yielded interesting results. In particular, females have been found to be less confident than males about their a b i l i t y in math-ematics and tend to underestimate their a b i l i t y . Compared with males' b e l i e f s , females believe that mathematics w i l l be less useful to them in the future. These factors seem to have caused females to avoid senior courses in mathematics. The results of differences in achievement between males and females have been discussed in the General Reports from both of the B r i t i s h Columbia Assessments^,, of Learning in Mathematics (Robi t a i l l e and S h e r r i l l , 1977; and R o b i t a i l l e , 1981). The results of the f i r s t B.C. Assessment and of the 1979 NAEP Mathematics study are also described by Erickson, Erickson, and Haggerty (1980). The three sets of assessment 26. data show some common trends. In the primary years, males tend to outperform females on measurement items and females tend to outperform males in computation. In junior high school and beyond, males- outperform females in a l l areas except computation. The concern over sex-related differences is evidenced by the number of programs and projects developed in response to the demonstrated sex-related differences. Erickson et a l . (1980) and Fennema, Wolleat, Pedro, and Becker (1981) have described some of these programs and commented on their effectiveness. In addition, the recommendations made as a result of the 1977 and 1981 B.C. Mathematics Assessments have included some relating to sex-related differences (Robitaille and S h e r r i l l , 1977; R o b i t a i l l e , 1981). Hypotheses The review of l i t e r a t u r e gives r i s e to the following hypotheses. Given two Mathematics achievement tests containing the same items, one using multiple-choice format and the other using constructed-response format, 1. There is no difference in the mean score obtained by Grade 7 students on the two forms in each of the content domains considered. 27. 2. There is no difference between the test scores obtained by Grade 7 males and Grade 7 females on either form of the achievement test for each of the content domains con-sidered . 3. There is no difference between the test score obtained by the Grade 7 students of d i f f e r i n g a b i l i t i e s on either form on the achievement test in each of the content domains considered. 4. There are no interactions involving item format and item d i f f i c u l t y , item format and gender, or item format and student a b i l i t y which have s i g n i f i c a n t Tetrad differences. Chapter 3 METHOD The purpose of this study was to investigate the effect of item format on achievement test score. In order to meas-ure this effect a group of Grade 7 students was administered a Mathematics achievement test on two occasions. On the f i r s t occasion half the students completed a test consisting of 42 items in multiple-choice format, while the other half completed the same test with the items in constructed-response format.. Two weeks later the students were admini-stered the tests once again. On this occasion those who had previously written the multiple-choice test responded to the test in constructed-response format, and those who had writ-ten the constructed-response test on the f i r s t occasion, wrote the test in multiple-choice format. The tests were subsequently scored, and the test scores analysed. A description of the test development, sample selec-tion, and p i l o t testing are found below. The details of test administration, test .scoring, and data analysis are also presented. 28 29. Development of the Tests The following section contains a description of the or i g i n of the test items and the content domains. The pro-cedures used to p i l o t the tests are also described. Origin of the Test Items During the summer of 1981, three graduate students in Mathematics Education at U.B.C., including the writer, were contracted to construct multiple-choice achievement test items for the 1981 B.C. Mathematics Assessment (Klassen, Dukowski, and deGroot, 1981). Test objectives were deter-mined for each of the three grades (4, 8, and 12) involved in the Assessment, and pools of. items were constructed for each objective for each grade. During the course of devel-opment, the items were reviewed by the Contract Team for the Assessment and also, by the Assessment Advisory Committee. This Advisory Committee consisted of educators from the schools and colleges, Ministry of Education personnel, mem-bers of B.C. Research (the technical agency for the Assess-ment program), and a school trustee (see R o b i t a i l l e , 1981 for further discussion). P i l o t tests were constructed from the pools of items and administered to Grades 4, 8, and 12 30. classes in November, 1980. The items used for this study were selected from those piloted items for Grade 8. Content Domains The items on the ,tests used in this study were grouped into three content domains, Computation, Application, and Algebra. These content domains were defined so as to include only core material, that i s , essential learning for a l l students, from the B r i t i s h Columbia mathematics c u r r i c u -lum for Grades 7 and 8. The prescribed content for Grades 7 and 8 Mathematics in B r i t i s h Columbia schools is described in Mathematics, Curriculum Guide Years One to Twelve (B.C. Ministry of Education, 1978, pp. 22-26). In the Guide the learning outcomes for Grades 7 and 8 are grouped under eight strands. I. Set and set operations II. Number and number operations III. Geometry IV. Measurement V. Problem Solving VI. Graphs and functions VII. Applications of mathematics VIII. Logical thinking 31. For this study the writer chose objectives from the strands and grouped them in three content domains. As mentioned above, only, those objectives in the Guide designated as essential learning for a l l students were considered for inclusion in a content domain. The objectives in each con-tent domain and their relationship to the strands in the Guide are described below. Computation The Computation content domain is defined by the f o l -lowing learning outcomes. The student is able to 1. add, subtract, multiply, and divide whole numbers, common fractions, and decimal fractions 2. compare fractions 3. convert among common fractions, decimal fractions, and percent 4. calculate with percent. A l l of the material in the Computation domain is clas-s i f i e d in the Number and Number Operations strand of the Curriculum Guide. Application The Application content domain, is defined by the f o l -lowing learning outcomes. 32. The student is able to 1. apply computational s k i l l s to solve word, or' story problems 2. use s k i l l s with percent, r a t i o , and proportion to solve word, or story problems. These objectives are categorized under VI. Problem Solving and VII. Applications of Mathematics in the Curricu-lum Guide. Algebra The Algebra content domain is defined by the following learning outcomes. The student is able to 1. solve simple open sentences 2. translate verbal statements into expressions or open sentences 3. evaluate expressions. These objectives are categorized under II. Number and Number Operations and V. Problem Solving in the Curriculum Guide. Selection of Test Items There were four c r i t e r i a which governed the selection of test items: 1. The items had to be such that they could be stated in both multiple-choice and constructed-response formats with the same item stem. 2. The items had to test content in one of the three content domains—Computation, Application, or Algebra—as defined for in the study. 3. There had to be an equal number of items for each d i f f i c u l t y level considered. The two levels of d i f f i c u l t y were high d i f f i c u l t y (0.375 < p < 0.500) and low d i f f i c u l t y v.(0. 625 < p < 0.750). Items were c l a s s i f i e d based upon d i f -f i c u l t y levels obtained from p i l o t testing for the 1981 B.C. Mathematics Assessment. 4. The number of items testing content in each domain had to be equal yet the total number of items had to be such that the tota l test administration time would not exceed one hour. Using these c r i t e r i a , 42 items were selected from the items p i l o t tested in November 1980 for the 1981 B.C. Mathe-matics Assessment. There were seven items chosen for each of six subtests: Computation High D i f f i c u l t y , Computation Low D i f f i c u l t y , Application High D i f f i c u l t y , Application Low D i f f i c u l t y , Algebra High D i f f i c u l t y , and Algebra Low D i f f i -culty. For each subtest, two test forms were constructed: multiple-choice and constructed-response. 34. The items were then randomly distributed throughout the test with the r e s t r i c t i o n that the f i r s t two items were Computation Low D i f f i c u l t y items. The two forms of the test were i d e n t i c a l except that whereas on the multiple-choice form the students selected one of five answer options, including "I don't know", following each item stem; on the constructed-response form of the test, the same item stems, in the same order, were followed by a line upon which the students recorded their answers. Students responded d i r e c t l y in the test booklets which are reproduced in Appen-dix A. In order to v e r i f y the content v a l i d i t y of the test items, two of the investigator's colleagues, both experi-enced mathematics teachers, were given descriptions of the content domains and they independently c l a s s i f i e d the items according to domain. There was unanimous agreement as to item c l a s s i f i c a t i o n . P i l o t Testing The test forms were piloted in March, 1981 in the investigator's Grade 8 Mathematics class. A l l but one of the 27 students completed the test within 50 minutes. No problems were encountered during the administration of the tests. Item analysis of this p i l o t test data conducted using the computer program LERTAP 2.0 (Nelson, 1974) revealed that the items were considerably easier than would have been expected on .the basis of the Assessment p i l o t data. For example, the Assessment p i l o t data indicated that the mean d i f f i c u l t y for items on the Computation Low D i f f i c u l t y mul-tiple-choice subtest should be approximately 0.68; the mean d i f f i c u l t y obtained on the study p i l o t was 0.93. The tests were then piloted in the two Grade 7 classes of a neighbour-ing elementary school. As in Grade 8, the administration time was less than one hour. An analysis of the test results showed the item d i f f i c u l t i e s at Grade 7 to be closer to those obtained in the Assessment p i l o t . (The item d i f f i -c u l t i e s are summarized in Table 4.2 found in the next chap-ter. ) The reason for the discrepancy between the item d i f f i -c u l t i e s may be due to the fact that the Assessment p i l o t was conducted in November, while the p i l o t for this study was performed in March. It seems reasonable to expect that the s k i l l s of Grade 8 students would have improved over the intervening four months and that they would do better on the test items. Because the item d i f f i c u l t i e s computed from the Grade 7 p i l o t data were closer than the Grade 8 p i l o t data to those required, the study was performed using Grade 7 students . 36. Sample Selection Description of the Population The sample used in the study was selected from the population of intact Grade 7 classes of School D i s t r i c t 35, Langley, B r i t i s h Columbia. Langley is a suburban-rural com-munity located approximately 50 km from Vancouver, B.C. In contrast to more well-established school d i s t r i c t s in the Lower Mainland, Langley is experiencing growth in student population. A wide range of socio-economic levels is repre-sented in the community. Many Langley residents commute to blue-collar and white-collar jobs in Vancouver, and there is a sizeable number of families for whom farming is a primary or secondary source of income. The Grade 7 population con-sisted of 1017 students in 28 elementary schools, 20 of which had f u l l Grade 7 classes enrolled. The other eight schools had only s p l i t Grade 6/7 classes. Selection Technique Of the 20 schools which had f u l l classes of Grade 7 enrolled only the 18 schools which had a population of at least 25 Grade 7 students, excluding the school in which the 37. tests had been piloted, were considered for parti c i p a t i o n in the study. The director of elementary instruction for the school d i s t r i c t provided a l i s t of those six schools which he f e l t were representative of the population and which had princ i p a l s who were l i k e l y to agree to participate in the study. The six principals were contacted by telephone. Only one of the six principals declined to participate. The five schools which took part in the study enrolled a to t a l of 237 Grade 7 students in nine classes. Of these 237 students, 24 children f a i l e d to write one or both forms of the test due to absence on one or both of the testing dates. None of the data from these -students were included in any of the analyses. Subjects were grouped into three a b i l i t y levels accord-ing to IQ scores as measured by the Canadian Cognitive A b i l -i t i e s Quantitative Battery (Thorndike, Hagen, and Wright, 1974). This Battery was administered to Grade 7 students in Langley in the F a l l of 1980. Of the 213 students who wrote both tests, IQ scores were available for 191 of them. These scores were used to p a r t i t i o n the sample into low, average, and high a b i l i t y groups of roughly equal size. 38. Test Administration The repeated measures design of the study required that each student respond to both forms of the test. For this reason, two testing periods were required. In order to min-imize memory eff e c t s , a two-week interval separated the two testing periods. A two-week period was also used by Traub and Fisher (1977) to separate testing periods in a similar study. The tests were f i r s t administered to students during the week of A p r i l 29, 1981 in their regular classrooms by their teachers. Each teacher received a bundle of tests with the forms alternated throughout. They were asked to d i s t r i b u t e the tests randomly to their classes, read the test administration directions, and, when the hour-long testing period was over, to c o l l e c t the tests. The i n v e s t i -gator then collected a l l the used and unused tests from the schools. Teachers were also asked not to alter their teach-ing plans because of the test material. The directions to test administrators are reproduced in Appendix B. Two weeks later the tests were readministered. In order to ensure that students received the form of the test alternate to the one received on the f i r s t occasion, their names were affixed to the proper form before the tests were 3 9 . sent to the participating teachers. Teachers once again read the test directions, collected the tests, and returned a l l the papers. Two of the nine classes participating post-poned the second test administration u n t i l the start of the third week in order to accommodate a school play. The c l a s -ses in the other schools a l l wrote the tests in the middle of the week. During the week of the second test administra-tion, the principals provided information regarding stu-dents' gender and Quantitative IQ score. Data Analysis The tests were hand scored by the investigator and checked by a research assistant. The test answer key was constructed by the investigator. On those items in con-structed-response format where more than one answer was acceptable, each such response was considered correct. The scores obtained by the investigator and research assistant were in 100% agreement for both forms of the test. The test data and student data was subsequently entered into a computer f i l e at the U.B.C. Computing Centre. The f i l e was then checked, the errors were corrected, and the f i l e checked once more. There were no errors discovered. 40. Test Analyses An item analysis of the test data was performed using the computer program LERTAP (Nelson, 1974). Descriptive s t a t i s t i c s were generated by the programs BMDP2D and BMDP2V (Dixon and Brown, 1979). Preliminary Analyses The interpretation of an analysis of variance becomes very complicated for large numbers of factors. This is par-t i c u l a r l y the case i f some of the factors are nuisance fac-tors; that i s , variables of no particular interest but which must be included because they contribute s i g n i f i c a n t l y to the o v e r a l l variance. The order of test administration, that i s , multiple-choice form written f i r s t , or constructed-response form written f i r s t ; and the class in which a stu-dent is enrolled are two such variables. Therefore a 2x9x2 (order-by-class-by-item d i f f i c u l t y ) fixed effects analysis of variance was performed using the computer program BMDP2V. Order of test administration and class membership were con-sidered as grouping variables and item d i f f i c u l t y was treated as a t r i a l variable with two observations. The d i f -f i c u l t y factor was included to add more precision to the analyses. 41. Six of these analyses were performed, one for each test format in each content domain. A significance level of 0.05 was chosen and 213 cases, that i s , a l l students who com-pleted both forms of the test, were included in the analy-ses. The summary ANOVA tables for these analyses are con-tained in Appendix C. Order of Administration The results of the analyses showed that order of admin-i s t r a t i o n affected the scores of only two of the six sub-tests. Order of administration did not affect the scores on the multiple-choice subtests in any of the domains but did aff e c t the constructed-response scores in the Application and Algebra domains. In both cases the constructed-response scores for those students who wrote that form of the test second were s i g n i f i c a n t l y higher (p < .05) than the scores of those students who wrote the constructed-response form f i r s t . There were no si g n i f i c a n t f i r s t order interactions involving order of administration and only one s i g n i f i c a n t second order interaction involving order of administration. Order of test administration had a very limited effect on the overall test scores. In addition, i t was not a fac-tor of great interest in the. study. Therefore order of 42 . administration was eliminated as a factor in further analy-ses. Class The results of the analyses revealed that in three cases of the six there were s i g n i f i c a n t differences (p < .05) in subtest scores among classes. Therefore the raw scores were transformed in order to remove the class effects and yet retain a l l other information. To achieve t h i s , the raw scores were standardized within each class and content domain. This transformation was performed in the - following man-ner. Within each of the three content domains there are four subtests; two d i f f i c u l t y levels in each of two formats. The mean and standard deviation of the total of the four subtest scores within each class were used to transform the individual raw scores over the four subtests to mean zero and standard deviation one. In order to check the effect of this procedure, the three-way analyses of variance involving class, order, and item d i f f i c u l t y were repeated using the standard scores. As expected, the analyses showed no class e f f e c t s . A l l effects involving factors other than class were similar to those computed when raw score data were used. Thus, standard scores were used in a l l further 43. analyses of variance. A summary of c e l l means and standard deviations for these standard scores are contained in Appen-dix D. D i f f i c u l t y As expected, the d i f f i c u l t y of the items caused scores to d i f f e r s i g n i f i c a n t l y . This factor was retained in subse-quent ^ analyses . F i n a l Analyses To examine the effects of gender, a b i l i t y , item format, and item d i f f i c u l t y , a 2x3x2x2 (gender-by-ability-by-format-b y - d i f f i c u l t y ) analysis of variance procedure was performed using BMDP2V. The data from the 191 students for whom com-plete information was available were used in this analysis. The students missing a b i l i t y measures were evenly d i s t r i -buted among classes. Therefore i t is reasonable to expect that the class means of zero and standard deviations of one achieved by the transformation of raw scores were not s e r i -ously affected by deleting the data from the 22 students for whom a b i l i t y measures were not available. Gender and three levels of a b i l i t y were considered as grouping factors. Item format and item d i f f i c u l t y were treated as t r i a l factors each with two le v e l s . A level of significance of 0.05 was 44. chosen. Post hoc comparisons were made using Scheffe's pro-cedure (Kirk, 1968) with a significance l e v e l of 0.10. Scheffe (1959, p. 71) suggests this level of significance as appropriate for testing contrasts of this nature, as this test of conservative one. Ferguson (1981, pp. 308-309 ). also recommends a more l i b e r a l level of signficance. Although the effect of a correction for chance per-formed on multiple-choice data and the predictive power of multiple-choice scores to constructed-response scores are not central to the research questions, some analyses were performed with regard to these issues. The data from the 191 completed cases were then rescored applying the t r a d i -t i o n a l correction for chance formula. The chance-corrected data were standardized within class using the same procedure as for the uncorrected data and subjected to a 2x3x2x2 (gen-der-by-ability-by-format-by-difficulty) analysis of v a r i -ance. The significance level of 0.05 was chosen.. Post-hoc comparisons were made using Scheffe's procedure with a s i g -nificance level of 0.10. To estimate the a b i l i t y of multiple-choice scores to predict constructed-response scores, a simple linear regres-sion analysis was performed. Before the regression analysis was done, however, the raw score data in each content domain were subjected to a 9x2x2x2 (class-by-order-by-difficulty-by-format) fixed effects analysis of variance to check for 45. format-by-class interactions. At the 0.05 level of s i g n i f i -cance there were no format-by-class interactions using 191 complete cases. The raw scores in each content domain were subsequently employed in a simple linear regression analysis performed using the computer program BMDPlR. Constructed-response score was treated as the dependent variable and multiple-choice score was treated as the independent v a r i -able. Chapter 4 RESULTS The results of the study are reported in the following order. F i r s t , there is a description of the sample, then the results of the descriptive analysis of the test scores and the subtest characteristics are given. F i n a l l y , the effects of gender, a b i l i t y , item format, and item d i f f i c u l t y are presented; and the results of regression analyses per-formed on the raw scores are discussed. Description of the Sample The sample drawn for this study consisted of 237 Grade 7 students enrolled in nine classes in five . elementary schools in a suburban-rural school d i s t r i c t . Of the tot a l number of students, 24 f a i l e d to write one or both forms of the tests due to absence on the testing dates. None of the data from these 24 students were used in any of the analy-ses. There were also 22 students for whom a b i l i t y measures were unavailable. The data from these students were included only in the descriptive analyses of the test items and the i n f e r e n t i a l analysis of the effect of class member-ship and order of test administration. There were, then 191 46 47. students, 88 boys and 103 g i r l s , from whom complete sets of data were obtained. The a b i l i t y measures for the 191 complete cases were IQ scores from the Canadian Cognitive A b i l i t i e s Quantitative Battery (Thorndike, Hagen, and Wright, 1974). The mean IQ score was 102.6 with a standard deviation of 13.9. The sam-ple was partitioned into low, average, and high a b i l i t y groups of roughly equal size. ^.Students with Quantitative IQ scores of 96 or less were considered to be low a b i l i t y stu-dents; students with scores greater than 96 but less than or equal to 107 were considered to be of average a b i l i t y ; and students with scores higher than 107 were considered to be high a b i l i t y students. Table 4.1 contains the di s t r i b u t i o n of students by a b i l i t y and gender. Table 4.1 Distribution of Subjects by Gender and A b i l i t y A b i l i t y Male Female Total High 33 32 65 Average 24 40 64 Low 31 31 62 Total 88 103 191 48. Descriptive Analysis of the Test Scores There are four sets of information regarding - item d i f -f i c u l t y and subtest r e l i a b i l i t y . D i f f i c u l t y indices for the multiple-choice items are available from the p i l o t testing in Grade 8 for the 1981 Assessment (Robi t a i l l e , 1981), from the p i l o t testing performed in Grades 7 and 8 as part of the present study, and from the main study i t s e l f . There is test r e l i a b i l i t y information from three sources, the Grades 7 and 8 p i l o t s and from the main study. The results of the Grade 8 p i l o t indicated that for Grade 8 subjects, the items were too easy. Therefore subjects in Grade 7 were selected for the study. Table 4.2 contains the average p-values for the items on each subtest from the Assessment p i l o t , the Grade 7 p i l o t and the main study. The items were less d i f f i c u l t than one would expect on the basis of the Assessment p i l o t data. The high d i f f i c u l t y multiple-choice items were chosen so as to have an average p-value of approximately 0.4. In the main study the p-values for the Computation and Application domains were 0.529 and 0.566 respectively. The high d i f f i -culty Algebra multiple-choice items, however had an average p-value of 0.388. Similarly, the p-values of the low d i f f i -culty multiple-choice items were expected to be 49. Table 4.2 Mean Item D i f f i c u l t y Subtest Assessment Grade 7 P i l o t Study P i l o t M-C M-C C-R M-C C-R n=240 n=28 n=29 n=213 n=213 Computation High D i f f i c u l t y .404 .591 .483 .529 . 456 ( .058) (.148) (.199) ( .121) ( .170) Low D i f f i c u l t y .688 .842 . 827 .808 .716 ( .036) ( .074) (.113) ( .076) (.100) Application High D i f f i c u l t y .429 .638 .468 .566 .413 ( .044) (.136) (.124) ( .079) (.104) Low D i f f i c u l t y .710 .893 .799 .832 .749 (.073) (.077) . ( .098) ( . 087) (.156) Algebra High D i f f i c u l t y .402 .571 .320 .388 . 232 (.053) (.196) (.235) ( .104) (.172) Low D i f f i c u l t y .702 . 842 . 630 .717 .521 ( .033) (.068) ( .235) ( .099 ) (.258 ) Note. Each subtest contained 7 i terns . The numbers in parentheses are the standard devia-tions of the £-values. 50. approximately 0.7 but they had computed values of 0.808, 0.832, and 0.717 for the three content domains Computation, Application, and Algebra. Nonetheless, although they were easier than anticipated the items were partitioned into two d i s t i n c t d i f f i c u l t y levels in each content domain. Test r e l i a b i l i t i e s are influenced by a number of fac-tors including item d i f f i c u l t i e s and length of test. Each of the subtests in this study contained only a small number of items (seven), and a res t r i c t e d range of d i f f i c u l t y . As a result i t is not surprising that the subtest r e l i a b i l i t i e s computed using Hoyt's ANOVA procedure are not high. They range from a low of 0.47 to a high of 0.76. These r e l i a b i l -i t i e s are found in Table 4.3. Inferential Analyses Two sets of i n f e r e n t i a l analyses were performed on the data. The preliminary analyses were done to determine whether or not variables which were considered to be nui-sance variables could be safely eliminated from the f i n a l analyses. These preliminary analyses were discussed in Chapter 3. The f i n a l analyses were performed using the variables of greatest interest in the study. The f i n a l analyses are discussed below. 51. Table 4.3 Subtest R e l i a b i l i t i e s , Means, Standard Deviations, and Standard Errors of Measurement Subtest Mean Score Hoyt Standard i R e l i a b i l i t y Error Computation high d i f f i c u l t y Multiple-choice 3.71 (1.77) 0.55 1.10 Constructed-response 3.19 (1.70) 0.53 1.08 Computation low d i f f i c u l t y Multiple-choice 5.66 (1.38) 0.52 0.88 Constructed-response 5.01 (1.52) 0.47 1.02 Application high d i f f i c u l t y Multiple-choice 3.96 (1.75) 0.53 1.12 Constructed-response 2.90 (1.80) 0.58 1.09 Application low d i f f i c u l t y Multiple-choice 5.83 (1.42) 0.62 0.81 Constructed-response 5.24 (1.59) 0.62 0.90 Algebra high d i f f i c u l t y Multiple-choice 2.72 (1.96) 0.68 1.03 Constructed-response 1.62 (1.60) 0.68 0.84 Algebra low d i f f i c u l t y Multiple-choice 5.02 (1.75) 0.65 0.96 Constructed-response 3.65 (1.98) 0.76 0.89 Note. Each subtest contained 7 items. Two hundred thirteen subjects wrote each subtest. The numbers in parentheses are standard deviations. 52. Gender., A b i l i t y , Item Format, and Item D i f f i c u l t y The effects due to gender, a b i l i t y , item format, and item d i f f i c u l t y were examined in each content domain using a 2x3x2x2 (gender-by-ability-by-format-by-difficulty) analysis of variance with repeated measures. Gender and a b i l i t y were considered grouping factors and item format and item d i f f i -culty were t r i a l factors. Tables 4.10 to 4.12 show the sum-mary ANOVA tables for the analyses in each of the content domains. The effects due to each variable are presented separately below and then the s i g n i f i c a n t interactions are presented. A significance level of 0.05 was chosen for each of the omnibus F's. Gender An examination of the summary ANOVA tables found in tables 4.4 to 4.6 indicates that there was a s i g n i f i c a n t e f f e c t due to gender in only one of the domains, Applica-ti o n . In this case, the mean score obtained by males was s i g n i f i c a n t l y higher than that obtained by females. A b i l i t y In each of the three content domains, the analyses of the data revealed that there were s i g n i f i c a n t effects due to Table 4.4 Summary Analysis of Variance Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y Computation Domain Source of variance Degrees of Freedom Mean Square Mean Gender (G) A b i l i t y (A) G x A Error 1 1 2 2 185 1. 078 3.178 121.791 3.384 2. 221 0.49 1.43 54.84* 1.52 Format (F) F x G F x A F x G x A Error 1 1 2 2 185 49.568 0. 206 0. 044 0.890 0. 702 70.61* 0.29 0. 06 1. 27 D i f f i c u l t y (D) D x G D x A D x G x A Error 1 1 2 2 185 502.638 0. 198 5. 253 3.325 1. 319 380.96* 0.15 3. 98* 2.52 F x D F x D x G F x D X A F x. D X G x A Error 1 1 2 2 185 0. 034 1. 628 0. 632 0.753 0. 591 0. 06 2. 76 1.07 1. 27 *p < .05 Table 4.5 Summary Analysis of Variance Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y Application Domain Source of variance Degrees of Mean Square F Freedom Mean 1 0.753 0.33 Gender (G) 1 11.440 4.94* A b i l i t y (A) 2 138.500 59.86* G x A 2 4.023 1.74 Error 185 2.314 Format (F) 1 71.234 95.72* F x G 1 2.891 3.88 F x A 2 2.133 2.87 F x G x A 2 1.252 1.68 Error 185 0. 744 D i f f i c u l t y (D) 1 492.124 484.47* D x G 1 4.154 4.09* D x A 2 3.262 3.21* D x G x A 2 1.316 1.3-0 Error 185 1.016 F x D 1 7.321 13.90* F x D x G 1 0.113 0.21 F x D x A 2 0.270 0.51 F x D x G x A 2 0.079 0.15 Error 185 0.527 *p < .05 Table 4.6 Summary Analysis of Variance Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y Algebra Domain Source of variance Degrees of Mean Square F Freedom Mean 1 2.416 1.03 Gender (G) 1 0.455 0.19 A b i l i t y (A) 2 133.618 56.70* G x A 2 0.217 0.09 Error 185 2.357 Format (F) 1 122.306 225.05* F x G 1 0.262 0.48 F x A 2 0.157 0.29 F x G x A 2 0.745 1.37 Error 185 0.543 D i f f i c u l t y (D) 1 412.578 491.77* D x G 1 0.061 0.07 D x A 2 2.696 3.21* D x G x A 2 0.047 0.06 Error 185 0.839 F x D 1 1.073 2.91 F x D x G 1 0.041 0.11 F x D x A 2 4.555 12.37* F x D x G x A 2 0.184 0.50 Error 185 0.368 *p < .05 56. a b i l i t y . The mean scores of the students in each a b i l i t y level were compared using Scheffe's procedure with a s i g n i -ficance l e v e l of 0.10. It was found that the means of the tot a l scores in each content domain were ordered s t r i c t l y according to a b i l i t y l e v e l . Students of high a b i l i t y scored s i g n i f i c a n t l y higher than students of average a b i l i t y , who, in turn, scored s i g n i f i c a n t l y higher than students of low a b i l i t y . Item Format In each domain, the format had a s i g n i f i c a n t effect on scores. In each case multiple-choice scores were higher than constructed-response scores. Item D i f f i c u l t y The analyses showed that there was an effect due to item d i f f i c u l t y . In each domain, the mean scores on items of low d i f f i c u l t y were s i g n i f i c a n t l y greater than the mean scores on items of high d i f f i c u l t y . F i r s t Order Interactions The summary ANOVA tables show five s i g n i f i c a n t f i r s t -order interactions: item d i f f i c u l t y by gender in the Appli-cation domain, item d i f f i c u l t y by a b i l i t y in a l l three 57. content domains, and item format by item d i f f i c u l t y in the Application domain. Many contrasts can be formed to test interactions. For the purposes of this study, tetrad d i f -ferences were considered to be the only contrasts of inter-est. The tetrad differences for each of the s i g n i f i c a n t interactions were analysed using Scheffe's procedure with an alpha level of .0.10. The results of the analyses of each of the f i r s t order interactions is discussed below. D i f f i c u l t y by Gender:- The d i f f i c u l t y by gender inter-action resulted in a si g n i f i c a n t omnibus F-ratio only in the Application domain. An analysis of the tetrad differences, however, f a i l e d to show si g n i f i c a n t differences. Figure 4.1 contains a graph of c e l l means versus gender for each of the item d i f f i c u l t y l evels. The f a i l u r e of the test of s i g n i f i -cance on the tetrad differences indicates that the change in performance of males on high d i f f i c u l t y items when compared to their performance on low d i f f i c u l t y items is not d i f f e r -ent than the corresponding change in performance for females. The s i g n i f i c a n t omnibus F-ratio implies that one could construct a complex comparison of the d i f f i c u l t y - b y -gender c e l l means which would be s t a t i s t i c a l l y s i g n i f i c a n t , however such a comparison would not be helpful in answering the experimental questions. 58. Figure 4.1 Plot of Cell Mean versus Gender for Items of High and Low D i f f i c u l t y Application Domain 1.2 0.8 0.4 Low D i f f i c u l t y C e l l Mean 0.0 -0. 4L -0.8L High D i f f i c u l t y Male Female Gender D i f f i c u l t y by A b i l i t y : - The d i f f i c u l t y by a b i l i t y interaction resulted in a s i g n i f i c a n t omnibus F in a l l three domains. Figure 4.2 contains graphs of c e l l means versus item d i f f i c u l t y for the Computation domain. Similar graphs are found in Figures 4.3 and 4.4 for the Application and Algebra domains respectively. The tetrad differences for each of these interactions were analysed and the null hypothesis was not rejected in any case. It appears that the differences between perform-ance on high d i f f i c u l t y and low d i f f i c u l t y items do not vary signficantly at the 0.10 level among a b i l i t y levels for any of the content domains. 59. Figure 4.2 Plot of Cell Means versus Item D i f f i c u l t y by A b i l i t y Level Computation Domain 1.6 t C e l l Means High A b i l i t y Average A b i l i t y Low A b i l i t y -0.4- y / -0.8 I / -1.2: / High Low Item D i f f i c u l t y Level > 60, Figure 4.3 Plot of Cell Means versus Item D i f f i c u l t y by A b i l i t y Level Application Domain C e l l Means 1.6 1.2 0.8 0.4 0.0 ^0.4 -0.8 -1.2 -1.6 High A b i l i t y Average A b i l i t y 1 1 Low A b i l i t y High Low Item D i f f i c u l t y Level V 61. Figure 4.4 Plot of Cell Means versus Item D i f f i c u l t y by A b i l i t y Level Algebra Domain C e l l Means 1.6 1.2 0.8 0.4 0.0 -0.4 -0.8 -1.2 -1.6 High A b i l i t y ^ Average A b i l i t y Low A b i l i t y High Low Item D i f f i c u l t y Level \ 62. Format by D i f f i c u l t y : - The format by d i f f i c u l t y inters action resulted in a s i g n i f i c a n t omnibus F s t a t i s t i c in the Application domain. A graph of c e l l means versus item for-mat for two levels of d i f f i c u l t y is found in Figure 4.5. Figure 4.5 Plot of C e l l Means versus Item Format by Item D i f f i c u l t y Application Domain I -1.6 1.2 0.8 0.4 0.0 - 0.4 -0.8 -1.2 -1.6 C e l l Means Low D i f f i c u l t y High D i f f i c u l t y I 1 C-R M-C Item Format Scheffe's test showed the tetrad difference to be s i g -n i f i c a n t (p < 0.10). That i s , the difference in scores on items of high d i f f i c u l t y in multiple-choice and constructed-response formats was s i g n i f i c a n t l y d i f f e r e n t than the difference in scores on items of low d i f f i c u l t y in the two formats. There is a greater difference in achievement between formats for d i f f i c u l t items than for easy items. Second Order Interactions The summary ANOVA tables show one s i g n i f i c a n t omnibus F for a second order interaction. That interaction, item for-mat by item d i f f i c u l t y by a b i l i t y , was found in the Algebra domain. Figure 4.6 shows three graphs. These graphs plot c e l l mean versus item format at each of two levels of d i f f i -culty for the three a b i l i t y l e v e l s . An analysis of the tet-rad differences shows that while the interactions of item format by item d i f f i c u l t y for average and high a b i l i t y stu-dents are not s i g n i f i c a n t l y d i f f e r e n t from one another, both'' of these are s i g n i f i c a n t l y different from the interaction of item format by item d i f f i c u l t y for low a b i l i t y students. Correction for Guessing The multiple-choice scores were transformed using the t r a d i t i o n a l correction for guessing and then standardized within the class and subjected to the same set of 2x3x2x2 (gender-by-ability-by-format-by-difficulty) analyses of variance as the uncorrected scores (Appendix D contains a Figure 4.6 Plot of Cell Mean versus Item Format by Item D i f f i c u l t y for Three A b i l i t y Levels Algebra Domain C e l l Means 2.0 1.6 1.2 0.8 0. 4 0.0 -0.4 -0.8 -1.2 -1.6 -2.0 Low A b i l i t y C-R M-C Average A b i l i t y High A b i l i t y Low D i f f i c u l t y High D i f f i c u l t y C-R M-C Item Format C-R M-C 65. summary of the c e l l means and standard deviations for both corrected and uncorrected scores). The results of these analyses are summarized in Tables 4.7 to 4.9. These results followed• a pattern very similar to those findings of the analyses performed on the scores not corrected for chance. The findings are summarized in Figure 4.9. In particular the main effects were identical for both sets of data except for the effect of format. The format e f f e c t (multiple-choice score greater than constructed-response score) was s i g n i f i c a n t (p < .05) in the Application and Algebra domains but not s i g n i f i c a n t in the Computation domain. The patterns of s i g n i f i c a n t interactive effects were also very similar. For the corrected data there were s i g n i -ficant interactions for difficulty-by-gender in the Applica-tion domain but no s i g n i f i c a n t tetrad differences (p < .10) were found. Similarly for the s i g n i f i c a n t d i f f i c u l t y - b y -a b i l i t y interactions in the Computation and Algebra domains no s i g n i f i c a n t tetrad differences were found. There were s i g n i f i c a n t format-by-difficulty interac-tions in the Computation and Algebra domains. Significant tetrad differences were found in these interactions. These tetrad differences indicate that the difference between scores on the low d i f f i c u l t y and high d i f f i c u l t y subtests in multiple-choice format is larger than the difference between Table 4.7 Summary Analysis of Variance Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y Computation Domain Multiple-Choice Scores Corrected for Guessing Source of variance Degrees of Mean Square F Freedom Mean 1 1.048 0.47 Gender (G) 1 2.979 1.33 A b i l i t y (A) 2 123.043 55.07* G x A 2 3.647 1.63 Error 185 2.234 Format (F) 1 1.292 1.77 F x G 1 0.045 0.06 F x A 2 1.438 1.97 F x G x A 2 1.281 1.75 Error 185 0.731 D i f f i c u l t y (D) 1 519.856 378.84* ' D x G 1 0.371 0.27 D x A 2 5.681 4.14* D x G x A 2 3.550 2.59 Error 185 1.372 F x D 1 5.949 9.75* F X D x G 1 2.007 3.29 F x D x A 2 1.123 1.84 F x D x G x A 2 1.059 1.74 Error 185 0.610 *p < .05 Table 4.8 Summary Analysis of Variance Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y Application Domain Multiple-Choice Scores Corrected for Guessing Source of variance Degrees of Freedom Mean Square Mean Gender (G) A b i l i t y (A) G x A Error 1 1 2 2 185 0.570 10.330 140.988 4. 262 2. 326 0. 25 4.44* 60.62* 1. 83 Format (F) F x G F x A F x G X A Error 1 1 2 2 185 13.831 2.286 1. 929 0. 824 0.752 18.38* 3.04 2.56 1.09 D i f f i c u l t y (D) D x G D x A D x G x A Error 1 1 2 2 185 503.352 4.594 3.148 0. 953 1. 059 475.12* 4.34* 2. 97 0.90 F X D F x D x G F X D X A F X D X G X A Error 1 1 2 2 185 0. 300 0. 317 0.162 0.091 0. 575 0.52 0. 55 0. 28 0.16 *p < .05 Table 4.9 Summary Analysis of Variance Gender, A b i l i t y , Item Format, and Item D i f f i c u l t y Algebra Domain Multiple-Choice Scores Corrected for Guessing Source of variance Degrees of Mean Square F Freedom \ . Mean 1 2.257 0.95 Gender (G) 1 0.324 0.14 A b i l i t y (A) 2 135.774 57.23* G x A 2 0. 327 0.14 Error 185 2.372 Format (F) 1 24.969 42.87* F x G 1 . 0.184 0.32 F x A 2 0.893 1.53 F x G x A 2 0.751 1.29 Error 185 0.582 D i f f i c u l t y (D) 1 426.849 481.29* D x G 1 0.009 0.01 D x A 2 2.766 3.12* D x G x A 2 0.082 0.09 Error 185 0.887 F x D 1 8.433 20.90* F x D x G 1 0.006 0.02 F x D x A 2 4.952 12.27* F x D x G x A 2 0.201 0.50 Error 185 0.404 *p < .05 scores on the low d i f f i c u l t y and high d i f f i c u l t y subtests in constructed-response formats. These interactions are p l o t -ted in Figures 4 . 7 and 4 . 8 . •> As in, the case of the uncorrected data, there was one s i g n i f i c a n t format-by-difficulty-by-ability interaction in the Algebra domain. The character of this interaction was id e n t i c a l for both corrected and uncorrected data; the same set of tetrad differences were s i g n i f i c a n t for both. Regression Analyses A set of simple linear regression analyses were per-formed on the raw score data. In these analyses, the con-structed-response scores were treated as the dependent v a r i -ables and the multiple-choice scores were considered to be the independent variables. A series of analyses of variance, one analysis for each content domain preceded the regression analyses. These were done to ensure that there were no s i g n i f i c a n t class by for-mat interactions in the raw score data which would confound the regressions. No such interactions "were found. The regression weights, correlations between the scores in each format, and the standard errors are found in Table 4.10. In each of the content domains, the correlations were 70, Figure 4.7 Plot of C e l l Means versus Item Format by Item D i f f i c u l t y Computation Domain Chance-Corrected Data C e l l Means 0.8 0.4 0.0 -0.4 -0.8 Low D i f f i c u l t y High D i f f i c u l t y C-R M-C Item Format 71. Figure 4.8 Plot of C e l l Means versus Item Format by Item D i f f i c u l t y Algebra Domain Chance-Corrected Data C e l l Means 1.2 0.8 0.4 0.0 -0.4 -0.8 1 Low D i f f i c u l t y High D i f f i c u l t y C-R M-C Item Format Figure 4.9 Summary of Main Effects and Interactions Main Effects Scores Corrected for Guessing Computation Application Algebra Computation Application Algebra Gender N / S A b i l i t y s/ s/ </• y Format V y D i f f i c u l t y V V y Interactions Scores Corrected for Guessing Computation Application Algebra Computation Application Algebra D i f f i c u l t y X Gender y y D i f f i c u l t y X A b i l i t y y y y V Format X D i f f i c u l t y * * * Format X Diff X A b i l i t y * * Significant Omnibus F *: Significant Tetrad Differences to 73. greater than 0.7 which indicate that there is a strong r e l a -tionship between the scores in each format. The multiple R-squared s t a t i s t i c s show that in each domain, approximately half of the variance in the open-ended scores can be predic-ted by the multiple-choice scores. Table 4.10 Regression Weights and Intercepts, and Correlation Coefficients for C-R Scores Regressed on M-C Scores Test Weight Intercept Standard Error Correlation Computation 0.717 1.497 Application 0.807 0.274 Algebra 0.768 -0.604 0.050 0. 056 0.046 0. 721 0. 726 0.772 Chapter'5 SUMMARY, CONCLUSIONS, AND IMPLICATIONS This study was undertaken to determine the effect of item.format on students' scores on a mathematics achievement test. A tota l of 191 Grade 7 students who wrote the achievement test in two formats, multiple-choice and con-structed-response, one on each of two occasions, provided the data for the study. These data were analysed using an analyses of variance procedure with repeated measures. Other factors besides item format were considered. These factors included gender, a b i l i t y , and item d i f f i c u l t y . The results of the analyses shed some l i g h t on the relationship between the scores obtained on multiple-choice and construc-ted-response mathematics achievement tests. In the paragraphs which follow, a summary of the find-ings and conclusions based on those findings are presented. Implications of the results are then discussed and then sug-gestions for future research are presented. 74 75 . Summary Item Format Test results showed that item format had a s i g n i f i c a n t e f f e c t on the scores obtained by students on a mathematics achievement test. In each of the three content domains; Computation, Application, and Algebra; scores on the multi-ple-choice form of the test were higher than on the con-structed-response form. It would therefore be unwise to consider the multiple-choice and constructed-response scores as interchangeable measures. However, although the scores are of di f f e r e n t magnitude, i t may be that with a suitable change of scale, one score can be transformed into the other. In order to feel comfortable about such a transforma-tion, however, i t would be necessary to consider any possi-ble confounding effects due to other variables. Previous research has shown that multiple-choice test scores may be affected by response sets, p a r t i c u l a r l y guessing. It has also been shown that response sets are related to personal-i t y c h a r a c t e r i s t i c s . The equivalence of scores depends not only on the transformation of the scores, but also upon the assumption that other factors do not interact with item for-mat to produce unique eff e c t s . 76. The findings regarding other variables considered in the study are discussed below. Particular attention is given to describing the nature of any interactions observed. Class It was found that there were differences in achievement among the nine classes sampled. Although one might have hoped for a consistent level of performance among classes at the same grade l e v e l , i t is not surprising that such d i f f e r -ences e x i s t . More important to the questions of the study, however, is the absence of s i g n i f i c a n t format by class interactions. This suggests that ' although the levels of performance among classrooms are s i g n i f i c a n t l y d i f f e r e n t , none of those differences can be attributed to the effect of item format. Gender In the Application domain males' scores were s i g n i f i -cantly higher than females'. Sex-related differences in the other two domains were not s i g n i f i c a n t . These findings are in agreement with recent studies (Robitaille and S h e r r i l l , 1977; R o b i t a i l l e , 1981; Swafford, 1980) which show that i f 77. differences in achievement exist between males and females, males tend to outperform females only on those items which require the application of higher cognitive s k i l l s . There were no si g n i f i c a n t interactions between format and gender. This indicates that both males and females tend to respond to multiple-choice tests in the same way and respond to constructed-response items in the same way. Therefore, i f one wishes to use multiple-choice scores as an indication of probable score on the same test in construc-ted-response format, there is no need to make an adjustment on the basis of a subject's gender. There was a si g n i f i c a n t omnibus F obtained for the item d i f f i c u l t y by gender interaction in the Application domain. The interaction, however, was such that there were no s i g n i -ficant contrasts among the means which are relevant to the questions of the study. A b i l i t y As expected, in a l l content domains high a b i l i t y stu-dents scored s i g n i f i c a n t l y higher than average a b i l i t y stu-dents who, in turn, scored s i g n i f i c a n t l y higher than low a b i l i t y students. There were also s i g n i f i c a n t item d i f f i -culty by a b i l i t y interactions in a l l three domains. How-ever, there were no contrasts among the means which produced 1 78. s i g n i f i c a n t results and which were meaningful in terms of the experimental questions. Therefore, differences in scores between high and low d i f f i c u l t y items did not vary according to a b i l i t y l e v e l . The a b i l i t y variable did not interact with item format to produce unique effects in any domain. It is reasonable to suspect that low a b i l i t y students might achieve unde-served higher, scores due to guessing. Because there are more items for which low a b i l i t y students do not know the answer these students have more of an opportunity to guess. Answers guessed correct w i l l then i n f l a t e the scores to a higher degree than for students who did not guess. This suspicion was not confirmed by the data: the a b i l i t y by format interactions were not s i g n i f i c a n t . Although there are score differences related to a b i l i t y , the scores are not affected by a unique combination of a b i l i t y and item format. Students of d i f f e r i n g a b i l i t i e s are riot d i f f e r e n t i a l l y affected by item format. This finding also supports e a r l i e r claims that response sets are r e l a t i v e l y independent of a b i l i t y (Hopkins, 1964). Item D i f f i c u l t y c u l t Results show that students' items than on easy items in scores were lower on d i f f i -a l l content domains. This finding is not of great interest. Of greater interest is the interaction between item format and item d i f f i c u l t y . There was a s i g n i f i c a n t interaction between format and d i f -f i c u l t y in the Application domain only. An analysis of the interaction showed that the d i f f e r -ence between multiple-choice and constructed-response scores on high d i f f i c u l t y items was greater than the corresponding difference for low d i f f i c u l t y items. It is not obvious why this interaction should exist only in the Application domain. It may be that content in the Application domain is familiar enough so that clues pro-vided in the multiple-choice alternatives were enough to e l i c i t a correct response. In contrast, the Computation domain may have been so familiar that the students' respon-ses were not affected by clues. The students knew whether or not they could do the exercise and therefore did not search for clues. In the Algebra domain, the content may have been so unfamiliar that clues were of no help. It may also be that because the Application domain contained items that were applications of mathematics to real situations, students may have been more w i l l i n g or able to seek reason-able responses from those provided in the l i s t of alterna-t i v e s . ^ There was a s i g n i f i c a n t second-order interaction involving item format, item d i f f i c u l t y , and a b i l i t y in the t A 80. Algebra domain. The analysis of this interaction indicated that the differences between high and low d i f f i c u l t y item scores did not vary with format for high and average a b i l i t y ' students. However, for students of low a b i l i t y , the d i f f e r -ence between scores on high and low d i f f i c u l t y items in con-structed-response format was less than the corresponding differences for students of high and average a b i l i t y . Simi-l a r l y , the difference between scores on high and low d i f f i -culty items in multiple-choice format was greater than the corresponding differences for high and average a b i l i t y stu-dents. The reader is referred to Figure 4 . 6 . This pattern of achievement may also be due to the value of clues supplied to the students by the multiple-choice alternatives. For d i f f i c u l t items, the difference between scores over formats for low a b i l i t y students is less than that of more able students indicating that perhaps the clues did not help students choose a response for an item about which they were unfamiliar. On the other hand, for easy items, the clues may have provided low a b i l i t y students with enough information to make a reasonable guess. The reason that format interacts with item d i f f i c u l t y is not clear. However, s i g n i f i c a n t effects due to this interaction are not widespread. Therefore i t would be i l l -advised to claim that this effect is of major importance. 81. Conclusions The results of this study appear to answer the experi-mental questions. F i r s t , students do score higher on a mul-tiple-choice form of a mathematics achievement test than on the same test in constructed-response format. With regard to gender,, males s i g n i f i c a n t l y outperformed females in only one domain, Application. There are no d i f f e r e n t i a l effects due to format related to gender. Similarly, there are no d i f f e r e n t i a l effects due to format related to a b i l i t y or related to the class in which a student is enrolled. The effects of item d i f f i c u l t y combined with format are not clear. Format and d i f f i c u l t y showed a s i g n i f i c a n t f i r s t - o r d e r interaction in only one content domain and these two factors were also involved in a second order interaction in another domain. It appears that format and d i f f i c u l t y have a unique effect only when the student has p a r t i a l know-ledge and is able to make use of the clues provided In the alternatives of the multiple-choice questions. This specu-l a t i o n , however, is only weakly supported by the results. The absence of format interactions is encouraging. This indicates that one may be able to develop a procedure which, when applied to multiple-choice scores, w i l l trans-form them into the scores which would have been obtained i f I 82. the subjects had written the test in constructed-response format. This transformation would be such that i t would not penalize, or give advantage to, any particular group of stu-dents from a population partitioned by class, gender, or a b i l i t y . The multiple-choice scores were corrected for chance and then analysed in the same manner as the uncorrected scores. The differences due to test format were eliminated in only the Computation domain. In addition there were for-mat-by-difficulty interactions in the Computation and Alge-bra domains. It appears that the t r a d i t i o n a l correction for guessing is not s u f f i c i e n t to make multiple-choice scores equivalent to constructed-response scores. The constructed-response scores obtained in the study were regressed on the multiple-choice scores. The results of the simple linear regression showed that the constructed-response scores were moderately to strongly correlated with multiple-choice scores. The correlations ranged between 0.72 and 0.77 for the three content domains. This implies that the multiple-choice scores account for between 52% and 60% of the variance in the constructed-response scores. Given that the test r e l i a b i l i t i e s were somewhat low, there were a small number of items, and there was a res t r i c t e d range of item d i f f i c u l t i e s , i t may be that on longer, more o 83. r e l i a b l e tests these correlations would be s i g n i f i c a n t l y higher. The regression weights were computed and found to range between 0.72 and 0.81. the intercept values, however ranged between approximately 1.5 and -0.60. Although the slopes of the regression lines are f a i r l y constant, the intercepts are not. This indicates that i t is unlikely that there is a global scoring formula which can be applied to a l l multiple-choice tests to obtain an estimate of the constructed-response score. Rather, the scoring formula for each test may have to be determined empirically. Implications The results of the study make i t clear that the scores of multiple-choice and constructed-response tests are not interchangeable when making criterion-referenced judgments. However, the results indicate that the two measures are equivalent except for a change of scale. In fact, the change of scale may be a simple linear transformation. This finding is in agreement with that of Rowley and Traub (1977). Mason's (1979) objection to basing criterion-referenced judgments on multiple-choice data is largely unfounded. 84. Although the absolute scores are not equivalent, the items in both formats appear to measure the same attributes in the same way. The format apparently does not have a differen-t i a l effect according to content variables or subject v a r i a -bles. Therefore, after allowing a fixed amount for format differences, the two types of scores are interpretable in the same manner. These findings make the task of interpreting the results of multiple-choice assessment tests less ambiguous. The claim that multiple-choice and constructed-response tests rank students in essentially the same order (Cronbach, 1970) is confirmed. It also appears that multiple-choice scores are obtained in such a way that a comparison of intervals within those scores may be interpreted in the same way as intervals within scores as though they had been obtained by constructed-response items. Therefore, the results- of multiple-choice tests do not need to be consi-dered as simply ordinal information. The scores can be used to compare groups in other than a norm-referenced fashion. The skepticism of Interpretation Panels toward making c r i -terion-referenced judgments, as reported by Mussio and Greer (1980) may then be alle v i a t e d . 85. Limitations of the Study Although the present study investigated the effect of item format on achievement test scores, the following condi-tions are limiting factors. Only students enrolled in grade 7 were sampled, and a l l of those students attended schools in the same suburban/rural school d i s t r i c t . . The content sampled by the test items was mathematics and did not repre-sent the tot a l Grade 7 mathematics curriculum. The items tested the areas of Computation, Application, and ^Algebra. Suggestions for Future Research Results of this study have cast some l i g h t on the r e l a -tionship between multiple-choice and constructed-response test scores. A number of questions remain unanswered, how-ever. These may be pursued by further research. In this study constructed-response scores were regressed on multiple-choice scores to obtain estimates of the correlation between the two types of scores. At f i r s t sight, the consistency of the regression weights suggests that there may be a simple linear transformation which would change one score into the other. It may be possible to determine this transformation empirically. Studies directed 86. toward determining an empirical scoring formula should be pursued. In the study there was no attempt made to analyse the discrepancies between the two types of scores. However, the multiple-choice, scores were subjected to the tr a d i t i o n a l correction for guessing procedure and reanalysed. The cor-rection for guessing did not remove the format e f f e c t . The usefulness of this procedure should be investigated empiri-c a l l y , perhaps to examine the predictive a b i l i t y of correc-ted scores. Such an examination might also provide informa-tion about the effect of omitted items on constructed-response score prediction. In the discussion of the results there was mention made of. the effect of p a r t i a l knowledge. P a r t i a l knowledge is the body of facts and understandings which, although i t does not allow the respondent to construct or choose the correct answer with certainty, enables him or her to eliminate unlikely responses or aids in constructing an acceptable response. There are other multiple-choice formats which attempt to measure the extent of the subject's p a r t i a l know-ledge. An examination of the effect of p a r t i a l knowledge on the predictive a b i l i t y of multiple-choice test scores may shed some li g h t on the item format by item d i f f i c u l t y inter-actions found in this study. 87. REFERENCES Bracht, G. H. and Hopkins, K. D. Comparative v a l i d i t i e s of essay objective tests. Research paper No. 20. Boulder: University of Colorado, Laboratory of Educational Research, 1968. Bracht, G. H. and Hopkins, K. D. The communality of essay and objective tests of academic achievement. Educational and Psychological Measurement, 1970, 3_0, 359-364. B r i t i s h Columbia Ministry of Education. Mathematics c u r r i - culum guide years one to twelve. V i c t o r i a , B.C.: 1978. Burton, N. W. An investigation of item-scoring formulas which take into account random guessing, p a r t i a l informa- tion, and misinformation. Unpublished doctoral d i s s e r t a -tion, University of Colorado, 1972. Cronbach, L. J. Response sets and test v a l i d i t y . Educa- tional and Psychological Measurement, 1946, 475-494. Cronbach, L. J. Essentials of psychological testing (3rd ed.). New York: Harper and Row, 1970. Dixon, W.J. and Brown, M.B. Biomedical Computer Programs, P-Series. Berkeley: University of C a l i f o r n i a Press, 1979. Ebel, R. L. Blind guessing on objective achievement tests. Journal of Educational Measurement, 1968, 5, 321-325. 88. Ebel, R. L. Essentials of educational measurement (3rd ed.). Englewood C l i f f s , N.J.: Prentice-Hall, 1979. Erickson, G., Erickson, L., and Haggerty, S. Gender and mathematics/science education in elementary and secondary schools. Discussion paper 08/80. V i c t o r i a , B.C.: Ministry of Education, 1980. Fennema, E., Wolleat, P. L., Pedro, J. D., and Becker, A. D. Increasing women's participation in mathematics: An intervention study. Journal for Research in Mathematics Education, 1981, 12, 3-14. Ferguson, G.H. S t a t i s t i c a l analysis in psychology and education (5th ed.). New York: McGraw-Hill, 1981. Gronlund, N. E. Constructing achievement tests. Englewood C l i f f s , New Jersey: Prentice-Hall, 1968. Hopkins, K. D. Extrinsic r e l i a b i l i t y : Estimating and attenuating variance from response sets, chance, and other irrelevant sources. Educational and Psychological Measurement, 1964, 2_4, 271-281. Kinney, L. B. and Eurich, A. C. A summary of investigations comparing different types of tests. School and Society, 1932, 36, 540-544. Kirk, R. E. Experimental design: Procedures for the behav- i o r a l sciences. Belmont, C a l i f o r n i a : Wadsworth, 1968. 89. Klassen, W., Dukowski, L., and deGroot, I. Item preparation for the 1981 B.C. Mathematics Assessment. Vector, Winter, 1981, 22 (2), 22-25. Lord, F. M. Formula scoring and v a l i d i t y . Educational and Psychological Measurement, 1963, 23 (4), 663-672. Lord, F. M. Formula scoring and number right scoring. Journal of Educational Measurement, 1975, 12, 7-11. Lord, F. M. and Novick, M. R. S t a t i s t i c a l theories Of mental test scores. Reading, Massachusetts: Addison-Wesley, 1968. Mason, G. P. Test purpose and item type. Canadian Journal of Education, 1979, 4 (4), 8-13. Mussio, J. J. and Greer, R. N. The B r i t i s h Columbia Assessment Program: An overview. Canadian Journal of Education, 5 (4), 1980, 22-40. Nelson, L. R. Guide to LERTAP Use and Interpretation. Dunedin, New Zealand: University of Otago, 1974. Rabinowitz, J. C , Mandler, G., and Patterson, K. Determi-nants of recognition and r e c a l l : A c c e s s i b i l i t y and generation. Experimental Psychology: General, 1977, 106 , 302-329. Ro b i t a i l l e , D. (Ed.). The 1981 B.C. mathematics assessment: General report. V i c t o r i a , B.C.: Ministry of Education, 1981. 90. R o b i t a i l l e , D. and S h e r r i l l , J. B r i t i s h Columbia Mathe-matics Assessment: Summary Report. V i c t o r i a , B.C.: Ministry of Education, 1977. Rowley, G. L. and Traub, R. E. Formula scoring, number-right scoring, and test taking strategy. Journal of Educational Measurement, 1977, _14, 15-22. Scheffe, H. A. The analysis of variance. New York: John Wiley and Sons, 1959. S h e r r i f f s , A. C. and Boomer, D. S. Who is penalized by the penalty for guessing? Journal of Educational Psychology, 1954, _45, 81-90. Stanley, J. C. and Hopkins, K. D. Educational and psycholo- g i c a l measurement and evaluation. Englewood C l i f f s , N.J.: Prentice-Hall, 1972. Swafford, J. 0. Sex differences in f i r s t - y e a r algebra. Journal for Research in Mathematics Education, 1980, 11, 335-346. Thorndike, R. L. The problem of guessing. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, D.C.: American Council on Education, 1971, 59-61. Thorndike, R. L., Hagen E., and Wright, E. N. Canadian Cognitive A b i l i t i e s Test, Form 1, Levels A-F. Toronto: Thomas Nelson and Sons, 1974. Traub, R. E. and Fisher, C. W. On the equivalence of con-structed-response and multiple-choice tests. Applied Psychological Measurement, 1977, 1, 355-369. Tulving, E. and Watkins, M. J. Continuity between r e c a l l and recognition. American Journal of Psychology, 1973, 86 (4), 739-748. Wolleat, P. L., Pedro, J. D., Becker, A. D., and Fennema, E. Sex differences in high school students' causal attributions of performance in mathematics. Journal for Research in Mathematics Education, 1980, 11, 356-366. APPENDIX A. Copies of the Test Instruments and Table of Question Distribution Table A. 1 Distribution of Questions by Domain and D i f f i c u l t y Domain D i f f i c u l t y Question Number Computation High 8 9 14 26 27 33 38 Low 1 2 7 12 15 18 29 Application High 3 6 13 24 25 41 42 Low 4 5 10 17 28 30 36 Algebra High 16 20 31 32 34 37 40 Low 11 19 21 22 23 35 39 NAME TEACHER'S NAME 1. Divide: 45 I 1232 2. Subtract: 51.2 - 4.35 = A) 25 remainder 7 B) 27 remainder 17 C) 29 remainder 27 D) 207 remainder 17 E) I don 11 know A) 46.95 B) 46.85 C) 17.7 D) 7.7 E) I don't know 3. One fourth of a cake i s shared equally among 3 children. What f r a c t i o n of the whole cake did each of the children receive ? A) B) C) 1 7 3 4 _1 12 D, f E) I don't know 4. Seven pies are to be cut into fourths. How many pieces w i l l there be ? A) 14 B) 7 C) 28 D) 36 E) I don't know 5. The chart shows how long i t took Ted to d e l i v e r papers l a s t week. He worked a t o t a l of 320 minutes during the week. How long did i t take him to d e l i v e r papers on Wednesday ? Day Mon Tues Wed Thurs F r i Sat Minutes 50 60 •> 60 55 45 A) 54 B) 50 C) 55 D) 60 E) I don't know A b i c y c l e bought for S80.00 was sold at a loss of 30%. What was the s e l l i n g p r i c e ? A) $ 24 B) $104 C) $ 56 D) $ 30 E) I don't know 95 7. Multiply: 6 x j 8. Divide: i -if 4 8 A) 22 B) 20y C) 4 D) ei E) I don't know A) 32 B) 2j C) 9. 12 i s 15% of what number ? E) I don't know A) 80 B) 180 C) 800 D) 1.8 E) I don't know 10. A map of B.C. i s to be drawn so that 1 millimetre represents 5 kilometres. I f the actual distance between Vernon and Penticton i s 125 kilometres, how many millimetres apart should these two points be on the map ? A) 125 , B) 625 C) 120 D) 25 E) I don't know 11. Solve for n 12. Calculate: 4 J = 96 . A) 32 B) 8 C) 2 D) _1 32 E) I don't know A) 36 B) 64 C) 12 D) 32 E) I don't know 13. P a t t i took 20 pictures with her new camera. Five of the pictures were over-exposed and could not be developed. It cost S4.50 to develop the r o l l . What was the cost of each developed picture ? A) 300 B) 18C C) D) E) 22¥ 25« I don't know 14. Divide: 0.0228 -r 0.003 A) B) C) D) E) 7.6 76.0 13.0 0.13 I don't know 15. Subtract: 12i - 3i ^6 J3 A) B) C) D) E) 14 I don 11 know R i f I = 250, P = 1000 and « h B) 50 C) 1 E) I don't know If 37% of the Canadian population i s under 20 years of age, what percent of the population i s 20 years of age or older ? A) 37% B) 63% C) 67% D) 137% E) I don't know 0.9 5 as a percent i s A) 9.5% B) 0.95 % C) 95% D) 9 j % E) I don't know Solve: 3x - 3 = 12 A) x = 7 B) x = 4 C) x = 3 D) x = 5 E) I don't know Write an equation which represents the sentence: " I f 9 i s added to 4 times a number the r e s u l t i s 29 ". A) 4x = 29 + 9 B) 4(x + 9) = 29 C) 9x + 4 = 29 D) 4x + 9 = 29 E) I don't know In the formula ^ T = 2 then R = ? Write an expression which represents a number increased by 5. A) 5 - x B) x + 5 C) 5 > x D) I x E) I don't know If n = 5 , then 2n + 4 = A) 14 B) 18 C) 20 D) 11 E) I don' t know One number i s 3 times as large as a second number. The sum of the two numbers i s 72. What are the numbers ? A) 2 4 and 8 B) 18 and 6 C) 12 and 36 D) 18 and 54 E) I don't know A t r a f f i c s i g n a l has four equally spaced l i g h t s . How far apart are the centres of l i g h t s 2 and 4 ? © - } A) B) 22.5.can 30 cm 90 cm C) 45 cm D) 60 cm ©-E) I don't A pasture i s 48 m long and 30 m wide. How wide should a scale model of the pasture be i f the length of the model i s 24 cm ? A) 15 cm B) 38.4 cm C) 60 cm D) 12 cm E) I don't know 6 99. 26. Some of the d i g i t s have been covered. What d i g i t was under the c i r c l e ? g | © 2 - 3 4 8 .5 A) 1 B) 3 C) 5 ( § § 6 ® D) 4 E) I don't know 27. Written as a decimal, ^ = A) 0.12 B) 0.8 C) 0.125 D) 0.18 E) I don't know 2 28. I'f a man mowed -jr of his lawn, what part of his lawn does he s t i l l have to mow ? A) B) C) D) 0 E) I don't know 29. Written as a decimal, four and four hundredths i s A) 0;44 B) 44.00 C) 4.4 D) 4.04 E) I don't know 30. B r i t i s h Columbia became a province of Canada i n 1871. Alberta became a province i n 1905. How many years a f t e r B r i t i s h Columbia did Alberta become a province ? A) 24 B) 134 C) 74 D) 34 E) I don't know 7 100. 31. If 12(n + 7) = 108 then the value of n i s A) 9 B) 89 C) 2 D) 1 8 ^ E) I don't know 32. I f n i s an odd number then the next odd number i s : A) n + 1 B) n + 2 C) n + 3 D) 2n - 1 E) I don't know 33. Which of these numbers i s largest ? I 1 i I i I A) 1 3 ' ' 5 ' 4 ' 8 J ' 3 • »' i « ! E) I don't know 34. If m = 2 and n = 3, then what i s the value of 5(3m + 4n) ? A) 35 B) 90 C) 85 D) 17 E) I don't know 35. What i s the solution to 3n = 15 ? A) 45 B) 18 C) 5 D) 12 E) I don't know 8 101. 36. I f one kg of oranges costs $0.85, what w i l l be the cost of 4.2 kg ? A) $4.55 B) $4.85 C) $3.98 D ) $ 3 . 5 7 E) I don't know 37. If 3n = 1, then n 38. Simplify: £ 6 A) 1 B) -2 C' * D) 2 E) I don't know A) 0 B) I n f i n i t y C) 6 D) Cannot be done E) I don't know 39. What i s the solution of 2n + 8 = 20 ? A) 12 B) 14 C) 6 D) 10 E) I don't know 40. What values of n make the sentence (n + 5) - 5 = n TRUE ? A) 0 only B) 0 and 5 only C) a l l values of n D) no values of n E) I don't know 9 102. 41. A salesman sold $2200.00 worth of merchandise i n one month. If he earns 8% commission on sales, what i s his commission for t h i s month ? A) $220.00 B) $176.00 C) $ 22.00 D) $ 17.60 E) I don't know 42. Paul earned $12 272 i n twenty-six weeks. What was his weekly income ?. A) $482 B) $472 C) $468 D) $293 E) I don't know NAME TEACHER'S NAME 1. Divide: 45 I 1232 104. 2. Subtract: 51.2 - 4.35 = 3. One fourth of a cake i s shared equally among 3 children. What f r a c t i o n of the whole cake did each of the chil d r e n receive ? 4. Seven pies are to be cut into fourths. How many pieces w i l l there be ? 5. The chart shows how long i t took Ted to d e l i v e r papers l a s t week. He worked a t o t a l of 32 0 minutes during the week. How long did i t take him to d e l i v e r papers on Wednesday ? Day Mon Tues , Wed Thurs F r i Sat Minutes 50 60 7 60 55 45 105, A b i c y c l e bought for $80.00 was sold at a loss of 30%. What was the s e l l i n g p r i c e ? 7. Mu l t i p l y : 6 x | 8. Divide: ^ -f i f 4 8 9. 12 i s 15% of what number ? 10. A map of B.C. i s to be drawn so that 1 millimetre represents 5 kilometres. I f the actual distance between Vernon and Penticton i s 125 kilometres, how many millimetres apart should these two points be on the map ? 106 11. Solve for n : - = 8 4 12. Calculate: 4 3 = 13. P a t t i took 20 pictures with her new camera. Five of the pictures were over-exposed and could not be developed. It cost $4.50 to develop the r o l l . What was the cost of each developed picture ? 14. Divide: 0.0228 -r 0.003 15. Subtract: 12f- - 3| o 3 4 16. In the formula = R i f I = 250, P = 1000 and T = 2 then.R = ? 107. 17. If 37% of the Canadian population i s under 20 years of age, what percent of the population i s 20 years of age or older ? 18. 0.9 5 as a percent i s 19. Solve: 3x - 3 = 12 20. Write an equation which represents the sentence: If 9 i s added to 4 times a number the r e s u l t i s 29 ". 108. 5 21. Write an expression which represents a number increased by 5. 22. If n = 5 , then 2n + 4 = 23. One number i s 3 times as large as a second number. The sum of the two numbers i s 72. What are the numbers ? 24. A t r a f f i c s i g n a l has four equally spaced l i g h t s . How far apart are the centres of l i g h t s .2 and 4 ? ©" ©-" t © ! ©-25. A pasture i s 48 m long and 30 m wide. How wide should a scale model of the pasture be i f the length of the model i s 24 cm ? 6 109. 26. Some of the d i g i t s have been covered. What d i g i t was under the c i r c l e ? 2 - 3 4 8 5 i § 6 § 27. Written as a decimal, -j- = 28. I f a man mowed j of his lawn, what part of his lawn does he s t i l l have to mow ? 29. Written as a decimal, four and four hundredths i s : 30. B r i t i s h Columbia became a province of Canada i n 1871. Alberta became a province i n 1905. How many years a f t e r B r i t i s h Columbia did Alberta become a province ? 7 31. If 12(n + 7) = 108 then the value of n i s 110. 32. I f n i s an odd number then the rtext odd number i s : 33. Which of these numbers i s largest ? 34. If in = 2 and n = 3, then what i s the value of 5 (3m + 4n) ? 35. What i s the s o l u t i o n to 3n = 15 ? 111. 8 36. I f one kg of oranges costs 5 0 . 8 5 , what w i l l be the cost of 4.2 kg ? 37. If 3n = 1, then n = 38. Simplify: £ = 39. What i s the so l u t i o n of 2n + 8 = 20 ? 40. What values of n make the sentence (n + 5) - 5 = n TRUE ? 112. 9 41. A salesman sold $2200.00 worth of merchandise i n one month. I f he earns 8% commission on sales, what i s his commission for t h i s month ? 42. Paul earned $12 272 i n twenty-six weeks. What was his weekly income ? 113. APPENDIX B. Instructions to Test Administrators ( F i r s t Testing Occasion) F i r s t , l e t me thank you for taking time from your busy schedule to administer these tests to your Grade 7 students. The purpose of the study of which this testing is a part is to determine what e f f e c t , i f any, that item format has on the score obtained by students on Mathematics achievement tests and how those scores are affected by gender, a b i l i t y , item d i f f i c u l t y , and content. You w i l l notice that the two tests are i d e n t i c a l except that one test is in multiple-choice format and the other is in open-ended format. The, test is actually made up of six subtests of seven items each. Each of three content domains, Computation, Application, and Algebra, contain both easy and hard items mixed throughout the length of the test. The study is a counterbalanced repeated measures design. Each student w i l l take the test under both formats, one on A p r i l 29th in Math class, and the other two weeks l a t e r . On each occasion half the students w i l l write one format and the other half w i l l write the other- format. Because of t h i s , the tests must be ide n t i f i e d with the stu-dents and also so that gender and a b i l i t y can be coded along 114. with the test results. Because the tests are given on two di f f e r e n t occasions, i t is important that you not alter your teaching plans based on the test content in the two-week period between the test dates. Please carry on as though the testing had not taken place. In addition, please make sure that you do not inform the students that they w i l l be writing the same test again in two weeks time. With this l e t t e r , you should receive enough copies of the test for your Grade 7 class. You w i l l notice that the test formats are mixed. Please distribute the tests ran-domly to your class. Once the tests have been distributed, please say to the class: "Today you are going to write a test so that you can find out how well you write Math tests. Although this mark may not count as part of your tot a l grade, I expect you to do your best. If you try hard on this test i t is to your advantage. The results of these tests w i l l be used to help teachers design better and f a i r e r tests. "Please write your f u l l name, both your f i r s t name and your l a s t name, on the front page of the test. Put my name (insert teacher's name) on the test as well. "Now turn over the front page of the test. Each of you has a test which has 42 questions. Some of you have multi-ple-choice tests and some of you have tests where you must 115. f i l l in a blank with the right answer. The questions on both tests are the same. "To answer the multiple-choice questions, c i r c l e the l e t t e r of the best answer. If you have a f i l l in the blanks test then write your answer neatly on the line next to the question. You may do your working on the test, just make sure that your answer is neatly written in the proper place. "You have one hour to f i n i s h the test. Do your best. Don't spend too much time on one question. You can always come back and answer i t after you have finished the others. Check your paper before you f i n i s h . i When you f i n i s h please put your paper face down on your desk and s i t quietly and read." After the hour is up, please c o l l e c t the papers and check that each student has put his or her f u l l name on the test. I ' l l c o l l e c t the tests on the test date or the day afte r . Please return a l l the tests used or unused. If any students ask for help while writing the test, please do no more than read the question to them. 116. Instructions to Test Administrators (Second Testing Occasion) Thank you once again for agreeing to administer these tests to your Grade 7 students. Enclosed you w i l l find a test for each student who took part in the f i r s t testing period. In addition, there are some blank tests for those who do not have complete tests or who were absent for the f i r s t testing period. This second set of tests is to be administered on the day two weeks following the f i r s t testing period. On that day, please distribute the tests to the students. Any stu-dents who did not write the f i r s t test should be given one of the blank tests. Once the tests have been distributed, please say to the class: "Today you are going to write another test so that you can find out how well you write Math tests. Although this mark may not count as part of your total grade, I expect you to do your best. If you try hard on this test i t is to your advantage. The results of these tests w i l l be used to help teachers design better and f a i r e r tests. "Please make sure that you have the test with your name on i t . If you received a blank test then write your f u l l 117. name and my name (teacher's name) on the front page of the test. "Now turn over the front page of the test. Each of you has a test which has 42 questions. Some of you have multi-ple-choice tests and some of you have tests where you must f i l l in the blank with the right answer. The questions on both tests are the same. "To answer the multiple-choice questions, c i r c l e the l e t t e r of the best answer. If you have a f i l l in the blanks test then write your answer neatly on the line next to the question. You may do your working on the test, just make sure that your answer is neatly written in the proper place. "You have one hour to f i n i s h the test. Do your best. Don't spend too much time on one question. You can always come back and answer i t after you have finished the others. Check your paper before you f i n i s h . "When you f i n i s h please put your paper face down and s i t quietly and read." If some student's test needs to be replaced because of missing pages etc. please be sure to give the student a replacement of the same form. That i s , i f a student has a multiple-choice test the replace i t with a multiple-choice test and i f a student.has an open-ended test then replace i t with an open-ended test. 118. After the hour is up, please c o l l e c t the papers. I ' l l c o l l e c t the tests on the test date or the day after. Please return a l l tests, used or unused. If any students ask for help while writing the test, please do no more than read the question to him. APPENDIX C. Summary Analysis of Variance Tables for Order of Administration, Class, and Item D i f f i c u l t y Table C.l Summary Analysis of Variance Class, Order, and Item D i f f i c u l t y Multiple-Choice Computation Source of variance Degrees of Mean square F Freedom Mean " 1 8591.477 2614.01* Class (C) 8 12.990 3.95* Order .(0) 1 1.723 0.52 C x 0 8 5.025 1.53 Error 195 3.287 D i f f i c u l t y 1 346.277 268.84* D x C 8 2.919 2.27* D x 0 1 0.398 0.31 D x C x 0 8 1.170 0.91 . Error 195 1.288 *p < .05 120. Table C.2 Summary Analysi Class, Order, and Multiple-Choic s of Variance Item D i f f i c u l t y e Application Source of variance Degrees of Mean square Freedom Mean Class (C) Order (0) C x 0 Error 1 8 1 8 195 9434.786 3. 546 0.087 2.838 3.797 2484.68* 0.93 0.02 0.75 D i f f i c u l t y D x C D x 0 D x C x 0 Error 1 8 1 8 195 317.591 1.009 2.646 0. 656 1.379 230.38* 0. 73 1. 92 0.48 *p < .05 Table C.3 Summary Analysis of Variance Class, Order, and Item D i f f i c u l t y Multiple-Choice Algebra Source of variance Degrees of Freedom Mean square Mean Class (C) Order (0) C x 0 Error 1 8 1 8 195 5843.306 13.437 2. 264 6. 777 5.150 1134.73* 2. 61* 0. 44 1. 32 D i f f i c u l t y D x C D x 0 D x C x 0 Error 1 8 1 8 195 516.146 1. 382 0. 991 1. 061 1. 443 357.67* 0.96 0.69 0. 74 *p < .05 Table C.4 Summary Analysis of Variance Class, Order, and Item D i f f i c u l t y Construeted-Response Computation Source of variance Degrees of Mean square F Freedom Mean 1 6620.177 2194.50* Class (C) 8 19.306 6.40* Order (0) 1 6.754 2.24 , C x 0 8 4. 670 1. 55 Error 195 3.017 D i f f i c u l t y 1 299.478 221.52* D x C 8 2.779 2.06* D x 0 1 0.870 0.64 D x C x 0 8 2.702 2.00* Error 195 1.352 *p < .05 Table C.5 Summary Analysis of Variance Class, Order, and Item D i f f i c u l t y Constructed-Response Application Source of variance Degrees of Mean square Freedom Mean Class (C) Order (0) C x 0 Error 1 8 1 8 195 6487.121 3.143 40.168 3. 992 4.538 1429.61* 0.69 8.85* 0. 88 D i f f i c u l t y D x C D x 0 D x C x 0 Error 1 8 1 8 195 499.965 2. 602 1.347 0. 739 1.111 449.90* 2. 34* 1. 21 0. 67 *p < .05 / 124. Table C.6 Summary Analysis of Variance Class, Order, and Item D i f f i c u l t y Constructed-Response Algebra Source of variance Degrees of Mean square F Freedom Mean 1 2727.003 567.16* Class (C) 8 9.468 1.97 Order (0) 1 72.963 15.17* C x 0 8 9.042 1.88 Error 195 4.808 D i f f i c u l t y 1 382.270 359.67* D x C 8 1. 926 1.81 D x 0 1 2.295 2. 16 D x C x 0 8 0. 613 0. 58 Error 195 1.063 *p < .05 APPENDIX D. Table D.l CELL MEANS AND STANDARD DEVIATIONS COMPUTATION ABILITY FORMAT DIFFICULTY LOW AVERAGE HIGH MALES n = 31 n = 24 n = 33 M-C HIGH -1.39 -0.31 0.15 (1.03) (1.34) (1.06) LOW 0.24 1. 32 1.47 (1.35) (0.74) (0.76) C-R HIGH -1.90 -1.17 -0.36 (1.03) (1.19) (1.12) LOW -0.21 0. 82 1.02 (1.24) (1.02) (0.83) FEMALES n = 31 n = 40 n = 32 M-C HIGH -1.54 -0.59 0.50 (1.07) (1.40) (1.05) LOW 0.95 0.98 1.77 (0.91) (1.23) (0.79) C-R HIGH -1.76 -0.80 - -0. 20 (1.11) (1.21) (1.31) LOW 0.16 0.51 1.26 (1.23) (1.05) (0.79) Table D.2 CELL MEANS AND STANDARD DEVIATIONS APPLICATION ABILITY FORMAT DIFFICULTY LOW AVERAGE HIGH MALES n = 31 n = 24 n = 33 M-C HIGH -1.05 -0.47 0.33 (0.97) (1.41) (1.09) LOW 0.61 1. 20 1. 80 (1.12) (0.93) (0.53) C-R HIGH -1.77 -0.91 -0.51 (1.03) (1.33) (1.24) LOW 0. 20 1. 21 1.23 (1.07) (0.71) (0.74) FEMALES n = 31 n = 40 n = 32 M-C HIGH -1.06 -0.45 0.47 (1.13) (1.23) (1.04) LOW 0.03 1.11 1.57 (1.35) (0.94) (0.59)' C-R HIGH -2.17 -1.38 -0.39 (1.05) (1.26) (1.36) LOW -0.84 0. 82 1.17 (1.28 ) (0.85) (0.92) 127. Table D.3 CELL MEANS AND STANDARD DEVIATIONS ALGEBRA FORMAT DIFFICULTY LOW ABILITY AVERAGE HIGH MALES M-C C-R HIGH LOW HIGH LOW n = 31 -1.13 (0.82) 0.66 (1.20) -1.70 (0.44) -0.67 (1.02) n = 24 -0.42 (1.24) 1.30 (1.01) -1.05 (1.16) 0.66 (0.94) n = 33 0.52 (1.07) 1.59 (0.84) -0.41 (1.12) 1.06 (0.96 ) FEMALES M-C C-R HIGH LOW HIGH LOW n = 31 -1.20 (0.75) 0. 60 (1.17) -1.59 (0.67) -0.57 (1.16) n = 40 -0.26 (1.19) 1.41 (1.13) -1. 21 (0.94) 0. 58 (1.14) n = 32 0. 58 (1.22) 1. 90 (0.69) -0.34 (0.99) 1.0.8 (0.92) 128. Table D.4 CELL MEANS AND STANDARD DEVIATIONS COMPUTATION SCORES CORRECTED FOR GUESSING ABILITY FORMAT DIFFICULTY LOW AVERAGE HIGH MALES n = 31 n = 24 n = 33 M-C HIGH -1.76 -0.59 -0.12 (1.09) (1.52) (1.19) LOW 0.05 1.17 1.40 (1.44) (0.89) (0.83) C-R HIGH -1. 53 -0. 88 -0. 13 (0.95) (1.09) (1.02) LOW 0.01 0. 95 1. 14 (1.14) (0.93) (0.77) FEMALES n = 31 n = 40 n = 32 M-C HIGH -1. 98 -0.92 0.31 (1.21) (1.52) (1.16) LOW 0.84 0.83 1.73 (0.99 ) (1.35) (0.85) C-R HIGH -1.42 -0. 54 0.01 (1.01) (1.11) (1.20) LOW 0. 34 0. 65 1. 35 (1.13) (0.96) (0.73) Table D.5 CELL MEANS AND STANDARD DEVIATIONS APPLICATION SCORES CORRECTED FOR GUESSING ABILITY FORMAT DIFFICULTY LOW AVERAGE HIGH MALES n = 31 n = 24 n = 33 M-C HIGH -1.35 -0.79 0.12 (1.05) (1.62) (1.23) LOW 0.48 1. 22 1.7 8 (1.20) (1.04) (0.56) C-R HIGH -1. 48 -0.71 -0. 31 (0.95) (1.23) (1.13) LOW 0. 33 1. 26 1.30 (0.99) (0.66) (0.69) FEMALES n = 31 n = 40 n = 32 M-C HIGH -1. 43 -0. 68 0.31 (1.28) (1.29) (1.17) LOW -0.17 1.02 1.55 (1.47) (1.03) (0.64) C-R HIGH -1. 86 -1. 12 -0. 21 (0.97) (1.16) (1.25) LOW -0. 63 0. 90 1. 23 ' (1.18) (0.79) (0.86) 130. Table D.6 CELL MEANS AND STANDARD ALGEBRA DEVIATIONS SCORES CORRECTED FOR GUESSING ABILITY FORMAT DIFFICULTY LOW AVERAGE HIGH MALES n = 31 n = 24 n = 33 M-C HIGH -1.55 -0.72 0.31 (0.87) (1.37) (1.18) LOW 0.50 1.17 1.50 (1.25) (1.06) (0.92) C-R HIGH -1. 36 -0.78 -0 .21 (0.41) (1.07) (1.03) LOW • -0. 40 0.80 1.15 (0.94) (0.88) (0.90) FEMALES n = 31 n = 40 n = 32 M-C HIGH -1.61 -0.55 0.38 (0.85) (1.31) (1.36) LOW 0.35 1.28 1.80 (1.28) (1.22) (0.77 ) C-R HIGH -1. 28 -0.91 -0.13 (0.62) (0.86) (0.90) LOW -0. 34 0. 73 1.18 (1.08) (1.06) (0.86)
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- The effect of item format on mathematics achievement...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
The effect of item format on mathematics achievement set scores Dukowski, Leslie Hubert 1982
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | The effect of item format on mathematics achievement set scores |
Creator |
Dukowski, Leslie Hubert |
Publisher | University of British Columbia |
Date Issued | 1982 |
Description | The purpose of the study was to determine whether item format significantly affected scores on a mathematics achievement test. A forty-two item test was constructed and cast in both multiple-choice and constructed-response formats. The items were chosen in such a way that in each of three content domains, Computation, Application, and Algebra, there were seven items at each of two difficulty levels. The two tests were then administered on separate occasions to a sample of 213 Grade 7 students from a suburban/ rural community in British Columbia, Canada. The data gathered was analysed according to a repeated measures analysis of variance procedure using item format and item difficulty as trial factors and using student ability and gender as grouping factors. Item format did have a significant (p < 0.05) effect on test score. In all domains multiple-choice scores were higher than constructed-response scores. The multiple-choice scores were also transformed using the traditional correction for guessing procedure and analysed. Multiple-choice scores were still significantly higher in two of the three domains, Application and Algebra. There were significant omnibus F-statistics obtained for a number of interactions for both corrected and uncorrected data but there were significant Tetrad differences (p < 0.10) only for interactions involving format and difficulty. The results indicate that students score higher on a multiple-choice form of a mathematics achievement test than on a constructed-response form, and therefore the two scores cannot be considered equal or interchangeable. However, because of the lack of interactions involving format, the two scores may be considered equivalent in the sense that they rank students in the same manner and that the intervals between scores may be interpretable in the same manner under both formats. Therefore, although the traditional correction for chance formula is not sufficient to remove differences between multiple-choice and constructed-response scores, it may be possible to derive an empirical scoring formula which would equate the two types of scores on a particular test. |
Subject |
Mathematics -- Study and teaching (Elementary) Educational tests and measurements |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-04-01 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0055084 |
URI | http://hdl.handle.net/2429/23287 |
Degree |
Master of Arts - MA |
Program |
Education |
Affiliation |
Education, Faculty of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1982_A8 D94.pdf [ 4.5MB ]
- Metadata
- JSON: 831-1.0055084.json
- JSON-LD: 831-1.0055084-ld.json
- RDF/XML (Pretty): 831-1.0055084-rdf.xml
- RDF/JSON: 831-1.0055084-rdf.json
- Turtle: 831-1.0055084-turtle.txt
- N-Triples: 831-1.0055084-rdf-ntriples.txt
- Original Record: 831-1.0055084-source.json
- Full Text
- 831-1.0055084-fulltext.txt
- Citation
- 831-1.0055084.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0055084/manifest