@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Education, Faculty of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "McKie, Thomas Douglas Muir"@en ; dcterms:issued "2011-11-15T20:54:42Z"@en, "1962"@en ; vivo:relatedDegree "Master of Arts - MA"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """It was hypothesized that on an achievement test, items measuring complex cognitive objectives would exhibit a higher mean discrimination index — based on the whole test as criterion — than would an equal number of items measuring less complex cognitive objectives; and that the mean discrimination index of these items would in turn be higher than that of the same number of still less complex items. The proviso was made that the difficulty indices of the items be similarly distributed within the several categories of items, hereafter called "relevance-categories," since discriminating power is related to difficulty. The categories selected were, from simplest to most complex, the Knowledge, Comprehension, and Application categories of Bloom's "Taxonomy of Educational Objectives." An achievement test was constructed, consisting of items in all three categories, and covering the content of two units of the British Columbia university-programme grade nine science course. A try-out of this test, on 200 students in two schools, permitted negatively discriminating items to be rejected and, in addition, provided difficulty indices for the remaining items. It was possible to match forty Knowledge items and forty Comprehension items very closely for difficulty; however, the mean difficulty of the Application items was so high that they could not be used in a test of the hypothesis without reducing numbers too drastically in all categories. Two "equivalent forms," matched for content, relevance-category, and difficulty were constructed from these eighty items and administered to 530 students in three schools. The reliability coefficient of the total test, estimated by correlating the sub-test scores and applying the Spearman-Brown formula, was .84; those of the Knowledge and Comprehension categories were similarly found to be .69 and .77, respectively. Revised difficulty indices, based on the new and larger sample, were calculated. Their distribution within the two relevance categories were found to be very similar, though not as closely matched as on the basis of the try-out test. For each item, the point-biserial coefficient of correlation between item and total score was computed — this being the selected index of discrimination — and Fisher's z-transformation was applied to produce measures with more nearly an equal-unit scale, in the hope that the parametric t-test could be used. However, the shapes of the resulting distributions were such that they could not be claimed to be samples from a normal population or populations. Accordingly, the t-test was rejected in favour of the non-parametric Mann-Whitney test of "no difference in median discrimination indices." The respective medians were .27 and .30, in terms of Fisher's z-values, but the difference proved to be non-significant at the pre-selected l%-level of significance. It was concluded that this experiment provided no grounds for accepting the hypothesis of the study. However, the actual probability of obtaining, in random sampling from a single population, a difference as large as that observed was only about .10; in addition, the results consistently favoured the Comprehension items, whose discrimination indices exceeded those of the Knowledge items at the extremes as well as at the mean. It was therefore suggested that if adequate testing time could be obtained, the use of larger numbers of items in all categories might increase test-reliability and possibly produce a significant result. Suggestions were advanced, based upon observations from the data, for refining the experiment and for further research."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/39012?expand=metadata"@en ; skos:note "AN INVESTIGATION OF THE RELATIONSHIP BETWEEN THE RELEVANCE CATEGORY OF ACHIEVEMENT TEST ITEMS AND THEIR INDICES OF DISCRIMINATION by THOMAS DOUGLAS MOTR Mc KIE A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS i n the COLLEGE OF EDUCATION We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA September, 1962 In presenting t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the requirements f o r an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree t h a t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r reference and study. I f u r t h e r agree that permission f o r extensive copying of t h i s t h e s i s f o r s c h o l a r l y purposes may be granted by the Head of my Department or by h i s r e p r e s e n t a t i v e s . I t i s understood tha t copying or p u b l i c a t i o n of t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be allowed without my w r i t t e n permission. The U n i v e r s i t y of B r i t i s h Columbia, Vancouver Canada. Date >^ / ? & v ^ i i ABSTRACT It was hypothesized that on an achievement test, items measuring complex cognitive objectives would exhibit a higher mean discrimination index — based on the whole test as criterion — than would an equal number of items measuring less complex cognitive objectives: and that the mean discrimination index of these items would i n turn be higher than that of the same number of s t i l l less complex items. The proviso was made that the d i f f i c u l t y indices of the items be similarly distributed within the several categories of items, hereafter called \"relevance-categories , \" since discriminating power is related to d i f f i c u l t y . The categories selected were, from simplest to most complex, the Knowledge, Comprehension, and Application categories of Bloom's \"Taxonomy of Educ ational Obj ectives.\" An achievement test was constructed, consisting of items i n a l l three categories, and covering the content of two units of the British Columbia university-programme grade nine science course. A try-out of this test, on 200 students i n two schools, permitted negatively discriminating items to be rejected and, i n addition, provided d i f f i c u l t y indices for the remaining items. I t was possible to match forty Knowledge items and forty Comprehension items very closely for d i f f i c u l t y ; however, the mean di f f i c u l t y of the Application items was so high that they could not be used i n a test of the hypothesis without reducing numbers too drastically i n a l l categories. Two \"equivalent forms,\" matched for content, relevance-category, and d i f f i c u l t y were constructed from these eighty items and administered to 530 students in three schools. The r e l i a b i l i t y coefficient of the total test, estimated by correlating the sub-test scores and applying the Spearman-Brown formula, was ,8Uj those of the Knowledge and Comprehension categories were similarly found to be .69 and .77, respectively. Revised d i f f i c u l t y indices, based on the new and larger sample, were calculated. Their distribution within the two relevance categories were found to be very similar, though not as closely matched as on the basis of the try-out test. For each item, the point-biserial coefficient of correlation between item and t o t a l score was computed — this being the selected index of discrimination — and Fisher's z-transformation was applied to produce measures with more nearly an equal-unit scale, i n the hope that the parametric t-test could be used. However, the shapes of the resulting distributions were such that they could not be claimed to be samples from a normal population or populations. Accordingly, the t-test was rejected i n favour of the non-parametric Mann-Whitney test of \"no difference i n median discrimination indices.\" The respective medians were .27 and .30, i n terms of Fisher's z-values, but the difference proved to be non-significant at the pre-selected l#-level of significance It was concluded that this experiment provided no grounds for accept ing the hypothesis of the study. However, the actual probability of obtaining, i n random sampling from a single population, a difference as large as that observed was only about .10; i n addition, the results consistently favoured the Comprehension items, whose discrimination indices exceeded those of the Knowledge items at the extremes as well as at the mean. It was therefore suggested that i f adequate testing time could be obtained, the use of larger numbers of items i n a l l categories might increase t e s t - r e l i a b i l i t y and possibly produce a significant result. Suggestions were advanced, based upon observations from the data, for refining the experiment and for further research. TABLE OF CONTENTS CHAPTER PAGE I. THE PROBLEM AND HYPOTHESIS 1 The Problem 2 Statement of the problem 2 Statement of the hypothesis 3 Definition of Terms • 3 Discussion of the Hypothesis J> Importance of the Study 9 Organization of Remainder of the Thesis 11 H. REVIEW OF THE LITERATURE 12 Literature on Categorizing Educational Objectives 12 Literature Specifically Related to the Present Hypothesis . 1$ I H . DELIMITATION OF THE PROBLEM 19 Grade-Level and Content of Test 19 Degree of Speededness of the Test 21 Type and Number of Items 22 Nature and Size of Sample of Students 2k The Tryout Problem . 2 6 IV. METHODOLOGY 30 Preparation of the Tryout Tests 30 Writing and classifying the items 30 Assembly of tryout tests 32 Administration of the Tryout Tests 33 The problem of guessing 33 The test-directions 3^. v i CHAPTER PAGE Administration and results of the tryout tests 35 The Preparation and Administration of the Final Test . . . . 38 Selection of items for the f i n a l test 38 Equivalence of relevance categories i n respect of d i f f i c u l t y Ill The f i n a l test ill* Administration of the f i n a l test 1*7 The test results 1*8 V. THE TEST RESULTS AND THEIR STATISTICAL ANALYSIS 50 The Test Scores and R e l i a b i l i t i e s 50 Knowledge, comprehension, and total scores 50 R e l i a b i l i t y of the test and relevance categories 51* The relationship between the knowledge and conprehension scores 56 Distributions of D i f f i c u l t y Indices 58 Distributions of Discrimination Indices 58 The S t a t i s t i c a l Test of the Hypothesis 61* VI. CONCLUSIONS AND SUMMARY 67 Weaknesses 67 Conclusions 68 Possibilities of Further Research . . . . . . . . . . . . . . 69 Summary of the Investigation 71 BIBLIOGRAPHY 7l* APPENDIX A. Content-Specifications for Test (Condensed) . . . 77 APPENDIX B. Test-Directions (Excerpts) 80 v i i CHAPTER PAGE APPENDIX C. Reproduction of the Final Test 81+ APPENDIX D. Table XI: Taxonomy Category, D i f f i c u l t y Index, and Discrimination Indices (Point-Bi s e r i a l and Biserial) f o r a l l Items . . . 107 v i i i LIST OF TABLES TABLE PAGE I. Distribution of Scholastic Aptitude Scores i n Two Groups of Students Used for Tryout Tests 29 II. Distribution of Difficulty- Indices Computed from Tryout Tests, Grouped by Relevance Category 37 III. Distribution of Difficulty Indices in the Knowledge and Comprehension Categories of the Final Test (Frequency and Cumulative Frequency) 1*0 17. Values and Ranks of the Difficulty Indices in the Knowledge and Comprehension Categories 1*5 V. (a) Distribution of Unit I - Items by Content Category . . in the Two Final Subtests. U6 (b) Distribution of Unit II - Items by Content Category in the Two Final Subtests -1*6 VI. Distribution of Knowledge, Comprehension, and Total Scores in Final Tests 51 VII. Distribution of Transformed Difficulty Indices in the Knowledge and Comprehension Categories 59 VIII. Distribution of Discrimination Indices of Knowledge and Comprehension Items 62 IX. Value and Ranks of Discrimination Indices of Items in Knowledge and Comprehension Categories 65 X. Content-Specifications for Test (Condensed) 78 XI. Taxonomy Category, Difficulty Index, and Discrimination Indices (Point-Biserial and Biserial) for a l l Items . . . . 108 • ix . LIST OF FIGURES FIGURE PAGE 1 Cumulative Distribution of Transformed D i f f i c u l t y Indices of Items Selected for the Final Knowledge and Comprehension Categories I4.2 2 Distribution of Final Scores of 530 Students on the Final Test 52 3(a). Distribution of Knowledge-Scores of 530 Students on the Final Test £3 3(b). Distribution of Comprehension-Scores of 530 Students on the Final Test 53 k Cumulative Distributions of Transformed D i f f i c u l t y Indices of Items i n the Knowledge and Comprehension Categories Obtained from the Administration of the Final Test 60 5(a). Distribution of Discrimination Indices for Knowledge Items 63 5(b). Distribution of Discrimination Indices for Comprehension Items ' 63 CHAPTER I THE PROBLEM AND HYPOTHESIS In recent years, there have been several attempts, for example, those 1 2 3 of Bloom and his associates, Ebel, and Gerberich, to classify the object-ives of cognitive learning i n what have been called \"relevance categories.\" These categories are arranged i n a kind of hierarchy intended to reflect increasing complexity of behaviours. To i l l u s t r a t e , the simplest of these categories usually deals with simple r e c a l l of information, while the more complex ones involve behaviours such as the a b i l i t y to apply principles i n unfamiliar situations. Of these systems of classification, that proposed by Bloom and his co-workers, i n the Taxonomy of Educational Objectives,^ i s the most detailed; i t i s the one i n terms of which the problem of this study i s stated, and which was used throughout. For simplicity of expression, categories involving simpler behaviours w i l l be referred to as \"lower categories\" and those involving more complex behaviours as \"higher categories. A major purpose of such classifications and of the specifications which accompany them i s to aid the test constructor i n the preparation of items which w i l l serve to sample behaviours of the degrees of complexity •'•Benjamin S. Bloom, et a l . , Taxonomy of Educational Objectives, Hand-book I: Cognitive Domain (New York: Longmans, Green & Co., 1956). 2 Robert L. Ebel, \"How an Examination Service Helps College Teachers to Give Better Tests,\" Proceedings of the Invitational Conference on Testing Problems (Princeton, N.J.: Educational Testing Service, 1953). 3 J. Raymond Gerberich, Specimen Objective Test Items : A Guide to Achievement Test Construction (New York: Longmans, Green & Co., 1956). 2 represented by the various categories. I. THE PROBLEM Statement of the problem. I t was assumed that the categories of the Taxonomy do represent the different degrees of complexity of behaviour which they purport to represent: and, i n general terms, i t was the purpose of this investigation to discover whether, under restrictions to be specified, the magnitudes of the discrimination indices of test items sampling behaviours i n different relevance categories reflect differences i n complexity assumed to exist among the categories. Reasons for supposing that such differences i n discrimination indices might be found lay i n the following considerations, which formed the basis of the hypothesis presented below. On a course whose objectives i n the cognitive f i e l d may be cl a s s i f i e d i n terms of relevance categories of increas-ing complexity, i t i s generally assumed that the best student i s the one who obtains the highest score on a test. The items of the test are constructed to probe for a l l of the stated objectives, i n so far as these are measurable. On such a test, i t would appear that while the top students are l i k e l y to obtain comparatively high scores within a l l of the relevance categories involved, the performance of the poorer students may be expected to be less uniform. That i s to say, they are l i k e l y to perform even more poorly on items sampling the less complex ones. It would then seem to follow that the characteristic;which distinguishes the poor student from the good student should be not only his general i n a b i l i t y to perform well i n any of the categories, but also the fact that he w i l l be l i k e l y to do even more poorly on higher-category items than on lower-category ones. Thus, i t was anticipated that items i n the more complex relevance categories would exhibit 3 higher indices of discrimination than would those i n the less complex categories, under restrictions specified below. Statement of the hypothesis. Given a self-defining achievement test with equal numbers of items sampling behaviours i n the \"Knowledge,\" \"Comprehension,\" and \"Application\" categories of the Taxonomy, i t was hypothesized that the magnitudes of the discrimination indices of the items, based on the whole test as criterion, would tend to increase from the f i r s t - t o the last-named category, provided that mean item-difficulty were kept constant from category to category, and the d i f f i c u l t y indices were similarly distributed within each category. II. DEFINITION OF TERMS Self-defining achievement test. This term was interpreted as meaning a test of achievement i n a given subject f i e l d , whose claim to val i d i t y rests upon the care with which i t was constructed to reflect the judgment of authorities regarding the content and measurable objectives of the course, and which thus constitutes i t s own criterion-measure.^ Relevance. The definition of this term contained i n the Dictionary of Education was adopted, namely, \" . . . the quality on the part of a test of measuring a function that i t i s used to measure; the more nearly the behaviour e l i c i t e d by the test approximates the criteri o n behaviour, the more relevant the test.\"^ Adapted from Frederick B. Davis, \"Item Selection Techniques,\" Educational Measurement, E.F.Lindquist, editor (Washington: Amer.Coun.on Educ, 1951) . 6 Carter V. Good ( ed.), Dictionary of Education (second edition; New York: McGraw-Hill Book Co., 1959), lr5BT u Relevance category. In extension of the foregoing definition of \"relevance,\" any system of categorizing by degree of complexity the intended behaviours constituting the measurable objectives of a course was considered to result i n a set of \"relevance categories.\" In particular, the categories employed i n this study were the \"Knowledge,\" \"Comprehension,\" and \"Application\" categories of the Taxonomy. Fuller discussion i s provided i n Chapter U , where the relevant literature i s surveyed. For the present, i t w i l l suffice to note that the authors of the Taxonomy consider these categories to constitute a hierarchy of learning-outcomes, arranged from simplest to 7 most complex i n terms of the behaviours involved. Discrimination index. Following Davis, this term was taken to mean a measure of \". . . the value of each test item for making the test i n which i t i s included rank examinees accurately with respect to a defined criterion variable.\" In this study, the interest was i n a measure of \"internal consistency discrimination,\" that i s , one i n which the total test scores are used as the criterion. Of the many indices i n common use for this purpose, the point-biserial correlation coefficient, r ^, was selected. Further discussion of i t s choice for use i n this investigation i s deferred u n t i l the hypothesis i s discussed i n Section III of this chapter. Item d i f f i c u l t y . The definition adopted was that of Davis, namely, 9 \". . . the per cent of the tryout group that marks i t correctly.\" However, as w i l l be seen i n Chapter IV, a transformation, suggested by Davis, was 7 Bloom, et a l . , op_. c i t . , p. 18. 8 Frederick B. Davis, \"Item Selection Techniques,\" Educational Measure-ment, E.F. Lindquist, editor (Washington: Amer.Coun. on Educ, 1951), 285. 9 Ibid., p.267. applied to the d i f f i c u l t y indices to try to produce a scale approximating an interval scale. III. DISCUSSION OF THE HYPOTHESIS It w i l l have been remarked that the statement of the hypothesis required that mean d i f f i c u l t y be kept constant from category to category, and that item d i f f i c u l t y be similarly distributed within each category. This means that within a l l categories, the distributions of d i f f i c u l t y indices must have equal means and variances, and also have the same form. The reason f o r including these restrictions was the familiar one of attempt-ing to control an irrelevant variable which would otherwise exert i t s own influence on the dependent variable and thus lead to a confounding of influences. The way i n vhich differences i n the mean level of d i f f i c u l t y from category to category could affect the size of the discrimination indices of the items i n those categories may be explained i n the following way. I f i t be assumed, for i l l u s t r a t i v e purposes, that of a l l the items i n a test the application-items were much more d i f f i c u l t than the others, then the scores obtained on those items would be lower than those obtained on the lower-category items. They would thus constitute a smaller fraction of the total or criterion score. The items most nearly measuring what the test as a whole was measuring would be those most heavily represented i n the criterion scores. Thus, the lower-category items and not the application-items would account for most of the variance of the total scores. In other words, as a result of their greater d i f f i c u l t y , the application-items might be expected to exhibit smaller discrimination indices. From this argument i t follows that i f item d i f f i c u l t y were effectively to be 6 prevented from introducing an extraneous influence on the mean discrimination indices of the three relevance categories, the items i n a l l categories must have about the same distribution of d i f f i c u l t y indices. In the early stages of the planning of this study, before the point made i n the foregoing paragraph became evident, a different manifestation of the effect of d i f f i c u l t y on discrimination indices was noted. This concerned the fact, illustrated graphically by Thorndike,\"1\"^ that the maximum size of most kinds of discrimination indices i n common use rises from zero to unity and then decreases again to zero as item d i f f i c u l t y changes from zero through f i f t y to one hundred per cent. To minimize this effect, i t had been decided to use bi s e r i a l - r as a measure of discriminative power. This was because, as noted by Adams, bi s e r i a l - r is theoretically independent of d i f f i c u l t y level when the criterion variable i s normally distributed, and 1 1 nearly so even when moderate degrees of non-normality are present. However, once the principle had been adopted of controlling d i f f i c u l t y by ensuring that the three categories of items had about the same distribution .of d i f f i c u l t y , the need for using a measure of discrimination which i s independent of d i f f i c u l t y vanished. The question of which index of discrimination to use was therefore reconsidered. What was required was an index which would use a l l of the informa-tion provided by the test scores, rather than only that given by an upper and a lower group. The two correlation coefficients, the b i s e r i a l and the point-biserial, both share this property. It was also considered desirable \"*\"°Robert L. Thorndike, Personnel Selection (New York: John Wiley and Sons, Inc., 19k9, 2 2 9 . 1 1 James E. Adams, \"The Effect of Non-Normally Distributed Criterion Scores on Item Analysis Techniques,\" Educational and Psychological Measurement, XX (No.2, Summer, I 9 6 0 ) , 3 1 9 . 7 to select an index to which parametric s t a t i s t i c a l tests of significance could be applied, since non-parametric tests lack power; but even the basic operation of averaging cannot be applied directly to correlation coefficients. Davis notes: It i s well known that i t becomes more and more d i f f i c u l t to raise a correlation coefficient by a certain number of hundredths as perfect correlation is approached. A difference of .05 between correlation coefficients of .90 and .95 represents a far greater difference i n relationship than a difference of .Of? between coefficients of .05 and .10. This i s not a serious limitation for item analysis purposes, but i n the case of product-moment coefficients i t i s so easily removed that there i s no reason to tolerate i t i f i t works any inconvenience. The usual procedure, for product-moment correlation coefficients, i s to apply to them Fisher's z-transformation, or a linear function thereof, and to proceed thereafter on the assumption -that something approaching an interval scale has been attained. Peters and Van Voorhis state that unit increments of Fisher's-z have nearly the same meaning, a l l along their range, i n terms of d i f f i c u l t y of attaining them. This fact makes adding, subtracting, or averaging Fisher's-z'a more legitimate 13 processes than are like processes with correlation coefficients. However, bi s e r i a l - r i s not a product-moment coefficient. The only trans-formation which has been developed for use with i t is that of Tate,, which i s applicable only when the proportion of students responding correctly to the item i s near .50, and when the population-value, p , i s not near plus 111 ~~ or mxnus one. -^Davis, op_. c i t . , p.299-13 \"^Charles C. Peters and W.R. Van Voorhis, St a t i s t i c a l Procedures and their Mathematical Bases (New York: McGraw-Hill.Book Co., 19U0), 156-57. 1U Helen M. Walker and Joseph Lev, S t a t i s t i c a l Inference (New York: Henry Holt and Co., Inc., 1953), 271. 8 The point-biserial coefficient, on the other hand, \" i s a product-moment coefficient of correlation and can be used with, and under the same 15 conditions as, the usual product-moment coefficient.\" This coefficient i s a legitimate and frequently used index of item-discrimination, with which the use of the Fisher transformation raould seem jus t i f i e d . Unfortunately, i t s purity as an index of discriminating power is impaired by the fact that i t s maximum size i s restricted for very easy and very hard items. As noted earlier, given a similar distribution of item-difficulty within each relevance category, this restrictive effect should apply equally to a l l categories. However, the possibility had to be contemplated that the f i n a l difficulty-indices might d i f f e r somewhat from those obtained on the tryout group. Under such a possibility, the \"purer\" biserial-r would be the preferred index, despite the fact that non-parametric methods would have to be used for the test of the hypothesis. The choice of which index to use thus presented a problem, complicated by the thought that the f i n a l distribu-tions of discrimination indices might so grossly violate the assumptions of normality and homogeneity of variance as to forbid, i n any event, the use of parametric methods! The decision f i n a l l y taken was to employ the point-biserial coefficient and z-transformation, i n the hope that control of d i f f i c u l t y would be adequate and that the greater efficiency of the parametric tests could be employed. Results using b i s e r i a l - r would be reported for comparison. Finally, i n this discussion of the hypothesis, i t i s to be noted that i f the number of items were permitted to vary from category to category, this inequality of numbers might be expected to have a direct effect upon Palmer 0. Johnson and Robert W.B. Jackson, Modern Sta t i s t i c a l Methods; Descriptive and Inductive (Chicago: Rand McNally and Co., 1959), 366. 9 the magnitudes of the discrimination indices. Items belonging to the relevance category with the greatest number of items i n the test would be l i k e l y to have higher discrimination indices than i f a l l categories had been equally represented. I t was decided, therefore, to make the number of items i n each category the same. I?. IMPORTANCE OF THE STUDY This study i s essentially an attempt to add to knowledge within a restricted area of test-theory,namely, the behaviour of item-discrimination indices under specified conditions. Much is already known, for example, regarding the effect upon indices of item-discrimination of such variables as the d i f f i c u l t y level of the item, the heterogeneity of content of the test^ and the distribution of-the criterion variable — that i s , whether i t i s normally distributed, leptokurtic, platykurtic or skewed. Such knowledge is of value to those who are concerned with the construction of tests, and to the professional test-builder i n particular. The present study i s an attempt to extend what i s known i n this area by clarifying the relationship which may exist between relevance category and indices of item-discrimination. A l l of the practical values attached to a discovery i n the area of theory do not generally become immediately apparent; but i f the present hypothesis is confirmed, at least one poss i b i l i t y of application becomes evident. It is frequently desired to select from a group of examinees the outstanding candidates. This may be because one wishes to determine scholarship winners or members of some such special group. A test used for such a purpose must discriminate with maximum accuracy at such a level of d i f f i c u l t y that the desired percentage of candidates can be selected. For a test to possess such discrimination, i t s items should a l l be at the level 10 of d i f f i c u l t y corresponding to the cut-off point, as has been shown by Richardson;^ i n addition, of course, of the available items at that l e v e l of d i f f i c u l t y , those with the highest discrimination indices -Mould be chosen. If i t were known i n advance that higher-category items tend to have higher discrimination indices than do lower-category items, the test constructor could concentrate on the production of these more complex items, with the knowledge that they are the ones most lik e l y to be effective for the purpose stated. He should, however, avoid overloading the test with such items, since i f he did this his test would lose validity through i t s f a i l u r e to sample adequately a l l of the objectives of the course. It would seem preferable to administer a battery of tests, each designed to measure achievement at a different l e v e l of complexity. If, however, administrative time is limited — and i t usually i s — so that several sub-tests, each long enough to possess adequate r e l i a b i l i t y , cannot be given, then i t is well to know how best to build a single composite test. For such a purpose i t i s useful to know what kind of items are l i k e l y to be most effective. Further ju s t i f i c a t i o n for this study arises from the following consideration! i f items sampling more complex behaviours do discriminate better than those sampling less complex behaviours, they w i l l increasingly be used by test constructors. The cause of education would benefit from such a trend, since teachers often teach only toward objectives which they know w i l l be tested. 16 . Davis, op_. ci t . , p.309' 11 V. ORGANIZATION OF REMAINDER OF THE THESIS Chapter II surveys the literature pertaining to the categorization of educational objectives by degree of complexity of intended behaviours, and also provides a c r i t i c a l discussion of the one study which was conducted to test an hypothesis similar to that of the present study. In Chapter III, the problem i s delimited and the general plan of procedure outlined. The preparation of items, their classification into relevance categories and subsequent assembly into tryout tests are discussed i n Chapter IV. This chapter also treats the administration of the tryout tests and the preparation of the f i n a l tests on the basis of information provided by the tryout tests. In particular, the s t a t i s t i c a l equivalence of the relevance categories, i n terms of the d i f f i c u l t y indices of the items, is demonstrated. The data obtained from the tests are presented i n Chapter V. Test r e l i a b i l i t y i s discussed here, together with the correlations obtained between pairs of categories. The test of the hypothesis i s also described i n this chapter. The results and conclusions are summarized in Chapter VI, which also discusses weaknesses that must be considered i n evaluating the conclusions. CHAPTER II REVIEW OF THE LITERATURE I. LITERATURE ON CATEGORIZING EDUCATIONAL OBJECTIVES Reference was made i n the last chapter to the Taxonomy of Educational Objectives, developed by Bloom and his associates, and to i t s selection as the system used i n this study for classifying test items into relevance categories. The Taxonomy, which was the product of the deliberations of over thirty workers i n the behavioural sciences, was based on the belief that educational objectives, stated i n behavioural terms, have their counterparts i n the behaviour of individuals, that such behaviours can be observed and described, and that these descriptive state-ments can be c l a s s i f i e d . 1 While educational considerations were given priority i n deciding how behaviours should be categorized, logical and psychological factors were kept constantly i n mind. With regard to this latter point, the authors note: One may take the Gestalt point of viexv that complex behavior i s more than the sum of the simpler behaviors, or one may view the complex behavior as being completely analyzable into simpler components. But either way, so long as the simpler behaviors may be viewed as components of the more complex behaviors, we can view the educational process as one of building on the simpler behavior. . . In order to f i n d a single place for each type of behavior, the taxonomy must be organized from simple to complex classes of behavior. 2 ^Benjamin S. Bloom, et a l . , Taxonomy of Educational Objectives Handbook I: Cognitive Domain (New York: Longmans, Green & Co., 1956), 5. 2Ibid., p.l6. 13 In empirical justification of their claim that their system of classification does have this simple-to-complex order, the authors note that evidence has been found suggesting that test items sampling the more complex categories are more d i f f i c u l t than those sampling the simpler 3 categories. They further note that i t is more common for students to have high scores on lower-category items and low scores on higher-category items than the reverse. They conclude the discussion with the statement: \"Our evidence on this i s not. entirely satisfactory, but there i s an unmistakable trend pointing toward a hierarchy of classes of behavior which is i n accordance with our present tentative classification of these k behaviors.\" Mention has been made of the care taken by the authors of the Taxonomy to justify their classification upon psychological, l o g i c a l , educational,, and empirical grounds. Apart from t h i s , the major argument for i t s selection for use i n this study was i t s detailed description and exemplification of the behaviours cl a s s i f i e d i n each category. Such detail helped to ensure maximum accuracy of classification of the test items into the relevance categories which were basic to this investigation. It would be d i f f i c u l t to communicate f a i r l y , i n small compass, the essential distinctions made by the authors of the Taxonomy among the selected categories, but i t was f e l t that some attempt should be made to provide a brief description of each. The hehaviours included i n the Knowledge-category involve \" l i t t l e more than bringing to mind the appropriate material,\"'' though i n a knowledge -test situation the subject must find i n 3Ibid., pp.18-19. U £ Ibid., p.19. Ibid., p.201. the problem the clues and cues needed to stimulate r e c a l l of the relevant material.^ The Comprehension category i s divided into three parts labelled \"Translation,\" \"Interpretation, 1 1 and \"Extrapolation,\" themselves forming a hierarchy. Given an unfamiliar communication, T r i t i c h may be i n verbal, symbolic or p i c t o r i a l form, the examinee displays translation-behaviours when he demonstrates understanding of i t s l i t e r a l meaning by 7 rendering i t at a different level of abstraction. If he goes beyond this to draw inferences from i t or to make generalizations, then he i s 8 \"interpreting.\" If he displays sufficient understanding of trends and conditions described i n the communication to enable him to make accurate 9 predictions, he i s \"extrapolating.\" The essence of the definition supplied for application-behaviours seems to be the following: A problem i n the comprehension category requires the student to know an abstraction well enough so that he can correctly demonstrate i t s use when specifically asked to do so. \"Application,\" however, requires a step beyond t h i s . Given a problem.new to the student, he w i l l apply the appropriate abstraction without having to be prompted as to which abstraction i s correct or without having to be shown how to use i t i n that situation. This brief outline summarizes the definitions and distinctions which, together with many i l l u s t r a t i v e test items, formed the c r i t e r i a used to classify according to relevance category the items of the test used i n this investigation. 6-Tbid.. Ibid., p.91. 8 I b i d . , p.90 9 I b i d . 16 Ibid., p.120. I I . LITERATURE SPECIFICALLY RELATED TO THE PRESENT HYPOTHESIS With the implicit purpose of attempting to validate the concept of relevance category, Cook 1 1 conducted an investigation. One of his hypothesis i s similar to that of the present study. Specifically, he hypothesized that item-difficulty and item-discrimination indices would show an increase from the simplest to the most complex of several relevance categories. He was, however, unable to accept either of these hypotheses. Indeed, the mean discrimination indices showed a slight tendency to decrease with increasing complexity of relevance category. In view of Cook's failure to substantiate the hypothesis relating to the discrimination indices, i t i s proposed to discuss his methodology i n some detail and to point out certain weaknesses which constitute more than adequate grounds for taking a fresh look at the problem. Cook selected from the f i l e s of the Iowa University Examinations service a sample of tests each of which had already been item-analyzed on the basis of t o t a l test score, and whose items had previously been classified as to relevance category. The item stat i s t i c s obtained were 12 based upon Johnson's Upper-Lower method. The tests were of the objective type, and had been b u i l t as classroom tests by instructors i n ten different subject f i e l d s . Cook grouped together items belonging to each of six relevance categories, using the system of classification proposed \"^Desmond L. Cook, \"A Note on Relevance Categories and Item Statistics,\" Educational and Psychological Measurement, XX (No.2, Summer, I960). 12 A.Pemberton Johnson, \"Notes on a Suggested Index of Item Validity. The U-L Index,\" Journal of Educational Psychology, 1+2 (1951), 500. 16 by Ebel.\"^ He hypothesized, as already stated, that the mean discrimina-tion and d i f f i c u l t y indices for the several categories would show an increase from the simplest to the most complex category, and f a i l e d to establish either hypothesis using analysis of variance procedures. Cook points to several weaknesses which might have affected his findings. He notes: \"The r e l i a b i l i t y of the item analysis data, the intra-individual r e l i a b i l i t y of item classification, and the variation i n number of items i n each category operate to limit the results.\"\"^ The small number of cases, sixteen, i n his simplest category, the possible inadequacy of the instructors as test makers, and the existence of \"good\" students who study details and \"poor\" students who don't, are advanced as further reasons for cautious interpretation of the results. Certain additional weaknesses are apparent, however. F i r s t , the tests had been constructed by different instructors and covered ten subjects ranging from Air Science through Elementary Nutrition to Western Ci v i l i z a t i o n . A finding of McConnell, 1^ later confirmed by 16 Furst, bears directly upon this point. Their investigations involved correlating scores obtained by students on simple-recall items with those obtained by them on items relating to application of principles. The investigations were conducted both within and across several subject f i e l d s . In the words of McConnell, \" . . . the differences between the outcomes •^kobert L. Ebel, \"How an Examination Service Helps College Teachers to Give Better Tests,1.' Proceedings of the Invitational Conference on Testing Problems (Princeton, N.J.: Educational Testing Service, 1953), 8. ^Cook, op. c i t . , p.327-^ T .R .McConnell, \"A Study of the Extent of Measurement of Differential Objectives of Instruction,\" i n An Appraisal of Techniques of Evaluation: Symposium (Washington: Amer. Educ. Research Assoc., National Educ. Assoc., February, 19h0). Edward J. Furst, \"Effect of the Organization of Learning Experiences upon the Organization of Learning Outcomes,\" Journal of Experimental Education, 18 (195U). , 17 within the same subject seem to be smaller than those between the subjects 17 within the same outcome.\" This would imply that, for example, applica-tion-items covering a variety of subject fields should exhibit less similarity than should items sampling a variety of objectives within a single subject field. It would seem, therefore, that the items in each of Cook's relevance categories must be heterogeneous rather than homogeneous with respect to the behaviours for which he was probing. The second of these additional weaknesses is that the tests were item-analyzed on different groups whose ability levels are unlikely to have been the same, so that Cook's indices of difficulty and discrimination can scarcely be considered comparable from test to test. A third weakness lies in Cook's implicit assumption that his difficulty and discrimination indices are mutually independent, vihich is not the case. As previously stated, Cook used as his index of discrimina-tion the U-L Index of Johnson, which is defined by the formula *U \" h U-L Index = , .27N where and represent the number of correct responses made to an item by the upper and lower 27$ of the students, respectively, and N represents 18 the total number of students. The effect of difficulty level upon the maximum values which can be assumed by the U-L Index may readily be demonstrated algebraically. In particular, i f i t is assumed that the index of difficulty of an item is represented by the symbol, x, where x is less than or equal to .27, then xN of the students must have responded correctlyj 'McConnell, op_. c i t . , p.7. 18 T u -18 f o r maximum d i s c r i m i n a t i o n , a l l of these mast be i n the upper 21%, so that R w = xN and RJj = 0. Hence, T T T _ xN - 0 1 U-L Index = = ^_ x (max) .2?H .27 It i s therefore c l e a r that f o r items with d i f f i c u l t y indices between 0 and .27, the maximum possible value of the dis c r i m i n a t i o n index i s a l i n e a r f u n c t i o n of the d i f f i c u l t y index, decreasing from u n i t y t o zero as the d i f f i c u l t y index varies from .27 to zero, that i s , as the items become more d i f f i c u l t . It thus appears that i t i s impossible f o r U-L indices to continue to increase as item d i f f i c u l t y approaches maximum. Cook's hypotheses are thus evidently mutually incompatible; i f h i s hypothesis of increasing d i f f i c u l t y with in c r e a s i n g complexity of relevance category were true, then h i s hypo-t h e s i s of increasing d i s c r i m i n a t i o n could not be expected to hold. To t e s t the l a t t e r hypothesis, i t becomes necessary to hold d i f f i c u l t y l e v e l constant from Category to category. I t i s perhaps worth repeating what was demonstrated i n Chapter I on t h i s point; namely, that i t i s i n s u f f i c i e n t merely to employ a d i s c r i m i n a t i o n index independent of d i f f i c u l t y . This c r i t i c a l d i s c u s s i o n of the study most pertinent to the present i n v e s t i g a t i o n concludes the survey of the relevant l i t e r a t u r e . CHAPTER III DELIMITATION OF THE PROBLEM Reference to the hypothesis indicates that a self-defining achieve-ment test was required, and that i t s items had to be classifiable, with respect to educational objectives, into one or another of the Knowledge, Comprehension, and Application categories of the Taxonomy. It was further required that the number of items should be the same from category to category and that the items should have approximately the same distribution of d i f f i c u l t y within each category. It was evident that no standardized, commercial test could be obtained which would meet a l l of these specifications, so that a test had to be constructed. A number of decisions had to be taken prior to constructing the test, however. These concerned the grade-level and content of the test, the number and type of items to be used, and whether i t was to be a power test or one which was somewhat speeded. In addition, since arrangements had to be made before the beginning of the school year with those schools i n which the test was to be administered, some consideration had to be given to the size and nature of the sample of students to be used. Such preliminary decisions are discussed i n this chapter. I. GRADE-LEVEL AND CONTENT OF TEST It was decided to r e s t r i c t the study to two teaching-units i n the f i e l d of science at the grade nine l e v e l . In passing, i t may be noted that the objectives of science-teaching, i n the cognitive f i e l d , as 20 prescribed for the public schools of British Columbia, include those quoted below from the relevant curriculum bu l l e t i n : 1. To acquire useful facts and information concerning the environment and to develop functional concepts and an under-standing of sci e n t i f i c principles. . . . 2. To acquire an appreciation of the s c i e n t i f i c or problem-solving method and to develop the ability to use i t . . . . 3. To acquire instrumental s k i l l s such as the following: (a) Reading science content with understanding. . . . It thus appears that a valid test of achievement i n this area should contain items probing for behaviours classifiable as Knowledge, Comprehension, and Application, as previously defined. A single grade-level was chosen for a variety of reasons. Among these was the fact that i t is not known whether the relationship among the behaviours represented by the various relevance categories i s constant from one educational l e v e l to another; i t i s conceivable that they may show relatively slight relationship at one l e v e l but be more closely integrated at another. Another reason was the d i f f i c u l t y i f not impossibility of constructing a test which would have content-validity and relevance for a variety of grades. A single subject was chosen i n view of the conclusion of McConnell and of Furst, already noted, to the effect that, for example, the a b i l i t y to apply principles does not seem to be a general a b i l i t y , but i s rather specific to a subject f i e l d . The restriction of the study to the two units of the course dealing with the properties of matter and energy, from the points of view of chemist and physicist, narrowed the f i e l d s t i l l further. It has been noted by Bloom that \"We can only distinguish between Bri t i s h Columbia Department of Education, The Sciences (Victoria, B.C.: The Queen's Printer, 1956), 9. 21 behaviors as we analyze the relation between the problem and each student's background of experience.\" This provides further ju s t i f i c a t i o n for the choice of grade nine science, designated \"Science 10\", inasmuch as the writer's familiarity with the course and with the preceding courses i n science could be expected to aid him i n making valid judgments when categorizing the items. While no one can know every student's background of experience, an observation of Dressel and Nelson, who classified a very large number of items, i s pertinent: The classification of any item under a particular objective i s a f a l l i b l e judgment, but nevertheless one upon which a f a i r degree of agreement has been achieved by individuals who have independently undertaken to classify the same group of items. To endeavour to reduce the number of possible errors i n classifying items, the co-operation of two science specialists, one of whom was familiar with the Taxonomy, was enlisted, as w i l l be described i n Chapter IV. II. DEGREE OF SPEEDEDNESS OF THE TEST It was decided that the test should be essentially a power test since, to whatever extent a test i s speeded, both d i f f i c u l t y and discrimination indices w i l l be affected. Speed thus becomes an extraneous variable to be controlled. This effect of speeding is discussed by Guilford.^ He notes that toward the end of a test i n which speed i s a significant factor, the number of items omitted increases, so that the proportion of students \"passing\" the later items i s lowered. Thus, there i s an apparent increase i n Benjamin S. Bloom, et a l . , Taxonomy of Educational Objectives, Handbook I: Cognitive Domain (New York: Longmans, Green & Co., 1956),l6. 3 . Paul L. Dressel and Clarence H. Nelson, Questions and Problems i n Science: Test. Item Folio No. 1 (Princeton, N.J.: Co-operative Test Division, Educational Testing Service, 1956), v i i i . U Joy P. Guilford, Psychometric Methods (second edition; New York: McGraw-Hill Book Co., 195U, U36-37. 22 d i f f i c u l t y , not necessarily because these items are inherently much more d i f f i c u l t , but because fewer students have had the opportunity to try them. And this situation would not be improved by defining the d i f f i c u l t y index as \"the r a t i o of those passing to those attempting an item,\" since i n this case the later items would appear easier i n virtue of the fact that i t i s only the faster, and i n general more able, students who would attempt them. Again, i f a test i s timed so as to permit only the more able students to f i n i s h i t , the discrimination indices of the later items are l i k e l y to be higher than i f a l l students were permitted to f i n i s h . This i s because most of the students who reach these items would do quite well while the poorer students, who f a i l to reach them, would be the ones counted wrong. This argument would not, of course, apply to a pure speed-test. It does, however, apply to a test i n which speed and the kind of a b i l i t i e s being tested are positively correlated; and this i s the usual situation i n testing achievement i n schools. A l l things considered, then, the decision was taken to time the subtests so that not more than three per cent of the examinees fai l e d to f i n i s h . H I . TYPE AND NUMBER OF ITEMS In constructing objective test items, i t i s often found that a desired response can more readily be e l i c i t e d by casting the item i n one form rather than another, f o r example, i n \"matching\" rather than \"multiple-choice\" form. Consequently, some tests employ a variety of item-types. In the present situation, however, i t was believed that i f more than one \"type were permitted, then the variation i n the proportion of each type which would be l i k e l y to occur from one relevance category to another might introduce an unknown factor into the problem. For a similar reason i t was decided to have the same number of options for each item. Accordingly, only one type, the multiple-choice item with four options, was permitted. This choice was dictated mainly by the fact that of a l l item-types the multiple-choice item appears tor be the one which can serve adequately i n the widest variety of situations. Ebel notes that i t can\"be used with great s k i l l and effective-ly ness to measure complex a b i l i t i e s and fundamental understandings.\"^ Furst stresses i t s relative freedom from response-sets.^ Regarding the number of options, i t was f e l t that while five would be preferable, as tending to mimmize the effect of guessing, four would be a more practical aim, since i n many cases i t i s d i f f i c u l t to produce additional distracters which are really functional. The decision on the number of items to use was influences by both theoretical and practical considerations. It is well known that, other things being equal, increasing the number of items tends to increase the r e l i a b i l i t y of a test. In addition, while the concept of \"degrees of freedom\" makes possible the use of small samples i n s t a t i s t i c a l tests of 7 f t significance, studies lik e those of Norton' and Boneau indicate that the effect on the F and t distributions of any violations of the underlying assumptions i s less for large than far small samples. Further, as a result of the fact that each item i s part of the criterion with which i t i s to be correlated, an element of spurious correlation i s introduced: ^Robert L. Ebel, \"Writing the Test Item,\" Educational Measurement, E.F. Lindquist,. editor (Washington: Amer.Goun.on Educ., 1951), 196. - 6 Edward J. Furst, Constructing Evaluation Instruments (New York: Longmans, Green and Co., 1958, 252. 7 E.F. Lindquist, Design and Analysis of Experiments i n Psychology and Education (Boston: Houghton M i f f l i n Co., 1953~L 79-83. o °C.Alan Boneau, \"The Effects of Violations of Assumptions Underlying the t Test,\" Psychological Bulletin, 57 (No.l, I960), U9-6U. 2k this w i l l be reduced as the number of items i s increased. For these three reasons, therefore, i t was decided to use as many items as maintenance of item quality and other practical considerations would permit. In an effort to set a lower li m i t to this number, i t was recalled that constructors of achievement test batteries commonly produce r e l i a b i l i t y coefficients of of around .85 to .95 for their subtests by using from about forty to eighty items, the majority containing f i f t y or more. Accepting these figures as a rough guide, i t was f e l t that an attempt should be made to obtain at least f i f t y items f o r each relevance category. However, limitations imposed by the practical situation had also to be considered. Cf these, the most serious was the time available for testing, a factor which was aggravated by the fact that test items sampling the more complex categories of behaviours require more testing time per item than do simple-recall items. On the optimistic assumption; that an average time of one minute per item would suffice, i t was clear that a test contain-ing one-hundred f i f t y items could not be administered i n fewer than three fifty-minute periods. Since more testing time than this could not reasonably be hoped for, i t became apparent that f i f t y items i n each of the three categories was the maximum that might be achieved. IV. NATURE AND SIZE OF SAMPLE OF STUDENTS • Careful scrutiny of the selected portion of the Science 10 course indicated that variations i n students' experiential background resulting from such factors as geographical location would be unlikely to exert much direct influence on their performance. The portion which was tested dealt with the properties of matter, physical and chemical changes, energy-transformations and various forms of energy, l i b e r a l l y illustrated with laboratory demonstrations. Had a later part of the course been chosen, 25 for example, that dealing with elementary geology and mineralogy, or with adaptations among wild animals, i t would have been considered necessary to allow for factors such as urban-rural proportions i n the sample. As i t was, i t was considered l i k e l y that the two major factors influencing results would be (a) the a b i l i t y of the students i n the f i e l d of science, and (b) the quality of teaching. With regard to the second of these, l i t t l e could be done i n view of the well known d i f f i c u l t y of measuring teacher-effectiveness; not only i s there a severe criterion-problem, but there also exists an ethical d i f f i c u l t y which restricts attempts to locate the \"best teachers\" for experimental purposes. Uniformity of teaching-quality could, therefore, only be hoped for. An adequate measure of science-ability, particularly with regard to the more complex kinds of behaviours was not available. It was f e l t , therefore, that a measure of scholastic aptitude was the next best criterion: that could be used for this purpose. Efforts were made to secure from the provincial Division of Tests, Research and Standards, information regarding the ( distribution of scholastic aptitude i n the grade nine university programme population. These proved unavailing. The assumption was then made that a f a i r l y large sample of students from several schools i n the southern part i of the province could be considered to be a random sample from the population of students i n the province. No evidence was available on the size of sample desirable, on either theoretical or practical grounds, to ensure a high degree of st a b i l i t y f o r internal-consistency discrimination indices, though Conrad regards four to 9 five hundred students as appropriate. Accordingly, i t was arbitrarily ^Herbert S. Conrad, \"The Experimental Tryout of Test Materials,\" Educational Measurement, E.F. Lindquistj, editor (Washington: Amer. Coun. on Educ, 1951), 252. 26 decided that a minimum of five hundred students should be used. Six schools were approached, with a total of nearly one thousand grade nine university-programme students. Of these schools, no reply was received from one, arrangements broke dowi with another over the timing of the test, and one small school f a i l e d to send i n i t s results sufficiently early for them to be of use. The remaining three schools provided the sample of five hundred thirty cases on which the f i n a l results were based. These schools were the junior high schools at T r a i l , Kelowna and Sardis i n British Columbia. V. THE TRYOUT PROBLEM The decisions reported to this point concern the test i n i t s f i n a l form, that i s to say, the test with equal numbers of items and equivalence of d i f f i c u l t y - l e v e l i n a l l categories, and with defective items removed. To obtain indices of d i f f i c u l t y for a l l items, and to detect faulty items, i t was necessary to try them out on a sample of students different from those to be used i n the f i n a l administration. Such a tryout test was also required to obtain reasonably precise estimates of the time limits required for the f i n a l test. Naturally, most of the decisions about the f i n a l test applied equally to the tryout test, though the number of items i n the tryout test had to be considerably greater. At an estimated one minute per item, the tryout test of 180 items or more would require about four fijfty-minute periods. Since for any one group of students and teachers, this would have involved almost a l l the science periods for one complete week, i t was decided to try out half of the items on one group of students and the other half on a second group, both groups being reasonably similar to the group chosen to write the f i n a l test, with 27 : respect to scholastic aptitude. While this was by no means an ideal arrangement, i t appeared to be the best solution to the problem. The most practical division of the items was i n terms of content. One group of students would have to study Unit I while the other group studied Unit I I . This was possible since the two units were nearly independent of each other. At least one practical advantage followed from this mode of division; the tryout tests for both units could be administered i n late October and the f i n a l test could be assembled ready f o r administra-tion by the time the selected schools had completed both Units I and II i n December. Thus, i t would be possible to do the f i n a l testing idiile the work of both units was fresh i n the minds of the students, instead of at Easter when the students would also be loaded with additional tests covering the units studied between Christmas and Easter. Since factors other than scholastic aptitude — such as the educational climate of a community and the general discipline of a school — affect student performance, i t was believed that each of the two tryout groups should include classes from more than one school. Unfortunately, this could not be achieved; most school authorities appear reluctant to permit disruption of the status quo and to devote class-time to tryout out unproved and perhaps worthless tests. The best that could be achieved was to have the students of the Creston junior high school act as the tryout group f o r the Unit I items while the students of the Salmon Arm junior high school handled the Unit II items. The distributions of the scholastic aptitude scores of the grade nine university-programme students i n these two schools, together with the distribution for the f i n a l group, are shown i n Table I. The measures employed were Otis \"Gamma-I.Q.'s\". No means were available since the data reported for scholastic aptitude were i n the form of 28 frequencies i n seven letter-grade categories as shown i n the table. Using the proportions i n the f i n a l group as the \"expected proportions\" for each 10 category, the chi-square technique described by Walker and Lev was employed to test the hypothesis that the discrepancies between the tryout and the f i n a l proportions were such as might have occurred rather frequently i n random sampling., The obtained value of for the Ores ton group was 5*326, and for the Salmon Arm group was 6.U7k» Since, for six degrees of freedom, the probability of obtaining values greater than these exceeds .50 and .30, respectively, the hypothesis was accepted. As noted i n the foregoing section, this arrangement of using two tryout groups i n two different schools was far from ideal, and was adopted only out of practical necessity. Since the f i n a l test would consist, within each relevance category, of about equal numbers of items on each content-unit, i t was hoped that any discrepancies i n the d i f f i c u l t y indices resulting from the use of different groups of students might manifest themselves more or less equally within each category. This discussion of the preliminary decisions and arrangements being concluded, i t is now possible to proceed to a description of the methodology of the investigation. Helen M. Walker and Joseph Lev, S t a t i s t i c a l Inference (New York: Henry Holt and Co., Inc., 1953), 9^ -95. 29 TABLE I DISTRIBUTION OF SCHOLASTIC APTITUDE SCORES IN TWO GROUPS OF STUDENTS USED FOR TRYOUT TESTS Scholastic aptitude (letter-grade1) I.Q.-range Expected frequencies (from final group) Creston (Unit I) Salmon Arm (Unit II) f % f % f % A 126 - U9 9.0 15 12.1* 10 8.9 B 116 - 125 131 2l*.2 27 22.3 20 17.9 C + 110 - 115 121* 22.8 22 18.2 30 26.8 c lOli - 109 133 2l*.i* 27 22.3 31 27.7 C\" 98 - 103 78 H*.3 23 19.0 13 11.6 D 85 - 97 26 1*.7 7 5.8 6 5.3 E - at 3 0.6 2 1.8 51*1* 100.0 121 100.0 112 100.0 ^he Department of Education of British Columbia divides the grade nine population into seven groups designated by letters, \"An representing thenmost able and \"E\" the least able group. In the entire grade nine population the proportions represented by the letters are: A: 5$, B: 20% C : 1$%, C: 20%, C~: 15£, D: 20%, E: $%. CHAPTER IV METHODOLOGY The preparation, administration, and results of the tryout test are described>in this chapter, which also treats the selection of the items for the f i n a l test. The s t a t i s t i c a l equivalence of the relevance categories of the f i n a l test, i n terms of their d i f f i c u l t y l e v e l , i s also demonstrated. The chapter concludes with a discussion of the composition and administration of the f i n a l test, consideration of results being deferred u n t i l Chapter V. I. PREPARATION OF THE TRYOUT TESTS Writing and classifying the items. Guidance was sought from the 1 appropriate curriculum bu l l e t i n regarding sections of the course considered particularly important and topics to be omitted or treated i n outline only. A detailed table of specifications was then prepared to cover the content of the chosen units. This table was used constantly as a guide during the preparation of the items, and copies were mailed to the teachers of the participating classes to try to ensure uniformity of content-coverage. Valuable as i t was for these purposes, this table was far too detailed for use i n classifying items by content. This was a necessary preliminary to assembling the f i n a l forms of the test, i n ^faich items were matched on content as well as on relevance-category and d i f f i c u l t y . •'•British Columbia Department of Education, The Sciences (Victoria, B.C.: The Queen's Printer, 1 9 5 6 ) , 3 9 - 1 * 2 . 31 Accordingly, this table was condensed and, i n i t s abbreviated form, i s reproduced i n Appendix A. A total of 192 items was produced, each item being typed on a 5\" x 8\" card and classified by content according to the condensed table of specifications. As the items were written they were also classified by relevance category, constant reference being made to the Taxonony i n an effort to secure maximum accuracy of classification. In addition, the hundreds of items c l a s s i f i e d according to the Taxonomy by Dressel and Nelson were consulted frequently. 2 There were, altogether, 60 knowledge-items, 67 comprehension-items, and 65 application-items. The aid of two colleagues was enlisted to check the accuracy of the classification and to provide estimates of d i f f i c u l t y . Both were science-graduates with experience i n teaching the Science 10 course and previous courses. In addition, one had taken graduate work i n educational measurement and was familiar with the Taxonomy. Both consulted the Taxonomy to resolve doubts that arose. The criterion for acceptance of an item was set at 100 per cent agreement among the c l a s s i f i e r s . Twelve items were rejected since they f a i l e d to satisfy the 100 per cent criterion, another eleven as being too abstruse or otherwise d i f f i c u l t , and s i x as possessing defects unperceived by the writer. The remaining 163 items comprised 1*9, 58, and 56 i n the knowledge, comprehension, and application categories, respectively. This number was less than had been hoped for. I t l e f t l i t t l e room for the rejection of further items during the process of equating the Paul L. Bressel and Clarence H. Nelson, Questions and Problems i n Sciencet Test..Item Folio No. 1 (Princeton, N.J.: Co-operative Test Division, Educational Testing Service, 1956). 32 relevance categories for d i f f i c u l t y . However, this discovery was made too close to the time scheduled for the tryout tests to make i t possible for additional items to be constructed and appraised. Assembly of tryout tests. Six subtests were prepared, three for each unit. The number of items i n Unit I was 82. These were divided on the basis of content, into subtests containing 31, 31 and 20 items, respectively. For Unit I I , the total number was 81, similarly divided. Each subtest contained items i n a l l three relevance categories, and within each subtest the items were arranged i n order of increasing d i f f i c u l t y , as estimated by the three teachers involved i n the checking of the items. The arrangement of the items i n order of d i f f i c u l t y i s generally accepted i n testing-practice. The purpose of mixing items of different relevance categories was two-fold. F i r s t , i t was intended to eliminate what Lindquist calls \"Type G errors,\" that i s , errors which would system-atic a l l y affect one relevance category rather than another, with respect 3 to d i f f i c u l t y indices. Such errors could result from changes taking place within students i n the interval between subtests, and should be allowed to manifest themselves equally i n a l l relevance categories. And secondly, the items were mixed to avoid the possibility of giving the students a l l at once a large number of items whose d i f f i c u l t y level might prove to be f a i r l y high i f Bloom's contention* regarding increasing d i f f i c u l t y with increasing complexity of behaviours should turn out to be true — as indeed i t did I 3 E.F. Lindquist, Design and Analysis .of Experiments i n Psychology and Education (Boston: Houghton M i f f l i n Co., 1953), 9-^BenjaminS. Bloom, et a l . , Taxonomy of Educational Objectives, Handbook I: Cognitive Domain (New York: Longmans, Green & Co., 1956), 18-19. 33 The f i r s t two subtests were timed to take f i f t y minutes each, and the third to take thirty minutes, these times being sufficient for almost every-one to consider every item. Care was taken to ensure that the correst responses were divided about equally among the alternatives a, b, c, and d. In addition, the order i n which the correct responses appeared was without pattern so as to provide no extraneous clues for the students. II. ADMINISTRATION OF THE TRYOUT TESTS The problem of guessing. It was clearly desirable to curb blind guessing as far as possible since i t would distort the distribution of criterion scores and also affect the item-statistics; but i t was f e l t that i t would be unwise for a number of reasons to apply a \"correction for guess-ing\" to the scores i n this investigation. F i r s t , the derivation of the usual correction-formula assumes that a l l wrong answers are derived from pure guessing, so that when the correct answer i s not known a l l possible responses to an item are equally l i k e l y to be chosen. It i s , i n fact, extremely improbable that a l l incorrect responses are the result of pure guessing. Many wrong responses are l i k e l y to result from p a r t i a l informa-tion or misinformation. In the former case, the correction-formula would undercorrect, since the subject i s actually guessing among fewer alternatives and thus increasing the possibility of chance-success. In the latter case, the formula would overcorrect since the subject i s not guessing at a l l ; he i s marking what he genuinely believes to be the correct response. Second, the occurrence of pure or blind guessing i s l i k e l y to be greatestin speeded tests, where subjects have insufficient time to f i n i s h , and i n the last few seconds blindly mark the remaining questions; but the test considered here 3U was constructed as a power test, so that this situation would not be l i k e l y to occur. Further, a strongly worded penalty for guessing would be l i k e l y to influence the cautious, conscientious, sensitive student, while the one with ,the \"gambler's\" personality would be relatively unaffected. Lastly, and not least i n importance, i t was believed that correction for guessing might introduce a factor idiich would interact i n a totally unknown manner with the relevance categories. Thus, despite the existence of arguments i n the opposite direction, the decision was taken to apply no correction for guessing, but to phrase the directions i n such a way as to discourage blind guessing. The instruc-tion f i n a l l y selected was: \"You may answer questions even when you are not completely sure that your answers are correct. In such cases, intelligent consideration of the choices provided may help you to gain marks. HOWEVER, you should AVOID WILD GUESSING as this may result i n a reduction of your score.\" It was hoped that the suggestion regarding the intelligent weighing of alternatives might encourage the cautious, and at the same time appeal to the \"gamblers\" as a reasonable alternative to blind guessing. The test-directions. The directions for the f i n a l test are reproduced i n Appendix B. Since the directions for the tryout tests were essentially the same i t i s not proposed to detail them here. However, brief reference I may be made to their content. Besides the instructions concerning guessing, they included certain preliminary directions. I t was impossible to schedule a l l students i n a school to write a given sub-test during the same period. Consequently i t was possible that students who had written a sub-test might communicate information to those who had not. Directions were included to minimize t h i s , The preliminary directions also ensured that students were given advance preparation on sample items of the types they were l i k e l y to 35 meet. Further, they dealt with the physical arrangements which had to be made i n advance of the tests. In addition, they included the instructions, previously reported, regarding time-limits, and the necessity for s t r i c t adherence to these, together with instructions to the students to draw a line under the highest-numbered item which they had time to consider, whether they could answer i t or not. The purpose of this la t t e r instruction was, of course, to determine the extent to which the tryout tests were speeded, and hence to estimate the r e l i a b i l i t y of the d i f f i c u l t y indices of the later items. Administration and results of the tryout tests. The three subtests dealing with Unit I were written by 107 students taught by four teachers, while those dealing with Unit II were written by 100 studenteuunder three teachers. No student indicated that he was unable to consider every item. For each group, the papers were arranged i n order of total score, and divided into three groups consisting of the top 27%, middle 1*6$, and bottom 27%. The performance of every student on every item was recorded. A note was also made of omissions, but only for the upper and lower groups. Four of the Unit I-items and two of the Unit II-items were answered correctly by more students i n the lower than i n the upper portion of the groups writing them. These items were rejected as being unsuitable f o r further consideration, on the ground that — perhaps as a result of some unperceived ambiguity of wording or other defect — they were getting at something essentially different from what the t o t a l test was measuring. In addition, the presence of such items would have increased heterogeneity within the test, and have resulted i n a reduction of the r e l i a b i l i t y of the criterion scores. Diff i c u l t y indices were computed for a l l other items: these 36 indicated the per cent of the tryout group responding correctly to the item, according to the definition adopted i n Chapter I. Davis has pointed out that the use of per cents as d i f f i c u l t y indices i s open to the criticism that per cents \"do not even approach an interval scale of d i f f i c u l t y . \" He suggests, on the assumption of approximate normality i n the t r a i t being measured, a procedure to transform the percentage-indices into numerical indices which can reasonably legitimately be averaged and otherwise s t a t i s t i c a l l y manipulated. His suggested method i s to use the table of the normal probability integral to transform the per cents to standard-deviation units, to multiply these figures by 21.063, and to add 50.^ This amounts to setting the mean at 50 and the standard deviation at 21.063, and results i n a set of indices for which the values 1, 50 and 99 correspond to the 1$, $0%, and 99$ levels respectively. This procedure was adopted. In the following discussion, the term \" d i f f i c u l t y index\" i s to be regarded as referring to the transformed indices, unless otherwise specified. The distribution of these indices within each category i s shown i n Table II below. Their means, f o r the knowledge, comprehension, and applica-tion categories were 1*9.5, l i U.l and 39.1, respectively, thus displaying the trend anticipated by the authors of the Taxonomy toward increasing d i f f i c u l t y of items with increasing complexity of behaviours sampled. Indeed, as w i l l be seen, this trend was sufficiently pronounced to force the elimination of the application category from further consideration, and to reduce the test of the hypothesis to a comparison of the knowledge and comprehension categories only. ^Frederick B. Davis, \"Itenu Selection Techniques,\" Educational Measurement, E.F. Lindquist, editor (Washington: Amer. Coun. on Educ, 1951), 2 o T 6 I b i d . 37 TABLE II DISTRIBUTION OF DIFFICULTY INDICES COMPUTED FROM TRYOUT TESTS GROUPED BY RELEVANCE CATEGORY1 Transformed Original d i f f i c u l t y d i f f i c u l t y Number of items indices 2 indices [real intervals) (per cents) Knowledge Comprehension Application 82.25 - 85.75 95 - 96 1 78.75 - 82.25 92 - 91* 1 75.25 - 78.75 89 - 91 71.75 - 75.25 85 - 88 2 68.25 - 71.75 81 - 8U 6U.75 - 68.25 76 - 80 2 2 61.25 - 6it.75 71 - 75 3 3 2 57.75 - 61.25 65 - 70 3 2 1 51.25 - 57.75 59 - 61* 1 1* 3 50.75 - 5U.25 52 - 58 8 6 1* 1*7.25 - 50.75 1*5 - 51 3 3 3 1*3.75 - 1*7.25 39-1*1* 8 7 5 1*0.25 - 1*3.75 33 - 38 5 7 5 36.75 - 1*0.25 27 - 32 2 8 I* 33.25 - 36.75 22 - 26 3 5 12 29.75 - 33.25 17-21 1* 5 5 26.25 - 29.75 13 - 16 1 2 22.75 - 26.25 10 - 12 1 2 3 19.25 - 22.75 8 - 9 2 1 15.75 - 19.25 6 - 77 1 12.25 - 15.75 1* - 5 1 8.75 - 12.25 3 5.25 - 8.75 2 1 N = 1*8 56 53 Mean d i f f i c u l t y index - 1*9.5 l * l * . l 39.1 x S i x items, rejected as discriminating negatively, have been omitted from this table. The transformed d i f f i c u l t y indices were obtained from the original per cents by using Davis' procedure, described on page 36. 3 m The dotted lines indicate the boundaries of the range of d i f f i c u l t y indices from which items were selected f o r the f i n a l test. 38 Before passing on to a discussion of the preparation of the f i n a l tests, i t may be of interest to note that of the students who wrote the Unit I test, twenty-nine, or S0%3 of the students i n the upper and lower groups omitted one or more items, as did thirt y , or 55%, of those who wrote the Unit II test. It would appear that these students, at any rate, believed that they were not guessing blindly. However, the fact that the remaining h5 to $0% of the upper and lower groups responded to a l l items suggests that a certain amount of guessing may have been taking place and that the results must be assessed with this i n mind. HI. THE PREPARATION AND ADMINISTRATION OF THE FINAL TEST Selection of items for the f i n a l test. As indicated i n the discussion of the hypothesis i n Chapter I, i t was necessary that the d i f f i c u l t y indices of the items within each relevance category of the f i n a l test should have as nearly as possible the same distribution. The distribution aimed at was a symmetrical one with a mean of about 50. A mean much lower than this was considered to be undesirable since i t might lead to increased guessing and consequent distortion of both criterion scores and discrimination indices. As may be seen from Table II, the four easiest knowledge-items had to be excluded since there were no items of a corresponding le v e l of d i f f i c u l t y i n either of the other categories. At the other end of the scale, the knowledge items having a d i f f i c u l t y index of less than 31.5* were rejected, leading forty items with a mean of 1+7.8. To obtain a mean of 5 0 . 0 , i t would have been necessary to reject six additional items, reducing the total number i n that category — and hence i n a l l 39 categories — to thirty-four, which was considered too low to offer hope of reasonably reliable results. At this point, i t became clear that the application-items could not be retained, f o r the mean d i f f i c u l t y of the top forty of these items was only U3.8; to raise the mean to that of the forty knowledge-items would have meant using only the top twenty-eight items, which was again too low. It was with reluctance that this decision was taken. It had, throughout, been foreseen as a possibility but, as was pointed out earlier i n this chapter, the construction of more items to provide a greater choice would probably have been f u t i l e , owing to the impossibility of securing significantly more testing time. The next problem was to select from the available comprehension-items a group of forty with a mean d i f f i c u l t y index of about ii7.8 and roughly the same distribution of these indices as i n the knowledge category. F i r s t , the four items whose d i f f i c u l t y indices f e l l outside the range, covered by those of the knowledge-items were rejected. Then, twelve more items were discarded. The f i n a l distributions are shown i n Table I U i n which smaller class-intervals have been used than were employed i n Table I I , i n order to show more detail. A consideration i n determining which items to discard was that since only the knowledge- and comprehension-items remained, the most time-consuming items having been rejected, i t should be possible to assmble the f i n a l test i n the form of two subtests. Since these would have to be administered i n two separate periods, i t was considered that from the point of view of determining t e s t - r e l i a b i l i t y i t would be better to try to assemble the tests as \"equivalent forms\" i n which d i f f i c u l t y level, relevance category, and content were as nearly alike i n both forms as possible. In this way, the component of variance i n the f i n a l test TABLE H I DISTRIBUTION OF DIFFICULTY INDICES IN THE KNOWLEDGE AND COMPREHENSION CATEGORIES OF THE FINAL TEST (FREQUENCY AND CUMULATIVE FREQUENCY) Difficulty- No. of items 1 1 Cumulative Max. difference indices (frequency) frequency i n cumulative (real intervals) • K C K C frequency 63.99 - 65.99 j 2 2 1*0 Uo 61.99 - 63.99 2 3 38 38 59.99 - 61.99 1 1 36 35 -1 57.99 - 1 2 1 35 31* -1 55.99 - 57.99 ' 1 33 33 53.99 - 55.99 3 3 33 32 -1 51.99 - 53.99 6 5 30 29 -1 1*9.99 - 51.99 i 1 21* 21* 1*7.99 - 1*9.99 2 1 21* 23 -1 1*5.99 - 1*7.99 2 3 22 22 1*3.99 - 1*5.99 7 5 20 19 -1 1*1.99 - 1*3.99 3 2 13 11* 1 39.99 - 1*1.99 2 1 10 12 2 37.99 - 39.99 2 3 8 11 3 35.99 - 37.99 3 6 8 2 33.99 - 35.99 3 3 6 5 -1 31.99 - 33.99 2 2 3 2 -1 29.99 - 31.99 , i 1 1 -1 N = 1*0 1*0 Mean - 1*7.8 1*7.9 Median = 1*6.0 1*6.7 S.D. : 9.03 9.55 •These figures are provided for reference i n connection with a s t a t i s t i c a l test used on page 1*3 of this chapter. la scores attributable to testing on two different occasions would be properly treated as error variance rather than as systematic variance. In view of these considerations, the problem of discarding the excess comprehension-items became one of card-shuffling with two aims i n view: ( 1 ) to produce similar distributions of d i f f i c u l t y for both relevance categories, together with near-equivalence of content, and (2) to produce, from the same items, two subtests characterized by near-equivalence i n terms of d i f f i c u l t y , relevance category and content. Equivalence of relevance categories i n respect of d i f f i c u l t y . It i s evident from Table IU that the f i n a l distributions of d i f f i c u l t y indices for items i n the knowledge and comprehension categories do not approach the normal form. Despite t h i s , the distributions appeared very similar, as may be seen from Figure 1 , below, which depicts the distributions as ogives. For the knowledge- and comprehension-items, respectively, the means were 1+7.8 and 1+7*9, the medians 1+6.0 and 1+6.5, and the standard deviations 9 . 0 3 and 9 . 5 5 . Though the close correspondence of these statis t i c s and of the ogives indicated that the two distributions could reasonably be said to satisfy the requirements of the hypothesis, i t was thought desirable to supplement such a subjective claim by s t a t i s t i c a l tests. The non-normal form of the distributions rendered parametric tests of doubtful vali d i t y , so the Kolmogorov-Smirnov non-parametric test was applied. This test, which Walker and Lev describe as \"sensitive to any kind of difference i n the distributions from which the samples are drawn,\"1 tests the n u l l hypothesis that both samples were drawn from populations with the same continuous distribution. With reference to the assumption 7Helen M. Walker and Joseph Lev, S t a t i s t i c a l Inference (New York: , Henry Holt and Co., Inc., 1 9 5 3 ) , J+26-27. U2 Difficulty-indices (mid-points) FIGURE 1 CUMULATIVE DISTRIBUTION OF TRANSFORMED DIFFICULTY INDICES OF ITEMS SELECTED FOR THE FINAL KNOWLEDGE AND COMPREHENSION CATEGORIES KEY: Knowledge items Comprehension items U3 of continuity, i t i s not unreasonable to believe that item-difficulty i s continuously distributed, even though the numerical indices used to characterize i t may exhibit discreteness. In any event, the parametric tests of significance i n common use are frequently applied to discrete or only quasi-continuous data, though the theoretical requirements c a l l for random samples from a normal, and therefore continuous, population, For equal samples, each with forty cases or fewer, the Kolmogorov-Smirnov test involves referring the s t a t i s t i c DN to a specially devised Q significance table. Since D represents the maximum distance between the cumulative percentage distributions, and N i s the number of cases i n each of the two samples, DN represents the maximum difference, i n any interval, between the actual cumulative frequencies. The la s t column of Table I U shows this difference to be three. Since the size of this s t a t i s t i c required to reject the null hypothesis at the .C£-level for sample-size forty i s twelve, the n u l l hypothesis must be accepted, and the conclusion drawn that i t i s highly reasonable to accept the view that the two samples have essentially the same distribution. In corroboration, the Mann-Whitney test was employed to test the o nu l l hypothesis of \"no difference i n medians.1' This involved assigning ranks to the eighty d i f f i c u l t y indices, finding the sum of the ranks of the indices of one sample, and referring to the table of the normal probability integral the s t a t i s t i c z 2%- NgCN+i) 3 8 9 Ibid., p.l*27. Ibid., pp.U3li-35. where Ri sum of ranks of d i f f i c u l t y indices of knowledge-items. N, K number of cases i n the knowledge category. number of cases i n the comprehension category. N % 4- % The d i f f i c u l t y indices, together with their ranks, are supplied i i n Table IV, below, for reference. Where ties occurred between indices, the rank assigned to each was the mean of the ranks which would have been assigned to them had they been slightly different. Here, the obtained value of z was .072. Since the probability of obtaining an absolute value of z greater than .072 i s about .°U, this figure being obtained from the table of the normal probability integral, for a two-tailed test, i t was concluded that there was no basis for believing that the two samples differed significantly i n t h e i r medians. As a result of these two tests, i t was f e l t safe to assume that the knowledge- and comprehension-items had been adequately matched on the variable, d i f f i c u l t y , as required by the hypothesis. The f i n a l test. The decision to assemble the f i n a l test i n the form of two subtests closely matched i n terms of d i f f i c u l t y l e v e l , content, and relevance category, has already been reported. The result of the matching process was two subtests, I and I I , with mean d i f f i c u l t y indices of U7.9 and I4.7.8, respectively, and each containing equal numbers of knowledge-and comprehension-items. Table V indicates that the subtests were also quite closely matched on the basis of the content-groupings shown i n the table of specifications i n Appendix A. Care was taken, however, to ensure that items were not matched to the point of near-identity of content for, as Thorndike points out, this could result i n a spurious i n f l a t i o n of the' TABLE IV VALUES AND RANKS CF THE DIFFICULTY INDICES IN THE KNOWLEDGE AND COMPREHENSION CATEGORIES Di f f i c u l t y indices Ranks of indices K C K C 65-5 6U.9 1 2.5 6 1 * . l 6 1 * . 9 1* 2.5 6 2 . 8 62.8 6 6 62.2 62.8 8 . 5 6 59.9 62.2 11 8.5 58.6 60.9 12.5 10 58.2 58.6 11* 12.5 5U.8 56.9 19 15 5U.2 55.9 20.5 16 5U.2 55.3 2 0 . 5 r 17.5 53.8 55.3 22.5 17.5 53.1 53.8 2l*.5 22.5 52.7 53.1 26.5 2l*.5 52.7 52.1 26.5 3 0 52.1 52.1 3 0 3 0 52.1 52.1 30 30 1*9.1* 5 i . l 3 1 * 33 1*8.3 1*8.3 35.5 35.5 1*7.3 1 * 7 . 3 37.5 37.5 1*6.2 1*6.9 1*0.5 39 1*5.8 1*6.2 1*1* 1*0.5 1*5.8 1*5.8 1*1* 10* 1*5.2 1*5.8 1*7 111* 1*1*. 7 1*5.8 1*8.5 1*1* i*l*.7 l*l*.l 1*8.5 51.5 l*l*.l l*l*.l 51.5 51.5 l*U.l 1*3.1 51.5 57 1*3.5 1*3.1 51*.5 57 1*3.5 1*0.8 5U.5 61 1*3.1 3 9 . 5 57 62.5 ia.8 39.1 59 65 1*1.1* 3 9 . 1 60 65 3 9 . 5 37.8 6 2 . 5 67 3 9 . 1 37.2 65 68 35.9 36.5 70.5 69 35.9 35.1 7 0 . 5 7 2 . 5 3l*.l* 35.1 7l*.5 7 2 . 5 33.0 3l*.l* 77.5 7l*.5 33.0 ' 33.0 7 7 . 5 77.5 31.5 33.0 80 77.5 Sum cf ranks 1622.5 1617.5 N » 1*0 1*0 1*6 TABLE V (a) DISTRIBUTION OF UNIT I-ITEMS BY CONTENT CATEGORY IN THE TWO FINAL SUBTESTS Unit I Number of items i n content-category Subtest I Subtest II 1 1 2 1* 1* 3 $ 5 1* $ 1* $ 1* 6 3 3 21 22 No. of Unit I items 1*3 i n f i n a l test TABLE V (b) DISTRIBUTION OF UNIT II-UEMS -BY CONTENT CATEGORY IN THE TWO FINAL SUBTESTS Unit II Number of items i n content-category Subtest I Subtest II 1 3 3 2 2 2 3 2 2 1* I* 1* 5 1* 6 3 3 : 19 18 No. of Unit II items 37 i n f i n a l test hi 10 r e l i a b i l i t y coefficient of the test. As i n the tryout test, the items i n *.each f i n a l subtest were arranged i n order of increasing d i f f i c u l t y ; also, the correct responses were about equally distributed among the four options for the items, and were arranged i n a quasi-random fashion. The f i n a l test i f reproduced i n Appendix C. Administration of the f i n a l test. I t was estimated from the tryout tests that, with the time consuming application-items deleted, the eighty items of the f i n a l test could be administered i n two fifty-minute periods, without introducing a speed-factor, provided that the students were familiarized with the directions before the test began. This decision was incorporated into the directions, which are reproduced i n Appendix B. The teachers administering the test were requested to read the directions without embellishing them i n any way. They were asked to give prior practice to their students i n handling sample items chosen from among the rejects. In reading the directions they conveyed the information that on tests like this i t i s expected that the average student w i l l score around f i f t y per cent, and that the pass-mark on this test would be considerably lower than that, The purpose of this was to help i n reducing tension and — i t was hoped — to decrease the ever-present tendency among some students to guess wildly, regarding which the instructions used i n the tryout test were repeated. I t should be remarked that the students were told that the test-results would be used to help determine their Christmas mark i n science. This had the l cEobert L. Thorndike, \"Reliability,\" Educational Measurement, E.F. Lindquist, editor. (Washington: Amer. Coun. on Educ, 1 9 5 1 ) , 5 7 5 . U8 merit of forcing them to take the test seriously and of providing a kind of motivation. It could also have had the effect of increasing the tendency toward guessing, but this was i n any event inevitable since the tests would never have been administered at a l l i f the teachers had been denied the opportunity to use the results and the norms. Stric t adherence to time-limits was stressed. In addition, students were instructed to draw a line under the highest-numbered item which they had a chance to consider, and — i f they had been able to consider a l l items — to place a check-mark i n a space provided on the cover-sheet. Thus, i t was possible to confirm the expectation that the test was a power-test. For convenience, i t may be reported at this point that only one student indicated that he had been unable to consider a l l of the items of one subtest. He had marked the last few items, but had omitted some i n the body of the subtest, so that i t appeared that he had deliberately l e f t for later consideration some items which had, on cursory reading, appeared d i f f i c u l t . Essentially, then, the f i n a l test was a power-test. The teachers were requested to make internal arrangements so that subtest I could be administered during the same period for a l l classes, or at least during the same morning or afternoon; the same request was made for subtest II. These requests were complied with so that leakage during change of periods could be considered minimal. The two subtests were run on successive days. The test results. A few students were absent for one or the other subtest, but 530 students wrote both. Each student's response to every item was recorded. The results were sent to the Computing Centre at the University of British Columbia for the computation of the point-biserial U9 correlation coefficients which represented the discrimination indices of a l l items. The test data and their interpretation are deferred to Chapter V. CHAPTER V THE TEST RESULTS AND THEIR STATISTICAL ANALYSIS In this chapter the distributions of scores i n each relevance category, and of total scores, are presented. Test r e l i a b i l i t y i s discussed together with inter-category correlation. The distributions of f i n a l d i f f i c u l t y indices are depicted graphically, and the adequacy of the process of matching f o r d i f f i c u l t y i s tested. Finally, after applying Fisher's z-transformation to the point-biserial correlation coefficients, or discrimination indices, a test of significance i s made. This i s supported by corroborative evidence using b i s e r i a l - r . I. THE TEST SCORES AND RELIABTLITIES Knowledge, comprehension, and total scores. Table VI shows the distributions of knowledge scores, comprehension scores, and total scores. The same information i s presented visually i n Figures 2 and 3 which follow. A l l three distributions are somewhat positively skewed, this effect being least for the knowledge items. A l l exhibit a certain \"piling up\" of scores near the lower end, an effect which is particularly noticeable with the comprehension items. While the means and medians of the knowledge and comprehension scores, shown i n the table, are highly similar, their difference i n variance i s noteworthy. Before discussing this point further, data on r e l i a b i l i t y w i l l be considered. TABLE VI DISTRIBUTION OF KNOWLEDGE, COMPREHENSION, AND TOTAL SCORES IN FINAL TESTS Scores Frequency Scores Frequency (real (real limits K C limits) Total Test 38.5 - 1*0.5 1 67.5 - 70.5 2 36.5 - 38.5 6U.5 - 67.5 2 3l*.5 - 36.5 1 1 61.5 - 61*.5 1* 32.5 - 3U.5 3 I* 58.5 - 61.5 5 30.5 - 32.5 2 6 55.5 - 58.5 13 28.5 - 30.5 17 21* 52.5 - 55.5 23 26.5 - 28.5 19 21 U9.5 - 52.5 29 2U.5 - 26.5 1*7 5o U6.5 - 1*9.5 38 22.5 - 2iu5 38 58 1*3.5 - 1*6.5 1*6 20.5 - 22.5 70 53 1*0.5 - 1*3.5 1*9 18.5 - 20.5 83 58 37.5 - 1*0.5 62 16.5 - 18.5 75 61* 3l*.5 - 37.5 50 ll * . 5 - 16.5 7U 70 31.5 - 3U.5 61* 12.5 - H*.5 50 57 28.5 - 31.5 52 10.5 - 12.5; 29 27 25.5 - 28.5 1*3 8.5 - 10.5 18 32 22.5 - 25.5 21* 6.5 - 8.5 3 1* 19.5 - 22.5 17 1*.5 - :6V5 1 16.5- 19.5 7 N 530 530 530 M = 19.1 19.3 38.1* Man = 18.9 18.9 37.9 S 2 = 25.97 33.19 100.60 s = 5.10 5.76 10.03 52 freq. 7 0 60 { 50 4 0 30 20 15\" »8 21 IK Z7, 30 3 3 3 6 39 42 4-5\" k-l Total Score (mid-points) FIGURE 2 51 5-4. 57 60 63 66 69 72 DISTRIBUTION OF FINAL SCORE OF 530 STUDENTS ON THE FINAL TEST freq. 90' 801 7o. 60 5b AO 30 20 IO O <-T\\ N £ > 7- (J> cJ\\ j - \\ -A tf\\ ts\\ reliance must continue to be placed on the care with which the items were constructed, and double-checked, i n accordance with the specifications of the Taxonomy. ? I b i d . , p.299. 6 Edward J. Furst, \"Effect, of the Organization of Learning Experiences upon the Organization of Learning Outcomes,\" Journal of Experimental Education. 18 ( 1 9 5 U ) . , - . 7 T.R.McConnell, \"A'Study of the Extent of Measurement of Differential Objectives of Instruction,\" i n An Appraisal of Techniques of Evaluation: Symposium (Washington: Amer. Educ. Research Assoc., National Educ. Assoc., February, 19&0). 58 II. DISTRIBUTIONS OF DIFFICULTY INDICES As at the tryout stage, the d i f f i c u l t y indices obtained from the f i n a l test were transformed by Davis 1 procedure. These transformed indices are shown i n Table VU, and their cumulative frequency curves i n Figure k below. It i s clear that the matching process was par t i a l l y but not entirely successful. The means are s t i l l almost identical, though about one unit larger. However, the dispersion of the knowledge items has increased while that of the other has decreased; the com-prehension items have turned out to be more homogeneous than the knowledge' items, i n respect of d i f f i c u l t y . This fact may be partly responsible for their greater r e l i a b i l i t y . Despite the agreement i n means and i n general form of distribution, the presence of one very easy knowledge item and two or three quite d i f f i c u l t ones, not matched by corresponding comprehension items, gives rise to concern as to their effect on the discrimination indices. As noted i n Chapter I, the value of the point-biserial correlation coefficient cannot rise very high for very easy or very hard items. The presence of these few more extreme d i f f i c u l t y indices i n the knowledge category gives an advantage to the comprehension category i n respect of any test of significance involving the discrimination indices. H I DISTRIBUTIONSOF DISCRIMINATION INDICES Point-biserial correlation coefficients were obtained f o r a l l items. For reference purposes, these are l i s t e d , for each item, i n Appendix D, together with the item'3 Taxonomy-category, i t s f i n a l percentage-difficulty index, and i t s b i s e r i a l correlation coefficient. TABLE VII DISTRIBUTION OF TRANSFORMED DIFFICULTY INDICES IN THE KNOWLEDGE AND COMPREHENSION CATEGORIES Transformed Knowledge Comprehension d i f f i c u l t y items items indices cu. cu. (real limits) f f f f 78.75 - 81.25 1 1+0 76.25 - 78.75 73.75 - 76.25 71.25 - 73.75 68.75 - 71.25 1 39 66.25 - 68.75 2 1+0 63.75 - 66.25 1 38 1 38 61.25 - 63.75 3 37 58.75 - 61.25 • 56.25 - 58.75 1+ 31+ 6 37 53.75 - 56.25 2 30 3 31 51.25 - 53.75 1+ 28 1+ 28 i+8.75 - 51.25 3 21+ 3 21+ 1+6.25 - 1+8.75 1 21 2 21 1+3.75 - 1+6.25 6 20 9 19 1+1.25 - 1+3.75 1 11+ 1 10 38.75 - 1+1.25 5 13 5 9 36.25 - 38.75 5 8 2 1* 33.75 - 36.25 2 3 2 2 31.25 - 33.75 1 1 N l+o 1+0 M 1+8.5K1+7.8)1 1+9.0(1+7.9) Mdn 1+6.3(1+6.0) 1+7.3(1+6.7) s 10.76(9.03) 8.50(9.55) •4\"he figures i n paranthesis are the corresponding values from the tryout stage. 60 D i f f i c u l t y Indices (mid-points) FIGURE li CUMULATIVE DISTRIBUTIONS OF TRANSFORMED DIFFICULTY INDICES OF ITEMS IN THE KNOWLEDGE AND COMPREHENSION CATEGORIES, OBTAINED FROM THE ADMINISTRATION OF THE FINAL TEST KEY: Comprehension Items Knowledge Items 61 According to the procedure described i n Chapter I, Fisher's z-transformation was applied to each point-biserial coefficient. The distributions of these z-values for both relevance categories are shown i n Table VIII, and the corresponding histograms i n Figure 5>, below. These distributions exhibit the irregularity of form which often accompanies the use of small samples. Although both are somewhat negatively skewed, they have a predominance of cases i n the middle range and fewer at the extremes. Nevertheless, neither can be considered to have been drawn from a normal population. Among the factors responsible f o r th i s , three may be cited: (1) the use of only f o r i y cases i n each category does not provide a very typical sample of a l l possible items which might have been written; (2) at the tryout stage, the negatively discriminating items were rejected, thus disrupting the randomness of selection; and (3) while the point-biserial correlation coefficient i s unrestrictedly used for item analysis by test constructors, and under certain conditions i t s sampling distribution i s known for a given item, there is, l i t t l e that can be said about i t s expected distribution for a set of items. I t had been hoped that the obtained distributions of discrimination indices would, despite the considerations of the previous paragraph, be (a) reasonably symmetrical, (b) similar i n form, and (c) about equally 8 variable; granted these conditions, the empirical work of Boneau and o Norton' on using the t-test and the F-distribution under considerable violation of the assumptions, could have been invoked to justify a para-metric significance test. However, the irregularity of the obtained C. Alan Boneau, \"The Effects of Violations of Assumptions Underlying the | Test,\" Psychological Bulletin, 5\"7 (No. 1, I960), 1+9-61+. 9 E.F. Lindquist, Design and Analysis of Experiments i n Psychology and Education (Boston: Houghton M i f f l i n Co., 19f>3), 79-83. 62 ' TABLE T i l l DISTRIBUTION OF DISCRIMINATION INDICES1 OF KNOWLEDGE AND COMPREHENSION ITEMS z-values Knowledge Comprehension (real items items limits) f f .1*725 - .1*975 1 .1*1*75 - .1*725 1 2 .1*225 - .1*1*75 2 .3975 - .1*225 3 1 .3725 - .3975 1 1 .31*75 - .3725 2 5 •3225 - .31*75 6 1* .2975 - • 3225 2 5 .2725 - .2975 5 l .21*75 - .2725 5 3 .2225 - .21*75 l 1* .1975 - .2225 l 3 .1725 - .1975 3 5 .11*75 - .1725 3 1 .1225 - .11*75 3 1 .0975 - .1225 2 .0725 - .0975 .01*75 - .0725 1 .0225 - .01*75 1 -.0025 - .0225 -.0275 --.0025 -.0525 --.0275 1 N 1*0 1*0 z .2580 .2908 .252 .283 Mdn(z) .27 .30 These are Fisher's z-values corresponding to the point-biserial correlation coefficients obtained for each item. 63 freq. 6 5 \\ k 3 2 1 Q • on. ,oi . 0 6 , -i( -iG .21 .26 -3/ -36 4f .4-6 Discrimination Indices (mid-points) FIGURE 5(a) DISTRIBUTION OF DISCRIMINATION INDICES' FOR KNOWLEDGE ITEMS 6 5 freq. 3 — —1 2 . 1 — I — 1 .. p-, . , _ — p -o , __ , , . , , .. -.Olf -01 -06 .11 -It .Zi -t<° -31 -34 -4-1 -4& .5\"! Discrimination Indices (raid-points) FIGURE 5(b) . DISTRIBUTION OF DISCRIMINATION INDICES1 . FOR COMPREHENSION ITEMS *These are Fisher's z-values corresponding to point-biserial correlation coefficients. 6i+ distributions i s such that they cannot be said to resemble any of the that their conclusions cannot be considered applicable. Accordingly the decision was taken to use the non-parametric Mann-Whitney test already employed i n Chapter IV. TV. THE STATISTICAL TEST OF THE HYPOTHESIS The s t a t i s t i c a l hypothesis to be tested was the nu l l hypothesis, H Q: MdnK - MdnQ = 0, the admissible alternative hypothesis being Hj_: MdnQ> MdnK. This, therefore, required a one-tailed test. The level of significance was pre-set at the rather severe level of 1$. It was recognized that this would increase the risk of committing a lype II error, but i t was considered better that the status quo should be main-tained, even i f false, rather than that a possibly true (null) hypothesis should be incorrectly rejected. Ranks were assigned to the discrimination indices, and the procedure described i n Chapter TV followed. The discrimination indices — s t i l l expressed i n terras of Fisher's-z, — together with their ranks, are l i s t e d i n Table IX. The normal deviate was calculated, using the values shown i n Table IX, and i t s value found to be 1.26. For the one-tailed test, the probability of obtaining a value of z greater than 1.26 was determined from the table of the normal probability integral to be .1038. There was thus no basis for rejecting the null hypothesis, which was therefore accepted. populations which Boneau and Norton sampled i n their investigations, so 2RK - N K(N + 1) TABLE IX 65 VALUES AND RANKS OF DISCRIMINATION INDICES1 OF ITEMS IN KNOWLEDGE AND COMPREHENSION CATEGORIES Discrimination indices Ranks of indices K C K\" C .U56 .U76 3 1 .1*21 .U62 7.5 2 .1*17 .U5U 9 u .un* .UU3 10 5 .382 .U39 11 6 .368 .U21 13 7.5 .363 .37U 15 12 •3U7 .365 20 1U .31*1 .360 22 16.5 .339 .360 23 16.5 .328 .356 27 18 .327 .3U9 28 19 .325 •3l*U 29 21 .310 .336 33.5 2U .30U .33U 35 25 .296 .332 . 37 26 .289 .321 39 30 .288 .316 Uo 31 .287 .312 Ul 32 .283 .310 U2 33.5 .267 .298 1*3 36 .265 .292 U5.5 38 .265 .266 U5.5 UU .259 .26U U9 U7 .251 .262 50 U8 .2U2 .2U3 52 51 .220 .235 56 53 .195 • 23U 60 5U .187 .225 61 55 .186 .211 62 57 .171 .206 68 58 .152 .200 70 59 .1U9 .185 71 63 .131 .18U 73 6U .127 .180 7U 65 .12U .176 75 66 .121 .175 76 67 .120 .170 77 69 .051 .135 78 72 -.0U9 .025 80 70 % = Uo = Uo = 1751 Rc B.LU89 N N c - 80 ^Discrimination indices reported are Fisher's z-values of point-biserial correlation coefficients. 66 For comparison, the same test applied to the bi s e r i a l correlation coefficients i s offered. These coefficients are l i s t e d i n Appendix D, for reference. Here, was found to be 1716j and using the values kO, kO, and 80 for N R, N Q, and N, respectively, the value of the normal deviate was calculated to be .9238. The probability of obtaining a value greater than this i s .18, so that using the b i s e r i a l - r , the nu l l hypothesis must s t i l l be accepted. In view of the non-significance of the difference, i t was concluded that the observed difference could be ascribed to sampling error, and that there i s no s t a t i s t i c a l basis for maintaining that the comprehension items discriminate better than do the knowledge items. CHAPTER VI CONCLUSIONS AND SUMMARY This investigation was designed to avoid the inadequacies of the earlier one reported by Cook. It i s believed that this objective was achieved, but unfortunately the present study has weaknesses of i t s own. It i s i n the light of these that conclusions must be stated. I. WEAKNESSES The major weakness lay i n the small number of items used. Among the effects of this were the following. F i r s t , although an objective of the Science 10 course i s to develop functional problem-solving a b i l i t y , the application items generally proved too hard; too few of moderate d i f f i c u l t y could be found. A larger number of items would have provided more items of moderate to low d i f f i c u l t y , and avoided the loss of what might have proved to be a set of items of high average discrimination. Second, the almost rectangular distribution of d i f f i c u l t y indices shown i n Table VII makes i t clear that though the mean d i f f i c u l t y - l e v e l i n both categories was around $0%, there were too many rather d i f f i c u l t items. Again, a larger number of items would have made some control of the distribution of d i f f i c u l t y possible. In any event, i t may be surmised that the presence of these more d i f f i c u l t items increased the tendency to guess, thus contributing to lower the r e l i a b i l i t y of the test and i t s categories. 68 This unduly low reliability, constitutes the third effect of small numbers. The generalized Spearman-Brown formula\"'\" indicates that twice as many items of the same quality should produce a r e l i a b i l i t y coefficient for the total test of over .90. I t i s l i k e l y that, with a. better distribution of d i f f i c u l t y indices, the same value could be produced with fewer than twice as many items. In passing, however, i t may be remarked that i n an experiment of this kind, really high r e l i a b i l i t y may always remain elusive, since the design forbids the rejection of poorly discriminating items during construction of the f i n a l test. The f i n a l test, therefore, can never be thoroughly refined. Finally, on this point, a larger number of items was required to smooth out the distributions of discrimination indices and to permit their form to become apparent. Another source of weakness lay i n the use of two tryout groups, though this was inevitable i n view of the lack of testing time available — also the root-cause of the use of so few items. However, i t s disadvantages could have been avoided to some extent by the use of larger numbers of students i n each group. Greater st a b i l i t y of the d i f f i c u l t y indices might have resulted. II. CONCLUSIONS The weaknesses noted i n the foregoing reduce the precision with which conclusions can be drawn. The s t a t i s t i c a l tests provide no basis for believing the observed discrepancy i n average discrimination index to be anything but sampling error. Nevertheless, the results obtained are ^\"Robert L. Thorndike and Elizabeth Hagen, Measurement and Evaluation i n Psychology and Education (New York: John Wiley and Sons, Inc., 1955), 137. 69 i n the right direction; and inspection of Table IX indicates that the superiority of the comprehension items over the knowledge items i s highly consistent, being evident from top to bottom of the table. The possi b i l i t y cannot be ignored that the use of a larger number of items — had this been possible to time-table — might have increased the r e l i a b i l i t y of the criterion scores, smoothed the distributions of discrimination indices, and generally improved the precision of the experiment to the point where a significant discrepancy might have been obtained. In addition, i t i s impossible to avoid speculating whether the apparent trend i n the direction hypothesized would have been continued by the application items, had they survived. This is not the definitive experiment on this problem. III. POSSIBILITIES OF FURTHER RESEARCH With the exception of one point, the general design of the investiga-, tion appears sound. This point concerns the distribution of d i f f i c u l t y indices. It was observed i n Chapter V that we cannot predict the distribu-tion of point-biserial correlation coefficients to be expected f o r a set of items. We can, however, say that for a given item, the sampling distribution of the point-biserial coefficient w i l l be that of r under certain conditions; and under the same conditions the sampling distribution of the corresponding Fisher's z-values w i l l be approximately normal. These conditions are that during sampling the number of students \"passing\" the item and the number \"f a i l i n g \" i t remain constant, and that the criterion scores within each 2 category, \"pass\" or \" f a i l \" , are normally distributed. For a set of items 2Helen M. Walker and Joseph Lev, S t a t i s t i c a l Inference (New York: Henry Holt and Co., Inc., 1953), 271. 70 these conditions are unlikely to be met. In practice they could best be approximated by choosing items such that their transformed d i f f i c u l t y -indices were roughly normally distributed.about a selected value and had a f a i r l y narrow range. In this way the number of students \"passing\" and \" f a i l i n g \" each item would not be too far from constant. Making allowance for the rejection at the tryout stage of a small number of negatively discriminating items, the distribution of Fisher's z-values corresponding to point-biserial coefficients might not be too far from normal. The appropriate parametric test could then be applied. It i s therefore suggested that the present design be retained, but that greater control be exercised over the d i f f i c u l t y indices of the items selected for the f i n a l test. The use of items with transformed d i f f i c u l t y indices distributed approximately normally about 50, with a standard deviation of about 10, would have certain advantages. F i r s t , very few items would be exceptionally d i f f i c u l t ; most would have d i f f i c u l t y indices between 1+0 and 60. Thus guessing should be reduced and test r e l i a b i l i t y improved. In addition, conclusions could be more precisely drawn since they would refer to items whose difficulty-distribution was specified. Further, i t w i l l be necessary to increase substantially the number of items used at the tryout stage. The number used i n the f i n a l test should also be increased by perhaps f i f t y per cent. I t i s therefore plain that the co-operation of the provincial educational authority would be necessary to implement the expanded design. It w i l l be recalled from Chapter V that the comprehension items manifested a much larger true-variance than did the knowledge items, while exhibiting about the same error-variance. I t would be interesting to know whether i t i s generally characteristic of comprehension items — and, 71 indeed, higher category items of a l l kinds — to \"spread\" the scores of students more effectively. If i t were shown that higher category items did i n fact possess superior discrimination, then the a b i l i t y to \"spread\" more effectively the scores of students would be a useful adjunct to their discriminating power. This would be particularly important where, as i n the case of a cut-off test, i t i s desired to discriminate among students with particular precision at a stipulated level of d i f f i c u l t y . Unconnected directly with the present study, but related to i t , i s the question of the extent to which schools are actually working toward developing i n students the a b i l i t y to apply principles, scientific or other-wise, to the solution of unfamiliar problems. The evidence of this study suggests that there may be a weakness here. 17. SUMMARY OF THE INVESTIGATION I t was hypothesized, on a p r i o r i grounds, that on a self-defining achievement test, equally weighted with items sampling behaviours c l a s s i f i -able i n the Knowledge, Comprehension, and Application categories of Bloom's Taxonomy of Educational Objectives, the mean dxscrimination indices of the items i n each category would show an increase from the f i r s t - to the last-named category. The distributions of d i f f i c u l t y indices within the three categories were to be similar, to avoid weighting the total or criterion scores unequally with items from any one category, and thus affecting the discrimination indices. The discrimination index chosen was the point-b i s e r i a l correlation coefficient, since of those indices which u t i l i z e a l l information provided by the test, i t i s the only one which has properties permitting the employment of a parametric test of significance. The subject chosen was General Science, and the grade level was nine. 72 Items were written categorized according to Taxonomy-specifications, and checked by two colleagues. Assembled into two subtests, half of the items were written by one tryout group, and the other half by a second tryout group, limitations of testing time forbidding the use of a single group. Both groups were reasonably, though not highly, similar i n scholastic aptitude to the f i n a l group. Items which discriminated negatively were discarded. Of the remainder, the application items were found to be so d i f f i c u l t that they had to be rejected. The knowledge and comprehension items were then matched for d i f f i c u l t y , necessitating further rejections u n t i l only forty remained i n each category. Two subtests were then assembled, their items matched-for relevance category, content, and d i f f i c u l t y , forming essentially \"equivalent forms.\" Five hundred thirty students wrote both subtests, The r e l i a b i l i t y coefficient of the total test was .81+, and those of the knowledge and comprehension \"sub-tests\" were .69 and .77, respectively. Revised d i f f i c u l t y indices were obtained. The adequacy of the original matching was found to have been f a i r , but not completely satisfactory j the two categories had almost identical mean d i f f i c u l t y indices, but the knowledge category contained two or three very hard items and some easy ones, unmatched by corresponding comprehension items. This was unfortunate, since such items are restricted with respect to the size of point-biserial discrimination index which they can attain. The point-biserial coefficients were transformed to Fisher's z-values, and the distributions graphed. These were not sufficiently close to the normal form to ju s t i f y assuming normality i n the population. Thus, 73 the non-parametric Mann-Whitney test was employed, rather than the t-test. At the 1% l e v e l , the nu l l hypothesis of no difference i n medians had to be accepted. This decision was confirmed by application of the same test to the b i s e r i a l correlation coefficients, whose advantage i s that their maximum size i s relatively unaffected by di f f i c u l t y l e v e l . I t was concluded that the hypothesis could not be sustained on the basis of this experiment. However, certain inadequacies and trends combined to suggest that the attack on the problem be not dropped. Suggestions were made for refining the experimental design, and for further research. BTJBLICGRAPHT A. BOOKS 75 Bloom, Benjamin S., et a l . Taxonomy of Educational Objectives, Handbook I: Cognitive Domain. New York: Longmans, Green and Co., 1 9 5 6 . ~ Conrad, Herbert S. \"The Experimental Tryottt of Test Materials,\" Educational Measurement, E.F. Lindquist, editor. Washington: Amer. Coun. on Educ, 1951. Davis, Frederick B. \"Item Selection Techniques,\" Educational Measurement, E.F. Lindquist, editor. Washington: Amer. Coun. on Educ, 1 9 5 1 . Dressel, Paul L., and Clarence H. Nelson. Questions and Problems i n Science: Test Item Folio No. 1 . Princeton, N.J.: Co-operative Test Division, Educational Testing Service, 1 9 5 6 . Ebel, Robert L. \"Writing the Test Item,\" Educational Measurement, E.F. Lindquist, editor. Washington: Amer. Coun. on Educ, 1 9 5 1 . Furst, Edward J. Constructing Evaluation Instruments. New York: Longmans, Green and Co., 1958. Gerberich, J. Raymond. Specimen Objective Test Items; A Guide to Achievement Test Construction. New York: Longmans, Green and Co., 1 9 5 6 . Good, Carter V. (ed.). Dictionary of Education. Second edition. New York: McGraw-Hill Book Co., 1959. Guilford, Joy P. Psychometric Methods. Second edition. New York: McGraw-Hill Book Co., 195k-Johnson, Palmer 0., and Robert W.B. Jackson. Modern S t a t i s t i c a l Methods: Descriptive and Inductive. Chicago: Rand McNally and Co., 1959. Lindquist, E.F. Design and Analysis of Experiments i n Psychology and Education. Boston: Houghton M i f f l i n Co., 19^3. Peters, Charles C , and W.R. Van Hoorhis. S t a t i s t i c a l Procedures and their Mathematical Bases. New York: McGraw-Hill Book Co., 191*0. Thomdike, Robert L. Personnel Selection. New York: John Wiley and Sons, Inc., 19k9' . \"Reliability,\" Educational Measurement, E.F. Lindquist, editor. Washington: Amer. Coun. on Educ, 1 9 5 1 * , and Elizabeth Hagen. Measurement and Evaluation i n Psychology and Education. New York: John Xftley and Sons, Inc., 195^ Walker, Helen M., and Joseph Lev. S t a t i s t i c a l Inference. New York: Henry Holt and Co., Inc., 1 9 5 3 . 76 B. PUBLICATIONS OF GOVERNMENT AND OTHER ORGANIZATIONS British Columbia Department of Education. The Sciences. Victoria, B.C.: The Queen's Printer, 1956. Ebel, Robert L. \"How an Examination Service Helps College Teachers to Give Better Tests,\" Proceedings of the Invitational Conference on Testing Problems, pp. 3-16. Princeton, N.J.: Educational Testing Service, 1953. McConnell, T.R. \"A Study of the Extent of Measurement of Differential Objectives of Instruction,\" in An Appraisal of Techniques of Evaluation: Symposium, pp. 3-8. Washington: Amer. Educ. Research Assoc., National Educ. Assoc., February, 191+0. C. PERIODICALS Adams, James F. \"The Effect of Non-Normally Distributed Criterion Scores oh Item Analysis Techniques,\" Educational and Psychological Measurement, XX (No.2, Summer, I960), 317-20. Boneau, C. Alan. \"The Effects of Violations of Assumptions Underlying the t Test,\" Psychological Bulletin, 5*7 (No. 1, I960), 1+9-61+. Cook, Desmond L. \"A Note on Relevance Categories and Item Statistics,\" Educational and Psychological Measurement, XX (No.3, Summer, i960), 321-31. Furst, Edward J. \"Effect of the Organization of Learning Experiences upon the Organization of Learning Outcomes,» Journal of Experimental Education, 18 (1951+), 215-28 j 31+3-52. Johnson, A. Pemberton. \"Notes on a Suggested Index of Item Validity. The U-L Index,\" Journal of Educational psychology, 1+2 (1951), 1*99-50l+. APPENDIX A 78 TABLE X CONTENT-SPECIFICATIONS FOR TEST (CONDENSED) Unit Content Content Category 1 Units, systems of measurement; characteristics of science. 2 Forms of matter — elements, compounds, mixtures, etc.; properties; physical and chemical change; chemical activity. 3 Oxygen, hydrogen; preparation, properties, uses. Catalysis. Oxidation, combustion. U Structure of matter — atoms, molecules; electrons, protons, neutrons. Atomic Theory. Law of Definite Proportions. Formulae. Equations. 5 Acids, bases, salts; electrolysis; types of reaction. 6 Calcium carbonate and derivatives; limewater test for carbon dioxide. Sodium chloride and derivatives. Uses, laboratory and industrial. Mortar, cement. II 1 Force, energy, work; elementary problems involving work. Kinetic and potential energy. Conservation of energy. Forms of energy; energy-transformations. 2 Magnetic properties of some materials; power of iron to \"concentrate lines of force;\" temporary, permanent magnets. Molecular theory of magnetism. Induction and disruption of magnetism.' Variation^ inclination. 3 Magnetic effects of electric current; increasing magnetic f i e l d of \" l i v e \" wire. Relative motion of circuit-wire and magnetic f i e l d produces current. Dynamo; A.C., D.C. k Electrostatic charges and f r i c t i o n ; compared with magnetism. Conductors and insulators. Charging by induction, conduction; electron-flow i n charging. Electroscope. Electrostatic Law. Fields of force. Lightning. Discharging charged bodies. 79 II $ Current as electron-flow; conductors, insulators. Potential difference. Resistance; factors affecting. Chemical generation of e l e c t r i c i t y ; c e l l s . Circuits. 6 Transmission of A.C.; transformers. Fuses; structure, function. Heat-effects of electric current. Light bulbs; function of parts. E l e c t r i c a l effects i n solution; electrolysis; electroplating. APPENDIX B 81 TEST-DIRECTIONS (EXCERPTS) I. PRELIMINARY ARRANGEMENTS 1. Every student should be warned i n advance to provide himself with a) 2 pencils b) an eraser c) some scrap-paper, for rough calculations 2. Each supervisor should have a watch with a second-hand, so that he may be able to adhere s t r i c t l y to the time-limits. II . PRETEST DIRECTIONS Since each test w i l l l a s t exactly £0 minutes, schools operating on a 50-minute period are requested to take a few minutes at the end of the science-period immediately preceding the test-series to read the following test-directions to the participating classes. Schools operating on a 60-minute period w i l l have time to read these directions to the classes at the beginning of the f i r s t test-period. II I . DIRECTIONS TO STUDENTS Please read these without attempting to elaborate. 1. Many of the items on these tests w i l l appear different from any that you have seen before. Most of them require a l i t t l e b i t of thinking as well as just remembering something that you have learned. However, ALL of them CAN be done by any student who has learned the work done to date WELL. So however strange or hard a question may appear at f i r s t sight, remember YOU KNOW ENOUGH TO ANSWER IT; a l l i t needs i s your knowledge plus a l i t t l e thought! 2. You may answer questions, even when you are not completely sure that your answers are correct. In such cases, intelligent consideration of the choices provided may help you to gain marks. HOWEVER, you should AVOID WILD GUESSING as this may result i n a reduction of your score. 82 3. Give each question careful thought, but work as quickly as you can. I f you find a question too d i f f i c u l t , do not linger over-long on i t . Pass on to the next ones, and return later to any that 5 0u missed, i f you have time. k» To show you how to mark each item, here i s a sample item, taken from the f i r s t unit of your text. (Teachers should demonstrate this at the board.) ITEM: Which of the following formulae represents a substance that might be expected to form a salt i f reacted with vinegar? a ( ) V ( x ) a) HCI b) NaOH c) NaCl d) HoO c ( ) d ( ) The correct answer i s b), NaOH, so we write a cross (x) i n the parantheses to the right of b, as shown. Make sure that the cross i s actually i n the parentheses or i t may be overlooked and the item marked wrong. $. Mark only ONE of the k choices given f o r each item. 6. As soon as you receive your paper, complete the required information on the front page ( i . e . , NAME, LOCATION OF SCHOOL, AND U.P. or G.P.) Do not turn the page u n t i l you receive the signal to start. 7. On tests of this kind, i t i s not expected that any student w i l l get a l l of the items correct; i n fact, i t i s usual f o r most average students to score around $0%. The pass-mark w i l l undoubtedly be much less than $0%, so i f you have to leave quite a few questions unanswered, do not worry — and DO NOT GUESS. 8. Are there any questions, since you may not ask questions during the test (except regarding misprints or missing pages or things l i k e that.)? IV. DIRECTIONS FOR ADMINISTRATION If the directions above were read during the science-period preceding the test, please check to see i f any student i s present for the test who was absent for the directions. Such a student may write the test, but his paper should bear a notation explaining that he missed the directions, since his results w i l l be of no use to me i n my project. 1. Hand out papers; remind students, as you do so, to complete the blanks on the front page and wait for the signal to start. 83 2. Say: YOU HAVE 50 MINUTES. START NOW. (Note precise time of starting.) 3. After exactly $0 minutes, say: STOP'WORK. DRAW A LINE UNDER THE HIGHEST-NUMBERED QUESTION YOU HAD TIME TO CONSIDER, WHETHER YOU WERE ABLE TO ANSWER IT OR NOT . . . . IF YOU HAD TIME TO CONSIDER ALL OF THE ITEMS, PUT A CHECK MARK IN THE CIRCLE AT THE BOTTOM OF THE FRONT.PAGE, WHETHER YOU WERE ABLE TO ANSWER ALL OF THEM OR NOT. U. Collect the papers. APPENDIX C 85 SCIENCE 10 TEST: SUB-TEST I NAME: U.P. or G.P. . LOCATION OF SCHOOL: (Town or City.) (DO NOT TURN TO PAGE 1 UNTIL YOU ARE TOLD.) THE DIAGRAM BELOW REFERS TO TWO ITEMS APPEARING IN THE TEST CHECK HERE =#-i f you had time to consider a l l items. 86 SCIENCE 10 TEST SUB-TEST I 1. Which of the following types of chemical reaction best describes that represented by the equation H9C0, * Ho0 •+• G0o ? a( ) d i 2 d b( X ) a) Combination c) Replacement c( ) b) Decomposition d) Neutralization d( ) 2. Which of the following comparisons of lime-mortar and concrete i s true? a) Both harden under water- a( ) b) Both set to about the same degree of hardness. b( ) c) Only lime-mortar undergoes chemical change i n hardening. c( ) d) Both use limestone at some stage of their preparation. d ( % ) 3. (THIS ITEM REFERS TO THE DIAGRAM ON THE FRONT PAGE.) In this diagram, box A contains a) an electric generator to produce current at 110 volts for the house-circuits. b) a circuit-breaker to protect the house-circuits from a( ) overloading. b( ) c) a metering-device to measure the number of kilowatt-hours c( ) used each month. d( x ) d) a device to reduce to 110 volts the current supplied to the house. k> Which of these compounds is used i n f i r e extinguishers and baking powders? a( ) K ) a) magnesium hydroxide c) sodium bicarbonate c( X ) b) magnesium sulphate d) sodium carbonate d( ) !>. The statement has been made: \"Never replace the fuse i n a houselighting c i r c u i t with one of a higher ampere-rating, except on the advice of a s k i l l e d electrician.\" 87 The main reason f o r this i s that a) you may allow dangerous amounts of heat to form. b) you may cause your range or water-heater to burn out. a( X ) c) you may receive a dangerous electric shock. b( ) d) you may \"blow\" the other fuses and lose their c( ) protection. d( ) 6. The c o i l and soft-iron core of a dynamo or magneto are together referred to as the a( X.) a) armature b) commutator c( ) c) f i e l d - or electro-magnet d) split-ring and brushes d( ) 7. Machine A can raise a 5>0-lb. weight to a height of UO feet i n 5* seconds, while machine B can raise a 100-lb. weight to a height of 20 feet i n 10 sees. If A and B perform these tasks, the work done by A w i l l be a) one-quarter that done by B a( ) b) one-half that done by B b( ) c) the same as that done by B c( X ) d) twice that done by B d( ) 8. When an ebonite rod i s charged by rubbing with fur a) positive charges are removed from the fur and transferred to the rod. b) negative charges are removed from the fur and a( ) transferred to the rod. b( X ) c) positive charges are removed from the rod and c( ) transferred to the fur. d( ) d) negative charges are removed from the rod and transferred to the fur. 9. The f i r i n g of a r i f l e i s an i l l u s t r a t i o n of the principle that before useful work can be done a) potential energy must be converted to kinetic energy. a( X. ) b) mechanical energy must be converted to potential energy. c) potential energy must be converted to chemical energy. c( ) d) mechanical energy must be converted to kinetic energy. *i( ) 88 10. Which of the following statements i s true of a compound but not of a mixture? The original ingredients a) can be separated out by heating. a( ) b) may include both elements and compounds. b( ) c) nO longer have their original characteristics. c( X) d) have the same t o t a l energy as the product(s). d( ) 11. whether a material can be magnetized depends entirely upon whether or not a) i t i s composed at least p a r t i a l l y of iron. a( ) b) i t Is composed of either iron or nickel. b( ) c) i t i s a metal, possessing lines of force. c( ) d) i t i s composed of magnetic particles. d( X) 12. (THIS ITEM REFERS TO THE DIAGRAM ON THE FRONT PAGE.) In this diagram, the letters, B, refer to a device which is intended to a) reduce loss of electrons through contact with conducting materials. a( X ) b) prevent people who touch the lower part of the pole b( ) from receiving electric shocks. c( ) c) prevent damage to wires and pole during lightning d( ) storms. d) reduce danger of overheated wires by lowering resistance. 13. The metals iron, sodium and gold, d i f f e r greatly i n their chemical activity. In which of the following are these metals arranged i n order of increasing activity ( i . e . , least active f i r s t ) ? a( ) b ( X ) a) iron, sodium, gold c) gold, sodium, iron c( ) b) gold, iron, sodium d) sodium, gold, iron. d( ) lit . Which of the following materials i s a mixture? a( * ) b( ) a) air b) water vapour c) carbon dioxide c( ) d) iron sulphide d( ) 89 15>. In which of the following materials i s the Law of Definite Proportions best illustrated? a) a i r b) iron c) concrete d) water a( ) b( ) c( ) d ( x ) 16. According to the Electron Theory of matter, most of the elements tend to combine with others i f the number of electrons i n the outer orbit of each of their atoms i s different from 8. Which of the following elements i s most l i k e l y to have an outer orbit containing exactly 8 electrons? a) chlorine b) mercury c) argon d) calcium a( ) b( ) c ( x ) d( ) 17. A wire carrying a strong, direct current l i e s i n a north-south direction. Over i t is held a compass. If north i s i n the direction indicated by the arrow at the right, which of the following most nearly represents the new direction of the compass needle? 18. What is the main advantage of using soft iron rather than steel as a material for the core of a l i f t i n g magnet? a) It has less electrical resistance and thus conducts the current more ef f i c i e n t l y . b) Its amount of magnetism changes when the current-strength i s varied. c) It obstructs the passage of lines of force, which are thus available for work i n l i f t i n g objects. d) It i s less affected by the heat produced by the flow of e l e c t r i c a l energy through the c o i l s . a ( X ) b( ) c( ) d( ) a( ) b ( X ) c( ) d( ) 90 19. Consider the electrolysis of water. Which of the following statements, taken from your text, contains the explanation of why the passage of an electric current produces chemical change? In chemical change a) the reacting materials always change to different a( materials. b( b) the new substances formed always have properties c( X. different from those of the reacting materials. d( c) energy i s always given off or absorbed. d) the\" weights of the reacting substances always equal the weights of the new substances formed. 20. To detect (or discover) the presence of a static charge on a body, an appropriate instrument to use would be a( b( x a) a galvanometer c) an induction c o i l c( b) an electroscope d) an ammeter d( 21. A certain compound i s burned. The products are analyzed and found to be water vapour and carbon dioxide. Which of the formulae l i s t e d below might reasonably be a( X b( considered to be that of the original compound? c( d( a) G2H2 b) CO c) Ca(0H) 2 d) CaH2 22. The resistance of a ci r c u i t carrying a direct electric current may be raised by a) using a lower voltage than before. a( b) coiling a part of the wire. b( c) shortening the length of the c i r c u i t . c( d) using thinner wire of the same material and length. d( X 91 23. A (CuO) c a l c i u m c h l o r i d e A gas, containing no moisture, was passed through tube A, containing heated copper oxide (CuO), and then through tube B, containing calcium chloride, which readily absorbs moisture. At the end of the experiment, a considerable amount of the CuO was found to have been transformed to pure copper (Cu), and the calcium chloride i n tube B had become damp. Of the following four gases, the one used i n this experiment a( ) could have been b( x ) e( ) a) oxygen c) chlorine d( ) b) hydrogen d) carbon dioxide 2l+. The reason that a dry c e l l i s able to produce a steady current for much longer than an ordinary voltaic c e l l i s that a) i t has a carbon electrode which does not react with the electrolyte. b) i t contains particles of carbon which reduce the a( ) cel l ' s internal resistance. b( ) c) i t contains a substance which reduces the rate at c(x ) which hydrogen collects. d( ) d) i t s electrolyte, ammonium chloride, reacts with zinc less vigorously than does sulphuric acid. 2f>. A form of slaked lime i s sometimes spread on the s o i l of gardens. The reason for this i s that slaked lime a) i s acidic i n nature and hence useful i n correcting s o i l s containing too much a l k a l i . a( ) b) i s a base and hence useful i n neutralizing sour s o i l . b( *) c) reacts with water to form a substance which makes c( ) . sandy s o i l less porous. d( ) d) contains chemical elements which make i t a useful f e r t i l i z e r . 92 2 6 . The statement, \"These exist only, i n sets or pairs\", applies a) to both electric charges and magnetic poles. b) to neither electric charges nor magnetic poles. c) only to electric charges and not to magnetic poles. d) only to magnetic poles and not to electric charges. 27. Consider the following simplified \"equations\". (Pt i s the chemical symbol for platinum.) SO2 +• 02 > l i t t l e reaction at ltOO°C. SO2 + 02 4 - Pt ?S03 + Pt at U00°C. These equations most strongly i l l u s t r a t e a) a typical replacement reaction. b) the chemical activity of platinum. c) the operation of a catalyst. d) the formation of a radical. 28. A voltaic c e l l transforms chemical energy to e l e c t r i c a l energy. The chemical energy comes from a) the zinc and sulphuric acid. a( X) b) the copper and sulphuric acid. b( ) c) the copper and zinc plates. c( ) d) the zinc, copper and sulphuric acid. d( ) 2 9 . During a thunderstorm, the safest place to stay (of those l i s t e d below) i s a) i n the middle of a large, open, le v e l f i e l d . a( ) b) under a tree i n the middle of a large, open, le v e l b( ) f i e l d . c) i n a building with a steel frame. c( X ) d) i n the bathtub. d( ) 30. Nowadays helium i s used, rather than hydrogen, i n most balloons employed by the armed forces. a( ) b( ) c( ) d ( x ) a( ) b( ) c ( x ) d( ) This i s accounted for, or explained, by the fact that helium 93 a} i s somewhat less easily oxidized than hydrogen. a( ) b) i s much more chemically inert than hydrogen. b( \\ ) c) has greater density than hydrogen. c( ) d) does not decompose as easily as hydrogen. d( ) 31. Powdered graphite i s frequently used as a lubricator because a) i t becomes l i q u i d easily as heat i s produced by f r i c t i o n . a( ) b) i t s atoms are arranged i n f l a t crystals which can s l i d e b( X ) easily. c( ) c) i t s atoms are spherical and behave l i k e tiny ball-bearings. d( ) d) i t s atoms break down under pressure to form an o i l y paste. When a b a l l swings back and forth on the end of a string, i t has kinetic energy due to i t s motion, and also potential energy due to i t s height above i t s lowest point. However, when It stops swinging, i t has neither K.E. nor P.E. Why does this not contradict the Law of Conservation of Energy? Because, according to this Law, a) although energy cannot be produced out of nothing, i t can be destroyed. b) energy cannot be created or destroyed. a( ) c) energy cannot be created, but may be transformed or b( destroyed. c( d) energy can be changed into other forms. d(X ) 33. Each molecule of the compound whose formula i s Na3POl4. contains a( ) b(X ) a) 7 atoms b) 8 atoms c) 9 atoms d) 1U atoms c( ) J d( ) 3k- When an unknown gas i s bubbled through clear lime water, the limewater becomes milky. The gas can be assumed to be carbon dioxide, provided a) the gas does not react chemically with the limewater. a( ) b) no other substance turns milky with C02- b( ) c) no other gas turns limewater milky. c( x ) d) a l l of the last three statements are true. d( ) 9k 3H>. A r o l l of steel wool was dipped i n a li q u i d which was supposed to prevent oxidation from taking place. I t was then sealed i n a large container f i l l e d only with moist a i r (a l i t t l e moisture being necessary for rusting to occur). The container was weighed immediately, and again after several weeks, s t i l l unsealed. No change of weight had occurred. What conclusion could be drawn? a) Since there was no increase i n weight, oxidation could not\"have taken place. b) Since rusted steel i s lighter than pure steel, oxidation could not have taken place. c) Since the t o t a l weight at the end of a chemical change equals that at -the beginning, oxidation took place. d) There i s not enough evidence to t e l l i f oxidation occurred. 3 6 . Below i s a set of equations. The one which best displays the meaning of the term \"radical\" i s a) AgNC-3 +- KC1 > KNO3 + AgCl b) Ca + CI2 > GaGl2 c) Ca 4- MgCl 2 > CaClg + Mg d) H2SO3 > H20 + S02 3 7 . Which of the following formulae does NOT represent an actual a( ) chemical substance? b( x ) c( ) d( ) a) Fe b) H c) C d) Na a( b( c( d( X a ( M b( ) c( ) d( ) 95 38. When a positively charged glass rod i s held near a neutral pith-ball, the b a l l w i l l be attracted to the rod because a) a magnetic f i e l d of force surrounds the rod and influences the b a l l . a( ) b) ; electrons are transferred from the b a l l to the rod. b( ) c) ' electrons are transferred from the rod to the b a l l . c( ) d) electrons are moved over the surface of the b a l l d(X ) towards the rod. 39. Of the following formulae, which one represents a substance whose water-solution i s l i k e l y to feel slippery and to change a( ) b( ) the colour of red litmus paper? c( ) d(x ) a) HCIO3 b) Li C l c) HI d) LiOH kO. A transformer may be used to a) convert an alternating current to a larger current ; at a higher voltage. a( ) b) transform alternating current into direct current. b( ) c) increase an alternating current and decrease i t s c( X ) ! voltage. d( ) d) step up the voltage of a battery. 96 SCIENCE 10 TEST: SUB-TEST II NAME: U.P. or G.P. LOCATION OF SCHOOL: (City or Town) (DO NOT TURN TO PAGE 1 UNTIL YOU ARE TOLD.) THE DIAGRAMS BELOW REFER TO AN ITEM APPEARING IN THE TEST Charged rods wire carrying current from dry-cell pith b a l l on insulating stand 6 compass-needle CHECK HERE ==> i f you had time to consider a l l items. 97 SCIENCE 10 TEST SUB-TEST II 111. No one has yet been able to make cobalt by joining two or more simpler substances, or to break i t down into any other substances, using chemical means. a( ) b( ) This suggests that cobalt i s c( X ) a) a metal c) an element b) , a non-metal d) a compound d( ) 1*2. (THIS ITEM REFERS TO THE DIAGRAMS OF THE FIGURE ON THE FRONT PAGE.) Which of the following pairs of these devices could be used i n repeating Oersted's famous experiment, establishing that a( ) a wire carrying an electric current has a f i e l d of force? b( ) c( ) a); 1,3 b) 2,3 c) 3,1* d) 3,5 d( x ) 1*3. E l e c t r i c a l currents produced by electrostatic processes are of limited value i n our c i v i l i z a t i o n because a) they are brief and not properly controllable. b) they are weaker than those produced by regular methods. a( X ) c) they are ah essentially different kind of b( ) e l e c t r i c i t y from that produced by regular methods. c( ) d) they flow from negative to positive instead of d( ) from positive to negative. 98 i*l*. According to the definition of \"work\", which of the following represents a situation i n which work i s being done? a) a wooden prop, 1* feet long, holds up a corner of a building weighing 1§ tons. b) a man pushes against a lawnmower, stuck against a stone, with a force of 1*0 l b . c) a piece of rock i s thrown 100 feet into the a i r by a volcano. a( ) d) a mountaineer i s supporting his 180-lb. comrade who, b( ) having lost his foothold, i s dangling from the c(X ) other end of a 12-ft. rope. d( ) il!?. \"Reduction\" i s the name given to a chemical process which i s just the reverse of oxidation. Select, from the following equation, the formula of the substance which i s being reduced: EQUATION: C + H 9 0 > CO -+• H 9 a( ) b( X ) The substance which i s being reduced i s c( ) d( ) a) C b) H20 c) CO d) H 2 1*6. I f you wished to prepare a sample of carbon dioxide, you could do i t by putting some pieces of calcium carbonate i n a ( X ) a solution of b( ) c( ) a) acid b) base c) salt d) limewater d( ) 1*7. To distinguish with certainty and ease between aluminum and zinc, which of the following four properties should be investigated? b( ) a) density c) combustibility c( ) b) state d) lustre d( ) 99 A condition vjhich i s necessary for magnetic induction to take place is that the material to be magnetized be a) placed within a magnetic field of force. b) tapped with a hammer while held in a north-south direction. a(X ) c) stroked with a magnet, always in the same direction. b( ) d) placed within a coil of wire through which an c( ) electric current i s flowing. d( ) In Europe in the l£th Century, i t was known that the needle of a compass did not point to True North, but somewhat to the west of True North. One of the early voyages of Columbus nearly ended in failure when his men discovered that the compass was pointing east of True North. This may be considered evidence that a) even in the l£th Century, the north magnetic pole was slowly changing i t s location. b) in those days, i t was not known that the basic principles of magnetism are reversed in the southern hemisphere. c) the sailors of that time did not know that the earth's a( ) rotation changes the relative position of Magnetic and b( ) True North. c( ) d) the nature of magnetic variation was not understood in d( X ) the 15th Century. During the 16th Century, Gilbert repeatedly tried to produce an electrical charge upon materials like brass, but failed each time. This was because brass is a a) non-electrolyte a( ) b) non-magnetic material . b( ) c) metallic insulator c( d) good conductor d( X ) The number of degrees between the freezing point and boiling a( ) point of water on a centigrade thermometer is b( ) c(x) a) 212 b) 180 c) 100 d) 32 d( ) 100 52. Recently, a f i r e started i n a heap of fresh, damp sawdust, a material having low conductivity. Of the following statements, which one provides the best explanation of this? a) Fresh, damp sawdust \"heats\" easily, so i t produced combustion. b) Chemical combination produced energy that could not escape. c) Damp sawdust has a higher kindling point than dry a( ) ; sawdust. b( * ) d) Damp sawdust has a lower kindling point than dry c( ) sawdust. d( ) 53. To give a positive charge to an uncharged conductor, we must a( ) a) rub i t with fur. c) remove some electrons. b( ) b) rub i t with s i l k . d) add some electrons. c( X) d( ) 5U. The element which i s present i n the earth's crust i n greatest a( ) quantity i s b( X ) c( ) a) hydrogen b) oxygen c) nitrogen d) s i l i c o n d( ) 55. A method of altering the molecules of li q u i d o i l s so as to produce solid fats i s known as a( ) b( ) a) chlorination c) hydrogenation c( X ) b) oxidation d) fluoridation d( ) 56. Which of the following best illu s t r a t e s \"charging by conduction\"? a) Uncharged body comes i n contact with charged body and acquires same charge as that of charged body. b) Uncharged body comes i n contact with charged body a( X ) and acquires charge opposite to that of charged body. b( ) c) Uncharged body approaches charged body, i s grounded, c( ) and acquires same charge as that of charged body. ccmtaiaing a fuse (F) and an of j < ) their alternating motion. d) A's alternating current produces a variable magnetic f i e l d whose lines of force cut c o i l B i n one direc-tion, then another. 72. Power losses during transmission of e l e c t r i c a l energy are lowest when the current i s a) A.C. with low amperage and high voltage. a( X ) b) D.G. with low amperage and high voltage. b( ) c) A.C. with high amperage and low voltage. c( ) d) D.C. with high amperage and low voltage. d( ) i o 5 73. Consider the equations: C 4- 0^ ? C02 C02 + HgO ? H2C03 Decide which is the most l i k e l y result of holding a piece of litmus paper i n the fumes of a barbecue-pit (using charcoal or carbon for fuel.) a) Moist blue litmus paper w i l l turn red. a(X ) b) Moist red litmus paper w i l l turn blue. b( ) c) Dry blue litmus paper w i l l turn red. c( ) d) Dry red litmus paper w i l l turn blue. d( ) 7U. Which of the following represents an actual compound? a( ) b( ) a) CO, b) NO, c) NHo d) SO. c( X ) •* . 4 d( ) 75. Which of the following observations refers to a physical, rather than a chemical property? a) Burns rapidly i n oxygen, above 8o°C. a( ) b) Boils at 6 8°C, at standard pressure. b( X ) c) Imparts a milky appearance to lime-water. c( ) di) Releases bubbles when some metal i s added. d( ) 76. Newspapers sometimes carry reports of the use by scientists of Uranium-235 and Uranium-238. These are samples of the element uranium, whose atoms weigh either 235 units or 238 units. Such reports appear to be i n conflict with a) the Electron Theory of matter. a( ) b) Dal ton's Atomic Theory. b( X ) c) the Law of Definite Proportions. c( ) d) the Molecular Theory of matter. d( ) 77. Which of the following formulae represents a salt? a( ) a) Mg(0H)2 b) NH3 c) HgS d) FeSO^ || j 106 78. Which of the following equations illustrates the type of chemical reaction known as \"replacement\"? a) NH3 +- 202 ? H N O 3 +• H20 a( ) K M b) Ca + SnCl 0 > 0aCl o +- Sn c( ) 2 2 d( ) c) Ca + C l 2 > CaCl 2 d) CaG03 > CaO 4- COg 79. The combination of hydrogen (H) from one compound with hydroxyl (OH) from another compound, to produce water, i s characteristic of a l l chemical reactions of a particular kind. Which kind? a( ) b( ) a) Replacement c) Decomposition c( ) b) Combination d) Neutralization d( X ) 80. Which of the following compounds is manufactured from one or more of the others? a( x ) b( ) a) sodium carbonate c) sodium chloride c( ) b) calcium carbonate d) cellulose d( ) APPENDIX D 108 TABLE XT TAXONOMY CATEGORY, DIFFICULTY INDEX, AND DISCRIMINATION INDICES \"(POINT BISERIAL AND BISERIAL) FOR ALL ITEMS Item Taxonomy- D i f f i c u l t y .no. category index (%) r, r.. (final) P ° 0 1 1 2.10 (c) 80.0 .335 .1*79 2 1.12 ( K ) 71.5 .253 .336 3 2.10 (c) 61*. 3 .1*25 .51*6 1* 1.11 ( K ) 72.6 .193 .258 5 2.10 (c) 51* .7 .281* .356 6 1.11 ( K ) 53.2 .169 .213 7 2.20 (c) 61*.3 .256 .329 8 1.31 ( K ) 53.2 .185 .232 9 2.10 (c) 6U.7 .302 .390 10 1.31 ( K ) 77.1* .295 .1*11 11 1.32 ( K ) 1*7.6 .051 .065 12 2.10 (c) 31.5 .197 .258 13 1.31 U ) 61*.3 .327 .1*20 Hi 1.11 ( K ) 65.5 .328 .1*21* 15 2.10 (c) 65.3 .321* .1*18 16 2.20 (c) 6o.l* .320 .1*06 17 1.12 ( K ) 32.3 .261 .31*1 18 2.10 (c) 1*1.9 .31*5 .1*37 19 2.30 (c) 1*3.6 .178 .225 20 1.25 (K) 55.3 .316 .397 21 2.20 (c) 1*0.2 .208 .263 22 1.31 ( K ) 50.6 .331* .1*18 23 2.20 (c) 1*2.3 .10*3 .560 2l* 1.31 ( K ) 1*2.3 .281 .356 25' 1.31 ( K ) 1*0.1* .361* .1*62 26: 1.31 ( K ) 32.6 .119 .156 27 2.20 (C) 58.1 .1*13 .522 28 1.12 ( K ) 3l*.2 .123 .159 29 1.12 ( K ) 1*3.0 .151 .190 30 2.10 (c) 53.6 .260 .328 31' 1.31 ( K ) 33.1* .352 .1*57 32 2.10 (c) 39.1 .230 .293 33 2.10 (c) 51.1 .357 .1*1*7 31+ 2.20 (c) 1*1.5 .300 .379 35 2.20 (c) 39.8 .310 .393 36 - 2.10 (o) 22.8 .221 .306 37 1.11 ( K ) 28.1 .126 .168 38 1.31 ( K ) , 39.1* .300 .381 39 2.10 (c) 32.8 «398 .515 1*0 1.31 ( K ) 21*. 7 -.01*9 -.066 109 Item Taxonomy - D i f f i c u l t y no. category index (%) r b i s (final) i*i 1.11 (K) 93.2 .288 .555 1*2 2.10 (C) 80.0 .289 .1*13 1*3 1.31 (K) 66.0 .392 •507 1*1* 2.10 (C) 51+.5 .1*16 .522 1*5 2.20 (C) 71+.5 .203 .276 1+6 1.31 (K) U9.8 .11*8 .186 1*7 2.10 (C) 61+.9 .331 .1*26 1+8 2.10 (0) U8.3 .350 .1*1*0 1+9 2.10 (G) 59.8 .171* .221 50 1.31 (K) 27.9 .279 .373 51 1.12 (K) 81.9 .276 .1*05 52 2.20 (C) 61+. 3 .183 .236 53 1.25 (K) 57.6 .1*27 .538 51+ 1.11 (K) 61.7 .130 .167 55 1.11 (K) 72.3 .317 .1*21* 56 2.10 (C) 1+6.0 .258 .321* 57 2.10 (G) 31.1 .31+5 .1*52 58 1.31 (K) 1+2.1 .120 .152 59 1.31 (K) 31.9 .237 .309 6o 2.20 (C) 56.1+ .231 .292 61 1.31 (K) 53.2 .31*8 .1*35 62 2.10 (C) 33.1+ .31*2 .1*1*1+ 63 2.20 (C) 1+0.6 .306 .389 61+ 1.11 (K) 61.3 .398 .506 65 1.11 (K) 51.1 .39U .1+93 66 2.10 (C) 25.9 .025 .031* 67 1.25 (K) 31.5 .311* .1*10 68 2.20 (C) 50.8 .1*32 .51*3 69 1.31 (K) 27.9 .21*6 .329 70. 1.31 (K) 27.2 .280 .375 71 2,20 (C) 35.9 .168 .215 72 1.31 (K) 20.1+ .181* .261 73 2.20 (C) 21+.0 .131* .181* 71+ 1,23 (K) 29.1 .216 .286 75 1.23 (K) 38.5 .259 .331 76. 2.20 (C) 27.6 .173 .231 77 2.10 (C) 1+1.5 .322 .1*07 78 2.10 (C) 1+3.0 .238 .301 79 2.10 (C) 33.1+ .182 .236 80 1.22 (K) 22.6 .259 .361 "@en ; edm:hasType "Thesis/Dissertation"@en ; edm:isShownAt "10.14288/1.0105669"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Education"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en ; ns0:scholarLevel "Graduate"@en ; dcterms:title "An investigation of the relationship between the relevance category of achievement test items and their indices of discrimination"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/39012"@en .