UBC Theses and Dissertations
An investigation of the relationship between the relevance category of achievement test items and their indices of discrimination McKie, Thomas Douglas Muir
It was hypothesized that on an achievement test, items measuring complex cognitive objectives would exhibit a higher mean discrimination index — based on the whole test as criterion — than would an equal number of items measuring less complex cognitive objectives; and that the mean discrimination index of these items would in turn be higher than that of the same number of still less complex items. The proviso was made that the difficulty indices of the items be similarly distributed within the several categories of items, hereafter called "relevance-categories," since discriminating power is related to difficulty. The categories selected were, from simplest to most complex, the Knowledge, Comprehension, and Application categories of Bloom's "Taxonomy of Educational Objectives." An achievement test was constructed, consisting of items in all three categories, and covering the content of two units of the British Columbia university-programme grade nine science course. A try-out of this test, on 200 students in two schools, permitted negatively discriminating items to be rejected and, in addition, provided difficulty indices for the remaining items. It was possible to match forty Knowledge items and forty Comprehension items very closely for difficulty; however, the mean difficulty of the Application items was so high that they could not be used in a test of the hypothesis without reducing numbers too drastically in all categories. Two "equivalent forms," matched for content, relevance-category, and difficulty were constructed from these eighty items and administered to 530 students in three schools. The reliability coefficient of the total test, estimated by correlating the sub-test scores and applying the Spearman-Brown formula, was .84; those of the Knowledge and Comprehension categories were similarly found to be .69 and .77, respectively. Revised difficulty indices, based on the new and larger sample, were calculated. Their distribution within the two relevance categories were found to be very similar, though not as closely matched as on the basis of the try-out test. For each item, the point-biserial coefficient of correlation between item and total score was computed — this being the selected index of discrimination — and Fisher's z-transformation was applied to produce measures with more nearly an equal-unit scale, in the hope that the parametric t-test could be used. However, the shapes of the resulting distributions were such that they could not be claimed to be samples from a normal population or populations. Accordingly, the t-test was rejected in favour of the non-parametric Mann-Whitney test of "no difference in median discrimination indices." The respective medians were .27 and .30, in terms of Fisher's z-values, but the difference proved to be non-significant at the pre-selected l%-level of significance. It was concluded that this experiment provided no grounds for accepting the hypothesis of the study. However, the actual probability of obtaining, in random sampling from a single population, a difference as large as that observed was only about .10; in addition, the results consistently favoured the Comprehension items, whose discrimination indices exceeded those of the Knowledge items at the extremes as well as at the mean. It was therefore suggested that if adequate testing time could be obtained, the use of larger numbers of items in all categories might increase test-reliability and possibly produce a significant result. Suggestions were advanced, based upon observations from the data, for refining the experiment and for further research.
Item Citations and Data