- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Reliability of composite test scores: an evaluation...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Reliability of composite test scores: an evaluation of teachers' weighting methods Steele, Bonita Marie
Abstract
Teachers often combine component scores from varied item formats to create composite scores. These composite scores are supposed to reflect students' ability or achievement level in a specific domain such as mathematics or science. Sometimes these same teachers weight the component scores before combining them in an effort to increase the validity of the inferences made from the resulting composite scores (Chen, Hanson, & Harris, 1998). These resulting composite scores have validity-related intuitive merit for teachers who use them, but no research-demonstrated value. The present study examined some consequences of such teacher-designed ad-hoc weighting methods so that the resulting psychometric properties of these methods could be evaluated. The item formats of the two components examined in this study were constructed-response and multiple-choice. Because of the unequal variance o f the two item format component scores, and the dissimilar metric used for each of the two item formats, the resulting composite score is considered a congeneric measure. What this present study attempted to do was to discover which of the typically-used teacher-designed ad-hoc weighting methods used for the weighting of the two component scores is best according to a particular criterion. The criterion adopted to evaluate the worth of a weighting method was the bias in the reliability estimates of the weighted composite scores. For the evaluation of bias, the deviations in reliability estimates between un-weighted composite scores and corresponding weighted component scores were examined. Reliability estimates of un-weighted composite scores provide the true estimations calculated from the actual test data, while the reliability estimates of the weighted composite scores reflect the post-hoc manipulations imposed by some teachers. If a teacher decided to use an ad-hoc weighting method in an effort to increase the validity of a composite score formed by combining multiple-choice and constructed-response items, then he/she should do it without causing much of a change, or bias, in the reliability of the resulting composite test score. Such manipulations as component-score weighting can artificially inflate or deflate the reliability estimate of the resulting composite score. This in turn could create a composite score that is of an undesirable psychometric deviation from its original form, which could in turn result in a score-dictated misrepresentation of an examinee's ability or achievement level. In all cases of test design and item combining, the validity of the resulting scores are of paramount concern. Many teachers have an intuitive sense of this, but lack the psychometric expertise to critically examine test scores from this perspective. In an intuitive-based fashion, teachers have often made attempts at ensuring the validity of scores by imposing some ad-hoc weighting methods on the multiple-choice and constructed-response components involved. Four of the weighting methods examined in this research were chosen because of their common usage in educational research and public education, and one method was derived from measurement theory. The component score weighting methods are: score points as weights; number of items as weights; number of minutes (time) as weights; equal weights; and z-scores as weights. With the minimal amount of measurement training that most classroom teachers have before they receive their teaching certification, they become accustomed to relying on intuition, rather than measurement theory, for their test design methodology. Teacher-designed ad-hoc weighting methods are examples of intuition-based logic that has not yet been examined in a psychometric fashion. The research question is as follows: What commonly-used teacher-designed ad-hoc method of weighting and combining multiple-choice and constructed-response component scores is the best? The selection of the best method was done according to the criteria of minimal artifactual bias in the resulting composite score reliability estimates. With the increased use of assessments that include mixed item types, it is certain that for some school administrators and parents, a composite test score would be desired. Encompassed in a composite score is the dilemma of how to combine the component scores in a manner that would maximize the validity and reliability of the composite score, without artificially inflating or deflating it and violating the principles of measurement. Crocker and Algina (1986) stated that to interpret composite scores in a meaningful way, in other words, to make valid inferences from the scores, it is important to understand how the statistical properties of the scores on each subtest or component influence those of the composite score. When attempting to combine component test scores to produce a single composite score, a problem does arise if the component scores are not calculated using the same item format. When multiple-choice and constructed-response are the two item formats of the respective components, strict measurement principles of parallel measures will reveal that test homogeneity does not exist. Because of this, the scalings of the measurements of the two components differ by more than just an additive constant, and hence, a composite score is illogical (Lord & Novick, 1968). Multiple-choice items are typically scored in a dichotomous manner, whereas constructed-response items are typically scored in a polytomous manner. Dichotomously scored items are scored without a magnitude of quantity, just a simple binary code that reveals if the item was correct or not. Polytomously scored items are scored in a continuous manner, where the content of such item responses is rated on an arbitrary scale in an attempt to measure the magnitude of the attribute in question (Russell, 1897). Like many things in the realm of educational psychology, pure measurement principles provide good guidelines; but in application, they can also provide unrealistic restrictions. Current educational/psychological assessment research in North America is riddled with violations of measurement principles, but that doesn't mean that psychological constructs are not being measured in a relatively efficient and revealing manner. In small-scale assessments across North America, component scores of unlike item types are often combined to form composite scores, regardless of the fact that these scores may be deemed illogical by researchers such as Michell (1990) and Lord and Novick(1968). It is likely that many future assessments will be made up of a combination of multiple-choice and constructed-response item types (Wainer & Thissen, 1993). Often, weighting methods are imposed on such item components with minimal psychometric foundation. For the sake of fair measurement of students' abilities and achievement levels, research regarding this psychometric query is necessary. The research findings of this present study will provide the basis of forethought and understanding in terms of the selection of an ad-hoc weighting method. This forethought and understanding will enable teachers to be more certain about their precision of reported composite scores. The recent evolution of educational assessment as described above, has resulted in a shift in focus in educational research. An example of this is Chen, Hanson, and Harris' (1998) findings of classroom teachers' typical ad-hoc methods of weighting multiple-choice and constructed-response item scores in an effort to produce a composite score. The unknown/questionable nature of the reliability of the composite scores derived using these ad-hoc weighting methods is essentially the problem that was investigated in this thesis. The main research task posed by the present study was to find out which teacher-designed ad-hoc component-score weighting method, designed for the combining of two component scores (one made of multiple-choice items and the other of constructed-response items), was the best. The evaluative criteria of composite score reliability bias was examined across variations in class size and item ratios (multiple-choice: constructed-response) to ensure that conclusions were reflective of real-life classroom sizes and test-item ratios. By examining a variety of aspects of three factors (class size, test mixture, and weighting method), an evaluative look at the general tendency for each weighing method to cause artificially deflated/inflated component-score reliability estimates was examined. It was found that the component-score weighting method that treated the components equally, called the equal weights (EQUAL) weighing method in the present study, lead to the least amount of deflation/inflation in the composite score reliability estimates. Please remember the limitations of the study: the results were calculated using test scores from the academic domain of science only, and these test scores were only gathered from Canadian grade eight students. The Third International Mathematics and Science Study (TIMSS) data was used for this research. The TIMSS data set was selected because it included both multiple-choice and constructed-response item formats.
Item Metadata
Title |
Reliability of composite test scores: an evaluation of teachers' weighting methods
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2000
|
Description |
Teachers often combine component scores from varied item formats to create
composite scores. These composite scores are supposed to reflect students' ability or
achievement level in a specific domain such as mathematics or science. Sometimes these
same teachers weight the component scores before combining them in an effort to
increase the validity of the inferences made from the resulting composite scores (Chen,
Hanson, & Harris, 1998). These resulting composite scores have validity-related intuitive
merit for teachers who use them, but no research-demonstrated value. The present study
examined some consequences of such teacher-designed ad-hoc weighting methods so that
the resulting psychometric properties of these methods could be evaluated.
The item formats of the two components examined in this study were constructed-response
and multiple-choice. Because of the unequal variance o f the two item format
component scores, and the dissimilar metric used for each of the two item formats, the
resulting composite score is considered a congeneric measure.
What this present study attempted to do was to discover which of the typically-used
teacher-designed ad-hoc weighting methods used for the weighting of the two
component scores is best according to a particular criterion. The criterion adopted to
evaluate the worth of a weighting method was the bias in the reliability estimates of the
weighted composite scores.
For the evaluation of bias, the deviations in reliability estimates between un-weighted
composite scores and corresponding weighted component scores were
examined. Reliability estimates of un-weighted composite scores provide the true
estimations calculated from the actual test data, while the reliability estimates of the
weighted composite scores reflect the post-hoc manipulations imposed by some teachers.
If a teacher decided to use an ad-hoc weighting method in an effort to increase the
validity of a composite score formed by combining multiple-choice and constructed-response
items, then he/she should do it without causing much of a change, or bias, in the
reliability of the resulting composite test score. Such manipulations as component-score
weighting can artificially inflate or deflate the reliability estimate of the resulting
composite score. This in turn could create a composite score that is of an undesirable
psychometric deviation from its original form, which could in turn result in a score-dictated
misrepresentation of an examinee's ability or achievement level.
In all cases of test design and item combining, the validity of the resulting scores
are of paramount concern. Many teachers have an intuitive sense of this, but lack the
psychometric expertise to critically examine test scores from this perspective. In an
intuitive-based fashion, teachers have often made attempts at ensuring the validity of
scores by imposing some ad-hoc weighting methods on the multiple-choice and
constructed-response components involved.
Four of the weighting methods examined in this research were chosen because of
their common usage in educational research and public education, and one method was
derived from measurement theory. The component score weighting methods are: score
points as weights; number of items as weights; number of minutes (time) as weights;
equal weights; and z-scores as weights.
With the minimal amount of measurement training that most classroom teachers
have before they receive their teaching certification, they become accustomed to relying
on intuition, rather than measurement theory, for their test design methodology. Teacher-designed
ad-hoc weighting methods are examples of intuition-based logic that has not yet
been examined in a psychometric fashion. The research question is as follows: What
commonly-used teacher-designed ad-hoc method of weighting and combining multiple-choice
and constructed-response component scores is the best? The selection of the best
method was done according to the criteria of minimal artifactual bias in the resulting
composite score reliability estimates.
With the increased use of assessments that include mixed item types, it is certain
that for some school administrators and parents, a composite test score would be desired.
Encompassed in a composite score is the dilemma of how to combine the component
scores in a manner that would maximize the validity and reliability of the composite
score, without artificially inflating or deflating it and violating the principles of
measurement. Crocker and Algina (1986) stated that to interpret composite scores in a
meaningful way, in other words, to make valid inferences from the scores, it is important
to understand how the statistical properties of the scores on each subtest or component
influence those of the composite score.
When attempting to combine component test scores to produce a single composite
score, a problem does arise if the component scores are not calculated using the same
item format. When multiple-choice and constructed-response are the two item formats of
the respective components, strict measurement principles of parallel measures will reveal
that test homogeneity does not exist. Because of this, the scalings of the measurements
of the two components differ by more than just an additive constant, and hence, a
composite score is illogical (Lord & Novick, 1968).
Multiple-choice items are typically scored in a dichotomous manner, whereas
constructed-response items are typically scored in a polytomous manner. Dichotomously
scored items are scored without a magnitude of quantity, just a simple binary code that
reveals if the item was correct or not. Polytomously scored items are scored in a
continuous manner, where the content of such item responses is rated on an arbitrary
scale in an attempt to measure the magnitude of the attribute in question (Russell, 1897).
Like many things in the realm of educational psychology, pure measurement
principles provide good guidelines; but in application, they can also provide unrealistic
restrictions. Current educational/psychological assessment research in North America is
riddled with violations of measurement principles, but that doesn't mean that
psychological constructs are not being measured in a relatively efficient and revealing
manner. In small-scale assessments across North America, component scores of unlike
item types are often combined to form composite scores, regardless of the fact that these
scores may be deemed illogical by researchers such as Michell (1990) and Lord and
Novick(1968).
It is likely that many future assessments will be made up of a combination of
multiple-choice and constructed-response item types (Wainer & Thissen, 1993). Often,
weighting methods are imposed on such item components with minimal psychometric
foundation. For the sake of fair measurement of students' abilities and achievement
levels, research regarding this psychometric query is necessary. The research findings of
this present study will provide the basis of forethought and understanding in terms of the
selection of an ad-hoc weighting method. This forethought and understanding will
enable teachers to be more certain about their precision of reported composite scores.
The recent evolution of educational assessment as described above, has resulted in
a shift in focus in educational research. An example of this is Chen, Hanson, and Harris'
(1998) findings of classroom teachers' typical ad-hoc methods of weighting multiple-choice
and constructed-response item scores in an effort to produce a composite score.
The unknown/questionable nature of the reliability of the composite scores derived using
these ad-hoc weighting methods is essentially the problem that was investigated in this
thesis.
The main research task posed by the present study was to find out which teacher-designed
ad-hoc component-score weighting method, designed for the combining of two
component scores (one made of multiple-choice items and the other of constructed-response
items), was the best. The evaluative criteria of composite score reliability bias
was examined across variations in class size and item ratios (multiple-choice:
constructed-response) to ensure that conclusions were reflective of real-life classroom
sizes and test-item ratios. By examining a variety of aspects of three factors (class size,
test mixture, and weighting method), an evaluative look at the general tendency for each
weighing method to cause artificially deflated/inflated component-score reliability
estimates was examined. It was found that the component-score weighting method that
treated the components equally, called the equal weights (EQUAL) weighing method in
the present study, lead to the least amount of deflation/inflation in the composite score
reliability estimates. Please remember the limitations of the study: the results were
calculated using test scores from the academic domain of science only, and these test
scores were only gathered from Canadian grade eight students.
The Third International Mathematics and Science Study (TIMSS) data was used
for this research. The TIMSS data set was selected because it included both multiple-choice
and constructed-response item formats.
|
Extent |
4759969 bytes
|
Genre | |
Type | |
File Format |
application/pdf
|
Language |
eng
|
Date Available |
2009-07-07
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
|
DOI |
10.14288/1.0053909
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2000-05
|
Campus | |
Scholarly Level |
Graduate
|
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.