- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Reliability of composite test scores: an evaluation...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Reliability of composite test scores: an evaluation of teachers' weighting methods Steele, Bonita Marie
Abstract
Teachers often combine component scores from varied item formats to create
composite scores. These composite scores are supposed to reflect students' ability or
achievement level in a specific domain such as mathematics or science. Sometimes these
same teachers weight the component scores before combining them in an effort to
increase the validity of the inferences made from the resulting composite scores (Chen,
Hanson, & Harris, 1998). These resulting composite scores have validity-related intuitive
merit for teachers who use them, but no research-demonstrated value. The present study
examined some consequences of such teacher-designed ad-hoc weighting methods so that
the resulting psychometric properties of these methods could be evaluated.
The item formats of the two components examined in this study were constructed-response
and multiple-choice. Because of the unequal variance o f the two item format
component scores, and the dissimilar metric used for each of the two item formats, the
resulting composite score is considered a congeneric measure.
What this present study attempted to do was to discover which of the typically-used
teacher-designed ad-hoc weighting methods used for the weighting of the two
component scores is best according to a particular criterion. The criterion adopted to
evaluate the worth of a weighting method was the bias in the reliability estimates of the
weighted composite scores.
For the evaluation of bias, the deviations in reliability estimates between un-weighted
composite scores and corresponding weighted component scores were
examined. Reliability estimates of un-weighted composite scores provide the true
estimations calculated from the actual test data, while the reliability estimates of the
weighted composite scores reflect the post-hoc manipulations imposed by some teachers.
If a teacher decided to use an ad-hoc weighting method in an effort to increase the
validity of a composite score formed by combining multiple-choice and constructed-response
items, then he/she should do it without causing much of a change, or bias, in the
reliability of the resulting composite test score. Such manipulations as component-score
weighting can artificially inflate or deflate the reliability estimate of the resulting
composite score. This in turn could create a composite score that is of an undesirable
psychometric deviation from its original form, which could in turn result in a score-dictated
misrepresentation of an examinee's ability or achievement level.
In all cases of test design and item combining, the validity of the resulting scores
are of paramount concern. Many teachers have an intuitive sense of this, but lack the
psychometric expertise to critically examine test scores from this perspective. In an
intuitive-based fashion, teachers have often made attempts at ensuring the validity of
scores by imposing some ad-hoc weighting methods on the multiple-choice and
constructed-response components involved.
Four of the weighting methods examined in this research were chosen because of
their common usage in educational research and public education, and one method was
derived from measurement theory. The component score weighting methods are: score
points as weights; number of items as weights; number of minutes (time) as weights;
equal weights; and z-scores as weights.
With the minimal amount of measurement training that most classroom teachers
have before they receive their teaching certification, they become accustomed to relying
on intuition, rather than measurement theory, for their test design methodology. Teacher-designed
ad-hoc weighting methods are examples of intuition-based logic that has not yet
been examined in a psychometric fashion. The research question is as follows: What
commonly-used teacher-designed ad-hoc method of weighting and combining multiple-choice
and constructed-response component scores is the best? The selection of the best
method was done according to the criteria of minimal artifactual bias in the resulting
composite score reliability estimates.
With the increased use of assessments that include mixed item types, it is certain
that for some school administrators and parents, a composite test score would be desired.
Encompassed in a composite score is the dilemma of how to combine the component
scores in a manner that would maximize the validity and reliability of the composite
score, without artificially inflating or deflating it and violating the principles of
measurement. Crocker and Algina (1986) stated that to interpret composite scores in a
meaningful way, in other words, to make valid inferences from the scores, it is important
to understand how the statistical properties of the scores on each subtest or component
influence those of the composite score.
When attempting to combine component test scores to produce a single composite
score, a problem does arise if the component scores are not calculated using the same
item format. When multiple-choice and constructed-response are the two item formats of
the respective components, strict measurement principles of parallel measures will reveal
that test homogeneity does not exist. Because of this, the scalings of the measurements
of the two components differ by more than just an additive constant, and hence, a
composite score is illogical (Lord & Novick, 1968).
Multiple-choice items are typically scored in a dichotomous manner, whereas
constructed-response items are typically scored in a polytomous manner. Dichotomously
scored items are scored without a magnitude of quantity, just a simple binary code that
reveals if the item was correct or not. Polytomously scored items are scored in a
continuous manner, where the content of such item responses is rated on an arbitrary
scale in an attempt to measure the magnitude of the attribute in question (Russell, 1897).
Like many things in the realm of educational psychology, pure measurement
principles provide good guidelines; but in application, they can also provide unrealistic
restrictions. Current educational/psychological assessment research in North America is
riddled with violations of measurement principles, but that doesn't mean that
psychological constructs are not being measured in a relatively efficient and revealing
manner. In small-scale assessments across North America, component scores of unlike
item types are often combined to form composite scores, regardless of the fact that these
scores may be deemed illogical by researchers such as Michell (1990) and Lord and
Novick(1968).
It is likely that many future assessments will be made up of a combination of
multiple-choice and constructed-response item types (Wainer & Thissen, 1993). Often,
weighting methods are imposed on such item components with minimal psychometric
foundation. For the sake of fair measurement of students' abilities and achievement
levels, research regarding this psychometric query is necessary. The research findings of
this present study will provide the basis of forethought and understanding in terms of the
selection of an ad-hoc weighting method. This forethought and understanding will
enable teachers to be more certain about their precision of reported composite scores.
The recent evolution of educational assessment as described above, has resulted in
a shift in focus in educational research. An example of this is Chen, Hanson, and Harris'
(1998) findings of classroom teachers' typical ad-hoc methods of weighting multiple-choice
and constructed-response item scores in an effort to produce a composite score.
The unknown/questionable nature of the reliability of the composite scores derived using
these ad-hoc weighting methods is essentially the problem that was investigated in this
thesis.
The main research task posed by the present study was to find out which teacher-designed
ad-hoc component-score weighting method, designed for the combining of two
component scores (one made of multiple-choice items and the other of constructed-response
items), was the best. The evaluative criteria of composite score reliability bias
was examined across variations in class size and item ratios (multiple-choice:
constructed-response) to ensure that conclusions were reflective of real-life classroom
sizes and test-item ratios. By examining a variety of aspects of three factors (class size,
test mixture, and weighting method), an evaluative look at the general tendency for each
weighing method to cause artificially deflated/inflated component-score reliability
estimates was examined. It was found that the component-score weighting method that
treated the components equally, called the equal weights (EQUAL) weighing method in
the present study, lead to the least amount of deflation/inflation in the composite score
reliability estimates. Please remember the limitations of the study: the results were
calculated using test scores from the academic domain of science only, and these test
scores were only gathered from Canadian grade eight students.
The Third International Mathematics and Science Study (TIMSS) data was used
for this research. The TIMSS data set was selected because it included both multiple-choice
and constructed-response item formats.
Item Metadata
| Title |
Reliability of composite test scores: an evaluation of teachers' weighting methods
|
| Creator | |
| Publisher |
University of British Columbia
|
| Date Issued |
2000
|
| Description |
Teachers often combine component scores from varied item formats to create
composite scores. These composite scores are supposed to reflect students' ability or
achievement level in a specific domain such as mathematics or science. Sometimes these
same teachers weight the component scores before combining them in an effort to
increase the validity of the inferences made from the resulting composite scores (Chen,
Hanson, & Harris, 1998). These resulting composite scores have validity-related intuitive
merit for teachers who use them, but no research-demonstrated value. The present study
examined some consequences of such teacher-designed ad-hoc weighting methods so that
the resulting psychometric properties of these methods could be evaluated.
The item formats of the two components examined in this study were constructed-response
and multiple-choice. Because of the unequal variance o f the two item format
component scores, and the dissimilar metric used for each of the two item formats, the
resulting composite score is considered a congeneric measure.
What this present study attempted to do was to discover which of the typically-used
teacher-designed ad-hoc weighting methods used for the weighting of the two
component scores is best according to a particular criterion. The criterion adopted to
evaluate the worth of a weighting method was the bias in the reliability estimates of the
weighted composite scores.
For the evaluation of bias, the deviations in reliability estimates between un-weighted
composite scores and corresponding weighted component scores were
examined. Reliability estimates of un-weighted composite scores provide the true
estimations calculated from the actual test data, while the reliability estimates of the
weighted composite scores reflect the post-hoc manipulations imposed by some teachers.
If a teacher decided to use an ad-hoc weighting method in an effort to increase the
validity of a composite score formed by combining multiple-choice and constructed-response
items, then he/she should do it without causing much of a change, or bias, in the
reliability of the resulting composite test score. Such manipulations as component-score
weighting can artificially inflate or deflate the reliability estimate of the resulting
composite score. This in turn could create a composite score that is of an undesirable
psychometric deviation from its original form, which could in turn result in a score-dictated
misrepresentation of an examinee's ability or achievement level.
In all cases of test design and item combining, the validity of the resulting scores
are of paramount concern. Many teachers have an intuitive sense of this, but lack the
psychometric expertise to critically examine test scores from this perspective. In an
intuitive-based fashion, teachers have often made attempts at ensuring the validity of
scores by imposing some ad-hoc weighting methods on the multiple-choice and
constructed-response components involved.
Four of the weighting methods examined in this research were chosen because of
their common usage in educational research and public education, and one method was
derived from measurement theory. The component score weighting methods are: score
points as weights; number of items as weights; number of minutes (time) as weights;
equal weights; and z-scores as weights.
With the minimal amount of measurement training that most classroom teachers
have before they receive their teaching certification, they become accustomed to relying
on intuition, rather than measurement theory, for their test design methodology. Teacher-designed
ad-hoc weighting methods are examples of intuition-based logic that has not yet
been examined in a psychometric fashion. The research question is as follows: What
commonly-used teacher-designed ad-hoc method of weighting and combining multiple-choice
and constructed-response component scores is the best? The selection of the best
method was done according to the criteria of minimal artifactual bias in the resulting
composite score reliability estimates.
With the increased use of assessments that include mixed item types, it is certain
that for some school administrators and parents, a composite test score would be desired.
Encompassed in a composite score is the dilemma of how to combine the component
scores in a manner that would maximize the validity and reliability of the composite
score, without artificially inflating or deflating it and violating the principles of
measurement. Crocker and Algina (1986) stated that to interpret composite scores in a
meaningful way, in other words, to make valid inferences from the scores, it is important
to understand how the statistical properties of the scores on each subtest or component
influence those of the composite score.
When attempting to combine component test scores to produce a single composite
score, a problem does arise if the component scores are not calculated using the same
item format. When multiple-choice and constructed-response are the two item formats of
the respective components, strict measurement principles of parallel measures will reveal
that test homogeneity does not exist. Because of this, the scalings of the measurements
of the two components differ by more than just an additive constant, and hence, a
composite score is illogical (Lord & Novick, 1968).
Multiple-choice items are typically scored in a dichotomous manner, whereas
constructed-response items are typically scored in a polytomous manner. Dichotomously
scored items are scored without a magnitude of quantity, just a simple binary code that
reveals if the item was correct or not. Polytomously scored items are scored in a
continuous manner, where the content of such item responses is rated on an arbitrary
scale in an attempt to measure the magnitude of the attribute in question (Russell, 1897).
Like many things in the realm of educational psychology, pure measurement
principles provide good guidelines; but in application, they can also provide unrealistic
restrictions. Current educational/psychological assessment research in North America is
riddled with violations of measurement principles, but that doesn't mean that
psychological constructs are not being measured in a relatively efficient and revealing
manner. In small-scale assessments across North America, component scores of unlike
item types are often combined to form composite scores, regardless of the fact that these
scores may be deemed illogical by researchers such as Michell (1990) and Lord and
Novick(1968).
It is likely that many future assessments will be made up of a combination of
multiple-choice and constructed-response item types (Wainer & Thissen, 1993). Often,
weighting methods are imposed on such item components with minimal psychometric
foundation. For the sake of fair measurement of students' abilities and achievement
levels, research regarding this psychometric query is necessary. The research findings of
this present study will provide the basis of forethought and understanding in terms of the
selection of an ad-hoc weighting method. This forethought and understanding will
enable teachers to be more certain about their precision of reported composite scores.
The recent evolution of educational assessment as described above, has resulted in
a shift in focus in educational research. An example of this is Chen, Hanson, and Harris'
(1998) findings of classroom teachers' typical ad-hoc methods of weighting multiple-choice
and constructed-response item scores in an effort to produce a composite score.
The unknown/questionable nature of the reliability of the composite scores derived using
these ad-hoc weighting methods is essentially the problem that was investigated in this
thesis.
The main research task posed by the present study was to find out which teacher-designed
ad-hoc component-score weighting method, designed for the combining of two
component scores (one made of multiple-choice items and the other of constructed-response
items), was the best. The evaluative criteria of composite score reliability bias
was examined across variations in class size and item ratios (multiple-choice:
constructed-response) to ensure that conclusions were reflective of real-life classroom
sizes and test-item ratios. By examining a variety of aspects of three factors (class size,
test mixture, and weighting method), an evaluative look at the general tendency for each
weighing method to cause artificially deflated/inflated component-score reliability
estimates was examined. It was found that the component-score weighting method that
treated the components equally, called the equal weights (EQUAL) weighing method in
the present study, lead to the least amount of deflation/inflation in the composite score
reliability estimates. Please remember the limitations of the study: the results were
calculated using test scores from the academic domain of science only, and these test
scores were only gathered from Canadian grade eight students.
The Third International Mathematics and Science Study (TIMSS) data was used
for this research. The TIMSS data set was selected because it included both multiple-choice
and constructed-response item formats.
|
| Extent |
4759969 bytes
|
| Genre | |
| Type | |
| File Format |
application/pdf
|
| Language |
eng
|
| Date Available |
2009-07-07
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
|
| DOI |
10.14288/1.0053909
|
| URI | |
| Degree (Theses) | |
| Program (Theses) | |
| Affiliation | |
| Degree Grantor |
University of British Columbia
|
| Graduation Date |
2000-05
|
| Campus | |
| Scholarly Level |
Graduate
|
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.