Reliability of composite test scores: an evaluation of teachers' weighting methods

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Reliability of composite test scores: an evaluation of teachers' weighting methods Steele, Bonita Marie

Abstract

Teachers often combine component scores from varied item formats to create composite scores. These composite scores are supposed to reflect students' ability or achievement level in a specific domain such as mathematics or science. Sometimes these same teachers weight the component scores before combining them in an effort to increase the validity of the inferences made from the resulting composite scores (Chen, Hanson, & Harris, 1998). These resulting composite scores have validity-related intuitive merit for teachers who use them, but no research-demonstrated value. The present study examined some consequences of such teacher-designed ad-hoc weighting methods so that the resulting psychometric properties of these methods could be evaluated. The item formats of the two components examined in this study were constructed-response and multiple-choice. Because of the unequal variance o f the two item format component scores, and the dissimilar metric used for each of the two item formats, the resulting composite score is considered a congeneric measure. What this present study attempted to do was to discover which of the typically-used teacher-designed ad-hoc weighting methods used for the weighting of the two component scores is best according to a particular criterion. The criterion adopted to evaluate the worth of a weighting method was the bias in the reliability estimates of the weighted composite scores. For the evaluation of bias, the deviations in reliability estimates between un-weighted composite scores and corresponding weighted component scores were examined. Reliability estimates of un-weighted composite scores provide the true estimations calculated from the actual test data, while the reliability estimates of the weighted composite scores reflect the post-hoc manipulations imposed by some teachers. If a teacher decided to use an ad-hoc weighting method in an effort to increase the validity of a composite score formed by combining multiple-choice and constructed-response items, then he/she should do it without causing much of a change, or bias, in the reliability of the resulting composite test score. Such manipulations as component-score weighting can artificially inflate or deflate the reliability estimate of the resulting composite score. This in turn could create a composite score that is of an undesirable psychometric deviation from its original form, which could in turn result in a score-dictated misrepresentation of an examinee's ability or achievement level. In all cases of test design and item combining, the validity of the resulting scores are of paramount concern. Many teachers have an intuitive sense of this, but lack the psychometric expertise to critically examine test scores from this perspective. In an intuitive-based fashion, teachers have often made attempts at ensuring the validity of scores by imposing some ad-hoc weighting methods on the multiple-choice and constructed-response components involved. Four of the weighting methods examined in this research were chosen because of their common usage in educational research and public education, and one method was derived from measurement theory. The component score weighting methods are: score points as weights; number of items as weights; number of minutes (time) as weights; equal weights; and z-scores as weights. With the minimal amount of measurement training that most classroom teachers have before they receive their teaching certification, they become accustomed to relying on intuition, rather than measurement theory, for their test design methodology. Teacher-designed ad-hoc weighting methods are examples of intuition-based logic that has not yet been examined in a psychometric fashion. The research question is as follows: What commonly-used teacher-designed ad-hoc method of weighting and combining multiple-choice and constructed-response component scores is the best? The selection of the best method was done according to the criteria of minimal artifactual bias in the resulting composite score reliability estimates. With the increased use of assessments that include mixed item types, it is certain that for some school administrators and parents, a composite test score would be desired. Encompassed in a composite score is the dilemma of how to combine the component scores in a manner that would maximize the validity and reliability of the composite score, without artificially inflating or deflating it and violating the principles of measurement. Crocker and Algina (1986) stated that to interpret composite scores in a meaningful way, in other words, to make valid inferences from the scores, it is important to understand how the statistical properties of the scores on each subtest or component influence those of the composite score. When attempting to combine component test scores to produce a single composite score, a problem does arise if the component scores are not calculated using the same item format. When multiple-choice and constructed-response are the two item formats of the respective components, strict measurement principles of parallel measures will reveal that test homogeneity does not exist. Because of this, the scalings of the measurements of the two components differ by more than just an additive constant, and hence, a composite score is illogical (Lord & Novick, 1968). Multiple-choice items are typically scored in a dichotomous manner, whereas constructed-response items are typically scored in a polytomous manner. Dichotomously scored items are scored without a magnitude of quantity, just a simple binary code that reveals if the item was correct or not. Polytomously scored items are scored in a continuous manner, where the content of such item responses is rated on an arbitrary scale in an attempt to measure the magnitude of the attribute in question (Russell, 1897). Like many things in the realm of educational psychology, pure measurement principles provide good guidelines; but in application, they can also provide unrealistic restrictions. Current educational/psychological assessment research in North America is riddled with violations of measurement principles, but that doesn't mean that psychological constructs are not being measured in a relatively efficient and revealing manner. In small-scale assessments across North America, component scores of unlike item types are often combined to form composite scores, regardless of the fact that these scores may be deemed illogical by researchers such as Michell (1990) and Lord and Novick(1968). It is likely that many future assessments will be made up of a combination of multiple-choice and constructed-response item types (Wainer & Thissen, 1993). Often, weighting methods are imposed on such item components with minimal psychometric foundation. For the sake of fair measurement of students' abilities and achievement levels, research regarding this psychometric query is necessary. The research findings of this present study will provide the basis of forethought and understanding in terms of the selection of an ad-hoc weighting method. This forethought and understanding will enable teachers to be more certain about their precision of reported composite scores. The recent evolution of educational assessment as described above, has resulted in a shift in focus in educational research. An example of this is Chen, Hanson, and Harris' (1998) findings of classroom teachers' typical ad-hoc methods of weighting multiple-choice and constructed-response item scores in an effort to produce a composite score. The unknown/questionable nature of the reliability of the composite scores derived using these ad-hoc weighting methods is essentially the problem that was investigated in this thesis. The main research task posed by the present study was to find out which teacher-designed ad-hoc component-score weighting method, designed for the combining of two component scores (one made of multiple-choice items and the other of constructed-response items), was the best. The evaluative criteria of composite score reliability bias was examined across variations in class size and item ratios (multiple-choice: constructed-response) to ensure that conclusions were reflective of real-life classroom sizes and test-item ratios. By examining a variety of aspects of three factors (class size, test mixture, and weighting method), an evaluative look at the general tendency for each weighing method to cause artificially deflated/inflated component-score reliability estimates was examined. It was found that the component-score weighting method that treated the components equally, called the equal weights (EQUAL) weighing method in the present study, lead to the least amount of deflation/inflation in the composite score reliability estimates. Please remember the limitations of the study: the results were calculated using test scores from the academic domain of science only, and these test scores were only gathered from Canadian grade eight students. The Third International Mathematics and Science Study (TIMSS) data was used for this research. The TIMSS data set was selected because it included both multiple-choice and constructed-response item formats.

Item Metadata

Title	Reliability of composite test scores: an evaluation of teachers' weighting methods
Creator	Steele, Bonita Marie
Publisher	University of British Columbia
Date Issued	2000
Description	Teachers often combine component scores from varied item formats to create composite scores. These composite scores are supposed to reflect students' ability or achievement level in a specific domain such as mathematics or science. Sometimes these same teachers weight the component scores before combining them in an effort to increase the validity of the inferences made from the resulting composite scores (Chen, Hanson, & Harris, 1998). These resulting composite scores have validity-related intuitive merit for teachers who use them, but no research-demonstrated value. The present study examined some consequences of such teacher-designed ad-hoc weighting methods so that the resulting psychometric properties of these methods could be evaluated. The item formats of the two components examined in this study were constructed-response and multiple-choice. Because of the unequal variance o f the two item format component scores, and the dissimilar metric used for each of the two item formats, the resulting composite score is considered a congeneric measure. What this present study attempted to do was to discover which of the typically-used teacher-designed ad-hoc weighting methods used for the weighting of the two component scores is best according to a particular criterion. The criterion adopted to evaluate the worth of a weighting method was the bias in the reliability estimates of the weighted composite scores. For the evaluation of bias, the deviations in reliability estimates between un-weighted composite scores and corresponding weighted component scores were examined. Reliability estimates of un-weighted composite scores provide the true estimations calculated from the actual test data, while the reliability estimates of the weighted composite scores reflect the post-hoc manipulations imposed by some teachers. If a teacher decided to use an ad-hoc weighting method in an effort to increase the validity of a composite score formed by combining multiple-choice and constructed-response items, then he/she should do it without causing much of a change, or bias, in the reliability of the resulting composite test score. Such manipulations as component-score weighting can artificially inflate or deflate the reliability estimate of the resulting composite score. This in turn could create a composite score that is of an undesirable psychometric deviation from its original form, which could in turn result in a score-dictated misrepresentation of an examinee's ability or achievement level. In all cases of test design and item combining, the validity of the resulting scores are of paramount concern. Many teachers have an intuitive sense of this, but lack the psychometric expertise to critically examine test scores from this perspective. In an intuitive-based fashion, teachers have often made attempts at ensuring the validity of scores by imposing some ad-hoc weighting methods on the multiple-choice and constructed-response components involved. Four of the weighting methods examined in this research were chosen because of their common usage in educational research and public education, and one method was derived from measurement theory. The component score weighting methods are: score points as weights; number of items as weights; number of minutes (time) as weights; equal weights; and z-scores as weights. With the minimal amount of measurement training that most classroom teachers have before they receive their teaching certification, they become accustomed to relying on intuition, rather than measurement theory, for their test design methodology. Teacher-designed ad-hoc weighting methods are examples of intuition-based logic that has not yet been examined in a psychometric fashion. The research question is as follows: What commonly-used teacher-designed ad-hoc method of weighting and combining multiple-choice and constructed-response component scores is the best? The selection of the best method was done according to the criteria of minimal artifactual bias in the resulting composite score reliability estimates. With the increased use of assessments that include mixed item types, it is certain that for some school administrators and parents, a composite test score would be desired. Encompassed in a composite score is the dilemma of how to combine the component scores in a manner that would maximize the validity and reliability of the composite score, without artificially inflating or deflating it and violating the principles of measurement. Crocker and Algina (1986) stated that to interpret composite scores in a meaningful way, in other words, to make valid inferences from the scores, it is important to understand how the statistical properties of the scores on each subtest or component influence those of the composite score. When attempting to combine component test scores to produce a single composite score, a problem does arise if the component scores are not calculated using the same item format. When multiple-choice and constructed-response are the two item formats of the respective components, strict measurement principles of parallel measures will reveal that test homogeneity does not exist. Because of this, the scalings of the measurements of the two components differ by more than just an additive constant, and hence, a composite score is illogical (Lord & Novick, 1968). Multiple-choice items are typically scored in a dichotomous manner, whereas constructed-response items are typically scored in a polytomous manner. Dichotomously scored items are scored without a magnitude of quantity, just a simple binary code that reveals if the item was correct or not. Polytomously scored items are scored in a continuous manner, where the content of such item responses is rated on an arbitrary scale in an attempt to measure the magnitude of the attribute in question (Russell, 1897). Like many things in the realm of educational psychology, pure measurement principles provide good guidelines; but in application, they can also provide unrealistic restrictions. Current educational/psychological assessment research in North America is riddled with violations of measurement principles, but that doesn't mean that psychological constructs are not being measured in a relatively efficient and revealing manner. In small-scale assessments across North America, component scores of unlike item types are often combined to form composite scores, regardless of the fact that these scores may be deemed illogical by researchers such as Michell (1990) and Lord and Novick(1968). It is likely that many future assessments will be made up of a combination of multiple-choice and constructed-response item types (Wainer & Thissen, 1993). Often, weighting methods are imposed on such item components with minimal psychometric foundation. For the sake of fair measurement of students' abilities and achievement levels, research regarding this psychometric query is necessary. The research findings of this present study will provide the basis of forethought and understanding in terms of the selection of an ad-hoc weighting method. This forethought and understanding will enable teachers to be more certain about their precision of reported composite scores. The recent evolution of educational assessment as described above, has resulted in a shift in focus in educational research. An example of this is Chen, Hanson, and Harris' (1998) findings of classroom teachers' typical ad-hoc methods of weighting multiple-choice and constructed-response item scores in an effort to produce a composite score. The unknown/questionable nature of the reliability of the composite scores derived using these ad-hoc weighting methods is essentially the problem that was investigated in this thesis. The main research task posed by the present study was to find out which teacher-designed ad-hoc component-score weighting method, designed for the combining of two component scores (one made of multiple-choice items and the other of constructed-response items), was the best. The evaluative criteria of composite score reliability bias was examined across variations in class size and item ratios (multiple-choice: constructed-response) to ensure that conclusions were reflective of real-life classroom sizes and test-item ratios. By examining a variety of aspects of three factors (class size, test mixture, and weighting method), an evaluative look at the general tendency for each weighing method to cause artificially deflated/inflated component-score reliability estimates was examined. It was found that the component-score weighting method that treated the components equally, called the equal weights (EQUAL) weighing method in the present study, lead to the least amount of deflation/inflation in the composite score reliability estimates. Please remember the limitations of the study: the results were calculated using test scores from the academic domain of science only, and these test scores were only gathered from Canadian grade eight students. The Third International Mathematics and Science Study (TIMSS) data was used for this research. The TIMSS data set was selected because it included both multiple-choice and constructed-response item formats.
Extent	4759969 bytes
Genre	Thesis/Dissertation
Type	Text
File Format	application/pdf
Language	eng
Date Available	2009-07-07
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0053909
URI	http://hdl.handle.net/2429/10342
Degree (Theses)	Master of Arts - MA
Program (Theses)	Counselling Psychology
Affiliation	Education, Faculty of; Educational and Counselling Psychology, and Special Education (ECPS), Department of
Degree Grantor	University of British Columbia
Graduation Date	2000-05
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

ubc_2000-0148.pdf -- 4.54MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Reliability of composite test scores: an evaluation of teachers' weighting methods Steele, Bonita Marie

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights