Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Relative performance of scoring designs for the assessment of constructed responses Barnett, Sharon Gale 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2005-104651.pdf [ 5.99MB ]
Metadata
JSON: 831-1.0054496.json
JSON-LD: 831-1.0054496-ld.json
RDF/XML (Pretty): 831-1.0054496-rdf.xml
RDF/JSON: 831-1.0054496-rdf.json
Turtle: 831-1.0054496-turtle.txt
N-Triples: 831-1.0054496-rdf-ntriples.txt
Original Record: 831-1.0054496-source.json
Full Text
831-1.0054496-fulltext.txt
Citation
831-1.0054496.ris

Full Text

RELATIVE PERFORMANCE OF SCORING DESIGNS FOR T H E ASSESSMENT OF CONSTRUCTED RESPONSES by SHARON G A L E BARNETT  M.A., The University of British Columbia, 2001 B.Ed., The University of Winnipeg, 1986 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES Measurement, Evaluation, and Research Methodology  THE UNIVERSITY OF BRITISH COLUMBIA June 2005  © Sharon Gale Barnett, 2005  Scoring Designs ii ABSTRACT This study used computer simulation to investigate the relative performance of three rater allocation designs and an objective scoring condition at two levels of comparisons: comparison to the examinee true ability, and comparison to the ability estimate obtained under the objective scoring condition. At both comparison levels, score level (theta estimate) differences across theta bins and decisions at four cut-off levels were investigated. At comparison level one, small differences across scoring designs were found in the accuracy of ability estimates obtained using the scoring designs investigated. However, large differences in percentage of accurate pass and fail decisions were found at thefirstcomparison level. This pattern of findings was repeated at comparison level two; comparison to the objective scoring estimates and decisions. The decisions associated with the row allocation design in particular were somewhat erratic when compared to the other scoring methods at both comparison levels making it the least desirable of the rater allocation designs. The random and spiral allocation designs provided similar levels of accuracy in estimating examinee ability and provided similar levels of decision accuracy. These allocation designs showed similar levels of discrepancy and showed similar percentages of agreement with the decisions obtained under the objective scoring condition.  Scoring Designs iii  TABLE OF CONTENTS ABSTRACT  ii  TABLE OF CONTENTS  iii  LIST OF TABLES  ix  LIST OF FIGURES  xi  LIST OF EQUATIONS  xiii  ACKNOWLEDGEMENTS  xiv  CHAPTER I: CONTEXT OF THE STUDY  1  Challenges in Scoring Constructed Responses  1  CHAPTER II: REVIEW OF THE LITERATURE  5  Definition and Use of Constructed Response Items The Effect of Raters in Scoring Constructed Responses Rater Effects  Influence of Technology in the Scoring of Constructed Responses  5 6 7  9  Movement toward Automated and Human Scoring  9  MovementfromTraditional to Computer Interface Scoring  11  The Use and Setting of Cut-off Levels  14  Development of Psychometric (Analysis^Models  15  Facets  16  Hierarchical Rater Models  17  Rater Bundle Model  18  Scoring Designs iv Rater Allocation Designs  20  Fully Crossed Design  21  Random Allocation Design  21  Nested Allocation Designs  22  Recent Findings at the Intersection of Rater Designs and Analysis Models  25  Study 1: Patz, Wilson, and Hoskins (1997)  27  Study 2: Hombo, Donoghue and Thayer (2001)  32  Study 3: Lee and Sykes (2000)  36  Research Questions CHAPTER III: METHOD Overview  40 43 43  Comparison Level 1  44  Comparison Level 2  45  Scoring Designs  46  Random Allocation Designs  47  Nested Allocation Designs  48  Rater nested in examinee (Row) Spiral allocation  48 49  Objective Scoring  49  Item Characteristics (Tests)  50  Cut-off Levels  53  Dependent Variables  54  Scoring Designs v Level 1: Comparison to True Theta  54  Score Differences  54  Decisions  55  Level 2: Comparison to the Objective Estimate  56  Discrepancy  56  Decisions  56  Steps in the Simulation Step 1: Generation of the Simulated Rater Scored Item Response  57 57  Model Used in the Simulation of Data (Data Generation)  58  Rater Parameters  59  Step 2: Psychometric Analysis of the Rating Data Estimation of Examinee Ability  Step 3: Analysis of Simulation Results Statistical Properties  Comparison Rules Summary CHAPTER IV: RESULTS Section A: Comparison to True Theta  59 59  60 62  62 65 66 67  Accuracy of the Scoring Designs Across Levels of Examinee Ability  67  Relative Accuracy of the Rater Allocation Designs  68  Accuracy of the Objective Scoring  68  Accuracy Across Theta Bin Levels  71  Scoring Designs vi Consistency of the Scoring Design Across Theta Bin  75  Relative Consistency of the Rater Allocation Designs  75  Consistency of the Objective Scoring Design  79  Decision Accuracy  83  Correct Decision to Pass an Examinee  83  Correct Decision to Pass for the Objective Scoring  86  Correct Decision to Fail  91  Decision Accuracy for a Fail Across Allocation Designs  92  Decision Accuracy for a Fail with Objective Scoring  94  Section B: Comparison to Objective Scoring Estimate  98  Estimate Differences  98  Decision Compared to those of Objective Scoring  103  Decision Agreement for a Pass  103  Decision Agreement for a Fail  110  CHAPTER V: DISCUSSION  116  Level 1: Comparison to True Theta  117  Level 2: Comparison to the Objective Scoring  119  Contributions of this Study  121  Limitations of this Study  122  Future Research  124  Summary  125  REFERENCES  126  Scoring Designs vii APPENDIX A: Item Parameters for Dichotomously Scored Items  133  APPENDLX B: Differences in Squared Bias for Test 1  134  APPENDIX C: Differences in Squared Bias for Test 2  135  APPENDIX D: Differences in Squared Bias for Test 3  136  APPENDLX E: Differences in Mean Square Error for Test 1  137  APPENDIX F: Differences in Mean Square Error for Test 2  138  APPENDIX G: Differences in Mean Square Error for Test 3  139  APPENDIX H: Percentage Differences in Decision Accuracy (Pass) for Test 1  140  APPENDIX I: Percentage Differences in Decision Accuracy (Pass) for Test 2  141  APPENDIX J: Percentage Differences in Decision Accuracy (Pass) for Test 3  142  APPENDIX K: Percentage Differences in Decision Accuracy (Fail) for Test 1  143  APPENDIX L: Percentage Differences in Decision Accuracy (Fail) for Test 2  144  APPENDIX M: Percentage Differences in Decision Accuracy (Fail) for Test 3  145  APPENDIX N: Discrepancy Differences for Test 1  146  APPENDIX O: Discrepancy Differences for Test 2  147  APPENDLX P: Discrepancy Differences for Test 3  148  APPENDIX Q: Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Pass) for Test 1  149  APPENDIX R: Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Pass) for Test 2  150  APPENDIX S: Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Pass) for Test 3  151  Scoring Designs viii APPENDLX T: Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Fail) for Test 3  152  APPENDIX U: Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Fail) for Test 2  153  APPENDIX V: Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Fail) for Test 3  154  Scoring Designs ix LIST OF TABLES Table 1. Item Parameters for Tests 1-3  50  Table 2. Theta and SEM at each Cut-off Level of the Domain Score  54  Table 3. Squared Bias for Each Theta Bin in Test 1  69  Table 4.  70  Squared Bias for Each Theta Bin in Test 2  Table 5. Squared Bias and Greatest Difference for Each Theta Bin in Test 3  70  Table 6. Bias for Each Theta Bin in Test 1  71  Table 7. Bias for Each Theta Bin in Test 2  72  Table 8. Bias for Each Theta Bin in Test 3  72  Table 9. Mean Square Error for each Theta Bin in Test 1  76  Table 10. Mean Square Error for each Theta Bin in Test 2  77  Table 11. Mean Square Error for each Theta Bin in Test 3  78  Table 12. Percentage of Correct Pass Decisions in Test 1  84  Table 13. Percentage of Correct Pass Decisions in Test 2  85  Table 14. Percentage of Correct Pass Decisions in Test 3  86  Table 15. Percentage of Correct Fail Decisions in Test 1  93  Table 16. Percentage of Correct Fail Decisions in Test 2  93  Table 17. Percentage of Correct Fail Decisions in Test 3  94  Table 18. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 1  99  Table 19. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 2  99  Scoring Designs x Table 20. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 3  100  Table 21. Percentage of agreement with the objectively scored test on a decision to pass at each of the cut-off levels for Test 1  106  Table 22. Percentage of agreement with the objectively scored test on a decision to pass at each of the cut-off levels for Test 2  107  Table 23. Percentage of agreement with the objectively scored test on a decision to pass at each of the cut-off levels for Test 3  107  Table 24. Percentage of Decision Agreement for a Fail in Test 1  112  Table 25. Percentage of Decision Agreement for a Fail in Test 2  112  Table 26. Percentage of Decision Agreement for a Fail in Test 3  113  Scoring Designs xi LIST OF FIGURES Figure 1. Scoring Model with Rater Figure 2.Basic Spiral Pattern  7 23  Figure 3. Rater Nested in Examinee (Row) Allocation Design  24  Figure 4. Test Information and Standard Error for Test 1  51  Figure 5. Test Information and Standard Error for Test 2  52  Figure 6. Test Information and Standard Error for Test 3  52  Figure 7. Decision Grid  55  Figure 8. Bias Across Theta-bins for Test 1  73  Figure 9. Bias Across Theta-bins for Test 2  74  Figure 10. Bias Across Theta-bins for Test 3  74  Figure 11. Mean Square Error for Test 1  81  Figure 12. Mean Square Error for Test 2  82  Figure 13. Mean Square Error for Test 3  82  Figure 14 . Percentage of Correct Pass Decisions Across Cut-Off Levels in Test 1  90  Figure 15 . Percentage of Correct Pass Decisions Across Cut-Off Levels in Test 2  90  Figure 16 . Percentage of Correct Pass Decisions Across Cut-Off Levels in Test 3  91  Figure 17 . Percentage of Correct Fail Decisions Across Cut-Off Levels in Test 1  96  Figure 18 . Percentage of Correct Fail Decisions Across Cut-Off Levels in Test 2  97  Figure 19 . Percentage of Correct Fail Decisions Across Cut-Off Levels in Test 3  97  Figure 20. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 1  101  Scoring Designs xii Figure 21. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 2  102  Figure 22. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 3  102  Figure 23. Percentage of Pass Decision Agreement Across Cut-Off Levels in Test 1 108 Figure 24. Percentage of Pass Decision Agreement Across Cut-Off Levels in Test 2... 109 Figure 25. Percentage of Pass Decision Agreement Across Cut-Off Levels in Test 3... 119 Figure 26. Percentage of Fail Decision Agreement Across Cut-Off Levels in Test 1 114 Figure 27. Percentage of Fail Decision Agreement Across Cut-Off Levels in Test 2....114 Figure 28. Percentage of Fail Decision Agreement Across Cut-OfF Levels in Test 3....115  Scoring Designs xiii LIST OF EQUATIONS Equation 1. Squared Bias of the Ability Estimate  35  Equation 2. Mean Square Error  35  Equation 3. Transformation to the Domain Score  53  Equation 4. 2 Parameter Logistic Model for Dichotomously Scored Rater Data... 58 Equation 5. Classical rj  63  Equation 6. Partial rj  63  2  2  Scoring Designs xiv ACKNOWLEDGEMENTS I wish to thank the following people who have, each in their own way, contributed to the success of this study and assisted me in the completion of my program of studies: My thesis supervisor, Professor Bruno D. Zumbo, for his support, guidance and commitment to the successful completion of this dissertation. His wonderful sense of humour and encouragement were most certainly appreciated. Dr. Anita Hubley, for her attention to detail. Thank you, Anita, for the enthusiasm and energy that you brought to this task. I have, by the way, purchased a punctuation primer. Dr. Laurie Ford, for her encouragement and emphasis on clear expression. Eventually, I will get that CBC talk down. The departmental examiner, Dr. Marion Porath, who is a living example of professionalism. Her encouragement and support throughout my graduate studies have been appreciated. SBr:*Catherine McClellan, for her generous assistance with PARSCALE and for her eviSerit%edication to student research. Dr. Kadriye Ercikan, for her continued, quiet support over the years, and for believing that I can contribute to the field of measurement, even when I was not so sure. The university examiner, Dr. Ann Anderson, and the external examiner, Dr. Walter Muir, for their time an effort. My amazing parents, Len and Evelyn Mendes, for their ongoing encouragement, support and love. My husband, Lance and son Khyber, who somehow survived the moods, cursing, and other side effects that accompanied the completion of this thesis and program of study. I am looking forward to spending more time with both of you.  Scoring Designs  1  CHAPTER I CONTEXT OF THE STUDY The use of constructed response items in large-scale and high stakes testing is becoming common (Patz, Junker, Johnson, & Mariano, 2002; Wilson & Hoskens, 2001). Currently, constructed response items are routinely used in exams such as the National Assessment of Educational Progress (NAEP), and California's Golden State Exam. The United States Military also employs constructed response items in its testing. In Canada, constructed response items are used in academic testing at the provincial level such as the Foundation Skills Assessments in British Columbia and the Yukon Achievement Tests. Some provinces have instituted mandatory graduating exams for students leaving grade 12 such as Alberta's Diploma and Achievement Examination Programs and British Columbia's Provincial Graduating Exams. The purpose of these examination programs is, in general, to establish provincial level performance standards, promote accountability, and ensure equity to students who plan to enter into post-secondary studies. Scores from graduating exams may also be used to determine provincial scholarships. In short, decisions resultingfromthese exams have the potential to affect the choices open to the examinee. Often included in exams such as these are constructed response items. Challenges in Scoring Constructed Responses Scoring challenges accompany the use of constructed responses. At times, a few constructed responses can make up a considerable portion of the final score. For example, in British Columbia's English 12 (2003, 2004) Graduating Exam four constructed  Scoring Designs  2  response items made up 69% of the total exam mark. Given the emphasis placed on the results of these items, high scoring accuracy is needed. When traditional items such as multiple-choice or true-or-false are used, scoring is a relatively simple process because there is only one possible correct response. When constructed responses are used, however, a rating is generated. At this time, constructed response items are often scored by trained professionals or field experts. These experts are commonly referred to as raters. The use of raters in the scoring of constructed response items has been most frequently challenged on reliability grounds (Lukhele, Thissen, & Wainer, 1994). Certainly, examinees (or parents) could question whether the outcome of the test (received a different score or passed rather than failed) would have been different if scoring had been objective rather than having been scored by a human rater. Raters often demonstrate consistent individual differences in rating tendencies, with some raters being consistently either more lenient or more severe than others. This variation in rater accuracy is observable (Englehard, 1996). Given this, the score a student obtains could depend on who scored the response. Even withrigoroustraining and frequent calibration checks, raters still contribute a source of uncertainty that translates into overall measurement error (Haertel & Linn, 1996) potentially biasing test results. This uncertainty has the potential to affect decisions and provides challenges for equating (e.g, Muraki, Hombo, & Lee, 2000). Several avenues to reduce the impact of differential ratings have been explored. These avenues include the extensively researched effects of rater training, the inclusion of a rater parameter in the scoring model, and the use of various rater allocation designs.  Scoring Designs  3  Selection of rater allocation design has been shown in previous research to have an effect on estimates of examinee ability and on measurement error (Hombo, Donoghue & Thayer, 2001; Patz, Wilson & Hoskins, 1997). Although some research has shown that the selection of rater allocation design can greatly influence test scores (Patz et al., 1997) its potential impact has only minimally been explored. It was recognized that allocation of raters to item responses in tasks that contain more than one constructed response item can affect the total outcome score (Sykes, Heidorn & Lee, 1999). At this time, however, studies evaluating the decisions associated with the outcome score made under differing rater allocation designs are not evident in the literature. Accurate and consistent estimates of examinee ability (6) at every level of true examinee ability (6) is desirable. In addition, we want to ensure that a decision to pass or to fail an examinee is accurate at whatever cut-off score is required by the intention and characteristics of a given test. Taking this a bit further, one question of some concern when using rater scored constructed responses is whether an examinee would have received a different score if he or she had taken an objectively scored test. In other words, we would like to know if we would have reached the same conclusions about the examinee's ability and have made the same decision to pass or fail that examinee under both a rater and an objective scoring condition. Two levels of concerns are apparent here. At the first level are concerns about comparative levels of accuracy and consistency across scoring designs. This concern extends to the decision to pass or fail an examinee. It would be logical to expect that differences in accuracy and consistency would translate to differences in the pass and fail decisions. The second level of concerns is about the similarity or difference between ability estimates obtained from rater scoring of a test and  Scoring Designs  4  objective scoring of the same test. Again, if differences in the ability estimates were found between the rater scoring and the objective scoring, then differences are also likely to be seen in the resulting pass or fail decisions. The purpose of this study was to investigate the relative accuracy and consistency of ability estimates across various levels of true examinee ability (0) for three rater allocation designs and an objective scoring condition as defined by item response theory (IRT). In this study, differences in the accuracy of a pass or fail decision resulting from application of the scoring design conditions at the various levels of 0 were also of interest. The further purpose of this study was to compare the ability estimates obtained under each rater allocation design condition with that of the objective scoring condition. Again, the similarity of the pass and fail decisions resultingfromapplication of the scoring design conditions at the various levels of 6 were also of interest. To these ends, Chapter II will review the literature associated with the purpose of this study and close with research questions. Chapter HI will describe the methodology associated with this simulation study. The results of this study are presented in Chapter IV and the findings discussed in Chapter V.  Scoring Designs  5  CHAPTER II REVIEW OF THE LITERATURE This review of the literature examines developments in the accurate scoring of constructed responses. It begins with the definition of constructed response items and an explanation of the complexity associated with their use. Included also is an examination of how technology has influenced the scoring of constructed responses. Specifically, development in the automated scoring of constructed response, innovations in digital technology and its influence on the movement from traditional scoring to computer interface scoring are presented. This is followed by a brief examination of the use and setting of cut-off levels in assessment. An important step in the accurate scoring of constructed responses was the development, refinement, and promotion of psychometric (analysis) models that detect and incorporate leniency/severity difference among raters. Three notable analysis models are examined here. These analysis models have been the subjects of recent studies along with methods of allocating raters to score constructed responses. These recently scrutinized rater allocation designs are also examined here. Finally, three landmark studies at the intersection of rater designs and analysis models are presented here in detail. Definition and Use of Constructed Response Items A constructed response item is one that requires an examinee to provide a written response to a prompt. Often the focus of constructed responses is on problem solving, reasoning, and the integration of knowledge. Essays, creative compositions and written responses to open ended questions are examples of constructed response items. Constructed response items are currently being used in educational assessment. Essays,  Scoring Designs  6  creative compositions and written responses to open ended items are now used routinely. Proponents of constructed response item use claim greater face validity (Birenbaum, & Tatsuoka, 1987) and this item type is perceived to be a more authentic form of assessment when compared with multiple choice, true/false, or fill in the blank items (Hombo et al., 2001; Muraki, Hombo, & Lee, 2000; Patz et al., 2002; Williamson, Johnson, Sinharay, & Bejar, 2002). Constructed responses allow examinees to demonstrate skills that are not easily accessed with traditional items. Unfortunately, as noted by Williamson et al. (2002), "when assessment design calls for the use of constructed response tasks the complexity of such tasks produces ripples of complexity throughout the assessment" (p. 2). The Effect of Raters in Scoring Constructed Responses  Despite its widespread use, the inclusion of raters in the scoring model provides limitations. The use of constructed response items poses several challenges for scoring and for comparability. The measurement constructs are also likely to be complex, and at times, multidimensional and unstable across context. Human scoring of constructed responses is time consuming and relatively expensive when compared to the scoring of traditional items or automated scoring (Wainer & Thissen, 1993). Further, human scoring of constructed responses decreases generalizability (Clauser, Subhiyah, Nungester, Pvipkey, Clyman, and McKinley, 1995) and contributes an additional source of variance to the scoring model. A traditional scoring model for multiple-choice items, for example, includes examinee proficiency, item characteristics and one interaction. With constructed response items, the probability that a given response will earn a particular score depends on examinee proficiency, item characteristics, and rater characteristics  Scoring Designs  7  (severity/leniency). The inclusion of a rater in the model, then, adds an additional source of variance to the traditional model that results in four interactions as may be seen in Figure 1. Figure 1. Scoring Model with Rater  Each additional interaction represents a potential source of bias. Lack of consistency between raters can add variability to the data, thus reducing the reliability of the assessment and validity of the inferences. Linn (1998) noted a large discrepancy between the standards set by raters when examining multiple-choice, short answer, and extended response type items, a finding that raises questions about the conceptual coherence of the judgments. Rater Effects  Rater effects are the systematic consequences of mistakes or inconsistencies of raters. Specifically, a rater effect is the difference between a rater's average and the average of all ratings. If the rater effect is zero, no systematic bias exists in the scores. Because of rater errors, the rater effect is rarely zero. These effects, in the form of  Scoring Designs  8  differential severity, have been detected in large-scale tests such as National Assessment of Educational Progress (NAEP; Hombo et al., 2001; Patz et al., 2002). The size and scope of rater effects have been documented in previous research (Engelhard, 1994, 1996; Myford & Mislevy, 1995; Wilson & Case, 1996; Wilson & Wang, 1995). Wilson and Case (1996) found that raters could vary in multiple ways. Several forms of intra-rater effects, such as the halo effect, stereotyping, scaling shrinking, and rater drift, have been identified. The halo effect refers to a situation in which the impressions that a rater forms about an examinee on one dimension influences impressions of that person on other dimensions. The viewpoints and past experiences of an evaluator can affect how he or she interprets behavior. Stereotyping, where the impressions that a rater forms about an entire group can alter his or her impressions about a group member may also occur. Some raters will not use the end of any scale. This form of rater effect is known as scale shrinking. Raters can also drift, becoming more severe or lenient over the course of a rating period. Between-rater variation may be due to background and personality differences between raters and/or be due to differential effects of training. Rater bias is present when individual raters have consistent tendencies to be differentially severe or lenient in rating particular test items. When raters vary in severity, the same raw scores derived from two raters are not necessarily the result of the same ability estimates (Wilson & Wang, 1995). This can have serious consequences for the reported results of educational tests and assessments. Although good training and the establishment of well-structured scoring rubrics have been shown to somewhat reduce negative rater effects, they generally cannot be eliminated (Hoskens & Wilson, 2001). Haertel and Linn (1996) noted, even with  Scoring Designs  9  rigorous training and frequent calibration checks, raters still contribute a source of uncertainty that translates into overall measurement error. The magnitude of the effects may be hidden when only a few traditional measures of reliability are reported. In other words, it is possible to have a high percentage of exact agreement between raters yet have a significant amount of rater bias affecting the test scores (Patz et al., 1997). Influence of Technology in the Scoring of Constructed Responses Advancements in technology have affected the manner in which constructed responses are scored. Research into improved methods of automated scoring of constructed responses allows for a different perspective on the scoring of constructed responses. Considered here is the movement toward human ratings accompanying or being compared to automated scoring. Also considered here is the impact of advancements in technology for human rating of constructed responses. Movement Toward Automated and Human Scoring  As the use of constructed response items has expanded, technology has improved. Developments in computer technology have fuelled research into methods of automated or computerized scoring of constructed response items. Previous research has explored ways of assessing constructed responses with the aid of computerized scoring (Myfofd & Cline, 2002; Sireci & Rizavi, 2000; Williamson, Bejar, & Hone, 1999). It was noted that human scoring of constructed responses decreases generalizability (Clauser et al., 1995) and is relatively expensive when compared to the scoring of traditional items or automated scoring (Wainer & Thissen, 1993). Automated scoring systems were developed, in part, to increase the feasibility of including essays (one type of constructed response) in large-scale assessments by: a) reducing scoring costs, and b) shortening the  Scoring Designs 10 turnaround time for issuing scoring reports (Myford & Cline, 2002). Innovations are continuing to be made in the automated scoring of constructed responses. While computerized scoring of constructed response items claims advantages over human scoring such as reproducibility, consistency, objectivity, reliability and efficiency (Williamson et al., 1999), the fact remains that when automated scoring is used, it is often accompanied by human scoring as in the Golden State Exam and NAEP. Myford and Cline (2002) noted that some assessment programs such as the Graduate Management Admission Test (GMAT) are now using a rating provided by a single human rater and a rating provided by an automated essay scoring tool to evaluate constructed responses. Sireci and Rizavi (2000) suggested that, at minimum, the computer could replace the necessity for a second human rater in the scoring of constructed responses. Although at times criticized (Bennett & Bejar, 1998; Chung & O'Neil, 1997; Myford & Cline, 2002), the need for human scoring of constructed responses to provide the criterion against which automated scoring is compared has been acknowledged. Educational Testing Service (ETS), for example, uses an e-rater system to score essays. In this system the e-rater identifies features in the text of a constructed response (essays) that have the characteristics of good writing identified in a scoring guide used by trained human raters. Each of the features is assigned a weight based on the score assigned by an ETS-trained rater to essays written to the same prompt. Chung and O'Neil (1997) noted the role played by human scoring in the examination of two methods of automated scoring of constructed responses. Project Essay Grade (PEG), for example, begins by having trained raters score constructed responses. Computerized predicted scores are correlated with the scores from the raters to produce reliability estimates. Latent  Scoring Designs 11 Semantic Analysis (LSA) also compares rater scores with these of computed scores for essays. Movement From Traditional to Computer Interface Scoring  Technology is also beginning to impact scoring procedures. Technology has made it easier for data to be passed more efficiently from one location to another distant location. Digital imaging of constructed responses has been refined, making it more feasible to consider responses to unique items rather than booklets. According to Sykes, Heidorn, and Lee (1999), "With the advent of imaging of cr. [constructed response] item responses the logistic problem of allocating papers to readers is effectively solved" (p. 6). As such, there is currently a movement awayfromthe traditional "table" scoring of constructed response items toward the presentation of items via the computer to raters at different locations and this trend is expected to expand (Patz et al., 2002). This has allowed test developers and researchers opportunities to more easily vary the patterns by which raters were allocated to score examinee constructed responses in order to potentially minimize the effect of rater variance in the examinee proficiency estimates (Hombo et al., 2001; Patz et al., 1997; Wilson & Hoskens , 2001). Myford and Cline (2002) noted that scoring with human raters usually involves bringing raters to a central location, training them, monitoring their work, data entry of ratings, and then analysis of the assessment data. This can be a lengthy and costly process. Traditionally, the scoring of constructed responses follows a procedure that evolved out of experience with performance assessments of students' writing. There are several methods of scoring currently being used: (a) holistic scoring whereby a single, overall holistic judgment about the quality of a response is formed, (b) analytic scoring in which responses are  Scoring Designs  12  rated on several different dimensions, and (c) primary trait scoring where raters ascertain the extent that a given exercise fulfills a specific purpose. Procedurally, a scoring rubric is developed first. A scoring rubric is a set of scoring guidelines that describe the characteristics of the different levels of performance used in scoring or judging a performance. The rubric provides a set of ordered categories to which a given piece of work can be compared and specifies the qualities or processes that must be exhibited in order for an item response to be assigned a particular evaluative rating. A rubric may also provide raters with criteria to differentiate between a correct and incorrect response (dichotomous scoring). Model papers, conventionally referred to as anchor papers, are then chosen to exemplify each performance level or the boundaries between each successive level in order to ensure (or support) comparability. Raters are trained in the use of the rubric before scoring live papers. During operational scoring in a traditional scoring scenario the raters are often arranged at tables with a table leader. A table leader may literally behind  read  a rater to monitor performance. Second scorings may be used for quality control  purposes. To ensure that rater drift does not occur some previously scored papers are seeded throughout those being scored for the first time. In this way, rater accuracy may be continually monitored (Haertel & Linn, 1996). Generally, quality control starts off as random then becomes targeted. Blind double-reads may also take place whereby standard raters second-score other pieces rated first by other standard raters for the purpose of summative quality control of raters. Wilson and Hoskens (2001) noted that multiple re-rating of work is often done in research and development studies but less often in large-scale testing practice.  Scoring Designs 13 Patz et al. (2002) noted that as digital technology improves, decentralized online scoring will eventually replace the traditional method of having raters in one room with a table leader. Raters instead will work in a decentralized manner infrontof a computer terminal at a location of their choosing. NAEP (2002) for example, now scans images of item responses to their constructed-response items and presents them via computer to human raters for scoring. The province of British Columbia is currently investigating this procedure for use with the scoring of the Foundation Skills Assessments (FSA). This development makes more convenient the presentation of isolated items to raters and may be generally more cost effective. This development also has important implications for rater allocation. No longer will rater allocation be restricted by location or booklet. Digital technology has made possible random or patterned assignment of raters to items without consideration of the number of raters available in a particular location. Patz et al. (2002), however, noted that under these new conditions supervisors would no longer be able to monitor things such as body language and flipping. Discussion between raters and supervisors would be difficult even when supervisors are available to provide on-line assistance. As such, with computer interface scoring, there is greater opportunity for raters to get off track and less opportunity to "bring them back into the fold when they stray" (Patz et al., 2002, p. 381). It has been suggested that to comprehensively minimize this inevitable rater variation, methods of making corrections in an ongoing way need be developed (Patz et al., 1997). Possible (non-exclusive) ways of minimizing rater variation in the scoring of constructed responses may be found through the inclusion of the rater in the analysis model and by selecting an appropriate method of allocating raters to score items.  Scoring Designs 14 The Use and Setting of Cut-off Levels Research into the scoring of constructed responses has been concentrated heavily in the areas of rater training, automated scoring, and reliability. Little attention has been given to validity of ratings data (Harwell, 1999). Phrased differently, little attention has been given to the effect that rater error has on the validity of the consequential decisions. Patz et al. (1997) questioned the inferences that were made by NAEP from ratings data when rater effects were ignored. One way to examine the effect of the rater error on the validity of decisions is to examine pass/fail decisions at particular cut-off levels. Cut-off levels are what delineate success from failure. Previous research examined decisions made around cut-scores for different proficiency levels (Ercikan & Julian, 2002) and for training and scoring variables in writing assessments (Moon & Hughes, 2002). Despite the fact that high stakes decisions are made at established cut-off levels, we currently know little about how these decisions are influenced by the scoring design. In defining substantive issues that need to be addressed in future performance assessment research, Englehard (1996) left open the question of whether a cut-off score is needed to define acceptable rater accuracy and how the cut-off levels should be determined. Although setting cut-off levels has potentially high stakes consequences for examinees, such as in professional licensure or program placement decisions, their determination can be a subjective exercise. Hambleton (2001) noted that there is general agreement that cut-off levels are constructed rather than found. In other words, the determination of these levels through standard setting can be a less than precise exercise. Moon and Hughes (2002), for example, used cut-off levels obtained from previous standard setting in order to examine training and scoring issues in large-scale writing  Scoring Designs 15 assessments. Zieky (2001) noted that standards may be considered acceptable if they have (a) a legitimate purpose, (b) adequate notice, and (c) fundamental fairness. Development of Psychometric (Analysis) Models The use of item response theory (IRT) is a popular approach to analyzing largescale assessment data. IRT consists of a family of models that have been demonstrated to be useful in the design, construction, and evaluation of educational and psychological tests (Hambleton, Rogers & Swaminathan, 1991). This approach is one adopted by largescale assessments such as NAEP where item response probabilities are modeled in terms of student proficiencies and item characteristics. IRT supposes that: (a) the performance of an examinee on an assessment can be explained by the characteristics (latent traits) of that examinee, (b) scores on examinee latent traits can be estimated, and (c) estimated latent trait scores can be used to predict item or test performance. In practice, rater effects were historically modeled and analyzed on the raw score scale using either analysis of variance or generalizability theory. IRT analyses of rater effects followed. IRT analyses help explain how the characteristics of examinees, items, and raters interact in the formation of constructed responses. With constructed response items, the probability that a given response will earn a particular score depends on examinee proficiency, item characteristics, and rater characteristics (severity / leniency). This suggests that rater effects should be modeled at the item response level. Un-modeled rater effects decrease the reliability of scoring (Hombo et al., 2001; Wolfe & Myford, 1997). Incorporating rater effects into analysis models has the potential to increase the measurement precision of assessments. Wolfe and Myford (1997) suggested using rater effects models to detect various types of  Scoring Designs 16 systematic drift, thus enhancing the reliability and validity of the scoring process itself. Three recently investigated rater effects models, the Facets model, the Hierarchical Rater Model (HRM), and the Rater Bundle Model (RBM), are presented here. Facets  Englehard (1996) described an IRT approach to modeling rater effects based on a multifaceted version of the Rasch model for ordered response categories (Linacre, 1989). The Facets model is an extension of the Rasch model that includes a parameter for the effect of raters. This model distinguished itself from the existing models at the time in that accuracy was not based on group-level data, but rather was conceptualized as a latent variable with individual raters and benchmarks as the level of analysis. The Facets model also avoided the existing two-step process of first calculating the rater accuracy index for each of the raters and then conducting a separate examination of differences between raters. Instead, Facets allowed a systematic framework for the explicit examination of statistical analysis of differences in rater accuracy. Unlike previous models, Facets also accounted for potential differences between benchmarks and it was based on a probabilistic IRT model. This model allows for an ANOVA-like additive decomposition in the logit scale, and, as such, is an additive form of the Linear Logistic Test Model (LLTM; Fischer, 1973, 1983). The Facets model has the same mathematical form as LLTM and became a common item response modeling approach. LLTM may be extended to include polytomous responses and additional facets such as rater by item interactions (Patz et al., 1997). The Facets model is a 6-shift model (Hombo et al., 2001) because, it incorporates rater effect by shifting the entire item response function (IRF) up or down the ability  Scoring Designs 17 scale. The rater, then, has the effect of changing the ^-parameter. The model seeks to explicitly capture rater leniency/severity behavior. The Facets model stumbles in the face of multiple scoring of item responses because, in the Facets model, rater severity parameters may be constrained to be the same for a specific rater across items. The likelihood is specified such that each time a rater scores an item response by an examinee it is treated as a conditionally independent event. This is the standard assumption in IRT modeling and is necessary to the standard interpretation of the parameters in IRT. The Facet model, however, ignores violations of the assumption of conditional independence and, as such, underestimates measurement error, overestimates reliability (Donoghue & Hombo, 2000), and distorts model parameters (Patz et al., 2002). Given this, alternative methods of modeling raters' effects have been developed and tested. Hierarchical Rater Models  An alternative class of model, the hierarchical rater model (HRM; Donoghue & Hombo 2000; Junker & Patz, 1998; Patz et al., 2002), was proposed to correct the way the Facets model accumulates information in multiple ratings to estimate examinee proficiency. In contrast to the 6-shift model that shifts the IRF up or down the 6 scale to account for severe or lenient raters, the HRM assumes a latent, true category score, £ , which is related to ability by an IRT (2PL or generalized partial credit) model. With HRM, the job of raters is to discern the true category score of the response, with the concept being that raters make errors while trying to discern the true category scores. Both independence of examinees and local independence are assumed. HRM accounts for marginal dependence between different ratings of the same examinee's work and makes  Scoring Designs 18 possible the calibration and monitoring of individual rater effects seen in multiple rating designs. The problem of downward bias of standard error seen in the Facets model was corrected by breaking the data generation process into two stages. In stage one, the HRM posits ideal ratings variables describing a given examinee's performance on a given item as unobserved per-item latent variables. This ideal rating variable might follow a partial credit model for example. HRM corrects the Facets problem by using the ideal rating to capture dependence between multiple ratings of the same response to an item. In stage two, raters produce a rating for the given examinee on the given item. This rating may or may not be the same as the ideal rating category. This process is modeled as a discrete signal detection problem using a matrix of rating probabilities that may be constrained to focus attention on specific features of rater behavior. Interaction and dependence of ratings on rater covariates may be modeled in stage two. Rater Bundle Model  Rosenbaum (1984, 1988) introduced the concept of bundle independence as a way to address issues of conditional or local dependence. The likelihood function was expressed as a product of the probabilities of a response to a subset of items or bundles among which one has reason to expect conditional dependence rather than as a product of the probabilities of each individual item. Wilson and Adams (1995) applied this concept of bundle independence to Masters' (1982) partial credit model and to other related models. This application resulted in the birth of the item bundle model. Similarly, Wilson and Hoskens (2001) extended the Facet model to establish the rater bundle model (RBM).  Scoring Designs 19 Wilson and Hoskens (2001) proposed RBM as a result of observations that, when repeated ratings occur, the assumption of conditional independence is violated whereas test information and reliability are overestimated. In RBM, pairs of ratings to a given item response are defined as a rater bundle. Probabilities are modeled at the bundle level. Patz et al. (2002) noted that RBM worked well for modeling a few specific dependencies, such as between specific pairs of raters or specific raters and specific items. They further noted that RBM is more feasible than HRM for larger numbers of ratings per item because it provides a simpler model of dependence between ratings. Wilson and Case (1997) noted that it is important to recognize the limitations of using a statistical model to check consistency. In general, mathematical models are premised on the belief that raters, examinees and items will act in a consistent manner. Raters may, however randomly rate some items too severely or too leniently. Examinees may also respond to items in unexpected ways, such as when a strong examinee makes a mistake on a relatively easy question. Adjustments to scores could improve the overall reliability and consistency of individual scores; however, this may be done at the cost of reducing the correctness of some individual scores. In the same way, these models may help in the identification of biased raters but may not be successful in helping us find inappropriate individual scores for item responses. While the development of several analysis models that incorporate a rater parameter has been shown to be beneficial in recent research (Hombo et al., 2001; Patz et al., 2002; Patz et al., 1997; Wilson & Hoskens, 2001), the relative contribution of the scoring designs has been explored less extensively. Recently, the potential benefits of using specific methods of allocating items to raters (or raters to items) have been  Scoring Designs 20 investigated in conjunction with analysis models research (Hombo et al., 2001; Patz et al. 1997). Rater Allocation Designs Ignoring rater effects could have serious consequences. Specifically, Patz et al. (1997) noted that the design in the assignment of raters to items has implications for our ability to detect and correct any rater effects. Furthermore, the method of distributing item responses to raters has the potential to mitigate raters' effects and has a significant effect on the accuracy of examinee ability estimates (Hombo et al., 2001; Patz et al., 1997). There are many possible methods of allocating raters to score item responses. One possible method is to have every rater score every item. This is referred to as a fully crossed rater allocation design. More feasible, however, is a design in which raters are allocated to either a set of examinees and/or to a set of items. Brennan (1992) suggested that scoring designs, in which not all responses are scored by all raters, resulting in a savings of time and money in addition to increased rater endurance. Patz et al. (1997) proposed a stratified randomization allocation design that attempts to cancel out the residual rater biases at test score level. When the distribution of multiple ratings was balanced, this design was shown to significantly improve proficiency estimation in the presence of severe rater effects. When raters are randomly assigned to a set of item responses or to a set of examinees, the design may be referred to simply as a random allocation design. When a systematic scoring pattern can be determined, the rater design may be referred to as a nested allocation design. Similar to random allocation, a rater may be  Scoring Designs 21 nested in either a set of examinees or in a set of item responses. Bock, Brennan, and Muraki (2002) examined the information obtained for nested designs under multiple rating conditions. Like random allocation designs, nested designs have shown great promise in minimizing the effects of rater variation (Hombo et al., 2001). The distinction between random allocation designs and nested designs is that in the former no scoring pattern should be evident. Allocation is literally random. In a nested design, however, a clear allocation pattern is visually evident in the scoring matrix. The fully crossed design, random allocation design, and two nested designs (row and spiral) are described here. Fully Crossed Design  In a fully crossed design, every rater scores every item response in every cell. Patz et al. (1997) noted that an optimal situation for monitoring the impact of rater effects is one in which the two facets, rater and item, are completely crossed and the design is balanced. Despite the fact that the fully crossed design would not be feasible in practice, it has been used in previous research as a gold standard to which other designs may be compared (Hombo et al, 2001; Lee & Sykes, 2000; Patz et al, 1997). Hombo et al. used the fully crossed design only as a baseline in their research. They noted that, although this design would ensure maximum available information and reduce measurement error, it is clearly not practical, particularly in large-scale assessment. Not only would this design be costly but also raters would have to score hundreds of thousands of item responses. Random Allocation Design  Raters may be randomly allocated to either examinees or items. It was suggested by Patz et al. (1997) that random allocation of raters to items could help to minimize the effect of rater variability. Notably, with computer and digital technology, the random  Scoring Designs 22 allocation of raters to examinees or items would be fairly simple, even when raters are not at a central location. In a 1997 study, Patz and his colleagues addressed concerns about NAEP's fragmented analysis of errors in the rating of open-ended responses by developing methodology for a more unified analysis. They applied the developed methodology to investigate, among other things, potential ways of minimizing rater effects using modern imaging technology. Among other things, Patz et al. examined two rater allocation designs. These were random assignment of raters to intact student booklets, in which the rater scored all items for a randomly assigned examinee, and random assignment of rater to item responses in which all raters scored some responses to randomly assigned item responses. Randomization of raters to individual responses (item by item randomization) instead of intact booklets resulted in a significant reduction in the errors associated with estimated proficiencies. Further details of this study are provided later. Nested Allocation Designs  Brennan (1992) suggested a nested design, in which all raters need not score items for all examinees, would be more practical than the fully crossed design with regards to constraints of time, rater endurance, and cost. As in random allocation, there are two ways in which a rater may be nested: (a) Spiral allocation. In the spiral allocation design, the rater is assigned to score a subset, or column, of item responses. As such, for a block of examinees, the rater is nested in a particular item or set of items. A basic spiral pattern is shown in Figure 2. Essentially, the rater scores a predetermined repeating pattern of responses. In Figure 2, rater four (r4) appears in bold to highlight the spiral pattern. The benefit of the spiral  Scoring Designs 23 design is that, as in the random allocation design, it is possible to maximize the number of raters that contribute to the total score. Figure 2. Basic Spiral Pattern Item Examinee  A  B  C  D  1  rl  r2  r3  r4  2  rl  r2  r3  r4  3  rl  r2  r3  r4  4  r2  r3  r4  r5  5  r2  r3  r4  r5  6  r2  r3  r4  r5  7  r3  r4  r5  rl  8  r3  r4  r5  rl  9  r3  r4  r5  rl  10  r4  r5  rl  r2  Hombo et al. (2001) investigated several rating designs and found biases associated with a spiral allocation design to be low. (b) Row allocation. In this scoring design, each rater would score all of the items administered to a given examinee. Specifically, each rater would score the responses of a subset of examinees. This would visually result in a row, or set of rows, of scored responses for each rater. An example of this type of allocation design is provided in Figure 3. In this figure, rater one (rl) is presented in bold so that the row pattern may be  Scoring Designs 24 more clearly seen. This was a commonly used design because it had its origins in teacher rating of student writing. It was also practical for administration because it is relatively easy to assign booklets to raters. With the ease of administration, however, comes a potential cost in accuracy. Hombo et al. found that this design resulted in less accurate estimates of examinee ability than that of a spiral rater allocation design when items were scored dichotomously. Figure 3: Row Allocation Design Item Examinee  A  B  C  D  1  rl  rl  rl  rl  2  r2  r2  r2  r2  3  r3  r3  r3  r3  4  r4  r4  r4  r4  5  rl  rl  rl  rl  6  r2  r2  r2  r2  7  r3  r3  r3  r3  8  r4  r4  r4  r4  9  rl  rl  rl  rl  10  r2  r2  r2  r2  Scoring Designs 25 Recent Findings at the Intersection of Rater Designs and Analysis Models Three studies that examined the relative efficiency of differing designs for allocating raters to item response stand out in the existing literature. These studies are: Hombo et al.(2001), Lee and Sykes (2000), and Patz et al. (1997). Two of these studies, Hombo et al., and Patz et al. (1997) are based on NAEP data and were conducted using simulated data whereas Lee and Sykes' study was conducted using "real" data. The former studies also examined the effects of analysis models. Terminology was used inconsistently in these studies. As such, it is necessary to clarify terminology. To start, the term "rater allocation design" used here was suggested by Patz et al. (1997). Hombo et al. referred to the same area of study as examining rater designs whereas Lee and Sykes referred to the examination of scoring modalities. Many of the individual allocation designs described in these studies are identical or similar across the three studies yet the designs are given different names in each of the studies. The studies conducted by Patz et al. (1997) and by Hombo et al. were both simulation studies, based on NAEP data. Both examined rater allocation designs in conjunction with analysis models. Whereas Hombo and her colleagues took the approach of examining specified allocation patterns, Patz and his colleagues investigated allocation designs based on random assignment of raters to item responses. Both studies used item response theory in their investigations. Lee and Sykes based their study on real data and studied the relative accuracy of rater allocation designs using generalizability theory. Patz et al. (1997) proposed a stratified randomization allocation design that attempts to cancel out the residual rater biases at the test score level. They assessed this rater design by simulatingfroma fitted rater model in order to investigate the  Scoring Designs 26 implications of ignoring rater effects on inferences regarding IRT parameters and examinee proficiencies. When rater effects were not modeled, error was greater than that of the modeled data. The stratified randomization allocation design was also shown to significantly improve proficiency estimation in the presence of severe rater effects. Similarly, Hombo et al. investigated the impact of rating designs on the estimation of examinee ability both with and without the rater in the analysis models. They found biases associated with a spiral rater design to be low whether a rater parameter was included in the analysis model or not. The Lee and Sykes study distinguishes itself from that of Patz et al. (1997) and from Hombo et al. in several ways. First, the Lee and Sykes study uses real item responses and raters rather than simulated item responses and raters. Second, results from both Hombo et al. and Patz et al. are based on item response theory whereas Lee and Sykes use generalizability theory to assess accuracy. Third, in addition to accuracy, Lee and Sykes examined efficiency, wherein efficiency is measured according to the movement of paperfromone rater to the next. Fourth, the specific manner in which raters were allocated to items and the patterns formed (or lack of in the case of random allocation) was an important consideration in the Patz et al. (1997) and in the Hombo et al. studies. Lee and Sykes, however, specified the model without making explicit the patterns (random versus spiral, for example) resultingfromthe manner in which the task or rater subsets were assigned. These three studies provide the foundation for this proposed study. As such, details of these studies relevant to this proposal are explicitly presented below. Aspects of the studies that are not related to the research questions here have been left out for the sake of clarity.  Scoring Designs 27 Study 1: Patz, Wilson, and Hoskins (1997)  Patz et al. (1997) considered NAEP's fragmented analysis of errors in the rating of open-ended responses for the purpose of developing methodology for a more unified analysis. They then applied the developed methodology to analyze rater effects in NAEP data and investigated potential ways of minimizing rater effects using modern imaging technology. They conducted a study in two phases. Phase 1 was a pilot study conducted using 1992 datafromthe NAEP Trial State Assessment in reading (grade 4). Preliminary analyses were conducted on a relatively small scale to carry out a prototype simulation study to investigate the impact of rater effects on item calibration and proficiency estimation for two rater allocation designs. These were: random assignment to student papers, in which the rater scores all items for an individual, and random assignment to student responses in which all raters scored some responses to all items. These researchers found that, using a random allocation design in a multiple scoring situation, an unbalanced allocation could result. Problematic, in their study, was the uneven number of ratings provided by individual raters in the randomized design. For example, with these NAEP data, item 1 was scored 14 times by three different raters whereas other items were rated only once. Also, estimation of rater severity for items with very few ratings was problematic because the lack of balance also complicated interpretation of estimated rater severity parameters. Patz et al. concluded that the randomization of responses to raters should be carried out in a way that ensures that an unbalanced design will not result. A larger (phase 2) study using the 1994 NAEP Trial State Assessment in reading (grade 4) followed and addressed the problem of unbalanced allocation with multiple  Scoring Designs 28 ratings. Of primary importance in this second set of analyses was the distribution of item responses to the set of raters. Noting that the impact of rater biases on test scores depends on both the nature of the biases and on the rater allocation design, Patz et al. investigated these relationships by simulating responses under the four configurations of rater effect types used in the pilot investigation: 1) control - no rater effects/2pl model 2) same as NAEP 3) raters vary primarily in terms of overall severity 4) rater variability is heterogeneous across items. A fully crossed allocation design in which the assignment of raters to items is balanced and the two facets, rater and item, are completely crossed would be optimal for monitoring the impact of rater effects. Patz and his colleagues, however, noted that it would not be feasible. An appropriate partially balanced design intended to facilitate the detection of rater by item bias was seen to be a significant improvement so a third allocation design, stratified allocation, was added to the investigation in the second phase such that the rater to task allocation designs were: 1) random assignment to student papers whereby the rater scores all item responses for a randomly assigned examinee. This design represented the practice of NAEP in 1992. 2) random assignment to student responses whereby the rater scores the randomly assigned items. This design represented the practice of NAEP in 1994. The problem that was found with this design was that with multiple ratings it was an  Scoring Designs 29 unbalanced allocation design. In other words, some raters were assigned to score certain items many times whereas others scored certain items very few times. 3) stratified random allocation: a set of raters is divided into 10 deciles based on rater severity, separately for each set of items. Each of a student's 10 open-ended responses is then distributed randomly so that one raterfromeach severity decile rates one response. It was expected that this design would systematically cancel out the effects of rater bias at the test booklet level because it eliminates the possibility that a given examinee booklet would be rated primarily by either lenient or severe raters. It is important to note that, in this investigation, it was assumed that rater severities were known in the stratified allocation design. Proficiency estimates were fixed at 100 equally spaced quartiles of a normal distribution with 10 students at each unique theta and each set of thetas being consistent with a normal distribution for 1000 examinees. The twelve constructed response items varied in scoring levels. Also simulated were 20 raters with severities fixed at equally spaced quantiles of the normal distribution. So that the standard error of the mean, which quantifies the uncertainty attributable to the simulation process, could be reported, each simulation was conducted 10 times. Two conditions were varied: rater effect type and rater to task design. One hundred data sets for 1000 examinees on 12 dichotomously scored constructed response items were simulated. Proficiencies were generated according to a normal distribution and difficulty (location) and discrimination parameters were generated in a manner consistent with observed distributions of the estimated parameters in NAEPs 1992 Trial  Scoring Designs 30 State Assessment Program in Reading (grade 4) for the pilot study then, later, using NAEPs 1994 Trial State Assessment Program in Reading (grade 4) in phase 2. In phase 2, the implications of rater effects for IRT scale scores and classical reliability estimates under the three different allocation designs were investigated. This time the investigation focused on one complete set of items presented to a subset of examinees in which there were 10 multiple-choice items and 10 constructed response items. Of interest was the accuracy of the resulting scale scores as measured by the root mean squared error (RMSE) for the estimated true thetas along with classical test reliability as measured by the correlation of the two replicated raw scores. Two estimated reliabilities were viewed as significantly different when the roughly four standard error wide intervals centered at the estimates did not overlap. These researchers fit the following three LLTM models to the 1992 NAEP data to see whether the faster MML estimation technique applied to the PC-R model would supply useful information: 1) Regular partial credit model (PCM) - Ignores potential rater effects; 2) PC-R model with general rater effects - includes parameters for rater severity that are constant over items in addition to item difficulties and item step parameters; 3) PC-R model with item-specific rater effects that includes rater severity parameters specific to each item in addition to item difficulties and item steps. The item specific parameters indicated how much more severe or lenient a rater is than the average rater when scoring a particular item.  Scoring Designs 31 The square root of the RMSE for the parameters was calculated. It was concluded for these data that rater effects, when present but not modeled, increase error in the estimation of item parameters. The most notable effect was in the location parameter. The increase was not sensitive to the design for assigning raters to responses within examinees in the designs investigated. The impact of non-modeled rater effects on proficiency estimation was large when effects were systematic within rater and when the rater scored all responses for a given examinee. The impact of the effects was significantly mitigated in the random assignment to student responses design. In the study conducted by Patz et al., the stratified random allocation design provided a significant increase in reliability and an increase in measurement accuracy in the presence of known severe rater effects. It should be noted that the RMSE for the stratified randomization under severe rater effects was found to be lower than the RMSE attained when no rater effects were present. When this unexpected finding was explored, it was revealed that the stratification results in smaller standard deviations for both realized raw scores and estimated scale scores and consequently results in smaller RMSE without necessarily improving reliability. Given this, Patz et al. noted that the smaller RMSE should not be considered as the sole basis for comparison of methodologies. When the complete simulation was conducted again using only the constructed response items, the benefits of randomization under severe rater effects were found to be proportionally greater for tests consisting of only constructed response items. It is under these conditions that the stratified randomization brings the greatest improvement. Patz et al. noted that the design in the assignment of raters to items has implications for our ability to detect and correct any rater effects. Randomization of raters  Scoring Designs 32 to individual responses (item by item randomization) instead of intact booklets may lead to a significant reduction in the errors associated with estimated proficiencies. However, the uneven number of ratings provided by individual raters (unbalanced design) in the randomized design was problematic. Also problematic was the estimation of rater severity for items with very few ratings, because having an unbalanced design complicates the interpretation of estimated rater severity parameters. Patz and his colleagues concluded that the randomization of responses to raters should be carried out in a way that ensures that unbalanced designs do not result. Regardless of which procedure is used, the distribution of responses to raters should be conducted in a statistically balanced fashion. Conclusions and recommendations were made for future NAEP studies, given both the findings from their analyses and other analyses in the literature. The stratified randomization allocation design investigated in phase 2 provided a significant increase in reliability and an increase in measurement accuracy in the presence of known severe rater effects. However, it is important to note that, in this investigation, it was assumed that rater severities were known. This study recognized the important role played by optimizing rater allocation procedures, noting that the method of distributing responses to raters may have significant consequences for the impact of rater effects. Study 2: Hombo, Donoghue and Thayer (2001)  While similar in objectives, Hombo et al. (2001) examined somewhat different rater allocation designs. Similar conclusions were drawn regarding the difference seen between the modeled and unmodeled rater effects and the manipulation of the rater allocation design.  Scoring Designs 33 Hombo et al. conducted a simulation study of the effect of rater designs on ability estimation whereby the designs were evaluated in terms of their impact on the accuracy of examinee ability estimation. Like Patz et al. (1997), Hombo and her colleagues studied the effects of different rater allocation designs, and the inclusion of a rater in the analysis models. These researchers had two objectives: 1. To examine the bias of estimation of examinee ability when rated data was analyzed. They did this using a model first without, and then with, a rater parameter. 2. To examine the impact of rating designs on the estimation of examinee ability. Hombo et al. identified the three types of rater allocation designs examined in their report as fully crossed, nested, and spiral. The fully crossed design, in which every rater rates every item response in every cell was used as a baseline. Although this design would ensure maximum available information and reduce measurement error, it is clearly not practical, particularly in large-scale assessment. Not only would this design be costly but also raters would have to rate hundreds of thousands of item responses. Hombo et al. included three nested designs in their study that represented different combinations of the matching of ability levels to varying levels of rater severity that would result in moderate effects on the ability estimates in one design and "exacerbated effects" in two of the designs. In a spiral design, the rater is assigned to rate responses on a subset of test items. In other words, the rater is nested in items rather than examinees. Columns of item responses would be scored by each rater. Four spiral designs were examined, each adhering to the following conditions:  Scoring Designs 34 1. Items were grouped in sets composed of one each of easy, moderately easy, medium, moderately difficult, and difficult. 2. Each item set was assigned to four raters who were harsh, moderately harsh, moderately lenient, or lenient. 3. Examinees were grouped into sets of four, by levels of ability. 4. A pattern of item sets assigned to particular raters was established within the first set of examinees and then applied to all others. It was noted that simulating the data allowed for control over which factors were varied, thus strengthening conclusions about influence on ability estimates. In the simulation, the item and rater parameters were treated as known to allow the focus to remain on the estimation of examinee ability. These researchers argued that item parameters are sometimes fixed and known from pre-calibration and that there existed the possibility of pre-calibration of raters as part of rater training. The purpose of the study itself required fixing parameters because the nested and spiral designs used in their study would not lend themselves to the estimation of the rater parameter. All data responses were generated as dichotomously scored and the rater effects were incorporated under the one parameter logistic (Rasch) and two parameter logistic models. Each data set consisted of 1000 replications under each of the study conditions. True examinee ability, item characteristics (difficulty, discrimination, and guessing), and rater parameters were fixed across rater designs. Sixteen true examinee abilities values were uniformly distributed with an intervalfrom-2.0 to 2.0. Twenty item difficulties (b parameters) were also uniformly distributed with an intervalfrom-2 to 2. For the 1PL, all discrimination parameters (a) were set to 1.0 whereas for the 2PL twenty values were  Scoring Designs 35 evenly distributedfrom0.85 to 3.4. The a-values were randomly paired with the b values. Zero guessing was assumed. Hombo et al. identified four rater parameters (-0.5, -0.25, 0.25, 0.5) in which the larger values are associated with the more severe raters. They based the effect sizes on analysis of a set of 1994 reading datafromNAEP. To evaluate the effect of using an analysis model not incorporating a rater parameter on rated data, Hombo et al. analyzed the data twice for every spiral and nested design; once with a rater parameter and once without a rater parameter. Analyses were conducted using the following two computer programs, FACETS and PARSCALE. Two measures of the accuracy of the ability estimate were examined: squared bias of the ability estimate and mean square error. Squared bias of the ability estimate was defined by Hombo et al. as the difference between the squared value ofthe difference between the mean value of a given estimate and the true ability of that examinee. Specifically, Bias  2  =(d,-e )  2  true  (1)  and where mean square error is  (2)  MSE = k  where k = the number of replications. Like Patz and his colleagues (1997), Hombo et al. found that the some of the rater allocation designs that they examined also mitigated the effects of raters. When judged by the most extreme raters, the row allocation design resulted in biased ability estimates for the examinees, while the spiral allocation designs were found to be quite robust under the same conditions. Hombo et al., however, manipulated the spiral allocation such that raters  Scoring Designs 36 of varying severity scored items. While this provided valuable information about the interaction between the rater parameters and bias, the rater parameters are not likely to be known before the spiral is established. Although Hombo et al. argued that item parameters are sometimes fixed and known from pre-calibration and that there existed the possibility of routine pre-calibration of raters as part of rater training, the degree to which these conditions could be replicated in practice outside of NAEP is questionable. Even if this were possible, it does not account for possible changes in the rater parameter between estimation and actual scoring. Study 3: Lee and Sykes (2000) Lee and Sykes (2000) also examined accuracy of scores and the efficiency of scoring procedures for several scoring modalities (rater allocation designs). Five distinct modalities were examined: 1. Univariate person (p) x task (r) JC rater (r) modality [px txr]. This has come to be known as the fully crossed design. 2. A rater scores each examinee's responses to a subset of tasks. This is a univariate person, crossed with task, nested in raters modality [px(t:r)].This a deviation from the fully crossed model in that, although all raters read responses for all examinees, only a subset of responses are rated. 3. Several raters read each student's response to the same task. Each task, however, is read by different sets of raters. This is a univariate design with person crossed with raters nested within tasks \px (r: t)\. Here, it may said, that the rater is nested in item.  Scoring Designs 37 4. A rater reads all responses to the tasks of a subset of examinees. The rater then is nested in examinee. In this univariate (p: r)xt design, person is nested within raters, crossed with task. 5. A rater reads only one student's response to a task. Here the raters nested within persons and tasks is the univariate r : (pxt) design. Lee and Sykes used datafroma Mathematicsfieldtest (grade 8) that included 11  1  constructed response items. Nine of these items were dichotomously scored. The remaining two items were scored on a four-point scale. A stratified random sample of 2000 students was used for their study. In order to maintain scoring standards, raters were trained, anchor papers were established, and read behind rater checks were used. Each item response was scored once with the exception of the fully crossed design that incorporated multiple ratings of the same item. Lee and Sykes started out with one rater scoring all of the items for every examinee. When a second rater was added to the design, the second rater scored all of the item responses such that each item was scored twice, resulting in a fully crossed design. With the other modalities, Lee and Sykes began with two raters. In the second modality, with two raters, one rater scored the first half of the item responses and the other rater scored the second half of the item responses. In the third modality, it was necessary for the number of raters to be equal to the number of items because the raters scored all of the examinee responses for a given item. In the fourth modality, with two raters, each  In the reporting of results Lee and Sykes refer to twelve tasks. It is unclearfromwhere the additional task comes.  Scoring Designs  38  rater scored the responses of half of the examinees. The fifth modality, the number of raters was equal to the number of items multiplied by the number of examinees. The definition of n was quite different for the first modality, the fully crossed r  design, than it was for the others. In the first modality, an increase in n was a direct r  increase wherein a changefrom«, to n meant an increase in the number of ratersfrom1 2  to 2. With the subsequent modalities, a changefrom«, to n meant an increase in the 2  number of ratersfrom1 to 2 for each of the ratings. A bi-product of this was an inconsistency in the number of raters in each of the scoring modalities. Lee and Sykes found that, of all the rating designs examined, the second modality, in which the rater reads each examinee's responses to a subset of tasks, was the least preferable in terms of scoring accuracy. This design is most similar to the spiral design described by Hombo et al., which they found to be the most desirable. Differences in findings could be due to differences in the details of the design. Whereas in both studies each rater was assigned to a subset of items, in the Lee and Sykes study the raters were paired such that one rater scored the first half of the examinee responses and the other rater scored the second half of the examinee responses. In the Hombo et al. study, however, there was no overlap between raters for any examinee. The findings of Lee and Sykes are nevertheless counterintuitive because logic would dictate that, as the number of raters increased, error would decrease (i.e., leniency of one rater would absorb the variance produced by severity in another rater). In the Lee and Sykes study, we know little of the characteristics of the raters, a factor that may also have influenced these results. In the Hombo et al. study, rater characteristics were controlled.  Scoring Designs 39 Modality four, in which the rater reads all responses to the tasks of a subset of examinees, was found to be the most favorable by Lee and Sykes when both scoring accuracy and efficiency of scoring procedures were considered. These researchers recommended this modality for use with large-scale performance assessments. One key reason for this decision was that the scoring accuracy did not depend on the number of raters. It would make sense that scoring accuracy would not change as a result of increased number of raters because the raters were scoring all of the items for a given subset of examinees. This finding was quite different from that of Hombo et al. and those of Patz et al. (1997). In the Hombo et al. study, this design was most similar to the nested design in which a rater was assigned to rate examinee performance for a subset of examinees such that each rater scored a row of responses. In the Patz et al. study, modality four is similar to the rater to task design in which the rater scored all item responses for a randomly assigned examinee. An important difference between the Lee and Sykes study and that of Patz et al. is that in the former study, the method by which the raters were assigned to examinee is not described. Although the area of constructed response scoring is not new to researchers, the increased adoption of constructed response items in large scale testing and developments in technology have created the potential for solutions to some serious problems associated with the scoring of constructed responses. More recent research has seen the development and testing of analysis models, and an acceptance that raters need to be included in the IRT model. It has also investigated severe potential rater allocation designs that may be used to reduce bias resulting from variations in rating tendencies.  Scoring Designs 40 Research Questions Certain large-scale assessments, such as NAEP and TIMMS, are concerned with group level variables such as reliability of scores across years (Hess, Donoghue, & Hombo, 2003) and/or the incorporation of statistical models driving research in that direction. However, when licensure, promotion or program acceptance is at stake, the focus is on the accuracy of ability estimation and decisions. It is evident in the existing literature that additional research into the scoring of constructed responses is needed in several areas. Although some preliminary research has examined the relative accuracy and standard error of several rater allocation designs, a comparison across designs from these studies has yet to be completed. Previous studies (Hombo et al., 2001; Lee & Sykes, 2000) used the fully crossed rater allocation design as the gold standard to which the other rater allocation designs were compared; however, it was also recognized in these studies that the design was not feasible when dealing with high numbers of examinees. The present study does not seek to compare allocation designs to the fully crossed design because: a) this study seeks to compare allocation designs that have been used or have shown promise in previous research, and b) this study is only interested in designs where there is no multiple rating of items. There are two reasons for this. First, as Wilson and Hoskens (2001) noted, multiple re-rating of work is often done in research and development studies but less often in practice. Second, Patz et al. (2002) noted that with as few as one or two ratings per response, important differences emerge in the ways that various rated response models handle both single and multiple ratings. The present study examines the relative performance of three rater allocation designs: a random design, a spiral design and the traditional row design. Also included in  Scoring Designs 41 this study is a no rater condition. The no rater condition is included in this study to mimic an objective scoring condition. It is, therefore, referred to herein as "the objective scoring condition." This should not be confused with multiple choice or other such items wherein changes in format must be considered. The relative accuracy and consistency of examinee ability estimates obtained under these design conditions are examined across levels of true examinee ability (6). The differences in estimated abilityfromeach of the rater allocation designs and the objective scoring are also compared. It is further recognized in this study that administration of constructed response items occurs frequently in the context of high stakes assessments where a decision is made at a cut-off level. These decisions have the potential to greatly affect the opportunities open to an individual examinee. The research questions for this study are: 1. What is the relative accuracy {6 - 6 ) of the spiral allocation design, row allocation design, random allocation design, and objective scoring across levels of true examinee ability? 2. What is the relative consistency of the spiral allocation design, row allocation design, random allocation design, and objective scoring across levels of true examinee ability? 3. How does the accuracy of pass and fail decisions differ when scored using the spiral allocation design, row allocation design, random allocation design, and objective scoring for four cut-off levels? 4. What is the relative discrepancy between the ability estimates obtained under each of the rater allocation designs (the spiral allocation design, row allocation design,  Scoring Designs 42 random allocation design) compared to the objective scoring across levels of true examinee ability? 5. To what degree do the pass and fail decisions obtained under objective scoring agree with that of the spiral allocation design, row allocation design, and random allocation design, at four cut-off levels? This study requires that the true examinee score be known, and befreefromthe influence of rater variation, so that the deviance between the true score and observed score may be identified. As such, these questions are addressed using computer simulation. Through computer simulation, control may be gained over which factors are varied thus strengthening conclusions about influence on ability estimates. As in the studies conducted by Hombo et al. and by Patz et al. (1997), it was necessary in the simulation that the item and rater parameters be treated as known to allow the focus to remain on the estimation of examinee ability.  Scoring Designs 43 CHAPTER ni METHOD This study builds upon a foundation laid down by Hombo et al. while considering the rater allocation designs examined by Patz et al. (1997) and by Lee and Sykes (2000). Like its predecessors, both traditional and newly proposed rater allocation designs were examined in this study. Following the logic and methodology of Hombo et al. and Patz et al., this study was conducted with simulated data. The relative performance of rater allocation designs and objective scoring are examined under short test conditions wherein changes in item characteristics are expected to greatly influence examinee scores. This study deviatesfromits predecessors in that it is acknowledged here that an analysis model incorporating the rater is a necessary element in the correct analysis of data resultingfromrater scoring. Findings by Hombo et al. have spawned several studies in this area (Patz et al., 2002; Wilson & Hoskens, 2001). This study adds two additional steps in the examination of the relative performance of rater allocation designs. First, objective scoring is considered along with, and compared to, the rater allocation designs. Second, a comparison of decisions around cut-off levels resultingfromthe use of particular scoring designs was conducted. Before providing details of the simulation design and the dependent variablesfromthe simulation experiment, an overview of the two levels of comparison used in this study is presented. Overview The purpose of this chapter is to describe the simulation study methodology designed to answer the research questions listed at the end of Chapter II. The research questions are organized in terms of the object of comparison. That is, for thefirstthree  Scoring Designs 44 research questions, the comparisons are with the true ability level, denoted as theta (#), whereas for the last two research questions, the object of comparison is the examinees' estimated ability (6) under objective scoring (e.g., automated scoring via computer) conditions. Comparison Level 1  To answer the first three research questions, four scoring designs (three rater allocation designs and objective scoring) were investigated. These designs are described in detail in a later section. To answer the research questions about accuracy and consistency, it was important to consider the true ability level of the examinee. That is, the true ability distribution was divided into six bins (ranges) of scores, noting that the ability distribution had a mean of 0 and a standard deviation of 1. The six bins involved examinees whose true ability levels were: a. less than or equal to -2.0 b. between -1.999 and -1.0 inclusive c. between -0.999 and 0 inclusive d. between 0.001 and 1.0 inclusive e. between 1.001 and 2.0 inclusive f. 2.001 and greater. Although this division of the theta distribution is somewhat arbitrary, the goal was to investigate theta in increments of one standard deviation above and below the mean and yet acknowledge that the density of theta beyond two standard deviations is relatively small. It should be noted that this division ofthe theta distribution is of the true theta. This provides the conditional bias and consistency. That is, bias and consistency are  Scoring Designs 45 conditional on the true theta. Accuracy and consistency were then investigated for each of these theta bins resulting in a four (scoring designs) by six (theta bins) factorial design. To answer the third research question about the accuracy of the decisions made from test scores, four cut-off levels: 50%, 60%, 70%, and 80% of the domain score for a decision to pass and to fail were investigated. This is described in more detail below. These cut-off levels were selected to sample a range of potential cut-off levels found in practice. In practice, the test developer, according to the purpose and characteristics of the associated test, determines cut-off levels. This resulted in a four (scoring design) by four (cut-off levels) factorial design. Comparison Level 2  The last two research questions involve a comparison to the ability estimates obtained under the objective scoring condition. The two factorial designs described above become a three (rater allocation designs) by six (theta bins) factorial design, and a three (rater allocation designs) by four (cut-off levels) factorial design, respectively. For each cell of the factorial designs, there were 50 replications and 16,000 examinees. In order to investigate whether the resultsfromthe simulation experiments were not specific to one set of item characteristics, the above simulation experiments were replicated for three different tests. It is notable that this simulation experiment is, in essence, a within subject design. This means that the same person (true examinee ability for a given simulated examinee) is used in all conditions of the simulation. Specifically, this experiment simulates an artificial situation wherein an individual could take a test and have that test be scored by the same rater using first a spiral allocation design, and then a random allocation design,  Scoring Designs 46 then a row allocation design without any concern for carry-over effect (i.e., the memory of the previous rating is erased). In addition, we would have this same examinee's score on the test had it been objectively scored. This within subject design allows for a very strong comparison of scoring designs. In the next three sections, the scoring designs, tests and cut-off levels are described in detail. Scoring Designs Raters may be allocated to either a set of examinees and/or to a set of items. When a systematic scoring pattern may be determined, the rater design may be viewed as nested. In the matrix, this would look like either columns or rows of response vectors are allocated to the raters depending on whether the raters were allocated to item responses or to examinees, respectively. Similarly, raters may be randomly assigned to score responses for a given set of items, or raters may be randomly assigned to score responses for a given set of examinees. The reader is reminded that the distinction between the random allocation designs and nested designs (given that it may be easily argued that, in random allocation, raters are still nested in either examinee or items) is that, in a random design, no clear scoring pattern should be evident in the same way that no pattern should be evident if, for example, one were to draw numbers from a hat. In the nested design, however, a clear pattern is evident and may be repeated to form patterns that are visually evident when viewing the scoring matrix. Two nested rater allocation designs, (i.e., raters nested in examinee (row) and raters nested in item response) are described here. The first design represents traditional scoring, whereas the second design is one that has shown potential in previous research. I begin, however, with a random allocation design.  Scoring Designs 47 Random Allocation Design  It was suggested by Patz et al. (1997) that random allocation of raters to items could help to minimize rater effects with multiple ratings of data. The problem encountered with this general random allocation type of design was that, under multiple rating conditions, an unbalanced allocation could result. In other words, some items would be rated many times whereas other items would be rated few times, or even just once. Patz and his colleagues suggested and tested a stratified rater allocation design; however, the study was limited by the need to fix the rater parameters and assume that the test parameters were known so that deciles could be established for the stratification. The random allocation design proposed here does not make the same assumptions about the rater parameters. Because technology has progressed such that items may be presented to raters via a computer interface and the supervisor may interact with the raters electronically, the item responses may be randomly assigned to sub-groupings without compromising ease of administration/supervision. Raters are generally limited by time. When hiring raters, the human side of the equation dictates that the raters get an estimate of the number of items that they are expected to rate and are paid accordingly. Hence, expense is projected for each rater. With random allocation, raters may be randomly assigned to items based on their limit. This may be a flexible limit. One practical advantage of randomly allocating raters to items is that there is no need to assemble a rating pattern that accommodates a given rater who can score a limited number of items or one who can score more items. If, for example, rater 1 is only able to commit to scoring 100 items whereas rater 2 is able to score 150, this design allows for a simple accommodation to be  Scoring Designs 48 made in the assignment. In this study, this was accomplished by randomly selecting one of a pool of 39 raters to score a given item. Each of the 39 raters had been previously randomly assigned to one of the four rater parameters. Once the rater was "used", his number could not be drawn again until all other raters had been assigned to an item. This process was then repeated until all item responses were randomly paired with raters. In practice, using computers to provide randomization, this could easily be adapted so that raters appeared an uneven number of times if necessary. Nested Allocation Designs  As in the case of random allocation, in nested scoring designs, not all responses are scored by all raters resulting in a savings of time and money in addition to increased rater endurance (Brennan, 1992). Raters may be nested in examinees such that a rater scores all items for a given examinee. This sort of nesting may be referred to as a row allocation design. An example of the row allocation design was shown in Figure 3. A rater may also be nested in items such that the rater scores a particular item or subset of items. An example of this type of nesting, referred to here as a spiral allocation design, was shown in Figure 2. Rater Nested in Examinee (Row)  In this scoring design, each rater scores all items administered to an examinee. Specifically, each rater scores the responses of a subset of examinees such that the rater allocation would visually form rows in the matrix as seen in Figure 3. In this study, this was achieved by systematically assigning each rater a set of examinees. Recall that the 39 raters were randomly assigned 1 of the 4 rater parameters. Starting with rater number 1, the rater was assigned to score all of the responses of approximately 410 examinees.  Scoring Designs 49 Specifically, Rater 1 would score all of the item responses for examinee number 1, 40, 79, etc. Rater 2 was assigned to score all of the item responses for examinee number 2, 41, 80 etc. In practice, this could be done just as efficiently and easily by randomly dividing the examinees into as many groups are there are raters. Spiral Allocation  One potential predetermined pattern that has shown promise is the spiral allocation design (Hombo et al., 2001). In the spiral allocation design, the rater is assigned to score a subset of item responses. Essentially the rater scores a set of predetermined responses that form a spiral pattern through repetition. The benefit of the spiral design is that it is possible to maximize the number of raters that contribute to the total score. There are multiple possible ways of arranging a spiral pattern. One relatively basic spiral pattern practical for use with large numbers of examinees is examined in this study. The basic spiral pattern was seen in Figure 2. This study utilized four items and 39 raters. In previous studies, raters were assigned to score examinee responses according to preset patterns based on defined rater parameters. In practice, the rater parameter is generally not known before scoring has taken place. As such, the 39 raters were ordered randomly before the spiral pattern allocation was implemented. This mirrors, to a greater degree than in previous studies, the conditions often found in practice. The raters were first randomly ordered, then assigned to rate items one at a time, moving through all item responses until each item response was paired with a rater and a simple spiral pattern emerged.  Scoring Designs 50 Objective Scoring  A no rater allocation design was also simulated. In other words, responses were simulated under no rater conditions (rater parameter = 0) such that scoring was objective. This step allowed for the opportunity to compare the scoring results under each of the rater allocation designs with that of objective scoring of the same set of items administered to the same set of examinees. Item Characteristics (Tests) Three separate tests were simulated with differing parameters. Each set of parameters was randomly selected from those used by Hombo et al. The item discrimination (a) and item difficulty (b) parameters for the three sets are provided in Table 1. The pseudo-guessing parameter (c) was set to 0, as the probability of guessing is considered to be minimal in constructed response questions and other performance assessments. Table 1 Item Parameters for Tests 1-3  Test Item  1 a  2_ b  a  3 b  a  b  1  1.6553 -1.3684  0.9842 -2.0000  2.5947  -0.3158  2  0.8500  -0.5263  3.4000 -0.9474  2.3263  -0.1053  3  2.1921  0.3158  1.5211  0.9474  2.1921  0.3158  4  3.1316  1.5789  2.0579  1.5279  2.7289  2.0000  Note: a denotes discrimination, and b denotes difficulty.  Scoring Designs 51 Randomly-selected item parameters taken from the earlier work of Hombo et al. were fixed in the simulation. The original item parameters used in Hombo et al. are provided in Appendix A. The randomly selected items for the three sets generated were provided in Table 1. The test information (solid line) and standard error (dotted line) for the three sets are provided in Figures 4 through 6, for Tests 1 through 3, respectively. These figures illustrate how differently the three tests were operating. The less than optimal test information function is tied closely to the small number of items in the tests. In Test 1, information is highest at 0=\ .6 and standard error is greater at the higher end of the ability scale than at the lower end. In contrast, Test 2 provides the greatest information at 0=-\. This test does not perform as well for examinees of higher ability. Test 3 provides the greatest level of information at the middle of the ability scale and it provides fairly high information at 0=2. In Test 3, the standard error is remarkably high at the lowest end of the ability scale. Figure 4. Test Information and Standard Error for Test 1 I  Test 1  I  Scale Score  Note: Test information and standard error of measurement are represented by solid and dashed lines, respectively.  Scoring Designs 52 Figure 5. Test Information and Standard Error for Test 2 Test 2  Note: Test information and standard error of measurement are represented by solid and dashed lines, respectively. Figure 6. Test Information and Standard Error for Test 3 Test 3  Note: Test information and standard error of measurement are represented by solid and dashed lines, respectively.  Scoring Designs 53 Cut-off Levels In educational assessment, cut-off scores may be described in terms of percentage correct. The precise percentage correct considered acceptable is a product of the characteristics and purpose of the test. This encompasses a wide range of possible cut-off levels for a pass/fail decision that are dependent on the purpose and characteristics of the test. Hambleton, Swaminathan and Rogers (1991) noted, "When a pass-fail decision must be made, it is often difficult to set a cut-off score on the 6 -scale. Since the domain score scale is familiar, a cut-off level (such as 80% mastery) is typically set on the domainscore scale"(p.85). The percentage correct scores of 50%, 60%, 70% and 80% were selected for this investigation. As indicated earlier, these cut-off levels were selected for use in order to sample a range of potential cut-off levels found in practice. The reader is reminded that, in practice, test developers determine cut-off levels according to the purpose and characteristics of the associated tests. Following Hambleton et al., the transformation to the domain score scale was completed using  1" n = - Y P, (<9)  Equation 3  The domain score was then plotted against 6 to identify the corresponding cut-off score. Theta and the standard error of measurement (SEM) at each cut-off level of the domain score are provided in Table 2. This step takes place outside of the simulation procedure.  Scoring Designs 54 Table 2 Theta and SEM at each Cut-off Level of the Domain Score  Test Cut-Off Level (%)  1  2  3  G  SEM  G  SEM  G  SEM  50  0.10  0.53  -0.15  0.84  0.20  0.36  60  0.45  0.52  0.70  0.63  0.42  0.42  70  0.90  0.67  1.15  0.51  0.80  0.66  80  1.40  0.41  1.45  0.48  1.70  0.53  Dependent Variables The dependent variables that were recorded in the simulation are described here followed by a detailed description of the simulation steps. The dependent variables are divided into two levels. In level 1 are the comparisons to the true ability. In level 2 are the comparisons to the ability estimate from the objective scoring condition. Level 1: Comparison to True Theta  The score differences (bias and error) and decision (pass and fail) differences considered in the comparison to true examinee ability are described here. Score Differences  Two score level differences were considered. First, the difference between the examinee true ability and estimated ability was recorded and examined for the scoring designs. This difference (G - G) is referred to as bias and it is an indicator of accuracy. Second, the difference in the mean square error (MSE) for the scoring designs was  Scoring Designs 55 recorded and compared. MSE provides a measure of consistency within the scoring design. These aspects of scoring are essential in assessment. Although it is important for ability estimates to be accurate, if this accuracy is inconsistent comparison may not be made across individuals. In the same way, it is important for ability to be estimated in a consistent manner; however, if the estimate is wrong, no matter how consistent, it is of little use. For the examination of accuracy and consistency,firstthe rater allocation designs were compared and then the accuracy and consistency of the rater allocation designs were compared to that ofthe objective scoring estimate. Decisions  Decisions accuracy was examined in two ways. The percentage of correct passes was identified for each scoring design and compared at each cut-off level. The percentage of correct failures was then identified for each scoring design and compared at each cutoff level. Again, for the examination of decision accuracy, first the rater allocation designs were compared and then the decision accuracy ofthe rater allocation designs was compared to that of the decision accuracy obtained under objective scoring. Figure 7 Decision Grid  Incorrect decision to pass  Correct decision to pass  Correct decision to fail  Incorrect decision to fail  Scoring Designs 56 Level 2: Comparison to the Objective Estimate  The score differences (discrepancy) and decision (pass and fail) differences considered in the comparison to the objective estimate are described here. Discrepancy  The difference between the ability estimate obtained through objective scoring and that of each of the rater allocation designs (9 - 9 ) is referred to here as obj  AD  discrepancy. Discrepancy is compared across rater allocation designs. Discrepancy provides a measure of difference between the ability estimate obtained through objective scoring and that of the rater allocation designs. While 9-9 provides information as to which scoring design is most accurate, 9 - 9 obj  AD  provides information as to which  allocation design provides ability estimates most, or least, like those that would have been obtained had the test been scored objectively. This is an important distinction because G  obJ  -  9  AD  allows for an answer to the question " Would my score (ability  estimate) have been different if I had taken an objective test, rather than one scored by a rater under a given design condition." This is not a question that can be answered by examining relative accuracy and consistency. Decisions  As with decisions accuracy, decision agreement was examined by first comparing percentage of corresponding pass decisions (between the objective scoring and a given rater allocation design) at each cut-off level and then by comparing percentage of corresponding fail decisions at each cut-off level across rater allocation designs. Again, a distinction is made between decision accuracy and decision agreement. While decision accuracy provides information as to which scoring design provided the most accurate  Scoring Designs 57 decisions, decision agreement allows an answer to the question " Would I have passed (or failed) this test if I had taken an objective test, rather than one scored by a rater under a given design condition?" The reader is reminded that, unlike decision accuracy, decision agreement takes into account the percentage of agreement even when the decision is wrong. Steps in the Simulation In their investigation, Hombo et al. (2001) noted that simulating the data allowed for control over which factors were varied, thus strengthening conclusions. In the simulation, the item and rater parameters were treated as known to allow the focus to remain on the estimation of examinee ability. IRT reflects the item response models commonly found in educational measurement. Further, IRT allows for an estimate of ability (67) to be calculated, thus providing the information needed to assess accuracy. The purpose of the Hombo et al. study required the fixing of parameters because the designs used in their study would not lend themselves to the estimation of the rater parameter. The same is true here. Step 1: Generation of the Simulated Rater Scored Item Responses  In this simulation, examinee true ability (6) was generatedfroma normal standard distribution with a mean of 0 and a standard deviation of 1 (N (0,1)), for each examinee in a sample of 16,000. True ability was generated only once for each replication and used in the simulation of ability estimates under each of the scoring conditions across an intervalfrom-3.0 to 3.0. Thus the true score for each examinee remained the same under all design conditions for each replication. The experiment was replicated 50 times for each of the three sets of item parameters (tests). Four constructed response items made  Scoring Designs 58 up the simulated test. This number is reflective of that found in practice. Many of the 2004 provincial graduating assessments in the province of British Columbia contain approximately four constructed response items. Model Used in the Simulation of Data (Data Generation)  Rater effects will be incorporated into a two-parameter logistic model (2P1) because typically for constructed response items, it is assumed that there is little probability of guessing the correct response. For dichotomously scored rater data, the IRT model that allows for the inclusion of a rater parameter (y) in the model takes the form D (0„-b rj)  e  Pij(0n) =  ai  i+  D (0„-b rj)  e  ai  l+  •  Equation 4  where is the probability that a randomly chosen examinee with latent trait 6 responds correctly or positively to item /, is the item discrimination parameter, is the item difficulty parameter, is the rater parameter, n  is the number of items,  D  is a scaling factor to make the logistic function as close as possible to the normal ogive function (Z>=1.7).  Scoring Designs 59 The item parameters used in the simulation are described earlier in the section entitled "Item Characteristics (Tests)". Rater Parameters  Rater parameters were fixed in order for comparisons to be made across designs. Four rater parameters, -0.5, -0.25, 0.25, and 0.5, defined here as y •, are used. Rater severity increases with the value ofthe parameter. The size of these effects was chosen to correspond to those in the Hombo et al. study. They were based on an analysis of a set of 1994 reading datafromthe National Assessment of Educational progress (NAEP). Step 2: Psychometric Analysis of the Rating Data  All analyses were completed using the computer program PARSCALE (Muraki & Bock, 1999). The computer software PARSCALE was used because it allows for a variation of the 2PL for either binary or polytomous item ratings. The current use of PARSCALE allows the present study to easily be extended to include polytomously scored items in the future. Like the Hombo et al. study, for all design conditions the rater and item parameters were treated as known (i.e., not estimated by PARSCALE) allowing the focus of the analysis to remain on the rater allocation designs. Also, the designs investigated in this study do not allow one to estimate the rater parameters because they represent the situation wherein there is no overlap in the examinees rated by the rater (Englehard, 1997). Estimation of Examinee Ability  Examinees were sorted into bins based on their simulated true ability level. Bias and the differences between each of the allocation designs and the objective scoring design were recorded for each theta bin. The ability estimates were completed using  Scoring Designs 60 Bayesian estimation procedures (EAP). Previous studies have used maximum likelihood estimation (MLE); however, MLE estimates of ability become infinite for examinees with either all 1 (all correct) or all 0 (all incorrect) response vectors when scoring is dichotomous. PARSCALE, the program used in this study, eliminates these examinees from the analysis. Whereas previous studies eliminated these scores before analysis, the potential for a high number of 1 or 0 scores on a short test would make this an unfeasible choice. The use of Bayesian estimation procedures eliminates this concern. Step 3: Analysis of Simulation Results  The analysis of the simulation results is broken down for score differences and for decisions at the two comparison levels: comparison to true theta, and comparison to the objective scoring estimate. Several recent studies have examined both commonly used and new analysis models that could allow for increased accuracy in ability estimates with human-rated data and for the inclusion of informationfromsecond or additional ratings. The benefit of using an analysis model that incorporates a rater parameter has been well documented in recent studies (Hombo et al, 2001; Patz et al., 2002; Patz et al., 1997; Wilson & Hoskens, 2001). Lessons can be learned by examining the treatment of item responses in testing programs such as The National Assessment of Educational Progress (NAEP). NAEP statistically models item response probabilities in terms of examinee proficiencies and item characteristics such that for constructed response items the probability that a given response will obtain a particular score depends on both the proficiency and item characteristics, in addition to rater characteristics (e.g. severity). IRT allows us to examine item, test, examinee, and rater characteristics. IRT refers to a family of models that includes the 6-shift model. In the 6-shift model the effect of raters  Scoring Designs 61 may shift item difficulty (the b parameter) in the item characteristic curve in the direction of the rater's rating tendency. At comparison level one, effect sizes associated with accuracy and consistency were compared across scoring designs at each theta bin level for each of the three tests. Effect sizes associated with a correct decision to pass an examinee and a correct decision to fail an examinee were then compared at each cut-off level for each of the three tests. At comparison level two, effect sizes associated with discrepancy were compared across scoring designs at each theta bin level for each of the three tests. Effect sizes associated with a decision discrepancy for a pass decision and decision discrepancy for a fail decision were then compared at each cut-off level for each of the three tests. In mathematical statistics, when we look at properties, we look at bias {6-0) and variance (consistency). When we look at outcomes we want one with the least bias and the least sample-to-sample variability. The definition of unbiased is that the average of the statistic is equal to the population quantity. In simulation experiments we calculate bias. If that bias is negative, then the statistic underestimates (the statistic is lower than it should be). When it is 0 we say that the outcome is unbiased. A positive difference, then, indicates that the scores are inflated (too high). Similarly, when it comes to variability, we compute the mean square error (MSE). If the statistic is unbiased then MSE is equal to the variability of the statistic. Alternately, if it is biased the MSE will not equal the variance of the statistic because MSE = bias + variance. Ideally, we want a measure with 2  little bias and little variability, but these are relative terms.  Scoring Designs 62 Statistical Properties  At comparison level one, the relative accuracy and consistency of the scoring design across levels of examinee ability were examined. Specifically, bias and MSE scores were calculated. Total bias and total squared bias were calculated and summarized for each allocation design, theta bin, and set. This allowed for observations of relative accuracy under these design conditions. The percentage of correct pass decisions was considered separately from the percentage of correct decisions to fail an examinee. At comparison level two, differences in discrepancy were examined then a comparison of the agreement in the decisions across rater allocation designs for the cutoff levels was conducted. At both comparison levels, an examination of the profile plots served to clarify small differences and allowed for discussion of patterns of consistency (or variation) that were not as readily seen in tables. Comparison Rules In this study it is recognized that a statistically significant difference may not represent an important difference. Similarly, a difference that is not found to be statistically significant may have great ramifications. The inappropriate use of statistical testing has been extensively criticized (Cohen, 1994; Daniel, 1998; Shaver, 1993). A logical step in the analysis of these data would have been most certainly to fit an ANOVA model and estimate rf\ however, given the nature of this simulation study, replication to replication differences were expected to be very small. This means that the mean square error for any given effect would also be small. As such, nearly all effects would be found to be statistically significant. When sample size is small, rf is an upwardly biased estimate of the population strength of association between the dependent  Scoring Designs 63 and independent variables (Pierce, Block, & Aguinis, 2004). Although a sample size of 16,000 was used for each replication in a cell in the simulations, it would be necessary to conduct the ANOVAs on summary data. As such, the mean scores of the 50 replications for each test would be used in the ANOVA. Further, Pierce et al. recently provided a cautionary note on reporting if values from multifactor ANOVA designs such as that used in the current study. This study would require the use of partial if. Pierce et al. reinforced the distinction between classical and partial if as measures of strength of association. Classical if is defined as the proportion of total variance attributable to a given factor, specifically, classical  =  if  SS  factor  Equation 5  /SS  lotal  where SSf tor is the variation attributable to the factor and ac  SS tai to  is the total variation.  Partial if as given by Pierce et al. however, is defined as the portion of total variance attributable to the factor, excluding other factors from the nonerror variation. Specifically, partial 77 = SS 2  factor  I SS  f a c l o r  error  where SS/ tor is the variation attributable to the factor and ac  Equation 6  + SS  SS or err  is the error variation.  The denominator in Equation 6, then will always be less than or equal to the denominator in Equation 5. As such partial if is generally greater than classical if for a source of variance in a multifactor ANOVA. Another notable difference is that, although classical and partial if are said to range from 0 to 1, in a multifactor ANOVA, partial if can sum to greater than 1, while classical if cannot (Cohen, 1972). The reason for this is that partial eta-squared is not a measure of unique variation. Some of the non-error variation then, could be accounted for by other factors in the analysis leading to confusing  Scoring Designs  64  conclusions. If this study had chosen to focus on null hypothesis statistical testing, even minute differences would be identified as statistically significant. In this study, as in previous studies in this area, the focus is not on statistical differences but, rather, on the size and differences in the effects. The importance of a given effect size (an effect size worthy of discussion) is quite different in different fields and under different design and analysis conditions. In the spirit of Cohen (1994), this study focuses on what we want to know without trying to make it so through statistical testing. As such, this study compares magnitude of effects. An effect is the actual difference between population means. The use of direct comparisons of actual differences to evaluate relative performance of scoring designs is in keeping with previous research in this area (Hombo et al., 2001; Patz et al. 1997). Unlike these previous studies, a particular size of difference will provide the basis from which to report and discuss results. Differences beyond 5% in the accuracy of decision to pass or fail an examinee are considered to be worthy of discussion and will henceforth be referred to as a "concerning difference". In this study, with 16,000 examinees, a 2  difference of 5% between allocation designs would affect the decisions made for 800 examinees. Acceptable differences in percentage of correct decisions could range greatly according to the context and purpose of an assessment. All differences, depending on the context, are worthy of consideration. It is important to note that a difference of .05, and the 5% rule used in this study, are established solely for the purpose of discussing the results and should not be used as a general rule.  It is recognized that this is a new term in the literature and may be grammatically awkward, nonetheless it is the most appropriate term because it conveys the spirit of my methodology.  2  Scoring Designs 65 A difference in discrepancy greater than .05 between rater allocation designs is again established as the criterion for a concerning difference, and the 5% rule is applied to percentage of agreement in the decisions for the objective scoring and the rater allocation designs. There is no precedence for the classification of differences in discrepancy. Summary In summary, three sets of item parameters for four items were randomly selected from those used by Hombo et al. Using these parameters, data were simulated and then analyzed. In this study, a 6-shift rater model was used to analyze simulated 2PL data. Scoring design refers to the three rater allocation designs and the objective scoring together. Bias is defined as the difference between the true examinee ability (0 ) and the observed examinee ability (0 ). As such, bias is an indicator of accuracy. The examination of bias here provides a measure of accuracy obtained for the three sets under six bins on the true score scale (0 ) and four scoring design conditions (spiral, row, Bin  random, and objective). MSE is measure of consistency, or variability. Specifically, here, the examination of MSE provides a measure of consistency across the three sets, under theta bins and across scoring design conditions. The reader is reminded to keep in mind that the items used in this study were scored with a rater parameter in the model. We knowfromprevious research that when a rater is not included in the scoring model that rater allocation designs perform in a differential manner (Hombo et al., 2001). This study acknowledges this finding and is interested in accuracy, consistency and decision differences in a situation in which the effects of raters have, theoretically, been addressed.  Scoring Designs  66  CHAPTER IV RESULTS The results of the analyses presented here are organized into two sections. In Section A, results of the comparisons to the true examinee ability are presented (comparison level one). Specifically, in Section A, the relative accuracy of examinee ability estimation and the relative consistency of estimates across replications are reported. This is completed for each rater scoring design across theta bin levels. In order to efficiently report resultsfromthe theta bin levels, the categorizations of theta bin are identified by a letter rather than by the somewhat lengthier description of the theta bin. The theta bins are identified then, as: A. less than or equal to -2.0 B. between -1.999 and -1.0 inclusive C. between -0.999 and 0 inclusive D. between 0.001 and 1.0 inclusive E. between 1.001 and 2.0 inclusive F. between 2.001 and greater. Profile plots are used to show, graphically, how the scoring designs relate to one another over categorizations of true examinee ability. This is followed by the results of the decision accuracy analyses. Decision accuracy is reported first for a correct decision to pass an examinee and then for a correct decision to fail an examinee, across four cutoff levels. Profile plots are used to show, graphically, how the scoring designs relate to one another at four cut-off levels.  Scoring Designs 67 In Section B (comparison level two), the results are reported for the analyses pertaining to the difference in ability estimates obtained under each of the three rater allocation designs and the objective scoring condition. This difference is referred to in this study as discrepancy. Discrepancy is an indicator of how different the estimate of examinee ability would be if scored by a rater using a specified allocation design rather than if the same test, given to the same examinees, were to be scored objectively (no rater). Discrepancy was compared across rater allocation designs at each theta bin level. Again, theta bin is identified by a letter. Profile plots were again used to illustrate the differences in discrepancy across scoring designs over the theta bin levels. Finally, resultsfromthe decision analyses are reported for comparison level two. Specifically, the agreement in the decision to pass or to fail an examinee under the objective scoring condition and each of the rater allocation designs is reported for each cut-off level. Differences are calculated for each analysis and given in the appendices. Section A: Comparison to True Theta Accuracy of the Scoring Designs Across Levels of Examinee Ability  The relative accuracy of the scoring designs across levels of examinee ability is examined here. Accuracy is defined by bias (6-0).  Therefore, smaller differences  indicate less bias and greater accuracy. Conversely, scoring designs with larger biases are deemed to be less accurate. In order to interpret the result, a difference of .05 between scoring designs was established as indicating a concerning difference. The reader is reminded that what constitutes a concerning difference is relative to the purpose of the study and the characteristics of the data. The squared bias for each theta bin level is provided in Tables 3, 4 and 5 for Tests 1, 2 and 3, respectively.  Scoring Designs 68 Relative Accuracy of the Rater Allocation Designs  The relative accuracy of the rater allocation designs was examinedfirst.The difference in squared bias for the rater allocation designs was trivial in Tests 1 and 2. Differences in accuracy were less than .05 across the allocation designs. These differences for Tests 1 and 2 are provided in Appendix B and Appendix C, respectively. In Test 3, a relatively large difference between the spiral and the random allocation designs (.20) is seen at the lowest level of theta bin (bin A). Differences for Test 3 are detailed in Appendix D. An examination of the test information function (Figure 6) reveals that for this test, within this theta bin level, information is particularly low and standard error is very high. As such, this result was not surprising. Overall, trivial differences in accuracy were seen across rater allocation designs. Accuracy of the Objective Scoring Design  Accuracy of the objective scoring was then compared to the accuracy of each of the rater allocation designs. In Test 1, a difference in accuracy of .08 was found between the objective scoring condition and the row and random allocation designs at the highest theta bin level with the objective scoring shower greater accuracy. At this level, a difference in accuracy of .09 was also found between the objective scoring condition and the spiral allocation design with the objective scoring again showing greater accuracy. Differences at the other theta bin levels were less than or equal to .05 in Test 1. In Test 2, a difference in accuracy between the objective scoring condition and the row allocation design was .07 in theta bin B. The objective scoring was indicated to be more accurate within this theta bin. Accuracy differences between the objective scoring condition and the remaining two allocation designs (spiral =.11, and random = .10) were also found in  Scoring Designs 69 the highest theta bin level. In Test 3, small differences in accuracy were again seen between the objective scoring condition and each of the allocation designs, except at the highest theta bin level. At this theta bin level, a relatively large difference in accuracy was seen where the difference in accuracy between the objective scoring condition and the spiral allocation design was .15. The accuracy difference between the objective scoring condition and the remaining allocation designs at this theta bin level was .11. The objective scoring was again indicated to be more accurate. Overall, the objective scoring was found to be more accurate than the rater allocation designs when differences greater than .05 were seen in squared bias.  3  Table 3 Squared Bias For Each Theta Bin in Test 1  Random Objective  Spiral  Row  A. less than or equal to -2.0  0.92  0.94  0.94  0.89  B. between -1.999 and -1.0 inclusive  0.19  0.18  0.19  0.18  C. between -0.999 and 0 inclusive  0.02  0.01  0.01  0.01  D. between 0.001 and 1.0 inclusive  0.03  0.03  0.02  0.02  E. between 1.001 and 2.0 inclusive  0.11  0.08  0.08  0.11  F. between 2.001 and greater.  0.42  0.41  0.41  0.33  Theta Bin  The reader should note that in interpreting squared bias, it is important to consider the pattern and not just the absolute value. In particular, it is important to compare the numerical values down the columns in Table 3. 3  Scoring Designs 70 Table 4 Squared Bias For Each Theta Bin in Test 2  Theta Bin A. less than or equal to-2.0  Spiral  Row  Random Objective  0.69  0.71  0.73  0.68  B. between-1.999 and-1.0 inclusive  0.10  0.09  0.10  0.03  C. between-0.999 and 0 inclusive  0.05  0.04  0.04  0.09  D. between 0.001 and 1.0 inclusive  0.04  0.03  0.03  0.05  E. between 1.001 and 2.0 inclusive  0.16  0.16  0.15  0.14  F. between 2.001 and greater.  0.57  0.58  0.58  0.47  Spiral  Row  1.57  1.74  Table 5 Squared Bias For Each Theta Bin in Test 3  Theta Bin A. less than or equal to-2.0  Random Objective 1.77  1.76  B. between-1.999 and-1.0 inclusive  0.10  0.14  0.15  0.12  C. between-0.999 and 0 inclusive  0.00  0.00  0.00  0.01  D. between 0.001 and 1.0 inclusive  0.00  0.00  0.00  0.00  E. between 1.001 and 2.0 inclusive  0.11  0.11  0.11  0.15  F. between 2.001 and greater.  0.45  0.41  0.41  0.30  Scoring Designs 71 Accuracy Across Theta Bin Levels  When examined across theta bin levels, it was noted that accuracy was considerably lower for all of the scoring conditions at the two ends of the theta distribution in all of the tests. This is not surprising. As N decreases at the lower and higher ends of the theta distribution it makes sense that squared bias would increase. Although squared bias allows data to be examined on the same metric as error, one problem with examining only the squared bias is that it does not allow for direction to be considered. As such, bias was also examined. Tables 6 through 8 summarize bias prior to squaring for each of the scoring designs across theta bin levels. Notable in these data is that the bias at the low end of the ability scale is positive whereas the bias at the high end is negative. This means that, at the lower ability levels, the scores were over-estimated (higher than they should be) whereas at the high end the scores were underestimated (lower than they should be). This was very slightly less pronounced at the high end of the theta scale under the objective scoring condition in all three tests. Table 6 Bias For Each Theta Bin in Test 1  Theta Bin  Spiral  Row  Random Objective  A. less than or equal to -2.0  0.96  0.97  0.96  0.94  B. between -1.999 and -1.0 inclusive  0.43  0.42  0.44  0.42  C. between -0.999 and 0 inclusive  0.11  0.09  0.10  0.08  D. between 0.001 and 1.0 inclusive  -0.16  -0.16  -0.14  -0.14  E. between 1.001 and 2.0 inclusive  -0.33  -0.29  -0.28  -0.33  F. between 2.001 and greater.  -0.64  -0.64  -0.64  -0.57  Scoring Designs 72  Table 7 Bias For Each Theta Bin in Test 2  Spiral  Row  A. less than or equal to -2.0  0.83  0.84  0.85  0.83  B. between -1.999 and -1.0 inclusive  0.32  0.31  0.31  0.16  C. between -0.999 and 0 inclusive  0.22  0.21  0.21  0.29  D. between 0.001 and 1.0 inclusive  -0.19  -0.17  -0.17  -0.22  E. between 1.001 and 2.0 inclusive  -0.40  -0.40  -0.38  -0.37  F. between 2.001 and greater.  -0.75  -0.76  -0.76  -0.69  Spiral  Row  A. less than or equal to -2.0  1.25  1.32  1.33  1.33  B. between -1.999 and -1.0 inclusive  0.31  0.37  0.38  0.35  C. between -0.999 and 0 inclusive  -0.03  -0.04  0.00  -0.10  D. between 0.001 and 1.0 inclusive  0.00  -0.01  -0.01  0.05  E. between 1.001 and 2.0 inclusive  -0.33  -0.33  -0.33  -0.39  F. between 2.001 and greater.  -0.67  -0.64  -0.64  -0.55  Theta Bin  Random Objective  Table 8 Bias For Each Theta Bin in Test 3  Theta Bin  Random Objective  Scoring Designs 73 Figures 8 through 10, show the bias across theta-bin levels for Tests 1 through 3, respectively. An examination of these profile plots for bias reinforces the generally trivial difference found between scoring designs in all three of the tests. On the ordinate are the estimated marginal means. Although ANOVA results are not presented in this study, the estimated marginal means are the cell means. In this case, the profile plot shows the relationship between bias and the examinee true ability (theta) found on the abscissa. In Tests 1 and 3, the scoring designs are virtually indistinguishable in their level of bias, whereas slight deviations in the bias associated with the objective scoring is seen in Test 2. Clearly seen on the profile plots is the fact that at the low theta bin levels bias is positive, whereas at the higher levels bias is negative.  Figure 8. Bias Across Theta-bins for Test 1 1.5  T  THETA  Scoring Designs Figure 9. Bias Across Theta-bins for Test 2  1.0  -2 or less  -0.999 to 0 -1.999 to -1.0  1.001 to 2.0 0.001 to 1.0  THETA  Figure 10. Bias Across Theta-bins for Test 3  2.001 and greater  Scoring Designs 75  Consistency of the Scoring Designs Across Theta Bin  Clearly, scoring designs that produce accurate estimates of examinee ability are desirable. So too is consistency in the estimate. In this section, consistency in the scoring designs is examined for each of the categorizations of true theta. MSE was used as an indicator of consistency. Examination of MSE allows for replication to replication differences to be compared across the four scoring designs. MSE for each of the scoring designs at each theta bin level in the three tests are provided in Tables 9 through 11. When examining MSE, larger values indicate less replication to replication consistency. Hence, smaller values mean greater consistency. Again, a difference of greater than .05 between scoring designs indicated concerning differences in consistency. Relative Consistency of the Rater Allocation Designs  In Test 1, a comparison of the rater allocation designs showed a concerning difference in consistency in all but the highest and lowest theta bins. The MSE for each of the scoring designs at each theta bin level is provided in Table 9 for Test 1 and the MSE differences between scoring designs are provided in Appendix E. At these theta bin levels, consistency was quite low. A difference of .07 was found between the row and random allocation designs at theta bin B. Here the row allocation design had the highest level of consistency. At theta bin level C, the spiral and row allocation design showed the same degree of consistency. The MSE for each of these rater allocation designs was .22. A difference in consistency of .06 was found between these two allocation designs and the random allocation design. A difference in consistency of .09 was found between the spiral and random allocation designs in theta bin D with the spiral allocation design  Scoring Designs 76 showing greater consistency across replications. A difference of .11 was seen in theta bin level E between the row and random allocation designs. The row allocation design showed the greatest replication to replication consistency at this theta bin level. In general, the row allocation design was indicated to be slightly more consistent than the other rater allocation designs in Test 1. Table 9 Mean Square Error for each Theta Bin in Test 1  Random Objective  Spiral  Row  A. less than or equal to -2.0  1.11  1.13  1.13  1.05  B. between -1.999 and -1.0 inclusive  0.41  0.39  0.46  0.41  C. between -0.999 and 0 inclusive  0.22  0.22  0.28  0.19  D. between 0.001 and 1.0 inclusive  0.23  0.24  0.32  0.33  E. between 1.001 and 2.0 inclusive  0.33  0.30  0.41  0.32  F. between 2.001 and greater.  0.63  0.59  0.58  0.47  Theta Bin  The MSE for each of the scoring designs at each theta bin level is provided in Table 10 for Test 2. In this test, differences in consistency greater than .05 were seen at three levels of the theta distribution. MSE differences between scoring designs are provided in Appendix F for Test 2. At theta bin B, a difference of .11 was found for the row and random allocation designs with the row allocation design being the more consistent of the two. In theta bin C, a difference in consistency of .09 was found for the random allocation design and the remaining two allocation designs with the random design being considerably more consistent than the other designs. In theta bin level E, a  Scoring Designs  77  difference of .11 and .10 were found for the row and spiral allocation designs when compared to the random allocation design, respectively. Here, the spiral and row allocation designs showed greater consistency than the random allocation designs. Table 10 Mean Square Error for each Theta Bin in Test 2  Random Objective  Spiral  Row  A. less than or equal to -2.0  0.87  0.88  0.86  0.81  B. between -1.999 and -1.0 inclusive  0.35  0.32  0.43  0.24  C. between -0.999 and 0 inclusive  0.25  0.25  0.34  0.28  D. between 0.001 and 1.0 inclusive  0.23  0.22  0.26  0.23  E. between 1.001 and 2.0 inclusive  0.44  0.43  0.54  0.45  F. between 2.001 and greater.  0.75  0.77  0.76  0.62  Theta Bin  The MSE for each of the scoring designs at each theta bin level is provided in Table 11 for Test 3. At theta bin level C, In Test 3, a difference of greater than .05 was found between rater allocation designs at all theta bin levels. MSE differences between scoring designs are provided in Appendix G for Test 3. At theta bin level A, a difference of .20 was found between the spiral allocation design and the row allocation design where the spiral allocation design was the more consistent of the two. The spiral allocation design was also more consistent than the random allocation design at the same theta bin level, with a difference of .17. The row allocation design was the least consistent of the three at theta bin level A. At theta bin level B a difference of .06 for consistency was seen between the spiral and row allocation design. Again, the smaller  Scoring Designs 78 MSE found for the spiral allocation design indicated that it was the more consistent of the two. the row and spiral design both showed greater consistency (MSE = .19) than the random allocation design (MSE = .32). The resulting difference was .13. This pattern was repeated again at theta bin level D. At this theta bin level, the spiral and row allocation designs showed similar levels of consistency (MSE = .16 and .17 respectively) but showed a difference of .10 and .11, respectively, from the random allocation design. At theta bin level E, a difference of .06 was found for the spiral and random designs, whereas in theta bin level F, the concerning difference (.07) was found between the row allocation design and the random allocation design. Table 11 Mean Square Error for each Theta Bin in Test 3  Spiral  Row  A. less than or equal to -2.0  1.72  1.92  1.89  1.88  B. between -1.999 and -1.0 inclusive  0.21  0.27  0.24  0.2  C. between -0.999 and 0 inclusive  0.19  0.19  0.32  0.19  D. between 0.001 and 1.0 inclusive  0.16  0.17  0.26  0.15  E. between 1.001 and 2.0 inclusive  0.27  0.29  0.33  0.30  F. between 2.001 and greater.  0.70  0.63  0.70  0.56  Theta Bin  Random Objective  Although differences in consistency were seen across theta bin levels, different rater allocation designs were shown to be the most consistent over the three tests. In Test 1, the row allocation design was seen as the most consistent design most often. Yet, in Test 2, the results were quite mixed. In Test 2, each of the rater allocation designs was  Scoring Designs 79 shown to be the most consistent at the varying theta bin levels. In Test 3, the spiral design showed the highest levels of consistency most frequently. Consistency of the Objective Scoring Design  When the consistency of the objective scoring was compared to that of the rater allocation designs, a concerning difference was found at all theta bin levels in Test 1. The MSE associated with these differences are given in Table 9 and the differences are provided in Appendix E. The objective scoring design provided the greatest across replications consistency of the scoring designs at the highest and lowest theta bin levels, as well as in theta bin level C. The spiral allocation design provided the greatest consistency at theta bin level D whereas the row allocation design provided the greatest consistency at theta bin levels B and E. In Test 2, differences in consistency greater than .05 were seen between the objective scoring and specific rater allocation designs throughout the theta distribution, with the exception of theta bin level D. The MSEs associated with these differences were given in Table 10 and the differences provided in Appendix F. At theta bin level A, a difference of .06 was found between the consistency of the spiral allocation design and the objective scoring whereas a difference of .07 was found between the row allocation design and the objective scoring. Both of these differences reflected greater accuracy in the rater allocation designs. The differences in consistency found between the objective scoring and each of the rater allocation designs at theta bin level B were all greater than .05. Superior consistency was found in the objective scoring when compared to each of the three rater allocation designs. At theta-bin level C , a concerning difference (.09) was found between the random allocation design and the objective scoring condition. The  Scoring Designs 80 objective scoring, again, demonstrated greater consistency. At theta bin level E, a consistency difference of .11 was found between the random allocation design and the objective scoring, with the objective scoring showing greater consistency. The objective scoring provided greater consistency than each of the scoring designs, again at the highest theta bin level, with a consistency difference of .13, .05, and .14 for the spiral, row, and random allocation designs, respectively. When the consistency of the objective scoring condition was compared with that of the rater allocation designs in Test 3, a consistency difference of greater than .05 was found at all theta bin levels except theta bin level E. The MSE for each of these scoring designs is provided in Table 1 land the differences are detailed in Appendix G. At theta bin level F, a difference of .16 was found between the spiral allocation design and the objective scoring in which the spiral allocation design was the more consistent ofthe two. At theta bin level B, a difference of .07 for consistency was seen between the objective scoring and row allocation design. Again, the objective scoring provided the more consistent estimates. At theta bin level C, a difference in consistency of 0.13 was found for the objective scoring and the random allocation design. This was repeated at theta bin level D with a difference of. 11. At each of these theta bin levels, the objective scoring showed superior consistency. At theta bin level E, a difference of .06 was found for the spiral and random designs, whereas in theta bin level F the concerning difference was found between the row allocation design and the random allocation design (.07). At this bin level, a difference of greater than .05 was found between the objective scoring and each of the rater allocation designs. Again, the objective scoring showed the greatest consistency when compared to each of the rater allocation designs.  Scoring Designs 81 The profile plots for MSE (Figures 11 through 13) illustrate the similarity in consistency found in the scoring design in all three of the tests. Here, the profile plot shows the relationship between MSE and the examinee true ability (theta). As indicated earlier, the estimated marginal means on the ordinate are the cell means. An examination of Figures 11 through 13, shows that the random allocation design is slightly less consistent than the other scoring designs. The greatest fluctuation in consistency across levels of true ability for the four scoring designs is seen in Test 2. All of the scoring designs were found to be less consistent at the upper and lower ends of the theta distribution. Figure 11. Mean square error for Test 1  1.2  1ALLOC Spiral Row Random (0 Lll  •  0.0 -0.999 to 0  •2.0 or less -1.999 to-1.0 THETA  Objective  1.001 to 2.0 0.001 to 1.0  2.001 and greater  Scoring Designs Figure 12. Mean square error for Test 2  -2.0 and less  -0.999 to 0 -1.999 to -1.0  1.001 to 2.0 0.001 to 1.0  2.001 and greater  THETA  Figure 13. Mean square error for Test 3 2 5 -i  2 0  1  i  -2.0 and less  -0.999 to 0 -1.999 to -1.0  THETA  1.001 to 2.0 0.001 to 1.0  2.001 and greater  Scoring Designs 83  Decision Accuracy  Decision accuracy refers to the percentage of correct decisions to pass or fail an examinee. The accuracy of the decisions made at the four cut-off levels is reported here. Decision accuracy was examined in two ways in this study. First, the percentage of correct passes associated with each scoring design and cut-off value was compared. Second, the percentage of correct decisions to fail an examinee was compared across scoring designs at each cut-off value. Differences of greater than 5% in decision accuracy were considered here to be concerning . It is important to note that the 5% rule 4  established here is solely for the purpose of discussing the results found in this study and should not be used as a general rule. Researchers need to consider the trade-offe of the consequences when establishing their own guidelines for examining differences such as these. Correct Decision to Pass across Rater Allocation Designs  The percentage of accurate pass decisions in Test 1 are provided in Table 12. Percentage differences are detailed in Appendix H. In Test 1, a difference in decision accuracy greater than 5% was found at two of the cut-off levels. The row allocation design provided the greatest accuracy at the 50% cut-off level. The difference in percentage of correct pass decision for the random and row allocation designs was 7.54% and the difference for the spiral and row allocation design was 6.32 %. The row allocation design also provided the greatest accuracy when compared to the spiral and  It should be noted that although 5% (or .05) are somewhat arbitrary, they were chosen considering: (a) historical precedent of .05 in statistical decision making, and (b) a consideration of the balance of consequences ignoring an important finding versus missing an important finding; the former being considered a more egregious error. 4  Scoring Designs 84 random allocation design at the 70% cut-off level. Here, a difference of 25.32% was seen for the random and row allocation designs, and a difference of 10.49% was seen between the spiral and row allocation designs. The magnitude of the difference at the 70% cut-off level was unexpected and counterintuitive. An examination of the item parameters, test information and standard error of the test revealed no clue as to why this result might occur. Further, arigorouscheck of all simulation files and scores confirmed that this unexpected difference was not an error. Table 12 Percentage of Correct Pass Decisions in Test 1  Cut Score  Spiral  Scoring Design Row Random  50%  74.94  81.26  73.72  76.15  60%  79.03  79.07  78.53  80.95  70%  55.94  66.43  41.11  35.47  80%  62.90  63.20  64.84  67.51  Objective  Differences in decision accuracy greater than 5% were found at all cut-off levels in Test 2. The percentage of correct pass decisions for Test 2 are provided in Table 13. The calculated differences for this test are provided in Appendix I. The random allocation design showed superior accuracy for a pass decision at all but the 70% cut-off levels for this test. A difference of 8.91% was found at the 50% cut-off level between the row and random allocation designs, and a difference of 6.79% was seen between the spiral and random design. At the 60% cut-off level, the random allocation outperformed the row allocation design by 17.1%. A shift took place at the 70% cut-off level again. At this cut-  Scoring Designs 85 off level, the row allocation design showed greater decision accuracy than the spiral allocation design by 12.67%, and the random allocation design by 12.4%. The spiral and random allocation designs showed remarkably similar levels of decision accuracy at this cut-off level. The random allocation design showed greater accuracy than the row allocation design at the 80% cut-off level. A difference in decision accuracy of 21.24% was found between these two rater allocation designs. At this same cut-off level, the difference in decision accuracy between the row and spiral allocation designs was 16.63%. The spiral allocation design showed superior decision accuracy. A trivial difference was seen between the spiral and random allocation designs at this cut-off level. Table 13 Percentage of Correct Pass Decisions in Test 2  Cut Score  Spiral  Scoring Design Row Random  50%  90.75  88.83  97.74  98.52  60%  65.54  53.15  70.25  71.72  70%  48.37  61.04  48.90  46.73  80%  58.21  41.58  62.82  65.04  Objective  In Test 3, differences of greater than 5% between allocation designs for the correct decision to pass an examinee were found at the 60% and 70% cut-off levels. These differences are provided in Appendix J. The specific percentages of correct pass decisions for Test 3 are given in Table 14. The row allocation design showed the highest levels of decision accuracy at the 60% cut-off levels, distancing itself from the random allocation design by 13.9% and the spiral allocation design by 7.74%. A 5.45%  Scoring Designs 86 difference was also found between the spiral and random allocation design at this cut-off level, with the spiral allocation design showing a slightly higher level of decision accuracy. The random allocation design showed greater decision accuracy at the 70% cut-off level. A difference of 34.59% was seen between the row allocation design and the random allocation design. Although a difference of greater than 5% was seen between all of the allocation designs at this cut-off level, the decision accuracy associated with the row allocation design was disproportionately small. A difference of 14.25% was also found between the random and spiral allocation designs at the 70% cut-off level. The difference in percentage of correct decisions to pass an examinee between the row and spiral allocation designs at this cut- off level was 20.34% with the spiral allocation design showing greater decision accuracy. Table 14 Percentage of Correct Pass Decisions in Test 3  Cut Score  Spiral  Scoring Design Row Random  50%  86.46  87.6  87.51  93.02  60%  84.88  92.62  79.43  84.25  70%  76.61  56.27  90.86  95.00  80%  55.44  54.86  55.52  54.57  Objective  Correct Decision to Pass for the Objective Scoring  In general, the accuracy of the decision to pass an examinee under the objective scoring condition was seen to be most consistent with the decision accuracy found using the spiral and random allocation designs across the three tests. The percentage of correct  Scoring Designs 87 pass decisions in Test 1 were given in Table 12 and the differences are presented in Appendix H. A difference of greater than 5% in decision accuracy was seen between the objective scoring and the row allocation design at the 50% and 70% cut-off levels in Test 1. A difference of 7.54 % was seen at the 50% cut-off level. At the 70% cut-off level the results were again counter-intuitive. The row allocation design provided remarkably higher decision accuracy than the objective scoring, with a difference of 30.96%. These data were again checked for errors. A small difference (5.64%) in decision accuracy was found between the random allocation design and the objective scoring condition while a 20% difference was found between the spiral allocation design and the objective scoring condition. In Test 2, the objective scoring produced the most accurate pass decisions at each of the cut-off levels except 70%. These results were summarized in Table 13 and the calculated differences provided in Appendix I. At the 50% cut-off level, a difference of 9.52% was found between the objective scoring and row allocation design. A difference of 7.77% was found between the objective scoring and the spiral allocation design at this cut-off level. A difference of greater than 5% was found again between the objective scoring and the row (18.57%) and spiral (6.18%) allocation designs at the 60% cut-off level. At the 70% cut-off level, a 14.31% difference was seen between the objective scoring and the row allocation design with the row allocation design showing greater accuracy for a pass decision. By contrast, the spiral and random allocation designs showed decision accuracy levels consistent with that obtained under the objective scoring condition. At the 80% cut-off value, the percentage difference for correct decision to pass between the objective scoring and the row allocation design was 23.46%, with the  Scoring Designs 88 objective scoring showing greater decision accuracy. Once again, the percentage of correct decisions associated with the other rater allocation designs was closer in value to the objective scoring. The percentage of correct pass decisions for Test 3 were given in Table 14. In this test, differences greater than 5% were found at all but the highest cut-off level for the objective test condition and the allocation designs. These differences are presented in Appendix J. At the 50% cut-off level the objective scoring condition produced more accurate pass decisions 6.56% more often than did the spiral allocation design, 5.51% more often than the random allocation design and 5.42% more often than the row allocation design. At the 60 % cut-off level, an 8.37% difference in decision accuracy was found between the row allocation design and the objective scoring condition, with the objective scoring showing less decision accuracy. The percentage of correct decisions to pass an examinee at the 60% cut-off level was similar for the objective scoring and that of the remaining rater allocation designs. Again, at the 70% cut-off level, a remarkable 38.73% difference in decision accuracy was seen between the row allocation design and the objective scoring condition. While the accuracy rate at this cut-off level was 95% for the objective scoring condition, it was only 56.27% for the row allocation design. An 18.39% difference in decision accuracy was also found between the spiral allocation design and the objective scoring condition at this cut-off level. An examination of the profile plots for a correct decision to pass (Figures 14 through 16) illustrates the difference between the performance of the row allocation design and that of the other scoring designs. Again, on the ordinate, are the estimated marginal means, or predicted cell means from the ANOVA model. On the abscissa are  Scoring Designs 89 the four cut-off values. Here, the profile plot shows the relationship between percentage of accurate decisions to pass an examinee at the cut-off value. Recallfromthe earlier tables, that the estimated marginal means are the cell means. The decision accuracy of the row allocation design is erratic when compared to the other scoring designs, at some cutoff levels moving in a direction opposite to that of the other scoring designs. This difference is seen most clearly in Figures 15 and 16 that correspond to Tests 2 and 3. Also notable is the divergence in decision accuracy around the 70% cut-off value. In Test 1, there is a notable divergence in the decision accuracy of the scoring designs at the 70% cut-off level. While the random and objective scoring designs remain consistent with each other, the spiral allocation design provides slightly deviant levels of decision accuracy. An extreme difference in decision accuracy between the objective scoring and the row allocation design is clearly seen in Figure 14. In Test 2, despite smaller intervals on the ordinal scale, it is clear that the objective scoring , random allocation design, and spiral allocation design provided a similar pattern of decision accuracy across the cut-off levels. The row allocation design, however, provided lower levels of decision accuracy at the 60% cut-off level and higher levels at the 70% cut-off level and considerably lower levels of decision accuracy at the 80% cut-off level when compared to the other scoring designs. Again, in Test 3, the objective scoring and the random allocation design provide similar levels of decision accuracy at the varying cutoff levels. The spiral allocation showed a decrease in decision accuracy at the 70% cutoff level whereas the objective scoring and random allocation design both showed an increase. The row allocation design displayed erratic levels of decision accuracy when compared to the other scoring designs in Test 3.  Scoring Designs Figure 14.  Percentage of Correct Pass Decisions Across Cut-Off Levels in Test 1  90  50%  60%  70%  80%  CUT  Figure 15. Percentage  CUT  of Correct Pass Decisions Across Cut-Off Levels in Test 2  Scoring Designs 91  Figure 16. Percentage of Correct Pass Decisions Across Cut-Off Levels in Test 3  Correct Decision to Fail  A summary of the percentages of a correct decision to fail an examinee is given in Tables 15 through 17 for Tests 1 through 3, respectively. Again the 5% rule was used to distinguish concerning differencesfromthose that are not as concerning in the context of this study. Acceptability of a given percentage of correct failures, the reader is again reminded, is a factor of the purpose and characteristics of the specific test. Differences in the decision accuracy percentages for a fail in this study rangedfrom.97% to 20.02%.  Scoring Designs 92 Decision Accuracyfor a Fail Across Allocation Designs  In Test 1, no differences greater than 5% were found for a correct decision to fail an examinee at any of the cut-off levels. Percentages associated with this examination is given in Table 15. The calculated differences are provided in Appendix K. In Test 2, a difference of greater than 5% decision accuracy for a failure was found at the 50% cut-off level and the 60% cut-off level between the row and random allocation designs. The row allocation design showed 18.42% greater decision accuracy for a fail at the 50% cut-off level and 7.50% greater decision accuracy for a fail at the 60% cut-off level. These percentages are provided in Table 16. The differences for Test 2 are detailed in Appendix L. The percentage of correct decisions to fail an examinee for Test 3 are given in Table 17. The calculated differences across scoring designs for this test are provided in Appendix M. In Test 3, differences greater than 5% were seen across allocation designs for the correct decision to fail an examinee at all but the highest cut-off level. At the 50% cut-off level the difference in percentage of correct decisions between the row and random allocation designs was 7.09% with the row allocation design showing greater decision accuracy. At the same cut-off level, the percentage difference between the spiral and random allocation designs was 7.49%. Here, the spiral allocation design showed greater decision accuracy. At the 60% cut-of level a difference of 6.05% was seen between the row and spiral allocation designs with the spiral allocation design showing greater accuracy for the decision to fail an examinee. At the 70% cut-off level, decision accuracy differences of greater than 5% were seen across the three rater allocation designs with the row allocation design showing the highest percentage of accurate  Scoring Designs 93 decisions to fail an examinee and the random allocation design showing the smallest percentage of correct decisions. Table 15 Percentage of Correct Fail Decisions in Test 1  Scoring Design Cut Score  Spiral  Row  Random  Objective  50%  89.02  84.75  89.11  93.29  60%  87.88  86.89  85.35  89.71  70%  96.96  95.78  99.24  99.82  80%  97.45  97.12  96.88  98.74  Table 16 Percentage of Correct Fail Decisions in Test 2  Cut Score  Spiral  Scoring Design Row Random  50%  62.04  67.54  49.12  47.52  60%  93.08  97.01  89.51  92.09  70%  98.28  96.60  97.89  99.15  80%  97.40  98.43  96.24  97.88  Objective  Scoring Designs 94 Table 17 Percentage of Correct Fail Decisions in Test 3  Cut Score  Spiral  Scoring Design Row Random  50%  89.29  88.89  81.80  83.67  60%  89.60  83.55  87.62  92.35  70%  90.07  96.72  79.53  82.49  80%  98.44  98.39  98.35  99.32  Objective  Decision Accuracy for a Fail with Objective Scoring  An examination of Appendix K and Table 15 shows a difference of 8.54% for a correct decision to fail an examinee in Test 1. This difference is found between the objective scoring condition and row allocation design at the 50% cut-off level, with the objective scoring showing greater decision accuracy. No other differences greater than 5% were found in Test 1. A large percentage difference (20.02%) in decision accuracy is found at the lowest cut-off level in Test 2. The percentage of correct fail decisions for Test 2 was given in Table 16. The associated differences are provided in Appendix L. The row allocation design provided the greatest percentage of accuracy for a decision to fail (67.54%), whereas the objective scoring provided the least percentage of accurate decisions (47.52%) to fail an examinee. No other concerning differences were found in Test 2. An examination of Appendix M, for Test 3, shows a difference of 5.62% in decision accuracy for a fail at the 50% cut-off level. This difference was found between the objective scoring condition and the row allocation design. A difference of 5.62% between the objective scoring condition and the spiral allocation design was also found at  Scoring Designs 95 the 50% cut-off level. In both cases, the rater allocation design showed greater decision accuracy. An 8.80% difference in decision accuracy for the fail decision was found for the objective scoring condition and the row allocation design at the 60% cut-off level. The objective scoring showed greater accuracy at this cut-off level. At the 70% cut-off level in Test 3, a decision accuracy difference of 14.23% was found between the objective scoring and the row allocation design whereas a 7.58%) difference was found for the correct decision to fail an examinee between the objective scoring and the spiral allocation design. In both cases, the rater allocation design showed greater decision accuracy. The decision accuracy (fail) profile plots for the three tests are provided in Figures 17 through 19. These plots illustrate the percentage of correct decisions to fail an examinee. Here, the profile plot shows the relationship between percentage of accurate decisions to fail an examinee and the cut-off value. In test 1, the objective scoring, random allocation design and spiral allocation design follow a similar pattern, with decision accuracy (fail) decreasing slightly at the 60% cut-off level and increasing slightly at the 70% cut-of level. While slightly exaggerated by the scale, a slight deviation in the pattern of decision accuracy obtained using the row allocation design may be seen when compared to the other scoring designs. The row allocation design displays slightly lower decision accuracy than the other scoring designs at the 50% cut-off level. Decision accuracy increases at the 60% cut-off level, and then again at the 70% cut-off level. In Test 2, despite smaller intervals on the ordinal scale, it is clear that the objective scoring and random allocation design provided a similar pattern of decision accuracy across the cut-off levels, whereas the row allocation design, while appearing to provide higher  Scoring Designs 96 levels of decision accuracy, provided results quite different from that of these two scoring designs. Again, in Test 3, the objective scoring and the random allocation design provide similar levels of decision accuracy at the varying cut-off levels. While the spiral allocation design is not perfectly in line with that of these two scoring designs, decision accuracy differences are relatively small. The erratic decision accuracy associated with the row allocation design, is quite clear in Test 3.  Figure 17. Percentage of Correct Fail Decisions cross Cut-off Levels in Test 1  110  CUT  Scoring Designs Figure 18. Percentage of Correct Fail Decisions cross Cut-off Levels in Test 2  CUT  Scoring Designs 98 Section B: Comparison to Objective Scoring Estimate In this section, resultsfromthe comparison to the objective scoring are presented. Resultsfromthe discrepancy analysis are presented first. Here, the ability estimates obtained using each of the rater allocation designs are compared according to how different the estimated are to those obtained under the objective scoring condition at each of the theta bin levels. In the same way that a difference between the estimates obtained under each of the scoring designs and true examinee ability provided a measure of accuracy, the difference between estimates here provided a measure of discrepancy. The second set of analyses at this comparison level were for the pass and fail decisions. The rater allocation designs were compared according to how closely they agreed with the decisions to pass and to fail an examinee under the objective scoring condition at the four Cut-off levels. Estimate Differences  The difference between the estimates obtained under each of the rater allocation designs and the objective scoring condition (discrepancy) was calculated at each theta bin level. These results are summarized in Tables 18 through 20 for Tests 1 through 3, respectively. Differences for these tables are provided in Appendix N through P. The difference between the estimates obtained for rater allocation designs and the objective scoring were trivial. The discrepancy across rater allocation designs was, at times, indistinguishable in Tests 1 and 2. In Test 3, the difference between theta estimates at the lowest level of true examinee ability is slightly greater for the spiral (.07) than for the random allocation design (0) wherein the estimation of examinee ability was identical to  Scoring Designs 99 that of the objective scoring. In general, discrepancy was trivial across theta bin levels and rater allocation designs in all of the tests. Table 18 Discrepancy at each theta bin level for Test 1  Allocation Design Theta Bin  Spiral  Row  Random  A. less than or equal to -2.0  0.02  0.03  0.04  B. between -1.999 and -1.0 inclusive  0.01  -0.01  0.02  C. between -0.999 and 0 inclusive  0.01  0.01  0.02  D. between 0.001 and 1.0 inclusive  -0.02  -0.02  0  E. between 1.001 and 2.0 inclusive  0  0.04  0.05  -0.07  -0.07  -0.06  F. between 2.001 and greater. Table 19  Discrepancy at each theta bin level for Test 2  Allocation Design Theta Bin  Spiral  Row  Random  A. less than or equal to -2.0  0.01  0.01  0.03  B. between -1.999 and -1.0 inclusive  0.16  0.14  0.15  C. between -0.999 and 0 inclusive  -0.07  -0.08  -0.08  D. between 0.001 and 1.0 inclusive  0.03  0.05  0.05  E. between 1.001 and 2.0 inclusive  -0.03  -0.02  -0.01  F. between 2.001 and greater.  -0.07  -0.07  -0.07  Scoring Designs 100  Table 20 Discrepancy at each theta bin level for Test 3  Allocation Design Theta Bin  Spiral  Row  Random  A. less than or equal to -2.0  -0.07  -0.01  0  B. between -1.999 and -1.0 inclusive  -0.04  0.02  0.04  C. between -0.999 and 0 inclusive  0.06  0.05  0.09  D. between 0.001 and 1.0 inclusive  -0.05  -0.07  -0.06  E. between 1.001 and 2.0 inclusive  0.06  0.06  0.06  F. between 2.001 and greater.  -0.12  -0.09  -0.09  The profile plots for differences between the rater allocation designs and objective scoring at each theta bin level for the three tests are provided in Figures 20 through 22. Again, on the ordinate, are the estimated marginal means, or cell means. On the abscissa are the theta bin levels. Here, the profile plot shows the relationship between percentage of discrepancy to pass an examinee and the theta bin levels for each of the rater allocation designs. The reader is cautioned that tiny differences are exaggerated in the profile plots presented here because the difference between calculated discrepancy for the allocation designs are so small, the steps in the scale used for the plot are also quite small. This is particularly true for Test 1 (Figure 20). It is, however, still apparent in the plots that not only was the discrepancy found between the objective scoring estimates and the estimates from each of the three rater allocation designs remarkably similar, but that the value of  Scoring Designs 101 the discrepancy was remarkably similar across rater allocation designs. If it were not for the next set of analyses, we would be forced to conclude that it makes little difference whether raters are used to score constructed responses or whether responses are scored objectively. Figure 20. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 1  .06  -2.0 or less  -0.999 to 0 -1.999 to -1.0  THETA  1.001 to 2.0 0.001 to 1.0  2.001 or greater  Scoring Designs 102 Figure 21. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 2  -2.0 or less  -0.999 to 0 -1.999 to -1.0  1.001 to 2.0 0.001 to 1.0  2.001 and greater  THETA  Figure 22. Differences Between the Rater Allocation Designs and Objective Scoring at each Theta bin level for Test 3  THETA  Scoring Designs 103 Decisions Compared to those of Objective Scoring Decision Agreement for a Pass  Although discrepancy differences appeared trivial across rater allocation designs, the differences found across the allocation designs, for the comparable decisions, were not. Tables 21 through 23 show the percentage of times each of the allocation designs agreed with the objectively scored test on a decision to pass an examinee, at each of the cut-off levels, for the three tests. Percentage differences rangedfrom1.30% to 41.24%. Once again, a difference of 5% in the percentage of agreement between the decision obtained under objective scoring and that obtained under each of the rater allocation designs, across rater allocation designs, was used to distinguish a trivial difference from that which is concerning. In Test 1, differences greater than 5% were found at the 50% and 70% cut-off level. These differences are provided in Appendix Q for Test 1. At the 50% cut-off level, a difference of 5.51% was found between the decision agreement found with the random allocation design and the row allocation design. The row allocation design agreed with the decisions obtained through objective scoring slightly more often than did the other two rater allocation designs. A difference of 17.69% was found between the random and row allocation designs at the 70% cut-off level. At this cut-off level the random allocation design agreed with the objective scoring on a decision to pass 75.58% of the time whereas the row allocation design agreed with the objective scoring 93.27% of the time. There was a 10.28% difference in percentage of agreement to the objective scoring decision to pass an examinee between spiral and random design. Here, the spiral design agreed with the objective scoring decision more often. A 7.41% difference was also  Scoring Designs 104 found at this cut-off level between the row and spiral allocation design. Here, the spiral allocation design agreed with the objective scoring 85.86% ofthe time. An examination of Appendix R shows that, in Test 2, differences were seen across allocation designs at each of the cut-off levels. In other words, at each ofthe cut off levels, the rater allocation designs showed greater than a 5% difference with each other on the percentage of agreement to the decision obtained under objective scoring. At the 50% cut-off level, a difference of 13.5% was found between the random and row allocation designs. Here, the decisions obtained under the random allocation design agreed with decisions obtained under objective scoring 90.03% of the time, as opposed to the 79.53% level of agreement found between the row allocation design and the objective scoring. The decisions obtained using the random allocation design agreed with those obtained under objective scoring 9.95% more often than that of the spiral allocation design. At the 60% cut-off, a difference of 24.37% was seen between the decision agreement, to the objective scoring, of the row (58.35%) and random (82.72%) allocation designs. Here, the random allocation design agreed with the pass decision obtained through objective scoring most often. The random allocation design also agreed with the objective decision 8.46% more often than the spiral allocation design. Further, at this same cut-off level, a decision agreement difference of 15.91% was also found between the spiral and the row allocation designs. The spiral allocation design agreed most often with the decision to pass an examinee obtained under the objective scoring condition. At the 70% cut-off level, in Test 2, a shift in the decisions took place such that the row allocation design provided the higher percentage of decision agreement with the objective scoring. A difference of 11.35% in decision agreement between the spiral allocation  Scoring Designs 105 design and the row allocation design was found, whereas a difference of 11.96% in decision agreement was found between the row allocation design and the random allocation design. Here, the row allocation design showed a higher percentage of agreement to the objective scoring decisions. The spiral and random allocation designs, however, showed similar levels of agreement with the decisions to pass obtained under the objective scoring condition at this cut-off level. At the 80% cut-off level, the spiral and random allocation designs, again, showed similar levels of agreement with the objective scoring on the decision to pass an examinee. The percentage difference between the row and the spiral allocation design agreement was 18.64 %, and the difference between the row allocation design and the random allocation design was 23.64%). Notably, the row allocation design showed relatively low levels of agreement with the decision to pass an examinee obtained under the objective scoring condition at the 60% and 80% cut-off levels. The percentage differences, for decision agreement in Test 2, are provided in Appendix R. In Test 3, the random allocation design was found to agree with the decision to pass an examinee obtained under the objective scoring condition most often. The agreement percentages are provided in Table 23. Calculated differences are provided in Appendix S. Differences of greater than 5% in decision agreement were found at the 60% and 70% cut-off levels in this test. At the 60% cut-off level, a difference of 10.55% was seen between the row and random allocation designs, and a difference of 7.38% was found between the row and spiral allocation designs. At this cut-off level, the spiral and random allocation designs displayed similar levels of agreement to that obtained under the objective scoring condition. At the 70% cut-off level, the row allocation design and the  Scoring Designs 106 objective scoring reached the same pass decision 42.69% of the time, whereas the random allocation design and the objective scoring reached the same pass decision 83.93% of the time. This amounted to a 41.24% difference in decision agreement between these rater allocation designs. Similarly, the difference in the degree of agreement with the objective scoring on the decision to pass an examinee for the row and spiral allocation designs was 22.27%. Here, the spiral allocation design showed a considerably higher level of decision agreement than the row allocation design. Notably, the row allocation design agreed with the decision to pass an examinee obtained under the objective scoring less than one half of the time at the 70% cut-off level in Test 3. Table 21 Percentage of agreement with the objectively scored test on a decision to pass at each of the cut-off levels for Test 1  Allocation Designs Cut Score Spiral  Row Random  50%  85.83  906  85.09  60%  83.57  82.84  84.55  70%  85.86  93.27  75.58  80%  72.96  73.32  75.47  Scoring Designs 107 Table 22 Percentage of agreement with the objectively scored test on a decision to pass at each of the cut-off levels for Test 2  Allocation Designs Cut Score Spiral  Row  Random  50%  83.08  79.53  93.03  60%  74.26  58.35  82.72  70%  76.53  87.88  75.92  80%  70.92  52.28  75.92  Table 23 Percentage of agreement with the objectively scored test on a decision to pass at each of the cut-off levels for Test 3  Allocation Designs Cut Score Spiral  Row  Random  50%  80.56  81.28  84.53  60%  87.1  94.48  83.93  70%  64.96  42.69  83.93  80%  72.85  71.69  72.99  The profile plots for the agreement on a pass decision are provided below. The profile plots serve to highlight the inconsistency in the differences between the allocation designs when they are compared to the objective scoring. Note that each point on the  Scoring Designs 108 plot indicates the percentage of agreement associated with a given allocation design and the objective scoring condition for a decision to pass an examinee. Clear differences are seen across allocation designs for the decision to pass an examinee. In Test 1, the largest difference across rater allocation designs was seen at the 70% cut-off level. This may be seen in Figure 23. The profile plots for Tests 2 and 3 (Figures 24 and 25) highlight the erratic performance of the row allocation design. While small differences are seen between the spiral and random allocation designs, at the four cut-off levels, the row allocation design displays extremes when compared to the other two.  Figure 23. Percentage of Pass Decision Agreement Across Cut-Off Levels in Test 1  CUT  Scoring Designs Figure 24. Percentage of Pass Decision Agreement Across Cut-Off Levels in Test 2  CUT  Scoring Designs 110 Decision Agreement for a Fail  Tables 24 through 26 summarize the percentage of agreement in the decision to fail an examinee, between each of the rater allocation designs and the objective test condition, for the three tests. Recall that decision agreement refers to the percentage of times the decision obtained under each rater allocation design agreed with the decision obtained under objective scoring. These differences are provided in Appendix T through V for Tests 1 though 3, respectively. The largest difference in percentage of agreement decision across allocation designs is 11.20%. The smallest difference in percentage of agreement decision across allocation designs is 0.08%. In Test 1, the random allocation design agreed with the objective design condition decision to fail an examinee most often at the 50% and 70% cut-off levels while the spiral allocation design agreed with the objective design condition decision to fail an examinee most often at the 60% and 80% cut-off levels. In this test, differences across rater allocation designs greater than 5% were found at the 50% and 70% cut-off levels. At the 50% cut-off level, a difference of 6.38% was seen between the row and random allocation designs. A difference of 5.97% was also seen at this cut-off level between the row and spiral allocation design. Here, the spiral allocation design agreed most often with the objective scoring condition on a decision to fail an examinee. Similar levels of agreement were seen between the spiral and random allocation designs at this cut-off level. At the 70% cut-off level, a 6.79% difference was seen between the row and random allocation designs. The random allocation design showed higher levels of agreement to the objective scoring on the decision to fail an examinee than that of the row allocation design.  Scoring Designs 11 In Test 2, differences greater than 5% were seen at the 50% and 60% cut-off levels. At the 50% cut-off level an 11.2% difference was found between the decision agreement levels of the row and random allocation designs. Here, the row allocation design showed a higher level of agreement. A difference of 7.97% was also seen at this cut-off level between the random and spiral allocation designs. The spiral allocation design displayed higher levels of decision agreement here. At the 60% cut-off level, a difference of 5.39% was seen between the spiral and random allocation designs. The spiral allocation design showed the higher level of decision agreement, at this cut-off level, on the decision to fail an examinee. In Test 3, differences in levels of agreement to the decisions obtained under the objective scoring condition were seen, across allocation designs, at three of the cut-off levels. At the 50% cut-off level, a difference of 5.56% was seen between the spiral and random allocation designs. At this cut-off level, the spiral allocation design agreed most often with the decisions obtained under the objective scoring condition. At the 60% cutoff level, a difference of 6.23% was seen between the spiral and row allocation designs. Again, the spiral allocation design agreed most often with the decisions obtained under the objective scoring condition. At the same cut-off level, a difference of 5.46% was seen between the row and random allocation designs with the random allocation design showing a higher level of agreement with the decision to fail an examinee found using objective scoring. A difference of 10.40% was found between the random and row allocation design, with the row allocation design showing greater agreement objective scoring condition on a decision to fail an examinee. The spiral design also showed a  Scoring Designs 112 higher level of decision agreement with a difference of 7.39% between it and the random allocation design. Table 24 Percentage of Decision Agreement for a Fail in Test 1  Allocation Designs Cut Score  Spiral  Row  Random  50%  88.37  82.4  88.78  60%  90.9  89.61  89.07  70%  92.53  89.83  96.62  80%  97.2  96.87  96.65  Table 25 Percentage of Decision Agreement for a Fail in Test 2  Allocation Designs Cut Score  Spiral  Row  Random  50%  87.69  90.97  79.77  60%  95.07  98.02  92.63  70%  97.26  94.78  96.79  80%  97.95  98.94  96.8  Scoring Designs 113 Table 26 Percentage of Decision Agreement for a Fail in Test 3  Allocation Designs Cut Score  Spiral  Row  Random  50%  93.34  92.64  87.78  60%  90.28  84.05  89.51  70%  96.9  99.91  89.51  80%  98.22  98.17  98.14  Profile plots for the percentage of agreement to the objective scoring condition for a decision to fail an examinee across cut-off levels are provided in Figures 26 through 28 for Tests 1 though 3, respectively. The estimated marginal means, or cell means, given on the ordinal in these plots make the plots somewhat deceiving. The scale used on each of these plots is different. The reader is reminded to keep this in mind when examining Figures 26 through 28. Although it would appear that there are large differences in the decision agreement (fail) between the rater allocation designs, the scale is exaggerating small differences across these designs. The most pronounced difference in Tests 1 and 3 is seen at the 70% cut-of level. This divergence in performance was also seen in earlier analyses.  Scoring Designs Figure  26.  Percentage of fail decision agreement across cut-off levels in Test 1  Scoring Designs Figure 28. Percentage of fail decision agreement across cut-off levels in Test 3 110  1  T  50%  CUT  60%  70%  80%  Scoring Designs 116 CHAPTER V DISCUSSION The questions asked in this study dealt with two comparison levels. Examinee ability estimatesfromthree rater allocation designs and an objective scoring (no rater) condition for the scoring of constructed responses were compared to the known examinee true ability 0. By comparing the ability estimates to the known true ability of the examinee it was possible to obtain a measure of accuracy and consistency for each of the scoring designs. It was also possible to compare the accuracy of the decision to either pass or fail an examinee made under each of the scoring designs. As such, test developers may gain important information as to the relative accuracy of the ability estimates and accuracy of the decisions obtained under each of the scoring design conditions. This information will help test developers to make decisions about methods of scoring constructed responses. Furthermore, because each of the scoring designs was examined in theta bins, rather than overall, it was possible to see how differences in the true examinee ability, given a normal distribution, affected the accuracy and comparability of ability estimates and to examine the accuracy of the pass and fail decisions at predetermined cutoff levels. The second set of comparisons was to the ability estimate obtained under the objective scoring condition. The benefit of this set of analyses was to ascertain the degree to which each of the rater allocation designs was comparable to that of objective scoring. This is quite differentfromexamining accuracy because these analyses allowed for a comparison of the ability estimates regardless of whether they were correct. This is an important consideration because this study is also interested in whether an examinee  Scoring Designs 117 would have received the same score, or if the same decision would have been reached, had his test been subject to objective, rather than human, scoring. Level 1: Comparison to True Theta Three questions were asked about the relative performance of the scoring designs when ability estimates were compared to true ability. The first question addressed the relative accuracy of the three random allocation designs and objective scoring across levels of true examinee ability. The second question dealt with the relative consistency of the scoring designs across levels of true examinee ability. The third question addressed comparisons in decision accuracy. Primarily, small differences were seen in the accuracy of the rater allocation designs over the three tests. Accuracy differences between the objective scoring and at least one of the allocation designs were seen in all three of the tests. When differences were found, they were generally at the higher and lower levels of true examinee ability. There was no clear pattern of differences in consistency across rater allocation designs seen across the three tests. Each of the rater allocation designs was, at times, indicated to be the most consistent of the three and, at times, indicated to be the least consistent of the three scoring designs. The objective scoring condition provided the most consistent estimates most often. Although resultsfromthe accuracy and consistency analyses did not raise concerns, larger differences in the pass and fail decisions were evident. In Test 1, for example, the greatest difference in accuracy between the scoring was .09 with trivial differences found at all other theta bin levels. Yet, a percentage correct difference as large as 30.96% was found at the 70% cut-off level for a correct decision to pass an examinee.  Scoring Designs 118 In Test 2, the accuracy of the row allocation design is virtually indistinguishable from that of the other scoring designs yet the decision accuracy of the row allocation design contrasted greatly with the other scoring designs at three of the four cut-off levels. Indeed, for this test, the difference between the row and random allocation designs was 21.26% for a correct decision to pass an examinee at a cut-off level of 80%. In Test 3, large differences were seen again in the percentage of correct decisions found between the row allocation design and that of the other scoring designs at the 70% cut-off level. The inconsistent findings between the theta difference and decision analyses illustrates the need to take the additional step of examining decisions. Differences in decision emerged between the row allocation designs and the other scoring designs, particularly at the 70% cut-off level in all tests. The decision accuracy for a pass using the row allocation design tended to be either considerably lower, or considerably higher than that of the other scoring designs but only occasionally was in line with the others. Further, when other scoring designs (particularly objective) had low percentage correct decisions accuracy the row allocation designs were often relatively quite high. This unusual and often counterintuitive performance was found at the 70% cut-off level in all three of the tests, and again at the 80% cut-off level in Test 3. The pattern for a correct decision to fail was similar. At times the row allocation designs were found to be considerably more accurate than the other scoring designs and, at times, considerably less accurate. When accuracy was examined across levels of examinee ability it was found that bias at the low end of the ability scale was positive while the bias at the high end was negative. This means that, at the lower ability levels, the examinee scores were over-  Scoring Designs 119 estimated. In other words, the examinee ability was estimated to be higher than it really was. At the high end of the ability scale, the scores were underestimated. In other words, the ability estimates were lower than examinee true ability. Although this is not a surprising finding, given the fact that IRT works best at the center of the ability distribution, it is a reminder for test developers that, given a normal distribution, they are likely to get less accurate estimates of examinee ability the higher and lower they go on the ability scale whether they are rater scored, or scored objectively. In summary, when ability estimates were compared to the true examinee ability, small differences in accuracy and the slight inconsistency demonstrated by the row allocation design had a great impact on decision accuracy particularly at the 70% cut-off level. Less decision accuracy was seen across designs at the highest and lowest levels of the ability scale and the row allocation design provided erratic decisions. It should be noted, however, that although the pattern is erratic, it is replicated across the three tests and hence not a spurious, or chance, finding. Level 2: Comparison to the Objective Scoring Estimate The distinction between a comparison to true theta and a comparison to the estimate obtained through objective scoring is an important one. A comparison to the true ability estimate provides a measure of accuracy. Essentially - did the design get it right? A comparison to the objective scoring estimate provides a measure of agreement. This is very different from accuracy because the designs can agree even when wrong. We may, however, determine whether an examinee would have obtained a different score on an objectively scored test rather than one scored through one of the rater allocation designs. Two questions were asked at this comparison level. First, the relative  Scoring Designs 120 discrepancy between the objective and rater allocation designs ability estimates were investigated across designs for each theta bin level. Second, the percentage of agreement between the pass and fail decisions obtained under objective scoring and each of the rater allocation designs was examined for each of the cut-off levels. An examination of relative discrepancy provides test developers with information about differences in the degree to which each of the allocation designs agree with the objective test. The results of this study provide a step for test developers to engage in a discussion with parents, examinees, and educators on the similarity or difference of using human rather than objective scoring of tests items. The first question asked at this comparison level was about differences between the ability estimates obtained using objective scoring and each of the three rater allocation designs. Stated differently, this study was interested in which of the three rater allocation designs provided ability estimates most, and least, like that of the objective scoring condition. It was found that the differences between the rater allocation designs and the objective scoring were trivial at all levels of true examinee ability. This was true for each of the three tests investigated. Examining only theta estimate differences, we might conclude that the ability estimatesfromthe three allocation designs were remarkably similar to those of the objective scoring and that any of the rater allocation designs may be substituted for objective scoring. This study, however, went beyond the examination of differences in theta estimates. An examination of decision agreement between the objective scoring and that of each of the rater allocation designs for the pass and fail decisions at each of the four cutoff levels was conducted to address the second question at comparison level two. The  Scoring Designs 121 rater allocation designs showed differential agreement with the decisions to pass an examinee particularly at the 70% cut-off level. While the percentage of agreement to the objective decision was steady for the spiral allocation design and the random allocation design, the percentage of agreement to the objective decision for the row allocation design was erratic. Concerning characteristics of the row allocation design were revealed. On a decision to pass, we see agreement levels as low as 42.69% for the row allocation design. As seen previously in the examination of accuracy, and decision accuracy, the row allocation design showed itself to be inconsistent. At times, the row allocation design showed high agreement with the estimate or decision obtained under objective scoring, yet, at times, it showed very low levels of agreement. This is a concerning finding considering that an examination of differences in theta estimates for these scoring designs would lead to quite a different conclusion. Specifically, an examination of theta differences alone might easily lead researchers to conclude that the rater allocation designs examined here are interchangeable. Contributions of this Study This research adds to the currently growing research and scientific discourse centered on the scoring of constructed responses. The spiral, row, and random allocation designs have not all been previously compared and they have not been compared to the objective test in previous studies examining allocation designs. Further, this study incorporated a 6-shift model in the analysis of data. Previous research has demonstrated the usefulness of including a rater in the scoring model. This study distinguishes itself in that differences in accuracy, consistency and discrepancy across scoring designs were examined for bins across the ability scale. In other words, the study addresses the degree  Scoring Designs 122 to which accuracy differs for these designs when one has a normal distribution of high, moderate, and low ability examinees. The application of the rater allocation designs was slightly different in this study than in the parent studies for this research. The spiral allocation simulation in this study incorporated an addition step. The raters were assigned randomly to the spiral pattern in the spiral allocation design unlike in previous studies where the spiral pattern was manipulated using rater parameters. This step allowed the rater effect to appear at intervals, more like that which would be found in practice, when rater parameters are not estimated in advance. Also, the random allocation design was implemented under single rating conditions, unlike the study by Patz et al. (1997). This allowed for a direct comparison to be made between the random allocation design and the spiral allocation design, both of which purport to mitigate rater effects by distributing these effects across items. As such, the effect of a lenient rater, for example, would serve to cancel out, or reduce the impact of the effect of a severe rater. These two rater allocation designs provided generally similar levels of accuracy, consistency and decision accuracy across theta bin levels in all three tests at comparison level one. These rater allocation designs showed minimal difference in discrepancy at comparison level two. They also provided similar percentage of agreement to the decisions obtained under objective scoring. Limitations of this Study As in all research, this study has its limitations. First, as is commonly seen in practice, only four items were used to create tests. While the selection of four items was necessary in order to mirror possible performance assessment situations such as essay testing, estimation using item response theory is best with a larger number of items  Scoring Designs 123 (Hambleton et al. 1991). That said, concerns about estimation errors were mitigated in two ways: first, only the ability parameters were estimated whereas item parameters were fixed, and second, a check of the item parameter recovery was completed. Nevertheless, it is not unusual to see constructed response tests with a higher number of items. Another limitation of this study is that only dichotomous scoring was examined. Results might be quite different under polytomous scoring conditions. Deviant results were often found at the 70% cut-off level. Patz et al. (1997) also had some deviant findings. They found that the root mean square error (RMSE) for their stratified randomization design under severe rater effects was found to be lower than the RMSE attained when no rater effects were present. When this unexpected finding was explored, they found that the stratification results in smaller standard deviations for both realized raw scores and estimated scale scores and consequently resulted in smaller RMSE without necessarily improving reliability. In this study, the most plausible reason for the deviant results found at the 70% cut-off level was a result of the length of the tests. The reader is reminded that the cut-off levels are of the domain score, making this explanation even, somewhat questionable. This needs to be investigated in future research. Although it was noted that the row allocation design demonstrated unexpectedly high consistency, the nature of this study does not support an investigation into the phenomenon. This may, also, be an interesting, and potentially important phenomenon for future study. This was a simulation study by necessity. This reality limits the generalizability of this study. It would not have been possible to investigate the rater allocation designs using the same raters, on the same set of items, for the same set of examinees outside of a  Scoring Designs 124 simulation due to memory effects, and in the case of this study, due to the necessity of a known true theta. The rater parameters used in this study, although based on realistic parameters, were also limiting in that there were only four rater parameters used. It is expected that there would be greater diversity of rater effects in the population of raters. It would be unreasonable to assume that there would be such a clean distribution of rater leniency and severity in the real population of raters. The objective scoring condition in this study was one where the rater parameter was set to 0. Objective scoring commonly refers to the use of objective items such as multiple-choice, true or false, or fill in the blank items. Item format was not a factor in this study. As such, this study may not be generalized to changes in item format. Objective scoring could also take place under automated scoring conditions. The reader is cautioned that direct comparisons to objective scoring using specific automated methods of scoring did not take place, although this study does provide a good base from which to move into this area of study. Future Research Findings in this study provide a solid base for examining other aspects of the scoring process such as the scoring of polytomous items and combining results of objectively scored items with rater scored items. In the future, researchers may wish to compare polytomously scored rater allocation designs with the dichotomously scored rater allocation designs and objectively scored ability estimates. This study also examined only a single rating of items. In the future, researchers may also wish to examine the questions asked in this study in a repeated rating situation.  Scoring Designs 125 One important aspect of this study was the examination of a no rater design to represent objective scoring. The next logical step would be to compare the performance ofthe rater allocation designs with that of automated scoring. Future research might compare the performance of the spiral and random allocation designs to that of the e-rater scoring used by ETS or Latent Semantic Analysis (LSA) that incorporates a comparison of rater scores with that of computed scores for essays. Summary These findings will assist test administrators in their decisions about the scoring methods and procedures that will produce the most accurate ability estimates and decisions, and to answer questions about the relative use of rater scoring versus objective scoring. In general, with the 6-shift model in place, there is relatively little difference found in the overall accuracy of the spiral allocation design, random allocation design and the objective scoring. However, the results suggest that even with trivial differences in accuracy, large differences in decisions may be seen across scoring designs. Researchers interested in investigating the effects of raters are advised to go beyond examinations of differences in theta (true and estimated ability) and to consider the outcome decisions. Examining only theta differences can be misleading. The spiral allocation design and random allocation design provided similar accuracy and led to similar pass and fail decisions. The overall performance of the row allocation design was somewhat erratic when compared to the other scoring methods. The erratic performance of the row allocation designs makes it the least desirable of the rater allocation designs.  Scoring Designs 126 REFERENCES Bennett, R. E. & Bejar 1.1. (1998). Validity and automated scoring: It's not only the scoring. Educational Measurement: Issues and Practice. 17, 9-17.  Bock, R.D., Brennan, R. L., & Muraki, E. (2002). The information in multiple ratings. Applied Psychological Measurement, 26, 364-375.  Brennan, R.L. (1992). Generalizability theory. Educational Measurement: Issues and Practice, 11 (4), 27-34. Chung, G. K. W. K., & O'Neil, H. F., Jr. (1997). Methodological approaches to online scoring of essays. Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing. (ERIC Document Reproduction Service No. ED418101. Clauser, B. E., Subhiyah, R. G., Nungester, R. J., Ripkey, D. R., Clyman, S. G., & McKinley, D. (1995). Scoring a perfomance-based assessment by modeling the judgment of experts. Journal of Educational Measurement, 32, 397-415. Cohen, J. (1973). Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educational and Psychological Measurement, 33, 107-112  Cohen, J. (1994). The Earth is round (p<.05). American Psychologist, 49, 997-1003. Daniel, L.G. (1998). Statistical significance testing: A historical overview of misuse and misinterpretation with implications for the editorial policies of educational journals. Research in the Schools. 5, 23-32.  Donoghue, J. R. (1994). An empirical examination of the IRT information of polytomously scored reading items under the generalized partial credit model. Journal of Educational Measurement, 31, 295-311.  Scoring Designs 127 Donoghue, J.R. & Hombo, C. M. (2000, April). A comparison of different model assumptions about rater effects. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA, USA. Donoghue, J.R. & Hombo, C. M. (2000, June). How rater effects are related to reliability indices. Paper presented at the North American Meeting of the Psychometric Society. Vancouver, BC, Canada. Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with many-faceted Rasch models. Journal of Educational Measurement, 31, 93-112. Engelhard, G., Jr. (1996). Examining rater accuracy in performance assessments. Journal of Educational Measurement, 33, 56-70.  Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiency levels: Guidelines for assessment design. Applied Measurement in Education, 15, 269-294.  Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359 - 374.  Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48, 3-26.  Haertel, E.H., & Linn, R.L. (1996). Comparability. In G.W. Phillips (Ed). Technical issues in large-scale performance assessment (pp. 59-78). Washington, DC:  National Center for Education Statistics.  Scoring Designs 128 Hambleton, R. K. (2001). Setting Performance Standards on Educational Assessments and Criteria for Evaluating the Process, in G. J. Cizek, (Ed.). Setting performance standards: Concepts, methods, and perspectives, (pp. 89-116). Mahwah, NJ:  Lawrence Erlbaum Associates. Hambleton R., Swaminathan H., & Rogers J. (1991). Fundamentals of item response theory: Newbury Park, CA: Sage. Harwell, M.R. (1999). Evaluating the validity of educational rating data. Educational and Psychological Measurement, 59, 25-27.  Hess, M.R., Donoghue, J. R. & Hombo, C M . (2003). Investigating constructed response scoring over time: The effects of study design on trend rescore statistics. Paper  presented at the annual meeting of the American Educational Research Association, Chicago IL. USA. Hombo, CM., Donoghue, J. R. & Thayer, D.T. ( 2001, March). A simulation study of the effect of rater designs on ability estimation. Research report: Educational Testing  Service, Princeton, NJ. Hoskens, M., & Wilson, M. (2001). Real-time feedback on rater drift in constructed response items: An example from the Golden State Examination. Journal of Educational Measurement, 38, 121-145. Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60, 237-263. Junker, B. W., & Patz, R. J. (1998, June). The hierarchical rater model for rated test  items. Presented at the North American meeting of the Psychometric Society, Champaign-Urbana, IL, USA.  Scoring Designs 129 Lee, G., & Sykes, R. (2000). A comparison of scoring modalities for performance assessment: A case of constructed response items. Paper presented the Annual  Meeting of the American Educational Research Association, New Orleans, LA. Linacre, J.M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press. Linn, R.L. (1998). Partitioning responsibility for the evaluation of the consequences of assessment programs. Educational Measurement: Issues and Practices, 16, 28-30.  Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple choice, constructed response, and examinee selected items on two achievement tests. Journal of Educational Measurement, 31, 234-250.  Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149174. Moon, T. R. & Hughes, K. R. (2002). Training and scoring issues involved in large-scale writing assessments. Educational Measurement Issues and Practice, 21, 15-19. Muraki, E., & Bock, R.D. (1999). PARCALE: Parameter scaling of rating data  [Computer program]. Chicago, IL; Scientific Software International Inc. Muraki, E., Hombo, C , & Lee, Y. (2000). Equating and linking of performance assessments. Applied Psychological Measurement, 24, 325-337.  Myford, C. M., & Mislevy, R. J. (1995). Monitoring and improving a portfolio assessment system. Center for Performance Assessment Research Report. Princeton, NJ: Educational Testing Service.  Scoring Designs 130 Myford, C. M., & Cline. F. (2002, April). Looking for patterns in disagreements: A Facets analysis of human raters' and e-raters scores on essays written for the Graduate Management Admission Test (GMAT). Paper presented at the Annual  Meeting of the American Educational Research Association. New Orleans, LA. Patz, R.J., Junker, B.W., Johnson, M.S., & Mariano, L.T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27, 341-384.  Patz, R.J., Wilson, M., & Hoskins, M. (1997). Optimal rating procedures and methodology for NAEP open-ended items. Final report to the National Center for Education Statistics under the redesign of NAEP. Rosenbaum P.R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrica, 49, 425-435. Rosenbaum P.R. (1988). Item bundles. Psychometrica, 53, 349-359. Shaver, J.P. (1993). What statistical testing is, and what it is not. Journal of Experimental Education, 61, 293-316. Sireci, S. G. & Rizavi, S. (2000). Comparing computerized and human scoring of  student's essays. Laboratory of Psychometric and Evaluative Research No. 354. Amherst, MA. Sykes, R.C, Heidorn, M., & Lee, G. (April, 1999). The assignment of raters to items: Controlling for rater effects. Paper presented at the annual meeting of the National Council on Measurement in Education. Montreal, Canada.  Scoring Designs 131 Wainer, H. & Thissen, D. (1993). Combining multiple-choice and constructed response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6,103-118.  Williamson, D.M., Bejar, LL, & Hone, A.S. (1999). 'Mental model' comparison of automated and human scoring. Journal of Educational Measurement, 36, 158-184. Williamson, D.M., Johnson, M.S., Sinharay, S., & Bejar, I.I. (2002, April). Hierarchical IRT examination of isomorphic equivalence of complex constructed response tasks.  Paper presented at the annual meeting ofthe American Educational Research Association, New Orleans, LA, USA. Wilson, M., & Adams, R. J. (1995). Rasch models for item bundles. Psychometrica, 60, 181-198. Wilson, M. & Case, H. (1997). An examination of variation in rater severity over time: A  study in rater drift. Berkley Evaluation and Assessment Research Center. University of California. Wilson, M., & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26, 283-306.  Wilson, M., & Wang, W. (1995). Complex composites: Issues that arise in combining different modes of assessment. Applied Psychological Measurement, 19, 51-72. Wolfe, E. W., & Myford, C. M. (1997, April). Detecting order effects with a multi-  faceted Rasch scale model. Presented at the annual meeting of the American Educational Research Association, Chicago, IL, USA.  Scoring Designs 132 Zieky, M. J. (2001). So much has changed: How the setting of cut-scores has evolved since the1980s, in G. J. Cizek, (Ed.). Setting performance standards: Concepts, methods, and perspectives, (pp. 19-51). Mahwah, NJ: Lawrence Erlbaum Associates.  Scoring Designs 133  APPENDIX A Item Parameters for Dichotomously Scored Items Item  a  b  1  0.98420  -2.00000  2  1.92370  -1.78950  3  1.25260  -1.57890  4  1.65530  -1.36840  5  1.78950  -1.15790  6  3.40000  -0.94740  7  3.26580  -0.73680  8  0.85000  -0.52630  9  2.59470  -0.31580  10  2.32630  -0.10530  11  2.86320  0.10530  12  2.19210  0.31580  13  1.11840  0.52630  14  1.38680  0.73580  15  1.52110  0.94740  16  2.05790  1.15790  17  2.99740  1.36840  18  3.13160  1.57890  19  2.46050  1.78950  20  2.72890  2.00000  Scoring Designs 134 APPENDIX B Difference in Squared Bias for Test 1 Scoring Designs Spiral  Row Random Objective  A. less than or equal to -2.0  0  -0.02  -0.02  0.03  B. between -1.999 and -1.0  0  0.01  0  0.01  C. between-0.999 and 0  0  0.01  0.01  0.01  D. between 0.001 and 1.0  0  0  0.01  0.01  E. between 1.001 and 2.0  0  0.03  0.03  0  F. between 2.001 and greater.  0  0.01  0.01  0.09  A. less than or equal to -2.0  0.02  0  0  0.05  B. between -1.999 and -1.0  -0.01  0  -0.01  0  C. between -0.999 and 0  -0.01  0  0  0  D. between 0.001 and 1.0  0  0  0.01  0.01  E. between 1.001 and 2.0  -0.03  0  0  -0.03  F. between 2.001 and greater.  -0.01  0  0  0.08  0.02  0  0  0.05  0  0.01  0  0.01  C. between-0.999 and 0  -0.01  0  0  0  D. between 0.001 and 1.0  -0.01  -0.01  0  0  E. between 1.001 and 2.0  -0.03  0  0  -0.03  F. between 2.001 and greater.  -0.01  0  0  0.08  Objective A. less than or equal to -2.0  -0.03  -0.05  -0.05  0  B. between -1.999 and -1.0  -0.01  0  -0.01  0  C. between-0.999 and 0  -0.01  0  0  0  D. between 0.001 and 1.0  -0.01  -0.01  0  0  E. between 1.001 and 2.0  0  0.03  0.03  0  -0.09  -0.08  -0.08  0  Spiral  Row  Random A. less than or equal to -2.0 B. between -1.999 and -1.0  F. between 2.001 and greater.  Scoring Designs 135 APPENDIX C Difference in Squared Bias for Test 2 Scoring Designs Random Objective  Spiral  Row  A. less than or equal to -2.0  0  -0.02  -0.04  0.01  B. between -1.999 and -1.0  0  0.01  0  0.07  C. between-0.999 and 0  0  0.01  0.01  -0.04  D. between 0.001 and 1.0  0  0.01  0.01  -0.01  E. between 1.001 and 2.0  0  0  0.01  0.02  F. between 2.001 and greater.  0  -0.01  -0.01  0.1  A. less than or equal to -2.0  0.02  0  -0.02  0.03  B. between -1.999 and -1.0  -0.01  0  -0.01  0.06  C. between-0.999 and 0  -0.01  0  0  -0.05  D. between 0.001 and 1.0  -0.01  0  0  -0.02  E. between 1.001 and 2.0  0  0  0.01  0.02  0.01  0  0  0.11  0.04  0.02  0  0.05  0  0.01  0  0.07  C. between-0.999 and 0  -0.01  0  0  -0.05  D. between 0.001 and 1.0  -0.01  0  0  -0.02  E. between 1.001 and 2.0  -0.01  -0.01  0  0.01  F. between 2.001 and greater.  0.01  0  0  0.11  Objective A. less than or equal to -2.0  -0.01  -0.03  -0.05  0  B. between -1.999 and -1.0  -0.07  -0.06  -0.07  0  C. between -0.999 and 0  0.04  0.05  0.05  0  D. between 0.001 and 1.0  0.01  0.02  0.02  0  E. between 1.001 and 2.0  -0.02  -0.02  -0.01  0  F. between 2.001 and greater.  -0.1  -0.11  -0.11  0  Spiral  Row  F. between 2.001 and greater. Random A. less than or equal to -2.0 B. between -1.999 and -1.0  Scoring Designs 136 APPENDIX D Difference in Squared Bias for Test 3 Scoring Designs Spiral  Row  A. less than or equal to -2.0  0  -0.17  -0.2  -0.19  B. between -1.999 and -1.0  0  -0.04  -0.05  -0.02  C. between-0.999 and 0  0  0  0  -0.01  D. between 0.001 and 1.0  0  0  0  0  E. between 1.001 and 2.0  0  0  0  -0.04  F. between 2.001 and greater.  0  0.04  0.04  0.15  A. less than or equal to -2.0  0.17  0  -0.03  -0.02  B. between -1.999 and -1.0  0.04  0  -0.01  0.02  C. between-0.999 and 0  0  0  0  -0.01  D. between 0.001 and 1.0  0  0  0  0  E. between 1.001 and 2.0  0  0  0  -0.04  -0.04  0  0  0.11  Random A. less than or equal to -2.0  0.2  0.03  0  0.01  B. between -1.999 and -1.0  0.05  0.01  0  0.03  C. between-0.999 and 0  0  0  0  -0.01  D. between 0.001 and 1.0  0  0  0  0  E. between 1.001 and 2.0  0  0  0  -0.04  F. between 2.001 and greater.  -0.04  0  0  0.11  Objective A. less than or equal to -2.0  0.19  0.02  -0.01  0  B. between -1.999 and -1.0  0.02  -0.02  -0.03  0  C. between -0.999 and 0  0.01  0.01  0.01  0  D. between 0.001 and 1.0  0  0  0  0  E. between 1.001 and 2.0  0.04  0.04  0.04  0  F. between 2.001 and greater.  -0.15  -0.11  -0.11  0  Spiral  Row  F. between 2.001 and greater.  Random Objective  Scoring Designs 137 APPENDLX E Differences in Mean Square Error for Test 1 Scoring Designs Spiral  Row  A. less than or equal to -2.0  0  -0.02  -0.02  0.06  B. between -1.999 and -1.0  0  0.02  -0.05  0  C. between-0.999 and 0  0  0  -0.06  0.03  D. between 0.001 and 1.0  0  -0.01  -0.09  -0.1  E. between 1.001 and 2.0  0  0.03  -0.08  0.01  F. between 2.001 and greater.  0  0.04  0.05  0.16  A. less than or equal to -2.0  0.02  0  0  0.08  B. between -1.999 and -1.0  -0.02  0  -0.07  -0.02  C. between -0.999 and 0  0  0  -0.06  0.03  D. between 0.001 and 1.0  0.01  0  -0.08  -0.09  E. between 1.001 and 2.0  -0.03  0  -0.11  -0.02  F. between 2.001 and greater.  -0.04  0  0.01  0.12  Random A. less than or equal to -2.0  0.02  0  0  0.08  B. between -1.999 and -1.0  0.05  0.07  0  0.05  C. between-0.999 and 0  0.06  0.06  0  0.09  D. between 0.001 and 1.0  0.09  0.08  0  -0.01  E. between 1.001 and 2.0  0.08  0.11  0  0.09  F. between 2.001 and greater.  -0.05  -0.01  0  0.11  -0.06  -0.08  -0.08  0  0  0.02  -0.05  0  C. between-0.999 and 0  -0.03  -0.03  -0.09  0  D. between 0.001 and 1.0  0.1  0.09  0.01  0  E. between 1.001 and 2.0  -0.01  0.02  -0.09  0  F. between 2.001 and greater.  -0.16  -0.12  -0.11  0  Spiral  Row  Objective A. less than or equal to -2.0 B. between -1.999 and -1.0  Random Objective  Scoring Designs 138  APPENDIX F Differences in Mean Square Error for Test 2 Scoring Designs Random Objective  Spiral  Row  A. less than or equal to -2.0  0  -0.01  0.01  0.06  B. between -1.999 and -1.0  0  0.03  -0.08  0.11  C. between -0.999 and 0  0  0  -0.09  -0.03  D. between 0.001 and 1.0  0  0.01  -0.03  0  E. between 1.001 and 2.0  0  0.01  -0.1  -0.01  F. between 2.001 and greater.  0  -0.02  -0.01  0.13  A. less than or equal to —2.0  0.01  0  0.02  0.07  B. between -1.999 and -1.0  -0.03  0  -0.11  0.08  C. between -0.999 and 0  0  0  -0.09  -0.03  D. between 0.001 and 1.0  -0.01  0  -0.04  -0.01  E. between 1.001 and 2.0  -0.01  0  -0.11  -0.02  F. between 2.001 and greater.  0.02  0  0.01  0.15  Random A. less than or equal to -2.0  -0.01  -0.02  0  0.05  B. between -1.999 and -1.0  0.08  0.11  0  0.19  C. between -0.999 and 0  0.09  0.09  0  0.06  D. between 0.001 and 1.0  0.03  0.04  0  0.03  E. between 1.001 and 2.0  0.1  0.11  0  0.09  F. between 2.001 and greater.  0.01  -0.01  0  0.14  Objective A. less than or equal to -2.0  -0.06  -0.07  -0.05  0  B. between -1.999 and -1.0  -0.11  -0.08  -0.19  0  C. between-0.999 and 0  0.03  0.03  -0.06  0  D. between 0.001 and 1.0  0  0.01  -0.03  0  E. between 1.001 and 2.0  0.01  0.02  -0.09  0  F. between 2.001 and greater.  -0.13  -0.15  -0.14  0  Spiral  Row  Scoring Designs 139  APPENDIX G Differences in Mean Square Error for Test 3 Scoring Designs Spiral  Row  A. less than or equal to -2.0  0  -0.2  -0.17  -0.16  B. between -1.999 and -1.0  0  -0.06  -0.03  0.01  C. between -0.999 and 0  0  0  -0.13  0  D. between 0.001 and 1.0  0  -0.01  -0.1  0.01  E. between 1.001 and 2.0  0  -0.02  -0.06  -0.03  F. between 2.001 and greater.  0  0.07  0  0.14  A. less than or equal to -2.0  0.2  0  0.03  0.04  B. between -1.999 and -1.0  0.06  0  0.03  0.07  C. between-0.999 and 0  0  0  -0.13  0  D. between 0.001 and 1.0  0.01  0  -0.09  0.02  E. between 1.001 and 2.0  0.02  0  -0.04  -0.01  F. between 2.001 and greater.  -0.07  0  -0.07  0.07  Random A. less than or equal to -2.0  0.17  -0.03  0  0.01  B. between -1.999 and -1.0  0.03  -0.03  0  0.04  C. between-0.999 and 0  0.13  0.13  0  0.13  D. between 0.001 and 1.0  0.1  0.09  0  0.11  E. between 1.001 and 2.0  0.06  0.04  0  0.03  0  0.07  0  0.14  Objective A. less than or equal to -2.0  0.16  -0.04  -0.01  0  B. between -1.999 and -1.0  -0.01  -0.07  -0.04  0  C. between-0.999 and 0  0  0  -0.13  0  D. between 0.001 and 1.0  -0.01  -0.02  -0.11  0  E. between 1.001 and 2.0  0.03  0.01  -0.03  0  F. between 2.001 and greater.  -0.14  -0.07  -0.14  0  Spiral  Row  F. between 2.001 and greater.  Random Objective  Scoring Designs 140 APPENDLX H Percentage Differences in Decision Accuracy (Pass) for Test 1  Scoring Designs  Spiral  Row  Random  Objective  Spiral  Row  0  -6.32  1.22  -1.21  60%  -0.04  0.5  -1.92  70%  -10.49  14.83  20.47  80%  -0.3  -1.94  -4.61  50%  Random Objective  50%  6.32  7.54  5.11  60%  0.04  0.54  -1.88  70%  10.49  25.32  30.96  80%  0.3  0  -1.64  -4.31  50%  -1.22  -7.54  60%  -0.5  -0.54  70%  -14.83  -25.32  80%  1.94  1.64  0  -2.67  50%  1.21  -5.11  2.43  0  60%  1.92  1.88  2.42  0  70%  -20.47  -30.96  -5.64  0  80%  4.61  4.31  2.67  0  -2.43 0  -2.42 5.64  Scoring Designs 141 APPENDIX I Percentage Differences in Decision Accuracy (Pass) for Test 2  Scoring Designs  Spiral  Spiral  Row  0  1.92  -6.99  -7.77  60%  12.39  -4.71  -6.18  70%  -12.67  -0.53  1.64  80%  16.63  -4.61  -6.83  50%  Random Objective  50%  -1.92  -8.91  -9.69  60%  -12.39  -17.1  -18.57  70%  12.67  12.14  14.31  80%  -16.63  -21.24  -23.46  50%  6.99  8.91  0  -0.78  60%  4.71  17.1  0  -1.47  70%  0.53  -12.14  0  2.17  80%  4.61  21.24  0  -2.22  Objective 50%  7.77  9.69  0.78  0  60%  6.18  18.57  1.47  0  70%  -1.64  -14.31  -2.17  0  80%  6.83  23.46  2.22  0  Row  Random  Scoring Designs 142 APPENDIX J Percentage Differences in Decision Accuracy (Pass) for Test 3  Scoring Designs  Spiral  Spiral  Row  0  -1.14  -1.05  -6.56  60%  -7.74  5.45  0.63  70%  20.34  -14.25  -18.39  80%  0.58  -0.08  0.87  0.09  -5.42  50%  Random Objective  50%  1.14  60%  7.74  0  13.19  8.37  70%  -20.34  0  -34.59  -38.73  80%  -0.58  -0.66  0.29  50%  1.05  -0.09  -5.51  60%  -5.45  -13.19  -4.82  70%  14.25  34.59  -4.14  80%  0.08  0.66  0.95  Objective 50%  6.56  5.42  5.51  0  60%  -0.63  -8.37  4.82  0  70%  18.39  38.73  4.14  0  80%  -0.87  -0.29  -0.95  0  Row  Random  Scoring Designs 143 APPENDIX K Percentage Differences in Decision Accuracy (Fail) for Test 1  Scoring Designs  Spiral  Row  Random  Objective  Spiral  Row  Random Objective  50%  0  4.27  -0.09  -4.27  60%  0  0.99  2.53  -1.83  70%  0  1.18  -2.28  -2.86  80%  0  0.33  0.57  -1.29  50%  -4.27  0  -4.36  -8.54  60%  -0.99  0  1.54  -2.82  70%  -1.18  0  -3.46  -4.04  80%  -0.33  0  0.24  -1.62  50%  0.09  4.36  0  -4.18  60%  -2.53  -1.54  0  -4.36  70%  2.28  3.46  0  -0.58  80%  -0.57  -0.24  0  -1.86  50%  4.27  8.54  4.18  0  60%  1.83  2.82  4.36  0  70%  2.86  4.04  0.58  0  80%  1.29  1.62  1.86  0  Scoring Designs 144 APPENDIX L Percentage Differences in Decision Accuracy (Fail) for Test 2  Scoring Designs  Spiral  Row  Random  Objective  Spiral  Row  Random Objective  50%  0  -5.5  12.92  14.52  60%  0  -3.93  3.57  0.99  70%  0  1.68  0.39  -0.87  80%  0  -1.03  1.16  -0.48  50%  5.5  0  18.42  20.02  60%  3.93  0  7.5  4.92  70%  -1.68  0  -1.29  -2.55  80%  1.03  0  2.19  0.55  50%  -12.92  -18.42  0  1.6  60%  -3.57  -7.5  0  -2.58  70%  -0.39  1.29  0  -1.26  80%  -1.16  -2.19  0  -1.64  50%  -14.52  -20.02  -1.6  0  60%  -0.99  -4.92  2.58  0  70%  0.87  2.55  1.26  0  80%  0.48  -0.55  1.64  0  Scoring Designs 145 APPENDLX M Percentage Differences in Decision Accuracy (Fail) for Test 3  Scoring Designs  Spiral  Spiral  Row  0  0.4  7.49  5.62  60%  6.05  1.98  -2.75  70%  -6.65  10.54  7.58  80%  0.05  0.09  -0.88  50%  Random Objective  50%  -0.4  7.09  5.22  60%  -6.05  -4.07  -8.8  70%  6.65  17.19  14.23  80%  -0.05  0.04  -0.93  50%  -7.49  -7.09  -1.87  60%  •1.98  4.07  -4.73  70%  -10.54 -17.19  -2.96  80%  -0.09  -0.04  -0.97  Objective 50%  -5.62  -5.22  1.87  0  60%  2.75  8.8  4.73  0  70%  -7.58  -14.23  2.96  0  80%  0.88  0.93  0.97  Row  Random  Scoring Designs 146  APPENDIX N Discrepancy Differences for Test 1 Allocation Design Spiral  Row  Random  A. less than or equal to -2.0  0  -0.01  -0.02  B. between -1.999 and -1.0  0  0.02  -0.01  C. between-0.999 and 0  0  0  -0.01  D. between 0.001 and 1.0  0  0  -0.02  E. between 1.001 and 2.0  0  -0.04  -0.05  F. between 2.001 and greater.  0  0  -0.01  A. less than or equal to -2.0  0.01  0  -0.01  B. between -1.999 and -1.0  -0.02  0  -0.03  C. between -0.999 and 0  0  0  -0.01  D. between 0.001 and 1.0  0  0  -0.02  E. between 1.001 and 2.0  0.04  0  -0.01  0  0  -0.01  Random A. less than or equal to -2.0  0.02  0.01  0  B. between -1.999 and -1.0  0.01  0.03  0  C. between-0.999 and 0  0.01  0.01  0  D. between 0.001 and 1.0  0.02  0.02  0  E. between 1.001 and 2.0  0.05  0.01  0  F. between 2.001 and greater.  0.01  0.01  0  Spiral  Row  F. between 2.001 and greater.  Scoring Designs APPENDLX O Discrepancy Differences for Test 2 Allocation Design Spiral  Row  Random  A. less than or equal to -2.0  0  0  -0.02  B. between -1.999 and -1.0  0  0.02  0.01  C. between-0.999 and 0  0  0.01  0.01  D. between 0.001 and 1.0  0  -0.02  -0.02  E. between 1.001 and 2.0  0  -0.01  -0.02  F. between 2.001 and greater.  0  0  0  A. less than or equal to -2.0  0  0  -0.02  B. between -1.999 and -1.0  -0.02  0  -0.01  C. between-0.999 and 0  -0.01  0  0  D. between 0.001 and 1.0  0.02  0  0  E. between 1.001 and 2.0  0.01  0  -0.01  0  0  0  Random A. less than or equal to -2.0  0.02  0.02  0  B. between -1.999 and -1.0  -0.01  0.01  0  C. between-0.999 and 0  -0.01  0  0  D. between 0.001 and 1.0  0.02  0  0  E. between 1.001 and 2.0  0.02  0.01  0  0  0  0  Spiral  Row  F. between 2.001 and greater.  F. between 2.001 and greater.  Scoring Designs 148 APPENDIX P Discrepancy Differences for Test 3 Allocation Design Spiral  Row  Random  A. less than or equal to -2.0  0  -0.06  -0.07  B. between -1.999 and -1.0  0  -0.06  -0.08  C. between-0.999 and 0  0  0.01  -0.03  D. between 0.001 and 1.0  0  0.02  0.01  E. between 1.001 and 2.0  0  0  0  F. between 2.001 and greater.  0  -0.03  -0.03  A. less than or equal to -2.0  0.06  0  -0.01  B. between -1.999 and -1.0  0.06  0  -0.02  C. between-0.999 and 0  -0.01  0  -0.04  D. between 0.001 and 1.0  -0.02  0  -0.01  E. between 1.001 and 2.0  0  0  0  0.03  0  0  Random A. less than or equal to -2.0  0.07  0.01  0  B. between -1.999 and -1.0  0.08  0.02  0  C. between-0.999 and 0  0.03  0.04  0  D. between 0.001 and 1.0  -0.01  0.01  0  E. between 1.001 and 2.0  0  0  0  0.03  0  0  Spiral  Row  F. between 2.001 and greater.  F. between 2.001 and greater.  Scoring Designs 149 APPENDLX Q Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Pass) for Test 1  Spiral  Row  Random  Spiral  Row  Random  50%  0  -4.77  0.74  60%  0  0.73  -0.98  70%  0  -7.41  10.28  80%  0  -0.36  -2.51  50%  4.77  0  5.51  60%  -0.73  0  -1.71  70%  7.41  0  17.69  80%  0.36  0  -2.15  50%  -0.74  -5.51  0  60%  0.98  1.71  0  70%  -10.28  -17.69  0  80%  2.51  2.15  0  Scoring Designs 150 APPENDIX R Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Pass) for Test 2  Allocation Designs  Spiral  Row  Random  Spiral  Row  Random  50%  0  3.55  -9.95  60%  0  15.91  -8.46  70%  0  -11.35  0.61  80%  0  18.64  -5  50%  -3.55  0  -13.5  60%  -15.91  0  -24.37  70%  11.35  0  11.96  80%  -18.64  0  -23.64  50%  9.95  13.5  0  60%  8.46  24.37  0  70%  -0.61  -11.96  0  80%  5  23.64  0  Scoring Designs 151 APPENDIX S Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Pass) for Test 3  Spiral  Row  Random  Spiral  Row  Random  50%  0  -0.72  -3.97  60%  0  -7.38  3.17  70%  0  22.27  -18.97  80%  0  1.16  -0.14  50%  0.72  0  -3.25  60%  7.38  0  10.55  70%  -22.27  0  -41.24  80%  -1.16  0  -1.3  50%  3.97  3.25  0  60%  -3.17  -10.55  0  70%  18.97  41.24  0  80%  0.14  1.3  0  Scoring Designs 152 APPENDLX T Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Fail) for Test 3  Allocation Designs  Spiral  Row  Random  Spiral  Row  Random  50%  0  5.97  -0.41  60%  0  1.29  1.83  70%  0  2.7  -4.09  80%  0  0.33  0.55  50%  -5.97  0  -6.38  60%  -1.29  0  0.54  70%  -2.7  0  -6.79  80%  -0.33  0  0.22  50%  0.41  6.38  0  60%  -1.83  -0.54  0  70%  4.09  6.79  0  80%  -0.55  -0.22  0  Scoring Designs 153 APPENDIX U Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Fail) for Test 2  Allocation Designs  Spiral  Row  Random  Spiral  Row  Random  50%  0  -3.28  7.92  60%  0  -2.95  2.44  70%  0  2.48  0.47  80%  0  -0.99  1.15  50%  3.28  0  11.2  60%  2.95  0  5.39  70%  -2.48  0  -2.01  80%  0.99  0  2.14  50%  -7.92  -11.2  0  60%  -2.44  -5.39  0  70%  -0.47  2.01  0  80%  -1.15  -2.14  0  Scoring Designs 154 APPENDLX V Percentage Difference for the Decision Agreement Between the Rater Allocation Designs and the Objective Scoring (Fail) for Test 3  Allocation Designs  Spiral  Row  Random  Spiral  Row  Random  50%  0  OJ  5.56  60%  0  6.23  0.77  70%  0  -3.01  7.39  80%  0  0.05  0.08  50%  -0.7  0  4.86  60%  -6.23  0  -5.46  70%  3.01  0  10.4  80%  -0.05  0  0.03  50%  -5.56  -4.86  0  60%  -0.77  5.46  0  70%  -7.39  -10.4  0  80%  -0.08  -0.03  0  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0054496/manifest

Comment

Related Items