UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A framework for validation of the use of performance assessment in science Bartley, Anthony William 1995

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1995-982519.pdf [ 3.35MB ]
JSON: 831-1.0054788.json
JSON-LD: 831-1.0054788-ld.json
RDF/XML (Pretty): 831-1.0054788-rdf.xml
RDF/JSON: 831-1.0054788-rdf.json
Turtle: 831-1.0054788-turtle.txt
N-Triples: 831-1.0054788-rdf-ntriples.txt
Original Record: 831-1.0054788-source.json
Full Text

Full Text

A FRAMEWORK FOR VALIDATIONOF THE USE OFPERFORMANCE ASSESSMENT IN SCIENCEbyANThONY WILLIAM BARTLEYB.A. The University of EssexM.Sc. The University of Kent at CanterburyA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIES(Centre for the Study of Curriculum and Instruction)We accept this thesis as conformingto the required standardTHE UNIVERSITY O1 BRITISH COLUMBIAApril, 1995© Anthony William Bartley 1995In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his - or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Cv’iviLM 4*(4 )i/The University of British ColumbiaVancouver, CanadaDate 6 4pr11 SDE-6 (2188)AbstractThe assessment of learning in school science is important to the students,educators, policy makers, and the general public. Changes in curriculum and instructionin science have led to greater emphasis upon alternative modes of assessment. Mostsignificant of these newer approaches is “performance assessment”, where studentsmanipulate materials in experimental situations. Only recently has the development ofperformance assessment procedures, and the appropriate strategies for interpreting theirresults, received substantial research attention.In this study, educational measurement and science education perspectives aresynthesized into an integrated analysis of the validity of procedures, inferences andconsequences arising from the use of performance assessment. The Student PerformanceComponent of the 1991 B.C. Science Assessment is offered as an example. A frameworkfor the design, implementation, and interpretation of hands-on assessment in schoolscience is presented, with validity and feasibility considered at every stage. Particularattention is given to a discussion of the influence of construct labels upon assessmentdesign. A model for the description of performance assessment tasks is proposed. Thismodel has the advantage of including both the science content and the science skilldemands for each task. The model is then expanded to show how simultaneousrepresentation of multiple tasks enhances the ability to ensure adequate sampling fromappropriate content domains.The main conclusion of this validation inquiry is that every aspect of performanceassessment in science is influenced by the perspective towards learning in science thatpermeates the assessment, and that this influence must be considered at all times.Recommendations are made for those carrying out practical assessments, as well assuggestions of areas that invite further research.11Table of ContentsABSTRACT.iiTABLE OF CONThNTS iiiLIST OF TABLES viLIST OF FIGURES viiACKNOWLEDGEMENTS ixCHAPTER 1— INTRODUCTIONDescription of the Problem 1Historical and Theoretical Perspectives for this Study 4Research Questions 7Significance of the Study 9Delimitations of the Study 9CHAPTER 2— THEORIES OF VALIDITY, PROCESSES OF SCIENCE ANDRELIABILITYValidity — Changing Conceptualizations 10Validity Evidence including Consequential Evidence 13Validation Procedures for Performance Assessments 22Processes of Science 38Processes in Hands-on Science Assessment 41Reliability — Meanings and Requirements 54Classical Theory 54Generalizabiity Theory 58Instrumental Variables and Performance Assessment 62CHAPTER 3— THE 1991 BRITISH COLUMBIA SCIENCE ASSESSMENTAn Historical Perspective 64Governance and Structure 65Component 1: The Classical Component 67Component 2: The Student Performance Component 68Component 3: The Socioscientific Issues Component 69Component 4: The Context for Science Component 69niStudent Performance Component: A Detailed DescriptionPlanning for the Assessment 70The Assessment Framework 71Stations and Investigations 73Sampling 77Teacher Preparation for Data Collection 78Preparation for Data Analysis 78Coding Workshop 80Analytical and Statistical Procedures 81Reporting and Interpretation of Results 82Gender-Related Differences 83Grade-Related Differences 85Inter-coder Consistency Ratings for the Stations 88Student Performance Across Tasks 89Atomistic versus Holistic Scoring 90Project Recommendations 91CHAPTER 4 A FRAMEWORK FOR THE VALIDATION OF THE USE OFPERFORMANCE ASSESSMENT IN SCIENCEValidity in the Assessment Context 93A Validation Framework 96Purposes of the Assessment 98Learning and Communication in Science 101Content Analysis 108Instrumental Stability 112Administration Stability 113Internal Consistency and Generalizability 114Fairness 120Consequences 123Reflections upon Validation Questions 129CHAPTERS— DESCRIBING STUDENT PERFORMANCE IN SCIENCESchool Science 132Assessment Frameworks 135A Model for Describing Student Performance 154ivCHAPTER 6— INTERPRETfl’JG STUDENT PERFORMANCE IN SCIENCEScoring Procedures 162Holistic Scoring 163Analytical Scoring 166Procedures for interpretation 171Interpretation of Hands-on Performance Assessment in Science 173CHAPTER 7— CONCLUSIONSResponse to the Four Research Questions 179Summary 183Final Remarks 186REFERENCES 189VList of TablesTable 1 Gender-related Differences in Student Performance 83Table 2 Inter-coder Consistency Coefficients for Grade 7 Circuit A 116Table 3 Pearson Correlation Matrix for Grade 7, Circuit A Data 118Table 4 Grade 10 Student Performance— Satisfactory or Better Rating by Gender,Stations ito 6 121Table 5 Student Performance on Common Tasks by Gender 122Table 6 Stations Where the Percentage of Females and Males Judged to HaveSatisfactory or Better Levels of Performance Differs by More than 15% 122Table 7 Stations Where the Percentage of Females and Males Judged to HaveSatisfactory or Better Levels of Performance Differs by More than 10% 123Table 8 SISS Skill Category Mean Correlation Coefficients 146viList of FiguresFigure 1 Facets of Validity as a Progressive Matrix (Messick, 1989a) 15Figure 2 Messick’s Facets of Validity Framework (Messick, 1989b) 35Figure 3 APU Procedures for Scientific Enquiry 48Figure 4 Overview of Science Assessment Components 68Figure 5 Dimensions of Science 72Figure 6 Venn Diagram of Stations 75Figure 7 Cycle of Pilot-Testing 76Figure 8 Criteria for the Choice of Assessment Tasks 77Figure 9 Scale for Evaluating Performance on Station Tasks — “In your opinion howwell did the student...?” 79Figure 10 Evaluative Questions for Investigations 80Figure 11 Traditional “Types” of Validity Evidence 94Figure 12 An Integrated Model of Validity Evidence 95Figure 13 Mapping of Performance Assessment Dimensions to Task Type 102Figure 14 Criteria for the Selection of Tasks 103Figure 15 Grade 4 Student Comments about the Assessment Tasks 103Figure 16 Student Instructions for Magnets Investigation 104Figure 17 Judgement Questions for Grade 7, Circuit A 117Figure 18 Seven Curriculum Emphases and Associated Views of Science 134Figure 19 Development of the Achievement Instruments 135Figure 20 Higher Order Thinking in Science and Mathematics 139Figure 21 SISS Process Test Skill Categories 142Figure 22 SISS Procedures 143viiFigure 23 Correspondence between SISS Practical Skill Categories and Klopfer’sScheme 143Figure 24 Klopfer Table of Specifications for Science Education 144Figure 25 Classification of Exercise 2A1 145Figure 26 Classification of Task 2A1 by SISS Practical Skill Categories 145Figure 27 SISS Skill/Content Matrix 146Figure 28 Aspects and Major Categories of the TIMSS Science Framework 149Figure 29 The Dimensions and Abilities used in the Student Performance Component ofthe 1991 B.C. Science Assessment 152Figure 30 Sub-categories of ‘Practical Skills’ 157Figure 31 A Three-dimensional Model of Performance in Science 157Figure 32 A Two-dimensional Model of Performance in Science 159Figure 33 The B.C. Science Assessment Grade 10 Station Tasks Mapped onto theTemplate 160Figure 34 CLAS Science Scoring-guide Shell Points 4 and 1 164Figure 35 CLAS Science Score Points 4 and 1 for “Spaceship U.S.A” 165Figure 36 IEP Style of Presenting Scores 168Figure 36 Berlak’s Measurement Paradigms 172viiiAcknowledgementsMy thanks and appreciation goes to my wife, Jan MacPhail, her parents Donaldand Emma MacPhail, and my mother Hilda Bartley who has watched and listened fromafar.My work on the Student Performance Component of the 1991 B.C. ScienceAssessment with Gaalen Erickson, Bob Carlisle, Karen Meyer, Lorna Blake and RuthStavy was a vital part of my growth. The Ministry of Education of the Province ofBritish Columbia funded the 1991 B.C. Science Assessment; Jim Gaskill and AmyBryden valued my work on the provincial assessment project, which led to myinvolvement in the workshops for district performance assessment around the province.My committee—Gaalen Erickson, Dave Bateson and Nand Kishor—were mostencouraging. I value my contact with each and all of them, particularly as I was framingthe focus of the study.The community of graduate students at U.B.C. in Math and Science (nowCurriculum Studies) provided a supportive environment. Foremost among thesesupporters were Tony Clarke and Renee Fountain who were there at the beginning andthe finale. I wish both of them well in their own careers.ixCHAPTER 1— INTRODUCTIONDESCRIPTION OF THE PROBLEMA powerful message about the systemic nature of educational reform was sentthrough North America when the American Association for the Advancement of Science(AAAS) organized the 1990 Forum for School Science, Assessment in the Service ofInstruction (Champagne, Lovitts, & Callinger, 1990). The crux of this message was thatfor change in curriculum and instruction in science to be effective there must also be changein assessment practices. Good assessment in science must expand from mere indicators ofperformance in the form of multiple-choice tests to include direct measures of performance(Lovitts and Champagne, 1990). In this dissertation I address some of the problemsrelated to this shift in emphasis to performance-based assessment, and propose somesolutions.The history and changes of perspective in the development of large-scale multiple-choice tests are outlined by Cole (1991). At their inception it was assumed that such testswere “neutral indicators of student progress largely isolated from classroom concerns”(Cole, 1991, pp. 97-98). However, when competency in the so-called “basic skills”became an issue in the 1970’s, and criterion-referenced testing programs were establishedwith district and school scores published, these neutral indicators became an integral part ofthe system. Teachers perceived a need to focus on test preparation by spending more timeon topics covered by the tests and felt obliged to prepare students for the tests by drillingspecific item formats. Lovitts and Champagne (1990) argue that such large-scale, multiplechoice assessments serve the needs of policy-makers rather than those of classroomteachers or their students in part because assessments of this type produce aggregated dataamenable to statistical manipulation. This enables politicians and senior administrators tomonitor the status of science programs within a district, state, or even nation, as in the caseof the National Assessment of Educational Progress (NAEP). Lovitts and Champagne-1-contend that assessment by multiple-choice tests emphasizes the less important aspects ofscientific literacy such as recognition and recall of factual information, rather than the moresignificant aspects of science such as generating and testing hypotheses, designing andconducting experiments, and solving multi-step problems. Their opinion is supported inthe Carnegie Commission report In the National Interest: The Federal Government in theReform ofK--12 Math and Science Education (1991) which identifies the emphasis uponstandardized testing as leaving “students without the capacity to think quantitatively andsolve problems for themselves” (p. 23). In the National Interest followed a series ofnational reports describing problems in the American school system’, each with anemphasis upon the position of science and mathematics (Tucker, 1991) and describing theneed for systemic change.In British Columbia, a Royal Commission embarked on an extensive examinationof the education system and published the report, A Legacy for Learners: The Report of theRoyal Commission on Education 1988 (Sullivan, 1988). This report led to a response fromthe Ministry of Education in the form of the document Year 2000: A FrameworkforLearning (Ministry of Education, 1990). Year 2000 sets out the framework for educationreform in the Province for the decade leading to the year 2000. A vital part of thisproposed change is a learner-focused framework for curriculum and assessment which isdefined as:developmentally appropriate, allows for continuous learning, provides forself direction, meets the individual learning needs of the students as much aspossible and deals with matters of relevance to the learners. (Emphasis inoriginal. Ministry of Education, 1990, p. 9)1 A Nation at Risk (National Commission on Excellence in Education, 1983), A Nation Prepared(Carnegie Forum on Education and the Economy, 1986), A Time For Results (NationalGovernors’ Association, 1986), National Goals for Education (U.S. Department of Education,1990) and America’s Choice: High Skills or Low Wages! (National Center on Education and theEconomy, 1991)-2-The 1991 B.C. Science Assessment was the first major project of the Assessment,Examinations, and Reporting Branch of the Ministry of Education to be conceived andconducted following the publication of the Year 2000 document. The design of theAssessment reflects this new policy, particularly in the range of the components.British Columbia has a long history of science assessments as part of the ProvincialLearning Assessment Program (Hobbs, Boldt, Erickson, Quelch, & Sieben, 1980; Taylor,Hunt, Sheppy, & Stronck, 1982; Bateson, Anderson, Dale, McConnell, & Rutherford,1986). The 1991 Science Assessment (Bateson, Erickson, Gaskell, & Wideen, 1992) withits principal focus upon Grade 4, 7 and 10 students, built upon this history. The mode ofassessment was extended from primarily the use of multiple-choice questions to includeopen-ended written questions, a classroom observation component, a socio-scientificissues component, and a performance assessment component. These additionalcomponents enabled collection of a wide range of valuable data about science education inBritish Columbia. The performance component is the most significant development in thecurrent debate about the quality of alternative assessments (Rothman, 1990a; Linn, Baker,& Dunbar, 1991), particularly with respect to performance assessment in science. Forexample, students from B.C. took part in the performance option of the 1991 InternationalAssessment of Educational Progress (IAEP) in mathematics and science (Semple, 1992).The United States did not because “the performance-based items were not of the samequality as the multiple-choice questions” (Rothman, 1990b, p. 10). Responding to thiscriticism, and the absence of the U.S.A. from the performance option, Rothman reportsthat Lapointe, the project director at the Educational Testing Service (ETS), acknowledgedthat there had been little research on the validity and reliability of performance assessmentuse. This is particularly remarkable because the ETS had coordinated the assessment.My intent is to address the concerns regarding validity and reliability ofperformance assessments, in the curriculum area of science, by proposing a framework forvalidation. The description of student performance in science continues to evoke debate-3-about the use of the “processes of science” as a scaffold about which the assessment tasksare designed and student performance reported (Donnelly and Gott, 1985; Bryce andRobertson, 1985; Milar and Driver, 1987; Johnson, 1989; Milar, 1991), and the model ofscience which is portrayed by such an approach (Woolnough, 1989). Recentdevelopments in the theory of validity (Messick, 1989a, 1989b, 1994; Moss, 1992, 1994;Shepard, 1993) have identified validity as a unitary concept, based upon “constructvalidity”. The most significant development in this reconceptualization of validity is theinclusion of “consequential validity” as a component of construct validity. Considerationof consequences entails an examination of the values that give rise to the construct labelsused in defining an assessment.The Student Performance Component of the 1991 B.C. Science Assessment is usedto exemplify arguments, procedures, and conclusions presented in this study. In theBritish Columbia Assessment of Science 1991 Technical Report II: Student PerfornwnceComponent (Erickson, Bartley, Blake, Carlisle, Meyer, & Stavy, 1992), performanceassessment has been defined as hands-on assessment of student abilities in science, usingtasks that are classified as either “stations” or open-ended “investigations”. “Stations” areshort tasks, each focusing upon different aspects of school science: every studentcompletes six different stations. The data for stations consist of the students’ writtenresponses. “Investigations” are more complex tasks in which students are able to use anycombination of provided equipment to find a solution to a preset problem. The data forinvestigations consist of observers’ records, students’ written responses and interviewresponses.HISTORICAL AND THEORETICAL PERSPECTIVES FOR THIS STUDYMy perspective for this validation inquiry is derived from current conceptions ofvalidity (Messick, 1989a, 1989b, 1994; Moss, 1992, 1994; Shepard, 1993). Scoreinterpretation along with consequential aspects of the assessment process make up the-4-significant elements of this inquiry. Hem (1990) warns science educators of the problemsof abdicating responsibility or involvement in science assessment:Science assessment is too important to be left in the hands of onlypsychometricians and other test developers. As a problem that combinestheoretical and real-world issues, it requires input from groups representingmany perspectives. Only then will we come to solutions that have boththeoretical validity and application in the real world of schools. (p. 279)Specific emphasis is placed upon the construct labels used to describe student performancewith materials in science (Bryce and Robertson, 1985; Black, 1986; Millar and Driver,1987; Millar, 1989, 1991; Hodson, 1986). Concerns about quality and technical adequacyof performance assessments focus upon the relationship between the reliability of the datacollected and the validity of the interpretations of those data (Linn et al. 1991). Reliabilityis seen as a necessary but not sufficient condition in the validity of data interpretation.Feldt and Brennan (1989) in their chapter entitled “Reliability” in the handbook EducationalMeasurement (Linn, 1989) warn against exaggerated concerns for reliability. Theyacknowledge:the primacy of validity in the evaluation of the adequacy of an educationalmeasure. No body of reliability data, regardless of the methods used toanalyze it, is worth very much if the measure to which it applies is irrelevantor redundant. (p. 143)Conceptions of validity (Moss, 1992; Shepard, 1993) have shifted with changingperspectives in the philosophy of science (Messick, 1989b). While these three authors setout some primary considerations, the perspective for this validation inquiry synthesizes thework of:1) Fredericksen and Collins (1989), who define criteria for enhancing systemic validity;2) Linn, Baker, and Dunbar (1991) who set out expanded criteria for validity inquiry foralternative assessments;3) The Quantitative Understanding: Amplifying Student Achievement and Reasoning(QUASAR) Project, a mathematics curriculum reform project intended to promote theacquisition of thinking and reasoning skills in mathematics. This project made-5-extensive use of performance assessment; the paper Principles for DevelopingPerformance Assessments (Lane, Parke, & Moskal, 1992) presents its perspective.4) Berlak, Newmann, Adams, Archbald, Burgess, Raven, and Romberg (1992), whoargue for a “New Science of Educational Testing and Assessment”. This groupincludes Archbald and Newman (1988) who introduced the phrase “authentic academicachievement” which has had a significant impact upon the measurement community;and5) Herman (1992), who believes that “good assessment is built upon current theories oflearning and cognition”. She cites several authors, including Wittrock (1991), whoconsider meaningful learning to be “reflective, constructive, and self-regulated” (p. 75).Herman argues that student motivation is a significant factor in assessmentacknowledging the importance of a “disposition to use the skills and strategies as wellas the knowledge of when to apply them” (p. 75).Validation inquiry studies for large scale alternative forms of assessment have beenmade in the curriculum areas of written composition (Welch, 1993) and mathematics(Magone, Cai, Silver, & Wang, 1992). Shavelson and his associates have made someprogress in clarifying certain issues in the validation of alternative assessment in science(Pine, 1990; Baxter, Shavelson, Goldman, & Pine, 1992; Shavelson, Baxter, & Pine,1992; Shavelson and Baxter, 1992; Shavelson, Baxter, & Gao, 1993; Shavelson, Gao, &Baxter, 1994). Aspects relating to scoring procedures, task stability, and samplingvariability have been addressed through generalizability theory. Although much of theShavelson group’s work has been set in a research context and used a small number oftasks, the 1993 paper addresses the effects of sampling variability using some of theextensive data amassed from the 1990 California Assessment Program (CAP).Pine (1990) examines the tension between judgements based upon assessingstudents’ science knowledge and competence from work done over a short period of timein an assessment situation, and those judgements based upon longer term assessments, for-6-example those by a classroom teacher. Pine describes the Assessment of Performance Unit(APU) as “most likely the world’s most experienced group in developing test questions forassessing science process skills”, but he points out that “the APU does not have data toestablish the validity of its questions”( 1990, p. 91). Pine considers that expert judgementcould be considered as “prejudice” and that “validating methodology is an area in whichinformed opinion will not suffice” (1990, p. 91). Moss believes that “we need to consideryet another expansion in our delimitation of the concept of validity” (1992, p. 252). This isparticularly so in the curriculum area of science, where developments in assessmentprocedures have moved ahead of our capacity to give meaning to the data produced, i.e. thevalidation procedures.Of the issues raised in the assessment literature, the validation of hands-onperformance assessments in science is the one that I address in this dissertation. Linn,Baker, and Dunbar (1991) envisage that a general set of criteria used to judge the adequacyof assessments should include consequences, fairness, transfer and generalizability,cognitive complexity, content quality, content coverage, meaningfulness, and cost andefficiency. These criteria are evaluated, clarified, and expanded in considering thevalidation of performance assessment in science.RESEARCH QUESTIONSMy work in this dissertation is to propose and evaluate procedures for validation inthe assessment of students’ hands-on performance in science. The range of evidenceconsidered in such an approach necessarily encompasses both analytical and empiricaldomains. Procedures developed for and data produced in the Student PerformanceComponent of the 1991 B.C. Science Assessment (Erickson et al., 1992) are used toexemplify the model proposed. Specifically, this validation inquiry addresses four majorquestions:-7-1. What are the essential components of a systematicframework for the development and administration ofperformance assessments in science?This question should be considered a foundation question which sets out a myriad ofconcerns about principles and procedures that must be in place to enable valid inferencesabout student performance in science.2. What are the essential characteristics of descriptors ofstudent performance on performance tasks in science?Question 2 addresses the issue of choice of construct labels. It requires a discussion of theidentification and labelling of skills/processes and content in science, and the reasons whythis approach is attractive in assessment. This is an issue that permeates the dissertationsince it is at the core of validity for performance assessment in science.3. What are the implications of using different strategies forscoring student achievement upon the interpretation of studentperformance?This question expands the analysis of holistic and analytical (atomistic) scoring into anexamination of construct validity, reliability, and messages about learning in science givenby the use of a specific scoring system. This aspect of consequential validity is receivinggreater prominence (Baxter, Shavelson, Herman, Brown, & Valadez, 1993) though littlehas yet been published in the field of science assessment.4. How could/should test scores2 in performance assessmentsbe interpreted and used?Question 4 is not only a “so what?” question, but also a “who?” question. Theinterpretation of student performance, and the consequences of that interpretation, are thecrux of the matter. It is essential to discuss who should interpret student performance andthe implications of their involvement. Students, teachers, administrators, parents,2 The term test score is used here generically in the same way that Messick (1989b, p. 14) usesscore to mean “any observed consistency” and test to include questionnaire, observation procedureor other assessment device. He qualifies this general usage to include qualitative and quantitativesummaries not only of persons but of judgemental consistencies and attributes.-8-politicians, and professional evaluators all have perspectives towards the interpretation ofperformance. The nature of the interpretation, and the power to effect change are issuesthat must be considered.SIGNIFICANCE OF THE STUDYPerformance assessment is still in its infancy in North America. Herman warns that“what we know about them [performance assessments] is relatively small compared towhat we have yet to discover” (1992, p. 74). Baxter, Shavelson, Goldman, and Pine(1992) provide evidence that hands-on performance assessment appears to be measuringsomething different from pencil-and-paper modes of assessment. However, these authorsconcentrate upon procedural and technical issues rather than attempting to explain theconceptual problems in terms of a learning theory or a philosophy of science. Inaddressing conceptual as well as technical issues pertaining to performance assessment inscience, this study will make a significant contribution in an area of expanding interest andconcern.DELIMITATIONS OF THE STUDYThis validation inquiry is expressly focused upon assessment of hands-on studentperformance in the context of school science. Questions 1 and 2 are set explicitly withinthe domain of science and are intended to extract some operational understanding, in thiscontext, for the terms “reliability” and “construct validity”. Questions 3 and 4 direct ananalysis of consequential validity, focusing upon the consequences of scoring strategiesand consequences of interpretation. Many of the procedures in the development and use ofperformance assessment tasks in mathematics, particularly in the use of manipulatives, aresimilar to those in science. Because of these parallels it is likely that many of the proposalsfor the validation of performance assessments in science are equally applicable toperformance assessments in mathematics.-9-CHAPTER 2— THEORIES OF VALIDITY,PROCESSES OF SCIENCEAND RELIABILITYThe theoretical rationale behind this validation inquiry of the assessment of handson performance in science is derived from current concepts of validity. In this chapter Idiscuss how the theories of validity have undergone a significant re-alignment as original“types” of validity have been absorbed into construct validity, and construct validity hasbeen expanded to cover the consequences of test use. This expansion of validity leads to adiscussion of the approaches that have been taken to re-examine how validation ofperformance assessments can be achieved. The breadth of the terms of reference ofvalidity, particularly the considerations embedded in the consequences of test use, requiresan analysis of the value positions implied in the choice of construct labels in scienceassessment. I set out the rationale behind the extensive use of “processes of science” as anorganizing theme in hands-on assessment, and explore the concerns of those educatorswho argue against such a perspective.Assessment data must conform to some agreed standard of reliability. Forperformance assessments involving the use of manipulatives, traditional measures ofreliability are inappropriate. The assumptions of classical reliability theory are examined,and the application of generalizability theory is reviewed. Qualitative factors that influencereliability, including the instrumental variables arising from the use of equipment, are alsoconsidered.VALIDITY — CHANGING CONCEPTUALIZATIONSThe theory of validity has undergone significant evolution in recent times. Ofparticular importance is the chapter by Messick (1989b), in the third edition of Educational-10-Measurement, which presents validity as a unitary concept. Messick defines validity as “anintegrated evaluative judgement of the degree to which empirical evidence and theoreticalrationales support the adequacy and appropriateness of inferences and actions based on testscores or other modes of assessment”1(1989b, p. 13). This is significant because Messickidentifies that it is inferences and actions, together with their underlying theories, that mustbe validated. Messick argues that validity is a matter of degree, continuous over a range,and likely to change over time. Validity evidence gains or loses strength with newfindings, and as the expected consequences of testing are realized (or not) by the actualconsequences. This being the case, validation is never complete and the test developermust continue to search for evidence to make “the most reasonable case to guide bothcurrent use of the test and current research to advance understanding of what test scoresmean” (Messick, 1989b, p. 13). Messick is consistent in emphasizing the need forvalidation studies to use multiple sources to obtain a range of evidence to support score-based inferences, as “validity is a unitary concept” (Messick, 1989b, p. 13). Messickdescribes validation as following the methods of science in the collection of evidence tosupport inferences; this is seen as hypothesis testing in the context of interpretive theoriesof score meaning. The major concern of validity is “to account for consistency inbehavior, or item responses, which frequently reflects distinguishable determinants”(Emphasis in original. Messick, 1989b, p. 14). In restating this key aspect of validity,Messick reminds us that the:emphasis is on scores and measurements as opposed to tests or instrumentsbecause the properties that signify adequate assessment are properties ofscores, not tests. Tests do not have reliabilities and validities, only testresponses do. This is an important point because test responses are afunction not only of the items, tasks, or stimulus conditions but of personsresponding and the contexts of the measurement. (1989b, p. 14)1 Evidence includes data, facts and the rationale or arguments used to justify the inferences (Messick,1989b).- 11 -The context of the measurement is a further issue in validity research. Messick refers toearlier work by Cronbach (1971) and argues that it is the interpretation of data arising froma specified procedure that is to be validated and this validation should include an analysis ofthe generalizability of the interpretation, particularly over time.A further aspect of generalizability is derived from consideration of test behaviouras a sample of a domain behaviour, or as an indicator of some other underlying process ortrait2. This step beyond the obvious has led to the use of constructs, for example “criticalthinking skills”, in an attempt to describe these underlying traits. Such traits are givensignificant emphasis in educational measurement. Messick perceives these underlyingstructures as a source of much contention among measurement theorists: Behaviorists andsocial behaviorists interpret scores as samples of response classes3 but trait theorists andcognitive theorists consider scores as signs of underlying processes or structures (Messick,1989b). Whereas a trait represents a stable set of relationships, a person’s state isconsidered likely to change across contexts and over time4. Test scores may be interpretedas signs of trait disposition, or internal states, or some combination of these. Whicheverextreme is chosen there is a requirement for validation of the hypothesis (Messick, 1989b).In making this stipulation, Messick warns validity researchers against taking for granted amode of interpretation and using this assumption to identify the nature of the evidence forthe validation inquiry. Data do not constitute evidence. Messick (1989b, p. 16) citesKaplan (1964, p. 375) to make this point:2“A trait is a relatively stable characteristic of a person which is consistently manifested to somedegree when relevant despite considerable variation in the range of settings and circumstances”(Messick, 1989b, p. 15).3 Messick defines response class as a “class of behaviors that reflect essentially the same changeswhen the person’s relation to the environment is altered” (1989b, p. 15).4 A state is considered as a temporary condition of mentality or mood, transitory level of arousal ordrive (Messick, 1989b)-12-What serves as evidence is the result of a process of interpretation — facts donot speak for themselves; nevertheless, facts must be given a hearing, or thescientific point to the process of interpretation is lost.This perspective leads to a conceptualization of evidence as a blending of facts togetherwith the theoretical rationale used for their interpretation. Messick compounds the issuefurther by adding consideration of values:Hence, just as data and theoretical interpretation were seen to be intimatelyintertwined in the concept of evidence, so data and values are intertwined inthe concept of interpretation. And this applies not just to evaluativeinterpretation, where the roles of values is often explicit, but also totheoretical interpretations more generally, where value assumptionsfrequently lurk unexamined. Fact, meaning and value are thusquintessential constituents of the evidence and rationales underlying thevalidity of test interpretation and use. (Messick, 1989b, p. 16)Thus all interpretations must be considered value-laden, and these values must berecognized as guiding the choice of data collection methods, i.e., the test.Validity Evidence including Consequential EvidenceMessick (1989b, p. 16) lists six possible basic sources of validity evidence:content of the test in relation to the content of the domain of reference;descriptions of individual’s responses;internal structure of the test responses, i.e. relationship between items or parts ofthe test;external structure of the test responses, i.e. relationship between test scores andother measures;differences in test processes and structures over time, in different contexts or as aresult of some treatment;social consequences of interpreting test scores in a particular way with someexamination of the intended outcomes, and the unintended effects.- 13 -Messick considers it important to contrast this approach to validity with the traditionalcategories of validity: content, criterion-related and construct. Each of these facets ofvalidity had been given status as a type of validity rather than as a component of thewhole. The relationship between these three facets and Messick’ s six sources of validityevidence is meaningful in clarifying developments in validation procedures.Content validity is defined as “based on the professional judgements about therelevance of test content to the content of a particular behavioral domain of interest andabout the representativeness with which item or task content covers that domain” (Messick,1989b, p. 17). As considerations of content validity are not concerned with responses orscores that arise from the use of a test, Messick questions whether content validity shouldbe described as a form of validity. He recognizes that content relevance and representationinfluence the nature of score inferences, but believes that validity of score interpretationmust be supported by other evidence.Criterion-related validity is concerned only with “specific test-criterioncorrelations” (Messick, 1989b, p. 17). Messick describes this as focusing upon a specificpart or parts of a test’s external structure. This may lead to a range of criterion relatedvalidities as scores are compared with other measures. For examples, see Burger andBurger (1994).Construct validity is “based on an integration of any evidence that bears on theinterpretation or meaning of the test scores” (Messick, 1989b, p. 17) The test score cannotbe equated with the construct it is intended to tap, and should not be considered as definingthe construct, but rather as “one of an extensible set of indicators of the construct”(Messick, 1989b, p. 17). In this sense, a construct would be “invoked as a latent variableor ‘causal factor’ to account for the relationships among its indicators” (Messick, 1989b, p.17). The breadth of construct validity enables “almost any kind of information about a test”(Messick, l989b, p. 17) to contribute to an understanding of its construct validity, but thiscontribution is strengthened if there is justification of the theoretical rationale behind the-14-interpretation of the scores. In the past, it was the pattern of relationships between internaland external test structures which made the major contribution to construct validation.Construct validity subsumes content representation and criterion-relatedness because suchinformation contributes to score interpretation. With the consideration of the consequencesof testing, the three historical “types” of validity and the evidence supporting them, havebeen embraced by construct validity.Messick has developed a “progressive matrix” which illustrates the inclusion of thefacets of validity, each building upon construct validity. Figure 1, presented here, is takenfrom the paper Meanings and Values in Test Validation: The Science and Ethics ofAssessment (Messick, 1989a) as this schemata represents the progression with greaterclarity than does the matrix in the chapter “Validity” (Messick, 1989b) in EducationalMeasurement.TEST INTERPRETATION TEST USEEVIDENTIAL BASIS Construct Validity (CV) CV + Relevance/Utility(RIU)CONSEQUENTLL BASIS CV + Value Implications (VI) CV + R/U + VI + SocialConsequencesFigure 1. Facets of Validity as a Progressive Matrix(Messick, 1989a)In remarking upon the connection between the “consequences” and “functionalworth” of testing, Messick presents a strong case for consideration of consequentialevidence in validity studies:It is ironic that little attention has been paid over the years to theconsequential basis of test validity, because validity has been cogentlyconceptualized in the past in terms of the functional worth of testing— thatis, in terms of how well the test does the job it is employed to do.(Messick, 1989b, pp. 17-18)As the influence of testing has become more pervasive, particularly in the change of rolefrom a neutral indicator to an explicit component of educational programs, the need to- 15-consider the consequences of testing has become more imperative. Messick considers thatthe consequential basis of testing is concerned with two issues, first:the appraisal of the value implications of the construct label, of the theoryunderlying test interpretation, and of the ideologies in which the theory isembedded...and second:the appraisal of both potential and actual social consequences of the appliedtesting. (1989b, p. 20)The first issue leads to the question of how to take account of values in test validation ratherthan whether values should be considered, because such action is “virtually mandatory”(Messick, 1989b, p. 58). The second issue identifies a requirement to broaden the scopeof validation inquiry beyond merely looking for consequences that are immediate, positiveand anticipated, but to search for effects that are long-term, negative and unanticipated,even if finding evidence of such outcomes emperils any future program of testing. It is inthis latter situation that the value-system implicit in the mode of validation becomes of greatsignificance.Test validation is a process of inquiry (Messick, 1989b) and as such can beconceptualized in terms of some combination of the five systems of inquiry that aredescribed by Churchman (1971) and his co-workers Mitroff and Sagasti (1973). Indrawing together these five systems Churchman and his colleagues reconstruct “some ofthe major theories of epistemology in a way that makes them more pertinent to theknowledge-acquisition and model-building goals of the practicing scientist” (Messick,l989b, p. 31). Churchman (1971) named these five systems after the philosophers whosework is best captured by the system of inquiry. The labels are Leibnizian, Lockean,Kantian, Hegelian, and Singerian (Messick, 1989b).Messick adjudges the Leibnizian and Lockean approaches to be best suited fortackling well-structured problems, with the Lockean system appearing to capture theessence of validation of multiple-choice tests:-16-A Lockean inquiring system entails an empirical and inductive approach toproblem representation. Starting with an elementary data set of experientialjudgments or observations and some consensually accepted methods forgenerating factual propositions inductively, an expanding network of factsis produced. The standard of validity in a Lockean inquiry system is theconsensus of experts on the objectivity, or lack of bias, with respect to dataand methods. (Messick, 1989b, p. 31)Where there is no consensually agreed definition of the problem, Messick posits that theKantian, Hegelian, or Singerian systems offer the most promise. A Kantian mode ofinquiry starts with:at least two alternative theories or problem representations, from each ofwhich are developed corresponding alternative data sets or fact networks.(Messick, 1989b, p. 31)In the Kantian mode of inquiry these alternative theories may be divergent, but notantagonistic. Alternative theories may lead to multiple interpretations of the data, each withits own strengths and weaknesses. The test for validity in the Kantian system is the“goodness of fit or match between the theory and its associated data” (Messick, 1989b,p. 31). An Hegelian approach to inquiry is developed through the formulation ofconfficting perspectives on the problem:Specifically, the Hegelian approach starts with at least two antithetical orcontrary theories, which are then applied to a common data set. It is hopedthat this dialectical confrontation between rival interpretations of the samedata will expose the contrary assumptions of the competing models to openexamination and policy debate. The guarantor of a Hegelian inquiry systemis conflict. (Messick, 1989b, p. 31)The Hegelian system of inquiry is most suitable for problems where there is little agreementupon the structure or even the nature of the problem (Messick, 1989b), particularly as itrequires discussion of the underlying value assumptions. It is intended that the debateshould be between ideas rather than between people, and that scientists should examinealternate perspectives to an extent that challenges their own positions. Churchmanidentifies the product of Hegelian inquiry as “two stories, one supporting the mostprominent policy on one side, the other supporting the most prominent policy on the other- 17 -side” (1971, P. 177). A Singerian approach is one which uses multiple systems of inquiryto “observe” the target inquiry process5:A Singerian inquiring system starts with the set of other inquiring systems(Leibnizian, Lockean, Kantian, Hegelian) and applies any systemrecursively to another system, including itself. The intent is to elucidate thedistinctive technical and value assumptions underlying each systemapplication and to integrate the scientific and ethical assumptions of theinquiry. (Messick, 1989b, p. 32)While Messick finds that most educational and psychological research has been eitherLeibnizian, Lockean or some combination of the two, he rejoices in the fertility of the“methodological heuristics” of the Singerian inquiring system (Messick, 1989b, p. 32).The similar feature of the Kantian, Hegelian and Singerian systems is the requirement ofthe existence of alternative theories. There must be dialogue to enable comparison betweenalternative or antithetical theories, and recognition that “observations and meanings aredifferentially theory-laden and theories are differentially value-laden” (Messick, l989b,p. 32).The discussion of the role of values centres around perceptions that the influence ofvalues upon scientific inquiry lessens the worth of that inquiry. In the social sciences it hasbeen an unambiguous expectation that researchers make explicit their own values inconducting research (Howe, 1985). To reinforce his argument that values guideapproaches to validation, Messick traces the roots of the words ‘valid’ and ‘value’ back tothe same Latin root, valere, which means “to be strong” (1989b, p. 59). The word ‘value’came directly from the French verb valoir, meaning “to be worth”, which Messick appliesto the functional worth of testing.Moss in her paper “Can There be Validity Without Reliability?” in Educational Researcher(March, 1994) provides an example of Singerian inquiry in discussing the hermeneutic andpsychometric perspectives.- 18 -In science, and validation inquiry, the values of the investigator and to a great extentthose of the specific scientific community to which the investigator belongs, are majordeterminants of problem selection, identification of research perspectives and theories, datacollection and processing, and inferences that are made from those data. The valueimplications of the choice of construct labels are particularly important in terms of the“range of the implied theoretical and empirical referents” (Messick, 1989b, p. 60).Messick invokes the term referent generality, used earlier by Snow (1974), where there aredifficulties in embracing all of the substantive features of a construct in a single or acomposite measure. The use of such measures is very likely to allow hidden valuesextensive opportunities for pernicious growth and pervasive implications. There is tensionbetween levels of generality: too specific at one end of the spectrum and too broad at theother.At one extreme is the apparent safety in using strictly descriptive labelstightly tied to behavioral exemplars in the test (such as Adding andSubtracting Two-Digit Numbers). The use of neutral labels descriptive oftest tasks rather than of the processes presumably underlying taskperformance is a sound strategy with respect to test names to be sure(Cronbach, 1971). But with respect to construct labels, choices on this sidesacrifice interpretive power and range of applicability if the construct mightdefensibly be viewed more broadly (e.g., as Number Facility). At the otherextreme is the apparent richness of high-level inferential labels such asintelligence, creativity, or introversion. Choices on this side suffer from themischievous value consequences of untrammeled surplus meaning.(Messick, 1989b, p. 60)The search for a defensible balance of construct reference is seen to be governed by therange of research evidence available. However, Messick warns against carving this instone by citing Cronbach’ s (1971) reminder that constructs refer to potential as well asactual relationships. Thus as new data are collected, it may be possible, or necessary, toreview and modify the construct labels used to describe test behaviour, and these data mustalso guide the general statements about test behaviour.As choices of construct labels are guided by theories or ideologies, these too mustbe considered part of the value implications. This is particularly important when one-19-considers the value implications that arise when alternative theories are brought to bearupon the same data. Messick argues that such effects support one of the relativists’ basicpoints that the advocates of opposing theories appear to live in different worlds. As theboundaries between subjectivity and objectivity have become blurred, it has become moreimportant to set out theories, ideologies6and world views for public scrutiny. A specificproblem that Messick describes arises when two different theories share the samemetaphorical perspective or model. This leads to investigators “talking past one another”by using the same language, but each interpreting according to her/his own framework.Broader ideologies that give theories perspective and purpose must also beconsidered. A particular difficulty arises when there is “ideological overlay” whichMessick considers to be a frequent occurrence in educational measurement and likely tocause much fallout as a result of radically different perceptions of the value implications oftest interpretation and use. Even the claim of scientific behaviour must be seen as value-laden, as are the epistemic principles which guide scientific choices— predictive accuracy,internal consistence, unifying power and simplicity (Messick, 1989b, p. 62). Messickdoes not believe that it is necessary to go so far as to label scientific judgements as valuejudgements. He does however insist that there must be some empirical support or rationaldefence against criticism. The challenge in test validation is to excavate the valueassumptions of a construct theory and its ideological foundations.The consequential basis of test use is concerned not only with how well the testdoes the job it is employed to do but also the question of what other jobs the test actuallydoes or might do, whether by intent or by accident. These consequences should be6 Messick defines an ideology as “a complex configuration of shared values, affects, and beliefs thatprovides, among other things, an existential framework for interpreting the world — a ‘stagesetting’, as it were for viewing the human drama in ethical, scientific, economic or whatevertenns” (1989b, p. 62).- 20 -considered at all levels — individual, institutional, and systemic or societal. Messick statesthe point eloquently:Although appraisal of the intended ends of testing is a matter of socialpolicy, it is not only a matter of policy formulation but also of policyevaluation that weighs all of the outcomes and side effects of policyimplementation by means of test scores. Such evaluation of theconsequences and side effects of testing is a key aspect of the validation oftest use. (1989b, p. 85)For example, unintended effects may appear as gender or ethnic differences in scoredistributions. In the case of a test to be used for selection screening, such differences havesome impact upon the functional worth of the selection process. Messick (1989b, p. 85)considers that it is important to distinguish between issues of test invalidity and test validityin this discussion7.Test invalidity arises when irrelevant sources of test and criterionvariance overpower other relevant properties of the instrument. When gender or ethnic-related score differences arise because of valid properties of the construct tapped, then suchdifferences contribute to score meaning and are a reflection of test validity. The challengefor validation is the identification of the irrelevant causes of variance, and consideration ofthe consequences of such identification upon interpretation and actions. Many of thefactors that are seen to cause variance must be discounted because of political or socialpolicies; these lead to special pressure upon professional judgement in guiding the use andinterpretation of tests. Messick captures the issue:The point is that the functional worth of the testing depends not only on thedegree to which the intended purposes are served but also on theconsequences of the outcomes produced, because the values captured by theoutcomes are at least as important as the values unleashed by the goals.(1989b, p. 85)Messick perceives that testing of consequences could parallel the testing of constructs in thepresentation of alternative hypotheses. In this case, counter proposals would be used toI am using the term ‘test validity’ in the manner set out by Messick (1989b) to focus upon theinterpretation of scores and the validity of test use.- 21 -expose “key technical and value assumptions of the original proposal to critical evaluation”(Messick, 1989b, p. 86). These counter proposals could involve a range of alternativeassessment techniques which tap the construct through different interpretivemethodologies, and could also include an analysis of the social consequences of not testingat all! In this respect Messick considers that the recognition of alternative perspectivesabout standards and about social values to be served by assessment methods is laudable.Such recognition would enable, even encourage, investigators to use multiple modes ofinquiry to search for a range of effects.Validation Procedures for Performance AssessmentsThe central dilemma in validating performance assessments is identified by Moss:Performance assessments present a number of validity problems not easilyhandled with traditional approaches and criteria for validity research. Theassessments typically permit students substantial latitude in interpreting,responding to, and perhaps designing tasks; they result in fewerindependent responses, each of which is complex, reflecting integration ofmultiple skills and knowledge; and they require expert judgement forevaluation. Consequently, meeting criteria related to such validity issues asreliability, generalizability, and comparability of assessments— at least asthey are typically defined and operationalized becomes problematic. Thisresults in a tension between traditionally accepted criteria for validity andcriteria that derive from concerns about the instructional consequences ofassessment, such as “authenticity” (Newmann, 1990), “directness”(Fredericksen and Collins, 1989), or “cognitive complexity” (Linn, et al.,1991), which are commonly invoked when arguments for the value ofperformance assessments are made. (1992, p. 230)Moss argues that developments in the philosophy of validity, particularly the inclusion ofconsequential aspects, provide some theoretical support for performance assessment. Theproblem that Moss identifies is to find “the appropriate set of criteria and standards tosimultaneously support the validity of an assessment-based interpretation and the validity ofits impact upon the educational system” (1992, p. 230). In her examination of theStandtzrdsfor Educational and Psychological Testing [American Educational ResearchAssociation (AERA), American Psychological Association (APA), & National Council onMeasurement in Education (NCME), 1985] Moss finds that they “stop short of articulating- 22 -the centrality of construct validity and the importance of considering the socialconsequences of any validity effort” (1992, P. 232). Recognizing that the Standards is apolitical document produced by a group that represents a large and varied constituency,Moss cites Messick’ s lament for the missed opportunity to move the measurement fieldforward in the consideration of construct validity as a unifying force, and the appraisal ofvalue implications and social consequences. The argument that the “consequences ofperformance assessments are likely to be more beneficial to teaching and learning than theconsequences of multiple-choice assessment used alone” is identified by Moss (1992,p. 248) as a unifying theme in the work of Fredericksen and Collins (1989) and Linn,Baker, and Dunbar (1991). Both these sets of authors perceive thatperformance assessments will not fare as well as multiple-choice assessmentin terms of traditional criteria, particularly those that emphasizecomparability, internal consistency of scores across items and readings andefficiency. (Moss, 1992, p. 249)The positions and proposals of these authors is presented below.Frederiksen and Collins discuss the effects of assessment practices upon theeducational system in their paper, A Systems Approach to Educational Testing (1989).They address the issue of the validity of educational tests in an educational system in whichapproaches to curriculum and instruction, and student learning strategies are modified toproduce higher scores. Frederiksen and Collins question the validity of introducing testsinto such a dynamic system which adapts itself to the characteristics of the test. Theyperceive that a challenge to validity is associated with:the instructional changes engendered by the use of the test and whether ornot they contribute to the development of knowledge and/or skills that thetest purportedly measures. (1989, p. 27)Frederiksen and Collins introduce the term systemic validity of a test to extend the notion ofconstruct validity to allow for consideration of the effects of instructional change broughtabout by the introduction of a test into an educational system. Frederiksen and Collins statethat:- 23 -A systemically valid test is one that introduces in the educational systemcurricular and instructional changes that foster the development of cognitiveskifis that the test is designed to measure. Evidence for systemic validitywould be an improvement in those skills after the test has been in placewithin the educational system for a period of time. (1989, p. 27)The conditions for systemic validity pose particular problems in the construction of tests,most specifically in accounting for the evolution of instructional strategies and studentlearning engendered by the use of the test. Frederiksen and Collins believe that attempts toreduce or slow these modifications to instruction and learning are bound to fail, particularlybecause such an approach would “deny the educational system the ability to capitalize onone of its greatest strengths: to invent, modify, assimilate, and in other ways improveinstruction as a result of experience” (1989, p. 28). They conclude that the mostappropriate approach is to recognize that various components of the system will react bydeveloping tests that “directly reflect and support the development of the aptitudes and traitsthey are supposed to measure” (Italics by authors, Frederiksen and Collins, 1989, p. 28).Frederiksen and Coffins identify two characteristics of tests that they believe willfacilitate educational improvement. These are (1) directness of the cognitive skill of interestin the performance of some extended task and (2) the degree of subjectivity or judgement,analysis and reflection required on the part of the person who assigns the score. Directassessment, argue Frederiksen and Collins, is systemically valid because “instruction thatimproves the test score will also have improved performance on the extended task and theexpression of the cognitive skill within the task context” (p. 29). The argument in favourof subjective scoring has the prerequisite that scorers understand how to use the scoringcategories. This has the prior requirement that scorers must be taught how to use thecategories consistently. This condition, Frederiksen and Collins maintain, will lead to thedevelopment of training materials that will become a medium for communication of thecritical traits that are to be promoted through the use of the assessment. They cite theexample of the writing tasks developed for the National Assessment of Educational- 24-Progress (NAEP) (Mullis, 1980) as “a particularly good example of thisapproach...seminal in influencing our thinking” (Frederiksen and Collins, 1989, p. 29).Frederiksen and Coffins identify three crucial considerations that must be addressedin the design of a systemically valid testing system: the components, the standards to besought in the design, and the methods by which the system seeks to encourage learning.The components of the system that Frederiksen and Collins seek to promote are:Set of tasks. The tests should consist of a representative set of tasks thatcover the spectrum of knowledge, skills and strategies needed for theactivity or domain being tested... The tasks should be authentic,ecologically valid tasks in that they are representative of the ways in whichknowledge and skills are used in the real world.Primary traitfor each task and sub-process. The knowledge and skills usedin performing each task may consist of distinct sub-processes... Eachsubprocess must be characterized by a small number of primary traits orcharacteristics that cover the knowledge and skills necessary to do well inthat aspect of the activity. The primary traits chosen should be ones that thetest takers should strive to achieve, and thus should be traits that arelearnable.A library ofexemplars. In order to ensure reliability of scoring andlearnability, it is important that for each task there be a library of exemplarsof all levels of performance for each primary trait assessed in the test. Thelibrary should include exemplars representing the different ways to do well(or poorly) with respect to each trait... The library should be accessible toall, and particularly to the testees, so that they can learn to assess their ownperformance reliably and thus develop goals to strive for in their learning.A training systemfor scoring tests. There are three groups who must learnto score test perfonnance reliably: (a) the administrators of the testingsystem who develop and maintain the assessment standards (i.e. the masterassessors), (b) the coaches in the testing system whose role it is to help testtakers to perform better, and (c) the test takers themselves, who mustinternalize the criteria by which their work is being judged. (p. 30)In consideration of standards, Frederiksen and Collins identify:Directness. . . .it is essential that whatever knowledge and skills we wanttest takers to develop be measured directly.Scope. The test should cover, as far as possible, all the knowledge, skillsand strategies required to do well in the activity.Reliability. . . .the most effective way to obtain reliable scoring that fosterslearning is to use the primary trait scoring borrowed from the evaluation ofwriting.- 25 -Transparency. The terms in which the test takers are judged must be clearto them if a test is to be successful in motivating and directing learning. Infact, we argue that the test must be transparent enough so that they canassess themselves and others with almost the same reliability as the actualtest evaluators achieve. (p. 30)Methods for fostering improvement on the test are:Practice in sef-assessment. The test takers should have ample opportunityto practice taking the test and should have coaching to help them assess howwell they have done and why...Repeated testing. If what is measured by the test is important to learn, thenthe test should not be taken once and forgotten. It should serve as a beaconto guide future learning.Feedback on test performance. Whenever a person takes a test, thereshould be a “rehash” with a master assessor or teacher. This rehash shouldemphasize what the testee did well and poorly on, and how performancemight be improved.Multiple levels of success. There should be various landmarks of successin performance on the test, so that students can strive for higher levels ofperformance in repeated testing. (p. 31)The paper Complex, Peiforinance-Based Assessment: Expectations and ValidationCriteria written by Linn, Baker, and Dunbar (1991) represents a watershed in themeasurement literature as in it they propose a set of criteria for judgements that might beconsidered in the validation of performance assessments. What is notable about this paperis that the writers are respected members of the educational measurement community, andthat the paper was published in a journal, Educational Researcher, which has a broadreadership in North American faculties of education.Linn, Baker, and Dunbar are concerned that arguments for the use of directassessments, particularly those which focus upon the fidelity of the assessment tasks to thegoals of instruction, place an unhealthy weighting on the face validity of the tasks. Instating that face validity alone is “not enough” they argue thatevidence must support the interpretations and must demonstrate the technicaladequacy of ‘authentic’ assessments...But what sort of evidence is needed,and by what criteria should these alternative to current standardized tests bejudged? (Linn et al., 1991, p. 16).- 26 -Linn, Baker, and Dunbar find that few of the advocates of alternatives to standardized testshave addressed the issue of criteria for evaluating these measures. Linn, Baker, andDunbar consider that a measure that is derived from actual performance or a simulationdoes not necessarily offer greater validity than a multiple-choice test. In response to theirown statement of concern, Linn, Baker, and Dunbar propose a set of criteria that might beconsidered in evaluating the technical quality of performance assessments. They considerthat these criteria might expand upon the base of “well established psychometric criteria forjudging the technical adequacy of measures” (1991, p. 16). However, they caution thatReliability has been too often overemphasized at the expense of validity;validity has itself been viewed too narrowly. (Linn et al., 1991, p. 16)In particular, Linn, Baker, and Dunbar review the historical criteria of efficiency, reliabilityand comparability of assessment from year to year, and speculate that these criteria wouldalmost always favour the traditional multiple-choice assessment tasks in any comparisonwith the newer alternatives. They suggest that as there is expansion of modes ofassessment there should also be an expansion of the criteria used to judge the adequacy ofassessments. Linn, Baker, and Dunbar consider that modern views of validity, particularlyas enunciated by Messick (1989b), present a theoretical rationale for such an expansion ofthe criteria. In broadening their view of validity to justify their criteria, Linn, Baker, andDunbar caution that their “set of proposed criteria is not exhaustive” (1991, p. 16) anddefend their position by claiming consistency with current views of validity and potentialuses of new forms of assessment. The eight criteria proposed by Linn, Baker, and Dunbarfor consideration in evaluating the adequacy of performance assessments are describedhere.- 27 -(1) Consequences — Linn, Baker, and Dunbar believe that the consequential basis ofvalidity has come to the fore at an opportune moment for the advocates of authenticassessment8.They refer to Messick (1989b) and Cronbach (1988) as theoreticians whohave stressed the criticality of giving attention to the consequential basis ofvalidity prior to the recent pleas for authentic assessmentand find thatconsequences could rarely be listed among the major criteria by which thetechnical adequacy of an assessment was evaluated. (Linn et al., 1991,p. 17)In particular, for performance-based assessments, Linn, Baker, and Dunbar argue thatconsequences must be given greater consideration:If performance-based assessments are going to have a chance of realizingthe potential that the major proponents of the movement hope for, it will beessential that the consequential basis of validity be given much greaterprominence among the criteria that are used in judging assessments. (Linnet al., 1991, p. 17)Linn, Baker, and Dunbar point out that it cannot be assumed that all the consequences ofperformance-based modes of assessment will be positive or conducive to learning. Theyargue that there needs to be higher priority given to broadening the range of evidencecollected about the assessment. In particular, the intended and unintended effects upon “theways that teachers and students spend their time and think about the goals of education”need to be monitored (Linn et al., 1991, p. 17).(2) Fairness — Linn, Baker and Dunbar believe that the criterion of fairness must be appliedto any assessment, with specific judgements to depend upon the uses and interpretations of8 The use of the term “authentic” in assessment stems from the work of Archbald and Newman(1988). These authors have refined the defining features of authentic achievement (Newmann andArchbald, 1992) to include (a) the production of knowledge, (b) disciplined inquiry, and (c) valuebeyond evaluation. Meyer (1992) argues that certain performance assessment tasks may not beauthentic.- 28 -the assessment results. They identify the concern of biases against racial or ethnicminorities and cite results from the NAEP writing (assessed by open-ended essays) andreading (assessed mainly by multiple-choice questions) assessments where differences inachievement between Black and White students were similar for each type of test. Linn,Baker, and Dunbar state that these “gaps in performance among groups exist because ofdifference in familiarity, exposure, and motivation on the tasks of interest” (1991, p. 18).They advise that there needs to be substantial change in instructional strategies and resourceallocation, particularly by providing training and support for teachers, to give studentsadequate preparation for assessments. In proposing these changes, they question thepossibility of the success of such developments because:validly teaching for success on these assessments is a challenge in itself andpushes the boundaries of what we already know about teaching andlearning. Because we have no ready technology to assist on theinstructional side, performance gaps may persist. (Linn et al., 1991, p. 8)These statements are puzzling in that they suggest that Linn, Baker, and Dunbar believe thatmethods of teaching directed at complex, time-consuming, open-ended assessments shouldbe developed, with a goal of reducing the differences between the performance of Blacksand Whites; they appear to be suggesting that the problem is merely one of finding a“teaching technology”. It is strange that Linn, Baker, and Dunbar do not raise the issue ofgender-related differences in open-ended assessments. Perhaps this is because the authorshave concentrated on assessments in reading and writing, where fewer significant genderdifferences have been observed compared with mathematics and science.Linn, Baker, and Dunbar maintain that questions about fairness must also be askedof the procedures for scoring, particularly in the training of evaluators, to ensure consistentand unbiased ratings of performance. The authors examine the use of differential itemfunctioning (DIF) procedures as a possible method of identifying items that may performdifferently for minority groups. However DIF procedures require that several items beused as matching criteria in judging each individual item. This has not been feasible with- 29 -most performance assessments as the number of tasks has been quite small. As Linn,Baker, and Dunbar are writing from the perspective of consistency in measurement here,they appear to regret the inevitability of “greater reliance on judgemental reviews ofperformance tasks” (1991, p. 18).Linn, Baker, and Dunbar express concern about prior knowledge as a source ofbias. They cite reading assessments as being vulnerable, since children who have priorknowledge of the topic tend to show a higher level of comprehension. While this may bethe case in assessment of general skills such as reading, it would be of concern in scienceonly to those who espouse “process” assessments. A similar discussion concludes thesection on fairness where Linn, Baker, and Dunbar cite Miller-Jones (1989) and the use of“functionally equivalent” tasks which would be “specific to the culture and instructionalcontext of the individual being assessed” (Linn et al., 1991, p. 18). Miller-Jonesrecognizes that it would be “exceedingly difficult” to establish task equivalence. WhileLinn, Baker, and Dunbar consider that such development would pose a “major challenge”they do not give an opinion as to whether the challenge should be taken.(3) Transfer and Generalizabiity— Generalizabiity theory (Cronbach, Gleser, Nanda, &Rajaratnam, 1972; Brennan, 1983) is identified by Linn, Baker, and Dunbar as providing“a natural framework for investigating the degree to which performance assessment resultscan be generalized” (1991, pp. 18-19). Generalizabiity theory is used to compare themagnitudes of variability for different factors of interest. Linn, Baker, and Dunbarconsider that variability due to raters and sampling of tasks is the minimum to beconsidered. Results from direct writing assessments (e.g., Hieronymous and Hoover,1987) and hands-on performance tasks in science (e.g., Baxter, Shavelson, Goldman, &Pine, 1990) show that the variance component for the sampling of tasks tends to be muchgreater than that for the sampling of raters. Linn, Baker, and Dunbar consider that thisfinding supports the research in learning and cognition that emphasizes the situation andcontext specific nature of thinking (Greeno, 1989). The authors consider that the limited- 30-degree of generalizability across tasks must necessarily lead either to increasing the numberof tasks or to the use of a matrix sampling approach.Linn, Baker, and Dunbar argue that the traditional view of reliability is subsumedby this transfer and generalizabiity criterion. They identify a need for expansion of thetraditional view of reliability:Consistency from one part of a test to another or from one form to anotheris insufficient. Whether conclusions about educational quality are based onscores on fixed response tests or ratings of performance on written essays,laboratory experiments, or portfolios of student work, the generalizationfrom the specific assessment tasks to the broader domain of achievementneeds to be justified. (1991, p. 19)Linn, Baker, and Dunbar continue by calling for evidence that abilities demonstrated inspecific modes of assessment are transferable to solving real-world problems. In theirview, this should be a requirement of all assessments, multiple-choice as well as the neweralternative modes of assessment.(4) Cognitive Complexity — According to Linn, Baker, and Dunbar this is a worthwhilecriterion and should be used in judging all forms of assessment. They believe that manyassumptions must be challenged, particularly in hands-on science and mathematics. Forexample, just because an assessment in science is hands-on, it does not necessarilyincorporate problem-solving skills or more complex mental models. Similarly inmathematics they believe that complex open-ended problems do not always entail the use ofcomplex cognitive processes. They declare that:Judgements regarding the cognitive complexity of an assessment need tostart with an analysis of the task; they also need to take into account studentfamiliarity with the problems and the ways in which students attempt tosolve them. (Linn et al., 1991, p. 19)It is interesting that Linn, Baker, and Dunbar recognize the importance of examining thestudent responses to questions in the analysis of cognitive complexity, rather than leavingthe assessment of complexity solely to the inferences of “experts”.-31 -(5) Content Quality — This criterion specifies that the content of an assessment “beconsistent with the best current understandings of the field” (Linn et al., 1991, p. 19). Inaddition, these authors consider it important that “tasks selected to measure a given contentdomain should themselves be worthy of the time and effort of students and raters” (p. 19).Linn, Baker, and Dunbar emphasize the value of involving subject matter “experts” in thedesign and review of tasks, particularly in the analysis of the quality of content knowledgedisplayed in various student responses.(6) Content Coverage — Linn, Baker, and Dunbar appear to put lower priority upon contentcoverage. They describe it as “another potential criterion of interest” and state that:Performance assessment recognizes the importance of process sampling,giving it primacy over traditional content sampling. But breadth of coverageshould not be overlooked. (1991, p. 20).These comments do not appear to have been directed towards any specific area ofassessment. While these assertions might be accepted without too much debate inassessments of written comprehension or reading, in science there has been much disputeover the focus of performance assessment. Linn, Baker, and Dunbar observe that one ofthe consequences of gaps in content coverage is the under-emphasis of parts of the contentthat are excluded from the assessment. They believe that this is an area where there mayhave to be some trade-off with other criteria. In fact, they suggest that this may be a placewhere “traditional tests appear to have an advantage over more elaborate performanceassessments” (Linn et al., 1991, p. 20).(7) Meaningfulness — This criterion addresses the issue of provision of more worthwhileeducational experiences. Linn, Baker, and Dunbar consider that some “investigation ofstudent and teacher understandings of performance assessments, and their reactions to themwould provide more systematic information relevant to this criterion” (1991, p. 20). Theyspeculate that low performance on NAEP assessments might be related to students’perceptions of lack of meaning in the test situation.-32-(8) Cost and Efficiency — Linn, Baker, and Dunbar compare labor-intensive performanceassessments to paper-and-pencil, multiple-choice tests. They argue that greater attentionmust be given to the “development of efficient data collection and scoring procedures” forperformance assessments (Linn et al., 1991, p. 20).Linn, Baker, and Dunbar conclude with a reminder that their eight criteria are not anexhaustive set. Their key point is that “the traditional criteria need to be expanded to makethe practice of validation more adequately reflect theoretical concepts of validity” (Linn etal., 1991, p. 20). In addition, they assert that this perspective is not a “theoretical nicety”but a way of identifying suitable criteria to make judgements about the relative merits ofnewly developed alternative assessments. To those who might accuse them of “stackingthe deck in favor of alternative assessments over traditional fixed-response tests” Linn,Baker, and Dunbar respond that:the issue is not which form of assessment may be favored by a particularcriterion, however. Rather, it is the appropriateness and importance of thecriteria for the purposes to which the assessments are put and interpretationsthat are made of the results. (1991, p. 20)Linn, Baker, and Dunbar consider that it is essential that the evolving conceptions ofvalidity should be dominant in any argument about traditional and alternative forms ofassessment, particularly in the context of the “fundamental purpose of measurement— theimprovement of instruction and learning” (1991, p. 20).Another group of authors, Lane, Parke, and Moskal in The Principles forDeveloping Performance Assessments (1992), refer to Messick’ s work as fundamental intheir conceptualization of validity. They cite Frederickson and Collins (1989), Glaser (inpress), Linn, Baker, and Dunbar (1991) and Merhens (1992) as references for establishingcriteria to ensure reliable and valid performance assessments. The Principles weredeveloped in the context of the QUASAR (Quantitative Understanding: Amplifying StudentAchievement and Reasoning) project, an American instructional program based at theUniversity of Pittsburgh, that attempts to promote the acquisition of thinking and reasoning- 33 -skills in mathematics. Lane and her co-authors suggest that their principles “may beextended to other types of performance assessments and can be used with other subjectmatters” (1991, P. 2). Twenty-six principles are grouped within the following categories:construct specificationcontent representation and relevancyspecification of task format and task representationtask wording and directionstask fairnessspecifications for the scoring rubricsThese categories, and the attendant principles, are intended for sequential use in theplanning and implementation of performance assessments. While Lane, Parke, and Moskalclaim that these principles are “discussed in relation to current conceptualizations ofvalidity” (1992, p. 3) there is a paucity of explicit discussion in this paper about theconsequential aspects of validity.The chapter “Evaluating Test Validity” by Shepard in Review ofResearch inEducation (1993) also builds upon the work of Messick (1989a; 1989b). Shepard coversthe historical aspects of validity in the first section, and uses the second to reconfirm“construct validity as the whole of validity” (Shepard, 1993, p. 405). The third section isdevoted to an analysis of Messick’ s unified theory, followed by Shepard’ s reformulation ofMessick’s theory in terms of evaluation arguments. Shepard presents specific examplesand concludes the chapter with a discussion of the implications for the development of thenext set of standards for educational and psychological testing9. The pertinent sections forthis review are Shepard’s reformulation of validation in terms of an evaluation argument,and her discussion of the potential for revision of the standards.The planning for these standards has started with the formation of a joint committee of the NCME,AERA and APA. The draft standards are to be reviewed during 1995.- 34-Shepard considers that Messick’s presentation of validity in the four-fold table’° —shown as Figure 2 below — has the potential for a “new segmentation of validityrequirements” (1993, p. 426).____________________________TEST INTERPRETATION TEST USEEVIDENTIAL Construct validity Construct validity +BASIS Relevance/utilityCONSEQUENTIAL Value implications Social consequencesBASISFigure 2. Messick’ s Facets of Validity Framework(Messick, 1989b)Shepard expresses three concerns about this schemata:(1) The faceted presentation allows the impression that values are distinctfrom a scientific evaluation of test score meaning.(2) By locating construct validity in the first cell and then reinvoking it insubsequent cells, it is not clear whether the term names the whole or thepart. I argue for the larger, more encompassing view of construct validity.(3) The complexity of Messick’s analysis does not identify which validityquestions are essential to support test use. (Shepard, 1993, pp. 426-7)Shepard believes that the use of positivistic principles, where facts and values areseparated, is “no longer defensible in contemporary philosophy of science” (1993, p. 427).While Shepard recognizes Messick’ s acknowledgment that “scientific observations aretheory-laden and theories are value-laden” (Messick, 1989b, p. 62), she argues thatMessick lapses on his next page “into discourse that separates value implications fromsubstantive or trait implications”. Shepard states that this:plays into the hands of researchers who deny that their construct definitionor predictive equation follows from value choices. This should not be readto mean that scientific facts are indistinguishable from value judgements.Although scientific inquiry is distinct from politics and moral philosophy atthe extremes, the concern here is with value perspectives that are entwinedwith scientific investigations. (1993, p. 427)10 This is the table used in the Handbook of Educational Measurement. I chose to use the similartable from the Educational Researcher article earlier in this study.- 35 -Shepard also voices concern with Messick’s sequential arrangement of cells which leads toan apparent segmentation of validity. While Messick describes his framework as a“progressive matrix” with construct validity appearing in every cell, and something more tobe added in each subsequent cell, Shepard identifies a significant problem with thisapproach:Messick has implicitly equated construct validity with a narrow definition ofscore meaning, whereas I would equate it with the full set of demandsimplied by all four cells, which all involve score meaning. Intended effectsentertained in the last cell are integrally a part of test meaning in appliedcontexts. (1993, pp. 427-8)Shepard clarifies that her disagreement with Messick relates more to issues incommunication than to divergent perspectives about validity. Shepard emphasizes that it isvitally important to communicate issues that have arisen to those involved in the theory andpractice of measurement. She refers to how Cole and Moss (1989) had earlier used theterm validity to refer only to the interpretive component of a framework but:since then, we have expanded our definition of validity to include theconsequential component, in part, because we were concerned thatexcluding consideration of consequences from the definition of validityrisks diminishing its importance. (Moss, 1992, p. 235)Shepard goes on to look at the operational level of planning and conducting validityevaluations. She argues that Messick’s sequential approach “may misdirect theconceptualization of theoretical frameworks intended to guide validity evaluations”(Shepard, 1993, p. 429). Shepard contends that “measurement specialists need a morestraightforward means to prioritize validity questions” and argues that the current Standards(AERA, APA, & NCME, 1985) provide little help. The absence of a coherent conceptualframework in which to organize validation does not help the standards to:answer the question “How much evidence is enough?” nor do they clarifythat the stringency of evidential demands should vary as a function ofpotential consequences. (Shepard, 1993, p. 429)In response to this problem, Shepard proposes that:- 36 -validity evaluations be organized in response to the question “What does thetesting practice claim to do?” Additional questions are implied: What arethe arguments for and against the intended use of the test? and What doesthe test do in the system other than what it claims, for good or bad? All ofMessick’ s issues should be sorted through at once, with consequences asequal contenders alongside domain representativeness as candidates forwhat must be assessed in order to defend test use. (1993, pp. 429-30)Shepard credits Cronbach (1988, 1989) for proposing that validation should beconsidered as an evaluation argument. In this process, the evaluator identifies relevantquestions for intensive research, justifying this particular set in terms of the “prioruncertainty, information yield, cost and leverage”11 (Shepard, 1993, p. 430). The processis taken further by Kane (1992) who conceptualizes validation as an interpretive argument.Shepard paraphrases Kane’s criteria for evaluating the argument as:(a) The argument must be clearly stated so that we know what is beingclaimed; (b) the argument must be coherent in the sense that conclusionsfollow reasonably from assumptions; and (c) assumptions should beplausible or supported by evidence, which includes investigating plausiblecounterarguments. (Shepard, 1993, pp. 430 - 31)Thus by setting out assumptions it is possible to identify specific areas of study and thetypes of evidence that are needed to support specific hypotheses.Shepard concludes her chapter by confirming that “construct validation is the oneunifying and overarching framework for conceptualizing validity evaluations” (1993,p. 443). Both analysis of test content (content validity or representation) and empiricalconfirmation of hypothesized relationships (criterion validity) are essential but not sufficientfor Shepard in her approach to construct validation. She insists that there must be aconceptual framework which:11 Shepard identifies “leverage” as referring to the importance of the study information in achievingconsensus about test use in the relevant audience.-37 -portrays the theoretical relationships believed to connect the test responsesto a domain of performance and desired ends implied by the intended testuse. In all but rarefied research contexts, test uses have intendedconsequences that are an essential part of the validity framework. Giventhat theory testing must also include empirical evaluation of the mostcompelling rival hypotheses, construct validation entails a search for bothalternative meaning and unintended consequences as well. (Shepard, 1993,p. 443)Shepard encapsulates these developments by considering a shift in the key validity questionfrom “Does the test measure what it purports to measure?” to “Does the test do what itclaims to do?” (Shepard, 1993, p. 444). The metaphors that Shepard uses illustrate thechange in emphasis — from “truth in labelling” to searching for both side effects andintended effects of newly developed drugs. Too many “side-effects” would necessarilylead to a decision to remove the test from that specific application. Shepard’s finalparagraph contains a reminder there is no universality in time or location for test use;therefore all validation studies must be considered context-specific. Shepard suggests thattest developers should:be able to specify the evaluation argument for the test use they intend andgather the necessary evidence, paying close attention to the competinginterpretations and potential side effects. (1993, p. 445)In developing an evaluation argument for performance assessment in science, the choice ofconstruct labels is an important factor, and will be the focus of the next part of this literaturereview.PROCESSES OF SCIENCEMy focus in this study is the validation of hands-on performance assessments in thecontext of school science. In arguing that performance assessment in science poses specialchallenges, specific concerns arise from the conceptual framework implied by the use of“processes of science” as an organizing theme. The construct labels that are used as aresult of this framework have had a significant impact on the design of curriculum andassessment in the English-speaking world. The “processes of science” issue must play a- 38 -significant role in any discussion of hands-on assessment of student performance inscience. Donnelly and Gott examine the impact of this approach on curriculum theory anddevelopment, and on large scale curriculum evaluation and student assessment. Theyconclude that:process specification, the cognitive and epistemological status which hasbeen ascribed to them, and their place in the formal structures which havebeen used to describe the science curriculum, have generally remainedproblematic. (1985, p. 237)In an attempt to resolve some of these problems, Millar and Driver (1987) consider manyof the pertinent issues in a paper entitled Beyond Processes. They start by tracing therecent use of the term ‘process’ back to the work of Gagn, and his influence upon thecourse Science - A Process Approach (1965). Gagne argues that scientific concepts andprinciples are acquired through the operation of basic scientific processes (such asobserving, classifying and inferring) and integrated processes (such as formulatinghypotheses, interpreting data, etc.). Gagné believes that processes are skills used by allscientists, and can be learned by students; they are transferable across content domains.Donnelly and Gott are concerned about the lack of attention “devoted to their [processes]definition, interrelationship or justification” (1985, p. 237). Millar and Driver addressthese concerns in Beyond Processes (1987). Millar and Driver discuss the apparentseparation of content and process in science education. A teaching approach that stresses“content” is seen as transmissive and passive, whereas stress on “process” is seen as moreactive and progressive. They support Black who argues that:the content-process separation is artificial. One does not observe exceptthrough selective attention, guided by one’s aims and one’s theories. Onecannot tackle even the simplest experimental problem without some modelof the working of the system under investigation. (Black, 1986, p. 672)Millar and Driver present a multi-dimensional critique of the claims for process science.They start with an examination from the perspective of philosophy of science where theprocesses of science are seen as the methods of science, and of scientists. Feyerabend- 39 -(1975), perhaps the most uncompromising of modem philosophers writing about methods,is cited as providing the argument for “the underlying point that there is no algorithm forgaining or validating scientific evidence” (Millar and Driver, 1987, p. 40). Donnelly aridGott (1985) suggest that conceptualizing science in terms of the activities of professionalscientists is a top-down approach, providing processes that are “vague andunoperationalized” (p. 239).Mifiar and Driver continue their critique from the perspective of cognitivepsychology. The view that learners are actively constructing personal knowledge has arelatively long history; Millar and Driver (1987) identify the work of Piaget, Bruner andKelly as examples of this approach. Millar and Driver indicate that contemporaryperspectives on cognition (e.g., Bereiter, 1985; Resnick, 1983)reject the view that learning is a one-way process whereby the learnerreceives and organizes stimuli from an external world. Instead learning isviewed as an active process in which the learner brings prior sets of ideas,schemes or internal mental representations to any interaction with theenvironment. (Millar and Driver, 1987, p. 45)This view gains support from the “alternative conceptions” literature (Driver, Guesne, andTiberghien, 1985; Osborne and Freyberg, 1985): the mental representations that learnersbring to a situation enable them to make predictions and conceptualize physical phenomenain ways that are internally consistent, but are not accepted as “science”.In examining the impact of a process approach on pedagogy, Millar and Driver areconcerned about the implicit expectation of transferability:The idea of teaching the processes of science implies both that generalcognitive skills can be transferred from the specific area and context withinwhich they are taught to other analogous contexts, and that instruction canfoster the learners progress or development in deploying these skills.(Millar and Driver, 1987, p. 51)Millar and Driver doubt both that such content-independent processes can be taught and thattransfer will occur to new situations. They posit that learners tend to understand newsituations through reasoning by analogy with other previously encountered situations,- 40 -rather than by any general procedures. Such use of analogy or modeling, almost throughtrial and error, enables learners to transfer and generalize from one context to another.Milar and Driver also question the possibility of a “progressive scale of difficulty or of thepattern of stages of children’s performance” (1987, p. 53) and express doubt that such ascale could be adequately described. In comparison, analysis of logical structures ofsubject matter has led to “an empirical base of knowledge about the development of patternsof children’s growth in conceptual understanding” (Millar and Driver, 1987, p. 53).Donnelly and Gott (1985, p. 239) also examine the “strength of arguments for generalizedprocesses when confronted with the authority of the discipline-based curriculum” and findthem to be weak; they believe that this weakness should be acknowledged in order thatscience processes coexist with the traditional science discipline in secondary curricula12.Processes in Hands-on Science AssessmentDonnelly and Gott (1985) offer two main explanations why science processes havebecome such a major feature of large scale assessments in science:the objective of assessment in science, which compels a search for aunifying aspect of science disciplines; andthe need for assessment to extend across pupils and institutions, and thus tominimize the impact of curriculum-based variables. This assumes thatcurricula are largely based on traditional content. (p. 240)These reasons ring particularly true as the struggle to present a unified face for science fromprimary to secondary grades continues in England with the National Curriculum12 Donnelly and Gott offer their working definition of ‘science processes’:classes of tasks undertaken by pupils which• can be identified across a wide range of disciplinary areas; and• can be systematically connected with a specifically scientific epistemology, which is analytical,manipulative and materialist. Its proximate results are variable-based descriptions of phenomena(data) and the establishment of functional relationships between variables. (1985, p. 239)-41-(Department of Education and Science and the Welsh Office, 1991) and in the U.S.A. withthe proposed National Science Education Standards (National Committee on ScienceEducation Standards and Assessment {NCSECA}, 1993). Both of these national projectshave an embedded emphasis upon the processes (or skills) of science. In England this isidentified as Science Attainment Target 1 - Scientific Investigation, while in the U.S.A. thisis seen as a set of inquiry skills that students “should be able to demonstrate in a newexperiment” (NCSECA, 1993; p. 56). As there are also content dimensions for each ofthese national projects, neither can be considered set exclusively in a “science processes”mode. However, the Techniques for the Assessment of Practical Skills in the FoundationSciences (TAPS) Project (Bryce, McCall, MacGregor, Robertson, & Weston, 1984) wasso conceived. Bryce and Robertson (1985) describe the focus of TAPS as “the non-trivialaspects of practical skills as they manifest themselves in the classroom” and state that theskills are “conceptualized in accord with the ways in which teachers think and act in thelaboratory”. It is of interest that these authors focus on the teacher, rather than on thestudents who will perform the tasks, reflecting their interest in enabling teachers to feel thatthe assessment scheme supports their work. The framework used in the TAPS scheme isbased upon the work of Whittaker (1974) and Swain (1974). The TAPS team has beenextremely thorough in the definition of skills, and has developed what is called “a step-upapproach”. Basic skills (e.g., observation, recording and measurement) are assessed withtasks from a bank of over 300 items. The higher level skills (e.g., inference and selectionof procedures) constitute the second level, with students performing investigations as thepinnacle of achievement. The TAPS team is explicit that students should start at the basiclevel, and work through a hierarchy of process tasks until they are capable of attemptinginvestigations (Bryce et al., 1984).This “step-up” theory of “science processes” in assessment has been stronglycriticized by Woolnough and Ailsop (1985), Woolnough (1989, 1991), Millar and Driver- 42 -(1987) and Milar (1989, 1991). Woolnough argues against a reductionist approach with:a tightly prescriptive structure for science teaching which has reduced itselfto a series of small component parts. Though not claiming each of theseparts, in itself is scientifically significant the implied message is that whenthey are ultimately put together by the pupil they will produce competence inthe ‘scientific method’. (1989, p. 41)Woolnough is concerned that using discrete processes in assessment will lead to separateteaching of different skills. Woolnough values complete investigations in science andargues for a holistic approach where students “learn to do investigations by actually doingscientific investigations, simple ones at first but complete investigations none the less,becoming more sophisticated as confidence and experience increase” (Woolnough, 1989,p. 43). Woolnough wrote his chapter before the publication of the National Curriculum forScience in December 1988 (Department of Education and Science and the Welsh Office).In a footnote, Woolnough rejoices at the holistic approach that is presented for “AttainmentTarget 1: Exploration of Science” in the 1988 curriculum. But in 1991, a revisedcurriculum was published, and “Attainment Target 1: Scientific Investigation” appears tohave been re-aligned with a step-up approach.Millar and Driver (1987) are particularly critical of the process approach toassessment. They identify two reasons for concern: (1) the influence of content andcontext upon student scores, and (2) the problem of identifying processes used from thewritten outcome, as tasks do not necessarily predetermine strategies that students use. Thislatter point is significant, particularly with respect to complex tasks. The requirement thattasks be uni-dimensional is an important aspect of the TAPS program (Kempa, 1986).Bryce and Robertson (1985, p. 18) describe as “purifying” the modification of tasks toensure that they focus upon only one specific objective. The TAPS program in Scotlandwas developed in parallel with the Assessment of Performance Unit (APU) for science inEngland and Wales. The work of the APU has been the more influential of these twoBritish-based assessment projects and deserves detailed examination.- 43 -The Assessment of Performance Unit for science (Johnson, 1989) examinedperformance in science of English and Welsh students age 11 (Russell, Black, Harlen,Johnson, & Palacio, 1988), age 13 (Schofield, Black, Bell, Johnson, Murphy, Qualter, &Russell, 1988) and age 15 (Archenhold, Bell, Donnelly, Johnson, & Welford, 1988). TheAPU work provided a significant foundation for the first attempt at practical-basedassessment in North America — the National Assessment of Educational Progress (NAEP)project A Pilot Study ofHigher Order Thinking Skills Assessment Techniques in Scienceand Mathematics (Blumberg, Epstein, MacDonald, & Mullis, 1986). APU-derived tasksappeared in the New York State Grade 4 performance tasks in 1989 (New York StateDepartment of Education), which itself was influential in the design of the fifth-grade tasksused by the California State Department of Education in 1990.Johnson (1989) reports that the APU inherited a process-based framework from aScience Working Group, although it was acknowledged that “processes are rarely deployedwithout the simultaneous use of knowledge and conceptual understanding” (Johnson,1989, p. 7). The first task of the APU Science Monitoring Team13 was to develop anappropriate approach to the issue of process-content interdependence. As England had nonational curriculum at that time, content lists were developed and refined by consultationwith teachers. The content was grouped into the three disciplines of biology, chemistryand physics. Process skills are represented by a framework of Science Activity Categories,each considered to portray an identifiably different activity.13 The APU teams were based at two centres, at Kings College, London (formerly Chelsea College)for ages 11 and 13, and at the University of Leeds for age 15.- 44 -The categories are:Using symbolic representationsUse of apparatus and measuring instrumentsUsing observationsInterpretation and applicationDesign of investigationsPerforming investigationsJohnson (1989, P. 10) reminds us that “it has never been claimed that these Categories aremutually exclusive” and that there is clearly some overlap in terms of skills and abilities.The Monitoring Teams chose to use a practical mode of assessment for “use of apparatusand measuring instruments”, “using observations”, and “performing investigations”, withpencil and paper tests for the other categories. Circuses of experiments were chosen forboth “use of apparatus and measuring instruments” and “using observations”4.“Performing investigations” was structured so that a student would spend an extendedperiod of time working through a single investigation. The tasks and the proceduresdeveloped by the APU make up perhaps the most rich legacy for those following in thedesign and implementation of performance assessment in science. Welford, Harlen andSchofield describe how tasks were chosen for the “using observations” circus:In observation, between 40-50 questions are selected at random from thebank for each survey at each age. These are divided into two or threecircuses of experiments, each administered separately. Between 12 and 20experiments are distributed among eight or nine stations in the laboratory orclassroom being used. Time is the basis for distributing questions amongthe stations. Extensive question trailing has resulted in tasks which requiretwo, four, six, or eight minutes, and so questions are grouped to allowcompletion of a station within eight minutes. (1985, p. 25)14 The circus approach involves students working through a set of as many as 8 or 9 differentexperiments, called stations, one after another. The name circus was given because the studentsmove around, generally in some kind of circular pattern.- 45 -This large number of questions, and a similar set for “use of apparatus and measuringinstruments”, enabled the APU to collect annual data from 1980 through 1984. The APUproduced 15 extensive reports, and 11 short reports for teachers.While the circus approach to student practical work was a regular feature in thepractice of science education in England, open-ended investigations most certainly werenot. Gott and Murphy, in Assessing Investigations at Ages 13 and 15 (1987), wrote of theposition of processes in school science:In general, little emphasis is placed on what have become labelled the‘processes’ of scientific enquiry; planning, interpreting and so on. These‘processes’ are seen as serving the ends of concept acquisition rather thanbeing of intrinsic value in themselves. This is not to argue that such‘processes’ are absent from this view of science: rather it is to suggest thatthey are often unacknowledged and, where present, are there because of thestyle of teaching rather than explicit aims of the course.Assessment, not unnaturally, has similar aims. It is concerned with pupils’ability to explain phenomena using their conceptual knowledge andunderstanding. (p. 6)Gott and Murphy go on to describe courses designed to develop the processes of science.They warn that:The danger in such approaches lies in the degree to which the emphasis onconcept in the one case or ‘process’ in the other effectively underplays oreven denies the significance of the other. (p. 6)Investigations were seen by the APU monitoring teams as catering deliberately to theinteraction between process and concept by presenting the students with practically basedproblems. As much of the language used by teachers and those concerned with assessmenthas both everyday and technical meaning, the APU chose to be explicit and defined aproblem as:a task for which the pupil cannot immediately see an answer or recall aroutine method for finding it. (Gott and Murphy, 1987, p. 7)The set of problems that was developed by the APU was by necessity different from thetraditional exercises that students meet in school science. These investigations were- 46 -designed so that “science concepts and procedures are essential elements of any solution”(Gott and Murphy, 1987, p. 8). Gott and Murphy report that a descriptive framework forthe categorization of problems emerged as the trials proceeded. They point Out that thesedescriptions are based upon:what an assessor would consider to be important and does not reflect eitherthe pupils’ perception of the problems or the knowledge they, individuallyuse to solve them. (Gott and Murphy, 1987, pp. 8-9)The elements of the APU investigation framework are purpose, nature, content andcontext. The purpose of a problem is seen as a statement of what the student is askedto do. Three types of purpose are identified: “Decide which...” problems which may leadto students designing investigations to make a choice between competing products; “Find away to...” problems which ask students to find a way to extend the use of apparatusbeyond its usual; “Find the effect of..” problems which are usually set in a more scientificcontext and are concerned with the effect of changing one or more variable. The nature ofa task is identified as the arrangement of apparatus, and the development of a singlestrategy or a series of interconnected investigations. The content and context are similarin that either may be scientific (i.e., likely to have been encountered only in the schoollaboratory) or may be everyday (e.g., concerned with the absorbing properties of papertowels).These elements suggest limits upon the conceptual understanding and theprocedural understanding that might be expected or applied in any investigation.Conceptual understanding is seen as a “loose division of concepts into taught scienceand everyday science, or some combination of these” (Gott and Murphy, 1987, p. 12). Tnaddition, the APU perceives that “conceptual understanding is vital in the identification ofthe key variables in a problem”, particularly in controlling variables for a fair test (Gott andMurphy, 1987, p. 12). Procedural understanding is identified as “strategies ofscientific enquiry such as will occur in a many variable problem” (Gott and Murphy, 1987,- 47 -p. 12). The specific actions that constitute a scientific procedure, according to the APU,are shown in Figure 3.- defining the status of variables as:independentdependentor control- systematically varying the independent variable,- developing a measurement strategy for the dependent variable,- controlling variables by choice of apparatus,- choosing an appropriate scale for the quantities of variables, This willrequire a consideration of the match between quantities and themeasurement instruments available,- developing a strategy for sifting complex data involving many variables,- transforming data from one form to another.During the course of the investigation a variety of intellectual and practicalskills may also be called upon:setting up apparatusreading instrumentsrecording datainterpreting dataevaluating the data obtained or the methods used.Figure 3. APU Procedures for Scientific Enquiry(Gott and Murphy, 1987, p. 15)The APU used this framework to generate and explore problems (Gott and Murphy, 1987).The approach developed by the APU for assessment of student performance ininvestigations involves a trained observer completing a checklist developed previously bythe monitoring team. The checklist enables the observing teachers and the monitoring teamto reconstruct the investigation in the manner the student performed it. Data collected fromthe checklists enable the APU to provide aggregate results to describe student performance.The APU developed holistic scales of performance with:criteria based on a scientist’s view of what is or is not an appropriateexperiment. We have not attempted to say, for instance, that a particularexperiment is good for that age and ability of pupil but rather to say — this iswhat a scientist would do — how does a pupil match up to this ‘expert’.(Gott and Murphy, 1987, p. 33)The researchers report that:- 48 -The data collected so far suggest that many pupils can make a verycreditable attempt at investigations which emphasize the ‘processes’ and‘procedures of science’. (Gott and Murphy, 1987, p. 40)Compared with traditional approaches to assessment using written questions, where theresults uniformly indicate low levels of performance, the “higher success rate of theinvestigative problem must come as something of a surprise” (Gott and Murphy, 1987,p. 40). Investigations such as these were not commonplace in secondary schools, yetstudents appeared to approach the tasks with confidence and success. With regard to age-related differences in performance, Gott and Murphy (1987) report the somewhatunexpected finding that pupils:do not seem to improve significantly between the ages of 13 and 15, afinding which suggests either that pupils are already performing to theirlimit at age 13 or that the science of years 4 and 5 is doing little to assisttheir problem-solving capacity, at least on this type of problem’5. (p. 40)Gott and Murphy suggest five possibilities which may explain these findings:1. The reduced writing load: it was found that the sub-sample of pupils who wereasked to ‘write-up’ their experiment did so in a fashion which conveyed much of whatthey had done. Students were capable of writing the report, if asked to do so.2. The practical nature of the task presents many more clues which assist inthe perception of the problem; it is the failure to grasp the problem inthe first place which is the difficulty. Gott and Murphy describe the differencesin performance on prose, pictorial and practical versions of the paper towel15 The English system of the time identified years 4 and 5 as students in their 4th and 5th years ofsecondary schools, equivalent to North American grades 9 and 10.- 49 -investigation’6.In the prose version, pupils are presented with a written stem and nodescription of the possible apparatus; the pictorial version presents the pupils with apicture of possible apparatus and an identical stem, and the practical version has acomplete set of equipment for the pupils to use in working through the experiment,along with the written question stem. In the APU study, the percentage of studentsperforming at the highest level was lowest for prose (12%), higher for pictorial (24%)and highest (43%) for practical versions. At the lowest level of performance thepercentage for prose was 71%, pictorial 51%, and 23% for the practical version. Thispattern of data was interpreted as indicating that “practical tasks allow more pupils tosee the problem in its entirety rather than as a series of disconnected activities”. Gottand Murphy make the suggestion that “interaction with the apparatus during theexperiment is an important facet of practical work” (Emphasis in original. p. 45).3. The practical nature of the task provides opportunity for reconsiderationand refinement of the experiment, an interaction which permits pupils toidentify the necessary stages. Gott and Murphy describe how the APU broke upinvestigations into component parts for prose, pictorial and practical versions. Theperformance of students was generally lower in the versions which required extendedresponses. Gott and Murphy argue that:practical tasks (and to a lesser extent extended response questions) not onlyallow pupils to perceive the questions as a whole, but also give somemeasure of feedback as they attempt to put together a logical sequence ofactivities that wifi lead to a solution. This feedback is either very abstractand self generated or, more likely, non-existent in the structured questions.(1987, p. 49)16 In this investigation pupils are given three different kinds of paper towel and asked to “Describe anexperiment you could do to find out which kind of paper will hold the most water” (Gott andMurphy, 1987, p. 43).- 50-4. The tasks require very little in the way of taught science concepts; theirabsence is responsible for the improved performance: Gott and Murphypoint to detailed analysis of student performance over several different investigationswhere conceptual understanding, either in the form of conceptualization of the problemor appreciation of the effects of specific variables, is seen as a limiting or enablingfactor. Gott and Murphy conclude that:the activities that we have labelled problem-solving in his booklet arecharacterized by the mutual dependency of scientific procedural andconceptual understanding. From this viewpoint the pupils are regarded ashaving to access the pool of concept and knowledge available to them inorder to first conceptualize the problem. A procedural strategy has thento be developed but the pupils’ particular conceptualization of the problem isthe link which determines the procedures that are understood to beappropriate in the problem situation. In practical activity the pupils can bothrefine their procedural strategy and their conceptualization of the problem byevaluating their own procedures and their outcomes in situ. (Emphasis inoriginal. 1987, p. 51)5. The everyday context in which the tasks are set encourages more pupilsto “have a go”. The APU considered this to be an important issue and ran a specialstudy in which student performance was scrutinized across a range of investigations(10 for most students). Examination of these results indicates that “questions set in ascientific context can inhibit some pupils’ performance” (Gott and Murphy, 1987,p. 51). The authors argue that the effect appears to be linked with “pupils’ belief intheir own incompetence in a specific topic area”. Inconsistency in pupil performancemay be explained in part by pupils perceiving:alternative problems which appear to be loaded with the concepts theyassociate with the content: they can therefore decide that they do not ‘know’how to solve such a problem. (Gott and Murphy, 1987, p. 51)One effect of setting the problem in an everyday context was to lead “some pupils to decidean ‘everyday’ answer is all that is required” (Gott and Murphy, 1987, p. 51). A corollaryto this is a cause for concern:-51 -The idea that we should only behave scientifically when the task looksscientific should give us cause to reflect on the nature of science teaching.If we only ever carry out investigations related to the concepts of scienceusing apparatus that is never encountered outside the laboratory, there isperhaps little wonder that pupils assume that scientific methods can only beapplied in such circumstances. (Gott and Murphy, 1987, p. 51)This issue of context was identified as a focus for a subsequent phase of APU Scienceresearch — unfortunately cancelled because government funding for the project wasterminated.By 1987, the educational climate in England and Wales had become relativelyunsettled. The four years which led to the 1988 introduction of the General Certificate ofEducation (GCSE) were followed by lobbying leading to publication of the NationalCurriculum for Science (Department of Education and Science and the Welsh Office,1988). In their analysis of the implications of the APU results, Gott and Murphy refer tothe significance of using the view of science adopted by the APU to guide an assessmentprogram, particularly as “the ability to plan and perform investigations is an important partof science education for all pupils” (Gott and Murphy, 1987, p. 52). Gott and Murphyconsider that accepting such a value leads to a need for clarification of purpose, and typesof investigation, together with a formalization of the role of procedural understanding inscience education. They perceive that:the interaction of procedure and concept inherent in such problem-solvingactivities is a useful addition to current courses and will allow pupils todemonstrate their conceptual understanding in a situation which does notrely solely on explanation. (Gott and Murphy, 1987, p. 53)My examination of the APU’s work has concentrated upon the role of conceptualand procedural knowledge in investigations. The Performance of Investigations wasidentified by the APU as the “putting together of the component activities within scienceperformance, hence an overlap with other categories was assumed” (Murphy, 1989,p. 149). For the other categories, assessed by stations or written questions, there was aself-imposed requirement to identify a unique category for each question:- 52 -in establishing category definitions questions were written to fit onecategory only. The degree offit was established during the externalvalidation exercise carried out by groups of experts in the education field.The criterion of fit used was the burden of the given question’s demand, i.e.its ultimate loading in terms of the demands made upon the pupils.Questions which failed to fit the category they were designed for wererejected or rewritten. If this criterion was met then what might be overlap inprinciple would in practice have liffle effect on performance scorecorrelation. (Italics in original. Murphy, 1989, p. 149)This was not achieved. Murphy reports that the APU came to identify Use of graphicaland symbolic representation (Category 1) and Use of apparatus and measuringinstruments (Category 2) as “enabling skills” which contribute to student performance inother categories. Murphy also relates that the data gathered over the five years of surveysindicate that:many questions made demands on pupils other than those specified in thedefinition of the question type. The question type was seen to represent themajor demand of a question but the subsidiary demands also appeared toinfluence pupil’s performance. (1989, pp. 149-50)Johnson (1989) describes some of the pilot work that was done to provide empirical, albeitcorrelational, evidence of relationships among categories. She warns thatempirical data cannot substitute for educational judgment. A poorcorrelation between performance on two groups of questions might supportan assumption that the two groups measure different things, or might throwdoubt on a previous assumption about equivalence. A strong associationbetween different test questions is not sufficient evidence that these domeasure the same thing(s). (Johnson, 1989, p. 13)Correlations between subcategory test scores were found to be in the range of 0.55 to 0.75,with modal values in the order of 0.6. These correlations appear to have been calculatedbetween the aggregated scores for each test package. Johnson is particularly tentative indiscussing these “moderately high correlations” which “could be taken as confirmation ofthe overlap between subcategories” or “might also be merely coincidental” (1989, p. 13).The APU also pioneered some significant technical procedures in the establishmentof reliability and generalizability of student performance. An explanation of some of these- 53 -procedures will follow in the section examining reliability and generalizability of studentperformance.RELIABILITY- MEANINGS AND REQUIREMENTSThe widely accepted guidelines for the use and development of tests, the Standardsfor Educational and Psychological Testing (AERA, APA, & NCME, 1985), cover technicalstandards for test construction and evaluation as well as professional standards for test use.The section on reliability is written in the language of classical reliability theory— standarderror of measurement, reliability coefficient and true score. Classical reliability theory hasdefinite limitations as a model when applied to performance assessments. These problemsare identified below and the use of Generalizability Theory (Shavelson and Webb, 1991) isexplored.Classical TheoryFeldt and Brennan (1989) describe the historical perspectives of reliability andclassical reliability theory as an attempt to quantify the consistency and inconsistency inexaminee performance. They define the standard error of measurement as:the standard deviation of a hypothetical set of repeated measurements on asingle individual. Because such measurements are presumed to vary onlybecause of measurement error, the standard deviation represents the potencyof random error sources. (p. 105)Feldt and Brennan characterize the reliability coefficient as quantifying “reliability bysummarizing the consistency or inconsistency among several error-prone measurements”(1989, p. 105). Error manifests itself by depressing the correlation coefficient betweentwo experimentally independent measures. These two statistics are seen by Feldt andBrennan as having specific limitations and advantages in use. Standard error is useful asthe error is presented in score-oriented units; it can be used to suggest how testingprocedures can be improved, and it is relatively stable from group to group. However the- 54 -standard error cannot be compared from one instrument to another, or used to comparedifferent scoring procedures for the same instrument. Reliability coefficients have a muchgreater generalizable use. For example, they can be compared between instruments —instruments with coefficients of less than 0.70 are considered unsuited for individualstudent evaluations (Feldt and Brennan, 1989). Feldt and Brennan warn that reliabilitycoefficients are sensitive to the character of the group from which the data were collected.Feldt and Brennan consider that “true score” is a further vital concept in quantification ofreliability. Their definition is taken from conventional practice:the true score of a person is regarded as a personal parameter that remainsconstant over the time required to take at least several measurements, thoughin some settings a person might be deemed to have a true score subjectchange almost moment to moment. (1989, p. 106)In warning against considering the true score as merely the “limit approached by theaverage of observed scores as the number of these observed scores increases”, Feldt andBrennan offer a reminder that this measurement is set in the behavioral sciences rather thanthe physical sciences (1989, p. 106). The measurement process itself is likely to effect achange in the examinee. A second concern is voiced about the specification of instrumentsthat would provide such a true score. Feldt and Brennan argue that it must be possible to“define what meant is by interchangeable test forms without the use of the concept of truescore, lest the definition be circular” (1989, p. 107). The third concern expressed by Feldtand Brennan is that conditions may vary legitimately from score to score, and the effects ofsubtle differences in obtaining scores might lead to redefinitions of true score. Suchvariations can lead to multiple reliabilities, even within a set population.The shadow of “true score” is “measurement error”, an observed score is made upof these two components. Feldt and Brennan identify three possible sources of errorvariance: (1) random variations within individuals (e.g., health, motivation, carelessness,luck); (2) situational factors such as the working environment of the examinee (factors maybe psychological or physical); and (3) instrumental variables such as variations in- 55 -machinery, or effects arising from a misfit of the domain sampled for individual examinees,advantaging some and disadvantaging others. Feldt and Brennan describe the fiveassumptions of classical reliability theory:1. An observed score on a test or measurement is the sum of a true scorecomponent and a measurement error component.2. Parallel forms are measures constructed to the same specifications, giverise to identical distributions of observed scores for any very large(infinite) population of examinees, covary with each other, and covaryequally with any measure that is not one of the parallel forms.3. If it were possible for an individual to be measured many times, usingdifferent parallel forms on each occasion, the average of the resultanterrors of measurement would approach zero as the number ofmeasurement increased.4. If any infinite population of examinees were tested via any given formof a test, the average of the resultant errors would equal zero, providedthe examinees were not chosen on the basis of the magnitude of theobserved test score.5. Consistent with the foregoing assumptions, the true score of anindividual is perceived as a personal constant that does not change fromform to form. For convenience, the true score is regarded as theaverage score the individual would obtain if tested on an infinite numberof parallel forms. (1989, p. 108)Parallel forms are an important part of classical reliability theory. Feldt and Brennanconsider the development of a large number of parallel forms as impractical. Instead theydescribe strategies for the use of two interchangeable test forms, and for the readministration of a form previously used. The parallel-forms approach is considered theideal, but administratively most difficult approach. Its main disadvantage stems from thereluctance of school authorities to retest students; in practice, available second forms tendto be ignored. For the test-retest approach, Feldt and Brennan believe that:A second administration of the same tasks or stimuli is a natural andappropriate procedure when two conditions are met:a) no significant changes in examinee proficiency or character is to beexpected within the period of measurement, andb) memory of the tasks or stimuli will not influence responses onsubsequent presentations. (1989, p. 110)- 56 -They warn that the second characteristic is unlikely to be met in pencil-and-paperadministrations, as many examinees believe that they are being assessed on theirconsistency, and attempt to respond as they did in the first administration. This causes afalsely high level of reliability.The estimation of reliability from a single administration of a test is based uponpart-test similarity. In this approach the test itself is divided into two forms. In an idealsituation the mean and variance must be the same for each of the forms. As this is anextremely harsh condition, the measurement community has found ways of accommodatingthe reality of non-parallel forms, and alternative models have been created. Each of thesemodels represents a progressive reduction in adherence to the ideal. Feldt and Brennan,(1989, pp. 110-111) describe these models:— tau-equivalent forms have identical means, but differences in error variance andhence, in observed score variances;— essentially tau equivalent forms exhibit mean differences in addition to variancedifferences, but the difference of the true score will be constant between parts forevery examinee;— congericforms are even less parallel, but the true scores between parts areperfectly correlated in a linear sense, and have different error variances and scorevariances;— multi-factor congeric forms attempt to identify systematic components within theforms and apply constants to weight each of these factors. The true scores of thetwo parts are represented by the same factors, in differing combinations.- 57 -These different types of fonn allow the calculation of various coefficients of reliability’7.This discussion of reliability is focused upon the underlying assumptions of the theoriesrather than the modes of calculating coefficients, particularly as Traub and Rowley (1991)stress that:reliability is not simply a function of the test. It is an indicator of the qualityof a set of test scores; hence reliability is dependent on characteristics of thegroup of examinees who take the test, in addition to being dependent on thecharacteristics of the test and the test administration. (p. 41)Traub and Rowley assert that “inconsistent measurements are a bane to persons engaged inresearch” (1991, p. 37). While this statement may be at the core in consideration of closedassessment procedures where a single “right answer” is valued, there are many examples ofperformance assessment in science where students are asked to plan, design or experiment,with a range of possible outcomes or answers.Generalizability TheoryGeneralizability theory (G theory) has been used in analysis of student test scores inperformance assessments by several researchers (e.g., Shavelson et al., 1992; Koretz,Stecher, Klein, McCaffrey, & Deibert, 1992; Candell and Ercikan, 1992). Asperformance assessments become used more widely, this approach to analysis of sourcesof variation in performance is likely to see greater use. Generalizability theory was firstproposed by Cronbach, Gleser, Nanda, and Rajaratnam in The Dependability ofBehavioralMeasurement (1972). Much of this summary of G theory is taken from GeneralizabilityTheory: A Primer by Shavelson and Webb (1991). Shavelson and Webb characterize Gtheory as “a statistical theory about the dependability of behavioral measurements” andwithin this definition they refer to dependability as:17 Spearmai-Brown, Cronbach alpha and Gutiman among others.- 58 -the accuracy of generalizing from a person’s observed test score on a test orother measure (e.g., behavior observation, opinion survey) to the averagescore that the person would have achieved under all possible conditions thatthe test user would be equally willing to accept. (1991, p. 1)A necessary condition, and limiting assumption, is that a person’s ability or measuredattribute remains in a steady state over different occasions of measurement. Thusdifferences among scores earned by an individual arise because of one or more sources oferror in measurement18.Other methods of estimating error have generally been based uponclassical test theory and are limited to estimates of one source of error at a time. Thealleged advantage of G theory is that it is possible to estimate the value of each of severalsources of error in a single analysis. Shavelson and Webb claim that:G theory enables the decision maker to determine how many occasions, testforms, and administrators are needed to provide a dependable score. In theprocess, G theory provides a summary coefficient reflecting the level ofdependability, a generalizabiity coefficient that is analogous to classical testtheory’s reliability coefficient. (1991, p. 2)Thus, by running a generalizability theory analysis (as done by Baxter et al., 1992;Shavelson et al., 1993; Shavelson et al., 1994), and by making several assumptions as tothe impact of the types of error, it is possible to propose model studies in which the error isreduced to an acceptable minimum19,comparable to those established for classical theory.A consistent finding in the reported results of Shavelson and his associates is that“measurement error is introduced by task-sampling variability, and not by variability due toother measurement facets” (Shavelson et al., 1993, p. 229). Shavelson, Baxter and Gaoexplain the consequences of this finding for performance tasks in large scale assessments:18“Error iii measurement” changes a student’s score from the “true score”.19 Shavelson and his collaborators have chosen to use a G value of 0.8 as the critical value aboutwhich to manipulate the number of tasks, studies or raters to find an acceptable level ofdependability.-59-Regardless of the subject matter (mathematics or science) or the level ofanalysis (individual or school), large numbers of tasks are needed to get ageneralizable measure of achievement. One practical implication of thesefindings is that — assuming 15 minutes per CAP task, for example— a totalof 2.5 hours testing time would be needed to obtain a generalizable measure(0.80) of student achievement. (1993, p. 229)In a later paper, Shavelson, Gao, and Baxter (1994) demonstrate the influence oftask domain upon task sampling variability. Shavelson, Gao, and Baxter revisit some ofthe group’s earlier work (Shavelson, Baxter, Pine, Yur, Goldman, & Smith, 1991) inwhich students investigated the preferences of sow bugs for different environments: (a)light vs. dark conditions, (b) damp vs. dry conditions, and (c) factorial combinations ofthese environments (light and damp, etc.). Shavelson, Gao, and Baxter report that“Observations and subsequent data analyses showed convincingly that the third experimentwas qualitatively different, and more difficult, than the first two” (1994, p. 3). Tn this1994 paper, Shavelson, Gao, and Baxter base their analysis upon the premise that thecontent of the domain of the questions appropriately represents the domain or universe towhich the researcher wishes to generalize. Shavelson, Gao, and Baxter argue that thefactorial experiment is beyond the domain of fifth-grade science and as such represents partof an inaccurately specified domain for the set of tasks. However, if only the first twotasks are considered, then the domain is deemed appropriate. Shavelson, Gao, and Baxterconsider this a significant development:The practical implications of this misspecification are teffing. To obtain acoefficient of 0.80, 1 rater and 3 experiments would be needed for bothrelative and absolute decisions, when the domain is correctly specified.When misspecified, 1 rater and no less than 7 experiments would be neededfor relative decisions, and 1 rater and 11 experiments for absolute decisions.(1994, p. 8)With this “elegant” solution of eliminating the “inappropriate” experiment Shavelson, Gao,and Baxter found a way to enhance the generalizabiity of the two experiments. Shavelsonand his colleagues claim that they can generalize performance on these two sow bug tasksto the domain of “living things” from the California Science Framework (1990). There areproblems with content representation that make this claim troublesome for science- 60-educators, who might argue that an acceptable representation of the Framework requiresmore than two similar tasks to control experimental variables for insects. An analysis ofthe Framework would likely show that content areas such as human biology and plantbiology require some representation. Statistical manoeuvers such as those performed byShavelson, Gao, and Baxter highlight the need for vigilance in the application ofgeneralizability theory, and provide strong evidence to support Hem’s (1990) plea for theinvolvement of science educators in the design and interpretation of science assessments.The factors within a test that contribute to the reliability of test scores, test length,item type and item quality (Traub and Rowley, 1991) also apply in the application ofgeneralizabiity theory to the analysis of the variance of test responses. Longer tests areconsidered more reliable and to provide a more generalizable set of scores, but the principalcondition for the additional items is that they should:function in the same way as those already present. They should be of thesame type (multiple-choice, short answer, etc.) and should test similarknowledge and skills. But also it is necessary that students shouldapproach them similarly; if the length of the test is such that fatigue,boredom, or resentment begin to affect the students’ behavior, we could notexpect the Spearman-Brown formula to give us sensible predictions.(Traub and Rowley, 1991, p. 43)The type of item, and its method of scoring, favours a large number of shorter, objectiveitems. This generally eliminates scorer inconsistency as a source of measurement error andextends the represented content, reducing unreliability resulting from chance in questionselection.The issue of item quality deserves careful examination. Traub and Rowley (1991)warn that unclear or ambiguous items will lead to multiple interpretations by students,clearly reducing reliability, If items are too difficult then students will guess and introducerandomness to the scores; items that are too easy will be answered correctly by everyoneand not distinguish between students. Traub and Rowley explain the issue clearly:- 61 -Items that contribute most to test reliability are those that discriminate — inthe technical sense, this refers to items on which students who possess theknowledge and skill needed to answer the question correctly have a betterchance of success than students not in possession of this knowledge andskill. Items that are either very easy or very difficult for all the studentsbeing tested cannot be good discriminators. Therefore, it can be said that inorder to maximize reliability a test should be pitched at a level of difficultythat matches the abilities of the students, neither too easy for them nor (theworse of the two) too difficult. (1991, p. 43)These three factors pose particular problems for criterion-referenced tests. As mostperformance items have been designed with reference to specific criteria, the use of suchreliability coefficients is necessarily inappropriate.Similar concerns exist in examining the variance requirements of G theory. The setof tasks must enable the task sampling variance to be low (Shavelson et al., 1993). Thismeans that students should perform at similar levels from one task to the next. By makingthe tasks very similar this can be achieved (Shavelson et al., 1994), but only at the expenseof a reduction in content representation.Instrumental Variables and Performance AssessmentIn the context of complex performance tasks it is most appropriate to considerreliability in terms of consistency of the data collection procedures and the reproducibilityof the data obtained. These procedures can be considered in terms of:1. equipment stability;2. administration procedures; and3. uniformity in the data analysis procedures, leading to the application ofconsistent evaluative standards.Stability of equipment used in hands-on assessment in science is vital. The equipmentmust not only perform to a well-defined specification, but must behave in the same mannerfrom one student to the next, and in many different assessment sites. The administrationprocedures developed for an assessment must enable each student to have an equal- 62 -opportunity to perform at her/his best, but not enable any group of students to have aparticular advantage. The training of the administrators must ensure that issues relating tofairness are clarified, and that procedures such as orientation time on tasks, and response tostudent questions are considered.The issue of reliability of the rating of student performance has receivedconsiderable attention in the literature (Raymond and Houston, 1990; Slater and Ryan,1993). Complex tasks which lead to qualitatively different responses can challenge thereliability of score interpretation, as for example in the Vermont portfolios (Koretz et al.,1992). However, Shavelson, Gao, and Baxter report more positive results forperformance assessments in science:The findings are consistent. Inter-rater reliability is not a problem. Raterscan be trained to score performance reliably in real time or from surrogatessuch as notebooks. (1994, p. 1)Hardy (1992) reports a study in which inter-rater reliability coefficients ranged from 0.76on one task to 1.00 on two other tasks. Both analytical and holistic scoring techniqueswere used to produce inter-rater reliability values of 1.00, with student samples of 183 and240 respectively.- 63 -CHAPTER 3 -- THE 1991 BRITISH COLUMBIASCIENCE ASSESSMENTAN HISTORICAL PERSPECTIVEThe 1991 B.C. Science Assessment is the fourth in a series of science assessmentsorganized as part of the Provincial Learning Assessment Program (P.L.A.P.). TheP.L.A.P. was set up in the 1970’s by the provincial government to evaluate theeffectiveness of the core program for education in British Columbia. The first scienceassessment took place in 1978 (Hobbs et al., 1980). Further assessments followed in1982 (Taylor et al., 1982), and in 1986 (Bateson et al., 1986). These first threeassessments were similar in structure, with three constituents:1. Achievement tests for all students in Grades 4, 8 and 12 (Grade 10 in 1986);2. A two-part instrument to survey students’ attitudes towards science andscientists, and also to examine students’ perceptions of the type of scienceteaching to which they had been exposed; and3. Teaching/learning surveys for a sample of teachers in elementary, junior-secondary and senior-secondary schools.These three parts provided sufficient data to enable the evaluators to make the wide range ofinferences about the state of science education in B.C. demanded by the statement of majorpurposes of the P.L.A.P. These purposes are:1. To monitor student learning over time;2. To inform the public of the strengths and weaknesses of the publicschool system;3. To provide the province and individual districts with information thatcan be used to identify strengths and overcome identified weaknesses;4. To assist curriculum developers at the provincial and local levels in theprocess of improving curriculum and developing resource materials;- 64-5. To provide directions for change in teacher education and professionaldevelopment;6. To provide information that can be used in the allocation of resourcesat the provincial and district level; and7. To provide directions for educational research.(British Columbia Department of Education, 1975)The initial assessment in 1978 provided baseline data that have been used and extended bysubsequent assessments to identify trends in student achievement. Anchor items from1978, 1982 and 1986 were used in the 1991 assessment.GOVERNANCE AND STRUCTUREThe management structure for the 1991 science assessment followed a modeldeveloped for earlier assessments. Approximately two years before the expected date of anassessment, the Ministry of Education requests proposals to conduct the assessment fromteams of evaluators. Before proposals are submitted, Ministry personnel brief potentialcontractors about possible assessment instruments, sampling procedures, and levels ofreporting. The proposals are reviewed by the Ministry and the contract for the assessmentis awarded to a “Contract Team”. Contract Teams have generally been based in universityfaculties of education, e.g., the University of British Columbia (U.B.C.) in 1978, theUniversity of Victoria in 1982. A single coordinated proposal with a range of optionalcomponents was submitted for the 1991 assessment, and included personnel from each ofthe three universities in B.C.Contract Teams are independent of the Ministry, but report to a Ministry-ledmanagement or review committee. This committee is made up of Ministry personnel,teachers of different grades, university/college instructors, parents, school trustees andmembers of the public. The Contract Team, in consultation with the review committee,prepares the materials for the assessment and presents the items and procedural details to- 65 -sub-committees of the management committee. These sub-committees review the itemsfocusing upon the suitability of wording and content, and make recommendations forchange. As the first three science assessments were all machine-scored, interpretationpanels with a similar structure to the review committees met to examine student results.The range of components in the 1991 assessment led to significantly broader approaches.These are discussed below.The 1991 B.C. Science Assessment saw a transition in the definition of thepurposes of the assessment by the Ministry of Education. The report of the RoyalCommission on Education Commission (Sullivan, 1988) and the subsequent Ministryresponse in Year 2000: A Frameworkfor Learning (Ministry of Education, 1990) providedthe impetus for the Ministry to seek greater breadth in the 1991 assessment. The emphasison learner-focused curriculum and assessment in the Year 2000 document led to arestatement of the objectives for the science assessment. These are to:• Describe to professionals and the public what children of various ages CANDO in the curricular area of Science• Examine CHANGES in student performance over time, and providebaseline data for the proposed changes in educational programs• Describe RELATIONSHIPS among instructional activities, use ofmaterials, teacher and student background variables, and studentperformance and attitudes• Describe the differences in student performance and attitude that are relatedto GENDER• Provide EXEMPLARS OF ASSESSMENT TOOLS that can be used byteachers, schools, and districts for assessment of science processes andperformances• Provide directions for educational RESEARCH• Suggest areas of need, and provide direction for decisions regarding bothin-service and pre-service ThACHER EDUCATION. (Emphasis inoriginal. Erickson et al., 1992, p. 1)While the pedigree of these objectives can be traced back to the original set of majorpurposes, there are many significant changes in emphasis. The overall tenor of the- 66 -assessment objectives has become more focused upon students, teachers and the practicesof education, and appears to reflect the outlook that education is about people and what theydo together. These revised objectives are much more tentative in tone, and indicate thatthere has been a recognition that changes in education cannot be forced from above. Forexample, “identify” becomes “suggest” when dealing with teacher-education. Equalopportunity has become an issue in science and assessment, so there is an “up front”requirement to identify and describe gender-related differences.The first objective in the list is pertinent to the discussion of performanceassessment. The change from “the strengths and weaknesses” to “what children of variousages can do” is evidence of a philosophical metamorphosis. This “kinder gentler” approachtowards assessment led to the development of the four components of the assessment. Therelationship between the four components is shown in the Overview (Figure 4, next page).Component 1: The Classical ComponentThis component retained many of the features of the earlier assessments — multiple-choice achievement items, student background survey and attitude scales, and a teacherquestionnaire. An extension to the procedure was made through the use of open-endeditems. The Ministry chose to report results at the provincial rather than district or schoollevel. This led to the choice to use a 30% sample’ rather than blanket coverage of allstudents in Grades 4, 7 and 10. In addition, data were collected from 10% samples ofstudents in Grades 3, 5, 6, 8 and 9.The sample was chosen to represent the Province by geographic zones, and also by school size.- 67 -Figure 4. Overview of Science Assessment Components(taken from Erickson et al., 1992, p. 5)Component 2: The Student Performance ComponentIn this component of the assessment the Contract Team identified and developedtwo distinct modes of assessment, “stations” and “investigations”, both of which involvestudent interaction with materials. Students in Grades 4, 7 and 10 participated in thiscomponent, with equal numbers working through each mode. For stations andClassical Component: Multiple-Choice Booklets30% sample of the students Grades 4, 7, and Science 1010% sample of Grades 3, 5, 6, and Science 8 and Science 94710- — 70% of students inGrades 4, 7 and 10not participatingin the assessmentClassroom Context4 Topics in Print and VideoIInterviews BookletsI 48 Eng. Classrooms12 Ft 1mm. ClassroomsEthnographic InterviewsDescription— Teacher— Students— Principal— DistrictRepresentativePerformance TasksIIndividualInvestigations2 at 45 mm. eachStationTasks6 at 7 mm. each- 68 -investigations, the results are reported in terms of descriptions of what the students actuallydid, and as well in terms of levels of student performance on a five-point scale.Component 3: The Socioscientific Issues ComponentThe socioscientific issues component was developed to:study student understandings of, or points of view with respect to, science,technology and society issues, andcompare ways of assessing understandings of, or points of view withrespect to, socioscientific issues. (Gaskell, Fleming, Fountain, & Ojelel,1992, p. 3)The Contract Team prepared videotapes for students in Grades 7 and 10. These presentconflicting perspectives on four issues, e.g., clear cut logging, and the use of animals inscientific research. Through a series of interviews and open-ended questions, specificmultiple-choice and written response instruments for each scenario were developed.Distributions of students’ points of view, by gender, by grade and by medium (print orvideo) are reported.Component 4: The Context for Science ComponentThe focus of this component was to provide “more information about classroompractices and the context within which students gain their knowledge and understandingabout science” (Wideen, Mackinnon, O’Shea, Wild, Shapson, Day, Pye, Moon, Cusack,Chin, & Pye, 1992, p. 9). The Contract Team chose to collect data from structuredobservation of 60 classrooms, and interviews with teachers, samples of students,principals and district representatives. In the results, the Contract Team describes patternsof classroom practice and discusses the influence of district support upon these practices.The report British Columbia Assessment of Science 1991 Technical Report IV: Contextfor- 69 -Science Component concludes by presenting an “agenda for improvement” (Wideen et al.,1992, pp. 129-141).STUDENT PERFORMANCE COMPONENT: A DETAILED DESCRIPTIONThe time-frame of this component of the assessment spanned three years from thefall of 1989 until the Technical Report was signed off by the Ministry of Education andMinistry Responsible for Multiculturalism and Human Rights at the end of 1992.Planning for the AssessmentTask development began with a review of the literature on performance assessmentprojects from around the world. The Assessment of Performance Unit (APU) for Sciencein England (Johnson, 1989) and the National Assessment of Educational Progress (NAEP)document A Pilot Study ofHigher Order Thinking Skills Assessment Techniques inScience and Mathematics (Blumberg et al., 1986) were the prime sources of informationabout development and exemplars of potential tasks. In addition, the assessment programsthat had been developed in New York (New York State Department of Education, 1989),California (Anderson, 1990; Comfort, 1990), England (CHASSIS edited by Wilson,1986), Manitoba (1988), and Scotland (Techniques for the Assessment ofPractical Skillsin Foundation Science — “TAPS” by Bryce, McCall, MacGregor, Robertson, & Weston,1984) were scrutinized. These assessments illuminated the need for breadth in the modesof assessment, particularly in representing content and a range of activities for the students.The NAEP pilot study of 1986 was heavily influenced by the work of the APU, and manyof the features of these two earlier assessments were considered carefully in the planningand development of the Student Performance Component of the 1991 B.C. ScienceAssessment.- 70-The Assessment FrameworkThe assessment framework is described in The Assessment of Students’ PracticalWork in Science (Erickson, 1990) a paper produced for the Ministry of Education at thetime the Contract Team started preparations for the assessment. In addition to recognizingthe influence of educational initiatives in British Columbia, Erickson identifies two othercontexts that he considers significant in the development of an assessment framework: thecontext of the constructivist perspective and the context of assessment. Given the influenceof the constructivist perspective, Erickson emphasizes the importance of experimentation:If we take a constructivist perspective of learning seriously then we see thata fundamental aspect of learning anything new consists of a form ofcontinual experimentation. Thus at a very general level of description weobtain an image of children as well as adults constantly engaged in a processof constructing conjectures about the nature of the social and physicalworlds in which they inhabit and testing these against “the reality” of thoseworlds. (Erickson, 1990)That the framework, with its concentration upon the notion of experimentation, differsfrom those previously identified in curricular documents in British Columbia is seen byErickson as a necessary and desirable consequence of this constructivist perspective.Earlier science assessments had shown that few teachers engaged in assessing theirstudents in practical mode; indeed as students progress through their education, the amountof hands-on science decreases (Bateson et al., 1986). It became an explicit focus of theContract Team to exemplify experimentation to teachers:The specification of a framework of objectives in the area of students’practical work in science should provide teachers with a betterunderstanding of the importance of practical work in science instruction andenable them to construct effective assessment strategies. (Erickson, 1990)The framework is also intended to assist teachers in working towards assessing “theirstudents’ progress in developing the types of skills and competencies that they consider tobe important in this particular curricular domain, considering the age and experience of the-71-students they are currently teaching” (Erickson, 1990). The language of the framework isdeliberately general, a position that is justified by the statement thatthe framework is intended to be used as a basis for generating specificoutcomes suitable for the developmental levels of the students to beassessed and embedded in an appropriate content area for that group ofstudents — for instance, those content areas specified by the curriculumguide. (Erickson, 1990)There are two levels of description within the framework. General level descriptors arecalled “dimensions”. For example, “measurement”, the “use of apparatus” and“communication” are classified as dimensions; the detailed levels of description, the subcategories within each dimension, are called “abilities”. The dimensions are shown inFigure 5; the complete table of dimensions and abilities is given in Figure 29 (page 152).(1) Observation and Classification of Experience(2) Measurement(3) Use of Apparatus or Equipment(4) Communicationi) Receiving and interpreting informationii) Reporting information(5) Planning Experiments(6) Performing ExperimentsFigure 5. Dimensions of ScienceFramework descriptors require elaboration to include a specific context and content for eachtask. The Contract Team called these statements of specification “objectives”. Forexample, the station “Rolling Bottles” is described using the following statements:- 72 -1. Students observe three bottles filled with different amounts of sand rolldown the ramp, and identify the fastest. l.a2. Students explain why one bottle went fastest. 4.b3. Students observe an empty bottle roll down the ramp and compare itsspeed with one bottle previously rolled. l.a4. Students explain why one of the two bottles went faster. 4.b(Erickson et al., 1992, p. 61)It can be argued that to use only abilities l.a (ability to describe observations made usingthe senses about a variety of living and non-living objects) and 4.b (ability to drawinferences from data presented in tabular, pictorial or graphic format or generatedexperimentally) to describe the station is insufficient. These “behavioral objectives” serveonly to emphasize the salient features of each of the stations and act as guidelines for theevaluation of students’ performance.Stations and InvestigationsThe literature review identified many different modes of performance assessment:short exercises of two to six minutes in the TAPS scheme (Bryce et al., 1984), longerstations of 13 to 15 minutes in the California Assessment Program (Anderson, 1990), andinvestigations of up to one hour in the APU (Johnson, 1989). In examining the tasksthrough the lens provided by the B.C. assessment framework, it was perceived that theshorter tasks tend to be prescriptive in nature, with students following instructions andresponding to short questions; the majority of these tasks fit into Dimensions 1 to 4.Investigations are open-ended tasks in which students design and perform experiments tosolve specific questions; such investigations appear to focus upon Dimensions 5 and 6.A limit of one hour of contact per student was placed upon the assessment by theMinistry of Education. The Contract Team chose to use two distinct types of task tobroaden the range of the assessment in an attempt to cover the major part of the spectrum ofschool science. The two distinct types of assessment task are:- 73 -Stations: in which students spend seven minutes working through short, time-limited tasks. Each station focuses upon different dimensions and abilities in diversecontexts of school science. Students work through one of two different circuits, eachcomprised of six distinct stations. The data for station tasks are student’s writtenresponses.Investigations: in which students work in pairs on a problem, presented as anoperational question, and use a specific set of materials. Each pair of students spendsbetween 40 and 50 minutes upon a single investigation. The data for investigationtasks consist of observers’ records, students’ written responses, and students’ verbalresponses in an interview following the investigation.The process of pilot-testing is described in British Columbia Assessment ofScience1991 Technical Report II: Student Peiformance Component (Erickson et al., 1992).A variety of tasks used in the earlier mentioned assessment projects werepilot-tested for possible use in our station format. New tasks were createdby members of the Contract Team to broaden the curriculum coverage andextend the range of dimensions and abilities assessed. During the cycle ofpilot testing, attention was paid to student reactions to the questions withineach task, and student criticism of each station was considered. Each of thestations was pilot-tested with at least 40 students from at least three differentschools. (Erickson et al., 1992, p. 10)Stations that were “borrowed” from other assessment jurisdictions were presented to thestudents for pilot-testing as they had been used in those regimes. The Grade 4 students hadsignificant problems with the language of some instructions, and there was muchdeliberation, both with students and within the Contract Team, about refinements to thetasks. Further decisions were made by watching students perform the tasks in revising thestructures of both instructions and the student responses sheets. Some assessments, forexample the New York assessment at fourth grade (New York State Department ofEducation, 1989), separated the instruction sheets from the response sheets; othersintegrated the instruction sheet with the response sheet. The integrated sheet appeared towork better with the tasks used in B.C., particularly when the response sheet was limitedto a single page. Twenty-two stations were eventually considered suitable for use in the-74-assessment. As four stations overlapped all three grades, and six stations overlapped twogrades (either Grades 4 and 7 or Grades 7 and 10) it was possible to allocate 12 stations foreach grade. This was done by constructing two circuits, each of six stations. Figure 6shows the relationships between overlapping stations.Figure 6. Venn Diagram of Stations(Ericksonetal., 1992, p. 11)The pilot-testing of the investigations led to many decisions after each school visit.Two critical decisions were made about the structure of the investigations following theexperiences with the pilot testing. The first of these was to assess students working in- 75 -pairs, rather than individually, as had been the procedure with the APU. This decision wasmade because students appeared to be more comfortable working with a partner, and it waseasier for observers to identify procedures as students discussed their work. The seconddecision was that only two investigations were to be chosen for the assessment. “Magnets”is based upon Meyer’s doctoral work (1992) and “Paper Towels” is derived from the workof the APU (Gott and Murphy, 1987). Other investigations were pilot-tested but rejectedon conceptual or administrative grounds. The investigation “Survival”, developed by theAPU (Driver, Child, Gott, Head, Johnson, Worsley, & Wylie, 1984), involves studentsmodeling a human body with an aluminum drink can which contained hot water. Whilemany students visualized the model effectively, few were aware that the amount of timeneeded to cool the can was several minutes rather than seconds, an effect of the magnitudeof the specific heat of water upon the rate of cooling. Administrative problems with liveanimals, snails and mealworms, led to the rejection of investigations based upon animalbehaviour! The observation schedules for the investigations were developed as the pilot-testing proceeded and members of the Contract Team became more familiar with theexperimental procedures. Although each investigation required its own observationschedule, the basic structure was maintained for both.Figure 7 summarizes the cycle of pilot-testing for both stations and investigations,covering the phases of students working through the tasks, discussion of the tasks withmembers of the Contract Team through to revision of the tasks for further piloting.Pi1oithContract team membersrevise tasks and create Discuss tasks with studentsnew tasks4Contact team membersTasks are___________________debrief and discussrejected tasksFigure 7. Cycle of Pilot-Testing-76-To direct the choice of particular stations and investigations for use in theassessment, the assessment goals were operationalized in terms of the questions shown inFigure 8. Four clusters of questions were used for both stations and investigations. Theselection of tasks was complete before the third rotation of pilot-testing.STATIONS I INVESTIGATIONSEngagementIs the station interesting and likely to Is the investigation interesting and likelyengage the student? to engage the student?Appropriate to student abilitiesIs the content knowledge appropriate Do students have the knowledge of thefor the grade level? materials and procedures to answerCan the student use the equipment? the operational question?Can the student complete the station in Do the operational question and thethe allotted time? materials facilitate variability inDoes this station assess different student performance?abilities from the other stations? Can students make appropriatemeasurement using the suppliedequipment?Appropriate task characteristicsWill the materials stand up to repeated Can an observer describe studentuse? performance, and reconstruct whatWifi the equipment provide consistent the students did?results? Can students complete the investigationIs the equipment available at reasonable in the allotted time?cost?Can the equipment be transportedeasily?Range of appropriatenessIs the station suitable for more than one Is the investigation appropriate for allgrade level? grade levels?Figure 8. Criteria for the Choice of Assessment Tasks(Developed from Erickson et al., 1992, p. 10 and p. 12)SamplingThe overall structure of the assessment required that students who participated inthe Alternative Components be a subset of the 30% sample in the Classical Component.Geographical representation from around the Province was an important element in thesampling design. The assessment coordinator divided the Province into six regions, eachcomposed of approximately 12 school districts. Data were collected from a random sample-77 -of three school districts in each region, with equal numbers of students in each districtworking through either stations or investigations— approximately 115 per type of task ateach grade level. The unit of sampling was a complete class, and students in the class wereallocated randomly to either stations or investigations by classroom teachers.Teacher Preparation for Data CollectionHaving identified grades and districts, the Ministry of Education funded twoclassroom teachers from each grade in each region (36 teachers in total) to collect andinterpret the data. These teachers were nominated to the assessment by districtadministrators. The nominees attended an orientation workshop for data collection. Thisworkshop took place at U.B.C. and included teachers working through both sets of tasksas though they were being assessed. After debriefing, the teachers administered a “dressrehearsal” of each mode of assessment to students from local schools. Members of theContract Team took the teachers through the use of the observation schedules for bothinvestigations, and helped use them with the “dress rehearsal” students. The orientationworkshop concluded with the teachers checking the contents of the kits they were to use fordata collection in their own regions.The details for the tasks and the assessment procedures are presented in StudentPerformance Tasks: Administration Manual (Bartley, 1991). This manual contains a set ofprotocols for use by the teacher/administrators in introducing both investigations andstations to the students. In addition, equipment lists for all tasks are included, many withdiagrams illustrating specific equipment.Preparation for Data AnalysisThe data from the station tasks are contained in the students’ completed responsebooklets. These booklets were stamped with a unique identity number on each page. The-78-booklets were then taken apart and sets of station response sheets were created to enablecoding to proceed with teachers not being aware of the identity, location or gender of thestudent. For the investigations there are multiple sources of data, including observationschedules, student response booklets, and notes from the teachers’ interviews withstudents.The Contract Team chose to develop coding sheets for both stations andinvestigations. The station coding sheets contain two parts, a descriptive part that wasconstructed using a random sample of 25% to 30% of student response sheets to generateresponse categories for each question, and an evaluative part which consists of one or twoquestions focused upon the significant aspects of each station. Generation of these“judgement questions” entailed extended discussions within the Contract Team. Theformat of the questions is consistent from station to station; the stem for all evaluations is“In your judgement, how well did the student... ?“. Judgements were made on a five-pointscale where the central point is a “satisfactory” level of performance. The scale is shown asFigure 9.1. Notatafi2.3. Satisfactory4.5. Extremely wellFigure 9. Scale for Evaluating Performance on StationTasks — “In your opinion how well did the student...?”For the investigations the Contract Team identified five questions that covered thesalient features of student investigations. These questions are shown in Figure 10.- 79 -Evaluative Questions for Investigations1. How well do the students plan an experiment to answer the question?2. How well do the students develop a suitable measuring strategy?3. How well do the students interpret the data collected to answer the question?4. How well do the students report the results of their experiment?5. In your judgement, considering ALL of the students’ experimentshow would you rate the quality of their performance on this task?Figure 10. Evaluative Questions for InvestigationsThis holistic approach towards evaluating student performance required criteriondescriptors for each level of performance. The Contract Team chose to leave the writing ofthe descriptors to the teachers at the coding workshop; it was believed that this wouldbring the specification of appropriate levels of performance, and attendant descriptors,closer to classroom practice.Coding WorkshopThe coding workshop took place over three days in July 1991. The teachers whohad collected the data returned to complete the coding, spending equal time on stations andinvestigations. For the stations the teachers worked in teams of four; each team wasresponsible for coding four stations. Sets of response sheets for each station wereprepared for teacher-orientation and development of criteria for levels of performance.These sheets were copied on brown paper, with four sets of five response sheets for eachstation. The coding sheets that were used at this stage also were printed on coloured paper.At the conclusion of coding a station, a further set of five response sheets, printed on redpaper, was given to the teachers for collection of data on the consistency of their coding.- 80 -As most teachers had specialized in observing a specific investigation, coding of theinvestigations took place in two groups. Teachers coded the investigations performed bythe student pairs that they had observed. The Contract Team asked the teachers to code amaximum of four different experiments2as some pairs of students had been identified asperforming as many as 11 different experiments in the Magnets investigation! Theobserving teacher was required to identify the “best experiment”, performance on whichwas evaluated using the questions shown in Figure 10. In addition, the teachers wereasked to evaluate the students’ overall performance in the set of experiments they hadobserved.Analytical and Statistical ProceduresThe coding sheets were passed on the Educational Measurement Research Group(EMRG) at U.B.C. Data from these sheets were entered into ASCII files for use withStatistical Package for the Social Sciences (SPSS) software. The basic analysis involvesaggregation of the numbers and percentages of students coded for each descriptivecategory, and at each level of performance for the judgement questions. The results arepresented in terms of the percentage of all students in that grade, and also percentages bygender. Thus, descriptions of what students did in performing the assessment tasks, andhow well they performed are presented as percentages of the sample who participated inthese tasks.Further analysis of the data led to the production of correlation matrices of teacherjudgements for each circuit of stations and also for the teacher judgements for each2 A “different experiment” is one where the students appear to be changing their experimentalapproach. This might be by using different apparatus or by controlling different variables.Approaches and apparatus are shown on the coding sheets.- 81 -investigation. In addition, factor analysis tables were generated, but are not reported in theTechnical Report due to time constraints. At the inception of the assessment it was hopedthat data for the Student Performance Component of the 1991 B.C. Science Assessmentcould be merged with that from the Classical Component. Again, time constraintsprevented this work from being completed.Reporting and Interpretation of ResultsThe results for the assessment are reported in the British Columbia Assessment ofScience 1991 Technical Report II: Student Performance Component (Erickson et al.,1992). Aggregated data for each station are followed by a discussion of the significance ofthe data for those students. Also included are the teacher-developed criteria for judgementsat each station. Discussions for each station are broad ranging. They include an analysisof the descriptive data (focusing upon the nature of student approaches to the tasks) andalso cover the evaluative data, with an analysis of the impact of the teacher-developedcriteria upon the percentages of students performing at each level. Contract Team membersspecialized in writing about particular stations, and where tasks were used across grades, asingle author prepared all the discussions. The comments conclude with an interpretationof the students’ overall levels of performance on each station and some incorporate a criticalevaluation of the virtues and deficits of the specific station.Results from the investigations are examined in a single chapter in the assessmentreport. This chapter presents the data for all three grades, together with the interpretationsmade by the Contract Team. This arrangement facilitates comparisons between theexperimental approaches of students in different grades.The British Columbia Assessment of Science 1991 Technical Report II: StudentPeiformance Component (Erickson et al., 1992) contains a chapter entitled “Issues in theAssessment of Student Performance on Science Tasks”. Chosen for consideration are two-82-distinct facets: (1) comparative analyses of performance, and (2) specific technical issuesconcerning the validity of the assessment process. In comparing student performance inthese assessment tasks the Contract Team chose to look at gender-related differences andgrade-related differences. Both these issues had been identified by the Ministry ofEducation as needing to be addressed; these sections in the “issues” chapter are a responseto the Ministry goals.Gender-Related DifferencesThe section on gender-related differences in student performance shows tables ofpercentages of all students, female students, and male students who were evaluated ashaving performed at “satisfactory or better” levels, with differences between females andmales of over 15% shown. The reason given for the selection of 15% as a notabledifference is “pedagogical significance” (Erickson et al., 1992, p. 227). It was recognizedthat statistically significant differences are likely to be much smaller than 15%, but thesewould not translate easily into student numbers in a typical classroom of 20 to 30 students.Judgements scores with gender differences of at least 15% are shown in Table 1.Table 1Gender-related Differences in Student Performance (Derivedfrom Erickson et al., 1992)% Satisfactory or Better RatingGinde Station and Judgement Female Male4 4.7 Making Measurements: How well did the student 91% 99% 82%measure temperature, length, volume and time? (N=108) (N=58) (N=50)4 4.10 Rolling Bottles: How well did the student explain 63% 71% 54%the motion of the bottles (consider both “why” questions)? (N=104) (N=56) (N=48)7 7.7 Instruments: How well did the student measure 46% 34% 58%selected properties, given a set of instruments? (N=111) (N59) (N=52)7 7.11 Magnet Power: A. How well did student develop a 67% 76% 56%strategy and use materials to identify the stronger of two (N=1 11) (N=59) (N=52)magnets?10 10.4 Environmental Testing: How well did the student 65% 75% 58%draw inferences from collected data? (N=106) (N=44) (N=62)- 83 -That only five out of the 57 judgements across the three grades showed differences ofgreater than 15% led the Contract Team to state that “similarities in performance betweenfemales and males were more evident than the differences” (Erickson et aL, 1992, P. 223).Four out of these five differences appear to indicate superior performance of females, withvariations from grade to grade.The Contract Team did not report any pattern in gender-related differences instudent performance for common stations. Measurement stations show a differencefavouring females in “Making Measurements” at Grade 4, favouring males in“Instruments” at Grade 7, and “similar levels” in “Instruments” at Grade 10 (Erickson etal., 1992, p. 227). The station “Magnet Power” shows:at Grades 4 and 7 more females than males (differences of 14% and 20%respectively) were judged to perform at a “satisfactory or better” level indeveloping the strategy and using the materials; andat Grade 10 more males (74%) than females (64%) performed at a“satisfactory or better” level for the same aspect of the task.There was also a similar trend, but smaller differences, in the second teacherjudgement on this station, communication of the strategy. This reversal ofgender differences at Grade 10 is somewhat puzzling. Perhaps the open-ended nature of the problem and its solution appealed to the youngerfemales (Grade 4 and 7) and showed in the quality of their responses,whereas the Grade 10 results may be related to students’ experiences withthe physics component of the junior secondary curriculum. At this agephysics is often associated more with males’ prior interests and knowledge(Erickson and Farkas, 1991) and so this may have created the differentperformance that we observed. (Erickson et al., 1992, p. 227)The Contract Team is tentative in its interpretation of these differences, particularlyin attempting to explain any pattern or cause. Other gender-related differences in studentperformance posed similar problems in explanation, and appeared to be quite small (lessthan 15%). Given earlier reported gender-related differences in performance in scienceassessments (Johnson, 1987; Robertson, 1987), the Contract Team was keen to restatethat:- 84-gender-related differences are noteworthy by their absence in this study,especially in areas such as the physical sciences, where previous studiesusing more conventional modes of assessment have consistently reportedlarge differences (e.g., electrical circuits). Many of the differencesreported, whether showing males or females at higher performance levels,particularly with across grade variations are difficult to explain. Clearly,much more detailed work and analysis is required in some of these areas.(Erickson et al., 1992, pp. 228-9)In concluding this section, the Contract Team comments upon the anecdotal data fromteachers and students which indicate that students enjoyed working through the assessmenttasks. Citing studies (Erickson and Farkas, 1992; Linn and Hyde, 1989) which emphasizethe significance of “the influence of motivational and affective factors in producing genderdifferences on achievement tests” (Erickson et al., 1992, p. 229), the Contract Teamspeculates that this mode of assessment, and this set of tasks, have reduced gender-relateddifferences. The Contract Team argues that this is an important step toward more “gender-fair assessment” (Bateson and Parsons, 1991).Grade-Related DifferencesThe Contract Team chose to limit the analysis of grade-related differences in studentperformance to the four stations that were common to all three grades in the belief that“these four stations are sufficient to demonstrate and illustrate the grade-related issues ofnote in this type of assessment” (Erickson et al., 1992, p. 229). In choosing to make thesespecific comparisons the Contract Team made an effort to identify potential problems:The criteria for these judgements were made by the teachers whoadministered and coded these stations at that grade level. Hence for some ofthese common stations both the criteria and the way in which the criteriawere interpreted differed significantly across the grade levels. For other,multi-grade stations there was more uniformity in these criteria. (Ericksonet al., 1992, p. 229)In working with these different criteria for teacher judgements, Erickson and his colleaguesare necessarily cautious in making claims about differences in student performance.However, they express greater confidence in the comparisons of the descriptive data.- 85 -Pages 231 to 242 of the British Columbia Assessment of Science 1991 Technical Report II:Student Performance Component (Erickson et aT., 1992) consist of tables of comparativedata. The Contract Team reports many similarities in performance:For Sound Board the modal response category is the same for all threedescriptive questions; for Rolling Bottles it is the same for two out of fivequestions; and for Magnets it is the same for three out of four questions.Likewise, if one creates a rank order in terms of the frequency of responsesin these categories, a similar pattern is obtained from these descriptivequestions. (Erickson et al., 1992, p. 243)These findings were clearly not anticipated. The Contract Team makes two conjectures:The first is that the tasks did not require the kind of abilities and knowledgebeyond the sort of “everyday knowledge” that most Grade 4 students haveafready constructed and hence the tasks did not discriminate in ways that wethought they might. A second conjecture is that the teachers implicitlyinterpreted the data in ways appropriate to the age of the students withwhom they worked. (Erickson et al., 1992, p. 243)These two hypotheses serve different purposes in explaining the findings. The first is seenas most suitable in accounting for the many similarities in student performance in theassessment tasks, particularly when the Grade 4 students demonstrate “considerablephysical and linguistic experience with objects like stringed instruments, rolling objects,magnets, and insects in different stages of their life cycle” (Erickson et al., 1992, p. 243).Only where the task requires a more elaborate response is there evidence of substantivedifferences between the younger and older students. The student explanations in “RollingBottles” are used to exemplify this point, as older students showed a “greater degree oflinguistic fluency than is available to most Grade 4 students” (Erickson et al., 1992, p.243).The second conjecture, the age-appropriateness of teacher interpretations of thedata, most likely explains the nature of the criteria developed by teachers for thejudgements. This is evident in that many of the criteria for higher ratings require studentsto provide extended responses or greater numbers of responses. The Contract Teamconsiders that:-86-these types of criteria are more difficult for Grade 4 students to meetbecause they generally lack the verbal fluency of the older students and theysimply do not work and write as quickly. (Erickson et al., 1992, p. 243)The data analysis and an extended discussion of the investigation results arepresented in Chapter 6 of the Technical Report (Erickson et al., 1992). The Contract Teamsaw this section as an appropriate place for discussion of the question: “What is developingover the three grade levels as students are engaged in these types of open-endedinvestigations?” The results for the investigations parallel those for the stations in that thereare many similarities in the performance of students in all three grade-levels. The ContractTeam reports that:in both investigation tasks one is immediately struck with is the apparentlack of differences between the students at the three age levels especially asitpertains to the planning andperformance aspects of the tasks. (Italics inoriginal. Erickson et al., 1992, p. 243)Detailed examination of experimental approaches for each investigation shows similarchoices of both strategies and control of variables. The Contract Team interprets this asstrong evidence in support of the use of complete investigations with younger students.The Contract Team also makes explicit that this evidence challenges the position of thosefavouring a step-up approach from “process skills” to investigations:What does appear to be changing for older students is:the construction of more elaborate and powerful explanatory models that areused to frame experiments of the type that the teachers observed in thisproject.(and)...we do see an increase in the abilities of students to perform adequately someexperimental abilities such as conducting appropriate measurements withcare and precision, identifying and controlling variables thought to beimportant to the outcome of the experiment, and finally providing aninterpretation of the data and communicating that interpretation to others.(Erickson et al., 1992, p. 244)- 87 -Inter-coder Consistency Ratings for the StationsThe reliability of teacher coding and rating of student performance is a major issuein the use of performance assessments; indeed warnings about the use of holistic scoringsystems have been made. Bryce and Robertson contend that:teachers should no longer be urged to judge or rate “holistically” theirpupils’ performances on “experiments” or “scientific investigations”. In acomprehensive examination of the international literature we have recentlyshown that, however desirable, there is no demonstrable evidence ofvalidity and reliability in currently available versions of assessment byholistic teacher-judgement. (1986, p. 63)In the Student Peiformance Component of the 1991 B.C. Science Assessment, steps weretaken to maximize the consistency of the teacher coding and evaluations. A structured setof procedures saw teachers working with practice sheets to develop the criteria for theirjudgements; this was followed by coding of actual student responses, and then checkingconsistency with yet another set of response sheets.The Contract Team reports that for the stations the inter-coder consistency isacceptable (see for example Baxter et al., 1992), and in the range 0.83 to 0.86 for allcodings and 0.62 to 0.81 for the teacher judgements3.Grade 4 teachers show greaterlevels of consistency and Grade 10 teachers lower levels; this difference is considered as:likely to have arisen from the widely different set of criteria each group ofteachers had developed and the increase in complexity of the criteria used bythe Grade 10 teachers. (Erickson et al., 1992, p. 248)The teim ‘inter-coder consistency’ refers both to descriptive and to evaluative codings completed bythe teachers. When data are presented for consistency of the teacher judgements, these figures aresynonymous with inter-rater reliability.- 88 -Coding consistency is considered an essential precursor to the subsequent issue ofcomparing student perfonnance across tasks. The Contract Team believes that theconsistency (reliability) is sufficient to proceed to this next step.Student Performance Across TasksAlthough there are 12 stations for each grade, these were arranged into two circuitsof six stations, with each student completing only one of the circuits. This arrangement oftasks enables comparisons of an individual student’s performance within circuits A or Brather than between the circuits. As stated above, the Contract Team is confident enough inthe reliability of the coding to discuss the issue of student performance across the tasks.The consistency of student performance across stations is evaluated by the use ofcorrelation matrices of individual student scores for each circuit. The Grade 7 matrices arediscussed in the body of the Technical Report (Erickson et al., 1992) while the othermatrices are published in the appendices to the report. The patterns for each of the sixmatrices are similar. These are:1. Where there is a strong correlation, greater than 0.5 (p <0.00 1), between studentperformances, it is usually between two judgements within the same station.2. Correlations between stations with similar objectives or “skills” set in differing contentareas are very low.The Contract Team concludes that these data indicate that performance is influenced by astrong context effect “in the type of content knowledge and experience embedded within thetask” (Erickson et al., 1992, p. 251); a conclusion that explains the within-stationcorrelations. For the “between skills” correlations the Contract Team proposes that:- 89 -These and other low correlations between judgements based on theassessment of similar “skills” in different stations suggest that theassessment of student performance may be more dependent upon the actualtask context and less dependent upon the student possessing somegeneralizable “skill” or ability that can be applied equally well to a variety oftasks. (Erickson et al., 1992, p. 251)Correlation matrices of the five teacher judgements in the investigations are seen by theContract Team as “strongly supportive of the above claim that student performance appearsto be very dependent upon the content understanding brought to these tasks by thestudents” (Erickson et al., 1992, p. 252).Atomistic versus Holistic ScoringThe options for scoring student performance in hands-on assessment appear torange from atomistic or analytical at one extreme, to holistic at the other. The reliability ofholistic scoring in science has been questioned by Bryce and Robertson (1986), but recentwork by Baxter, Shavelson, Goldman, and Pine (1992) provides convincing evidence forthe reliability of some holistic scoring methods. In explaining its decision to choose theholistic approach, the Contract Team writes:We have reviewed the literature, talked to the proponent of each perspective,solicited the views of local teachers and researchers, and examined our ownproclivities and decided to be holistic. (Erickson et al., 1992, p. 252)Holistic scoring enabled the Contract Team to ask the teachers to consider each station andmake “a global judgement based on their professional experiences of students at that gradelevel” (Erickson et al., 1992, p. 252). In recognizing the professional judgements of theteachers in developing criteria for scoring rubrics, the Contract Team concedes that thesecriteria may not be viewed as acceptable by teachers from other school districts. TheContract Team concludes this section with a plea:We, at all costs, wanted to avoid the return in science teaching to the days ofthe Science as Process Approach where individual so-called “process skills”were taught separately. (Erickson et al., 1992, p. 253)- 90 -Project RecommendationsAlthough the British Columbia Assessment of Science 1991 Technical Report II:Student Performance Component (Erickson et al., 1992) contains the significant details ofthe procedures and the results, it is the chapter entitled Performance AssessmentComponent (Erickson, Carlisle, & Bartley, 1992) in the British Columbia Assessment ofScience: Provincial Report 1991 (Bateson et al., 1992) that presents detailedrecommendations for science educators and policy-makers in British Columbia. The fourrecommendations that appear in the concluding section “Looking to the Future”, are writtenin terms of actions that will lead to improved student performance in a range of contexts.These recommendations are:1. Given the students’ demonstrated abilities and interest in these assessmenttasks, students should be given more opportunity to generate questions andto seek answers from their own investigations.2. Given that students do not report results well, particular attention should begiven to this aspect of investigations.3. Given the positive student response to these tasks, we would encourageteachers to use more of these types of performance tasks in their assessmentprocedures.4. Given the importance of content knowledge in the planning and execution ofinvestigations, it is critical that the acquisition and use of relevant contentknowledge be recognized and encouraged.(Erickson, Carlisle, & Bartley, 1992, p. 39)The first three recommendations are proposals for change in the practice of scienceteaching, while the fourth recommendation advocates a change in thinking about schoolscience, and has significant implications for elementary school curricula.Because the Student Performance Component of the 1991 B.C. ScienceAssessment was successful, a package entitled Science Program Assessment ThroughPerformance Assessment Tasks: A Guide for School Districts (Bartley, Carlisle, &Erickson, 1993) was developed. Ministry personnel are employed in the dissemination ofthis resource around the Province, and several districts have developed plans for their own- 91 -assessment procedures. The package is self-contained and enables districts to collect andinterpret their own data with the use of common microcomputer software.- 92 -CHAPTER 4— A FRAMEWORK FOR THE VALIDATIONOF THE USE OF PERFORMANCE ASSESSMENT IN SCIENCEVALIDITY IN THE ASSESSMENT CONTEXTIn this chapter I describe and exemplify the essential components of a systematicframework for the development and administration of performance assessments in science.In my view, such a framework should attend to all significant decision-points right fromthe initial planning stages through to the analysis of both short and long term consequencesof the assessment practice. I believe that employment of such a framework by anassessment developer, or user, is sufficient to validate the inferences and consequences of aspecific assessment activity. The formal requirements for validation inquiry direct thefocus towards an evaluation of the evidence and theoretical justifications to support thesuitability of the inferences and actions based upon test scores (Messick, 1989b). I intendto substantiate the claim that the validation should extend from genesis to conclusion.This chapter begins with an interpretation of current conceptualizations of validity(Messick, 1989a, 1989b; Moss, 1992; Shepard, 1993) which led to the identification of aset of criteria for the validation of performance assessments, previously described inChapter 2 (Linn et al., 1991). As the criteria developed by Linn, Baker and Dunbar do notaddress all the issues that arise in the validation of performance assessment in science, Ipropose a more specific set of questions set in the context of school science. The use ofthese questions is exemplified by an evaluation of the validity of the inferences made in theBritish Columbia Assessment of Science 1991 Technical Report II: Student PerformanceComponent (Erickson et al., 1992) and subsequent developments in science education inBritish Columbia.- 93 -The definition of Validity used in this study is that presented by Messick:validity is an integrated evaluative judgement of the degree to whichempirical evidence and theoretical rationales support the adequacy andappropriateness of inferences and actions based on test scores or othermodes of assessment (Bold added. Messick, 1989b, p.13).The focus upon an integrated judgement of inferences and consequential actions presentsa sharp contrast to the traditional separation of validity evidence into content, criterion andconstruct. The association, or lack thereof, between these “types” is represented inFigure 11.Figure 11. Traditional “Types” of Validity EvidenceFigure 12 illustrates an integrated view of validity based upon Messick’ s progressivematrix (1989a) where all “types” of validity evidence are subsumed under constructvalidity.- 94-Figure 12. An Integrated Model of Validity EvidenceThe significant sources of evidence for an examination of construct validity must includeinterpretations of score meaning, and also the consequences arising from the valueimplications of evaluators, teacher-administrators and students, as well as the socialconsequences of testing. Messick proposes that validation evidence should be collected toanswer questions (1989b). Shepard identifies the central validity question as “What doesthe testing practice claim to do?” (1993, p. 429); other general questions, including “Howwell does the testing practice do what it claims to do?” and “What does the testing practicedo beyond that which is claimed?” are implied. These general questions suggest whichspecific questions should be used for validation in a particular assessment context. Inmuch the same way as the values of those involved in the management of a testing programlead to specific decisions and interpretations, the questions asked in a validation inquirywill manifest some value system.- 95 -The dominant values in the measurement community have been described as “thewell established psychometric criteria for judging the technical adequacy of measures”(Linn et al., 1991, p. 16). These well established criteria include efficiency, reliability andcomparability of assessment from year to year. Linn, Baker, and Dunbar speculate thatsuch criteria would almost always favour the traditional multiple-choice assessment tasks inany comparison with the newer alternatives. They argue that as there is expansion inmodes of assessment there must be expansion in the criteria used to judge the adequacy ofassessments. In particular, Linn, Baker, and Dunbar caution that:Reliability has been too often overemphasized at the expense of validity;validity has itself been viewed too narrowly. (1991, p. 16)Linn and his colleagues propose a set of validity criteria focused specifically uponperformance assessments; these criteria relate to consequences, fairness, transfer andgeneralizabiity, cognitive complexity, content quality, content coverage, meaningfulness,and cost and efficiency. Moss (1992) finds these criteria to be within general standards(AERA, APA, & NCME, 1985). However, Messick (1994) exclaims that these criteria aremore limited and “may become a problem for validation practice”. In particular, Messick isconcerned that this set of criteria might lead to insufficient emphasis upon “scoreinterpretation and its value implications” (1994, p. 13).A VALIDATION FRAMEWORKThe appraisal of the implications of underlying values is an important component ofthe understanding of validity in this study, and represents a significant expansion beyondthe appraisal of score meaning. As the values of all the people involved in an assessmentaffect actions and inferences, the choice of questions in a framework for investigatingvalidity must be examined as part of some value system. The questions can also beconsidered sequentially as they relate to concerns that should be examined at each stage ofplanning, administration, scoring, interpretation and recommendations, and subsequently- 96 -in reflection. The set of questions presented below meets the criteria that Linn, Baker, andDunbar have proposed. In addition, it addresses the issues of score interpretation raised byMessick (1994). These eight focus questions were written with the specific purpose ofguiding a validation inquiry of the Student Performance Component of the 1991 B.C.Science Assessment but are relevant to any hands-on assessment. They include:(1) Purposes of the Assessment:What are the explicit purposes of the assessment?What are the operationalized purposes of the assessment?(2) Learning and Communication in Science:What models of science learning and communication are promoted by this mode ofassessment?(3) Content Analysis:Are the assessment tasks appropriate for the students and within the curriculum?(4) Instrumental Stability:Does the equipment behave consistently over time?(5) Administration Stability:Are the administration procedures clearly developed and applied consistently?(6) Internal Consistency and GeneralizabilityWhat factors affect the generalizability of student performance across tasks in ascience assessment?(7) FairnessDo the assessment data indicate any bias towards or against any specific identifiedgroup?(8) Consequences:Were the intended consequences of the assessment achieved?What were the unintended consequences?What actions have been taken to support the “good” unintended consequences andabate the “bad” unintended consequences?The analysis presented in this chapter explores the value implications of both thesequestions and the nature of responses to them.- 97 -Purposes of the AssessmentWhile it might appear that validation focuses solely on the after-effects of testing,the identification of purposes and intended consequences is a vital precursor in planningany assessment program. Shepard’s question “What does the testing practice claim to do?”(1993, p. 429) reverberates through the inquiry. It is pertinent to ask “What does thetesting practice intend to do?” as it is this question that guides the decision to assess, aswell as the design of the assessment. Thus, in examining purpose, there appear to be twoconsiderations, the “grand design”, and the operational aspects of the assessment. Thisleads to two focus questions:1. What are the explicit purposes of the assessment?2. What are the operationalized purposes of the assessment?The Ministry of Education of British Columbia funds and “owns” the provincialassessments. The Ministry identifies the purposes and modes of these assessments, buteach actual assessment is conducted by an independent contract team which reports to theMinistry of Education. The Ministry of Education identified the following objectives forthe 1991 B.C. Science Assessment:• Describe to professionals and the public what children of various ages CANDO in the curricular area of Science• Examine CHANGES in student performance over time, and providebaseline data for the proposed changes in educational programs• Describe RELATIONSHIPS among instructional activities, use ofmaterials, teacher and student background variables, and studentperformance and attitudes• Describe the differences in student performance and attitude that are relatedto GENDER• Provide EXEMPLARS OF ASSESSMENT TOOLS that can be used byteachers, schools, and districts for assessment of science processes andperformances• Provide directions for educational RESEARCH- 98 -• Suggest areas of need, and provide direction for decisions regarding bothin-service and pre-service TEACHER EDUCATION. (Emphasis inoriginal. Erickson et al., 1992, p. 1)The Contract Team for the Student Performance Component of the 1991 B.C. ScienceAssessment needed to operationalize these objectives. Erickson (1990) had explicitlyidentified the “constructivist perspective on learning” as the position from which the teamwould develop the assessment; this translated directly into the “CAN DO” objectivereceiving priority. Other objectives were also given major emphasis in the Contract Team’sdevelopment of the assessment, particularly those related to “GENDER”, “CHANGES”and “EXEMPLARS”.The assignment of high priority to the “CAN DO” objective had significantimplications in the development of the tasks, the administration of the assessment, thecoding and scoring of the data, and the interpretation of the results. Two types of hands-ontask, “stations” and “investigations”, were developed for the assessment. There wasextensive piloting and discussion with students about the tasks, with many studentsreporting a personal sense of success in their own achievements. Tasks were designedwith the intention of enabling all students to complete at least some part. A range of datacollection procedures was developed and included teacher observations, student open-ended and structured pencil-and-paper responses, and interviews.Those teachers who were to administer the assessment took part in a three-dayorientation workshop. In this workshop they completed both stations and investigationsunder the proposed assessment conditions so that they would appreciate some of thecomplexities of the tasks, and be better able to assist students if/when equipment did notperform as required. The same teachers returned for a further workshop to code andevaluate the data describing student performance. Descriptions of student responses wereentered onto coding sheets derived by the Contract Team from a sample of actual studentresponse sheets. The Contract Team also identified one or two major facets of each station,- 99 -in terms of questions that the teachers should use to evaluate student performance. Thecommon format for these questions was “In your judgement, how well did the student...?”Teachers were asked to rate the students on the five-point scale shown as Figure 9 (page80). The criteria to describe student performance at each point on the scale were developedby examination of response sheets and discussion within grade-level groups of teachers.The data were aggregated and are reported in percentages of students who madespecific types of response. The evaluative questions are reported in tables, showing thepercentages of students who achieved each level of performance. The Contract Teamdiscusses these results in terms of the percentage of students who were evaluated as“satisfactory or better” (levels 3, 4 and 5) or “below satisfactory” (levels 1 and 2).The technique for the description of student work in the investigations paralleledthat for the stations, but used five questions by which teachers judged how well studentsplanned their experiment, developed a measuring strategy, interpreted data, and reporteddata, with a final integrated judgement of students’ overall performance.An alternative interpretation of the “CAN DO” objective might have led to differenttask structures. For example, a set of tasks each with a single correct response would havelead to the results being reported in terms of Pass/Fail. Similarly, the tasks might haveremained similar but analytical scoring could have been used with credit given for successin each major component of the task, an approach demonstrated in PeiformanceAssessment: An International Experiment (Semple, 1992).Other goals of the assessment were addressed as the assessment was planned,conducted and reported. Thus, tasks which covered a specific curriculum area consideredessential, such as magnetism or electricity, were examined for gender bias, e.g., infamiliarity with the equipment to be used, and changes were made to create tasks believedto be more gender-sensitive. Materials from a “female sphere of experience”, e.g., paper-100-clips or hair pins, were used to balance other materials that were considered more likely tobe encountered in the “male sphere of experience”, e.g., steel washers. As this was thefirst assessment of this type in B.C., demonstration of change over time was not possible.The Contract Team chose to present similar tasks to students in two or more grades. It washoped that this approach would enable the Contract Team to describe qualitative differencesin student performance.The elaboration of the purposes of an assessment is a multi-leveled process. Whilethe “grand purposes” are passed down from the Ministry of Education and give direction tothe assessment, the operational purposes can be elucidated only as the assessment isdeveloped and negotiated between those involved. In the case of the Student PerformanceComponent of the 1991 B.C. Science Assessment, the Contract Team, Ministry officials,teachers and students all participated in developing the working definitions of the purposesof the assessment.Learning and Communication in ScienceThe influence of assessment upon instruction has been reported by many authors(e.g., Lovitts and Champagne, 1990; and Cole, 1991). Baron (1991) writes of tests as“magnets for instruction” and comments on the limiting effects of multiple-choice tests.Many of the arguments for performance assessment revolve around the differences incurriculum emphasis that comes from tasks in which students produce solutions rather thanmerely recognize or reproduce them (Newmann and Archbald, 1992). The focus questionin this part of the validation inquiry examines the face validity of the tasks, i.e., the explicitmessage about science communicated to students and their teachers. The focus questionfor this section is:What models of science learning and communication are promoted by thismode of assessment?- 101 -In his 1990 paper The Assessment of Students’ Practical Work in Science Ericksonidentifies the constructivist perspective as central in many aspects of the assessment.Erickson also presents the framework which is used to describe student performance in theassessment. Six dimensions of student performance are proposed, each amplified by a setof more specific abilities. The dimensions identified by Erickson were shown in Figure 5(page 72).Stations, which typically occupy students for seven minutes, were examined byContract Team members in an effort to identify the “dimensions” and “abilities” that arepertinent to the task. It was found that many of the stations are multi-dimensional, with anemphasis upon Dimensions 1 to 4. Not surprisingly, the investigations show an emphasisupon “planning experiments” (Dimension 5) and “performing experiments” (Dimension 6).Mapping of dimensions to type of task is shown in Figure 13(1) Observation and Classification of Experience(2) Measurement(3) Use of Apparatus or Equipment Stations(4) Communication(5) Planning Experiments______________________.(6) Performing Experiments I InvestigationsFigure 13. Mapping of Performance AssessmentDimensions to Task TypeIn selecting stations and investigations for use in the assessment, the Contract Teamapplied several criteria. Those relevant to this focus question are shown in Figure 14.- 102 -Engagement: -Is the task interesting and likely to engage the student?Appropriate to student abilities:Does this task assess different abilities from the other stations?Range ofappropriateness:Is the task suitable for more than one grade level?Figure 14. Criteria for the Selection of Tasks (Takenfrom Erickson et aL, 1992, PP. 10 & 12)Student EngagementStudent engagement as an explicit criterion for station selection might appear to beself-evident, but discussing the features of “good” tasks with students during pilot-testingreinforced the importance of this focus to the Contract Team. Comments collected fromtwo Grade 4 students by a teacher during the assessment are shown in Figure 15.I r ec4o &t sLuLj1’c“I realy enjoyed this study. The reason I liked it is because you got to seefor yourself and if you make a mistake nobody is going to get mad at youand give you a detention for forgetting something”X-WLal iFI— tiI: id.I.’-t t”irQtt“I loved it because I learned thing I did not know”Figure 15. Grade 4 Student Comments about theAssessment TasksThe positive comments of these two students were echoed by observations made by someof the teachers who had collected the data:- 103 -• . .we were not surprised by the comments from the kids: “You mean, thisis science? I like this science.” (female teacher);• . .people were really positive and we did experience what you suggestedmight happen in one class where we didn’t intend to test all of the kids-- wedidn’t need all of the kids. But the teacher said, “Oh, they’ll feel reallybadly, please come back and do test them.” (female teacher);.we asked each student, would you rather do science this way or theconventional way? And the response was about 99% this way, so theywere just thrilled with the idea. And, you know, we were thinking that wehad never seen such enthusiasm for a math test or a reading test. (maleteacher);They walked out of that room, nearly every one, thinking, “Oh, I’m ascientist”. And I would love to try and capture that kind of similarexperience in our classroom. It was a big thing I walked out of thatsituation with, this is great, and it was very special to have that time withthose kids. (male teacher).Very few students, typically less than 5%, did not respond to any part of a station task.Many of the stations elicited responses from 100% of the students, with several (Blocksand Boxes, Making Measurements and Paper Testing in Grade 4, Cool It, and ComparingSeeds in Grade 7; and Circuit Building and Cool It in Grade 10) leading to over 90% ofstudents being judged as having performed at “satisfactory or better” levels.A different strategy to engage students is manifested by the type of questions posedin the investigations. The instruction format on the student response sheet for the twoinvestigations is similar; that for Magnets is shown as Figure 16.MagnetsYou have in front of you three magnets: blue, silver and blackThis is what you have to find out:Which magnet is the strongest?You can use any of the things in front of you.Choose whatever you need to answer the question.Make a clear record of the results so that someone else can understand what youhave found out.Figure 16. Student Instructions for Magnets Investigation-104-The question for Paper Towels is “Which kind of paper holds the most water?” (Gott andMurphy, 1987; Erickson et al., 1992). Such questions are very open-ended and allowstudents considerable latitude in their interpretation and experimental design. Analysis ofthe logical structures within the investigation tasks, and observation of students performingthe investigations, enabled teachers and Contract Team members to identify the operationalquestion that guided the students’ experiments. For example, some students interpreted thepaper towel question as “How much water can each paper physically hold (as acontainer)?” but others saw this question in terms of “How much water can be squeezedout of the wet paper?”.Assessing Different AbilitiesIn essence, this question addresses the face validity of the tasks. This is seen byLinn, Baker, and Dunbar (1991) as insufficient evidence upon which to hang the wholevalidation procedure. But, for the student with a different perspective towards theassessment process, the consequences of encountering tasks that are interesting and ofeducational value in their own right, must be considered. For example, the Grade 4“Circuit A” saw students sorting beans, hitting strings on a sound board and listening to thesounds, testing simple materials for electrical conduction, moving a “puzzle box” to makeinferences about the contents, listening to a personal cassette player to receive instructionsfor an experiment, and comparing how different pieces of wood float in water and saltwater. These multi-sensory and tactile experiences were intended to convey the breadth of“hands-on” science to students and their teachers. Feedback from students and teachersindicate that this goal was achieved.- 105 -Range ofappropriatenessThe intent of this consideration was to demonstrate to teachers that growth instudents’ science abilities may be observed using a set of common tasks. Three stations(Sound Board, Rolling Bottles and Magnet Power), and two investigations, (Paper Towelsand Magnets), were used at all three grade-levels. The stations were pilot-tested withstudents in Grades 4, 7 and 10. It was found that the Grade 4 students had greaterproblems in developing experiments to identify the stronger of the two magnets.Conversely many of the Grade 10 students completed the experiment quickly andaccurately. In order to provide a differentiated degree of difficulty, the Contract Teamdecided to use pairs of magnets that were very different in strength for Grade 4 students,closer in strength for Grade 7 students, and very close for Grade 10 students; each magnetwas within 5% of the mass of the other magnet and of the same dimensions.Promotion ofModels of CommunicationThe abilities to read and write are highly valued in our society and the written wordremains the favoured medium for most formal education. As students progress throughschool there is the quite reasonable expectation that their abilities to communicate willimprove, based upon developments in writing skills and an increase in vocabulary.The data collection techniques for our stations depend completely upon the studentsbeing able to construct written responses to the questions. Coding and scoring of thestations was based solely upon the students’ written responses. One station at Grade 4(Blocks and Boxes), and a station used at both Grades 7 and 10 (Cool It), were devised touse a personal cassette player to present the instructions. A station common to all threegrades (Magnet Power) requires students to draw a picture or diagram to show what theydid in an experiment.- 106 -In the investigations there are three types of data: observation schedules completedby teachers, student descriptions, and teachers’ notes of interviews with students at theconclusion of their investigation. Reliance upon student written responses is significantlyreduced, as the principal source of descriptive data comes from the teachers through theobservation schedules, and to a lesser extent through the interviews. This dependence onthe teachers’ descriptions of the students’ experimental procedures is justified by commentsfrom teachers at a debriefing session after the data collection:After the investigations and you’re asking them questions, I always cameback to the teacher in me, rather than the assessor and asking, “Have youever done anything like this, this year?” And in every case the answer was“No”. And then I asked, “If you had to write this up as a formal write-up,could you do that?” and the answer in every case was “No.” This wasGrade 7. So, they hadn’t done anything like what we were doing with themagnets or the paper towels, which they really liked. (male teacher)The teachers of Grade 10 students also reported that their students were unfamiliar withsuch open-ended experiments, but these students appeared better able to describe theirprocedures. The Grade 4 teachers chose to analyze the interview data as an oral report ofthe students’ experimental procedures, an approach believed to be fairly consistent withclassroom practice.A further consideration of communication comes in the examination of the criteriaused to describe each level of performance. These criteria were constructed for appropriategrade-levels by the teachers as they were coding the student responses. A consistent themethat emerged across all three grades was that extended responses were more highly rated.In the discussion of the results of the Grade 4 task “Sorting Beans” the Contract Teamwrote:- 107 -The teachers’ judgments as to how well the students observed differencesamong the beans is puzzling. Sixty-one percent of students were judged tohave performed at a satisfactory or better level, while only 1% were judgedto have performed “extremely well.” Teachers judged student performanceby the number and complexity of the criteria they used. Category 3 required1 criterion, category 4 required 2 criteria while Category 5 required 3 ormore criteria. We think that “more is better” is a teacher judgement whichdisadvantages students. Some students may well value one salient criterionas an appropriate way to observe and designate similarities. It is in thisadult perspective of appropriateness that the student performance may beundervalued. (Erickson et al., 1992, p. 19)The Contract Team made similar observations at Grade 10 for the station “Sound Board”:The Contract Team thinks that the criteria developed for this latter judgementhave been too heavily weighted toward producing extensive explanations ofthe various relationships involved in this station which tends to undervaluethose students who opted for more parsimonious explanations such assimply referring to differences in the vibrations of the strings. (Erickson etal., 1992, p. 143)The message that students should engage in extended communication in responding tosome of these tasks was clearly unanticipated by the Contract Team and perhaps reflectssome differences in values. While such requirements may represent the practice in manyclassrooms, the Contract Team felt it appropriate to comment upon the apparent unfairnessof some of the sets of criteria.Content AnalysisLinn, Baker, and Dunbar (1991) consider that there are two components to beemphasized in considering the content of an assessment: first the coverage of the domainof interest by the content in the assessment, and second the quality of that coverage. Theseconcerns are addressed by the question:- 108 -Are the assessment tasks appropriate for the students and within thecurriculum?British Columbia has a provincial curriculum for science. However, many educationalinitiatives, in particular those leading to the proposals outlined in Year 2000: A Frameworkfor Learning (Ministry of Education, 1989) have led to three disparate curriculumdocuments: The Primary Program (Ministry of Education, 1990), Elementary ScienceCurriculum Guide: Grades 1-7 (Ministry of Education, 1981) and Junior SecondaryScience Curriculum Guide (Ministry of Education, 1985). The elementary science programin British Columbia therefore has two overlapping curriculum guides. The Elementaryguide focuses upon curriculum materials that were already 15—20 years old in 1991, andare available in varying amounts in schools in British Columbia. Conversely,implementation of the relatively new Primary Program document suffered from a paucity ofresources in 1991. Consequently, teachers of students up to Grade 4 were faced with theproblem of presenting either the “old” or “new” curriculum with limited resources. Similarproblems with the precision of the curriculum documents faced the teachers of Grade 10students. The Contract Team was aware of these problems when choosing station tasks forthe assessment. There was extensive cross-referencing to the curriculum documents toensure that stations generally fitted with at least some provincial curriculum. Many morestation tasks were presented to students for pilot-testing than were actually used in theassessment. Discussions with students, their teachers and teacher-members of anassessment review group led to the choice of stations that were used in the assessment.The traditional approach towards content validation was extended to include the voices andperspectives of students.One of the tasks, “Rolling Bottles”, produced very mixed reactions from teachers.Many who saw it were intrigued by the phenomenon of the bottle one-third full of sandrolling so slowly down the ramp. (When the bottles are rolled down the ramp, the bottle- 109 -that contain most sand rolls faster than less-full bottles; the one-third full bottle rollsslowest. When the empty bottle is introduced and rolls faster than partially filled bottles,the explanations become interesting.) The Contract Team presented this task to students inGrades 4, 7 and 10 with the intent of probing what students consider important in thissituation. The Contract Team identifies the following ‘objectives’ for this station:1. Students observe three bottles filled with different amounts of sand rolldown the ramp, and identify the fastest.2. Students explain why one bottle went fastest.3. Students observe an empty bottle roll down the ramp and compare itsspeed with one bottle previously rolled.4. Students explain why one of the two bottles went faster.(Erickson et al., 1992, p. 61)This is a visually stimulating task and has been demonstrated by members of the ContractTeam in seminars and at conference presentations. On two separate occasions, members ofthe audience, both with an interest in physics, have informed the presenter that this task istotally inappropriate for school students, perhaps even for university under-graduates!Both forcefully made the point that the physics of the task is so complex that no studentwould be capable of understanding the dynamics of the bottles. The presenters respondedby pointing out that the intent of the task is to examine the sense that students make fromtheir observations in this context. This type of debate about the quality of tasksdemonstrates the significance of different values in evaluating both task quality andcurriculum appropriateness.Content quality and cognitive complexity must also be evaluated for theinvestigation tasks. The advantage of open-ended questions was discussed earlier in thischapter in terms of how the students were able to interpret questions at a range of levels.Analysis of the aggregated data of the students’ work shows another effect:In both investigation tasks one is immediately struck with the apparent lackof differences between the students at the three age levels especially as itpertains to the planning and performance aspects of the tasks.The Contract Team concludes that:-110-• . .these data demonstrate the capabilities of Grade 4 students to plan, toperform, and to interpret the results of open-ended investigations of thisnature. Furthermore, it would seem that their experimental strategies arevery similar to those adopted by the older pupils. Our findings indicate thatthe most common experimental approaches, at least for magnets and theabsorption qualities of paper, are constructed by pupils at a much earlier agethan many theorists or curriculum developers have predicted. Thus onecurricular approach, which is firmly rooted in many science programs, isthat younger pupils (say up to the age of 11 or 12) should be engaged inless complex activities than a complete investigation. The implicit, andsome explicit, message is that younger students need to develop first themore basic “process skills” of science such as observing, classifying,measuring, and inferring before they can proceed to the more complex andsophisticated reasoning characteristic of conducting complete and validinvestigations. Our data contradict this position. (Erickson et al., 1992,p. 243)This interpretation of the data has important implications for curriculum planning,particularly in elementary science. It is consistent with perspectives articulated by Millarabout general cognitive processes (1991), and Woolnough about the value of introducingelementary students to investigations (1989). Differences between younger and olderstudents reported by the Contract Team are:the construction of more elaborate and powerful explanatory models thatare used to frame experiments of the type that the teachers observed in thisproject...an increase in the abilities of students to perform adequately someexperimental abilities such as conducting appropriate measurements withcare and precision, identifying and controlling variables thought to beimportant to the outcome of the experiment, and finally providing aninterpretation of the data and communicating that interpretation to others.(Erickson et al., 1992, p. 243)Alternative interpretations of certain aspects of these data might lead to thefollowing questions:1. Did the investigations require sufficient curriculum content knowledge to enable theolder students to show the depth of their knowledge?2. Was the language of the questions appropriate to elicit cognitively complexresponses from students capable of such responses?The two investigation tasks were very similar in structure and provided convergentevidence which led to the inferences made by the Contract Team. However, it may be- 111 -argued that more complex tasks would have provided data that could more clearly identifythe limits of performance of Grade 4 students and pose more of a challenge to the olderstudents.Instrumental StabilityWhen equipment is used in assessments it is necessary to ensure that the equipmentused by one set of students will behave the same way over the time period of theassessment, and will also behave in the same way as that used by other sets of students.This is a vital consideration in the discussion of the reliability of the data collectionprocedures. The focus question for this part is:Does the equipment behave consistently over time?In preparing this assessment the Contract Team constructed 10 kits of equipment for eachgrade, including over 20,000 individual pieces of equipment. The Contract Team preparedthe 1991 British Columbia Science Assessment Peiformance Tasks Administration Manual(Bartley, 1991) which presents specific details of the equipment and procedures for set-upand administration. A small number of measurements of temperature (room and coldwater) were recorded on “Administrator’s Record Sheets” that were completed each timethe station tasks were used. These sheets included the instruction “If anything happens,such as equipment malfunction or breakage, that you feel may affect the students’performance in this part of the assessment, please record on this sheet”. Although over 50rotations across the three grades were recorded on these sheets there were no reported casesof significant equipment malfunction. Typical problems included discharged batteries forthe personal cassette players, but as spares were supplied this particular problem wasminimized.- 112 -More recently the author has been involved in presenting the assessment tasks toteachers in over a dozen school districts around B.C. and at the 1993 National ScienceTeachers National Convention in Kansas City, Missouri. All of the equipment performedconsistently with appropriate maintenance and replacement, e.g., batteries, calcium metal,pH paper, sandstone, vinegar, etc.Administration StabilityThere are two aspects of this perspective towards data collection. The first isrelated to the work of the Contract Team in providing clear directions for procedures tocollect the data. The second is linked to the ability of the teachers, and the schools wherethe data were collected, to create similar environments in which students could workthrough the tasks. The focus question for this part of the inquiry is:Are the administration procedures clearly developed and appliedconsistently?The 1991 British Columbia Science Assessment Performance Tasks Administration Manual(B artley, 1991) was developed during the pilot-testing. In addition to the details ofequipment discussed earlier in this paper, the Manual presents a set of procedures foradministration of the assessment. These procedures include scripted orientation guides forteachers to read to the students while the assessment is under way. These scripts wereused by the Contract Team and teachers during the preparation workshop. At the postassessment debriefing, teachers declared that there were no difficulties in the application ofthese procedures during the assessment.The data were collected at a time when the union that represented teachers in manyschool districts was in dispute with the Ministry of Education. Several of the teachers whoadministered the assessment reported problems in collecting data, and delayed going to- 113 -schools until the climate was more harmonious. A different problem arose where theschool administrators were not able to find sufficient space. One teacher recounted hisexperience:we asked for two quiet places to work. The other teacher was OK. He gota whole room, big tables and everything. I got a little room, they call it thesmoker’s pit and I got two little tables about this wide to put my stuff on, acouple of little chairs.But this response was countered by another who spoke of the assessment being:generally well-received. One school, they moved the kids that weren’twriting our test to the library, so they opened up the classroom as well. Weused it. They were very accommodating.Discussions with the teachers suggest that few students were prevented from performing ina good working environment. Administrative stability appears to pose no threat to thereliability of the data collection.Internal Consistency and GeneralizabilityThis consideration has been a fundamental criterion in the evaluation of thereliability and validity of traditional forms of testing. Linn, Baker, and Dunbar (1991)consider that this continues to be important issue in the validation of alternativeassessments. As the B.C. assessment was in the curriculum area of science, the factorsthat must enter the discussion originate in both the measurement and the science educationcommunities. The focus question here is deliberately general, and interpretations will bearticulated in the discussion:What factors affect the generalizability of student performance across tasksin a science assessment?There appear to be three distinct levels for this discussion. The most straightforward ofthese concerns the stability of the data collection procedures and has been discussed earlier-114-in this chapter. Next there is the issue of consistency in the coding and scoring procedures.Finally there is scope for conceptualization in how to make sense of the data.Consistency in Coding and Scoring ProceduresCoding consistency was given high priority by the Contract Team. Therequirement for consistency was emphasized to the teachers during the coding workshop,and procedures were instituted to examine the consistency of teachers’ coding on allstations. Details of the procedures are presented the Technical Report (Erickson et al.,1992, pp.24-6-248), but a brief summary is presented here. Essentially, each team of fourteachers coded the same five student response sheets; these sheets were then examined formatching pairs. Results are reported in terms of percentage of matching pairs, i.e., degreeof consistency. A coefficient of 1 indicates 100% agreement of the teachers1. Thecoefficients for percentages of teachers in agreement with each other is reported for allcodings on each station, as are results on codings for the judgements alone. Furtheranalysis of the judgement data to identify scores that were only one point from agreementshows a generally high degree of consistency. Table 2 illustrates the inter-coderconsistency for the six stations2in Circuit A from Grade 7.1 If 3 out of the 4 teachers score the student at the same level, then 3 out of a possible 6 pairs are inagreement; the coefficient for inter-coder consistency is reported as 0.5.2 These stations are chosen as a representative sample of the inter-coder consistency data generated bythis process. The Grade 7 judgements are presented here as they were subjected to extensiveanalysis in the Technical Report (Erickson et al., 1992).- 115 -Table 2Inter-coder Consistency Coefficients for Grade 7 Circuit AStation All codings for Judgements Judgements ± 1station exact level7.1 Guess It 0.97 0.88 0.127.2 Sound Board 0.73 0.48 0.327.3 Electrical Testing 0.89 0.65 0.237.4 Environmental Testing 0.88 0.80 0.207.5 Cool It 0.92 0.93 0.077.6 Floating Wood 0.72 0.62 0.30These levels of inter-rater agreement are consistent with published acceptable levels forother performance assessments in science (Ruiz-Primo, Baxter, & Shavelson, 1993) andwriting (Dabney, 1993). More recently Shavelson, Gao, and Baxter report that if raters aregiven appropriate training:The findings are consistent. Interrater reliability is not a problem. Raterscan be trained to score performance reliably in real time or from surrogatessuch as notebooks [e.g., Baxter et al., 1992; Shavelson, et al., 1993].(Shavelson et al., 1994, p. 2)Consistency Across TasksThis is perhaps the most contentious issue in analysis of performance assessmentsin science. In most applications to this point, the number of tasks is usually too small forconventional reliability studies. Generalizabiity theory (Cronbach et al., 1972; Brennan,1983) enables the user to “parcel out” variance to controllable facets which may contributeto the error (e.g., occasion, scorer, task). Variance arising from tasks is the area of interestin the examination of consistency of performance. Linn, Baker, and Dunbar identifygeneralizability theory as providing “a natural framework for investigating the degree towhich performance assessment results can be generalized” (1991, pp. 18-19).- 116 -Extensive use of G theory has been made by the APU in England (Johnson, 1989), and inthe U.S.A. by Shavelson and his associates (Baxter et al., 1992; Shavelson et al., 1992;Shavelson and Baxter, 1992; Shavelson et al., 1993; Shavelson et al., 1994). G theoryhas been characterized as “a statistical theory about the dependability of behavioralmeasurements”. Within this definition, dependability is seen asthe accuracy of generalizing from a person’s observed test score on a test orother measure (e.g., behavior observation, opinion survey) to the averagescore that the person would have achieved under all possibleconditions that the test user would be equally willing to accept.(Bold added; Shavelson and Webb, 1991, p. 1)The application of generalizability theory requires that the test user accept the extension ofconditions from the domain of the test to whatever domain or universe s/he wishes togeneralize. Generalizations, particularly in science assessments, depend heavily uponconstruct labels and the definition of the domain of interest.Rather than pursue a G study with the data, the Contract Team for the StudentPerformance Component of the 1991 B.C. Science Assessment chose to examinecorrelation matrices to see if students performed at consistent levels over different stations.The judgement questions for these stations are shown in Figure 17.Station 7.1 Guess ItA. How well did the student estimate properties of length, mass, volume, area and time?Station 7.2 Sound BoardA. How well did the student observe differences between the strings (visual and aural)?B. How well did the student explain why the strings sound differently?Station 7.3 Electrical TestingA. How well did the student make inferences about the characteristics of conductors and non-conductors?B. How well did the student predict/justify whether an object will be a conductor or a non-conductor?Station 7.4 Environmental TestingA. How well did the student draw inferences from collected data?Station 7.5 Cool ItA. How well did the student listen and follow instructions orally?B. How well did the student interpolate and extrapolate from data?Station 7.6 Floating WoodA. How well did the student observe how three wooden blocks float in water?B. How well did the student predict what would happen to the third block in salt water (based onprediction and explanation)?Figure 17. Judgement Questions for Grade 7, Circuit A(Erickson et al., 1992, p. 249)- 117 -The correlation matrix for Grade 7, Circuit A is shown as Table 3. The matrix isdiscussed in the Technical Report:In circuit A only 6 out of the possible 45 correlations are statisticallysignificant (1-tailed test of significance, p <.01). Of these 6 correlationsthere are 4 that represent correlations between judgements within stations,i.e. from stations where there were two teacher judgements about differentaspects of a student’s performance on the same station. The strongcorrelations (r> .5 with p <.001) for this circuit are both within-stationcorrelations, on Sound Board and Electrical Testing. (Erickson et al.,1992, p. 251)Table 3Pearson Correlation Matrix for Grade 7, Circuit A Data,N=115 students3 (Bold added; Erickson et al., 1992, p.249)7.1 A 7.2 A 7.2 B 7.3A 7.3 B 7.4 A 7.5 A 7.5 B 7.6 A 7.6 B7.1 A 1.0 .004 -.019 .176 .048 .108 .151 .158 .007 .096p=.481 p=.421 p=.030 p=.3O7 p=.l25 p=.053 p=.046 p=.469 o=.1567.2 A 1.0 .648 .216 .122 .027 .190 -.117 -.070 .140p=.000 p=.0l1 p=.099 p=.389 p=.021 p.l08 p=.229 p=.0687.2 B 1.0 .233 .169 -.053 .141 .012 -.092 .129p.OO6 p=.036 v=.289 p=.067 p=.447 p=.l63 o=.0867.3 A 1.0 .544 .216 .259 .146 .073 .021p=.000 p=.Oll p=.O03 p=.O60 p=.22O p=.4l4-7.3 B 1.0 -.047 .102 .106 .027 .039p.33l p=.l40 p=.13l p=.389 p=.3407.4 A 1.0 .210 .169 .038 .150p=.O12 p=.036 p=.343 p=.0557.5 A 1.0 .226 .098 .096p=.008 p=.l49 p=.l557.5 B 1.0 .101 .018p=.l4l p=.4257.6 A 1.0 .234p=.OO67.6 B 1.0Uniformly, the within-station correlations are statistically significant; student performancewas consistent within any given station. The Contract Team suggests that studentperformance in these stations is probably most influenced by “the task context (i.e., thetype of content knowledge and experience embedded within the task)” (Erickson et al.,Within-station correlations are shown within the confines of the “bold box”.- 118 -1992, p. 251). Interestingly, correlations between student scores for judgements thatpurport to measure some aspect of a generalizable skill are very low:The most frequently discussed “skill” of science is “observation”. In SoundBoard students observe differences between strings, while in FloatingWood they observe how the wooden blocks float; the correlation is— .070.(Erickson et al., 1992, p. 251)These results are consistent with the constructivist perspective that the Contract Team usedin preparing the assessment framework. Millar and Driver present an eloquent argumentagainst simply “process-based science” in their paper Beyond Processes (1987). TheContract Team supports this perspective in the Technical Report:We, at all costs, wanted to avoid the return in science teaching to the days ofthe Science as Process Approach where individual so-called “process skills”were taught separately. We believe that an emphasis on separate abilities orskills would encourage such an orientation. (Erickson et al., 1992, p. 253)The evidence from the Student Performance Component of the 1991 B.C. ScienceAssessment and the work of Shavelson, Gao, and Baxter (1994) leads me to believe thatthe question of consistency across tasks depends upon the values (in the sense of aperspective about science) that drive the development of assessment tasks. if tasks aregenerally homogeneous and drawn from the same content domain, there is a stronglikelthood of consistent student performance from one task to the next, i.e., the task scoreswill show a high degree of correlation and performance can be generalized, but only tofurther similar tasks (as for Shavelson et al., 1994). If the tasks are designed to beheterogeneous, covering a range of dimensions or domains (as in B.C.), then consistencyof student performance should not necessarily be expected. Sets of heterogeneous tasksare more likely to be valid in representing a complex universe, say a science curriculum,than are sets of limited but reliable homogeneous tasks.- 119 -FairnessThe issue of fairness should be considered in any assessment, but is mandatory inan assessment that is intended to be sensitive to gender issues. The question focusing onfairness is:Do the assessment data indicate any bias towards or against any specific,identified group?The Contract Team attempted to be particularly sensitive to gender-related differences instudent performance. There is consistent evidence to show that in the majority of scienceclasses, fewer girls than boys handle science equipment (Kahie, 1988; Wienekamp,Jansen, Fickenferichs, & Peper, 1987). It was a concern of the Contract Team membersthat we present females and males with equal opportunities to demonstrate what they “CANDO” by working through hands-on performance tasks. The pilot-testing indicated that thegender-related differences were subtle, and that only when complete data were collectedand aggregated would differences, if any, appear. The stations were coded and scoredusing number identities; therefore the coders were not aware of the school location or thegender of the students4. In the assessment report, aggregated data are presented in tableslike the one shown here as Table 4. The Contract Team chose to focus upon tasks forwhich differences in performance between males and females were greater than 15%. Thereason for this choice is “pedagogical significance” (Erickson et al., 1992, p. 227) in that15% represents a difference of three to five students in a class of 20 to 30 students. Withthis condition it is reported that “similarities in performance between males and femaleswere more evident than the differences” (Erickson et al., 1992, p. 223).Investigations were scored by analysis of the observation data, so the teacher/administrators wereaware of the gender and identity of the students.-120-Table 4Grade 10 Student Performance— Satisfactory or BetterRating by Gender, Stations 1 to 6 (Erickson et al., 1992, p.226)Satisfactory or Better RatingFemale %N=55Male%N = 60Total%N= 115Some differences are reported and discussed in the Technical Report, for example:Station 4, “Environmental Testing,” produced a small gender difference atGrade 7, but at Grade 10 more females (75%) than males (58%) performedat a “satisfactory or better” level. The Classical Component of the BritishColumbia Assessment of Science Provincial Report (Bateson et al., 1992)has demonstrated that females show more interest and concern for theenvironment. We believe that this finding may point to a possibleexplanation for the observed differences in performance in this station. Theabilities of males and females to measure and interpret pH values weresimilar. The quality of the females’ explanations for the change in acidity ofa lake was rated higher by the teachers. (Erickson et al., 1992, p. 227)All three of the stations used across the three grades show interesting gender-relateddifferences in performance. Performance on all of the stations shown in Table 5 favouredfemales at Grade 4. However, by Grade 10 both “Rolling Bottles” and “Magnet Power”10.1 Guess It: How well did the student estimate 30 23 35properties_of length,_mass,_volume,_area and time?10.2 Sound Board: A. How well did the student observe 77 77 77differences between the strings_(visual and aural)?B. How well did the student explain why the strings 41 36 44sound differently?10.3 Circuit Building: A. How well did the student build 94 93 95series and parallel circuits?B. How well did the student draw circuit diagrams? 94 95 9410.4 Environmental Testing: How well did the student 65 75 58draw inferences from collected data?10.5 Cool It: A. How well did the student follow 98 100 96instructions_given_orally?B. How well did the student draw a graph from data 62 68 56obtained from a table?10.6 Microscope: A. How well did the student 34 29 38manipulate the microscope to obtain images at thestated magnifications?B. How well did the student calculate the increase in 10 10 10magnification?- 121 -were seen to favour males. The Contract Team postulates that the differences for “MagnetPower” arose because of males’ “prior interests and experiences in physics”. However, ifthis is the case, why does this not manifest at an earlier grade? A focus upon the males’current interests and experiences in Grade 10, might be more convincing.Table 5Student Performance on Common Tasks by Gender(Derived from Erickson et al., 1992)% Satisfactory or Better RatingGiade Station Grade 4 Grade 7 Grade 10Fern. Male Fern. Male Fern. MaleN=58 N=50 N=59 N=52 N=57 N=504 Making Measurements 99% 82%7 & 10 Instruments 34% 58% 55% 58%4, 7 & 10 Rolling Bottles 71% 54% 28% 38% 44% 55%4, 7 & 10 Magnet Power — Judgement A 67% 53% 76% 56% 64% 73%4, 7 & 10 MagnetPower—JudgementB 52% 43% 76% 64% 76% 89%As indicated earlier, the Contract Team chose to report only differences of 15% or greater.The number of judgements where the differences between females and males withsatisfactory or better performance were greater than 15 % is shown in Table 6.Table 6Stations Where the Percentage of Females and Males Judgedto have Satisfactory or Better Levels of Performance differsby more than 15% (Derived from Erickson et aT., 1992, pp.224-226)Grade4 Grade7 GradelO2 judgements favouring females 1 judgement favouring females 2 judgements favouring females0 judgements favouring males 1 judgement favouring males 1 judgement favouring malesTotal judgements = 19 Total judgements = 19 Total judgements = 19These data show that the number of judgements for which there is a large gender-relateddifference in performance is only one or two per grade out of a total of 19 judgements pergrade. However, if differences of 10% are considered pedagogically significant the overallpicture changes. These data are shown in Table 7.- 122 -Table 7Stations Where the Percentage of Females and Males Judgedto have Satisfactory or Better Levels of Performance differsby more than 10% (Derived from Erickson et al., 1992, pp.224-226)Gmde4 Grade7 GradelO5 judgements favouring females 5 judgements favouring females 3 judgements favouring females0 judgements favouring males 4 judgements favouring males 5 judgements favouring malesTotal judgements = 19 Total judgements = 19 Total judgements = 19At Grade 4 females appear to perform better than males; in Grades 7 and 10, both sexesappear to perform at similar levels. It is debatable as to whether the level of reportingshould have been 10% rather than 15%. Perhaps what should be considered is thepotential significance of the trend identified in Table 5. When relatively few studentsperform at satisfactory or better levels of performance then differences between females andmales may be more important which would require consideration of smaller differences.Conversely, when the majority of students, female and male, perform a task at asatisfactory or better level of performance, it may be more reasonable to consider onlylarger differences between females and males.ConsequencesThe most important component of this validation inquiry is addressed in drawingtogether earlier sections of this chapter to identify and examine the consequences of theassessment — intended or not. There are three concluding questions:Were the intended consequences of the assessment achieved?What were the jjintended consequences?What actions have been taken to support the “good” unintendedconsequences and abate the “bad” unintended consequences?- 123 -Intended Consequences AchievedExplicit, intended consequences of the assessment were set out by the Ministry ofEducation and operationalized by the Contract Team. Discussion of the intendedconsequences of the assessment is presented here first in terms of the Ministry of Educationand then from the standpoint of the Contract Team.In 1992, the Ministry of Education published a document entitled MinistryResponse to the 1991 British Columbia Assessment of Science (1992a) in which Ministrystaff examine the findings and implications of the recommendations presented in theTechnical Report (Erickson et al., 1992). The comments in this response document aredirected towards the Contract Team’s discussion of student performance in theinvestigation tasks, particularly the observation that students had experienced fewopportunities in their classrooms to investigate in open-ended or self-directed ways. TheMinistry responds by stating that it will encourage:.science teaching that emphasizes greater student involvement withscientific inquiry and open ended scientific and mathematical problemsolving. (Ministry of Education, 1992a, p. 8)A subsequent draft document Curriculum and Assessment Framework: Science (Ministryof Education, 1992b) includes such a focus.A further consequence of the assessment for the Ministry of Education has been thedevelopment of a school district-level package Science Program Assessment throughPerformance Assessment Tasks: A Guide for School Districts (Bartley et al., 1993). Thisself-contained binder contains details of all tasks, administration, data analysis andinterpretation procedures enabling school districts to use the materials independently tocollect their own district-level data.The original, explicit agenda of the Contract Team is outlined in The Assessment ofStudents’ Practical Work in Science (Erickson, 1990). This document indicates some of-124 -the intended consequences that were goals for the Contract Team. The most significant ofthese is the viewing of student performance in science from a constructivist perspective onlearning. The data provide empirical support for the Contract Team’s position on twosignificant issues in science education: advocating an approach that might be described as“Investigations for All”, and evidence that “processes” are context-dependent. Oneintended outcome was that students and teachers would support this mode of assessment;the actual degree of student and teacher enthusiasm surpassed our hopes.Unintended ConsequencesAt this stage of the validation inquiry, several unintended consequences of theStudent Performance Component of the 1991 B .C. Science Assessment have beenperceived.For the Ministry of Education one unintended consequence has been unwelcomeadministrative expenses. As the work of the assessment progressed, the labour-intensiveness of the whole enterprise became apparent. Data collection required trainedadministrators, as did the coding and evaluation of the data. The 36 teachers who wereinvolved in the assessment each required 10 or 11 days of paid release time, and two visitsto Vancouver for workshops. The cost of this teacher time and transportation can beconsidered as an undesirable consequence if one sets a low priority on teacher in-serviceeducation. However, many of the teachers reported that their work on this project was themost fulfilling in-service work of their careers — clearly a positive consequence!A desirable although unanticipated consequence is the positive internationalrecognition that this component of the Science Assessment has received (as part of thegreater set of B.C. Science Assessment reports) including the honour of the AERADivision H Award for the Best Program Evaluation published in 1992. This award- 125 -recognized and rewarded the decision of the Ministry of Education to examine a broadrange of variables in the 1991 B.C. Science Assessment.From the perspective of the Contract Team, the unintended consequences identifiedto date have generally been seen as problems to solve rather than as unpleasant surprises.Details are presented in chronological order.The first problem was related to task development and the procurement ofresources. Initially it was expected that three investigations, one with a biological theme,would be used in the assessment. Investigations that have been used by the Assessment ofPerformance Unit (APU) in England (Gott and Murphy, 1987) entail the use of Africanland snails, and mealworms or woodlice (sowbugs). Inquiries about sources of Africanland snails soon led to the discovery that this animal is banned from Canada because of aperceived agricultural risk. As this part of the pilot work took place in the late fall of 1990and early spring of 1991, local garden snails were not available because of winterhibernation. Instead, snails were purchased from a grower of edible snails, and plans weremade for pilot testing. Lengthy observations of these snails, a tropical variety, indicatedthat their movement was significantly slower than the local garden snails, but increased asthe ambient temperature was raised. Only when their environment was above 25°C werethese snails active enough to be useful in an assessment situation. As this presented anunworkable constraint, this investigation was not used. The Contract Team found similarproblems with the activity of mealworms. Consequently, the decision was made to useonly the two investigations “Magnets” and “Paper Towels” and forgo the biological theme.The next problem came about when the Contract Team attempted to find a scienceequipment supplier willing to contract for the construction of the assessment kits. Whilethe accessible suppliers were confident of their ability to provide the range of materialsrequired, they were not willing or staffed to construct the kits. This work fell to theContract Team and led to many extended workdays. It was, however, completed on time.- 126 -The teachers who administered the assessment appreciated the work and made commentssuch as:.my feeling was, and I think the other teacher from my district agrees, isthat the job you did setting it up for us was tremendous. Ninety-ninepercent of the problems were solved before they even got underway. (maleteacher, July 1991)A third problem arose with some of the teacher-developed criteria for the evaluationof student performance. A consistent theme running through these criteria for a noticeablenumber of stations was that teachers valued extended or multiple responses from thestudents. This presented a problem for the Contract Team as the criteria had beendeveloped by a set of teachers with extensive experience not only of teaching appropriategrade-levels in B.C., but also in this mode of assessment. In its discussion of specificstations in the Technical Report, the Contract Team comments upon its perception of theapparent unfairness of such criteria, and the consequences in the interpretation of the data.The discussion of gender-related differences in the Technical Report concludes thatwhile there may be some task-dependent differences in the performance of females andmales, overall results show that “gender-related differences are noteworthy by theirabsence” (Erickson et al., 1992, p. 228). The choice of 15% as the critical difference toguide the analysis, rather than a careful examination of the context for each of thedifferences, influenced the interpretation of the data. The potential for revisions to thisanalysis of gender-related differences remains.The benefits of extensive involvement of teachers from around the province wereseen as an unintended outcome of the assessment program by the Contract Team. Teacherinvolvement had been planned as part of the data collection procedures originally proposedto the Ministry of Education, but such a positive and extended interaction was notanticipated.- 127 -Because the Ministry of Education concluded that the assessment tasks werevaluable exemplars, the Contract Team received a further contract from the Ministry ofEducation. This involved members of the Contract Team in field testing and producingScience Program Assessment through Performance Assessment Tasks: A Guide forSchool Districts (Bartley et al., 1993). This opportunity for students and teachers acrossB.C. to work with the station tasks and gain further experience with assessment proceduresmust be considered another positive consequence. It does however, mean that this set oftasks cannot be used in the future to measure change in student performance!Refinements in Assessment ProceduresWork on field testing Science Program Assessment through PerformanceAssessment Tasks: A Guide for School Districts (Bartley et al., 1993) enabled the ContractTeam to reflect upon its experiences, and to address concerns that had arisen in the StudentPerformance Component of the 1991 B.C. Science Assessment. Most significant of thesewas the decision to review, and encourage rewriting of, some teacher-derived criteria forstudent performance. These revisions were done by practicing teachers, with moreguidance from members of the Contract Team than had been given to the teachers whoproduced the original criteria.The orientation workshop for using Science Prograin Assessment throughPerformance Assessment Tasks: A Guide for School Districts (Bartley et aT., 1993)includes discussion of the explicit and implicit messages about science that teachers and/orstudents perceive while performing the tasks, and also encourages debate about theimplications for student learning of using these tasks.- 128 -REFLECTIONS UPON VALIDATION QUESTIONSThe approaches to validation inquiry set out in this chapter represent a moveforward in the identification and resolution of some of the issues in validation ofperformance assessments in science. The questions asked in this study are derived fromShepard’ s fundamental question “What does the testing practice claim to do?” (1993,p. 429), and were influenced by the perspectives presented by Linn, Baker, and Dunbar(1991). However, this study has focused not only upon the claims and recommendationsmade from analyses of test responses, but has included an examination of the theoreticalperspectives used to develop and interpret the assessment. Examination of the theoriesunderlying test interpretation is a critical issue for validation (Messick, 1989b, 1994).Inclusion of the structural aspects of test development represents an important broadeningof the validation process.The theoretical perspective towards learning for the Student PerformanceComponent of the 1991 B.C. Science Assessment is identified as constructivist in TheAssessment of Students’ Practical Work in Science (Erickson, 1990). This perspectivepermeates the approach taken by the Contract Team throughout the assessment from theearly stages of task development and pilot studies through to data analysis andrecommendations to the Ministry of Education as the project was nearing completion.Inconsistencies arose when different perspectives were brought to bear upon theinterpretation of the data. For example, as noted previously, certain teacher-developedcriteria in the station tasks appear to value extended but simple responses over shorteranswers that indicate depth of understanding. As a result of this, the Contract Teamensured that more consistent perspectives were applied to criterion development for use inthe Science Program Assessment through Peiformance Assessment Tasks: A Guide forSchool Districts (Bartley et aL, 1993).-129 -“Validation is essentially a matter of making the most reasonable case to guide bothcurrent use of the test and current research to advance understanding of what the test scoresmean” (Messick, 1989b, p. 13). Many of the factors which tend to reduce or obscure thevalidity of inferences have been addressed by the Contract Team. At least two more issuesremain: 1) gender-related differences in student performance require further scrutiny, and2) the design of cognitively complex tasks, stations and investigations, for all grade levels,should be discussed. Validity is “a matter of degree, not all or none” (Messick, 1989b,p. 13). This inquiry has identified some specific difficulties with the inferences andtheoretical rationales made by the Contract Team. However, my conclusion from theevidence is that in general the assessment did achieve the intended goals.-130-CHAPTER 5 — DESCRIBING STUDENTPERFORMANCE IN SCIENCEThe issue of how to describe student performance in hands-on science tasks isdiscussed in this chapter which constitutes my response to the second research question,“What are the essential characteristics of descriptors of student performance onperformance tasks in science?” The characterization of a task, or set of tasks, is that vitalelement in the set of inferences that would typically be examined during validation.Cherryholmes (1989) summarizes the initial status of construct validity as “the identitybetween an attribute or quality being measured and a theoretical construct” (Emphasis inoriginal; p. 100). This attempt to provide a correct description of the world was soonfound to be impractical. Subsequently, construct validity was redefined as “the entire bodyof evidence, together with what is asserted about the test in the context of the evidence”(Cronbach and Meehi, 1955, p. 284; cited in Cherryholmes, 1989, p. 101). This shift indefinition moved construct validity from a defining identity to a defensible interpretation1.For assessment in school science the issue is multi-faceted at both the macro andmicro levels. At the macro level there is a need to examine the identity of the construct“school science” as it pertains to current curricula. At the micro level the issue becomesone of task description. There must be congruence between the micro and the macro; thedomain of items chosen for the assessment tasks must provide an acceptable or defensiblerepresentation of the universe of “school science”.The perspectives among science educators that have guided curriculum andassessment are reviewed. The chapter continues with an examination of the frameworks1 The vaiious sets of Standardsfor Educational and Psychological Testing describe what is acceptedas evidence in the argument to defend the interpretation.- 131 -that have been developed to provide descriptions of hands-on performance assessmentitems in school science. A model for the description of performance tasks is thenpresented.SCHOOL SCIENCEThe parallels between assessment design and curriculum design should not beunderstated as each of these is informed by a view of “What counts as science education?”(Roberts, 1988, p. 27). Roberts focuses upon the tensions between the shaping of scienceeducation by policy makers and that done by classroom teachers. He identifies threeinherent aspects of the question “What counts as science education?”:First, the answer to it requires that choices be made — choices amongscience topics and among curriculum emphases. Second, the answer is adefensible decision rather than a theoretically determined solution to aproblem theoretically posed. Third the answer is not arrived at by research(alone), nor with universal applicability; it is arrived at by the process ofdeliberation, and the answer is uniquely tailored to different situations.Hence the answers to the question will be different for every educationaljurisdiction, for every duly constituted deliberative group, and very likelyfor every science teacher. (Roberts, 1988, p. 30)The negotiations as to what should constitute an appropriate policy for science education inany particular jurisdiction are complex and require long periods of discussion andrefinement. For example, this has been ongoing for almost ten years in the EnglishNational Curriculum, and is approaching four years for the drafts of the NationalCurriculum Standards for Science in the United States. Roberts argues that individualteachers who receive the “authorized” view of science education in the form of a curriculumor syllabus will maintain their own opinions about what “really” counts as scienceeducation. Consequently, there will be significant variations in emphasis from classroomto classroom (Roberts, 1988).The view of science presented by formal curriculum documents has served to directthe development of student assessments. The developers of the British Columbia- 132 -Assessment of Science 199] Provincial Report (Bateson et al., 1992) demonstrate this byidentifying “authorized” emphases for science education2in the Province, where:An examination of recent curricula and general education documents inBritish Columbia reveals that the principal intention of science education isto foster “scientific literacy” through:• developing a positive attitude toward science;• expressing scientific attitudes;• developing science skills and processes that allow for exploration andinvestigation of the natural world;• understanding and communicating principles and concepts that provide ascientific perspective on the world;• appreciating cultural and historical contributions to science; and• understanding how science, technology, and society interact andinfluence one another, creating socioscientific issues.(Bateson et al., 1992, p. xvi)Multiple perspectives as to what should be emphasized in science education are representedin these curriculum documents. As these statements relate to science education for allstudents in the public school system in British Columbia, such a range in emphasis isexpected, and appropriate.In planning the 1991 British Columbia Assessment of Science, the Ministry ofEducation recognized that multiple-choice questions alone would not provide sufficientbreadth of information from which to make judgements about science education in theProvince. Therefore, as described in Chapter 3, separate contracts were developed for the2 This intention has also been stated by the Ministry of Education as the four goals of the scienceprogram (Bateson et al., 1992, p. xvi):Goal A: The Science Program should provide opportunities for students to develop positiveattitudes toward science.Goal B: The Science Program should provide opportunities for students to develop the skills andprocesses of science.Goal C: The Science Program should increase students’ scientific knowledge.Goal D: The Science Program should provide opportunities for students to develop creative,critical and formal (abstract) thinking abilities.- 133 -four components of the assessment3.The student performance component was intendedby the Ministry of Education to focus upon “science skills and processes that allow forexploration and investigation of the natural world” (Bateson et al., 1992, p. xvi). Thisaspect of the assessment appears to be consistent with the curriculum emphasis that Robertscategorizes as “scientific skill development” (Roberts, 1988, p. 45). Roberts surveyedscience education practice in elementary and secondary schools in North America andidentified seven emphases. These, together with the seven underlying views of science,are shown in Figure 18.Curriculum Emphasis View of ScienceEveryday Coping A view of meaning necessary for understandingand therefore controlling everyday objects andevents.Structure of Science A conceptual system for explaining naturallyoccurring objects and events, which is cumulativeand self-correcting.Science, technology, decisions An expression of the wish to control theenvironment and ourselves, intimately related totechnology and increasingly related to verysignificant societal issues.Scientific skill development Consists of the outcome of correct usage of certainphysical and conceptual processes.Correct explanations The best meaning system ever developed forgetting at the truth about natural objects and eventsSelf as explainer A conceptual system whose development isinfluenced by the ideas of the times, theconceptual principles used, and the personal intentto_explain.Solid foundation A vast and complex meaning system which takesmany years_to master.Figure 18. Seven Curriculum Emphases and AssociatedViews of Science (Roberts, 1988, p. 45)These are:Component 1: The Classical ComponentComponent 2: The Student Performance ComponentComponent 3: The Socioscientific Issues ComponentComponent 4: The Context for Science Component-134-The “skill development” perspective, with its dependency upon the processes of science,has guided the development of other performance assessment projects. Of these, theSecond International Science Study (SISS) has been most prominent, with a report entitledAssessing Science Laboratory Process Skills at the Elementary and Middle/Junior HighLevels (Kanis, Doran, & Jacobson, 1990).In stating as an underlying view of science the “usage of certain physical andconceptual processes” (1988, p. 45) Roberts does not pursue the description of the set ofprocesses which are included in such a perspective. Donnelly and Gott (1985) considerthat a coherent structure for science processes is an attractive approach for scienceeducators to use in assessment, as it appears to present a unified perspective of sciencethrough an assessment framework.ASSESSMENT FRAMEWORKSThe mechanics of using a curriculum as the starting point for science assessments inBritish Columbia is illustrated in the 1986 British Columbia Assessment of ScienceGeneral Report (Bateson et al., 1986). The diagram used to represent the chronology ofdevelopment is shown as Figure 19.The steps shown in Figure 19 identify procedures which attempted to “ensure that the itemsused in the assessment were of the highest quality possible and that the overall surveyaccurately reflected the appropriate curriculum” (Bateson et al., 1986, p. 8). The secondstep, development of Tables of Specifications for each grade, plays an important part intranslating the curriculum into a set of items used in the assessment by giving formalFigure 19. Development of the Achievement Instruments(Bateson et al., 1986, p. 8)- 135 -definition to the relationship between curriculum content and assessment items for eachgrade level. In addition, the Tables of Specifications act as the reference point for thecharacterization of each item. The technique of examining items for “goodness-of-fit”within a Table of Specification, by panels of experts, has long been a part of the formalvalidation process. Bateson and his collaborators in the Classical Component of the 1991British Columbia Assessment of Science (Bateson, Anderson, Brigden, Day, Deeter,Eberlé, Gurney, & McConnell, 1992) extended this methodology to include “think aloud”interviews with students about the individual assessment items.For “hands-on” performance assessment, the Table of Specifications is usuallyreplaced or augmented by an assessment framework. Assessment frameworks serve avariety of purposes — from describing the tasks or items at one extreme, to attempting toconfigure thinking processes at the other. The type of framework, its structure, and its usewill vary significantly depending upon the purposes and values of the assessmentdevelopers. The frameworks discussed here represent a wide variety of viewpointsranging from the pragmatic APU approach to the higher-order thinking skills framework ofNAEP (Blumberg et al., 1986). My examination of the International Association for theEvaluation of Educational Achievement (TEA) projects looks at the Second InternationalScience Study (SISS) and the Third International Mathematics and Science Study(TIMSS). I conclude with an examination of the framework used in the StudentPerformance Component of the 1991 B.C. Science Assessment (Erickson et al., 1992).The APU FrameworkMurphy (1990) reflects upon the framework used by the APU:- 136 -It is problematic to select the terms by which pupils’ attainment in sciencemay be described. No single philosophical or psychological model wasfound to be an appropriate basis for the assessment exercise and its definedpurposes, so none is reflected in the science activity categories. Nor wasany hierarchy implied in the list. The activities are often referred to assynonymous with science processes. At other times the link is morecautiously stated. Although many generally accepted process terms (e.g.,interpreting, observing, and hypothesising) are represented in either thecategory titles or in specific question descriptors in the categories, thescience activities do not define the science processes. Moreover, theoperational definitions of the activities include far more components ofperformance than are normally covered in discussions of the ‘processes,’e.g., identifying the status of variables in investigations. (p. 153)This cautious approach led the APU to develop a framework with six general categorystatements. These are:1. Using symbolic representations2. Use of apparatus and measuring instruments3. Using observations4. Interpretation and application5. Design of investigations6. Performing investigations (Johnson, 1989, p. 11)Extensive lists of scientific concepts and knowledge were developed for use in theassessment of students at age 11 (Russell et al., 1988), at age 13 (Schofield et al., 1988)and at age 15 (Archenhold et al., 1988). The work of the APU took place in a relativelyunstructured setting before the development of the National Curriculum in England. Thefreedom to select any focus allowed the APU management to adopt a view of science as “anexperimental subject concerned fundamentally with problem-solving” (Murphy and Gott,1984, p. 5). Murphy and Gott believe that this decision led to a framework defined bythree dimensions:• the science process involved in answering the question• the degree of conceptual understanding required for its solution• the content of the question and the context in which it is set(1984, p. 5)These dimensions were used by the APU both in describing the stations and also in thereporting of the results. Stations used by the APU were reviewed by external validators —practicing heads of science — with each station assigned to a unique sub-category as a uni- 137 -dimensional task. Murphy reports that each station description includes “what the pupilwas given, the expected outcome, and the mode of question response” (Murphy, 1990,p. 153).Reporting of student performance using these three dimensions was quicklyrejected by the APU management as too expensive and time consuming. The use of sixframework categories, three content categories, and three context categories (science, othersubjects and everyday) would lead to a grid of 54 cells! Johnson (1989) describes theeffect of simplifying the reporting procedures as “reduced score interpretability”. Whenreporting using the context dimension was abandoned, the remaining matrix consisted of18 cells and this was considered still too complex. The problem was resolved by thedecision that:the questions developed in any Category should be designed to be free ofany dependence on taught science concepts. (Johnson, 1989, p. 11)One consequence of this decision is that student performance is reported with no referenceto science content!The NAEP Project: A Pilot Study ofHigher Order Thinking Skills Assessment Techniquesin Science and MathematicsAn alternative model framework is presented by Blumberg, Epstein, MacDonaldand Mullis in the NAEP report A Pilot Study ofHigher Order Thinking Skills AssessmentTechniques in Science and Mathematics (1986). Identified as a model for higher orderthinking skills in science and mathematics, the framework is based on the premise that:at the most general level, higher order thinking skills are used to formulate aquestion, design and perform an analytical procedure and reach a conclusionto a problem. Further, such thinking was considered to be continuouslyself-monitored and evaluated as it occurs during the course of workingthrough a problem or situation. Finally, subject-matter knowledge, beliefs,and values also impact upon how effectively an individual employs thinkingskills in a particular situation. (Blumberg et al., 1986, p. 11)- 138 -The model, shown here as Figure 20, is built around six “aspects” that are described by theauthors as representing what is “done” in science and mathematics. These aspects areintended to “collectively comprise the complex network of thinking skills in science andmathematics” (Blumberg et al., 1986, p. 11).Figure 20. Higher Order Thinking in Science andMathematics (Blumberg et aL, 1986, p. 12)KnowledgeGeneratIng3eliefsUIU0Environment- 139 -The framework also invokes the “thinking skills” of generating and evaluating/monitoring,with the additional contextual demands of knowledge, beliefs and the environment4.Linking each “aspect” with each of the other five is intended to demonstrate the“dynamic” nature of the process. The authors recognize that the “distinctions between theaspects are fuzzy” (Blumberg et al., 1986, p. 15). This “fuzziness” is problematic whenusing this model, particularly as students reformulate problems to accommodate their ownviews of science, and their perceptions of the problem. Such models for higher orderthinking skills, and similar models for problem solving, often provide “classical”representations which merely approximate the actual procedures. The influence of tacitknowledge is usually so powerful that students have difficulty in describing what they havedone and why they have done it (Woolnough and Ailsop, 1985). This framework forhigher order thinking skills was conspicuous at the initial stages of the NAEP project, butreceived less attention after the APU personnel became involved. This framework was noteven discussed in the concluding section of the report (Blumberg et al., 1986). As thepurpose of the NAEP project was to evaluate only the feasibility of the procedures, themeasurement properties of the assessment tasks were not published.The environment for these considerations can be seen to include “external parameters such as recentexperiences, working conditions, and testing situations, as well as internal parameters such aspersonality characteristics and interpersonal reactions. These environmental parameters may affectinterest, motivation, attitudes, involvement, perseverance and cooperation” (Blumberg et al., 1986,p. 13).-140-lEA ProjectsTwo lEA projects are considered here-- the Second International Science Study(SISS) which ran from 1982 to 1992, and the Third International Mathematics and ScienceStudy (TIMSS) which started in 1991 and will run beyond the end of this century.The Second International Science Study (SISS)When reporting the results and procedures of the hands-on component of 5J555 inthe United States, Kanis, Doran, and Jacobsen (1990) recognize the influence of the courseScience — A Process Approach (AAAS, 1965) on science education in the United States.Kanis, Doran and Jacobsen report discussions of the lEA planning committee in 1983 inwhich several taxonomies of process skills were examined. Foremost among these was theTable of Specifications developed by Klopfer (1971) which was used for the SISScurriculum analysis. The APU framework (Murphy and Gott, 1984), and a list of skillsdeveloped by Tamir and Lunetta (1978) were also considered. The planning committeechose to adopt a three-category system, and justified this choice by stating that such asystem “readily related to existing process skill schemes and was simple to use andexplain” (Kanis et al., 1990, p. 14). The planning committee recognized that thisframework did “not include knowledge and understanding as separate categories, sinceelements of cognitive outcomes are present in all items” (Kanis et al., 1990, p. 14). TheSISS categories are shown in Figure 21.5 For SISS, Population 1 consisted of students aged 10 (grade 5) and Population 2 was made up ofstudents aged 14 (grade 9).- 141 -Skill Illustrative ComponentPerforming (P) To include: observing, measuring, manipulatingInvestigating (I) To include: planning and design of experimentsReasoning (R) To include: interpreting data, formulating generalizations,building and revising modelsFigure 21. SISS Process Test Skill Categories (Kanis etal., 1990, p. 14)Kanis, Doran, and Jacobsen relate that the coordinators from the six countries participatingin SISS scrutinized the set of items chosen for the assessment and “assigned a skillclassification to each item” (1990, p. 14). The issue of content representation wasaddressed by members of National Committees and science educators from participatingcountries using the SISS curricular grids. Tasks in SISS were characterized by both anidentified skill and a content area such as Chemistry, Biology or Physics. In a post hocanalysis (Tamir, Doran, & Chye, 1992) the SISS performance tasks are mapped onto theKlopfer scheme, enabling comparison of the hands-on tasks and paper-and-pencil tasks.This post hoc analysis shows that these tasks were perceived as multi-dimensional; eachtask for 10-year-olds involved four to six process skills, while the tasks for 14-year-oldswere assigned seven to nine process skills.The complexity of describing of performance tasks in science is exemplified by theapproach taken for SISS. Tamir, Doran, and Chye (1992) report the procedures that werefollowed in the development, administration and scoring of the SISS performance tasks.These are shown in Figure 22.- 142-1 3ate SISS skifi Match the SISS skifiDesign tasks to ensurecontent representation categories withones and mapin biology, chemistry e tasks on to Klopfer’s processand physics___________________categories_____objectives J4,46Categorize the tasksScore responses and4Htstudentdata_4—-— into the categoriesidentify patterns ofwithin the Klopferperformance acrossTaxonomytasks_____________________Figure 22. SISS ProceduresThese six steps, from task development through to interpretation, allow for the descriptionof task attributes. Some problems arise in the consistency of labelling, for example, whereSISS practical skills categories and Klopfer’ s taxonomy overlap (Figure 23).SISS Category Klopfer’s areas and skillsPerformance B. 1 Observation of objects and phenomenaB .2 Description of observations using appropriate languageB .3 Measurement of objects and changesG. 1 Development of skills in using common laboratory equipmentInvestigating C.3 Selection of suitable tests of a hypothesisC.4 Design of appropriate procedures for performing experimentsReasoning D. 1 Processing of experimental dataD.3 Interpretation of experimental data and observationsD.5 Evaluation of a hypothesis under test in the light of data obtainedFigure 23. Correspondence between SISS Practical SkillCategories and Klopfer’s Scheme (Tamir et al., 1992)The category “Performance” in the SISS scheme was seen to map onto four distinct skillsin Klopfer’s taxonomy (Figure 24, next page). However, the examination of individualtasks led to the identification of sets of skills beyond those that Tamir, Doran and Chye hadidentified as corresponding between framework and taxonomy.- 143 -A.0 Knowledge and ComprehensionA. 1 Knowledge of specific factsA.2 Knowledge of scientific terminologyA. 3 Knowledge of concepts of scienceA.4 Knowledge of conventionsA.5 Knowledge of trends and sequencesA. 6 Knowledge of classifications, categories and criteriaA.7 Knowledge of scientific techniques and proceduresA.8 Knowledge of scientific principles and lawsA.9 Knowledge of theories or major conceptual schemesA. 10 Identification of knowledge in a new contextA. 11 Translation of knowledge from one symbolic form to anotherB .0 Process of Scientific Inquiry I: Observing and MeasuringB. 1 Observation of objects and phenomenaB.2 Description of observations using appropriate languageB .3 Measurement of objects and changesB .4 Selection of appropriate measuring instrumentsB .5 estimation of measurement and recognition of limits in accuracyC.0 Process of Scientific Inquiry II: Seeing a Problem and Seeking Ways to Solve itC.1 Recognition of a problemC.2 Formulation of a working hypothesisC. 3 Selection of suitable tests of a hypothesisC.4 Design of appropriate procedures for performing experimentsD.0 Process of Scientific Inquiry III: Interpreting Data and Formulating GeneralizationsD. 1 Processing of experimental dataD.2 Presentation of data in the form of functional relationshipsD.3 Interpretation of experimental data and observationsD.4 Extrapolation and interpolationD.5 Evaluation of a hypothesis under test in the light of data obtainedD.6 Formulation of generalizations warranted by relationships foundE.0 Process of Scientific Inquiry III: Building, Testing and Revising a Theoretical ModelE. 1 Recognition of the need for a theoretical modelE.2 Formulation of a theoretical model to accommodate knowledgeE.3 Specification of relationships satisfied by a modelE.4 Deduction of new hypotheses from a theoretical modelE.5 Interpretation and evaluation of tests of a modelE.6 Formulation of revised, refined and extended modelF.0 Application of Scientific Knowledge and MethodsF. 1 Application to new problems in the same field of scienceF.2 Application to new problems in a different field of scienceF.3 Application to problems outside of science (including technology)G.0 Manual SkillsG. 1 Development of skills in using common laboratory equipmentG.2 Performance of common laboratory techniques with care and safetyFigure 24. Klopfer Table of Specifications for ScienceEducation (from Tamir et al., 1992)- 144 -For example, task 2A1 for Population 2, “Electrical Circuits” is a task in which students“determine the circuit within a black box by testing with battery-bulb apparatus” (Tamir,Doran, Kojima, & Bathory, 1992, p. 279). Figure 25 shows how this task was elaboratedfurther, using both the SISS Content Classification Scheme (Postlethwaite and Wiley,1991) and Klopfer’s Taxonomy of Process Skills:Task Content Klopfer’ s Process SkillsP53 Current Electricity Al, A2, A3, A4, Bi, Dl, D6, Gi, G2Figure 25. Classification of Exercise 2A 1 (Tamir,Doran, Kojima, & Bathory, 1992)“Electrical Circuits” consists of three questions or items; so it is possible to identify theSISS skills for each part (Figure 26):Task Part SISS Skill Content2A 1.1 Performing Physics2A1.2 Performing Physics2A1.3 Reasoning PhysicsFigure 26. Classification of Task 2A1 by SISS PracticalSkill Categories (Tamir, Doran, Kojima, & Bathory, 1992,p. 281)Figures 25 and 26 show the breadth and depth of the scrutiny to which the SISS tasks weresubjected. Tamir and Doran confirm some of the difficulties of the categorizationprocedures employed by SISS.It was evident that placing an item within one skill area was necessary foranalysis purposes, but this approach over-simplified the holistic, multifaceted nature of many of the items. (Tamir and Doran, 1992b, p. 145)Tamir and Doran report that “for both test forms, at both population levels, the success ratewas highest for the performing skill and lowest for the reasoning skill” (1992b, p. 395).Task structure tended to have “reasoning items” following other skill areas. This led Tamir2A1- 145 -and Doran to present an interpretation based upon a probable hierarchical structure. Theypropose that reasoning is the higher order skill, requiring successful use of lower orderskills. Tamir and Doran (1992b) interpret these results as supporting similar claims for ahierarchy of science skills made by Bathory (1985).In SISS, the use of a relatively simple three-state set of skills categories, combinedwith three content area descriptors, allows construction of a three-by-three matrix as shownin Figure 27.Content AreaSkill Biology Chemistry PhysicsPerformingInvestigatingReasoningFigure 27. SISS Skill/Content MatrixWhile there were some explorations in the use both of Klopfer’s taxonomy and the SISSContent Classification Scheme (Tamir, Doran, Kojima, & Bathory, 1992), the SISSanalysis (Tamir and Doran, 1992b) focuses upon performance as categorized within thematrix shown here as Figure 27. The means of the correlation coefficients between skills,for all students in both populations are positive, but low as shown in Table 8.Table 8SISS Skill Category Mean Correlation CoefficientsPopulation 1 Population 2Performing/reasoning 0.25 0.63Performing/investigating 0.27 0.32Investigating/reasoning 0.29 0.30Doran and Tamir offer as a possible interpretation that “these subtests were not measuringthe same skills” (1992, p. 371). These authors also warn that “the low correlations might- 146 -be due in part to the substantial errors of measurement in the assessment of the practicalskills tests” (1 992b, p. 371). One must consider whether it is reasonable to expect a highcorrelation between the SISS Skill Categories, given the structure of the tasks. Theallocation of marks for each part of a task is another significant factor. Tarnir and Doran(1992b) contend that the skills are hierarchical with “performing” at the lowest level and“reasoning” at the highest level. The initial parts of each task were defined as “performing”— the lowest level in the hierarchy— yet were allocated the major portion of the marks.Many students were successful in completing the “performing” part of a task, but were notable to accomplish the “reasoning” part of the task with similar success. However, thosewho were successful in “reasoning” were usually successful in “performing” too. Giventhis hierarchical structure, the low correlations between categories are not surprising.“Substantial errors of measurement” (Tamir and Doran, 1992b, p. 370) is a catchall phrase that can be considered to refer to the reliability of the test scores, and its impactupon the validity of score inferences. Tamir and Doran do not articulate whether theybelieve that such measurement errors are systematic or random. There are manyopportunities for systematic errors to occur. These range from lack of precision inconstruct labelling to problems in translation across different languages. Systematic errorstend to affect scores in a regular or consistent fashion. Random errors derive from purelychance happenings (Crocker and Algina, 1986), for example, guessing, scoring errors orinconsistencies in administration. Any analysis of measurement error requires a carefulinvestigation of the structure, procedures and interpretive framework of the assessment.The Third International Mathematics and Science Study (TIMSS)The Third International Mathematics and Science Study (TIIVISS) is the current TEAproject in mathematics and science. This study has attracted the participation of over 50educational systems, with extensive representation from countries on all continents except- 147 -Africa. The TR4SS performance option attracted the interest of 20 countries and is thefocus of discussion here. It must be noted that student performance data and analyticalprocedures in the TIMSS project are not yet available for analysis.While many parts of the TIMSS project can be seen as a progression from theSecond International Science Study (SISS) and the Second International Math Study(SIMS), there have been significant developments in the design of the TIMSS framework.Most notable is the move away from the content-by-cognitive behaviour grid. Criticism byRomberg and Zarinnia (1987) is identified by Robitaile, Schmidt, Raizen, McKnight,Britton, and Nicol (1993) as having stimulated this change.They (Romberg and Zarinnia) note that the use of a content by cognitivebehaviour grid fails to take into account the interrelatedness of content or ofcognitive behaviours, and that this forces the description of information intounrealistically isolated segments...The TIMSS curriculum frameworks were constructed to be powerfulorganizing tools, rich enough to make possible comparative analyses ofcurriculum and curriculum changes in a wide variety of settings and from awide variety of curriculum perspectives. The framework had to allow for agiven assessment item or proposed instructional activity to be categorized inits full complexity and not reduced to fit a simplistic classification schemethat distorted and impoverished the student experience embedded in thematerial classified. (Robitaille et al., 1993, p. 42)The developers of the TIMSS project hope that the criticism of Romberg and Zarinna canbe addressed by the use a three-dimensional framework. Three considerations areincluded: subject matter content, performance expectations, and perspectives or context (seeFigure 28).- 148 -Content Aspect• Earth sciences• Life sciencesPhysical SciencesScience technologymathematics• History of scienceand technology• Environmental issues• I’atuie of scienceScience and otherdisciplines_____________________Perspectives AspectAttitudes• (iui.eers• Participation• Increasing interest• Safety• Habits of mindFigure 28. Aspects and Major Categories of the TIMSSScience Frameworks (Robitaille et al., 1993, p. 46)Robitaille and his co-authors describe the content aspect as “obvious and needs littleexplanation or development” (1993, p. 44). They regard the performance aspect as:a reconceptualization of the former cognitive behaviour dimension. Thegoal of this aspect is to describe in a non-hierarchical scheme, the manykinds of behaviours that a given test item or block of content may elicit fromstudents.They state that the perspectives aspect is intended to:permit the nature of curriculum components according to the view of thenature of the discipline exemplified in the material, or the content within thematerial is presented.The addition of an extra dimension to the TIMSS framework is not the only significantdevelopment in the classification of curricula or item analysis. Of major import is therecognition that an item is not necessarily uni-dimensional:PerformanceExpectations Aspect• Understanding• Theorizing, analyzing,solving problems• Using tools, routineprocedures...• Investigating the naturalworld• Communicating- 149 -In the TIMSS frameworks, a test item or block of content can be related toany number of categories within each aspect, and to one or more of the threeaspects — thus, the multi-category, multi-aspect designation. It is no longerappropriate to think of disjoint “cells” since hierarchical levels within eachcategory make overlapping cells possible. An item is no longer representedas a single strand linked to a matrix, and instead may be associated withmany combinations of aspect categories in the TIMSS frameworks.(Robitaille et al., 1993, p. 44)Robitaille and his co-authors go on to introduce the idea that an achievement item, or pieceof curriculum material, has its own “signature”. They expand by saying:Technically, a signature is a vector of three components and the threecomponents are themselves each vectors of category codes for one of thethree aspects of a TIMSS framework. That is, associated with each item orpiece of curriculum is an array of categories from one, two, or all threeaspects of the relevant framework...The signature reflects the multi-aspect, multi-category nature of theframeworks. It also provides a more realistic depiction of the complexnature of the elements of curriculum, and is less reductionist than thetraditional one-to-one mappings. It is more suited to the complexity ofstudent activities emerging from the various national reforms of schoolmathematics and science, and more suited to the rich, integratedperformances expected of students in the new forms of assessment that areemerging along with curricular refOrms. (Robitaille et al., 1993, p. 44-5)In rejecting the simplistic grids of the earlier assessments, the developers of the TIMSSstudy have provided a structure that is likely to have two significant effects. It will:1. enable task descriptors to convey the intricate details of complex tasks; and2. pose a challenge to those who will interpret student scores.The requirement for uni-dimensional tasks and the problems of describing complex taskshave coalesced into the principal factor limiting the development and use of performancetasks for many years. The developers for both APU and TAPS projects have madesignificant efforts to create uni-dimensional tasks6. Such tasks, particularly those6 Bryce and Robertson (1985, p.18) describe the simplifying of tasks to ensure that they only focusupon one objective as “purifying”. Similarly, Johnson (1989) reports that the APU used expertjudges to identify a single process from the APU framework for each station task.-150-developed by the TAPS group in Scotland (Bryce and Robertson, 1985) are limited by thisconstraint. The “signature” approach proposed for TIMSS enables the development oflonger and more complex tasks that are consistent with current reform efforts. However,the use of complex tasks leads to significant problems for those who interpret the scores.The challenge of making sense of the TIMSS scores places the project in a position where itwill necessarily catalyze developments in the application of measurement theory of largescale assessments.Student Performance Component of the 1991 B.C. Science AssessmentThe approach taken to describe the tasks in the Student Performance Component ofthe 1991 B .C. Science Assessment is based upon the premise that content and process areinseparable. A position paper, presented on behalf of the Contract Team to the Ministry ofEducation at the commencement of the project, clarifies this perspective:Although we are concerned primarily with the tt5]jfl’ dimension in thisreport, it should be clear that the assessment of these skills can only beaccomplished by presupposing that the students possess a relatedknowledge base and the willingness to articulate that understanding in agiven assessment setting..the descriptors that we will outline below in the framework of studentspractical skills in science have been written in as general, content-free termsas possible. This is so that they can be used as a basis for constructingmore specific descriptors for the desired age levels as well as the desiredcontent area. (Erickson, 1990, pp. 2-3)Thus, the framework is written in terms of six dimensions of science, each with a specificsubset of abilities (Figure 29, next page). Two types of assessment task were developed:stations and investigations. The investigation tasks are open-ended, in that students planand perform their own experiments to solve a specified problem. Consequently, theassignment of specific abilities to each investigation is not possible. For the stations,descriptions of observable actions are written in the language of behavioural objectives.The framework was used to describe observable actions of the students, including their- 151 -(1) Observation and Classification of Experiencel.a. ability to describe observations made using the senses about a variety of living and non-living objects1 .b. ability to group living and non-living things by observable attributes1 .c. ability to select relevant information from observations to address a problem at hand1 .d. ability to use ‘keys’ for identifying and classifying objectsi.e. ability to observe and recognize changes or regularities which occur over time1 .f. ability to classify objects in different ways and construct keys to show others how this was donel.g. ability to observe objects or events from different perspectives(2) Measurement2.a. ability to make simple measurements of phenomena using ‘invented’ or established units2.b. ability to supply correct SI and metric units for conunon measurements2.c. ability to read the scales of various measuring instruments2.d. ability to estimate the quantity of common properties such as weight and length(3) Use of Apparatus or Equipment3 .a. ability to identify equipment used in investigations3.b. ability to state the purpose of a piece of equipment3.c. ability to select a piece of equipment appropriate for a given investigation3.d. ability to use a variety of equipment in conducting an investigation3.e. ability to explain the limitations of specific equipment(4) Communicationi) Receiving and interpreting information4.a. ability to follow written, oral or diagrammatic instructions4.b. ability to draw inferences from data presented in tabular, pictorial or graphic format or data generatedexperimentally4.c. ability to translate information from one format to anotherii) Reporting information4.d. ability to discuss orally the results of an investigation4.e. ability to report results of an investigation in a descriptive format4.f. ability to use appropriate symbolic representations when reporting results of an investigation(5) Planning Experiments5.a. ability to pose an ‘operational question’5.b. ability to develop a plan that is relevant to an identified problem5.c. ability to identify the relevant variables that will allow the operational question to be addressed in aproposed experiment5 .d. ability to suggest a testable hypothesis5.e. ability to identify the possible safety risks in a proposed experiment5 .f. ability to predict the sources of error in an experiment5.g. ability to use representational models in planning an experiment(6) Performing Experiments6.a. ability to set up and use materials and equipment safely6.b. ability to develop and carry out a suitable measuring strategy for the appropriate variables6.c. ability to develop a strategy for collecting and recording data that is relevant to the operational questionbeing addressed6.d. ability to transform and interpret the thta collected so that a response can be provided to the operationalquestion6.e. ability to evaluate the results of the experiment to determine whether further experimentation is requiredFigure 29. The Dimensions and Abilities used in theStudent Performance Component of the 1991 B.C. ScienceAssessment- 152 -manipulation of equipment and their responses to prompts or questions. It is important toadd that each objective is cross referenced to the “Ability” from which it is derived. Forexample, consider the station “Sound Board”, which was used with Grades 4, 7 and 10.The objectives for this station are:1. Students observe and describe what is different about four steel stringsof a varied diameter. l.a2. Students listen to the sound produced by each string and describe thedifferences they hear. l.a3. Students explain why the strings sound differently 4.b(Erickson et al., 1992, p. 20)These three objectives are considered to encapsulate the key performance elements of thestation. The first two are derived from ability 1 .a, “ability to describe observations madeusing the senses about a variety of living and non-living objects”. However, each objectiveis amplified in terms of actual student performance with the equipment, hence the differingstatements. The order of the objectives reflects the sequence of instructions or activities asstudents proceed through a particular station.How to score students was an issue for the Contract Team in the design of thiscomponent of the 1991 B .C. Science Assessment. While this is examined in greater detailin the next chapter, the features related to the description of student performance will bereviewed here. One issue concerns the relationship between stated objectives and howperformance is rated on the task. A holistic scoring system was chosen for the StudentPerformance Component of the 1991 B.C. Science Assessment. One or two carefullyconsidered evaluative or scoring questions — refered to as judgements were formulatedfor each station. The questions for the station “Sound Board” are:1. In your judgement how well did the student observe differences betweenthe strings (visual and aural),2. In your judgement how well did the student explain why the stringssound differently. (Erickson et al., 1990, p. 21)In this case, the evaluative questions are derived directly from the set of objectives, with thefirst two objectives (both based upon ability 1. a) combined as the basis for the first- 153 -question. This is the model for most stations. But for some, mapping of objectives intoevaluative questions in this way did not cover all objectives7.The Grade 10 station “AChemical Reaction” demonstrates some of the omissions that can arise when holisticscoring is used. The objectives for this station are:1. Students measure the pH of water. 2.c2. Students observe the reaction between calcium and water. l.a3. Students measure the pH of the products of the reaction. 2.c4. Students write a word equation. 4.e5. Students write a symbol equation. 4.f(Erickson et al., 1990, p. 171)These objectives were translated into two judgements:1. In your judgement how well did the student observe the reactionbetween calcium and water?2. In your judgement how well did the student write a symbol equation?(Erickson et al., 1990, p. 176)In moving from objectives to evaluative questions, both of the pH measurement objectiveswere lost to scoring, as was that of writing a word equation. Loss of information aboutstudent achievement related to specific objectives will occur if such objectives are notincorporated into the evaluative questions. However, in the Student PerformanceComponent of the 1991 B .C. Science Assessment, descriptive data were collectedseparately for each student. In this way it is possible to examine individual as well ascollective performance in relation to specific objectives.A MODEL FOR DESCRIBING STUDENT PERFORMANCEThe large-scale assessment programs examined in this chapter have used manydifferent strategies to describe science performance assessment tasks. Achieving a balanceIn order to facilitate the synthesis of the results for the student evaluations, the contract team choseto limit the number of judgements for each station to a maximum of two.- 154 -between specificity and generality in definition requires thought and examination of theconsequences of decisions made in other large-scale assessments. To be general incategorization, as in the case of the SISS project, leads to descriptors that enable situationof tasks within the domain of science, but allows little specificity regarding taskdescription. Conversely, to be so specific as to describe tasks with minutiae that cannot beadequately addressed with analytical procedures is a strategy that leads to a false sense ofprecision.My proposal for task description follows from Shepard’s question about what canbe claimed from the assessment practice. While hands-on performance assessment inscience is very much more than process assessment, it must be recognized that processes orskills are involved. In addition, the content knowledge requirement of each task, andcontent representation of the domain of the set of tasks, are further elements that must beconsidered. Each task has a unique identity in terms of the types of practical skillsencountered in it, and the content area in which these skills are demonstrated. Yet allbelong to the domain called “science”.A model for differentiation between practical skills was developed by Millar (1991).He chooses to distinguish between general cognitive processes and scientific procedures,which he calls “practical techniques” and “inquiry techniques”. The critical feature thatMillar uses to separate the elements of practical skills is that cognitive processes cannot betaught, but scientific procedures can, and may be improved through teaching. Millar’sposition extends the representation of procedural understanding developed by Gott andMurphy (1987). In claiming that practical techniques can be taught, Mifiar asserts that:These are specific pieces of know-how about the selection and use ofinstruments, including measuring instruments, and about how to carry outstandard procedures. They can also be seen as progressive, in terms ofboth the increasing conceptual demand of certain techniques and increasingprecision. (1991, p. 51)- 155 -The progression from elementary science to secondary science, together with the use ofequipment in a school science laboratory, provides strong circumstantial evidence insupport of Milar’ s claims. Teachers instruct students about the specific methods of usingapparatus such as mechanical mass-measuring devices8. Simple double-bucket balances,with precision to the nearest 10 g used at early elementary levels, are replaced by balancesthat measure masses to ± 0.5 g at later elementary grades, to ± 0.1 g in early secondarycourses and ± 0.00 1 g in many senior secondary chemistry and physics classes. Not onlyare senior students working at higher levels of precision, frequently they are asked toanalyse the probable sources of error in their equipment or technique. Millar amplifies hischaracterization of inquiry tactics. These are:best thought of as a ‘toolkit’ of strategies and approaches which can beconsidered in planning an investigation. These would include repeatingmeasurements and taking an average, tabulating or graphing results in orderto see trends and patterns more clearly; considering an investigation in termsof the variables to be altered, measured, controlled; and so on.(Millar, 1991, p. 51)Millar argues that the tactics or strategies of science are best taught through studentinvolvement with investigations, a perspective supported by Woolnough (1991). Inexamining the status of general cognitive processes, Millar (1991) warns of confusionbetween “means and ends. The processes are not the ends or goals of science but themeans of attaining those goals” (Emphasis in original; p. 50). Millar’s model for thecategorization of practical skills is shown as Figure 30.8 This is now the case only for mechanical as opposed to electronic digital read-out balances. Thepractical techniques for use of electronic balances are generally the same for elementary studentsmeasuring masses to the nearest gramme as for senior secondary students weighing to the nearestmilligranime.- 156 -Practical skillsGeneral cognitive processes Practical techniques Inquiry tactics(.observe, (.measure temperature with (.repeated measurements,• classify, a thermometer to within 1°C, • draw graph to see trend• hypothesise • separate a solid and a liquid in data,by filtration, • identify variables to alter,measure, control4—Cannot be taught —04——————— Can be taught and improved —0Figure 30. Sub-categories of ‘Practical Skills’ (Millar,1991, p. 51)I have expanded the model proposed by Milar in order to provide a method of describingstudent performance assessment tasks in science. A fundamental assumption of thisperspective is that practical skills must be demonstrated in the context of some sciencecontent. In order to escape from the limiting influence of a traditional content by processmatrix, I propose a three-dimensional relational diagram, shown as Figure 31.General cognitive processesInquiry tacticsPractical techniquesFigure 31 A Three-dimensional Model of Performance inScienceThe horizontal plane represents a typical set of content area sectors for school science. Thevertical plane shows segments which incorporate Millar’s three categories of practicalskills. The volume within the quarter-sphere represents the universe of the sciencecurriculum for the assessment. The mapping of individual tasks onto this three-- 157 -dimensional model is difficult to show on two-dimensional paper. The simplification of themodel into a two-dimensional template facilitates mapping of tasks, but requires areorientation of the categories of practical skills into sectors of the circle. In the two-dimensional diagram (Figure 32, next page) the area plotted Out by a specific task mustinclude both “science content” on the left side and “practical skills” on the right side.To exemplify use of this template, the station “Environmental Testing” (used withGrades 7 and 10) is mapped onto the template in Figure 32. The objectives presented bythe Contract Team for this station are:1. Students follow instructions.2. Students make and record observations.3. Students draw inferences from collected data.(Erickson et al., 1990, p. 149)In this station, students are asked to use pH paper to measure the pH of two differentsamples. One sample is neutral and is labelled “Lake Water 1985” the second is acidic andis labelled “Lake Water 1990”. Students are expected to respond to the question “Whatmight have caused this change?” by making inferences about possible causes of thedifferent pH results. The “science content” that pertains comes from chemistry andenvironmental science, so the left-hand of the template maps to these two sectors. Use ofthe pH paper is considered a “practical technique”, while observation of the colours of thepH paper represents “cognitive processes”.- 158 -Figure 32. A Two-dimensional Model of Performance inScienceThis scheme for mapping perceived attributes of specific tasks onto a relational diagram hasthe advantage of enabling visualization of individual task specificity and content coverageprovided by the set of tasks. Mapping tasks onto this template enables test developers —and users to be more secure in their claims about what will be or was measured in anassessment. This is illustrated in Figure 33 which represents the Grade 10 stations used inthe Student Perfomiance Component of the 1991 B.C. Science Assessment.ScienceContentEnvironmental Testing is set in the domains of chemistry andenvironmental science and depends upon practical techniques andcognitive abilities.- 159 -Circuit A41111) 10.1 Guess It10.2 SoundBoard• 10.3 Circuit Building10.4 Environmental Testing• 10.5 CoolIt10.6 MicroscopeCircuit Bj 10.7 Instruments10.8 A Chemical Reaction10.9 Rocks10.10 Rolling Bottles10.11 Magnet Power€ 10.12 BugsScienceContentPracticalSkillsFigure 33. The B.C. Science Assessment Grade 10Station Tasks Mapped onto the Template (Radial plotdimension does not imply level of content)Allocation of tasks to sectors of the circle is based upon the analysis of station objectives aswritten in the Technical Report (Erickson et al., 1992). Each of the 12 stations has its ownunique pattern. Examination of the distribution of the plots shows that the science content- 160 -for most of the stations is in the domain of physics, with a smaller number of stationsrepresenting each of the other content areas. Few of the tasks were considered to belong tomore than one content area. Conversely, for practical skills, many stations appear to bemulti-faceted: each of the sectors is covered by at least six tasks.This method of drawing tasks on a template also provides a model that helpsexplain the low levels of across-task generalizability found in performance assessments inscience. Discussions of task specificity in measurement journals (e.g., Linn and Burton,1994; Shavelson et al., 1993) have tended to focus upon the statistical properties of a smallnumber of tasks using generalizability theory (Shavelson and Webb, 1992). The modelproposed here can be used to present a set of task structures, with a profile for each task,and the relationship among these profiles can be clearly displayed. An advantage of thisapproach is that it enables the reader to examine the congruence of tasks and to predictpossible degrees of correlation between them.My vision of the description of performance assessment tasks leads to two distinctapproaches that augment each other. One of these starts with the construction of aframework that sets out generic forms of the three sub-categories of Millar’s “practicalskills”. This framework is then used to describe the students’ actions in completing tasks,and leads to a set of content-dependent “objectives” (as in the case of the StudentPerformance Component of the 1991 B.C. Science Assessment) or to a “signature” (as inthe case of TIMSS). I recommend that a second stage of description of assessment tasksbe built from the objectives or signature developed in the first stage. This involvesmapping the tasks’ profiles onto a two-dimensional template as shown in Figure 33. If thetest developers and test users are satisfied that the domain of the test items provides anadequate representation of the universe of interest, then they can have some confidence inthe inferences that they make from student performance on the set of tasks.- 161 -CHAPTER 6 — INTERPRETING STUDENTPERFORMANCE IN SCIENCEThis chapter begins with a discussion of the derivation of student scores, and thenmoves into a discussion of alternatives for interpretation. Once again, construct validity isthe driving theme for the questions considered in this chapter— initially in terms ofdeciding how score meaning is influenced by scoring systems. The procedures for scoreinterpretation are then examined in answer to the third research question of this dissertation:What are the implications of using different strategies for scoring student achievementupon the interpretation of student performance?This leads to the fourth question:How could/should test scores in performance assessments be interpreted and used?An appropriate start is the examination of two extreme approaches to scoring: holistic andanalytical. Choice of scoring approach is influenced by a view of the nature of theassessment tasks, and leads to decisions which impact upon how student performanceshould be evaluated. Having obtained student scores, the critical issues for test developersand users are the establishment of procedures for making inferences about studentperformance, reporting these inferences, and examining the usefulness of the wholeassessment process.SCORING PROCED URESThe essential difference between holistic scoring procedures and analyticalapproaches lies in the process of aggregation. In holistic scoring all information containedwithin the student response is amalgamated so that a “whole” score on a task is created.Alternatively, in analytical scoring methods, a score is allocated to each element within a- 162 -task. Item scores are summed or otherwise manipulated to create a total task score.Crocker and Algina (1986) argue that whenever item scores are summed, the resulting testscore should be described as a composite.Choosing a scoring approach has sometimes stimulated heated debate.Development of the National Curriculum in England and Wales involved particular tensionbetween Woolnough (1989) of England (arguing for holistic assessment), and Bryce andRobertson (1986) of Scotland (favouring an analytical approach). Consider the followingsection from an article entitled Practical science assessment. Will it work in England andWales?.teachers should no longer be urged to judge or rate ‘holistically’ theirpupils’ performances on ‘experiments’ or ‘scientific investigations’. In acomprehensive examination of the international literature we have recentlyshown that, however desirable, there is no demonstrable evidence ofvalidity and reliability in currently available versions of assessment byholistic teacher-judgement. You may consult the literature yourselves: it isall well documented. Should you do so you will be dismayed by the lack ofpracticality in the procedures suggested. An alternative to holisticassessment is the use of carefully structured assessment items such aspractical test items and ‘item-sets’ where product checks and end-checksafford the means by which teachers can conduct practical assessment withindividual pupils, in classes. (Bryce and Robertson, 1986, p. 63)Developments in performance assessment techniques, in the years since Bryce andRobertson wrote this article, have addressed many issues pertaining to validity andreliability. These are now reviewed.Holistic ScoringThe advantage of using holistic scoring is explained by the California LerningAssessment System (CLAS) as that it accommodates “a wide range of student responses,as well as evaluating the student’s thinking process” (Comfort, 1994, p. 17). Theapproach taken in California was to develop a 4-point scale with a generic scoring shellused to provide descriptors for each point on the scale. The descriptors were then used to- 163 -derive holistic scoring guides for each task. The scoring-guide shell points 4 and 1 areshown below in Figure 34.Score Point 4Demonstrates in-depth understanding of the scientific concept(s) andprocesses and of the problem and investigations. Responses include allelements of scientific design, and the responses are succinctly and clearlystated. All data are presented in an organized fashion— complete andaccurate diagrams, charts, tables, and/or graphs— with clear and effectivesupporting evidence. All observations are recorded and valid anddemonstrate attention to detail. All questions are clearly communicated andideas are communicated clearly and effectively. Explanations clearlysupport and demonstrate the relationship between data and conclusions.Responses demonstrate the need for additional testing and provideappropriate suggestions related to the problem (if appropriate for gradelevel). Scientific terminology is used in context with meaning (ifappropriate for grade level).Score Point 1Demonstrates extremely limited or no understanding of the scientificconcept(s) and processes or understanding of the problem andinvestigations. Responses include maj or misconceptions. Responses lackelements of scientific design and no supporting evidence is present. If dataare presented they are vague and unorganized. Diagrams, charts, tables,and/or graphs are missing, incomplete, and inadequate. If supportingevidence is present, it is flawed, disjointed or does not match data.Observations are usually missing, and if recorded they are vague with noattention to detail or do not match the task. Attempts to answer a fewquestions on the task but responses are incomplete, do not show evidenceof understanding and may not match the question. Explanations may bemissing and student is not able to show the relationship between data andconclusion. Student may rewrite the prompt or write off topic.Figure 34. CLAS Science Scoring-guide Shell Points 4and 1 (Comfort, 1994, p. 20)Comfort (1994) reports that a score of 4 represents performance that is “clear anddynamic”, scores of 3 and 2 are characterized as “acceptable, adequate” and a score of 1indicates performance that is “attempted, inadequate”. The California procedure forcreating task-specific criteria involves three steps. Student response papers from a field testare sorted into groups representing the four score points. Next, a rationale is given foreach grouping. Finally, specific criteria for each score point are drafted with reference tothe scoring guide shell. The draft scoring guidelines for score points 4 and 1 of the CLAStask entitled “Spaceship U.S.A.” are shown in Figure 35.-164-Score Point 4 Data are organized in a table/chart with complete and accuratelabels. Explanations in questions 11 through 14 show a clearunderstanding of living/nonliving interdependency.Conclusions are drawn from interpretation of the student’s owndata on the chart.Score Point 1 Data are not organized well. Labels are incomplete orinaccurate, and there is no evidence of conceptualunderstanding.Figure 35. CLAS Science Score Points 4 and 1 for“Spaceship U.S.A” (Comfort, 1994, p. 32)Once all score points have been described, “anchor papers” are selected to “represent arange of high, medium and low student achievement within each score point” (Comfort,1994, p. 21). The score points established by this process are used at all scoring sitesacross the state. The scoring procedures for the CLAS science assessment are reported tohave acceptable levels of inter-rater reliability, with coefficients typically greater than 0.8(Comfort, 1994). An extensive set of operations lead selected teachers to work throughtasks, review scoring guides, and discuss specifications for the assessment task with pretrained table-leaders, all under the watchful eye of “chief readers” who were involved indefining the score points. After this orientation, the teachers worked on scoring a range ofanchor papers. Eventually they arrived at what was called the “calibration round” in whichthey all read the same anchor paper and defended the score they assigned. This processwas intended to ensure consistency.In order to score live papers, readers must qualify during this calibrationround. If a reader strays from the scoring guide while scoring live papers,the table leader is responsible for realigning the reader to the scoring guide.(Comfort, 1994, p. 32)The number of student scripts in California is immense: over 400,000 fifth grade studentswere tested in 1994 with six sites designated as scoring centres. Consequently, the trainingand calibration procedures used were intended to be overtly structured, with the hope ofensuring reliable scoring at the level of the individual student.- 165 -The CLAS science assessment designers set out to develop long “coordinated”tasks. For example, “Spaceship USA” (Comfort, 1994) saw students respond to 14questions in a period of 40 minutes. Comfort reports that “Spaceship USA” was assigneda single holistic score in the 1991 field trials. When it was used for further field trials in1993, “Spaceship USA” was scored using logical groups of questions in order to provide aset of component scores (Comfort, 1994).The scoring procedures for the Student Performance Component of the 1991 B.C.Science Assessment were also holistic in orientation, but the administration of scoringprocedures was not weighed down by the massive numbers of the Californian sample. Theprocedures for developing scoring criteria in the B.C. project are set out in Chapter 3 ofthis dissertation. Of particular significance is the Contract Team’s decision to defme thekey aspects of each task in the form of one or two specific questions, but to delegate thedefinition of criteria for specific levels of achievement to the teachers who administered theassessment. The Contract Team argues that this approach enabled the teachers to make:a global judgement based on their professional experiences of students atthat grade level. They were not asked to think of the task other than in itsentirety, as a procedure that related to the actions of the students as theywould perform them in every day life. It was this type of holisticjudgement, related to the real life and real time aspects of the student’sactions, that we felt the teachers were best able to perform, given theirprevious experience with the tasks and with students in general. (Ericksonet al., 1992, p. 252)The effects on the validity of the assessment process of enabling teachers to derive thescoring criteria are discussed in Chapter 4 of this dissertation.Analytical ScoringThe perceived advantages of analytical scoring procedures tend to revolve aroundthe fact that a greater number of score points can be generated to demonstrate studentachievement in specific parts of a task. Item scores are usually aggregated to produce a- 166 -composite score. Within an analytical scoring framework it is important to direct attentiontowards the nature of the item scores. Are scores to represent a continuum of achievement?Are they to be considered in terms of mastery of specific objectives? In my view, thenature of particular assessment tasks should guide the structure of the scoring procedures.Anastasi (1988) proposes that pass/fail scoring be limited to tests of mastery of basic skills.The development of the program “Techniques for the Assessment of Practical Skills in theFoundation Sciences” (TAPS) (Bryce et al., 1984) identified as a first phase thedevelopment of items to assess basic skills in science. TAPS was developed with theexpectation that once such skills were mastered, students would move on to “more complexprocesses and strategies of scientific method” (Bryce and Robertson, 1986, p. 63). TheTAPS tasks have been criticized on the grounds that the nature of science presented by thetasks is overly simplistic and should not be represented in this way (Millar and Driver,1987; Woolnough, 1991). Hanna (1993), writing about mastery tests in general, identifiesthe core of the problem:Assessing all-or-none attainment of rigidly sequenced objectives isunnecessary when (a) the content is not inherently sequential and (b)achievement is not dichotomous. (p. 64)A mastery based scoring and reporting system was used in the most recentinternational assessment, the 1991 International Assessment of Educational Progress(IAEP). The report, Peiformance Assessment: An International Experiment (Semple,1992), presents the scoring scheme in terms of “credit for...” and gives results for studentsfrom different jurisdictions in terms of “Percentage of correct responses”. The percentageof students succeeding ranged from 10% on certain tasks up to 100% on others. Anexample of the style of data presentation is shown in Figure 36, although the data in thisfigure do not represent actual results.- 167 -Percentage of Correct Responses (with Standard Errors)ProvinceA • 60(3.1)Province B • 42 (2.5)Province C • 86 (1.3)Country D • 77 (3.6)CouiitryE • 57(1.1)I I I I I I I I I I0 10 20 30 40 50 60 70 80 90 100S task perfonned correctlyFigure 36. IEP Style of Presenting Scores (after Semple,1992)In Semple’s report there are fewer than three pages about interpretation of the numbers, andthe procedures used for interpretation. This absence may justify the criticism that “theperformance-based items were not of the same quality as the multiple-choice questions”(Rothman, 1990b, p. 10). This was the reason given to excuse the absence of the UnitedStates from this project despite the fact that it was coordinated by a division of theEducational Testing Service in Princeton, New Jersey!Hardy reports on a study in which four science performance assessment tasks weredesigned and administered to elementary school students in Georgia (1992). Data arepresented for three of the four tasks that were scored using both analytical and holisticscoring rubrics. The time needed to score the tasks was found to be similar for each rubricand was “directly related to the complexity of the task, e.g., the type and number ofjudgements required for each student score”. Hardy goes on to comment that inter-raterreliability:- 168 -.ranged from a low of 0.76 for holistic scoring of Design a Carton to nodiscrepancies in scoring (r = 1.00) for the objective scoring of DiscoveringShadows and the holistic scoring of Shades of Color. (1992, p. 18)Other values for inter-rater reliability coefficients are greater than 0.85. The correlationsbetween analytical and holistic scoring are also described:The correlations of analytical and holistic scores for Shades of Color and forIdentifying Minerals were r 0.66 and r= 0.78 respectively. Correlationsof this magnitude are common for comparisons of analytic and holisticscoring for essays and writing tasks. The correlation of analytical andholistic scoring for Design a Carton was r = 0.33. This relatively lowcorrelation suggests that, for this assessment, the two methods of scoringare likely measuring different constructs. (Hardy, 1992, p. 18)The data for the task Design a Carton led Hardy to comment about how the design of thescoring rubric affected student scores:An inspection of the data distributions suggests that the difficulty of the taskfor the target population produced a restriction of range that was moreevident in the holistic scoring than in the analytical scoring. Although scorepoints of 0-5 were anticipated for the holistic scoring, no scores of 4 or 5were awarded. These points were possible only if a student groupconsidered more than one carton design and showed a comparison. Therestriction in range was less evident in the analytic scoring. Consequently,score points of “2” or “3” by the holistic rubric were correctly associatedwith a range of score points by the analytic rubric. Conversely, papersassigned a “3,” “4,” or “5” by the analytic rubric are all classified as “2” bythe holistic rubric. These differences point to the difficulty of developingrubrics when only limited data are available for tryout. A modification ofthe holistic rubric to better differentiate the actual range of responses wouldlikely lead to a higher correlation with the analytic scoring. (1992, p. 23)The issues that arise from choice of scoring approach that artificially restrict the range ofpossible scores must be addressed in the design of assessment tasks, and in the piloting ofscoring procedures. The problem appears to surface frequently and requires carefulattention. For example, the Contract Team for the Student Performance Component of the1991 B.C. Science Assessment observed that:- 169 -The teachers’ judgements as to how well the students observed differencesamong the beans is puzzling. Sixty-one percent of students were judged tohave performed at a satisfactory or better level, while only 1% were judgedto have performed “extremely well.” Teachers judged student performanceby the number and complexity of the criteria they used. Category 3 required1 criterion, category 4 required 2 criteria while Category 5 required 3 ormore criteria. We think that “more is better” is a teacher judgement whichdisadvantages students. Some students may well value one salient criterionas an appropriate way to observe and designate similarities. It is in thisadult perspective of appropriateness that the student performance may beundervalued. (Erickson et al., 1992, p. 19)While the original design of Student Performance Component of the 1991 B.C. ScienceAssessment used holistic scoring, a consultant in one school district has recently rewrittenthe scoring guides for the Grades 7 and 10 stations (Klinger, 1994). The revised formsuse analytical scoring techniques. The provision of a “marking scheme”, rather thanholistic descriptors for levels of performance, is seen by Klinger as more analogous to theregular practice of secondary science teachers.A challenge for developers and users of assessments is to go beyond the explicitstatements of criteria and to reflect upon the type of performance that deserves a particularscore. Although the criteria for categories of performance were drawn up by practicingteachers who themselves had experienced both working through the task and administeringthe assessment, there appears to be little congruence between expected and actual levels ofstudent performance. Other teachers, who were involved with pilot-testing the packageScience Program Assessment Through Performance Assessment Tasks: A Guide forSchool Districts (Bartley et al., 1993) described some of these criteria as “blatantly unfair”and argued that the criteria should be revised to concur with what the students actually wereasked to do.Scoring procedures need to be sensitive to the context of the task, as well as able todifferentiate between performance levels. Analytical approaches have significant valuewhen the focus of the assessment task is either mastery of clearly specified skills or recallof facts. For open-ended problems, holistic scoring rubrics offer opportunities to evaluate- 170 -a range of different approaches to solving the problem. If, as Hanna (1993) argues, thepurpose of educational measurement is to inform teaching and learning, then carefullycrafted holistic scoring rubrics have great potential for educating students about the types ofperformance valued by their teachers. Analytical approaches have value in the identificationof specific, well-defined competencies, and should also have a place in any assessmentstructure.PROCEDURES FOR INTERPRETATIONThe final piece of the construct validity mosaic is the interpretation of theassessment data. Implied here is also an evaluation of the assessment process itself. Thework of Kaplan (1964), Messick (1989a; 1989b) and Cherryholmes (1988) points to theimportance of considering the values and expectations of those persons empowered tointerpret test scores. Berlak (1992) believes that the question of assessment context mustalso be considered. He goes so far as to suggest that an emerging contextual paradigmshould be preferred to a traditional psychometric paradigm. Warning against the use ofdichotomous categorization of the paradigms, Berlak identifies four foundationassumptions of the psychometric paradigm and presents counter assumptions. Mysummaries of his analysis are shown in Figure 37 (next page).Berlak and his co-authors of Toward a New Science ofEducational Testing andMeasurement (Berlak, Newmann, Adams, Archbald, Burgess, Raven, & Romberg, 1992)pose many challenges to those involved in educational measurement. Two crucial factorsmake their critique of traditional measurement practices particularly influential. First,alternative forms of assessment have gained favour with many educational policy-makers,teachers and students. Second, Berlak’ s group has access to a large audience througharticles in journals such as Educational Leadership and Phi Delta Kappan. Romberg’swork (with Zarinnia, 1987) has been recognized by the authors of the TIMSS framework- 171 -Psychometric Paradigm Contextual ParadigmAssumption 1: Universality of Meaning. The Counter Assumption: Plural and ContradictoryThere exists “universality of meaning” where there Meanings. Plurality of meanings, differences andexists a single meaning about exactly what a stan- contradiction in perspectives are inevitable in adardized or criterion referenced test is claimed to multi-cultural world where individuals and groupsmeasure. Such agreement in meaning transcends have different histories, divergent interests andsocial context and history. concerns. Validating educational tests based onpsychometric canons represents a quest for certaintyand consensus where certainty is impossible, andagreement is unlikely unless differences are suppressed and consensus is overtly or covertly imposed.Assumption 2: The Separability ofEnds and Means The Counter Assumption: The Inseparability ofand Moral Neutrality of Technique. If tests are Means and Ends. The argument as to whether a testconstructed and interpreted according to accepted measures what it is claimed to measure rests uponstandards, they are scientific instruments which are the case made for its construct validity. Judgementvalue neutral and capable of being judged solely on about a test’s validity requires choices amongscientific merits. Standardized and criterion-refer- contradictory values, beliefs and schooling practices.enced tests represent an advance over prescientific and If judgements about assessment procedures andsubjective forms of assessment. The ends and the testing are left to experts, then they assume themeans of testing are seen to be separable. The responsibility for resolving differences over basicassessment scientist is best equipped to make moral questions which in a democratic societyjudgements about means, that is to develop the ways should be settled by ordinary citizens and/or theirof assessing educational outcomes and how these are democratically elected representatives.to be properly used and interpreted.Assumption 3: The Separability of Cognitive from The Counter Assumption: The Inseparability ofAffective Learning. The assessment of learning out- Cognitive, Affective and Conative Learning. Ravencomes must be separated into distinct and mutually (1992) disagrees with this classification whichexclusive categories, separating cognition or separates the cognitive from the affective, but alsoacademic learning from affect, interests, or attitudes. subsumes the conative under “affective.” ConativeThere exists a three-way classification of human behaviour is that related to determination, persistencelearning which divides head, hand and heart, i.e., the and will. Development of human capacities isrealm of the intellect, of feelings and values, and of highly contingent upon the social context, as well asmanual dexterity. Bloom’s Taxonomy of the learner’s will, interest and knowledge.Educational Objectives (1956) legitimated thedistinctions that are now treated as virtually self-evident in the discourse of teachers.Assumption 4: The Needfor Controlfrom the The Counter Assumption: Assessment forCenter. Testing and assessment procedures are a Democratic Management Requires Dispersedform of surveillance whose use is a super-imposition Control. There is a need to change the uniof a power relationship. Tests shape the school’s directional nature of the power relationship that hascurriculum and pedagogy — a form of social control. been imposed by the use of standardized and criterionThe technology used in the assessment process will referenced tests. There must be re-fonn of theencourage particular forms of management and system of assessment in such a way that it disperseshuman relationship with the organization, while power, vesting it not only in administrative handssuppressing others. Mass administered tests are but also in the hands of teachers, students, parentssuited to exercising control from the center, e.g., and citizens of the community a particular schooldistrict office, and objectify the subject by reducing serves. It is clear that good schools require a strongall human characteristics to a single number. measure of autonomy by teachers, other school-basedprofessionals, and participation by the local school-community.Figure 37. Berlak’s Measurement Paradigms (1992)- 172 -(Robitaile et aL, 1993) as a significant factor in their decision to move away from thetraditional grid. The impact of Toward a New Science ofEducational Testing andMeasurement is such that the National Council of Measurement in Education (NCME)reviewed it in both association journals. The review in the Journal ofEducationalMeasurement was written by Loyd (1994), and that in Educational Measurement: Issuesand Practice was done by Lane (1994).Berlak (1992) identifies the key opportunity in a contextual paradigm as holding thepower to make decisions that affect oneself. Cherryholmes (1988) is similarly direct:Construct validity requires occasional departure from conventional practicesof the past, or else the past will dominate the future. Construing constructvalidity as radical, critical pragmatism takes Cronbach and Meehi anotherstep in the direction they headed in 1955.Construct validity and validation is what those in authority choose to call it.if they choose to exclude phenomenological investigation, so be it. if theychoose to exclude ideological criticism and discussions of power, so be it.if they choose to stipulate meanings for constructs and enforce them, so beit. But embracing what is in place does not free authorities from thestrictures of inherited discourses. They simply are indulged for being in aprivileged position. At bottom, construct validation is bound up withquestions such as these: What justifies our theoretical constructs and theirmeasurement? What kinds of communities are we building and how are webuilding them? (p. 129)I agree that alternative voices should be included in the interpretation process. This leadsme to elaborate on specific areas of concern that should be considered in the interpretationof hands-on performance assessment tasks in science. My intent is to clarify the featuresthat affect what can be claimed as a result of the administration of an assessment.INTERPRETATION OF HANDS-ON PERFORMANCE ASSESSMENT IN SCIENCEA series of structural considerations for the interpretation of performanceassessments in science is proposed in this section. These are articulated in terms of a set of- 173 -pre-suppositions, followed by specific concerns related to each part of the process1. Whilethe aspirations of Berlak and his co-authors (1992) lead them to argue for a contextualparadigm, authors such as Cronbach (1988, 1989), Messick (1989a, 1989b, 1994),Shepard (1993) and Moss (1992, 1994) have suggested a broadening of the interpretationprocess as part of the expansion of construct validity. The first set of pre-suppositionsarises from considerations pertaining to the purpose of testing, the second set relates to thenature of assessment tasks. The third set is intended to generate discussion about theattributes of the personnel who score and interpret student responses.Those who develop assessments are responsible for the identification of thepurpose of the assessment (AERA, APA, & NCME, 1985). In defining the statement ofpurpose for hands-on performance assessment in science, I believe four areas should bearticulated:1. A set of explicit goals for the assessment, together with an analysis of howthe goals have been operationalized (see Chapter 4 of this dissertation).2. The intended student population for the test, with details of pilot testingprocedures that were used in test development for this population (seeChapter 3 of this dissertation).3. Purposes and populations for which the set of test items should not beused, together with an extended justification of such claims.4. Intended changes that the testing program is intended to effect with ananalysis of the nature of potential unintended consequences (see Chapter 4of this dissertation).These four statements are directed to the assessment developer who must specify theintended purpose for assessment tasks, and also must identify applications for which theset of tasks are inappropriate, and state why this is so.1 It is not the author’s intent to replace the Standards for Educational and Psychological Testing(AERA, APA & NCME, 1985), but to identify specific features that may impact upon the validityof the inferences that might be made on the basis of student scores.-174-My second area of concern is the nature of school science that is portrayed by theassessment tasks. Again, the onus is upon the developer to explain:1. The curriculum emphases that were seen to drive the development of theassessment tasks.2. The perspective towards learning that guided the development of theassessment items, and how such a predisposition influenced itemdevelopment.3. The rationale for the types of task and their selection.4. The domain represented by the set of test items and its representation of theuniverse of interest.5. An articulation of the proposed system of scoring.These five areas reflect the requirement for the developer to identify and defend theassumptions that influence the development of the assessment instrument.The final important area I shall discuss is the question of who should interpret testscores. Pertinent to this issue are two statements in the Standards (AERA, APA, &NCME, 1985).Standard 5.5: Test manuals should identify any special qualifications thatare required to administer a test and to interpret it properly. Statements ofuser qualification should identify the specific training, certification, orexperience needed. (Primary)Standard 8.2: Those responsible for school testing programs shouldensure that the individuals who use the test scores within the school contextare properly instructed in the appropriate methods for interpreting testscores. (Primary)These statements were written in 1985 and are undergoing revision; drafts of newstandards are expected in 1995. Application of Standard 5.5 as written above shouldrestrict the interpretation of most assessments to those few experts with specificqualifications in educational measurement. For hands-on performance assessment inscience, there are other factors which are worthy of consideration and could broaden themembership of a potential interpretation panel.- 175 -1. There are measurement personnel whose training qualifies them to participate in theprocess. However, if they have not worked through the tasks, or administered theassessment to students, or graded the response sheets, these people may be eminentlyqualified but their input needs to be augmented as they lack appropriate sciencecurriculum experience.2. Test developers explore the nuances of each task and are very much aware of themeasurement properties of items and how students are likely to respond. Unless thisgroup includes practicing teaèhers, its members will not be aware of current classroompractices and therefore will have insufficient information and/or experience to makevalid interpretations of scores without the participation of others.3. Science educators from university faculties of education are able to provide aconsidered perspective about student learning in science. However, unless suchpersonnel have extensive experience in the design of performance assessment tasks,and the specific set of tasks for consideration, useful input is likely to be limited.However, such people are valuable members of an interpretation team.4. Teachers who have been trained to administer the assessment are constructiveinterpreters as these people have themselves completed the tasks, and assigned scoresto the student responses. Their professional classroom experiences enable them to beaware of the relationship between the test content and the actual school curriculum. It isunlikely that they will have undergone extensive training concerning the issues of testinterpretation, but they are valuable contributors to the interpretation team.5. Politicians, school trustees, parents and representatives of the community arestakeholders whose everyday existence does not take them into classrooms.Nevertheless, their interest in student performance is evident, as is their experience andexpertise.- 176 -6. Students are increasingly taking greater responsibility for their own learning. A logicalconsequence of this development is the inclusion of students in the interpretationprocess. While students lack the experiential background of the other potentialmembers of an interpretation panel, their participation provides an opportunity toexpose other members of the education community to students’ unique perspectivestowards the process.The package for district assessment, Science Program Assessment Through PerformanceAssessment Tasks: A Guide for School Districts (Bartley et al., 1993) identifies apreferred mix of people for the interpretation process:Those teachers who administered and coded the assessment are essentialmembers of the panel. Their knowledge of the purposes of each station, thecoding and judgement process, and their sense of how well students did andshould do will be invaluable. To provide a different and a wider perspectiveit is suggested that other teachers, school and district administrators, andtrustees would constitute a representative and powerful forum forinterpretation. As suggested earlier, the constitution of the panel should bedetermined in the planning stages of the assessment process to allow for allmembers to participate in orientation workshops.It is recommended that the interpretation panel work in grade level teams.Each team would consist of between three to eight people representing thevarious special interests identified earlier. The presence ofteacher/administrators in each grade level group is essential. (Bartley et al.,1993, p. 35)The author’s experience in developing this district package, and a subsequent series ofworkshops around British Columbia, was that teachers who first work throughperformance assessment tasks, and then observe students from their own classes underassessment conditions, are pleasantly surprised by their students’ performances.Students can be included in the interpretation process in many ways. This mightstart at the level of discussions about students’ impressions of the tasks during piloting, orcould involve teachers discussing the tasks with students after the assessment has beenadministered. Alternatively, teachers could collect written responses at the time theassessment is completed. To involve students in the grading process offers an exceptional- 177 -opportunity for teachers to discuss what they value in student performance, and this wouldprobably lead to significant student growth. However, tensions might appear with theproposal that students be included in the interpretation panels, and this approach might notbe acceptable in every political climate.It is crucial that members of the interpretation panels be aware of the nature of thetasks. The most appropriate route for enabling participants to achieve this state ofawareness is to require them to work through the tasks under actual assessmentconditions. This approach has been successful in many jurisdictions that have introducedperformance assessment tasks. The contextual element is also a vital consideration. It isextremely important that classroom teachers participate, in order that their awareness ofhow specific tasks are synchronized with district or classroom practices be included in theinterpretation process.- 178 -CHAPTER 7— CONCLUSIONSThe purpose of this study was to develop a framework for the validation of the useof hands-on performance assessment tasks in school science. The first section of thischapter is composed of my responses to each of the four research questions. In the nextsection I draw together the claims that I have made in responding to the questions. Adiscussion about the use of performance assessment tasks in science is included in thischapter which concludes with a statement of the implications for future research.The systematic framework that I propose in response to the first question providesdirection for the validation inquiry. However, in order to evaluate the overall utility of theassessment process I strongly believe that it is vital to consider the complete set of researchquestions. This is because the values of those who propose, develop and use anassessment must be illuminated.RESPONSE TO THE FOUR RESEARCH QUESTIONS1. What are the essential components ofa systematic frameworkfor the development andadministration ofperformance assessments in science?Analysis of the requirements for validation serves to shape the framework that I advance inresponse to this question (see Chapter 4). My view of the process of validation has beenguided by the perspectives towards validity elaborated by both Messick and Shepard.Messick writes of the need to consider “functional worth” (1989b, p. 17) while Shepardposes her validity question in the form “What does the testing practice claim to do?” (1993,p. 429). My interpretation of these authors’ work leads me to conclude that a frameworkfor the development and administration of performance assessments should provide anappropriate foundation upon which the final inferences can be built, and in relation to- 179 -which the functional effects of the testing process can be analysed. The frameworkpresented in Chapter 4 contains eight headings which identify areas of to be considered inthe collection of evidence. These are:(1) Purposes of the Assessment(2) Learning and Communication in Science(3) Content Analysis(4) Instrumental Stability(5) Administration Stability(6) Internal Consistency and Generalizability(7) Fairness(8) ConsequencesThis approach represents a significant expansion in the evidential basis for a comprehensiveand unified analysis of the validity of assessment instruments.The examination of purposes and consequences of an assessment represents muchmore than beginning and end points. While the “grand purpose” usually remains consistentthroughout the use of a particular assessment instrument, the operational purpose demandscareful attention during development. As the assessment proceeds, opportunities aboundfor subtle deviations from the explicit purpose. That purpose and consequence should beinter-connected is self-evident in the analysis of intended consequences, but the search forunintended consequences is also an essential element of this, and every other type ofassessment.The fact that a model of learning and communication in science is conveyed byassessment tasks is a consequence that requires special attention. The impact of basic skillstesting on what teachers present within their classrooms and what students learn isdescribed in Chapter 1. Baron (1991) writes of the “magnetic attraction” of tests, and ofhow good assessment serves instruction well. In the language of psychometrics, the-180-domain of a test should provide an adequate representation of the universe of interest. Atest which appears to require the student to look for a single correct answer does not serveas an adequate representation of science. However, the type of science espoused in theDraft of the National Science Education Standards (National Research Council, 1994) hasbeen characterized by the science assessment standards working group as “authentic” inapproximating how scientists do their work.Content analysis and the monitoring of instrumental and administrative stabilityrequire that some attention be given to the more traditional approaches of evaluatingvalidity. This is not any less important in performance assessment in science, but it is notthe ultimate goal. Instead, these considerations take their place alongside other worthyvalidity concerns.My intent in developing this framework was to ensure that the process ofassessment is defensible, that claims about student performance are justifiable, and thatthere is vigilance for unexpected consequences. In order that a model be of practical use,its application in real life must be feasible. The framework proposed has been applied in anevaluation of the development and administration of the Student Performance Componentof the 1991 B .C. Science Assessment, where its utility is demonstrable.2. What are the essential characteristics ofdescriptors ofstudent performance onperformance tasks in science?A starting point in the description of practical skills in science is provided by the work ofMillar (1991). I have proposed a further refinement to illustrate the context in which thesepractical skills are demonstrated. The question “what does this task measure?” is betterrephrased as two questions: “what can be reasonably claimed that this task measures?” and“what do students do as they work through this task?”. Complex performance assessmenttasks tend to be difficult to describe since they are multi-dimensional, including- 181 -requirements in both skill and content knowledge. The model I propose facilitatesexamination of the interaction among practical skifis and science content. For example, theskills and content demands of a single task “Environmental Testing” are illustrated inFigure 32. In Figure 33 I demonstrate the utility of the model in presenting the completeset of stations used at Grade 10 for the Student Performance Component of the 1991 B.C.Science Assessment. A plot of the stations upon the template clearly portrays the sciencecontent and skills to be assessed by the set of tasks. There are two major uses for such aplot: first to illustrate the adequacy of content representation, and second to analyze taskoverlap in science content and skills. Considerable variation in student performance acrossdifferent tasks has been a consistent finding in performance assessments in science(Shavelson et al., 1994). Theoretically, a student should perform at a similar level on taskswhich occupy overlapping areas on the template. This expectation is based upon thepremise that the tasks measure similar attributes.A further element in my description of hands-on science assessment tasks comes inthe form of statements about the students’ actions, rephrased as objectives for each task.The behaviorist elements of describing what students do cannot be denied, but the narrationof student behaviour takes the description of the task many stages beyond the provision ofan equipment list and a set of instructions. The most suitable manner of conveying theattributes of a task is to administer the task to the interested parties, and then enable thesepeople to observe students attempting the tasks under assessment conditions.3. What are the implications of using different strategies for scoring student achievementupon the interpretation of student performance ?Complex performance assessment tasks may suffer greatly or gain dramatically by thedivision of each task into component skills in order to develop scoring rubrics. Analyticalscoring systems are most promising where there is a need to focus upon specific- 182 -responses, for example, in a task to measure specific lengths. The holistic approach toscoring is generally less prescriptive. It has demonstrable advantages when tasks offerstudents latitude in performance, for example, a “design an experiment to...” type of task.I conclude that the development of a system for scoring performance assessment tasksrequires careful analysis of both the nature of the task and the scoring system.4. How could/should test scores in performance assessments be interpreted and used?Who should interpret student performance assessments in science is a question of majorimportance. The constituency of interested parties is large, although some have limitedexperience or qualifications. I believe that those familiar with the educational context of thecurriculum and classroom, and those with knowledge of the measurement issues, shouldhave the most input, but others such as parents, students and members of the communityshould be included to ensure that the procedures for interpretation are fair and as free frombias as possible.SUMMARYIn responding to the four research questions, I have described the critical features ofa framework for validation inquiry into the use of hands-on performance assessment tasksin the curriculum area of school science. My decision to propose and use a validationframework that features an “integrated evaluative judgement” (Messick, 1989b, p. 13)rather than a group of loosely linked investigations, sets this work within currentmeasurement theory. Developments in conceptualizations of validity, and consequent newdirections for validation inquiry, have had a major impact upon the design of theframework and the approach taken to interpret validity evidence. I must emphasize thatresponses to measurement-driven validity questions are entwined with the theories oflearning science that were used to develop an assessment framework. School science is the- 183 -focus of my discussion of learning theories. Permeating the whole analysis of everyassessment strategy is an underlying theory of learning. Shepard (1991) is critical ofmeasurement-driven instruction because of its underlying model of learning:Bracey, Shepard, and others disagree fundamentally with measurement-driven basic skills instruction because it is based upon a model of learningwhich holds that basic skills be mastered before going on to higher orderproblems, as Popham has suggested when he says, “Creative teachers canefficiently promote mastery of content-to-be-tested and then get on withother classroom pursuits” (1987, p. 682). (Shepard, 1991, pp. 2-3)This debate within the measurement community parallels the disagreement between scienceeducators who argue for the “step-up approach”, for example Bryce and Robertson (1986),and those who propose a holistic approach to assessments such as Woolnough (1989) orMillar (1989).Performance assessments have generated much discussion in the measurementliterature as theoreticians attempt to explain the measurement properties of items wherestudents are offered considerable latitude in their responses. Classical test theory has beenfound wanting in this regard because of task sampling variability and the low number ofitems usually included. Generalizabiity theory has been used by Shavelson and hisassociates to examine the characteristics of performance tasks in science (Shavelson et al.,1994). They have had mixed success in predicting the number of items required to producean acceptably generalizable score. The difficulties that these authors found were, I believe,a direct result of their failure to examine the problem from the perspective of learning inscience.The challenge for those involved in examining the technical adequacy ofperformance assessment instruments is to include the cognitive components of the items,and to consider the characteristics of the population used to generate the data, in addition toexamining the statistical properties of instruments and data. With this in mind, I concludethat:-184-The development of performance assessment tasks in science requires aclear enunciation of the perspective towards student learning that has driventhe assessment.I hasten to add that merely to identify a learning theory used in an assessment design isinsufficient. Test developers must be able to clarify the assumptions used inoperationalizing the theory in the context of the assessment tasks. There are severalimportant constraints. These are:1. The developer must use a theory of learning science to justify task structures.2. The questions asked and decisions made during pilot studies and field testing shouldfollow directly from the stated theory of learning.3. The scoring system should reward features of performance that are consistent with thetheory of learning.4. The theory of learning should be used to predict and explain correlational evidence ofconvergent and divergent validity.The choice of a theory of learning in which to embed the design of an assessment system isa defining point for that assessment, and must be clearly articulated for those who will usethe items and interpret the performance. This consideration should be examined as part ofthe validation framework. The importance of clarifying the interpretation of the learningperspective is shown in the following example. The constructivist perspective towardslearning in science was identified as inspiring the design of two of the assessment projectsdescribed in this dissertation. The Contract Team for the Student Performance Componentof the 1991 B .C. Science Assessment distinguished a “basic tenet of constructivism that thechild is both capable and indeed must experiment in order to construct meaning” (Erickson,1990, p. 6). This led to a structure with many open-ended tasks. Conversely, reflectingupon her experience as research officer for the TAPS project, Robertson states that the“TAPS research was strongly influenced by the constructivist approach, based upon thefindings of Piaget and Bruner” (1994, p. 1). The TAPS group perceived that the work of- 185 -Piaget and Bruner substantiated the design of a project for classroom assessment of scienceprocess skills; the resultant series of assessment tasks is highly structured with manyclosed questions. Both these projects claim to be constructivist in orientation, yet the rangeof tasks and the scoring systems are too different to be explained solely by contextualfeatures. The reason for highlighting such striking differences in design is to show that itis insufficient merely to state a view of learning. Developers must clearly enunciate specificdetails of the chosen perspective, review the implications of its choice, and carefully applythis perspective in attempting to make sense of student performance.FINAL REMARKSMuch discussion in the measurement community was generated by Messick’s(1989b) treatise on validity. Unfortunately, instances of the application of a unified viewhave been limited. Indeed, both of two recent validation inquiries1are framed withtraditional views of validity evidence. Similarly, the 1993 measurement text by Hannaidentifies Messick’s unified approach as “an alternative view” while the traditionalseparation of types of validity evidence is described as “common usage” (1993, p. 408).Application of the unified view may be hampered if the examination of consequentialevidence appears to be never-ending (Shepard, 1993). A further philosophical issue mustbe considered in the examination of the underlying values in assessment design andvalidation inquiry. To follow the guidelines set out in the Standards (AERA, APA, &NCME, 1985) is to accept values of the measurement community articulated over 10 yearsago. The clarification of values in assessment has been a consistent part of Messick’s1 Both studies appeared in Educational Measurement: Issues and Practice. The titles are“Determiiiing the Validity of Perfonnance Based Assessment” by Burger and Burger (1994) and“The Vermont Portfolio Assessment Program: Findings and Implications” by Koretz, Steelier,Klein, and McCaffery (1994).- 186 -agenda for the last 20 years (1975, 1989a, 1989b, 1994). Others have since added theirsupport (Cherryholmes, 1988; Berlak, 1992; Moss, 1992; Shepard, 1993). In achallenge to the Standards (AERA, APA, & NCME, 1985), Berlak and his co-authors(1992) have proposed an alternative paradigm to take account of the many contextualinfluences that impact upon assessment design and interpretation.Many are looking to the revision of the Standards for Educational andPsychological Testing (AERA, APA, & NCME, 1985) for guidance in a wide range ofcontexts. Linn, writing in the December 1994 edition of Educational Researcher, identifiesthree additional factors that wifi impact upon the design of the “new” Standards. These are(1) the political agenda of the U.S. federal government, (2) the creation of U.S. nationalstandards, and (3) the expanded reliance upon performance-based assessment. TheseStandards will be used extensively in Canada where the parallel pressures are (1) theexpanded interest of the provincial governments in setting and monitoring standards2,and(2) greater use of performance-based assessment. Opportunities exist for Canadianmeasurement professionals to become involved in the revisions of the Standards atmeetings or by written submission as took place with the Program Evaluation Standards(Sage, 1994).The emerging literature in science education standards is linked with the revisions inmeasurement standards. Many changes in standards are proposed. Some have movedbeyond a statement to proceed and are already at the response stage, for example the Draftof the National Science Education Standards (National Research Council, 1994). Others,2 The B.C. Ministry of Education identifies meeting “standards for achievement iii curriculum” aspart one of its current (September, 1994) goals of education. Similarly the Ontario RoyalCommission on Learning recommends that clearly written standards be developed in science(January, 1995).- 187 -such as the provincial curriculum standards in B.C. are in the process of preparation for theconsultation process. The influence of the new American national education standards willinevitably permeate north of the border. There are a wide range of emphases: teaching,professional development, assessment, content, program and system. One of the key areasof emphasis in the new science curriculum documents is the enabling of students toparticipate in and experience scientific inquiry. This is an area where performance tasksoffer significant potential for direct assessment in the form of investigations. The positiveexperiences of the students and the teacher/administrators to the open ended investigationsused in the Student Performance Component of the 1991 B.C. Science Assessment attest tothe value of such an approach to assessment.The expanding future for hands-on performance assessment tasks in science offersa broad range of research opportunities. Three areas follow directly from this study. Thefirst relates to the use of the validation framework to evaluate claims and the consequentialeffects of the use of performance assessments in a context other than that presented here inChapter 4. The second area is to use the model template illustrated in Chapter 5 to examinecontent representation, in conjunction with generalizabiity studies, to predict task-samplingvariability. This will aid in the design of assessments that are sensitive to such effects.Research should also be directed towards a comparative analysis of holistic and analyticalscoring systems, with the explicit intent of providing some guidelines to suggest underwhat circumstances they can be synthesized and when a particular approach mightilluminate student performance with greater clarity.- 188-REFERENCESAmerican Association for the Advancement of Science (1965). Science a processapproach. Xerox.American Psychological Association (1954). Technical recommendations forpsychological tests and diagnostic techniques. Washington, DC: Author.American Educational Research Association, American Psychological Association, &National Council on Measurement in Education (1985). Standards for educationalandpsychological testing. Washington, DC: American Psychological Association.Anastasi, A. (1988). Psychological testing. (6th ed.) New York, NY: Macmillan.Anderson, R. (1990). California: The state of assessment. Sacramento: California StateDepartment of Education.Archbald, D. A., & Newmann, F. M. (1988). Beyond standardized tests: Assessingauthentic academic achievement in the secondary schooL Reston, VA: NationalAssociation of Secondary School Principals.Archenhold, W. F., Bell, J., Donnelly, J., Johnson, S., & Welford, G. (1988). Scienceat age 15: a review ofA.P. U. findings 1980-84. London: HMSO.Baron, J. B. (1991). Performance assessment: blurring the edges among assessment,curriculum and instruction. In G. Kulm, & S. Malcolm (Eds.), Assessment in theservice of reform. (pp. 247-266). Washington D.C.: American Association for theAdvancement of Science.Bartley, A. W. (1991). Student performance tasks: Administration manual. Vancouver,BC: University of British Columbia.Bartley, A. W., Carlisle, R. W., & Erickson, G. (1993). Science program assessmentthrough performance assessment tasks: A guide for school districts. Victoria, BC:Ministry of Education and Ministry Responsible for Multiculturalism and HumanRights.Bateson, D. J., Anderson, J., Brigden, S., Day, E., Deeter, B., Eberlé, C., Gurney, B.,& McConnell, V. (1992). British Columbia Assessment ofScience 1991 TechnicalReport I: Classical Component. Victoria, BC: Ministry of Education and MinistryResponsible for Multiculturalism and Human Rights.Bateson, D. J., Anderson, J. 0., Dale, T., McConnell, V., & Rutherford, C. (1986).Science assessment 1986, general report. Victoria: Ministry of Education.Bateson, D. J., Erickson, G., Gaskell, P. J., & Wideen, M. (1992). British Columbiaassessment ofscience: Provincial report 1991. Victoria, BC: Ministry of Educationand Ministry Responsible for Multiculturalism and Human Rights.Bateson, D. J., & Parsons, S. (1989). Sex-related differences in science achievement: apossible testing artifact. International Journal ofScience Education, 11(4), 371-385.- 189 -Bathory, Z. (1985). The CTD science practical survey. Studies in Educational Evaluation,(9), 165-174.Baxter, G. P., Shavelson, R. J., Goldman, S. R., & Pine, J. (1992). Evaluation ofprocedure-based scoring for hands-on assessment. Journal ofEducationalMeasurement, 29 (1), 1-17.Baxter, G. P., Shavelson, R. J., Herman, S. J., Brown, K. A., & Valadez, J. R. (1993).Mathematics performance assessment: Technical quality and diverse student impact.Journalfor Research in Mathematics Education, 24 (3), 190-216.Bereiter, C. (1985). Towards a solution of the learning paradox. Review ofEducationalResearch, 55 (2), 201-206.Berlak, H. (1992). The need for a new science of assessment. H. Berlak, F. M.Newmann, E. Adams, D. A. Archbald, T. Burgess, J. Raven, & T. Romberg (eds.),Toward a new science ofeducational testing and measurement. (pp. 1-2 1). Albany,NY: State University of New York Press.Berlak, H., Newmann, F. M., Adams, E., Archbald, D. A., Burgess, T., Raven, J., &Romberg, T. (1992). Toward a new science of educational testing and measurement.Albany, NY: State University of New York.Black, P. (1986). Integrated or coordinated science? School Science Review, 67 (241),669-681.Bloom, B. (1956). Taxonomy of educational objectives. New York, NY: David Mckay.Blumberg, F., Epstein, M., MacDonald Walter, & Mullis ma. (1986). A pilot study ofhigher order thinking skills assessment techniques in science and mathematics —finalreport. Princeton: National Assessment of Educational Progress.Brennan, R. L. (1983). Elements ofgeneralizability theory. Iowa City, IA: AmericanCollege Testing Program.British Columbia Department of Education (1975). Assessment planning: B.C.assessment program. Victoria B.C.: Author.Bryce, T. J. K., McCall, J., MacGregor, J., Robertson, I. J., & Weston, R. A. (1984).Techniques for the assessment ofpractical skills in foundation science: report of theproject (1980-1983). Glasgow: Jordanhil College of Education.Bryce, T. J., & Robertson, I. J. (1985). What can they do? A review of practicalassessment in Science. Studies in Science Education, 12, 1-24.Bryce, T. J., & Robertson, I. J. (1986). Practical science assessment. Will it work inEngland and Wales? The Times Educational Supplement, (18.04.86), 63.Burger, S. E., & Burger, D. L. (1994). Determining the validity of performance-basedassessment. Educational Measurement: Issues and Practice, 13 (1), 9-15.California Department of Education (1990). California science framework. Sacramento,CA: Author.-190-Candell, G. L., & Ercikan, K. (1992). Assessing the reliability of the Maryland SchoolPerformance Assessment Program using generalizabilily theory. : Paper presented atthe Annual Meeting of the National Council on Measurement in Education, SanFrancisco.Carnegie Commission on Science Technology and Government. (1991). In the nationalinterest: The federal government in the reform ofK-12 math and science education.New York, NY: Author.Carnegie Forum on Education and the Economy. (1986). A nation prepared: Teachersforthe 21st century. New York, NY: Carnegie.Champagne, A. B., Lovitts, B. E., & Callinger, B. J. (1990). This year in school science1990: Assessment in the service of instruction. Washington D.C: AmericanAssociation for the Advancement of Science.Cherryholmes, C. (1989). Power and criticisms; poststructural investigations in education.New York, NY: Teachers College Press.Churchman, C. W. (1971). The design of inquiring systems: Basic concepts of systemsand organizations. New York, NY: Basic Books.Cole, N. S. (1991). The impact of science assessment on classroom practice. In G.Kulm, & S. Malcolm (Eds.), Science assessment in the service of reform. (pp. 97-105). Washington D.C.: American Association for the Advancement of Science.Cole, N. S., & Moss, P. (1989). Bias in test use. In R. L. Linn (Ed.), Educationalmeasurement. (3rd) (pp. 201-219). New York, NY: Macmillan.Comfort, K. B. (1990). New Directions in Science Assessment. Sacramento: CaliforniaDepartment of Education.Comfort, K. B. (1994). A sampler of science assessment: Elementary. Sacramento:California Department of Education.Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. NewYork, NY: Holt, Reinhart and Winston.Cronbach, L. J. (1970). Essentials ofpsychological testing. (3rd ed.) New York, NY:Harper and Row.Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educationalmeasurement. (2nd ed.) (pp. 443-507). Washington, DC: American Council onEducation.Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer, & H. I.Braun (Eds.), Test validity. (pp. 3-17). Hillsdale, NJ: Erlbaum.Cronbach, L. J. (1989). Construct validity after thirty years. In R. L. Linn (Ed.),Intelligence: measurement theory andpublic policy. (pp. 147-17 1). Urbana:University of Illinois Press.Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependabilityof behavioral measurements. New York, NY: Wiley.- 191 -Cronbach, L. J., & Meehi, P. E. (1955). Construct validity in psychological tests.Psychological Bulletin, 52 , 28 1-302.Dabney, V. M. How can we insure accurate and reliable datafrom authentic assessments,or can we? Paper presented at the Annual Meeting of the American EducationalResearch Association, Atlanta, GA: 1993.Department of Education and Science and the Welsh Office. (1988). Science in the nationalcurriculum. London: HMSO.Department of Education and Science and the Welsh Office. (1991). Science in the nationalcurriculum (1991). London: HMSO.Donnelly, J. F., & Gott, R. (1985). An assessment-led approach to processes in thescience curriculum. European Journal ofScience Education, 7 (3), 237-25 1.Doran, R. L., & Tamir, P. (1992). Results of practical skills testing. Studies inEducational Evaluation, 18 (1), 393-406.Driver, R., Gott, R., Johnson, S., Worsley, C., & Wylie, F. (1982). Science in schools.Age 15: Report no. 1. London: HMSO.Driver, R., Guesne, E., & Tiberghien, A. (1985). (Eds.), Children’s ideas in science.Milton Keynes: Open University Press.Erickson, G. (1990). Report to ministry of education on the assessment of students’practical work in science. Vancouver: The University of British Columbia.Erickson, G., Bartley, A. W., Blake, L., Carlisle, R. W., Meyer, K., & Stavy, R.(1992). British Columbia assessment of science 1991 technical report II: Studentperformance component. Victoria, BC: Ministry of Education and MinistryResponsible for Multiculturalism and Human Rights.Erickson, G., Carlisle, R. W., & Bartley, A. W. (1992). Performance assessmentcomponent. In D. J. Bateson, G. Erickson, P. J. Gaskell, & M. Wardeen (Eds.),British Columbia assessment of science: Provincial report 1991. (pp. 25-40).Victoria, BC: Ministry of Education and Ministry Responsible for Multiculturalismand Human Rights.Erickson, G., & Parkas, 5. (1991). Prior experiences and gender differences in scienceachievement. Alberta Journal ofEducational Research, XXXVII (3), 225-239.Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educationalmeasurement. (3rd) (pp. 105-146). New York, NY: Macmillan.Feyerabend, P. (1975). Against Method. London: New Left Books.Frederiksen, J. R., & Collins, A. (1989). A systems approach to educational testing.Educational Researcher, 18 (9), 27-32.Gagné, R. M. (1965). The psychological basis of science - aprocess approach. (pp. 65 -68). Washington D.C.: American Association for the Advancement of Science.- 192 -Gaskell, P. J., Fleming, R., Fountain, R., & Ojelel, A. (1992). British ColumbiaAssessment ofScience 1991 Technical Report III: Socioscientific Component.Victoria, BC: Ministry of Education and Ministry Responsible for Multiculturalismand Human Rights.Glaser, R. (In press). Testing and assessment: 0 Tempora! 0 Mores! EducationalResearcherGott, R., & Murphy, P. (1987). Assessing investigations in science, ages 13 and 15.London: HIvISO.Greeno, J. G. (1989). A perspective on thinking. American Psychologist, 39, 193-202.Hanna, G. S. (1993). Better teaching through better measurement. Fort Worth, TX:Harcourt Brace Jovanovich.Hardy, R. (1992). Options for scoring performance assessment tasks. Paper presented atthe Annual Meeting of the National Council on Measurement in Education, SanFrancisco.Harlen, W., Palacio, D., & Russell, T. (1984). The APU assessment framework forscience at age 11. London: HMSO.Hem, G. E. (1990). Conclusion. In G. E. Rein (editor.), The assessment of hands-onelementary programs. (pp. 264-279). Grand Forks, ND: Center for Teaching andLearning, University of North Dakota.Herman, J. L. (1992). What research tells us about good assessment. EducationalLeadership, 49 (8), 74-78.Hieronymous, A. N., & Hoover, H. D. (1987). Iowa Test of Basic Skills: Writingsupplement teacher’s’ guide. Chicago, IL: Riverside.Hobbs, E. D., Boldt, W. B., Erickson, G. L., Quelch, T. P., & Sieben, G. A. (1980).British Columbia science assessment 1978: General report, volume I: Procedures,student Test, conclusions and recommendations. Victoria: Ministry of Education.Hodson, D. (1986). The nature of scientific observation. School Science Review, 68(242), 17-29.Howe, K. R. (1985). Two dogmas of educational research. Educational Researcher 14(8), 10-18.Johnson, S. (1987). Gender differences in science: parallels in interest, experience andperformance. International Journal of Science Education, 9 (3), 467-48 1.Johnson, S. (1989). NationalAssessment: The APU science approach. London: HMSO.Kahle, J. B. (1988). Gender and science education II. In P. Fensham (Ed.), Developmentand dilemmas in science education. (pp. 249-265). Lewes, Sussex: Falmer Press.Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin,112 , 527-535.- 193 -Kanis, I. B., Doran, R. L., & Jacobson, W. J. (1990). Assessing laboratory processskills at the elementary and middle/high levels. New York: The Second InternationalScience Study, Teachers College, Columbia University. Available through NSTA,Washington DC 20009.Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science. SanFrancisco, CA: Chandler and Sharp.Kempa, R. (1986). Assessment in science. Cambridge: Cambridge University Press.Klinger. D. (1994) Science performance assessment scoring guide. Langley, B.C.:Langley School District.Klopfer, L. B. (1971). Evaluation of learning in science. In B. S. Bloom, J. T. Hastings,& G. F. Madaus (Eds.), Handbook on formative and summative evaluation ofstudent learning. New York, NY: McGraw-Hill.Koretz, D., Stecher, B., Klein, S., McCaffery, D., & Deibert, E. (1992). Can porifoliosassess student performance and influence instruction? The 1991-92 Vermontexperience. RAND Institute on Education and Training.Koretz, D., Stecher, B., Klein, S., & McCaffery, D. (1994). The Vermont portfolioassessment program: findings and implications. Educational Measurement: Issuesand Practice, 13 (3), 5-16.Lane, S. (1994). Book review: Toward a new science of educational testing andmeasurement. Educational Measurement: Issues and Practice, 13 (1), 40-43.Lane, S., Parke, C., & Moskal, B. (1992). Principles for developing perfornwinceassessments. Paper presented at the Annual Meeting of the National Council onMeasurement in Education, San Francisco.Linn, M., & Hyde, J. S. (1989). Gender, mathematics, and science. EducationalResearcher, November, 17-27.Linn, R. L. (Ed.). (1989). Educational measurement. (3rd ed.) New York, NY:Macmillan.Linn, R. L. (1994). Performance assessment: Policy promises and technical measurementstandards. Educational Researcher 23 (9), 4-14.Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex performance-basedassessment: expectations and validation criteria. Educational Researchers 20 (8), 15-21.Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of taskspecificity. Educational Measurement: Issues and Practice, 13 (1), 5-15.Lovitts, B. E., & Champagne, A. B. (1990). Assessment and instruction: two sides of thesame coin. In A. B. Champagne, B. E. Lovitts, & B. J. Callinger (Eds.), This yearin school science 1990: Assessment in the service of instruction. (pp. 1-13).Washington D.C: American Association for the Advancement of Science.-194-Loyd, B. H. (1994). Book review: Toward a new science of educational testing andmeasurement. Journal ofEducational Measurement 31(1), 83-87.Magone, M., Cai, J., Silver, E. A., & Wang, N. (1992). Validity evidence for cognitivecomplexity ofperformance assessments: An analysis of selected QUASAR tasks.Paper presented at the Annual Meeting of the National Council on Measurement inEducation, San Francisco.Manitoba Curriculum Development and Assessment Branch (1988). Manitoba scienceassessment 1986: Final report. Winnipeg: Author.McCombs, B. L. (1991). The definition and measurement of primary motivationalprocesses. In M. C. Wittrock, & E. L. Baker (Eds.), Testing and cognition.Englewood Cliffs, NJ: Prentice Hall.Mehrens, W. A. (1992). Using performance assessment for accountability purposes.Educational Measurement: Issues and Practice, 11(1), 3-9.Messick, 5. (1975). The standard problem: Meaning and values in measurement andevaluation. American Psychologist, 30, 955-966.Messick, S. (1994). The interplay of evidence and consequences in the validation ofperformance assessments. Educational Researcher 23 (2), 13-23.Messick, S. (1989a). Meaning and values in test validation: the science and ethics ofassessment. Educational Researcher, 18 (5), 5-11.Messick, S. (1989b). Validity. In R. L. Linn (Ed.), Educational measurement. (3rd ed.)(pp. 13-103). New York, NY: Macmillan.Meyer, C. A. (1992). What’s the difference between authentic and performanceassessment? Educational Leadership, 49 (8), 39-40.Meyer, K. (1991). Children as experimenters: Elementary students’ actions in anexperimental context with magnets. Unpublished doctoral dissertation, University ofBritish Columbia, Vancouver.Millar, R. (1989). What is ‘scientific method’ and can it be taught? In J. J. Wellington(Ed.), Skills andprocesses in science education: A critical appraisal. (pp. 47-62).London: Routledge.Millar, R. (1991). A means to an end: the role of processes in science education. In B.Woolnough (Ed.), Practical Science. (pp. 43-52). Milton Keynes: Open UniversityPress.Millar, R., & Driver, R. (1987). Beyond processes. Studies in Science Education, (14),33 -62.Miller-Jones, D. (1989). Culture and testing. American Psychologist, 44, 360-366.Ministry of Education (1981). Elementary science curriculum guide: Grades 1-7.Victoria, B.C.: Author.- 195 -Ministry of Education (1985). Junior secondary science: curriculum guide and resourcebook. Victoria, B.C.: Author.Ministry of Education (1990). Year 2000: A frameworkfor learning. Victoria, B.C.:Author.Ministry of Education (1991). Primary Program: foundation document. Victoria, B.C.:Author.Ministry of Education (1994). The kindergarten to grade 12 education plan. Victoria, BC:Province of British Columbia.Ministry of Education and Ministry Responsible for Multiculturalism and Human Rights(1992a). Ministry response the 1991 British Columbia Assessment of Science.Victoria, BC: Ministry of Education and Ministry Responsible for Multiculturalismand Human Rights.Ministry of Education and Ministry Responsible for Multiculturalism and Human Rights(1992b). Curriculum and assessmentframework: science. Victoria, BC: Ministry ofEducation and Ministry Responsible for Multiculturalism and Human Rights.Mitroff, I. I., & Sagasti, F. (1973). Epistemology as general systems theory: Anapproach to the design of complex decision-making experiments. Philosophy ofSocial Sciences, (3), 117-134.Moss, P. A. (1992). Shifting conceptions of validity in educational measurement:Implications for performance assessment. Review ofEducational Research, 62 (3),229-258.Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5-12.Mullis, I. V. S. (1980). Using the primary trait systemfor evaluating writing. Denver,CO: Education Commission of the States.Murphy, P. (1989). Across-category performance issues. In B. Schofield, P. Black, J.Bell, S. Johnson, P. Murphy, A. Qualter, & T. Russell, Science at age 13: a reviewofAPUfindings 1980-84. (pp. 149-157). London: HMSO.Murphy, P. (1990). What has been learned about assessment from the work of the APUscience project? In G. E. Hem (editor.), The assessment of hands-on elementaryprograms. (pp. 148-179) Grand Forks, ND: Center for Teaching and Learning,University of North Dakota.Murphy, P., & Gott, R. (1984). The assessmentframeworkfor science at ages 13 and 15.(Science report for teachers 3) London: Department of Education and Science.National Center on Education and the Economy. (1990). America choice: High skills orlow wages! Rochester, NY: National Center on Education and the Economy.National Commission on Excellence in Education. (1983). A nation at risk. Washington,DC: U.S. Department of Education.- 196 -National Committee on Science Education Standards and Assessment. (1993). Nationalscience education standards: An enhanced sampler. Washington, DC: NationalResearch Council.National Governor& Association. (1986). A time for results: The governors’ report oneducation. Washington, DC: National Governors’ Association.National Research Council. (1994). National science education standards: Draftforreview. Washington, DC: National Academy Press.New York State Department of Education (1989). Teachers’ guide to administration ofgrade 4 peiformance tasks. Albany: Author.Newmann, F. M. (1991). Linking restructuring to authentic student achievement. PhiDelta Kappan, (February), 458-463.Newmann, F. M., & Archbald, D. A. (1992). The nature of authentic academicachievement. In H. Berlak, F. M. Newmann, E. Adams, D. A. Archbald, T.Burgess, J. Raven, & T. Romberg (Eds.), Towards a new science of educationaltesting and assessment. Albany, NY: State University of New York Press.Osborne, R., & Freybourg, P. (1985). Learning in science- The implications ofchildren’s science. Auckland: Heinemann.Pine, J. (1990). Validity of science assessments. In G. E. Hem (editor.), The assessmentof hands-on elementary programs. (pp. 83-94). Grand Forks, ND: Center forTeaching and Learning, University of North Dakota.Popham, W. J. (1987). The merits of measurement-driven instruction. Phi Delta Kappan,68 , 679-682.Postlethwaite, T. N., & Wiley, D. E. (Eds). (1992). Science achievement in twenty-threecountries. Oxford: Pergamon Press.Raven, J. (1992). A model of competence, motivation, and behavior, and a paradigm forassessment. In H. Berlak, F. M. Newmann, E. Adams, D. A. Archbald, T.Burgess, J. Raven, & T. Romberg, Toward a new science of educational testing andmeasurement. (pp. 85-116). Albany, NY: State University of New York Press.Raymond, M. R., & Houston, W. M. (1990, April). Detecting and correcting for ratereffects in performance assessment. Paper presented at the annual meetings of theAmerican Educational Research Association and the National Council onMeasurement in Education, Boston, MA.Resnick, L. B. (1983). Mathematics and science learning: A new conception. Science,(220), 477-478.Roberts, D. (1988). What counts as science education? In P. J. Fensham (Ed.),Development and dilemmas in science education. (pp. 27-54). Lewes, England:Falmer Press.Robertson, I. J. (1987). Girls and boys and practical science. International Journal ofScience Education, 9 (5), 505-5 18.- 197 -Robertson, I. J. (1994). Making inferences and evaluating evidence in practicalinvestigations. Paper presented at the Annual Meeting of National Association forResearch in Science Teaching, Anaheim, CA.Robitaille, D. F., Schmidt, W. H., Raizen, S., McKnight, C., Britton, E., & Nicol, C. C.(1993). TIMSS Monograph No. 1: Curriculumframeworks for mathematics andscience. Vancouver, BC: Pacific Educational Press.Romberg, T., & Zarinnia, A. (1987). Consequences of the new world view to assessmentof students knowledge in mathematics. In T. Romberg, & D. Stewart (Eds.), Themonitoring of school mathematics: Background papers. Vol. 2. Implications frompsychology: Outcomes of instruction. Wisconsin: University of Wisconsin-Madison.Rothman, R. (1990 a). New tests based on performance raise questions - assessmentmethods said like star wars. Education Week, (Volume X, Number 2), 1,10 & 12.Rothman, R. (1990 b). U.S. opts out of international performance assessment. EducationWeek, (Volume X, Number 2), 10.Ruiz-Primo, M. A., Baxter, G. P., & Shavelson, R. J. (1993). On the stability ofperformance assessments. Journal ofEducational Measurement, 30 (1), 41-53.Russell, T., Black, P., Harlen, W., Johnson, S., & Palacio, D. (1988). Science at age 11:a review ofAPUfindings 1980-84. London: HMSO.Sage Publications Inc. (1994). The program evaluation standards. Thousand Oaks, CA:Sage Publications.Schofield, B., Black, P., Bell, J., Johnson, S., Murphy, P., Qualter, A., & Russell, T.(1989). Science at age 13: a review ofAPUfindings 1980-84. London: HMSO.Semple, B. (1992). Performance assessment: an international experiment. Princeton:International Assessment of Educational Progress.Shavelson, R. J., & Baxter, G. P. (1992). What we’ve learned about assessing hands-onscience. Educational Leadership, 49 (8), 20-25.Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performanceassessments. Journal ofEducational Measurement, 30 (3), 2 15-232.Shavelson, R. J., Baxter, G. P., & Pine, J. (1992). Performance assessments: politicalrhetoric and measurement reality. Educational Researcher 21(4), 22-27.Shavelson, R. J., Baxter, G. P., Pine, J., Yur, J., Goldman, S. R., & Smith, B. (1991).Alternative technologies for large-scale science assessment: Instruments ofeducational reform. School Effectiveness and School Improvement, 2 (2), 97-114.Shavelson, R. J., Gao, X., & Baxter, G. P. (1994). On the content validity ofperformance assessments: Centrality of domain specification. An invited paper at theFirst European Electronic Conference on Assessment and Evaluation. : Conferenceon the Internet by The European Association for Research on Learning andInstruction (EARLI).- 198 -Shavelson, R. J., & Webb, N. (1991). Generalizability theory: A primer. NewburyPark, CA: Sage Publications.Shepard, L. (1991). Psychometricians’ beliefs about learning. Educational Researcher, 20(7), 2-16.Shepard, L. (1993). Evaluating test validity. Review ofResearch in Education, (19), 405-450.Slater, T. F., & Ryan, J. M. (1993). Laboratory performance assessment. The PhysicsTeacher, 31 , 306-308.Snow, R. E. (1974). Representative and quasi-representative designs for research onteaching. Review ofEducational Research, 44 , 265-29 1.Sullivan, B. M. (1987-88). A legacyfor learners: the report of the royal commission oneducation. Victoria: Ministry of Education.Swain, J. R. L. (1974). Practical objectives - A review. Education in Chemistry, 11(5),152-156.Tamir, P., & Doran, R. L. (1992a). Scoring guidelines. Studies in EducationalEvaluation, 18 (1), 355-363.Tamir, P., & Doran, R. L. (1992b). Conclusions and discussion of findings related topractical skills testing in science. Studies in Educational Evaluation, 18 , 393-406.Tamir, P., Doran, R. L., & Chye, Y. 0. (1992). Practical skills testing in science.Studies in Educational Evaluation, 18 (1), 263-275.Tamir, P., Doran, R. L., Kojima, S., & Bathory, Z. (1992). Procedures used in practicalskills testing in science. Studies in Educational Evaluation, 18 (1), 277-290.Tamir, P., & Lunetta, V. N. (1978). An analysis of laboratory activities in BSCS yellowversion. American Biology Teacher, (40), 426-428.Taylor, H., Hunt, R., Sheppy, J., & Stronck, D. (1982). Science assessment 1982,general report. Victoria: Ministry of Education.Traub, R. E., & Rowley, G. L. (1991). Understanding reliability. EducationalMeasurement: Issues and Practice, 10 (1), 37-45.Tucker, M. S. (1991). Why assessment is now issue number one. In G. Kulm, & S.Malcolm (Eds.), Science assessment in the service of reform. (pp. 3-15).Washington D.C.: American Association for the Advancement of Science.United States Department of Education (1990). National goals for education. Washington,DC: Author.Welch, C. J. (1993). Issues in developing and scoring performance assessments. Paperpresented at the Annual Meeting of the National Council on Measurement inEducation, Atlanta, GA.- 199 -Welford, G., Harlen, W., & Schofield, B. (1985). Practical testing at ages 11, 13 and 15.London: HMSO.Whittaker, R. J. (1974). The assessment of practical work. In Macintosh H. G. (Ed.),Techniques and problems in assessment. London: Arnold.Wideen, M., Mackinnon, A., O’Shea, T., Wild, R., Shapson, S., Day, E., Pye, I.,Moon, B., Cusack, S., Chin, P., & Pye, K. (1992). British Columbia AssessmentofScience 1991 Technical Report 1½ Contextfor Science Component. Victoria,BC: Ministry of Education and Ministry Responsible for Multiculturalism and HumanRights.Wienekamp, H., Jansen, W., Fickenferichs, H., & Peper, R. (1987). Does theunconscious behaviour of teachers cause chemistry lessons to be unpopular withgirls? International Journal of Science Education, 9 (3), 28 1-286.Wilson, G. (1986). CHASSIS: Cheshire achievement of scientific skills in schools.Chester, England: Cheshire County Council.Wittrock, M. C. (1991). Testing and recent research in cognition. In M. C. Wittrock, &E. L. Baker (Eds.), Testing and cognition. Englewood Cliffs, NJ: Prentice Hall.Woolnough, B. (Ed.). (1991). Practical Science. Milton Keynes: Open University Press.Woolnough, B. (1989). Towards a holistic view of science education (or the whole isgreater than the sum of its parts, and different. In J. J. Wellington (Ed.), Skills andprocesses in science education: A critical appraisal. London: Routledge.Woolnough, B., & Alisop, T. (1985). Practical work in science. Cambridge: CambridgeUniversity Press.- 200-


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items