UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The effects of experience and test revision on the rate of practitioners’ clerical errors on the WISC-R… Klassen, Robert Mark 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1994-0220.pdf [ 1.52MB ]
JSON: 831-1.0054589.json
JSON-LD: 831-1.0054589-ld.json
RDF/XML (Pretty): 831-1.0054589-rdf.xml
RDF/JSON: 831-1.0054589-rdf.json
Turtle: 831-1.0054589-turtle.txt
N-Triples: 831-1.0054589-rdf-ntriples.txt
Original Record: 831-1.0054589-source.json
Full Text

Full Text

ThE EFFECTS OF EXPERIENCE AND TEST REVISION ON THE RATE OFPRACTITIONERS’ CLERICAL ERRORS ON THE WISC-R AND WISC-IIIbyROBERT MARK KLASSENB.Ed., University of British Columbia, 1987A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF ARTSinTHE FACULTY OF GRADUATE STUDIESDepartment of Educational Psychology and Special EducationWe accept this thesis as conformingto tq requiredjs)ándardTHE UNIVERSITY OF BRITISH COLUMBIAAPRIL, 1994© Robert Mark Klassen, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)_____________________________Department of c(The University of British ColumbiaVancouver, CanadaDate Aprl i’DE.6 (2/88)11ABSTRACTThe purpose of this study was to investigate the effect of testrevision and examiner experience on the rate of examiner clericalerror on the Wechsler Intelligence Scale for Children- Revised(WISC-R) and the Wechsler Intelligence Scale for Children - ThirdEdition (WISC-llI). A total of seven school psychologists provided asample of 252 protocols: 18 from each psychologist for the WISC-R,and 18 from each psychologist for the WISC-lll. The errorstabulated consisted of clerical errors only, that is, addition ofsubtest raw and scaled scores, computation of chronological age,transformation from tables of raw score to scaled scores, andtransformation of scaled scores to IQ scores. Clerical errors werefound on 38% of WISC-R protocols, and 42% of WISC-lll protocols.Errors caused by incorrect addition of raw scores were the mostcommon error type. No significant difference was found between thetwo tests for any of the types of clerical errors. A positive butstatistically non-significant correlation was found betweenexaminers’ years of experience on the WISC-R and number of errorsmade.On the WISC-lll only, a comparison was made between theclerical error rate at the beginning of usage of the test, and theclerical error rate in two time periods in the 12 months followingthe beginning of usage of the test. With overall errors, nosignificant difference was found between the three time periods.However, examiners made significantly more Full Scale IQ-changing111errors at the beginning of WISC-Ill administration than they did inthe third six-month time period. It was suggested that testpublishers provide clear directions on the actual protocol to testusers to help lower the rate of clerical errors. It was alsosuggested that because of the high rate of clerical errors at thebeginning of the use of the WISC-lll, practitioners need formalretraining on a test when it is revised.ivTABLE OF CONTENTSABSTRACT i iLISTOFTABLES viLIST OF FIGURES v iiACKNOWLEDGEMENTS v iiiCHAPTERINTRODUCTION 1Purpose of the Study 2Error of Measurement 2Previous Studies 5Examiner Differences 8Test Revisions 9Protocol Changes 1 0Questions and Hypotheses 11Scope of Study 1 2I I LITERATURE REVIEW 1 4Introduction 1 4Studies with Graduate Students 1 4Training Programs 1 6Unscored Protocols 1 7Completed Protocols 22Definitions of Error 24Test Revisions 26Examiner Characteristics 27Assessment Practices 30Summary 30Ill METHODOLOGY 34Sample 34Characteristics of Examiners 36Characteristics of Examinees 37Assumptions 38Procedure 40VAnalysis 40IV RESULTS 42Errors and IQ Changes 42Examiners 45Location of Errors 48Error Category 48Comparison between WISC-R and WISC-IlI 52Examiner Experience and the WISC-llI 54Summary of Results 58V DISCUSSION 60Error Rates 60Standard Error of Measurement 62Examiners 64WISC-III Over Time 66Strengths and Weaknesses of the Study 68Conclusions 69Implications of the Study 72REFERENCES 75viLIST OF TABLESTABLE1 Examiners’ Years of Experience 372 Characteristics of Examinees 383 Number and Proportion of Protocols with Errors 434 Number of Total Errors by Test per Examiner 465 Location of Clerical Errors by Test 496 Error Categories Across Test 507 Comparison of Error-Types 538 WISC-lll Overall Errors Over Time 55viiLIST OF FIGURESFIGURE1 Number of Protocols with Errors by Test 442 Errors by Examiners 473 Clerical Errors by Category 514 WISC-III Mean Errors 56viiiACKNOWLEDGEMENTSI wish to express my sincere appreciation to Dr. Nand Kishor,my research supervisor, for his help and direction throughout thecourse of this research. Much guidance was given to help focus thispaper on each successive draft.Thanks also to Dr. Harold Ratzlaff who helped enlighten somemurky research questions, and who provided the original impetus tobegin the program and to persevere to the end.I also am appreciative of the feedback of Dr. Kim SchonertReichl, who provided valuable critical analysis on several drafts.Finally, I wish to acknowledge the support, patience and loveof my family: Andrea, Danielle, and especially Lenore, who toleratedmy 4:30 a.m. awakenings, and who encouraged me to continue to theend.1Chapter IIntroductionThe Wechsler Intelligence Scales are among the mostcommonly used individual intelligence tests within educational andclinical psychology settings (Slate & Chick, 1989). In the schoolsetting, critical educational placement and planning decisions arebased on the data derived from these tests, which are assumed to beadministered in a standardized manner, and scored with minimalerror. In fact, the American Psychological Association’s Standardsfor Educational and Psychological Testing (1985) suggests that thescoring of intelligence tests should be completely free of errors.Numerous studies have shown, however, that test protocols arefar from error-free, and that virLually all examiners make errors(e.g., Sherrets, Gard, & Langner, 1979; Slate, Jones, Coulter, &Covert, 1992). Unavoidably, the conclusion must be drawn thatcritical educational placement and planning decisions are sometimesbased on the results from inaccurately-scored tests. These testresults, consequently, are not as true a reflection of the child asthey might seem to be. Any information, then, that helpsprofessionals working in the field avoid errors in the scoring ofthese tests, is potentially of great assistance in achieving the idealof error-free testing.2Purpose of the StudyThe Wechsler Intelligence Scale for Children- Revised (WISCR) has recently (1991) been revised as the Wechsler IntelligenceScale for Children - Third Edition (WISC-lll). These tests are amongthe most widely used tests of mental ability for children aged 6 to16. This study investigated how the nature and rate of examinererror change from one test, the Wechsler Intelligence Scale forChildren - Revised (WISC-R), to its recently published revision, theWechsler Intelligence Scale for Children- Third Edition (WISC-llI).Error of MeasurementStandardized tests - such as the WISC-R and its revision, theWISC-lll, generally report a standard error of measurement (SEM)which is an estimate of the amount of error involved in anexaminee’s obtained test score. This SEM accounts only for contentsampling errors, and, in the majority of tests, including the WISC-Rand WISC-lll, is measured through the use of the Spearman-Browncorrection of split-half correlations. In this approach to reliability,the test is divided into two approximately parallel forms orcomponents which are then correlated. The resulting correlation,when adjusted for the greater length of the composite through theuse of the Spearman-Brown correction, represents the reliability ofthe test.On the WISC-R and WISC-lll, the reliability coefficients werefirst calculated for each individual subtest through the split-halfmethod and Spearman-Brown correction formula (except for Coding,3and on the WISC-III, Symbol Search, which are speeded tests, forwhich split-half methods are inappropriate). The individual subtestreliability coefficients were then used to compute the overall testreliability, through the use of Nunnally’s formula for the reliabilityof a composite of several tests (Nunnally, 1978, p. 246).On most standardized tests, the actual scoring of the test bythe examiner is assumed to be almost completely objective anderror-free. In the case of many tests, no measure of examiner erroror of interrater reliability is reported, although this has beenrecommended: “There is as much a need for a measure of scorerreliability as there is for the more usual reliability coefficients”(Anastasi, 1988, p. 108).The WISC-R manual, published in 1974 does not include anyinformation on interscorer/interrater reliability or on how examinererror might affect test scores. The WISC-lll manual, published in1991 does offer some improvement over the WISC-R in the area ofagreement between scorers. In the new manual, results are reportedfrom a test-development standardization study on interscoreragreement for four of the thirteen subtests, with coefficientsgenerally quite high, ranging from .90 for the Comprehensionsubtest, to .98 for the Similarities and Vocabulary subtests.The data from the WISC-IlI standardization study oninterscorer agreement were drawn from four examiners who eachscored 4 subtests from 60 protocols. The examiners, however, were4presumably chosen from among the specialists working on thestandardization study, whose sole duty was accurate administrationof the WISC-lll. The WISC-llI manual states that in thestandardization study “all protocols were double-scored by trainedscorers... The computer flagged any discrepancy between scorersquestions were answered by in-house experis. Scorers receivedindividual feedback on a daily basis and group feedback weekly”(1991, p. 28-29).These highly-specialized standardization examiners can hardlybe considered representative of practitioners working in the field.The practitioner uses a number of measurement instruments,focuses less on the test than on the subject or client, and has agreat number of accompanying responsibilities. Moreover, in mostcases, practitioners are not formally re-trained on revisions oftests. The protocols used in the standardization study - double-checked and closely scrutinized - also cannot be consideredrepresentative of the protocols in the field where protocols aretypically scored once only.In the WISC-IIl test manual (1991) only a brief mention ismade of research that suggests examiner error may contributesignificantly to making test scores less reliable. The admission ismade that “errors are common ... and may lead to misleading scores”(p. 57) but, surprisingly, apart from a few brief suggestions on howto avoid clerical errors (record scores legibly, and check the5addition for raw scores and scaled scores) more information on thenature or degree of these errors is not given.Although the reported internal reliability and interscorerreliability coefficients in the WISC-lll manual are quite high, thesample of examiners used in the standardization study does notaccurately reflect the population of practicing professionals. Also,the reported interscorer agreement coefficients do not account forerrors directly attributable to the examiner giving the test, but onlyfor scoring four of the subtests. The end result of not attempting toremedy the fact that examiner errors affect test scores is that“important decisions will continue to be made on the basis ofinappropriately scored lQ measures” (Miller & Chansky, 1973, p.344).Previous StudiesThe question of examiner errors on standardized intelligencetests has been the subject of considerable research. Manyresearchers have circumvented the difficulty of gaining access toconfidential data from practitioners (e.g., school and clinicalpsychologists, psychiatrists) by using data drawn from graduatestudents enrolled in training courses for administration ofstandardized tests (e.g., Slate & Chick, 1989; Stewart, 1987). Thisinformation is particularly useful in shedding light on trainingissues and on errors made by beginning testers. Training programsin graduate schools have been designed as a result of these studies(Conrad, 1991).6The generalizability of the data on errors made by graduatestudents to errors made by practicing professionals, however, issuspect - the nature of testing in the field is vastly different fromthat of testing by students in graduate school. Those practicing inthe field are quite likely to view their own administration of anyparticular test as incidental; that is, as a means of gaining moreinformation about a student or a client, and not as an end in itself,which is quite frequently the case in graduate training courses. Atthe same time, the greater experience and familiarity a professionalhas with a test may result in increased skill in avoiding error andworking with an instrument. In any case, the testing practices ofgraduate students are not likely representative of the testingpractices of examiners working in the field. Although studiesinvestigating student errors are potentially useful, “investigation ofthe examiner errors made by practicing professionals is sorelyneeded” (Slate, Jones, Coulter, & Covert, 1992, p. 78).Some of the earlier studies did investigate the error rate ofpracticing professional psychologists by having the examiners scorecompleted (i.e., filled-in) but unscored protocols (e.g., Miller &Chansky, 1972; Oakland, Lee, & Axelrad, 1975). These studiesusually depended on a volunteer sample of psychologists whoresponded to a general mail-out to a particular professional body(e.g, National Association of School Psychologists). A number ofsampling problems arise in studies of this nature, however, becausethe data are collected from volunteer subjects.7Several limitations exist when data are collected in thismanner. Firstly, the two studies cited above did not have a high rateof return: 23.5 and 32%, respectively. With these disappointinglylow return rates, the sample loses its randomness may be lessrepresentative of the population. The results from studies of thisnature which place relatively heavy demands on the subjectsinevitably suffer from the bias inherent in volunteer samples.Secondly, little is known regarding the degree to whichattention to scoring details changes for examiners anonymouslyscoring protocols on their own time rather than on the job in aclinical situation. Relatedly, the scoring of an already completedprotocol is a contrived situation, with the subject quite likelyaware of the nature of the study. Because this approach may notproduce the same results as having an examiner filling out andscoring an actual protocol while administering the test, thegeneralizability of data obtained in this manner is limited.Finally, because the researcher is practically limited to havingonly one or two protocols scored by the volunteer scorers, thegeneralizability of the results is questionable. Test protocols varya great deal in their ease or difficulty of scoring, and one or twoprotocols may not well represent all protocols. It seems then, thatin order to gain the most generalizable and accurate information,actual completed protocols from those practicing in the field shouldbe analyzed. And because the rate and nature of examiner error on8one test may not be generalizable to errors on another test, it isimperative that each widely-used test (and its revisions) should beinvestigated.Examiner DifferencesThe differences among examiners may result in differences intheir error rate on standardized tests. Studies have been completedinvestigating the relationship between rate of examiner error andgender (Hanley, 1977; Levenson, Golden-Scaduto, Aiosa-Karpas, &Ward, 1988), education (Hanley, 1977; Levenson et al., 1988;Sherrets, Gard, & Langner, 1979), expectancy (Egeland, 1969) andexperience (Conner & Woodall, 1983). Of these three areas, onemight expect that an examiner’s experience and practice on a testwould have the strongest impact on the rate of error. As expected,Conner and Woodall (1983) found a significant decrease in thenumber of errors on the WISC-R made by examiners as they gainedexperience. However, the examiners in their study were student-examiners enrolled in a graduate training program, and cannot beconsidered representative of practitioners in the field.Although little research has been done in the area of examinerexperience and error, and more is clearly needed, it is likely that theerror rate of practitioners changes as they gain experience workingwith any particular test. And though one would predict a lowering ofthe rate of examiner error on a graduate student’s test protocols asthe result of instruction and experience, this pattern may not be thesame with practitioners, who may be less concerned with the actual9administration and scoring of a test, and more concerned with theresults and their interpretation.Test RevisionsThere is no question that test revisions bring about changes insubjects’ test results. For example, subjects usually score lower onnewer standardized tests than on older ones, as was the case in thenorm changes in the revision of the Wechsler Preschool and PrimaryScale of Intelligence (WPPSI) to the Wechsler Preschool and PrimaryScale of Intelligence - Revised (WPPSI-R) as has been detailed(Kaufman, 1990; Sattler, 1991). There is also little question thatthe format of a test or questionnaire influences a subject’s orinterviewee’s response pattern: “Good design is crucial forproducing a reliable and valid questionnaire. Respondents feel lessintimidated by a questionnaire which has a clear layout and is easyto understand, and take their task of completing the questionnairemore seriously” (Rust & Golambok, 1989, p. 155).In the same way that testing subjects or interviewees areinfluenced by the format of a questionnaire, it is also likely that theformat and layout of a test protocol influences the response patternand scoring accuracy of an examiner. A test revision, then, with itscorresponding revision of the test protocol, will see changes in theresponse pattern of examiners, which may include a change in therate and pattern of examiners’ clerical errors.10Protocol ChangesChanges to the WISC-R protocol had been stronglyrecommended in order to reduce the rate of examiner’s clericalerror:The present investigators believe that a revision inthe WISC-R Record Form must be considered by thetest’s publisher, the Psychological Corporation. Such arevised record form might include directions to thepsychologist to verify and recheck all responsescoring, mathematical, administrative, lQ, and birthcomputations (Levenson et al., 1988, p. 663).Although an alternate and expanded protocol was subsequentlypublished for the WISC-R, it did not follow any of the suggestionsmade by Levenson et al., but simply provided more space forrecording clinical observations.The publishers of a test undergoing revision, such as the WISCR to the WISC-lll, have an opportunity to improve the format of thetest’s protocol, an opportunity to make it less prone to examinererrors. Although the test’s publisher does not explicitly claim thatthe Record Form or test protocol has been revised in order to lowerthe rate of examiner error, it does claim in the manual (1991) that“changes from the WISC-R ... have made it easier for the examiner touse” (p.iv), and that “A final goal of WISC-lll development wasimprovement of ... administration“(p. 12). Unfortunately, noempirical evidence is cited to back these claims. Although perhapstrue for administration of the test itself, the WISC-lll test protocol11does not appear to be easier to use and less prone to examiner error.In fact, on the surface, the protocol appears to be more complicatedand less clear, especially in the section where raw scores are to beconverted to scaled scores.Questions and HypothesesWhat then, are the changes brought about in examiner error bythe revision of the WISC-R and its protocol? To what extent is theWISC-lll test and protocol less prone to examiner clerical error,and is this shown by a lowering of the rate of examiner error? Howdoes the examiner’s experience, or lack of experience, on a new test- the WISC-lll- affect the rate of clerical errors made?Most of the changes on the WISC-lll test protocol appear to bechanges in layout only, and are not seen to have been made for thepurpose of lowering examiners’ clerical errors.Hypothesis #1: The rate of practitioners’ clerical error on completedand scored WISC-IlI protocols will be equal to their rate of error onthe WISC-R.Little research has been completed in the area of the effectsof a practitioner’s experience on error rate. Past studies show thatgraduate students’ error rates decrease with experience. This maynot be true, however, for practitioners who approach testing in adifferent manner. And although one would intuitively expect anexaminer’s errors to decrease with experience, it is possible that as12an examiner becomes more comfortable with using a test, theclerical error rate of that examiner increases, as carelessness setsin. Psychologists with PhDs have been shown to have a higher rateof examiner error than Masters-level psychologists (Levenson et al.,1988; Ryan, Prifitera & Powers, 1983).Hypothesis #2:Practitioners’ clerical error rates will not change significantly withincreasing experience on the WISC-III.Scope of StudyThis study investigated the nature and rate of examiners’clerical errors on the WISC-R and WISC-lll. Comparisons of errorpatterns between the two tests were made, and the relationshipbetween examiner experience on the WISC-lll and rate of clericalerror was studied.Examiner error, for the purpose of this study, consisted ofclerical errors as defined by Sherrets et al. (1979): addition ofsubtest raw and scaled scores, computation of chronological age,and transformation from tables. The data for this study were drawnfrom analysis of actual completed test protocols from schoolpsychologists in a large suburban school district. Because the datawere drawn from practitioners’ completed protocols, the results arelikely to be representative of present practice in the field.13Information derived from this study will enable users ofpsychoeducational tests to have a better understanding of their ownpotential clerical errors on the WISC-lII, and may result in improvedpractice in the area of re-training of professionals for testrevisions, which in turn would result in the increased reliability ofthe tests. Improvements in testing practices in educational settingsshould result in the improvement of the accuracy and quality ofplanning and placement decisions.14Chapter IILiterature ReviewIntroductionAlthough a look through the manuals of standardized testsmight suggest that examiner error is not a prevalent problem, or atleast that it does not have any serious impact on a test’s reliability,many researchers have found substantial evidence that suggests thaterrors made by examiners are an important source of error, notcovered by merely reporting the SEM of a test. These studiesregarding the nature and rate of examiner error have utilized manydifferent designs, used both graduate students and practitioners ofvarious stripes, investigated errors made on actual completedprotocols and on fabricated protocols, and have looked at a varietyof error types on a collection of different tests.Studies with Graduate StudentsThe majority of the previous studies have investigated theerrors made by graduate students enrolled in test-administrationcourses. Obtaining raw data from graduate students is frequentlyeasier than from practitioners who may not volunteer as readily totake part in a study which reflects on their professional skills.Many of the earliest studies followed this route. Miller, Chansky andGredler (1970) had 32 Clinical and School Psychology graduatestudents score a single completed WISC protocol and found theresulting Verbal and Performance scale scores to vary by as much as21 points and the Full Scale lQs by as much as 17 points.15Warren and Brown (1972) examined the first three and lastthree completed protocols of 40 student examiners in a graduate-level psychology program. Of the 240 protocols, there were 89cases in which Full Scale lQs were changed (according to theauthors, the number would have been much higher had not someerrors which added points not cancelled out others which deductedpoints). On the WISC protocols, IQ changes ranged from 1 to 16points, with only 2 of the 240 protocols error-free. The mean IQchange after correction was 3.32 points for the WISC, and 4.77points for the Binet. There was no significant change in error ratefrom the beginning to the end of the training period. From the abovestudies, it is clear that scoring and administration errors arecommon among graduate students.Without question, data are more easily obtained from graduatestudents than from practicing professionals. However, thegeneralizability of the results from studies involving graduatestudents to practitioners working in the field is questionable. Intraining programs, at least in the initial phases, the focus ofstudents and instructors is, naturally, on the actual administrationand scoring of the test. In the field, however, the focus is morelikely to be on the results of the test, and competent administrationand scoring of the test is assumed. This is not to say that studieswhich look at novice examiners are not valuable - several studiesusing student examiners have had important implications for thetraining of test users in graduate psychology programs, training16which, currently, is widely discrepant from one institution toanother.Training ProgramsConner and Woodall (1983) examined students’ errors inrelation to a training program that provided the students withwritten and verbal feedback on types and frequencies of errors.They concluded that:Response Scoring, Mathematical, and IQ Errors remaina source of variance, affecting the reliability andvalidity of the WISC-R. Further research is needed inthis area to determine if there are any effectiveteaching methodologies that can significantly decreasethe number of Response Scoring errors (p. 378).Slate and Jones (1989) challenged the conventionally-heldwisdom that testing courses should consist mostly of opportunitiesto practice test administrations, and that instructional time shouldbe secondary: “In this study, students’ performance improved only asa result of detailed classroom instruction. ... Neither group improvedduring practice administrations” (p. 409). In a subsequent study theinvestigators stated that “Five practice administrations were notsufficient to improve student accuracy in scoring and 10 practiceadministrations improved student accuracy only slightly. Even onthe 10th administration error rates were high” (Slate & Jones,1990c, p. 139).17It seems, then, that traditional training programs often fail toprepare graduate students to administer and score standardizedtests in a competent manner. Russ (1978) outlined the causes forthe gap between academic training in psychological assessment andthe comparatively higher standards established by supervisors ininternship settings. Other studies have offered more detailed andstructured training programs or methods for systematicobservations of graduate student assessment behavior (Strein,1984). Blakey, Fantuzzo, Gorsuch and Moon (1987) found that a peer-mediated, competency-based training package, which uses studentpeers to give detailed feedback to examiners, significantly loweredthe error rate for student examiners. Conrad (1991) constructed acomprehensive performance checklist, the Criteria for CompetentWISC-lll Administration - Revised (CCW3A-R) to be used to aidstudents in locating and avoiding their most frequent errors.It is clear that useful information - information that mayresult in improved training programs - may be derived from studyingthe errors of student-examiners. However, the population which isof most concern is not graduate students, but professionals who usestandardized tests in order to gain greater understanding and tomake decisions.Unscored ProtocolsA number of studies have looked at the issue of examiner errorby having psychologists and graduate students score test protocolswhich have been filled in but not scored. The intent of this type of18study is that the examiner will display typical behavior in scoringthe protocol. In a study by Perot (1992) investigating the cognitiveprocesses behind scoring decisions, eight school psychologistsscored the Verbal Scale subtests from a completed but unscoredWISC-R protocol. It was found that the psychologists producedVerbal IQ scores which varied by as much as 11 points.Ryan, Prifitera and Powers (1983) used practicingpsychologists as well as graduate students in their study ofexaminer error on the Wechsler Adult Intelligence Scale- Revised(WAIS-R). They had the 39 examiners score two protocols and foundthat the standard deviations of the Full Scale IQ scores averagedabout 1 .95. In comparison, the WAIS-R manual reports a SEM of 2.53for Full Scale IQ (On the WISC-R, the reported Full Scale SEM is3.19; on the WISC-lll, 3.2). In an earlier study, Miller and Chansky(1972) sent a single completed WISC protocol to 200 doctoral-levelschool or clinical psychologists. From the 64 returned protocols itwas found that Verbal and Performance IQ’s varied by as much as 23points and Full Scale lQs by as much as 17 points.It must be questioned, however, just how seriously theresponding sample treated their task in the study. In a follow-upphase of the study, Miller and Chansky sent a letter in which theyoffered to discuss with the the responding psychologists the typesof errors that they had made. Out of the 64 volunteers, only 1psychologist responded to the offer. Miller and Chansky concludedthat “If the apathy discovered in this brief follow-up is19representative of the field at large, we can assume that importantdecisions will continue to be made on the basis of inappropriatelyscored IQ measures” (p. 345). The difficulty in knowing therepresentativeness of the sample to the population is raised byMiller and Chansky themselves: it is unknown if the results arerepresentative of the field at large when a volunteer sample of 32%(64 out of 200) responded to the study, and 1.6% (1 out of 64)responded to the follow-up offer. Another difficulty with this typeof study is the representativeness of the to-be-scored protocol.When only one protocol is used, it is not necessarily a typicalprotocol, but may be much easier or much more difficult to scorethan most.In order to overcome the problem of non-representative singleprotocols, some researchers have several scorers score two or moreprotocols either varying in examinee ability level or in the expectedease and difficulty of scoring the protocol. Whereas one protocolmight not well-represent the population of protocols, two or threecan better cover the range of ease or difficulty of scoring, orexaminee ability. Behind this type of study is the idea thatexaminee differences reflected in the protocol may result indifferences in error rate for examiners.Oakland, Lee, and Axelrad (1975) sent three completed butunscored actual (as opposed to fabricated) WISC protocols reflectingthree different levels of intellectual abilities (below average,average and above average) to 400 APA-registered school20psychologists. Of the 400 randomly selected psychologists, 94(23.5%) volunteered to score the tests and send them back to theinvestigators. From the 94 responding psychologists, data werecollected in the form of means and standard deviations for subtestscale scores and Verbal, Performance, and Full Scale lQs for thethree protocols. Oakland et al. found no significant correlationbetween examiner error rate and examinee’s level of intellectualability.As in the Oakland et al. (1975) study, the standard deviation ofscores is commonly used as an index of interrater agreement instudies comparing examiners scoring on one or two protocols: thehigher the standard deviation, the lower the rate of interraterreliability. However, it must be remembered that, when comparingthe SD to the SEM, two different sources of error are beingcompared. The SEM when measured through split-half reliabilitycoefficients reflects error attributable only to item or contentsampling, whereas the SD reflects error variance attributable toexaminer differences. In other words, the SD of the differencesbetween examiners should be viewed as an additional source oferror.In another study involving multiple protocols sent to a randomsample of psychologists, Bradley, Hanna, and Lucas (1980) attemptedto investigate the differences in rate of examiner error between“hard” (which responses were deliberately ambiguous) and “easy”(which responses were deliberately clear) WISC-R protocols. The21widely differing protocols were used to ensure that the results ofthe study would cover the range of the population of protocols. Ofthe 280 randomly-sampled psychologists, 63 volunteered toparticipate in the study, for a 27% return rate. The measure ofinterrater agreement used - the standard deviation- varied from 2.9for the easier protocol and 4.3 for the more difficult one (comparedto the published SEM of 3.19 in the test manual). They concludedthat the lack of scoring accuracy was due to a shortage of scoringcriteria for the more objective subtests and was cause for increasedattention and further study.This genre of study does provide some beneficial informationfor users of standardized tests, but weaknesses inherent in theresearch design limit the generalizability of the results. First, thetask of scoring a completed protocol is contrived. In actual practicethe administration and scoring are inextricably linked. Second, it isdifficult to determine how much attention the practitioner pays tothe task of scoring (usually anonymously) a protocol for a researchstudy. Third, the representativeness of the sample to the populationis impossible to establish when using a volunteer sample. None ofthe studies using volunteer samples had a very high return rate ofprotocols. Finally, because this type of study is limited in thenumber of protocols used, the generalizability of one or two (usuallyfabricated) protocols to the population of actual protocols is againimpossible to establish.22Completed ProtocolsOne way to avoid the problems inherent in having volunteersscore already filled-in protocols is to investigate the frequency andnature of examiner error from a sample of completed (and scored)protocols. In a review of the literature on examiner error, Slate andHunnicutt (1988) state: “Future research needs to address thefrequency, type, and specificity of errors on actual protocols. Arepracticing professionals making scoring errors on test protocols inthe field?” (p. 286). Few studies to date, however, have been able toanswer this question. The problem of gaining access to completedand scored protocols has prevented most researchers frominvestigating the error rate of practicing professionals. Only asmall number of investigations of this type have been conducted.In one study, impressive because of the size of the sample,Sherrets et al. (1979), obtained two random samples of 100 WISCand WISC-R protocols drawn from the files of a psychiatric facilityand from the files of the student services department of a largeschool system. They found that nearly 89% of the 39 examiners hadmade at least one error and that 46.5% of the protocols contained atleast one error. Although most of the errors found did not changethe diagnostic classification of the child, the largest errors foundwould have raised the Full Scale IQ by 9 points or lowered it by 7points. No significant difference between the error rates onprotocols from the two facilities or difference in levels of trainingwas found.23Although Sherrets et al. (1979) used a sample considerablylarger than in most other studies of its kind, they neglected toinvestigate two key areas. No comparisons were made of error ratesbetween the two tests (WISC and WISC-R), something which mighthave shed some light on the contribution of the protocol layout toexaminer error. The researchers also failed to investigate therelationship between examiner error rate and experience.Slate et al. (1992) used data from nine school psychologistsand found that these practitioners made errors on all 56 of theprotocols drawn from a metropolitan school district. In thisparticular study, though, the term “error” was used rather liberally,and included failure to record response, failure to circle scores, andfailure to record times. These may be technically considered errorsbecause they depart from the prescribed manner of testadministration, but they cannot change the score of an examinee.Peterson, Steger, Slate, Jones, and Coulter (1991) investigatedexaminer error on the Wide Range Achievement Test - Revised(WRAT-R). They found that 95% of their sample of 55 protocolscontained errors, with an average of three errors per protocol. Theyconcluded that “Practitioners make a significant number of errorsthat adversely affect standard scores” (p. 208). Wagoner (1988)examined a similar number of protocols- 50 - in her study ofexaminer error on the WISC-R. She found a total of 8.4 total errorsper protocol, with 34% of the Full Scale lQs altered as a result ofthese errors.24Definitions of ErrorMost investigators use their own definition of error dependingon the purpose of their study. On a standardized test, three broadareas of examiner error can be categorized. First, administrationerror, that is, error that occurs when the examiner strays from theprescribed standardized procedure in giving the test. Theimportance of adhering to standardized procedure is well-stated byBuckhalt (1990):One of our explicit objectives in testing is tostandardize the test situation so that variations intest performance can be assumed to be a function oftrue differences in the examinees’ abilities rather thanof any extraneous contextual variables (p. 478).Studies which investigate administration errors are generallyrestricted to using small samples because of the difficulty involvedin actually observing test administrations. Buckhalt’s study (1990)used the results derived from just 14 test administrations frompractitioners, and 12 from graduate students. Although theinformation drawn from this type of study may help direct futureresearch, its generalizability is questionable due to the smallsample size.The second category of examiner error is scoring error: errorthat occurs when the examiner scores the responses of theexaminee. Three WISC subtests (i.e., Vocabulary, Comprehension,25Similarities) frequently result in ambiguous responses and havebeen the focus of some study. Brannigan, (1975) suggested revisionsfor the WISC and WAIS: “more thorough scoring standards are neededfor the Wechsler Intelligence Scales” and, “It would also bedesirable to revise those test items which lend themselves toambiguous replies” (p. 314). Many other studies have alsoinvestigated the problem of scoring errors and found them to be onefactor contributing to examiner error on standardized tests (e.g.,Ryan et al., 1983; Slate & Jones, 1990a; Warren & Brown, 1972).A major difficulty, though, with studies investigatingresponse-scoring errors is that psychologists and other examinersfrequently do not record verbatim the responses of the examinee.Slate et at. (1992) found that school psychologists frequentlydeparted from the prescribed standardized administration procedure.In their study they found that the school psychologists failed torecord responses verbatim 30 times per protocol! Without thebenefit of direct observation of the test administration, it is clearlydifficult to determine if the examiner is scoring the responserecorded on the test protocol or the actual (but not recorded in itsentirety) response of the examinee. Studies which purport toexamine the rate of scoring errors by looking at completed protocolsmay end up with misleading conclusions.A final category of examiner error on standardized tests isclerical error; that is, errors caused by examiner carelessness, andperhaps exacerbated by complicated or poorly laid-out test record26forms. Sherrets et al. (1979) found that nearly 89% of theexaminers in their study made clerical errors, and that 46.5 % of theprotocols had errors. They recommended “Greater caution is calledfor in the clerical computations, but especially in the simpleaddition of the scaled scores, where the majority of errors appearedin the present samples” (p. 496). Similarly, Levenson et al. (1988)found in their study that 57% of school psychologists and internsmade clerical errors, leading the authors to call for improvedtraining in graduate programs and a revision of the WISC-R protocolin order to reduce this type of error.Slate and Hunnicutt (1988) claimed that “Mechanical andclerical errors seem to occur frequently and detract from theaccuracy of Wechsler scores” (p. 284). The problem is of greatenough magnitude for them to recommend that clerks could beemployed to check the areas on the protocol where clerical errorsare most prevalent. Clearly, clerical errors are a major problem onstandardized tests, and more information about them is called for inorder to minimize their impact in weakening test reliability.Test RevisionsThe revision of a psychological test may bring about changes inthe examiner’s response pattern and also changes in the rate ofexaminer’s clerical errors. The Wechsler Intelligence Scale forChildren - Third Edition (WISC-lll) is a revision of the WechslerIntelligence Scale for Children - Revised (WISC-R). Because it isrelatively new (1991), few studies have been published which27examine differences between the two tests, especially in the area ofsusceptibility to examiner errors. Little (1992) compared the twotests and found few major changes apart from the appearance ofactual test materials. The record form or protocol has been redesigned with some changes, for example, there is now more spacefor writing behavioral observations.Of greatest concern for Little (1992) is the changed processfor converting raw scores to standard scores. The subtests on therecord form are listed in their order of administration, while on theconversion table, they are split with the Verbal subtests in onelocation, and the Performance subtests in another. Little predictsthat “This different format of presentation is sure to be confusingand cause recording errors” (p. 151).Examiner CharacteristicsThe interaction between examiner characteristics andexaminer error has been the subject of several studies. In one of theearlier studies, Cieutat (1965) evaluated the effect of examinerdifferences on Stanford-Binet scores. Although he found nosignificant effect for sex of examinee, sex of examiner wassignificant, with the female examiners eliciting significantly higherIQ scores than the male examiners. Interaction between sex ofexaminer and examinee was “of marginal significance” (p. 318). Theresults of this study can be called into question, though, because ofdifficulties with the interpretation of the data analysis. Because of28unequal ns in the analysis of variance cells, Cieutat issued thecaveat that “tests of significance are only approximate” (p. 318).Levenson et al. (1988) also looked at the effects of sex ofexaminer. In their study, sex of examiner was not a significantfactor, but education of examiner did account for a significantportion of the differences in error rates. They found that doctoralschool psychologists made more errors than both school psychologyinterns and Master’s-level school psychologists.In a study comparing PhD psychologists and graduate students,Ryan et al. (1983) found that the PhD psychologists committed asignificantly higher rate of scoring errors than the graduatestudents on the Performance scale of the WAIS although there wereno significant differences between the two groups on any of thesubtest or lQ score means. Unfortunately, the results from thisstudy are weakened from the earlier-mentioned basic design flaw inwhich examiners find themselves in the contrived situation ofscoring completed but unscored protocols.Bird (1992) attempted to find which examiner characteristicswere most closely linked to rate of clerical error on the WISC-R. Inher study, the percentage of protocols with Full Scale lQ changesranged from 4% to 24% in the three samples of protocols (50 in eachsample) she selected. She concluded that the issue of time pressureappeared to be more related to the error rate than did any variabilityin training.29The effect of examiner experience has been found to be worthyof some study, but most of the research has dealt exclusively withgraduate students and not with practitioners. Conner and Woodall(1983) found that student examiners showed a significant decreasein administrative and scoring errors with experience inadministering the WISC-R. Clerical errors, however, did notdecrease, leading the authors to call for further research in thisarea. Slate and Jones (1990b) found that error rates decreased after8 administrations of the WAIS-R. Gilbert (1992) also found studenterror rates on the WAIS-R decreasing from the beginning to the endof a course on test administration.Studies involving graduate students are relevant for thoseorganizing testing courses for graduate programs. It does notnecessarily follow, however, that the same trend holds true forpractitioners who are not typically in the process of learning how touse a test. One of the omissions in studies involving practitionersis the inclusion of examiner experience as an independent variable inthe analysis of the data. Whereas sex and education of examinerhave been studied, experience has rarely been investigated. Theresult of this gap in the research is that little is known regardingthe effect that experience plays in practitioners’ error rates.Research into the impact that experience has on the error rates ofpracticing professionals is in short supply but plainly needed.30Assessment PracticesA great deal of a clinician’s time is spent in assessment. Astudy by Anderson, Cancelli, and Kratochwill (1984) revealed that44% of the school psychologist respondents to their questionnairespent between 41% and 80% in testing. During this time spenttesting, a plethora of different tests are used. Studies investigatingexaminer error have been conducted on many of these tests,including the Stanford-Binet (e.g., Cieutat, 1965; Warren & Brown,1972), the Kaufman Assessment Battery for Children (Hunnicutt,Slate, Gamble, & Wheeler, 1990) and, the Wide Range AchievementTest - Revised (Petersen et al., 1991) with all finding significanterror attributable to examiner error.But the most widely-used tests among practitioners (Lubin,Larsen, & Matarazzo, 1984), and the tests on which most of the workinvestigating examiner error has been done, are the Wechsler scales.On these scales, studies of examiner error have largely focussed onthe WISC-R and the WAIS and WAIS-R, but few studies have yet to bepublished on the WISC-R’s revision, the WISC-Ill. Because the WISCIll is a frequently-used instrument, and because little is availableat this time regarding rate of examiner error, any study whichattempts to investigate this instrument would be valuable.SummaryMuch research has been completed in the area of examinererror and standardized tests. Many of these studies have analyzedthe testing practices of graduate students, and have found that their31examiner errors are a significant source of variation in test scores.Although data from graduate students are more readily available andaccessible than data from practicing professionals, and although theresults may be useful to improve graduate training programs, thedata produce results that are not easily generalizable toprofessionals working in the field.Another way of avoiding the difficult task of obtaining datafrom actual, completed test records is to have a number of subjectsscore a completed, but unscored test protocol. Researchers thathave utilized this design typically send out one or two fabricatedprotocols to a random sample of professionals. The returned andscored protocols form the data-base for the subsequent analysis.Unfortunately, the sample loses its randomness, and hence itsrepresentativeness, when a typically low proportion ofprofessionals choose to participate in the study. The task ofscoring, but not administering a test is an exercise that iscontrived, and not representative of practice in the field.Investigators have looked at a number of different error-types,including administrative error, scoring error, and clerical error.Administrative error is studied infrequently because of thedifficulty and time involved in observing test administrations.Scoring error is difficult to substantiate, because examiners veryfrequently do not record responses verbatim. A look at completedprotocols, therefore, does not always give a clear indication of theactual responses of examinees. The third category of error, which32can be accurately measured from examinations of test protocols, isclerical error. Clerical error has been found to be a factor in theprotocols of most practitioners (e.g., Levenson et al., 1988; Sherretset a!., 1979).The magnitude of examiner error varies from study to study,and is measured in different ways depending on the design of thestudy and on the definition of examiner error. Investigations usingmultiple scorers of single protocols generally report the standarddeviation of scores. For example, the study by Bradley et al. (1980)on the WISC-R, reports the standard deviation ranging from 2.9 forthe easier-to-score protocol, to 4.3 for the harder protocol.In studies investigating errors in completed and scoredprotocols, error rates have been reported as change in Full Scale IQscore (as high as 17 points for Full Scale IQ in Miller et al.’s study[1970]). Proportion of errors among examiners and protocols hasalso been used, as in the study by Sherrets et al. on the rate ofclerical errors on the WISC and WISC-R. The investigators foundthat 89% of examiners made clerical errors, and that 46.5% of theprotocols contained this type of error. If the definition of error isbroadened to include scoring and administrative errors, almost allprotocols contain errors (e.g., Franklin, Stillman, & Burpeau, 1982;Slate & Jones, 1990a; Wagoner, 1988).Various examiner characteristics have been investigated inrelation to rate of examiner error. Sex of examiner appears to have33little impact on error rate (Levenson et al., 1988). Education ofexaminer (Ph.D vs. Master’s degree) may make a difference, withstudies showing doctoral-level psychologists making more errorsthan master’s-level psychologists (Levenson et al., 1988; Ryan et al.,1983).Although examiner experience appears to be related to theerrors graduate students make on standardized tests, little researchhas been completed using practitioners such as clinical or schoolpsychologists for whom assessment is a major part of their day-today practices. The value of studies involving practitioners’ errorswould be enhanced if the experience of examiner was included as anindependent variable in the analysis of the data. More research isclearly called for to investigate the relation between practitioners’experience and clerical errors. Test revisions, with accompanyingchanges in test protocols, may also result in a change in examiners’clerical errors. Research comparing practitioners’ error ratesbetween test versions would be widely welcomed.34Chapter IIIMethodologyThe purpose of this ex post facto study was to investigate howthe nature and rate of examiner error changes from the WechslerIntelligence Scale for Children - Revised, to the WechslerIntelligence Scale for Children - Third Edition. The two centralquestions addressed in this study were how the examiner clericalerror rate changed as a result of the test’s revision, which includedmajor alterations to the format of the test record or protocol; andhow the pattern or trend of error changed as a result of theexaminer’s increasing experience on the new test, the WISC-lll.This chapter will first define the sample and population oftest protocols as used in this study. Next, the characteristics of theexaminers will be examined, followed by the assumptions that needto be made it the results of this study are to be seen asgeneralizable to a greater population. Finally, the actual proceduresand data analysis methods used will be explained.SampleThe population of WISC-lll protocols was made up of the totalprotocols (approximately 1200) completed by all of the schoolpsychologists in a suburban school district since the beginning ofthe test’s use, in 1991. The population of WISC-R protocols(approximately 1800) consisted of the total number of protocolscompleted by the school psychologists in the district in the last35three years of its use. Protocols from before this time period werenot available.The sample consisted of a total of 252 test protocols: 18WISC-R and 18 WISC-lll protocols randomly drawn from the files ofeach of seven school psychologists practicing in a large suburbancommunity near a large Western Canadian city, chosen because ofthe convenience of access to files for the researcher. Thecommunity showed demographic similarity with the province inseveral key areas. The community’s average annual income,educational level of population, and ethnic composition were verysimilar with the same provincial indicators (Ministry of Education,1993). Also similar were the student/teacher ratio and teacherexperience and certification.Of the total number of school psychologists in the schooldistrict, seven fit the criteria for this study: a) had administeredboth the WISC-R and WISC-lll, and b) still had access to older (i.e.,WISC-R) protocols. In some cases, older WISC-R protocols had beendestroyed because of lack of storage space.The sample size - that is, number of protocols from eachexaminer- was determined through comparison to the number ofprotocols used in previous studies of a similar nature. For example,Slate et al. (1992) collected 56 WISC-R protocols from nineexaminers in a large metropolitan school district. Wagoner (1988)collected 50 protocols from 6 examiners, for her study on the WISC36R. In a study of examiner errors on the WRAT-R, Peterson et al.(1991) looked at 55 protocols from 9 examiners. The study found tohave the largest sample was conducted by Sherrets et al. (1979), inwhich 100 protocols from each of two institutions were examined.The present study’s total of 126 protocols for each test was deemedsufficiently large in comparison with the sample size used inprevious studies of a similar nature.Characteristics of ExaminersAll of the examiners had Master’s degrees in either school orclinical psychology from four different graduate training programsin Canada and the United States. Their years of experience (seeTable 1) ranged from 3 to 27 years (Mean = 9.6, S.D. = 8.6). All of theparticipants were trained to administer the WISC-R, but received noformal training on its revision, the WISC-lll. The number ofadministrations per year varied with each school psychologist andwith the particular year, with the mean number of WISC-R or WISCIll administrations ranging from 50 to over 70 per examiner perschool year.The protocols from each school psychologist were drawn in thefollowing manner: the WISC-R protocols were randomly drawn fromthe last three years of its administration, and the WISC-lllprotocols from the first year-and-a-half of its administration. Thesample, therefore, consisted of 18 protocols for the WISC-R, and 18protocols for the WISC-lll from each examiner, drawn from a pool37ranging from 150 to 210 total protocols for the WISC-R, and 75 to105 total protocols for the WISC-lll for each examiner. In order toTable 1Examiners’ Years of ExperienceExaminers 1 2 3 4 5 6 7Years of 15 27 4 6 5 3 7ExperienceNote. Years of experience consists of number of years employed in aposition which requires the administration of standardized tests.examine the trend or pattern of error over time on the WISC-lll, the18 months of administration were further divided into half-yearsegments, and six protocols were drawn from each of the three half-year periods.Characteristics of ExamineesTable 2 displays the characteristics of the examinees fromwhom the protocols were drawn in this study. Note that the meanand median IQs reflect a special education population, i.e. apopulation of children who have been referred for assessment orservice because of questions regarding their performance in anacademic setting.38Table 2Characteristics of ExamineesWISC-R WISC-lll1. Age Range 6-2 to 16-8 6-3 to 16-72. Median Age 9-10 10-63. lQ range 46 to 126 50 to 1154.MeanlQ 89 855. Median lQ 89 86Note. Total of 126 protocols per test. Also, “6-2” is read as 6years and 2 months.AssumptionsSeveral assumptions need to be made if the results of thisstudy are to be generalizable to the population of protocols in theschool district and in the province. It is assumed that the sampleprotocols drawn from the school district files are representative ofthe total number of protocols in the files, and of the population ofprotocols in the province. Because the protocols were drawnrandomly (on the WISC-lll, the total number of a particularexaminers protocols for each half-year period divided by the numbernecessary to draw 6 files, and then every 4th or 5th protocol drawn)this assumption can be supported. The characteristics of theexaminees and the demographics of the community of the examinees,39as outlined previously, are not atypical for the province, suggestingthat the examinees, and therefore, the protocols did not varydemographically from other protocols in the province.It is assumed that the testing practices of the seven schoolpsychologists who completed the test protocols are similar to thepractices of other school psychologists using the tests. Thegraduate training, and years of experience of the seven are variedenough to support this assumption. Previous studies have shownthat Master’s-level psychologists have a lower examiner error ratethan Doctoral-level psychologists (e.g., Levenson et al., 1988; Ryanet al., 1983). Thus, because only Master’s-level psychologists wereincluded in this study, the results may be an under-estimate of theactual rate of error among all practitioners.It is not assumed that the examiners’ training is the same foreach of the two tests. It is acknowledged that the examinersreceived formal graduate-level training on the WISC-R, but no directformal training on the WISC-lII. In a study by Perot (1992) nosignificant difference in number of scoring errors was foundbetween those with formal training on the WISC-R and thosewithout. Most practitioners in the field are not formally re-trainedon revisions of tests, and the conditions in this study are likelyrepresentative of the conditions in the population of examiners inthe province. In any case, the results of this study, and anyconclusions drawn regarding the ease-of-use of the WISC-lIlprotocol, must be seen in the light of the fact that any training on40the second test (the WISC-lll) is not as comprehensive as thetraining on the first test (the WISC-R).ProcedureIn order to protect the identity of the examiners, an associatephotocopied the 252 completed protocols with the names of testerand test subject deleted in order to ensure anonymity of examinerand confidentiality of the examinee and actual test results. Theexaminer was identified by a code number only; the examinee wasnot identifiable in any way.The protocols were then analyzed, item by item, for clericalerrors using an error checklist and data compilation sheet. Errortypes tabulated were clerical in nature (Levenson et al., 1988;Sherrets et al., 1979) and consisted of: addition of subtest raw andscaled scores, computation of chronological age, transformationfrom tables of raw scores to scaled scores, and transformation ofscaled scores to IQ scores. Because the WISC-lll is relatively new(published 1991) and has been used for a relatively brief timeperiod, no comparison between error rates over an extended timeperiod was possible.AnalysisThe analysis of the data consisted of a tally of overall clericalerrors on each test, as well as a tally of “serious” clerical errors,that is, errors causing Full Scale IQ changes (as has previously beendone (e.g., Bird, 1992). On each test, this number of errors was41shown as a proportion of total protocols, along with mean, range andstandard deviation of errors. The number of errors made was alsocorrelated, on both tests, with the examiners’ years of experience inadministering standardized tests.The errors made were divided into their respective categories:addition of raw scores, transformation of raw to scaled scores,addition of scaled scores, transformation of scaled to IQ scores, andchronological-age computation. Comparisons between the two testswere made using t-tests for paired or dependent observations forerror type, or category, as well as for overall and “serious”, or FullScale IQ-changing errors. Location of error (type #1 only, the othererror categories cannot logically be given a subtest location) oneach test was examined.On the WISC-lll, the rate of clerical errors was examined overthe three six-month time periods. Both overall and serious, i.e. IQchanging errors were examined. A repeated-measures 2-factorANOVA was used to compare the number of errors over the threetime periods. When significant differences were found, Tukey’smultiple comparison technique was used to locate these significantdifferences.42Chapter IVResultsThis chapter will describe the results of the study outlined inChapter III. The nature and range of clerical errors on the WISC-Rand WISC-lll will be described, with frequency of error per examinergiven, as well as error location and category. Error frequenciesbetween the two tests will be compared, and the correlationbetween examiner experience and number of errors will be examined.On the WISC-lll only, the number of overall and FSIQ-changing errorswill be plotted over three six-month time periods, with comparisonsmade between the number of errors in the three time periods.Errors and IQ ChangesTable 3 (see also Figure 1) displays the proportion of clericalerrors on both tests and the resulting IQ changes. The WISC-lll had ahigher number of total errors, Verbal lQ (VIQ), Performance lQ (PIQ),and Full Scale lQ (FSIQ) changes than did the WISC-R. Although 38%and 42% of the WISC-R and WISC-lll protocols, respectively,contained clerical errors, not all of these clerical errors resulted ina VIQ, PIQ, or FSIQ change. Nevertheless, almost one-quarter (23%)of the WISC-IlI protocols completed reported incorrect Full Scale IQscores.On the WISC-R protocols with FSIQ changes, the mean FSIQchange was 2.3 points, with a range of 1-8 points. The mean FSIQchange on the WISC-lll protocols with errors was 1.8 points, with a43Table 3Number and (Proportion) of Protocols with Errors by TestWISC-R WISC-llI DIFFERENCEProtocols 126 126Overall Errors 48 (.38) 53 (.42) 5 (.04)VIQ changes 9 (.07) 12 (.09) 3 (.02)PIQ changes 15 (.12) 23 (.18) 8 (.06)FSIQ changes 21 (.17) 29 (.23) 8 (.06)Note. Proportions of error-type for each test are presented withinparentheses. Error proportions do not sum to 1.0: errors may cause aVIQ or a PlO change, and a FSIQ change, or only a VIQ or PIQ change,or no change in scoring at all.Err0rSØWISC-R WISC-lll44Figure 1. Number of Protocols with Errors by Test (per 126601Total Errors VIQ Changes PIQ Changes FSIQ ChangesError Effectprotocols)45range of 1-7 points. In comparison, the SEM0n the WISC-IIl is 3.20,but it can be argued that this does not account for clerical errors.Of the errors which led to a change in Full Scale lQs, one error on theWISC-lll led to a descriptive category change, from Mildly MentallyHandicapped to Borderline. Both tests showed numerous otherchanges (e.g. 70-73, on the WISC-R; 71-75, on the WISC-llI) whichwere close enough to category changes to possibly influenceplacement decisions.ExaminersSix of the seven examiners made clerical errors (see Table 4;also shown graphically in Figure 2), with one examiner (#5)completing error-free protocols on both tests. On the WISC-R,examiners committed a mean of 6.86 clerical errors per 18protocols (38%, Range 0-11, S.D. 4.41), while on the WISC-lll, a meanof 7.57 errors were found per 18 protocols (42%, Range 0-12, S.D.3.78).The relationship between the examiners’ years of experienceas school psychologists and the number of clerical errors made oneach of the tests was examined using a Pearson product-momentcorrelation coefficient. On the WISC-R, no strong relationship wasfound between years of experience and errors. On the WISC-lIl,however, a positive relationship was found between overall errorsand experience (r= .54). When years of experience was correlatedwith Full Scale la-changing errors, the correlation coefficient waseven higher (r = .60). These correlation46Table 4Number of Total Errors by Test per ExaminerWISC-R WISC-III DIFFERENCEExaminer1 11 9 -22 8 12 43 6 7 14 10 8 -25 0 0 06 11 10 -17 2 7 5Mean Errorsper test 6.9 7.6 .70Note. There are 18 protocols per test for each examiner.4712.0____10.896_84 //ExaminerØWISC-R WISC-IIIFigure 2. Errors by Examiner (per 18 protocoIs48coefficients were not statistically significant, but, based on thissample alone, the pattern suggests that the correlation likely wouldhave been significant if the n had been larger.Location of ErrorsThe location of errors on each test (see Table 5) showed adefinite pattern, with more errors made on the Performancesubtests than on the Verbal subtests. The one test most prone toerror on both tests was Coding (errors on this test could be definedas either scoring or clerical errors) with 54% and 38% of totalerrors. The errors on this subtest, though numerous, frequently didnot have a great impact on the subtest scaled score, and theresulting Performance IQ and Full Scale IQ. On the WISC-R, threesubtests were without any clerical errors: Information, Similaritiesand Object Assembly. On the WISC-lll, only Information andArithmetic were error-free.Far more category-#1 (addition of raw score) errors werefound on the Performance subtests than on the Verbal subtests forboth tests. On the WISC-R, 67% of category-#1 errors were found onthe Performance subtests; on the WISC-lll, 82% of these errors werefound on the Performance subtests.Error CateçoryTable 6 (shown graphically in Figure 3) displays the breakdownof clerical error by category for each test. The most common error,49Table 5Location of Clerical Errors by TestWISC-R WISC-IIl DIFFERENCEProtocols 126 126Verbal Subtests 6 14 8Information 0 0 0Similarities 0 2 2Arithmetic 1 0 - 1Vocabulary 4 5 1Comprehension 0 3 3Digit Span 1 4 3PertormanceSubtests 27 28 1Picture Comp. 1 2 1Coding 18 16 -2Picture Arr. 1 1 0Block Des. 6 1- 5Object Ass. 0 6 6Mazes 1 1 0Symbol Search n/a 1 n/a50Table 6Error Categories Across TestWISC-R WISC-lll DIFFERENCEProtocols 126 1261. Addition ofRaw Scores 33 (.69) 42 (.79) 9 (.10)2. Transformationof Raw to ScaledScores 6 (.12) 5 (.09) -1 (.03)3. Addition ofScaled Scores 2 (.04) 5 (.09) 3 (.05)4. Transformationof Scaled to IQScores 7 (.15) 1 (.02) -6 (.13)Total Errors 48 53 5Note. The parentheses show proporLion of error per 126 protocols.51o 100/0 - -900 -2010_ _r ;Error CategoryDWISC.-R WISC-IIICategory 1: Addition of Raw ScoresCategory 2: Transformation of Raw to Scaled ScoresCategory 3: Addition of Scaled ScoresCategory 4: Transformation of Scaled to IQ ScoresFigure 3. Clerical Errors by Category52by far, for both tests was the addition of raw scores. On the WISCR, 69% of clerical errors came as a result of incorrect addition ofraw scores (error-category #1), while on the WISC-lll, theproportion was slightly higher: 79% of the clerical errors were as aresult of incorrect addition of raw scores.Although addition of raw score errors were most common,other error types were more likely to have a greater impact on lQscores. Whereas incorrect addition of raw scores only affects onesubtest, and then often only minimally, the other error types areerrors involving scaled scores and standard scores and are morecertain to change the IQ scores.The single error category showing the most change was errorcategory #4, transformation of scaled to IQ scores, which accountedfor 15 % of errors on the WISC-R, but only 2% of errors on the WISCIll. No chronological-age computation errors (error-category #5)were found on any protocols for either test.Comparison between WISC-R and WISC-lIIA series of t-tests for paired or dependent observations wererun to analyze the difference between means on the two tests fortotal errors, Full Scale lQ-changing errors, and for each errorcategory for the two tests (see Table 7). No significant differencewas found between the means on the two tests for any of thesecategories.53Table 7Comparison of Error-types for WISC-R and WISC-lllWISC-R WISC-lllError Type Mean ., Mean S.D. t- t e S t(na=7) (n=7)1. Overall Errors 6.86 4.41 7.57 3.77 0.672. FSIQ-changing 3.00 2.08 4.14 2.67 1.43errors3. Category#1b 4.71 2.81 6.00 2.89 2.124. Category-#2 0.86 1.21 0.71 1.25 1.05. Category-#3 0.29 0.76 1.14 0.90 1.876. Category-#4 1.00 1.41 0.14 0.38 1.87Note. The [-tests used were for paired or correlated observations.aMean number of errors per 18 protocols.bType#1 errors: Addition of Raw ScoresType-#2 errors: Transformation of Raw to Scaled ScoresType-#3 errors: Addition of Scaled ScoresType-#4 errors: Transformation of Scaled to IQ scores54Examiner Experience and the WISC-lllOne of the questions asked in this study pertains to the effectsthat examiner experience on a test has on the rate of clerical error.Table 8 (and Figure 4) displays each examiners overall errors overthree consecutive half-year time periods from the beginning ofWISC-lll administration. This table shows the mean error ratedecreasing over each of the three time periods from 3.0 errors persix protocols at the beginning of administration, to 2.0 errors persix protocols in the third six-month period.A repeated-measures 2-factor ANOVA was run in order tocompare the rate of overall examiner clerical error on the WISC-lllin each of the three time periods. No significant effect was found(F[2,12] = 1.10, MSe = 1.59) for the time/experience factor (thevariance for the person factor cannot be calculated because there isonly one observation per cell). Thus, there was no significantdifference between overall error rates between the three six-monthtime periods.Table 9 (and Figure 4) displays examiners’ Full Scale IQchanging errors over the three time periods from the beginning ofWISC-lll administration. A decrease of lQ changing errors wasnoted: from 2.0 per six protocols at the beginning of WISC-llladministration, to .57 in the third six-month period ofadministration.55Table 8WISC-lII Overall Errors Over TimeExaminer Time 1 Time 2 Time 31 4 2 32 5 4 33 4 1 24 3 2 35 0 0 06 4 4 27 1 5 1M 3.0 2.6 2.0Note. Errors listed consist of clerical errors per six protocols foreach of the three six-month time periods for each examiner.56MeanErr0rS3.0•• 1<.Total ErrorsTime 2Six-Month Time PeriodsD FSIQ-Changing ErrorsTime 3Figure 4. WISC-lIl Mean Errors (per 6 protocols)57Table 9WISC-lll FSIQ-Chanpinp Errors Over TimeExaminer Time 1 Time 2 Time 31 3 2 22 3 3 13 3 1 04 1 1 15 0 0 06 4 2 07 0 2 0M 2.0 1.6 .6Note. Errors listed consist of Full-Scale IQ-changing clerical errorsper six protocols for each of the three six-month time periods.58Another repeated-measures ANOVA was run to investigate thedifferences between means of serious errors- that is, errors whichcause changes in Full Scale lQ - for the three time periods. It wasfound that there was a significant difference (F[2,12] = 4.06, p < .05,MSe = 0.93) between the three means.Through further investigation using the Tukey honestsignificant difference multiple comparisons technique (HSDo5 =1.37), it was found that the mean for time period 1 wassignificantly higher than the mean for time period 3. In other words,significantly more Full-Scale IQ-changing clerical errors were madeat the very beginning of WISC-lll administration than after a year toa year-and-a-half of experience.Summary of ResultsClerical errors were found on 38% and 42% of WISC-R andWISC-lll protocols, respectively, with 17% and 23% of protocolshaving Full Scale lQ changes as a result. The mean FSIQ change onthe WISC-R was 2.3 points (range 1 - 8); on the WISC-lll the meanchange was 1 .8 points with a range of 1 - 7 points. Only one errorled to a change in descriptive category, from Mildly MentallyHandicapped to Borderline.The study’s first hypothesis stated that the rate ofpractitioners’ clerical error on completed and scored WISC-lllprotocols will be equal to their rate on the WISC-R. No significantdifferences were found between the mean errors on the two tests59for overall errors, Full Scale lQ-changing errors, or any of the fourerror-categories examined. A positive, but statistically nonsignificant correlation was found between examiners’ years ofexperience and number of clerical errors made on the WISC-lll.The second hypothesis stated practitioners’ clerical errorrates will not change significantly with increasing experience onthe WISC-III. The examiners’ mean overall error rate was reducedfrom 3.0 errors per six protocols at the beginning of administration,to 2.0 errors per six protocols in the third six-month period. Thedifference was not statistically significant (F{2,12} = 1.10). Theexaminers’ Full Scale lQ-changing errors were also compared overthe three-month time period. At the beginning of testadministration, 2.0 errors were made per six protocols; during thethird six-month time period, 0.57 errors were made which changedthe Full Scale lQs. The difference between these means wassignificant (F[2,121 = 4.06; p < .05).60Chapter VDiscussionThe purpose of this study was to investigate the difference inerror rates for practitioners between the WISC-R and its revision,the WISC-lll. The first hypothesis presented in this study statedthat the rate of practitioners’ clerical error on completed andscored WISC-lI! protocols will be equal to their rate of error on theWISC-R. The results of this study show that the WISC-lIl error ratewas not significantly different from the error rate on the WISC-R.Also of interest was the pattern of error rate on the WISC-lllover three time periods, starting from the beginning of itsadministration. The second hypothesis stated that practitioners’clerical error rates will not change significantly with increasingfamiliarity (experience) on the WISC-Ill. The results from thisstudy suggested that whereas overall clerical error rates showed nosignificant change, Full Scale IQ-changing errors showed asignificant decrease over the three time periods; that is, schoolpsychologists made significantly more errors at the beginning oftest administration than after 18 months of administration.Error RatesOf the 252 total protocols, 40% contained clerical errors, withalmost 20% of the total number of protocols having incorrect FullScale lQs as a result of these errors. Slightly more clerical errorswere found on the WISC-lll (53, or 42% of the 126 protocols) than on61the WISC-R (48, or 38% of the 126 protocols) but the difference wasnot significant. There was also no significant difference betweenthe two tests for errors that caused changes in Full Scale lQs.Other studies have reported similar rates of clerical error.For example, Sherrets et al. (1979) reported that 46.5% of the 200WISC-R protocols examined in their study contained at least oneclerical error. Wagoner (1988) found that 34% of the 50 WISC-Rprotocols contained errors which changed Full Scale lQs, althoughthe errors examined in her study included scoring errors as well asclerical errors, thus inflating the rate of errors in comparison tothis study. Error rates in studies using student examiners aregenerally much higher than the rates reported in the present study.For example, Slate and Chick (1989) found that 67.4% of theirstudent-examiners made errors which changed Full Scale lQ scores.Of the WISC-R protocols with FSIQ changes, the mean changewas 2.3 points, with a range of 1-8 points. The standard deviationof change in FSIQ scores was 1.14. On the WISC-lll, the mean changeon protocols with errors was 1 .8 points, with a range of 1-7 points.The standard deviation of change in FSIQ scores was 1.05. The SEMof the WISC-lll is 3.20, but because of the method used incalculating this SEM, it is evident that clerical errors must be seenas a source of additional error.62Standard Error of MeasurementAs stated earlier (in Chapter 1), the SEMOn the Wechslerscales is calculated from a composite of split-half reliabilitycoefficients taken from individual subtests. The resulting SEMtherefore, cannot be thought to account for error from sourcesoutside of these subtests. The SEM published for a Wechsler testmight be seen to account for one type of clerical error - addition ofraw scores - in the protocols used in the reliability studies in theproduction of the test. However, it is almost certain that the rateof clerical errors is higher in actual practice than is the rate givenfrom the test’s original reliability study. Those errors not foundwithin the actual subtests - transformation of raw to scaled score,addition of scaled scores, transformation of scaled scores to lQscores - are not accounted for at all by the published SEM of thetest.It has been suggested that the SEM of a test should include notonly errors brought about through content sampling, as in the case ofthe Wechsler scales, but also time sampling errors, errors thatresult from mistakes in scoring, and from improper testadministration (Bradley, Hanna, & Lucas, 1981). The Bradley et al.study suggested that a composite SEM be composed in the followingmanner: SEM comp = ‘ISEMcs2 + SEMts2 + SEMs2 + SEMa2 (cs=contentsampling; ts=time sampling; s=scoring; a=administration). Theirestimation of a composite SEM on the WISC-R came to 6.5 lQ points(as opposed to the published 3.19 points).63The results of the present study suggest that because clericalerrors exist and have an effect on test reliability, a composite SEMshould also include clerical errors (SEMC). The standard deviation ofIQ changes on the WISC-R was 1.14 points (1.05 on the WISC-lll).When included in the above formula, the composite SEM rises to 6.7points for the WISC-R, as opposed to the published SEMOt 3.19points. This pattern is likely similar for the WISC-lll.ExaminersSix of the seven examiners (86%) made clerical errors on bothtests, with one examiner completing error-tree protocols on all 18protocols of the WISC-R and all 18 of the WISC-R. In comparison,Sherrets et al. (1979) found 89% of the examiners in his study hadcommitted clerical errors; Levenson (1988) found 57% of examinersmade clerical errors on the 162 WISC-R protocols used in the study.The results from this study in this area are similar to results inprevious studies and suggest that the sample of examiners issimilar to the sample in previous research.The examiners’ years of experience were correlated with thenumber of errors made on the two tests. On the WISC-R, norelationship was seen between years of experience and number ofclerical errors made. On the WISC-Ill, however, a positivecorrelation was seen between overall errors and experience (r = .54).Even stronger was the correlation between IQ-changing errors andexperience (r= .60). These positive correlations (although notstatistically significant) suggest that, in this sample, those64examiners with the greatest years of experience had the mostdifficulty in adapting to the changes in a newly revised standardizedtest, and were more likely to make careless errors. These moreexperienced examiners made more clerical errors, and they reporLedmore inaccurate results than did examiners with less experience.The error rates in the present study are consistent enoughwith those in other studies to supporL the assumption that theexaminers are representative of the population of examiners.Conclusions drawn in this study should be applicable to otherpractitioners using these tests. Also, the demographiccharacteristics of the community from which the data were takensuggest that the examinees were likely representative of thepopulation of examinees in the province.Error CategoriesThe most common error type by far, on both tests, wasaddition of raw scores. On the WISC-R, 69% of all errors were ofthat type, and on the WISC-lll, 79% of all errors were in thatcategory. The difference between the two tests was not significant.This type of error, though most frequent, is least likely to make amajor difference in the IQ score at the end of the test.Category #2 errors - transformation of raw to scaled scores -was the second most frequent error on the WISC-R , with 15% oferrors, but was the least-frequent error type on the WISC-lll, withonly 2% of errors. The difference between the two tests for this65error type was not statistically significant, but suggests adifference between the two tests.Little (1992) suggested that “Of greatest concern withregard to administration and scoring is the process of convertingraw scores to standard scores ... This different format ofpresentation is sure to be confusing and cause recording errors” (p.151). That the WISC-lll seems less prone to error in the area oftransformation of raw to scaled scores is somewhat surprising. Thechanges made on the WISC-lll protocol appear to make the process oftransformation of raw to scaled, or standard scores more confusing.Contrary to Little’s predictions, however, the new format does notseem to be more error-prone in this area, with the present studyshowing relatively few errors being committed in that location.The second most frequent error-types on the WISC-lll weretransformation of raw to scaled scores (9%), and addition of scaledscores (9%). These errors were found in similar number on theWISC-R, with no significant difference between the two tests inthis area. That these error rates are the same should be of someconcern to those using the WISC-lll: these two error categories areamong the most likely to have an impact on Full Scale IQ.Most studies report at least some clerical errors in the area ofcomputation of chronological age (e.g., Levenson et al., 1988;Sherrets et al., 1979). No errors of this type were found in thisstudy, probably as a result of the referral and reporting process used66in the school district from which the protocols were taken. In theschool district selected for this study, the school psychologists aregiven referrals with age and birth-date supplied; after theassessment, the reports with birLhdate and age are distributed toparents and school personnel involved with the child. It is likelythat any chronological-age computation error would be noticed andthe test protocol would be corrected.The WISC-lll Over TimeThe WISC-lll protocols were also scrutinized for the rate ofexaminer errors over three time periods, starting with the beginningof administration and continuing for three half-year time periods.All of the examiners came to the WISC-lll with training andexperience on the WISC-R, but no formal re-training on the WISC-Ill.The revised test is similar enough to the WISC-R that examiners areexpected to be able to make the transition with a minimum ofdifficulty.Two types of errors were looked at: overall clerical errors, andserious errors (i.e., those errors which cause the Full Scale lQs tochange). The overall error rate declined over the three time periodsfrom 3.0 per six protocols, to 2.0 per six protocols, but the decreasewas non-significant. However, for the errors which cause Full ScaleIQ change, a significant difference was found between the first andthird time periods. The mean number of lQ-changing errors (2.0 per6 protocols, or 33%) at the beginning of administration was67significantly higher than during the third six-month period, whichhad an IQ-changing error rate of 0.6 per 6 protocols (10%).Past studies have shown some decrease in examiner errors ingraduate students’ administrations of standardized tests, but few,if any studies to date have looked at practitioners. Slate and Jones(1990c) found no decrease in graduate students’ scoring error ratesafter 5 practice administrations of the WISC-R, and a slightdecrease after 10 administrations. In their study, error rates werevery high: 79.7% of the protocols completed contained errors whichchanged the Full Scale IQ. They concluded that the training foradministration of standardized tests was inadequate, and shouldconsist of more than simple repetition of test administrations.Gilbert (1992) investigated an alternative method of teachingtest administration and found students’ WAIS-R administrationerrors significantly reduced from the beginning to the end of the ofthe semester. Gilbert’s study, however, looked only atadministration errors, and not at scoring or clerical errors, whichmay or may not have followed a similar pattern.The results of the present study indicate that practicingschool psychologists make a significantly higher number of lQchanging errors at the beginning of administration of a newlyrevised instrument than they do 18 months later. Interestingly,though, the rate of overall errors does not change significantly overtime. The reason for this difference in pattern likely lies in the68make-up of each of these two broad categories of errors (overall andIQ-changing).The overall errors consisted mostly of category #1 errors,that is, addition of raw scores. These errors often result in minimalchange in scaled scores and, consequently, IQ scores. The WISC-Rand WISC-lll do not differ vastly in the way raw scores are added up,and therefore no change over time is noted when the examiners beginusing the revised test. The lQ changing errors, though, are morelikely to be category #2, #3 or #4 errors which have more potentialto change lQ scores. These types of errors can be directly attributedto changes in the protocol format of the newly-revised test, and tothe examiners’ inexperience with the test. When all three timeperiods of WISC-lll administration were examined, the WISC-lll didnot show a significantly higher rate of category #2, #3, and #4errors than the WISC-R. It may be construed that the cause of thehigh rate of lQ-changing errors at the beginning of WISC-llladministration can be attributed to examiners’ inexperience and lackof retraining rather than to a deficiency in protocol design.Strengths and Weaknesses of the StudyThe relatively small number of examiners who providedprotocols is perhaps a limitation of this study, though the totalnumber of protocols used compares favorably with most othersimilar studies in the literature. Also, the variety of trainingbackgrounds (school and clinical psychology, many differentgraduate programs) may lessen the impact of the limitation stated69above. Another limitation of this study is that only Master’s levelpsychologists participated in this study, thus likely underestimatingthe rate of clerical error (Ryan, 1983). However, error rates in thisstudy were similar to those in other studies (Sherrets et al., 1979;Wagoner, 1988).Another possible limitation is that this study focussed only onclerical errors, and ignored scoring and administration errors. Theerror rate in this study therefore likely underestimates the actualrate of total examiner error.Several key areas of strength were noted in this study. Forexample, unlike most research studies completed in this area, thisstudy used actual protocols from practitioners. Many other studieshave used either graduate students’ protocols, or fabricatedprotocols scored by volunteer professionals. The size of the sampleof protocols also compares favourably with other studies. Thisstudy also examined examiner error from two new perspectives: veryfew, if any, studies to date have made comparisons between a testand its revision, or looked at the pattern of error rates over time.ConclusionsTwo central questions were addressed in this study. The firstwas regarding the change in the rate of examiner clerical errorbetween the WISC-R and the WISC-lll. The second involved the rateof error over time on the WISC-lll, from the beginning of the test’sadministration to a period 18 months later.70A claim made in the WISC-lll examiner’s manual (1991) states:“Changes from the WISC-R ... have made it easier for the examiner touse” (p. iv). I interpreted the claim to mean that the test, includingthe newly-revised protocol, has been designed so as to be less proneto examiners’ clerical errors. Even if “easier to use” wasn’tintended to be interpreted as reduced clerical errors, the problem ofclerical errors is well-enough documented so that it should havebeen addressed in the design of the revised test.The results from this study suggest examiners continue tomake a substantial number of clerical errors on the WISC-lll: 42% ofprotocols had clerical errors, with Full Scale lQ-changing errors on23% of all protocols. The results suggest that there is nosignificant difference in clerical errors between the two tests, andtherefore, the WISC-lll does not seem to be easier to use than theWISC-R, if ease of use is interpreted as propensity to make clericalerrors.A more extensive retraining on the WISC-lll might reduce thefrequency of IQ-changing errors in the initial months of testadministration, when these errors are most likely to occur. Thisretraining might not have a significant effect on reducing overallclerical errors, which in this study occurred at a high rate throughWISC-R administration, and remained high throughout all three timeperiods of WISC-lll administration.71Because clerical errors remain a problem with the WISC-lll,and because the provided SEM does not account for these clericalerrors, it is suggested that the amount of error in a test could bebetter indicated with a composite SEM which would include not onlycontent sampling errors, but all other error types, including clericalerrors. This suggestion, a variation of a suggestion made by Hannaet al. (1981), would provide test users with important additionalinformation that is not presently published. The issue of reliabilityand SEM is not just a theoretical or psychometric issue, but onewhich is of practical significance. Scores from standardized testsare frequently given within a surrounding confidence interval, whichis directly related to the reliability coefficient of the test.This study also examined the rate of clerical error on theWISC-lll over three six-month time periods, beginning with the firstsix months of administration of the test. It was found that althoughthe overall clerical error rate did not change significantly betweenthe three time periods, the Full Scale IQ-changing errors weresignificantly higher in the first time period than in the last timeperiod. In other words, examiners made a relatively high number ofserious (FSIQ-changing) errors at the beginning of theiradministration of the WISC-lll.It was earlier hypothesized that there would be no significantchange in the rate of clerical error with examiners’ increasingfamiliarity with the WISC-lll. Although little, if any previousresearch was conducted in this area with practitioners, and studies72with graduate students showed a decrease in error rates withexperience, it was thought that practitioners who had no formalretraining would show special vigilance in avoiding errors with anew and somewhat unfamiliar test revision. This present studysuggests that practitioners with no formal retraining make arelatively high number of IQ-changing errors at the beginning of thetest’s use. It was found that the examiners who had the mostdifficulty with the transition from the WISC-R to the WISC-lIl werethose with the most experience on the WISC-R. There is a clear needfor the re-training of examiners who use the revised version of astandardized test.Implications of the StudyClerical errors continue to be a problem with the Wechslerscales. The results from this study suggest that examiner errorscontinue unabated in the WISC-IIl, with no significant reduction ineither overall or IQ-changing errors. The revised protocol, with onepossible exception (transformation of raw to scaled scores) does notseem to be less prone to clerical errors. And although one mightattribute all clerical errors to examiner carelessness, and considerthese errors unavoidable, valuable suggestions have been madewhich may lessen their rate and impact. Levenson et al. (1988) madethe suggestion that a “revised record form might include directionsto the psychologist to verify and recheck all response scoring,mathematical, administrative, IQ, and birth date computations” (p.663). Suggestions like these have been wholly ignored on the WISCIll protocol, with the result that the error rate remains high and73unchanged. The WISC-lll would be a more reliable instrument ifmore emphasis was placed by the test designer on avoiding clericalerrors. A few key instructions on the actual protocol - perhaps astep-by-step guide to check and recheck scoring and mathematicalcomputations- would likely result in more reliable scores.Recently-introduced computer scoring programs forstandardized tests may help reduce the incidence of some types ofclerical errors. For example, the Scoring Assistant for the WechslerScales, introduced in 1992, completes all calculations fortransformation of raw to scaled scores, addition of scaled scores,and transformation of scaled to IQ scores. The most frequent ofclerical errors - addition of raw scores- would not be reducedthrough the use of a computer scoring program, nor would scoring oradministration errors. Also introduced through the use of computerscoring programs would be another source of error: mistakes intransferring data from the protocol to the computer.Another issue that has been largely ignored is the issue of theaccuracy of the reported reliability coefficients and SEMS. Theresults from this study underscore the need to include in acomposite SEM not only the content sampling, time sampling and testadministration errors suggested by Hanna et al. (1981), but to alsoinclude the component of clerical error.Professional examiners are not often formally re-trained onrevisions of standardized tests. For the most part, it has been74assumed that they possess the knowledge and experience toundertake the necessary changes in practice brought on by the test’srevision. This study suggests that professionals make a relativelyhigher number of serious clerical errors when they are in theprocess of learning to use a revised test. It has been suggested (e.g.,Franklin et al., 1982) that practicing psychologists are lacking inopportunities for continuing education. From the results of thisstudy, it can be recommended that practitioners should undertake amore extensive retraining when a standardized test is revised, evenwhen the changes appear to be largely cosmetic, as is the case withthe WISC-lll.75REFERENCESAmerican Educational Research Association, American PsychologicalAssociation, and National Council on Measurement in Education.(1985). Standards for educational and psychological testing.Washington, DC: American Psychological Association.Anastasi, A. (1988). Psychological Testing (6th ed.). New York:Macmillan.Anderson, T., Cancelli, A., & Kratochwill, T. (1984). Self-reportedassessment practices of school psychologists: Implications fortraining and practice. Journal of School Psychology,.,17-29.Bird, C. A. (1992). Examiner errors in scoring the WechslerIntelligence Scale for Children- Revised. (PhD Dissertation,Texas Woman’s University, 1992). Dissertations AbstractInternational, 54/01, p.140Blakey, W., Fantuzzo, J., Gorsuch, R., & Moon, G. (1987). A peermediated, competency-based training package foradministering and scoring the WAIS-R. ProfessionalPsychology: Research and Practice, j.(1), 17-20.Borg, W., & Gall, M, (1989). Educational research: An introduction,Fifth Edition. New York: Pitman.76Bradley, F., Hanna, G., & Lucas, M. (1980). The reliability of scoringthe WISC-R. Journal of Consulting and Clinical Psychology, 4..,530-531.Brannigan, G. (1975). Scoring difficulties on the Wechslerintelligence scales. Psychology in the Schools, 12, 313-314.Buckhalt, J. (1990). Atlributional comments of experienced andnovice examiners during intelligence testing. Journal ofPsychological Assessment,., 478-484.Cleutat, V. (1965). Examiner differences with the Stanford-Binet IQ.Perceptual and Motor Skills, SQ., 317-318.Conner, R., & Woodall, F. (1983). The effects of experience andstructured feedback on WISC-R error rates made by studentexaminers. Psychology in the Schools, , 376-379.Conrad, P. (1991). Error frequencies of student-examiners onWISC-lll administrations. (Doctoral DisserLation, FullerTheological Seminary, School of Psychology, 1991).Dissertation Abstracts International. 52-10.Egeland, B. (1969). Examiner expectancy: Effects on the scoring ofthe WISC. Psvcholoav in the Schools, ., 31 3-315.77Franklin, M., Stiliman, P. & Burpeau, M. (1982). Examiner error inintelligence testing: Are you a source? Psychology in theSchools, 19., 563-569.Gilbert, T. (1992). Students’ errors on the administration of theWAIS-R. Canadian Journal of School Psychology, , 126-130.Hanley, J.H. (1977). A comparative study of the sensitivity of theWISC and WISC-R to examiner and subject variables. (Doctoraldissertation, Saint Louis University). Dissertation AbstractsInternational, i, 7814571Hanna, G., Bradley, F., & Holen, M. (1981). Estimating major sourcesof measurement error in individual intelligences scales: Takingour heads out of the sand. Journal of School Psychology, 19.,370-376.Hunnicutt, C., Slate, J., Gamble, C., & Wheeler, M. (1990). Examinererrors on the Kaufman Assessment Battery for Children: Apreliminary investigation. Journal of School Psychology, aa,271 -278.Kaufman, A. (1990). The WPPSI-R: You can’t judge a test by itscolors. Journal of School Psychology, , 387-394.Levenson, R., Golden-Scaduto, C., Aiosa-Karpas, C., & Ward, A.(1988). Effects of examiners’ education and sex on presence78and type of clerical errors made on WISC-R protocols.Psychological Reports, , 659-664.Little, S. (1992). The WISC-lIl: Everything old is new again. SchoolPsychology Quarterly, Z 148-154.Lubin, B., Larsen, R., & Matarazzo, J. (1984). Patterns ofpsychological test usage in the United States. AmericanPsychologist, 3.9.(4), 451-453.Miller, C., & Chansky, N., (1972). Psychologists’ scoring of WISCprotocols. Psychology in the Schools, 9., 144-152.Miller, C., & Chansky, N., (1973). Psychologists’ scoring of WISCprotocols: A follow-up note. Psychology in the Schools, 10,344-345.Miller, C., Chansky, N., & Gredler, G. (1970). Rater agreement on WISCprotocols. Psychology in the Schools, Z, 190-193.Ministry of Education, Information Services Branch (1993).Information profile for school year 1992/93.Nunnally, J. (1978). Psychometric theory (2nd ed.). New York:McGraw-Hill.79Oakland, T., Lee, S. & Axelrad, K. (1975). Examiner differences onactual WISC protocols. Journal of School Psychology, ta, 227-233.Perot, J. (1992). Cognitive strategies and heuristics underlyingpsychologists’ judgements on the WISC-R verbal scales: Aprotocol analysis. Unpublished master’s thesis, University ofBritish Columbia, Vancouver, B.C.Petersen, D., Steger, H., Slate, J., Jones, C., & Coulter, C. (1991).Examiner errors on the WRAT-R. Psychology in the Schools,, 205-208.Psychological Corporation (1991). Manual for the WechslerIntelligence Scale for Children - Third Edition, New York:Harcourt Brace Jovanovich, Inc.Rothman, C. (1973). Differential vulnerability of WISC subtest totester effects. Psychology in the Schools, it, 300-302.Russ, S. (1978). Teaching psychological assessment: Training issuesand teaching approaches. Journal of Personality Assessment,4Z(5), 452-456.Rust, J., & Golambok, S. (1989). Modern Psychometrics. The scienceof Psychological Assessment. London: Routledge.80Ryan, J., Prifitera, A., & Powers, L. (1983). Scoring reliability on theWAIS-R. Journal of Consulting and Clinical Psychology, 5j,149-150.Sattler, J. M. (1991). Normative changes on the WechslerPreschool and Primary Scale of Intelligence- Revised AnimalPegs subtest. Psychological Assessment: A Journal ofConsulting and Clinical Psychology, a, 691 -692.Sherrets, S., Gard, G., & Langner, H. (1979). Frequency of clericalerrors on WISC protocols. Psychology in the Schools, j., 495-6Slate, J. & Chick. (1989). WISC-R examiner errors: Cause forconcern. Psychology in the Schools, , 78-83.Slate, J., & Hunnicutt, L. (1988). Examiner errors on the Wechslerscales. Journal of Psycho-Educational Assessment, , 280-288.Slate, J. & Jones, C. (1989). Can teaching of the WISC-R beimproved? Quasi-experimental exploration. ProfessionalPsychology: Research and Practice, ZQ, 408-410.Slate, J. & Jones, C. (1990a). Examiner errors on the WAIS-R: Asource of concern. Journal of Psychology, 124, 343-345.81Slate, J., & Jones,C. (1990b). Identifying students’ errors inadministering the WAIS-R. Psychology in the Schools, , 83-87.Slate, J., & Jones, C. (1990c). Student error in administering theWISC-R: Identifying problem areas. Measurement andEvaluation in Counseling and Development, , 137-140.Slate, J., Jones, C., Coulter, C., & Covert, T. (1992). Practitioner’sadministration and scoring of the WISC-R: Evidence that we doerr. Journal of School Psychology, 3.Q., 77-82Strein, W., (1984). A method for the systematic observation ofexaminer behavior during psychoeducational assessments.Psychology in the Schools, Zj, 318-323.Stewart, K., (1987). Assessment of technical aspects of WISC-Radministration. Psychology in the Schools, 24, 221-228.Wagoner, R. (1988). Scorinç. errors made by practicing psychologistson the WISC-R. Unpublished Master’s thesis, Western CarolinaUniversity.Warren, S., & Brown, W. (1972). Examiner scoring errors on individualtests. Psychology in the Schools, 9, 118-122.82Wechsler, D. (1974). Manual for the Wechsler Intelligence Scale forChildren - Revised. New York: Psychological Corporation.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items