UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The effects of experience and test revision on the rate of practitioners’ clerical errors on the WISC-R… Klassen, Robert Mark 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1994-0220.pdf [ 1.52MB ]
Metadata
JSON: 831-1.0054589.json
JSON-LD: 831-1.0054589-ld.json
RDF/XML (Pretty): 831-1.0054589-rdf.xml
RDF/JSON: 831-1.0054589-rdf.json
Turtle: 831-1.0054589-turtle.txt
N-Triples: 831-1.0054589-rdf-ntriples.txt
Original Record: 831-1.0054589-source.json
Full Text
831-1.0054589-fulltext.txt
Citation
831-1.0054589.ris

Full Text

ThE EFFECTS OF EXPERIENCE AND TEST REVISION ON THE RATE OF PRACTITIONERS’ CLERICAL ERRORS ON THE WISC-R AND WISC-III by ROBERT MARK KLASSEN  B.Ed.,  University of British Columbia, 1987  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS in THE FACULTY OF GRADUATE STUDIES Department of Educational Psychology and Special Education  We accept this thesis as conforming to tq requiredjs)ándard  THE UNIVERSITY OF BRITISH COLUMBIA APRIL, 1994 © Robert Mark Klassen, 1994  _________  In presenting this thesis  in  partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department  or  by  his  or  her  representatives.  It  is  understood  that  copying  or  publication of this thesis for financial gain shall not be allowed without my written permission.  (Signature)  c  Department of The University of British Columbia Vancouver, Canada Date  DE.6 (2/88)  Aprl  i’  (  11  ABSTRACT  The purpose of this study was to investigate the effect of test revision and examiner experience on the rate of examiner clerical error on the Wechsler Intelligence Scale for Children  -  Revised  (WISC-R) and the Wechsler Intelligence Scale for Children Edition (WISC-llI).  -  Third  A total of seven school psychologists provided a  sample of 252 protocols: 18 from each psychologist for the WISC-R, and 18 from each psychologist for the WISC-lll.  The errors  tabulated consisted of clerical errors only, that is, addition of subtest raw and scaled scores, computation of chronological age, transformation from tables of raw score to scaled scores, and transformation of scaled scores to IQ scores.  Clerical errors were  found on 38% of WISC-R protocols, and 42% of WISC-lll protocols. Errors caused by incorrect addition of raw scores were the most common error type.  No significant difference was found between the  two tests for any of the types of clerical errors.  A positive but  statistically non-significant correlation was found between examiners’ years of experience on the WISC-R and number of errors made.  On the WISC-lll only, a comparison was made between the clerical error rate at the beginning of usage of the test, and the clerical error rate in two time periods in the 12 months following the beginning of usage of the test.  With overall errors, no  significant difference was found between the three time periods. However, examiners made significantly more Full Scale IQ-changing  111  errors at the beginning of WISC-Ill administration than they did in the third six-month time period.  It was suggested that test  publishers provide clear directions on the actual protocol to test users to help lower the rate of clerical errors.  It was also  suggested that because of the high rate of clerical errors at the beginning of the use of the WISC-lll, practitioners need formal retraining on a test when it is revised.  iv  TABLE OF CONTENTS ABSTRACT  i i  LISTOFTABLES  vi  LIST OF FIGURES  v ii  ACKNOWLEDGEMENTS  v iii  CHAPTER INTRODUCTION Purpose of the Study Error of Measurement Previous Studies Examiner Differences Test Revisions Protocol Changes Questions and Hypotheses Scope of Study I I  LITERATURE REVIEW Introduction Studies with Graduate Students Training Programs Unscored Protocols Completed Protocols Definitions of Error Test Revisions Examiner Characteristics Assessment Practices Summary  Ill  METHODOLOGY Sample Characteristics of Examiners Characteristics of Examinees Assumptions Procedure  1 2 2 5 8 9 1 0 11 1 2 1 4 14 14 1 6 1 7 22 24 26 27 30 30 34 34 36 37 38 40  V  Analysis IV  RESULTS Errors and IQ Changes Examiners Location of Errors Error Category Comparison between WISC-R and WISC-IlI Examiner Experience and the WISC-llI Summary of Results  V  DISCUSSION Error Rates Standard Error of Measurement Examiners WISC-III Over Time Strengths and Weaknesses of the Study Conclusions Implications of the Study REFERENCES  40 42 42 45 48 48 52 54 58 60 60 62 64 66 68 69 72 75  vi  LIST OF TABLES TABLE 1  Examiners’ Years of Experience  37  2  Characteristics of Examinees  38  3  Number and Proportion of Protocols with Errors  43  4  Number of Total Errors by Test per Examiner  46  5  Location of Clerical Errors by Test  49  6  Error Categories Across Test  50  7  Comparison of Error-Types  53  8  WISC-lll Overall Errors Over Time  55  vii  LIST OF FIGURES FIGURE 1  Number of Protocols with Errors by Test  44  2  Errors by Examiners  47  3  Clerical Errors by Category  51  4  WISC-III Mean Errors  56  viii  ACKNOWLEDGEMENTS  I wish to express my sincere appreciation to Dr. Nand Kishor, my research supervisor, for his help and direction throughout the course of this research.  Much guidance was given to help focus this  paper on each successive draft.  Thanks also to Dr. Harold Ratzlaff who helped enlighten some murky research questions, and who provided the original impetus to begin the program and to persevere to the end.  I also am appreciative of the feedback of Dr. Kim Schonert Reichl, who provided valuable critical analysis on several drafts.  Finally, I wish to acknowledge the support, patience and love of my family: Andrea, Danielle, and especially Lenore, who tolerated my 4:30 a.m. awakenings, and who encouraged me to continue to the end.  1 Chapter I Introduction  The Wechsler Intelligence Scales are among the most commonly used individual intelligence tests within educational and clinical psychology settings (Slate & Chick, 1989).  In the school  setting, critical educational placement and planning decisions are based on the data derived from these tests, which are assumed to be administered in a standardized manner, and scored with minimal error.  In fact, the American Psychological Association’s Standards  for Educational and Psychological Testing (1985) suggests that the scoring of intelligence tests should be completely free of errors.  Numerous studies have shown, however, that test protocols are far from error-free, and that virLually all examiners make errors (e.g., Sherrets, Gard, & Langner, 1979; Slate, Jones, Coulter, & Covert, 1992).  Unavoidably, the conclusion must be drawn that  critical educational placement and planning decisions are sometimes based on the results from inaccurately-scored tests.  These test  results, consequently, are not as true a reflection of the child as they might seem to be.  Any information, then, that helps  professionals working in the field avoid errors in the scoring of these tests, is potentially of great assistance in achieving the ideal of error-free testing.  2  Purpose of the Study The Wechsler Intelligence Scale for Children  -  Revised (WISC  R) has recently (1991) been revised as the Wechsler Intelligence Scale for Children  -  Third Edition (WISC-lll).  These tests are among  the most widely used tests of mental ability for children aged 6 to 16.  This study investigated how the nature and rate of examiner  error change from one test, the Wechsler Intelligence Scale for Children  -  Revised (WISC-R), to its recently published revision, the  Wechsler Intelligence Scale for Children  -  Third Edition (WISC-llI).  Error of Measurement Standardized tests  -  such as the WISC-R and its revision, the  WISC-lll, generally report a standard error of measurement (SEM) which is an estimate of the amount of error involved in an examinee’s obtained test score.  This SEM accounts only for content  sampling errors, and, in the majority of tests, including the WISC-R and WISC-lll, is measured through the use of the Spearman-Brown correction of split-half correlations.  In this approach to reliability,  the test is divided into two approximately parallel forms or components which are then correlated.  The resulting correlation,  when adjusted for the greater length of the composite through the use of the Spearman-Brown correction, represents the reliability of the test.  On the WISC-R and WISC-lll, the reliability coefficients were first calculated for each individual subtest through the split-half method and Spearman-Brown correction formula (except for Coding,  3  and on the WISC-III, Symbol Search, which are speeded tests, for which split-half methods are inappropriate).  The individual subtest  reliability coefficients were then used to compute the overall test reliability, through the use of Nunnally’s formula for the reliability of a composite of several tests (Nunnally, 1978, p. 246).  On most standardized tests, the actual scoring of the test by the examiner is assumed to be almost completely objective and error-free.  In the case of many tests, no measure of examiner error  or of interrater reliability is reported, although this has been recommended: “There is as much a need for a measure of scorer reliability as there is for the more usual reliability coefficients” (Anastasi, 1988, p. 108).  The WISC-R manual, published in 1974 does not include any information on interscorer/interrater reliability or on how examiner error might affect test scores.  The WISC-lll manual, published in  1991 does offer some improvement over the WISC-R in the area of agreement between scorers.  In the new manual, results are reported  from a test-development standardization study on interscorer agreement for four of the thirteen subtests, with coefficients generally quite high, ranging from .90 for the Comprehension subtest, to .98 for the Similarities and Vocabulary subtests.  The data from the WISC-IlI standardization study on interscorer agreement were drawn from four examiners who each scored 4 subtests from  60 protocols.  The examiners, however, were  4  presumably chosen from among the specialists working on the standardization study, whose sole duty was accurate administration of the WISC-lll.  The WISC-llI manual states that in the  standardization study “all protocols were double-scored by trained scorers... The computer flagged any discrepancy between scorers questions were answered by in-house experis.  Scorers received  individual feedback on a daily basis and group feedback weekly” (1991, p. 28-29).  These highly-specialized standardization examiners can hardly be considered representative of practitioners working in the field. The practitioner uses a number of measurement instruments, focuses less on the test than on the subject or client, and has a great number of accompanying responsibilities.  Moreover, in most  cases, practitioners are not formally re-trained on revisions of tests.  The protocols used in the standardization study  checked and closely scrutinized  -  -  double-  also cannot be considered  representative of the protocols in the field where protocols are typically scored once only.  In the WISC-IIl test manual (1991) only a brief mention is made of research that suggests examiner error may contribute significantly to making test scores less reliable. made that “errors are common  ...  The admission is  and may lead to misleading scores”  (p. 57) but, surprisingly, apart from a few brief suggestions on how to avoid clerical errors (record scores legibly, and check the  5  addition for raw scores and scaled scores) more information on the nature or degree of these errors is not given.  Although the reported internal reliability and interscorer reliability coefficients in the WISC-lll manual are quite high, the sample of examiners used in the standardization study does not accurately reflect the population of practicing professionals.  Also,  the reported interscorer agreement coefficients do not account for errors directly attributable to the examiner giving the test, but only for scoring four of the subtests.  The end result of not attempting to  remedy the fact that examiner errors affect test scores is that “important decisions will continue to be made on the basis of inappropriately scored lQ measures” (Miller & Chansky, 1973, p. 344).  Previous Studies The question of examiner errors on standardized intelligence tests has been the subject of considerable research.  Many  researchers have circumvented the difficulty of gaining access to confidential data from practitioners (e.g., school and clinical psychologists, psychiatrists) by using data drawn from graduate students enrolled in training courses for administration of standardized tests (e.g., Slate & Chick, 1989; Stewart, 1987).  This  information is particularly useful in shedding light on training issues and on errors made by beginning testers.  Training programs  in graduate schools have been designed as a result of these studies (Conrad, 1991).  6  The generalizability of the data on errors made by graduate students to errors made by practicing professionals, however, is suspect  -  the nature of testing in the field is vastly different from  that of testing by students in graduate school.  Those practicing in  the field are quite likely to view their own administration of any particular test as incidental; that is, as a means of gaining more information about a student or a client, and not as an end in itself, which is quite frequently the case in graduate training courses.  At  the same time, the greater experience and familiarity a professional has with a test may result in increased skill in avoiding error and working with an instrument.  In any case, the testing practices of  graduate students are not likely representative of the testing practices of examiners working in the field.  Although studies  investigating student errors are potentially useful, “investigation of the examiner errors made by practicing professionals is sorely needed” (Slate, Jones, Coulter, & Covert, 1992, p. 78).  Some of the earlier studies did investigate the error rate of practicing professional psychologists by having the examiners score completed (i.e., filled-in) but unscored protocols (e.g., Miller & Chansky, 1972; Oakland, Lee, & Axelrad, 1975).  These studies  usually depended on a volunteer sample of psychologists who responded to a general mail-out to a particular professional body (e.g, National Association of School Psychologists).  A number of  sampling problems arise in studies of this nature, however, because the data are collected from volunteer subjects.  7  Several limitations exist when data are collected in this manner.  Firstly, the two studies cited above did not have a high rate  of return: 23.5 and 32%, respectively.  With these disappointingly  low return rates, the sample loses its randomness may be less representative of the population.  The results from studies of this  nature which place relatively heavy demands on the subjects inevitably suffer from the bias inherent in volunteer samples.  Secondly, little is known regarding the degree to which attention to scoring details changes for examiners anonymously scoring protocols on their own time rather than on the job in a clinical situation.  Relatedly, the scoring of an already completed  protocol is a contrived situation, with the subject quite likely aware of the nature of the study.  Because this approach may not  produce the same results as having an examiner filling out and scoring an actual protocol while administering the test, the generalizability of data obtained in this manner is limited.  Finally, because the researcher is practically limited to having only one or two protocols scored by the volunteer scorers, the generalizability of the results is questionable.  Test protocols vary  a great deal in their ease or difficulty of scoring, and one or two protocols may not well represent all protocols.  It seems then, that  in order to gain the most generalizable and accurate information, actual completed protocols from those practicing in the field should be analyzed.  And because the rate and nature of examiner error on  8  one test may not be generalizable to errors on another test, it is imperative that each widely-used test (and its revisions) should be investigated.  Examiner Differences The differences among examiners may result in differences in their error rate on standardized tests.  Studies have been completed  investigating the relationship between rate of examiner error and gender  (Hanley, 1977; Levenson, Golden-Scaduto, Aiosa-Karpas, &  Ward, 1988), education (Hanley, 1977; Levenson et al., 1988; Sherrets, Gard, & Langner, 1979), expectancy (Egeland, 1969) and experience (Conner & Woodall, 1983).  Of these three areas, one  might expect that an examiner’s experience and practice on a test would have the strongest impact on the rate of error.  As expected,  Conner and Woodall (1983) found a significant decrease in the number of errors on the WISC-R made by examiners as they gained experience.  However, the examiners in their study were student-  examiners enrolled in a graduate training program, and cannot be considered representative of practitioners in the field.  Although little research has been done in the area of examiner experience and error, and more is clearly needed, it is likely that the error rate of practitioners changes as they gain experience working with any particular test.  And though one would predict a lowering of  the rate of examiner error on a graduate student’s test protocols as the result of instruction and experience, this pattern may not be the same with practitioners, who may be less concerned with the actual  9  administration and scoring of a test, and more concerned with the results and their interpretation.  Test Revisions There is no question that test revisions bring about changes in subjects’ test results. For example, subjects usually score lower on newer standardized tests than on older ones, as was the case in the norm changes in the revision of the Wechsler Preschool and Primary Scale of Intelligence (WPPSI) to the Wechsler Preschool and Primary Scale of Intelligence  -  Revised (WPPSI-R) as has been detailed  (Kaufman, 1990; Sattler, 1991).  There is also little question that  the format of a test or questionnaire influences a subject’s or interviewee’s response pattern: “Good design is crucial for producing a reliable and valid questionnaire.  Respondents feel less  intimidated by a questionnaire which has a clear layout and is easy to understand, and take their task of completing the questionnaire more seriously” (Rust & Golambok, 1989, p. 155).  In the same way that testing subjects or interviewees are influenced by the format of a questionnaire, it is also likely that the format and layout of a test protocol influences the response pattern and scoring accuracy of an examiner.  A test revision, then, with its  corresponding revision of the test protocol, will see changes in the response pattern of examiners, which may include a change in the rate and pattern of examiners’ clerical errors.  10  Protocol Changes Changes to the WISC-R protocol had been strongly recommended in order to reduce the rate of examiner’s clerical error: The present investigators believe that a revision in the WISC-R Record Form must be considered by the test’s publisher, the Psychological Corporation.  Such a  revised record form might include directions to the psychologist to verify and recheck all response scoring,  mathematical,  administrative,  lQ,  and  birth  computations (Levenson et al., 1988, p. 663). Although an alternate and expanded protocol was subsequently published for the WISC-R, it did not follow any of the suggestions made by Levenson et al., but simply provided more space for recording clinical observations.  The publishers of a test undergoing revision, such as the WISC R to the WISC-lll, have an opportunity to improve the format of the test’s protocol, an opportunity to make it less prone to examiner errors.  Although the test’s publisher does not explicitly claim that  the Record Form or test protocol has been revised in order to lower the rate of examiner error, it does claim in the manual (1991) that “changes from the WISC-R  ...  have made it easier for the examiner to  use” (p.iv), and that “A final goal of WISC-lll development was improvement of  ...  administration  “  (p. 12).  Unfortunately, no  empirical evidence is cited to back these claims.  Although perhaps  true for administration of the test itself, the WISC-lll test protocol  11  does not appear to be easier to use and less prone to examiner error. In fact, on the surface, the protocol appears to be more complicated and less clear, especially in the section where raw scores are to be converted to scaled scores.  Questions and Hypotheses What then, are the changes brought about in examiner error by the revision of the WISC-R and its protocol? WISC-lll test and  To what extent is the  protocol less prone to examiner clerical error,  and is this shown by a lowering of the rate of examiner error?  How  does the examiner’s experience, or lack of experience, on a new test -  the WISC-lll  -  affect the rate of clerical errors made?  Most of the changes on the WISC-lll test protocol appear to be changes in layout only, and are not seen to have been made for the purpose of lowering examiners’ clerical errors.  Hypothesis #1: The rate of practitioners’ clerical error on completed and scored WISC-IlI protocols will be equal to their rate of error on  the WISC-R.  Little research has been completed in the area of the effects of a practitioner’s experience on error rate.  Past studies show that  graduate students’ error rates decrease with experience.  This may  not be true, however, for practitioners who approach testing in a different manner.  And although one would intuitively expect an  examiner’s errors to decrease with experience, it is possible that as  12  an examiner becomes more comfortable with using a test, the clerical error rate of that examiner increases, as carelessness sets in.  Psychologists with PhDs have been shown to have a higher rate  of examiner error than Masters-level psychologists (Levenson et al., 1988; Ryan, Prifitera & Powers, 1983).  Hypothesis #2: Practitioners’ clerical error rates will not change significantly with increasing experience on the WISC-III.  Scope of Study This study investigated the nature and rate of examiners’ clerical errors on the WISC-R and WISC-lll.  Comparisons of error  patterns between the two tests were made, and the relationship between examiner experience on the WISC-lll and rate of clerical error was studied.  Examiner error, for the purpose of this study, consisted of clerical errors as defined by Sherrets et al. (1979): addition of subtest raw and scaled scores, computation of chronological age, and transformation from tables.  The data for this study were drawn  from analysis of actual completed test protocols from school psychologists in a large suburban school district.  Because the data  were drawn from practitioners’ completed protocols, the results are likely to be representative of present practice in the field.  13  Information derived from this study will enable users of psychoeducational tests to have a better understanding of their own potential clerical errors on the WISC-lII, and may result in improved practice in the area of re-training of professionals for test revisions, which in turn would result in the increased reliability of the tests.  Improvements in testing practices in educational settings  should result in the improvement of the accuracy and quality of planning and placement decisions.  14  Chapter II Literature Review  Introduction Although a look through the manuals of standardized tests might suggest that examiner error is not a prevalent problem, or at least that it does not have any serious impact on a test’s reliability, many researchers have found substantial evidence that suggests that errors made by examiners are an important source of error, not covered by merely reporting the SEM of a test. These studies regarding the nature and rate of examiner error have utilized many different designs, used both graduate students and practitioners of various stripes, investigated errors made on actual completed protocols and on fabricated protocols, and have looked at a variety of error types on a collection of different tests.  Studies with Graduate Students The majority of the previous studies have investigated the errors made by graduate students enrolled in test-administration courses.  Obtaining raw data from graduate students is frequently  easier than from practitioners who may not volunteer as readily to take part in a study which reflects on their professional skills. Many of the earliest studies followed this route.  Miller, Chansky and  Gredler (1970) had 32 Clinical and School Psychology graduate students score a single completed WISC protocol and found the resulting Verbal and Performance scale scores to vary by as much as 21 points and the Full Scale lQs by as much as 17 points.  15  Warren and Brown (1972) examined the first three and last three completed protocols of 40 student examiners in a graduatelevel psychology program.  Of the 240 protocols, there were 89  cases in which Full Scale lQs were changed (according to the authors, the number would have been much higher had not some errors which added points not cancelled out others which deducted points).  On the WISC protocols, IQ changes ranged from 1 to 16  points, with only 2 of the 240 protocols error-free.  The mean IQ  change after correction was 3.32 points for the WISC, and 4.77 points for the Binet.  There was no significant change in error rate  from the beginning to the end of the training period.  From the above  studies, it is clear that scoring and administration errors are common among graduate students.  Without question, data are more easily obtained from graduate students than from practicing professionals.  However, the  generalizability of the results from studies involving graduate students to practitioners working in the field is questionable.  In  training programs, at least in the initial phases, the focus of students and instructors is, naturally, on the actual administration and scoring of the test.  In the field, however, the focus is more  likely to be on the results of the test, and competent administration and scoring of the test is assumed.  This is not to say that studies  which look at novice examiners are not valuable  -  several studies  using student examiners have had important implications for the training of test users in graduate psychology programs, training  16  which, currently, is widely discrepant from one institution to another.  Training Programs Conner and Woodall (1983) examined students’ errors in relation to a training program that provided the students with written and verbal feedback on types and frequencies of errors. They concluded that: Response Scoring, Mathematical, and IQ Errors remain a source  of variance,  validity of the WISC-R. this  area to  determine  affecting  the  reliability  and  Further research is needed in if there  are  any  effective  teaching methodologies that can significantly decrease the number of Response Scoring errors (p. 378).  Slate and Jones (1989) challenged the conventionally-held wisdom that testing courses should consist mostly of opportunities to practice test administrations, and that instructional time should be secondary: “In this study, students’ performance improved only as a result of detailed classroom instruction. during practice administrations” (p. 409).  ...  Neither group improved  In a subsequent study the  investigators stated that “Five practice administrations were not sufficient to improve student accuracy in scoring and 10 practice administrations improved student accuracy only slightly.  Even on  the 10th administration error rates were high” (Slate & Jones, 1990c, p. 139).  17  It seems, then, that traditional training programs often fail to prepare graduate students to administer and score standardized tests in a competent manner.  Russ (1978) outlined the causes for  the gap between academic training in psychological assessment and the comparatively higher standards established by supervisors in internship settings.  Other studies have offered more detailed and  structured training programs or methods for systematic observations of graduate student assessment behavior (Strein, 1984).  Blakey, Fantuzzo, Gorsuch and Moon (1987) found that a peer-  mediated, competency-based training package, which uses student peers to give detailed feedback to examiners, significantly lowered the error rate for student examiners.  Conrad (1991) constructed a  comprehensive performance checklist, the Criteria for Competent WISC-lll Administration  -  Revised (CCW3A-R) to be used to aid  students in locating and avoiding their most frequent errors.  It is clear that useful information result in improved training programs the errors of student-examiners.  -  -  information that may  may be derived from studying  However, the population which is  of most concern is not graduate students, but professionals who use standardized tests in order to gain greater understanding and to make decisions.  Unscored Protocols A number of studies have looked at the issue of examiner error by having psychologists and graduate students score test protocols which have been filled in but not scored.  The intent of this type of  18  study is that the examiner will display typical behavior in scoring the protocol.  In a study by Perot (1992) investigating the cognitive  processes behind scoring decisions, eight school psychologists scored the Verbal Scale subtests from a completed but unscored WISC-R protocol.  It was found that the psychologists produced  Verbal IQ scores which varied by as much as 11 points.  Ryan, Prifitera and Powers (1983) used practicing psychologists as well as graduate students in their study of examiner error on the Wechsler Adult Intelligence Scale (WAIS-R).  -  Revised  They had the 39 examiners score two protocols and found  that the standard deviations of the Full Scale IQ scores averaged about 1 .95.  In comparison, the WAIS-R manual reports a SEM of 2.53  for Full Scale IQ (On the WISC-R, the reported Full Scale SEM is 3.19; on the WISC-lll, 3.2). In an earlier study, Miller and Chansky (1972) sent a single completed WISC protocol to 200 doctoral-level school or clinical psychologists.  From the 64 returned protocols it  was found that Verbal and Performance IQ’s varied by as much as 23 points and Full Scale lQs by as much as 17 points.  It must be questioned, however, just how seriously the responding sample treated their task in the study.  In a follow-up  phase of the study, Miller and Chansky sent a letter in which they offered to discuss with the the responding psychologists the types of errors that they had made.  Out of the 64 volunteers, only 1  psychologist responded to the offer.  Miller and Chansky concluded  that “If the apathy discovered in this brief follow-up is  19  representative of the field at large, we can assume that important decisions will continue to be made on the basis of inappropriately scored IQ measures” (p. 345).  The difficulty in knowing the  representativeness of the sample to the population is raised by Miller and Chansky themselves: it is unknown if the results are representative of the field at large when a volunteer sample of 32% (64 out of 200) responded to the study, and 1.6% (1 out of 64) responded to the follow-up offer.  Another difficulty with this type  of study is the representativeness of the to-be-scored protocol. When only one protocol is used, it is not necessarily a typical protocol, but may be much easier or much more difficult to score than most.  In order to overcome the problem of non-representative single protocols, some researchers have several scorers score two or more protocols either varying in examinee ability level or in the expected ease and difficulty of scoring the protocol.  Whereas one protocol  might not well-represent the population of protocols, two or three can better cover the range of ease or difficulty of scoring, or examinee ability.  Behind this type of study is the idea that  examinee differences reflected in the protocol may result in differences in error rate for examiners.  Oakland, Lee, and Axelrad (1975) sent three completed but unscored actual (as opposed to fabricated) WISC protocols reflecting three different levels of intellectual abilities (below average, average and above average) to 400 APA-registered school  20  psychologists.  Of the 400 randomly selected psychologists, 94  (23.5%) volunteered to score the tests and send them back to the investigators.  From the 94 responding psychologists, data were  collected in the form of means and standard deviations for subtest scale scores and Verbal, Performance, and Full Scale lQs for the three protocols.  Oakland et al. found no significant correlation  between examiner error rate and examinee’s level of intellectual ability.  As in the Oakland et al. (1975) study, the standard deviation of scores is commonly used as an index of interrater agreement in studies comparing examiners scoring on one or two protocols: the higher the standard deviation, the lower the rate of interrater reliability.  However, it must be remembered that, when comparing  the SD to the SEM, two different sources of error are being compared.  The SEM  when measured through split-half reliability  coefficients reflects error attributable only to item or content sampling, whereas the SD reflects error variance attributable to examiner differences.  In other words, the SD of the differences  between examiners should be viewed as an additional source of error.  In another study involving multiple protocols sent to a random sample of psychologists, Bradley, Hanna, and Lucas (1980) attempted to investigate the differences in rate of examiner error between “hard” (which responses were deliberately ambiguous) and “easy” (which responses were deliberately clear) WISC-R protocols.  The  21  widely differing protocols were used to ensure that the results of the study would cover the range of the population of protocols.  Of  the 280 randomly-sampled psychologists, 63 volunteered to participate in the study, for a 27% return rate. interrater agreement used  -  The measure of  the standard deviation  -  varied from 2.9  for the easier protocol and 4.3 for the more difficult one (compared to the published SEM of 3.19 in the test manual).  They concluded  that the lack of scoring accuracy was due to a shortage of scoring criteria for the more objective subtests and was cause for increased attention and further study.  This genre of study does provide some beneficial information for users of standardized tests, but weaknesses inherent in the research design limit the generalizability of the results. task of scoring a completed protocol is contrived.  First, the  In actual practice  the administration and scoring are inextricably linked.  Second, it is  difficult to determine how much attention the practitioner pays to the task of scoring (usually anonymously) a protocol for a research study.  Third, the representativeness of the sample to the population  is impossible to establish when using a volunteer sample.  None of  the studies using volunteer samples had a very high return rate of protocols.  Finally, because this type of study is limited in the  number of protocols used, the generalizability of one or two (usually fabricated) protocols to the population of actual protocols is again impossible to establish.  22  Completed Protocols One way to avoid the problems inherent in having volunteers score already filled-in protocols is to investigate the frequency and nature of examiner error from a sample of completed (and scored) protocols.  In a review of the literature on examiner error, Slate and  Hunnicutt (1988) state: “Future research needs to address the frequency, type, and specificity of errors on actual protocols.  Are  practicing professionals making scoring errors on test protocols in the field?” (p. 286). answer this question.  Few studies to date, however, have been able to The problem of gaining access to completed  and scored protocols has prevented most researchers from investigating the error rate of practicing professionals.  Only a  small number of investigations of this type have been conducted.  In one study, impressive because of the size of the sample, Sherrets et al. (1979), obtained two random samples of 100 WISC and WISC-R protocols drawn from the files of a psychiatric facility and from the files of the student services department of a large school system.  They found that nearly 89% of the 39 examiners had  made at least one error and that 46.5% of the protocols contained at least one error.  Although most of the errors found did not change  the diagnostic classification of the child, the largest errors found would have raised the Full Scale IQ by 9 points or lowered it by 7 points.  No significant difference between the error rates on  protocols from the two facilities or difference in levels of training was found.  23  Although Sherrets et al. (1979) used a sample considerably larger than in most other studies of its kind, they neglected to investigate two key areas.  No comparisons were made of error rates  between the two tests (WISC and WISC-R), something which might have shed some light on the contribution of the protocol layout to examiner error.  The researchers also failed to investigate the  relationship between examiner error rate and experience.  Slate et al. (1992) used data from nine school psychologists and found that these practitioners made errors on all 56 of the protocols drawn from a metropolitan school district.  In this  particular study, though, the term “error” was used rather liberally, and included failure to record response, failure to circle scores, and failure to record times.  These may be technically considered errors  because they depart from the prescribed manner of test administration, but they cannot change the score of an examinee.  Peterson, Steger, Slate, Jones, and Coulter (1991) investigated examiner error on the Wide Range Achievement Test (WRAT-R).  -  Revised  They found that 95% of their sample of 55 protocols  contained errors, with an average of three errors per protocol.  They  concluded that “Practitioners make a significant number of errors that adversely affect standard scores” (p. 208). examined a similar number of protocols examiner error on the WISC-R.  -  50  -  Wagoner (1988) in her study of  She found a total of 8.4 total errors  per protocol, with 34% of the Full Scale lQs altered as a result of these errors.  24  Definitions of Error Most investigators use their own definition of error depending on the purpose of their study.  On a standardized test, three broad  areas of examiner error can be categorized.  First, administration  error, that is, error that occurs when the examiner strays from the prescribed standardized procedure in giving the test.  The  importance of adhering to standardized procedure is well-stated by Buckhalt (1990): of  One  our  explicit  objectives  in  testing  is  to  standardize the test situation so that variations in test performance can be assumed to be a function of true differences in the examinees’ abilities rather than of any extraneous contextual variables (p. 478).  Studies which investigate administration errors are generally restricted to using small samples because of the difficulty involved in actually observing test administrations.  Buckhalt’s study (1990)  used the results derived from just 14 test administrations from practitioners, and 12 from graduate students.  Although the  information drawn from this type of study may help direct future research, its generalizability is questionable due to the small sample size.  The second category of examiner error is scoring error: error that occurs when the examiner scores the responses of the examinee.  Three WISC subtests (i.e., Vocabulary, Comprehension,  25  Similarities) frequently result in ambiguous responses and have been the focus of some study.  Brannigan, (1975) suggested revisions  for the WISC and WAIS: “more thorough scoring standards are needed for the Wechsler Intelligence Scales” and, “It would also be desirable to revise those test items which lend themselves to ambiguous replies” (p. 314).  Many other studies have also  investigated the problem of scoring errors and found them to be one factor contributing to examiner error on standardized tests (e.g., Ryan et al., 1983; Slate & Jones, 1990a; Warren & Brown, 1972).  A major difficulty, though, with studies investigating response-scoring errors is that psychologists and other examiners frequently do not record verbatim the responses of the examinee. Slate et at. (1992) found that school psychologists frequently departed from the prescribed standardized administration procedure. In their study they found that the school psychologists failed to record responses verbatim 30 times per protocol!  Without the  benefit of direct observation of the test administration, it is clearly difficult to determine if the examiner is scoring the response recorded on the test protocol or the actual (but not recorded in its entirety) response of the examinee.  Studies which purport to  examine the rate of scoring errors by looking at completed protocols may end up with misleading conclusions.  A final category of examiner error on standardized tests is clerical error; that is, errors caused by examiner carelessness, and perhaps exacerbated by complicated or poorly laid-out test record  26  forms.  Sherrets et al. (1979) found that nearly 89% of the  examiners in their study made clerical errors, and that 46.5 % of the protocols had errors.  They recommended “Greater caution is called  for in the clerical computations, but especially in the simple addition of the scaled scores, where the majority of errors appeared in the present samples” (p. 496).  Similarly, Levenson et al. (1988)  found in their study that 57% of school psychologists and interns made clerical errors, leading the authors to call for improved training in graduate programs and a revision of the WISC-R protocol in order to reduce this type of error.  Slate and Hunnicutt (1988) claimed that “Mechanical and clerical errors seem to occur frequently and detract from the accuracy of Wechsler scores” (p. 284).  The problem is of great  enough magnitude for them to recommend that clerks could be employed to check the areas on the protocol where clerical errors are most prevalent.  Clearly, clerical errors are a major problem on  standardized tests, and more information about them is called for in order to minimize their impact in weakening test reliability.  Test Revisions The revision of a psychological test may bring about changes in the examiner’s response pattern and also changes in the rate of examiner’s clerical errors. Children  -  The Wechsler Intelligence Scale for  Third Edition (WISC-lll) is a revision of the Wechsler  Intelligence Scale for Children  -  Revised (WISC-R).  Because it is  relatively new (1991), few studies have been published which  27  examine differences between the two tests, especially in the area of susceptibility to examiner errors.  Little (1992) compared the two  tests and found few major changes apart from the appearance of actual test materials.  The record form or protocol has been re  designed with some changes, for example, there is now more space for writing behavioral observations.  Of greatest concern for Little (1992) is the changed process for converting raw scores to standard scores.  The subtests on the  record form are listed in their order of administration, while on the conversion table, they are split with the Verbal subtests in one location, and the Performance subtests in another.  Little predicts  that “This different format of presentation is sure to be confusing and cause recording errors” (p. 151).  Examiner Characteristics The interaction between examiner characteristics and examiner error has been the subject of several studies.  In one of the  earlier studies, Cieutat (1965) evaluated the effect of examiner differences on Stanford-Binet scores.  Although he found no  significant effect for sex of examinee, sex of examiner was significant, with the female examiners eliciting significantly higher IQ scores than the male examiners.  Interaction between sex of  examiner and examinee was “of marginal significance” (p. 318).  The  results of this study can be called into question, though, because of difficulties with the interpretation of the data analysis.  Because of  28  unequal ns in the analysis of variance cells, Cieutat issued the caveat that “tests of significance are only approximate” (p. 318).  Levenson et al. (1988) also looked at the effects of sex of examiner.  In their study, sex of examiner was not a significant  factor, but education of examiner did account for a significant portion of the differences in error rates.  They found that doctoral  school psychologists made more errors than both school psychology interns and Master’s-level school psychologists.  In a study comparing PhD psychologists and graduate students, Ryan et al. (1983) found that the PhD psychologists committed a significantly higher rate of scoring errors than the graduate students on the Performance scale of the WAIS although there were no significant differences between the two groups on any of the subtest or lQ score means.  Unfortunately, the results from this  study are weakened from the earlier-mentioned basic design flaw in which examiners find themselves in the contrived situation of scoring completed but unscored protocols.  Bird (1992) attempted to find which examiner characteristics were most closely linked to rate of clerical error on the WISC-R.  In  her study, the percentage of protocols with Full Scale lQ changes ranged from 4% to 24% in the three samples of protocols (50 in each sample) she selected.  She concluded that the issue of time pressure  appeared to be more related to the error rate than did any variability in training.  29  The effect of examiner experience has been found to be worthy of some study, but most of the research has dealt exclusively with graduate students and not with practitioners.  Conner and Woodall  (1983) found that student examiners showed a significant decrease in administrative and scoring errors with experience in administering the WISC-R.  Clerical errors, however, did not  decrease, leading the authors to call for further research in this area.  Slate and Jones (1990b) found that error rates decreased after  8 administrations of the WAIS-R.  Gilbert (1992) also found student  error rates on the WAIS-R decreasing from the beginning to the end of a course on test administration.  Studies involving graduate students are relevant for those organizing testing courses for graduate programs.  It does not  necessarily follow, however, that the same trend holds true for practitioners who are not typically in the process of learning how to use a test.  One of the omissions in studies involving practitioners  is the inclusion of examiner experience as an independent variable in the analysis of the data.  Whereas sex and education of examiner  have been studied, experience has rarely been investigated.  The  result of this gap in the research is that little is known regarding the effect that experience plays in practitioners’ error rates. Research into the impact that experience has on the error rates of practicing professionals is in short supply but plainly needed.  30  Assessment Practices A great deal of a clinician’s time is spent in assessment.  A  study by Anderson, Cancelli, and Kratochwill (1984) revealed that 44% of the school psychologist respondents to their questionnaire spent between 41% and 80% in testing.  During this time spent  testing, a plethora of different tests are used.  Studies investigating  examiner error have been conducted on many of these tests, including the Stanford-Binet (e.g., Cieutat, 1965; Warren & Brown, 1972), the Kaufman Assessment Battery for Children (Hunnicutt, Slate, Gamble, & Wheeler, 1990) and, the Wide Range Achievement Test  -  Revised (Petersen et al., 1991) with all finding significant  error attributable to examiner error.  But the most widely-used tests among practitioners (Lubin, Larsen, & Matarazzo, 1984), and the tests on which most of the work investigating examiner error has been done, are the Wechsler scales. On these scales, studies of examiner error have largely focussed on the WISC-R and the WAIS and WAIS-R, but few studies have yet to be published on the WISC-R’s revision, the WISC-Ill.  Because the WISC  Ill is a frequently-used instrument, and because little is available at this time regarding rate of examiner error, any study which attempts to investigate this instrument  would be valuable.  Summary Much research has been completed in the area of examiner error and standardized tests.  Many of these studies have analyzed  the testing practices of graduate students, and have found that their  31  examiner errors are a significant source of variation in test scores. Although data from graduate students are more readily available and accessible than data from practicing professionals, and although the results may be useful to improve graduate training programs, the data produce results that are not easily generalizable to professionals working in the field.  Another way of avoiding the difficult task of obtaining data from actual, completed test records is to have a number of subjects score a completed, but unscored test protocol.  Researchers that  have utilized this design typically send out one or two fabricated protocols to a random sample of professionals.  The returned and  scored protocols form the data-base for the subsequent analysis. Unfortunately, the sample loses its randomness, and hence its representativeness, when a typically low proportion of professionals choose to participate in the study.  The task of  scoring, but not administering a test is an exercise that is contrived, and not representative of practice in the field.  Investigators have looked at a number of different error-types, including administrative error, scoring error, and clerical error. Administrative error is studied infrequently because of the difficulty and time involved in observing test administrations. Scoring error is difficult to substantiate, because examiners very frequently do not record responses verbatim.  A look at completed  protocols, therefore, does not always give a clear indication of the actual responses of examinees.  The third category of error, which  32  can be accurately measured from examinations of test protocols, is clerical error.  Clerical error has been found to be a factor in the  protocols of most practitioners (e.g., Levenson et al., 1988; Sherrets et a!., 1979).  The magnitude of examiner error varies from study to study, and is measured in different ways depending on the design of the study and on the definition of examiner error.  Investigations using  multiple scorers of single protocols generally report the standard deviation of scores.  For example, the study by Bradley et al. (1980)  on the WISC-R, reports the standard deviation ranging from 2.9 for the easier-to-score protocol, to 4.3 for the harder protocol.  In studies investigating errors in completed and scored protocols, error rates have been reported as change in Full Scale IQ score (as high as 17 points for Full Scale IQ in Miller et al.’s study [1970]).  Proportion of errors among examiners and protocols has  also been used, as in the study by Sherrets et al. on the rate of clerical errors on the WISC and WISC-R.  The investigators found  that 89% of examiners made clerical errors, and that 46.5% of the protocols contained this type of error.  If the definition of error is  broadened to include scoring and administrative errors, almost all protocols contain errors (e.g., Franklin, Stillman, & Burpeau, 1982; Slate & Jones, 1990a; Wagoner, 1988).  Various examiner characteristics have been investigated in relation to rate of examiner error.  Sex of examiner appears to have  33  little impact on error rate (Levenson et al., 1988).  Education of  examiner (Ph.D vs. Master’s degree) may make a difference, with studies showing doctoral-level psychologists making more errors than master’s-level psychologists (Levenson et al., 1988; Ryan et al., 1983).  Although examiner experience appears to be related to the errors graduate students make on standardized tests, little research has been completed using practitioners such as clinical or school psychologists for whom assessment is a major part of their day-to day practices.  The value of studies involving practitioners’ errors  would be enhanced if the experience of examiner was included as an independent variable in the analysis of the data.  More research is  clearly called for to investigate the relation between practitioners’ experience and clerical errors.  Test revisions, with accompanying  changes in test protocols, may also result in a change in examiners’ clerical errors.  Research comparing practitioners’ error rates  between test versions would be widely welcomed.  34  Chapter III Methodology  The purpose of this ex post facto study was to investigate how the nature and rate of examiner error changes from the Wechsler Intelligence Scale for Children Intelligence Scale for Children  -  -  Revised, to the Wechsler Third Edition.  The two central  questions addressed in this study were how the examiner clerical error rate changed as a result of the test’s revision, which included major alterations to the format of the test record or protocol; and how the pattern or trend of error changed as a result of the examiner’s increasing experience on the new test, the WISC-lll.  This chapter will first define the sample and population of test protocols as used in this study.  Next, the characteristics of the  examiners will be examined, followed by the assumptions that need to be made it the results of this study are to be seen as generalizable to a greater population.  Finally, the actual procedures  and data analysis methods used will be explained.  Sample The population of WISC-lll protocols was made up of the total protocols (approximately 1200) completed by all of the school psychologists in a suburban school district since the beginning of the test’s use, in 1991. (approximately 1800)  The population of WISC-R protocols consisted of the total number of protocols  completed by the school psychologists in the district in the last  35  three years of its use.  Protocols from before this time period were  not available.  The sample consisted of a total of 252 test protocols: 18 WISC-R and 18 WISC-lll protocols randomly drawn from the files of each of seven school psychologists practicing in a large suburban community near a large Western Canadian city, chosen because of the convenience of access to files for the researcher.  The  community showed demographic similarity with the province in several key areas.  The community’s average annual income,  educational level of population, and ethnic composition were very similar with the same provincial indicators (Ministry of Education, 1993).  Also similar were the student/teacher ratio and teacher  experience and certification.  Of the total number of school psychologists in the school district, seven fit the criteria for this study: a) had administered both the WISC-R and WISC-lll, and b) still had access to older (i.e., WISC-R) protocols.  In some cases, older WISC-R protocols had been  destroyed because of lack of storage space.  The sample size examiner  -  -  that is, number of protocols from each  was determined through comparison to the number of  protocols used in previous studies of a similar nature.  For example,  Slate et al. (1992) collected 56 WISC-R protocols from nine examiners in a large metropolitan school district.  Wagoner (1988)  collected 50 protocols from 6 examiners, for her study on the WISC  36  R.  In a study of examiner errors on the WRAT-R, Peterson et al.  (1991) looked at 55 protocols from 9 examiners.  The study found to  have the largest sample was conducted by Sherrets et al. (1979), in which 100 protocols from each of two institutions were examined. The present study’s total of 126 protocols for each test was deemed sufficiently large in comparison with the sample size used in previous studies of a similar nature.  Characteristics of Examiners All of the examiners had Master’s degrees in either school or clinical psychology from four different graduate training programs in Canada and the United States.  Their years of experience (see  Table 1) ranged from 3 to 27 years (Mean  =  9.6, S.D.  =  8.6). All of the  participants were trained to administer the WISC-R, but received no formal training on its revision, the WISC-lll.  The number of  administrations per year varied with each school psychologist and with the particular year, with the mean number of WISC-R or WISC Ill administrations ranging from 50 to over 70 per examiner per school year. The protocols from each school psychologist were drawn in the following manner: the WISC-R protocols were randomly drawn from the last three years of its administration, and the WISC-lll protocols from the first year-and-a-half of its administration.  The  sample, therefore, consisted of 18 protocols for the WISC-R, and 18 protocols for the WISC-lll from each examiner, drawn from a pool  37  ranging from 150 to 210 total protocols for the WISC-R, and 75 to 105 total protocols for the WISC-lll for each examiner.  In order to  Table 1 Examiners’ Years of Experience  Examiners  1  2  3  4  5  6  7  Years of Experience  15  27  4  6  5  3  7  Note.  Years of experience consists of number of years employed in a  position which requires the administration of standardized tests.  examine the trend or pattern of error over time on the WISC-lll, the 18 months of administration were further divided into half-year segments, and six protocols were drawn from each of the three halfyear periods.  Characteristics of Examinees Table 2 displays the characteristics of the examinees from whom the protocols were drawn in this study.  Note that the mean  and median IQs reflect a special education population, i.e. a population of children who have been referred for assessment or service because of questions regarding their performance in an academic setting.  38  Table 2 Characteristics of Examinees  WISC-R  WISC-lll  1. Age Range  6-2 to 16-8  6-3 to 16-7  2. Median Age  9-10  10-6  3. lQ range  46 to 126  50 to 115  4.MeanlQ  89  85  5. Median lQ  89  86  Note. Total of 126 protocols per test. years and 2 months.  Also, “6-2” is read as 6  Assumptions Several assumptions need to be made if the results of this study are to be generalizable to the population of protocols in the school district and in the province.  It is assumed that the sample  protocols drawn from the school district files are representative of the total number of protocols in the files, and of the population of protocols in the province.  Because the protocols were drawn  randomly (on the WISC-lll, the total number of a particular examiners protocols for each half-year period divided by the number necessary to draw 6 files, and then every 4th or 5th protocol drawn) this assumption can be supported.  The characteristics of the  examinees and the demographics of the community of the examinees,  39  as outlined previously, are not atypical for the province, suggesting that the examinees, and therefore, the protocols did not vary demographically from other protocols in the province. It is assumed that the testing practices of the seven school psychologists who completed the test protocols are similar to the practices of other school psychologists using the tests.  The  graduate training, and years of experience of the seven are varied enough to support this assumption.  Previous studies have shown  that Master’s-level psychologists have a lower examiner error rate than Doctoral-level psychologists (e.g., Levenson et al., 1988; Ryan et al., 1983).  Thus, because only Master’s-level psychologists were  included in this study, the results may be an under-estimate of the actual rate of error among all practitioners.  It is not assumed that the examiners’ training is the same for each of the two tests.  It is acknowledged that the examiners  received formal graduate-level training on the WISC-R, but no direct formal training on the WISC-lII.  In a study by Perot (1992) no  significant difference in number of scoring errors was found between those with formal training on the WISC-R and those without.  Most practitioners in the field are not formally re-trained  on revisions of tests, and the conditions in this study are likely representative of the conditions in the population of examiners in the province.  In any case, the results of this study, and any  conclusions drawn regarding the ease-of-use of the WISC-lIl protocol, must be seen in the light of the fact that any training on  40  the second test (the WISC-lll) is not as comprehensive as the training on the first test (the WISC-R).  Procedure In order to protect the identity of the examiners, an associate photocopied the 252 completed protocols with the names of tester and test subject deleted in order to ensure anonymity of examiner and confidentiality of the examinee and actual test results.  The  examiner was identified by a code number only; the examinee was not identifiable in any way.  The protocols were then analyzed, item by item, for clerical errors using an error checklist and data compilation sheet.  Error  types tabulated were clerical in nature (Levenson et al., 1988; Sherrets et al., 1979) and consisted of: addition of subtest raw and scaled scores, computation of chronological age, transformation from tables of raw scores to scaled scores, and transformation of scaled scores to IQ scores.  Because the WISC-lll is relatively new  (published 1991) and has been used for a relatively brief time period, no comparison between error rates over an extended time period was possible.  Analysis The analysis of the data consisted of a tally of overall clerical errors on each test, as well as a tally of “serious” clerical errors, that is, errors causing Full Scale IQ changes (as has previously been done (e.g., Bird, 1992).  On each test, this number of errors was  41  shown as a proportion of total protocols, along with mean, range and standard deviation of errors.  The number of errors made was also  correlated, on both tests, with the examiners’ years of experience in administering standardized tests.  The errors made were divided into their respective categories: addition of raw scores, transformation of raw to scaled scores, addition of scaled scores, transformation of scaled to IQ scores, and chronological-age computation.  Comparisons between the two tests  were made using t-tests for paired or dependent observations for error type, or category, as well as for overall and “serious”, or Full Scale IQ-changing errors.  Location of error (type #1 only, the other  error categories cannot logically be given a subtest location) on each test was examined.  On the WISC-lll, the rate of clerical errors was examined over the three six-month time periods.  Both overall and serious, i.e. IQ  changing errors were examined.  A repeated-measures 2-factor  ANOVA was used to compare the number of errors over the three time periods.  When significant differences were found, Tukey’s  multiple comparison technique was used to locate these significant differences.  42 Chapter IV Results  This chapter will describe the results of the study outlined in Chapter III.  The nature and range of clerical errors on the WISC-R  and WISC-lll will be described, with frequency of error per examiner given, as well as error location and category.  Error frequencies  between the two tests will be compared, and the correlation between examiner experience and number of errors will be examined. On the WISC-lll only, the number of overall and FSIQ-changing errors will be plotted over three six-month time periods, with comparisons made between the number of errors in the three time periods.  Errors and IQ Changes Table 3 (see also Figure 1) displays the proportion of clerical errors on both tests and the resulting IQ changes.  The WISC-lll had a  higher number of total errors, Verbal lQ (VIQ), Performance lQ (PIQ), and Full Scale lQ (FSIQ) changes than did the WISC-R. Although 38% and 42% of the WISC-R and WISC-lll protocols, respectively, contained clerical errors, not all of these clerical errors resulted in a VIQ, PIQ, or FSIQ change.  Nevertheless, almost one-quarter (23%)  of the WISC-IlI protocols completed reported incorrect Full Scale IQ scores.  On the WISC-R protocols with FSIQ changes, the mean FSIQ change was 2.3 points, with a range of 1-8 points. The mean FSIQ change on the WISC-lll protocols with errors was 1.8 points, with a  43  Table 3  Number and (Proportion) of Protocols with Errors by Test  WISC-R  WISC-llI  Protocols  126  126  Overall Errors  48 (.38)  53 (.42)  5 (.04)  VIQ changes  9  (.07)  12 (.09)  3 (.02)  PIQ changes  15 (.12)  23 (.18)  8 (.06)  FSIQ changes  21 (.17)  29 (.23)  8 (.06)  DIFFERENCE  Note. Proportions of error-type for each test are presented within parentheses.  Error proportions do not sum to 1.0: errors may cause a  VIQ or a PlO change, and a FSIQ change, or only a VIQ or PIQ change, or no change in scoring at all.  44  60  E r r 0  r S  1  Total Errors  VIQ Changes  PIQ Changes  FSIQ Changes  Error Effect ØWISC-R  WISC-lll  Figure 1. Number of Protocols with Errors by Test (per 126 protocols)  45  range of 1-7 points.  In comparison, the SEM0n the WISC-IIl is 3.20,  but it can be argued that this does not account for clerical errors. Of the errors which led to a change in Full Scale lQs, one error on the WISC-lll led to a descriptive category change, from Mildly Mentally Handicapped to Borderline.  Both tests showed numerous other  changes (e.g. 70-73, on the WISC-R; 71-75, on the WISC-llI) which were close enough to category changes to possibly influence placement decisions.  Examiners Six of the seven examiners made clerical errors (see Table 4; also shown graphically in Figure 2), with one examiner (#5) completing error-free protocols on both tests.  On the WISC-R,  examiners committed a mean of 6.86 clerical errors per 18 protocols (38%, Range 0-11, S.D. 4.41), while on the WISC-lll, a mean of 7.57 errors were found per 18 protocols (42%, Range 0-12, S.D. 3.78).  The relationship between the examiners’ years of experience as school psychologists and the number of clerical errors made on each of the tests was examined using a Pearson product-moment correlation coefficient.  On the WISC-R, no strong relationship was  found between years of experience and errors. however, a positive relationship was found and experience (r= .54).  On the WISC-lIl, between overall errors  When years of experience was correlated  with Full Scale la-changing errors, the correlation coefficient was even higher (r  =  .60).  These correlation  46  Table 4 Number of Total Errors by Test per Examiner  WISC-R  WISC-III  DIFFERENCE  Examiner 1  11  9  -2  2  8  12  4  3  6  7  1  4  10  8  -2  5  0  0  0  6  11  10  7  2  7  6.9  7.6  Mean Errors per test  -1 5  .70  Note. There are 18 protocols per test for each examiner.  47  12.0 10.8  96 84  //  Examiner ØWISC-R  Figure 2.  WISC-III  Errors by Examiner (per 18 protocoIs  48  coefficients were not statistically significant, but, based on this sample alone, the pattern suggests that the correlation likely would have been significant if the n had been larger.  Location of Errors The location of errors on each test (see Table 5) showed a definite pattern, with more errors made on the Performance subtests than on the Verbal subtests.  The one test most prone to  error on both tests was Coding (errors on this test could be defined as either scoring or clerical errors) with 54% and 38% of total errors.  The errors on this subtest, though numerous, frequently did  not have a great impact on the subtest scaled score, and the resulting Performance IQ and Full Scale IQ.  On the WISC-R, three  subtests were without any clerical errors: Information, Similarities and Object Assembly.  On the WISC-lll, only Information and  Arithmetic were error-free.  Far more category-#1 (addition of raw score) errors were found on the Performance subtests than on the Verbal subtests for both tests.  On the WISC-R, 67% of category-#1 errors were found on  the Performance subtests; on the WISC-lll, 82% of these errors were found on the Performance subtests.  Error Cateçory Table 6 (shown graphically in Figure 3) displays the breakdown of clerical error by category for each test.  The most common error,  49 Table 5 Location of Clerical Errors by Test WISC-R  WISC-IIl  Protocols  126  126  Verbal Subtests  6  14  8  Information  0  0  0  Similarities  0  2  2  Arithmetic  1  0  Vocabulary  4  5  1  Comprehension  0  3  3  Digit Span  1  4  3  27  28  1  Picture Comp.  1  2  1  Coding  18  16  -2  Picture Arr.  1  1  0  Block Des.  6  1  -  Object Ass.  0  6  6  Mazes  1  1  0  Symbol Search  n/a  1  n/a  Pertormance Subtests  DIFFERENCE  -  1  5  50  Table 6 Error Categories Across Test  WISC-R  WISC-lll  Protocols  126  126  1. Addition of Raw Scores  33  (.69)  42 (.79)  9  2. Transformation of Raw to Scaled Scores  6  (.12)  5  (.09)  -1 (.03)  3. Addition of Scaled Scores  2  (.04)  5  (.09)  3 (.05)  4. Transformation of Scaled to IQ Scores  7  (.15)  1  (.02)  Total Errors  48  53  DIFFERENCE  (.10)  -6 (.13)  5  Note. The parentheses show proporLion of error per 126 protocols.  51  o  /0  100 -  -  90 0  -  20 r  10  ; Error Category DWISC.-R  WISC-III  Category 1: Addition of Raw Scores Category 2: Transformation of Raw to Scaled Scores Category 3: Addition of Scaled Scores Category 4: Transformation of Scaled to IQ Scores  Figure 3. Clerical Errors by Category  52  by far, for both tests was the addition of raw scores.  On the WISC  R, 69% of clerical errors came as a result of incorrect addition of raw scores (error-category #1), while on the WISC-lll, the proportion was slightly higher: 79% of the clerical errors were as a result of incorrect addition of raw scores.  Although addition of raw score errors were most common, other error types were more likely to have a greater impact on lQ scores.  Whereas incorrect addition of raw scores only affects one  subtest, and then often only minimally, the other error types are errors involving scaled scores and standard scores and are more certain to change the IQ scores.  The single error category showing the most change was error category #4, transformation of scaled to IQ scores, which accounted for 15 % of errors on the WISC-R, but only 2% of errors on the WISC Ill.  No chronological-age computation errors (error-category #5)  were found on any protocols for either test.  Comparison between WISC-R and WISC-lII A series of t-tests for paired or dependent observations were run to analyze the difference between means on the two tests for total errors, Full Scale lQ-changing errors, and for each error category for the two tests (see Table 7).  No significant difference  was found between the means on the two tests for any of these categories.  53  Table 7  Comparison of Error-types for WISC-R and WISC-lll  WISC-R  WISC-lll  Error Type  Mean (na=7)  Mean S.D. (n=7)  t- t e S t  1. Overall Errors  6.86 4.41  7.57 3.77  0.67  2. FSIQ-changing errors  3.00 2.08  4.14 2.67  1.43  3. Category#1b  4.71 2.81  6.00 2.89  2.12  4. Category-#2  0.86 1.21  0.71 1.25  1.0  5. Category-#3  0.29 0.76  1.14 0.90  1.87  6. Category-#4  1.00 1.41  0.14 0.38  1.87  .,  Note. The [-tests used were for paired or correlated observations. aMean number of errors per 18 protocols. bType#1 errors: Addition of Raw Scores Type-#2 errors: Transformation of Raw to Scaled Scores Type-#3 errors: Addition of Scaled Scores Type-#4 errors: Transformation of Scaled to IQ scores  54  Examiner Experience and the WISC-lll One of the questions asked in this study pertains to the effects that examiner experience on a test has on the rate of clerical error. Table 8 (and Figure 4) displays each examiners overall errors over three consecutive half-year time periods from the beginning of WISC-lll administration.  This table shows the mean error rate  decreasing over each of the three time periods from 3.0 errors per six protocols at the beginning of administration, to 2.0 errors per six protocols in the third six-month period.  A repeated-measures 2-factor ANOVA was run in order to compare the rate of overall examiner clerical error on the WISC-lll in each of the three time periods. ] 1 , 2 (F[ 2  =  1.10, MSe  =  No significant effect was found  1.59) for the time/experience factor (the  variance for the person factor cannot be calculated because there is only one observation per cell).  Thus, there was no significant  difference between overall error rates between the three six-month time periods.  Table 9 (and Figure 4) displays examiners’ Full Scale IQ changing errors over the three time periods from the beginning of WISC-lll administration.  A decrease of lQ changing errors was  noted: from 2.0 per six protocols at the beginning of WISC-lll administration, to .57 in the third six-month period of administration.  55  Table 8  WISC-lII Overall Errors Over Time  Examiner  Time 1  Time 2  Time 3  1  4  2  3  2  5  4  3  3  4  1  2  4  3  2  3  5  0  0  0  6  4  4  2  7  1  5  1  M  3.0  2.6  2.0  Note. Errors listed consist of clerical errors per six protocols for each of the three six-month time periods for each examiner.  56  3.0•  M e a n  2.7 2.4 2.1 1.8  E r r  1.5 1.2  0  0.9•  r  0.6  S  0.3 0.0  Time 1  Time 2  Six-Month Time Periods <.Total Errors  D  FSIQ-Changing Errors  Figure 4. WISC-lIl Mean Errors (per 6 protocols)  Time 3  57  Table 9  WISC-lll FSIQ-Chanpinp Errors Over Time  Examiner  Time 1  Time 2  Time 3  1  3  2  2  2  3  3  1  3  3  1  0  4  1  1  1  5  0  0  0  6  4  2  0  7  0  2  0  M  2.0  1.6  .6  Note.  Errors listed consist of Full-Scale IQ-changing clerical errors  per six protocols for each of the three six-month time periods.  58  Another repeated-measures ANOVA was run to investigate the differences between means of serious errors cause changes in Full Scale lQ  -  -  that is, errors which  for the three time periods.  found that there was a significant difference ] 12 , 2 (F[ MSe  =  =  It was  4.06, p  <  .05,  0.93) between the three means.  Through further investigation using the Tukey honest significant difference multiple comparisons technique (HSDo 5  =  1.37), it was found that the mean for time period 1 was significantly higher than the mean for time period 3.  In other words,  significantly more Full-Scale IQ-changing clerical errors were made at the very beginning of WISC-lll administration than after a year to a year-and-a-half of experience. Summary of Results Clerical errors were found on 38% and 42% of WISC-R and WISC-lll protocols, respectively, with 17% and 23% of protocols having Full Scale lQ changes as a result. The mean FSIQ change on the WISC-R was 2.3 points (range 1  -  8); on the WISC-lll the mean  change was 1 .8 points with a range of 1  -  7 points.  Only one error  led to a change in descriptive category, from Mildly Mentally Handicapped to Borderline.  The study’s first hypothesis stated that the rate of practitioners’ clerical error on completed and scored WISC-lll protocols will be equal to their rate on the WISC-R.  No significant  differences were found between the mean errors on the two tests  59  for overall errors, Full Scale lQ-changing errors, or any of the four error-categories examined.  A positive, but statistically non  significant correlation was found between examiners’ years of experience and number of clerical errors made on the WISC-lll.  The second hypothesis stated practitioners’ clerical error rates will not change significantly with increasing experience on the WISC-III.  The examiners’ mean overall error rate was reduced  from 3.0 errors per six protocols at the beginning of administration, to 2.0 errors per six protocols in the third six-month period.  The  difference was not statistically significant (F{ } 1 , 2 2  The  =  1.10).  examiners’ Full Scale lQ-changing errors were also compared over the three-month time period.  At the beginning of test  administration, 2.0 errors were made per six protocols; during the third six-month time period, 0.57 errors were made which changed the Full Scale lQs. significant 2 (F[ 1 , 21  The difference between these means was =  4.06; p  <  .05).  60  Chapter V Discussion The purpose of this study was to investigate the difference in error rates for practitioners between the WISC-R and its revision, the WISC-lll.  The first hypothesis presented in this study stated  that the rate of practitioners’ clerical error on completed and scored WISC-lI! protocols will be equal to their rate of error on the WISC-R.  The results of this study show that the WISC-lIl error rate  was not significantly different from the error rate on the WISC-R.  Also of interest was the pattern of error rate on the WISC-lll over three time periods, starting from the beginning of its administration.  The second hypothesis stated that practitioners’  clerical error rates will not change significantly with increasing familiarity (experience) on the WISC-Ill.  The results from this  study suggested that whereas overall clerical error rates showed no significant change, Full Scale IQ-changing errors showed a significant decrease over the three time periods; that is, school psychologists made significantly more errors at the beginning of test administration than after 18 months of administration.  Error Rates Of the 252 total protocols, 40% contained clerical errors, with almost 20% of the total number of protocols having incorrect Full Scale lQs as a result of these errors.  Slightly more clerical errors  were found on the WISC-lll (53, or 42% of the 126 protocols) than on  61  the WISC-R (48, or 38% of the 126 protocols) but the difference was not significant.  There was also no significant difference between  the two tests for errors that caused changes in Full Scale lQs.  Other studies have reported similar rates of clerical error. For example, Sherrets et al. (1979) reported that 46.5% of the 200 WISC-R protocols examined in their study contained at least one clerical error.  Wagoner (1988) found that 34% of the 50 WISC-R  protocols contained errors which changed Full Scale lQs, although the errors examined in her study included scoring errors as well as clerical errors, thus inflating the rate of errors in comparison to this study.  Error rates in studies using student examiners are  generally much higher than the rates reported in the present study. For example, Slate and Chick (1989) found that 67.4% of their student-examiners made errors which changed Full Scale lQ scores.  Of the WISC-R protocols with FSIQ changes, the mean change was 2.3 points, with a range of 1-8 points. of change in FSIQ scores was 1.14.  The standard deviation  On the WISC-lll, the mean change  on protocols with errors was 1 .8 points, with a range of 1-7 points. The standard deviation of change in FSIQ scores was 1.05. The SEM of the WISC-lll is 3.20, but because of the method used in calculating this SEM, it is evident that clerical errors must be seen as a source of additional error.  62  Standard Error of Measurement As stated earlier (in Chapter 1), the SEMOn the Wechsler scales is calculated from a composite of split-half reliability coefficients taken from individual subtests.  The resulting SEM  therefore, cannot be thought to account for error from sources outside of these subtests.  The SEM published for a Wechsler test  might be seen to account for one type of clerical error raw scores  -  -  addition of  in the protocols used in the reliability studies in the  production of the test.  However, it is almost certain that the rate  of clerical errors is higher in actual practice than is the rate given from the test’s original reliability study. within the actual subtests  -  Those errors not found  transformation of raw to scaled score,  addition of scaled scores, transformation of scaled scores to lQ scores  -  are not accounted for at all by the published SEM of the  test.  It has been suggested that the SEM of a test should include not only errors brought about through content sampling, as in the case of the Wechsler scales, but also time sampling errors, errors that result from mistakes in scoring, and from improper test administration (Bradley, Hanna, & Lucas, 1981).  The Bradley et al.  study suggested that a composite SEM be composed in the following manner: SEM comp  =  2 ‘ISEMcs  +  2 SEMts  +  2 SEMs  +  2 (cs=content SEMa  sampling; ts=time sampling; s=scoring; a=administration).  Their  estimation of a composite SEM on the WISC-R came to 6.5 lQ points (as opposed to the published 3.19 points).  63  The results of the present study suggest that because clerical errors exist and have an effect on test reliability, a composite SEM should also include clerical errors (SEMC).  The standard deviation of  IQ changes on the WISC-R was 1.14 points (1.05 on the WISC-lll). When included in the above formula, the composite SEM rises to 6.7 points for the WISC-R, as opposed to the published SEMOt 3.19 points.  This pattern is likely similar for the WISC-lll.  Examiners Six of the seven examiners (86%) made clerical errors on both tests, with one examiner completing error-tree protocols on all 18 protocols of the WISC-R and all 18 of the WISC-R.  In comparison,  Sherrets et al. (1979) found 89% of the examiners in his study had committed clerical errors; Levenson (1988) found 57% of examiners made clerical errors on the 162 WISC-R protocols used in the study. The results from this study in this area are similar to results in previous studies and suggest that the sample of examiners is similar to the sample in previous research.  The examiners’ years of experience were correlated with the number of errors made on the two tests.  On the WISC-R, no  relationship was seen between years of experience and number of clerical errors made.  On the WISC-Ill, however, a positive  correlation was seen between overall errors and experience (r  =  .54).  Even stronger was the correlation between IQ-changing errors and experience (r= .60).  These positive correlations (although not  statistically significant) suggest that, in this sample, those  64  examiners with the greatest years of experience had the most difficulty in adapting to the changes in a newly revised standardized test, and were more likely to make careless errors.  These more  experienced examiners made more clerical errors, and they reporLed more inaccurate results than did examiners with less experience.  The error rates in the present study are consistent enough with those in other studies to supporL the assumption that the examiners are representative of the population of examiners. Conclusions drawn in this study should be applicable to other practitioners using these tests.  Also, the demographic  characteristics of the community from which the data were taken suggest that the examinees were likely representative of the population of examinees in the province.  Error Categories The most common error type by far, on both tests, was addition of raw scores.  On the WISC-R, 69% of all errors were of  that type, and on the WISC-lll, 79% of all errors were in that category.  The difference between the two tests was not significant.  This type of error, though most frequent, is least likely to make a major difference in the IQ score at the end of the test.  Category #2 errors  -  transformation of raw to scaled scores  was the second most frequent error on the WISC-R  ,  with 15% of  errors, but was the least-frequent error type on the WISC-lll, with only 2% of errors.  The difference between the two tests for this  -  65  error type was not statistically significant, but suggests a difference between the two tests.  Little (1992) suggested that “Of greatest concern with regard to administration and scoring is the process of converting raw scores to standard scores  ...  This different format of  presentation is sure to be confusing and cause recording errors” (p. 151).  That the WISC-lll seems less prone to error in the area of  transformation of raw to scaled scores is somewhat surprising.  The  changes made on the WISC-lll protocol appear to make the process of transformation of raw to scaled, or standard scores more confusing. Contrary to Little’s predictions, however, the new format does not seem to be more error-prone in this area, with the present study showing relatively few errors being committed in that location.  The second most frequent error-types on the WISC-lll were transformation of raw to scaled scores (9%), and addition of scaled scores (9%).  These errors were found in similar number on the  WISC-R, with no significant difference between the two tests in this area.  That these error rates are the same should be of some  concern to those using the WISC-lll: these two error categories are among the most likely to have an impact on Full Scale IQ.  Most studies report at least some clerical errors in the area of computation of chronological age (e.g., Levenson et al., 1988; Sherrets et al., 1979).  No errors of this type were found in this  study, probably as a result of the referral and reporting process used  66  in the school district from which the protocols were taken.  In the  school district selected for this study, the school psychologists are given referrals with age and birth-date supplied; after the assessment, the reports with birLhdate and age are distributed to parents and school personnel involved with the child.  It is likely  that any chronological-age computation error would be noticed and the test protocol would be corrected.  The WISC-lll Over Time The WISC-lll protocols were also scrutinized for the rate of examiner errors over three time periods, starting with the beginning of administration and continuing for three half-year time periods. All of the examiners came to the WISC-lll with training and experience on the WISC-R, but no formal re-training on the WISC-Ill. The revised test is similar enough to the WISC-R that examiners are expected to be able to make the transition with a minimum of difficulty.  Two types of errors were looked at: overall clerical errors, and serious errors (i.e., those errors which cause the Full Scale lQs to change).  The overall error rate declined over the three time periods  from 3.0 per six protocols, to 2.0 per six protocols, but the decrease was non-significant.  However, for the errors which cause Full Scale  IQ change, a significant difference was found between the first and third time periods.  The mean number of lQ-changing errors (2.0 per  6 protocols, or 33%) at the beginning of administration was  67  significantly higher than during the third six-month period, which had an IQ-changing error rate of 0.6 per 6 protocols (10%).  Past studies have shown some decrease in examiner errors in graduate students’ administrations of standardized tests, but few, if any studies to date have looked at practitioners.  Slate and Jones  (1990c) found no decrease in graduate students’ scoring error rates after 5 practice administrations of the WISC-R, and a slight decrease after 10 administrations.  In their study, error rates were  very high: 79.7% of the protocols completed contained errors which changed the Full Scale IQ.  They concluded that the training for  administration of standardized tests was inadequate, and should consist of more than simple repetition of test administrations.  Gilbert (1992) investigated an alternative method of teaching test administration and found students’ WAIS-R administration errors significantly reduced from the beginning to the end of the of the semester.  Gilbert’s study, however, looked only at  administration errors, and not at scoring or clerical errors, which may or may not have followed a similar pattern.  The results of the present study indicate that practicing school psychologists make a significantly higher number of lQ changing errors at the beginning of administration of a newly revised instrument than they do 18 months later.  Interestingly,  though, the rate of overall errors does not change significantly over time.  The reason for this difference in pattern likely lies in the  68  make-up of each of these two broad categories of errors (overall and IQ-changing).  The overall errors consisted mostly of category #1 errors, that is, addition of raw scores.  These errors often result in minimal  change in scaled scores and, consequently, IQ scores. The WISC-R and WISC-lll do not differ vastly in the way raw scores are added up, and therefore no change over time is noted when the examiners begin using the revised test.  The lQ changing errors, though, are more  likely to be category #2, #3 or #4 errors which have more potential to change lQ scores.  These types of errors can be directly attributed  to changes in the protocol format of the newly-revised test, and to the examiners’ inexperience with the test.  When all three time  periods of WISC-lll administration were examined, the WISC-lll did not show a significantly higher rate of category #2, #3, and #4 errors than the WISC-R.  It may be construed that the cause of the  high rate of lQ-changing errors at the beginning of WISC-lll administration can be attributed to examiners’ inexperience and lack of retraining rather than to a deficiency in protocol design.  Strengths and Weaknesses of the Study The relatively small number of examiners who provided protocols is perhaps a limitation of this study, though the total number of protocols used compares favorably with most other similar studies in the literature.  Also, the variety of training  backgrounds (school and clinical psychology, many different graduate programs) may lessen the impact of the limitation stated  69  above.  Another limitation of this study is that only Master’s level  psychologists participated in this study, thus likely underestimating the rate of clerical error (Ryan, 1983).  However, error rates in this  study were similar to those in other studies (Sherrets et al., 1979; Wagoner, 1988).  Another possible limitation is that this study focussed only on clerical errors, and ignored scoring and administration errors.  The  error rate in this study therefore likely underestimates the actual rate of total examiner error.  Several key areas of strength were noted in this study.  For  example, unlike most research studies completed in this area, this study used actual protocols from practitioners.  Many other studies  have used either graduate students’ protocols, or fabricated protocols scored by volunteer professionals.  The size of the sample  of protocols also compares favourably with other studies.  This  study also examined examiner error from two new perspectives: very few, if any, studies to date have made comparisons between a test and its revision, or looked at the pattern of error rates over time.  Conclusions Two central questions were addressed in this study.  The first  was regarding the change in the rate of examiner clerical error between the WISC-R and the WISC-lll.  The second involved the rate  of error over time on the WISC-lll, from the beginning of the test’s administration to a period 18 months later.  70  A claim made in the WISC-lll examiner’s manual (1991) states: “Changes from the WISC-R use” (p. iv).  ...  have made it easier for the examiner to  I interpreted the claim to mean that the test, including  the newly-revised protocol, has been designed so as to be less prone to examiners’ clerical errors.  Even if “easier to use” wasn’t  intended to be interpreted as reduced clerical errors, the problem of clerical errors is well-enough documented so that it should have been addressed in the design of the revised test.  The results from this study suggest examiners continue to make a substantial number of clerical errors on the WISC-lll: 42% of protocols had clerical errors, with Full Scale lQ-changing errors on 23% of all protocols.  The results suggest that there is no  significant difference in clerical errors between the two tests, and therefore, the WISC-lll does not seem to be easier to use than the WISC-R, if ease of use is interpreted as propensity to make clerical errors.  A more extensive retraining on the WISC-lll might reduce the frequency of IQ-changing errors in the initial months of test administration, when these errors are most likely to occur.  This  retraining might not have a significant effect on reducing overall clerical errors, which in this study occurred at a high rate through WISC-R administration, and remained high throughout all three time periods of WISC-lll administration.  71  Because clerical errors remain a problem with the WISC-lll, and because the provided SEM does not account for these clerical errors, it is suggested that the amount of error in a test could be better indicated with a composite SEM which would include not only content sampling errors, but all other error types, including clerical errors.  This suggestion,  a variation of a suggestion made by Hanna  et al. (1981), would provide test users with important additional information that is not presently published.  The issue of reliability  and SEM is not just a theoretical or psychometric issue, but one which is of practical significance.  Scores from standardized tests  are frequently given within a surrounding confidence interval, which is directly related to the reliability coefficient of the test.  This study also examined the rate of clerical error on the WISC-lll over three six-month time periods, beginning with the first six months of administration of the test.  It was found that although  the overall clerical error rate did not change significantly between the three time periods, the Full Scale IQ-changing errors were significantly higher in the first time period than in the last time period.  In other words, examiners made a relatively high number of  serious (FSIQ-changing) errors at the beginning of their administration of the WISC-lll.  It was earlier hypothesized that there would be no significant change in the rate of clerical error with examiners’ increasing familiarity with the WISC-lll.  Although little, if any previous  research was conducted in this area with practitioners, and studies  72  with graduate students showed a decrease in error rates with experience, it was thought that practitioners who had no formal retraining would show special vigilance in avoiding errors with a new and somewhat unfamiliar test revision.  This present study  suggests that practitioners with no formal retraining make a relatively high number of IQ-changing errors at the beginning of the test’s use.  It was found that the examiners who had the most  difficulty with the transition from the WISC-R to the WISC-lIl were those with the most experience on the WISC-R.  There is a clear need  for the re-training of examiners who use the revised version of a standardized test.  Implications of the Study Clerical errors continue to be a problem with the Wechsler scales.  The results from this study suggest that examiner errors  continue unabated in the WISC-IIl, with no significant reduction in either overall or IQ-changing errors.  The revised protocol, with one  possible exception (transformation of raw to scaled scores) does not seem to be less prone to clerical errors.  And although one might  attribute all clerical errors to examiner carelessness, and consider these errors unavoidable, valuable suggestions have been made which may lessen their rate and impact.  Levenson et al. (1988) made  the suggestion that a “revised record form might include directions to the psychologist to verify and recheck all response scoring, mathematical, administrative, IQ, and birth date computations” (p. 663).  Suggestions like these have been wholly ignored on the WISC  Ill protocol, with the result that the error rate remains high and  73  unchanged.  The WISC-lll would be a more reliable instrument if  more emphasis was placed by the test designer on avoiding clerical errors.  A few key instructions on the actual protocol  -  perhaps a  step-by-step guide to check and recheck scoring and mathematical computations  would likely result in more reliable scores.  -  Recently-introduced computer scoring programs for standardized tests may help reduce the incidence of some types of  clerical errors.  For example, the Scoring Assistant for the Wechsler  Scales, introduced in 1992, completes all calculations for transformation of raw to scaled scores, addition of scaled scores, and transformation of scaled to IQ scores. clerical errors  -  addition of raw scores  -  The most frequent of would not be reduced  through the use of a computer scoring program, nor would scoring or administration errors.  Also introduced through the use of computer  scoring programs would be another source of error: mistakes in transferring data from the protocol to the computer.  Another issue that has been largely ignored is the issue of the accuracy of the reported reliability coefficients and SEMS.  The  results from this study underscore the need to include in a composite SEM not only the content sampling, time sampling and test administration errors suggested by Hanna et al. (1981), but to also include the component of clerical error.  Professional examiners are not often formally re-trained on revisions of standardized tests.  For the most part, it has been  74  assumed that they possess the knowledge and experience to undertake the necessary changes in practice brought on by the test’s revision.  This study suggests that professionals make a relatively  higher number of serious clerical errors when they are in the process of learning to use a revised test.  It has been suggested (e.g.,  Franklin et al., 1982) that practicing psychologists are lacking in opportunities for continuing education.  From the results of this  study, it can be recommended that practitioners should undertake a more extensive retraining when a standardized test is revised, even when the changes appear to be largely cosmetic, as is the case with the WISC-lll.  75  REFERENCES  American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association.  Anastasi, A. (1988). Psychological Testing (6th ed.). New York: Macmillan.  Anderson, T., Cancelli, A., & Kratochwill, T. (1984).  Self-reported  assessment practices of school psychologists: Implications for training and practice.  Journal of School Psychology,  .,  17-  29.  Bird, C. A. (1992).  Examiner errors in scoring the Wechsler  Intelligence Scale for Children  -  Revised.  Texas Woman’s University, 1992).  (PhD Dissertation,  Dissertations Abstract  International, 54/01, p.140  Blakey, W., Fantuzzo, J., Gorsuch, R., & Moon, G. (1987). A peer mediated, competency-based training package for administering and scoring the WAIS-R.  Professional  Psychology: Research and Practice, j.(1), 17-20.  Borg, W., & Gall, M, (1989). Educational research: An introduction, Fifth Edition.  New York: Pitman.  76  Bradley, F., Hanna, G., & Lucas, M. (1980). The reliability of scoring the WISC-R. Journal of Consulting and Clinical Psychology, 4.., 530-531.  Brannigan, G. (1975).  Scoring difficulties on the Wechsler  intelligence scales.  Psychology in the Schools, 12, 313-314.  Buckhalt, J. (1990). Atlributional comments of experienced and novice examiners during intelligence testing. Psychological Assessment,  Cleutat, V. (1965).  .,  Journal of  478-484.  Examiner differences with the Stanford-Binet IQ.  Perceptual and Motor Skills, SQ., 317-318.  Conner, R., & Woodall, F. (1983). The effects of experience and structured feedback on WISC-R error rates made by student examiners. Psychology in the Schools,  ,  376-379.  Conrad, P. (1991). Error frequencies of student-examiners on WISC-lll administrations.  (Doctoral DisserLation,  Fuller  Theological Seminary, School of Psychology, 1991). Dissertation Abstracts  Egeland, B.  (1969).  the WISC.  International. 52-10.  Examiner expectancy: Effects on the scoring of  Psvcholoav in the Schools,  .,  31 3-315.  77  Franklin, M., Stiliman, P. & Burpeau, M.  (1982).  intelligence testing: Are you a source?  Examiner error in  Psychology in the  Schools, 19., 563-569.  Gilbert, T.  (1992).  WAIS-R.  Students’ errors on the administration of the  Canadian Journal of School Psychology,  Hanley, J.H. (1977).  ,  126-130.  A comparative study of the sensitivity of the  WISC and WISC-R to examiner and subject variables. dissertation, Saint Louis University).  (Doctoral  Dissertation Abstracts  International, i, 7814571  Hanna, G., Bradley, F., & Holen, M.  (1981).  Estimating major sources  of measurement error in individual intelligences scales: Taking our heads out of the sand. Journal of School Psychology, 19., 370-376.  Hunnicutt, C., Slate, J., Gamble, C., & Wheeler, M.  (1990).  Examiner  errors on the Kaufman Assessment Battery for Children: A preliminary investigation.  Journal of School Psychology,  aa,  271 -278.  Kaufman, A. colors.  (1990).  The WPPSI-R: You can’t judge a test by its  Journal of School Psychology,  ,  387-394.  Levenson, R., Golden-Scaduto, C., Aiosa-Karpas, C., & Ward, A. (1988).  Effects of examiners’ education and sex on presence  78  and type of clerical errors made on WISC-R protocols. Psychological Reports,  Little, S.  (1992).  ,  659-664.  The WISC-lIl: Everything old is new again.  School  Psychology Quarterly, Z 148-154.  Lubin, B., Larsen, R., & Matarazzo, J. (1984).  Patterns of  psychological test usage in the United States.  American  Psychologist, 3.9.(4), 451-453.  Miller, C., & Chansky, N., (1972).  Psychologists’ scoring of WISC  protocols. Psychology in the Schools, 9., 144-152.  Miller, C., & Chansky, N., (1973). protocols: A follow-up note.  Psychologists’ scoring of WISC Psychology in the Schools, 10,  344-345.  Miller, C., Chansky, N., & Gredler, G. (1970). Rater agreement on WISC protocols. Psychology in the Schools, Z, 190-193.  Ministry of Education, Information Services Branch (1993). Information profile for school year 1992/93.  Nunnally, J. (1978). Psychometric theory (2nd ed.). McGraw-Hill.  New York:  79  Oakland, T., Lee, S. & Axelrad, K. (1975). Examiner differences on actual WISC protocols. Journal of School Psychology, ta, 227233.  Perot, J. (1992).  Cognitive strategies and heuristics underlying  psychologists’ judgements on the WISC-R verbal scales: A protocol analysis. Unpublished master’s thesis, University of British Columbia, Vancouver, B.C.  Petersen, D., Steger, H., Slate, J., Jones, C., & Coulter, C. (1991). Examiner errors on the WRAT-R. ,  Psychology in the Schools,  205-208.  Psychological Corporation (1991).  Manual for the Wechsler  Intelligence Scale for Children  -  Third Edition, New York:  Harcourt Brace Jovanovich, Inc.  Rothman, C.  (1973).  Differential vulnerability of WISC subtest to  tester effects. Psychology in the Schools, it, 300-302.  Russ, S. (1978).  Teaching psychological assessment: Training issues  and teaching approaches.  Journal of Personality Assessment,  4Z(5), 452-456. Rust, J., & Golambok, S. (1989).  Modern Psychometrics.  of Psychological Assessment.  London: Routledge.  The science  80  Ryan, J., Prifitera, A., & Powers, L. (1983). Scoring reliability on the WAIS-R. Journal of Consulting and Clinical Psychology, 5j, 149-150.  Sattler, J. M.  (1991).  Normative changes on the Wechsler  Preschool and Primary Scale of Intelligence Pegs subtest.  Revised Animal  -  Psychological Assessment: A Journal of  Consulting and Clinical Psychology,  Sherrets, S., Gard, G., & Langner, H.  a, 691 -692.  (1979). Frequency of clerical  errors on WISC protocols. Psychology in the Schools, j., 495-6  Slate, J. & Chick. (1989). WISC-R examiner errors: Cause for concern. Psychology in the Schools,  Slate, J., & Hunnicutt, L. scales.  (1988).  ,  78-83.  Examiner errors on the Wechsler  Journal of Psycho-Educational Assessment,  ,  280-  288.  Slate, J. & Jones, C. (1989). Can teaching of the WISC-R be improved?  Quasi-experimental exploration.  Professional  Psychology: Research and Practice, ZQ, 408-410.  Slate, J. & Jones, C.  (1990a).  source of concern.  Examiner errors on the WAIS-R: A  Journal of Psychology, 124, 343-345.  81 Slate, J., & Jones,C. (1990b). administering the WAIS-R.  Identifying students’ errors in Psychology in the Schools,  ,  83-  87.  Slate, J., & Jones, C. (1990c).  Student error in administering the  WISC-R: Identifying problem areas.  Measurement and  Evaluation in Counseling and Development,  Slate, J., Jones, C., Coulter, C., & Covert, T.  ,  (1992).  137-140.  Practitioner’s  administration and scoring of the WISC-R: Evidence that we do err. Journal of School Psychology, 3.Q., 77-82  Strein, W., (1984). A method for the systematic observation of examiner behavior during psychoeducational assessments. Psychology in the Schools, Zj, 318-323.  Stewart, K., (1987). administration.  Assessment of technical aspects of WISC-R Psychology in the Schools, 24, 221-228.  Wagoner, R. (1988). Scorinç. errors made by practicing psychologists on the WISC-R.  Unpublished Master’s thesis, Western Carolina  University.  Warren, S., & Brown, W. (1972). Examiner scoring errors on individual tests. Psychology in the Schools, 9, 118-122.  82  Wechsler, D. Children  (1974). -  Manual for the Wechsler Intelligence Scale for  Revised.  New York: Psychological Corporation.  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0054589/manifest

Comment

Related Items