Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An examination of several in-basket scoring strategies and their effect on reliability and criterion-related… Harlos, Karen P. 1992

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1993_spring_harlos_karen.pdf [ 8.69MB ]
JSON: 831-1.0086109.json
JSON-LD: 831-1.0086109-ld.json
RDF/XML (Pretty): 831-1.0086109-rdf.xml
RDF/JSON: 831-1.0086109-rdf.json
Turtle: 831-1.0086109-turtle.txt
N-Triples: 831-1.0086109-rdf-ntriples.txt
Original Record: 831-1.0086109-source.json
Full Text

Full Text

AN EXAMINATION OF SEVERAL IN-BASKET SCORING STRATEGIES ANDTHEIR EFFECT ON RELIABILITY AND CRITERION-RELATED VALIDITYbyKAREN P. HARLOSB. A. University of British Columbia, 1982A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF MASTER OF ARTSinTHE FACULTY OF GRADUATE STUDIESDepartment of PsychologyWe accept this thesis as conformingto the required standingTHE UNIVERSITY OF BRITISH COLUMBIANovember, 1992© Karen P. Harlos, 1992In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.Department of  epislaria.04 yThe University of British ColumbiaVancouver, CanadaDate ^re-1,144t44-1 ^5 11/3 (Signature)DE-6 (2/88)11ABSTRACTThe in-basket exercise, a paper-and-pencil measure of administrative ability, isan assessment technique characterized by complex, often subjective scoring procedureswhich have limited wide-scale applications of this popular instrument. Because theliterature affords no clear classification system for this and related exercises, a structurewhich clarified the definitions of and relationships among management games,management simulations, work sample tests, and in-basket exercises was introduced.The primary purpose of the present research was to investigate the cross-samplegeneralizability of several strategies for scoring the in-basket exercise, including areduced-item scoring approach in which an optimal subset of items was identified forscoring. The preliminary studies upon which the present research is based are alsodiscussed in the present work. In addition to examining the impact of these scoringstrategies on reliability and criterion-related validity, consideration was given toaddressing long-standing concerns of in-basket training- and scoring time demands.Three hundred and twenty-one entry-level employees from a large westernCanadian utility company were administered the same in-basket exercise previouslyapplied in a different Canadian utility company. Contrary to expectations, theshrinkage in validity using an empirically-based scoring key was substantial, pointing tothe selection of a more logically-derived panel key as the method of choice. Theintroduction of a new cognitive-based measure of in-basket performance showedpromising results. In addition, the reduced-item scoring approach did not result insignificant losses in reliability or criterion-related validity, thus allowing substantialreductions in training- and scoring-time. Implications for in-basket development arealso considered.TABLE OF CONTENTSABSTRACT^ iiLIST OF TABLESLIST OF FIGURES^ xiiiACKNOWLEDGMENT xivINTRODUCTIONThe In-Basket Exercise ^ 1Definition and Design 1The In-Basket as a Work Sample Test^ 2Legal influences on work sample testing^ 2Economic influences on work sample testing^ 3Classification of Scoring Methods^ 4The objective method 5The subjective method 6The combination method^ 7Rationale for the Present Study 7Need for Further In-Basket Research 7Overview of Preliminary Studies^ 9Study 1 ^ 9Study 2 10The Present Study 11LITERATURE REVIEWSection 1: Management Simulations^ 13History and Development 13Terms and Definitions 14Historical Background of Management Simulations^ 16Management Simulations as Work Sample Tests ^ 19Issues in Management Simulation Design and Research^ 27Key Theoretical Issues: Realism and Fidelity 27Realism^ 27ivTABLE OF CONTENTS (cont.)Fidelity ^ 28Validity in the Context of Management Simulations ^ 31Content validity 32Construct validity ^ 32Criterion-related validity 34Summary of Section 1: Management Simulations^ 36Section 2: Psychometric Properties of the In-Basket Exercise^ 37The Seminal In-Basket Work^ 37Goal of Frederiksen's work 37Identification of Performance Dimensions^ 38Development and selection of exercise problems^ 38Development of the scoring method^ 38Significance^ 40Introductory Comments on In-Basket Empirical Findings ^ 41Reliability of the In-Basket Exercise^ 46Alternate Form Reliability Estimation^ 46Conceptual foundation 46Empirical findings^ 47Summary^ 48Internal Consistency: Split-Half and Alpha Coefficients^ 48Conceptual foundation^ 48Empirical findings 49Summary^ 51Inter-Rater Reliability Estimation^ 52Conceptual foundation 52Empirical findings 52Summary^ 52General Summary of In-Basket Reliability Results ^ 53Validity of the In-Basket Exercise^ 54Criterion-related Validity 55Air Force in-baskets 55School administration in-baskets^ 55vTABLE OF CONTENTS (cont.)The Port of New York Authority studies^ 60The Management Progress Study^ 64Cross' work with the school administration in-basket.^ 66IBM study 67The General Electric research^ 68A "leadership" in-basket exercise 71A return to the self-report scored in-basketexercise^ 74A predictive validation study^ 77A quickly-scored in-basket 78Summary of criterion-related validity evidence^ 78Construct Validity ^ 79School administration in-baskets^ 79Frederiksen (1966) 80A parallel form study^ 81The measurement of participative decision-making ^ 83Another multitrait-multimethod study^ 84Factor-analytic studies^ 85Summary of construct-related validity evidence^ 86Content Validity^ 86General Summary of In-Basket Validity Results^ 87Summary of Section 2: Psychometric Properties of the In-BasketExercise^ 89Section 3: A Review of Current In-Basket Scoring Strategies^ 90Limitations of a Literature-Based Review 90Review of Current In-Basket Scoring Methodology inIndustry^ 92The Multiple-Choice In-Basket Exercise^ 93General Management In-Basket Exercise 94ETS Consolidated Fund In-Basket Exercise^ 97An example of the combination scoring method ^ 98viTABLE OF CONTENTS (cont.)Summary of current in-basket scoring methodology inindustry ^ 100Summary of Section 3: A Review of Current In-Basket ScoringStrategies 101PRELIMINARY STUDIESStudy 1^ 102Method 102Participants and Setting^ 102Materials ^ 102The Telephone Supervisor In-Basket Exercise^ 102Criterion measurement^ 103Procedure^ 104Development of the TSIB 104Identification of the TSIB dimensions^ 105Development of the TSIB dimension scoring keys^ 107General scoring protocol^ 108Results^ 112Reliability^ 112Criterion measurement^ 112TSIB dimensions 112Validity^ 114Discussion 114Study 2^ 116Method 116Participants and Setting^ 116Procedure^ 117Modification of the Quality of Judgementscoring key^ 117A new, expanded panel ^ 117Derivation of the new scoring keys forQuality of Judgement 118Derivation of the reduced-item sets ^ 122viiTABLE OF CONTENTS (cont.)Results^ 122Reliability^ 122Validity 123Intercorrelations 123Construct validity^ 128Discussion ^ 129RATIONALE AND HYPOTHESES FOR THE PRESENT STUDY^ 133General Summary^ 133Proposal and Hypotheses 135Criterion measurement^ 136Quality of Judgement Scores 136Predictive validities of scores based on thePanel, Empirical and Merged scoring keys^ 136Additional analyses based on thePanel scoring key^ 137Additional Performance Dimensions in the TSIB^ 140Understanding of Situation Questionnaire 140Productivity^ 142In-Basket Stylistic Dimensions^ 143Additional New Measures 144Number of Priority Items attempted ^ 144Scorer's Impression of Involvement 145METHODParticipants and Setting^ 146Assessment Measures 147Predictor measures 147Criterion measures^ 147viiiTABLE OF CONTENTS (cont.)RESULTSOverview of Study Design and Data Analysis^ 151Design^ 151Data Analysis 151Reliability ^ 151Criterion-related validity^ 152A brief comment on the Results ^ 153Major Reliability and Criterion-related Validity Findings^ 153Reliability of the Performance Appraisal DimensionScores ^ 153Cross-Validation of the Quality of Judgement Scores^ 154Other Performance Dimensions in the TSIB^ 157Understanding of Situation Questionnaire^ 157Productivity^ 162TSIB Stylistic Dimensions 163Reliability of the TSIB dimensions^ 170Additional New Measures 170Number of Priority Items Attempted^ 170Scorer's Impression of Involvement 170Factor Analysis of the TSIB Action Elements 174A Series of Factorings ^ 174The first factor analysis 174The second factoring 175The third factoring^ 176Interpretation of the factors 176Factor I: Consultation and Discussion^ 176Factor II: Independent Decision-Making 177Factor III: Company-Based Decision-Making^ 178Factor IV: Supervising Staff^ 178Factor V: Integration and Analysis 179Assignment of Factors to Action Elements ^ 179TABLE OF CONTENTS (cont.)DISCUSSIONReliability and Criterion-related Validities of Selected Dimensionsof the TSIB^ 185Cross-Validation of the Quality of Judgement Scores^ 185Predictive validities of scores based on the Panel,Empirical, and Merged keys^ 185Additional analyses based on the Panel scoringkey^ 189Additional Performance Dimensions in the TSIB^ 196Understanding of Situation Questionnaire  196Productivity ^  198TSIB Stylistic Dimensions 201Reliability of the TSIB dimensions^ 202Additional New Measures^ 204Scorer's Impression of Involvement^ 204Number of Priority Items Attempted 204Selection of the Optimal Item Subset: Integrating EmpiricalResults and Practical Issues^ 205Factor Analysis of the TSIB Action Elements^ 206A Note Regarding A Linear Combination of theTSIB Dimensions^ 209Summary and Conclusions 210REFERENCES^ 214APPENDICESAppendix A List of Selected Cognitive Ability and Personality MeasuresUsed in the Construct Validity Analysis ^ 223ixLIST OF TABLESTable 1^Review of Major Empirical Findings on the In-Basket^ 43Table 2^Inter-Rater Reliability Estimates for the TSIB 113Table 3^Validity Coefficients between Overall Performance Criterion (OPC)and TSIB Dimension Scores^ 115Table 4^Alpha and Stepped-up Split-Half Reliability Estimates for Quality ofJudgement Scores Obtained from Three Items Sets as Scored by theMerged Key^ 124Table 5^Bivariate Correlations between Overall Performance Criterion andQuality of Judgement Scores Obtained from Four Scoring Keys andfrom Three Item Sets^ 125Table 6^Intercorrelations of Dimension Scores based on the 21-item Set:Stylistic Dimensions Scored by Logical Keying System and Quality ofJudgement Scored by Panel 1 (Liberal) Key^ 126Table 7^Intercorrelations of Dimension Scores based on the 8-item Set:Stylistic Dimensions Scored by Logical Keying System and Quality ofJudgement Scored by Panel 1 (Liberal) Key^ 127Table 8^Reliability Analysis (Alpha Coefficient) of the Five PerformanceAppraisal Dimensions ^ 155Table 9^Bivariate Correlations between Overall Management Performanceand Quality of Judgement Scores Obtained from Three Scoring Keysand from Three Item Sets^ 156Table 10 Bivariate Correlations between Overall Management Performanceand Quality of Judgement Scores Obtained Using Scoring Keys basedon Different Methods of Combining Panel Judgements (Liberal orConservative), Different Origins of Panel Ratings (Panel 1 or Panel 2),and Applied to Different Datasets (Company 1 or Company 2)^ 158xiLIST OF TABLES (cont.)Table 11 Validity Coefficients between Overall Management Performance andUnderstanding of Situation Questionnaire Total Score (32-item); AlphaReliability Estimates for Total Score^ 160Table 12 Validity Coefficients between Overall Management Performance andUnderstanding of Situation Questionnaire Total Score (20-item) andReliability Estimates for Total Score 161Table 13 Validity Coefficients between Overall Management Performance andProductivity as Measured by Several Individual Units of Output, LinearComposites, and Standard Total Action Indices^ 164Table 14 Validity Coefficients between Overall Management Performance andTSIB Stylistic Dimension Scores^ 165Table 15 Intercorrelations of TSIB Stylistic Dimensions and Productivity ScoresBased on the 8-Item Set^ 167Table 16 Intercorrelations of TSIB Stylistic Dimensions and Productivity ScoresBased on the 10-Item Set^ 168Table 17 Intercorrelations of TSIB Stylistic Dimensions and Productivity ScoresBased on the 12-Item Set^ 169Table 18 Reliability Estimates of the TSIB Dimensions ^ 171Table 19 Bivariate Correlations between Number of High Priority ItemsAttempted and Overall Management Performance (OMP)^ 173Table 20 Primary-Factor Intercorrelations^ 180Table 21 Factor-Based Scale Score Intercorrelations^ 182Table 22 Validity Coefficients between Overall Management Performance(OMP) and TSIB Factor-Based Scale Scores and Reliability Estimates ...183xiiLIST OF TABLES (cont.)Table 23 Summary of Validity Coefficient between Overall ManagementPerformance and Quality of Judgement Scores Obtained Using ScoringKeys based on Different Methods of Combining Panel Judgements(Liberal or Conservative), Different Origins of Panel Ratings (Panel 1or Panel 2), and Whether Panel Origin and Dataset Match (Intra- orInter-company analysis) ^ 192Table 24 Mean Validity Coefficients Determined by Aggregation Across Methodof Combining Panel Judgements, Origin of Panel, and Nature of Analysis(Intra- or Inter-company)^ 194LIST OF FIGURESFigure 1 Relationship between simulations and work sample tests ^ 22Figure 2 Relationships among simulations, work sample tests,management simulations, management games, andin-basket exercises ^ 26Figure 3 Sample Item from the Telephone Supervisor In-Basket Exercise^ 110Figure 4 Excerpt from the Scoring Manual for Sample Item(shown in Figure 3)^ 111Figure 5 Design grid for planned panel analyses in terms of originof the Panel key, of method of combining panel judgements,and of the dataset to which the keys will be applied^ 138Figure 6 Sample Item from the Understanding of Situation Questionnaire^ 141xivACKNOWLEDGEMENTI would like to thank my committee members, Dr. Ralph Hakstian, Dr. JerryWiggins, and Dr. Beth Haverkamp for their contributions toward the organization andpreparation of what became a lengthy tome. As my advisor, Dr. Hakstian deservesparticular thanks for his assistance in the provision of research resources and hisguidance in the direction and evaluation throughout both the preliminary and presentstudies.In addition, Lois Crooks and Peggy Mahoney of ETS, who provided me withpatient and pleasurable instruction in the scoring of in-baskets, are gratefullyacknowledged. In the present study, further assistance in scoring and administrationwas provided by Derek Sam, Susan Greaves, and Rodney Hicks. Several employees ofthe Human Resources Department of the second company were very important to thesuccessful completion of the present cross-validation: Cherie Chapman, Paula Weber,Carol Harper, and Joy Powell.Lastly, I must thank Danny for his cherished intellectual and emotional support.INTRODUCTIONThe identification and development of future administrators is of great interestto company executives and industrial psychologists. Psychometric assessment ofadministrative ability is most commonly conducted by means of the in-basket exercise,which attempts to simulate the job conditions of managers by providing typical samplesof managerial work and by requiring the candidate to take action on the problemspresented. The in-basket exercise was first developed in the late 1950's as a trainingtool for officers in the U.S. Air Force (Frederiksen, Saunders & Wand, 1957). Itsapplications have now expanded to include certification and assessment for selectionand promotion decisions. The in-basket exercise is an established feature of themanagerial assessment centre, which is a method (rather than location) characterizedby multiple assessment procedures and pooled judgements of multiple assessors toevaluate the performance of managerial candidates (Bray & Grant, 1966; Cascio,1987). In fact, a survey of large American companies with assessment centres foundthat 31 out of 34 used the in-basket; it was also rated the most popular exercise amongjob candidates (Bender, 1973). More recently, Gaugler, Bentson and Pohley (cited inThornton, 1992) surveyed the types of exercises used in over 200 assessment centresand found that 81 % regularly used the in-basket.The In-Basket ExerciseDefinition and DesignThe in-basket exercise consists of a stack of letters, memos, reports and relatedmaterials that are presumed to have accumulated in the in-basket of a manager. Theparticipant is directed to deal with the materials as if he or she were actually on the jobas the manager. Performance on the exercise can be scored on various dimensions that2are important to the successful performance of managerial work. Although usuallygroup-administered, the in-basket exercise allows the participant to act independentlyby reading and responding in writing to the mountain of memos, letters and otherdocuments. Typically, a job analysis is conducted to ensure that the problems andsituations presented in the in-basket exercise are representative and accurate samples ofwork and so will be valid measures of on-the-job performance.Generally, the in-basket exercise is composed of two substantive parts and aprocedural element that have become the traditional model for design and constructionof this instrument (Lopez, 1966). The first part is made up of background materialssuch as instructions, an organizational chart, and job descriptions to help orientparticipants to the fictitious company and outline the expectations and guidelines forresponding. The second part involves the actual set of items or problems (letters,memos, etc.) to which the participant responds, as well as the materials needed torespond (letterhead, memo forms, paper clips, etc.). The procedural element of the in-basket exercise refers to the method by which the participant's actions are evaluated,whether scored objectively from written responses, or assessed more subjectivelythrough an interview with a trained evaluator, for example.The In-Basket as a Work Sample TestAs noted, the problems presented in the in-basket exercise are based on samplesof managerial work. Accordingly, the in-basket can be considered as a member of abroader category of instruments known as "work sample" tests, which present samplesof work representative of the target job and measure candidates' ability to do the workpresented in the test.Legal influences on work sample testing.  In recent years, work sample testinghas undergone extensive growth in response to legal pressures regarding the fairness of3selection procedures, especially in relation to women and minorities. In Griggs v.Duke Power Company (1971), the U. S. Supreme Court issued a landmark rulingwhich established several key principles in employment discrimination cases, includingthe requirement that selection tests be job relevant (that a given requirement foremployment is related to job performance). The importance of job relevance forselection measures was later reaffirmed by the U.S. Supreme Court in Albemarle PaperCo. v. Moody (1975). In order to comply with the judicial requirement of fairness inselection procedures, the U. S. federal government published the Uniform Guidelineson Employee Selection Procedures (1978), which articulates three validation proceduresto establish the legal defensibility of selection measures, namely construct-, criterion-,and content-validation (these will be discussed in greater detail in Section 1 of theLiterature Review).By their design, work sample tests are likely to satisfy the guideline of contentvalidity (i.e., the extent to which the exercise truly measures the skills and abilitiesrequired for and used in the target job). Because work sample tests are based onrepresentative samples of work from the target job, they are designed to be contentvalid, and therefore, are job relevant. As a work sample test, the development of thein-basket exercise is also based on the notion of content validity and job relevance.Thus, the legal requirement of job relevance and the federal guideline of contentvalidity have significantly influenced the growth and development of work sampletesting and in-basket exercises.Economic influences on work sample testing. In addition to legal pressures toensure fairness, selection procedures have also been subjected to economic pressuresresulting from the current recession. Limited human resources budgets have called forassessment measures which can be administered and scored efficiently, therebyreducing costs. Thus, assessment measures which are psychometrically sound (reliable4and valid), legally defensible (job relevant and fair), as well as economically viable(quickly-administered and quickly-scored) are required.While work sample tests tend to be psychometrically sound and legallydefensible, their economic viability may be limited. Hunter and Hunter (1984) pointedout that work sample tests usually incur high costs for the initial development of theinstrument, for further re-development if the target job changes, and for testadministration. Furthermore, it will be seen shortly that, because of the complexity ofin-basket scoring, it is not unusual for an in-basket allowing free-format or open-endedresponses to require from 1 1/2 to 5 hours to evaluate, whereas many establishedpersonality inventories used for managerial assessment can be fully scored in under fiveminutes. Consequently, the use of the in-basket as a measure of administrative abilityis a costly component of managerial assessment, given its high development andadministration costs, as well as the considerable time required both to train scorers andfor trained scorers to evaluate a completed in-basket. A unique challenge, then, facingin-basket exercise developers has been in meeting the concerns of company executivesto improve economic efficiency in an increasingly competitive corporate market.Classification of Scoring MethodsLopez (1966) noted that, despite the apparent utility of this instrument, "themajor weakness lay in its complex, tedious scoring process" (p. 108). In an effort tosimulate real-world conditions, the typical in-basket exercise allows the candidate torespond in a written free-format fashion. Given a free-format design, subjectivity andlengthy time demands have, to varying degrees, affected all methods of in-basketscoring. Generally, the training process required to reliably score in-baskets iscomplex, given the number and scope of the problems presented and the judgementoften needed to determine candidates' intent. Even when scorers are fully trained, it is5a complex process for them to decipher and interpret written responses, which are oftenhastily scrawled as candidates try to complete as much as work they can in the timeavailable.There are three main methods of scoring in-basket performance: (a) objective(psychometric), (b) subjective (interpretive), and (c) a combination of the two. At thistime, no known information exists regarding the relative rates of use of these in-basketscoring approaches (B. Gaugler, personal communication, June 30, 1992; G. Thornton,personal communication, June 30, 1992).The objective method.  The "traditional objective" scoring approach whichfollows principles established by Frederiksen et. al. (1957) and later refined by Lopez(1966) focuses primarily on candidates' free-format written responses. In-basketscoring using this method attempts to break down these open-ended, free-formatresponses into molecular units of action (called action elements) which are thenquantitatively scored for style and content. This psychometrically-based objectivemethod can often require 1 1/2 to 2 hours per exercise to score, which, for large-scaleapplications, can present prohibitive costs in terms of both time (i.e., time required fortraining scorers and scoring time itself) and money (i.e., administration costs).In response to these concerns of lengthy scoring time and high costs, a recentdevelopment in in-basket design has been away from the free-format response modeand toward a multiple-choice, self-report response mode (Kesselman, Lopez & Lopez,1982; Morris, 1991). Candidates still may initially respond to items in written, open-ended form, but, upon completion of the exercise, this development then requirescandidates to indicate the course of actions they took using a self-report multiple-choiceform. Only this self-report form is then evaluated by scorers; the open-ended writtenresponses are typically not examined by scorers. This design change has resulted in a6modification to the traditional objective scoring method because the scoring of only themultiple choice responses precludes the scoring of action elements derived from open-ended responses. A scoring key is applied to the candidate's choice of the responsealternatives provided, thereby eliminating scorers' interpretive judgements in responseassessment.Although objectivity and efficiency in scoring are greatly improved byrestricting responses, a clear shortcoming of the self-report response mode is the newthreat of motivational distortion. Because scorers rarely consider the open-endedresponses, participants may, in hindsight, select a self-report course of action that theybelieve is superior (and is contrary to the course of action indicated in their open-endedresponses). Consequently, deception is a clear possibility for those in-basket exercisescombining open-ended responses with a self-report, multiple-choice response format.In addition, concerns of lack of realism and excessive restriction of candidates'responses make multiple-choice in-basket exercises a controversial development in in-basket design and scoring.The subjective method.  In this commonly-used approach, the primary source ofinformation about in-basket performance comes from an interview with the candidateupon completion of the in-basket (Bray & Grant, 1966; Frederiksen, Jensen & Beaton,1972). The scorer's interpretive judgements of the candidate's verbal (and possiblywritten) responses then form the basis of a report describing administrativeperformance. A consequence of the highly subjective, interpretive nature of thisscoring method is its inherent unreliability. Furthermore, given this unreliability,empirical evidence of a predictive relationship between in-basket "scores" andperformance criteria (i.e., criterion-related validity) would be difficult to provide usingthe subjective scoring method. Like the objective method, the subjective scoringmethod also involves similarly significant training- and scoring-time demands.7The combination method. The combination method of in-basket scoringtypically involves both an interview with the candidate and an objective examination ofthe candidate's written responses. This approach, which continues to be widely-used inmanagerial assessment centres, requires even greater time and financial demands thanthe objective scoring method. Even with a trained assessor, it is not unusual for in-basket scoring using the combination method to require 2 to 5 hours per exercise whenused in managerial assessment centres (Thornton & Byham, 1982).Rationale for the Present StudyNeed for Further In-Basket ResearchDespite its forty-year history and widespread use, relatively few empiricalstudies examining the reliability and validity of the in-basket exercise have beenconducted, perhaps because its high face validity has lead to an over-reliance on the in-basket's appearance of job-relatedness (Gill, 1979). Gill's observation, although madeover a decade ago, remains valid today; Schippmann, Prien and Katz (1990), in areview of the psychometric properties of in-basket measures concluded, "Whatever thereason, research and opinion about the in-basket can best be described as incompleteand arrested in time" (p. 856).In addition, as will be shown in the Literature Review, the quality of theresearch that is reported in this area can be criticized for its lack of concreteinformation on methods, procedures, and results. Moreover, great variations in in-basket construction, performance dimensions, and external non-instrument criterionmeasures (e.g., salary progression, rate of promotion, supervisory ratings) makemeaningful comparisons across studies difficult. There is continued controversy as towhether the psychometric evidence from published studies is strong enough to warrant8the in-basket's extensive and firm foundation in assessment programs (Gill, 1979;Schippmann et. al., 1990).As noted earlier, assessment measures which are psychometrically sound(reliable and valid), legally defensible (job relevant and fair), as well as economicallyviable (quickly-administered and quickly-scored) are required. Given theirclassification as work sample tests, in-basket exercises are likely to be legallydefensible. At the same time, the psychometric evidence from published studies isinconclusive regarding the in-basket's extensive and firm foundation in assessmentprograms. Moreover, the economic viability of the in-basket exercise has beenseverely limited because of high development, administration, and scoring costs(resulting from extensive training of scorers and scoring time itself). Excessive timedemands stemming from the complex nature of in-basket scoring have limited the wide-scale application of the exercise with large numbers of candidates.However, if modifications to the scoring system could be made whichsignificantly reduced the time demands but did not significantly decrease the validityand reliability of the instrument, perhaps the in-basket exercise could enjoy anexpanded application with large numbers of candidates for training and selection. Asnoted earlier, the scoring of the in-basket has long been recognized as the weak link inthis important and popular instrument. Nearly forty years after the initial developmentof the in-basket exercise, Brannick, Michaels, and Baker (1989) cited comparisons ofscoring systems in in-basket research as a key area for much-needed investigation.Accordingly, it is this aspect of the in-basket, the scoring method, which is thefocus of the present study. More specifically, because of the inherent unreliability ofthe subjective and, to a lesser extent, combination methods of scoring, modifications tothe traditional objective scoring method will be examined. This will provide the9opportunity to assess the criterion-related validities of the modifications to thetraditional objective scoring method within a free-format response mode in-basketexercise.The present study is the result of efforts to build on previous work by Hakstian,Woolsey and Schroeder (1986) which outlined an attempt to develop an in-basketexercise with an objective scoring key and modest training- and scoring-time demands.Two preliminary studies (described below under Study 1 and Study 2) were conductedwhich made modifications to both the instrument itself and scoring strategies to furtherthe goals of Hakstian et al. (1986). A more complete discussion of the two studies isfound in the Preliminary Studies section of the present work.The present study is a cross-validation of the findings of Study 2 whichexamined the impact of several scoring strategies on the psychometric properties of theinstrument. In addition, using a novel approach of scoring an optimal subset, ratherthan the entirety of items in the exercise, industry concerns regarding time and costrequired for training and actual scoring of the in-basket are considered in the evaluationof the instrument .Overview of Preliminary StudiesStudy 1. Nearly three years ago, a sample of 321 entry-level managers from alarge Canadian utility company were administered a newly-revised in-basket exercise aspart of a concurrent validity study for a managerial assessment battery. Thepsychometric properties of the instrument were determined; inter-rater reliability andcriterion-related validity of seven new performance dimensions were assessed. Inaddition, in order to reduce scoring time, a smaller, optimal subset of items wasidentified so that this smaller subset (rather than the entire set of items) would bescored. One of the seven in-basket performance dimensions was selected as the most10critical aspect of administrative performance, and so became the primary predictor usedin this and subsequent studies. The derivation of the optimal item-subset was made onthe basis of results from this most critical in-basket performance dimension as measuredby one scoring key. The Preliminary Studies section contains an outline of thedevelopment of the in-basket used in this (and the present) study and the details of thescoring keys generated for the seven performance dimensions, and also contains adiscussion of the procedure used to derive the item-subset. The empirical results aresummarized and briefly discussed under Study 1 in the Preliminary Studies section.Study 2. Analysis of the findings from Study 1 led to a further developmentdesigned to improve the criterion-related validity of the in-basket exercise and preservethe use of the optimal subset of items for reduced scoring time. To improve thevalidity for the critical performance dimension chosen in Study 1, the scoring key forthis dimension was modified. Specifically, three separate scoring strategies weredeveloped and applied to this most critical in-basket performance dimension. Inaddition, the item composition of the optimal item-subset identified in Study 1 was alsoevaluated using the new scoring strategies. Actually, in addition to the original subsetof items (identified in Study 1), two new subsets of items were also scored (for thecritical performance dimension) using the three scoring strategies in order to select themost predictive scoring method as applied to the most optimal subset of items. In thisway, the effects of both the item-subset composition and the scoring method could beexamined at the same time. The findings suggested that an empirically-based scoringmethod yielded the highest criterion-related validity coefficients (which weresubstantially higher than the validity using the Study 1 scoring method), and oneparticular subset emerged as the most optimal (psychometrically and practically).11The Present StudyIn the present study, 321 first and second-level managers from another largeCanadian utility company will be administered the same in-basket exercise used in thepreliminary studies (1 and 2), but a newly-developed Insight questionnaire designed tomeasure the candidate's understanding of the situation is included in the exercise.Criterion data of supervisory ratings of performance will be collected concurrently. Inorder to cross-validate the findings from Study 2, the same scoring strategies and theidentical item-subset compositions will be re-applied to this new sample, which will bescored for the same critical performance dimension evaluated in Study 2. The primarypurpose of the present study, then, is to examine the degree to which the observedvalidities are spuriously inflated because of capitalization on chance, resulting instatistically significant correlations simply on the basis of chance. Assessment of theamount of shrinkage in the cross-validated coefficients will thus allow a more accuratedetermination of the true validity of the critical performance dimension selected forinvestigation, as measured by the scoring strategies developed in Study 2.An additional goal of the present study follows from a finding of highintercorrelations among dimension scores in Study 2, as well as reported disagreementin the literature regarding the complexity of in-basket performance dimensions(Frederiksen, 1966; Lopez, 1966). For these reasons, a factor analysis of actionsendorsed by candidates will be conducted in order to produce an empirically, ratherthan rationally, derived set of performance dimensions. A further aim of the presentstudy is to assess the psychometric properties of the newly-designed Insightquestionnaire. That is, to what degree will the aspects of in-basket performancemeasured by this new instrument improve the prediction of administrative ability?12A more detailed outline of the hypotheses and analyses involved in the presentstudy will be provided in the fourth main section of this work entitled Rationale andHypotheses for the Present Study. First, however, the conceptual and empiricalgroundwork for both the preliminary and present studies will be laid in the LiteratureReview. Next, the specific procedures and findings from the initial development andapplication of the in-basket used in the present study are related in the PreliminaryStudies section. Then, overall findings from the Literature Review and PreliminaryStudies sections will be integrated in the Rationale and Hypotheses for the PresentStudy section in order to provide a solid conceptual and empirical justification for thehypotheses and analyses in the present study. Finally, the procedures, findings, andimplications of the present study will then be described in the Method, Results andDiscussion sections.13LITERATURE REVIEWThe following Literature Review is divided into three sections. In the firstsection, management simulations and their relation to in-basket exercises will beexplored in some depth. Section 2 then presents a summary of the published findingsof the psychometric properties of the in-basket exercise (i.e., reliability and validity).The third and final section of the Literature Review provides an overview and empiricalevaluation of the current approaches to scoring the in-basket, including those based onin-basket technology and scoring methods presently used in industry.Section 1: Management SimulationsHistory and DevelopmentOne consequence of the varied influences on the history of managementsimulations has been considerable confusion regarding the definitions of both gamesand simulations (Biggs, 1990; Meier, Newell & Pazer, 1969). The lack of uniformterminology has resulted in little consensus regarding the definitions of andrelationships among simulations, work sample tests, management simulations,management games, and in-basket exercises. The literature affords no clearclassification system for these measurement instruments. Accordingly, a primary goalof this section is to provide a structure which clarifies the definitions of andrelationships among these instruments. The terminology and relationships providedhere may be more restrictive than those typically implied by researchers of games andsimulations.14Terms and DefinitionsThe history and development of management simulations is closely tied to thehistory and development of games. A simulation is an imitative representation of real-world events in which the essential features of an activity are duplicated withoutportraying reality itself (Jones, 1972; Morgenthaler, 1961). In its broadest sense, agame is defined as an interactive process between the game system and one or moreplayers who are given background information to study, rules and conditions to followand are usually provided specific roles to play (Jones, 1972). Consequently, gamesthat attempt to emulate real-world events or processes are also types of simulations, andthe marriage of the two has borne a field known as gaming simulations (Jones, 1972).In an organizational context, a managerial simulation attempts to mimic theessential features of managerial activity by recreating the typical work, decisions, andchallenges of a managerial position. Management simulations are a broad classificationof assessment devices which includes the management game as a smaller sub-type ofassessment device. Management games simulate important aspects of businessoperations by involving individuals or groups of participants who partake in the runningof important aspects of business operations and managerial functioning in a "live"fashion (Cascio, 1987; Howard, 1983). By reconstructing and representing criticalbusiness processes, the management game is therefore a type of managerial simulationand is properly classified as a type of gaming simulation.The in-basket exercise has been classified as a management game (Lopez,1966), and is thus also a gaming simulation. Lopez delineated three main classes ofmanagement games: solitaire games, small group games, and complex team games.The in-basket exercise is perhaps the most well-known example of the solitaire game,in which the player or candidate responds independently to a series of adminstrative or15general management problems as if he or she had assumed a particular manager's role.An example of a small group game is The Manufacturing Problem used in the 1956Management Progress Study at the American Telephone and Telegraph Company(AT & T). In this game, a manufactured product was represented by tinker toys thatcould be built into objects of varying complexity (e.g., ladder, airplane). Thechallenges for the group of six participants were deciding what type of tinker toyproduct to manufacture, buying parts, and selling the finished product given changingmarket conditions (Bray, Campbell & Grant, 1974). An example of a complex teamgame is The Looking Glass (Lombardo, McCall & Devries, 1976) which involves 20participants (placed in three divisions across four plant levels) who, in a day-long totalorganizational simulation, run a fictitious glass manufacturing company. Whether inthe form of solitaire, small group or complex team games, management gamescontinue to be widely-used for a variety of managerial functions, including assessmentand diagnosis, training, and research into managerial behaviour (Faria, 1987).While all management games are simulations, not all management simulationsare management games. The leaderless group discussion, for example, is a simulationwhich involves a small group of participants who are asked to carry on a discussion,usually about a work-related topic (Bass, 1954). This simulation is characterized by alack of structure and standardization; no person is designated as leader and there are norestrictions on the amount or content of participants' discussion. Individualinterpersonal skills such as leadership, persuasiveness, flexibility, and verbal expressionare the most commonly measured aspects of performance. Although roles may beassigned, the leaderless group discussion is not strictly a game in that it does notprovide set rules and conditions for participants to follow, and it is not designed tosimulate key aspects of business decisions and operations.16Management games, although properly classified as gaming simulations, arecommonly referred to in the literature simply as simulations (Thornton & Cleveland,1990). The considerable confusion in the definitions of games and simulations isexacerbated by the oversimplification and inconsistent use of these terms. In this work,the convention of using the shortened term, simulation, will be followed in referring toboth gaming and non-gaming simulations. Accordingly, the in-basket exercise,although a management game, will be referred to more generally as a managementsimulation.Historical Background of Management SimulationsManagement simulations grew out of developments in four main areas: militarywar games, operations research, role playing, and performance testing (Thornton &Cleveland, 1990). War game simulations have a long history, beginning with their usein the re-enactment of battles in China in 3,000 B.C. for military education andtraining (Keys & Wolfe, 1990). In this century, war games have been adapted toreflect a variety of military situations ranging from actual physical battle simulations tomore logistical, administrative aspects of military operations (Hausrath, 1971). Anexample of a simulation in military operations research is the 1955 Rand Corporation'sMonopologs simulation of the U.S. Air Force supply system which provided decision-making experience without the risk of giving critical responsibilities to untrained staff(Jackson, 1959). The methodology of operations research was then applied toproblems of management in non-military settings. For instance, Monolpologs led tothe development in 1957 of the first widely-known management game, TopManagement Decision Simulation, by the American Management Association(Ricciardi, Malcolm, Bellman, Clark, Kebbee, & Rawdon, 1957). The thirddevelopment, the role-playing exercise (Moreno, 1959), allowed greater realism forthose simulations that involve the adoption or enactment of certain roles. The fourth17and perhaps most influential development in the history of managerial simulations wasthe large-scale application of managerial simulations as a performance test, which is astandardized assessment of what candidates can do, rather than what they know, theirknowledge being commonly measured by paper-and-pencil tests (Cascio & Phillips,1979). The important step toward the use of management simulations as a performancetest in personnel assessment deserves closer inspection, for it was here that the firstbusiness in-basket was developed and applied.Historically, because of the central role of paper-and-pencil tests (rather thanperformance tests) in personnel assessment and selection, there was limited opportunityfor the measurement of behaviours or actions. However, during the 1940's, Germanmilitary psychologists began to use multiple assessment procedures which allowed thecandidate to show behaviours in complex situations in order to generate a holisticappraisal of abilities, rather than an atomistic appraisal determined by the paper-and-pencil approach (Cascio, 1987). During the Second World War, the U.S. Office ofStrategic Services (OSS) adopted the holistic approach in its use of multiple assessmenttechniques to select spies. A principal technique used by the OSS was the situationaltest, which is based on the belief that behavioural performance results from theinteraction of both individual and situational variables (Cascio, 1987). Situational tests,then, are re-constructions of typical, realistic situations which are representative of theperformance to be predicted (Flanagan, 1954). Complexity is a key feature of thesituations presented so that candidates are less likely to discern the specific reactions orvariables being measured. It is believed that allowing candidates to behave naturallyand spontaneously will yield more typical, valid data than could be generated by other,less realistic measures. The OSS required potential spies to develop a cover story tohide their identity; several complex situational tests were then administered to try totrick candidates into breaking their cover (OSS, 1948).18As noted, the situational test, hallmark of the OSS assessment approach, isdesigned to re-construct or simulate real-world scenarios or events. Situational tests,then, can also be called simulations. The situational assessment principles of the OSSwere adapted for managerial selection and were first applied in Douglas Bray'sinitiation of the 1956 Management Progress Study at AT & T. This longitudinal studyis described as the largest and most comprehensive examination of managerial careerdevelopment ever undertaken (Cascio, 1987). The Management Progress Study is alsothe first known industrial application of the assessment centre method which ischaracterized by multiple assessment procedures and pooled judgements of multipleassessors to evaluate the performance of managerial candidates. Using multipleassessment techniques, the primary purpose of the study was to identify those skills andcharacteristics that were most predictive of managerial potential. The assessmentprogram (i.e., battery of measures used to assess candidates) included "standard"assessment procedures such as paper-and-pencil tests, interviews, and projective tests.However, the inclusion of situational tests in the assessment program was an innovativefeature of the Management Progress Study. A variety of simulations were introduced,including the in-basket exercise.As noted in the Introduction, the in-basket had been developed in the 1950's asa training tool for officers in the U.S. Air Force (Frederiksen et al., 1957), but wasreadily adaptable to a variety of settings and applications. John Hemphill, then atEducational Testing Service (ETS), worked with AT & T on the design and materialsfor the first assessment centre application of the in-basket. As a result, what isbelieved to be the first business in-basket exercise was developed by ETS, inconjunction with AT & T, for the assessment program of the Management ProgressStudy (Crooks, 1974). Norman Frederiksen, whose pioneering work introduced the in-basket as an effective training instrument, was also a research associate at ETS. Thus,19the early developmental and experimental work by Hemphill and Frederiksenestablished ETS as the founding figure in the history of the in-basket.Management Simulations as Work Sample TestsThe Introduction noted that the in-basket can be considered as a member of abroader category of instruments known as work sample tests, which present samples ofwork representative of the target job and measure candidates' ability to do the workpresented in the test. The in-basket is also a type of managerial simulation, whichmimics the essential features of managerial activity by recreating the typical work of amanagerial position and requiring the candidate to take action on the problemspresented. By providing typical examples of managerial work, managementsimulations, then, are also a type of work sample test. Several figures will be providedin the following discussion to illustrate the relationships among these instruments.Because the literature more commonly refers to situational tests as simulations (it willbe recalled that these terms are interchangeable), this convention will be followed here.The area of work sample testing has undergone extensive growth in response tolegal and social pressures regarding the fairness of selection procedures. Employmentdiscrimination is minimized with the use of work sample tests because, as will beshown, they are well-suited to reducing the possibility of bias or adverse impact(Cascio, 1987). A test is biased if consistent non-zero errors of prediction are made formembers of a subgroup (Cleary, 1968). Adverse impact is a condition in which asubstantially different rate of selection in hiring, promotion, or other employmentdecisions work to the disadvantage of members of a race, sex, or ethnic group(Schneider & Schmitt, 1986). How work sample tests (and, by extension, managerialsimulations and in-basket exercises) reduce the possibility of bias and adverse impact20will now be considered by looking more closely at the theoretical underpinnings of thisimportant class of instruments.Historically, measurement for predictive purposes has been founded on thenotion that test results are "signs" or indicators of predispositions to behave in certainways. A different and perhaps more compelling view was provided by Wernimont andCampbell (1968), who argued that prediction of behaviour would be most fruitful if"samples", rather than signs, of behaviour were studied. Wernimont and Campbell'sviews have been termed more generally as the "behaviour-consistency model" becausethey argued that predictor measures (e.g., selection tests) should be as similar aspossible to outcome or criterion measures. As a result, in order to understand andpredict behaviour in organizations, tests which are related to observable job-behaviourmeasures should be used. As will be discussed in more detail, work sample testsprovide an accurate "past performance indicator" by presenting actual or imitativework samples. Therefore, they comport with the basic notion of the behaviour-consistency model that the best predictor of future performance is past performance.Asher and Sciarrino (1974) conducted a review of work sample tests, resultingin the construction of a two-category classification system of work sample tests: a)motor tests involving the direct manipulation of objects and b) verbal tests presentingproblem situations that are mainly language or people-oriented. Two examples of amotor test are a typing test for office personnel (Giese, 1949) and a meat weighing testfor meat scalers (Bridgman, Spaethe & Dignan, 1958). Verbal work sample tests canbe classified as either group discussions/decision-making tests or individual situationaldecision-making tests. Examples of verbal tests include the leaderless group discussion(Bass, 1954), a role-playing test that simulates telephone contacts with customers (Gael& Grant, 1972), and the in-basket exercise. Because the in-basket exercise is usually21completed independently, it is properly classified as an individual situational decision-making test.An important distinction arising from Asher and Sciarrino's (1974) classificationsystem is whether the work samples presented in the test are actual versus imitativerepresentations of work. This distinction is useful in understanding the relationshipbetween work sample tests and simulations, illustrated in Figure 1. It will be recalledthat, in its broadest sense, a simulation is an imitative representation of real-worldevents in which the essential features of an activity are duplicated without portrayingreality itself. In terms of Asher and Sciarrino's (1974) work sample classificationsystem, it is likely that motor work sample tests are not simulations because noimitation of reality is presented; instead, actual work (i.e., reality itself) is presented.However, it is likely that verbal work sample tests are also simulations because arepresentation or image, rather than the actual work, is being presented and candidatesare required to make decisions similar to those made in the job in question. In suchcases, then, the simulation is also a type of work sample test, but a simulated one; it isan imitative representation of work (Howard, 1983).It would appear that all managerial simulations can be classified as verbal worksample tests. Not all verbal work sample tests are managerial simulations, however.The broader category of verbal work sample tests includes non-managerial simulations,such as a law school admission's test involving cases, data interpretation, and readingcomprehension (Breslow, 1957), to cite one example. As shown in Figure 1, not allsimulations are work sample tests. Non-work sample simulations are typically used togenerate a sequence of activities in a system and record statistics regarding systemoperation. Thus, the computer-based construction and analysis of mathematical models(e.g., economic analysis) are common forms of non-work sample simulations.SIMULATIONS22Figure 1. Relationship between simulations and work sample tests.23Furthering the behaviour-consistency model of Wernimont and Campbell(1968), Asher and Sciarrino (1974) hypothesized that the closer the "point-to-point"correspondence between the predictor and criterion, the greater the validity. Theycompared the validities of several classes of predictor tests and found that work sampletests generally yielded higher validity coefficients than ability, aptitude or personalitytests. More specifically, when job proficiency was the criterion (measured bysupervisory ratings), motor work sample tests were a close second to biographicalinformation in terms of the number of high validity coefficients. Why biographicalinformation was most predictive of job proficiency was puzzling to Asher andSciarrino. Verbal work sample tests were "in the top-half of predictors," yieldinghigher validity coefficients than personality, mechanical aptitude, spacial relations andfinger dexterity tests in predicting job proficiency. When the criterion was success intraining (measured by grades achieved or a rating), verbal work sample tests hadsubstantially more significant validity coefficients than motor work sample tests.Comparisons of the validities of different tests in predicting training success were notreported.Asher and Sciarrino's (1974) finding of the strong predictive power of worksample tests was echoed by Hunter and Hunter's (1984) meta-analysis of severalpredictors of job performance, including the work sample test. Concerns regarding theadverse impact of cognitive ability tests lead to their review of the cumulative researchon the use of alternative predictors of job performance. As Hunter and Hunteracknowledged, "the use of cognitive ability tests presents a serious problem forAmerican society; there are differences in the mean ability scores of different racialand ethnic groups that are large enough to affect selection outcomes (p. 73)." Incontrast, concerns regarding the adverse impact of work sample tests are considerablyless than for achievement or aptitude tests because differences in test scores between24majority and minority groups are substantially smaller and usually nonsignificant(Robertson & Kandola, 1982; Schmidt, Greenthal, Hunter, Berner & Seaton, 1977).Hunter and Hunter reviewed the reported validities for six predictors used forpromotion or certification where supervisor ratings were used as the criterion forcurrent job performance. The resulting mean validities of the predictors were close inmagnitude, ranging from .43 to .54. Work sample tests emerged (marginally) as thebest predictor, with an ability composite predictor a very close second at .53. The goalof finding a predictor that was as valid as ability but that resulted in less adverse impactappeared to have been met.Hunter and Hunter (1984) further conducted a utility analysis (Brogden, 1949;Schmidt, Hunter, McKenzie & Muldrow, 1979) of the predictors in order to quantifythe dollar value gains in productivity with the use of various methods of selection.They determined that, when used for promotion or certification decisions, the grossutility estimates of ability and work sample tests were essentially equal, with aproductivity gain for the U. S. federal government of $15.61 billion for one year dueto hiring on the basis of ability compared to a productivity gain of $15.33 billion due tohiring on the basis of work sample tests (net utility estimates were not provided).However, they acknowledged that, unlike ability tests, work sample tests incurred highcosts that were not taken into account in the utility calculations (e.g., initialdevelopment and re-development costs as the job changes, and high administrationcosts). Practical considerations, therefore, may limit the applications suggested by thefavorable empirical findings reported for work sample tests.In concluding this discussion of management simulations as work sample tests,it should be clearly understood that all management simulations are also work sampletests. We recall that the in-basket exercise is a type of management simulation. Thus,the in-basket is also a type of work sample test. As such, the in-basket exercise enjoys25the advantages of reduced bias and reduced adverse impact, but it also suffers thedisadvantage of reduced utility (both gross and, in all likelihood, net) because of highdevelopment and high administration costs. Figure 2 provides an overview of therelationships among simulations, work sample tests (including the motor versus verbaldistinction), management simulations, management games, and in-basket exercises. Itshould be clearly recognized that, in the literature, the similarities and differencesamong these instruments are not clear, and so the paradigm presented here is one of theunique contributions of this study.26Figure 2. Relationships among simulations, work sample tests, managementsimulations,  maagement games, and in-basket exercises.27Issues in Management Simulation Design and ResearchSimulations vary greatly in the nature and complexity of the stimuli(background information, items, etc.) and the nature and complexity of the responseformat. The traditional form involves written stimulus materials and open-endedwritten responses. A recent trend in management simulation design is toward the useof videotaped stimulus information and either open-ended written responses ormultiple-choice responses (Goldsmith, 1991). Another new development in the use ofsimulations that are completely computer-based. Here, scenarios and situations arepresented via microcomputers and the candidate responds using the computer keyboard.For example, in a complete computer-based simulation exercise called Utopia, thecandidate is given the task of governing a fictitious island by simultaneously stimulatingthe economy and protecting the environment during his or her term (Diete, 1991).Currently, the majority of simulations used in assessment centres are non-computerized. For this reason, the following discussion of design and research issuesconcerning management simulations will be limited to non-computerized simulations.Key Theoretical Issues: Realism and FidelityRealism. In-baskets are designed to provide a setting that closely approximatesreality by requiring participants to "appreciate the social subtleties and technicalniceties that always complicate any management problem" (Lopez, 1966, p.68). In thisway, realism is a key feature, which is increased by providing ambiguous rather thanstraightforward issues and by presenting problems that are meaningful and appropriategiven the background of participants and the target job (Gill, 1979; Lopez, 1966).Realism is also enhanced by having participants feel time-pressured, which requiresthem to use judgement in deciding which items and issues are most important, and28more closely simulates the action-oriented, quick-thinking approach needed for mostmanagerial work (Lopez, 1966).Realism is necessary to provide motivation for the participants to take theexercise seriously and become "ego-involved" (Lopez, 1966). When they can identifywith the prescribed roles, participants become more involved and they are more likelyto behave in the exercise as they would under similar circumstances in the organization(Wernimont & Campbell, 1968). In this way, the key situational test principle ofallowing participants to behave naturally and spontaneously in order to yield moretypical, valid data than could be generated by other, less realistic measures issupported.It must be recognized that increasing realism to augment the richness of theinformation provided by the in-basket can present difficulties, particularly inadministration and scoring. Either or both the stimulus materials and responsematerials can be targeted for greater realism. Improving realism of the stimulusmaterials by introducing "spontaneous" announcements of budgetary changes, lostshipments, etc. can reduce the control and standardization of the exercise, asadministrators must ensure the announcements come at precisely the same point acrossadministrations. In addition, making the response materials more realistic by providingseveral types and sizes of stationary usually makes both preparation of the exercisepackages and scoring of the mass of papers more time-consuming and more costly forthe organization.Fidelity. A more precise, operational definition of realism applied toassessment exercises has been called the fidelity of the exercise, or the degree to whichthe task stimulus and response format match the conditions on the job (Thornton,1992). Fidelity is viewed as a continuum of realism, with a decrease in fidelity as29stimulus materials and responses become less and less exact approximations of jobstimuli and responses (Motowidlo, Dunnette & Carter, 1990). The most technical,widely-used definitions of low- and high-fidelity exercises were applied to simulationsby Motowidlo et al. A high-fidelity simulation, they asserted, presents realistic,accurate samples of the task stimulus and elicits actual responses for performing thetask. In contrast, a low-fidelity simulation would likely present a verbal or writtendescription of a hypothetical work situation. In a low-fidelity simulation, participantsdescribe or choose how they would deal with situations (typically in a questionnaireformat) rather than directly handling problems.Following Wernimont and Campbell's (1968) logic of behaviour-consistency,Motowidlo et al. (1990) postulated that because high-fidelity simulations most closelyresemble actual work conditions, high-fidelity simulations should be better indicators offuture job performance than low-fidelity simulations. High-fidelity simulations,however, can be very costly to develop, implement, and score. Motowidlo et al.(1990) were unsure whether the gain in predictive potential from the use of high-fidelity simulations would justify their high costs. Accordingly, they developed a low-fidelity simulation for selection in order to explore the predictive usefulness of lowfidelity.A simulation was produced which presented 58 short, hypothetical problemsituations (items) with a multiple-choice response format for each item. Fivealternative courses of action were listed after each item; the participant was instructedto select the one alternative he or she would most likely take and the one alternative heor she would least likely take for each of the 58 task situations. The scoring key wasdeveloped from experienced managers' ratings of the most effective and least effectivealternatives. Each of the two alternatives chosen by the participant was scored aseither - 1, 0 or + 1, depending on its identification by the experienced managers as the30least effective, neutral, or most effective way to handle the situation. In a sample of120 managerial incumbents, a correlation of .30 (p < .01) between total simulationscores and supervisory performance ratings of overall effectiveness was observed.Motowidlo et al. (1990) concluded that a carefully constructed low-fidelity simulationcan yield satisfactory validity. In other words, the degree of realism provided incomplex and costly high-fidelity simulations is not always necessary in order to obtainempirical validity. As a result, most organizations must carefully consider the choicebetween a low versus high-fidelity instrument, particularly in view of the currenteconomic recession.It should be noted that the term "fidelity" as it is used here is quite differentfrom its original meaning, rooted in psychological testing and decision theory.Communications engineering provides the framework for describing the dilemma oftenfaced by test developers, that of a choice between "wideband" and "narrowband" tests.The more varied and fast wideband signal transmits more information, but the clarity ordependability of the information received ("fidelity") is usually less than for the slower,homogeneous message sent along the narrowband. In general, narrowband signals havegreater fidelity, with fewer errors confusing the signal. In this context, then, fidelityrefers to the thoroughness of testing to obtain more certain information (Cronbach &Gleser, 1965), rather than the more recent adaptation to describe the realism andcomplexity of simulations.The clarification of the terms bandwidth and fidelity (both in the broader,traditional test theory and the more specific, recent application to simulations) isimportant, for it will provide a structure for classifying and describing differentsimulations, including in-basket exercises. That is, a simulation designed to measure abroad range of behaviours may be seen as a wideband exercise. Conversely, asimulation which focuses on one key aspect of managerial behaviour can be seen as a31narrowband exercise. In addition to the bandwidth continuum, the level of fidelity canalso provide a means of classifying simulations. That is, both wide and narrowbandexercises can vary in the degree of realism in the task stimulus and the manner ofresponse. As we shall see, high-fidelity, wideband simulations likely would posegreater challenges in establishing psychometric soundness and predictive usefulness, aswell as presenting practical disadvantages in terms of lengthy scoring time.Validity in the Context of Management SimulationsThe reliability and validity of managerial simulations can be evaluated in orderto determine their effectiveness as assessment devices. The psychometric principlesunderlying the reliability of managerial simulations are relatively straightforward andeasily extend to in-basket exercise (to be described in Section 2 of the LiteratureReview). In contrast, the principles behind the validity of managerial simulations aremore complex; clarification of these psychometric underpinnings will make theforthcoming extension of the psychometric principles underlying the validity ofmanagerial simulations to in-basket exercises more comprehensible.Validity, or the extent to which a measurement procedure does measure what itis designed to measure, can be assessed by looking at three main sources of evidence,namely, content-related evidence, construct-related evidence and criterion-relatedevidence (American Psychological Association Standards, 1985). Validation is theprocess of gathering and evaluating data to examine these sources of evidence. Twokey goals of validation are to determine what the instrument measures (i.e., constructs)and how well it measures those constructs (Cascio, 1987). Because legal imperativesconcern the fairness of selection measures, the following discussion of validationstrategies for managerial simulations will be concerned with those instruments used forselection (rather than training, for example).32Content validity. In general, the content validity of measurement proceduresrefers to whether the procedures contain a representative sample of the universe ofsituations they are intended to reflect (content domain). Content validity is a keypsychometric foundation of selection procedures; more specifically, it refers to thedegree to which the content of a selection device is representative of important aspectsof job-related performance (Uniform Guidelines on Employee Selection Procedures,1978). Well-designed management simulations typically involve some type of jobanalysis (critical incidents, interviews, etc.) in order to determine the central tasks andrequirements of the target job and incorporate them into the assessment device.Depending on their purpose, management simulations may be designed to measureeither broad content domains (e.g., general administrative ability) or more narrowcontent domains (e.g., specific decision-making styles of administrators) through theuse of either wideband or narrowband instruments. According to Thornton andCleveland (1990), management simulations should compare very favorably in contentvalidity with more traditional paper-and-pencil tests because of the careful attentiontypically given in the design and presentation of job-related stimulus materials andrealistic response formats.Construct validity. As applied to selection, construct validity refers to theextent to which a selection measure can assess candidates' levels of identifiablecharacteristics which have been determined to be important for successful jobperformance (Uniform Guidelines on Employee Selection Procedures, 1978). Therelation between the targeted content domain and the features of a measurementinstrument is an important consideration in establishing construct validity (Thornton &Cleveland, 1990). In other words, in order to ensure construct validity, considerationmust be given both to the targeted managerial skills and to the type of assessmentdevice chosen to measure those skills; some constructs cannot be adequately measured33by instruments other than a simulation. For example, some essential skills involved ingeneral managerial behaviour include interactions with personnel, demonstrating asound decision-making process, and showing leadership, to name a few. Simulations,by allowing written role-play interactions with fictitious characters, may be moreappropriate than non-situational, traditional paper-and-pencil tests for the measurementof those general managerial skills that are social or interactive in nature. As Thorntonand Cleveland (1990) pointed out, "simulations may be necessary to engage socialprocesses and to measure the application of social skills" (p. 195).Perhaps an important qualifier should be added to this statement; high-fidelitysimulations may be necessary to engage social processes and measure the application ofsocial skills. High-fidelity simulations, with their open-ended response format, allowfor flexibility, spontaneity, and individuality which are necessary to adequately measuresocially-oriented skills. On the other hand, low-fidelity simulations, with theirrelatively unrealistic stimulus items and restricted, multiple-choice response format,may not be capable of tapping into the social process constructs which are importantaspects of general managerial skills. The multiple-choice, questionnaire format of low-fidelity simulations can measure knowledge and beliefs about social interactions, butnot the direct application of social skills through interactions with fictitious characters.Low-fidelity simulations, therefore, may be better-suited to measure more narrowly-defined, specific managerial behaviours that are not socially-oriented.In sum, because general managerial skills, by their nature, include social skills(determined to be necessary for successful job performance), an instrument which iscapable of measuring those interactive skills is necessary in order to achieve a highdegree of construct validity. Therefore, we may expect that construct validity would begreatest when realistic, high-fidelity simulations are used in the measurement of generalmanagerial performance.34Criterion-related validity. Evidence of criterion-related validity is demonstratedby selection measures shown to be predictive of or significantly correlated withimportant elements of work behaviour (Uniform Guidelines on Employee SelectionProcedures, 1978). Criterion-related validation is the most appropriate and mostimportant validation procedure to apply when measures of individual differences areused to predict behaviour. Accordingly, the psychometric principles underlying thecriterion-related validity of selection measures in general and managerial simulations inparticular will be considered in some detail.Two alternative approaches to assess criterion-related validity are available,namely, predictive or concurrent validation. If both the predictor scores and thecriterion scores are available and considered at the same time, concurrent validity isbeing assessed. If the criterion results are not gathered until some time after predictordata are gathered, predictive validity is being measured. In essence, concurrentvalidation studies are concerned with assessing existing level(s) of behaviour(s) whereaspredictive validation studies attempt to predict future performance level(s) ofbehaviour(s). Both approaches are primarily concerned with assessing the strength ofthe predictor-criterion relationship.The most appropriate criterion-validation procedure (concurrent versuspredictive) must take into account the purpose of measurement. If an instrument is tobe used to make predictions of administrative ability, for instance, it would be mostappropriate to conduct a predictive validation study using a longitudinal researchdesign. In such a study, a typical sequence of events would be as follows: (a) assessadministrative ability for job-candidates, (b) select candidates without using theadministrative ability results, (c) gather criterion performance data some time in thefuture, and (d) measure the strength of the predictor-criterion relationship. However,real-world conditions of economic/budgetary constraints, high rates of staff turnover,35and so on, usually preclude the feasibility of predictive validation studies.Consequently, concurrent validation studies, where predictor scores are correlated withcriterion measures for employees already on the job, are often substituted for predictivevalidation studies (Thornton & Byham, 1982).Little published empirical information is available on most managerialsimulations used in industry (Thornton & Cleveland, 1990) and, as a result, evidenceof criterion-related validity for managerial simulations remains comparatively scarce.Considering the popularity and widespread use of managerial simulations, the paucityof published empirical validity may seem surprising. However, a brief consideration ofthe dynamic, heterogeneous nature of managerial simulations will help explain theunexpected lack of published findings.First, it is clear that work sample testing (and, by extension, managerialsimulations) is a dynamic field characterized by recent growth and responsiveness torapidly changing job requirements. Unlike more traditional paper-and-pencil tests,managerial simulations are capable of incorporating new technologically-based designfeatures, such as videotaped stimulus materials and computer response formats, forexample. Moreover, in comparison to more static, homogeneous assessment measures(e.g., cognitive ability tests, personality inventories), management simulations are oftenmore heterogeneous in both content (breadth of target behaviours: e.g., leadershipability, written and oral communication skills, personality characteristics) and form(e.g., leaderless group discussions, complex team games, etc.). This heterogeneityoften results in company-specific, custom-tailored managerial simulations withcomparatively limited, intra-company use. This limited use, in turn, translates intolimited diffusion of costs for validation studies within industry, often making directmethods of establishing criterion validity (on a per exercise basis) cost prohibitive.36More indirect evidence of the criterion-related validity of managerialsimulations comes from the studies of work sample validities noted earlier (Asher &Sciarrino, 1974; Hunter & Hunter, 1984; Roberston & Kandola, 1982; Wernimont &Campbell, 1968). It may be that, because many work samples are also managerialsimulations, the evidence of work sample criterion-related validity has been generalizedto apply to managerial simulations. However, as shown in Figure 1, not all worksample tests are managerial simulations. Moreover, given the widespread use ofheterogeneous, company-specific management simulations, the generalization ofcriterion-related validities from broader work sample tests to particular, custom-tailoredmanagement simulations seems tenuous, at best.Summary of Section 1: Management SimulationsThe first section of the Literature Review outlined the development of themanagement simulation and acknowledged influences from several fields, includinggaming. Because the literature affords no clear classification system, a primarypurpose of this section was to provide a structure which clarified the definitions of andrelationships among management games, management simulations, work sample testsand in-basket exercises. The in-basket exercise was classified as a type of gamingsimulation. The historical background of management simulations was briefly related,followed by a discussion of management simulations as work sample tests. Conceptualissues and psychometric findings for work sample tests were reviewed. The in-basketexercise, as a type of work sample test, was seen to enjoy the same advantages(reduced bias and adverse impact) but also suffer from the disadvantages (reducedutility) of work sample tests. Issues in the design and research of managerialsimulations were then explored. The notion of fidelity of simulations was introducedand discussed, followed by a brief consideration of realism in in-basket design andresearch. Lastly, the validity of managerial simulations was considered by looking at37the three validation strategies (content-, construct- and criterion-related validation) asthey apply to management simulations.Section 2: Psychometric Properties of the In-Basket ExerciseThe Seminal In-Basket WorkThe pioneering work by Frederiksen and his colleagues in the early 1950'sestablished the founding principles and procedures that became the basis forconventional in-basket design and traditional objective scoring strategies. A synopsisof this seminal work, designed as a training device for officers in the U. S. Air Force,will illustrate several central issues of in-basket design (and subsequent development)and will set the stage for the review of the psychometric properties of the in-basketexercise which is to follow. This review of published information will be presentedchronologically and will include a description of in-basket design and proceduraldevelopments, as well as a summary of empirical findings.Goal of Frederiksen's work. Although this was not the first attempt to design asituational test measuring high-level performance, the in-basket developed byFrederiksen et al. (1957) was one of the first efforts to design a group-administeredinstrument that was presented in written form to measure individual performance.Frederiksen et. al recognized that, in the design of instruments to measure performancein high-level jobs, test developers usually face a choice between an objective, reliablyscored instrument or one which is wideband or "sensitive" (i.e., can adequatelymeasure the broad, complex set of skills involved in high-level jobs). A primary goalof Frederiksen et al. was to "devise a sensitive measure which may at the same time beobjectively and reliably scored ....[which].... proceeds from the faith that progress38toward both goals of sensitivity and of objectivity may be made in one operation" (p.2).Identification of performance dimensions. The first step toward an objectiveand reliable scoring method was to identify those performance dimensions believed tobe important aspects of effective executive performance. By studying the officers'training curriculum and by discussing with instructors the desired skills to be evaluatedin the newly-trained Air Force officers, an initial set of 12 "functional categories ofbehaviour" (i.e., performance dimensions) was established. Four categories wereselected from the 12 as the primary focus in the development of the in-basket, namelyFlexibility, Efficient Use of Routines, Foresight, and Evaluating Data Effectively.Development and selection of exercise problems. In the next step toward thedevelopment of an objective scoring system, the specific problems presented in theexercise were formulated in consultation with officers of several U. S. Air Force bases.Interviews were conducted and written descriptions of typical problems were prepared.In addition, the actual contents of administrators' in-baskets were examined. The finalset of problems or items selected were designed to elicit one of the four performancedimensions listed above. The assignment of more than one problem to a particularperformance dimension ensured that dimension was adequately measured. Theproblems chosen reflected the varying scope and complexity of administrative work,with simpler problems merely describing background information and other morecomplex problems (sometimes involving more than one item) describing more detailedadministrative situations. A guiding principle in problem preparation was to minimizethe amount of reading required by the participant, and so materials were designed to bebrief and clear in an attempt to equalize reading comprehension skills required foroptimal performance.39In order to minimize possible confounding effects of candidates' previousknowledge and/or experience with a similar situation, four two-hour in-basket exerciseswere constructed which required candidates to play four hypothetical roles insuccession. In the first two-hour exercise, the required role was that of a CommandingOfficer of a hypothetical Air Force wing. This was immediately followed by the roleof the Director of Material in the second exercise, and the roles of Director ofPersonnel and Director of Operations in the third and fourth exercises, respectively. Acomprehensive package of materials describing the hypothetical situation was providedto each candidate for each of the four exercises. Each package included detailedbackground information, an organizational chart, maps, a history of the particulardivision (e.g., Material, Personnel), and a mission statement for each division. Inaddition, stationery (letterhead, memo forms, paper clips, etc.) were furnished toencourage realistic responses. The broad scope of the package and the realism of theresponse materials were designed to stimulate candidates' complete involvement in thecomplex situations presented.Development of the scoring method. The series of in-basket exercises werethen administered to a "tryout group" of students training to be officers. Theirresponses to each problem were examined, and a list of the range of responses wasderived by breaking down answers into the smallest, distinct units of action, laterknown as "action elements" (e.g., concurring in a recommendation, referring theproblem to a higher authority). Only those action elements that were relevant to thepredetermined functional category of behaviour (or performance dimension) for thatproblem were recorded. In other words, an action element may have primarilydisplayed Flexibility, but if the particular problem was designed to measure Foresight,the action element demonstrating Flexibility was ignored.40The entire list of action elements across all problems was then shown to twopanels of expert judges in the Air Force who evaluated each action element andassigned scoring weights for the functional category of behaviour using a 5-point scale.Each point on the scale was designed to be used at least once for each problem. Usingthe judgements from the two panels, final scoring weights were then assigned by thetest developers for each action element so that the scoring weights reflected a singlefunctional category of behaviour. The panel of judges also ranked the importance ofthe problems presented in the exercise on a priority basis. The level of priority giveneach problem was later used in assigning scoring weights in the event a candidate failedto respond to the problem.Significance. With some later variations, the process of striking a panel ofexpert judges to assign scoring weights for action elements became the establishedprocedure in in-basket scoring key development for those researchers affiliated withFrederiksen and his colleagues at ETS. Although the exercise constructed byFrederiksen et al. (1957) was based on a situation involving the Air Force, the in-basket exercise was designed to be readily adapted for other situations and scenarios.Accordingly, as indicated in Section 1, the in-basket was soon adapted for its firstindustry application in the 1957 Management Progress Study. In the 1960's, the Portof New York Authority carried out extensive research and development with the in-basket exercise after adapting the problems and materials to reflect the nature of theirwork (Lopez, 1966). The Port Authority also adopted and further refined Frederiksenet al.'s expert panel process in the development and investigation of psychometrically-based objective scoring keys (this research will be explored more fully in the followingreview of empirical findings.)The goal and approaches used in Frederiksen et al.'s (1957) seminal work havebeen considered here in some detail, not only for their historical significance, but also41because of their considerable influence on the present work. As noted earlier, thepresent work is a cross-validation of several scoring strategies, one of which is theexpert panel approach initially developed by Frederiksen et al. In addition, thechallenge originally identified by Frederiksen et al.--to design a situational instrumentwith the dual features of sensitivity and objectivity--has been a driving conceptual forcebehind the design and evaluation of the in-basket exercise investigated in the presentwork. However, Frederiksen's challenge has required some modification to its originaltwo-part formulation. Specifically, a third aspect of practicality must be added tosensitivity and objectivity because the current global recession makes cost effectivenessa paramount concern. Practical issues of training time, scoring time and administrationcosts demand consideration in the overall evaluation of the in-basket exercise. Thepresent work is an attempt to design and cross-validate an instrument which meets thisthree-part goal of sensitivity, objectivity, and practicality.We turn now to a review of the reported empirical findings of the in-basketexercise. An examination of the published psychometric properties of the instrumentwill tell us whether Frederiksen's " that progress toward both sensitivity andobjectivity may be made in one operation..." (Frederiksen et al., 1957, p. 2) wasjustified.Introductory Comments on In-Basket Empirical FindingsOver the last forty years, the in-basket exercise has been evaluated with respectto its psychometric properties of reliability and validity. To provide some structure forthese findings, the empirical results have been summarized and presented in Table 1. Itshould be noted that the framework for the following review of results is a variation ofan outline provided by Schippmann et al. (1990).42Summarizing empirical findings for the in-basket exercise is difficult because ofseveral sources of variation across studies. Firstly, the research setting and thereforethe content of the in-basket (which often reflects the research setting) varies widelyacross studies; universities, the public and private sectors, the military, and theeducational administration system have all been selected as settings for in-basketresearch. Even for studies conducted within the same setting, there may be muchvariation in the level and specificity of the particular target job. For example, inmanagement settings, the in-basket exercise may be designed for entry-level,first/second level, mid- or high-level managerial positions and may be tailored toreflect general or specific target jobs. Secondly, there is great variation in the type ofcriterion measures not only across disparate research settings but also within a singleresearch setting. Within management settings alone, for example, salary level,job/salary progression, and supervisory ratings are commonly-used criterion measures,each based on different metrics.A third source of variation across studies has to do with the in-basket exerciseitself. The exercises examined in the following review vary greatly in severalsignificant ways, including the complexity and scope of the performance dimensions tobe measured by the exercise (wideband or narrowband), the realism of the stimulusmaterials and response format (high-fidelity or low-fidelity), the exercise constructionmethods, as well as the method of scoring.In addition to these sources of variation across studies, there is a lack ofimportant detail in published studies regarding these sources of variation (Frederiksenet al., 1972; Schippmann et al., 1990). For instance, information regarding thedevelopment and description of outcome criteria is often not provided, and importantfacts concerning the scoring method used to evaluate the in-basket exercise are simplynot offered.43Table 1Review of Major Empirical Findings on the In-BasketStudyReliability Validity CriterionMethod Range Method RangeFrederiksen, Saunders IR .62-.93 C .02-.14 Grades& Wand (1957) IR .47-.94 C .01-.25 Exam scoresFrederiksen (1962) SHa ?-.92ALT 0-.69Hemphill, Griffiths& Frederiksen (1962)SH .52-.97 C -.05-.24 SupervisorratingsLopez (1966) IR -.20-.97 C -.12-.20 SupervisorratingsC -.15-.18 SupervisorratingsC -.29-.27 SupervisorratingsBray & Grant (1966) IR .92 P -.19-.44 SalaryprogressC .45-.76 OARCross (1969) SH -.39-.95 P -.45-.59 SupervisorratingsALT -.45-.75 C -.33-.72 SupervisorratingsWollowick & McNamara (1969) P .32 JobprogressionContinued next page44Table 1 (cont.)Review of Major Empirical Findings on the In-BasketReliability Validity^CriterionStudy Method Range Method RangeMeyer (1970) SH .50-.95 C -.19-.37 "Supervision"ratingC -.09-.31 "Planning"ratingBourgeois & Slivinski (1974) IR .54-.93Brass & Oldham (1976) IR .64-.95 C .17-.40 SupervisorratingsSHa .13-.58Hinrichs & Haanpera (1976) ALPHA -.04-.73Kesselman, Lopez &Lopez (1982)SH .83 C .13-.33 SupervisorratingsTumage & Muchinsky (1984) P .08 SupervisorratingsP .25 CareerpotentialP .03 PromotionsP .01 SalaryprogressionContinued next page45Table 1 (cont.)Review of Major Empirical Findings on the In-BasketStudyReliability Validity CriterionMethod Range Method RangeHakstian, Woolsey & SHa .80-.84 C .00-.33 SupervisorSchroeder (1986) IR .80-.99 ratingsBrannick, Michaels & IR .71-.94Baker (1989) SH .19-.62ALT .21-.43ALPHA .35-.72Tett & Jackson (1990) IR .65-.95ALPHA .41-.70Note. Table 1 is adapted from Schippmann et al. (1990). IR = inter-rater agreement; SH = split-halfreliability; ALT = alternate form reliability; ALPHA = Cronbach's alpha; C = concurrent validity; P= predictive validity; OAR = Overall assessment centre rating.aCorrected using Spearman-Brown formula.46Reliability of the In-Basket ExerciseTypically, the most direct measure of the reliability of an instrument is obtainedby the test-retest method (i.e., administering the same instrument to the same group ofsubjects at two different times). Correlations of the two scores arising from thismethod allow computation of the "coefficient of stability" of the instrument. However,random error is often introduced with the use of stability coefficients because ofvariables which influence the performance of participants on one test administration butnot the other (e.g., differences in testing conditions, mood of participant, etc.).Furthermore, in some situations, multiple test administrations are not feasible; this iscertainly true for the in-basket exercise, where, because of logistical difficulties andhigh costs associated with multiple test administrations, assessment of test-retestreliability is generally not viable.As a result, other methods have been used to assess in-basket reliabilities,including analyses of alternate forms of the same instrument, internal consistencyestimates (i.e., split-half analyses and, to a lesser extent, coefficient alpha), and thedegree of inter-rater agreement. Each of these methods will now be briefly introduced,followed by a review of the specific empirical findings for each reliability estimationmethod as applied to the in-basket. A more complete description of the design andmethodology used in the more central studies will be provided in the following Validitysection.Alternate Form Reliability EstimationConceptual foundation. Alternate form reliability estimation is based on thepremise that an instrument contains only a sample of selected items from a contentdomain. In theory, it is possible to select a different, interchangeable sample of items47in order to construct two or more different forms of the same instrument which containthe same number of items at the same level of difficulty with non-significantdifferences in means and variances across the forms. The correlation between scoresobtained from the two forms yields a reliability estimate called the "coefficient ofequivalence," which takes into account error due to different samples of items.Generally, alternate form reliability estimation requires as many administrations asthere are numbers of forms. The administrations are typically conducted as closelytogether in time as possible in order to minimize random error arising from theprobable increase in confounding factors over time.Empirical findings. Hemphill, Griffiths and Frederiksen (1962) developed andapplied several forms of The Whitman School In-Basket exercise, in order to assessability in school administration. The alternate form reliabilities were reported byFrederiksen (1962), who had also developed and applied the Bureau of Business In-Basket. and compared these results with those from Hemphill et al.'s (1962) study ofschool administrators. The reported median r's ranged from .25 to .38 using the fouralternate forms across 68 performance categories. Frederiksen (1972) again comparedseveral alternate versions of an in-basket exercise and found reliability coefficients of.15, .17 and .27 for the three comparisons conducted. Brannick et al. (1989) reportedreliability correlations ranging from .21 to .43 across the five performance dimensionsmeasured by two alternate forms of an in-basket exercise.Summary. Generally, reliability coefficients based on alternate forms of an in-basket are not strong. One possible reason for the low coefficients is that the situationspresented in the various forms of the in-basket can be quite different. For instance, thestudy by Frederiksen (1962) involved alternate forms of an in-basket exercisedescribing both a school administration situation and a business situation. Reliabilitycoefficients from alternate forms based on the school situation were higher than those48coefficients based on alternate forms across both the school and business situations.Therefore, as expected, there was greater consistency in performance across thoseforms dealing with the same, as opposed to different, situations.A major difficulty in interpreting alternate form reliabilities is that in-basketperformance across forms has typically been evaluated by multiple raters, causingperformance reliability to be confounded with rater (or scorer) reliability. Littlepublished information is available regarding the proportion of variance contributed byin-basket scorers. Overall, the very limited data available on alternate form reliabilityestimation allows only a tentative conclusion of a lack of strong consistency acrossindividual in-basket performances.Alternate forms of instruments are recognized as costly and difficult to construct(Cascio, 1987). Moreover, in some situations, because of logistical difficulties, only asingle administration of a test is possible, making neither test-retest nor alternate formreliability estimation the method of choice for in-basket reliability estimation.Consequently, other methods are used which are based on measuring the effect ofdifferent samples of items on reliability, namely, methods of internal consistency.Internal Consistency: Split-Half and Alpha CoefficientsConceptual foundation. As noted, in cases where only a single testadministration is possible, reliance is placed on measuring the internal consistency ofthe instrument. which is most commonly assessed by the split-half and coefficient alphamethods. Split-half analyses involve several methods of separating the test into twoequivalent halves and then correcting for double length by the Spearman-Brownformula. As with alternate form reliability estimation, the correlation between scoresbased on the two halves of the instrument (alternate forms) are interpreted as acoefficient of equivalence. In split-half analyses, a common method of creating two49equivalent forms is to split the test in half by computing two separate scores for eachindividual based on responses to odd-numbered items and responses to even-numbereditems. Other methods of splitting the test include randomly selecting the items, orselecting items consecutively to form the first half with the remaining items making upthe second half (e.g., items 1-10 in the first half and items 11-20 in the second half of a20-item test).Coefficient alpha (Cronbach, 1951) is an additional technique used to assessinternal consistency that is based on an analysis of item variances and is considered tobe a measure of homogeneity (rather than equivalence) as it indicates the degree towhich items within a test are intercorrelated. The coefficient alpha estimates theaverage correlation of all possible half-splits of items within a test (Cascio, 1987).Empirical findings. In the following review of split-half reliability findings,the particular split-half method used (i.e., odd-even, random) has been specified whereit is known. Because each item in the in-basket used by Frederiksen et al. (1957) wasdesigned to measure only one of the four performance dimensions, and because thetotal number of items measuring each category ranged from merely four to a maximumof seven items, split-half estimates of reliability were not directly calculated. Instead, aseries of intercorrelations of the groups of items measuring each of the fourperformance categories were determined; these ranged from -.31 to .42. Later studiesbrought developments in item preparation and in-basket design (detailed in the Validitysection) which, among other results, made the calculation of split-half reliability moredirect.Hemphill et al. (1962) reported odd-even split-half reliability coefficients for 68stylistic categories of in-basket performance ranging from zero to .97. On the basis ofthese results, forty categories were retained for subsequent analyses; the reliabilities of50these categories ranged from .52 to .97, with a median r of .78. It should be noted,however, that Hemphill et al. employed different raters for the different halves of thetest and so consistency of performance across the two halves of the exercise wasconfounded with the degree of inter-scorer agreement resulting from the use ofdifferent raters. Consequently, these reliability results should be interpreted cautiously.Frederiksen's (1962) analyses with The Whitman School In-Basket and theBureau of Business In-Basket resulted in Spearman-Brown corrected split-half mediancorrelations (across performance categories) which ranged from .47 to .56. A studyby Cross (1969) also examined the Whitman School In-Basket, and, in twoadministrations of the exercise, odd-even split-half reliability coefficients of .57 and.49 were reported.Meyer (1970) conducted an in-basket analysis at General Electric Company byconstructing approximately 50 performance categories. Results showed that out ofthese 50 initial categories, only 27 demonstrated sufficient reliability for furtheranalysis, with odd-even split-half coefficients of the reduced set ranging from .50 to.95. Like Hemphill et al. (1962), Meyer used different raters for the different halvesof the test, and, for reasons provided above, these results should be interpreted withcaution (Schippmann et al., 1990). In 1976, Brass and Oldham found odd-even split-half correlations ranging from .13 to .58 across six scoring categories designed tomeasure leadership. Stronger reliability results were seen by Kesselman et al. (1982),who reported a split-half odd-even reliability coefficient of .83 for a composite in-basket score. Hakstian et al. (1986) obtained corrected split-half reliability coefficientsfrom .80 to .84 for the Content dimension of a two dimension exercise. Brannick et al.(1989) reported odd-even split-half reliability coefficients for five dimensions of thetwo alternate in-basket forms ranging from .19 to .61 (Form A) and .20 to .62 (FormB), with an overall median r of .34.51Using the coefficient alpha to estimate internal consistency, Hinrichs andHaanpera (1976) reported an average alpha coefficient of .49 measured across 14performance categories of an in-basket exercise. Brannick et al. (1989) computedalpha coefficients for six dimensions across two alternate forms of an in-basketexercise. The results ranged from .35 to .72, averaging to .51. Alpha coefficientswere also calculated by Tett and Jackson (1990) for the six aspects of participativedecision-making measured in their in-basket exercise. The alpha coefficients ranged inmagnitude from .41 to .70, averaging .52 across the six performance dimensions.Summary. Reported reliability estimates derived from the split-half method ofassessing internal consistency vary greatly. The results reported by Kesselman et al.(1982) and Hakstian et al. (1986), when averaged, suggest that in-basket exercises canattain satisfactory levels of equivalence-based estimates of reliability (approximately.82). However, the median split-half reliability coefficient of .34 reported by Brannicket al. (1989) is significantly lower, in sharp contrast to earlier, stronger findings ofKesselman et al. and Hakstian et al. Given such a wide range in reported findings, it isdifficult to conclude, with confidence, whether or not the split-half reliabilities of in-basket exercises are satisfactory. In contrast, less variability is seen in the alphacoefficient results reported. Despite greater convergence, the mean coefficient alphaacross studies is .51; this modest degree of homogeneity does not clearly indicate thatthe selected samples of items result in solid reliability.In an internally consistent instrument (indicated by high coefficients ofequivalence), items are considered to be mutually equivalent. In general, evidence ofinconsistency (indicated by low coefficients of equivalence) is interpreted as errorvariance arising from inconsistent sampling of the content domain (Cascio, 1987).Overall, both the split-half and alpha coefficients reported here may seem somewhat52baskets, the findings summarized here may appear less negative, with such features asthe breadth and complexity of items, cases of low within-item variance (due, in largepart, to low endorsement rates of courses of action), and the numerous unquantifiableinfluences that affect the participant's perception of the items, constituting somepossible reasons for the modest internal consistency results observed.Inter-Rater Reliability EstimationConceptual foundation. The assessment of inter-rater agreement measures thelikelihood that two or more scorers will arrive at similar scores or ratings for aindividual's performance. Calculation of inter-rater reliability allows an estimation ofthe errors of measurement attributable to scorer variance (Cascio, 1987).Empirical findings. In the context of the in-basket exercise, inter-raterreliability was first examined by Frederiksen et al. (1957), who compared individualitem scores and the total score assigned by three raters and expressed reliability interms of the product-moment correlation coefficient between various pairs of raters(e.g., Scorer A vs. B; B vs. C, etc.). Reliability coefficients from .47 to .94 wereseen. Lopez (1966) conducted a study with police lieutenants in the Port Authority ofNew York and reported inter-rater reliabilities for 78 performance categories rangingfrom -.20 to .97 with a median r of .60. When the 78 categories were collapsed toeight more global performance categories, the correlational range measuring inter-raterreliability improved, ranging from .35 to .97 with a median r of .80. Bourgeois andSlivinski (1974) reported a median r of .86 across nine performance categories.Two years later, Brass and Oldham (1976) reported inter-rater correlationsranging from .64 to .95 across the six dimensions measured. Brostoff and Meyer(1984) computed correlations for two raters in terms of four performance dimensions;the resulting correlations ranged from .84 to .91. Inter-rater reliability estimates for53two dimensions measured in a study by Hakstian et al. (1986) were .80 and .99.Brannick et al. (1990) also reported high inter-rater agreement within each alternateform of the in-basket measured, with coefficients ranging from .71 to .94 across thetwo forms. Lastly, Tett and Jackson (1990) published inter-rater correlations from sixcategories of in-basket performance ranging from .65 to .95 with an average reliabilitycoefficient of .82.Summary. The inter-rater reliability findings suggest that performancedimensions measured by the in-basket exercise can, for the most part, be assessedreliably by scorers. Across studies and exercises, the reported reliability coefficientsconverge to an average between .80 to .85. However, within in-basket exercises, therecan be a considerable range in inter-rater values (e.g., .35 to .97 reported by Lopez,1966). This intra-exercise variation may be the result of differing levels of specificityof the performance dimensions and the degree to which they can be clearly andobjectively operationalized. For example, we would expect a dimension concernedwith productivity (measured, perhaps, by counting the number of words written and/ornumber of actions taken) to be more reliably assessed than a dimension concerned withsensitivity in written responses (measured, perhaps, by the use of a courteous tone).The range in inter-rater reliability coefficients across studies may be due, in part, to thedifferences in scoring methods used (i.e., subjective, objective, combination), or theefficacy of scorer training (e.g., whether scoring manuals were available, the length ofthe training process),General Summary of In-Basket Reliability ResultsIn summary, it appears that evidence of reliability of the in-basket has beenmodestly established. Alternate form reliability results are relatively low. Lowcoefficients of equivalence are likely due, in large part, to the comparison of in-basket54exercises which are based on different scenarios and situations. Because of practicaldifficulties and costs associated with multiple administrations, reliability estimatesbased on equivalence and homogeneity of items (i.e., internal consistency) rather thanstability (i.e., test-retest) are favored forms of reliability estimation. However, thereare great variations in the magnitude of the split-half reliability estimates reported forthe in-basket, and internal consistency reliability estimates are generally not strongacross studies. More promising results suggesting that the in-basket exercise can bereliably scored are seen in the reported findings of the degree of inter-rater agreement.Coding schemes that are designed to be more objective, with behavioural and codingrule descriptions set out in coding manuals probably account for the considerable levelsof inter-rater reliability seen across studies. However, no coding scheme for a free-format response in-basket can completely eliminate the inherent subjectivity andjudgement involved in scoring an in-basket exercise. All in all, considering theevidence of the reliability of the in-basket, it appears that Frederiksen's (1957) originalgoal of designing an instrument with the property of objectivity, reflected by in-basketreliability results, has been adequately met.Validity of the In-Basket ExerciseFor several reasons, the following review of the validity findings of the in-basket exercise is not completely exhaustive. Not only is the literature voluminous, butthe research reported in this area is also fragmented and fraught with methodologicaland conceptual shortcomings (Schippmann et al., 1990) Accordingly, only thosestudies which sufficiently report necessary descriptive (i.e., development and design)and evaluative (i.e., validity) information are included in this review. An additionalconsideration guiding the selection of studies discussed here is the significance of thecontributions made by the work. Generally, those studies which reported unique andimportant developments in in-basket design or which involved large-scale applications55of existing methodology were selected over those studies which simply replicatedprevious approaches or which involved small-scale applications, precluding thegeneralizability of findings. For example, the innovative in-basket design and scoringmethods behind the development of Frederiksen's (1962) Bureau of Business In-Basketwere also used in the development of The Whitman School In-Basket in a much moreelaborate study by Hemphill et. al (1962). Consequently, only the in-basket principlesand their application in the latter study are discussed in this review.Criterion-related ValidityAir Force in-baskets. In order to validate the in-basket exercises used in theseminal work by Frederiksen et al. (1957), they were administered to a class of 92students at the U. S. Air Force Command and Staff School. The efficacy of the in-basket as a measure to evaluate military instruction was assessed using a total in-basketscore (summed across the four previously-mentioned roles) as the predictor measure.Both scores on course grades and an external educational test were used as criterionmeasures. Validity coefficients of .25 with the external test and .15 with course gradeswere reported.Frederiksen et al. (1957) concluded their investigation by acknowledging thatthe in-basket exercises they developed were disappointing in terms of theirpsychometric properties. Consequently, they recognized that the in-basket exercise, asit was then, was not useful for the selection of individual candidates. Instead, theexercise was useful solely for training and educational purposes. Althoughdiscouraging to the researchers, the results of the Air Force in-basket study laid animportant foundation for continued research and development with the in-basket.School administration in-baskets.^Further work by Hemphill, Griffiths andFrederiksen (1962) resulted in the first evidence that in-basket exercise performance is56positively associated with on-the-job administrative performance. Using the newly-developed series of three Whitman School In-Baskets (Frederiksen, 1962), theyundertook a study of the administrative performance of 232 elementary schoolprincipals (137 men, 95 women). The Whitman School In-Baskets differed from theAir Force in-baskets in both the preparation of in-basket items and the method ofscoring. As was shown earlier, each item in the Air Force in-baskets was prepared toelicit only one preselected behaviour judged to be relevant to the officer trainingprogram. In contrast, each item in the school administration in-baskets was written toreflect a broad range of target behaviours that were representative of the work requiredin school administration.Two general components of school administration in-basket performance weretargeted for evaluation, namely, style and content. Content referred to what specificactions were taken (e.g., praising a teacher), whereas style referred to the manner inwhich, or how, actions were taken in handling an item (e.g., using a courteous tone).These two aspects of administrative performance were not seen as independent, for itwould be unlikely that praising a teacher (content) would be done discourteously(style). The inclusion of the content scoring category is notable, for it is a particularlyunique contribution of this study. Identifying what specific actions were taken allowedthe important development, introduced in subsequent studies, of evaluating theappropriateness of actions.In scoring for style, a total of 68 scoring categories were selected based both onobservations of school principals at work and on theories of leadership. For example, aleadership theory of Hemphill (1958) involved concepts of initiation of structure andconsideration which resulted in the in-basket stylistic categories of Initiates a NewStructure and Courtesy to Subordinates. Other observation-based categories includedPostpones Decisions, Asks Subordinate For Information, and Sets a Deadline. Each in-57basket item then was measured using each of the 68 stylistic scoring categories and theoverall score for a certain stylistic category was the number of times a particular stylewas evidenced across all items from all four in-basket exercises (132 items in total).Following odd-even split-half reliability analyses, 28 of the original set of 68stylistic categories were eliminated, leaving a final set of 40 categories. A factoranalysis of these 40 stylistic categories revealed eight stylistic factors: ExchangingInformation, Discussing before Acting, Complying with Suggestions made by Others,Analyzing the Situation, Maintaining Organizational Relationships, Organizing Work,Responding to Outsiders, and Directing the Work of Others.In scoring for content, rather than style, the major courses of action taken bythe school principals were identified for each in-basket item (e.g., refers to secretary,communicates with superintendent's office). The number of courses of action derivedfor any one item was arbitrarily limited to ten (no reason was provided). A value of 0or 1 was recorded for each course of action, with 1 indicating the selection orendorsement of a particular course of action for that item. Several methods were thenconsidered in scoring the courses of action, or content component, of in-basketperformance. An interesting, but unsuccessful, effort was made to derive contentscores by evaluating the appropriateness the courses of action taken as solutions to theproblem presented in the item. This approach was abandoned because of severalunsuccessful attempts to obtain a consensus among qualified judges of theappropriateness of the courses of action for an item. However, this panel-based logicalapproach of evaluating the appropriateness of actions was successfully adopted insubsequent studies and became the leading method of scoring the content of in-basketperformance.58Ultimately, two methods of scoring for content of responses were adopted in theschool administration study using the categories of Imaginativeness and OrganizationChange. A score of 1 for Imaginativeness was applied to those courses of actionidentified by scorers to be good, creative ideas that go beyond the courses of actionimmediately apparent in the exercise. A score of 1 for Organization Change wasapplied to those actions which reflected the introduction of, or consideration of,changes in the policies, practices or procedures of the organization.In addition to scoring for style and content, Hemphill et al. (1962) devised athird scoring component that was seen as more global and subjective than the othertwo, more analytical, components. This third component required scorers to makeimpressionistic ratings of school principals' performance using 21 pairs of adjectives(e.g., friendly versus aloof, logical versus intuitive, witty versus humourless) and toselect the one adjective from each pair which best described performance in the in-basket. The score for each adjective-pair was based on the number of times the firstadjective in the pair was checked by the scorer (further rationale was not provided).On the basis of reliability estimates, 10 of the 21 pairs of adjectives were retained forsubsequent analyses.An additional development in the school administration study was the first large-scale introduction of a "Reasons for Action" form, which required candidates, uponcompletion of the in-basket exercise, to state very briefly what was done for each itemand why. The primary purpose for the inclusion of the form was to clarify the actionstaken and the intent behind the responses made in order to facilitate more accurateinterpretation and scoring of responses. For instance, 1 of the 68 in-basket stylecategories was Number of Items not Attempted. It was recognized that, just because acandidate did not actually write in response to the item, did not mean that no actionwas attempted. On the Reasons for Action form, the candidate may have clarified that59action was postponed because it was not an urgent problem. Postponing a decision, inthis study, was considered a scorable response.Several criterion measures were employed in the school administration study,including ratings by subordinates (teachers) and ratings by superiors (superintendents).Because of their critical importance in actual administrative situations, only thecriterion-related validity results from performance ratings by the superiors of theprincipals will be related here. The principals were evaluated on 13 performanceappraisal dimensions, such as Interest in Work, Ability to Get Along with Parents, andan Overall General Impression rating. Each performance dimension used a 5-pointsubjective scale, multiplied by a number reflecting the judge's level of confidence aboutthe rating.Correlations between the superiors' Overall General Impression rating andscores from 32 stylistic categories were calculated (no reasons were provided foreliminating 8 of the final set of 40 stylistic categories). Only 4 of the 32 correlationswere significant at the .01 level. Composite scores for the eight stylistic in-basketfactors (listed earlier) were also approximated and correlated with the Overall GeneralImpression rating. Here, correlations ranged from -.05 to .24, with two factors--Discussing before Acting and Exchanging Information--reaching significance. Lastly,the impressionistic ratings of the ten adjective pairs yielded validity coefficients rangingfrom -.15 to .17 when correlated with the superiors' Overall General Impression (threepairs reached significance).Overall, the criterion-related validity coefficients for the school administrationin-basket exercises were low. However, important methodological developments initem preparation and scoring approaches (i.e., the notion of a content category,60Reasons for Action form) set the groundwork for a more psychometrically-based designand scoring approach researched in the Port of New York Authority studies.The Port of New York Authority studies. An important series of studies bythe New York Port Authority represented the first use of the in-basket exercise as ameans of assessing candidates for promotion to higher levels of management. Unlikeprevious training or research applications, in-basket performance for candidates in thePort Authority studies held serious, personal consequences because of the application ofperformance results to selection and promotion. These studies have been summarizedby Lopez (1966), who provided much descriptive, but little quantitative, information(Frederiksen et al., 1972).Historically, the Port Authority was committed to using objective assessmentmethods for selection and promotion and was actively engaged in researching a numberof assessment devices. In the late 1950's, they determined that their assessmentprogram was not measuring important skills such as critical thinking, organization ofideas, and communication skills. In 1960, the Port Authority then contracted with ETS(specifically, Frederiksen and Hemphill) to develop an in-basket exercise which wouldmeasure these and other skills for the evaluation of police lieutenants for promotion.The conclusion drawn from this evaluation was that, although seen as an improvementto its assessment program, the in-basket exercise was too complex and too costly toscore (fully trained scorers required three hours to evaluate each exercise).Accordingly, until a larger-scale application was required, the in-basket exercise wouldnot be used.Four years later, the Port Authority initiated a comprehensive managementselection and promotion program for both junior- and middle-level management. Thiswas seen as an excellent opportunity to conduct further research and development with61the in-basket. Consequently, in 1964, Lois Crooks of ETS developed two in-basketexercises that were designed to assess predicted success in administrative and facilitymanagement jobs, respectively. The completed in-basket exercises were scored byETS, who employed 42 of the 68 stylistic performance categories generated in theschool administration study (Hemphill et al., 1962). Eight stylistic scoring factorssimilar to those identified by Hemphill were derived from the 42 categories. Inaddition, a "content" category scoring key was successfully developed using a panel ofsubject matter experts to logically assign a weight of -1, 0 or +1 for appropriateness ofeach course of action (no further development details were provided). A Productivitycategory was also scored, based on the number of items attempted, the number ofcourses of action taken, and the number of words written. Finally, an impressionisticrating by the scorer of how well the candidate would perform on the job was alsomade.The criterion measure used in the study was a composite score derived from theratings of 10 supervisors, calculated for each candidate. In the sample of 58administrative service candidates, no significant correlations between any of the in-basket categories (or factors) and supervisory ratings were seen. However, in thesample of 97 facility management candidates, the impressionistic rating, Productivity,Content and several stylistic factors were significantly related (at both the .05 and .01levels) to the supervisory rating. Significant correlations ranged from -.29 to .27.The following year, the Port Authority's management selection and promotionprogram was again administered. However, in 1965, a different in-basket exercise wasused, rather than the ones developed for the 1964 program. The "Ama Company,Inc." in-basket had been developed several years before for the American ManagementAssociation. Lopez (1966), who was involved in its development, confirmed that thisin-basket was based on standard ETS principles and had already been administered to62over 2,500 executives. In a departure from the traditional ETS scoring approach, aninnovative feature of the Ama in-basket was the inclusion of a multiple-choice, self-report questionnaire used to score the exercise. The in-basket "Action Report Form"was developed as a way to efficiently and economically meet the significant scoringdemands of large-scale in-basket applications. Upon completion of the exercise, thecandidate was asked to respond "yes" or "no" to a total of 822 statements provided todescribe all possible actions which could be taken for all problems. Little provisionwas made for "unusual" responses because it was believed that, with statements basedon tabulated responses from 2,500 participants, it would be unlikely that subsequentparticipants would respond in a significantly different way. A sheet was provided forunusual responses to be recorded, if necessary, but it appears they were not scored.Few details are available regarding the scoring system developed and applied inscoring the Action Report. It is known that initially four in-basket performance factors(Judgement, Output, Social Style and Leadership), based on 12 stylistic categories,were scored. Analyses of the intercorrelations among categories and factors from acombined sample of Port Authority candidates and management executives from the1965 American Management Associations' Management Course (totalling 726participants) led to refinements of the performance categories. Specifically, a final setof three factors called Organization of Work, Productivity, and Delegation wereidentified and retained. It is not clear, however, how the factors were derived or howthe 822 yes/no statements were weighed or assigned to the categories or factors. Lopez(1966) simply stated that the Action Report used a scoring system that was based onETS procedures and previous research.A complication with the reporting of this research was that, in the validation ofthe Action Report, the in-basket performance results and criterion data from 150 PortAuthority candidates were combined with results from a study of 93 management63executives. The criterion measure used was a composite score, summed across sevenjob performance dimensions (e.g., interpersonal competence, emotional maturity, etc.).The ratings were completed independently by the supervisors of the combined sampleof 243 participants. Results of correlations between the composite job performancemeasure and the three final in-basket factors revealed significant relationships forOrganization of Work (r = .20) and Productivity (r = .25) but no significant resultwith Delegation was seen.In sum, the Port Authority's landmark use of the in-basket exercise forpersonnel decisions was an historic step toward later widespread industrial applicationsof the instrument. The conclusions regarding the costly and complex nature of in-basket scoring from the 1960 police lieutenant promotion study clearly identified thescoring system as requiring further research and modification. Subsequently, theresults of the in-basket exercises used in the 1964 management evaluation programprovided limited evidence of criterion-related validity, although it did prove to be avery popular assessment device among candidates. Therefore, the in-basket exercisewas retained as part of the management evaluation program and the scoring system wasfurther refined in the following year's evaluation program.After the 1965 application, Lopez (1966) optimistically concluded that theintroduction of the Action Report form scoring method for in-basket evaluation was apromising way of assessing adminstrative ability and that it may represent abreakthrough in in-basket scoring. A careful consideration of the findings, however,suggests his optimism was not warranted. As noted in the Introduction, a key concernwith the use of self-report questionnaires as the basis for in-basket scoring is whethercandidates' actions from the exercise will be accurately reflected in the self-report.Either unintentionally, through carelessness, or intentionally, through choosing a moredesirable course of action suggested by the questionnaire, the completed scoring form64may be distorted compared to candidates' actual performance. Lopez (1966) did little,empirically, to alleviate these concerns of motivational distortion and deception. Theonly information he provided about efforts to investigate such a serious disadvantagewith self-report scoring was to assert that "later experimentation proved that theseconcerns were groundless, even in the competitive atmosphere of the assessmentsituation" (p. 109). This assertion remains questionable, because, to date, thereappears to be no published empirical examinations of the extent or effects of deceptionor motivational distortion with the use of in-basket self-report forms. The self-report,questionnaire method of scoring was not widely adopted and did not represent thebreakthrough Lopez had hoped. Nevertheless, Lopez continued to research the self-report scoring format and his additional findings will be reported shortly (Kesselman,Lopez & Lopez, 1982).The Management Progress Study. The historical significance of this majorlongitudinal research of managerial assessment and career development by AT & T hasbeen acknowledged in Section 1 of the Literature Review. Following the designprinciples of predictive validation previously outlined, the original assessment centredata gathered on 422 subjects were not released to company officials. Neither theirperformance nor their subsequent annual evaluations had any influence on the careersof the men being studied; hence, there was no contamination of subsequent criteriondata by the assessment results.The in-basket exercise used in the study was developed with the assistance ofETS. An interesting and unique deviation from standard ETS objective scoringprinciples (typically based on action elements) was that the evaluation of in-basketperformance in the Management Progress Study was solely subjective, conducted bymeans of a 45 minute interview with the participant upon completion of the exercise.In the interview, questions were raised with participants concerning the approach they65took in the exercise, the reasons for specific actions taken, and their views of superiors,peers, and subordinates. After the interview, the rater prepared a detailed, writtensummary of in-basket performance. A set of guidelines in conducting the interviewand a manual for report preparation were made available for raters to make thesubjective evaluation process more reliable. It should be made clear that the in-basketexercise was not scored in any quantitative way during the assessment component of thestudy.However, to determine the empirical relationship between in-basketperformance and progress in management, a simple method to quantitatively evaluatethe in-basket reports was derived. This post hoc method required two raters toindependently read the performance reports and rate overall performance on a 5-pointscale. By mutual agreement, a composite rating for each participant was determined.In 1966, ten years after the initiation of the program, the first set of predictive validityfindings were published. At that time, data from 125 of the original 274 college-educated men and 144 of the original 148 non-college educated men were available.The criterion measure used in this analysis was salary progress, determined by takingthe difference between salary at the time of assessment and that as of June 30, 1965.Across samples from seven telephone companies, in-basket validity coefficients rangedfrom -.19 to .44, with two coefficients reaching significance (.27 and .44).Bray and Grant (1966) concluded that the two situational techniques (the in-basket and The Manufacturing Problem described on page 13), used in the ManagementProgress Study produced reasonably reliable results. In addition to evidence ofpredictive validity, both procedures were shown to have significantly influencedassessment staff evaluations (evidenced by correlations between situational testperformance and assessment staff judgements). Bray and Grant were confident thatneither technique could have been omitted without losing important information and66therefore, despite their high costs, continued use of situational exercises in assessmentcentres was justified.Cross' work with the school administration in-basket. Furthering the researchof Hemphill et al. (1962), Cross (1969) conducted an investigation into the predictiveand concurrent validity of the Whitman School In-Basket used in the schooladministration study. With a sample of 14 school principals, the stability of stylistic in-basket scores and their relation to on-the-job measures of performance were examinedby the administration of the first in-basket (labelled the predictive exercise), followedby a second administration of the same in-basket (labelled the concurrent exercise).Participants were given the predictive in-basket when they were involved in anadministrator preparation program from 1961 to 1964 (not all subjects were involved inthe program at the same time). In 1966, the second, concurrent in-basket was re-administered after they had been working in the school system for lengths of timeranging from 2 to 5 years across participants. This research design suggests a test-retest reliability paradigm allowing measurement of the stability of in-basketperformance. However, because the time interval between administrations exceededthe six month maximum for assessing test-retest reliability advised by Anastasi (1982)and because the time interval was not consistent for all participants, it is not instructiveto consider the correlations between predictive and concurrent in-basket scores.Cross (1969) scored the in-basket responses using 26 of the 40 original stylisticcategories identified by Hemphill et al. (1962)--the strategy for selecting the 26categories was not reported. The empirical validity of the scorers' impressionisticratings from the same ten adjective pairs used in Hemphill's earlier study were also re-examined by Cross. No attempt was made to score in-basket performance for content.The criterion measures were based on multiple observations of the principals at work,interviews with the principals, and analyses of samples of their written work. Separate67criterion measures were determined for each of the 26 stylistic in-basket categories.Little additional descriptive information on the measurement of the criterion variableswas provided, except Cross' statement that "rules for scoring the stylistic measures ofon-the-job behaviour were extrapolated from procedures for scoring stylistic in-basketperformance" (p. 27). The reliabilities for these on-the-job stylistic behaviours werebased exclusively on the observational data and were determined by correlating resultsacross several observation periods. The reliability coefficients across the 26 categoriesranged from -.04 to .87, and, in Cross' view, all but two were at an acceptable level.For the predictive in-basket exercise, the validity coefficients ranged from -.45to .59 with only two stylistic categories reaching significance (at the .05 level).Similarly, only two categories from the concurrent in-basket were significantlycorrelated with their corresponding measures of on-the-job behaviour, resulting incoefficients of .60 and .72. Scorers' impressionistic ratings of in-basket performanceshowed slightly more promising, but still weak evidence of empirical validity with 3 of20 correlations (10 predictive and 10 concurrent coefficients) reaching significance.Cross (1969) concluded that the in-basket exercise was best used as a trainingdevice instead of a device for predictive purposes. In considering whether the in-basketexercise could act as an effective selection device, Cross recognized that "the data donot permit a resounding affirmative response" (p. 28). It should be noted that thesample size in this study was exceedingly small and that conclusions regarding the lackof empirical validity for selection purposes using this instrument could only betentative, at best.IBM study. Wollowick and McNamara (1969) conducted a study of 94 malelower-to-middle managers from a large electronics firm (International BusinessMachines, or IBM) in order to determine the relative validities of the components of an68assessment center. The in-basket exercise was included as an individual, situationalmeasure in the assessment program. The in-basket performance dimensions rated werePlanning and Organizing, Self-Confidence, Decision-Making, Risk-Taking, Oral andWritten Communication, and Administrative Ability, which were measured subjectivelyin an interview with an observer who inquired as to the actions taken by the candidateand the reasons for his decisions (no further scoring information was provided). Thecriterion measure used was the increase in managerial responsibility, measured bychange in position level three years after the participants' initial involvement in theassessment program. Wollowick and McNamara reported a significant (p < .01)correlation coefficient of .32 between a composite in-basket score and increase inmanagerial responsibility.The General Electric research. As with many earlier studies, Meyer's (1970)in-basket analyses at the General Electric Company were the result of collaborationbetween the company and ETS, in a specific effort to resolve selection problems for aparticular middle management position. Following the in-basket design methodsdeveloped by Frederiksen et al. (1957) and Hemphill et al. (1962), approximately 50stylistic categories of administrative performance were generated (e.g., Makes aConcluding Decision, Involves Subordinates, etc.). As discussed in the Reliabilitysection, only 27 of the 50 stylistic categories were judged to have had adequatereliability to warrant retention in the study. The 27 categories were then factoranalyzed and four in-basket factors emerged: Preparation for Decision, Taking FinalAction, Organizing Systematically, and Orienting to Subordinate Needs. Finally, theScorer's Rating--a subjective measure of overall performance based on the scorer'simpression of the managerial skills shown in handling the items in the in-basket--wasalso examined as a predictor.69The criterion measures used in this study were generated from performanceratings of the 81 participating middle managers by their superiors on 12 "key areas ofjob responsibility" (these were not further described). A factor analysis was thenconducted on these areas of job responsibility, yielding two job-performance factors:Supervision (concerned with human relations) and Planning-Administration (concernedwith intellectual tasks). These two job-performance factors were then correlated withthe scores from the 27 stylistic categories and with the four in-basket factors. At thefive percent level, significant correlations between ratings of the Supervision factor and3 of 27 performance categories were seen, ranging from .22 to .31, whereas thePlanning-Administration factor yielded significant validity coefficients on 7 of 27categories, ranging from .22 to .35. Two out of the four in-basket factors weresignificant when correlated against the Supervision factor (r = .25 and .32); similarly,two out of four in-basket factors were significantly correlated with the Planning-Administration factor (r = .31 and .40). Interestingly, the impressionistic Scorer'sRating yielded a greater, but nonsignificant, validity coefficient than two out of four in-basket performance factors when correlated against the Supervision factor (r = .21)and yielded a significant correlation greater than three out of four in-basket factorswhen correlated against the Planning-Administration factor (r = .37).In addition to scoring for style, Meyer (1970) also measured the content of in-basket responses with an innovative and purely statistical, empirical approach ratherthan logical, panel-based approach. More specifically, he compared those courses ofaction taken by the sample of 81 managers who had received above average (median)on-the-job performance ratings on the two job-performance factors with those who hadreceived below average performance ratings. Values were then assigned to eachcourse of action based on its correlation with each of the two criterion measures (job-performance factors). No further information was provided regarding this unique70process, other than Meyer's statement that "weights were assigned and scores derivedfrom courses of action in the same way as would be the case if empirical keys weredeveloped for scoring item alternatives in a biographical inventory" (p. 302). In thisway, two empirically-derived content keys for the courses of action were developed,one based on correlations with the Supervision factor and one based on correlationswith the Planning-Administration factor. Validity coefficients between the scoresderived from these two empirically-derived content keys and the two job-performancefactors were not determined for the sample of 81 managers.However, a second, smaller sample of 45 middle-level managers wasadministered the same in-basket exercise and content scores based on the two keys werecalculated. Only the Planning-Administration content key showed evidence of criterionvalidity, with a significant correlation of .31 with the Planning-Administration job-performance factor. Additional correlations with scores based on the 27 individualstylistic categories were not calculated because, in Meyer's view, this second "cross-validation" sample was too small. Instead, 17 stylistic categories from two of the fourin-basket factors (Preparation for Decision and Organizing Systematically) werecombined to form a composite factor score, which was then correlated with thePlanning-Administration job-performance factor in this second sample, resulting in avalidity coefficient of .38 (significant at the .01 level). The Scorer's Rating was alsore-correlated with the Planning-Administration job-performance factor; the validitycoefficient increased from .37 in the first sample to .43 in this second sample.Meyer (1970) concluded that the in-basket "might serve as a valuable aid in theselection of managers" (p. 306) and emphasized its unique advantage (compared toability or aptitude tests) of high face validity. This distinctive feature is important, heclaimed, because it made the exercise more acceptable to candidates and it resulted instraightforward interpretation which facilitated the feeding back of results to the71candidates. Meyer agreed with Lopez (1966) that a negative consequence of high facevalidity is that few studies of criterion-related validity of the in-basket have beenconducted, despite its widespread use in managerial assessment. The results fromMeyer's study indicated that certain styles in handling in-basket items correlate withon-the-job managerial performance. Despite these positive findings, Meyer called foradditional research into the predictive validity of the in-basket, recognizing that, of thefew in-basket studies conducted to date, most, including his, had been concurrentvalidation studies. Although Meyer did not emphasize the importance of hisempirically-derived content scoring keys, this innovative scoring development formsthe basis of one of three scoring strategies assessed in the present study.A "leadership" in-basket exercise. Brass and Oldham (1976), like Lopez (1966)and Meyer (1970), cited the paucity of in-basket criterion validation research. Brassand Oldham further claimed that the few findings which had been reported had notadequately established whether in-basket performance corresponded to actualmanagerial performance; "few of the correlations between scores on the various in-basket scoring categories and managers' on-the-job performance ratings have reachedacceptable levels of significance" (p. 652). They speculated that a chief reason forthese disappointing validation results may have been that in-basket scoring categoriesused by researchers did not accurately reflect effective managerial behaviour.According to Brass and Oldham, scoring well on in-basket performance dimensionsbore little or no relation to effectively performing one's duties as a manager.The target, therefore, for their research efforts was the derivation of thedimensions or categories used to score the in-basket that would more accurately reflecteffective managerial behaviour. Brass and Oldham's (1976) approach was to scoreparticipants on behavioural dimensions that have been shown to reliably predictmanagerial and subordinate performance. Specifically, they identified leadership72ability as a critical measure of managerial effectiveness. The basis for the derivation ofthe in-basket dimensions was a theory of "environmental control" involving the degreeto which the manager enhanced the work motivation and performance of subordinatesby controlling their work environments (Oldham, 1976). Accordingly, byoperationalizing aspects of leadership based on this notion of environmental control, sixspecific leadership activities were selected as in-basket categories: PersonallyPunishing, Personally Rewarding, Setting Goals, Placing Personnel, Designing JobSystems, and Designing Feedback Systems.A sample of 71 male first-level foremen from a large American manufacturingcompany were administered this leadership in-basket exercise designed specifically forthis study. A scoring manual was developed which set out guidelines and specific rulesfor scoring each of the six leadership activities in relation to the 28 items comprisingthe in-basket. Following ETS principles of item preparation, each of the items wasdesigned to provide an opportunity for one or more of the leadership activities to bedemonstrated. Scorers examined the response(s) to each item and assigned a 1 or 0 toreflect whether or not each of the particular six leadership categories had beenevidenced by the participant. Multiple occurrences of a particular leadership activitywithin one item were not scored and so the maximum score for a single leadershipcategory was 28. No other aspects of in-basket performance (i.e., scorers'impressionistic ratings, and content, as opposed to style) were measured.The criterion data used by Brass and Oldham (1976) were supervisory ratingsprovided by the immediate supervisors of the foremen. By adapting proceduressuggested by Smith and Kendall (1963), eight measures of managerial effectivenesswere derived, each based on critical incidents of both effective and ineffective foremenbehaviour. The supervisors were asked to sort the incidents into their appropriateperformance categories and scale them on a 7-point effectiveness continuum. A single73composite measure of managerial effectiveness was then derived, based on an averageof the eight measures. Reliability results for the criteria were not reported.When correlated against the composite criterion measure, four of the sixleadership categories were positively and significantly related, with validity coefficientsof .24 (p < .05), .27, .28, and .34 (p's < .01). Foremen who personally rewardedtheir subordinates for good work, who punished subordinates for poor work, who setspecific performance objectives, or who enriched their subordinates' jobs also tended tobe rated by their supervisors as effective performers. Brass and Oldham (1976)recognized that, because some of the leadership categories were used infrequently,some of the validity coefficients may have been inflated (frequency data were notprovided). For this reason, they advised caution in the interpretation of the validitycoefficients.Nevertheless, Brass and Oldham (1976) firmly concluded that the results oftheir investigation of the criterion-related validity of the in-basket were substantiallystronger than past findings. They asserted that such strong findings were due, in largepart, to their focus on the derivation and selection of in-basket scoring categories whichwere relevant to actual managerial performance. It may also be that using anarrowband instrument (leadership ability is only one of several components of generalmanagerial behaviour) allows a tighter, more controlled operationalization of thescoring categories, making measurement more objective and accurate. However, therelatively restricted measurement of managerial behaviour used by Brass and Oldham'sleadership in-basket may preclude the generalizability of their positive findings tosituations where managerial performance is more broadly-defined (as is typically thecase).74A return to the self-report scored in-basket exercise. Continuing the novelapproach reported by Lopez (1966), Kesselman et al. (1982) designed an in-basketexercise scoring format based on a self-report questionnaire (the Action Report form).A group of 85 first-line supervisors (65 male and 20 female) from a utility company,upon completion of the exercise, were provided with a list of all possible actions foreach of the 26 items included in this exercise. The list was constructed by firstgrouping possible actions globally (e.g., "I communicated with Shivers as follows:")and then providing more specific options within that grouping (e.g., "#171. ContactRiddle, determine priority and follow-up on request; #172. Have Riddle send me hisperformance data for review"). There were, in total, 684 possible actions listed.Respondents completed the Action Report form by choosing the appropriate numberedoption(s) that corresponded to the action(s) they took. In addition, they were asked torate the priority of each item and provide the reasons for the action taken. The ActionReport form included an "Unusual Action" form wherein candidates could add actionsthat were not already listed. Kesselman et al. reported that only 4% of candidates usedthe Unusual action form, and nearly all of the unusual responses could be classified intoone of the 684 listed actions. A total of three hours was required for administrationwith two hours needed to complete the in-basket exercise and one hour to fill in thescoring report form.The participants' supervisory positions involved two job areas, classified asadministrative and physical/technical. Standard job analysis techniques (i.e.,interviews, questionnaires, work samples) were employed in both job areas todetermine the in-basket performance dimensions to be measured and to provideinformation for in-basket item preparation. In addition, a method called ThresholdTraits Analysis (TTA) was used. TTA lists 33 traits thought to be necessary foreffective job performance and requires supervisors of the job being analyzed to75determine the relevance, level, and practicality of each trait (Lopez, Kesselman &Lopez, 1981). The results from both the TTA and standard job analysis techniquesprovided a list of specific traits required for effective job performance across both jobareas. Three traits were selected and the following three in-basket dimensions, basedon the three traits, were developed: (a) Problem-Solving, (b) Planning, and (c)Decision-Making. Because of high intercorrelations among these dimensions, scoreswere subsequently summed to yield a total in-basket score.Kesselman et al. (1982) selected ten middle managers to form a panel of subjectmatter experts who were asked to assess the appropriateness of each of the 684 possibleactions and assign scoring weights, by group consensus, based on a scale from 0(inappropriate) to 3 (highly appropriate) for each of the three in-basket dimensions. Inaddition, each panel member was asked to rate each action on a 1 to 4 scale reflectingthe priority of the item (immediacy of action required). Final scoring weights for eachpossible action were then determined by multiplying the appropriateness value by thepriority rating.Next, to establish the job performance criteria, eight performance dimensions,based on two separate measures, were completed by the immediate supervisors of theparticipants. The two separate measures both identified a set of traits related to jobperformance, but used two different measurement methods. The first performancemeasure used a graphic scale to measure job performance on each of the 33 traits basedexclusively on the TTA (no further details were provided). The second performancemeasure included traits from the TTA as well as traits derived from standard jobanalysis techniques. Two 6-point behavioural observation scales were used to measurejob performance based on the set of combined traits, one for the administrative and onefor the physical/technical positions. The two types performance measures (also calledforms) were correlated, and an average "inter-form" reliability of .70 was reported.76No further criterion reliability information was provided. Ultimately, four of the eightjob performance dimensions were chosen from each of the first and second measures.In addition, an Overall Job Performance rating for each of the two measures wasdetermined. The results using this criterion measure (only) will now be related.The validity coefficients reported by Kesselman et al. (1982), based oncorrelations between the total in-basket score (summed across the three in-basketdimensions) and the Overall Job Performance rating from the two forms, were .33 and.27, both significant at the .01 level. Corrected coefficients, adjusted for attenuation inthe 'criteria, were also reported, based (apparently) on the earlier inter-form reliabilityestimate of .70. The corrected correlations were .38 and .31. The individual validitycoefficients for the three in-basket dimensions were not reported.In sum, Kesselman et al. (1982) sought to clarify what they believed weremixed criterion-related validity results reported for situational exercises. In their view,the limited research evidence did not unequivocally support the use of situationalexercises over paper-and-pencil tests. At the same time, they also sought to develop ascoring format which was less costly and less time-consuming than other in-basketscoring procedures. Kesselman et al. concluded that, with their refinements to andapplication of the Action Report form, the results showed that their in-basket scoringmethod reliably, accurately and economically predicted managerial ability. However,the same problems of deception and motivational distortion which plagued Lopez's(1966) self-report form remained unaddressed. Kesselman et al. acknowledged thatfuture research should be directed to address both the extent and impact of deception inthe self-report format. As noted earlier, despite continued development and applicationof self-report scored in-baskets, there appears to be little or no published empiricalexaminations of the extent or effects of deception or motivational distortion. Inaddition, the use of group consensus in this study to assign scoring weights to the77courses of action is a questionable methodological practice because of the likely effectsof undesirable group dynamics in what should be an objective, empirical process.According to ETS, factors like the undue influence from one or two "strongpersonalities" within the group, "political" pressures from using members of differenthierarchical levels within the company, and physical fatigue leading to a "let's just getthis done" approach are some examples of the serious, subjective flaws inherent withthe process of group consensus to determine scoring weights (K. Abbey, personalcommunication, May 8, 1991). Consequently, ETS typically solicits individual,independent (as opposed to group-based, inter-dependent) ratings from panel membersto determine scoring weights for the appropriateness of the courses of action.A predictive validation study. A large-scale study by Turnage and Muchinsky(1984) examined the predictive validity of assessment centre evaluations and several"traditional" predictor variables in predicting managerial performance. The in-basketexercise was one of a battery of assessment devices administered, over a four yearperiod, to a sample of 799 employees (92% male) in a large manufacturing firm. Nodetails were provided regarding the design or scoring of the in-basket; it is not knownwhether a purely subjective (i.e., interview) or a combination (interview and objective)scoring method was used. Ten criterion variables were used, including salary progress,promotions, career potential ratings, and standard job performance ratings. As with thepredictor variables, little specific information (such as development, descriptions,reliability, etc.) was provided regarding the criterion variables. None of the traditionalpredictor variables, including the in-basket exercise, was found to be predictive ofsalary progression, job performance ratings, or promotions. In-basket performance,however, was found to be significantly related (at the .01 level) to ratings of careerpotential (r = .25).78A quickly-scored in-basket. Hakstian, Woolsey and Schroeder (1986)developed an in-basket exercise for the prediction of first-level supervisoryperformance which could be scored quickly by hourly workers in a provincial telephonecompany. From an initial pool of six in-basket performance dimensions, Productivity,measured by the number of items dealt with and the approximate number of words peritem written (summed over all items), and Content, measured by an evaluation of 10 of22 items administered as part of the exercise, showed evidence of concurrent validity.The Content scoring key was generated by a three-member panel who examined each ofthe ten items and derived a series of 2-point, 1-point and 0-point responses. Unanimityamong the panel members was needed to make final Content value assignments.The criterion measures were provided by a 5-point behaviourally-anchoredperformance appraisal which measured seven performance dimensions and wascompleted by the immediate supervisors of the 238 participants (128 men and 110women). Two global criterion measures, Overall Work Performance and OverallInterpersonal Skills, were generated by taking the first principal component of scoresfor the appropriate section of the performance appraisal. When correlated with OverallWork Performance, the in-basket Productivity dimension yielded validity coefficients,significant at the .01 level, of .23 (for males) and .26 (females). With this samecriterion measure, the in-basket Content category was not significantly related toperformance with either gender. Using Overall Interpersonal Skills as the criterion,only the Productivity dimension applied in the female subsample resulted in asignificant correlation (r = .25).Summary of criterion-related validity evidence. The in-basket researchreviewed here may appear to present substantial evidence of criterion-related validity.However, a closer inspection of the data reveals a collection of disparate instrumentsapplied in a variety of settings with little information provided on methodology. Many79of the research settings are not industrial (e.g., military, academic, etc.), and there aregreat variations in the constructs measured and outcome criteria used. Also, there islittle consistency in the design format and scoring approaches of the in-basketsreviewed. Given the disparity in test construction, methodology, performancedimensions, and outcome criteria, perhaps it is not surprising that reviews of in-basketcriterion-related validity results appear mixed (Gill, 1979; Schippmann et al., 1990;Thornton & Byham, 1982).Notwithstanding the disparate methods and settings of the in-basket studiesreviewed, an approximate average range of the validity coefficients reported in Table 1is between .25 and .30. However, it should be recognized that predictive validities forthe in-basket in the high .20's and low .30's are not inconsequential. In a utilityanalysis of the assessment centre for selection, Cascio and Silbey (1978) reported thateven with a validity of .05, the gain in the dollar-value utility of the assessment centreover random selection represented a net savings of $173 per selectee. The in-basketexercise, which makes a significant and unique contribution to the overall assessmentcentre rating (Huck, 1974), no doubt results in substantial dollar-value utility gainswhen used in the assessment centre as a selection device.Construct ValiditySchool administration in-baskets. In addition to gathering evidence forcriterion-related validity, a key purpose of the school administration study (Hemphill etal., 1962) was to investigate the relationships between the major dimensions ofadministrative performance and a variety of other personal characteristics. A series ofcognitive ability tests, a personality inventory, and an interest inventory were amongthe measures used to gather data for a broad set of characteristics. Specifically, theStrong Vocational Interest Blank for Men (Strong, 1951), the Sixteen Personality80Factor Questionnaire (Cattell, 1957), four fluency factors (Guilford, 1957), and a set ofability tests from the Kit of Selected Tests for Reference Aptitude and AchievementFactors (French, 1951) were employed. It is not appropriate to summarize all findingshere, except to briefly describe the strongest relationships between the most salientstylistic factors of in-basket performance and the set of personal correlates examined inthe study.Those school principals who scored high on the in-basket ExchangingInformation dimension also tended to have a strong vocabulary and be emotionallysensitive. Strong skills in oral communication were associated with high scores onAnalyzing the Situation and Organizing Work. The tendency to organize work asmeasured by the in-basket, in turn, was positively related to principals who wereanxious, insecure and characterized by nervous tension. Strong skills in word andideational fluency and skill with simple numerical tasks were characteristic of principalswho scored high on Complying with Suggestions made by Others dimension but low onDirecting the Work of Others. Principals who were friendly and adventurous weremore likely to be involved in maintaining relationships with others in the schoolportrayed in the exercise. In general, most of the in-basket stylistic factors wereunrelated to interest scores.Frederiksen (1966). A similar study was later conducted wherein in-basketperformance from a sample of 115 male administrators in the federal government wasrelated to ability, personality, and interest factors. Among his findings, Frederiksenreported that those who tended to take many imaginative courses of action also tendedto have higher scores on a vocabulary test. Also, those who wrote a great deal,attempted many items, involved many people, and took many leading actions alsotended to have high scores on the Active scale of the Thurstone Temperament Schedule(Thurstone, 1953). Those who planned to have many face-to-face discussions with81subordinates tended to resemble life insurance salesmen in their responses to the StrongVocational Interest Blank, whereas those who resembled forest service men tended toavoid discussions and asking for information.A parallel form study. Brannick et al. (1989) conducted an examination of twoparallel forms of an in-basket exercise, developed as part of a statewide managerialtraining program, in order to study the construct validity of in-basket scores. Using apretest-posttest design, 88 university students (49 males, 39 females) were firstrandomly assigned to take either Form A of Form B of the in-basket (pretest). Onemonth later, the students then took the in-basket that was the alternate form of the onetaken previously (posttest). The in-baskets forms were designed, following standardETS principles, to measure Organization, Leadership, Perceptiveness, Decision-making, and Delegating. Separate scoring keys were developed for each in-basket afterfirst constructing a list of the usual courses of action for each item. For each form, andfor each item within that form, each response was judged as positive, neutral, ornegative and assigned a +1, 0 or -1, on each of the five dimensions. It should benoted that this application of qualitative or evaluative-based scoring of dimensions is adeparture from the standard ETS scoring practice of using quantitative or frequency-based scoring. The final keying of each response was determined by consensus of apanel of five (three authors and two scorers). A total score was calculated by summingthe dimension scores.A multitrait-multimethod matrix (Campbell & Fiske, 1959) was set up toexamine the convergent and discriminant validity of in-basket performance. Accordingto Campbell (1960), in order to demonstrate construct validity, a test should be shownto both correlate highly with other variables with which it should theoretically correlate(convergent validity) and to not correlate significantly with variables from which itshould differ (discriminant validity). Therefore, Brannick et al. (1989) divided the two82parallel in-basket forms into two halves, creating four distinct exercises. One half ofeach exercise contained scores on the even numbered items, and the other halfcontained scores on the odd numbered items. Generally, the validity diagonal resultswere non-zero (Form A: r's ranged from .19 to .61; Form B: is ranged from .20 to.62), but, for the most part, they were not larger than the off-diagonal values.Brannick et al. (1989) concluded that, although there was limited evidence ofconvergent validity and even less evidence of discriminant validity. In their view, thedata "...failed to support the interpretation of in-basket scores as indicants ofmanagerial ability" (p. 962). This conclusion, while distressing, may be explained, inpart, by the scoring system employed in the study. In addition to scoring for actionstaken, candidates were also scored for actions not taken. Failing to take action on anurgent item was typically seen as a negative response and, because each action wasevaluated along each dimension, such non-action was usually scored negatively acrossseveral (if not all) dimensions. As the researchers themselves acknowledged, scoringthe same behaviour, in the same way, for several managerial abilities would likelyresult in high correlations among dimensions. It is not surprising, then, that littleevidence of discriminant validity was seen; it is unlikely that multiple dimensions,measured in this way, would satisfy the "heterotrait" condition of multitrait-multimethod design.Similarly, if the methods used in a multitrait-multimethod matrix are notsubstantially different, it is unlikely that a true "heteromethod" situation exists. Theabsence of true "heteromethod" comparisons would likely result in higher off-diagonalcorrelations. It is questionable whether it is reasonable to expect evidence ofdiscriminant validity when alternate forms of the same instrument are used. Althoughthe assessment of administrative ability does not easily lend itself to multiple methodsof measurement, it may be more instructive to conduct convergent and discriminant83validation using in-baskets which differ in the nature of the stimuli (e.g., paper-and-pencil versus videotaped presentation), and the nature of the response format (e.g.,open-ended versus multiple-choice), in order to more closely replicate a multitrait-multimethod design, and so more soundly demonstrate construct validity (or the lackof).The measurement of participative decision-making (PDM). Tett and Jackson(1990) designed and evaluated an in-basket measure of managerial participativedecision-making and examined the relations between participative tendency and variouspersonality traits. The identification of in-basket dimensions was theory-driven (Jago,1978; Vroom & Yetton, 1973) and resulted in six participative behaviours to bemeasured in the exercise: Delegation of Decision-Making Authority, RequestingAdvice, Following Advice, Requesting to Meet with Subordinates to Discuss aProblem, Requesting Information, and Asking to be Kept Informed (as to how aproblem is developing or being resolved). Items were prepared, using standard job-analysis techniques, to be realistic and representative of typical managerial problems.The process of deciding which dimensions were relevant to which items was notclarified.The scoring system for the six participative dimensions was unique, with thenumber of subordinates involved forming the score for the dimensions Requesting toMeet, Requesting Information, Requesting Advice, and Asking to be Kept Informed.The Delegation dimension was scored on a scale from 1 (explicit directions) to 6(complete delegation). Following Advice was scored as 0 for advice rejection, 1 for aneutral response, and 2 for advice acceptance. Scores within a particular dimensionand across items were averaged to yield six participative dimension scores. Dimensionscores were then summed to provide an overall PDM score.84Three theory-based components of participative decision-making (powersharing, interactions among co-workers, and information exchange) related to workersatisfaction (Wall & Lischeron, 1977) were used as a framework to then select a totalof six related personality traits from the Personality Research Form (Jackson, 1986)and the Jackson Personality Inventory (Jackson, 1976). The component of powersharing led to the selection of Dominance and Autonomy as personality correlates inthis study; interactions among co-workers provided Affiliation and Social Recognition;information exchange pointed to Cognitive Structure and Tolerance. (External criterionvalidation of the in-basket exercise was not carried out in this study.)The sample consisted of 89 mid- to upper-level managers (82 men, 7 women).When the personality traits were correlated with the overall PDM score, three of the sixpersonality traits were found to be unrelated to participative tendency and two werecorrelated in the direction opposite to that predicted. These unexpected positivecorrelations were between Delegation and the traits of Autonomy and Dominance (r =.23 and r = .36, respectively, both significant at the .05 level), and lead Tett andJackson (1990) to recognize the need for further work into the effect of authority-threatening situations in the relation between power-based motives and participativedecision-making. In addition, following a principal components analysis, evidence ofthe unidimensionality of the participative behaviours was found. Consistent with thetrend toward computer scoring (Thornton, 1992), a computerized version of the PDMin-basket exercise is under development, based on a multiple-choice response format, to"...permit highly reliable but less labor-intensive scoring" (Tett & Jackson, p. 181).Another multitrait-multimethod study. Recently, dimensions from the in-basketfeatured in the present study (Telephone Supervisor In-Basket Exercise, or TSIB) andthe most closely-corresponding dimensions from the ETS Consolidated Fund In-BasketTest (Educational Testing Service, c. 1970) were combined to form a multitrait-85multimethod matrix (Hakstian & Harlos, 1992). Because both instruments will later bediscussed in detail, only the results of the study will be related here. (A limitedassessment of the construct validity of the TSIB will also be reported under Study 2 ofthe Preliminary Studies section.)Consistent with the findings from Brannick et al. (1989), some evidence forconvergent, but little for discriminant, validity was seen. The mean validity diagonalwas .43 (without disattenuation), and although the mean heterotrait-heteromethodcorrelation was lower (.32), three of the five TSIB dimensions correlated more highlywith non-corresponding than with corresponding ETS in-basket dimensions. Unlike thein-baskets used by Brannick et al. (1989), the TSIB used more standard ETS scoringprotocols. However, like Brannick et al. (1989), multiple keying of courses of actionacross dimensions was common in the TSIB, creating dimensional dependency.Perhaps a multitrait-multimethod matrix based on factor-analytically derived in-basketdimensions would be instructive, reducing the contribution to variance fromdimensional dependency and allowing a more focused analysis of discriminant validity.Factor-analytic studies.  Factor-analytic studies of in-basket performance mayprovide more indirect evidence of construct validity than those studies relating in-basketperformance to data from external (i.e., non-test) variables (Schippmann et al., 1990).Nevertheless, these findings merit consideration because they do provide additionalinformation about the nature of in-basket performance. Factor-analytic studies of thecorrelations of large numbers of in-basket style categories have suggested somecommon dimensions of adminstrative performance: Complying with Suggestions,Preparing for Action by Gathering More Information, Directing Others, and DiscussingProblems with Others (Thornton & Byham, 1982). The factor consistency acrossstudies (and high intercorrelations among dimensions in non-factor-analytic studies) hasled some researchers to postulate that the underlying characteristics measured by the in-86basket exercise are based on a single generalized trait (Kesselman et al., 1982; Lopez,1966). Other researchers, however, have contended that the factor consistency ".. mayrepresent nothing more than the potentially clouding influence of an ETS dominatedmeasurement system" (Schippmann et al., 1990, p. 856).Summary of construct-related validity evidence. Considering the widespreaduse of in-basket exercises, there are few studies reported in the literature which directlyassess construct validity. The limited evidence suggests that, particularly when relatedto cognitive ability, in-basket performance reflects logical and generally consistentcharacteristics. The in-basket exercise appears capable of measuring specifictheoretical constructs. However, factor-analytic findings have resulted in somedisagreement among researchers as to the complexity of administrative performance.On one hand, it is believed that skills in this area are multi-dimensional, that severalseparate, identifiable styles and action-approaches are present and thus need to bemeasured to fully understand and to make accurate predictions about administrativepotential (Frederiksen, 1966; Thornton & Byham, 1982). Conversely, someresearchers maintain that a more global, unitary skill underlies performance (Kesselmanet al., 1982; Lopez, 1966).Content ValidityIn general, the realism, relevance, and representativeness of items in an in-basket contribute to content validity. As noted earlier, conducting thorough jobanalyses in order to more realistically reflect the nature and complexity of the target jobsubstantially enhances content validity. Although the theoretical procedures to establishcontent validity for in-baskets are clear, in the view of Schippmann et al. (1990), "allof the reviewed studies which suggest that their procedures are [italics added] contentvalid fall seriously short of the mark" (p. 851). The reason the studies fell short, they87contended, is that researchers have confused face validity with content validity; theexercise, on the face of it, may seem to reflect the challenges present in a position, andso proper content validation is not completed. Another factor which may discouragetest developers from adequately establishing content validity is the length of timerequired to sample and integrate documentation from real in-baskets (Gill, 1979).Schippmann et al. concluded their review of content validity findings by recognizingthat "there simply are no published or widely distributed reports which describe how todevelop an in-basket that is well-grounded with regard to content validity" (p.851).General Summary of In-Basket Validity ResultsDespite the range of in-basket scoring approaches developed and researched, itis simply not possible to know, with confidence, whether any one key is correct. Forseveral reasons, accurate assessment of in-basket validity is a daunting task. Becauseof its multi-faceted nature and the lack of clear, consistent definitions and descriptionsof managerial behaviour, there are inherent limitations in accessing and measuringmanagerial performance. In addition, problems of judgement bias remain a seriousobstacle in performance appraisal assessment (Cascio, 1987).As previously noted, it is difficult to summarize published empirical findings ofthe in-basket because important detail is often missing in the reporting of methods,scoring procedures, and results. Moreover, substantial variations in research settings,in-basket construction and scoring, performance dimensions, level of exercise fidelity,and external criterion measures make meaningful comparisons difficult. Despite thesedifficulties, some summary observations can be made which point to several areas inneed of further in-basket research.The equivocal support among reviewers regarding the use of the in-basketexercise as a selection measure suggests that further evidence of criterion-related88validity is called for. Several explanations have been proposed for the relative lack ofconsistent, sound empirical validation for this instrument, including its high facevalidity, lengthy time requirements to train in-house scorers, and the difficulty ofscoring the exercise, and the distortion that arises from subjective evaluations of in-basket performance. Industry users may believe that further research is unnecessary,either due to high face validity or due to the inherent soundness suggested by thelengthy training of scorers and the complexity of scoring (Lopez, 1966; Kesselman etal., 1982; Thornton & Byham, 1982). Even when in-basket research is undertaken,sound results may be hard to obtain because of the difficulty in properly constructingin-basket exercises and the reliance in assessment center applications of the in-basketexercise on subjective evaluation methods. Regardless of the reasons, more evidenceof criterion-related validity appears warranted before it can be concluded, withconfidence, that the in-basket exercise possesses the predictive properties to justify itswidespread use in industry.Further evidence of construct validity also appears warranted, because of boththe lack of published studies directly assessing the nature of in-basket performance andthe controversy surrounding those results which have been reported. Thus, additionalresearch into the nature and number of in-basket dimensions that are involved inadministrative performance is needed. Finally, there is a need for clear guidelines todevelop content-valid in-basket exercises which must be consistently applied acrossindustry.This review of the content-, construct-, and criterion-related evidence ofvalidation suggests that, for the most part, Frederiksen's (1957) goal of designing aninstrument which is "sensitive" (can adequately measure the broad, complex set ofskills involved in high-level jobs) has not yet been satisfactorily met. This is especiallyapparent when one considers the overwhelming gender imbalance of subjects in studies89assessing criterion-related validity. Over the forty years of in-basket research in thosestudies which clearly identified gender, roughly 1,672 men, compared to 290 women,were involved as subjects. (The same trend, although to a lesser degree, is apparent inthe few published studies of construct validity.) It is assumed that, in studies where thegender of subjects was not clearly specified (e.g., Lopez, 1966), subjects were male.The reasons for such a disturbing discrepancy are not relevant here. What is relevant isthe realization that the findings and summary observations from published in-basketresearch cannot be assumed, in all cases, to apply to women. With gender-baseddifferences in cognitive ability it is not unreasonable, at the least, to gather additionaldata on women's in-basket performance, thereby setting the groundwork to furtherinvestigate the possibility of differential attributes and predictive accuracy ofadminstrative ability based on gender.Summary of Section 2: Psychometric Properties of the In-Basket ExerciseThe second section of the Literature Review considered, in some detail, thepioneering work by Frederiksen in the early 1950's, which laid the foundation forconventional in-basket design and traditional objective scoring strategies. Achronological review of the published empirical findings of the in-basket exercise wasthen presented, which included both a description of design and proceduraldevelopments, as well as a summary of the reported psychometric properties of theinstrument. Several significant inadequacies in both the quantity and quality of theresearch were identified and areas for further research were discussed.90Section 3: A Review of Current In-Basket Scoring StrategiesLimitations of a Literature-Based ReviewIn contrast to the disparate studies and contradictory findings reported in Section2 (Psychometric Properties), the difficulty faced by researchers in scoring in-baskets isa uniquely consistent observation. Since Frederiksen et al.'s (1957) disappointingpsychometric results, research which considers both psychometric and practicalimplications of scoring methods has continued, albeit slowly, resulting in thedevelopment of several scoring approaches. The Introduction described these threemain approaches of evaluating in-basket performance: a) qualitative, subjective, clinical(typically based on a post-exercise interview); b) quantitative, objective, psychometric(based on analyses of written responses); and c) a combination of the two. Thepsychometric approach was further broken down into two methods: the moretraditional, ETS-based scoring method based on molecular courses of action (alsoknown as action elements), and the less conventional, self-report multiple-choice form.It should be noted that the classification of scoring strategies introduced here isnot present in the literature. Brief reviews of general scoring practices have appeared(Gill, 1979; Kesselman et al., 1982), but these tend to be too cursory to be useful.Currently, neither a systematic delineation nor a systematic evaluation of scoringapproaches exists. As acknowledged earlier, many studies fail to provide adequatedetail on the scoring method used, and, as a result, valid comparisons across studies areoften precluded. In one of the few references to the combination method of scoring in-basket performance, Clutterbuck (1974) reported one industry user's claim that thecombined interview and psychometric approach could increase the efficacy of the in-basket by as much as 50 per cent. This assertion, however, is purely subjective. Todate, no known studies examining the combination method of in-basket scoring have91been published. In one of two major reviews of in-basket research, Gill's (1979)review did not even refer to the ETS action element approach in his description of thequantitative scoring method, instead selecting only the multiple-choice self-reportformat.With the widely-acknowledged difficulty of in-basket scoring and without asystematic, comprehensive delineation of scoring approaches, it is not surprising thatvery few studies have focused specifically on scoring methods. Consequently, researchand opinion into the relative merits (psychometrically and practically) of the variousscoring strategies are inconclusive (Gill, 1979). As noted in the Introduction, Brannicket al. (1989) cited comparisons of in-basket scoring systems as a principal area formuch-needed investigation.Given the limited findings on in-basket scoring, inaccuracies in the fewpublished reviews of in-basket research make the accurate assessment of in-basketscoring even more difficult. For example, Gill (1979) falsely described Meyer's(1970) analysis of in-basket performance as resulting in two in-basket performancedimensions for scoring, namely, a supervision and a planning/administrativedimension. As discussed in Section 2 of the Literature Review, these dimensions werejob performance dimensions, not in-basket performance dimensions. Gill compoundedhis error by asserting that many other studies supported Meyer's finding of "in-basket"dimensions of supervision and planning/administration to be used for scoring in-basketperformance. In another misstatement, Gill claimed that the study by Wollowick andMcNamara (1969) was one of the few studies which showed that the objective,psychometric approach to in-basket scoring was superior. Yet, as Section 2 indicated,Wollowick and McNamara's very limited discussion of the in-basket scoring methodthey used clearly identified it as the subjective (interview) approach. They hadreported that a subjectively-derived Overall Assessment Rating (OAR) yielded92substantially lower validities in the prediction of management success than anempirically-derived combination of scores and ratings obtained across performancemeasures. Moreover, Wollowick and McNamara's conclusion of the superiority of thestatistical (objective) scoring approach was, in fact, relevant only to methods used tocombine variables in the derivation of the OAR, not in-basket scoring. Because of theimportance of Gill's work as one of only two reviews of in-basket research, the mis-identification of in-basket scoring dimensions and the inaccuracy regarding the relativeefficacy of scoring methods may, if unrecognized as erroneous, make the synthesis offindings (which already are disparate and insufficiently detailed) even less conclusive.Considering the paucity of sound research (or reviews) conducted on in-basketscoring, it may be more instructive to turn to current industry practice, rather than relyon the literature, in order to gather more complete and accurate information of currentin-basket scoring strategies.Review of Current In-Basket Scoring Methodology in IndustrySerious obstacles exist which make the goal of synthesizing industry-basedscoring information laborious, if not unattainable. Detailed descriptions of the scoringprotocols and specific concepts used in the development and application of industry in-baskets are simply not available. Many companies, through consultants, opt forcustom-designed, "in-house" in-baskets, rather than "off-the-shelf" exercises. Theconcerns from developers of both in-house and off-the-shelf in-basket exercisesregarding competition from other in-basket consulting rivals have led to a fiercelyprotective, vigilant control over "design and scoring secrets." Despite the difficultiesin surveying in-basket scoring methodology used in industry, some importantinformation regarding several widely-used in-basket exercises has been ascertained.What follows is a brief review of the design and scoring of these instruments.93The Multiple-Choice In-Basket Management Exercise (MCIME).  Morris(1991) has developed a wideband, low-fidelity simulation, very similar in format andscoring to that described by Motowidlo et al. (1990). According to Morris, theconstruction of the MCIME was based, in part, on a request from the U. S. EqualEmployment Opportunity Commission for alternatives to traditional multiple-choicetests used in selection. In addition, the evidence that assessment centres are goodpredictors of managerial potential (Cohen, Moses & Byham, 1974), that assessmentcentres do not result in adverse impact for minorities or women (Huck & Bray, 1976),that in-baskets correlate well with OAR' s (Huck, 1974), and that in-baskets themselvesare good predictors of managerial performance, also led Morris to consider theapplication of a multiple-choice format in-basket to the assessment centre method.The MCIME consists of 81 questions or items presented to the participant withfour response options provided for each question. The participant is directed to selectthe option representing the "best" and "worst" way of handling each problem using astrictly multiple-choice format; no free-format written responses are made. Standardjob analysis techniques were used to both prepare items and identify dimensionsrelevant to performance as a manager. The in-basket items were ultimately prepared tomeasure six performance dimensions: Planning and Organizing, Decision-Making,Interpersonal Relations, Problem Analysis and Issue Identification, WrittenCommunication, and Overall Job Performance. The MCIME is unique in that no timelimit is imposed for completing the exercise.A racially and gender-balanced panel of subject matter experts used theconsensus approach to determine the scoring key for the in-basket dimensions, first bydeciding which dimensions apply to each item and then by reaching a consensus on thebest and worst options for handling each item. One point each (toward the dimensionscore) is assigned for the participant's correct (as defined by the panel) selection of best94and worst option. Thus, for each item, scores range from 0 to 2, with 2 reflecting theaccurate selection of both the best and worst options. Multiple keying of items todimensions does occur, from a minimum of 13 items used to measure WrittenCommunication to a maximum of 49 items being scored for Problem Analysis andIssue Identification. Feedback is in the form of a computer-generated diagnostic reportprepared for each participant.The objective scoring approach used in the MCIME represents a recent revisionto a previous objective scoring approach with the result that little empirical data areavailable. Nevertheless, a recent study of 150 subjects using the newly-revisedobjective scoring approach obtained uncorrected validity coefficients ranging from .26to .42 across the six dimensions measured (D. Morris, personal communication,October, 1991). Additional information about either the MCIME itself or relatedempirical findings are simply not attainable. Efforts are currently underway to prepareand publish a synopsis of the design, development, and predictive accuracy of theinstrument. For this reason, until publication, Morris wishes to restrict the release ofinformation through other sources (D. Morris, personal communication, June 29,1992).General Management In-Basket (GMIB). Another objective scoring method isapplied by Joines (1991), who has developed a more traditional, medium-fidelity in-basket exercise wherein participants read a series of realistic-looking items and, onstandardized forms, provide both an analysis of the managerial issues involved in eachitem and a description of the actions they would take in handling the item. In addition,on a second standardized form, participants write out responses or memos to fullyexecute their actions in completing the item.95The GMIB's scoring principles reflect its intended development as a generic in-basket, designed to measure managerial skills regardless of the target job classification.Its content is described as theory-driven, in that many items are written to requireparticipants to apply sound management theory to practice. Management conceptsfrom McGregor's Theory Y (1960), participative management, and situationalleadership form the basis not only of the instrument's content, but also of its scoringsystem. Unlike the standard ETS panel approach, the GMIB scoring keys reflect theviews of the test developer, not industry personnel. The scoring key is based more onacceptance and application of the relevant management theory for each item (as definedby the test developer) rather than on the consensus judgements of subject matter expertsor panels.The GMIB also differs from ETS-developed in-basket exercises in that noReasons for Action form is involved. In-basket evaluation is based solely on theparticipants' written responses to the 15 items that make up the GMIB. Three of the 15items are considered critical and these are scored on a scale from 0 to 5 whereas theremaining 12 items are scored using values ranging from 0 to 4. Scoring values forcourses of action are assigned by the test developers who have also provided responseexamples, set out in a scoring manual, for each scoring value. According to Joines(1991), trained raters require only 20 minutes to score one GMIB, including selecting,for each item, an appropriate narrative statement from a bank of statements. Using thisapproach, a 5-6 page narrative report is prepared for each participant.Joines (1991) reported that four reliability studies have been conducted on theGMIB. The largest of these used two raters who each scored 100 exercises. The inter-rater reliability coefficient, based on the total in-basket score, was .95. In addition, analpha coefficient of .71 was reported for the total GMIB score.96A large-scale criterion-validation study of the GMIB was conducted with asample of 365 employees from several supervisory levels within a public sectororganization (the gender breakdown was not provided). Two sets of performanceratings (gathered from both immediate supervisors and next-higher-level supervisors)served as the criterion measures, with participants' job-performance assessed along sixdimensions: Written Communication, Leadership, Interpersonal Relations, Planningand Organizing, Analyzing Problems and Making Sound Decisions, and OralCommunication. Although several criterion composites were constructed, only theresults based on the mean of the immediate supervisors' ratings across the sample of365 employees will be reported here, in order to more closely match the criterionmeasures used in in-basket studies described in Section 2 of the Literature Review. Aninter-rater reliability coefficient of .56 was reported for this criterion composite, basedon a subsample of 194 subjects. Correlating the in-basket total score with the criterioncomposite yielded a corrected validity coefficient of .41 ( p < .01). (The uncorrectedvalidity coefficient was .31, reported to be significant at the .0001 level).A factor analysis of GMIB scores was also performed, resulting in fourinterpretable factors: Leadership Style and Practices, Handling Priorities/SensitiveIssues, Managing Conflict/Interpersonal Insight, and OrganizationalPractices/Management Control. The factor scores across the four dimensions weresummed and correlated with the criterion composite, yielding a validity coefficient of.41 (p < .01).The GMIB has attempted to use balance the realism of a medium-fidelity designwith sound psychometric properties and practical scoring methods. These efforts seemto have been successful, as evidenced by several large-scale contracts using the GMIBfor administrative ability assessment and a national database of results from 4500GMIB participants (R. Joines, personal communication, November, 1991).97ETS Consolidated Fund In-Basket Exercise.  This widely-used, objectivelyscored exercise uses the scenario of a fictitious volunteer community fund in which theparticipant assumes the role of the Fund's paid director who, in coordinating fund-raising activities, must deal with a small staff and Board of Directors made up ofprominent citizens. This high-fidelity exercise uses scoring procedures which followETS principles that are applied, through analysis of participants' open-ended writtenresponses, to scoring dimensions derived from factor analyses of a much larger set ofin-basket categories (Crooks, 1968). The in-basket performance dimensions measuredby the Consolidated Fund In-Basket are: Taking Action Toward Solving Problems,Exercising Supervision and Control, Problem Analyzing and Relating, Communicatingin Person, Delegating, Scheduling Systematically, Productivity (Amount of WorkAccomplished), Quality of Actions Taken, and Scorer's Rating of OverallPerformance. The Quality of Actions dimension is the only qualitative or evaluativedimension of the nine because its scoring key is based on the panel approach ofindustry experts who evaluate all possible courses of action and assign scoring weightsfor the appropriateness of each action. The Scorer's Rating is impressionistic (ratersuse a 1 to 5 scale to indicate how well they believe the participant would perform onthe job), whereas the remaining seven dimensions are stylistic and are measured by thefrequency of occurrence (i.e., quantitative assessment) of actions which involve thosestylistic dimensions. The ETS scoring scheme also uses the "Reasons for Action"form, which asks participants, upon completion of the in-basket, to record what theydid, and why they chose those actions for each item. As described in the schooladministration study (Hemphill et. al., 1962), information from the Reasons for Actionform is applied to several different dimensions, although the exact keying ontodimensions and weighing of responses is not known.98Very little published empirical information on the Consolidated Fund In-BasketExercise is available. Even ETS itself has no prepared technical reports on thedevelopment and systematic criterion-validation of this popular instrument (K. Abbey,personal communication, June 3, 1992). However, the results from two studies havebeen published, although they are relatively small-scale applications of the exercise.The first study involved a comparison of in-basket performance across four managerialgroups from several companies: 101 high-potential middle managers, 96 lower-middlemanagers, 34 upper-middle managers, and 175 MBA students (Crooks & Slivinski,1972). A multiple-group discriminant analysis of scores revealed that while the overallprofiles of the four groups were significantly different statistically, subsequent analysesshowed considerable overlapping of subgroups within each of the four groups. A seriesof two-group discriminant analyses were then performed, comparing in-basketperformance of the high-potential group with each of the three other groups. Asexpected, the results indicated that the administrative performance of the high-potentialgroup was most like that of the upper-middle managerial group whereas it was leastlike performance by the MBA students.The second study of the Consolidated Fund In-Basket Exercise (French version)examined inter-rater reliability in a sample of 38 French-speaking middle-levelmanagers. Two ETS-trained scorers independently scored each exercise along the ninedimensions previously listed. The correlations across these dimensions ranged from.54 (the impressionistic Scorer's Rating of Overall Performance) to .93 (the moreobjective Productivity dimension). Interestingly, despite a detailed scoring manualwhich limits the influence of personal judgement, the important Quality of Actionsdimension yielded the second lowest reliability coefficient ( r = .75).An example of the combination scoring method. An additional key corporateplayer in the in-basket field is Development Dimensions International (DDI). Whereas99the MCIME and the GMIB use an objective or psychometric scoring method, thescoring approach used by DDI in both its off-the-shelf and custom-designed in-basketexercises is the combination method, with both an objective and a subjective(interview) component. A large component of DDI's work is the development oftailor-made in-basket exercises for assessment centre applications, many of which arecomputer-scored. However, neither descriptions of the development nor empiricalresults of the efficacy of these or other off-the-shelf instruments has been published (A.Smith, personal communication, June 4, 1992).A description of DDI's off-the-shelf Sellmore Manufacturing CompanyForeman's In-Basket (DDI, 1978) illustrates (albeit briefly) the combination scoringmethod. This high-fidelity in-basket exercise presents items on varying sizes ofcompany stationery and provides blank letterheads for free-format written responses bythe participant. The in-basket performance dimensions measured by the instrument are:Sensitivity, Initiative, Planning and Organizing, Analysis, Judgement, Decisiveness,Delegation, and Management Control. In scoring in-basket performance a manualoutlining the protocol for the interview is used. The manual also contains severalpreliminary and concluding interview questions, such as "What did you think of the in-basket," and "What are the major problems confronting you?" In addition, thescorer/interviewer is directed to examine the written responses for each item anddiscuss each response with the participant in order to decide how many of the"mandatory dimensions to be evaluated in an item" (listed for each item in the manual)were exemplified in the response. If the action(s) taken for the item show evidence ofthe mandatory dimension, a +1 score toward that dimension is awarded. On the otherhand, if no evidence of action(s) exhibiting the mandatory dimension is shown, a -1score is given for that dimension. The manual also lists non-mandatory dimensions foreach item and, unlike the scoring of mandatory dimensions, lack of action on non-100mandatory dimensions does not imply negative behaviour (absence of actiondemonstrating that dimension is not awarded a negative evaluation).After reviewing the written response for an item, the scorer/interviewerquestions the participant as to why a particular action was taken. The interviewtherefore allows information similar to what is gleaned from the ETS Reasons forAction Form to be collected (i.e., what was done, and why) but without using theForm's paper-and-pencil format. A primary advantage of the interview is that itprovides a greater richness of performance information not measurable throughobjective means, such as the evaluation of Oral Communication skills (e.g., eyecontact, persuasiveness, clarity of expression).The final step in scoring in-basket performance using this example of thecombination method is the scorer's completion of the Overview of In-Basket and In-Basket Interview. For each dimension, a "Dimension Summary Sheet" contains anitem grid of all items across the exercise in order to record summaries of actions thatdemonstrate that dimension. An overall rating of the effectiveness of the participant'sperformance is made in the particular metric applied by individual assessment centres.The manual acknowledges that usually a 0 to 5 scale is applied, with 0 indicating thatno opportunity existed for the dimension to be shown and 5 indicating that a great dealof the dimension was shown. Lastly, a subjective summary is made of the majorstrengths and weaknesses of the participant's in-basket performance.Summary of current in-basket scoring methodology in industry. There is amarked lack of information regarding instrument descriptions, scoring protocols, andempirical evaluations of current in-basket scoring practices in the field. For example,no known published information exists regarding instruments solely using the subjectivemethod of scoring and important detail is often missing in reports of in-basket exercises101using the objective and combination scoring methods. Moreover, as noted earlier, anaccurate assessment of the relative rates of use of the three scoring methods in industryis not available. According to Thornton, " good data exists...we simply don'tknow about in-basket scoring practices in industry" (G. Thornton, personalcommunication, June 30, 1992). Although little is known about general scoringpractices in the field, several specific examples of widely-used in-basket exercises werediscussed in order to illustrate (as fully as possible) traditional industry applications(e.g., The Consolidated Fund In-Basket Exercise), as well as newer, less conventionalin-baskets which incorporate recent design and scoring innovations (e.g., TheMCIME).Summary of Section 3: A Review of Current In-Basket Scoring StrategiesThis third and final section of the Literature Review began by elaboratingseveral limitations of a literature-based review of current in-basket scoring strategies,including the absence of a comprehensive classification system of in-basket scoringmethods. Accordingly, the present work has provided a scoring classification system,first outlined in the Introduction and briefly reviewed in this section. Considerationwas then given to current industry-based in-basket designs and the scoring practicesused to evaluate administrative performance. Frederiksen's observation, made twodecades ago, that "the work of the industrial psychologist often does not find its wayinto the psychological literature" (1972, p. 69) continues to be valid regarding in-basketdesign, as a whole, and in-basket scoring, in particular. The reporting of in-basketdesign and scoring must become more common and standardized in order to allow anaccurate assessment of the efficacy (psychometrically and practically) of variousscoring methods.PRELIMINARY STUDIESStudy 1MethodParticipants and SettingThe sample consisted of 321 first-level supervisory employees (entry-levelmanagement personnel) from different departments of a large Canadian utility company(168 males; 153 females). In the fall of 1989, participants were given a newly-revisedversion of an existing in-basket exercise as part of a concurrent validation study for amanagerial assessment battery containing intellectual, personality, biodata, andsupervisory judgement measures. The full assessment battery required 8 hours tocomplete and was administered in a 1-day session. Participants were given 1 1/2 hoursto complete the in-basket exercise, preceded by 15 minutes of scripted instructions readaloud to enhance both the clarity and standardization of exercise instructions.MaterialsThe Telephone Supervisor In-Basket Exercise (TSIB). This in-basket exerciseconsisted of 21 items or problems to be dealt with by the fictitious character ChrisWilson, who was recently promoted to First-Level Supervisor for the IndependentTelephone Company of Iowa. The scenario required Chris to handle, on a Sundayafternoon, the work that had accumulated in the in-basket of the new position, 90minutes prior to leaving with the new boss to attend the yearly Budget Meeting. Thein-basket items included an angry letter from a dissatisfied customer, internal memossuggesting a large backlog for residential and industrial telephone service, evidence ofbudget overruns, and staff conflicts. Considerable effort was made to enhance the102103realism of the stimulus materials and response format (i.e., fidelity). Accordingly,company stationery of varying sizes and design, complete with the company logo, wereprovided. Participants were instructed to take actions as necessary and accomplish asmuch as they could in the time available. They were told to write out memos andletters, to sign papers, if necessary, and to use paper clips provided to attach theirresponse memos and letters to the relevant items.Criterion measurement. A combination of three behaviourally-anchored ratingscales (BARS; Smith & Kendall, 1963) and three behavioural observation scales (BOS;Latham & Wexley, 1981) developed specifically for this study were used for eachdimension. A total of 12 performance appraisal dimensions were measured by theparticipants' immediate supervisor. These appraisal dimensions were developed by thecompany from an exhaustive list of behavioural incidents generated by 22 companymanagers. The specific appraisal measures are described more fully in an article byHakstian, Woolley, Woolsey and Kryger (1991). They comprised Leadership,Planning/Organizing/Control, Oral Communication, Analysis, Judgement,Decisiveness, Work Ethic, Initiative, Behaviour Flexibility, Sensitivity, PerformanceStability, and Written Communications. The supervisors providing criterion ratingswere trained in the meaning of the definitions used for the criteria and were given clearinstructions for the proper completion of the appraisal forms. Each participant wasmeasured on each of the 12 performance dimensions by means of three BARS and threeBOS, for a total of six scales. The average time required to complete a rating on oneparticipant was 1 1/2 hours. Assessment battery and performance criterion scores werecollected concurrently.The first principal component of the 12 performance dimension scores was thencalculated and called the Overall Performance Criterion (OPC). Because it is aweighted average of the performance dimensions with the added advantage of maximal104internal consistency, this criterion measure is seen as the best measure of overallmanagement performance.ProcedureDevelopment of the TSIB. An existing in-basket exercise described inHakstian et al. (1986) was modified in order to improve its reliability and validity.After a review of the literature, several steps were taken to change the originalinstrument, including modifications to item design and response format of the in-basket, as noted earlier.Firstly, it was apparent that the ratio of number of items per hour in the originalexercise was much higher than that typically found in other in-baskets. Accordingly,the time allotted for the exercise was extended from 1 to 1 1/2 hours to fit the typicalratio of approximately 15 items per hour. It was hypothesized that such a changewould allow greater breadth and depth of responses, and so help to yield richerinformation along a greater number of performance dimensions.A further focus for change was to improve the realism of the response format.As we have seen in the Literature Review, realism is necessary to provide motivationfor the participants to take the exercise seriously and become "ego-involved" (Lopez,1966). Thus, in order to allow female participants to adopt the role and fit in moreeasily, the male persona portrayed in the original in-basket was changed to a gender-neutral persona. Similarly, the structure of the response format was altered to providemore realism; instead of using an "action booklet" wherein responses were recorded ina small bound notebook, company stationery of varying sizes and design wereintroduced. Participants were instructed to use paper clips to attach their responsememos and letters to the relevant items. Moreover, the original in-basket requiredparticipants to determine the exact order in which they would complete all items, as a105means of assessing the ability to recognize the priority of items. It was felt that such adetermination was somewhat artificial and may over-emphasize a cognitive, rather thanaction-oriented, approach. Again, given greater freedom in response format, allowingmore creativity and uniqueness of responses, and given more time in which to showadministrative acumen, it was hypothesized that more accurate and valid data wouldemerge.Such positive changes, however, would not be without consequence. Inessence, the main impact of such modifications was to increase both the complexity of,and time required for, scoring. In addition, a more complex scoring system couldresult in decreased reliability with an attendant reduction in validity. Thus, inconjunction with these changes to item design and response format, an additionalapproach aimed at modifying the scoring system was designed and incorporated inorder to minimize possible deleterious psychometric effects.Identification of the TSIB dimensions. This additional approach involved aseries of changes designed to make the scoring system as objective, reliable and validas possible. In the summer of 1989, a large Canadian utility company using the in-basket described in Hakstian et. al (1986) in its management selection program agreedto make available 60 randomly chosen in-baskets completed by managerial candidates.The author compiled a list of the responses for all 21 in-basket items (and across all 60in-baskets) in order to identify the typical courses of action taken and thus derive a listof action elements for each item. (It will be recalled from the Literature Review thataction elements are the smallest, distinguishable units of action which result frombreaking down in-basket responses.)Later that summer, the author met with Lois Crooks and Peggy Mahoney ofEducational Testing Service (ETS), whose research department sponsored Frederiksen's106(1957, 1962) pioneering work, for instruction in the development of an objective in-basket scoring system. This consultation led to the identification of the dimensions tobe used in the newly-revised in-basket, which were operationalized as follows:1. Planning and Organizing Work: This dimension included organizing workaccording to related content or priority, and scheduling systematically for definitetimes.2. Interpersonal Relations: This dimension reflected the degree of sensitivity,interpersonal awareness and skills shown.3. Leadership in a Supervisory Role: This dimension indicated the extent towhich the participant assumed leadership, i.e. took the lead in solving problemsand in carrying out policies set by superiors, changing procedures, decidingcritically whether to comply with the suggestions or proposals of others.4. Managing Personnel: This dimension involved supervision and controlling theactivities of staff, asking for information, resolving staff conflicts, and givingdirections and suggestions to personnel.5. Analysis and Synthesis in Decision-Making: This dimension reflected theextent to which the participant analyzed problems through means such as takinginto account available information and considering policy or other aspects ofproblems.6. Productivity: This dimension was measured by the number of actions taken onall items and the number of items attempted.1077.^Quality of Judgement: This dimension reflected the appropriateness of actionstaken, based on pooled opinions of experienced persons who had previouslyconsidered the range of the possible courses of action.Development of TSIB dimension scoring keys.  In the late summer of 1989, thescoring key for Quality of Judgement was generated using the standard ETS practice ofdetermining the plurality panel ratings of appropriateness for each action element.However, before this method is described, an important development toimprove the validity of the crucial Quality of Judgement dimension should beacknowledged. As described in the Literature Review, a panel of industry experts istypically constructed in order to determine the Quality of Judgement scoring key byassigning ratings of appropriateness (or judgement) for each action element. Thedevelopment introduced here involved increasing the size of the panel based on thehypothesis that a larger, company-specific set of ratings would increase the stability andvalidity of the judgement ratings. Whereas the original in-basket used a three-memberpanel (two human resource employees and a senior researcher) to assignappropriateness ratings to action elements, the new version of the in-basket used an 11-member panel of mid-to-senior level managers from the target company (later toreferred to as Panel 1). Despite efforts to construct a gender-balanced panel, only 1 ofthe 11 panel members was female.The process of determining the Quality of Judgement scoring key involved a 1-day workshop with the entire 11-member panel. In order to familiarize the panel withthe in-basket exercise, the first part of the workshop required each panel member toactually complete the in-basket exercise following the standardized administrationprocedure. Next, each member was given a prepared list of action elements for eachitem (derived from the 60 in-basket exercises referred to in the preceding Identification108of the TSIB dimensions section). Each panel member was then asked to consider eachelement independently and assign a value of - 1, 0, or + 1 to reflect whether thatelement would be an unfavorable, neutral, or favorable step toward solving the problempresented in the item. After the workshop, the author tallied the scoring assignmentsfor each action element across all panel members and determined final scoring weightsby plurality of judgements. For example, if six panel members assigned a + 1,whereas five members assigned a - 1, that action element would be keyed positively. Iffive members assigned a - 1 to an action element, while four rated it + 1, and theremaining members gave it a 0, the action element would be keyed negatively.The remaining five dimensions, from Planning and Organizing Work toAnalysis and Synthesis in Decision-Making, are seen as stylistic dimensions becausethey are descriptive measures of the manner in which a participant tends to respond.The process of determining the scoring keys for these dimensions was based on theETS procedure of considering each action element and logically assigning thedimensions to elements. Specifically, the author and two colleagues made independentassignments based on the perceived relevance of each dimension to each actionelement. The author then made the final decisions regarding the assignment of thedimension(s) to action elements. For the majority of action elements, multipledimension assignments were made. Figure 3 presents a sample item from the in-basketexercise used in both the preliminary and present studies. Figure 4 displays an excerptfrom the Scoring Manual for the same item, listing the action elements and outliningthe Quality of Judgement key and remaining stylistic dimensions coded for each actionelement.General scoring protocol. Scoring of the 321 in-basket exercises was conductedby three employees from the target company and two external consultants whounderwent a 5-day training process directed by the author. Scorers used a detailed109Scoring Manual prepared by the author (110 pages) which provided a description of thein-basket exercise, general procedures and guidelines for scoring, and a list of theaction elements and scoring keys for each item in the in-basket exercise (as noted, anillustration of the action elements and scoring key for one item has been provided inFigure 4). In total, the Manual listed approximately 600 action elements (with assigneddimensions for scoring) for the 21 items, with an average of 29 elements per item.MEMORANDUM110To: Ve.)2-44^ From:Subject: etrffze Vreaks^Date: If/1 .1- has bee-v, +.vv-145Gt+^tpty^-itket-t^1a-ft-^t veAtabg^o^or yrvt-fr 4-k-u.eks- have Itee4-1 arecd at- earn "w eerCO-ffe,t- Seter fo-v- ad ervt+ 4 6 kik4'114,&) t;4.1-"f I^etittoci4AnrNA .^atuetittel 1:etci) t;t1 duz dam)trPs.^4- --ftti4 eimer-ricA !Figure 3. Sample Item from the Telephone Supervisor In -Basket Exercise.1111. Item 4: Memo from Hal regarding trucks parked by Coffee ShopQuality of^OtherJudgement Dimensions^Action Elements+ 3,4,5^1. Plan to find out who is involved+ 2,3,4,5^2. Give reprimands to those involved0^2,3,4^3. Call to attention of Service Centre informing someone of break problem+ 3,4,5^4. Plan to drive by Coffee Shop+ 1^5. Defer until return (no other action)+ 1,5^6. Note urgency (do immediately, ASAP)• 2,3,4^7. Plan to motivate employees to work harder and more efficiently by talkingto them in friendly, positive way - must be directly stated0^2,3^8. Memo to Hal that Wilson is looking into it (no mention of follow-up orfurther contact)• 1,2,3^9. Plan to discuss with Hal in car on way to Budget Meeting/upon return+ 2,3^10. Plan to discuss with Hal (time not specified)0^2,3^11. Memo to Hal about corrective action (not specified)+ 2,3,4,5^12. Memo to truckers to clarify 15 minute, not 45 minute breaks+ 1,5^13. Relate to Item 20 (letter from customer about lack of service)2,3,4,5^^14. Memo to Wayne Freed threatening disciplinary action if further complaintsof extended breaks are received• 1,5^15. Relate to Item 10 (Proficiency/Productivity memo from Hal)2,3,4^16. Add to Item 10 memo that coffee breaks are not to be abused• 1,5^17. Relate to Item 3 (Personnel Record)Figure 4. Excerpt from the Scoring Manual for Sample Item (shown in Figure 3) .112The primary task of the scorer was to match, as closely as possible, theparticipant's written response to the most appropriate action element listed in theScoring Manual for the item. In keeping with standard ETS scoring procedures,provision for "Unusual" or imaginative responses was made (those not provided in thelist of typical action elements). In such cases, the scorer had to first decide whether theresponse was similar enough to one listed in the Manual. If not (i.e., the response wastruly unusual), the scorer then had to decide whether the response was scorable andconstruct a new action element for that response. These responses were then creditedor assigned under Productivity and any other dimensions deemed relevant by thescorer, but no Quality of Judgement value was assigned because no panel had evaluatedthe appropriateness of these novel actions.ResultsReliabilityCriterion measurement. A serious consequence of unreliability of criterionmeasurement is the attenuation of validity coefficients. Much effort was devoted tominimizing the possibility of rater error and bias in the development and application ofthe criterion measures. A generalizability analysis was carried out on the scales used tomeasure the performance dimensions. The average generalizability coefficient of thetwelve performance dimensions measured by the six scales was .83 (with each assessedin a two-facet generalizability design).TSIB dimensions. The three trained company scorers independently scored 39in-baskets (19 males; 20 females) for the seven dimensions outlined previously. Table2 shows the inter-rater reliabilities for the dimensions, with coefficients ranging from113Table 2Inter-Rater Reliability Estimates for TSIB DimensionsIn-Basket Dimension Inter-Rater Reliabilitya1. Planning and Organizing Work .952. Interpersonal Relations .823. Leadership in a Supervisory Role .824. Managing Personnel .875. Analysis and Synthesis in Decision-Making .926. Productivity .947. Quality of Judgement .91Note. These results are based on a reliability subsample of 19 males and 20 females.a This is the inter-rater reliability estimate, by analysis of variance, for a single rater.114.82 to .95. These high coefficients suggest the raters were in strong agreement inscoring the action elements measuring the seven in-basket exercise dimensions. Theinternal consistency reliability of the seven in-basket dimensions was also calculated,separately by gender, yielding values of .87 (males) and .85 (females). The sevendimensions, therefore, are relatively highly correlated.ValidityTable 3 presents the concurrent validity coefficients between the OverallPerformance Criterion, OPC, and the in-basket dimension scores. Validity resultspresented in Table 3 are somewhat mixed; two of seven dimensions showed low, butstatistically significant, relationships with overall performance for both genders, withQuality of Judgement for the males also contributing useful information.DiscussionIn summary, the results obtained in Study 1 revealed that the scoring system forthe seven dimensions was highly reliable, and that certain TSIB dimension scores weresignificantly related to job performance ratings. Validity results, however, were verymodest, especially considering previous findings reported in the literature (Brass &Oldham, 1976; Kesselman et al., 1982). It should be noted that, on average, one in-basket required between 1 1/4 and 1 1/2 hours to score. In short, the exercise requiredas long to score as it did to administer.The modest validity findings of Study 1 (particularly for the Quality ofJudgement dimension), as well as company concerns of lengthy scoring time, supportedthe continued development of the TSIB. Future research to assess the impact ofmodifications to the scoring keys appeared warranted; in this way, it was hypothesizedthat psychometric and practical improvements could be realized. Specifically, a115Table 3Validity Coefficients between Overall Performance Criterion (OPC) and TSIBDimension ScoresBivariate Correlation withOPCIn-Basket Dimension^ Males^Females1. Planning and Organizing Work2. Interpersonal Relations^ .21^.153. Leadership in a Supervisory Role4. Managing Personnel^ .21^.215. Analysis and Synthesis in Decision-Making6. Productivity7. Quality of Judgement^ .20Note. Reported r's have been corrected for effects of range restriction and criterionunreliability. All are significant at the .05 level (one-tailed tests). All non-significantcorrelations were omitted.116promising area for further investigation would be to examine the effect of severalscoring keys, each constructed differently, on the validity of Quality of Judgementscores. Because this dimension is regarded as the most critical in terms ofunderstanding managerial potential, it would be the logical choice as the focus forscoring key development.An additional avenue for future research would be to assess the impact thatscoring a subset of items, rather than the complete set of 21 items, would have on thevalidity of Quality of Judgement scores. In addition to reducing scoring time, itseemed plausible that eliminating those items which showed negative or low validitycoefficients with the criterion, as well as deleting those which showed inconsistentresults across genders, would result in an increase in validity coefficients. It should benoted that although fewer items would be scored, participants would still beadministered the full exercise, and they would have no knowledge of which items wereto be selected for scoring.Study 2MethodParticipants and SettingThree new scoring keys to measure the Quality of Judgement dimension wereconstructed and applied to the same dataset from Study 1. Moreover, three differentsubsets of items consisting of 8, 10 and 12 items were selected and the results from the321 subjects were rescored on these three subsets of items.117ProcedureModification of the Quality of Judgement scoring key. In an effort toincrease the validity of the Quality of Judgement dimension, a re-analysis of allUnusual responses across the 21 items in Study 1 was undertaken by the author. Itshould be recalled that Unusual action elements carry no Quality of Judgement rating,and possible inconsistencies (hence error) would be introduced if a scorer is to decidewhether an action is or is not scorable (i.e., listed in the action element list provided inthe Scoring Manual). Thus, to glean as much scorable information as possible aboutparticipants' responses and to reduce possible error from scorers' judgements, unusualresponses from the 321 subjects were studied. When three or more participantsendorsed the same unusual action, it was then included on the new list of actionelements for that item. Logical keying of the stylistic dimensions was also made foreach "new" action element. What remained was to determine and assign Quality ofJudgement ratings for these new elements.A new, expanded panel. A new, second panel was struck at a differentcompany whose work was very similar to that of the first company. The same 1-dayworkshop format described in Preliminary Study 1 was followed here. To increase theaccuracy and soundness of the appropriateness ratings even further, 21 judges wereused in total for this second panel, compared to 11 for the first panel used in Study 1.Because of personnel requirements, however, not all of the selected panel memberscould attend the 1-day panel workshop at the same time. As a result, the 21-memberpanel was divided into two sub-groups of a 9- and a 12-member panel. Ratingscollected from these two sub-panels from the two workshops were then combined toyield the overall 21-member evaluations. The first 11-member panel and the second21-member panel are referred to as Panel 1 and Panel 2, respectively, reflecting theorigins of the key from Company 1 or Company 2. Size is not the most useful118classification basis for the panels because, as will be shown, not all action elementsfrom Company 2 were rated by 21 members (as noted, some were based on the 12-member sub-panel).The establishment of Panel 2 (from Company 2) allowed the newly-added actionelements from the Unusual action re-analysis to be rated. However, the re-analysis andrecoding of Unusual responses into action elements described earlier was completedafter the first nine-member panel had met. The second sub-panel of 12 members wasthen provided with the new, expanded list of action elements to make their ratings ofappropriateness of actions. Thus, although most of the elements were rated based on a21-member panel, some of the newly-added action elements were based on judgementsfrom the 12-member sub-panel.Derivation of the new scoring keys for Quality of Judgement. Panel 2 ratingsalso provided the opportunity for several new Quality of Judgement keys to begenerated. For the first key, a variation of the original panel method of derivingoverall panel ratings was introduced. Labelled the Panel 2 (Conservative) key, themain distinction from the original method used to derive the Panel 1 key was theapplication of more conservative guidelines which required a higher proportion ofagreement before a positive or negative value would be assigned. Specifically, insteadof a simple majority, 75% panel agreement was needed before a maximum value of +2or -2 would be given. Between 50% and 75% agreement resulted in either a +1 or -1(as long as 25 % or fewer marked the opposite sign), and zero would be assigned for allother outcomes. Thus, the metric was also moderately expanded from a three to fivepoint scale to allow finer gradations of agreement. This same method of using a 50%and 75% cutoff was used for those elements based on judgements from the 12-membersub-panel. For clarification, the convention of indicating the method of combiningpanel judgements will now be followed. Specifically, the Panel 1 key will be referred119to as Panel 1 (Liberal) key, and the Panel 2 key will be referred to as the Panel 2(Conservative) key.The second key, called the Empirical key, followed Meyer's (1970)correlational analyses to derive scoring weights in that the endorsement of each actionelement was correlated with the OPC by means of a point-biserial correlationcoefficient. Action elements were treated as dichotomous predictor variables; where anaction element was endorsed, the candidate received a score of 1, whereas his/heraction element score for unendorsed elements was zero. The criterion measure, OPC,was a continuous variable. If those rated higher on the overall performance scaletended to endorse an element like "meeting with a subordinate" while lower performerstended not to take such an action, we would see a positive correlation, and that elementwould be keyed positively. If poorer performers chose an action element while thebetter performers did not, we would see a negative correlation and that element wouldbe keyed negatively. If no clear pattern between management performance level andendorsement of action elements was present, we would see negligible correlations, andthose action elements would receive a neutral value.More specifically, the Empirical key decision rules for assigning quality ofjudgement values considered both the size and direction of the observed correlations, aswell as gender (the correlations from both genders had to agree in sign in order for apositive or negative value to be assigned). In effect, each gender acted as a cross-validity check on the other. The specific guidelines used in the Empirical key to assignvalues to all action elements were as follows:1.^+2 or -2:^separate-gender correlations agreed in sign and theproduct of p values was less than or equal to .05;120^2.^+1 or -1:(a) separate-gender correlations agreed in sign and the product of pvalues was greater than .05 and less than or equal to .24;OR(b) separate-gender correlations differed in sign and:i) the larger correlation had a p value less than or equal to.10 and the smaller correlation had a p value of greaterthan .30ORii) the larger correlation had a p value less than or equal to.01 and the smaller correlation had a p value larger than.20For (b), the sign keyed was the same as that of the larger correlation^3.^0 for all other outcomesIt should also be noted that zero correlations were counted as agreeing in sign with theother, non-zero, correlation and a p value of 1 was assigned to all zero correlations.Lastly, correlations based on endorsements across the sample of less than threeparticipants were omitted.121The third key, called the Merged key, was a complex combination of the Panel2 (Conservative) and Empirical keys. It included consideration of both thecorrelational results (size, significance level, and directions for each gender) as well asoverall Panel 2 (Conservative) ratings. Specifically:^1.^+2 or -2:(a) sign in panel was matched in both genders, and the product of thep values was less than or equal to .12;(b) in the case of the panel having assigned a zero, both genderresults were in the same direction with the product of p valuesless than or equal to .06;2.^+1 or -1:(a) sign in panel was matched in both genders and the product of thepvalues was greater than .12 and less than or equal to .24;(b) sign in the panel agreed for one gender result, with a p value lessthan or equal to .12 (this sign is keyed), and the other genderresult (in the other direction) had a p value of greater than .20;(c) when panel = 0, both gender results were in the same direction,with the product of the p values larger than .06 and less than orequal to .15;(d) sign in panel was opposite to both gender results (which were inthe same direction), and the product of the p values was less thanor equal to .05 (keyed in direction of the empirical results);122(e)^when the panel = 0, gender results were in opposite directions,with one gender result (one keyed) with a p value of less than .06and the other (opposite) result with a p value of greater than .25.3.^0 for all other outcomesHere, zero correlations were counted as agreeing in sign with either the panel orthe other gender result. If a need arose to choose between the panel and the genderresult, the correlation was seen as agreeing with the gender result. A p value of 1 wasgiven to all zero correlations for computations. As with the Empirical key, correlationsbased on endorsements across the sample of less than three participants were omitted.Derivation of the reduced-item sets. To select the best items for the 8-, 10-,and 12-item subsets, item analyses were conducted using the item-reliability index andthe item-validity index. The item-reliability index is the product of the standarddeviation of the item and the correlation between the item score and the total test score.The item-validity index is the product of the standard deviation of the item and thecorrelation between the item score and the criterion score. Reliability and validityindices were calculated for each of the 21 items in the TSIB using the data from Study1 for each gender and the Quality of Judgement scores as measured by the Panel 1(Liberal) key. Final selections of the components of the three subsets of items weremade on the basis of consistency in results across genders and on the basis of the items'relative contributions to validity, and, to a lesser degree, reliability.ResultsReliabilityIn Table 4, internal consistency reliability results appear for the Quality ofJudgement scores obtained from the three subsets of items as scored by the Merged123key. Although the internal consistency reliabilities reported in Table 4 showmoderately low reliabilities, the split-half reliability estimates are more promising thanthose obtained by the Panel 1 Liberal method applied to the 21-item set (these arereported under the Note in Table 4).ValidityTable 5 contains the results from correlating the Quality of Judgement scorescalculated using several scoring keys (Panels 1--Liberal and 2--Conservative,Empirical, and Merged) and three combinations of items with the Overall PerformanceCriterion score. It should be clearly understood that Panel 1 refers to that scoring keyderived from the first panel of 11 members from the Company 1 whereas Panel 2 refersto ratings based on Company 2's panel of 21 members.Table 5 shows, not surprisingly, that validity coefficients are strongest for theEmpirical key. The contrast in magnitude of coefficients between both Panel keys andthe Empirical key is marked. It should also be noted that incorporating Panel 2(Conservative) judgements with the Empirical results to yield the Merged key did notresult in significantly lower validities. It should be made clear that the itemcomposition of the item sets measured (8-, 10- and 12-items) is exactly the same foreach scoring key applied.IntercorrelationsTable 6 contains the results of the intercorrelations among the seven dimensionscores based on the 21-item set whereas Table 7 provides the intercorrelations based onthe 8-item subset, using the previously described logical dimension keying system forscoring these stylistic dimensions, and the Panel 1 (Liberal) key to score the Quality ofJudgement dimension. The fairly high level of intercorrelation among the seven124Table 4Alpha and Stepped-up Split-Half Reliability Estimates for Quality of Judgement ScoresObtained from Three Item Sets as Scored by the Merged KeyItem Sets^ Males^FemalesAlpha8 item set^ .45^.4010 item set^ .53^.5412 item set^ .56^.57Split Half8 item set^ .52^.4110 item set^ .58^.5812 item set^ .65^.62Note. For comparison, the alpha reliability coefficients for the 21-item set, scored bythe Panel 1 (Liberal) key, were .54 (males) and .60 (females). The stepped-up splithalf reliability coefficients for the 21-item set, scored by the Panel 1 (Liberal) key,were .28 (males) and .48 (females).125Table 5Bivariate Correlations between Overall Performance Criterion and Quality ofJudgement Scores Obtained from Four Scoring Keys and from Three Item SetsScoring KeyItem Sets^Panel 1^Panel 2^Empirical^Merged(Liberal)^(Conservative)Males8 item set^.37*^.29*^.61**^.60**10 item set .23*^.64**^.62**12 item set^ .23*^.63**^.60**Females8 item set^.29*^.31*^.63**^.59**10 item set .28*^.62**^.62**12 item set^ .25*^.63**^.62**Note. Reported r's have been corrected for effects of range restriction and criterionunreliability. Correlations for the 10- and 12-item sets, scored by the Panel 1(Liberal) key, are not available.*p < .01 (one-tailed). **p < .001 (one-tailed).126Table 6Intercorrelations of Dimension Scores based on the 21-Item Set: Stylistic Dimensions Scored by LogicalKeying System and Quality of Judgement Scored by Panel 1 (Liberal) key MalesDimension P & 0 IR LSR MP A & S P QJPlanning and Organizing Work .39 .30 .24 .77 .87 .62Interpersonal Relations .79 .79 .11 .54 .45Leadership in a Supervisory Role .65 .19 .50 .36Managing Personnel .02 .44 .44Analysis and Synthesis in Decision-Making .76 .52Productivity .67Quality of JudgementFemalesDimension P & 0 IR LSR MP A & S P QJPlanning & Organizing Work .28 .19 .25 .74 .84 .63Interpersonal Relations .78 .79 -.07 .39 .46Leadership in a Supervisory Role .62 -.03 .32 .34Managing Personnel -.05 .39 .50Analysis & Synthesis in Decision-Making .78 .48Productivity .69Quality of Judgement127Table 7Intercorrelations of Dimension Scores based on the 8-Item Set: Stylistic Dimensions Scored by LogicalKeying System and Quality of Judgement Scored by Panel 1 (Liberal) keyMalesDimension P & 0 IR LSR MP A & S P QJPlanning and Organizing Work .36 .32 .22 .70 .73 .36Interpersonal Relations .76 .77 .22 .61 .61Leadership in a Supervisory Role .56 .34 .61 .51Managing Personnel .13 .46 .58Analysis and Synthesis in Decision-Making .84 .38Productivity .58Quality of JudgementFemalesDimension P & 0 IR LSR MP A & S P QJPlanning & Organizing Work .21 .14 .22 .64 .71 .45Interpersonal Relations .77 .69 .04 .51 .50Leadership in a Supervisory Role .54 .11 .43 .49Managing Personnel .17 .48 .54Analysis & Synthesis in Decision-Making .79 .53Productivity .67Quality of Judgement128dimensions as measured by both the 21- and 8-item sets suggests a lack of meaningfuldiscriminant validity of several aspects of administrative performance.Construct validityLastly, the conceptual nature of in-basket performance was examined bycorrelating Quality of Judgement scores as derived by the 10-item set (to choose a"moderately-sized" subset) as scored by Merged key (Key 3) with selectedintellectualand personality variables. (A list of the cognitive ability and personalitytests used in this analysis is provided in Appendix A.)It is not appropriate to summarize all findings here, except to briefly describethe strongest relationships between in-basket performance (Quality of Judgement) andthe set of external variables examined. All following reported correlation coefficientswere tested using one-tailed significance tests. Bivariate correlations with intellectualmeasures revealed that reading comprehension was related to higher in-basketperformance for both men (r = .20, p < .005) and women (r = .26, p < .001). Astronger pattern emerged for women across additional communication skills likewriting ability (r = .19, p < .01) and vocabulary (r = .20, p < .01). Both gendersshowed a significant relationship between Quality of Judgement and the ability to thinkin cognitively flexible ways (males: r = .20, p < .005; females; r = .34, p < .000).Correlations with personality measures suggest that, across genders,characteristics such as assertiveness and dominance are moderately related to in-basketperformance. Specifically, correlations of .17 (males, p < .05) and .20 (females, p <.01) were observed when Quality of Judgement was related to dominance. Whencorrelated with assertiveness (16PF Factor E), coefficients of .15 (women, p < .05)and .19 (males, p < .01) were seen. Gender-specific results suggested that womenwho were unconventional (r = .25; p < .001) and independent (r = .24; p < .001)129also seemed to show better administrative judgement. A different pattern emerged formen; those who tended to score higher on the Quality of Judgement dimension werealso enthusiastic (16PF Factor F; r = .18; p < .01) and extroverted (16PF Second-Order Factor QI; r = .17; p < .05).DiscussionThe goals of Study 2 were to further the research begun in Study 1 by (a)examining the effect that three scoring keys, each constructed differently, would haveon the validity of the Quality of Judgement scores, and (b) constructing three differentsubsets of items and assessing the impact of scoring each on reliability and validity.The three keys were constructed on the basis of three frames of reference:(1) the logical or rational Panel 2 (Conservative) key,(2) the Empirical key, and(3) the Merged key, a combination of the Panel 2 and Empirical keys.In the derivation of the reduced item sets, the items were selected more for theircontribution to validity, rather than reliability. One of the chief concerns with reducingthe set of scored items in this way was the possible detrimental effect this would haveon the reliability of the instrument by reducing the variability and the overall stabilityof the scores. It was decided to methodically examine this concern by constructing thethree different subsets of items: an 8-item, 10-item, and 12-item scored set of items.Thus, the relative impact of a reduced item set on both reliability and validity could beassessed.130In sum, then, the main question asked in Study 2 was what effect would thethree ways of scoring Quality of Judgement applied to three subsets of items scoredhave on the psychometric properties of the exercise.Overall, we found confirmation of past research findings that the in-basket is areliable and valid way of assessing administrative ability, even when the set of itemsactually scored is reduced substantially and new scoring keys are applied. Scoring timewas decreased from 1 1/4 - 1 1/2 hours to 23 minutes, on average, for the 8-item key.Another practical benefit to the modifications examined in Study 2 was the reduction intime required to train scorers, who would be responsible for becoming skilled inscoring less than half of the original items. The time required to train scorers wasslightly more than half of what had been required when all items were scored (i.e.,three days for the 8-item subset versus five days for the full set of 21 items).Accordingly, it was found that important practical advantages of reducing training andscoring time could be realized without attendant reductions in the reliability and validityof the instrument.As noted earlier, the internal consistency reliabilities presented in Table 6showed moderately low reliability coefficients for the Quality of Judgement dimensionas based on the three subsets of items as scored by the Merged key, although the split-half reliability estimates were more promising than those obtained by the Panel 1(Liberal) key applied to the full 21-item set. However, if we were attempting tomeasure a unitary, factorially-pure trait, the items would have to be seen as lacking therequisite internal consistency. It is possible, however, that the Quality of Judgementdimension is less homogeneous than originally thought; perhaps these items aremeasuring somewhat independent aspects of administrative behaviour that make up theoverall Quality of Judgement dimension. It should also be recalled that, in thederivation of the reduced-item subsets, the items were selected more for their131contribution to validity, rather than reliability, so maximal internal consistency of itemswas not expected. Across all dimensions (particularly the stylistic dimensions), Tables6 and 7 showed evidence of a fairly high level of intercorrelation for both the full and areduced item set, suggesting a lack of meaningful discriminant validity of severalaspects of administrative performance.We have seen that concerns of reduction in validity with the scoring keychanges and reduced-item scoring approach examined in this study were not realized.In fact, the Quality of Judgement dimension results shown in Table 5 suggested markedimprovements to the validity coefficients for this dimension using the Empirical andMerged keys, in particular. However, some serious questions remain as to the cross-sample generalizability of the positive findings reported in Study 2.Specifically, to what degree are the reported validities spuriously inflatedbecause of capitalization on chance, resulting in statistically significant correlationssimply on the basis of chance, or Type I error. Whenever items from a larger set arechosen on the basis of their relation to the criterion, it is inevitable that some of thenewly-selected items, when examined in a new sample, will not be related to thecriterion. We would expect that the observed validity coefficients (especially thoseresulting from the Empirical key) are not the true validities, and so we would expect tosee shrinkage in the cross-validated coefficients. This expectation comes from twoprimary sources. First, it will be recalled that reduced-item subsets were selected onthe basis of maximal item validities for the original sample. Second, in the derivationof the Empirical and Merged keys, a total of six hundred correlations between theaction elements and the Overall Performance Criterion were considered in order todetermine the final Quality of Judgement value assigned each action element. Clearly,the large number of correlations required for Study 2 increased the likelihood thatchance factors could operate with so many correlations used in this process of alternate132scoring key development. A final consideration is that, although action elementsendorsed by less than three participants across the sample were not used in thecorrelational analyses, this cut-off was somewhat arbitrary, and low endorsement ratesfor other action elements may not yield stable correlations.These considerations pointed, then, to the likelihood that shrinkage would occurin a cross-sample validation. It should be noted, however, that using a fairly largeinitial sample (321 subjects) will tend to reduce shrinkage because of smaller samplingerrors. At this point, it is difficult to determine the extent of the phenomenon ofcapitalization on chance in order to more accurately assess the true validity of theQuality of Judgement dimension as measured by the Empirical and Merged keys. It ispossible, nonetheless, to estimate (albeit roughly) the degree of possible capitalizationon chance by considering the number of statistically significant items and the numberthat would be expected from chance or Type I error. The most accurate means,however, to directly quantify the effect of this artifact is to apply the same method(multiple key derivations and reduced-item subsets) to a new sample.It should be noted that one problem with this approach is the introduction ofpossible bias and variation in the results due to differences in the sample (geographic,cultural etc.). A double cross-validation, in which the original sample is split andevaluated against itself, would be an effective way to estimate the true validity.However, a substantially larger sample would be needed than that available here.Given this limitation (of a changing sample), the most telling method to determine thedegree to which the promising results observed in Study 2 are true, as opposed toinflated, predictive validities of the in-basket Quality of Judgement scale would be tore-administer the instrument in another setting, applying the same three scoring keys tothe same three combinations of item sets.133RATIONALE AND HYPOTHESES FOR THE PRESENT STUDYGeneral SummaryThe Literature Review has provided a considerable number of summaries of theempirical findings from forty years of research on the in-basket exercise.Consequently, specific summary observations, such as the marginal evidence ofreliability and mixed evidence of criterion-related validity, will not be restated here.Instead, the broader implications from these observations will be incorporated withsalient findings from Preliminary Studies 1 and 2 in order to identify the objectives andto develop the hypotheses for the present research study. (It should be noted that,although there is agreement among researchers for the need for further constructvalidity examinations of the in-basket, this area will not be addressed in the presentstudy.)Summaries from the literature and past research findings, especially thoseresearch results reported in the two preliminary studies, point to particular areasneeding further investigation. For example, it was observed in the Literature Reviewthat, given its widespread use, there was a relative lack of sound empirical validationfor the in-basket exercise. A need for an approach to in-basket research anddevelopment which considered both psychometric and practical implications wasidentified. The past dilemma of developing an objectively-scored in-basket withmaximal reliability and criterion validity which could also be readily scored by industrypersonnel appears to remain after nearly forty years of research on this popularinstrument. It was recognized that research efforts which focused on methods ofscoring the instrument seemed a viable way to simultaneously address the somewhatcontrary needs of empirical soundness and feasibility or practicality in application.134These considerations and concerns guided the focus and goals of the two preliminarystudies reported earlier.The results of the first preliminary study examining a newly-revised in-basket,the TSIB, suggested that, although the traditional, panel-based scoring system yieldedreliable dimensions, the criterion-related validity results were very modest. Moreover,an in-basket exercise which required nearly as long to score as it did to administer (11/2 hours) would simply not be practical in industry. The economic pressures resultingfrom the current recession and the high costs associated with lengthy scoring- andtraining-time would likely pose significant impediments to organizations consideringthe large-scale application of the in-basket in its standard form (i.e., written, free-format responses).The second preliminary study was designed to investigate the relativepsychometric and practical benefits of less conventional, more empirical, in-basketscoring strategies. The call from Brannick et. al (1989) for further research into in-basket scoring systems has therefore been answered, in part, by Study 2. This studyalso examined the effects of expanding the Scoring Manual to include more scorableunits of action by the re-analysis of the Unusual actions, the effects of selecting severalsubsets of items for scoring, and the effects of developing and applying several Qualityof Judgement scoring keys to these reduced-item subsets. As noted, scoring keymodifications and a reduced-item scoring approach did not result in a reduction inreliability and validity. In fact, significantly higher criterion-related validitycoefficients across a greater number of dimensions were observed. Furthermore,substantial reductions in scoring time (and training time required for scoring) wererealized. Although promising, the results from this most recent investigation must beresearched further before the conclusion can be made with confidence that the observed135validity coefficients are, in fact, more accurate estimates of the true validity for thiscritical dimension of administrative performance.Proposal and HypothesesAccordingly, in the present study, a cross-validation of the new Quality ofJudgement scoring approaches in another, related setting will be conducted. The sameset of three scoring keys, combined with the reduced-item scoring approach (using thesame item-subsets) will be applied to assess the extent of shrinkage in the cross-validated validity coefficients. It is hypothesized that detectable, but not substantial,reductions in the Quality of Judgement validity coefficients will be observed. It isfurther hypothesized that a new application of the reduced-item scoring approach willreaffirm its ability to provide optimal psychometric properties while also addressingmore practical, application-oriented concerns. It should be made clear that, becausescoring the full 21-item set no longer appears necessary or feasible, only those subsetsof items used in Study 2 will be re-analyzed (8-, 10-, and 12-items). The full set of 21items will not be scored for the new sample.In addition to a cross-validation of the Quality of Judgement scoring strategiesintroduced in Study 2, further research and continued development of the TSIB inseveral additional areas appears warranted. In particular, ways to further refine andunderstand the remaining dimensions are needed.The seven in-basket dimensions that comprise the TSIB can be grouped into thetwo broad categories of: performance and stylistic dimensions. The two performancedimensions, Productivity and Quality of Judgement, generally reflect more evaluative,maximum-performance aspects of in-basket responses, whereas the remaining stylisticdimensions are frequency-based descriptive measures of the manner in which a136participant tends to respond. To summarize, the dimensions measured in the TSIBwere as follows:Stylistic:^1. Planning and Organizing2. Interpersonal Relations3. Leadership in a Supervisory Role4. Managing Personnel5. Analysis and Synthesis in Decision-MakingPerformance:6. Productivity7. Quality of JudgementCriterion measurementA company performance appraisal instrument, administered annually, will be usedas the criterion measure for the present study. Data from 15 dimensions of work-related performance, measured by a 7-category ordinal scale involving descriptors(rather than numbers) reflecting levels of performance, will be collected from theHuman Resources Department records for the previous two years of participants'performance (1989 and 1990). Numerical values will be assigned to each category(e.g., "excelling" = 1, "unsatisfactory" = 7). Averages will then be computedfor each dimension across the two sets of ratings.Quality of Judgement ScoresPredictive validities of scores based on the Panel, Empirical and Merged scoringkeys. The critical Quality of Judgement dimension will form the basis for the cross-validation of the present study with a comparative analysis of the predictive accuracy of137the Quality of Judgement scores as measured by the three scoring keys (Panel 2--Conservative, Empirical, and Merged keys), and as applied to each of the three TSIBreduced-item sets in the two studies. The focus, then, will be to examine the effects ofcapitalization on chance by re-applying the same scoring keys used in Study 2 tomeasure the Quality of Judgement dimension in a new, second sample. The mainresearch question to be addressed is to what degree the validity coefficients reported inStudy 2 for the 8-, 10-, and 12-item subsets will hold up when applied in a new setting.It is hypothesized that the Merged and, to a lesser degree, the Empirical key used tomeasure the Quality of Judgement dimension will result in the largest validities for thethree scoring keys.Additional analyses based on the Panel scoring key. Whereas PreliminaryStudies 1 and 2 both used a sample from Company 1, the present study will use a newsample from Company 2 which will allow, through sample comparisons, a more finely-tuned analysis of the Panel scoring keys applied to the Quality of Judgement dimension.For example, the question of whether the weak link in Study 1 arose from the specificjudgements (i.e., calibre of decisions) from Panel 1, or from the actual method ofcombining the judgements of the panel can now be examined by separating the twocritical variables: origin of the panel (Panel 1 versus Panel 2) and the method ofcombining panel judgements (Liberal or majority method with a three point rating scaleused in Study 1 versus Conservative method using a 5 point rating scale used in Study2) and examining several combinations of these variables in relation to the two datasets(Company 1 versus Company 2). A framework for the planned analyses is illustratedin Figure 3. The shaded areas represent analyses already conducted in Studies 1 and 2,namely Study 1 involved the Liberal Method using Panel 1 applied to the Company 1dataset and Study 2 involved the Conservative Method using Panel 2 applied to theCompany 1 dataset.^Panel 1^Panel 2^(Company 1) (Gorr pany 2)PANELCompany 2COMPANY DATASETCompany 1^(omich scoring keyaglied)138LiberalMETHODOF COMBININGPANELJUDGEMENTSConservativeFigure 5. Design grid for planned panel analyses in terms of originof the Panel key, of method of combining panel judgements,and of the dataset to which the keys will be applied.139In addition to the two analyses already conducted, then, six additional analyseswill also be carried out:1. Conservative combination method using Panel 1 applied to Company 1dataset.2. Conservative combination method using Panel 1 applied to Company 2dataset.3. Conservative combination method using Panel 2 applied to Company 2dataset.4. Liberal combination method using Panel 1 applied to Company 2dataset.5. Liberal combination method using Panel 2 applied to Company 1dataset.6. Liberal combination method using Panel 2 applied to Company 2dataset.It is hypothesized that, because of the greater consensus required among thepanel members and the use of a 5-point, rather than 3-point, scale, the Conservativemethod of combining judgements will result in greater validity coefficients than theLiberal method. Also, it is expected that, in general, Panel 2 ratings will yield highervalidities than Panel 1, because of the expanded list of action elements and the largerpanel size used for the action element ratings. Lastly, "intra-company" analyses,comparing results wherein the source of the panel and the dataset match, (e.g., Panel 1applied to Company 1 dataset; Panel 2 applied to Company 2 dataset) will producegreater validity coefficients than "inter-company" analyses (Panel 1 applied toCompany 2 dataset). This pattern is expected because intra-company analyses shouldtend to eliminate possible confounding effects due to differences in corporate culture orgeographical location.Additional Performance Dimensions in the TSIBUnderstanding of Situation Questionnaire. In the present study, a new eighthdimension (performance rather than stylistic) will be added in order to measure theparticipant's understanding of the issues, problems and broader implications of thescenario. Although it is often clear what actions the participant took on the exercise, itis often not known why a particular action was chosen. It is hypothesized that moreinformation about a participant's rationale or thinking will constitute useful predictiveinformation about that person's administrative potential. As suggested by the fairly lowinternal consistency results in Study 2, it may be that Quality of Judgement is made upof multidimensional, independent aspects of appropriateness of actions. Thus, byasking participants additional questions about their attitudes, strategies, and rationalesbehind the actions they chose, it is believed more valuable, predictive information willbe obtained from participants' particular approaches to problems in the exercise.Although such data is perhaps more directly cognitive than is generally true of in-basket data, it nonetheless seems capable of improving in-basket assessment. To thisend, a 33-item multiple choice Understanding of Situation Questionnaire will bedesigned and administered upon completion of the in-basket exercise. A sample itemfrom the TSIB Understanding of Situation Questionnaire is provided in Figure 6.The development of the scoring key for the Understanding of SituationQuestionnaire will involve the same 21-member panel that provided the New PanelQuality of Judgement ratings. They also completed this questionnaire and the patternof their selections will be analyzed and a preliminary key using a 0, 1, 2 scoring systemassigned each of the five options per question will be applied. The particular value140At the moment, probably the most important issue to be dealt with is:a. the budget overrun on overtimeb. conflicts among employeesc. public image/customer satisfactiond. low productivitye.^low charitable donations141Figure 6. Sample Item from the Understanding of Situation Questionnaire.142assigned to each option will depend on the level of endorsement from the expert panel.Across the five response options provided for an item, the panel members' pattern ofoption selection will be examined and an empirical, followed by a logical approach,will be used to assign scoring weights. The empirical approach is based on thefrequencies of option selections within an item. If, for example, option "b" wereselected most frequently by panel members, it would initially be keyed "2" whereas ifoption "d" were chosen by very few members (or none), it would be keyed "0".Options receiving moderate endorsement by the panel (i.e., options neither maximallynor minimally selected) would initially be keyed "1". Following this empirically-basedassignment of scoring weights, final determination of the scoring key will be made by alogical, or rational consideration of the relative differences between option selectionrates. If, for example, the two most frequently selected options within an item differedby one or perhaps two panel members' selection, both options may be keyed "2".Similarly, if two moderately-endorsed options showed a minimal difference in theirendorsement rates, they may both be keyed "1".The results from this instrument will be analyzed by correlating both theindividual item scores and overall questionnaire scores with the company performanceappraisal ratings. Item-analyses using item-reliability and item-validity indices will beconducted in order to select a subset of items from the questionnaire with maximalreliability and validity. An important advantage of the questionnaire format comparedto the standard ETS Reasons for Action form or a post-exercise interview is that themultiple-choice scoring format requires less administration and scoring-time and itprovides improved reliability by using a quantitative, objectively-applied scoring key.Productivity.^Several different ways of operationalizing Productivity toincrease the validity of this dimension will be examined. As seen in Study 2, the143Productivity dimension did not contribute significantly to in-basket performance formen or women. It will be recalled that, in Study 1, Productivity was operationalized asthe total number of actions taken across the exercise (including Unusual actions), aswell as the number of items attempted.In the present study, several specific methods of defining this dimension will beused. In essence, the approach will be based on that described in Hakstian et al.(1986), involving the aggregation of smaller, more concrete units of output. In all,seven units of productivity will be assessed: (a) the number of items attempted across21 items, (b) the number of letters or memos written across 21 items, (c) a scorereflecting the number of words written per memo or letter across the 21 items will bedetermined, (d) number of actions scheduled for a definite time, (e) number of entriesmade on the calendar, (f) whether a "things to do" list was completed, and (g) whethera "summary" list of the items was completed. Productivity linear composites will thenbe derived to determine the optimal Productivity operationalization by correlating thecomposite scores with company performance appraisal ratings.In-Basket Stylistic DimensionsThe efficacy of logically assigning the Stylistic dimensions will be re-examinedusing the 8-, 10-, and 12-item subsets. Using the expanded list of action elementsdescribed in Study 2, the total scores for each stylistic dimension (derived by summingthe number of action elements logically coded for each dimension) will be correlatedwith company performance appraisal ratings. The purpose for this analysis is examinethe validity coefficients for these reduced-item dimensions when applied in theCompany 2 dataset used in the present study. The intercorrelations of the stylisticdimensions will also be calculated, separately be gender, with the expectation that, asin Study 1, a fairly high level of intercorrelation among the stylistic dimensions will be144seen. Results can then be compared with those obtained in Study 2 using the Company1 dataset (although broadly because of the use of the expanded list of action elements inthe present study).A factor analysis of the action element endorsements will also be conductedbecause of the high intercorrelations among the dimension scores seen in Study 2(Tables 6 and 7). This analysis will produce an empirical, as opposed to rational, setof administrative dimensions of performance from the second dataset. Using factorloadings, each action element will be coded for the most relevant factor. As observedin the Literature Review, there is some disagreement regarding the complexity of in-basket performance dimensions (Frederiksen, 1966; Lopez, 1966). It is hypothesizedthat this analysis will help resolve the conflicting results and conclusions seen to date.Criterion-related validities of each factorially-derived dimension will be alsodetermined by correlating dimension scores with the company performance appraisalratings.Additional New MeasuresIn addition to the introduction of the new Understanding of Situation dimensionand a re-examination of the existing dimensions as outlined above, two new measureswill also be investigated.Number of High Priority Items attempted. The first new measure involves anevaluation, by the 21-member panel of Study 2, of the priority or urgency whichshould be afforded each item. The Priority rating was based on a consideration of theimmediacy of action required by the item. Each panel member independently assigneda priority rating to each item using a 3-point scale in which "1" indicated a low priorityitem where action could be deferred one calendar week or more. A weight of "2" stoodfor a medium priority item where action, although not necessarily required145immediately, is required within the week, and "3" was assigned to high priority itemswhich required immediate action. Final identification of high priority items was madeby considering the majority judgements of the panel.In the present research, participants will receive one point for each high priorityitem attempted in the exercise, based on an evaluation of the entire set of 21 items,rather than a reduced-item subset. The total number of points, or high priority itemsattempted, will then be correlated with the company performance appraisal ratings. Itis hypothesized that the number of high priority items attempted across the TSIB willshow positive, significant validity when correlated with the criterion rating.Scorer's Impression of Involvement. The second new measure follows fromseveral promising results reported in the literature with the impressionistic Scorer'sRating of overall performance (e.g., Meyer, 1970). The predictive accuracy of asimilar subjective measure will be explored in the present study. As a way tooperationalize Lopez's (1966) crucial concept of "ego-involvement," a 3-point ratingscale of the degree of involvement will be used by scorers to measure their impressionof "no involvement" (assigned a 1) to "extreme involvement" (assigned a 3). Theassigned Scorer's Impression of Involvement values will be correlated with thecompany performance appraisal ratings to determine whether it will contribute usefulpredictive information about participants' administrative performance.METHODParticipants and SettingThe participants were 321 employees of a large western Canadian utilitycompany (176 males and 145 females). Sixty one percent of the sample were first-levelmanagers (195 employees), while the remaining 126 participants were second-levelmanagers. In the summer of 1991, participants were given the TSIB and a low-fidelitysimulation in a concurrent validation study of these instruments.To recruit participants, the Human Resources Department first generated arandom sample of 1,000 first and second-level managers from a pool of approximately2,000 employees across all major departments. Next, two factors guided the quasi-random selection of employees for possible participation: first, an equal genderbalance of participants was sought, and secondly, equal employee involvement fromacross the seven main divisions of the company was desired. Potential participantswere contacted by company mail to solicit their voluntary involvement in the study.Participants were provided with extensive, confidential feedback on their exerciseresults.The two instruments required three hours to complete. As described inPreliminary Study 1, participants were first given 1 1/2 hours to complete the TSIB,preceded by 15 minutes of scripted instructions. Upon completion of the exercise,participants were then given the Understanding of Situation questionnaire, for which notime limit was imposed. (The average time required to complete the post-exercisequestionnaire was 30 minutes.) Participants were then administered the second,untimed, low-fidelity instrument, which usually required 1 hour to complete.Typically, two administration sessions were held each testing day, and participants146147were given the option of attending a morning or afternoon session. In total, 30 testingsessions were conducted over a testing period of approximately six weeks, in groupsranging from 5-6 employees to 20-25 employees at a time.Assessment MeasuresPredictor measures.1. The Telephone Supervisor In-Basket Exercise (TSIB).21 items. The TSIB used was identical to that employed in Studies 1and 2, except for the inclusion of the Understanding of Situation Questionnaire(33 items).2. The Supervisory Profile Inventory (SPI).The SPI is a low-fidelity simulation consisting of two parts: a) Part A, abiodata questionnaire designed to measure employees' personal interests,background experiences, and opinions (50 items) and, b) Part B, also aquestionnaire format, which assessed individual differences in managerial style(22 items). Because the SPI is not a focus of the present work, it shall not bediscussed further.Criterion measures.1. Company Performance Appraisal.Initially, a company performance appraisal instrument, administeredannually, was selected as the criterion measure for the present study. Fifteendimensions of work-related performance along a 7-point scale were measured.Criterion data were collected for each participant along the 15 performance148dimensions measured for both 1989 and 1990. However, the performanceratings were severely positively skewed (the low end of the scale measuredbetter performance whereas the high end of the scale measured poorerperformance) with little variability. Consequently, the company appraisalinstrument appeared to be demonstrating little discrimination in managerialperformance between participants. The likely causes of the limited criterionvariability were the cumulative effects from several judgemental rating biasessuch as leniency, central tendency, and halo bias, as well as missing data fromincomplete files.2. The Employee Appraisal Inventory (EAI).Because of these concerns of low variability in the company performanceappraisal measure, an additional questionnaire, the Employee AppraisalInventory (EAI), was then administered by company mail to the participants'immediate supervisors in order to collect further criterion data on work-relatedperformance. All participants were informed by company mail of the need forthis additional criterion measure and were assured that EAI results wereconfidential and for research (validation) purposes only. Participants couldrefuse to allow the collection of this additional criterion data with impunity.The EAI, which required approximately 10 minutes to complete, was returnedfor 296 of the 321 participants (159 males; 137 females). Of the 25 non-completions, five participants refused the EAI while the remaining 20inventories were not returned by the immediate supervisor.The EAI is made up of the same set of three behavioural observationscales (BOS) used to measure each of 12 performance dimensions previouslydescribed in Studies 1 and 2 in the Preliminary Studies section. As noted149earlier, the development of the BOS scales is described more fully in an articleby Hakstian et al. (1991). The instrument requires the supervisor to rate thefrequency of certain job behaviours (corresponding to the performancedimensions) observed in the employee. The order of the 36 BOS statements,each measured by a scale ranging from 1 (low) to 5 (high), was rotated both bydimension and by tone (i.e., a positively-worded statement was followed by anegatively-worded one). An example of one of the three BOS scale statementsused to measure the management performance dimensionPlanning/Organizing/Control is the following:Establishes a plan for the fiscal year; maps out when each event must takeplace in order to meet the stated goals; allows for unexpectedcircumstances.Almost Never 1 2 3 4 5 Almost AlwaysA factor analysis of the 12 performance dimensions conducted in Study 2resulted in four global management performance dimensions: (a) InterpersonalEffectiveness, made up of the dimensions Leadership, Behaviour Flexibility,and Sensitivity; (b) Methodical/Stable Performance, made up of the dimensionsPlanning/Organizing, Work Ethic, Initiative, Performance Stability; (c)Insightful/Decisive Performance, made up of dimensions Analysis, Judgementand Decisiveness; and (d) Communication Effectiveness, made up of OralCommunication and Written Communication. It will be recalled that, in thepreliminary studies, three BARS scales and three BOS scales were constructedfor each performance dimension. Whereas the preliminary studies based eachof the four global management dimensions on a sum of the BARS and BOS150scales, the present study, in order to facilitate the acquisition of criterion data,used only the BOS scales in the measurement of the performance dimensions.These four dimensions were then added to yield a fifth ; more molarcriterion called Overall Management Performance (OMP). It is this Overalloutcome measure that figures most prominently in the criterion-related analysesof the present study. It should be made clear that the four global dimensionsand the OMP to be used here were based on half the number of scales used inearlier studies because of the exclusion of the BARS scales. (Where possibleand appropriate, comparable criterion correlations between the OMP andrelevant TSIB dimensions from Study 2 will be reported, in order to facilitatethe evaluation of the scoring keys used in the Quality of Judgement dimensioncross-validation.)RESULTSOverview of Study Design and Data AnalysisDesignThe primary focus in the present study was to assess the degree of shrinkage inthe validity coefficients of a cross-sample validation of Quality of Judgement scoresderived from the three scoring keys (Panel 2--Conservative, Empirical, and Mergedkeys) previously applied in Study 2 to the three reduced-item subsets (8-, 10-, and 12-items). Accordingly, a concurrent validation of the TSIB was undertaken byadministering it to a second, related sample. The validities of the Quality of Judgementscores were examined and the validity coefficients from the two applications werecompared. It should be clear that the main difference in the version of the TSIB usedin Preliminary Study 2 compared to that used in the present study is the introduction ofa multiple-choice questionnaire administered upon completion of the TSIB.The concurrent validation allowed further specific analyses, involving newmeasures (Number of Priority Items Attempted and Scorer's Impression ofInvolvement), new operationalizations of existing dimensions (e.g., Productivity) and afactor-analytic derivation of TSIB stylistic dimensions to be carried out. Theseanalyses are summarized in the following section.Data AnalysisReliability. In order to validate the TSIB, data from existing measures of jobperformance were collected from the company (against which TSIB scores werecorrelated). The reliability (alpha coefficient) of the criterion performance dimensionswas calculated. Reliability estimates (internal consistency) of the TSIB Quality of151152Judgement dimension as measured by the three scoring keys, applied to the threereduced-item subsets, were also determined. In addition, internal consistency estimatesof the reliability of the TSIB stylistic and remaining performance dimensions(Understanding of Situation Questionnaire and Productivity) were carried out.Criterion-related validity. The validity of the Quality of Judgement scores,derived from the three scoring keys and applied to the three reduced-item subsets, wasexamined by correlating dimension scores with on-the-job measures of performance.Comparisons with previously-obtained validity coefficients for the same reduced-itemsubsets, scored using the same scoring keys, were made.In addition, to further assess the predictive accuracy of the Panel-basedderivation of the Quality of Judgement scoring key, Quality of Judgement scoresderived in the following six ways were correlated with job performance measures: (a)Conservative panel judgement combination method using Panel 1 applied to Company 1dataset, (b) Conservative combination method using Panel 1 applied to Company 2dataset, (c) Conservative combination method using Panel 2 applied to Company 2dataset, (d) Liberal combination method using Panel 1 applied to Company 2 dataset,(e) Liberal combination method using Panel 2 applied to Company 1 dataset, and (f)Liberal combination method using Panel 2 applied to Company 2 dataset. Thus, threeaspects of the Panel-based Quality of Judgement keys--the method of combining paneljudgements, the origins of the Panel key, and the origins of the data to which the keyswill be applied --were targeted in order to identify those factors related to the Panel keythat are most useful in predicting adminstrative ability.The TSIB stylistic dimensions (Planning and Organizing Work to Analysis andSynthesis in Decision-Making) were measured in the present study using the expandedlist of action elements applied to the reduced-item subsets. Criterion-related validity153results and the intercorrelations of the stylistic dimensions were then compared withthose obtained in Study 2 (which used the Company 1 dataset). A factor analysis of theaction element endorsements from the Company 2 dataset was conducted in order toproduce an empirical, as opposed to rational, set of descriptive dimensions ofadministrative performance. Reliability estimates (internal consistency) and criterion-related validities of each factorially-derived dimension were determined.Validity coefficients were corrected for both criterion unreliability and rangerestriction (Schmidt & Hunter, 1981). The correction for range restriction was basedon logic more fully described in Hakstian et. al (1991). Briefly, as a maximumperformance measure, adminstrative skill assessment relied on an unrestricted varianceestimate that was midway between the general population variances published instandardized cognitive measure test manuals and the restricted variance estimatessupplied by the data at hand.A brief comment on the Results. It should be noted that, in the ensuingpresentation of results, the data are merely described and briefly summarized. Moredetailed analyses of the hypotheses and consideration of the implications of the resultsare found in the following Discussion section.Major Reliability and Criterion-related Validity FindingsReliability of the Performance Appraisal Dimension ScoresA reliability analysis (Cronbach's alpha) was conducted on the four global andthe fifth more molar management criterion (previously described in the Methodssection). These reliability estimates are reported in Table 8. The alpha coefficients ofthe four global management performance dimensions range from a low of .48 to a highof .92, with a mean of .80. The mean alpha coefficient of the four global management154dimensions increases to .84 when the lowest, anomalous result (.48 for femaleCommunication Effectiveness) is omitted. For comparison, a reliability estimate basedon a generalizability analysis of the four global performance dimensions measured inPreliminary Study 1 (involving Company 1) and then corrected for length forapplication in the present study (the number of scales used here were half those ofStudy 1) yielded single-facet generalizability (alpha) coefficients ranging from .83 to.90, with a mean of .87. Thus, the four global management performance dimensionsshow consistent evidence of strong reliability. In addition, as shown in Table 8, thefifth most molar performance criterion, Overall Management Performance or OMP, ishighly reliable.Cross-Validation of the Quality of Judgement ScoresOf central interest in the present study is the degree to which the three TSIBQuality of Judgement keys (Panel 2--Conservative, Empirical, and Merged), as appliedto the three reduced-item subsets (8-, 10-, and 12-item) yield similar criterion-relatedvalidity results when applied to the Company 2, rather than Company 1, dataset. Table9 presents the resulting validity coefficients, with relevant coefficients from theprevious application (correlated against the same OMP criterion) provided inparentheses beside each result. As expected, the Merged key yields higher validitycoefficients than the Empirical key, from which only two of six correlations (the 8- and10-item subsets for women) are significant. As noted earlier, these results for theQuality of Judgement scores will be reviewed in greater depth in the followingDiscussion section. (The reliability of the Panel 2--Conservative key, selected becauseit yielded the highest overall validity coefficients, was determined and the findings areshown later, in Table 18, along with the reliability of the remaining TSIB dimensions.)155Table 8Reliability Analysis (Alpha Coefficient) of the Five Performance Appraisal DimensionsAlpha CoefficientDimension Males FemalesInterpersonal Effectiveness .81 .76Methodical/Stable Performance .92 .86Insightful/Decisive Performance .91 .85Communication Effectiveness .79 .48Overall Management Performance .96 .92156Table 9Bivariate Correlations between Overall Management Performance and Quality ofJudgement Scores Obtained from Three Scoring Keys and from Three Item SetsScoring KeyItem Sets^Panel 2^Empirical^Merged(Conservative)Males8 item set^.30** (.29)^(.61)^.19* (.60)10 item set^.26** (.23)^(.64)^.18* (.62)12 item set^.26** (.24)^(.65)^.18* (.61)Females8 item set^.25** (.28)^.19* (.63)^.23* (.59)10 item set^.32** (.26)^.19* (.62)^.23* (.62)12 item set^.25** (.23)^(.64)^.23* (.63)Note. Reported r's have been corrected for effects of range restriction and criterionunreliability.^Values in parentheses are bivariate correlations between OverallManagement Performance and Quality of Judgement scores obtained in Study 2,provided to facilitate comparison between the previous and present study.*p < .05 (one-tailed). **p < .01 (one-tailed).157As noted in the Rationale and Hypotheses for the Present Study section, theapplication of Quality of Judgement scores to a new dataset allowed more finely-grained analyses of the Panel-based scoring key. Table 10 shows the results of sixseparate, but related, analyses based on varying three factors: the origin of the Panelkey, the method of combining panel judgements, and the dataset to which the keyswere applied. Once more, these results and their relation to the hypotheses outlinedearlier will be reviewed in the following Discussion section.Other Performance Dimensions in the TSIBUnderstanding of Situation Ouestionnaire. As described in the Rationale andHypotheses for the Present Study section, the Understanding of Situation Questionnaireconsisted of 33 items, with five response options per item, designed to measure theparticipant's understanding of the issues, problems and broader implications of thescenario presented in the in-basket exercise. The panel whose ratings provided thescoring key (previously described) strongly recommended the deletion of one particularitem, which was subsequently eliminated to yield a final set of 32 items comprising thequestionnaire.Table 11 presents the reliability (internal consistency) and criterion-relatedvalidity results of the 32-item questionnaire. The resulting validity coefficient ismodest for females but marginal for males. Therefore, item analyses were conductedin order to select a subset of items from the original 32 which, across both genders,would yield stronger validity coefficients for both groups. Item-reliability and item-validity indices were first calculated and then represented on a graph to better selectthose items with maximal reliability and validity for both genders. Several sets ofitems were examined, with the inclusion of individual items based more on theircontribution to validity, than to reliability. Table 12 presents the reliability and158Table 10Bivariate Correlations between Overall Management Performance and Quality ofJudgement Scores Obtained Using Scoring Keys based on Different Methods ofCombining Panel Judgements (Liberal or Conservative). Different Origins of PanelRatings (Panel 1 or Panel 2), and Applied to Different Datasets (Company 1 orCompany 2).Item SetsQuality of Judgement Scoring Approach^8-item 10-item 12-itemMales1. Conservative Method, Panel 1, Company 1 dataset .30** .24** .25**2. Conservative Method, Panel 1, Company 2 dataset .19* .18* .21*3. Conservative Method, Panel 2, Company 1 dataset .29** .23** .24**4. Conservative Method, Panel 2, Company 2 dataset .30** .26** .26**5. Liberal Method, Panel 1, Company 1 dataset .37**6. Liberal Method, Panel 1, Company 2 dataset .26** .23** .18**7. Liberal Method, Panel 2, Company 1 dataset .32** .25** .25**8. Liberal Method, Panel 2, Company 2 dataset .29** .23** .24**159Table 10 (cont.)Bivariate Correlations between Overall Management Performance and Quality ofJudgement Scores Obtained Using Scoring Keys based on Different Methods ofCombining Panel Judgements (Liberal or Conservative), Different Origins of PanelRatings (Panel 1 or Panel 2), and Applied to Different Datasets (Company 1 orCompany 2).Item SetsQuality of Judgement Scoring Approach^8-item 10-item 12-itemFemales1. Conservative Method, Panel 1, Company 1 dataset .19* .20*2. Conservative Method, Panel 1, Company 2 dataset .37** .38** .36**3. Conservative Method, Panel 2, Company 1 dataset .28** .26** .23**4. Conservative Method, Panel 2, Company 2 dataset .25** .32** .25**5. Liberal Method, Panel 1, Company 1 dataset .29**6. Liberal Method, Panel 1, Company 2 dataset .35** .35** .36**7. Liberal Method, Panel 2, Company 1 dataset .26** .25** .23**8. Liberal Method, Panel 2, Company 2 dataset .30** .30** .32**Note. Reported r' s have been corrected for the effects of range restriction and criterionunreliability. Correlations for the 10- and 12-item sets, scored by the Liberal Method,Panel 1 key, applied to the Company 1 dataset are not available. All non-significantcorrelations were omitted.*p < .05 (one-tailed). **p < .01 (one-tailed).160Table 11Validity Coefficients between Overall Management Performance (OMP) andUnderstanding of Situation Questionnaire Total Score (32-item): Alpha ReliabilityEstimates for Total ScoreBivariate Correlation with OMPaMales^FemalesUnderstanding of Situation^ .19b^.29cQuestionnaire Total ScoreAlpha CoefficientMales^FemalesUnderstanding of Situation^ .24^.38Questionnaire Total ScoreNote. a The reported r' s have been corrected for the effects of range restriction andcriterion unreliability.b Significant at the .05 level (one-tailed test).c Significant at the .01 level (one-tailed test).161Table 12Validity Coefficients between Overall Management Performance and Understanding ofSituation Questionnaire Total Score (20-item) and Reliability Estimates for Total ScoreBivariate Correlation with OMPaMales^FemalesUnderstanding of Situation^ .35^.34Questionnaire Total ScoreAlpha Coefficient^Split HalfMales^Females^Males^FemalesUnderstanding of Situation^.23^.27^.23^.31Questionnaire Total ScoreNote. a The reported r' s have been corrected for the effects of range restriction andcriterion unreliability. Both are significant at the .001 level (one-tailed tests).162criterion-related validity results for the most promising subset of items, comprising 20items selected for retention in the final version of the Understanding of SituationQuestionnaire.Productivity. In its original operationalization applied in Preliminary Studies 1and 2, Productivity was measured by the sum of the total number of items attempted,the total number of action elements, and the total number of Unusual actions takenacross the exercise. It will be recalled that, for both males and females, criterion-related validity preliminary study findings were non-significant.In the present study, several new ways of operationalizing Productivity werederived, based on the aggregation of smaller, concrete units of output, in order toexamine the effect on the criterion-related validity of this performance dimension. Intotal, seven units of productivity were assessed: (a) the number of items attemptedacross 21 items, (b) the number of letters or memos written across 21 items, (c) a scorereflecting the number of words written per memo or letter across the 21 items, (d)number of actions scheduled for a definite time, (e) number of entries made on thecalendar, (f) whether a "things to do" list was completed, and (g) whether a "summary"list of the items was completed. More standard indices of output were also calculatedfor each item-subset, comprised of the total number of action elements and totalnumber of Unusual actions taken over the 8-, 10-, and 12-item subsets and are referredto as Total action (8), Total action (10), and Total action (12), respectively.Productivity linear composites were also derived in order to determine the optimalProductivity operationalization, based on the most promising of the individual units ofoutput.163The upper portion of Table 13 presents the validity coefficients between theindividual units of output and the OMP. The middle lower portion of the table displaysthe results of the correlations between the linear composites and the OMP, whereas thelower portion of Table 13 presents validity coefficients between the more standardoperationalizations of Productivity (as an index of the overall quantity of participants'output) and the OMP.TSIB Stylistic DimensionsUsing the expanded list of action elements prepared in Study 2, the total scoresfor each stylistic dimension (calculated by summing the number of actions elementslogically coded for each dimension) were correlated with the OMP for each of the threeitem-subsets and are displayed in Table 14. As Table 14 shows, a distinct patternacross genders, but consistent within each gender is evident. Specifically, across theitem-subsets, Interpersonal Relations and Leadership in a Supervisory Role are the onlysignificant stylistic dimensions for men, whereas Planning and Organizing Work,Managing Personnel, and Analysis and Synthesis in Decision-Making yield usefulpredictive information for women.Intercorrelations among the TSIB stylistic dimension scores were determinedand are reported in Tables 15, 16, and 17 for the 8-, 10-, and 12-item subsets,respectively. Although more accurately considered a performance dimension, theProductivity dimension, measured in the same way as in Study 1, was included in theseanalyses for completeness. (The Quality of Judgement dimension was not includedbecause of the variety of methods used to calculate scores.)164Table 13Validity Coefficients between Overall Management Performance (OMP) and Productivity as Measuredby Several Individual Units of Output, Linear Composites, and Standard Total Action IndicesBivariate Correlation with OMPProductivity--Units of Output^ Males^FemalesIndividual Units of Output1. Item Total (across 21 items)2. # Letters/memos (across 21 items)^ .24**3. Total # words written (across 21 items) .24**4. Definite actions5. Calendar entries6. "Things to do" list7.^"Summary" list^ .20*Linear Composites of Units of Output1. Item Total , # Letters/memos, Total action (10)^.24**^ .24**2. Item Total, Total # words written, Total action (10)^.25**3. Item Total, Definite actions, Total action (10)^.19*^ .23*4. Item Total (across 10 items), #Letters/memos,Total action (10)^ .25**^ .25**Standard Approach to Scoring of Productivity1. Total action (8-item)^ .18*^ .21*2. Total action (10-item) .18* .25**3.^Total action (12-item)^ .18*^ .25**Note. Reported r's have been corrected for the effects of range restriction and criterion unreliability.All non-significant correlations were omitted.*p < .05 (one-tailed). **p < .01 (one-tailed).165Table 14Validity Coefficients between Overall Management Performance (OMP) and TSIBStylistic Dimension ScoresBivariate^Correlation^withOMPStylistic Dimension Males Females8-item set1. Planning and Organizing Work .21*2. Interpersonal Relations .26**3. Leadership in Supervisory Role .28**4. Managing Personnel .19*5. Analysis and Synthesis in Decision-Making .24**10-item set1. Planning and Organizing Work .20*2. Interpersonal Relations .26**3. Leadership in Supervisory Role .29**4. Managing Personnel .20*5. Analysis and Synthesis in Decision-Making .23**166Table 14 (cont.)Validity Coefficients between Overall Management Performance (OMP) and TSIBStylistic Dimension ScoresBivariate^Correlation^withOMPStylistic Dimension^ Males^Females12-item set1. Planning and Organizing Work^ .23*2. Interpersonal Relations .25**3. Leadership in Supervisory Role^.26**^.18*4. Managing Personnel^ .21*5. Analysis and Synthesis in Decision-Making^ .25**Note. All non-significant correlations were omitted.*p < .05 (one-tailed). **p < .01 (one-tailed).167Table 15Intercorrelations of TSIB Stylistic Dimensions and Productivity Scores based on the 8-Item Set: MalesDimension^ P & 0^IR^LSR^MP^A & S^PPlanning and Organizing Work^ .14^.14^.13^.47^.53Interpersonal Relations^ .69^.75^.25^.55Leadership in a Supervisory Role^ .47^.29^.51Managing Personnel^ .28^.45Analysis and Synthesis in Decision-Making^ .77ProductivityFe malesDimension^ P& 0^IR^LSR^MP^A& S^PPlanning and Organizing Work^ .37^.36^.24^.53^.55Interpersonal Relations^ .73^.78^.40^.67Leadership in a Supervisory Role^ .47^.36^.57Managing Personnel^ .38^.58Analysis and Synthesis in Decision-Making^ .78Productivity168Table 16Intercorrelations of TSIB Stylistic Dimensions and Productivity Scores based on the 10-Item Set: MalesDimension^ P & 0^IR^LSR^MP^A & S^PPlanning and Organizing Work^ .10^.11^.08^.64^.60Interpersonal Relations^ .76^.76^.13^.51Leadership in a Supervisory Role^ .54^.18^.48Managing Personnel^ .17^.42Analysis and Synthesis in Decision-Making^ .75ProductivityFemalesDimension^ P & 0^IR^LSR^MP^A & S^PPlanning and Organizing Work^ .30^.28^.21^.69^.58Interpersonal Relations^ .75^.76^.26^.62Leadership in a Supervisory Role^ .49^.28^.57Managing Personnel^ .27^.55Analysis and Synthesis in Decision-Making^ .72Productivity169Table 17Intercorrelations of TSTB Stylistic Dimensions and Productivity Scores based on the 12-Item Set: MalesDimension P & 0 IR LSR MP A & S PPlanning and Organizing Work .13 .14 .12 .71 .67Interpersonal Relations .82 .79 .16 .53Leadership in a Supervisory Role .61 .20 .51Managing Personnel .19 .44Analysis and Synthesis in Decision-Making .77ProductivityFemalesDimension P& 0 IR LSR MP A& S PPlanning and Organizing Work .34 .34 .26 .77 .67Interpersonal Relations .82 .81 .28 .64Leadership in a Supervisory Role .59 .31 .60Managing Personnel .27 .55Analysis and Synthesis in Decision-Making .75Productivity170Reliability of the TSIB dimensionsReliability estimates (split half and alpha coefficients) were calculated for theseven dimensions, with Productivity measured as the standard index of overall output(sum of total action elements and Unusuals) and Quality of Judgement measured by thePanel 2 (Conservative) key, for each of the three item-subsets. These results areprovided in Table 18. Not surprisingly, a pattern of increasing reliability with largeritem-subsets is seen.Additional New MeasuresNumber of High Priority Items Attempted.  Ratings of priority or urgency(based on a consideration of the immediacy of action required by the item) wereprovided by the 21-member panel of Study 2 using a 3-point scale where 1 indicated alow priority item and 3 indicated a high priority item. Final identification of HighPriority items was made from the majority judgements of the panel. Consequently, sixitems in the TSIB were labelled as High Priority items. Participants received onepoint, along a 6-point scale, for each High Priority item attempted across the entire setof 21 items in the TSIB. This total "High Priority Score" was then correlated with theOMP. Table 19 presents the results from this analysis.Scorer's Impression of Involvement.  A 3-point rating scale of the degree ofinvolvement was used by scorers to measure their subjective impression of participants'efforts on the TSIB. The level of participation and commitment in the actions takenwere scored from a low of 1, indicating "no involvement", to 3, indicating "extremeinvolvement". No significant results, for either gender, emerge when the Scorer'sImpression of Involvement scores is correlated with the OMP. It appears, then, that171Table 18Reliability Estimates of TSIB DimensionsAlpha Coefficient^Split HalfTSIB Dimension Males Females Males Females8-item set1. Planning and Organizing Work .16 .30 .25 .282. Interpersonal Relations .33 .29 .37 .433. Leadership in Supervisory Role .05 .35 .08 .414. Managing Personnel .44 .42 .51 .465. Analysis and Synthesisin Decision-Making.45 .42 .54 .466. Productivity .45 .51 .57 .597. Quality of Judgementa .24 .37 .34 .3910-item set1. Planning and Organizing Work .38 .48 .51 .482. Interpersonal Relations .41 .33 .43 .373. Leadership in Supervisory Role .22 .40 .27 .334. Managing Personnel .43 .42 .48 .455. Analysis and Synthesisin Decision-Making.59 .58 .66 .686. Productivity .45 .55 .47 .617. Quality of Judgementa .26 .38 .36 .41172Table 18 (cont.)Reliability Estimates of TSIB DimensionsAlpha Coefficient^Split HalfTSIB Dimension^Males^Females^Males Females12-item set1. Planning and Organizing Work .45 .56 .61 .612. Interpersonal Relations .47 .45 .57 .493. Leadership in Supervisory Role .30 .51 .39 .554. Managing Personnel .48 .49 .57 .615. Analysis and Synthesisin Decision-Making.63 .65 .71 .826. Productivity .55 .67 .59 .787. Quality of Judgementa .41 .46 .56 .58Note. aQuality of Judgement scores are measured by the Panel 2 key (Conservativemethod of combining judgements).173Table 19Bivariate Correlation between Number of High Priority Items Attempted and OverallManagement Performance (OMP) Bivariate Correlation with OMPMales^FemalesNumber of High Priority^ .21 *Items AttemptedNote. Reported r has been corrected for effects of range restriction and criterionunreliability. Non-significant correlations were omitted.*p < .05 (one-tailed test).174this new impressionistic measure does not contribute useful predictive informationabout participants' administrative performance.Factor Analysis of the TSIB Action ElementsIt was desired to construct an empirically rather than logically-derived set ofscoring keys for the stylistic aspects of administrative performance. Accordingly, theaction elements from a subset of the TSIB items were combined in a factor analysis inorder to develop a set of meaningful, independent stylistic dimensions of in-basketperformance using factor loadings to assign action elements to performance dimensions(factors).The same items comprising the reduced-item subsets used in the Quality ofJudgement analyses were included in the factor analysis. Specifically, the actionelements from largest set of 12 items, which incorporated the same items used in boththe 8- and 10-item sets, along with action elements from two additional items, werecombined in a factor analysis. In total, 440 action elements from the 14 items wereselected as variables. No statistical software package was available which could factoranalyze such an unusually large number of variables. Software limitations alsoprecluded an examination of the data from the 440 variables for mean genderdifferences and for homogeneity of covariance matrices. As a result, separate genderdata were pooled and several factorings of smaller sets of selected variables werecarried out. In all, three factorings were conducted.A Series of FactoringsThe first factor analysis.^In order to reduce the number of variables to permitcomputer analysis (the least limited software package allowed approximately 275variables), the set of 440 action elements was first reduced by eliminating those with175low endorsement rates, defined as those action elements endorsed four times or lessover the sample of 321 participants. In addition, the action elements from two items(those not part of the 12-item subset) were excluded from first factor analysis, resultingin 269 action elements remaining for inclusion as variables in the factor analysis.The criterion used to extract the optimal number of factors was primarilyrational rather than psychometric. The commonly-used Kaiser-Guttman rule of thenumber of eigenvalues greater than unity was not appropriate to determine the optimalnumber of factors because the number of eigenvalues greater than unity was excessive(95, in total). Cattell's (1966) scree test indicated that four factors were likely thecorrect number. The decision to extract six factors in this first analysis was made withthe rationale that fewer factors may not be comprehensive enough to describe the dataand a greater number of factors would be difficult to interpret. The six factors wereobtained and transformed to an optimal oblique simple structure using the obliminrotation procedure (Carroll, 1960). Factor loadings from the oblique primary factor-pattern matrix were examined in order to select a subset of variables for inclusion in thesecond factoring. Variables with loadings of .30 or greater were targeted for retentionwhereas those variables with loadings less than .30 were excluded in the subsequentfactoring.The second factoring. The variables involved in the second factoring of thedata included the targeted variables from the first analysis and the action elements fromthe remaining two items of the 14 items identified for factor analysis, for a total of 218action elements/variables. Again, six factors were obtained and transformed to anoptimal oblique simple structure by the oblimin rotation procedure. Factor loadingsfrom this second factor pattern matrix were examined to select variables for inclusionin the final factoring of the data. Relatively few large loadings were evident.Accordingly, it was decided to exclude the action elements from the two new items176added in the second factoring, and provide a more complete factoring of the 12-itemsubset by including those action elements which had been previously excluded becauseof low endorsement rates.The third factoring. For this final factoring, a total of 268 action elementswere retained as variables. Six factors were extracted and transformed to an optimaloblique simple structure by the oblimin rotation procedure. In order to facilitate theinterpretation of the factors from the sizable factor-pattern matrix, a grouping of actionelements (variables) based on factor-pattern matrix loadings for each factor across the268 variables within each factor was made: Group "A" action elements with loadings of.26 or greater, and Group "B" action elements with loadings between .175 and .25.The extensive factor-pattern matrix is too large to report here. Instead, the generalsense of each factor will be discussed below, with some specific action element contentgiven to make the discussion more concrete.Interpretation of the factorsOn the basis of this third oblique factor solution, five interpretable factorsemerged, which were interpreted as follows:FACTOR I: CONSULTATION AND DISCUSSION. This factor indicates atendency to discuss with superiors and subordinates before taking action, tocommunicate in person with a focus on immediate, short-range concerns and issues,and to require an exchange of views in an attempt to reach a decision. The central traitfor persons high on this factor appears to be an orientation to people.A total of 16 action elements were classified as Group "A" action elements,whereas 26 action elements were considered Group "B" elements. The action elementsloading on this factor (listed in Appendix B) show a pattern, across items, of initiating177discussions with various staff rather than making final decisions. Although the contentof some action elements (primarily those with B loadings) suggest meetings are not tobe held, it is likely that, given the particular item content, action elements calling fordiscussions and meetings with personnel would be more conclusive, rather thanpreparatory, in dealing with problem presented in the item. For example, Item 10consisted of a terse memo from the participant's superior threatening disciplinary actionif productivity is not improved. The participant is directed to sign the memo anddistribute it to all personnel. It is likely that meeting with staff to discuss theproductivity problem is a definitive step toward resolving the issue, and is more action-oriented than holding discussions in the context of issues presented in other items.FACTOR II: INDEPENDENT DECISION-MAKING. This factor appearsconcerned with the ability to take final action without guidance from superiors orsubordinates and the ability to get things done in an autonomous, decisive way.Persons who score high on this factor would likely not require close supervision orcontrol to make decisions, and would show initiative and drive in identifying andfollowing through with actions.For this factor, a total of 17 action elements were classified as Group "A"action elements and 17 action elements were categorized into Group "B" elements. Theaction elements loading on this factor consistently display an action-oriented approachto dealing with the items in the TSIB. There is a tendency to not discuss issues withsuperiors and to act conclusively in handling problems. For example, the actionelements with the largest loadings from Item 4 (reproduced in Figure 4) include findingout who is involved in the coffee break violations, driving by the Coffee Shop, andgiving reprimands to those involved.178FACTOR III: COMPANY-BASED DECISION-MAKING.  This factor describesone who tends to comply with policies and decisions set out by superiors and who takesaction, but likely after consultation with superiors. A tendency to solicit opinions andguidance from higher levels in developing and carrying out managerial duties, aconcern with maintaining consistency of individual decisions and actions with thebroader mandate from company, and a sensitivity to company priorities characterizethose who score high on this factor.Here, a total of 18 action elements were classified as Group "A" actionelements, whereas 19 action elements were considered Group "B" elements. Acrossitems, the action elements with the highest loadings (and thus the most interpretiveweight) exhibit a common thread of communication with superiors and a desire toconform to company directives, such as minimizing or prohibiting overtime, despite abacklog of orders.FACTOR IV: SUPERVISING STAFF. This factor appears related toinvolvement in scheduling and supervising the activities of staff, the tendency to ask forinformation and to review and control the allocation of staff. A supervisor scoring highon this factor would likely provide structured, close monitoring of the activities ofstaff, would delegate and direct work to subordinates, would provide suggestions orproposals to staff to deal with issues or problems, and would be able to disapprovesuggestions or plans from staff as necessary.For this factor, a total of 15 action elements were each classified as Group "A"and Group "B" elements. Across all factors, the highest loadings are seen in this fourthfactor. A strong and consistent pattern of actions which involves assigning duties tostaff, and meeting with both subordinates and the company resource person (likely todiscuss staff conflicts), is evident.179FACTOR V: INTEGRATION AND ANALYSIS.  This factor is concerned withthe degree of recognition of the inter-relatedness of problems across items or situations,the ability to see broader view and implications of problems in the exercise, and aconsideration of the urgency or priority of taking actions or in deciding the order forwork.For this final factor, seven action elements were classified as Group "A" actionelements, whereas 33 action elements were considered Group "B" elements. Althoughthis factor yields the lowest number of action elements with A-level loadings,consideration of the B-level loadings suggested a tendency to investigate through peerdiscussions and more impersonal fact-finding actions (e.g., ask secretary for overtimefigures). Of the five final factors, the few large loadings and numerous moderateloadings (33, in total) made this the most difficult factor to interpret.The sixth factor extracted from the third oblique factor solution did not presenta clear, interpretable pattern based on action element content and so was omitted. Theintercorrelations among the five primary factors are shown in Table 20.Assignment of Action Elements to FactorsUsing the factor loadings provided by the factor-pattern matrix of the thirdfactor solution, a combination approach of empirically and logically assigning actionelements to factors was taken in order to determine factor-scale scores. In general, theempirical results were given primary consideration, in that action elements which were180Table 20Primary-Factor IntercorrelationsFactorsFactorsI II III IV^VIII .01III -.02 -.01IV -.01 -.07 .02V .03 -.01 .01 .02181factorially simple (i.e., loaded onto one factor) were keyed for that factor. In morefactorially complex cases, the typical action element assignment was based on thelargest loadings for that element. In more ambiguous cases (factor loadings of similarmagnitude), the content of the action element was considered, and the final assignmentwas based on logically matching the administrative performance involved in the actionelement to the most closely related (by content) factor. Multiple keying of actionelements to factors (i.e., assigning an action element to more than one factor) wasminimized in order to retain the distinctive, independent nature of these largely factor-analytically derived dimensions.It will be recalled that the second factoring included two additional items, notpart of the 12-item subset, and that the action elements from these items weresubsequently eliminated for the third factoring. However, these action elements werenow included in the derivation of the factor-scores in order to increase the number ofaction elements used and increase the variability of the new dimensions. To this end, acombination process of action element assignment identical to that just described wasused, except that factor loadings from the primary factor-pattern matrix from thesecond factoring were studied to assign the additional 69 action elements to factors inthe computation of factor-based scale scores. Finally, to increase the base ofmeasurable behaviours for Factor V (Integratioh and Analysis), the conceptually-salientvariable of the Number of Priority Items Attempted was added in the calculation ofFactor V scores. Like the logically-derived stylistic dimension scores, final scalescores were derived by summing the number of actions elements with positive loadingscoded for each factor and subtracting those with negative loadings that were coded foreach factor.The scale score intercorrelations are provided in Table 21. The results of thecorrelations between these scale scores and the OMP, presented in the upper portion of182Table 21Factor-Based Scale Score IntercorrelationsFactorsFactors^ I^II^III^IV^VMalesConsultation and DiscussionIndependent Decision-MakingCompany-Based Decision-MakingSupervising StaffIntegration and Analysis-.05.00.21*.10-. .13FemalesConsultation and DiscussionIndependent Decision-MakingCompany-Based Decision-MakingSupervising StaffIntegration and Analysis.05.09.19*.00-. .29Note. *p < .05 (one-tailed test).183Table 22Validity Coefficients between Overall Management Performance (OMP) and TSIBFactor-Based Scale Scores and Reliability EstimatesBivariate Correlation with OMPFactor^ Males^FemalesConsultation and DiscussionIndependent Decision-MakingCompany-Based Decision-MakingSupervising StaffIntegration and Analysis.18*.18*.18*.24**Alpha CoefficientFactor^ Males^FemalesConsultation and Discussion .43 .39Independent Decision-Making .00 .14Company-Based Decision-Making .02 .09Supervising Staff .25 .36Integration and Analysis .56 .61Note. Reported r' s have been corrected for the effects of range restriction and criterionunreliability. All non-significant correlations were omitted.*p < .05 (one-tailed). **p < .01 (one-tailed).184Table 22, were not promising. Although 4 of 10 possible correlations showed asignificant relationship with on-the-job managerial performance, the coefficients werevery modest. None was significant for both genders. In addition, the reliabilityestimates (alpha coefficients) for the scale scores, presented in the lower portion ofTable 22, showed strong evidence of unreliability.185DISCUSSIONIn general, mixed support was found for the main hypotheses proposed in thepresent study. In this section, because of the scope and number of hypothesesinvolved, they will be briefly reviewed preceding the evaluation of the results, in orderto more fully interpret and consider the implications of the findings.Reliability and Criterion-related Validities of Selected Dimensions of the TSIBCross-Validation of the Quality of Judgement ScoresPredictive validities of scores based on the Panel, Empirical and Merged keys.A central focus of the present study was the degree to which the three Quality ofJudgement keys (Panel 2--Conservative, Empirical, and Merged), as applied to thethree reduced-item subsets (8-, 10-, and 12-item) would yield similar criterion-relatedvalidity results when administered in a second, cross-validation sample (Company 2).It was hypothesized that the Merged and, to a lesser degree, the Empirical key used tomeasure the Quality of Judgement dimension would result in the largest validities forthe three scoring keys and that there would be some shrinkage in the cross-validatedcoefficients based on the Empirical and Merged keys .The findings, presented in Table 9, showed that these expected results were onlypartially realized. The Merged key, across both genders and for each of the threereduced-item subsets, did yield higher validity coefficients than the Empirical key, withcoefficients ranging from .18 to .23 (corrected). It should be noted, however, thatonly two of six Empirical key correlations were significant (8-item set and 10-item set,r = .19). Even though some shrinkage in the cross-validated Quality of Judgementscores was hypothesized, such a marked decrease using the Empirical key wasunexpected. Serious limitations in the empirically-based key were revealed in that four186of the six reduced-item sets measured using the Empirical key were found to beunrelated to on-the-job measures of managerial performance. Furthermore, the twosignificant correlations were observed for females only, therefore no predictiveinformation about males' managerial performance is provided by the Empirical key. Itshould also be noted that, although all six possible cross-validation Merged keycorrelations were significant, the average magnitude of the correlations from the seconddataset was substantially lower. For example, the mean Merged key validitycoefficient (across both genders and the three reduced-item subsets) generated using theCompany 1 dataset was .61, whereas the mean validity coefficient based on theCompany 2 dataset was .21.These reductions in the validity of the Quality of Judgement scores as measuredby both the Empirical and Merged keys are likely seen because the original validitycoefficients reported in Preliminary Study 2 were spuriously inflated from the effects ofcapitalization on chance, which produces statistically significant correlations simply onthe basis of chance, or Type I error. The considerable number of correlations requiredin the determination of the Quality of Judgement values for these two keys (228correlations for the 8-item subset, 291 for the 10-item subset, and 354 for the 12-itemsubset) presumably increased the likelihood that chance factors could operate. Anotherpotential factor contributing to the inflation of the original validity coefficients was thefact that the reduced-item subsets were chosen largely on the basis of maximal itemvalidities (rather than reliabilities) for the original sample. In addition, it may havebeen that, in the correlational analyses used to develop these scoring keys, the inclusionof correlations based on action elements with low endorsement rates (e.g., thoseendorsed by 5 of 321 participants) did not yield stable correlations. Regardless of thereasons for the over-estimates of the original validity coefficients, it is clear that thesecond, more accurate assessment of the true validity of the Quality of Judgement187dimension as measured by the Empirical and Merged keys shows much less promisingfindings.Contrary to expectations, the Panel 2 (Conservative) key yielded the highestvalidity coefficients of the findings from the three main approaches used in Quality ofJudgement scoring key development and reported in Table 9. Unlike the cross-validated coefficients from the Empirical and Merged keys, no shrinkage was seen inthe Panel 2 (Conservative) validity coefficients; in fact, the average magnitude of thesix correlations from the second dataset was marginally higher. Specifically, the meanPanel 2 (Conservative) key validity coefficient (across both genders and the threereduced-item subsets) generated using the Company 1 dataset was .255, whereas themean validity coefficient based on the Company 2 dataset was .273. It was expectedthat the Panel 2--Conservative key would result in at least equal, if not slightly higher,validity coefficients across the two datasets. Whereas the original analysis representedan inter-company analysis (the origin of the panel and the dataset did not match; Panel2 applied to Company 1 dataset), the cross-validation represented an intra-companyanalysis (the origin of the panel and the dataset match; Panel 2 applied to Company 2dataset). Higher validity coefficients from intra-company analyses were expectedbecause confounding effects due to differences in corporate culture or geographicallocation were presumed to be minimized, if not eliminated.A final point in the consideration of the Panel approach is that it may have beenresponsible, at least in part, for the higher coefficients seen with the Merged keycompared to the Empirical key. It will be recalled that although the derivation of theEmpirical key was solely quantitative, in contrast the derivation of the Merged keyincorporated both the Empirical and the qualitative Panel judgements. It is likely thatthe inclusion of the more predictive Panel judgements in the Merged key increased theobserved cross-validated validity coefficients. In sum, it is evident that subjective,188human judgements, rather than objective, empirical calculations provided more enduring,accurate estimates of the predictive validity of the Quality of Judgement scores.In retrospect, the expectation that the Panel 2 (Conservative) key would yield thelowest validity coefficients from the three main approaches used in Quality of Judgementscoring key development may not have been called for, particularly in light of thefindings of Stern, Stein, and Bloom (1956). Stern et al. developed an importantclassification of three approaches used in personality assessment programs, based on theextent to which an explicit personality model is used in predicting human behaviour.Interestingly, the analytic, empirical, and synthetic approaches identified by Stern et al.closely correspond to the three approaches of the Panel, Empirical, and Merged Qualityof Judgement scoring key development used in the present study. Comparable to thePanel approach, the analytic approach to personality assessment is complex. At itssimplest level, however, it involves gathering and combining individual ratings of therole requirements made by significant others. A further step in the analytic approachinvolves a diagnostic council, whose purpose is to assess, by group consensus, thecongruence of a candidate's personality with the hypothetical target model. According toWiggins (1973), the most outstanding advantage of the analytic approach is its generalityof predictive accuracy across samples of both individuals and situations.The empirical approach identified by Stern, Stein, and Bloom (1956) is based onexamination of the degree of correlation between instruments selected for personalityprediction and objective standards of performance. Item analysis following a contrastedgroups application to high- and low-performing subjects is used to select items able toreflect group differences. In this approach, as in the Empirical Quality of Judgementscoring key approach, cross-validation is required to avoid capitalization on chanceinfluences from the original contrasted groups. The view that "the degree of success of189empirical selection is considered to be less than optimal in terms of such considerationsas hit rates, costs of testing, and selection ratios" (Wiggins, 1973, p. 469) is supported bythe findings of the present study,. In addition, Wiggins noted that the generality of theempirical approach is much less than that of the analytic and synthetic approaches.The synthetic approach, like the Merged key, involves a more global or armchairappraisal of the situation, combining assessors' ratings in a less rigorous way than theempirical approach. It should be recognized, however, that the Merged key approach isless intuitive and follows more clearly established decision rules than the processdescribed by Stern, Stein, and Bloom (1956). Generally, the synthetic approach isconsidered less accurate than the empirical approach in predicting personality, and it isconsidered more efficient than the comprehensive analytic approach.In sum, given the results of Stern, Stein, and Bloom (1956), the finding that, ofthe three main Quality of Judgement scoring key approaches, the Panel 2 (Conservative)key yielded the highest validity coefficients and is most generalizable is considerably lesscontrary than originally expected. Despite obvious differences in domain assessment,there is an interesting convergence to the pattern of findings across the two studies.Additional analyses based on the Panel scoring key.  The administration of theTSIB in a new, related setting allowed a set of more finely-grained analyses of the Panelscoring keys used in determination the validity of the Quality of Judgement scores. Itwill be recalled that the involvement of Company 2 permitted the construction of a new,expanded panel. To increase the accuracy and stability of the appropriateness ratings, 21panel members were used for Panel 2, compared to 11 panel members comprising thefirst panel (from Company 1). In addition to more panel members from a differentcompany, Panel 2 also differed from Panel I by the method of combining paneljudgements (Panel I used the liberal method while Panel 2 used the conservative190method). These multiple differences in the two panels, then, would not allow theisolation of the factor(s) responsible for differing validity coefficients resulting when thetwo panels were used in the derivation of the Quality of Judgement scoring key. Moreprecise comparisons of results required the separation of these three factors: the methodof combining panel judgements (liberal or conservative), origins of the panel ratings(Panel 1 or Panel 2), and the dataset to which the particular panel key is to be applied.Accordingly, the six additional Panel analyses previously outlined in theRationale and Hypotheses for the Present Study were conducted. (It will be recalled thatStudy 1 involved the Liberal Method using Panel 1 applied to the Company 1 dataset andStudy 2 involved the Conservative Method using Panel 2 applied to the Company 1dataset.) To review, the new analyses involved:1. Conservative combination method using Panel 1 applied to Company 1dataset.2. Conservative combination method using Panel 1 applied to Company 2dataset.3. Conservative combination method using Panel 2 applied to Company 2dataset.4. Liberal combination method using Panel 1 applied to Company 2dataset.5. Liberal combination method using Panel 2 applied to Company 1dataset.6. Liberal combination method using Panel 2 applied to Company 2dataset.It was hypothesized that, because of the greater consensus required among thepanel members and the use of a 5-point, rather than 3-point, scale, the Conservativemethod of combining judgements would result in greater validity coefficients than the191Liberal method. In addition, it was expected that, in general, Panel 2 ratings would yieldhigher validities than Panel 1 ratings because of the use of expanded list of actionelements and the larger panel size used for the action element ratings. Lastly, it wasbelieved that "intra-company" analyses, wherein the origin of the panel and the datasetmatch, (e.g., Panel 1 applied to Company 1 dataset; Panel 2 applied to Company 2dataset) would produce greater validity coefficients than "inter-company" analyses (Panel1 dataset applied to Company 2 dataset). This expected pattern was based on theassumption that intra-company analyses would tend to eliminate possible confoundingeffects arising from differences in corporate culture or geographical location. Table 23presents a summary of the observed validity coefficients reported in the previous section.In sum, the effects of three factors on the validities of the Quality of Judgementscores were examined: the method of combining panel judgements, the origin of thePanel, and the nature of the analysis (whether intra- or inter-company). The analysesconducted in the present study were restricted to a consideration of the main effects ofthese factors; no interactive effects were investigated. Table 24 presents the means of theobserved validity coefficients, aggregated over each of the three factors and for each ofthe three item subsets. The means, therefore, were determined by a broad aggregation ofvalues shown in Table 23. For example, in the calculation of the Liberal method means,the categories of gender, origin of the Panel, and intra- versus inter-company analysiswere collapsed. Similarly, in the calculation of the Panel 2 means, the categories ofgender, method of combining panel judgements, and the intra- versus inter-companyanalyses were collapsed.192Table 23Summary of Validity Coefficents between Overall Management Performance and Quality of Judgement Scores Obtained Using Scoring Keys based on Different Methodsof Combining Panel Judgements (Liberal or Conservative), Different Origins of Panel Ratings (Panel 1 or Panel 2), and Whether Panel Origin and Dataset Match (Intra- or Inter-company analysis) 8-item setLiberal MethodPanel 1^Panel 2Conservative MethodPanel 1^Panel 2MIntra-company^.37F.29M.29F.30M.30F.19M.30F.25Inter-company^.26 .35 .32 .26 .19 .37 .29 .28Liberal MethodPanel 1^Panel 210-item setConservative MethodPanel I^Panel 2MIntra-companyF M.23F.30M.24F.20M.26F.32Inter-company^.23 .35 .25 .25 .18 .38 .23 .26193Table 23 (cont.)Summary of Validity Coefficents between Overall Management Performance and Quality of Judgement Scores Obtained Using Scoring Keys based on Different Methodsof Combining Panel Judgements (Liberal or Conservative), Different Origins of Panel Ratings (Panel 1 or Panel 2), and Whether Panel Origin and Dataset Match (Intra- orInter-company analysis) 12-item setLiberal Method^Conservative MethodPanel 1^Panel 2^Panel 1^Panel 2M^F^M^F^M^F^M^FIntra-company^ .24^.32^.25^(.16)^.26^.25Inter-company^.18^.36^.25^.23^:71^.36^.24^.23Note. Reported r's have been corrected for the effects of range restriction and criterionunreliability. Correlations for the 10- and 12-item sets, scored by the Liberal Method,Panel 1 key, applied to the Company 1 dataset are not available. Value in parentheses isa non-significant correlation.194Table 24Mean Validity Coefficients Determined by ALrgregation Across Method of CombiningPanel Judgements, Origin of Panel, and Nature of Analysis (Intra- or Inter-company) Item SetsFactor 8-item 10-item 12-itemMethod of Combining Panel JudgementsLiberal .31 .27 .27Conservative .27 .26 .25Origin of PanelPanel 1 .29 .27 .26Panel 2 .29 .26 .25Nature of AnalysisIntra-company .29 .26 .25Inter-company .29 .28 .26Note. Means were determined by Fisher transformation of the original bivariatecorrelations.195As shown in Table 24, no support was found for the hypothesis of higher validitycoefficients derived from the Conservative, rather than Liberal, method of combiningpanel judgements. Across the three item-subsets, the mean validity coefficient for theConservative method was .26, compared to .28 for the Liberal method. It is evident thatrequiring greater consensus among panel members and the use of a 5-point, rather than 3-point scale, in the assignment of Quality of Judgement values did not result in improvedpredictive validity of the Quality of Judgement scores. This finding suggests that,because of simpler, more efficient computations, the Liberal method of combining paneljudgements is the method of choice.Limited support was found for the hypothesis that Panel 2 ratings would yieldhigher validity coefficients than Panel 1 ratings. The coefficients were higher for allthree possible comparisons for males whereas for females, none of the three possiblecomparisons was higher (the pattern was fully opposite to that predicted). For the males,then, it is apparent that using the expanded list of action elements and the larger panelsize resulted in greater predictive accuracy (8-item, .30 versus .28; 10-item, .24 versus.22; 12-item, .25 versus .21). The question remains whether, given the marginal increasein validity coefficients for males using Panel 2 and the decrease in validities for females,the use of the expanded list of action elements and a panel size nearly double that of theoriginal panel is warranted. Moreover, using the validity coefficients from Table 24,which have been aggregated across genders, the means of the validity coefficents acrossthe three item-subsets are equal for the two Panels. Practical considerations point to theuse of Panel 1 as the more favored of the two panels. That is, using a smaller (ratherthan larger) panel and a less comprehensive list of action elements will yield similarvalidity coefficients with the added advantage of reduced cost and time demandsassociated with the smaller panel. It should he noted that, although Panel 1 is referred toas the smaller of the two panels, when compared to panel sizes reported in the literature196(Hakstian et. a I, 1986) and those employed by ETS, Panel 1, with 11 members, isrelatively large.Limited support was also found for the hypothesis that intra-company analyseswould yield greater Quality of Judgement validity coefficients than inter-companyanalyses. Like the previous analyses comparing Panel 1 to Panel 2 ratings, a differentialpattern across genders was seen (Table 23). The coefficients were higher for all threepossible intra-company comparisons for males, whereas for females, none of the threepossible intra-company comparisons was higher (again, the pattern was fully opposite tothat predicted). When considered across genders (Table 24), matching the origin of thepanel to the dataset to which the Quality of Judgement key will be applied does notappear to be a requirement for satisfactory validity. Specifically, across the three item-subsets, the mean validity coefficient for the intra-company analyses was .27 whereas themean for the inter-company analyses was .28. It can be seen that, overall, no significantreduction in validity results if the panel and dataset do not correspond, and so it isunlikely that differences due to corporate culture or geographical location haveappreciable confounding effects on the Quality of Judgement key. An importantconsequence of this finding is the implied versatility of panel-derived keys. It appearsthat panel keys derived in the original, target company can also be applied in new, relatedsettings without significant reductions in validity.Additional Performance Dimensions in the TSIBUnderstanding of Situation Questionnaire.^In the present study, a new eighthdimension (performance rather than stylistic) was added in order to measure theparticipant's understanding of the issues, problems and broader implications of thescenario presented in the TSIB. It was hypothesized that more information about aparticipant's rationale or thinking would constitute useful predictive information about197that person's administrative potential. Because of the fairly low internal consistencyresults in Study 2, it was postulated that the Quality of Judgement dimension may bemade up of multidimensional, independent aspects of appropriateness of actions. A 33-item multiple choice Understanding of Situation Questionnaire was designed andadministered upon completion of the in-basket exercise.The development of the scoring key for the Understanding of SituationQuestionnaire was based on an analysis of the responses from the second 21-memberpanel involving the application of a 0, 1, 2 scoring system assigned to each of the fiveoptions per question. It will be recalled that the particular value assigned to each optiondepended on the level of endorsement from the expert panel. Across the five responseoptions provided for an item, the panel members' pattern of option selection wasexamined and an empirical, followed by a logical approach, was used to assign scoringweights. The empirical approach was based on the frequencies of option selectionswithin an item. If, for example, option "h" were selected most frequently by panelmembers, it was initially keyed "2" whereas if option "d" were chosen by very fewmembers (or none), it was initially keyed "0". Options receiving moderate endorsementby the panel (i.e., options neither maximally nor minimally selected) were initially keyed"1". Following this empirically-based assignment of scoring weights, final determinationof the scoring key was made by a logical or rational consideration of the relativedifferences between option selection rates. If, for example, the two most frequentlyselected options within an item differed by one or perhaps two panel members' selection,both options were keyed "2". Similarly, if two moderately-endorsed options showed aminimal difference in their endorsement rates, they both were keyed "1".As outlined earlier in the Rationale and Hypotheses for the Present Study, item-analyses using item-reliability and item-validity indices were conducted in order to selecta subset of items from the questionnaire with maximal reliability and validity. Table 11198presented the internal consistency and criterion-related validity findings based on theoriginal, full 32-item set (one item was omitted on the advice of the panel). Table 12presented the internal consistency results and validity coefficients for the most promisingsubset of items (20, in total). The reliability results indicate low levels of internalconsistency for both the full- and reduced-item questionnaire (alpha coefficients for thefull-item set were .24 and .38 for males and females, respectively, compared to .23 and.27 for the 20-item subset). However, given the nature of this data, it may be unrealisticto expect results indicating greater homogeneity. The questionnaire is designed to solicitinformation regarding participants' attitudes, strategies and rationales behind thenumerous and varied actions they chose. The data, therefore, is more cognitive and itemor issue-specific than is typically true of more general, task-oriented in-basket data.Based on the criterion-related validity findings from the 20-item questionnaire,strong support was found for the hypothesis that additional, useful predictive informationwhich augments in-basket assessment could be provided by an analysis of participants'particular approaches in the exercise. The criterion-related findings suggest that the 20-item Understanding of Situation Questionnaire not only yields higher, but also moreequal, validity coefficients across genders compared to the original, full-item instrument(32-item, .19 and .29 for males and females, respectively; 20-item, .35 and .34 for malesand females, respectively). An important practical advantage of the questionnaire formatcompared to the standard ETS Reasons for Action form or a post-exercise interview,designed to solicit the same information, is afforded with this multiple-choice scoringformat. Specifically, less administration and scoring-time is required and improvedreliability (inter-rater) follows from using a quantitative, objectively-applied scoring key.Productivity.^Findings from the preliminary studies revealed that theProductivity dimension did not contribute significantly to the prediction of in-basketperformance for men or women. It will be recalled that Productivity was operationalized199as the total number of actions taken across the exercise (including Unusual actions), aswell as the number of items attempted. Consequently, in the present study, severaldifferent ways of operationalizing Productivity to increase the validity of this dimensionwere examined.New methods of defining this dimension involved the aggregation of smaller,more concrete units of output. Seven separate units of productivity were identified asindividual productivity components: (a) the number of items attempted across 21 items,referred to as Item Total; (11) the number of letters or memos written across 21 items,referred to as # Letters/memos; (c) a score reflecting the number of words written permemo or letter across the 21 items, referred to as Total # words written; (d) number ofactions scheduled for a definite time, referred to as Definite actions; (e) number of entriesmade on the calendar; (f) whether a "things to do" list was completed, and (g) whether a"summary" list of the items was completed. In addition, more standard indices of outputwere calculated for each item-subset, composed of the total number of action elementsand total number of Unusual actions taken over the 8-, 10-, and 12-item subsets. Thesewere referred to as Total action (8), Total action (10), and Total action (12), respectively.Criterion-related validity results for the individual units of output, presented inTable 13, show a different pattern across genders. For males, the # Letters/memoswritten (across 21 items) and the Total # words written (also across 21 items) weresignificantly related to on-the-job managerial performance ( r = .24 for both measures).For females, all individual units of output were non-significant, except whether asummary list was completed by the participant ( r = .20).Productivity linear composites were then derived to determine the optimalProductivity operationalization. The most promisinu, individual units of output wereselected and four linear composites were determined: (a) Item Total, # Letters/memos,200Total action (10); (b) Item Total, Total # words written, Total action (10); (c) Item Total,Definite actions, Total action (10); and (d) Item Total (across 10 items), #Letters/memos, Total action (10). As Table 13 showed, the fourth linear composite, withsignificant validity coefficients of .25 for both genders, proved to be the optimaloperationalization of Productivity, although the first linear composite, with significantvalidity coefficients of .24 for both genders, was a close second. The former linearcomposite, however, has the added advantage of being the more easily scored of the twocomposites, because the Item Total is calculated across 10, rather than 21, items. Forcomparison, a more conventional approach to the scoring of Productivity was examinedin the Total action measures for the 8-. 10-, and 12-item subsets. The Productivity totalsfor the 10- and 12-item sets yielded the highest coefficients (r = .18 for males and r =.25 for females, for both item-sets).The results of these analyses suggest that measuring Productivity using thecomposite based on the Item Total (across 10 items), # Letters/memos across the entireexercise, and the Total action (10) count provides the most predictive information aboutadministrative potential (mean validity coefficient is .25). However, it should berecognized that the mean validity coefficient, across genders, for the Total action (10)measure is only marginally lower (r = .22) than this linear composite. Practicalconsiderations point to the selection of this second, slightly less predictiveoperationalization of Productivity as the optimal method of measuring in-basket output.Unlike the linear composite, the scoring of the standard indices of output require noadditional steps or calculations other than those which follow directly from scoring thein-basket itself. That is, in the scoring of the TSIB, the action elements (both Unusualones and those listed in the Scoring Manual) are identified. The total number of actionelements and the total number of Unusual actions taken (whether using the 8-, 10-, and12-item subset), already determined in the scoring of the TSIB, are then summed,201making Productivity assessment a straightforward computation. In contrast, to score thelinear composite, an additional component of the number of letters and memos writtenacross the entire exercise must be counted. This component, although not complex, isfairly time-consuming to determine because all written responses, not simply thoserelevant to the subset of items, must be examined. Thus, with such practical advantages,it appears that the Total action measure is the most useful method of measuring in-basketproductivity; the ultimate selection of the Total action (8, 10, or 12) rests on the selectionof the optimal reduced-item subset.TSIB Stylistic DimensionsThe efficacy of logically assigning the stylistic dimensions was re-examined inthe second dataset using the 8-, 10-, and 12-item subsets. The total scores for eachstylistic dimension were correlated with the OMP for each of the three item-subsets andwere displayed in Table 14. As reported in the Results section, a distinct pattern acrossgenders, but consistent within each gender, was evident. It will be recalled that, acrossthe item-subsets, Interpersonal Relations and Leadership in a Supervisory Role were theonly significant stylistic dimensions for men, whereas Planning and Organizing Work,Managing Personnel, and Analysis and Synthesis in Decision-Making provided usefulpredictive information for women.These results using the reduced-item subsets can be compared, generally, withthose from Study 1 wherein the full 21-item set was administered in Company 1 (resultsdisplayed in Table 5). There, greater consistency across genders in the results was seenwith two stylistic dimensions, Interpersonal Relations and Managing Personnel,significant across both genders. In the present study, it appears that only a small set ofstylistic dimensions, unique to each gender, were related to measures of on-the-jobperformance. It is not clear why such a differential pattern across genders should exist.202It is interesting to note that not one of the five stylistic dimensions is entirelynonsignificant; each dimension contributes some useful predictive information aboutmanagerial performance, albeit for only one gender.Tables 15, 16, and 17 present the intercorrelations for the TSIB stylisticdimensions for the three reduced-item subsets. Consistent with the findings from theintercorrelations of the 8-item subset reported in Study 2 (Table 7), a fairly high level ofintercorrelations for the reduced-item subsets was seen, suggesting a lack of meaningfuldiscriminant validity of these aspects of administrative performance. This dimensionaldependency most likely results from multiple keying of action elements acrossdimensions. Earlier studies using such multiple keying of logically-assigned dimensionsalso found evidence of dimensional dependency (Brannick et al. 1989; Hakstian &Harlos, 1992).Reliability of the TSIB dimensionsThe split-half reliability estimates of the seven dimensions, presented in Table 18,fall in the middle to upper part of the range of split-half reliabilities reported bySchippman et al. (1990) and summarized in the Table 1 of the Literature Review. It wasnoted in the Results section that a pattern of increasing reliability estimates with largeritem-subsets was seen, supporting the speculation by Hakstian et al. (1992) that thephenomenon of test length is responsible for the lower split-half reliability of individualTSIB dimension scores for the 8-item subset compared to higher reliability estimates forthe full 21-item set. Thus, as expected, the more items scored, the greater the split-halfreliability. Similarly, the reported alpha coefficients, presented in Table 18, also show apattern of increasing magnitude as the number of items increases. For example, formales the mean alpha coefficients across the seven dimensions and the three item subsets203were .30 (8-item), .39 (10-item) and .47 (12-item).^For females, the mean alphacoefficients were .38 (8-item), .45 (10-item) and .54 (12-item).It is apparent that, regardless of the reliability estimation method used and thenumber of items scored, the TSIB dimension scores, like in-basket dimension scoresreported in the literature, are not highly internally consistent. The Literature Reviewobserved that there was great variability in the published split-half reliability coefficients,making it difficult to conclude, with confidence, whether or not the split-half reliabilitiesare satisfactory. Less variability was seen in the alpha coefficients reported in theliterature. The mean alpha coefficient, across in-basket studies, was .51. In the presentstudy, only the 12-item subset reached a comparable degree of homogeneity; the 8-itemsubset, with a mean alpha coefficient across genders of .34, was substantially lower.Although more promising, published internal consistency findings do not clearly indicatesolid levels of reliability using estimates of equivalence and homogeneity. As observedin the Literature Review, the most appropriate measure of the reliability of the in-basketwould likely he a measure of stability, obtained by the test-retest method. Yet, becauseof logistical difficulties and high costs associated with multiple test administrations inindustry, assessment of test-retest reliability of the in-basket is generally not viable.Given these limitations, it may he unrealistic to expect stronger evidence of in-basketreliability. If we consider some central features of wideband, high-fidelity in-baskets,such as the breadth and complexity of items, cases of low item variances (due, in largepart, to low endorsement rates of courses of action), and the numerous unquantifiableinfluences that affect the participant's perception of the items, the present findings mayappear less negative.Additional New MeasuresScorer's Impression of Involvement. It will be recalled that two new measures,the Number of Priority Items Attempted and the Scorer's Impression of Involvement,were introduced in the TSIB to investigate their predicted efficacy in improving in-basket assessment. Althoutzh previous findings of impressionistic measures of in-basketperformance have shown mixed evidence of criterion-related validity (Cross, 1969;Hemphill, 1962; Meyer, 1970), the present attempt to develop a measure which was bothsubjective and predictive was expected to be favorable because of the clear conceptuallink to Lopez's (1966) key notion of ego-involvement in in-basket assessment. Contraryto prediction, the subjective Scorer's Impression of Involvement was not significantlyrelated to on-the-job managerial performance. Consequently, no useful predictiveinformation about participants' administrative performance was contributed by thismeasure. Previously reported positive findings (albeit modestly so) have typicallyinvolved impressionistic measures in the form of adjective-pairs which describesubjective aspects of in-basket performance. It is not clear what unique informationthese measures contribute for those in-baskets also employing the more standard andpredictive ETS stylistic dimension measurement. Therefore, it appears that inclusion ofimpressionistic measures in traditional in-basket assessment is unwarranted.Number of Priority Items attempted.  Table 19 presented the results of thecorrelation between the Number of High Priority Items attempted. Mixed support wasfound for the hypothesis that this measure would be significantly and positively related tomanagerial performance in that this prediction proved true for males only (r = .21). Itmay be that, with a total of six items identified as high priority, and the fact that mostparticipants complete most of the exercise in the time provided, the variability for therange of possible scores was insufficient. Descriptive statistics for this measure support204205this view for, across the sample of 321 participants, the mean score was 5.5, with astandard deviation of .9. It appears, then, that in its present form, a measure of thenumber of urgent or high priority items completed is too weak to warrant inclusion in theTSIB.Selection of the Optimal Item Subset: Integrating Empirical Results and Practical IssuesThe selection of the optimal item subset (8-, 10-, or 12-item) based on thesereliability findings point to the largest item set with its maximal reliability. However,additional considerations of validity and scoring-time must also he taken into account inthe final selection of the optimal item subset. The optimal choice of item subsets inrelation to the validity of the Quality of Judgement scores clearly depends on the methodused to score this crucial dimension. Following the conclusions reached earlier that, interms of efficiency and accuracy, the Quality of Judgement dimension is best measuredby the Liberal method of combining panel judgements using Panel 1 ratings, this scoringapproach will he adopted for the current consideration of the optimal item subset.It was also seen earlier that no significant reductions in validity resulted when theorigin of the panel and the dataset did not match (inter-company application). For thisreason, the particular Quality of Judgement validity coefficients selected are based on theLiberal method of combining Panel 1 judgements applied to the Company 2 dataset (noresults are available for the 10- and 12-item set of the Liberal method of Panel 1judgement ratings applied to Company 1). When averaged across genders, these meanvalidity coefficients (taken from Table 24) were .31 (8-item subset), .27 (10-item subset),and .27 (12-item subset). These findings suggest, then, that the optimal reduced-itemsubset, in the context of Quality of Judgement validity, is the 8-item set.206As Table 14 shows, the validity coefficients between the TSIB stylisticdimensions and the management criterion are very stable across the item subsets. Itappears that no appreciable advantage to the prediction of administrative ability isrealized with the selection of the 10- or 12-item subsets.Clearly, practical considerations affirm the selection of the smallest subset ofitems. As reported in the findings from Study 2, scoring time using the 8-item key was23 minutes, on average, and the time required to train scorers was three days. Althoughit is unlikely that training for scoring the 10- or 12-item set would be substantiallylonger, estimates for the time required to score the longer subsets are 30 and 40 minutes,respectively. For small-scale applications, the addition of seven or even 17 minutesrequired for scoring each exercise may not have serious consequences. However, thefeasibility of wide-scale applications of the exercise with large numbers of candidateswould likely be limited, particularly if the 12-item set were scored.Factor Analysis of the TSIB Action ElementsIt was postulated that, with the high intercorrelations among the stylisticdimension scores seen in Study 2, a factor analysis of the action elements from a subsetof the TSIB items would yield a set of meaningful, more independent stylisticdimensions of in-basket performance. In addition, it was expected that these factor-analytically derived in-basket dimensions would help resolve the conflict seen in theliterature of the complexity of in-basket performance (whether uni- versus multi-dimensional, in nature), by considering the number of si gnificant factors and the degreeto which they are consistent with factors seen in past studies.It will he recalled that five interpretable factors emerged from this analysis:Consultation and Discussion, Independent Decision-Making, Company-Based Decision-207Making, Supervising Staff, and Inteaation and Analysis. The primary-factorintercorrelation results, shown in Table 20, provided evidence of minimal interrelationsamong the factors (the largest correlation was -.07). The intercorrelations of the scalescores, seen in Table 21, suggests that the independent nature of the dimensions wasretained. Criterion-related validity results were less promising than expected, with verymodest coefficients significant in 4 of 10 possible correlations. As Table 22 indicates,the strongest relationship with on-the-job managerial performance was seen by theIntegration and Analysis factor for women (r = .24, corrected; p < .01). The remainingthree coefficients showed weaker evidence of a relationship to managerial performance (r= .18, corrected; p < .05), none of which was consistently significant across genders. Noconsistent evidence of the reliability of the scale scores was seen, with only theIntegration and Analysis factor showing a modest level of internal consistency.Several reasons likely contributed to the lack of solid criterion-related validityand reliability seen in these factor-analytically derived in-basket dimensions. It will berecalled that, in previous factor-analytic studies of in-basket performance, the unit ofanalysis was the in-basket style category, thus providing molar-level data. The presentstudy, in contrast, employed action element endorsements as the unit of analysis,providing very molecular-level data. A likely consequence of the vast numbers of actionelements used (recall that 268 were used as variables in the third factoring) was lowwithin-factor variance due, in large part, to low endorsement rates of many of the actionelements. In addition, it is likely that correlations based on action elements which havebeen endorsed by only a small number of participants (e.g., 5 of 321) were not stable.The very molecular-level approach used in the present factor analysis allowed themeasurement of a greater scope or ranee of in-basket performance compared to moremolar-level measurement. This greater scope, however, resulted in reduced frequenciesfor each molecular unit of measurement, the action element endorsement; a longer list of208action elements yielded greater disbursement of endorsements with fewer endorsementsper action element. Consequently, it may he that the use of more molecular-level data inthe present study limited the robustness of possible empirical relationships with criterionmeasures.Present findings provided some evidence of consistency with factors previouslyidentified in past studies. The Literature Review observed that factor-analytic studieshave suggested some common dimensions of administrative performance: DiscussingProblems with Others, Complying with Suggestions, Preparing for Action by GatheringMore Information, and Directing Others (Thornton & Byham, 1982). In the presentstudy, conceptual overlap is apparent between Consultation and Discussion andDiscussing Problems with Others, and between Company-Based Decision-Making andComplying with Suggestions. Moreover, it appears that Preparing for Action byGathering More Information and Directing Others are two smaller stylistic componentsof the present factor Supervising Staff.In the Literature Review, the interpretation of factor consistency across in-basketstudies was seen as controversial. Whether the further evidence of factor consistencyprovided by the present study indicates that the underlying characteristics measured bythe in-basket exercise are based on a single, generalized trait, as suggested by Kesselmanet al. (1982) and Lopez (1966), remains unclear. If in-basket performance were trulyunidimensional, the consistent pattern of three to five distinct, independent factorsemerging from solitary studies (regardless of the unit of analysis used--style categories oraction elements) would not be expected. Instead, the emergence of one, or perhaps two,unique factors would more likely be seen across past studies. More support is seen forthe contention by other researchers that the factor consistency observed in the literature isthe result of an ETS-dominated measurement system (Schippmann et al., 1990). Thisview holds that a similar measurement system may result in greater similarity of factor209definitions, increasing the likelihood for factor consistency across studies. With thestrong influence from ETS in the development of the TSIB, such an explanation for thefactor consistency observed here seems readily acceptable.A Note Regarding A Linear Combination of TSIB DimensionsIn factor analysis, optimal combinations of variables are derived from clustersresulting from the internal structure of the data. Other methods to derive optimal linearcombinations of variables typically involve correlation of the variables with externalcriteria, such as job performance. Additional research (Hakstian & Harlos, 1992) intothe construction of a simple linear composite of TSIB dimensions has been conductedand will be briefly acknowledged here.Using the Company 1 dataset from Preliminary Study 1, Hakstian and Harlos(1992) reported that, based on criterion correlations, a linear combination of the TSIBdimensions was constructed, with the Quality of Judgement  and Managing Personneldimensions given weights of two and the other five dimensions given unit weights. Thisnew measure, called Overall In-Basket Performance, yielded significant validities of .30(males) and .34 (females) when correlated against the Overall Management Performancecriterion. When applied in the Company 2 dataset, the cross-validities for these twovariables were .28 (males) and .30 (females). The development of the Understanding ofSituation dimension allowed it to be added to the linear composite, with a weight of twoto reflect its stature as a performance, rather than stylistic, dimension. Consequently, thevalidities for this global composite increased to .34 for both genders. Substantialincrements in reliability for this aggregation of predictor variables (rather than individualdimensions) were also seen.210Summary and ConclusionsIn the Introduction, the legal and economic influences on work sample testingwere acknowledged. It was observed that work sample assessment measures which arepsychometrically sound, legally defensible, and economically viable are required. As awork sample test, the in-basket exercise is believed to enjoy the advantages of reducedbias and reduced adverse impact, although these attributes have not (yet) beenempirically demonstrated. However, the in-basket also suffers the disadvantage ofreduced utility because of its high development and administration costs, and theconsiderable time required both to train scorers and for trained scorers to evaluate acompleted in-basket. Thus, it has been viewed as a costly component of managerialassessment. A unique challenge, then, facing in-basket developers has been in meetingthe concerns of company executives to improve the economic efficiency ofadministrative ability assessment in an increasingly competitive corporate market.Accordingly, the focus of the present research has been on the scoring of the in-basket exercise, long recognized as the weak link in its practical implementation. In fact,nearly forty years after the initial development of the in-basket exercise, Brannick et al.(1989) cited comparisons of scoring systems in in-basket research as a principal area formuch-needed investigation. Excessive scoring time demands stemming from thecomplex and often subjective nature of in-basket scoring have limited the wide-scaleapplication of high-fidelity, wideband exercises with large numbers of candidates forselection purposes.It should he recognized that, because the purpose of measurement in this study isto make predictions about administrative ability, the most appropriate criterion-validationapproach is to conduct a predictive validation study using a longitudinal research design.However, due to budgetary and time constraints, a predictive validation study was not211feasible, and so a concurrent validation design was substituted.^Although suchsubstitution is commonly made (Thornton & Byham, 1982), it is no doubt preferable tomore closely match the purpose of measurement with study design.It is also important to recognize the inherent limitations in accessing andmeasuring managerial performance given its multi-faceted nature and the lack of clear,consistent definitions and descriptions of managerial behaviour. Moreover, as noted inthe Literature Review, problems of judgement bias remain a serious obstacle inperformance appraisal assessment (Cascio, 1987). This study attempts to minimize thispossible source of inaccuracy in the assessment of in-basket validity with the use of afactor analytically-derived aggregate of BOS scales as the criterion measure, based onimmediate supervisors' ratings of observed frequencies of specific managerialbehaviours. Despite this effort to minimize inaccuracy due to subjectivity in criterionmeasurement, it is no doubt clear that some inaccuracy remains; it is thus impossible toknow, with complete confidence, whether any one in-basket scoring key is correct.Given these limitations, the findings of the present study contain severalimportant implications for both the in-basket developer and the in-basket user. First,contrary to expectations, it appears that Quality of Judgement scoring key developmentbased on more logical, rather than empirical, principles is warranted. Although anempirical approach involving criterion-related validity results in the construction ofscoring keys may appear promising, as suggested by Meyer (1970) and the findings fromPreliminary Study 2, the present research clearly shows that the cross-samplegeneralizability of an empirically-derived key is inadequate, at least at sample-size levelslike those in the present study.Instead, paralleling the findings of Stern, Stein, and Bloom (1956), it appears thata panel-derived key, based on pooled judgements of experienced industry personnel,212remains psychometrically adequate in cross-validation. Specifically, more finely-grainedanalyses of the panel-based approach indicated that (a) panel size can be relatively small(here, 11 members were sufficient), (h) panel judgements can be combined relativelysimply (a "liberal", 3-point scale was effective), (c) a less comprehensive list of actionelements is as useful as an expanded list, and, lastly, (d) panel-derived Quality ofJudgement keys allow company-to-company generalizahility (i.e., the origin of the paneland the company or dataset to which it is applied need not correspond). These principlesof panel and scoring key construction, when applied by in-basket test developers, shouldsubstantially reduce the historically high development costs associated with in-basketexercises.Secondly, findings from the present study also result in greater understanding ofthe importance which should he afforded the stylistic dimensions of in-basket assessment.In summary, it was learned that, unlike the performance dimensions, the stylisticdimensions, when used in isolation, contribute little useful information to the predictionof administrative ability. However, previous research with the present exercise indicatesthat, when used in a linear composite along with performance dimensions (also involvinga new Understanding of Situation measure), the stylistic dimensions contribute to a moreglobal measure possessing improved reliability and criterion-related validity (Hakstian &Harlos, 1992). It should also be made clear that, on their own, stylistic dimensions docontribute useful diagnostic information about administrative performance.A final consideration for the in-basket developer follows from the minimalshrinkage which occurred on cross-validation of the reduced item-subsets. It appearsthat, in conjunction with the principles of panel construction discussed above, actionelement-based scoring keys measurinL!, performance on 8, rather than 21 items, result inadequate reliability and criterion-related validity for both the initial development andcross-validation samples. In conclusion, scoring-item reduction represents a new and213highly effective way to significantly improve the training and scoring efficiency of thehigh-fidelity in-basket exercise, thereby allowing more wide-scale applications of theexercise with large numbers of candidates for both training and selection purposes.214REFERENCESAlbemarle Paper Company v. Moody, 45 L. Ed. 2d 180 (U. S. Supreme Court 1975).American Psychological Association, American Educational Research Association, &National Council on.Measurement in Education (Joint Committee) (1985). Standards foreducational and psychological testing. Washington, D.C.: American PsychologicalAssociation.Anastasi, A. (1982). Psychological testing. (5th ed.). New York: Macmillan.Asher, J. J., & Sciarrino, J. A. (1974). Realistic work sample tests: A review. Personnel Psychology, 27, 519-533.Bass, B. M. (1954). The leaderless group discussion as a leadership evaluation instrument.Personnel Psychology, 7, 470-477.Bender, J. (1973). What is typical of assessment centres. Personnel, July, 50-57.Biggs, W. D. (1990). Introduction to business management simulations. In J.W. Gentry(Ed.), ABSEL guide to experiential learning and simulation gaming. New York: NicholsPublishing Co.Bourgeois, R. P., & Slivinski, L. W. (1974). The inter-rater reliability of The ConsolidatedFund In-Basket. Studies in Personnel Psychology, 6, 47-52.Brannick, M. T., Michaels, C. E., & Baker, D. P. (1989). Construct validity of in-basketscores. Journal of Applied Psycholo Ty, 74, 957-963.Brass, D. J., & Oldham, G. R. (1976). Validating an in-basket test using an alternative set ofleadership scoring dimensions. Journal of Applied Psychology, 61, 652-657.Bray, D. W. , Campbell, R. J., & Grant, D. L. (1974). Formative years in business: A long-term AT & T study of managerial lives. New York: Wiley.215Bray, D. W., & Grant, D. L. (1966). The assessment center in the measurement of potentialfor business management. Psychological Monographs, 80 (17, Whole No. 625).Breslow, E. (1957). The predictive efficiency of the Law School Admission Test at the newYork University School of Law. Psychology Newsletter, 9, 13-22.Bridgman, C. S., Spaethe, M., & Dignan, F. (1958). Validity information exchange No. 11-21. Personnel Psychology,  11, 264-265.Brogden, H. E. (1949). When testing pays off. Personnel Psychology, 2, 171-183.Brostoff, M., & Meyer, H. H. (1984). The effects of coaching on in-basket performance.Journal of Assessment Center Technology,  7, 17-21.Campbell, D. T. (1960). Recommendations for the APA test standards regarding construct,trait, and discriminant validity. American Psychologist,  15, 546-553.Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by themultitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.Carroll, J. B. (1960). IBM 704 program for generalized analytic rotation solution in factoranalysis. Unpublished manuscript, Harvard University.Cascio, W. F. (1987). Applied Psychology in Personnel Management  (3rd ed.). New Jersey:Prentice-Hall Inc.Cascio, W. F., & Phillips, N. F. (1979). Performance testing: A rose among thorns? Personnel Psychology, 32, 751-766.Cascio, W. F., & Silbey, V. (1978). Utility of the assessment centre as a selection device., Journal of Applied Psychology, 64, 107-118.Cattell, R. B. (1957). Personality and Motivation Structure and Measurement.  New York:World.216Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioural Research, 1, 245-276.Cleary, T. A. (1968). Test bias: Prediction of grades of negro and white students inintegrated colleges. Journal of Educational Measurement, 5, 115-124.Clutterbuck, D. (1974). Acid test for management potential. International Management,May, 54-57.Cohen, S. L., Moses, J. L., & Byham, W. C. (1982). The validity of assessment centers: Aliterature review. Monograph II. Pittsburgh: Development Dimensions Press.Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,16, 297-334.Cronbach, L. J., & Gleser, G. C. (1965). Psycholocical Tests and Personnel Decisions  (2nd.ed). Urbana: University of Illinois Press.Crooks, L. A. (1968). Issues in the development and validation of in-basket exercises forspecific objectives. (Research Memorandum 23). Princeton, N.J.: Educational TestingService.Crooks, L. A. (1974). The selection and development of performance measures forassessment center programs. (Research Memorandum 6). Princeton, N.J.: EducationalTesting Service.Crooks, L. A., & Slivinski, L. W. (1972). Comparison of in-basket test score profiles offour managerial groups. Studies in Personnel Psychology, 4, 19-30.Cross, W. R. (1969). Relationships between elementary school principals: In-basketperformance and their on-the-job behaviour. Journal of Educational Research,  63, 26-30.Development Dimensions International.^(197$).^Sellmore Manufacturing Company Foreman's In-Basket. Pittsburgh, Pennsylvania: Author.217Diete, M. (1991, May). Computer-assisted assessment centre exercises: A sampling of theEuropean market. Paper presented at the 19th Annual Congress on the AssessmentCentre Method, Toronto, Ontario.Educational Testing Service. (c. 1970). ETS Consolidated Fund In-Basket Test (LongForm). Princeton, NJ: Author.Faria, A. J. (1987). A survey of the use of business games in academia and business.Simulation and Games, 18, 192-206.Flanagan, J. C. (1954).^Some considerations in the development of situational tests.Personnel Psychology, 7, 461-464.Frederiksen, N. (1961). Consistency of performance in simulated situations (ONR TechnicalReport and Research Bulletin 61-22). Princeton, New Jersey: Educational TestingService.Frederiksen, N. (1962). Factors in in-basket performance. Psychological Monographs: General & Applied, 76, (22, Whole No. 541).Frederiksen, N., Jensen, 0., & Beaton, A. E. (1972). Prediction of Organizational Behaviour.New York: Pergamon Press.Frederiksen, N., Saunders, D. R., & Wand, B. (1957). The in-basket test. Psychological Monographs: General & Applied, 71, (Whole No. 438).French, W., Hull, J. D., & Dodds, B. L. (1951). American High School Administration: Policy and Practice. New York: Rinehart and Co.Gael, S., & Grant, D. L. (1972). Employment test validation for minority and non-minoritytelephone company service representatives. Journal of Applied Psychology, 56, 135-136.Giese, W. J. (1949). A tested method for the selection of office personnel. Personnel Psychology, 2, 525-545.218Gill, R. T. (1979). The in-tray (in-basket) exercise as a measure of management potential.Journal of Occupational Psychology, 52, 185-197.Goldsmith, R. (1991). Use of videotaped information in assessment centres.  Paper presentedat the 19th Annual Congress on the Assessment Centre Method, Toronto, Ontario.Griggs v. Duke Power Company, 28 L. Ed. 2d 158 (U. S. Supreme Court 1971).Hakstian, A. R., & Hallos, K. P. (1992). Assessment of in-basket performance by quickly-scored methods: Development and psychometric evaluation. Manuscript submitted forpublication.Hakstian, A. R., Woolley, R. M., Woolsey, L. K., & Kryger, B. R. (1991). Managementselection by multiple-domain assessment: Concurrent validity. Educational and Psychological Measurement, 51, 883-898.Hakstian, A. R., Woolsey, L. K., & Schroeder, M. L. (1986). Development and applicationof a quickly-scored in-basket exercise in an organizational setting. Educational andPsychological Measurement, 46, 385-396.Hausrath, A. H. (1971). Ventures simulation in war, business, and politics.  New York:McGraw-Hill.Hemphill, J. K. (1958). Administration as problem-solving. In A. W. Halpin (Ed.),Administrative Theory in Education  (pp. 89-118). Chicago, Ill: Midwest AdministrationCenter.Hemphill, J. K., Griffiths, D. E., & Frederiksen, N. (1962). Administrative performance and•ersonalit : A stud of the )rinci al in a simulated elementar school. New York:Teachers College Bureau of Publications, Columbia.Hinrichs, J. R., & Haanpera, S. (1976). Reliability of measurement in situational exercises:An assessment of the assessment center method. Personnel Psychology,  15, 335-344.Howard, A. (1983). Work samples and simulations in competency evaluation. Professional Psychology: Research and Practice, 14, 780-796.219Huck, J. R. (1974). Determinants of assessment center ratings for white and black femalesand the relationship of these dimensions to subsequent performance effectiveness. Unpublished doctoral dissertation, Wayne State University, Detroit.Huck, J. R., & Bray, J. R. (1976). Management assessment center evaluations andsubsequent job performance of black and white females. Personnel Psychology, 29, 13-20.Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of jobperformance. Psychological Bulletin, 96, 72-93.Jackson, D. N. (1976). Jackson Personality Inventory Manual. Port Huron, MI: ResearchPsychologists Press.Jackson, D. N. (1986). Personality Research Form Manual. Port Huron, Ml: ResearchPsychologists Press.Jackson, J. R. (1959). Learning from experience in business decision games. CaliforniaManagement Review, 1 92-107.Jago, A. G. (1978). A test of spuriousness in descriptive models of participative behaviour.Journal of Applied Psychology, 63, 383-387.Joines, R. (1991, May). The General Management In-Basket Exercise. Paper presented atthe 19th Annual Congress on the Assessment Centre Method, Toronto, Ontario.Jones, G. T. (1972). Simulations and business decisions. Middlesex, England: Penguin Press.Kesselman, G. A., Lopez, F. M., & Lopez, F. E. (1982). The development and validation ofa self-report scored in-basket test in an assessment center setting. Public Personnel Management, 11 (3), 228-238.Latham, G. P., & Wexley, K. N. (1981). Increasimz productivity through performanceappraisal. Reading, MA: Addison-Wesley.220Lombardo, M., McCall, M., & DeVries, D. (1976). Lookirpg. Glass. Greensboro, NC: Centrefor Creative Leadership.Lopez, F. M. (1966). Evaluating executive decision-making: The in-basket technique.(AMA Research Study 75). New York: American Management Association, Inc.Lopez, F. M., Kesselman, G. A., & Lopez, F. E. (1981). An empirical test of a trait-orientedjob analysis technique. Personnel Psychology, 34, 479-502.McGregor, D. (1960). The human side of enterprise.  New York: McGraw-Hill.Meier, R. C., Newell, W. T., & Pazer, H. L. (1969). Simulation in Business and Economics.New Jersey: Prentice-Hall, Inc.Meyer, H. H. (1970). The validity of the in-basket test as a measure of managerialperformance. Personnel Psychology, 23, 297-307.Moreno, J. L. (1975). Psychodrama. In S. Arieti (Ed.), American handbook of psychiatry(Vol. 2, pp. 1375-1396). New York: Basic Books.Morgenthaler, G. W. (1961). The theory and application of simulation in operationsresearch. In R. A. Ackoff (Ed.), Progress in Operations Research,  (Vol 1, pp. 366-372).New York: Wiley & Sons, Inc.Morris, D. (1991, May). The Multiple-Choice In-Basket Management Exercise.^Paperpresented at the 19th Annual Congress on the Assessment Centre Method, Toronto,Ontario.Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selectionprocedure: The low-fidelity simulation. Journal of Applied Psychology, 75, 640-647.Office of Strategic Services Assessment Staff. (1948). Assessment of men. New York:Rinehart.Oldham, G. R. (1976). The motivational strategies used by supervisors: Relationships toeffectiveness indicators. Organizational Behaviour and Human Performance, 15, 66-86.221Ricciardi, F. M., Malcolm, D. C., Bellman, R., Clark, C., Kebbee, J. M., & Rawdon, R. H.(1957). Top management decisions simulation: The AMA approach. New York:American Management Association.Robertson, I. T., & Kandola, R. S. (1982). Work sample tests: Validity, adverse impact andapplicant reaction. Journal of Occupational Psychology, 55, 171-183.Schippmann, J. S., Prien, E. P., & Katz, J. A. (1990). Reliability and validity of in-basketperformance measures. Personnel Psychology, 43, 837-859.Schmidt, F. L., Greenthal, A. L., Hunter, J. E., Berner, J. G., & Seaton, F. W. (1977). Jobsample vs. paper-and-pencil trades and technical tests: Adverse impact and examineeattitudes. Personnel Psychology, 30, 187-197.Schmidt, F. L., & Hunter, J. E. (1981). Employment testing: Old theories and new research.American Psychologist, 36, 1128-1137.Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of validselection procedures on work-force productivity. Journal of Applied Psychology, 64,609-626.Schneider, B., & Schmitt, N. (1986). Staffing organizations (2nd ed.). Scott, Foresman andCompany: Illinois.Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectation: An approach to theconstruction of unambiguous anchors for rating scales. Journal of Applied Psychology,47, 149-158.Stern, G. G., Stein, M I. , & Bloom, B. S. (1956). Methods in personality assessment.Illinois: Free Press.Strong, E. K. (1951). Permanence of interest scores over twenty-two years. Journal ofApplied Psychology, 35, 89-91.222Tett, R. P., & Jackson, D. J. (1990). Organizational and personality correlates of participativebehaviours using an in-basket exercise. Journal of Occupational Psychology, 63, 175-188.Thornton, G. C. (1992). Assessment centers in human resources management. New York:Addison-Wesley.Thornton, G. C., & Byham, W. C. (1982). Assessment centers and managerial performance.New York: Academic Press.Thornton, G. C., & Cleveland, J. N. (1990). Developing managerial talent throughsimulation. American Psychologist, 45, 190-199.Thurstone, L. L. (1953). Examiner Manual for the Thurstone Temperament Schedule  (2nded.). Chicago: Science Research Associates.Uniform guidelines on employee selection procedures. (1978). Federal Register, 43, 38290-38309.Vroom, V. H., & Yetton, P. W. (1973). Leadership and Decision-Makini.  Pittsburgh, PA:University of Pittsburgh Press.Wall, T. D., & Lischeron, J. A. (1977). Worker participation: A critique of the literature andsome fresh evidence. London: McGraw-Hill.Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples and criteria. Journal of AppliedPsychology, 52, 372-376.Wiggins, J. S. (1973). Personality and prediction: Principles of personality assessment.Florida: Krieger Publishing Co.Wollowick, H. B., & McNamara, W. J. (1969). Relationship of the components of anassessment center to management success. Journal of Applied Psychology,  53, 348-352.Appendix AListing of Selected Cognitive Ability and Personality Measuresused in the Construct Validity AnalysisCognitive Ability Measures: Wonderlic Personnel TestCulture Fair Intelligence Test (Scale 3)Nelson-Denny Reading Test (Form E)Reading SpeedReading ComprehensionVocabularyFlanagan Industrial TestExpression (Form A)Arithmetic (Form A)Comprehensive Ability Battery (CAB)Ideational Fluency (Fi)Spontaneous Flexibility (Fs)Aggregates:General Intellectual Level: average of scores of Wonderlic PersonnelTest and Culture Fair Intelligence TestProcessing Written Information: aggregate of Nelson-Denny ReadingTest and Flanagan Expression TestCognitive Flexibility: aggregate of Ideational Fluency andSpontaneous Flexibility223Personality/Motivational Inventories: California Psychological Inventory (revised)Sixteen Personality Factor Questionnaire224


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items