UBC Faculty Research and Publications

Validation of educational assessments: a primer for simulation and beyond Cook, David A; Hatala, Rose Dec 7, 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-41077_2016_Article_33.pdf [ 481.61kB ]
JSON: 52383-1.0368601.json
JSON-LD: 52383-1.0368601-ld.json
RDF/XML (Pretty): 52383-1.0368601-rdf.xml
RDF/JSON: 52383-1.0368601-rdf.json
Turtle: 52383-1.0368601-turtle.txt
N-Triples: 52383-1.0368601-rdf-ntriples.txt
Original Record: 52383-1.0368601-source.json
Full Text

Full Text

METHODOLOGY ARTICLEValidation of educationalnssichcompetency-based education, milestones, and masterylearning hinge on accurate, timely, and meaningful assess-settings. Indeed, the need for good assessments has neverbeen greater and will most likely continue to grow.a vital role init permits theenvironmentstandardizedCook and Hatala Advances in Simulation  (2016) 1:31 DOI 10.1186/s41077-016-0033-yfor example, common yet critical tasks, infrequently seenMedicine, Rochester, MN, USAFull list of author information is available at the end of the articleacross learners, and the spectrum of disease, clinical con-texts, and comorbidities can be manipulated to focus on,1Mayo Clinic Online Learning, Mayo Clinic College of Medicine, Rochester,MN, USA2Office of Applied Scholarship and Education Science, Mayo Clinic College ofAssessment of professional competence increasingly ex-tends beyond training into clinical practice, with ongoingsimulation does and will continue to playhealth professions assessment, inasmuch astargeting of specific topics and skills in a safe[4–6]. The conditions of assessment can be* Correspondence: cook.david33@mayo.edument to provide essential information about performance. Although workplace-based assessment is essential [1–3],Key principles: Validation refers to the process of collecting validity evidence to evaluate the appropriateness ofthe interpretations, uses, and decisions based on assessment results. Contemporary frameworks view validity as ahypothesis, and validity evidence is collected to support or refute the validity hypothesis (i.e., that the proposedinterpretations and decisions are defensible). In validation, the educator or researcher defines the proposedinterpretations and decisions, identifies and prioritizes the most questionable assumptions in making theseinterpretations and decisions (the “interpretation-use argument”), empirically tests those assumptions usingexisting or newly-collected evidence, and then summarizes the evidence as a coherent “validity argument.”A framework proposed by Messick identifies potential evidence sources: content, response process, internalstructure, relationships with other variables, and consequences. Another framework proposed by Kane identifieskey inferences in generating useful interpretations: scoring, generalization, extrapolation, and implications/decision. We propose an eight-step approach to validation that applies to either framework: Define the constructand proposed interpretation, make explicit the intended decision(s), define the interpretation-use argument andprioritize needed validity evidence, identify candidate instruments and/or create/adapt a new instrument,appraise existing evidence and collect new evidence as needed, keep track of practical issues, formulate thevalidity argument, and make a judgment: does the evidence support the intended use?Conclusions: Rigorous validation first prioritizes and then empirically evaluates key assumptions in theinterpretation and use of assessment scores. Validation science would be improved by more explicit articulationand prioritization of the interpretation-use argument, greater use of formal validation frameworks, and moreevidence informing the consequences and implications of assessment.Good assessment is important; simulation can helpEducators, administrators, researchers, policymakers,and even the lay public recognize the importance ofassessing health professionals. Trending topics such asdebates regarding the requirements for initial and ongoingprofessional licensure and certification. Front-line educa-tors and education researchers require defensible assess-ments of health professionals in clinical and nonclinicalprofessionals, but the principles apply broadly to other assessment approaches and topics.a primer for simulation aDavid A. Cook1,2,3* and Rose Hatala4AbstractBackground: Simulation plays a vital role in health profeassessment validation for educators and education resear© The Author(s). 2016 Open Access This articInternational License (http://creativecommonsreproduction in any medium, provided you gthe Creative Commons license, and indicate if(http://creativecommons.org/publicdomain/zeOpen Accessassessments:d beyondons assessment. This review provides a primer oners. We focus on simulation-based assessment of healthle is distributed under the terms of the Creative Commons Attribution 4.0.org/licenses/by/4.0/), which permits unrestricted use, distribution, andive appropriate credit to the original author(s) and the source, provide a link tochanges were made. The Creative Commons Public Domain Dedication waiverro/1.0/) applies to the data made available in this article, unless otherwise stated.Cook and Hatala Advances in Simulation  (2016) 1:31 Page 2 of 12conditions, activities that might put patients at risk, orsituations that provoke specific emotional responses [7, 8].Thus, it comes as no surprise that simulation-based as-sessment is increasingly common. A review publishedin 2013 identified over 400 studies evaluating simulation-based assessments [9], and that number has surely grown.However, that same review identified serious and frequentshortcomings in the evidence supporting these assess-ments, and in the research studies designed to collect suchevidence (i.e., validation studies). The gap between theneed for good simulation-based assessment and the defi-ciencies in the process and product of current validationefforts suggests the need for increased awareness of thecurrent state of the science of validation.The purpose of this article is to provide a primer onassessment validation for educators and education re-searchers. We focus on the context of simulation-basedassessment of health professionals but believe the princi-ples apply broadly to other assessment approaches andtopics.Validation is a processValidation refers to the process of collecting validityevidence to evaluate the appropriateness of the inter-pretations, uses, and decisions based on assessment re-sults [10]. This definition highlights several importantpoints. First, validation is a process not an endpoint.Labeling an assessment as “validated” means only thatthe validation process has been applied—i.e., that evidencehas been collected. It does not tell us what process wasused, the direction or magnitude of the evidence (i.e., wasit favorable or unfavorable and to what degree?), whatgaps remain, or for what context (learner group, learningobjectives, educational setting) the evidence is relevant.Second, validation involves the collection of validityevidence, as we discuss in a following section.Third, validation and validity ultimately refer to a spe-cific interpretation or use of assessment data, be thesenumeric scores or narrative comments [11], and to the de-cisions grounded in this interpretation. We find it helpfulto illustrate this point through analogy with diagnostictests in clinical medicine [12]. A clinical test is only usefulto the degree that (a) the test influences decisions, and (b)these decisions lead to meaningful changes in action orpatient outcomes. Hence, physicians are often taught,“Don’t order the test if it won’t change patient manage-ment.” For example, the prostate-specific antigen (PSA)test has high reliability and is strongly associated withprostate cancer. However, this test is no longer widelyrecommended in screening for prostate cancer because itis frequently elevated when no cancer is present, becausetesting leads to unnecessary prostate biopsies and patientanxiety, and because treating cancers that are found oftendoes not improve clinical outcomes (i.e., treatment is notneeded). In other words, the negative/harmful conse-quences outweigh the beneficial consequences of testing(screening) in many patients [13–15]. However, PSAtesting is still useful as a marker of disease once prostatecancer has been diagnosed and treated. Reflecting thisexample back to educational tests (assessments) and theimportance of decisions: (1) if it will not change manage-ment the test should not be done, (2) a test that is usefulfor one objective or setting may be less useful in anothercontext, and (3) the long-term and downstream conse-quences of testing must be considered in determining theoverall usefulness of the test.Why is assessment validation important?Rigorous validation of educational assessments is criticallyimportant for at least two reasons. First, those using an as-sessment must be able to trust the results. Validation doesnot give a simple yes/no answer regarding trustworthiness(validity); rather, a judgment of trustworthiness or validitydepends on the intended application and context and istypically a matter of degree. Validation provides the evi-dence to make such judgments and a critical appraisal ofremaining gaps.Second, the number of assessment instruments, tools,and activities is essentially infinite, since each newmultiple-choice question, scale item, or exam stationcreates a de facto new instrument. Yet, for a given educa-tor, the relevant tasks and constructs in need of assess-ment are finite. Each educator thus needs information tosort and sift among the myriad possibilities to identify theassessment solution that best meets his or her immediateneeds. Potential solutions include selecting an existinginstrument, adapting an existing instrument, combiningelements of several instruments, or creating a novel in-strument from scratch [16]. Educators need informationregarding not only the trustworthiness of scores, but alsothe logistics and practical issues such as cost, acceptability,and feasibility that arise during test implementation andadministration.In addition, simulation-based assessments are almostby definition used as surrogates for a more “meaning-ful” clinical or educational outcome [17]. Rarely do weactually want to know how well learners perform in asimulated environment; usually, we want to know howthey would perform in real life. A comprehensive ap-proach to validation will include evaluating the degree towhich assessment results extrapolate to different settingsand outcomes [18, 19].What do we mean by validity evidence?Classical validation frameworks identified at least three dif-ferent “types” of validity: content, construct, and criterion;see Table 1. However, this perspective has been replaced bymore nuanced yet unified and practical views of validity[10, 12, 20]. Contemporary frameworks view validity as ahypothesis, and just as a researcher would collect evidenceto support or refute a research hypothesis, validity evidenceis collected to support or refute the validity hypothesis(more commonly referred to as the validity argument). Justas one can never prove a hypothesis, validity can never bereflect the construct they are intended to measure.Internal structure evidence evaluates the relationships ofindividual assessment items with each other and with theoverarching construct(s), e.g., reliability, domain or factorstructure, and item difficulty. Relationships with other var-iables evidence evaluates the associations, positive orTable 1 The classical validity frameworkType of validitya Definition Examples of evidenceContent Test items and format constitute a relevantand representative sample of the domain of tasksProcedures for item development and samplingCriterion (includes correlational,concurrent, and predictive validity)Correlation between actual test scores and the “true”(criterion) scoreCorrelation with a definitive standardConstruct Scores vary as expected based on an underlying psychologicalconstruct (used when no definitive criterion exists)Correlation with another measure of the sameconstructFactor analysisExpert-novice comparisonsChange or stability over timeaSome authors also include “face validity” as a fourth type of validity in the classical framework. However, face validity refers either to superficial appearances thathave little merit in evaluating the defensibility of assessment [26, 59] (like judging the speed of the car by its color) or to influential features that are betterlabeled content validity (like judging the speed of the car by its model or engine size). We discourage use of the term "face validity"est[2sesCook and Hatala Advances in Simulation  (2016) 1:31 Page 3 of 12proven; but evidence can, as it accumulates, support orrefute the validity argument.The first contemporary validity framework was pro-posed by Messick in 1989 [21] and adopted as a stand-ard for the field in 1999 [22] and again in 2014 [23].This framework proposes five sources of validity evi-dence [24–26] that overlap in part with the classicalframework (see Table 2). Content evidence, which is es-sentially the same as the old concept of content validity,refers to the steps taken to ensure that assessment items(including scenarios, questions, and response options)Table 2 The five sources of evidence validity frameworkSource of evidence DefinitionContent “The relationship between the content of a tand the construct it is intended to measure”Internal structure Relationship among data items within the asand how these relate to the overarching constrRelationships with othervariables“Degree to which these relationships are consisthe construct underlying the proposed test scointerpretations” [24]Response process “The fit between the construct and the detailedof performance . . . actually engaged in” [24]Consequences “The impact, beneficial or harmful and intendedunintended, of assessment” [27]See the following for further details and examples [20, 25, 26]negative and strong or weak, between assessment resultsand other measures or learner characteristics. This corre-sponds closely with classical notions of criterion validityand construct validity. Response process evidence evaluateshow well the documented record (answer, rating, or free-text narrative) reflects the observed performance. Issuesthat might interfere with the quality of responses includepoorly trained raters, low-quality video recordings, andcheating. Consequences evidence looks at the impact,beneficial or harmful, of the assessment itself and thedecisions and actions that result [27–29]. Educators andExamples of evidence4]Procedures for item sampling, development, and scoring(e.g., expert panel, previously described instrument, testblueprint, and pilot testing and revision)sment Internal consistency reliabilityuct Interrater reliabilityFactor analysisTest item statisticstent withreCorrelation with tests measuring similar constructsCorrelation (or lack thereof) with tests measuring differentconstructsExpert-novice comparisonsnature Analysis of examinees’ or raters’ thoughts or actions duringassessment (e.g., think-aloud protocol)Assessment security (e.g., prevention of cheating)Quality control (e.g., video capture)Rater trainingor Impact on examinee performance (e.g., downstream effectson board scores, graduation rates, clinical performance,patient safety)Other examinee effects (e.g., test preparation, length oftraining, stress, anxiety)Definition of pass/fail standardresearchers must identify the evidence most relevant totheir assessment and corresponding decision, then collectand appraise this evidence to formulate a validity argu-ment. Unfortunately, the “five sources of evidence” frame-work provides incomplete guidance in such prioritizationor selection of evidence.The most recent validity framework, from Kane[10, 12, 30], addresses the issue of prioritization by iden-tifying four key inferences in an assessment activity(Table 3). For those accustomed to the classical or five-evidence-sources framework, Kane’s framework is oftenchallenging at first because the terminology and conceptsare entirely new. In fact, when learning this framework,we have found that it helps to not attempt to matchconcepts with those of earlier frameworks. Rather, weinference); and this performance (in a test setting or reallife) might not form a proper foundation for the desireddecision (implications or decision inference). Kane’s valid-ity framework explicitly evaluates the justifications foreach of these four inferences. We refer those wishing tolearn more about Kane’s framework to his description [10,30] and to our recent synopsis of his work [12].Educators and researchers often ask how much valid-ity evidence is needed and how the evidence from aprevious validation applies when an instrument is usedin a new context. Unfortunately, the answers to thesequestions depend on several factors including the riskof making a wrong decision (i.e., the “stakes” of the as-sessment), the intended use, and the magnitude and sa-lience of contextual differences. While all assessmentserves plecsisCook and Hatala Advances in Simulation  (2016) 1:31 Page 4 of 12begin de novo by considering conceptually the stagesinvolved in any assessment activity. An assessmentstarts with a performance of some kind, such as an-swering a multiple-choice test item, interviewing a realor standardized patient, or performing a proceduraltask. Based on this observation, a score or written nar-rative is documented that we assume reflects the levelof performance; several scores or narratives are com-bined to generate an overall score or interpretation thatwe assume reflects the desired performance in a testsetting; the performance in a test setting is assumed toreflect the desired performance in a real-life setting;and that performance is further assumed to constitute arational basis for making a meaningful decision (seeFig. 1). Each of these assumptions represents an infer-ence that might not actually be justifiable. The docu-mentation of performance (scoring inference) could beinaccurate; the synthesis of individual scores might notaccurately reflect performance across the desired test do-mains (generalization inference); the synthesized scorealso might not reflect real-life performance (extrapolationTable 3 The validation inferences validity frameworkValidity inference Definition (assumptions)aScoring The score or written narrative from a given obsadequately captures key aspects of performancGeneralization The total score or synthesis of narratives reflectacross the test domainExtrapolation The total score or synthesis in a test setting refperformance in a real life settingImplications/decisions Measured performance constitutes a rational bameaningful decisions and actionsSee Kane [10] and Cook et al [12] for further details and examplesaEach of the inferences reflects assumptions about the creation and use of assessmshould be important, some assessment decisions havemore impact on a learner’s life than others. Assess-ments with higher impact or higher risk, includingthose used for research purposes, merit higher stan-dards for the quantity, quality, and breadth of evidence.Strictly speaking, validity evidence applies only to thepurpose, context, and learner group in which it wascollected; existing evidence might guide our choice ofassessment approach but does not support our futureinterpretations and use. Of course, in practice, we rou-tinely consider existing evidence in constructing a val-idity argument. Whether old evidence applies to a newsituation requires a critical appraisal of how situationaldifferences might influence the relevance of the evi-dence. For example, some items on a checklist might berelevant across different tasks while others might betask-specific; reliability can vary substantially from onegroup to another, with typically lower values amongmore homogeneous learners; and differences in context(inpatient vs outpatient), learner level (junior medicalstudent vs senior resident), and purpose might affectExamples of evidenceation Procedures for creating and empirically evaluating itemwording, response options, scoring optionsRater selection and trainingerformance Sampling strategy (e.g., test blueprint) and sample sizeInternal consistency reliabilityInterrater reliabilityts meaningful Authenticity of contextCorrelation with tests measuring similar constructs,especially in real-life contextCorrelation (or lack thereof) with tests measuring differentconstructsExpert-novice comparisonsFactor analysisfor See Table 2, “Consequences”ent resultsWhat do we mean by validity argument?Cook and Hatala Advances in Simulation  (2016) 1:31 Page 5 of 12In addition to clarifying the four key inferences, Kanehas advanced our understanding of “argument” in thevalidation process by emphasizing two distinct stages ofargument: an up-front “interpretation-use argument” or“IUA,” and a final “validity argument.”our interpretation of evidence of content, relations withother variables, or consequences. Evidence collected incontexts similar to ours and consistent findings acrossa variety of contexts will support our choice to includeexisting evidence in constructing our validity argument.Fig. 1 Key inferences in validationAs noted above, all interpretations and uses—i.e., deci-sions—incur a number of assumptions. For example, ininterpreting the scores from a virtual reality assessment,we might assume that the simulation task—including thevisual representation, the simulator controls, and the taskitself—has relevance to tasks of clinical significance; thatthe scoring algorithm accounts for important elements ofthat task; that there are enough tasks, and enough varietyamong tasks, to reliably gauge trainee performance; andthat it is beneficial to require trainees to continue prac-ticing until they achieve a target score. These and otherassumptions can and must be tested! Many assumptionsare implicit, and recognizing and explicitly stating thembefore collecting or examining the evidence is an essen-tial step. Once we have specified the intended use, weneed to (a) identify as many assumptions as possible,(b) prioritize the most worrisome or questionable as-sumptions, and (c) come up with a plan to collect evi-dence that will confirm or refute the correctness of eachassumption. The resulting prioritized list of assumptionsand desired evidence constitute the interpretation-useargument. Specifying the interpretation-use argument isanalogous both conceptually and in importance to statinga research hypothesis and articulating the evidence re-quired to empirically test that hypothesis.Once the evaluation plan has been implemented andevidence has been collected, we synthesize the evi-dence, contrast these findings with what we anticipatedin the original interpretation-use argument, identifystrengths and weaknesses, and distill this into a finalvalidity argument. Although the validity argument at-tempts to persuade others that the interpretations anduses are indeed defensible—or that important gapsremain—potential users should be able to arrive at theirown conclusions regarding the sufficiency of the evi-dence and the accuracy of the bottom-line appraisal.Our work is similar to that of an attorney arguing acase before a jury: we strategically seek, organize, andinterpret the evidence and present an honest, complete,and compelling argument, yet it is the “jury” of poten-tial users that ultimately passes judgment on validityfor their intended use and context. [31]It is unlikely that any single study will gather all thevalidity evidence required to support a specific decision.Rather, different studies will usually address different as-pects of the argument, and educators need to considerthe totality of the evidence when choosing an assessmentinstrument for their context and needs.Of course, it is not enough for researchers to simply col-lect any evidence. It is not just the quantity of evidencethat matters, but also the relevance, quality, and breadth.Collecting abundant evidence of score reliability does notobviate the need for evidence about content, relationships,or consequences. Conversely, if existing evidence is robustand logically applicable to our context, such as a rigorousitem development process, then replicating such effortsmay not be top priority. Unfortunately, researchers ofteninadvertently fail to deliberately prioritize the importanceof the assumptions or skip the interpretation-use argu-ment altogether, which can result in reporting evidencefor assumptions that are easy to test rather than those thatare most critical.A practical approach to validationAlthough the above concepts are essential to understand-ing the process of validation, it is also important to be ableto apply this process in practical ways. Table 4 outlinesone possible approach to validation that would work withany of the validity frameworks described above (classical,Messick, or Kane). In this section, we will illustrate thisapproach using a hypothetical simulation-based example.Imagine that we are teaching first year internalmedicine residents lumbar puncture (LP) using apart-task trainer. At the end of the training session,we wish to assess whether the learners are ready to3attanstuinthawohigfirsmianintwe(a)SCook and Hatala Advances in Simulation  (2016) 1:31 Page 6 of 12In our example, our foremost decision is whether thelearner has sufficient procedural competence tomaking based on those interpretations, we will beunable to craft a coherent validity argument.interest. For example, are we interested in thelearners’ knowledge of LP indications and risks,their ability to perform LP, or their non-technicalskills when attempting an LP? Each of these is adifferent construct requiring selection of a differentassessment tool: we might choose multiple-choicequestions (MCQs) to assess knowledge, a seriesof skill stations using a part-task trainer to assesprocedural skill with an Objective StructuredAssessment of Technical Skills (OSATS) [32], or aresuscitation scenario using a high-fidelity manikinand a team of providers to assess non-technicalskills with the Non-Technical Skills (NOTECHS)scale [33].In our example, the construct is “LP skill” and theinterpretation is that “learners have fundamental LPskills sufficient to attempt a supervised LP on a realpatient.”2. Make explicit the intended decision(s)Without a clear idea of the decisions we anticipatesafely attempt an LP with a real patient undersupervision.1. Define the construct and proposed interpretationValidation begins by considering the construct ofTable 4 A practical approach to validation1. Define the construct and proposed interpretation2. Make explicit the intended decision(s)3. Define the interpretation-use argument, and prioritize neededvalidity evidence4. Identify candidate instruments and/or create/adapt a new instrument5. Appraise existing evidence and collect new evidence as needed6. Keep track of practical issues including cost7. Formulate/synthesize the validity argument in relation to theinterpretation-use argument8. Make a judgment: does the evidence support the intended use?attdeidefeeprothe. Dempt a supervised LP on a real patient. Othercisions we might alternatively consider includentifying performance points on which to offerdback to the learner, deciding if the learner can bemoted to the next stage of training, or certifyinglearner for licensure.efine the interpretation-use argument, and prioritizeneeded validity evidenceIn making our interpretations and decisions, wewill invoke a number of assumptions, and thesemust be tested. Identifying and prioritizing keyEvidence will ideally show that the items withinthe instrument are relevant to LP performance,that raters understood how to use theinstrument, and that video-recording perform-ance yields similar scores as direct observation.(b)Generalization: scores on a single performancealign with overall scores in the test setting.Evidence will ideally show that we haveadequately sampled performance (sufficientnumber of simulated LPs, and sufficient varietyof conditions such as varying the simulatedpatient habitus) and that scores are reproduciblebetween performances and between raters(inter-station and inter-rater reliability).(c)Extrapolation: assessment scores relate to real-worldperformance. Evidence will ideally show thatscores from the instrument correlate with otherLP performance measures in real practice, suchas procedural logs, patient adverse events, orsupervisor ratings.(d)Implications: the assessment has important andfavorable effects on learners, training programs, orpatients, and negative effects are minimal.Evidence will ideally show that students feel moreprepared following the assessment, that thoserequiring remediation feel this time was wellllected or if we will need to collect it ourselves, buthave at least identified what to look for.coring: the observation of performance is correctlytransformed into a consistent numeric score.know at this stage whether evidence has already beencolls in sterile technique or instrument handling),t the rater is properly trained, that a different rateruld give similar scores, and that learners who scoreher on the test will perform more safely on theirt patient attempt. Considering the evidence weght need to support or refute these assumptions,d using Kane’s framework as a guide, we propose anerpretation-use argument as follows. We do nottechniques essential for LP performance (vs genericskitrument in which a “pass” indicates competence toempt a supervised LP on a real patient. Weticipate that this will involve a physician ratingdent performance on a skills station. Assumptionsthis context include that the station is set up to testIn our scenario, we are looking for an assessmentinsassumptions and anticipating the evidence wehope to find allows us to outline aninterpretation-use argument [30].spent, and that LP complications in real patientsdecline in the year following implementation.Wthinanev45evitraimLPevireq[37to6Cook and Hatala Advances in Simulation  (2016) 1:31 Page 7 of 12dence by showing that scores are higher afterining [35, 37] and that the instrument identifiedportant learner errors when used to rate real patients [38]. One study also provided limited implicationsdence by counting the number of practice attemptsuired to attain competence in the simulation settingor a slightly modified checklist provide furtherevidence for generalization with good inter-raterreliabilities [35, 36], and contribute extrapolationas neededAlthough existing evidence does not, strictlyspeaking apply to our situation, for practicalpurposes we will rely heavily on existing evidenceas we decide whether to use this instrument. Ofcourse, we will want to collect our own evidence aswell, but we must base our initial adoption on whatis now available.We begin our appraisal of the validity argument bysearching for existing evidence. The originaldescription [34] offers scoring evidence by describingthe development of checklist items through formal LPtask analysis and expert consensus. It providesgeneralization evidence by showing good inter-raterreliability, and adds limited extrapolation evidence byconfirming that residents with more experience hadhigher checklist scores. Other studies using the samee cannot over-emphasize the importance of these firstree steps in validation. Clearly articulating the proposedterpretations, intended decision(s), and assumptionsd corresponding evidence collectively set the stage forerything that follows.. Identify candidate instruments and/or create/adapt anew instrumentWe should identify a measurement format thataligns conceptually with our target construct andthen search for existing instruments that meet orcould be adapted to our needs. A rigorous searchprovides content evidence to support our finalassessment. Only if we cannot find an appropriateexisting instrument would we develop aninstrument de novo.We find a description of a checklist for assessingPGY-1’s procedural competence in LP [34]. Thechecklist appears well suited for our purpose, aswe will be using it in a similar educational context;we thus proceed to appraising the evidence withoutchanging the instrument.. Appraise existing evidence and collect new evidence]. In light of these existing studies, we will not plancollect more evidence before our initial adoption ofsufficiency and suitability of evidence, i.e.,whether the validity argument and the associatedevidence meet the demands of the proposedinterpretation-use argument.Based on the evidence summarized above, we judgethat the validity argument supports those7. Formulate/synthesize the validity argument in relationto the interpretation-use argumentWe now compare the evidence available (the validityargument) against the evidence we identified up-frontas necessary to support the desired interpretationsand decisions (the interpretation-use argument).We find reasonable scoring and generalizationevidence, a gap in the extrapolation evidence (directcomparisons between simulation and real-worldperformance have not been done), and limitedimplications evidence. As is nearly always the case,the match between the interpretation-use argumentand the available evidence is not perfect; some gapsremain, and some of the evidence is not as favorableas we might wish.8. Make a judgment: does the evidence support theintended use?The final step in validation is to judge theunder-studied aspect of validation concerns thepractical issues surrounding development,implementation, and interpretation of scores. Anassessment procedure might yield outstanding data,but if it is prohibitively expensive or if logistical orexpertise requirements exceed local resources, it maybe impossible to implement.For the LP instrument, one study [37] tracked thecosts of running a simulation-based LP training andassessment session; the authors suggested that costscould be reduced by using trained non-physicianraters. As we implement the instrument, andespecially if we collect fresh validity evidence,we should likewise monitor costs such as money,human and non-human resources, and otherpractical issues.this instrument. However, we will collect our own evi-dence during implementation, especially if we identifyimportant gaps, i.e., at later stages in the validationprocess; see below.. Keep track of practical issues including costAn important yet often poorly appreciated andinterpretations and uses reasonably well, and thechecklist appears suitable for our purposes. Moreover,CoInthcoCook and Hatala Advances in Simulation  (2016) 1:31 Page 8 of 12cisions.mmon mistakes to avoid in validationour own validation efforts [39–41] and in reviewingthe costs seem reasonable for the effort expended, andwe have access to an assistant in the simulationlaboratory who is keen to be trained as a rater.We also plan to help resolve the evidence gapsnoted above by conducting a research study as weimplement the instrument at our institution. Tobuttress the extrapolation inference we plan tocorrelate scores from the simulation assessmentwith ongoing workplace-based LP assessments.We will also address the implications inferenceby tracking the effects of additional training forpoor performing residents, i.e., the downstreamconsequences of assessment. Finally, we willmeasure the inter-rater, inter-case, and internalconsistency reliability in our learner population,and will monitor costs and practical issues asnoted above.Application of the same instrument to a different settingAs a thought exercise, let us consider how the abovewould unfold if we wanted to use the same instrument fora different purpose and decision, for example as part of ahigh-stakes exam to certify postgraduate neurologisttrainees as they finish residency. As our decision changes,so does our interpretation-use argument; we would nowbe searching for evidence that a “pass” score on the check-list indicates competence to independently perform LPson a variety of real patients. We would require different oradditional validity evidence, with increased emphasis ongeneralization (sampling across simulated patients thatvary in age, body habitus, and other factors that influencedifficulty), extrapolation (looking for stronger correlationbetween simulation and real-life performance), and impli-cations evidence (e.g., evidence that we were accuratelyclassifying learners as competent or incompetent for inde-pendent practice). We would have to conclude that thecurrent body of evidence does not support this argumentand would need to either (a) find a new instrument withevidence that meets our demands, (b) create a new instru-ment and start collecting evidence from scratch, or (c)collect additional validity evidence to fill in the gaps.This thought exercise highlights two important points.First, the interpretation-use argument might change whenthe decision changes. Second, an instrument is not “valid”in and of itself; rather, it is the interpretations or decisionsthat are validated. A final judgment of validity based onthe same evidence may differ for different proposeddee work of others [9, 25, 42], we have identified severalmmon mistakes that undermine the end-user’s abilityto understand and apply the results. We present these asten mistakes guaranteed to alarm peer reviewers, frustratereaders, and limit the uptake of an instrument.Mistake 1. Reinvent the wheel (create a new assessmentevery time)Our review [9] found that the vast majority of validitystudies focused on a newly created instrument ratherthan using or adapting an existing instrument. Yet, thereis rarely a need to start completely from scratch wheninitiating learner assessment, as instruments to assessmost constructs already exist in some form. Using orbuilding from an existing instrument saves the troubleof developing an instrument de novo, allows us to com-pare our results with prior work, and permits others tocompare their work with ours and include our evidence inthe overall evidence base for that instrument, task, orassessment modality. Reviews of evidence for the OSATS[42], Fundamentals of Laparoscopic Surgery (FLS) [43],and other simulation-based assessments [9] all showimportant gaps in the evidence base. Filling these gaps willrequire the collaborative effort of multiple investigators allfocused on collecting evidence for the scores, inferences,and decisions derived from the same assessment.Mistake 2. Fail to use a validation frameworkAs noted above, validation frameworks add rigor to theselection and collection of evidence and help identifygaps that might otherwise be missed. More importantthan the framework chosen is the timing (ideally early)and manner (rigorously and completely) in which theframework is applied in the validation effort.Mistake 3. Make expert-novice comparisons the crux ofthe validity argumentComparing the scores from a less experienced groupagainst those from a more experienced group (e.g.,medical students vs senior residents) is a common ap-proach to collecting evidence of relationships with othervariables—reported in 73% of studies of simulation-based assessment [9]. Yet this approach provides onlyweak evidence because the difference in scores may arisefrom a myriad of factors unrelated to the intendedconstruct [44]. To take an extreme example for illustra-tion, suppose an assessment intended to measure suturingability actually measured sterile technique and completelyignored suturing. If an investigator trialed this in practiceamong third-year medical students and attending physi-cians, he would most likely find a significant differencefavoring the attendings and might erroneously concludethat this evidence supports the validity of the proposedinterpretation (i.e., suturing skill). Of course, in this hypo-thetical example, we know that attendings are better thanmedical students in both suturing and sterile technique.Cook and Hatala Advances in Simulation  (2016) 1:31 Page 9 of 12Yet, in real life, we lack the omniscient knowledge ofwhat is actually being assessed; we only know the testscores—and the same scores can be interpreted asreflecting any number of underlying constructs. Thisproblem of “confounding” (multiple possible interpre-tations) makes it impossible to say that any differ-ences between groups are actually linked to theintended construct. On the other hand, failure to con-firm expected differences would constitute powerfulevidence of score invalidity.Cook provided an extended discussion and illustrationof this problem, concluding that “It is not wrong to per-form such analyses, … provided researchers understandthe limitations. … These analyses will be most interestingif they fail to discriminate groups that should be different,or find differences where none should exist. Confirmationof hypothesized differences or similarities adds little to thevalidity argument.” [44]Mistake 4. Focus on the easily accessible validity evidencerather than the most importantValidation researchers often focus on data they have read-ily available or can easily collect. While this approach isunderstandable, it often results in abundant validityevidence being reported for one source while large evi-dence gaps remain for other sources that might be equallyor more important. Examples include emphasizing con-tent evidence while neglecting internal structure, report-ing inter-item reliability when inter-rater reliability ismore important, or reporting expert-novice comparisonsrather than correlations with an independent measure tosupport relationships with other variables. In our review,we found that 306/417 (73%) of studies reported expert-novice comparisons, and 138 of these (45%) reported noadditional evidence. By contrast, only 128 (31%) reportedrelationships with a separate measure, 142 (34%) reportedcontent evidence, and 163 (39%) reported score reliability.While we do not know all the reasons for these reportingpatterns, we suspect they are due at least in part to theease with which some elements (e.g., expert-novicecomparison data) can be obtained.This underscores the importance of clearly and com-pletely stating the interpretation-use argument, identifyingexisting evidence and gaps, and tailoring the collection ofevidence to address the most important gaps.Mistake 5. Focus on the instrument rather than scoreinterpretations and usesAs noted above, validity is a property of scores, interpre-tations, and uses, not of instruments. The same instru-ment can be applied to different uses (the PSA may notbe useful as a clinical screening tool, but continues tohave value for monitoring prostate cancer recurrence),and much validity evidence is context-dependent. Forexample, score reliability can change substantially acrossdifferent populations [44], an assessment designed forone learning context such as ambulatory practice may ormay not be relevant in another context such as hospitalor acute care medicine, and some instruments such asthe OSATS global rating scale lend themselves readily toapplication to a new task while others such as the OSATSchecklist do not [42]. Of course, evidence collected in onecontext, such as medical school, often has at least partialrelevance to another context, such as residency training;but determinations of when and to what degree evidencetransfers to a new setting are a matter of judgment, andthese judgments are potentially fallible.The interpretation-use argument cannot, strictly speak-ing, be appropriately made without articulating the con-text of intended application. Since the researcher’s contextand the end-user’s context almost always differ, theinterpretation-use argument necessarily differs as well.Researchers can facilitate subsequent uptake of their workby clearly specifying the context of data collection—forexample, the learner group, task, and intended use/deci-sion—and also by proposing the scope to which theybelieve their findings might plausibly apply.It is acceptable to talk about the validity of scores, butfor reasons articulated above, it is better to specify theintended interpretation and use of those scores, i.e., theintended decision. We strongly encourage both researchersand end-users (educators) to articulate the interpretationsand uses at every stage of validation.Mistake 6. Fail to synthesize or critique the validityevidenceWe have often observed researchers merely report theevidence without any attempt at synthesis and appraisal.Both educators and future investigators greatly benefitwhen researchers interpret their findings in light of theproposed interpretation-use argument, integrate it withprior work to create a current and comprehensive valid-ity argument, and identify shortcomings and persistentgaps or inconsistencies. Educators and other end-usersmust become familiar with the evidence as well, toconfirm the claims of researchers and to formulate theirown judgments of validity for their specific context.Mistake 7. Ignore best practices for assessmentdevelopmentVolumes have been written on the development, re-finement, and implementation of assessment tasks, in-struments, and procedures [23, 45–48]. Developing ormodifying an assessment without considering thesebest practices would be imprudent. We could notbegin to summarize these, but we highlight two rec-ommendations of particular salience to health profes-sions educators, both of which relate to contentCook and Hatala Advances in Simulation  (2016) 1:31 Page 10 of 12evidence (per the classic or five sources frameworks)and the generalization inference (per Kane).First, the sample of tasks or topics should representthe desired performance domain. A recurrent finding inhealth professions assessment is that there are few, ifany, generalizable skills; performance on one task doesnot predict performance on another task [49, 50]. Thus,the assessment must provide a sufficiently numerousand broad sample of scenarios, cases, tasks, stations, etc.Second, the assessment response format should bal-ance objectification and judgment or subjectivity [51].The advantages and disadvantages of checklists and glo-bal ratings have long been debated, and it turns out thatboth have strengths and weaknesses [52]. Checklists out-line specific criteria for desired behaviors and guidancefor formative feedback, and as such can often be used byraters less familiar with the assessment task. However,the “objectivity” of checklists is largely an illusion; [53]correct interpretation of an observed behavior may yetrequire task-relevant expertise, and forcing raters todichotomize ratings may result in a loss of information.Moreover, a new checklist must be created for each spe-cific task, and the items often reward thoroughness atthe expense of actions that might more accurately reflectclinical competence. By contrast, global ratings requiregreater expertise to use but can measure more subtlenuances of performance and reflect multiple comple-mentary perspectives. Global ratings can also be de-signed for use across multiple tasks, as is the case forthe OSATS. In a recent systematic review, we foundslightly higher inter-rater reliability for checklists thanfor global ratings when averaged across studies, whileglobal ratings had higher average inter-item and inter-station reliability [52]. Qualitative assessment offersanother option for assessing some learner attributes[11, 54, 55].Mistake 8. Omit details about the instrumentIt is frustrating to identify an assessment with rele-vance to local needs and validity evidence supportingintended uses, only to find that the assessment is notspecified with sufficient detail to permit application.Important omissions include the precise wording ofinstrument items, the scoring rubric, instructions pro-vided to either learners or raters, and a description ofstation arrangements (e.g., materials required in aprocedural task, participant training in a standardizedpatient encounter) and the sequence of events. Mostresearchers want others to use their creations and citetheir publications; this is far more likely to occur ifneeded details are reported. Online appendices pro-vide an alternative to print publication if articlelength is a problem.Mistake 9. Let the availability of the simulator/assessmentinstrument drive the assessmentToo often as educators, we allow the availability of anassessment tool to drive the assessment process, such astaking an off-the-shelf MCQ exam for an end-of-clerkshipassessment when a performance-based assessment mightbetter align with clerkship objectives. This issue is furthercomplicated with simulation-based assessments, wherethe availability of a simulator may drive the educationalprogram as opposed to designing the educational programand then choosing the best simulation to fit the educa-tional needs [56]. We should align the construct we areteaching with the simulator and assessment tool that bestassess that construct.Mistake 10. Label an instrument as validatedThere are three problems with labeling an instrument asvalidated. First, validity is a property of scores, interpre-tations, and decisions, not instruments. Second, validityis a matter of degree—not a yes or no decision. Third,validation is a process, not an endpoint. The word vali-dated means only that a process has been applied; itdoes not provide any details about that process nor indi-cate the magnitude or direction (supportive or opposing)of the empiric findings.The future of simulation-based assessmentAlthough we do not pretend to know the future ofsimulation-based assessment, we conclude with sixaspirational developments we hope come to pass.1. We hope to see greater use of simulation-basedassessment as part of a suite of learner assessments.Simulation-based assessment should not be a goalin and of itself, but we anticipate more frequentassessment in general and believe that simulationwill play a vital role. The choice of modalityshould first consider what is the best assessmentapproach in a given situation, i.e., learning objective,learner level, or educational context. Simulation inits various forms will often be the answer, especiallyin skill assessments requiring standardization ofconditions and content.2. We hope that simulation-based assessment will focusmore clearly on educational needs and lesson technology. Expensive manikins and virtualreality task trainers may play a role, but pigsfeet, Penrose drains, wooden pegs, and cardboardmanikins may actually offer more practical utilitybecause they can be used with greater frequencyand with fewer constraints. For example, suchlow-cost models can be used at home or on thewards rather than in a dedicated simulationcenter. As we consider the need for high-value,Cook and Hatala Advances in Simulation  (2016) 1:31 Page 11 of 12cost-conscious education [57], we encourageinnovative educators to actively seek low-techsolutions.3. Building off the first two points, we hope to see lessexpensive, less sophisticated, less intrusive, lower-stakes assessments take place more often in a greatervariety of contexts, both simulated and in theworkplace. As Schuwirth and van der Vleutenhave proposed [58], this model would—overtime—paint a more complete picture of thelearner than any single assessment, no matterhow well-designed, could likely achieve.4. We hope to see fewer new assessment instrumentscreated and more evidence collected to support andadapt existing instruments. While we appreciate theforces that might incentivize the creation of novelinstruments, we believe that the field will advancefarther and faster if researchers pool their efforts toextend the validity evidence for a smaller subset ofpromising instruments, evaluating such instrumentsin different contexts, and successively filling inevidence gaps.5. We hope to see more evidence informing theconsequences and implications of assessment.This is probably the most important evidencesource, yet it is among the least often studied.Suggestions for the study of the consequencesof assessment have recently been published [27].6. Finally, we hope to see more frequent and moreexplicit use of the interpretation-use argument.As noted above, this initial step is difficult butvitally important to meaningful validation.AcknowledgementsNot applicable.FundingNone.Availability of data and materialsNot applicable.Authors’ contributionsAuthors DAC and RH jointly conceived this work. DAC drafted the initialmanuscript, and both authors revised the manuscript for importantintellectual content and approved the final version.Competing interestsThe authors declare that they have no competing interests.Consent for publicationNot applicable.Ethics approval and consent to participateNot applicable.Author details1Mayo Clinic Online Learning, Mayo Clinic College of Medicine, Rochester,MN, USA. 2Office of Applied Scholarship and Education Science, Mayo ClinicCollege of Medicine, Rochester, MN, USA. 3Division of General InternalMedicine, Mayo Clinic College of Medicine, Mayo 17-W, 200 First Street SW,Rochester, MN 55905, USA. 4Department of Medicine, University of BritishColumbia, Vancouver, British Columbia, Canada.Received: 20 July 2016 Accepted: 16 November 2016References1. Norcini J, Burch V. Workplace-based assessment as an educational tool:AMEE Guide No. 31. Med Teach. 2007;29:855–71.2. Kogan JR, Holmboe ES, Hauer KE. Tools for direct observation andassessment of clinical skills of medical trainees: a systematic review. JAMA.2009;302:1316–26.3. Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. The role ofassessment in competency-based medical education. Med Teach.2010;32:676–82.4. Ziv A, Wolpe PR, Small SD, Glick S. Simulation-based medical education: anethical imperative. Acad Med. 2003;78:783–8.5. Schuwirth LWT, van der Vleuten CPM. The use of clinical simulations inassessment. Med Educ. 2003;37(s1):65–71.6. Boulet JR, Jeffries PR, Hatala RA, Korndorffer Jr JR, Feinstein DM, Roche JP.Research regarding methods of assessing learning outcomes. Simul Healthc.2011;6(Suppl):S48–51.7. Issenberg SB, McGaghie WC, Hart IR, Mayer JW, Felner JM, Petrusa ER, et al.Simulation technology for health care professional skills training andassessment. JAMA. 1999;282:861–6.8. Amin Z, Boulet JR, Cook DA, Ellaway R, Fahal A, Kneebone R, et al.Technology-enabled assessment of health professions education: consensusstatement and recommendations from the Ottawa 2010 conference. MedTeach. 2011;33:364–9.9. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Technology-enhanced simulation to assess health professionals: a systematic reviewof validity evidence, research methods, and reporting quality. Acad Med.2013;88:872–83.10. Kane MT. Validation. In: Brennan RL, editor. Educational measurement. 4thed. Westport: Praeger; 2006. p. 17–64.11. Cook DA, Kuper A, Hatala R, Ginsburg S. When assessment data are words:validity evidence for qualitative educational assessments. Acad Med. 2016.Epub ahead of print: 2016 Apr 5.12. Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach tovalidity arguments: a practical guide to Kane’s framework. Med Educ.2015;49:560–75.13. Barry MJ. Screening for prostate cancer—the controversy that refuses to die.N Engl J Med. 2009;360:1351–4.14. Moyer VA. Screening for prostate cancer: U.S. Preventive Services Task Forcerecommendation statement. Ann Intern Med. 2012;157:120–34.15. Hayes JH, Barry MJ. Screening for prostate cancer with the prostate-specificantigen test: a review of current evidence. JAMA. 2014;311:1143–9.16. Artino AR, Jr., La Rochelle JS, Dezee KJ, Gehlbach H. Developingquestionnaires for educational research: AMEE Guide No. 87. Med Teach.2014;36:463–74.17. Brydges R, Hatala R, Zendejas B, Erwin PJ, Cook DA. Linking simulation-based educational assessments and patient-related outcomes: a systematicreview and meta-analysis. Acad Med. 2015;90:246–56.18. Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we beingmisled? Ann Intern Med. 1996;125:605–13.19. Prentice RL. Surrogate endpoints in clinical trials: definition and operationalcriteria. Stat Med. 1989;8:431–40.20. Downing SM. Validity: on the meaningful interpretation of assessment data.Med Educ. 2003;37:830–7.21. Messick S. Validity. In: Linn RL, editor. Educational measurement. 3rd ed.New York: American Council on Education and Macmillan; 1989. p. 13–103.22. Association AER. American Psychological Association, National Council onMeasurement in Education. Standards for educational and psychologicaltesting. Washington, DC: American Educational Research Association; 1999.23. Association AER. American Psychological Association, National Council onMeasurement in Education. Standards for educational and psychologicaltesting. Washington, DC: American Educational Research Association; 2014.24. American Educational Research Association, American PsychologicalAssociation, National Council on Measurement in Education. Validity.Standards for educational and psychological testing. Washington, DC:American Educational Research Association; 2014. p. 11–31.JAMA Pediatr. 2015. Epub ahead of print 12/23/2014.58. Schuwirth LW, van der Vleuten CP. A plea for new psychometric modelsin educational assessment. Med Educ. 2006;40:296–300.59. Downing SM. Face validity of assessments: faith-based interpretations orevidence-based science? Med Educ. 2006;40:7–8.Cook and Hatala Advances in Simulation  (2016) 1:31 Page 12 of 1225. Cook DA, Zendejas B, Hamstra SJ, Hatala R, Brydges R. What counts asvalidity evidence? Examples and prevalence in a systematic review ofsimulation-based assessment. Adv Health Sci Educ. 2014;19:233–50.26. Cook DA, Beckman TJ. Current concepts in validity and reliability forpsychometric instruments: theory and application. Am J Med.2006;119:166.e7-16.27. Cook DA, Lineberry M. Consequences validity evidence: evaluating theimpact of educational assessments. Acad Med. 2016;91:785–95.28. Moss PA. The role of consequences in validity theory. Educ Meas IssuesPract. 1998;17(2):6–12.29. Haertel E. How is testing supposed to improve schooling? Measurement.2013;11:1–18.30. Kane MT. Validating the interpretations and uses of test scores. J Educ Meas.2013;50:1–73.31. Cook DA. When I say… validity. Med Educ. 2014;48:948–9.32. Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, et al.Objective structured assessment of technical skill (OSATS) for surgicalresidents. Br J Surg. 1997;84:273–8.33. Moorthy K, Munz Y, Forrest D, Pandey V, Undre S, Vincent C, et al. Surgicalcrisis management skills training and assessment: a simulation-basedapproach to enhancing operating room performance. Ann Surg.2006;244:139–47.34. Lammers RL, Temple KJ, Wagner MJ, Ray D. Competence of newemergency medicine residents in the performance of lumbar punctures.Acad Emerg Med. 2005;12:622–8.35. Shanks D, Brydges R, den Brok W, Nair P, Hatala R. Are two heads betterthan one? Comparing dyad and self-regulated learning in simulationtraining. Med Educ. 2013;47:1215–22.36. Brydges R, Nair P, Ma I, Shanks D, Hatala R. Directed self-regulated learningversus instructor-regulated learning in simulation training. Med Educ.2012;46:648–56.37. Conroy SM, Bond WF, Pheasant KS, Ceccacci N. Competence and retentionin performance of the lumbar puncture procedure in a task trainer model.Simul Healthc. 2010;5:133–8.38. White ML, Jones R, Zinkan L, Tofil NM. Transfer of simulated lumbarpuncture training to the clinical setting. Pediatr Emerg Care.2012;28:1009–12.39. Hatala R, Issenberg SB, Kassen B, Cole G, Bacchus CM, Scalese RJ. Assessingcardiac physical examination skills using simulation technology and realpatients: a comparison study. Med Educ. 2008;42:628–36.40. Hatala R, Scalese RJ, Cole G, Bacchus M, Kassen B, Issenberg SB.Development and validation of a cardiac findings checklist for use withsimulator-based assessments of cardiac physical examination competence.Simul Healthc. 2009;4:17–21.41. Dong Y, Suri HS, Cook DA, Kashani KB, Mullon JJ, Enders FT, et al.Simulation-based objective assessment discerns clinical proficiency incentral line placement: a construct validation. Chest. 2010;137:1050–6.42. Hatala R, Cook DA, Brydges R, Hawkins RE. Constructing a validity argumentfor the Objective Structured Assessment of Technical Skills (OSATS): asystematic review of validity evidence. Adv Health Sci Educ Theory Pract.2015;20:1149–75.43. Zendejas B, Ruparel RK, Cook DA. Validity evidence for the Fundamentals ofLaparoscopic Surgery (FLS) program as an assessment tool: a systematicreview. Surg Endosc. 2016;30:512–20.44. Cook DA. Much ado about differences: why expert-novice comparisons addlittle to the validity argument. Adv Health Sci Educ Theory Pract. 2014. Epubahead of print 2014 Sep 27.45. Brennan RL. Educational measurement. 4th ed. Westport: Praeger; 2006.46. DeVellis RF. Scale Development: Theory and applications. 2nd ed. ThousandOaks, CA: Sage Publications; 2003.47. Streiner DL, Norman GR. Health measurement scales: a practical guide totheir development and use. 3rd ed. New York: Oxford University Press; 2003.48. Downing SM, Yudkowsky R. Assessment in health professions education.New York, NY: Routledge; 2009.49. Neufeld VR, Norman GR, Feightner JW, Barrows HS. Clinical problem-solvingby medical students: a cross-sectional and longitudinal analysis. Med Educ.1981;15:315–22.50. Norman GR. The glass is a little full—of something: revisiting the issue ofcontent specificity of problem solving. Med Educ. 2008;42:549–51.51. Eva KW, Hodges BD. Scylla or Charybdis? Can we navigate betweenobjectification and judgement in assessment? Med Educ. 2012;46:914–9.•  We accept pre-submission inquiries •  Our selector tool helps you to find the most relevant journal•  We provide round the clock customer support •  Convenient online submission•  Thorough peer review•  Inclusion in PubMed and all major indexing services Submit your next manuscript to BioMed Central and we will help you at every step:52. Ilgen JS, Ma IW, Hatala R, Cook DA. A systematic review of validity evidencefor checklists versus global rating scales in simulation-based assessment.Med Educ. 2015;49:161–73.53. Norman GR, Van der Vleuten CP, De Graaff E. Pitfalls in the pursuit ofobjectivity: issues of validity, efficiency and acceptability. Med Educ.1991;25:119–26.54. Kuper A, Reeves S, Albert M, Hodges BD. Assessment: do we need tobroaden our methodological horizons? Med Educ. 2007;41:1121–3.55. Govaerts M, van der Vleuten CP. Validity in work-based assessment:expanding our horizons. Med Educ. 2013;47:1164–74.56. Hamstra SJ, Brydges R, Hatala R, Zendejas B, Cook DA. Reconsideringfidelity in simulation-based training. Acad Med. 2014;89:387–92.57. Cook DA, Beckman TJ. High-value, cost-conscious medical education.•  Maximum visibility for your researchSubmit your manuscript atwww.biomedcentral.com/submit


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items