UBC Faculty Research and Publications

A randomised trial of the influence of racial stereotype bias on examiners’ scores, feedback and recollections… Yeates, Peter; Woolf, Katherine; Benbow, Emyr; Davies, Ben; Boohan, Mairhead; Eva, Kevin Oct 25, 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12916_2017_Article_943.pdf [ 452.33kB ]
JSON: 52383-1.0362001.json
JSON-LD: 52383-1.0362001-ld.json
RDF/XML (Pretty): 52383-1.0362001-rdf.xml
RDF/JSON: 52383-1.0362001-rdf.json
Turtle: 52383-1.0362001-turtle.txt
N-Triples: 52383-1.0362001-rdf-ntriples.txt
Original Record: 52383-1.0362001-source.json
Full Text

Full Text

RESEARCH ARTICLE Open AccessA randomised trial of the influence ofracial stereotype bias on examiners’scores, feedback and recollections inundergraduate clinical examsPeter Yeates1,2* , Katherine Woolf3, Emyr Benbow4,5, Ben Davies6, Mairhead Boohan7 and Kevin Eva8AbstractBackground: Asian medical students and doctors receive lower scores on average than their white counterparts inexaminations in the UK and internationally (a phenomenon known as “differential attainment”). This could be dueto examiner bias or to social, psychological or cultural influences on learning or performance. We investigatedwhether students’ scores or feedback show influence of ethnicity-related bias; whether examiners unconsciouslybring to mind (activate) stereotypes when judging Asian students’ performance; whether activation depends on thestereotypicality of students’ performances; and whether stereotypes influence examiner memories of performances.Methods: This is a randomised, double-blinded, controlled, Internet-based trial. We created near-identicalvideos of medical student performances on a simulated Objective Structured Clinical Exam using British Asianand white British actors. Examiners were randomly assigned to watch performances from white and Asianstudents that were either consistent or inconsistent with a previously described stereotype of Asian students’performance. We compared the two examiner groups in terms of the following: the scores and feedback theygave white and Asian students; how much the Asian stereotype was activated in their minds (response timesto Asian-stereotypical vs neutral words in a lexical decision task); and whether the stereotype influenced memories ofstudent performances (recognition rates for real vs invented stereotype-consistent vs stereotype-inconsistent phrasesfrom one of the videos).Results: Examiners responded to Asian-stereotypical words (716 ms, 95% confidence interval (CI) 702–731 ms)faster than neutral words (769 ms, 95% CI 753–786 ms, p < 0.001), suggesting Asian stereotypes were activated(or at least active) in examiners’ minds. This occurred regardless of whether examiners observed stereotype-consistentor stereotype-inconsistent performances. Despite this stereotype activation, student ethnicity had no influence onexaminers’ scores; on the feedback examiners gave; or on examiners’ memories for one performance.Conclusions: Examiner bias does not appear to explain the differential attainment of Asian students in UK medicalschools. Efforts to ensure equality should focus on social, psychological and cultural factors that may disadvantagelearning or performance in Asian and other minority ethnic students.Keywords: Medical education, Assessment, Differential attainment, Ethnicity, Stereotypes* Correspondence: p.yeates@keele.ac.uk1Medical Education Research, School of Medicine, David Weatherall Building,Keele University, Newcastle under Lyme ST5 5BG, UK2Acute and Respiratory Medicine at Pennine Acute Hospitals NHS Trust, Bury,UKFull list of author information is available at the end of the article© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Yeates et al. BMC Medicine  (2017) 15:179 DOI 10.1186/s12916-017-0943-0BackgroundMedical students and doctors from black and minorityethnic (BME) backgrounds, including those fromAsian groups, perform less well on average than theirwhite counterparts in assessments [1]. These ethnicdifferences are found among British-trained students,not just in international medical graduates, and reflectsimilar findings from the Netherlands [2], the USA [3]and Australia [4]. This effect is small but consistent[5]; occurs in both written and performance-based as-sessments [6, 7]; at different stages of the educationalcontinuum [8, 9]; and is incompletely explained byprior attainment [8]. The reasons for this differentialattainment are unclear but broadly could arise eitherbecause exam systems are biased against BME stu-dents or because social, psychological or cultural fac-tors result in a lower average standard of performanceby BME trainees. Understanding which effect is re-sponsible is vital to successfully targeting interven-tions to ensure equality.Research in other domains has repeatedly demon-strated humans’ susceptibility to unconscious bias whenmaking judgements about individuals from negativelystereotyped groups [10]. Woolf et al. [11] have describedsome medical educators in the UK holding stereotypedviews of the performance of BME medical students, inwhich Asian students are conceived as often havinggood factual knowledge but poor communication skills.This is pertinent; stereotypes are readily activated whenmaking judgements [12] and can bias the information aperson attends to [13], the judgement they reach [14]and their memory of what occurred [15], with suchmemory bias serving to perpetuate stereotypes [16]. Ste-reotypes often influence judgements beyond consciousawareness [17] and are more likely to have an influencewhen judgements are mentally taxing [18], as is the casefor examiners during medical exams [19]. If any suchbias influences medical student exams, as well as influ-encing the scores that BME students receive, it couldalso result in provision of feedback that is more negativethan white students receive. Prior research on assess-ment in medical education has shown that other judge-mental biases (contrast effects) influence the strength oflanguage used in feedback in a similar pattern toinfluences on scores [20]). As a result, if a stereotypebias occurs, we may expect to see its influence on thevalence of feedback as well as on scores. Equally, due torecollection bias, feedback could focus on stereotypedaspects of performance, thereby distorting the con-veyed message and creating a potential determinant ofstudents’ self-efficacy and future learning strategies/performances. It is entirely plausible, therefore, thatdifferential attainment could arise due to examinerbias derived from stereotyping of BME students.Conversely, a number of retrospective analyses ofexam data have gone some way to refuting an influenceof examiner bias: Woolf et al. [6] found similar degreesof differential attainment by BME students on bothmachine-marked written exams and examiner-basedperformance exams; McManus et al. [21] found thatonly 3/1790 examiners showed evidence of ethnic biascompared to other examiners observing the same candi-dates; Denney et al. [22] found that very few examinersappeared to favour candidates of their own sex or ethni-city. No prior studies have examined the potential ofexaminer bias under conditions of experimental control.In this study we sought to determine whether examinersshow evidence of stereotype activation when examiningBME students; whether stereotype activation is dependenton the student’s behaviour matching the described stereo-type; whether examiners’ scores or feedback show anyevidence of ethnicity bias; or whether examiners’ memoryof performances suggests any influence of stereotypes ontheir judgements.Research questions1. Do examiners activate stereotypes relating tostudents’ ethnicity whilst judging students’performances?2. Is any such stereotype activation dependent onstudents’ performances matching describedstereotypes?3. Are the (a) scores, (b) valence of feedback, or (c)focus of feedback that examiners give to students’performances influenced by students’ ethnicity?4. Do examiners’ memories of students’ performancesshow evidence of stereotype bias?To operationalise these questions, we chose to focus on“British Asian” students, defining this term as individualswith recent heritage from the Indian subcontinent whohad been born or educated within the UK. We chose thisgroup because they are the largest group of BME studentsin UK medical schools and because of the existence ofacademic literature describing a stereotype of this groupwithin medical education [11].MethodsStudy designWe used a two-group, double-blinded, randomisedInternet-based experimental design.Participants, recruitment and consentParticipants were current UK undergraduate ObjectiveStructured Clinical Exam (OSCE) examiners. Inclusioncriteria were as follows: being a licensed doctor withinthe UK; having previously received training as an OSCEYeates et al. BMC Medicine  (2017) 15:179 Page 2 of 11examiner; having examined a summative OSCE at a UKmedical school within the last 2 years; being comfortableto assess both communication skills and knowledge.Recruitment was undertaken by email; medical schoolsaround the UK disseminated the invitation to OSCE ex-aminers. Interested individuals registered on the studywebsite and received the Participants Information Sheet.Consent was obtained online via the study website priorto participation. Participants were offered a £20 shop-ping voucher for study completion. As prior knowledgeof the study’s premise could have biased examiners’ re-sponses, participants were blinded to the study interven-tion by the use of a deceptive premise that simply stated:“we’re interested in understanding more about howOSCE examiners make judgements on clinical perform-ance when they are assessing OSCEs”, with an assurancethat a fuller explanation would follow.Measures, procedure and hypothesesAn overview of the study design and procedure is shownin Table 1. Details of the validation of the stimulus mate-rials and measures are given in Additional file 1.Scripted videos of OSCE performances of white and Asianmedical studentsWe created videos of scripted medical student perfor-mances on a simulated OSCE station, in which a youngwoman attends her general practitioner to discuss a newdiagnosis of type 1 diabetes mellitus. The scenario re-quired the student to both demonstrate accurate know-ledge of type 1 diabetes and to display empathy andgood communication skills. The scenario is described inAdditional file 1: Section 1.Three separate scripts were created by a clinical edu-cator (PY): one showed good factual knowledge andpoor communication skills (K+/C–) (i.e. a performanceconsistent with a described stereotype of Asian students’performance [11]); one showed poor factual knowledgeand good communication skills (K–/C+) (i.e. a perform-ance inconsistent with a described stereotype of Asianstudents’ performance); and one showed a mixture ofboth good and poor knowledge and communication(mixed). The scripting was done by drawing from PY’sclinical knowledge and experience of assessing medicaltrainees at different stages of training and aimed torepresent a plausible, authentic performance in an OSCEby an undergraduate medical student. Scripts werereviewed by a panel of six experienced clinical educatorswho scored the knowledge and communication that wasdisplayed in each script. Details of their scores can beseen in Additional file 1: Section 2.All scripts were performed and filmed twice: once byan actor who was white with a white British accent, andonce by an actor who was Asian with a British Asian ac-cent, giving a total of six performance videos. Four sep-arate actors were involved: two men and two women.Both men performed in both the K+/C– and K–/C+videos, but within groups participants saw one man forthe first performance and the other man for the secondperformance. The women performed the two versions ofthe mixed performance. As a result, participants saw adifferent actor in each video. The similarity of the BritishAsian and white versions of each performance werejudged by a panel of eight experienced clinical educators.All paired performances were judged to be at least“highly similar”. Details of this validation exercise can beseen in Additional file 1: Section 2.Examiners were randomised to two groups by thestudy Internet site, using a random number generatorwith a variable maximum between-group discrepancyfunction. A stereotype-consistent group (Group A) sawPerformance 1 (K+/C–) with an Asian student and Per-formance 2 (K–/C+) with a white student. A stereotype-inconsistent group (Group B) saw Performance 1 with awhite student and Performance 2 with an Asian student.In order to test the hypothesis relating to memory, bothgroups also saw the mixed performance, featuring an Asianstudent in Group A and a white student in Group B. Toprevent order effects, the order in which performanceswere presented within groups was balanced, with equalnumbers seeing them in order 1: K+/C–, K–/C+, mixed;and order 2: mixed, K–/C+, K+/C–.Performance scoringPrior to seeing the videos, examiners were providedwith briefing material describing the OSCE scenario,desirable student behaviours, key case-related informa-tion and the mark sheet. This material is available inAdditional file 1: Section 1.Table 1 Overview of study designInstructions Performancesa Lexical decision task Recollection Demographics DebriefGroup K+/C– K–/C+ Mixed MixedStereotype-consistent Asian1 White1 Asian2 AsianStereotype-inconsistent White1 Asian1 White2 WhiteaOrder of performance was counterbalanced within groups (half of each group saw K+/C–, K–/C+, mixed; the other half saw mixed, K–/C+, K+/C–)K+ good knowledge demonstrated in performance, K– poor knowledge demonstratedC+ good communication demonstrated in performance, C– poor communication demonstratedYeates et al. BMC Medicine  (2017) 15:179 Page 3 of 11Examiners watched the three performances they wereassigned online. After each performance, examinersscored the observed student on four domains (two relat-ing to communication skills and two relating to factualknowledge) on 7-point rating scales end-anchored with“no elements done” and “all elements done well”. Thetwo domains were collapsed to give average ratings ofcommunication and knowledge for each participant foreach performance. Examiners also provided an overallglobal rating on a 7-point scale anchored with Fail (1, 2),Borderline (3), Pass (4), Good (5), Excellent (6, 7).Finally, examiners were asked to “provide up to threesuggestions for improvement” as free text feedback. Thescoring format was based on the standard format ofOSCE mark sheets from one medical school which re-cruited participants, and will have been very familiar tothese examiners. Whilst it may have been less familiar toother participants, it was similar to typical domain-basedmark sheets.We hypothesised that there would be a main effect ofstudent ethnicity, with Asian students receiving lowerscores than white students in both groups (Hypothesis 1).Free text feedbackFree text feedback comments were segmented and ana-lysed by content analysis. Each portion of feedback wassegmented by a single researcher into pieces of feedbackthat were judged to contain a single concept. Eachfeedback segment was uniquely labelled, and then tworesearchers independently coded each segment for itsfocus (communication, factual-knowledge or general)and valence (positive, negative or neutral). Both re-searchers met repeatedly to discuss and develop a sharedinterpretation of the data. All analysis was done blind tostudy group and the ethnicity of the student to whomthe feedback had been given. Agreement between the re-searchers was calculated using Cohen’s kappa. Remainingdiscrepancies were resolved through discussion prior tounblinding. Once the analysis was complete, the balanceof focus was calculated for each student performancefor each participant, by allocating a score of +1 tocommunication-focused segments, –1 to knowledge-focused segments and 0 to general segments, and thensumming the segment scores. As a result, feedback witha positive score focused more on communication thanknowledge and feedback with a negative score focusedmore on knowledge than communication. The sameprocedure was used for the valence of feedback by allo-cating positive segments +1, negative segments –1 andneutral segments 0. This resulted in a focus and valencescore for the feedback given to each student perform-ance by each participant.On the basis that judgemental bias tends to have acorrespondingly positive or negative influence onfeedback, we hypothesised that the valence of feedbackto Asian students would be comparatively negativecompared to the valence of feedback to white students(Hypothesis 2a).On the basis that examiners tend to focus feedback onareas of weak performance, we hypothesised that thefocus of feedback to Asian students would incline moretowards communication skills than the focus of feedbackfor white students (Hypothesis 2b).Test of stereotype activationAfter scoring three performances, examiners performeda lexical decision task to gain a measure of their mentalactivation of Asian stereotypes (or, put more simply,whether they had brought to mind a stereotype of“Asian-ness” whilst judging students’ performances).Lexical decision tasks are a well-established measure ofstereotype activation within psychological research.Numerous previous studies have shown that when astereotype is activated (for example by someone cominginto contact with a person from a stereotyped group),concepts associated with that stereotype become morereadily available in the mind of the person who experi-ences the stereotypical thoughts. As a result they tend torespond to stereotype-related concepts more quicklythan neutral concepts [23]. Lexical decision tasks rely onthe premise that when a stereotype has been mentallyactivated, people can respond more quickly to words as-sociated with the stereotype than to neutral words. Thisenables detection of stereotype activation. The taskconsisted of 45-letter strings of which 30 were wordsand 15 were non-words. Of the 30 words, 15 were“stereotype words” (words associated with stereotypesof south-Asian people in the UK) and 15 were “neutral”(words that were unrelated to Asian stereotypes). Allstrings were presented in the same random order toboth groups. These words, along with evidence sup-porting the validity of their “neutral” or “Asian” associ-ation, can be viewed in Additional file 1: Section 2.After one practice trial, examiners were asked to deter-mine whether presented strings of letters were either areal word or place name in the English language, or a“non-word” (a string of letters with no meaning), bypressing “D” or “K” on the keyboard, respectively.Examiners were asked to work as quickly but as accur-ately as possible. To prevent demand characteristics,this was presented as a “test of concentration”; exam-iners were not made aware that some words were re-lated to an Asian stereotype. All responses were timedlocally, using the clock within the participant’s com-puter, thus negating effects of Internet bandwidth.We hypothesised that participants would have fasterresponse times to stereotypical words than neutral words(Hypothesis 3a).Yeates et al. BMC Medicine  (2017) 15:179 Page 4 of 11We hypothesised that Group A (who had seen com-paratively stereotype-consistent performances) wouldhave faster response times to stereotypical words thanGroup B (who had seen comparatively stereotype-inconsistent performances) (Hypothesis 3b).Test of memoryFollowing the stereotype activation task (approximately5 minutes), examiners completed a recognition-basedmemory test. Examiners were asked to read 40 quotesostensibly from the “mixed” performance video, and in-dicate whether they had occurred in the performance orwere invented, by marking them as “true” or “false”. Ofthe presented statements 20 were accurate quotes fromthe mixed performance (real) and 20 did not appear inany of the three videos (invented). Group A had seenthe mixed performance played by an Asian female stu-dent, whilst Group B had seen the performance playedby a white female student. For both the real andinvented statements, half were consistent with theliterature-based stereotype of Asian students’ perform-ance (a balanced mixture of accurate factual knowledgeand examples of poor communication), and half wereinconsistent with the stereotype of Asian students’ per-formance (a balanced mix of inaccurate factual know-ledge and examples of good communication). Evidencesupporting the validity of these constructs is presentedin Additional file 1: Section 2. The manipulated videopresentation order meant that within each groupexactly half of examiners saw the mixed video first andhalf saw the mixed video last, thereby balancing any ef-fect of video order on memory across groups.When stereotypes influence memory, they cause two op-posite effects: statements which are real, but inconsistentwith the stereotype, seem unexpected, making them moresalient and increasing their rate of recognition; conversely,statements that are invented, but consistent with thestereotype, seem plausible, also increasing their rate ofrecognition [15]. Consistent with this, we compared (1)the proportion of real, stereotype-inconsistent responsesmarked “True” and (2) the proportion of invented,stereotype-consistent responses marked “True” betweengroups that had seen an Asian vs a white student for themixed performance.We hypothesised that examiners in Group A (stereo-type-consistent group) would mark a higher proportion ofreal stereotype-inconsistent statements and inventedstereotype-consistent statements as “True” than examinersin Group B (stereotype-inconsistent group) (Hypothesis 4).Demographics and debriefAfter completing all tasks, examiners provided demo-graphic data including their own ethnicity using UK Officefor National Statistics ethnicity categories. Participantswere asked to indicate in free text what they thought thestudy was testing, before being provided a description ofthe study’s premise. Repeat consent was then sought.AnalysisPerformance scoresWe analysed scores using generalised linear modellingwith generalised estimating equations (GLM GEE), withwithin-subject variables of performance (K+/C–, K–/C+,mixed) and student ethnicity. Co-variate analyses basedon demographic data were performed to exclude con-founding. These analyses were performed on overallscores, communication scores and knowledge scores, re-spectively. The influence of students’ ethnicity on thefocus and valence of examiners’ feedback was then com-pared sequentially using GLM GEE, in a similar analysisto that used for scores, but with ‘focus’ and then‘valence’ as the dependent variables.As no prior data were available for power calculations,the study was powered based on interim examination ofthe groups’ standard deviations (without use of inferen-tial tests) to determine how large a sample would be re-quired to find a statistically significant difference of 0.35out of 7.0 in scores on the “overall scores” measure. Thisdifference would be similar to the difference in scoresobserved between BME students and white students in acommunication-focused OSCE exam by Wass et al. [7]with a similar effect size to that seen in the meta-analysis by Woolf et al. [5].Stereotype activationWe used repeated-measures analysis of variance(ANOVA) to compare participants’ mean response timesfor stereotype words and neutral words (within-subjectvariable) between groups (between-subject variable). Usingthe procedure described by Mussweiler and Epstude [24],responses to individual target words were excluded if theywere either incorrectly identified (for example if a partici-pant indicated a non-word when the target was a word) orwere greater than 2 standard deviations (SD) from themean response time for that category of word (interpretedas erroneous responses or distraction). Median exclusionrates were compared between groups using the Mann-Whitney U test.RecollectionWe used two independent-group univariate ANOVAs tocompare (1) the proportion of real, stereotype-inconsistentstatements marked “True” by each group and (2) theproportion of invented, stereotype-consistent statementsmarked “True” by each group.Yeates et al. BMC Medicine  (2017) 15:179 Page 5 of 11ResultsParticipantsParticipants were recruited between November 2014 andJune 2015, and recruitment closed when the recruitmenttarget was achieved. A total of 181 examiners enrolled,and 159 completed the study. Responses by all partici-pants who completed the study were included in all ana-lyses. Completing participants came from 20 of the UK’s33 medical schools and from a broad range of clinicalspecialities. The majority of examiners were recruitedfrom 4 medical schools (a total of 93 out of 169). Theseare denoted A–D in Table 2. The remaining 16 schoolscontributed 7 or fewer participants each. Groups wereequal in size (Group A: 92 enrolled, 12 dropped out, 80completed vs Group B: 89 enrolled, 10 dropped out, 79completed). As enrolled participants could leave the web-site without giving reasons, no explanations were obtainedfor dropouts. The study groups were similar in all mea-sured demographics: year of qualification, years of OSCEexamining experience and frequency of OSCE examiningper year (see Table 2). Participants were predominantly ofwhite ethnicity but also included individuals from a rangeof other ethnicities. To facilitate baseline comparisons,ethnicities were grouped as “white”, “Indian subcontinent”and “other minority ethnic individuals”. Numbers of par-ticipants in each of these categories did not vary betweengroups. These data are also presented in Table 2. Examin-ation of participants’ responses in the debrief phaseTable 2 Comparison of participant characteristics between groupsCharacteristic Group A (viewed stereotype-consistentperformances)Group B (viewed stereotype-inconsistentperformances)SignificanceSex: Frequency p (chi sq.)Male 40 (51%) 29 (37%) 0.078Female 39 (49%) 50 (63%)Clinical speciality:Anaesthetics 7 2 0.09 (Fisher exact)Diagnostic specialities 5 2Hospital medicine 24 31Surgery 10 4Emergency medicine 3 0Child health 3 7Women’s health 2 7General practice 15 17Psychiatry 4 3Non-hospital medical specialities 1 0Public health 0 1Other 6 5Medical school:A 5 13 0.15B 20 21C 9 4D 10 12Others 34 26Participant ethnicity:White 61 66 0.11 (Fisher exact)Indian subcontinent 11 7Other minority ethnicity 6 1Prefer not to say 2 5Median p (Mann-Whitney U)Year of qualification 1995 1992 0.75Years examining OSCEs 5 5 0.47OSCEs/year 2 2 0.79Yeates et al. BMC Medicine  (2017) 15:179 Page 6 of 11indicated that only two participants (1.2% of all completedrespondents) guessed the study’s true purpose; therefore,no participants were excluded, given that all consented toongoing inclusion of their data.Evidence that examiners activate mental stereotypes ofstudentsAs it is pertinent to the consideration of further results, wewill present these data first. Participants’ responses to 8.2%of target words were excluded due to being incorrect orerroneous (> ± 2 SD from category mean); median exclu-sion rates showed no significant difference between groups(p = 0.147). Examiners’ response times to stereotype words(mean = 716 ms (95% CI 702–731 ms)) were faster thantheir response times to neutral words (769 ms (753–786 ms), F = 220.4, p < 0.001). No difference was observed,however, between groups in their response times:Group A, 750 ms (729–772) vs Group B, 735 ms (714–756 ms), p = 0.32, and the interaction of word type x groupwas also non-significant (p = 0.55). The same pattern orresults were observed for both white and non-white par-ticipants. As a result, Hypothesis 3a was supported and 3bwas refuted: examiners in both groups showed evidence ofstereotype activation, regardless of whether the perfor-mances they had seen by Asian students had beenstereotype-consistent or stereotype-inconsistent.ScoresScore data consisted of 18 separate distributions: 3 per-formances x 3 measures (knowledge; communication;overall) x 2 groups. Domain scores followed the scripteddiscordant patterns: K+/C– knowledge mean = 5.6 (95%CI 5.5–5.8) and communication = 2.5 (2.3–2.6); K–/C+knowledge mean = 3.0 (2.9–3.2) and communicationmean = 5.7 (5.6– 5.8); mixed knowledge mean = 3.1(2.9–3.2) and communication mean = 3.6 (3.4–3.7).Knowledge scores (p < 0.001) and communication scores(p < 0.001) differed statistically significantly between per-formances. These data are shown in Table 3.Influence of students’ ethnicity on examiners’ scoresComparison of performance scores showed no differencedue to student ethnicity. The average knowledge scorewhen performances were acted by Asian students was3.9 (95% CI 3.8–4.0) vs 3.9 (3.8–4.0) when acted bywhite students (p = 0.77). The average communicationscore when the performances were acted by Asian stu-dents was 3.9 (3.8–4.1) vs 3.9 (3.7–4.0) when acted bywhite students (p = 0.31). The average overall scoreswhen the performances were acted by Asian studentswas 3.1 (2.9–3.3) vs 3.1 (3.0–3.3) when acted by whitestudents (p = 0.88). The scores for each measure on eachperformance by each group are shown in Table 3. Statis-tical examination for potential confounding effects ofexaminers’ sex or ethnicity showed that neither of thesevariables confounded the comparisons of interest. Thestudy had 81% power to detect a difference of 0.35 outof 7.0 on the assessment scale, equivalent to an effectsize of d = 0.32. As a result, Hypothesis 1 was not sup-ported, suggesting that the Asian stereotypes, which thelexical decision task suggested were activated among ex-aminers, did not influence their scoring.Influence of students’ ethnicity on examiners’ feedbackAgreement between the two analysts regarding how tocategorise feedback statements was high (Cohen’s kappafor ‘Focus’ codings = 0.85; quadratic weighted Cohen’skappa for ‘Valence’ codings = 0.92). The focus of feed-back varied by performance, indicating that examinersgenerally focused their feedback on weaker areas ofTable 3 Comparison of scores (knowledge, communication,overall scores) and feedback (focus and valence) byperformance and groupPerformance Group A (viewedstereotype-consistentperformances)Group B (viewedstereotype-inconsistentperformances)Performance 1:K+/C–Asian student White studentScoresKnowledge 5.6 (5.3–5.9) 5.6 (5.4–5.9)Communication 2.4 (2.2–2.6) 2.5 (2.3–2.7)Overall 3.2 (3.0–3.5) 3.1 (2.8–3.4)FeedbackFocus 2.7 (2.5–3.0) 2.7 (2.4–2.9)Valence –2.6 (–2.9 to –2.4) –2.6 (–2.9 to –2.4)Performance 2:K–/C+White student Asian studentScoresKnowledge 3.0 (2.8–3.3) 3.0 (2.8–3.2)Communication 5.6 (5.5–5.8) 5.7 (5.6–5.9)Overall 3.4 (3.1–3.6) 3.3 (3.1–3.5)FeedbackFocus –0.3 (–0.5 to 0.0) –0.4 (–0.7 to 0.0)Valence –1.7 (–2.0 to –1.4) –1.8 (–2.1 to –1.5)Performance 3:mixedAsian student White studentScoresKnowledge 3.1 (2.9–3.3) 3.1 (2.9–3.3)Communication 3.5 (3.3–3.7) 3.6 (3.4–3.8)Overall 2.9 (2.7–3.0) 2.8 (2.6–3.0)FeedbackFocus 1.6 (1.3–1.9) 1.5 (1.2–1.9)Valence –2.8 (–3.1 to –2.6) –3.0 (–3.3 to –2.7)All comparisons are non-significantYeates et al. BMC Medicine  (2017) 15:179 Page 7 of 11performance (see Table 3). Students’ ethnicity had no in-fluence on the focus of feedback, with both groups re-ceiving more feedback comments on communicationthan on factual knowledge: Asian students, 1.3 (1.2–1.5)vs white students, 1.3 (1.1–1.5), p = 0.87. The interactionof performance by student ethnicity for feedback focuswas also non-significant, p = 0.82. All performances re-ceived more negative than positive feedback. There wasno influence of students’ ethnicity on the valence offeedback: Asian students –2.4 (–2.6 to –2.3); white stu-dents –2.4 (–2.6 to –2.3), p = 0.82. The interaction ofperformance by student ethnicity was also non-significant for feedback valence at p = 0.57. As a result,neither Hypotheses 2a or 2b were supported, suggestingthat the Asian stereotypes which examiners activated didnot influence their provision of feedback.Influence of students’ ethnicity on examiners’ memoriesfor the mixed performanceParticipants agreed with real statements more frequentlythan with invented statements (real 72.5% (70.6–74.4%);invented 32.7% (30.8–34.5%), F = 871.3, p < 0.001). State-ments that are real but stereotype inconsistent aretheoretically expected to be more memorable if a stereo-type has influenced memory, as they seem unexpected,and so achieve increased saliency; no between-group dif-ference occurred in recognition rates of such statements:Group A (recalling an Asian student on the mixedperformance) 75% (72–79%); Group B (recalling awhite student on the mixed performance) 78% (75–82%),F = 1.63, p = 0.20. Invented statements that are stereotype-consistent are theoretically expected to be endorsed moreoften if stereotypes influence judgements because theyseem particularly plausible. No between-group differenceoccurred in these statements: Group A (recalling an Asianstudent on the mixed performance) 29% (24–33%); GroupB (recalling a white student on the mixed performance)32% (27–36%), F = 0.84, p = 0.36. Recognition data areshown in Table 4. As a result, Hypothesis 4 was notsupported, suggesting that the activated Asian stereotypesexaminers demonstrated via the lexical decision task didnot influence their memories of performances.DiscussionSummary of resultsFor the first time in a double-blinded, randomised, con-trolled study, we have compared the influence of stu-dents’ ethnicity (white vs British Asian) on (1) the scoresand feedback that OSCE examiners give to simulatedundergraduate student OSCE performances and (2) ex-aminers’ cognitive processing of those performances in-cluding their recollection accuracy and activation of anAsian stereotype when examining Asian students. Exam-iners showed evidence of stereotype activation (eitherreflecting a generalised activation or activation inducedby exposure to the Asian students in our stimuli), re-gardless of whether Asian students’ performances wereconsistent with a described stereotype [11]. Despite this,we found no effect of students’ ethnicity on the scoresthat were assigned; the valence or focus of the feed-back that students received; or any evidence of bias inthe recollections of performances by examiners. Ourfindings are partly consistent with both the extensiveliterature in social psychology, which describes theprevalence of stereotyping in judgements, and the lit-erature in medical education, which has suggested thatexaminer bias is not responsible for differential attainmentby BME students. The fact that this study and previousuncontrolled observational field studies [5, 9, 22] havefound consistent results by different methods helps tosupport this conclusion.Practical and theoretical implicationsA recent prominent legal case in the UK (see [25]) hasreaffirmed an important principle of equality law: thatthe absence of direct discrimination in an exam systemdoes not mean educational organisations are absolved ofresponsibility for differential attainment. Instead, theeducation system is responsible to ensure equality ofopportunity for minority groups by addressing indirectdiscrimination and providing reasonable support. Theaverage underperformance of BME students and training-grade doctors in UK medical exams is a robust observa-tion that should not be ignored. Whilst a single studycannot definitively exclude the possibility of examiner biasin all circumstances, this study tends to suggest that ef-forts to address differential attainment should focus onfactors which either reduce learning opportunities or hin-der performance for BME students. A variety of potentialavenues could be explored. For example, Vaughan et al.[26] showed that the social networks of BME students inUK medical schools may produce a relative disadvantagein creating and accessing educational opportunities.Table 4 Proportions of each statement type marked as true, bygroup, within the recognition test of memoryProportion of statements marked“True” (95% CIs)Statement type: Group A (recallingAsian student)Group B (recallingwhite student)Real statementsStereotype-consistent 69% (65–73%) 68% (64–71%)Stereotype-inconsistent 75% (71–79%) 78% (75–81%)Invented statementsStereotype-consistent 29% (24–33%) 32% (27–36%)Stereotype-inconsistent 34% (31–36%) 37% (34–40%)Theorised comparisons are highlighted in boldface and are non-significantYeates et al. BMC Medicine  (2017) 15:179 Page 8 of 11Burgess et al. [27] have suggested that “stereotype threat”(a process whereby members of a stereotyped group areunconsciously hindered in their performance by aware-ness of the stereotype) could account for differential at-tainment by BME students. Woolf et al. [28] havedescribed that BME doctors in the UK can experiencepoor relationships with seniors and problems fitting in,which can in turn lead to fewer learning opportunities,lower confidence, and an increased chance of mentalhealth problems than experienced by their white counter-parts. All of these factors have the potential to reduce per-formance. As a result, efforts to enhance equality shouldclarify these mechanisms and design interventions to ad-dress them.It is important not to be unduly reassured by ourfindings: whilst examiners did not show bias in theirscores, feedback or recollections, they did show evidenceof stereotype activation. This is consistent with the priorfindings of Woolf et al. [11] that medical educatorsstereotype BME students to some degree. Stereotypeactivation is understood to be a separate cognitiveprocess from stereotype application; or more simply, astereotype can come to mind during a judgementprocess without influencing the decision that is reached[17]. People are known to be comparatively resistant tostereotype application if they are strongly motivated toavoid prejudice [29], or if they are motivated to achieveaccuracy in the task they are performing [30]. It is not-able that by not showing an effect of stereotype bias,these findings are at odds with a significant body ofresearch in social psychology. The reasons for thisdifference are unclear. Differences in study populationmay offer some explanation. Many social psychologystudies have used undergraduates or members of thepublic, whereas this study recruited qualified doctors. Itcould be that doctors’ sense of professionalism as exami-ners or experience of working with trainees from BritishAsian backgrounds may produce a greater tendency toindividuate and thereby resist stereotyping. All such ex-planations of this difference must, however, be viewed asspeculative at this stage, and further work is needed toreplicate and understand this difference. At a practicallevel, it is important to note that whilst these findingshelp to reassure us that no systematic bias exists in ex-aminers’ judgements, some individual examiners maystill exhibit bias (as was indicated by the previously de-scribed study by McManus et al. [21]). Equality and diver-sity training of OSCE examiners has been a mandatorycomponent of assessment in medical education forseveral years [31]; understanding how such trainingmight have influenced the stereotypes which exami-ners hold or their motivation to resist applying themto judgements is important to continued efforts to en-hance equality.LimitationsThis study used an adequately powered, double-blinded,randomised, controlled methodology to determine theinfluence of students’ ethnicity on examiners’ judge-ments. Consequently, we assert that the study has stronginternal validity to address the stated research questions.Despite this, the study has some limitations. The studywas (necessarily) conducted in a simulated context, ra-ther than in a real OSCE. It is possible that the pressureof examining in real life could make examiners morevulnerable to stereotyping than they were in this study.The fact that this study is consistent with other observa-tional studies is reassuring in this respect. We comparedwhite students with British Asian students; it is not pos-sible to exclude the possibility that an effect could arisedue to other ethnic groups. Examiners in the study onlyjudged three video performances, whereas in real OSCEsexaminers may judge a much larger number of perfor-mances between breaks. Stereotypes are known to bemore influential when individuals’ cognitive resourcesare depleted [17]; we can’t exclude the possibility thatexaminers’ judgements could be influenced by students’ethnicity after a more prolonged series of performancesdue to (for example) fatigue or lapses in concentration,or that different samples of performance (displaying adifferent range of behaviours by Asian students) couldproduce an effect. Lastly, as we did not have a controlcondition in which participants performed the LDTwithout watching the videos, we cannot definitivelyclaim that participants activated an Asian stereotypespecifically in response to the videos rather than that theactivation observed was already present upon beginningthe study. These results support the central conclusionthat examiners had a stereotype which was active at thetime of judging performances, which does not appear tohave been applied to their judgements. However, it ispossible that stereotype-related differences in judgementare induced only when the behaviours of the individualsbeing judged contribute directly to further stereotypeactivation.Recommendations for future researchAs with all research, this study would benefit from repli-cation by independent groups in other contexts, andusing student performances derived from other minorityethnic groups, to determine the generalisability and re-peatability of these findings. Further studies with a lon-ger series of performances would help to exclude thepossibility of the fatigue-related effect posited above,whilst more investigation is needed of the stereotypeswhich medical educators appear to possess to under-stand whether they influence educators’ interactionswith BME students in other circumstances. Further re-search should focus on understanding how BMEYeates et al. BMC Medicine  (2017) 15:179 Page 9 of 11students’ learning or performance may be disadvantagedwithin medical education, and whether effective inter-ventions can be developed to ensure equality.ConclusionsIn this study we have shown that whilst OSCE examinersexhibit evidence of mental stereotypes when examiningethnic minority students, they revealed no evidencethat students’ ethnicity (British Asian vs white) has anyinfluence on the scores or feedback that examiners gaveto performances. Nor did students’ ethnicity appear toinfluence examiners’ recollections of performance. Futureefforts to address differential attainment by BME studentsmay, therefore, be best directed at understanding detri-mental influences on their learning or performance anddeveloping interventions to ensure equality within thelearning environment.Additional fileAdditional file 1: Section 1. Case material and scoring format used insimulated OSCE stations. Section 2. Validation results for stimulusmaterials. (DOCX 20 kb)AcknowledgementsWe would like to thank Professor Jim Crossley, University of Sheffield for histhoughtful comments on the manuscript and assistance with recruitment tothe study. We would like to thank Mike Eastwood for his assistance withqualitative analysis. We would like to thank all of the medical schools thathelped with recruitment, and all of the study’s participants.FundingThe study was funded by a Starter Grant for New Lecturers from theAcademy of Medical Sciences which was awarded to PY.The Academy of Medical Sciences reviewed a funding application whichdescribed the planned research. They had no involvement in the design of thestudy, data collection, analysis, interpretation or writing of the manuscript.Availability of data and materialsThe datasets generated and/or analysed during the current study are notpublicly available due to limitations imposed on the use of data by theethics committee, but they are available from the corresponding author onreasonable request.Authors’ contributionsPY led the design of the study; material development and validation; datacollection, analysis and interpretation; and was a major contributor to drafting themanuscript. KW contributed to the design of the study; material developmentand validation; data collection, analysis and interpretation; and was a majorcontributor to drafting the manuscript. EB contributed to material developmentand validation; and data collection and interpretation, and to drafting themanuscript. BD contributed to data collection, analysis and interpretation, and todrafting the manuscript. MB contributed to data collection and interpretation, andto drafting the manuscript. KE contributed to the design of the study; materialdevelopment and validation; and data collection, analysis and interpretation; andwas a major contributor to drafting the manuscript. All authors approved the finalversion of the manuscript.Ethics approval and consent to participateEthical approval was provided by the University of Manchester Ethics committee,reference 14131. Patients were not included in the conduct or the design of thestudy. Consent was provided by participants online via the Internet site beforeparticipating. Participants were asked to indicate whether they continued toconsent at the end of the study.Consent for publicationNot applicable.Competing interestsThe authors declare that they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.Author details1Medical Education Research, School of Medicine, David Weatherall Building,Keele University, Newcastle under Lyme ST5 5BG, UK. 2Acute and RespiratoryMedicine at Pennine Acute Hospitals NHS Trust, Bury, UK. 3University CollegeLondon Medical School, University College London, London, UK. 4Division ofMedical Education, University of Manchester, Manchester, UK. 5CentralManchester University Hospitals NHS Foundation Trust, Manchester, UK.6North Devon Healthcare NHS Trust, Barnstaple, UK. 7School of Medicine,Dentistry and Biomedical Sciences, Queens University Belfast, Belfast, UK.8Centre for Health Education Scholarship, Faculty of Health, University ofBritish Columbia, Vancouver, Canada.Received: 24 May 2017 Accepted: 11 September 2017References1. Haq I, Higham J, Morris R, Dacre J. Effect of ethnicity and gender onperformance in undergraduate medical examinations. Med Educ. 2005;39(11):1126–8.2. Stegers-Jager KM, Steyerberg EW, Cohen-Schotanus J, Themmen APN.Ethnic disparities in undergraduate pre-clinical and clinical performance.Med Educ. 2012;46(6):575–85.3. Kleshinski J, Khuder SA, Shapiro JI, Gold JP. Impact of preadmission variableson USMLE step 1 and step 2 performance. Adv Heal Sci Educ. 2009;14(1):69–78.4. Kay-Lambkin F, Pearson S-A, Rolfe I. The influence of admissions variableson first year medical school performance: a study from NewcastleUniversity. Australia Med Educ. 2002;36(2):154–9.5. Woolf K, Potts HWW, McManus IC. Ethnicity and academic performance inUK trained doctors and medical students: systematic review and meta-analysis. BMJ. 2011;342:d901.6. Woolf K, Haq I, McManus IC, Higham J, Dacre J. Exploring theunderperformance of male and minority ethnic medical students in firstyear clinical examinations. Adv Health Sci Educ. 2008;13(5):607–16.7. Wass V, Roberts C, Hoogenboom R, Jones R, Van der Vleuten C. Effect ofethnicity on performance in a final objective structured clinical examination:qualitative and quantitative study. BMJ. 2003;326(7393):800–3.8. McManus IC, Woolf K, Dacre J. The educational background andqualifications of UK medical students from ethnic minorities. BMC MedEduc. 2008;8:21. https://bmcmededuc.biomedcentral.com/articles/10.1186/1472-6920-8-21.9. Dewhurst NG, McManus C, Mollon J, Dacre JE, Vale AJ. Performance in theMRCP(UK) Examination 2003-4: analysis of pass rates of UK graduates inrelation to self-declared ethnicity and gender. BMC Med. 2007;5:8. https://bmcmedicine.biomedcentral.com/articles/10.1186/1741-7015-5-8.10. Greenwald A, Banaji M. Implicit social cognition: attitudes, self-esteem, andstereotypes. Psychol Rev. 1995;102(1):4–27.11. Woolf K, Cave J, Greenhalgh T, Dacre J. Ethnic stereotypes and theunderachievement of UK medical students from ethnic minorities:qualitative study. BMJ. 2008;337:a1220.12. Macrae CN, Bodenhausen GV. Social cognition: thinking categorically aboutothers. Soc Cogn. 2000;51:93–120.13. Bodenhausen GV, Todd AR. Social cognition. Cogn Sci. 2010;1:160–71.14. Bodenhausen GV, Wyer RS. Effects of stereotypes on decision making andinformation-processing strategies. J Pers Soc Psychol. 1985;48(2):267–82.15. Stangor C, McMillan D. Memory for expectancy-congruent and expectancy-incongruent information: a review of the social and social developmentalliteratures. Psychol Bull. 1992;111(1):42–61.16. Fyock J, Stangor C. The role of memory biases in stereotype maintenance.Br J Soc Psychol. 1994;33(3):331–43.Yeates et al. BMC Medicine  (2017) 15:179 Page 10 of 1117. Kunda Z, Spencer SJ. When do stereotypes come to mind and when dothey color judgment? A goal-based theoretical framework for stereotypeactivation and application. Psychol Bull. 2003;129(4):522–44.18. Macrae CN, Milne AB, Bodenhausen GV. Stereotypes as energy-saving devices:a peek inside the cognitive toolbox. J Pers Soc Psychol. 1994;66(1):37–47.19. Tavares W, Eva KW. Impact of rating demands on rater-based assessmentsof clinical competence. Educ Prim. 2014;25(6):308–18.20. Yeates P, Cardell J, Byrne G, Eva KW. Relatively speaking: contrast effectsinfluence assessors’ scores and narrative feedback. Med Educ. 2015;49:909–19.21. McManus IC, Elder AT, Dacre J. Investigating possible ethnicity and sex biasin clinical examiners: an analysis of data from the MRCP(UK) PACES andnPACES examinations. BMC Med Educ. 2013;13:103. https://bmcmededuc.biomedcentral.com/articles/10.1186/1472-6920-13-103.22. Denney ML, Freeman A, Wakeford R. MRCGP CSA: are the examiners biased,favouring their own by sex, ethnicity, and degree source? Br J Gen Pract.2013;63(616):718–25.23. Sinclair L, Kunda Z. Reactions to a black professional: motivated inhibition andactivation of conflicting stereotypes. J Pers Soc Psychol. 1999;77(5):885–904.24. Mussweiler T, Epstude K. Relatively fast! Efficiency advantages ofcomparative thinking. J Exp Psychol Gen. 2009;138(1):1–21.25. BAPIO vs RCGP and GMC [Internet]. 2014. http://www.rcgp.org.uk/news/2014/may/~/media/Files/News/Judicial-Review-Judgment-14-April-2014.ashx.Accessed 19 Sept 2017.26. Vaughan S, Sanders T, Crossley N, O’Neill P, Wass V. Bridging the gap: theroles of social capital and ethnicity in medical student achievement. MedEduc. 2015;49(1):114–23.27. Burgess DJ, Warren J, Phelan S, Dovidio J, van Ryn M. Stereotype threat andhealth disparities: what medical educators and future physicians need toknow. J Gen Intern Med. 2010;25(2):169–77.28. Woolf K, Rich A, Viney R, Rigby M, Needleman S, Griffin A. Fair trainingpathways for all: understanding experiences of progression. 2016. http://www.gmc-uk.org/2016_04_28_FairPathwaysFinalReport.pdf_66939685.pdf.Accessed 19 Sept 2017.29. Moskowitz GB, Li P. Egalitarian goals trigger stereotype inhibition: aproactive form of stereotype control. J Exp Soc Psychol. 2011;47(1):103–16.30. Kunda Z. The case for motivated reasoning. Psychol Bull [Internet]. 1990;108(3):480–98.31. General Medical Council. Assessment in undergraduate medical education[Internet]. 2009. http://www.gmc-uk.org/Assessment_in_undergraduate_medical_education___guidance_0815.pdf_56439668.pdf. Accessed 19 Sept2017.•  We accept pre-submission inquiries •  Our selector tool helps you to find the most relevant journal•  We provide round the clock customer support •  Convenient online submission•  Thorough peer review•  Inclusion in PubMed and all major indexing services •  Maximum visibility for your researchSubmit your manuscript atwww.biomedcentral.com/submitSubmit your next manuscript to BioMed Central and we will help you at every step:Yeates et al. BMC Medicine  (2017) 15:179 Page 11 of 11


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items