Assuring The Quality of High-Stakes Undergraduate Assessments of Clinical Competence

Medical Teacher, Vol. 28, No. 6, 2006, pp.
535–543
Assuring the quality of high-stakes undergraduate

assessments of clinical competence
CHRIS ROBERTS1, DAVID NEWBLE2, BRIAN JOLLY3, MALCOLM REED4 &

KINGSLEY HAMPTON4
1
Office of Teaching and Learning in Medicine, University of Sydney, Australia; 2Department of
Medical Education, Flinders University, Adelaide, Australia; 3Centre for Medical Health Sciences
Education, Monash University, Melbourne, Australia; 4Sheffield Teaching Hospitals Trust and the
University of Sheffield, Sheffield, UK
Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 10/31/14
ABSTRACT In the UK, and in many Commonwealth countries, a

university degree is accepted by registration bodies as an indication Practice points
of competence to practice as a PRHO or intern. Concerns have
. There has been increasing awareness that summative
been raised that the quality of university examinations may not
assessments used for high stakes decision-making
always be sufficient for such high-stakes decision-making. should demonstrate a high degree of validity and
Assessments of clinical competence are subject to many potential reliability.
sources of error. The search for standardization, and high validity . Research has shown that best practice is not always
and reliability, demands the identification and reduction of achieved in university-run clinical competence
measurement errors and biases due to poor test design or variation examinations.
in test items, judges, patients or examination procedures. . A set of best practice guidelines has been developed by
Generalizability and other research studies have identified where an international group of experts in assessment, which
the likely sources of error might arise and have been taken into
For personal use only.
outlines the principles and steps required to ensure

account in the development of published guidelines on interna- such procedures are defensible to both internal and
tional best practice, which institutions should strive to follow. The external scrutiny.
purpose of this paper is to describe the development of . A description of how an institution has developed and
the integrated final-year assessment of clinical competence at the introduced a defensible high-quality assessment in
University of Sheffield. The aim was to introduce a range of practice, using its own resources as opposed to a
strategies to ensure the examination met the best practice national institution such as the NBME, should be of
guidelines. These included blueprinting the assessment to achieve interest to all in medical schools with the responsibility
a high degree of content validity; lengthening the examination of certifying the competence of their graduates.
by adding a written component to the OSCE component to ensure
an adequate level of reliability; providing training and feedback
for examiners and simulated patients; paying attention to item
development; and providing statistical information to assist the
examination committee in standard setting and decision-making. reassure the public that PRHOs will be competent regardless
This evidence-based approach should be readily achievable by all of their medical school of qualification (Fowell & Bligh,
medical schools. 2001). A recent survey concluded that educationally undesir-
able assessment methods and practices are still used by many
medical schools, which in part appeared to be due to an
apparent lack of knowledge of the technical aspects of
Background
assessment, including the application of robust approaches
Over a century ago, concerns about the quality of medical to standard setting (Fowell et al., 1998). Consequently, some
degrees in North America led to the establishment of medical schools may be making pass/fail decisions on
independent national examinations, such as those conducted students’ fitness to practice based on assessments that are
by the National Board of Medical Examiners (NBME), prone to error. This is worrying in the light of evidence that
which all students have to pass in order to obtain a license students who are on the borderline of pass/fail decision-
to practice. In the UK, and many Commonwealth countries, making may graduate to become doctors whose clinical
the award of a degree still certifies graduates as competent to performance will give cause for concern (Tamblyn et al.,
practice as a Pre-registration House Officer (PRHO), or 1998). Additional emphasis on this issue has come from
intern, having achieved the attributes defined in the General a national expectation that the public will be protected
Medical Council’s (GMC) document Tomorrow’s Doctors from incompetent doctors. While the main media and
(GMC, 1993, 2003). The accreditation process of the GMC
and the university system of Visiting Examiners are seen as
providing a guarantee of standards (Newble, 2001). Correspondence: Chris Roberts, Associate Professor in Medical Education,
Office of Teaching and Learning in Medicine, Faculty of Medicine (A27),
There is increasing disquiet as to whether the quality of Edward Ford Building, University of Sydney, NSW 2006, Australia. Tel: þ61
university assessments in the UK is good enough to properly 2 9036 9453; fax: þ61 2 9351 6646; email: chris.roberts@med.usyd.edu.au
ISSN 0142–159X print/ISSN 1466–187X online/06/060535–9 ß 2006 Informa UK Ltd. 535

DOI: 10.1080/01421590600711187
C. Roberts et al.
political attention has focused on the performance of doctors (Newble, 2004). The testing of relevant knowledge, including
in practice (Bristol Royal Inquiry Report, 2001; Southgate aspects of diagnosis, investigation and management, can be
et al., 2001; Shipman Inquiry, 2005) universities are obliged more feasibly (and cheaply) tested with written formats.
to play their part by ensuring they graduate only those However, not all aspects of clinical competence can be validly
students who have reached the required level of competence. tested in an examination setting and some need to be part of
Dissatisfaction with the present situation in the UK has summative in-course assessments, for example professional
led, in some quarters, to calls for the establishment of behaviours, which are often not formally assessed at under-
a national medical qualifying examination similar to the graduate level.
National Board of Medical Examiners (NBME) United
States Medical Licensing Examination (Patel, 2001;
Reliability
Wakeford, 2001; Bligh, 2003). However, the clinical compe-
tence examinations developed by the NBME are complex The search for standardization in clinical competence
and expensive, rely heavily on simulated patient-based (SP) assessments demands the identification and reduction of
ratings, and have to be undertaken at national testing centres. any measurement error or biases due to variation in test
Their quality assurance system (Boulet et al., 2003) depends items; examiners; patients; or examination procedures that
on the application of psychometric techniques available only might affect the observed scores of individual examinees.
within a few UK research-based centres. Generalizability studies have been highly influential in
In this paper, we describe an integrated final-year assess- providing guidance for identifying where the largest sources
ment of clinical competence, and the accompanying quality of measurement error might lie and suggesting ways of
assurance mechanisms, which have been developed over refining assessment procedures in order to enhance relia-
the last six years at the University of Sheffield. These ensure bility. For example, the common finding of lack of correlation
that the examination is valid, reliable and feasible to conduct of scores across cases in clinical competence assessments
in a university setting so that student scores are an accurate is often referred to as content specificity (van der Vleuten,
measure of their true ability and reasonably free from 1996). Studies have shown that content specificity is the
measurement error. We expect the principles we demonstrate major contributor to unreliability, much more so than
should be of interest to those in medical schools with the unreliability attributable to inconsistencies in marking. This
responsibility of certifying the competence of their graduates. means that testing competences across a large sample of cases
has to be done before a reliable generalization as to student
performance can be made. This is the reason that OSCE

Principles of examination development
examinations have to be quite long to be reliable, irrespective
There has been increasing awareness that summative assess- of the effectiveness of structuring marking sheets and
ments used for high-stakes decision-making should demon- examiner training in reducing error. It also means that
strate a high degree of validity and reliability. To assist in this multiple inputs will have to be collected and collated where
process a set of best practice guidelines has been developed by in-course assessments are used for summative decision-
an international group of experts in assessment outlining the making (van der Vleuten, 2000).
principles and steps required to ensure procedures are
defensible to both internal and external scrutiny (Newble
Standard setting and decision-making
et al., 1994). There are four broad areas for an examination
committee to consider. These are: blueprinting to ensure Standard setting refers to the procedures applied to assess-
content validity; selection of best test formats; applying ments, which establishes the borderline between those
strategies to achieve adequate levels of reliability; and students who pass or are considered competent, and those
instituting appropriate standard setting and decision-making who should fail or be deemed incompetent (Norcini, 2003).
procedures. A number of methods have been used to set the standards for
written and clinical examinations, which are usually classified
into relative and absolute. For assessments of competence,
Blueprinting
the latter are the most appropriate. All such methods require
Blueprinting is the process whereby an adequate and the input of a number of credible judges who contribute,
representative sample of items to be included in the examina- in various ways, to establishing the intended standard to be
tion is determined. In an undergraduate assessment of clinical applied to the results. The standard is an important, but not
competence, the blueprint provides a grid or matrix, which the only, contribution to the approach an examination
maps the content of the examination against the learning committee might use to make the final decision on who
outcome objectives of the course (Newble, 2002). It is the should pass or fail. Whatever the procedures, they should be
process by which the content validity of the test is established. carefully documented and made available to students, staff
and to external reviewers.
Test format
Development of the final year assessment of
No one single test format is able to assess all aspects of
clinical competence
clinical competence (Wass et al., 2001). The Objective
Structured Clinical Examination (OSCE) provides a format Changes to clinical competence assessments at Sheffield
particularly suitable for assessing many components medical school occurred progressively over a number of years
of clinical competence, especially clinical, technical and as part of the development of a comprehensive assessment
practical skills, often with a high degree of fidelity strategy. They included modifications to the structure and
536
Assuring assessment quality of clinical competence
format of assessments throughout the course to make them (Reznick et al., 1996). The initial reliability for this 16-station
both integrated and standardized. This allowed the quality examination was 0.64 (Cronbach’s alpha), which was not
assurance systems to be developed and to be embedded in considered adequate. However, it was unfeasible to
the school at both academic and administrative levels. provide the four or more hours of clinical examination time
In particular, academics and students gained increasing requiring at least 40 stations that generalizability studies
confidence that the new examination procedures would predict would be required to give a reliable examination
function as intended and that the resulting decision-making for high-stakes summative assessment of greater than 0.8
processes were acceptable to the examination committee, (Newble & Swanson, 1988). As an alternative strategy, in
the external examiners, the university, the students, and the 2003, the OSCE was combined with a written component
Postgraduate Dean responsible for PRHO placements and consisting of modified essay questions (MEQs). These,
training. in effect, were almost identical in format to some of the
non-patient-based OSCE stations assessing the ability to
interpret investigational data such as imaging and ECGs. All
Blueprinting
stations and written questions were still focused on clinical
The learning outcome objectives for clinical competence are tasks determined in advance from the same blueprint. This
defined within a core curriculum. These objectives have been now provided a test sample of nearly 40 items.
related to 95 clinical problems which graduates are likely to The examination is taken in two stages, a week apart. The
meet as PRHOs (Newble et al., 2005). They are available written component is undertaken first in an examination hall.
to both staff and students in a searchable database accessed Marking of each MEQ and the written OSCE items within
via Minerva, the schools’ networked learning environment the clinical component of the examination, referred to as
(Roberts et al., 2003). This information forms the content of static stations, are marked anonymously by single examiners
the core curriculum pertaining to clinical competence and using structured answer sheets. The clinical component is
the underpinning medical sciences. The aim of the final-year run in a hospital or clinical skills centre setting. There are
assessment of clinical competence is to examine an appro- four separate OSCE venues each able to conduct four circuits
priate sample of this curricular database. The blueprint for in the one day. Students rotate through the circuits in groups
the final examination has the competences on one axis of 13–15, depending on the cohort size. While general-
(e.g. history-taking, communication skills, practical skills, izability studies have suggested only one examiner need be
management etc.) and the problem content on the other axis, used for scoring each OSCE station without significantly
with problems allocated as appropriate under body systems, affecting examination reliability (van der Vleuten, 1996),
some therefore being represented in more than one system. enough examiners had been recruited through our training
A multi-disciplinary team of medical teachers, including programme to double up for added fairness and reassurance
experts in assessment and psychometrics, has developed to the students. It adds to collegiality and to faculty
an agreed sampling procedure based on the blueprint development (e.g. less experienced examiners can be
and information from prior examinations. Items are then teamed with more experienced examiners). In the interests
requisitioned or accessed from a bank of items used of examination security, students are corralled so there is no
previously, or developed from workshops as part of the contact between students who have taken the examination
examiner training system. New items are reviewed by and those about to take it (Swanson et al., 1995).
the team. The proposed written and clinical components At clinical stations, a mixture of simulated and
of the examination are then sent for ratification to the real patients are used. Simulated patients are drawn from a
external examiners, usually clinicians who are conversant pool of volunteers and actors, managed by a coordinator.
with the best-practice assessment principles we endorse. They are trained on their clinical scenario by clinical skills
staff and senior clinical examiners prior to the examination.
Real patients with well-established physical signs are involved
Test format
in at least three of the physical examination stations. One
An OSCE was first introduced as the clinical component of physical examination station is a 10-minute station, accorded
the final-year examination in June 1999. The short (five- double marks, allowing time for students to complete a more
minute) station format was adopted, as this is the approach extensive task.
being used in most medical schools in the UK (Newble, The ways in which the refinements to the structure of the
2002) and by many licensing bodies such as the GMC whole examination have been introduced during the period
(Tombleson et al., 2000) and the Medical Council of Canada 2001–04 are given in Table 1.
Table 1. Changes in the number of items in the clinical competence examination in the years 2001 to 2004.
2001 2002 2003 2004
No. of items Total marks No. of items Total marks No. of items Total marks No. of items Total marks
Clinical 9 200 12 260 10 220 10 220

Static 3 30 3 30 3 30 3 30
Written 5 50 5 50 26 260 26 260
Overall 17 280 20 340 39 510 39 510
537
C. Roberts et al.
Table 2. Typical structured OSCE mark sheet showing marking scheme.
Performed
Performed but not Not performed
competently fully competent or incomptent
Initial approach to the patient 2 1 0

(introduces him/herself,
explains what he/she will be doing)
Periodicity
Previous episodes 1 0.5 0
Single or multiple 1 0.5 0
Associated symptoms
Loin pain 1 0.5 0
Abdominal pain 1 0.5 0
Symptoms of cystitis 1 0.5 0
Drug history 1 0.5 0

Social history
Smoking 1 0.5 0
Occupation 1 0.5 0
Diagnosis
Bladder cancer 1 0.5 0
Kidney cancer 1 0.5 0
Investigations
MSU 0.5 0 0
U & Es 0.5 0 0
Imaging (IVU, USS) 1 0.5 0
Cystoscopy 1 0.5 0
Overall approach to task 5 4 3 2 1 0

Total (max 20)
Overall rating of station Clear Fail Borderline Clear Pass
Scoring and standard setting and clinical examinations (Norcini, 2003), the borderline
method was adopted (Smee, 2001; Wilkinson et al., 2001). In
Examiners receive details of the station they will mark several
this approach, the borderline score for each question/station
days prior to the examination. For the written component
was calculated as the median score of all students identified
and static OSCE stations, questions and the marking sheet
as borderline by the examiners. At OSCE stations, where
reflect the aim of assessing relevant knowledge and clinical
there were two examiners, their marks were averaged. The
problem-solving with particular emphasis on diagnosis,
overall borderline score for the whole examination comprises
investigations and patient management. For the observed
the sum of the borderline scores for all questions/stations
stations in the clinical component, a briefing on the day
(Table 3).
reinforces the rules about the marking system and the
standard-setting procedure. On the examiner marking
sheet, the checklist items reflect the relevant history, physical
examination or practical skills the student should obtain or Performance of the examination
perform. There are generally 8–12 checklist items for each A summary of the examination results and psychometric data
case (see Table 2). Checklist items are weighted to avoid (Streiner & Norman, 2003) is provided to the examination
trivialization. Candidates are rated on each item as: having board to support decision-making (Tables 3–5). The overall
not performed or having demonstrated incompetence on the reliability (internal consistency) of the fully developed
item (awarded no marks), having performed the item but not examination in 2003 and 2004 was 0.81 using Cronbach’s
to the required level of competence (awarded half the marks (Table 4). This is above the conventional ‘gold standard’
for the task) or having performed the item competently to of 0.8 and confirms that the item sample is now large enough
the level of a starting PRHO (awarded the full mark for this for high-stakes decision-making purposes. The Standard
item). Additionally, for both written and observed questions/ Error of Measurement (SEM) of the examination is used
stations examiners are asked to provide a global rating, for grading and decision making (Table 5). Various sub-
independently of their checklist scores, as to whether in their analyses provide insights into the performance of different
opinion the student had passed, failed or was borderline for components of the examination (i.e. test items, examiners,
competence on her/his performance of the overall task. This patients or examination procedures), for quality assurance
information is required for the standard-setting procedure. purposes. We have illustrated the appropriate tests and
Whilst various standard-setting methods are used for written interpretations of data with some examples.
538
Table 3. Median borderline scores, correlations and marks available for each blueprinted question/station in the examination.
Correlation
to total Max. Median
minus marks Borderline
Station/MEQ ID Competence System Problem item available score
Station1 Physical exam CNS Abnormal Gait 0.30 40 23.50

Station2 (static) Data (X-ray) Resp Wheeze 0.17 10 6.00
Station3 Physical exam GI Abdominal mass 0.26 20 11.75
Station4 History Resp Haemoptysis 0.21 20 12.00
Station6 Patient educ/comms Skin/Misc Rash 0.30 20 13.00
Station7 History Endocrine Weight loss 0.21 20 13.50
Station8 (static) Practical skills Oncology Pain 0.18 10 5.00
Station9 History Eyes/ENT Visual disturbance 0.27 20 12.00
Station10 Physical exam MSS/CNS Laceration 0.32 20 11.00
Station11 Practical skills CVS Collapse 0.31 20 10.50

Station12 Physical exam Oncology Breast lump 0.30 20 13.00
Station14b Patient educ/comms Oncology Dying patient 0.27 20 12.50
Station15 (static) Data Skin/Misc Itch 0.26 10 6.00
Written1 Problem solving/Dx/Manage CVS Claudication 0.47 10 6.00
Written2 Problem solving/Dx/Manage CVS/Endo Palpitations 0.52 10 6.00
Written3 Problem solving/Dx/Manage CVS Heart failure 0.28 10 6.00
Written4 Problem solving/Dx/Manage Resp Haemoptysis 0.31 10 7.00
Written5 Problem solving/Dx/Manage Mental Mania 0.31 10 5.00
Written6 Problem solving/Dx/Manage ENT Hearing difficulty 0.42 10 4.00
Written7 Problem solving/Dx/Manage CNS Coma 0.11 10 7.00
Written8 Problem solving/Dx/Manage MSS Joint pain 0.47 10 5.00
Written9 Problem solving/Dx/Manage MSS Fracture 0.32 10 5.00

Written10 Problem solving/Dx/Manage GI Diarrhoea 0.41 10 7.00
Written11 Problem solving/Dx/Manage GI Haematemesis 0.52 10 6.00
Written12 Problem solving/Dx/Manage Mental Overdose 0.25 10 5.00
Written13 Problem solving/Dx/Manage Endocrine Weight loss 0.40 10 7.00
Written14 Problem solving/Dx/Manage Oncology Breast lump 0.40 10 6.00
Written15 Problem solving/Dx/Manage Haem Bleeding 0.42 10 5.00
Written16 Problem solving/Dx/Manage Oncology Testicular mass 0.34 10 4.00
Written17 Problem solving/Dx/Manage Eyes Red eye 0.46 10 5.00
Written18 Problem solving/Dx/Manage Endocrine Hairy 0.28 10 5.00
Written19 Problem solving/Dx/Manage CNS Weakness 0.41 10 5.50
Written20 Problem solving/Dx/Manage GI Vomiting 0.42 10 7.00
Static1 Investig/Interpretation CNS Headache 0.43 10 6.00
Static2 Investigation/Interpretation GI Jaundice 0.12 10 5.00
Static3 Investigation/Interpretation Resp SOB 0.17 10 6.00
Static4 Investigation/Interpretation Endocrine Dehydration 0.30 10 5.00
Static5 Investigation/Interpretation CNS Consciousness 0.16 10 6.00
Static6 Investigation/Interpretation ENT Abnormal Hearing 0.21 10 6.00
Overall borderline score 510 297.25
Overall borderline score % 100 58.28
Table 4. Summary of examination data 2001–04.
Mean score Overall borderline Standard Standard error of Pass mark

Number of students (%) score (%) deviation measurement (%) Reliability
2004 195 73.25 58.28 5.50 2.40 60.68 0.81

2003 215 74.4 58.41 5.77 2.52 60.93 0.81
2002 214 73.0 59.81 4.95 2.98 62.79 0.64
2001 195 71.85 58.90 5.86 3.21 62.11 0.70
539
C. Roberts et al.
Table 5. Summary of provisional grading procedure for 195 students showing bands of 1SEM (2.4%) from the overall
borderline score of 58.28%.
Grade Definition of Band Status Range No Of
Students
5 2 SEMs from Top mark Eligible for 80.79–85.59 16

distinction
172
4 1 SEM Good pass 63.08–80.78
3 1 SEM Pass 60.68–63.07

3
1 SEM Above Borderline Score

2 Borderline 55.89–60.67 4
1 SEM Below Borderline Score
1 1 SEM Fail <55.88

0
Question/station development Inter-rater reliability is examined across all examiner

pairings for patient-based stations. Overall, there is good
The Pearson correlation between a station score and the
general agreement between examiners. In 2004 Pearson’s
average of all the other question/station scores in the
correlations ranged from 0.54 to 0.86. In this examination,
examination provides an indication of how well that ques-
the lowest correlations were at the two patient education/
tion/station is functioning (see Table 3). A level above 0.25
communication skills stations; items 6 and 14 (see Table 3).
is desirable. Those questions/stations returning less are
A more detailed inspection of individual examiner pairs across
reviewed as they may benefit from re-writing if they are to
OSCE sites showed poor correlations in several examiner
be used again.
pairs on these two stations. In this case, it suggested a need to
pay more attention to examiner training for communication
Examiner issues
skills stations to reduce the effect of assessor subjectivity.
Performance of the examiners is investigated in several ways. In the written component of the examination, a random-
For example, median borderline scores are inspected to see ized sample of four questions is double marked. Inter-rater
if for some reason examiners are producing an inappropri- agreement ranged from 0.72 to 0.87 showing that examiners
ately low borderline mark, which might affect the overall were in good agreement and that our strategy to use single
standard. This is most likely to happen in the written markers for the written component was defensible. However,
components where a single examiner is used for each it should be remembered that experienced and trained
question. This problem is less likely to occur in the clinical examiners were used for this marking task.
component, because of marks being averaged out across Examiners receive feedback on the overall performance
many examiners. of the examination, as well as appreciation for participating.
540
80.00
Mean OSCE % score with 95% CI

75.00
70.00
65.00
60.00
B F J M
Succesive OSCE student groups at one venue
Figure 1. Exploring order effects at one OSCE site for four groups of students.
Examination procedures The method Sheffield adopted utilizes the SEM calculated
from the examination performance indicators discussed
Within such a complex OSCE examination conducted at
previously using the following formula:
multiple sites with multiple rotations, it is valuable to
p
investigate possible biases caused by order or location effects. SEM ¼ S:D: ð1 rÞ
In most years, it has been possible to show that no such biases
(where r is the reliability of the test and SD is the standard
existed. However, in 2004 a significant difference in the
mean scores was detected between two consecutive rotations
deviation (%) of the scores).

of students at one venue using an analysis of variance If we enter the 2004 data as an example the
p
(see Figure 1). This initially suggested a breach of examina- SEM ¼ 5.50 ð1 0:81Þ ¼ 2:40%. The SEM can now be
tion security. However, investigation showed that the used as the equivalent of a confidence interval around the
differences between the groups concerned (B and F) were decision point, which in this examination is the overall
not evident once the students’ performance had been borderline score, determined as described previously.
adjusted relative to their written component marks. Further A system of grade bands has been agreed and is expressed
inspection showed that a higher proportion of top-performing in a protocol used for decision-making (see Table 5). As this
students were in group F and conversely a higher proportion is a high-stakes examination, the school’s assessment policy
of less able students were in group B. Because of this has deemed that a pass will only be awarded to students who
greater attention has been paid by the administration to have satisfied the examination board that they are not within
randomization of the students into their examination groups. the borderline group (grade 2), as indicated by being at least
one SEM above the overall borderline score. This ensures a
68% confidence interval around the pass/fail decision-making
Patient factors score. The examination board is actively considering raising
the bar for the borderline group to two SEMs to increase
Finally, qualitative feedback is collected from examiners on
the confidence interval to 95%. Students more than one
aspects of the marking sheets and performance of the real and
SEM below the overall borderline score are awarded a fail
simulated patients. Suggestions are sought for improvements
(grade 1). The majority of students obtain a grade 4. Those
in the questions/stations, which are considered by the
obtaining a mark of two SEMs from the top mark are graded
examination committee before they are put into the bank.
five and are eligible for a distinction. Students scoring
a grade 3 may benefit from the feedback that they have only
just passed and additional monitoring in the immediate
Reporting results and decision-making
postgraduate period may be wise. Those scoring grade 2 or
The main priority, of course, is to compile the results and less are invited for a repeat examination (described below).
make decisions on who is to pass and who is to fail. Here, It is incongruous to use a competence examination to
the concern of the medical school, and the GMC, is whether determine those students who are excellent and deserve a
the students are competent to be provisionally registered higher academic grade or a prize. Previously these best
to practice as a PRHO. The university also has regula- performing students were offered another examination in
tory requirements necessitating the awarding of grades. the form of a distinction viva. This practice has now been
Mechanisms have to be found to satisfy both of these discontinued on the basis that such an examination is likely
demands, which can be conflicting. The former requires to be less valid and considerably less reliable than the
a competency-based or criterion-referenced approach while main examination. However, it is beyond available resources
the latter requires a norm-referenced approach. to create and administer a separate examination of similar
541
C. Roberts et al.
quality that is able to discriminate reliably amongst such psychometric expertise to ensure that students’ scores reflect
excellent students. their true ability and that subsequent competence decisions
A summary of the main statistical indicators over the are reasonably free from error.
period 2001–04 is shown in Figure 1. It indicates the The examination we have developed progressively over
improvement in reliability since the examination was several years has followed published criteria for international
lengthened in 2003 and generally how stable the mean best practice, based on evidence derived largely from
scores, borderline scores and SEMs are with the differing generalizability studies, and from many years’ practical
cohorts of students. experience in the methods used.
A number of quality-assurance procedures have been
described to demonstrate how the assessment is subject to
Repeat assessments
continual monitoring for potential sources of error, resulting
University regulations demand that all students deemed not in the introduction of several refinements. We endorse
to have passed the examination be given a repeat assessment. the process of blueprinting the assessment; providing
Accordingly, all borderline students and students who have well-defined training regimes for examiners and simulated
failed enter the repeat assessment examination, which is patients; providing regular feedback to those involved
conducted approximately six months later. This has been in judging student performances; applying appropriate
controversial in that it is usual to provide students with a standard-setting procedures; attending to item development;
much earlier opportunity for redemption. However, students and providing detailed statistical analyses of all scores.
not shown to be clinically competent cannot reasonably be In conclusion, we have demonstrated that a high-quality
expected to improve their performance without a significant assessment of clinical competence can be conducted within
period of additional clinical experience and focused the resources of a university medical school. If the relevant
remediation. expertise is not available in-house we would strongly suggest
Numbers in the repeat assessment group have become such support be sought on a consultancy basis and key staff
gratifyingly low (range 2–8). University regulations and be supported to attain the appropriate skills. This will reduce
fairness to these students dictates that the repeat assessment development time and costs, and ensure unnecessary
examination should be of the same format and quality as the mistakes are not made.
main examination. This makes it an expensive exercise for
such a small group, and provides some additional challenges
when it comes to standard-setting and decision-making. Notes on contributors

For efficiency purposes, the examination is blueprinted at
CHRIS ROBERTS is Associate Professor of Medical Education and Director
the same time as the main administration with questions
of the Office of Teaching and Learning in Medicine at the University of
drawn from the pre-existing bank or written afresh. However, Sydney, Australia.
the borderline method of standard-setting does not work for
DAVID NEWBLE is Emeritus Professor of Medical Education at the
such small numbers. Generalizability studies (Kramer et al.,
University of Sheffield. He returned to Australia in 2004 where he is a
2003) indicate that a minimum of 30 students would need to professor of Medical Education at Flinders University and works as a
sit the exam to get a dependable pass/fail decision using this consultant in Medical Education.
approach. The most attractive solution to this dilemma will
BRIAN JOLLY is Director of the Centre for Medical and Health Sciences
arise when the bank of questions/stations, on which a Education, Faculty of Medicine, Nursing & Health Sciences, Monash
borderline standard has previously been calculated, is large University, Australia.
enough to construct a valid examination. As of 2004, a hybrid MALCOLM REED is Professor of Surgery in the Department of Surgery in
approach has been adopted in which the historical standard, the Sheffield Teaching Hospitals Trust. He is Chair of the Assessment
i.e. the mean of the main 2003 and 2004 pass marks for the Committee responsible for the finals examination.
main examination, is used as the pass mark for the repeat KINGSLEY HAMPTON is Senior Clinical Lecturer Division of Genomic
assessment. This is justified on the basis that the repeat Medicine and Consultant Haematologist at the Sheffield Teaching
assessment exam was blueprinted and produced at the same Hospitals Trust. He is the current Director of Studies responsible for
time as the main examination. As a further check, a modified the finals examination.
Angoff procedure was used to predict the borderline score for
new questions/stations. For previously used items, the
borderline score was used from the item bank database. References
BLIGH, J. (2003) Nothing is but what is not, Medical Education, 37,
Summary and conclusions pp. 184–185.
BOULET, J.R., MCKINLEY, D.W., WHELAN, G.P. & HAMBLETON, R.K.
Developing a valid and reliable assessment of clinical (2003) Quality assurance models for performance-based assessments,
competence is not easy to achieve with the resources available Advances in Health Science Education, 8, pp. 27–47.
at the university level. Where such assessments are used for BRISTOL ROYAL INQUIRY (2001) Learning from Bristol: The Report of the
high-stakes decision-making, as in the UK and many other Public Inquiry into Children’s Heart Surgery at the Bristol Royal Infirmary
countries, the institution has a responsibility both to the 1984–1995 (London, HMSO).
FOWELL, S.L., MAUDSLEY, G., MAGUIRE, P., LEINSTER, S.J. & BLIGH, J.
students and to the public to ensure that these assessments
(1998) Student assessment in undergraduate medical education in
meet best practice standards. Research has shown that this is the United Kingdom, Medical Education, 34, pp. 1–49.
currently not always the case. The main reason for this may FOWELL, S.L. & BLIGH, J. (2001) Assessment of undergraduate medical
be that the design and development of such assessments education in the UK: time to ditch motherhood and apple pie,
require a considerable degree of educational and Medical Education, 35, pp. 1006–1007.
542
GENERAL MEDICAL COUNCIL (1993) Tomorrow’s Doctors: Recommendations SHIPMAN INQUIRY (2005) Independent inquiry into the issues arising
on Undergraduate Medical Education (London, GMC). from the case of Harold Fredrick Shipman. Available at: http://
GENERAL MEDICAL COUNCIL (2003) Tomorrow’s Doctors: Recommendations www.shipmaninquiry.org.uk/ (accessed February 2005).
on Undergraduate Medical Education (London, GMC). In: NEWBLE, D.I., SMEE, S.M. (2001) Setting standards for objective structured clinical
JOLLY, B. & WAKEFORD, R. (Eds) (1994) The Certification and examination: the borderline group method gains grounds on Angoff,
Recertification of Doctors: Issues in the Assessment of Clinical Competence Medical Education, 35, pp. 1009–1010.
(Cambridge, Cambridge University Press). SOUTHGATE, L., CAMPBELL, L., COX, J., FOULKES, J., JOLLY, B.,
KRAMER, A., MUIJTJENS, A., JANSEN, K., DÜSMAN, H., TAN, L. & VAN DER MCCRORIE, P. & TOMBLESON, P. (2001) The General Medical
VLEUTEN, C.P. (2003) Comparison of a rational and an empirical Council’s performance procedures: the development and implemen-
standard setting procedure for an OSCE, Medical Education, 37, tation of tests of competence with examples from general practice,
pp. 132–139. Medical Education, 35, pp. 20–28.
NEWBLE, D.I. & SWANSON, D.B. (1988) Psychometric characteristics STREINER, D.L. & NORMAN, G.R. (2003) Health Measurement Scales: A
of the objective structured clinical examination, Medical Education, 22, Practical Guide to their Development and Use (Oxford, Oxford University
pp. 325–334. Press).
NEWBLE, D.I. (2001) (Letter to the editor), Medical Education, 35, SWANSON, D.B., NORMAN, G.R. & LINN, R.L. (1995) Performance based
pp. 308–309. assessment: lessons learnt from the health professions, Educational
NEWBLE, D.I. (2002) Assessing Clinical Competence at the Undergraduate Researcher, 24, pp. 5–11.
Level, Medical Education Booklet, No. 25 (Edinburgh, Association for TAMBLYN, R.M., ABRAHAMOWICZ, M., BRAILOVSKY, C. &
the Study of Medical Education). GRAND’MAISON, P. (1998) Association between licensing

NEWBLE, D.I. (2004) Techniques for measuring clinical competence: examination scores and resource use and quality of care in primary
objective structured clinical examinations, Medical Education, 35, care practice, Journal of the American Medical Association, 280,
pp. 199–203. pp. 989–996.
NEWBLE, D.I., DAUPHINEE, D., DAWSON-SAUNDERS, B., et al. (1994) TOMBLESON, P., FOX, R.A. & DACRE, J.A. (2000) Defining the content
Guidelines for the development of effective and efficient procedures for the objective structured clinical examination component of the
for the assessment of clinical competence, Teaching and Learning in Professional and Linguistic Assessment Board examination: develop-
Medicine, 6, pp. 213–220. ment of a blueprint, Medical Education, 34, pp. 566–572.
NEWBLE, D.I., STARK, P., LAWSON, M. & BAX, N.D. (2005) Developing VAN DER VLEUTEN, C.P. (1996) The assessment of professional
an outcome-focused core curriculum: the Sheffield approach, Medical competence: developments, research and practical implications,
Education (in press). Advances in Health Science Education, 1, pp. 41–67.
NORCINI, J.J. (2003) Setting standards on educational tests, Medical VAN DER VLEUTEN, C.P. (2000) Validity of final examinations in under-
Education, 37, pp. 464–469. graduate medical training, British Medical Journal, 321, pp. 1217–1219.
PATEL, K. (2001) USMLE-style exam (letter), Medical Education, 35, WAKEFORD, R. (2001) (Author’s reply), British Medical Journal, 322,
pp. 306–307. p. 359.

REZNICK, R.K., BLACKMORE, D., DAUPHINEE, W.D., ROTHMAN, A.I. & WASS, V., MCGIBBON, D. & VAN DER VLEUTEN, C.P. (2001) Composite
SMEE, S.M. (1996) Large scale high stakes testing with an OSCE: undergraduate clinical examinations: how should the components be
report from the Medical Council of Canada, Academic Medicine, 71, combined to maximize reliability, Medical Education, 35, p. 330.
pp. S19–S21. WILKINSON, T.J., NEWBLE, D.I. & FRAMPTON, C.M. (2001) Standard
ROBERTS, C., LAWSON, M., NEWBLE, D.I. & SELF, A. (2003) Managing setting in an objective structured clinical examination: use of global
the learning environment in undergraduate medical education: the ratings of borderline performance to determine the passing score,
Sheffield Approach, Medical Teacher, 5, pp. 297–301. Medical Education, 35, pp. 1043–1049.
543

Assuring The Quality of High-Stakes Undergraduate Assessments of Clinical Competence

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assuring The Quality of High-Stakes Undergraduate Assessments of Clinical Competence

Uploaded by

Copyright:

Available Formats

Medical Teacher, Vol. 28, No. 6, 2006, pp.

Assuring the quality of high-stakes undergraduate

CHRIS ROBERTS1, DAVID NEWBLE2, BRIAN JOLLY3, MALCOLM REED4 &

ABSTRACT In the UK, and in many Commonwealth countries, a

outlines the principles and steps required to ensure

ISSN 0142–159X print/ISSN 1466–187X online/06/060535–9 ß 2006 Informa UK Ltd. 535

performance can be made. This is the reason that OSCE

2001 2002 2003 2004

Clinical 9 200 12 260 10 220 10 220

Table 2. Typical structured OSCE mark sheet showing marking scheme.

Initial approach to the patient 2 1 0

Drug history 1 0.5 0

Overall approach to task 5 4 3 2 1 0

Station1 Physical exam CNS Abnormal Gait 0.30 40 23.50

Station11 Practical skills CVS Collapse 0.31 20 10.50

Written9 Problem solving/Dx/Manage MSS Fracture 0.32 10 5.00

Table 4. Summary of examination data 2001–04.

Mean score Overall borderline Standard Standard error of Pass mark

2004 195 73.25 58.28 5.50 2.40 60.68 0.81

5 2 SEMs from Top mark Eligible for 80.79–85.59 16

4 1 SEM Good pass 63.08–80.78

3 1 SEM Pass 60.68–63.07

1 SEM Above Borderline Score

1 SEM Below Borderline Score

1 1 SEM Fail <55.88

Question/station development Inter-rater reliability is examined across all examiner

Mean OSCE % score with 95% CI

deviation (%) of the scores).

when it comes to standard-setting and decision-making. Notes on contributors

the Study of Medical Education). GRAND’MAISON, P. (1998) Association between licensing

pp. 306–307. p. 359.

You might also like