Postgraduate Medical Education

and Training Board


Developing and maintaining an assessment system
- a PMETB guide to good practice

January 2007


Developing and maintaining an assessment system - a PMETB guide to good practice completes the guidance for medical
Royal Colleges and Faculties who are developing assessment systems based on curricula approved by PMETB.

As the title implies, this is a good practice guide rather than a cook book providing recipes for assessment systems.
This guide covers the assessment Principles 3, 4 and 6. As indicated by PMETB, the Colleges and Faculties developing
assessment systems have until August 2010 to comply with all nine Principles of assessment produced by PMETB (1).
These Principles highlight the issues that need to be addressed for transparency and fairness to trainees and to encourage
curricula designers in not forgetting the duty of care to the trainers and trainees alike.

The work was undertaken by the Assessment Working Group, which consisted of people from different disciplines of
medicine and who are considered experts in designing assessments.

This good practice guide explains some of the challenges faced by anyone who is devising an assessment system. As far as
possible we have developed this guidance in the context of practicality, feasibility in respect of quality management, utility
sources of evidence required for competency progression, standard setting and integrating various assessments bearing
in mind the fine balance between the training and service requirements. We have avoided being prescriptive and have not
produced a toolbox of PMETB approved assessment tools. Instead, we recommend that the Colleges and Faculties should
consult the guidance produced by the Academy of Medical Royal Colleges (AoMRC) as well as Modernising Medical
Careers (MMC) to choose assessment tools which comply with PMETB’s Principles for an assessment system for postgraduate
medical training (1).

PMETB has been fortunate in securing the services of highly skilled and enthusiastic experts who worked in their own
time on the Assessment Working Group to produce this document. I would like to extend my grateful thanks to all these
dedicated people on behalf of PMETB. I would principally like to mention the dedication and effort of the work stream
leaders, Dr Gareth Holsgrove, Dr Helena Davies and Professor David Rowley, who have worked through all hours in
developing this guide.

Dr Has Joshi FRCGP
Chair of the Assessment Committee and the Assessment Working Group
January 2007

2 Developing and maintaining an assessment system - a PMETB guide to good practice

The Assessment Working Group
Dr Has Joshi FRCGP: Chair of the Assessment Committee and the Assessment Working Group
Jonathan D Beard: Consultant Vascular Surgeon and Education Tutor, Royal College of Surgeons of England
Dr Nav Chana: General Practitioner, RCGP Assessor
Helena Davies: Senior Lecturer in Late Effects/Medical Education, University of Sheffield
John C Ferguson: Consultant Surgeon, Southern General Hospital, Glasgow
Professor Tony Freemont: Chair of the Examiners in Histopathology, Royal College of Pathologists
Miss L A Hawksworth: Director of Certification, PMETB
Dr Gareth Holsgrove: Medical Education Adviser, Royal College of Psychiatrists
Dr Namita Kumar: Consultant Rheumatologist and Physician
Dr Tom Lissauer: FRCPCH: Officer for Exams, Royal College of Paediatrics and Child Health, and Consultant Neonatologist,
St Mary’s Hospital, London
Dr Andrew Long: Consultant Paediatrician and Director of Medical Education, Princess Royal University Hospital
Dr Amit Malik: Specialist Registrar in Psychiatry, Nottinghamshire Healthcare NHS Trust
Dr Keith Myerson: Member of Council, Royal College of Anaesthetists
Mr Chris Oliver: Co-Convener Examinations, Royal College of Surgeons of Edinburgh and Consultant Trauma Orthopaedic
Surgeon, Edinburgh Orthopaedic Trauma Unit
Jan Quirke: Board Secretary, PMETB
Professor David Rowley: Director of Education, Royal College of Surgeons Edinburgh
Dr David Sales: RCGP Assessment Fellow
Professor Dame Lesley Southgate: PMETB Board member to November 2006 and Chair Assessment Committee to January
Dr Allister Vale: Medical Director, MRCP (UK) Examination
Winnie Wade: Director of Education, Royal College of Physicians
Val Wass: Professor of Community Based Medical Education, Division of Primary Care, University of Manchester
Laurence Wood: RCOG representative to the Academy of Royal Colleges, Associate Postgraduate Dean West Midlands

PMETB would like to thank the following individuals for their editorial assistance in the production of this guide to good
Helena Davies - Senior Lecturer in Late Effects/Medical Education, University of Sheffield
Dr Has Joshi, FRCGP - Chair of the Assessment Committee and the Assessment Working Group
Dr Gareth Hoslgrove - Medical Education Adviser, Royal College of Psychiatrists
Professor David Rowley - Director of Education, Royal College of Surgeons Edinburgh

Developing and maintaining an assessment system - a PMETB guide to good practice 3

Table of contents

Introduction .......................................................................................................................................6

Chapter 1: An assessment system based on principles..............................................................................7
Introduction........................................................................................................................................................... 7
Utility ................................................................................................................................................................... 7
What is meant by utility?............................................................................................................................ 7
Why is the utility index important?............................................................................................................. 8
Defining the components...................................................................................................................................... 8
Reliability................................................................................................................................................... 8
Validity....................................................................................................................................................... 9
Educational impact................................................................................................................................... 10
Cost, acceptability and feasibility............................................................................................................ 12

Chapter 2: Transparent standard setting in professional assessments........................................................ 13
Introduction......................................................................................................................................................... 13
Types of standard................................................................................................................................................ 13
Prelude to standard setting................................................................................................................................. 14
Standard setting methods......................................................................................................................... 15
Test based methods.................................................................................................................................. 15
Trainee or performance based methods.................................................................................................. 16
Combined and hybrid methods - a compromise...................................................................................... 16
Hybrid standard setting in performance assessment.......................................................................................... 16
Standard setting for skills assessments............................................................................................................... 17
A proposed method for standard setting in skills assessments using a hybrid method....................................... 17
Standard setting for workplace based assessment.............................................................................................. 19
Anchored rating scales........................................................................................................................................ 19
Decisions about borderline trainees................................................................................................................... 19
Making decisions about borderline trainees............................................................................................ 19
Summary............................................................................................................................................................. 20
Conclusion.......................................................................................................................................................... 20

Chapter 3: Meeting PMETB approved Principles of assessment................................................................ 21
Introduction......................................................................................................................................................... 21
Quality assurance and workplace based assessment.......................................................................................... 21
The learning agreement...................................................................................................................................... 22
Appraisal............................................................................................................................................................ 22
The annual assessment........................................................................................................................................ 23
Additional information........................................................................................................................................ 23
Quality assuring summative exams.......................................................................................................... 23
Summary............................................................................................................................................................. 25

Chapter 4: Selection, training and evaluation of assessors....................................................................... 26
Introduction......................................................................................................................................................... 26
Selection of assessors.......................................................................................................................................... 26
Assessor training................................................................................................................................................. 26
Feedback for assessors....................................................................................................................................... 27

Chapter 5: Integrating assessment into the curriculum - a practical guide................................................. 28
Introduction......................................................................................................................................................... 28
Blueprinting........................................................................................................................................................ 28
Sampling............................................................................................................................................................. 28
The assessment burden (feasibility).................................................................................................................... 29
Feedback............................................................................................................................................................ 29

4 Developing and maintaining an assessment system - a PMETB guide to good practice

Chapter 6: Constructing the assessment system..................................................................................... 30
Introduction......................................................................................................................................................... 30
Purposes............................................................................................................................................................. 30
One classification of assessments........................................................................................................................ 30

References ..................................................................................................................................... 32
Further reading................................................................................................................................................... 34

Appendices ..................................................................................................................................... 36
Appendix 1: Reliability and measurement error.................................................................................................. 36
Reliability................................................................................................................................................. 36
Measurement error................................................................................................................................... 37
Appendix 2: Procedures for using some common methods of standard setting................................................... 39
Test based methods.................................................................................................................................. 39
Trainee based methods............................................................................................................................ 40
Combined and compromise methods...................................................................................................... 41
Appendix 3: AoMRC, PMETB and MMC categorisation of assessments............................................................... 42
Purpose.................................................................................................................................................... 42
The categories......................................................................................................................................... 42
Appendix 4: Assessment good practice plotted against GMP.............................................................................. 44
Glossary of terms................................................................................................................................................ 46

Developing and maintaining an assessment system - a PMETB guide to good practice 5

Most importantly. from trainees to patients. and encourage curricular designers to ensure a proper duty of care to trainers and trainees alike. The guide is a reference document rather than a narrative and it is anticipated that it will help organisations collate all the information likely to be asked of them by PMETB in any quality assurance activity. Where possible we have drawn on that expertise and we hope this is reflected in the text. 6 Developing and maintaining an assessment system . including the workplace. What we wish to do is to provide a long term framework for continuing to improve assessments for all parties. Educational trends change but principles do not and by providing this document PMETB wishes to set a benchmark against which a programme of continuing quality improvement can progress. In producing this document we wish to acknowledge that most colleges and many committed individuals. usually as volunteers.Introduction This guide explains some of the challenges which face anyone devising an assessment system in response to the Principles for an assessment system for postgraduate medical training laid out by PMETB (1). the principles assure the general public that trainees who undergo professional accreditation will be assessed properly and only those who have achieved the required level of competence are allowed to progress. A full glossary of terms related to assessment in the context of medical education can be found on page 46.a PMETB guide to good practice . we largely refer to ‘assessment systems’ rather than ‘assessment programmes’ and ‘quality management’ rather than ‘quality control’. ‘Assessment instrument’ is used throughout to refer to individual assessment methods. In particular. have considerable expertise in the area of assessment. The original principles have not needed any fundamental changes since they were written and serve to highlight the issues which should be addressed to ensure transparency and fairness for trainees. to ensure consistency with other PMETB guidance. The term ‘assessor’ should be assumed to encompass examiners for formal exams as well as those undertaking assessments in other contexts.

with less of an emphasis on reliability (Figure 2). • explain the importance of utility and its evolution from the original concept. For example. an assessment which focuses largely on providing a trainee with feedback to inform their own personal development planning would focus on educational impact. Figure 2 illustrates the relative importance of reliability vs educational impact. The relative importance of the components of the utility index for a given assessment will depend on both the purpose and nature of the assessment system. In contrast.g. • provide clear workable definitions of the components of utility that enable the reader to understand their relevance and importance to their own assessment context. depending on the purpose of the assessment. e. • summarise the existing evidence base in relation to utility both to provide a guide for the interested reader and to reassure those responsible for assessment that there is a body of research that can be referred to. Developing and maintaining an assessment system .a PMETB guide to good practice 7 . What is meant by utility? Figure 1: Utility Educational x validity x reliability x cost x acceptability x feasibility* *Not in van der Vleuten’s original utility index but explicitly here because of its importance The original utility index described by Cees van der Vleuten consisted of five components: • Reliability • Validity • Educational impact • Cost efficiency • Acceptability Given the massive change in postgraduate training in the UK and the significantly increased assessment burden which has occurred. Choice of assessment instruments and aspirations for high validity and reliability are limited by the constraints of feasibility. a high stakes examination on which progression into higher specialist training is dependent will need high reliability and validity.Chapter 1: An assessment system based on principles Introduction This chapter aims to: • define what is meant by utility in relation to assessment. it is important that feasibility is explicitly acknowledged as an additional sixth component (although it is implicit in cost effectiveness and acceptability). Utility The ‘utility index’ described by Cees van der Vleuten (2) in 1996 serves as an excellent framework for assessment design and evaluation (3). and may focus on this at the expense of educational impact. • highlight gaps where further research is needed. It acknowledges that optimising any assessment tool or programme is about balancing the six components of the utility index. resources to deliver the tests and acceptability to the trainees.

such as age. race.8 has been considered as an appropriate cut off for high stakes assessments. assessors. a reliability coefficient of greater than 0. 9).a PMETB guide to good practice . where Figure 2: Utility function possible. The utility index is an important component of the framework presented to PMETB and allows a series of questions that should be asked of each assessment instrument . A perfectly reproducible test would have a coefficient of 1. trainee nervousness and test conditions. In reality. cases used. An assessment cannot be viewed as valid unless it is reliable. demographic data to allow exploration of effects. 8 Developing and maintaining an assessment system . etc. This means evidence of the application of appropriate psychometric and statistical support for the evaluation of the programme should be provided. Estimation of reliability as part of the overall quality management (QM) of an assessment programme will require specialist expertise. Exploration of sources of bias is essential as part of the overall evaluation of the programme and collection of. whilst at the same time recognising the constraints of the ‘real world’.Why is the utility index important? There is an increasing recognition that no single assessment instrument can 100% adequately assess clinical performance and 100% 100% 100% 100% 100% that assessment planning should focus on In-training In-training In-training High stakes HighHigh stakes stakes formative formative formative assessment assessment assessment assessment systems with triangulation of data assessment assessment assessment in order to build up a complete picture of a 0% 0% 0% 0% 0% 0% doctor’s performance (4.3. Traditionally. must be planned at the outset. Reliability is a quantifiable measure which can be expressed as a coefficient and is most commonly approached using classical test theory or generalisability analysis (6-8. where possible. The purpose of U =U R=xV R xx E VxE U =U R=xV R xx E VxE both the overall assessment system and its U=RxVxE U = Utility U = Utility U= RxVxE RU==Reality RUtility = Reality individual assessment instruments must be V = Validity R = VReality = Validity clearly defined for PMETB.and of the assessment system as a whole. regulatory bodies and the public alike want to be reassured that assessments used to ensure that doctors are competent would reach the same conclusions if it were possible to administer the same test again on the same doctor in the same circumstances. It is recognised. all of its components have been addressed adequately. that is 100% of the trainees would achieve the same rank order on retesting. Intrinsic to the validity of any assessment is analysis of the scores to quantify their reproducibility. Defining the components Reliability What is the quality of the results? Are they consistent and reproducible? Reliability is of central importance in assessment because trainees. that reliability coefficients at this level will not be achievable for some assessment tools but they may nevertheless be a valuable part of an assessment programme. The rationale for E = Educational E = Educational Impact Impact V = Validity C van derCVleuten van der Vleuten developing/implementing the individual E = Educational Impact components within the assessment system C van der Vleuten must be transparent. tests are affected by many sources of potential error such as examiner judgments. An understanding of these individual components of the utility index will help when planning or reviewing assessment programmes in order to ensure that.0. gender. PMETB will require evidence of the reliability of each component appropriate to the weight given to that component in the utility equation. however. 5). justifiable and based on supportive evidence from the literature. for example. both to provide additional evidence for triangulation and/or because of their effect on learning. over and above that to be found in those simply involved as assessors.

1999 A significant current challenge is to introduce sample frameworks into workplace based assessments of performance which sample sufficiently to address issues of content specificity..93 1 Norcini et al. For example.Remember that sufficient testing time is essential in order to achieve adequate reliability. It is becoming increasingly clear that whatever the format.92 0. is critical to the reliability of any clinical competence test (10) (Box 1). On the other hand the inclusion of a defensible clinical competency assessment in an artificial summative examination environment to decide on progression may be justified on the grounds of high reliability.96 0. Consequential validity is integral to evaluation of educational impact. 2002 3 Swanson. Because content specificity (differences in performance across different clinical problem areas) and assessor variability consistently represent the two greatest threats to reliability.92 8 0.76 0.82 0.Figure 3) but may not achieve a high stakes reliability coefficient of >0. Validity Reliability is a measure of how reproducible is a test. SPs8 hours essay2 ment7 1 0. ensuring breadth of content sampling and sufficient individual assessments by individual assessors. 1990 5 Petrusa. Predictive and consequential validity are important but poorly explored aspects of assessment. 2002 8 Gorter.82 0. There is usually a trade off between validity and reliability as the assessment with perfect validity and perfect reliability does not exist.86 0. a number of facets of validity have been defined (12) (Box 2). However.62 0. 1987 6 Norcini et al. Predictive validity may not be able to be evaluated for many years but plans to determine predictive validity should be described and this will be facilitated by high quality centralised data management.CEX6 assess.93 0.60 0. Box 2 summarises the traditional facets of validity and can provide a useful framework for evaluating validity. 1999 2 Stalenhoef-Halling et al.93 0. PMETB will require an explanation of the weight placed on each assessment tool within the modified utility index. sampling of both clinical content and assessors is essential and this should be reflected in assessment system planning.84 0.88 0. but only achieved at the expense of high face validity.50 0. it is important to recognise that if a test is not reliable it cannot be valid.73 0.47 0.90 0.84 0. Validity is a conceptual term which should be approached as a hypothesis and cannot be expressed as a simple coefficient (11).75 0.93 0.76 4 0.69 0.36 0. particularly workplace based assessment.73 0. separately acknowledging that evaluating the validity of an assessment requires multiple sources of evidence.62 0.68 0... a workplace based assessment aimed at testing performance has a higher weighting for validity (at the apex of Miller’s Pyramid .93 0.78 0. If you administered the same assessment again on the same person would you get the same outcome? Validity is a measure of how completely an assessment tests what it is designed to test. An alternative approach arguing that validity is a unitary concept which requires these multiple sources of evidence to evaluate and interpret the outcomes of an assessment has also been proposed (11). 2001 7 Ram et al... Box 1: Reliability as a function of testing time Case Practice Testing based Oral Long mini video Incognito time in MCQ1 PMP1 OSCE5 short exam3 case4 .69 0.53 0.90 0.76 0. 1985 4 Wass et al.82 0.64 0.61 2 0.8 as it is difficult to standardise content.a PMETB guide to good practice 9 . Developing and maintaining an assessment system . Traditionally. total testing time. It is evaluated against the various facets of clinical competency.

In order to plan appropriate feedback it is essential for there to be clarity of purpose for the assessment system. All those designing and delivering assessment should explore ways of enabling feedback to be provided at all stages and make their intentions transparent to trainees. professional success after and level of competency? graduation The educational consequence or impact of Does the test produce the desired Consequential validity the test educational outcome? Educational impact Assessment must have clarity of educational purpose and be designed to maximise learning in areas relevant to the curriculum. For example. is it for final certification. If assessment focuses only on certification and exclusion. The progression to ‘Knows How’ highlights that there is more to clinical competency than knowledge alone. competence or performance? A helpful and widely utilised framework for describing levels of competence is provided by Miller’s Pyramid (14) (Figure 3). the all important potential for a beneficial influence on the learning process will be lost. A quality enhanced assessment system cannot be effective without high quality feedback. the PMETB principles emphasise the importance of giving students feedback on all assessments. etc? Feedback should be provided that is relevant to the purpose as well as the content of the assessment in order that personal development planning in relation to the relevant curriculum can take place effectively. is it to determine whether to exclude an individual from their training programme. The purpose of assessment should be clearly described. a) What is the educational intent or purpose of the assessment? In the past a clear distinction between summative and formative assessment has been made. differentiation between novice Construct validity support a sensible underpinning construct and expert on a test of overall clinical (May also be referred to as indirect validity) or constructs? assessments .a PMETB guide to good practice . 10 Developing and maintaining an assessment system . e. The base represents the knowledge components of competence: ‘Knows’ (basic facts) followed by ‘Knows How’ (applied knowledge).Box 2: Traditional facets of validity Type of validity Test facet being measured Questions being asked What is the test’s face value? Compatibility with the curriculum’s Face validity Does it match up with the educational educational philosophy intentions? Content validity Does the test include a representative The content of the curriculum (May also be referred to as direct validity) sample of the subject matter? What is the construct. in line with modern assessment theory. at what level of expertise and how was the content of the assessment defined relative to the curriculum . Careful planning is essential. For example. encouraging reflection and deeper learning (PMETB Principle 5(1)).a process known as blueprinting (13). what aptitudes are you aiming to assess. However. does the evidence support it? Does the evidence in relation to assessment E. it should be used strategically to promote desirable learning strategies in contrast to some of the learning behaviours that have been promoted by traditional approaches to assessment within medicine. b) What level of competence are you trying to assess? Is it knowledge. is it to determine progress from one stage to two assessments designed to test different things have a low correlation? The ability to predict an outcome in the Does the test predict future performance Predictive validity future. Agreement on how to maximise educational impact must be an integral part of planning assessment and the rationale and thinking underpinning this must be evident to those reviewing the assessment programme. Several different factors contribute to overall educational impact and a number of questions will therefore need to be considered.g.g. Based on the assumption that assessment drives learning that underpins training.

The question ‘Is the assessment and standard appropriate for the demonstrate particular level of training under scrutiny?’ must always be asked.‘Shows How’ represents a behavioural rather than a cognitive function. an objective OSCE Competence structured clinical examination (OSCE). Most of the Royal Colleges are working on assessment frameworks that describe a progression in terms of level of expertise as trainees Evaluate appraise. iii) Selection of assessment methods once blueprinting has been undertaken should take account of the likely educational impact. i. Figure 4: Bloom’s taxonomy a process known as blueprinting (12.blueprinting must also ensure that the contextual content of the curriculum is covered.a framework against which to map assessment is essential. The ultimate goal for a Shows How highly valid assessment of clinical ability is to test performance CCS . A number of developmental progressions have been described for knowledge. The aim of an assessment blueprint is to ensure that sampling within the assessment system ensures adequate coverage of: i) A conceptual framework . 18) (see also page 28). Wide sampling of content is essential (13). Knows c) At what level of expertise is the assessment set? Figure 3: Miller’s Pyramid Any assessment design must accommodate the progression from novice through competency to expertise. PMETB recommends Good Medical Practice (GMP) (19) as the broad framework for all UK postgraduate assessments.g. e) Triangulation . Individual assessment instruments should be chosen in the light of the content and purpose of that component of the Developing and maintaining an assessment system . It must be clear against what level the trainee is being assessed. It must be clear at which level you are MCQ aiming to test. e. including those in Bloom’s taxonomy (15) (Figure 4). 17.e. When designing an assessment system Synthesis integrate.a PMETB guide to good practice 11 . Sampling broadly to cover the full range of the curriculum is of paramount importance if fair and reliable assessments are to be guaranteed. Knowledge define.the ‘Does’ of Miller’s Pyramid. discuss reflective of the trainees’ postgraduate experience. Content needs careful planning to ensure trainees are comprehensively and fairly assessed across their entire training period. It is not Analyse order uncommon to find questions in postgraduate examinations assessing basic Comprehension factual knowledge at undergraduate level rather than applied knowledge do the different components relate to each other to ensure educational impact is achieved? It is important to develop an assessment system which builds up evidence of performance in the workplace and avoids reliance on examinations alone. test content must be carefully planned against the curriculum and intended learning outcomes. design it is important to identify the level of expertise anticipated at that point in Application training. what the doctor actually Knows How does in the workplace. it is ‘hands on’ and not ‘in the head’. discriminate move through specialty training. whereas context-free questions test only the underpinning knowledge base. Schuwirth (3) and van der Vleuten highlight the importance of consideration of the context as well as content of assessment. Context-rich methods test application of knowledge. (Figure 5). the nature of assessment methods will influence approaches to learning as well as the stated content coverage. Blueprinting is essential to the appropriate selection of assessment methods. ii) Content specificity . Frameworks are also being developed for the clinical competency model (16). describe d) Is the clinical content clearly defined? Expertise Once the purpose of the assessment is agreed. Professionals do not perform consistently from task to task or across the range of clinical content (20). Assessment Work based Performance/ at this level requires an ability to demonstrate a clinical assessment action Does competency in a controlled environment. i.e. it is not until the purpose and the content of the assessments has been decided that the assessment methods should be chosen. Triangulation of observed contextualised performance tasks of ‘Does’ can be assessed alongside high stakes competency based tests of ‘Shows How’ and knowledge tests where appropriate.

All assessment programmes are dependent on the goodwill of assessors who are usually balancing participation in assessment against many other conflicting commitments. In general. se d Cost. the assessment package must be designed to mirror and drive the educational intent. but also for the general public on whom professionals practice. recognising that assessments will need to be undertaken in a variety of contexts with a wide Ex range of assessors and trainees. centralisation is likely to increase cost effectiveness. These factors should be part of your explanation to justify the design of your assessment package. To overcome this.a PMETB guide to good practice . Formal evaluation of acceptability is an important component of QM and approaches to this should be documented. 22). understandable and demonstratively comprehensible to the general public as well as other stakeholders. Management of the overall assessment system including infrastructure to support it is an important contributor to feasibility. venues for structured Figure 5: Triangulation examinations. cost and acceptability. b) Acceptability Both the trainees’ and assessors’ perspective must be taken into account. Creating too many burdensome. timing of exit assessments and the availability of assessors in the workplace all place constraints on assessment design. It must be acceptable to the learner. Explicit consideration am of feasibility is an essential part of evaluation of any s assessment programme. acceptability and feasibility as se PMETB recognises that assessment takes place in ss m a real world where pragmatism must be balanced en against assessment idealism.assessment system. Trainee numbers. It is essential that t consideration is given to issues of feasibility. All assessments incur costs and these must be acknowledged and quantified. time consuming assessment ‘hurdles’ can detract from the educational opportunities of the curriculum itself (21. The high stakes of professional assessments need to be acknowledged not only for potential colleagues of those being assessed and trainees. 12 Developing and maintaining an assessment system . trainees naturally tend to feel overloaded by work and prioritise those aspects of the curriculum which are assessed. the use of real patients. At all levels of education. It is therefore essential that any assessment system is transparent. The balance is a fine one. Consideration of the acceptability of assessment programmes to assessors is also important. PMETB will require evidence of W how the different methods used relate to each other to or k ensure an appropriate educational balance has been ba achieved. a) Feasibility and cost Triangulate evidence Assessment is inevitably constrained by feasibility and cost.

a PMETB guide to good practice 13 . However. As mentioned above. The aim of this guide. Developing and maintaining an assessment system . it quickly becomes plain that there is no single best standard setting method for all tests. The use of relative standards might result in passing trainees with little regard to their ability. It should be noted. performance.relative and absolute. possess adequate ability. This might be simply in the recall or (preferably) the application of factual knowledge. It is widely known that the standard set for a given test can vary according to the methods used. although there is often a particularly appropriate method for each assessment. Since PMETB’s priority is to ensure that passing standards are set with due diligence and at sufficiently robust levels as to ensure patient safety. if all the trainees in a cohort were exceptionally skilled. Experience also shows that different assessors set different standards for the same test using the same method. • stable. in fact. to the extent that it can assure the stakeholders about its validity. and how to identify these borderline trainees. its methods and the debate surrounding them are not. Moreover. the cut or cutting score. competence in specific skills or technical procedures. too. PMETB agrees with Cizek’s conclusion that ‘the particular approach to standard setting selected may not be as critical to the success of the endeavour as the fidelity and care with which it is conducted’ (25). the selection and training of the judges or subject experts who set the standard for passing assessments is as important as the chosen methodology (29). ‘How much is enough?’ (23). through the rationale behind the decisions made. An exam in which there is a fixed pass rate (for example. However. it must be recognised that although the concept of standard setting might seem straightforward. Types of standard There are two different kinds of standard . It must be: • defensible. as it is not defensible if the standards vary over time (28). • explicable. 1978 (24)) who are highly critical of the whole concept of standard setting. therefore. an absolute standard (based on individual trainee performance) should be used. in North America. It is generally accepted that unless there is a particularly good reason to pass or fail a predetermined number of trainees. the top 80% or the top 200 trainees pass) uses a relative standard. since relative standards will vary over time with the ability of the trainees being assessed. and is the point that separates those trainees who pass the assessment from those who do not.e. For example. This is because absolute or criterion-referenced standards are preferred for any assessment used to inform licensing decisions. standards should be set using absolute methods. misclassifying) a certain proportion who. Whatever the aspect and level of performance. in the workplace. This is certainly unfair and at variance with the purpose of a test of competence. there is a cohort of educational academics (such as Gene Glass. when an absolute standard is applied trainees pass or fail according to their own performance. Therefore. It is essential that the people using those methods are appropriately trained and approach the task in a fair and professional manner. In other words. Certainly. even before describing the processes and outcomes of standard setting. the standard is the answer to the question. By contrast. In fact. the use of norm-referenced standards (passing the top n% of the trainees) would result in failing (i. Nevertheless. is to be practical rather than philosophical and. it is essential to set standards with reference to some absolute and defined performance criterion (30). Relative standards are based on a comparison between trainees and they pass or fail according to how well they perform in relation to the other trainees. In other words. The methods described in this chapter are for setting absolute standards. if valid measures of competence are desired. Simply selecting the most appropriate method of standard setting for each element in an examination is not enough. however. Having considered some methods for standard setting. irrespective of how any of the other trainees perform. there are three main requirements in the choice of method. this section will discuss reliability and measurement error. or a combination of some or all of these.Chapter 2: Transparent standard setting in professional assessments Introduction Standard setting is the process used to establish the level of performance required by an examining body for an individual trainee to be judged as competent. it is the pass mark or. the reliability of any competence based classifications could be questionable. they are not without a case. assessors should choose methods that they are happy with. that there will almost inevitably be a group of trainees with marks close to the pass mark that the assessment cannot reliably place on one side or the other. day in day out. The literature describes a wide variation of methods (26) and the procedures for many are set out very clearly (27).

the content should reflect the relative importance of aspects of the curriculum. so that essential and important elements predominate. but this guide has also discussed some issues of standard setting in workplace based assessment. it might be to confirm completion of a major stage of professional development. In order to set the standard. or to the assessment containing an excessive proportion of difficult items. PMETB requirements are proving to be a powerful incentive in bringing about long overdue improvements. Even trainee centred methods of standard setting. such as graduation or Royal College membership. such as the borderline group method and the contrasting groups method (described below and expanded upon in Appendix 2). Clearly. and examining bodies are increasingly striving to ensure that all aspects of their assessments are conducted properly. ‘driving up’) standards. This overestimate of trainees’ ability can lead to the pass mark being set unrealistically high. whereas in trainee centred methods. there are still examinations in which the pass mark has been set quite arbitrarily. Moreover. For example. the process of determining what the standards are has been an extraordinarily lax affair (31). such procedures are quite unacceptable in contemporary postgraduate medical education. 14 Developing and maintaining an assessment system . Prelude to standard setting Before the standard can be set there must be agreement about the purpose of the assessment. • the level of the trainee at the time of that particular assessment. before doing so it is necessary to point out that there has been something of a tradition in UK medical education for standards to be set without giving proper consideration to these points. and subsequently be enshrined in the regulations. For example. Contrary to the belief held by many assessors that difficult exams sort out the best trainees.Absolute standard setting methods can be broadly classified as either assessment centred or individual trainee (performance) centred. particularly by content experts. the assessment might be made to check that progress through the curriculum is satisfactory. and is one of several good reasons why assessors should undertake the assessment themselves before they set the standards. or to identify problems and difficulties at an early stage when they will probably be easier to resolve.a PMETB guide to good practice . The content of the assessment will be determined by the curriculum. the most effective assessment items are generally found to be those that are moderately difficult and a good discriminator through covering a wide sample of the prescribed curriculum. what is to be assessed should be established before selecting the methods. of course. In formal assessments as part of set piece examinations. despite all the talk about maintaining (or. The standard in workplace based assessment is therefore usually determined by specific levels of performance for items often prescribed on a checklist and this is discussed below. these might be described in terms of competencies and other observable behaviours and rated against descriptions of levels of performance. This will also enable the standard setters to agree on the trainees’ expected level of expertise. • the domains to be assessed. theoretical decisions based on test content are used to derive a standard. Indeed. making change particularly difficult. there are various established methods of standard setting that can be used. judgments regarding actual trainee performance are used to determine the passing score. There are ways around this. Correctly. and the level of expertise that trainees might be expected to demonstrate. Based on these characteristics. It is implicit that each of these methods will ensure that all performance test material will be subjected to standard setting as an integral part of the test development. in accordance with Principle 2 (1). often long before the exams themselves have even been written. what will be assessed and how. depend on the discriminatory power of the items . three things need to be established: • the purpose of the assessment. but it would still be far better if pass marks were not predetermined in this way. the examination methods are often also stipulated. The methods described below are principally used as assessment as part of formal examinations. On the other hand. In formal examination style assessments the standard will take account of the importance and difficulty of the individual items of assessment and a method is described which includes this consideration. In the case of workplace based assessment. In test centred methods. particularly when the consequences of passing or failing assessments can be so important. but not the content of the exam.individually in the borderline group method and across the exam as a whole in contrasting groups. contemporarily. defensibly and transparently. Standard setting is an important element in this. Experience has shown that the level of expertise is very often overestimated. However.

Trainee based methods are currently gaining in popularity. the assessors are required to make judgments as subject experts as to the probability of a ‘just passing’ trainee answering the particular question or performing (correctly) the indicated task. namely the Membership and Fellowship exams of the College of Physicians and Surgeons. 1) Angoff’s method Originally developed for standard setting in multiple choice examinations. a combined/compromise method which is more complex and best used with large cohorts of trainees. An alternate method is to instruct the subject experts to make assessments regarding the number of checklist items that a ‘just passing’ trainee would be expected to obtain credit for. especially when there are multiple checklists. a ‘borderline pass’) on defined content or skills. Thirdly. Here. it is very time consuming and labour intensive. important and supplementary material. However. In test centred methods. Test based methods In general. • combined and compromise methods. this method is better suited to standard setting in knowledge tests as it has some significant disadvantages for performance testing. the task of deciding how many items constitute a borderline pass remains challenging with regard to rules of combination and compensation. See Appendix 2 for more details. Ebel’s method can be considerably more useful in practice. They can be seen as falling into four categories: • relative methods. the use of this method makes the implicit assumption that ratings on tasks are independent. Both are based on judgments about the assessment items. While this may reduce the problem of checklist item dependencies and substantially shorten the time taken to set standards. This modification has the advantage of not only helping examiners to set a passing standard. The simplest test based method is Angoff’s (32). • absolute methods based on judgments about the trainees. This leaves methods based on judgment about the trainees.e. However. individual assessment items and combined and compromise methods. The Holsgrove and Kauser Ali modification allows it to be used for more complex items such as OSCE stations and work is underway to explore its potential for workplace based assessment. These three factors are important in improving the stability of examinations when repeated many times. potentially invalidating the use of this method for setting standards (34). individual checklist items are often interrelated within a task. such as multiple choice questions (MCQs) of the ‘one best answer’ type. In its original form.Standard setting methods There are several methods for standard setting. See Appendix 2 for further details. moderate and easy items. 2) Ebel’s method Only slightly more complicated than Angoff’s method. this assumption is often untenable with performance assessments because of the phenomenon of case specificity. whereas in trainee centred methods judgments regarding actual trainee performance are used to determine the appropriate passing score. the precision of the standard derived may be compromised. As a result. test based methods require assessors to act as subject experts to make judgments regarding the anticipated performance of ‘just passing’ trainees (i. The assessors’ mean scores are used to calculate a standard for the case. with a balance of difficult. Ebel’s method was suitable for simple right/wrong items. Developing and maintaining an assessment system . Test based and compromise methods consist of the three main methods of standard setting in formal knowledge based exams. especially when building and managing question banks. theoretical decisions based on test content are used to derive a standard. Holsgrove and Kauser Ali’s (35) modification of Ebel’s method was developed for a large group of postgraduate medical exams. Pakistan. yet probably leads to a better examination design. this guide will not consider relative methods in any more detail. It may also yield too stringent standards. the assessors’ judgments are not totally independent. As indicated above. As a result. This means that essentially. • absolute methods based on judgments about the test items. this method has also been used to set standards on the history taking and physical examination checklist items that are often used for scoring cases in skills assessments. and more importantly.a PMETB guide to good practice 15 . though there are several variations of each method. but also to produce an examination with appropriate coverage of essential. This guide also describes two trainee based methods and the Hofstee method. since the resulting standard is a mean assessment across items and/or tasks. For example. The Angoff procedure is probably the best known example of assessment centred method and has subsequently undergone various modifications. Ebel’s (33) method is slightly more complicated.

Hofstee’s method This is probably the best known of the compromise methods. this method requires expert judges to observe multiple trainees on a single station or case (rather than following a single trainee around the circuit) and give a global rating for each on a three point scale: • pass. A proposed hybrid method is described later in this section. both relative and absolute methods. satisfactory/unsatisfactory. 2) Contrasting groups method Procedures. performance assessments have not been so well informed by a standard setting evidence base. See Appendix 2 for more detail. Trained simulated patients might be considered sufficiently expert to serve as assessors. assessors find the process and results more credible because the standard is derived from judgments based on the actual test performances (30). This method requires that the trainees are divided into two groups which can be variously labelled as pass/fail. but it is important in all of them that the examiners must be able to determine a borderline performance level of skills in the domain sampled. Assessors rate each trainee’s performance at each station or case. In practice this almost always produces an overlap in the score distributions of the contrasting groups. etc. competent/not competent. the group into which they are placed depends on their global rating across the performance criteria. The performance is also scored.a compromise There are various approaches to standard setting that combine aspects of other methods or.Trainee or performance based methods Trainee or performance based methods have been used increasingly as the standard setting method of choice in clinical skills and performance assessments. See Appendix 2 for more detail. The global ratings using the three point scale are used to establish the checklist ‘score’ that will be used for the passing standard. Additionally. especially in communication and interpersonal skills.a PMETB guide to good practice . • borderline. It takes account of both the difficulty of the individual assessment items and of the maximum and minimum acceptable failure rate for the exam and was designed for use in professional assessments with a large number of trainees. Instead of providing judgments based on test materials. either by the same assessor or another. the panel of subject experts is invited to review a series of trainee performances and make judgments about the demonstrated level of proficiency. 16 Developing and maintaining an assessment system . using a specific score sheet for each. in the case of the Hofstee method described below. have focused on the actual performance of contrasting groups of trainees identified by a variety of methods. such as the contrasting groups method (37) and associated modifications. which have been extensively researched to guide their standard setting methods and whose standards can be determined by a defined group of modest size. on a checklist. for example as external criteria or specific competencies set out in the curriculum and specified on the multiple item score sheet. These methods are more intuitively appealing to assessors as they afford greater ease when making judgments about specific performances. Combined and hybrid methods . this method allows for further scrutiny and adjustment so that if the point of intersection is found to allow trainees to pass who should have rightly failed (or vice versa) the pass mark can be adjusted appropriately. scores from each of the two contrasting groups are expressed graphically and the passing standard is provisionally set where the two groups intersect. There are various ways of doing this. After the assessment. Hybrid standard setting in performance assessment Unlike knowledge assessments. 1) Borderline group method Described by Livingston and Zieky in 1982 (23). which combines aspects of both relative and absolute standard setting. However. However. A variety of modifications of this method exist (36). • fail. Neither test nor trainee based approaches can readily be applied at the case level to a high fidelity clinical assessment in which standards are implicit in the grading of the individual cases. See Appendix 2 for more detail.

the panel of assessors as subject experts would collectively agree on the standard to pass. Performance or clinical skills assessments are playing an increasingly important role in making certification or licensing decisions. a clinical case. However. such as identification of the borderline or ‘just passing’ trainee.a PMETB guide to good practice 17 .‘he or she was very kind and polite’) is one such phenomenon. Therefore. is set by the assessors as a result of their expertise. such as multiple choice examinations (32). for example. yet reported problems with both the methods and their implementation (30. Each case would thus be passed or failed. and other potential sources of assessor bias. Working as a group. this task is intuitively appealing. usually attributable to insufficient training of experts as assessors. whilst the underlying principles. which need to include assessments on the performance of whole tasks. this has been shown to yield different (30) and inconsistent results (39). A proposed method for standard setting in skills assessments using a hybrid method Assessors must observe either trainees’ actual performance . and the resulting decisions regarding competence are realistic. It is imperative to emphasise the need to concentrate the assessor’s attention on the pass/fail decision to ensure that they are properly informed as to the definition of ‘just passing’ behaviour. can be addressed by offering adequate training to assessors about making judgments. pose considerable standard setting challenges . are common. The assessors could carefully review the pass/fail algorithm and collectively support it. as it articulates with their clinical experience. particularly if different sets of assessors are also used (30). including the provision of suitable performance descriptors. could facilitate this process. Thus. fair and defensible. However. using generic and case specific guidance. The standard may then be verified and refined by the application of either performance based approach to examples of individual assessments and/or the trainee’s overall performance in the full battery of assessments. the scores are accurate and reliable. on the one hand there is a requirement for the standards to be defensible. one potential shortcoming of trainee centred methods. However. assessors can discuss and establish a collective and defensible recommendation for what constitutes a passing standard. In order to set acceptable standards. and it will do this by agreeing a decision algorithm. The attribution of positive ratings based on irrelevant factors (halo effects . In order to address this. for example. The borderline group and contrasting groups methods of standard setting are well suited to standard setting in controlled assessment systems testing skills and performance. assessments of complex and integrated skills.A hybrid approach has been proposed (38) and is described below. it is imperative that the assessor’s task is clear and unambiguous and that any misinterpretation of the task is rectified. justifiable standards must be set. Although a range of standard setting methods have been used in skills assessments. using standardised patient assessments in simulated encounters or OSCEs. they are not necessarily appropriate for standard setting in performance assessments. In order for the pass/ fail decisions to be fair and valid. converting the individual assessment grades to pass/fail overall. In addition. the methodology is not as well developed for performance standard setting as it is for knowledge based assessments and the influence that the assessor panel has on the process appears to be a greater factor in this form of assessment. explicable and stable (28). A further method is reviewed below. is the tendency to attribute performance based on skills or factors that are not directly targeted by the assessment. This. The passing standard for each individual assessment. training and insights into the performance of ‘just passing’ trainees in real life. 39). which helps avoid over influencing of decisions by powerful or dominant characters. it is a prerequisite that due care is taken to ensure that the assessments are standardised. For clinician standard setters. The standard may then be modified by the application of additional criteria (see below) or appropriate statistical management.for example using a DVD. Standard setting for skills assessments Numerous standard setting methods have been proposed for knowledge tests. Assessors are required to identify a passing standard of performance on each individual assessment that they observe.and then make a direct assessment concerning competence based on such particular ensuring that the standard is stable over time (‘linear test equating’). This is the point at which idealism meets reality and it presents us with a problem. The standard setting issue then relates to how many individual assessments need to be passed (and/or not failed) in order to pass the assessment as a whole. Methods such as the Delphi technique. Developing and maintaining an assessment system . VCR or an authentic simulation which includes a suitable breadth of trainee’s performance .

the trainee’s performance on each of the skills assessments should be assessed in a specified number of domains (such as history taking. 12) and a pass mark set. a trainee must have at least n clear passes and no clear fails’ or ‘allowing compensation between clear passes and clear fails. There will also be a global. 2) Scoring methods Such an assessment would produce a number of scores based on the four point system for any individual trainee. the trainee shall pass if they have a neutral ‘score’ or above’. The difficulties with the latter are that. For pass/fail licensing decisions. being non-specific. 3. It raises the following issues: • psychometric . etc) and each of these graded on a scale. Although the individual domain grades may be taken into account.1) Grading system for individual assessment of clinical cases This section is based on the paper by Wakeford and Patterson (38) mentioned above. • institutional . and between marginal passes and marginal fails. communication with the patient would be far less important than ensuring a clear airway. which will be the overall grade for that particular assessment. as that would imply equal weight to each. • bare (marginal) pass. The essence of such a grading system is that it is based on expert assessments of what is acceptable behaviour in the overall passing criteria of the assessment. It is an accurate overall assessment grade which is the key to its producing a credible overall result. communication or practical skills. since there is no borderline grade. In view of the complexity of these issues.including financial issues of pass rates. For example in resuscitation assessment. designed to confirm that a doctor is sufficiently safe and proficient to undertake unsupervised independent practice.there will be argument about the relative scores attached to individual assessments. using four points as follows: • clear pass. marginal fails and marginal passes can be seen as fails and passes respectively.a PMETB guide to good practice .including what constitutes competence. it is not surprising that there is no easy answer to the problem of converting a series of individual assessments scores into overall assessment results. 1. the fact that one domain was weakly represented in that individual assessment will also need to be accounted for. it may well not prevent the unacceptable combinations . • trainee . Converting a number of scores into an overall result is not straightforward.such as ‘case blackballing’. The essential focus of the assessment is upon the trainees’ global performance during a particular overall unit of assessment and not on their performance on the necessarily artificial constructs of the domains within that individual assessment. with possible codicils such as ‘… n clear fails will fail’. which might say something like ‘to pass. The end can be achieved in two ways: • A set of rules can be devised. 10. 8. This process must be fair and must ensure that unacceptable combinations of individual assessment ‘grades’ do not lead to a pass. The main difficulty with the former is that it does not produce a ‘score’ that could subsequently be processed statistically. • Alternatively. particularly where there are a number of ‘marginal’ grades involved. or 4. 2. a scoring system could be devised with different marks being given for each grade (such as 0.especially of compensation vs combination. • stakeholder . by the assessor. • clear fail. This might be termed a categorical approach. The overall assessment grade for a particular case or scenario will use the same four grades but. 18 Developing and maintaining an assessment system . 3) Is turning the results of assessments into a series of numbers likely to help? It is clearly necessary to combine the total number of assessment scores in some way so as to produce an overall pass/fail standard. • bare (marginal) fail. examination. This overall individual assessment grade is not determined by the simple aggregation of the domain grades. overarching judgment for each individual assessment. to whose work David Sales contributed. for illustrative example.

uk/training/workplace-basedassessment/wbadownloads. are interchangeable with a similar group on the ‘fail’ side. Decisions about what happens to borderline trainees once they have been correctly identified should rest with individual assessment boards. However.rcpsych. badly structured and missing some important details.they are treated as clear passes even though mathematically there is always a group who happen to fall on the ‘pass’ side of the cutting point who. otherwise a categorical combination algorithm may be preferable.a PMETB guide to good practice 19 . sensitive and allowing the patient to tell their story. having identified them. 6 (the highest) and 4 (the standard for completion) might be deemed sufficient. work is underway to evaluate the contribution that the Holsgrove and Kauser Ali modification to Ebel’s method might make in this area. However. to help assessors to achieve accuracy and consistency. 2) Poor history taking. This is particularly appropriate with competency based curricula where intended learning outcomes are described in terms of observable behaviours. they must be fair. It seems inevitable that standard setting in workplace based assessment will be an area of considerable research and development activity over the next few years. the only groups of borderline trainees usually considered at present are those within. transparent and defensible. 4 and 6 on the rating scale: 1) Very poor. methodical.In practice. To take an example from the assessment programme in the specialty curriculum of the Royal College of Psychiatrists (http://www. As mentioned earlier. the numerical approach can possess all the advantages of the categorical approach. but might be incomplete though without major oversights. facilitating the patient in telling their story. 6) Excellent history taking with some aspects demonstrated to a very high level of expertise and no flaws at all. Anchored rating scales An anchored rating scale is essentially a Likert-type scale with descriptors at various points . a couple of percentage points of the pass mark (the range is typically arbitrary rather than evidence based). and is probably not particularly well suited to the kind of standard setting methods described above and conventionally applied in well controlled ‘examining’ environments. no important omissions. In this situation the examiners could consider the scoring approach.typically at each end and at some point around the middle. 4) but at present the most promising approach is probably to use anchored rating scales with performance descriptors. 5) A good demonstration of structured. Borderline ‘passes’ are usually ignored . For example. regarding the identification of borderline trainees and. no important omissions. methodical. and in monitoring their progress and attainment. it is a relative newcomer to the assessment scene. However. The traditional practices will no longer suffice. Decisions about borderline trainees It is very important that there is a proper policy. Making decisions about borderline trainees Some common current practices for making decisions about borderline trainees are unacceptable and indefensible. Standard setting for workplace based assessment Assessment in the workplace Assessment of a trainee’s performance in the workplace is extremely important in ensuring competent practice and good patient care. incomplete and inadequate history taking. what to do about them. For example. particularly in the UK. say. performance for each item at point 1 (the poorest rating). if the passing ‘score’ approaches the maximum score. methodical and sensitive history taking. both negative and positive. 4) Structured. 3) Fails to reach the required standard. sensitive and allowing the patient to tell their story.aspx) the descriptors for performance at ST1 level in the mini-CEX (which RCPsych have modified for psychiatry training and renamed mini-Assessed Clinical Encounter [mini-ACE]) describes performance in history taking at points 1. Developing and maintaining an assessment system . agreed in advance. in terms of confidence intervals. history taking is probably structured and fairly methodical. incomplete and inadequate history taking. and it is essential that borderline trainees on both sides of the pass mark are treated in exactly the same way. the performance descriptors used also describe the other three points on the scale: 1) Very poor. 6) Excellent history taking with some aspects demonstrated to a very high level of expertise and no flaws at all. the rating scales used in most of the assessment forms in the Foundation Programme.

even when the pass mark has been correctly set. inevitably have an element of measurement error. Generalisability theory (7. By breaking the requirements of Principle 4 down into three elements: • How is the standard set? • How is the measurement error calculated? • How are borderline trainees identified and treated? PMETB hopes that this chapter will be helpful in assisting those responsible for assessing doctors. which produce checklist scores that may have little other than conceptual relevance to skills tests which assess global performance relevant to professional practice. variability in material covered (not infrequently asking about things that are not in the curriculum) and. to ensure that their assessments meet the required standards. which is to conduct a viva voce examination. Where performance assessments are used for licensing decisions. However. the responsible organisation must ensure that passing standards achieve the intended purposes (e. PMETB’s Principles require more than having a standard that has been properly set. It is clearly inappropriate to use perhaps the least reliable assessment method to make pass/fail decisions that even the most reliable methods have been unable to make. in Appendices 1 and 2. in particular. like all other measurement systems. This chapter also described how. identified in the title itself.g. Appendices 1 and 2 decribe how this measurement error can be calculated and noted that it is often surprisingly large. Conclusion The majority of methods of standard setting have been developed for knowledge (MCQ-type) tests and address the need for setting a passing score within a distribution of marks in which there is no notional pre-existing standard. extremely poor reliability. the two questions that must be addressed in meeting Principle 4: i) What is the measurement error around the agreed level of proficiency? ii) What steps are taken to account for measurement error.g. The vast majority of vivas are plagued with problems. some methods for establishing the pass mark for assessments. there will almost certainly be a group of trainees with marks on either side of it who cannot be confidently declared to have either passed or failed. 8. public protection) and avoid any serious negative consequences. and. The appendices serve to illustrate how measurement error can be reduced by improving the reliability of the assessment having calculated the measurement error for their assessment . Regardless of the method used to set standards in performance assessments. such as inconsistency within and between assessors. It also pointed out that they will need to have agreed in advance how decisions about the borderline trainees will be made. is to establish what the level of proficiency actually should be . both in the examination hall and in the workplace. number of assessors.assessors can identify the borderline trainees. Much of the currently published evidence relating to standard setting in performance assessments relates to undergraduate medical examinations. transparency and defensibility would probably exclude one of the most common ways of making decisions about borderline trainees. 41) can be used to inform standard setting decisions by determining conditions (e. This is because all assessments.The criteria of other words. 20 Developing and maintaining an assessment system . 40. above all. Clear and defensible procedures are needed for identifying and making decisions about borderline trainees. number of assessments and types of assessments) that would minimise sources of measurement error and result in a more defensible pass/fail standard. Summary This chapter is concerned with issues arising from the requirement for assessments to comply with the PMETB Principles (1).a PMETB guide to good practice . how is the standard agreed? This chapter has outlined some of the principles of standard setting and described. it is imperative that data are collected both to support the assessment system that was used and to establish the credibility of the standard. particularly in relation to borderline performance? The first step.

it makes sense that more assessments are required to distinguish between trainees who are in fact safe and those where doubts remain. It is the role of PMETB to assure QM takes place at the highest level achievable in the circumstances. However. Although PMETB acknowledges that this dual role of a workplace based assessment is in some ways not ideal because it may inhibit the learning opportunity from short loop feedback. must be managed. the position of workplace based assessment is somewhat different. For the purposes of this section ‘quality management’ (QM) replaces the traditional term ‘quality control’. In the case of borderline trainees. in practice the evidence is that some workplace based assessments have very reasonable reliability. whilst being more challenging in terms of its reliability. although larger numbers of patient assessors are needed than colleagues in these assessments (44- 49). He emphasises the need to re-examine measurement characteristics in different settings and the need for sampling across assessors and clinical problems on which the assessments are based.a PMETB guide to good practice 21 . Norcini’s work has demonstrated that mini-CEX can have acceptable reliability with six to ten separate but similar assessments (based on 95% CIs using generalisability) (42). This chapter provides some of the background source material which will permit interested groups to begin to understand some of the thinking behind the Principles. It is very important that the assessors are trained. This can then be compared to the agreed outcomes set by trainer and trainee in the educational agreement. The use of the 95% CI emphasises the need for more interactions where performance is borderline in order to establish whether the trainee is performing safely or not. it is necessary to be pragmatic. to provide robust evidence which supports what they do. It is the role of postgraduate deaneries with training programme directors to manage quality. PMETB is at pains to explain why this amount of work is demanded of already hard pressed groups who are trying. Holmboe described a training method which was in practice very simple in that it consisted of less than one day of intensive training (43). Therefore. based on a background of established practice. This might be summarised as being an assessment methodology that has demonstrably high validity. However. The agreement reached within the PMETB Assessment Working Group is that this is acceptable provided it is entirely transparent to trainer and trainee in what circumstances they are meeting on a particular occasion. our ability to directly control quality is inevitably challenging. Quality and the risks of falling short of achieving the required standards inherent in the Principles.Chapter 3: Meeting the PMETB approved Principles of assessment Introduction PMETB’s Principles of assessment (1) are now well established. The information acquired during a workplace based assessment can also provide evidence of progression of a trainee and therefore contribute evidence suitable for recording in their learning portfolio. A number of groups have demonstrated that both multi-source feedback (MSF) from colleagues and patients can also be defensibly reliable. Given the complexities of training and education as part of the delivery of medical services. however. Quality assurance and workplace based assessment Workplace based assessment should comply with universal principles of QM and quality assurance that one might expect in any assessment. particularly because the number of assessors and the time available for assessment is precious. This is not to say it should not be quality managed but because of its nature there are issues concerning its best use. however. Extensive sampling for borderline trainees may be needed to precisely identify the problems behind their difficulties so that a plan can be formed to find remedial solutions where possible. educational supervisors will sometimes be tutors or mentors and on other occasions will actually be assessors. It is essential that both trainer and trainee are aware that both feedback and assessment of performance that contributes to their learning portfolio of evidence are simultaneously taking place during workplace based assessment. The main value of workplace based assessments is that they provide immediate feedback. Developing and maintaining an assessment system .

so maximising the learning opportunities inherent in work based assessment. where there is a big pool of trainers with trainees frequently rotating amongst them. has to take place at school or programme level and is best carried out locally where training during that period has actually taken place. This. The aim and objective of the trajectory is to cover the whole of the curriculum to a level of competence defined by a series of outcomes. This must contain evidence about the development of the trainee in the round and not just a list of completed assessments. The reason for separation of appraisal from review is to ensure the subsequent annual assessment. then a practical way forward can be envisaged. this is ideal as it keeps the mentoring role and the assessing activity completely divorced to the particular advantage of effective mentorship. assessments will occur annually but in the case of remedial action being required for a trainee they may occur more frequently. It is important.The learning agreement Workplace based assessments taken in isolation are of limited value. Assessments will be used to provide evidence that the direction and pace of travel is timely. based on GMP (19). In many ways. For example. depending on the structure of training rotations. PMETB hopes this encourages the trainees and the trainers to develop an adult-to-adult learning and teaching style. usually every four or six months. This is an indisputably formative review with the specific objective of ensuring there have been no immediate problems. Provided it is clear when the trainer/assessor is in which role. therefore. such as the inability to have sufficient training or assessment opportunities and to provide timely feedback when difficulties of any kind have arisen.a PMETB guide to good practice . which has external members relative to the training process. rather than presenting difficulties which were otherwise remediable to the annual assessment. low stakes.which determines whether or not a trainee is able to proceed to the next stage of their training. A structured trainer’s report of the whole period of work ahead of the evidence being submitted to the annual assessment must be made by the designated educational supervisor. in anaesthesia. to have an opportunity for the trainee and the group of trainers a trainee has worked with during the specified training period to review the evidence during an appraisal which is distinct and unequivocally separate from the annual assessment described below. appropriate and valid. or remediation is required. the educational supervisor may not act as assessor for the whole or any of the training period. therefore. trainees spend six months with a trainer and the vast majority of the time the trainer will also be the workplace based assessor. They will inform the educational appraisal which in turn informs the specialty trainee assessment process (STrAP) . other than the principal designated trainer. It has been made clear elsewhere in this document that assessment must not simply be a ‘summation of the alphas’ (referring to Cronbach’s alpha). In trauma and orthopaedic surgery. which is a high stakes event determining whether progression in training can take place. PMETB recognises there will be different ways of achieving the educational appraisal process that precedes submission of evidence to the annual assessment. A series of learning agreements must ensure that the whole of the curriculum is covered by the end of training. during a particular training period and over the year a trainee will have worked for at least two trainers who will also be acting as assessors. Normally. educational appraisal must take place based on the evidence provided by learning agreements and a large number of in-the-workplace based assessments. the latter are really presented as supporting evidence. 22 Developing and maintaining an assessment system . in this case. They should be contextualised to a learning agreement that refers to the written curriculum and sets the agenda for a particular training episode. There will inevitably be small gaps and more overlaps. but by and large what trainer and trainee are creating is an agreement on a direction of travel which is usefully thought of as an educational trajectory. The aim is to resolve problems as soon as they are identified. The way in which appraisal is separated from review will vary from programme to programme and from discipline to discipline. does not simply consider the future of a trainee on the basis of the raw data of a series of scores. This means that the trainee is aware that the individual workplace based assessments contribute to the whole judgment and they would not be ‘hung out to dry’ over one less than satisfactory assessment event. a formative. Appraisal After a suitable period of training. on the other hand. Some assessments may be carried out by other assessors.referred to throughout this document as the annual assessment .

The exam must be as reliable as possible. With structuring. What many formal exams are best at doing is testing knowledge and its application. Clearly. application of knowledge and clinical decision making. although the Gold Guide suggests the option of reviewing borderline trainees should always be applied. Inevitably. however. as described in the structured trainer’s report.a PMETB guide to good practice 23 . based on an overview of the trainee’s educational trajectory with the intention of completing more parts of the written curriculum. exams and workplace based assessments are not at all comparable assessments. The corollary is that there is considerably more control over reliability. Additional information The annual assessment for trainees needs to take into account issues around health and probity not dealt with elsewhere.The annual assessment PMETB expects deans and programme directors to ensure that the annual assessment (currently encompassed in RITA processes (50)) is a consistently well structured and conducted process. being asked to repeat training or to have focused training. An example of good practice is when a member of the appropriate SAC from out with the training programme would attend the annual review to give the process externality. training programmes. which triangulates with evidence from a learning portfolio which includes workplace assessments and accumulated experience such as may be found in log books. this will be a paper or virtual exercise. but simply point out that they require considerable work to make them robust and they need prolonged examination time to be sure they are reliable. It is important that there is a demonstrably transparent and fair process for trainees and one that assures the general public that those treating them are fit for their purpose. This means that correlating exam results with workplace based assessment is one indicator that the process is working well (an example of a process described as triangulation). The role of PMETB in this process is to be assured that the QM mechanisms and the decision making are appropriate. Assessment experts such as Geoff Norman are not intrinsically against orals and clinicals. Quality assuring summative exams Colleges set exit examinations as a quality managing and assuring process. Certainly. would agree on the content of the next period of training. In doing this. The annual Appraisal review process should have stakeholders from the deaneries. This will be very high stakes because the exercise might result in a trainee being removed from training. the annual assessment panel would have to assure themselves that the evidence provided was appropriate. external assessors and lay membership. In the most part. There is one reason why there is a trend to move to much longer assessments of knowledge. in conjunction with members of his or her training committee and the trainee. The annual assessment is a high stakes event for a trainee. Psychometric analysis demonstrates clearly that MCQs of various types do this most reliably. The second part of the annual review will be a facilitatory event where the training programme director. clinicals and orals can contribute to formal examinations. It would be wrong therefore to assume one is the control or check on the other. Developing and maintaining an assessment system . The panel would look at the evidence and would either confirm or might differ from the decision reached by the training programme director and their committee about a particular trainee. This must be based on a face-to-face discussion with at least one designated trainer or mentor. the postgraduate dean gives educational Assessment externality from a particular programme and internally the training programme director has an overview of a particular trainee. the review in effect provides all the required evidence for NHS appraisal processes. the artificially created environment of a summative college exam has less intrinsic validity. they provide complementary information invaluable as part of a triangulation exercise. but should contain no surprises if the educational appraisal process has worked effectively. The annual assessment should be a quality assuring exercise ensuring that the conclusions that the Head of Training and the trainers Annual review have reached about a particular trainee during the training interval are reasonable and the trainee has achieved the standards expected. MSF may provide information about health and probity. The exact composition of annual assessment panels is laid out in the shortly Evidence to be published Gold Guide which replaces the current Orange Guide. Note.

• Question. • assessment technique. Where clinical environments are used. Useful feedback comments might concern: • quotes and examples of positive and negative behaviour. • Content of exam . It is also necessary to have policies on the following: • examination security. For example.need to be of proper quality and available to all examiners.should be scrutinised regularly in terms of behaviour and performance. • plagiarism and cheating. • Examination materials .a PMETB guide to good practice .should be selected and applied as explained above in Chapter 2. • examiner training.should be clear that the marking is either normative or criteria based. it is necessary to address the following: • Purpose of exam . It is better to have reserved or purpose designed facilities and not impose an exam. An example of good practice is to appoint examiner/exam assessors. • selection of examiners and officials. unshared computer images of which an individual examiner may be fond should only be permitted if the image meets criteria of standard. • level and appropriateness of questions. for example. • malpractice by college or trainees.should simply match the agreed syllabus and look to test at the level of ‘competent’. • Standard setting procedures . Materials and props should be approved by examination boards and the introduction of new materials by individual examiners put through the same standard setting processes as any other exam material or question. which determines: • marking and analysis of results. answers and marking scheme .should be reasonable and of equal quality for all trainees so that they can perform to the best of their ability. • mobile phones and electronic devices. • Running the exam . • provision of feedback. • checking and distribution of results. • Selection of assessment instruments used in the exam . and available in comprehensible form to the general public. • documents. They would be expected to prepare a report on the whole process and on individual assessors.In order to quality assure an assessment system based on formal examinations.should meet the utility criteria laid out in Chapter 1. be trained to be fit for purpose. • test development. • sitting unobtrusively as observers and not interfering. of course. • evaluating assessors against transparent criteria. firewalls and buildings. The exam should also undertake a review of written policies available to trainees. quality and viability agreed by all assessors. • Conduct of assessors . The standard should also be related to methodology and purpose. which can be fed back to conveners and other assessors to assure consistency of performance. 24 Developing and maintaining an assessment system . assessors and as far as possible the general public. but might also encourage excellence. • interpersonal skills of assessors with trainee. Assessors of exams must. computers. on a busy ward or clinic where unexpected events or schedules such as mealtimes or visiting impinge on the exam environment. Most professional exams are criteria based. the standards must not be compromised by everyday service activities going on around the exam. These individuals should be experienced examiners who are appointed in open competition and with the approval of their peers.should be explicit to examiners and trainees. Their role would include: • making multiple visits to inspect the venue or observe trainee assessments. • data protection.

Provided these experts acknowledge they also need to learn how to be expert assessors and trainers as well as expert clinicians. The intellectual thrust of the Assessment Working Group in PMETB is to respect the role of peer assessment from genuine experts. they need to assess people holistically and not just represent them as a set of results based on a battery of assessments.a PMETB guide to good practice 25 . which PMETB accepts make some educators uncomfortable. Developing and maintaining an assessment system . provided assessment instruments are valid and reliable. Assessing professionals in the exacting working environment of healthcare means making some compromises. there is a genuine way forward. PMETB assessment principles are predicated on the overriding value that.Summary PMETB recognises there is no perfect solution.

delivery of face-to-face training for all assessors is likely to take some time but as a minimum. • willingness to undergo training. • willingness to have their performance as an assessor evaluated and to respond to feedback from this. both specifically and more generally in terms of professional accountability. • up-to-date. including work based particular. particularly with reference to the assessment process they are participating in. 26 Developing and maintaining an assessment system . individuals undertaking assessment should recognise that they are professionally accountable for the decisions they make. • diversity training to ensure that judgments are non-discriminatory (or a requirement for this in another context). it is important that it is made clear to trainees when they are acting as an assessor rather than a trainer. 51). Selection of assessors Selection of assessors should be undertaken against a transparent set of criteria in the public domain and therefore available to both assessors and trainees. Assessor training There is evidence that assessor training enhances assessor performance in all types of assessment (43. will help achieve as much consistency as possible. • willing and able to contribute to standard setting processes. All assessments. • wiling and able to deliver feedback effectively. Ongoing training for assessors should be provided to ensure that they are up-to-date and CPD approval should be sought for this. Assessors for work based assessment will need to understand the principles behind work based assessment. both in their field and in relation to assessment processes. handing the form to the trainee to fill in themselves. assessors participating in a standard setting group will need training specifically in standard setting methodologies. trained and evaluated.g. Evaluation of assessor training should be integral to the training programme and where concerns/gaps are raised in relation to training. Very importantly. Assessor training should include: • an overview of the assessment system and specifics in relation to the particular area that is the focus of the training. Honest and reliable assessment is also essential in enabling assessors to fulfil their responsibilities in relation to GMP.g.Chapter 4: Selection. filling in a form on the basis of ‘I know you are OK’) is a probity issue with respect to GMP. • non-discriminatory and able to provide evidence of diversity training. large scale work based assessment . training and evaluation of assessors Introduction The role of assessor is an important and responsible one for which individuals should be properly selected. Given that assessors are often trainers in the same environment. written guidance and an explicit plan to deliver any necessary additional training should be provided. these should be responded to and training modified if necessary. Submission of assessment judgments which are not actually based on direct observation/discussion by the assessor with the trainee (e. All assessment systems should include a programme of training for assessors. Provision of written/visual training materials and observation of local training. • where assessors have a role which requires them to give face to face feedback to trainees. • understanding of assessment principles. • willing and able to undertake assessments in a consistent manner. the importance of the quality of this feedback should be emphasised and provision for training in feedback skills made. Criteria for selection of assessors may include: • commitment to the assessment process they are participating in. Particularly in relation to work based assessment this may include guidance in relation to assessor characteristics such as grade or occupational group. • principles of assessment. Cascade models for training of assessors where centralised training is provided and then cascaded out at a more local level are attractive and cost efficient. but ensuring standardisation of training is more difficult in this context. e. must be taken seriously and their importance for the trainee and in terms of patient safety fully acknowledged. All assessor training should be seen as a natural part of Continuing Professional Development (CPD) and based on evidence (52). • clarification of their responsibilities in relation to assessment. It is recognised that for some types of assessment .a PMETB guide to good practice .

This should include feedback both on their own performance as an assessor and feedback on the QM of the assessment process they are involved in.Feedback for assessors Evaluation of assessor performance and provision of feedback for assessors should be planned within the development of the assessment system. a ‘thank you’). Developing and maintaining an assessment system . Planning for evaluation of the assessors should include mechanisms for dealing with assessors about whom concerns are raised. Assessors are largely unpaid and give their time in the context of many other conflicting pressures.a PMETB guide to good practice 27 . Feedback for assessors should include formal recognition of their contribution (i. In the first instance.e. this would usually involve the offer of additional training targeted at addressing the area(s) of concern.

All PMETB asks is that the use of any one tool can be justified on the grounds of its ‘fitness for purpose’ and validity as discussed in Chapter 1. It is worth recalling that assessments are as much tools for driving learning as assessing knowledge. even by using a range of methodologies. Sampling PMETB has realistic expectations and fully accepts that. • In order to cover all aspects of the curriculum within the framework of the GMC’s GMP. Overlap between methods is inevitable and may even be desirable in providing confirmation of performance through triangulation. There are a number of general points that PMETB would encourage assessing organisations to take into account when employing assessment tools. Specifically. Sampling is entirely appropriate. For example. that its assessments conform to specific criteria. after wide consultation the GMC has set its own standards down as GMP. skills or understanding. • PMETB recognises that in any specific assessment setting it will not be possible to assess the entire curriculum and that sampling will be a concern to assessment organisations and trainees alike. even over the full training life of an individual trainee. GMP is the chosen anchor of PMETB and it provides a baseline or benchmark against which everything else can be planned and evaluated.a practical guide Introduction Assessment is a necessary process to assure the profession. as it ensures all aspects of the curriculum and GMP are covered over a period of time defined (and justified) by that organisation. • PMETB would also like to suggest that a good assessment instrument should be able to assess a doctor’s ability to support self care. it will not usually be possible to assess all aspects of the curriculum. Blueprinting Blueprinting of assessments against the curriculum and GMP is important for any organisation. to ensure the curriculum is adequately assessed. It is essential in this process that assessments are valid and specialty relevant. • PMETB encourages the use of assessment instruments which fit in naturally with normal clinical practice in the workplace. This is important because PMETB is anxious to instil in trainees the appreciation of having a breadth of knowledge and skills. PMETB also requires that the assessment organisation provides evidence that outcome measures are appropriate. providing the weighting of that sampling process can be justified. which has evolved and matured over the years and which is familiar to every doctor and clinical medical student in the UK. PMETB requires that any assessment organisation setting or overseeing an assessment should be able to show. PMETB expects assessment organisations to choose an appropriate assessment system that overall ensures each attribute of GMP is being tested. knowledge and understanding. if called upon. Although PMETB does not expect any one trainee to be assessed on the whole curriculum.a PMETB guide to good practice . the range assessment instruments must include those applicable to both exam based and workplace based assessments to ensure that the full scope of knowledge. When blueprinting. This chapter addresses these issues from a practical point of view. each with their own reliability/validity. • The assessment tools that any assessing organisation might choose are not prescribed by PMETB. MCQs may be appropriate for testing knowledge and practical tests for testing skills. which are also worthy.Chapter 5: Integrating assessment into the curriculum . they should be able to show that every assessment has been designed to test a particular aspect of the curriculum or an appropriate element of GMP and that assessments are weighted in respect of clinical importance. 28 Developing and maintaining an assessment system . there is an appropriate balance of subjects from the curriculum being assessed. It is therefore important that in sampling aspects of a trainee’s skills. However. and recognises that evidence of the possibility that any aspect of the curriculum/GMP might be assessed drives relevant learning. skills and attitudes is assessed. public and regulatory authorities that practitioners are capable of offering the highest quality of healthcare. • PMETB recognises that there is no ‘perfect’ assessment instrument and any single assessing organisation will need to use a range of instruments. The outcome for an individual trainee or group of trainees should be justified and can be benchmarked against the performance of an optimised trainee group. there is a requirement that over a period of time an organisation can provide evidence that all aspects of the curriculum have been sampled. this is the process of sampling. There are alternatives to GMP such as CanMEDS.

As Dame Onora O’Neill stated in her 2002 BBC Reith Lecture: “Plants don’t flourish when we pull them up too often to check how their roots are growing.The assessment burden (feasibility) PMETB recognises that the frequency of assessments should not be as excessive as to overburden the trainee. assessment systems are required to assess global professional judgments and expose dangerous weaknesses. • There is always to some degree a conflict between validity and reliability. Feedback should also be such as to demonstrate that the trainee has provided evidence of competence if this is the case or. Developing and maintaining an assessment system . even if they represent a minority of the decision making outcomes. Feedback must be provided from all assessments and the assessment organisation must be able to demonstrate that the feedback that has occurred is appropriate and that it has been given in a timely and useful manner.a PMETB guide to good practice 29 . or to exhaust the assessors or the assessment system. Whenever feedback is given it should be done in such a way that an external observer would be reassured that the outcome of the assessment had reflected the trainee’s skills within a framework built on the principle that public well-being was the prime driver of medical education and practice. but also their strengths. if not.” There is a requirement to be able to balance a number of issues in this context and PMETB will expect assessment organisations to be able to show that they understand the need to recognise: • There is a tension between having a large number of tests in which the organisation is confident and their overuse to the point that the trainee is overburdened. but as a rule feedback should always be given as soon as possible after the assessment. It is important to avoid the danger of focusing too much on reliability at the expense of important attributes that cannot easily be assessed using traditional examination based methods. feedback to the trainee should include identifying those areas of the assessment in which the trainee has shown mastery or excellence to provide them with an understanding not only of their weaknesses. • Whilst it is important to summate good performances. • Sufficient time needs to be given between assessments for the trainee to reflect on his/her performance and to allow this to be reinforced through its application in further clinical practice. ‘Timeliness’ will vary depending on the nature of the assessment. Feedback PMETB sees appropriate feedback as being at the heart of assessment. define a framework suitable for a trainee to use as the basis of acquiring the necessary competence. Wherever possible.

such as other professionals in the team or the patients themselves. individual Colleges have been invited to share the work they have done and are doing on assessment methodologies.A compendium of assessments. This is being collated on a website within Modernising Medical Careers (MMC) . Assessment tools are of a number of qualitatively different types. or it may be a video of such an encounter reviewed later.g. • promoting patient safety. • insight.g. It is useful to consider these different categories. The feature of this type of assessment is that it is real (and therefore difficult to standardise) and that typically an Educational Supervisor might conduct the assessment on a one-to-one basis with the trainee. 30 Developing and maintaining an assessment system . One classification of assessments One classification of assessments is to consider how they are conducted in the life of the trainee. There will be a range of purposes at every level in the training programme. rather than just the individual (e. • procedural and technical skills. training. Purposes The assessment system will begin by defining the purposes for which assessment is needed. can save having to make the same mistakes twice. • understanding medical education principles. etc. • treatment. each of which tends to be used in a different setting. mini-CEX. Assessment may even have the purpose of enhancing the organisation. It is worth checking what is already known before launching into the creation of a new assessment instrument. As well as pointing to areas of good practice. and keeping these constantly in mind. • promoting the patient’s self care. • record keeping. as such consideration will make it more likely that the complete assessment system is adequately comprehensive. developing insight. Under the auspices of the Academy of Medical Royal Colleges. • specialty based knowledge. other assessors may sometimes be used. this site will allow Colleges to post their ‘work in progress’ so as to encourage collaboration and learning. the desire to bed in the desired attributes into normal practice). mini-ACE). which would collectively address the needs of the training programme. • commitment to educational activities. including: developing skills. checking actual day-to-day performance. It is rather that taking account of the experience of others in what does and what does not work. However. It is emphasised that the reason for such background work is not simply that existing tools can be ‘borrowed’. being certain of minimum competence.a PMETB guide to good practice . testing knowledge. both about the assessment instruments and their evaluation. • fairness. it is useful to consider the various modalities of assessment which exist. which dictated the desired attributes of all UK doctors (and thus of the systems to which they belong). The main areas (‘domains’) of GMP relevant to assessment are: Good clinical care Keeping up-to-date • clinical assessment. There may be assessment of: 1) A real (medical) patient encounter This may be done in the work setting (e. • understanding evidence based medicine. methods of assessment must then be chosen.Chapter 6: Constructing the assessment system Introduction This chapter shows: • how application of the previous five chapters might lead to a ‘blueprint’ of assessment methodologies. The definition of the purposes will be firmly based upon the GMC’s GMP. In selecting methods. It is vital to understand that an assessment which has been validated in another setting cannot be assumed to be valid or reliable in a different setting. Maintaining and improving performance Teaching. Starting with the purposes of the training programme. appraising and assessing • involvement in audit and quality improvement.

This classification is simply intended to ‘pigeonhole’ assessments into various types so as to make it easier to share practice and to compare what is being done. It is possible to use this classification to produce a ‘matrix’. There is a large variety of materials which may be used for such discussion. it is easier to standardise. These are usually referred to as MSF (e. Assessments by computer simulation in some areas are beginning to address these needs. have the disadvantage of being less real than real life. communication assessment using standard patients. etc. e. where these categories appear along the x-axis of a grid. or which need to be checked before the trainee is allowed to practice them. typically by means of a written test such as MCQ. the y- axis being formed by the domains of GMP (see Appendix 4). Materials on which the trainee might reflect include portfolios. 4) Behaviour in a real situation or environment The focus of this type of assessment would not be an individual patient. In fact. it is the purpose which must come first and the choice of assessment method second. such assessments are performed in busy acute settings such as a labour ward or an emergency room. • quality assurance (Chapter 3). skill or knowledge. The consistent feature is that one or more assessors. mini-PAT. EMQ. • selection and training of assessors (Chapter 4). etc. Typically. Very importantly. Once methods have been selected. Because simulation allows more standardisation. although having the advantage of reproducibility. charts (CSR). Because this type of assessment is based on actual materials.a PMETB guide to good practice 31 . but instead they assess the trainee’s insight when reflecting on these things. 8) Reflective practice These assessments are not assessments of actual performance. CRQ.g. This allows a quick visual check to ensure that all the domains of GMP have been appropriately addressed within the assessment system. teamwork assessment using simulated situations. such as practical procedures. using models or manikins. etc. teaching skills and presentation skills. typically with regard to generic attributes such as team working. often outside the clinical setting. CAEs and other case events.g. e. It does not in any way intend to dictate that an assessment system must contain all of these types. the trainee might be assessed in a simulated setting. verbal communication and diligence. this modality is appropriate for a variety of areas. but rather the trainee’s management of the whole situation. Developing and maintaining an assessment system . OSATS). consideration then has to be given to: • ensuring that they fit the principles of utility (Chapter 1). technical skills (DOPS. clinical prioritisation. PMETB regards this as the first principle of constructing assessment systems. etc. videos (VSR). TAB). ‘situational awareness’. These types of assessments tend to look at behaviours and skills such as teamwork. 3) Behaviour over time These assessments ask multiple observers to assess behaviour. make a judgment about a real life performance. this categorisation does not attempt to define the purposes to which the assessment is put. • integration of assessment into the curriculum (Chapter 5). 7) Cognitive assessments These share the feature that a group of trainees may be assessed simultaneously. Simulations.g. • standard setting (Chapter 2). They share the feature that they are a collection of retrospective and subjective opinions of key professionals based on observation over a period of time. who are trained in the assessment of that skill. In such circumstances. clinical letter writing (SAIL). case notes (CBD).2) Direct observation of a skill This category is again assessment of real life activities. 6) Simulation Sometimes the system requires that skills will be developed which need to be tested more often than the clinical realities will allow. where the focus of the assessment is the skill with which the activity was performed. 5) Discussion of clinical materials These assessments are usually performed in the absence of patients on a one-to-one basis. clinical leadership. etc.

2006. 32 Developing and maintaining an assessment system . NJ: Lawrence Erlbaum. Wakeford R. van der Vleuten CP. Medical Teacher. 1978. van der Vleuten C. Zieky MJ. 8. London: Falmer Press. Crossley J. 4. Chana N. Passing scores: a manual for setting standards of performance on educational and occupational tests. Miller G. Bridge PD. Norman G. 1996. 2002 Oct. Dauphinee D. 15(4): 237-61. 17. Downing SM. Humphris GM. Medical Education. Standards and criteria. Lescop JM. Eraut M. 25. The certification and recertification of Doctors: Issues in the assessment of clinical competence: Cambridge University Press. Paget 20. My current thoughts on coefficient alpha and successor procedures. Generalisability: a key to unlock professional assessment. In: Downing S. 1994. Wakeford R. 5. van der Vleuten C. Assessing health professionals: introduction to a series on methods of professional assessment. 2005. Jolly B.a PMETB guide to good practice . Education Research. Standard Setting. Assessing professional competence: from methods to programmes. Medical Education. Humphris G. 6. 1995. 2003.References 1. PMETB. 36(10): 972-8. Roberts C. When enough is enough: a conceptual basis for fair and defensible practice performance assessment. 146-57. 2003 Jul. et al. Medical Education. 28(6): 535-43. Livingston SA. Medical Education. 1965. Shavelson R. 16. Medical Education. editors. Streiner D. Schuwirth L. The assessment of clinical skills/competence/performance. Crossley J. 39(3): 309-17. Jolly B. Measurement practices: methods for developing content-valid student examinations. New York: Oxford University Press. Academic Medicine. The assessment of professional competence: developments. 13. How to design a useful test: the principles of assessment. Cizek G. 24. Cronbach L. Sawilowsky S. Jolly B. 12. Fabb W. 2002. 23. 2006 [cited 2002 October].uk/ pmetb/publications/ 2. 1994. Jolly B. Swanwick T. Performance-based assessment: Lessons learnt from the health professions. 5-11: 24(5). Southgate L. Musial J. GMC. Schuwirth LW. Schuwirth LW. 25(4): 414-21. Page GG. 14. Trainees’ views of the MRCGP examination and its effects upon approaches to learning: a questionnaire study in the Northern Deanery. 7. 3. 225-57. 2003 Sep. 1982. Educational and Psychological Measurement. The Certification and Recertification of Doctors: Issues in the Assessment of Clinical Competence. Reliability: on the reproducibility of assessment data. 1994. Newble D. 22. Good Medical Practice. Principles for an assessment system for postgraduate medical training. Newble In press. Validity: on meaningful interpretation of assessment data. Cambridge University Press. 2004 June. Downing SM. 2006. Jolly B. 37(9): 830-7. 2nd ed. Bloom B.J. Wealthall S. Medical Teacher. 14. 1: 41-67. 9. Medical Education. Davies H. Norman G. 92-104. Determining the content of certifying examinations. Education for Primary Care.pmetb. Assuring the quality of high-stakes undergraduate assessments of clinical competence. 11. Jolly B. Mahwah. Health Measurement Scales: A Practical Guide to their Development and Use. editors. 2006 Sep. 21. 2004. Available from: www. London: Longman. 2005 Mar. Glass G. Edinburgh: ASME. Haladyna T. In: Newble D. Advances in Health Sciences Education. Handbook of Test Development. Journal of Educational Measurement. editors. 38(9): 1006-12. Langsley D. Princeton: Educational Testing Service. Taxonomy of educational objectives. 19. Available from: http://www. 15. Reed M. Linn R.gmc-uk. Workplace assessment for licensing in general practice. Developing Professional Knowledge and Competence. Roe T. 55: 461-7. research and practical implications. 18. 65(Suppl): S63-S7. Hampton K. Lew SR. 2004 Sep. 64(3): 391-418. 1990. Procopis P. 2002 Oct. Frank R. 36(10): 925-30. Swanson D. British Journal of General Practice. Dixon H. 10.

88(Suppl 1): A50. Setting standards on educational tests. Clyman S. Wakeford R. The Journal of Continuing Education in the Health Professions. Southgate L. Health Measurement Scales: A Practical Guide to their Development and Use. McKinley DW. Oxford: Oxford University Press. 71(Suppl. 9(3): 215-35. 29. Norcini JJ. A contrasting groups approach to standard setting for performance assessments of clinical skills. 51. Stroobant J. 34. Educational Measurement. mini-PAT (Peer Assessment Tool): A Valid Component of a National Assessment Programme in the UK? Advances in Health Sciences Education: Theory and Practice. 71(Suppl): S1-30. 27. Heard S. Annals of Internal Medicine. 40. 42. NHSE. 75: 267-71. Lissauer T. Englewood Cliffs. editors. 49. 128-43. How should paediatric examiners be trained? Archives of Disease in Childhood. 2005 Jan. The mini-CEX: a method for assessing clinical skills. Lockyer JM. Rothman A. Holsgrove G. Downing S. 28. Fortna GS. Medical Teacher. Wakeford R. 48. Academic Medicine. Setting defensible performance standards on OSCEs and standardized patient examinations. 39. Principles of assessment. Academic Medicine. 43. 140: 874-81. Developing and maintaining an assessment system . Crossley J. Boulet JR. De Champlain AF. Duffy FD. Campion P. 2001. Davies H. Skuse D. 2003 May. 44. Mann K. 46. Norcini J. Cohen R. Norman G. Eiser C. Davies H. Teaching medicine in the community. Norcini JJ. 2003 Mar 18. 2000. NJ: Prentice-Hall. Archer J. 2006. 2003. Multisource feedback in the assessment of physician competencies. Davies HA. 1996. An assessment tool whose time has come. Essentials of educational measurement. 3rd ed. editor. Boulet J. Standard setting The next generation: Where few psychometricians have gone before. Academic Medicine. 50. 37. Blank LL. Annals of Internal Medicine. Wenrich MD. Procedures for establishing defensible absolute passing scores on performance examinations in health professional education. Holsgrove G. 183-5. Violato C. 2006 Jan. Roland M. Canadian Journal of Anaesthesia. Kaufman D. a randomized trial. Peer ratings. 2003. Fidler H. Generalizability. 1996. Archer JC. Berk R. Washington DC: American Council on Education. 14(9): 581-2. Medical Teacher. In: Throndike R. In: Whitehouse C. Kauser Ali S. Journal of General Internal Medicine. norms and equivalent score. Generalizability Theory. Clauser B. 1997. Streiner D. Yudkowsky R. 1998. 1993. Hout S. New York: Springer Verlag. 2006 Oct 12. 41. 47. Lockyer J. Khera N. Tekian A. Teaching and Learning in Medicine. 53(1): 33-9. 2003 May. 90(1): 43-7. 37(5): 464-9. 31. 32. 2004. de Champlain A. 2003. Ebel R. 330(7502): 1251-3. A comparison of standard setting procedures for an OSCE in undergraduate medical education. Quality assurance. Scales. Effects of training in direct observation of medical residents’ clinical competence.26. 2004. 30. 25: 35. Applied Measurement in Education. Ramsey PG. 18(1): 50-7.a PMETB guide to good practice 33 . British Medical Journal. The MRCGP Clinical Skills Assessment Standard setting and related quality issue. 23(1): 4-12. 138(6): 476-81. Academic Medicine. 2006. Brennan R. 1994. 36. 25(3): 245-9. A comparison of empirically and rationally defined standards for clinical skills checklist. 33. A Guide to Specialist Registrar Training. Medical Education. Setting defensible performance standards on OSCEs and standardized patient examinations. 2005 May 28 . 2003 Winter. 1971. 69(10 Suppl): S42-4. Hawkins RE. van der Vleuten C.): S112-20. Angoff W. The measurement characteristics of children’s and parents’ ratings of the doctor-patient interaction: measuring what matters well. 1972. Norcini J. A multi source feedback program for anaesthesiologists. New York: Oxford University Press. Muijtkens A. McKinley D. College of Physicians and Surgeons: Pakistan. Standard setting in medical education. 1999 Sep. standard-setting and item banking in professional examinations. Davies H. Patterson F. Archives of Disease in Childhood. 38. 508-600. Use of SPRAT for peer review of paediatricians in training. Cusimano M. 45. Holmboe ES.

Ebel RL. 75: 267-271 34 Developing and maintaining an assessment system . 1997a. Further reading Angoff WH. 2003. 1997b. Academic Medicine. 1996. Journal of Educational Measurement. Procedures for establishing defensible absolute passing scores on performance examinations in health professional education. 3rd ed. Norman G. Hertz NR. 2003. Chapter 29. Mahwah. Roland M. 31(1). Chapter 10. Philadelphia. Berk RA. Cusimano MD. 1-14. 4. Assessing knowledge. 18. 69(10 Suppl): S42-4. A large-scale multicenter objective structured clinical examination for licensure. 71(Suppl. in Teaching medicine in the community (Editors: Whitehouse C. Holsgrove G. 2002. 1998. Joint Centre for Education in Medicine. 28(1): 21-39. Raible M. S37-S39. The next generation: Where few psychometricians have gone before. Setting defensible performance standards on OSCEs and standardized patient examinations. Kauser Ali S. 15(1). 1997. 2004. Alternative approaches to standard setting for licensing and certification examinations. 1993. McKinley DW. Cizek G. Standards and criteria. Campion P). Health Measurement Scales: A Practical Guide to their Development and Use. Does CME work? An analysis of the effect of educational activities on physician performance or health care outcomes. Grand’maison P. Washington DC: American Council on Education. 78(10) Suppl: S85-S87. 67(10 Suppl). 1971. 1996. Establishing Passing Standards for Classroom Achievement Tests in Medical Education: A Comparative Study of Four Methods. Standard setting in medical education. PA: National Board of Medical Examiners. 2001. Champlain A (2004) Ensuring that the competent are truly competent: an overview of common methods and procedures used to set standards on high stakes examinations. 1991. 1994 Oct. 1992. Oxford University Press. Chinn RN. Journal of Veterinary Medical Education. 183-185. 2003 Oct. 2004. Yudkowsky R. Applied Measurement in Education. Jolly B. Principles of assessment. Kaufman DM. 2006. Quality assurance. 1: 50-57. 54. Standard Setting (2006) in Downing S and Haladyna TM (Eds) Handbook of Test Development. The good assessment guide. 9(3).52. 25: 245-249.a PMETB guide to good practice . 186-194. 215-235. Clyman SG. Streiner D.Verlag). Brennan RL. Lieska N. Holsgrove G. norms and equivalent scores in Educational Measurement. standard-setting and item banking in professional examinations. New York: Oxford University Press. Generalizability Theory (New York: Springer. Case SM. Constructing Written Test Questions for the Basic and Clinical Sciences. 1972. Englewood Cliffs. 237-261. Academic Medicine. International Journal of Psychiatry in Medicine. de Champlain AF. Grant J (Eds). NJ: Prentice-Hall.): S112-120. Roland M. Holsgrove G. Tekian A. Downing S. Cambridge: University of Cambridge Local Examination Syndicate. Ed. 2000. Muijtkens AM. Boulet JR. Assessment and Testing: a survey of research. Clauser BE. Standard setting. Academic Medicine. Downing S. 15. Throndike RL. 1978. 2004. Academic Medicine. 508-600. 53. Wood R. Applied Measurement in Education. Teaching and Learning in Medicine. Chapter 28. Swanson DB. Academic Medicine. Campion P). Glass GV. Mann KV. A contrasting groups approach to standard setting for performance assessments of clinical skills. van der Vleuten CP. Davis D. Scales. NJ: Lawrence Erlbaum. Brailovsky CA. 2006. Essentials of educational measurement. in Teaching medicine in the community (Editors: Whitehouse C. Lescop J. Medical Teacher. Internal document for the College of Physicians and Surgeons Pakistan. Research in Medical Education: Proceedings of the Forty-second Annual Conference. Oxford: Oxford University Press. A comparison of standard setting procedures for an OSCE in undergraduate medical education.

2006. Zieky M J. Teaching and Learning in Medicine. Developing and maintaining an assessment system . Royal College of Psychiatrists. Available from: pmetb/publications/ Rothman AI. Understanding Medical Education. van der Vleuten C. Muijtjens A. NJ: Educational Testing Service. The metric of medical education: setting standards on educational tests. Dusman H. 1996. Medical Education. Norcini J. Wood R. Oxford University Press.a PMETB guide to good practice 35 . Paper to the RCGP/COGPED Assessment Group. Patterson F.Kramer A. 2003. Norcini JJ. Principles for an assessment system for postgraduate medical training.rcpsych. University of Cambridge Local Examination Syndicate. Cohen R. Cambridge. Frampton CM. 2003. Newble Princeton. How to design a useful test: the principles of assessment. Wakeford R. Livingston S A. 1(3): 158-166. Tan L. Swanson DB. 2003. 464-9. 2001. Workplace based assessment materials (2006). 71(Suppl): S1-30. Medical Education. Streiner DL. Standard setting in an objective structured clinical examination: use of global ratings of borderline performance to determine the passing score. 35: 1043-1049. Factors influencing reproducibility of tests using standardized patients. Association for the Study of Medical Education. 1982. van der Vleuten CP (in print). Wilkinson TF. Norman GR. 1991. Medical Education. Academic training/workplace-basedassessment/wbadownloads. Passing scores: a manual for setting standards of performance on educational and occupational tests. PMETB. 37: 132-139. Available from: http://www. A comparison of empirically and rationally defined standards for clinical skills checklists.pmetb. 1989. Comparison of a rational and an empirical standard setting procedure for an OSCE. 2004. Health Measurement Scales (3rd edition). Jansen K.aspx Schuwirth LW. The nMRCGP Clinical Skills Assessment Standard Setting and Related Quality Issues. Assessment and Testing: a survey of research.

it is not necessary to go much beyond 0. our bathroom scales might quite reasonably have a measurement error of up to 1 kg and still be fit for routine use. However. 36 Developing and maintaining an assessment system . Historically. or for each component part if each is considered separately). In some instances quite a large measurement error can be acceptable.8. Thus. the accepted minimum value for alpha in an examination has been 0. This involves calculating the reliability and Standard Error of Measurement of the assessment. therefore. the measurement error of many exams is uncomfortably large.9 because of the increasing likelihood that this would mean that the exam was testing more or less the same thing in slightly different ways. c) implement strategies for reducing measurement error. Traditionally. but in fact.a PMETB guide to good practice . for example) they are. b) agree a policy for determining how the borderline trainees will be identified and how you will be making pass/fail decisions about them. Reliability has two components and is expressed as: Reliability = Subject Variability Subject Variability + Measurement Error Reliability is typically reported as a coefficient called coefficient alpha. where this is necessary. In order to provide satisfactory answers to these questions it is necessary to do three things: a) calculate the measurement error of the assessment (either as a whole.9.6 would indicate that 60% of the measured variance was due to genuine differences between trainees (and. That said. the answers to these questions must be transparent and in the public domain.Appendices Appendix 1: Reliability and measurement error Having set a passing standard. All measurement methods have a margin of error. For example. the measurement error associated with the assessment (which is another PMETB requirement) can be established and a protocol to identify and appropriately manage borderline trainees formulated. therefore. quality control and the assessment system (Principle 4). not exempt from the universal rule that all measurement methods have an associated margin of error. Readers seeking additional information would be interested in the excellent coverage by Wood (53). also shows us how much of the variability in the marks is not actually due to differences between the trainees but to other sources of variance such as inconsistencies between assessors and random error. Assessors often overestimate the accuracy of marks awarded in their assessments. or even acknowledged. Reliability The term reliability has a specific and rather complicated meaning in relation to the mathematical performance of assessment instruments. an alpha value expresses the amount of variance between trainees that is genuinely due to true differences between them and. this has not bothered assessors unduly because the measurement error of UK exams is rarely calculated. particularly in relation to borderline performance? Moreover. therefore. there is a consensus among medical educationalists that high stakes assessments. should have a reliability of at least 0.mass. In this context reliability is concerned with the accuracy with which the trainee’s performance is determined and reported. that 40% was not). This remains the benchmark below which an exam or elements within it should not fall. Such a margin would be quite unacceptable for weighing babies. This situation will change. In order to comply with PMETB requirements on quality assurance. or Cronbach’s alpha in honour of its inventor. an alpha of 0. such as most of the Royal College examinations. even though both types of scales are making measurements in the same domain . Streiner and Norman (54) and in various superb publications by Lee Cronbach. In essence. Since both formal and workplace based assessments are measuring something (clinical competence. assessing bodies must address two specific questions: i) What is the measurement error around the agreed level of proficiency? ii) What steps are taken to account for measurement error. however.

those whom the examination has not been able to confidently place on one side or the other of the pass mark.47 .53% really had failed (more confidence could be felt the further above or below these two points). there is about a 1 in 3 chance that their exam mark was not even within 1 SEM of their ‘true’ mark. which needs to be measured so that it can be compensated for.Measurement error The reliability coefficient involves two components: the true variance between trainees (which is what is really required .47%. by one of the methods described above) is 50% and the standard deviation is 10.a PMETB guide to good practice 37 . For example. 1 SEM represents a confidence interval of 68%.47% really had passed and those below 45. • improve marking schedules.47% range on the correct side of the pass mark. different methods and different assessors arriving at different passing scores). as discussed above. a) The standard error of measurement and confidence intervals The SEM forms the basis on which the range of marks that would determine the group of borderline trainees that poses a similar problem in every examination can be calculated. Developing and maintaining an assessment system . However. is to enable the proper identification of the borderline trainees . the SEM can help in identifying individual assessments that need to be improved. However.47. a 68% confidence interval is probably adequate for determining borderline trainees. the hypothetical assessment would not have been able to place trainees with marks in the 45.or. the borderline trainees would be those with marks of 50% ± 4. measurement error is calculated and reported as the standard error of measurement (SEM). Consequently. • reject badly performing items from the examination before calculating final marks. Based on a confidence interval of 68% (i. though the reliability coefficient is more important in this regard. one way of reducing the SEM (and. between 45. • use optical mark reading or computer based testing to minimise marking errors. even at only a very modest 68% confidence level. of course. In other words. If the reliability of the assessment was 0. This is done using the simple formula: SEM = Standard Deviation √ 1 . confidently placing a greater proportion of trainees on the correct side of the pass/fail cutting point) is to improve the reliability. This is because the SEM equates with the confidence interval for the marks. Confidence can be better than 68% that trainees with marks above 54.53% and 54.this means.e. There are several ways of doing this. In exam analysis. 68% of the time a trainee’s ‘true’ mark would be within ± 1 SEM of the mark they obtained in the test . using 1 SEM). The main use of the SEM. it is obvious that the SEM will be smaller in a reliable assessment than in an unreliable one with the same standard deviation.8. • produce better items (questions). failing and borderline trainees can be illustrated by taking a hypothetical assessment where the pass mark (determined. These two components also need to be identified for test development and quality assurance purposes. Since the SEM is partly dependent on reliability. of which the main methods are: • increase testing time/number of items.Cronbach’s alpha In terms of assessment development. however. to put it the other way. thus. since the passing standard itself is also associated with errors (for example.53% to least to the extent that there is confidence that the right trainees have passed or failed) and additional variance due to measurement error. b) Identifying borderline trainees The use of SEMs in determining passing. • improve examiner training. the SEM would be 4.

Below are comparative figures for the same scenario in our hypothetical exam using the same 50% pass mark and reliabilities of 0.16 46.48% 0.6 6.9: Borderline trainees will have Reliability Standard Error of Measurement marks within this zone (Cronbach’s alpha) (based on 1 SEM) 0.a PMETB guide to good practice .47 45. 38 Developing and maintaining an assessment system .47% 0.7 5.The box below illustrates the effect of improving reliability on the SEM by extending the example given above which was based on a reliability of 0.52% to 55.8 4.8 and 0.84% to 53.68% to 56.53% to 54.32% 0.32 43.8.9 3.16% It is clear that the more reliable version of the hypothetical exam is likely to have considerably fewer trainees in the borderline zone.48 44.

However. and easy). for MCQs. or that borderline trainees would probably gain about 4 marks out of the 10 available on a particular OSCE station of assessment. 2) Scrutinise any assessment items classified as ‘questionable’ and either reclassify them (which would almost invariably be as ‘supplementary’) or. remove them from the exam and replace them with more assessment items testing more important material. how many of the available marks a borderline trainee would get. the sum of the ‘provisional’ standards for each item is calculated and divided by the number of items. with a discussion about the characteristics of a borderline trainee. or. 3) Each assessor looks at the first exam item and.i. moderate. 5) Assessors’ ‘final’ estimates for the item are collected and averaged to give the ‘provisional’ standard for that item of assessment. Matrix for Ebel’s method: Difficult Moderate Easy Essential Important Supplementary Questionable For example. estimates the proportion of borderline trainees who would get the correct answer. supplementary and questionable) and difficulty (difficult.a PMETB guide to good practice 39 . in the Holsgrove and Kauser Ali (2004) modification. 2) In open discussion. Developing and maintaining an assessment system . this group can be subdivided later into working groups that should each have at least five members and preferably fewer than ten. etc: Difficult Moderate Easy Essential Q1 Important Q2 Supplementary Questionable In the classical Ebel method. the group outlines the characteristics of an imaginary group of borderline trainees . 7) Finally. 4) The estimates are then discussed and assessors can subsequently change their own estimate if they wish. when all the items have been classified in this way. For example. However. if the individual assessment is an OSCE station. This becomes the pass mark (or standard) for the whole exam. those with about a 50/50 chance of passing. it then moves in a different direction. Assessors enter each item onto a matrix according to their individual assessments about relevance (essential. question 2 important and moderate. 6) The process is then repeated for each of the remaining individual assessments in the exam. the assessors first do three extra things: 1) Agree about the classification of the items of assessment in the matrix (using a majority decision if necessary).assessors usually tend towards setting it too high. question 1 might be assessed as essential and difficult. important. more frequently.Appendix 2: Procedures for using some common methods of standard setting Test based methods a) Angoff’s method (32) 1) A group of assessors is assembled and briefed. an assessor might estimate that about 40% of borderline trainees might get a particular MCQ correct. b) Ebel’s method (33) The process begins in the same way as Angoff’s method.e. If necessary. etc. assessors will make an initial estimate of the number of items in each cell that a borderline trainee would get right. which can subsequently be revised if it becomes clear that it is too high or too low . independently. having made their individual assessments about the difficulty and importance of each item of assessment.

Also in the modified version. So if. a) Borderline group method 1) Assessors are orientated and briefed about the station individual assessment item they will be assessing and the checklist and three point rating scale they will be using. • the focus of assessment should be on essential material. 2) Assessors observe each trainee’s performance on their allocated unit of assessment. the proportions (or. A reasonable final distribution of items would be along the lines of: Difficult Moderate Easy Essential 10% 35% 10% Important 5% 20% 5% Supplementary 5% 5% 5% Questionable None The rationale for aiming for such a pattern is threefold: • unless an assessment item is clearly essential.e. b) Contrasting groups method 40 Developing and maintaining an assessment system . only after the three steps above have been completed will the assessors move on to estimate the number of assessment items in each cell that a borderline trainee would get right. the two versions of the Ebel method are the same. one is based on performance at each individual unit of assessment. assessors are free to change their estimates as a result of these discussions. 3) A global rating (pass/borderline/fail) is made of each trainee. important or supplementary (i. difficulty and discrimination indices. if the items of assessment are OSCE stations.3) Look at the overall matrix to ensure that: • the majority of assessment items are of moderate difficulty. an assessor estimated that a borderline trainee would score 4 out of 10 on Station 1. the assessors discuss the marks for each cell. Having made their individual estimates. together with a detailed rating on a multiple item score sheet. led by those giving the highest and lowest estimates in each case. Following the discussions. the other over the assessment as a whole. Both are based on assessments about the performance of individual trainees. it is trivial or irrelevant). for example. it should not be in the assessment system as a whole. In the modified version. • the majority of assessment items test essential material. A well designed and maintained assessment item bank should have psychometric data on assessments that have been used on previous occasions. the marks) assigned by each assessor are averaged for each of the nine active cells. 4) The mean scores on the multiple item score sheet for the trainees receiving a global ‘borderline’ rating for the individual unit of assessment is taken as the passing standard for that unit of assessment. the assessors will estimate how many marks a borderline trainee would get for each station. as well as mapping to the area of the curriculum that they are assessing. These averages are then summated to produce the overall standard (pass mark). • the most effective items of assessment are good discriminators of moderate difficulty. this would be recorded on the grid in the following manner: Difficult Moderate Easy Essential Station 1 (4/10) Important Station 2 (7/12) Supplementary Questionable None From this point onwards. It might be necessary to substitute assessment items in order to get this kind of distribution. This will include measures of reliability. rather than the content or difficulty of the items themselves. Trainee based methods Two trainee based methods are described by Downing et al (27) and elsewhere. and 7 out of 12 on Station 2. As with the Angoff method.a PMETB guide to good practice . However. and are summarised here. in the case of OSCE stations.

3) Each trainee is placed into one of the two contrasting groups using a global rating based on external criteria. The graph now contains a rectangle based on the four agreed values established above . are plotted on the graph. for example. the standard setters might agree that trainees with a mark below 50% should not pass (i. the standard setters estimate the highest score for a trainee to fail and the lowest score that would allow someone to pass. but that no more than 20% of trainees should be allowed to fail.e.1) Assessors are orientated and briefed about the individual item of assessment they will be examining and the rating scales they will be using. Combined and compromise methods Hofstee’s method This is based on item content and difficulty. 100 90 80 70 60 % Fail 50 40 30 20 10 0 <45 50 55 60 65 70 75 80 85 % Correct score Developing and maintaining an assessment system . The description below is based on that of Case and Swanson (1996). performance descriptors and their overall performance in the assessment. For example. a line is drawn from the upper left to lower right corner of the rectangle. 4) The rating scale results for both contrasting groups are represented graphically as curves. 5) The pass mark can subsequently be adjusted in either direction if the provisional mark appears to be unjustly passing or failing certain trainees. After the examination and calculation of final marks. points are entered on the graph along the ‘score’ axis at 50% and 60%. The pass mark is provisionally set where the two curves intersect. Finally. too. and the highest and lowest acceptable pass mark. These. The standard setters then agree on the highest and lowest acceptable percentages of failing trainees. the trainee’s scores are plotted as a graph of fail rate as a function of scores obtained. the lowest score that would allow someone to pass) and that the highest acceptable pass mark would be 60%. and 50% and 60% on the ‘% Correct score’ axis. Based on the assessments about item and 20% on the ‘% Fail’ axis. They might agree. that a zero failure rate would be acceptable. Therefore. but also takes account of agreed parameters regarding the proportions of passing and failing trainees. Once agreed (this is often done by taking median values between the highest and lowest estimates) these two values are plotted on a graph. Where this line intersects the graph determines the standard (pass mark).a PMETB guide to good practice 41 . 2) Assessors observe each trainee’s performance on their allocated unit of assessment. this time along the other axis ‘% Fail’.

The consistent feature is that one or more assessors.e. reviewed later. Grouping assessments into categories facilitates being able to see what sort of assessments (and validations of assessments) have already been developed in each category.e. OSATS). ‘situational awareness’. what works and to compare like with like. PMETB and MMC categorisation of assessments Purpose This is a system to enable sharing of best practice.e. mini-PAT. • a video of patient encounter(s) in the workplace . • presentation skills. such assessments are performed in busy acute settings such as a labour ward or an emergency room.Appendix 3: AoMRC. by mini-CEX. TAB. teaching skills and presentation skills.e. Categorisation makes it easier to find out what exists.g. However. ‘Competence’ and ‘Performance’. but rather the trainee’s management of the whole situation. These are usually referred to as MSF (e. from assessing what actually happens in the workplace. or the patients themselves. this may mean a degree of overlap between categories. These types of assessments tend to look at behaviours and skills such as teamwork. This spectrum might also be labelled ‘Knowledge’.g. mini-ACE. They share the feature that they are a collection of retrospective and subjective opinions of key professionals. labour ward. the principle purpose is to prevent unnecessary duplication. 2) Direct observation of a skill • direct observation of a skill . Typically. other assessors may sometimes be used. e. This may be done in the work setting (e.g. such as other professionals in the team. However. These assessments ask multiple observers to assess behaviour. There is also a spectrum of sophistication of the level at which an assessment may test. 1) A real (medical) patient encounter • an actual individual patient encounter . The feature of this type of assessment is that it is real (and therefore difficult to standardise) and that typically an educational supervisor might conduct the assessment on a one-to-one basis with the trainee. clinical leadership. through simulation and OSCEs. This category is again assessment of real life activities. 4) Behaviour in a real situation or environment • observation of teamwork . Inevitably. in general practice.e.e.g. DOPS. ‘patient record’ in general practice. mini-ACE) or it may be a video of such an encounter. TAB). mini-CEX. ‘Shows How’ and ‘Does’. to assessment in exam halls. Constructive comment for future iterations should always be welcomed. This is intended to be a living. verbal communication and diligence.g. • feedback from patients . where the focus of the assessment is the skill with which the activity was performed.g. • teaching skills. ‘Knows How’. OSATS. • (simultaneous) multiple actual patient encounter . mini-PAT. make a judgment about a real life performance.a PMETB guide to good practice .g. This may potentially lead to the development of common competencies across the specialties.g. in emergency room. 3) Behaviour over time • MSF . Miller described these levels as ‘Knows’. based on observation over a period of time. but this should not interfere with the purpose. clinical prioritisation. 42 Developing and maintaining an assessment system . The focus of this type of assessment would not be an individual patient.g. typically with regard to generic attributes such as team working.e. in psychiatry. This categorisation attempts to reflect these spectra. The categories There is a spectrum of methodologies in assessment. evolving system. who are trained in the assessment of that skill. etc. technical skills (DOPS.g.

e. or of processes undertaken.a PMETB guide to good practice 43 . Developing and maintaining an assessment system . case note review. • computer simulation.g. video stimulated recall .e. etc. • other written assessments.e.g. 7) Cognitive assessments • knowledge . chart simulated recall. moulage. CRQ. • reflective practice . by invigilated test such as MCQ. • simulated teamwork exercise . clinical letter writing (SAIL). Materials on which the trainee might reflect include portfolios. etc. In such circumstances. Sometimes the system requires that skills will be developed which need to be tested more often than the clinical realities will allow. this modality is appropriate for a variety of areas. or which need to be checked before the trainee is allowed to practice them. simulated group discussion.g. CPR. although having the advantage of reproducibility. it is easier to standardise.5) Discussion of clinical materials • review of a documented incident or of medical records .g.e. • problem solving/higher cognitive assessment/application of knowledge . • simulated practical procedure . but instead they assess the trainee’s insight when reflecting on these things.g. teamwork assessment using simulated situations. Because simulation allows more standardisation. etc. using models or manikins.g. the trainee might be assessed in a simulated setting. CAEs and other case events. case based discussion.e. 6) Simulation • consultation skills . have the disadvantage of being less real than real life. on a manikin or a model. on a one-to-one basis . • discussion of clinical material . etc. These assessments are usually performed in the absence of patients. e. These share the feature that a group of trainees may be assessed simultaneously.g. A&E. • simulated situation management . topic or event.e. charts (CSR).g. with ‘standard patient’ or other role player.g. videos (VSR). communication assessment using standard patients.g. reflective diary. written up case.e. • review of trainee-held materials .often outside the clinical setting. EMQ.e. skill or knowledge. These assessments are not assessments of actual performance. case notes (CBD). EMQ. typically by means of a written test such as MCQ. Simulations. Assessments by computer simulation in some areas are beginning to address these needs. There is a large variety of materials which may be used for such discussion.e.g. 8) Reflective practice • review of outcomes of care. • critical thinking/understanding/evaluation of evidence. CRQ. file (‘portfolio’) of achievements.e. Because this type of assessment is based on actual materials. such as practical procedures.

e.44 1) A real 2) Direct 3) Behaviour 4) Behaviour 5) Discussion 6) Simulation 7) Cognitive 8) Reflective (medical) observation over time in a real of clinical .g. A&E.g. OSATS feedback . diary labour ward Good clinical care Clinical assessment Treatment Insight Record keeping Developing and maintaining an assessment system .e.practice . mini-PAT . manikin.e. patient of skill .g.e. mini-CEX TAB. e. CBR computer reflective .g. MCQ. multi-source situation or materials . encounter DOPS.e.e.g. environment CBD. CRQ portfolio.a PMETB guide to good practice Fairness Keeping up-to-date Specialty based knowledge Appendix 4: Assessment good practice plotted against GMP Procedural and technical skills Understanding evidence based medicine .g. play.e. VSR. role assessments .g.g.

appraising and assessing Commitment Understanding medical education principles Practical skills in teaching. training.a PMETB guide to good practice Relationship with patients Respect 45 . Maintaining and improving performance Folder Audit & quality Patient safety Teaching. appraising and assessment Relationship with patients Respect Communication with patients Child protection Responding to problems Informed consent Confidentiality Developing and maintaining an assessment system .

• Summative assessment traditionally takes the form of tests and often occurs at the end of a term or a course. htm). especially in postgraduate medical education. the individual's performance against a benchmark.tums. which is included in more than one test in order to provide comparative information about the items in the new version of the test and also about the test takers attempting it. The validity of a test is determined by the extent to which it measures what it sets out to measure. Achievement test A test designed to measure and quantify a person’s knowledge and/or skill. other sources of evidence are increasingly contributing to summative assessment. Adaptive testing A sequential form of individual testing in which successive items in the test are based primarily on the participant’s response to previous items. Accreditation A self-regulatory process by which governmental. Unless there is a particular reason not to do so (such as a limited number of places for students who pass) all summative assessments should be criterion referenced. • Formative assessment is used as part of a developmental or ongoing teaching/learning process. It is a check on progress that does not contribute to pass/fail decisions. weaknesses and any problem areas. Anchor item An item with known performance characteristics. Appeal Formal request to the awarding body for reconsideration of a decision (commonly the pass/fail decision). However. Angoff method A method of standard setting (discussed in more detail in the PMETB Standard Setting document) based on group judgments about the performance of hypothetical borderline (‘just passing’) trainees. the use of evidence from assessments for multiple purposes and the increasingly common practice of providing feedback following all now becoming less important because of the development of assessment programmes. to measure improvement over time.a PMETB guide to good practice . Assessment The process of measuring an individual’s progress and accomplishments against defined standards and of terms A working paper used by the Workplace Based Assessment Subcommittee of the Postgraduate Medical Education and Training Board to define their work and documents. which often includes an attempt at Ability The level of successful performance of the objects of measurement on the variable. A reliable test should produce the same or similar score on two occasions or if given by two assessors. non-governmental. to arrive at some definitions of strengths and weaknesses. It should not alter the purpose or nature of the examination or provide an unfair advantage to the disabled trainee.formative and summative . This glossary is based on material from the Tehran University Medical School website (http://www. Appraisal An individual and private planned review of progress focusing on achievements and future activities. i. Assessment should be as objective and reproducible as possible. 46 Developing and maintaining an assessment system . There are different kinds of assessment. voluntary associations and other statutory bodies grant formal recognition to educational institutions or programmes that meet or exceed stated criteria of educational quality. Accommodation A change in standard examination conditions which aims to lessen the impact of a trainee’s disability on their performance. but informs teachers and learners about strengths. or perhaps to motivate them.e. Summative assessment is used primarily to provide information about whether or not the student has reached the required standard and it can form the basis of pass/fail decisions. • Criterion referenced assessment refers to an absolute standard. the MRCP (UK) Glossary of testing terms and contributions from members of the PMETB Workplace Based Assessment Subcommittee. It is best used when accompanied by feedback to the student. The purpose of assessment in an educational context is to make a judgment about mastery of skills or knowledge. though the distinction between the first two . to rank people for selection or exclusion.

at a certain point in time. and an excellent source of evidence on which to give feedback to the person who was assessed. Communicator. Competency The knowledge. interpretation of clinical findings and management plans. competence itself is best seen as a prerequisite for performance in the real clinical setting where it would be expected that a doctor operated at a higher level in many areas and demonstrated mastery in some. Assessment programmes Contemporary best practice favours assessment strategies that are multi-faceted and assess an appropriate spectrum of knowledge. clinical and technical components. organised around seven key roles: Medical Expert (the central role). Current research is also being carried out in using CBR in multi-disciplinary learning (contact Dr Gareth Holsgrove . Scholar and Professional. peers. professional experience and expertise. attitude or combination of these. communication. However. Developing and maintaining an assessment system .uk). Assessment: 360-degree Can be used to assess interpersonal and communication skills. Certification The process by which governmental. because of the complex reality of what doctors actually do on a day-to-day basis.a PMETB guide to good practice 47 . Clinical competence A student’s ability to do what is expected at a satisfactory level of facility. Health Advocate. e. Chart stimulated recall oral examination (CSR) A measurement tool which permits the assessment of clinical decision making and the application of medical knowledge with real patients using a standardised oral It is the acquisition of a body of relevant knowledge and of a range of relevant skills which includes personal. The examiners rate the examinee using an established protocol and scoring procedure. In the case of clinical education. such as a limited number of places for successful students to progress into. norm referenced assessment should not be used. at graduation. An experienced doctor searches his or her mind and sifts through a wide range of options and in some cases the solution will be something he or she has never come up with before. or even opportunistically. Manager. but are increasingly proving to be a valuable instrument in workplace based assessment where they tend to be carried out on a case-by-case basis. which is primarily based on an apprenticeship model. teachers define what the student is expected to do and then test their ability to do it. interpersonal.g. Therefore. subordinates. Most 360-degree assessments use a structured questionnaire to gather information about an individual’s performance in several domains such as teamwork.gholsgrove@rcpsych.• Norm referenced assessment ranks each student’s performance against all the others in the same cohort. patients and their families. it is often designed as a matrix or a series of matrices. A trained and experienced physician examiner questions the examinee about the provided care. Assessors completing rating forms in a 360-degree evaluation are usually a mixture of innovative framework for medical education produced by the Royal College of Physicians and Surgeons of Canada. Norm referenced assessment is inherently unfair because a student may pass or fail simply because of the company they keep. Bias Systemic variance that skews the accurate reporting of data in favour of. Collaborator. a particular individual or group. Blueprint A template used to define the content of a given test. non-governmental or professional organisations or other statutory bodies grant recognition to an individual who has met certain predetermined standards specified by the organisation and who voluntarily seeks such recognition. skills. probing for reasons behind the differential diagnoses. leadership and management skills. CSR can be used in a formal situation where trainees discuss a number of cases. or against. CanMEDS Canadian Medical Education Directives for Specialists . Unless there is an exceptional reason. Most clinical actions are concerned with problems for which there is no clear answer or no single solution and where no two patients are the same. professional behaviours and many aspects of patient care and systems based practice. etc. ‘clinical competence’ gives us a rather limited view of their work. skill. that enables one to effectively perform the activities of a particular occupation or role to the standards expected. This is a useful instrument for both formative and summative assessment. competencies and personal attributes in an adequately reliable way. In medical education. Such a programme of assessment is based upon and determined by the curriculum (see PMETB Principles of assessment). decision making. even if they have the same condition. with a (usually) predetermined number of the top students passing. other team members.

0 in value. Criterion referencing Criterion referenced assessment measures performance against an absolute standard. learning. having acquired the knowledge and skills necessary to perform those tasks that reflect the scope of professional practices.0 to +1.internal consistency. a correlation coefficient of 0. Strongly positive correlations indicate that they are testing more or less the same thing (because most of the variation in one can be predicted from variation in the other. it will also stipulate the entry criteria and duration of the programme. Negative correlations indicate that the items are either testing material from different domains.0 indicates no relationship between the two variables and a correlation coefficient of -1.49) of the variation in one variable can be predicted from the other. In a test.7 between two variables indicates that 49% ( essential skill for clinical practitioners because of the large and varied number of people doctors must communicate with every day and the range of circumstances. The idea that doctors automatically learn communication through experience or that doctors are inherently either good or bad communicators is long abandoned. which denotes what someone is actually doing in a real life situation. The questions might ask about research methodology. supervision and feedback. items should correlate moderately positively with each other. Cronbach’s alpha The most commonly measured aspect of reliability of a test . CPD refers to the learning activities that doctors undertake after their formal specialist training is complete. stronger trainees performing statistically better than weaker ones. each trainee’s performance against a benchmark (usually the pass mark).a PMETB guide to good practice . or that at least one of them is flawed. Correlations range from assessment based on responses to questions regarding an article (often a research article) from a book or (more commonly) journal. assessment.0 indicates a perfect positive relationship. If a correlation coefficient is squared. content. It is now widely acknowledged that both students and postgraduate doctors can be educated in communication skills and their proficiency can develop to extremely high levels of expertise. Correlation coefficient Describes the strength of the relationship between two variables. Construct A specific professional concept.0 indicates a perfect negative relationship. almost invariably a senior doctor.8. clinical implications. See ‘Construct validity’ below. organisation. Communication skills These skills lead to proficiency in communication . as explained in the preceding paragraph). some of which might be very distressing. Discriminator An item that discriminates well between weaker and stronger test takers. For example. under ‘Validity’. A correlation coefficient of 1.a key aspect of life-long learning. processes and methods of teaching. responsible for overseeing a trainee’s clinical work and providing guidance and feedback. skill. CRQ Critical Reading Question . It may be different from performance. Competencies A set of professional abilities that includes elements of knowledge. It states the rationale. in which they must communicate. CPD Continuing Professional Development .7 x 0.7 = 0. It is an average of all possible split half reliability measurements. Curriculum A curriculum is a statement of the aims and intended learning outcomes of an educational programme. Competence The possession of requisite or adequate ability. If appropriate.Clinical supervisor A term used in UK postgraduate medical education to describe an individual. but for high stakes examinations it should be at least 0. robustness of conclusions. etc. a correlation of 0. 48 Developing and maintaining an assessment system . the resulting number indicates the ‘percentage of the variation’ in the two variables that is in common. The generally accepted minimum value of Cronbach’s alpha for a test is 0. attitudes and experience. In other words.9.

skills and abilities. as instructed. In the UK. Facility A statistical property indicating the level of difficulty of a question (between 0. skills. The distinction between formative and summative . such as the number of items in the test. whether marking is carried out by one examiner or more. Developing and maintaining an assessment system . Fail Awarded a score below the pass mark. Educational agreement A mutually acceptable educational development plan drawn up jointly by the trainee and their educational supervisor. or observation of the trainee performing practical tasks. Examinations might involve written or oral responses. evaluation refers to the process of determining the quality and value of an educational programme.assessment is becoming less important as evidence from assessment is increasingly being used for multiple purposes. for incorrect options in a multiple choice question. Discussed in more detail in the PMETB Principles for an assessment system for postgraduate medical training (1).Distractor A term. and how valid and relevant it is. Examiner A person appropriately skilled. Evidence based medical education (EBME) An education that is based on the best evidence available. followed by a homologous list of at least five options from which the trainee selects one or more. Experience Exposure to a range of medical practice and clinical activity. Examination A formal. based only on the cohort of trainees attempting that particular item. A multiple analysis of variance is used to indicate the magnitude of errors from various specified sources. Evaluation In the UK curriculum. persons and observational conditions that were studied. competencies and professional characteristics that can be combined for practical reasons into one cluster. Generalisability theory An extension of classical reliability theory and methodology that is now becoming the preferred option. what its utility. becoming obsolete. Extended matching questions (EMQs) A more detailed form of multiple choice question (MCQ) having a lead-in statement such as a clinical vignette. Domain The scope of knowledge. ‘assessment’ is used of individuals and ‘evaluation’ of programmes. Goal A general aim towards which to strive. The analysis is used both to indicate the reliability of the test and to evaluate the generalisability of scores beyond the specific sample of items. It should take into account such factors as how reliable the available evidence is. It takes account of both the difficulty of the test items and of the maximum and minimum acceptable failure rate for the exam.0) obtained from the average score for the question divided by the maximum achievable score.a PMETB guide to good practice 49 . extent and strength is.decision making . experienced and trained to conduct examinations. In US usage. controlled method or procedure to access an individual’s knowledge. and was designed for use in high stakes examinations with a large number of trainees. Educational supervisor The person who is responsible for the overall supervision and management of an individual student or trainee’s educational programme. evaluation includes both the quality of the programme and the assessment of individuals on the programme. etc. Hofstee method A ‘compromise’ method of standard setting which combines aspects of both relative and absolute methods. Formative assessment Assessment carried out for the purpose of improvement rather than pass/fail decision making.0 and 1.

experience. MSF Multi-Source Feedback . In certain countries this commitment is a statutory requirement. and to the test as a whole. learning objectives describe the specific knowledge or skills which learners are expected to be able to demonstrate. Medical Informatics “Medical informatics is a rapidly developing scientific field that deals with the storage. It is usually calculated as the Standard Error of Measurement. Measurement error is present in all assessments. H. postgraduate and continuing medical education. Some medical educators are physicians. Measurement error The difference between the ‘true’ score and the score obtained in an assessment. Because medical science changes so rapidly. but can be minimised by good item design and. but increasingly there is a focus on the ‘life- long’ developmental and integrated nature of medical education. Knowledge The acquisition or awareness of facts. communication technology and an increasing awareness that the knowledge base of medicine is essentially unmanageable by traditional paper based methods. it is vital that its practitioners are committed to and engage in life-long learning. classified examination items. retrieval and optimal use of biomedical information. Rapid development is due to advances in computing.a PMETB guide to good practice . This feedback is typically given by completing a questionnaire. Item bank A collection of stored. behavioural science or other health sciences. attitudes and competencies that should be demonstrable on completion of a learning episode. etc.for example. on a doctors’ performance from a number of co-workers such as other team members. IRT also examines individual items in relation to each other. The preferred alternative is intended learning outcome (see above). Rules governing the writing of learning objectives make them difficult to produce and. It has traditionally been divided into undergraduate. Replies are collated and an anonymised summary is produced for feedback. as a result. up to a point. responsibility and values. data. Life-long learning Continuous personal educational development over the course of a professional career. 50 Developing and maintaining an assessment system . Medical educator A professional who focuses on the educational process necessary to transform non-physicians into physicians and to keep them current over their years of practice. but many have backgrounds in education. In-training An adjective used in UK medical education to describe ongoing processes that occur in the workplace . Learning objective A term that is now becoming obsolete. comparatively few items described as learning objectives actually are proper learning objectives.Intended learning outcome The contemporary replacement for learning objectives (see below) which describes (typically in observable terms) the knowledge. administrative staff. skills. Medical education The ongoing integration of knowledge. quantifying such characteristics as item difficulty and their ability to discriminate between good and poor trainees. in- training assessment would refer to collecting evidence of progress and attainment over an extended period of time. Item An individual question or task in an assessment or examination. by increasing the number of test items. Item response theory (IRT) A set of mathematical models for relating an individual’s performance in a test to that individual’s level of ability. These models are based on the fundamental theory that an individual’s expected performance on a particular test item is a function of both the level of difficulty of the item and the individual’s level of ability. ideas or principles to which one has access through formal or individual study. usually with regular staging reviews. data and knowledge for problem solving and decision making” (E. observation experience or intuition. qualities. Shortlife). skills. information.

See also above under ‘Assessment’. The PDP is an integral part of reflective practice and self-directed learning for professionals. Portfolio based learning or portfolios This refers to a collection of evidence documenting learning and achievements. OSCE Objective Structured Clinical Examination . it denotes what a student or doctor actually does in his/her encounter with patients. In the UK. carried out in the workplace. colleagues. Artefacts such as radiographs. rather than to an established standard (criterion referencing). Peers review This is an important tool in obtaining evidence about professional attitudes and behaviour. lab reports and photographs are also commonly used. On the contrary. development goals. actions and processes. considerable efforts and resources are now being brought to bear on the assessment of doctors’ performance and. It is an important component of 360-degree assessment.Multiple choice An item where the trainee selects what they consider to be the correct answer from a list of options. Stations frequently feature real or (more often) simulated patients. instruments to assess it are predominantly workplace based. nurses and patients to evaluate trainees. only the top n number or x% of trainees pass. portfolios are used routinely in postgraduate medical education where they are known as RITAs (Records of in-training assessment). for example where there is a limited number of posts available for successful trainees to move on to. because of its importance. Performance is not the same as needing to ‘know’ everything. Performance based assessment Assessment of clinical performance is of the greatest importance but is difficult to measure. knowing your own limits. since this performance is carried out and observed in the workplace. Norm referencing should be used only in certain special circumstances. Pass mark The score that allows a trainee to pass an assessment. their relatives and carers. It can be carried out by trainees to assess each other and is also used by supervisors. Educational models work from the premise that the outcomes cannot wholly be predicted. Norm referencing A method of establishing passing and failing trainees based on their performance in relation to each other. it may well be about knowing what you don’t or even cannot know . knowledge or stimulus to change (improve) practice. Performance The application of competence in real life.a PMETB guide to good practice 51 . In education. Personal development plan (PDP) A prioritised list of educational needs. are likely to be among the most important developments in medical education over the next few years. irrespective of how strong or weak the cohort is as a whole.a multi-station clinical examination (typically having 15 to 25 stations). Candidates spend a designated time (usually 5 to 10 minutes) at each station demonstrating a clinical skill or competency at each. compiled by learners and used in systematic management and periodic reviews of learning. there is a growing tendency to use the expression ‘intended learning outcomes’. Outcomes An expression reflecting all possible results that may stem from exposure to a causal factor or activity. etc. outcomes are part of the training model and this is usually a new skill. Developing and maintaining an assessment system . In the case of medicine. Commonly used in MCQs (multiple choice questions) and EMQs (extended matching questions). Multiple choice questions (MCQs) A lead-in statement (typically a short clinical description) followed by a homologous list of options (five is generally considered the optimum) from which the trainee selects the best answer. However. team members and other members of staff. Multiple response questions Apart from some types of extended matching questions (EMQs). Performance based assessments. other words. particularly in curriculum design. So for example. Pass To achieve a score (mark) that allows progress in training or successful completion of an examination. this is an obsolete question format where trainees select various combinations of the proffered options as their correct answer.

which is the learner’s practical and intellectual property relating to their professional development.9. while others are assessed against specific targets of achievement. Quality control This relates to the arrangements (procedures. This quality is usually calculated statistically and reported as coefficient alpha (also known as Cronbach’s alpha in recognition of its developer. the quality of the test and test items themselves. In addition to medical knowledge and skills. duty and honour. Ideally. Raw score A test mark that has not been modified (for example. organisation) within local education providers (Health Boards. • Intra-rater reliability is concerned with the extent to which a single assessor would give similar marks for almost identical 52 Developing and maintaining an assessment system . humility and compassion. There are some other important dimensions of reliability. Professionalism Adherence to a set of values comprising statutory professional obligations. Among the factors contributing to reliability are the consistency of marking. generalisability theory (see above) is becoming the preferred alternative because. standards. The main dimensions of reliability. portfolios contain material collected by the learner over a period of time. It is assembling evidence of performance from different sources and enables an assessment within a framework of established clear criteria and learning outcomes. its presentation for assessment. equivalence and homogeneity. PMETB will undertake planned and systematic activities to provide public and patient confidence that postgraduate medical education satisfies given requirements for quality within the principles of better regulation. in the light of reliability calculations). Lee Cronbach). measurements should yield the same results when repeated by the same person or made by the different assessors. Programme director The person with overall day-to-day responsibility for a regional (usually deanery level) postgraduate training programme Quality assurance This encompasses all the policies. are as follows: • Equivalence or alternate form reliability is the degree to which alternate forms of the same measurement instrument produce congruent result. As it is based on the real learner’s experience. apart from internal consistency. Quality management This refers to the arrangements by which the Postgraduate Deanery discharges its responsibility for the standards and quality of postgraduate medical education. national and professional standards. High stakes assessments must have a higher alpha than this and there is general consensus among test developers that the benchmark in high stakes examinations should be 0. It is usually done within some agreed objectives or a negotiated set of learning activities. Some portfolios are developed in order to demonstrate the progression of learning. empathy.especially for test development purposes. Other measures of reliability include stability. The lowest acceptable value of Cronbach’s alpha in summative assessments is generally agreed to be 0. NHS Trusts. The learner takes responsibility for the portfolio’s creation and maintenance. which is a measure of a test’s internal consistency. In the case of tests. If a single measure of the reliability of an assessment instrument is made. accountability. However. On top of this. Key values include acting in the patients’ best interest and maintaining the standards of competence and knowledge expected of members of highly trained professions. it should be this one. Reliability Expresses a trust in the accuracy or provision of the correct results. social responsibility and sensitivity to people’s culture and beliefs. It satisfies itself that local education and training providers are meeting the PMETB standards through robust reporting and monitoring mechanisms. and. if appropriate. it is an expression of the consistency and reproducibility (precision) of measurements. formally agreed codes of conduct and the informal expectations of patients and colleagues. and the type and size of the sample. • Homogeneity is the extent to which various items legitimately team together to measure a single characteristic. it provides much richer information .8. Independent sectors) that ensure postgraduate medical trainees receive education and training that meets local. there will also be a component of random error. it links theory and practice and might also usefully include a reflective element. systems and processes directed to ensuring maintenance and enhancement of the quality of postgraduate medical education in the UK. These standards will include ethical elements such as integrity. medical professionals should present psychosocial and humanistic qualities such as caring.In essence.a PMETB guide to good practice . • Inter-rater reliability refers to the extent to which different raters give similar ratings for similar performances. probity. although it is considerably more complicated to calculate.

nor is it a review of progress although it is likely to be used as a source of evidence. gauge. established by authority. manners or behaviour). testing in the same domain. achievements and performance. Self assessment A process of evaluation of one’s own achievements. or de facto (generally accepted by custom or convention. standard setting and examiner training. This. It is also defined as a ‘criterion. Result The outcome of a test. It is important to note that RITA is not an assessment in its own right. Review Consideration of past events. 3 SEMs = 99%). is highly significant in examinations such as Royal College membership and Fellowship examinations. borderline trainees would be those within 2 or even 3 SEMs of the pass mark. Self assessment is an important part of self directed and life-long learning. the SEM gives the confidence intervals for marks awarded to trainees (1 SEM = a confidence interval of 68%. custom or general consent. and feedback. In high stakes examinations. assessment. 2 SEMs = 95%.a structured rating form for the assessment of outpatient letters between hospital and GP. weight. Meaningful standards should offer a realistic prospect of assessing whether or not they are met. or a systematic and co-ordinated pattern of mental and/or physical activity. RITA Record of in-training assessments. yardstick and touchstone’ by which judgments or decisions may be made. that informs reviews. professional performance and competencies. or would be consistent if re-marking a test item. This is important in identifying borderline trainees. Thus the word ‘standard’ refers simultaneously to both ‘model and example’ and ‘criterion or yardstick’ for determining how well one’s performance approximates the designed model. voluntary (established by private and professional organisations and available for use). A portfolio of assessments that are carried out during training. behaviour. This may be either a formal or informal process and can be an integral part of appraisal. Skill The ability to perform a task well usually gained by training or experience. which is used throughout UK postgraduate medical education. Spearman-Brown formula A calculation derived from classical test theory that predicts the reliability of shortened or (usually) lengthened versions of a test. performance. Standard deviation The square root of the variance. Score The mark obtained in a test.a PMETB guide to good practice 53 . SAIL Sheffield Assessment Instrument for Letters . • Test-retest reliability (or stability) is the degree to which the same test produces the same results when repeated under the same conditions. gained through assessment. A standard may be mandatory (required by law). used to indicate the spread of group scores and a component of the equation to calculate Standard Error of Measurement (SEM) Standard error of measurement (SEM) Calculated from Cronbach’s alpha and the standard deviation of a test (SEM = SD √ (1 – alpha)). extent. along with test-retest reliability to which it is closely associated. value or quality. Sometimes programmes use actors to accomplish this goal. and involves such concepts as producing matching items. such as the standard of dress. Simulated patients Individuals who are not ill but adopt a patient’s history and role for learning or assessment in medical education. based on the reliability calculated from a version of that same test of a specific length. example or rule for the measure of quantity. Standard Refers to a model. • Parallel forms reliability refers to the consistency of results between two or more forms of the same assessment. Developing and maintaining an assessment system .

Validity In the case of assessment. knowledge. attainment or difficulties should be obtained from more than one source. for example. • Performance or assessment standards . often in cost benefit form. what teachers are supposed to teach and what students are expected to learn. and the individual components of the assessment. face validity can be described from the perspective of an interested lay observer. then the assessment has good face validity. True score A trainee’s score on a test without measurement error .Standard setting The process of establishing the pass mark. It is concerned with whether the right things are being assessed. using more than one assessment method. attitudes and values. Syllabus A list. Utility Utility refers to an evaluation.these standards define degrees of attainment of content standards and level of competencies in compliance with the professional requirements. This aspect of validity is the one of greatest concern to the teachers. of course contents or topics that might be tested in examinations. or some other kind of summary description.that whenever possible. See also ‘Assessment’ above. If they feel that the right things are being assessed in the right way. • Construct validity The extent to which the assessment. in the right way. Thus a standard is both a goal (what should be done) and a measure of progress towards that goal (how well it was done). workplace based process by which educational experience is provided and competencies obtained. though they should also pay serious attention to consequential validity. skills and attitudes required at the time of the graduation. • Face validity Related to content validity. The assessment must be representative and should.these define availability of staff and other resources necessary for students to be able to meet the content and performance standards. They describe how well the curriculum standards have been attained. There might also be ‘essential (core) requirements’ that the medical curriculum must meet to equip physicians with the knowledge. Standards In medical education standards may be defined as ‘a model design or formulations related to different aspects of medical education and presented in such a way to make possible assessment of graduates’ performance in compliance with generally accepted professional requirements’. a detailed curriculum is the document of choice and the syllabus would not be regarded as an adequate substitute.a PMETB guide to good practice . • Content validity This is concerned with sampling what the student is expected to achieve and demonstrate. Trainer An individual providing direct educational support for a doctor in training. on more than one occasion and.these describe skills. In modern medical education. Summative assessment Assessment carried out for the purpose of (usually pass/fail) decision making. and with a positive influence of learning. or of using a test in one manner compared with another. as opposed to not using it. The distinction between formative assessment (to aid improvement) and summative assessment (for decision making) is becoming less important as evidence from assessment is increasingly being used for multiple purposes. • Process (or opportunity-to-learn) standards . if possible. Medical education standards are set up by consent of experts or by decisions of an educational authority. or of using one test as opposed to another. Training The ongoing. tests the professional constructs on 54 Developing and maintaining an assessment system . cover several categories of competence.particularly important in workplace based assessment . a range of patient problems and a number of technical skills. of the relative value of using a test. evidence of progress. although one might usefully be included as an appendix. Triangulation The principle . validity refers to the degree to which a measurement instrument truly measures what it is supposed to measure. Three types of interrelated educational standards might be envisaged: • Curriculum standards .true score is the observed score minus the error.

and in what direction. • Consequential validity This is an important. which they are based. are an important determinant of individual and community health. or they might commit large bodies of factual knowledge to memory without really understanding it in order to pass a test of factual recall and then forget it soon afterwards. The z-score transformation is useful to compare the relative standings of items from distributions with different means and/or different standard deviations. reflecting. the extent to which inferences can be made on the basis of a particular assessment of professional concepts. they might omit certain aspects of the curriculum because they do not expect to be assessed on them. Values This is a sociological term referring to what we believe in and what we hold dear about the way we live. so. Developing and maintaining an assessment system .a PMETB guide to good practice 55 . though often neglected. Workplace based assessment The assessment of working practices based on what they actually do in the workplace and predominantly carried out in the workplace itself. Z-score The z-score for an item indicates how far. • Concurrent validity This is the degree to which a measurement instrument produces the same results as another accepted or proven instrument that measures the same parameters. a measure of attitudes towards preventive care should correlate significantly with preventive care behaviours. an individual trainee’s score deviates from the mean distribution of that item. For example. for example. therefore. • Criterion-related validity This is concerned with the overall criteria of the assessment and how it relates to a ‘gold standard’. Values. communities and cultures . It refers to the effect that assessment has on learning and in particular on what students learn and how they learn it. • Predictive validity This refers to the degree to which a measure accurately predicts expected outcomes. Weighting Assigning different values to different items. It is expressed in units of its standard deviation.perhaps as a species. groups. It is usually sub- divided into concurrent validity and predictive validity. Both these behaviours would indicate that the assessment has poor consequential validity because both lead to bad learning practices. but they are difficult to measure objectively. their importance or difficulty in order to increase the effectiveness of a test. Our values influence our behaviour as persons. aspect of the validity of assessment. for example. .Postgraduate Medical Education and Training Board Hercules House Hercules Road London SE1 7DU Tel +44 (0)20 7160 6100 Fax +44 (0)20 7160 6102