Evaluation Methods 2

McMaster University Program For Educational Development Program For Faculty Development Educating Future Physicians of OntarioSuwoevovwrevv: EVALUATION METHODS: A RESOURCE HANDBOOK McMASTER UNIVERSITY PROGRAM FOR EDUCATIONAL DEVELOPMENT PROGRAM FOR FACULTY DEVELOPMENT and EDUCATING FUTURE PHYSICIANS OF ONTARIO (EFPO) PROJECT 1995TABLE OF CONTENTS I “INTRODUCTION : LA Overview G Norman, J Wakefield, § Shannon... 1 1.2. Lessons from the Literature G Norman, D Swanson all 1.3. Effective Feedback M Brown, B Hodges, J ats 1:4 A Guide to Sections 2 - 6 G Norman, J Wakefield, § Shannon. | | 23 I SUMMARY REPORTS AND RATINGS CK! Cle, jeciies Bees or raat oe 1 Tutorial Performance Jay ose 2s S .2 Clinical Ratings-Ward Evaluation _D Steiner .29 Il ORAL EXAMINATIONS 3.2 Oral Examinations 3.1. Triple Jump Exercise we 3.3. Chart Stimulated Recall IV WRITTEN TESTS ‘4.1 Multiple Choice Questions 2 Modified Essay Questions 3 Essays Gustove f noe? V PERFORMANCE TESTS te ye Direct Observation me 5.2. Objective Structured Clinical Examination” VI ASSESSMENT OF SKILLS Marco 6.1 Self and Peer Assessment Jaime -2 Communication Skills 4.3 Problem Solving Skills wis (6.4 Psychomotor Skills wee ee Evaluation of Bioethics 6.6 Critical Appraisal Skills ¢ MES 167. Professional Behaviours Journal Abbreviations Igva! Qué cap 2. M Westmoreland, M Parsons. L Muzzin .. EHanna... G Norman . 47 P Stratford, J Smeda 55 D Palmer, E Rideout... “oe M-Thomson.... 67 P Salvatori, } Koberss, B Brown a HH Arthur, J MacFadyen B Hodges, J Turnbull G Norman a P Singer G Norman, $ Shannon J Turnbull, D Bienenstocl 109 315 Ne “~ nar nmaemn manana nnaeneneaeaeaeeee2eaneaenenen i et aTABLE OF CONTENTS I INTRODUCTION 1.1 Overview 1.2 Lessons from the Literature 1.3. Effective Feedback 1.4 A Guide to Sections 2 - 6 SUMMARY REPORTS AND RATINGS 1 Tutorial Performance sono Ul ORAL EXAMINATIONS 3.1. Triple Jump Exercise ‘we 43.2. Oral Examinations 3.3. Chart Stimulated Recall IV WRITTEN TESTS 1 Multiple Choice Questions .2 Modified Essay Questions 3. Essays | hoe? Y PERFORMANCE TESTS atk ee Direct Observation Ne .2° Objective Structured Clinical Examination VI. ASSESSMENT OF SKILLS WSC 6.1. Self and Peer Assessment 6.2 Communication Skills J®™* 163 Problem Solving Skills Luis (6-4 Psychomotor Skills 28 (5 Evalvation of Bioethics gor $66 Critical Appraisal Skills wi 6.7 Professional Behaviours Journal Abbreviations 2 Clinical Ratings-Ward Evaluation G Norman, J Wakefield, $ Shannon . G Norman, D Swanson ......... M Brown, B Hodges, J Wakefield . . G Norman, J Wakefield, $ Shannon . 1 ul 15 -23 Flexible. Mde calidad de/ ACU 23 Util p7 waa de Lecistane! © J Hay D Steiner Igual que cap 2. M Westmoreland, M Parsons . L Muzzin EHanna .. GNorman 0... eee eee eee eee P Stratford, JSmeda . . D Palmer, E Rideout . M Thomson . P Salvatori, } Roberts, B Brown H Arthur, J MacFadyen B Hodges, J Turnbull . . G Norman . M O'Connor P Singer... . G Norman, S Shannon oe J Turnbull, D Bieneastock - ee ae ee ee eeCHAPTER 2.1 ‘TUTORIAL PERFORMANCE «NY YSiicle: wi Confiaiic John Hay, McMaster University ‘The evaluation of student performance within problem-based tutorials is a troublesome subject. There are questions of what to evaluate, how to evaluate, and how often to evaluate, all presaged by the question of whether to evaluate at all, What to evaluate is, on the surface, relatively straightforward. There are common expectations of students in tutorials. They must learn to make relevant hypotheses, identify appropriate learning objectives, diréét their learning using suitable resources, provide evidence of learning, and perform throughout as a responsible, and collegial member of the wtorial group. Evidence for the accomplishment of each of these tasks should be apparent during the tutorial session. Students typically spend enormous amounts of time and effort in preparation for a tutorial session. In the few hours they thea spend with their fellow smdents and sutor the fruits of their labour may or may not be evidenced. At the same time the students’ ability to act as well-functioning group members should be apparent. Should either or both of these skill areas be formally evaluated? If yes, then how? Generally the responsibility of forming some judgement of a student's performance based on the evidence provided during a tutorial rests with the rutor, although the student's peers may contribute. Some areas of performance appear to be more easily evaluated than others. One would expect experienced tutors to be able to identify students who are capable, productive members whose actions facilitate the learning of the group. Gaining a sense of how well students directed their learning and made use of appropriate learning resources requires somewhat more deduction but should be relatively straightforward to ascertain. The fact that students are part of a group makes determining a student's ability to hypothesize, suggest learning objectives and strategies, degree and quality of learning a more difficult proposition. These areas require direct questions to probe knowledge and understanding, the interpretation of questions and comments made curing tutorial, and an ability to determine the quality and extent of information provided to the group. There is a skill which students can develop which allows them to mask a lack of understanding and/or effort by simply agreeing with Sther members, acquiescing, ot making broad generalizations rather than providing specifics. Graesser and Person (1994) have documented that tutors tend to ask open-ended questions and rarely ask questions of sufficient specificity. This provides little opportunity to accurately determine an individual student's level of understanding. The ability to draw inferences from a student's questions and comments during twtorial requires both experience and a sound knowledge of the subject matter at hand, Determining the quality of a student’s informational contributions requires sufficient content knowledge on the part of the ruror. This latter skill supports Schmidt's (1993) contention that content knowiedge expertise is a necessary part of a tator’s repertoire. To be able to evaluate an individuar’s performance in these areas is a difficult task for a tutor. Often the tutor experiences a conflict between his/her role as facilitator and mentor of individual studeats and his/ber eventual role as arbiter of student progress. 25Section 2 - Summary Ratings RELIABILITY ‘What then is known about the reliability and validity of tutorial performance evaluation? Precious lite, as it turns out, Inter - Rater Reliability There are two potential evaluators since there are two kinds of witness to individual performance— the tutor and fellow group members. It has been our experience that evaluations made by peers which have a direct impact on grades tend to be inflated and have a range which excludes anything below an “average” rating. This is not unexpected but strongly suggests that the bias influencing this group of observers obviates their use as evaluators. However peers also provide feedback, hopefully on a regular basis, to each other within tutorials. A tutor can summarize and use this feedback to gain a-sense of peer evaluation of performance. ‘The relation between self and tutor evaluations has been investigated and very strong correlations have been determined (Hay, 1994), This appears to suggest that students provide accurate self-evaluations. However it does not exclude the possibility that tutors and students simply share the same biases and have negotiated a mutually satisfactory position regarding a grade (Hay, 1994) This can be interpreted as a shared Lake Woebegone effect (Cannell, 1989) — that all students are necessarily above average. This then brings into question the tutor’s ability to evaluate. Internal Consistency The OT programme at McMaster has used a number of different forms. The areas which are evaluated are generally, group skills, learning: skills, knowledge, and critical thinking. In facior analyses of a 10 item form however Hay (1994) found that only a single factor was evident. Very hhigh internal consistency also supported the conclusion that tutors overall were reporting a single attribute. Forms have been useful in identifying very weak areas and in rank-ordering students, however the ranking tends to begin at a B- level or higher! VALIDITY No formal studies of reliability have been undertaken in the M.D. Program, however, several internal studies examined the relation betweeri tutorial ratings and performance on the licensing ‘examination. AJ) used a similar method — blind comparison of written turorial evaluations of swurents who had failed, and a control group of students who had passed, the licensing exam. The findings of all studies were similar; none found any relation between performance jidged in the nutorial setting (even for the specific category’ of knowledge) and performance on he licensing examination. Admittedly, tutorial evaluation ostensively judges many characteristics beside knowledge (see Professional Behaviours, Chapter 6.3), what these qualities are remains undetermine'!. This is the extent of the validity information we have. It is clear that there is no suppor for autorial 26Section 2 - Summary Ratings performance as a measure of knowledge gain. Whether tutorial evaluations have validity in other areas remains to be determined. ADVANTAGES AND DISADVANTAGES ‘Tutocial evaluation should have several advantages. It is based on actual performance in the learning setting, To the extent that tutors can establish a close and nurturing environment, it should be possible to provide accurate and constructive feedback. However, the evidence we have reviewed suggests otherwise. The very intimate nature of the tutorial setting makes honest evaluation more, not less, difficult, particularly when it involves "bad news". Moreover, it is unclear just how much opportunity there is to evaluate an individual student's true performance, except in self-selected circumstances when the student chooses to volunteer information or opinion. DEVELOPMENT AND IMPLEMENTATION In favour of tutors’ capacity to evaluate accurately is that they have multiple opportunities to observe and comment on the student's development, responses to feedback regarding performance, and reaction to feedback from other group members. Tutors may be assisted by having a formal checklist of behaviours/performance to direct their observations. Sharing these evaluations with ‘students requires that the tutor be able to defend a judgement with documented evidence which helps sharpen both observation and documentation. Strategies of these types provide a process which should allow a turor to provide sound evaluations. Requiring the tutor to complete these evaluations at regular intervais should assist the tutor in focusing attention on appropriate concems for each member of the group. Evaluations have beta carried out with a variety of schedules ranging from n:id-point and final to six times over a 14 week unit. The weakness of fewer observations is the degree of weight givea to each and the resulting greater susceptibility for bias. More frequent evaluations leave less time for forming a judgement and require more time and effort of a tutor. At some poirt a balance must be determined which allows multiple observations with sufficient intervals to form a judgement, each of which are of enough value to be meaningful but not so great as 10 be anxiety provoking. In summary, it is becoming apparent that while some strategies may assist a ctor in evaluating in-tutorial performance, the task remains difficult. In the best of all worlds, a highly skilled tutor with a formidable knowledge of the topic under consideration would be encowraced and supported by the educational programme to set high standards for achievement and given a sound process in ‘which to document judgement of students tutorial performance. However, the evidence to die suggests that these lofty ideals have not, and perhaps cannot, be achieved in the real world. 27Section 2 - Summary Ratings REFERENCES Cannell J. (1989) How public educators cheat on standardized achievement tests. New Mexico Friends of Education. Cohen GS, Blumberg P, Ryan NC, Sullivan PL. (1993) Do final grades reflect written qualitative evaluations of student performances? Teach Learn Med 5; 10-15. Graesser A, Person N, (1994) Questions asked during tutoring. 4m Ed Res J 31(1): 104-137. _ Hay J. (1994) A Comparison of self-evaluations and tutor evaluations of student performance in problem-based tutorials. Proceedings of the 11th World Congress of Occupational Therapists p 134. > Hay J, Schmuck M. (1994) An investigation of student self-evaluations in problem-based tutorials imemna) document. McMaster University Faculty of Health Sciences. Norman GR. (1994) Why evaluate. Pedagogue 5(1): 1-6. Areas ena(uadar — habilidader Grape - nab tietacies de aprenctizaje - conoc fmf enre -pensamients CitTca 28CHAPTER 2.2 CLINICAL RATINGS - WARD EVALUATION Wo con fiabsl< 7 . Y ne vsifde David Steiner, McMaster University checklists Me brad evattston Global clinical ratings are ubiquitous, We have never encountered a program in the health sciences which did not, at some point, evatuate students with some Kind of one-page form with ten or twenty categories and five or seven point scales. ‘The ITER (In Training Evaluation Report) of the Royal College of Physicians and Surgeons of Canada is likely the best known and most common of a large number of global rating scales; pieces ‘of paper filled out by some supervising individual after a period of weeks or months of contact with the individual student. ‘The attractiveness of the form is based on: 1) Its simplicity and efficiency. Supervisors can complete such a form in minutes. 2) Its comprehensiveness. Typically such forms contain from six to twenty apparently separate categories ranging from responsibility to clinical judgement to interpersonal skills. 3) Its face validity. Tt would seém that a colleague or supervisor with weeks or months of contact with a student should have had ample dime to observe and comment on a wide range of qualities. DESCRIPTION ‘We have already indicated some characteristics of global rating forms. To meet with usual standatds of questionnaire design, some others might be added: 1) The number of response levels on the rating scale should be a minimum of 5 (actually used). Since raters rarely use the bottom half of the scale, extending this to 7, 9 or 11 levels may help. 2) Descriptors of levels of performance, using simple adjectives such as "Below expectations”, "Fair", or "Meets all objectives” may help. 3} Conversely, attempting to describe levels of behaviour in complex detail, or to reduce performance to detailed checklists, is unlikely 10 help, and may make things worse (van der Vleuten, Norman, and de Graaff, 1991). 4) Thére is no advantage to complex weighting schema to sum individual scores (Wainer, 1976) 29Section 2 - Summary Ratings HISTORY As we have indicated, some form of global rating is nearly universal as a method of evaluation in clinical settings for the assessment of all kinds of clinical trainees at all educational levels. The situation exists apparently in defiance of the overwhelming evidence of the limited value of the method (Streiner, 1985). Moreover, many of the problems already described with reference to evaluation in tutorials are equally present in clinical ratings. RELIABILITY AND VALIDITY While the guidelines described above may help to sors slight degree, no degree of twiddling” is likely to remove the major structural disadvantages of the methods, as outlined below. Despite the apparent (face) validity and the ease with which the form can be completed, literally dozens of studies have shown that the reliability of global ratings is essentially zero (Streiner, 1985). As a consequence, it is nearly impossible to establish any empitical basis for validity, since reliability sets an upper limit of possible validity. There are several possible reasons for this dilemma: 1) The Time Period Involved ‘The rater faces the daunting task of having tp integrate exposures and experiences with the student over periods of weeks, months or even years. A number of psychological studies (Ross, 1989) have shown that'such kinds of integration are literally impossible to do. You simply cannot remember any time beyond about three days, therefore any kind of integration is based om recent experience and is fraught with errors 2) Domain Discrimination and Halo Effect Another problem is that alihough it would appear from the form that the rater is assessing something between 10 and 20 different characteristics, in fact generally speaking, studies of global ratings indicate that there is usually only a single factor unterlying these ratings. This was strongly highlighted in a study we completed internally, in which supervisors rated 65 hypothetical residents (simply pieces of paper with individual ratings assigned at random). The supervisors were asked to give an overall judgement of the performance of individual residents on a twelve-month scale. We then predicted their overall judgement based on a regression analysis of the individual scores, and found that a single characteristic best predicted the finat overall scores of all supervisors: Team Relationships, which is likely a synonym for "how well the supervisor got along with the resident” ‘This occurred under circumstances where the “residents” were simply pieces of pa ver. Related to this problem is that, since the judgement is based over an extended period of time, anc is no longer particularized by individual behaviours, the evaluation takes on more cf a judgment 0! personality than of performance. It is very difficult under these circumstances for a supervisor, whe has intimately known the resident or student for this period of time, to then turn aro ind and become judge and jury. The result and consequence of this particular characteristic is that vir uaily all ratings con these forms are “above average". 30Section 2 - Summary Ratings 3) The Consequences of the Evaluation Despite their limited psychometric qualities, global ratings are often used as the final arbitrator of success or failure in a clinical setting. For example, residents are not allowed to sit specialty certification examination unless they receive a satisfactory final rating (the FITER). As a consequence, supervisors may practice defensive (or path of least resistance) evaluation. Fear that an unsatisfactory evaluation may be appealed by the student, and the inherent subjectivity in the process, lead supervisors to rate students as above average or satisfactory, simply because they would find the alternative to be indefensible, Solutions 4 A potential sOlution to these problems: First, increase the sample size of observations. Instead of a single assessment at the end of three months, multiple assessments of performance based on a very limited period of observation for example one or two days, may alleviate many difficulties. This has several advantages: first the sample size of observations is increased, which, all other things being equal, is beneficial. Secondly, the evaluation is no longer an evaluation of personality, but rather is a evaluation of performance, and can be verified by specific behaviours. Finally a number of independent observers can be used in this process. This form of evaluation could also be applied to operative procedures, where residents are routinely observed and supervised in the OR, but observations are rarely turned into any kind of useful documentation. It would seem self-evident that this is an ideal setting in which to use specific forms of supervisor assessment. REFERENCES Barrett GV, Depinet RL. (1991) A reconsideration or testing for competence rather than for intelligence. Am Psychol 46: 1012-1024. Ross M. (1989) Relation of implicit theories to th~. construction of personal histories. Psychol Rev 96: 341-357. Streiner DL. (1985) Global Rating Scales. In: Neufeld VR, Norman GR. (eds.) Assessing clinical competence New York, (Springer) pp 119-141. van der Vieuten CPM, Norman GR, De Graaff E. (1991) Pitfalls in the pursuit of objectivity: Issues of reliability, Med Educ 25(2): 110-118. Wainer H. (1976) Estimating coefficients in linear modeis: it don't nake n0 nevermind. Psychological Buitetin 83: 213-217. 31CHAPTER 3.1 ‘TRIPLE JUMP EXERCISE (STRUCTURED ORAL ASSESSMENT) poe tenance Muriel Westmorland and Marilyn Parsons, McMaster University No vdlide The Triple Jump Exercise is a three part structured oral assessment developed at McMaster University and used for both formative and summative evaluation. The objectives are to evaluate the individual student's ability to: generate hypotheses from a given clinical situation, seek out and critique relevant data, develop either a diagnosis or management (care) plan and to evaluate his/her own performance in the exercise. — DESCRIPTION The Triple Jump Exercise provides both students and faculty with the opportunity to simulate a real clinical situation. In their daily practice, health professionals are confronted with clinical situations for which all the relevant data is not available. The Triple Jump simulates this activity. The student meets with the assessor and is presented with a clinical situation (on paper or with a standardized patient). In step one, she/he is expected to formulate hypotheses based on the given information, identify the issues, collect data to support or refute her/his hypotheses and identify gaps in her/his knowledge that need to be filled in order to formulate a diagnosis and management or care plan. in step two the student spends time on her/his own using available people and print resources to fill the knowledge gaps identified in step one. In step three, she/he discusses the plan with the assessor supporting the plan with rationale from step two. The student is evaluated by the assessor throughout the process and evaluates her/himself at the end of the exercise. The process varies slightly from pfogram to program. HISTORY The Triple Jump Exercise was first developed for use in the undergraa'uate medical, Program at ‘McMaster and has since been adopted by both the School of Nursing and the School of Occupational and Physiotherapy. The students named the exercise “Triple Jump" because of the three steps involved in the process. WASELITY ilé the TJ may have some useful characteristics, particularly with respect to the opportunity for Qne to one interaction between teacher and student, it remains somewhat demanding of faculty. ‘order to test a typical tutorial group of five to six students, faculty must typically devote nearly i day; the morning for the initial oral examination, and the afternoon for the third step. in view fe linuron of extrapolation from 2 single case (See Relibility and Valiiy), this may be Gifficult 10 justify for summative evaluation purposes. 33Section 3 - Oral Examinations 7 RELIABILITY AND VALIDITY RELIABILITY: Painvin et al, (1979) examined inter-rater reliability and showed intraclass correlations between .60. and .90 for different categories of the triple jump evaluation. Neufeld et al. (unpublished data, 1984) standardized the format of the exercise, trained the evaluators and used a specific rating scale. The results showed inter-rater reliability coefficients of .50 to .80 or. six of nine categories of evaluation. Chapman et al. (1993) examined inter-rater reliability using a scale to evaluate eight attributes of the student's performance. The first year the overall reliability was .75. In the second year the evaluators were given specific training and the overall inter-rater reliability improved to .87. Inter-rater reliability can be enhanced by standardizing the format of the exercise, structuring the oral questions, providing specific evaluator training and using a specific rating scale. However, although there is no direct evidence, the likely major source of error in Triple Jump assessments is content specificity — the low correlation of performance across cases, which is typically 0.1 to 0.3 using other similar instruments. Such a finding raises serious questions about the feasibility of the Triple Jump, since it could be anticipated that a reliable estimate of a student's competence may require as many as ten to twenty cases, which is clearly unfeasible. One soiution to this dilernma is to drastically shorten the length of the case interaction, to as little as 3-5 minutes. Such a study was recently conducted by Neville (1995), and resulted in acceptable levels of reliability after one half hour of testing; moreover, students felt is was a fairer assessment. This method is described in more decai! in Section 6.5. VauibiTY: In the Chapman study, the triple jump exercise evaluations were compared to written clinical problem assignments and tutorial performance to determine validity. In the first year there ‘was 2 correlation of .57 with the student's tutorial performance, but no correlation was found in the second year. No correlation was found between the triple jump exercise and the written problem assignments in either year. These results may reflect the low overall reliability of the single case as much as anything else. Smith (1993) found very low carrelations (.07 to .16) between a single triple jump Score and objective measures of unrelated ‘-xowledge. He used this finding to claim validity of the Triple Jump on the logically weak ground that it must be measuring something else, namely cligical reasoning This interpretation was strongly criticized by Norman (1994). a In summary, little evidence of validity exists at present. Moreover, the restriction of the evaluation to a single case represents a serious constraint on both relia vility and validity ADVANTAGES AND DISADVANTAGES ‘The Triple Jump has some advantages. It incorporatts gener: | Irarning objectives: a) understanding Underlying mechanisms b) problem-solving (clinical reasoning) ability ) self-directed learmmnz Skills “d) self-assessment skills. It can be used for both formative and summative evaluation and direct feedback 10 students is built into the exercise. The Triple Jump exercise is flexible and caSection 3 - Oral Examinations be adapted to various situations. For example a paper problem, a problem card deck, or a simulated patient can be used for the initial problem. Laboratory and/or diagnostic test results, referral forms, ‘consultant reports can be incorporated as part of the data base. By altering the initial problem, data base and criteria, the Triple Jump can be adapted to assess various levels of competence. For a beginning student, the emphasis can be on understanding mechanisms and concepts, whereas for the - experienced student, the emphasis can be on investigation and treatment. The disadvantages include inconsistent ratings by evaluators as shown by the occasionally low inter-rater correlations. A more serious disadvantage is the measurement error resulting from the use of a single case. We have already alluded to the relacively high time commitment required of faculty. Finally, validity remains to be demonstrated. ~ DEVELOPMENT AND IMPLEMENTATION A bank of problems is needed so that students do not become familiar with them (especially if used for summative evaluation). Problems need to be appropriate for the level of student being evaluated. Faculty need to develop a straightforward format that is discipline - specific (eg. medical mode! or nursing framework) to be used in developing a problem. Allowing for clerical, teaching and technical assistance is necessary as the process is very time consuming ‘The time allotment is one to one and a half hours with the nutor for each exercise, but there are 2 to 24 hours between steps two and three. This means that the tutor must plan his/her schedue to meet with the student at the appropriate times. Most rutors set aside one to two days for Triple /2mp assessments and schedule three 10 four students per day. The lack of consistency of approach between raters can be overcome by assessor training sessions using "dummy" problems and volunteer students who can provide helpful feedback (Chapman and Westmorland, 1990). Assessor guide sheets have been found wo be very useful and keep the assessors 07 track inroughout the Triple Jump process. Nursing assessor guide sheets state exactly ‘what the aszessor is to say at each step of the exercise eg. “Now that you have identified your initial’ problem list, what is your rationale for stating this order of priority?" and what the tutor should be looking fo. eg. "please make notes about the student's data gathering, interpretation of the data and current knowledge." REFERENCES Callin M, Ciliska D. (1983) Revitaliziag problem-solving wich triple jump. Cdn Nurse 79: 41-43. Chapman JA, Westmoriand MA, Noniaan GP, Durrell K, Hail A. (1993) Evaluating occupational therapy and physiotherapy students: oe method of assessing clinical reasoning skills. Med Teach 1S: 223-236. 35Section 3 - Oral Examinations Henry N, McAuley R, Norman G, Neufeld V, Powles P, Dodd P. The triple jump exercise: reliability and validity (unpublished data). z Neville AD, Norman GR, Blake JM. (1995) Measuring problem solving by written and oral examination: the non-effect of format. American Education Research Association Meeting: San Francisco. Neufeld VR, Norman GR, McAuley R, Repo R, Henry N. (1984) The triple jump exercise - 2 further study of standardization and interrater reliability (unpublished data). Painvin C, Neufcld VR, Norman GR, Walker I. (1979) The triple jump exercises: a structured measure of problem-solving and self-directed learning. Proceedings of the 15th Conference on Research in Medical Education Washington. -Powles AC, Wintrop N, Neufeld VR, Wakefield JG, Coates G, Burrow J. (1981) The triple jump exercises: further studies of an evaluative technique, Proceedings of the 20th Conference on Research in Medical Education Washington. 36 pet aCHAPTER 3.2 ORAL EXAMINATIONS +0 - confiasice ; Ne vstids Linda J. Muzzin, University of Toronto Oral examinations in professional education include a variety of techniques that. provoke the candidate to demonstrate the reasoning used in professional practice, usually in response to evaminers’ questions. While the oral examination has a long history, dating back through millennia to the Greeks, in medicine the oral examination remains the traditional "rite of passage”, particularly in the former British Empire. For example, the Royal College orals in Canada uses an oral examination for certification in many specialities. This usually consists of a "long case" or "viva", where the candidate examines a patient at length (usually selected from patients in hospital) then repor 5 her figdings to the examiners and undergoes further interrogation. This is then followed by a series of hypothetical “short cases” chosen by the examiners. HISTORY AND USE ‘The value of the oral exam situation as a flexible technique allowing direct feedback between teacher and student has often been cited. It is instirutionalized in most medical schools “at the bedside” in the form of medical rounds and evaluative clinical exercises in residency programs and clerkships (Futcher, Sanderson and Pusler, 1977). With a few exceptions, however (Vu, Johnson and Mertz, 1981; Powles et al, 1981) these exercises are not described or evaluated in the literature. ee wo VALIDITY REUABILITY: The majority of the 20 or so research studies on oral exams that have appeared in the past 50 years have focused on the lack of agreement between raters of the same candidate and have assumed that the raters were responsible for the discrepancies. This is certainly the case when particular examiners are found to be more lenient or strict than others in grading - the "dove/hawk” dimension (Bull, 1956). But the lack of agreement typically found between totally independent ratings of a candidate in two different situations 2s found, for example, in a very large study undertaken by the National Board of Medical Examiners (Hubbard et al, 1963) might have other explanations - specifically, the behaviour observed is differem and the respective skills for each problem or question may not covary. This interpretation would be consistent with the finding, now replicated several times, that pairs of raters observing the same piece of behaviour show better inter-rater reliability in their ratings (.75 to .89) than do individuals observing different sessions (.25 to .45) (Carter, 1962; Wilson et al., 1969; O"Donobue and Wergin, 1978). While this in part may reflect consultation between members of the pair, or access to prior written grades, this does not explain why consensus is unpredictable among members of teems of three or more raters (Colton and Petersoa, 1967). At least one group of medical educators feels that the problem is the presence of "nonconformist" on the rating tears (Newble, Hoare and Sheldrake, 1980) and advocates removal of such individuals as oral examiners. Ths is now routine for some medical specialty boards in the U.S. (Lloyd, 1983).Section 3 - Oral Examinations i ‘When checklists and rating forms are used for the guidance of oral examiners, quite high inter-rater reliability coefficients have been reported - ranging from .79 to .92 (Maatsch, 1980; Littlefield, Harrington and Garman, 1977). In general, the more rigid the structure of the oral, the higher the ‘consistency in rating. Compatible with this is the observation that the shorter the oral and thus the less “content” to be rated, the higher the reliability-(Bull,1956;-Wilson-et-al, 1969). Levine and McGuire, who have perhaps done the most exhaustive work on this problem, attribute the relatively high reliability achieved with their role-playing orals to the fact that they "presuppose much less specific content than do tests of cognitive skills" (1970b, p. 702). The general argument here is that short, process-oriented orals are more reliable than longer orals that highlight diversities of skills, since the laster may reflect "content specificity”, or the tendency for mastery of one domain to be unrelated to mastery of another. As far as the low correlations between different sessions, this, once again, likely reflects cast or content specificity. One direct test of the within - team , between - team differences was recently conducted on the Royal College orals in Internal Medicine (Turnbull, Danoff and Norman, 1995), where each candidate attend a morning and an afternoon session with different examiners. The results were completely consistent with previous studies: inter-rater reliability in each session were in the range of 0.70, however the correlation between morning and affernoon scores was only 0.30. VALIDITY: In two studies that report construct validation of medical orals (Miller, 1968; Maatsch, 1980), the differences in performance within levels of training appear greater than the differences between the levels. Oral Exam validation has instead tended to revolve almost exciusively around the extent to which the scores achieved on multiple-choice tests (MCQs) and on orals intercorrelate. ‘The use of written tests as a criterion against which to compare oral exams that are supposed to be measuring clinical reasoning may seem somewhat anomalous. However, since MCQs are highly reliable, widely administered and.known to be measuring factual knowledge, they are more of a "known quantity” than other yardsticks that might be used. Most studies have reported small positive or insignificant correlations between the results of MCQ and oral exams (¢.g., Bull, 1956; Ludbrook and Marshall, 1971; O"Donohve and Wergin, 1978 and have usually interpreted this as evidence that orals and written measure different aspects of clinical competence. However, a few have pointed out that test unreliability could also reduce such correlations (Levine and McGuire, 19702; Meskeuskas, 1975). Maatsch (1980), wha.has reported the highest MCQ-oral intercorrelation so far in any study, has suggested several Sources of low intercorrelations in previous studies, including the truncation of the candidate population by giving the orals only co those who have passed their written. He has also argued that MCQs are more reliable because they contain more questions and that increasing the number of ovals per candidat would allow orals to approach MCQ reliability. The issue of why some orals correlate with written, however, goes beyond whether they are reliable or not, since at least one type, Levine and McGuire's role-playing oral (1970b), has acceptable reliability, shows a low correlation with written exam scores and is clearly not designed to measure directly the candidate's “fund of information, 38 | \Section 3 - Oral Examinations ADVANTAGES AND DISADVANTAGES Proponents of oral examinations feel that this exercise measures skills that other evaluation tools do not. Examples of such skills include: personal characteristics, capacities for solving problems, breadth as well as depth of knowledge, clinical judgement, and the ability to handle stressful or ‘emergency situations. f If one uses the term “oral exam” loosely, it is used everyday, and in every teaching activity where there is a constant questioning and answering with immediate feedback. In reality, this kind of “oral exam" constitutes 2 good proportion of a student's final evaluation. in postgraduate medical education. = ‘While some strong claims have been made for the usefulness of oral exams in judging a candidate's ability to apply clinical knowledge, to problem solve, to respond in a dynamic situation, to demonstrate interpersonal skills, professional attitudes, and so on, educational researchers have tended to be sceptical of these claims. The claims are typically based at worst, on uninvestigated impressions that the exams sample these skills and at best, on examiner and candidate reports that the exam seemed to be valid (Van Wart, 1974), rather than on a more systematic content analysis as advocated by Levine and McGuire (1970a, 1970b).. ‘There is scattered evidence that oral exam scores reflect aspects of candidate behaviour unrelated to clinical competence, such as anxiety level (Pokomey and Frazier, 1966; Waugh and Moysé, 1969); the percentage of words contributed by a candidate (Evans, Ingersoll and Smith, 1966; 1966) and the examiner's “visual impression” rating of candidates (Holloway, Collins and Start, + 1968) or "self confidence” (Wigton, 1980). Whether oral examiners are able to “use” or ignore such cues and testmanship in their evaluations is an issue that needs to be addressed. If materials are made overly uniform and questions and topics are defined for each candidate in the attempt to achieve consistency in rating, the flexibility, uniqueness and individualistic nature of the oral exam are lost. O’Donohue and Wergin (1978), suggest that high reliability achieved by tight structure and standardization of an oral exam may only allow the sampling of a very small part of 12 student's overall clinical’ competence. Thus the challenge is to standardize sufficiently to achieve acceptable reliability while not destroying the unique aspects of the oral. Experience of the American Board of Emergency Medicine suggests that it is possible to achieve acceptable reliability with relatively little structure, amounting to a standardization of cases and suggested key criteria of performance. The issue of content specificity remains a second issue. A longer sequence of oral evaluations extending over multiple cases would yield more reliable, hence more valid estimates of candidate performance. ral exam evaluations are not feasible for large populations because of the logistical problems and. costs of having candidates in the same location examining the same patients and being examined by the same raters, which is now widely accepted as the basis for standardization. Both the National Board of Medical Examiners and the American Board of Internal Medicine, with the largest candidate populations, have had to discontinue their orais for combined reasons of unreliability and cost. 39a ee ee Section 3 - Oral Examinations | DEVELOPMENT AND IMPLEMENTATION ‘There are a few simple guidelines to maximize the measurement from the oral examination: 1, Sample broadly, with as many short cases as feasible. A single case discussion should not exceed ten to fifteen minutes. 2. Provide written stimulus cases and suggested criterion responses in advance, so examiners are not at liberty to ask their “pet” questions. 3, Use simple rating scales for examiners, Detailed checklists are mind-mumbing to experienced examiners and do not improve reliability or validity. 4, Do not expect it to do mofe than it is capable. Using the oral to assess interpersonal skills is more likely to be an assessment of acting ability. Conversely, using the oral as a primary assessment Of knowledge is inefficiem of both student and examiner tims; this can be achieved more effectively with written or multiple choice formats. REFERENCES Bull GM. (1956) An examination of the final examination in medicine. Lancer 271: 368-372. Canter HD. (1962) How reliable are good oral examinations? California J Ed Res 13; 147-153. Colon T, Peterson OL. (1967) An essay of medical students’ abilities by oral examination. J Med Ed 42: 1005-1014, S Evans L, Ingersoll RW, Smith EJ. (1966) The reliability, validity and taxonomic structure of the oral examination. J Med Ed 41: 651. Futcher PH, Sanderson EV, Pusler PA. (1977) Evaluation of clinical skills for a specialty board during resident training. J Med Ed 52: 567-577. Hubbard JP, Levitt FJ, Schumacher CF, Schnabel TG. (1973) An objective evalueif6n of clinical competence. New England J Med 272: 1321-1328, Levine HG, McGuire CH. (1970s) The use of role-playing .o evaluate effective skills in medicine. J Med Ed 45: 700-705. Levine HG, MeGuire CH. (1970) The validity and celiability of oral examinations in assessing cognitive skills in medicine. J Med Ed 45: 700-705,Section 3 - Oral Examinations Littlefield JH, Harrington JT, Garman RE, (1977) Use of an oral examination in an internal medicine clerkship. Proceedings of the 16th Conference on Research in Medical Education, Washington DC. Lioyd JS. (Ed.) (1983) Oral examination in medical specialty board certification. Chicago, American Board of Medical Specialties. Ludbrook JJ, Marshall VR. (1971) Examiner training for clinical examinations. British J Med Ed 6: 152-155. Maatsch JL. (1980) Model for criterion reference medical specialty test. Office of Medical Edveation research and Development Michigan State University in collaboration with the American Board of Emergency Medicine. Meskauskas MS. (1975) A study of the oral examinations of the Subspecialty Board of Cardiovascular Disease of the American Board of Internal Medicine. Proceedings of the Conference ‘on Oral Examination, Des Plaines, IL: American Board of Medical Specialties. Miller GE. (1968) The orthopaedic training study. J Am Med Assoc 206: 601-606. Newble DI, Hoare J, Sheidrake PF. (1980) The selection and training of examiners for clinical examinations. Med Educ 14(5): 345-349. ‘O'Donohue WJ, Wergin JF. (1978) Evaluation of medical students during a clinical clerkship in internal medicine. J Med Ed 53(1): 55-58. Pokomey AD, Fraxier SH. (1966) An evaluation on oral examinations. J Med Ed 41(1): 28-40, Powles ACP, Wintrop N, Neufeld VR, Wakefield JG, Coates G, Burrrows J, (1981) The "triple- jump” exercise: further studies on an evaluative technique. Proceedings of the 20th Conference o:: Research in Medical Education, Washington DC. Turabuil J, Danoff D, Norman GR. (1995) Content specificity and oral certification exam.nations. submitted to Med Educ. ‘Van Wart AD, (1974) A problem-solving oral examination for Family Medicing, J Med Ed 49: 673- 679. Vu NV, Johnson R, Mertz SA. (1981) Oral examination: A model for its use within a clinics! clerkship. J Med Ed 56(8): 665-667. ‘Waugh D,. Moyse CA. (1969) Medical Education II: Oral examinations: A video study of the reproducibility of grades in pathology. Cdn Med Assoc J 100: 635-640. 41CHAPTER 3.3 CHART STIMULATED RECALL Eileen Hanna, McMaster University ey Unlike a chart audit which uses a fixed, objective, check list, chart stimulated recall is a discussion ofa physician's actual practice. Physicians are rated on the following the followng broad areas: the acquisition of data, problem solving, patient management, comprehension of pathophysiology, sensitivity to patient’s needs, overall clinical competence and quality of charting. DESCRIPTION a ‘The candidates are asked to submit charts of specific patients. At the time of examination the candidate is interviewed by the examiner Who rates the candidate on the following categories, using the charts as the basis of the questions. 1) Acquisition of dara is directed at the completeness, relevance and efficiency of obtaining key items. 2) Probiem Solving inctades the organization and accurate interpretation of history and physical findings, and investigations. The positive and negative findings are weighed and synthesized clearly and concisely to explain the cause of the presenting complaint (whether pathophysiological and/or psychosocial),.the diagnosis and the potential outcome. 3) Patlent Management rates the appropriateness of the plan, priorities and sequence of actions, patient education in regard to the diagnosis, medications, life-style changes etc, treatment decisions and outcomes. 4) Comprehension of pathophysiology assesses how well the candidate understands the underlying pathophysiology of the problema, is able to apply knowledge of the concepts relating to the cond:tion to the underlying causes and mechanisms of iliness and can explain the scientific rationale for the clinical procedures ordered. 5) Sensitivity o patient's needs rates whether concem was shown in dealing with the patient's ‘emotional and physical state in relation ro the social and family setting. 6) Overall clinical competence represents a global assessment of the candidate's ability to not ‘only competently provide health care for a patient with this specific problem, but also manage a patient in this age group aud gender in regard to general issues of health promotion/disease prevention. 7) Quality of Charting is determined by leysbility of notes, comnleteness as a legal document and organization of data. These seven items are reviewed for each chart «iscussed, 43Section 3 - Oral Examinations HISTORY AND USE Chart stimulated recall was developed by Maatsch (1985) to certify emergency physicians in the American Board of Emergency Medicine (ABEM) examination. It was intended to be a free- response format for evaluating clinical judgement based on real patient charts, For the Physician Review and Enbancement Program at McMaster University, the chart stimulated recall tool developed by Meatsch was adapted for use with family doctors. RELIABILITY AND VALIDITY The chart stithulated recall developed by Meatsch (1985) asked the candidate to submit three charts of patients that they had treated within a fixed period of time. However, because of some concer that chart stimulated recall would not adequately challenge candidates, ABEM conducted a field test of this new tool. Sixteen examiners who had experience in previous certification examinations and eight candidates whose ten year board certification needed to be renewed were invited to participate One examiner assessed the chart with the candidate, while the other independently verified the information. Diagnosis and management of actual cases was rated using a structured interview based ‘on the presenting chart, For this study, examiners participated in an intensive training program that included structured interviewing techniques, videotaped sessions, consensus driven rating points, feedback and ample practice time (Solomon, 1990). The resulting reliability coefficients were .5 or greater. To determine validity and reliability of thé new CSR tool, the well-developed simulated patient encounter was used. Data acquisition, problem solving, patient management, resource utilization, health care provided, comprehension of pathophysiology and overall clinical competence were rated on an & point scale for both test tools (Solomon et al. 1990). Correlation between the two scores of results demonstrated that both tools assessed similar abilities. In addition to the high correlation between the two measures, there was a significant relationship between the results on this field test, and the oral and written certification examinations the candidate took 10 years earlier (Solomon et al. 1990). In a similar study (Huang, 1980), they concluded that trained examiners can produce reliable and valid ratings; however, six or more cases need to be sampled to produce sufficiently accurate results. ADVANTAGES AND DISADVANTAGES The CSR is unique in its ability to access performance, as opposed to competence. The presence of the candidate overcomes the major deficiency of convention! chart review, the uncerniaty a:sociated with errors of omission. ‘Tse maior disadvantage is simply that the method is extremely labour-intensive. It may be of primary use ifi continuing education, where a group practice can utilize this as a learning tool for sc If-assessment. 44Section 3 - Oral Examinations DEVELOPMENT AND IMPLEMENTATION Generally, the first step in the CSR begins when the doctor is asked to submit a list of the 50 most recent patients that s/he has seen. From this list 10 charts are selected. For CSR, particular care must be taken in choosing charts in order to comprehensively assess all aspects of practice. Because of the diversity of charts used in the assessment, only limited use of condition specific criteria can be made. The assessor reviews the chart with the candidate physician present, using the ‘candidate to “fill in the blanks" and provide rationale for particular aspects. The evaluation is then based on subjective rating according to a small number of predetermined categories as described above. = REFERENCES Huang R, Maatsch J. (1980) A model for a criterion-referenced medical specialty test. Final report on grant No. HS-02038-02, Office of Medical Education Research and Development, Michigan State University. Maatsch JL, Huang R, Downing SM, Munger B. (1985) In: I. Hart and R. Harden (eds). Newer Developments in Assessing Clinical Competence. Proceedings of the ist International Conference, Montreal, 352-360. Norman G, Davis D, Lamb S, Hanna E, Caulford P, Kaigas T. (1993) Competency assessment as part of a peer review program. J Am Med Assoc 270(9): 1046-1051. Solomon DJ, Reinhart M, Bridgham RG, Munger B, Starnaman S. (1990) An assessment of an oral” examination format for evaluation competence in emergency medicine. Acad Med 65(9): $43-S44.CHAPTER 4.1 MULTIPLE CHOICE QUESTIONS Geoffrey Norman, McMaster University Multiple choice tests literally need no introduction. Everyone involved in all facets of education has at one, or mote likely many, times, experienced the anxiety associated with sitting the “final exam", a gruelling, several hundred item, multiple choice test. ‘Their logistical advantages are self-evident; it is possible to test hundreds or even thousands of candidates with the same test and with no human intervention beyond the support staff to feed the forms through the computerized scoring machine, Conversely theis apparent disadvantages have been proclaimed fer decades -' the focus on recall of facts over higher level thinking, the difference between recognizing the “right” answer and being able to recall the answer from memory. The ‘ubiquitous nature of MCQ"s in medical education led directly to the invention of many alternatives (including some described in this handbook) with proclaimed advantages. The Patient Management Problem (which is now essentially deceased), the Modified Essay Question, even the 0.S.C.E., can claim a lineage directly to dissatisfaction with MCQ's. How then, have MCQ's survived relatively unscathed from all these assaults. A cynic may assume that the sole rationale for their continued existence is the cost factor. However, a careful review of the literature reveals that the apparent disadvantages of MCQ's are more illusory than real, and with careful attention to question writing end sensitivity to their impact on learning, they continue to have 4 useful role in evaluation at every level. PURPOSE MCQ's are expressly designed to assess knowledge. While some would maintain that their use is more circumscribed to the ability to test recognition of isolated facts, itis clear that well constructed MCQ's can also assess more complex judgement and application of knowledge, perhaps as well as ‘many alternative formats. Moreover, while possession of an adequate knowledge base was once viewed as an unnecesséry encumbrance when knowledge was changing so rapidly, over two decades of research into reasoning and thinking, both within health sciences and elsewhere, has unequivocally shown that knowledge of a domain is the single best determinant of expertise (Glaser, 1984). Conversely, research to elucidate a general problem solving skill, independent of kriowledge, thas met with a singular lack of success (Perkins and Salomon, 1989). From this perspective, then, Knowledge is worth evaluating, and the multiple choice question remains, for the foreseeable furure, the most effective means to assess knowledge DESCRIPTION In ts standard format, the so-called "A" format, the question consists of a tiief question description (the "Stem") followed by five alternative answers (the "Options”). An example of an A-type is:Section 4 - Written Tests A-5S year ol6 man presents a the emergency room with crushing chest pain which has been progress ‘worsening over the past several hours. The pain radiates down the left arm past the elbow. Me is digphoret Vital signs are: P 100, BP 120/60, The most likely diagnosis i: ‘While these items are often written with a fifth option of “All of the above" or "None of the above", presumably by writers suffering from creative lapses, this is to be discouraged (See Development and Implementation below). One simple modification of this format is the "B" type, where the instruction is ¢o pick the correct alternatives, and there are more than one correct answer. One more complex format is the "K-type’ witch was invented by John Hubbard at the National Board of Medical Examiners (NBME) in order to more effectively deal with the inherent uncertainty in clinical Medicine. An example If 10,000 women with no risk factors are screened for breast cancer with mammography, about 50 will have breast cancer as determined by subsequent confirmatory tests. On the 50, about 15 will have a negative mammogram. Of the 9950 women withouc cancer, about 700 will have a positive mammogram. 1) The sensitivity is 35/50 = 70%. 2) The specificity is 9950/10,000 - 99.5% 3) The positive predictive value is 35/735 = 5% 4) The prevalence is 35/10000 = 0.35% 5) The negative predictive value is 15/50 = 30% SELECT: 2) if 1) and 3) are gorrect. * b) if 2) and 4) are oreat. ©) if 1, 2) and 3) are correct. 2) ifall are correct. ©) if none are correct, While the complexity of choice suggests that this type is likely measuring higher order thinking skills, an accumulation of evidence suggest that the K-type is measuring the same underlying aptinades as the regular A-type, but because it takes more time and is more error-prone, it is less reliable than the simpler format, Consequently its use has been virtually discontinued by national testing agencies, : Recent advances towards greuter flexibility in scoring software and hardware have resulted in 2 number of more interesting formats. Notable in this regard are the "Pattern Recognition Test” of Case and Swanson (1992), 2n example of which is shown below: 48Section 4 - Written Tests 1. Myocardial Infarction 7. Mitral Valve Prolapse 2. Angina 8. Congestive Heart Faure 3. Pulmonary embolus 9. Preumonta 4. Esophageal Spasm 10. Chest Wall Pain 5. Pleurisy 11, Rheumatic Heart Disease 6. Acute bronchospasm 12. Ventricular Fibrillation a) ASS year old man with crushing chest pain worsening over the past several hours and radiating down the left am, b) A 25 year old woman with piercing chest pain. She is on the birth control pill. ©) ATS year old man with penetrating chest pain worsening on exertion. 4) A 12 year old girl with shoriness of breath. Two weeks earlier, sbe had a sore throat, ac, A second variant which is now used as one component of the Medical Council of Canada licensing examination is the so-called "Q4 Problem” (It was called this simply because it was in the fourth question booklet of the examination. In the Q4 problem, candidates are given a series of clinical scenarios and then asked to take the appropriate actions by selecting the code numbers from a long standard list of alternatives (Page et al. 1995). For example: Paul, a.56-year-old man, consults you inthe outpatient clinic because of pain in his left leg which began two ays ago and has been getting progressively worse. He states his leg is tender below the knee and swollen around the ankle. He has never had similar problems, His ocher legis fine. 1, What diagnoses would you chasider at this time? List up to three. L ee a, 2. With respect to yaur diagnoses, what elements of his history would you particularly want to elicit? Select up to seven. 1, Activity a the onset of symptoms RELIABILITY AND VALIDITY Given the Ubiquitous nature of MCQ's, it is not surprising that there is a wealth of evidence about the reliabitity and validity of the method. 4sSection 4 - Written Tests RELIABILITY ‘A recurrent theme in the handbook is that, while it is often not too difficult to achieve good inter- rater reliability (with the possible exception of global ratings and essay tests), it is far more difficult to achieve reliable estimates of performance from limited samples of content or cases. Itis precisely this limitation which gives MCQ's an advantage over virtually all other testing methods. MCQ's are able to sample large content domains efficiently, thus can achieve levels of reliability in relatively short testing times which are unthinkable with other test formats. Typically, test designers assume that students can complete one MCQ per minute, or 60 per hour. On that basis, a 180 item test, taking 3 hours, will typically achieve internal consistency reliability exceeding 0.90. By contrast, an MCQ consisting of fewer that 40 or $0 items can be assumed to have insufficient reliability for any practical use. Evidence of test-retest reliability is more limited, since most students rarely take the same, or parallel, tests on a second occasion. However, there are two exceptional circumstances; progress teSi (described in more detail below) and recertification examinations, In the former situation, studies in the M.D. Program have demonstrated test-retest reliability over a three month interval of 0.7 (for a 180 item test), Further, results from a progress test compieted one year before stadtiation were able to identify 4 of 5 MD snidents who eventually failed the licensing examination (while also predicting that 8 of the remaining 82 students would fail), Perhaps even more dramatic is the result of the American Board of Internal Medicine examinations, where the correlation between performance on the original certification examination and’ a parallel rectification examination 7 to 10 years later was 0.70 Day, 1988). Vaupity While the reliability of MCQ's is never questioned, most antagonists challenge the format on validity grounds. Some of these issues are described below: 1) MCQ’s test recognition, not recall While recognition and recall are different psychological processes, careful study of performance on ‘MCQ's and paralle! free-response (short answer) tests show true correlations above 0.90; recognition and recall access the same knowledge base (Frederiksen, 1984). In one study in particular (Norman, 1987), MCQ and Modified Essay Question versions of a test were constructed. The true correlation across test formats Was equal to one. 2) MCQ’s only measure factual recall, not problem: solving : Poor MCQ's only measure factual recall; so do poor short answer questions. There is no basic difference between: ‘The Latin name for the collarbone is: a) Humerus) clavicle ¢) radius) ulna) fecuur and: ‘What is the Latin name for the collarbone? 50Section 4 - Writen Tests Certainly, itis likely true that most MCQ's measure factual recall, but likely most short answer tests do as well. However a recent unpublished study of the Medical Council of Canada examination showed that residents rated 49% of the questions as “higher order", indicating that carefully constructed MCQ's are capable of addressing higher order thinking skills. Moreover, there is a value statement implicit in this challenge, that recall of facts is somehow Jess important or worthwhile than problem-solving. There is some evidence to the contrary. Peiteman (1990) studied performance on a multiple choice examination in the second year of medical school as a predictor of performance in the clerkship as rated by supervisars. He divided the MCQ questions into “factual recall” and "higher order* and found that the correlation between facoual recall questions and clinical performance was 0.33, and between higher order questions and performance was 0.34! A second approach to this question involved developing two versions of a test question; one which contained an extensive Patient case description and then asked the question, and a second which simply stated * the question (Case SM, unpublished). There was no evidence from either study that the case version was more difficult or more discriminating than the straightforward version. 3) MCQ’s only measure the ability to recognize the right answer. That has nothing to do with performance or competence. Not so. Not surprisingly, there have beea hundreds of studies comparing performance on an MCQ to performance on another test format taken at the same time. While low cotrelations are frequently found and then used as evidence for a claim such as "My new test must be measuring something else like problem-solving, competence, clinical reasoning, etc.", it is more likely that the low correlations reflect inadequate content sampling (and hence low reliability) of the new format, or a mismatch of content (Norman, 1994). When such concurrent validity studies are done appropriately, typically the correlation exceeds 0.60, suggesting that recognition of facts is not so dissociated from performance. Two large predictive validity studies have examined the relation between MCQ performance and performance in practice seven to ten years later, assessed by chart review or peer rating (Solomon et al. 1988; Ramsey et al. 1989). The true correlation in both studies was approximately 0.60 - 0.70. ADVANTAGES AND DISADVANTAGES Much of this should be evid-at by now. The one major advantage of the MCQ is that it can sample ‘broad domains of knowledge efficiently and hence reliab:y. This one characteristic is sufficient to casure that its edge in relabily mate than compensates foc-some perceived flings in validity. Objective scoring is a logistical advantage in mass testing situations, but in classroom settings, the advantage in implementation can be offset by the effort required to generate a large enough number of items. ‘The disadvantages are occasionally more perceived than real. We have noted the arguments against’ the validity of MCQ's, and are satisfied that they are, for the most part, ;zoundless. However MCQ's, when used as a major componen: of evaiuation, 4o have a measurable, and likeiy negative. SLSection 4 - Written Tests steering effect on student learning. Students do study differently for MCQ's than for other test formats, and the usual strategies of “final examing" and “bell-curving" undoubtedly encourage ‘competition and discourage cooperation. One way to introduce MCQ’s into a curriculum and avoid this steering effect is described below. Ap overlooked disadvantage is that good MCQ’s are notoriously difficult to construct. It is not ‘uncommon for national testing agencies to accept one of ten test items, although this rate can be considerably raised with training, As a result the National Board of Medical Examiners has estimated that the cost of a single test item is approximately $1000, a DEVELOPMENT AND IMPLEMENTATION Dévelopment: While it seems straightforward to write MCQ’s, there are several common pitfalls. ‘The first we have alluded to already — too often, the question demands simple recall of an isolated and often arcane fact. A second mistake is that writers often use the MCQ to play “Trivial Pursuit” in the following manner: 1. Cystic fibrosis is a) More common in children than adults b) Often symptomaically relieved by bronchodilators ©) Inberized as an autosomal dominant trait 4) More prevalent in Western societies ©) None of the above A good MCQ should function iike a short answer question — you should be able to answer the question simply by looking at the stem, Otherwise, you are asking for unrelated isolated facts. Authors also commit many errors in writing the response options. One is the use of “All of the above" or None of the above”. Ifthe student knows that one of the options is wrong, this eliminates “All of the above”. And finding the right answer negates the “None of the above" option. Another error results from the uncertainties inherent in the field. Authors often couch the correct answer in probabilistic terms, using terms like “often associazed with”, “likely” or “frequent” but make the wrong alternatives more absolute, "always", "never". Partly as a consequence, the correct answer is often the longest of the options, so one strategy for guessing is to pick the longest answer. Finally, writers are reticent about putting the correct option first or last. If you are serious, use a random number table, or order alphabetically by the first fetter in the response. A second cevelopmental stage used routinely by national testing agencies is the screening of test items after the administration based on item statistics. This is done by calculating item parameters: the DIFFICULTY index, which is simply the proportion of candidates who answered the question “correctly, and the DISCRIMINATION index, which is the correlation (usually point biserial) ‘between performance on the test item and performance on the remainder of the examinatica. Item difficulties in the extremes (less than .20 or greater than .90) suggest that che item is too difficultSection 4 - Written Tests or too easy, providing lite useful information. Very low or negative discrimisation indices imply that good student are no more likely, or even less likely, to get the correct answer than poor students, usually signalling a strucuural problem with the item. In high stakes situations, poor items are then removed and the test rescored. Implementation (Progress Testing): There is good reason for concern about the impact of objective tests. To some degree the testing of facts in isolation can be viewed as antithetical to the goals of the PBL curriculum, which emphasizes problem-solving, learning for understanding, etc. (Muller, 1984). The challenge, then, js to introduce objective testing methods into the curriculum in such a way as to maximize the potential benefits, in terms of providing students accurate and comprehensive essessment of knowledge mastery, while avoiding the attendant risks associated with the potential Steering effect of t.e examination. ‘One potential solution of the problem was developed independently at the University of Limburg, ‘Maastricht, the Netherlands (van Heesen et al, 1989), and the University of Missouri, Kansas City (Amold, et al. 1992), in the mid -1970's. Both institutions sav that, in order to avoid the steering effect of examinations, it was necessary to break the link between learning and examination, vy making the examination so comprehensive that it was virtually impossible to study for it. Second, in both instinutions, progress through the school was separated from performance on any single examination, so that students were less concerned about passing or failing. ‘This system was adopted at McMaster in the fall of 1992. Three times per year, students from al! classes gather on a single day to sit a 180-item multiple choice test. Within a few days, students received detailed feedback about their performance in various content areas, and students who have consistent problems are advised and remedial action recommended, Studies of the method have been completed, and show: 1. Average class performance continues to improve monatonically over successive tests. 2. Test retest reliability over successive administrations is the onder of 0.70, suggesting the testis detecting students with persistent difficulties. 3. A progress test administered 12 months earlier identified 4 out of 5 failures on the licensing exam; 74 of 81 who passed were indicated as satisfactory by the progress test. 4. The test method has a slight impact on students’ awareness of strengths and weaknesses and learning approach; 71% of students said it "No Impact” on tutorial function (on a -4 to +4 scale) ‘The progress test methodology appears a relatively non-invasive means to achieve the testing advantages of MCQ's while overcoming some of their negative effects on learning. 53CHAPTER 4,2 MODIFIED ESSAY QUESTION Paul Stratford and John Smeda, McMaster University Be conecimenty 6 Pa2eramiente, Bucne cuarts, ‘The Modified Essay Question (MEQ) was developed by Hodgkin and Knox (1975) as an examination tool for the Royal College of General Practitioners. It is intended to examine the respondent's ability to solve and manage (at least on paper) clinical problems, and to assess problem related aspects of basic and clinical sciences. CaSO clinico Ypreguatas ablerrar Pueclen ee star “relacbrdciac, Aespuertas breves DESCRIPTION Puede deimenuzace en Muchas preguntas determina’ si cencird evr ACumulade ‘The MEQ is presented in a booklet. Instructions which inform the respondent of the appropriate procedure are provided on the cover of the tooklet. The next page provides a brief scenario (often ‘only elements of the history) reflecting a clinical situation and one or more questions pertaining to the text. Space is provided on this page for the respondent's answers. Subsequent pages include additional information, questions, and response space. Scenarios can be constructed fo represent a single point in time, such as a single patient encounter, or to unfold over a series of encounters. The questions may ask the respondent about factual information, the interpretation of clinical or laboratory data, diagnostics, management, or to write a letter referral. Ls respuesta correcta 2573 dade por expeltey 4 Se Compars 1b del Altmne conva esta. RELIABILITY AND VALIDITY A > estructiracisn cfel crdines > confesriread. Depende cle af cases. ‘The internal consistency of a test is dependent on the homogeneity of items and the number of items composing the test. When standardized to a 60 item test, internal consistency coefficients of 0.74 are typical (Feletti, 1980). Inter-rater reliability coefficiems varying from 0.80 0 0.97 have been reported (Norman et al, 1987). The magnitude of these coefficients are influenced not only by the quality of the criteria lists, but also by the test's scructure. It appears that items presented in the context of a case, as opposed to random items, results in higher inter-rater reliability (Norman et .t., 1987}. Inter-case reliability has not been reported, however, given estimates from other test formats it is reasonable to expect that itis low, of the order of 0.1 100.3. (por immer pregacton dct. supjenvideace aos Performance on free-response tests improves with the level of training (Newble, Baxter, and Elmslic 1979; Nevble, Hoare, and Elmslie 1981). Correlations varying from 0.40 to 0.52 have bern reported between MEQ and MCQ scores (Irwin and Bamber, 1982; Norman et al., 1987). Howev , in the latter study, when the correlations for the scores were corrected for attenuation, be Coefficients approached unity (Norman et al., 1987}. Correlations of spproximately 0.42 have been reported between MEQ and M.B. clinical examination scores (Irwin and Bamber, 1982). Alsc,'a correlation of 0.64 has been reported between MEQ and clinical practise evaluations for physiotherapy students (Stratford and Pierce-Fenn, 1985). $5 pesjctividad Si ne Mtewiene 13 cpinigy cel Evalvacter. EF mss Objenvo CS ef de opcidn mulTIPIe - confies.ticiag 2 Ide Con A Crubach,Section 4 - Written Tests ADVANTAGES AND DISADVANTAGES Intuitively the modified essay question is an appealing tool for both formative and summative ‘components of evaluation in a problem based instruction method. Tae Modified essay question evaluation demonstrates reasonable psychometric properties. The development time (including criteria setting) is comparable to that of the MCQ. Also, the imbedding of test questions in a problem format is intuitively appealing, particularly to PBL curriculum planners and students. ‘The disadvantages are that they are time consuming to grade because they need to be hand marked, They cannot sample as much content as MCQ per unit of examination time because the free response question requires more candidate time. Since, except under unusual circumstances, free response Questions assess the same competeacy as multiple choice quest.ons, the advantage of the free response format may be more illusory than real (Frederiksen, 1984). Certainly recent developments in MCQ methodology are employing more extensive use of patient case stems, thereby blurring the appagent advantages of the MEQ format, While the imbedding a series of questions in a single case may be desirable from the perspective of face validity, this may actually be detrimental to Psychometric properties as the items are no longer independent. When used as a summative evaluation tool, a greater number of invigilators compared to MCQ or ‘traditional essay questions may be required to ensure that students comply, (J.¢., students do not Jook forward or alter previous answers with instructions). In this respect, a computer format which would preciude "looking back" may be advantageous, although we are not aware that this format has been developed. DEVELOPMENT AND IMPLEMENTATION 1. Determine competencies to be assessed. 2, Decide on whether cumulative error is desirable. For example, authors may consider questions for which the answer from 2 previous question wili serve as the starting point for the next question, ; ‘Question 1, Based on the information provided, what is your diagnosis? Question 2. What is your choice of therapy? If cumulative error is not desirable, it can be avoided by providing the preferred answer (0 Question 1 in the stem of Question 2 and by presenting Question 2 on the next page, (e.8!, Question 2 What i: your choice of therapy if this patient has a complete tear of the achilles tendon?).. 3. Determine whethe: your questions will be undirected (¢.g., what do you think is going on?) or directed (¢.g., Whit are two probable causes of dyspnoea on exertion?). Directed questions bave been show to result ia higher inter-rater reliability (Norman ¢t al., 1987). ei nal Recta L eae en > : 3 > inferencia Caastraccin) 4 ee ee 2 concret 2acisn (aAirectas) > Posibitretact de estancianzar ia retpueita - dS conRaviliclac.s Section 4 - Written Tests 4. Generate and revise questions. Because of the sequential nature of the examination, it is important that the test author provides a written estimate of the amount of time the respondent should spent on each item. Criterion Setting and Scoring. An inclusive list of possible options and associated weights must be developed. Often this Jist may be revised after administration, as candidates identify answers which were not anticipated. The particular weighting scheme is not critical, and it has been repeatedly demonstrated that scores are insensitive to the particular weights applied (Bligh, 1980). One appealing technique for criterion setting and scoring, referred 10 as the aggregate score method, (Norman. 1985), involves weights derived from actual performance of a criterion group, instead of armchair weights. = REFERENCES f Bligh TJ. (1980) Written simulation scoring: a comparison of nine systems. Presented at American Educational Research Association Annual Meeting, New York. Feletti Gl. (1980) Reliability and validity studies on modified essay questioas. J Med Ed 55(11): 933-941. Felerti GI, Smith EKM. (1986) Modified essay questions: are they worth the effort? Med Ed 20(2): 126-132. Frederiksen N. (1984) The real test item bias. Am Psychol 39:193-202. Hodgkin K, Knox JDE. (1975) Problem Centred Learning. London, Churchill Livingstone. Knox JDE. (1989) What is ... a modified essay: question? Med Teach 11(1): 51-57. Irwin WG, Bamber JB. (1982) The cognitive structure of the modified essay question. Med Educ 16(6): 326-321. Newble DI, Baxter A, Elmslie RG. (1979) A comparison of multiple choice and free response tests in the examination of clinical competence. Med Educ 13(4): 263-268. Newble Di, Hoare J, Elmslie RG, (1981) The validity and reliability of a new examination of the clinical competence to medical students. Med Educ 15(1): 46-52. Norman GR. (1985) Objective measurement of clinical performance. Med Educ 19(1): 43-47. Norman GR, Smith EKM, Powies ACP, et al. (1987) Factors undetlying performance on written tes's of knowledge. Med Educ 21(4): 297-304. Stratford P, Pierce-Fenn H. (1985) Predictive validity of the modified essay question. Physiotherapy Canada 37: 355-359.CHAPTER 4.3 ‘ESSAYS David Palmer and Elizabeth Rideout, McMaster University DESCRIPTION AND PURPOSE ‘The term “essay” can be used ta refer to a variety of kinds of written discourse. Probably the most useful way of describing the variants of essays is to arrange them along a continuum according to the degree of freedom enjoyed by the writer in the treatment of the subject. One end of this continuum might be represented, for instance, by a short answer in an examination or a specific section within a Modified Essay Question (Chapter 4.2). Towards this end of the range, essays aliow the writer little freedom to determine the scope of the subject. They usually require a summary or synthesis of-standard informa.ion about a phenomenon and expect little individual judgment in the formulation of a position, involve litde or no consultation of sources, and are normally completed in minutes rather than in days or weeks. ‘As we move towards the other end of the scale we find more open-ended essay tasks, in which the writer is given greater freedom to make decisions about the topic to be investigated, the issues t0 be addressed, the information to be gathered, and the position to be taken. At this end of the scale (to take one example) the writer might be asked to choose and explore an issue in population health. This might involve such processes as establishing the dimensions of the problem, investigating relevant background factors, idestifying health policy implications, and recommending - with rationale and likely long-range effects - a social or health-care policy to respond to a particular problem. The writer might have considerable freedom to decide which of these possibilities to deal with at length. The claim often advanced for this latter kind of essay 48 an evaluation method is that it allows us to observe @ mind at work ona problem - to see, in short, how someone thinks. We watch the writer generate ideas, weigh arguments, organize and integrate information, build and support conclusions. Essays can assess 4 person’s ability to construct an extended argument about a complex topic after considering a large body of information and ideas. Writers of essays have to decide how to approach a topic and define its scope, which issues to investigate and which to leave aside, wher to look for relevant fhformation, and what information is in fact relevant. They are expected 10 consider alternative ways of interpreting the information they encounter, to foraulate their own position on the issues they address, to organize their discourse into a sequence of coherent sections, to present evidence and argumentation in support of those findings, and to anticipate objections. It is often assumed that essays are particularly compatible with a self-directed approach to learning, because they allow students substantial control over the selection and treatment of topics 10 be investigated, and appear to foster independent thinking and learning. They are also thought to be relevant to a problem-based curriculum in that they demand integration and synthesis of informatic from different sources with application of that information to problematic situations, rather than the accumuiation and presentation of discrete pieces of information without an integrative context. 59.Section 4 - Written Tests ‘A further justification frequently offered for the use of essays is that they allow us to assess a writer's ability to communicate ideas, explain findings clearly and precisely, and use sources appropriately. These skills are especially significant when one of the objectives of a course of study is to prepare students to write and publish research or other kinds of reports in their professional lives. This lofty view of what essays cam do, however, has its critics: one summary of their usefulness cautions against "unsubstantiated claims" that essays can be used to measure "undefined and only vaguely perceived ‘higher-order mental abilities’ (Ebel & Frisbie, 1986:131). HISTORY AND USE ‘As indicated above, essays are seen as particularly appropriace for a self-directed and problem-based academic program. In the Faculty of Health Sciences, essays have been used extensively in the School of Nursing and the M.H.Se. program over the last twenty years, as a major component of the assessment in virtually all non-clinical courses. Essays are also used in the more recently established programs in Occupational and Physical Therapy. In some programs, we find the use of the so-called "Patient Problem Write-Up" which can range from a fully-fledged essay (in the sense in which we are primarily using the term) to a more restricted task. In its more restricted form, the Problem Write-Up generally requires the writer to identify the questions that need to be explored cor answered in response to a patient sirwation, summarize the relevant findings from the literature, and draw conclusions about patient management. This usually involves not so much the taking of an individual position as the summarizing of information and applying it in a fairly straightforward way to a patticular situation. Some - but not all - of what we have to say about essays as an evaluation instrument applies equally to this kind of task, as will be apparent. RELIABILITY AND VALIDITY RELIABILITY S Essay evaluations are subject to major problems of reliability, in that different readers tend to give different grades to the same essay. Much of the evidence for this, however, comes from studies of the use of essays to assess writing skill itself. The situation with regard to the use of essays to evaluate achievement in a particular content area is less clear. One standard work on educational measurement concludes that “the fact that essay tests typically yield highly subjective and unreliable measures of achievement” has been “established beyond dispute" (Ebel & Frisbie, 1986:130). The magnitude of the problem, another commentator has noted, can be illustrated by a classic example: In 1961 a study was conducted at the Educational Testing Service in which 300 essays written by college freshmen were rated by 53 readers representing several professional fields...Each rater used a nine-point scale. The results showed that none of the 300 essays received iess than five of the nine possible ratings, 23 percent of the essays received seven differeat ratings, 37 percent received eight different ratings, and 34 percent received all possible ratings (Breland, 1983).Section 4 - Wristen Tests Another writer summatizes the situation as follows: Some markers are "soft" and assign marks that are higher on the average than those assigned by "hard” markers. Some markers assign scores that afe bunched together whereas other markers spcead the scores that they assign over the full range of possible scores. Some markers assign marks more reliably (consistently) than other markers. (Traub and others, 1976). ‘Ove aspect of the variability of essay scores is that readers have different notions about the importance of various criteria when judging a particular set of essays. In addition, a criterion considered by a reader to be of importance in one essay may not be considered so critical by that reader in anather essay. For example, the weight that is given to the fact that an essay is deemed “well organized" is likely to vary according to other circumstances - such as how cleat the overall purpose of the essay is perceived to be or how sensible the argument. Criteria are clearly, to some extent, interactive. Even when instructed to judge an essay according to specified criteria, some readers interpret and apply those criteria differently from other readers. Readers are also influenced by extraneous factors such as length, physical presentation, and the order in which essays are presented (Breland, 1983). ‘What can be done to minimize the problem of unreliability? There is some evidence that reasonable consistency among markers can be obtained on essays written in examinations if one or both of two conditions apply: 1) what the essay should contain is closely specified (e.g. by highly specific instructions to writers, and the use of model answers or detailed marking schemes by scoters); 2) essays are scored by multiple readers who ate trained, co-ordinated and closely monitored. (See, for example, Naeraa & Lundgren, 1980; Nichols & Miller, 1984).. ‘Tue problem with the first condition is that itis in conflict with much of what we have described as being the main purpose of asking students 10 write essays, especially those written over a lengthy period of time: giving students the freedom to explore a topic and take a posixion. The difficulty with the second condition is that it is scarcely feasible in normal acadernic circzsnstances. The Educational Testing Service (ETS), for instance, describes as follows the procedure involved in scoring essays designed to evaluate proficiency in writing: Before a scoring session, a team of scoring leaders meets to read and discuss a set of essays chosen randomly from the test administration. They then select essays for taining that cleatly exemplify each point on the scale. No names, testing locations, or other identifying data appear on the essays. The leaders then train the readers by explaining the scoring method to te applied, giving explicit instructions, reviewing criteria for scoring, and presenting sample essays for practice scoting and discussion. Training and practice scoring continue until the scocing leaders are satisfied that participants are ready'to begin the actual scoring. Throughout the reading, the scoring leaders monitor consistency by collecting and analyzing scoring statistics. 61ern, oem Section 4 - Written Tests Reader training is a continual process. After rest breaks, for example, the readers score sstlitional sample essays (0 ensure that standards are applied consistently and accurately (GRE Board, 1992). ‘One point to note is that reasonably consistent marking on these examinations appears to ve achieved only ifthe preferences and judgments of individual markers are to some extent over-ridden (rough the training sessions) by the need for conformity. It is unlikely that in normal academic situations markers would be willing to sacrifice their individual judgments to this extent. ‘One procedure that is often recommended (e.g. by Reiser, 1980) as a way of reducing inconsistency between scorers of essays is to use “analytical” rather than “holistic” scoring. This involves awarding-Sito-scores for specific features of an essay and combining them to produce a tot: , rather than assigning a total score directly on the basis of an overall assessment of the essay’s merit. (Various other ways of goirig about the task of giving an essay a grade are described in Breland, 1983). Unfortunately, analytical scoring does not seem to improve consistency generally, although it may work as part of a closely controlled examination-style marking process as mentioned above, when the features (0 be looked for in the essays are specified, and markers’ compliance closely monitored, Many essay-readers have had the experience, when using analytical scoring, of adding up the numbers only to conclude zhat the resulting total under- or over-estimates the essay’s worth. VaupiTy Essays have high face validicy at stints vee exces Ce atyaitey to do the sors of independent thinking and communicating tasks described above (under "Purpose*), It is in fact hard to think of another way of evaluating skills such a5 organizing, supporting and communicating extended arguments. Essays can have high content validity when topics and evaluation criteria are clearly related to course content and leaming objectives. However, because of the time required to complete a single essay, the relatively narraw scope of most essays, and the freedom often accorded to essay-writers to decide where’ to focus their attention, essays lack breadth of coverage, or comprehensiveness, in relation to course content. Even essay questions in examinations can sample understanding and knowledge only highly selectively. No clear informatiox: seems to be available about the relationship between essays (as opposed to examination essay answérs) and other measures of subject-area competence. There have been some ‘comparisons of essay examination and multiple-choice test scores (see Neufeld, 1985; Day and others, 1990) whict. do not indicate a strong relationship between the two. This may, however, seflect largely the problem of reliability of essay scores. DEVELOPMENT AND IMPLEMENTATION ‘There is no standard procedure o: format for the di velopment of essay assigamen's. The instructions and assistance provided to students will vary fron one situation to another. However, it may be useful 10 mention here some of the advice often given fo assigners of cssays by those who assist snudents in developing their writing skills (for ex:mple, Clanchy,"1986). 6Section 4 - Written Tests L. Assigning of essays: Students need a clear understanding of what they are expected to do. Sorte tend to feel that they have to “guess what the instructor wants" in the face of enigmatic instructions, while some instructors find that students look for an over-simple “template” or format to fallow, resisting the invitation to engage with the complexities or open-endedness of the thinking task required, Written directions should be explicit, and should try to avoid making the task sound unnecessarily complex or elaborate. Terms like "discuss", “explore” and “analyze” may requite clarification, Class tir’: can usefully be spent explaining the assignment, anticipating misconceptions and common problems, soliciting students’ perceptions of what they have to da, modelling the thinking process required, or presenting examples of successful papers. 2. Assistance during the process of writing: Many of the prc afems in completed essays result fram. difficultige-that were not resolved early in the process of working on the project. It may be helpful for writers to have the opportunity, at stages along the way, to discuss and clarify the issues they are addressing, the’ sources of information they are using, their tentative findings, the proposed structure of the essay, and so forth, 3. Feedback: This needs to be given while the student still recalls in detail the experience of writing the paper. Comments should be detailed, and where possible should be clearly related to specific features of the essay. For instance, a long series of comments written at the end of the essay, pointing out a failure to integrate theory with practice, or complaining of poor organization, is unlikely to be very effective, On the other hand, pointing out (at the appropriate points in the text of the essay) thar after sections on aspects A, B and C we appear to be back with A, or that there is no clear connection between B and C, or chat during the section on. assessment of the patient there is no mention of the theory that is supposed to be guiding that assessment, is likely to be, more useful. Where feasible, a personal interview in which’ the marker goes through the essay. pointing out noteworthy features may be desirable. ADVANTAGES AND DISADVANTAGES ‘The main advantage of essays is that they attempt to assess Kinds of thinking and communication skills and abilities not generally demanded by other evaluation methods (see above under "Description and Purpose”), Some of these skills and abilities are clearly related to the leaming objectives of PBL curricula. Essays allow students to pursue individual interests, set their own objectives, control their own investigations, and feel a substantial sense of achievement. They are therefore especially suitable for courses which encourage self-directed learning. Another advantage :5 that essays are relatively straightforward to devise, though some cautions are in order bere: Constructing #s:ay questions that requice the specific benaviours emphasized in 2 particular set of learning outcomes takes considerable time and effort. ...... In addition to the 63Section 4 - Written Tests invalidity of the measurement, evaluating the answers to carelessly developed questions tends 10 be a confusing and time-consuming task (Gronlund and Linn, 1990). Essays are also highly adaptable to different subjects and different levels of sophistication. They are, however, very time-consuming and difficult (and frequently tedious) to read and mark. ‘The major drawback to essays is their low reliability, and it seems likely that this problem cannot in most cases be significantly ameliorated. To make a significant impact on the problem of reliability there are two main strategies which can be adopted: (1) ‘Use multiple markers, trained and monitored, to compensate for the various inconsistencies in grading.iMfat otherwise occur. For practical reasons this is likely to be feasible only in special circumstances. (2) Restrict significantly the essay task and the writer's freedom, prescribing very tightly what the writer should do, and setting up very specific criteria for determining the writer's degree of sucess. While this may allow for easier and more consistent judgments about the extent 10 which criteria have been met, it empties the essay task of a great deal of what makes essays a uniquely valuable evaluation instrument - the use of independent thought and judgment in focusing, organizing, synthesizing and so forth. Once this has occurred there art likely to be methods of evaluation other than essays which can achieve the same purpose more efficiently and reliably. REFERENCES. Breland HM. (1983) The direct assessment of writing skill: A measurement review. College Board Report no. 83-6, New York (College Entrance Examination Board). Clanchy J. (1986) Improving Student Writing. HERDSA News, 26 Feb. 1986. Reprinted in Pedagogical Info (University of Otawa), January 1988. Day SC, Norcini JJ, Diserens D, Cebul RD, Schwartz JS, Beck LH, Webster GD, Schnabel TG, Elstein A. (1990) The validity of an essay test of clinical judgement. Acad Med 65(9): $39-S40. Ebel RL, Frisbie DA. (1986) Essentials of Educational Measurement 4th ed, Englewood Cliffs (Prentice-Hall). Gronlund NE, Linn RL. (1990) Measurement ard Evaiuation in Teaching 6th & New York (Macmillan). GRE Board (i992) Writing proficiency: how is it assessed. GRE Board Newsletter 8: 3-4.

Evaluation Methods 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluation Methods 2

Uploaded by

Copyright:

Available Formats

You might also like