Professional Documents
Culture Documents
Table 1: Canadian and International Medical Graduate Pass/Fail Rates for the Years 2012-2014..... 13
Table 2: Standard Setting Results for Panels 1 and 2 for Rounds 1 and 2 .......................................... 13
Figure 1: Failure Rates for First-Time Takers (Panel 1) ....................................................................... 14
Figure 2: Failure Rates for First-Time Takers (Panel 2) ....................................................................... 14
Figure 3: Failure Rates for First-Time Takers (Combined Panels) ...................................................... 15
Figure 4: Failure Rates for all First-Time Takers (Round 2) ................................................................. 15
Figure 5: Failure Rates for all First-Time Takers and Hofstee Boundaries .......................................... 16
The purpose of the standard setting session for the MCCQE Part I that took place October 23-24,
2014, was to arrive at a recommended cut score for subsequent review and approval by the
Central Examination Committee (CEC). The most important aspect of standard setting is the
validity of the process and activities. In the sections that follow, we describe in detail the pre
standard setting session activities, as well as the activities that took place during the standard
setting session for the MCCQE Part I.
Pre-Session Activities
SELECTING A STANDARD SETTING METHOD
Standard setting methodologies abound but not all are well suited for the types of items that are
used in the MCCQE Part I. Several methodologies were considered but the Bookmark method
was chosen because of its simplicity and the ease with which both MCQs and CDM items can be
integrated in the cut score (Cizek, 2007). The Bookmark method is an item mapping procedure
where items are ordered from easiest to most difficult based on operational data and panelists
are asked to place a bookmark at the point at which they believe a minimally proficient candidate
Since the panelists selected for a standard setting exercise represent a microcosm of all MCCQE
Part I examination stakeholders, it is critical to select participants that are representative with
respect to a number of key variables, including the region of Canada, ethnicity, medical specialty
and years of experience. Furthermore, to assess the reproducibility of the cut score across 2
groups of physicians, we decided to split our panelists into 2 matched subgroups. The latter
allows us to collect critical validity evidence in support of the recommended cut score.
The process of selecting participants started with an invitation which was forwarded to physicians
from across Canada, targeting Family Physicians as well as a broad range of other specialists. A
total of 22 physicians were retained based on several key criteria (see Appendix A for the
demographic information survey that was filled out by all potential participants). As previously
mentioned, we attempted to select panelists in both subgroups that were reflective of various
regions across the country (i.e., Western, Central, and Eastern Canada); medical specialty (family
medicine, internal medicine, surgery, obstetrics and gynecology, pediatrics, and psychiatry);
ethnicity (i.e., Asian, Black, Caucasian, First Nation, or Hispanic), sex, and years of experience
supervising residents. In Appendix B, we present a summary of the demographics of the two
panels. Some minor imbalance ensued when five participants bowed out a few days before the
session. Two of these people decided not to participate on account of the tragic incident that
occurred in Ottawa at the War Memorial and Parliament building center block the day before this
session.
All questions used for the standard setting session were taken from the most recent MCCQE
Part I, namely the spring 2014 administration. Dichotomously scored MCQs were calibrated
using the Rasch model (Rasch, 1960/1980) which, in turn, were used as anchors to calibrate the
CDM questions (Rasch model for dichotomous CDMs and the partial credit model (Masters,
1982) for polytomous CDMs). With the bookmark method, the basic question that panelists must
answer is the following: “Is it likely that the borderline candidate will be able to answer this
question correctly”. A typical probability level used with the bookmark method is the 67%
response probability or, 2/3 chance of answering correctly. Therefore, response probabilities
were calculated using a 2/3 probability criterion for each dichotomously scored MCQ and CDM
and for each step value for polytomously scored CDMs.
To assist panelists to prepare for the standard setting session, we asked them to read an article
(De Champlain, 2004) and a book chapter (De Champlain, 2014) on the topic of standard setting
that we sent out prior to the exercise in October, 2014. Additionally, the agenda for the two-day
session was mailed out to participants a few weeks before the session (see Appendix C).
The success of any standard setting session relies heavily on the extensive training of
participating panelists. This helps to ensure that panelists have the same objective in mind and
the same basic premises and understanding of the standard setting process. To this end, we
spent half of the first day of the exercise training our panelists on a number of issues, including
the structure and content of the MCCQE Part I. Examples of questions for both components of
the examination were shown with the type of scoring rubrics that would be seen in the exercises
included in the session. This was followed by a tutorial on standard setting, including issues to
consider, methods and sources of evidence to support the reliability and validity of any cut-score.
Particular attention was provided to the method that was selected to arrive at a recommended
cut- score for the MCCQE Part I exam, namely, the Bookmark method . In addition, a second,
ancillary standard setting method was introduced, the Hofstee method , which was used as a
complement to the item-centered Bookmark approach. The Hofstee method is described in the
literature as a compromise method (Hofstee, 1983) in that it integrates both norm-referenced
(relative interpretations) and criterion-referenced (absolute interpretations) considerations in a
“gut estimate” that is used to further validate the cut-score obtained following the Bookmark
exercise.
Commonly, standard setting methodologies, including the Bookmark method, assume that a cut-
score is set for the minimally proficient or borderline candidate. This hypothetical candidate is
critical in setting the cut-score, i.e., a point on the continuum of professional competence that
separates those deemed as competent candidates from those deemed as incompetent. The
Bookmark method requires that panelists clearly define what constitutes a minimally proficient (or
borderline) candidate, with respect to what they may know and not know in the domains targeted
by the MCCQE Part I exam.
To assist panelists in this task, a basic definition was developed by the Vice-chair of the CEC and
offered to the panelists as a starting point. After much discussion, the participants agreed on
To better understand the type of questions that Part I candidates must answer during an
examination, a practice test was administered to the panelists prior to collecting their judgments.
It contained a representative sample of 50 multiple-choice questions and 26 clinical decision-
making questions selected from the spring 2014 MCCQE Part I examination. Panelists were
given 90 minutes to complete the practice test after which they were instructed to self-score their
test using an item map which provided correct answers for each question. The purpose of the
practice test was also to give participants a sense of the level of difficulty of the MCCQE Part I.
Participants were not asked to share their resulting score with other panelists. However, this
exercise did provide the basis for a discussion of their perceived level of difficulty of the questions
and the appropriateness of the content in relation to the purpose of the Part I examination and its
target population (i.e. candidates entering supervised training or residency).
A practice bookmark exercise was planned to train the panelists in this procedure before they
engaged in the actual full-scale activity. The same questions used in the practice test were used
for this exercise as well. However, the questions were now ordered by difficulty level, from
“easiest” to “most difficult”, based on actual spring, 2014 MCCQE Part I candidate performances.
The goal of this standard setting method was to allow panelists, in a practice round, to identify a
point on the scale that they believed reflected minimal competency in the domains measured by
the MCCQE Part I examination.
Each participant was presented with a booklet that contained examination questions (one per
page) that were ordered by difficulty from easiest to most difficult. Each participant was asked to
place their bookmark at the point at which they felt a minimally proficient (or borderline) candidate
would correctly answer all items up to that point and incorrectly answer all items beyond that
point. The basic question that panelists must answer in the Bookmark procedure is the following:
“Is it likely that a minimally proficient candidate will be able to correctly answer this test question?”
Of course, the “likeliness” must be defined more specifically. In the Bookmark method, it is
defined as having a 2/3 chance of answering correctly (or 2/3 chance of reaching a CR score or
higher – for polytomous items). The expression “RP67” is often used to capture the essence of a
.67 response probability; simply another way of expressing the 2/3 chance of answering correctly.
Round 1 (Preliminary round). Following the practice bookmark round, panelists were reminded of
some key points about the Bookmark method and were assigned to their respective panels. They
were then each provided with a booklet that contained 236 items (one form’s worth of items)
which were ordered by difficulty level (based on RP67 value) from easiest to most difficult. They
were then instructed to independently place a bookmark at the point at which they felt a
minimally proficient (or borderline) candidate would correctly answer all items up to that point and
incorrectly answer all items beyond that point. Forms were distributed for documenting each
panelist’s bookmark (see Appendix E). The panelists were given 3.5 hours to complete their
round 1 bookmark placement. Note that the judgments provided in round 1 were solely based on
the item text that was provided, i.e., no performance data were given.
Following round 1, panelists were asked to provide answers to the following four Hofstee method
questions: (1) What is the minimally acceptable cut-score (Cmin), even if all candidates attained
this score level; (2) What is the maximum acceptable cut-score (Cmax), even if no candidate tis
score level; (3) What is the minimum tolerable failure rate (Fmin) and; (4) What is the maximum
tolerable failure rate (Fmax). Again, this information is used to gauge the appropriateness of the
Bookmark method cut-score as per the panelists’ holistic views. Forms were distributed (see
Appendix F) to allow panelists to record the data for the Hofstee method. Forms were collected
and provided to Statistical Analysts who in turn entered the data in an application which allowed
us to view each panel’s bookmark overlaid with the Hofstee boundaries. Figures 1 and 2 illustrate
Bookmark and Hofstee data for round 1 for Panels 1 and 2, respectively. Figure 3 combines the
data for both panels. Panel 1 panelists are represented as blue letters on each graph. Panel 1
had 9 panelists: A, C, D, E, G, H, I, J, and K. Panel 2 panelists are represented as red letters.
Panel 2 had 8 panelists: A, B, E, F, G, H, I, and J. The placement of letters on the graphs have
significance only on the x-axis, namely the cut scores on the theta scale. The stacking of some of
the letters was done simply to distinguish panelists whose cut score was the same instead of
Panelists from both panels were gathered in one room to provide them with impact data which
consisted of failure rates given their respective cut scores and combined cut score values. Pass
and failure rates for Canadian and International Medical Graduates for the years 2012-2014 were
presented to all panelists (See Table 1). Also, a cumulative distribution of examination results
was prepared from all first-time candidates who completed the spring 2014 MCCQE Part I. For
each score, a distribution of cumulative percentage of failures was established and a look-up
table was created to obtain a percentage failure for any given cut score obtained from each
panelist.
To translate bookmark placement into cut scores on the item response theory (IRT) ability (theta)
scale, an additional look-up table was created that listed: (1) item identification number for each
item used in the bookmarking exercise; (2) the corresponding booklet page number; (3) the
Rasch item difficulty measure and; (4) the RP67 value or IRT ability value needed to have a 2/3
chance of correctly answering any given item in the sample MCCQE Part I exam form that was
used in our standard setting exercise. Once we obtained all bookmark placement page numbers,
those were entered and a corresponding cut score was identified using the look-up table for each
panelist, panel and overall.
To obtain a panel-level cut score, the median cut score was calculated from the distribution of cut
scores by panel. The median was chosen instead of the mean since it mitigates the influence of
extreme values when they occur. The latter value corresponded to the preliminary or round 1 cut
score.
In Figure 1, we can observe that failure rates increase as cut scores increase and that the cut
score obtained by the Hofstee method (established by drawing a line down to the cut score at the
point where Fmax / Cmin and Fmin / Cmax lines traverse the cumulative failure rates curve) for Panel
1 falls between the lower and higher boundaries identified by the Hofstee method. This is a
desirable outcome. It is desirable because it indicates that the cut score (-0.39 on the theta scale)
identified by Panel 1 falls within what they expected in terms of maximum and minimum failure
rates and maximum and minimum cut scores.
In Figure 2, Panel 2 results for round 1 are presented. The results indicate that this panel had
incongruent outcomes between what they established as acceptable Hofstee boundaries and the
bookmark cut score (-0.78 on the theta scale). It would seem that 2 panelists (B and E) are
Round 2 (Final round). Panelists were then directed to their respective subgroup to engage in the
second and final round of bookmarking. Results from this second round constitute the
recommended cut score which was subsequently brought forward to the CEC for consideration
and adoption. Panelists were given two hours to complete this final standard setting round. As
was the case in the preliminary round (round 1), forms were gathered from panelists who
indicated their second bookmark placement as well as their responses to the four Hofstee
questions (post round 2). Graphical representations for round 2 bookmarking results are
presented in Figures 4 and 5. In Figure 4, round 2 individual and panel bookmark cut scores and
corresponding failure rates are presented. In Figure 5, the same data are provided with an
additional overlay of the Hofstee boundaries from round 2. The combined (i.e., both panels taken
together) cut score of -0.22 on the |IRT ability scale (theta) would fail 14% of all first-time
candidates using the spring 2014 examination results. This cut score would fail 5.1% of first-time
Canadian medical graduates from the spring, 2014 MCCQE Part I administration.
Results of the survey are presented in Appendix G. All 17 participants thought that the
information regarding the overview of the MCCQE Part I was either good (18%), very good
(18%), or excellent (65%). They thought that the overview of standard setting was either good
(6%), very good (29%), or excellent (65%). Central to the exercises during this standard setting
session was the notion of the minimally competent (i.e., borderline) candidate. Participants were
asked to assess the clarity of the definition of that target population that they developed. All 17
participants thought that the definition was clear (76%) or very clear (24%).
A significant amount of time was devoted to training panelists to the task which was felt by staff
as extremely important to ensure a common understanding of what we expected of them before
they engaged in the actual bookmarking exercise. Ninety- four percent of panelists thought that
exercise was appropriate, 6% thought that it was somewhat appropriate, and none thought it was
not appropriate. All participants thought that the training provided for the bookmark method was
either good (12%), very good (18%) or excellent (71%).
Among the facto" that influenced participants the most when they engaged in the Bookmark
method were their perception of the level of difficulty of the items (94%), the description of the
minimally competent candidate (88%), the item statistics provided in round 2 (76%), and the
knowledge and skills measured by the items (76%). Among the factors that had the least
influence on their bookmarking exercise were the quality of the item distractors (12%) and the
number of answer choices per item (18%).
Participants were asked about their level of understanding of how to apply the bookmark and
Hofstee methods during round 1. For the bookmark method, 16 out of 17 participants said that
Participants were also asked about their level of confidence regarding the consequential/
feedback data and the final discussion. Two participants (12%) felt somewhat confident, 6
participants (35%) felt confident, 9 participants (53%) felt very confident, whereas none of the
participants felt that they were not at all confident.
One of the significant outcomes desired following a standard setting exercise is a standard that
participants would recommend with a very high level of confidence. As part of the survey,
participants were asked about the level of confidence in the final recommended passing score.
One participant felt somewhat confident while the large majority reported being confident (18%)
or very confident (76%) about the recommended cut score value.
Finally, participants were surveyed on potential improvements to consider for further standard
setting exercises. Among the suggestions for improvement were comments about providing
impact data after the practice bookmark method as well as each panelist’s bookmark placement.
Also, one participant suggested providing failure rates for each panelist’s bookmark following the
practice bookmark method. A few participants felt that there were no improvements to be made.
Concluding Remarks
The main goal of this report was to outline the main activities that constituted the standard setting
exercise for the MCCQE Part I. In summary, two panels were gathered for the purpose of
establishing and recommending a cut score by participating in a 2- day session during which they
were trained in the Bookmark and Hofstee standard setting methods. A significant amount of time
was spent defining the target population and training of panelists on various critical aspects of the
exercise. Two panels established highly comparable cut scores as demonstrated by the overlap
of their respective confidence interval using the standard error of judgment. A high level of
confidence in the recommended cut score was expressed by a majority of participants. Several
staff from Psychometrics and Assessment Services and the Evaluation Bureau participated in
making this a successful session. Finally, a comprehensive description of all the activities and the
resulting cut score as well as impact data for both the spring 2014 and 2015 cohorts were
presented to the CEC on June 8, 2015 for their discussion and consideration. The CEC
unanimously accepted the recommended cut score of -0.22 (427 on the 3-digit MCCQE Part I
reporting scale) at this meeting.
De Champlain, A. F. (2004). Ensuring that the competent are truly competent: An overview of
common methods and procedures used to set standards on high-stakes examinations. Journal
of Veterinary Medical Education, 31, 61-5.
Hofstee, W. K. B. (1983). The case for compromise in educational selection and grading. In S. B.
Anderson and J. S. Helmick (Eds.). On educational testing (109-127). San Francisco: Jossey-
Bass.
Kane, M. (1994). Validating the Performance Standards Associated with Passing Scores. In
Review of Educational Research. Fall 1994 64 (3), 425-461.
Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149- 174.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests.
(Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword
and afterword by B.D. Wright. Chicago: The University of Chicago Press
Table 2: Standard Setting Results for Panels 1 and 2 for Rounds 1 and 2
Please provide your name and contact information, and check a box next to each of the
questions. The form can be sent by mail or electronically by 30 April 2014.
Name: __________________________________________________________________
________________________________________________________________________
________________________________________________________________________
1-5 years ☐
6-10 years ☐
11-20 years ☐
21-30 years ☐
More than 30 years ☐
1-5 years ☐
6-10 years ☐
11-20 years ☐
21-30 years ☐
More than 30 years ☐
Yes ☐
No ☐
Yes ☐
No ☐
Canada ☐
Other ☐
Alberta ☐
British Columbia ☐
Manitoba ☐
Maritimes ☐
Ontario ☐
Quebec ☐
Saskatchewan ☐
Territories ☐
7. First Language:
English ☐
French ☐
Other (________________________) ☐
8. Gender:
Male ☐
Female ☐
Asian ☐
Black ☐
Caucasian ☐
First Nations ☐
Hispanic ☐
Pediatrics ☐
Internal Medicine ☐
Psychiatry ☐
Obstetrics and Gynecology ☐
Surgery ☐
Family Medicine ☐
Other ☐
Urban ☐
Rural ☐
Hospital-based ☐
Community-based ☐
Variable of
Group Panel A Panel B Total
Interest
Female 56% 50% 53%
Gender
Male 44% 50% 47%
West 22% 38% 29%
Geographic
Central 56% 38% 47%
Region
East 22% 25% 24%
Internal Medicine 33% 38% 35%
Surgery 22% 13% 18%
Medical
Specialty Obstetrics/Gynecology 11% 13% 12%
Pediatrics 22% 13% 18%
Psychiatry 0% 13% 6%
Family Medicine 11% 13% 12%
1-5 years 11% 38% 24%
Number
of Years 6-10 years 44% 13% 29%
Supervising 11-20 years 11% 25% 18%
Residents
21-30 years 33% 25% 29%
Country of Canada 89% 88% 88%
Medical Training Other 11% 12% 12%
LUNCH 11:45
Panel: ______________________________________________________________________
Panelist: _____________________________________________________________________
Please indicate the page number of the item on which you placed your bookmark. It is the item for
which, in your judgment, a minimally proficient candidate’s chance of answering correctly falls
below a 2/3 probability.
Panel: ______________________________________________________________________
Panelist: _____________________________________________________________________
1. What is the highest percent correct cut score that would be acceptable, even if every
candidate attains that score? This value represents your estimate of the maximum level of
knowledge that should be required of candidates.
Round 1: ______ Round 2: ______
2. What is the lowest percent correct cut score that would be acceptable, even if no candidate
attains that score? This value represents your judgment of the minimum acceptable
percentage of knowledge that should be tolerated.
Round 1: ______ Round 2: ______
3. What is the maximum acceptable failure rate? This value represents your judgment of the
highest percentage of failing candidates that could be tolerated.
Round 1: ______ Round 2: ______
4. What is the minimum acceptable failure rate? This value represents your judgment of the
lowest percentage of failing candidates that could be tolerated.
Round 1: ______ Round 2: ______
2. What was your impression of the clarity of the information regarding the overview of
the MCCQE Part I exam that was provided on the morning of Day 1? (Select ONE)
3. What was your impression of the clarity of the information regarding the overview of
standard setting that was provided on the morning of Day 1? (Select ONE)
4. What was your impression of the clarity of the information regarding the overview of
the Bookmark Method that was provided on the morning of Day 1? (Select ONE)
6. How clear were you about the description of the “Minimally Competent” (or sometimes
called “Borderline”) candidate on the MCCQE Part I exam as you began the task of
setting a passing score following the training on the afternoon of Day 1? (Select ONE)
7. How would you judge the length of time spent (about 45 minutes on the agenda) on the
afternoon of Day 1 introducing, discussing and editing the definition of the “Minimally
Competent” or “Borderline” candidate? (Select ONE)
8. What is your impression of the practice session for applying the Bookmark Method to a
set of MCQs and CDM questions on the afternoon of Day 1? (Select ONE)
10. What factors influenced your placement of your Bookmark on day 2? (Select ALL
choices that apply)
11. How did you feel about participating in the group discussions regarding the ordered
item booklet? (Select ONE)
13. How comfortable were you in applying the Bookmark Method during marking round 1
on Day 2? (Select ONE)
14. How comfortable were you in applying the Bookmark Method during marking round 1
on Day 2? (Select ONE)
15. How comfortable were you in applying the Hofstee during marking round 1 on Day 2?
(Select ONE)
17. What level of confidence do you have in the final recommended passing score?
(Select ONE)
18. How could the method used for setting a passing score on the MCCQE Part I exam have
been improved?
# Response
1. The process as executed was excellent.
2. no
3. I think it took a little while to grasp the concept of minimally competent & hence the
book mark but became very clear after the initial exercise
4. I think that people are pushed to change their scores after the first session on day 2.
The bias was to increase the passing score on the second round because of the large
disparity in panels.
5. This is my first time doing this exercise, so I do not have previous experience for
comparison. Having said that, I don't feel there was nothing to improve.
6. it would have been valuable after the practice bookmark to provide the data including
the impact information and graphical spread, as we had done after round 1 on day 2.
7. I think the discussions were excellent!
8. no improvement needed - there was lots of time for discussion which I think was
important
9. Not sure; I thought the process went well as it is.
10. Develop the list of competencies from the onset of the exercise.