You are on page 1of 17

Teaching and Learning in Medicine

An International Journal

ISSN: 1040-1334 (Print) 1532-8015 (Online) Journal homepage: http://www.tandfonline.com/loi/htlm20

A Rasch Analysis Validation of the Maslach


Burnout Inventory–Student Survey with Preclinical
Medical Students

Yang Shi, P. Cristian Gugiu, Remle P. Crowe & David P. Way

To cite this article: Yang Shi, P. Cristian Gugiu, Remle P. Crowe & David P. Way (2018): A Rasch
Analysis Validation of the Maslach Burnout Inventory–Student Survey with Preclinical Medical
Students, Teaching and Learning in Medicine, DOI: 10.1080/10401334.2018.1523010

To link to this article: https://doi.org/10.1080/10401334.2018.1523010

Published online: 21 Dec 2018.

Submit your article to this journal

Article views: 45

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=htlm20
TEACHING AND LEARNING IN MEDICINE
https://doi.org/10.1080/10401334.2018.1523010

VALIDATION

A Rasch Analysis Validation of the Maslach Burnout Inventory–Student


Survey with Preclinical Medical Students
Yang Shia , P. Cristian Gugiua , Remle P. Croweb , and David P. Wayc
a
Division of Quantitative Research, Evaluation and Measurement, Department of Educational Studies, The Ohio State University,
Columbus, Ohio, USA; bThe National Registry of Emergency Medical Technicians, Columbus, Ohio, USA; cDepartment of Emergency
Medicine, The Ohio State University Wexner Medical Center, Columbus, Ohio, USA

ABSTRACT KEYWORDS
Construct: Burnout is a psychological construct characterized by emotional exhaustion that Rasch analysis;
arises from an excess of physical, emotional, and social demands over an extended period. undergraduate medical
Symptoms of burnout include withdrawal or disengagement from work. Burnout has education; Maslach Burnout
Inventory–Student Survey;
become an important public health concern due to its association with severe negative con- validity; reliability
sequences across numerous professions. Background: The most widely used instrument for
measuring burnout is the Maslach Burnout Inventory (MBI). An adaptation of the MBI, the
MBI–Student Survey (MBI-SS), was developed for college students. The MBI-SS consists of 15
items covering 3 domains of burnout: exhaustion, cynicism (CY), and professional efficacy
(PE). Although studies have confirmed the validity of the MBI-SS for college student popula-
tions, studies of its use with medical students are limited. The purpose of this study was to
employ the Rasch model to examine the psychometric properties of the MBI-SS when used
with a population of preclinical medical students. Approach: Data were collected from 787
medical students who answered the MBI-SS at the conclusion of their 1st year. A maximum
likelihood exploratory factor analysis for ordinal data confirmed the hypothesized three fac-
tor structure of the MBI-SS. Subsequently, a Rasch analysis was employed to further evaluate
the measurement properties of MBI-SS. We used the Rasch Rating Scale model to investi-
gate the extent to which the three MBI subscales conformed to proper measurement char-
acteristics, including comprehensive coverage of person ability and item difficulty along the
latent continuum. Results: Most of the 15 items on the MBI-SS effectively fit the Rasch
Rating Scale Model, with minimal measurement error. Respondents effectively used the full
range of the rating scale for all 15 items. Two subscales (PE and CY) contained items that
were difficult for respondents to endorse, resulting in significant gaps along the measure-
ment continuum. The CY subscale exhibited a slight floor effect. The 3 subscales showed
good person reliability, good real-item reliability, and good person separation. Conclusions:
The Rasch analysis confirmed that the MBI-SS works well for measuring burnout among pre-
clinical medical students. However, the Rasch analysis was able to identify that additional
items are needed to improve the performance of MBI-SS. New items would be targeted at
reducing the floor effect for the CY subscale and filling the other gaps in measurement
along the latent continuum for the PE and CY subscales.

Introduction musculoskeletal disorders, hypertension, and myocar-


Occupational burnout represents a major public dial infarction.2–5 Burnout has further been shown to
health concern, affecting both the individual and the negatively impact work organizations. Besides reduced
organizations for which they work. Symptoms of job performance of workers, other problems such as
burnout include fatigue, callousness, withdrawal, increased absenteeism and employee turnover affect
cynicism, low work morale, and deterioration of job organizations.5–7
performance.1 This condition has also been linked The term burnout was first coined in the 1970s to
with more serious individual health problems, describe the “physical and mental exhaustion that
including depression, sleep disturbances, alcoholism, results from emotionally demanding interactions with

CONTACT David P. Way David.Way@osumc.edu The Ohio State University Wexner Medical Center, Department of Emergency Medicine 778 Prior
Hall, 376 W 10th Avenue, Columbus, OH 43210, USA.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/htlm.
ß 2018 Taylor & Francis Group, LLC
2 Y. SHI ET AL.

people.”8,9 As originally conceived, burnout was behavioral frequency scale ranging from 1 (never), 2
thought to be a syndrome affecting primarily frontline (a few times a year or less), 3 (once a month or less), 4
human service workers in occupations such as law (a few times a month), 5 (once a week), 6 (a few times
enforcement, healthcare, social work, education, and a week), to 7 (every day).
so on.9 By the 1990s, a broader definition of burnout Since the establishment of the burnout construct,
was adopted, which recognized that individuals out- effective measurement has been of great interest to
side of human service occupations could also exhibit researchers. As of the time this article was written,
symptoms of burnout.10 This broader definition sug- more than 3,400 peer-reviewed articles were found
gested that burnout could be caused by any individu- through the search term “Maslach Burnout Inventory”
al’s interaction with the stressful demands of work.11 in the medical and social science databases. A large
Beyond those in the workforce, burnout was thought proportion of these articles were psychometric evalua-
to also impact nonworkers, such as students, who suf- tions of the various MBI instruments that employed
fer from burnout as a consequence of the stress cre- either exploratory factor analysis (EFA) or confirma-
ated by their studies.12 tory factor analysis to verify dimensionality of the
Although burnout has been distinguished from tools. In a meta-analysis of studies that used con-
other psychological conditions including depression firmatory factor analysis to investigate the domain
and anxiety, as of yet there are no objective clinical structure, 18 of the 21 studies concluded that a three-
diagnostic tests for burnout outside of self-report factor model provided the best fit.21
questionnaires.13,14 Numerous questionnaire instru- Other studies have attempted to verify the unidimen-
ments exist for the measurement of burnout; however, sionality of each of the three subscales by calculating
the most commonly used and studied is the Maslach the internal consistency coefficients (Cronbach’s alpha).
Burnout Inventory (MBI).15 By some estimates, the Estimated alpha reliabilities from a sample of human
MBI has been used in more than 90% of empirical service workers reported in the original MBI article by
studies of burnout worldwide.16 The original MBI was Maslach and Jackson (1981) were 0.90 for Exhaustion,
introduced in 1981 and was designed for measuring 0.79 for Depersonalization, and 0.71 for Personal
burnout in individuals who worked in professions Accomplishment subscales.22 A meta-analysis of 84
involving high levels of interaction with people. This studies published between 1995 and 2006 reported 95%
instrument consisted of 22 items intended to measure confidence intervals for Cronbach’s alpha estimates
three domains of burnout: exhaustion, depersonaliza- weighted for sample size as 0.87–0.88 for Exhaustion,
tion, and personal accomplishment.17 Of interest, the 0.77–0.79 for Depersonalization, and 0.77–0.79 for
MBI has traditionally been considered a classification Personal Accomplishment.23 Of interest, all estimates
tool that assigns individuals to categories of burnout of internal consistency reliability reported in the litera-
rather than points on a burnout continuum.7,9 ture to date have treated MBI response set data as inter-
Establishing a continuous measure of burnout would val, continuous data rather than ordinal,
provide more information about the “degree of noncontinuous data.24,25
burnout” an individual may be experiencing.18 Very few MBI validation studies provide new infor-
Since its inception, the MBI has been adapted for mation regarding the dimensions of the underlying
populations beyond the original human service construct, the burnout, or about the effectiveness of
worker. The MBI General Survey (MBI-GS) was the measurements. Moreover, it is unknown whether
developed to accommodate the wider scope of burn- the MBI can be effectively transformed from a blunt
out and was adapted for use with a broader popula- instrument of categorization into a more informa-
tion of workers.7 The MBI-GS was reduced to 16 tional instrument, with associated measurement con-
items that measured exhaustion, cynicism, and profes- tinuum. Toward this end, Rasch modeling provides an
sional efficacy.7 Work by Gold and by Balogun et al. alternative to investigating the psychometric properties
contributed to Schaufeli’s development of a version of of the MBI.
the MBI for measuring burnout in undergraduate col- The key to the effectiveness of the Rasch model as
lege students.12,19,20 The MBI Student Survey (MBI- an alternative method of evaluating the psychometric
SS) consists of 15 items organized into three sub- properties of instruments is the transformation of rat-
scales.12 As in the MBI-GS, cynicism replaced deper- ings to probabilities. This transformation effectively
sonalization and professional efficacy replaced converts the rating scale from ordinal (rank-order) to
personal accomplishment. The MBI-SS uses the same interval (continuous) level measurement of the latent
response set as all other MBI instruments: a 7-point trait of interest—in our case, burnout.18,26 Distinct
TEACHING AND LEARNING IN MEDICINE 3

from dimensional analyses, like EFA, Rasch modeling participants represented approximately 75% of the
allows separate but integrated analyses of both person total population of 1st-year medical students
and item effects. Person ability is loosely defined as a (N ¼ 1,052) for the same years. Class cohorts (year of
person’s relative standing on the latent trait of inter- entry) and gender groups were equally represented in
est, the probability that he or she will endorse a par- the data set, with participants by gender being 54.2%
ticular rating for a specific item.27 The interaction of male (n ¼ 427) and 45.7% female (n ¼ 360). Our
the person with the item produces an estimate of the Institutional Review Board approved the collection of
item difficulty. Estimates of item difficulty allow MBI-SS scores for educational program evaluation
researchers to examine whether an instrument con- (Protocol 2008B0122). In addition, our Institutional
tains items the options of which are too difficult (floor Review Board approved the use of these MBI-SS
effect) or too easy (ceiling effect) for the persons of a scores for the purposes of instrument validation
certain ability level to endorse.28 Comparing the dis- research (Protocol 2012B0331).
tributions of person ability and item difficulty cap-
tured by an instrument can reveal gaps in
Data analysis
measurement that potentially deflate effect sizes, reli-
ability estimates, and correlation coefficients.29 Dimensionality. To verify the hypothesized dimension-
Insights such as these are not attainable through clas- ality of the MBI-SS (exhaustion, cynicism, and profes-
sical item analysis methods.30–33 sional efficacy), we performed a preliminary
Although there is a tremendous amount of research maximum likelihood method of EFA using the poly-
exploring the structural validity of the MBI, the same is choric correlation matrix for input (ordinal ML-EFA).
not true for the MBI-SS. Most investigations have used A parallel analysis was used to identify the number of
the MBI-SS to study the relationship of burnout to factors to extract, and Promax rotation (j ¼ 4) was
other variables, such as professional behavior, academic used to achieve simple structure.41 The parallel ana-
performance, or suicide ideation.20,34,35 A few structural lysis method has been shown to have an accuracy rate
validity studies of the MBI-SS have been conducted but of higher than 90% in identifying the correct number
primarily with students outside of the United States and of factors to retain.42 An ordinal version of this
none with medical students.36–38 Although we found a method is available that uses polychoric correlations,
few studies in the medical and medical education litera- which are considered unbiased estimates of the true
ture that employed Rasch analysis in evaluating meas- correlations had an ordinal response scale not been
urement instruments, surprisingly we found no used.43–45 SAS Version 9.4 (SAS Institute, Inc., Cary,
literature on its use with the MBI-SS and its associated NC) was employed to do all statistical calculations.
latent trait burnout.39,40 The purpose of this study was Rasch analysis. Rasch modeling was performed
to investigate the psychometric properties of the MBI- using WINSTEPS version 4.1.0 (Linacre JM,
SS using Rasch analysis and to determine whether the Beaverton, OR) on each of the MBI-SS subscales sep-
MBI-SS was a practical instrument for measuring burn- arately. The Rating Scale Modeling, which is fixed at
out among preclinical medical students. the threshold across all items, was employed to exam-
ine the MBI-SS measurement properties.46
Methods Item fit. A critical step in Rasch analysis is to
evaluate the fit of the data to the model. To this end,
Participants we assessed model fit using the information-weighted
The data for this analysis were obtained from preclin- mean square (Infit MNSQ), outlier-sensitive mean
ical, 1st-year medical students enrolled at a large mid- square (Outfit MNSQ), and point-measure correlation.
western medical school. Students completed the MBI- Outfit MNSQ is sensitive to people with an ability
SS at the end of the academic year as part of a battery level far from the item difficulty level, whereas the
of instruments used for program evaluation. The Infit MNSQ is sensitive to people with an ability level
MBI-SS was slightly modified from the original MBI- close to the item difficulty level.40 Values of Infit and
SS by Schaufeli for use with these medical students by Outfit MNSQ statistics in the range of .5 to 1.5 were
replacing the term “university” with “medical regarded as satisfactory.47 Items that did not meet this
school.”12 The final data set for analysis consisted of criterion were dropped from subsequent analyses.
complete MBI-SS data for 787 medical students, who Similarly, good fitting items should exhibit a positive
were enrolled in the 1st-year course at the time of correlation with the subscale formed by the Rasch
annual data collection in May 2009–2013. These analysis (point-measure or item-Rasch measure
4 Y. SHI ET AL.

Table 1. Item statistics generated from Rasch analysis, their definitions and guide for interpretation.
Rasch term or statistic Meaning/implications Criterion for interpretation/evaluation
Scale or Subscale Used to identify or verify the existence of mean- Dimensionality from EFA: Interpretable factor load-
ingful groups of items designed to measure ings and a match between hypothesized sub-
smaller components of a larger more compre- scales and factors extracted.
hensive construct. Overall model fit from Rasch: All items consist
of proper fit statistics (.5–1.5).
Absence of gaps in measurement across the
latent trait scale.
Scale Item Items within a subscale Proper fit statistics, positive point-measure correl-
ation, orderly response categories, item
reliability.
Scale Category Response categories within a subscale Each category is endorsed by at least 10 persons.
Category measure falls into hypothesized order
among other categories. Has Outfit MNSQ <2.0.
Item Difficulty/Logit Measure of how easy or hard it is for a person to For our study, large positive item difficulties repre-
endorse an item. Hard items are endorsed only sented difficult items and large negative difficul-
by a person experiencing high levels of burnout. ties represented easy items.
Easy ones are endorsed only by those experi-
encing low levels of burnout.
Standard Error The quality or precision of the estimate of The closer the standard error is to zero, the better.
item difficulty. Large positive standard errors means a lack of
precision in estimating item difficulty.
Infit MNSQ: Information Weighted Mean Square Used to identify irregular or unexpected patterns Should fall into this range (.5–1.5) for productive
Outfit MNSQ: Outlier Sensitive Mean Square in the data. Helps to evaluate how well items measurement
fit the Rasch model.
Point-Measure Correlation This is like a point-polyserial correlation between Good fitting items should exhibit positive point-
the item and the total subscale measure correlations.
Item Reliability Indicator of consistency with which persons esti- Reliabilities > .70 are acceptable for research pur-
mate item and response option difficulties, that poses.
is, the rank order in the item-response Reliabilities > .95 are acceptable when scores
difficulties. are being used for decisions about individ-
Person Reliability Indicator of consistency with which items estimate ual persons.
the latent score for persons, that is, the consist-
ency in the rank-order of people’s level
of burnout.
Real and Model Reliabilities Serve as lower and upper bounds of person and
item reliability estimates.
Person Separation Indicator as to how well an item discriminates Separation indices >2.0 are satisfactory for
between persons suffering low and high levels research purposes.
of burnout.
Wright-Andrich Map A graphical representation of the distributions of The person distribution on the left side of the map
person measures (scores) and item measures should align with the item distribution on the
(difficulties) on the latent trait. right side of the map.
Rasch-Andrich Threshold Most easily understood as part of the Category Threshold values should increase across the
Probability Curves: the Andrich Threshold is the response categories.
point at which one category probability curve
crosses an adjacent category probability curve.
Category Measure Similar to Item Measure, Category Measures indi- For our study, high positive category measures
cate how easy or hard it is for a person to represented difficult categories and high nega-
endorse a particular category or rating on a rat- tive category measures represented
ing scale. Difficult to endorse categories are easy categories.
endorsed only by persons experiencing high lev-
els of burnout. Easy ones are endorsed only by
those experiencing low levels of burnout.
Category Probability Curve A graphical representation of the probability (likeli- Ideally, each probability curve should have a
hood) of the response category being selected region along the latent continuum in which it is
over the range of the latent trait continuum. the most probable option. Relatively speaking,
category curves should not overlap or be sepa-
rated too much.
Note: EFA ¼ exploratory factor analysis.

correlation). A negative point-measure correlation is properties were also dropped from subse-
an indication that response categories have been quent analyses.
reversed (i.e., lower response categories have a higher Two indices of reliability were estimated, one for
item measure than subsequent response categories). persons and the other for items. Person reliability, an
Essentially, the latter two statistics test whether equivalent to the traditional “test” reliability, indicates
respondents treated the response options as if they the consistency with which the items estimate the
were ordinal in nature. Items that lacked ordinal latent score for persons (i.e., the frequency with which
TEACHING AND LEARNING IN MEDICINE 5

Figure 1. Plot of actual versus randomly generated eigenvalues from Ordinal Parallel Analysis.

they experienced symptoms of burnout), whereas item distribution of person ability scores. Such an align-
reliability indicates the consistency with which persons ment or coverage is an indication that the instrument
estimate item and response option difficulties (i.e., the is well suited for measuring burnout in the population
rank order and relative spacing between items). That of preclinical medical students. Gaps in measurement
is, person reliability is a measure of the extent to occur when the person’s and item’s distributions do
which items are able to spread people out along the not align. Gaps indicate a lack of precision in effective
latent continuum, whereas item reliability is a measure measurement at that place on the scale. Gaps can
of the extent to which people are able to spread items appear anywhere along the latent continuum but are
out along the same continuum. The “real” and most commonly observed as either floor effects (not
“model” person reliabilities serve as lower and upper enough easy items to cover the bottom of the person
bounds to these values, respectively. Although reliabil- distribution) or ceiling effects (not enough difficult
ity is a function of both the instrument and the sam- items to cover the top of the person distribution).
ple to which it was administered, standards do exist Our criteria for evaluating measurement gaps was
for acceptable levels of reliability ranging from .70 to inspired by Baghaei.50 Floor and ceiling effects were
.95.48,49 Nunnally and Bernstein suggested that reli- defined as the existence of persons with logit scores at
abilities of .70 are acceptable for research,48 whereas the bottom or top of the persons distribution (left
Gugiu and Gugiu illustrated that decisions about per- side of the Wright-Andrich Map) that were at least
sons (interpretation of individual-level scores) require one logit from the nearest item measure (right side of
reliabilities greater than .95 to offset even modest the Wright-Andrich Map). Specifically, we considered
margins of measurement error.49 Similarly, person effects to be mild if less than 10% of respondents met
separation is an indication of item discrimination this definition, moderate if 10% to 20% met the defin-
around ability levels and separation index values of ition, and severe if more than 20% of respondents met
larger than 2 are satisfactory for clinical research.32 the definition.
We used a graphical method (Wright-Andrich Rating scale. The rating scale is considered an indi-
Maps) to plot the distributions of items and persons cation that higher ratings are consistently used or pro-
for each MBI subscale. Ideally, the distribution of vided by higher functioning individuals, and vice
items (or, in our case, response categories) generated versa. We evaluated the functionality of the MBI-SS
from difficulty scores should align with the 7-point rating scale (also referred to as response
6 Y. SHI ET AL.

category scale) using criteria proposed by Linacre,


which include the occurrence of more than 10
endorsements per response category, the observation
that both average measures and category thresholds
increase across each response category, and an
observed out-fit mean square residual (MNSQ) value
of less than 2 for each response category.51
Finally, a graphical inspection of the category prob-
ability curves was performed to determine how well
respondents effectively used all seven response catego-
ries. Ideally, every probability curve should have a
region along the latent continuum in which it is the
most probable option. A curve (response categories)
that is buried under an adjacent curve is an indication
that respondents did not effectively discriminate
between these two response options, thereby suggest-
ing that the response options should be reduced or
grouped.52 In contrast, large distances between the
peaks of adjacent curves would indicate that respond-
ents might need an additional response category
between these two options.52 Table 1 provides a sum-
mary of these Rasch terms and statistics along with
their definitions and criteria for interpretation
and evaluation.

Results
Dimensionality
The EFA and parallel analysis procedure generated
eigenvalues from the polychoric correlation matrix
along with the mean eigenvalues and eigenvalues rep-
resenting the 95th percentile based on the Monte
Carlo simulation. This analysis revealed that three
eigenvalues (factors) fell above the 95th percentile
estimates for the simulated eigenvalues (see Figure 1).
In other words, the MBI-SS has three factors consist-
ent with theory and past studies.12 According to the
path diagram (Figure 2) generated by the ML-EFA, all
items loaded on their expected factor. The exhaustion
factor (EX) comprised five items with factor loadings
from .55 to .80, the professional efficacy factor (PE)
consisted of six items with factor loadings from .43 to
.73, and the cynicism factor (CY) consisted of five
items with factor loadings from .32 to .80.
Figure 2. Path diagram for Maslach Burnout Inventory–Student
Survey (MBI-SS).
Overall model fit
All items demonstrated Infit and Outfit MNSQ statis- of measurement because no disordered response cate-
tics within the range of .5 to 1.5 and consisted of gories were found. Because the Rasch model assumes
positive item-Rasch measure correlations, suggesting unidimensionality, these fit statistics further confirm
that all the items fit the model (see Table 2). In add- the dimensionality of the MBI-SS, as does the fact
ition, the response scales functioned as ordinal levels that more than 10 participants selected each response
TEACHING AND LEARNING IN MEDICINE 7

Table 2. Item statistics including difficulty (in logits) infit and outfit mean square and
point-measure correlations for three MBI-SS subscales dimensions along with separ-
ation and reliability estimates by subscale.
Scale item Item difficulty SE Infit MNSQa Outfit MNSQa Point-measure correlation
EX
EX4 0.94 0.05 1.16 1.17 0.77
EX5 0.09 0.05 0.89 0.89 0.85
EX1 0.23 0.05 0.81 0.81 0.86
EX3 0.29 0.05 1.18 1.18 0.82
EX2 0.33 0.05 0.89 0.89 0.85
rPE
PE5 1.55 0.05 1.24 1.16 0.60
PE4 0.49 0.04 0.96 0.93 0.67
PE1 0.35 0.04 0.95 0.93 0.66
PE3 0.12 0.04 0.81 0.79 0.74
PE6 1.09 0.04 0.78 0.78 0.77
PE2 1.18 0.04 1.32 1.33 0.67
CY
CY4 0.42 0.05 1.02 0.99 0.78
CY1 0.15 0.05 0.98 0.91 0.83
CY3 0.23 0.05 1.03 1.03 0.81
CY2 0.34 0.05 0.91 0.95 0.84
Person Item
Separation Reliability Reliability
EX
Real 2.74 0.88 0.99
Model 3.19 0.91 0.99
rPE
Real 1.93 0.79 1.0
Model 2.23 0.83 1.0
CY
Real 2.14 0.82 0.97
Model 2.44 0.86 0.97
Note: MBI-SS ¼ Maslach Burnout Inventory–Student Survey; Infit MNSQ ¼ information-weighted mean
square; Outfit MNSQ ¼ outlier-sensitive mean square; EX ¼ Exhaustion; rPE ¼ Reversed Professional
Efficacy; CY ¼ Cynicism.
a
Values in the range of .5 to 1.5 indicate a good fit.

Table 3. Response category (rating scale) diagnostics based on Linacre’s criteria involving observations, Outfit
MNSQ, Rasch-Andrich Threshold, and category measure.
Scale category Observed count % of Counts Outfit MNSQ Rasch-Andrich Threshold Category measure
Exhaustion
1 (Never) 156 4 1.33 None 6.42
2 (A few times a year or less) 547 14 .97 5.26 4.03
3 (Once a month or less) 1,340 34 .97 2.75 1.46
4 (A few times a month) 920 23 .91 .09 .49
5 (Once a week) 509 13 .97 1.3 1.82
6 (A few times a week) 361 9 .91 2.2 3.5
7 (Every day) 102 3 1.28 4.6 5.76
Reversed Professional Efficacy
1 (Never) 600 13 1.06 None 4.65
2 (A few times a year or less) 1,347 29 .89 3.47 2.52
3 (Once a month or less) 1,094 23 .88 1.32 1.01
4 (A few times a month) 1,031 22 1.01 .58 .21
5 (Once a week) 478 10 .91 .91 1.38
6 (A few times a week) 125 3 1.12 2.15 2.42
7 (Every day) 47 1 1.60 2.30 3.8
Cynicism
1 (Never) 575 18 1.21 None 5.17
2 (A few times a year or less) 800 25 .9 3.98 3.02
3 (Once a month or less) 1,043 33 .84 2.0 .86
4 (A few times a month) 387 12 .92 .48 .58
5 (Once a week) 179 6 .82 1.12 1.47
6 (A few times a week) 100 3 .91 1.71 2.44
7 (Every day) 64 2 1.7 2.67 4.00
Note: MNSQ ¼mean square.
8 Y. SHI ET AL.

Figure 3. Results of Rasch Analysis, Wright-Andrich Map of Exhaustion subscale with Person distribution on the left of the center
line, and Item distribution on the right. Note: The Measure Scale (7 to þ7) is the logit scale resulting from the Rasch Analysis.

category across all items. Hence, the Andrich are presented in order from most difficult (high
Thresholds and average category measure were stable exhaustion) to easy (low exhaustion). For medical stu-
(Table 3). dents, EX4, “Studying or attending a class is really a
strain for me,” was rated higher as a source of exhaus-
tion, whereas EX 2, “I feel used up at the end of a day
Item and person results by subscale
at medical school,” was rated lower. The Exhaustion
Exhaustion. Column 2 of Table 2 shows the difficulties subscale was found to have person reliability of .88,
for the items that compose the Exhaustion subscale. real-item reliability of .99, and real person separation
The values presented in this column can be inter- of 2.61, all of which are satisfactory for research.48
preted as the average difficulty estimates for the seven The Wright-Andrich Map (Figure 3) plots the per-
response options that comprise that item. The items son measures along the left side and the item
TEACHING AND LEARNING IN MEDICINE 9

Figure 4. Category probability curves for Exhaustion Item 1.

difficulties for each of the EX items along the right Reversed Professional Efficacy. Table 2 shows that
side, allowing one to compare item difficulties to the PE5, “I have learned many interesting things during
person measures. The M found to the left of the cen- the course of my studies,” originally had the lowest
ter line denotes the mean of the person logits on the ratings, which after reverse coding ended up being
left. The M on the right denotes the mean of the the highest ratings. Students who selected the high
option (item) logits. The S’s and T’s denote 1 and 2 reverse-coded ratings would be higher on the
standard deviations from the means, respectively. The Burnout Latent Trait Scale. The opposite would be
two distributions are roughly aligned, (approximately true for PE2, “I believe that I make an effective
.5 logit difference), as demonstrated by the proximity contribution to the classes I attend,” that is, stu-
of the average person measure to the average item dents who selected the low reverse-coded ratings for
measure. Furthermore, the range of the item measures PE2 would be lower on the Burnout Latent Trait
covers the vast majority of person measures with no Scale. The reliability indices for the Reversed
gaps of notable significance (>1 logit) along Professional Efficacy (rPE) subscale were also
the continuum. adequate: person reliability of .79, real-item reliabil-
Figure 4 shows the category probability curves for ity of 1.0, and model person separation of 2.23.
the Exhaustion response options. By evaluating the Figure 5 presents the Wright-Andrich Map for
relative position of the response categories to one options on the rPE subscale. From this figure, we see
another, we see that respondents may have been able a lack of alignment between the person and item dis-
to effectively distinguish between all seven categories tributions, which is best characterized as the 1-point
as demonstrated by the fact that no probability curves logit difference between the means of the items and
were “buried” under other curves. Examination of the persons. Although the rPE scale attained an acceptable
interval between the peaks of the curves, however, level of reliability (person reliability ¼ .79 and real-
suggests that respondents may have been able to dis- item reliability ¼ 1.00), the person reliability was con-
tinguish two additional response categories: between 1 siderably lower than those for the EX subscale. The
(never) and 2 (a few times a year) and between likely explanation for the lower reliability is the lack
options 2 (a few times a year) and 3 (once a month of alignment between the distributions of person and
or less). item measures. Essentially, respondents found the
10 Y. SHI ET AL.

Figure 5. Results of Rasch Analysis, Wright-Andrich Map of Reversed Professional Efficacy subscale with Person distribution on the
left of the center line, and item distribution on the right. Note: The Measure Scale (6 to þ4) is the logit scale resulting from the
Rasch Analysis.

items to be a bit too difficult to endorse. Therefore, between the seven response categories. Examination of
more precise measurement would be achieved by add- the interval gaps between options 6 (a few times a
ing a few easier items to the rPE subscale. week) and 7 (every day) indicate the probable need for
Figure 6 shows the response option probability more response categories between these two options.
curves for the rPE subscale. None of the probability An additional response category may also need to be
curves were “buried” under other curves, indicating inserted between options 1 (never) and 2 (a few times
that respondents were able to effectively distinguish a year or less).
TEACHING AND LEARNING IN MEDICINE 11

Figure 6. Category probability curve for Item 1 of reversed professional efficacy.

Cynicism. The Cynicism subscale performed more times a year or less) and 3 (once a month or less), and
poorly when compared to the other two subscales, another between options 6 (a few times a week) and 7
perhaps because CY consists of fewer items and the (every day).
item difficulties are more closely clustered (Table 2).
Although the overall person and item measure reli-
Discussion
abilities were good (person, real-item reliability, and
person separation of 0.80, 0.97 and 2.14, respect- This study aimed to explore the psychometric prop-
ively), two significant gaps were observed along the erties of the MBI-SS for use with preclinical medical
latent continuum (see Figure 7). As was the case students and to identify areas of the instrument that
for the rPE subscale, the distributions between the may benefit from further development to enhance
person measures and the option-item measures did measurement effectiveness. To this end, we
not align well. The means of the two distributions employed Rasch modeling to investigate the struc-
were off by about a 1.5 logit difference. In addition, tural and response option validity, along with the
a mild floor effect was detected, which can be seen reliability of the instrument. Based on the reliability
by the lack of overlap between persons and items at estimates and comparison of the distribution of per-
the bottom of the figure. Given the short length of son ability to item difficulty, all three MBI-SS sub-
the subscale, its measurement precision could be scales were found to function adequately but not
improved with the addition of a few midrange and optimally. We detected a few important areas for
easier items. improvement.
Similar to the other two subscales, each category in The Cynicism subscale exhibited two significant
turn was the modal (most probable) category at some gaps and a mild floor effect along the latent con-
point on the latent variable, indicating that all catego- tinuum, indicating regions in which the amount of
ries performed distinctly different from other catego- burnout a medical student experienced was not meas-
ries (see Figure 8). As was true of the EX subscale, ured with adequate precision. This signals the need
respondents may have been able to distinguish for improvement by adding items or response options
between three additional response categories on the to fill in the measurement gaps. Because gaps
CY subscale: one between options 1 (never) and 2 (a occurred between the hardest item of Threshold 2
few times a year or less), one between options 2 (a few (point demarking the difference between response
12 Y. SHI ET AL.

Figure 7. Results of Rasch Analysis, Wright-Andrich Map of Cynicism subscale with Person distribution on the left of the cen-
ter line, and Item distribution on the right. Note: The Measure Scale (6 to þ5) is the logit scale resulting from the
Rasch Analysis.

categories 1 and 2) and the easiest item of Threshold One strategy for improving measurement precision
3 (point demarking the difference between response for all three subscales would be to expand the number
categories 2 and 3) adding three to four items easier of response categories from seven to nine, such as
than Item 2 would most effectively close both gaps 1 (never), 2 (once or twice a year), 3 (three to five times
and eliminate the floor effect. A similar pattern, a year), 4 (six to 11 times a year), 5 (once a month), 6
though somewhat milder, was observed with the (a few times a month), 7 (once a week), 8 (a few times
EX subscale. a week), and 9 (every day). Another strategy would be
TEACHING AND LEARNING IN MEDICINE 13

Figure 8. Category probability curve for Item 1 of Cynicism.

to compare the MBI-SS to the original 22 items of the of experiencing burnout, a possible solution would be
MBI to see if important items were lost when translat- to increase the number of items within subscales by a
ing the measurement of burnout from work to school. factor of 2 so that each subscale would contain 10 to
Finally, the medical school context requires some 15 items, approximately the length of the original
modification of items originally written for an under- MBI. Future research is needed to improve the MBI-
graduate college population. For instance, because stu- SS for use with preclinical medical students and to
dents in medical school study independent of their establish criteria for interpreting scores on the three
attendance in class, EX4 becomes a double-barreled MBI-SS measurement scales.
question (i.e., two questions in one). In our medical
school, studying and attending class are competing
Conclusion
endeavors, making this item confusing for the medical
school student. This study employed the Rasch analysis to assess the
The reliability of the three subscales averaged performance of the MBI-SS among preclinical 1st-year
around 0.8. This level of reliability is good for medical students. This research provides additional
research purposes at the group level but would not be evidence for the structural validity and reliability of
adequate should medical schools want to use the the MBI-SS. The MBI-SS demonstrated satisfactory
instrument to screen students for burnout and identify unidimensionality for each of the three subdomains.
those in need of support. To make decisions at the Regarding response option validity, we found that the
individual level, one would need the overall reliability 7-point rating scale functioned adequately; however,
to exceed .90.49 According to the Spearman-Brown there is evidence to suggest that respondents may be
prophecy formula, doubling the length of the survey able to effectively distinguish between 9 points and
would yield a reliability of about .88 (¼ 2  .80/ may need that many to improve measurement preci-
(1 þ . 8)).53 This formula highlights the nonlinear rela- sion. Although the subscales are appropriately struc-
tionship between test length and test reliability coeffi- tured (according to the hypothetical construct), we
cient.54 Hence, if medical schools are interested in found significant gaps in measurement along the
adopting the instrument to identify students in danger latent scale creating additional precision problems.
14 Y. SHI ET AL.

Suggestions for addressing these measurement gaps students: A cross-national study. Journal of Cross-
include the addition of items. These efforts would Cultural Psychology. 2002;33(5):464–481.
13. Leiter MP, Durup J. The discriminant validity of
improve precision but would not likely appreciatively
burnout and depression: A confirmatory factor ana-
improve the scale reliability. Nevertheless, if medical lytic study. Anxiety, Stress, and Coping. 1994;7(4):
school programs intend to use the instrument to iden- 357–373.
tify preclinical students in need of support services, 14. Leone SS, Wessely S, Huibers MJ, Knottnerus JA,
then this would be a necessary step to increase their Kant I. Two sides of the same coin? On the history
and phenomenology of chronic fatigue and burnout.
person reliability above 0.90.
Psychology and Psychol Health. 2011;26(4):449–464.
15. Schaufeli WB, Enzmann D, Girault N. Measurement
of Burnout: A Review. Philadelphia, PA: Taylor &
ORCID
Francis, 1993.
Yang Shi http://orcid.org/0000-0001-9995-4370 16. Schaufeli W, Enzmann D. The Burnout Companion to
P. Cristian Gugiu http://orcid.org/0000-0003-0022-287X Study and Practice: A Critical Analysis. London;
Remle P. Crowe http://orcid.org/0000-0001-9733-9294 Philadelphia, PA: Taylor & Francis, 1998.
David P. Way http://orcid.org/0000-0002-1896-3425 17. Maslach C, Jackson SE, Leiter MP, Schaufeli W,
Schwab RL. Maslach Burnout Inventory. 2017.
Available at: http://www.mindgarden.com/117-mas-
lach-burnout-inventory. Accessed January 4, 2018.
References 18. Wright BD, Masters GN. Rating Scale Analysis: Rasch
Measurement. Chicago, IL: MESA Press; 1982:60–89.
1. Bakker AB, Demerouti E, Sanz-Vergel AI. Burnout 19. Gold Y. Does teacher burnout begin with student
and work engagement: The JD–R approach. Annu Rev teaching. Education 1985;105(3):254.
Organ Psychol Organ Behav. 2014;1(1):389–411. 20. Balogun JA, Hoeberlein-Miller TM, Schneider E, Katz
2. Madsen IEH, Lange T, Borritz M, Rugulies R. JS. Academic performance is not a viable determinant
Burnout as a risk factor for antidepressant treat- of physical therapy students’ burnout. Percept Mot
ment–a repeated measures time-to-event analysis of Skills. 1996;83(1):21–22.
2936 Danish human service workers. Journal of 21. Worley JA, Vassar M, Wheeler DL, Barnes LL. Factor
Psychiatric Research. 2015;65:47–52. structure of scores from the Maslach Burnout
3. Peterson U, Demerouti E, Bergstr€ om G, Samuelsson Inventory: A review and meta-analysis of 45 explora-
M, Asberg M, Nygren A. Burnout and physical and tory and confirmatory factor-analytic studies.
mental health among Swedish healthcare workers. J Educational and Psychological Measurement. 2008;
Adv Nurs. 2008;62(1):84–95. 68(5):797–823.
4. Shanafelt TD, Sloan JA, Habermann TM. The well- 22. Maslach C, Jackson SE. The measurement of experi-
being of physicians. Am J Med. 2003;114(6):513–519. enced burnout. J Organiz Behav. 1981;2(2):99–113.
5. Sorour AS, El-Maksoud MMA. Relationship between 23. Wheeler DL, Vassar M, Worley JA, Barnes LL. A reli-
musculoskeletal disorders, job demands, and burnout ability generalization meta-analysis of coefficient alpha
among emergency nurses. Advanced Emergency for the Maslach Burnout Inventory. Educational and
Nursing Journal. 2012;34(3):272–282. Psychological Measurement. 2011;71(1):231–244.
6. Borritz M, Christensen KB, B€ ultmann U, et al. Impact 24. Zumbo BD, Gadermann AM, Zeisser C. Ordinal ver-
of burnout and psychosocial work characteristics on sions of coefficients alpha and theta for Likert rating
future long-term sickness absence. Prospective results scales. J Mod App Stat Meth. 2007;6(1):21.
25. Dembe AE, Lynch MS, Gugiu PC, Jackson RD. The
of the Danish PUMA Study among human service
translational research impact scale: development, con-
workers. J Occup Environ Med. 2010;52(10):964–970.
struct validity, and reliability testing. Eval Health Prof.
7. Maslach C, Jackson SE, Leiter MP. Maslach Burnout
2014;37(1):50–70.
Inventory Manual. Palo Alto, California: Consulting
26. Tennant A, Conaghan PG. The Rasch Measurement
Psychologists Press, 1996. Model in Rheumatology: What Is It and Why Use It?
8. Freudenberger HJ. Staff Burn-Out. Journal of Social When Should It Be Applied, and What Should One
Issues. 1974;30(1):159–165. Look for in a Rasch Paper?. Arthritis Rheum. 2007;
9. Milicevic-Kalasic A. Burnout Examination. New York, Dec57(8):1358–1362.
NY: Springer Science & Business Media, 2012. 27. Bond TG, Fox CM. Applying the Rasch Model:
10. Kitaoka-Higashiguchi K, Nakagawa H, Morikawa Y, Fundamental Measurement in the Human Sciences,
et al. Construct validity of the Maslach Burnout (3rd ed., pp. 112–39). New York, NY: Routledge,
Inventory-General Survey. Stress and Health. 2004; Taylor & Francis Group, 2015.
20(5):255–260. 28. Hendriks J, Fyfe S, Styles I, Skinner SR, Merriman G.
11. Demerouti E, Bakker AB, Nachreiner F, Schaufeli Scale construction utilizing the Rasch unidimensional
WB. The job demands-resources model of burnout. J measurement model: A measurement of adolescent
Appl Psychol. 2001;86(3):499 attitudes towards abortion. Amj. 2012;5(5):251.
12. Schaufeli WB, Martinez IM, Pinto AM, Salanova M, 29. Engelhard G. Jr, Invariant Measurement: Using Rasch
Bakker AB. Burnout and engagement in university Models in the Social, Behavioral, and Health Sciences.
TEACHING AND LEARNING IN MEDICINE 15

New York, NY: Routledge, Taylor & Francis Group, 46. Andrich D. A rating formulation for ordered response
2013. categories. Psychometrika 1978;43(4):561–573.
30. Gugiu MR, Gugiu PC. Utilizing item analysis to 47. Juttner M, Boone W, Park S, Neuhaus BJ.
improve the evaluation of student performance. Development and use of a test instrument to measure
Journal of Political Science Education. 2013;9(3): biology teachers’ content knowledge (CK) and peda-
345–361. gogical content knowledge (PCK). Educ Asse Eval
31. Wright BD, Stone MH. Best Test Design: Rasch Acc. 2013;25(1):45–67.
Measurement. Chicago, IL: Mesa Press, 1979. 48. Nunnally JC, Bernstein IH. Psychometric Theory (3rd
32. Boone WJ, Staver JR, Yale MS. Rasch Analysis in the
ed., pp. 265). New York, NY: McGraw-Hill, 1994.
Human Sciences (pp. 217-The Netherlands: Springer,
49. Gugiu C, Gugiu MR. Determining the Minimum
2014.
Reliability Standard Based on a Decision Criterion.
33. Ewing MT, Salzberger T, Sinkovics RR. An alternate
approach to assessing cross-cultural measurement The Journal of Experimental Education. 2018;86(3):
equivalence in advertising research. Journal of 458–472.
Advertising. 2005;34(1):17–36. 50. Baghaei P. The Rasch Model as a Construct
34. Dyrbye LN, Massie FS, Eacker A, et al. Relationship Validation Tool. Rasch Measurement Transactions
between burnout and professional conduct and atti- 2008;22(1):1145–1146.
tudes among US medical students. JAMA 2010; 51. Linacre JM. Optimizing rating scale category effective-
304(11):1173–1180. ness. J Appl Meas. 2002;3(1):85–106.
35. Brazeau CM, Shanafelt T, Durning SJ, et al. Distress 52. Linacre JM. A user’s guide to Winsteps: Program
among matriculating medical students relative to the Manual 3.75.0. 2012.
general population. Academic Medicine. 2014;89(11): 53. Remmers HH, Karslake R, Gage N. Reliability of mul-
1520–1525. tiple-choice measuring instruments as a function of
36. Hu Q, Schaufeli WB. The factorial validity of the the Spearman-Brown prophecy formula, I. Journal of
Maslach Burnout Inventory-Student Survey in China. Educational Psychology. 1940;31(8):583.
Psychol Rep. 2009;105(2):394–408. 54. Baumgartner TA. The applicability of the Spearman-
37. Mostert K, Pienaar J, Gauche C, Jackson L. Burnout Brown prophecy formula when applied to physical
and engagement in university students: A psychomet- performance tests. Research Quarterly. American
ric analysis of the MBI-SS and UWES-S. South
Association for Health, Physical Education and
African Journal of Higher Education. 2007;21(1):
147–162. Recreation. 1968;39(4):847–856.
38. Yavuz G, Dogan N. Maslach burnout inventory-stu-
dent survey (MBI-SS): a validity study. Procedia-Social
APPENDIX
and Behavioral Sciences. 2014;116:2453–2457.
39. Loera B, Molinengo G, Miniotti M, Leombruni P. Maslach burnout inventory–student survey
Refining the Frommelt Attitude Toward the Care of
the Dying Scale (FATCOD-B) for medical students: A Exhaustion
confirmatoryfactor analysis and Rasch validation 1. I feel emotionally drained by my studies.
study. Pall Supp Care. 2018;16(01):50–59. 2. I feel used up at the end of a day at medical school.
40. Molinengo G, Baiardini I, Braido F, Loera B. 3. I feel tired when I get up in the morning and I have to
RhinAsthma patient perspective: A Rasch validation face another day at medical school.
study. J Asthma. 2018; Feb55(2):119–123. 4. Studying or attending a class is really a strain for me.
41. Gugiu PC, Coryn C, Clark R, Kuehn A. Development 5. I feel burned out from my studies.
and evaluation of the short version of the Patient
Assessment of Chronic Illness Care instrument. Cynicism
Chronic Illn. 2009;5(4):268–276. 1. I have become less interested in my studies since my
42. Hayton JC, Allen DG, Scarpello V. Factor Retention enrollment in medical school.
Decisions in Exploratory Factor Analysis: a Tutorial 2. I have become less enthusiastic about my studies.
on Parallel Analysis. Organizational Research Methods. 3. I have become more cynical about the potential useful-
2004;7(2):191–205. ness of my studies.
43. Drasgow F. Polychoric and Polyserial Correlations. In 4. I doubt the significance of my studies.
Kotz S, Johnson NL (Eds.), Encyclopedia of Statistical
Sciences (pp. 68–74). New York: John Wiley & Sons,
Inc., 1986. Professional efficacy
44. J€
oreskog KG. On the estimation of polychoric correla- 1. I can effectively solve the problems that arise in my
tions and their asymptotic covariance matrix. studies. (R)
Psychometrika 1994;59(3):381–389. 2. I believe that I make an effective contribution to the
45. Holgado-Tello FP, Chacon-Moscoso S, Barbero GI, classes I attend. (R)
Vila-Abad E. Polychoric versus Person correlations 3. In my opinion, I am a good student. (R)
in exploratory and confirmatory factor analysis 4. I feel stimulated when I achieve my study goals. (R)
with ordinal variables. Qual Quant. 2010;44(1): 5. I have learned many interesting things during the
153–166. course of my studies. (R)
16 Y. SHI ET AL.

6. During class, I feel confident that I am effective in get- 1 ¼ Never


ting things done. (R) 2 ¼ A few times a year or less
Note: (R) items were reverse coded prior to analyses so that 3 ¼ Once a month or less
scores are consistent with the other two scales, wherein high 4 ¼ A few times a month
values represent more burnout. 5 ¼ Once a week
6 ¼ A few times a week
Related response set 7 ¼ Every day
Less Burnout More Burnout

You might also like