You are on page 1of 17

Journal of the Society for Social Work and Research October 2010

Volume 1, Issue 2, 66–82 ISSN 1948-822X DOI:10.5243/jsswr.2010.6

An Introduction to Using Multidimensional Item Response Theory


to Assess Latent Factor Structures
Philip Osteen
University of Maryland

This study provides an introduction to the use of multidimensional item response theory (MIRT)
analysis for assessing latent factor structure, and compares this statistical technique to
confirmatory factor analysis (CFA) in the evaluation of an original measure developed to assess
students’ motivations for entering a social work community of practice. The Participation in a
Social Work Community of Practice Scale (PSWCoP) was administered to 506 masters of social
work students from 11 accredited graduate programs. The psychometric properties and latent
factor structure of the scale are evaluated using MIRT and CFA techniques. Although designed as
a 3-factor measure, analysis of model fit using both CFA and MIRT do not support this solution.
Instead, analyses using both methods produce convergent results supporting a 4-factor solution.
Discussion includes methodological implications for social work research, focusing on the
extension of MIRT analysis to assessment of measurement invariance in differential item
functioning, differential test functioning, and differential factor functioning.

Keywords: item response theory, factor analysis, psychometrics

In comparison to classical test theory (CTT), item


measurement studies. Historically regarded as a
response theory (IRT) is considered as the standard, if
method for evaluating latent skill and ability traits in
not preferred, method for conducting psychometric
education, the application of IRT to measures of
evaluations of new and established measures
affective latent traits is becoming more common and
(Embretson & Reise, 2000; Fries, Bruce, & Cella,
accepted. As outlined in this article, drawing on the
2005; Lord, 1980; Ware, Bjorner, & Kosinski, 2000).
strengths of IRT as an alternative to, or ideally in
Dubbed the ―modern test theory,‖ IRT is used across
conjunction with, CTT analyses supports social work
scientific disciplines, including psychology, education,
researchers’ development of rigorously substantiated
nursing, and public health. Considered a superior
measures. This article provides social work researchers
method because of IRT’s ability to overcome inherent
with a basic overview of IRT and a demonstration of
limitations of CTT, IRT provides researchers with an
the utility of IRT as compared with CTT-based factor
array of statistical tools for assessing measure
analysis by using actual data obtained with the
characteristics. Unfortunately, there is a resounding
implementation of a novel measure of professional
paucity of published research in social work using IRT.
motivations of master’s of social work (MSW)
A review of measurement-based articles appearing in
students. Published studies comparing IRT and
journals specific to the social work field published
confirmatory factor analysis (CFA) have focused
between 2000 and 2006 showed that fewer than 5% of
almost exclusively on assessing measurement
studies used IRT analysis to evaluate the psychometric
invariance. This study takes a different approach in
properties of new and existing measures (Unick &
comparing IRT and CTT by applying these theories to
Stone, 2010). Unick and Stone hypothesized several
the assessment of multidimensional latent factor
reasons for the absence of IRT analyses from social
structures.
work journals, one of which was a lack of familiarity
with key conceptual and practical components of IRT. IRT
IRT is based on the premise that only two
Regardless of the reasons underlying the absence
elements are responsible for a person’s response on any
of IRT-based analyses in the social work literature, the
given item: the person’s ability, and the characteristics
field of social work will benefit from researchers
of the item (Bond & Fox, 2001). The most common
becoming more familiar with IRT methods and
IRT model, called the Rasch or one-parameter logistic
incorporating these analyses into social work-based
model, assumes the probability of a given response is a
function of the person’s ability and the difficulty of the
Philip J. Osteen is an assistant professor in the
item (Bond & Fox, 2001). More complex IRT models
University of Maryland School of Social Work.
estimate the probability of a given response based on
All correspondence concerning this article should
additional item characteristics such as discrimination
be directed to posteen@ssw.umaryland.edu
and guessing (Bond & Fox, 2001). Derived from its

Journal of the Society for Social Work and Research 66


OSTEEN

early use in educational measurement, the term ability is needed to endorse or respond in a certain way to the
may seem mismatched to psychosocial constructs; thus, item. For items on a rating scale, the IRF is a
the term latent trait may be more intuitive, and mathematical function describing the relation between
references to level of ability are synonymous with level where an individual falls on the continuum of a given
of the latent trait. The IRT model produces estimates construct such as motivation and the probability that he
for both of these elements by calculating item- or she will give a particular response to a scale item
difficulty parameters on the basis of the total number of designed to measure that construct (Reise, Ainsworth,
persons who correctly answered an item, and person- & Haviland, 2005). The basic goal of IRT modeling is
trait parameters on the basis of the total number of to create a sample-free measure.
items successfully answered (Bond & Fox, 2001). The Multidimensional item response theory, or MIRT,
assumptions underlying these estimates are (a) that a is an extension of IRT and is used to explore the
person with more of the trait will always have a greater underlying dimensionality of an IRT model. Advances
likelihood of success than a person with less of the in computer software (e.g., Conquest, MULTILOG, &
trait, and (b) that any person will have a greater Mplus) allow for testing and evaluation of more
likelihood of endorsing items requiring less of the trait complex multidimensional item response models and
than items requiring more of the trait (Müller, Sokol, & enable researchers to statistically compare competing
Overton, 1999). Samejima (1969) and Andrich (1978) dimensional models. ACER Conquest 2.0 (Wu,
extended this model to measures with polytomous Adams, & Wilson, 2008), the software used in this
response formats (i.e., Likert scales) by adding an study, produces marginal maximum likelihood
estimate to account for the difficultly in crossing the estimates for the parameters of the models. The fit of
threshold from one level of response to the next (e.g., the models is ascertained by generalizations of the
moving from agree to strongly agree). Wright and Masters (1982) residual-based methods.
Scale Evaluation Using IRT Alternative dimensional models are evaluated using a
likelihood ratio chi-squared statistic (χ2LR; Barnes,
The basic unit of IRT is the item response Chard, Wolfe, Stassen, & Williams, 2007).
function (IRF) or item characteristic curve. The
Core statistical output of an IRT analysis of a one-
relationship between a respondent’s performance and
parameter rating scale model includes estimates of
the characteristics underlying item performance can be
person latent trait, item difficulty, model fit, person-
described by a monotonically increasing function
fit, item-fit, person reliability, item reliability, and step
called the item characteristic curve (ICC; Henard,
calibration. A two-parameter model would include
2000). The ICC is typically a sigmoid curve estimating
estimates for item discrimination, and a three-
the probability of a given response based on a person’s
parameter model would include an additional estimate
level of latent trait. The shape of the ICC is determined
for guessing. Person latent trait is an estimate of the
by the item characteristics estimated in the model. The
underlying trait present for each respondent. Persons
ICC in a three-parameter IRT model is derived using
with high person-ability scores possess more of the
the formula P(θ) = c + (1-c)e a(θ-b-f) underlying trait than persons with low scores. Item
difficulty is an estimate of the level of underlying trait
1 + e a(θ-b-f) at which a person has a 50% probability of endorsing
where P, the probability of a response given a the item. Items with higher item-difficulty scores
person’s level of the latent trait—denoted by theta require a respondent to have more of the underlying
(θ)—is a function of guessing (c parameter), item trait to endorse or correctly respond to the item than
discrimination (a parameter), item difficulty (b items with lower item difficulty scores. Consider a
parameter), and the category threshold (f) if using a measure of reading comprehension. An item requiring
polytomous response format. a 12th grade reading level is more difficult than an item
For the one-parameter IRT model, the guessing requiring a 6th grade reading level. The same concept
parameter, c, is constrained to zero, assuming little or applies to a measure of motivation; an item requiring a
no impact of guessing. For example a person cannot high amount of motivation is more ―difficult‖ than an
guess the correct response to an item using a Likert item requiring a low amount of motivation. This idea
scale because items are not scored as right or wrong. translates to the concept of person-ability or latent trait.
The item discrimination parameter, a, is set to 1 under A person who reads at a 12th grade level has more
the assumption that there is equal discrimination across ability than a person who reads at a 6th grade level; a
items. In a one-parameter model the probability of a person who is more motivated has more of the latent
response is determined only by the person’s level of the trait than a person who is less motivated.
latent trait and the difficulty of the item. Item difficulty Analysis of item fit. Fit statistics in IRT analysis
is an indication of the level of the underlying trait that include infit and outfit mean square (MNSQ) statistics.

Journal of the Society for Social Work and Research 67


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Infit and outfit are statistical representations of how category is equal to the probability of endorsing a
well the data match the prescriptions of the IRT model corresponding category one step away. Although
(Bond & Fox, 2001). Outfit statistics are based on thresholds are ideally equidistant, that characteristic is
conventional sum of squared standardized residuals, not necessarily the reality. Guidelines indicate that
and infit statistics are based on information-weighted thresholds should be at least 1.4 logits but no more than
sum squared standardized residuals (Bond & Fox, 5 logits (Linacre, 1999). Logits are the scale units for
2001). Infit and outfit have expected MNSQ values of the log odds transformation. When thresholds have
1.00; values greater than or less than 1 indicate the small logits, response categories may be too similar
degree of variation from the expected score. For and nondiscriminant. Conversely, when the threshold
example, an item with an infit MNSQ of 1.33 (1.00 + logit is large, response categories may be too dissimilar
.33), indicates 33% more variation in responses to that and far apart, indicating the need for more response
item than were predicted by the model. Mean infit and options as intermediate points. Infit and outfit statistics
outfit values represent a degree of overall fit of the data are also available for step calibrations. Outfit MNSQ
to the model, but infit and outfit statistics are also values greater than 2.0 indicate that a particular
available for assessing fit at the individual item level response category is introducing ―noise‖ into the
(item-fit) and the individual person level (person-fit). measurement process, and should be evaluated as a
Item-fit refers to how well the IRT model explains the candidate for collapsing with an adjacent category
responses to a particular item (Embretson & Reise, (Bond & Fox, 2001; Linacre, 1999).
2000). Person-fit refers to the consistency of an In conjunction with the standard output of IRT
individual’s pattern of responses across items analysis, MIRT analysis provides information about
(Embretson & Reise, 2000). dimensionality, the underlying latent factor structure.
One limitation of IRT is the need for large Acer Conquest 2.0 (Wu et al., 2008) software provides
samples. No clear standards exist for minimum sample estimations of population parameters for the
size, although Embretson and Reise (2000) briefly multidimensional model, which include factor means,
noted that a sample of 500 respondents was factor variances, and factor covariances/correlations.
recommended, and cautioned that parameter Acer Conquest 2.0 also produces maps of latent
estimations might become unstable with samples of variable distributions and response model parameter
less than 350 respondents. Reeve and Fayers (2005) estimates.
suggested that useful information about item Analysis of nested models. Two models are
characteristics could be obtained with samples of as considered as being nested if one is a subset of the
few as 250 respondents. One-parameter models may second. Overall model fit of an IRT model is based on
yield reliable estimates with as few as 50 to 100 the deviance statistic, which follows a chi-square
respondents (Linacre, 1994). As the complexity of the distribution. The deviance statistic changes as
IRT model increases and more parameters are parameters are added or deleted from the model, and
estimated, sample size should increase accordingly. changes in fit between nested models can be
Smith, Schumacker, and Bush (1998), provided the statistically tested. The chi-square difference statistic
following sample size dependent cutoffs for (χ2 D) can be used to test the statistical significance of
determining poor fit: misfit is evident when MNSQ the change in model fit (Kline, 2005). The χ2 D is
infit or outfit values are larger than 1.3 for samples less calculated as the difference between the model chi-
than 500, 1.2 for samples between 500 and 1,000, and square (χ2 M) values of two nested models using the
1.1 for samples larger than 1,000 respondents. same data; the df for the χ2 D statistic is the difference in
According to Adams and Khoo (1996), items with dfs for two nested models. The χ2 D statistic tests the
adequate fit will have weighted MNSQs between .75 null hypothesis of identical fit of the two models to the
and 1.33. Bond and Fox (2001) stated items that are population. Failure to reject the null hypothesis means
routinely accepted as having adequate fit will have t- that the two models fit the population equally. When
values between -2 and +2. According to Wilson (2005), two nested models fit the population equally well, the
when working with large sample sizes, the researcher more parsimonious model is generally considered the
can expect the t-statistic to show significant values for more favorable.
several items regardless of fit; therefore, Wilson
Scale Evaluation Using CFA
suggested that the researcher consider items
problematic only if items are identified as misfitting Factor analysis is a more traditional method for
based on both the weighted MNSQ and t-statistic. analyzing the underlying dimensionality of a set of
observed variables. Derived from CTT, factor analysis
For rating scale models, category thresholds are
includes a variety of statistical procedures for exploring
provided in the IRT analysis. A category threshold is
the relationships among a set of observed variables
the point at which the probability of endorsing one
with the intent of identifying a smaller number of

Journal of the Society for Social Work and Research 68


OSTEEN

factors, the unobserved latent variables, thought to be loadings to models without cross-loadings allows the
responsible for these relationships among the observed researcher to make stronger assertions about the
variables (Tabachnik & Fidell, 2007). CFA is used underlying latent variable structure of a measure. As
primarily as a means of testing hypotheses about the Guo et al. (2009) noted, modified models allowing
latent structure underlying a set of observed data. cross-loadings between items and factors have been
A common and preferred method for conducting frequently published in social work literature without
CFA is structural equation modeling (SEM). The term fully explaining how they related to models without
SEM refers to a family of statistical procedures for cross-loadings.
assessing the degree of fit between observed data and Analysis of nested models. As noted in the
an a priori hypothetical model in which the researcher discussion of MIRT analysis, two models are
specifies the relevant variables, which variables affect considered to be nested if one is a subset of the second.
other variables, and the direction of those effects. The Overall model fit based on the chi-square distribution
two main goals of SEM analysis are to explore patterns will change as paths are added to or deleted from a
of correlations among a set of variables, both observed model. Kline’s (2005) chi-square difference statistic (χ2
and unobserved, and to explain as much variance as D) can be used to test the statistical significance of the
possible using the model specified by the researcher change in model fit.
(Klem, 2000; Kline, 2005). MIRT versus CFA
Analysis of SEM models. Analysis of SEM MIRT and CFA analyses can be used to assess the
models is based on the fit of the observed variance- dimensionality or underlying latent variable structure
covariance matrix to the proposed model. Although of a measurement. The choice of statistical procedures
maximum likelihood (ML) estimation is the common raises questions about differences between analyses,
method for deriving parameter estimates, it is not the whether the results of the two analyses are consistent,
only estimation method available. ML estimation and what information can be obtained from one
produces parameter estimates that minimize the analysis but not the other. IRT addresses two problems
discrepancies between the observed covariances in the inherent in CTT. First, IRT overcomes the problem of
data and those predicted by the specified SEM model item-person confounding found in CTT. IRT analysis
(Kline, 2005). Parameters are characteristics of the yields estimates of item difficulties and person-abilities
population of interest; without making observations of that are independent of each other, whereas in CTT
the entire population, parameters cannot be known and item difficulty is assessed as a function of the abilities
must be estimated from sample statistics. ML of the sample, and the abilities of respondents are
estimation assumes interval level data, and alternative assessed as a function of item difficulty (Bond & Fox,
methods, such as weighted least squares estimation, 2001), a limitation that extends to CFA.
should be used with dichotomous and ordinal level
Second, the use of ordinal level data (i.e., rating
data. Guo, Perron, and Gillespie (2009) noted in their
scales), which are routinely treated in statistical
review of social work SEM publications that ML
analyses as continuous, interval-level data, may violate
estimation was sometimes used and reported
the scale and distributional assumptions of CFA (Wirth
inappropriately.
& Edwards, 2007). Violating these assumptions may
Analysis of model fit. Kline (2005) defined result in model parameters that are biased and
model fit as how well the model as a whole explained ―impossible‖ to interpret (Wirth & Edwards, 2007, p.
the data. When a model is over identified, it is expected 58; DiStefano, 2002). The logarithmic transformation
that model fit will not be perfect; it is therefore of ordinal level raw data into interval level data in IRT
necessary to determine the actual degree of model fit, analysis overcomes this problem.
and whether the model fit is statistically acceptable.
IRT and CTT also differ in the treatment of the
Ideally, indicators should load only on the specific
standard error of measurement. The standard error of
latent variable identified in the measurement model.
measurement is an indication of variability in scores
This type of model can be tested by constraining the
due to error. Under CTT, the standard error of
direct effects between indicators and other factors to
measurement is averaged across persons in the sample
zero. According to Kline (2005), ―indicators are
or population and is specific to that sample or
expected to be correlated with all factors in CFA
population. Under IRT, the standard error of
models, but they should have higher estimated
measurement is considered to vary across scores in the
correlations with the factors they are believed to
same population and to be population-general
measure‖ (emphasis in original, p. 177). A
(Embretson & Reise, 2000). The IRT approach to the
measurement model with indicators loading only on a
standard error of measurement offers the following
single factor is desirable but elusive in practice with
benefits: (a) the precision of measurement can be
real data. Statistical comparison of models with cross-
evaluated at any level of the latent trait instead of

Journal of the Society for Social Work and Research 69


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

averaged over trait levels as in CTT, and (b) the assessment of students’ motivations for entering a
contribution of each item to the overall precision of the masters of social work (MSW) program as
measure can be assessed and used in item selection conceptualized in Wenger, McDermott, and Snyder’s
(Hambleton & Swaminathan, 1985). (2002) three-dimensional model of motivation for
MIRT and CFA differ in the estimation of item participation in a community of practice. Wenger et al.
fit. Where item fit is assessed through error variances, (2002) asserted that all communities of practice are
communalities, and factor loadings in CFA, item fit is comprised of three fundamental elements (p. 27): a
assessed through unweighted (outfit) and weighted domain of knowledge defining a set of issues; a
(infit) mean square errors in IRT analyses (Bond & community of people who care about the domain; and,
Fox, 2001). Further, the treatment of the relationship the shared practice developed to be effective in that
between indicator and latent variable, which is domain. Some individuals are motivated to participate
constrained to a linear relationship in CFA, can be because they care about the domain and are interested
nonlinear in IRT (Greguras, 2005). CFA uses one in its development. Some individuals are motivated to
number, the factor loading, to represent the relationship participate because they value being part of a
between the indicator and the latent variable across all community as well as the interaction and sharing with
levels of the latent variable; in IRT, the relationship others that is part of having a community. Finally,
between indicator and latent variable is given across some individuals are motivated to participate by a
the range of possible values for the latent variable desire to learn about the practice as a means of
(Greguras, 2005). Potential implications of these improving their own techniques and approaches. The
differences include inconsistencies in parameter PSWCoP was developed as a multidimensional
estimates, indicator and factor structure, and model fit measure of the latent constructs domain motivation,
across MIRT and CFA analyses. community motivation, and practice motivation (Table
1). Data were collected from a convenience sample of
Both IRT and CFA provide statistical indicators
students enrolled in MSW programs using a cross-
of psychometric performance not available in the other
sectional survey design and compared to the three-
analysis. Using the item information curve (IIC), IRT
factor model developed from Wenger et al.
analysis allows the researcher to establish both item
information functions (IIF) and test information Method
functions (TIF). The IIF estimates the precision and Participants
reliability of individual items independent of other A convenience sample of 528 current MSW
items on the measure; the TIF provides the same students was drawn from 11 social work programs
information for the total test or measure, which is a accredited by the Council on Social Work Education
useful tool in comparing and equating multiple tests (CSWE). Participants were enrolled during two
(Hambleton et al., 1991; Embretson & Reise, 2000). separate recruitment periods. The first round of
IRT for polytomous response formats also provides recruitment yielded a nonrandom sample of 268
estimated category thresholds for the probability of students drawn from nine academic institutions. The
endorsing a given response category as a function of second round of recruitment yielded a nonrandom
the level of underlying trait. These indices of item and sample of 260 students drawn from eight institutions.
test performance and category thresholds are not Six institutions participated in both rounds of data
available in CFA in which item and test performance collection, three institutions participated in only the
are conditional on the other items on the measure. first round of data collection, and two institutions
Conversely, CFA offers a wide range of indices for participated in only the second round of data collection.
evaluating model fit, whereas IRT is limited to the use The response rate for the study could not be calculated
of the χ2 deviance statistic. Reise, Widaman, and Pugh because there was no way to determine the total
(1993) explicitly identified the need for modification number of students who received information about the
indices and additional model fit indicators for IRT study or had access to the online survey. Twenty-two
analyses as a limitation. cases (4.1%) were removed because of missing data,
Participation in a Social Work Community of yielding a final sample of 506 students; listwise
Practice Scale deletion was used given the extremely small amount of
Although the content of the Participation in a missing data.
Social Work Community of Practice Scale (PSWCoP) Data were collected on multiple student
is less important in the current discussion than the characteristics including age, gender, race/ethnicity,
methodologies used to evaluate the scale, a brief sexual orientation, religious affiliation, participation in
overview will provide context for interpreting the religious activities, family socioeconomic status (SES),
results of the analyses. The PSWCoP scale is an and enrollment status. The mean age of participants

Journal of the Society for Social Work and Research 70


OSTEEN

Table 1
Original Items on the Participation in a Social Work Community of Practice Scale
Item Factor
My main interest for entering the MSW program was to be a part of a community of Community (C_1)
social workers.
I wanted to attend a MSW program so that I could be around people with similar Community (C_2)
values to me.
I chose a MSW program because I thought social work values were more similar to Community (C_3)
my values than those of other professions.
There is more diversity of values among students than I expected. Community (C_4)*
Before entering the program, I was worried about whether or not I would fit in with Community (C_5)*
my peers.
Learning about the social work profession is less important to me that being part of a Community (C_6)*
community of social workers.
Without a MSW degree, I am not qualified to be a social worker. Practice (P_1)
A MSW degree is necessary to be a good social worker. Practice (P_2)
Learning new social work skills was not a motivating factor in my decision to enter Practice (P_3)
the MSW program.
My main reason for entering the MSW program was to acquire knowledge and/or Practice (P_4)
skills.
A MSW degree will give me more professional opportunities than other professional Practice (P_5)*
degrees.
Being around students with similar goals is less important to me than developing my Practice (P_6)*
skills as a social worker.
Learning how to be a social worker is more important to me than learning about the Practice (P_7)*
social work profession.
I find social work appealing because it is different than the type of work I have done Domain (D_1)
in the past.
I decided to enroll in a MSW program to see if social work is a good fit for me. Domain (D_2)
I wanted to attend a MSW program so that I could learn about the social work Domain (D_3)
profession.
Entering the MSW program allowed me to explore a new area of professional interest. Domain (D_4)
My main reason for entering the MSW program was to decide if social work is the Domain (D_5)
right profession for me.
*Items deleted from the final version of the PSWCoP

was 30.2 years (SD = 8.7 years). The majority of community, and practice). Items were measured on a 6-
students were female (92%). The majority of the point rating scale from strongly disagree to strongly
participants were Caucasian (82.6%), with 7.3% of agree. Items from the pilot measure organized by
students self-identifying as African American or Black; subscale are listed in Table 1. In addition to items on
4.1% as Hispanic; 1.8% as Asian/Pacific Islander; and the PSWCoP, students were asked to provide
4.1% as a nonspecified race/ethnicity. Students demographic information.
identified their enrollment status as either part-time Procedures
(19.5%), first year (32.7%), advanced standing (27%), Participants completed the PSWCoP survey as part
or second year (20.8%). of a larger study exploring the relationship between
Measures students’ motivations to pursue the MSW degree, their
Analyses were conducted on an original measure attitudes about diversity and historically marginalized
of students’ motivations for entering a social work groups, and their endorsement of professional social
community of practice, defined as pursuing a MSW work values as identified in the National Association of
degree. The PSWCoP was developed and evaluated Social Workers (2009) Code of Ethics. This research
using steps outlined by by Benson and Clark (1982) was approved by the University of Denver Institutional
and DeVellis (2003). The pilot measure contained 18 Review Board prior to recruitment and data collection.
items designed to measure three constructs (domain, Recruitment consisted of a two-pronged approach: (a)
an e-mail providing an overview of the study and a link

Journal of the Society for Social Work and Research 71


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

to the online survey was sent to students currently community of practice. All analyses indicated
enrolled in the MSW program; and (b) an problems with the practice subscale and ultimately
announcement providing an overview of the study and EFA was used with this subscale only. The results of
a link to the online survey posted to student-oriented the EFA suggested items P_1 and P_2 formed one
informational Web sites. Interested participants were factor, and items P_3 and P_4 constituted a second
able to access the anonymous, online survey through factor. Item P_5 did not load on either factor and was
www.surveymonkey.com, which is a frequently used deleted.
online survey provider. Participants were presented The results of the item selection process yielded
with a project information sheet and were required to two competing models. The first model consisted of
indicate their consent to participate by clicking on the three factors in which all items developed for the
appropriate response before being allowed to access the practice subscale were kept together; this model most
actual survey. closely reflected the original hypothetical model
Results developed based on community of practice theory. The
Reliability of scores from the PSWCoP was second model had four factors with the items from the
assessed using both CTT and IRT methods. SPSS hypothesized practice subscale split into the two factors
(v.16.0.0, 2007) was used to calculate internal suggested by the EFA. Internal consistency for each of
consistency reliability (Cronbach’s α; inter-item the subscales on the final version of the PSWCoP was
correlations). Acer Conquest 2.0 (Wu et al., 2008) was assessed using Cronbach’s alpha. Cronbach’s alpha
used to assess item reliability. The dimensionality and was 0.64 for scores from the domain subscale, 0.68 for
factor structure of the PSWCoP were evaluated using scores from the community subscale, and 0.47 for
both a MIRT and a CFA approach. Acer Conquest 2.0 scores from the practice subscale (three-factor model).
(Wu et al., 2008) was used to conduct the MIRT Splitting the practice subscale into two factors yielded
analysis and Lisrel 8.8 (Jöreskog & Sörbom, 2007) was a Cronbach’s alpha of 0.58 for scores from the skills
used to conduct the CFA analysis. Acer Conquest 2.0 subscale and .68 for scores from the competency
was used to evaluate the PSWCoP with respect to subscale. Although ultimately indicative of a poor
estimates of levels of latent trait and item difficulty measure, low internal consistency did not prohibit the
using a one-parameter logistic model. Assessment of application and comparison of factor analysis using
the measure was based on model fit, person-fit, item CFA and MIRT.
fit, person reliability, item reliability, step calibration, Factor Structure
and population parameters for the multidimensional CFA. CFA analyses of the PSWCoPS was
model. conducted using Lisrel 8.8 (Jöreskog & Sörbom, 2007).
Item Selection The data collected using the PSWCoP were considered
Items were identified for possible deletion from ordinal based on the 6-point rating scale. When data
each subscale using Cronbach’s alpha, IRT MNSQ are considered ordinal, Jöreskog and Sörbom (2007)
Infit/Outfit results, and theory. Poorly performing advocated the use of PRELIS to calculate asymptotic
items identified through statistical analyses were covariances and polychloric correlations of all items
further assessed using conceptual and theoretical modeled, and LISREL or SIMPLIS with weighted least
frameworks. A combination of results led to the squares estimation to test the structure of the data.
removal of three items from the community subscale, Failure to use these guidelines may result in
three items from the practice subscale, but no items underestimated parameters, biased standard errors, and
from the domain subscale (Table 1). Items C_6, P_6, an inflated chi-square (χ2) model fit statistic (Flora &
and P_7 addressed relationships between types of Curran, 2004). The chi-square difference statistic (χ2 D)
motivations by asking respondents to rate whether one was used to test the statistical significance of the
type of motivation was more important than another change in model fit between nested models (Kline,
type. Quantitative differences between types of 2005). The χ2 D was calculated as the difference
motivations were not addressed in community of between the model chi-square (χ2 M) values of nested
practice theory, and therefore these items were deemed models using the same data; the df for the χ2 D statistic
not applicable in the measurement of each type of is the difference in dfs for nested models. The χ2 D
motivation. Items C_4 and C_5 were deleted from the statistic tested the null hypothesis of identical fit of two
community subscale because these items specifically models to the population. In all, three nested models
addressed relationships between respondents and peers. were evaluated and compared sequentially: a four-
Community-based motivation arises out of perceived factor model with cross-loadings served as the baseline
value congruence between the individual and the model, followed by a four-factor model without cross-
practice (i.e., professional social work), and not loadings, and a three-factor model without cross-
between the individual and other members of the loadings. The four-factor model with cross-loadings

Journal of the Society for Social Work and Research 72


OSTEEN

was chosen as the baseline model because it was nested models is considered to have the best fit. The
presumed to demonstrate the best fit having the fewest GFI is an assessment of incremental change in fit with
degrees-of-freedom. The primary models of interest an adjustment for model complexity; values greater
were then compared against this baseline to estimate than 0.90 indicate good fit. The RMSEA fit index is a
the change in model fit. measure of the lack of fit of the researcher’s model to
Sun (2005) recommended considering fit indices the population covariance matrix and tests the null
in four categories: sample-based absolute fit indices, hypothesis that the researcher’s model has close
sample-based relative fit indices, population-based approximate fit in the population. According to Kline
absolute indices, and population-based relative fit (2005), good models have an RMSEA less than 0.05
indices. Sample-based fit indices are indicators of and models with RMSEA greater than 0.10 have poor
observed discrepancies between the reproduced fit, while Browne and Cudeck (1993) suggested that a
covariance matrix and the sample covariance matrix. RMSEA less than 0.08 represents acceptable fit. The
Population-based fit indices are estimations of CFI assesses the improvement in fit of the researcher’s
difference between the reproduced covariance matrix model over a baseline model that has assumed zero
and the unknown population covariance matrix. At a covariances among observed variables; values greater
minimum, Kline (2005) recommended interpreting and than 0.90 represent an acceptable model fit.
reporting four indices: the model chi-square (sample- Four-Factor Model with Cross-Loadings. A
based), the Steiger-Land root mean square error of baseline CFA model was constructed using the four
approximation (RMSEA; population-based), the latent variables, domain, community, competency, and
Bentler comparative fit index (CFI; population-based), skills, and items were allowed to cross-load on factors
and the standardized root mean square residual based on modification indices of the LISREL output.
(SRMR; sample-based). In addition to these fit indices, Based on the six fit indices previously described, the
this study examined the Akaike information criteria overall fit of the model was good: χ2 = 64.48, df = 35,
(AIC; sample-based) and the goodness-of- fit index p=.00175; RMSEA = 0.04 [90%CI:.03,.05]; CFI =
(GFI; sample-based). According to Jackson, Gillaspy, 0.98; SRMR = 0.043; AIC = 150.48; GFI =0.97. Note
and Purc-Stephenson (2009), a review of CFA journal that this solution was mathematically derived, and as
articles published over the past decade identified these such, there was no conceptual justification for the
six fit indices as the most commonly reported. cross-loadings of these multiple items. This model
The range of values indicating good fit of observed served as the baseline against which competing models
data to the measurement model varies depending on the were compared.
specific fit index. The model chi-square statistic tests Four Factor Model without Cross-Loadings. The
the null hypothesis that the model has perfect fit in the next model to be evaluated used the same four factors
population. Degrees-of-freedom for the chi-square as the previous model, but items were constrained to
statistic is equal the number of observations minus the load on specific factors. The standardized solution for
number of parameters to be estimated. Given its the four-factor model without cross loadings is shown
sensitivity to sample size, the chi-square test is often in Figure 1. Based on the six fit indices previously
statistically significant. Kline (2005) suggested using a described, the overall fit of the model was acceptable:
normed chi-square statistic obtained by dividing chi- χ2 = 185.82, df = 48, p<0.001; RMSEA = 0.07
square by df; ideally, these values should be less than [90%CI:.06,.09]; CFI = 0.91; SRMR = 0.094; AIC
three. The SRMR is a measure of the differences =245.82 ; GFI =0.91. When compared with the four-
between observed and predicted correlations; in a factor model with cross-loadings, this model
model with good fit, these residuals will be close to demonstrated a significant increase in model misfit
zero. Hu and Bentler (1999) suggested that a SRMR [(χ12 – χ22)(df1-df2) =121.04(13), p <.001]. However, the fit
<0.08 indicates good model fit. The AIC is an indicator indices as a whole did not indicate poor fit, and as a
of comparative fit across nested models with an conceptually derived and theory-supported model, the
adjustment for model complexity. The AIC is not an four-factor model without cross-loadings was
indicator of fit for a specific model, but instead the preferable over the four-factor model with cross-
model with the lowest AIC from among the set of loadings.

Journal of the Society for Social Work and Research 73


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Figure 1. Standardized solution for four-factor PSWCoP model

Three Factor Model without Cross-Loadings. The Based on the six fit indices previously described, the
three-factor model corresponded to the original model overall fit of the model was poor: χ2 = 359.90, df = 51,
of the PSWCoP (Figure 2). Three latent variables were p < 0.001; RMSEA = 0.11 [90%CI:.10,.12]; CFI = 0.8;
included in this model:domain, community, and SRMR = 0.12; AIC = 413.90; GFI =0.85. When
practice. Items were constrained to load on the factor compared with the four-factor model without cross-
for which they were designed. The four items loadings, this model demonstrated a significant
originally developed for the practice subscale were increase in model misfit [(χ12 – χ22)(df1-df2) =174.38(3),
constrained to load on a single latent variable, which p < .001]. All of the fit statistics indicated that the data
represented a perfect correlation between the did not fit the model.
previously used latent variables competency and skills.

Figure 2. Standardized solution for three-factor PSWCoP model

Journal of the Society for Social Work and Research 74


OSTEEN

Summary of CFA of the PSWCoP. A summary of measure because correlations between latent variables
fit indices across nested models is provided in Table 2. were computed and there were no significant
The model with the best overall fit was the four-factor correlations between any pair of latent variables
model in which items were allowed to load across all (α=.01).
factors. The fit of this model was good, but the model
The four-factor model with constrained loadings
lacked conceptual support and was not interpretable
was compared with a three-factor model based on the
with respect to the underlying latent structure of the
originally proposed measurement model for the
PSWCoP. Although the four-factor model with
PSWCoP. The conceptual difference between the two
constrained loadings had a significant increase in
models was the placement of the items developed for
model misfit over the four-factor model with cross-
the practice subscale. Constraining these four items to
loadings, the four-factor model with constrained
load on a single latent variable resulted in a large
loadings demonstrated acceptable fit. The results of the
increase in model misfit. All of the reported fit
CFA on the four-factor model without cross-loadings
statistics indicated a model with poor fit.
supported the hypothesis of a multidimensional

Table 2
Comparison of Fit Indices across Nested Models
Model 1: Model 2: Model 3:
4 Factor Unidimensional Unidimensional
Model 4 Factor Model* 3 Factor Model**
χ2(df) 64.48(35) 185.52(48) 359.90(51)
2 2
Normed χ (χ /df) 1.84 3.86 7.05
p-value (model) .002 <.001 <.001
2 2
χ1 –χ2 (df1 –df2) 121.04(13) 174.38(3)
p-value (model diff) <.001 <.001
RMSEA .04 .07 0.11
RMSEA 90%CI [.03, .05] [.06, .09] [.10, .12]
CFI 0.98 .91 0.80
SRMR 0.04 .09 0.12
AIC 150.48 245.82 413.90
GFI 0.97 0.91 0.85
*Compared to Model 1
**Compared to Model 2

Multidimensional Item Response Theory Analysis construction, as was the case with the PSWCoP. The
first set of analyses evaluated item difficulty, item fit,
The PSWCoP data were then analyzed using a
and reliability for a unidimensional model. The second
one-parameter IRT model using Winsteps 3.66.0
set of analyses explored the dimensionality of the
(Linacre, 2006) Rasch measurement software, and
PSWCoP by comparing four- and three-factor models.
MIRT analyses using ACER Conquest 2.0, generalized
The third set of analyses evaluated item difficulty, item
item-response modeling software (Wu et al., 2008).
fit, and reliability for the multidimensional models.
The parameters for guessing were all constrained to
zero, and the parameters for item discrimination were Rasch measurement results. Winsteps 3.68.0
assumed equal and set to one. A thorough (Linacre, 2006) Rasch measurement software was used
psychometric evaluation should ideally utilize at least a to assess item difficulty, fit, and reliability for a
two-parameter model, especially when considering unidimensional model. For affective measures, item
established measures, as item discrimination difficulty refers to the amount of the construct needed
parameters are rarely the same. However, Reeve and to endorse, or respond positively, to the item. As the
Fayers (2005) suggested that the one-parameter model PSWCoP was designed to measure motivation, item
with equal item discrimination parameters is acceptable difficulty was the amount of motivation needed to
in the development and revision phase of scale respond positively to the question. Person-ability or

Journal of the Society for Social Work and Research 75


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

latent trait was the amount of motivation a given positive values indicated items that were harder to
student possessed. In general, the range of the latent endorse.
trait of the sample and item difficulties were the same,
Item fit is an indication of how well an item
and the distribution of persons and items about the
performs according to the underlying IRT model being
mean were relatively symmetrical, indicating a good
tested, and it is based on the comparison of observed
match between the latent trait of students and the
responses to expected responses for each item. Adams
difficulty of endorsing items. Exact numerical values
and Khoo (1996) suggested that items with good fit
for item difficulty are provided in Table 3 and ranged
have infit scores between 0.75 and 1.33; Bond and Fox
from -1.05 to +0.94. Item difficulty was scaled
(2001) suggested that items with good fit have t values
according to the theta metric, and indicated the level of
between -2 and +2. Table 3 provides the fit statistics
the latent trait at which the probability of a given
for the items of the PSWCoP survey; according to this
response to the item was .50. Theta (θ) is the level of
output, only item P_3_R exceeded Bond and Fox’s
the latent trait being measured and scaled with a mean
guideline, and no items exceeded Adams and Khoo’s
of zero and a standard deviation of one. Negative
guideline.
values indicated items that were easier to endorse, and

Table 3
Rasch Analysis of Full Survey Item Difficulty and Fit

Model Infit Outfit


Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD
1 C_1 0.30 .04 1.05 0.9 1.06 1.2
2 C_2 0.04 .04 0.93 -1.1 0.94 -1.0
3 C_3 0.05 .03 1.02 0.5 1.06 1.1
4 P_1 -0.56 .04 1.01 0.1 1.06 0.8
5 P_2 0.30 .04 0.98 -0.4 1.00 0.1
6 D_1 0.68 .04 0.94 -1.1 0.93 -1.1
7 D_2 -0.11 .04 0.91 -1.4 0.89 -1.6
8 D_3 0.24 .04 1.01 0.1 1.05 0.9
9 D_4 -0.33 .04 1.07 1.1 1.08 1.1
10 D_5 0.94 .04 0.97 -0.4 0.95 -0.7
11 P_3 -0.51 .04 1.17 2.1 1.35 4.0
12 P_4 -1.05 .06 0.93 -0.7 0.92 -0.9

IRT analysis produced an item reliability index to a constricted range of the latent trait in the sample or
indicating the extent to which item estimates would be a constricted range of item difficulty.
consistent across different samples of respondents with
MIRT factor structure. One of the core
similar abilities. High item reliability indicates that the
assumptions of IRT is unidimensionality; in other
ordering of items by difficult will be somewhat
words, that person-ability can be attributed to a single,
consistent across samples. The reliability index of
latent construct, and that each item contributes to the
items for the PSWCoP pilot survey was 0.99, and
measure of that construct (Bond & Fox, 2001).
indicated consistency in ordering of items by difficulty.
However, whether intended or not, item responses may
IRT analysis also produced a person-reliability index
be attributable to more than one latent construct. MIRT
that indicated the extent of consistency in respondent
analyses allow the researcher to assess the
ordering based on level of latent trait if given an
dimensionality of the measure. Multidimensional
equivalent set of items (Bond & Fox, 2001). The
models can be classified as either ―within items‖ or
reliability index of persons for the PSWCoP was 0.60,
―between items‖ (Adams, Wilson, & Wang, 1997).
and indicated low consistency in ordering of persons
Within-items multidimensional models have items that
by level of level of latent trait, which was possibly due
can function as indicators of more than one dimension,

Journal of the Society for Social Work and Research 76


OSTEEN

and between-items multidimensional models have the CFA (Figure 1). This baseline model was a
subsets of items that are mutually exclusive and between-items multidimensional model with items
measure only one dimension. placed in mutually exclusive subsets. The four
dimensions in the model were community,
Competing multidimensional models can be
competency, domain, and skills. The baseline model fit
evaluated based on changes in model deviance and
statistic was G2=17558.64 with 26 parameters. A three
number of parameters estimated. A chi-square statistic
dimensional, between-items, multidimensional model,
is calculated as the difference in deviance (G2) between
corresponding to the theoretical model of the PSWCoP
two nested models with df equal to the difference in
(Figure 2) was tested against the baseline model. The
number of parameters for the nested models. A
three-dimensional model fit statistic was G2=17728.83
statistically significant result indicates a difference in
with 22 parameters. When compared with the four-
model fit. When a difference in fit is found, the model
dimensional model, the change in model fit was
with the smallest deviance is selected; when a
statistically significant and indicated that the fit of the
difference in model fit is not found, the more
three-dimensional model was worse than the fit of the
parsimonious model is selected.
four-dimensional model (χ2 (4) = 170.19, p < .001).
The baseline MIRT model corresponded to the
four-factor model with no cross-loadings estimated in

Table 4

Comparison of Model Fit Across Nested Models

Four Factor (Between) Three Factor* (Between)

Deviance (G2 ) 17558.64 17728.83


Df 26 22
G21- G22 -170.19
df1-df2 4
(G21- G22)/(df1-df2 ) 42.55
p-value < .001
* Compared to the Four Factor, Between-Items Model

Based on the change in model fit between nested abilities. Two inferences were made based on the
models, the four dimensional, between-items model MIRT item-person map. First, although the range of
had the better fit. This model resulted in a more item difficulty was narrow, items appeared to be
accurate reproduction of the probability of endorsing a dispersed in terms of difficulty with a range of -0.81 to
specific level or step of an item for a person with a +0.84. Furthermore, regarding Dimensions 1, 2, and 3,
particular level of the latent trait (Reckase, 1997). the item difficulties appeared to be well matched to
Thus, the four-dimensional model yielded the greatest levels of the latent trait, though over a limited range of
reduction in discrepancy between observed and the construct as scaled via the theta metric. Second,
expected responses. based on the means of the dimensions, Dimension 2
Item difficulty. MIRT analyses yielded an item- (competency, x2=0.069) and Dimension 3 (domain,
person map by dimension. The output of the MIRT x3=-0.074) did a better job of representing all levels of
item-person map (Figure 3) provided a visual estimate these types of motivation than the other two
of the latent trait in the sample, item difficulty, and dimensions. The small positive mean of Dimension 1
each dimension. Items are ranked in the right-hand (community, x1=0.335), indicated that students sampled
column by difficulty, with items at the top being more for this study found it somewhat easier to endorse those
difficult than items at the bottom. Although the range items, whereas the large positive mean of Dimension 4
of item difficulty was narrow, items were well (skills, x4=1.42) indicated that students sampled for this
dispersed around the mean. Each dimension or factor study found it very easy to endorse those items.
has its own column with estimates for respondents’

Journal of the Society for Social Work and Research 77


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Figure 3. MIRT Latent Variable Item-Person Map

Item fit. Table 5 summarizes the items’ showed poor fit (MNSQ=0.68). In contrast, Bond and
characteristics. In addition to the estimation of item Fox’s (2001) guideline identified several items as
difficulties, infit and outfit statistics are reported. Using having poor fit (based on a 95% CI for MNSQ): C_1,
Adams and Khoo’s (1996) guideline, only item C_2_2 C_2, D_1, D_3, and D_4.

Table 5
Item Parameter Estimates for 4 Dimensional Model
Model Infit Outfit
Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD
1 C_1 0.40 0.03 0.77 -3.8 0.77 -4.5
2 C_2 0.21 0.03 0.68 -5.6 0.67 -6.5
3 C_3 -0.61* 0.04 1.02 0.4 1.04 0.6
4 P_1 -0.14 0.03 1.01 0.2 1.00 0.0
5 P_2 0.14* 0.03 0.96 -0.5 0.93 -1.1
6 D_1 0.11 0.03 1.21 3.0 1.18 3.0
7 D_2 0.51 0.03 1.02 0.4 1.04 0.7
8 D_3 -0.65 0.03 1.29 4.1 1.30 4.4
9 D_4 -0.81 0.03 1.17 2.5 1.22 3.1
10 D_5 0.84* 0.06 0.95 -0.7 0.98 -0.2
11 P_3_R 0.33 0.04 1.00 -0.0 1.02 0.4
12 P_4 -0.33* 0.04 0.99 -0.2 1.00 -0.0
*Indicates that a parameter estimate is constrained

Journal of the Society for Social Work and Research 78


OSTEEN

Discussion as compared with the CFA. Specifically, two items on


The rigor and sophistication with which social the community subscale had large standardized fit
workers conduct psychometric assessments can be scores in the IRT analysis but displayed high factor
strengthened. Guo et al. (2010) found that social loadings and low error variances in the CFA. The IRT
workers under utilize CFA, and more generally SEM, analyses also provided estimates of the item
analyses. Further, even when those approaches are used information function and test information function,
appropriately, considerable room remains for making it possible to get specific estimates of standard
improvement in reporting (Guo et al., 2010). Similarly, errors of measurement instead of relying on an
Unick and Stone (2010) found the use of IRT analyses averaged standard error of measurement obtained from
for psychometric evaluation was noticeably missing the CFA.
from the social work literature. Developing familiarity Strengths and Limitations
and proficiency with strong psychometric methods will Reliance on a convenience sample is a significant
empower social workers in developing and selecting limitation of this study. The extent to which
appropriate measures for research, policy, and practice. participants in this study were representative of the
Integration of CFA and MIRT Results larger population of MSW students was indiscernible.
The primary result from both the CFA and MIRT Although IRT purports to generate sample-independent
analyses was the establishment of the PSWCoP as a item characteristic estimations, the stability of these
multidimensional measure. Both sets of analyses estimations is enhanced when the sample is
identified a four-factor model in which items loaded on heterogeneous with regard to the latent trait. It is
a single factor as having the best model fit when possible that students who self-selected to complete the
compared with the three-factor model. In addition, both measure were overly similar.
analytic strategies identified significant problems with A study strength is its contribution to the field of
the PSWCoP. Low subscale internal consistencies psychometric assessment. Previous studies comparing
might be due to the small number of items for the IRT and CFA have dealt almost exclusively with
community, skills, and competency subscales, as well assessing measurement invariance across multiple
as the inability to capture the complexity of different samples (e.g., Meade & Lautenschlager, 2004; Raju,
types of motivation for participating in a social work Lafitte, & Byrne, 2002; Reise et al., 1993). The current
community of practice. CFA identified multiple items study addresses emerging issues in measurement
with high (>.7) error variances, and IRT analyses theory by applying IRT analyses to multidimensional
indicated poor fit for several items. Although the latent variable measures, and comparing MIRT and
results of the analyses identified the PSWCoP as CFA assessments of factor structure in a novel
having limited utility, these poor psychometric measure.
properties did not prohibit CFA and MIRT analyses. Implications for Social Work Research
The CFA analysis was found to be more In addition to the benefits of using IRT/MIRT
informative at the subscale level, whereas the MIRT analytic procedures outlined in this paper, the ability of
analysis was found to be more informative at the item these techniques to assess differential item functioning
level. CFA was more informative regarding subscale (DIF) and differential test functioning (DTF) is a major
composition and assessing associations among factors. advantage over CTT methods. Wilson (1985) described
The CFA analysis led to a final form of the PSWCoP DIF as an indication of whether an item performs the
with four subscales, and beginning evidence supporting same for members of different groups who have the
the factorial validity of the measure. As indicated by same level of the latent trait, whereas DTF is invariant
the nonsignificant correlations among factors, each performance of a set of items across different groups
subscale appeared to be tapping separate constructs. (Badia, Prieto, Roset, Díez-Pírez, & Herdman, 2002).
Although MIRT allows the researcher to model factor If DIF/DTF exists, respondents from the subgroups
structure, this approach does not estimate relationships who share the same level of a latent trait ―do not have
between factors. the same probability of endorsing a test item‖
MIRT analyses were found to be more informative (Embretson & Reise, 2000, p. 252). The ability to
for assessing individual item performance. Item assess potential bias in items and tests provides a
difficulty estimates were obtained for the PSWCoP as a powerful method for developing culturally competent
whole and for each subscale. Items on the PSWCoP measures (Teresi, 2006). Valid comparisons between
appeared to be a good match for the levels of latent groups require measurement invariance, and IRT
trait of the respondents with regards to the community, provides an additional tool for examining both items
domain, and competency factors, but too easy for the and tests.
skills factor. Based on infit and outfit statistics, MIRT
analyses identified additional items exhibiting poor fit

Journal of the Society for Social Work and Research 79


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Additional benefits of IRT analyses include the References


ability to conduct test equating and develop adaptive
Adams, R. J., & Khoo, S. T. (1996). ACER Quest
testing. The core question of test equating is the extent
[Computer software]. Melbourne, Australia: ACER.
to which scores from two measures presumed to
measure the same construct are comparable. For Adams, R. J., Wilson, M., & Wang, W. (1997). The
example, are the Beck Depression Inventory and the multidimensional random coefficients multinomial
Center for Epidemiological Studies Depression Scale logit model. Applied Psychological Measurement,
(CES-D) equitable? Adaptive testing allows the 21, 1-24. doi:10.1177/0146621697211001
researcher to match specific items to different levels of Andrich, D. (1988). Rasch models for measurement.
ability to more finely discern a person’s ability; Newbury Park, CA: Sage.
persons estimated to have high ability may receive a Badia, X., Prieto, L., Roset, M., Díez-Pírez, A., &
different set of items than a person estimated to have Herdman, M. (2002). Development of a short
low ability. With the increasing availability of osteoporosis quality of life questionnaire by
statistical software for conducting MIRT analyses, the equating items from two existing instruments.
potential also exists for developing models with greater Journal of Clinical Epidemiology, 55, 32-40.
complexity for testing differential factor functioning doi:10.1016/S0895-4356(01)00432-2
(DFF). Akin to testing measurement invariance using
Benson, J., & Clark, F. (1982). A guide for instrument
CFA techniques, DFF analyses will provide researchers
development and validation. American Journal of
with an assessment of potential bias in performance of
Occupational Therapy, 36, 789-800.
factors (e.g., subscales) across groups.
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch
A final consideration is choosing between the
different psychometric strategies outlined in this paper. model (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
Ideally, both methods should be integrated. Doing so DeVellis, R. F. (2003). Scale development: Theory and
gives the researcher access to unique information applications. Thousand Oaks, CA: Sage.
available only from each analytic method, allows the DiStefano, C. (2002). The impact of categorization
researcher to compare common elements of both with confirmatory factor analysis. Structural
analyses, and minimizes the impact of each method’s Equation Modeling, 9, 327-346.
limitations. If applying both methods is not possible, doi:10.1207/S15328007SEM0903_2
theoretical and practical considerations can inform the Embretson, S. E., & Reise, S. P. (2000). Item response
decision. IRT is a stronger choice when data are theory for psychologists. Mahwah, NJ: Lawrence
dichotomous or ordinal because raw scores are Erlbaum.
transformed to an interval scale. If the relationship
between items and factors are nonlinear or unknown, Flora, D. B., & Curran, P. J. (2004). An empirical
IRT will yield less biased estimates than CFA. If the evaluation of alternative methods of estimation for
construct to be measured is presumed to be confirmatory factor analysis with ordinal data.
unidimensional, IRT is a better strategy because of the Psychological Methods, 9, 466-491.
additional information provided in the item analysis. doi:10.1037/1082-989X.9.4.466
Both MIRT and CFA are informative in assessing Fries, J., Bruce, B., & Cella, D. (2005). The promise of
latent factor structures, but only CFA allows the PROMIS: Using item response theory to improve
researcher to estimate relationships between factors. assessment of patient-reported outcomes. Clinical
Both strategies perform better with large sample sizes, and Experimental Rheumatology, 23(5, Suppl. 39),
but IRT is affected more negatively by smaller samples S53–S57.
given the larger number of parameters being estimated. Greguras, G. J. (2005). Managerial experience and the
If possible, IRT/MIRT analysis should be limited to measurement equivalence of performance ratings.
samples of 200 items or more. Conversely, IRT Journal of Business and Psychology, 19 (3), 383-
analyses yield stable results with very few items, 397. doi:10.1007/s10869-004-2234-y
whereas CTT reliability varies in part as a function of Guo, B., Perron, B. E., & Gillespie, D. F. (2009). A
the number of items. systematic review of structural equation modeling
in social work research. British Journal of Social
Work, 39, 1556-1574. doi:10.1093/bjsw/bcn101
Hambleton, R. K., & Swaminathan, H. (1985). Item
response theory: Principles and applications.
Boston, MA: Kluwer/Nijhoff.

Journal of the Society for Social Work and Research 80


OSTEEN

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. Rasch, G. (1980). Probabilistic models for some
(1991). Fundamentals of item response theory. intelligence and attainment tests. Chicago, IL:
Newbury Park, CA: Sage. MESA Press.
Henard, D. H. (2000). Item response theory. In L. G. Reckase, M. D. (1997). The past and future of
Grimm & P. R. Yarnold (Eds.), Reading and multidimensional item response theory. Applied
understanding more multivariate statistics, (67-98). Psychological Measurement, 21, 25-27.
Washington, DC: American Psychological doi:10.1177/0146621697211002
Association. Reeve, B. B., & Fayers, P. (2005). Applying item
Jackson, D. L., Gillaspy, J. A, & Purc-Stephenson, R. response theory modeling for evaluating
(2009). Reporting practices in confirmatory factor questionnaire items and scale properties. In P. M.
analysis: An overview and some recommendations. Fayers & R. D. Hays (Eds.), Assessing quality of
Psychological Methods, 14(1), 6-23. life in clinical trials: Methods and practice (2nd
doi:10.1037/a0014694 ed., pp. 55-73). New York, NY: Oxford University
Jöreskog, K. G., & Sörbom, D. (2007). LISREL 8.80 Press.
for Windows [Computer software]. Licolnwood, Reise, S. P., Ainsworth, A. T., & Haviland, M. G.
IL: Scientific Software International. (2005). Item response theory: Fundamentals,
Klem, L. (2000). Structural equation modeling. In L. applications, and promises in psychological
G. Grimm and P. R. Yarnold (Eds.), Reading and research. Current Direction sin Psychological
understanding more multivariate statistics, (pp. Science, 14(2), 95-101.
227-260). Washington, DC: American Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993).
Psychological Association. Confirmatory factor analysis and item response
Kline, R .B. (2005). Principles and practices of theory: Two approaches for exploring measurement
structural equation modeling (2nd ed.). New York, invariance. Psychological Bulletin, 114, 552-566.
NY: Guilford Press. doi:10.1037/0033-2909.114.3.552
Linacre, J. K. (1994). Sample size and item calibration Samejima, F. (1969). Estimation of latent ability using
stability. Rasch Measurement Transactions, 7(4), a response format of graded scores. (Psychometric
328. Monograph No. 17) .Richmond, VA: Psychometric
Society. Retrieved from
Linacre, J. K. (1999). Investigating rating scale
http://www.psychometrika.org/journal/online/MN1
category utility. Journal of Outcome Measurement,
7.pdf
3(2), 103-122.
Smith, R. M., Schumaker, R. E., & Bush, M. J. (1998).
Linacre, J. K. (2006). Winsteps Rasch measurement
Using item mean squares to evaluate fit to the
3.68.0 [Software]. Chicago, IL: Author.
Rasch model. Journal of Outcome Measurement,
Lord, F. M. (1980). Applications of item response 2(1), 66-78. PMid:9661732
theory to practical testing problems. Hillsdale, NJ:
SPSS. (2007). SPSS for Windows, Rel. 16.0.0
Lawrence Erlbaum.
[Software]. Chicago: SPSS, Inc.
Meade, A.W., & Lautenschlager, G. J. (2004). A
Sun, J. (2005). Assessing goodness of fit in
comparison of item response theory and
confirmatory factor analysis. Measurement and
confirmatory factor analytic methodologies for
Evaluation in Counseling and Development, 37(4),
establishing measurement equivalence/invariance.
240-256.
Organization Research Methods, 7(4), 361-388.
doi:10.1177/1094428104268027 Swaminathan, H., & Gifford, J. A. (1979). Estimation
of parameters in the three-parameter latent-trait
Müller, U., Sokol, B., & Overton, W.F. (1999).
model. Laboratory of Psychometric and Evaluation
Developmental Sequences in class reasoning and
Research (Report No. 90). Amherst: University of
propositional reasoning. Journal of Experimental
Massachusetts.
Child Psychology, 74, 69-106.
doi:10.1006/jecp.1999.2510 Tabachnik, B. G., & Fidell, L. S. (2001). Using
multivariate statistics (4th ed.). Boston, MA: Allyn
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2001).
and Bacon.
Measurement equivalence: A comparison of
methods based on confirmatory factor analysis and Teresi, J. A. (2006). Overview of quantitative
item response theory. Journal of Applied measurement methods: Equivalence, invariance and
Psychology, 87, 517-529. doi:10.1037/0021- differential item functioning in health applications.
9010.87.3.517 Medical Care, 44, S39–S49.

Journal of the Society for Social Work and Research 81


USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Unick, G. J., & Stone, S. (2010). State of modern Wirth, R. J., & Edwards, M. C. (2007). Item factor
measurement approaches in social work research analysis: Current approaches and future directions.
literature. Social Work Research, 34(2), 94-101. Psychological Methods, 12(1), 58-79.
Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). doi:10.1037/1082-989X.12.1.58
Practical implications of item response theory and Wright, B. D., & Masters, G. N. (1982). Rating scale
computerized adaptive testing: A brief summary of analysis: Rasch measurement. Chicago, IL: MESA
ongoing studies of widely used headache impact Press.
scales. Medical Care, 38, 1173–1182. Wu, M. L., Adams, R. J., & Wilson, M., & Haldane, S.
doi:10.1097/00005650-200009002-00011 (2008). ACER Conquest 2.0: Generalized item
Wenger, E., McDermott, R., & Snyder, W. M. (2002). response modeling software [computer program].
Cultivating communities of practice. Boston, MA: Hawthorn, Australia: ACER.
Harvard Business School Press.

Journal of the Society for Social Work and Research 82

You might also like