Professional Documents
Culture Documents
An Introduction To Using Multidimensional Item Response Theory To Assess Latent Factor Structures PDF
An Introduction To Using Multidimensional Item Response Theory To Assess Latent Factor Structures PDF
This study provides an introduction to the use of multidimensional item response theory (MIRT)
analysis for assessing latent factor structure, and compares this statistical technique to
confirmatory factor analysis (CFA) in the evaluation of an original measure developed to assess
students’ motivations for entering a social work community of practice. The Participation in a
Social Work Community of Practice Scale (PSWCoP) was administered to 506 masters of social
work students from 11 accredited graduate programs. The psychometric properties and latent
factor structure of the scale are evaluated using MIRT and CFA techniques. Although designed as
a 3-factor measure, analysis of model fit using both CFA and MIRT do not support this solution.
Instead, analyses using both methods produce convergent results supporting a 4-factor solution.
Discussion includes methodological implications for social work research, focusing on the
extension of MIRT analysis to assessment of measurement invariance in differential item
functioning, differential test functioning, and differential factor functioning.
early use in educational measurement, the term ability is needed to endorse or respond in a certain way to the
may seem mismatched to psychosocial constructs; thus, item. For items on a rating scale, the IRF is a
the term latent trait may be more intuitive, and mathematical function describing the relation between
references to level of ability are synonymous with level where an individual falls on the continuum of a given
of the latent trait. The IRT model produces estimates construct such as motivation and the probability that he
for both of these elements by calculating item- or she will give a particular response to a scale item
difficulty parameters on the basis of the total number of designed to measure that construct (Reise, Ainsworth,
persons who correctly answered an item, and person- & Haviland, 2005). The basic goal of IRT modeling is
trait parameters on the basis of the total number of to create a sample-free measure.
items successfully answered (Bond & Fox, 2001). The Multidimensional item response theory, or MIRT,
assumptions underlying these estimates are (a) that a is an extension of IRT and is used to explore the
person with more of the trait will always have a greater underlying dimensionality of an IRT model. Advances
likelihood of success than a person with less of the in computer software (e.g., Conquest, MULTILOG, &
trait, and (b) that any person will have a greater Mplus) allow for testing and evaluation of more
likelihood of endorsing items requiring less of the trait complex multidimensional item response models and
than items requiring more of the trait (Müller, Sokol, & enable researchers to statistically compare competing
Overton, 1999). Samejima (1969) and Andrich (1978) dimensional models. ACER Conquest 2.0 (Wu,
extended this model to measures with polytomous Adams, & Wilson, 2008), the software used in this
response formats (i.e., Likert scales) by adding an study, produces marginal maximum likelihood
estimate to account for the difficultly in crossing the estimates for the parameters of the models. The fit of
threshold from one level of response to the next (e.g., the models is ascertained by generalizations of the
moving from agree to strongly agree). Wright and Masters (1982) residual-based methods.
Scale Evaluation Using IRT Alternative dimensional models are evaluated using a
likelihood ratio chi-squared statistic (χ2LR; Barnes,
The basic unit of IRT is the item response Chard, Wolfe, Stassen, & Williams, 2007).
function (IRF) or item characteristic curve. The
Core statistical output of an IRT analysis of a one-
relationship between a respondent’s performance and
parameter rating scale model includes estimates of
the characteristics underlying item performance can be
person latent trait, item difficulty, model fit, person-
described by a monotonically increasing function
fit, item-fit, person reliability, item reliability, and step
called the item characteristic curve (ICC; Henard,
calibration. A two-parameter model would include
2000). The ICC is typically a sigmoid curve estimating
estimates for item discrimination, and a three-
the probability of a given response based on a person’s
parameter model would include an additional estimate
level of latent trait. The shape of the ICC is determined
for guessing. Person latent trait is an estimate of the
by the item characteristics estimated in the model. The
underlying trait present for each respondent. Persons
ICC in a three-parameter IRT model is derived using
with high person-ability scores possess more of the
the formula P(θ) = c + (1-c)e a(θ-b-f) underlying trait than persons with low scores. Item
difficulty is an estimate of the level of underlying trait
1 + e a(θ-b-f) at which a person has a 50% probability of endorsing
where P, the probability of a response given a the item. Items with higher item-difficulty scores
person’s level of the latent trait—denoted by theta require a respondent to have more of the underlying
(θ)—is a function of guessing (c parameter), item trait to endorse or correctly respond to the item than
discrimination (a parameter), item difficulty (b items with lower item difficulty scores. Consider a
parameter), and the category threshold (f) if using a measure of reading comprehension. An item requiring
polytomous response format. a 12th grade reading level is more difficult than an item
For the one-parameter IRT model, the guessing requiring a 6th grade reading level. The same concept
parameter, c, is constrained to zero, assuming little or applies to a measure of motivation; an item requiring a
no impact of guessing. For example a person cannot high amount of motivation is more ―difficult‖ than an
guess the correct response to an item using a Likert item requiring a low amount of motivation. This idea
scale because items are not scored as right or wrong. translates to the concept of person-ability or latent trait.
The item discrimination parameter, a, is set to 1 under A person who reads at a 12th grade level has more
the assumption that there is equal discrimination across ability than a person who reads at a 6th grade level; a
items. In a one-parameter model the probability of a person who is more motivated has more of the latent
response is determined only by the person’s level of the trait than a person who is less motivated.
latent trait and the difficulty of the item. Item difficulty Analysis of item fit. Fit statistics in IRT analysis
is an indication of the level of the underlying trait that include infit and outfit mean square (MNSQ) statistics.
Infit and outfit are statistical representations of how category is equal to the probability of endorsing a
well the data match the prescriptions of the IRT model corresponding category one step away. Although
(Bond & Fox, 2001). Outfit statistics are based on thresholds are ideally equidistant, that characteristic is
conventional sum of squared standardized residuals, not necessarily the reality. Guidelines indicate that
and infit statistics are based on information-weighted thresholds should be at least 1.4 logits but no more than
sum squared standardized residuals (Bond & Fox, 5 logits (Linacre, 1999). Logits are the scale units for
2001). Infit and outfit have expected MNSQ values of the log odds transformation. When thresholds have
1.00; values greater than or less than 1 indicate the small logits, response categories may be too similar
degree of variation from the expected score. For and nondiscriminant. Conversely, when the threshold
example, an item with an infit MNSQ of 1.33 (1.00 + logit is large, response categories may be too dissimilar
.33), indicates 33% more variation in responses to that and far apart, indicating the need for more response
item than were predicted by the model. Mean infit and options as intermediate points. Infit and outfit statistics
outfit values represent a degree of overall fit of the data are also available for step calibrations. Outfit MNSQ
to the model, but infit and outfit statistics are also values greater than 2.0 indicate that a particular
available for assessing fit at the individual item level response category is introducing ―noise‖ into the
(item-fit) and the individual person level (person-fit). measurement process, and should be evaluated as a
Item-fit refers to how well the IRT model explains the candidate for collapsing with an adjacent category
responses to a particular item (Embretson & Reise, (Bond & Fox, 2001; Linacre, 1999).
2000). Person-fit refers to the consistency of an In conjunction with the standard output of IRT
individual’s pattern of responses across items analysis, MIRT analysis provides information about
(Embretson & Reise, 2000). dimensionality, the underlying latent factor structure.
One limitation of IRT is the need for large Acer Conquest 2.0 (Wu et al., 2008) software provides
samples. No clear standards exist for minimum sample estimations of population parameters for the
size, although Embretson and Reise (2000) briefly multidimensional model, which include factor means,
noted that a sample of 500 respondents was factor variances, and factor covariances/correlations.
recommended, and cautioned that parameter Acer Conquest 2.0 also produces maps of latent
estimations might become unstable with samples of variable distributions and response model parameter
less than 350 respondents. Reeve and Fayers (2005) estimates.
suggested that useful information about item Analysis of nested models. Two models are
characteristics could be obtained with samples of as considered as being nested if one is a subset of the
few as 250 respondents. One-parameter models may second. Overall model fit of an IRT model is based on
yield reliable estimates with as few as 50 to 100 the deviance statistic, which follows a chi-square
respondents (Linacre, 1994). As the complexity of the distribution. The deviance statistic changes as
IRT model increases and more parameters are parameters are added or deleted from the model, and
estimated, sample size should increase accordingly. changes in fit between nested models can be
Smith, Schumacker, and Bush (1998), provided the statistically tested. The chi-square difference statistic
following sample size dependent cutoffs for (χ2 D) can be used to test the statistical significance of
determining poor fit: misfit is evident when MNSQ the change in model fit (Kline, 2005). The χ2 D is
infit or outfit values are larger than 1.3 for samples less calculated as the difference between the model chi-
than 500, 1.2 for samples between 500 and 1,000, and square (χ2 M) values of two nested models using the
1.1 for samples larger than 1,000 respondents. same data; the df for the χ2 D statistic is the difference in
According to Adams and Khoo (1996), items with dfs for two nested models. The χ2 D statistic tests the
adequate fit will have weighted MNSQs between .75 null hypothesis of identical fit of the two models to the
and 1.33. Bond and Fox (2001) stated items that are population. Failure to reject the null hypothesis means
routinely accepted as having adequate fit will have t- that the two models fit the population equally. When
values between -2 and +2. According to Wilson (2005), two nested models fit the population equally well, the
when working with large sample sizes, the researcher more parsimonious model is generally considered the
can expect the t-statistic to show significant values for more favorable.
several items regardless of fit; therefore, Wilson
Scale Evaluation Using CFA
suggested that the researcher consider items
problematic only if items are identified as misfitting Factor analysis is a more traditional method for
based on both the weighted MNSQ and t-statistic. analyzing the underlying dimensionality of a set of
observed variables. Derived from CTT, factor analysis
For rating scale models, category thresholds are
includes a variety of statistical procedures for exploring
provided in the IRT analysis. A category threshold is
the relationships among a set of observed variables
the point at which the probability of endorsing one
with the intent of identifying a smaller number of
factors, the unobserved latent variables, thought to be loadings to models without cross-loadings allows the
responsible for these relationships among the observed researcher to make stronger assertions about the
variables (Tabachnik & Fidell, 2007). CFA is used underlying latent variable structure of a measure. As
primarily as a means of testing hypotheses about the Guo et al. (2009) noted, modified models allowing
latent structure underlying a set of observed data. cross-loadings between items and factors have been
A common and preferred method for conducting frequently published in social work literature without
CFA is structural equation modeling (SEM). The term fully explaining how they related to models without
SEM refers to a family of statistical procedures for cross-loadings.
assessing the degree of fit between observed data and Analysis of nested models. As noted in the
an a priori hypothetical model in which the researcher discussion of MIRT analysis, two models are
specifies the relevant variables, which variables affect considered to be nested if one is a subset of the second.
other variables, and the direction of those effects. The Overall model fit based on the chi-square distribution
two main goals of SEM analysis are to explore patterns will change as paths are added to or deleted from a
of correlations among a set of variables, both observed model. Kline’s (2005) chi-square difference statistic (χ2
and unobserved, and to explain as much variance as D) can be used to test the statistical significance of the
possible using the model specified by the researcher change in model fit.
(Klem, 2000; Kline, 2005). MIRT versus CFA
Analysis of SEM models. Analysis of SEM MIRT and CFA analyses can be used to assess the
models is based on the fit of the observed variance- dimensionality or underlying latent variable structure
covariance matrix to the proposed model. Although of a measurement. The choice of statistical procedures
maximum likelihood (ML) estimation is the common raises questions about differences between analyses,
method for deriving parameter estimates, it is not the whether the results of the two analyses are consistent,
only estimation method available. ML estimation and what information can be obtained from one
produces parameter estimates that minimize the analysis but not the other. IRT addresses two problems
discrepancies between the observed covariances in the inherent in CTT. First, IRT overcomes the problem of
data and those predicted by the specified SEM model item-person confounding found in CTT. IRT analysis
(Kline, 2005). Parameters are characteristics of the yields estimates of item difficulties and person-abilities
population of interest; without making observations of that are independent of each other, whereas in CTT
the entire population, parameters cannot be known and item difficulty is assessed as a function of the abilities
must be estimated from sample statistics. ML of the sample, and the abilities of respondents are
estimation assumes interval level data, and alternative assessed as a function of item difficulty (Bond & Fox,
methods, such as weighted least squares estimation, 2001), a limitation that extends to CFA.
should be used with dichotomous and ordinal level
Second, the use of ordinal level data (i.e., rating
data. Guo, Perron, and Gillespie (2009) noted in their
scales), which are routinely treated in statistical
review of social work SEM publications that ML
analyses as continuous, interval-level data, may violate
estimation was sometimes used and reported
the scale and distributional assumptions of CFA (Wirth
inappropriately.
& Edwards, 2007). Violating these assumptions may
Analysis of model fit. Kline (2005) defined result in model parameters that are biased and
model fit as how well the model as a whole explained ―impossible‖ to interpret (Wirth & Edwards, 2007, p.
the data. When a model is over identified, it is expected 58; DiStefano, 2002). The logarithmic transformation
that model fit will not be perfect; it is therefore of ordinal level raw data into interval level data in IRT
necessary to determine the actual degree of model fit, analysis overcomes this problem.
and whether the model fit is statistically acceptable.
IRT and CTT also differ in the treatment of the
Ideally, indicators should load only on the specific
standard error of measurement. The standard error of
latent variable identified in the measurement model.
measurement is an indication of variability in scores
This type of model can be tested by constraining the
due to error. Under CTT, the standard error of
direct effects between indicators and other factors to
measurement is averaged across persons in the sample
zero. According to Kline (2005), ―indicators are
or population and is specific to that sample or
expected to be correlated with all factors in CFA
population. Under IRT, the standard error of
models, but they should have higher estimated
measurement is considered to vary across scores in the
correlations with the factors they are believed to
same population and to be population-general
measure‖ (emphasis in original, p. 177). A
(Embretson & Reise, 2000). The IRT approach to the
measurement model with indicators loading only on a
standard error of measurement offers the following
single factor is desirable but elusive in practice with
benefits: (a) the precision of measurement can be
real data. Statistical comparison of models with cross-
evaluated at any level of the latent trait instead of
averaged over trait levels as in CTT, and (b) the assessment of students’ motivations for entering a
contribution of each item to the overall precision of the masters of social work (MSW) program as
measure can be assessed and used in item selection conceptualized in Wenger, McDermott, and Snyder’s
(Hambleton & Swaminathan, 1985). (2002) three-dimensional model of motivation for
MIRT and CFA differ in the estimation of item participation in a community of practice. Wenger et al.
fit. Where item fit is assessed through error variances, (2002) asserted that all communities of practice are
communalities, and factor loadings in CFA, item fit is comprised of three fundamental elements (p. 27): a
assessed through unweighted (outfit) and weighted domain of knowledge defining a set of issues; a
(infit) mean square errors in IRT analyses (Bond & community of people who care about the domain; and,
Fox, 2001). Further, the treatment of the relationship the shared practice developed to be effective in that
between indicator and latent variable, which is domain. Some individuals are motivated to participate
constrained to a linear relationship in CFA, can be because they care about the domain and are interested
nonlinear in IRT (Greguras, 2005). CFA uses one in its development. Some individuals are motivated to
number, the factor loading, to represent the relationship participate because they value being part of a
between the indicator and the latent variable across all community as well as the interaction and sharing with
levels of the latent variable; in IRT, the relationship others that is part of having a community. Finally,
between indicator and latent variable is given across some individuals are motivated to participate by a
the range of possible values for the latent variable desire to learn about the practice as a means of
(Greguras, 2005). Potential implications of these improving their own techniques and approaches. The
differences include inconsistencies in parameter PSWCoP was developed as a multidimensional
estimates, indicator and factor structure, and model fit measure of the latent constructs domain motivation,
across MIRT and CFA analyses. community motivation, and practice motivation (Table
1). Data were collected from a convenience sample of
Both IRT and CFA provide statistical indicators
students enrolled in MSW programs using a cross-
of psychometric performance not available in the other
sectional survey design and compared to the three-
analysis. Using the item information curve (IIC), IRT
factor model developed from Wenger et al.
analysis allows the researcher to establish both item
information functions (IIF) and test information Method
functions (TIF). The IIF estimates the precision and Participants
reliability of individual items independent of other A convenience sample of 528 current MSW
items on the measure; the TIF provides the same students was drawn from 11 social work programs
information for the total test or measure, which is a accredited by the Council on Social Work Education
useful tool in comparing and equating multiple tests (CSWE). Participants were enrolled during two
(Hambleton et al., 1991; Embretson & Reise, 2000). separate recruitment periods. The first round of
IRT for polytomous response formats also provides recruitment yielded a nonrandom sample of 268
estimated category thresholds for the probability of students drawn from nine academic institutions. The
endorsing a given response category as a function of second round of recruitment yielded a nonrandom
the level of underlying trait. These indices of item and sample of 260 students drawn from eight institutions.
test performance and category thresholds are not Six institutions participated in both rounds of data
available in CFA in which item and test performance collection, three institutions participated in only the
are conditional on the other items on the measure. first round of data collection, and two institutions
Conversely, CFA offers a wide range of indices for participated in only the second round of data collection.
evaluating model fit, whereas IRT is limited to the use The response rate for the study could not be calculated
of the χ2 deviance statistic. Reise, Widaman, and Pugh because there was no way to determine the total
(1993) explicitly identified the need for modification number of students who received information about the
indices and additional model fit indicators for IRT study or had access to the online survey. Twenty-two
analyses as a limitation. cases (4.1%) were removed because of missing data,
Participation in a Social Work Community of yielding a final sample of 506 students; listwise
Practice Scale deletion was used given the extremely small amount of
Although the content of the Participation in a missing data.
Social Work Community of Practice Scale (PSWCoP) Data were collected on multiple student
is less important in the current discussion than the characteristics including age, gender, race/ethnicity,
methodologies used to evaluate the scale, a brief sexual orientation, religious affiliation, participation in
overview will provide context for interpreting the religious activities, family socioeconomic status (SES),
results of the analyses. The PSWCoP scale is an and enrollment status. The mean age of participants
Table 1
Original Items on the Participation in a Social Work Community of Practice Scale
Item Factor
My main interest for entering the MSW program was to be a part of a community of Community (C_1)
social workers.
I wanted to attend a MSW program so that I could be around people with similar Community (C_2)
values to me.
I chose a MSW program because I thought social work values were more similar to Community (C_3)
my values than those of other professions.
There is more diversity of values among students than I expected. Community (C_4)*
Before entering the program, I was worried about whether or not I would fit in with Community (C_5)*
my peers.
Learning about the social work profession is less important to me that being part of a Community (C_6)*
community of social workers.
Without a MSW degree, I am not qualified to be a social worker. Practice (P_1)
A MSW degree is necessary to be a good social worker. Practice (P_2)
Learning new social work skills was not a motivating factor in my decision to enter Practice (P_3)
the MSW program.
My main reason for entering the MSW program was to acquire knowledge and/or Practice (P_4)
skills.
A MSW degree will give me more professional opportunities than other professional Practice (P_5)*
degrees.
Being around students with similar goals is less important to me than developing my Practice (P_6)*
skills as a social worker.
Learning how to be a social worker is more important to me than learning about the Practice (P_7)*
social work profession.
I find social work appealing because it is different than the type of work I have done Domain (D_1)
in the past.
I decided to enroll in a MSW program to see if social work is a good fit for me. Domain (D_2)
I wanted to attend a MSW program so that I could learn about the social work Domain (D_3)
profession.
Entering the MSW program allowed me to explore a new area of professional interest. Domain (D_4)
My main reason for entering the MSW program was to decide if social work is the Domain (D_5)
right profession for me.
*Items deleted from the final version of the PSWCoP
was 30.2 years (SD = 8.7 years). The majority of community, and practice). Items were measured on a 6-
students were female (92%). The majority of the point rating scale from strongly disagree to strongly
participants were Caucasian (82.6%), with 7.3% of agree. Items from the pilot measure organized by
students self-identifying as African American or Black; subscale are listed in Table 1. In addition to items on
4.1% as Hispanic; 1.8% as Asian/Pacific Islander; and the PSWCoP, students were asked to provide
4.1% as a nonspecified race/ethnicity. Students demographic information.
identified their enrollment status as either part-time Procedures
(19.5%), first year (32.7%), advanced standing (27%), Participants completed the PSWCoP survey as part
or second year (20.8%). of a larger study exploring the relationship between
Measures students’ motivations to pursue the MSW degree, their
Analyses were conducted on an original measure attitudes about diversity and historically marginalized
of students’ motivations for entering a social work groups, and their endorsement of professional social
community of practice, defined as pursuing a MSW work values as identified in the National Association of
degree. The PSWCoP was developed and evaluated Social Workers (2009) Code of Ethics. This research
using steps outlined by by Benson and Clark (1982) was approved by the University of Denver Institutional
and DeVellis (2003). The pilot measure contained 18 Review Board prior to recruitment and data collection.
items designed to measure three constructs (domain, Recruitment consisted of a two-pronged approach: (a)
an e-mail providing an overview of the study and a link
to the online survey was sent to students currently community of practice. All analyses indicated
enrolled in the MSW program; and (b) an problems with the practice subscale and ultimately
announcement providing an overview of the study and EFA was used with this subscale only. The results of
a link to the online survey posted to student-oriented the EFA suggested items P_1 and P_2 formed one
informational Web sites. Interested participants were factor, and items P_3 and P_4 constituted a second
able to access the anonymous, online survey through factor. Item P_5 did not load on either factor and was
www.surveymonkey.com, which is a frequently used deleted.
online survey provider. Participants were presented The results of the item selection process yielded
with a project information sheet and were required to two competing models. The first model consisted of
indicate their consent to participate by clicking on the three factors in which all items developed for the
appropriate response before being allowed to access the practice subscale were kept together; this model most
actual survey. closely reflected the original hypothetical model
Results developed based on community of practice theory. The
Reliability of scores from the PSWCoP was second model had four factors with the items from the
assessed using both CTT and IRT methods. SPSS hypothesized practice subscale split into the two factors
(v.16.0.0, 2007) was used to calculate internal suggested by the EFA. Internal consistency for each of
consistency reliability (Cronbach’s α; inter-item the subscales on the final version of the PSWCoP was
correlations). Acer Conquest 2.0 (Wu et al., 2008) was assessed using Cronbach’s alpha. Cronbach’s alpha
used to assess item reliability. The dimensionality and was 0.64 for scores from the domain subscale, 0.68 for
factor structure of the PSWCoP were evaluated using scores from the community subscale, and 0.47 for
both a MIRT and a CFA approach. Acer Conquest 2.0 scores from the practice subscale (three-factor model).
(Wu et al., 2008) was used to conduct the MIRT Splitting the practice subscale into two factors yielded
analysis and Lisrel 8.8 (Jöreskog & Sörbom, 2007) was a Cronbach’s alpha of 0.58 for scores from the skills
used to conduct the CFA analysis. Acer Conquest 2.0 subscale and .68 for scores from the competency
was used to evaluate the PSWCoP with respect to subscale. Although ultimately indicative of a poor
estimates of levels of latent trait and item difficulty measure, low internal consistency did not prohibit the
using a one-parameter logistic model. Assessment of application and comparison of factor analysis using
the measure was based on model fit, person-fit, item CFA and MIRT.
fit, person reliability, item reliability, step calibration, Factor Structure
and population parameters for the multidimensional CFA. CFA analyses of the PSWCoPS was
model. conducted using Lisrel 8.8 (Jöreskog & Sörbom, 2007).
Item Selection The data collected using the PSWCoP were considered
Items were identified for possible deletion from ordinal based on the 6-point rating scale. When data
each subscale using Cronbach’s alpha, IRT MNSQ are considered ordinal, Jöreskog and Sörbom (2007)
Infit/Outfit results, and theory. Poorly performing advocated the use of PRELIS to calculate asymptotic
items identified through statistical analyses were covariances and polychloric correlations of all items
further assessed using conceptual and theoretical modeled, and LISREL or SIMPLIS with weighted least
frameworks. A combination of results led to the squares estimation to test the structure of the data.
removal of three items from the community subscale, Failure to use these guidelines may result in
three items from the practice subscale, but no items underestimated parameters, biased standard errors, and
from the domain subscale (Table 1). Items C_6, P_6, an inflated chi-square (χ2) model fit statistic (Flora &
and P_7 addressed relationships between types of Curran, 2004). The chi-square difference statistic (χ2 D)
motivations by asking respondents to rate whether one was used to test the statistical significance of the
type of motivation was more important than another change in model fit between nested models (Kline,
type. Quantitative differences between types of 2005). The χ2 D was calculated as the difference
motivations were not addressed in community of between the model chi-square (χ2 M) values of nested
practice theory, and therefore these items were deemed models using the same data; the df for the χ2 D statistic
not applicable in the measurement of each type of is the difference in dfs for nested models. The χ2 D
motivation. Items C_4 and C_5 were deleted from the statistic tested the null hypothesis of identical fit of two
community subscale because these items specifically models to the population. In all, three nested models
addressed relationships between respondents and peers. were evaluated and compared sequentially: a four-
Community-based motivation arises out of perceived factor model with cross-loadings served as the baseline
value congruence between the individual and the model, followed by a four-factor model without cross-
practice (i.e., professional social work), and not loadings, and a three-factor model without cross-
between the individual and other members of the loadings. The four-factor model with cross-loadings
was chosen as the baseline model because it was nested models is considered to have the best fit. The
presumed to demonstrate the best fit having the fewest GFI is an assessment of incremental change in fit with
degrees-of-freedom. The primary models of interest an adjustment for model complexity; values greater
were then compared against this baseline to estimate than 0.90 indicate good fit. The RMSEA fit index is a
the change in model fit. measure of the lack of fit of the researcher’s model to
Sun (2005) recommended considering fit indices the population covariance matrix and tests the null
in four categories: sample-based absolute fit indices, hypothesis that the researcher’s model has close
sample-based relative fit indices, population-based approximate fit in the population. According to Kline
absolute indices, and population-based relative fit (2005), good models have an RMSEA less than 0.05
indices. Sample-based fit indices are indicators of and models with RMSEA greater than 0.10 have poor
observed discrepancies between the reproduced fit, while Browne and Cudeck (1993) suggested that a
covariance matrix and the sample covariance matrix. RMSEA less than 0.08 represents acceptable fit. The
Population-based fit indices are estimations of CFI assesses the improvement in fit of the researcher’s
difference between the reproduced covariance matrix model over a baseline model that has assumed zero
and the unknown population covariance matrix. At a covariances among observed variables; values greater
minimum, Kline (2005) recommended interpreting and than 0.90 represent an acceptable model fit.
reporting four indices: the model chi-square (sample- Four-Factor Model with Cross-Loadings. A
based), the Steiger-Land root mean square error of baseline CFA model was constructed using the four
approximation (RMSEA; population-based), the latent variables, domain, community, competency, and
Bentler comparative fit index (CFI; population-based), skills, and items were allowed to cross-load on factors
and the standardized root mean square residual based on modification indices of the LISREL output.
(SRMR; sample-based). In addition to these fit indices, Based on the six fit indices previously described, the
this study examined the Akaike information criteria overall fit of the model was good: χ2 = 64.48, df = 35,
(AIC; sample-based) and the goodness-of- fit index p=.00175; RMSEA = 0.04 [90%CI:.03,.05]; CFI =
(GFI; sample-based). According to Jackson, Gillaspy, 0.98; SRMR = 0.043; AIC = 150.48; GFI =0.97. Note
and Purc-Stephenson (2009), a review of CFA journal that this solution was mathematically derived, and as
articles published over the past decade identified these such, there was no conceptual justification for the
six fit indices as the most commonly reported. cross-loadings of these multiple items. This model
The range of values indicating good fit of observed served as the baseline against which competing models
data to the measurement model varies depending on the were compared.
specific fit index. The model chi-square statistic tests Four Factor Model without Cross-Loadings. The
the null hypothesis that the model has perfect fit in the next model to be evaluated used the same four factors
population. Degrees-of-freedom for the chi-square as the previous model, but items were constrained to
statistic is equal the number of observations minus the load on specific factors. The standardized solution for
number of parameters to be estimated. Given its the four-factor model without cross loadings is shown
sensitivity to sample size, the chi-square test is often in Figure 1. Based on the six fit indices previously
statistically significant. Kline (2005) suggested using a described, the overall fit of the model was acceptable:
normed chi-square statistic obtained by dividing chi- χ2 = 185.82, df = 48, p<0.001; RMSEA = 0.07
square by df; ideally, these values should be less than [90%CI:.06,.09]; CFI = 0.91; SRMR = 0.094; AIC
three. The SRMR is a measure of the differences =245.82 ; GFI =0.91. When compared with the four-
between observed and predicted correlations; in a factor model with cross-loadings, this model
model with good fit, these residuals will be close to demonstrated a significant increase in model misfit
zero. Hu and Bentler (1999) suggested that a SRMR [(χ12 – χ22)(df1-df2) =121.04(13), p <.001]. However, the fit
<0.08 indicates good model fit. The AIC is an indicator indices as a whole did not indicate poor fit, and as a
of comparative fit across nested models with an conceptually derived and theory-supported model, the
adjustment for model complexity. The AIC is not an four-factor model without cross-loadings was
indicator of fit for a specific model, but instead the preferable over the four-factor model with cross-
model with the lowest AIC from among the set of loadings.
Three Factor Model without Cross-Loadings. The Based on the six fit indices previously described, the
three-factor model corresponded to the original model overall fit of the model was poor: χ2 = 359.90, df = 51,
of the PSWCoP (Figure 2). Three latent variables were p < 0.001; RMSEA = 0.11 [90%CI:.10,.12]; CFI = 0.8;
included in this model:domain, community, and SRMR = 0.12; AIC = 413.90; GFI =0.85. When
practice. Items were constrained to load on the factor compared with the four-factor model without cross-
for which they were designed. The four items loadings, this model demonstrated a significant
originally developed for the practice subscale were increase in model misfit [(χ12 – χ22)(df1-df2) =174.38(3),
constrained to load on a single latent variable, which p < .001]. All of the fit statistics indicated that the data
represented a perfect correlation between the did not fit the model.
previously used latent variables competency and skills.
Summary of CFA of the PSWCoP. A summary of measure because correlations between latent variables
fit indices across nested models is provided in Table 2. were computed and there were no significant
The model with the best overall fit was the four-factor correlations between any pair of latent variables
model in which items were allowed to load across all (α=.01).
factors. The fit of this model was good, but the model
The four-factor model with constrained loadings
lacked conceptual support and was not interpretable
was compared with a three-factor model based on the
with respect to the underlying latent structure of the
originally proposed measurement model for the
PSWCoP. Although the four-factor model with
PSWCoP. The conceptual difference between the two
constrained loadings had a significant increase in
models was the placement of the items developed for
model misfit over the four-factor model with cross-
the practice subscale. Constraining these four items to
loadings, the four-factor model with constrained
load on a single latent variable resulted in a large
loadings demonstrated acceptable fit. The results of the
increase in model misfit. All of the reported fit
CFA on the four-factor model without cross-loadings
statistics indicated a model with poor fit.
supported the hypothesis of a multidimensional
Table 2
Comparison of Fit Indices across Nested Models
Model 1: Model 2: Model 3:
4 Factor Unidimensional Unidimensional
Model 4 Factor Model* 3 Factor Model**
χ2(df) 64.48(35) 185.52(48) 359.90(51)
2 2
Normed χ (χ /df) 1.84 3.86 7.05
p-value (model) .002 <.001 <.001
2 2
χ1 –χ2 (df1 –df2) 121.04(13) 174.38(3)
p-value (model diff) <.001 <.001
RMSEA .04 .07 0.11
RMSEA 90%CI [.03, .05] [.06, .09] [.10, .12]
CFI 0.98 .91 0.80
SRMR 0.04 .09 0.12
AIC 150.48 245.82 413.90
GFI 0.97 0.91 0.85
*Compared to Model 1
**Compared to Model 2
Multidimensional Item Response Theory Analysis construction, as was the case with the PSWCoP. The
first set of analyses evaluated item difficulty, item fit,
The PSWCoP data were then analyzed using a
and reliability for a unidimensional model. The second
one-parameter IRT model using Winsteps 3.66.0
set of analyses explored the dimensionality of the
(Linacre, 2006) Rasch measurement software, and
PSWCoP by comparing four- and three-factor models.
MIRT analyses using ACER Conquest 2.0, generalized
The third set of analyses evaluated item difficulty, item
item-response modeling software (Wu et al., 2008).
fit, and reliability for the multidimensional models.
The parameters for guessing were all constrained to
zero, and the parameters for item discrimination were Rasch measurement results. Winsteps 3.68.0
assumed equal and set to one. A thorough (Linacre, 2006) Rasch measurement software was used
psychometric evaluation should ideally utilize at least a to assess item difficulty, fit, and reliability for a
two-parameter model, especially when considering unidimensional model. For affective measures, item
established measures, as item discrimination difficulty refers to the amount of the construct needed
parameters are rarely the same. However, Reeve and to endorse, or respond positively, to the item. As the
Fayers (2005) suggested that the one-parameter model PSWCoP was designed to measure motivation, item
with equal item discrimination parameters is acceptable difficulty was the amount of motivation needed to
in the development and revision phase of scale respond positively to the question. Person-ability or
latent trait was the amount of motivation a given positive values indicated items that were harder to
student possessed. In general, the range of the latent endorse.
trait of the sample and item difficulties were the same,
Item fit is an indication of how well an item
and the distribution of persons and items about the
performs according to the underlying IRT model being
mean were relatively symmetrical, indicating a good
tested, and it is based on the comparison of observed
match between the latent trait of students and the
responses to expected responses for each item. Adams
difficulty of endorsing items. Exact numerical values
and Khoo (1996) suggested that items with good fit
for item difficulty are provided in Table 3 and ranged
have infit scores between 0.75 and 1.33; Bond and Fox
from -1.05 to +0.94. Item difficulty was scaled
(2001) suggested that items with good fit have t values
according to the theta metric, and indicated the level of
between -2 and +2. Table 3 provides the fit statistics
the latent trait at which the probability of a given
for the items of the PSWCoP survey; according to this
response to the item was .50. Theta (θ) is the level of
output, only item P_3_R exceeded Bond and Fox’s
the latent trait being measured and scaled with a mean
guideline, and no items exceeded Adams and Khoo’s
of zero and a standard deviation of one. Negative
guideline.
values indicated items that were easier to endorse, and
Table 3
Rasch Analysis of Full Survey Item Difficulty and Fit
IRT analysis produced an item reliability index to a constricted range of the latent trait in the sample or
indicating the extent to which item estimates would be a constricted range of item difficulty.
consistent across different samples of respondents with
MIRT factor structure. One of the core
similar abilities. High item reliability indicates that the
assumptions of IRT is unidimensionality; in other
ordering of items by difficult will be somewhat
words, that person-ability can be attributed to a single,
consistent across samples. The reliability index of
latent construct, and that each item contributes to the
items for the PSWCoP pilot survey was 0.99, and
measure of that construct (Bond & Fox, 2001).
indicated consistency in ordering of items by difficulty.
However, whether intended or not, item responses may
IRT analysis also produced a person-reliability index
be attributable to more than one latent construct. MIRT
that indicated the extent of consistency in respondent
analyses allow the researcher to assess the
ordering based on level of latent trait if given an
dimensionality of the measure. Multidimensional
equivalent set of items (Bond & Fox, 2001). The
models can be classified as either ―within items‖ or
reliability index of persons for the PSWCoP was 0.60,
―between items‖ (Adams, Wilson, & Wang, 1997).
and indicated low consistency in ordering of persons
Within-items multidimensional models have items that
by level of level of latent trait, which was possibly due
can function as indicators of more than one dimension,
and between-items multidimensional models have the CFA (Figure 1). This baseline model was a
subsets of items that are mutually exclusive and between-items multidimensional model with items
measure only one dimension. placed in mutually exclusive subsets. The four
dimensions in the model were community,
Competing multidimensional models can be
competency, domain, and skills. The baseline model fit
evaluated based on changes in model deviance and
statistic was G2=17558.64 with 26 parameters. A three
number of parameters estimated. A chi-square statistic
dimensional, between-items, multidimensional model,
is calculated as the difference in deviance (G2) between
corresponding to the theoretical model of the PSWCoP
two nested models with df equal to the difference in
(Figure 2) was tested against the baseline model. The
number of parameters for the nested models. A
three-dimensional model fit statistic was G2=17728.83
statistically significant result indicates a difference in
with 22 parameters. When compared with the four-
model fit. When a difference in fit is found, the model
dimensional model, the change in model fit was
with the smallest deviance is selected; when a
statistically significant and indicated that the fit of the
difference in model fit is not found, the more
three-dimensional model was worse than the fit of the
parsimonious model is selected.
four-dimensional model (χ2 (4) = 170.19, p < .001).
The baseline MIRT model corresponded to the
four-factor model with no cross-loadings estimated in
Table 4
Based on the change in model fit between nested abilities. Two inferences were made based on the
models, the four dimensional, between-items model MIRT item-person map. First, although the range of
had the better fit. This model resulted in a more item difficulty was narrow, items appeared to be
accurate reproduction of the probability of endorsing a dispersed in terms of difficulty with a range of -0.81 to
specific level or step of an item for a person with a +0.84. Furthermore, regarding Dimensions 1, 2, and 3,
particular level of the latent trait (Reckase, 1997). the item difficulties appeared to be well matched to
Thus, the four-dimensional model yielded the greatest levels of the latent trait, though over a limited range of
reduction in discrepancy between observed and the construct as scaled via the theta metric. Second,
expected responses. based on the means of the dimensions, Dimension 2
Item difficulty. MIRT analyses yielded an item- (competency, x2=0.069) and Dimension 3 (domain,
person map by dimension. The output of the MIRT x3=-0.074) did a better job of representing all levels of
item-person map (Figure 3) provided a visual estimate these types of motivation than the other two
of the latent trait in the sample, item difficulty, and dimensions. The small positive mean of Dimension 1
each dimension. Items are ranked in the right-hand (community, x1=0.335), indicated that students sampled
column by difficulty, with items at the top being more for this study found it somewhat easier to endorse those
difficult than items at the bottom. Although the range items, whereas the large positive mean of Dimension 4
of item difficulty was narrow, items were well (skills, x4=1.42) indicated that students sampled for this
dispersed around the mean. Each dimension or factor study found it very easy to endorse those items.
has its own column with estimates for respondents’
Item fit. Table 5 summarizes the items’ showed poor fit (MNSQ=0.68). In contrast, Bond and
characteristics. In addition to the estimation of item Fox’s (2001) guideline identified several items as
difficulties, infit and outfit statistics are reported. Using having poor fit (based on a 95% CI for MNSQ): C_1,
Adams and Khoo’s (1996) guideline, only item C_2_2 C_2, D_1, D_3, and D_4.
Table 5
Item Parameter Estimates for 4 Dimensional Model
Model Infit Outfit
Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD
1 C_1 0.40 0.03 0.77 -3.8 0.77 -4.5
2 C_2 0.21 0.03 0.68 -5.6 0.67 -6.5
3 C_3 -0.61* 0.04 1.02 0.4 1.04 0.6
4 P_1 -0.14 0.03 1.01 0.2 1.00 0.0
5 P_2 0.14* 0.03 0.96 -0.5 0.93 -1.1
6 D_1 0.11 0.03 1.21 3.0 1.18 3.0
7 D_2 0.51 0.03 1.02 0.4 1.04 0.7
8 D_3 -0.65 0.03 1.29 4.1 1.30 4.4
9 D_4 -0.81 0.03 1.17 2.5 1.22 3.1
10 D_5 0.84* 0.06 0.95 -0.7 0.98 -0.2
11 P_3_R 0.33 0.04 1.00 -0.0 1.02 0.4
12 P_4 -0.33* 0.04 0.99 -0.2 1.00 -0.0
*Indicates that a parameter estimate is constrained
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. Rasch, G. (1980). Probabilistic models for some
(1991). Fundamentals of item response theory. intelligence and attainment tests. Chicago, IL:
Newbury Park, CA: Sage. MESA Press.
Henard, D. H. (2000). Item response theory. In L. G. Reckase, M. D. (1997). The past and future of
Grimm & P. R. Yarnold (Eds.), Reading and multidimensional item response theory. Applied
understanding more multivariate statistics, (67-98). Psychological Measurement, 21, 25-27.
Washington, DC: American Psychological doi:10.1177/0146621697211002
Association. Reeve, B. B., & Fayers, P. (2005). Applying item
Jackson, D. L., Gillaspy, J. A, & Purc-Stephenson, R. response theory modeling for evaluating
(2009). Reporting practices in confirmatory factor questionnaire items and scale properties. In P. M.
analysis: An overview and some recommendations. Fayers & R. D. Hays (Eds.), Assessing quality of
Psychological Methods, 14(1), 6-23. life in clinical trials: Methods and practice (2nd
doi:10.1037/a0014694 ed., pp. 55-73). New York, NY: Oxford University
Jöreskog, K. G., & Sörbom, D. (2007). LISREL 8.80 Press.
for Windows [Computer software]. Licolnwood, Reise, S. P., Ainsworth, A. T., & Haviland, M. G.
IL: Scientific Software International. (2005). Item response theory: Fundamentals,
Klem, L. (2000). Structural equation modeling. In L. applications, and promises in psychological
G. Grimm and P. R. Yarnold (Eds.), Reading and research. Current Direction sin Psychological
understanding more multivariate statistics, (pp. Science, 14(2), 95-101.
227-260). Washington, DC: American Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993).
Psychological Association. Confirmatory factor analysis and item response
Kline, R .B. (2005). Principles and practices of theory: Two approaches for exploring measurement
structural equation modeling (2nd ed.). New York, invariance. Psychological Bulletin, 114, 552-566.
NY: Guilford Press. doi:10.1037/0033-2909.114.3.552
Linacre, J. K. (1994). Sample size and item calibration Samejima, F. (1969). Estimation of latent ability using
stability. Rasch Measurement Transactions, 7(4), a response format of graded scores. (Psychometric
328. Monograph No. 17) .Richmond, VA: Psychometric
Society. Retrieved from
Linacre, J. K. (1999). Investigating rating scale
http://www.psychometrika.org/journal/online/MN1
category utility. Journal of Outcome Measurement,
7.pdf
3(2), 103-122.
Smith, R. M., Schumaker, R. E., & Bush, M. J. (1998).
Linacre, J. K. (2006). Winsteps Rasch measurement
Using item mean squares to evaluate fit to the
3.68.0 [Software]. Chicago, IL: Author.
Rasch model. Journal of Outcome Measurement,
Lord, F. M. (1980). Applications of item response 2(1), 66-78. PMid:9661732
theory to practical testing problems. Hillsdale, NJ:
SPSS. (2007). SPSS for Windows, Rel. 16.0.0
Lawrence Erlbaum.
[Software]. Chicago: SPSS, Inc.
Meade, A.W., & Lautenschlager, G. J. (2004). A
Sun, J. (2005). Assessing goodness of fit in
comparison of item response theory and
confirmatory factor analysis. Measurement and
confirmatory factor analytic methodologies for
Evaluation in Counseling and Development, 37(4),
establishing measurement equivalence/invariance.
240-256.
Organization Research Methods, 7(4), 361-388.
doi:10.1177/1094428104268027 Swaminathan, H., & Gifford, J. A. (1979). Estimation
of parameters in the three-parameter latent-trait
Müller, U., Sokol, B., & Overton, W.F. (1999).
model. Laboratory of Psychometric and Evaluation
Developmental Sequences in class reasoning and
Research (Report No. 90). Amherst: University of
propositional reasoning. Journal of Experimental
Massachusetts.
Child Psychology, 74, 69-106.
doi:10.1006/jecp.1999.2510 Tabachnik, B. G., & Fidell, L. S. (2001). Using
multivariate statistics (4th ed.). Boston, MA: Allyn
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2001).
and Bacon.
Measurement equivalence: A comparison of
methods based on confirmatory factor analysis and Teresi, J. A. (2006). Overview of quantitative
item response theory. Journal of Applied measurement methods: Equivalence, invariance and
Psychology, 87, 517-529. doi:10.1037/0021- differential item functioning in health applications.
9010.87.3.517 Medical Care, 44, S39–S49.
Unick, G. J., & Stone, S. (2010). State of modern Wirth, R. J., & Edwards, M. C. (2007). Item factor
measurement approaches in social work research analysis: Current approaches and future directions.
literature. Social Work Research, 34(2), 94-101. Psychological Methods, 12(1), 58-79.
Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). doi:10.1037/1082-989X.12.1.58
Practical implications of item response theory and Wright, B. D., & Masters, G. N. (1982). Rating scale
computerized adaptive testing: A brief summary of analysis: Rasch measurement. Chicago, IL: MESA
ongoing studies of widely used headache impact Press.
scales. Medical Care, 38, 1173–1182. Wu, M. L., Adams, R. J., & Wilson, M., & Haldane, S.
doi:10.1097/00005650-200009002-00011 (2008). ACER Conquest 2.0: Generalized item
Wenger, E., McDermott, R., & Snyder, W. M. (2002). response modeling software [computer program].
Cultivating communities of practice. Boston, MA: Hawthorn, Australia: ACER.
Harvard Business School Press.