You are on page 1of 13

Measurement: Interdisciplinary Research and

Perspectives

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/hmes20

Item Response Theory and Modeling with Stata

Tenko Raykov

To cite this article: Tenko Raykov (2023) Item Response Theory and Modeling with
Stata, Measurement: Interdisciplinary Research and Perspectives, 21:2, 117-128, DOI:
10.1080/15366367.2022.2133528

To link to this article: https://doi.org/10.1080/15366367.2022.2133528

Published online: 21 Jun 2023.

Submit your article to this journal

Article views: 58

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=hmes20
MEASUREMENT: INTERDISCIPLINARY RESEARCH AND PERSPECTIVES
2023, VOL. 21, NO. 2, 117–128
https://doi.org/10.1080/15366367.2022.2133528

Item Response Theory and Modeling with Stata


Tenko Raykov
Measurement and Quantitative Methods, Michigan State University

ABSTRACT KEYWORDS
This software review discusses the capabilities of Stata to conduct item Item characteristic curve;
response theory modeling. The commands needed for fitting the popular item information function;
one-, two-, and three-parameter logistic models are initially discussed. The item response theory; one-
procedure for testing the discrimination parameter equality in the one- parameter logistic model;
Stata; test information
parameter model is then outlined. The commands for fitting several poly­ function; test characteristic
tomous models are subsequently indicated, as are those facilitating model curve; three-parameter
comparison. Scoring of individual units of analysis with Stata’s item response logistic model; two-
theory module is next discussed, and various graphical features of this parameter logistic model
module are pointed out. The review concludes with an illustration example
of using Stata for item response modeling on an empirical data set.

Introduction
The statistical analysis and modeling software Stata is now well into its fourth decade. Since the 1980s,
it has developed into one of the most widely circulated and popular statistical packages. A main reason
for this popularity is the fact that it possesses many important features making it the software of choice
for an impressively large number of empirical and theoretical researchers across the behavioral, social,
and educational sciences as well as in numerous related disciplines. One of these features is that to
begin with Stata is a package with extensive data management capabilities, in addition to offering
publication quality graphics, including tables and reports that can be customized readily to fit one’s
needs. Further, Stata makes available to the analyst a comprehensive collection of statistical modeling
tools, in particular these of item response theory and modeling that the present review is concerned
with. Moreover, Stata includes a rich software development environment that ensures
a straightforward process of accessing extensive user-developed analysis and modeling functions
and subroutines.
With Stata, raw data can be imported from most common statistical packages and databases. The
resulting files can then be merged, appended, and/or processed into highly organized, labeled, and
well-documented datasets. At the analysis and modeling level, empirical data can be analyzed using
a large variety of statistical modeling tools. These include generalized linear models, panel data and
time-series models, multilevel and longitudinal models, structural equation models, latent class and
finite mixture models, Bayesian models, and item response theory models, to name but a few. As part
of the process of model diagnostics, visualizations of the results can be customized and exported to the
most popular high-resolution graphics software. Analysis and modeling results can also be presented
in customized tables, and reports can be created and exported to Microsoft Word, Adobe PDF, HTML,
markdown, and other formats. Stata also offers to the statistical methods and numerical algorithms
programmer the possibility of using its matrix programing language Mata, or alternatively Python,
C++, or Java plugins.

CONTACT Tenko Raykov raykov@msu.edu Measurement and Quantitative Methods, Michigan State University, East
Lansing, MI 48824, USA
© 2022 Taylor & Francis Group, LLC
118 T. RAYKOV

Over the past decade, Stata has also developed its own item response theory (IRT) module (for
an introduction to IRT, see de Ayala, 2022; for its use with Stata consult Raykov & Marcoulides,
2018). This module offers numerous important features to the researcher and instructor interested
more generally in latent variable modeling, including in particular – but not limited to – optimally
scoring studied units of analysis (e.g., students, patients, respondents, examinees, or employees) or
groups of them (e.g., schools, physicians, treatment centers, cities, firms, or companies). With the
Stata IRT module, a variety of widely used item response models can be fitted and their parameters
optimally point and interval estimated, whereby also the opportunity of testing statistical hypoth­
eses about their population values (incl. significance) as well as parameter restrictions is readily
available.
The remainder of this review is concerned with selected from the multitude of features of Stata’s
IRT module, which are more likely to be frequently used by behavioral, social, and educational
scientists (e.g., Raykov & Marcoulides, 2018), as well as by marketing, business, and economics
researchers. Further information about the module can be found in its Manual (Stata Item Response
Theory Reference Manual, 2019), which is also accessible from the software Help drop-down menu
once Stata is installed in a computing environment.

Item response model fitting with Stata


In this section, the aim is to indicate how easy and straightforward it is to use Stata for item response
theory and modeling, when a researcher or instructor is interested in fitting any of an array of popular
IRT models in its applications in the behavioral, social and educational sciences as well as cognate
disciplines (see, e.g., Raykov & Marcoulides, 2018, for details on the Stata use then and related
activities). We will be concerned thereby with eight key models that may be considered at present
most frequently used in empirical research in these sciences.
Popular IRT models for binary or binary scored items
Stata’s module for IRT, as could be expected, is called or started with the command

.irt

where the dot is the Stata prompt. (Upon reading in a relevant data set, this command is entered in
the command window of the software, which is the bottom center one opened when starting Stata; see,
e.g., Acock, 2016, for details pertaining to using the software once installed.)
Immediately after this command, typically a researcher will state the abbreviation of the IRT model
to be fitted. For instance, to fit the one-parameter logistic model (Rasch model; abbr. 1PL-model in the
sequel), one utilizes the command

.irt 1pl

The abbreviation “1pl” is replaced by “2pl” when to fit is the two-parameter logistic model (2PL-
model). Similarly, it is replaced by “3pl” to fit the three-parameter logistic model (3PL-model). (See
further below for how to fit other popular IRT models, such as polytomous IRT models, and
Footnote 3 further below.)
In order to initiate the process of fitting the model of interest, the user needs to state immediately
after its abbreviation the observed variables (latent construct indicators or items) that are to be used
thereby. For instance, to fit the 1PL-model to a data set on say five binary items, denoted i1 through i5,
use this command

.irt 1pl i1-i5

which could be abbreviated to


MEASUREMENT: INTERDISCIPLINARY RESEARCH & PERSPECTIVE 119

.irt 1pl i*

employing the wild card notation (and assuming there are no other variables in the analyzed data
file whose name starts with the letter “i;” it is advisable to keep in mind though that some data sets
contain also an identifier variable often named “id” - in that case, the shortest wild card use in the last
command would be as “it*,” on the last assumption mentioned). Similarly proceeded is with more
complex models, such as the 2PL- and 3PL-models (as well as those mentioned below).
The application of any IRT model fitting command using as traditionally the popular maximum
likelihood (ML) method, yields upon convergence – and assuming model identification – extensive
output that includes the iteration history as well as the maximized log-likelihood and related
information. In addition, the item difficulty parameter estimates with their standard errors and 95%-
confidence intervals as well as related statistics are presented in tabular form, as are the item
discrimination parameter estimates (estimate). The fixing of the latent variable means at 0 and their
variances at 1, as conventionally done in IRT applications, is used as default in a variety of fitted IRT
models with this Stata module.
The test of equal item discrimination parameters, which is an essential part of the 1PL-model
(Rasch model; see Raykov & DiStefano, 2021, for the particular relevance of this test as part of the
process of evaluating the fit of the 1PL-model), is readily carried out with the likelihood ratio test
(LRT) and pertinent command

.lrtest

As a matter of routine, it is recommended to assign a label, such as say “m1pl,” to the 1PL-model
immediately after fitting it (through the “. estimate store” command) and, say, the label “m2pl” to the
2PL-model, in which case the full LRT command will be

.lrtest m1pl m2pl

following the software requirement for the nested model to be stated first. We also mention in
passing and caution that a 1PL-model (Rasch model) could be spuriously found to be plausible in an
empirical study not accounting for population heterogeneity (latent class mixtures; Raykov &
Marcoulides, 2020).
In the three-parameter logistic (3PL-) model, unlike the previously discussed models, the possibility
of respondents (examinees) arriving at the correct answer based on some chance-related processes is
allowed. This is reflected in an added parameter that is usually referred to as pseudo-guessing
parameter (e.g., de Ayala, 2022). The model is fitted with the command

.irt 3pl

which is followed by the listing of the items with such a parameter. The default setup is to constrain
for equality the pseudo-guessing parameters for all items with them. This default can be overridden if
need be using the option “sepguessing,” but doing so is not in general recommended due to likely
arising identification problems. A Bayesian approach to fitting the 3PL-model is instead recommend­
able to consider in such situations (e.g., Balov, 2016).
In relation to the preceding discussion of the 3PL-model, in this author’s view its application must
only be proceeded with after thorough and expert knowledge-based consideration of alternatives. This
is because a 3PL-model fitted to data from an empirical study can be associated, like the 1PL- (Rasch)
model, with potentially serious problems resulting from not accounting for considerable unobserved
population heterogeneity. In the latter case, an item may be found to possess a spurious (non-zero)
pseudo-guessing parameter, or the subsequent unit of analysis scoring can produce potentially
misleading predictions (estimates) of the individual ability levels at least for some units of analysis
120 T. RAYKOV

(e.g., Raykov & DiStefano, 2021; Raykov & Marcoulides, 2020). 1,2 A key reason for such a finding may
well be the fact that while test or instrument developers may be of the view that one or more particular
items “invite” guessing or related chance-based processes, some examinees may still not be guessing on
(some of) these items while other examinees could be so on a given item(s) that presumably follows the
3PL-model.
The 3PL-model may have a certain appeal in hybrid models where a limited number of items may
be considered associated with pseudo-guessing parameters (e.g., de Ayala, 2022), but the same caution
is relevant then as that raised above in this section.3
Polytomous IRT models
Polytomous items (nominal or categorical/ordinal discrete items) are frequently used in behavioral,
social, and educational research as well as in the marketing, biomedical, economics and related
disciplines. This is in particular the case in situations where a true (false) answer is not present
among those offered to respondents, and/or in studies concerned with typical rather than maximal
performance (as in achievement testing settings; e.g., Nering & Ostini, 2010).
One of the earliest polytomous IRT models is the nominal response model (NRM; Bock, 1972). It is
readily fitted with Stata’s IRT module using this command:

.irt nrm

which is followed by a listing of nominal items. It may well be argued that the NRM should only
be used when all of the items analyzed with it are nominal. In some empirical studies, however, for
validity related reasons, a scholar may also include items that are categorical (ordinal) and with
more than two possible response options. In those cases, a hybrid model can be used, whereby all
nominal items are fitted using the NRM while simultaneously all other items are fitted using
a respective polytomous ordinal IRT model (see below; for a discussion of hybrid models, see e.g.
de Ayala, 2022).
When the items analyzed are not nominal and still discrete ordinal, an option to consider is the
generalization of the 1PL-model to the polytomous item case. Stata’s IRT module offers two such
models, the partial credit model (PCM) and rating scale model (RSM). The latter may be viewed as
a constrained version of the former as a result of imposing equality constraints on appropriate
differences in difficulty-related parameters when all items have the same number of response options
possessing the same meaning (Andrich, 1978). The PCM is fitted with the command

.irt pcm

followed by a listing of the items analyzed with it. The RSM is fitted similarly with the command

.irt rsm

and has recently been used as a means of providing a framework for defining a generalized location
parameter for polytomous items (Zhang & Petersen, 2018). An extension of the PCM to the case when
not all items share the same discrimination parameter, unlike the PCM and RSM, is the generalized
partial credit model (GPCM). It is readily fitted with the command

.irt gpcm

followed by a listing of the items analyzed with it.


A widely used alternative with differing discrimination parameters is the graded response model
(GRM). It may be considered a generalization of the 2PL-model to the polytomous (ordinal) item case
(like the NRM may be seen as a different generalization of the latter model; Stata Item Response Theory
Reference Manual, 2019). The GRM is fitted with the command
MEASUREMENT: INTERDISCIPLINARY RESEARCH & PERSPECTIVE 121

.irt grm

similarly followed by the names of the items analyzed with it.


Model comparison
Model choice between two or more IRT models can be readily carried out with Stata’s IRT module.
A comparison of the BIC (and AIC) information indices in particular is facilitated for these purposes
with the command

.estat ic

that is to be used immediately after fitting each of the models considered. It is to be pointed out,
however, that in order for this information criteria-based approach to be correctly used, all models
compared should be fit to the same data set (same item set data) and have the same dependent
variables. (As a side note for our purposes here, in the resulting output the entry in the column “df” is
actually not the degrees of freedom of the pertinent fitted model, but rather the number of its
parameters.) An example of the application of this approach to polytomous IRT model selection is
found, for example, in Raykov and Marcoulides (2018, ch. 11).
When interested in comparing the 3PL-model with competing models for binary or binary scored
items, following our earlier critical comments regarding the former, individual item pseudo-guessing
parameters can also be examined for significance in the 3PL-model. This procedure may be considered
as a selection based on “local discrepancies” between the 3PL-model and its considered rivals, unlike
the comparison of their information criterion indices that are global measures of relative model fit.
(See also Footnote 3 with respect to two more complex models, which this approach can also be used
on when comparing either of them with its other four rivals.)

Prediction (estimation) of individual ability levels


Key aim of an application of item response theory and modeling in an empirical study is to enable
researchers (i) to make inferences about the unobserved individual levels of a studied latent ability
(−ies), based on the respondents’ answers to the items used; as well as (ii) to reveal important
measurement-related characteristics of these items and the entire instrument (test, scale, inventory,
self-report, questionnaire, composite, or survey; e.g., Reckase, 2009). That is, in informal terms, an IRT
user typically wishes to obtain predictions (at times referred to informally as estimates) for the
individual ability levels for all studied units of analysis, as well as to be able to make conclusions
about the way each of the items as well as the entire measuring instrument under consideration are
functioning as means of evaluating the ability (abilities) of concern.
To this end, after fitting and selecting an IRT model, evaluation of the individual ability levels can
be readily obtained along with their standard errors using the command

.predict

which is to be followed by the names of the two newly added data file columns containing these
predictions (estimates) and their standard errors, respectively. (As a general rule, in Stata the names of
latent variables must begin with a capital letter.) For example,

.predict Theta, latent se(ThetaSE)

would predict (estimate) the individual unit of analysis latent ability levels, with standard errors,
and name them correspondingly Theta and ThetaSE as the last two added variables to the original
data set. (Note the use of the comma in this command, introducing the needed subcommand stated
immediately after it.) The individual predictions, or estimates, in the so-created variable (new data
122 T. RAYKOV

file column) named “Theta,” are the evaluated individual ability scores/predictions, which are often
referred to as “theta hats” in the IRT and related literature. That is, they are the result of the IRT
scoring process conducted with the last command, and capitalize on the examinee responses on the
utilized items. As such, these predictions may be difficult to use in some more practice-oriented
settings. For this reason, they can be transformed say using the popular T-transformation or
another linear (or non-linear) transformation enhancing interpretability, and where needed they
can also be rounded then to integer numbers. Various plots of these individual predictions can also
be obtained using the rich set of graphical features available in Stata, in particular within its IRT
module (see Stata Item Response Theory Reference Manual, 2019, and next).

Graphical presentation of item response modeling functions


One of the most direct, distinct, and informative features of Stata’s IRT module is the ease with which
a number of essential IRT functions can be graphed. To start with, what may be considered perhaps
the most fundamental IRT concept, that of item characteristic curve (ICC), is readily obtained for each
used item after fitting the pertinent model with this command:

.irtgraph icc

Adding to it as a subcommand the range of interest for the underlying latent (theta) scale can be
rather helpful for graphical purposes in some empirical studies. For example,

.irtgraph icc, range(−8, 4)

could be very informative in studies where at least one item is substantially less difficult than others
(and with a markedly negative difficulty parameter) in a 1PL- or 2PL-model, say (see, e.g., next
section).
The item information function (IIF) contains important estimation precision-related information,
e.g., for the purposes of instrument construction and development (e.g., Raykov & Marcoulides, 2018),
and is obtained with this command:

.irtgraph iif

The accumulation of the individual item information functions across the components of a given
instrument renders the test information function (TIF), which is plotted similarly with the command

.irtgraph tif

Last but not least, the test characteristic curve (TCC) that relates the underlying latent variable scale
(at times referred to as “theta scale”) with that of the familiar number correct (expected) score metric is
obtained this way:

.irtgraph tcc

Use of these graphing features of Stata’s IRT module is strongly recommended both in empirical
research and in instruction, as highly informative means of plotting key IRT functions and thus
allowing more informed IRT-related decisions on part of the researcher, instructor or user, for
instance with respect to test construction and revision.
MEASUREMENT: INTERDISCIPLINARY RESEARCH & PERSPECTIVE 123

Illustration on empirical data


In this section, we use the popular LSAT data set that can be readily obtained with downloading the
R-package “ltm” (Rizopoulos, 2007; the data set can also be obtained from the author upon request).
The data consists of the true/false responses (coded 1 and 0, respectively) of 1000 cases on 5 items
denoted item1 through item5, with no guessing assumed on them. Once reading the data set into Stata,
we fit the 1PL-model as indicated above, with the following command:

.irt 1pl item1-item5

which yields the following item parameter estimates:

Coefficient Std. err. z P>|z| [95% conf. interval]


Discrim .7551283 .0694206 10.88 0.000 .6190666 .8911901
item1
Diff -3.615293 .3265991 −11.07 0.000 −4.255416 −2.975171
item2
Diff -1.322434 .1421673 −9.30 0.000 −1.601077 −1.043792
item3
Diff -.3176353 .0976766 −3.25 0.001 −.5090779 −.1261928
item4
Diff -1.730106 .1691149 −10.23 0.000 −2.061565 −1.398647
item5
Diff -2.780193 .251015 −11.08 0.000 −3.272174 −2.288213

These results indicate that the common item discrimination parameter is estimated at .755
(rounded off to 3rd decimal place, as in the remainder), with a standard error (SE) of .069 and a 95%-
confidence interval (.619, .891). The item difficulty parameters are all associated with negative
estimates and confidence intervals entirely below 0, indicating negative item difficulty (relative to
the usual scale origin fixing by setting the latent mean at 0). To examine the assumption of equal item
discrimination parameters, which is characteristic for the Rasch model, we fit the 2PL-model to the
same data as indicated above with the following command:

.irt 2pl item1-item5

The resulting item parameter estimates are as follows:

item1 Coefficient Std. err. z P>|z| [95% conf. interval]


Discrim .8256703 .2581376 3.20 0.001 .3197299 1.331611
Diff - 3.358777 .8665242 −3.88 0.000 −5.057133 −1.660421
item2
Discrim .7227513 .1866698 3.87 0.000 .3568852 1.088618
Diff - 1.370049 .307467 −4.46 0.000 −1.972673 −.7674249
item3
Discrim. 8907338 .2326049 3.83 0.000 .4348366 1.346631
Diff -.2796988 .0996259 −2.81 0.005 −.4749621 −.0844356
item4
Discrim -.6883831 .1851495 3.72 0.000 .3254968 1.05127
Diff 1.866349 .4343093 −4.30 0.000 −2.71758 −1.015118
item5
Discrim .6568946 .2099182 3.13 0.002 .2454624 1.068327
Diff -.3.125751 .8711505 −3.59 0.000 −4.833174 −1.418327
124 T. RAYKOV

We readily note that the 2PL-model item discrimination parameter estimates do not differ
markedly, and that their confidence intervals overlap to a considerable degree. This suggests that it
is worth testing the nested 1PL-model against the 2PL-model (which test may be considered of
relevance regardless of whether such a finding is obtained or not, in general; Raykov & DiStefano,
2021). This test is achieved by storing the estimates first into newly created objects upon refitting each
of these models (without needing the model outputs once again), and then carrying out the LRT of the
former against the latter model. All this is accomplished with the following sequence of Stata
commands:

.quietly irt 1pl item1-item5

.estimate store m1pl

.quietly irt 2pl item1-item5

.estimate store m2pl

.lrtest m1pl m2pl

Only the last command produces output printed to the screen (due to the others invoking
operations not yielding output of relevance here, via use of the prefix “quietly”). The LRT output is
as follows:

Likelihood-ratio test

Assumption: m1pl nested within m2pl

LR chi2(4) = 0.57

Prob > chi2 = 0.9666

These results suggest that the null hypothesis of the item discrimination parameters being the same
is tenable, and hence that the 1PL-model is preferable to the 2PL-model. This is consistent with an
examination of the BICs of both models, which are obtained with the command

.estat ic

that is submitted to Stata after fitting each of the two models.


As a result, the following two pieces of output are obtained:

For the 1PL-model:


Akaike’s information criterion and Bayesian information criterion
Model N ll(null) ll(model) df AIC BIC
1,000 −2466.938 6 4945.875 4975.322

For the 2PL-model:


Akaike’s information criterion and Bayesian information criterion
Model N ll(null) ll(model) df AIC BIC
. 1,000 −2466.654 10 4953.307 5002.385
MEASUREMENT: INTERDISCIPLINARY RESEARCH & PERSPECTIVE 125

As seen from these results, both the AIC and BIC of the 1PL-model are lower than those of the 2PL-
model, whereby the BIC of the 1PL-model is so by more than 27 units (Raftery, 1995). This finding
indicates that the 1PL-model is preferable to the 2PL-model on the basis of information criteria as well.
These nested model and information criterion comparison results suggest that the 1PL-model is
preferable to the 2PL-model, and so we select the 1PL-model as a means of data explanation and
description.
In this 1PL-model, we can readily graph the ICCs of each of the items, using the next commands
(keeping in mind that the items have marked negative difficulty parameter estimates, as seen above)

.quietly irt 1pl item1-item5

.irtgraph icc, range (−8, 4)

The resulting ICCs are as follows:

From this figure, we readily see the characteristic feature of the 1PL-model, viz. the “parallelism” of
the ICCs (apart from the extremely large or small latent continuum ranges). We note in passing
that the gray horizontal line drawn horizontally at probability .5 intersects with the ICCs precisely
at the item difficulty parameter estimates (e.g., Stata Item Response Theory Reference Manual,
2019).
Given our preference of the 1PL-model, we now adopt it in the estimation (prediction) of the 1000
individual ability levels, which as mentioned above we furnish with this command:

.predict Theta, latent se(ThetaSE)

The last command does not produce output printed to the screen containing these individual ability
predications (estimates), but instead adds as last 2 columns to the original data file, correspondingly
with the SEs and these predications, which is seen by using the command
126 T. RAYKOV

Variables 10
Variable Storage Display Value
name type format label Variable label
id float %9.0g
item1 float %9.0g
item2 float %9.0g
item3 float %9.0g
item4 float %9.0g
item5 float %9.0g
ThetaSE float %9.0g S.E. of empirical Bayes means for Theta
Theta float %9.0g empirical Bayes means for Theta
Sorted by:
Note: Dataset has changed since last saved.

.describe

As a result of the last command, we obtain this data file statement by Stata:
In addition to the description of the individual ability predictions (estimates) as empirical Bayes
means (Stata Item Response Theory Reference Manual, 2019) indicated in the right margin of the last 2
lines, we observe the statement that the data set was changed relative to its original version. The change
is thereby merely the addition of the last two columns consisting of the 1000 SEs for the individual
ability predictions (estimates) and the latter.
These individual ability predictions/estimates (also called “theta-hats” in the applied literature
occasionally) can be used now as the results of the scoring of each of the cases in the analyzed data set
and can be employed by the empirical scholar in a way they see fit according to their research question.
That is, the entries in the last column of the final version of the data set discussed can be treated as
(predicted or estimated, informally speaking) scores on the underlying latent dimension of interest.
For the sake of completeness of this section, we next present these estimates of the first 20 cases in the
data set (prior to its publication, the data set has been sorted in terms of number correct score):

.list in 1/20

id item1 item2 item3 item4 item5 ThetaSE Theta


1. 1 0 0 0 0 0 .7973221 -1.910115
2. 2 0 0 0 0 0 .7973221 -1.910115
3. 3 0 0 0 0 0 .7973221 -1.910115
4. 4 0 0 0 0 1 .8003169 -1.428806
5. 5 0 0 0 0 1 .8003169 -1.428806
6. 6 0 0 0 0 1 .8003169 -1.428806
7. 7 0 0 0 0 1 .8003169 -1.428806
8. 8 0 0 0 0 1 .8003169 -1.428806
9. 9 0 0 0 0 1 .8003169 -1.428806
10. 10 0 0 0 1 0 .8003169 -1.428806
11. 11 0 0 0 1 0 .8003169 -1.428806
12. 12 0 0 0 1 1 .8087799 -.9405584
13. 13 0 0 0 1 1 .8087799 -.9405584
14. 14 0 0 0 1 1 .8087799 -.9405584
15. 15 0 0 0 1 1 .8087799 -.9405584
16. 16 0 0 0 1 1 .8087799 -.9405584
17. 17 0 0 0 1 1 .8087799 -.9405584
18. 18 0 0 0 1 1 .8087799 -.9405584
19. 19 0 0 0 1 1 .8087799 -.9405584
20. 20 0 0 0 1 1 .8087799 -.9405584
MEASUREMENT: INTERDISCIPLINARY RESEARCH & PERSPECTIVE 127

We can see from this output section (the remaining 980 cases can be looked at once using only the
command -list-) the characteristic feature of the 1PL-model as yielding the same individual ability
predictions/estimates for subjects with the same number correct score, due to the fact that this score is
a sufficient statistic under the 1PL-model for the underlying ability level (e.g., de Ayala, 2022).
In conclusion of this section, the above application of Stata’s IRT module demonstrates a number of
its highly useful empirical features. Further details and applications can be found in the pertinent
Manual (Stata Item Response Theory Reference Manual, 2019) as well as in Raykov and Marcoulides
(2018).

Conclusion
During the past couple of decades, Stata has affirmed its well-deserved place among the most widely
used and popular statistical analysis and modeling software in the social, behavioral, and educational
sciences as well as in the business, economics, and life science-related disciplines. Over the last decade,
Stata has also developed its own item response theory-based modeling module that offers a large
number of beneficial features for the researcher and instructor in these and cognate fields. The
algorithmic implementation of the module may be seen in part as closely connected to capitalizing
on appropriate use of relevant aspects of the comprehensive latent variable modeling methodology
(e.g., Stata Item Response Theory Reference Manual, 2019; Muthén, 2002). This is a particular property
of Stata that permits conceptual as well as more rigorous connections with classical measurement
approaches, especially those based on classical test theory and factor analysis (viz., non-linear factor
analysis; e.g., McDonald, 1999; see also Raykov & Marcoulides, 2016; Takane & de Leeuw, 1987). With
the Stata IRT module, a variety of popular item response models can be readily fitted as well as
compared (in terms of relative fit to a given data set). The aims of this review were therefore to provide
a discussion of selected from the many noteworthy features of this module, which are more likely to be
often of interest and utilized by behavioral, social, and educational scientists as well as marketing and
business researchers. Extensive online resources are also easily accessed, in particular from within the
Help menu of the software where its detailed and elaborate Manual is also available. Numerous
internet-based materials, incl. informative videos, are similarly readily accessible by the scientist or
instructor interested in using item response modeling with Stata.
In conclusion, with all its highly beneficial features for the empirically engaged as well as the
theoretically oriented researcher, instructor, or user, Stata’s IRT module offers a comprehensive state-
of-the-art approach to fitting and evaluating IRT models as well as scoring of studied units of analysis,
which holds a strong promise to become even more popular in the coming years across the social,
behavioral, educational and management sciences as well as cognate disciplines.

Notes
1. Latent variables (traits, constructs, continua, dimensions), or generally hidden (i.e., not directly observable)
variables, are referred to generically as ‘abilities’ throughout this discussion.
2. We refer to the predictions of individual ability levels, oftentimes colloquially called “theta hats” in the IRT and
related literature, also as ‘estimates’ (mostly parenthetically) throughout this discussion. This is due to the wide-
spread reference to these predictions as estimates, particularly in empirical research. In strict terms, however,
these predictions cannot be really estimates since they are individual realizations of a latent variable(s), viz. the
studied ability (abilities), and hence they can only be predicted rather than estimated (e.g., Rabe-Hesketh &
Skrondal, 2022).
3. The substantially less popular currently four-parameter and five-parameter logistic models can also be fitted with
Stata, specifically using the command -bayesmh-. This review is not discussing those models in any detail and
their utilization with the reviewed software, due to their rather limited number of applications in the educational,
behavioral and social sciences at present (especially as far as the five-parameter logistic model is concerned).
Details on these two models and their estimation with Stata can be found at https://tinyurl.com/
StataBayesianIRT.
128 T. RAYKOV

Acknowlegments
I am indebted to G. A. Marcoulides for valuable discussions on item response theory and modeling. I am grateful to
C. Huber, R. Raciborski, J. Pitblado, and K. MacDonald for helpful comments on software implementation. I am also
thankful to C. Huber and R. Schumacker for critical comments on an earlier version of the paper, which contributed
considerably to its improvement.

Disclosure statement
No potential conflict of interest was reported by the author(s).

References
Acock, A. C. (2016). A gentle introduction to Stata. Stata Press.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.
org/10.1007/BF02293814
Balov, N. (2016). Bayesian binary item response theory models using bayesmh. The Stata Blog: not elsewhere classified.
http://blog.stata.com/2016/01/18/bayesian-binary-item-response-theory-models-using-bayesmh/
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal
categories. Psychometrika, 37(1), 29–51. https://doi.org/10.1007/BF02291411
de Ayala, R. J. (2022). The theory and practice of item response theory (Second ed.). Guilford.
McDonald, R. P. (1999). Test theory. A unified treatment. Earlbaum.
Muthén, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29(1), 81–117. https://doi.org/
10.2333/bhmk.29.81
Nering, M. L., & Ostini, R. (2010). Handbook of polytomous item response theory models. Taylor & Francis.
Rabe-Hesketh, S., & Skrondal, A. (2022). Multilevel and longitudinal modeling with Stata (Fourth ed.). Stata Press.
Raftery, A. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. https://doi.org/10.
2307/271063
Raykov, T., & DiStefano, C. (2021). Evaluating restrictive models in educational and behavioral research: Local Misfit
overrides model tenability. Educational and Psychological Measurement, 81(5), 980–995. https://doi.org/10.1177/
0013164420944566
Raykov, T., & Marcoulides, G. A. (2016). On the relationship between classical test theory and item response theory:
From one to the other and back. Educational and Psychological Measurement, 76(2), 325–338. https://doi.org/10.
1177/0013164415576958
Raykov, T., & Marcoulides, G. A. (2018). A course in item response theory and modeling using Stata. Stata Press.
Raykov, T., & Marcoulides, G. A. (2020). A note on the presence of spurious pseudo-guessing parameters for
three-parameter logistic models in heterogeneous populations. Educational and Psychological Measurement, 80(3),
604–612. https://doi.org/10.1177/0013164419850882
Reckase, M. D. (2009). Multidimensional item response theory. Springer.
Rizopoulos, D. (2007). Ltm: An R package for latent variable modeling and item response analysis. Journal of Statistical
Software, 17(5), 1–25. https://doi.org/10.18637/jss.v017.i05
Stata Item Response Theory Reference Manual. (2019). Stata Press.
Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized
variables. Psychometrika, 52, 393–408.
Zhang, S., & Petersen, J. H. (2018). Quantifying rater variation for ordinal data using a rating scale model. Statistics in
Medicine, 37(14), 2223–2237. https://doi.org/10.1002/sim.7639

You might also like