You are on page 1of 29

Item Response Theory

(LATENT TRAIT THEORY


Lec 10b
Introduction
Imagine that youre teaching a math fourth graders.
Example:
Administered a test - 10 questions.
2 questions are trivial, 2 are incredibly hard, and the rest are
equally difficult.
Imagine that two of your students take this test and answer
nine of the 10 questions correctly.
S1 answers an easy question incorrectly, while the S2 answers
a hard question incorrectly.
How would you try to identify the student with higher ability?
Under a traditional grading approach, S1 & S2
score 90 out of 100, grant both of them an A.
This approach illustrates a key problem with
measuring student ability via testing
instruments: test questions do not have
uniform characteristics.
So how can we measure student ability while
accounting for differences in questions?
What is IRT?
Item Response Theory is the study of test and
item scores based on assumptions concerning the
mathematical relationship between abilities (or
other hypothesized traits) and item responses.
Other names and subsets include Item
Characteristic Curve Theory, Latent Trait Theory,
Rasch Model, 2PL Model, 3PL model and the
Birnbaum model.
Lawrence M. Rudner (2001) http://echo.edres.org:8080/irt/
The figure: the x-axis - student ability; the y-axis - the probability of a
correct response to one test item; and the s-shaped curve - the
probabilities of a correct response for students with different ability
(theta) levels.
Estimating Item Parameters
Because the actual values of the parameters of the items in a test
are unknown, one of the tasks performed when a test is analyzed
under item response theory is to estimate these parameters.
The obtained item parameter estimates then provide information
as to the technical properties of the test items.
So, the parameters of a single item will be estimated under the
assumption that the examinees' ability scores are known. In
reality, these scores are not known, but it is easier to explain how
item parameter estimation is accomplished if this assumption is
made.
http://info.worldbank.org/etools/docs/library/117765/Item
%20Response%20Theory%20-%20F%20Baker.pdf
In the case of a typical test, a sample of M examinees responds to
the N items in the test. The ability scores of these examinees will
be distributed over a range of ability levels on the ability scale. For
present purposes, these examinees will be divided into, say, J
groups along the scale so that all the examinees within a given
group have the same ability level j and there will be mj examinees
within group j, where j = 1, 2, 3. . . . J. Within a particular ability
score group, rj examinees answer the given item correctly. Thus, at
an ability level of j, the observed proportion of correct response is
p( j ) = rj/mj , which is an estimate of the probability of correct
response at that ability level. Now the value of rj can be obtained
and p( j ) computed for each of the j ability levels established along
the ability scale. If the observed proportions of correct response in
each ability group are plotted, the result will be something like that
shown in Figure 3-1.
Diagrammatic:
i. M examinees responds to the N items in the test.
ii. No of group j

Group j

- same ability level j


- mj examinees within group j, where j = 1, 2, 3. . . . J.
- rj examinees answer the given item correctly.

observed proportion of correct response is p( j ) = rj/mj , which is an


estimate of the probability of correct response at that ability level.
Contoh:
M = 30; Grpup N = 6 ie {7, 5, 6, 3, 6, 3}

J = ke3; mj = 6 ; rj = 4

P(ej) = 4/6
Observed proportion of correct response as a
function of ability (Baker, 2001)
Under this approach, initial values for the item
parameters, such as b = 0.0, a = 1.0, are
established a priori. Then, using these estimates,
the value of P( j ) is computed at each ability
level via the equation for the item characteristic
curve model. The agreement of the observed
value of p( j ) and computed value P( j ) is
determined across all ability groups.
Item response theory (IRT) - Model
IRT attempts to model student ability using
question level performance instead of
aggregate test level performance.
Instead of assuming all questions contribute
equally to our understanding of a students
abilities, IRT provides a more nuanced view on
the information each question provides about a
student.
The features? Lets see some examples.
First, think back to the previous example.
In the traditional grading paradigm - a correct
answer on the first section would count just as
much as a correct answer on the final section,
despite the fact that the first section is easier
than the last!
The traditional grading scheme, however,
completely ignores each questions difficulty
when grading students.
The one-parameter logistic (1PL) IRT model
attempts to address this by allowing each
question to have an independent difficulty
variable. It models the probability of a correct
answer using the following logistic function:

Where: j represents the question of interest, theta is the


current students ability, and beta is item js difficulty. This
function is also known as the item response function. We
can examine its plot (with different values of beta) below
.
http://www.knewton.com/tech/blog/2012/06/understanding-student-performance-with-item-response-theory/
To confirm a couple of things:
1. For a given ability level, the probability of a correct answer
increases as item difficulty decreases. Between two
questions, the question with a lower beta value is easier.
2. Similarly, for a given question difficulty level, the probability
of a correct answer increases as student ability increases. In
fact, the curves displayed above take a sigmoidal form, thus
implying that the probability of a correct answer increases
monotonically as student ability increases.
Now using the 1PL model: If one student answers one question,
we can only draw information about that students ability from
the first question.
Now imagine a second student answers the same question as
well as a second question, as illustrated below.
Additional information about both students and both test
questions:
We now know more about S2 ability relative to S1 based
on answers to the first question. For example, if S1
answered correctly and S2 answered incorrectly we know
that S1 ability is greater than student S2 ability.
We also know more about the first questions difficulty
after S2 answered the second question. If S2 answers the
second question correctly, we know that Q1 likely has a
higher difficulty than Q2 does.
So: Q1 is more difficult than initially expected; S1 has
greater ability than we initially estimated.
This form of message passing via item
parameters is the key distinction between IRTs
estimates of student ability and other naive
approaches (like the grading scheme described
earlier).
Interestingly, it also suggests that one could
develop an online version of IRT that updates
ability estimates as more questions and
answers arrive!
When discussing IRT models, we say that
these questions have a low discrimination
value, since they do not discriminate between
students of high- or low-ability. Ideally, a good
question (i.e. one with a high discrimination)
will maximally separate students into two
groups: those with the ability to answer
correctly, and those without.
An important point: some questions do a better job than
others of distinguishing between students of similar
abilities.
The two-parameter logistic (2PL) IRT model incorporates
this idea by attempting to model each items level of
discrimination between high- and low-ability students.
This can be expressed as a simple tweak to the 1PL:

How does the addition of alpha (the item discrimination


parameter) affect the model?
As above, we can take a look at the item response function while
changing alpha a bit:
As previously stated, items with high
discrimination values can distinguish between
students of similar ability.
If were attempting to compare students with
abilities near zero, a higher discrimination
sharply decreases the probability that a
student with ability < 0 will answer correctly,
and increases the probability that a student
with ability > 0 will answer correctly.
We can even go a step further here, and state that an
adaptive test could use a bank of high-discrimination
questions of varying difficulty to optimally identify a
students abilities.
As a student answers each of these high-discrimination
questions, we could choose a harder question if the
student answers correctly (and vice versa). In fact, one
could even identify the students exact ability level via
binary search, if the student is willing to work through a
test bank with an infinite number of high-discrimination
questions with varying difficulty!
Of course, the above scenario is not completely true
to reality.
Sometimes students will identify the correct answer
by simply guessing! Additionally, students can
increase their odds of guessing a question correctly
by ignoring answers that are obviously wrong.
We can thus model each questions guess-ability
with the three-parameter logistic (3PL) IRT model.
The 3PLs item response function looks like this:

where chi represents the items pseudoguess value. Chi is not


considered a pure guessing value, since students can use some strategy
or knowledge to eliminate bad guesses.
Thus, while a pure guess would be the reciprocal of the number of
options (i.e. a student has a one-in-four chance of guessing the answer
to a multiple-choice question with four options), those odds may
increase if the student manages to eliminate an answer (i.e. that same
student increases her guessing odds to one-in-three if she knows one
option isnt correct).
As before, lets take a look at how the pseudoguess parameter affects the
item response function curve:
Note that students of low ability now have a higher
probability of guessing the questions answer. This is also
clear from the 3PLs item response function (chi is an
additive term and the second term is non-negative, so the
probability of answering correctly is at least as high as chi).
Note that there are a few general concerns in the IRT
literature regarding the 3PL, especially regarding whether an
items guessability is instead a part of a students testing
wisdom, which arguably represents some kind of student
ability.
Regardless, at Knewton weve found IRT models to be
extremely helpful when trying to understand our students
abilities by examining their test performance.

You might also like