Professional Documents
Culture Documents
-Matthew Lynch
How well did my
test distinguish among students
according to how well they met my
learning goals?
How well did my students perform
in each items on my test?
What questions in the test
were easy, average or
difficult?
Session Objectives:
At the end of the session, the participants should be able
to:
• define item analysis, difficulty index,
discrimination index and reliability;
• conduct item analysis using the second
summative assessment; and
• use SPSS to conduct reliability testing.
What is Item
Analysis?
•Item Difficulty
•Item Discrimination
•Distracter Analysis
•Reliability
Item Difficulty
• The proportion of students who answer an item correctly.
• Item difficulty is relevant for determining whether
students have learned the concept being tested. It also
plays an important role in the ability of an item to
discriminate between students who know the tested
material and those who do not.
(University of Washington – Office of Educational
Assessment, 2020)
Item Difficulty (p)
𝑅𝑈 + 𝑅 𝐿
𝑝=
𝑇
= The number in the upper group who
answered the item correctly
The number in the lower group who answered
the item correctly
= The total number of examinees
Difficulty Index Interpretation
(Hopkins and Antes)
Difficulty Index Description
.86 to 1.00 Very Easy
.71 to .85 Easy
.40 to .70 Desirable
.15 to .39 Difficult
0 to .14 Very Difficult
Difficulty index above 0.85 are very easy items and be carefully reviewed on the basis of the
instructor’s purpose.
P-values below 0.14 are very difficult items and should be reviewed for possible confusion.
Item Discrimination
• It is the difference between the proportion of the top scorers
who got an item correct and the proportion of the bottom
scorers who got the item right (Anu, 2019).
• It refers to the ability of an item to differentiate among
students based on how well they know the material being
tested. (University of Washington – Office of Educational
Assessment, 2020)
Item Discrimination
• When an item is discriminating negatively, over-all the most
knowledgeable examinees are getting the item wrong, and
the least knowledgeable examinees are getting the item
right. A negative discrimination index may indicate that the
item is measuring something other than what the rest of the
test is measuring. More often, it is a sign that the item has
been mis-keyed (Anu, 2019).
Item Discrimination (D)
D
= The number in the upper group who answered the item correctly
The number in the lower group who answered the item correctly
= The number of examinees in the upper group
= The number of examinees in the lower group
Discrimination Index Interpretation
(Penn, 2009; McGahee & Ball, 2009)
Discrimination Index Description
0.40 and above Very Good
0.30 – 0.39 Good (Subject to Improvement)
0.20 – 0.29 Marginal Item (Needs
Improvement)
Below 0.20 Poor
Example 1
Objective Item Upper Lower Total Decision
1 10 8 18
Total 10 8 18
Remembering Improve Item
P-value of 0.9 indicates that the item is very easy. The D-value is
0.2, indicating poor item discrimination. The item therefore the
item needs improvement.
Example 2
Objective Item Upper Lower Total Decision
1 10 10 20
Total 10 10 20
Remembering Revise
1 9 4 13
Total 9 4 13
Applying Retained
1 7 7 14
Analyzing
Total 7 7 14 Revise
1 4 8 12
Total 4 8 12
Understanding Change
P-value of 0.40 indicates that the item is moderate. Again, the D-value is -0.4, indicates
that the lower achievers are scoring that item more than the higher achievers. A negative
discrimination index may indicate that the item is measuring something other than what
the rest of the test is measuring. More often, it is a sign that the item has been mis-
keyed. The item therefore very bad and needs to be changed.
Distracter Analysis
• Items quality essentially depends partly on the effective
functioning of the selected distracters. The real functioning
of the distracters is identified by logical check of the options
and how they are selected by the examinees. Most
importantly, every distracter must attract, at least one
examinee. In general, a good distracter must attract more
examinees from the lower group than the high Group
(Tamakloe, Atta & Amedahe, 1996).
Distracter Analysis
•If no one selects a distracter it is
important to revise the option and
attempt to make the distracter a more
plausible choice.
Example 1
Options Upper Lower Total
A 0 3 3
B 0 0 0
C** 10 7 17
D 0 0 0
Total 10 10 20
From the table above, it is very clear that options “B” and “D” had none of the examinees selecting them.
Even none of the lower group selected these distracters meaning these two purported distracters are not
really functioning well as distracters because they did not distract any of the examinees. Therefore, these
distracters are not good distracters and must be changed.
Example 2
Options Upper Lower Total
A** 9 8 17
B 0 2 2
C 1 0 1
D 0 0 0
Total 10 10 20
Option “D” did not really function as a good distracter, because none of the
examinees selected it, not even the lower group. Therefore, that distracter
cannot be retained, it needs to be revised or changed.
Example 3
Options Upper Lower Total
A 1 4 5
B 0 2 2
C 1 0 1
D** 8 4 12
Total 10 10 20
The Item 4 table clearly shows that option “A” is really a good distracter,
because it attracted 10 examines, representing 25% of the total examinees.
Therefore, is appropriate to maintain option “A”.
Reliability
• Reliability is a measure of consistency. It is the degree to
which student results are the same when they take the same
test on different occasions, when different scorers score the
same item or task, and when different but equivalent tests
are taken at the same time or at different times (The Center
on Standards & Assessment Implementation, 2018).
Types of Reliability (Trochim, nd)
• Stability
1. Test – Retest
Types of Reliability (Trochim, nd)
• Stability
2. Inter-rater/ Observer/ Scorer
- applicable on essay questions
Types of Reliability (Trochim, nd)
• Equivalence
3. Parallel-Forms/ Equivalent
-use to assess the consistency of results of two test
constructed in the same way from the same content domain
Types of Reliability (Trochim, nd)
• Internal Consistency
-Used to assess the consistency of results across items within
a test.
4. Split Half
Types of Reliability (Trochim, nd)
• Cronbach Alpha
- It is equivalent to the average of all possible split half
correlations we would never actually calculate it that way.
Reliability Index Interpretation (Taber,
2017)
Reliability Index Description
0.91 and above Excellent and at the level of Standardized
Test
0.81-0.90 Very good for Classroom Test
0.71-0.80 Good for Classroom Test. There are
probably a few items which could be
improved.
0.61-0.70 Somewhat Low. Needs to be
supplemented
0.51 to 0.60 Needs revision of test
0.50 or below Questionable Reliability
Item Analysis
Process
(NORSU, 2019
Prepare the Table of Specifications (TOS)
Steps in Item Analysis
6. For Reliability
In SPSS, click the variable
view and label each item as Q1.
Note: There should be no space
in the name column
Steps in Item Analysis
9. Click
Analyze > Scale>Reliability Analysis
>Select all the variable > Drag to Items
Box > Model – Alpha > Statistics –
Descriptive (Scale if item Deleted) ,
ANOVA Table (F test) > ok
ITEM ANALYSIS is
not an END in
itself
“Exam scores are not just supposed to
separate students who pass or fail an exam.
Through the scrutiny of the exam questions
and scores, teachers can have insight into
how much the students have learned”.
-Matthew Lynch
OUTPUT