Professional Documents
Culture Documents
School of Education
1
OUTLINE
2
Part 1
Item Difficulty
(and Distractor Analyses)
Reviewing the Concept of Item Difficulty
hedonism/hedonisme
• Low difficulty items provide
information about individuals
having locations that are low on
the trait continuum.
4
Measuring Item Difficulty
5
The Scale of Item Difficulty: Dichotomous Items
• For the case of dichotomously scored items (i.e., multiple-choice items scored
as 0/1 for incorrect/correct) the mean item score is equal to the proportion
correct.
• We denote the proportion correct “p”, and call this the item’s p-value.
• Thus, the scale of difficulty for dichotomous items is:
0 1
High Difficulty Low Difficulty
6
Interpreting Item Difficulty: Dichotomous Items
• p-value near 1 reflects very low difficulty (provides info about individuals
with very low trait level)
• p-value near 0 reflects very high difficulty (provides info about individuals
with very high trait level)
• p-value near 0.5 reflects moderate difficulty (provides info about individuals
with medium trait level)
7
Difficulty and Item Quality: Dichotomous Items
• If the p-value is too extreme, the item may not be providing information about
individuals in a relevant range of the trait continuum.
• Values near p = 0.9 reflect items that providing information about very low levels of
target trait, and thus may not be a good use of an item.
• Typically items in the 0.3 – 0.7 range generate info for a useful range of the trait
continuum (optimal level range).
• But, sometimes you need a few items in more extreme difficulty values… so, it’s a bit
of a judgment call that depends on the range of the trait continuum about which you
intend to make inferences. 8
Exercises
9
Item Difficulty for Multiple-Choice Items
• Because the potential act of guessing on multiple-choice items can cause
complications, we anticipate that the p-value associated with the highest possible
difficulty item should not actually be zero.
• Rather, it should equal (roughly speaking) the chance of guessing (g) on the item.
10
Exercises
• Consider a multiple-choice item with four response options (A, B, C, D). For what trait level does an
item with p = 0.3 provide information?
Somewhat difficult
• Consider a multiple-choice item with two response options (e.g., a true/false item). What would you
expect the lowest possible p-value to be?
p = 0.50
• Consider a multiple-choice item with three response options (A, B, C) that has a p-value of 0.15. What
does this tell you about the item quality?
This is a very difficult item as high difficulty item should be 1/3 = 0.33. This item should be
flagged for further analyses (start with distractor analyses)
11
Distractors Analyses
Abang Zaidi ke Kuala Lumpur untuk menghadiri temu duga dengan _____________ segulung ijazah yang dimilikinya.
A. berbekalkan C. berlandaskan
B. bersandarkan D. berpandukan
Chance
Difficulty Index (p) level (g)
KATEGORI PILIHAN JAWAPAN
*A B C D
p = Jumlah pelajar yang menjawab betul g=1
KT 3 3 1 4 Jumlah pelajar
KR 1 2 1 5
4
g = 0.25
JUMLAH 4 5 2 9 p = KT* + KR*
N
*Jawapan
p=4 Optimal level
20
opt = 1 + 0.25
p = 0.20 2
opt = 0.625
The Scale of Item Difficulty: Polytomous Items
• For the case of polytomously scored items (e.g., rating scale items, essay items,
performance-based items) the mean item score is used.
• Thus, the scale of difficulty for polytomous items is:
0 J
High Difficulty Low Difficulty
14
Interpreting Item Difficulty: Polytomous Items
15
Of Course, It’s a Little More Complicated than that.
16
Difficulty and Item Quality: Polytomous Items
• If the item mean is too extreme, the item may not be providing information
about individuals in a relevant range of the trait continuum.
• Values near 0 or J reflect items that providing information about very extreme
levels of target trait, and thus may not be a good use of an item.
• But sometimes you need a few items in more extreme difficulty values… so,
it’s a bit of a judgment call that depends on the range of the trait continuum
about which you intend to make inferences.
17
Exercises
For each situation, specify whether the item difficulty is low,
moderate, or high:
1. A rating scale item with five score levels (0,1,2,3,4) has a mean of 3.8.
low difficulty @ an easy item; average p = 4/2 = 2.00
2. A rating scale item with three score levels has a mean of 0.5.
high difficulty @ a difficult item; average p = 3/2 = 1.5
• If you want to generate good information across a very wide range of trait
levels, you would want to have a very good range of item difficulties.
• If you want to generate very high information at a specific trait level (a cut-
score or a standard), then your would want to have lots of items with a
difficulty that differentiated between individuals at the location of the trait
level of interest.
19
Part 2
Item Discrimination
Measuring Item Discrimination
21
Item Analysis: Example (Item Discrimination)
Abang Zaidi ke Kuala Lumpur untuk menghadiri temu duga dengan _____________ segulung ijazah yang dimilikinya.
A. berbekalkan C. berlandaskan
B. bersandarkan D. berpandukan
D=3-1
11
D = 0.182
Distractors (and Item) Analyses 3 (cont…): Example of Interpretations
26
18. Correlation Coefficient: Pearson’s r :
Polytomous ITC (cont…)
Notice again: The corrected ITC is
usually smaller than the non-
corrected ITC and is usually
reported.
non-corrected ITC:
corrected ITC bivariate correlation
27
Item Discrimination Guidelines
• Guidelines:
• ITC < .2: very low (especially for polytomous items)
• ITC > .7: very high (especially for dichotomous items)
D
item discrimination: height of arrow
C item difficulty: location of arrow
A
A
B
C
D
*Jawapan
Suggested Revision
Distractor Analyses
3
A. 32 = 9
5 B. 3 x 5 = 15 correct answer
C. 3 + 5 + 3 + 5 = 16
D. 52 = 25
9
15
16
25
Another Exercise
• Pekali kesukarannya berada dalam skala 0.41 - 0.60 iaitu
0.556, item berada pada tahap sederhana sukar.
• Pekali diskriminasi melebihi 0.40 iaitu 0.444, diskriminasi
yang baik.
• Analisis distraktor bagi distraktor A berfungsi agak baik
kerana terdapat seorang pelajar daripada kumpulan rendah
menyangka itu adalah jawapan. Distraktor C berfungsi
dengan baik kerana lebih ramai pelajar daripada kumpulan
rendah yang menyangka itu adalah jawapan. Namun
distraktor D adalah distraktor lemah kerana kerana pelajar
daripada kedua-dua kumpulan memilih jawapan tersebut.
Hal ini mungkin berlaku kerana berlaku kekeliruan.
• Jadi item ini boleh dikategorikan sebagai item yang
berkualiti kerana tidak melebihi 60% pelajar boleh
menjawab dengan betul iaitu hanya sebanyak 56%.
Manakala diskriminasi item adalah sangat tinggi iaitu 0.444,
menunjukkan ianya adalah sangat baik kerana dapat lebih
ramai pelajar berkebolehan tinggi yang menjawab dengan
betul berbanding pelajar berkebolehan rendah. Hal ini
menunjukkan distraktor-distraktor telah berjaya
mengelirukan pelajar berkebolehan rendah.
Examining Item Difficulty and Discrimination Using SPSS
1. Go to the “Scale” option of the Analysis Menu, and then select “Reliability
Analysis…”.
2. Once in the Reliability Analysis window, select the items of the instrument.
3. Click on “Statistics…”, and then “Descriptives for Item, Scale, and Scale if
item Deleted”.
4. Click on “Continue” to close the “Statistics” window, and then “OK” to run
the analysis.
37
Part 3a
Case Study 1
Case Study 1: SPSS
39
Case Study 1: SPSS
40
Case Study 1: SPSS (cont…)
Based on this output, does everything look O.K. with respect to any error in data
entry (e.g., are there any values that fall outside of the acceptable range of 0-3)?
What is the minimum and maximum value of the observed scale assigned to the
anxiety continuum as defined by the SAS?
41
The mean
represents the
item difficulty
of each item
42
Case Study 1: SPSS (cont…)
Compute the observed summated SAS score for each individual in the data file
(Xp for each person). Do this using the “Sum” function in SPSS. You can call this
variable “SAS_Score”.
In SPSS, go to Transform > Compute Variable… > Target Variable: “SAS_Score” >
Function group: Statistical > Functions and Special Variables: Sum
In SPSS, go to Analyze > Scale > Reliability Analysis. Click Statistics and check all
buttons Descriptives for:
Item
Scale
Scale if Item Deleted
We are now doing this to introduce you to some basic concepts for item analyses.
Examine the output especially the table with (Corrected) Item-Total Correlation
46
47
Case Study 1: SPSS
48
49
Table 1. Descriptive Statistics and Item Analyses for Social Anxiety Scale (SAS)
Item Mean SD Corrected Cronbach’s alpha The mean represents the
Item-Total if item deleted
Correlation item difficulty of each
V1 1.79 .975 .455 .838 item
V2 1.50 1.419 .081 .859
V3 1.01 1.226 .289 .846
V4
The mean represents the
1.74 .993 .507 .836
V5 1.39 .864 .720 .829 item discrimination of
V6 1.73 1.004 .530 .835 each item
V7 1.03 1.191 .492 .836
V8 1.71 1.146 .514 .835
V9 2.64 .698 .421 .840
V10 1.37 .942 .747 .826
V11 2.44 .798 .537 .836
V12 1.54 1.449 .075 .860
V13 1.48 1.068 .629 .830
V14 1.73 1.013 .482 .836
V15 2.24 .893 .526 .835
V16 .75 .942 .718 .828
V17 1.50 1.230 .410 .840
V18 .93 1.129 .368 .841
V19 .32 .669 .469 .839
V20 1.49 1.309 .235 .849
Scale Reliability (Cronbach’s alpha for SAS) .846
Scale Mean 30.330
Scale SD 10.780
N
500
50
Part 3b
Case Study 2
Case Study 2
A researcher is developing a 36-item test of heart health awareness (HHA test) that
will be used to evaluate whether patients having heart disease are aware of various
issues associated with improving the health of their heart. Each item is a multiple-
choice item. The researcher administers the HHA test to a sample of 2,000
individuals. The data are contained in “CaseStudy2_Data.sav”.
52
Case Study 2 (cont…)
Compute the observed summated score for each individual in the data file
Examine the distribution of the summated score for the sample of 2,000
respondents by creating a histogram in SPSS
How is the data in Case Study 2 different from the data in Case Study 1?
53
Part 3c
Case Study 3
Case Study 3: SPSS
Download the file “CaseStudy3_Data.sav”. Each file contains responses to a 35 test items by 565
test takers. There are 36 variables in the data file. PersonID is the subject identifier. Variables
A1 to D8 are the response data on the 35 test items.
The 35 items were written to measure four constructs, conveniently labeled A, B, C, and D. Items
(variables) A1 to A10 are intended to measure Construct A. Items B1 to B8 were written to
measure Construct B. Items C1 to C8 ideally measure Construct C and Items D1 to D9
supposedly measure Construct D.
Statistically analyze and evaluate these 35 items. That is, conduct an appropriate item analysis of
the items and of the four intended scales (A, B, C, and D). Do all of the items appear to work well
in measuring the four intended constructs? And, if not, which items might be discarded and
why? How do you know that throwing out those items would improve the scale properties?
How WELL are each of the constructs measured in terms of their reliability?
55
Case Study 3: SPSS (cont…) B1
B1
1
B2
B3
B4
B5
B6
B7
B8
Selected Results B2 .458 1
links
SPSS & Excel Demo Case Study 1 (English)
Part 1 https://www.youtube.com/watch?v=AG5nCX8_Nxk&t=4s
Part 2 https://www.youtube.com/watch?v=4xRlLj8plkc&t=607s
Part 3 https://www.youtube.com/watch?v=hB3d0dpzXm0&t=97s
Part 4 https://www.youtube.com/watch?v=qpsox5xr6hc&t=253s
Part 5 https://www.youtube.com/watch?v=3zwfQYkr6fk&t=5s
Part 4
Excel Calculations
Folder & Files for Part 4
FOLDER: Data4ItemAnalyses_Excel_SPSS
A172_SGDY_GroupB_Quiz1_DataEntry_Scoring_Organization_v2.xlsxCaseStudy1_Data.xlsx
A172_SGDY_GroupB_Quiz1_ItemAnalyses_v2.xlsx
ItemAnalyses_inSPSS_CopiedFromExcel.sav
links
Videos explaining the Excel files (English & Bahasa Malaysia)
Part 1 https://www.youtube.com/watch?v=1UWgODuRpYE&t=2s
Part 2 https://www.youtube.com/watch?v=VD_Nd6OxgJc
Additional links
SPSS & Excel Demo Data Entry (Bahasa Malaysia)
Provide you with materials on how to do item analyses calculation
(by hand or in Excel) based on formulas.
Part 1 https://www.youtube.com/watch?v=ez38bR7rvrM&t=625s
Part 2 https://www.youtube.com/watch?v=n4YSG8sX5io&t=16s
Interpreting Item Difficulty: Dichotomous
Items
• p-value near 1 reflects very low difficulty (provides info about
individuals with very low trait level)
63
Interpreting Item Difficulty: Polytomous
Items
• An item mean near J reflects very low difficulty (provides info
about individuals with very low trait level).
64
Difficulty and Item Quality: Polytomous Items
• If the item mean is too extreme, the item may not be providing information
about individuals in a relevant range of the trait continuum.
• Values near 0 or J reflect items that providing information about very extreme
levels of target trait, and thus may not be a good use of an item.
• But, sometimes you need a few items in more extreme difficulty values…so,
it’s a bit of a judgment call that depends on the range of the trait continuum
about which you intend to make inferences.
65
Consideration for Instrument Development
• Ultimately, you want your item difficulties to align with the intended uses of
the instrument.
• If you want to generate good information across a very wide range of trait
levels, you would want to have a very good range of item difficulties.
• If you want to generate very high information at a specific trait level (a cut-
score or a standard), then your would want to have lots of items with a
difficulty that differentiated between individuals at the location of the trait
level of interest.
66
Item Discrimination Guidelines
• Guidelines:
• ITC < .2: very low (especially for polytomous items)
• ITC > .7: very high (especially for dichotomous items)
68
RECAP
69
nurliyana@uum.edu.my
nurliyana.bukhari@alumni.uncg.edu
70