You are on page 1of 6

ASSESSMENT OF LEARNING 1

CHAPTER 6:
ITEM ANALYSIS AND VALIDATION

6.1. Item Analysis: Difficulty Index and Discrimination Index

There are two important characteristics of an item that will be of interest to the teacher.
These are: (a) item difficulty and (b) discrimination index. We shall learn how to
measure these characteristics and apply our knowledge in making a decision about
the item in question.

The difficulty of an item or item difficulty is defined as the number of students


who are able to answer the item correctly divided by the total number of students.
Thus:

Item difficulty = number of students with correct answer/ total number of

students The item difficulty is usually expressed in percentage.

Example: What is the item difficulty index of an item if 25 students are unable to answer it
correctly while 75 answered it correctly?

Here, the total number of students is 100, hence the item difficulty index is 75/100 or 75%.

Another example: 25 students answered the item correctly while 75 students did not. The total
number of students is 100 so the difficulty index is 25/100 or 25 which is 25%.

It is a more difficult test item than that one with a difficulty index of 75.

A high percentage indicates an easy item/question while a low percentage indicates a difficult item.

Range of Index Interpretation Action


Difficulty
0 — 0.25 Difficult Revise or discard

0.26 — 0.75 Right difficulty Retain

0.76 — above Easy Revise or discard


Difficult items tend to discriminate between those who know and those who do not know
the answer. Conversely, easy items cannot discriminate between these two groups of students. We
are therefore interested in deriving a measure that will tell us whether an item can discriminate
between these two groups of students. Such a measure is called an index of discrimination.

An easy way to derive such a measure is to measure how difficult an item is with respect to
those in the upper 25% of the class and how difficult it is with respect to those in the lower 25% of the
class. If the upper 25% of the class found the item easy yet the lower 25% found it difficult, then the
item can discriminate properly between these two groups. Thus:

Index of discrimination = DU — DL (U — Upper group; L — Lower group)

Example: Obtain the index of discrimination of an item if the upper 25% of the class had a
difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer) while the lower 25% of the
class had a difficulty index of 0.20.

Here, DU = 0.60 while DL = 0.20, thus index of discrimination = .60 - .20 = .40.

Discrimination index is the difference between the proportion of the top scorers who got
an item correct and the proportion of the lowest scorers who got the item right. The
discrimination index range is between -1 and +1. The closer the discrimination index is to +1, the more
effectively the item can discriminate or distinguish between the two groups of students. A negative
discrimination index means more from the lower group got the item correctly. The last item is not
good and so must be discarded.

Theoretically, the index of discrimination can range from -1.0 (when DU =0 and DL = 1) to 1.0
(when DU
= 1 and DL = 0). When the index of discrimination is equal to -1, then this means that all of the
lower 25% of the students got the correct answer while all of the upper 25% got the wrong answer. In
a sense, such an index discriminates correctly between the two groups but the item itself is highly
questionable.

Index Range Interpretation Action

-1.0 — -.50 Can discriminate Discard

but item is questionable


-.55 - 0.45 Non-discriminating Revise

0.46 — 1.0 Discriminating item Include

Example: Consider a multiple choice type of test of which the following data were obtained:

Item Options
A B* C D
1 0 40 20 20 Total

0 15 5 0 Upper 25%
0 5 10 5 Lower 25%

The correct response is B. Let us compute the difficulty index and index of discrimination:

Difficulty, Index = no. of students getting correct response/total = 40/100 = 40%,


within range of a "good item"

The discrimination index can similarly be computed:

DU = no. of students in upper 25% with correct response/no. of


students in the upper 25% = 15/20 = .75 or 75%

DL = no. of students in lower 25% with correct response/ no. of


students in the lower 25% = 5/20 = .25 or 25%
Discrimination Index = DU — DL = .75 -

.25 = .50 or 50%. Thus, the item also

has a "good discriminating power."

It is also instructive to note that the distracter A is not an effective distracter since this was never
selected by the students. It is an implausible distracter. Distracters C and D appear to have good
appeal as distracters. They are plausible distracters.
Index of Difficulty
Ru + RL
P=_________________100
T
Where:
Ru — The number in the upper group who answered

the item correctly.

RL — The number in the lower group who answered

the item correctly.

T — The total number who tried the item.

Index of item Discriminating Power

Ru + RL

½T

Where:
P percentage who answered the item correctly (index of difficulty)

R number who answered


the item correctly
T total number who tried
the item.

P=8/20 x100=40%
The smaller the percentage figure the more difficult the item

Estimate the item discriminating power using the

formula below:

(Ru — RL) ( 6 – 2)

D = ½t = 10 = 0.40

The discriminating power of an item is reported as a decimal fraction; maximum


discriminating power is indicated by an index of 1.00.

Maximum discrimination is usually found at the 50 percent level of difficulty

0.00 – 0.20 = Very difficult

0.21 – 0.80 = Moderately difficult

0.81 – 1.00 = Very easy

For classroom achievement tests, most test constructors desire items with indices of difficulty no
lower than 20 nor higher than 80, with an average index of difficulty from 30 or 40 to a maximum of
60.

The INDEX OF DISCRIMINATION is the difference between the proportion of the


upper group who got an item right and the proportion of the lower group who got the item
right. This index is dependent upon the difficulty of an item. It may reach a maximum value of 100
for an item with an index of difficulty of 50, that is, when 100% of the upper group and none of the
lower group answer the item correctly. For items of less than or greater than 50 difficulty, the index of
discrimination has a maximum value of less than 100.

6.2 VALIDATION AND VALIDITY

After performing the item analysis and revising the items which need revision, the next step is to
validate the instrument. The purpose of validation is to determine the characteristics of the whole
test itself, namely, the validity and reliability of the test. Validation is the process of collecting and
analyzing evidence to support the meaningfulness and usefulness of the test.

Validity. Validity is the extent to which a test measures what it purports to measure or as
referring to the appropriateness, correctness, meaningfulness and usefulness of the specific decisions
a teacher makes based on the test results. These two definitions of validity differ in the sense that the
first definition refers to the test itself while the second refers to the decisions made by the teacher
based on the test. A test is .valid when it is aligned with the learning outcome.

Criterion-related evidence of validity refers to the relationship between scores


obtained using the instrument and scores obtained using one or more other tests (often called
criterion). How strong is this relationship? How well do such scores estimate present or predict
future performance of a certain type?

Construct-related evidence of validity refers to the nature of the psychological


construct or characteristic being measured by the test. How well does a measure of the construct
explain differences in the behavior of the individuals or their performance on a certain task?

In particular, this type of criterion-related validity is called its concurrent validity.


Another type of criterion-related validity is called predictive validity wherein the test scores in
the instrument are correlated with scores on a later performance (criterion, measure) of the
students.

Criterion-related validity is also known as concrete validity because criterion validity


refers to a test's correlation with a concrete outcome.

There are 2 main types of criterion validity-concurrent validity and predictive validity.
Concurrent validity refers to a comparison between the measure in question and an outcome
assessed at the same time.

Criterion-related validity is also known as concrete validity because criterion validity


refers to a test's correlation with a concrete outcome.

6.3. Reliability

Reliability refers to the consistency of the scores obtained — how consistent they are
for each individual from one administration of an instrument to another and from one set of
items to another.

Reliability and validity are related concepts. If an instrument is unreliable, it cannot get valid
outcomes. As reliability improves, validity may improve (or it may not). However, if an instrument is
shown scientifically to be valid then it is almost certain that it is also reliable.

Predictive validity compares the question with an outcome assessed at a later time. An
example of predicitve validity is a comparison of scores in the National Achievement Test
(NAT) with first semester grade point average (GPA) in college. Do NAT scores predict college
performance? Construct validity refers to the ability of a test to measure what it is supposed to
measure. As researcher, you intend to measure depression but you actually measure anxiety so your
research gets compromised.

The following table is a standard followed almost universally in educational test and
measurement.

Reliability Interpretation

90 and above Excellent reliability; at the level of the best standardized tests
80 - 90 Very good for a classroom test

70 - 80 Good for a classroom test; in the range of most. There are probably a few
items which could be improved.

60-70 Somewhat low. This test needs to be supplemented by other


measures (e.g., more tests) to determine grades. There are probably
some items which could be improved.

50 - 60 Suggests need for revision of test, unless it is quite short (ten or


fewer items). The test definitely needs to be supplemented by other
measures (e.g., more tests) for grading.

50 or below Questionable reliability. This test should not contribute heavily to the
course grade, and it needs revision.

You might also like