You are on page 1of 5

Chapter 6

Item Analysis and Validation


Learning outcomes:

 Explain the meaning of item analysis, item validity, reliability, item difficulty, discrimination
index
 Determine the validity and reliability

Introduction

The teacher normally prepares a draft of the test. Such a draft is subjected to item analysis and
validation in order to ensure that the final version of the test would be useful and functional. First, the
teacher tries out the draft test to a group of students of similar characteristics as the intended test
takers (try-out phase). From the try-out group, each item will be analyzed in terms of its ability to
discriminate between those who know and those who do not know and also its level of difficulty (item
analysis phase). The item analysis will provide information that will allow the teacher to decide whether
to revise or replace an item (item revision phase). Then, finally, the final draft of the test is subjected to
validation if the intent is to make use of the test as a standard test for the particular unit or grading
period.

Item Analysis

There are two important characteristics of an item that will be of interest to the teacher. There
are: (a) item difficulty and (b) discrimination index. We shall learn how to measure these characteristics
and apply our knowledge in making a decision about the item in question.

The difficulty of an item or item difficulty is defined as the number of students who are able to
answer the item correctly divided by the total number of students.

The item difficulty is usually expressed in percentage.

Example: What is the item difficulty index of an item if 25 students are unable to answer it
correctly while 75 answered it correctly?

Here, the total number of students is 100, hence, the item difficulty index is 75/100 or 75%.

One problem with this type of difficulty index is that it may not actually indicate that the item is
difficult (or easy). A student who does not know the subject matter will naturally be unable to answer
the item correctly even if the question is easy. How do we decide on the basis of this index whether the
item is too difficult or too easy? The following arbitrary rule is often used in the literature:

Range of Difficulty Index Interpretation Action

0-0.25 Difficult Revise or discard

0.26-0.75 Right difficulty Retain

0.76-above Easy Revise or discard

Difficult items tend to discriminate between those who know and those who do not know the
answer. Conversely, easy items cannot discriminate between these two groups of students. We are
therefore interested in deriving a measure that will tell us whether an item can discriminate between
those two groups of students. Such a measure is called an index of discrimination.
Example: Consider a multiple choice type of test of which the following data were obtained:

Item Options

A B+ C D

1 0 40 20 20 Total

0 15 5 0 Upper 25%

0 5 10 5 Lower 25%

The correct response is B. let us compute the difficulty index and index of discrimination:

Difficulty index = no. of students getting correct response/total

= 40/100 = 40% within range of a ‘good item’

The discrimination index can similarly be computed:

DU = no. of students in lower 75% with correct response/ no. of students in the lower 25%

= 5/20 = .25 or 25%

Discrimination Index = DU – DL = .75 - .25 = .50 or 50%.

Thus, the item also has a “good discriminating power”.

It is also instructive to note that the distracter A is not an effective distracter since this was
never selected by the students. Distracters C and D appear to have good appeal as distracters.

Basic Item Analysis Statistics

The Michigan State University Measurement and Evaluation Department reports a number of
item statistics which aid in evaluating the effectiveness of an item. The first of these is the index of
difficulty which MSU defines as the proportion of the total group who got the item wrong. “thus a high
index indicates a difficult item and a low index indicates an easy item. Some item analysts prefer an
index of difficulty which is the proportion of the total group who got an item right. This index may be
obtained by marking the PROPORTION RIGHT option on the item analysis header sheet. Whichever
index is selected is shown as the INDEX OF DIFFICULTY on the item analysis point-out. For classroom
achievement tests, more test constructors desires more with indices of difficulty from 30 or 40 to a
maximum of 60.

The INDEX OF DISCRIMINATION is the difference between the proportion of the upper group
who got an item right and the proportion of the lower group who got the item right. This index is
dependent upon the difficulty of an item. It may reach a maximum value of 100 for an item with an
index of difficulty of 50, that is, when 100% of the upper group and none of the lower group answer the
item correctly. For items of less than or greater than 50 difficulty, the index of discrimination has a
maximum value of less than 100. Interpreting the Index of Discrimination document contains a more
detailed discussion of the index of discrimination.

More Sophisticated Discrimination Index

Item discrimination refers to the ability of an item to differentiate among students on the basis
of how well they know the material being tested. Various hand calculation procedures have traditionally
been used to compare item responses to total test scores using high and low scoring groups of students.
Computerized analyses provide more accurate assessment of the discrimination power of items because
they take into account responses of all students rather than just high and low scoring groups.
The item discrimination index provided by ScorePak@ is a Pearson Product Moment correlation
between student responses to a particular item and total scores on all other items on the test. This
index is the equivalent of a point-biserial coefficient in this application. It provides an estimate of the
degree to which an individual item is measuring the same thing as the rest of the items.

Because the discrimination index reflects the degree to which an item and the test as a whole
are measuring a unitary ability or attribute, values of the coefficient will tend to be lower for tests
measuring a wide range of content areas than for more homogeneous tests. Item discrimination indices
must always be interpreted in the context of the type of test which is being analyzed. Items with low
discrimination indices are often ambiguously worded and should be examined. Items with negative
indices should be examined to determine why a negative value was obtained. For example, a negative
value may indicate that the item was mis-keyed, so that students who knew the material tended to
choose an unkeyed, but correct response option.

Tests with high internal consistency consist of items with mostly positive relationships with total
test score. In practice, values of the discrimination index will seldom exceed .50 because of the differing
shapes of item and total score distributions. ScorePak@ classifies item discrimination as “good” if the
index is above .30; “fair” if it is between .10 and .30; and “poor” if it is below .10.

A good item is one that has good discriminating ability and has sufficient level of difficult (not
too difficult nor too easy). In the two tables presented for the levels of difficulty and discrimination there
is a little area of interaction where the two indices will coincide (between 0.56 to 0.67) which represent
the good items in a test.

At the end of the Item Analysis report, test items are listed according to their degrees of
difficulty (easy, medium, hard) and discrimination (good, fair, poor). These distributions provide a quick
overview of the test, and can be used to identify items which are not performing well and which can
perhaps be improved or discarded.

Index of Difficulty

Ru + RL
P= -------------- x 100
T
Where:
Ru ---- The number in the upper group who answered the item correctly.
RL ---- The number in the lower group who answered the item correctly.
T ---- The total number who tried the item.
Index of item Discriminating Power
Ru + RL
D = -------------
½ T
Where:
P ---- percentage who answered the item correctly (index of difficulty).
R ---- number who answered the item correctly.
T ---- total number who tried the item.
8
P = ------- x 100 = 40%
20
The smaller the percentage figure the more difficult the item.

Estimate the item discriminating power using the formula below:

Ru – RL 6-2
D = ------------- = -------- = 0.40
½ T 10
The discriminating power of an item is reported as a decimal fraction; maximum discriminating
power is indicated by an index of 1.00.
Maximum discrimination is usually found at the 50 percent level of difficulty

0.00 - -.20 = Very difficult


0.21 – 0.80 = Moderately difficult
0.81 – 1.00 = Very easy
Validation
After performing the item analysis and revising the items which need revision, the next step is to
validate the instrument. The purpose of validation is to determine the characteristics of the whole test
itself, namely, the validity and reliability of the test. Validation is the process of collecting and analyzing
evidence to support the meaningfulness and usefulness of the test.

Validity. Validity is the extent to which a test measures what it purports to measure or as
referring to the appropriateness, correctness, meaningfulness and usefulness of the specific decisions a
teacher makes based on the test results. These two definitions of validity differ in the sense that the first
definition refers to the test itself while the second refers to the decisions made by the teacher based on
the test. A test is valid when it is aligned to the learning outcome.

A teacher who conducts test validation might want to gather different kinds of evidence. There
are essentially three types of evidence that may be collected: content-related evidence of validity,
criterion-related evidence of validity and construct-related evidence of validity. Content-related
evidence of validity refers to the content and format of the instrument. How appropriate is the content?
How comprehensive? Does it logically get at the intended variable? How adequately does the sample of
items or questions represent the content to be assessed?

Criterion-related evidence of validity refers to the relationship between scores obtained using
the instrument and scores obtained using one or more other tests (often called criterion). How strong is
this relationship? How well do such scores estimate present or predict future performance of a certain
type?

Construct-related evidence of validity refers to the nature of the psychological construct or


characteristic being measured by the test. How well does a measure of the construct explain differences
in the behavior of the individuals or their performance on a certain task?

The usual procedure for determining content validity may be described as follows: The teacher
writes out the objectives of the test based on the table of specifications and then gives these together
with the test to at least two (2) experts along with a description of the intended test takers. The experts
look at the objectives, read over the items in the test and place a check mark in front of each question or
item that they feel does not measure one or more objectives. They also place a check mark in front of
each objective not assessed by any item in the test. The teacher then rewrites any item so checked and
resubmits to the experts and/or writes new items to cover those objectives not heretofore covered by
the existing test. This continues until the experts approve of all items and also until the experts agree
that all of the objectives are sufficiently covered by the test.

In order to obtain evidence of criterion-related validity, the teacher usually compares scores on
the test in question with the scores on some other independent criterion test which presumably has
already high validity. For example, if a test is designed to measure mathematics ability of students and it
correlates highly with a standardized mathematics achievement test (external criterion), then we say we
have high criterion-related evidence of validity. In particular, this type of criterion-related evidence is
called its concurrent validity. Another type of criterion-related validity is called predictive validity
wherein the test scores in the instrument are correlated with scores on a later performance (criterion
measure) of the students. For example, the mathematics ability test constructed by the teacher may be
correlated with their later performance in a Division wide mathematics achievement test.

Apart from the use of correlation coefficient in measuring criterion-related validity, Gronlund
suggested using the so-called expectancy table. This table is easy to construct and consists of the test
(predictor) categories listed on the left hand side and the criterion categories listed horizontally along
the top of the chart. For example, suppose that a mathematics achievement test is constructed and the
scores are categorized as high, average, and low. The criterion measure used is the final average grades
of the students in high school: Very Good, Good, and Needs Improvement. The two way table lists down
the number of students falling under each of the possible pairs of (test, grades) as shown below.
Grade Point Average
Test Score Very Good Good Needs Improvements
High 20 10 5
Average 10 25 5
Low 1 10 14
The expectancy table shows that there were 20 students getting high test scores and
subsequently rated excellent in terms of their final grades; 25 students got average scores and
subsequently rated good in their finals; and finally, 14 students obtained low test scores and were later
graded as needing improvement. The evidence for this particular test tends to indicate that students
getting high scores on it would be graded excellent, average scores on it would be rated good later, and
students getting low scores on the test would be graded as needing improvement later.

Reliability

(GIKAPOY NAKO, TAN AWA RAKAN NINJO SA PICTURE KAY MAKLARO NAMAN. UNJA GAMAY RAKAN
POD C E AG NABILIN NGA WAYA NAHU MATYPE)

You might also like