Professional Documents
Culture Documents
Introduction
The teacher normally prepares a draft of the test., Such a draft is subjected to item
analysis and validation in order to ensure that the final version of the test would be useful and
functional. First, the teacher tries out the draft test to a group of students of similar characteristics
as the intended test takers (try-out phase). From the try-out group, each item will be analyzed in
terms of its ability to discriminate between those who know and those who do not know and also
its level of difficulty (item analysis phase). The item analysis will provide information that will
allow the teacher to decide whether to revise or replace an item (item revision phase). Then,
finally, the final draft of the test is subjected to validation if the intent is to make use of the test
as a standard test for the particular unit or grading period.
What is Item
Item Difficulty
Item Difficulty or
difficulty of an item is
Difficulty?
defined as the number of
students who are able to
answer the item correctly
divided by the total
number of students. Thus:
Number of students withcorrect answer
Item Difficulty =
Total Number of Students
One problem with this type of difficulty index is that it may not actually indicate that the
item is difficult (or easy). A student who does not know the subject matter will naturally be
unable to answer the item correctly even if the question is easy. How do we decide on the basis
of this index whether the item is too difficult or too easy?
An easy way to derived such a measure is to measure how difficult an item is with
respect to those in the upper 25% of the class and how difficult it is with respect to those in the
lower 25% of the class. If the upper 25% of the class found the item easy yet the lower 25% of
the class found it difficult, then the item can discriminate properly between these two groups.
Index of discrimination = DU - DL
Theoretically, the index of discrimination can range from -1.0 (when DU = 0 and DL =
1) to 1.0 (When DU = 1 and DL = 0). When the index of discrimination is equal to -1, then this
means that all of the lower 25% of the students got the correct answer while all of the upper 25%
got the wrong answer. In a sense, such an index discriminates correctly between the two groups
but the item itself is highly questionable. Why should the bright ones get the wrong answer and
the poor ones get the right answer? On the other hand, if the index of discrimination is 1.0, then
this means that all of the lower 25% failed to get the correct answer while all of the upper 25%
ED 108 – Assessment of Learning 1
got the correct answer. This is perfectly discriminating item and is the ideal item that should be
included in the test.
The correct response is B. Let us compute the difficulty index and index of
discrimination:
Discrimination Index = DU – DL
= 0.75 – 0.25
= 0.50 or 50%.
Thus, the item also has a “good discriminating power”.
It is also instructive to note that the distracter A is not and effective distracter since this
was never selected by the students. Distracter C and D appear to have good appeal as distracters.
PROPORTION RIGHT option on the item analysis header sheet. Whichever index is selected is
shown as the INDEX OF DIFFICULTY on the item analysis print-out. For classroom
achievement tests, most test constructors desire items with indices of difficulty no lower than 20
nor higher than 80, with an average index of difficulty from 30 or 40 to a maximum of 60.
The INDEX OF DISCRIMINATION is the difference between the proportion of the
upper group who got an item right and the proportion of the lower group who got the item right.
This index is dependent upon the difficulty of an item. It may reach a maximum value of 100 for
an item with an index of difficulty of 50, that is, when 100% of the upper group and none of the
lower group answer the item correctly. For items of less than or greater than 50 difficulty, the
index of discrimination has a maximum value of less than 100. Interpreting the Index of
Discrimination document contains a more detailed discussion of the index of discrimination.”
(http//www.msu.edu/dept).
Validation
After performing the item analysis and revising the items
which need revision, the next step is to validate the instrument. The
purpose of validation is to determine the characteristics of the whole
test itself, namely, the validity and reliability of the test. validation is the process of collecting
and analyzing evidence to support the meaningfulness and usefulness of the test.
What is
Validity?
Validity
-Is the extent to which a test measures what it
purports to measure or as referring to the
appropriateness, correctness, meaningfulness
and usefulness of the specific decisions of a
teacher based on the test results. These two
definitions of validity differ in the sense that
the first definition refers to the test itself
while the second refers to the decisions made
by the teacher based on the test. A test is
valid when it is aligned to the learning
outcome.
Content-Related Refers to the content and the format of the instrument. How
Evidence of Validity appropriate is the content? How comprehensive? Does it logically
get at the intended variable? How adequate does the sample of
items or questions represent the content to be assessed?
Criterion-Related Refers to the relationship between scores obtained using the
Evidence of Validity instrument and scores obtained using one or more other test (often
called criterion).
How strong is this relationship? How well do such scores estimate
present or predict future performance of a certain type?
Construct-Related Refers to the nature of the psychological construct or characteristic
Evidence of Validity being measured by the test. How well does a measure of a construct
explain differences in the behavior of the individuals or their
performance on a certain task?
The usual procedure for determining content validity may be describe as follows: The
teacher writes out the objectives of the test based on the table of specifications and then gives
these together with the test to at least two (2) experts along with a description of the intended test
takers. The experts look at the objectives, read over the items in the test and place a check mark
in front of each question or item that they feel does not measure one or more objectives.
In order to obtain evidence criterion- related validity, the teacher usually compares
scores on the test in question with the scores on some other independent criterion test which
presumably has already high validity. For example, is a test is designed to measure mathematics
ability of students and it correlates highly with a standardized mathematics achievement test
(external criterion), the we say we have high criterion-related evidence of validity. In particular,
this type of criterion-related validity is called its concurrent validity. Another type of criterion-
related validity is called predictive validity wherein the test scores in the instrument are
correlated with scores on a later performance (criterion measure) of the students. For example,
the mathematics ability test constructed by the teacher may be correlated with their later
performance in a Division wide mathematics achievement test.
Apart from the use of correlation coefficient in measuring criterion-related validity,
Gronlund suggested using the so-called expectancy table. This table is easy to construct and
consist of a test (predictor) categories listed on the left hand side and criterion categories listed
horizontally along the top of the chart. For example, suppose that a mathematics achievement
test is constructed and the scores are categorized as high, average, and low. The criterion
measure is the final average grades of the students in high school: Very Good, Good, and Needs
Improvement. The two-way table lists down the number of students falling under each of the
possible pairs (test, grade) as shown below:
Test Score Grade Point Average
Very Good Good Needs Improvement
High 20 10 5
Average 10 25 5
Low 1 10 14
The expectancy table shows that there were 20 students getting high test scores and
subsequently rated excellent in terms of their final grades; 25 students got average scores and got
rated good in their finals; and finally, 14 students obtained low test scores and were later graded
as needing improvement. The evidence for this particular test tends to indicate that students
getting high scores on it would be graded excellent; average scores on it would be rated good
later; and students low scores on the test would be graded as needing improvement later.
Reliability
Reliability refers to the consistency of the scores obtained – how
consistent they are for each individual from one administration of an
instrument to another and from one set of items to another.
For internal consistency; for instance, we could use the split-half
method or the Kuder-Richardson formulae (KR-20 or KR-21)
ED 108 – Assessment of Learning 1
Reliability and validity are related concepts. If an instrument is unreliable, it cannot yet
valid outcomes. As reliability improves, validity may improve (or it may not). However, if an
instrument is shown scientifically to be valid then it is almost certain that it is also reliable.
The following table is a standard followed almost universally in educational tests and
measurement.
Reliability Interpretation
0.90 and above Excellent reliability; at the level of the best standardized tests
0.80 – 0.90 Very good for a classroom test
0.70 – 0.80 Good for a classroom test; in the range of most. There are probably
a few items which could be improved.
0.60 – 0.70 Somewhat low. This test needs to be supplemented by other
measures (e.g., more tests) to determine grades. There are probably
some items which could be improved.
0.50 – 0.60 Suggests need for revision of test, unless it is quite short (ten or
fewer items). The test definitely needs to be supplemented by other
measures (e.g., more tests) for grading.
0.50 or below Questionable reliability. This test should not contribute heavily to
the course grade, and it needs revision.
Additional Readings:
https://assessment.tki.org.nz/Using-evidence-for-learning/Working-with-
data/Concepts/Reliability-and-validity
https://fcit.usf.edu/assessment/basic/basicc.html
https://www.washington.edu/assessment/scanning-scoring/scoring/reports/item-analysis/
References:
1. http://www.specialconnections.ku.edu/?
q=assessment/quality_test_construction/teacher_tools/item_analysis
2. Navarro, R.L., Santos, R.G., Corpus, B.B. Assessment of Learning 1, LORIMAR
Publishing Inc. 2017.
ED 108 – Assessment of Learning 1