You are on page 1of 67


Basic Concepts in Item and Test Analysis

Susan Matlock-Hetzel
Texas A&M University, January 1997
When norm-referenced tests are developed for instructional purposes, to assess the
effects of educational programs, or for educational research purposes, it can be very
important to conduct item and test analyses. These analyses evaluate the quality of the
items and of the test as a whole. Such analyses can also be employed to revise and
improve both items and the test as a whole. However, some best practices in item and
test analysis are too infrequently used in actual practice. The purpose of the present
paper is to summarize the recommendations for item and test analysis practices, as
these are reported in commonly-used measurement textbooks (Crocker & Algina,
1986; Gronlund & Linn, 1990; Pedhazur & Schemlkin, 1991; Sax, 1989; Thorndike,
Cunningham, Thorndike, & Hagen, 1991).
Paper presented at the annual meeting of the Southwest Educational Research
Association, Austin, January, 1997.

Basic Concepts in Item and Test Analysis
Making fair and systematic evaluations of others' performance can be a challenging
task. Judgments cannot be made solely on the basis of intuition, haphazard guessing,
or custom (Sax, 1989). Teachers, employers, and others in evaluative positions use a
variety of tools to assist them in their evaluations. Tests are tools that are frequently
used to facilitate the evaluation process. When norm-referenced tests are developed
for instructional purposes, to assess the effects of educational programs, or for
educational research purposes, it can be very important to conduct item and test
Test analysis examines how the test items perform as a set. Item analysis "investigates
the performance of items considered individually either in relation to some external
criterion or in relation to the remaining items on the test" (Thompson & Levitov,
1985, p. 163). These analyses evaluate the quality of items and of the test as a whole.
Such analyses can also be employed to revise and improve both items and the test as a
However, some best practices in item and test analysis are too infrequently used in
actual practice. The purpose of the present paper is to summarize the
recommendations for item and test analysis practices, as these are reported in
commonly-used measurement textbooks (Crocker & Algina, 1986; Gronlund & Linn,
1990; Pedhazur & Schemlkin, 1991; Sax, 1989; Thorndike, Cunningham, Thorndike,
& Hagen, 1991). These tools include item difficulty, item discrimination, and item
I tem Difficulty
Item difficulty is simply the percentage of students taking the test who answered the
item correctly. The larger the percentage getting an item right, the easier the item. The
higher the difficulty index, the easier the item is understood to be (Wood, 1960). To
compute the item difficulty, divide the number of people answering the item correctly
by the total number of people answering item. The proportion for the item is usually
denoted as pand is called item difficulty (Crocker & Algina, 1986). An item answered
correctly by 85% of the examinees would have an item difficulty, or p value, of .85,
whereas an item answered correctly by 50% of the examinees would have a lower
item difficulty, or p value, of .50.
A p value is basically a behavioral measure. Rather than defining difficulty in terms of
some intrinsic characteristic of the item, difficulty is defined in terms of the relative
frequency with which those taking the test choose the correct response (Thorndike et
al, 1991). For instance, in the example below, which item is more difficult?
1. Who was Boliver Scagnasty?
2. Who was Martin Luther King?
One cannot determine which item is more difficult simply by reading the questions.
One can recognize the name in the second question more readily than that in the first.
But saying that the first question is more difficult than the second, simply because the
name in the second question is easily recognized, would be to compute the difficulty
of the item using an intrinsic characteristic. This method determines the difficulty of
the item in a much more subjective manner than that of a p value.
Another implication of a p value is that the difficulty is a characteristic of both the
item and the sample taking the test. For example, an English test item that is very
difficult for an elementary student will be very easy for a high school student.
A p value also provides a common measure of the difficulty of test items that measure
completely different domains. It is very difficult to determine whether answering a
history question involves knowledge that is more obscure, complex, or specialized
than that needed to answer a math problem. When p values are used to define
difficulty, it is very simple to determine whether an item on a history test is more
difficult than a specific item on a math test taken by the same group of students.
To make this more concrete, take into consideration the following examples. When
the correct answer is not chosen (p = 0), there are no individual differences in the
"score" on that item. As shown in Table 1, the correct answer C was not chosen by
either the upper group or the lower group. (The upper group and lower group will be
explained later.) The same is true when everyone taking the test chooses the correct
response as is seen in Table 2. An item with a p value of .0 or a p value of 1.0 does
not contribute to measuring individual differences, and this is almost certain to be
useless. Item difficulty has a profound effect on both the variability of test scores and
the precision with which test scores discriminate among different groups of examinees
(Thorndike et al, 1991). When all of the test items are extremely difficult, the great
majority of the test scores will be very low. When all items are extremely easy, most
test scores will be extremely high. In either case, test scores will show very little
variability. Thus, extreme p values directly restrict the variability of test scores.
Table 1
Minimum Item Difficulty Example Illustrating No Individual Differences
Group Item Response
Upper group 4 5 0 6
Lower group 2 6 0 7
Note. * denotes correct response
Item difficulty: (0 + 0)/30 = .00p
Discrimination Index: (0 - 0)/15 = .00

Table 2
Maximum Item Difficulty Example Illustrating No Individual Differences
Group Item Response
Upper group 0 0 15 0
Lower group 0 0 15 0
Note. * denotes correct response
Item difficulty: (15 + 15)/30 = 1.00p
Discrimination Index: (15-15)/15 = .00
In discussing the procedure for determining the minimum and maximum score on a
test, Thompson and Levitov (1985) stated that
items tend to improve test reliability when the percentage of students
who correctly answer the item is halfway between the percentage
expected to correctly answer if pure guessing governed responses and
the percentage (100%) who would correctly answer if everyone knew the
answer. (pp. 164-165)
For example, many teachers may think that the minimum score on a test consisting of
100 items with four alternatives each is 0, when in actuality the theoretical floor on
such a test is 25. This is the score that would be most likely if a student answered
every item by guessing (e.g., without even being given the test booklet containing the
Similarly, the ideal percentage of correct answers on a four-choice multiple-choice
test is not 70-90%. According to Thompson and Levitov (1985), the ideal difficulty
for such an item would be halfway between the percentage of pure guess (25%) and
100%, (25% + {(100% - 25%)/2}. Therefore, for a test with 100 items with four
alternatives each, the ideal mean percentage of correct items, for the purpose of
maximizing score reliability, is roughly 63%. Tables 3, 4, and 5 show examples of
items with p values of roughly 63%.
Table 3
Maximum Item Difficulty Example Illustrating Individual Differences
Group Item Response
Upper group 1 0 13 3
Lower group 2 5 5 6
Note. * denotes correct response
Item difficulty: (13 + 5)/30 = .60p
Discrimination Index: (13-5)/15 = .53

Table 4
Maximum Item Difficulty Example Illustrating Individual Differences

Item Response
Upper group 1 0 11 3
Lower group 2 0 7 6
Note. * denotes correct response
Item difficulty: (11 + 7)/30 = .60p
Discrimination Index: (11-7)/15 = .267

Table 5
Maximum Item Difficulty Example Illustrating Individual Differences
Group Item Response
Upper group 1 0 7 3
Lower group 2 0 11 6
Note. * denotes correct response
Item difficulty: (11 + 7)/30 = .60p
Discrimination Index: (7 - 11)/15 = .267
Item Discrimination
If the test and a single item measure the same thing, one would expect people who do
well on the test to answer that item correctly, and those who do poorly to answer the
item incorrectly. A good item discriminates between those who do well on the test and
those who do poorly. Two indices can be computed to determine the discriminating
power of an item, the item discrimination index, D, and discrimination coefficients.
I tem Discrimination Index, D
The method of extreme groups can be applied to compute a very simple measure of
the discriminating power of a test item. If a test is given to a large group of people, the
discriminating power of an item can be measured by comparing the number of people
with high test scores who answered that item correctly with the number of people with
low scores who answered the same item correctly. If a particular item is doing a good
job of discriminating between those who score high and those who score low, more
people in the top-scoring group will have answered the item correctly.
In computing the discrimination index, D, first score each student's test and rank order
the test scores. Next, the 27% of the students at the top and the 27% at the bottom are
separated for the analysis. Wiersma and Jurs (1990) stated that "27% is used because
it has shown that this value will maximize differences in normal distributions while
providing enough cases for analysis" (p. 145). There need to be as many students as
possible in each group to promote stability, at the same time it is desirable to have the
two groups be as different as possible to make the discriminations clearer. According
to Kelly (as cited in Popham, 1981) the use of 27% maximizes these two
characteristics. Nunnally (1972) suggested using 25%.
The discrimination index, D, is the number of people in the upper group who
answered the item correctly minus the number of people in the lower group who
answered the item correctly, divided by the number of people in the largest of the two
groups. Wood (1960) stated that
when more students in the lower group than in the upper group select the
right answer to an item, the item actually has negative validity.
Assuming that the criterion itself has validity, the item is not only
useless but is actually serving to decrease the validity of the test. (p. 87)
The higher the discrimination index, the better the item because such a value indicates
that the item discriminates in favor of the upper group, which should get more items
correct, as shown in Table 6. An item that everyone gets correct or that everyone gets
incorrect, as shown in Tables 1 and 2, will have a discrimination index equal to zero.
Table 7 illustrates that if more students in the lower group get an item correct than in
the upper group, the item will have a negative D value and is probably flawed.
Table 6
Positive Item Discrimination Index D
Group Item Response
Upper group 3 2 15 0
Lower group 12 3 3 2
Note. * denotes correct response
74 students took the test
27% = 20(N)
Item difficulty: (15 + 3)/40 = .45p
Discrimination Index: (15 - 3)/20 = .60

Table 7
Negative Item Discrimination Index D
Group Item Response
Upper group 0 0 0 0
Lower group 0 0 15 0
Note. * denotes correct response
Item difficulty: (0 + 15)/30 = .50p
Discrimination Index: (0 - 15)/15 = -1.0

A negative discrimination index is most likely to occur with an item covers complex
material written in such a way that it is possible to select the correct response without
any real understanding of what is being assessed. A poor student may make a guess,
select that response, and come up with the correct answer. Good students may be
suspicious of a question that looks too easy, may take the harder path to solving the
problem, read too much into the question, and may end up being less successful than
those who guess. As a rule of thumb, in terms of discrimination index, .40 and greater
are very good items, .30 to .39 are reasonably good but possibly subject to
improvement, .20 to .29 are marginal items and need some revision, below .19 are
considered poor items and need major revision or should be eliminated (Ebel &
Frisbie, 1986).
Discrimination Coefficients
Two indicators of the item's discrimination effectiveness are point biserial correlation
and biserial correlation coefficient. The choice of correlation depends upon what kind
of question we want to answer. The advantage of using discrimination coefficients
over the discrimination index (D) is that every person taking the test is used to
compute the discrimination coefficients and only 54% (27% upper + 27% lower) are
used to compute the discrimination index, D.
Point biserial. The point biserial (rpbis) correlation is used to find out if the right people
are getting the items right, and how much predictive power the item has and how it
would contribute to predictions. Henrysson (1971) suggests that the rpbis tells more
about the predictive validity of the total test than does the biserial r, in that it tends to
favor items of average difficulty. It is further suggested that the rpbis is a combined
measure of item-criterion relationship and of difficulty level.
Biserial correlation. Biserial correlation coefficients (rbis) are computed to determine
whether the attribute or attributes measured by the criterion are also measured by the
item and the extent to which the item measures them. The rbis gives an estimate of the
well-known Pearson product-moment correlation between the criterion score and the
hypothesized item continuum when the item is dichotomized into right and wrong
(Henrysson, 1971). Ebel and Frisbie (1986) state that the rbis simply describes the
relationship between scores on a test item (e.g., "0" or "1") and scores (e.g., "0",
"1",..."50") on the total test for all examinees.
Analyzing the distractors (e.i., incorrect alternatives) is useful in determining the
relative usefulness of the decoys in each item. Items should be modified if students
consistently fail to select certain multiple choice alternatives. The alternatives are
probably totally implausible and therefore of little use as decoys in multiple choice
items. A discrimination index or discrimination coefficient should be obtained for
each option in order to determine each distractor's usefulness (Millman & Greene,
1993). Whereas the discrimination value of the correct answer should be positive, the
discrimination values for the distractors should be lower and, preferably, negative.
Distractors should be carefully examined when items show large positive D values.
When one or more of the distractors looks extremely plausible to the informed reader
and when recognition of the correct response depends on some extremely subtle point,
it is possible that examinees will be penalized for partial knowledge.
Thompson and Levitov (1985) suggested computing reliability estimates for a test
scores to determine an item's usefulness to the test as a whole. The authors stated,
"The total test reliability is reported first and then each item is removed from the test
and the reliability for the test less that item is calculated" (Thompson & Levitov,
1985, p.167). From this the test developer deletes the indicated items so that the test
scores have the greatest possible reliability.
Developing the perfect test is the unattainable goal for anyone in an evaluative
position. Even when guidelines for constructing fair and systematic tests are followed,
a plethora of factors may enter into a student's perception of the test items. Looking at
an item's difficulty and discrimination will assist the test developer in determining
what is wrong with individual items. Item and test analysis provide empirical data
about how individual items and whole tests are performing in real test situations.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test
theory. New York: Holt, Rinehart and Winston.
Ebel, R.L., & Frisbie, D.A. (1986). Essentials of educational
measurement. Englewood Cliffs, NJ: Prentice-Hall.
Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th
ed.). New York: MacMillan.
Henrysson, S. (1971). Gathering, analyzing, and using data on test items. In R.L.
Thorndike (Ed.), Educational Measurement (p. 141). Washington DC: American
Council on Education.
Millman, J., & Greene, J. (1993). The specification and development of tests of
achievement and ability. In R.L. Linn (Ed.), Educational measurement (pp. 335-366).
Phoenix, AZ: Oryx Press.
Nunnally, J.C. (1972). Educational measurement and evaluation (2nd ed.). New York:
Pedhazur, E.J., & Schmelkin, L.P. (1991). Measurement, design, and analysis: An
integrated approach. Hillsdale, NJ: Erlbaum.
Popham, W.J. (1981). Modern educational measurement. Englewood Cliff, NJ:
Sax, G. (1989). Principles of educational and psychological measurement and
evaluation (3rd ed.). Belmont, CA: Wadsworth.
Thompson, B., & Levitov, J.E. (1985). Using microcomputers to score and evaluate
test items. Collegiate Microcomputer, 3, 163-168.
Thorndike, R.M., Cunningham, G.K., Thorndike, R.L., & Hagen, E.P.
(1991). Measurement and evaluation in psychology and education (5th ed.). New
York: MacMillan.
Wiersma, W. & Jurs, S.G. (1990). Educational measurement and testing (2nd ed.).
Boston, MA: Allyn and Bacon.
Wood, D.A. (1960). Test construction: Development and interpretation of
achievement tests. Columbus, OH: Charles E. Merrill Books, Inc.

Degree Articles School Articles
Lesson Plans
Learning Articles
Education Articles

Full-text Library | Search ERIC | Test Locator | ERIC System | Assessment Resources | Calls for papers | About us | Site
map |Search | Help
Sitemap 1 - Sitemap 2 - Sitemap 3 - Sitemap 4 - Sitemap 5 - Sitemap 6
1999-2012 Clearinghouse on Assessment and Evaluation. All rights reserved. Your privacy is guaranteed at

Under new ownership

We assess the quality of tests that are implemented and carry out analysis of tests for the
various organizations involved in testing, such as qualifying examination bodies, educational
institutions, and education services companies to ensure that the abilities of examinees are
understood correctly.
Is the difficulty of the questions within an appropriate range? Is the number of
questions appropriate?
Do the choices function well? Does each question distinguish between examinees with
high level and low level ability?
Is the pass line setting and method of dividing levels appropriate?
Is it possible to make comparisons with previous test scores and ascertain changes in
continuous test scores?
What is the relationship between examinee attributes, grouping and test scores?

Item analysis - Analysis using classical test theory
Output index name Explanation of the index
Percentage of correct Difficulty of questions in the test population
Output index name Explanation of the index
Point biserial
How well does a question item discriminate between high and low
Biserial correlation The correlation when the questions and test scores are assumed to
follow a bivariate normal distribution
Choice selection rate Choice selection status, function status
Actual choice rate The actual number of choices as seen from the test results
Fundamental statistics Fundamental statistical information about the test
Reliability factor An index of the reliability of the test score
Standard error The standard deviation for score assuming a certain examinee takes
the test repeatedly
GP analysis table A table and graph showing how the choices function for each level
Analysis using IRT
Output index
Explanation of the index
Parameter a An index of the sensitivity with which ability groups around value b are
Parameter b The difficulty of a particular question
Parameter c An index of the possibility of guessing the correct answer
Standard error a The standard error for value a assuming repeated data acquisition and
Standard error b The standard error for value b assuming repeated data acquisition and
Standard error c The standard error for value c assuming repeated data acquisition and
chi-square The deviation between the percentage of correct answers obtained from
the model and the percentage of correct answers obtained from the data
df Degree of freedom (the number of division categories for ability score used
to calculate the squared value)
Output index
Explanation of the index
p The occurrence ratio of relevant data assuming equivalence between the
model and the data
Average, median, standard deviation, variance, range, minimum,
maximum, sample size
Test characteristic
A graph of the correspondence between ability score and test score
Test information
A diagram showing the reliability of each ability score point
Test analysis: identifying test conditions
Test analysis is the process of looking at something that can be used to derive test
information. This basis for the tests is called the test basis.
The test basis is the information we need in order to start the test analysis and create
our own test cases. Basically its a documentation on which test cases are based, such
as requirements, design specifications, product risk analysis, architecture and
We can use the test basis documents to understand what the system should do once
built. The test basis includes whatever the tests are based on. Sometimes tests can be
based on experienced users knowledge of the system which may not be documented.
From testing perspective we look at the test basis in order to see what could be tested.
These are the test conditions. A test condition is simply something that we could test.
While identifying the test conditions we want to identify as many conditions as we can
and then we select about which one to take forward and combine into test cases. We
could call them test possibilities.
As we know that testing everything is an impractical goal, which is known as exhaustive
testing. We cannot test everything we have to select a subset of all possible tests. In
practice the subset we select may be a very small subset and yet it has to have a high
probability of finding most of the defects in a system. Hence we need some intelligent
thought process to guide our selection called test techniques. The test conditions that
are chosen will depend on the test strategyor detailed test approach. For example, they
might be based on risk, models of the system, etc.
Once we have identified a list of test conditions, it is important to prioritize them, so that
the most important test conditions are identified. Test conditions can be identified
for test data as well as for test inputs and test outcomes, for example, different types of
record, different sizes of records or fields in a record. Test conditions are documented
in the IEEE 829 document called a Test Design Specification.

Tests: Post Test Analysis
Tests: Post Test Analysis (PDF)
Sometimes you can get valuable study clues for upcoming tests by examining old tests you have
already taken. This method works best if the instructor gives many examinations. Obviously it would
not work on the first test. This method is based on the premise that people tend to be consistent.
Here's what you do:
1. Gather all your notes, texts, and test answer sheets and visit your instructor during office hours.
Ask to look over the test that was previously given in your class.
2. As you look over the test, answer two basic questions:
1. Where did this test come from?
Did the test come mostly from lecture notes, the textbook, or the homework? Did your
instructor lecture hard on Chapter 4 and then test hard on Chapter 4? Does he like lots of
little specifics, or just test on broad, general areas?
2. What kinds of questions were asked?
Were there factual questions, application questions, definition questions? If factual, then
know names, dates, places; if application, then study theory; if definitions, then be familiar
with terms.
For example, one student discovered that her instructor made up exams by selecting only the major
paragraphs in the chapter and then using the topic sentence of each paragraph as the exam question.
It was then a simple matter to study for the forthcoming examinations. This knowledge came only
after carefully examining the old test.
Back to top
Tips to COMBAT Test Panic
1. Sleep. Get a good night's rest.
2. Diet. Eat breakfast or lunch. This may help calm your nervous stomach and give you energy.
Avoid greasy or acidic foods, and avoid overeating. Avoid caffeine pills.
3. Exercise. Nothing reduces stress more than exercise. An hour or two before an examination, stop
studying and go workout. Swimming, jogging, cycling, aerobics.
4. Allow yourself enough time to get to the test without hurrying.
5. Don't swap questions at the door. Hearing anything you don't know may weaken your
confidence and send you into a state of anxiety.
6. Leave your books at home. Flipping pages at the last minute may only upset you. If you must
take something, take a brief outline that you know well.
7. Take a watch with you, as well as extra pencils, scantron sheets, and blue books.
8. Answer the easy questions first. This will relax you and help build your confidence, plus give you
some assured points.
9. Sit apart from your classmates to reduce being distracted by their movements.
10. Don't panic if others are writing and you aren't. Your thinking may be more profitable than their
11. Don't be upset if others finish their tests before you do. Use as much time as you are allowed.
Students who leave early don't always get the highest grades.
12. If you still feel nervous during the test, try some emergency first aid: inhale deeply, close eyes,
hold, than exhale slowly. Repeat as needed.

There are many benefits that can be gained by using tools to support testing. They are:
Reduction of repetitive work:Repetitive work is very boring if it is done manually.
People tend to make mistakes when doing the same task over and over. Examples of
this type of repetitive work include running regression tests, entering the same test data
again and again (can be done by a test execution tool), checking against coding
standards (which can be done by a static analysis tool) or creating a specific test
database (which can be done by a test data preparation tool).
Greater consistency and repeatability: People have tendency to do the same task in
a slightly different way even when they think they are repeating something exactly. A
tool will exactly reproduce what it did before, so each time it is run the result is
Objective assessment: If a person calculates a value from the software or incident
reports, by mistake they may omit something, or their own one-sided preconceived
judgments or convictions may lead them to interpret that data incorrectly. Using a tool
means that subjective preconceived notion is removed and the assessment is more
repeatable and consistently calculated. Examples include assessing the cyclomatic
complexity or nesting levels of a component (which can be done by a static analysis
tool), coverage (coverage measurement tool), system behavior (monitoring tools) and
incident statistics (test management tool).
Ease of access to information about tests or testing: Information presented visually
is much easier for the human mind to understand and interpret. For example, a chart or
graph is a better way to show information than a long list of numbers this is why
charts and graphs in spreadsheets are so useful. Special purpose tools give these
features directly for the information they process. Examples include statistics and
graphs about test progress (test execution or test management tool), incident rates
(incident management or test management tool) and performance (performance testing

Item Analysis
Table of Contents
Major Uses of Item Analysis
Item Analysis Reports
Item Analysis Response Patterns
Basic Item Analysis Statistics
Interpretation of Basic Statistics
Other Item Statistics
Summary Data
Report Options
Item Analysis Guidelines

Major Uses of Item Analysis
Item analysis can be a powerful technique available to instructors for the guidance and
improvement of instruction. For this to be so, the items to be analyzed must be valid
measures of instructional objectives. Further, the items must be diagnostic, that is,
knowledge of which incorrect options students select must be a clue to the nature of
the misunderstanding, and thus prescriptive of appropriate remediation.
In addition, instructors who construct their own examinations may greatly improve the
effectiveness of test items and the validity of test scores if they select and rewrite their
items on the basis of item performance data. Such data is available to instructors who
have their examination answer sheets scored at the Computer Laboratory Scoring
[ Top ]

Item Analysis Reports
As the answer sheets are scored, records are written which contain each student's
score and his or her response to each item on the test. These records are then
processed and an item analysis report file is generated. An instructor may obtain test
score distributions and a list of students' scores, in alphabetic order, in student number
order, in percentile rank order, and/or in order of percentage of total points. Instructors
are sent their item analysis reports from as e-mail attacments. The item analysis report
is contained in the file IRPT####.RPT, where the four digits indicate the instructors's
GRADER III file. A sample of an individual long form item analysis lisitng is shown
Item 10 of 125. The correct option is 5.
Item Response Pattern
1 2 3 4 5 Omit Error Total
Upper 27% 2 8 0 1 19 0 0 30
7% 27% 0% 3% 63% 0% 0% 100%
Middle 46% 3 20 3 3 23 0 0 52
6% 38% 6% 6% 44% 0% 0% 100%
Lower 27% 6 5 8 2 9 0 0 30
20% 17% 27% 7% 30% 0% 0% 101%
Total 11 33 11 6 51 0 0 112
10% 29% 11% 5% 46% 0% 0% 100%
[ Top ]

Item Analysis Response Patterns
Each item is identified by number and the correct option is indicated. The group of
students taking the test is divided into upper, middle and lower groups on the basis of
students' scores on the test. This division is essential if information is to be provided
concerning the operation of distracters (incorrect options) and to compute an easily
interpretable index of discrimination. It has long been accepted that optimal item
discrimination is obtained when the upper and lower groups each contain twenty-
seven percent of the total group.
The number of students who selected each option or omitted the item is shown for
each of the upper, middle, lower and total groups. The number of students who
marked more than one option to the item is indicated under the "error" heading. The
percentage of each group who selected each of the options, omitted the item, or erred,
is also listed. Note that the total percentage for each group may be other than 100%,
since the percentages are rounded to the nearest whole number before totaling.
The sample item listed above appears to be performing well. About two-thirds of the
upper group but only one-third of the lower group answered the item correctly.
Ideally, the students who answered the item incorrectly should select each incorrect
response in roughly equal proportions, rather than concentrating on a single incorrect
option. Option two seems to be the most attractive incorrect option, especially to the
upper and middle groups. It is most undesirable for a greater proportion of the upper
group than of the lower group to select an incorrect option. The item writer should
examine such an option for possible ambiguity. For the sample item on the previous
page, option four was selected by only five percent of the total group. An attempt
might be made to make this option more attractive.
Item analysis provides the item writer with a record of student reaction to items. It
gives us little information about the appropriateness of an item for a course of
instruction. The appropriateness or content validity of an item must be determined by
comparing the content of the item with the instructional objectives.
[ Top ]

Basic Item Analysis Statistics
A number of item statistics are reported which aid in evaluating the effectiveness of
an item. The first of these is the index of difficulty which is the proportion of the total
group who got the item wrong. Thus a high index indicates a difficult item and a low
index indicates an easy item. Some item analysts prefer an index of difficulty which is
the proportion of the total group who got an item right. This index may be obtained by
marking the PROPORTION RIGHT option on the item analysis header sheet.
Whichever index is selected is shown as the INDEX OF DIFFICULTY on the item
analysis print-out. For classroom achievement tests, most test constructors desire
items with indices of difficulty no lower than 20 nor higher than 80, with an average
index of difficulty from 30 or 40 to a maximum of 60.
The INDEX OF DISCRIMINATION is the difference between the proportion of the
upper group who got an item right and the proportion of the lower group who got the
item right. This index is dependent upon the difficulty of an item. It may reach a
maximum value of 100 for an item with an index of difficulty of 50, that is, when
100% of the upper group and none of the lower group answer the item correctly. For
items of less than or greater than 50 difficulty, the index of discrimination has a
maximum value of less than 100. The Interpreting the Index of
Discrimination document contains a more detailed discussion of the index of
[ Top ]

Interpretation of Basic Statistics
To aid in interpreting the index of discrimination, the maximum discrimination value
and the discriminating efficiency are given for each item. The maximum
discrimination is the highest possible index of discrimination for an item at a given
level of difficulty. For example, an item answered correctly by 60% of the group
would have an index of difficulty of 40 and a maximum discrimination of 80. This
would occur when 100% of the upper group and 20% of the lower group answered the
item correctly. The discriminating efficiency is the index of discrimination divided by
the maximum discrimination. For example, an item with an index of discrimination of
40 and a maximum discrimination of 50 would have a discriminating efficiency of 80.
This may be interpreted to mean that the item is discriminating at 80% of the potential
of an item of its difficulty. For a more detailed discussion of the maximum
discrimination and discriminating efficiency concepts, see the Interpreting the Index
of Discrimination document.
[ Top ]

Other Item Statistics
Some test analysts may desire more complex item statistics. Two correlations which
are commonly used as indicators of item discrimination are shown on the item
analysis report. The first is the biserial correlation, which is the correlation between a
student's performance on an item (right or wrong) and his or her total score on the test.
This correlation assumes that the distribution of test scores is normal and that there is
a normal distribution underlying the right/wrong dichotomy. The biserial correlation
has the characteristic, disconcerting to some, of having maximum values greater than
unity. There is no exact test for the statistical significance of the biserial correlation
The point biserial correlation is also a correlation between student performance on an
item (right or wrong) and test score. It assumes that the test score distribution is
normal and that the division on item performance is a natural dichotomy. The possible
range of values for the point biserial correlation is +1 to -1. The Student's t test for the
statistical significance of the point biserial correlation is given on the item analysis
report. Enter a table of Student's t values with N - 2 degrees of freedom at the desired
percentile point N, in this case, is the total number of students appearing in the item
The mean scores for students who got an item right and for those who got it wrong are
also shown. These values are used in computing the biserial and point biserial
coefficients of correlation and are not generally used as item analysis statistics.
Generally, item statistics will be somewhat unstable for small groups of students.
Perhaps fifty students might be considered a minimum number if item statistics are to
be stable. Note that for a group of fifty students, the upper and lower groups would
contain only thirteen students each. The stability of item analysis results will improve
as the group of students is increased to one hundred or more. An item analysis for
very small groups must not be considered a stable indication of the performance of a
set of items.
[ Top ]

Summary Data
The item analysis data are summarized on the last page of the item analysis report.
The distribution of item difficulty indices is a tabulation showing the number and
percentage of items whose difficulties are in each of ten categories, ranging from a
very easy category (00-10) to a very difficult category (91-100). The distribution of
discrimination indices is tabulated in the same manner, except that a category is
included for negatively discriminating items.
The mean item difficulty is determined by adding all of the item difficulty indices and
dividing the total by the number of items. The mean item discrimination is determined
in a similar manner.
Test reliability, estimated by the Kuder-Richardson formula number 20, is given. If
the test is speeded, that is, if some of the students did not have time to consider each
test item, the reliability estimate may be spuriously high.
The final test statistic is the standard error of measurement. This statistic is a common
device for interpreting the absolute accuracy of the test scores. The size of the
standard error of measurement depends on the standard deviation of the test scores as
well as on the estimated reliability of the test.
Occasionally, a test writer may wish to omit certain items from the analysis although
these items were included in the test as it was administered. Such items may be
omitted by leaving them blank on the test key. The response patterns for omitted items
will be shown but the keyed options will be listed as OMIT. The statistics for these
items will be omitted from the Summary Data.
[ Top ]

Report Options
A number of report options are available for item analysis data. The long-form item
analysis report contains three items per page. A standard-form item analysis report is
available where data on each item is summarized on one line. A sample reprot is
shown below.
ITEM ANALYSIS Test 4482 125 Items 112 Students
Percentages: Upper 27% - Middle - Lower 27%
Item Key 1 2 3 4 5 Omit Error Diff Disc
1 4 7-23-57 0- 4- 7 28- 8-36 64-62- 0 0-0-0 0-0-0 0-0-0 54 64
2 2 7-12- 7 64-42-29 14- 4-21 14-42-36 0-0-0 0-0-0 0-0-0 56 35
The standard form shows the item number, key (number of the correct option), the
percentage of the upper, middle, and lower groups who selected each option, omitted
the item or erred, the index of difficulty, and the index of discrimination. For example,
in item 1 above, option 4 was the correct answer and it was selected by 64% of the
upper group, 62% of the middle group and 0% of the lower group. The index of
difficulty, based on the total group, was 54 and the index of discrimination was 64.
[ Top ]

Item Analysis Guidelines
Item analysis is a completely futile process unless the results help instructors improve
their classroom practices and item writers improve their tests. Let us suggest a number
of points of departure in the application of item analysis data.
1. Item analysis gives necessary but not sufficient information concerning the
appropriateness of an item as a measure of intended outcomes of instruction.
An item may perform beautifully with respect to item analysis statistics and
yet be quite irrelevant to the instruction whose results it was intended to
measure. A most common error is to teach for behavioral objectives such as
analysis of data or situations, ability to discover trends, ability to infer
meaning, etc., and then to construct an objective test measuring mainly
recognition of facts. Clearly, the objectives of instruction must be kept in mind
when selecting test items.
2. An item must be of appropriate difficulty for the students to whom it is
administered. If possible, items should have indices of difficulty no less than
20 and no greater than 80. lt is desirable to have most items in the 30 to 50
range of difficulty. Very hard or very easy items contribute little to the
discriminating power of a test.
3. An item should discriminate between upper and lower groups. These groups
are usually based on total test score but they could be based on some other
criterion such as grade-point average, scores on other tests, etc. Sometimes
an item will discriminate negatively, that is, a larger proportion of the lower
group than of the upper group selected the correct option. This often means
that the students in the upper group were misled by an ambiguity that the
students in the lower group, and the item writer, failed to discover. Such an
item should be revised or discarded.
4. All of the incorrect options, or distracters, should actually be distracting.
Preferably, each distracter should be selected by a greater proportion of the
lower group than of the upper group. If, in a five-option multiple-choice item,
only one distracter is effective, the item is, for all practical purposes, a two-
option item. Existence of five options does not automatically guarantee that
the item will operate as a five-choice item.

Item Analysis of Classroom Tests: Aims and Simplified

How well did my test distinguish among students according to the how well they met
my learning goals?
Recall that each item on your test is intended to sample performance on a particular
learning outcome. The test as a whole is meant to estimate performance across the full
domain of learning outcomes you have targeted.
Unless your learning goals are minimal or low (as they might be, for instance, on a
readiness test), you can expect students to differ in how well they have met those
goals. (Students are not peas in a pod!). Your aim is not to differentiate students just
for the fun of it, but to be able to measure the differences in mastery that occur.
One way to assess how well your test is functioning for this purpose is to look at how
well the individual items do so. The basic idea is that a good item is one that good
students get correct more often than do poor students. You might end up with a big
spread in scores, but what if the good students are no more likely than poor students to
get a high score? If we assume that you have actually given them proper instruction,
then your test has not really assessed what they have learned. That is, it is "not
An item analysis gets at the question of whether your test is working by asking the
same question of all individual itemshow well does it discriminate? If you have lots
of items that didnt discriminate much if at all, you may want to replace them with
better ones. If you find ones that worked in the wrong direction (where good students
did worse) and therefore lowered test reliability, then you will definitely want to get
rid of them.
In short, item analysis gives you a way to exercise additional quality control over your
tests. Well-specified learning objectives and well-constructed items give you a
headstart in that process, but item analyses can give you feedback on how successful
you actually were.
Item analyses can also help you diagnose why some items did not work especially
well, and thus suggest ways to improve them (for example, if you find distracters that
attracted no one, try developing better ones).
Item analyses are intended to assess and improve the reliability of your tests. If test
reliability is low, test validity will necessarily also be low. This is the ultimate reason
you do item analysesto improve the validity of a test by improving its reliability.
Higher reliability will not necessarily raise validity (you can be more consistent in
hitting the wrong target), but it is a prerequisite. That is, high reliability is necessary
but not sufficient for high validity (do you remember this point on Exam 1?).
However, when you examine the properties of each item, you will often discover how
they may or may not actually have assessed the learning outcome you intended
which is a validity issue. When you change items to correct these problems, it means
the item analysis has helped you to improve the likely validity of the test the next time
you give it.
The procedure (apply it to the sample results I gave you)
1. Identify the upper 10 scorers and lowest 10 scorers on the test. Set aside the
2. Construct an empty chart for recording their scores, following the sample I
gave you in class. This chart lists the students down the left, by name. It arrays
each item number across the top. For a 20-item test, you will have 20 columns
for recording the answers for each student. Underneath the item number, write
in the correct answer (A, B, etc.)
3. Enter the student data into the chart you have just constructed.
a. Take the top 10 scorers, and write each students name down the left,
one row for each student. If there is a tie for 10th place, pick one student
randomly from those who are tied.
b. Skip a couple rows, then write the names of the 10 lowest-scoring
students, one row for each.
c. Going student by student, enter each students answers into the cells of
the chart. However, enter only the wrong answers (A, B, etc.). Any
empty cell will therefore signal a correct answer.
d. Go back to the upper 10 students. Count how many of them got Item 1
correct (this would be all the empty cells). Write that number at the
bottom of the column for those 10. Do the same for the other 19
questions. We will call these sums R
, where U stands for "upper."
e. Repeat the process for the 10 lowest students. Write those sums under
their 20 columns. We will call these R
, where L stands for "lower."
4. Now you are ready to calculate the two important indices of item functioning.
These are actually only estimates of what you would get if you had a computer
program to calculate the indices for everyone who took the test (some schools
do). But they are pretty good.
a. Difficulty index. This is just the proportion of people who passed the
item. Calculate it for each item by adding the number correct in the top
group (R
) to the number correct in the bottom group (R
) and then
dividing this sum by the total number of students in the top and bottom
groups (20).
+ R

Record these 20 numbers in a row near the bottom of the chart.
b. Discrimination index. This index is designed to highlight to what extent
students in the upper group were more likely than students in the lower
group to get the item correct. That is, it is designed to get at the
differences between the two groups. Calculate the index by subtracting
from R
, and then dividing by half the number of students involved
- R

Record these 20 numbers in the last row of the chart.
5. You are now ready to enter these discrimination indexes into a second chart.
6. Construct the second chart, based on the model I gave you in class. (This is the
smaller chart that contains no student names.)
a. Note that there are two rows of column headings in the sample. The first
row of headings contains the maximum possible discrimination indexes
for each item difficulty level (more on that in a moment). The second
row contains possible difficulty indexes. Lets begin with that second
row of headings (labeled "p"). As your sample shows, the entries range
on the far left from "1.0" (for 100%) to ".4-0" (40%-0%) for a final
catch-all column. Just copy the numbers from the sample onto your
b. Now copy the numbers from the first row of headings in the sample
(labeled "Md").
7. Now is the time to pick up your first chart again, where you will find
the discrimination indexes you need to enter into your second chart.
a. You will be entering its last row of numbers into the body of the second
b. List each of these discrimination indexes in one and only one of the 20
columns. But which one? List each in the column corresponding to
its difficulty level. For instance, if item 4s difficulty level is .85 and its
discrimination index is .10, put the .10 in the difficulty column labeled
".85." This number is entered, of course, into the row for the fourth item
8. Study this second chart.
a. How many of the items are of medium difficulty? These are the best,
because they provide the most opportunity to discriminate (to see this,
look at their maximum discrimination indexes in the first row of
headings). Items that most everybody gets right or gets wrong simply
cant discriminate much.
b. The important test for an items discriminability is to compare it to the
maximum possible. How well did each item discriminate relative to the
maximum possible for an item of its particular difficulty level? Here is a
rough rule of thumb.
Discrimination index is near the maximum possible = very
discriminating item
Discrimination index is about half the maximum possible =
moderately discriminating item
Discrimination index is about a quarter the maximum possible =
weak item
Discrimination index is near zero = non-discriminating item
Discrimination index is negative = bad item (delete it if worse
than -.10)
9. Go back to the first chart and study it.
a. Look at whether all the distracters attracted someone. If some did not
attract any, then the distracter may not be very useful. Normally you
might want to examine it and consider how it might be improved or
b. Look also for distractors that tended to pull your best students and
thereby lower discriminability. Consider whether the discrimination you
are asking them to make is educationally significant (or even clear). You
cant do this kind of examination for the sample data I have given you,
but keep it in mind for real-life item analyses.
10. There is much more you can do to mine these data for ideas about your items,
but this is the core of an item analysis.
If you are lucky
If you use scantron sheets for grading exams, ask your school whether it can calculate
item statistics when it processes the scantrons. If it can, those statistics probably
include what you need: the (a) difficulty indexes for each item, (b) correlations of
each item with total scores for each student on the test, and (c) the number of students
who responded to each distracter. The item-total correlation is comparable to (and
more accurate than) your discrimination index.
If your school has this software, then you won't have to calculate any item statistics,
which makes your item analyses faster and easier. It is important that you have
calculated the indexes once on your own, however, so that you know what they mean.

Improve multiple choice tests using item analysis

Item analysis report
An item analysis includes two statistics that can help you analyze the effectiveness of your test
questions. The question difficulty is the percentage of students who selected the correct response.
The discrimination (item effectiveness) indicates how well the question separates the students
who know the material well from those who dont.

Question difficulty
Question difficulty is defined as the proportion of students selecting the correct answer. The most
effective questions in terms of distinguishing between high and low scoring students will be
answered correctly by about half of the students. In practical terms, questions in most classroom
tests will have a range of difficulties from low or easy (.90) to high or very difficult (.40). Questions
having difficulty estimates outside of these ranges may not contribute much to the effective
evaluation of student performance.
Very easy questions may not sufficiently challenge the most able students. However, having
a few relatively easy questions in a test may be important to verify the mastery of some
course objectives. Keep tests balanced in terms of question difficulty.
Very difficult questions, if they form most of a test, may produce frustration among students.
Some very difficult questions are needed to challenge the best students.

Question discrimination
The discrimination index (item effectiveness) is a kind of correlation that describes the
relationship between a students response to a single question and his or her total score on the test.
This statistic can tell you how well each question was able to differentiate among students in terms
of their ability and preparation.
As a correlation, question discrimination can theoretically take values between -1.00 and
+1.00. In practical terms values for most classroom tests range between near 0.00 to values
near .90.
If a question is very easy so that nearly all students answered correctly, the questions
discrimination will be near zero. Extremely easy questions cannot distinguish among
students in terms of their performance.
If a question is extremely difficult so that nearly all students answered incorrectly, the
discrimination will be near zero.
The most effective questions will have moderate difficulty and high discrimination values.
The higher the value of discrimination is, the more effective it is in discriminating between
students who perform well on the test and those that dont.
Questions having low or negative values of discrimination need to be reviewed very carefully
for confusing language or an incorrect key. If no confusing language is found then the course
design for the topic of the question needs to be critically reviewed.
A high level of student guessing on questions will result in a question discrimination value
near zero.

Steps in a review of an item analysis report
1. Review the difficulty and discrimination of each question.
2. For each question having low values of discrimination review the distribution of responses
along with the question text to determine what might be causing a response pattern that
suggests student confusion.
3. If the text of the question is confusing, change the text or remove the question from the
course database. If the question text is not confusing or faulty, then try to identify the
instructional component that may be leading to student confusion.
4. Carefully examine the questions that discriminate well between high and low scoring
students to fully understand the role that instructional design played in leading to these
results. Ask yourself what aspects of the instructional process appear to be most effective.
Test Item Performance: The Item Analysis
Table of Contents
Summary of Test Statistics
Test Frequency Distribution
Item Difficulty and Discrimination: Quintile Table
Interpreting Item Statistics
MERMAC - Test Analysis and Questionnaire Package

The ITEM ANALYSIS output consists of four parts: A summary of test statistics, a test frequency
distribution, an item quintile table, and item statistics. This analysis can be processed for an entire class.
If it is of interest to compare the item analysis for different test forms, then the analysis can be processed
by test form. The Division of Measurement and Evaluation staff is available to help instructors interpret
their item analysis data.

Summary of Test Statistics

Part I of the ITEM ANALYSIS consists of a summary of the following statistics:


(Number of items on the test.)
(Arithmetic average; the sum of all scores divided by the number of
(The raw score point that divides the raw score distribution in half;
50% of the scores fall above the median and 50% fall below.)
(Measure of the spread or variability of the score distribution. The
higher the value of the standard deviation, the better the test is
discriminating among student performance levels.)
(Is an estimate of test reliability indicating the internal consistency of
the test. The range of the reliability is from 0.00 to 1.00. A reliability of
.70 or better is desirable for classroom tests.)
(When item difficulties are approximately equal, is an estimate of test
reliability indicating the internal consistency of the test. The range of
the reliability is from 0.00 to 1.00. A reliability of .70 or better is
desirable for classroom tests.)
(The accuracy of measurement expressed in the test score scale.
The larger the standard error, the less precise the measure of student
achievement. Two-thirds of the time test takers obtained scores fall
within one standard error of measurement of their true score.)
(The possible low score.)
(The possible high score.)
(The obtained low score.)
(The obtained high score.)
(The number of answer sheets submitted
for scoring.)
(Number of test scores that could be not computed.)
(Number of test scores out of range specified by the user.)
(Only those scores that fall within the range specified by the user are
included in the analysis so that
the user has the option of disregarding certain scores.)
Blank and invalid scores (those falling outside the specified range) are counted and are omitted from the
Table of Contents
Test Frequency Distribution

Part II of the ITEM ANALYSIS program displays a test frequency distribution. The raw scores are ordered
from high to low with corresponding statistics:
1. Standard score--a linear transformation of the raw score that sets the mean equal to 500 and the
standard deviation equal to 100; in normal score distributions for classes of 500 students of more
the standard score range usually falls between 200 and 800 (plus or minus three standard
deviations of the mean); for classes with fewer than 30 students the standard score range usually
falls within two standard deviations of the mean, i.e., a range of 300 to 700.
2. Percentile rank--the percentage of individuals who received a score lower than the given score
plus the percentage of half the individuals who received the given score. This measure indicates a
person's relative position within a group.
3. Percentage of people in the total group who received the given score.
4. Frequency--in a test analysis, the number of individuals who receive a given score.
5. Cumulative frequency--in a test analysis, the number of individuals who score at or below a given
score value.
92 717 99 0.2 1 603 *
91 708 99 0.3 2 602 **
90 700 99 0.0 0 600
89 691 99 0.2 1 600 *
88 683 99 0.8 5 599 *****
87 675 99 0.3 2 594 **
86 666 98 1.0 6 592 ******
85 658 97 1.3 8 586 ********
84 649 96 1.2 7 578 *******
83 641 95 2.0 12 571 ************
82 632 93 1.7 10 559 **********
81 624 91 1.5 9 549 *********
80 615 90 1.5 9 540 *********
79 607 88 2.8 17 531 *****************
78 598 85 4.1 25 514 *************************
77 590 81 2.3 14 489 **************
76 562 79 4.0 24 475 ************************
75 573 75 2.2 13 451 *************
74 565 73 3.3 20 438 ********************
73 556 69 2.0 12 418 ************
72 548 67 3.8 23 406 ***********************
71 539 64 2.8 17 383 *****************
70 531 61 3.0 18 366 ******************
69 522 58 3.2 19 326 *******************
67 505 51 3.6 22 307 **********************
66 497 47 3.8 23 285 ***********************
65 489 43 2.7 16 262 ****************
64 480 41 3.2 19 246 *******************
63 472 38 2.5 15 227 ***************
62 463 35 3.2 19 212 *******************
61 455 32 2.5 15 193 ***************
60 446 30 1.8 11 178 ***********
59 438 28 2.3 14 167 **************
58 429 25 3.0 18 153 ******************
57 421 22 1.7 10 135 **********
56 413 21 3.2 12 106 ************
54 396 16 1.7 10 94 **********
53 387 14 1.5 9 84 *********
52 379 12 1.2 7 75 *******
51 370 11 2.0 12 68 ************
50 362 9 1.2 7 56 *******
49 353 8 1.3 8 49 ********
48 345 7 1.7 10 41 **********
Table of Contents
Item Difficulty and Discrimination: Quintile Table

Part III of the ITEM ANALYSIS output, an item quintile table, can aid in the interpretation of Part IV of the
output. Part IV compares the item responses versus the total score distribution for each item. A good item
discriminates between students who scored high or low on the examination as a whole. In order to
compare different student performance levels on the examination, the score distribution is divided into
fifths, or quintiles. The first fifth includes students who scored between the 81st and 100th percentiles; the
second fifth includes students who scored between the 61st and 80th percentiles, and so forth. When the
score distribution is skewed, more than one-fifth of the students may have scores within a given quintile
and as a result, less than one-fifth of the students may score within another quintile. The table indicates
the sample size, the proportion of the distribution, and the score ranges within each fifth.
1ST 128 0.21 77 - 92
2ND 127 0.21 70 - 76
3RD 121 0.20 64 - 69
4TH 121 0.20 56 - 63
5TH 106 0.18 24 - 55

Table of Contents
Interpreting Item Statistics

Part IV of ITEM ANALYSIS portrays item statistics which can help determine which items are good and
which need improvement or deletion from the examination. The quintile graph on the left side of the
output indicates the percent of students within each fifth who answered the item correctly. A good,
discrimination item is one in which students who scored well on the examination answered the correct
alternative more frequently than students who did not score well on the examination. Therefore, the
scattergram graph should form a line going from the bottom left-hand corner to the top right-hand corner
of the graph. Item 1 in the sample output shows an example of this type of positive linear relationship.
Item 2 in the sample output also portrays a discriminating item; although few students correctly answered
the item, the students in the first fifth answered it correctly more frequently than the students in the rest of
the score distribution. Item 3 indicates a poor item, the graph indicates no relationship between the fifths
of the score distribution and the percentage of correct responses by fifths. However, it is likely that this
item was miskeyed by the instructor--note the response pattern for alternative B.

A. Evaluating Item Distractors: Matrix of Responses

On the right-hand side of the output, a matrix of responses by fifths shows the frequency of students
within each fifth who answered each alternative and who omitted the item. This information can help point
out what distractors, or incorrect alternatives, are not successful because: (a) they are not plausible
answers and few or no students chose the alternative (see alternatives D and E, item 2), or (b) too many
students, especially students in the top fifths of the distribution, chose the incorrect alternative instead of
the correct response (see alternative B, item 3). A good item will result in students in the top fifths
answering the correct response more frequently than students in the lower fifths, and students in the
lower fifths answering the incorrect alternative more frequently than students in the top fifths. The matrix
of responses prints the correct response of the item on the right-hand side and encloses the correct
response in the matrix in parentheses.
B. Item Difficulty: The PROP Statistic

The proportion (PROP) of students who answer each alternative and who omit the item is printed in the
first row below the matrix. The item difficulty is the proportion of subjects in a sample who correctly
answer the item. In order to obtain maximum spread of student scores it is best to use items with
moderate difficulties. Moderate difficulty can be defined as the point halfway between perfect score and
chance score. For a five choice item, moderate difficulty level is .60, or a range between .50 and .70
(because 100% correct is perfect and we would expect 20% of the group to answer the item correctly by
blind guessing).

Evaluating Item Difficulty. For the most part, items which are too easy or too difficult cannot discriminate
adequately between student performance levels. Item 2 in the sample output is an exception; although
the item difficulty is .23, the item is a good, discriminating one. In item 4, everyone correctly answered the
item; the item difficulty is 1.00. Such an item does not discriminate at all between good and poor students,
and therefore does not contribute statistically to the effectiveness of the examination. However, if one of
the instructor's goals is to check that all students grasp certain basic concepts and if the examination is
long enough to contain a sufficient number of discrimination items, then such an item may remain on the

C. Item Discrimination: Point Biserial Correlation (RPBI)

Interpreting the RBI Statistic. The point biserieal correlation (RPBI) for each alternative and omit is
printed below the PROP row. It indicates the relationship between the item response and the total test
score within the group tested, i.e., it measures the discriminating power of an item. It is interpreted
similarly to other correlation coefficients. Assuming that the total test score accurately discriminates
among individuals in the group tested, then high positive RPBI's for the correct responses would
represent the most discriminating items. That is, students who answered the correct response scored well
on the examination, whereas students who not answer the correct response did not score well on the
examination. It is also interesting to check the RPBI's for the item distractors, or incorrect alternatives.
The opposite correlation between total score and choice of alternative is expected for the incorrect vs. the
correct alternative. Where a high positivecorrelation is desired for the RPBI of a correct alternative,
a high negative correlation is good for the RPBI of a distractor, i.e., students who answer with an
incorrect alternative did notscore well on the total examination. Due to restrictions incurred when
correlating a continuous variable (total examination score) with a dichotomous variable (response vs
nonresponse of an alternative), the highest possible RPBI is .80 instead of the usual maximum value of
1.00 for a correlation. This maximum RPBI is directly influenced by the item difficulty level. The maximum
RPBI value of .80 occurs with items of moderate difficulty level; the further the difficulty level deviates
from the moderate difficulty level in either direction, the lower the ceiling and RPBI. For example, the
maximum RPBI is about .58 for difficulty levels of .10 or .90. Therefore, in order to maximize item
discrimination, items of moderate difficulty level are preferred, although easy and difficult items still can be
discriminating (see item 2 in the sample output).

Evaluating Item Discrimination. When an instructor examines the item analysis data, the RPBI is an
important indicator in deciding which items are discriminating and should be retained, and which items
are not discriminating and should be revised or replaced by a better item (other content considerations
aside). The quintile graph also illustrates this same relationship between item response and total scores.
However, the RPBI is a more accurate representation of this relationship. An item with a RPBI of .25 or
below should be examined critically for revision or deletion; items with RPBIs of .40 and above are good
discriminators. Note that all items, not only those with RPBIs lower than .25, can be improved. An
examination of the matrix of responses by fifths for all items may point out weaknesses, such as
implausible distractors, that can be reduced by modifying the item.

It is important to keep in mind that the statistical functioning of an item should not be the sole basis for
deleting or retaining an item. The most important quality of a classroom test is its validity, the extent to
which items measure relevant tasks. Items that perform poorly statistically might be retained (and perhaps
revised) if they correspond to specific instructional objectives in the course. Items that perform well
statistically but are not related to specific instructional objectives should be reviewed carefully before
being reused.


Ebel, R. L. & Frisbee, D. A. (1986). Essentials of educational measurement (4th ed.). Eaglewood Cliffs,
NJ: New Jersey: Prentice-Hall, Inc.

Guilford, J. P. Pshychometric method. New York: McGraw-Hill, 1954.

Gronlund, N. E. & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). NY: MacMillan.

Osterlind, S. J. Constructing test items Norwell, MA: Kluwer Academic Publishers, 1989.

Thorndike, Robert L. & Hagen, Elizabeth. Measurement and evaluation in psychology and education (3rd
ed.). New York: John Wiley & Sons, 1969, Chapters 4, 6.
Table of Contents


1ST + * 1ST 0 25 1 0 102 0
2ND + * 2ND 1 45 6 0 75 0
3RD + * 3RD 1 63 5 3 49 0
4TH + * 4TH 2 76 9 0 34 0
5TH + * 5TH 11 73 13 4 5 0
0 10 20 30 40 50 60 70 80 90 100 PROP 0.02 0.47 0.06 0.01 (0.44) 0.00
RPBI -0.20 -0.33 -0.20 -0.13 (0.51) 0.00

1ST + * 1ST 83 35 10 0 0 0
2ND + * 2ND 19 85 23 0 0 0
3RD + * 3RD 17 67 37 0 0 0
4TH + * 4TH 13 78 30 0 0 0
5TH + * 5TH 6 84 16 0 0 0
0 10 20 30 40 50 60 70 80 90 100 PROP (0.23) 0.57 0.19 0.00 0.00 0.00
RPBI (0.43)-0.33 -0.05 0.00 0.00 0.00

1ST * 1ST 2 125 0 1 0
2ND +* 2ND 6 109 0 8 4
3RD + * 3RD 14 86 4 7 10
4TH +
* 4TH 23 71 2 19 6 0
5TH + * 5TH 29 45 8 15 8
0 10 20 30 40 50 60 70 80 90 100 PROP 0.12 0.72 0.02 0.08 (0.05) 0.00
RPBI-0.24 0.45 -0.16 -0.17 (0.13)-0.14

1ST + * 1ST 0 0 0 0 128 0
2ND + * 2ND 0 0 0 0 127 0
3RD + * 3RD 0 0 0 0 121 0
4TH + * 4TH 0 0 0 0 121 0
5TH + * 5TH 0 0 0 0 106 0
0 10 20 30 40 50 60 70 80 90 100 PROP 0.00 0.00 0.00 0.00 (1.00) 0.00

RPBI 0.00 0.00 0.00 0.00 (0.00) 0.00
Table of Contents

Purpose of Item Analysis
OK, you now know how to plan a test and build a test
Now you need to know how to do ITEM ANALYSIS
--> looks complicated at first glance, but actually quite simple

-->even I can do this and I'm a mathematical idiot

Talking about norm-referenced, objective tests
mostly multiple-choice but same principals for true-false, matching and short answer
by analyzing results you can refine your testing

1. Fix marks for current class that just wrote the test
o find flaws in the test so that you can adjust the mark before return to
o can find questions with two right answers, or that were too hard, etc., that
you may want to drop from the exam
even had to do that occasionally on Diploma exams, even after 36
months in development, maybe 20 different reviewers, extensive field
tests, still occasionally have a question whose problems only become
apparent after you give the test
more common on classroom tests -- but instead of getting defensive,
or making these decisions at random on basis of which of your
students can argue with you, do it scientifically
2. More diagnostic information on students
o another immediate payoff of item analysis
Classroom level:
o will tell which questions they were are all guessing on, or if you find a
questions which most of them found very difficult, you can reteach that
o CAN do item analysis on pretests to:
so if you find a question they all got right, don't waste more time on
this area
find the wrong answers they are choosing to identify common
can't tell this just from score on total test, or class average
Individual level:
o isolate specific errors this child made
o after you've planned these tests, written perfect questions, and now analyzed
the results, you're going to know more about these kids than they know
3. Build future tests, revise test items to make them better
o REALLY pays off second time you teach the same course
by now you know how much work writing good questions is
studies have shown us that it is FIVE times faster to revise items that
didn't work, using item analysis, than trying to replace it with a
completely new question
new item which would just have new problems anyway

--> this way you eventually get perfect items, the envy of your
o SHOULD NOT REUSE WHOLE TESTS --> diagnostic teaching means that
you are responding to needs of your students, so after a few years you build
up a bank of test items you can custom make tests for your class
know what class average will be before you even give the test because
you will know approximately how difficult each item is before you use
can spread difficulty levels across your blueprint too...
4. Part of your continuing professional development
o doing the occasional item analysis will help teach you how to become a better
test writer
o and you're also documenting just how good your evaluation is
o useful for dealing with parents or principals if there's ever a dispute
o once you start bringing out all these impressive looking stats parents and
administrators will believe that maybe you do know what you're talking about
when you fail students...
o parent says, I think your "question stinks",

well, "according to the item analysis, this question appears to have worked
well -- it's your son that stinks"

(just kidding! --actually, face validity takes priority over stats any day!)
o and if the analysis shows that the question does stink, you've already
dropped it before you've handed it back to the student, let alone the parent
seeing it...
5. Before and After Pictures
o long term payoff
o collect this data over ten years, not only get great item bank, but if you
change how you teach the course, you can find out if innovation is working
o if you have a strong class (as compared to provincial baseline) but they do
badly on same test you used five years ago, the new textbook stinks.

ITEM ANALYSIS is one area where even a lot of otherwise very good classroom teachers
fall down
they think they're doing a good job; they think they've doing good evaluation, but
without doing item analysis, they can't really know
part of being a professional is going beyond the illusion of doing a good job to
finding out whether you really are
but something just a lot of teachers don't know HOW to do
do it indirectly when kids argue with them...wait for complaints from students,
student's parents and maybe other teachers...
I do realize that I am advocating here more work for you in the short term, but, it
will pay off in the long term
But realistically:

*Probably only doing it for your most important tests
end of unit tests, final exams --> summative evaluation
especially if you're using common exams with other teachers
common exams give you bigger sample to work with, which is good
makes sure that questions other teacher wrote are working for YOUR class
maybe they taught different stuff in a different way
impress the daylights out of your colleagues
*Probably only doing it for test questions you are likely going to reuse next year

*Spend less time on item analysis than on revising items
item analysis is not an end in itself,
no point unless you use it to revise items,
and help students on basis of information you get out of it
I also find that, if you get into it, it is kind of fascinating. When stats turn out well, it's
objective, external validation of your work. When stats turn out differently than you
expect, it becomes a detective mystery as you figure out what went wrong.
But you'll have to take my word on this until you try it on your own stuff.

Eight Simple Steps to Item Analysis

1. Score each answer sheet, write score total on the corner
o obviously have to do this anyway
2. Sort the pile into rank order from top to bottom score
(1 minute, 30 seconds tops)
3. If normal class of 30 students, divide class in half
o same number in top and bottom group:
o toss middle paper if odd number (put aside)
4. Take 'top' pile, count number of students who responded to each alternative
o fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type
of error you get for short answer, fill-in-the-blank

OR set up on spread sheet if you're familiar with computers


1. A 0
*B 4
C 1
D 1

*=Keyed Answer
o repeat for lower group


1. A 0
*B 4 2
C 1
D 1

*=Keyed Answer
o this is the time consuming part --> but not that bad, can do it while watching
TV, because you're just sorting piles

(A) If you have a large sample of around 100 or more, you can cut down the sample
you work with
o take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all
o put middle 46 aside for the moment
larger the sample, more accurate, but have to trade off against labour;
using top 1/3 or so is probably good enough by the time you get to
100; --27% magic figure statisticians tell us to use
o I'd use halves at 30, but you could just use a sample of top 10 and bottom 10
if you're pressed for time
but it means a single student changes stats by 10%
trading off speed for accuracy...
o but I'd rather have you doing ten and ten than nothing
(B) Second short cut, if you have access to photocopier (budgets)
o photocopy answer sheets, cut off identifying info
(can't use if handwriting is distinctive)
o colour code high and low groups --> dab of marker pen color
o distribute randomly to students in your class so they don't know whose
answer sheet they have
o get them to raise their hands
for #6, how many have "A" on blue sheet?
how many have "B"; how many "C"
for #6, how many have "A" on red sheet....
o some reservations because they can screw you up if they don't take it
o another version of this would be to hire kid who cuts your lawn to do the
counting, provided you've removed all identifying information
I actually did this for a bunch of teachers at one high school in
Edmonton when I was in university for pocket money
(C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type
than to sort


ITEM # 1 2 3 4 5 6 7 8 9 10



Kay T T T F F A D D A C

Jane T T T F T A D C A D

John F F T F T A D C A B

o type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving
right hand free to turn pages (from Sax)
o IF you have a computer program -- some kicking around -- will give you all
stats you need, plus bunches more you don't-- automatically after this stage

5. Subtract the number of students in lower group who got question
right from number of high group students who got it right
o quite possible to get a negative number


1. A 0
*B 4 2 2
C 1
D 1

*=Keyed Answer
6. Divide the difference by number of students in upper or lower group
o in this case, divide by 15
o this gives you the "discrimination index" (D)


1. A 0
*B 4 2 2 0.333
C 1
D 1

*=Keyed Answer
7. Total number who got it right


1. A 0
*B 4 2 2 0.333 6
C 1
D 1

*=Keyed Answer
8. If you have a large class and were only using the 1/3 sample for top and
bottom groups, then you have to NOW count number of middle group who
got each question right (not each alternative this time, just right answers)
9. Sample Form Class Size= 100.
o if class of 30, upper and lower half, no other column here
10. Divide total by total number of students
o difficulty = (proportion who got it right (p) )


1. A 0
*B 4 2 2 0.333 6 .42
C 1
D 1

*=Keyed Answer
11. You will NOTE the complete lack of complicated statistics --> counting,
adding, dividing --> no tricky formulas required for this
o not going to worry about corrected point biserials etc.
o one of the advantages of using fixed number of alternatives

Interpreting Item Analysis

Let's look at what we have and see what we can see
90% of item analysis is just common sense...
1. Potential Miskey
2. Identifying Ambiguous Items
3. EqualDistribution to all alternatives.
4. Alternatives are not working
5. Distracter too atractive.
6. Question not discriminating.
7. Negative discrimination.
8. Too Easy.
9. Omit.
10. &11. Relationship between D index and Difficulty (p).

o Item Analysis of Computer Printouts

1. What do we see looking at this first one? [Potential Miskey]

Upper Low Difference D Total Difficulty

1. *A 1 4 -3 -.2 5 .17
B 1 3
C 10 5
D 3 3
O <----means omit or no answer
o #1, more high group students chose C than A, even though A is supposedly
the correct answer
o more low group students chose A than high group so got negative
o only .16% of class got it right
o most likely you just wrote the wrong answer key down
--> this is an easy and very common mistake for you to make
better you find out now before you hand back then when kids complain
OR WORSE, they don't complain, and teach themselves that your
miskey as the "correct" answer
o so check it out and rescore that question on all the papers before handing
them back
o Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50
--> nice item
o you check and find that you didn't miskey it --> that is the answer you

two possibilities:
1. one possibility is that you made slip of the tongue and taught them the
wrong answer
anything you say in class can be taken down and used against
you on an examination....
2. more likely means even "good" students are being tricked by a
common misconception -->
You're not supposed to have trick questions, so may want to dump it
--> give those who got it right their point, but total rest of the
marks out of 24 instead of 25
If scores are high, or you want to make a point, might let it stand, and then teach to
it --> sometimes if they get caught, will help them to remember better in future
such as:
o very fine distinctions
o crucial steps which are often overlooked
REVISE it for next time to weaken "B"
-- alternatives are not supposed to draw more than the keyed answer
-- almost always an item flaw, rather than useful distinction

2. What can we see with #2: [Can identify ambiguous items]

Upper Low Difference D Total Difficulty

2. A 6 5
B 1 2
*C 7 5 2 .13 12 .40
D 1 3

o #2, about equal numbers of top students went for A and D.
Suggests they couldn't tell which was correct
either, students didn't know this material (in which case you can
reteach it)
or the item was defective --->
o look at their favorite alternative again, and see if you can find any reason
they could be choosing it
o often items that look perfectly straight forward to adults are ambiguous to
FavoriteExamples of ambiguous items.
o if you NOW realize that D was a defensible answer, rescore before you hand
it back to give everyone credit for either A or D -- avoids arguing with you in
o if it's clearly a wrong answer, then you now know which error most of your
students are making to get wrong answer
o useful diagnostic information on their learning, your teaching

3. Equally to all alternatives
Upper Low Difference D Total Difficulty

3. A 4 3
B 3 4
*C 5 4 1 .06 9 .30
D 3 4

o item #3, students respond about equally to all alternatives
o usually means they are guessing
Three possibilities:
0. may be material you didn't actually get to yet
you designed test in advance (because I've convinced you to
plan ahead) but didn't actually get everything covered before
or item on a common exam that you didn't stress in your class
1. item so badly written students have no idea what you're asking
2. item so difficult students just completely baffled
o review the item:
if badly written ( by other teacher) or on material your class hasn't
taken, toss it out, rescore the exam out of lower total
BUT give credit to those that got it, to a total of 100%

if seems well written, but too hard, then you know to (re)teach this
material for rest of class....
maybe the 3 who got it are top three students,
tough but valid item:
OK, if item tests valid objective
want to provide occasional challenging question for top
but make sure you haven't defined "top 3 students" as "those
able to figure out what the heck I'm talking about"

4. Alternatives aren't working

Upper Low Difference D Total Difficulty

4. A 1 5
*B 14 7 7 .47 21 .77
C 0 2
D 0 0

o example #4 --> no one fell for D --> so it is not a plausible alternative
o question is fine for this administration, but revise item for next time
o toss alternative D, replace it with something more realistic
o each distracter has to attract at least 5% of the students
class of 30, should get at least two students

o or might accept one if you positively can't think of another fourth alternative -
- otherwise, do not reuse the item
if two alternatives don't draw any students --> might consider redoing as

5. Distracter too attractive

Upper Low Difference D Total Difficulty

5. A 7 10
B 1 2
C 1 1
*D 5 2 3 .20 7 .23

o sample #5 --> too many going for A
--> no ONE distracter should get more than key

--> no one distracter should pull more than about half of students

-- doesn't leave enough for correct answer and five percent for each
o keep for this time
o weaken it for next time

6. Question not discriminating

Upper Low Difference D Total Difficulty

6. *A 7 7 0 .00 14 .47
B 3 2
C 2 1
D 3 5

o sample #6: low group gets it as often as high group
o on norm-referenced tests, point is to rank students from best to worst
o so individual test items should have good students get question right, poor
students get it wrong
o test overall decides who is a good or poor student on this particular topic
those who do well have more information, skills than those who do
less well
so if on a particular question those with more skills and knowledge do
NOT do better, something may be wrong with the question
o question may be VALID, but off topic
E.G.: rest of test tests thinking skill, but this is a memorization
question, skilled and unskilled equally as likely to recall the answer
should have homogeneous test --> don't have a math item in with
social studies
if wanted to get really fancy, should do separate item analysis for each
cell of your long as you had six items per cell

o question is VALID, on topic, but not RELIABLE
addresses the specified objective, but isn't a useful measure of
individual differences
asking Grade 10s Capital of Canada is on topic, but since they will all
get it right, won't show individual differences -- give you low D

7. Negative Discrimination

Upper Low Difference D Total Difficulty

7. *A 7 10 -3 -.20 17 .57
B 3 3
C 2 1
D 3 1

o D (discrimination) index is just upper group minus lower group
o varies from +1.0 to -1.0
o if all top got it right, all lower got it wrong = 100% = +1
o if more of the bottom group get it right than the top group, you get a
negative D index
o if you have a negative D, means that students with less skills and knowledge
overall, are getting it right more often than those who the test says are better
o in other words, the better you are, the more likely you are to get it wrong

Two possibilities:
o usually means an ambiguous question
that is confusing good students, but weak students too weak to see
the problem
look at question again, look at alternatives good students are going
for, to see if you've missed something

o or it might be off topic

--> something weaker students are better at (like rote memorization) than
good students

--> not part of same set of skills as rest of test--> suggests design flaw with
table of specifications perhaps
((-if you end up with a whole bunch of -D indices on the same test, must mean you
actually have two different distinct skills, because by definition, the low group is the
high group on that bunch of questions
--> end up treating them as two separate tests))
o if you have a large enough sample (like the provincial exams) then we toss
the item and either don't count it or give everyone credit for it
o with sample of 100 students or less, could just be random chance, so
basically ignore it in terms of THIS administration
kids wrote it, give them mark they got
o furthermore, if you keep dropping questions, may find that you're starting to
develop serious holes in your blueprint coverage -- problem for sampling
but you want to track stuff this FOR NEXT TIME
o if it's negative on administration after administration, consistently, likely not
random chance, it's screwing up in some way
o want to build your future tests out of those items with high positive D indices
o the higher the average D indices on the test, the more RELIABLE the
test as a whole will be
o revise items to increase D
-->if good students are selecting one particular wrong alternative,
make it less attractive

-->or increase probability of their selecting right answer by making it
more attractive
o may have to include some items with negative Ds if those are the only items
you have for that specification, and it's an important specification
what this means is that there are some skills/knowledge in this unit
which are unrelated to rest of the skills/knowledge
--> but may still be important

o e.g., statistics part of this course may be terrible on those students who are
the best item writers, since writing tends to be associated with the opposite
hemisphere in the brain than math, right... but still important objective in this
may lower reliability of test, but increases content validity

8. Too Easy

Upper Low Difference D Total Difficulty

8. A 0 1
*B 14 13 1 .06 27 .90
C 0 1
D 1 1

o too easy or too difficult won't discriminate well either
o difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0
o if the item is NOT miskeyed or some other glaring problem, it's too late to
change after administered --> everybody got it right, OK, give them the mark
TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)
o if the item is too difficult, don't drop it, just because everybody missed it -->
you must have thought it was an important objective or it wouldn't have been
on there;
o and unless literally EVERYONE missed it, what do you do with the students
who got it right?
o give them bonus marks?
o cheat them of a mark they got?
furthermore, if you drop too many questions, lose content validity (specs)
--> if two or three got it right may just be random chance,
so why should they get a bonus mark

o however, DO NOT REUSE questions with too high or low difficulty (p) values
in future
if difficulty is over 85%, you're wasting space on limited item test
o asking Grade 10s the Capital of Canada is probably waste of their time and
yours --> unless this is a particularly vital objective
o same applies to items which are too difficult --> no use asking Grade 3s to
solve quadratic equation
o but you may want to revise question to make it easier or harder rather than
just toss it out cold

You may have consciously decided to develop a "Mastery" style tests
--> will often have very easy questions -& expect everyone to get
everything trying to identify only those who are not ready to go on

--> in which case, don't use any question which DOES NOT have a
difficulty level below 85% or whatever
Or you may want a test to identify the top people in class, the reach for the
top team, and design a whole test of really tough questions
--> have low difficulty values (i.e., very hard)

o so depends a bit on what you intend to do with the test in question
o this is what makes the difficulty index (proportion) so handy
14. you create a bank of items over the years
--> using item analysis you get better questions all the time, until you
have a whole bunch that work great

-->can then tailor-make a test for your class

you want to create an easier test this year, you pick questions with
higher difficulty (p) values;

you want to make a challenging test for your gifted kids, choose items
with low difficulty (p) values

--> for most applications will want to set difficulty level so that it gives
you average marks, nice bell curve
government uses 62.5 --> four item multiple choice, middle of
bell curve,

15. start tests with an easy question or two to give students a running
16. make sure that the difficulty levels are spread out over examination
not all hard geography questions, easy history
unfair to kids who are better at geography, worse at history
turns class off geography if they equate it with tough questions
-->REMEMBER here that difficulty is different than complexity,
so can have difficult recall knowledge question, easy synthesis
synthesis and evaluation items will tend to be harder than recall
questions so if find higher levels are more difficult, OK, but try to
balance cells as much as possible
certainly content cells should be the roughly the same


Upper Low Difference D Total Difficulty

9. A 2 1
B 3 4
*C 7 3 4 .26 10 .33
D 1 1
O 2 4

If near end of the test
0. --> they didn't find it because it was on the next page

--format problem
--> your test is too long, 6 of them (20%) didn't get to it

OR, if middle of the test:
3. --> totally baffled them because:
way too difficult for these guys
or because also 2 from high group too: ambiguous wording

2. &

Upper Low Difference D Total Difficulty

10. A 0 5
*B 15 0 15 1.0 15 .50
C 0 5
D 0 5
11. A 3 2
*B 8 7 1 0.6 15 .50
C 2 3
D 2 3

o 10 is a perfect item --> each distracter gets at least 5
discrimination index is +1.0
o high discrimination D indices require optimal levels of difficulty
o but optimal levels of difficulty do not assure high levels of D
o 11 has same difficulty level, different D
on four item multiple-choice, student doing totally by chance will get

Program Evaluation

When your kids write the Diploma or Achievement Test Department sends out a printout of
how your class did compared to everybody else in the province
Three types of report:
o how are kids doing today in terms of meeting the standards?
o how are they doing compared to four years ago? eight years ago?
(monitor over time)
o format keeps changing --> some years all tests in one book to save on paper
and mailing costs; other years each exam gets its own report
o tons of technical information (gender stuff, etc.)
(up to superintendent what happens to these after that
--> can publish in newspaper, keep secret central office only, etc.)
o get your hands on and interpret
o either you do it or someone else will do it for/to you
o better teachers take responsibility rather than top down
o new table formats are so easy to interpret no reason not to
o this means you can compare their responses to the responses of 30,000
students across the province
will help you calibrate your expectations for this class
is your particular class high or low one?
have you set your standards too high or too low?
giving everyone 'Cs because you think they ought to do better than
this, but they all ace the provincial tests?
Who Knows Where This is?OVERHEAD: SCHOOL TABLE 2 (June 92 GRADE 9 Math
o check table 2 for meeting standard of excellence
o Standards set by elaborate committee structure
This example (overhead): Your class had 17 students
Total test out of 49 (means test of 50, but dropped on after item analysis)
standard setting procedures decided that 42/49 is standard of EXCELLENCE
for grade 9s in Alberta
next column shows they expect 15% to reach this standard

standard setting procedure decided that 23 out of 49 which Acceptable
standard; next column says expect 85% to reach that standard

columns at end of table show that actually, only 8.9% made standard of
excellence, and only 67.4% made acceptable standard

(bad news!)

but looking at YOUR class, 5.9, almost 6% made standard of excellence (so
fewer than province as a whole) but on the other hand, 76.5% meeting
acceptable standard.
Need comparison -- otherwise, fact that only get 6% to excellence might sound
Interpretation: either you only have one excellent math student in your class,
or you are teaching to acceptable standard, but not encouraging excellence?
BUT can use tables to look deeper,
o use tables to identify strengths and weaknesses in student learning
o and therefore identify your own strength and weaknesses
Problem solving & knowledge/skills broken down --> table of specs topics

Interestingly, though, above provincial on problem solving at excellence...

ASK: -how do you explain % meeting knowledge and % meeting problems both
higher than % meeting standard on whole test?

Answer: low correlation between performance on the two types of questions
(i.e., those who met standard on the one often did not on the other)
which means (a) can't assume that easy/hard = Bloom's taxonomy

and (b) that you have to give students both kinds of questions on your test
or you are being unfair to group who is better at the other stuff
Don't know where this is OVERHEAD: SCHOOL TABLE 5.1 (GRADE 9 MATH, JUNE 92)
o check tables 5.1 to 5.6 for particular areas of strengths and weaknesses
o look for every question where students in your school were 5% different on
keyed answer from those in provincial test
if 5% or more higher, is a particular strength of your program
if 5% or more lower, is a particular weakness
note that score on question irrelevant, only difference from rest of province
--> e.g., if you only got 45% but province only got 35%, than that's a
significant strength

--> the fact that less than 50% just means it was a really tough question, too
hard for this grade
o similarly, just because got 80% doesn't make your class good if province is
if find all strengths or all weaknesses, where is the gap lowest?
least awful = strengths; least strong = weakness to work on
THIS EXAMPLE? all above provincial scores on these skills

converts a decimal into a fraction 76.5-60.9 = 15.6% above provincial norm

so decimal to fraction is a strength

all of these are good, but find least good --> thats the area to concentrate on

question 10: on 4.5% difference -- so your weak spot, one area you arent
significantly above rest of province

You can even begin to set standards in your class as province does
--> i.e., ask yourself BEFORE the test how many of these questions should
your class be able to do on this test?
Then look at actual performance.

How did my students do? Compared to what?
my classroom expectations
school's expectations
jurisdiction's expectations
provincial expectations
the last/previous test administered
community expectations
(each jurisdiction how has its own public advisory committee)
You can even create your own statistics to compare with provincial standard
o lots of teachers recycle Diploma and Achievement test questions, but they
only do it to prep kids for actual exam --> losing all that diagnostic info
HOWEVER:-avoid comparisons between schools
o serves no useful purpose, has no logic since taken out of context
o e.g., comparing cancer clinic and walk in clinic --> higher death rate in cancer
clinic doesn't mean its worse; may be best cancer clinic in the world, be doing
a great job given more serious nature of problems it faces
o invidious comparisons like this become "blaming" exercise
o self-fulfilling prophecy: parents pull kids from that school
Provincial authorities consider such comparisons a misuse of results
o school report = your class if only one class;
but if two or more classes, then we are talking about your school's program

--> forces you to get together with other teachers to find out what they're

--> pool resources, techniques, strategies to address problem areas....

eliability and Item Analysis
General Introduction
Basic Ideas
Classical Testing Model
Sum Scales
Cronbach's Alpha
Split-Half Reliability
Correction for Attenuation
Designing a Reliable Scale
This topic discusses the concept of reliability of measurement as used in social sciences (but not in
industrial statistics or biomedical research). The term reliability used in industrial statistics denotes a
function describing the probability of failure (as a function of time). For a discussion of the concept
of reliability as applied to product quality (e.g., in industrial statistics), please refer to the section
on Reliability/Failure Time Analysis in the Process Analysistopic (see also the section Repeatability and
Reproducibility and the topic Survival/Failure Time Analysis). For a comparison between these two (very
different) concepts of reliability, see Reliability.
General Introduction
In many areas of research, the precise measurement of hypothesized processes or variables
(theoretical constructs) poses a challenge by itself. For example, in psychology, the precise
measurement of personality variables or attitudes is usually a necessary first step before any theories
of personality or attitudes can be considered. In general, in all social sciences, unreliable
measurements of people's beliefs or intentions will obviously hamper efforts to predict their behavior.
The issue of precision of measurement will also come up in applied research, whenever variables are
difficult to observe. For example, reliable measurement of employee performance is usually a difficult
task; yet, it is obviously a necessary precursor to any performance-based compensation system.
In all of these cases, Reliability & Item Analysis may be used to construct reliable measurement scales, to
improve existing scales, and to evaluate the reliability of scales already in use. Specifically, Reliability
& Item Analysis will aid in the design and evaluation of sum scales, that is, scales that are made up of
multiple individual measurements (e.g., different items, repeated measurements, different
measurement devices, etc.). You can compute numerous statistics that allows you to build and
evaluate scales following the so-called classical testing theory model.
The assessment of scale reliability is based on the correlations between the individual items or
measurements that make up the scale, relative to the variances of the items. If you are not familiar
with the correlation coefficient or the variance statistic, we recommend that you review the respective
discussions provided in the Basic Statistics section.
The classical testing theory model of scale construction has a long history, and there are many
textbooks available on the subject. For additional detailed discussions, you may refer to, for example,
Carmines and Zeller (1980), De Gruijter and Van Der Kamp (1976), Kline (1979, 1986), or Thorndyke
and Hagen (1977). A widely acclaimed "classic" in this area, with an emphasis on psychological and
educational testing, is Nunnally (1970).
Testing hypotheses about relationships between items and tests. Using Structural Equation Modeling and
Path Analysis (SEPATH), you can test specific hypotheses about the relationship between sets of items
or different tests (e.g., test whether two sets of items measure the same construct, analyze multi-
trait, multi-method matrices, etc.).

Basic Ideas
Suppose we want to construct a questionnaire to measure people's prejudices against foreign- made
cars. We could start out by generating a number of items such as: "Foreign cars lack personality,"
"Foreign cars all look the same," etc. We could then submit those questionnaire items to a group of
subjects (for example, people who have never owned a foreign-made car). We could ask subjects to
indicate their agreement with these statements on 9-point scales, anchored at 1=disagree and 9=agree.
True scores and error. Let us now consider more closely what we mean by precise measurement in this
case. We hypothesize that there is such a thing (theoretical construct) as "prejudice against foreign
cars," and that each item "taps" into this concept to some extent. Therefore, we may say that a
subject's response to a particular item reflects two aspects: first, the response reflects the prejudice
against foreign cars, and second, it will reflect some esoteric aspect of the respective question. For
example, consider the item "Foreign cars all look the same." A subject's agreement or disagreement
with that statement will partially depend on his or her general prejudices, and partially on some other
aspects of the question or person. For example, the subject may have a friend who just bought a very
different looking foreign car.
Testing hypotheses about relationships between items and tests. To test specific hypotheses about the
relationship between sets of items or different tests (e.g., whether two sets of items measure the
same construct, analyze multi- trait, multi-method matrices, etc.) use Structural Equation
Modeling (SEPATH).

Classical Testing Model
To summarize, each measurement (response to an item) reflects to some extent the true score for the
intended concept (prejudice against foreign cars), and to some extent esoteric, random error. We can
express this in an equation as:
X = tau + error
In this equation, X refers to the respective actual measurement, that is, subject's response to a
particular item; tau is commonly used to refer to the true score, and error refers to the random error
component in the measurement.

To index
To index
To index
In this context the definition of reliability is straightforward: a measurement is reliable if it reflects
mostly true score, relative to the error. For example, an item such as "Red foreign cars are particularly
ugly" would likely provide an unreliable measurement of prejudices against foreign- made cars. This is
because there probably are ample individual differences concerning the likes and dislikes of colors.
Thus, this item would "capture" not only a person's prejudice but also his or her color preference.
Therefore, the proportion of true score (for prejudice) in subjects' response to that item would be
relatively small.
Measures of reliability. From the above discussion, one can easily infer a measure or statistic to describe
the reliability of an item or scale. Specifically, we may define an index of reliability in terms of the
proportion of true score variability that is captured across subjects or respondents, relative to the total
observed variability. In equation form, we can say:
Reliability =
(true score) /
(total observed)

Sum Scales
What will happen when we sum up several more or less reliable items designed to measure prejudice
against foreign-made cars? Suppose the items were written so as to cover a wide range of possible
prejudices against foreign-made cars. If the error component in subjects' responses to each question is
truly random, then we may expect that the different components will cancel each other out across
items. In slightly more technical terms, the expected value or mean of the error component across
items will be zero. The true score component remains the same when summing across items.
Therefore, the more items are added, the more true score (relative to the error score) will be
reflected in the sum scale.
Number of items and reliability. This conclusion describes a basic principle of test design. Namely, the
more items there are in a scale designed to measure a particular concept, the more reliable will the
measurement (sum scale) be. Perhaps a somewhat more practical example will further clarify this
point. Suppose you want to measure the height of 10 persons, using only a crude stick as the
measurement device. Note that we are not interested in this example in the absolute correctness of
measurement (i.e., in inches or centimeters), but rather in the ability to distinguish reliably between
the 10 individuals in terms of their height. If you measure each person only once in terms of multiples
of lengths of your crude measurement stick, the resultant measurement may not be very reliable.
However, if you measure each person 100 times, and then take the average of those 100 measurements
as the summary of the respective person's height, then you will be able to make very precise and
reliable distinctions between people (based solely on the crude measurement stick).
Let's now look at some of the common statistics that are used to estimate the reliability of a
sum scale.

To index
To index

Cronbach's Alpha
To return to the prejudice example, if there are several subjects who respond to our items, then we
can compute the variance for each item, and the variance for the sum scale. The variance of the sum
scale will be smaller than the sum of item variances if the items measure the same variability between
subjects, that is, if they measure some true score. Technically, the variance of the sum of two items is
equal to the sum of the two variances minus (two times) the covariance, that is, the amount of true
score variance common to the two items.
We can estimate the proportion of true score variance that is captured by the items by comparing the
sum of item variances with the variance of the sum scale. Specifically, we can compute:
= (k/(k-1)) * [1- (s
This is the formula for the most common index of reliability, namely, Cronbach's coefficient alpha ( ).
In this formula, the si**2's denote the variances for the k individual items; ssum**2 denotes the variance
for the sum of all items. If there is no true score but only error in the items (which is esoteric and
unique, and, therefore, uncorrelated across subjects), then the variance of the sum will be the same as
the sum of variances of the individual items. Therefore, coefficient alpha will be equal to zero. If all
items are perfectly reliable and measure the same thing (true score), then coefficient alpha is equal to
1. (Specifically, 1- (si**2)/ssum**2 will become equal to (k-1)/k; if we multiply this by k/(k-1) we obtain
Alternative terminology. Cronbach's alpha, when computed for binary (e.g., true/false) items, is
identical to the so-called Kuder-Richardson-20 formula of reliability for sum scales. In either case,
because the reliability is actually estimated from the consistency of all items in the sum scales, the
reliability coefficient computed in this manner is also referred to as the internal-consistency

Split-Half Reliability
An alternative way of computing the reliability of a sum scale is to divide it in some random manner
into two halves. If the sum scale is perfectly reliable, we would expect that the two halves are
perfectly correlated (i.e., r = 1.0). Less than perfect reliability will lead to less than perfect
correlations. We can estimate the reliability of the sum scale via the Spearman-Brown split
half coefficient:
rsb = 2rxy /(1+rxy)
In this formula, rsb is the split-half reliability coefficient, and rxy represents the correlation between the
two halves of the scale.

To index
To index
Correction for Attenuation
Let us now consider some of the consequences of less than perfect reliability. Suppose we use our scale
of prejudice against foreign-made cars to predict some other criterion, such as subsequent actual
purchase of a car. If our scale correlates with such a criterion, it would raise our confidence in
the validity of the scale, that is, that it really measures prejudices against foreign-made cars, and not
something completely different. In actual test design, thevalidation of a scale is a lengthy process that
requires the researcher to correlate the scale with various external criteria that, in theory, should be
related to the concept that is supposedly being measured by the scale.
How will validity be affected by less than perfect scale reliability? The random error portion of the
scale is unlikely to correlate with some external criterion. Therefore, if the proportion of true score in
a scale is only 60% (that is, the reliability is only .60), then the correlation between the scale and the
criterion variable will be attenuated, that is, it will be smaller than the actual correlation of true scores.
In fact, the validity of a scale is always limited by its reliability.
Given the reliability of the two measures in a correlation (i.e., the scale and the criterion variable), we
can estimate the actual correlation of true scores in both measures. Put another way, we
can correct the correlation for attenuation:
rxy,corrected = rxy /(rxx*ryy)

In this formula, rxy,corrected stands for the corrected correlation coefficient, that is, it is the estimate of the
correlation between the true scores in the two measures x and y. The term rxy denotes the uncorrected
correlation, and rxx and ryydenote the reliability of measures (scales) x and y. You can compute the
attenuation correction based on specific values, or based on actual raw data (in which case the
reliabilities of the two measures are estimated from the data).

Designing a Reliable Scale
After the discussion so far, it should be clear that, the more reliable a scale, the better (e.g., more
valid) the scale. As mentioned earlier, one way to make a sum scale more valid is by adding items. You
can compute how many items would have to be added in order to achieve a particular reliability, or
how reliable the scale would be if a certain number of items were added. However, in practice, the
number of items on a questionnaire is usually limited by various other factors (e.g., respondents get
tired, overall space is limited, etc.). Let us return to our prejudice example, and outline the steps that
one would generally follow in order to design the scale so that it will be reliable:
Step 1: Generating items. The first step is to write the items. This is essentially a creative process where
the researcher makes up as many items as possible that seem to relate to prejudices against foreign-
made cars. In theory, one should "sample items" from the domain defined by the concept. In practice,
for example in marketing research, focus groups are often utilized to illuminate as many aspects of the
concept as possible. For example, we could ask a small group of highly committed American car buyers
to express their general thoughts and feelings about foreign-made cars. In educational and
To index
psychological testing, one commonly looks at other similar questionnaires at this stage of the scale
design, again, in order to gain as wide a perspective on the concept as possible.
Step 2: Choosing items of optimum difficulty. In the first draft of our prejudice questionnaire, we will
include as many items as possible. We then administer this questionnaire to an initial sample of typical
respondents, and examine the results for each item. First, we would look at various characteristics of
the items, for example, in order to identify floor or ceiling effects. If all respondents agree or disagree
with an item, then it obviously does not help us discriminate between respondents, and thus, it is
useless for the design of a reliable scale. In test construction, the proportion of respondents who agree
or disagree with an item, or who answer a test item correctly, is often referred to as the item difficulty.
In essence, we would look at the item means and standard deviations and eliminate those items that
show extreme means, and zero or nearly zero variances.
Step 3: Choosing internally consistent items. Remember that a reliable scale is made up of items that
proportionately measure mostly true score; in our example, we would like to select items that measure
mostly prejudice against foreign-made cars, and few esoteric aspects we consider random error. To do
so, we would look at the following:
Summary for scale: Mean=46.1100 Std.Dv.=8.26444 Valid n:100
Cronbach alpha: .794313 Standardized alpha: .800491
Average inter-item corr.: .297818

Mean if
Var. if
StDv. if
Multp. R
Alpha if
Shown above are the results for 10 items. Of most interest to us are the three right-most columns.
They show us the correlation between the respective item and the total sum score (without the
respective item), the squared multiple correlation between the respective item and all others, and the
internal consistency of the scale (coefficient alpha) if the respective item would be deleted. Clearly,
items 5 and 6 "stick out," in that they are not consistent with the rest of the scale. Their correlations
with the sum scale are .05 and .12, respectively, while all other items correlate at .45or better. In the
right-most column, we can see that the reliability of the scale would be about .82 if either of the two
items were to be deleted. Thus, we would probably delete the two items from this scale.
Step 4: Returning to Step 1. After deleting all items that are not consistent with the scale, we may not
be left with enough items to make up an overall reliable scale (remember that, the fewer items, the
less reliable the scale). In practice, one often goes through several rounds of generating items and
eliminating items, until one arrives at a final set that makes up a reliable scale.
Tetrachoric correlations. In educational and psychological testing, it is common to use yes/no type items,
that is, to prompt the respondent to answer either yes or no to a question. An alternative to the
regular correlation coefficient in that case is the so-called tetrachoric correlation coefficient. Usually,
the tetrachoric correlation coefficient is larger than the standard correlation coefficient, therefore,
Nunnally (1970, p. 102) discourages the use of this coefficient for estimating reliabilities. However, it
is a widely used statistic (e.g., in mathematical modeling).

Test Item Analysis Using Microsoft Excel Spreadsheet Program
by Chris Elvin
This article is written for teachers and researchers whose budgets are limited
and who do not have access to purposely designed item analysis software such
as Iteman (2003). It describes how to organize a computer spreadsheet such
as Microsoft Excel in order to obtain statistical information about a test and
the students who took it.
Using a fictitious example for clarity, and also a real example of a personally
written University placement test, I will show how the information in a
spreadsheet can be used to refine test items and make judicious placement
decisions. Included is the web address for accessing the sample Excel files for
the class of fictitious students (Elvin, 2003a , 2003b).
I had been teaching in high schools in Japan for many years, and upon
receiving my first University appointment, I was eager to make a good
impression. My first task was to prepare anorm-referenced placement test to
separate approximately one hundred first year medical students into ten
relative levels of proficiency and place them into appropriate classes. This
would allow teachers to determine appropriate curricular goals and adjust the
teaching methodology based more closely on students personal needs. It was
also hoped that a more congenial classroom atmosphere, with less frustration
or boredom, would enhance motivation and engender a true learning
The course was called Oral English Communication, so the placement test
needed to measure this construct. However, since time restricted us to no
more than half an hour for administering the test, a spoken component for the
test was ruled out. It had to be listening only, and in order to ensure reliablity,
the questions had to be as many as possible. I decided I could only achieve this
by having as many rapid-fire questions as possible within the time constraint.
In order for the test to be valid, I focused on points that one might expect to
cover in an oral English course for "clever" first year university students. It
was not possible to meet the students beforehand, so I estimated their level
To index
based on my experience of teaching advanced-level senior high school
Organizing The Spreadsheet Part A
To show briefly how I compiled my students' test scores, I have provided here
the results of a fabricated ten-item test taken by nine fictitious students. (see
Table 1; to download a copy of this file, see Elvin, 2003a.) The purpose of this
section of the spreadsheet is primarily to determine what proportion of the
students answered each item, how many answered correctly, and how efficient
the distractors were, It also helps the instructor prepare for item
discrimination analysis in a separate part of the spreadsheet.
Table 1. Fabricated 10-Item Test - Part A: Actual letter choices

1 ID ITEM NUMBER 1 2 3 4 5 6 7 8 9 10
2 200201 Arisa D A A B C D A A B D
3 200202 Kana A C D A D C B C A A
4 200203 Saki D B D B C D D A B A
5 200204 Tomomi A B B A C C C A D D
6 200205 Natsumi C B A B C D B C C D
7 200206 Haruka C B A B C D A A B C
8 200207 Momo
9 200208 Yuka B D B C C D D B D B
10 200209 Rie C B A B C B A A B C


A 0.22 0.11 0.44 0.22 0.00 0.00 0.44 0.67 0.11 0.22

B 0.11 0.56 0.33 0.56 0.00 0.22 0.22 0.11 0.44 0.11

C 0.33 0.22 0.00 0.11 0.78 0.22 0.11 0.22 0.22 0.22

D 0.22 0.11 0.22 0.11 0.22 0.56 0.22 0.00 0.22 0.44

TOTAL 0.89 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
What proportion of students answered the question?

It may be expected that for a multiple-choice test, all of the students would
answer all of the questions. In the real world, this is rarely true. The quality of
a test item may be poor, or there may be psychological, environmental, or
administrative factors to take into consideration. To try to identify these
potential sources of measurement error, I calculate the ratio of students
answering each question to students taking the test.
In cell C16 of the Excel file spreadsheet, the formula bar reads =SUM
(C12:C15), which adds the proportion of students answering A, B, C and
D respectively. One student didnt answer question 1 (cell C8), so the total
for this item is eight out of nine, which is 0.89. Perhaps she was feeling a little
nervous, or she couldn't hear well because of low speaker volume or noise in
the classroom. The point is, if it is possible to determine what was responsible
for a student or students not answering, then it may also be possible to rectify
it. In some cases, a breakdown of questions on a spreadsheet can contribute to
the discovery of such problems.

What proportion of students answered the question correctly?

For question 1, the correct answer is C, as shown in cell C11. The proportion
of students who chose C is shown in cell C14. To calculate this value, we use
the COUNTIF function. In cell C14, the formula bar reads
=COUNTIF(C2:C10,"C")/9, which means that any cell from C2 to C10 which
has the answer C is counted, and then divided by the number of test takers,
which is nine for this test. This value is also the item facility for the question,
which will be discussed in more detail later in this paper.
How efficient were the distractors?

We use the same function, COUNTIF, for finding the proportion of students
who answered incorrectly. For item 5, cell G12 reads
=COUNTIF(G2:G10,"A")/9 in the formula bar. The two other distractors are
B, which is shown in cell G13 (=COUNTIF(G2:G10,B/9) and D, which
is shown in cell G15 (=COUNTIF(G2:G10,D/9). For this question, seven
students answered correctly (answer C), and two students answered
incorrectly by choosing D. If this test were real, and had many more test
takers, I would want to find out why A and B were not chosen, and I would
consider rewriting this question to make all three distractors equally
Preparing for item discrimination analysis in a separate part of the
Part A of the spreadsheet shows the letter choices students made in answering
each question. In Part B of the spreadsheet, I score and rank the students, and
analyze the test and test items numerically.
Organizing The Spreadsheet Part B
The area C22:L30 in part B of the spreadsheet (see Table 2; also Elvin, 2oo3a)
correlates to the absolute values of C2:L10 in part A of the spreadsheet in
Table 1. This means that even after sorting the students by total score in part B
of the spreadsheet, the new positions of the ranked students will still refer to
their actual letter choices in part A of the spreadsheet.
Absolute cell references, unlike relative cell references, however, cannot be
copied and pasted. They have to be typed in manually. It is therefore much
quicker to make a linked worksheet with copied and pasted relative cell
references. The linked spreadsheet can then be sorted without fear of
automatic recalculation, as would happen if working within the same
spreadsheet using relative references. For the actual test, I used a linked
spreadsheet. If you would like to see a linked file, a copy of one is available for
download from my website (see Elvin, 2003b).
The purposes of part B of the spreadsheet are to
a. convert students multiple choice options to numerals
b. calculate students total scores
c. sort students by total score
d. compute item facility and item discrimination values
e. calculate the average score and standard deviation of the test
f. determine the tests reliability
g. estimate the standard error of measurement of the test
Table 2. Fabricated 10-Item Test - Part B: Scoring and Ranking of

21 ID ITEM NUMBER 1 2 3 4 5 6 7 8 9 10 TOTAL

22 200206 Haruka 1 1 1 1 1 1 1 1 1 1 10

23 200209 Rie 1 1 1 1 1 0 1 1 1 1 9

24 200201 Arisa 0 0 1 1 1 1 1 1 1 0 7

25 200203 Saki 0 1 0 1 1 1 0 1 1 0 6

26 200205 Natsumi 1 1 1 1 1 1 0 0 0 0 6

27 200204 Tomomi 0 1 0 0 1 0 0 1 0 0 3

28 200207 Momo 0 0 0 0 0 0 1 1 0 0 2

29 200208 Yuuka 0 0 0 0 1 1 0 0 0 0 2

30 200202 Kana 0 0 0 0 0 0 0 0 0 0 0


IF total 0.33 0.56 0.44 0.56 0.78 0.56 0.44 0.67 0.44 0.22 Reliability 0.87

IF upper 0.67 0.67 1.00 1.00 1.00 0.67 1.00 1.00 1.00 0.67 Average 5.00

IF lower 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.33 0.00 0.00 SD 3.43

ID 0.67 0.67 1.00 1.00 0.67 0.33 0.67 0.67 1.00 0.67 SEM 1.21
a) Converting students multiple-choice options to numerals
Cell C22, in this previously sorted part of the spreadsheet, reads
=IF($C$7="C",1,0) in the formula bar. This means that Haruka has
answered C for item 1 in cell C7 of part A of the spreadsheet, so she will score
one point. If there is anything else in cell C7, she will score zero. (The dollar
signs before C and 7 indicate absolute reference.)
b) Calculating total scores
Cell M22 reads =SUM(C22:L22). This calculates one students total score by
adding up her ones and zeros for all the items on the test from C22 to L22.
c) Sorting students by total scores
The area A22:M30 is selected. Sort is then chosen from the data menu in the
menu bar, which brings up a pop-up menu and a choice of two radio
buttons. Column M is selected from the pop-up menu, and
the descending radio button is clicked. Finally, the OK button is selected. This
sorts the students by test score from highest to lowest.
d) Computing item facility and item discrimination values
Item facility (IF) refers to the proportion of students who answered the
question correctly. In part A of the spreadsheet, we calculated the IF using the
COUNTIF function for the letter corresponding to the correct answer. With
these letter answers now converted numerically, we can also calculate the IF
using the SUM function. For example, In cell 31, the formula bar reads
=SUM(C22:C30)/9, which gives us the IF for item 1 by adding all the ones
and dividing by the number of test-takers.
The item discrimination (ID) is usually the difference between the IF for the
top third of test takers and the IF for the bottom third of test takers for each
item on a test (some prefer to use the top and bottom quarters). The IF for the
top third is given in cell C32 and reads =SUM(C22:C24)/3. Similarly, the IF
for the bottom third of test takers is given in cell C33, and reads
=SUM(C28:C30)/3. The difference between these two scores, shown in cell
34 (=C32-C33), gives us the ID. This value is useful in norm-referenced tests
such as placement tests because it is an indication of how well the test-takers
are being purposefully spread for each item of the test.
e) Calculating the average score and standard deviation of the test
The Excel program has functions for average score and standard deviation, so
they are both easy to calculate. Cell N32 reads =AVERAGE(M22:M30) in the
formula bar, which gives us the average score. The standard deviation is
shown in cell N33 and reads =STDEV(M22:M30) in the formula bar.
f) Determining the tests reliability
I use the Kuder-Richardson 21 formula for calculating reliability because it is
easy to compute, relying only on the number of test items, and the average and
variance of the test scores.
The formula is KR-21 = n/n-1[1-{(X-X2/n)/S2}], where n is the number of test
items, X is the average score, and S the standard deviation. (See Hatch and
Lazaraton, 1991, p. 538 for information on the Kuder-Richardson 21 formula.)
In cell N31, the formula bar reads =(10/9)*(1-(N32-
N32*N32/10)/(N33*N33)), which will give us a conservative estimate of the
tests reliability, compared to the more accurate but more complex KR-20
g) Estimating the standard error of measurement of the test
The true-score model, which was proposed by Spearman (1904), states that an
individuals observed test score is made up of two components, a true
component and an error component. The standard error of measurement,
according to Dudek (1979), is an estimate of the standard deviation expected
for observed scores when the true score is held constant. It is therefore an
indication of how much a students observed test score would be expected to
fluctuate either side of her true test score because of extraneous
circumstances. This error estimate uncertainty means that it is not possible to
say for sure which side of a cut-off point the true score of a student whose
observed score is within one SEM of that cut-off point truly lies. However,
since the placement of students into streamed classes within our university is
not likely to effect the students lives critically, I calculate SEM not so much to
determine the borderline students, who in some circumstances may need
further deliberation, but more to give myself a concrete indication of how
confident I am that the process of streaming is being done equitably.
To measure SEM, we type in the formula bar for cell N34, =SQRT(1-
N31)*N33, which gives us a value of 1.21. We can therefore say that
students true scores will normally be within 1.21 points of
their observed scores.
The 2002 Placement Test
A 50-item placement test was administered to 102 first year medical students
in April, 2002. To the teachers and students present, it may have appeared to
be a typical test, in a pretty booklet, with a nice font face, and the name of the
college in bold. But this face value was its only redeeming feature. After
statistical analysis, it was clear that its inherent weakness was that it was
unacceptably difficult and therefore wholly inappropriate. If the degree to
which a test is effective in spreading students out is directly related to the
degree to which that test fits the ability levels of the students (Brown, 1996),
then my placement test was ineffective because I had greatly overestimated
the students level based on the nave assumption that they would be similar to
students in my high school teaching experience.
It had a very low average score not much higher than guesswork, and such a
low reliability, and therefore large SEM, that it meant that many students
could not be definitively placed. In short, I was resigned to the fact that Id be
teaching mixed ability classes for the next two semesters.
Pilot Procedures for the 2003 Placement Test
Statistical analysis of the 2002 test meant that I had to discard almost all of
the items. The good news was that at least I now had the opportunity to pilot
some questions with my new students. I discovered that nearly all of them
could read and write well, and many had impressive vocabularies. Most had
been taught English almost entirely in Japanese, however, and very few of
them had had much opportunity to practice English orally. Fewer still had had
contact with a native-English speaker on a regular basis.
According to Brown (1996), ideal items of a norm-referenced language test
should have an average IF of 0.50, and be in the range of 0.3 to 0.7 to be
considered acceptable. Ebels guidelines (1979) for item discrimination
consider an ID of greater than 0.2 to be satisfactory. These are the criteria I
generally abide by when piloting test questions, after, of course, first
confirming that these items are valid and also devoid of redundancy.
A Comparison Of The 2002 And 2003 Placement Tests
Table 3: A Comparison of the 2002 and 2003 placement tests

Reliability Average SD SEM IF<0.3 0.3=<IF<0.7 IF>0.7 ID>0.2
2002 0.57 16.09 4.95 3.26 27 23 0 2
2003 0.74 24.8 6.71 3.44 3 40 7 38
A statistical analysis of the 2003 50-item test showed a great improvement
compared to the previous year (see Table 3), with just ten items now falling
outside the criteria guidelines. The average score of the 2003 test was very
close to the ideal, but the reliability was still not as good as it should have
been. Despite this, we were still able to identify and make provision for the
highest and lowest scoring students, and feedback from all classes, thus far,
has generally been very positive.
I plan to extend my database of acceptable test items to employ in developing
the test for 2004. The reliability should improve once the bad items are
replaced with acceptable ones, and distractor efficiency analysis may help to
pinpoint which acceptable items can be modified further. My main concern,
however, is the very small standard deviation. If it remains stubbornly small,
we may have to conclude that our students are simply too homogenous to be
streamed effectively, and that may ultimately force us to reconsider
establishing mixed-ability classes.
Brown, J.D. (1996). Testing in language programs. Upper Saddle River, NJ:
Prentice Hall.
Dudek, F.J. (1979). The continuing misinterpretation of the standard error of
measurement. Psychological Bulletin, 86, 335-337.
Ebel, R.L. (1979). Essentials of educational measurement (3rd ed.).
Englewood Cliffs, NJ: Prentice Hall.
Elvin, C. (2003a). Elvinsdata.xls [Online]. Retrieved September 26, 2003,
Elvin, C. (2003b). Elvinsoutput.xls [Online]. Retrieved September 26, 2003,
from <>.
Hatch, E. & Lazaraton, A. (1991). The research manual: Design and statistics
for applied linguists. Boston, MA: Heinle & Heinle.
Iteman (Version 3.6). (1997). [Computer software]. St. Paul, MN: Assessment
Systems Corporation.
Spearman, C. (1904). General intelligence, objectively determined and
measured. American Journal of Psychology, 15, 201-293.
Chris Elvin has a Masters degree in education from Temple University,
Japan. He is the current programs chair of the JALT Materials Writers
special interest group, and former editor of The School House , the JALT
junior and senior high school SIG newsletter. He is the author of Now Youre
Talking, an oral communication coursebook published by EFL Press, and the
owner and webmaster of, an English language learning
website dedicated to young learners. He currently teaches at Tokyo Womens
Medical University, Soka University, Caritas Gakuen and St. Dominics
Institute. His research interests include materials writing, classroom
language acquisition and learner autonomy.
Return to .

ITEM ANALYSIS Technique to improve test items and instruction
TEST DEVELOPMENT PROCESS 13. Standard Setting Study 14. Set Passing Standard 11. Administer Tests
12. Conduct Item Analysis 9. Assemble Operational Test Forms 10. Produce Printed Tests Mat. 1. Review
National and Professional Standards 2. Convene National Advisory Committee 3. Develop Domain,
Knowledge and Skills Statements Conduct NeedsAnalysis 5. Construct Table of Specifications 6. Develop
Test Design 7. Develop New Test Questions 8. Review Test Questions
again in later tests

SEVERAL PURPOSES 1. More diagnostic information on students Classroom level: determine
questions most found very difficult/ guessing on reteach that concept questions all got right
don't waste more time on this area find wrong answers students are choosing identify common
misconceptions Individual level: isolate specific errors the students made
2. Build future tests, revise test items to make them better know how much work in writing good
questions SHOULD NOT REUSE WHOLE TESTS --> diagnostic teaching means responding to needs of
students, so after a few years a test bank is build up and choose a tests for the class can spread
difficulty levels across your blueprint (TOS)
3. Part of continuing professional development doing occasional item analysis will help become a
better test writer documenting just how good your evaluation is useful for dealing with parents or
administrators if there's ever a dispute once you start bringing out all these impressive looking
statistics, parents and administrators will believe why some students failed.
CLASSICAL ITEM ANALYSIS STATISTICS Reliability (test level statistic) Difficulty (item level statistic)
Discrimination (item level statistic)
TEST LEVEL STATISTIC Quality of the Test Reliability and Validity Reliability Consistency of
measurement Validity Truthfulness of response Overall Test Quality Individual Item Quality
RELIABILITY refers to the extent to which the test is likely to produce consistent scores. Characteristics:
1. The intercorrelations among the items -the greater/stronger the relative number of positive
relationships are, the greater the reliability. 2. The length of the test a test with more items will have a
higher reliability, all other things being equal.
3. The content of the test -generally, the more diverse the subject matter tested and the testing
techniques used, the lower the reliability. 4. Heterogeneous groups of test takers
TYPES OF RELIABILITY Stability 1. Test Retest
Stability 2. Inter rater / Observer/ Scorer applicable for mostly essay questions Use Cohens
Kappa Statistic
Equivalence 3. Parallel-Forms/ Equivalent Used to assess the consistency of the results of two tests
constructed in the same way from the same content domain.
Internal Consistency Used to assess the consistency of results across items within a test. 4. Split
5. Kuder-Richardson Formula 20 / 21 Correlation is determined from a single administration of a test
through a study of score variances
6. Cronbach's Alpha (a)
Reliability Indices .91 and above Interpretation Excellent reliability; at the level of the best standardized
tests .81 - .90 Very good for a classroom test .71 - .80 Good for a classroom test; in the range of most.
There are probably a few items which could be improved. .61 - .70 Somewhat low. This test needs to be
supplemented by other measures (e.g., more tests) to determine grades. There are probably some items
which could be improved. .51 - .60 Suggests need for revision of test, unless it is quite short (ten or
fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for
grading. .50 or below Questionable reliability. This test should not contribute heavily to the course
grade, and it needs revision.

item "functions How valid the item is based on the total test score criterion
WHAT IS A WELL-FUNCTIONING TEST ITEM? how many students got it correct? (DIFFICULTY) which
students got it correct? (DECRIMINATION)
item was too easy or too hard. Item discrimination: measure whether an item discriminated between
students who knew the material well and students who did not. Effectiveness of alternatives:
Determination whether distracters (incorrect but plausible answers) tend to be marked by the less able
students and not by the more able students.
ITEM DIFFICULTY Item difficulty is simply the percentage of students who answer an item correctly. In
this case, it is also equal to the item mean. Diff = # of students choosing correctly total # of students
The item difficulty index ranges from 0 to 100; the higher the value, the easier the question.
ITEM DIFFICULTY LEVEL: DEFINITION The percentage of students who answered the item correctly. High
(Difficult) Low (Easy ) <= 30% 0 Medium (Moderate) > 30% AND < 80% >=80 % 10 20 30 40 50 60 70 80
90 100
ITEM DIFFICULTY LEVEL: SAMPLE Number of students who answered each item = 50 Item No. No.
Correct Answers % Correct Difficulty Level 1 15 30 High 2 25 50 Medium 3 35 70 Medium 4 45 90 Low
ITEM DIFFICULTY LEVEL: QUESTIONS/DISCUSSION Is a test that nobody failed too easy? Is a test on
which nobody got 100% too difficult? Should items that are too easy or too difficult be thrown
ITEM DISCRIMINATION Traditionally, using high and low scoring groups (upper 27 % and lower 27%)
Computerized analyses provide more accurate assessment of the discrimination power of items since it
accounts all responses rather than just high and low scoring groups. Equivalent to point-biserial
correlation. It provides estimate the degree an individual item is measuring the same thing as the rest of
the items.
WHAT IS ITEM DISCRIMINATION? Generally, students who did well on the exam should select the
correct answer to any given item on the exam. The Discrimination Index distinguishes for each item
between the performance of students who did well on the exam and students who did poorly.
0.86 above Very Easy To be discarded 0.71 0.85 Easy To be revised 0.30 0.70 Moderate Very Good
items 0.15 0.29 Difficult To be revised 0.14 below Very Difficult To be discarded
ITEM DISCRIMINATION: QUESTIONS / DISCUSSION What factors could contribute to low item
discrimination between the two groups of students? What is a likely cause for a negative
discrimination index?
SAMPLE TOS Remember Section A Section B Section C Total Understand Apply Total 4 6 10 20 5 4 14 7 6
16 18 20 50 (1,3,7,9) 5 (2,5,8,11,15) 3 (6,17,21) 12
STEPS IN ITEM ANALYSIS 1. Code the test items: - 1 for correct and 0 for incorrect - Vertical columns
(item numbers) - Horizontal rows (respondents/students)
TEST ITEMS No. 1 1 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 2 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 3 0 0 0 1 0 0 0
1 0 0 0 1 1 1 1 1 1 1 0 4 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 5 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 6 1 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1 0 1 7 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 8 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 2 3 4 5
6 7 8 9 1 0 1 1 1 2 1 3 1 4 . . . . 5 0
2. IN SPSS: Analyze-Scale-Reliability analysis (drag/place variables to Item box) Statistics Scale if
item deleted ok.
****** Method 1 (space saver) will be used for this analysis ****** R E L I A B I L I T Y A N A L Y S I S -
S C A L E (A L P H A) Item-total Statistics Scale Scale Corrected Mean Variance Item- if Item if Item
Deleted Deleted Total Correlation Alpha if Item Deleted VAR00001 14.4211 127.1053 .9401 .9502
VAR00002 14.6316 136.8440 .7332 .9542 VAR00022 14.4211 129.1410 .7311 .9513 VAR00023
14.4211 127.1053 .4401 .9502 VAR00024 14.6316 136.8440 -.0332 .9542 VAR00047 14.4737
128.6109 .8511 .9508 VAR00048 14.4737 128.8252 .8274 .9509 VAR00049 14.0526 130.6579 .5236
.9525 VAR00050 14.2105 127.8835 .7533 .9511 Reliability Coefficients N of Cases = Alpha = .9533
57.0 N of Items = 50
3. In the output dialog box: Alpha placed at the bottom the corrected item total correlation is the
point biserial correlation as bases for index of test reliability
4. Count the number of items discarded and fill up summary item analysis table.
TEST ITEM RELIABILITY ANALYSIS SUMMARY (SAMPLE) Test Math (50 items) Level of Difficulty Very Easy
Number of Items % Item Number 1 2 1 Easy 2 4 2,5 Moderate 10 20 3,4,10,15 Difficult 30 60
6,7,8,9,11, Very Difficult 7 14 16,24,32
5. Count the number of items retained based on the cognitive domains in the TOS. Compute the
percentage per level of difficulty.
Remember Understand Apply N A B C Total % Over all Ret N Ret N Ret 4 5 3 12 1 3 2 6 6 5 7 18 3 3 4 10
10 4 6 20 3 2 3 8 50% 56% 24/50 = 48% 40%
Realistically: Do item analysis to your most important tests end of unit tests, final exams -->
summative evaluation common exams with other teachers (departmentalized exam) common exams
gives bigger sample to work with, which is good makes sure that questions other teacher s prepared
are working for your class
ITEM ANALYSIS is one area where even a lot of otherwise very good classroom teachers fall down:
they think they're doing a good job; they think they've doing good evaluation; but without doing
item analysis, They dont really know.
ITEM ANALYSIS is not an end in itself, no point unless you use it to revise items, and helps students on
the basis of information you get out of it.