FOR TEACHERS

© All Rights Reserved

136 views

FOR TEACHERS

© All Rights Reserved

- A Critical Appraisal of the Revised Trauma Score
- Timetable CSEC 2015 May-June - A4 Size
- PSE Datacenter - A 7.0
- ALCPT Handbook
- UT Dallas Syllabus for math5304.501 06s taught by Joselle Kehoe (jxk061000)
- Hx&SysOutline (1)
- Topic 3 Roles of Ordinary Teachers as a Guidance Teacher
- SAMPLE Human Resource Management
- M9510-747.pdf
- Alphadsaa
- a-z of kas
- New Scheme of Examinations - for All Posts
- Measurement
- Study Notes
- 10 Chapter 4
- Quality of Life Assessment
- Annual Supervisory Plan 2015
- disertasi
- Ctrstreadtechrepv01995i00607 Opt
- Health and Human Services: language color

You are on page 1of 67

Susan Matlock-Hetzel

Texas A&M University, January 1997

Abstract

When norm-referenced tests are developed for instructional purposes, to assess the

effects of educational programs, or for educational research purposes, it can be very

important to conduct item and test analyses. These analyses evaluate the quality of the

items and of the test as a whole. Such analyses can also be employed to revise and

improve both items and the test as a whole. However, some best practices in item and

test analysis are too infrequently used in actual practice. The purpose of the present

paper is to summarize the recommendations for item and test analysis practices, as

these are reported in commonly-used measurement textbooks (Crocker & Algina,

1986; Gronlund & Linn, 1990; Pedhazur & Schemlkin, 1991; Sax, 1989; Thorndike,

Cunningham, Thorndike, & Hagen, 1991).

Paper presented at the annual meeting of the Southwest Educational Research

Association, Austin, January, 1997.

Basic Concepts in Item and Test Analysis

Making fair and systematic evaluations of others' performance can be a challenging

task. Judgments cannot be made solely on the basis of intuition, haphazard guessing,

or custom (Sax, 1989). Teachers, employers, and others in evaluative positions use a

variety of tools to assist them in their evaluations. Tests are tools that are frequently

used to facilitate the evaluation process. When norm-referenced tests are developed

for instructional purposes, to assess the effects of educational programs, or for

educational research purposes, it can be very important to conduct item and test

analyses.

Test analysis examines how the test items perform as a set. Item analysis "investigates

the performance of items considered individually either in relation to some external

criterion or in relation to the remaining items on the test" (Thompson & Levitov,

1985, p. 163). These analyses evaluate the quality of items and of the test as a whole.

Such analyses can also be employed to revise and improve both items and the test as a

whole.

However, some best practices in item and test analysis are too infrequently used in

actual practice. The purpose of the present paper is to summarize the

recommendations for item and test analysis practices, as these are reported in

commonly-used measurement textbooks (Crocker & Algina, 1986; Gronlund & Linn,

1990; Pedhazur & Schemlkin, 1991; Sax, 1989; Thorndike, Cunningham, Thorndike,

& Hagen, 1991). These tools include item difficulty, item discrimination, and item

distractors.

I tem Difficulty

Item difficulty is simply the percentage of students taking the test who answered the

item correctly. The larger the percentage getting an item right, the easier the item. The

higher the difficulty index, the easier the item is understood to be (Wood, 1960). To

compute the item difficulty, divide the number of people answering the item correctly

by the total number of people answering item. The proportion for the item is usually

denoted as pand is called item difficulty (Crocker & Algina, 1986). An item answered

correctly by 85% of the examinees would have an item difficulty, or p value, of .85,

whereas an item answered correctly by 50% of the examinees would have a lower

item difficulty, or p value, of .50.

A p value is basically a behavioral measure. Rather than defining difficulty in terms of

some intrinsic characteristic of the item, difficulty is defined in terms of the relative

frequency with which those taking the test choose the correct response (Thorndike et

al, 1991). For instance, in the example below, which item is more difficult?

1. Who was Boliver Scagnasty?

2. Who was Martin Luther King?

One cannot determine which item is more difficult simply by reading the questions.

One can recognize the name in the second question more readily than that in the first.

But saying that the first question is more difficult than the second, simply because the

name in the second question is easily recognized, would be to compute the difficulty

of the item using an intrinsic characteristic. This method determines the difficulty of

the item in a much more subjective manner than that of a p value.

Another implication of a p value is that the difficulty is a characteristic of both the

item and the sample taking the test. For example, an English test item that is very

difficult for an elementary student will be very easy for a high school student.

A p value also provides a common measure of the difficulty of test items that measure

completely different domains. It is very difficult to determine whether answering a

history question involves knowledge that is more obscure, complex, or specialized

than that needed to answer a math problem. When p values are used to define

difficulty, it is very simple to determine whether an item on a history test is more

difficult than a specific item on a math test taken by the same group of students.

To make this more concrete, take into consideration the following examples. When

the correct answer is not chosen (p = 0), there are no individual differences in the

"score" on that item. As shown in Table 1, the correct answer C was not chosen by

either the upper group or the lower group. (The upper group and lower group will be

explained later.) The same is true when everyone taking the test chooses the correct

response as is seen in Table 2. An item with a p value of .0 or a p value of 1.0 does

not contribute to measuring individual differences, and this is almost certain to be

useless. Item difficulty has a profound effect on both the variability of test scores and

the precision with which test scores discriminate among different groups of examinees

(Thorndike et al, 1991). When all of the test items are extremely difficult, the great

majority of the test scores will be very low. When all items are extremely easy, most

test scores will be extremely high. In either case, test scores will show very little

variability. Thus, extreme p values directly restrict the variability of test scores.

Table 1

Minimum Item Difficulty Example Illustrating No Individual Differences

Group Item Response

*

A B C D

Upper group 4 5 0 6

Lower group 2 6 0 7

Note. * denotes correct response

Item difficulty: (0 + 0)/30 = .00p

Discrimination Index: (0 - 0)/15 = .00

Table 2

Maximum Item Difficulty Example Illustrating No Individual Differences

Group Item Response

*

A B C D

Upper group 0 0 15 0

Lower group 0 0 15 0

Note. * denotes correct response

Item difficulty: (15 + 15)/30 = 1.00p

Discrimination Index: (15-15)/15 = .00

In discussing the procedure for determining the minimum and maximum score on a

test, Thompson and Levitov (1985) stated that

items tend to improve test reliability when the percentage of students

who correctly answer the item is halfway between the percentage

expected to correctly answer if pure guessing governed responses and

the percentage (100%) who would correctly answer if everyone knew the

answer. (pp. 164-165)

For example, many teachers may think that the minimum score on a test consisting of

100 items with four alternatives each is 0, when in actuality the theoretical floor on

such a test is 25. This is the score that would be most likely if a student answered

every item by guessing (e.g., without even being given the test booklet containing the

items).

Similarly, the ideal percentage of correct answers on a four-choice multiple-choice

test is not 70-90%. According to Thompson and Levitov (1985), the ideal difficulty

for such an item would be halfway between the percentage of pure guess (25%) and

100%, (25% + {(100% - 25%)/2}. Therefore, for a test with 100 items with four

alternatives each, the ideal mean percentage of correct items, for the purpose of

maximizing score reliability, is roughly 63%. Tables 3, 4, and 5 show examples of

items with p values of roughly 63%.

Table 3

Maximum Item Difficulty Example Illustrating Individual Differences

Group Item Response

*

A B C D

Upper group 1 0 13 3

Lower group 2 5 5 6

Note. * denotes correct response

Item difficulty: (13 + 5)/30 = .60p

Discrimination Index: (13-5)/15 = .53

Table 4

Maximum Item Difficulty Example Illustrating Individual Differences

Differences

Group

Item Response

*

A B C D

Upper group 1 0 11 3

Lower group 2 0 7 6

Note. * denotes correct response

Item difficulty: (11 + 7)/30 = .60p

Discrimination Index: (11-7)/15 = .267

Table 5

Maximum Item Difficulty Example Illustrating Individual Differences

Group Item Response

*

A B C D

Upper group 1 0 7 3

Lower group 2 0 11 6

Note. * denotes correct response

Item difficulty: (11 + 7)/30 = .60p

Discrimination Index: (7 - 11)/15 = .267

Item Discrimination

If the test and a single item measure the same thing, one would expect people who do

well on the test to answer that item correctly, and those who do poorly to answer the

item incorrectly. A good item discriminates between those who do well on the test and

those who do poorly. Two indices can be computed to determine the discriminating

power of an item, the item discrimination index, D, and discrimination coefficients.

I tem Discrimination Index, D

The method of extreme groups can be applied to compute a very simple measure of

the discriminating power of a test item. If a test is given to a large group of people, the

discriminating power of an item can be measured by comparing the number of people

with high test scores who answered that item correctly with the number of people with

low scores who answered the same item correctly. If a particular item is doing a good

job of discriminating between those who score high and those who score low, more

people in the top-scoring group will have answered the item correctly.

In computing the discrimination index, D, first score each student's test and rank order

the test scores. Next, the 27% of the students at the top and the 27% at the bottom are

separated for the analysis. Wiersma and Jurs (1990) stated that "27% is used because

it has shown that this value will maximize differences in normal distributions while

providing enough cases for analysis" (p. 145). There need to be as many students as

possible in each group to promote stability, at the same time it is desirable to have the

two groups be as different as possible to make the discriminations clearer. According

to Kelly (as cited in Popham, 1981) the use of 27% maximizes these two

characteristics. Nunnally (1972) suggested using 25%.

The discrimination index, D, is the number of people in the upper group who

answered the item correctly minus the number of people in the lower group who

answered the item correctly, divided by the number of people in the largest of the two

groups. Wood (1960) stated that

when more students in the lower group than in the upper group select the

right answer to an item, the item actually has negative validity.

Assuming that the criterion itself has validity, the item is not only

useless but is actually serving to decrease the validity of the test. (p. 87)

The higher the discrimination index, the better the item because such a value indicates

that the item discriminates in favor of the upper group, which should get more items

correct, as shown in Table 6. An item that everyone gets correct or that everyone gets

incorrect, as shown in Tables 1 and 2, will have a discrimination index equal to zero.

Table 7 illustrates that if more students in the lower group get an item correct than in

the upper group, the item will have a negative D value and is probably flawed.

Table 6

Positive Item Discrimination Index D

Group Item Response

*

A B C D

Upper group 3 2 15 0

Lower group 12 3 3 2

Note. * denotes correct response

74 students took the test

27% = 20(N)

Item difficulty: (15 + 3)/40 = .45p

Discrimination Index: (15 - 3)/20 = .60

Table 7

Negative Item Discrimination Index D

Group Item Response

*

A B C D

Upper group 0 0 0 0

Lower group 0 0 15 0

Note. * denotes correct response

Item difficulty: (0 + 15)/30 = .50p

Discrimination Index: (0 - 15)/15 = -1.0

A negative discrimination index is most likely to occur with an item covers complex

material written in such a way that it is possible to select the correct response without

any real understanding of what is being assessed. A poor student may make a guess,

select that response, and come up with the correct answer. Good students may be

suspicious of a question that looks too easy, may take the harder path to solving the

problem, read too much into the question, and may end up being less successful than

those who guess. As a rule of thumb, in terms of discrimination index, .40 and greater

are very good items, .30 to .39 are reasonably good but possibly subject to

improvement, .20 to .29 are marginal items and need some revision, below .19 are

considered poor items and need major revision or should be eliminated (Ebel &

Frisbie, 1986).

Discrimination Coefficients

Two indicators of the item's discrimination effectiveness are point biserial correlation

and biserial correlation coefficient. The choice of correlation depends upon what kind

of question we want to answer. The advantage of using discrimination coefficients

over the discrimination index (D) is that every person taking the test is used to

compute the discrimination coefficients and only 54% (27% upper + 27% lower) are

used to compute the discrimination index, D.

Point biserial. The point biserial (rpbis) correlation is used to find out if the right people

are getting the items right, and how much predictive power the item has and how it

would contribute to predictions. Henrysson (1971) suggests that the rpbis tells more

about the predictive validity of the total test than does the biserial r, in that it tends to

favor items of average difficulty. It is further suggested that the rpbis is a combined

measure of item-criterion relationship and of difficulty level.

Biserial correlation. Biserial correlation coefficients (rbis) are computed to determine

whether the attribute or attributes measured by the criterion are also measured by the

item and the extent to which the item measures them. The rbis gives an estimate of the

well-known Pearson product-moment correlation between the criterion score and the

hypothesized item continuum when the item is dichotomized into right and wrong

(Henrysson, 1971). Ebel and Frisbie (1986) state that the rbis simply describes the

relationship between scores on a test item (e.g., "0" or "1") and scores (e.g., "0",

"1",..."50") on the total test for all examinees.

Distractors

Analyzing the distractors (e.i., incorrect alternatives) is useful in determining the

relative usefulness of the decoys in each item. Items should be modified if students

consistently fail to select certain multiple choice alternatives. The alternatives are

probably totally implausible and therefore of little use as decoys in multiple choice

items. A discrimination index or discrimination coefficient should be obtained for

each option in order to determine each distractor's usefulness (Millman & Greene,

1993). Whereas the discrimination value of the correct answer should be positive, the

discrimination values for the distractors should be lower and, preferably, negative.

Distractors should be carefully examined when items show large positive D values.

When one or more of the distractors looks extremely plausible to the informed reader

and when recognition of the correct response depends on some extremely subtle point,

it is possible that examinees will be penalized for partial knowledge.

Thompson and Levitov (1985) suggested computing reliability estimates for a test

scores to determine an item's usefulness to the test as a whole. The authors stated,

"The total test reliability is reported first and then each item is removed from the test

and the reliability for the test less that item is calculated" (Thompson & Levitov,

1985, p.167). From this the test developer deletes the indicated items so that the test

scores have the greatest possible reliability.

Summary

Developing the perfect test is the unattainable goal for anyone in an evaluative

position. Even when guidelines for constructing fair and systematic tests are followed,

a plethora of factors may enter into a student's perception of the test items. Looking at

an item's difficulty and discrimination will assist the test developer in determining

what is wrong with individual items. Item and test analysis provide empirical data

about how individual items and whole tests are performing in real test situations.

References

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test

theory. New York: Holt, Rinehart and Winston.

Ebel, R.L., & Frisbie, D.A. (1986). Essentials of educational

measurement. Englewood Cliffs, NJ: Prentice-Hall.

Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th

ed.). New York: MacMillan.

Henrysson, S. (1971). Gathering, analyzing, and using data on test items. In R.L.

Thorndike (Ed.), Educational Measurement (p. 141). Washington DC: American

Council on Education.

Millman, J., & Greene, J. (1993). The specification and development of tests of

achievement and ability. In R.L. Linn (Ed.), Educational measurement (pp. 335-366).

Phoenix, AZ: Oryx Press.

Nunnally, J.C. (1972). Educational measurement and evaluation (2nd ed.). New York:

McGraw-Hill.

Pedhazur, E.J., & Schmelkin, L.P. (1991). Measurement, design, and analysis: An

integrated approach. Hillsdale, NJ: Erlbaum.

Popham, W.J. (1981). Modern educational measurement. Englewood Cliff, NJ:

Prentice-Hall.

Sax, G. (1989). Principles of educational and psychological measurement and

evaluation (3rd ed.). Belmont, CA: Wadsworth.

Thompson, B., & Levitov, J.E. (1985). Using microcomputers to score and evaluate

test items. Collegiate Microcomputer, 3, 163-168.

Thorndike, R.M., Cunningham, G.K., Thorndike, R.L., & Hagen, E.P.

(1991). Measurement and evaluation in psychology and education (5th ed.). New

York: MacMillan.

Wiersma, W. & Jurs, S.G. (1990). Educational measurement and testing (2nd ed.).

Boston, MA: Allyn and Bacon.

Wood, D.A. (1960). Test construction: Development and interpretation of

achievement tests. Columbus, OH: Charles E. Merrill Books, Inc.

Degree Articles School Articles

Lesson Plans

Learning Articles

Education Articles

Full-text Library | Search ERIC | Test Locator | ERIC System | Assessment Resources | Calls for papers | About us | Site

map |Search | Help

Sitemap 1 - Sitemap 2 - Sitemap 3 - Sitemap 4 - Sitemap 5 - Sitemap 6

1999-2012 Clearinghouse on Assessment and Evaluation. All rights reserved. Your privacy is guaranteed at ericae.net.

Under new ownership

We assess the quality of tests that are implemented and carry out analysis of tests for the

various organizations involved in testing, such as qualifying examination bodies, educational

institutions, and education services companies to ensure that the abilities of examinees are

understood correctly.

Is the difficulty of the questions within an appropriate range? Is the number of

questions appropriate?

Do the choices function well? Does each question distinguish between examinees with

high level and low level ability?

Is the pass line setting and method of dividing levels appropriate?

Is it possible to make comparisons with previous test scores and ascertain changes in

continuous test scores?

What is the relationship between examinee attributes, grouping and test scores?

Item analysis - Analysis using classical test theory

Output index name Explanation of the index

Percentage of correct Difficulty of questions in the test population

Output index name Explanation of the index

answers

Point biserial

correlation

How well does a question item discriminate between high and low

abilities?

Biserial correlation The correlation when the questions and test scores are assumed to

follow a bivariate normal distribution

Choice selection rate Choice selection status, function status

Actual choice rate The actual number of choices as seen from the test results

Fundamental statistics Fundamental statistical information about the test

Reliability factor An index of the reliability of the test score

Standard error The standard deviation for score assuming a certain examinee takes

the test repeatedly

GP analysis table A table and graph showing how the choices function for each level

Analysis using IRT

Output index

name

Explanation of the index

Parameter a An index of the sensitivity with which ability groups around value b are

identified

Parameter b The difficulty of a particular question

Parameter c An index of the possibility of guessing the correct answer

Standard error a The standard error for value a assuming repeated data acquisition and

calculation

Standard error b The standard error for value b assuming repeated data acquisition and

calculation

Standard error c The standard error for value c assuming repeated data acquisition and

calculation

chi-square The deviation between the percentage of correct answers obtained from

the model and the percentage of correct answers obtained from the data

df Degree of freedom (the number of division categories for ability score used

to calculate the squared value)

Output index

name

Explanation of the index

p The occurrence ratio of relevant data assuming equivalence between the

model and the data

Fundamental

statistics

Average, median, standard deviation, variance, range, minimum,

maximum, sample size

Test characteristic

curve

A graph of the correspondence between ability score and test score

Test information

curve

A diagram showing the reliability of each ability score point

Test analysis: identifying test conditions

Test analysis is the process of looking at something that can be used to derive test

information. This basis for the tests is called the test basis.

The test basis is the information we need in order to start the test analysis and create

our own test cases. Basically its a documentation on which test cases are based, such

as requirements, design specifications, product risk analysis, architecture and

interfaces.

We can use the test basis documents to understand what the system should do once

built. The test basis includes whatever the tests are based on. Sometimes tests can be

based on experienced users knowledge of the system which may not be documented.

From testing perspective we look at the test basis in order to see what could be tested.

These are the test conditions. A test condition is simply something that we could test.

While identifying the test conditions we want to identify as many conditions as we can

and then we select about which one to take forward and combine into test cases. We

could call them test possibilities.

As we know that testing everything is an impractical goal, which is known as exhaustive

testing. We cannot test everything we have to select a subset of all possible tests. In

practice the subset we select may be a very small subset and yet it has to have a high

probability of finding most of the defects in a system. Hence we need some intelligent

thought process to guide our selection called test techniques. The test conditions that

are chosen will depend on the test strategyor detailed test approach. For example, they

might be based on risk, models of the system, etc.

Once we have identified a list of test conditions, it is important to prioritize them, so that

the most important test conditions are identified. Test conditions can be identified

for test data as well as for test inputs and test outcomes, for example, different types of

record, different sizes of records or fields in a record. Test conditions are documented

in the IEEE 829 document called a Test Design Specification.

Tests: Post Test Analysis

Tests: Post Test Analysis (PDF)

Sometimes you can get valuable study clues for upcoming tests by examining old tests you have

already taken. This method works best if the instructor gives many examinations. Obviously it would

not work on the first test. This method is based on the premise that people tend to be consistent.

Here's what you do:

1. Gather all your notes, texts, and test answer sheets and visit your instructor during office hours.

Ask to look over the test that was previously given in your class.

2. As you look over the test, answer two basic questions:

1. Where did this test come from?

Did the test come mostly from lecture notes, the textbook, or the homework? Did your

instructor lecture hard on Chapter 4 and then test hard on Chapter 4? Does he like lots of

little specifics, or just test on broad, general areas?

2. What kinds of questions were asked?

Were there factual questions, application questions, definition questions? If factual, then

know names, dates, places; if application, then study theory; if definitions, then be familiar

with terms.

For example, one student discovered that her instructor made up exams by selecting only the major

paragraphs in the chapter and then using the topic sentence of each paragraph as the exam question.

It was then a simple matter to study for the forthcoming examinations. This knowledge came only

after carefully examining the old test.

Back to top

Tips to COMBAT Test Panic

1. Sleep. Get a good night's rest.

2. Diet. Eat breakfast or lunch. This may help calm your nervous stomach and give you energy.

Avoid greasy or acidic foods, and avoid overeating. Avoid caffeine pills.

3. Exercise. Nothing reduces stress more than exercise. An hour or two before an examination, stop

studying and go workout. Swimming, jogging, cycling, aerobics.

4. Allow yourself enough time to get to the test without hurrying.

5. Don't swap questions at the door. Hearing anything you don't know may weaken your

confidence and send you into a state of anxiety.

6. Leave your books at home. Flipping pages at the last minute may only upset you. If you must

take something, take a brief outline that you know well.

7. Take a watch with you, as well as extra pencils, scantron sheets, and blue books.

8. Answer the easy questions first. This will relax you and help build your confidence, plus give you

some assured points.

9. Sit apart from your classmates to reduce being distracted by their movements.

10. Don't panic if others are writing and you aren't. Your thinking may be more profitable than their

writing.

11. Don't be upset if others finish their tests before you do. Use as much time as you are allowed.

Students who leave early don't always get the highest grades.

12. If you still feel nervous during the test, try some emergency first aid: inhale deeply, close eyes,

hold, than exhale slowly. Repeat as needed.

There are many benefits that can be gained by using tools to support testing. They are:

Reduction of repetitive work:Repetitive work is very boring if it is done manually.

People tend to make mistakes when doing the same task over and over. Examples of

this type of repetitive work include running regression tests, entering the same test data

again and again (can be done by a test execution tool), checking against coding

standards (which can be done by a static analysis tool) or creating a specific test

database (which can be done by a test data preparation tool).

Greater consistency and repeatability: People have tendency to do the same task in

a slightly different way even when they think they are repeating something exactly. A

tool will exactly reproduce what it did before, so each time it is run the result is

consistent.

Objective assessment: If a person calculates a value from the software or incident

reports, by mistake they may omit something, or their own one-sided preconceived

judgments or convictions may lead them to interpret that data incorrectly. Using a tool

means that subjective preconceived notion is removed and the assessment is more

repeatable and consistently calculated. Examples include assessing the cyclomatic

complexity or nesting levels of a component (which can be done by a static analysis

tool), coverage (coverage measurement tool), system behavior (monitoring tools) and

incident statistics (test management tool).

Ease of access to information about tests or testing: Information presented visually

is much easier for the human mind to understand and interpret. For example, a chart or

graph is a better way to show information than a long list of numbers this is why

charts and graphs in spreadsheets are so useful. Special purpose tools give these

features directly for the information they process. Examples include statistics and

graphs about test progress (test execution or test management tool), incident rates

(incident management or test management tool) and performance (performance testing

tool).

Item Analysis

Table of Contents

Major Uses of Item Analysis

Item Analysis Reports

Item Analysis Response Patterns

Basic Item Analysis Statistics

Interpretation of Basic Statistics

Other Item Statistics

Summary Data

Report Options

Item Analysis Guidelines

Major Uses of Item Analysis

Item analysis can be a powerful technique available to instructors for the guidance and

improvement of instruction. For this to be so, the items to be analyzed must be valid

measures of instructional objectives. Further, the items must be diagnostic, that is,

knowledge of which incorrect options students select must be a clue to the nature of

the misunderstanding, and thus prescriptive of appropriate remediation.

In addition, instructors who construct their own examinations may greatly improve the

effectiveness of test items and the validity of test scores if they select and rewrite their

items on the basis of item performance data. Such data is available to instructors who

have their examination answer sheets scored at the Computer Laboratory Scoring

Office.

[ Top ]

Item Analysis Reports

As the answer sheets are scored, records are written which contain each student's

score and his or her response to each item on the test. These records are then

processed and an item analysis report file is generated. An instructor may obtain test

score distributions and a list of students' scores, in alphabetic order, in student number

order, in percentile rank order, and/or in order of percentage of total points. Instructors

are sent their item analysis reports from as e-mail attacments. The item analysis report

is contained in the file IRPT####.RPT, where the four digits indicate the instructors's

GRADER III file. A sample of an individual long form item analysis lisitng is shown

below.

Item 10 of 125. The correct option is 5.

Item Response Pattern

1 2 3 4 5 Omit Error Total

Upper 27% 2 8 0 1 19 0 0 30

7% 27% 0% 3% 63% 0% 0% 100%

Middle 46% 3 20 3 3 23 0 0 52

6% 38% 6% 6% 44% 0% 0% 100%

Lower 27% 6 5 8 2 9 0 0 30

20% 17% 27% 7% 30% 0% 0% 101%

Total 11 33 11 6 51 0 0 112

10% 29% 11% 5% 46% 0% 0% 100%

[ Top ]

Item Analysis Response Patterns

Each item is identified by number and the correct option is indicated. The group of

students taking the test is divided into upper, middle and lower groups on the basis of

students' scores on the test. This division is essential if information is to be provided

concerning the operation of distracters (incorrect options) and to compute an easily

interpretable index of discrimination. It has long been accepted that optimal item

discrimination is obtained when the upper and lower groups each contain twenty-

seven percent of the total group.

The number of students who selected each option or omitted the item is shown for

each of the upper, middle, lower and total groups. The number of students who

marked more than one option to the item is indicated under the "error" heading. The

percentage of each group who selected each of the options, omitted the item, or erred,

is also listed. Note that the total percentage for each group may be other than 100%,

since the percentages are rounded to the nearest whole number before totaling.

The sample item listed above appears to be performing well. About two-thirds of the

upper group but only one-third of the lower group answered the item correctly.

Ideally, the students who answered the item incorrectly should select each incorrect

response in roughly equal proportions, rather than concentrating on a single incorrect

option. Option two seems to be the most attractive incorrect option, especially to the

upper and middle groups. It is most undesirable for a greater proportion of the upper

group than of the lower group to select an incorrect option. The item writer should

examine such an option for possible ambiguity. For the sample item on the previous

page, option four was selected by only five percent of the total group. An attempt

might be made to make this option more attractive.

Item analysis provides the item writer with a record of student reaction to items. It

gives us little information about the appropriateness of an item for a course of

instruction. The appropriateness or content validity of an item must be determined by

comparing the content of the item with the instructional objectives.

[ Top ]

Basic Item Analysis Statistics

A number of item statistics are reported which aid in evaluating the effectiveness of

an item. The first of these is the index of difficulty which is the proportion of the total

group who got the item wrong. Thus a high index indicates a difficult item and a low

index indicates an easy item. Some item analysts prefer an index of difficulty which is

the proportion of the total group who got an item right. This index may be obtained by

marking the PROPORTION RIGHT option on the item analysis header sheet.

Whichever index is selected is shown as the INDEX OF DIFFICULTY on the item

analysis print-out. For classroom achievement tests, most test constructors desire

items with indices of difficulty no lower than 20 nor higher than 80, with an average

index of difficulty from 30 or 40 to a maximum of 60.

The INDEX OF DISCRIMINATION is the difference between the proportion of the

upper group who got an item right and the proportion of the lower group who got the

item right. This index is dependent upon the difficulty of an item. It may reach a

maximum value of 100 for an item with an index of difficulty of 50, that is, when

100% of the upper group and none of the lower group answer the item correctly. For

items of less than or greater than 50 difficulty, the index of discrimination has a

maximum value of less than 100. The Interpreting the Index of

Discrimination document contains a more detailed discussion of the index of

discrimination.

[ Top ]

Interpretation of Basic Statistics

To aid in interpreting the index of discrimination, the maximum discrimination value

and the discriminating efficiency are given for each item. The maximum

discrimination is the highest possible index of discrimination for an item at a given

level of difficulty. For example, an item answered correctly by 60% of the group

would have an index of difficulty of 40 and a maximum discrimination of 80. This

would occur when 100% of the upper group and 20% of the lower group answered the

item correctly. The discriminating efficiency is the index of discrimination divided by

the maximum discrimination. For example, an item with an index of discrimination of

40 and a maximum discrimination of 50 would have a discriminating efficiency of 80.

This may be interpreted to mean that the item is discriminating at 80% of the potential

of an item of its difficulty. For a more detailed discussion of the maximum

discrimination and discriminating efficiency concepts, see the Interpreting the Index

of Discrimination document.

[ Top ]

Other Item Statistics

Some test analysts may desire more complex item statistics. Two correlations which

are commonly used as indicators of item discrimination are shown on the item

analysis report. The first is the biserial correlation, which is the correlation between a

student's performance on an item (right or wrong) and his or her total score on the test.

This correlation assumes that the distribution of test scores is normal and that there is

a normal distribution underlying the right/wrong dichotomy. The biserial correlation

has the characteristic, disconcerting to some, of having maximum values greater than

unity. There is no exact test for the statistical significance of the biserial correlation

coefficient.

The point biserial correlation is also a correlation between student performance on an

item (right or wrong) and test score. It assumes that the test score distribution is

normal and that the division on item performance is a natural dichotomy. The possible

range of values for the point biserial correlation is +1 to -1. The Student's t test for the

statistical significance of the point biserial correlation is given on the item analysis

report. Enter a table of Student's t values with N - 2 degrees of freedom at the desired

percentile point N, in this case, is the total number of students appearing in the item

analysis.

The mean scores for students who got an item right and for those who got it wrong are

also shown. These values are used in computing the biserial and point biserial

coefficients of correlation and are not generally used as item analysis statistics.

Generally, item statistics will be somewhat unstable for small groups of students.

Perhaps fifty students might be considered a minimum number if item statistics are to

be stable. Note that for a group of fifty students, the upper and lower groups would

contain only thirteen students each. The stability of item analysis results will improve

as the group of students is increased to one hundred or more. An item analysis for

very small groups must not be considered a stable indication of the performance of a

set of items.

[ Top ]

Summary Data

The item analysis data are summarized on the last page of the item analysis report.

The distribution of item difficulty indices is a tabulation showing the number and

percentage of items whose difficulties are in each of ten categories, ranging from a

very easy category (00-10) to a very difficult category (91-100). The distribution of

discrimination indices is tabulated in the same manner, except that a category is

included for negatively discriminating items.

The mean item difficulty is determined by adding all of the item difficulty indices and

dividing the total by the number of items. The mean item discrimination is determined

in a similar manner.

Test reliability, estimated by the Kuder-Richardson formula number 20, is given. If

the test is speeded, that is, if some of the students did not have time to consider each

test item, the reliability estimate may be spuriously high.

The final test statistic is the standard error of measurement. This statistic is a common

device for interpreting the absolute accuracy of the test scores. The size of the

standard error of measurement depends on the standard deviation of the test scores as

well as on the estimated reliability of the test.

Occasionally, a test writer may wish to omit certain items from the analysis although

these items were included in the test as it was administered. Such items may be

omitted by leaving them blank on the test key. The response patterns for omitted items

will be shown but the keyed options will be listed as OMIT. The statistics for these

items will be omitted from the Summary Data.

[ Top ]

Report Options

A number of report options are available for item analysis data. The long-form item

analysis report contains three items per page. A standard-form item analysis report is

available where data on each item is summarized on one line. A sample reprot is

shown below.

ITEM ANALYSIS Test 4482 125 Items 112 Students

Percentages: Upper 27% - Middle - Lower 27%

Item Key 1 2 3 4 5 Omit Error Diff Disc

1 4 7-23-57 0- 4- 7 28- 8-36 64-62- 0 0-0-0 0-0-0 0-0-0 54 64

2 2 7-12- 7 64-42-29 14- 4-21 14-42-36 0-0-0 0-0-0 0-0-0 56 35

The standard form shows the item number, key (number of the correct option), the

percentage of the upper, middle, and lower groups who selected each option, omitted

the item or erred, the index of difficulty, and the index of discrimination. For example,

in item 1 above, option 4 was the correct answer and it was selected by 64% of the

upper group, 62% of the middle group and 0% of the lower group. The index of

difficulty, based on the total group, was 54 and the index of discrimination was 64.

[ Top ]

Item Analysis Guidelines

Item analysis is a completely futile process unless the results help instructors improve

their classroom practices and item writers improve their tests. Let us suggest a number

of points of departure in the application of item analysis data.

1. Item analysis gives necessary but not sufficient information concerning the

appropriateness of an item as a measure of intended outcomes of instruction.

An item may perform beautifully with respect to item analysis statistics and

yet be quite irrelevant to the instruction whose results it was intended to

measure. A most common error is to teach for behavioral objectives such as

analysis of data or situations, ability to discover trends, ability to infer

meaning, etc., and then to construct an objective test measuring mainly

recognition of facts. Clearly, the objectives of instruction must be kept in mind

when selecting test items.

2. An item must be of appropriate difficulty for the students to whom it is

administered. If possible, items should have indices of difficulty no less than

20 and no greater than 80. lt is desirable to have most items in the 30 to 50

range of difficulty. Very hard or very easy items contribute little to the

discriminating power of a test.

3. An item should discriminate between upper and lower groups. These groups

are usually based on total test score but they could be based on some other

criterion such as grade-point average, scores on other tests, etc. Sometimes

an item will discriminate negatively, that is, a larger proportion of the lower

group than of the upper group selected the correct option. This often means

that the students in the upper group were misled by an ambiguity that the

students in the lower group, and the item writer, failed to discover. Such an

item should be revised or discarded.

4. All of the incorrect options, or distracters, should actually be distracting.

Preferably, each distracter should be selected by a greater proportion of the

lower group than of the upper group. If, in a five-option multiple-choice item,

only one distracter is effective, the item is, for all practical purposes, a two-

option item. Existence of five options does not automatically guarantee that

the item will operate as a five-choice item.

Item Analysis of Classroom Tests: Aims and Simplified

Procedures

Aim:

How well did my test distinguish among students according to the how well they met

my learning goals?

Recall that each item on your test is intended to sample performance on a particular

learning outcome. The test as a whole is meant to estimate performance across the full

domain of learning outcomes you have targeted.

Unless your learning goals are minimal or low (as they might be, for instance, on a

readiness test), you can expect students to differ in how well they have met those

goals. (Students are not peas in a pod!). Your aim is not to differentiate students just

for the fun of it, but to be able to measure the differences in mastery that occur.

One way to assess how well your test is functioning for this purpose is to look at how

well the individual items do so. The basic idea is that a good item is one that good

students get correct more often than do poor students. You might end up with a big

spread in scores, but what if the good students are no more likely than poor students to

get a high score? If we assume that you have actually given them proper instruction,

then your test has not really assessed what they have learned. That is, it is "not

working."

An item analysis gets at the question of whether your test is working by asking the

same question of all individual itemshow well does it discriminate? If you have lots

of items that didnt discriminate much if at all, you may want to replace them with

better ones. If you find ones that worked in the wrong direction (where good students

did worse) and therefore lowered test reliability, then you will definitely want to get

rid of them.

In short, item analysis gives you a way to exercise additional quality control over your

tests. Well-specified learning objectives and well-constructed items give you a

headstart in that process, but item analyses can give you feedback on how successful

you actually were.

Item analyses can also help you diagnose why some items did not work especially

well, and thus suggest ways to improve them (for example, if you find distracters that

attracted no one, try developing better ones).

Reminder

Item analyses are intended to assess and improve the reliability of your tests. If test

reliability is low, test validity will necessarily also be low. This is the ultimate reason

you do item analysesto improve the validity of a test by improving its reliability.

Higher reliability will not necessarily raise validity (you can be more consistent in

hitting the wrong target), but it is a prerequisite. That is, high reliability is necessary

but not sufficient for high validity (do you remember this point on Exam 1?).

However, when you examine the properties of each item, you will often discover how

they may or may not actually have assessed the learning outcome you intended

which is a validity issue. When you change items to correct these problems, it means

the item analysis has helped you to improve the likely validity of the test the next time

you give it.

The procedure (apply it to the sample results I gave you)

1. Identify the upper 10 scorers and lowest 10 scorers on the test. Set aside the

remainder.

2. Construct an empty chart for recording their scores, following the sample I

gave you in class. This chart lists the students down the left, by name. It arrays

each item number across the top. For a 20-item test, you will have 20 columns

for recording the answers for each student. Underneath the item number, write

in the correct answer (A, B, etc.)

3. Enter the student data into the chart you have just constructed.

a. Take the top 10 scorers, and write each students name down the left,

one row for each student. If there is a tie for 10th place, pick one student

randomly from those who are tied.

b. Skip a couple rows, then write the names of the 10 lowest-scoring

students, one row for each.

c. Going student by student, enter each students answers into the cells of

the chart. However, enter only the wrong answers (A, B, etc.). Any

empty cell will therefore signal a correct answer.

d. Go back to the upper 10 students. Count how many of them got Item 1

correct (this would be all the empty cells). Write that number at the

bottom of the column for those 10. Do the same for the other 19

questions. We will call these sums R

U

, where U stands for "upper."

e. Repeat the process for the 10 lowest students. Write those sums under

their 20 columns. We will call these R

L

, where L stands for "lower."

4. Now you are ready to calculate the two important indices of item functioning.

These are actually only estimates of what you would get if you had a computer

program to calculate the indices for everyone who took the test (some schools

do). But they are pretty good.

a. Difficulty index. This is just the proportion of people who passed the

item. Calculate it for each item by adding the number correct in the top

group (R

U

) to the number correct in the bottom group (R

L

) and then

dividing this sum by the total number of students in the top and bottom

groups (20).

R

U

+ R

L

20

Record these 20 numbers in a row near the bottom of the chart.

b. Discrimination index. This index is designed to highlight to what extent

students in the upper group were more likely than students in the lower

group to get the item correct. That is, it is designed to get at the

differences between the two groups. Calculate the index by subtracting

R

L

from R

U

, and then dividing by half the number of students involved

(10)

R

U

- R

L

10

Record these 20 numbers in the last row of the chart.

5. You are now ready to enter these discrimination indexes into a second chart.

6. Construct the second chart, based on the model I gave you in class. (This is the

smaller chart that contains no student names.)

a. Note that there are two rows of column headings in the sample. The first

row of headings contains the maximum possible discrimination indexes

for each item difficulty level (more on that in a moment). The second

row contains possible difficulty indexes. Lets begin with that second

row of headings (labeled "p"). As your sample shows, the entries range

on the far left from "1.0" (for 100%) to ".4-0" (40%-0%) for a final

catch-all column. Just copy the numbers from the sample onto your

chart.

b. Now copy the numbers from the first row of headings in the sample

(labeled "Md").

7. Now is the time to pick up your first chart again, where you will find

the discrimination indexes you need to enter into your second chart.

a. You will be entering its last row of numbers into the body of the second

chart.

b. List each of these discrimination indexes in one and only one of the 20

columns. But which one? List each in the column corresponding to

its difficulty level. For instance, if item 4s difficulty level is .85 and its

discrimination index is .10, put the .10 in the difficulty column labeled

".85." This number is entered, of course, into the row for the fourth item

8. Study this second chart.

a. How many of the items are of medium difficulty? These are the best,

because they provide the most opportunity to discriminate (to see this,

look at their maximum discrimination indexes in the first row of

headings). Items that most everybody gets right or gets wrong simply

cant discriminate much.

b. The important test for an items discriminability is to compare it to the

maximum possible. How well did each item discriminate relative to the

maximum possible for an item of its particular difficulty level? Here is a

rough rule of thumb.

Discrimination index is near the maximum possible = very

discriminating item

Discrimination index is about half the maximum possible =

moderately discriminating item

Discrimination index is about a quarter the maximum possible =

weak item

Discrimination index is near zero = non-discriminating item

Discrimination index is negative = bad item (delete it if worse

than -.10)

9. Go back to the first chart and study it.

a. Look at whether all the distracters attracted someone. If some did not

attract any, then the distracter may not be very useful. Normally you

might want to examine it and consider how it might be improved or

replaced.

b. Look also for distractors that tended to pull your best students and

thereby lower discriminability. Consider whether the discrimination you

are asking them to make is educationally significant (or even clear). You

cant do this kind of examination for the sample data I have given you,

but keep it in mind for real-life item analyses.

10. There is much more you can do to mine these data for ideas about your items,

but this is the core of an item analysis.

If you are lucky

If you use scantron sheets for grading exams, ask your school whether it can calculate

item statistics when it processes the scantrons. If it can, those statistics probably

include what you need: the (a) difficulty indexes for each item, (b) correlations of

each item with total scores for each student on the test, and (c) the number of students

who responded to each distracter. The item-total correlation is comparable to (and

more accurate than) your discrimination index.

If your school has this software, then you won't have to calculate any item statistics,

which makes your item analyses faster and easier. It is important that you have

calculated the indexes once on your own, however, so that you know what they mean.

Improve multiple choice tests using item analysis

Item analysis report

An item analysis includes two statistics that can help you analyze the effectiveness of your test

questions. The question difficulty is the percentage of students who selected the correct response.

The discrimination (item effectiveness) indicates how well the question separates the students

who know the material well from those who dont.

Question difficulty

Question difficulty is defined as the proportion of students selecting the correct answer. The most

effective questions in terms of distinguishing between high and low scoring students will be

answered correctly by about half of the students. In practical terms, questions in most classroom

tests will have a range of difficulties from low or easy (.90) to high or very difficult (.40). Questions

having difficulty estimates outside of these ranges may not contribute much to the effective

evaluation of student performance.

Very easy questions may not sufficiently challenge the most able students. However, having

a few relatively easy questions in a test may be important to verify the mastery of some

course objectives. Keep tests balanced in terms of question difficulty.

Very difficult questions, if they form most of a test, may produce frustration among students.

Some very difficult questions are needed to challenge the best students.

Question discrimination

The discrimination index (item effectiveness) is a kind of correlation that describes the

relationship between a students response to a single question and his or her total score on the test.

This statistic can tell you how well each question was able to differentiate among students in terms

of their ability and preparation.

As a correlation, question discrimination can theoretically take values between -1.00 and

+1.00. In practical terms values for most classroom tests range between near 0.00 to values

near .90.

If a question is very easy so that nearly all students answered correctly, the questions

discrimination will be near zero. Extremely easy questions cannot distinguish among

students in terms of their performance.

If a question is extremely difficult so that nearly all students answered incorrectly, the

discrimination will be near zero.

The most effective questions will have moderate difficulty and high discrimination values.

The higher the value of discrimination is, the more effective it is in discriminating between

students who perform well on the test and those that dont.

Questions having low or negative values of discrimination need to be reviewed very carefully

for confusing language or an incorrect key. If no confusing language is found then the course

design for the topic of the question needs to be critically reviewed.

A high level of student guessing on questions will result in a question discrimination value

near zero.

Steps in a review of an item analysis report

1. Review the difficulty and discrimination of each question.

2. For each question having low values of discrimination review the distribution of responses

along with the question text to determine what might be causing a response pattern that

suggests student confusion.

3. If the text of the question is confusing, change the text or remove the question from the

course database. If the question text is not confusing or faulty, then try to identify the

instructional component that may be leading to student confusion.

4. Carefully examine the questions that discriminate well between high and low scoring

students to fully understand the role that instructional design played in leading to these

results. Ask yourself what aspects of the instructional process appear to be most effective.

Test Item Performance: The Item Analysis

Table of Contents

Summary of Test Statistics

Test Frequency Distribution

Item Difficulty and Discrimination: Quintile Table

Interpreting Item Statistics

MERMAC - Test Analysis and Questionnaire Package

The ITEM ANALYSIS output consists of four parts: A summary of test statistics, a test frequency

distribution, an item quintile table, and item statistics. This analysis can be processed for an entire class.

If it is of interest to compare the item analysis for different test forms, then the analysis can be processed

by test form. The Division of Measurement and Evaluation staff is available to help instructors interpret

their item analysis data.

Summary of Test Statistics

Part I of the ITEM ANALYSIS consists of a summary of the following statistics:

* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *

SAMPLE ITEM ANALYSIS

SUMMARY OF TEST STATISTICS

NUMBER OF ITEMS:

(Number of items on the test.)

80

MEAN SCORE:

(Arithmetic average; the sum of all scores divided by the number of

scores.)

60.92

MEDIAN SCORE:

(The raw score point that divides the raw score distribution in half;

50% of the scores fall above the median and 50% fall below.)

63.15

STANDARD DEVIATION:

(Measure of the spread or variability of the score distribution. The

higher the value of the standard deviation, the better the test is

discriminating among student performance levels.)

12.24

RELIABILITY (KR-20):

(Is an estimate of test reliability indicating the internal consistency of

the test. The range of the reliability is from 0.00 to 1.00. A reliability of

.70 or better is desirable for classroom tests.)

0.915

RELIABILITY (KR-21):

(When item difficulties are approximately equal, is an estimate of test

reliability indicating the internal consistency of the test. The range of

the reliability is from 0.00 to 1.00. A reliability of .70 or better is

desirable for classroom tests.)

0.915

S.E. OF MEASUREMENT:

(The accuracy of measurement expressed in the test score scale.

The larger the standard error, the less precise the measure of student

achievement. Two-thirds of the time test takers obtained scores fall

within one standard error of measurement of their true score.)

3.58

POSSIBLE LOW SCORE:

(The possible low score.)

0

POSSIBLE HIGH SCORE:

(The possible high score.)

80

OBTAINED LOW SCORE:

(The obtained low score.)

0

OBTAINED HIGH SCORE: 80

(The obtained high score.)

NUMBER OF SCORES:

(The number of answer sheets submitted

for scoring.)

603

BLANK SCORES1:

(Number of test scores that could be not computed.)

0

INVALID SCORES:

(Number of test scores out of range specified by the user.)

0

VALID SCORES:

(Only those scores that fall within the range specified by the user are

included in the analysis so that

the user has the option of disregarding certain scores.)

603

1

Blank and invalid scores (those falling outside the specified range) are counted and are omitted from the

analysis

Table of Contents

Test Frequency Distribution

Part II of the ITEM ANALYSIS program displays a test frequency distribution. The raw scores are ordered

from high to low with corresponding statistics:

1. Standard score--a linear transformation of the raw score that sets the mean equal to 500 and the

standard deviation equal to 100; in normal score distributions for classes of 500 students of more

the standard score range usually falls between 200 and 800 (plus or minus three standard

deviations of the mean); for classes with fewer than 30 students the standard score range usually

falls within two standard deviations of the mean, i.e., a range of 300 to 700.

2. Percentile rank--the percentage of individuals who received a score lower than the given score

plus the percentage of half the individuals who received the given score. This measure indicates a

person's relative position within a group.

3. Percentage of people in the total group who received the given score.

4. Frequency--in a test analysis, the number of individuals who receive a given score.

5. Cumulative frequency--in a test analysis, the number of individuals who score at or below a given

score value.

.

* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *

SAMPLE ITEM ANALYSIS

TEST FREQUENCY DISTRIBUTION

RAW STANDARD PER- CUM

SCORE SCORE CENTILE PERCENT FREQ FREQ EACH * REPRESENTS 1

PERSON(S)

92 717 99 0.2 1 603 *

91 708 99 0.3 2 602 **

90 700 99 0.0 0 600

89 691 99 0.2 1 600 *

88 683 99 0.8 5 599 *****

87 675 99 0.3 2 594 **

86 666 98 1.0 6 592 ******

85 658 97 1.3 8 586 ********

84 649 96 1.2 7 578 *******

83 641 95 2.0 12 571 ************

82 632 93 1.7 10 559 **********

81 624 91 1.5 9 549 *********

80 615 90 1.5 9 540 *********

79 607 88 2.8 17 531 *****************

78 598 85 4.1 25 514 *************************

77 590 81 2.3 14 489 **************

76 562 79 4.0 24 475 ************************

75 573 75 2.2 13 451 *************

74 565 73 3.3 20 438 ********************

73 556 69 2.0 12 418 ************

72 548 67 3.8 23 406 ***********************

71 539 64 2.8 17 383 *****************

70 531 61 3.0 18 366 ******************

69 522 58 3.2 19 326 *******************

67 505 51 3.6 22 307 **********************

66 497 47 3.8 23 285 ***********************

65 489 43 2.7 16 262 ****************

64 480 41 3.2 19 246 *******************

63 472 38 2.5 15 227 ***************

62 463 35 3.2 19 212 *******************

61 455 32 2.5 15 193 ***************

60 446 30 1.8 11 178 ***********

59 438 28 2.3 14 167 **************

58 429 25 3.0 18 153 ******************

57 421 22 1.7 10 135 **********

56 413 21 3.2 12 106 ************

54 396 16 1.7 10 94 **********

53 387 14 1.5 9 84 *********

52 379 12 1.2 7 75 *******

51 370 11 2.0 12 68 ************

50 362 9 1.2 7 56 *******

49 353 8 1.3 8 49 ********

48 345 7 1.7 10 41 **********

Table of Contents

Item Difficulty and Discrimination: Quintile Table

Part III of the ITEM ANALYSIS output, an item quintile table, can aid in the interpretation of Part IV of the

output. Part IV compares the item responses versus the total score distribution for each item. A good item

discriminates between students who scored high or low on the examination as a whole. In order to

compare different student performance levels on the examination, the score distribution is divided into

fifths, or quintiles. The first fifth includes students who scored between the 81st and 100th percentiles; the

second fifth includes students who scored between the 61st and 80th percentiles, and so forth. When the

score distribution is skewed, more than one-fifth of the students may have scores within a given quintile

and as a result, less than one-fifth of the students may score within another quintile. The table indicates

the sample size, the proportion of the distribution, and the score ranges within each fifth.

* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *

THE QUINTILE GRAPH AND MATRIX OF RESPONSES

APPEARING WITH EACH ITEM ARE BASED ON THE

STATISTICS INDICATED IN THE TABLE BELOW:

QUINTILE SAMPLE

SIZE

PROPORTION SCORE

RANGE

1ST 128 0.21 77 - 92

2ND 127 0.21 70 - 76

3RD 121 0.20 64 - 69

4TH 121 0.20 56 - 63

5TH 106 0.18 24 - 55

Table of Contents

Interpreting Item Statistics

Part IV of ITEM ANALYSIS portrays item statistics which can help determine which items are good and

which need improvement or deletion from the examination. The quintile graph on the left side of the

output indicates the percent of students within each fifth who answered the item correctly. A good,

discrimination item is one in which students who scored well on the examination answered the correct

alternative more frequently than students who did not score well on the examination. Therefore, the

scattergram graph should form a line going from the bottom left-hand corner to the top right-hand corner

of the graph. Item 1 in the sample output shows an example of this type of positive linear relationship.

Item 2 in the sample output also portrays a discriminating item; although few students correctly answered

the item, the students in the first fifth answered it correctly more frequently than the students in the rest of

the score distribution. Item 3 indicates a poor item, the graph indicates no relationship between the fifths

of the score distribution and the percentage of correct responses by fifths. However, it is likely that this

item was miskeyed by the instructor--note the response pattern for alternative B.

A. Evaluating Item Distractors: Matrix of Responses

On the right-hand side of the output, a matrix of responses by fifths shows the frequency of students

within each fifth who answered each alternative and who omitted the item. This information can help point

out what distractors, or incorrect alternatives, are not successful because: (a) they are not plausible

answers and few or no students chose the alternative (see alternatives D and E, item 2), or (b) too many

students, especially students in the top fifths of the distribution, chose the incorrect alternative instead of

the correct response (see alternative B, item 3). A good item will result in students in the top fifths

answering the correct response more frequently than students in the lower fifths, and students in the

lower fifths answering the incorrect alternative more frequently than students in the top fifths. The matrix

of responses prints the correct response of the item on the right-hand side and encloses the correct

response in the matrix in parentheses.

B. Item Difficulty: The PROP Statistic

The proportion (PROP) of students who answer each alternative and who omit the item is printed in the

first row below the matrix. The item difficulty is the proportion of subjects in a sample who correctly

answer the item. In order to obtain maximum spread of student scores it is best to use items with

moderate difficulties. Moderate difficulty can be defined as the point halfway between perfect score and

chance score. For a five choice item, moderate difficulty level is .60, or a range between .50 and .70

(because 100% correct is perfect and we would expect 20% of the group to answer the item correctly by

blind guessing).

Evaluating Item Difficulty. For the most part, items which are too easy or too difficult cannot discriminate

adequately between student performance levels. Item 2 in the sample output is an exception; although

the item difficulty is .23, the item is a good, discriminating one. In item 4, everyone correctly answered the

item; the item difficulty is 1.00. Such an item does not discriminate at all between good and poor students,

and therefore does not contribute statistically to the effectiveness of the examination. However, if one of

the instructor's goals is to check that all students grasp certain basic concepts and if the examination is

long enough to contain a sufficient number of discrimination items, then such an item may remain on the

examination.

C. Item Discrimination: Point Biserial Correlation (RPBI)

Interpreting the RBI Statistic. The point biserieal correlation (RPBI) for each alternative and omit is

printed below the PROP row. It indicates the relationship between the item response and the total test

score within the group tested, i.e., it measures the discriminating power of an item. It is interpreted

similarly to other correlation coefficients. Assuming that the total test score accurately discriminates

among individuals in the group tested, then high positive RPBI's for the correct responses would

represent the most discriminating items. That is, students who answered the correct response scored well

on the examination, whereas students who not answer the correct response did not score well on the

examination. It is also interesting to check the RPBI's for the item distractors, or incorrect alternatives.

The opposite correlation between total score and choice of alternative is expected for the incorrect vs. the

correct alternative. Where a high positivecorrelation is desired for the RPBI of a correct alternative,

a high negative correlation is good for the RPBI of a distractor, i.e., students who answer with an

incorrect alternative did notscore well on the total examination. Due to restrictions incurred when

correlating a continuous variable (total examination score) with a dichotomous variable (response vs

nonresponse of an alternative), the highest possible RPBI is .80 instead of the usual maximum value of

1.00 for a correlation. This maximum RPBI is directly influenced by the item difficulty level. The maximum

RPBI value of .80 occurs with items of moderate difficulty level; the further the difficulty level deviates

from the moderate difficulty level in either direction, the lower the ceiling and RPBI. For example, the

maximum RPBI is about .58 for difficulty levels of .10 or .90. Therefore, in order to maximize item

discrimination, items of moderate difficulty level are preferred, although easy and difficult items still can be

discriminating (see item 2 in the sample output).

Evaluating Item Discrimination. When an instructor examines the item analysis data, the RPBI is an

important indicator in deciding which items are discriminating and should be retained, and which items

are not discriminating and should be revised or replaced by a better item (other content considerations

aside). The quintile graph also illustrates this same relationship between item response and total scores.

However, the RPBI is a more accurate representation of this relationship. An item with a RPBI of .25 or

below should be examined critically for revision or deletion; items with RPBIs of .40 and above are good

discriminators. Note that all items, not only those with RPBIs lower than .25, can be improved. An

examination of the matrix of responses by fifths for all items may point out weaknesses, such as

implausible distractors, that can be reduced by modifying the item.

It is important to keep in mind that the statistical functioning of an item should not be the sole basis for

deleting or retaining an item. The most important quality of a classroom test is its validity, the extent to

which items measure relevant tasks. Items that perform poorly statistically might be retained (and perhaps

revised) if they correspond to specific instructional objectives in the course. Items that perform well

statistically but are not related to specific instructional objectives should be reviewed carefully before

being reused.

References

Ebel, R. L. & Frisbee, D. A. (1986). Essentials of educational measurement (4th ed.). Eaglewood Cliffs,

NJ: New Jersey: Prentice-Hall, Inc.

Guilford, J. P. Pshychometric method. New York: McGraw-Hill, 1954.

Gronlund, N. E. & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). NY: MacMillan.

Osterlind, S. J. Constructing test items Norwell, MA: Kluwer Academic Publishers, 1989.

Thorndike, Robert L. & Hagen, Elizabeth. Measurement and evaluation in psychology and education (3rd

ed.). New York: John Wiley & Sons, 1969, Chapters 4, 6.

Table of Contents

* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *

ITEM 1 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY

FIFTHS E IS CORRECT RESPONSE

A B C D (E) OMIT

1ST + * 1ST 0 25 1 0 102 0

2ND + * 2ND 1 45 6 0 75 0

3RD + * 3RD 1 63 5 3 49 0

4TH + * 4TH 2 76 9 0 34 0

5TH + * 5TH 11 73 13 4 5 0

+----+----+----+----+----+----+----+----+----+

0 10 20 30 40 50 60 70 80 90 100 PROP 0.02 0.47 0.06 0.01 (0.44) 0.00

RPBI -0.20 -0.33 -0.20 -0.13 (0.51) 0.00

ITEM 2 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY

FIFTHS A IS CORRECT RESPONSE

(A) B C D E OMIT

1ST + * 1ST 83 35 10 0 0 0

2ND + * 2ND 19 85 23 0 0 0

3RD + * 3RD 17 67 37 0 0 0

4TH + * 4TH 13 78 30 0 0 0

5TH + * 5TH 6 84 16 0 0 0

+----+----+----+----+----+----+----+----+----+

0 10 20 30 40 50 60 70 80 90 100 PROP (0.23) 0.57 0.19 0.00 0.00 0.00

RPBI (0.43)-0.33 -0.05 0.00 0.00 0.00

ITEM 3 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY

FIFTHS E IS CORRECT RESPONSE

A B C D (E) OMIT

1ST * 1ST 2 125 0 1 0

0

2ND +* 2ND 6 109 0 8 4

0

3RD + * 3RD 14 86 4 7 10

0

4TH +

* 4TH 23 71 2 19 6 0

5TH + * 5TH 29 45 8 15 8

1

+----+----+----+----+----+----+----+----+----+

0 10 20 30 40 50 60 70 80 90 100 PROP 0.12 0.72 0.02 0.08 (0.05) 0.00

RPBI-0.24 0.45 -0.16 -0.17 (0.13)-0.14

ITEM 4 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY

FIFTHS E IS CORRECT RESPONSE

A B C D (E) OMIT

1ST + * 1ST 0 0 0 0 128 0

2ND + * 2ND 0 0 0 0 127 0

3RD + * 3RD 0 0 0 0 121 0

4TH + * 4TH 0 0 0 0 121 0

5TH + * 5TH 0 0 0 0 106 0

+----+----+----+----+----+----+----+----+----+

0 10 20 30 40 50 60 70 80 90 100 PROP 0.00 0.00 0.00 0.00 (1.00) 0.00

RPBI 0.00 0.00 0.00 0.00 (0.00) 0.00

Table of Contents

Purpose of Item Analysis

OK, you now know how to plan a test and build a test

Now you need to know how to do ITEM ANALYSIS

--> looks complicated at first glance, but actually quite simple

-->even I can do this and I'm a mathematical idiot

Talking about norm-referenced, objective tests

mostly multiple-choice but same principals for true-false, matching and short answer

by analyzing results you can refine your testing

SERVES SEVERAL PURPOSES

1. Fix marks for current class that just wrote the test

o find flaws in the test so that you can adjust the mark before return to

students

o can find questions with two right answers, or that were too hard, etc., that

you may want to drop from the exam

even had to do that occasionally on Diploma exams, even after 36

months in development, maybe 20 different reviewers, extensive field

tests, still occasionally have a question whose problems only become

apparent after you give the test

more common on classroom tests -- but instead of getting defensive,

or making these decisions at random on basis of which of your

students can argue with you, do it scientifically

2. More diagnostic information on students

o another immediate payoff of item analysis

Classroom level:

o will tell which questions they were are all guessing on, or if you find a

questions which most of them found very difficult, you can reteach that

concept

o CAN do item analysis on pretests to:

so if you find a question they all got right, don't waste more time on

this area

find the wrong answers they are choosing to identify common

misconceptions

can't tell this just from score on total test, or class average

Individual level:

o isolate specific errors this child made

o after you've planned these tests, written perfect questions, and now analyzed

the results, you're going to know more about these kids than they know

themselves

3. Build future tests, revise test items to make them better

o REALLY pays off second time you teach the same course

by now you know how much work writing good questions is

studies have shown us that it is FIVE times faster to revise items that

didn't work, using item analysis, than trying to replace it with a

completely new question

new item which would just have new problems anyway

--> this way you eventually get perfect items, the envy of your

neighbours

o SHOULD NOT REUSE WHOLE TESTS --> diagnostic teaching means that

you are responding to needs of your students, so after a few years you build

up a bank of test items you can custom make tests for your class

know what class average will be before you even give the test because

you will know approximately how difficult each item is before you use

it;

can spread difficulty levels across your blueprint too...

4. Part of your continuing professional development

o doing the occasional item analysis will help teach you how to become a better

test writer

o and you're also documenting just how good your evaluation is

o useful for dealing with parents or principals if there's ever a dispute

o once you start bringing out all these impressive looking stats parents and

administrators will believe that maybe you do know what you're talking about

when you fail students...

o parent says, I think your "question stinks",

well, "according to the item analysis, this question appears to have worked

well -- it's your son that stinks"

(just kidding! --actually, face validity takes priority over stats any day!)

o and if the analysis shows that the question does stink, you've already

dropped it before you've handed it back to the student, let alone the parent

seeing it...

5. Before and After Pictures

o long term payoff

o collect this data over ten years, not only get great item bank, but if you

change how you teach the course, you can find out if innovation is working

o if you have a strong class (as compared to provincial baseline) but they do

badly on same test you used five years ago, the new textbook stinks.

ITEM ANALYSIS is one area where even a lot of otherwise very good classroom teachers

fall down

they think they're doing a good job; they think they've doing good evaluation, but

without doing item analysis, they can't really know

part of being a professional is going beyond the illusion of doing a good job to

finding out whether you really are

but something just a lot of teachers don't know HOW to do

do it indirectly when kids argue with them...wait for complaints from students,

student's parents and maybe other teachers...

ON THE OTHER HAND....

I do realize that I am advocating here more work for you in the short term, but, it

will pay off in the long term

But realistically:

*Probably only doing it for your most important tests

end of unit tests, final exams --> summative evaluation

especially if you're using common exams with other teachers

common exams give you bigger sample to work with, which is good

makes sure that questions other teacher wrote are working for YOUR class

maybe they taught different stuff in a different way

impress the daylights out of your colleagues

*Probably only doing it for test questions you are likely going to reuse next year

*Spend less time on item analysis than on revising items

item analysis is not an end in itself,

no point unless you use it to revise items,

and help students on basis of information you get out of it

I also find that, if you get into it, it is kind of fascinating. When stats turn out well, it's

objective, external validation of your work. When stats turn out differently than you

expect, it becomes a detective mystery as you figure out what went wrong.

But you'll have to take my word on this until you try it on your own stuff.

Eight Simple Steps to Item Analysis

1. Score each answer sheet, write score total on the corner

o obviously have to do this anyway

2. Sort the pile into rank order from top to bottom score

(1 minute, 30 seconds tops)

3. If normal class of 30 students, divide class in half

o same number in top and bottom group:

o toss middle paper if odd number (put aside)

4. Take 'top' pile, count number of students who responded to each alternative

o fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type

of error you get for short answer, fill-in-the-blank

OR set up on spread sheet if you're familiar with computers

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

CLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0

*B 4

C 1

D 1

O

*=Keyed Answer

o repeat for lower group

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

CLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0

*B 4 2

C 1

D 1

O

*=Keyed Answer

o this is the time consuming part --> but not that bad, can do it while watching

TV, because you're just sorting piles

THREE POSSIBLE SHORT CUTS HERE (STEP 4)

(A) If you have a large sample of around 100 or more, you can cut down the sample

you work with

o take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all

100)

o put middle 46 aside for the moment

larger the sample, more accurate, but have to trade off against labour;

using top 1/3 or so is probably good enough by the time you get to

100; --27% magic figure statisticians tell us to use

o I'd use halves at 30, but you could just use a sample of top 10 and bottom 10

if you're pressed for time

but it means a single student changes stats by 10%

trading off speed for accuracy...

o but I'd rather have you doing ten and ten than nothing

(B) Second short cut, if you have access to photocopier (budgets)

o photocopy answer sheets, cut off identifying info

(can't use if handwriting is distinctive)

o colour code high and low groups --> dab of marker pen color

o distribute randomly to students in your class so they don't know whose

answer sheet they have

o get them to raise their hands

for #6, how many have "A" on blue sheet?

how many have "B"; how many "C"

for #6, how many have "A" on red sheet....

o some reservations because they can screw you up if they don't take it

seriously

o another version of this would be to hire kid who cuts your lawn to do the

counting, provided you've removed all identifying information

I actually did this for a bunch of teachers at one high school in

Edmonton when I was in university for pocket money

(C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type

than to sort

SAMPLE OF TYPING FORMAT

FOR ITEM ANALYSIS

ITEM # 1 2 3 4 5 6 7 8 9 10

KEY T F T F T A D C A B

STUDENT

Kay T T T F F A D D A C

Jane T T T F T A D C A D

John F F T F T A D C A B

o type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving

right hand free to turn pages (from Sax)

o IF you have a computer program -- some kicking around -- will give you all

stats you need, plus bunches more you don't-- automatically after this stage

OVERHEAD: SAMPLE ITEM ANALYSIS FOR CLASS OF 30 (PAGE #1) (in text)

5. Subtract the number of students in lower group who got question

right from number of high group students who got it right

o quite possible to get a negative number

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

CLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0

*B 4 2 2

C 1

D 1

O

*=Keyed Answer

6. Divide the difference by number of students in upper or lower group

o in this case, divide by 15

o this gives you the "discrimination index" (D)

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

CLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0

*B 4 2 2 0.333

C 1

D 1

O

*=Keyed Answer

7. Total number who got it right

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

CLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0

*B 4 2 2 0.333 6

C 1

D 1

O

*=Keyed Answer

8. If you have a large class and were only using the 1/3 sample for top and

bottom groups, then you have to NOW count number of middle group who

got each question right (not each alternative this time, just right answers)

9. Sample Form Class Size= 100.

o if class of 30, upper and lower half, no other column here

10. Divide total by total number of students

o difficulty = (proportion who got it right (p) )

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

CLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0

*B 4 2 2 0.333 6 .42

C 1

D 1

O

*=Keyed Answer

11. You will NOTE the complete lack of complicated statistics --> counting,

adding, dividing --> no tricky formulas required for this

o not going to worry about corrected point biserials etc.

o one of the advantages of using fixed number of alternatives

Interpreting Item Analysis

Let's look at what we have and see what we can see

90% of item analysis is just common sense...

1. Potential Miskey

2. Identifying Ambiguous Items

3. EqualDistribution to all alternatives.

4. Alternatives are not working

5. Distracter too atractive.

6. Question not discriminating.

7. Negative discrimination.

8. Too Easy.

9. Omit.

10. &11. Relationship between D index and Difficulty (p).

o Item Analysis of Computer Printouts

.

1. What do we see looking at this first one? [Potential Miskey]

Upper Low Difference D Total Difficulty

1. *A 1 4 -3 -.2 5 .17

B 1 3

C 10 5

D 3 3

O <----means omit or no answer

o #1, more high group students chose C than A, even though A is supposedly

the correct answer

o more low group students chose A than high group so got negative

discrimination;

o only .16% of class got it right

o most likely you just wrote the wrong answer key down

--> this is an easy and very common mistake for you to make

better you find out now before you hand back then when kids complain

OR WORSE, they don't complain, and teach themselves that your

miskey as the "correct" answer

o so check it out and rescore that question on all the papers before handing

them back

o Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50

--> nice item

OR:

o you check and find that you didn't miskey it --> that is the answer you

thought

two possibilities:

1. one possibility is that you made slip of the tongue and taught them the

wrong answer

anything you say in class can be taken down and used against

you on an examination....

2. more likely means even "good" students are being tricked by a

common misconception -->

You're not supposed to have trick questions, so may want to dump it

--> give those who got it right their point, but total rest of the

marks out of 24 instead of 25

If scores are high, or you want to make a point, might let it stand, and then teach to

it --> sometimes if they get caught, will help them to remember better in future

such as:

o very fine distinctions

o crucial steps which are often overlooked

REVISE it for next time to weaken "B"

-- alternatives are not supposed to draw more than the keyed answer

-- almost always an item flaw, rather than useful distinction

2. What can we see with #2: [Can identify ambiguous items]

Upper Low Difference D Total Difficulty

2. A 6 5

B 1 2

*C 7 5 2 .13 12 .40

D 1 3

O

o #2, about equal numbers of top students went for A and D.

Suggests they couldn't tell which was correct

either, students didn't know this material (in which case you can

reteach it)

or the item was defective --->

o look at their favorite alternative again, and see if you can find any reason

they could be choosing it

o often items that look perfectly straight forward to adults are ambiguous to

students

FavoriteExamples of ambiguous items.

o if you NOW realize that D was a defensible answer, rescore before you hand

it back to give everyone credit for either A or D -- avoids arguing with you in

class

o if it's clearly a wrong answer, then you now know which error most of your

students are making to get wrong answer

o useful diagnostic information on their learning, your teaching

3. Equally to all alternatives

Upper Low Difference D Total Difficulty

3. A 4 3

B 3 4

*C 5 4 1 .06 9 .30

D 3 4

O

o item #3, students respond about equally to all alternatives

o usually means they are guessing

Three possibilities:

0. may be material you didn't actually get to yet

you designed test in advance (because I've convinced you to

plan ahead) but didn't actually get everything covered before

holidays....

or item on a common exam that you didn't stress in your class

1. item so badly written students have no idea what you're asking

2. item so difficult students just completely baffled

o review the item:

if badly written ( by other teacher) or on material your class hasn't

taken, toss it out, rescore the exam out of lower total

BUT give credit to those that got it, to a total of 100%

if seems well written, but too hard, then you know to (re)teach this

material for rest of class....

maybe the 3 who got it are top three students,

tough but valid item:

OK, if item tests valid objective

want to provide occasional challenging question for top

students

but make sure you haven't defined "top 3 students" as "those

able to figure out what the heck I'm talking about"

4. Alternatives aren't working

Upper Low Difference D Total Difficulty

4. A 1 5

*B 14 7 7 .47 21 .77

C 0 2

D 0 0

O

o example #4 --> no one fell for D --> so it is not a plausible alternative

o question is fine for this administration, but revise item for next time

o toss alternative D, replace it with something more realistic

o each distracter has to attract at least 5% of the students

class of 30, should get at least two students

o or might accept one if you positively can't think of another fourth alternative -

- otherwise, do not reuse the item

if two alternatives don't draw any students --> might consider redoing as

true/false

5. Distracter too attractive

Upper Low Difference D Total Difficulty

5. A 7 10

B 1 2

C 1 1

*D 5 2 3 .20 7 .23

O

o sample #5 --> too many going for A

--> no ONE distracter should get more than key

--> no one distracter should pull more than about half of students

-- doesn't leave enough for correct answer and five percent for each

alternative

o keep for this time

o weaken it for next time

6. Question not discriminating

Upper Low Difference D Total Difficulty

6. *A 7 7 0 .00 14 .47

B 3 2

C 2 1

D 3 5

O

o sample #6: low group gets it as often as high group

o on norm-referenced tests, point is to rank students from best to worst

o so individual test items should have good students get question right, poor

students get it wrong

o test overall decides who is a good or poor student on this particular topic

those who do well have more information, skills than those who do

less well

so if on a particular question those with more skills and knowledge do

NOT do better, something may be wrong with the question

o question may be VALID, but off topic

E.G.: rest of test tests thinking skill, but this is a memorization

question, skilled and unskilled equally as likely to recall the answer

should have homogeneous test --> don't have a math item in with

social studies

if wanted to get really fancy, should do separate item analysis for each

cell of your blueprint...as long as you had six items per cell

o question is VALID, on topic, but not RELIABLE

addresses the specified objective, but isn't a useful measure of

individual differences

asking Grade 10s Capital of Canada is on topic, but since they will all

get it right, won't show individual differences -- give you low D

7. Negative Discrimination

Upper Low Difference D Total Difficulty

7. *A 7 10 -3 -.20 17 .57

B 3 3

C 2 1

D 3 1

O

o D (discrimination) index is just upper group minus lower group

o varies from +1.0 to -1.0

o if all top got it right, all lower got it wrong = 100% = +1

o if more of the bottom group get it right than the top group, you get a

negative D index

o if you have a negative D, means that students with less skills and knowledge

overall, are getting it right more often than those who the test says are better

overall

o in other words, the better you are, the more likely you are to get it wrong

WHAT COULD ACCOUNT FOR THAT?

Two possibilities:

o usually means an ambiguous question

that is confusing good students, but weak students too weak to see

the problem

look at question again, look at alternatives good students are going

for, to see if you've missed something

OR:

o or it might be off topic

--> something weaker students are better at (like rote memorization) than

good students

--> not part of same set of skills as rest of test--> suggests design flaw with

table of specifications perhaps

((-if you end up with a whole bunch of -D indices on the same test, must mean you

actually have two different distinct skills, because by definition, the low group is the

high group on that bunch of questions

--> end up treating them as two separate tests))

o if you have a large enough sample (like the provincial exams) then we toss

the item and either don't count it or give everyone credit for it

o with sample of 100 students or less, could just be random chance, so

basically ignore it in terms of THIS administration

kids wrote it, give them mark they got

o furthermore, if you keep dropping questions, may find that you're starting to

develop serious holes in your blueprint coverage -- problem for sampling

but you want to track stuff this FOR NEXT TIME

o if it's negative on administration after administration, consistently, likely not

random chance, it's screwing up in some way

o want to build your future tests out of those items with high positive D indices

o the higher the average D indices on the test, the more RELIABLE the

test as a whole will be

o revise items to increase D

-->if good students are selecting one particular wrong alternative,

make it less attractive

-->or increase probability of their selecting right answer by making it

more attractive

o may have to include some items with negative Ds if those are the only items

you have for that specification, and it's an important specification

what this means is that there are some skills/knowledge in this unit

which are unrelated to rest of the skills/knowledge

--> but may still be important

o e.g., statistics part of this course may be terrible on those students who are

the best item writers, since writing tends to be associated with the opposite

hemisphere in the brain than math, right... but still important objective in this

course

may lower reliability of test, but increases content validity

8. Too Easy

Upper Low Difference D Total Difficulty

8. A 0 1

*B 14 13 1 .06 27 .90

C 0 1

D 1 1

O

o too easy or too difficult won't discriminate well either

o difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0

(nobody)

REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE

QUESTION

o if the item is NOT miskeyed or some other glaring problem, it's too late to

change after administered --> everybody got it right, OK, give them the mark

TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)

o if the item is too difficult, don't drop it, just because everybody missed it -->

you must have thought it was an important objective or it wouldn't have been

on there;

o and unless literally EVERYONE missed it, what do you do with the students

who got it right?

o give them bonus marks?

o cheat them of a mark they got?

furthermore, if you drop too many questions, lose content validity (specs)

--> if two or three got it right may just be random chance,

so why should they get a bonus mark

o however, DO NOT REUSE questions with too high or low difficulty (p) values

in future

if difficulty is over 85%, you're wasting space on limited item test

o asking Grade 10s the Capital of Canada is probably waste of their time and

yours --> unless this is a particularly vital objective

o same applies to items which are too difficult --> no use asking Grade 3s to

solve quadratic equation

o but you may want to revise question to make it easier or harder rather than

just toss it out cold

OR SOME EXCEPTIONS HERE:

You may have consciously decided to develop a "Mastery" style tests

--> will often have very easy questions -& expect everyone to get

everything trying to identify only those who are not ready to go on

--> in which case, don't use any question which DOES NOT have a

difficulty level below 85% or whatever

Or you may want a test to identify the top people in class, the reach for the

top team, and design a whole test of really tough questions

--> have low difficulty values (i.e., very hard)

o so depends a bit on what you intend to do with the test in question

o this is what makes the difficulty index (proportion) so handy

14. you create a bank of items over the years

--> using item analysis you get better questions all the time, until you

have a whole bunch that work great

-->can then tailor-make a test for your class

you want to create an easier test this year, you pick questions with

higher difficulty (p) values;

you want to make a challenging test for your gifted kids, choose items

with low difficulty (p) values

--> for most applications will want to set difficulty level so that it gives

you average marks, nice bell curve

government uses 62.5 --> four item multiple choice, middle of

bell curve,

15. start tests with an easy question or two to give students a running

start

16. make sure that the difficulty levels are spread out over examination

blueprint

not all hard geography questions, easy history

unfair to kids who are better at geography, worse at history

turns class off geography if they equate it with tough questions

-->REMEMBER here that difficulty is different than complexity,

Bloom

so can have difficult recall knowledge question, easy synthesis

synthesis and evaluation items will tend to be harder than recall

questions so if find higher levels are more difficult, OK, but try to

balance cells as much as possible

certainly content cells should be the roughly the same

9. OMIT

Upper Low Difference D Total Difficulty

9. A 2 1

B 3 4

*C 7 3 4 .26 10 .33

D 1 1

O 2 4

If near end of the test

0. --> they didn't find it because it was on the next page

--format problem

OR

--> your test is too long, 6 of them (20%) didn't get to it

OR, if middle of the test:

3. --> totally baffled them because:

way too difficult for these guys

or because also 2 from high group too: ambiguous wording

2. &

3. RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p)

Upper Low Difference D Total Difficulty

10. A 0 5

*B 15 0 15 1.0 15 .50

C 0 5

D 0 5

O

---------------------------------------------------

11. A 3 2

*B 8 7 1 0.6 15 .50

C 2 3

D 2 3

O

o 10 is a perfect item --> each distracter gets at least 5

discrimination index is +1.0

(ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW

FOR GUESSING)

o high discrimination D indices require optimal levels of difficulty

o but optimal levels of difficulty do not assure high levels of D

o 11 has same difficulty level, different D

on four item multiple-choice, student doing totally by chance will get

25%

Program Evaluation

When your kids write the Diploma or Achievement Test Department sends out a printout of

how your class did compared to everybody else in the province

Three types of report:

1. ASSESSMENT HIGHLIGHTS (pamphlet)

o how are kids doing today in terms of meeting the standards?

o how are they doing compared to four years ago? eight years ago?

(monitor over time)

2. PROVINCIAL REPORT

o format keeps changing --> some years all tests in one book to save on paper

and mailing costs; other years each exam gets its own report

o tons of technical information (gender stuff, etc.)

3. JURISDICTION & SCHOOL REPORTS

(up to superintendent what happens to these after that

--> can publish in newspaper, keep secret central office only, etc.)

o get your hands on and interpret

o either you do it or someone else will do it for/to you

o better teachers take responsibility rather than top down

o new table formats are so easy to interpret no reason not to

o this means you can compare their responses to the responses of 30,000

students across the province

will help you calibrate your expectations for this class

is your particular class high or low one?

have you set your standards too high or too low?

giving everyone 'Cs because you think they ought to do better than

this, but they all ace the provincial tests?

Who Knows Where This is?OVERHEAD: SCHOOL TABLE 2 (June 92 GRADE 9 Math

Achievement)

o check table 2 for meeting standard of excellence

o Standards set by elaborate committee structure

This example (overhead): Your class had 17 students

Total test out of 49 (means test of 50, but dropped on after item analysis)

standard setting procedures decided that 42/49 is standard of EXCELLENCE

for grade 9s in Alberta

next column shows they expect 15% to reach this standard

standard setting procedure decided that 23 out of 49 which Acceptable

standard; next column says expect 85% to reach that standard

columns at end of table show that actually, only 8.9% made standard of

excellence, and only 67.4% made acceptable standard

(bad news!)

but looking at YOUR class, 5.9, almost 6% made standard of excellence (so

fewer than province as a whole) but on the other hand, 76.5% meeting

acceptable standard.

Need comparison -- otherwise, fact that only get 6% to excellence might sound

bad...

Interpretation: either you only have one excellent math student in your class,

or you are teaching to acceptable standard, but not encouraging excellence?

BUT can use tables to look deeper,

o use tables to identify strengths and weaknesses in student learning

o and therefore identify your own strength and weaknesses

Problem solving & knowledge/skills broken down --> table of specs topics

Interestingly, though, above provincial on problem solving at excellence...

ASK: -how do you explain % meeting knowledge and % meeting problems both

higher than % meeting standard on whole test?

Answer: low correlation between performance on the two types of questions

(i.e., those who met standard on the one often did not on the other)

which means (a) can't assume that easy/hard = Bloom's taxonomy

and (b) that you have to give students both kinds of questions on your test

or you are being unfair to group who is better at the other stuff

Don't know where this is OVERHEAD: SCHOOL TABLE 5.1 (GRADE 9 MATH, JUNE 92)

o check tables 5.1 to 5.6 for particular areas of strengths and weaknesses

o look for every question where students in your school were 5% different on

keyed answer from those in provincial test

if 5% or more higher, is a particular strength of your program

if 5% or more lower, is a particular weakness

note that score on question irrelevant, only difference from rest of province

--> e.g., if you only got 45% but province only got 35%, than that's a

significant strength

--> the fact that less than 50% just means it was a really tough question, too

hard for this grade

o similarly, just because got 80% doesn't make your class good if province is

98%

if find all strengths or all weaknesses, where is the gap lowest?

least awful = strengths; least strong = weakness to work on

THIS EXAMPLE? all above provincial scores on these skills

converts a decimal into a fraction 76.5-60.9 = 15.6% above provincial norm

so decimal to fraction is a strength

all of these are good, but find least good --> thats the area to concentrate on

question 10: on 4.5% difference -- so your weak spot, one area you arent

significantly above rest of province

You can even begin to set standards in your class as province does

--> i.e., ask yourself BEFORE the test how many of these questions should

your class be able to do on this test?

Then look at actual performance.

How did my students do? Compared to what?

my classroom expectations

school's expectations

jurisdiction's expectations

provincial expectations

the last/previous test administered

community expectations

(each jurisdiction how has its own public advisory committee)

You can even create your own statistics to compare with provincial standard

o lots of teachers recycle Diploma and Achievement test questions, but they

only do it to prep kids for actual exam --> losing all that diagnostic info

HOWEVER:-avoid comparisons between schools

o serves no useful purpose, has no logic since taken out of context

o e.g., comparing cancer clinic and walk in clinic --> higher death rate in cancer

clinic doesn't mean its worse; may be best cancer clinic in the world, be doing

a great job given more serious nature of problems it faces

o invidious comparisons like this become "blaming" exercise

o self-fulfilling prophecy: parents pull kids from that school

Provincial authorities consider such comparisons a misuse of results

o school report = your class if only one class;

but if two or more classes, then we are talking about your school's program

--> forces you to get together with other teachers to find out what they're

doing

--> pool resources, techniques, strategies to address problem areas....

eliability and Item Analysis

General Introduction

Basic Ideas

Classical Testing Model

Reliability

Sum Scales

Cronbach's Alpha

Split-Half Reliability

Correction for Attenuation

Designing a Reliable Scale

This topic discusses the concept of reliability of measurement as used in social sciences (but not in

industrial statistics or biomedical research). The term reliability used in industrial statistics denotes a

function describing the probability of failure (as a function of time). For a discussion of the concept

of reliability as applied to product quality (e.g., in industrial statistics), please refer to the section

on Reliability/Failure Time Analysis in the Process Analysistopic (see also the section Repeatability and

Reproducibility and the topic Survival/Failure Time Analysis). For a comparison between these two (very

different) concepts of reliability, see Reliability.

General Introduction

In many areas of research, the precise measurement of hypothesized processes or variables

(theoretical constructs) poses a challenge by itself. For example, in psychology, the precise

measurement of personality variables or attitudes is usually a necessary first step before any theories

of personality or attitudes can be considered. In general, in all social sciences, unreliable

measurements of people's beliefs or intentions will obviously hamper efforts to predict their behavior.

The issue of precision of measurement will also come up in applied research, whenever variables are

difficult to observe. For example, reliable measurement of employee performance is usually a difficult

task; yet, it is obviously a necessary precursor to any performance-based compensation system.

In all of these cases, Reliability & Item Analysis may be used to construct reliable measurement scales, to

improve existing scales, and to evaluate the reliability of scales already in use. Specifically, Reliability

& Item Analysis will aid in the design and evaluation of sum scales, that is, scales that are made up of

multiple individual measurements (e.g., different items, repeated measurements, different

measurement devices, etc.). You can compute numerous statistics that allows you to build and

evaluate scales following the so-called classical testing theory model.

The assessment of scale reliability is based on the correlations between the individual items or

measurements that make up the scale, relative to the variances of the items. If you are not familiar

with the correlation coefficient or the variance statistic, we recommend that you review the respective

discussions provided in the Basic Statistics section.

The classical testing theory model of scale construction has a long history, and there are many

textbooks available on the subject. For additional detailed discussions, you may refer to, for example,

Carmines and Zeller (1980), De Gruijter and Van Der Kamp (1976), Kline (1979, 1986), or Thorndyke

and Hagen (1977). A widely acclaimed "classic" in this area, with an emphasis on psychological and

educational testing, is Nunnally (1970).

Testing hypotheses about relationships between items and tests. Using Structural Equation Modeling and

Path Analysis (SEPATH), you can test specific hypotheses about the relationship between sets of items

or different tests (e.g., test whether two sets of items measure the same construct, analyze multi-

trait, multi-method matrices, etc.).

Basic Ideas

Suppose we want to construct a questionnaire to measure people's prejudices against foreign- made

cars. We could start out by generating a number of items such as: "Foreign cars lack personality,"

"Foreign cars all look the same," etc. We could then submit those questionnaire items to a group of

subjects (for example, people who have never owned a foreign-made car). We could ask subjects to

indicate their agreement with these statements on 9-point scales, anchored at 1=disagree and 9=agree.

True scores and error. Let us now consider more closely what we mean by precise measurement in this

case. We hypothesize that there is such a thing (theoretical construct) as "prejudice against foreign

cars," and that each item "taps" into this concept to some extent. Therefore, we may say that a

subject's response to a particular item reflects two aspects: first, the response reflects the prejudice

against foreign cars, and second, it will reflect some esoteric aspect of the respective question. For

example, consider the item "Foreign cars all look the same." A subject's agreement or disagreement

with that statement will partially depend on his or her general prejudices, and partially on some other

aspects of the question or person. For example, the subject may have a friend who just bought a very

different looking foreign car.

Testing hypotheses about relationships between items and tests. To test specific hypotheses about the

relationship between sets of items or different tests (e.g., whether two sets of items measure the

same construct, analyze multi- trait, multi-method matrices, etc.) use Structural Equation

Modeling (SEPATH).

Classical Testing Model

To summarize, each measurement (response to an item) reflects to some extent the true score for the

intended concept (prejudice against foreign cars), and to some extent esoteric, random error. We can

express this in an equation as:

X = tau + error

In this equation, X refers to the respective actual measurement, that is, subject's response to a

particular item; tau is commonly used to refer to the true score, and error refers to the random error

component in the measurement.

To index

To index

To index

Reliability

In this context the definition of reliability is straightforward: a measurement is reliable if it reflects

mostly true score, relative to the error. For example, an item such as "Red foreign cars are particularly

ugly" would likely provide an unreliable measurement of prejudices against foreign- made cars. This is

because there probably are ample individual differences concerning the likes and dislikes of colors.

Thus, this item would "capture" not only a person's prejudice but also his or her color preference.

Therefore, the proportion of true score (for prejudice) in subjects' response to that item would be

relatively small.

Measures of reliability. From the above discussion, one can easily infer a measure or statistic to describe

the reliability of an item or scale. Specifically, we may define an index of reliability in terms of the

proportion of true score variability that is captured across subjects or respondents, relative to the total

observed variability. In equation form, we can say:

Reliability =

2

(true score) /

2

(total observed)

Sum Scales

What will happen when we sum up several more or less reliable items designed to measure prejudice

against foreign-made cars? Suppose the items were written so as to cover a wide range of possible

prejudices against foreign-made cars. If the error component in subjects' responses to each question is

truly random, then we may expect that the different components will cancel each other out across

items. In slightly more technical terms, the expected value or mean of the error component across

items will be zero. The true score component remains the same when summing across items.

Therefore, the more items are added, the more true score (relative to the error score) will be

reflected in the sum scale.

Number of items and reliability. This conclusion describes a basic principle of test design. Namely, the

more items there are in a scale designed to measure a particular concept, the more reliable will the

measurement (sum scale) be. Perhaps a somewhat more practical example will further clarify this

point. Suppose you want to measure the height of 10 persons, using only a crude stick as the

measurement device. Note that we are not interested in this example in the absolute correctness of

measurement (i.e., in inches or centimeters), but rather in the ability to distinguish reliably between

the 10 individuals in terms of their height. If you measure each person only once in terms of multiples

of lengths of your crude measurement stick, the resultant measurement may not be very reliable.

However, if you measure each person 100 times, and then take the average of those 100 measurements

as the summary of the respective person's height, then you will be able to make very precise and

reliable distinctions between people (based solely on the crude measurement stick).

Let's now look at some of the common statistics that are used to estimate the reliability of a

sum scale.

To index

To index

Cronbach's Alpha

To return to the prejudice example, if there are several subjects who respond to our items, then we

can compute the variance for each item, and the variance for the sum scale. The variance of the sum

scale will be smaller than the sum of item variances if the items measure the same variability between

subjects, that is, if they measure some true score. Technically, the variance of the sum of two items is

equal to the sum of the two variances minus (two times) the covariance, that is, the amount of true

score variance common to the two items.

We can estimate the proportion of true score variance that is captured by the items by comparing the

sum of item variances with the variance of the sum scale. Specifically, we can compute:

= (k/(k-1)) * [1- (s

2

i)/s

2

sum]

This is the formula for the most common index of reliability, namely, Cronbach's coefficient alpha ( ).

In this formula, the si**2's denote the variances for the k individual items; ssum**2 denotes the variance

for the sum of all items. If there is no true score but only error in the items (which is esoteric and

unique, and, therefore, uncorrelated across subjects), then the variance of the sum will be the same as

the sum of variances of the individual items. Therefore, coefficient alpha will be equal to zero. If all

items are perfectly reliable and measure the same thing (true score), then coefficient alpha is equal to

1. (Specifically, 1- (si**2)/ssum**2 will become equal to (k-1)/k; if we multiply this by k/(k-1) we obtain

1.)

Alternative terminology. Cronbach's alpha, when computed for binary (e.g., true/false) items, is

identical to the so-called Kuder-Richardson-20 formula of reliability for sum scales. In either case,

because the reliability is actually estimated from the consistency of all items in the sum scales, the

reliability coefficient computed in this manner is also referred to as the internal-consistency

reliability.

Split-Half Reliability

An alternative way of computing the reliability of a sum scale is to divide it in some random manner

into two halves. If the sum scale is perfectly reliable, we would expect that the two halves are

perfectly correlated (i.e., r = 1.0). Less than perfect reliability will lead to less than perfect

correlations. We can estimate the reliability of the sum scale via the Spearman-Brown split

half coefficient:

rsb = 2rxy /(1+rxy)

In this formula, rsb is the split-half reliability coefficient, and rxy represents the correlation between the

two halves of the scale.

To index

To index

Correction for Attenuation

Let us now consider some of the consequences of less than perfect reliability. Suppose we use our scale

of prejudice against foreign-made cars to predict some other criterion, such as subsequent actual

purchase of a car. If our scale correlates with such a criterion, it would raise our confidence in

the validity of the scale, that is, that it really measures prejudices against foreign-made cars, and not

something completely different. In actual test design, thevalidation of a scale is a lengthy process that

requires the researcher to correlate the scale with various external criteria that, in theory, should be

related to the concept that is supposedly being measured by the scale.

How will validity be affected by less than perfect scale reliability? The random error portion of the

scale is unlikely to correlate with some external criterion. Therefore, if the proportion of true score in

a scale is only 60% (that is, the reliability is only .60), then the correlation between the scale and the

criterion variable will be attenuated, that is, it will be smaller than the actual correlation of true scores.

In fact, the validity of a scale is always limited by its reliability.

Given the reliability of the two measures in a correlation (i.e., the scale and the criterion variable), we

can estimate the actual correlation of true scores in both measures. Put another way, we

can correct the correlation for attenuation:

rxy,corrected = rxy /(rxx*ryy)

In this formula, rxy,corrected stands for the corrected correlation coefficient, that is, it is the estimate of the

correlation between the true scores in the two measures x and y. The term rxy denotes the uncorrected

correlation, and rxx and ryydenote the reliability of measures (scales) x and y. You can compute the

attenuation correction based on specific values, or based on actual raw data (in which case the

reliabilities of the two measures are estimated from the data).

Designing a Reliable Scale

After the discussion so far, it should be clear that, the more reliable a scale, the better (e.g., more

valid) the scale. As mentioned earlier, one way to make a sum scale more valid is by adding items. You

can compute how many items would have to be added in order to achieve a particular reliability, or

how reliable the scale would be if a certain number of items were added. However, in practice, the

number of items on a questionnaire is usually limited by various other factors (e.g., respondents get

tired, overall space is limited, etc.). Let us return to our prejudice example, and outline the steps that

one would generally follow in order to design the scale so that it will be reliable:

Step 1: Generating items. The first step is to write the items. This is essentially a creative process where

the researcher makes up as many items as possible that seem to relate to prejudices against foreign-

made cars. In theory, one should "sample items" from the domain defined by the concept. In practice,

for example in marketing research, focus groups are often utilized to illuminate as many aspects of the

concept as possible. For example, we could ask a small group of highly committed American car buyers

to express their general thoughts and feelings about foreign-made cars. In educational and

To index

psychological testing, one commonly looks at other similar questionnaires at this stage of the scale

design, again, in order to gain as wide a perspective on the concept as possible.

Step 2: Choosing items of optimum difficulty. In the first draft of our prejudice questionnaire, we will

include as many items as possible. We then administer this questionnaire to an initial sample of typical

respondents, and examine the results for each item. First, we would look at various characteristics of

the items, for example, in order to identify floor or ceiling effects. If all respondents agree or disagree

with an item, then it obviously does not help us discriminate between respondents, and thus, it is

useless for the design of a reliable scale. In test construction, the proportion of respondents who agree

or disagree with an item, or who answer a test item correctly, is often referred to as the item difficulty.

In essence, we would look at the item means and standard deviations and eliminate those items that

show extreme means, and zero or nearly zero variances.

Step 3: Choosing internally consistent items. Remember that a reliable scale is made up of items that

proportionately measure mostly true score; in our example, we would like to select items that measure

mostly prejudice against foreign-made cars, and few esoteric aspects we consider random error. To do

so, we would look at the following:

STATISTICA

RELIABL.

ANALYSIS

Summary for scale: Mean=46.1100 Std.Dv.=8.26444 Valid n:100

Cronbach alpha: .794313 Standardized alpha: .800491

Average inter-item corr.: .297818

variable

Mean if

deleted

Var. if

deleted

StDv. if

deleted

Itm-Totl

Correl.

Squared

Multp. R

Alpha if

deleted

ITEM1

ITEM2

ITEM3

ITEM4

ITEM5

ITEM6

ITEM7

ITEM8

ITEM9

ITEM10

41.61000

41.37000

41.41000

41.63000

41.52000

41.56000

41.46000

41.33000

41.44000

41.66000

51.93790

53.79310

54.86190

56.57310

64.16961

62.68640

54.02840

53.32110

55.06640

53.78440

7.206795

7.334378

7.406882

7.521509

8.010593

7.917474

7.350401

7.302130

7.420674

7.333785

.656298

.666111

.549226

.470852

.054609

.118561

.587637

.609204

.502529

.572875

.507160

.533015

.363895

.305573

.057399

.045653

.443563

.446298

.328149

.410561

.752243

.754692

.766778

.776015

.824907

.817907

.762033

.758992

.772013

.763314

Shown above are the results for 10 items. Of most interest to us are the three right-most columns.

They show us the correlation between the respective item and the total sum score (without the

respective item), the squared multiple correlation between the respective item and all others, and the

internal consistency of the scale (coefficient alpha) if the respective item would be deleted. Clearly,

items 5 and 6 "stick out," in that they are not consistent with the rest of the scale. Their correlations

with the sum scale are .05 and .12, respectively, while all other items correlate at .45or better. In the

right-most column, we can see that the reliability of the scale would be about .82 if either of the two

items were to be deleted. Thus, we would probably delete the two items from this scale.

Step 4: Returning to Step 1. After deleting all items that are not consistent with the scale, we may not

be left with enough items to make up an overall reliable scale (remember that, the fewer items, the

less reliable the scale). In practice, one often goes through several rounds of generating items and

eliminating items, until one arrives at a final set that makes up a reliable scale.

Tetrachoric correlations. In educational and psychological testing, it is common to use yes/no type items,

that is, to prompt the respondent to answer either yes or no to a question. An alternative to the

regular correlation coefficient in that case is the so-called tetrachoric correlation coefficient. Usually,

the tetrachoric correlation coefficient is larger than the standard correlation coefficient, therefore,

Nunnally (1970, p. 102) discourages the use of this coefficient for estimating reliabilities. However, it

is a widely used statistic (e.g., in mathematical modeling).

Test Item Analysis Using Microsoft Excel Spreadsheet Program

by Chris Elvin

Introduction

This article is written for teachers and researchers whose budgets are limited

and who do not have access to purposely designed item analysis software such

as Iteman (2003). It describes how to organize a computer spreadsheet such

as Microsoft Excel in order to obtain statistical information about a test and

the students who took it.

Using a fictitious example for clarity, and also a real example of a personally

written University placement test, I will show how the information in a

spreadsheet can be used to refine test items and make judicious placement

decisions. Included is the web address for accessing the sample Excel files for

the class of fictitious students (Elvin, 2003a , 2003b).

Background

I had been teaching in high schools in Japan for many years, and upon

receiving my first University appointment, I was eager to make a good

impression. My first task was to prepare anorm-referenced placement test to

separate approximately one hundred first year medical students into ten

relative levels of proficiency and place them into appropriate classes. This

would allow teachers to determine appropriate curricular goals and adjust the

teaching methodology based more closely on students personal needs. It was

also hoped that a more congenial classroom atmosphere, with less frustration

or boredom, would enhance motivation and engender a true learning

environment.

The course was called Oral English Communication, so the placement test

needed to measure this construct. However, since time restricted us to no

more than half an hour for administering the test, a spoken component for the

test was ruled out. It had to be listening only, and in order to ensure reliablity,

the questions had to be as many as possible. I decided I could only achieve this

by having as many rapid-fire questions as possible within the time constraint.

In order for the test to be valid, I focused on points that one might expect to

cover in an oral English course for "clever" first year university students. It

was not possible to meet the students beforehand, so I estimated their level

To index

based on my experience of teaching advanced-level senior high school

students.

Organizing The Spreadsheet Part A

To show briefly how I compiled my students' test scores, I have provided here

the results of a fabricated ten-item test taken by nine fictitious students. (see

Table 1; to download a copy of this file, see Elvin, 2003a.) The purpose of this

section of the spreadsheet is primarily to determine what proportion of the

students answered each item, how many answered correctly, and how efficient

the distractors were, It also helps the instructor prepare for item

discrimination analysis in a separate part of the spreadsheet.

Table 1. Fabricated 10-Item Test - Part A: Actual letter choices

A B C D E F G H I J K L

1 ID ITEM NUMBER 1 2 3 4 5 6 7 8 9 10

2 200201 Arisa D A A B C D A A B D

3 200202 Kana A C D A D C B C A A

4 200203 Saki D B D B C D D A B A

5 200204 Tomomi A B B A C C C A D D

6 200205 Natsumi C B A B C D B C C D

7 200206 Haruka C B A B C D A A B C

8 200207 Momo

*

C B D D B A A C B

9 200208 Yuka B D B C C D D B D B

10 200209 Rie C B A B C B A A B C

11

CORRECT ANSWER C B A B C D A A B C

12

A 0.22 0.11 0.44 0.22 0.00 0.00 0.44 0.67 0.11 0.22

13

B 0.11 0.56 0.33 0.56 0.00 0.22 0.22 0.11 0.44 0.11

14

C 0.33 0.22 0.00 0.11 0.78 0.22 0.11 0.22 0.22 0.22

15

D 0.22 0.11 0.22 0.11 0.22 0.56 0.22 0.00 0.22 0.44

16

TOTAL 0.89 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

What proportion of students answered the question?

It may be expected that for a multiple-choice test, all of the students would

answer all of the questions. In the real world, this is rarely true. The quality of

a test item may be poor, or there may be psychological, environmental, or

administrative factors to take into consideration. To try to identify these

potential sources of measurement error, I calculate the ratio of students

answering each question to students taking the test.

In cell C16 of the Excel file spreadsheet, the formula bar reads =SUM

(C12:C15), which adds the proportion of students answering A, B, C and

D respectively. One student didnt answer question 1 (cell C8), so the total

for this item is eight out of nine, which is 0.89. Perhaps she was feeling a little

nervous, or she couldn't hear well because of low speaker volume or noise in

the classroom. The point is, if it is possible to determine what was responsible

for a student or students not answering, then it may also be possible to rectify

it. In some cases, a breakdown of questions on a spreadsheet can contribute to

the discovery of such problems.

What proportion of students answered the question correctly?

For question 1, the correct answer is C, as shown in cell C11. The proportion

of students who chose C is shown in cell C14. To calculate this value, we use

the COUNTIF function. In cell C14, the formula bar reads

=COUNTIF(C2:C10,"C")/9, which means that any cell from C2 to C10 which

has the answer C is counted, and then divided by the number of test takers,

which is nine for this test. This value is also the item facility for the question,

which will be discussed in more detail later in this paper.

How efficient were the distractors?

We use the same function, COUNTIF, for finding the proportion of students

who answered incorrectly. For item 5, cell G12 reads

=COUNTIF(G2:G10,"A")/9 in the formula bar. The two other distractors are

B, which is shown in cell G13 (=COUNTIF(G2:G10,B/9) and D, which

is shown in cell G15 (=COUNTIF(G2:G10,D/9). For this question, seven

students answered correctly (answer C), and two students answered

incorrectly by choosing D. If this test were real, and had many more test

takers, I would want to find out why A and B were not chosen, and I would

consider rewriting this question to make all three distractors equally

attractive.

Preparing for item discrimination analysis in a separate part of the

spreadsheet

Part A of the spreadsheet shows the letter choices students made in answering

each question. In Part B of the spreadsheet, I score and rank the students, and

analyze the test and test items numerically.

Organizing The Spreadsheet Part B

The area C22:L30 in part B of the spreadsheet (see Table 2; also Elvin, 2oo3a)

correlates to the absolute values of C2:L10 in part A of the spreadsheet in

Table 1. This means that even after sorting the students by total score in part B

of the spreadsheet, the new positions of the ranked students will still refer to

their actual letter choices in part A of the spreadsheet.

Absolute cell references, unlike relative cell references, however, cannot be

copied and pasted. They have to be typed in manually. It is therefore much

quicker to make a linked worksheet with copied and pasted relative cell

references. The linked spreadsheet can then be sorted without fear of

automatic recalculation, as would happen if working within the same

spreadsheet using relative references. For the actual test, I used a linked

spreadsheet. If you would like to see a linked file, a copy of one is available for

download from my website (see Elvin, 2003b).

The purposes of part B of the spreadsheet are to

a. convert students multiple choice options to numerals

b. calculate students total scores

c. sort students by total score

d. compute item facility and item discrimination values

e. calculate the average score and standard deviation of the test

f. determine the tests reliability

g. estimate the standard error of measurement of the test

Table 2. Fabricated 10-Item Test - Part B: Scoring and Ranking of

Students

A B C D E F G H I J K L M N

21 ID ITEM NUMBER 1 2 3 4 5 6 7 8 9 10 TOTAL

22 200206 Haruka 1 1 1 1 1 1 1 1 1 1 10

23 200209 Rie 1 1 1 1 1 0 1 1 1 1 9

24 200201 Arisa 0 0 1 1 1 1 1 1 1 0 7

25 200203 Saki 0 1 0 1 1 1 0 1 1 0 6

26 200205 Natsumi 1 1 1 1 1 1 0 0 0 0 6

27 200204 Tomomi 0 1 0 0 1 0 0 1 0 0 3

28 200207 Momo 0 0 0 0 0 0 1 1 0 0 2

29 200208 Yuuka 0 0 0 0 1 1 0 0 0 0 2

30 200202 Kana 0 0 0 0 0 0 0 0 0 0 0

31

IF total 0.33 0.56 0.44 0.56 0.78 0.56 0.44 0.67 0.44 0.22 Reliability 0.87

32

IF upper 0.67 0.67 1.00 1.00 1.00 0.67 1.00 1.00 1.00 0.67 Average 5.00

33

IF lower 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.33 0.00 0.00 SD 3.43

34

ID 0.67 0.67 1.00 1.00 0.67 0.33 0.67 0.67 1.00 0.67 SEM 1.21

a) Converting students multiple-choice options to numerals

Cell C22, in this previously sorted part of the spreadsheet, reads

=IF($C$7="C",1,0) in the formula bar. This means that Haruka has

answered C for item 1 in cell C7 of part A of the spreadsheet, so she will score

one point. If there is anything else in cell C7, she will score zero. (The dollar

signs before C and 7 indicate absolute reference.)

b) Calculating total scores

Cell M22 reads =SUM(C22:L22). This calculates one students total score by

adding up her ones and zeros for all the items on the test from C22 to L22.

c) Sorting students by total scores

The area A22:M30 is selected. Sort is then chosen from the data menu in the

menu bar, which brings up a pop-up menu and a choice of two radio

buttons. Column M is selected from the pop-up menu, and

the descending radio button is clicked. Finally, the OK button is selected. This

sorts the students by test score from highest to lowest.

d) Computing item facility and item discrimination values

Item facility (IF) refers to the proportion of students who answered the

question correctly. In part A of the spreadsheet, we calculated the IF using the

COUNTIF function for the letter corresponding to the correct answer. With

these letter answers now converted numerically, we can also calculate the IF

using the SUM function. For example, In cell 31, the formula bar reads

=SUM(C22:C30)/9, which gives us the IF for item 1 by adding all the ones

and dividing by the number of test-takers.

The item discrimination (ID) is usually the difference between the IF for the

top third of test takers and the IF for the bottom third of test takers for each

item on a test (some prefer to use the top and bottom quarters). The IF for the

top third is given in cell C32 and reads =SUM(C22:C24)/3. Similarly, the IF

for the bottom third of test takers is given in cell C33, and reads

=SUM(C28:C30)/3. The difference between these two scores, shown in cell

34 (=C32-C33), gives us the ID. This value is useful in norm-referenced tests

such as placement tests because it is an indication of how well the test-takers

are being purposefully spread for each item of the test.

e) Calculating the average score and standard deviation of the test

The Excel program has functions for average score and standard deviation, so

they are both easy to calculate. Cell N32 reads =AVERAGE(M22:M30) in the

formula bar, which gives us the average score. The standard deviation is

shown in cell N33 and reads =STDEV(M22:M30) in the formula bar.

f) Determining the tests reliability

I use the Kuder-Richardson 21 formula for calculating reliability because it is

easy to compute, relying only on the number of test items, and the average and

variance of the test scores.

The formula is KR-21 = n/n-1[1-{(X-X2/n)/S2}], where n is the number of test

items, X is the average score, and S the standard deviation. (See Hatch and

Lazaraton, 1991, p. 538 for information on the Kuder-Richardson 21 formula.)

In cell N31, the formula bar reads =(10/9)*(1-(N32-

N32*N32/10)/(N33*N33)), which will give us a conservative estimate of the

tests reliability, compared to the more accurate but more complex KR-20

formula.

g) Estimating the standard error of measurement of the test

The true-score model, which was proposed by Spearman (1904), states that an

individuals observed test score is made up of two components, a true

component and an error component. The standard error of measurement,

according to Dudek (1979), is an estimate of the standard deviation expected

for observed scores when the true score is held constant. It is therefore an

indication of how much a students observed test score would be expected to

fluctuate either side of her true test score because of extraneous

circumstances. This error estimate uncertainty means that it is not possible to

say for sure which side of a cut-off point the true score of a student whose

observed score is within one SEM of that cut-off point truly lies. However,

since the placement of students into streamed classes within our university is

not likely to effect the students lives critically, I calculate SEM not so much to

determine the borderline students, who in some circumstances may need

further deliberation, but more to give myself a concrete indication of how

confident I am that the process of streaming is being done equitably.

To measure SEM, we type in the formula bar for cell N34, =SQRT(1-

N31)*N33, which gives us a value of 1.21. We can therefore say that

students true scores will normally be within 1.21 points of

their observed scores.

The 2002 Placement Test

A 50-item placement test was administered to 102 first year medical students

in April, 2002. To the teachers and students present, it may have appeared to

be a typical test, in a pretty booklet, with a nice font face, and the name of the

college in bold. But this face value was its only redeeming feature. After

statistical analysis, it was clear that its inherent weakness was that it was

unacceptably difficult and therefore wholly inappropriate. If the degree to

which a test is effective in spreading students out is directly related to the

degree to which that test fits the ability levels of the students (Brown, 1996),

then my placement test was ineffective because I had greatly overestimated

the students level based on the nave assumption that they would be similar to

students in my high school teaching experience.

It had a very low average score not much higher than guesswork, and such a

low reliability, and therefore large SEM, that it meant that many students

could not be definitively placed. In short, I was resigned to the fact that Id be

teaching mixed ability classes for the next two semesters.

Pilot Procedures for the 2003 Placement Test

Statistical analysis of the 2002 test meant that I had to discard almost all of

the items. The good news was that at least I now had the opportunity to pilot

some questions with my new students. I discovered that nearly all of them

could read and write well, and many had impressive vocabularies. Most had

been taught English almost entirely in Japanese, however, and very few of

them had had much opportunity to practice English orally. Fewer still had had

contact with a native-English speaker on a regular basis.

According to Brown (1996), ideal items of a norm-referenced language test

should have an average IF of 0.50, and be in the range of 0.3 to 0.7 to be

considered acceptable. Ebels guidelines (1979) for item discrimination

consider an ID of greater than 0.2 to be satisfactory. These are the criteria I

generally abide by when piloting test questions, after, of course, first

confirming that these items are valid and also devoid of redundancy.

A Comparison Of The 2002 And 2003 Placement Tests

Table 3: A Comparison of the 2002 and 2003 placement tests

Reliability Average SD SEM IF<0.3 0.3=<IF<0.7 IF>0.7 ID>0.2

2002 0.57 16.09 4.95 3.26 27 23 0 2

2003 0.74 24.8 6.71 3.44 3 40 7 38

A statistical analysis of the 2003 50-item test showed a great improvement

compared to the previous year (see Table 3), with just ten items now falling

outside the criteria guidelines. The average score of the 2003 test was very

close to the ideal, but the reliability was still not as good as it should have

been. Despite this, we were still able to identify and make provision for the

highest and lowest scoring students, and feedback from all classes, thus far,

has generally been very positive.

Conclusion

I plan to extend my database of acceptable test items to employ in developing

the test for 2004. The reliability should improve once the bad items are

replaced with acceptable ones, and distractor efficiency analysis may help to

pinpoint which acceptable items can be modified further. My main concern,

however, is the very small standard deviation. If it remains stubbornly small,

we may have to conclude that our students are simply too homogenous to be

streamed effectively, and that may ultimately force us to reconsider

establishing mixed-ability classes.

References

Brown, J.D. (1996). Testing in language programs. Upper Saddle River, NJ:

Prentice Hall.

Dudek, F.J. (1979). The continuing misinterpretation of the standard error of

measurement. Psychological Bulletin, 86, 335-337.

Ebel, R.L. (1979). Essentials of educational measurement (3rd ed.).

Englewood Cliffs, NJ: Prentice Hall.

Elvin, C. (2003a). Elvinsdata.xls [Online]. Retrieved September 26, 2003,

from

<www.eflclub.com/elvin/publications/2003/Elvinsdata.xls>.

Elvin, C. (2003b). Elvinsoutput.xls [Online]. Retrieved September 26, 2003,

from <www.eflclub.com/elvin/publications/2003/Elvinsoutput.xls>.

Hatch, E. & Lazaraton, A. (1991). The research manual: Design and statistics

for applied linguists. Boston, MA: Heinle & Heinle.

Iteman (Version 3.6). (1997). [Computer software]. St. Paul, MN: Assessment

Systems Corporation.

Spearman, C. (1904). General intelligence, objectively determined and

measured. American Journal of Psychology, 15, 201-293.

Chris Elvin has a Masters degree in education from Temple University,

Japan. He is the current programs chair of the JALT Materials Writers

special interest group, and former editor of The School House , the JALT

junior and senior high school SIG newsletter. He is the author of Now Youre

Talking, an oral communication coursebook published by EFL Press, and the

owner and webmaster of www.eflclub.com, an English language learning

website dedicated to young learners. He currently teaches at Tokyo Womens

Medical University, Soka University, Caritas Gakuen and St. Dominics

Institute. His research interests include materials writing, classroom

language acquisition and learner autonomy.

Return to www.eflclub.com/elvin.html .

ITEM ANALYSIS Technique to improve test items and instruction

TEST DEVELOPMENT PROCESS 13. Standard Setting Study 14. Set Passing Standard 11. Administer Tests

12. Conduct Item Analysis 9. Assemble Operational Test Forms 10. Produce Printed Tests Mat. 1. Review

National and Professional Standards 2. Convene National Advisory Committee 3. Develop Domain,

Knowledge and Skills Statements Conduct NeedsAnalysis 5. Construct Table of Specifications 6. Develop

Test Design 7. Develop New Test Questions 8. Review Test Questions

again in later tests

SEVERAL PURPOSES 1. More diagnostic information on students Classroom level: determine

questions most found very difficult/ guessing on reteach that concept questions all got right

don't waste more time on this area find wrong answers students are choosing identify common

misconceptions Individual level: isolate specific errors the students made

2. Build future tests, revise test items to make them better know how much work in writing good

questions SHOULD NOT REUSE WHOLE TESTS --> diagnostic teaching means responding to needs of

students, so after a few years a test bank is build up and choose a tests for the class can spread

difficulty levels across your blueprint (TOS)

3. Part of continuing professional development doing occasional item analysis will help become a

better test writer documenting just how good your evaluation is useful for dealing with parents or

administrators if there's ever a dispute once you start bringing out all these impressive looking

statistics, parents and administrators will believe why some students failed.

CLASSICAL ITEM ANALYSIS STATISTICS Reliability (test level statistic) Difficulty (item level statistic)

Discrimination (item level statistic)

TEST LEVEL STATISTIC Quality of the Test Reliability and Validity Reliability Consistency of

measurement Validity Truthfulness of response Overall Test Quality Individual Item Quality

RELIABILITY refers to the extent to which the test is likely to produce consistent scores. Characteristics:

1. The intercorrelations among the items -the greater/stronger the relative number of positive

relationships are, the greater the reliability. 2. The length of the test a test with more items will have a

higher reliability, all other things being equal.

3. The content of the test -generally, the more diverse the subject matter tested and the testing

techniques used, the lower the reliability. 4. Heterogeneous groups of test takers

TYPES OF RELIABILITY Stability 1. Test Retest

Stability 2. Inter rater / Observer/ Scorer applicable for mostly essay questions Use Cohens

Kappa Statistic

Equivalence 3. Parallel-Forms/ Equivalent Used to assess the consistency of the results of two tests

constructed in the same way from the same content domain.

Internal Consistency Used to assess the consistency of results across items within a test. 4. Split

Half

5. Kuder-Richardson Formula 20 / 21 Correlation is determined from a single administration of a test

through a study of score variances

6. Cronbach's Alpha (a)

Reliability Indices .91 and above Interpretation Excellent reliability; at the level of the best standardized

tests .81 - .90 Very good for a classroom test .71 - .80 Good for a classroom test; in the range of most.

There are probably a few items which could be improved. .61 - .70 Somewhat low. This test needs to be

supplemented by other measures (e.g., more tests) to determine grades. There are probably some items

which could be improved. .51 - .60 Suggests need for revision of test, unless it is quite short (ten or

fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for

grading. .50 or below Questionable reliability. This test should not contribute heavily to the course

grade, and it needs revision.

item "functions How valid the item is based on the total test score criterion

WHAT IS A WELL-FUNCTIONING TEST ITEM? how many students got it correct? (DIFFICULTY) which

students got it correct? (DECRIMINATION)

THREE IMPORTANT INFORMATION ON QUALITY OF TEST ITEMS Item difficulty: measure whether an

item was too easy or too hard. Item discrimination: measure whether an item discriminated between

students who knew the material well and students who did not. Effectiveness of alternatives:

Determination whether distracters (incorrect but plausible answers) tend to be marked by the less able

students and not by the more able students.

ITEM DIFFICULTY Item difficulty is simply the percentage of students who answer an item correctly. In

this case, it is also equal to the item mean. Diff = # of students choosing correctly total # of students

The item difficulty index ranges from 0 to 100; the higher the value, the easier the question.

ITEM DIFFICULTY LEVEL: DEFINITION The percentage of students who answered the item correctly. High

(Difficult) Low (Easy ) <= 30% 0 Medium (Moderate) > 30% AND < 80% >=80 % 10 20 30 40 50 60 70 80

90 100

ITEM DIFFICULTY LEVEL: SAMPLE Number of students who answered each item = 50 Item No. No.

Correct Answers % Correct Difficulty Level 1 15 30 High 2 25 50 Medium 3 35 70 Medium 4 45 90 Low

ITEM DIFFICULTY LEVEL: QUESTIONS/DISCUSSION Is a test that nobody failed too easy? Is a test on

which nobody got 100% too difficult? Should items that are too easy or too difficult be thrown

out?

ITEM DISCRIMINATION Traditionally, using high and low scoring groups (upper 27 % and lower 27%)

Computerized analyses provide more accurate assessment of the discrimination power of items since it

accounts all responses rather than just high and low scoring groups. Equivalent to point-biserial

correlation. It provides estimate the degree an individual item is measuring the same thing as the rest of

the items.

WHAT IS ITEM DISCRIMINATION? Generally, students who did well on the exam should select the

correct answer to any given item on the exam. The Discrimination Index distinguishes for each item

between the performance of students who did well on the exam and students who did poorly.

INDICES OF DIFFICULTY AND DISCRIMINATION (BY HOPKINS AND ANTES) Index Difficulty Discrimination

0.86 above Very Easy To be discarded 0.71 0.85 Easy To be revised 0.30 0.70 Moderate Very Good

items 0.15 0.29 Difficult To be revised 0.14 below Very Difficult To be discarded

ITEM DISCRIMINATION: QUESTIONS / DISCUSSION What factors could contribute to low item

discrimination between the two groups of students? What is a likely cause for a negative

discrimination index?

ITEM ANALYSIS PROCESS

SAMPLE TOS Remember Section A Section B Section C Total Understand Apply Total 4 6 10 20 5 4 14 7 6

16 18 20 50 (1,3,7,9) 5 (2,5,8,11,15) 3 (6,17,21) 12

STEPS IN ITEM ANALYSIS 1. Code the test items: - 1 for correct and 0 for incorrect - Vertical columns

(item numbers) - Horizontal rows (respondents/students)

TEST ITEMS No. 1 1 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 2 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 3 0 0 0 1 0 0 0

1 0 0 0 1 1 1 1 1 1 1 0 4 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 5 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 6 1 1 1 1 1

1 1 1 1 1 1 1 1 0 1 1 1 0 1 7 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 8 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 2 3 4 5

6 7 8 9 1 0 1 1 1 2 1 3 1 4 . . . . 5 0

2. IN SPSS: Analyze-Scale-Reliability analysis (drag/place variables to Item box) Statistics Scale if

item deleted ok.

****** Method 1 (space saver) will be used for this analysis ****** R E L I A B I L I T Y A N A L Y S I S -

S C A L E (A L P H A) Item-total Statistics Scale Scale Corrected Mean Variance Item- if Item if Item

Deleted Deleted Total Correlation Alpha if Item Deleted VAR00001 14.4211 127.1053 .9401 .9502

VAR00002 14.6316 136.8440 .7332 .9542 VAR00022 14.4211 129.1410 .7311 .9513 VAR00023

14.4211 127.1053 .4401 .9502 VAR00024 14.6316 136.8440 -.0332 .9542 VAR00047 14.4737

128.6109 .8511 .9508 VAR00048 14.4737 128.8252 .8274 .9509 VAR00049 14.0526 130.6579 .5236

.9525 VAR00050 14.2105 127.8835 .7533 .9511 Reliability Coefficients N of Cases = Alpha = .9533

57.0 N of Items = 50

3. In the output dialog box: Alpha placed at the bottom the corrected item total correlation is the

point biserial correlation as bases for index of test reliability

4. Count the number of items discarded and fill up summary item analysis table.

TEST ITEM RELIABILITY ANALYSIS SUMMARY (SAMPLE) Test Math (50 items) Level of Difficulty Very Easy

Number of Items % Item Number 1 2 1 Easy 2 4 2,5 Moderate 10 20 3,4,10,15 Difficult 30 60

6,7,8,9,11, Very Difficult 7 14 16,24,32

5. Count the number of items retained based on the cognitive domains in the TOS. Compute the

percentage per level of difficulty.

Remember Understand Apply N A B C Total % Over all Ret N Ret N Ret 4 5 3 12 1 3 2 6 6 5 7 18 3 3 4 10

10 4 6 20 3 2 3 8 50% 56% 24/50 = 48% 40%

Realistically: Do item analysis to your most important tests end of unit tests, final exams -->

summative evaluation common exams with other teachers (departmentalized exam) common exams

gives bigger sample to work with, which is good makes sure that questions other teacher s prepared

are working for your class

ITEM ANALYSIS is one area where even a lot of otherwise very good classroom teachers fall down:

they think they're doing a good job; they think they've doing good evaluation; but without doing

item analysis, They dont really know.

ITEM ANALYSIS is not an end in itself, no point unless you use it to revise items, and helps students on

the basis of information you get out of it.

END OF PRESENTATION THANK U FOR LISTENING HAVE A RELIABLE AND ENJOYABLE DAY.

- A Critical Appraisal of the Revised Trauma ScoreUploaded byCraig
- Timetable CSEC 2015 May-June - A4 SizeUploaded byRamoneYoungyachtDavis
- PSE Datacenter - A 7.0Uploaded byAnonymous WdNYeBYjK
- ALCPT HandbookUploaded byjgrimsditch
- UT Dallas Syllabus for math5304.501 06s taught by Joselle Kehoe (jxk061000)Uploaded byUT Dallas Provost's Technology Group
- Hx&SysOutline (1)Uploaded byAlyssa Ara Valenzuela
- Topic 3 Roles of Ordinary Teachers as a Guidance TeacherUploaded byMunira Anis
- SAMPLE Human Resource ManagementUploaded byMahboob Iqbal
- M9510-747.pdfUploaded byWilliams
- AlphadsaaUploaded bygaara-san
- a-z of kasUploaded bySyed Zeeshan Bukhari
- New Scheme of Examinations - for All PostsUploaded byDinesh Kumar
- MeasurementUploaded bydr_putnam
- Study NotesUploaded byVenkateswaran Sankar
- 10 Chapter 4Uploaded byKB
- Quality of Life AssessmentUploaded byanisyafitri
- Annual Supervisory Plan 2015Uploaded byHaironisaMalaoMacagaan
- disertasiUploaded byErisa Kurniati
- Ctrstreadtechrepv01995i00607 OptUploaded byDr-Mushtaq Ahmad
- Health and Human Services: language colorUploaded byHHS
- MD3_CAJJS (145)Uploaded byAnonymous QPKqoIh7
- Arima TestUploaded byAbhishek Prasad
- 13. - The Influence of Organisational Culture on the Total Quality ManagementUploaded byElías A Pérez Ríos
- JD EXAM.docxUploaded byKimie Rosli
- RDA08_CBA94_2215081399_MAULINA SARIUploaded bymolin_sari
- Faculty DocketUploaded bySanjeev Khalkho
- Validacion de La PruebaUploaded byJudithParamo
- UPPCS Exam Syllabus.pdfUploaded byArvind Maurya
- RelativeResourceManagerUploaded bySimon Yang
- (2) 199697871-002Uploaded byDejan Georgiev

- Analogies Ws 1Uploaded bymkumar
- Educational SitesUploaded byAldrin Paguirigan
- APJMR-2018-6.4.01Uploaded byAldrin Paguirigan
- DLL_SCIENCE 6_Q1_W1 (1)Uploaded byAldrin Paguirigan
- DO_s2016_036Uploaded byMart Dumali Sambalud
- Basic Guitar ChordsUploaded byefraincantu
- 34. Form 6 - For TeachersUploaded byAldrin Paguirigan
- Macroscript for Interactive PowerpointUploaded byAldrin Paguirigan
- AnalogiesUploaded byMichael Pasok
- TEMPO MUSIC TERMS.docxUploaded byAldrin Paguirigan
- Open The Eyes Of My Heart chords.docxUploaded byAldrin Paguirigan
- SOLAR SYSTEM FACTS.docxUploaded byAldrin Paguirigan
- SMADAV 12 all versions key.docxUploaded byAldrin Paguirigan
- TABS SYMBOLS.docxUploaded byAldrin Paguirigan
- TITIBO TABS.docxUploaded byAldrin Paguirigan
- You Are My Sunshine.docxUploaded byAldrin Paguirigan
- You Are the Reason ChordsUploaded byAldrin Paguirigan
- SCIENCE EXPERIMENTS 2019.docxUploaded byAldrin Paguirigan
- Aacomplishment FEBRUARY 2019Uploaded byAldrin Paguirigan
- 10000 Reasons Bless The Lord chords.docxUploaded byAldrin Paguirigan
- DISCIPLESHIP IN THE 21ST CENTURY PART 3.pdfUploaded byAldrin Paguirigan
- 14 Back to School Games for Youth MinistryUploaded byAldrin Paguirigan
- Lac Action Plan Jan MarchUploaded byAldrin Paguirigan
- Phil IRI 2018 Manual from LRMDS---Phil-IRI_Full_Package_v1--720 pages.pdfUploaded byKat Causaren Landrito
- KRA 1-4 PAGESUploaded byAldrin Paguirigan
- Movie Maker Comprehensive TutorialUploaded bypowerspeaking
- Guides-MovieMaker Guide v.2Uploaded byAldrin Paguirigan
- Santa Sena SongsUploaded byAldrin Paguirigan
- MTAP REVIEWER3Uploaded byAldrin Paguirigan
- LINE UPUploaded byAldrin Paguirigan

- CMA ResourcesUploaded bysultan
- 2005 Mathematical Methods (CAS) Exam Assessment Report Exam 1.pdfUploaded bypinkangel2868_142411
- IELTS Secret Key for ReadingUploaded bymadmaxjune17557
- Primero Complete FirstUploaded byluiscuzco
- 3. Assessing Metacognition in Children and AdultsUploaded byJuliette
- Myers Advanced Placement Syllabus CorrectedUploaded byhaligen
- Sample Quiz7Uploaded byJessica Harvey
- Practical Language TestingUploaded byStafford Lumsden
- Training-Measurement & EvaluationUploaded byOnibon Ademoa
- 9701_w02_er1.pdfUploaded byHendrawan Saputra
- Free 12 Grade Reading Comprehension TestUploaded byJennifer
- Saadat Nia 2017Uploaded bykeramatboy88
- PSMO 2011_STUDENT ANSWER SHEETUploaded bySTEPCentre
- MATHS METHODS EXAM.pdfUploaded byDJLI3
- Bulats Candidate HandbookUploaded byTerence Li
- Commerce MCQs Practice Test 1Uploaded byArun Chauhan
- Getting Ready for the Sat Subject Tests 2015 16Uploaded byEdgar Contreras
- Legal Aspects of Finance_11Uploaded byapi-3745584
- jawapan peperiksaan aeuUploaded byShahnizat Datsun
- Test BookletUploaded byKrizty Lozada
- Grade-9-Math-IA-2011-revised-6-1-20111Uploaded byYvonne EvaRoyal Gardener
- NAPLEXStudyGuide-1Uploaded byRaushan Blake
- SEPT English.pdfUploaded byNissreen Sapry
- PTE Test Overview Without Watermark 1Uploaded byconvey2vino
- content lesson cuban finalUploaded byapi-249941070
- -AUploaded byCathee Zheng
- Objective Type Questions of Electronics and Communication Engineering PDFUploaded byTrey
- content lesson cuban finalUploaded byapi-249941070
- Aptis+Candidate+Guide+Final+CopyUploaded byRose Yusof
- Sample ExamUploaded byJasonSpring