You are on page 1of 26

UNIT THREE

Introduction
This unit discusses the importance of tests in educational assessment. It is followed by a
description of the steps in test construction. It further describes types of tests, their strengths
and their weaknesses..
Specific Objectives
After studying this unit you should be able to:
1. Give reasons for testing in the classroom.
2. Describe the steps is test development
3. Explain the various types of tests.
4. Construct various test items
TEST CONSTRUCTION
A test is a collection of items developed to measure human educational or psychological
attributes. It can also be used to make predictions.
Bean (1953) defines a test as an organized succession of stimulate designed to measure
quantitatively or to evaluate qualitatively some mental process, trait or characteristic. For
example the reading ability of a child may be measured with the help of a test specially designed
for the purpose. His/her reading ability score may be evaluated with respect to the average
performance of the reading ability of other children of his/her age or class.
Why Teacher Made Tests
It is important for teachers to know how to construct their own tests because of the following:
 The teacher Made Tests can be closely related to a teacher’s particular objectives and pupils
since he/she knows the needs, strengths and weaknesses of his/her students.
 The teacher can tailor the test to fit her/his particular objectives fit a class or even fit the
individual pupils.
 The classroom tests may be used by the teacher to help him/her develop more efficient
teaching strategies e.g. a teacher develop his/her own tests, administer them to the students
as pre-tests and
(a)Re-teach some of the information assumed known by the students.
(b) Omit some of the material planned to be taught because the students already know it.
(c) Provide some of the students with remedial instruction while giving other students some
enriching experience.
 They are used for diagnosis where the teacher diagnoses the pupil’s strengths and
weaknesses.
Achievement Tests
Achievement refers to what a person has acquired or achieved after the specific training or
instruction has been imparted. Hence achievement tests are designed to measure the effects of a
specific program of instruction or training i.e. the extent to which students have learned the
intended curriculum. Examples of achievement tests are Kenya Certificate of Secondary
Education (KCSE) examination and Kenya certificate of primary education (KCPE).
Aptitude Test
Tuckman (1975) defines aptitude as “ a combination of abilities and other characteristics
whether native or acquired, known or believed to be indicative of an individual’s ability to
acquire skill or knowledge in a particular area. On the basis of such abilities, future
performance of a child can be predicted. Aptitude tests are tests or examinations that measure
the degree to which students have the capacity to acquire knowledge, skills and attitudes. The
primary purpose of an aptitude test is to predict what a person can learn and are future oriented.
Criterion-referenced Test
Criterion-referenced tests are tests that measure the extent to which prescribed standards have
been met. They examine students’ mastery of educational objectives and are used to determine
whether a student has learned specific knowledge or skills. Criterion-referenced assessment
asks the question: can student X do Z?
Norm Referenced Tests
These are tests which compare a student’s performance on the test with that of other students in
his/her cohort. For example, scores on a test given to Form II students can be compared to the
scores of other students in Form II. Unlike criterion-referenced tests, norm referenced tests are
not concerned with determining how proficient a student is in a particular subject or skill. The
main problem with making normative comparisons is that they do not indicate what students
know or do not know.

Steps in Test Development


For a teacher to develop a good test he/she need to follow the following steps:
1. Planning
Good tests do not just happen, they require adequate and extensive planning so that the
instructional objectives, the teaching strategy to be employed the teaching material and the
evaluative procedures are all related in some meaningful fashion.
At this stage the teacher specifies the:
a. Purpose of the Test
To be helpful, classroom tests must be related to the teacher’s instructional objectives,
which in turn must be related methods used, and eventually to the use of the test results.
Purpose of Classroom Tests
Classroom achievement tests serve a variety of purposes such as:
a. Judging the pupil’s mastery of certain essential skills and knowledge.
b. Measuring growth over time
c. Ranking pupils in terms of their achievement of particular instructional objectives.
d. Diagnosing pupils difficulties
e. Evaluating the teacher’s instructional method
f. Ascertaining the effectiveness of the curriculum
g. Encouraging good study habits
h. Motivating students
The teacher should not hope that because a test can serve many masters it will automatically
serve his/her intended purpose. The teacher must plan for this in advance.
b) What is to be tested?
The next major question the teacher needs to ask himself or herself is what knowledge,
skills and attitudes do I want to measure? Should I test for factual knowledge or should I
test the extent to which my students are able to apply their factual knowledge?
c) Decide on the nature of the content or items to be included

2. Prepare the test Items: In so doing consider Blooms taxonomy of instructional


objectives. All the different levels of abilities and skills should be tested.

Prepare a Table of specification


A table of specification is a blue print which defines clearly as possible the scope and emphasis
of the test to relate to the objectives to content and to construct a balanced test.
A Table of Specifications is a blueprint for an objective selected response assessment.  The
purpose is to coordinate the assessment questions with the time spent on any particular content
area, the objectives of the unit being taught, and the level of critical thinking required by the
objectives or state standards.  The use of a Table of Specifications is to increase the validity and
quality of objective type assessments.  The teacher should know in advance specifically what is
being assessed as well as the level of critical thinking required of the students.  Tables of
Specifications are created as part of the preparation for the unit, not as an afterthought the night
before the test.  Knowing what is contained in the assessment and that the content matches the
standards and benchmarks in level of critical thinking will guide learning experiences presented
to students.  Students appreciate knowing what is being assessed and what level mastery is
required.
            Any question on an assessment should require students to do three things: first, access
information on the topic of the question. Second, use that knowledge to complete critical
thinking about the information. Third, determine the best answer to the question asked on the
assessment.  A Table of Specifications is a two-way chart which describes the topics to be
covered in a test and the number of items or points which will be associated with each topic.
Sometimes the types of items are described as well.  The purpose of a Table of Specifications is
to identify the achievement domains being measured and to ensure that a fair and
representative sample of questions appear on the test.
    As it is impossible, in a test, to assess every topic from every aspect, a Table of Specifications
allows us to ensure that our test focuses on the most important areas and weights different areas
based on their importance / time spent teaching. A Table of Specifications also gives us the
proof we need to make sure our test has content validity.   Tables of Specifications are designed
based on:

o course objectives
o topics covered in class
o amount of time spent on those topics
o textbook chapter topics
o emphasis and space provided in the text

 A Table of Specification could be designed in 3 simple steps:

1. Identify the domain that is to be assessed


2. Break the domain into levels (e.g. knowledge, comprehension, application …)
3. Construct the table
.The more detailed a table of specifications is, the easier it is to construct the test.
A table of specification has two dimensions. The first dimension represents the different abilities that the
teacher wants the pupil to display and the second represents the specific content and skills to be
measured.

A Table of Specification for a 15 Item C.R.E Test for Form 3


Major Knowled Comprehen Applicati Analys Synthe Evaluati Total
Content ge sion on is sis on
Areas
Call of 2 1 1 1 5
Abraham
Passion of 2 1 1 1 5
Jesus
Apostleship 2 1 2 5
Covenant at 1 2 1 1 5
Sinai
5 3 3 3 2 4 20

Moderation of the test: This means examining the items set by a group of teachers in the
same subject area. Questions that are ambiguous are identified and feedback is given about the
test prepared. When moderating, attention should be paid to the following:
 Ensure all the questions are within the syllabus
 Ensure that the test has face validity i.e. by looking at the questions tell whether the items
are clear and not ambiguous.
 Ensure items are free of bias
 Check that appropriate action verbs are used.
 Check whether the questions can be answered in the time given
 Check whether the items sample the syllabus (content validity)
 Check whether the marks in the question paper and marking scheme tally.
3. Administering Of The Test
During administration you need to inform the students in advance about the test and the time
that the test will take place. Prepare a register of attendance. Write the start and end time of
the test. Ensure there is enough room and there is enough space in the sitting arrangement.
Also make any corrections before the test starts.
4. Marking And Scoring of The Scripts
Here ensure you follow the marking scheme to avoid being biased.
6. Interpret the scores and you can use statistical tools e.g. the mean, mode and standard
deviation.
DEVELOPING TEST ITEMS
Developing Test Items for Different Cognitive Abilities.
Knowledge
Define the term ………………….
Describe how…………………….
List the ………………………….
State the …………………………
Name the ………………………..
Comprehension
Explain why ………………………….
Illustrate ………………………………..
Give five reasons for ……………………
Classify …………………………………
Compare ……………………………….
Application
What methods should be used to solve a problem …………………………………
Solve ………………………………….
Predict
Using ………………. Demonstrate how
Construct …………………..
Perform mouth to mouth resuscitation
Analysis
Analyze ………………………..
Differentiate …………………..
Criteria for arriving at a conclusion ……………………….
Synthesis
Summarize …………………………
Prepare a plan …………………….
Organize …………………………
Describe reasons for selection of …………………………
Argue for and against …………………………………….
Evaluation
Make an ethical judgement ………………………………
Justify ………………………………..
Critically assess this statement …………………………..
N.B.: The table of specification can aid immensely in the preparation of test items, in the
production of a valid and well balanced test.
In the classification of objectives to both teacher and students.
The table is only a guide, it is not designed to be adhered to strictly.
Relating the test Items to the instructional objective. It is important to obtain a match between
a test’s items and the test’s instructional objectives, which is not guaranteed by the table of
specification. The table of specification only indicates the number or proportion of test items to
be allocated to each of the instructional objectives specified.
Example 1
Objective:
The student will be able to differentiate between assessment and measurement.
Test Item
Distinguish between assessment and measurement.
Example 2
Objective
Students can perform mouth-to-mouth resuscitation on a drowning victim.
Test Item:
Describe the correct procedure for administering mouth-to-mouth resuscitation on a drowning
victim.
First example there is a match between one learning outcome and the test item. In the second
there is no match, because describing something is not valid evidence that the person can do
something.
Writing down the Items
After preparing a table of specification, you should construct test items. There are two types of
test items free response type and choice type.
Free Response (Essay Questions)
They require students to provide answers. They are of two types
a) Restricted type
b) Extended type

The restricted type is one that requires the examine to supply the answer in one or two lines and
is concerned with one central concept (Marshall & Hales, 1972)
An extended essay is one where the examinee’s answer comprises several sentences and is
usually concerned with more than one central concept.
Examples
Restricted Type
Describe the meaning of reliability of an educational test.
Extended Type
Describe five characteristics of a good test
Advantages of Restricted Type of Question
- They cover a wide content
- They are easy to construct
- They are easy to mark
Disadvantage
They do not test pupils’ ability to organize or criticize.
Advantages of Extended Type
- Easy to set
- Improves students writing skills
- Help develop student ability to organize and express his/her ideas in a logical and coherent
manner.
- Allows for creativity since examinee is required to give a coherent and organized answer
rather than recognize the answer.
- Leaves no room for guess work.
- Tests student’s ability to supply rather than select the correct answer.
Disadvantages
- Inadequate sampling of the syllabus
- It is disadvantageous to pupils with difficulties in expressing themselves
- Marking is highly unreliable and varies form scorer to scorer and sometimes within the same
score when asked to evaluate the answer at different time intervals.
- Scoring takes a longer time because of the length of the answer.
Open to subjectivity i.e. halo effect thus what influences the marks may not be what was being
tested i.e. handwriting or language.

Guidelines to Writing Good Essay Items.


1. Have clearly in mind what mental processes you want the students to use before starting to
write the question. If you want students to judge, analyze or think critically what mental
processes involve analysis, judgment or critical thinking? Once you determine this use the
appropriate verbs in your question.
2. Write the questions in such a way that the task is clearly and unambiguously defined for the
student.
i) In the overall instructions preceding test items.
ii) In the test items themselves.
3. Start essay questions with phrases or words such as compare, contrast, describe, explain,
assess etc. Do not begin with such words as what, who when and list since these words
generally lead to tasks that require only recall of information.
4. Avoid using optional items i.e. require all students to complete the same items. Allowing
students three of five, four of seven and so forth decreases test validity as well as decreases
your basis for comparison among students.
5. Be sure each question relates to a instructional objective.
6. The learner should be guided on how many points or factors are required by the examiner.
This guards against wide or wild writing, e.g. poor item: Describe the struggle for
independence in Kenya. Better Item: Describe five methods used in the struggle for
independence in Kenya.
Suggestions for Grading Essay Items
1. Check your marking scheme against actual responses. Before actually beginning to mark
the exam papers, it is recommended that a few papers are selected at random to a certain
the appropriateness of the marking scheme. It also helps in the up dating of the marking
scheme.
2. Be consistent in your grading. Graders are human and may be influenced by the first few
papers they read and thereby grade them either too leniently or too harshly depending
on their initial mind set (Hales &Tokar, 1975). For this reason it is important that once
grading has started teachers should occasionally refer to the first few papers graded to
satisfy them that standards are being applied consistently. This may be especially true
for those papers read near the end of the day when the reader might be physically and
mentally tired.
3. Randomly shuffle the papers before grading them. Research shows that a student’s essay
grade will be influenced by the position of his/her paper especially if the preceding
answers were either very good or very poor.
It is hence recommended that the examiner shuffles the papers prior to grading to the
bias introduced. It is significant especially if the teacher is working with high and low
level classes and read the best papers first or last.
4. Grade only one question at a time for all papers. To reduce the “halo” effect it is
recommended that teachers grade one question at a time rather than one paper
(containing several responses) at a time. This also makes it possible for the examiner to
concentrate and become thoroughly familiar with one set of scoring criteria and not be
distracted by moving from one question to another.
5. Try to grade all responses to a particular question without interruption. One source of
unreliability is that the grader’s standards may vary markedly from one day to the next
and even from morning to afternoon of the same day. If a lengthy break is taken the
reader should re-read some of the first few papers to re-familiarize him/herself with
his/her grading standards so that she/he will not change them mid-stream.
6. The mechanics of expression should be judged separately form the content. For those
teachers who feel that the mechanics of expression are very important, it is
recommended that they assign a proportion of the question’s value to such factors as
legibility, spelling, handwriting punctuation, and grammar.
The proportion assigned to these factors should be spelt out in the grading criteria and
the students should be informed in advance.
7. Provide comments and correct errors. Although it is time consuming to write comments
and correct errors, it should be done if we are to help the student to become a better
student.
This also helps the teacher in explaining his/her method of assigning a particular grade.
Oral Questions
The oral question is variation of essay. It is well suited for testing students who are unable to
write because of their physical handicaps or in languages.
Advantages
1. They permit the examiner to determine how well the student can synthesize and organize
his ideas and express himself.
2. They require the pupil to know and be able to supply the answer.
3. They allow students to demonstrate their oral competence in mastery of a language.
4. They permit free response by the student.
5. Permits detailed probing by the examiner
6. Students may ask for clarification.
Limitation
1. They provide for a limited sampling of content
2. They have low rater reliability.
3. They are time consuming (only one student can be tested at a time).
4. Do not permit or provide for any record of the examinee’s response to be use for future
action by the teacher and pupil unless the examination process is recorded.
Choice Items
They require controlled response from the candidate and are at times referred to as objective
type. Examples include
 Multiple choice
 True/false
 Matching
 Completion
Multiple Choice Item
It consists to two parts (1) the stem, which contains the problem
(2) A list of suggested answers (responses or options).
The incorrect responses are often called distracters.
The correct response is called the Key.
The stem may be stated as a direct question or an incomplete statement.
There are five variations of the multiple-choice item:
 Correct answer type
 Best answer
 Incomplete statement
 Multiple response
 Negative variety
a) Correct Answer Type:
This is the simplest type of multiple choice item. The student is told to select the one
correct answer listed among several plausible but incorrect options.

Example
When a test item and the objective it is intended to measure match the item
a. Is called an objective item
b. Has content validity
c. Is too easy
d. Should be discarded
b) Best Answer Type:
The directions are similar to those of the single correct answer except the student is told
to select the best answer.
Example
Which of the following is the most important agent of curriculum implementation
a. Teacher
b. Inspectorate
c. Curriculum development center
d. Parents Teachers Associations
c) Incomplete Statement
The stem is an incomplete statement rather than a question. It is best for lower levels.
For example the first president of Uganda was ______________________
a. Kabaka Mutesa I
b. Obote
c. Museveni
d. Idd Amin
d) Multiple Response Type
The candidate is required to endorse more than one response.
Which of the following reasons explain why a teacher needs to prepare a lesson plan in
advance.
i. To enable him/her collect the necessary material in good time.
ii. To enable him/her focus on the questions that pupils are likely to ask.
iii. To keep a meaningful record of what has been taught to a given class.
iv. To visualize and organize a complete learning situation in advance.
a) i &ii
b) iii & iv
c) i, iii & iv
d) i, ii & iii
Negative Variety Item
Here all the responses are correct except one example:
Which of the following is not a good reason for organizing an educational visit?
a. Correlate several school subjects
b. Broaden students experiences beyond the classroom
c. Break the monotony of the class.
d. Arouse the students’ curiosity and develop an inquiring mind.
How to Construct Effective Multiple Choice Items
1. The stem should contain the central problem so that the student will have some idea as to
what is expected to him/her and some tentative answer in mind before he begins to read the
options.
Poor Stem: A criterion reference test. This example is poor because it does not ask a question
or set a task. It is essential that the intent of the item be stated clearly in the stem.
2. Avoid repetition of words in the options. The stem should be written so that the key words
are incorporate in the stem and will not have to be repeated in each option.
Poor: According to Engle’s law
a. Family expenditures for food increase in accordance with the size of the
family.
b. Family expenditures for food decrease as income increases.
c. Family expenditures for food require a smaller percentage of an increasing
income.
d. Family expenditures for food rise in proportion to income.
Better: According to Engle’s Law, Family expenditures for food.
a. Increase in accordance with the size of the family
b. Decrease as income increases
c. Require a smaller percentage of an increasing income
d. Rise in proportion to income
3. Avoid making the key consistently longer or shorter than the distracters.
4. Avoid giving irrelevant clues to the correct answer. The length of the answer may be a clue
others may be of a grammatical nature such as the use of “a” or “an” at the end of a
statement and using a singular or plural subject and/ or verb in the stem with just one or
two singular or plural options.

Poor: Roosevelt was an


a. President
b. Man
c. Alcoholic
d. General
5. There should be only one key.
6. An item in the test should not reveal the answer to another stem
Example:
Item 1 : The “halo” effect is pronounced in essay tests. The best way to minimize its effects is
to:
a Provide optional questions
b “Aim” the student to the desired response
c Read all responses to one question before reading the responses to the other questions.
d Permit students to write essays at home.
If item 10 – In what type of test is the “halo” effect more operative.
a Essay
b Matching
c True-false
d Short-Answer
The student can obtain the correct answer to item 10 from item 1.
7. The distracters should be plausible and homogeneous. The student should be forced to read
and consider all options. No distracter should be automatically eliminated by the student
because it is irrelevant or a stupid answer.
Advantages of Multiple Choice Tests
 Easy to mark/score
 Objective marking
 Covers a wide content area
 Can lend itself to machine scoring
 Provides level playing ground for students good in language and those poor in the same
Limitations of Multiple Choice Tests
 Susceptible to guess-work
 Not fit for measuring arguments, opinions and creativity
 Encourages note learning
 Difficult to set and takes a long time to prepare
 Do not test ability to communicate or organize ideas
True – False Items
These are when items are expressed in the form a declarative statement which is either entirely
true or entirely false.
Advantages
 Easy to set and mark
 More items can be answered
 Samples a wide content area
 Favors both the linguistically gifted and poor candidates.
 A sure way of detecting misconceptions in learners.
Disadvantages
 Susceptible to guess work
 Tests trivial facts
 Does not test higher cognitive abilities i.e. analysis, evaluation etc.
 True or false items are relative
 Susceptible to ambiguity and misinterpretation
Suggestions for Writing True-False Items
1 Construct statements that are definitely true or definitely false.
2 Keep true and false statements at approximately the same length and be sure that there
are approximately equal numbers of true and false items.
3 Keys should not fall in a pattern
4 The statement should include only a single issue.
Matching Items
The candidates are asked to match up items in two columns.
The matching exercise is well suited to those situations where one is interested in testing the
knowledge of terms, definitions, dates, events and other matters involving simple relationships.
Example:
For each definition below, select the most appropriate term from the set of terms at the right.
Mark your answer in the blank before each definition.
Definitions Terms
1. A professional judgement of the 1. Behavioral objective
Adequacy of test scores
2. Determination of the amount of some skill 2. Criterion referenced test
Test or trait
3. Specification of what a child must do to in 3. Evaluation
Indicate mastery of a skill
4. A series of tasks or problems 4. Measurement
5. Tests used to compare individuals 5. Norm referenced
6. Test
Advantages of Matching Exercises
 Easy to mark and score
 Easy to set
 Provide level ground for those with good or poor language
Disadvantages
 Limited to measuring factual information
 Limited to only parts of the content
 Susceptible to guess work
Completion Items
Also referred to as supply items
The examinee is expected to complete or fill in the spaces. So that the sentence can be a
complete one.
Example
The longest river in Africa _____________________
Guidelines for writing completion items:
 The item should be clear and unambiguous word each item in specific terms with clear
meaning so that the intended answer is the only one possible.

Note: It is important that you prepare a table of specification when constructing your tests. It
ensures balance and comprehensiveness in a test.

The Marking scheme and its relevance to teachers

Principles for developing marking guidelines in a classroom test

1. Marking guidelines are developed in the context of relevant syllabus outcomes


and content.
2. Marks are awarded for demonstrating achievement of aspects of the syllabus
outcomes addressed by the question.
3. Marking guidelines reflects the nature and intention of the question and will be
expressed in terms of the knowledge and skills demanded by the task.

4. Marking guidelines indicate the initial criteria that will be used to award marks.
5. Marking guidelines allow for less predictable and less defined responses, for
example, characteristics such as flair, originality and creativity, or the provision
of alternative solutions where appropriate.
6. Marking guidelines for extended responses uses language that is consistent with
the outcomes and the band descriptions for the subject.
7. Marking guidelines are to incorporate the generic rubric provided in the
examination paper as well as aspects specifically related to the question.

8. The language of marking guidelines should be clear, unambiguous and accessible


to ensure consistency in marking.

9. Where a question is designed to test higher-order outcomes, the marking


guidelines will allow for differentiation between responses, with more marks
being awarded for the demonstration of higher-order outcomes.
10. Marking guidelines will indicate the quality of response required to gain a mark
or a sub-range of marks.
11. High achievement will not be defined solely in terms of the quantity of
information provided.

12. Optional questions within a paper will be marked using comparable marking
criteria.

13. Marking guidelines for questions that can be answered using a range of contexts
and/or content will have a common marking guideline exemplified using
appropriate contexts and/or content.

The importance of preparing a marking scheme in assessment of learning


in the classroom
 as part of teaching and learning strategy can aid learners creating meanings of
previously learned content
 It provides an opportunity for students to be a part of the thinking process
around judging performance and deepens their understanding of what is
required
 It can also allow for discussion and agreements to be reached about the
meanings of certain words and phrases in the context of the assessment task
 Provides an opportunity to learn how marks are allocated in relation to the task
and content which in return aids them to prepare for assessment more effectively
 Forms a framework for decision making
Merits of preparing a marking scheme in assessment of learning in the classroom
 Marks available for each part of the question are distributed so it is easy to
justify awarding the marks
 The content of the answer is specified and this guides the scoring
 Clearly indicates how other areas may be scored for instance the grammar and
expression of ideas
 Extra information to help the Examiner make his or her judgment is provided
 Whatever is acceptable or not worthy of credit or, in discursive answers, to give
an overview of the area in which a mark or marks may be awarded is specified
 Serves as a feedback to the teacher in regards to the depth of content coverage
Major Uses of Item Analysis

Item analysis can be a powerful technique available to instructors for the guidance and
improvement of instruction. For this to be so, the items to be analyzed must be valid
measures of instructional objectives. Further, the items must be diagnostic, that is,
knowledge of which incorrect options students select must be a clue to the nature of the
misunderstanding, and thus prescriptive of appropriate remediation.

In addition, instructors who construct their own examinations may greatly improve the
effectiveness of test items and the validity of test scores if they select and rewrite their
items on the basis of item performance data.

Item Analysis Guidelines

Item analysis is a completely futile process unless the results help instructors improve
their classroom practices and item writers improve their tests. Let us suggest a number
of points of departure in the application of item analysis data.

1. Item analysis gives necessary but not sufficient information concerning the
appropriateness of an item as a measure of intended outcomes of instruction. An
item may perform beautifully with respect to item analysis statistics and yet be
quite irrelevant to the instruction whose results it was intended to measure. A
most common error is to teach for behavioral objectives such as analysis of data
or situations, ability to discover trends, ability to infer meaning, etc., and then to
construct an objective test measuring mainly recognition of facts. Clearly, the
objectives of instruction must be kept in mind when selecting test items.
2. An item must be of appropriate difficulty for the students to whom it is
administered. If possible, items should have indices of difficulty no less than 20
and no greater than 80. lt is desirable to have most items in the 30 to 50 range of
difficulty. Very hard or very easy items contribute little to the discriminating
power of a test.
3. An item should discriminate between upper and lower groups. These groups are
usually based on total test score but they could be based on some other criterion
such as grade-point average, scores on other tests, etc. Sometimes an item will
discriminate negatively, that is, a larger proportion of the lower group than of the
upper group selected the correct option. This often means that the students in the
upper group were misled by an ambiguity that the students in the lower group,
and the item writer, failed to discover. Such an item should be revised or
discarded.
4. All of the incorrect options, or distracters, should actually be distracting.
Preferably, each distracter should be selected by a greater proportion of the lower
group than of the upper group. If, in a five-option multiple-choice item, only one
distracter is effective, the item is, for all practical purposes, a two-option item.
Existence of five options does not automatically guarantee that the item will
operate as a five-choice item.

Aim of Item analysis

 How well did my test distinguish among students according to the how well they
met my learning goals?

Recall that each item on your test is intended to sample performance on a particular
learning outcome. The test as a whole is meant to estimate performance across the full
domain of learning outcomes targeted.

One way to assess how well your test is functioning for this purpose is to look at how
well the individual items do so. The basic idea is that a good item is one that good
students get correct more often than do poor students. An item analysis gets at the
question of whether your test is working by asking the same question of all individual
items—how well does it discriminate? In short, item analysis gives the teacher a way to
exercise additional quality control over their tests. Well-specified learning objectives
and well-constructed items gives teachers a head-start in that process, but item analyses
can give you feedback on how successful you actually were. Item analyses can also help
you diagnose why some items did not work especially well, and thus suggest ways to
improve them (for example, if you find distracters that attracted no one, try developing
better ones).The important test for an item’s discriminability is to compare it to the
maximum possible. How well did each item discriminate relative to the maximum
possible for an item of its particular difficulty level? Here is a rough rule of thumb.

 Discrimination index is near the maximum possible = very


discriminating item
 Discrimination index is about half the maximum possible =
moderately discriminating item
 Discrimination index is about a quarter the maximum possible =
weak item
 Discrimination index is near zero = non-discriminating item
 Discrimination index is negative = bad item (delete it if worse than
-.10)
n addition to these and other qualitative procedures, a thorough item analysis also
includes a number of quantitative procedures. Specifically, three numerical indicators
are often derived during an item analysis: item difficulty, item discrimination, and
distractor power statistics.

Item Difficulty Index (p)

The item difficulty statistic is an appropriate choice for achievement or aptitude tests
when the items are scored dichotomously (i.e., correct vs. incorrect). Thus, it can be
derived for true-false, multiple-choice, and matching items, and even for essay items,
where the instructor can convert the range of possible point values into the categories
“passing” and “failing.”

The item difficulty index, symbolized p, can be computed simply by dividing the number
of test takers who answered the item correctly by the total number of students who
answered the item. As a proportion, p can range between 0.00, obtained when no
examinees answered the item correctly, and 1.00, obtained when all examinees
answered the item correctly. Notice that no test item need have only one p value. Not
only may the p value vary with each class group that takes the test, an instructor may
gain insight by computing the item difficulty level for a number of different subgroups
within a class, such as those who did well on the exam overall and those who performed
more poorly.

Although the computation of the item difficulty index p is quite straightforward, the
interpretation of this statistic is not. To illustrate, consider an item with a difficulty level
of 0.20. We do know that 20% of the examinees answered the item correctly, but we
cannot be certain why they did so. Does this item difficulty level mean that the item was
challenging for all but the best prepared of the examinees? Does it mean that the
instructor failed in his or her attempt to teach the concept assessed by the item? Does it
mean that the students failed to learn the material? Does it mean that the item was
poorly written? To answer these questions, we must rely on other item analysis
procedures, both qualitative and quantitative ones.

Item Discrimination Index (D)

Item discrimination analysis deals with the fact that often different test takers will
answer a test item in different ways. As such, it addresses questions of considerable
interest to most faculty, such as, “does the test item differentiate those who did well on
the exam overall from those who did not?” or “does the test item differentiate those who
know the material from those who do not?” In a more technical sense then, item
discrimination analysis addresses the validity of the items on a test, that is, the extent to
which the items tap the attributes they were intended to assess. As with item difficulty,
item discrimination analysis involves a family of techniques. Which one to use depends
on the type of testing situation and the nature of the items. I’m going to look at only one
of those, the item discrimination index, symbolized D. The index parallels the difficulty
index in that it can be used whenever items can be scored dichotomously, as correct or
incorrect, and hence it is most appropriate for true-false, multiple-choice, and matching
items, and for those essay items which the instructor can score as “pass” or “fail.”

We test because we want to find out if students know the material, but all we learn for
certain is how they did on the exam we gave them. The item discrimination index tests
the test in the hope of keeping the correlation between knowledge and exam
performance as close as it can be in an admittedly imperfect system.

The item discrimination index is calculated in the following way:

1. Divide the group of test takers into two groups, high scoring and low scoring.
Ordinarily, this is done by dividing the examinees into those scoring above and
those scoring below the median. (Alternatively, one could create groups made up
of the top and bottom quintiles or quartiles or even deciles.)
2. Compute the item difficulty levels separately for the upper (pupper) and lower
(plower) scoring groups.
3. Subtract the two difficulty levels such that D = pupper- plower.

How is the item discrimination index interpreted? Unlike the item difficulty level p , the
item discrimination index can take on negative values and can range between -1.00 and
1.00. Consider the following situation: suppose that overall, half of the examinees
answered a particular item correctly, and that all of the examinees who scored above the
median on the exam answered the item correctly and all of the examinees who scored
below the median answered incorrectly. In such a situation pupper = 1.00 and plower =
0.00. As such, the value of the item discrimination index D is 1.00 and the item is said
to be a perfect positive discriminator. Many would regard this outcome as ideal. It
suggests that those who knew the material and were well-prepared passed the item
while all others failed it.

Though it’s not as unlikely as winning a million-dollar lottery, finding a perfect positive
discriminator on an exam is relatively rare. Most psychometricians would say that items
yielding positive discrimination index values of 0.30 and above are quite good
discriminators and worthy of retention for future exams.

Finally, notice that the difficulty and discrimination are not independent. If all the
students in both the upper and lower levels either pass or fail an item, there’s nothing in
the data to indicate whether the item itself was good or not. Indeed, the value of the item
discrimination index will be maximized when only half of the test takers overall answer
an item correctly; that is, when p = 0.50. Once again, the ideal situation is one in which
the half who passed the item were students who all did well on the exam overall.

Does this mean that it is never appropriate to retain items on an exam that are passed by
all examinees, or by none of the examinees? Not at all. There are many reasons to
include at least some such items. Very easy items can reflect the fact that some relatively
straightforward concepts were taught well and mastered by all students. Similarly, an
instructor may choose to include some very difficult items on an exam to challenge even
the best-prepared students.

The difficulty Index


 Difficulty index is a measure of the probability of passing individual test items
 Very useful in assessing whether the students have learned the concept have
acquired the cognitive skill that the question calls for
 Serves as a feedback mechanism to the teacher and student
 Can also be used by teachers to improve the quality of test items
 Helps in identifying specific areas of course content which need greater emphasis
or clarity
 Very useful in identifying specific areas of strength and weakness of learners
The discrimination index
 Useful in establishing item effectiveness
 Provides evidence of learning difficulties in relation to cognitive tasks
 A useful technique in decision making
 Creates an opportunity to teachers to build their skills in test development
 Provides evidence of internal consistency
 Very useful in identifying specific areas of strength and weakness of learners\tic
 An effective strategy in analysis of the effectiveness of teaching methods
 Useful in identifying learning misconceptions among learners
 Very useful is assessing transfer of training

Measures of Central tendency

The mean is also the number that divides the scores into two equal groups and is the
score that occurs most frequently.
The shape of a distribution of your test scores can provide useful clues about your test
and your students’ performance. When representing students’ scores on a graph, the
scores often will be positively or negatively skewed. When the distribution is positively
skewed, that implies that the most frequent scores (the mode) and the median are below
the mean. If your test is very difficult, there may be many low scores and few high ones.
The distribution of scores would have a shape similar to the one depicted below that is
positively skewed.

When the tail points to the left, the distribution is negatively skewed. In this distribution
there are high scores and relatively few low scores. Notice that the mean is influenced by
the skewing.
 

The mean can be distorted if there are some scores that are extremely different (outliers)
from the mean of the majority of scores for the group. Consequently, the median is the
most descriptive measure of central tendency.

Indicators of Variability

Variability is the dispersion of the scores within a distribution. Given a test, a group of
students with a similar level of performance on a specific skill tend to have scores close
to the mean. Another group with varying levels of performance will have scores widely
spread and further from the mean. In other words, how varied are the scores? Two
common measures of variability are the range and standard deviation.

Range

The range, R, is the difference between the lowest and the highest scores in a
distribution. The range is easy to compute and interpret, but it only indicates the
difference between the two extreme scores in a set.

If we use the scores from Mr. Walker’s class (above), we would calculate the range as:
Range (R) = the highest score – the lowest score in the distribution.

95 91 100 96 92 91 87 84 70 65 96 65 56 86 43 65 22 40 93

R = 100 - 22 = 78, so the range is 78.

Standard Deviation

A more useful statistic than simply knowing the range of scores would be to see how
widely dispersed different scores are from the mean. The most common measure of
variability is the standard deviation (SD). The standard deviation is defined as the
numeric index that describes how far away from the mean the scores in the distribution
are located. The formula for the standard deviation is:

Where X = the test score, M = mean, and N = number of scores.


The higher the standard deviation, the wider the distribution of the scores is around the
mean. This indicates a more heterogeneous or dissimilar spread of raw scores on a scale.
A lower value of the standard deviation indicates a narrower distribution (more similar
or homogeneous) of the raw scores around the mean.

The characteristics of the mean


 It is an interval statistic which is a superior level of measurement compared to
other levels
 More precise than median or mode
 Takes into account every score in the distribution
 Most stable measure of central tendency
 Best indicator of combined performance in the classroom
The demerits of the mean as an effective measure of central tendency in the classroom
tests
 It is an interval statistic and therefore inappropriate for ordinal level of
measurement
 Sensitive to extreme scores or even outliers
 Affected by skewed data
 It is inadequate on its own as a measure of central tendency sometimes the value
calculated is not meaningful
 Cannot be computed for qualitative data
 Cannot be determined graphically
The median as a measure of central tendency
 It is the middle score in a distribution
 Usually presents the ordinal level of measurement
 Very useful in calculation of the skewness of a distribution
 Ranking of performance can be done using the median score
 Sensitive to extreme scores
Properties of the Standard deviation
 The standard deviation is only used to measure spread or dispersion around the
mean of a data set.
 Standard deviation is never negative.
 Standard deviation is sensitive to outliers. A single outlier can raise the standard
deviation and in turn, distort the picture of spread.
 For data with approximately the same mean, the greater the spread, the greater
the standard deviation.
 If all values of a data set are the same, the standard deviation is zero (because
each value is equal to the mean).

Activities
1. Construct five extended essay questions and five restricted essay questions in your
subject area
2. Prepare a table of specification for a 20 item multiple choice test.

Written Exercises
1. Distinguish between testing and assessment
2. Describe the main steps in the development of tests.
3. Outline the main considerations you would bear in mind in the construction of tests.

You might also like