Professional Documents
Culture Documents
Introduction
This unit discusses the importance of tests in educational assessment. It is followed by a
description of the steps in test construction. It further describes types of tests, their strengths
and their weaknesses..
Specific Objectives
After studying this unit you should be able to:
1. Give reasons for testing in the classroom.
2. Describe the steps is test development
3. Explain the various types of tests.
4. Construct various test items
TEST CONSTRUCTION
A test is a collection of items developed to measure human educational or psychological
attributes. It can also be used to make predictions.
Bean (1953) defines a test as an organized succession of stimulate designed to measure
quantitatively or to evaluate qualitatively some mental process, trait or characteristic. For
example the reading ability of a child may be measured with the help of a test specially designed
for the purpose. His/her reading ability score may be evaluated with respect to the average
performance of the reading ability of other children of his/her age or class.
Why Teacher Made Tests
It is important for teachers to know how to construct their own tests because of the following:
The teacher Made Tests can be closely related to a teacher’s particular objectives and pupils
since he/she knows the needs, strengths and weaknesses of his/her students.
The teacher can tailor the test to fit her/his particular objectives fit a class or even fit the
individual pupils.
The classroom tests may be used by the teacher to help him/her develop more efficient
teaching strategies e.g. a teacher develop his/her own tests, administer them to the students
as pre-tests and
(a)Re-teach some of the information assumed known by the students.
(b) Omit some of the material planned to be taught because the students already know it.
(c) Provide some of the students with remedial instruction while giving other students some
enriching experience.
They are used for diagnosis where the teacher diagnoses the pupil’s strengths and
weaknesses.
Achievement Tests
Achievement refers to what a person has acquired or achieved after the specific training or
instruction has been imparted. Hence achievement tests are designed to measure the effects of a
specific program of instruction or training i.e. the extent to which students have learned the
intended curriculum. Examples of achievement tests are Kenya Certificate of Secondary
Education (KCSE) examination and Kenya certificate of primary education (KCPE).
Aptitude Test
Tuckman (1975) defines aptitude as “ a combination of abilities and other characteristics
whether native or acquired, known or believed to be indicative of an individual’s ability to
acquire skill or knowledge in a particular area. On the basis of such abilities, future
performance of a child can be predicted. Aptitude tests are tests or examinations that measure
the degree to which students have the capacity to acquire knowledge, skills and attitudes. The
primary purpose of an aptitude test is to predict what a person can learn and are future oriented.
Criterion-referenced Test
Criterion-referenced tests are tests that measure the extent to which prescribed standards have
been met. They examine students’ mastery of educational objectives and are used to determine
whether a student has learned specific knowledge or skills. Criterion-referenced assessment
asks the question: can student X do Z?
Norm Referenced Tests
These are tests which compare a student’s performance on the test with that of other students in
his/her cohort. For example, scores on a test given to Form II students can be compared to the
scores of other students in Form II. Unlike criterion-referenced tests, norm referenced tests are
not concerned with determining how proficient a student is in a particular subject or skill. The
main problem with making normative comparisons is that they do not indicate what students
know or do not know.
o course objectives
o topics covered in class
o amount of time spent on those topics
o textbook chapter topics
o emphasis and space provided in the text
Moderation of the test: This means examining the items set by a group of teachers in the
same subject area. Questions that are ambiguous are identified and feedback is given about the
test prepared. When moderating, attention should be paid to the following:
Ensure all the questions are within the syllabus
Ensure that the test has face validity i.e. by looking at the questions tell whether the items
are clear and not ambiguous.
Ensure items are free of bias
Check that appropriate action verbs are used.
Check whether the questions can be answered in the time given
Check whether the items sample the syllabus (content validity)
Check whether the marks in the question paper and marking scheme tally.
3. Administering Of The Test
During administration you need to inform the students in advance about the test and the time
that the test will take place. Prepare a register of attendance. Write the start and end time of
the test. Ensure there is enough room and there is enough space in the sitting arrangement.
Also make any corrections before the test starts.
4. Marking And Scoring of The Scripts
Here ensure you follow the marking scheme to avoid being biased.
6. Interpret the scores and you can use statistical tools e.g. the mean, mode and standard
deviation.
DEVELOPING TEST ITEMS
Developing Test Items for Different Cognitive Abilities.
Knowledge
Define the term ………………….
Describe how…………………….
List the ………………………….
State the …………………………
Name the ………………………..
Comprehension
Explain why ………………………….
Illustrate ………………………………..
Give five reasons for ……………………
Classify …………………………………
Compare ……………………………….
Application
What methods should be used to solve a problem …………………………………
Solve ………………………………….
Predict
Using ………………. Demonstrate how
Construct …………………..
Perform mouth to mouth resuscitation
Analysis
Analyze ………………………..
Differentiate …………………..
Criteria for arriving at a conclusion ……………………….
Synthesis
Summarize …………………………
Prepare a plan …………………….
Organize …………………………
Describe reasons for selection of …………………………
Argue for and against …………………………………….
Evaluation
Make an ethical judgement ………………………………
Justify ………………………………..
Critically assess this statement …………………………..
N.B.: The table of specification can aid immensely in the preparation of test items, in the
production of a valid and well balanced test.
In the classification of objectives to both teacher and students.
The table is only a guide, it is not designed to be adhered to strictly.
Relating the test Items to the instructional objective. It is important to obtain a match between
a test’s items and the test’s instructional objectives, which is not guaranteed by the table of
specification. The table of specification only indicates the number or proportion of test items to
be allocated to each of the instructional objectives specified.
Example 1
Objective:
The student will be able to differentiate between assessment and measurement.
Test Item
Distinguish between assessment and measurement.
Example 2
Objective
Students can perform mouth-to-mouth resuscitation on a drowning victim.
Test Item:
Describe the correct procedure for administering mouth-to-mouth resuscitation on a drowning
victim.
First example there is a match between one learning outcome and the test item. In the second
there is no match, because describing something is not valid evidence that the person can do
something.
Writing down the Items
After preparing a table of specification, you should construct test items. There are two types of
test items free response type and choice type.
Free Response (Essay Questions)
They require students to provide answers. They are of two types
a) Restricted type
b) Extended type
The restricted type is one that requires the examine to supply the answer in one or two lines and
is concerned with one central concept (Marshall & Hales, 1972)
An extended essay is one where the examinee’s answer comprises several sentences and is
usually concerned with more than one central concept.
Examples
Restricted Type
Describe the meaning of reliability of an educational test.
Extended Type
Describe five characteristics of a good test
Advantages of Restricted Type of Question
- They cover a wide content
- They are easy to construct
- They are easy to mark
Disadvantage
They do not test pupils’ ability to organize or criticize.
Advantages of Extended Type
- Easy to set
- Improves students writing skills
- Help develop student ability to organize and express his/her ideas in a logical and coherent
manner.
- Allows for creativity since examinee is required to give a coherent and organized answer
rather than recognize the answer.
- Leaves no room for guess work.
- Tests student’s ability to supply rather than select the correct answer.
Disadvantages
- Inadequate sampling of the syllabus
- It is disadvantageous to pupils with difficulties in expressing themselves
- Marking is highly unreliable and varies form scorer to scorer and sometimes within the same
score when asked to evaluate the answer at different time intervals.
- Scoring takes a longer time because of the length of the answer.
Open to subjectivity i.e. halo effect thus what influences the marks may not be what was being
tested i.e. handwriting or language.
Example
When a test item and the objective it is intended to measure match the item
a. Is called an objective item
b. Has content validity
c. Is too easy
d. Should be discarded
b) Best Answer Type:
The directions are similar to those of the single correct answer except the student is told
to select the best answer.
Example
Which of the following is the most important agent of curriculum implementation
a. Teacher
b. Inspectorate
c. Curriculum development center
d. Parents Teachers Associations
c) Incomplete Statement
The stem is an incomplete statement rather than a question. It is best for lower levels.
For example the first president of Uganda was ______________________
a. Kabaka Mutesa I
b. Obote
c. Museveni
d. Idd Amin
d) Multiple Response Type
The candidate is required to endorse more than one response.
Which of the following reasons explain why a teacher needs to prepare a lesson plan in
advance.
i. To enable him/her collect the necessary material in good time.
ii. To enable him/her focus on the questions that pupils are likely to ask.
iii. To keep a meaningful record of what has been taught to a given class.
iv. To visualize and organize a complete learning situation in advance.
a) i &ii
b) iii & iv
c) i, iii & iv
d) i, ii & iii
Negative Variety Item
Here all the responses are correct except one example:
Which of the following is not a good reason for organizing an educational visit?
a. Correlate several school subjects
b. Broaden students experiences beyond the classroom
c. Break the monotony of the class.
d. Arouse the students’ curiosity and develop an inquiring mind.
How to Construct Effective Multiple Choice Items
1. The stem should contain the central problem so that the student will have some idea as to
what is expected to him/her and some tentative answer in mind before he begins to read the
options.
Poor Stem: A criterion reference test. This example is poor because it does not ask a question
or set a task. It is essential that the intent of the item be stated clearly in the stem.
2. Avoid repetition of words in the options. The stem should be written so that the key words
are incorporate in the stem and will not have to be repeated in each option.
Poor: According to Engle’s law
a. Family expenditures for food increase in accordance with the size of the
family.
b. Family expenditures for food decrease as income increases.
c. Family expenditures for food require a smaller percentage of an increasing
income.
d. Family expenditures for food rise in proportion to income.
Better: According to Engle’s Law, Family expenditures for food.
a. Increase in accordance with the size of the family
b. Decrease as income increases
c. Require a smaller percentage of an increasing income
d. Rise in proportion to income
3. Avoid making the key consistently longer or shorter than the distracters.
4. Avoid giving irrelevant clues to the correct answer. The length of the answer may be a clue
others may be of a grammatical nature such as the use of “a” or “an” at the end of a
statement and using a singular or plural subject and/ or verb in the stem with just one or
two singular or plural options.
Note: It is important that you prepare a table of specification when constructing your tests. It
ensures balance and comprehensiveness in a test.
4. Marking guidelines indicate the initial criteria that will be used to award marks.
5. Marking guidelines allow for less predictable and less defined responses, for
example, characteristics such as flair, originality and creativity, or the provision
of alternative solutions where appropriate.
6. Marking guidelines for extended responses uses language that is consistent with
the outcomes and the band descriptions for the subject.
7. Marking guidelines are to incorporate the generic rubric provided in the
examination paper as well as aspects specifically related to the question.
12. Optional questions within a paper will be marked using comparable marking
criteria.
13. Marking guidelines for questions that can be answered using a range of contexts
and/or content will have a common marking guideline exemplified using
appropriate contexts and/or content.
Item analysis can be a powerful technique available to instructors for the guidance and
improvement of instruction. For this to be so, the items to be analyzed must be valid
measures of instructional objectives. Further, the items must be diagnostic, that is,
knowledge of which incorrect options students select must be a clue to the nature of the
misunderstanding, and thus prescriptive of appropriate remediation.
In addition, instructors who construct their own examinations may greatly improve the
effectiveness of test items and the validity of test scores if they select and rewrite their
items on the basis of item performance data.
Item analysis is a completely futile process unless the results help instructors improve
their classroom practices and item writers improve their tests. Let us suggest a number
of points of departure in the application of item analysis data.
1. Item analysis gives necessary but not sufficient information concerning the
appropriateness of an item as a measure of intended outcomes of instruction. An
item may perform beautifully with respect to item analysis statistics and yet be
quite irrelevant to the instruction whose results it was intended to measure. A
most common error is to teach for behavioral objectives such as analysis of data
or situations, ability to discover trends, ability to infer meaning, etc., and then to
construct an objective test measuring mainly recognition of facts. Clearly, the
objectives of instruction must be kept in mind when selecting test items.
2. An item must be of appropriate difficulty for the students to whom it is
administered. If possible, items should have indices of difficulty no less than 20
and no greater than 80. lt is desirable to have most items in the 30 to 50 range of
difficulty. Very hard or very easy items contribute little to the discriminating
power of a test.
3. An item should discriminate between upper and lower groups. These groups are
usually based on total test score but they could be based on some other criterion
such as grade-point average, scores on other tests, etc. Sometimes an item will
discriminate negatively, that is, a larger proportion of the lower group than of the
upper group selected the correct option. This often means that the students in the
upper group were misled by an ambiguity that the students in the lower group,
and the item writer, failed to discover. Such an item should be revised or
discarded.
4. All of the incorrect options, or distracters, should actually be distracting.
Preferably, each distracter should be selected by a greater proportion of the lower
group than of the upper group. If, in a five-option multiple-choice item, only one
distracter is effective, the item is, for all practical purposes, a two-option item.
Existence of five options does not automatically guarantee that the item will
operate as a five-choice item.
How well did my test distinguish among students according to the how well they
met my learning goals?
Recall that each item on your test is intended to sample performance on a particular
learning outcome. The test as a whole is meant to estimate performance across the full
domain of learning outcomes targeted.
One way to assess how well your test is functioning for this purpose is to look at how
well the individual items do so. The basic idea is that a good item is one that good
students get correct more often than do poor students. An item analysis gets at the
question of whether your test is working by asking the same question of all individual
items—how well does it discriminate? In short, item analysis gives the teacher a way to
exercise additional quality control over their tests. Well-specified learning objectives
and well-constructed items gives teachers a head-start in that process, but item analyses
can give you feedback on how successful you actually were. Item analyses can also help
you diagnose why some items did not work especially well, and thus suggest ways to
improve them (for example, if you find distracters that attracted no one, try developing
better ones).The important test for an item’s discriminability is to compare it to the
maximum possible. How well did each item discriminate relative to the maximum
possible for an item of its particular difficulty level? Here is a rough rule of thumb.
The item difficulty statistic is an appropriate choice for achievement or aptitude tests
when the items are scored dichotomously (i.e., correct vs. incorrect). Thus, it can be
derived for true-false, multiple-choice, and matching items, and even for essay items,
where the instructor can convert the range of possible point values into the categories
“passing” and “failing.”
The item difficulty index, symbolized p, can be computed simply by dividing the number
of test takers who answered the item correctly by the total number of students who
answered the item. As a proportion, p can range between 0.00, obtained when no
examinees answered the item correctly, and 1.00, obtained when all examinees
answered the item correctly. Notice that no test item need have only one p value. Not
only may the p value vary with each class group that takes the test, an instructor may
gain insight by computing the item difficulty level for a number of different subgroups
within a class, such as those who did well on the exam overall and those who performed
more poorly.
Although the computation of the item difficulty index p is quite straightforward, the
interpretation of this statistic is not. To illustrate, consider an item with a difficulty level
of 0.20. We do know that 20% of the examinees answered the item correctly, but we
cannot be certain why they did so. Does this item difficulty level mean that the item was
challenging for all but the best prepared of the examinees? Does it mean that the
instructor failed in his or her attempt to teach the concept assessed by the item? Does it
mean that the students failed to learn the material? Does it mean that the item was
poorly written? To answer these questions, we must rely on other item analysis
procedures, both qualitative and quantitative ones.
Item discrimination analysis deals with the fact that often different test takers will
answer a test item in different ways. As such, it addresses questions of considerable
interest to most faculty, such as, “does the test item differentiate those who did well on
the exam overall from those who did not?” or “does the test item differentiate those who
know the material from those who do not?” In a more technical sense then, item
discrimination analysis addresses the validity of the items on a test, that is, the extent to
which the items tap the attributes they were intended to assess. As with item difficulty,
item discrimination analysis involves a family of techniques. Which one to use depends
on the type of testing situation and the nature of the items. I’m going to look at only one
of those, the item discrimination index, symbolized D. The index parallels the difficulty
index in that it can be used whenever items can be scored dichotomously, as correct or
incorrect, and hence it is most appropriate for true-false, multiple-choice, and matching
items, and for those essay items which the instructor can score as “pass” or “fail.”
We test because we want to find out if students know the material, but all we learn for
certain is how they did on the exam we gave them. The item discrimination index tests
the test in the hope of keeping the correlation between knowledge and exam
performance as close as it can be in an admittedly imperfect system.
1. Divide the group of test takers into two groups, high scoring and low scoring.
Ordinarily, this is done by dividing the examinees into those scoring above and
those scoring below the median. (Alternatively, one could create groups made up
of the top and bottom quintiles or quartiles or even deciles.)
2. Compute the item difficulty levels separately for the upper (pupper) and lower
(plower) scoring groups.
3. Subtract the two difficulty levels such that D = pupper- plower.
How is the item discrimination index interpreted? Unlike the item difficulty level p , the
item discrimination index can take on negative values and can range between -1.00 and
1.00. Consider the following situation: suppose that overall, half of the examinees
answered a particular item correctly, and that all of the examinees who scored above the
median on the exam answered the item correctly and all of the examinees who scored
below the median answered incorrectly. In such a situation pupper = 1.00 and plower =
0.00. As such, the value of the item discrimination index D is 1.00 and the item is said
to be a perfect positive discriminator. Many would regard this outcome as ideal. It
suggests that those who knew the material and were well-prepared passed the item
while all others failed it.
Though it’s not as unlikely as winning a million-dollar lottery, finding a perfect positive
discriminator on an exam is relatively rare. Most psychometricians would say that items
yielding positive discrimination index values of 0.30 and above are quite good
discriminators and worthy of retention for future exams.
Finally, notice that the difficulty and discrimination are not independent. If all the
students in both the upper and lower levels either pass or fail an item, there’s nothing in
the data to indicate whether the item itself was good or not. Indeed, the value of the item
discrimination index will be maximized when only half of the test takers overall answer
an item correctly; that is, when p = 0.50. Once again, the ideal situation is one in which
the half who passed the item were students who all did well on the exam overall.
Does this mean that it is never appropriate to retain items on an exam that are passed by
all examinees, or by none of the examinees? Not at all. There are many reasons to
include at least some such items. Very easy items can reflect the fact that some relatively
straightforward concepts were taught well and mastered by all students. Similarly, an
instructor may choose to include some very difficult items on an exam to challenge even
the best-prepared students.
The mean is also the number that divides the scores into two equal groups and is the
score that occurs most frequently.
The shape of a distribution of your test scores can provide useful clues about your test
and your students’ performance. When representing students’ scores on a graph, the
scores often will be positively or negatively skewed. When the distribution is positively
skewed, that implies that the most frequent scores (the mode) and the median are below
the mean. If your test is very difficult, there may be many low scores and few high ones.
The distribution of scores would have a shape similar to the one depicted below that is
positively skewed.
When the tail points to the left, the distribution is negatively skewed. In this distribution
there are high scores and relatively few low scores. Notice that the mean is influenced by
the skewing.
The mean can be distorted if there are some scores that are extremely different (outliers)
from the mean of the majority of scores for the group. Consequently, the median is the
most descriptive measure of central tendency.
Indicators of Variability
Variability is the dispersion of the scores within a distribution. Given a test, a group of
students with a similar level of performance on a specific skill tend to have scores close
to the mean. Another group with varying levels of performance will have scores widely
spread and further from the mean. In other words, how varied are the scores? Two
common measures of variability are the range and standard deviation.
Range
The range, R, is the difference between the lowest and the highest scores in a
distribution. The range is easy to compute and interpret, but it only indicates the
difference between the two extreme scores in a set.
If we use the scores from Mr. Walker’s class (above), we would calculate the range as:
Range (R) = the highest score – the lowest score in the distribution.
95 91 100 96 92 91 87 84 70 65 96 65 56 86 43 65 22 40 93
Standard Deviation
A more useful statistic than simply knowing the range of scores would be to see how
widely dispersed different scores are from the mean. The most common measure of
variability is the standard deviation (SD). The standard deviation is defined as the
numeric index that describes how far away from the mean the scores in the distribution
are located. The formula for the standard deviation is:
Activities
1. Construct five extended essay questions and five restricted essay questions in your
subject area
2. Prepare a table of specification for a 20 item multiple choice test.
Written Exercises
1. Distinguish between testing and assessment
2. Describe the main steps in the development of tests.
3. Outline the main considerations you would bear in mind in the construction of tests.