Professional Documents
Culture Documents
in educational planning
Series editor: Kenneth N.Ross
6
Module
John Izard
Overview of
test construction
These modules were prepared by IIEP staff and consultants to be used in training
workshops presented for the National Research Coordinators who are responsible
for the educational policy research programme conducted by the Southern and
Eastern Africa Consortium for Monitoring Educational Quality (SACMEQ).
The designations employed and the presentation of material throughout the publication do not imply the expression of
any opinion whatsoever on the part of UNESCO concerning the legal status of any country, territory, city or area or of
its authorities, or concerning its frontiers or boundaries.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means: electronic, magnetic tape, mechanical, photocopying, recording or otherwise, without permission
in writing from UNESCO (International Institute for Educational Planning).
Content
2. What is a test? 5
1
Module 6 Overview of test construction
II
Content
10. References 50
11. Exercises 52
© UNESCO III
Assessment needs at different 1
levels of an education system
© UNESCO 1
Module 6 Overview of test construction
2
Assessment needs at different levels of an education system
© UNESCO 3
Module 6 Overview of test construction
Those who are taking action should also know the likely direct and
indirect effects of various action options, and the costs associated
with those options. They will include politicians, high level advisors,
senior administrators, and those responsible for curriculum,
assessment, teacher training (pre-service and in-service), and other
educational planners.
That is, those taking action need to be able to provide evidence that
their actions do ‘pay off’. For example. politicians have to be able to
convince their constituents that the actions taken were wise, and
senior administrators need to be able to show that programmes
have been implemented as intended and to show the effectiveness
of those programmes. It is important for such officials to realise
that effecting change requires more than issuing new regulations.
At the national level, action will probably be needed to train those
responsible for implementing change.
4
What is a test? 2
One valid approach to assessment is to observe everything that is
taught. In most situations this is not possible, because there is so
much information to be recorded. Instead, one has to select a valid
sample from the achievements of interest. Since school learning
programmes are expected to provide students with the capability
to complete various tasks successfully, one way of assessing each
student’s learning is to give a number of these tasks to be done
under specified conditions. Conventional pencil-and-paper test
items (which may be posed as questions) are examples of these
specially selected tasks. However other tasks may be necessary as
well to give a comprehensive, valid and meaningful picture of the
learning. For example, in the learning of science subjects practical
skills are generally considered to be important so the assessment
of science subjects should therefore include some practical tasks.
Similarly, the student learning music may be required to give a
musical performance to demonstrate what has been learned. Test
items or tasks are samples of intended achievement, and a test is a
collection of such assessment tasks or items.
Clearly, the answer for one item should not depend on information
in another item or the answer to another item. Otherwise this
© UNESCO 5
Module 6 Overview of test construction
6
What is a test?
© UNESCO 7
Module 6 Overview of test construction
The students and their parents will focus on the total scores at
the foot of the columns. High scores will be taken as evidence
of high achievement, and low scores will be taken as evidence of
low achievement. However, in summarising achievement, these
scores have lost their meaning in terms of particular strengths and
weaknesses. They give no information about which aspects of the
curriculum students knew and which they did not understand.
Teachers, subject specialists, curriculum planners, and national
policy advisors need to focus on the total scores shown to the right
of the matrix. These scores show how well the various content areas
have been covered. Low scores show substantial gaps in knowledge
where the intentions of the curriculum have not been met. High
scores show where curriculum intentions have been met (at least for
those questions that appeared on the test).
8
Figure 1. Matrix of student data on a twenty-item test
Students
Items 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 15
2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14
3 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 14
4 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 11
5 1 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 13
6 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 11
7 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 1 1 8
8 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 13
9 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 1 1 1 9
10 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 7
11 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 1 0 7
12 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 6
13 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 7
14 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 1 11
15 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 1 12
16 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 1 1 8
17 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 8
18 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 6
19 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 4
20 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 5
21 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0 0 10
3 4 5 7 7 9 10 10 12 12 14 14 14 14 15 15 17 17 199
© UNESCO 9
Module 6 Overview of test construction
Students
Items 1 2 3 4 5 • • • • • • • • •
1
2
•
•
•
10
Interpreting test data
© UNESCO 11
Module 6 Overview of test construction
For example, we could weigh the students and give the highest
scores to those with the largest mass. This assessment could be very
consistent, particularly if our scales were accurate. However, this
assessment information is not meaningful when trying to judge
whether learning has occurred. To be meaningful in this context,
our assessment tasks (items) have to relate to what students learn.
The choice of what to assess, the strategies of assessment, and the
modes of reporting depend upon the intentions of the curriculum,
the importance of different parts of the curriculum, and the
audiences needing the information that assessment provides.
12
• all important parts of the curriculum are addressed;
• achievement over the full range is assessed (not just the narrow
band where a particular selection decision might be required on
a single occasion).
© UNESCO 13
Module 6 Overview of test construction
Tasks may also be too easy. If all students can do all of the tasks
then the most able students will not be able to provide evidence of
their advanced achievements. If two such assessments are made
it will appear that these able students have not learned anything
(because their scores cannot improve) even though they may have
learned a great deal of important knowledge and skills. Such
assessments are also faulty in that they fail to recognise learning
that has occurred. (A test with many easy items which do not
allow more able students to show evidence of their learning may be
referred to by saying that the test has a ‘ceiling’ effect.)
14
Inferring range of achievement from samples of tasks
When assessment results have high stakes (as in the case where
results are used to select a small proportion for the next stage
of schooling or for employment), the chosen assessment tasks
have a high degree of influence on curriculum, teaching practice,
and student behaviour. When public examination papers are
published, teachers and students expend a great deal of effort
in analysing these papers, to practice test-taking skills, and
to attempt to predict what topics will be examined so that the
whole curriculum does not have to be studied. These practices of
restricting learning to examinable topics may lead to high scores
being obtained without the associated (and expected) coverage of
the intended curriculum.
© UNESCO 15
Module 6 Overview of test construction
This may cost more at first because the levels of skill in writing
such tests are much higher. Generally teams of item-writers are
required rather than depending upon a very limited number of
individuals to write all of the questions. The pool of experienced
teachers with such skills will not increase in size if teachers are
not encouraged by the assessment procedures to prepare students
for higher quality assessments. Further, item writing skills develop
in part from extensive experience in writing items (especially
those that are improvements on previous items). Such experience
of item writing, of exploring the ways that students think, and
of considering the ways in which students interpret a variety
of evidence, is gained gradually. Many good item writers are
experienced and sensitive classroom teachers who have developed
the capacity to construct items which reveal (correct and incorrect)
thought processes.
16
Inferring range of achievement from samples of tasks
© UNESCO 17
Module 6 Overview of test construction
18
Results used to compare students
Norm-referenced tests: These tests provide the results for
a reference group on a representative test and therefore scores
on the test are normally presented in terms of comparisons with
this reference group. If the reference group serves as a baseline
group, norm-referenced scores can provide evidence of learning
improvement (or decline) for the student population although this
is in terms of a change in score rather than an indication of what
students can now do (or not do) compared with what they could do
before. If changes in score are reported (for example, a difference
in average score), administrators have little evidence about the
strengths and weakness reflected in the results for particular
topics and may rely on rather limited experience (such as their
own experiences as a student) to interpret the changes. This could
result in increased expenditure on the wrong topic, wasting scarce
resources, and not addressing the real problems.
© UNESCO 19
Module 6 Overview of test construction
Very few tests provide the user with a strategy for making such
adjustments for themselves although some tests prepared using
Item Response Theory or Latent Trait Theory do enable qualified
and experienced users to estimate new norm tables for particular
sub-sets of items.
20
What purposes will the test serve?
© UNESCO 21
Module 6 Overview of test construction
22
What purposes will the test serve?
© UNESCO 23
Module 6 Overview of test construction
Finally, the term can refer (loosely) to a published test which was
prepared by standard (or conventional) procedures. The usage of
‘standardised’ has become somewhat confused because published
tests often present scores interpreted in terms of deviation from the
mean (or average) and have a standard procedure for administering
tests and interpreting results.
5 metres
3 metres 3 metres
5 metres
24
What purposes will the test serve?
5 metres
3 metres 3 metres
5 metres
© UNESCO 25
Module 6 Overview of test construction
26
There are potential difficulties in scoring such prose, oral, drawn
and manipulative responses. An expert judge is required because
each response requires interpretation to be scored. Judges vary
in their expertise, vary over time in the way they score responses
(due to fatigue, difficulty in making an objective judgment without
being influenced by the previous candidate’s response, or by
giving varying credit for some correct responses over other correct
responses), and vary in the notice they take of handwriting,
neatness, grammatical usage and spelling.
© UNESCO 27
Module 6 Overview of test construction
28
What types of tasks?
A B C D
A Platypus
B Marmosets
C Flatworms
D Plankton
© UNESCO 29
Module 6 Overview of test construction
30
What types of tasks?
© UNESCO 31
Module 6 Overview of test construction
32
Figure 4. Stages in test construction
â
Decision to allocate resources
â
Content analysis and test blue print
â
à
Item writing
â
Item review 1
â
Planning item scoring
â
Production of trial tests
â
Trial testing
â
Item review 2
â
Amendment
(revise/replace/discard)
â
More items needed? à Yes
No
â
Assembly of final tests
© UNESCO 33
Module 6 Overview of test construction
Objectives
Content
Recall Computational
Understanding Total
of facts skills
Total 14 12 28 54
34
The test construction steps
Item writing
Item writing is the preparation of assessment tasks which can reveal
the knowledge and skill of students when their responses to these
tasks are inspected. Tasks which confuse, which do not engage the
students, or which offend, always obscure important evidence by
either failing to gather appropriate information or by distracting
the student from the intended task. Sound assessment tasks
will be those which students want to tackle, those which make
clear what is required of the students, and those which provide
evidence of the intellectual capabilities of the students. Remember,
items are needed for each important aspect as reflected in the test
specification. Some item writers fall into the trap of measuring what
is easy to measure rather than what is important to measure. This
enables superficial question quotas to be met but at the expense
of validity – using questions that are easy to write rather than
those which are important distorts the assessment process, and
therefore conveys inappropriate information about the curriculum
to students, teachers, and school communities.
© UNESCO 35
Module 6 Overview of test construction
Item review
The first form of item analysis: Checking intended
against actual
Writing assessment tasks for use in tests requires skill. Sometimes
the item seems clear to the person who wrote it but may not
necessarily be clear to others. Before empirical trial, assessment
tasks need to be reviewed by a review panel (with a number of
people) with questions like:
• Is there a single clearly correct (or best) answer for each item?
36
The test construction steps
This review before the items are tried should ensure that we avoid
tasks which are expressed in language too complex for the idea
being tested, avoid redundant words, multiple negatives, and
distractors which are not plausible. The review should also identify
items with no correct (or best) answer and items with multiple
correct answers. Such items may be discarded or re-written.
• Will the students be told how the items are to be scored? Will
they be told the relative importance of each item? Will they be
given advice on how to do their best on the test?
© UNESCO 37
Module 6 Overview of test construction
• Has the layout of the test (and answer sheet if appropriate) been
arranged for efficient scoring of responses? Are distractors for
multiple-choice tests shown as capital letters (easier to score
than lower case letters)?
38
The test construction steps
© UNESCO 39
Module 6 Overview of test construction
as informing those attempting the trials tests that their results will
be used to validate the items and will not have any effect on their
current course work), and gather all test materials before candidates
leave the room.
Item analysis
The second form involving responses by real
candidates
Empirical trial can identify instances of confused meaning,
alternative explanations not already considered by the test
constructors, and (for multiple-choice questions) options which are
popular amongst those lacking knowledge, and ‘incorrect’ options
which are chosen for some reason by very able students.
40
The test construction steps
© UNESCO 41
Module 6 Overview of test construction
A test with high content validity for one curriculum may not be
as valid for another curriculum. This is an issue which bedevils
international comparisons where the same test is administered in
several countries. Interpretation of the results in each country has
to take account of the extent to which the comparison test is content
valid for each country. If two countries have curricula that are only
partly represented in the test, then comparisons between the results
of those countries are only valid for part of the data.
42
The test construction steps
When tests have high construct validity we may argue that this
is evidence of dimensionality. When we add scores on different
parts of a test to give a score on the whole test, we are assuming
dimensionality without checking whether our assumption is
justified. Similarly, when item analysis is done using the total score
on the same test as the criterion, we are assuming that the test
as a whole is measuring a single dimension or construct, and the
analysis seeks to identify items which contradict this assumption.
© UNESCO 43
Module 6 Overview of test construction
44
Resources required to 8
construct and produce a test
© UNESCO 45
Module 6 Overview of test construction
teachers (either set aside from classroom work to join the test-
construction team, or paid additional fees to work outside school
hours), class teacher and student time for trials, the paper on
which the test (and answer sheet if appropriate) is to be printed
or photocopied, the production of copies of the test materials,
distribution to schools, retrieval from schools (if teachers are
not to score the tests), and scoring and analysis costs. Figure 6
shows a possible time scale for developing two parallel forms of
an achievement test of 50 items for use during the sixth year of
schooling. Figure 6 also shows the resources that would need to be
assembled to ensure that the tests were produced on time. (Note
that this schedule assumes that the test construction team has
had test development training prior to commencing work on the
project.)
46
Resources required to construct and produce a test
Time
Task Resources
(weeks)
© UNESCO 47
Module 6 Overview of test construction
Preparation of final forms of a test is not the end of the work. The
data gathered from the use of final versions should be monitored
as a quality control check on their performance. Such analyses
can also be used to fix a standard by which the performance of
future candidates may be compared. It is important to do this as
candidates in one year may vary in quality from those in another
year.
48
It is customary to develop more trial forms so that some forms of
the final test can be retired from use (where there is a possibility of
candidates having prior knowledge of the items through continued
use of the same test).
The trial forms should include acceptable items from the original
trials (not necessarily items which were used on the final forms
but similar in design to the pattern of item types used in the final
forms) to serve as a link between the new items and the old items.
The process of linking tests using such items is referred to as
anchoring. Surplus items can be retained for future use in similar
test papers.
© UNESCO 49
Module 6 Overview of test construction
10 References
General measurement and evaluation
1. Hopkins, C.D. and Antes, R.L. (1990). Classroom measurement
and evaluation. Itasca, Illinois: Peacock.
Item writing
1. Withers, G. (1997). Item Writing for Tests and Examinations.
Paris: International Institute for Educational Planning.
50
Testing applications
1. Adams, R.J., Doig, B.A. & Rosier, M.J. (1991). Science learning
in Victorian schools: 1990. (ACER Research Monograph No. 41).
Hawthorn, Vic.: Australian Council for Educational Research.
© UNESCO 51
Module 6 Overview of test construction
11 Exercises
1. CONSTRUCTION OF A TEST PL AN
52
2. TE X TBOOK ANALYSIS AND ITE M WRITING
(b) Choose one cell of the test plan and write some items for this
cell.
© UNESCO 53
Since 1992 UNESCO’s International Institute for Educational Planning (IIEP) has been
working with Ministries of Education in Southern and Eastern Africa in order to undertake
integrated research and training activities that will expand opportunities for educational
planners to gain the technical skills required for monitoring and evaluating the quality
of basic education, and to generate information that can be used by decision-makers to
plan and improve the quality of education. These activities have been conducted under
the auspices of the Southern and Eastern Africa Consortium for Monitoring Educational
Quality (SACMEQ).
In 2004 SACMEQ was awarded the prestigious Jan Amos Comenius Medal in recognition
of its “outstanding achievements in the field of educational research, capacity building,
and innovation”.
These modules were prepared by IIEP staff and consultants to be used in training
workshops presented for the National Research Coordinators who are responsible for
SACMEQ’s educational policy research programme. All modules may be found on two
Internet Websites: http://www.sacmeq.org and http://www.unesco.org/iiep.