Professional Documents
Culture Documents
15
Topic Principles of
Tests and
2 Measurements
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Describe the different kinds of reliability in relation to testing;
2. Describe how reliability relates to testing and measurement;
3. Describe the different kinds of validity; and
4. Identify the relationship between validity, reliability, and practicality
XX INTRODUCTION
Tests need to be reliable, valid, and practical in order to be considered a good test
capable of measuring the ability or knowledge we are interested in. Each of these
requirements makes a test a more useful tool and the information they provide
more precise. What do these three terms – reliability, validity, and practicality –
refer to? This topic will try to answer these questions.
2.1 RELIABILITY
Let’s say that a student scores 35 points in a 50 point test of listening
comprehension. How sure are we that this is actually the score that the student
should receive? One way to be more confident of this is to look at the reliability
of the test. Reliability has to do with the consistency and accuracy of the
measurement. It is reflected in several possible ways including the obtaining of
similar results when the measurement is repeated on different occasions or by
different persons. Reliability is computed using some kind of correlation. Some of
the more common ways for computing reliability are illustrated in Figure 2.1.
(a) Test-retest
In test-retest reliability, the same test is re-
administered to the same people. The scores
obtained on the first administration of the test
are correlated to the scores obtained on the
second administration of the test. It is expected
that the correlation between the two scores
would be high. However, a test-retest situation
is somewhat difficult to conduct as it is unlikely
that students will take the same test twice. The
effect of practice as well as memory may also
influence the correlation value.
Table 2.1 describes the focus of each reliability measure, with the emphasis given to
the first that is listed when there are two emphases. We can see from Table 2.1 that
the five types of reliability measures, the split half is perhaps the most efficiently
performed because it requires only one test administration, one grader, and one
grading session.
This error may come from various sources such as within the test takers,
within the test, in the test administration or even during scoring. Fatigue,
illness, copying or even the unintentional noticing of another student’s
answer all contribute to error from within the test taker. Some of these will
reduce value of the true score while others will increase it.
For example, fatigue will cause the obtained score to be lower than the true score
(i.e. Obtained score = True score – Error) while copying will cause the obtained
score to be higher than the true score (i.e. Obtained score = True score + Error).
Just as errors within the test taker affects the value of the obtained scores with
respect to the true score, errors within the test such as the use of faulty test items,
a reading level in the test that is too high and faulty instructions can do the
same. Errors in test administration include the level of physical comfort during
the test, the test administrator’s attitude as well as the use of faulty equipment
such as a cassette recorder with poor sound quality in a listening test. Finally,
errors in scoring are quite obvious as graders can contribute to the error element
if they lack adequate qualifications, do not follow instructions or are themselves
fatigued.
Copyright © Open University Malaysia (OUM)
TOPIC 2 PRINCIPLES OF TESTS AND MEASUREMENTS 19
Sm = SD √1 – r
where r is the reliability of the test.
Using the normal curve as presented earlier, you can estimate the student’s
true score to some degree of certainty based on the observed score and the
Standard Error of Measurement. For example let us take the obtained score of 75.
Assuming that the standard deviation (SD) = 2.5 and the reliability is 0.7, then the
Standard Error of Measurement will be:
Therefore, based on the normal distribution curve, the student’s true score is
between
75 – 1.375 and 75 + 1.375 or 73.625 and 76.375 (68% of the time) X +/- 1Sm
75 – 2.75 and 75 + 2.75 or 72.25 and 77.75 (95% of the time) X +/- 2Sm
75 – 4.125 and 75 + 4.125 or 70.875 and 79.125 (99% of the time) X +/- 3Sm
activity 2.1
2.2 VALIDITY
The second characteristic of good tests is validity which refers to whether the test
is actually measuring what it claims to measure. This is important for us as we
do not want to make claims concerning what a student can or cannot do based
on a test when the test is actually measuring something else. Validity is usually
determined logically although several types of validity may use correlation
coefficients.
(c) Concurrent validity is the use of another more reputable and recognised test
to validate one’s own test. For example, suppose you come up with your
own new test and would like to determine the validity of your test. As it
uses an external measure as a reference, concurrent validity is sometimes
also referred to as criterion validity. If you choose to use concurrent validity,
you would look for a reputable test and compare your students’ performance
on your test with their performance on the reputable and acknowledged
test. In concurrent validity, a correlation coefficient is obtained and used
to generate an actual numerical value. A high positive correlation of 0.7 to
1 indicates that the learners’ score is relatively similar for the two tests or
measures.
(e) Content validity “is concerned with whether or not the content of the test is
sufficiently representative and comprehensive for the test to be a valid measure of
what it is supposed to measure” (Henning, 1987). We can quite easily imagine
taking a test after going through an entire language course. How would you
feel if at the end of the course, your final exam consists of only one question
that covers one element of language from the many that were introduced in
the course? If the language course was a conversational course focusing on
the different social situations that one may encounter, how valid is a final
examination that requires you to demonstrate your ability to place an order
at a fast food restaurant?
These different types of validity represent different concerns that many educators
feel the test should address. However, it should be mentioned that in most
situations, each of the kinds of validity described previously are independent of
each other.
• A test may have face validity but lack construct validity.
• Similarly, it may have predictive validity but not content validity. What kind
of validity we need to focus on depends on what our test is supposed to
measure.
• A second issue is that validity is often a matter of degree. We will seldom find
a language test that is completely an invalid measure of language ability.
• Similarly, it is also extremely unlikely that we will find a completely valid
test of language.
A test can therefore be reliable but not valid. If a test is said to measure
reading ability but actually measures grammatical knowledge, then reliable
scores on this test will not indicate that it is a valid test of reading ability.
2.3 PRACTICALITY
Although an important characteristic of tests, practicality is actually a limiting
factor in testing. There will be situations in which after we have already determined
what we consider to be the most valid test, we need to reconsider the format simply
because of practicality issues. A valid test of spoken interaction, for example, would
require that the candidates be relaxed, interact with peers and speak on topics that
they are familiar and comfortable with. This sounds like the kind of conversations
that people have with their friends while drinking coffee at road side stalls. Of
course such a situation would be a highly valid measure of spoken interaction – if
we can set it up. Imagine if we even try to do so. It would require hidden cameras
as well as a lot of telephone calls and money.
2.4.1 Washback
Washback, also known as backwash (e.g. Hughes, 1989), refers to the extent to
which a test affects teaching and learning. According to Alderson and Wall (1993),
washback is seen in the actions that teachers and students perform which they
would otherwise not necessarily do if there were no test. Washback can influence
how a teacher teaches or what a teacher teaches. It can also influence how a student
learns or what a student learns. Bailey (1996), discusses a framework suggested by
Hughes in which washback effect is addressed with respect to the participants,
processes, and products involved. Washback affects participants involved in the
test such as students, teachers, materials, writers and curriculum developers, as
well as researchers by influencing them affectively or even cognitively. It can also
affect processes by way of how the participants act or do their work in relation
to the test. Finally, the products are also affected as materials and teaching
programmes are clearly influenced by the processes. Washback is especially
apparent when the tests or examinations in question are regarded as being very
important and having a definite impact on the test-taker’s future. We would
expect, for example, that national standardised examinations would have strong
washback effects compared to a school-based or classroom-based test.
How can such a situation occur? Let us take one possible scenario. If the test is
a multiple choice format test and the teacher spends countless hours preparing
students for the test by drilling them with multiple choice type questions, then
although the teaching and learning may be effective in terms of performance on
the test, it may not actually promote language development. On the other hand, if
a language test involves performance in group interaction, test preparation would
also use discussion techniques. Most language learning theorists would approve
of such a washback effect from tests as interaction is often seen as providing
opportunities for language development. Based on these two examples, we can
conclude that the nature of the test itself is the cornerstone of whether there
will be positive or negative washback. Good and valid tests will have a positive
washback effect while tests that are not valid measures of the construct concerned
will promote negative washback.
As tests are in the service of teaching and learning, we must strive to achieve
positive washback. Hughes (1989: 44-47), makes several suggestions on how to
promote beneficial washback:
• First, he suggests that we test the abilities whose development we want
to encourage. Essentially, this refers to a very obvious fact – if we want to
develop a particular skill such as speaking, then we should test that skill.
Although this seems fairly obvious, Hughes points out that this is often times
not done because of reasons (or rather excuses) such as impracticality of some
tests. As such, instead of testing the abilities that we want to develop, we end
up testing a small portion or a poor representation of the skill.
• His second suggestion is to use direct testing. We are all aware of the
difference between direct tests and indirect tests. By using tests that directly
assess the abilities that we are interested in, we will be able to encourage
students to develop those abilities. Therefore, positive washback occurs as
students work towards developing abilities relevant to the testing construct
rather than work on skill that may be more “testwiseness” than anything
else.
• Hughes also suggests that teachers should sample widely and unpredictably.
He argues that if the test is too predictable, the students will prepare for the
Copyright © Open University Malaysia (OUM)
26 TOPIC 2 PRINCIPLES OF TESTS AND MEASUREMENTS
test by performing only the kinds of tasks expected in the test. Students
have often been observed to show disinterest in performing tasks and
activities that they immediately recognise as not part of a test. This action is
taken regardless of the benefits of the task or activity. Therefore, by sampling
widely and unpredictably, students are expected to be more willing to
prepare for the test by performing a variety of tasks related to the objectives
of the teaching learning situation.
• Two other suggestions are to make testing criterion-referenced and to base
achievement tests on teaching learning objectives. Both these steps will
help students become aware of what is expected of them and what kind of
abilities they should be able to demonstrate. Descriptions of criterial levels
help students match their current abilities to the stated criteria and develop
their abilities accordingly.
Beneficial backwash can also be achieved if the students and teachers are familiar
with the test, its objectives as well as format. By becoming aware of the objectives
of the test, both students and teachers can prepare for the test in an organised
and more directed manner. Hughes also mentions the importance of assisting
teachers as they prepare their students for tests. He argues that whenever a test
is intended to create positive backwash in teaching methodology, some teachers
may find it difficult to adapt their teaching techniques to the demands of the test.
In such situations, it becomes imperative that these teachers are assisted in order
for the test to have a positive backwash effect.
Several others have also suggested ways and means to promote positive
washback. Bailey (1993), for example, suggests that attention should be given
to the language learning goals of the programme and the learners and that
as much authenticity as possible should be built into the tests. The importance
of authenticity is echoed by Doye (1991) who advocates absolute congruence
between tests and real-life situations. An authentic test is therefore one that
reproduces a real-life situation in order to examine the student’s ability to
cope with it (p. 104). Other suggestions include the need to focus on learner
autonomy and self-assessment. These suggestions imply that tests should
provide sufficiently informative score reporting so that learners can assess their
own performance and determine their own strengths and weaknesses. It should
also be noted here that high stakes tests tend to have a higher washback effect
than tests that are not considered important. Teachers often observe students not
performing well on classroom tests. Students are not too concerned about the
outcome of these tests as they are aware that the results would have little impact
on them.
Hughes stresses the need to count the cost. He believes that in many situations,
the cost of not achieving beneficial backwash may be higher than the cost to
develop a test and testing situation that promote beneficial backwash. He points
out that “when we compare the cost of the test with the waste of effort and time on the
part of the teachers and students in activities quite inappropriate to their true learning
goals … , we are likely to decide that we cannot afford not to introduce a test with a
powerful backwash effect” (p. 47).
(a) Consequences
This issue deals with the effect tests can have on teaching and learning. The
question that needs to be asked is whether the test has positive consequences
or whether there are negative and unintended washback effects such as the
narrowing of the curriculum and the use of inappropriate teaching and
learning strategies?
(b) Fairness
A contemporary concern in any kind of evaluation or assessment is fairness.
As the cultural background of students will colour the way they use
language and communicate, it is important to ask whether the ethnic and
cultural background of the students have been taken into consideration in
the assessment process?
(g) Meaningfulness
Another concern involves the meaningfulness of the task in tests and
assessments. Only if the tasks are meaningful will we be able to get students
to respond with honesty and motivation. A measure of performance when
students are honest and motivated will be more accurate and trustworthy
than one in which the students are not concerned. Therefore, we need to ask
ourselves if the students feel that the tasks are realistic and worthwhile?
activity 2.2
(a) What are reliability and validity? What determines the reliability of
a test?
(b) What are the different types of validity? Describe any three types
and cite examples.
SUMMARY
• The concepts of reliability, validity and practicality are central to testing and
measurement as they help determine the worth of the measurement.
• If a student is awarded 90 marks but the test is not reliable and not valid,
then the marks awarded may not be worth the certificate it is written on.
• Therefore, this topic on reliability, validity, and practicality is an important
one for us to better understand the importance of good testing and
measurement.