06 HBET4503 Topic 2

TOPIC 2 PRINCIPLES OF TESTS AND MEASUREMENTS
 15
Topic  Principles of
Tests and
2 Measurements
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Describe the different kinds of reliability in relation to testing;
2. Describe how reliability relates to testing and measurement;
3. Describe the different kinds of validity; and
4. Identify the relationship between validity, reliability, and practicality
XX INTRODUCTION
Tests need to be reliable, valid, and practical in order to be considered a good test
capable of measuring the ability or knowledge we are interested in. Each of these
requirements makes a test a more useful tool and the information they provide
more precise. What do these three terms – reliability, validity, and practicality –
refer to? This topic will try to answer these questions.
2.1 RELIABILITY
Let’s say that a student scores 35 points in a 50 point test of listening
comprehension. How sure are we that this is actually the score that the student
should receive? One way to be more confident of this is to look at the reliability
of the test. Reliability has to do with the consistency and accuracy of the
measurement. It is reflected in several possible ways including the obtaining of
similar results when the measurement is repeated on different occasions or by
different persons. Reliability is computed using some kind of correlation. Some of
the more common ways for computing reliability are illustrated in Figure 2.1.
Copyright © Open University Malaysia (OUM)

16  TOPIC 2 PRINCIPLES OF TESTS AND MEASUREMENTS
Figure 2.1: Ways of computing reliability
(a) Test-retest
In test-retest reliability, the same test is re-
administered to the same people. The scores
obtained on the first administration of the test
are correlated to the scores obtained on the
second administration of the test. It is expected
that the correlation between the two scores
would be high. However, a test-retest situation
is somewhat difficult to conduct as it is unlikely
that students will take the same test twice. The
effect of practice as well as memory may also
influence the correlation value.
(b) Parallel/Equivalent Forms

In this type of reliability measure, two similar
tests are administered to the same sample of
persons. Therefore, as in test-retest reliability,
two scores are obtained. However, unlike test-
retest, the parallel or equivalent forms reliability
measure is protected from the influence of
memory.
(c) Inter-rater Reliability

Inter-rater reliability involves two or more judges or raters. Scores on a test
are independent estimates of these judges or raters. A score is a more reliable
and accurate measure if two or more raters agree on it. The extent to which
the raters agree will determine the level of reliability of the score.
TOPIC 2 PRINCIPLES OF TESTS AND MEASUREMENTS  17
(d) Intra-rater Reliability

While inter-rater reliability involves two or more raters,
intra-rater reliability is the consistency of grading
by a single rater. Scores on a test are rated by a single
rater/judge at different times. When we grade tests
at different times, we may become inconsistent in our
grading for various reasons. Some papers that are
graded during the day may get our full and careful
attention, while others that are graded towards the
end of the day are very quickly glossed over. As such,
intra rater reliability determines the consistency of our
grading.
(e) Split half Reliability

In a split half measure of reliability, a test is administered once to a group,
is divided into two equal halves after the students have returned the test,
and the halves are then correlated. As the means for determining reliability
is internal, within one administration of the test, this method of computing
reliability is considered as an internal consistency measure. Halves are
often determined based on the number assigned to each item with one half
consisting of odd numbered items and the other half even numbered items.
The five different measures of reliability mentioned inform us of the different

techniques for computing reliability. We should now be aware that reliability
concerns first, the consistency of student performance; secondly, the teachers’
grading, and thirdly, the test itself. The emphasis of the five different types of
reliability with respect to these concerns can be illustrated in Table 2.1.
Table 2.1: Focus, Number of Administrations, Graders,

and Grading Sessions of Measures of Reliability
Type of Focus Number of tests Number of Number of

Reliability or administrations graders grading
Test-rested Student performance; 2 1 or 2 2
the test itself
Parallel/Equivalent Student performance; 2 1 or 2 2
forms the test itself
Inter rater Grading 1 2 1 per
grader
Intra rater Grading 1 1 2
Split half Test itself; student 1 1 1
performance

Table 2.1 describes the focus of each reliability measure, with the emphasis given to
the first that is listed when there are two emphases. We can see from Table 2.1 that
the five types of reliability measures, the split half is perhaps the most efficiently
performed because it requires only one test administration, one grader, and one
grading session.
2.1.1 Accuracy and Error

Notice that the discussion on reliability in the previous section has so far revolved
around only the notion of consistency. According to its definition, reliability
is also an issue of accuracy. As such, it is also imperative that we examine
the accuracy of a test so that we can be satisfied with its use as a measure of
knowledge or ability. In this respect, it is important to remember that in any
test, the obtained score is actually a mixture of the true score and some element
of error. This element of error in any score can be represented by the following
notation:
Obtained score = True score +/- Error

It would be extremely easy for us if an obtained score is actually the student’s

true score. That is, if a student obtains a score of 75, it reflects his or her actual
ability. Unfortunately, a student’s observed or obtained score actually consists of
his or her true score plus or minus some element of error.
This error may come from various sources such as within the test takers,
within the test, in the test administration or even during scoring. Fatigue,
illness, copying or even the unintentional noticing of another student’s
answer all contribute to error from within the test taker. Some of these will
reduce value of the true score while others will increase it.
For example, fatigue will cause the obtained score to be lower than the true score
(i.e. Obtained score = True score – Error) while copying will cause the obtained
score to be higher than the true score (i.e. Obtained score = True score + Error).
Just as errors within the test taker affects the value of the obtained scores with
respect to the true score, errors within the test such as the use of faulty test items,
a reading level in the test that is too high and faulty instructions can do the
same. Errors in test administration include the level of physical comfort during
the test, the test administrator’s attitude as well as the use of faulty equipment
such as a cassette recorder with poor sound quality in a listening test. Finally,
errors in scoring are quite obvious as graders can contribute to the error element
if they lack adequate qualifications, do not follow instructions or are themselves
fatigued.
In testing, we can use the Standard Error of Measurement in order to estimate a

student’s true score. This formula is as follows:
Sm = SD √1 – r

where r is the reliability of the test.
Using the normal curve as presented earlier, you can estimate the student’s
true score to some degree of certainty based on the observed score and the
Standard Error of Measurement. For example let us take the obtained score of 75.
Assuming that the standard deviation (SD) = 2.5 and the reliability is 0.7, then the
Standard Error of Measurement will be:
2.5 √1 – 0.7 = 2.5 √0.3 = 2.5 x 0.55 = 1.375
Therefore, based on the normal distribution curve, the student’s true score is
between
75 – 1.375 and 75 + 1.375 or 73.625 and 76.375 (68% of the time) X +/- 1Sm
Such information is helpful when important decisions need to be made on the

basis of test scores and we would like to be fair to candidates by taking into
account any error that may have occurred.
activity 2.1
Do this simple exercise to assess your understanding of the concepts of

accuracy and error:
A student obtains a score of 63 in a test. The reliability of the test was
calculated at 0.75. The standard deviation of the test is 0.5.
(a) If you wanted to select students who had scores of 65 and above,
would you be willing to accept this student assuming that there
could be an error in his score?
(Hint: Use 1 standard error of measure).
(b) If you wanted to be at least 95% sure of the student’s true score,
what would be the range for the true score?

2.2 VALIDITY
The second characteristic of good tests is validity which refers to whether the test
is actually measuring what it claims to measure. This is important for us as we
do not want to make claims concerning what a student can or cannot do based
on a test when the test is actually measuring something else. Validity is usually
determined logically although several types of validity may use correlation
coefficients.
The following are different types of validity:

• Face validity.
• Construct validity.
• Concurrent validity.
• Predictive validity.
• Content validity.
(a) Face validity is validity which is “determined impressionistically; for example by

asking students whether the exam was appropriate to their expectations” (Henning,
1987). It is important that a test looks like a test even at first impression. If
students taking a test do not feel that the questions given to them are not
a test or part of a test, then the test may not be valid as the students may
not seriously attempt to answer the questions. The test, therefore, will not be
able to measure what it claims to measure.
(b) Construct validity refers to whether the underlying theoretical constructs

that the test measures are themselves valid. Some authors consider construct
validity to be the most critical type of validity. Construct validity is the most
obvious reflection of whether a test measures what it is supposed to measure
as it directly addresses the issue of what it is that is being measured.
(c) Concurrent validity is the use of another more reputable and recognised test
to validate one’s own test. For example, suppose you come up with your
own new test and would like to determine the validity of your test. As it
uses an external measure as a reference, concurrent validity is sometimes
also referred to as criterion validity. If you choose to use concurrent validity,
you would look for a reputable test and compare your students’ performance
on your test with their performance on the reputable and acknowledged
test. In concurrent validity, a correlation coefficient is obtained and used
to generate an actual numerical value. A high positive correlation of 0.7 to
1 indicates that the learners’ score is relatively similar for the two tests or
measures.

(d) Predictive validity is closely related to concurrent validity in that it too

generates a numerical value. For example, the predictive validity of a
university language placement test can be determined several semesters
later by correlating the scores on the test to the GPA of the students who took
the test. Therefore, a test with high predictive validity is a test that would
yield predictable results in a latter measure. A simple example of tests that
may be concerned with predictive validity is the trial national examinations
conducted at schools in Malaysia as it is intended to predict the students’
performance on the actual SPM national examinations.
(e) Content validity “is concerned with whether or not the content of the test is
sufficiently representative and comprehensive for the test to be a valid measure of
what it is supposed to measure” (Henning, 1987). We can quite easily imagine
taking a test after going through an entire language course. How would you
feel if at the end of the course, your final exam consists of only one question
that covers one element of language from the many that were introduced in
the course? If the language course was a conversational course focusing on
the different social situations that one may encounter, how valid is a final
examination that requires you to demonstrate your ability to place an order
at a fast food restaurant?
These different types of validity represent different concerns that many educators
feel the test should address. However, it should be mentioned that in most
situations, each of the kinds of validity described previously are independent of
each other.
• A test may have face validity but lack construct validity.
• Similarly, it may have predictive validity but not content validity. What kind
of validity we need to focus on depends on what our test is supposed to
measure.
• A second issue is that validity is often a matter of degree. We will seldom find
a language test that is completely an invalid measure of language ability.
• Similarly, it is also extremely unlikely that we will find a completely valid
test of language.
2.2.1 The Relationship between Validity and

Reliability
The relationship between validity and reliability is often discussed. Questions
that are asked include:
• Can you have a valid test if the test is not reliable?
• Can a test be reliable if it is not valid?
• Which would you give preference to?
Let us consider each question separately.
(a) Can a Test be Valid but Not Reliable?

In order to answer that question, we have to be aware of what sort of
reliability is being measured. If the reliability in question is rater reliability,
it is possible for a test to be valid but not reliable. The test could be a good
measure of what it intends to measure but the grading is not reliably done.

However, if we were to consider the split half reliability which measures
the internal consistency of the items and whether or not they work together,
it will be difficult to claim that the test is valid if reliability is low. This is
because when the internal consistency of the test is low, we cannot be sure
of what the test is testing as the items do not seem to measure the same
construct. In this case, therefore, you must have reliability before you can
consider the test to be valid.

It should also be noted that sometimes, the validity of a test is determined
only by the scores that are given and how well the scores are seen to measure
a students’ ability. If the validity is determined completely on the basis of
scores, then reliability is critical – regardless of whether it is rater reliability
or internal consistency as the score that is unreliable is not a valid measure.
(b) Can a Test be Reliable if it is Not Valid?

This question is much easier to respond to. Imagine a bull’s-eye in a shooting
practice and let us consider the bull’s eye as an indication of validity.

If a person consistently hits the bull’s-eye, then you may want to say that his
shots are consistent (reliable) and valid. However, if his shots consistently
miss the mark, then we can consider him to be consistent (reliable) but not
valid. This may be illustrated by the following:
Figure 2.2: The relationship between validity and reliability

A test can therefore be reliable but not valid. If a test is said to measure
reading ability but actually measures grammatical knowledge, then reliable
scores on this test will not indicate that it is a valid test of reading ability.
(c) Which Would You Give Preference To?

In a situation where all else is equal, it is often better to consider validity as
the more important of the two. This is because, as described earlier and in
the previous paragraph, the worth of a test rests with how well it measures
what it claims to measure. We need to analyse items in a test to ensure that
they are relevant and representative of the construct that is being tested. If
they are not, no degree of reliability will make it a useful test in measuring
the construct we are interested in.

For language teaching, the relationship between validity and reliability
is an important one as sometimes, by increasing the degree of validity,
the reliability of scoring of the test is affected. A valid test of speaking, for
example, is one that is conducted with very little controls, restriction or
artificial interference. However, by reducing each of these elements, the
reliability of the test may quite easily be negatively affected.
2.3 PRACTICALITY
Although an important characteristic of tests, practicality is actually a limiting
factor in testing. There will be situations in which after we have already determined
what we consider to be the most valid test, we need to reconsider the format simply
because of practicality issues. A valid test of spoken interaction, for example, would
require that the candidates be relaxed, interact with peers and speak on topics that
they are familiar and comfortable with. This sounds like the kind of conversations
that people have with their friends while drinking coffee at road side stalls. Of
course such a situation would be a highly valid measure of spoken interaction – if
we can set it up. Imagine if we even try to do so. It would require hidden cameras
as well as a lot of telephone calls and money.
Therefore, a more practical form of the test – especially if it is to be administered

nationwide as a standardised test – is to have a short interview session lasting
about fifteen minutes using perhaps a picture or reading stimulus that the
candidates would describe or discuss. Therefore, practicality issues, although
limiting in a sense, cannot be dismissed if we are to come up with a useful
assessment of language ability. Practicality issues can involve economics or costs,
administration considerations such as time and scoring procedures, as well as the
ease of interpretation. Tests are only as good as how well they are interpreted.
Therefore tests that cannot be easily interpreted will definitely cause many
problems.

2.4 ISSUES RELATED TO VALIDITY,

RELIABILITY AND PRACTICALITY
There are numerous issues that are related to validity, reliability and practicality.
Here, we will examine the concept of washback as well as suggested techniques
for achieving acceptable standards in testing.
2.4.1 Washback
Washback, also known as backwash (e.g. Hughes, 1989), refers to the extent to
which a test affects teaching and learning. According to Alderson and Wall (1993),
washback is seen in the actions that teachers and students perform which they
would otherwise not necessarily do if there were no test. Washback can influence
how a teacher teaches or what a teacher teaches. It can also influence how a student
learns or what a student learns. Bailey (1996), discusses a framework suggested by
Hughes in which washback effect is addressed with respect to the participants,
processes, and products involved. Washback affects participants involved in the
test such as students, teachers, materials, writers and curriculum developers, as
well as researchers by influencing them affectively or even cognitively. It can also
affect processes by way of how the participants act or do their work in relation
to the test. Finally, the products are also affected as materials and teaching
programmes are clearly influenced by the processes. Washback is especially
apparent when the tests or examinations in question are regarded as being very
important and having a definite impact on the test-taker’s future. We would
expect, for example, that national standardised examinations would have strong
washback effects compared to a school-based or classroom-based test.
It is also interesting to note that a related concept - Washforward or the effect of

teaching and learning on tests – has also been suggested. It is plausible that as
we teach, we think of how we will later test or measure the effectiveness of our
teaching. This is the essence of washforward and brings about the question of
whether we teach what we test or we test what we teach. The former would refer
to washback while the latter certainly describes washforward.
2.4.2 Promoting Positive Washback

A test can affect teaching and learning either in a positive or a negative manner. If
the effect is positive, then we refer to it as positive washback. If it is negative, then
we have an example of negative washback. If it is true that tests affect teaching
and learning, then we definitely want this effect to be positive.

What are examples of positive washback? When a

student attends tuition classes in preparation for a test,
does this represent positive washback? You may think
that it does as attending tuition classes means that the
students will spend more time learning. However, it
may not be positive washback if the tuition class teacher
teaches using poor techniques and promotes learning
strategies that may help the students do well in the
test but not help them become more proficient in the
language.
How can such a situation occur? Let us take one possible scenario. If the test is
a multiple choice format test and the teacher spends countless hours preparing
students for the test by drilling them with multiple choice type questions, then
although the teaching and learning may be effective in terms of performance on
the test, it may not actually promote language development. On the other hand, if
a language test involves performance in group interaction, test preparation would
also use discussion techniques. Most language learning theorists would approve
of such a washback effect from tests as interaction is often seen as providing
opportunities for language development. Based on these two examples, we can
conclude that the nature of the test itself is the cornerstone of whether there
will be positive or negative washback. Good and valid tests will have a positive
washback effect while tests that are not valid measures of the construct concerned
will promote negative washback.
As tests are in the service of teaching and learning, we must strive to achieve
positive washback. Hughes (1989: 44-47), makes several suggestions on how to
promote beneficial washback:
• First, he suggests that we test the abilities whose development we want
to encourage. Essentially, this refers to a very obvious fact – if we want to
develop a particular skill such as speaking, then we should test that skill.
Although this seems fairly obvious, Hughes points out that this is often times
not done because of reasons (or rather excuses) such as impracticality of some
tests. As such, instead of testing the abilities that we want to develop, we end
up testing a small portion or a poor representation of the skill.
• His second suggestion is to use direct testing. We are all aware of the
difference between direct tests and indirect tests. By using tests that directly
assess the abilities that we are interested in, we will be able to encourage
students to develop those abilities. Therefore, positive washback occurs as
students work towards developing abilities relevant to the testing construct
rather than work on skill that may be more “testwiseness” than anything
else.
• Hughes also suggests that teachers should sample widely and unpredictably.
He argues that if the test is too predictable, the students will prepare for the
test by performing only the kinds of tasks expected in the test. Students
have often been observed to show disinterest in performing tasks and
activities that they immediately recognise as not part of a test. This action is
taken regardless of the benefits of the task or activity. Therefore, by sampling
widely and unpredictably, students are expected to be more willing to
prepare for the test by performing a variety of tasks related to the objectives
of the teaching learning situation.
• Two other suggestions are to make testing criterion-referenced and to base
achievement tests on teaching learning objectives. Both these steps will
help students become aware of what is expected of them and what kind of
abilities they should be able to demonstrate. Descriptions of criterial levels
help students match their current abilities to the stated criteria and develop
their abilities accordingly.
Beneficial backwash can also be achieved if the students and teachers are familiar
with the test, its objectives as well as format. By becoming aware of the objectives
of the test, both students and teachers can prepare for the test in an organised
and more directed manner. Hughes also mentions the importance of assisting
teachers as they prepare their students for tests. He argues that whenever a test
is intended to create positive backwash in teaching methodology, some teachers
may find it difficult to adapt their teaching techniques to the demands of the test.
In such situations, it becomes imperative that these teachers are assisted in order
for the test to have a positive backwash effect.
Several others have also suggested ways and means to promote positive
washback. Bailey (1993), for example, suggests that attention should be given
to the language learning goals of the programme and the learners and that
as much authenticity as possible should be built into the tests. The importance
of authenticity is echoed by Doye (1991) who advocates absolute congruence
between tests and real-life situations. An authentic test is therefore one that
reproduces a real-life situation in order to examine the student’s ability to
cope with it (p. 104). Other suggestions include the need to focus on learner
autonomy and self-assessment. These suggestions imply that tests should
provide sufficiently informative score reporting so that learners can assess their
own performance and determine their own strengths and weaknesses. It should
also be noted here that high stakes tests tend to have a higher washback effect
than tests that are not considered important. Teachers often observe students not
performing well on classroom tests. Students are not too concerned about the
outcome of these tests as they are aware that the results would have little impact
on them.
Hughes stresses the need to count the cost. He believes that in many situations,
the cost of not achieving beneficial backwash may be higher than the cost to
develop a test and testing situation that promote beneficial backwash. He points

out that “when we compare the cost of the test with the waste of effort and time on the
part of the teachers and students in activities quite inappropriate to their true learning
goals … , we are likely to decide that we cannot afford not to introduce a test with a
powerful backwash effect” (p. 47).
2.4.3 Attaining Acceptable Standards in

Testing Language
As discussed earlier, tests are an integral part of much of academic as well as
professional life. Many important decisions may be made on the basis of test
scores. In this topic, as well as throughout most of the module, various issues
related to accurate testing and assessment have been raised and discussed. In
this section, some of the observations made by Linn, Baker and Dunbar (1991)
and reiterated in Herman et al. (1992) will be used to conclude this topic and the
entire module on tests and measurement. These concerns how to evaluate a test
are adapted for language tests and include the following. Each of these concerns
is accompanied by a pertinent question that needs to be asked.
(a) Consequences
This issue deals with the effect tests can have on teaching and learning. The
question that needs to be asked is whether the test has positive consequences
or whether there are negative and unintended washback effects such as the
narrowing of the curriculum and the use of inappropriate teaching and
learning strategies?
(b) Fairness
A contemporary concern in any kind of evaluation or assessment is fairness.
As the cultural background of students will colour the way they use
language and communicate, it is important to ask whether the ethnic and
cultural background of the students have been taken into consideration in
the assessment process?
(c) Transfer and Generalisability

Assessment is only useful if it can be taken out of the testing context and
applied to actual contexts and situations. Therefore, we should strive to
answer whether the assessment results support accurate generalizations
about student capability? (Herman et al., 1992, p. 10).
(d) Cognitive and Linguistic Complexity

In so far as we believe that true learning requires cognitive operations, we
should be concerned with the level of cognitive complexity that occurs
during assessment. Does the assessment require that the students use
complex thinking and problem solving? Similarly, in language testing, does
the test ensure that the candidates are required to display an appropriate
level of linguistic complexity?
(e) Content Quality

The content of the test or assessment is always an important concern. As
such, the validity of the content as representative of the construct that is
being tested, assessed and measured is of prime importance. The question
that needs to be asked in this context is therefore: Is the selected content
representative of the current understanding of the construct?
(f) Content Coverage

As we have seen in earlier topics, tests and assessments are integrated to
instruction and the curriculum specifications of the teaching and learning
programme. We would expect that tests would cover the content presented
in the curriculum. As such, we must ask whether the key elements of the
curriculum are covered by the assessment?
(g) Meaningfulness
Another concern involves the meaningfulness of the task in tests and
assessments. Only if the tasks are meaningful will we be able to get students
to respond with honesty and motivation. A measure of performance when
students are honest and motivated will be more accurate and trustworthy
than one in which the students are not concerned. Therefore, we need to ask
ourselves if the students feel that the tasks are realistic and worthwhile?
(h) Cost and Efficiency

Finally, the issue of practicality is addressed. While practicality may be
an important aspect of testing, it should not be the determining factor in
developing a test. Rather, cost-effectiveness is a more important concern. In
this respect, the issue is whether the information about students obtained
through the test or assessment is worth the cost and time to obtain it?
activity 2.2
(a) What are reliability and validity? What determines the reliability of
a test?
(b) What are the different types of validity? Describe any three types
and cite examples.

SUMMARY
• The concepts of reliability, validity and practicality are central to testing and
measurement as they help determine the worth of the measurement.
• If a student is awarded 90 marks but the test is not reliable and not valid,
then the marks awarded may not be worth the certificate it is written on.
• Therefore, this topic on reliability, validity, and practicality is an important
one for us to better understand the importance of good testing and
measurement.
Face validity Self assessment

Internal consistency Split half reliability
Inter rater reliability Standard error of measurement (SEM)
Intra rater reliability True score
Parallel forms reliability Test-retest reliability
Predictive validity Validity
Practicality Washback
Reliability

06 HBET4503 Topic 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 HBET4503 Topic 2

Uploaded by

Copyright:

Available Formats

TOPIC 2 PRINCIPLES OF TESTS AND MEASUREMENTS

Copyright © Open University Malaysia (OUM)

Figure 2.1: Ways of computing reliability

(b) Parallel/Equivalent Forms

(c) Inter-rater Reliability

(d) Intra-rater Reliability

(e) Split half Reliability

The five different measures of reliability mentioned inform us of the different

Table 2.1: Focus, Number of Administrations, Graders,

Type of Focus Number of tests Number of Number of

Copyright © Open University Malaysia (OUM)

2.1.1 Accuracy and Error

Obtained score = True score +/- Error

It would be extremely easy for us if an obtained score is actually the student’s

In testing, we can use the Standard Error of Measurement in order to estimate a

2.5 √1 – 0.7 = 2.5 √0.3 = 2.5 x 0.55 = 1.375

Such information is helpful when important decisions need to be made on the

Do this simple exercise to assess your understanding of the concepts of

Copyright © Open University Malaysia (OUM)

The following are different types of validity:

(a) Face validity is validity which is “determined impressionistically; for example by

(b) Construct validity refers to whether the underlying theoretical constructs

Copyright © Open University Malaysia (OUM)

(d) Predictive validity is closely related to concurrent validity in that it too

2.2.1 The Relationship between Validity and

Let us consider each question separately.

(a) Can a Test be Valid but Not Reliable?

(b) Can a Test be Reliable if it is Not Valid?

Figure 2.2: The relationship between validity and reliability

Copyright © Open University Malaysia (OUM)

(c) Which Would You Give Preference To?

Therefore, a more practical form of the test – especially if it is to be administered

Copyright © Open University Malaysia (OUM)

2.4 ISSUES RELATED TO VALIDITY,

It is also interesting to note that a related concept - Washforward or the effect of

2.4.2 Promoting Positive Washback

Copyright © Open University Malaysia (OUM)

What are examples of positive washback? When a

Copyright © Open University Malaysia (OUM)

2.4.3 Attaining Acceptable Standards in

(c) Transfer and Generalisability

(d) Cognitive and Linguistic Complexity

(e) Content Quality

(f) Content Coverage

(h) Cost and Efficiency

Copyright © Open University Malaysia (OUM)

Face validity Self assessment

Copyright © Open University Malaysia (OUM)

You might also like