Professional Documents
Culture Documents
Validity of
8 Assessment
Techniques
LEARNING OUTCOMES
INTRODUCTION
We have discussed the various methods of assessing student performance using
objective tests, essay tests, authentic assessments, project assessments and
portfolio assessment. In this topic, we will address two important issues, namely,
the reliability and validity of these assessment methods. How do we ensure that
the techniques we use for assessing the knowledge, skills and values of students
are reliable and valid? We are making important decisions about the abilities and
capabilities of the future generation and obviously we want to ensure that we are
making the right decisions.
Let say you gave a Geometry test to a group of Form Five students and one of your
students named Swee Leong obtained a score of 66 per cent in the test. How sure
are you that it is actually the score that Swee Leong should receive? Is that his true
score? When you develop a test and administer it to your students, you are
attempting to measure as far as possible the true score of each student. The true
score is a hypothetical concept with regard to the actual ability, competency and
capacity of an individual. A test attempts to measure the true score of a person.
When measuring human abilities, it is practically impossible to develop an error-
free test and there will be error. However, just because there is an error it does not
mean that the test is not good; what is more important is the size of the error.
Formally, an observed test score, X, is conceived as the sum of a true score, T and
an error term, E. The true score is defined as the average of test scores if a test is
repeatedly administered to a student (and the student can be made to forget the
content of the test in-between repeated administrations). Given that the true score
is defined as the average of the observed scores, so in each administration of a test,
the observed score departs from the true score and the difference is called
measurement error. This statement can be simplified as follows:
This departure is not caused by blatant mistakes made by the test writers, but it is
caused by some chance elements in studentsÊ performance on a test. Measurement
error mostly comes from the fact that we have only sampled a small portion of a
studentÊs capabilities. Ambiguous questions and incorrect markings can
contribute to measurement error but it is only a small part of it. Imagine if there
are 10,000 items and a student can obtain 60 per cent if all 10,000 items are
administered (which is not practically feasible). Then, 60 per cent is the true score.
Now, assume that you sample only 40 items to put in a test. The expected score for
the student is 24 items. However, the student may get 20, 26, 30 and so on
depending on which items are in the test. This is the main source of measurement
error. That is, measurement error is due to the sampling of items, rather than
poorly written items.
Generally, the smaller the error, the greater the likelihood that you are closer to
measuring the true score of a student. If you are confident that your Geometry test
(observed score) has a small error, then you can confidently infer that Swee
LeongÊs score of 66 per cent is close to his true score or his actual ability in solving
geometry problems; i.e. what he actually knows. To reduce the error in a test, you
must ensure that your test is both reliable and valid. The higher the reliability and
validity of your test, the greater the likelihood that you will be measuring the true
score of your students. We will first examine the reliability of a test.
Would your students get the same scores if they took your test on two different
occasions? Would they get approximately the same scores if they took two
different forms of your test? These questions have to do with the consistency of
your classroom tests in measuring studentsÊ abilities, skills and attitudes or values.
The generic name for consistency is reliability. Reliability is an essential
characteristic of a good test because if a test does not measure consistently
(reliably), then you cannot count on the scores resulting from the administration
of the test.
If there is relatively little error, the ratio of the true score variance to the observed
score variance approaches a reliability coefficient of 1.00 which is perfect
reliability. If there is a relatively large amount of error, the ratio of the true score
variance to the observed score variance approaches 0.00, which is total
unreliability (refer to Figure 8.1):
High reliability means that the questions of a test tended to „pull together‰.
Students who answered a given question correctly were more likely to answer
other questions correctly as well. If an equivalent or parallel test was developed
by using similar items, the relative scores of students would show little change.
Reliability Interpretation
0.90 and above Excellent reliability (comparable to the best standardised tests).
0.80ă0.90 Very good for a classroom test.
0.70ă0.80 Good for a classroom test but there are probably a few items which
could be improved.
0.60ă0.70 Somewhat low. There are probably some items which could be
removed or improved.
0.50ă0.60 The test needs to be revised.
0.50 and below Questionable reliability and the test should be replaced or needs major
revision.
If you know the reliability coefficient of a test, can you estimate the true score of
a student on a test? In testing, we use the standard error of measurement to
estimate the true score.
Using the normal curve, you can estimate a studentÊs true score with some degree
of certainty based on the observed score and standard error of measurement.
Example 8.1:
You gave a History test to group of 40 students. Khairul obtained a score of 75 in
the test, which is his observed score. The standard deviation of your test is 2.0.
Earlier, you had established that your History test had a reliability coefficient of
0.7. You are interested in finding out KhairulÊs true score.
Therefore, based on the normal distribution curve (refer to Figure 8.2), KhairulÊs
true score should be:
(a) Between 75 ă 1.1 and 75 + 1.1 or between 73.9 and 76.1 for 68 per cent of the
time.
(b) Between 75 ă 2.2 and 75 + 2.2 or between 72.8 and 77.1 for 95 per cent of the
time.
(c) Between 75 ă 3.3 and 75 + 3.3 or between 71.7 and 78.3 for 99 per cent of the
time.
SELF-CHECK 8.1
ACTIVITY 8.1
(a) Test-retest
Using the test-retest technique, the same test is administered again to the
same group of students. The scores obtained in the first administration of the
test are correlated to the scores obtained on the second administration of the
test. If the correlation between the two scores is high, then the test can be
considered to have high reliability. However, a test-retest situation is
somewhat difficult to conduct as it is unlikely that students will be prepared
to take the same test twice.
There is also the effect of practice and memory that may influence the
correlation. The shorter the time gap, the higher the correlation; the longer
the time gap, the lower the correlation. This is because the two observations
are related over time. Since this correlation is the test-retest estimate of
reliability, you can obtain considerably different estimates depending on the
interval.
by the influence of memory. One major problem with this approach is that
you have to be able to generate a lot of items that reflect the same construct.
This is often not an easy feat.
(i) Split-half
To solve the problem of having to administer the same test twice, the
split-half technique is used. In this technique, a test is administered
once to a group of students. The test is divided into two equal halves
after the students have completed the test. This technique is most
appropriate for tests which include multiple-choice items, true-false
items and perhaps short-answer essays. The items are selected based
on odd-even method whereby one half of the test consists of odd
numbered items, while the other half consists of even numbered items.
Then, the scores obtained for the two halves are correlated to determine
the reliability of the whole test using the Spearman-Brown correlation
coefficient.
2rxy
rsb
1 rxy
In this formula, rsb is the split-half reliability coefficient and rxy
represents the correlation between the two halves. Say for example, you
have established that the correlation coefficient between the two halves
is 0.65. What is the reliability of the whole test?
k
k i 1
pi 1 pi
CronbachÊs alpha () 1
K 1 x2
where,
Example 8.2:
Suppose that in a multiple-choice test consisting of five items or
questions, the following difficulty index for each item was observed:
p1 = 0.4, p2= 0.5, p3 = 0.6, p4 = 0.75 and p5 = 0.85. Sample variance
(2x) = 1.84. CronbachÊs alpha would be calculated as follows:
5 1.045
1 0.54
5 1 1.840
A Word of Caution!
When you get a low alpha, you should be careful not to immediately conclude that
the test is a bad test. Instead, you should check to determine if the test measures
several attributes or dimensions rather than one attribute or dimension. If it does,
there is the likelihood for the CronbachÊs alpha to be deflated.
For example, an aptitude test may measure three attributes or dimensions such as
quantitative ability, language ability and analytical ability. Hence, it is not
surprising that the CronbachÊs alpha for the whole test may be low as the questions
may not correlate with each other. Why? This is because the items are measuring
three different types of human abilities. The solution is to compute three different
CronbachÊs alphas ă one for quantitative ability, one for language ability and one
for analytical ability.
SELF-CHECK 8.2
1. What is the main advantage of the split-half technique over the test-
retest technique in determining the reliability of a test?
SELF-CHECK 8.3
List the steps that may be taken to enhance inter-rater reliability in the
grading of essay answer scripts.
ACTIVITY 8.2
In the myINSPIRE online forum, suggest other steps you would take to
enhance intra-rater reliability in the grading of projects.
However, validity will vary from test to test depending on what it is used for. For
example, a test may have high validity in testing the recall of facts in economics
but that same test maybe low in validity with regard to testing the application of
concepts in economics.
Messick (1989) was most concerned about the inferences a teacher draws from the
test score, the interpretation the teacher makes about his or her students and the
consequences from such inferences and interpretation. You can imagine the power
an educator holds in his or her hand when designing a test. Your test could
determine the future of many thousands of students. Inferences based on a test of
low validity could give a completely different picture of the actual abilities and
competencies of students.
Three types of validity have been identified: construct validity, content validity
and criterion-related validity which is made up of predictive and concurrent
validity (refer to Figure 8.5).
Thus, to ensure high construct validity, you must be clear about the
definition of the construct you intend to measure. For example, a construct
such as reading comprehension would include vocabulary development,
reading for literal meaning and reading for inferential meaning. Some
experts in educational measurement have argued that construct validity is
the most critical type of validity. You could establish the construct validity
of an instrument by correlating it with another test that measures the same
construct. For example, you could compare the scores obtained on your
reading comprehension test with the scores obtained on another well-known
reading comprehension test administered to the same sample of students. If
the scores for the two tests are highly correlated, then you may conclude that
your reading comprehension test has high construct validity.
For example, the science unit on „energy and forces‰ may include facts,
concepts, principles and skills on light, sound, heat, magnetism and
electricity. However, it is difficult, if not impossible, to administer a 2 to
3 hours paper to test all aspects of the syllabus on „energy and forces‰ (refer
to Figure 8.6).
Figure 8.6: Sample of content tested for the unit on „energy and forces‰
Therefore, only selected facts, concepts, principles and skills from the
syllabus (or domain) are sampled. The content selected will be determined
by content experts who will judge the relevance of the content in the test to
the content in the syllabus or a particular domain.
Content validity will be low if the questions in the test include questions
testing content not included in the domain or syllabus. To ensure content
validity and coverage, most teachers use the table of specifications (as
discussed in Topic 3). Table 8.2 is an example of a table of specifications
which specifies the knowledge and skills to be measured and the topics
covered for the unit on „energy and forces‰.
Table 8.2: Table of Specifications for the Unit on „Energy and Forces‰
Understanding of Application of
Topics Total
Concept Concepts
Light 7 4 11 (22%)
Sound 7 4 11 (22%)
Heat 7 4 11 (22%)
Magnetism 3 3 6 (11%)
Electricity 8 3 11 (22%)
TOTAL 32 (64%) 18 (36%) 50 (100%)
Since you cannot measure all the content of a topic, you will have to focus on
the key areas and give due weighting to those areas that are important. For
example, the teacher has decided that 64 per cent of questions will emphasise
the understanding of concepts, while the remaining 36 per cent will focus on
the application of concepts for the five topics. A table of specifications
provides the teachers with evidence that a test has high content validity, that
it covers what should be covered.
Content validity is different from face validity, which refers not to what the
test actually measures, but to what it superficially appears to measure. Face
validity assesses whether the test „looks valid‰ to the examinees who take it,
the administrative personnel who decide on its use and other technically
untrained observers. The face validity is a weak measure of validity but that
does not mean that it is incorrect, only that caution is necessary. Its
importance however cannot be underestimated.
(i) Predictive validity relates to whether the test predicts accurately some
future performance or ability. Is STPM a good predictor of performance
in a university? One difficulty in calculating the predictive validity of
STPM is because only those who pass the exam will go on to university
(generally speaking) and we do not know how well students who did
not pass might have done. Also, only a small proportion of the
population takes the STPM and the correlation between STPM grades
and performance at the degree level would be quite high.
(ii) Concurrent validity is concerned with whether the test correlates with,
or gives substantially the same results as, another test of the same skill.
For example, does your end-of-year language test correlate with the
Malaysian University English Test (MUET)? In other words, if your
language test correlates highly with MUET, then your language test has
high concurrent validity.
valid because the test does not measure the right intended learning
outcomes. There is no constructive alignment between instruction and
assessment.
Figure 8.7: Graphical representations of the relationship between reliability and validity
The centre or the bullÊs-eye is the concept that we are trying to measure. Say, for
example, in trying to measure the concept of „inductive reasoning‰, you are likely
to hit the centre (or the bullÊs-eye) if your inductive reasoning test is both reliable
and valid, which is what all test developers aim to achieve (refer to Figure 8.6d).
On the other hand, your inductive reasoning test can be „reliable but not valid‰.
How is that possible? Your test may not measure inductive reasoning but the
scores you obtain each time you administer the test is approximately the same
(refer to Figure 8.6b). In other words, the test is consistently and systematically
measuring the wrong construct (i.e. inductive reasoning). Imagine the
consequences of making judgement about the inductive reasoning of students
using such a test!
The worst-case scenario is when the test is neither reliable nor valid (refer to
Figure 8.6a). In this scenario, the scores obtained by students tend to concentrate
at the top and left of the target and they are consistently missing the centre target.
Your measure in this case is neither reliable nor valid and the test should be
rejected or improved.
The true score is a hypothetical concept with regard to the actual ability,
competency and capacity of an individual.
The higher the reliability and validity of your test, the greater the likelihood
that you will be measuring the true scores of your students.
Using the test-retest technique, the same test is administered again to the same
group of students.
For the parallel or equivalent forms technique, two equivalent tests (or forms)
are administered to the same group of students.
Internal consistency is determined using only one test administered once to the
students.
When two or more people mark essay questions, the extent to which there is
agreement in the marks allotted is called inter-rater reliability.
Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related.