12 Hmef5053 T8

Topic  Reliability and
Validity of
8 Assessment
Techniques
LEARNING OUTCOMES
By the end of the topic, you should be able to:

1. Explain the concept of a true score and reliability coefficient;
2. Apply the different methods of estimating the reliability of a test;
3. Compare the different techniques of establishing the validity of a test;
4. Identify the factors affecting reliability and validity; and
5. Discuss the relationship between reliability and validity.
 INTRODUCTION
We have discussed the various methods of assessing student performance using
objective tests, essay tests, authentic assessments, project assessments and
portfolio assessment. In this topic, we will address two important issues, namely,
the reliability and validity of these assessment methods. How do we ensure that
the techniques we use for assessing the knowledge, skills and values of students
are reliable and valid? We are making important decisions about the abilities and
capabilities of the future generation and obviously we want to ensure that we are
making the right decisions.
Copyright © Open University Malaysia (OUM)

174  TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES
8.1 WHAT IS RELIABILITY?

What is reliability?
Reliability is the consistency of the measurement.
Let say you gave a Geometry test to a group of Form Five students and one of your
students named Swee Leong obtained a score of 66 per cent in the test. How sure
are you that it is actually the score that Swee Leong should receive? Is that his true
score? When you develop a test and administer it to your students, you are
attempting to measure as far as possible the true score of each student. The true
score is a hypothetical concept with regard to the actual ability, competency and
capacity of an individual. A test attempts to measure the true score of a person.
When measuring human abilities, it is practically impossible to develop an error-
free test and there will be error. However, just because there is an error it does not
mean that the test is not good; what is more important is the size of the error.
Formally, an observed test score, X, is conceived as the sum of a true score, T and
an error term, E. The true score is defined as the average of test scores if a test is
repeatedly administered to a student (and the student can be made to forget the
content of the test in-between repeated administrations). Given that the true score
is defined as the average of the observed scores, so in each administration of a test,
the observed score departs from the true score and the difference is called
measurement error. This statement can be simplified as follows:
Observed score (X) = True score (T) + Error (E)
This departure is not caused by blatant mistakes made by the test writers, but it is
caused by some chance elements in studentsÊ performance on a test. Measurement
error mostly comes from the fact that we have only sampled a small portion of a
studentÊs capabilities. Ambiguous questions and incorrect markings can
contribute to measurement error but it is only a small part of it. Imagine if there
are 10,000 items and a student can obtain 60 per cent if all 10,000 items are
administered (which is not practically feasible). Then, 60 per cent is the true score.
Now, assume that you sample only 40 items to put in a test. The expected score for
the student is 24 items. However, the student may get 20, 26, 30 and so on
depending on which items are in the test. This is the main source of measurement
error. That is, measurement error is due to the sampling of items, rather than
poorly written items.

TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES  175
Generally, the smaller the error, the greater the likelihood that you are closer to
measuring the true score of a student. If you are confident that your Geometry test
(observed score) has a small error, then you can confidently infer that Swee
LeongÊs score of 66 per cent is close to his true score or his actual ability in solving
geometry problems; i.e. what he actually knows. To reduce the error in a test, you
must ensure that your test is both reliable and valid. The higher the reliability and
validity of your test, the greater the likelihood that you will be measuring the true
score of your students. We will first examine the reliability of a test.
Would your students get the same scores if they took your test on two different
occasions? Would they get approximately the same scores if they took two
different forms of your test? These questions have to do with the consistency of
your classroom tests in measuring studentsÊ abilities, skills and attitudes or values.
The generic name for consistency is reliability. Reliability is an essential
characteristic of a good test because if a test does not measure consistently
(reliably), then you cannot count on the scores resulting from the administration
of the test.
8.2 THE RELIABILITY COEFFICIENT

Reliability is quantified as a reliability coefficient. The symbol used to denote a
reliability coefficient is r with two identical subscripts (for example, rxx). The
reliability coefficient is generally defined as the variance of the true score divided
by the variance of the observed score. The following is the equation.
Variance of the true score  2 true score

rxx   2
Variance of the observed score  observed score

If there is relatively little error, the ratio of the true score variance to the observed
score variance approaches a reliability coefficient of 1.00 which is perfect
reliability. If there is a relatively large amount of error, the ratio of the true score
variance to the observed score variance approaches 0.00, which is total
unreliability (refer to Figure 8.1):
Figure 8.1: Reliability coefficient
High reliability means that the questions of a test tended to „pull together‰.
Students who answered a given question correctly were more likely to answer
other questions correctly as well. If an equivalent or parallel test was developed
by using similar items, the relative scores of students would show little change.
Meanwhile, low reliability indicates that the questions tended to be unrelated to

each other in terms of who answered them correctly. The resulting test scores
reflect that something is wrong with either the items or the testing situation rather
than the studentsÊ knowledge of the subject matter. The following guidelines may
be used to interpret reliability coefficients for classroom tests as shown in
Table 8.1.
Table 8.1: Interpretation of Reliability Coefficients
Reliability Interpretation
0.90 and above Excellent reliability (comparable to the best standardised tests).
0.80ă0.90 Very good for a classroom test.
0.70ă0.80 Good for a classroom test but there are probably a few items which
could be improved.
0.60ă0.70 Somewhat low. There are probably some items which could be
removed or improved.
0.50ă0.60 The test needs to be revised.
0.50 and below Questionable reliability and the test should be replaced or needs major
revision.

If you know the reliability coefficient of a test, can you estimate the true score of
a student on a test? In testing, we use the standard error of measurement to
estimate the true score.
The standard error of measurement = Standard deviation  1  r
Note: „r‰ is the reliability of the test.
Using the normal curve, you can estimate a studentÊs true score with some degree
of certainty based on the observed score and standard error of measurement.
Example 8.1:
You gave a History test to group of 40 students. Khairul obtained a score of 75 in
the test, which is his observed score. The standard deviation of your test is 2.0.
Earlier, you had established that your History test had a reliability coefficient of
0.7. You are interested in finding out KhairulÊs true score.
The standard error of measurement = Standard deviation  1  r
= 2.0 1  0.7 =2.0  0.55 = 1.1
Therefore, based on the normal distribution curve (refer to Figure 8.2), KhairulÊs
true score should be:
(a) Between 75 ă 1.1 and 75 + 1.1 or between 73.9 and 76.1 for 68 per cent of the
time.
(b) Between 75 ă 2.2 and 75 + 2.2 or between 72.8 and 77.1 for 95 per cent of the
time.
(c) Between 75 ă 3.3 and 75 + 3.3 or between 71.7 and 78.3 for 99 per cent of the
time.

Figure 8.2: Determining KhairulÊs true score based on a normal distribution
SELF-CHECK 8.1
1. Define the reliability of a test.
2. What does the reliability coefficient indicate?
3. Explain the concept of a true score.
ACTIVITY 8.1
Shalin obtains a score of 70 in a Biology test. Given the reliability of the

test is 0.65 and the standard deviation of the test is 1.5. The teacher was
planning to select students who had scored 70 and above to take part in
a Biology competition. The teacher was not sure whether he should select
Shalin since there could be an error in her score. Should he select Shalin?
Why?
Post your answer on the myINSPIRE online forum.
(Use the standard error of measurement.)

8.3 METHODS TO ESTIMATE THE RELIABILITY

OF A TEST
Let us now discuss how we estimate the reliability of a test. Figure 8.3 lists three
common methods of estimating the reliability of a test. It is not possible to calculate
reliability exactly and so we have to estimate reliability.
Figure 8.3: Methods for estimating reliability
These three methods are further explained as follows:
(a) Test-retest
Using the test-retest technique, the same test is administered again to the
same group of students. The scores obtained in the first administration of the
test are correlated to the scores obtained on the second administration of the
test. If the correlation between the two scores is high, then the test can be
considered to have high reliability. However, a test-retest situation is
somewhat difficult to conduct as it is unlikely that students will be prepared
to take the same test twice.
There is also the effect of practice and memory that may influence the
correlation. The shorter the time gap, the higher the correlation; the longer
the time gap, the lower the correlation. This is because the two observations
are related over time. Since this correlation is the test-retest estimate of
reliability, you can obtain considerably different estimates depending on the
interval.
(b) Parallel or Equivalent Forms

For this technique, two equivalent tests (or forms) are administered to the
same group of students. The two tests are not similar but are equivalent. In
other words, they may have different questions but they are measuring the
same knowledge, skills or attitudes. Therefore, you have two sets of scores
which are correlated and reliability can be established. Unlike the test-retest
technique, the parallel or equivalent forms reliability measure is not affected

by the influence of memory. One major problem with this approach is that
you have to be able to generate a lot of items that reflect the same construct.
This is often not an easy feat.
(c) Internal Consistency

Internal consistency is determined using only one test administered once to
the students. Internal consistency refers to how the individual items or
questions behave in relation to each other and the overall test. In effect, we
judge the reliability of the instrument by estimating how well the items that
reflect the same construct yield similar results. We are looking at how
consistent the results are for different items for the same construct within the
measure. The following are two common internal consistency measures that
can be used.
(i) Split-half
To solve the problem of having to administer the same test twice, the
split-half technique is used. In this technique, a test is administered
once to a group of students. The test is divided into two equal halves
after the students have completed the test. This technique is most
appropriate for tests which include multiple-choice items, true-false
items and perhaps short-answer essays. The items are selected based
on odd-even method whereby one half of the test consists of odd
numbered items, while the other half consists of even numbered items.
Then, the scores obtained for the two halves are correlated to determine
the reliability of the whole test using the Spearman-Brown correlation
coefficient.
2rxy
rsb 
1  rxy 
In this formula, rsb is the split-half reliability coefficient and rxy
represents the correlation between the two halves. Say for example, you
have established that the correlation coefficient between the two halves
is 0.65. What is the reliability of the whole test?
2rxy 2  0.65  1.3

rsb     0.79
1  rxy  1  0.65 1.65

(ii) CronbachÊs Alpha

CronbachÊs coefficient alpha can be used for both binary-type
(1 = correct, 0 = incorrect or 1 = true and 0 = false) and scale items
(1 = strongly agree, 2 = agree, 3 = disagree, 4 = strongly disagree).
Reliability is estimated by computing the correlation between the
individual questions and the extent to which individual questions
correlate with the total test. This is meant by internal consistency. The
key is „internal‰, unlike test-retest and parallel or equivalent form that
require another test as an external reference. The stronger the items are
inter-related, the more likely the test is consistent. The higher the alpha,
the more reliable is the test. There is no generally agreed cut-off point.
Usually, 0.7 and above is acceptable (Nunnally, 1978). The formula for
CronbachÊs alpha is as follows:
 k 

k  i 1
 pi 1  pi  
CronbachÊs alpha ()  1 
K 1  x2 
 
 
where,
 k is the number of items in the test;

 pi refers to item difficulty which is the proportion of students who
answered the item i correctly; and
 2x is the sample variance for the total score.
Example 8.2:
Suppose that in a multiple-choice test consisting of five items or
questions, the following difficulty index for each item was observed:
p1 = 0.4, p2= 0.5, p3 = 0.6, p4 = 0.75 and p5 = 0.85. Sample variance
(2x) = 1.84. CronbachÊs alpha would be calculated as follows:
5  1.045 
 1    0.54
5  1  1.840 

Professionally developed standardised tests should have an internal

consistency coefficient of at least 0.85. High reliability coefficients are
required for standardised tests because they are administered only
once and the score on that one test is used to draw conclusions about
each studentÊs ability level on the construct measured. Perhaps, the
closest to a standardised test in the Malaysian context would be the
tests for different subjects conducted at the national level in the PMR
and SPM.
According to Wells and Wollack (2003), it is acceptable for classroom

tests to have reliability coefficients of 0.70 and higher because a
studentÊs score on any one test does not determine the studentÊs entire
grade in the subject or course. Usually, grades are based on several
other measures such as project work, oral presentations, practical tests,
class participation and so forth. To what extent is this true in the context
of the Malaysian classroom?
A Word of Caution!
When you get a low alpha, you should be careful not to immediately conclude that
the test is a bad test. Instead, you should check to determine if the test measures
several attributes or dimensions rather than one attribute or dimension. If it does,
there is the likelihood for the CronbachÊs alpha to be deflated.
For example, an aptitude test may measure three attributes or dimensions such as
quantitative ability, language ability and analytical ability. Hence, it is not
surprising that the CronbachÊs alpha for the whole test may be low as the questions
may not correlate with each other. Why? This is because the items are measuring
three different types of human abilities. The solution is to compute three different
CronbachÊs alphas ă one for quantitative ability, one for language ability and one
for analytical ability.
SELF-CHECK 8.2
1. What is the main advantage of the split-half technique over the test-
retest technique in determining the reliability of a test?
2. Explain the parallel or equivalent form technique in determining

the reliability of a test.
3. Explain the concept of internal consistency reliability.

8.4 INTER-RATER AND INTRA-RATER

RELIABILITY
Whenever you use humans as a part of your measurement procedure, you have to
be concerned whether the results you get are reliable or consistent. People are
notorious for their inconsistency. We are easily distracted. We get tired of doing
repetitive tasks. We daydream. We misinterpret. Therefore, how do we determine
whether:
(a) Two observers are being consistent in their observations?
(b) Two examiners are being consistent in their marking of an essay?
(c) Two examiners are being consistent in their marking of a project?
Let us analyse these problems from the perspectives of:
(a) Inter-rater Reliability

When two or more people mark essay questions, the extent to which there is
agreement in the marks allotted is called inter-rater reliability (refer to
Figure 8.4). The greater the agreement, the higher is the inter-rater reliability.
Figure 8.4: Examiner A versus Examiner B

Inter-rater reliability can be low because of the following reasons:
(i) Examiners are subconsciously being influenced by knowledge of the

students whose scripts are being marked;
(ii) Consistency in marking is affected after marking a set of either very

good or very weak scripts;
(iii) When there is an interruption during the marking of a batch of scripts,

different standards may be applied after the break; and
(iv) The marking scheme is poorly developed resulting in examiners

making their own interpretations of the answers.
Inter-rater reliability can be enhanced if the criteria for marking or the

marking scheme:
(i) Contain suggested answers related to the question;
(ii) Have made provision for acceptable alternative answers;
(iii) Allocate appropriate time for the work required;
(iv) Are sufficiently broken down to allow the marking to be as objective as

possible and the totalling of marks is correct; and
(v) Allocate marks according to the degree of difficulty of the question.
(b) Intra-rater Reliability

While inter-rater reliability involves two or more individuals, intra-rater
reliability refers to the consistency of grading by a single rater. Scores on a
test are rated by a single rater at different times. When we grade tests at
different times, we may become inconsistent in our grading for various
reasons. For example, some papers that are graded during the day may get
our full attention, while others that are graded towards the end of the day
may be very quickly glossed over. Similarly, changes in our mood may affect
the grading of papers. In these situations, the lack of consistency can affect
intra-reliability in the grading of student answers.
SELF-CHECK 8.3
List the steps that may be taken to enhance inter-rater reliability in the
grading of essay answer scripts.

ACTIVITY 8.2
In the myINSPIRE online forum, suggest other steps you would take to
enhance intra-rater reliability in the grading of projects.
8.5 TYPES OF VALIDITY

Validity is often defined as the extent to which a test measures what it was
designed to measure (Nuttall, 1987). While reliability relates to the consistency of
the test, validity relates to the relevancy of the test. If it does not measure what it
sets out to measure, then its use is misleading and the interpretation based on the
test is not valid or relevant. For example, if a test that is supposed to measure the
„spelling ability of eight-year-old children‰ does not measure „spelling ability‰,
then the test is not a valid test. It would be disastrous if you make claims about
what a student can or cannot do based on a test that is actually measuring
something else. It is for this reason that many educators argue that validity is the
most important aspect of a test.
However, validity will vary from test to test depending on what it is used for. For
example, a test may have high validity in testing the recall of facts in economics
but that same test maybe low in validity with regard to testing the application of
concepts in economics.
Messick (1989) was most concerned about the inferences a teacher draws from the
test score, the interpretation the teacher makes about his or her students and the
consequences from such inferences and interpretation. You can imagine the power
an educator holds in his or her hand when designing a test. Your test could
determine the future of many thousands of students. Inferences based on a test of
low validity could give a completely different picture of the actual abilities and
competencies of students.
Three types of validity have been identified: construct validity, content validity
and criterion-related validity which is made up of predictive and concurrent
validity (refer to Figure 8.5).

Figure 8.5: Types of validity
These various types of validity are further explained as follows:
(a) Construct Validity

Construct validity relates to whether the test is an adequate measure of the
underlying construct. A construct could be any phenomenon such as
mathematics achievement, map skills, reading comprehension, attitude
towards school, inductive reasoning, environmental awareness, spelling
ability and so forth. You might think of construct validity as the correct
„labelling‰ of something. For example, when you measure what you term as
„critical thinking‰, is that what you are really measuring?
Thus, to ensure high construct validity, you must be clear about the
definition of the construct you intend to measure. For example, a construct
such as reading comprehension would include vocabulary development,
reading for literal meaning and reading for inferential meaning. Some
experts in educational measurement have argued that construct validity is
the most critical type of validity. You could establish the construct validity
of an instrument by correlating it with another test that measures the same
construct. For example, you could compare the scores obtained on your
reading comprehension test with the scores obtained on another well-known
reading comprehension test administered to the same sample of students. If
the scores for the two tests are highly correlated, then you may conclude that
your reading comprehension test has high construct validity.
A construct is determined by referring to theory. For example, if you are

interested in measuring the construct „self-esteem‰, you need to be clear
what self-esteem is. Perhaps, you need to refer to various literature in the
field describing the attributes of self-esteem. You find that theoretically, self-
esteem is made of the following attributes: physical self-esteem, academic
self-esteem and social self-esteem. Based on this theoretical perspective, you
can build items or questions to measure self-esteem covering these three
types of self-esteem. Through such a process, you are more certain to ensure
high construct validity.

(b) Content Validity

Content validity is more straightforward and likely to be related to construct
validity. It concerns the coverage of appropriate and necessary content, i.e.
does the test cover the skills necessary for good performance or all the aspects
of the subject taught? It is concerned with sample-population
representativeness, i.e. the facts, concepts and principles covered by the test
items should be representative of the larger domain (e.g. syllabus) of facts,
concepts and principles.
For example, the science unit on „energy and forces‰ may include facts,
concepts, principles and skills on light, sound, heat, magnetism and
electricity. However, it is difficult, if not impossible, to administer a 2 to
3 hours paper to test all aspects of the syllabus on „energy and forces‰ (refer
to Figure 8.6).
Figure 8.6: Sample of content tested for the unit on „energy and forces‰
Therefore, only selected facts, concepts, principles and skills from the
syllabus (or domain) are sampled. The content selected will be determined
by content experts who will judge the relevance of the content in the test to
the content in the syllabus or a particular domain.
Content validity will be low if the questions in the test include questions
testing content not included in the domain or syllabus. To ensure content
validity and coverage, most teachers use the table of specifications (as
discussed in Topic 3). Table 8.2 is an example of a table of specifications
which specifies the knowledge and skills to be measured and the topics
covered for the unit on „energy and forces‰.

Table 8.2: Table of Specifications for the Unit on „Energy and Forces‰
Understanding of Application of
Topics Total
Concept Concepts
Light 7 4 11 (22%)
Sound 7 4 11 (22%)
Heat 7 4 11 (22%)
Magnetism 3 3 6 (11%)
Electricity 8 3 11 (22%)
TOTAL 32 (64%) 18 (36%) 50 (100%)
Since you cannot measure all the content of a topic, you will have to focus on
the key areas and give due weighting to those areas that are important. For
example, the teacher has decided that 64 per cent of questions will emphasise
the understanding of concepts, while the remaining 36 per cent will focus on
the application of concepts for the five topics. A table of specifications
provides the teachers with evidence that a test has high content validity, that
it covers what should be covered.
Content validity is different from face validity, which refers not to what the
test actually measures, but to what it superficially appears to measure. Face
validity assesses whether the test „looks valid‰ to the examinees who take it,
the administrative personnel who decide on its use and other technically
untrained observers. The face validity is a weak measure of validity but that
does not mean that it is incorrect, only that caution is necessary. Its
importance however cannot be underestimated.
(c) Criterion-related Validity

Criterion-related validity of a test is established by relating the scores
obtained to some other criterion or the scores of other tests. There are two
types of criterion-related validity:
(i) Predictive validity relates to whether the test predicts accurately some
future performance or ability. Is STPM a good predictor of performance
in a university? One difficulty in calculating the predictive validity of
STPM is because only those who pass the exam will go on to university
(generally speaking) and we do not know how well students who did
not pass might have done. Also, only a small proportion of the
population takes the STPM and the correlation between STPM grades
and performance at the degree level would be quite high.

(ii) Concurrent validity is concerned with whether the test correlates with,
or gives substantially the same results as, another test of the same skill.
For example, does your end-of-year language test correlate with the
Malaysian University English Test (MUET)? In other words, if your
language test correlates highly with MUET, then your language test has
high concurrent validity.
8.6 FACTORS AFFECTING RELIABILITY AND

VALIDITY
To prepare tests which are acceptably valid and reliable, the following factors
should be taken into account:
(a) Construction of Test Items

The quality of test items has a significant effect on the validity and reliability
of a test. If the test items are poorly constructed, ambiguous and open to
different interpretations, the reliability of the test will be affected because the
test results will not reflect the true abilities of the students being assessed. If
the items do not assess the right content and do not match the intended
learning outcomes, then the test is not measuring what it is supposed to
measure; thus, affecting the test validity.
(b) Length of the Test

Generally, the longer the test, the more reliable and valid the test is. A short
test would not adequately cover a yearÊs work. The syllabus needs to be
sampled. The test should consist of enough questions that are representative
of the knowledge, skills and competencies in the syllabus. However, there is
also a problem with tests that are too long. A lengthy test maybe valid, but it
will take too much time and fatigue may set in which may affect the
performance and the reliability of the test.
(c) Selection of Topics

The topics selected and the test questions prepared should reflect the way
the topics were treated during teaching and learning. It is necessary to be
clear about the learning outcomes and to design items that measure these
learning outcomes. For example, in your teaching, students were not given
an opportunity to think critically and solve problems. However, your test
consists of items requiring students to think critically and solve problems. In
such a situation, the reliability and validity of the test will be affected. The
test is not reliable because it will not produce consistent results. It is also not

valid because the test does not measure the right intended learning
outcomes. There is no constructive alignment between instruction and
assessment.
(d) Choice of Testing Techniques

The testing techniques selected will also affect reliability and validity. For
example, if you choose to use essay questions, validity may be high but
reliability may be low. Essay questions tend to have validity because they are
capable of assessing simple and complex learning outcomes, but tend to be
less reliable because of the subjective manner studentsÊ responses are scored.
On the other hand, if a test chooses objective test items such as MCQs, true-
false questions, matching questions and short-answer questions are selected,
the reliability of the test can be high because the scoring of studentsÊ
responses are not influenced by the subjective judgement of the assessors.
The validity, however, can be low because not all intended learning
outcomes can be appropriately assessed by objective test items alone. For
instance, multiple-choice questions are not suitable in assessing learning
outcomes that require students to organise ideas.
(e) Method of Test Administration

Test administration is also an important step in the measurement process.
This includes the arrangement of items in a test, the monitoring of test taking
and the preparation of data files from the test booklets. Poor test
administration procedures can lead to problems in the data collected and
affect the validity of the test results. For instance, if the results of the students
taking a test are not accurately recorded, the test scores have become invalid.
Adequate time must also be allowed for the majority of students to finish the
test. This would reduce wild guessing and instead encourage students to
think carefully about the answer. Instructions need to be clear to reduce the
effects of confusion on reliability and validity. The physical conditions under
which the test is taken must be favourable for the students. There must be
adequate space and lighting, and the temperature must be conducive.
Students must be able to work independently and the possibility of
distractions in the form of movement and noise must be minimised. If such
measures are not taken, studentsÊ performance may be affected because they
are handicapped in demonstrating their true abilities.
(f) Method of Marking

The marking should be as objective as possible. Marking which depends on
the exercise of human judgement ă such as in essays, projects and portfolios
ă is subject to the variations of human fallibility (refer to inter-rater reliability
discussed earlier). Besides, poorly designed or inappropriate marking
schemes can affect validity. For example, if an essay test is intended to assess

studentsÊ ability to discuss an issue but a checklist is used to assess content

(knowledge), the validity of the test is questionable. If unqualified or
incompetent examiners are engaged to mark responses to essay questions,
they will not be consistent in their scoring, thus affecting test reliability. It is
quite easy to mark objective items quickly, but it is also surprisingly easy to
make careless errors. This is especially true where large numbers of scripts
are being marked. A system of checks is strongly advised. One method is
through the comments of the students themselves when their marked papers
are returned to them.
8.7 RELATIONSHIP BETWEEN RELIABILITY

AND VALIDITY
Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related. Figure 8.7 shows the analogy.
Figure 8.7: Graphical representations of the relationship between reliability and validity
The centre or the bullÊs-eye is the concept that we are trying to measure. Say, for
example, in trying to measure the concept of „inductive reasoning‰, you are likely
to hit the centre (or the bullÊs-eye) if your inductive reasoning test is both reliable
and valid, which is what all test developers aim to achieve (refer to Figure 8.6d).
On the other hand, your inductive reasoning test can be „reliable but not valid‰.
How is that possible? Your test may not measure inductive reasoning but the
scores you obtain each time you administer the test is approximately the same
(refer to Figure 8.6b). In other words, the test is consistently and systematically
measuring the wrong construct (i.e. inductive reasoning). Imagine the
consequences of making judgement about the inductive reasoning of students
using such a test!

However, in the context of psychological testing, if an instrument does not have

satisfactory reliability, one typically cannot claim validity. That is, validity requires
that instruments are sufficiently reliable. So, Figure 8.6c does not have high
validity even though the target is hit twice. The validity is low and it does not have
reliability because the hits are not concentrated. In other words, you are not getting
a valid estimate of the inductive reasoning ability of your students and they are
inconsistent.
The worst-case scenario is when the test is neither reliable nor valid (refer to
Figure 8.6a). In this scenario, the scores obtained by students tend to concentrate
at the top and left of the target and they are consistently missing the centre target.
Your measure in this case is neither reliable nor valid and the test should be
rejected or improved.
 The true score is a hypothetical concept with regard to the actual ability,
competency and capacity of an individual.
 The higher the reliability and validity of your test, the greater the likelihood
that you will be measuring the true scores of your students.
 Reliability refers to the consistency of a measure. A test is considered reliable

if we get the same result repeatedly.
 Validity requires that instruments are sufficiently reliable.
 Face validity is a weak measure of validity.
 Using the test-retest technique, the same test is administered again to the same
group of students.
 For the parallel or equivalent forms technique, two equivalent tests (or forms)
are administered to the same group of students.
 Internal consistency is determined using only one test administered once to the
students.
 When two or more people mark essay questions, the extent to which there is
agreement in the marks allotted is called inter-rater reliability.
 While inter-rater reliability involves two or more individuals, intra-rater

reliability is the consistency of grading by a single rater.
 Validity is the extent to which a test measures what it claims to measure. It is

vital for a test to be valid in order for the results to be accurately applied and
interpreted.
 Construct validity relates to whether the test is an adequate measure of the

underlying construct.
 Content validity is more straightforward and likely to be related to construct

validity; it is related to the coverage of appropriate and necessary content.
 Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related.
Construct Reliability and validity relationship

Content and face Reliable and not valid
Criterion relate Test-retest
Internal consistency True score
Parallel-form Valid and reliable
Predictive Validity
Reliability
Deale, R. N. (1975). Assessment and testing in secondary school. Chicago, IL:

Evans Bros.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement

(3rd ed.). New York, NY: Macmillan.
Nunnally, J. (1978). Psychometric methods. New York, NY: McGraw-Hill.

Nuttall, D. L. (1987). The validity assessment. European Journal of Psychology

Education, 2(2), 109ă118.
Wells, C. S., & Wollack, J. A. (2003). An instructorÊs guide to understanding test

reliability. Retrieved from https://testing.wisc.edu/Reliability.pdf

12 Hmef5053 T8

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12 Hmef5053 T8

Uploaded by

Copyright:

Available Formats

Topic  Reliability and

By the end of the topic, you should be able to:

Copyright © Open University Malaysia (OUM)

8.1 WHAT IS RELIABILITY?

Reliability is the consistency of the measurement.

Observed score (X) = True score (T) + Error (E)

Copyright © Open University Malaysia (OUM)

8.2 THE RELIABILITY COEFFICIENT

Variance of the true score  2 true score

Copyright © Open University Malaysia (OUM)

Figure 8.1: Reliability coefficient

Meanwhile, low reliability indicates that the questions tended to be unrelated to

Table 8.1: Interpretation of Reliability Coefficients

Copyright © Open University Malaysia (OUM)

The standard error of measurement = Standard deviation  1  r

Note: „r‰ is the reliability of the test.

The standard error of measurement = Standard deviation  1  r

= 2.0 1  0.7 =2.0  0.55 = 1.1

Copyright © Open University Malaysia (OUM)

Figure 8.2: Determining KhairulÊs true score based on a normal distribution

1. Define the reliability of a test.

2. What does the reliability coefficient indicate?

3. Explain the concept of a true score.

Shalin obtains a score of 70 in a Biology test. Given the reliability of the

Post your answer on the myINSPIRE online forum.

(Use the standard error of measurement.)

Copyright © Open University Malaysia (OUM)

8.3 METHODS TO ESTIMATE THE RELIABILITY

Figure 8.3: Methods for estimating reliability

These three methods are further explained as follows:

(b) Parallel or Equivalent Forms

Copyright © Open University Malaysia (OUM)

(c) Internal Consistency

2rxy 2  0.65  1.3

Copyright © Open University Malaysia (OUM)

(ii) CronbachÊs Alpha

 k is the number of items in the test;

 2x is the sample variance for the total score.

Copyright © Open University Malaysia (OUM)

Professionally developed standardised tests should have an internal

According to Wells and Wollack (2003), it is acceptable for classroom

2. Explain the parallel or equivalent form technique in determining

3. Explain the concept of internal consistency reliability.

Copyright © Open University Malaysia (OUM)

8.4 INTER-RATER AND INTRA-RATER

(a) Two observers are being consistent in their observations?

(b) Two examiners are being consistent in their marking of an essay?

(c) Two examiners are being consistent in their marking of a project?

Let us analyse these problems from the perspectives of:

(a) Inter-rater Reliability

Figure 8.4: Examiner A versus Examiner B

Copyright © Open University Malaysia (OUM)

Inter-rater reliability can be low because of the following reasons:

(i) Examiners are subconsciously being influenced by knowledge of the

(ii) Consistency in marking is affected after marking a set of either very

(iii) When there is an interruption during the marking of a batch of scripts,

(iv) The marking scheme is poorly developed resulting in examiners

Inter-rater reliability can be enhanced if the criteria for marking or the

(i) Contain suggested answers related to the question;

(ii) Have made provision for acceptable alternative answers;

(iii) Allocate appropriate time for the work required;

(iv) Are sufficiently broken down to allow the marking to be as objective as

(v) Allocate marks according to the degree of difficulty of the question.