You are on page 1of 10

Chapter 6

Establishing Validity and Reliability of Tests


Assessment tools, whether traditional or authentic, should possess good qualities. They
should be carefully developed so that they may serve whatever purpose they may have. The
most common concepts are the validity, reliability and, accuracy
Lesson No. 1: Validity of a Test
A test is useful only when it measures the skill, a trait or attribute it is supposed to
measure. This lesson will deal with one important concept in assessment– validity of a test
Intended Learning Outcomes:
1. Define the following terms: validity, content validity, construct validity, concurrent
validity, predictive validity and validity coefficient;
2. Discuss the different approaches of validity;
3. Identify the factors affecting the validity of a test;
4. Compute the validity coefficient; and
5. Interpret the validity coefficient of a test
Take off:
If a test shows that 60% of the students in a third-grade class are reading below their
grade level, should you be seriously concerned? Your initial response might be an unqualified
YES, but it is not necessarily so. The following lessons will tell you
Content Focus:
A. Validity of a Test
A test has validity evidence if we can demonstrate that it measures what it says
it measures. For instance; if it is supposed to a test of third-grade arithmetic ability it
should measure third-grade arithmetic skills not fifth-grade arithmetic skills and not
reading ability. If it is supposed to measure the ability to write behavioral
objectives, it should measure that ability, not the ability to recognize bad objectives.
Clearly, if a test is to have any use at all, we should be able to identify evidence for
the test’s validity.

B. Types of Validity
B.1. Content Validity. A type of validation that refers to the relationship between
the instructional objectives so that the test measures what it is supposed to
measure. Things to remember about this type of validity:
B.1.a. The evidence of the content validity pf a test is found in the Table
of Specification.
B.1.b. This is the most important type of validity for a classroom teacher.
B.1.c. There is no coefficient for content validity. It is determined y
experts judgmentally, not empirically.

B.2. Criterion-related Validity. A type of validation that refers to the extent to


which the scores from a test relate to theoretically similar measures. It is a
measure of how accurately a student’s current test score can be used to
estimate a score on a criterion measure, like performance in courses, classes or
another measurement instrument. For example, a classroom reading grades
should indicate similar levels of performance s Standardized Reading Test scores.
B.2.a. Concurrent validity. The criterion and the predictor data are
collected at the same time. It is established by correlating the criterion and the
predictor using the Pearson product correlation coefficient and other statistical
tools correlations. This type of validity is appropriate for tests designed to assess
student’s current criterion status or when you want to diagnose student’s status.
B.2.b. Predictve validity. A type of validation that refers to a measure of
the extent to which a student’s current test result can be used to estimate
accurately the outcome of the student’s performance at later time. It is
appropriate for a test designed to assess student’s future status on a criterion.
Regression analysis can be used to predict the criterion of a single predictor or
multiples of predictors.

B.3. Construct Validity. A type of validation that refers to the measure of the
extent to which a test measure the theoretical and unobservable variable
qualities such as intelligence, math achievement, test anxiety, etc. over a period
of time on the basis of gathering evidence. It is established through intensive
study of the test or measuring instrument using convergent/divergent validation
and factor analysis.
B.3.a. Convergent validity is a type of construct validation wherein a test
has a high correlation with another test that measures the same construct.
B.3.b. Divergent validity is type of construct validation wherein a test
have low correlation with another test that measures a different construct.
B.3.c. Factor analysis is a complex statistical procedure in establishing tet
validity

The following table summarizes our discussion of content, criterion and


predictive validity. The actual process used to assess construct validity is lengthy
and complex and need not concern us here.

Type of Validity Ask the Question To Answer the Question


Content Validity Do test items match and Match the items with
measure objectives? objectives
Concurrent- How well does performance on Correlate new test with an
criterion-Related the new test match accepted criterion, for
Validity performance on an established example, a well-established
test? test measuring the same
behavior
Predictive- Can the test predict successful Correlate scores from the
Criterion-Related performance, for example new test with a measure of
Validity success or failure in the next some future performance
grade?

Remember, predictive validity involves a time interval. For example, the SAT
score before college is correlated with the GPA in college. Concurrent validity does not
involve a time interval. A test is administered and its relationship to a well-established
test measuring the same behavior is determined.
C. Factors Affecting Validity of a Test
C.1. The test itself: poor construction, unclear directions, ambiguous test items,
too difficult vocabulary, inadequate time limit, level of difficulty, unintended
clues
C.2. Personal factors influencing students’ response to the test.
C.3. Validity is always specific to a particular group.
D. Computing and Interpreting Validity Coefficient
The validity coefficient is the computed value of the rxy. In theory, the validity
coefficient has values like the correlation that ranges from 0 to 1. In practice, most
of the validity scores re small and they range from 0.3 to0.5, few exceed 0.6 to 0.7

Another way of interpreting the findings is to consider the squared correlation


coefficient (rxy)2, this is called coefficient of determination. Coefficient of
Determination indicates how much variation in the criterion can be accounted for by
the predictor (teacher test). Example: If the computed value of rxy = 0.75 the
coefficient of determination is (0.75)2 = 0.5625 or 56.25% of the variance in the
student performance can be attributed to the test or 43.75% of the student
performance cannot be attributed to the test results.

Example: Teacher A develops a 45-item test and he wants to determine if his test
is valid. He takes another test that is already acknowledge for its validity and uses it
as a criterion. He conducted these two sets of test to his 15 students. The following
table shows the results of the two tests. Is the test developed by teacher A valid?
Find the validity coefficient using Pearson r and the coefficient of determination.
Take Action: Exercises
1. What type of validity can go with the following procedure?
a. Matching test items with objectives.
b. Correlating a test of mechanical skills after training with on-the-job performance
ratings.
c. Correlating the short form of an IQ test with the long form.
d. Correlating a paper-pencil test of musical talent with ratings from a live audition
completed after the test.
e. Correlating a test of reading ability with a test of mathematical ability.
2. Compute the validity coefficient using the Pearson r and the coefficient of
determination. Interpret the results.

Scores of students in a teacher-made test (X): 25, 22, 18, 18, 16, 14, 12, 8, 6, 6
Scores of the same students on criterion test(Y): 22, 23, 25, 28, 31, 32, 34, 42, 44, 48

Lesson 2: Reliability of a Test


If any use is to be made of the information from a test, it is desirable that the results be
reliable. If the test is going to be used to make placement decisions about individual students
you wouldn’t want it to provide different data about students if it were given again the next
day. If test is going to be used to make decisions about the difficulty level of your instructional
materials, you wouldn’t want it to indicate that the same materials are difficult in one day and
easy the next day. In testing, as in our everyday lives, if we have use for some piece of
information, we would like that information to be stable, consistent and dependable.
Intended Learning Outcomes:
1. Define the following terms: reliability, test-retest method, equivalent/parallel
method, split-half method, Kuder - Richardson formula, and reliability coefficient;
2. Identify factors affecting the reliability of the test;
3. Compute and interpret the reliability coefficient of the test
Content focus:
A. Reliability of a test refers to the consistency with which it yields the same rank for
individuals who take the test more than once. That is, how consistent test results or
other measurement result from one measurement to another. A test is reliable
when it can be used to predict practically the same scores when the same test is
administered twice to the same group of students and with a reliability index of 0.60
or above. The reliability of a test can be determined by means of Pearson Product
Correlation Coefficient, Spearman-Brown formula or Kuder-Richardson formula (KR).
KR formula will not be considered here.
B. Factors affecting Reliability of a Test
B.1. Length of the test. The longer the test, the higher the test reliability
B.2. Moderate item difficulty. Reliability tends to decrease as tests become too
easy or too difficult.
B.3. Objective scoring. The more objective the scoring the higher the test
reliability.
B.4. Heterogeneity of the group. The more heterogeneous the group, the higher
the test reliability.
B.5. Limited time
C. Methods of Estimating Reliability
C.1. Test – Restest or Stability. It is a method of estimating reliability that is
exactly what it implies. The test is given twice and the correlation between the first set
of scores and the second set of scores is determined. Generally, the longer the interval
between test administrations, the lower the correlation. Since students change with
the passage of time, an especially long interval between testings will produce a
“reliability” coefficient that is more a reflection of student changes on the attribute
being measured than a reflection of the reliability of the test. The formula used is

Exercise: 1. Solve the Reliability Coefficient using the Pearson r formula

Studen 1st Test 2nd Test Xy x2 y2


t
1 36 38 1 368 1 296 1 444
2 26 34 884 676 1 156
3 38 38 1444 1 444 1 444
4 15 27 405 225 729
5 17 25 425 289 625
6 28 26 728 784 676
7 32 35 1 120 1 024 1 225
8 35 36 1 260 1 225 1 296
9 12 19 228 144 361
10 35 38 1 330 1 225 1 444
2 2
n= 10 ∑x = 274 ∑y = 316 ∑xy = 9,192 ∑x = 8,332 ∑y = 10,400
2. Interpret the Reliability Coefficient using the Table that follows:

Levels of Reliability Coefficient

Reliability Coefficient Interpretation


Above 0.90 Excellent reliability.
0.81 – 0.90 Very good for a classroom test.
0.71 – 0.80 Good for classroom test. There are probably few items
that need to be improved.
0.61 – 0.70 Somewhat low. The test needs to be supplemented by
other measures (more test) to determine grades.
0.51 – 0.60 Suggests need for revision of test unless it is quite short
(ten or lower). The test needs to be supplemented by
other measures (more test) to determine grades.
0.50 and Below Questionable reliability. The test should not contribute
heavily to the course grade, and it needs revision

C.2. Alternate forms or Equivalence. If there are two equivalent forms of a test,
these forms can be used to estimate the reliability of the test. Both forms are given to
a group of students and the correlation between two sets of scores are determined..
This estimate eliminates the problems of memory and practice involved in test-retest
estimates. Large differences in the student’s score on two forms that supposedly
measure the same behavior/trait would indicate an unreliable test. To use this method
two equivalent forms of the test must be available, and they must be administered
under conditions as nearly equivalent as possible. One major problem in using this
method is that it takes a great deal of time and effort to develop one good test let
alone two. Hence, this method is often used by test publishers who create two forms
of their test for other reasons (e.g., to maintain test security).
C.3. Internal Consistency. If the test in question is designed to measure a single
basic concept, it is reasonable to assume that people who get one item right will be
more likely to get other, similar items right. In which case, the test has internal
consistency. One approach in determining the test internal consistency, called split
halves, involves splitting the test into equivalent halves and determining the
correlation between them. This can be done by assigning all items in the first half of
the test to one form and all items in the second half of the test to the other form.
However this method is appropriate only when items of varying difficulty is spread
cross the test. Frequently, they are not. In these cases, the best approach would be to
divide the test items by placing all odd-numbered items into one half and all even-
numbered items into the other half. When this latter approach is used, the reliability is
more commonly called the odd-even reliability. To compute the internal consistency
of the whole test the Spearman-Brown prophecy formula (a correction formula) is
used. It is
rw = ____2rb____
1 + rb
Where: rw is the correlation for the whole test
rb is the correlation between the two halves of the test

Exercise: The test was given once. The score of the students in odd and even
items are gathered. Using the split-half method, is the test reliable? Show the
complete solution. To find the reliability of odd and even test items solve for
∑x ∑y ∑xy ∑x2 ∑y2. Use the formula rw = ____2rb____
1 + rb
to solve the reliability of the whole test.

Take action: Answer the following:


1. For each of the following statements indicate which type of reliability is being
referred to from among the four alternatives that follow:
Test-retest
Alternative forms
Alternate forms (long interval)
Split-half
a. “Practice effect” could most seriously affect this type of reliability.
b. This procedure would yield the lowest reliability coefficient.
c. This procedure requires the use of the formula to adjust the estimate of the
reliability for a total test.
d. This should be used by teachers who want to give comparable (but different) test
to students.
e. In this method changes caused by item sampling will not be reflected.

2. All things being equal, which test – A, B, or C would you use?


Type of Reliability Test A Test B Test C
Split-half Reliability Coefficient .80 .90 .85
Test-retest Reliability Coefficient .60 .60 .75
Alternate-forms Reliability .30 .60 .60

You might also like