You are on page 1of 41

Establishing Test Validity

and Reliability
LESSON 6 / CHAPTER 2
What is test reliability?

Reliability is the consistency of the responses on measures across three


conditions:
(1) when retested on the same person- consistent response is expected
when the same test is provided to the same set of participants
(2) when retested on the same measure - responses on the same test is
expected to be consistent to the same test or another test but measures
the same characteristic when administered at a different time
(3) when similarity of responses across items that measure the same
characteristic is established; there is reliability when the person
responded in the same way or consistently across items that measure
the same characteristic.
Factors Affecting Reliability of A Measure

1. The number of items in a test – the more items a test has, the more likely the
reliability is high. The probability of obtaining consistent scores has high probability
because of the large pool of items.

2. Individual difference of participants – Every participant possesses


characteristics that affect their performance in the test such as fatigue, concentration,
innate ability, perseverance, and motivation. These individual factors change
overtime and affect the consistency of the answers in a test.

3. External environment – The external environment may include the room


temperature, noise level, depth of instruction, exposure to materials, and quality of
instruction that may affect changes in the responses of examinees on a test.
What are the different ways to establish test reliability?

There are different ways in determining the reliability of a test. The


specific kind of reliability will depend on the (1) variable you are
measuring, (2) type of test, and (3) number of versions the test have.

The different types of reliability are indicated and how they are done.

Notice in the third column that statistical analysis is needed to determine


the test reliability.
Different Ways to Establish Test Reliability
Different Ways to Establish Test Reliability
Different Ways to Establish Test Reliability
Basis of Statistical Analysis to Determine Reliability

1. Linear regression

Linear regression is demonstrated when you have two variables that are
measured. Like two set of scores in a test taken at two different times by the
same participants. When the two scores are plotted in a graph (with X and Y
axis), they will tend to form a straight line. The straight line formed for the two
set of scores can produce a linear regression. When a straight line is formed, we
can say that there is a correlation between the two set of scores.
Basis of Statistical Analysis to Determine Reliability
Scatterplot (Spreadsheet1 10v*10c)
Score 2 = 4.8493+1.0403*x
22

20

Example: 18

• The graph is called a 16

Score 2
scatterplot. Each point 14

in a scatterplot is a
12
respondent with two
scores (one for each 10
test).
8

6
2 4 6 8 10 12 14 16 18
Score 1
Basis of Statistical Analysis to Determine Reliability

2. Computation of Pearson r correlation

The index of the linear regression is called a correlation coefficient. When the
points in a scatterplot tend to fall within the linear line, the stronger is the
correlation. When the direction of the scatterplot is directly proportional, the
correlation coefficient will have a positive value. If the line is inverse, the
correlation coefficient will have a negative value. The statistical analysis,
used to determine the correlation coefficient is called the Pearson r. Below
illustrates how the Pearson r is obtained.

Suppose a teacher gave a spelling of two syllable words with 20 items for
Monday and Tuesday. The teacher wanted to determine the reliability of the
two set of scores by computing for the Pearson r.
Formula: N (XY ) − (X )(Y )
r=
[ N (X 2 ) − (X ) 2 ][ N (Y 2 ) − (Y ) 2 ]
Monday Test Tuesday Test
X Y X2 Y2 XY
10 20 100 400 200
9 15 81 225 135
6 12 36 144 72
10 18 100 324 180
12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
16 17 256 289 272
8 13 64 169 104

X = 87 Y =139 X2=871 Y2=2125 XY=1328


X – Add all the X scores (Monday scores)
Y – Add all the Y sores (Tuesday scores)
X2 – Square the value of the X scores (Monday scores)
Y2 – Square the value for the Y scores (Tuesday scores)
XY – Multiply the X and Y scores
X2 – Add all the squared values of X
Y2 – Add all the squared values of Y
XY – Add all the product of X and Y
10(1328) − (87)(139)
Substitute the values in the formula: r=
[10(871) − (87) ][10(2125) − (139) ]
2 2

r = 0.80

The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of
1.00 and -1.00 indicates perfect correlation.
Basis of Statistical Analysis to Determine Reliability

3. Difference between a positive and negative correlation

When the value of the correlation coefficient is positive, it means that the
higher are the scores in X, the higher are the scores in Y. This is called a positive
correlation. In the case of the two spelling scores, a positive correlation is
obtained. When the value of the correlation coefficient is negative, it means that the
higher are the scores in X, the lower are the scores in Y or vice versa. This is called
a negative correlation. When the same test is administered to the same group of
participants, usually a positive correlation indicates reliability or consistency of the
scores.
Basis of Statistical Analysis to Determine Reliability

4. Determining the strength of a correlation

The strength of the correlation is determined by the value of the correlation


coefficient. The closer is the value to 1.00 or -1.00, the stronger is the correlation.
Below is the guide:

0.80 – 1.00 Very high relationship


0.6 – 0.79 High relationship
0.40 – 0.59 Substantial/marked relationship
0.2 – 0.39 Low relationship
0.00 – 0.19 Negligible relationship
Basis of Statistical Analysis to Determine Reliability
5. Determining the significance of the correlation

The obtained correlation of two variables may be due to chance. In order to determine
if the correlation is free of some error, it is tested for significance. When a correlation
is significant, it means that the probability of the two variables being related is free of
some error.
In order to determine if a correlation coefficient value is significant, it is compared
with an expected probability of correlation coefficient values called a critical table.
When the value computed is greater than the critical value, it means that the
information obtained has beyond 95% chance of being correlated and it is significant.
Another statistical analysis mentioned to determine the internal consistency of test is
the Cronbach’s alpha. Follow the procedure to determine the internal consistency.
Suppose that five students answered a checklist with a scale of 1 to 5 about their
hygiene where the following are the corresponding scores:
5 - always, 4 – often, 3 – sometimes, 2 –rarely, 1 – never

The checklist has five items. The teacher wanted to determine if the items have
internal consistency.
item item item item Item total for each case Score-
Student 1 2 3 4 5 (X) Mean (Score-Mean)2
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
case=16.2 Σ(Score-Mean)2=22.8
total for
each item
(ΣX) 14 21 16 17 13 item
=16.2

ΣX2 48 91 54 59 39

SD2t 2.2 .7 .7 .3 1.3 ΣSD2t =5.2 = 5.7

 n   t − ( t ) 
2 2
 5  5.7 − 5.2 
Cronbach' s =    Cronbach' s =   
 n − 1  t
2
  5 − 1  5.7 

Cronbach’s α = .10
The internal consistency of the responses in the attitude towards teaching is .10 indicating low
internal consistency.
The consistency of the ratings can be obtained using a coefficient of concordance.
The Kendall’s ω coefficient of concordance is used to test the agreement among
raters.
If a performance task was demonstrated by five students and there are three
raters. The rubric used a scale of 1 to 4 where 4 is the highest and 1 is the lowest.

Five Sum of
demonstrations Rater 1 Rater 2 Rater 3 Ratings D D2
A 4 4 3 11 2.6 6.76
B 3 2 3 8 -0.4 0.16
C 3 4 4 11 2.6 6.76
D 3 3 2 8 -0.4 0.16
E 1 1 2 4 -4.4 19.36
=8.4 ΣD2=33.2
The scores given by the three raters are first computed by summating
the total ratings for each demonstration. The mean is obtained for the
sum of ratings (X Ratings=8.4). The mean is subtracted to each of the Sum
of Ratings (D). Each difference is squared (D2), then the sum of squares
is computed (ΣD2=33.2). The mean and summation of squared
difference is substituted in the Kendall’s ω formula. In the formula, m is
the numbers of raters.
12D 2
W= 2
m ( N )( N 2 − 1) A value of .38 Kendall’s ω coefficient estimates the
agreement of the three raters in the five demonstrations.
12(33.2) There is a moderate concordance among the three raters
W= 2
3 (5)(5 − 1)
2 because the value is far from 1.00.

W=0.37
What is test validity?

A measure is valid when it measures what it is supposed to measure.


If a quarterly exam is valid, the contents should directly measure the
objectives of the curriculum. If a scale that measures personality is
composed of five factors, the scores on the five factors should have
items that are highly correlated. If an entrance exam is valid, it
should predict students’ grades after the first semester.
Different Ways to Establish Test Validity

Type of Validity Definition


Procedure
Content validity When the items represent the domain The items are compared with the objectives of
being measured. the program. The items need to measure
directly the objectives (for achievement) or
definition (for scales). A reviewer conducts the
checking.
Face validity When the test is presented well, free of The test items and its layout is reviewed and
errors, and administered well. tried out to a small group of respondents. A
manual for administration can be made as a
guide for the test administrator.
Predictive Validity A measure should predict a future A correlation coefficient is conducted where
criterion. Example is an entrance exam the X variable is used as the predictor and the Y
predicting the grades of the students variable as the criterion.
after the first semester.
Type of Validity Definition Procedure
Concurrent Validity When two or more measures describe The scores on the measures should be correlated.
the present the same characteristic.

Construct Validity The components or factors of the test The Pearson r can be used to correlate the items
should contain items that are strongly for each factor. However, there is a technique
correlated. called factor analysis to determine which items
are highly correlated to form a factor.
Convergent Validity When the components or factors of a Correlation is done for the factors of the test.
test are hypothesized to have positive
correlation.
Divergent Validity When the components or factors of a Correlation is done for the factors of the test.
test are hypothesized to have a
negative correlation. Example is the
items on intrinsic and extrinsic
motivation.
Cases to Illustrate the Types of Validity

1. Content Validity

A coordinator in science is checking the science test paper for grade 4. She asked the
grade 4 science teacher to submit the table of specifications containing the objectives
of the lesson and the corresponding items. The coordinator checked whether each
item is aligned with the objectives.

• How are the objectives used when creating test items?


• How is content validity determined when given the objectives and the items in a
test?
• What should be present in a test table of specifications when determining content
validity?
• Who checks the content validity of items?
2. Face Validity

The assistant principal browsed the test paper made by the math
teacher. She checked if the contents of the items are about
mathematics. She examined if instructions are clear. She browsed
through the items if the grammar is correct and if the vocabulary is
within the students’ level of understanding.

• What can be done in order to ensure that the assessment appears


to be effective?
• What practices are done in conducting face validity?
• Why is face validity the weakest form of validity?
3. Concurrent Validity

A school guidance counselor administered a math achievement test among the grade
6 students. She also has a copy of the students’ grades in math. She wanted to verify
if the math grades of the students are measuring the same competencies in the math
achievement test. The school counselor correlated math achievement scores and the
math grades to determine if they are measuring the same competencies.

• What needs to be available when conducting concurrent validity?


• At least how many tests need to be present for conducting concurrent validity?
• What statistical analysis can be used to establish concurrent validity?
• How are the results of a correlation coefficient interpreted for concurrent validity?
4. Predictive Validity

The school admission’s office developed an entrance examination. The officials


wanted to determine if the results of the entrance examination are accurate in
accepting the good students. They took the grades of the students accepted for the first
quarter. They correlated the entrance exam results and the first quarter grades. They
found that there was significant and positive correlations between the entrance
examination scores and grades. The entrance examination results predicted the grades
of students after the first quarter. There predictive validity.

• Why are two measures needed in predictive validity?


• What is the assumed connection between these two measures?
• How can we determine if a measure has predictive validity?
• What statistical analysis is done to determine predictive validity?
• How are the results of predictive validity interpreted?
5. Construct Validity

A science test was made by a grade 10 teacher composed of four domains: Matter,
living things, force and motion, and earth and space. There are 10 items under each
domain. The teacher wanted to determine if the 10 items made under each domain
really belong to that domain. The teacher consulted an expert in test measurement.
They conducted a procedure called factor analysis. Factor Analysis is a statistical
procedure in determining if the items written will load under the domain they
belong.

• What type of tests can construct validity be used?


• What should the test have in order to verify its constructs?
• What are constructs and factors in a test?
• How are these factors verified if they are appropriate for the test?
• What results come out in construct validity?
• How are the results in construct validity interpreted?
The construct validity of a measure are reported in journal articles. The
following are guide questions used when searching for the construct validity
of a measure form reports:

• What was the purpose for doing construct validity


• What type of test was used?
• What are the dimensions or factors that were studied using construct
validity?
• What procedure was used to establish the construct validity?
• What statistics was used for the construct validity?
• What were the results of the test’s construct validity?
6. Convergent Validity

A math teacher developed a math test that will be administered at the end of the
school year that measures number sense, patterns and algebra, measurement,
geometry, and statistics. It is assumed by the math teacher that students’
competencies in number sense help students learn better the patterns and algebra
and the other areas. After administering the test, the scores were separated for each
area and these five domains were inter-correlated using Pearson r. The positive
correlation between number sense and patterns and algebra indicates that when
number sense scores increase, patterns and algebra scores also increase. This shows
that students learning in number sense scaffold patterns and algebra competencies.

• What should a test have to conduct convergent validity?


• What are done with the domains in a test in convergent validity?
• What analysis is used to determine convergent validity?
• How are the results in convergent validity interpreted?
7. Divergent Validity

An English teacher taught metacognitive awareness strategy to comprehend a


paragraph for grade 11 students. She wanted to determine if the performance of her
students in reading comprehension would reflect well in the reading comprehension
test. She administered the same reading comprehension test to another class which
was not taught with the metacognitive awareness strategy. She compared the results
using a t-test for independent samples and found that the class that was taught
metacognitive awareness strategy significantly performed higher than the other
group. The test has divergent validity.

• What conditions need to be present to conduct divergent validity?


• What assumption is being proved in divergent validity?
• What statistical analysis can be used to establish divergent validity?
• How are the results of divergent validity interpreted?
How to determine if an item is easy or difficult?

An item is difficult if majority of students are not able to provide the


correct answer. The item is easy if majority of the students are able to
answer correctly.

An item can discriminate if the examinees who are high in the test can
answer more the items correctly than the examinees who got low scores.
Below is a data set of 5 items on addition and subtraction of integers. Follow
the procedure to determine the difficulty and discrimination of each item.

Item 1 Item 2 Item 3 Item4 Item 5


Student 1 0 0 1 1 1
Student 2 1 1 1 0 1
Student 3 0 0 0 1 1
Student 4 0 0 0 0 1
Student 5 0 1 1 1 1
Student 6 1 0 1 1 0
Student 7 0 0 1 1 0
Student 8 0 1 1 0 0
Student 9 1 0 1 1 1
Student 10 1 0 1 1 0
1. Get the total score of each student and arrange from highest
to lowest.

Item Item 2 Item 3 Item4 Item 5 total


1 score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Student 1 0 0 1 1 1 3
Student 6 1 0 1 1 0 3
Student 10 1 0 1 1 0
3
Student 3 0 0 0 1 1 2
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1
2. Obtain the upper and lower 27% of the group. Multiply 0.27
to the total number of students and you will get a value of 2.7.
The rounded whole number value is 3.0. Get the top 3 students
and the bottom 3 students based on their total scores. The top
3 students are students 2, 5, and 9. The bottom 3 students are
students 7, 8, and 4. The rest of the students are not included
in the item analysis.
3. Obtain the proportion correct for each item. This is computed for the
upper 27% group and the lower 27% group. This is done by summating the
correct answer per item and divide it by the total number of students.

Item 1 Item 2 Item 3 Item4 Item 5 total score

Student 2 1 1 1 0 1
4
Student 5 0 1 1 1 1
4
Student 9 1 0 1 1 1
4
Total 2 2 3 2 3
Proportion
of the high
group
(PH) 0.67 0.67 1.00 0.67 1.00
Student 7 0 0 1 1 0
2
Student 8 0 1 1 0 0
2
Student 4 0 0 0 0 1
1
Total 0 1 2 1 1
Proportio
n of the
low
group
(PL) 0.00 0.33 0.67 0.33 0.33
4. The item difficulty is obtained using the following formula:
pH + pL
Item difficulty =
2

The difficulty is interpreted using the table:

Difficulty Index Remark


.76 or higher Easy Item
.25 to .75 Average Item
.24 or lower Difficult Item
Computation

Item 1 Item 2 Item 3 Item 4 Item 5

Index of
difficulty 0.33 0.50 0.83 0.50 0.67
Item Difficult Average Easy Average Average
difficulty
5. The index of discrimination is obtained using the formula: Item
discrimination=pH – pL
The value is interpreted using the table:

Index discrimination Remark


.40 and above Very good item
.30 - .39 Good item
.20 - .29 Reasonably Good item
.10 - .19 Marginal item
Below .10 Poor item
Item 1 Item 2 Item 3 Item 4 Item 5
=0.67-0 =0.67-0.33 =2.00-0.67 =1.00-0.33 =1.00-0.33
Index
Discrimination 0.67 0.33 0.33 0.33 0.67
Discrimination Very good Good item Good item Good item Very good
item item
Thank You!!!
☺☺☺

You might also like