You are on page 1of 11

Lesson 6 Item Analysis and Validation

Course Outcomes:

 distinguish the uses of item analysis, validity, reliability, difficulty index, discrimination index
(PO-Cc)
 determine the validity and reliability of given test items (PO-Dd) (PO-Ec)

LESSON 6. 2: Validation and Validity and Reliability

Intended Learning Outcomes (ILO’s)

The students will be able to…

 distinguish the concepts and uses of validity and reliability (PO-Cc)


 determine the validity and reliability of the given test items (CLO-S.2)

INTRODUCTION

This lesson covers the purpose of validation in order to determine the characteristics of the
whole test itself, namely, the validity and reliability of the test. (Navarro, Rosita, et.al.,2019)

Preliminary Questions:

Share your observations about the pictures.


I. CAPTIVATE
1) Share your ideas about the picture:

- When and what is the use the instruments?


- Are the instruments important in our daily
activities?
- How important are the instruments to your
life? Explain.

II. CONNECT

After performing the item analysis and revising the items which need revision, the next step
is to validate the instrument. The purpose of validation is to determine the characteristics of the
whole test itself, namely validity and reliability of the test.
What is validation?

Validation is the process of collecting and analyzing evidence to support the


meaningfulness of the test. (Navarro, et.al.,2019)

What is validity?

Validity refers to how well a test measures what it is purported to measure. It refers also to
the appropriateness, correctness, meaningfulness and usefulness of the specific decisions a
teacher makes based on the test results.
Why is it necessary?

While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs
to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an excess
of 5lbs. The scale is reliable because it consistently reports the same weight every day, but it is
not valid because it adds 5lbs to your true weight. It is not a valid measure of your weight.

A teacher who conducts test validation might want to gather different kinds of evidence.
There are essentially three main types of evidence that may be collected:
1. Construct Validity is used to ensure that the measure is actually measure what it is
intended to measure (i.e. the construct), and not other variables. Using a panel of “experts” familiar
with the construct is a way in which this type of validity can be assessed. The experts can examine
the items and decide what that specific item is intended to measure. Students can be involved in
this process to obtain their feedback.
Example: A women’s studies program may design a cumulative assessment of learning
throughout the major. The questions are written with complicated wording and phrasing. This can
cause the test inadvertently becoming a test of reading comprehension, rather than a test of
women’s studies. It is important that the measure is actually assessing the intended construct,
rather than an extraneous factor.
2. Criterion-Related Validity is used to predict future or current performance - it correlates
test results with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student learning
throughout the major. The new measure could be correlated with a standardized measure of
ability in this discipline, such as an ETS field test or the GRE subject test. The higher the
correlation between the established measure and new measure, the more faith stakeholders can
have in the new assessment tool.
5. Sampling Validity (similar to content validity) ensures that the measure covers the broad range
of areas within the concept under study. Not everything can be covered, so items need to be
sampled from all of the domains. This may need to be completed using a panel of “experts” to
ensure that the content area is adequately sampled. Additionally, a panel can help limit “expert”
bias (i.e. a test reflecting what an individual personally feels are the most important or relevant
areas).
Example: When designing an assessment of learning in the theatre department, it would not
be sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound,
functions of stage managers should all be included. The assessment should reflect the content
area in its entirety.

What are some ways to improve validity?


1. Make sure your goals and objectives are clearly defined and operationalized. Expectations
of students should be written down.
2. Match your assessment measure to your goals and objectives. Additionally, have the test
reviewed by faculty at other schools to obtain feedback from an outside party who is less
invested in the instrument.
3. Get students involved; have the students look over the assessment for troublesome
wording, or other difficulties.
If possible, compare your measure with other measures, or data that may be available.

What is reliability?
Reliability is the degree to which an assessment tool produces stable and consistent
results or the consistency of the scores obtained.

Types of Reliability
1. Test-retest reliability is a measure of reliability obtained by administering the same test
twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then
be correlated in order to evaluate the test for stability over time.
Example: A test designed to assess student learning in psychology could be given to a group
of students twice, with the second administration perhaps coming a week after the first. The
obtained correlation coefficient would indicate the stability of the scores.

2. Parallel forms reliability is a measure of reliability obtained by administering different


versions of an assessment tool (both versions must contain items that probe the same construct,
skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions
can then be correlated in order to evaluate the consistency of results across alternate versions.
Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might
create a large set of items that all pertain to critical thinking and then randomly split the questions
up into two sets, which would represent the parallel forms.
3. Inter-rater reliability is a measure of reliability used to assess the degree to which different
judges or raters agree in their assessment decisions. Inter-rater reliability is useful because
human observers will not necessarily interpret answers the same way; raters may disagree as to
how well certain responses or material demonstrate knowledge of the construct or skill being
assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating the
degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful
when judgments can be considered relatively subjective. Thus, the use of this type of reliability
would probably be more likely when evaluating artwork as opposed to math problems.

4. Internal consistency reliability is a measure of reliability used to evaluate the degree to


which different test items that probe the same construct produce similar results.

A) Average inter-item correlation is a subtype of internal consistency reliability. It is


obtained by taking all of the items on a test that probe the same construct (e.g., reading
comprehension), determining the correlation coefficient for each pair of items, and finally
taking the average of all of these correlation coefficients. This final step yields the average
inter-item correlation.
B) Split-half reliability is another subtype of internal consistency reliability. The process of
obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended
to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of
items. The entire test is administered to a group of individuals, the total score for each “set”
is computed, and finally the split-half reliability is obtained by determining the correlation
between the two total “set” scores. ( https://chfasoa.uni.edu/reliabilityandvalidity.htm)

The following table is a standard followed almost universally in educational test and
measurement.
Reliability Interpretation
0.90 and above Excellent reliability; at the level of the best standardized tests.
(Very high reliability)
0.80 – 0.90 Very good for a classroom test.
(High reliability)
Good for a classroom test; in the range of most. There are
0.70 – 0.80 probably a few items which could be improved.
(Average/moderate reliability)
Somewhat low. The test needs to be supplemented by other
measures (e.g., more tests) to determine grades. There are
0.60 – 0.70 probably some items which could be improved.
(Low reliability)
Suggests need for revision of test, unless it is quite short
(ten or fewer items). The test definitely needs to be
supplemented by other measures (e.g., more tests) for
0.50 – 0.60
grading.
(Very low reliability)
Questionable reliability. This test should not contribute heavily
0.50 or below to the course grade, and it needs revision.

Navarro, et.al., 2019


To determine the reliability of the test.
1) Test-retest method. The same measuring instrument is administered twice to the same
group of students and the correlation coefficient is determined.
A Spearman rank correlation coefficient or Spearman rho is a statistical tool used to
measure the relationship between paired ranks assigned to individual scores on two
variables, X and Y. Thus, this is used to correlate the scores in a test-retest method.
To obtain the value of Spearman rho ( r ) consider the formula:
2
6∑D
rs = 1 - 3
N −N

where rs = Spearman rho


∑D2 = Sum of the squared difference between ranks
N = Total number of cases or students

To apply the formula, the steps are as follows:


Step 1. Rank the scores of respondents/students from the highest to the lowest in the first
set of administration (X1 ) and mark this rank as Rx. The highest score receives the
rank of 1; second score, third highest score, 3; and so on.
Step 2. Rank the second set of scores (Y) in the same manner as in Step 1 and mark as
Ry.
Step 3. Determine the difference in ranks for every pair of ranks.
Step 4. Squared each difference to get D2.
Step 5. Sum the squared difference to find ∑D2.
Step 6. Compute Spearman rho (rs ) by applying the formula.

Example. Spearman rho Computation of the First and Second Administration of Achievement
Test in English (Artificial Data)

Ranks Difference
Students X Y Rx Ry D D2
1 90 70 2.0 7.5 -5.5 30.25
2 43 31 13.0 12.5 0.5 0.25
3 84 79 6.5 3.0 3.5 12.25
4 86 70 4.5 7.5 -3.0 9.00
5 55 43 11.0 10.5 0.5 0.25
6 77 70 8.5 7.5 1.0 1.00
7 84 75 6.5 4.5 2.0 4.00
8 91 88 1.0 1.0 0.0 0.00
9 40 31 14.0 12.5 1.5 2.25
10 75 70 10.0 7.5 2.5 6.25
11 86 80 4.5 2.0 2.5 6.25
12 89 75 3.0 4.5 -1.5 2.25
13 48 30 12.0 14.0 2.0 4.00
14 77 43 8.5 10.5 -2.5 4.00
Total ∑D 2 = 82.00
Calmorin, 2004

Solution:

rs = 1 - 6 ∑D2
N3 - N

= 1 - 6 ( 82 )__
143 - 14

= 1 - 492__
2744 - 14

= 1 - 492__
2730

= 1 - 0.18021978

rs = 0.82 (high reliability/relationship)

Further readings.
https://study.com/academy/lesson/test-retest-reliability-coefficient-examples-lesson-quiz.html

2) Internal-consistency method. The method of obtaining this method is determined by


Kuder-Richardson formula 20.
The Kuder and Richardson Formula 20 test checks the internal consistency of measurements
with dichotomous choices. It is equivalent to performing the split half methodology on all
combinations of questions and is applicable when each question is either right or wrong. A correct
question scores 1 and an incorrect question scores 0. The test statistic is

Where

k = number of questions
pj = number of people in the sample who answered question j correctly
qj = number of people in the sample who didn’t answer question j correctly
σ2 = variance of the total scores of all the people taking the test = VARP(R1) where R1 = array
containing the total scores of all the people taking the test.
Values range from 0 to 1. A high value indicates reliability, while too high a value (in excess of .90)
indicates a homogeneous test.

Example 1: A questionnaire with 11 questions is administered to 12 students. The results are


listed in the upper portion of Figure 1. Determine the reliability of the questionnaire using Kuder
and Richardson Formula 20.
Figure 1 – Kuder and Richardson Formula 20 for Example 1

The values of p in row 18 are the percentage of students who answered that question correctly –
e.g. the formula in cell B18 is =B16/COUNT(B4:B15). The values of q in row 19 are the percentage
of students who answered that question incorrectly – e.g. the formula in cell B19 is =1–B18. The
values of pq are simply the product of the p and q values, with the sum given in cell M20.
We can calculate ρKR20 as described in Figure 2.

Figure 2 – Key formulas for worksheet in Figure 1

The value ρKR20 = 0.738 shows that the test has high reliability.
http://www.real-statistics.com/reliability/internal-consistency

Further readings.
http://korbedpsych.com/LinkedFiles/CalculatingReliability.pdf
III. COLLABORATE
Practice exercise. Test-retest Method
Solve for the reliability applying the Spearman rank correlation coefficient or Spearman rho.

Ranks Difference
Students X Y Rx Ry D D2
1 85 70
2 43 38
3 55 43
4 77 70
5 84 75
6 88 88
7 45 40
8 75 70
9 86 80
10 89 75
Total ∑D 2 =

IV. CREATE

Unit Test on Reliability


Ed 216 – Assessment in Learning 1

Name : ________________________________________ Date : _______________


Curr.Year : ____________________________ Score: _______________

I. Multiple Choice.
Direction: Read each problem carefully and determine the correct answer from the given
options. Check the box from the choices.

1. Spearman-Brown formula is used to determine the reliability of


[ ] test-retest method [ ] parallel-forms method
[ ] split-half method [ ] internal consistency method
2. Kuder-Richardson Formula 20 is used to
[ ] parallel-forms
[ ] split half
[ ] internal-consistency
[ ] test-retest
3. The correlation value of 0.83 is interpreted as
[ ] high relationship
[ ] slight correlation
[ ] moderate relationship
[ ] very high relationship
4. Which of the correlation value has high reliability?
[ ] 0.70
[ ] 0.95
[ ] 0.92
[ ] 0.72
5. Which of the correlation (r) value has low reliability?
[ ] 0.20
[ ] 0.23
[ ] 0.47
[ ] 0.18
6. Which of the correlation value below has moderate relationship?
[ ] 0.91
[ ] 0.40
[ ] 0.69
[ ] 0.35
7. Test-retest method is determined by
[ ] Kuder-Richardson Formula 20
[ ] Spearman rho
[ ] Pearson Product-Moment Correlation Coefficient
[ ] Spearman-Brown Formula
8. A method of testing the reliability of a test in which the examinee receives a point of one or
zero for each item is
[ ] internal-consistency
[ ] split-half
[ ] test-retest
[ ] parallel-forms
9. Method of estimating the reliability of a testing instrument in which the test is administered
twice to the same group of students and the correlation coefficient is determined is
[ ] split-half
[ ] internal-consistency
[ ] test-retest
[ ] parallel-forms
10. Split-half method is determined by
[ ] Kuder-Richardson Formula 20
[ ] Spearman rho
[ ] Pearson Product-Moment Correlation Coefficient
[ ] Spearman-Brown Formula

Calmoring, 2004
II. Problem Solving. Use extra paper for the solutions.

Using the data below, find out if the test is reliable using the test-retest method administered to
15 students as pilot sample. Use extra papers for the solutions.

Ranks Difference
Students X Y Rx Ry D D2
1 28 29
2 83 83
3 44 45
4 77 49
5 80 79
6 95 95
7 88 87
8 45 45
9 83 84
10 79 80
11 82 83
12 25 25
13 77 79
14 38 39
15 70 72
Total ∑D 2 =

V. ASSIGNMENT

Find out if there is internal consistency in the responses of the 12 students as pilot sample in
a 10-item test in Mathematics. Show your solutions. Use extra paper.

Students
Items 1 2 3 4 5 6 7 8 9 10 11 12 f pi qi p i qi
1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1 1 1
4 0 1 1 1 1 1 1 1 1 1 1 0
5 0 1 1 1 1 1 1 1 1 1 1 0
6 1 0 1 1 1 1 1 1 0 1 1 0
7 1 0 1 0 1 1 1 1 0 1 0 0
8 1 0 1 0 1 1 1 1 0 0 0 0
9 0 0 0 0 1 1 1 1 0 0 0 0
10 0 0 0 0 0 0 1 1 0 0 0 0
Total
References:
Books:
Navarro, Rosita L., Santos, Rosita G., Corpuz, Brenda B. (2019). Assessment of Learning 1.
Fourth Edition. Quezon City: Lorimar Publishing, Inc.
Navarro, Rosita L., Santos, Rosita G., Corpuz, Brenda B. (2019). Assessment of Learning 1. OBE
and PPST Based. Fourth Edition. Quezon City: Lorimar Publishing, Inc.
Navarro, Rosita L., Santos, Rosita G., Corpuz, Brenda B. (2017). Assessment of Learning 1. OBE
and K12 Based. Third Edition. Quezon City: Lorimar Publishing, Inc.
De Guzman, Estefania S. and Adamos, Joel L. (2015). Assessment of Learning 1. Quezon City:
Adrian Publishing Co., Inc.
Gonzales, Jacobo O. and Nocon, Rizaldi C. (2015). Essential Statistics. 2015 Edition. Quezon
City: MaxCor Publishing House Inc.
Corpuz, Brenda B. and Salandanan, Gloria G. (2015). Principles of Teaching 2 with TLE. Lorimar
Publishing, Inc.
Navarro, Rosita L. et al. (2012). Assessment of Learning Outocmes. Second Edition. Lorimar
Publishing, Inc.
Buendicho, Flordeliza C. (2011). Assessment of Learning 1. Rex Bookstore
Santos, Rosita de-Guzman , Ph.D. (2007).Assessment of Learning 2. Lorimar Publishing , Inc.,
Quezon City.
Calmorin, Laurentina P. (2004). Educational Research Measurement & Evaluation. (3rd ed.)
National Bookstore
Padua, Roberto N. et.al. (1997). Educational Evaluation and Measurement. Katha Publishing
Coy.
Calmorin, Laurentina P. (1994). Educational Research Measurement & Evaluation. National
Bookstore
Calderon and Gonzales. (1993). Measurement and Evaluation. National Bookstore.

Website:
https://chfasoa.uni.edu/reliabilityandvalidity.htm
https://study.com/academy/lesson/test-retest-reliability-coefficient-examples-lesson-quiz.html
http://korbedpsych.com/LinkedFiles/CalculatingReliability.pdf
http://www.real-statistics.com/reliability/internal-consistency-reliability/kuder-richardson-formula-20/

Prepared by : DR. LUZVIMINDA T. PATINDOL

You might also like