Professional Documents
Culture Documents
I. Preliminaries
A. Greetings
B. Prayer
C. Motivational Activity
In order to establish the validity and reliability of an assessment tool, you need to know the
different ways of establishing test validity and reliability. You are expected to read this before
you can analyze your items.
What is test reliability?
1. Reliability. It refers to the consistency of scores obtained by the same person when retested
using the same instrument/its parallel or when compared with other students who took the same
test.
(1) when retested on the same person; (2) when retested on the same measure; and (3) similarity
of responses across items that measure the same characteristic
In the first condition, consistent response is expected when the test is given to the same
participants. In the second condition, reliability is attained if the responses to the same test is
consistent with the same test or its equivalent or another test that measures but measures the
same characteristic when administered at a different time. In the third condition, there is
reliability when the person responds in the same way or consistently across items that measure
the same characteristic.
There are different factors that affect the reliability of a measure. The reliability of a measure
can be high or low, depending on the following factors:
1. The number of items in a test – The more items a test has, the higher the reliability likelihood.
The probability of obtaining consistent scores is high because of the large pool of items.
3. External environment – The external environment may include room temperature, noise level,
depth of instruction, exposure to materials, and quality of instruction, which could affect
changes in the responses of examinees in a test.
What are the different ways to establish test reliability?
There are different ways of determining the reliability of a test. The specific kind of reliability
will depend on the
1. the variable you are measuring,
2. type of test, and
3. number of versions of the test.
The different types of reliability are indicated and how they are done.
Method in
Testing How is this reliability done? What statistics are used?
Reliability
1. Test-Retest a test, and you need to Correlate the test scores from
administer it at one time to a the first and the next
group of examinees. administration. A significant
Administer it again at another and positive correlation
time to the “same group” of indicates that the test has
examinees. temporal stability over time.
time interval of not more than
6 months between the first Correlation refers to a statistical
and second administration of procedure where a linear
tests that measure stable relationship is expected for two
characteristics, such as variables. You may use Pearson
standardized aptitude tests. Product Moment
The post-test can be given Correlation or Pearson r
with a minimum time interval because test data are usually in
of 30 minutes. an interval scale (refer to a
The responses in the test statistics book for Pearson r)
should more or less be the
same across the two points in
time.
Test-retest is applicable for
tests that measure stable
variables, such as aptitude
and psychomotor measures
(e.g., typing test, tasks in
physical education).
2. Parallel Forms There are two versions of a Correlate the test results for the
test. The items need to first form and the second form.
exactly measure the same Significant and positive
skill. correlation coefficient are
Each test version is called a expected. The significant and
“form.” Administer one form positive correlation indicates
at one time and the other that the responses in the two
form to another time to the forms are the same or
“same” group of participants. consistent. Pearson is usually
The responses on the two used for this analysis.
forms should be more or less
the same. Parallel forms are
applicable if there are two
versions of the test. This is
usually done when the test is
repeatedly used for different
groups, such as entrance
examinations and licensure
examinations. Different
versions of the test are given
to a different group of
examinees.
3. Split-Half Administer a test to a group Correlate the two sets of scores
of examinees. The items need using Pearson r. After the
to be split into halves, usually correlation, use another formula
using the odd-even technique. called Spearman Brown
In this technique, get the sum Coefficient.
of the points in the odd- The correlation coefficient
numbered items and correlate obtained using Pearson rand
it with the sum of points of Spearman Brown should be
the even-numbered items. significant and positive to mean
Each examinee will have two that the test has internal
scores coming from the same consistency reliability.
test. The scores on each set
should be close or consistent.
Split-half is applicable when
the test has a large number of
items.
4. Test of involves determining if the A statistical analysis called
Internal scores for each item are Cronbach’s alpha or the Kuder
Consistency consistently answered by the Richardson is used to determine
using Kuder- examinees. the internal consistency of the
Richardson and After administering the test to items. A Cronbach’s alpha
Cronbach’s a group of examinees, it is value of 0.60 and above
Alpha Method necessary to determine and indicates that the test items have
record the scores for each internal consistency.
item.
The idea here is to see if the
responses per item are
consistent with each other.
This technique will work well
when the assessment tool has
a large number of items.
It is also applicable for scales
and inventories (e.g., Likert
scale from “strongly agree” to
“strongly disagree”).
5. Inter-rater used to determine the A statistical analysis called
Reliability consistency of multiple raters Kendall’s tau coefficient of
when using rating scales and concordance is used to
rubrics to judge performance. determine if the ratings
The reliability here refers to provided by multiple raters
the similar or consistent agree with each other.
ratings provided by more than Significant Kendall’s tau value
one rater or judge when they indicates that the raters concur
use an assessment tool. or agree with each other in their
Inter-rater is applicable when ratings.
the assessment requires the
use of multiple raters.
1. Linear Regression
Linear regression is demonstrated when you have two variables that are measured, such as two
set of scores in a test taken at two different times by the same participants. When the two scores
are plotted in a graph (with X- and Y-axis), they tend to form a straight line.
This can be seen in the graph shown. This correlation is shown in the graph given. The graph
is called a scatterplot. Each point in the scatterplot is a respondent with two scores (one for each
test).
The statistical analysis used to determine the correlation coefficient is called the Pearson r.
The scores given by the three raters are first computed by summing up the total ratings
for each demonstration. The mean is obtained for the sum of ratings . The mean is
subtracted from each of the Sum of Ratings (D). Each difference is squared (D2), then the sum of
squares is computed (ΣD2 = 33.2). The mean and summation of squared difference is substituted
in the Kendall’s 𝜔 formula. In the formula, m is the numbers of raters.
A Kendall’s 𝜔 coefficient value of 0.37 indicates the agreement of the three raters in the five
demonstrations. There is moderate concordance among the three raters because the value is far
from 1.00.
What is test validity?
A measure is valid when it measures what it is supposed to measure. If quarterly exam is
valid, then the contents should directly measure the objectives of the curriculum. If a scale that
measures personality is composed of five factors, then the scores on the five factors should have
items that are highly correlated If an entrance exam is valid, it should predict students’ grades
after the first semester.
Type of
Definition Procedure
validity
Content When the items represent the domain The items are with the objectives of
Validity being measured the program. The items need to
measure directly the objective (for
achievement) or definition (for
scales). A reviewer conducts the
checking.
Face When the test is presented well, free of The test items and layout are
Validity errors, and administered well reviewed and tried out on a small
group of respondents. A manual for
administration ca be made as a guide
for the test administrator.
Predictive A measure should predict a future A correlation coefficient is obtained
Validity criterion. Example is an entrance where the X-variable is used as the
exam predicting the grades of the predictor and the Y-variable as the
students after the first semester. criterion.
Construct The components or factors of the test The Pearson r can be used to
Validity should contain items that are strongly correlate the items for each factor.
correlated. However, there is a technique called
factor analysis to determine which
items are highly correlated to form a
factor.
Concurrent When two or more measures are The scores on the measures should be
Validity present for each examinee that correlated.
measure the same characteristic
Convergen When the components or factors of a Correlation is done for the factors of
t test are hypothesized to have a positive the test.
Validity correlation
Divergent When the components or factors of a Correlation is done for the factors of
Validity test are hypothesized to have a the test.
negative correlation. An example of to
correlation is the scores in a test on
intrinsic and extrinsic motivation.
1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She asked the grade 4
science teacher to submit the table of specifications containing the objectives of the lesson and
the corresponding items. The coordinator checked whether each item is aligned with the
objectives.
2. Face Validity
The assistant principal browsed the test paper made by the math teacher. She checked if the
contents of the items are about mathematics. She examined if instructions are clear. She browsed
through the items if the grammar is correct and if the vocabulary is within the students’ level of
understanding.
• What can be done in order to ensure that the assessment appears to be effective?
• What practices are done in conducting face validity?
• Why is face validity the weakest form of validity?
3. Predictive Validity
The school admission’s office developed an entrance examination. The officials wanted to
determine if the results of the entrance examination are accurate in identifying good students.
They took the grades of the students accepted for the first quarter. They correlated the entrance
exam results and the first quarter grades. They found significant and positive correlations
between the entrance examination scores and grades. The entrance examination results predicted
the grades of students after the first quarter. Thus, there was predictive-prediction validity.
5. Construct Validity
A science test was made by a grade 10 teacher composed of four domains: matter, living
things, force and motion, and earth and space. There are 10 items under each domain. The
teacher wanted to determine if the 10 items made under each domain really belonged to that
domain. The teacher consulted an expert in test measurement. They conducted a procedure called
factor analysis. Factor analysis is a statistical procedure done to determine if the items written
will load under the domain they belong.
The construct validity of a measure is reported in journal articles. The following are guide
questions used when searching for the construct validity of a measure from reports:
7. Divergent Validity
An English teacher taught a metacognitive awareness strategy to comprehend a paragraph for
grade 11 students. She wanted to determine if the performance of her students in reading
comprehension would reflect well in the reading comprehension test.
2. Obtain the upper and lower 27% of the group. Multiply 0.27 by the total number of students,
and you will get a value of 2.7. The rounded whole number value is 3.0. Get the top three
students and the bottom 3 students based on their total scores. The top three students are
students 2, 5, and 9. The bottom three students are students 7, 8, and 4. The rest of the
students are not included in the item analysis.
Item 1 Item 2 Item 3 Item 4 Item 5 Total
Score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Student 1 0 0 1 1 1 3
Student 6 1 0 1 1 0 3
Student 1 0 1 1 0 3
10
Student 3 0 0 0 1 1 2
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1
3. Obtain the proportion correct for each item. This is computed for the upper 27% group and
the lower 27% group. This is done by summating the correct answer per item and dividing it
by the total number of students.
Total
Item 1 Item 2 Item 3 Item 4 Item 5
Score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Total 2 2 3 2 3
Proportion of the high 0.67 0.67 1.00 0.67 1.00
group (PH)
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1
Total 0 1 2 1 1
Proportion of the low group 0.00 0.33 0.67 0.33 0.33
(PL)
Item difficulty =
Get the results of your previous exam in the class and conduct item analysis. Determine the
difficulty and discrimination. Tabulate the results for each item below. Indicate the index of
difficulty, then write if the item is very difficult, difficult, average, easy, and very easy. In the
last column, indicate the index of discrimination and write if very good item, good item, fair
item, or poor item.
Item 2
Item 3
Item 4
Item 5
When developing a teacher-made test, it is good to have items that are easy, average, and
difficult with positive discrimination indices. If you are developing a standardized test, the rule is
more stringent as it aims for average items or not so easy nor difficult items and whose
discrimination index is at least 0.3.
III. Abstraction
A. Indicate the type of reliability applicable for each case. Write the type of reliability on the
space before the number.
1. Mr. Perez conducted a survey of his students to determine their study
habits. Each item is answered using a five-point scale (always, often,
sometimes, rarely, never). He wanted to determine if the responses for
each item are consistent. What reliability technique is recommended?
2. A teacher administered a spelling test to her students. After a day, another
spelling test was given with the same length and stress of words. What
reliability can be used for the two spelling tests?
3. A PE teacher requested two judges to rate the dance performance of her
students in physical education. What reliability can be used to determine
the reliability of the judgments?
4. An English teacher administered a test to determine students’ use of verb
given a subject with 20 items. The scores were divided into items 1 to 10,
and another for items 11 to 20. The teacher correlated the two set of scores
that form the same test. What reliability is done here?
5. A computer teacher gave a set of typing tests on Wednesday and gave the
same set the following week. The teacher wanted to know if the students’
typing skills are consistent. What reliability can be used?
B. Indicate the type of validity applicable for each case. Write the type of validity on the blank
before the number.
1. The science coordinator developed a science test to determine who among
the students will be placed in an advanced science section. The students
who scored high in the science test were selected. After two quarters, the
grades of the students in the advanced science were determined. The
scores in the science test were correlated with the science grades to check
if the science test was accurate in the selection of students. What type of
validity was used?
2. A test composed of listening comprehension, reading comprehension, and
visual comprehension items was administered to students. The researcher
determined if the scores on each area refers to the same skill on
comprehension. The researcher hypothesized a significant and positive
relationship among these factors. What validity was established?
3. The guidance counselor conducted an interest inventory that measured the
following factors: realistic, investigative, artistic, scientific, enterprising,
and conventional. The guidance counselor wanted to provide evidence that
the items constructed really belong to the factor proposed. After her
analysis, the proposed items had high factor loadings on the domain they
belong to. What validity was conducted?
4. The technology and livelihood education teacher developed a performance
task to determine student competency in preparing a dessert. The students
were tasked with selecting a dessert, preparing the ingredients, and
making the dessert in the kitchen. The teacher developed a set of criteria
to assess the dessert. What type of validity is shown here?
5. The teacher in a robotics class taught students how to create a program to
make the arms of a robot move. The assessment was a performance task of
making a program to make three kinds of robot arm movements. The same
assessment task was given to students with no robotics class. The
programming performance of the two classes was compared. What
validity was established?
IV. Assessment
An English teacher administered a spelling test to 15 students. The spelling test is composed
of 10 items. Each item is encoded, wherein a correct answer is marked as “1”, and the incorrect
answer is marked as “O”. The grade in English is also provided in the last column. The first five
are words with two stresses, and the next five are words with a single stress. The recording is
indicated in the table.
Your task is to determine whether the spelling test is reliable and valid using the data to
determine the following: (1) split half, (2) Cronbach’s alpha, (3) predictive validity with the
English grade, (4) convergent validity of between words with single and two stresses, and (5)
difficulty index of each item.
Student Item Item Item Item Item Item Item Item Item Item English
no. 1 2 3 4 5 6 7 8 9 10 grade
1 1 0 0 1 1 1 0 1 1 0 80
2 0 0 0 1 1 1 1 1 0 0 81
3 1 1 0 0 1 0 1 0 1 1 83
4 0 1 0 0 1 1 1 1 1 0 85
5 0 1 1 0 1 1 1 0 1 1 84
6 1 0 1 0 1 1 1 1 1 1 89
7 1 0 1 1 1 1 1 1 0 1 87
8 1 1 1 0 1 1 1 1 1 1 87
9 1 1 1 1 1 1 1 1 0 1 89
10 1 1 1 1 0 0 1 1 1 1 90
11 0 1 1 1 0 1 1 1 1 0 90
12 1 0 1 1 1 1 1 1 1 1 87
13 1 1 1 1 1 1 1 0 1 1 88
14 1 1 0 1 1 1 1 1 1 1 88
15 1 1 1 1 1 0 1 1 0 1 85
Create a short test and report its validity and reliability. Select a grade level and a subject.
Choose one or two learning competencies and make at least 10-20 items for these two learning
competencies. Consult your teacher on the items and the table of specifications.
1. Have your items checked by experts if they are aligned with the selected competencies.
2. Revise your items based on the reviews provided by the experts.
3. Make a layout of your test and administer to about 100 students.
4. Encode your data and you may use an application to compute for the needed statistical
analysis.
5. Determine the following:
• Split-half reliability
• Cronbach’s alpha
• Construct validity of the items with the underlying factors
• Convergent validity of the domains
• Item difficulty and discrimination
Write a report on your procedure. The report will contain the following parts:
Introduction. Give the purpose of the study. Describe the test measures, its components, the
competencies selected, and the kind of items. Rationalize the need to determine the validity and
reliability of the test.
Method. Describe the participants who took the test. Describe what the test measures, number
of items, test format, and how content validity was established. Describe the procedure on how
data was collected or how the test was administered. Describe what statistical analysis was used.
Results. Present the results in a table and provide the necessary interpretations. Make sure to
show the results of the split-half reliability, Cronbach’s alpha, construct validity of the items with
the underlying factors, convergent validity of the domains, and item difficulty and
discrimination.
Discussion. Provide implications about the test validity and reliability.
V. Evaluation
Use the rubric to rate students’ work on the previous task.
Needs
Part Very Good Good Fair
Improvement
Introduction All the parts, such One of the parts Two of the parts All parts of the
as the purpose, is not are not report are not
characteristics of sufficiently sufficiently sufficiently
the measure, and explained. The explained. The explained. The
rationale are rational justified rationale connection
indicated. The the purpose. somehow between
rational justifies However, some justifies the the purpose and
well the purpose of details of the test purpose. Several rationale is
the study and are not found. details about the difficult to
adequate details test are not follow. The
about the test is indicated. features of the
described and test are not
supported. described well.
Method All the parts, such One of the parts Two of the parts All parts of the
as participants, test is not are not report are not
description, sufficiently sufficiently sufficiently
validity and explained. One explained. Two explained. Two
reliability, part lacks parts lack or more parts are
procedure and adequate adequate missing.
analysis are all information on information
present. All the how data was about the data
parts describe gathered and gathering and
sufficiently how analyzed. analysis.
the data was
gathered ana
analyzed.
Results The tables and There is one There are two There are more
interpretation table and tables and than two tables
necessary are all interpretation interpretations and
present. All the missing. One that are missing. interpretations
required analyzes table and/or Two tables and that are missing.
are complete and interpretation interpretations Three or more
accurately does not have have tables and
interpreted. accurate inaccurate interpretations
content. information. have
inaccurate
information.
Discussion Implications of the Implications of Implications of Implications of
test’s validity and the test’s validity the test’s validity the test’s validity
reliability are well and and and
explained with reliability are reliability are reliability are
three or more explained with explained with not explained,
supporting reviews. two supporting no supporting and there is not
Detailed discussion reviews. One of review. Two of supporting
on the results of the results for the results for review. Three
reliability and reliability and the validity and or more of the
validity are validity are not reliability are no validity and
provided with provided with not provided reliability are
explanation. explanation. with not provided
explanation. with
explanation.
VI. References:
Detailed description of one IRT model can be read in the work of Magno (2009) titled,
“Demonstrating the Difference between Classical Test Theory and Item Response Theory
Using Derived Test Data” published in the International Journal of Educational and
Psychological Assessment Volume 1. It can be accessed at
https://files.eric.ed.gov/fulltext/ED506058.pdf