Lesson in EDUC 4 (Establishing Test Validity and Reliability)

Lesson in EDUC 4 – Assessment in Learning 1
Group 4: Rowelyn Cadiao Mendoza – Team Leader

Ray Lorenz F. Ortega
Rinelyn E. Magsipoc
Course Facilitator: Charlene Joy Rectra Amar, PhD

First Semester, 2023-2024
I. Preliminaries
A. Greetings
B. Prayer
C. Motivational Activity
II. Lesson Proper

Topic: Establishing Test Validity and Reliability
How do we establish the validity and reliability of tests?

Desired Significant Learning Outcomes
In this lesson, you are expected to:
 use procedures and statistical analysis to establish test validity
and reliability;
 decide whether a test is valid or reliable; and
 decide which test items are easy and difficult.
Significant Culminating Performance Task and Success Indicators
At the end of the lesson, you should be able to demonstrate your knowledge and skills in
determining whether the test and its items are valid and reliable. You are considered successful in
this culminating performance task if you have satisfied at least the following indicators of
success:
Performance Tasks Success Indicators
Use appropriate procedure in Provided the detailed steps, decision, and rationale in the
determining test validity and use of appropriate validity and reliability measures
reliability
Show the procedure on how to Provided the detailed procedure from the preparation of
establish test validity and the instrument, procedure in pretesting, and analysis in
reliability determining the test’s validity and reliability
Provide accurate results in the Made the appropriate computation, use of software,
analysis of item difficulty and reporting of the results for the tests of validity and
reliability reliability
Prerequisite of this Lesson
To be able to successfully perform this culminating performance task, you should have
prepared a test following the proper procedure with clear learning targets (objectives), table of
specifications, and pretest data per item. In the previous lesson, you were provided with
guidelines in constructing tests following different formats. You have also learned that
assessment becomes valid when the test items represent a good set of objectives, which should
be found in the table of specifications. The learning targets will help you construct appropriate
test items
In order to establish the validity and reliability of an assessment tool, you need to know the
different ways of establishing test validity and reliability. You are expected to read this before
you can analyze your items.
What is test reliability?
1. Reliability. It refers to the consistency of scores obtained by the same person when retested
using the same instrument/its parallel or when compared with other students who took the same
test.
(1) when retested on the same person; (2) when retested on the same measure; and (3) similarity
of responses across items that measure the same characteristic
In the first condition, consistent response is expected when the test is given to the same
participants. In the second condition, reliability is attained if the responses to the same test is
consistent with the same test or its equivalent or another test that measures but measures the
same characteristic when administered at a different time. In the third condition, there is
reliability when the person responds in the same way or consistently across items that measure
the same characteristic.
There are different factors that affect the reliability of a measure. The reliability of a measure
can be high or low, depending on the following factors:
1. The number of items in a test – The more items a test has, the higher the reliability likelihood.
The probability of obtaining consistent scores is high because of the large pool of items.
2. Individual differences of participants – Every participant possesses characteristics that affect

their performance in a test, such as fatigue, concentration, innate ability, perseverance, and
motivation. These individual factors change over time and affect the consistency of the
answers in a test.
3. External environment – The external environment may include room temperature, noise level,
depth of instruction, exposure to materials, and quality of instruction, which could affect
changes in the responses of examinees in a test.
What are the different ways to establish test reliability?
There are different ways of determining the reliability of a test. The specific kind of reliability
will depend on the
1. the variable you are measuring,
2. type of test, and
3. number of versions of the test.
The different types of reliability are indicated and how they are done.
Method in
Testing How is this reliability done? What statistics are used?
Reliability
1. Test-Retest  a test, and you need to Correlate the test scores from
administer it at one time to a the first and the next
group of examinees. administration. A significant
Administer it again at another and positive correlation
time to the “same group” of indicates that the test has
examinees. temporal stability over time.
 time interval of not more than
6 months between the first Correlation refers to a statistical
and second administration of procedure where a linear
tests that measure stable relationship is expected for two
characteristics, such as variables. You may use Pearson
standardized aptitude tests. Product Moment
 The post-test can be given Correlation or Pearson r
with a minimum time interval because test data are usually in
of 30 minutes. an interval scale (refer to a
 The responses in the test statistics book for Pearson r)
should more or less be the
same across the two points in
time.
 Test-retest is applicable for
tests that measure stable
variables, such as aptitude
and psychomotor measures
(e.g., typing test, tasks in
physical education).
2. Parallel Forms  There are two versions of a Correlate the test results for the
test. The items need to first form and the second form.
exactly measure the same Significant and positive
skill. correlation coefficient are
 Each test version is called a expected. The significant and
“form.” Administer one form positive correlation indicates
at one time and the other that the responses in the two
form to another time to the forms are the same or
“same” group of participants. consistent. Pearson is usually
 The responses on the two used for this analysis.
forms should be more or less
the same. Parallel forms are
 applicable if there are two
versions of the test. This is
usually done when the test is
repeatedly used for different
groups, such as entrance
examinations and licensure
examinations. Different
versions of the test are given
to a different group of
examinees.
3. Split-Half  Administer a test to a group Correlate the two sets of scores
of examinees. The items need using Pearson r. After the
to be split into halves, usually correlation, use another formula
using the odd-even technique. called Spearman Brown
 In this technique, get the sum Coefficient.
of the points in the odd- The correlation coefficient
numbered items and correlate obtained using Pearson rand
it with the sum of points of Spearman Brown should be
the even-numbered items. significant and positive to mean
 Each examinee will have two that the test has internal
scores coming from the same consistency reliability.
test. The scores on each set
should be close or consistent.
 Split-half is applicable when
the test has a large number of
items.
4. Test of  involves determining if the A statistical analysis called
Internal scores for each item are Cronbach’s alpha or the Kuder
Consistency consistently answered by the Richardson is used to determine
using Kuder- examinees. the internal consistency of the
Richardson and  After administering the test to items. A Cronbach’s alpha
Cronbach’s a group of examinees, it is value of 0.60 and above
Alpha Method necessary to determine and indicates that the test items have
record the scores for each internal consistency.
item.
 The idea here is to see if the
responses per item are
consistent with each other.
 This technique will work well
when the assessment tool has
a large number of items.
 It is also applicable for scales
and inventories (e.g., Likert
scale from “strongly agree” to
“strongly disagree”).
5. Inter-rater  used to determine the A statistical analysis called
Reliability consistency of multiple raters Kendall’s tau coefficient of
when using rating scales and concordance is used to
rubrics to judge performance. determine if the ratings
 The reliability here refers to provided by multiple raters
the similar or consistent agree with each other.
ratings provided by more than Significant Kendall’s tau value
one rater or judge when they indicates that the raters concur
use an assessment tool. or agree with each other in their
 Inter-rater is applicable when ratings.
the assessment requires the
use of multiple raters.
1. Linear Regression
Linear regression is demonstrated when you have two variables that are measured, such as two
set of scores in a test taken at two different times by the same participants. When the two scores
are plotted in a graph (with X- and Y-axis), they tend to form a straight line.
This can be seen in the graph shown. This correlation is shown in the graph given. The graph
is called a scatterplot. Each point in the scatterplot is a respondent with two scores (one for each
test).
2. Computation of Pearson r Correlation

The index of the linear regression is called a correlation coefficient. When the points in a
scatterplot tend to fall within the linear line, the correlation is said to be strong. When the
direction of the scatterplot is directly proportional, the correlation coefficient will have a positive
value. If the line is inverse, the correlation coefficient will have a negative value.
The statistical analysis used to determine the correlation coefficient is called the Pearson r.
How the Pearson r is obtained is illustrated below.
Monday Test Tuesday Test

X Y X2 Y2 XY
10 20 100 400 200
9 15 81 225 135
6 12 36 144 72
10 18 100 324 180
12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
16 17 256 289 272
8 13 64 169 104
2 2
ΣX = 87 ΣY = 139 ΣX = 871 ΣY = 2125 ΣXY = 1328
ΣX – Add all the X scores (Monday scores) XY – Multiply the X and Y

scores
ΣY – Add all the Y scores (Tuesday scores) ΣX2 – Add all the squared
values of X
X2 – Square the value of the X scores (Monday ΣY2 – Add all the squared
scores) values of Y
Y2 – Square the value of the Y scores (Tuesday ΣXY – Add all the product of
scores) Substitute the values in the formula: X and Y
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of 1.00 and -1.00
indicates perfect correlation. In test of reliability though, we aim for high positive correlation to
mean that there is consistency in the way the student answered the tests taken. 3. Difference
between a positive and a negative correlation
3. Difference between a positive and a negative correlation

When the value of the correlation coefficient is positive, it means that the higher the scores in X,
the higher the scores in Y. This is called a positive correlation.
In the case of the two spelling scores, a positive correlation is obtained. When the value
of the correlation coefficient is negative, it means that the higher the scores in X, the lower the
scores in Y, and vice versa. This is called a negative correlation. When the same test is
administered to the same group of participants, usually a positive correlation indicates the
reliability or consistency of the scores.
4. Determining the strength of a correlation
The strength of the correlation also indicates the strength of the reliability of the test. This is
indicated by the value of the correlation coefficient. The closer the value to 1.00 or -1.00, the
stronger the correlation. Below is the guide:
0.80 – 1.00 Very strong relationship
0.60 – 0.79 Strong relationship
0.40 – 0.59 Substantial/marked relationship
0.20 – 0.39 Weak relationship
0.00 – 0.19 Negligible relationship
5. Determining the significance of the correlation

The correlation obtained between two variables could be due to chance. In order to
determine if the correlation is free of certain errors, it is tested for significance. When a
correlation is significant, it means that the probability of the two variables being related is free of
certain errors.
In order to determine if a correlation coefficient value is significant, it is compared with
an expected probability of correlation coefficient values called a critical value. When the value
computed is greater than the critical value, it means that the information obtained has more than
95% chance of being correlated and is significant.
Another statistical analysis mentioned to determine the internal consistency of test is the
Cronbach’s alpha. Follow the procedure to determine the internal consistency.
Suppose that five students answered a checklist about their hygiene with a scale of 1 to 5, where
in the following are the corresponding scores:
The internal consistency of the responses in the attitude toward teaching is 0.10, indicating low
internal consistency.
The consistency of ratings can also be obtained using a coefficient of concordance. The
Kendall’s 𝜔 coefficient of concordance is used to test the agreement among raters.
Below is a performance task demonstrated by five students rated by three raters. The
rubric used a scale of 1 to 4, where 4 is the highest and 1 is the lowest.
The scores given by the three raters are first computed by summing up the total ratings
for each demonstration. The mean is obtained for the sum of ratings . The mean is
subtracted from each of the Sum of Ratings (D). Each difference is squared (D2), then the sum of
squares is computed (ΣD2 = 33.2). The mean and summation of squared difference is substituted
in the Kendall’s 𝜔 formula. In the formula, m is the numbers of raters.
A Kendall’s 𝜔 coefficient value of 0.37 indicates the agreement of the three raters in the five
demonstrations. There is moderate concordance among the three raters because the value is far
from 1.00.
What is test validity?
A measure is valid when it measures what it is supposed to measure. If quarterly exam is
valid, then the contents should directly measure the objectives of the curriculum. If a scale that
measures personality is composed of five factors, then the scores on the five factors should have
items that are highly correlated If an entrance exam is valid, it should predict students’ grades
after the first semester.
What are the different ways to establish test validity?
Type of
Definition Procedure
validity
Content When the items represent the domain The items are with the objectives of
Validity being measured the program. The items need to
measure directly the objective (for
achievement) or definition (for
scales). A reviewer conducts the
checking.
Face When the test is presented well, free of The test items and layout are
Validity errors, and administered well reviewed and tried out on a small
group of respondents. A manual for
administration ca be made as a guide
for the test administrator.
Predictive A measure should predict a future A correlation coefficient is obtained
Validity criterion. Example is an entrance where the X-variable is used as the
exam predicting the grades of the predictor and the Y-variable as the
students after the first semester. criterion.
Construct The components or factors of the test The Pearson r can be used to
Validity should contain items that are strongly correlate the items for each factor.
correlated. However, there is a technique called
factor analysis to determine which
items are highly correlated to form a
factor.
Concurrent When two or more measures are The scores on the measures should be
Validity present for each examinee that correlated.
measure the same characteristic
Convergen When the components or factors of a Correlation is done for the factors of
t test are hypothesized to have a positive the test.
Validity correlation
Divergent When the components or factors of a Correlation is done for the factors of
Validity test are hypothesized to have a the test.
negative correlation. An example of to
correlation is the scores in a test on
intrinsic and extrinsic motivation.
1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She asked the grade 4
science teacher to submit the table of specifications containing the objectives of the lesson and
the corresponding items. The coordinator checked whether each item is aligned with the
objectives.
• How are the objectives used when creating test items?

• How is content validity determined when given the objectives and the items in a test?
• What should be present in a test table of specifications when determining content validity?
• Who checks the content validity of items?
2. Face Validity
The assistant principal browsed the test paper made by the math teacher. She checked if the
contents of the items are about mathematics. She examined if instructions are clear. She browsed
through the items if the grammar is correct and if the vocabulary is within the students’ level of
understanding.
• What can be done in order to ensure that the assessment appears to be effective?
• What practices are done in conducting face validity?
• Why is face validity the weakest form of validity?
3. Predictive Validity
The school admission’s office developed an entrance examination. The officials wanted to
determine if the results of the entrance examination are accurate in identifying good students.
They took the grades of the students accepted for the first quarter. They correlated the entrance
exam results and the first quarter grades. They found significant and positive correlations
between the entrance examination scores and grades. The entrance examination results predicted
the grades of students after the first quarter. Thus, there was predictive-prediction validity.
• Why are two measures needed in predictive validity?

• What is the assumed connection between these two measures?
• How can we determine if a measure has predictive validity?
• What statistical analysis is done to determine predictive validity?
• How are the test results of predictive validity interpreted?
4. Concurrent Validity
A school guidance counselor administered a math achievement test to grade 6 students. She
also has a copy of the students’ grades in math. She wanted to verify if the math grades of the
students are measuring the same competencies as the math achievement test. The school
counselor correlated the math achievement scores and math grades to determine if they are
measuring the same competencies.
• What needs to be available when conducting concurrent validity?

• At least how many tests are needed for conducting concurrent validity?
• What statistical analysis can be used to establish concurrent validity?
• How are the results of a correlation coefficient interpreted for concurrent validity?
5. Construct Validity
A science test was made by a grade 10 teacher composed of four domains: matter, living
things, force and motion, and earth and space. There are 10 items under each domain. The
teacher wanted to determine if the 10 items made under each domain really belonged to that
domain. The teacher consulted an expert in test measurement. They conducted a procedure called
factor analysis. Factor analysis is a statistical procedure done to determine if the items written
will load under the domain they belong.
• What type of test requires construct validity?

• What should the test have in order to verify its constructs?
• What are constructs and factors in a test?
• How are these factors verified if they are appropriate for the test?
• What results come out in construct validity?
• How are the results in construct validity interpreted?
The construct validity of a measure is reported in journal articles. The following are guide
questions used when searching for the construct validity of a measure from reports:
• What was the purpose of construct validity?

• What type of test was used?
• What are the dimensions or factors that were studied using construct validity?
• What procedure was used to establish the construct validity?
• What statistics was used for the construct validity?
• What were the results of the test’s construct validity?
6. Convergent Validity
A math teacher developed a test to be administered at the end of the school year, which
measures number sense, patterns and algebra, measurement, geometry, and statistics. It is
assumed by the math teacher that students’ competencies in number sense improves their
capacity to learn patterns and algebra and other concepts. After administering the test, the scores
were separated for each area, and these five domains were intercorrelated using Pearson r. The
positive correlation between number sense and patterns and algebra indicates that, when number
sense scores increase, the patterns and algebra scores also increase. This shows student learning
of number sense scaffold patterns and algebra competencies.
• What should a test have in order to conduct convergent validity?

• What are done with the domains in a test on convergent validity?
• What analysis is used to determine convergent validity?
• How are the results in convergent validity interpreted?
7. Divergent Validity
An English teacher taught a metacognitive awareness strategy to comprehend a paragraph for
grade 11 students. She wanted to determine if the performance of her students in reading
comprehension would reflect well in the reading comprehension test.
• What conditions are needed to conduct divergent validity?

• What assumption is being proved in divergent validity?
• What statistical analysis can be used to establish divergent validity?
• How are the results of divergent validity interpreted?
How to determine if an item is easy or difficult?
An item is difficult if the majority of students are unable to provide the correct answer. The
item is easy if the majority of the students are able to answer correctly. An item can discriminate
if the examinees who score high in the test can answer more the items correctly than examinees
who got low scores.
Below is a dataset of five items on the addition and subtraction of integers. Follow the
procedure to determine the difficulty and discrimination of each item
1. Get the total score of each student and arrange scores from highest to lowest.
Item 1 Item 2 Item 3 Item 4 Item 5
Student 1 0 0 1 1 1
Student 2 1 1 1 0 1
Student 3 0 0 0 1 1
Student 4 0 0 0 0 1
Student 5 0 1 1 1 1
Student 6 1 0 1 1 0
Student 7 0 0 1 1 0
Student 8 0 1 1 0 0
Student 9 1 0 1 1 1
Student 10 1 0 1 1 0
2. Obtain the upper and lower 27% of the group. Multiply 0.27 by the total number of students,
and you will get a value of 2.7. The rounded whole number value is 3.0. Get the top three
students and the bottom 3 students based on their total scores. The top three students are
students 2, 5, and 9. The bottom three students are students 7, 8, and 4. The rest of the
students are not included in the item analysis.
Item 1 Item 2 Item 3 Item 4 Item 5 Total
Score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Student 1 0 0 1 1 1 3
Student 6 1 0 1 1 0 3
Student 1 0 1 1 0 3
10
Student 3 0 0 0 1 1 2
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1
3. Obtain the proportion correct for each item. This is computed for the upper 27% group and
the lower 27% group. This is done by summating the correct answer per item and dividing it
by the total number of students.
Total
Score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Total 2 2 3 2 3
Proportion of the high 0.67 0.67 1.00 0.67 1.00
group (PH)
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1
Total 0 1 2 1 1
Proportion of the low group 0.00 0.33 0.67 0.33 0.33
(PL)
4. The item difficulty is obtained using the following formula:
Item difficulty =
The difficulty is interpreted using the table:
5. The index of discrimination is obtained using the formula:

Item discrimination = pH –pL
The value is interpreted using the table:
Discrimination Remark
Index
0.40 and above Very good item
0.30 – 0.39 Good item
0.20 – 0.29 Reasonably good item
0.10 – 0.19 Marginal item
Below 0.10 Poor item
Xx
= 0.67 – 0 = 0.67 – = 1.00 – = 0.67 – = 1.00 –
0.33
Difficulty 0.67
Remark 0.33 0.33
Discrimination 0.67 0.34
Index 0.33 0.34 0.67
Index 0.76 or higher Easy Item
Very good0.25 to 0.75 Average Item Very good
Discrimination Good item Good item Good item
item 0.24 or lower Difficult Item item
Get the results of your previous exam in the class and conduct item analysis. Determine the
difficulty and discrimination. Tabulate the results for each item below. Indicate the index of
difficulty, then write if the item is very difficult, difficult, average, easy, and very easy. In the
last column, indicate the index of discrimination and write if very good item, good item, fair
item, or poor item.
Item Difficulty Item Discrimination

Item 1
Item 2
Item 3
Item 4
Item 5
When developing a teacher-made test, it is good to have items that are easy, average, and
difficult with positive discrimination indices. If you are developing a standardized test, the rule is
more stringent as it aims for average items or not so easy nor difficult items and whose
discrimination index is at least 0.3.
III. Abstraction
A. Indicate the type of reliability applicable for each case. Write the type of reliability on the
space before the number.
1. Mr. Perez conducted a survey of his students to determine their study
habits. Each item is answered using a five-point scale (always, often,
sometimes, rarely, never). He wanted to determine if the responses for
each item are consistent. What reliability technique is recommended?
2. A teacher administered a spelling test to her students. After a day, another
spelling test was given with the same length and stress of words. What
reliability can be used for the two spelling tests?
3. A PE teacher requested two judges to rate the dance performance of her
students in physical education. What reliability can be used to determine
the reliability of the judgments?
4. An English teacher administered a test to determine students’ use of verb
given a subject with 20 items. The scores were divided into items 1 to 10,
and another for items 11 to 20. The teacher correlated the two set of scores
that form the same test. What reliability is done here?
5. A computer teacher gave a set of typing tests on Wednesday and gave the
same set the following week. The teacher wanted to know if the students’
typing skills are consistent. What reliability can be used?
B. Indicate the type of validity applicable for each case. Write the type of validity on the blank
before the number.
1. The science coordinator developed a science test to determine who among
the students will be placed in an advanced science section. The students
who scored high in the science test were selected. After two quarters, the
grades of the students in the advanced science were determined. The
scores in the science test were correlated with the science grades to check
if the science test was accurate in the selection of students. What type of
validity was used?
2. A test composed of listening comprehension, reading comprehension, and
visual comprehension items was administered to students. The researcher
determined if the scores on each area refers to the same skill on
comprehension. The researcher hypothesized a significant and positive
relationship among these factors. What validity was established?
3. The guidance counselor conducted an interest inventory that measured the
following factors: realistic, investigative, artistic, scientific, enterprising,
and conventional. The guidance counselor wanted to provide evidence that
the items constructed really belong to the factor proposed. After her
analysis, the proposed items had high factor loadings on the domain they
belong to. What validity was conducted?
4. The technology and livelihood education teacher developed a performance
task to determine student competency in preparing a dessert. The students
were tasked with selecting a dessert, preparing the ingredients, and
making the dessert in the kitchen. The teacher developed a set of criteria
to assess the dessert. What type of validity is shown here?
5. The teacher in a robotics class taught students how to create a program to
make the arms of a robot move. The assessment was a performance task of
making a program to make three kinds of robot arm movements. The same
assessment task was given to students with no robotics class. The
programming performance of the two classes was compared. What
validity was established?
IV. Assessment
An English teacher administered a spelling test to 15 students. The spelling test is composed
of 10 items. Each item is encoded, wherein a correct answer is marked as “1”, and the incorrect
answer is marked as “O”. The grade in English is also provided in the last column. The first five
are words with two stresses, and the next five are words with a single stress. The recording is
indicated in the table.
Your task is to determine whether the spelling test is reliable and valid using the data to
determine the following: (1) split half, (2) Cronbach’s alpha, (3) predictive validity with the
English grade, (4) convergent validity of between words with single and two stresses, and (5)
difficulty index of each item.
Student Item Item Item Item Item Item Item Item Item Item English
no. 1 2 3 4 5 6 7 8 9 10 grade
1 1 0 0 1 1 1 0 1 1 0 80
2 0 0 0 1 1 1 1 1 0 0 81
3 1 1 0 0 1 0 1 0 1 1 83
4 0 1 0 0 1 1 1 1 1 0 85
5 0 1 1 0 1 1 1 0 1 1 84
6 1 0 1 0 1 1 1 1 1 1 89
7 1 0 1 1 1 1 1 1 0 1 87
8 1 1 1 0 1 1 1 1 1 1 87
9 1 1 1 1 1 1 1 1 0 1 89
10 1 1 1 1 0 0 1 1 1 1 90
11 0 1 1 1 0 1 1 1 1 0 90
12 1 0 1 1 1 1 1 1 1 1 87
13 1 1 1 1 1 1 1 0 1 1 88
14 1 1 0 1 1 1 1 1 1 1 88
15 1 1 1 1 1 0 1 1 0 1 85
Create a short test and report its validity and reliability. Select a grade level and a subject.
Choose one or two learning competencies and make at least 10-20 items for these two learning
competencies. Consult your teacher on the items and the table of specifications.
1. Have your items checked by experts if they are aligned with the selected competencies.
2. Revise your items based on the reviews provided by the experts.
3. Make a layout of your test and administer to about 100 students.
4. Encode your data and you may use an application to compute for the needed statistical
analysis.
5. Determine the following:
• Split-half reliability
• Cronbach’s alpha
• Construct validity of the items with the underlying factors
• Convergent validity of the domains
• Item difficulty and discrimination
Write a report on your procedure. The report will contain the following parts:
Introduction. Give the purpose of the study. Describe the test measures, its components, the
competencies selected, and the kind of items. Rationalize the need to determine the validity and
reliability of the test.
Method. Describe the participants who took the test. Describe what the test measures, number
of items, test format, and how content validity was established. Describe the procedure on how
data was collected or how the test was administered. Describe what statistical analysis was used.
Results. Present the results in a table and provide the necessary interpretations. Make sure to
show the results of the split-half reliability, Cronbach’s alpha, construct validity of the items with
the underlying factors, convergent validity of the domains, and item difficulty and
discrimination.
Discussion. Provide implications about the test validity and reliability.
V. Evaluation
Use the rubric to rate students’ work on the previous task.
Needs
Part Very Good Good Fair
Improvement
Introduction All the parts, such One of the parts Two of the parts All parts of the
as the purpose, is not are not report are not
characteristics of sufficiently sufficiently sufficiently
the measure, and explained. The explained. The explained. The
rationale are rational justified rationale connection
indicated. The the purpose. somehow between
rational justifies However, some justifies the the purpose and
well the purpose of details of the test purpose. Several rationale is
the study and are not found. details about the difficult to
adequate details test are not follow. The
about the test is indicated. features of the
described and test are not
supported. described well.
Method All the parts, such One of the parts Two of the parts All parts of the
as participants, test is not are not report are not
description, sufficiently sufficiently sufficiently
validity and explained. One explained. Two explained. Two
reliability, part lacks parts lack or more parts are
procedure and adequate adequate missing.
analysis are all information on information
present. All the how data was about the data
parts describe gathered and gathering and
sufficiently how analyzed. analysis.
the data was
gathered ana
analyzed.
Results The tables and There is one There are two There are more
interpretation table and tables and than two tables
necessary are all interpretation interpretations and
present. All the missing. One that are missing. interpretations
required analyzes table and/or Two tables and that are missing.
are complete and interpretation interpretations Three or more
accurately does not have have tables and
interpreted. accurate inaccurate interpretations
content. information. have
inaccurate
information.
Discussion Implications of the Implications of Implications of Implications of
test’s validity and the test’s validity the test’s validity the test’s validity
reliability are well and and and
explained with reliability are reliability are reliability are
three or more explained with explained with not explained,
supporting reviews. two supporting no supporting and there is not
Detailed discussion reviews. One of review. Two of supporting
on the results of the results for the results for review. Three
reliability and reliability and the validity and or more of the
validity are validity are not reliability are no validity and
provided with provided with not provided reliability are
explanation. explanation. with not provided
explanation. with
explanation.
VI. References:
Detailed description of one IRT model can be read in the work of Magno (2009) titled,
“Demonstrating the Difference between Classical Test Theory and Item Response Theory
Using Derived Test Data” published in the International Journal of Educational and
Psychological Assessment Volume 1. It can be accessed at
https://files.eric.ed.gov/fulltext/ED506058.pdf

Lesson in EDUC 4 (Establishing Test Validity and Reliability)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson in EDUC 4 (Establishing Test Validity and Reliability)

Uploaded by

Copyright:

Available Formats

Lesson in EDUC 4 – Assessment in Learning 1

Group 4: Rowelyn Cadiao Mendoza – Team Leader

Course Facilitator: Charlene Joy Rectra Amar, PhD

II. Lesson Proper

How do we establish the validity and reliability of tests?

2. Individual differences of participants – Every participant possesses characteristics that affect

2. Computation of Pearson r Correlation

How the Pearson r is obtained is illustrated below.

Monday Test Tuesday Test

ΣX – Add all the X scores (Monday scores) XY – Multiply the X and Y

3. Difference between a positive and a negative correlation

5. Determining the significance of the correlation

What are the different ways to establish test validity?

• How are the objectives used when creating test items?

• Why are two measures needed in predictive validity?

• What needs to be available when conducting concurrent validity?

• What type of test requires construct validity?

• What was the purpose of construct validity?

• What should a test have in order to conduct convergent validity?

• What conditions are needed to conduct divergent validity?

4. The item difficulty is obtained using the following formula:

The difficulty is interpreted using the table:

5. The index of discrimination is obtained using the formula:

Item Difficulty Item Discrimination

You might also like