Reliability and Validity in Testing

GROUP 6
What is test reliability?

Reliability is the consistency of the responses to measure under three conditions; (1) when retested on
the same person; (2) when retested on the same measure; and (3) similarity of responses across items that
measure the same characteristic. In the first condition, consistent response is expected when the test is
given to the same participants. In the second condition, reliability is attained if the responses to the same
tests consistent with the same test or its equivalent or another test that measures but measures the same
characteristic when administered at a different time. In the third condition, there is reliability when the
person responded in the same way or consistently across items that measure the same characteristic.
There are different factors that affect the reliability of a measure. The reliability of a measure can
be high or low depending on the following factors:
1. The number of Items in a test - The more tests a test has, the likelihood
of reliability is high. The probability of obtaining consistent scores is high
because of the large pool of items
2. Individual differences of participants - Every participant possesses characteristics that affect their
performance in a rest, such as fatigue. concentration, innate ability, porseverance, and motivation. These
individual factors change over time and affect the consistency of the answers in a test.
3. External environment - The external environment may include room temperature, noise level, depth
of instruction, exposure to materials. and quality of instruction, which could affect changes in the
responses of examinees in a test.
What are the different ways to establish test reliability?

-There are different ways in determining the reliability of a test. The specific kind of reliability will
depend on the (1) variable you are measuring, (2) type of test, and (3) number of versions of the test.
The different types of reliability are indicated and how they are done. Notice in the third column that
statistical analysis is needed to determine the test reliability.
Method in Testing How is this reliability done? What statistics is used?
Reliability
1. Test-retest Test-retest is applicable for tests Correlate the best scores from the first and the next
that measure stable variables, such administration. Significant and positive correlation
as aptitude and psychomotor indicates that the test has temporal stability overtime.
measures (eg typing test, tasks in Correlation refers to a statistical procedure where linear
physical education). relationship is expected for two variables. You may use
Pearson Product Moment Correlation or Pearson because
test data are usually in an interval scale (refer to a
statistics book for Pearson r).
2. Parallel Forms Parallel forms are applicable if there Correlate the test results for the first form and the second
are two versions of the test. This is form. Significant and positive correlation coefficient are
usually done when the test is expected. The significant and positive correlation
repeatedly used for different groups, indicates that the responses in the two forms are the same
such as entrance examinations and or consistent Pearson r is usually used for this analysis.
licensure examinations. Different
versions of the test are given to a
different group of examinees.
3. Split Half Split-half is applicable when the test Correlate the two sets of scores using Pearson r. After the
has a large number of items. correlation, use another formula called Spearman- Brown
Coefficient. The correlation coefficient obtained using
Pearson r and Spearman Brown should be significant and
positive to mean that the test has internal consistency
reliability.
4. Test of Internal This technique will work well when A statistical analysis called Cronbach's alpha or the Kuder
Consistency Using the assessment tool has a large Richardson is used to determine the internal consistency
Kuder-Richardson and number of items. It is also of the items. A Cronbach's alpha value of 0.60 and above
applicable for scales and inventories indicates that the test items have internal consistency.
Cronbach's Alpha
(eg Likert scale from "strongly
Method agree to "strongly disagree").
5. Inter-rater Inter-rater is applicable when the A statistical analysis called Kendall's tau coefficient of
Reliability assessment requires of multiple concordance is used to determine if the ratings provided
raters. by multiple raters agree with each other. Significant
Kendall's tau value indicates that the raters concur or
agree with each other in their ratings.
You will notice in the table that statistical analysis is required to determine the reliability of a measure.
The very basis of statistical analysis to determine reliability is the use of linear regression.
1. Linear regression
Linear regression is demonstrated when you have two variables that s measured, such as two set of
scores in a test taken at two different times the same participants. When the two scores are plotted in a
graph (with X and Y-axis), they tend to form a straight line. The straight line formed for the two sets of
scores can produce a linear regression. When a straight line is formes we can say that there is a
correlation between the two sets of scores. This can be seen in the graph shown. This correlation is shown
in the graph given. Th graph is called a scatterplot. Each point in the scatterplot is a respondent with two
scores (one for each test).
Sample graph
2. Computation of Pearson r correlation

The index of the linear regression is called a correlation coefficient. When the points in a scatterplot
tend to fall within the linear line, the correlation is said to be strong. When the direction of the scatterplot
is directly proportional, the correlation coefficient will have a positive value. If the line is inverse, the
correlation coefficient will have a negative value. The statistical analysis used to determine the correlation
coefficient is called the Pearson r. How the Pearson's r is obtained is illustrated below.
Suppose that a teacher gave the spelling of two-syllable words with 20 items for Monday and Tuesday.
The teacher wanted to determine the reliability of two set of scores by computing for the Pearson r.
Formula:
3. Difference between a positive and a negative correlation
When the value of the correlation coefficient is positive, it means that the higher the scores in X, the
higher the scores in This is called a positive correlation. In the case of the two spelling scores, a positive
correlation is obtained. When the value of the correlation coefficient is negative, it means that the higher
the scores in X, the lower the scores in Y, and vice versa. This is called a negative correlation. When the
same test is administered to the same group of participants, usually a positive correlation indicates
reliability or consistency of the scores.
4. Determining the strength of a correlation

The strength of the correlation also indicates the strength of the reliability the test. This is indicated by
the value of the correlation coefficient. The closer the value to 1.00 or -1.00 the stronger is the
correlation. Below is the guide.
0.80-1.00. Very strong relationship

0.6-0.79 Strong relationship
0.40-0.59 Substantial/marked relationship
0.2-0.39 Weak relationship
0.00-0.19 Negligible relationship
5. Determining the significance of the correlation

The correlation obtained between two variables could be due to chance. In order to determine if the
correlation is free of certain errors, it is tested for significance. When a correlation is significant, it means
that the probability of the two variables being related is free of certain errors.
-in order to determine if a correlation coefficient value is significant, it is compared with an expected
probability of correlation coefficient values called a critical value. When the value computed is greater
than the critical value, it means that the information obtained has more than 95% chance of being
correlated and is significant.
-Another statistical analysis mentioned to determine the internal consistency of test is the Cronbach's
alpha.
Formula:
What is test validity?
A measure is valid when it measures what it is supposed to measure. If a quarterly exam is valid, then
the contents should directly measure the objective of the curriculum. If a scale that measures personality
in composed of five factors then the scores on the five factors should have items that are highly
correlated. An entrance exam is valid, it should predict students grades after the fest semester.
What are the different ways to establish test validity?

There are different ways to establish test validity.
Type of validity Definition Procedure
Content Validity When the items represent the domain being The items are compared with the objectives
measured. of the program. The items need to measure
directly the objectives (for achievement) or
definition (for scales). A reviewer conducts
the checking.
Face validity When the test is presented well, free of The test items and layout are reviewed and
errors, and administered well. tried out on a small group of respondents. A
manual for administration can be made as a
guide for the test administrator.
Predictive Validity A measure should predict a future criterion. A correlation coefficient is obtained where
Example is an entrance exam predicting the the X-variable is used as the predictor and
grades of the students after the first semester. the Y-variable as the criterion.
Construct Validity The components or factors of the test should The Pearson r can be used to correlate the
contain items that are strongly correlated. items for each factor. However, there is a
technique called factor analysis to determine
which items are highly correlated to form a
factor.
Concurrent Validity When two or more measures are present for The scores on the measures should be
each examinee that measure the same correlated.
characteristic.
Convergent Validity When the components or factors of a test are Correlation is done for the factors of the test.
hypothesized to have a positive correlation.
Divergent Validity When the components or factors of a test are Correlation is done for the factors of the test.
hypothesized to have a negative correlation.
An example to correlate are the scores in a
test on intrinsic and extrinsic motivation.
How to determine if an item is easy or difficult?
An item is difficult if majority of students are unable to provide the correct answer. The item is easy if
majority of the students are able to answer correct. An item can discriminate if the examinees who score
high in the test can answer more the items correctly than examinees who got low scores.
-Below is a dataset of five items on the addition and subtraction of integers
Follow the procedure to determine the difficulty and discrimination of each item.
1. Get the total score of each student and arrange scores from highest to lowest.
2. Obtain the upper and lower 27% of the group. Multiply 0.27 by the total number of students, and you
will get a value of 2.7. The rounded whole number value is 3.0. Get the top three students and the bottom
3 students based on their total scores. The top three students are students 2, 5, and 5. The bottom three
students are students 7, 8, and 4. The rest of the students are not included in the item analysis.
3. Obtain the proportion correct for each item. This is computed for the upper 27% group and the lower
27% group. This is done by summating the correct answer per item and dividing it by the total number of
students.
4 The item difficulty is obtained using the following formula:
pH+pL
Item difficulty= 2
5. The index of discrimination is Obtained using the formula:

Item discrimination = pH-pl
GROUP 7
Organization of Test Data using tables and graphs
How do we organize and present ungrouped data through tables?
As you can see in Table 7.1, the test scores presented as a simple list of raw scores. Raw scores are easy
to get because these are scores that are obtain from administering a test, a questionnaire, or any inventory
rating scale measure knowledge, skills, or other attributes of interest. But as presented in the above table,
how do these numbers appeal to you? Most likely, they do not look interesting nor meaningful.
Look at the following table.

Table 7.1. Scores of 100 College Students in a Final Examination
Table 7.2. frequency Distribution of Test Scores
Text Score (xl Frequency (f) Percent Commutative Percent
21.00 1 1.0 1.0

30.00 1 1.0 2.0
32.00 1 1.0 3.0
33.00 6 6.0 9.0
33.00 1 1.0 10.0
34.00 3 3.0 13.0
36.00 4 4.0 17.0
37.00 4 4.0 21.0
38.00 1 1.0 22.0
39.00 1 1.0 23.0
40.00 2 2.0 25.0
41.00 3 3.0 28.0
42.00 9 9.0 37.0
43.00 7 7.0 44.0
45.00 4 4.0 48.0
46.00 4 4.0 52.0
47.00 2 2.0 54.0

48.00 2 2.0 56.0
49.00 6 6.0 62.0
50.00 4 4.0 66.0
51.00 3 3.0 69.0
52.00 4 4.0 73.0
53.00 5 5.0 78.0
54.00 1 1.0 79.0
56.00 1 1.0 80.0
57.00 4 4.0 84.0
58.00 3 3.0 87.0
60.00 2 2.0 89.0
61.00 1 1.0 90.0
62.00 3 3.0 93.0
64.00 2 2.0 95.0
65.00 1 1.0 96.0
66.00 1 1.0 97.0
75.00 1 1.0 98.0
78.00 1 1.0 99.0
79.00 1 1.0 100.0
Total 100 100.0

Table 7.2 is a simple frequency distribution that shows an ordered arrangements of scores which is better
than the random arrangement of raw scores in Table 2.1. The listing of score be in descending or
ascending order. You create this table by simply tallying the score. There is no grouping of scores but a
recording of the frequency in single test score. With this table, you know it a second what is the highest
and lowest score and the corresponding counts for each score.
Following are some conventions in presenting test data grouped in frequency distribution:
1. As much as possible, the size of the class intersals should be equal. Class intervals that are multiples of
5, 10, 100, etc. are often desirable. At times when large gaps exist in the data and unequal class intervals
are used. such intervals may cause inconvenience in the preparation of graphs and computation of certain
descriptive statistical measures The following formula can be useful in estimating the necessary class
interval:
i=H-L
C
where i size of the class intervals
H-highest test score
L- lowest test score
C-number of classes
2. Start the class interval at a value which is a multiple of the class width. In 7.3, we used the class
interval of 5 such that we start with the dass value of 20, which is a multiple of 5 and where 20-24
includes the lowest test score of 21, as seen in Table 7.1.
3. As much as possible, open-ended class intervals should be avoided, eg, 100 and below or 150 and
above. These will cause some problems in graphing and computation of descriptive statistical measures.
How do we present test data graphically?
You must be familiar with the saying, "A picture is worth a thousand words in e similar veins", a graph
can be worth a hundred or thousand number. The use of tables may not be enough to give a clear picture
of the properties of a group of test scores.
There are many types of graphs, but the more common methods of graphing a frequency distribution are
the following
1. Histogram. A histogram is a type of graph appropriate for quantitative data such as test scores. This
graph consists of columns-each has a base that represents one class interval, and its height represents the
number of observations or simply the frequency an that class interval.
Sample graph
2. Frequency Polygon. This is also use for quantitative data and it is one of the most commonly used in
presenting test scores. It is the line graph of a frequency polygon.
Sample graph
3. Cumulative Frequency Polygon. This graph is quite different from a frequency polygon because the
cumulative frequencies are plotted. In addition, you plot the point above the exact limits of the interval.
As such, a cumulative polygon gives a picture of the number of observations that fall below a certain
score instead of the frequency within a class interval.
Sample graph
4. Bar Graph. This graph is often to present frequencies in categories of a qualitative variable. It looks
very similar to a histogram constructed in the same manner, but spaces are placed in between the
consecutive bars. The columns represent the categories and the height of each bar as in a histogram
represents the frequency. If experimental data are graphed, the dependent variable in categories is usually
plotted on the x-axis while the dependent variable is the test score on the y-axis.
Sample graph
5. Box-and-Whisker Plots. This is a very useful graph depicting the distribution of test scores through
their quartiles. The first quartile, Q., is the point in the test scale below, which 25% of the scores lie. The
second quartile is the median, which defines the upper 50% and lower 50% of the scores. The third
quartile is the point above which 25% of the scores lie. The data on the test scores of 100 college students
produced this image using the box-plot approach.
Sample graph
6. Pie Graph. One commonly used method to represent categorical data in the use of a circle graph. You
have learned in basic mathematics that there are 360° in a full circle. As such, the categories can be
represented by the slices of the circle that appear like a pie: thus, the name pie graph. The size of the pie
is determined by the percentage of students who belong in each category.
Sample graph
What are the variations on the shapes of frequency distributions?

Frequency distribution is an arrangement of a set of observations.
The following are the shapes of frequency distributions.
GROUP 9
GRADING AND REPORTING OF TEST RESULTS
What are the purposes of grading and reporting learners' test performance?
There are various reasons why we assign grades and report learners' test performance. Grades are
alphabetical or numerical symbols/marks that indicate he degree to which learners are able to achieve the
intended learning outcomes. Grades do not exist in a vacuum but are part of the instructional process and
serve as a feedback loop between the teacher and learners.
-Grades also give the parents, who have the greatest stake in learners education, information about their
children's achievements. They provide teachers some bases for improving their teaching and learning
practices and for identifying learners who need further educational intervention.
What are the different methods in scoring tests or performance tasks?

There are various ways to score and grade results in multiple-choice tests. Traditionally, the two most
commonly-used scoring methods are number right scoring (NR) and negative marking (NM).
Number Right Scoring (NR) - entails assigning positive values only to correct answer while giving a
score of zero to incorrect answer. The test score is the sum of the scores for correct responses.
Negative Marking (NM) - entails assigning positive values to correct answers while punishing the
learners for incorrect responses (e. right-minus- wrong-correcting method). In this model, a fraction of the
number of wrong answer is subtracted from the number of correct answers.
Partial Credit Scoring Methods - attempt to determine a learner's degree of level of knowledge with
respect to each response option given. This method of scoring takes into account partial knowledge
mastery of learners, it acknowledges while learners cannot always recognize the correct response, they
can discern that same response options are clearly incorrect. There are three formats of partial credit
scoring method.
•Liberal Multiple-Choice Test-It allows learners to select more than one answer to a question if they feel
uncertain which option or alternative id correct.
•Elimination Testing (ET)- It instructs learners to cross out all alternatives they consider to be incorrect.
•Confidence Weighting (CW)- It asks learners to indicate what they believe is the correct answer and
how confident they are about their choice.
Multiple Answers Scoring Method allows learners to have multiple answer for each item by this
method, learners are instructed that each item has at least one correct answer or how many answers are
correct items can be scoring as solved only.If all the correct response options are marked but none of
incorrect others. Incorrect options that are marked can lead to negative score.
Retrospective Correcting for Guessing considers omitted or no answer
items as incorrect, forcing learners to give an answer for every item even if they not know the correct
answer. The correction for guessing is implemented later or retroactively.
Standard-Setting entails using standards when scoring multiple-choice items particularly standards set
through norm-referenced or criterion referenced assessment: Standards based on norm-referenced
assessment are derived from the test performance of a certain group of learners, while standards from
criterion-referenced assessment are based on preset standards specified from the very start by the teacher
or school in general.
Holistic Scoring - involves giving a single, overall assessment for an essay, writing composition or other
performance type of assessments as a whole.
Analytic Scoring - involves assessing each aspect of a performance task and assigning score for each
criterion.
Primary Trait Scoring - focuses on only one aspect or criterion of a task, and a learners performance is
evaluated on only one trait. This scoring system defines a primary trait in the task that will then be scored.
Multiple-Trait Scoring - requires that an essay test or performance task is scored on more than one
aspect, with scoring criteria in place so that they are consistent with the prompt.
What are the different types of test scores?

Grading methods communicate the teachers' evaluative appraisal of learner level of achievement or
performance in a test or task. Test scores can take the form of any of the following: (1) raw scores, (2)
percentage scores, and (3) derived scores. Under derived scores are grades that are based on criterion
referenced and norm-referenced grading system.
1. Raw Score. Is simply the number of items answered correctly on a test.
2. Percentage Score. This refers to the percent of items answered correctly in a test. The number of items
answered correctly is typically converted to percent based on the total possible score.
3. Criterion-Referenced Grading System. This is a grading system wherein learners' test scores or
achievement levels are based on their performance in specified learning goals and outcomes and
performance standards. Criterion- referenced grades provide a measure of how well the learners have
achieved the preset standards, regardless of how everyone else does. It is therefore important that the
desired outcomes and the standards that determine proficiency and success are clear to the learners at the
very start.
The following are some of the types of criterion-referenced scores or grades:
3.1 Pass or Fail Grade. This type of score is most appropriate if the test or assessment is primarily or
entirely to make a pass or fail decision. In this type of scoring, a standard or cut-off score is preset, and a
learner is given a score to Pass if he or she surpassed the expected level of performance or the cut-off
score. Pass or Fail scoring is most appropriate for comprehensive or licensure exams because there is no
limit to the number of examinees who can pass or fail.
3.2 Letter Grade. This is one of the most commonly used grading system. Letter grades are usually
composed of five-level grading scale labeled from A to E or F, with A representing the highest level of
achievement or performance and E or F-the lowest grade-representing a fail grade. These are often used
for all forms of learners' work, such quizzes, essays, projects and assignments.
3.3 Plus (+) and Minus (-) Letter Grades. This grading provides a more
detailed descriptions of the level of learners achievement or tasks performance by dividing each grade
category into three levels, such that a grade of A can be assigned as A+, and A- and B as B and B+, and
B- so on. Plus (+) and minus (-) grades provide a finer discrimination between achievement or
performance levels.
3.4 Categorical Grades. This system of grading is generally more descriptive than letter grades,
especially it coupled with verbal labels. Verbal labels eliminate the need for a key or legend to explain
what each grade category means.
4. Norm-Referenced Grading System. In this method of grading, learners scores are compared with
those of their peers. Norm referenced involves rank ordering learners and expressing a learner's score in
relation to the achievement of the rest of the group(class or grade level's etc.).
4.1 Developmental Score. This is the score that has been transformed from raw scores and reflect the
average performance at age and grade level There are two kinds of developmental scores (1) grade-
equivalent and (2) age equivalent scores.
4.1.1 Grade-Equivalent Score is described as both a growth score and status score.
4.1.2 Age-Equivalent Score indicates the age level that is typically to a learner to obtain such raw score.
It reflects a learn performance in terms of the chronological age as compared those in the norm group.
4.2 Percentile Rank. This indicates the percentage of the score that fall at or below the given score.
4.3 Stanine Score. This system expresses test results in nine equal steps which range from one (lowest) to
nine highest).
4.4 Standard Scores. They are raw scores that are converted into a common scale of measurement that
provides meaningful description of the individual scores within the distribution.
4.4.1 Z-score is one type of a standard score Z-scores have a mean of 0 and a standard deviation of 1.
4.4.2 T-score is another type of standard score, where in the mean is equal to 50, and the standard
deviation is equal to 10.
What are the general guidelines in grading tests or performance tasks?

Should be observed to ensure that grading practices are equitable, fair, and meaningful to learners and
other stakeholders. The following are the general guidelines in grading tests or performance tasks:
1. Stick to the purpose of the assessment. Before coming up with an ssessment, it is first important to
determine the purpose of the test.
2. Be guided by the desired learning outcomes. The learners should be informed early on what are
expected of them insolar as learning outcomes are concerned, as well as how they will be assessed and
graded in the test
.
3. Develop grading criteria. Grading criteria to be used in traditional tests, and performance tasks should
be made clear to the students.
4. Inform learners what scoring methods are to be used. Learners should be made aware before the
start of testing, whether their test responses in to be scored based on the number right, negative marking,
or through non conventional scoring methods. As such, the learners will be guided to mark their
responses during the test.
5. Decide on what type of test scores to use. As discussed earlier, there ared dfferent ways by which
students learning can be measured and presented.
What are the general guidelines in grading essay tests?

Essays require more time to grade than the other types of traditional tests.Grading essay tests can also
be influenced by extraneous factors, such as learners handwriting legibility and raters biases. The
following are the general guidelines in scoring essay tests:
1. Identify the criteria for rating the essay. The criteria or standards for evaluating the essay should be
predetermined.
2. Determine the type of rubric to use. There are two basic types of rubric holistic or analytic scores
system, Holistic rubrics require evaluating the essay and taking into consideration all the criteria.
3. Prepare the rubric in developing rubric, the skills and competences related to essay writing should
first be identified.
4. Evaluate essay anonymously. Checking essay should be done anonymously it is important that the
rater does not identify whose paper he/she is rating.
5. Score one essay question at a time. This is to ensure that the same thinking and standards are applied
for all leamers in the class.
6. Be conscious of own biases when evaluating a paper. The rate should not be affected by learners
handwriting, writing style, length of responses, and other factors. He/she should stick to the criteria
included in the rubric when evaluating the essay.
7. Review initial scores and comments before giving the final rating. This is important especially for
essays that were initially given a barely passing or failing grade.
8. Get two or more raters for essays that are high-stake, such as those used
for admission, placement, or scholarship screening purposes. The final grade will be the average of the all
ratings given.
9. Write comments next to the learner's responses to provide feedback on how well one has performed in
the essay test.
What is the new grading system of the Philippine K-12 Program?

On April 2015, the Department of Education through DepEd Orders in public schools from elementary
to Senior High School. Although private is 2015, announced the implementation of a new grading system
for all are nit required to implement the same guidelines, they are encourages follow them and are
permitted to modify them in accordance to their institute Philosophy, Vision and Mission, the grading
system is described as a stands to pass a specific learning area, when is transmuted to 75 in the report
cam. Then and competency-based grading system where 60 is the minimum grade lowest mark that can
appear on the report card is 60 for Quarterly Grades a Final Grades. Grades will be based on the weighted
raw score of the eart summative assessments based on three components: Whitten Work, Performance
Task and Quarterly Assessment.
How should test results be communicated to different stakeholders?

Since grades serve as an important feedback about learners level performance or achievement, teachers
should communicate the test to leamers, their parents, and other stakeholders. Feedback on how well the
learners performed on a test or any performance task has been proven to them improve their learning.
-The rationale or purpose of testing and the nature of the administered to the learners should be clearly
explained. This is especially true for high-stake testing such as those used for placement, admission,
grade let promotion, graduation decisions, as well as for IQ or psychological testing, which are more
likely to be misinterpreted it is important to inform the students and their parents that tests are only one of
several tools to assess their performance or achievement and that they are not evaluated on the basis of
one test alone.

Reliability and Validity in Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability and Validity in Testing

Uploaded by

Copyright:

Available Formats

GROUP 6

What is test reliability?

What are the different ways to establish test reliability?

2. Computation of Pearson r correlation

4. Determining the strength of a correlation

0.80-1.00. Very strong relationship

5. Determining the significance of the correlation

What are the different ways to establish test validity?

Type of validity Definition Procedure

5. The index of discrimination is Obtained using the formula:

Look at the following table.

21.00 1 1.0 1.0

47.00 2 2.0 54.0

Total 100 100.0

What are the variations on the shapes of frequency distributions?

What are the different methods in scoring tests or performance tasks?

What are the different types of test scores?

The following are some of the types of criterion-referenced scores or grades:

What are the general guidelines in grading tests or performance tasks?

What are the general guidelines in grading essay tests?

What is the new grading system of the Philippine K-12 Program?

How should test results be communicated to different stakeholders?

You might also like