Professional Documents
Culture Documents
Course Description....................................................................................5
Chapter 1 – Basic Concepts in Assessment..........................................6
Chapter 2 – Principles of High Quality Assessment............................12
Chapter 1 – Basic Concepts in Assessment
At the end of this chapter, the students will be able to:
1|Page
b. Assessment OF learning
It is usually given at the end of a unit, grading period or a term like
a semester. It is meant to assess learning for grading purposes,
thus the term assessment of learning.
c.Assessment AS learning
It is associated with self-assessment. As the term implies,
assessment by itself is already a form of learning for the students.
As students assess their own work (e.g. a paragraph) and/or with
their peers with the use of scoring rubrics, they learn on their own
what a good paragraph is. At the same, as they are engaged in
self-assessment, they learn about themselves as learners become
aware of how they learn. In short, in assessment AS learning,
students set their targets, actively monitor and evaluate their own
learning in relation to their set target. As a consequence, they
become self-directed or independent learners. By assessing their
own learning, they are learning at the same time.
Assessment AS
learning
Assessment FOR Self-assessment
learning Assessment OF learning
Placement assessment Summative assessment
Diagnostic assessment
Formative assessment
ASSESSMENT
2|Page
Other terms in assessment include:
Level 4. Analysis refers to the breaking down of a concept or idea into its
components and explaining the concept as a composition of these concepts.
5|Page
2.2.3 Performance Tests
One of the most frequently used measurement instruments is the
checklist. A performance checklist consists of a list of behaviors that
make up a certain type of performance. It is used to determine whether or
not an individual behaves in a certain way when asked to complete a
particular task. If a particular behavior is present when an individual is
observed, the teacher places a check opposite it on the list.
6|Page
diagnose or to appraise the performance of students from the point of view
of the students themselves.
2.3.1 Validity
8|Page
Another type of criterion-related validity is called predictive validity
wherein the test scores in the instrument are correlated with scores
on a later performance (criterion measure) of the students. For
example, the mathematics ability test constructed by the teacher may be
correlated with their later performance in a division – wide mathematics
achievement test.
Another type of validity is the face validity where it is the extent to which a
test is subjectively viewed as covering the concept it tries to measure.
2.3.2 Reliability
Reliability refers to the consistency of the scores obtained – how
consistent they are for each individual from one administration of an
instrument to another and from one set of items to another.
Reliability and validity are related concepts. If an instrument is
unreliable, it cannot yield valid outcomes. As reliability improves,
validity may also improve (or not) however, if an instrument is shown
scientifically to be valid then it is almost certain that it is also
reliable.
Something reliable is something that works well and that you can trust.
A reliable test is a consistent measure of what it is supposed to
measure.
9|Page
The following table is a standard followed almost universally in educational
test and measurement.
Reliability Interpretation
0.90 and above Excellent reliability; at the level of the best
standardized tests
0.80 – 0.90 Very good for a classroom test
0.70 – 0.80 Good for a classroom test; in the range of most.
There are probably a few items which could be
improved
0.60 – 0.70 Somewhat low. This test needs to be supplemented
by other measures (more tests) to determine grades.
There are probably
some items which could be improved
10 | P a g e
Methods for Estimating Reliability
The selection of a method for estimating reliability depends on the nature
of the test. Each method not only entails different procedures but is also
affected by different sources of error. For many tests, more than one
method should be used.
a. Test – retest Reliability - The test-retest method for estimating
reliability involves administering the same test to the same group of
examinees on two different occasions and then correlating the two sets of
scores.
When using this method, the reliability coefficient indicates the degree of
stability (consistency) of examinees' scores over time and is also known
as the coefficient of stability.
The primary sources of measurement error for test-retest reliability are any
random factors related to the time that passes between the two
administrations of the test. These time sampling factors include random
fluctuations in examinees over time (e.g., changes in anxiety or
motivation) and random variations in the testing situation.
Memory and practice also contribute to error when they have random
carryover effects; i.e., when they affect many or all examinees but not in
the same way.
Test-retest reliability is appropriate for determining the reliability of tests
designed to measure attributes that are relatively stable over time and that
are not affected by repeated measurement. It would be appropriate for a
test of aptitude, which is a stable characteristic, but not for a test of mood,
since mood fluctuates over time, or a test of creativity, which might be
affected by previous exposure to test items.
b. Alternate (Equivalent, Parallel) Forms Reliability
To assess a test's alternate forms reliability, two equivalent forms of the
test are administered to the same group of examinees and the two sets of
scores are correlated.
Alternate forms reliability indicates the consistency of responding to
different item samples
11 | P a g e
(the two test forms) and, when the forms are administered at different
times, the consistency of responding over time.
The alternate forms reliability coefficient is also called the coefficient of
equivalence when the two forms are administered at about the same time;
The coefficient of equivalence and stability when a relatively long period of
time separates administration of the two forms.
The primary source of measurement error for alternate forms reliability is
content sampling, or error introduced by an interaction between different
examinees' knowledge and the different content assessed by the items
included in the two forms (e.g.: Form A and Form B). The items in Form A
might be a better match of one examinee's knowledge than items in Form
B, while the opposite is true for another examinee.
In this situation, the two scores obtained by each examinee will differ,
which will lower the alternate forms reliability coefficient. When
administration of the two forms is separated by a period of time, time
sampling factors also contribute to error.
Like test-retest reliability, alternate forms reliability is not appropriate when
the attribute measured by the test is likely to fluctuate over time (and the
forms will be administered at different times) or when scores are likely to
be affected by repeated measurement.
If the same strategies required to solve problems on Form A are used to
solve problems on Form B, even if the problems on the two forms are not
identical, there are likely to be practice effects, when these effects differ
for different examinees (i.e., are random), practice will serve as a source
of measurement error.
Although alternate forms reliability is considered by some experts to be the
most rigorous (and best) method for estimating reliability, it is not often
assessed due to the difficulty in developing forms that are truly equivalent.
12 | P a g e
Split-half reliability and coefficient alpha are two methods for evaluating
internal consistency. Both involve administering the test once to a single
group of examinees, and both yield a reliability coefficient that is also
known as the coefficient of internal consistency.
To determine a test's split-half reliability, the test is split into equal halves
so that each examinee has two scores (one for each half of the test).
Scores on the two halves are then correlated. Tests can be split in several
ways, but probably the most common way is to divide the test on the basis
of odd- versus even-numbered items.
A problem with the split-half method is that it produces a reliability
coefficient that is based on test scores that were derived from one-half of
the entire length of the test. If a test contains 30 items, each score is
based on 15 items. Because reliability tends to decrease as the length of a
test decreases, the split-half reliability coefficient usually underestimates a
test's true reliability.
For this reason, the split-half reliability coefficient is ordinarily corrected
using the
Spearman-Brown prophecy formula, which provides an estimate of
what the reliability coefficient would have been had it been based on
the full length of the test.
Cronbach's coefficient alpha also involves administering the test
once to a single group of examinees. However, rather than splitting the
test in half, a special formula is used to determine the average degree of
inter-item consistency.
One way to interpret coefficient alpha is as the average reliability that
would be obtained from all possible splits of the test. Coefficient alpha
tends to be conservative and can be considered the lower boundary of a
test's reliability (Novick and Lewis, 1967).
When test items are scored dichotomously (right or wrong), a variation of
coefficient alpha known as the Kuder-Richardson Formula 20 (KR-20)
can be used.
The Kuder-Richarson is the more frequently employed formula for
determining internal consistency, particularly KR20 and KR21. We present
13 | P a g e
the latter formula (KR21) since KR20 is more difficult to calculate and
requires a computer program:
So,
14 | P a g e
Coefficient alpha could be expected to be smaller for a 200-item test that
contains items assessing knowledge of test construction, statistics, ethics,
epidemiology, environmental health, social and behavioral sciences,
rehabilitation counseling, etc. than for a 200-item test that contains
questions on test construction only.
The methods for assessing internal consistency reliability are useful when
a test is designed to measure a single characteristic, when the
characteristic measured by the test fluctuates over time, or when scores
are likely to be affected by repeated exposure to the test. They are not
appropriate for assessing the reliability of speed tests because, for these
tests, they tend to produce spuriously high coefficients. (For speed tests,
alternate forms reliability is usually the best choice.)
d. Inter-Rater (Inter-scorer, Inter-Observer) Reliability
Inter-rater reliability is of concern whenever test scores depend on a
rater's judgment.
A test constructor would want to make sure that an essay test, a
behavioral observation scale, or a projective personality test have
adequate inter-rater reliability. This type of reliability is assessed either by
calculating a correlation coefficient (e.g., a kappa coefficient or coefficient
of concordance) or by determining the percent agreement between two or
more raters.
Although the latter technique is frequently used, it can lead to erroneous
conclusions since it does not take into account the level of agreement that
would have occurred by chance alone. This is a particular problem for
behavioral observation scales that require raters to record the frequency
of a specific behavior.
In this situation, the degree of chance agreement is high whenever the
behavior has a high rate of occurrence, and percent agreement will
provide an inflated estimate of the measure's reliability.
Sources of error for inter-rater reliability include factors related to the
raters such as lack of motivation and rater biases and characteristics of
the measuring device.
15 | P a g e
An inter-rater reliability coefficient is likely to be low, for instance, when
rating categories are not exhaustive (i.e., don't include all possible
responses or behaviors) and/or are not mutually exclusive.
The inter-rater reliability of a behavioral rating scale can also be affected
by consensual observer drift, which occurs when two (or more) observers
working together influence each other's ratings so that they both assign
ratings in a similarly idiosyncratic way. (Observer drift can also affect a
single observer's ratings when he or she assigns ratings in a consistently
deviant way.) Unlike other sources of error, consensual observer drift
tends to artificially inflate inter-rater reliability.
The reliability (and validity) of ratings can be improved in several ways:
• Consensual observer drift can be eliminated by having raters
work independently or by alternating raters.
• Rating accuracy is also improved when raters are told that
their ratings will be checked.
• Overall, the best way to improve both inter- and intra-rater
accuracy is to provide raters with training that emphasizes
the distinction between observation and interpretation.
Factors that affect the Reliability Coefficient
The magnitude of the reliability coefficient is affected not only by the
sources of error discussed earlier, but also by the length of the test, the
range of the test scores, and the probability that the correct response to
items can be selected by guessing.
a. Test Length - The larger the sample of the attribute being
measured by a test, the less the relative effects of measurement
error and the more likely the sample will provide dependable,
consistent information.
Consequently, a general rule is that the longer the test, the larger the test's
reliability coefficient.
The Spearman-Brown prophecy formula is most associated with split-half
reliability but can actually be used whenever a test developer wants to
16 | P a g e
estimate the effects of lengthening or shortening a test on its reliability
coefficient.
For instance, if a 100-item test has a reliability coefficient of .84, the
Spearman-Brown formula could be used to estimate the effects of
increasing the number of items to 150 or reducing the number to 50. A
problem with the Spearman-Brown formula is that it does not always yield
an accurate estimate of reliability: In general, it tends to overestimate a
test's true reliability. This is most likely to be the case when the added
items do not measure the same content domain as the original items
and/or are more susceptible to the effects of measurement error.
Note that, when used to correct the split-half reliability coefficient, the
situation is more complex, and this generalization does not always apply:
When the two halves are not equivalent in terms of their means and
standard deviations, the Spearman-Brown formula may either over- or
underestimate the test's actual reliability.
Where:
rKK = reliability of a test “k” times as long as the
original test r11 = reliability of the original test
Example:
A test made up of 12 items has reliability (r11) of 0.68. If the number of
items is doubled to 24, will the reliability of the test improve?
Solution: r11
= 0.68 k =
24 / 12 = 2
17 | P a g e
So,
2.3.3 Fairness
An assessment procedure needs to be fair. This means many things:
First, students need to know exactly what the learning targets are and
what method of assessment will be used. If students do not know what
they are supposed to be achieving, then they could get lost in the maze of
18 | P a g e
concepts being discussed in class. Likewise, students have to be informed
how their progress will be assessed in order to allow them to strategize
and optimize their performance.
19 | P a g e
result of the assessment or testing?‖ Naturally, no teacher would want this
to happen to any of his/her student.
Test results and assessment results are confidential results. Such should
be known only by the student concerned and the teacher. Results should
be communicated to the students in such a way that other students would
not be in possession of information pertaining to any specific number of
the class.
20 | P a g e
Chapter 3 – Measures of Central Tendency and Variability
3.1 Introduction
A measure of central tendency is a single value that attempts to
describe a set of data (like scores) by identifying the central position
within that set of data or scores. As such, measures of central tendency
are sometimes called measures of central location.
Central Tendency refers to the center of a distribution of observations.
Where do scores tend to congregate? In a test of 100 items, where are
most of the scores? Do they tend to group around the mean score of 50 or
80?
The mean, median and mode are valid measures of central tendency but
under different conditions, one measure becomes more appropriate than
the others. For example, if the scores are extremely high and extremely
low, the median is a better measure of central tendency since the mean is
affected by extremely high and extremely low scores.
21 | P a g e
Mean
The mean or average or arithmetic mean is the most popular and most
well-known measure of central tendency. The mean is equal to the sum
of all the values in the data set divided by the number of values in the data
set. For example, 10 students in a Graduate School class got the following
scores in a 100-item test: 70, 72, 75, 77, 78, 80, 84, 87, 90 and 92.
The mean score of the group of 10 students is the sum of all their scores
divided by 10. The mean, therefore, is 805/10 equals 80.5.
80.5 is the average score of the group. There are 6 scores below the
average score (mean) of the group (70, 72, 75, 77, 78 and 80) and there
are 4 scores above the mean of the group (84, 87, 90 and 92).
5 38 56 60 67 70 73 78 79 95
The mean score for these ten Grade 12 students is 62.1. However,
inspecting the raw data suggests that this mean may not be the best way
to accurately reflect the score of the typical Grade 12 student as most
students have scores in the 5 to 95 range. The mean is being skewed by
the extremely low and extremely high scores. Therefore, in this situation,
we would like to have a better measure of central tendency. As we will find
out later, taking the median would be a better measure of central tendency
in this situation.
22 | P a g e
Median
The median is the middle score for a set of scores arranged from lowest
to highest. The mean is less affected by extremely low and extremely high
scores.
65 55 89 56 35 14 56 55 87 45 92
To determine the median, first we have to rearrange the scores into order
of magnitude (from smallest to largest).
14 35 45 55 55 56 56 65 87 89 92
Our median is the score at the middle of the distribution. In this case 56 is
the middle score. There are 5 scores before it and 5 scores after it. This
works fine when you have an odd number of scores, but what happens
when you have an even number of scores? What of you have 10 scores
like the scores below?
65 55 89 56 35 14 56 55 87 45
Mode
14 35 45 55 55 56 56 56 65 84 89
23 | P a g e
There are two most frequent scores 55 and 56 so we have a score
distribution with two modes, hence a bimodal distribution.
Mean
where
To be able to apply the formula for the mean of a grouped data, we shall
follow the step below:
24 | P a g e
Median
Just like the mean, the computation of the value of the median is done
through interpolation. The procedure requires the construction of the less
than cumulative frequency column ( ). The first step in finding the value
of the median is to divide the total number of frequencies by 2. This is
consistent with the definition of the median. The value shall be used to
determine the cumulative frequency before the median class denoted by
. refers to the highest value under the column that is less than
. The median class refers to the interval that contains the median, that is,
where the value is located. Hence, among the entries under the
column which are greater than
, the smallest shall be the frequency of the median class. If a
distribution contains an interval where the cumulative frequency is
exactly , the upper boundary of that class will be the median and no
interpolation is needed.
After identifying the median class, we shall approximate the position of the
median within the median class. This approximation shall be done by
subtracting the value of from
. Then, the difference is divided by the frequency of the median class
times the size of the class interval. The result is then added to the lower
boundary of the median class to get the median of the distribution.
The computing formula for the median for grouped data is given below.
where
To be able to apply the formula for the median for grouped data, we shall
follow the steps below:
Step 1. Get .
Step 2. Determine the value of .
Step 3. Determine the median class.
Step 4. Determine the lower boundary and the frequency of the median
class and the size of the class interval.
Step 5. Substitute the values obtained in Steps 1 – 4 to the formula. Round
off the final result to two decimal places
Mode
If this expression is added to the lower boundary of the modal class, then
we can come up with the computing formula for the value of the mode for
grouped data. The formula is:
26 | P a g e
To be able to apply the formula for the mode for grouped data, we shall
consider the following steps:
Try this!
To be able to compute the value of the mean, we shall follow the steps
discussed earlier.
Step 1. Get the midpoint of each class. The midpoints are shown in the 3 rd
column.
Scores
11–22 3 16.5
23–34 5 28.5
27 | P a g e
35–46 11 40.5
47–58 19 52.5
59–70 14 64.5
71–82 6 76.5
83–94 2 88.5
28 | P a g e
Step 4. Divide the result in Step 3 by the sample size. The result is the
mean of the distribution. Hence,
To compute for the median, we shall construct the less than cumulative
frequency column.
We can use the existing table when we solved for the mean.
Scores
11–22 3 16.5 49.5 3
23–34 5 28.5 142.5 8
35–46 11 40.5 445.5 19
47–58 19 52.5 997.5 38 Median
class
59–70 14 64.5 903 52
71–82 6 76.5 459 58
83–94 2 88.5 177 60
Step 1.
Step 2.
Step 5.
29 | P a g e
To compute for the mode, we can still use the existing table.
Scores
11–22 3 16.5 49.5 3
23–34 5 28.5 142.5 8
35–46 11 40.5 445.5 19
47–58 19 52.5 997.5 38 Modal
class
59–70 14 64.5 903 52
71–82 6 76.5 459 58
83–94 2 88.5 177 60
3.2.3 Comparison
Although there are many types of averages, the three measures that were
discussed are considered the simplest and the most important of all.
In the case of the mean, the following are some of the observations that
can be made.
a) The mean always exists in any distribution. This implies that
for any set of data, the mean can always be computed
b) The value of the mean in any distribution is unique. This
implies that for any distribution, there is only one possible value of
the mean
30 | P a g e
c) In the computation for this measure, it takes into
consideration all the values in the distribution
In the case of the median, we have the following observations.
a) Like the mean, the median also exists in any distribution
b) The value of the median is also unique
c) This is a positional measure
For the third measure, the mode has the following
characteristics. a) It does not always exist
b) If the mode exists, it is not always unique
c) In determining the value of the mode, it does not take into account
all the values in the distribution
Skewness
32 | P a g e
consider the variation of the items relative to the central value of the
distribution.
3.3.2 Variance
Variability can also be defined in terms of how close the scores in the
distribution are to the middle of the distribution. Using the mean as the
measure of the middle of the distribution, the variance is defined as the
average squared difference of the scores from the mean. The formula for
variance (s2) is given below
where
– midpoint of each class interval
– mean
– sample size
To be able to apply the formula for the variance, we shall consider the
steps below
Step 1. Compute the value of the mean
Step 2. Determine the deviation by subtracting the mean from
the midpoint of each class interval
Step 3. Square the deviations obtained in Step 2
Step 4. Multiply the frequencies by their corresponding squared
deviations
Step 5. Add the results in Step 4
Step 6. Divide the result in Step 5 by the sample size
or simply, the standard deviation is just the square root of the variance.
Try this!
Compute the Range, Variance and Standard Deviation of the example
given earlier (Computation of Measures of Central Tendency).
Range
R = H – L = 94 – 11 = 83
Variance
First, we will reproduce the frequency distribution. Applying the steps
stated before, we have
Scores
11 – 22 3 16.5 49.5 -36.4 1324.96 3974.88
23 – 34 5 28.5 142.5 -24.4 595.36 2976.80
35 – 46 11 40.5 445.5 -12.4 153.76 1691.36
47 – 58 19 52.5 997.5 -0.4 0.16 3.04
59 – 70 14 64.5 903.0 11.6 134.56 1883.84
71 – 82 6 76.5 459.0 23.6 556.96 3341.76
2534.72
83 – 94 2 88.5 177.0 35.6 1267.36
34 | P a g e
Standard Deviation
It is just the square root of the variance so,
σ=
√
σ=
σ = 16.54
s
And the Standard deviation formula is
s=
( )
√∑
Just remember that Standard Deviation will always be the square root of
the Variance.
The important change in the formula is ―n-1‖ instead of ―n‖ (which is
called Bessel‘s correction) but it does not affect the calculations. The
symbol will also change to reflect that we are working on a sample
35 | P a g e
instead of the whole population. (σ will be changed to s when using the
sample SD) Why take a sample?
Mostly because it is easier and cheaper. Imagine you want to know what
the whole university thinks. You cannot ask thousands of people, so
instead you may ask maybe only 300 people. Samuel Johnson once said
―You don‘t have to eat the whole ox to know that the meat is tough‖.
More notes on Standard Deviation
The Standard Deviation is simply the square root of the variance. It is an
especially useful measure of variability when the distribution is normal or
approximately normal because the proportion of the distribution within a
given number of standard deviations from the mean can be calculated.
For example. 68% of the distribution is within one standard deviation of the
mean and approximately 95% of the distribution is within two standard
deviations of the mean. Therefore, if you have a normal distribution with a
mean of 50 and a standard deviation of 10, then 68% of the distribution
would be between 50 – 10 = 40 and 50 + 10 = 60. Similarly, about 95% of
the distribution would be between 50 – (2 x 10) = 30 and 50 + (2 x 10) =
70. The symbol for the population standard deviation is σ.
Standard deviation is a measure of dispersion, the more dispersed the
data, the less consistent the data are. A lower standard deviation means
that the data are more clustered around the mean and hence the data set
is more consistent.
37 | P a g e